CIS565-Fall-2021 · oppenheimj · Oct 17, 2021 · Oct 19, 2021 · Oct 19, 2021
diff --git a/README.md b/README.md
@@ -1,13 +1,77 @@
-CUDA Denoiser For CUDA Path Tracer
-==================================
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
+# Project 4 - CUDA Denoiser for CUDA Path Tracer
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
+* Jonas Oppenheim ([LinkedIn](https://www.linkedin.com/in/jonasoppenheim/), [GitHub](https://github.com/oppenheimj/), [personal](http://www.jonasoppenheim.com/))
+* Tested on: Windows 10, Ryzen 9 5950x, 32GB, RTX 3080 (personal machine)
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Introduction
+We saw during the previous project that it takes many hundreds or even thousands of iterations for the noise to dissipate in a path traced image. The purpose of this project is to implement a clever technique that denoises a path traced image after only a handful of iterations. The technique is described in the paper "[Edge-Avoiding A-Trous Wavelet Transform for fast Global Illumination Filtering](https://jo.dreggn.org/home/2010_atrous.pdf)," by Dammertz, Sewtz, Hanika, and Lensch.
 
-### (TODO: Your README)
+The naive way to reduce noise in a path traced image would be to apply a Gaussian blur filter. This would be naive because edges that should be sharp would instead end up looking blurred. So what we _really_ want to do is only do this sort of blurring _within areas that, sortof, are one piece_.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+The idea presented in the paper is to instead store per-pixel information and then use this information to allow pixels to compare themselves to their neighbors in order to selectively apply the blur. First, the path tracer is run for a few iterations and per-pixel information is stored in what is called a "gbuffer". This information includes position and normal vectors. Then, every pixel looks at surrounding pixels and compares its gbuffer data to the neighbors' gbuffer data to see which neighbors are similar and should be blurred.
 
+## Implementation
+The implementation was fairly straight forward. The paper itself reluctantly provides some hints at implementation details towards the end in the form of a GLSL fragment shader. I used a separate `void denoise()` CPU-side function wrapping a denoising kernel call. The assignment made it sound like we should denoise after every iteration of pathtracing, and its conceivable that this would have produced the best results. Instead, I tried invoking my denoising kernel a single time at the end of all pathtracing iterations.
+
+## Questions
+### Qualitative
+1. The denoising procedure runs as quickly as a single iteration of path tracing. This is a huge result. The visual gains from denoising are worth hundreds or even thousands of pathtrace iterations. It is clear that the most efficient way to get the best result is to perform some low numer of pathtrace iterations and then denoising.
+
+    ![qual1](img/qual_1.png)
+
+2. Without denoising, an acceptably smooth result is achieved by 1000 iterations of path tracing. Note that this is highly subjective. _With_ denoising, only 25 iterations are needed to achieve a comparably smooth result. The grainyness in the whites is due to issues with color compressing.
+
+    | 1,000 iterations of path tracing w/o denoising | 25 iterations of path tracing w/ denoising |
+    |---|---|
+    |![](img/1000_iter_pt.png)|![](img/25_iter_pt.png)|
+
+3. The runtime complexity of this algorithm is clearly linear because the operation done on each pixel is constant, for a given filter size. The slightly upward bending curve suggests that there is some penalty in terms of hardware efficiency, e.g. block size.
+![](img/qual_3.png)
+
+
+4. Filter size is computed on the CPU in the following way:
+    ```
+    for (int power = 0; power < filterSize; power++) {
+        int stepWidth = 1 << power;
+        ...
+    }
+    ```
+    and then each thread inside the kernel on the GPU uses this `stepWidth`, along with an array of `glm::vec2`s to compute offsets:
+    ```
+    for (int i = 0; i < 25; i++) {
+        glm::vec2 uv = pixelCoord + offset[i] * stepWidth;
+        ...
+    }
+    ```
+    Increasing the filter size changes the number of times the kernel executes, but does not change the complexity of the kernel invocation, and so the runtime increases linearly.
+    ![](img/qual_4.png)
+
+
+    In addition to the above, you should also analyze your denoiser on a qualitative level:
+### Qualitative
+1. Visual quality improves as filter size increases until about five (which translates to 2^5*5=160, so 160x160), after which point there is little improvement. This makes sense, since the "distance" with respect to position and normal between the center pixel and the farther out pixels will get large enough that the color contributions are effectively erased.
+
+2. The denoising procedure seems to work best with diffuse materials with solid colors because colors of neighboring pixels are most likely to be similar. It is seen that the diffuse sphere looks essentially perfect, while the edges of the reflective sphere still have some noise that couldn't be smoothed out.
+
+    | Diffuse sphere | Reflective sphere |
+    |---|---|
+    |![](img/diffuse_sphere.png)|![](img/reflective_sphere.png)|
+
+3. The results vary from scene to scene. Because it is sampling so few neighboring pixels compared to a full Gaussian filter, ever pixel counts for a lot. In low-light situations where the image is extremely noisy, the denoising procedure struggled.
+
+    | Best cornell large light | Best cornell small light |
+    |---|---|
+    |![](img/best_cornell_biglight.png)|![](img/best_cornell_smalllight.png)|
+
+## Debug images
+| positions | normals |
+|---|---|
+|![](img/debug_pos.png)|![](img/debug_nor.png)|
+
+## Bloopers
+The bloopers were absolutely a highlight of the project. The bottom right is my favorite :)
+|  |  |
+|---|---|
+|![](img/blooper1.png)|![](img/blooper2.png)|
+|![](img/blooper3.png)|![](img/blooper4.png)|
diff --git a/img/1000_iter_pt.png b/img/1000_iter_pt.png
diff --git a/img/25_iter_pt.png b/img/25_iter_pt.png
diff --git a/img/best_cornell_biglight.png b/img/best_cornell_biglight.png
diff --git a/img/best_cornell_smalllight.png b/img/best_cornell_smalllight.png
diff --git a/img/blooper1.png b/img/blooper1.png
diff --git a/img/blooper2.png b/img/blooper2.png
diff --git a/img/blooper3.png b/img/blooper3.png
diff --git a/img/blooper4.png b/img/blooper4.png
diff --git a/img/debug_nor.png b/img/debug_nor.png
diff --git a/img/debug_pos.png b/img/debug_pos.png
diff --git a/img/diffuse_sphere.png b/img/diffuse_sphere.png
diff --git a/img/qual_1.png b/img/qual_1.png
diff --git a/img/qual_3.png b/img/qual_3.png
diff --git a/img/qual_4.png b/img/qual_4.png
diff --git a/img/reflective_sphere.png b/img/reflective_sphere.png
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
@@ -52,7 +52,7 @@ EMITTANCE   0
 CAMERA
 RES         800 800
 FOVY        45
-ITERATIONS  5000
+ITERATIONS  10
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5

diff --git a/src/main.cpp b/src/main.cpp
@@ -1,6 +1,7 @@
 #include "main.h"
 #include "preview.h"
 #include <cstring>
+#include <chrono>
 
 #include "../imgui/imgui.h"
 #include "../imgui/imgui_impl_glfw.h"
@@ -23,11 +24,22 @@ int ui_iterations = 0;
 int startupIterations = 0;
 int lastLoopIterations = 0;
 bool ui_showGbuffer = false;
+
 bool ui_denoise = false;
-int ui_filterSize = 80;
-float ui_colorWeight = 0.45f;
-float ui_normalWeight = 0.35f;
-float ui_positionWeight = 0.2f;
+bool lastLoopDenoise = false;
+
+int ui_filterSize = 5;
+int lastLoopFilterSize;
+
+float ui_colorWeight = 0.572f;
+float lastLoopColorWeight;
+
+float ui_normalWeight = 0.021f;
+float lastLoopNormalWeight;
+
+float ui_positionWeight = 0.789f;
+float lastLoopPositionWeight;
+
 bool ui_saveAndExit = false;
 
 static bool camchanged = true;
@@ -45,6 +57,8 @@ int iteration;
 int width;
 int height;
 
+long duration_total_us;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -120,15 +134,41 @@ void saveImage() {
     //img.saveHDR(filename);  // Save a Radiance HDR file
 }
 
+bool denoisingSettingChanged() {
+    bool settingChanged = false;
+
+    if (lastLoopFilterSize != ui_filterSize) {
+        lastLoopFilterSize = ui_filterSize;
+        settingChanged = true;
+    }
+
+    if (lastLoopColorWeight != ui_colorWeight) {
+        lastLoopColorWeight = ui_colorWeight;
+        settingChanged = true;
+    }
+
+    if (lastLoopNormalWeight != ui_normalWeight) {
+        lastLoopNormalWeight = ui_normalWeight;
+        settingChanged = true;
+    }
+
+    if (lastLoopPositionWeight != ui_positionWeight) {
+        lastLoopPositionWeight = ui_positionWeight;
+        settingChanged = true;
+    }
+
+    return settingChanged;
+}
+
 void runCuda() {
     if (lastLoopIterations != ui_iterations) {
-      lastLoopIterations = ui_iterations;
-      camchanged = true;
+        lastLoopIterations = ui_iterations;
+        camchanged = true;
     }
 
     if (camchanged) {
         iteration = 0;
-        Camera &cam = renderState->camera;
+        Camera& cam = renderState->camera;
         cameraPosition.x = zoom * sin(phi) * sin(theta);
         cameraPosition.y = zoom * cos(theta);
         cameraPosition.z = zoom * cos(phi) * sin(theta);
@@ -144,7 +184,7 @@ void runCuda() {
         cameraPosition += cam.lookAt;
         cam.position = cameraPosition;
         camchanged = false;
-      }
+    }
 
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
@@ -154,21 +194,51 @@ void runCuda() {
         pathtraceInit(scene);
     }
 
-    uchar4 *pbo_dptr = NULL;
+    uchar4* pbo_dptr = NULL;
     cudaGLMapBufferObject((void**)&pbo_dptr, pbo);
 
     if (iteration < ui_iterations) {
         iteration++;
 
         // execute the kernel
         int frame = 0;
+
+        auto start = chrono::high_resolution_clock::now();
         pathtrace(frame, iteration);
+        duration_total_us += chrono::duration_cast<chrono::microseconds>(chrono::high_resolution_clock::now() - start).count();
+
+        if (iteration == ui_iterations) {
+            std::cout << "Pathtrace avg duration " << duration_total_us / ui_iterations << std::endl;
+            duration_total_us = 0;
+        }
+    }
+
+    if (ui_denoise && iteration == ui_iterations) {
+        if (denoisingSettingChanged() || lastLoopDenoise != ui_denoise) {
+            std::cout << "Need to denoise!" << std::endl;
+
+            lastLoopDenoise = ui_denoise;
+            denoiseFree();
+            denoiseInit(scene);
+
+            auto start = chrono::high_resolution_clock::now();
+            denoise(ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+            auto duration_us = chrono::duration_cast<chrono::microseconds>(chrono::high_resolution_clock::now() - start).count();
+
+            std::cout << "Denoising duration " << duration_us << std::endl;
+        }
+    }
+
+    if (lastLoopDenoise != ui_denoise) {
+        lastLoopDenoise = ui_denoise;
     }
 
     if (ui_showGbuffer) {
-      showGBuffer(pbo_dptr);
+        showGBuffer(pbo_dptr);
+    } else if (ui_denoise) {
+        showDenoise(pbo_dptr, iteration);
     } else {
-      showImage(pbo_dptr, iteration);
+        showImage(pbo_dptr, iteration);
     }
 
     // unmap buffer object