CIS565-Fall-2022 · ryanryantong · Oct 24, 2022 · Oct 24, 2022 · Oct 24, 2022 · Oct 24, 2022
diff --git a/README.md b/README.md
@@ -3,11 +3,51 @@ CUDA Denoiser For CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Ryan Tong
+  * [LinkedIn](https://www.linkedin.com/in/ryanctong/), [personal website](), [twitter](), etc.
+* Tested on: Windows 10, i7-8750H @ 2.20GHz 16GB, GeForce GTX 1060 6144MB (Personal Laptop)
 
-### (TODO: Your README)
+![Denoiser](img/title.png)
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Project Description
+This project uses an A-Trous filter to denoise a pathtraced image at an early iteration stage to reduce the number of iterations needed to generate an acceptably smooth image. The paper followed was: https://jo.dreggn.org/home/2010_atrous.pdf
+
+### Performance
+To generate an “acceptably smooth” image as shown above, I found that the denoiser takes about ~13ms. Note that this image was generated using 800x800 resolution and filter size of 40.
+
+Compared to using many thousands of iterations to generate an acceptably smooth image with only 10 iterations with denoising as opposed to 500 iterations without denoising.
+
+### Denoised vs Not
+![Denoised](img/title.png)
+![Reference](img/reference.png)
+
+### Performance Analysis
+2 parameters that affect runtime of the denoiser are resolution and filter size. For resolution, this makes sense because resolution determines the total number of operations needed to be performed. Specifically, the higher the resolution, the more threads need to be launched since one denoising thread is launched per pixel. Similarly, increasing filter size increases runtime because it increases the number of iterations of increasing the filter size. Specifically, we start at a size of 5x5 and increase the step width by a factor of 2 until we reach the desired filter size. We can see these affects reflected in the data below:
+
+### Resolution Performance Impact
+![Denoiser](img/resolution.png)
+
+### Filter Size Performance Impact
+![Denoiser](img/filter.png)
+
+### Filter Size Visual Analysis
+Here are some images comparing the effects of filter size. As you can see, filter values that are too small are ineffective and filter sizes that are too big make the image too blurry. This makes sense because smaller filters do not take into account enough of the neighboring pixels to be effective and larger filters take into account too many neighboring pixels. 
+### Different Filter Size Visualization
+![5x5](img/five.png)
+![15x15](img/fifteen.png)
+![45x45](img/fourtyfive.png)
+![80x80](img/eighty.png)
+![100x100](img/hundred.png)
+
+### Material Type
+The material type also determines the effectiveness of this technique. Diffuse materials work best while specular is worse since the reflections are blurred as well.
+### Different Material Visualization
+![diffuse](img/diffuse.png)
+![specular](img/title.png)
+
+### Scene Type
+The amount of light also determines the effectiveness of this technique. Specifically, brighter scenes work better since there is less complexity in the lighting and more uniformity. As you can see in the darker Cornell box scene, there are more dark splotches that are due to the fact that there is a large change from light to dark that the filter is unable to smooth.
+### Different Lighting Visualization
+![dark](img/dark.png)
+![light](img/title.png)
 
diff --git a/img/dark.png b/img/dark.png
diff --git a/img/diffuse.png b/img/diffuse.png
diff --git a/img/eighty.png b/img/eighty.png
diff --git a/img/fifteen.png b/img/fifteen.png
diff --git a/img/filter.png b/img/filter.png
diff --git a/img/five.png b/img/five.png
diff --git a/img/fourtyfive.png b/img/fourtyfive.png
diff --git a/img/hundred.png b/img/hundred.png
diff --git a/img/reference.png b/img/reference.png
diff --git a/img/resolution.png b/img/resolution.png
diff --git a/img/title.png b/img/title.png
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
@@ -52,7 +52,7 @@ EMITTANCE   0
 CAMERA
 RES         800 800
 FOVY        45
-ITERATIONS  5000
+ITERATIONS  10
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5

diff --git a/src/main.cpp b/src/main.cpp
@@ -167,7 +167,11 @@ void runCuda() {
 
     if (ui_showGbuffer) {
       showGBuffer(pbo_dptr);
-    } else {
+    }
+    else if (ui_denoise) {
+        showDenoise(pbo_dptr, iteration, ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+    } 
+    else {
       showImage(pbo_dptr, iteration);
     }
 

diff --git a/src/pathtrace.cu b/src/pathtrace.cu
@@ -67,18 +67,23 @@ __global__ void sendImageToPBO(uchar4* pbo, glm::ivec2 resolution,
     }
 }
 
+// TODO Modify this so that we can viz different parts of the gbuffer
 __global__ void gbufferToPBO(uchar4* pbo, glm::ivec2 resolution, GBufferPixel* gBuffer) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
 
     if (x < resolution.x && y < resolution.y) {
         int index = x + (y * resolution.x);
-        float timeToIntersect = gBuffer[index].t * 256.0;
-
-        pbo[index].w = 0;
-        pbo[index].x = timeToIntersect;
-        pbo[index].y = timeToIntersect;
-        pbo[index].z = timeToIntersect;
+        //float timeToIntersect = gBuffer[index].t * 256.0;
+        //glm::vec3 viz = (gBuffer[index].normal + glm::vec3(1.0)) / glm::vec3(2.0) * glm::vec3(255.0);
+        if (gBuffer[index].t > 0) {
+            float position_range = 25.f;
+            glm::vec3 viz = (glm::clamp(gBuffer[index].position, glm::vec3(-position_range), glm::vec3(position_range)) + position_range) / (position_range * 2.f) * 255.f;
+            pbo[index].w = 0;
+            pbo[index].x = viz.r;
+            pbo[index].y = viz.g;
+            pbo[index].z = viz.b;
+        }
     }
 }
 
@@ -92,6 +97,29 @@ static GBufferPixel* dev_gBuffer = NULL;
 // TODO: static variables for device memory, any extra info you need, etc
 // ...
 
+static float* dev_filter = NULL;
+static glm::vec2* dev_offsets = NULL;
+// Kernel/Filter from https://www.eso.org/sci/software/esomidas/doc/user/18NOV/volb/node317.html
+const float filter[25] = { 1.0 / 256.0, 1.0 / 64.0,  3.0 / 128.0, 1.0 / 64.0, 1.0 / 256.0,
+                           1.0 / 64.0,  1.0 / 16.0,  3.0 / 32.0,  1.0 / 16.0, 1.0 / 64.0,
+                           3.0 / 128.0, 3.0 / 32.0,  9.0 / 64.0,  3.0 / 32.0, 3.0 / 128.0,
+                           1.0 / 64.0,  1.0 / 16.0,  3.0 / 32.0,  1.0 / 16.0, 1.0 / 64.0,
+                           1.0 / 256.0, 1.0 / 64.0,  3.0 / 128.0, 1.0 / 64.0, 1.0 / 256.0,
+};
+// Offsets (x, y)
+const glm::vec2 offsets[25] = { glm::vec2(-2, -2), glm::vec2(-1, -2), glm::vec2(0, -2), glm::vec2(1, -2), glm::vec2(2, -2),
+                                glm::vec2(-2, -1), glm::vec2(-1, -1), glm::vec2(0, -1), glm::vec2(1, -1), glm::vec2(2, -1),
+                                glm::vec2(-2, 0),  glm::vec2(-1, 0),  glm::vec2(0, 0),  glm::vec2(1, 0),  glm::vec2(2, 0),
+                                glm::vec2(-2, 1),  glm::vec2(-1, 1),  glm::vec2(0, 1),  glm::vec2(1, 1),  glm::vec2(2, 1),
+                                glm::vec2(-2, 2),  glm::vec2(-1, 2),  glm::vec2(0, 2),  glm::vec2(1, 2),  glm::vec2(2, 2),
+};
+// Temp denoise output buffer for ping ponging
+static glm::vec3* dev_denoise_in = NULL;
+static glm::vec3* dev_denoise_out = NULL;
+// Stuff for timing
+static cudaEvent_t startTime = NULL;
+static cudaEvent_t endTime = NULL; 
+
 void pathtraceInit(Scene *scene) {
     hst_scene = scene;
     const Camera &cam = hst_scene->state.camera;
@@ -114,7 +142,17 @@ void pathtraceInit(Scene *scene) {
     cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
 
     // TODO: initialize any extra device memeory you need
+    cudaMalloc(&dev_filter, 25 * sizeof(float));
+    cudaMemcpy(dev_filter, &filter, 25 * sizeof(float), cudaMemcpyHostToDevice);
+
+    cudaMalloc(&dev_offsets, 25 * sizeof(glm::vec2));
+    cudaMemcpy(dev_offsets, &offsets, 25 * sizeof(glm::vec2), cudaMemcpyHostToDevice);
 
+    cudaMalloc(&dev_denoise_in, pixelcount * sizeof(glm::vec3));
+    cudaMalloc(&dev_denoise_out, pixelcount * sizeof(glm::vec3));
+
+    cudaEventCreate(&startTime);
+    cudaEventCreate(&endTime);
     checkCUDAError("pathtraceInit");
 }
 
@@ -126,7 +164,17 @@ void pathtraceFree() {
   	cudaFree(dev_intersections);
     cudaFree(dev_gBuffer);
     // TODO: clean up any extra device memory you created
+    cudaFree(dev_filter);
+    cudaFree(dev_offsets);
+    cudaFree(dev_denoise_in);
+    cudaFree(dev_denoise_out);
 
+    if (startTime != NULL) {
+        cudaEventDestroy(startTime);
+    }
+    if (endTime != NULL) {
+        cudaEventDestroy(endTime);
+    }
     checkCUDAError("pathtraceFree");
 }
 
@@ -148,7 +196,7 @@ __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, Path
 		PathSegment & segment = pathSegments[index];
 
 		segment.ray.origin = cam.position;
-    segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+        segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
 
 		segment.ray.direction = glm::normalize(cam.view
 			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
@@ -273,15 +321,19 @@ __global__ void shadeSimpleMaterials (
   }
 }
 
+
+// TODO ADD NORMALS, XYZ to this
 __global__ void generateGBuffer (
   int num_paths,
   ShadeableIntersection* shadeableIntersections,
-	PathSegment* pathSegments,
+  PathSegment* pathSegments,
   GBufferPixel* gBuffer) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < num_paths)
   {
     gBuffer[idx].t = shadeableIntersections[idx].t;
+    gBuffer[idx].normal = shadeableIntersections[idx].surfaceNormal;
+    gBuffer[idx].position = shadeableIntersections[idx].t * pathSegments[idx].ray.direction + pathSegments[idx].ray.origin;
   }
 }
 
@@ -356,46 +408,46 @@ void pathtrace(int frame, int iter) {
 	// --- PathSegment Tracing Stage ---
 	// Shoot ray into scene, bounce between objects, push shading chunks
 
-  // Empty gbuffer
-  cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
+    // Empty gbuffer
+    cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
 
 	// clean shading chunks
 	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
-  bool iterationComplete = false;
+    bool iterationComplete = false;
 	while (!iterationComplete) {
 
-	// tracing
-	dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
-	computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
-		depth
-		, num_paths
-		, dev_paths
-		, dev_geoms
-		, hst_scene->geoms.size()
-		, dev_intersections
-		);
-	checkCUDAError("trace one bounce");
-	cudaDeviceSynchronize();
-
-  if (depth == 0) {
-    generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
-  }
-
-	depth++;
-
-  shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
-    iter,
-    num_paths,
-    dev_intersections,
-    dev_paths,
-    dev_materials
-  );
-  iterationComplete = depth == traceDepth;
+	    // tracing
+	    dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
+	    computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
+		    depth
+		    , num_paths
+		    , dev_paths
+		    , dev_geoms
+		    , hst_scene->geoms.size()
+		    , dev_intersections
+		    );
+	    checkCUDAError("trace one bounce");
+	    cudaDeviceSynchronize();
+
+        if (depth == 0) {
+            generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
+        }
+
+	    depth++;
+
+        shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
+        iter,
+        num_paths,
+        dev_intersections,
+        dev_paths,
+        dev_materials
+        );
+        iterationComplete = depth == traceDepth;
 	}
 
-  // Assemble this iteration and apply it to the image
-  dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
+    // Assemble this iteration and apply it to the image
+    dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
 	finalGather<<<numBlocksPixels, blockSize1d>>>(num_paths, dev_image, dev_paths);
 
     ///////////////////////////////////////////////////////////////////////////
@@ -422,7 +474,7 @@ void showGBuffer(uchar4* pbo) {
 }
 
 void showImage(uchar4* pbo, int iter) {
-const Camera &cam = hst_scene->state.camera;
+    const Camera &cam = hst_scene->state.camera;
     const dim3 blockSize2d(8, 8);
     const dim3 blocksPerGrid2d(
             (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
@@ -431,3 +483,81 @@ const Camera &cam = hst_scene->state.camera;
     // Send results to OpenGL buffer for rendering
     sendImageToPBO<<<blocksPerGrid2d, blockSize2d>>>(pbo, cam.resolution, iter, dev_image);
 }
+
+__global__ void denoise(glm::vec3* dev_imageIn, glm::vec3* dev_imageOut, const int stepWidth, const glm::vec2 resolution,
+                        const glm::vec2* dev_offsets, const float* dev_filter, const float colorSigma, const float normalSigma, 
+                        const float positionSigma, const GBufferPixel* dev_gBuffer) {
+    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+    if (x < resolution.x && y < resolution.y) {
+        int originalIndex = x + (y * resolution.x);
+
+        //Center point values (current pixel)
+        glm::vec3 originalColor = dev_imageIn[originalIndex];
+        glm::vec3 originalNorm = dev_gBuffer[originalIndex].normal;
+        glm::vec3 originalPos = dev_gBuffer[originalIndex].position;
+
+        glm::vec3 sum = glm::vec3(0.0);
+        float cumW = 0.0;
+
+        for (int i = 0; i < 25; ++i) { // Get neighbors
+            glm::vec2 neighbor_offset = dev_offsets[i] * glm::vec2(stepWidth);
+            int neighborX = x + neighbor_offset.x;
+            int neighborY = y + neighbor_offset.y;
+            if (neighborX >= 0 && neighborX < resolution.x && neighborY >= 0 && neighborY < resolution.y) { // check bounds of image
+                int neighborIndex = neighborX + (neighborY * resolution.x);
+
+                glm::vec3 color = dev_imageIn[neighborIndex];
+                float colorWeight = min(exp(-(glm::length2(originalColor - color)) / colorSigma), 1.f);
+
+                glm::vec3 norm = dev_gBuffer[neighborIndex].normal;
+                float normWeight = min(exp(-(max(glm::length2(originalNorm - norm) / (stepWidth * stepWidth), 0.f) / normalSigma)), 1.f);
+
+                glm::vec3 pos = dev_gBuffer[neighborIndex].position;
+                float posWeight = min(exp(-(glm::length2(originalPos - norm) / positionSigma)), 1.f);
+
+                float weight = colorWeight * normWeight * posWeight;
+                sum += color * weight * dev_filter[i];
+                cumW += weight * dev_filter[i];
+                //blurred_pix += dev_filter[i] * dev_imageIn[neighbor_index];
+            }
+        }
+        dev_imageOut[originalIndex] = sum / cumW;
+    }
+}
+
+void showDenoise(uchar4* pbo, int iter, const int filterSize, const float colorSigma, const float normalSigma, const float positionSigma) {
+    const Camera& cam = hst_scene->state.camera;
+    const dim3 blockSize2d(8, 8);
+    const dim3 blocksPerGrid2d(
+        (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+        (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+    const int pixelcount = cam.resolution.x * cam.resolution.y;
+
+    // Copy image to denoise buffer so it doesnt affect orignial image
+    cudaMemcpy(dev_denoise_in, dev_image, pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+    const float squaredColor = pow(colorSigma, 2);
+    const float squaredNormal = pow(normalSigma * .1, 2);
+    const float squaredPos = pow(positionSigma, 2);
+    int i = 0;
+    float time;
+    cudaEventRecord(startTime);
+    while (4 * (1 << i) + 1 < filterSize)  { // Multiple iterations of denoising
+        int stepWidth = 1 << i;
+        denoise << <blocksPerGrid2d, blockSize2d >> > (dev_denoise_in, dev_denoise_out, stepWidth, 
+                                                        cam.resolution, dev_offsets, dev_filter,
+                                                        squaredColor, squaredNormal, squaredPos, dev_gBuffer);
+        cudaDeviceSynchronize();
+        //Ping pong buffers
+        glm::vec3* temp = dev_denoise_in;
+        dev_denoise_in = dev_denoise_out;
+        dev_denoise_out = temp;
+        ++i;
+    }
+    cudaEventRecord(endTime);
+    cudaEventSynchronize(endTime);
+    cudaEventElapsedTime(&time, startTime, endTime);
+    std::cout << "Time denoise: " << time << std::endl;
+    sendImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, iter, dev_denoise_in);
+}
diff --git a/src/pathtrace.h b/src/pathtrace.h
@@ -8,3 +8,4 @@ void pathtraceFree();
 void pathtrace(int frame, int iteration);
 void showGBuffer(uchar4 *pbo);
 void showImage(uchar4 *pbo, int iter);
+void showDenoise(uchar4* pbo, int iter, const int filterSize, const float colorSigma, const float normalSigma, const float positionSigma);
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
@@ -77,6 +77,9 @@ struct ShadeableIntersection {
 
 // CHECKITOUT - a simple struct for storing scene geometry information per-pixel.
 // What information might be helpful for guiding a denoising filter?
+// Need to store normal, position of intersection
 struct GBufferPixel {
   float t;
+  glm::vec3 normal;
+  glm::vec3 position;
 };