diff --git a/README.md b/README.md
index 0e38ddb..1d9c16d 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,133 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* RHUTA JOSHI
+ * [LinkedIn](https://www.linkedin.com/in/rcj9719/)
+ * [Website](https://sites.google.com/view/rhuta-joshi)
-### (TODO: Your README)
+* Tested on: Windows 10 Home, i5-7200U CPU @ 2.50GHz, NVIDIA GTX 940MX 4096 MB (Personal Laptop), RTX not supported
+* GPU Compatibility: 5.0
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+Introduction
+---
+### Stream Compaction
+
+Stream compaction is an important parallel computing primitive that generates a compact output buffer with selected elements of an input buffer based on some condition. Basically, given an array of elements, we want to create a new array with elements that meet a certain criteria while preserving order.
+The important steps in a parallel stream compaction algorithm are as follows:
+
+
+
+1. Step 1: Mapping - Compute a temporary array containing
+ - 1 if corresponding element meets criteria
+ - 0 if element does not meet criteria
+2. Step 2: Scanning - We can use one of the scanning techniques expanded below to run an exclusive scan on the mapped temporary array
+ - Naive scan
+ - Work-efficient scan
+3. Step 3: Scattering - Insert input data at index obtained from scanned buffer if criteria is set to true
+ - Result of scan is index into final array
+ - Only write an element if temporary array has a 1
+
+For this project the criteria into consideration is the number should not be equal to 0.
+
+### Parallel Scanning
+
+In this project, I implemented stream compaction on CPU and GPU using parallel all-prefix-sum (commonly known as scan) with CUDA and analyzed the performance of each of them. The sequential scan algorithm is poorly suited to GPUs because it does not take advantage of the GPU's data parallelism. The parallel version of scan that utilizes the parallel processors of a GPU to speed up its computation. The parallel scan can be performed in two ways:
+
+1. Naive scan - This is an O(nlogn) algorithm which iteratively adds elements with an offset.
+2. Work-efficient scan - This is an O(n) algorithm
+ - Step 1: **Upsweep scan** (Parallel Reduction phase) - In this, we traverse the tree from leaves to root computing partial sums at internal nodes of the tree. At the end of this phase, the root node (the last node in the array) holds the sum of all nodes in the array.
+
+ 
+
+ - Step 2: **Downsweep scan** (Collecting scanned results) - In the down-sweep phase, we traverse back down the tree from the root, using the partial sums from the reduce phase to build the scan in place on the array. We start with adding the identity value at the end of upsweep output array. At each level,
+ - Left child: Copy the parent value
+ - Right child: Add the parent value and left child value copying root value.
+
+ 
+
+
+Tests performed
+---
+The following tests are run with blocksize of 256 for an array size of 223 which is around 8 million array elements.
+```
+****************
+** SCAN TESTS **
+****************
+ [ 3 16 30 16 9 8 8 11 41 20 38 34 7 ... 4 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 0ms (std::chrono Measured)
+ [ 0 3 19 49 65 74 82 90 101 142 162 200 234 ... 205461733 205461737 ]
+==== cpu scan, non-power-of-two ====
+ elapsed time: 0ms (std::chrono Measured)
+ passed
+==== naive scan, power-of-two ====
+ elapsed time: 327.662ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two ====
+ elapsed time: 328.087ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two ====
+ elapsed time: 0ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two ====
+ elapsed time: 0ms (CUDA Measured)
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 3 3 1 0 3 3 3 3 3 0 0 1 0 ... 3 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 45.6113ms (std::chrono Measured)
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 46.366ms (std::chrono Measured)
+ passed
+==== cpu compact with scan ====
+ elapsed time: 142.93ms (std::chrono Measured)
+ passed
+==== work-efficient compact, power-of-two ====
+ elapsed time: 743.851ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two ====
+ elapsed time: 739.94ms (CUDA Measured)
+ passed
+Press any key to continue . . .
+```
+
+Performance Analysis
+---
+### Scan algorithm
+
+For different block sizes ranging from 4 to 1024, the most optimized performance was observed with a block size of 128 or 256. The performance below block size of 32 is really poor because warp size is 32 and block sizes lower than that can force it to perform operations serially. As the block size increases more than 256, the number of idle threads per iteration also increases hence decreasing performance. The following chart shows test results with block size of 256.
+
+
+
+Based on this image, it appears as if CPU takes lesser time than the GPU parallel algorithms. This is because of further optimizations that can be performed on GPU based parallel algorithms. Some of the things that can be considered when doing so include:
+1. In the current implementation, number of threads hosted in each iteration of upsweep and downsweep is the same. We know that in each iteration, many threads are idle and are simply returning without performing any meaningful operation.
+2. Even if some threads in a warp are done with execution with an early exit, they have to wait for other threads in the warp. When this happens due to conditional stalls, it is called warp divergence. This can be avoided by warp partitioning, such that threads which are likely to terminate together are grouped together in a single warp.
+
+I tried to improve the upsweep by tring to make better thread divisions to optimize parallel reduction process.
+
+
+I tried to take strides in the reverse order, such as ... 4, 2, 1, instead of 1, 2, 4 ... by setting the offset as follows:
+```int offsetd = pow(2, maxDepth - d - 1);```
+And then inside upsweep, instead of using expensive modulo operator, using:
+```
+if (k < offsetd) {
+ data[k] += data[k + offsetd];
+}
+```
+However the implementation is still buggy, did not get time to implement it correctly and test the results.
+
+### Stream compaction
+
+Stream compaction shows a similar trend when compared to scanning. The following graph does not cover stream compaction tests using naive scan method. The behavior with respect to block size is also similar as observed in scanning, described above.
+
+
+
+References
+---
+1. GPU Parallel Algorithms Course Presentation - CIS 5650 - Fall 2022
+2. GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html)
diff --git a/img/downsweep.jpg b/img/downsweep.jpg
new file mode 100644
index 0000000..220cbb9
Binary files /dev/null and b/img/downsweep.jpg differ
diff --git a/img/parallel_scan_performance_analysis.png b/img/parallel_scan_performance_analysis.png
new file mode 100644
index 0000000..8b1b236
Binary files /dev/null and b/img/parallel_scan_performance_analysis.png differ
diff --git a/img/stream-compaction.jpg b/img/stream-compaction.jpg
new file mode 100644
index 0000000..d8879a8
Binary files /dev/null and b/img/stream-compaction.jpg differ
diff --git a/img/stream_compaction_analysis.png b/img/stream_compaction_analysis.png
new file mode 100644
index 0000000..cf19f58
Binary files /dev/null and b/img/stream_compaction_analysis.png differ
diff --git a/img/upsweep.jpg b/img/upsweep.jpg
new file mode 100644
index 0000000..dec124e
Binary files /dev/null and b/img/upsweep.jpg differ
diff --git a/img/upsweep_optimization.jpg b/img/upsweep_optimization.jpg
new file mode 100644
index 0000000..da9eaf3
Binary files /dev/null and b/img/upsweep_optimization.jpg differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..210f212 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -44,7 +44,7 @@ int main(int argc, char* argv[]) {
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(NPOT, b, true);
+ //printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);
zeroArray(SIZE, c);
@@ -81,19 +81,19 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
- zeroArray(SIZE, c);
- printDesc("thrust scan, power-of-two");
- StreamCompaction::Thrust::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
+ //zeroArray(SIZE, c);
+ //printDesc("thrust scan, power-of-two");
+ //StreamCompaction::Thrust::scan(SIZE, c, a);
+ //printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ ////printArray(SIZE, c, true);
+ //printCmpResult(SIZE, b, c);
- zeroArray(SIZE, c);
- printDesc("thrust scan, non-power-of-two");
- StreamCompaction::Thrust::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(NPOT, c, true);
- printCmpResult(NPOT, b, c);
+ //zeroArray(SIZE, c);
+ //printDesc("thrust scan, non-power-of-two");
+ //StreamCompaction::Thrust::scan(NPOT, c, a);
+ //printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ ////printArray(NPOT, c, true);
+ //printCmpResult(NPOT, b, c);
printf("\n");
printf("*****************************\n");
@@ -115,7 +115,7 @@ int main(int argc, char* argv[]) {
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedCount = count;
- printArray(count, b, true);
+ //printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);
zeroArray(SIZE, c);
@@ -123,14 +123,14 @@ int main(int argc, char* argv[]) {
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedNPOT = count;
- printArray(count, c, true);
+ //printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(count, c, true);
+ //printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);
zeroArray(SIZE, c);
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..a422137 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -18,9 +18,16 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startCpuTimer();
+ //timer().startCpuTimer();
// TODO
- timer().endCpuTimer();
+ int identity = 0;
+
+ odata[0] = identity;
+ for (int i = 1; i < n; i++) {
+ odata[i] = odata[i - 1] + idata[i - 1]; // exclusive scan
+ }
+
+ //timer().endCpuTimer();
}
/**
@@ -31,8 +38,22 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
+ int nonZeroIdx = 0;
+ for (int i = 0; i < n; i++) {
+ if (idata[i] != 0) {
+ odata[nonZeroIdx] = idata[i];
+ nonZeroIdx++;
+ }
+ }
+
timer().endCpuTimer();
- return -1;
+ return nonZeroIdx;
+ }
+
+ void map(int n, int* odata, const int* idata) {
+ for (int i = 0; i < n; i++) {
+ odata[i] = (idata[i] == 0) ? 0 : 1;
+ }
}
/**
@@ -43,8 +64,23 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
+
+ int* mapped = new int[n];
+ int* scanned = new int[n];
+ map(n, mapped, idata);
+ scan(n, scanned, mapped);
+ int count = 0;
+ for (int i = 0; i < n; i++) {
+ if (mapped[i] == 1) {
+ int index = scanned[i];
+ odata[index] = idata[i];
+ count++;
+ }
+ }
+ delete[] mapped;
+ delete[] scanned;
timer().endCpuTimer();
- return -1;
+ return count;
}
}
}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..983e6b2 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,7 +2,8 @@
#include
#include "common.h"
#include "efficient.h"
-
+#include
+#define blockSize 256
namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
@@ -12,15 +13,128 @@ namespace StreamCompaction {
return timer;
}
+ /*
+ * Kernel for parallel reduction with upstream scan
+ */
+ __global__ void kernUpSweepReduction(int n, int d, int offsetd, int offsetd1, int* data) {
+ int k = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (k >= n) {
+ return;
+ }
+ if (k % offsetd1 == 0) {
+ data[k + offsetd1 - 1] = data[k + offsetd1 - 1] + data[k + offsetd - 1];
+ return;
+ }
+
+ //// Tried implementing optimized upsweep
+ //if (k < offsetd) {
+ // data[k] += data[k + offsetd];
+ //}
+ }
+
+ /*
+ * Kernel for collecting results with downsweep scan
+ */
+ __global__ void kernDownSweep(int n, int d, int offsetd, int offsetd1, int* data) {
+ int k = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (k >= n) {
+ return;
+ }
+ if (k % offsetd1 != 0) {
+ return;
+ }
+ int t = data[k - 1 + offsetd]; // Save left child
+ data[k - 1 + offsetd] = data[k - 1 + offsetd1]; // Set left child to this node’s value
+ data[k - 1 + offsetd1] += t;
+ }
+
+ /*
+ * Kernel to parallelly map input data to 0 and 1 based on whether
+ * it meets criteria for stream compaction
+ */
+ __global__ void kernMap(int n, int* odata, const int* idata) {
+ int k = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (k >= n) {
+ return;
+ }
+ odata[k] = (idata[k] == 0) ? 0 : 1;
+ }
+
+ /*
+ * Kernel to scatter
+ */
+ __global__ void kernScatter(int n, int* odata, const int* scandata, const int* criteria, const int* idata) {
+ int k = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (k >= n) {
+ return;
+ }
+ if (criteria[k] == 1) {
+ odata[scandata[k]] = idata[k];
+ }
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
- void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
+ void scan(int n, int* odata, const int* idata) {
+
// TODO
- timer().endGpuTimer();
+ int* dev_data;
+
+ // Extend buffers to handle arrays with lengths which are not a power of two
+ int maxDepth = ilog2ceil(n);
+ int extended_n = pow(2, maxDepth);
+
+ //dim3 gridSize(32, 32);
+ //dim3 blockSize(32, 32);
+
+ dim3 blocksPerGrid((extended_n + blockSize - 1) / blockSize);
+
+ // Memory allocation
+ cudaMalloc((void**)&dev_data, sizeof(int) * extended_n);
+ checkCUDAError("cudaMalloc dev_data failed!");
+ cudaMemset(dev_data, 0, sizeof(int) * extended_n);
+ checkCUDAError("cudaMemset dev_data initialization failed!");
+ cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy into dev_data failed!");
+
+ //timer().startGpuTimer();
+
+ // Upsweep - parallel reduction
+ for (int d = 0; d < maxDepth; d++) { // where d is depth of iteration
+ int offsetd1 = pow(2, d + 1);
+ int offsetd = pow(2, d);
+ //int offsetd = pow(2, maxDepth - d - 1);
+ kernUpSweepReduction << > > (extended_n, d, offsetd, offsetd1, dev_data);
+ checkCUDAError("kernUpStreamReduction invocation failed!");
+ }
+
+ // Set last element to identity value which is zero
+ cudaMemset(dev_data + extended_n - 1, 0, sizeof(int));
+ checkCUDAError("cudaMemset last value to identity failed!");
+
+ // Downsweep
+ for (int d = maxDepth - 1; d >= 0; d--) { // where d is depth of iteration
+ int offsetd1 = pow(2, d + 1);
+ int offsetd = pow(2, d);
+ kernDownSweep << > > (extended_n, d, offsetd, offsetd1, dev_data);
+ checkCUDAError("kernDownStream invocation failed!");
+ }
+ //timer().endGpuTimer();
+
+ //// Getting parallel reduction sum which can be used to convert to inclusive scan
+ //int* lastVal = new int();
+ //cudaMemcpy(lastVal, dev_data + extended_n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ //checkCUDAError("lastVal memcpy failed!");
+
+ // Copy calculated buffer to output
+ cudaMemcpy(odata, dev_data, sizeof(int) * (extended_n), cudaMemcpyDeviceToHost);
+ checkCUDAError("odata memcpy failed!");
+
+ cudaFree(dev_data);
}
+
/**
* Performs stream compaction on idata, storing the result into odata.
* All zeroes are discarded.
@@ -30,11 +144,77 @@ namespace StreamCompaction {
* @param idata The array of elements to compact.
* @returns The number of elements remaining after compaction.
*/
- int compact(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
+ int compact(int n, int* odata, const int* idata) {
+
// TODO
+
+ int* dev_idata;
+ int* dev_odata;
+ int* dev_criteria_buffer;
+ int* dev_scanned_buffer;
+
+ int maxDepth = ilog2ceil(n);
+ int extended_n = pow(2, maxDepth);
+
+ int* criteria_buffer = new int[extended_n];
+ int* scanned_buffer = new int[extended_n];
+
+ dim3 blocksPerGrid((extended_n + blockSize - 1) / blockSize);
+
+ // Memory allocation
+ cudaMalloc((void**)&dev_idata, sizeof(int) * extended_n);
+ checkCUDAError("cudaMalloc dev_idata failed!");
+ cudaMalloc((void**)&dev_criteria_buffer, sizeof(int) * extended_n);
+ checkCUDAError("cudaMalloc dev_criteria_buffer failed!");
+ cudaMalloc((void**)&dev_scanned_buffer, sizeof(int) * extended_n);
+ checkCUDAError("cudaMalloc dev_scanned_buffer failed!");
+
+ cudaMemset(dev_idata, 0, sizeof(int) * extended_n);
+ checkCUDAError("cudaMemset dev_idata initialization failed!");
+ cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy into dev_idata failed!");
+
+ timer().startGpuTimer();
+ // Mapping as per criteria
+ kernMap << > > (extended_n, dev_criteria_buffer, dev_idata);
+ checkCUDAError("kernMap invocation failed!");
+
+ cudaMemcpy(criteria_buffer, dev_criteria_buffer, sizeof(int) * extended_n, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy into criteria_buffer failed!");
+
+ // Scann criteria buffer to generate scanned buffer
+ scan(extended_n, scanned_buffer, criteria_buffer);
+ cudaMemcpy(dev_scanned_buffer, scanned_buffer, sizeof(int) * extended_n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy into dev_scanned_buffer failed!");
+
+ // Malloc for compressed output data, compressed buffer
+ // size given by last element of scanned criteria
+ cudaMalloc((void**)&dev_odata, sizeof(int) * scanned_buffer[extended_n -1]);
+ checkCUDAError("cudaMalloc dev_odata failed!");
+
+ // Initialize odata to 0
+ cudaMemset(dev_odata, 0, sizeof(int) * scanned_buffer[extended_n -1]);
+ checkCUDAError("cudaMemset dev_odata initialization failed!");
+
+ // Scatter data - insert input data at index obtained
+ // from scanned buffer if criteria is set to true
+ kernScatter << > > (n, dev_odata, dev_scanned_buffer, dev_criteria_buffer, dev_idata);
+ checkCUDAError("kernMap invocation failed!");
+
timer().endGpuTimer();
- return -1;
+
+ // Copy calculated buffer to output
+ cudaMemcpy(odata, dev_odata, sizeof(int) * scanned_buffer[extended_n -1], cudaMemcpyDeviceToHost);
+ checkCUDAError("odata memcpy failed!");
+
+ cudaFree(dev_scanned_buffer);
+ cudaFree(dev_criteria_buffer);
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ /*delete[] criteria_buffer;
+ delete[] scanned_buffer;*/
+
+ return scanned_buffer[extended_n-1];
}
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..f258566 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,6 +2,8 @@
#include
#include "common.h"
#include "naive.h"
+#include
+#define blockSize 256
namespace StreamCompaction {
namespace Naive {
@@ -11,15 +13,94 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
+
+ //// DEBUGGER TEST
+ //__global__ void kernTestDebugger(int param) {
+ // int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ // index = 1;
+ // index = threadIdx.x + (blockIdx.x * blockDim.x);
+ // param = index;
+ //}
+
// TODO: __global__
+ __global__ void kernNaiveScan(int n, int d, int offset, int *odata, const int *idata) {
+ int k = threadIdx.x + blockIdx.x * blockDim.x;
+ if (k >= n) {
+ return;
+ }
+
+ if (k >= offset) {
+ odata[k] = idata[k - offset] + idata[k];
+ }
+ else {
+ odata[k] = idata[k];
+ }
+ }
+
+ __global__ void kernInclusiveToExclusive(int n, int* odata, const int* idata) {
+ // shift all elements right and keep 1st element as identity 0
+ int k = threadIdx.x + blockIdx.x * blockDim.x;
+ if (k >= n) {
+ return;
+ }
+ if (k == 0) {
+ odata[k] = 0;
+ }
+ else {
+ odata[k] = idata[k - 1];
+ }
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
+
+
+ //// DEBUGGER TEST
+ // int noOfBlocks = 1;
+ // dim3 blockSize(32, 32);
+ // kernTestDebugger << < noOfBlocks, blockSize >> > (2);
+ //
+
// TODO
+ int* dev_buffer1;
+ int* dev_buffer2;
+
+ /*dim3 gridSize(32, 32);
+ dim3 blockSize(32, 32);*/
+
+ dim3 blocksPerGrid((n + blockSize - 1) / blockSize);
+
+ // Memory allocation
+ cudaMalloc((void**)&dev_buffer1, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_buffer1 failed!");
+ cudaMalloc((void**)&dev_buffer2, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_buffer2 failed!");
+ cudaMemcpy(dev_buffer1, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy into dev_buffer1 failed!");
+
+
+
+ int maxDepth = ilog2ceil(n);
+
+ timer().startGpuTimer();
+ for (int d = 1; d <= maxDepth; d++) { // where d is depth of iteration
+ int offset = pow(2, d - 1);
+ kernNaiveScan << > > (n, d, offset, dev_buffer2, dev_buffer1);
+ cudaMemcpy(dev_buffer1, dev_buffer2, sizeof(int) * n, cudaMemcpyDeviceToDevice);
+ }
+ // converting from inclusive to exclusive scan using same buffers
+ kernInclusiveToExclusive << > > (n, dev_buffer1, dev_buffer2);
timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_buffer1, sizeof(int) * (n), cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy into odata failed!");
+
+ cudaFree(dev_buffer1);
+ cudaFree(dev_buffer2);
+
}
}
}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..8b8fe3d 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,29 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
+
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
- // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+ thrust::host_vector host_in(n);
+ thrust::host_vector host_out(n);
+
+ // Copy input into host vectors
+ for (int i = 0; i < n; ++i) {
+ host_in[i] = idata[i];
+ }
+
+ // Create device_vectors from host_vectors
+ thrust::device_vector dev_in(host_in);
+ thrust::device_vector dev_out(host_out);
+
+ timer().startGpuTimer();
+ thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin());
timer().endGpuTimer();
+
+ // Write final results
+ for (int i = 0; i < n; i++) {
+ odata[i] = dev_out[i];
+ }
}
}
}