diff --git a/README.md b/README.md index b71c458..ffc62a0 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,368 @@ -CUDA Stream Compaction -====================== +# **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) + +Tested on: Windows 10, Intel Core i7-7700HQ CPU @ 2.80 GHz, 8GB RAM, NVidia GeForce GTX 1050 + + ![Built](https://img.shields.io/appveyor/ci/gruntjs/grunt.svg) ![Issues](https://img.shields.io/github/issues-raw/badges/shields/website.svg) ![CUDA 8.0](https://img.shields.io/badge/CUDA-8.0-green.svg?style=flat) ![Platform](https://img.shields.io/badge/platform-Desktop-bcbcbc.svg) ![Developer](https://img.shields.io/badge/Developer-Youssef%20Victor-0f97ff.svg?style=flat) + + + + +- [To-do's](#to-do's) + + + +- [Analysis](#analysis) + + + +- [Observations](#observations) + +- [My Block Size Optimizations](#my_block_size_optimizations) + +- [Raw Test Output](#raw_test_output) + + +- [Blooper](#bloopers) + + + + +____________________________________________________ + + + +The goal of this project was to run an algorithm that clears out all zeros from an array on the GPU using CUDA. This parallel reduction is done using the scan algorithm that computes the exclusive prefix sum. I also implemented a parallelized Radix Sort using the exclusive prefix sum algorithm developed. + + + +### To-do's + + + - [x] **A CPU Implementation of the algorithm that uses scan** + + + - [x] **A CPU Implementation of the algorithm that uses scan** + + - [x] **An implementation of the Hills & Steele Parallel Scan Algorithm** + + - [x] **A work-efficient parallel scan algorithm that uses global memory** + + - [x] **Radix sort using previously developed algorithms** + + - [ ] **An even more efficient scan algorithm that uses shared memory** + + + + + +### Analysis: + +All tests were run on a range of input from 2^7 (128) all the way to 2^15 (32768) sized arrays. This means that if my algorithms ran in linear O(n) time, then I should get a correlated graph that matches how fast my input increases, in other words, I should get an exponential graph such as this: + +![O(n)](/img/correlation.PNG) + +####Scan Algorithms + +So for my scan algorithms I didn't initially see anything interesting about my results. It seemed that they were just randomly consistent. Here is what I got with all my scan algorithms, GPU and CPU, side-by-side: + +![All of them](/img/scan-with-thrust.PNG) + +Now you can see that there is a massive outlier: The Thrust scan with a power-of-two number of elements in the array. Since I didn't implement it, and it's skewing all the results and making them unreadable, let's take it out. We are left with this: + + +![All of them no thrust](/img/scan-without-thrust.PNG) + +Now, we can still see some separation between the top results that were GPU-based, and the lower, CPU-based results. So let's start first with the latter. + +*CPU Runtimes:* + +![CPU Runtimes](/img/scan-cpu.PNG) + +Disregarding the weird fluke in the end, I think it's obvious that this algorithm is more or less O(n). This makes perfect sense as my CPU algorithms run through the array only once. So I've established a baseline to beat. + +Here is what it looks like with my GPU algorithms by themselves: + +*GPU Runtimes:* + +![GPU Runtimes](/img/scan-gpu.PNG) + + +Now, despite my GPU algorithms being slower than my CPU ones strictly in terms of runtime in milliseconds, it is clear that they scale much better than my CPU-based ones. This is the benefit of parallelization I'm assuming. The trendlines (dashed lines) show a clear linear correlation to my exponential input. This means that my GPU algorithms are better. Despite the bottleneck. (more on that in the [observations](#observations)). + + +####Compact Algorithms + +With compact algorithms, we see the same trend continue. Here are the CPU algorithms: + +![CPU Compact Runtimes](/img/compact-cpu.PNG) + +and here is what the GPU Algorithms look like: + +![GPU Compact Runtimes](/img/compact-gpu.PNG) + +Again, we can see a linear GPU trendline, and an exponential trend with CPU. This is not a surprise as compact algorithms are based on the scan algorithms. + +For reference, here is the CPU+GPU graph of compact algorithm runtimes: + +![All Compact Runtimes](/img/compact.PNG) + + +--------------------------------- + + +###Radix Sort Analysis + +It took me a while to understand radix sort, but I finally figured it out. Debugging was really annoying as I didn't use any algorithms I found online as I found them too convoluted. They were also optimized but to an extent. They were all limited to small arrays, I wanted mine to scale up. + +####How Parallel Radix Sort Works + +Understanding it took a while but once I did, it was super easy to implement. + +*Highest Level Approach:* Basically, it works like this: For each bit, from the least significant to the most significant, sort based on whether that bit is a zero or a one. + +*Less High Approach:* The way you do this is by doing [key counting](https://en.wikipedia.org/wiki/Counting_sort). + +--- You start off by running an algorithm to tell you whether a given number has a `zero` or a `one` in that particular bit. This leaves you with your `histogram` array. For my purposes I created two. One for zeros and another for ones. So I had `hists0` and `hists1`. + +-- The next step is to run a exclusive prefix scan (that we already implemented) on the histograms array to get the "`offset`" of a number. This tells you where this number will be. Here, I also made two arrays, one for the `0` offsets, and one for the `1` offsets. + +-- The final step is to place your numbers in the final array based on their offsets. This is done by adding `hists0` + `offsets0` and using that as your index. In code this would look kinda like this: `odata[hists0[i] + offsets0[i]] = idata[i]` + +Now because this is fairly abstract, I made a visualization of the algorithm run on an array `a = [7, 3, 4, 11, 10]`. Feel free to use my explanation as a reference to help you understand. + +*Visualized:* + + +![This animation took longer to make than it did to code](/img/radix.gif) + +As far as runtimes go for this, they were actually closer to O(n), which is fairly accurate considering that is the runtime for radix sort. However, this should be faster. My guess is that this was because of the global memory that was being accessed back and forth. This also probably contributed to the increase in ~4ms around all array sizes. Here are the runtimes: + +![Radix Sort Runtimes](/img/radix.PNG) + + +### My Block Size Optimization: + +So I had a setup for an optimized running of this that I think would have helped a lot had I implemented shared memory. Basically the setup is as follows: + +I have 96kb of Shared Memory per SM\*. That is roughly 24k ints allowed per SM. An optimally busy SM has 32 blocks per SM, which is the maximum for an SM\*. To make things even better you want to have it such that each thread only reads in one integer. + +If we call the memory we access per thread `x` and the maximum allowed blocks per SM `z`, then we want to optimize the number of threads per block, which we define as `y`. + +Put in equation form we want to have: + + `x * y * z <= 96000` + +Now we have in this case `x = 4 bytes`, `z = 32 blocks`. And so we can rewrite our equation to be: + + `4 * y * 32 <= 96000` + +We want `y` to be a multiple of 32 to match our warp sizes, and we also want it to be a bit less than `96000` as we probably use more memory per thread than exactly 4 bytes. I call this the "tone-down" value. I've gotten the optimal value for this by trial and error. I've found `0.85f` works best. + + + `4 * y * 32 <= 96000*0.85f` + +The nearest multiple of 32 that matches this is `608`. And so that is how I determine my block size\**. The number of blocks on the grid is just simply `n/block_size` + +\* These numbers are based on my GTX 1050 (CC 6.1) I have them hardcoded into my code at the top of `common.h` + +\** If an array is less than this "optimal block size" I just run everything on the same block since we definitely have enough shared memory then. + +### Tests Raw Output: + +``` +**************** +** SCAN TESTS ** +**************** + [ 15 10 42 24 29 20 33 5 10 5 16 2 0 ... 42 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.053971ms (std::chrono Measured) + [ 0 15 25 67 91 120 140 173 178 188 193 209 211 ... 802690 802732 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.053606ms (std::chrono Measured) + [ 0 15 25 67 91 120 140 173 178 188 193 209 211 ... 802656 802672 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.338944ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.300032ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.44544ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.493568ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 12.0873ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.230208ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 2 2 2 3 2 3 3 2 3 2 0 2 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.087885ms (std::chrono Measured) + [ 1 2 2 2 3 2 3 3 2 3 2 2 3 ... 2 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.086063ms (std::chrono Measured) + [ 1 2 2 2 3 2 3 3 2 3 2 2 3 ... 2 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.158633ms (std::chrono Measured) + [ 1 2 2 2 3 2 3 3 2 3 2 2 3 ... 2 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.42496ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.41984ms (CUDA Measured) + passed + +***************************** +** MY OWN RADIX SORT TESTS ** +***************************** +==== radix basic, power-of-two ==== + + elapsed time: 4.00589ms (CUDA Measured) + + +==== radix basic, non power-of-two ==== + + elapsed time: 3.56966ms (CUDA Measured) + +==== radix massive, power-of-two ==== + + elapsed time: 4.9449ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.59571ms (CUDA Measured) + + +*********** WITH 128 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 4.28851ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.08781ms (CUDA Measured) + + +*********** WITH 256 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 4.35814ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.32435ms (CUDA Measured) + + +*********** WITH 512 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 4.61926ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.46054ms (CUDA Measured) + + +*********** WITH 1024 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 4.82816ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.66227ms (CUDA Measured) + + +*********** WITH 2048 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 5.04115ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.8343ms (CUDA Measured) + + +*********** WITH 4096 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 5.43437ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 4.87834ms (CUDA Measured) + + +*********** WITH 8192 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 5.84192ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 5.13638ms (CUDA Measured) + + +*********** WITH 16384 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 6.02829ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 5.45382ms (CUDA Measured) + + +*********** WITH 32768 SIZED ARRAY ********* +==== radix massive, power-of-two ==== + + elapsed time: 6.46963ms (CUDA Measured) + + +==== radix massive, non power-of-two ==== + + elapsed time: 5.96787ms (CUDA Measured) +``` + +### Observations: + +* Bottlenecks: + +My bottleneck was definitely memory access. This added a constant look up and runtime lag in my execution. This is further supported by my runtimes. All of my GPU code scaled linearly to an exponential increase in input which means that there is a constant delay in the time it takes to run the algorithm. This is the memory cycle as each global memory access took ~200 cycles as opposed to 2 cycles in shared memory and it's either 1 or 2 for CPU. which is what you see as the GPU runtimes are ~100-200 times slower. But they scale up linearly. + + + +### Bloopers / Lessons Learned + +Bloopers weren't that interesting to see this assignment, so here's another thing: + +| **Lesson** | **Cost of Learning** | +|---------------------------------------------------------------------------------------------------------------------------|-----------------------| +| Apparently CUDA doesn't always tell you if you pass in a CPU Array by accident | 7 hours | +| If my gpu array is larger than "n", my algorithm will access things past n. Despite any amount of checks I put in place. | 2 hours | +| When iterating Radix Sort from 0 to MSB, remember off by one errors please. | 3 hours maybe I dunno | diff --git a/img/compact-cpu.PNG b/img/compact-cpu.PNG new file mode 100644 index 0000000..fff9002 Binary files /dev/null and b/img/compact-cpu.PNG differ diff --git a/img/compact-gpu.PNG b/img/compact-gpu.PNG new file mode 100644 index 0000000..ea79600 Binary files /dev/null and b/img/compact-gpu.PNG differ diff --git a/img/compact.PNG b/img/compact.PNG new file mode 100644 index 0000000..aabfc62 Binary files /dev/null and b/img/compact.PNG differ diff --git a/img/correlation.PNG b/img/correlation.PNG new file mode 100644 index 0000000..03d418b Binary files /dev/null and b/img/correlation.PNG differ diff --git a/img/radix.PNG b/img/radix.PNG new file mode 100644 index 0000000..56e2259 Binary files /dev/null and b/img/radix.PNG differ diff --git a/img/radix.gif b/img/radix.gif new file mode 100644 index 0000000..2c14d98 Binary files /dev/null and b/img/radix.gif differ diff --git a/img/scan-cpu.PNG b/img/scan-cpu.PNG new file mode 100644 index 0000000..b417ab5 Binary files /dev/null and b/img/scan-cpu.PNG differ diff --git a/img/scan-gpu.PNG b/img/scan-gpu.PNG new file mode 100644 index 0000000..d5bd36f Binary files /dev/null and b/img/scan-gpu.PNG differ diff --git a/img/scan-with-thrust.PNG b/img/scan-with-thrust.PNG new file mode 100644 index 0000000..a637d2a Binary files /dev/null and b/img/scan-with-thrust.PNG differ diff --git a/img/scan-without-thrust.PNG b/img/scan-without-thrust.PNG new file mode 100644 index 0000000..5a67277 Binary files /dev/null and b/img/scan-without-thrust.PNG differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..3f3eb73 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,12 +11,15 @@ #include #include #include +#include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 15; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; +const int const sizes[10] = { 7,8,9,10,11,12,13,14,15,16 }; + int main(int argc, char* argv[]) { // Scan tests @@ -139,5 +142,78 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + + printf("\n"); + printf("*****************************\n"); + printf("** MY OWN RADIX SORT TESTS **\n"); + printf("*****************************\n"); + printDesc("radix basic, power-of-two"); + const int COUNT_BASIC = 16; + int basic[COUNT_BASIC]; + int basic_out[COUNT_BASIC]; + genArray(COUNT_BASIC, basic, 78); + StreamCompaction::Radix::sort(COUNT_BASIC, basic_out, basic); + //printCPUArray(COUNT_BASIC, basic_out); + printf("\n"); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)\n\n"); + + printDesc("radix basic, non power-of-two"); + const int COUNT_BASIC2 = 18; + int basic2[COUNT_BASIC2]; + int basic_out2[COUNT_BASIC2]; + genArray(COUNT_BASIC2, basic2, 78); + StreamCompaction::Radix::sort(COUNT_BASIC2, basic_out2, basic2); + //printCPUArray(COUNT_BASIC2, basic_out2); + printf("\n"); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)\n"); + + + printDesc("radix massive, power-of-two"); + const int COUNT_MASSIVE = 1 << 10; + int massive[COUNT_MASSIVE]; + int massive_out[COUNT_MASSIVE]; + genArray(COUNT_MASSIVE, massive, 78); + StreamCompaction::Radix::sort(COUNT_MASSIVE, massive_out, massive); + //printCPUArray(COUNT_MASSIVE, massive_out); + printf("\n"); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)\n\n"); + + printDesc("radix massive, non power-of-two"); + const int COUNT_MASSIVE2 = (1 << 10) - 6; + int massive2[COUNT_MASSIVE2]; + int massive_out2[COUNT_MASSIVE2]; + genArray(COUNT_MASSIVE2, massive2, 78); + StreamCompaction::Radix::sort(COUNT_MASSIVE2, massive_out2, massive2); + //printCPUArray(COUNT_MASSIVE2, massive_out2); + printf("\n"); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)\n"); + + + + for (int i = 7; i < 16; i++) { + printf("\n*********** WITH %d SIZED ARRAY *********\n", 1 << i); + const int COUNT_MASSIVE = 1 << i; + printDesc("radix massive, power-of-two"); + + int* massive = (int*) malloc(sizeof(int) * COUNT_MASSIVE); + int* massive_out = (int*) malloc(sizeof(int) * COUNT_MASSIVE); + genArray(COUNT_MASSIVE, massive, 78); + StreamCompaction::Radix::sort(COUNT_MASSIVE, massive_out, massive); + //printCPUArray(COUNT_MASSIVE, massive_out); + printf("\n"); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)\n\n"); + + printDesc("radix massive, non power-of-two"); + const int COUNT_MASSIVE2 = (1 << i) - 6; + int* massive2 = (int*)malloc(sizeof(int) * COUNT_MASSIVE2); + int* massive_out2 = (int*)malloc(sizeof(int) * COUNT_MASSIVE2); + genArray(COUNT_MASSIVE2, massive2, 78); + StreamCompaction::Radix::sort(COUNT_MASSIVE2, massive_out2, massive2); + //printCPUArray(COUNT_MASSIVE2, massive_out2); + printf("\n"); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)\n"); + + } + system("pause"); // stop Win32 console from closing on exit } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..bcc484e 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -9,6 +9,8 @@ set(SOURCE_FILES "efficient.cu" "thrust.h" "thrust.cu" + "radix.h" + "radix.cu" ) cuda_add_library(stream_compaction diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..c80ec33 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,5 +1,6 @@ #include "common.h" + void checkCUDAErrorFn(const char *msg, const char *file, int line) { cudaError_t err = cudaGetLastError(); if (cudaSuccess == err) { @@ -23,7 +24,10 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int i = (threadIdx.x + (blockIdx.x * blockDim.x)); + if (i >= n) { return; } + + bools[i] = (int)(idata[i] != 0); } /** @@ -32,7 +36,12 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int i = (threadIdx.x + (blockIdx.x * blockDim.x)); + if (i >= n) { return; } + + if (bools[i] == 1) { + odata[indices[i]] = idata[i]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..9e20604 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,23 +1,84 @@ #pragma once -#include -#include - -#include -#include -#include -#include +#include +#include + +#include +#include +#include +#include #include -#include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +//All of the following calculations were done for my GTX 1050 and +//they will probably not work with most other non 6.1CC GPU's + +// "x" in my master equation +#define MEM_PER_THREAD 4 + +// "z" in my master equation +#define OPTIMAL_BLOCKS_PER_SM 32 + +// Inequality/RHS in my master equation +#define SHARED_MEM_MAX 96000 + +// A muting term that tones done the optimality +#define TONE_DOWN 0.85f + +//The number of warps on this GPU +#define WARP_SIZE 32 + /** * Check for CUDA errors; print and exit if there was a problem. */ void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1); +inline int binary(int num) +{ + if (num == 0) + { + return 0; + } + else + { + return (num % 2) + 10 * binary(num / 2); + } +} + +inline void printCPUArrayb(int n, int* arr) { + printf("\n( "); + for (int i = 0; i < n - 1; i++) { + printf("%d, ", binary(arr[i])); + } + printf("%d)", binary(arr[n - 1])); +} + +inline void printGPUArrayb(int n, int* dev_arr) { + int* cpu_arr = (int*)malloc(sizeof(int) * n); + cudaMemcpy(cpu_arr, dev_arr, sizeof(int)* n, cudaMemcpyDeviceToHost); + printCPUArrayb(n, cpu_arr); + free(cpu_arr); +} + + +inline void printCPUArray(int n, int* arr) { + printf("\n( "); + for (int i = 0; i < n-1; i++) { + printf("%d, ", arr[i]); + } + printf("%d)", arr[n-1]); +} + +inline void printGPUArray(int n, int* dev_arr) { + int* cpu_arr = (int*)malloc(sizeof(int) * n); + cudaMemcpy(cpu_arr, dev_arr, sizeof(int)* n, cudaMemcpyDeviceToHost); + printCPUArray(n, cpu_arr); + free(cpu_arr); +} + inline int ilog2(int x) { int lg = 0; while (x >>= 1) { @@ -26,6 +87,19 @@ inline int ilog2(int x) { return lg; } +inline int getThreadsPerBlock() { + //Get theoretical best y value you can based on GPU specs + //TODO: Find a better way to get the specs + int theoretical_y = SHARED_MEM_MAX * TONE_DOWN / (MEM_PER_THREAD * OPTIMAL_BLOCKS_PER_SM); + + //Find closest multiple of 32 to theoretical_y + for (int i = 0; i < SHARED_MEM_MAX / 32; i++) { + if (i * 32 >= theoretical_y) { + return (i - 1) * 32; + } + } +} + inline int ilog2ceil(int x) { return ilog2(x - 1) + 1; } @@ -37,96 +111,96 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); - /** - * This class is used for timing the performance - * Uncopyable and unmovable - * - * Adapted from WindyDarian(https://github.com/WindyDarian) - */ - class PerformanceTimer - { - public: - PerformanceTimer() - { - cudaEventCreate(&event_start); - cudaEventCreate(&event_end); - } - - ~PerformanceTimer() - { - cudaEventDestroy(event_start); - cudaEventDestroy(event_end); - } - - void startCpuTimer() - { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); - } - - void endCpuTimer() - { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; - } - - void startGpuTimer() - { - if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } - gpu_timer_started = true; - - cudaEventRecord(event_start); - } - - void endGpuTimer() - { - cudaEventRecord(event_end); - cudaEventSynchronize(event_end); - - if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } - - cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); - gpu_timer_started = false; - } - - float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 - { - return prev_elapsed_time_cpu_milliseconds; - } - - float getGpuElapsedTimeForPreviousOperation() //noexcept - { - return prev_elapsed_time_gpu_milliseconds; - } - - // remove copy and move functions - PerformanceTimer(const PerformanceTimer&) = delete; - PerformanceTimer(PerformanceTimer&&) = delete; - PerformanceTimer& operator=(const PerformanceTimer&) = delete; - PerformanceTimer& operator=(PerformanceTimer&&) = delete; - - private: - cudaEvent_t event_start = nullptr; - cudaEvent_t event_end = nullptr; - - using time_point_t = std::chrono::high_resolution_clock::time_point; - time_point_t time_start_cpu; - time_point_t time_end_cpu; - - bool cpu_timer_started = false; - bool gpu_timer_started = false; - - float prev_elapsed_time_cpu_milliseconds = 0.f; - float prev_elapsed_time_gpu_milliseconds = 0.f; + /** + * This class is used for timing the performance + * Uncopyable and unmovable + * + * Adapted from WindyDarian(https://github.com/WindyDarian) + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; }; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..52a0910 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,16 @@ #include #include "cpu.h" +#include -#include "common.h" +#include "common.h" namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** @@ -17,10 +18,63 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + void scan(int n, int *odata, const int *idata) { + try { + timer().startCpuTimer(); + } + catch (...){}; + /* FAILED ATTEMPT AT GPU WAY + //Copy idata into odata + for (int i = 0; i < n; i++) { + odata[i] = idata[i]; + } + + // Create Constants + const int logN = ilog2ceil(n); + + printf("LOG N IS : %d \n", logN); + + //Up-Sweep + for (int d = 0; d < logN; d++) { + for (int k = 0; k < n; k += (int)pow(2, d + 1)) { + odata[k + (int)pow(2, d + 1) - 1] += odata[(k + (int)pow(2, d) - 1)]; + } + } + + printf("UPSWEPT: \n ("); + for (int i = 0; i < 15; i++) { + printf("%d, ", odata[i]); + } + printf(".... )\n"); + + //Down-Sweep + odata[n - 1] = 0; + for (int d = logN-1; d >= 0; d--) { + for (int k = 0; k < n; k += (int)pow(2, d + 1)) { + if ((k + (int)pow(2, d+1) - 1) < n) { + int t = odata[(k + (int)pow(2, d) - 1)]; + odata[k + (int)pow(2, d) - 1] = odata[k + (int)pow(2, d + 1) - 1]; + odata[k + (int)pow(2, d + 1) - 1] += t; + } + } + } + + printf("SUMMED: \n ("); + for (int i = 0; i < 15; i++) { + printf("%d, ", odata[i]); + } + printf(".... )\n"); + + */ + odata[0] = 0; + for (int k = 1; k < n; k++) { + odata[k] = idata[k-1] + odata[k-1]; + } + + try { + timer().endCpuTimer(); + } + catch (...) {}; } /** @@ -30,9 +84,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int count = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + + return count; } /** @@ -42,9 +102,29 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; + int count = 0; + int* temp = (int*) malloc(sizeof(int) * n); + int* scanned = (int*) malloc(sizeof(int) * n); + + for (int i = 0; i < n; i++) { + temp[i] = (int) (idata[i] != 0); + count = idata[i] != 0 ? count + 1 : count; + } + + scan(n, scanned, temp); + + for (int i = 0; i < n; i++) { + if (temp[i] == 1) { + odata[scanned[i]] = idata[i]; + } + } + + try { + timer().endCpuTimer(); + } + catch (...) {}; + + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..b86c237 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,20 +5,116 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + __global__ void kern_upSweep(int n, int d, int* idata) { + int index = (threadIdx.x + (blockIdx.x * blockDim.x)); + int k = index * (1 << d + 1); + if (index >= n || k >= n) { return; } + + idata[k + (1 << d+1) - 1] += idata[k + (1 << d) - 1]; + } + + __global__ void kern_downSweep(int n, int d, int* idata) { + int k = (threadIdx.x + (blockIdx.x * blockDim.x)) * (1 << d + 1); + if (k >= n) { return; } + + int t = idata[k + (1 << d) - 1]; + idata[k + (1 << d) - 1] = idata[k + (1 << d+1) - 1]; + idata[k + (1 << d+1) - 1] += t; + } + + __global__ void roundN(int n, int nRounded, int* idataRounded, const int* idata) { + int i = (threadIdx.x + (blockIdx.x * blockDim.x)); + if (i >= nRounded) { return; } + + idataRounded[i] = i >= n ? 0 : idata[i]; + } + + __global__ void kern_scan_shared(int n, int *odata, const int *idata) { + + extern __shared__ float temp[]; + + + + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + * + ***** THIS IS AN EFFICIENT VERSION USING SHARED MEMORY + */ + void scan_shared(int n, int *odata, const int *idata) { + // Super Hyperthreaded Information Transloading calculation for threads per block + dim3 threadsPerBlock(std::min(getThreadsPerBlock(), n)); + dim3 numBlocks(std::ceilf(((float)n / threadsPerBlock.x))); + + + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + // Super Hyperthreaded Information Transloading calculation for threads per block + dim3 threadsPerBlock(std::min(getThreadsPerBlock(), n)); + dim3 numBlocks(std::ceilf(((float)n / threadsPerBlock.x))); + + //Round Up + int loops = ilog2ceil(n); + int nRounded = 1 << loops; + + // A copy of idata on the GPU + int* idata_GPU; + cudaMalloc((void**)&idata_GPU, sizeof(int) * n); + cudaMemcpy(idata_GPU, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + try { timer().startGpuTimer(); } + catch (...) {}; + + //Rounded Version of GPU Copy + int* idataRounded_GPU; + cudaMalloc((void**)&idataRounded_GPU, sizeof(int) * nRounded); + //Round the GPU Array + roundN << > > (n, nRounded, idataRounded_GPU, idata_GPU); + + //Up-Sweep: + for (int d = 0; d < loops; d++) { + kern_upSweep<< > >(n, d, idataRounded_GPU); + checkCUDAErrorFn("upSweep failed with error"); + } + + //Set Zero + int zero = 0; + cudaMemcpy(&idataRounded_GPU[nRounded - 1], &zero, sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("Zero Copy failed with error"); + + //Down-Sweep: + for (int d = loops - 1; d >= 0; d--) { + kern_downSweep <<> >(nRounded, d, idataRounded_GPU); + checkCUDAErrorFn("downSweep failed with error"); + } + + cudaMemcpy(odata, idataRounded_GPU, sizeof(int) * n, cudaMemcpyDeviceToHost); + + //Free Malloc'd + cudaFree(idataRounded_GPU); + cudaFree(idata_GPU); + /**** PRINTER ****** + printf("After DownSweep: \n ("); + for (int i = nRounded-10; i < nRounded -1; i++) { + printf("%d = %d, ", i, odata[i]); + } + printf("%d = %d) \n\n", nRounded-1, odata[nRounded-1]); + **/ + try { timer().endGpuTimer(); } + catch (...) {}; } /** @@ -31,10 +127,50 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; + try { timer().startGpuTimer(); } + catch (...) {}; + // Super Hyperthreaded Information Transloading calculation for threads per block + dim3 threadsPerBlock(std::min(getThreadsPerBlock(), n)); + dim3 numBlocks(std::ceilf(((float)n / threadsPerBlock.x))); + + // Create Buffers + int *temp, *scanned, *idata_GPU, *odata_GPU, *count_GPU; + cudaMalloc((void**)&temp , sizeof(int) * n); + cudaMalloc((void**)&scanned , sizeof(int) * n); + cudaMalloc((void**)&idata_GPU, sizeof(int) * n); + cudaMalloc((void**)&odata_GPU, sizeof(int) * n); + + cudaMemcpy(idata_GPU, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAErrorFn("idata memcpy failed with error"); + + //Create Temp Array + Common::kernMapToBoolean << > > (n, temp, idata_GPU); + checkCUDAErrorFn("kern_boolify failed with error"); + + //Scan + scan(n, scanned, temp); + + //Temporarily store "scanned" into odata to get count + cudaMemcpy(odata, scanned, sizeof(int) * n, cudaMemcpyDeviceToHost); + int count = odata[n - 1] + (int)(idata[n - 1] != 0); + + //Compact + Common::kernScatter << > >(n, odata_GPU, idata_GPU, temp, scanned); + checkCUDAErrorFn("kern_compact failed with error"); + + //Bring Back to CPU + cudaMemcpy(odata, odata_GPU, sizeof(int) * count, cudaMemcpyDeviceToHost); + + //Free Up All Malloc'd + cudaFree(temp); + cudaFree(scanned); + cudaFree(idata_GPU); + cudaFree(odata_GPU); + + + try { timer().endGpuTimer(); } + catch (...) {}; + return count; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..7ff29eb 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,6 +6,8 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); + void scan_shared(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..bcf4851 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -5,21 +5,68 @@ namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + // TODO: __global__ + __global__ void kern_scan(int n, int d, int* odata, const int* idata) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= n) { return; } + + int two_powd = 1 << (d - 1); + //Using ternary as recommended in lecture + odata[k] = (k >= two_powd) ? idata[k - two_powd] + idata[k] : idata[k]; + } + + __global__ void kern_shiftRight(int n, int* odata, const int* idata) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= n) { return; } + + //Using ternary as recommended in lecture + odata[k] = (k == 0) ? 0 : idata[k - 1]; + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + // Super Hyperthreaded Information Transloading calculation for threads per block + dim3 threadsPerBlock(std::min(getThreadsPerBlock(), n)); + dim3 numBlocks(std::ceilf(((float) n / threadsPerBlock.x) )); + + //New GPU Buffers + int* idata_GPU, *odata_GPU; + cudaMalloc((void**)&idata_GPU, sizeof(int) * n); + cudaMalloc((void**)&odata_GPU, sizeof(int) * n); + cudaMemcpy(idata_GPU, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + for (int d = 1; d <= ilog2ceil(n); d++) { + kern_scan <<>>(n, d, odata_GPU, idata_GPU); + checkCUDAErrorFn("kern_scan failed with error"); + std::swap(idata_GPU, odata_GPU); + } + kern_shiftRight<< > > (n, odata_GPU, idata_GPU); + + cudaMemcpy(odata, odata_GPU, sizeof(int) * n, cudaMemcpyDeviceToHost); + timer().endGpuTimer(); + + cudaFree(idata_GPU); + cudaFree(odata_GPU); + + //PRINTER + /* + printf("After: \n ( "); + for (int i = 0; i < 15; i++) { + printf("%d, ", odata[i]); + } + printf(" ... ) \n\n"); + */ } } } diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu new file mode 100644 index 0000000..d9954a3 --- /dev/null +++ b/stream_compaction/radix.cu @@ -0,0 +1,142 @@ +#include +#include +#include "common.h" +#include "radix.h" +#include "efficient.h" + + +namespace StreamCompaction { + namespace Radix { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __global__ void kern_initializeHistogram(int n, int* hists) { + int i = threadIdx.x + (blockIdx.x * blockDim.x); + if (i >= n) { return; } + + hists[i] = 0; + } + + + __global__ void kern_generateHistogram(int n, int bit, int desired_bit, const int* idata, int* hists) { + int i = threadIdx.x + (blockIdx.x * blockDim.x); + if (i >= n) { return; } + + int nth_bit = (idata[i] >> bit) & 1; + + //Determine if Zero + hists[i] = (int)(nth_bit == desired_bit); + } + + __global__ void kern_placeItems(int n, int* odata, const int* idata, const int* hists0, const int* hists1, + const int* offset0, const int* offset1) { + int i = threadIdx.x + (blockIdx.x * blockDim.x); + if (i >= n) { return; } + + int zero_count = offset0[n - 1] + hists0[n - 1]; + + //Zero Bit + if (hists0[i] == 1) { + odata[offset0[i]] = idata[i]; + } else { + odata[zero_count + offset1[i]] = idata[i]; + } + } + + /** + * Performs radix sort on a buffer, returns sorted array in odata + * + * @param n The number of elements in idata. + * @param odata The array into which to sort the elements. + * @param idata The array of elements to sort. + */ + void sort(int n, int *odata, const int *idata) { + try { timer().startGpuTimer(); } + catch (...) {}; + //Define Thread and Block Counts + dim3 threadsPerBlock(std::min(getThreadsPerBlock(), n)); + dim3 numBlocks(std::ceilf(((float)n / threadsPerBlock.x))); + + //Allocate GPU Buffers + int* idata_GPU, *odata_GPU; + cudaMalloc((void**)&idata_GPU, sizeof(int) * n); + cudaMalloc((void**)&odata_GPU, sizeof(int) * n); + cudaMemcpy(idata_GPU, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + int* emptyCounts = (int*) malloc(sizeof(int) * 2 * n); + //Get Maximum Number + int max = -INT_MAX; + for (int i = 0; i < n; i++) { + max = std::max(idata[i], max); + + //Zero out array of counts + *(emptyCounts + 0 * n + i) = 0; + *(emptyCounts + 1 * n + i) = 0; + } + int max_MSB = ilog2ceil(max); + + // ---- For each bit: + for (int bit = 0; bit < max_MSB; bit++) { + //Create a hist[n] Array, with each thread writing to its index + int* histograms_zero; // This is an Array of counts of zero + cudaMalloc((void**)&histograms_zero, sizeof(int) * n); + + //Create a hist[n] Array, with each thread writing to its index + int* histograms_one; // This is an Array of counts of zero + cudaMalloc((void**)&histograms_one, sizeof(int) * n); + + //Create a prefixSum array that lets you do cool stuff to it + int* offset_zero; + cudaMalloc((void**)&offset_zero, sizeof(int) * n); + + //Create a prefixSum array that lets you do cool stuff to it + int* offset_one; + cudaMalloc((void**)&offset_one, sizeof(int) * n); + + // kern: Generate Histogram with Number of zero's and one's for each bit + kern_generateHistogram << > > (n, bit, 0, idata_GPU, histograms_zero); + kern_generateHistogram << > > (n, bit, 1, idata_GPU, histograms_one); + + // kern: Calculate Offsets for each + StreamCompaction::Efficient::scan(n, offset_zero, histograms_zero); + StreamCompaction::Efficient::scan(n, offset_one, histograms_one); + + // kern: Ex Prefix Sum on Histogram + //REMEMBER THE MEMCPY OPTIMIZATION + /** + int last_zero_elt, last_offset_zero; + cudaMemcpy(&last_zero_elt, &histograms_zero[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&last_offset_zero, &offset_zero[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + int zero_count = last_offset_zero + last_zero_elt; + **/ + + // kern: Place each thing int its location + kern_placeItems << > > (n, odata_GPU, idata_GPU, histograms_zero, histograms_one, + offset_zero, offset_one); + + // Ping Pong Buffers + std::swap(idata_GPU, odata_GPU); + + //Free Old Stuff + cudaFree(histograms_zero); + cudaFree(histograms_one); + cudaFree(offset_zero); + cudaFree(offset_one); + } + + cudaMemcpy(odata, idata_GPU, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(idata_GPU); + cudaFree(odata_GPU); + free(emptyCounts); + + try { timer().endGpuTimer(); } + catch (...) {}; + + } + } +} diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h new file mode 100644 index 0000000..aa489d9 --- /dev/null +++ b/stream_compaction/radix.h @@ -0,0 +1,11 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace Radix { + StreamCompaction::Common::PerformanceTimer& timer(); + + void sort(int n, int *odata, const int *idata); + } +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..aa0c954 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,21 +8,33 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dv_in, *dv_out; + cudaMalloc((void**)&dv_in, sizeof(int) * n); + cudaMalloc((void**)&dv_out, sizeof(int) * n); + + cudaMemcpy(dv_in, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + + thrust::device_ptr dv_in_thrust(dv_in); + thrust::device_ptr dv_out_thrust(dv_out); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dv_in_thrust, dv_in_thrust + n, dv_out_thrust); timer().endGpuTimer(); + + cudaMemcpy(odata, dv_out, sizeof(int) * n, cudaMemcpyDeviceToHost); + + cudaFree(dv_in); + cudaFree(dv_out); } } }