diff --git a/README.md b/README.md
index d63a6a1..c5a709f 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,51 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Constance Wang  
+- [LinkedIn](https://www.linkedin.com/in/conswang/)
 
-### (TODO: Your README)
+Tested on AORUS 15P XD laptop with specs:  
+- Windows 11 22000.856  
+- 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz  
+- NVIDIA GeForce RTX 3070 Laptop GPU  
+
+![](flocking_gif.gif)
+
+### Analysis
+
+#### For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+Increasing the number of boids causes performance (frame rate) to decrease. This is because:
+- preprocessing steps take longer (need to sort 5mil boids instead of 5k)
+- individual boid processing takes longer - need to check more neighbours
+
+#### For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+Changing the block size by updating the blockSize macro does not have a large effect on performance. This is because block count is calculated as (N + blockSize - 1) / blockSize, meaning the total amount of parallelism will always be N + blockSize - 1, aka. launch enough blocks to run each calculations for each of N boids in parallel. The only way to achieve more parallelism might be to parallelize each boid checking its neighbouring grid cells - but we still have to collect the results to calculate average speed, velocity, etc. This is probably not worth the overhead, since each boid checks at most 8 neighbours.
+
+I noticed that testing with coherent uniform grid, with visualization off, and number of boids = 5000, the frame rate increases slightly with increased block size. This could be noise, although it's also possible that with increased block size, we have a decreased block count which results in better scheduling - eg. for N = 5000 and blockSize = 1024, we have block count = (5000 + 1024 - 1) / 1024 -> 6 blocks. Meanwhile, (5000 + 128 - 1) / 128 -> 40 blocks, it's possible not all warps in the blocks are able to run in parallel? But the effect is negligible. In addition, increasing block size wouldn't negatively affect performance since each thread doesn't use much resources (it's just getting and loading a few neighbours, doing minimal calculations).
+
+![](images/framerateblocksize.png)
+
+#### For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+The coherent uniform grid performs better than the uniform grid at higher boid counts (say, 50k boids), but worse at lower boid counts (5k boids).
+
+This is expected because: 
+- at lower boid counts, the overhead of sorting the position and velocity arrays outweighs the performance gains from eliminating random accesses - we aren't accessing the pos/vel arrays that often anyway
+- at higher boid counts, the performance gain of continguous access of position/velocity arrays enables us to take advantage of GPU hardware caching, and is worth the overhead costs
+
+Coherent uniform grid and uniform grid both performed better than naive boid count because naive's O(n) neighbour search time for each boid is just too slow.
+
+#### Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+Define/undefine #neighbours_27 to check. Halving cell width and checking 27 cells instead does not affect performance much. I tried it on the 5k boid coherent grid with no visualization and frame rate was about the same (2k fps).
+
+In theory, it could make performance worse because:
+- the overhead of calculating grid indices is slightly increased due to having more grid cells to check
+- we essentially still access the exact same set of neighbours (eg. if with 8 cells, we are checking dev_pos from indices 0 to 10, with 27 cells, we will still be checking dev_pos from 0 to 10), meaning performance will be about the same for each boid's neighbourhood calculation, not really improved
+
+On the other hand,
+- there is less branching in the code structure that checks 27 neighbours, that would increase performance
+
+Both are probably not that noticeable though, at least at 5k boids
+
+![](images/frameratesboidcount.png)
+![](images/frameratesboidcountwithviz.png)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/[477.5 fps] 565 CUDA Intro_ Boids [SM 8.6 NVIDIA GeForce RTX 3070 Laptop GPU] 2022-09-11 23-37-55.mp4 b/[477.5 fps] 565 CUDA Intro_ Boids [SM 8.6 NVIDIA GeForce RTX 3070 Laptop GPU] 2022-09-11 23-37-55.mp4
new file mode 100644
index 0000000..4f37892
Binary files /dev/null and b/[477.5 fps] 565 CUDA Intro_ Boids [SM 8.6 NVIDIA GeForce RTX 3070 Laptop GPU] 2022-09-11 23-37-55.mp4 differ
diff --git a/flocking_gif.gif b/flocking_gif.gif
new file mode 100644
index 0000000..286ff0f
Binary files /dev/null and b/flocking_gif.gif differ
diff --git a/images/framerateblocksize.png b/images/framerateblocksize.png
new file mode 100644
index 0000000..0b4cb04
Binary files /dev/null and b/images/framerateblocksize.png differ
diff --git a/images/frameratesboidcount.png b/images/frameratesboidcount.png
new file mode 100644
index 0000000..1a64229
Binary files /dev/null and b/images/frameratesboidcount.png differ
diff --git a/images/frameratesboidcountwithviz.png b/images/frameratesboidcountwithviz.png
new file mode 100644
index 0000000..b944ec0
Binary files /dev/null and b/images/frameratesboidcountwithviz.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..697f4a2 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,7 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include <thrust/sequence.h>
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
@@ -20,7 +21,7 @@
 /**
 * Check for CUDA errors; print and exit if there was a problem.
 */
-void checkCUDAError(const char *msg, int line = -1) {
+void checkCUDAError(const char* msg, int line = -1) {
   cudaError_t err = cudaGetLastError();
   if (cudaSuccess != err) {
     if (line >= 0) {
@@ -54,6 +55,8 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
 
+#define neighbours_27 0
+
 /***********************************************
 * Kernel state (pointers are device pointers) *
 ***********************************************/
@@ -66,25 +69,26 @@ dim3 threadsPerBlock(blockSize);
 // Consider why you would need two velocity buffers in a simulation where each
 // boid cares about its neighbors' velocities.
 // These are called ping-pong buffers.
-glm::vec3 *dev_pos;
-glm::vec3 *dev_vel1;
-glm::vec3 *dev_vel2;
+glm::vec3* dev_pos;
+glm::vec3* dev_vel1;
+glm::vec3* dev_vel2;
 
 // LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
 // pointers on your own too.
 
 // For efficient sorting and the uniform grid. These should always be parallel.
-int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
-int *dev_particleGridIndices; // What grid cell is this particle in?
+int* dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
+int* dev_particleGridIndices; // What grid cell is this particle in?
 // needed for use with thrust
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
 
-int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
-int *dev_gridCellEndIndices;   // to this cell?
+int* dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
+int* dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3* dev_pos2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -123,7 +127,7 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
 * LOOK-1.2 - This is a basic CUDA kernel.
 * CUDA kernel for generating boids with a specified mass randomly around the star.
 */
-__global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
+__global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3* arr, float scale) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
     glm::vec3 rand = generateRandomVec3(time, index);
@@ -152,23 +156,45 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
+  kernGenerateRandomPosArray << <fullBlocksPerGrid, blockSize >> > (1, numObjects,
     dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
+
+#if neighbours_27
+  gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#else
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
   gridCellCount = gridSideCount * gridSideCount * gridSideCount;
   gridInverseCellWidth = 1.0f / gridCellWidth;
   float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
+  gridMinimum.x -= halfGridWidth; // does this mean gridMinimum is already "auto-initialized" to (0, 0, 0)? 
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+
+  //thrust::sequence(dev_thrust_particleArrayIndices, dev_thrust_particleArrayIndices + numObjects);
+
+  //for (int i = 0; i < N; ++i) {
+  //  std::cout << dev_thrust_particleArrayIndices[i];
+  //}
+
   cudaDeviceSynchronize();
 }
 
@@ -180,7 +206,7 @@ void Boids::initSimulation(int N) {
 /**
 * Copy the boid positions into the VBO so that they can be drawn by OpenGL.
 */
-__global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
+__global__ void kernCopyPositionsToVBO(int N, glm::vec3* pos, float* vbo, float s_scale) {
   int index = threadIdx.x + (blockIdx.x * blockDim.x);
 
   float c_scale = -1.0f / s_scale;
@@ -193,7 +219,7 @@ __global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float
   }
 }
 
-__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
+__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3* vel, float* vbo, float s_scale) {
   int index = threadIdx.x + (blockIdx.x * blockDim.x);
 
   if (index < N) {
@@ -207,11 +233,11 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 /**
 * Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
 */
-void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
+void Boids::copyBoidsToVBO(float* vbodptr_positions, float* vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -229,29 +255,96 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3* pos, const glm::vec3* vel) {
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+  int rule1Count = 0;
+  glm::vec3 perceivedCenter(0);
+
   // Rule 2: boids try to stay a distance d away from each other
+  glm::vec3 neighboursOffset(0);
+
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  int rule3Count = 0;
+  glm::vec3 perceivedVel(0);
+
+  for (int i = 0; i < N; ++i) {
+    if (i == iSelf) {
+      continue;
+    }
+
+    float distance = glm::distance(pos[iSelf], pos[i]);
+
+    if (distance < rule1Distance) {
+      perceivedCenter += pos[i];
+      ++rule1Count;
+    }
+
+    if (distance < rule2Distance) {
+      neighboursOffset -= pos[i] - pos[iSelf];
+    }
+
+    if (distance < rule3Distance) {
+      perceivedVel += vel[i];
+      ++rule3Count;
+    }
+  }
+
+  glm::vec3 rule1, rule2, rule3;
+
+  if (rule1Count != 0) {
+    perceivedCenter /= rule1Count;
+    rule1 = (perceivedCenter - pos[iSelf]) * rule1Scale;
+  }
+  else {
+    rule1 = glm::vec3(0);
+  }
+
+  if (rule3Count != 0) {
+    perceivedVel /= rule3Count;
+    rule3 = perceivedVel * rule3Scale;
+  }
+  else {
+    rule3 = glm::vec3(0);
+  }
+
+  rule2 = neighboursOffset * rule2Scale;
+
+  return vel[iSelf] + rule1 + rule2 + rule3;
+}
+
+__device__ glm::vec3 clampSpeed(glm::vec3 vel) {
+  float speed = glm::length(vel);
+  if (speed > maxSpeed) {
+    return vel * maxSpeed / speed;
+  }
+  else {
+    return vel;
+  }
 }
 
 /**
 * TODO-1.2 implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
-__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
+__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3* pos,
+  glm::vec3* vel1, glm::vec3* vel2) {
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  glm::vec3 newVel = computeVelocityChange(N, index, pos, vel1);
+  vel2[index] = clampSpeed(newVel);
 }
 
 /**
 * LOOK-1.2 Since this is pretty trivial, we implemented it for you.
 * For each of the `N` bodies, update its position based on its current velocity.
 */
-__global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
+__global__ void kernUpdatePos(int N, float dt, glm::vec3* pos, glm::vec3* vel) {
   // Update position by velocity
   int index = threadIdx.x + (blockIdx.x * blockDim.x);
   if (index >= N) {
@@ -284,36 +377,64 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  glm::vec3* pos, int* indices, int* gridIndices) {
+  // TODO-2.1
+  // - Label each boid with the index of its grid cell.
+  // - Set up a parallel array of integer indices as pointers to the actual
+  //   boid data in pos and vel1/vel2
+
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  glm::vec3 boidPos = pos[index];
+
+  glm::vec3 gridPos = glm::floor((boidPos - gridMin) * inverseCellWidth);
+  int gridIndex = gridIndex3Dto1D((int)gridPos.x, (int)gridPos.y, (int)gridPos.z, gridResolution);
+
+  indices[index] = index;
+  gridIndices[index] = gridIndex;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
-__global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
+__global__ void kernResetIntBuffer(int N, int* intBuffer, int value) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
     intBuffer[index] = value;
   }
 }
 
-__global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
+__global__ void kernIdentifyCellStartEnd(int N, int* particleGridIndices,
+  int* gridCellStartIndices, int* gridCellEndIndices) {
   // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
+  int gridIndex = particleGridIndices[index];
+
+  if (index == 0 || particleGridIndices[index - 1] != particleGridIndices[index]) {
+    // then index is a start index
+    gridCellStartIndices[gridIndex] = index;
+  }
+
+  if (index == N - 1 || particleGridIndices[index] != particleGridIndices[index + 1]) {
+    // then index is an end index
+    gridCellEndIndices[gridIndex] = index;
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+  int* gridCellStartIndices, int* gridCellEndIndices,
+  int* particleArrayIndices,
+  glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
   // - Identify the grid cell that this particle is in
@@ -322,13 +443,111 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
+  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+  int rule1Count = 0;
+  glm::vec3 perceivedCenter(0);
+
+  // Rule 2: boids try to stay a distance d away from each other
+  glm::vec3 neighboursOffset(0);
+
+  // Rule 3: boids try to match the speed of surrounding boids
+  int rule3Count = 0;
+  glm::vec3 perceivedVel(0);
+
+  glm::vec3 boidPos = pos[index];
+  glm::vec3 gridPosExact = (boidPos - gridMin) * inverseCellWidth;
+  glm::vec3 gridPos = glm::floor(gridPosExact);
+  int gridX = gridPos.x;
+  int gridY = gridPos.y;
+  int gridZ = gridPos.z;
+
+#if neighbours_27
+  int xMin = imax(gridX - 1, 0);
+  int yMin = imax(gridY - 1, 0);
+  int zMin = imax(gridZ - 1, 0);
+  int xMax = imin(gridX + 1, gridResolution - 1);
+  int yMax = imin(gridY + 1, gridResolution - 1);
+  int zMax = imin(gridZ + 1, gridResolution - 1);
+#else
+  glm::vec3 gridPosRelativeExact = gridPosExact - gridPos;
+  int xMin = gridPosRelativeExact.x < 0.5 ? imax(gridX - 1, 0) : gridX;
+  int yMin = gridPosRelativeExact.y < 0.5 ? imax(gridY - 1, 0) : gridY;
+  int zMin = gridPosRelativeExact.z < 0.5 ? imax(gridZ - 1, 0) : gridZ;
+  int xMax = gridPosRelativeExact.x >= 0.5 ? imin(gridX + 1, gridResolution - 1) : gridX;
+  int yMax = gridPosRelativeExact.y >= 0.5 ? imin(gridY + 1, gridResolution - 1) : gridY;
+  int zMax = gridPosRelativeExact.z >= 0.5 ? imin(gridZ+ 1, gridResolution - 1) : gridZ;
+#endif
+
+  for (int z = zMin; z <= zMax; ++z) {
+    for (int y = yMin; y <= yMax; ++y) {
+      for (int x = xMin; x <= xMax; ++x) {
+        int neighbourGridIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+        int startIndex = gridCellStartIndices[neighbourGridIndex];
+        int endIndex = gridCellEndIndices[neighbourGridIndex];
+
+        if (startIndex == -1 || endIndex == -1) {
+          continue; // no boids in this grid cell
+        }
+
+        for (int middleManIndex = startIndex; middleManIndex <= endIndex; ++middleManIndex) {
+          int neighbourIndex = particleArrayIndices[middleManIndex];
+
+          glm::vec3 neighbourPos = pos[neighbourIndex];
+          glm::vec3 neighbourVel = vel1[neighbourIndex];
+          float distance = glm::distance(boidPos, neighbourPos);
+          //float distance = 0;
+          
+          if (distance < rule1Distance) {
+            perceivedCenter += neighbourPos;
+            ++rule1Count;
+          }
+
+          if (distance < rule2Distance) {
+            neighboursOffset -= neighbourPos - boidPos;
+          }
+
+          if (distance < rule3Distance) {
+            perceivedVel += neighbourVel;
+            ++rule3Count;
+          }
+        }
+      }
+    }
+  }
+
+  glm::vec3 rule1, rule2, rule3;
+
+  if (rule1Count != 0) {
+    perceivedCenter /= rule1Count;
+    rule1 = (perceivedCenter - boidPos) * rule1Scale;
+  }
+  else {
+    rule1 = glm::vec3(0);
+  }
+
+  if (rule3Count != 0) {
+    perceivedVel /= rule3Count;
+    rule3 = perceivedVel * rule3Scale;
+  }
+  else {
+    rule3 = glm::vec3(0);
+  }
+
+  rule2 = neighboursOffset * rule2Scale;
+
+  vel2[index] = clampSpeed(vel1[index] + rule1 + rule2 + rule3);
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+  int* gridCellStartIndices, int* gridCellEndIndices,
+  glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
   // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
@@ -341,6 +560,103 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
+  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+  int rule1Count = 0;
+  glm::vec3 perceivedCenter(0);
+
+  // Rule 2: boids try to stay a distance d away from each other
+  glm::vec3 neighboursOffset(0);
+
+  // Rule 3: boids try to match the speed of surrounding boids
+  int rule3Count = 0;
+  glm::vec3 perceivedVel(0);
+
+  glm::vec3 boidPos = pos[index];
+  glm::vec3 gridPosExact = (boidPos - gridMin) * inverseCellWidth;
+  glm::vec3 gridPos = glm::floor(gridPosExact);
+  int gridX = gridPos.x;
+  int gridY = gridPos.y;
+  int gridZ = gridPos.z;
+
+#if neighbours_27
+  int xMin = imax(gridX - 1, 0);
+  int yMin = imax(gridY - 1, 0);
+  int zMin = imax(gridZ - 1, 0);
+  int xMax = imin(gridX + 1, gridResolution - 1);
+  int yMax = imin(gridY + 1, gridResolution - 1);
+  int zMax = imin(gridZ + 1, gridResolution - 1);
+#else
+  glm::vec3 gridPosRelativeExact = gridPosExact - gridPos;
+  int xMin = gridPosRelativeExact.x < 0.5 ? imax(gridX - 1, 0) : gridX;
+  int yMin = gridPosRelativeExact.y < 0.5 ? imax(gridY - 1, 0) : gridY;
+  int zMin = gridPosRelativeExact.z < 0.5 ? imax(gridZ - 1, 0) : gridZ;
+  int xMax = gridPosRelativeExact.x >= 0.5 ? imin(gridX + 1, gridResolution - 1) : gridX;
+  int yMax = gridPosRelativeExact.y >= 0.5 ? imin(gridY + 1, gridResolution - 1) : gridY;
+  int zMax = gridPosRelativeExact.z >= 0.5 ? imin(gridZ + 1, gridResolution - 1) : gridZ;
+#endif
+
+  for (int z = zMin; z <= zMax; ++z) {
+    for (int y = yMin; y <= yMax; ++y) {
+      for (int x = xMin; x <= xMax; ++x) {
+        int neighbourGridIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+        int startIndex = gridCellStartIndices[neighbourGridIndex];
+        int endIndex = gridCellEndIndices[neighbourGridIndex];
+
+        if (startIndex == -1 || endIndex == -1) {
+          continue; // no boids in this grid cell
+        }
+
+        for (int neighbourIndex = startIndex; neighbourIndex <= endIndex; ++neighbourIndex) {
+
+          glm::vec3 neighbourPos = pos[neighbourIndex];
+          glm::vec3 neighbourVel = vel1[neighbourIndex];
+          float distance = glm::distance(boidPos, neighbourPos);
+          //float distance = 0;
+
+          if (distance < rule1Distance) {
+            perceivedCenter += neighbourPos;
+            ++rule1Count;
+          }
+
+          if (distance < rule2Distance) {
+            neighboursOffset -= neighbourPos - boidPos;
+          }
+
+          if (distance < rule3Distance) {
+            perceivedVel += neighbourVel;
+            ++rule3Count;
+          }
+        }
+      }
+    }
+  }
+
+  glm::vec3 rule1, rule2, rule3;
+
+  if (rule1Count != 0) {
+    perceivedCenter /= rule1Count;
+    rule1 = (perceivedCenter - boidPos) * rule1Scale;
+  }
+  else {
+    rule1 = glm::vec3(0);
+  }
+
+  if (rule3Count != 0) {
+    perceivedVel /= rule3Count;
+    rule3 = perceivedVel * rule3Scale;
+  }
+  else {
+    rule3 = glm::vec3(0);
+  }
+
+  rule2 = neighboursOffset * rule2Scale;
+
+  vel2[index] = clampSpeed(vel1[index] + rule1 + rule2 + rule3);
 }
 
 /**
@@ -348,7 +664,15 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+
   // TODO-1.2 ping-pong the velocity buffers
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +688,55 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+  int N = numObjects;
+  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGridCells((gridCellCount + blockSize - 1) / blockSize);
+
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (N, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+  //for (int i = 0; i < N; ++i) {
+  //  std::cout << "boid #: " << dev_thrust_particleArrayIndices[i] << "is inside cell: " << dev_thrust_particleGridIndices[i] << std::endl;
+  //}
+
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + N, dev_thrust_particleArrayIndices);
+
+  //std::cout << "AFTER SORTING-------------------------------------------" << std::endl;
+  //for (int i = 0; i < N; ++i) {
+  //  std::cout << "boid #: " << dev_thrust_particleArrayIndices[i] << "is inside cell: " << dev_thrust_particleGridIndices[i] << std::endl;
+  //  std::cout << "grid #" << gridSideCount << std::endl;
+  //}
+
+  // -1 means no boid is in the grid cell
+  // this is called for each cell block, hence fullBlocksPerGridCells and not fullBlocksPerGrid
+  kernResetIntBuffer << <fullBlocksPerGridCells, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer << <fullBlocksPerGridCells, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (N, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+  //std::cout << "Start and end indices------------------------------" << std::endl;
+  //thrust::device_ptr<int> thrust_startIndices(dev_gridCellStartIndices);
+  //thrust::device_ptr<int> thrust_endIndices(dev_gridCellEndIndices);
+  //for (int i = 0; i < gridCellCount; ++i) {
+  //  std::cout << "Grid cell " << i << " starts at index " << thrust_startIndices[i] << " and ends at " << thrust_endIndices[i] << std::endl;
+  //}
+
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (N, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (N, dt, dev_pos, dev_vel1);
+  std::swap(dev_vel1, dev_vel2);
+}
+
+__global__ void kernSortPosAndVel(int N, int* particleArrayIndicesSorted,
+  glm::vec3 *pos, glm::vec3 *posToSort,
+  glm::vec3 *vel, glm::vec3 *velToSort) {
+  // Index is the index into particleArrayIndicesSorted
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  int realIndex = particleArrayIndicesSorted[index];
+  
+  posToSort[index] = pos[realIndex];
+  velToSort[index] = vel[realIndex];
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +755,31 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  int N = numObjects;
+  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGridCells((gridCellCount + blockSize - 1) / blockSize);
+
+  // currently, particleArrayIndices[0] = 0, [1] = 1 ...
+  // dev_pos and dev_vel1 are the original parallel position/velocity arrays
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (N, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + N, dev_thrust_particleArrayIndices);
+
+  // dev_pos2 and dev_vel2 are parallel arrays, they are sorted according to particleArrayIndices
+  // dev_pos and dev_vel1 are in their original "real" order
+  kernSortPosAndVel << < fullBlocksPerGrid, blockSize >> > (N, dev_particleArrayIndices, dev_pos, dev_pos2, dev_vel1, dev_vel2);
+
+  kernResetIntBuffer << <fullBlocksPerGridCells, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer << <fullBlocksPerGridCells, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (N, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+  // Now dev_vel2 is the old (sorted) velocity, dev_vel1 is recycled for us to write the new velocity into
+  kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (N, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+    dev_pos2, dev_vel2, dev_vel1);
+
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (N, dt, dev_pos2, dev_vel2);
+
+  // Swap dev_pos to be sorted dev_pos2 to match the sorted new dev_vel1. Terrible variable names here
+  std::swap(dev_pos, dev_pos2);
 }
 
 void Boids::endSimulation() {
@@ -390,14 +788,20 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
   // LOOK-1.2 Feel free to write additional tests here.
 
   // test unstable sort
-  int *dev_intKeys;
-  int *dev_intValues;
+  int* dev_intKeys;
+  int* dev_intValues;
   int N = 10;
 
   std::unique_ptr<int[]>intKeys{ new int[N] };
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..6641a62 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,9 +13,9 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define VISUALIZE 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;