Skip to content

Mistral-large123b-instruct-2411-q2_K (45G) on NVIDIA 48G RTX-A6000 or Apple M4 Max 40 core 48G compared to OpenAI o1 pro #98

Open
@obriensystems

Description

@obriensystems
PS C:\Users\michael> ollama run mistral-large:123b-instruct-2411-q2_K --verbose
pulling manifest
pulling 914093065095... 100% ▕███████████████████████████████████████████████████████████▏  45 GB
pulling 96adabcf2c08... 100% ▕███████████████████████████████████████████████████████████▏  896 B
pulling 06a6f77f3e95... 100% ▕███████████████████████████████████████████████████████████▏  11 KB
pulling ac9aa3c4956d... 100% ▕███████████████████████████████████████████████████████████▏   47 B
pulling 99ec6feee8ef... 100% ▕███████████████████████████████████████████████████████████▏  487 B
verifying sha256 digest
writing manifest
success
>>> Can we go over creating a mandelbrot image using cuda c.  I would like to fully utilize all the cores of the GPU.  I am
...  using specific GPUs from NVidia including the RTX-3500 Ada mobile with 5120 cores and the RTX-4090 with 16384 cores.
Certainly! Creating a Mandelbrot set image using CUDA (Compute Unified Device Architecture) in C can be an excellent
way to leverage the parallel processing power of GPUs, including the NVIDIA RTX-3500 Ada mobile and RTX-4090. Below,
I'll guide you through the steps to write a CUDA program that generates a Mandelbrot set image and utilizes all
available GPU cores efficiently.

### Step 1: Setting Up Your Environment
First, ensure you have the NVIDIA CUDA Toolkit installed on your system. You can download it from the [NVIDIA
website](https://developer.nvidia.com/cuda-downloads). Additionally, install a GPU driver compatible with your
hardware and the CUDA Toolkit version.

### Step 2: Writing the CUDA Program
Here's a simple example of a CUDA program to generate a Mandelbrot set image. This example will focus on utilizing
all available cores by launching many threads in parallel.

```c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>

// Kernel function to calculate the Mandelbrot set
__global__ void mandelbrotKernel(int* output, int width, int height, float xMin, float yMin, float xMax, float yMax)
{
    // Calculate pixel's position that this thread will compute
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int iy = blockIdx.y * blockDim.y + threadIdx.y;

    if (ix < width && iy < height) {
        // Calculate pixel's value in the Mandelbrot set
        float x = xMin + ix * (xMax - xMin) / (width - 1);
        float y = yMin + iy * (yMax - yMin) / (height - 1);

        // Mandelbrot iteration
        int iter = 0;
        float zx = 0.0f, zy = 0.0f;
        while (iter < 256 && (zx * zx + zy * zy) < 4.0f) {
            float newZx = zx * zx - zy * zy + x;
            zy = 2.0f * zx * zy + y;
            zx = newZx;
            iter++;
        }

        // Output the iteration count as a color value
        output[iy * width + ix] = (iter == 256) ? 0 : iter;
    }
}

int main() {
    int width = 800;
    int height = 600;
    float xMin = -2.0f, yMin = -1.5f, xMax = 1.0f, yMax = 1.5f;

    // Allocate memory for the output image on the host
    size_t size = width * height * sizeof(int);
    int* hOutput = (int*)malloc(size);

    // Allocate memory for the output image on the device
    int* dOutput;
    cudaMalloc(&dOutput, size);

    // Define block and grid dimensions
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(width / threadsPerBlock.x, height / threadsPerBlock.y);

    // Launch the kernel
    mandelbrotKernel<<<numBlocks, threadsPerBlock>>>(dOutput, width, height, xMin, yMin, xMax, yMax);

    // Copy the result back to the host
    cudaMemcpy(hOutput, dOutput, size, cudaMemcpyDeviceToHost);

    // Save the output image as a PGM file
    FILE* fp = fopen("mandelbrot.pgm", "wb");
    fprintf(fp, "P5\n%d %d\n255\n", width, height);
    for (int i = 0; i < width * height; ++i) {
        unsigned char val = (unsigned char)(hOutput[i] % 256);
        fwrite(&val, sizeof(unsigned char), 1, fp);
    }
    fclose(fp);

    // Free allocated memory
    free(hOutput);
    cudaFree(dOutput);

    return 0;
}

Step 3: Compiling the CUDA Program

To compile this program, use nvcc, the NVIDIA CUDA compiler. For example:

nvcc -o mandelbrot_cuda mandelbrot_cuda.cu

Step 4: Running the Program

Run the compiled executable:

./mandelbrot_cuda

This will generate a mandelbrot.pgm file, which is a grayscale image of the Mandelbrot set. You can view this image
using any PGM viewer or convert it to another format using tools like ImageMagick.

Optimizing for Specific GPUs

To fully utilize all the cores of your specific GPUs (RTX-3500 Ada mobile and RTX-4090), you might need to experiment
with different block sizes and grid configurations. The example above uses a 16x16 thread block, which is a common
starting point. You can adjust threadsPerBlock and numBlocks based on your specific GPU's compute capabilities
and memory architecture.

Additional Considerations

  • Memory Management: Ensure efficient memory access patterns to avoid bottlenecks.
  • Concurrency: Maximize concurrency by balancing the workload across threads and blocks.
  • Error Checking: Add CUDA error checking after each API call to handle potential issues during runtime.

By following these steps, you can create a Mandelbrot set image using CUDA and fully utilize your GPU's processing
power.

total duration: 2m15.2936722s
load duration: 4.8562ms
prompt eval count: 85 token(s)
prompt eval duration: 747ms
prompt eval rate: 113.79 tokens/s
eval count: 1424 token(s)
eval duration: 2m14.539s
eval rate: 10.58 tokens/s

chael@14900c MINGW64 ~
$ nvidia-smi
Sun Jan 26 10:57:03 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17 Driver Version: 561.17 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 WDDM | 00000000:01:00.0 Off | Off |
| 58% 85C P2 299W / 300W | 44504MiB / 49140MiB | 98% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 14032 C ...uda_v12_avx\ollama_llama_server.exe N/A |

Image

hael@14900c MINGW64 ~
$ nvidia-smi
Sun Jan 26 11:04:59 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17                 Driver Version: 561.17         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000             WDDM  |   00000000:01:00.0 Off |                  Off |
| 92%   79C    P2            299W /  300W |   44656MiB /  49140MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     14032      C   ...uda_v12_avx\ollama_llama_server.exe      N/A      |

Image

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions