A quick preview of more ways to implement parallelism
- How much coordination/communication do you need?
- Example was sufficient, just want to scale up
- Large parameter space
- Same operations on many input files
- Don't need to run concurrently
#DISBATCH REPEAT 1000 RANK=$DISBATCH_REPEAT_INDEX subdivideCheck.py
#SBATCH --array=1-1000
RANK=$SLURM_ARRAY_TASK_ID subdivideCheck.py
- Need to synchronize with other ranks
- Need to share intermediate values between ranks
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &ranks);
...
float myvalue = ...;
float valuesum;
/* valuesum = sum(myvalue) over all ranks */
MPI_Allreduce(&myvalue, &valuesum, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);
float allvalues[ranks];
/* allvalue[i] = myvalue from rank = i */
MPI_Allgather(&myvalue, 1, MPI_FLOAT, allvalues, 1, MPI_FLOAT, MPI_COMM_WORLD);
...
MPI_Finalize();
- Some languages provide features to run loops in parallel
#pragma omp parallel for
for (i=0; i<100000; i++) {
...
}
parfor i=1:100000
...
end
- Single stream of main execution
- Assign calculation out to a pool of workers
- Input and output for sub-calculations must be sent over network
- Many languages have libraries that provide similar functionality
- python asyncio, dask, tensorflow, ...
- MATLAB, Julia
- C OpenMP (threads only), MPI, ... (requires marshalling)
- KVS (disBatch)
- ...
- A common abstraction for "worker pools"
- Given a pure function
- All inputs passed as arguments*
- All results returned
- No access to global data/state
- Request that this function be applied to some arguments in parallel ("asynchronously")
- Operations on a future (result)
- Check if it's done
- Wait for one (or more) futures to complete
- Pass as input to another future evaluation (chaining)
- parfor
- parallel map: apply same function to list values, in parallel
for i in 1..10000
y[i] = f(x[i])
for i in 1..10000
yf[i] = Future(f, x[i])
y = Wait(yf)
(subdivideCheck loop ...):
if 1 == randint(0, 3):
x = lower
while x < upper:
z = findZero(func, x, min(x+step, upper), eps)
if z is not None: print(z)
x += step
def piece(lower, upper, func, step, eps) -> None:
x = lower
while x < upper:
z = findZero(func, x, min(x+step, upper), eps)
if z is not None: print(z)
x += step
(subdivideCheck loop ...):
if 1 == randint(0, 3):
piece(lower, upper, func, step, eps)
(subdivideCheck loop ...):
if 1 == randint(0, 3):
piece(lower, upper, func, step, eps)
from dask import distributed
client = distributed.Client()
(subdivideCheck loop ...):
if 1 == randint(0, 3):
client.submit(piece, lower, upper, func, step, eps)
def pieceWithResults(lower, upper, func, step, eps) -> List[float]:
r = []
x = lower
while x < upper:
z = findZero(func, x, min(x+step, upper), eps)
if z is not None:
print(z)
r.append(z)
x += step
return r
def subdivideCheck(lower, upper, func, step, eps) -> None:
if (upper - lower) <= 1:
if 1 == randint(0, 3):
client.submit(piece, lower, upper, func, step, eps)
else:
mid = (upper + lower)/2.
subdivideCheck(lower, mid, func, step, eps)
subdivideCheck(mid, upper, func, step, eps)
def subdivideCheck(client, lower, upper, func, step, eps) -> Future[List[Float]]:
if (upper - lower) <= 1:
if 1 == randint(0, 3):
return client.submit(pieceWithResults, lower, upper, func, step, eps)
else: return []
else:
mid = (upper + lower)/2.
rl = subdivideCheck(client, lower, mid, func, step, eps)
rr = subdivideCheck(client, mid, upper, func, step, eps)
return ??? rl + rr ???
def concat(a: list, b: list) -> list:
return a + b
def subdivideCheck(client, lower, upper, func, step, eps) -> Future[List[Float]]:
if (upper - lower) <= 1:
if 1 == randint(0, 3):
return client.submit(pieceWithResults, lower, upper, func, step, eps)
else: return []
else:
mid = (upper + lower)/2.
rl = subdivideCheck(client, lower, mid, func, step, eps)
rr = subdivideCheck(client, mid, upper, func, step, eps)
return client.submit(concat, rl, rr)
r = subdivideCheck(client, ...)
print(client.gather(r))
Parallelism | Sequential | Rank | Pool | Speedup |
---|---|---|---|---|
(cores) | (runtime) | (factor) | ||
1 | 4:34 | 4:34 | 4:39 | 0.98 |
2 | 2:22 | 2:24 | 1.9 | |
4 | 1:15 | 1:16 | 3.6 | |
8 | 0:39 | 0:39 | 7.0 | |
16 | 0:23 | 0:23 | 11.9 | |
32 | 0:17 | 0:17 | 16.1 |
- Look at example output and times
- How do your times compare?
- Output is out of order (but not pool result list) -- why?
- Decreasing marginal improvement -- why?
- Pool 1 is a bit slower (but not 32) -- why?
- Identify latent parallelism in your own code
- Convert your own code to use rank-based parallelism or a worker pool