Skip to content

Commit 2f66ddf

Browse files
committed
add some docs on troubleshooting batch seal perf
1 parent 6daa769 commit 2f66ddf

File tree

1 file changed

+69
-3
lines changed

1 file changed

+69
-3
lines changed

documentation/en/supraseal.md

+69-3
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,18 @@ SupraSeal is an optimized batch sealing implementation for Filecoin that allows
2323
- NVMe drives with high IOPS (10-20M total IOPS recommended)
2424
- GPU for PC2 phase (NVIDIA RTX 3090 or better recommended)
2525
- 1GB hugepages configured (minimum 36 pages)
26-
- Ubuntu 22.04 or compatible Linux distribution
26+
- Ubuntu 22.04 or compatible Linux distribution (gcc-11 required, doesn't need to be system-wide)
27+
- At least 256GB RAM, ALL MEMORY CHANNELS POPULATED
28+
- Without **all** memory channels populated sealing **performance will suffer drastically**
29+
- NUMA-Per-Socket (NPS) set to 1
2730

2831
## Setup
2932

3033
### Dependencies
3134

32-
Cuda 12.x is required
35+
CUDA 12.x is required, 11.x won't work.
36+
37+
```bash
3338

3439
The build process depends on GCC 11.x system-wide or gcc-11/g++-11 installed locally.
3540
* On Arch install https://aur.archlinux.org/packages/gcc11
@@ -68,6 +73,13 @@ LayerNVMEDevices = [
6873
# Add PCIe addresses for all NVMe devices to use
6974
]
7075
76+
# Set to your desiced batch size (what the batch-cpu command says your CPU supports AND what you have nvme space for)
77+
BatchSealBatchSize = 32
78+
79+
# pipelines can be either 1 or 2; 2 pipelines double storage requirements but in correctly balanced systems makes
80+
# layer hashing run 100% of the time, nearly doubling throughput
81+
BatchSealPipelines = 2
82+
7183
# Set to true for Zen2 or older CPUs for compatibility
7284
SingleHasherPerThread = false
7385
```
@@ -139,10 +151,64 @@ curio seal start --now --cc --count 32 --actor f01234 --layers cluster --duratio
139151
* Monitor hasher core utilisation
140152
141153
## Troubleshooting
154+
155+
### Node doesn't start / isn't visible in the UI
142156
* Ensure hugepages are configured correctly
143157
* Check NVMe device IOPS and capacity
144158
* If spdk setup fails, try to `wipefs -a` the NVMe devices (this will wipe partitions from the devices, be careful!)
145-
* Benchmark iops with:
159+
160+
### Performance issues
161+
162+
You can monitor performance by looking at "hasher" core utilisation in e.g. `htop`.
163+
164+
To identify hasher cores, call `curio calc supraseal-config --batch-size 128` (with the correct batch size), and look for `coordinators`
165+
166+
```go
167+
topology:
168+
...
169+
{
170+
pc1: {
171+
writer = 1;
172+
...
173+
hashers_per_core = 2;
174+
175+
sector_configs: (
176+
{
177+
sectors = 128;
178+
coordinators = (
179+
{ core = 59;
180+
hashers = 8; },
181+
{ core = 64;
182+
hashers = 14; },
183+
{ core = 72;
184+
hashers = 14; },
185+
{ core = 80;
186+
hashers = 14; },
187+
{ core = 88;
188+
hashers = 14; }
189+
)
190+
}
191+
192+
)
193+
},
194+
195+
pc2: {
196+
...
197+
}
198+
199+
```
200+
201+
In this example, cores 59, 64, 72, 80, and 88 are "coordinators", with two hashers per core, meaning that
202+
* In first group core 59 is a coordinator, cores 60-63 are hashers (4 hasher cores / 8 hasher threads)
203+
* In second group core 64 is a coordinator, cores 65-71 are hashers (7 hasher cores / 14 hasher threads)
204+
* And so on
205+
206+
Coordinator cores will usually sit at 100% utilisation, hasher threads **SHOULD** sit at 100% utilisation, anything less
207+
indicates a bottleneck in the system, like not enough NVMe IOPS, not enough Memory bandwidth, or incorrect NUMA setup.
208+
209+
To troubleshoot:
210+
* Read the requirements at the top of this page very carefully
211+
* Benchmark iops with:
146212
```bash
147213
cd extern/supra_seal/deps/spdk-v22.09/
148214

0 commit comments

Comments
 (0)