Skip to content

Commit a296db5

Browse files
authored
Update LIFL Documentation (#585)
* add configs for top_aggregator in eager_hier_mnist example * Add example of Eager Aggregation * Update .gitignore * Update lifl.md
1 parent e479cfb commit a296db5

File tree

6 files changed

+288
-1
lines changed

6 files changed

+288
-1
lines changed

.gitignore

+9-1
Original file line numberDiff line numberDiff line change
@@ -55,5 +55,13 @@ requirements.lock
5555
# React dependencies
5656
node_modules
5757

58-
# Dataset
58+
# FedScale Dataset
5959
third_party/benchmark/dataset/data/
60+
61+
# Torchvision built-in datasets
62+
lib/python/examples/**/data/
63+
64+
# Object and binary from SPRIGHT
65+
third_party/spright_utility/**/*.o
66+
third_party/spright_utility/**/*.d
67+
third_party/spright_utility/bin/

docs/lifl/lifl.md

+101
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,104 @@
11
# LIFL Instructions
22

33
This document provides instructions on how to use LIFL in flame.
4+
5+
## Prerequisites
6+
The target runtime environment of LIFL is Linux **only**. LIFL requires Linux kernel version >= 5.15. We have tested LIFL on Ubuntu 20.
7+
8+
## Environment Setup
9+
10+
### 1. Upgrade kernel
11+
*Note: if you have kernel version >=5.15, please skip this step*
12+
13+
```bash
14+
# Execute the kernel upgrade script
15+
cd third_party/spright_utility/scripts
16+
./upgrade_kernel.sh
17+
```
18+
19+
### 2. Install libbpf
20+
21+
```bash
22+
# Install deps for libbpf
23+
sudo apt update && sudo apt install -y flex bison build-essential dwarves libssl-dev \
24+
libelf-dev pkg-config libconfig-dev clang gcc-multilib
25+
26+
# Execute the libbpf installation script
27+
cd third_party/spright_utility/scripts
28+
./libbpf.sh
29+
```
30+
31+
## Shared Memory Backend in LIFL
32+
33+
The [shared memory backend](../../lib/python/flame/backend/shm.py) in LIFL uses eBPF's sockmap and SK_MSG to pass buffer references between aggregators. We introduce a "[sockmap_manager](../../third_party/spright_utility/src/sockmap_manager.c)" on each node to manage the registration of aggregator's socket to the in-kernel sockmap. You must run the `sockmap_manager` first.
34+
35+
```bash
36+
# Execute the sockmap_manager
37+
cd third_party/spright_utility/
38+
39+
sudo ./bin/sockmap_manager
40+
```
41+
42+
To enable Shared Memory Backend in the channel, you need to add `shm` to the `brokers` field in the config:
43+
44+
```yaml
45+
"brokers": [
46+
{
47+
"host": "localhost",
48+
"sort": "mqtt"
49+
},
50+
{
51+
"host": "localhost:10104",
52+
"sort": "p2p"
53+
},
54+
{
55+
"host": "localhost:10105",
56+
"sort": "shm"
57+
}
58+
],
59+
```
60+
61+
You also need to specify the backend type of the channel to `shm` so that the channel will choose to use shared memory backend during its initialization.
62+
63+
```yaml
64+
"channels": [
65+
{
66+
"name": "top-agg-coord-channel",
67+
...
68+
},
69+
{
70+
"name": "global-channel",
71+
...
72+
"backend": "shm",
73+
...
74+
}
75+
],
76+
```
77+
78+
We offer sample configs in the [coord_3_hier_syncfl_mnist](../../lib/python/examples/coord_3_hier_syncfl_mnist/) and [coord_hier_syncfl_mnist](../../lib/python/examples/coord_hier_syncfl_mnist/) examples.
79+
80+
## Hierarchical Aggregation in LIFL
81+
82+
Flame initially supports hierarchical aggregation with two levels: top level and leaf level. The example of two-level hierarchical aggregation is at [coord_hier_syncfl_mnist](../../lib/python/examples/coord_hier_syncfl_mnist/). LIFL extends hierarchical aggregation in Flame with three levels: top level, middle level, and leaf level. The example of three-level hierarchical aggregation is at [coord_3_hier_syncfl_mnist](../../lib/python/examples/coord_3_hier_syncfl_mnist/).
83+
84+
## Eager Aggregation in LIFL
85+
86+
Flame initially supports lazy aggregation only. LIFL adds additional support for having eager aggregation in Flame, which gives us more flexible timing on the aggregation process. The example to run eager aggregation is availble at [eager_hier_mnist](../../lib/python/examples/eager_hier_mnist/). The implementation of eager aggregation is available at [eager_syncfl](../../lib/python/flame/mode/horizontal/eager_syncfl/).
87+
88+
## Problems when running LIFL
89+
1. When you run `sudo ./bin/sockmap_manager`, you receive
90+
```
91+
./bin/sockmap_manager: error while loading shared libraries: libbpf.so.0: cannot open shared object file: No such file or directory
92+
```
93+
94+
Solutions: This may happen when you use Ubuntu 22, which has the libbpf 0.5.0 pre-installed. You need to re-link the `/lib/x86_64-linux-gnu/libbpf.so.0` to `libbpf.so.0.6.0`
95+
```bash
96+
# Assume you have executed the libbpf installation script
97+
cd third_party/spright_utility/scripts/libbpf/src
98+
99+
# Copy libbpf.so.0.6.0 to /lib/x86_64-linux-gnu/
100+
sudo cp libbpf.so.0.6.0 /lib/x86_64-linux-gnu/
101+
102+
# Re-link libbpf.so.0
103+
sudo ln -sf /lib/x86_64-linux-gnu/libbpf.so.0.6.0 /lib/x86_64-linux-gnu/libbpf.so.0
104+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
{
2+
"taskid": "49d06b7526964db86cf37c70e8e0cdb6bd7aa744",
3+
"backend": "p2p",
4+
"brokers": [
5+
{
6+
"host": "localhost",
7+
"sort": "mqtt"
8+
},
9+
{
10+
"host": "localhost:10104",
11+
"sort": "p2p"
12+
},
13+
{
14+
"host": "localhost:10105",
15+
"sort": "shm"
16+
}
17+
],
18+
"groupAssociation": {
19+
"param-channel": "us",
20+
"global-channel": "default"
21+
},
22+
"channels": [
23+
{
24+
"description": "Model update is sent from mid aggregator to global aggregator and vice-versa",
25+
"groupBy": {
26+
"type": "tag",
27+
"value": [
28+
"default"
29+
]
30+
},
31+
"name": "global-channel",
32+
"pair": [
33+
"top-aggregator",
34+
"middle-aggregator"
35+
],
36+
"backend": "shm",
37+
"funcTags": {
38+
"top-aggregator": [
39+
"distribute",
40+
"aggregate"
41+
],
42+
"middle-aggregator": [
43+
"fetch",
44+
"upload"
45+
]
46+
}
47+
},
48+
{
49+
"description": "Model update is sent from mid aggregator to trainer and vice-versa",
50+
"groupBy": {
51+
"type": "tag",
52+
"value": [
53+
"uk",
54+
"us"
55+
]
56+
},
57+
"name": "param-channel",
58+
"pair": [
59+
"middle-aggregator",
60+
"trainer"
61+
],
62+
"funcTags": {
63+
"middle-aggregator": [
64+
"distribute",
65+
"aggregate"
66+
],
67+
"trainer": [
68+
"fetch",
69+
"upload"
70+
]
71+
}
72+
}
73+
],
74+
"dataset": "https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz",
75+
"dependencies": [
76+
"numpy >= 1.2.0"
77+
],
78+
"hyperparameters": {
79+
"batchSize": 32,
80+
"learningRate": 0.01,
81+
"rounds": 5
82+
},
83+
"baseModel": {
84+
"name": "",
85+
"version": 1
86+
},
87+
"job": {
88+
"id": "622a358619ab59012eabeefb",
89+
"name": "mnist"
90+
},
91+
"registry": {
92+
"sort": "dummy",
93+
"uri": ""
94+
},
95+
"selector": {
96+
"sort": "default",
97+
"kwargs": {}
98+
},
99+
"maxRunTime": 300,
100+
"realm": "default-cluster",
101+
"role": "middle-aggregator"
102+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
{
2+
"taskid": "49d06b7526964db86cf37c70e8e0cdb6bd7aa742",
3+
"backend": "p2p",
4+
"brokers": [
5+
{
6+
"host": "localhost",
7+
"sort": "mqtt"
8+
},
9+
{
10+
"host": "localhost:10104",
11+
"sort": "p2p"
12+
},
13+
{
14+
"host": "localhost:10105",
15+
"sort": "shm"
16+
}
17+
],
18+
"groupAssociation": {
19+
"global-channel": "default"
20+
},
21+
"channels": [
22+
{
23+
"description": "Model update is sent from mid aggregator to global aggregator and vice-versa",
24+
"groupBy": {
25+
"type": "tag",
26+
"value": [
27+
"default"
28+
]
29+
},
30+
"name": "global-channel",
31+
"pair": [
32+
"top-aggregator",
33+
"middle-aggregator"
34+
],
35+
"backend": "shm",
36+
"funcTags": {
37+
"top-aggregator": [
38+
"distribute",
39+
"aggregate"
40+
],
41+
"middle-aggregator": [
42+
"fetch",
43+
"upload"
44+
]
45+
}
46+
}
47+
],
48+
"dataset": "https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz",
49+
"dependencies": [
50+
"numpy >= 1.2.0"
51+
],
52+
"hyperparameters": {
53+
"batchSize": 32,
54+
"learningRate": 0.01,
55+
"rounds": 5
56+
},
57+
"baseModel": {
58+
"name": "",
59+
"version": 1
60+
},
61+
"job": {
62+
"id": "622a358619ab59012eabeefb",
63+
"name": "mnist"
64+
},
65+
"registry": {
66+
"sort": "dummy",
67+
"uri": "http://flame-mlflow:5000"
68+
},
69+
"selector": {
70+
"sort": "default",
71+
"kwargs": {}
72+
},
73+
"maxRunTime": 300,
74+
"realm": "",
75+
"role": "top-aggregator"
76+
}

0 commit comments

Comments
 (0)