Skip to content

Commit e7c0c4d

Browse files
Added support for MinIO and B2 buckets (#620)
* Added support for MinIO and B2 buckets -Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2 -Kept in support for AWS temporarily -Updated readme and other documentation to show instructions on MinIO and B2 bucket setup * Updated clean_s3 to support MinIO * Made 'minio' the default bucket_service
1 parent 2669e2f commit e7c0c4d

File tree

14 files changed

+262
-162
lines changed

14 files changed

+262
-162
lines changed

.devcontainer/Dockerfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,5 @@ ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
4343
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
4444
# Set environment variables
4545
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
46-
ENV SIL_NLP_DATA_PATH=/silnlp
4746
ENV EFLOMAL_PATH=/workspaces/silnlp/.venv/lib/python3.10/site-packages/eflomal/bin
4847
CMD ["bash"]

.devcontainer/devcontainer.json

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,18 @@
1212
"--gpus",
1313
"all",
1414
"-v",
15-
"${env:HOME}/.aws:/root/.aws", // Mount user's AWS credentials into the container
16-
"-v",
1715
"${env:HOME}/clearml/.clearml/hf-cache:/root/.cache/huggingface"
1816
],
1917
"containerEnv": {
2018
"AWS_REGION": "${localEnv:AWS_REGION}",
2119
"AWS_ACCESS_KEY_ID": "${localEnv:AWS_ACCESS_KEY_ID}",
2220
"AWS_SECRET_ACCESS_KEY": "${localEnv:AWS_SECRET_ACCESS_KEY}",
21+
"MINIO_ENDPOINT_URL": "${localEnv:MINIO_ENDPOINT_URL}",
22+
"MINIO_ACCESS_KEY": "${localEnv:MINIO_ACCESS_KEY}",
23+
"MINIO_SECRET_KEY": "${localEnv:MINIO_SECRET_KEY}",
24+
"B2_ENDPOINT_URL": "${localEnv:B2_ENDPOINT_URL}",
25+
"B2_KEY_ID": "${localEnv:B2_KEY_ID}",
26+
"B2_APPLICATION_KEY": "${localEnv:B2_APPLICATION_KEY}",
2327
"CLEARML_API_ACCESS_KEY": "${localEnv:CLEARML_API_ACCESS_KEY}",
2428
"CLEARML_API_SECRET_KEY": "${localEnv:CLEARML_API_SECRET_KEY}"
2529
},

README.md

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -62,15 +62,18 @@ These are the main requirements for the SILNLP code to run on a local machine. S
6262
Create a text file with the following content and edit as necessary:
6363
```
6464
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
65-
CLEARML_API_ACCESS_KEY=xxxxx
66-
CLEARML_API_SECRET_KEY=xxxxx
67-
AWS_REGION="us-east-1"
68-
AWS_ACCESS_KEY_ID=xxxxx
69-
AWS_SECRET_ACCESS_KEY=xxxxx
70-
SIL_NLP_DATA_PATH="/silnlp"
71-
```
72-
* If you do not intend to use SILNLP with ClearML and/or AWS, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
73-
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
65+
CLEARML_API_ACCESS_KEY=xxxxxxx
66+
CLEARML_API_SECRET_KEY=xxxxxxx
67+
MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
68+
MINIO_ACCESS_KEY=xxxxxxxxx
69+
MINIO_SECRET_KEY=xxxxxxx
70+
B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
71+
B2_KEY_ID=xxxxxxxx
72+
B2_APPLICATION_KEY=xxxxxxxx
73+
```
74+
* Include SIL_NLP_DATA_PATH="/silnlp" if you are not using MinIO or B2 and will be storing files locally.
75+
* If you do not intend to use SILNLP with ClearML, MinIO, and/or B2, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
76+
* Note that this does not give you direct access to a MinIO or B2 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
7477

7578
6. Start container
7679

@@ -129,22 +132,25 @@ These are the main requirements for the SILNLP code to run on a local machine. S
129132
poetry install
130133
```
131134
132-
10. If using ClearML and/or AWS, set the following environment variables:
135+
10. If using ClearML, MinIO, and/or B2, set the following environment variables:
133136
```
134137
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
135-
CLEARML_API_ACCESS_KEY=xxxxx
136-
CLEARML_API_SECRET_KEY=xxxxx
137-
AWS_REGION="us-east-1"
138-
AWS_ACCESS_KEY_ID=xxxxx
139-
AWS_SECRET_ACCESS_KEY=xxxxx
140-
SIL_NLP_DATA_PATH="/silnlp"
141-
```
138+
CLEARML_API_ACCESS_KEY=xxxxxxx
139+
CLEARML_API_SECRET_KEY=xxxxxxx
140+
MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
141+
MINIO_ACCESS_KEY=xxxxxxxxx
142+
MINIO_SECRET_KEY=xxxxxxx
143+
B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
144+
B2_KEY_ID=xxxxxxxx
145+
B2_APPLICATION_KEY=xxxxxxxx
146+
```
147+
* Include SIL_NLP_DATA_PATH="/silnlp" if you are not using MinIO or B2 and will be storing files locally.
142148
* If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
143-
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
149+
* Note that this does not give you direct access to a MinIO or B2 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
144150
* For instructions on how to permanently set up environment variables for your operating system, see the corresponding section under the Development Environment Setup header below.
145151
146-
11. If using AWS, there are two options:
147-
* Option 1: Mount the bucket to your filesystem following the instructions under [Install and Configure Rclone](https://github.com/sillsdev/silnlp/blob/master/s3_bucket_setup.md#install-and-configure-rclone).
152+
11. If using MinIO or B2, there are two options:
153+
* Option 1: Mount the bucket to your filesystem following the instructions under [Install and Configure Rclone](https://github.com/sillsdev/silnlp/blob/master/bucket_setup.md#install-and-configure-rclone).
148154
* Option 2: Create a local cache for the bucket following the instructions under [Create SILNLP cache](https://github.com/sillsdev/silnlp/blob/master/manual_setup.md#create-silnlp-cache).
149155
150156
## Development Environment Setup
@@ -177,7 +183,7 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the
177183
178184
4. Define environment variables.
179185
180-
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. Additionally, set AWS_REGION. The typical value is "us-east-1".
186+
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, B2_KEY_ID, B2_APPLICATION_KEY. Also set MINIO_ENDPOINT_URL to https://truenas.psonet.languagetechnology.org:9000 and B2_ENDPOINT_URL to https://s3.us-east-005.backblazeb2.com with no quotations.
181187
* Linux / macOS users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file (Linux) or `.profile` file (macOS) in your home directory with the format
182188
```
183189
export VAR="VAL"
@@ -210,7 +216,7 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the
210216
10. Install and activate Poetry environment.
211217
* In the VS Code terminal, run `poetry install` to install the necessary Python libraries, and then run `poetry shell` to enter the environment in the terminal.
212218
213-
11. (Optional) Locally mount the S3 bucket. This will allow you to interact directly with the S3 bucket from your local terminal (outside of the dev container). See instructions [here](s3_bucket_setup.md).
219+
11. (Optional) Locally mount the MinIO and/or B2 bucket(s). This will allow you to interact directly with the bucket(s) from your local terminal (outside of the dev container). See instructions [here](bucket_setup.md).
214220
215221
To get back into the dev container and poetry environment each subsequent time, open the silnlp folder in VS Code, select the "Reopen in Container" option from the Remote Connection menu (bottom left corner), and use the `poetry shell` command in the terminal.
216222

bucket_setup.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# MinIO/B2 bucket setup
2+
3+
We use MinIO and Backblaze B2 storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.
4+
5+
### Note For MinIO setup
6+
7+
In order to access the MinIO bucket locally, you must have a VPN connected to its network. If you need VPN access, please reach out to an SILNLP dev team member.
8+
9+
### Note For Backblaze B2 usage
10+
11+
Backblaze B2 is only used as a backup storage option when the MinIO bucket is unavailable or when running experiments from the ORU Titan Server.
12+
13+
### Install and configure rclone
14+
15+
**Windows**
16+
17+
The following will mount /silnlp on your B drive or /nlp-research on your M drive and allow you to explore, read and write.
18+
* Install WinFsp: http://www.secfs.net/winfsp/rel/ (Click the button to "Download WinFsp Installer" not the "SSHFS-Win (x64)" installer)
19+
* Download rclone from: https://rclone.org/downloads/
20+
* Unzip to your desktop (or some convient location).
21+
* Add the folder that contains rclone.exe to your PATH environment variable.
22+
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~\AppData\Roaming\rclone` (creating folders if necessary)
23+
* Add your credentials in the appropriate fields in `~\AppData\Roaming\rclone`
24+
* Take the `scripts/rclone/mount_minio_to_m.bat` and `scripts/rclone/mount_b2_to_b.bat` file from this SILNLP repo and copy it to the folder that contains the unzipped rclone.
25+
* Double-click either bat file. A command window should open and remain open. You should see something like, if running mount_minio_to_m.bat:
26+
```
27+
C:\Users\David\Software\rclone>call rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research M:
28+
The service rclone has been started.
29+
```
30+
31+
**Linux / macOS**
32+
33+
The following will mount /nlp-research to a M folder or /silnlp to a B folder in your home directory and allow you to explore, read and write.
34+
* For macOS, first download and install macFUSE: https://osxfuse.github.io/
35+
* Download rclone from: https://rclone.org/install/
36+
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~/.config/rclone/rclone.conf` (creating folders if necessary)
37+
* Add your credentials in the appropriate fields in `~/.config/rclone/rclone.conf`
38+
* Create a folder called "M" or "B" in your user directory
39+
* Run the following command for MinIO:
40+
```
41+
rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research ~/M
42+
```
43+
* OR run the following command for B2:
44+
```
45+
rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp ~/B
46+
```
47+
### To start M: and/or B: drive on start up
48+
49+
**Windows**
50+
51+
Put a shortcut to the mount_minio_to_m.bat and/or mount_b2_to_b.bat file in the Startup folder.
52+
* In Windows Explorer put `shell:startup` in the address bar or open `C:\Users\<Username>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup`
53+
* Right click to add a new shortcut. Choose `mount_minio_to_m.bat` and/or `mount_b2_to_b.bat` as the target, you can leave the name as the default.
54+
55+
Now your MinIO or B2 bucket should be mounted as M: or B: drive, respectively, when you start Windows.
56+
57+
**Linux / macOS**
58+
* Run `crontab -e`
59+
* For MinIO, paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research ~/M` into the file, save and exit
60+
* For B2, paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp ~/B` into the file, save and exit
61+
* Reboot Linux / macOS
62+
63+
Now your MinIO or B2 bucket should be mounted as ~/M or ~/B respectively when you start Linux / macOS.

manual_setup.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ __Download and install__ the following before creating any projects or starting
7373
"editor.formatOnSave": true,
7474
```
7575

76-
### S3 bucket setup
76+
### MinIO and/or B2 bucket(s) setup
7777

78-
See [S3 bucket setup](s3_bucket_setup.md).
78+
See [Bucket setup](bucket_setup.md).
7979

8080
### ClearML setup
8181

@@ -88,8 +88,11 @@ See [ClearML setup](clear_ml_setup.md).
8888
* Create the directory "$HOME/.cache/silnlp/projects" and set the environment variable SIL_NLP_CACHE_PROJECT_DIR to that path.
8989

9090
### Additional Environment Variables
91-
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
92-
* Set SIL_NLP_DATA_PATH to "/silnlp" and CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".
91+
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, MINIO_ACCESS_KEY, MINIO_SECRET_KEY B2_KEY_ID, B2_APPLICATION_KEY.
92+
* Set SIL_NLP_DATA_PATH to "/silnlp" if you are not using MinIO or B2 and will be storing files locally.
93+
* Set CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".
94+
* Set MINIO_ENDPOINT_URL to https://truenas.psonet.languagetechnology.org:9000
95+
* Set B2_ENDPOINT_URL to https://s3.us-east-005.backblazeb2.com
9396

9497
### Setting Up and Running Experiments
9598

s3_bucket_setup.md

Lines changed: 0 additions & 56 deletions
This file was deleted.

scripts/clean_s3.py

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import argparse
22
import csv
33
import datetime
4+
import os
45
import re
56
import time
67
from typing import Tuple
@@ -39,30 +40,46 @@ def clean_research(max_months: int, dry_run: bool) -> Tuple[int, int]:
3940
)
4041
# create a csv filename to store the deleted files that includes the current datetime
4142
output_csv = f"deleted_research_files_{time.strftime('%Y%m%d-%H%M%S')}" + ("_dryrun" if dry_run else "") + ".csv"
42-
return _delete_data(max_months, dry_run, regex_to_delete, output_csv, checkpoint_protection=True)
43+
return _delete_data(
44+
max_months, dry_run, regex_to_delete, output_csv, bucket_service="minio", checkpoint_protection=True
45+
)
4346

4447

4548
def clean_production(max_months: int, dry_run: bool) -> Tuple[int, int]:
4649
print("Cleaning production")
4750
regex_to_delete = re.compile(r"^(production|dev|int-qa|ext-qa)/builds/.+")
4851
output_csv = f"deleted_production_files_{time.strftime('%Y%m%d-%H%M%S')}" + ("_dryrun" if dry_run else "") + ".csv"
49-
return _delete_data(max_months, dry_run, regex_to_delete, output_csv)
52+
return _delete_data(max_months, dry_run, regex_to_delete, output_csv, bucket_service="aws")
5053

5154

5255
def _delete_data(
53-
max_months: int, dry_run: bool, regex_to_delete: str, output_csv: str, checkpoint_protection: bool = False
56+
max_months: int,
57+
dry_run: bool,
58+
regex_to_delete: str,
59+
output_csv: str,
60+
bucket_service: str,
61+
checkpoint_protection: bool = False,
5462
) -> Tuple[int, int]:
5563
max_age = max_months * MONTH_IN_SECONDS
56-
57-
s3 = boto3.client("s3")
64+
if bucket_service == "minio":
65+
s3 = boto3.client(
66+
"s3",
67+
endpoint_url=os.getenv("MINIO_ENDPOINT_URL"),
68+
aws_access_key_id=os.getenv("MINIO_ACCESS_KEY"),
69+
aws_secret_access_key=os.getenv("MINIO_SECRET_KEY"),
70+
)
71+
bucket_name = "nlp-research"
72+
else:
73+
s3 = boto3.client("s3")
74+
bucket_name = "silnlp"
5875
paginator = s3.get_paginator("list_objects_v2")
5976
total_deleted = 0
6077
storage_space_freed = 0
6178
keep_until_dates = {}
6279
# First pass, identify keep until files
6380
# which must follow the format keep_until_YYYY-MM-DD.lock and be located in the same folder
6481
# as the experiment's config.yml file
65-
for page in paginator.paginate(Bucket="silnlp"):
82+
for page in paginator.paginate(Bucket=bucket_name):
6683
for obj in page["Contents"]:
6784
s3_filename = obj["Key"]
6885
parts = s3_filename.split("/")
@@ -83,7 +100,7 @@ def _delete_data(
83100
csv_writer.writerow(["Filename", "LastModified", "Eligible for Deletion", "Extra Info"])
84101
else:
85102
csv_writer.writerow(["Filename", "LastModified", "Deleted", "Extra Info"])
86-
for page in paginator.paginate(Bucket="silnlp"):
103+
for page in paginator.paginate(Bucket=bucket_name):
87104
for obj in page["Contents"]:
88105
s3_filename = obj["Key"]
89106
if regex_to_delete.search(s3_filename) is None:
@@ -126,7 +143,7 @@ def _delete_data(
126143
print(s3_filename)
127144
print(f"{(now - last_modified) / MONTH_IN_SECONDS} months old")
128145
if not dry_run:
129-
s3.delete_object(Bucket="silnlp", Key=s3_filename)
146+
s3.delete_object(Bucket=bucket_name, Key=s3_filename)
130147
print("Deleted")
131148
total_deleted += 1
132149
storage_space_freed += obj["Size"]
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,4 @@ rem copy your key and secret to rclone.conf
1010

1111
rem run rclone - execute this file in the rclone folder
1212

13-
call rclone mount --vfs-cache-mode full --use-server-modtime s3silnlp:silnlp S:
13+
call rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp B:
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
rem Install rclone
2+
rem get rclone from https://rclone.org/downloads/
3+
rem extract the files to a folder
4+
rem then move this bat file to the folder where you run this bat file to start the service
5+
rem --use-server-modtime flag speeds up displaying large numbers of files. Not exactly mod time, but close enough.
6+
7+
rem configure rclone
8+
rem copy the adjacent file "rclone.conf" to: C:\Users\<username>\AppData\Roaming\rclone\rclone.conf
9+
rem copy your key and secret to rclone.conf
10+
11+
rem run rclone - execute this file in the rclone folder
12+
13+
call rclone mount --vfs-cache-mode full --use-server-modtime --no-check-certificate miniosilnlp:nlp-research M:

0 commit comments

Comments
 (0)