Batch AI provides managed infrastructure to help data scientists with cluster
management and scheduling, scaling, and monitoring of AI jobs.
Batch AI works on top of virtual machine scale sets and docker.
Batch AI can run training jobs in docker containers or directly on the compute nodes.
- Cluster
- Jobs
- Azure File Share - stdout, stderr, may contain python scripts
- Azure Blob Storage - python scripts, data
You Only Look Once (YOLO) is a real-time object detection system. We will be
running YOLOv3 on a single image with BatchAI. If you would like to run YOLO
without a cluster you can follow the steps on the
YOLO site.
git clone https://github.com/pjreddie/darknet
cd darknet
makewget https://pjreddie.com/media/files/yolov3.weights./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpgYOLOv3 should output something like:
...
104 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
105 conv 255 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 255 0.353 BFLOPs
106 detection
Loading weights from cfg/yolov3.weights...Done!
data/dog.jpg: Predicted in 24.016015 seconds.
dog: 99%
truck: 92%
bicycle: 99%- Python train and test scripts define the
parallel strategyused, not Batch AI.
For example,
CNTKuses asynchronous data paralleltraining strategyTensorflowuses aasynchronous model paralleltraining strategy
- Make sure
.shscripts haveLFendings - usedos2unixto fix - To enable faster communication between the nodes it´s necessary to use
Intel MPIand haveInfiniBandon the VM NC24r(works withIntel MPIandInfiniBand) quota is1 coreby default in any subscription, so make quota increase requests early- There's no reset ssh-key for nodes
- Do not put
CMDin the dockerfile used by Batch AI. Since the container runs in detached mode, it will exit onCMD - Error messages within the container are not very descriptive
- Clusters take a long time to provision and deallocate
- Install Azure CLI 2.0 for WSL
- Batch AI Recipes
- Azure CLI Docs
- Swagger Docs for Batch AI
- Batch AI Environment Variables
- Setting up KeyVault
az account set -s <subscription id>
az account list -o tableaz group create -n <rg name> -l eastusaz storage account create \
-n <storage account name> \
--sku Standard_LRS \
-l eastus \
-g <rg name>az storage account keys list \
-n <storage account name> \
-g <rg name> \
--query "[0].value"az storage share create \
-n <share name> \
--account-name <storage account name> \
--account-key <storage account key>az storage directory create \
-s <share name> \
-n yolo \
--account-name <storage account name> \
--account-key <storage account key>az storage file upload \
-s <share name> \
--source <python script> \
-p yolo \
--account-name <storage account name> \
--account-key <storage account key>Config parameters defined by ClusterCreateParameters in the batch ai swagger docs.
az batchai cluster create \
-n <cluster name> \
-l eastus \
-g <rg name> \
-c cluster.jsonaz batchai cluster create \
-n <cluster name> \
-g <rg name> \
-l eastus \
--storage-account-name <storage account name> \
--storage-account-key <storage account key> \
-i UbuntuDSVM \
-s Standard_NC6 \
--min 2 \
--max 2 \
--afs-name <share name> \
--afs-mount-path external \
-u $USER \
-k ~/.ssh/id_rsa.pub \
-p <password>az batchai cluster show \
-n <cluster name> \
-g <rg name> \
-o table- View
JobBasePropertiesin the batch ai swagger docs for the possible parameters to use injob.json.
az batchai job create \
-g <rg name> \
-l eastus \
-n <job name> \
-r <cluster name> \
-c job.jsonaz batchai job show \
-n <job name> \
-g <rg name> \
-o tableaz batchai job stream-file \
-j <job name> \
-n stdout.txt \
-d stdouterr \
-g <rg name>az batchai cluster list-nodes \
-n <cluster name> \
-g <rg name>ssh <ip> -p <port>$AZ_BATCHAI_MOUNT_ROOT is an environment variable set by Batch AI for each job, it's value depends on the image used for nodes creation. For example, on Ubuntu based images it's equal to /mnt/batch/tasks/shared/LS_root/mounts. You can cd to this directory and view the python scripts and logs.
