Skip to content

2021Tencent Rhino-bird Open-source Training Program—Angel Zeng Shang #98

@earlytobed

Description

@earlytobed

第一次作业

很荣幸入选 Angel 项目,开始开源实战环节。能够和导师们、同学们共同学习、了解 Angel 分布式机器学习平台架构设计原理是个难得的机会。以下是本次开源活动的实战笔记。因本人水平有限,错误和不足之处在所难免,敬请各位专家读者指正。

Angel 环境搭建

本次项目是基于 Angel-ML/PyTorch-On-Angel 的一个论文复现,在进行其它工作之前,我们需要部署一个可以运行的环境。

https://github.com/Angel-ML/PyTorch-On-Angel/blob/master/docs/img/pytorch_on_angel_framework.png?raw=true

PyTorch on Angel's architecture

PyTorch-On-Angel 主要由三个模块构成:

  1. Python Client:用于生成 ScriptModule
  2. Angel PS:参数服务器,负责模型的分布式存储、同步和协调计算
  3. Spark:Spark Driver、Spark Executor 负责加载 ScriptModule,数据处理,同参数服务器协同完成模型的训练和预测

厘清依赖:

  • 由 Python 代码生成 ScriptModule,需要 python 环境和 torch 包
  • 使用 C++ 后端,需要 libtorch_angel
  • Angel PS 和 Spark Driver、Spark Executor 需要 Spark
  • 项目中推荐使用 Spark on YARN 的方式,Hadoop 也是需要的

以下操作均基于 Ubuntu 20.04 LTS ,因为自用,环境不完全干净,不保证没有别的问题。

PyTorch-On-Angel

第一步当然是:

git clone https://github.com/Angel-ML/PyTorch-On-Angel.git --depth 1

项目文档中介绍了编译方法,出于使用方便,我准备好镜像源文件,放在下 ./addon 备用:

Debian 9 sources.list

deb http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free

maven settings.xml

<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
  <mirrors>
    <mirror>
      <id>nexus-tencentyun</id>
      <mirrorOf>*</mirrorOf>
      <name>Nexus tencentyun</name>
      <url>http://mirrors.cloud.tencent.com/nexus/repository/maven-public/</url>
    </mirror>
  </mirrors>
</settings>

修改了 Dockerfile

########################################################################################################################
#                                                       DEV                                                            #
########################################################################################################################
FROM maven:3.6.1-jdk-8 as DEV

##########################
#  install dependencies  #
##########################
COPY ./addon/sources.list /etc/apt/sources.list
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    curl=7.52.1-5+deb9u9 \
    g++=4:6.3.0-4 \
    make=4.1-9.1 \
    unzip=6.0-21+deb9u1 \
    python3 \
    python3-pip \
    python3-setuptools \
    python3-wheel \
    && rm -rf /var/lib/apt/lists/*

#####################
#  Install PyTorch  #
#####################
RUN python3 -m pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

#######################
#  install new cmake  #
#######################
RUN curl -fsSL --insecure -o /tmp/cmake.tar.gz https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    && tar -xzf /tmp/cmake.tar.gz -C /tmp \
    && rm -rf /tmp/cmake.tar.gz  \
    && mv /tmp/cmake-* /tmp/cmake \
    && cd /tmp/cmake \
    && ./bootstrap \
    && make -j8 \
    && make install \
    && rm -rf /tmp/cmake

#######################
#  download libtorch  #
#######################
WORKDIR /opt
RUN curl -fsSL --insecure -o libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-wit \
    && unzip -q libtorch.zip \
    && rm libtorch.zip

ENV TORCH_HOME=/opt/libtorch

########################################################################################################################
#                                                     JAVA BUILDER                                                     #
########################################################################################################################
FROM DEV as JAVA_BUILDER

COPY ./addon/settings.xml /usr/share/maven/conf/

WORKDIR /app

COPY ./java/pom.xml /app

RUN mvn -e -B dependency:resolve dependency:resolve-plugins

COPY ./java /app

RUN mvn -e -B -Dmaven.test.skip=true package

########################################################################################################################
#                                                     CPP BUILDER                                                      #
########################################################################################################################
FROM DEV as CPP_BUILDER

RUN apt-get update  \
    && apt-get install -y --no-install-recommends \
    zip=3.0-11+b1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY ./cpp ./

RUN ./build.sh \
    && cp ./out/*.so "$TORCH_HOME"/lib \
    && cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib \
    && ln -s "$TORCH_HOME"/lib torch-lib \
    && zip -qr /torch.zip torch-lib

########################################################################################################################
#                                                       Artifacts                                                      #
########################################################################################################################
FROM alpine:3.10 as ARTIFACTS

WORKDIR /dist
COPY --from=CPP_BUILDER /torch.zip ./
COPY --from=JAVA_BUILDER /app/target/*.jar ./

VOLUME /output

CMD [ "/bin/sh", "-c", "cp ./* /output" ]

修改 cpp/CMakeList.txt

set(TORCH_HOME $ENV{TORCH_HOME})

执行 build.sh 静待片刻:

./build.sh

如果下载安装缓慢也可以提前在 addon 下准备好需要的文件并修改 Dockerfile 里相应部分:

cd addon && wget https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.3.1%2Bcpu.zip \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

修改 gen_pt_model.sh python → python3

docker run -it --rm -v $(pwd)/${MODEL_PATH}:/model.py -v $(pwd)/dist:/output -w /output ${IMAGE_NAME} python3 /model.py ${@:2}

./dist 下就有了我们所需要的文件:

deepfm.pt  pytorch-on-angel-0.2.0.jar  pytorch-on-angel-0.2.0-jar-with-dependencies.jar  torch.zip

第一步就完成了~

Hadoop

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

强迫症表示看到很多没用的文件就想删掉:

find . -name *.cmd | xargs rm

修改配置文件:

hadoop-env.sh

export JAVA_HOME="按情况修改"

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

到这里 HDFS 就设置完了,format 一下:

hdfs namenode –format

启动试试是否正常工作:启动需要能 SSH master worker,SSH 设置这里就略了

./start-dfs.sh
jps
# 105141 DataNode
# 104964 NameNode
# 105385 SecondaryNameNode
# 都有就是正常啦,没有的看看日志排查

mapred-site.xml 运行方式改成 yarn

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml yarn 的资源配置,默认是 8G ,跑 Angel 可能不够,根据自身电脑配置修改:

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>30720</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>30720</value>
  </property>
</configuration>

启动试试是否正常工作:

./start-yarn.sh
jps
# 107761 ResourceManager
# 108141 NodeManager
# 都有就是正常啦,没有的看看日志排查

Spark

wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

配置好 Hadoop 之后 Spark 的配置就比较简单了,Spark on YARN 可以直接从 Hadoop 的配置里读取,只需要修改:

spark-env.sh

export HADOOP_CONF_DIR="按情况修改"

启动试试是否正常工作:

./start-all.sh
jps
# 2273766 Worker
# 2273463 Master
# 都有就是正常啦,没有的看看日志排查

Angel

注意 jdk 版本,不然后续会报错

sudo apt install openjdk-8-jdk -y
sudo apt install maven -y

编译安装 protobuf 2.5.0 ,依照 README.txt 即可,记得最后要 ldconfig :

wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

按照说明编译即可:

wget https://github.com/Angel-ML/angel/archive/refs/tags/Release-2.4.0.tar.gz

编译完成后解压,进行配置:

spark-on-angel-env.sh

export SPARK_HOME="按情况修改"
export ANGEL_HOME="按情况修改"
export ANGEL_HDFS_HOME="按情况修改"
export ANGEL_VERSION=2.4.0

# 部分 jar 包版本问题
angel_ps_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar
sona_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar,json4s-jackson_2.11-3.2.11.jar,json4s-ast_2.11-3.2.11.jar,json4s-core_2.11-3.2.11.jar

创建文件夹,把需要的文件放上 HDFS 备用

hdfs dfs -mkdir /angel
hdfs dfs -put ./angel/data/census/census_148d_train.libsvm /angel
hdfs dfs -put ./angel/lib /angel

把之前生成好的四个文件放在合适的位置:

torch.zip pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar deepfm.pt

spark-submit 配置参数按实际情况修改:

因为--archives torch.zip#torch 在我这一直不起作用,搜寻资料也没有结果,于是我解压了 torch.zip,选择用 —-files 上传:

#!/bin/bash
JAVA_LIBRARY_PATH="按情况修改"
source ./angel/bin/spark-on-angel-env.sh
input="按情况修改"
output="按情况修改"
torchlib=torch-lib/libpthreadpool.a,torch-lib/libcpuinfo_internals.a,torch-lib/libCaffe2_perfkernels_avx2.a,torch-lib/libgmock.a,torch-lib/libprotoc.a,torch-lib/libnnpack.a,torch-lib/libgtest.a,torch-lib/libpytorch_qnnpack.a,torch-lib/libcaffe2_detectron_ops.so,torch-lib/libCaffe2_perfkernels_avx512.a,torch-lib/libgomp-753e6e92.so.1,torch-lib/libgloo.a,torch-lib/libonnx.a,torch-lib/libtorch_angel.so,torch-lib/libbenchmark_main.a,torch-lib/libcaffe2_protos.a,torch-lib/libgtest_main.a,torch-lib/libprotobuf-lite.a,torch-lib/libasmjit.a,torch-lib/libCaffe2_perfkernels_avx.a,torch-lib/libonnx_proto.a,torch-lib/libfoxi_loader.a,torch-lib/libfbgemm.a,torch-lib/libc10.so,torch-lib/libclog.a,torch-lib/libbenchmark.a,torch-lib/libgmock_main.a,torch-lib/libnnpack_reference_layers.a,torch-lib/libcaffe2_module_test_dynamic.so,torch-lib/libqnnpack.a,torch-lib/libprotobuf.a,torch-lib/libc10d.a,torch-lib/libtorch.so,torch-lib/libcpuinfo.a,torch-lib/libstdc++.so.6,torch-lib/libmkldnn.a

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --conf spark.ps.instances=1 \
    --conf spark.ps.cores=1 \
    --conf spark.ps.jars=$SONA_ANGEL_JARS \
    --conf spark.ps.memory=5g \
    --conf spark.ps.log.level=INFO \
    --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraLibraryPath=. \
    --conf spark.driver.extraLibraryPath=. \
    --conf spark.executorEnv.OMP_NUM_THREADS=2 \
    --conf spark.executorEnv.MKL_NUM_THREADS=2 \
    --name "deepfm for torch on angel" \
    --jars $SONA_SPARK_JARS \
    --files deepfm.pt,$torchlib \
    --driver-memory 5g \
    --num-executors 1 \
    --executor-cores 1 \
    --executor-memory 5g \
    --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.2.0.jar \
    trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
    stepSize:0.001 numEpoch:10 testRatio:0.1 \
    angelModelOutputPath:$output

http://master:8088/cluster/apps 上收获成功吧!

success

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions