diff --git "a/Governance/\317\200Flow_Open_Source_Individual_CLA.docx" "b/Governance/\317\200Flow_Open_Source_Individual_CLA.docx" new file mode 100644 index 00000000..6d4a00eb Binary files /dev/null and "b/Governance/\317\200Flow_Open_Source_Individual_CLA.docx" differ diff --git "a/Governance/\317\200Flow_Open_Source_Individual_CLA.pdf" "b/Governance/\317\200Flow_Open_Source_Individual_CLA.pdf" new file mode 100644 index 00000000..72fc2655 Binary files /dev/null and "b/Governance/\317\200Flow_Open_Source_Individual_CLA.pdf" differ diff --git "a/Governance/\345\216\237\345\210\231.md" "b/Governance/\345\216\237\345\210\231.md" index 9173e661..0b75cbb9 100644 --- "a/Governance/\345\216\237\345\210\231.md" +++ "b/Governance/\345\216\237\345\210\231.md" @@ -29,4 +29,4 @@ PifFow社区遵循[社区行为准则](https://github.com/cas-bigdatalab/piflow/ ### CLA -所有贡献者都必须签署PifFow CLA,请具体看[这里](https://github.com/cas-bigdatalab/piflow/blob/master/Governance/image-20211118094103884.png)。 +所有贡献者都必须签署PifFow CLA,请具体看[这里](https://github.com/cas-bigdatalab/piflow/blob/master/Governance/%CF%80Flow_Open_Source_Individual_CLA.docx)。 diff --git a/README.md b/README.md index 837f886c..6342357d 100644 --- a/README.md +++ b/README.md @@ -39,11 +39,12 @@ ![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/architecture.png) ## Requirements * JDK 1.8 -* Scala-2.11.8 +* Scala-2.12.18 * Apache Maven 3.1.0 or newer -* Spark-2.1.0、 Spark-2.2.0、 Spark-2.3.0 -* Hadoop-2.6.0 -* Apache Livy-0.7.1 +* Spark-3.4.0 +* Hadoop-3.3.0 + +Compatible with X86 architecture and ARM architecture, Support CentOS and Kirin system deployment ## Getting Started @@ -319,12 +320,20 @@ ![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-stophublist.png) ## Contact Us -- Name:吴老师 -- Mobile Phone:18910263390 -- WeChat:18910263390 -- Email: wzs@cnic.cn -- QQ Group:1003489545 - ![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/PiFlowUserGroup_QQ.jpeg) +- Name:Yang Gang, Tian Yao +- Mobile Phone:13253365393, 18501260806 +- WeChat:13253365393, 18501260806 +- Email: ygang@cnic.cn, tianyao@cnic.cn +- Private vulnerability contact information:ygang@cnic.cn +- Wechat User Group +
+ +
+ +- Wechat Official Account +
+ +
diff --git "a/conda-pack\346\211\223\345\214\205\350\231\232\346\213\237\347\216\257\345\242\203.md" "b/conda-pack\346\211\223\345\214\205\350\231\232\346\213\237\347\216\257\345\242\203.md" deleted file mode 100644 index 163b66e5..00000000 --- "a/conda-pack\346\211\223\345\214\205\350\231\232\346\213\237\347\216\257\345\242\203.md" +++ /dev/null @@ -1,55 +0,0 @@ -我使用的是conda这个包管理工具来对anaconda所安装的python虚拟环境进行打包,需要注意的是打包的是anaconda安装的虚拟环境而不是本地环境。 - -优势:可以打包虚拟环境中包括二进制文件等整个环境包括pip安装的python库 - -劣势:conda打包的虚拟环境只能使用于同一个操作系统下,测试了ubuntu和centos可以通用 - -(操作系统ubuntu20.04) - -1.安装Anaconda(使用脚本安装)安装教程链接:https://www.myfreax.com/how-to-install-anaconda-on-ubuntu-20-04/ ,或者自行搜索安装方法。 - -大致过程如下,下载脚本,执行脚本。 - -```bash -wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -``` - -安装anconda后建议更新下版本 - -2.安装conda pack工具,这里推荐使用pip安装。 - -```bash -pip install conda-pack -``` - -3.创建python虚拟环境 - -``` -conda create -n vir-name python=x.x. #vir-name换成你的虚拟环境名字 -``` - -4.激活虚拟环境 - -``` -conda activate vir-name -``` - -5.使用pip安装对应的包 - -6.使用conda pack打包环境 - -``` -conda pack -n my_env_name -o out_name.tar.gz -``` - -此处打包只能打包成tar.gz模式,打包成zip会有报错,想打包成zip模式解决办法就是先打包成tar.gz之后解压再重新打包成zip - -7.激活虚拟环境 - -将zip包从本地环境上传到服务器或者其他操作系统相同的环境后解压并进入其中的bin目录 - -使用`source activate`激活环境 - -8.退出虚拟环境 - -使用`source deactivate`退出虚拟环境 \ No newline at end of file diff --git a/config.properties b/config.properties index 49e059fa..ad5b42ad 100644 --- a/config.properties +++ b/config.properties @@ -1,13 +1,13 @@ spark.master=yarn spark.deploy.mode=cluster - +server.ip=172.18.32.1 #hdfs default file system -fs.defaultFS=hdfs://10.0.82.108:9000 +fs.defaultFS=hdfs://172.18.39.41:9000 #yarn resourcemanager hostname -yarn.resourcemanager.hostname=10.0.82.108 +yarn.resourcemanager.hostname=172.18.39.41 #if you want to use hive, set hive metastore uris -hive.metastore.uris=thrift://10.0.82.108:9083 +#hive.metastore.uris=thrift://10.0.82.108:9083 #show data in log, set 0 if you do not show the logs data.show=10 diff --git a/doc/tencent.jpg b/doc/tencent.jpg new file mode 100644 index 00000000..99e30380 Binary files /dev/null and b/doc/tencent.jpg differ diff --git a/doc/wechat_user.png b/doc/wechat_user.png new file mode 100644 index 00000000..ea6a68c2 Binary files /dev/null and b/doc/wechat_user.png differ diff --git a/piflow-bin/config.properties b/piflow-bin/config.properties index 3b6dc841..a6f167b7 100644 --- a/piflow-bin/config.properties +++ b/piflow-bin/config.properties @@ -2,10 +2,10 @@ spark.master=yarn spark.deploy.mode=cluster #hdfs default file system -fs.defaultFS=hdfs://10.0.85.83:9000 +fs.defaultFS=hdfs://172.18.39.41:9000 #yarn resourcemanager hostname -yarn.resourcemanager.hostname=10.0.85.83 +yarn.resourcemanager.hostname=172.18.39.41 #if you want to use hive, set hive metastore uris hive.metastore.uris=thrift://10.0.85.83:9083 diff --git a/piflow-bin/example/flow.json b/piflow-bin/example/flow.json index 6459cad1..eff0558f 100755 --- a/piflow-bin/example/flow.json +++ b/piflow-bin/example/flow.json @@ -7,45 +7,15 @@ "paths": [ { "inport": "", - "from": "XmlParser", - "to": "SelectField", - "outport": "" - }, - { - "inport": "", - "from": "Fork", + "from": "CsvParser", "to": "CsvSave", - "outport": "out1" - }, - { - "inport": "data2", - "from": "SelectField", - "to": "Merge", "outport": "" }, { "inport": "", - "from": "Merge", - "to": "Fork", - "outport": "" - }, - { - "inport": "data1", "from": "CsvParser", - "to": "Merge", + "to": "CsvSave", "outport": "" - }, - { - "inport": "", - "from": "Fork", - "to": "JsonSave", - "outport": "out3" - }, - { - "inport": "", - "from": "Fork", - "to": "PutHiveMode", - "outport": "out2" } ], "executorCores": "1", @@ -56,7 +26,7 @@ "bundle": "cn.piflow.bundle.csv.CsvSave", "uuid": "8a80d63f720cdd2301723a4e67a52467", "properties": { - "csvSavePath": "hdfs://master:9000/xjzhu/phdthesis_result.csv", + "csvSavePath": "hdfs://172.18.32.1:9000/user/Yomi/test1.csv", "partition": "", "header": "false", "saveMode": "append", @@ -66,87 +36,18 @@ } }, - { - "name": "PutHiveMode", - "bundle": "cn.piflow.bundle.hive.PutHiveMode", - "uuid": "8a80d63f720cdd2301723a4e67a22461", - "properties": { - "database": "sparktest", - "saveMode": "append", - "table": "dblp_phdthesis" - }, - "customizedProperties": { - - } - }, { "name": "CsvParser", "bundle": "cn.piflow.bundle.csv.CsvParser", "uuid": "8a80d63f720cdd2301723a4e67a82470", "properties": { "schema": "title,author,pages", - "csvPath": "hdfs://master:9000/xjzhu/phdthesis.csv", + "csvPath": "hdfs://172.18.32.1:9000/user/Yomi/test.csv", "delimiter": ",", "header": "false" }, "customizedProperties": { - } - }, - { - "name": "JsonSave", - "bundle": "cn.piflow.bundle.json.JsonSave", - "uuid": "8a80d63f720cdd2301723a4e67a1245f", - "properties": { - "jsonSavePath": "hdfs://10.0.86.191:9000/xjzhu/phdthesis.json" - }, - "customizedProperties": { - - } - }, - { - "name": "XmlParser", - "bundle": "cn.piflow.bundle.xml.XmlParser", - "uuid": "8a80d63f720cdd2301723a4e67a7246d", - "properties": { - "rowTag": "phdthesis", - "xmlpath": "hdfs://master:9000/xjzhu/dblp.mini.xml" - }, - "customizedProperties": { - - } - }, - { - "name": "SelectField", - "bundle": "cn.piflow.bundle.common.SelectField", - "uuid": "8a80d63f720cdd2301723a4e67aa2477", - "properties": { - "columnNames": "title,author,pages" - }, - "customizedProperties": { - - } - }, - { - "name": "Merge", - "bundle": "cn.piflow.bundle.common.Merge", - "uuid": "8a80d63f720cdd2301723a4e67a92475", - "properties": { - "inports": "data1,data2" - }, - "customizedProperties": { - - } - }, - { - "name": "Fork", - "bundle": "cn.piflow.bundle.common.Fork", - "uuid": "8a80d63f720cdd2301723a4e67a42465", - "properties": { - "outports": "out1,out3,out2" - }, - "customizedProperties": { - } } ] diff --git a/piflow-bin/example/flow_2.json b/piflow-bin/example/flow_2.json new file mode 100644 index 00000000..6459cad1 --- /dev/null +++ b/piflow-bin/example/flow_2.json @@ -0,0 +1,154 @@ +{ + "flow": { + "name": "Example", + "executorMemory": "1g", + "executorNumber": "1", + "uuid": "8a80d63f720cdd2301723a4e679e2457", + "paths": [ + { + "inport": "", + "from": "XmlParser", + "to": "SelectField", + "outport": "" + }, + { + "inport": "", + "from": "Fork", + "to": "CsvSave", + "outport": "out1" + }, + { + "inport": "data2", + "from": "SelectField", + "to": "Merge", + "outport": "" + }, + { + "inport": "", + "from": "Merge", + "to": "Fork", + "outport": "" + }, + { + "inport": "data1", + "from": "CsvParser", + "to": "Merge", + "outport": "" + }, + { + "inport": "", + "from": "Fork", + "to": "JsonSave", + "outport": "out3" + }, + { + "inport": "", + "from": "Fork", + "to": "PutHiveMode", + "outport": "out2" + } + ], + "executorCores": "1", + "driverMemory": "1g", + "stops": [ + { + "name": "CsvSave", + "bundle": "cn.piflow.bundle.csv.CsvSave", + "uuid": "8a80d63f720cdd2301723a4e67a52467", + "properties": { + "csvSavePath": "hdfs://master:9000/xjzhu/phdthesis_result.csv", + "partition": "", + "header": "false", + "saveMode": "append", + "delimiter": "," + }, + "customizedProperties": { + + } + }, + { + "name": "PutHiveMode", + "bundle": "cn.piflow.bundle.hive.PutHiveMode", + "uuid": "8a80d63f720cdd2301723a4e67a22461", + "properties": { + "database": "sparktest", + "saveMode": "append", + "table": "dblp_phdthesis" + }, + "customizedProperties": { + + } + }, + { + "name": "CsvParser", + "bundle": "cn.piflow.bundle.csv.CsvParser", + "uuid": "8a80d63f720cdd2301723a4e67a82470", + "properties": { + "schema": "title,author,pages", + "csvPath": "hdfs://master:9000/xjzhu/phdthesis.csv", + "delimiter": ",", + "header": "false" + }, + "customizedProperties": { + + } + }, + { + "name": "JsonSave", + "bundle": "cn.piflow.bundle.json.JsonSave", + "uuid": "8a80d63f720cdd2301723a4e67a1245f", + "properties": { + "jsonSavePath": "hdfs://10.0.86.191:9000/xjzhu/phdthesis.json" + }, + "customizedProperties": { + + } + }, + { + "name": "XmlParser", + "bundle": "cn.piflow.bundle.xml.XmlParser", + "uuid": "8a80d63f720cdd2301723a4e67a7246d", + "properties": { + "rowTag": "phdthesis", + "xmlpath": "hdfs://master:9000/xjzhu/dblp.mini.xml" + }, + "customizedProperties": { + + } + }, + { + "name": "SelectField", + "bundle": "cn.piflow.bundle.common.SelectField", + "uuid": "8a80d63f720cdd2301723a4e67aa2477", + "properties": { + "columnNames": "title,author,pages" + }, + "customizedProperties": { + + } + }, + { + "name": "Merge", + "bundle": "cn.piflow.bundle.common.Merge", + "uuid": "8a80d63f720cdd2301723a4e67a92475", + "properties": { + "inports": "data1,data2" + }, + "customizedProperties": { + + } + }, + { + "name": "Fork", + "bundle": "cn.piflow.bundle.common.Fork", + "uuid": "8a80d63f720cdd2301723a4e67a42465", + "properties": { + "outports": "out1,out3,out2" + }, + "customizedProperties": { + + } + } + ] + } +} diff --git a/piflow-bin/server.ip b/piflow-bin/server.ip index 8f2cf70c..f32ec10e 100644 --- a/piflow-bin/server.ip +++ b/piflow-bin/server.ip @@ -1 +1 @@ -server.ip=10.0.85.83 +server.ip=172.18.32.1 diff --git a/piflow-bundle/config.properties b/piflow-bundle/config.properties index 78495b0f..90776c21 100644 --- a/piflow-bundle/config.properties +++ b/piflow-bundle/config.properties @@ -2,9 +2,9 @@ spark.master=yarn spark.deploy.mode=cluster #hdfs default file system -fs.defaultFS=hdfs://10.0.86.191:9000 +fs.defaultFS=hdfs://172.18.39.41:9000 #yarn resourcemanager hostname -yarn.resourcemanager.hostname=10.0.86.191 +yarn.resourcemanager.hostname=172.18.39.41 #if you want to use hive, set hive metastore uris hive.metastore.uris=thrift://10.0.86.191:9083 @@ -19,4 +19,10 @@ monitor.throughput=true server.port=8001 #h2db port -h2.port=50001 \ No newline at end of file +h2.port=50001 + +#ceph config +ceph.accessKey=123456 +ceph.secretKey=123456 +ceph.bucket=***** +ceph.domain.ip=xxxxxx(????????okhttp3.HttpUrl??) \ No newline at end of file diff --git a/piflow-bundle/pom.xml b/piflow-bundle/pom.xml index 6a8ee7a4..c4912acd 100644 --- a/piflow-bundle/pom.xml +++ b/piflow-bundle/pom.xml @@ -11,41 +11,14 @@ UTF-8 9.0.0.M0 - 2.12.18 + 2.11.8 1.8 - 0.3.1 + 2.5.32 + 10.1.12 piflow-bundle - - - - org.apache.hbase - hbase-client - 2.5.5 - - - - - org.apache.hbase - hbase-mapreduce - 2.5.5 - - - - - com.crealytics - spark-excel_2.12 - 3.3.1_0.18.7 - - - - org.elasticsearch - elasticsearch-spark-30_2.12 - 8.3.3 - - ch.ethz.ganymed ganymed-ssh2 @@ -53,17 +26,12 @@ - ru.yandex.clickhouse - clickhouse-jdbc - ${clickhouse.version} - - - com.fasterxml.jackson.core - * - - + com.alibaba + fastjson + 1.2.58 + org.neo4j.driver neo4j-java-driver @@ -82,6 +50,19 @@ biojava-structure 4.0.0 + + org.apache.hive + hive-jdbc + 1.2.1 + + + httpclient + org.apache.httpcomponents + + + org.mongodb @@ -94,6 +75,12 @@ org.apache.solr solr-solrj 7.2.0 + + + httpclient + org.apache.httpcomponents + + @@ -116,7 +103,7 @@ org.clapper - classutil_2.12 + classutil_2.11 1.3.0 @@ -128,41 +115,57 @@ com.chuusai - shapeless_2.12 - 2.3.7 + shapeless_2.11 + 2.3.1 com.sksamuel.scrimage - scrimage-core_2.12 - 2.1.8 + scrimage-core_2.11 + 2.1.7 com.sksamuel.scrimage - scrimage-io-extra_2.12 - 2.1.8 + scrimage-io-extra_2.11 + 2.1.7 com.sksamuel.scrimage - scrimage-filters_2.12 - 2.1.8 + scrimage-filters_2.11 + 2.1.7 + + + + org.slf4j + slf4j-api + 1.7.25 - net.liftweb - lift-json_2.12 - 3.3.0 + lift-json_2.11 + 2.6.1 com.databricks - spark-xml_2.12 - 0.5.0 + spark-xml_2.11 + 0.4.1 + + + black.ninia jep @@ -182,6 +185,19 @@ 0.11.0.0 + + org.elasticsearch + elasticsearch-hadoop + 7.6.1 + + + + + org.elasticsearch + elasticsearch + 7.6.1 + + org.jsoup @@ -189,6 +205,7 @@ 1.10.3 + org.json json @@ -201,39 +218,65 @@ 1.9.1 - - - - - + + com.memcached + java_memcached-release + 2.6.6 + - + + - io.netty - netty-all - 4.1.89.Final + org.apache.flume + flume-ng-core + 1.8.0 - + + + + + + + org.apache.hbase hbase-client - 2.5.5-hadoop3 + 1.2.6 + + + httpclient + org.apache.httpcomponents + + - + org.apache.hbase hbase-server - 2.5.5-hadoop3 + 1.2.6 - net.sourceforge.jexcelapi jxl 2.6.12 + + + org.apache.poi + poi-ooxml + + 3.17 + + + org.apache.xmlbeans + xmlbeans + + + + net.sf.json-lib @@ -247,12 +290,23 @@ commons-pool2 2.4.2 - org.apache.commons commons-lang3 3.5 + + + + + + + + ftpClient + edtftp + 1.0.0 + + @@ -264,37 +318,67 @@ org.apache.httpcomponents httpclient - 4.5.13 + 4.5.3 org.apache.httpcomponents httpmime - 4.5.13 + 4.5.3 - + - com.oracle.database.jdbc + oracle ojdbc6 - 11.2.0.4 + 11.2.0.3 - + + - com.taosdata.jdbc - taos-jdbcdriver - 2.0.36 + com.typesafe.akka + akka-remote_2.11 + ${akka.version} - - - - io.hetu.core - hetu-jdbc - 1.6.0 - + + + com.typesafe.akka + akka-actor_2.11 + ${akka.version} + + + + com.typesafe.akka + akka-http_2.11 + ${akka.http.version} + + + + com.crealytics + spark-excel_2.11 + 0.13.7 + + + + org.apache.commons + commons-collections4 + 4.1 + + + + + org.apache.xmlbeans + xmlbeans + 3.1.0 + + + + + + org.apache.maven.plugins maven-install-plugin @@ -315,6 +399,41 @@ true + + + + install-external-2 + + install-file + + install + + ${basedir}/lib/ojdbc6-11.2.0.3.jar + oracle + ojdbc6 + 11.2.0.3 + jar + true + + + + + install-external-4 + + install-file + + install + + ${basedir}/lib/edtftpj.jar + ftpClient + edtftp + 1.0.0 + jar + true + + + + @@ -342,28 +461,37 @@ - - - - - - - - - - - - - - - - - - - - - + + org.apache.maven.plugins + maven-install-plugin + 2.5.2 + + + install-databricks + install-file + clean + + ${basedir}/lib/spark-xml_2.11-0.4.2.jar + com.databricks + spark-xml_2.11 + 0.4.1 + jar + true + + + + + + + + + io.netty + netty-all + 4.1.68.Final + + + \ No newline at end of file diff --git a/piflow-bundle/server.ip b/piflow-bundle/server.ip index 39633ba6..3defccaa 100644 --- a/piflow-bundle/server.ip +++ b/piflow-bundle/server.ip @@ -1 +1 @@ -server.ip=10.0.85.83 \ No newline at end of file +server.ip=172.18.32.1 \ No newline at end of file diff --git a/piflow-bundle/src/main/resources/flow/normalization/Discretization.json b/piflow-bundle/src/main/resources/flow/normalization/Discretization.json new file mode 100644 index 00000000..f445a4ff --- /dev/null +++ b/piflow-bundle/src/main/resources/flow/normalization/Discretization.json @@ -0,0 +1,38 @@ +{ + "flow":{ + "name":"test", + "uuid":"1234", + "stops":[ + { + "uuid":"0000", + "name":"SelectHiveQL", + "bundle":"cn.piflow.bundle.hive.SelectHiveQL", + "properties":{ + "hiveQL":"select * from test.clean" + } + }, + { + "uuid":"1111", + "name":"Discretization", + "bundle":"cn.piflow.bundle.normalization.Discretization", + "properties":{ + "inputCol":"pre_normalization", + "outputCol":"finished_normalization", + "method": "EqualWidth", + "numBins": "5", + "k": "4" + } + + } + + ], + "paths":[ + { + "from":"SelectHiveQL", + "outport":"", + "inport":"", + "to":"Discretization" + } + ] + } +} \ No newline at end of file diff --git a/piflow-bundle/src/main/resources/flow/normalization/MaxMinNormalization.json b/piflow-bundle/src/main/resources/flow/normalization/MaxMinNormalization.json new file mode 100644 index 00000000..83a6f085 --- /dev/null +++ b/piflow-bundle/src/main/resources/flow/normalization/MaxMinNormalization.json @@ -0,0 +1,35 @@ +{ + "flow":{ + "name":"test", + "uuid":"1234", + "stops":[ + { + "uuid":"0000", + "name":"SelectHiveQL", + "bundle":"cn.piflow.bundle.hive.SelectHiveQL", + "properties":{ + "hiveQL":"select * from test.clean" + } + }, + { + "uuid":"1111", + "name":"MaxMinNormalization", + "bundle":"cn.piflow.bundle.normalization.MaxMinNormalization", + "properties":{ + "inputCol":"pre_normalization", + "outputCol":"finished_normalization" + } + + } + + ], + "paths":[ + { + "from":"SelectHiveQL", + "outport":"", + "inport":"", + "to":"MaxMinNormalization" + } + ] + } +} \ No newline at end of file diff --git a/piflow-bundle/src/main/resources/flow/normalization/ScopeNormalization.json b/piflow-bundle/src/main/resources/flow/normalization/ScopeNormalization.json new file mode 100644 index 00000000..4d0dac12 --- /dev/null +++ b/piflow-bundle/src/main/resources/flow/normalization/ScopeNormalization.json @@ -0,0 +1,37 @@ +{ + "flow":{ + "name":"test", + "uuid":"1234", + "stops":[ + { + "uuid":"0000", + "name":"SelectHiveQL", + "bundle":"cn.piflow.bundle.hive.SelectHiveQL", + "properties":{ + "hiveQL":"select * from test.clean" + } + }, + { + "uuid":"1111", + "name":"ScopeNormalization", + "bundle":"cn.piflow.bundle.normalization.ScopeNormalization", + "properties":{ + "inputCol":"pre_normalization", + "outputCol":"finished_normalization", + "range": "(0.0, 3.0)" + + } + + } + + ], + "paths":[ + { + "from":"SelectHiveQL", + "outport":"", + "inport":"", + "to":"ScopeNormalization" + } + ] + } +} \ No newline at end of file diff --git a/piflow-bundle/src/main/resources/flow/normalization/ZScore.json b/piflow-bundle/src/main/resources/flow/normalization/ZScore.json new file mode 100644 index 00000000..8a879b81 --- /dev/null +++ b/piflow-bundle/src/main/resources/flow/normalization/ZScore.json @@ -0,0 +1,35 @@ +{ + "flow":{ + "name":"test", + "uuid":"1234", + "stops":[ + { + "uuid":"0000", + "name":"SelectHiveQL", + "bundle":"cn.piflow.bundle.hive.SelectHiveQL", + "properties":{ + "hiveQL":"select * from test.clean" + } + }, + { + "uuid":"1112", + "name":"ZScore", + "bundle":"cn.piflow.bundle.normalization.ZScore", + "properties":{ + "inputCols":"pre_normalization", + "outputCols":"finished_normalization" + } + + } + + ], + "paths":[ + { + "from":"SelectHiveQL", + "outport":"", + "inport":"", + "to":"ZScore" + } + ] + } +} \ No newline at end of file diff --git a/piflow-bundle/src/main/resources/flow/script/scala.json b/piflow-bundle/src/main/resources/flow/script/scala.json index d18c5498..208edf49 100644 --- a/piflow-bundle/src/main/resources/flow/script/scala.json +++ b/piflow-bundle/src/main/resources/flow/script/scala.json @@ -20,7 +20,7 @@ "bundle":"cn.piflow.bundle.script.ExecuteScalaFile", "properties":{ "plugin": "ScalaTest_ExecuteScalaFile_123123123", - "script":" val df = in.read()\n df.show()\n df.createOrReplaceTempView(\"people\")\n val df1 = spark.sql(\"select * from people where author like '%xjzhu%'\")\n out.write(df1)" + "script":" val df = in.read().getSparkDf\n df.show()\n df.createOrReplaceTempView(\"people\")\n val df1 = spark.sql(\"select * from people where author like '%xjzhu%'\")\n out.write(df1)" } }, { diff --git a/piflow-bundle/src/main/resources/icon/ceph.png b/piflow-bundle/src/main/resources/icon/ceph.png new file mode 100644 index 00000000..0ef6b557 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/ceph.png differ diff --git a/piflow-bundle/src/main/resources/icon/ceph/ceph.png b/piflow-bundle/src/main/resources/icon/ceph/ceph.png new file mode 100644 index 00000000..bd877694 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/ceph/ceph.png differ diff --git a/piflow-bundle/src/main/resources/icon/jdbc/dameng.png b/piflow-bundle/src/main/resources/icon/jdbc/dameng.png new file mode 100644 index 00000000..5a64c1f9 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/jdbc/dameng.png differ diff --git a/piflow-bundle/src/main/resources/icon/jdbc/tbase.png b/piflow-bundle/src/main/resources/icon/jdbc/tbase.png index 0dc907b8..671477c1 100644 Binary files a/piflow-bundle/src/main/resources/icon/jdbc/tbase.png and b/piflow-bundle/src/main/resources/icon/jdbc/tbase.png differ diff --git a/piflow-bundle/src/main/resources/icon/normalization/DiscretizationNormalization.png b/piflow-bundle/src/main/resources/icon/normalization/DiscretizationNormalization.png new file mode 100644 index 00000000..7c62193a Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/normalization/DiscretizationNormalization.png differ diff --git a/piflow-bundle/src/main/resources/icon/normalization/MaxMinNormalization.png b/piflow-bundle/src/main/resources/icon/normalization/MaxMinNormalization.png new file mode 100644 index 00000000..9a9511e8 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/normalization/MaxMinNormalization.png differ diff --git a/piflow-bundle/src/main/resources/icon/normalization/ScopeNormalization.png b/piflow-bundle/src/main/resources/icon/normalization/ScopeNormalization.png new file mode 100644 index 00000000..0fc1b6aa Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/normalization/ScopeNormalization.png differ diff --git a/piflow-bundle/src/main/resources/icon/normalization/ZScoreNormalization.png b/piflow-bundle/src/main/resources/icon/normalization/ZScoreNormalization.png new file mode 100644 index 00000000..1c9b85e4 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/normalization/ZScoreNormalization.png differ diff --git a/piflow-bundle/src/main/resources/icon/unstructured/DocxParser.png b/piflow-bundle/src/main/resources/icon/unstructured/DocxParser.png new file mode 100644 index 00000000..5a7b42a4 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/unstructured/DocxParser.png differ diff --git a/piflow-bundle/src/main/resources/icon/unstructured/HtmlParser.png b/piflow-bundle/src/main/resources/icon/unstructured/HtmlParser.png new file mode 100644 index 00000000..ea4c3df4 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/unstructured/HtmlParser.png differ diff --git a/piflow-bundle/src/main/resources/icon/unstructured/ImageParser.png b/piflow-bundle/src/main/resources/icon/unstructured/ImageParser.png new file mode 100644 index 00000000..f9f63b0a Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/unstructured/ImageParser.png differ diff --git a/piflow-bundle/src/main/resources/icon/unstructured/PdfParser.png b/piflow-bundle/src/main/resources/icon/unstructured/PdfParser.png new file mode 100644 index 00000000..8ee74b9c Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/unstructured/PdfParser.png differ diff --git a/piflow-bundle/src/main/resources/icon/unstructured/PptxParser.png b/piflow-bundle/src/main/resources/icon/unstructured/PptxParser.png new file mode 100644 index 00000000..c9186149 Binary files /dev/null and b/piflow-bundle/src/main/resources/icon/unstructured/PptxParser.png differ diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineRead.scala index 9fce6871..70720ab0 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineRead.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession @@ -31,7 +32,7 @@ class TDengineRead extends ConfigurableStop{ .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineWrite.scala index df90de0c..bbb22220 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/TDengine/TDengineWrite.scala @@ -29,7 +29,7 @@ class TDengineWrite extends ConfigurableStop{ properties.put("user", user) properties.put("password", password) properties.put("driver",driver) - val df = in.read() + val df = in.read().getSparkDf df.write.mode(SaveMode.Append).jdbc(url,dbtable,properties) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/arrowflight/ArrowFlightOut.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/arrowflight/ArrowFlightOut.scala new file mode 100644 index 00000000..b561c896 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/arrowflight/ArrowFlightOut.scala @@ -0,0 +1,219 @@ +package cn.piflow.bundle.arrowflight + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import cn.piflow.conf._ +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.execution.arrow.ArrowConverters +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types.{DataType, StructType} +import org.apache.spark.sql.util.ArrowUtils +import org.apache.arrow.memory.RootAllocator +import org.apache.arrow.vector.ipc.{ArrowFileWriter, WriteChannel} +import org.apache.arrow.vector.{BigIntVector, BitVector, DateDayVector, Float8Vector, IntVector, ValueVector, VarCharVector, VectorSchemaRoot} +import org.apache.arrow.vector.util.VectorBatchAppender + +import java.io.{File, FileOutputStream} +import java.net.{ServerSocket, Socket} +import java.nio.channels.Channels +import scala.collection.JavaConverters._ +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.spark.sql.types._ + +import java.nio.charset.StandardCharsets + + +class ArrowFlightOut extends ConfigurableStop{ + val authorEmail: String = "zjliang@cnic.cn" + val description: String = "Output the data as arrow file format." + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var outputIp: String = _ +// var header: Boolean = _ +// var delimiter: String = _ +// var partition :String= _ +// var saveMode:String = _ + + override def setProperties(map: Map[String, Any]): Unit = { + outputIp = MapUtil.get(map,"outputIp").asInstanceOf[String] +// header = MapUtil.get(map,"header").asInstanceOf[String].toBoolean +// delimiter = MapUtil.get(map,"delimiter").asInstanceOf[String] +// partition = MapUtil.get(map,key="partition").asInstanceOf[String] +// saveMode = MapUtil.get(map,"saveMode").asInstanceOf[String] + + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + +// val saveModeOption = Set("append","overwrite","error","ignore") + var descriptor : List[PropertyDescriptor] = List() + + val outputIp = new PropertyDescriptor() + .name("outputIp") + .displayName("outputIp") + .description("The output ip of file") + .defaultValue("") + .required(true) + .example("127.0.0.1") + descriptor = outputIp :: descriptor + +// val header = new PropertyDescriptor() +// .name("header") +// .displayName("Header") +// .description("Whether the csv file has a header") +// .allowableValues(Set("true","false")) +// .defaultValue("false") +// .required(true) +// .example("false") +// descriptor = header :: descriptor +// +// val delimiter = new PropertyDescriptor() +// .name("delimiter") +// .displayName("Delimiter") +// .description("The delimiter of csv file") +// .defaultValue(",") +// .required(true) +// .example(",") +// descriptor = delimiter :: descriptor +// +// val partition = new PropertyDescriptor() +// .name("partition") +// .displayName("Partition") +// .description("The partition of csv file,you can specify the number of partitions saved as csv or not") +// .defaultValue("") +// .required(false) +// .example("3") +// descriptor = partition :: descriptor +// +// val saveMode = new PropertyDescriptor() +// .name("saveMode") +// .displayName("SaveMode") +// .description("The save mode for csv file") +// .allowableValues(saveModeOption) +// .defaultValue("append") +// .required(true) +// .example("append") +// descriptor = saveMode :: descriptor + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/csv/CsvSave.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.FlightGroup) + } + + override def initialize(ctx: ProcessContext): Unit = { + + } + def sparkTypeToArrowType(dataType: DataType): ArrowType = dataType match { + case IntegerType => new ArrowType.Int(32, true) + case LongType => new ArrowType.Int(64, true) + case FloatType => new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + case DoubleType => new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE) + case StringType => new ArrowType.Utf8() + case BooleanType => ArrowType.Bool.INSTANCE + case BinaryType => ArrowType.Binary.INSTANCE + case TimestampType => new ArrowType.Timestamp(TimeUnit.MILLISECOND, null) + case _ => throw new UnsupportedOperationException(s"Unsupported type: $dataType") + } + + def toArrowSchema(schema: StructType): Schema = { + val fields = schema.fields.map { field => + new Field( + field.name, + FieldType.nullable(sparkTypeToArrowType(field.dataType)), + null + ) + }.toList + new Schema(fields.asJava) + } + + type FieldProcessor = (Int, Any) => Unit + private def createFieldProcessor(sparkType: DataType, vector: ValueVector): FieldProcessor = + (sparkType, vector) match { + // Int 类型 (Integer/Numeric) + case (_: IntegerType, vec: IntVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Int]) + // 字符串类型 + case (_: StringType, vec: VarCharVector) => (rowIdx, value) => + if (value == null) { + vec.setNull(rowIdx) + } else { + val strValue = value.toString + val bytes = strValue.getBytes(StandardCharsets.UTF_8) + vec.setSafe(rowIdx, bytes, 0, bytes.length) + } + // Double 类型 + case (_: DoubleType, vec: Float8Vector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Double]) + // Long 类型 + case (_: LongType, vec: BigIntVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Long]) + // Boolean 类型(使用 BitVector) + case (_: BooleanType, vec: BitVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, if (value.asInstanceOf[Boolean]) 1 else 0) + // Date类型(示例) + case (_: DateType, vec: DateDayVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Int]) // 需根据实际日期格式转换 + case _ => throw new IllegalArgumentException( + s"Unsupported type combination: SparkType=$sparkType, VectorType=${vector.getClass}" + ) + } + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val df = in.read().getSparkDf + + val allocator = new RootAllocator(Long.MaxValue) + val arrowSchema = toArrowSchema(df.schema) + val root = VectorSchemaRoot.create(arrowSchema, allocator) + + val serverSocket = new ServerSocket(9090) + + println("Server is listening on port 9090") + + + try { + root.allocateNew() + // 创建类型映射的字段处理器 + val fieldProcessors = df.schema.zipWithIndex.map { case (field, idx) => + createFieldProcessor(field.dataType, root.getVector(idx)) // 动态绑定对应 Vector 类型 + } + // 逐行处理数据 + val rows = df.collect().toList + root.setRowCount(rows.size) + for { + (row, rowIndex) <- rows.zipWithIndex + (value, processor) <- row.toSeq.zip(fieldProcessors) + } { + processor(rowIndex, value) // 类型安全地写入数据 + } + val socket: Socket = serverSocket.accept() + val writer = new ArrowFileWriter(root, null, Channels.newChannel(socket.getOutputStream)) + try { + + writer.start() + writer.writeBatch() + writer.end() + } finally { + writer.close() + socket.close() + } + } finally { + root.close() + allocator.close() + } + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/asr/ChineseSpeechRecognition.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/asr/ChineseSpeechRecognition.scala index 46263250..3a7a8760 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/asr/ChineseSpeechRecognition.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/asr/ChineseSpeechRecognition.scala @@ -1,11 +1,11 @@ package cn.piflow.bundle.asr import java.io.{File, FileNotFoundException} - import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.http.entity.ContentType import org.apache.http.util.EntityUtils import org.apache.spark.rdd.RDD @@ -90,7 +90,7 @@ class ChineseSpeechRecognition extends ConfigurableStop { )) val df: DataFrame = session.createDataFrame(rowRDD,schema) - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ceph/CephRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ceph/CephRead.scala new file mode 100644 index 00000000..997392fb --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ceph/CephRead.scala @@ -0,0 +1,158 @@ +package cn.piflow.bundle.ceph + +import cn.piflow._ +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame +import org.apache.spark.sql.{DataFrame, SparkSession} + +class CephRead extends ConfigurableStop { + + val authorEmail: String = "niuzj@gmqil.com" + val description: String = "Read data from ceph" + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var cephAccessKey:String = _ + var cephSecretKey:String = _ + var cephEndpoint:String = _ + var types: String = _ + var path: String = _ + var header: Boolean = _ + var delimiter: String = _ + + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + spark.conf.set("fs.s3a.access.key", cephAccessKey) + spark.conf.set("fs.s3a.secret.key", cephSecretKey) + spark.conf.set("fs.s3a.endpoint", cephEndpoint) + spark.conf.set("fs.s3a.connection.ssl.enabled", "false") + + var df:DataFrame = null + + if (types == "parquet") { + df = spark.read + .parquet(path) + } + + if (types == "csv") { + + df = spark.read + .option("header", header) + .option("inferSchema", "true") + .option("delimiter", delimiter) + .csv(path) + } + + if (types == "json") { + df = spark.read + .json(path) + } + + out.write(new SciDataFrame(df)) + } + + def initialize(ctx: ProcessContext): Unit = { + + } + + + + override def setProperties(map: Map[String, Any]): Unit = { + cephAccessKey = MapUtil.get(map,"cephAccessKey").asInstanceOf[String] + cephSecretKey = MapUtil.get(map, "cephSecretKey").asInstanceOf[String] + cephEndpoint = MapUtil.get(map,"cephEndpoint").asInstanceOf[String] + types = MapUtil.get(map,"types").asInstanceOf[String] + path = MapUtil.get(map,"path").asInstanceOf[String] + header = MapUtil.get(map, "header").asInstanceOf[String].toBoolean + delimiter = MapUtil.get(map, "delimiter").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + + var descriptor : List[PropertyDescriptor] = List() + + val cephAccessKey=new PropertyDescriptor() + .name("cephAccessKey") + .displayName("cephAccessKey") + .description("This parameter is of type String and represents the access key used to authenticate with the Ceph storage system.") + .defaultValue("") + .required(true) + .example("") + descriptor = cephAccessKey :: descriptor + + val cephSecretKey=new PropertyDescriptor() + .name("cephSecretKey") + .displayName("cephSecretKey") + .description("This parameter is of type String and represents the secret key used to authenticate with the Ceph storage system") + .defaultValue("") + .required(true) + .example("") + descriptor = cephSecretKey :: descriptor + + + + val cephEndpoint = new PropertyDescriptor() + .name("cephEndpoint") + .displayName("cephEndpoint") + .description("This parameter is of type String and represents the endpoint URL of the Ceph storage system. It is used to establish a connection with the Ceph cluster") + .defaultValue("") + .required(true) + .example("http://cephcluster:7480") + .sensitive(true) + descriptor = cephEndpoint :: descriptor + + val types = new PropertyDescriptor() + .name("types") + .displayName("Types") + .description("The format you want to write is json,csv,parquet") + .defaultValue("csv") + .allowableValues(Set("json", "csv", "parquet")) + .required(true) + .example("csv") + descriptor = types :: descriptor + + val header = new PropertyDescriptor() + .name("header") + .displayName("Header") + .description("Whether the csv file has a header") + .defaultValue("false") + .allowableValues(Set("true", "false")) + .required(true) + .example("true") + descriptor = header :: descriptor + + val delimiter = new PropertyDescriptor() + .name("delimiter") + .displayName("Delimiter") + .description("The delimiter of csv file") + .defaultValue("") + .required(true) + .example(",") + descriptor = delimiter :: descriptor + + + val path = new PropertyDescriptor() + .name("path") + .displayName("Path") + .description("The file path you want to write to") + .defaultValue("") + .required(true) + .example("s3a://radosgw-test/test_df") + descriptor = path :: descriptor + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/ceph/ceph.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.CephGroup) + } + + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ceph/CephWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ceph/CephWrite.scala new file mode 100644 index 00000000..9fd5910a --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ceph/CephWrite.scala @@ -0,0 +1,162 @@ +package cn.piflow.bundle.ceph + +import cn.piflow._ +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil,MapUtil} +import org.apache.spark.sql.SparkSession + + +class CephWrite extends ConfigurableStop { + + + val authorEmail: String = "niuzj@gmqil.com" + val description: String = "Read data from ceph" + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var cephAccessKey:String = _ + var cephSecretKey:String = _ + var cephEndpoint:String = _ + var types: String = _ + var path:String = _ + var header: Boolean = _ + var delimiter: String = _ + + + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + spark.conf.set("fs.s3a.access.key", cephAccessKey) + spark.conf.set("fs.s3a.secret.key", cephSecretKey) + spark.conf.set("fs.s3a.endpoint", cephEndpoint) + spark.conf.set("fs.s3a.connection.ssl.enabled", "false") + + // Create a DataFrame from the data + val df = in.read().getSparkDf + + if (types == "parquet") { + df.write + .format("parquet") + .mode("overwrite") // only overwrite + .save(path) + } + + if (types == "csv") { + df.write + .format("csv") + .option("header", header) + .option("delimiter",delimiter) + .mode("overwrite") + .save(path) + } + + if (types == "json") { + df.write + .format("json") + .mode("overwrite") + .save(path) + } + + } + + def initialize(ctx: ProcessContext): Unit = { + + } + + override def setProperties(map: Map[String, Any]): Unit = { + cephAccessKey = MapUtil.get(map, "cephAccessKey").asInstanceOf[String] + cephSecretKey = MapUtil.get(map, "cephSecretKey").asInstanceOf[String] + cephEndpoint = MapUtil.get(map, "cephEndpoint").asInstanceOf[String] + types = MapUtil.get(map, "types").asInstanceOf[String] + path = MapUtil.get(map, "path").asInstanceOf[String] + header = MapUtil.get(map, "header").asInstanceOf[String].toBoolean + delimiter = MapUtil.get(map, "delimiter").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + + var descriptor : List[PropertyDescriptor] = List() + + val cephAccessKey=new PropertyDescriptor() + .name("cephAccessKey") + .displayName("cephAccessKey") + .description("This parameter is of type String and represents the access key used to authenticate with the Ceph storage system.") + .defaultValue("") + .required(true) + .example("") + descriptor = cephAccessKey :: descriptor + + val cephSecretKey=new PropertyDescriptor() + .name("cephSecretKey") + .displayName("cephSecretKey") + .description("This parameter is of type String and represents the secret key used to authenticate with the Ceph storage system") + .defaultValue("") + .required(true) + .example("") + descriptor = cephSecretKey :: descriptor + + val cephEndpoint = new PropertyDescriptor() + .name("cephEndpoint") + .displayName("cephEndpoint") + .description("This parameter is of type String and represents the endpoint URL of the Ceph storage system. It is used to establish a connection with the Ceph cluster") + .defaultValue("") + .required(true) + .example("http://cephcluster:7480") + .sensitive(true) + descriptor = cephEndpoint :: descriptor + + val types = new PropertyDescriptor() + .name("types") + .displayName("Types") + .description("The format you want to write is json,csv,parquet") + .defaultValue("csv") + .allowableValues(Set("json", "csv", "parquet")) + .required(true) + .example("csv") + descriptor = types :: descriptor + + val delimiter = new PropertyDescriptor() + .name("delimiter") + .displayName("Delimiter") + .description("The delimiter of csv file") + .defaultValue(",") + .required(true) + .example(",") + descriptor = delimiter :: descriptor + + + val header = new PropertyDescriptor() + .name("header") + .displayName("Header") + .description("Whether the csv file has a header") + .defaultValue("true") + .allowableValues(Set("true", "false")) + .required(true) + .example("true") + descriptor = header :: descriptor + + + val path= new PropertyDescriptor() + .name("path") + .displayName("Path") + .description("The file path you want to write to") + .defaultValue("s3a://radosgw-test/test_df") + .required(true) + .example("s3a://radosgw-test/test_df") + descriptor = path :: descriptor + + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/ceph/ceph.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.CephGroup) + } + + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/EmailClean.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/EmailClean.scala index 43ac6831..6f006e5e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/EmailClean.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/EmailClean.scala @@ -5,6 +5,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession class EmailClean extends ConfigurableStop{ @@ -19,7 +20,7 @@ class EmailClean extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dfOld = in.read() + val dfOld = in.read().getSparkDf dfOld.createOrReplaceTempView("thesis") sqlContext.udf.register("regexPro",(str:String)=>CleanUtil.processEmail(str)) val structFields: Array[String] = dfOld.schema.fieldNames @@ -51,7 +52,7 @@ class EmailClean extends ConfigurableStop{ val sqlTextNew:String = "select " + schemaStr.substring(0,schemaStr.length -1) + " from thesis" val dfNew1=sqlContext.sql(sqlTextNew) - out.write(dfNew1) + out.write(new SciDataFrame(dfNew1)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/IdentityNumberClean.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/IdentityNumberClean.scala index 2d2c933b..bc2c45de 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/IdentityNumberClean.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/IdentityNumberClean.scala @@ -6,6 +6,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{DataFrame, SparkSession} @@ -22,7 +23,7 @@ class IdentityNumberClean extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dfOld = in.read() + val dfOld = in.read().getSparkDf dfOld.createOrReplaceTempView("thesis") sqlContext.udf.register("regexPro",(str:String)=>CleanUtil.processCardCode(str)) val structFields: Array[String] = dfOld.schema.fieldNames @@ -54,7 +55,7 @@ class IdentityNumberClean extends ConfigurableStop{ val sqlTextNew:String = "select " + schemaStr.substring(0,schemaStr.length -1) + " from thesis" val dfNew1=sqlContext.sql(sqlTextNew) - out.write(dfNew1) + out.write(new SciDataFrame(dfNew1)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/PhoneNumberClean.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/PhoneNumberClean.scala index d0a65d38..1895f2de 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/PhoneNumberClean.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/PhoneNumberClean.scala @@ -5,6 +5,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.StructField @@ -20,7 +21,7 @@ class PhoneNumberClean extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dfOld = in.read() + val dfOld = in.read().getSparkDf dfOld.createOrReplaceTempView("thesis") sqlContext.udf.register("regexPro",(str:String)=>CleanUtil.processPhonenum(str)) val structFields: Array[String] = dfOld.schema.fieldNames @@ -52,7 +53,7 @@ class PhoneNumberClean extends ConfigurableStop{ val sqlTextNew:String = "select " + schemaStr.substring(0,schemaStr.length -1) + " from thesis" val dfNew1=sqlContext.sql(sqlTextNew) - out.write(dfNew1) + out.write(new SciDataFrame(dfNew1)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/ProvinceClean.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/ProvinceClean.scala index 09188341..9593d32e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/ProvinceClean.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/ProvinceClean.scala @@ -4,6 +4,7 @@ import cn.piflow.bundle.util.CleanUtil import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession @@ -18,7 +19,7 @@ class ProvinceClean extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dfOld = in.read() + val dfOld = in.read().getSparkDf dfOld.createOrReplaceTempView("thesis") sqlContext.udf.register("regexPro",(str:String)=>CleanUtil.processProvince(str)) val structFields: Array[String] = dfOld.schema.fieldNames @@ -50,7 +51,7 @@ class ProvinceClean extends ConfigurableStop{ val sqlTextNew:String = "select " + schemaStr.substring(0,schemaStr.length -1) + " from thesis" val dfNew1=sqlContext.sql(sqlTextNew) - out.write(dfNew1) + out.write(new SciDataFrame(dfNew1)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/TitleClean.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/TitleClean.scala index a176980f..e1145bdf 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/TitleClean.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clean/TitleClean.scala @@ -5,6 +5,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.StructField @@ -19,7 +20,7 @@ class TitleClean extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dfOld = in.read() + val dfOld = in.read().getSparkDf dfOld.createOrReplaceTempView("thesis") sqlContext.udf.register("regexPro",(str:String)=>CleanUtil.processTitle(str)) val structFields: Array[String] = dfOld.schema.fieldNames @@ -51,7 +52,7 @@ class TitleClean extends ConfigurableStop{ val sqlTextNew:String = "select " + schemaStr.substring(0,schemaStr.length -1) + " from thesis" val dfNew1=sqlContext.sql(sqlTextNew) - out.write(dfNew1) + out.write(new SciDataFrame(dfNew1)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseRead.scala index ca8aa3cd..53d8d69b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseRead.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.clickhouse import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.{DataFrame, SparkSession} @@ -36,7 +37,7 @@ class ClickhouseRead extends ConfigurableStop { .options(options) .load() jdbcDF.show() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseWrite.scala index aff3c031..b64575bf 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/clickhouse/ClickhouseWrite.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{DataFrame, SaveMode} import java.util.Properties @@ -22,7 +23,7 @@ class ClickhouseWrite extends ConfigurableStop{ var dbtable:String = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val jdbcDF: DataFrame = in.read() + val jdbcDF: DataFrame = in.read().getSparkDf val properties: Properties = new Properties() properties.put("driver", driver) if (user != null && user.nonEmpty) { @@ -35,7 +36,7 @@ class ClickhouseWrite extends ConfigurableStop{ "numPartitions" -> "1" ) jdbcDF.write.mode(SaveMode.Append).options(options).jdbc(url, dbtable, properties) - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/AddUUIDStop.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/AddUUIDStop.scala index d01c3756..c1c89066 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/AddUUIDStop.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/AddUUIDStop.scala @@ -1,10 +1,10 @@ package cn.piflow.bundle.common import java.util.UUID - import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.{DataFrame, SparkSession} @@ -19,14 +19,14 @@ class AddUUIDStop extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - var df = in.read() + var df = in.read().getSparkDf spark.udf.register("generateUUID",()=>UUID.randomUUID().toString.replace("-","")) df.createOrReplaceTempView("temp") df = spark.sql(s"select generateUUID() as ${column},* from temp") - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ConvertSchema.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ConvertSchema.scala index a8fb9493..c1fa16f6 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ConvertSchema.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ConvertSchema.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame class ConvertSchema extends ConfigurableStop { @@ -15,7 +16,7 @@ class ConvertSchema extends ConfigurableStop { var schema:String = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - var df = in.read() + var df = in.read().getSparkDf val field = schema.split(",").map(x => x.trim) @@ -24,7 +25,7 @@ class ConvertSchema extends ConfigurableStop { df = df.withColumnRenamed(old_new(0),old_new(1)) }) - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Distinct.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Distinct.scala index 38e493e3..bd6a2d80 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Distinct.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Distinct.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.common import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.DataFrame @@ -47,7 +48,7 @@ class Distinct extends ConfigurableStop{ } override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val inDf: DataFrame = in.read() + val inDf: DataFrame = in.read().getSparkDf var outDf: DataFrame = null if(columnNames.length > 0){ val fileArr: Array[String] = columnNames.split(",") @@ -55,6 +56,6 @@ class Distinct extends ConfigurableStop{ }else{ outDf = inDf.distinct() } - out.write(outDf) + out.write(new SciDataFrame(outDf)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/DropField.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/DropField.scala index 1378af71..19001cdd 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/DropField.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/DropField.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame class DropField extends ConfigurableStop { @@ -16,14 +17,14 @@ class DropField extends ConfigurableStop { var columnNames:String = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - var df = in.read() + var df = in.read().getSparkDf val field = columnNames.split(",").map(x => x.trim) for( x <- 0 until field.size){ df = df.drop(field(x)) } - out.write(df) + out.write(new SciDataFrame(df)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ExecuteSQLStop.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ExecuteSQLStop.scala index ac128fd8..bfda3185 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ExecuteSQLStop.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/ExecuteSQLStop.scala @@ -7,6 +7,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.lib._ import cn.piflow.lib.io.{FileFormat, TextFile} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{DataFrame, SparkSession} class ExecuteSQLStop extends ConfigurableStop{ @@ -23,11 +24,11 @@ class ExecuteSQLStop extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val inDF = in.read() + val inDF = in.read().getSparkDf inDF.createOrReplaceTempView(ViewName) val frame: DataFrame = spark.sql(sql) - out.write(frame) + out.write(new SciDataFrame(frame)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Filter.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Filter.scala index fe2b164f..755270fb 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Filter.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Filter.scala @@ -4,6 +4,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{Column, DataFrame} class Filter extends ConfigurableStop{ @@ -45,10 +46,10 @@ class Filter extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read() + val df = in.read().getSparkDf var filterDF : DataFrame = df.filter(condition) - out.write(filterDF) + out.write(new SciDataFrame(filterDF)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Fork.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Fork.scala index 279e67e8..3f11cc93 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Fork.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Fork.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.common import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} @@ -26,8 +27,8 @@ class Fork extends ConfigurableStop{ } override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read().cache() - outports.foreach(out.write(_, df)); + val df = in.read().getSparkDf.cache() + outports.foreach(out.write(_, new SciDataFrame(df))); } override def getPropertyDescriptor(): List[PropertyDescriptor] = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Join.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Join.scala index 6dffbfaa..7f1c2002 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Join.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Join.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.common import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.{Column, DataFrame} @@ -17,8 +18,8 @@ class Join extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val leftDF = in.read(Port.LeftPort) - val rightDF = in.read(Port.RightPort) + val leftDF = in.read(Port.LeftPort).getSparkDf + val rightDF = in.read(Port.RightPort).getSparkDf var seq: Seq[String]= Seq() correlationColumn.split(",").foreach(x=>{ @@ -32,7 +33,7 @@ class Join extends ConfigurableStop{ case "right" => df = leftDF.join(rightDF,seq,"right_outer") case "full_outer" => df = leftDF.join(rightDF,seq,"outer") } - out.write(df) + out.write(new SciDataFrame(df)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Merge.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Merge.scala index e5be9ead..bbb107f6 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Merge.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Merge.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.common import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} class Merge extends ConfigurableStop{ @@ -15,7 +16,7 @@ class Merge extends ConfigurableStop{ var inports : List[String] = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - out.write(in.ports().map(in.read(_)).reduce((x, y) => x.union(y))); + out.write(new SciDataFrame(in.ports().map(in.read(_).getSparkDf).reduce((x, y) => x.union(y)))); } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/MockData.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/MockData.scala index e3aa4a94..396a632f 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/MockData.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/MockData.scala @@ -5,6 +5,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.types._ import org.json4s @@ -91,7 +92,7 @@ class MockData extends ConfigurableStop{ val schemaStructType = StructType(structFieldArray) val rnd : Random = new Random() val df = spark.read.schema(schemaStructType).json((0 to count -1 ).map{ _ => compact(randomJson(rnd,schemaStructType))}.toDS()) - out.write(df) + out.write(new SciDataFrame(df)) } private def randomJson( rnd: Random, dataType : DataType): JValue ={ diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Route.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Route.scala index 955a73d6..36028a5d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Route.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Route.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.common import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} class Route extends ConfigurableStop{ @@ -23,7 +24,7 @@ class Route extends ConfigurableStop{ } override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read().cache() + val df = in.read().getSparkDf.cache() if(this.customizedProperties != null || this.customizedProperties.size != 0){ val it = this.customizedProperties.keySet.iterator @@ -31,10 +32,10 @@ class Route extends ConfigurableStop{ val port = it.next() val filterCondition = MapUtil.get(this.customizedProperties,port).asInstanceOf[String] val filterDf = df.filter(filterCondition) - out.write(port,filterDf) + out.write(port,new SciDataFrame(filterDf)) } } - out.write(df); + out.write(new SciDataFrame(df)); } override def getPropertyDescriptor(): List[PropertyDescriptor] = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/SelectField.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/SelectField.scala index b45cfe54..47b8f9fe 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/SelectField.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/SelectField.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{Column, DataFrame} import scala.beans.BeanProperty @@ -19,7 +20,7 @@ class SelectField extends ConfigurableStop { var columnNames:String = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read() + val df = in.read().getSparkDf val field = columnNames.split(",").map(x => x.trim) val columnArray : Array[Column] = new Array[Column](field.size) @@ -28,7 +29,7 @@ class SelectField extends ConfigurableStop { } var finalFieldDF : DataFrame = df.select(columnArray:_*) - out.write(finalFieldDF) + out.write(new SciDataFrame(finalFieldDF)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Subtract.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Subtract.scala index 1f99f91f..f2ca25be 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Subtract.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/common/Subtract.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.common import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.ImageUtil import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.api.java.JavaRDD import org.apache.spark.sql.types.StructType @@ -38,11 +39,11 @@ class Subtract extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val leftDF = in.read(Port.LeftPort) - val rightDF = in.read(Port.RightPort) + val leftDF = in.read(Port.LeftPort).getSparkDf + val rightDF = in.read(Port.RightPort).getSparkDf val outDF = leftDF.except(rightDF) - out.write(outDF) + out.write(new SciDataFrame(outDF)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvParser.scala index 9031b196..b3b459ac 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvParser.scala @@ -1,6 +1,7 @@ package cn.piflow.bundle.csv import cn.piflow._ +import cn.piflow.util.SciDataFrame import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} @@ -23,13 +24,13 @@ class CsvParser extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - var csvDF:DataFrame = null + var csvDF:SciDataFrame = null if (header){ - csvDF = spark.read + csvDF.setSparkDf(spark.read .option("header",header) .option("inferSchema","true") .option("delimiter",delimiter) - .csv(csvPath) + .csv(csvPath)) }else{ @@ -40,13 +41,13 @@ class CsvParser extends ConfigurableStop{ } val schemaStructType = StructType(structFieldArray) - csvDF = spark.read + csvDF.setSparkDf(spark.read .option("header",header) .option("inferSchema","false") .option("delimiter",delimiter) .option("timestampFormat","yyyy/MM/dd HH:mm:ss ZZ") .schema(schemaStructType) - .csv(csvPath) + .csv(csvPath)) } out.write(csvDF) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvSave.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvSave.scala index ff422cc2..31bfae7b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvSave.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvSave.scala @@ -5,6 +5,36 @@ import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.execution.arrow.ArrowConverters +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types.{DataType, StructType} +import org.apache.spark.sql.util.ArrowUtils +import org.apache.arrow.memory.RootAllocator +import org.apache.arrow.vector.ipc.{ArrowFileWriter, WriteChannel} +import org.apache.arrow.vector.{BigIntVector, BitVector, DateDayVector, Float8Vector, IntVector, ValueVector, VarCharVector, VectorSchemaRoot} +import org.apache.arrow.vector.util.VectorBatchAppender + +import java.io.{File, FileOutputStream} +import java.net.{ServerSocket, Socket} +import java.nio.channels.Channels +import scala.collection.JavaConverters._ +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.spark.sql.types._ + +import java.nio.charset.StandardCharsets + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import cn.piflow.conf._ +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.{BigIntVector, BitVector, DateDayVector, Float8Vector, IntVector, ValueVector, VarCharVector} +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.types.{BinaryType, BooleanType, DataType, DateType, DoubleType, FloatType, IntegerType, LongType, StringType, StructType, TimestampType} + +import java.nio.charset.StandardCharsets class CsvSave extends ConfigurableStop{ val authorEmail: String = "xjzhu@cnic.cn" @@ -94,9 +124,109 @@ class CsvSave extends ConfigurableStop{ override def initialize(ctx: ProcessContext): Unit = { } + def sparkTypeToArrowType(dataType: DataType): ArrowType = dataType match { + case IntegerType => new ArrowType.Int(32, true) + case LongType => new ArrowType.Int(64, true) + case FloatType => new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + case DoubleType => new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE) + case StringType => new ArrowType.Utf8() + case BooleanType => ArrowType.Bool.INSTANCE + case BinaryType => ArrowType.Binary.INSTANCE + case TimestampType => new ArrowType.Timestamp(TimeUnit.MILLISECOND, null) + case _ => throw new UnsupportedOperationException(s"Unsupported type: $dataType") + } + + def toArrowSchema(schema: StructType): Schema = { + val fields = schema.fields.map { field => + new Field( + field.name, + FieldType.nullable(sparkTypeToArrowType(field.dataType)), + null + ) + }.toList + new Schema(fields.asJava) + } + + type FieldProcessor = (Int, Any) => Unit + private def createFieldProcessor(sparkType: DataType, vector: ValueVector): FieldProcessor = + (sparkType, vector) match { + // Int 类型 (Integer/Numeric) + case (_: IntegerType, vec: IntVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Int]) + // 字符串类型 + case (_: StringType, vec: VarCharVector) => (rowIdx, value) => + if (value == null) { + vec.setNull(rowIdx) + } else { + val strValue = value.toString + val bytes = strValue.getBytes(StandardCharsets.UTF_8) + vec.setSafe(rowIdx, bytes, 0, bytes.length) + } + // Double 类型 + case (_: DoubleType, vec: Float8Vector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Double]) + // Long 类型 + case (_: LongType, vec: BigIntVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Long]) + // Boolean 类型(使用 BitVector) + case (_: BooleanType, vec: BitVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, if (value.asInstanceOf[Boolean]) 1 else 0) + // Date类型(示例) + case (_: DateType, vec: DateDayVector) => (rowIdx, value) => + if (value == null) vec.setNull(rowIdx) + else vec.setSafe(rowIdx, value.asInstanceOf[Int]) // 需根据实际日期格式转换 + case _ => throw new IllegalArgumentException( + s"Unsupported type combination: SparkType=$sparkType, VectorType=${vector.getClass}" + ) + } override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read() + val df = in.read().getSparkDf + + val allocator = new RootAllocator(Long.MaxValue) + val arrowSchema = toArrowSchema(df.schema) + val root = VectorSchemaRoot.create(arrowSchema, allocator) + + val serverSocket = new ServerSocket(9090) + + println("Server is listening on port 9090") + + + try { + root.allocateNew() + // 创建类型映射的字段处理器 + val fieldProcessors = df.schema.zipWithIndex.map { case (field, idx) => + createFieldProcessor(field.dataType, root.getVector(idx)) // 动态绑定对应 Vector 类型 + } + // 逐行处理数据 + val rows = df.collect().toList + root.setRowCount(rows.size) + for { + (row, rowIndex) <- rows.zipWithIndex + (value, processor) <- row.toSeq.zip(fieldProcessors) + } { + processor(rowIndex, value) // 类型安全地写入数据 + } + val socket: Socket = serverSocket.accept() + val writer = new ArrowFileWriter(root, null, Channels.newChannel(socket.getOutputStream)) + try { + + writer.start() + writer.writeBatch() + writer.end() + } finally { + writer.close() + socket.close() + } + } finally { + root.close() + allocator.close() + } + if("".equals(partition)){ df.write diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvStringParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvStringParser.scala index 13e6abc1..60ac8e95 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvStringParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/csv/CsvStringParser.scala @@ -4,6 +4,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{StringType, StructField, StructType} @@ -45,7 +46,7 @@ class CsvStringParser extends ConfigurableStop{ val fields: Array[StructField] = schema.split(",").map(d=>StructField(d.trim,StringType,nullable = true)) val NewSchema: StructType = StructType(fields) Fdf = session.createDataFrame(rowRDD,NewSchema) - out.write(Fdf) + out.write(new SciDataFrame(Fdf)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/PutElasticsearch.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/PutElasticsearch.scala index 22738c4a..0771630a 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/PutElasticsearch.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/PutElasticsearch.scala @@ -20,7 +20,7 @@ class PutElasticsearch extends ConfigurableStop { var saveMode : String = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val inDfES = in.read() + val inDfES = in.read().getSparkDf inDfES.write.format("org.elasticsearch.spark.sql") .option("es.nodes", es_nodes) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/ReadElasticsearch.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/ReadElasticsearch.scala index d5c7a88b..80a76697 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/ReadElasticsearch.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/elasticsearch/ReadElasticsearch.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.elasticsearch import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession @@ -26,7 +27,7 @@ class ReadElasticsearch extends ConfigurableStop { .option("es.port", es_port) .load(s"${es_index}/${es_type}") - out.write(esDF) + out.write(new SciDataFrame(esDF)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelRead.scala index 3faa7fea..ba77c9b0 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelRead.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.excel import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession @@ -31,7 +32,7 @@ class ExcelRead extends ConfigurableStop{ .option("header", header) .load(filePath) - out.write(frame) + out.write(new SciDataFrame(frame)) } override def setProperties(map: Map[String, Any]): Unit = { @@ -112,7 +113,7 @@ class ExcelRead extends ConfigurableStop{ } override def getIcon(): Array[Byte] = { - ImageUtil.getImage("icon/excel/excelParse.png",this.getClass.getName) + ImageUtil.getImage("icon/excel/excelParse.png") } override def getGroup(): List[String] = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWrite.scala index 82e2e6dd..0c3237af 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWrite.scala @@ -17,7 +17,7 @@ class ExcelWrite extends ConfigurableStop{ var saveMode: String = _ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read() + val df = in.read().getSparkDf df.write .format("com.crealytics.spark.excel") .option("dataAddress",dataAddress) @@ -84,7 +84,7 @@ class ExcelWrite extends ConfigurableStop{ } override def getIcon(): Array[Byte] = { - ImageUtil.getImage("icon/excel/excelParse.png",this.getClass.getName) + ImageUtil.getImage("icon/excel/excelParse.png") } override def getGroup(): List[String] = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWriteMultipleSheets.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWriteMultipleSheets.scala new file mode 100644 index 00000000..1ef00677 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/excel/ExcelWriteMultipleSheets.scala @@ -0,0 +1,85 @@ +package cn.piflow.bundle.excel + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} + +class ExcelWriteMultipleSheets extends ConfigurableStop{ + val authorEmail: String = "ygang@cnic.cn" + val description: String = "Write multiple DataFrames into multiple sheets of the same Excel file" + val inportList: List[String] = List(Port.AnyPort) + val outportList: List[String] = List(Port.DefaultPort) + + var filePath: String = _ + var header: String = _ + + var inports : List[String] = _ + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + + inports.foreach(x=>{ + val df = in.read(x).getSparkDf + df.write + .format("com.crealytics.spark.excel") + .option("dataAddress",s"'${x}'!A1") + .option("header", header) + .mode("append") + .save(filePath) + }) + } + + override def setProperties(map: Map[String, Any]): Unit = { + val inportStr = MapUtil.get(map,"inports").asInstanceOf[String] + inports = inportStr.split(",").map(x => x.trim).toList + + filePath = MapUtil.get(map,"filePath").asInstanceOf[String] + header = MapUtil.get(map,"header").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor : List[PropertyDescriptor] = List() + + val filePath = new PropertyDescriptor() + .name("filePath") + .displayName("FilePath") + .description("The path of excel file") + .defaultValue("") + .required(true) + .example("/test/test.xlsx") + descriptor = filePath :: descriptor + + val header = new PropertyDescriptor() + .name("header") + .displayName("Header") + .description("Whether the excel file has a header") + .defaultValue("true") + .allowableValues(Set("true","false")) + .required(true) + .example("true") + descriptor = header :: descriptor + + val inports = new PropertyDescriptor() + .name("inports") + .displayName("inports") + .description("Inports string are separated by commas") + .defaultValue("") + .required(true) + descriptor = inports :: descriptor + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/excel/excelParse.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.ExcelGroup) + } + + override def initialize(ctx: ProcessContext): Unit = { + + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/file/RegexTextProcess.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/file/RegexTextProcess.scala index ff4f0f50..352cb6c5 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/file/RegexTextProcess.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/file/RegexTextProcess.scala @@ -4,6 +4,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession class RegexTextProcess extends ConfigurableStop{ @@ -19,14 +20,14 @@ class RegexTextProcess extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dfOld = in.read() + val dfOld = in.read().getSparkDf val regexText=regex val replaceText=replaceStr dfOld.createOrReplaceTempView("thesis") sqlContext.udf.register("regexPro",(str:String)=>str.replaceAll(regexText,replaceText)) val sqlText:String="select *,regexPro("+columnName+") as "+columnName+"_new from thesis" val dfNew=sqlContext.sql(sqlText) - out.write(dfNew) + out.write(new SciDataFrame(dfNew)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LabelPropagation.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LabelPropagation.scala index 8867f67d..f95a0541 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LabelPropagation.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LabelPropagation.scala @@ -7,7 +7,7 @@ import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.graphx._ import org.apache.spark.graphx.lib.LabelPropagation import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LabelPropagation extends ConfigurableStop { val authorEmail: String = "06whuxx@163.com" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LoadGraph.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LoadGraph.scala index 1a2135f3..4b8755f2 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LoadGraph.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/graphx/LoadGraph.scala @@ -6,6 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.SparkSession import org.apache.spark.graphx.{GraphLoader, PartitionStrategy} +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LoadGraph extends ConfigurableStop { val authorEmail: String = "06whuxx@163.com" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/PutHbase.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/PutHbase.scala index b2072299..3d0b3def 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/PutHbase.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/PutHbase.scala @@ -13,7 +13,26 @@ import org.apache.hadoop.mapred.JobConf import org.apache.hadoop.mapreduce.Job import org.apache.spark.sql.SparkSession - +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + * Copyright (c) 2022 πFlow. All rights reserved. + */ class PutHbase extends ConfigurableStop{ override val authorEmail: String = "ygang@cnic.cn" @@ -28,7 +47,7 @@ class PutHbase extends ConfigurableStop{ var columnFamily: String = _ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df = in.read() + val df = in.read().getSparkDf val hbaseConf = HBaseConfiguration.create() hbaseConf.set("hbase.zookeeper.quorum",zookeeperQuorum) //设置zooKeeper集群地址,也可以通过将hbase-site.xml导入classpath,但是建议在程序里这样设置 diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/ReadHbase.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/ReadHbase.scala index 595fe93b..9512ca5a 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/ReadHbase.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hbase/ReadHbase.scala @@ -4,6 +4,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.util.Bytes import org.apache.spark.sql.types.{StringType, StructField, StructType} @@ -12,7 +13,26 @@ import org.apache.hadoop.hbase.HBaseConfiguration import scala.collection.mutable.ArrayBuffer - +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + * Copyright (c) 2022 πFlow. All rights reserved. + */ class ReadHbase extends ConfigurableStop{ override val authorEmail: String = "ygang@cnic.cn" @@ -78,7 +98,7 @@ class ReadHbase extends ConfigurableStop{ }) val df=spark.createDataFrame(kv,dfSchema) - out.write(df) + out.write(new SciDataFrame(df)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/DeleteHdfs.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/DeleteHdfs.scala index 94dbdf8e..3338b939 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/DeleteHdfs.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/DeleteHdfs.scala @@ -27,7 +27,7 @@ class DeleteHdfs extends ConfigurableStop{ val spark = pec.get[SparkSession]() if (isCustomize.equals("false")){ - val inDf = in.read() + val inDf = in.read().getSparkDf val configuration: Configuration = new Configuration() var pathStr: String =inDf.take(1)(0).get(0).asInstanceOf[String] diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/FileDownHdfs.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/FileDownHdfs.scala index 6e436d5d..6bc05c9d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/FileDownHdfs.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/FileDownHdfs.scala @@ -2,10 +2,10 @@ package cn.piflow.bundle.hdfs import java.io.InputStream import java.net.{HttpURLConnection, URL} - import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, Path} @@ -59,7 +59,7 @@ class FileDownHdfs extends ConfigurableStop{ val schema: StructType = StructType(fields) val df: DataFrame = spark.createDataFrame(rdd,schema) - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/GetHdfs.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/GetHdfs.scala index 444454d7..7bfadf2f 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/GetHdfs.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/GetHdfs.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession @@ -27,28 +28,28 @@ class GetHdfs extends ConfigurableStop{ if (types == "json") { val df = spark.read.json(path) df.schema.printTreeString() - out.write(df) + out.write(new SciDataFrame(df)) } else if (types == "csv") { val df = spark.read.csv(path) df.schema.printTreeString() - out.write(df) + out.write(new SciDataFrame(df)) }else if (types == "parquet") { val df = spark.read.csv(path) df.schema.printTreeString() - out.write(df) + out.write(new SciDataFrame(df)) }else if (types == "orc"){ val df = spark.read.orc(path) df.schema.printTreeString() - out.write(df) + out.write(new SciDataFrame(df)) } else { val rdd = sc.textFile(path) val outDf = rdd.toDF() outDf.schema.printTreeString() - out.write(outDf) + out.write(new SciDataFrame(outDf)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/ListHdfs.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/ListHdfs.scala index ac58b6ec..7b7ea30b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/ListHdfs.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/ListHdfs.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} import org.apache.spark.rdd.RDD @@ -44,7 +45,7 @@ class ListHdfs extends ConfigurableStop{ StructField("path",StringType) )) val outDF: DataFrame = spark.createDataFrame(rowRDD,schema) - out.write(outDF) + out.write(new SciDataFrame(outDF)) } // recursively traverse the folder diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/PutHdfs.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/PutHdfs.scala index 310db5d6..80e7c135 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/PutHdfs.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/PutHdfs.scala @@ -23,7 +23,7 @@ class PutHdfs extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val inDF = in.read() + val inDF = in.read().getSparkDf val config = new Configuration() config.set("fs.defaultFS",hdfsUrl) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SaveToHdfs.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SaveToHdfs.scala index 86f16518..7d8d3f8e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SaveToHdfs.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SaveToHdfs.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.hdfs import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} @@ -38,7 +39,7 @@ class SaveToHdfs extends ConfigurableStop { config.set("fs.defaultFS",hdfsUrl) val fs = FileSystem.get(config) - val inDF = in.read() + val inDF = in.read().getSparkDf if (types=="json"){ @@ -76,7 +77,7 @@ class SaveToHdfs extends ConfigurableStop { )) val outDF: DataFrame = spark.createDataFrame(rowRDD,schema) - out.write(outDF) + out.write(new SciDataFrame(outDF)) } // recursively traverse the folder diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SelectFilesByName.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SelectFilesByName.scala index 8df119e2..cb0d5fea 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SelectFilesByName.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hdfs/SelectFilesByName.scala @@ -1,10 +1,10 @@ package cn.piflow.bundle.hdfs import java.util.regex.Pattern - import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} @@ -69,7 +69,7 @@ class SelectFilesByName extends ConfigurableStop{ val df: DataFrame = session.createDataFrame(rowRDD,schema) df.collect().foreach(println) - out.write(df) + out.write(new SciDataFrame(df)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveMode.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveMode.scala index 784f5edf..bbac9f8c 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveMode.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveMode.scala @@ -19,7 +19,7 @@ class PutHiveMode extends ConfigurableStop { def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val inDF = in.read() + val inDF = in.read().getSparkDf inDF.write.format("hive").mode(saveMode).saveAsTable(database + "." + table) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveStreaming.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveStreaming.scala index fb43edae..59507f08 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveStreaming.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/PutHiveStreaming.scala @@ -20,7 +20,7 @@ class PutHiveStreaming extends ConfigurableStop { def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val inDF = in.read() + val inDF = in.read().getSparkDf val dfTempTable = table + "_temp" inDF.createOrReplaceTempView(dfTempTable) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQL.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQL.scala index a3d11de5..007557f9 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQL.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQL.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession import scala.beans.BeanProperty @@ -25,7 +26,7 @@ class SelectHiveQL extends ConfigurableStop { import spark.sql val df = sql(hiveQL) - out.write(df) + out.write(new SciDataFrame(df)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQLByJDBC.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQLByJDBC.scala index 1e896c15..75852f39 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQLByJDBC.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/hive/SelectHiveQLByJDBC.scala @@ -4,6 +4,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.SparkContext import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession} @@ -88,7 +89,7 @@ class SelectHiveQLByJDBC extends ConfigurableStop { override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val sc = pec.get[SparkSession]() val df = getDF (sc.sqlContext, sc.sparkContext, sql) - out.write(df) + out.write(new SciDataFrame(df)) } def getDF(sqlContext : SQLContext, sc : SparkContext, tableName : String) : DataFrame = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/http/GetUrl.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/http/GetUrl.scala index 40000a14..46995e57 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/http/GetUrl.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/http/GetUrl.scala @@ -1,7 +1,6 @@ package cn.piflow.bundle.http import java.util - import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} @@ -13,6 +12,8 @@ import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.dom4j.{Document, DocumentHelper, Element} +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +import cn.piflow.util.SciDataFrame import scala.collection.JavaConverters._ import scala.collection.mutable.{ArrayBuffer, ListBuffer} @@ -112,7 +113,7 @@ class GetUrl extends ConfigurableStop{ val outDf: DataFrame = ss.createDataFrame(rowRDD,structType) - out.write(outDf) + out.write(new SciDataFrame(outDf)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/imageProcess/AnimalClassification.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/imageProcess/AnimalClassification.scala index 7819f6ac..ff20e347 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/imageProcess/AnimalClassification.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/imageProcess/AnimalClassification.scala @@ -1,11 +1,11 @@ package cn.piflow.bundle.imageProcess import java.io.{File, FileNotFoundException} - import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.http.entity.ContentType import org.apache.http.util.EntityUtils import org.apache.spark.rdd.RDD @@ -89,7 +89,7 @@ class AnimalClassification extends ConfigurableStop { StructField("res",StringType) )) val df: DataFrame = session.createDataFrame(rowRDD,schema) - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/internetWorm/spider.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/internetWorm/spider.scala index f6794590..fa1f7e1b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/internetWorm/spider.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/internetWorm/spider.scala @@ -4,10 +4,10 @@ import java.io.{BufferedOutputStream, File, FileOutputStream, InputStream} import java.net.URL import java.text.SimpleDateFormat import java.util.Date - import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{StringType, StructField, StructType} @@ -92,7 +92,7 @@ class spider extends ConfigurableStop{ val schema: StructType = StructType(fields) val df: DataFrame = session.createDataFrame(rowRDD,schema) - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/DamengRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/DamengRead.scala new file mode 100644 index 00000000..283a9d23 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/DamengRead.scala @@ -0,0 +1,111 @@ +package cn.piflow.bundle.jdbc + +import cn.piflow._ +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame +import org.apache.spark.sql.SparkSession + +class DamengRead extends ConfigurableStop { + + val authorEmail: String = "ygang@cnic.cn" + val description: String = "Read data from dameng database with jdbc" + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var url:String = _ + var user:String = _ + var password:String = _ + var selectedContent:String = _ + var tableName:String = _ + + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + + val spark = pec.get[SparkSession]() + val dbtable = "( select " + selectedContent + " from " + tableName + " ) AS Temp" + val jdbcDF = spark.read.format("jdbc") + .option("url", url) + .option("driver", "dm.jdbc.driver.DmDrive") + .option("dbtable", dbtable) + .option("user", user) + .option("password",password) + .load() + + out.write(new SciDataFrame(jdbcDF)) + } + + def initialize(ctx: ProcessContext): Unit = { + + } + + override def setProperties(map: Map[String, Any]): Unit = { + + url = MapUtil.get(map,"url").asInstanceOf[String] + user = MapUtil.get(map,"user").asInstanceOf[String] + password = MapUtil.get(map,"password").asInstanceOf[String] + selectedContent= MapUtil.get(map,"selectedContent").asInstanceOf[String] + tableName= MapUtil.get(map,"tableName").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor : List[PropertyDescriptor] = List() + + val url=new PropertyDescriptor() + .name("url") + .displayName("Url") + .description("The Url of dameng database") + .defaultValue("") + .required(true) + .example("jdbc:dm://127.0.0.1:5236/DAMENG") + descriptor = url :: descriptor + + + val user=new PropertyDescriptor() + .name("user") + .displayName("User") + .description("The user name of dameng") + .defaultValue("") + .required(true) + .example("") + descriptor = user :: descriptor + + val password=new PropertyDescriptor() + .name("password") + .displayName("Password") + .description("The password of dameng") + .defaultValue("") + .required(true) + .example("") + .sensitive(true) + descriptor = password :: descriptor + + val selectedContent =new PropertyDescriptor() + .name("selectedContent") + .displayName("SelectedContent") + .description("The content you selected to read in the DBTable") + .defaultValue("*") + .required(true) + .example("*") + descriptor = selectedContent :: descriptor + + val tableName =new PropertyDescriptor() + .name("tableName") + .displayName("TableName") + .description("The table you want to read") + .defaultValue("") + .required(true) + .example("") + descriptor = tableName :: descriptor + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/jdbc/dameng.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.JdbcGroup) + } + +} \ No newline at end of file diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/DamengWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/DamengWrite.scala new file mode 100644 index 00000000..d971a083 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/DamengWrite.scala @@ -0,0 +1,109 @@ +package cn.piflow.bundle.jdbc + +import cn.piflow._ +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} + +class DamengWrite extends ConfigurableStop{ + + val authorEmail: String = "ygang@cnic.cn" + val description: String = "Write data into dameng database with jdbc" + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var url:String = _ + var user:String = _ + var password:String = _ + var dbtable:String = _ + var saveMode:String = _ + + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val jdbcDF = in.read().getSparkDf + + jdbcDF.write.format("jdbc") + .option("url", url) + .option("driver", "dm.jdbc.driver.DmDriver") + .option("user", user) + .option("password", password) + .option("dbtable", dbtable) + .mode(saveMode) + .save() + } + + def initialize(ctx: ProcessContext): Unit = { + + } + + override def setProperties(map: Map[String, Any]): Unit = { + url = MapUtil.get(map,"url").asInstanceOf[String] + user = MapUtil.get(map,"user").asInstanceOf[String] + password = MapUtil.get(map,"password").asInstanceOf[String] + dbtable = MapUtil.get(map,"dbtable").asInstanceOf[String] + saveMode = MapUtil.get(map,"saveMode").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor : List[PropertyDescriptor] = List() + val saveModeOption = Set("Append", "Overwrite", "Ignore") + + val url=new PropertyDescriptor() + .name("url") + .displayName("Url") + .description("The Url of dameng database") + .defaultValue("") + .required(true) + .example("jdbc:dm://127.0.0.1:5236/DAMENG") + descriptor = url :: descriptor + + + val user=new PropertyDescriptor() + .name("user") + .displayName("User") + .description("The user name of dameng") + .defaultValue("") + .required(true) + .example("") + descriptor = user :: descriptor + + val password=new PropertyDescriptor() + .name("password") + .displayName("Password") + .description("The password of dameng") + .defaultValue("") + .required(true) + .example("") + .sensitive(true) + descriptor = password :: descriptor + + val dbtable=new PropertyDescriptor() + .name("dbtable") + .displayName("DBTable") + .description("The table you want to write") + .defaultValue("") + .required(true) + .example("") + descriptor = dbtable :: descriptor + + val saveMode = new PropertyDescriptor() + .name("saveMode") + .displayName("SaveMode") + .description("The save mode for table") + .allowableValues(saveModeOption) + .defaultValue("Append") + .required(true) + .example("Append") + descriptor = saveMode :: descriptor + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/jdbc/dameng.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.JdbcGroup) + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ExcuteSql.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ExcuteSql.scala index bcd0a029..71c84ec3 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ExcuteSql.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ExcuteSql.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession import java.sql.{Connection, DriverManager, ResultSet} @@ -33,7 +34,7 @@ class ExcuteSql extends ConfigurableStop { conn.close() statement.close() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ImpalaRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ImpalaRead.scala index b42115e8..336ce54c 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ImpalaRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/ImpalaRead.scala @@ -1,15 +1,15 @@ package cn.piflow.bundle.jdbc -import java.sql.{Connection, DriverManager, ResultSet, Statement} - import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import java.sql.{Connection, DriverManager, ResultSet, Statement} import scala.collection.mutable.ArrayBuffer @@ -55,7 +55,7 @@ class ImpalaRead extends ConfigurableStop{ val rdd: RDD[Row] = session.sparkContext.makeRDD(rows) val df: DataFrame = session.createDataFrame(rdd,schema) - out.write(df) + out.write(new SciDataFrame(df)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/JdbcReadFromOracle.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/JdbcReadFromOracle.scala deleted file mode 100644 index 53dac916..00000000 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/JdbcReadFromOracle.scala +++ /dev/null @@ -1,210 +0,0 @@ -package cn.piflow.bundle.jdbc - -import java.io._ -import java.sql.{Blob, Clob, Connection, Date, DriverManager, NClob, PreparedStatement, ResultSet, SQLXML} - -import cn.piflow._ -import cn.piflow.conf._ -import cn.piflow.conf.bean.PropertyDescriptor -import cn.piflow.conf.util.{ImageUtil, MapUtil} -import org.apache.spark.rdd.RDD -import org.apache.spark.sql._ -import org.apache.spark.sql.types._ - -import scala.collection.mutable.ArrayBuffer - -class JdbcReadFromOracle extends ConfigurableStop{ - - val authorEmail: String = "yangqidong@cnic.cn" - val description: String = "Read from oracle" - val inportList: List[String] = List(Port.DefaultPort) - val outportList: List[String] = List(Port.DefaultPort) - - var url:String = _ - var user:String = _ - var password:String = _ - var sql:String = _ - var schema:String=_ - - - def toByteArray(in: InputStream): Array[Byte] = { - var byteArray:Array[Byte]=new Array[Byte](1024*1024) - val out: ByteArrayOutputStream = new ByteArrayOutputStream() - var n:Int=0 - while ((n=in.read(byteArray)) != -1 && (n != -1)){ - out.write(byteArray,0,n) - } - val arr: Array[Byte] = out.toByteArray - out.close() - arr - } - - def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val session = pec.get[SparkSession]() - - Class.forName("oracle.jdbc.driver.OracleDriver") - val con: Connection = DriverManager.getConnection(url,user,password) - val pre: PreparedStatement = con.prepareStatement(sql) - val rs: ResultSet = pre.executeQuery() - - - val filedNames: Array[String] = schema.split(",").map(x => x.trim) - var rowsArr:ArrayBuffer[ArrayBuffer[Any]]=ArrayBuffer() - var rowArr:ArrayBuffer[Any]=ArrayBuffer() - while (rs.next()){ - rowArr.clear() - for(fileName <- filedNames){ - val name_type: Array[String] = fileName.split("\\.") - val name: String = name_type(0) - val typestr: String = name_type(1) - if(typestr.toUpperCase.equals("BLOB")){ - val blob: Blob = rs.getBlob(name) - var byteArr : Array[Byte] =Array() - if(blob != null){ - val stream: InputStream = blob.getBinaryStream - byteArr = toByteArray(stream) - stream.close() - } - rowArr+=byteArr - }else if(typestr.toUpperCase.equals("CLOB") || typestr.toUpperCase.equals("XMLTYPE")){ - val clob: Clob = rs.getClob(name) - var byteArr : Array[Byte] =Array() - if(clob != null){ - val stream: InputStream = clob.getAsciiStream - byteArr = toByteArray(stream) - stream.close() - } - rowArr+=byteArr - }else if(typestr.toUpperCase.equals("NCLOB")){ - val nclob: NClob = rs.getNClob(name) - var byteArr : Array[Byte] =Array() - if(nclob != null){ - val stream: InputStream = nclob.getAsciiStream - byteArr = toByteArray(stream) - stream.close() - } - rowArr+=byteArr - }else if(typestr.toUpperCase.equals("DATE")){ - val date: Date = rs.getDate(name) - rowArr+=date - }else if(typestr.toUpperCase.equals("NUMBER")){ - val int: Int = rs.getInt(name) - rowArr+=int - }else{ - rowArr+=rs.getString(name) - } - } - rowsArr+=rowArr - } - - var nameArrBuff:ArrayBuffer[String]=ArrayBuffer() - var typeArrBuff:ArrayBuffer[String]=ArrayBuffer() - filedNames.foreach(x => { - nameArrBuff+=x.split("\\.")(0) - typeArrBuff+=x.split("\\.")(1) - }) - var num:Int=0 - val fields: ArrayBuffer[StructField] = nameArrBuff.map(x => { - var sf: StructField = null - val typeName: String = typeArrBuff(num) - if (typeName.toUpperCase.equals("BLOB") || typeName.toUpperCase.equals("CLOB") || typeName.toUpperCase.equals("NCLOB") || typeName.toUpperCase.equals("XMLTYPE")) { - sf = StructField(x, DataTypes.createArrayType(ByteType), nullable = true) - }else if( typeName.toUpperCase.equals("DATE")) { - sf = StructField(x, DateType, nullable = true) - }else if( typeName.toUpperCase.equals("NUMBER")) { - sf = StructField(x, IntegerType, nullable = true) - }else if( typeName.toUpperCase.equals("XMLTYPE")) { - sf = StructField(x, IntegerType, nullable = true) - }else { - sf = StructField(x, StringType, nullable = true) - } - num+=1 - sf - }) - - val schemaNew: StructType = StructType(fields) - val rows: List[Row] = rowsArr.toList.map(arr => { - - val row: Row = Row.fromSeq(arr) - row - }) - val rdd: RDD[Row] = session.sparkContext.makeRDD(rows) - val df: DataFrame = session.createDataFrame(rdd,schemaNew) - - out.write(df) - } - - def initialize(ctx: ProcessContext): Unit = { - - } - - override def setProperties(map: Map[String, Any]): Unit = { - url = MapUtil.get(map,"url").asInstanceOf[String] - user = MapUtil.get(map,"user").asInstanceOf[String] - password = MapUtil.get(map,"password").asInstanceOf[String] - sql = MapUtil.get(map,"sql").asInstanceOf[String] - schema = MapUtil.get(map,"schema").asInstanceOf[String] - } - - override def getPropertyDescriptor(): List[PropertyDescriptor] = { - var descriptor : List[PropertyDescriptor] = List() - - val url=new PropertyDescriptor() - .name("url") - .displayName("Url") - .description("The Url, for example jdbc:oracle:thin:@192.168.0.1:1521/newdb") - .defaultValue("") - .required(true) - .example("jdbc:oracle:thin:@192.168.0.1:1521/newdb") - descriptor = url :: descriptor - - val user=new PropertyDescriptor() - .name("user") - .displayName("User") - .description("The user name of database") - .defaultValue("") - .required(true) - .example("root") - descriptor = user :: descriptor - - val password=new PropertyDescriptor() - .name("password") - .displayName("Password") - .description("The password of database") - .defaultValue("") - .required(true) - .example("123456") - descriptor = password :: descriptor - - val sql=new PropertyDescriptor() - .name("sql") - .displayName("Sql") - .description("The sql you want") - .defaultValue("") - .required(true) - .language(Language.Sql) - .example("select * from type") - descriptor = sql :: descriptor - - val schema=new PropertyDescriptor() - .name("schema") - .displayName("Schema") - .description("The name of the field of your SQL statement query, such as: ID.number, name.varchar") - .defaultValue("") - .required(true) - .example("ID.number, name.varchar") - descriptor = schema :: descriptor - - descriptor - } - - override def getIcon(): Array[Byte] = { - ImageUtil.getImage("icon/jdbc/jdbcReadFromOracle.png") - } - - override def getGroup(): List[String] = { - List(StopGroup.JdbcGroup) - } - - -} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlRead.scala index 34746f77..ec1fd6f7 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlRead.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession @@ -31,7 +32,7 @@ class MysqlRead extends ConfigurableStop { .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlReadIncremental.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlReadIncremental.scala index 8a88b0e4..5d9fbe7b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlReadIncremental.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlReadIncremental.scala @@ -4,6 +4,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableIncrementalStop, Language, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession /** @@ -33,7 +34,7 @@ class MysqlReadIncremental extends ConfigurableIncrementalStop{ .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } override def setProperties(map: Map[String, Any]): Unit = { url = MapUtil.get(map,"url").asInstanceOf[String] diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlWrite.scala index 4f5a5e6a..1034fad1 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/MysqlWrite.scala @@ -1,11 +1,11 @@ package cn.piflow.bundle.jdbc import java.util.Properties - import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{SaveMode, SparkSession} import scala.beans.BeanProperty @@ -26,13 +26,13 @@ class MysqlWrite extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val jdbcDF = in.read() + val jdbcDF = in.read().getSparkDf val properties = new Properties() properties.put("user", user) properties.put("password", password) properties.put("driver", driver) jdbcDF.write.mode(SaveMode.valueOf(saveMode)).jdbc(url,dbtable,properties) - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/TbaseRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OpenTenBaseRead.scala similarity index 84% rename from piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/TbaseRead.scala rename to piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OpenTenBaseRead.scala index 3f24de77..56877a16 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/TbaseRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OpenTenBaseRead.scala @@ -4,13 +4,14 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession -class TbaseRead extends ConfigurableStop { +class OpenTenBaseRead extends ConfigurableStop { - val authorEmail: String = "bbbbbbyz1110@163.com" - val description: String = "Read data from Tbase database with jdbc" + val authorEmail: String = "ygang@cnic.cn" + val description: String = "Read data from OpenTenBase database with jdbc" val inportList: List[String] = List(Port.DefaultPort) val outportList: List[String] = List(Port.DefaultPort) @@ -32,7 +33,7 @@ class TbaseRead extends ConfigurableStop { .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } @@ -55,8 +56,8 @@ class TbaseRead extends ConfigurableStop { val url=new PropertyDescriptor() .name("url") .displayName("Url") - .description("The Url of postgresql database") - .defaultValue("jdbc:postgresql://127.0.0.1:30004/tbase") + .description("The Url of OpenTenBase database") + .defaultValue("") .required(true) .example("jdbc:postgresql://127.0.0.1:30004/tbase") descriptor = url :: descriptor @@ -65,19 +66,19 @@ class TbaseRead extends ConfigurableStop { val user=new PropertyDescriptor() .name("user") .displayName("User") - .description("The user name of postgresql") - .defaultValue("tbase") + .description("The user name of OpenTenBase") + .defaultValue("") .required(true) - .example("tbase") + .example("") descriptor = user :: descriptor val password=new PropertyDescriptor() .name("password") .displayName("Password") - .description("The password of postgresql") + .description("The password of OpenTenBase") .defaultValue("") .required(true) - .example("123456") + .example("") .sensitive(true) descriptor = password :: descriptor @@ -96,7 +97,7 @@ class TbaseRead extends ConfigurableStop { .description("The table you want to read") .defaultValue("") .required(true) - .example("test") + .example("") descriptor = tableName :: descriptor descriptor } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/TbaseWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OpenTenBaseWrite.scala similarity index 84% rename from piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/TbaseWrite.scala rename to piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OpenTenBaseWrite.scala index 45a527d6..377e628a 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/TbaseWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OpenTenBaseWrite.scala @@ -4,15 +4,16 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{SaveMode, SparkSession} import java.util.Properties -class TbaseWrite extends ConfigurableStop{ +class OpenTenBaseWrite extends ConfigurableStop{ - val authorEmail: String = "bbbbbbyz1110@163.com" - val description: String = "Write data into Tbase database with jdbc" + val authorEmail: String = "ygang@cnic.cn" + val description: String = "Write data into OpenTenBase database with jdbc" val inportList: List[String] = List(Port.DefaultPort) val outportList: List[String] = List(Port.DefaultPort) @@ -24,7 +25,7 @@ class TbaseWrite extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val jdbcDF = in.read() + val jdbcDF = in.read().getSparkDf val properties = new Properties() properties.put("user", user) properties.put("password", password) @@ -32,7 +33,7 @@ class TbaseWrite extends ConfigurableStop{ jdbcDF.write .mode(SaveMode.valueOf(saveMode)).jdbc(url,dbtable,properties) - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } def initialize(ctx: ProcessContext): Unit = { @@ -54,8 +55,8 @@ class TbaseWrite extends ConfigurableStop{ val url=new PropertyDescriptor() .name("url") .displayName("Url") - .description("The Url of postgresql database") - .defaultValue("jdbc:postgresql://127.0.0.1:30004/tbase") + .description("The Url of OpenTenBase database") + .defaultValue("") .required(true) .example("jdbc:postgresql://127.0.0.1:30004/tbase") descriptor = url :: descriptor @@ -64,16 +65,16 @@ class TbaseWrite extends ConfigurableStop{ val user=new PropertyDescriptor() .name("user") .displayName("User") - .description("The user name of postgresql") - .defaultValue("tbase") + .description("The user name of OpenTenBase") + .defaultValue("") .required(true) - .example("tbase") + .example("") descriptor = user :: descriptor val password=new PropertyDescriptor() .name("password") .displayName("Password") - .description("The password of postgresql") + .description("The password of OpenTenBase") .defaultValue("") .required(true) .example("123456") diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleRead.scala index 34b30a48..9792dd8e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleRead.scala @@ -4,6 +4,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession /** @@ -31,7 +32,7 @@ class OracleRead extends ConfigurableStop{ .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleReadByPartition.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleReadByPartition.scala index 06a3d94e..f4b49659 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleReadByPartition.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleReadByPartition.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.jdbc import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession @@ -140,6 +141,6 @@ class OracleReadByPartition extends ConfigurableStop{ .option("numPartitions",numPartitions) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleWrite.scala index d5a56bad..92aaa453 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/OracleWrite.scala @@ -22,7 +22,7 @@ class OracleWrite extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val session = pec.get[SparkSession]() - val inDF: DataFrame = in.read() + val inDF: DataFrame = in.read().getSparkDf Class.forName("oracle.jdbc.driver.OracleDriver") val con: Connection = DriverManager.getConnection(url,user,password) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlRead.scala index f6c372f0..2d8ac4e9 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlRead.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession @@ -32,7 +33,7 @@ class PostgresqlRead extends ConfigurableStop { .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlWrite.scala index b9b3210a..b87457c0 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/jdbc/PostgresqlWrite.scala @@ -1,11 +1,11 @@ package cn.piflow.bundle.jdbc import java.util.Properties - import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{SaveMode, SparkSession} @@ -24,7 +24,7 @@ class PostgresqlWrite extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val jdbcDF = in.read() + val jdbcDF = in.read().getSparkDf val properties = new Properties() properties.put("user", user) properties.put("password", password) @@ -32,7 +32,7 @@ class PostgresqlWrite extends ConfigurableStop{ jdbcDF.write .mode(SaveMode.valueOf(saveMode)).jdbc(url,dbtable,properties) - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonParser.scala index 50b6e83e..4bf1670f 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonParser.scala @@ -6,7 +6,7 @@ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.{DataFrame, SparkSession} - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class JsonParser extends ConfigurableStop{ val authorEmail: String = "xjzhu@cnic.cn" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonSave.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonSave.scala index 7a09449d..32b031f2 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonSave.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonSave.scala @@ -19,7 +19,7 @@ class JsonSave extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val jsonDF = in.read() + val jsonDF = in.read().getSparkDf jsonDF.write.format("json").mode(SaveMode.Overwrite).save(jsonSavePath) } @@ -39,7 +39,7 @@ class JsonSave extends ConfigurableStop{ .description("The save path of the json file") .defaultValue("") .required(true) - .example("hdfs://192.168.3.138:8020/work/testJson/test/") + .example("/test/test.json") descriptor = jsonSavePath :: descriptor descriptor diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonStringParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonStringParser.scala index c4e45bec..5eff6dd0 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonStringParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/json/JsonStringParser.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class JsonStringParser extends ConfigurableStop{ val authorEmail: String = "xjzhu@cnic.cn" val description: String = "Parse json string" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/ReadFromKafka.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/ReadFromKafka.scala index b07e588e..330b0946 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/ReadFromKafka.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/ReadFromKafka.scala @@ -2,12 +2,12 @@ package cn.piflow.bundle.kafka import java.util import java.util.{Collections, Properties} - import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.bundle.util.JedisClusterImplSer import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{StructField, StructType} @@ -67,7 +67,7 @@ class ReadFromKafka extends ConfigurableStop{ //val newRdd=rdd.map(line=>Row.fromSeq(line.toSeq)) val df=spark.sqlContext.createDataFrame(rdd,dfSchema) //df.show(20) - out.write(df) + out.write(new SciDataFrame(df)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/WriteToKafka.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/WriteToKafka.scala index b72b0932..8dd8dea8 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/WriteToKafka.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/kafka/WriteToKafka.scala @@ -24,7 +24,7 @@ class WriteToKafka extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val df = in.read() + val df = in.read().getSparkDf val properties:Properties = new Properties() properties.put("bootstrap.servers", kafka_host) properties.put("acks", "all") diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/ComplementByMemcache.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/ComplementByMemcache.scala index ce05e15a..397f6a97 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/ComplementByMemcache.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/ComplementByMemcache.scala @@ -30,7 +30,7 @@ // // override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { // val session: SparkSession = pec.get[SparkSession]() -// val inDF: DataFrame = in.read() +// val inDF: DataFrame = in.read().getSparkDf // // val mcc: MemCachedClient =getMcc() // @@ -75,7 +75,7 @@ // val schema: StructType = StructType(fields) // val df: DataFrame = session.createDataFrame(rowRDD,schema) // -// out.write(df) +// out.write(new SciDataFrame(df)) // } // // diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/GetMemcache.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/GetMemcache.scala index b77a3b14..642227b2 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/GetMemcache.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/GetMemcache.scala @@ -31,7 +31,7 @@ // // override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { // val session: SparkSession = pec.get[SparkSession]() -// val inDF: DataFrame = in.read() +// val inDF: DataFrame = in.read().getSparkDf // // val mcc: MemCachedClient =getMcc() // @@ -74,7 +74,7 @@ // val s: StructType = StructType(fields) // val df: DataFrame = session.createDataFrame(rowRDD,s) // -// out.write(df) +// out.write(new SciDataFrame(df)) // } // // def getMcc(): MemCachedClient = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/PutMemcache.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/PutMemcache.scala index c901b0b6..5873f57d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/PutMemcache.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/memcached/PutMemcache.scala @@ -26,7 +26,7 @@ // override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { // // val session: SparkSession = pec.get[SparkSession]() -// val inDF: DataFrame = in.read() +// val inDF: DataFrame = in.read().getSparkDf // // val pool: SockIOPool = SockIOPool.getInstance() // var serversArr:Array[String]=servers.split(",").map(x => x.trim) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreePrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreePrediction.scala index 1c4139ec..e0ab3955 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreePrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreePrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class DecisionTreePrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing decision tree model to predict." diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreeTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreeTraining.scala index a73fba54..0da6d966 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreeTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/DecisionTreeTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.ml.classification.DecisionTreeClassifier import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class DecisionTreeTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Train a decision tree model" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTPrediction.scala index 024283a4..f795fbf5 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.classification.GBTClassificationModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GBTPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing GBT Model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTTraining.scala index e11c8927..49441455 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/GBTTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.classification.GBTClassifier import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GBTTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Train a GBT Model" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionPrediction.scala index 0f157625..a3b5a646 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LogisticRegressionPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing logistic regression model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionTraining.scala index 17e5c562..414c7c32 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/LogisticRegressionTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession import org.apache.spark.ml.classification.LogisticRegression - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LogisticRegressionTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Train a logistic regression model" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronPrediction.scala index 7ca9ade2..7e4a6e74 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class MultilayerPerceptronPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing multilayer perceptron model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronTraining.scala index 481cbea9..bde0b055 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/MultilayerPerceptronTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.classification.MultilayerPerceptronClassifier import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class MultilayerPerceptronTraining extends ConfigurableStop{ val authorEmail: String = "xiaoxiao@cnic.cn" val description: String = "Train a multilayer perceptron model" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesPrediction.scala index 7a8e5c18..efbe507d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.ml.classification.NaiveBayesModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class NaiveBayesPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing NaiveBayes model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesTraining.scala index 4060042c..7deb3070 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/NaiveBayesTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.ml.classification.NaiveBayes import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class NaiveBayesTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Train a NaiveBayes model" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestPrediction.scala index f6c16169..667c2c3c 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.classification.RandomForestClassificationModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class RandomForestPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "use an existing RandomForest Model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestTraining.scala index dc999bc6..8813fef2 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_classification/RandomForestTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class RandomForestTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Train a RandomForest model" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansPrediction.scala index 0768fd31..7d9bde27 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.BisectingKMeansModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class BisectingKMeansPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "use an existing BisectingKMeans model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansTraining.scala index cdcad59d..33971423 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/BisectingKMeansTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.BisectingKMeans import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class BisectingKMeansTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "BisectingKMeans clustering" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixturePrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixturePrediction.scala index 6115b74b..1b2fcd0b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixturePrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixturePrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.GaussianMixtureModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GaussianMixturePrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing GaussianMixture Model to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixtureTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixtureTraining.scala index 8a4abf3e..ceb4e35e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixtureTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/GaussianMixtureTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.GaussianMixture import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GaussianMixtureTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "GaussianMixture clustering" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansPrediction.scala index db17533d..18cfc2ac 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.KMeansModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class KmeansPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing KmeansModel to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansTraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansTraining.scala index 89382070..5be64f31 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansTraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/KmeansTraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.KMeans import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class KmeansTraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Kmeans clustering" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDAPrediction.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDAPrediction.scala index bd513553..ac9b9e52 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDAPrediction.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDAPrediction.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.{DistributedLDAModel, LDAModel, LocalLDAModel} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LDAPrediction extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Use an existing LDAModel to predict" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDATraining.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDATraining.scala index ce451cd0..e60e48f3 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDATraining.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_clustering/LDATraining.scala @@ -6,7 +6,7 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.clustering.LDA import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LDATraining extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "LDA clustering" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_feature/WordToVec.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_feature/WordToVec.scala index 48f7a6a9..fab6edaf 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_feature/WordToVec.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/ml_feature/WordToVec.scala @@ -7,7 +7,7 @@ import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.ml.feature.Word2Vec import org.apache.spark.ml.feature.Word2VecModel import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class WordToVec extends ConfigurableStop{ val authorEmail: String = "06whuxx@163.com" val description: String = "Transfer word to vector" @@ -25,7 +25,7 @@ class WordToVec extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val df=in.read() + val df=in.read().getSparkDf df.createOrReplaceTempView("doc") sqlContext.udf.register("split",(str:String)=>str.split(" ")) val sqlText:String="select split("+colName+") as "+colName+"_new from doc" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongo.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongo.scala index 527fe6ad..4ad4468d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongo.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongo.scala @@ -14,7 +14,7 @@ import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.bson.Document import scala.collection.mutable.ArrayBuffer - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GetMongo extends ConfigurableStop{ override val authorEmail: String = "yangqidong@cnic.cn" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongoDB.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongoDB.scala index 49b8d336..e0535501 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongoDB.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/GetMongoDB.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.{ConfigurableStop, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.{DataFrame, SparkSession} - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GetMongoDB extends ConfigurableStop{ override val authorEmail: String = "yangqidong@cnic.cn" override val description: String = "Get data from mongodb" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongo.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongo.scala index 3a20f819..a85f4038 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongo.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongo.scala @@ -25,7 +25,7 @@ class PutMongo extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark: SparkSession = pec.get[SparkSession]() - val df: DataFrame = in.read() + val df: DataFrame = in.read().getSparkDf var addressesArr: util.ArrayList[ServerAddress] = new util.ArrayList[ServerAddress]() val ipANDport: Array[String] = addresses.split(",").map(x => x.trim) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongoDB.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongoDB.scala index be54bb16..fef0f22b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongoDB.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/mongodb/PutMongoDB.scala @@ -20,7 +20,7 @@ class PutMongoDB extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark: SparkSession = pec.get[SparkSession]() - val df: DataFrame = in.read() + val df: DataFrame = in.read().getSparkDf df.write.options( Map("spark.mongodb.output.uri" -> ("mongodb://" + ip + ":" + port + "/" + dataBase + "." + collection)) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/neo4j/PutNeo4j.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/neo4j/PutNeo4j.scala index 945befe3..2c80f81e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/neo4j/PutNeo4j.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/neo4j/PutNeo4j.scala @@ -21,7 +21,7 @@ class PutNeo4j extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark: SparkSession = pec.get[SparkSession]() - val inDf: DataFrame = in.read() + val inDf: DataFrame = in.read().getSparkDf val fileNames: Array[String] = inDf.columns diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/nlp/WordSpliter.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/nlp/WordSpliter.scala index 6e722fc3..c4f757c0 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/nlp/WordSpliter.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/nlp/WordSpliter.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import com.huaban.analysis.jieba.JiebaSegmenter.SegMode import com.huaban.analysis.jieba._ import org.apache.spark.rdd.RDD @@ -63,7 +64,7 @@ class WordSpliter extends ConfigurableStop { )) val df: DataFrame = session.createDataFrame(rowRDD,schema) - out.write(df) + out.write(new SciDataFrame(df)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/Discretization.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/Discretization.scala new file mode 100644 index 00000000..96e7f9bd --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/Discretization.scala @@ -0,0 +1,171 @@ +package cn.piflow.bundle.normalization + +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import org.apache.spark.ml.clustering.{KMeans, KMeansModel} +import org.apache.spark.ml.feature.VectorAssembler +import org.apache.spark.ml.feature.Bucketizer +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.ml.feature.QuantileDiscretizer +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame + +class Discretization extends ConfigurableStop { + + val authorEmail: String = "zljxnu@163.com" + val description: String = "continuous numerical discretization" + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var inputCol: String = _ + var outputCol: String = _ + var method: String = _ + var numBins: Int = _ + var k: Int = _ + + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + val df = in.read().getSparkDf + + // 根据用户选择的方法进行相应的离散化 + val discretizedDF = method match { + case "EqualWidth" => equalWidthDiscretization(df, inputCol, outputCol, numBins) + case "EqualFrequency" => equalFrequencyDiscretization(df, inputCol, outputCol, numBins) + case "KMeans" => kMeansDiscretization(df, inputCol, outputCol, k) + case _ => df // 默认情况下不进行任何处理 + } + + out.write(discretizedDF) + } + + // 等宽法离散化 + def equalWidthDiscretization(df: DataFrame, inputCol: String, outputCol: String, numBins: Int): DataFrame = { + val bucketizer = new Bucketizer() + .setInputCol(inputCol) + .setOutputCol(outputCol) +// .setSplits((0 to numBins).map(_.toDouble)) + .setSplits((0 to numBins).map(_.toDouble).toArray) + bucketizer.transform(df) + } + +// // 等频离散化 +// def equalFrequencyDiscretization(df: DataFrame, inputCol: String, outputCol: String, numBins: Int): DataFrame = { +// val discretizer = new QuantileDiscretizer() +// .setInputCol(inputCol) +// .setOutputCol(outputCol) +// .setNumBins(numBins) +// discretizer.fit(df).transform(df) +// } +// +// // 定义一个方法来执行等频离散化 +// def equalFrequencyDiscretization(df: DataFrame, inputCol: String, outputCol: String, numBins: Int ): DataFrame = { +// // 使用QuantileDiscretizer进行等频离散化 +// val discretizer = new QuantileDiscretizer() +// .setInputCol(inputCol) +// .setOutputCol(outputCol) +// .setNumBins(numBins) +// +// val dfNew = discretizer.fit(df).transform(df) +// dfNew +// } + + // 等频离散化 + def equalFrequencyDiscretization(df: DataFrame, inputCol: String, outputCol: String, numBins: Int): DataFrame = { + // 创建一个QuantileDiscretizer实例,用于等频离散化 + val discretizer = new QuantileDiscretizer() + .setInputCol(inputCol) // 设置输入列 + .setOutputCol(outputCol) // 设置输出列 + .setNumBuckets(numBins) // 设置桶的数量 + + // 使用数据来拟合(discretizer.fit)并进行离散化转换(discretizer.transform) + val dfNew = discretizer.fit(df).transform(df) + dfNew // 返回离散化后的DataFrame + } + + // 聚类离散化 + def kMeansDiscretization(df: DataFrame, inputCol: String, outputCol: String, k: Int): DataFrame = { + // 使用KMeans算法将数值列映射到[0, k-1]的整数 + val assembler = new VectorAssembler() + .setInputCols(Array(inputCol)) + .setOutputCol("features") + val vectorizedDF = assembler.transform(df) + + val kmeans = new KMeans() + .setK(k) + .setSeed(1L) + .setFeaturesCol("features") + .setPredictionCol(outputCol) + val model = kmeans.fit(vectorizedDF) + + val clusteredDF = model.transform(vectorizedDF) + clusteredDF.drop("features") + } + + def initialize(ctx: ProcessContext): Unit = {} + + def setProperties(map: Map[String, Any]): Unit = { + inputCol = MapUtil.get(map, "inputCol").asInstanceOf[String] + outputCol = MapUtil.get(map, "outputCol").asInstanceOf[String] + method = MapUtil.get(map, "method").asInstanceOf[String] + numBins = MapUtil.get(map, "numBins").asInstanceOf[String].toInt + k = MapUtil.get(map, "k").asInstanceOf[String].toInt + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + + val inputColDescriptor = new PropertyDescriptor() + .name("inputCol") + .displayName("Input Column") + .description("The name of the input column to be discretized.") + .defaultValue("") + .required(true) + + val outputColDescriptor = new PropertyDescriptor() + .name("outputCol") + .displayName("Output Column") + .description("The name of the output column to store discretized values.") + .defaultValue("") + .required(true) + + val methodDescriptor = new PropertyDescriptor() + .name("method") + .displayName("Discretization Method") + .description("Choose the discretization method: EqualWidth, EqualFrequency, or KMeans.") + .allowableValues(Set("EqualWidth", "EqualFrequency", "KMeans")) + .defaultValue("EqualWidth") + .required(true) + + val numBinsDescriptor = new PropertyDescriptor() + .name("numBins") + .displayName("Number of Bins") + .description("The number of bins to use for EqualWidth and EqualFrequency methods.") + .defaultValue("10") + .required(false) + + val kDescriptor = new PropertyDescriptor() + .name("k") + .displayName("Number of Clusters (KMeans only)") + .description("The number of clusters to use for the KMeans method.") + .defaultValue("3") + .required(false) + + descriptor = inputColDescriptor :: descriptor + descriptor = outputColDescriptor :: descriptor + descriptor = methodDescriptor :: descriptor + descriptor = numBinsDescriptor :: descriptor + descriptor = kDescriptor :: descriptor + + descriptor + } + + override def getIcon(): Array[Byte] = { + // 返回组件图标 + ImageUtil.getImage("icon/normalization/DiscretizationNormalization.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.NormalizationGroup) + } +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/MaxMinNormalization.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/MaxMinNormalization.scala new file mode 100644 index 00000000..2b655c73 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/MaxMinNormalization.scala @@ -0,0 +1,180 @@ +package cn.piflow.bundle.normalization + +import cn.piflow._ +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import org.apache.spark.sql.{DataFrame, SparkSession} +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class MaxMinNormalization extends ConfigurableStop { + // 作者信息 + val authorEmail: String = "zljxnu@163.com" + // 组件描述 + val description: String = "maximum and minimum value standardization" + // 输入端口列表 + val inportList: List[String] = List(Port.DefaultPort) + // 输出端口列表 + val outportList: List[String] = List(Port.DefaultPort) + + // 定义属性:要标准化的列名 + var inputCol: String = _ + + // 定义属性:输出列名 + var outputCol: String = _ + + // 初始化方法 + def initialize(ctx: ProcessContext): Unit = {} + + // 执行方法 + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + // 获取 SparkSession + val spark = pec.get[SparkSession]() + + // 从输入端口读取数据 + val df = in.read().getSparkDf + + // 计算列的最大值和最小值 + val max = df.agg(Map(inputCol -> "max")).collect()(0)(0).asInstanceOf[Double] + val min = df.agg(Map(inputCol -> "min")).collect()(0)(0).asInstanceOf[Double] + + // 使用公式进行最小-最大值标准化 + val scaledDf: DataFrame = df.withColumn(outputCol, (df(inputCol) - min) / (max - min)) + + // 将标准化后的数据写入输出端口 + out.write(scaledDf) + } + + // 设置属性 + def setProperties(map: Map[String, Any]): Unit = { + inputCol = MapUtil.get(map, "inputCol").asInstanceOf[String] + outputCol = MapUtil.get(map, "outputCol").asInstanceOf[String] + } + + // 获取属性描述 + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val inputCol = new PropertyDescriptor() + .name("inputCol") + .displayName("输入列名") + .description("要进行最小-最大值标准化的列名") + .defaultValue("") + .required(true) + + val outputCol = new PropertyDescriptor() + .name("outputCol") + .displayName("Column_Name输出列名") + .description("Column names with numerical data to be scaled 标准化后的列名") + .defaultValue("") + .required(true) + + descriptor = inputCol :: outputCol :: descriptor + descriptor + } + + // 获取组件图标 + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/normalization/MaxMinNormalization.png") + } + + // 获取组件所属的组 + override def getGroup(): List[String] = { + List(StopGroup.NormalizationGroup) + } +} + + +//package cn.piflow.bundle.normalization +// +//import cn.piflow.bundle.util.CleanUtil +//import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +//import cn.piflow.conf._ +//import cn.piflow.conf.bean.PropertyDescriptor +//import cn.piflow.conf.util.{ImageUtil, MapUtil} +//import org.apache.spark.sql.SparkSession +// +//class MaxMinNormalization extends ConfigurableStop { +// +// // 作者邮箱 +// val authorEmail: String = "zljxnu@163.com" +// // 描述 +// val description: String = "MinMax scaling for numerical data" +// // 输入端口列表 +// val inportList: List[String] = List(Port.DefaultPort) +// // 输出端口列表 +// val outportList: List[String] = List(Port.DefaultPort) +// +// // 需要标准化的列名,从属性中设置 +// var columnName: String = _ +// +// // 执行标准化操作 +// def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { +// val spark = pec.get[SparkSession]() +// val sqlContext = spark.sqlContext +// // 读取输入数据 +// val dfOld = in.read().getSparkDf +// // 将输入数据创建为临时表 +// dfOld.createOrReplaceTempView("data") +// // 解析需要标准化的列名 +// val columnNames = columnName.split(",").toSet +// +// val sqlNewFieldStr = new StringBuilder +// // 针对每个指定的列名,生成标准化的 SQL 代码 +// columnNames.foreach(c => { +// sqlNewFieldStr ++= ",(((" +// sqlNewFieldStr ++= c +// sqlNewFieldStr ++= " - min(" +// sqlNewFieldStr ++= c +// sqlNewFieldStr ++= ")) / (max(" +// sqlNewFieldStr ++= c +// sqlNewFieldStr ++= ") - min(" +// sqlNewFieldStr ++= c +// sqlNewFieldStr ++= "))) as " +// sqlNewFieldStr ++= c +// sqlNewFieldStr ++= "_scaled " +// }) +// +// // 构建最终的 SQL 查询文本 +// val sqlText: String = "select * " + sqlNewFieldStr + " from data" +// +// // 执行 SQL 查询,得到标准化后的 DataFrame +// val dfNew = sqlContext.sql(sqlText) +// dfNew.createOrReplaceTempView("scaled_data") +// +// // 将标准化后的数据写入输出 +// out.write(dfNew) +// } +// +// // 初始化方法 +// def initialize(ctx: ProcessContext): Unit = {} +// +// // 设置属性 +// def setProperties(map: Map[String, Any]): Unit = { +// // 从属性映射中获取需要标准化的列名 +// columnName = MapUtil.get(map, key = "columnName").asInstanceOf[String] +// } +// +// // 定义属性描述符 +// override def getPropertyDescriptor(): List[PropertyDescriptor] = { +// var descriptor: List[PropertyDescriptor] = List() +// val columnNameDesc = new PropertyDescriptor() +// .name("columnName") +// .displayName("Column_Name") +// .description("Column names with numerical data to be scaled (comma-separated)") +// .defaultValue("") +// .required(true) +// .example("feature1,feature2") +// +// descriptor = columnNameDesc :: descriptor +// descriptor +// } +// +// // 获取图标 +// override def getIcon(): Array[Byte] = { +// ImageUtil.getImage("icon/normalization/MaxMinNormalization.png") +// } +// +// // 获取所属组 +// override def getGroup(): List[String] = { +// List(StopGroup.NormalizationGroup) +// } +//} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/ScopeNormalization.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/ScopeNormalization.scala new file mode 100644 index 00000000..902e7ea0 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/ScopeNormalization.scala @@ -0,0 +1,114 @@ +package cn.piflow.bundle.normalization + +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.{DataFrame, SparkSession} +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class ScopeNormalization extends ConfigurableStop { + + // 组件的作者信息 + val authorEmail: String = "zljxnu@163.com" + // 组件的描述信息 + val description: String = "Scope standardization" + // 定义输入端口 + val inportList: List[String] = List(Port.DefaultPort) + // 定义输出端口 + val outportList: List[String] = List(Port.DefaultPort) + + // 定义输入列名称 + var inputCol: String = _ + // 定义输出列名称 + var outputCol: String = _ + // 定义目标范围 [a, b] + var range: (Double, Double) = (0.0, 1.0) + + // 实际的数据处理逻辑 + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + // 获取SparkSession + val spark = pec.get[SparkSession]() + + // 读取输入数据 + val dfOld = in.read().getSparkDf + + // 使用范围映射公式进行数据处理 + val dfNew = mapToRange(dfOld, inputCol, outputCol, range) + + // 将处理后的数据写出 + out.write(dfNew) + } + + // 初始化方法 + def initialize(ctx: ProcessContext): Unit = {} + + // 设置组件属性 + def setProperties(map: Map[String, Any]): Unit = { + inputCol = MapUtil.get(map, key = "inputCol").asInstanceOf[String] + outputCol = MapUtil.get(map, key = "outputCol").asInstanceOf[String] + val values = MapUtil.get(map, key = "range").asInstanceOf[String].stripPrefix("(").stripSuffix(")").split(",").map(_.toDouble) + range = (values(0), values(1)) + +//// range = MapUtil.get(map, key = "range").asInstanceOf[(Double, Double)] +// //把string解析成元组映射给range +// val jsonString: String = MapUtil.get(map, key = "range").asInstanceOf[String] +// // 移除括号并分割字符串 +// val values = jsonString.stripPrefix("(").stripSuffix(")").split(",").map(_.toDouble) +// // 创建 Scala 元组 +// val range: (Double, Double) = (values(0), values(1)) + + + + } + + // 定义组件的属性描述 + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val inputCol = new PropertyDescriptor() + .name("inputCol") + .displayName("Input Column") + .description("要映射的输入列的名称") + .defaultValue("") + .required(true) + .example("input_data") + + val outputCol = new PropertyDescriptor() + .name("outputCol") + .displayName("Output Column") + .description("映射后的输出列的名称") + .defaultValue("") + .required(true) + .example("normalized_data") + + val range = new PropertyDescriptor() + .name("range") + .displayName("Range") + .description("目标范围 [a, b],以元组的形式表示") + .defaultValue("") + .required(true) + .example("(0.0, 1.0)") + + descriptor = inputCol :: outputCol :: range :: descriptor + descriptor + } + + // 定义组件的图标(可选) + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/normalization/ScopeNormalization.png") + } + + // 定义组件所属的分组(可选) + override def getGroup(): List[String] = { + List(StopGroup.NormalizationGroup) + } + + // 实现范围映射的方法 + private def mapToRange(df: DataFrame, inputCol: String, outputCol: String, range: (Double, Double)): DataFrame = { + // 使用Spark SQL的functions库来进行数据处理 + val min = df.agg(Map(inputCol -> "min")).collect()(0)(0).asInstanceOf[Double] + val max = df.agg(Map(inputCol -> "max")).collect()(0)(0).asInstanceOf[Double] + val dfNew = df.withColumn(outputCol, (col(inputCol) - min) / (max - min) * (range._2 - range._1) + range._1) + dfNew + } +} \ No newline at end of file diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/ZScore.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/ZScore.scala new file mode 100644 index 00000000..43887ba3 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/normalization/ZScore.scala @@ -0,0 +1,91 @@ +package cn.piflow.bundle.normalization + +import cn.piflow.conf._ +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions._ +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class ZScore extends ConfigurableStop { + + // 作者邮箱 + val authorEmail: String = "zljxnu@163.cn" + // 描述 + val description: String = "ZScore standardization" + // 输入端口 + val inportList: List[String] = List(Port.DefaultPort) + // 输出端口 + val outportList: List[String] = List(Port.DefaultPort) + + // 输入列名称 + var inputCols: String = _ + // 输出列名称 + var outputCols: String = _ + + def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + val df = in.read().getSparkDf + + // 将逗号分隔的输入和输出列名称拆分为列表 + val inputColList = inputCols.split(",").map(_.trim) + val outputColList = outputCols.split(",").map(_.trim) + + // 计算均值和标准差 + val stats = inputColList.foldLeft(df) { + case (currentDf, inputCol) => + val mean = currentDf.select(avg(col(inputCol))).first().getDouble(0) + val stdDev = currentDf.select(stddev(col(inputCol))).first().getDouble(0) + // 创建一个新列名:{inputCol}_zscore + val zScoreCol = s"${inputCol}_zscore" + + // 使用公式进行 z-score 标准化 + currentDf.withColumn(zScoreCol, (col(inputCol) - mean) / stdDev) + } + + // 重命名输出列以匹配原始列名称 + val finalDf = inputColList.zip(outputColList).foldLeft(stats) { + case (currentDf, (inputCol, outputCol)) => + currentDf.withColumnRenamed(s"${inputCol}_zscore", outputCol) + } + + out.write(finalDf) + } + + def initialize(ctx: ProcessContext): Unit = {} + + def setProperties(map: Map[String, Any]): Unit = { + inputCols = MapUtil.get(map, key = "inputCols").asInstanceOf[String] + outputCols = MapUtil.get(map, key = "outputCols").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val inputCols = new PropertyDescriptor() + .name("inputCols") + .displayName("输入列") + .description("要标准化的列,用逗号分隔。") + .defaultValue("") + .required(true) + .example("特征1, 特征2") + + val outputCols = new PropertyDescriptor() + .name("outputCols") + .displayName("输出列") + .description("用于存储标准化值的相应输出列,用逗号分隔。") + .defaultValue("") + .required(true) + .example("标准化特征1, 标准化特征2") + + descriptor = inputCols :: outputCols :: descriptor + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/normalization/ZScoreNormalization.png") + } + + override def getGroup(): List[String] = { + List(StopGroup.NormalizationGroup) + } +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseRead.scala index ecc0631c..1b84c931 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseRead.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession @@ -32,7 +33,7 @@ class OceanBaseRead extends ConfigurableStop{ .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseWrite.scala index ff92191f..38fe8c06 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/oceanbase/OceanBaseWrite.scala @@ -30,7 +30,7 @@ class OceanBaseWrite extends ConfigurableStop{ properties.put("password", password) properties.put("driver",driver) properties.put("isolationLevel","NONE") //if not set this value, throw expection - val df = in.read() + val df = in.read().getSparkDf df.write.mode(SaveMode.Append).jdbc(url,dbtable,properties) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/openLooKeng/OpenLooKengRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/openLooKeng/OpenLooKengRead.scala index 5004dcfb..27903873 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/openLooKeng/OpenLooKengRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/openLooKeng/OpenLooKengRead.scala @@ -4,6 +4,7 @@ import cn.piflow._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} +import cn.piflow.util.SciDataFrame import org.apache.spark.sql.SparkSession @@ -32,7 +33,7 @@ class OpenLooKengRead extends ConfigurableStop{ .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/rdf/RdfToDF.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/rdf/RdfToDF.scala index b3bb3eee..20bea70f 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/rdf/RdfToDF.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/rdf/RdfToDF.scala @@ -10,7 +10,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{DataTypes, StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class RdfToDF extends ConfigurableStop{ override val authorEmail: String = "xiaomeng7890@gmail.com" @@ -162,7 +162,7 @@ class RdfToDF extends ConfigurableStop{ //in if (isFront == "true") { val inDF : Array[String] = in - .read() + .read().getSparkDf .collect() .map(r => r.getAs[String](1)) var index = 0 diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/ReadFromRedis.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/ReadFromRedis.scala index 439c8046..b5e8c7bb 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/ReadFromRedis.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/ReadFromRedis.scala @@ -2,12 +2,12 @@ package cn.piflow.bundle.redis import java.util - import cn.piflow.bundle.util.{JedisClusterImplSer, RedisUtil} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.avro.generic.GenericData.StringType import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.types.{DataType, StructField, StructType} @@ -33,7 +33,7 @@ class ReadFromRedis extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - var dfIn=in.read() + var dfIn=in.read().getSparkDf var colName=column_name //connect to redis @@ -57,7 +57,7 @@ class ReadFromRedis extends ConfigurableStop{ Row.fromSeq(row.toArray.toSeq) }) val df=spark.createDataFrame(newRDD,dfSchema) - out.write(df) + out.write(new SciDataFrame(df)) } def initialize(ctx: ProcessContext): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/WriteToRedis.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/WriteToRedis.scala index 1cf362af..aaa3a9f1 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/WriteToRedis.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/redis/WriteToRedis.scala @@ -23,7 +23,7 @@ class WriteToRedis extends ConfigurableStop{ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val df = in.read() + val df = in.read().getSparkDf var col_name:String=column_name df.printSchema() diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DataFrameRowParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DataFrameRowParser.scala index af1f29c1..8ad612b5 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DataFrameRowParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DataFrameRowParser.scala @@ -3,6 +3,7 @@ package cn.piflow.bundle.script import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf._ +import cn.piflow.util.SciDataFrame import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} @@ -46,7 +47,7 @@ class DataFrameRowParser extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() - val inDF = in.read() + val inDF = in.read().getSparkDf //parse RDD val rdd = inDF.rdd.map(row => { @@ -65,7 +66,7 @@ class DataFrameRowParser extends ConfigurableStop{ //create DataFrame val df = spark.createDataFrame(rdd,schemaStructType) //df.show() - out.write(df) + out.write(new SciDataFrame(df)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DockerExecute.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DockerExecute.scala index a50e84dd..e877f3e9 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DockerExecute.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/DockerExecute.scala @@ -6,20 +6,20 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.util.PropertyUtil import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} -import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.{SaveMode, SparkSession} +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame - -class DockerExecute extends ConfigurableStop{ +class DockerExecute extends ConfigurableStop { val authorEmail: String = "ygang@cnic.cn" val description: String = "docker runs Python" val inportList: List[String] = List(Port.AnyPort) val outportList: List[String] = List(Port.AnyPort) - var outports : List[String] = _ - var inports : List[String] = _ + var outports: List[String] = _ + var inports: List[String] = _ - var ymlContent:String =_ + var ymlContent: String = _ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() @@ -34,37 +34,44 @@ class DockerExecute extends ConfigurableStop{ ymlContent = ymlContent.replace("piflow_hdfs_url", PropertyUtil.getPropertyValue("hdfs.web.url")) .replace("piflow_extra_hosts", stringBuffer.toString) + val embedModelsPath = PropertyUtil.getPropertyValue("embed_models_path") + if (embedModelsPath != null) { + ymlContent = ymlContent.replace("embed_models_path", embedModelsPath) + } println(ymlContent) - + DockerStreamUtil.execRuntime("mkdir app") - val ymlName = uuid+".yml" + val ymlName = uuid + ".yml" println("执行命令:=============================执行创建app文件夹命令=================") DockerStreamUtil.execRuntime(s"echo '${ymlContent}'> app/${ymlName}") - val dockerShellString=s"docker-compose -f app/${ymlName} up" - val dockerDownShellString=s"docker-compose -f app/${ymlName} down" + val dockerShellString = s"docker-compose -f app/${ymlName} up --force-recreate" + val dockerDownShellString = s"docker-compose -f app/${ymlName} down" val inputPath = "/piflow/docker/" + appID + s"/inport_${uuid}/" val outputPath = "/piflow/docker/" + appID + s"/outport_${uuid}/" val inputPathStringBuffer = new StringBuffer() - if(!(inports.contains("Default") || inports.contains("DefaultPort"))){ + if (!(inports.contains("Default") || inports.contains("DefaultPort"))) { inports.foreach(x => { - println("输入端口:============================="+x+"=================") + println("输入端口:=============================" + x + "=================") val hdfsSavePath = inputPath + x inputPathStringBuffer.append(hdfsSavePath + ",") - in.read(x).write.format("csv").mode("overwrite") - .option("delimiter", "\t") - .option("header", true).save(hdfsSavePath) + // in.read(x).write.format("csv").mode("overwrite") + // .option("delimiter", "\t") + // .option("header", true).save(hdfsSavePath) + in.read(x).getSparkDf.write + .mode("overwrite") // 指定写入模式,这里是覆盖已存在的文件 + .parquet(hdfsSavePath) }) println("执行命令:======================输入路径写入app/inputPath.txt 文件========================") DockerStreamUtil.execRuntime(s"echo ${inputPath}> app/inputPath.txt") } - if(!(outports.contains("Default") || outports.contains("DefaultPort"))){ + if (!(outports.contains("Default") || outports.contains("DefaultPort"))) { println("执行命令:======================输出路径写入app/outputPath.txt 文件========================") DockerStreamUtil.execRuntime(s"echo ${outputPath}> app/outputPath.txt") } @@ -72,9 +79,9 @@ class DockerExecute extends ConfigurableStop{ println("执行命令:======================创建镜像命令========================") DockerStreamUtil.execRuntime(dockerShellString) - if(!(outports.contains("Default") || outports.contains("DefaultPort"))){ + if (!(outports.contains("Default") || outports.contains("DefaultPort"))) { outports.foreach(x => { - println("输出端口:============================="+x+"=================") + println("输出端口:=============================" + x + "=================") val outDF = spark.read.format("csv") .option("header", true) .option("mode", "FAILFAST") @@ -92,7 +99,7 @@ class DockerExecute extends ConfigurableStop{ val inportStr = MapUtil.get(map, "inports").asInstanceOf[String] inports = inportStr.split(",").map(x => x.trim).toList - ymlContent =MapUtil.get(map, key = "ymlContent").asInstanceOf[String] + ymlContent = MapUtil.get(map, key = "ymlContent").asInstanceOf[String] } @@ -102,7 +109,7 @@ class DockerExecute extends ConfigurableStop{ override def getPropertyDescriptor(): List[PropertyDescriptor] = { - var descriptor : List[PropertyDescriptor] = List() + var descriptor: List[PropertyDescriptor] = List() val inports = new PropertyDescriptor() .name("inports") .displayName("Inports") diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecutePythonWithDataFrame.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecutePythonWithDataFrame.scala index acf7fa87..01b32e84 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecutePythonWithDataFrame.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecutePythonWithDataFrame.scala @@ -13,7 +13,7 @@ import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} import scala.collection.JavaConversions._ - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame /** * Created by xjzhu@cnic.cn on 2/24/20 */ @@ -64,7 +64,7 @@ class ExecutePythonWithDataFrame extends ConfigurableStop{ val spark = pec.get[SparkSession]() - val df = in.read() + val df = in.read().getSparkDf val jep = new Jep() val scriptPath = "/tmp/pythonExcutor-"+ UUID.randomUUID() +".py" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecuteScala.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecuteScala.scala index 59e62b93..9e8e026d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecuteScala.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/ExecuteScala.scala @@ -42,10 +42,10 @@ class ExecuteScala extends ConfigurableStop{ val script = new PropertyDescriptor() .name("script") .displayName("script") - .description("The code of scala. \nUse in.read() to get dataframe from upstream component. \nUse out.write() to write datafram to downstream component.") + .description("The code of scala. \nUse in.read().getSparkDf to get dataframe from upstream component. \nUse out.write() to write datafram to downstream component.") .defaultValue("") .required(true) - .example("val df = in.read() \nval df1 = df.select(\"author\").filter($\"author\".like(\"%xjzhu%\")) \ndf1.show() \ndf.createOrReplaceTempView(\"person\") \nval df2 = spark.sql(\"select * from person where author like '%xjzhu%'\") \ndf2.show() \nout.write(df2)") + .example("val df = in.read().getSparkDf \nval df1 = df.select(\"author\").filter($\"author\".like(\"%xjzhu%\")) \ndf1.show() \ndf.createOrReplaceTempView(\"person\") \nval df2 = spark.sql(\"select * from person where author like '%xjzhu%'\") \ndf2.show() \nout.write(df2)") .language(Language.Scala) descriptor = script :: descriptor diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonExecutor.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonExecutor.scala index 95c07fcc..7010d35e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonExecutor.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonExecutor.scala @@ -5,7 +5,7 @@ import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} -import cn.piflow.util.{FileUtil, PropertyUtil, PythonScriptUtil} +import cn.piflow.util.{FileUtil, PropertyUtil, PythonScriptUtil, SciDataFrame} import org.apache.spark.SparkFiles import org.apache.spark.deploy.PythonRunner import org.apache.spark.sql.SparkSession @@ -82,7 +82,7 @@ class PythonExecutor extends ConfigurableStop{ val inputPath = "/piflow/python/" + appID + "/inport/default/" var outputPath = "/piflow/python/" + appID + "/outport/default/" - val df = in.read() + val df = in.read().getSparkDf df.write.format("csv").mode("overwrite").option("set","\t").save(inputPath) PythonRunner.main(Array(pyFilePath, pyFiles, "-i " + inputPath, "-o " + outputPath)) @@ -93,7 +93,7 @@ class PythonExecutor extends ConfigurableStop{ .option("mode","FAILFAST") .load(outputPath) outDF.show() - out.write(outDF) + out.write(new SciDataFrame(outDF)) } } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonRun.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonRun.scala index 3532d099..8f97ad76 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonRun.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/script/PythonRun.scala @@ -8,7 +8,7 @@ import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.deploy.PythonRunner import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class PythonRun extends ConfigurableStop{ override val authorEmail: String = "" override val description: String = "" @@ -60,7 +60,7 @@ class PythonRun extends ConfigurableStop{ val inputPath = "/piflow/python/" + ID + "/inport/default/" var outputPath = "/piflow/python/" + ID + "/outport/default/" - val dataFrame = in.read() + val dataFrame = in.read().getSparkDf dataFrame.write.format("csv").mode("overwrite").option("set","\t").save(inputPath) PythonRunner.main(Array(pyPath, pyFileshelp, "-i " + inputPath, "-o " + outputPath)) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/GetFromSolr.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/GetFromSolr.scala index 99572ab0..676a7b4e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/GetFromSolr.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/GetFromSolr.scala @@ -17,7 +17,7 @@ import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import scala.collection.mutable.ListBuffer - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class GetFromSolr extends ConfigurableStop{ override val authorEmail: String ="yangqidong@cnic.cn" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/PutIntoSolr.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/PutIntoSolr.scala index 31ecf084..9f9e215b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/PutIntoSolr.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/solr/PutIntoSolr.scala @@ -31,7 +31,7 @@ class PutIntoSolr extends ConfigurableStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val df: DataFrame = in.read() + val df: DataFrame = in.read().getSparkDf val SchemaList: List[StructField] = df.schema.toList diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/streaming/SocketTextStream.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/streaming/SocketTextStream.scala index 82970e48..3fe9923e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/streaming/SocketTextStream.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/streaming/SocketTextStream.scala @@ -8,7 +8,7 @@ import org.apache.spark.sql.SparkSession import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.{DStream, InputDStream, ReceiverInputDStream, SocketReceiver} import org.apache.spark.streaming.{Seconds, StreamingContext} - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class SocketTextStream extends ConfigurableStreamingStop { override val authorEmail: String = "xjzhu@cnic.cn" override val description: String = "Receive text data from socket" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbRead.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbRead.scala index 5625d893..cddbd99d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbRead.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbRead.scala @@ -5,7 +5,8 @@ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableStop, Language, Port, StopGroup} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +import cn.piflow.util.SciDataFrame class TidbRead extends ConfigurableStop{ override val authorEmail: String = "llei@cnic.com" @@ -32,7 +33,7 @@ class TidbRead extends ConfigurableStop{ .option("password",password) .load() - out.write(jdbcDF) + out.write(new SciDataFrame(jdbcDF)) } override def setProperties(map: Map[String, Any]): Unit = { diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbWrite.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbWrite.scala index c6a93c54..9bb55b39 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbWrite.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/tidb/TidbWrite.scala @@ -30,7 +30,7 @@ class TidbWrite extends ConfigurableStop{ properties.put("password", password) properties.put("driver",driver) properties.put("isolationLevel","NONE") //if not set this value, throw expection - val df = in.read() + val df = in.read().getSparkDf df.write.mode(SaveMode.Append).jdbc(url,dbtable,properties) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/DocxParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/DocxParser.scala new file mode 100644 index 00000000..0e44a1d9 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/DocxParser.scala @@ -0,0 +1,145 @@ +package cn.piflow.bundle.unstructured + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil, ProcessUtil} +import cn.piflow.conf.{ConfigurableStop, Port} +import cn.piflow.util.{SciDataFrame, UnstructuredUtils} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import com.alibaba.fastjson2.{JSON, JSONArray} +import org.apache.spark.sql.{DataFrame, SparkSession} + +import scala.collection.mutable.ArrayBuffer +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class DocxParser extends ConfigurableStop { + val authorEmail: String = "tianyao@cnic.cn" + val description: String = "parse docx to structured data." + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var filePath: String = _ + var fileSource: String = _ + + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + val unstructuredHost: String = UnstructuredUtils.unstructuredHost() + val unstructuredPort: String = UnstructuredUtils.unstructuredPort() + if (unstructuredHost == null || unstructuredHost.isEmpty) { + println("########## Exception: can not parse, unstructured host is null!!!") + throw new Exception("########## Exception: can not parse, unstructured host is null!!!") + } else if ("127.0.0.1".equals(unstructuredHost) || "localhost".equals(unstructuredHost)) { + println("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + throw new Exception("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + } + var localDir = "" + if ("hdfs".equals(fileSource)) { + //Download the file to the location, + localDir = UnstructuredUtils.downloadFilesFromHdfs(filePath) + } + + //Create a mutable ArrayBuffer to store the parameters of the curl command + println("curl start==========================================================================") + val curlCommandParams = new ArrayBuffer[String]() + curlCommandParams += "curl" + curlCommandParams += "-X" + curlCommandParams += "POST" + curlCommandParams += s"$unstructuredHost:$unstructuredPort/general/v0/general" + curlCommandParams += "-H" + curlCommandParams += "accept: application/json" + curlCommandParams += "-H" + curlCommandParams += "Content-Type: multipart/form-data" + var fileListSize = 0; + if ("hdfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(localDir) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + if ("nfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(filePath) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + val (output, error): (String, String) = ProcessUtil.executeCommand(curlCommandParams.toSeq) + if (output.nonEmpty) { + // println(output) + import spark.implicits._ + if (fileListSize > 1) { + val array: JSONArray = JSON.parseArray(output) + var combinedDF: DataFrame = null + array.forEach { + o => + val jsonString = o.toString + val df = spark.read.json(Seq(jsonString).toDS) + if (combinedDF == null) { + combinedDF = df + } else { + combinedDF = combinedDF.union(df) + } + } + combinedDF.show(10) + out.write(combinedDF) + } else { + val df = spark.read.json(Seq(output).toDS()) + df.show(10) + out.write(new SciDataFrame(df)) + } + } else { + println(s"########## Exception: $error") + throw new Exception(s"########## Exception: $error") + } + //delete local temp file + if ("hdfs".equals(fileSource)) { + UnstructuredUtils.deleteTempFiles(localDir) + } + } + + override def setProperties(map: Map[String, Any]): Unit = { + filePath = MapUtil.get(map, "filePath").asInstanceOf[String] + fileSource = MapUtil.get(map, "fileSource").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val filePath = new PropertyDescriptor() + .name("filePath") + .displayName("FilePath") + .description("The path of the file(.docx)") + .defaultValue("/test/test.docx") + .required(true) + .example("/test/test.docx") + descriptor = descriptor :+ filePath + + val fileSource = new PropertyDescriptor() + .name("fileSource") + .displayName("FileSource") + .description("The source of the file ") + .defaultValue("hdfs") + .allowableValues(Set("hdfs", "nfs")) + .required(true) + .example("hdfs") + descriptor = descriptor :+ fileSource + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/unstructured/DocxParser.png") + } + + override def getGroup(): List[String] = { + List("unstructured") + } + + + override def initialize(ctx: ProcessContext): Unit = { + + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/HtmlParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/HtmlParser.scala new file mode 100644 index 00000000..dcfbc783 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/HtmlParser.scala @@ -0,0 +1,145 @@ +package cn.piflow.bundle.unstructured + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil, ProcessUtil} +import cn.piflow.conf.{ConfigurableStop, Port} +import cn.piflow.util.{SciDataFrame, UnstructuredUtils} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import com.alibaba.fastjson2.{JSON, JSONArray} +import org.apache.spark.sql.{DataFrame, SparkSession} + +import scala.collection.mutable.ArrayBuffer +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class HtmlParser extends ConfigurableStop { + val authorEmail: String = "tianyao@cnic.cn" + val description: String = "parse html to structured data." + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var filePath: String = _ + var fileSource: String = _ + + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + val unstructuredHost: String = UnstructuredUtils.unstructuredHost() + val unstructuredPort: String = UnstructuredUtils.unstructuredPort() + if (unstructuredHost == null || unstructuredHost.isEmpty) { + println("########## Exception: can not parse, unstructured host is null!!!") + throw new Exception("########## Exception: can not parse, unstructured host is null!!!") + } else if ("127.0.0.1".equals(unstructuredHost) || "localhost".equals(unstructuredHost)) { + println("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + throw new Exception("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + } + var localDir = "" + if ("hdfs".equals(fileSource)) { + //Download the file to the location, + localDir = UnstructuredUtils.downloadFilesFromHdfs(filePath) + } + + //Create a mutable ArrayBuffer to store the parameters of the curl command + println("curl start==========================================================================") + val curlCommandParams = new ArrayBuffer[String]() + curlCommandParams += "curl" + curlCommandParams += "-X" + curlCommandParams += "POST" + curlCommandParams += s"$unstructuredHost:$unstructuredPort/general/v0/general" + curlCommandParams += "-H" + curlCommandParams += "accept: application/json" + curlCommandParams += "-H" + curlCommandParams += "Content-Type: multipart/form-data" + var fileListSize = 0; + if ("hdfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(localDir) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + if ("nfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(filePath) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + val (output, error): (String, String) = ProcessUtil.executeCommand(curlCommandParams.toSeq) + if (output.nonEmpty) { + // println(output) + import spark.implicits._ + if (fileListSize > 1) { + val array: JSONArray = JSON.parseArray(output) + var combinedDF: DataFrame = null + array.forEach { + o => + val jsonString = o.toString + val df = spark.read.json(Seq(jsonString).toDS) + if (combinedDF == null) { + combinedDF = df + } else { + combinedDF = combinedDF.union(df) + } + } + combinedDF.show(10) + out.write(combinedDF) + } else { + val df = spark.read.json(Seq(output).toDS()) + df.show(10) + out.write(new SciDataFrame(df)) + } + } else { + println(s"########## Exception: $error") + throw new Exception(s"########## Exception: $error") + } + //delete local temp file + if ("hdfs".equals(fileSource)) { + UnstructuredUtils.deleteTempFiles(localDir) + } + } + + override def setProperties(map: Map[String, Any]): Unit = { + filePath = MapUtil.get(map, "filePath").asInstanceOf[String] + fileSource = MapUtil.get(map, "fileSource").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val filePath = new PropertyDescriptor() + .name("filePath") + .displayName("FilePath") + .description("The path of the file(.html/.htm)") + .defaultValue("/test/test.html") + .required(true) + .example("/test/test.html") + descriptor = descriptor :+ filePath + + val fileSource = new PropertyDescriptor() + .name("fileSource") + .displayName("FileSource") + .description("The source of the file ") + .defaultValue("hdfs") + .allowableValues(Set("hdfs", "nfs")) + .required(true) + .example("hdfs") + descriptor = descriptor :+ fileSource + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/unstructured/HtmlParser.png") + } + + override def getGroup(): List[String] = { + List("unstructured") + } + + + override def initialize(ctx: ProcessContext): Unit = { + + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/ImageParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/ImageParser.scala new file mode 100644 index 00000000..7560e503 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/ImageParser.scala @@ -0,0 +1,162 @@ +package cn.piflow.bundle.unstructured + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil, ProcessUtil} +import cn.piflow.conf.{ConfigurableStop, Port} +import cn.piflow.util.{SciDataFrame, UnstructuredUtils} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import com.alibaba.fastjson2.{JSON, JSONArray} +import org.apache.spark.sql.{DataFrame, SparkSession} + +import scala.collection.mutable.ArrayBuffer +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class ImageParser extends ConfigurableStop { + val authorEmail: String = "tianyao@cnic.cn" + val description: String = "parse image to structured data." + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var filePath: String = _ + var fileSource: String = _ + var strategy: String = _ + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + val unstructuredHost: String = UnstructuredUtils.unstructuredHost() + val unstructuredPort: String = UnstructuredUtils.unstructuredPort() + if (unstructuredHost == null || unstructuredHost.isEmpty) { + println("########## Exception: can not parse, unstructured host is null!!!") + throw new Exception("########## Exception: can not parse, unstructured host is null!!!") + } else if ("127.0.0.1".equals(unstructuredHost) || "localhost".equals(unstructuredHost)) { + println("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + throw new Exception("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + } + var localDir = "" + if ("hdfs".equals(fileSource)) { + //Download the file to the location, + localDir = UnstructuredUtils.downloadFilesFromHdfs(filePath) + } + + //Create a mutable ArrayBuffer to store the parameters of the curl command + println("curl start==========================================================================") + val curlCommandParams = new ArrayBuffer[String]() + curlCommandParams += "curl" + curlCommandParams += "-X" + curlCommandParams += "POST" + curlCommandParams += s"$unstructuredHost:$unstructuredPort/general/v0/general" + curlCommandParams += "-H" + curlCommandParams += "accept: application/json" + curlCommandParams += "-H" + curlCommandParams += "Content-Type: multipart/form-data" + curlCommandParams += "-F" + curlCommandParams += "pdf_infer_table_structure=false" + curlCommandParams += "-F" + curlCommandParams += s"strategy=$strategy" + curlCommandParams += "-F" + curlCommandParams += "hi_res_model_name=detectron2_lp" + var fileListSize = 0; + if ("hdfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(localDir) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + if ("nfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(filePath) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + val (output, error): (String, String) = ProcessUtil.executeCommand(curlCommandParams.toSeq) + if (output.nonEmpty) { + // println(output) + import spark.implicits._ + if (fileListSize > 1) { + val array: JSONArray = JSON.parseArray(output) + var combinedDF: DataFrame = null + array.forEach { + o => + val jsonString = o.toString + val df = spark.read.json(Seq(jsonString).toDS) + if (combinedDF == null) { + combinedDF = df + } else { + combinedDF = combinedDF.union(df) + } + } + combinedDF.show(10) + out.write(combinedDF) + } else { + val df = spark.read.json(Seq(output).toDS()) + df.show(10) + out.write(new SciDataFrame(df)) + } + } else { + println(s"########## Exception: $error") + throw new Exception(s"########## Exception: $error") + } + //delete local temp file + if ("hdfs".equals(fileSource)) { + UnstructuredUtils.deleteTempFiles(localDir) + } + } + + override def setProperties(map: Map[String, Any]): Unit = { + filePath = MapUtil.get(map, "filePath").asInstanceOf[String] + fileSource = MapUtil.get(map, "fileSource").asInstanceOf[String] + strategy = MapUtil.get(map, "strategy").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val filePath = new PropertyDescriptor() + .name("filePath") + .displayName("FilePath") + .description("The path of the file(.png/.jpg/.jpeg/.tiff/.bmp/.heic)") + .defaultValue("/test/test.png") + .required(true) + .example("/test/test.png") + descriptor = descriptor :+ filePath + + val fileSource = new PropertyDescriptor() + .name("fileSource") + .displayName("FileSource") + .description("The source of the file ") + .defaultValue("hdfs") + .allowableValues(Set("hdfs", "nfs")) + .required(true) + .example("hdfs") + descriptor = descriptor :+ fileSource + + val strategy = new PropertyDescriptor() + .name("strategy") + .displayName("strategy") + .description("The method the method that will be used to process the file ") + .defaultValue("ocr_only") + .allowableValues(Set("ocr_only")) + .required(true) + .example("ocr_only") + descriptor = descriptor :+ strategy + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/unstructured/ImageParser.png") + } + + override def getGroup(): List[String] = { + List("unstructured") + } + + + override def initialize(ctx: ProcessContext): Unit = { + + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/PdfParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/PdfParser.scala new file mode 100644 index 00000000..9edbdfdc --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/PdfParser.scala @@ -0,0 +1,165 @@ +package cn.piflow.bundle.unstructured + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil, ProcessUtil} +import cn.piflow.conf.{ConfigurableStop, Port} +import cn.piflow.util.{SciDataFrame, UnstructuredUtils} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import com.alibaba.fastjson2.{JSON, JSONArray} +import org.apache.spark.sql.{DataFrame, SparkSession} + +import scala.collection.mutable.ArrayBuffer +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame +class PdfParser extends ConfigurableStop { + val authorEmail: String = "tianyao@cnic.cn" + val description: String = "parse pdf to structured data." + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var filePath: String = _ + var fileSource: String = _ + var strategy: String = _ + + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + val unstructuredHost: String = UnstructuredUtils.unstructuredHost() + val unstructuredPort: String = UnstructuredUtils.unstructuredPort() + if (unstructuredHost == null || unstructuredHost.isEmpty) { + println("########## Exception: can not parse, unstructured host is null!!!") + throw new Exception("########## Exception: can not parse, unstructured host is null!!!") + } else if ("127.0.0.1".equals(unstructuredHost) || "localhost".equals(unstructuredHost)) { + println("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + throw new Exception("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + } + var localDir = "" + if ("hdfs".equals(fileSource)) { + //Download the file to the location, + localDir = UnstructuredUtils.downloadFilesFromHdfs(filePath) + } + + //Create a mutable ArrayBuffer to store the parameters of the curl command + println("curl start==========================================================================") + val curlCommandParams = new ArrayBuffer[String]() + curlCommandParams += "curl" + curlCommandParams += "-X" + curlCommandParams += "POST" + curlCommandParams += s"$unstructuredHost:$unstructuredPort/general/v0/general" + curlCommandParams += "-H" + curlCommandParams += "accept: application/json" + curlCommandParams += "-H" + curlCommandParams += "Content-Type: multipart/form-data" + curlCommandParams += "-F" + curlCommandParams += "pdf_infer_table_structure=false" + curlCommandParams += "-F" + curlCommandParams += s"strategy=$strategy" + curlCommandParams += "-F" + curlCommandParams += "hi_res_model_name=detectron2_lp" + var fileListSize = 0; + if ("hdfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(localDir) + fileListSize = fileList.size + fileList.foreach { path => + println(s"local path:$path") + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + if ("nfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(filePath) + fileListSize = fileList.size + fileList.foreach { path => + println(s"local path:$path") + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + val (output, error): (String, String) = ProcessUtil.executeCommand(curlCommandParams.toSeq) + if (output.nonEmpty) { + // println(output) + import spark.implicits._ + if (fileListSize > 1) { + val array: JSONArray = JSON.parseArray(output) + var combinedDF: DataFrame = null + array.forEach { + o => + val jsonString = o.toString + val df = spark.read.json(Seq(jsonString).toDS) + if (combinedDF == null) { + combinedDF = df + } else { + combinedDF = combinedDF.union(df) + } + } + combinedDF.show(10) + out.write(combinedDF) + } else { + val df = spark.read.json(Seq(output).toDS()) + df.show(10) + out.write(new SciDataFrame(df)) + } + } else { + println(s"########## Exception: $error") + throw new Exception(s"########## Exception: $error") + } + //delete local temp file + if ("hdfs".equals(fileSource)) { + UnstructuredUtils.deleteTempFiles(localDir) + } + } + + override def setProperties(map: Map[String, Any]): Unit = { + filePath = MapUtil.get(map, "filePath").asInstanceOf[String] + fileSource = MapUtil.get(map, "fileSource").asInstanceOf[String] + strategy = MapUtil.get(map, "strategy").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val filePath = new PropertyDescriptor() + .name("filePath") + .displayName("FilePath") + .description("The path of the file(.pdf)") + .defaultValue("") + .required(true) + .example("/test/test.pdf") + descriptor = descriptor :+ filePath + + val fileSource = new PropertyDescriptor() + .name("fileSource") + .displayName("FileSource") + .description("The source of the file ") + .defaultValue("true") + .allowableValues(Set("hdfs", "nfs")) + .required(true) + .example("hdfs") + descriptor = descriptor :+ fileSource + + val strategy = new PropertyDescriptor() + .name("strategy") + .displayName("strategy") + .description("The method the method that will be used to process the file ") + .defaultValue("true") + .allowableValues(Set("auto", "hi_res", "ocr_only", "fast")) + .required(true) + .example("auto") + descriptor = descriptor :+ strategy + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/unstructured/PdfParser.png") + } + + override def getGroup(): List[String] = { + List("unstructured") + } + + + override def initialize(ctx: ProcessContext): Unit = { + + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/PptxParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/PptxParser.scala new file mode 100644 index 00000000..1e6dc772 --- /dev/null +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/unstructured/PptxParser.scala @@ -0,0 +1,146 @@ +package cn.piflow.bundle.unstructured + +import cn.piflow.conf.bean.PropertyDescriptor +import cn.piflow.conf.util.{ImageUtil, MapUtil, ProcessUtil} +import cn.piflow.conf.{ConfigurableStop, Port} +import cn.piflow.util.{SciDataFrame, UnstructuredUtils} +import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} +import com.alibaba.fastjson2.{JSON, JSONArray} +import org.apache.spark.sql.{DataFrame, SparkSession} + +import scala.collection.mutable.ArrayBuffer +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame + +class PptxParser extends ConfigurableStop { + val authorEmail: String = "tianyao@cnic.cn" + val description: String = "parse pptx to structured data." + val inportList: List[String] = List(Port.DefaultPort) + val outportList: List[String] = List(Port.DefaultPort) + + var filePath: String = _ + var fileSource: String = _ + + + override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { + val spark = pec.get[SparkSession]() + + val unstructuredHost: String = UnstructuredUtils.unstructuredHost() + val unstructuredPort: String = UnstructuredUtils.unstructuredPort() + if (unstructuredHost == null || unstructuredHost.isEmpty) { + println("########## Exception: can not parse, unstructured host is null!!!") + throw new Exception("########## Exception: can not parse, unstructured host is null!!!") + } else if ("127.0.0.1".equals(unstructuredHost) || "localhost".equals(unstructuredHost)) { + println("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + throw new Exception("########## Exception: can not parse, the unstructured host cannot be set to localhost!!!") + } + var localDir = "" + if ("hdfs".equals(fileSource)) { + //Download the file to the location, + localDir = UnstructuredUtils.downloadFilesFromHdfs(filePath) + } + + //Create a mutable ArrayBuffer to store the parameters of the curl command + println("curl start==========================================================================") + val curlCommandParams = new ArrayBuffer[String]() + curlCommandParams += "curl" + curlCommandParams += "-X" + curlCommandParams += "POST" + curlCommandParams += s"$unstructuredHost:$unstructuredPort/general/v0/general" + curlCommandParams += "-H" + curlCommandParams += "accept: application/json" + curlCommandParams += "-H" + curlCommandParams += "Content-Type: multipart/form-data" + var fileListSize = 0; + if ("hdfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(localDir) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + if ("nfs".equals(fileSource)) { + val fileList = UnstructuredUtils.getLocalFilePaths(filePath) + fileListSize = fileList.size + fileList.foreach { path => + curlCommandParams += "-F" + curlCommandParams += s"files=@$path" + } + } + val (output, error): (String, String) = ProcessUtil.executeCommand(curlCommandParams.toSeq) + if (output.nonEmpty) { + // println(output) + import spark.implicits._ + if (fileListSize > 1) { + val array: JSONArray = JSON.parseArray(output) + var combinedDF: DataFrame = null + array.forEach { + o => + val jsonString = o.toString + val df = spark.read.json(Seq(jsonString).toDS) + if (combinedDF == null) { + combinedDF = df + } else { + combinedDF = combinedDF.union(df) + } + } + combinedDF.show(10) + out.write(combinedDF) + } else { + val df = spark.read.json(Seq(output).toDS()) + df.show(10) + out.write(new SciDataFrame(df)) + } + } else { + println(s"########## Exception: $error") + throw new Exception(s"########## Exception: $error") + } + //delete local temp file + if ("hdfs".equals(fileSource)) { + UnstructuredUtils.deleteTempFiles(localDir) + } + } + + override def setProperties(map: Map[String, Any]): Unit = { + filePath = MapUtil.get(map, "filePath").asInstanceOf[String] + fileSource = MapUtil.get(map, "fileSource").asInstanceOf[String] + } + + override def getPropertyDescriptor(): List[PropertyDescriptor] = { + var descriptor: List[PropertyDescriptor] = List() + val filePath = new PropertyDescriptor() + .name("filePath") + .displayName("FilePath") + .description("The path of the file(.pptx)") + .defaultValue("") + .required(true) + .example("/test/test.pptx") + descriptor = descriptor :+ filePath + + val fileSource = new PropertyDescriptor() + .name("fileSource") + .displayName("FileSource") + .description("The source of the file ") + .defaultValue("true") + .allowableValues(Set("hdfs", "nfs")) + .required(true) + .example("hdfs") + descriptor = descriptor :+ fileSource + + descriptor + } + + override def getIcon(): Array[Byte] = { + ImageUtil.getImage("icon/unstructured/PptxParser.png") + } + + override def getGroup(): List[String] = { + List("unstructured") + } + + + override def initialize(ctx: ProcessContext): Unit = { + + } + +} diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/CustomView.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/CustomView.scala index 3d9681d0..d7663b13 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/CustomView.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/CustomView.scala @@ -39,7 +39,7 @@ class CustomView extends ConfigurableVisualizationStop { val hdfs = PropertyUtil.getVisualDataDirectoryPath() val appID = spark.sparkContext.applicationId - val df = in.read() + val df = in.read().getSparkDf val filePath= hdfs + appID + "/" + pec.getStopJob().getStopName() df.repartition(1).write .format("csv") diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/Histogram.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/Histogram.scala index a4753b56..a9e40bbe 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/Histogram.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/Histogram.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableVisualizationStop, Port, StopGroup, VisualizationType} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class Histogram extends ConfigurableVisualizationStop{ override val authorEmail: String = "xjzhu@cnic.cn" override val description: String = "Show data with histogram. " + @@ -59,7 +59,7 @@ class Histogram extends ConfigurableVisualizationStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dataFrame = in.read() + val dataFrame = in.read().getSparkDf dataFrame.createOrReplaceTempView("Histoqram") if(this.customizedProperties != null || this.customizedProperties.size != 0){ diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/LineChart.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/LineChart.scala index 12a5fccf..b3e6ce2b 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/LineChart.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/LineChart.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.{ConfigurableVisualizationStop, Port, StopGroup, Visualiza import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class LineChart extends ConfigurableVisualizationStop{ override val authorEmail: String = "xjzhu@cnic.cn" override val description: String = "Show data with scatter plot. " + @@ -53,7 +53,7 @@ class LineChart extends ConfigurableVisualizationStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dataFrame = in.read() + val dataFrame = in.read().getSparkDf dataFrame.createOrReplaceTempView("LineChart") if(this.customizedProperties != null || this.customizedProperties.size != 0){ diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/PieChart.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/PieChart.scala index 8c6ab4ad..dc11f192 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/PieChart.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/PieChart.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.util.{ImageUtil, MapUtil} import cn.piflow.conf.{ConfigurableVisualizationStop, Port, StopGroup, VisualizationType} import cn.piflow.{JobContext, JobInputStream, JobOutputStream, ProcessContext} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class PieChart extends ConfigurableVisualizationStop { override val authorEmail: String = "xjzhu@cnic.cn" override val description: String = "Show data with pie chart. " @@ -74,7 +74,7 @@ class PieChart extends ConfigurableVisualizationStop { override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext = spark.sqlContext - val dataFrame = in.read() + val dataFrame = in.read().getSparkDf dataFrame.createOrReplaceTempView("PieChart") val sqlText = "select " + dimension + "," +indicatorOption+ "(" + indicator + ") from PieChart group by " + dimension; println("PieChart Sql: " + sqlText) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/ScatterPlotChart.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/ScatterPlotChart.scala index b02dcc43..452e3dfe 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/ScatterPlotChart.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/ScatterPlotChart.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.{ConfigurableVisualizationStop, Port, StopGroup, Visualiza import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class ScatterPlotChart extends ConfigurableVisualizationStop{ override val authorEmail: String = "xjzhu@cnic.cn" override val description: String = "Show data with scatter plot chart." + @@ -51,7 +51,7 @@ class ScatterPlotChart extends ConfigurableVisualizationStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dataFrame = in.read() + val dataFrame = in.read().getSparkDf dataFrame.createOrReplaceTempView("ScatterPlot") if(this.customizedProperties != null || this.customizedProperties.size != 0){ diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/TableShow.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/TableShow.scala index ffe3dbd3..5fc187ac 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/TableShow.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/visualization/TableShow.scala @@ -5,7 +5,7 @@ import cn.piflow.conf.{ConfigurableVisualizationStop, Port, StopGroup, Visualiza import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} import org.apache.spark.sql.SparkSession - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class TableShow extends ConfigurableVisualizationStop{ override var visualizationType: String = VisualizationType.Table override val authorEmail: String = "xjzhu@cnic.cn" @@ -46,7 +46,7 @@ class TableShow extends ConfigurableVisualizationStop{ override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession]() val sqlContext=spark.sqlContext - val dataFrame = in.read() + val dataFrame = in.read().getSparkDf dataFrame.createOrReplaceTempView("TableShow") val sqlText = "select " + showField+ " from TableShow" println("TableShow Sql: " + sqlText) diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParser.scala index 4ebb36c6..0e403f5d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParser.scala @@ -8,7 +8,7 @@ import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.StructType import scala.beans.BeanProperty - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class XmlParser extends ConfigurableStop { val authorEmail: String = "xjzhu@cnic.cn" diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserColumns.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserColumns.scala index 517b2ec4..2f70ea8e 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserColumns.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserColumns.scala @@ -5,9 +5,10 @@ import cn.piflow.bundle.util.XmlToJson import cn.piflow.conf._ import cn.piflow.conf.bean.PropertyDescriptor import cn.piflow.conf.util.{ImageUtil, MapUtil} +import cn.piflow.util.SciDataFrame import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, SparkSession} - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class XmlParserColumns extends ConfigurableStop { @@ -22,7 +23,7 @@ class XmlParserColumns extends ConfigurableStop { val spark = pec.get[SparkSession]() - val df = in.read() + val df = in.read().getSparkDf spark.sqlContext.udf.register("xmlToJson",(str:String)=>{ XmlToJson.xmlParse(str.replaceAll("\n","\t")) @@ -50,7 +51,7 @@ class XmlParserColumns extends ConfigurableStop { val outDF: DataFrame = spark.read.json(rdd) outDF.printSchema() - out.write(outDF) + out.write(new SciDataFrame(outDF)) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserFolder.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserFolder.scala index fab166d9..390ffa5d 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserFolder.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlParserFolder.scala @@ -12,7 +12,7 @@ import org.apache.spark.sql.{DataFrame, SparkSession} import scala.collection.mutable.ArrayBuffer import scala.util.control.Breaks._ - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame /** * Created by admin on 2018/8/27. */ diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlSave.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlSave.scala index ab6f6e25..7bacfa1a 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlSave.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlSave.scala @@ -18,7 +18,7 @@ class XmlSave extends ConfigurableStop{ var xmlSavePath:String = _ def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val xmlDF = in.read() + val xmlDF = in.read().getSparkDf xmlDF.write.format("xml").save(xmlSavePath) } diff --git a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlStringParser.scala b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlStringParser.scala index 4932b08a..8f8a54cd 100644 --- a/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlStringParser.scala +++ b/piflow-bundle/src/main/scala/cn/piflow/bundle/xml/XmlStringParser.scala @@ -13,7 +13,7 @@ import org.dom4j.{Document, DocumentHelper, Element} import scala.collection.JavaConverters._ import scala.collection.mutable.{ArrayBuffer, ListBuffer} - +import cn.piflow.SciDataFrameImplicits.autoWrapDataFrame class XmlStringParser extends ConfigurableStop { override val authorEmail: String = "yangqidong@cnic.cn" val inportList: List[String] = List(Port.DefaultPort) diff --git a/piflow-bundle/src/test/scala/cn/piflow/bundle/ceph/CephReadTest.scala b/piflow-bundle/src/test/scala/cn/piflow/bundle/ceph/CephReadTest.scala new file mode 100644 index 00000000..3a4be9e0 --- /dev/null +++ b/piflow-bundle/src/test/scala/cn/piflow/bundle/ceph/CephReadTest.scala @@ -0,0 +1,50 @@ +package cn.piflow.bundle.ceph + +import org.apache.spark.sql.{DataFrame, SparkSession} + +object CephReadTest { + + var cephAccessKey: String = _ + var cephSecretKey: String = _ + var cephEndpoint: String = _ + var types: String = _ + var path: String = _ + var header: Boolean = _ + var delimiter: String = _ + + def main(args: Array[String]): Unit = { + val spark = SparkSession.builder(). + master("local[*]"). + appName("CephReadTest"). + getOrCreate() + + spark.conf.set("fs.s3a.access.key", cephAccessKey) + spark.conf.set("fs.s3a.secret.key", cephSecretKey) + spark.conf.set("fs.s3a.endpoint", cephEndpoint) + spark.conf.set("fs.s3a.connection.ssl.enabled", "false") + + var df:DataFrame = null + + if (types == "parquet") { + df = spark.read + .parquet(path) + } + + if (types == "csv") { + + df = spark.read + .option("header", header) + .option("inferSchema", "true") + .option("delimiter", delimiter) + .csv(path) + } + + if (types == "json") { + df = spark.read + .json(path) + } + df.show() + + } + +} diff --git a/piflow-bundle/src/test/scala/cn/piflow/bundle/ceph/CephWriteTest.scala b/piflow-bundle/src/test/scala/cn/piflow/bundle/ceph/CephWriteTest.scala new file mode 100644 index 00000000..3f4237ff --- /dev/null +++ b/piflow-bundle/src/test/scala/cn/piflow/bundle/ceph/CephWriteTest.scala @@ -0,0 +1,55 @@ +package com.dkl.s3.spark + +import org.apache.spark.sql.{DataFrame, SparkSession} + +object CephWriteTest { + var cephAccessKey: String = _ + var cephSecretKey: String = _ + var cephEndpoint: String = _ + var types: String = _ + var path: String = _ + var header: Boolean = _ + var delimiter: String = _ + + + def main(args: Array[String]): Unit = { + val spark = SparkSession.builder(). + master("local[*]"). + appName("SparkS3Demo"). + getOrCreate() + + spark.conf.set("fs.s3a.access.key", cephAccessKey) + spark.conf.set("fs.s3a.secret.key", cephSecretKey) + spark.conf.set("fs.s3a.endpoint", cephEndpoint) + spark.conf.set("fs.s3a.connection.ssl.enabled","false") + + + import spark.implicits._ + val df = Seq((1, "json", 10, 1000, "2022-09-27")).toDF("id", "name", "value", "ts", "dt") + + if (types == "parquet") { + df.write + .format("parquet") + .mode("overwrite") // only overwrite + .save(path) + } + + if (types == "csv") { + df.write + .format("csv") + .option("header", header) + .option("delimiter", delimiter) + .mode("overwrite") + .save(path) + } + + if (types == "json") { + df.write + .format("json") + .mode("overwrite") + .save(path) + } + + } + +} diff --git a/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/DiscretizationTest.scala b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/DiscretizationTest.scala new file mode 100644 index 00000000..745b8d90 --- /dev/null +++ b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/DiscretizationTest.scala @@ -0,0 +1,52 @@ +//package cn.piflow.bundle.normalization +// +//import cn.piflow.Runner +//import cn.piflow.conf.bean.FlowBean +//import cn.piflow.conf.util.{FileUtil, OptionUtil} +//import cn.piflow.util.PropertyUtil +//import org.apache.spark.sql.SparkSession +//import org.h2.tools.Server +//import org.junit.Test +// +//import scala.util.parsing.json.JSON +// +//class DiscretizationTest { +// +// @Test +// def DiscretizationFlow(): Unit = { +// +// //parse flow json +// val file = "src/main/resources/flow/normalization/Discretization.json" +// val flowJsonStr = FileUtil.fileReader(file) +// val map = OptionUtil.getAny(JSON.parseFull(flowJsonStr)).asInstanceOf[Map[String, Any]] +// println(map) +// +// //create flow +// val flowBean = FlowBean(map) +// val flow = flowBean.constructFlow() +// +// val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-tcpPort", "50001").start() +// +// //execute flow +// val spark = SparkSession.builder() +// .master("local[*]") +// .appName("DiscretizationTest") +// .config("spark.driver.memory", "1g") +// .config("spark.executor.memory", "2g") +// .config("spark.cores.max", "2") +// .config("hive.metastore.uris",PropertyUtil.getPropertyValue("hive.metastore.uris")) +// .enableHiveSupport() +// .getOrCreate() +// +// val process = Runner.create() +// .bind(classOf[SparkSession].getName, spark) +// .bind("checkpoint.path", "") +// .bind("debug.path","") +// .start(flow); +// +// process.awaitTermination(); +// val pid = process.pid(); +// println(pid + "!!!!!!!!!!!!!!!!!!!!!") +// spark.close(); +// } +//} \ No newline at end of file diff --git a/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/MaxMinNormalizationTest.scala b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/MaxMinNormalizationTest.scala new file mode 100644 index 00000000..03d9a024 --- /dev/null +++ b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/MaxMinNormalizationTest.scala @@ -0,0 +1,101 @@ +////package cn.piflow.bundle.normalization +//// +////import cn.piflow.Runner +////import cn.piflow.conf.bean.FlowBean +////import cn.piflow.conf.util.{FileUtil, OptionUtil} +////import org.apache.spark.sql.SparkSession +////import org.junit.Test +////import scala.util.parsing.json.JSON +//// +////class MaxMinNormalizationTest { +//// +//// @Test +//// def MaxMinNormalizationTest(): Unit = { +//// // Parse flow JSON +//// val file = "src/main/resources/flow/normalization/MaxMinNormalization.json" +//// val flowJsonStr = FileUtil.fileReader(file) +//// val map = OptionUtil.getAny(JSON.parseFull(flowJsonStr)).asInstanceOf[Map[String, Any]] +//// println(map) +//// +//// // Create SparkSession +//// val spark = SparkSession.builder() +//// .master("local[*]") +//// .appName("MaxMinNormalizationTest") +//// .config("spark.driver.memory", "1g") +//// .config("spark.executor.memory", "2g") +//// .config("spark.cores.max", "2") +//// .getOrCreate() +//// +//// // Create flow +//// val flowBean = FlowBean(map) +//// val flow = flowBean.constructFlow() +//// +//// // Execute flow +//// val process = Runner.create() +//// .bind(classOf[SparkSession].getName, spark) +//// .bind("checkpoint.path", "") +//// .bind("debug.path", "") +//// .start(flow) +//// +//// process.awaitTermination() +//// val pid = process.pid() +//// println(s"Flow execution completed. PID: $pid") +//// +//// // Close SparkSession +//// spark.close() +//// } +////} +// +// +//package cn.piflow.bundle.normalization +// +//import cn.piflow.Runner +//import cn.piflow.conf.bean.FlowBean +//import cn.piflow.conf.util.{FileUtil, OptionUtil} +//import cn.piflow.util.PropertyUtil +//import org.apache.spark.sql.SparkSession +//import org.h2.tools.Server +//import org.junit.Test +// +//import scala.util.parsing.json.JSON +// +//class MaxMinNormalizationTest { +// +// @Test +// def MaxMinNormalizationFlow(): Unit = { +// +// //parse flow json +// val file = "src/main/resources/flow/normalization/MaxMinNormalization.json" +// val flowJsonStr = FileUtil.fileReader(file) +// val map = OptionUtil.getAny(JSON.parseFull(flowJsonStr)).asInstanceOf[Map[String, Any]] +// println(map) +// +// //create flow +// val flowBean = FlowBean(map) +// val flow = flowBean.constructFlow() +// +// val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-tcpPort", "50001").start() +// +// //execute flow +// val spark = SparkSession.builder() +// .master("local[*]") +// .appName("MaxMinNormalizationTest") +// .config("spark.driver.memory", "1g") +// .config("spark.executor.memory", "2g") +// .config("spark.cores.max", "2") +// .config("hive.metastore.uris",PropertyUtil.getPropertyValue("hive.metastore.uris")) +// .enableHiveSupport() +// .getOrCreate() +// +// val process = Runner.create() +// .bind(classOf[SparkSession].getName, spark) +// .bind("checkpoint.path", "") +// .bind("debug.path","") +// .start(flow); +// +// process.awaitTermination(); +// val pid = process.pid(); +// println(pid + "!!!!!!!!!!!!!!!!!!!!!") +// spark.close(); +// } +//} diff --git a/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/ScopeNormalizationTest.scala b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/ScopeNormalizationTest.scala new file mode 100644 index 00000000..965067b5 --- /dev/null +++ b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/ScopeNormalizationTest.scala @@ -0,0 +1,52 @@ +//package cn.piflow.bundle.normalization +// +//import cn.piflow.Runner +//import cn.piflow.conf.bean.FlowBean +//import cn.piflow.conf.util.{FileUtil, OptionUtil} +//import cn.piflow.util.PropertyUtil +//import org.apache.spark.sql.SparkSession +//import org.h2.tools.Server +//import org.junit.Test +// +//import scala.util.parsing.json.JSON +// +//class ScopeNormalizationTest { +// +// @Test +// def ScopeNormalizationFlow(): Unit = { +// +// //parse flow json +// val file = "src/main/resources/flow/normalization/ScopeNormalization.json" +// val flowJsonStr = FileUtil.fileReader(file) +// val map = OptionUtil.getAny(JSON.parseFull(flowJsonStr)).asInstanceOf[Map[String, Any]] +// println(map) +// +// //create flow +// val flowBean = FlowBean(map) +// val flow = flowBean.constructFlow() +// +// val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-tcpPort", "50001").start() +// +// //execute flow +// val spark = SparkSession.builder() +// .master("local[*]") +// .appName("MaxMinNormalizationTest") +// .config("spark.driver.memory", "1g") +// .config("spark.executor.memory", "2g") +// .config("spark.cores.max", "2") +// .config("hive.metastore.uris",PropertyUtil.getPropertyValue("hive.metastore.uris")) +// .enableHiveSupport() +// .getOrCreate() +// +// val process = Runner.create() +// .bind(classOf[SparkSession].getName, spark) +// .bind("checkpoint.path", "") +// .bind("debug.path","") +// .start(flow); +// +// process.awaitTermination(); +// val pid = process.pid(); +// println(pid + "!!!!!!!!!!!!!!!!!!!!!") +// spark.close(); +// } +//} diff --git a/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/ZScoreTest.scala b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/ZScoreTest.scala new file mode 100644 index 00000000..12587327 --- /dev/null +++ b/piflow-bundle/src/test/scala/cn/piflow/bundle/normalization/ZScoreTest.scala @@ -0,0 +1,52 @@ +//package cn.piflow.bundle.normalization +// +//import cn.piflow.Runner +//import cn.piflow.conf.bean.FlowBean +//import cn.piflow.conf.util.{FileUtil, OptionUtil} +//import cn.piflow.util.PropertyUtil +//import org.apache.spark.sql.SparkSession +//import org.h2.tools.Server +//import org.junit.Test +// +//import scala.util.parsing.json.JSON +// +//class ZScoreTest { +// +// @Test +// def ZScoreFlow(): Unit = { +// +// //parse flow json +// val file = "src/main/resources/flow/normalization/ZScore.json" +// val flowJsonStr = FileUtil.fileReader(file) +// val map = OptionUtil.getAny(JSON.parseFull(flowJsonStr)).asInstanceOf[Map[String, Any]] +// println(map) +// +// //create flow +// val flowBean = FlowBean(map) +// val flow = flowBean.constructFlow() +// +// val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-tcpPort", "50001").start() +// +// //execute flow +// val spark = SparkSession.builder() +// .master("local[*]") +// .appName("ZScoreTest") +// .config("spark.driver.memory", "1g") +// .config("spark.executor.memory", "2g") +// .config("spark.cores.max", "2") +// .config("hive.metastore.uris",PropertyUtil.getPropertyValue("hive.metastore.uris")) +// .enableHiveSupport() +// .getOrCreate() +// +// val process = Runner.create() +// .bind(classOf[SparkSession].getName, spark) +// .bind("checkpoint.path", "") +// .bind("debug.path","") +// .start(flow); +// +// process.awaitTermination(); +// val pid = process.pid(); +// println(pid + "!!!!!!!!!!!!!!!!!!!!!") +// spark.close(); +// } +//} \ No newline at end of file diff --git a/piflow-configure/src/main/scala/cn/piflow/conf/StopGroup.scala b/piflow-configure/src/main/scala/cn/piflow/conf/StopGroup.scala index acbbff78..48e722cb 100644 --- a/piflow-configure/src/main/scala/cn/piflow/conf/StopGroup.scala +++ b/piflow-configure/src/main/scala/cn/piflow/conf/StopGroup.scala @@ -4,6 +4,7 @@ object StopGroup { val NSFC = "NSFC" val CommonGroup = "Common" val CsvGroup = "CSV" + val FlightGroup = "Flight" val HiveGroup = "Hive" val JdbcGroup = "Jdbc" val JsonGroup = "Json" @@ -34,4 +35,6 @@ object StopGroup { val Alg_ASRGroup = "Algorithms_ASR" val Python = "Python" val Visualization = "Visualization" + val CephGroup="ceph" + val NormalizationGroup = "Normalization" } diff --git a/piflow-configure/src/main/scala/cn/piflow/conf/util/ClassUtil.scala b/piflow-configure/src/main/scala/cn/piflow/conf/util/ClassUtil.scala index 53e0c0db..e753f5d7 100644 --- a/piflow-configure/src/main/scala/cn/piflow/conf/util/ClassUtil.scala +++ b/piflow-configure/src/main/scala/cn/piflow/conf/util/ClassUtil.scala @@ -11,7 +11,7 @@ import net.liftweb.json.{JValue, compactRender} import org.clapper.classutil.ClassFinder import org.reflections.Reflections import net.liftweb.json.JsonDSL._ -import sun.misc.BASE64Encoder +import java.util.Base64 import util.control.Breaks._ @@ -202,7 +202,6 @@ object ClassUtil { val stopName = bundle.split("\\.").last val propertyDescriptorList:List[PropertyDescriptor] = stop.getPropertyDescriptor() propertyDescriptorList.foreach(p=> if (p.allowableValues == null || p.allowableValues == None) p.allowableValues = List("")) - val base64Encoder = new BASE64Encoder() var iconArrayByte : Array[Byte]= Array[Byte]() try{ iconArrayByte = stop.getIcon() @@ -230,7 +229,7 @@ object ClassUtil { ("customizedAllowValue" -> "")*/ ("visualizationType" -> visualizationType) ~ ("description" -> stop.description) ~ - ("icon" -> base64Encoder.encode(iconArrayByte)) ~ + ("icon" -> Base64.getEncoder.encodeToString(iconArrayByte)) ~ ("properties" -> propertyDescriptorList.map { property =>( ("name" -> property.name) ~ diff --git a/piflow-configure/src/main/scala/cn/piflow/conf/util/ProcessUtil.scala b/piflow-configure/src/main/scala/cn/piflow/conf/util/ProcessUtil.scala new file mode 100644 index 00000000..618080be --- /dev/null +++ b/piflow-configure/src/main/scala/cn/piflow/conf/util/ProcessUtil.scala @@ -0,0 +1,42 @@ +package cn.piflow.conf.util + +import java.io.{ByteArrayOutputStream, PrintStream} +object ProcessUtil { + + /** + * 执行外部命令并返回标准输出和标准错误输出 + * + * @param command 要执行的命令及其参数 + * @return 一个包含标准输出和标准错误输出的元组 + */ + def executeCommand(command: Seq[String]): (String, String) = { + val processBuilder = new ProcessBuilder(command: _*) + val outBuffer = new ByteArrayOutputStream() + val errBuffer = new ByteArrayOutputStream() + val outStream = new PrintStream(outBuffer) + val errStream = new PrintStream(errBuffer) + + val process = processBuilder.start() + val threadOut = new Thread(() => scala.io.Source.fromInputStream(process.getInputStream()).getLines().foreach(outStream.println)) + val threadErr = new Thread(() => scala.io.Source.fromInputStream(process.getErrorStream()).getLines().foreach(errStream.println)) + + threadOut.start() + threadErr.start() + + // 等待进程结束 + process.waitFor() + threadOut.join() + threadErr.join() + + // 关闭流 + outStream.close() + errStream.close() + + // 获取输出和错误字符串 + val output = outBuffer.toString("UTF-8") + val error = errBuffer.toString("UTF-8") + + // 返回输出和错误 + (output, error) + } +} diff --git a/piflow-configure/src/main/scala/cn/piflow/conf/util/ScalaExecutorUtil.scala b/piflow-configure/src/main/scala/cn/piflow/conf/util/ScalaExecutorUtil.scala index c463a634..17328ba3 100644 --- a/piflow-configure/src/main/scala/cn/piflow/conf/util/ScalaExecutorUtil.scala +++ b/piflow-configure/src/main/scala/cn/piflow/conf/util/ScalaExecutorUtil.scala @@ -73,7 +73,7 @@ object ScalaExecutorUtil { def main(args: Array[String]): Unit = { val script = """ - |val df = in.read() + |val df = in.read().getSparkDf |df.show() |val df1 = df.select("title") |out.write(df1) diff --git a/piflow-core/pom.xml b/piflow-core/pom.xml index b84642aa..39dc899c 100644 --- a/piflow-core/pom.xml +++ b/piflow-core/pom.xml @@ -22,6 +22,16 @@ lift-json_2.12 3.3.0 + + org.apache.spark + spark-core_2.12 + ${spark.version} + + + org.apache.spark + spark-sql_2.12 + ${spark.version} + diff --git a/piflow-core/src/main/java/cn/piflow/util/SciDataFrame.java b/piflow-core/src/main/java/cn/piflow/util/SciDataFrame.java new file mode 100644 index 00000000..c68f829a --- /dev/null +++ b/piflow-core/src/main/java/cn/piflow/util/SciDataFrame.java @@ -0,0 +1,189 @@ +package cn.piflow.util; + +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.ipc.ArrowFileReader; +import org.apache.arrow.vector.ipc.ArrowFileWriter; +import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.sql.*; +import org.apache.spark.api.java.function.MapFunction; +import java.util.*; + +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.types.StructType; +import scala.Function1; +import scala.collection.TraversableOnce; + +//public class SciDataFrame implements Iterable { +public class SciDataFrame { + public enum Level { FOLDER, FILE } + public enum FileFormat { + TEXT, JSON, PARQUET; + public String getFormatString() { + return this.name().toLowerCase(); + } + } + + // Fields + private UUID id; + private List> schema; + private Long nbytes; + private VectorSchemaRoot data; + private Dataset df_data; + private Integer batchSize; +// private Client client; + private int counter; + private ArrowFileReader reader; + private Level level; + private String datasetId; + private Boolean isIterate; + private Map loadKwargs; + + + + + + public SciDataFrame(Dataset dataFrame) { + this.df_data = dataFrame; + } + + + // Constructor (Builder 模式处理复杂参数) + public static class Builder { + private String datasetId; + private List> schema; + private Long nbytes; + private Level level = Level.FOLDER; + private VectorSchemaRoot data; + private Integer batchSize; +// private Client client = new Client("127.0.0.1", 8815); // 默认客户端 + private Boolean isIterate; + private Map loadKwargs = new HashMap<>(); + + public Builder datasetId(String id) { this.datasetId = id; return this; } + public Builder schema(List> s) { this.schema = s; return this; } + // 其他参数类似... + +// public SciDataFrame build() { +// return new SciDataFrame(this); +// } + } + + + + public void setSparkDf(Dataset df){ + this.df_data = df; + } + + public Dataset getSparkDf(){ + return df_data; + } + + public void show(){ + df_data.show(); + } + // 显示指定行数 + public void show(int numRows) { + df_data.show(numRows); + } + // 显示指定行数,并控制字符串截断 + public void show(int numRows, boolean truncate) { + df_data.show(numRows, truncate); + } + // 显示全部行(谨慎使用!) + public void showAll(boolean truncate) { + long totalRows = df_data.count(); + df_data.show((int) totalRows, truncate); + + } + public void save(String path, String format) { + df_data.write().format(format).save(path); + + } + + public void write(SaveMode saveMode, String url, String dbtable, Properties props) { + df_data.write().mode(saveMode).jdbc(url, dbtable, props); + } + + public StructType getSchema() { + return df_data.schema(); + } + + public SciDataFrame map(MapFunction mapFunction, Encoder encoder){ + Dataset mappedDataset = df_data.map(mapFunction, encoder); + Dataset resultDF = mappedDataset.toDF(); + return new SciDataFrame(resultDF); + } + + public SciDataFrame flatMap(FlatMapFunction flatMapFunction, Encoder encoder){ + Dataset flatMappedDataset = df_data.flatMap(flatMapFunction, encoder); + return new SciDataFrame(flatMappedDataset.toDF()); + } + public SciDataFrame flatMap(Function1> flatMapFunction, Encoder encoder){ + Dataset flatMappedDataset = df_data.flatMap(flatMapFunction, encoder); + return new SciDataFrame(flatMappedDataset.toDF()); + } + +// private SciDataFrame(Builder builder) { +// this.id = UUID.randomUUID(); +// this.datasetId = builder.datasetId; +// this.schema = builder.schema; +// this.nbytes = builder.nbytes; +// this.data = builder.data; +// this.batchSize = builder.batchSize; +//// this.client = builder.client; +// this.level = builder.level; +// this.isIterate = builder.isIterate; +// this.loadKwargs = builder.loadKwargs; +// } + // 实现迭代器接口 +// @Override +// public Iterator iterator() { +// if (!isIterate) { +// throw new UnsupportedOperationException("Batch iteration not enabled"); +// } +// return new Iterator<>() { +// @Override +// public boolean hasNext() { +// return reader.hasNext(); +// } +// +// @Override +// public SciDataFrame next() { +//// VectorSchemaRoot batch = reader.next(); +//// return new SciDataFrame.Builder() +//// .datasetId(datasetId) +//// .data(batch) +// } +// }; + + // 流式数据处理初始化 +// public void flatOpen(String paths) { +// try { +// this.client.loadInit(loadKwargs); +// this.reader = client.flatOpen(isPathsFile(paths)); +// if (!isIterate) { +// this.data = readAllBatches(reader); +// } +// } catch (Exception e) { +// throw new RuntimeException("Flat open failed", e); +// } +// } + +// private VectorSchemaRoot readAllBatches(ArrowFileReader reader) { +// VectorSchemaRoot result = null; +// while (reader.hasNext()) { +// VectorSchemaRoot batch = reader.next(); +// if (result == null) { +// result = VectorSchemaRoot.create(batch.getSchema(), new RootAllocator()); +// } +// // 合并 batches 到 result(需实现具体合并逻辑) +// } +// return result; +// } + private Level isPathsFile(String paths) { + return paths.split(",").length == 1 ? Level.FILE : Level.FOLDER; + } + +} diff --git a/piflow-core/src/main/java/cn/piflow/util/SecurityUtil.java b/piflow-core/src/main/java/cn/piflow/util/SecurityUtil.java index 5fd107ac..f80c387a 100644 --- a/piflow-core/src/main/java/cn/piflow/util/SecurityUtil.java +++ b/piflow-core/src/main/java/cn/piflow/util/SecurityUtil.java @@ -2,8 +2,7 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import sun.misc.BASE64Decoder; -import sun.misc.BASE64Encoder; +import java.util.Base64; import javax.crypto.Cipher; import javax.crypto.KeyGenerator; import javax.crypto.SecretKey; @@ -37,12 +36,11 @@ public static String decryptAES(String encryptResultStr) { } private static String ebotongEncrypto(String str) { - BASE64Encoder base64encoder = new BASE64Encoder(); String result = str; if (str != null && str.length() > 0) { try { byte[] encodeByte = str.getBytes(ENCODING); - result = base64encoder.encode(encodeByte); + result = Base64.getEncoder().encodeToString(encodeByte); } catch (Exception e) { e.printStackTrace(); } @@ -51,10 +49,9 @@ private static String ebotongEncrypto(String str) { } private static String ebotongDecrypto(String str) { - BASE64Decoder base64decoder = new BASE64Decoder(); try { - byte[] encodeByte = base64decoder.decodeBuffer(str); - return new String(encodeByte); + byte[] encodeByte = Base64.getDecoder().decode(str); + return new String(encodeByte, ENCODING); } catch (IOException e) { logger.error("IO 异Exception",e); return str; diff --git a/piflow-core/src/main/scala/cn/piflow/SciDataFrameImplicits.scala b/piflow-core/src/main/scala/cn/piflow/SciDataFrameImplicits.scala new file mode 100644 index 00000000..e06bd145 --- /dev/null +++ b/piflow-core/src/main/scala/cn/piflow/SciDataFrameImplicits.scala @@ -0,0 +1,9 @@ +package cn.piflow + +import cn.piflow.util.SciDataFrame +import org.apache.spark.sql.DataFrame + +object SciDataFrameImplicits { + implicit def autoWrapDataFrame(df: DataFrame): SciDataFrame = + new SciDataFrame(df) +} diff --git a/piflow-core/src/main/scala/cn/piflow/lib/etl.scala b/piflow-core/src/main/scala/cn/piflow/lib/etl.scala index e0e65428..1fccf0c3 100644 --- a/piflow-core/src/main/scala/cn/piflow/lib/etl.scala +++ b/piflow-core/src/main/scala/cn/piflow/lib/etl.scala @@ -4,7 +4,7 @@ package cn.piflow.lib import cn.piflow._ -import cn.piflow.util.{FunctionLogic, Logging} +import cn.piflow.util.{FunctionLogic, Logging,SciDataFrame} import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.types.StructType @@ -30,11 +30,11 @@ class WriteStream(streamSink: Sink) extends Stop with Logging { } trait Source { - def load(ctx: JobContext): DataFrame; + def load(ctx: JobContext): SciDataFrame; } trait Sink { - def save(data: DataFrame, ctx: JobContext): Unit; + def save(data: SciDataFrame, ctx: JobContext): Unit; } class DoMap(func: FunctionLogic, targetSchema: StructType = null) extends Stop with Logging with Serializable { @@ -45,14 +45,14 @@ class DoMap(func: FunctionLogic, targetSchema: StructType = null) extends Stop w val input = in.read(); val encoder = RowEncoder { if (targetSchema == null) { - input.schema; + input.getSchema } else { targetSchema; } }; - val output = input.map(x => func.perform(Seq(x)).asInstanceOf[Row])(encoder); + val output = input.map(x => func.perform(Seq(x)).asInstanceOf[Row],encoder); out.write(output); } } @@ -66,7 +66,7 @@ class DoFlatMap(func: FunctionLogic, targetSchema: StructType = null) extends St val data = in.read(); val encoder = RowEncoder { if (targetSchema == null) { - data.schema; + data.getSchema; } else { targetSchema; @@ -74,7 +74,7 @@ class DoFlatMap(func: FunctionLogic, targetSchema: StructType = null) extends St }; val output = data.flatMap(x => - JavaConversions.iterableAsScalaIterable(func.perform(Seq(x)).asInstanceOf[java.util.ArrayList[Row]]))(encoder); + JavaConversions.iterableAsScalaIterable(func.perform(Seq(x)).asInstanceOf[java.util.ArrayList[Row]]),encoder); out.write(output); } } @@ -89,12 +89,12 @@ class ExecuteSQL(sql: String, bundle2TableName: (String, String)*) extends Stop val tableName = x._2; logger.debug(s"registering sql table: $tableName"); - in.read(x._1).createOrReplaceTempView(tableName); +// in.read(x._1).createOrReplaceTempView(tableName); } try { val output = pec.get[SparkSession].sql(sql); - out.write(output); + out.write(new SciDataFrame(output)); } catch { case e: Throwable => diff --git a/piflow-core/src/main/scala/cn/piflow/lib/gate.scala b/piflow-core/src/main/scala/cn/piflow/lib/gate.scala index eb4b9adb..f77db912 100644 --- a/piflow-core/src/main/scala/cn/piflow/lib/gate.scala +++ b/piflow-core/src/main/scala/cn/piflow/lib/gate.scala @@ -1,12 +1,13 @@ package cn.piflow.lib import cn.piflow._ +import cn.piflow.util.SciDataFrame class DoMerge extends Stop { override def initialize(ctx: ProcessContext): Unit = {} override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - out.write(in.ports().map(in.read(_)).reduce((x, y) => x.union(y))); + out.write(new SciDataFrame(in.ports().map(in.read(_).getSparkDf).reduce((x, y) => x.union(y)))); } } diff --git a/piflow-core/src/main/scala/cn/piflow/lib/io.scala b/piflow-core/src/main/scala/cn/piflow/lib/io.scala index d94a0ab5..eb2c469f 100644 --- a/piflow-core/src/main/scala/cn/piflow/lib/io.scala +++ b/piflow-core/src/main/scala/cn/piflow/lib/io.scala @@ -3,6 +3,7 @@ package cn.piflow.lib.io import java.io.File import cn.piflow.JobContext +import cn.piflow.util.SciDataFrame import cn.piflow.lib._ import cn.piflow.util.Logging import org.apache.spark.sql._ @@ -11,12 +12,13 @@ import org.apache.spark.sql._ * Created by bluejoe on 2018/5/13. */ case class TextFile(path: String, format: String = FileFormat.TEXT) extends Source with Sink { - override def load(ctx: JobContext): DataFrame = { - ctx.get[SparkSession].read.format(format).load(path).asInstanceOf[DataFrame]; - } + override def load(ctx: JobContext): SciDataFrame = { + val sparkDf=ctx.get[SparkSession].read.format(format).load(path).asInstanceOf[DataFrame]; + new SciDataFrame(sparkDf) + } - override def save(data: DataFrame, ctx: JobContext): Unit = { - data.write.format(format).save(path); + override def save(data: SciDataFrame, ctx: JobContext): Unit = { + data.save(path, format) } } @@ -27,7 +29,7 @@ object FileFormat { } case class Console(nlimit: Int = 20) extends Sink { - override def save(data: DataFrame, ctx: JobContext): Unit = { + override def save(data: SciDataFrame, ctx: JobContext): Unit = { data.show(nlimit); } } \ No newline at end of file diff --git a/piflow-core/src/main/scala/cn/piflow/main.scala b/piflow-core/src/main/scala/cn/piflow/main.scala index 7214b227..2ef6358c 100644 --- a/piflow-core/src/main/scala/cn/piflow/main.scala +++ b/piflow-core/src/main/scala/cn/piflow/main.scala @@ -5,6 +5,7 @@ import java.net.URI import java.util.concurrent.{CountDownLatch, TimeUnit} import cn.piflow.util._ +import cn.piflow.util.SciDataFrame import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.FileSystem import org.apache.spark.sql._ @@ -17,11 +18,11 @@ import org.apache.spark.sql.functions.{col, max} trait JobInputStream { def isEmpty(): Boolean; - def read(): DataFrame; + def read(): SciDataFrame; def ports(): Seq[String]; - def read(inport: String): DataFrame; + def read(inport: String): SciDataFrame; def readProperties() : MMap[String, String]; @@ -33,9 +34,9 @@ trait JobOutputStream { def loadCheckPoint(pec: JobContext, path : String) : Unit; - def write(data: DataFrame); + def write(data: SciDataFrame); - def write(bundle: String, data: DataFrame); + def write(bundle: String, data: SciDataFrame); def writeProperties(properties : MMap[String, String]); @@ -426,7 +427,7 @@ trait GroupContext extends Context { class JobInputStreamImpl() extends JobInputStream { //only returns DataFrame on calling read() - val inputs = MMap[String, () => DataFrame](); + val inputs = MMap[String, () => SciDataFrame](); val inputsProperties = MMap[String, () => MMap[String, String]]() override def isEmpty(): Boolean = inputs.isEmpty; @@ -444,14 +445,14 @@ class JobInputStreamImpl() extends JobInputStream { inputs.keySet.toSeq; } - override def read(): DataFrame = { + override def read(): SciDataFrame = { if (inputs.isEmpty) throw new NoInputAvailableException(); read(inputs.head._1); }; - override def read(inport: String): DataFrame = { + override def read(inport: String): SciDataFrame = { inputs(inport)(); } @@ -480,10 +481,10 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { //val path = getCheckPointPath(pec) logger.debug(s"writing data on checkpoint: $path"); - en._2.apply().write.parquet(path); + en._2.apply().getSparkDf.write.parquet(path); mapDataFrame(en._1) = () => { logger.debug(s"loading data from checkpoint: $path"); - pec.get[SparkSession].read.parquet(path)//default port? + new SciDataFrame(pec.get[SparkSession].read.parquet(path))//default port? }; }) } @@ -498,7 +499,7 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { val checkpointPortPath = checkpointPath + "/" + port logger.debug(s"loading data from checkpoint: $checkpointPortPath") println(s"loading data from checkpoint: $checkpointPortPath") - pec.get[SparkSession].read.parquet(checkpointPortPath) + new SciDataFrame(pec.get[SparkSession].read.parquet(checkpointPortPath)) }; val newPort = if(port.equals(defaultPort)) "" else port @@ -533,15 +534,15 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { HdfsUtil.getFiles(checkpointPath) } - val mapDataFrame = MMap[String, () => DataFrame](); + val mapDataFrame = MMap[String, () => SciDataFrame](); val mapDataFrameProperties = MMap[String, () => MMap[String, String]](); - override def write(data: DataFrame): Unit = write("", data); + override def write(data: SciDataFrame): Unit = write("", data); override def sendError(): Unit = ??? - override def write(outport: String, data: DataFrame): Unit = { + override def write(outport: String, data: SciDataFrame): Unit = { mapDataFrame(outport) = () => data; } @@ -568,7 +569,7 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { val portSchemaPath = debugPath + "/" + portName + "_schema" //println(portDataPath) //println(en._2.apply().schema) - val jsonDF = en._2.apply().na.fill("") + val jsonDF = en._2.apply().getSparkDf.na.fill("") var schemaStr = "" val schema = jsonDF.schema.foreach(f => { schemaStr = schemaStr + "," + f.name @@ -587,7 +588,7 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { val portName = if(en._1.equals("")) "default" else en._1 val portDataPath = visualizationPath + "/data" val portSchemaPath = visualizationPath + "/schema" - val jsonDF = en._2.apply().na.fill("") + val jsonDF = en._2.apply().getSparkDf.na.fill("") var schemaStr = "" val schema = jsonDF.schema.foreach(f => { schemaStr = schemaStr + "," + f.name @@ -607,7 +608,7 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { mapDataFrame.foreach(en => { val portName = if(en._1.equals("")) "default" else en._1 result - result(portName) = en._2.apply.count() + result(portName) = en._2.apply.getSparkDf.count() }) result } @@ -621,8 +622,8 @@ class JobOutputStreamImpl() extends JobOutputStream with Logging { var incrementalValue : String = "" mapDataFrame.foreach(en => { - if(!en._2.apply().head(1).isEmpty){ - val Row(maxValue : Any) = en._2.apply().agg(max(incrementalField)).head() + if(!en._2.apply().getSparkDf.head(1).isEmpty){ + val Row(maxValue : Any) = en._2.apply().getSparkDf.agg(max(incrementalField)).head() incrementalValue = maxValue.toString } }) @@ -704,7 +705,7 @@ class ProcessImpl(flow: Flow, runnerContext: Context, runner: Runner, parentProc df.show(showDataCount) } val streamingData = new JobOutputStreamImpl() - streamingData.write(df) + streamingData.write(new SciDataFrame(df)) analyzed.visitStreaming[JobOutputStreamImpl](flow, streamingStopName, streamingData, performStreamingStop) } @@ -734,6 +735,8 @@ class ProcessImpl(flow: Flow, runnerContext: Context, runner: Runner, parentProc catch { case e: Throwable => runnerListener.onJobFailed(pe.getContext()); + println("---------------performStreamingStop----update flow state failed!!!----------------") + runnerListener.onProcessFailed(processContext); throw e; } @@ -815,6 +818,8 @@ class ProcessImpl(flow: Flow, runnerContext: Context, runner: Runner, parentProc catch { case e: Throwable => runnerListener.onJobFailed(pe.getContext()); + println("---------------performStopByCheckpoint--------------update flow state failed!!!----------------") + runnerListener.onProcessFailed(processContext); throw e; } diff --git a/piflow-core/src/main/scala/cn/piflow/util/FileUtil.scala b/piflow-core/src/main/scala/cn/piflow/util/FileUtil.scala index 3da2151f..50895ce8 100644 --- a/piflow-core/src/main/scala/cn/piflow/util/FileUtil.scala +++ b/piflow-core/src/main/scala/cn/piflow/util/FileUtil.scala @@ -1,11 +1,65 @@ package cn.piflow.util -import java.io.{File, PrintWriter} - +import java.io.{File, IOException, PrintWriter} +import java.nio.file.{Files, Paths} import scala.io.Source +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FSDataInputStream, FSDataOutputStream, FileStatus, FileSystem, Path} + +import scala.util.Try object FileUtil { + val LOCAL_FILE_PREFIX = "/data/temp/files/" + + def downloadFileFromHdfs(fsDefaultName: String, hdfsFilePath: String) = { + + var fs: FileSystem = null + var result: Boolean = false + try { + val conf = new Configuration() + conf.set("fs.default.name", fsDefaultName) + fs = FileSystem.get(conf) + val hdfsPath = new Path(hdfsFilePath) + val localPath = new Path(LOCAL_FILE_PREFIX + hdfsPath.getName()) // 本地目录和文件名 + + if (fs.exists(hdfsPath)) { + fs.copyToLocalFile(hdfsPath, localPath) + println(s"File ${hdfsPath.getName()} downloaded successfully.") + result = true + } else { + throw new Exception(s"File ${hdfsPath.getName()} does not exist in HDFS.") + } + } catch { + case ex: IOException => println(ex) + } finally { + HdfsUtil.close(fs) + } + result + } + + + def exists(localFilePath: String): Boolean = { + Files.exists(Paths.get(localFilePath)) + } + + def extractFileNameWithExtension(filePath: String): String = { + val lastSeparatorIndex = filePath.lastIndexOf('/') + val lastBackslashIndex = filePath.lastIndexOf('\\') + val separatorIndex = Math.max(lastSeparatorIndex, lastBackslashIndex) + if (separatorIndex == -1) { + filePath // 如果没有找到分隔符,则整个字符串就是文件名 + } else { + filePath.substring(separatorIndex + 1) // 从分隔符后面开始截取,得到文件名 + } + } + + def deleteFile(filePath: String): Try[Unit] = Try { + Files.delete(Paths.get(filePath)) + println(s"File $filePath deleted successfully.") + } + + def getJarFile(file:File): Array[File] ={ val files = file.listFiles().filter(! _.isDirectory) .filter(t => t.toString.endsWith(".jar") ) //此处读取.txt and .md文件 diff --git a/piflow-core/src/main/scala/cn/piflow/util/H2Util.scala b/piflow-core/src/main/scala/cn/piflow/util/H2Util.scala index 8e922f29..cd0339c6 100644 --- a/piflow-core/src/main/scala/cn/piflow/util/H2Util.scala +++ b/piflow-core/src/main/scala/cn/piflow/util/H2Util.scala @@ -25,11 +25,19 @@ object H2Util { val CREATE_FLAG_TABLE = "create table if not exists configFlag(id bigint auto_increment, item varchar(255), flag int, createTime varchar(255))" val CREATE_SCHEDULE_TABLE = "create table if not exists schedule(id bigint auto_increment, scheduleId varchar(255), scheduleEntryId varchar(255), scheduleEntryType varchar(255))" val CREATE_PLUGIN_TABLE = "create table if not exists plugin (id varchar(255), name varchar(255), state varchar(255), createTime varchar(255), updateTime varchar(255))" - val serverIP = ServerIpUtil.getServerIp() + ":" + PropertyUtil.getPropertyValue("h2.port") - val CONNECTION_URL = "jdbc:h2:tcp://" + serverIP + "/~/piflow;AUTO_SERVER=true" - var connection : Connection= null + val serverIP = PropertyUtil.getPropertyValue("server.ip") + ":" + PropertyUtil.getPropertyValue("h2.port") +// val serverIP = ServerIpUtil.getServerIp() + ":" + PropertyUtil.getPropertyValue("h2.port") + var CONNECTION_URL = ""; + val h2Path: String = PropertyUtil.getPropertyValue("h2.path") + if (h2Path != null && h2Path.nonEmpty) { + CONNECTION_URL = "jdbc:h2:tcp://" + serverIP + "/~/piflow/" + h2Path + ";AUTO_SERVER=true;DB_CLOSE_DELAY=-1" + } else { + CONNECTION_URL = "jdbc:h2:tcp://" + serverIP + "/~/piflow;AUTO_SERVER=true;DB_CLOSE_DELAY=-1" + } +// val CONNECTION_URL = "jdbc:h2:tcp://" + serverIP + "/~/piflow;AUTO_SERVER=true" + var connection: Connection = null - try{ + try { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) @@ -42,12 +50,12 @@ object H2Util { statement.executeUpdate(CREATE_SCHEDULE_TABLE) statement.executeUpdate(CREATE_PLUGIN_TABLE) statement.close() - }catch { + } catch { case ex => println(ex) } - def getConnectionInstance() : Connection = { - if(connection == null){ + def getConnectionInstance(): Connection = { + if (connection == null) { Class.forName("org.h2.Driver") println(CONNECTION_URL) connection = DriverManager.getConnection(CONNECTION_URL) @@ -57,8 +65,8 @@ object H2Util { def cleanDatabase() = { - val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-tcpPort",PropertyUtil.getPropertyValue("h2.port")).start() - try{ + val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-tcpPort", PropertyUtil.getPropertyValue("h2.port")).start() + try { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) @@ -71,9 +79,9 @@ object H2Util { statement.executeUpdate("drop table if exists plugin") statement.close() - } catch{ - case ex => println(ex) - }finally { + } catch { + case ex => println(ex) + } finally { h2Server.shutdown() } @@ -99,23 +107,24 @@ object H2Util { } }*/ - def addFlow(appId:String,pId:String, name:String)={ + def addFlow(appId: String, pId: String, name: String) = { val startTime = new Date().toString val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) statement.executeUpdate("insert into flow(id, pid, name) values('" + appId + "','" + pId + "','" + name + "')") statement.close() } - def updateFlowState(appId:String, state:String) = { + + def updateFlowState(appId: String, state: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flow set state='" + state + "' where id='" + appId + "'" println(updateSql) //update related stop stop when flow state is KILLED - if(state.equals(FlowState.KILLED)){ + if (state.equals(FlowState.KILLED)) { val startedStopList = getStartedStop(appId) startedStopList.foreach(stopName => { - updateStopState(appId,stopName,StopState.KILLED) + updateStopState(appId, stopName, StopState.KILLED) updateStopFinishedTime(appId, stopName, new Date().toString) }) } @@ -123,7 +132,8 @@ object H2Util { statement.executeUpdate(updateSql) statement.close() } - def updateFlowStartTime(appId:String, startTime:String) = { + + def updateFlowStartTime(appId: String, startTime: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flow set startTime='" + startTime + "' where id='" + appId + "'" @@ -131,7 +141,8 @@ object H2Util { statement.executeUpdate(updateSql) statement.close() } - def updateFlowFinishedTime(appId:String, endTime:String) = { + + def updateFlowFinishedTime(appId: String, endTime: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flow set endTime='" + endTime + "' where id='" + appId + "'" @@ -140,7 +151,7 @@ object H2Util { statement.close() } - def updateFlowGroupId(appId:String, groupId:String) = { + def updateFlowGroupId(appId: String, groupId: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flow set groupId='" + groupId + "' where id='" + appId + "'" @@ -158,12 +169,12 @@ object H2Util { statement.close() }*/ - def isFlowExist(appId : String) : Boolean = { + def isFlowExist(appId: String): Boolean = { var isExist = false val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select * from flow where id='" + appId +"'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from flow where id='" + appId + "'") + while (rs.next()) { val id = rs.getString("id") println("Flow id exist: " + id) isExist = true @@ -177,8 +188,8 @@ object H2Util { var state = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select * from flow where id='" + appId +"'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from flow where id='" + appId + "'") + while (rs.next()) { state = rs.getString("state") //println("id:" + rs.getString("id") + "\tname:" + rs.getString("name") + "\tstate:" + rs.getString("state")) } @@ -187,12 +198,12 @@ object H2Util { state } - def getFlowProcessId(appId:String) : String = { + def getFlowProcessId(appId: String): String = { var pid = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select pid from flow where id='" + appId +"'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select pid from flow where id='" + appId + "'") + while (rs.next()) { pid = rs.getString("pid") } rs.close() @@ -200,7 +211,7 @@ object H2Util { pid } - def getFlowInfo(appId:String) : String = { + def getFlowInfo(appId: String): String = { /*val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var flowInfo = "" @@ -240,16 +251,16 @@ object H2Util { JsonUtil.format(JsonUtil.toJson(flowInfoMap)) } - def getFlowInfoMap(appId:String) : Map[String, Any] = { + def getFlowInfoMap(appId: String): Map[String, Any] = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var flowInfoMap = Map[String, Any]() //get flow basic info - val flowRS : ResultSet = statement.executeQuery("select * from flow where id='" + appId +"'") - while (flowRS.next()){ - val progress = getFlowProgressPercent(appId:String) + val flowRS: ResultSet = statement.executeQuery("select * from flow where id='" + appId + "'") + while (flowRS.next()) { + val progress = getFlowProgressPercent(appId: String) flowInfoMap += ("id" -> flowRS.getString("id")) flowInfoMap += ("pid" -> flowRS.getString("pid")) flowInfoMap += ("name" -> flowRS.getString("name")) @@ -261,9 +272,9 @@ object H2Util { flowRS.close() //get flow stops info - var stopList:List[Map[String, Any]] = List() - val rs : ResultSet = statement.executeQuery("select * from stop where flowId='" + appId +"'") - while(rs.next()){ + var stopList: List[Map[String, Any]] = List() + val rs: ResultSet = statement.executeQuery("select * from stop where flowId='" + appId + "'") + while (rs.next()) { var stopMap = Map[String, Any]() stopMap += ("name" -> rs.getString("name")) stopMap += ("state" -> rs.getString("state")) @@ -283,62 +294,62 @@ object H2Util { Map[String, Any]("flow" -> flowInfoMap) } - def getFlowProgressPercent(appId:String) : String = { + def getFlowProgressPercent(appId: String): String = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var stopCount = 0 var completedStopCount = 0 - val totalRS : ResultSet = statement.executeQuery("select count(*) as stopCount from stop where flowId='" + appId + "'") - while(totalRS.next()){ + val totalRS: ResultSet = statement.executeQuery("select count(*) as stopCount from stop where flowId='" + appId + "'") + while (totalRS.next()) { stopCount = totalRS.getInt("stopCount") //println("stopCount:" + stopCount) } totalRS.close() - val completedRS : ResultSet = statement.executeQuery("select count(*) as completedStopCount from stop where flowId='" + appId +"' and state='" + StopState.COMPLETED + "'") - while(completedRS.next()){ + val completedRS: ResultSet = statement.executeQuery("select count(*) as completedStopCount from stop where flowId='" + appId + "' and state='" + StopState.COMPLETED + "'") + while (completedRS.next()) { completedStopCount = completedRS.getInt("completedStopCount") //println("completedStopCount:" + completedStopCount) } completedRS.close() - val flowRS : ResultSet = statement.executeQuery("select * from flow where id='" + appId +"'") + val flowRS: ResultSet = statement.executeQuery("select * from flow where id='" + appId + "'") var flowState = "" - while (flowRS.next()){ + while (flowRS.next()) { flowState = flowRS.getString("state") } flowRS.close() statement.close() - val progress:Double = completedStopCount.asInstanceOf[Double] / stopCount * 100 - if(flowState.equals(FlowState.COMPLETED)){ + val progress: Double = completedStopCount.asInstanceOf[Double] / stopCount * 100 + if (flowState.equals(FlowState.COMPLETED)) { "100" - }else{ + } else { progress.toString } } - def getFlowProgress(appId:String) : String = { + def getFlowProgress(appId: String): String = { val progress = getFlowProgressPercent(appId) val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val flowRS : ResultSet = statement.executeQuery("select * from flow where id='" + appId +"'") + val flowRS: ResultSet = statement.executeQuery("select * from flow where id='" + appId + "'") var id = "" var name = "" var state = "" - while (flowRS.next()){ + while (flowRS.next()) { id = flowRS.getString("id") - name = flowRS.getString("name") + name = flowRS.getString("name") state = flowRS.getString("state") } flowRS.close() val json = ("FlowInfo" -> - ("appId" -> id)~ + ("appId" -> id) ~ ("name" -> name) ~ ("state" -> state) ~ ("progress" -> progress.toString)) @@ -347,13 +358,14 @@ object H2Util { } //Stop related API - def addStop(appId:String,name:String)={ + def addStop(appId: String, name: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) statement.executeUpdate("insert into stop(flowId, name) values('" + appId + "','" + name + "')") statement.close() } - def updateStopState(appId:String, name:String, state:String) = { + + def updateStopState(appId: String, name: String, state: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update stop set state='" + state + "' where flowId='" + appId + "' and name='" + name + "'" @@ -362,7 +374,7 @@ object H2Util { statement.close() } - def updateStopStartTime(appId:String, name:String, startTime:String) = { + def updateStopStartTime(appId: String, name: String, startTime: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update stop set startTime='" + startTime + "' where flowId='" + appId + "' and name='" + name + "'" @@ -371,7 +383,7 @@ object H2Util { statement.close() } - def updateStopFinishedTime(appId:String, name:String, endTime:String) = { + def updateStopFinishedTime(appId: String, name: String, endTime: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update stop set endTime='" + endTime + "' where flowId='" + appId + "' and name='" + name + "'" @@ -380,13 +392,13 @@ object H2Util { statement.close() } - def getStartedStop(appId:String) : List[String] = { + def getStartedStop(appId: String): List[String] = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - var stopList:List[String] = List() - val rs : ResultSet = statement.executeQuery("select * from stop where flowId='" + appId +"' and state = '" + StopState.STARTED + "'") - while(rs.next()){ + var stopList: List[String] = List() + val rs: ResultSet = statement.executeQuery("select * from stop where flowId='" + appId + "' and state = '" + StopState.STARTED + "'") + while (rs.next()) { stopList = rs.getString("name") +: stopList } @@ -396,25 +408,27 @@ object H2Util { } // Throughput related API - def addThroughput(appId:String, stopName:String, portName:String, count:Long) = { + def addThroughput(appId: String, stopName: String, portName: String, count: Long) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) statement.executeUpdate("insert into thoughput(flowId, stopName, portName, count) values('" + appId + "','" + stopName + "','" + portName + "','" + count + "')") statement.close() } - def getThroughput(appId:String, stopName:String, portName:String) = { + + def getThroughput(appId: String, stopName: String, portName: String) = { var count = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select count from thoughput where flowId='" + appId +"' and stopName = '" + stopName + "' and portName = '" + portName + "'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select count from thoughput where flowId='" + appId + "' and stopName = '" + stopName + "' and portName = '" + portName + "'") + while (rs.next()) { count = rs.getString("count") } rs.close() statement.close() count } - def updateThroughput(appId:String, stopName:String, portName:String, count:Long) = { + + def updateThroughput(appId: String, stopName: String, portName: String, count: Long) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update thoughput set count='" + count + "' where flowId='" + appId + "' and stopName='" + stopName + "' and portName='" + portName + "'" @@ -423,14 +437,15 @@ object H2Util { } //Group related api - def addGroup(groupId:String, name:String, childCount: Int)={ + def addGroup(groupId: String, name: String, childCount: Int) = { val startTime = new Date().toString val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - statement.executeUpdate("insert into flowGroup(id, name, childCount) values('" + groupId + "','" + name + "','" + childCount + "')") + statement.executeUpdate("insert into flowGroup(id, name, childCount) values('" + groupId + "','" + name + "','" + childCount + "')") statement.close() } - def updateGroupState(groupId:String, state:String) = { + + def updateGroupState(groupId: String, state: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flowGroup set state='" + state + "' where id='" + groupId + "'" @@ -447,7 +462,8 @@ object H2Util { statement.executeUpdate(updateSql) statement.close() } - def updateGroupStartTime(groupId:String, startTime:String) = { + + def updateGroupStartTime(groupId: String, startTime: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flowGroup set startTime='" + startTime + "' where id='" + groupId + "'" @@ -455,7 +471,8 @@ object H2Util { statement.executeUpdate(updateSql) statement.close() } - def updateGroupFinishedTime(groupId:String, endTime:String) = { + + def updateGroupFinishedTime(groupId: String, endTime: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flowGroup set endTime='" + endTime + "' where id='" + groupId + "'" @@ -463,7 +480,8 @@ object H2Util { statement.executeUpdate(updateSql) statement.close() } - def updateGroupParent(groupId:String, parentId:String) = { + + def updateGroupParent(groupId: String, parentId: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val updateSql = "update flowGroup set parentId='" + parentId + "' where id='" + groupId + "'" @@ -472,51 +490,51 @@ object H2Util { statement.close() } - def getGroupState(groupId:String) : String = { + def getGroupState(groupId: String): String = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var groupState = "" - val groupRS : ResultSet = statement.executeQuery("select state from flowGroup where id='" + groupId +"'") - if (groupRS.next()){ + val groupRS: ResultSet = statement.executeQuery("select state from flowGroup where id='" + groupId + "'") + if (groupRS.next()) { groupState = groupRS.getString("state") } return groupState } - def isGroupChildError( groupId : String) : Boolean = { + def isGroupChildError(groupId: String): Boolean = { - if(getGroupChildByStatus(groupId, GroupState.FAILED).size > 0 || getGroupChildByStatus(groupId, GroupState.KILLED).size > 0) + if (getGroupChildByStatus(groupId, GroupState.FAILED).size > 0 || getGroupChildByStatus(groupId, GroupState.KILLED).size > 0) return true - else if(getFlowChildByStatus(groupId, FlowState.FAILED).size > 0 || getFlowChildByStatus(groupId, FlowState.KILLED).size > 0) + else if (getFlowChildByStatus(groupId, FlowState.FAILED).size > 0 || getFlowChildByStatus(groupId, FlowState.KILLED).size > 0) return true else return false } - def isGroupChildRunning( groupId : String) : Boolean = { + def isGroupChildRunning(groupId: String): Boolean = { - if(getGroupChildByStatus(groupId, GroupState.STARTED).size > 0 ) + if (getGroupChildByStatus(groupId, GroupState.STARTED).size > 0) return true - else if(getFlowChildByStatus(groupId, FlowState.STARTED).size > 0 ) + else if (getFlowChildByStatus(groupId, FlowState.STARTED).size > 0) return true else return false } - def getGroupChildByStatus(groupId: String, status : String) : List[String] = { + def getGroupChildByStatus(groupId: String, status: String): List[String] = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var failedList = List[String]() //group children state - val groupRS : ResultSet = statement.executeQuery("select * from flowGroup where parentId='" + groupId +"'") - breakable{ - while (groupRS.next()){ + val groupRS: ResultSet = statement.executeQuery("select * from flowGroup where parentId='" + groupId + "'") + breakable { + while (groupRS.next()) { val groupName = groupRS.getString("name") val groupState = groupRS.getString("state") - if(groupState == status){ + if (groupState == status) { failedList = groupName +: failedList } } @@ -525,18 +543,19 @@ object H2Util { statement.close() return failedList } - def getFlowChildByStatus(groupId: String, status : String) : List[String] = { + + def getFlowChildByStatus(groupId: String, status: String): List[String] = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var failedList = List[String]() //flow children state - val rs : ResultSet = statement.executeQuery("select * from flow where groupId='" + groupId +"'") - breakable{ - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from flow where groupId='" + groupId + "'") + breakable { + while (rs.next()) { val flowName = rs.getString("name") val flowState = rs.getString("state") - if(flowState == status){ + if (flowState == status) { failedList = flowName +: failedList } } @@ -547,7 +566,7 @@ object H2Util { return failedList } - def getFlowGroupInfo(groupId:String) : String = { + def getFlowGroupInfo(groupId: String): String = { val flowGroupInfoMap = getGroupInfoMap(groupId) JsonUtil.format(JsonUtil.toJson(flowGroupInfoMap)) @@ -555,15 +574,15 @@ object H2Util { } //TODO need to get group - def getGroupInfoMap(groupId:String) : Map[String, Any] = { + def getGroupInfoMap(groupId: String): Map[String, Any] = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var flowGroupInfoMap = Map[String, Any]() - val flowGroupRS : ResultSet = statement.executeQuery("select * from flowGroup where id='" + groupId +"'") - while (flowGroupRS.next()){ + val flowGroupRS: ResultSet = statement.executeQuery("select * from flowGroup where id='" + groupId + "'") + while (flowGroupRS.next()) { flowGroupInfoMap += ("id" -> flowGroupRS.getString("id")) flowGroupInfoMap += ("name" -> flowGroupRS.getString("name")) @@ -573,9 +592,9 @@ object H2Util { } flowGroupRS.close() - var groupList:List[Map[String, Any]] = List() - val childGroupRS : ResultSet = statement.executeQuery("select * from flowGroup where parentId='" + groupId +"'") - while (childGroupRS.next()){ + var groupList: List[Map[String, Any]] = List() + val childGroupRS: ResultSet = statement.executeQuery("select * from flowGroup where parentId='" + groupId + "'") + while (childGroupRS.next()) { val childGroupId = childGroupRS.getString("id") val childGroupMapInfo = getGroupInfoMap(childGroupId) groupList = childGroupMapInfo +: groupList @@ -583,9 +602,9 @@ object H2Util { childGroupRS.close() flowGroupInfoMap += ("groups" -> groupList) - var flowList:List[Map[String, Any]] = List() - val flowRS : ResultSet = statement.executeQuery("select * from flow where groupId='" + groupId +"'") - while (flowRS.next()){ + var flowList: List[Map[String, Any]] = List() + val flowRS: ResultSet = statement.executeQuery("select * from flow where groupId='" + groupId + "'") + while (flowRS.next()) { val appId = flowRS.getString("id") flowList = getFlowInfoMap(appId) +: flowList } @@ -595,11 +614,10 @@ object H2Util { statement.close() - Map[String, Any]("group" -> flowGroupInfoMap) } - def getGroupProgressPercent(groupId:String) : String = { + def getGroupProgressPercent(groupId: String): String = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) @@ -608,29 +626,29 @@ object H2Util { var completedGroupCount = 0 var completedFlowCount = 0 - val groupRSALL : ResultSet = statement.executeQuery("select * from flowGroup where id='" + groupId +"'") + val groupRSALL: ResultSet = statement.executeQuery("select * from flowGroup where id='" + groupId + "'") var groupState = "" - while (groupRSALL.next()){ + while (groupRSALL.next()) { groupState = groupRSALL.getString("state") childCount = groupRSALL.getInt("childCount") } groupRSALL.close() - if(groupState.equals(FlowState.COMPLETED)){ + if (groupState.equals(FlowState.COMPLETED)) { statement.close() return "100" - }else{ + } else { - val completedGroupRS : ResultSet = statement.executeQuery("select count(*) as completedGroupCount from flowGroup where parentId='" + groupId +"' and state='" + GroupState.COMPLETED+ "'") - while(completedGroupRS.next()){ + val completedGroupRS: ResultSet = statement.executeQuery("select count(*) as completedGroupCount from flowGroup where parentId='" + groupId + "' and state='" + GroupState.COMPLETED + "'") + while (completedGroupRS.next()) { completedGroupCount = completedGroupRS.getInt("completedGroupCount") println("completedGroupCount:" + completedGroupCount) } completedGroupRS.close() - val completedFlowRS : ResultSet = statement.executeQuery("select count(*) as completedFlowCount from flow where GroupId='" + groupId +"' and state='" + FlowState.COMPLETED + "'") - while(completedFlowRS.next()){ + val completedFlowRS: ResultSet = statement.executeQuery("select count(*) as completedFlowCount from flow where GroupId='" + groupId + "' and state='" + FlowState.COMPLETED + "'") + while (completedFlowRS.next()) { completedFlowCount = completedFlowRS.getInt("completedFlowCount") println("completedFlowCount:" + completedFlowCount) } @@ -638,7 +656,7 @@ object H2Util { statement.close() - val progress:Double = (completedFlowCount.asInstanceOf[Double] + completedGroupCount.asInstanceOf[Double])/ childCount * 100 + val progress: Double = (completedFlowCount.asInstanceOf[Double] + completedGroupCount.asInstanceOf[Double]) / childCount * 100 return progress.toString } @@ -736,36 +754,36 @@ object H2Util { Map[String, Any]("project" -> projectInfoMap) }*/ - def addFlag(item:String, flag:Int) : Unit = { + def addFlag(item: String, flag: Int): Unit = { val createTime = new Date().toString val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - statement.executeUpdate("insert into configFlag(item, flag, createTime) values('" + item + "','" + flag + "','" + createTime +"')") + statement.executeUpdate("insert into configFlag(item, flag, createTime) values('" + item + "','" + flag + "','" + createTime + "')") statement.close() } - def getFlag(item : String) : Int = { + def getFlag(item: String): Int = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var flag = 0 - val flowGroupRS : ResultSet = statement.executeQuery("select flag from configFlag where item='" + item +"'") - if (flowGroupRS.next()){ + val flowGroupRS: ResultSet = statement.executeQuery("select flag from configFlag where item='" + item + "'") + if (flowGroupRS.next()) { flag = flowGroupRS.getInt("flag") } return flag } - def addScheduleInstance(scheduleId : String, cronExpression : String, startDate : String, endDate : String, state : String): Unit ={ + def addScheduleInstance(scheduleId: String, cronExpression: String, startDate: String, endDate: String, state: String): Unit = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val time = new Date().toString - statement.executeUpdate("insert into scheduleInstance(id, cronExpression, startDate, endDate, state, createTime, updateTime) values('" + scheduleId + "','" + cronExpression + "','" + startDate + "','" + endDate + "','" + state + "','" + time + "','" + time + "')") + statement.executeUpdate("insert into scheduleInstance(id, cronExpression, startDate, endDate, state, createTime, updateTime) values('" + scheduleId + "','" + cronExpression + "','" + startDate + "','" + endDate + "','" + state + "','" + time + "','" + time + "')") statement.close() } - def updateScheduleInstanceStatus(scheduleId : String, state : String) = { + def updateScheduleInstanceStatus(scheduleId: String, state: String) = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val time = new Date().toString @@ -776,15 +794,15 @@ object H2Util { } def getNeedStopSchedule(): List[String] = { - var resultList : List[String]= List() + var resultList: List[String] = List() val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) val nowDate: String = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()) - val updateSql = "select id from scheduleInstance where state = '" + ScheduleState.STARTED + "' and endDate != '' and endDate <= '" + nowDate + "'" + val updateSql = "select id from scheduleInstance where state = '" + ScheduleState.STARTED + "' and endDate != '' and endDate <= '" + nowDate + "'" println(updateSql) - val scheduleRS : ResultSet = statement.executeQuery(updateSql) - while (scheduleRS.next()){ + val scheduleRS: ResultSet = statement.executeQuery(updateSql) + while (scheduleRS.next()) { val id = scheduleRS.getString("id") resultList = id +: resultList @@ -794,23 +812,23 @@ object H2Util { resultList } - def addScheduleEntry(scheduleId : String, scheduleEntryId : String, scheduleEntryType : String): Unit ={ + def addScheduleEntry(scheduleId: String, scheduleEntryId: String, scheduleEntryType: String): Unit = { val createTime = new Date().toString val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - statement.executeUpdate("insert into schedule(scheduleId, scheduleEntryId, scheduleEntryType) values('" + scheduleId + "','" + scheduleEntryId + "','" + scheduleEntryType +"')") + statement.executeUpdate("insert into schedule(scheduleId, scheduleEntryId, scheduleEntryType) values('" + scheduleId + "','" + scheduleEntryId + "','" + scheduleEntryType + "')") statement.close() } - def getScheduleInfo(scheduleId: String) : String = { + def getScheduleInfo(scheduleId: String): String = { val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) var scheduleInfoMap = Map[String, Any]() //get flow basic info - val scheduleInstanceRS : ResultSet = statement.executeQuery("select * from scheduleInstance where id='" + scheduleId +"'") - while (scheduleInstanceRS.next()){ + val scheduleInstanceRS: ResultSet = statement.executeQuery("select * from scheduleInstance where id='" + scheduleId + "'") + while (scheduleInstanceRS.next()) { scheduleInfoMap += ("id" -> scheduleInstanceRS.getString("id")) scheduleInfoMap += ("cronExpression" -> scheduleInstanceRS.getString("cronExpression")) @@ -822,9 +840,9 @@ object H2Util { } scheduleInstanceRS.close() - var scheduleEntryList : List[Map[String, String]] = List() - val scheduleRS : ResultSet = statement.executeQuery("select * from schedule where scheduleId='" + scheduleId +"'") - while (scheduleRS.next()){ + var scheduleEntryList: List[Map[String, String]] = List() + val scheduleRS: ResultSet = statement.executeQuery("select * from schedule where scheduleId='" + scheduleId + "'") + while (scheduleRS.next()) { var scheduleEntryMap = Map[String, String]() scheduleEntryMap += ("scheduleEntryId" -> scheduleRS.getString("scheduleEntryId")) @@ -839,7 +857,7 @@ object H2Util { } - def getStartedSchedule() : List[String] ={ + def getStartedSchedule(): List[String] = { var scheduleList = List[String]() val statement = getConnectionInstance().createStatement() @@ -847,8 +865,8 @@ object H2Util { var scheduleInfoMap = Map[String, Any]() //get flow basic info - val scheduleInstanceRS : ResultSet = statement.executeQuery("select * from scheduleInstance where state='" + ScheduleState.STARTED +"'") - while (scheduleInstanceRS.next()){ + val scheduleInstanceRS: ResultSet = statement.executeQuery("select * from scheduleInstance where state='" + ScheduleState.STARTED + "'") + while (scheduleInstanceRS.next()) { scheduleList = scheduleInstanceRS.getString("id") +: scheduleList } @@ -856,27 +874,27 @@ object H2Util { scheduleList } - def addPlugin(name:String)={ + def addPlugin(name: String) = { var id = "" var state = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select * from plugin where name='" + name +"'") - if(!rs.isBeforeFirst){ + val rs: ResultSet = statement.executeQuery("select * from plugin where name='" + name + "'") + if (!rs.isBeforeFirst) { id = IdGenerator.uuid() val time = new Date().toString - statement.executeUpdate("insert into plugin(id, name, state, createTime, updateTime) values('" + id + "','" + name + "','" + PluginState.ON + "','" + time + "','" + time + "')") + statement.executeUpdate("insert into plugin(id, name, state, createTime, updateTime) values('" + id + "','" + name + "','" + PluginState.ON + "','" + time + "','" + time + "')") state = PluginState.ON - }else{ + } else { - breakable{ - while(rs.next()){ + breakable { + while (rs.next()) { id = rs.getString("id") state = rs.getString("state") val time = new Date().toString - if(state == PluginState.OFF){ - val updateSql = "update plugin set state='" + PluginState.ON + "'," + "updateTime='" + time +"' where name='" + name + "'" + if (state == PluginState.OFF) { + val updateSql = "update plugin set state='" + PluginState.ON + "'," + "updateTime='" + time + "' where name='" + name + "'" statement.executeUpdate(updateSql) state = PluginState.ON break @@ -890,22 +908,22 @@ object H2Util { id } - def removePlugin(name:String)={ + def removePlugin(name: String) = { var state = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select * from plugin where name='" + name +"'") - if(!rs.isBeforeFirst){ + val rs: ResultSet = statement.executeQuery("select * from plugin where name='" + name + "'") + if (!rs.isBeforeFirst) { state = PluginState.NONE - }else{ + } else { - breakable{ - while(rs.next()){ + breakable { + while (rs.next()) { state = rs.getString("state") val time = new Date().toString - if(state == PluginState.ON){ - val updateSql = "update plugin set state='" + PluginState.OFF + "'," + "updateTime='" + time +"' where name='" + name + "'" + if (state == PluginState.ON) { + val updateSql = "update plugin set state='" + PluginState.OFF + "'," + "updateTime='" + time + "' where name='" + name + "'" statement.executeUpdate(updateSql) state = PluginState.OFF break @@ -918,18 +936,18 @@ object H2Util { state } - def getPluginInfo(pluginId : String) : String ={ + def getPluginInfo(pluginId: String): String = { val pluginMap = getPluginInfoMap(pluginId) JsonUtil.format(JsonUtil.toJson(pluginMap)) } - def getPluginInfoMap(pluginId : String) : Map[String, String] ={ + def getPluginInfoMap(pluginId: String): Map[String, String] = { var pluginMap = Map[String, String]() val statement = getConnectionInstance().createStatement() - val rs : ResultSet = statement.executeQuery("select * from plugin where id='" + pluginId + "'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from plugin where id='" + pluginId + "'") + while (rs.next()) { val path = PropertyUtil.getClassPath() + "/" + rs.getString("name") pluginMap += ("id" -> rs.getString("id")) @@ -944,12 +962,12 @@ object H2Util { pluginMap } - def getPluginOn() : List[String] ={ + def getPluginOn(): List[String] = { var pluginList = List[String]() val statement = getConnectionInstance().createStatement() - val rs : ResultSet = statement.executeQuery("select * from plugin where state='" + PluginState.ON + "'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from plugin where state='" + PluginState.ON + "'") + while (rs.next()) { pluginList = rs.getString("name") +: pluginList } rs.close() @@ -958,27 +976,27 @@ object H2Util { } - def addSparkJar(sparkJarName:String)={ + def addSparkJar(sparkJarName: String) = { var id = "" var state = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select * from sparkJar where name='" + sparkJarName +"'") - if(!rs.isBeforeFirst){ + val rs: ResultSet = statement.executeQuery("select * from sparkJar where name='" + sparkJarName + "'") + if (!rs.isBeforeFirst) { id = IdGenerator.uuid() val time = new Date().toString - statement.executeUpdate("insert into sparkJar(id, name, state, createTime, updateTime) values('" + id + "','" + sparkJarName + "','" + PluginState.ON + "','" + time + "','" + time + "')") + statement.executeUpdate("insert into sparkJar(id, name, state, createTime, updateTime) values('" + id + "','" + sparkJarName + "','" + PluginState.ON + "','" + time + "','" + time + "')") state = SparkJarState.ON - }else{ + } else { - breakable{ - while(rs.next()){ + breakable { + while (rs.next()) { id = rs.getString("id") state = rs.getString("state") val time = new Date().toString - if(state == SparkJarState.OFF){ - val updateSql = "update sparkJar set state='" + SparkJarState.ON + "'," + "updateTime='" + time +"' where name='" + sparkJarName + "'" + if (state == SparkJarState.OFF) { + val updateSql = "update sparkJar set state='" + SparkJarState.ON + "'," + "updateTime='" + time + "' where name='" + sparkJarName + "'" statement.executeUpdate(updateSql) state = SparkJarState.ON break @@ -992,22 +1010,22 @@ object H2Util { id } - def removeSparkJar(sparkJarId:String) : String={ + def removeSparkJar(sparkJarId: String): String = { var state = "" val statement = getConnectionInstance().createStatement() statement.setQueryTimeout(QUERY_TIME) - val rs : ResultSet = statement.executeQuery("select * from sparkJar where id='" + sparkJarId +"'") - if(!rs.isBeforeFirst){ + val rs: ResultSet = statement.executeQuery("select * from sparkJar where id='" + sparkJarId + "'") + if (!rs.isBeforeFirst) { state = SparkJarState.NONE - }else{ + } else { - breakable{ - while(rs.next()){ + breakable { + while (rs.next()) { state = rs.getString("state") val time = new Date().toString - if(state == SparkJarState.ON){ - val updateSql = "update sparkJar set state='" + SparkJarState.OFF + "'," + "updateTime='" + time +"' where id='" + sparkJarId + "'" + if (state == SparkJarState.ON) { + val updateSql = "update sparkJar set state='" + SparkJarState.OFF + "'," + "updateTime='" + time + "' where id='" + sparkJarId + "'" statement.executeUpdate(updateSql) state = SparkJarState.OFF break @@ -1020,18 +1038,18 @@ object H2Util { state } - def getSparkJarInfo(sparkJarId : String) : String ={ + def getSparkJarInfo(sparkJarId: String): String = { val sparkJarMap = getSparkJarInfoMap(sparkJarId) JsonUtil.format(JsonUtil.toJson(sparkJarMap)) } - def getSparkJarInfoMap(sparkJarId : String) : Map[String, String] ={ + def getSparkJarInfoMap(sparkJarId: String): Map[String, String] = { var sparkJarMap = Map[String, String]() val statement = getConnectionInstance().createStatement() - val rs : ResultSet = statement.executeQuery("select * from sparkJar where id='" + sparkJarId + "'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from sparkJar where id='" + sparkJarId + "'") + while (rs.next()) { val path = PropertyUtil.getSpartJarPath() + "/" + rs.getString("name") sparkJarMap += ("id" -> rs.getString("id")) @@ -1046,12 +1064,12 @@ object H2Util { sparkJarMap } - def getSparkJarOn() : List[String] ={ + def getSparkJarOn(): List[String] = { var pluginList = List[String]() val statement = getConnectionInstance().createStatement() - val rs : ResultSet = statement.executeQuery("select * from sparkJar where state='" + SparkJarState.ON + "'") - while(rs.next()){ + val rs: ResultSet = statement.executeQuery("select * from sparkJar where state='" + SparkJarState.ON + "'") + while (rs.next()) { pluginList = rs.getString("name") +: pluginList } rs.close() @@ -1084,7 +1102,7 @@ object H2Util { case ex => println(ex) }*/ val needStopSchedule = H2Util.getNeedStopSchedule() - if (args.size != 1){ + if (args.size != 1) { println("Error args!!! Please enter Clean or UpdateToVersion6") } /*val operation = args(0) diff --git a/piflow-core/src/main/scala/cn/piflow/util/UnstructuredUtils.scala b/piflow-core/src/main/scala/cn/piflow/util/UnstructuredUtils.scala new file mode 100644 index 00000000..a43e0af9 --- /dev/null +++ b/piflow-core/src/main/scala/cn/piflow/util/UnstructuredUtils.scala @@ -0,0 +1,138 @@ +package cn.piflow.util + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} + +import java.io.{File, IOException} +import java.nio.file.Files.isRegularFile +import java.nio.file.{Files, Paths} +import scala.:: +import scala.collection.mutable.ListBuffer + +object UnstructuredUtils { + + def unstructuredHost(): String = { +// val unstructuredHost: String = PropertyUtil.getPropertyValue("unstructured.host") + val unstructuredHost: String = PropertyUtil.getPropertyValue("server.ip") + unstructuredHost + } + + def unstructuredPort(): String = { + var unstructuredPort: String = PropertyUtil.getPropertyValue("unstructured.port") + if (unstructuredPort == null || unstructuredPort.isEmpty) unstructuredPort = "8000" + unstructuredPort + } + + + def deleteTempFile(filePath: String) = { + var result = false + FileUtil.deleteFile(filePath).recover { + case ex: Exception => + println(s"Failed to delete file $filePath: ${ex.getMessage}") + }.get + result = true + result + } + + + def extractFileNameWithExtension(filePath: String): String = { + val lastSeparatorIndex = filePath.lastIndexOf('/') + val lastBackslashIndex = filePath.lastIndexOf('\\') + val separatorIndex = Math.max(lastSeparatorIndex, lastBackslashIndex) + if (separatorIndex == -1) { + filePath // 如果没有找到分隔符,则整个字符串就是文件名 + } else { + filePath.substring(separatorIndex + 1) // 从分隔符后面开始截取,得到文件名 + } + } + + def downloadFileFromHdfs(filePath: String) = { + var result = false + //先检验file是否已经存在在本地 + val localFilePath = FileUtil.LOCAL_FILE_PREFIX + FileUtil.extractFileNameWithExtension(filePath) + val exists = FileUtil.exists(localFilePath) + if (!exists) { + val hdfsFS = PropertyUtil.getPropertyValue("fs.defaultFS") + result = FileUtil.downloadFileFromHdfs(hdfsFS, filePath) + } else { + result = true + } + result + } + + def downloadFilesFromHdfs(hdfsFilePath: String) = { + // val hdfsFilePath = "/test;/test1/a.pdf" // HDFS路径,用分号隔开 + val localDir = FileUtil.LOCAL_FILE_PREFIX + IdGenerator.uuid() // 本地服务器目录 + val hdfsPaths = hdfsFilePath.split(";") // 将路径用分号分割成数组 + val conf = new Configuration() + conf.set("fs.defaultFS", PropertyUtil.getPropertyValue("fs.defaultFS")) // 设置HDFS的namenode地址 + val fs = FileSystem.get(conf) + hdfsPaths.foreach { hdfsPath => + downloadFiles(fs, new Path(hdfsPath), new Path(localDir)) + } + fs.close() + localDir + } + + def downloadFiles(fs: FileSystem, srcPath: Path, localDir: Path): Unit = { + val status = fs.getFileStatus(srcPath) + if (status.isDirectory) { + val files = fs.listStatus(srcPath) + files.foreach { fileStatus => + val path = fileStatus.getPath + if (fileStatus.isFile) { + val localFilePath = new Path(localDir, path.getName) + fs.copyToLocalFile(false, path, localFilePath) + } else { + val newLocalDir = new Path(localDir, path.getName) + downloadFiles(fs, path, newLocalDir) + } + } + } else { + val localFilePath = new Path(localDir, srcPath.getName) + fs.copyToLocalFile(false, srcPath, localFilePath) + } + } + + def getLocalFilePaths(filePaths: String) = { + val paths = filePaths.split(";").toList + paths.flatMap { path => + val file = new File(path) + if (file.exists) { + if (file.isDirectory) { + listFiles(file) + } else { + List(file.getAbsolutePath).filterNot(_.endsWith(".crc")) + } + } else { + println("filePath is empty") + List.empty + } + }.filterNot(_.endsWith(".crc")) + } + + def listFiles(directory: File): List[String] = { + if (directory.exists && directory.isDirectory) { + directory.listFiles.flatMap { + case f if f.isFile => List(f.getAbsolutePath) + case d if d.isDirectory => listFiles(d) + }.toList.filterNot(_.endsWith(".crc")) // 过滤掉以 .crc 结尾的文件路径 + } else { + List.empty + } + } + + + def deleteTempFiles(localDir: String) = { + val directory = new File(localDir) + if (directory.exists() && directory.isDirectory) { + directory.listFiles.foreach { file => + file.delete() + } + directory.delete() + println(s"Directory $localDir deleted successfully") + } else { + println(s"Directory $localDir does not exist or is not a directory") + } + } +} diff --git a/piflow-core/src/test/scala/FlowTest.scala b/piflow-core/src/test/scala/FlowTest.scala index 9f62c3ec..7e678321 100644 --- a/piflow-core/src/test/scala/FlowTest.scala +++ b/piflow-core/src/test/scala/FlowTest.scala @@ -1,11 +1,10 @@ import java.io.{File, FileInputStream, FileOutputStream} import java.util.Date import java.util.concurrent.TimeUnit - import cn.piflow._ import cn.piflow.lib._ import cn.piflow.lib.io._ -import cn.piflow.util.ScriptEngine +import cn.piflow.util.{SciDataFrame, ScriptEngine} import org.apache.commons.io.{FileUtils, IOUtils} import org.apache.spark.sql.SparkSession import org.junit.Test @@ -369,7 +368,7 @@ class PipedReadTextFile extends Stop { def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession](); val df = spark.read.json("../testdata/honglou.txt"); - out.write(df); + out.write(new SciDataFrame(df)); } def initialize(ctx: ProcessContext): Unit = { @@ -381,13 +380,13 @@ class PipedCountWords extends Stop { def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession](); import spark.implicits._ - val df = in.read(); + val df = in.read().getSparkDf; val count = df.as[String] .map(_.replaceAll("[\\x00-\\xff]|,|。|:|.|“|”|?|!| ", "")) .flatMap(s => s.zip(s.drop(1)).map(t => "" + t._1 + t._2)) .groupBy("value").count.sort($"count".desc); - out.write(count); + out.write(new SciDataFrame(count)); } def initialize(ctx: ProcessContext): Unit = { @@ -401,10 +400,10 @@ class PipedPrintCount extends Stop { val spark = pec.get[SparkSession](); import spark.implicits._ - val df = in.read(); + val df = in.read().getSparkDf; val count = df.sort($"count".desc); count.show(40); - out.write(df); + out.write(new SciDataFrame(df)); } def initialize(ctx: ProcessContext): Unit = { @@ -416,7 +415,7 @@ class PrintDataFrameStop extends Stop { def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession](); - val df = in.read(); + val df = in.read().getSparkDf; df.show(40); } @@ -431,7 +430,7 @@ class TestDataGeneratorStop(seq: Seq[String]) extends Stop { override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { val spark = pec.get[SparkSession](); import spark.implicits._ - out.write(seq.toDF()); + out.write(new SciDataFrame(seq.toDF())); } } @@ -439,7 +438,7 @@ class ZipStop extends Stop { override def initialize(ctx: ProcessContext): Unit = {} override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - out.write(in.read("data1").union(in.read("data2"))); + out.write(new SciDataFrame(in.read("data1").getSparkDf.union(in.read("data2").getSparkDf))); } } @@ -447,10 +446,10 @@ class ForkStop extends Stop { override def initialize(ctx: ProcessContext): Unit = {} override def perform(in: JobInputStream, out: JobOutputStream, pec: JobContext): Unit = { - val ds = in.read(); + val ds = in.read().getSparkDf; val spark = pec.get[SparkSession](); import spark.implicits._ - out.write("data1", ds.as[String].filter(_.head % 2 == 0).toDF()); - out.write("data2", ds.as[String].filter(_.head % 2 == 1).toDF()); + out.write("data1", new SciDataFrame(ds.as[String].filter(_.head % 2 == 0).toDF())); + out.write("data2", new SciDataFrame(ds.as[String].filter(_.head % 2 == 1).toDF())); } } \ No newline at end of file diff --git a/piflow-docker-divided/piflow-server/config.properties b/piflow-docker-divided/piflow-server/config.properties index f8ab0be5..59e870e2 100644 --- a/piflow-docker-divided/piflow-server/config.properties +++ b/piflow-docker-divided/piflow-server/config.properties @@ -3,10 +3,10 @@ spark.master=yarn spark.deploy.mode=cluster #hdfs default file system -fs.defaultFS=hdfs://10.0.86.191:9000 +fs.defaultFS=hdfs://172.18.39.41:9000 #yarn resourcemanager.hostname -yarn.resourcemanager.hostname=10.0.86.191 +yarn.resourcemanager.hostname=1172.18.39.41 #if you want to use hive, set hive metastore uris #hive.metastore.uris=thrift://10.0.88.71:9083 diff --git a/piflow-server/src/main/scala/cn/piflow/api/API.scala b/piflow-server/src/main/scala/cn/piflow/api/API.scala index d7af0371..0aec61a1 100644 --- a/piflow-server/src/main/scala/cn/piflow/api/API.scala +++ b/piflow-server/src/main/scala/cn/piflow/api/API.scala @@ -1,17 +1,11 @@ package cn.piflow.api -import java.io.{ByteArrayInputStream, ByteArrayOutputStream, File, FileOutputStream} -import java.net.URI -import java.text.SimpleDateFormat -import java.util.{Date, Properties} -import java.util.concurrent.CountDownLatch import cn.piflow.conf.VisualizationType -import org.apache.spark.sql.SparkSession -import cn.piflow.conf.util.{ClassUtil, MapUtil, OptionUtil, PluginManager, ScalaExecutorUtil} -import cn.piflow.{GroupExecution, Process, Runner} import cn.piflow.conf.bean.{FlowBean, GroupBean} +import cn.piflow.conf.util.{ClassUtil, MapUtil, OptionUtil, PluginManager} import cn.piflow.util.HdfsUtil.{getJsonMapList, getLine} import cn.piflow.util._ +import cn.piflow.{GroupExecution, Process, Runner} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FSDataInputStream, FileStatus, FileSystem, Path} import org.apache.hadoop.io.IOUtils @@ -20,20 +14,26 @@ import org.apache.http.entity.StringEntity import org.apache.http.impl.client.HttpClients import org.apache.http.util.EntityUtils import org.apache.spark.launcher.SparkAppHandle +import org.apache.spark.sql.SparkSession +import java.io.{ByteArrayInputStream, ByteArrayOutputStream, File} +import java.text.SimpleDateFormat +import java.util.Date +import java.util.concurrent.CountDownLatch import java.util.zip.{ZipEntry, ZipOutputStream} +import scala.collection.mutable.{HashMap, ListBuffer} import scala.util.control.Breaks._ -import scala.collection.mutable.{HashMap, ListBuffer, Map => MMap} object API { - def addSparkJar(addSparkJarName: String) : String = { + + def addSparkJar(addSparkJarName: String): String = { var id = "" val sparkJarFile = new File(PropertyUtil.getSpartJarPath()) val jarFile = FileUtil.getJarFile(sparkJarFile) - breakable{ - jarFile.foreach( i => { - if(i.getName.equals(addSparkJarName)) { + breakable { + jarFile.foreach(i => { + if (i.getName.equals(addSparkJarName)) { id = H2Util.addSparkJar(addSparkJarName) break @@ -42,12 +42,13 @@ object API { } id } - def removeSparkJar(sparkJarId : String) : Boolean = { + + def removeSparkJar(sparkJarId: String): Boolean = { var result = false val sparkJarState = H2Util.removeSparkJar(sparkJarId) - if(sparkJarState == SparkJarState.ON){ + if (sparkJarState == SparkJarState.ON) { false - }else{ + } else { true } @@ -58,13 +59,13 @@ object API { pluginInfo }*/ - def addPlugin(pluginManager:PluginManager, pluginName : String) : String = { + def addPlugin(pluginManager: PluginManager, pluginName: String): String = { var id = "" val classpathFile = new File(pluginManager.getPluginPath()) val jarFile = FileUtil.getJarFile(classpathFile) - breakable{ - jarFile.foreach( i => { - if(i.getName.equals(pluginName)) { + breakable { + jarFile.foreach(i => { + if (i.getName.equals(pluginName)) { pluginManager.unloadPlugin(i.getAbsolutePath) pluginManager.loadPlugin(i.getAbsolutePath) @@ -76,16 +77,16 @@ object API { id } - def removePlugin(pluginManager:PluginManager, pluginId : String) : Boolean = { + def removePlugin(pluginManager: PluginManager, pluginId: String): Boolean = { var result = false - val pluginName = H2Util.getPluginInfoMap(pluginId).getOrElse("name","") - if(pluginName != ""){ + val pluginName = H2Util.getPluginInfoMap(pluginId).getOrElse("name", "") + if (pluginName != "") { val classpathFile = new File(pluginManager.getPluginPath()) val jarFile = FileUtil.getJarFile(classpathFile) - breakable{ - jarFile.foreach( i => { + breakable { + jarFile.foreach(i => { println(i.getAbsolutePath) - if(i.getName.equals(pluginName)) { + if (i.getName.equals(pluginName)) { pluginManager.unloadPlugin(i.getAbsolutePath) H2Util.removePlugin(pluginName) result = true @@ -98,12 +99,12 @@ object API { result } - def getPluginInfo(pluginId : String) : String = { + def getPluginInfo(pluginId: String): String = { val pluginInfo = H2Util.getPluginInfo(pluginId) pluginInfo } - def getConfigurableStopInPlugin(pluginManager:PluginManager, pluginName : String) : String = { + def getConfigurableStopInPlugin(pluginManager: PluginManager, pluginName: String): String = { var bundleList = List[String]() val stops = pluginManager.getPluginConfigurableStops(pluginName) stops.foreach(s => { @@ -113,7 +114,7 @@ object API { """{"bundles":"""" + bundleList.mkString(",") + """"}""" } - def getConfigurableStopInfoInPlugin(pluginManager:PluginManager, pluginName : String) : String = { + def getConfigurableStopInfoInPlugin(pluginManager: PluginManager, pluginName: String): String = { var bundleList = List[String]() val stops = pluginManager.getPluginConfigurableStops(pluginName) stops.foreach(s => { @@ -123,33 +124,33 @@ object API { jsonString } - def getResourceInfo() : String = { + def getResourceInfo(): String = { - try{ + try { val matricsURL = ConfigureUtil.getYarnResourceMatrics() val client = HttpClients.createDefault() - val get:HttpGet = new HttpGet(matricsURL) + val get: HttpGet = new HttpGet(matricsURL) - val response:CloseableHttpResponse = client.execute(get) + val response: CloseableHttpResponse = client.execute(get) val entity = response.getEntity - val str = EntityUtils.toString(entity,"UTF-8") -// val yarnInfo = OptionUtil.getAny(JSON.parseFull(str)).asInstanceOf[Map[String, Any]] + val str = EntityUtils.toString(entity, "UTF-8") + // val yarnInfo = OptionUtil.getAny(JSON.parseFull(str)).asInstanceOf[Map[String, Any]] val yarnInfo = JsonUtil.jsonToMap(str) val matricInfo = MapUtil.get(yarnInfo, "clusterMetrics").asInstanceOf[Map[String, Any]] - val totalVirtualCores = matricInfo.getOrElse("totalVirtualCores",""); - val allocatedVirtualCores = matricInfo.getOrElse("allocatedVirtualCores",""); - val remainingVirtualCores = totalVirtualCores.asInstanceOf[Double] - allocatedVirtualCores.asInstanceOf[Double]; + val totalVirtualCores = matricInfo.getOrElse("totalVirtualCores", ""); + val allocatedVirtualCores = matricInfo.getOrElse("allocatedVirtualCores", "") + val remainingVirtualCores = totalVirtualCores.toString.toDouble - allocatedVirtualCores.toString.toDouble; val cpuInfo = Map( "totalVirtualCores" -> totalVirtualCores, "allocatedVirtualCores" -> allocatedVirtualCores, "remainingVirtualCores" -> remainingVirtualCores ) - val totalMemoryGB = matricInfo.getOrElse("totalMB","").asInstanceOf[Double]/1024; - val allocatedMemoryGB = matricInfo.getOrElse("allocatedMB","").asInstanceOf[Double]/1024; - val remainingMemoryGB = totalMemoryGB - allocatedMemoryGB; + val totalMemoryGB = matricInfo.getOrElse("totalMB", "").toString.toDouble / 1024; + val allocatedMemoryGB = matricInfo.getOrElse("allocatedMB", "").toString.toDouble / 1024; + val remainingMemoryGB = totalMemoryGB - allocatedMemoryGB; val memoryInfo = Map( "totalMemoryGB" -> totalMemoryGB, "allocatedMemoryGB" -> allocatedMemoryGB, @@ -161,24 +162,24 @@ object API { val resultMap = Map("resource" -> map) JsonUtil.format(JsonUtil.toJson(resultMap)) - }catch{ - case ex:Exception => "" + } catch { + case ex: Exception => "" } } - def getScheduleInfo(scheduleId : String) : String = { + def getScheduleInfo(scheduleId: String): String = { val scheduleInfo = H2Util.getScheduleInfo(scheduleId) scheduleInfo } - def startGroup(groupJson : String) = { + def startGroup(groupJson: String) = { - println("StartGroup API get json: \n" + groupJson ) + println("StartGroup API get json: \n" + groupJson) - var appId:String = null -// val map = OptionUtil.getAny(JSON.parseFull(groupJson)).asInstanceOf[Map[String, Any]] + var appId: String = null + // val map = OptionUtil.getAny(JSON.parseFull(groupJson)).asInstanceOf[Map[String, Any]] val map = JsonUtil.jsonToMap(groupJson) val flowGroupMap = MapUtil.get(map, "group").asInstanceOf[Map[String, Any]] @@ -187,32 +188,33 @@ object API { val group = groupBean.constructGroup() val flowGroupExecution = Runner.create() - .bind("checkpoint.path",ConfigureUtil.getCheckpointPath()) - .bind("debug.path",ConfigureUtil.getDebugPath()) + .bind("checkpoint.path", ConfigureUtil.getCheckpointPath()) + .bind("debug.path", ConfigureUtil.getDebugPath()) .start(group); flowGroupExecution } - def stopGroup(flowGroupExecution : GroupExecution): String ={ + def stopGroup(flowGroupExecution: GroupExecution): String = { flowGroupExecution.stop() "ok" } - def getFlowGroupInfo(groupId : String) : String = { + def getFlowGroupInfo(groupId: String): String = { val flowGroupInfo = H2Util.getFlowGroupInfo(groupId) flowGroupInfo } - def getFlowGroupProgress(flowGroupID : String) : String = { + + def getFlowGroupProgress(flowGroupID: String): String = { val progress = H2Util.getGroupProgressPercent(flowGroupID) progress } - def startFlow(flowJson : String):(String,SparkAppHandle) = { + def startFlow(flowJson: String): (String, SparkAppHandle) = { - var appId:String = null -// val flowMap = OptionUtil.getAny(JSON.parseFull(flowJson)).asInstanceOf[Map[String, Any]] - val flowMap = JsonUtil.jsonToMap(flowJson) + var appId: String = null + // val flowMap = OptionUtil.getAny(JSON.parseFull(flowJson)).asInstanceOf[Map[String, Any]] + val flowMap = JsonUtil.jsonToMap(flowJson) //create flow @@ -223,34 +225,35 @@ object API { val appName = flow.getFlowName() val (stdout, stderr) = getLogFile(uuid, appName) - println("StartFlow API get json: \n" + flowJson ) + println("StartFlow API get json: \n" + flowJson) val countDownLatch = new CountDownLatch(1) - val handle = FlowLauncher.launch(flow).startApplication( new SparkAppHandle.Listener { + val handle = FlowLauncher.launch(flow).startApplication(new SparkAppHandle.Listener { override def stateChanged(handle: SparkAppHandle): Unit = { appId = handle.getAppId val sparkAppState = handle.getState - if(appId != null){ + if (appId != null) { println("Spark job with app id: " + appId + ",\t State changed to: " + sparkAppState) - }else{ + } else { println("Spark job's state changed to: " + sparkAppState) } - if (handle.getState().isFinal){ + if (handle.getState().isFinal) { countDownLatch.countDown() println("Task is finished!") } } + override def infoChanged(handle: SparkAppHandle): Unit = { //println("Info:" + handle.getState().toString) } }) - while (handle.getAppId == null){ + while (handle.getAppId == null) { Thread.sleep(100) } - while (!H2Util.isFlowExist(handle.getAppId)){ + while (!H2Util.isFlowExist(handle.getAppId)) { Thread.sleep(1000) } appId = handle.getAppId @@ -259,7 +262,7 @@ object API { } - def stopFlow(appID : String, process : SparkAppHandle) : String = { + def stopFlow(appID: String, process: SparkAppHandle): String = { //yarn application kill appId stopFlowOnYarn(appID) @@ -274,79 +277,79 @@ object API { "ok" } - def stopFlowOnYarn(appID : String) : String = { + def stopFlowOnYarn(appID: String): String = { //yarn application kill appId val url = ConfigureUtil.getYarnResourceManagerWebAppAddress() + appID + "/state" val client = HttpClients.createDefault() - val put:HttpPut = new HttpPut(url) - val body ="{\"state\":\"KILLED\"}" + val put: HttpPut = new HttpPut(url) + val body = "{\"state\":\"KILLED\"}" put.addHeader("Content-Type", "application/json") put.setEntity(new StringEntity(body)) - val response:CloseableHttpResponse = client.execute(put) + val response: CloseableHttpResponse = client.execute(put) val entity = response.getEntity - val str = EntityUtils.toString(entity,"UTF-8") + val str = EntityUtils.toString(entity, "UTF-8") str } - def getFlowInfo(appID : String) : String = { + def getFlowInfo(appID: String): String = { val flowInfo = H2Util.getFlowInfo(appID) flowInfo } - def getFlowProgress(appID : String) : String = { + def getFlowProgress(appID: String): String = { val progress = H2Util.getFlowProgress(appID) progress } - def getFlowYarnInfo(appID : String) : String = { + def getFlowYarnInfo(appID: String): String = { val url = ConfigureUtil.getYarnResourceManagerWebAppAddress() + appID val client = HttpClients.createDefault() - val get:HttpGet = new HttpGet(url) + val get: HttpGet = new HttpGet(url) - val response:CloseableHttpResponse = client.execute(get) + val response: CloseableHttpResponse = client.execute(get) val entity = response.getEntity - val str = EntityUtils.toString(entity,"UTF-8") + val str = EntityUtils.toString(entity, "UTF-8") str } - def getFlowCheckpoint(appId:String) : String = { + def getFlowCheckpoint(appId: String): String = { val checkpointPath = ConfigureUtil.getCheckpointPath().stripSuffix("/") + "/" + appId val checkpointList = HdfsUtil.getFiles(checkpointPath) """{"checkpoints":"""" + checkpointList.mkString(",") + """"}""" } - def getFlowDebugData(appId : String, stopName : String, port : String) : String = { + def getFlowDebugData(appId: String, stopName: String, port: String): String = { - val debugPath :String = ConfigureUtil.getDebugPath().stripSuffix("/") + "/" + appId + "/" + stopName + "/" + port; + val debugPath: String = ConfigureUtil.getDebugPath().stripSuffix("/") + "/" + appId + "/" + stopName + "/" + port; val schema = HdfsUtil.getLine(debugPath + "_schema") - val result ="{\"schema\":\"" + schema+ "\", \"debugDataPath\": \""+ debugPath + "\"}" + val result = "{\"schema\":\"" + schema + "\", \"debugDataPath\": \"" + debugPath + "\"}" result } - def getFlowVisualizationData(appId : String, stopName : String, visualizationType : String) : String = { + def getFlowVisualizationData(appId: String, stopName: String, visualizationType: String): String = { var dimensionMap = Map[String, List[String]]() - val visuanlizationPath :String = ConfigureUtil.getVisualizationPath().stripSuffix("/") + "/" + appId + "/" + stopName + "/" + val visuanlizationPath: String = ConfigureUtil.getVisualizationPath().stripSuffix("/") + "/" + appId + "/" + stopName + "/" val visualizationSchema = getLine(visuanlizationPath + "/schema") val schemaArray = visualizationSchema.split(",") val jsonMapList = getJsonMapList(visuanlizationPath + "/data") - if(VisualizationType.LineChart == visualizationType || - VisualizationType.Histogram == visualizationType ){ + if (VisualizationType.LineChart == visualizationType || + VisualizationType.Histogram == visualizationType) { - var visualizationTuple = List[Tuple2[String,String]]() + var visualizationTuple = List[Tuple2[String, String]]() - val jsonTupleList = jsonMapList.flatMap( map => map.toSeq) + val jsonTupleList = jsonMapList.flatMap(map => map.toSeq) val visualizationInfo = jsonTupleList.groupBy(_._1) visualizationInfo.foreach(dimension => { var valueList = List[String]() val dimensionList = dimension._2 - dimensionList.foreach( dimensionAndCountPair => { + dimensionList.foreach(dimensionAndCountPair => { val v = String.valueOf(dimensionAndCountPair._2) println(v) valueList = valueList :+ v @@ -357,10 +360,12 @@ object API { var lineChartMap = Map[String, Any]() var legend = List[String]() val x = schemaArray(0) - lineChartMap += {"xAxis" -> Map("type" -> x, "data" -> OptionUtil.getAny(dimensionMap.get(schemaArray(0))) )} + lineChartMap += { + "xAxis" -> Map("type" -> x, "data" -> OptionUtil.getAny(dimensionMap.get(schemaArray(0)))) + } //lineChartMap += {"yAxis" -> Map("type" -> "value")} var seritesList = List[Map[String, Any]]() - dimensionMap.filterKeys(!_.equals(x)).foreach(item =>{ + dimensionMap.filterKeys(!_.equals(x)).foreach(item => { val name_action = item._1 val data = item._2 val name = name_action.split("_")(0) @@ -369,17 +374,21 @@ object API { case VisualizationType.LineChart => "line" case VisualizationType.Histogram => "bar" } - val map = Map("name" -> name, "type" -> vType,"stack" -> action, "data" -> data) + val map = Map("name" -> name, "type" -> vType, "stack" -> action, "data" -> data) seritesList = map +: seritesList legend = name +: legend }) - lineChartMap += {"series" -> seritesList} - lineChartMap += {"legent" -> legend} + lineChartMap += { + "series" -> seritesList + } + lineChartMap += { + "legent" -> legend + } val visualizationJsonData = JsonUtil.format(JsonUtil.toJson(lineChartMap)) println(visualizationJsonData) visualizationJsonData - }else if (VisualizationType.ScatterPlot == visualizationType){ - var visualizationTuple = List[Tuple2[String,String]]() + } else if (VisualizationType.ScatterPlot == visualizationType) { + var visualizationTuple = List[Tuple2[String, String]]() val legendColumn = schemaArray(0) val abscissaColumn = schemaArray(1) @@ -387,30 +396,30 @@ object API { //get legend - val legendList = jsonMapList.map(item =>{ - item.getOrElse(legendColumn,"").asInstanceOf[String] + val legendList = jsonMapList.map(item => { + item.getOrElse(legendColumn, "").asInstanceOf[String] }).distinct //get schema - val newSchema = schemaArray.filter(_ != legendColumn ) + val newSchema = schemaArray.filter(_ != legendColumn) val schemaList = ListBuffer[Map[String, Any]]() var index = 0 - newSchema.foreach(column =>{ - val schemaMap = Map("name" -> column, "index" -> index, "text" ->column) + newSchema.foreach(column => { + val schemaMap = Map("name" -> column, "index" -> index, "text" -> column) schemaList.append(schemaMap) index = index + 1 }) //get series val seriesList = ListBuffer[Map[String, Any]]() - legendList.foreach( legend => { + legendList.foreach(legend => { var legendDataList = ListBuffer[List[String]]() jsonMapList.foreach(item => { - if(item.getOrElse(legendColumn,"").asInstanceOf[String].equals(legend)){ + if (item.getOrElse(legendColumn, "").asInstanceOf[String].equals(legend)) { var dataList = ListBuffer[String]() - newSchema.foreach(column =>{ - val value = item.getOrElse(column,"").asInstanceOf[String] + newSchema.foreach(column => { + val value = item.getOrElse(column, "").asInstanceOf[String] dataList.append(value) }) legendDataList.append(dataList.toList) @@ -421,33 +430,33 @@ object API { seriesList.append(legendMap) }) - val resultMap = Map[String, Any]("legend" -> legendList, "schema" -> schemaList.toList, "series" -> seriesList.toList) + val resultMap = Map[String, Any]("legend" -> legendList, "schema" -> schemaList.toList, "series" -> seriesList.toList) val visualizationJsonData = JsonUtil.format(JsonUtil.toJson(resultMap)) println(visualizationJsonData) visualizationJsonData - }else if(VisualizationType.PieChart == visualizationType ){ + } else if (VisualizationType.PieChart == visualizationType) { var legend = List[String]() val schemaArray = visualizationSchema.split(",") - val schemaReplaceMap = Map(schemaArray(1)->"value", schemaArray(0)->"name") + val schemaReplaceMap = Map(schemaArray(1) -> "value", schemaArray(0) -> "name") val jsonMapList = getJsonMapList(visuanlizationPath + "/data") var pieChartList = List[Map[String, Any]]() jsonMapList.foreach(map => { var lineMap = Map[String, Any]() - for(i <- 0 to schemaArray.size-1){ + for (i <- 0 to schemaArray.size - 1) { val column = schemaArray(i) - lineMap += (schemaReplaceMap.getOrElse(column,"")-> map.getOrElse(column,"")) + lineMap += (schemaReplaceMap.getOrElse(column, "") -> map.getOrElse(column, "")) } pieChartList = lineMap +: pieChartList }) - pieChartList.foreach( item => { - legend = item.getOrElse("name","").toString +: legend + pieChartList.foreach(item => { + legend = item.getOrElse("name", "").toString +: legend }) val pieChartMap = Map("legend" -> legend, "series" -> pieChartList) val visualizationJsonData = JsonUtil.format(JsonUtil.toJson(pieChartMap)) println(visualizationJsonData) visualizationJsonData - }else if(VisualizationType.Table == visualizationType ){ + } else if (VisualizationType.Table == visualizationType) { //println(visualizationSchema) //println(jsonMapList) val resultMap = Map[String, Any]("schema" -> schemaArray.toList, "data" -> jsonMapList) @@ -455,20 +464,20 @@ object API { println(visualizationJsonData) visualizationJsonData } - else{ + else { "" } } - def getStopInfo(bundle : String) : String = { - try{ + def getStopInfo(bundle: String): String = { + try { val str = ClassUtil.findConfigurableStopInfo(bundle) str - }catch{ - case ex : Exception => println(ex);throw ex + } catch { + case ex: Exception => println(ex); throw ex } } @@ -478,24 +487,24 @@ object API { """{"groups":"""" + groups + """"}""" } - def getAllStops() : String = { - var stops : List[String] = List() + def getAllStops(): String = { + var stops: List[String] = List() val stopList = ClassUtil.findAllConfigurableStop() - stopList.foreach(s => stops = s.getClass.getName +: stops ) + stopList.foreach(s => stops = s.getClass.getName +: stops) """{"stops":"""" + stops.mkString(",") + """"}""" } - def getAllStopsWithGroup() : String = { + def getAllStopsWithGroup(): String = { - var resultList:List[String] = List() - var stops = List[Tuple2[String,String]]() + var resultList: List[String] = List() + var stops = List[Tuple2[String, String]]() val configurableStopList = ClassUtil.findAllConfigurableStop() configurableStopList.foreach(s => { //generate (group,bundle) pair and put into stops val groupList = s.getGroup() groupList.foreach(group => { - val tuple = (group , s.getClass.getName) - stops = tuple +: stops + val tuple = (group, s.getClass.getName) + stops = tuple +: stops }) }) @@ -503,7 +512,7 @@ object API { val groupsInfo = stops.groupBy(_._1) groupsInfo.foreach(group => { val stopList = group._2 - stopList.foreach( groupAndstopPair => { + stopList.foreach(groupAndstopPair => { println(groupAndstopPair._1 + ":\t\t" + groupAndstopPair._2) var groupAndstop = groupAndstopPair._1 + ":" + groupAndstopPair._2 resultList = groupAndstop +: resultList @@ -522,10 +531,10 @@ object API { // "out 200" } */ - - private def getLogFile(uuid : String, appName : String) : (File,File) = { - val now : Date = new Date() - val dataFormat : SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH:mm:ss") + + private def getLogFile(uuid: String, appName: String): (File, File) = { + val now: Date = new Date() + val dataFormat: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH:mm:ss") val nowDate = dataFormat.format(now) val stdoutPathString = PropertyUtil.getPropertyValue("log.path") + "/" + appName + "_" + uuid + "_stdout_" + nowDate @@ -537,25 +546,25 @@ object API { (stdout, stderr) } - def getHdfsDataByPath(hdfsPath:String) : ByteArrayInputStream={ + def getHdfsDataByPath(hdfsPath: String): ByteArrayInputStream = { val conf = new Configuration() conf.set("fs.defaultFS", PropertyUtil.getPropertyValue("fs.defaultFS")) val fs: FileSystem = FileSystem.get(conf) val fileStatusArr: Array[FileStatus] = fs.listStatus(new Path(hdfsPath)) val map = HashMap[String, FSDataInputStream]() for (elem <- fileStatusArr) { - val name =elem.getPath.getName + val name = elem.getPath.getName val inputStream = fs.open(elem.getPath) - map.put(name,inputStream) + map.put(name, inputStream) } val byteArrayOutputStream = new ByteArrayOutputStream() val zos = new ZipOutputStream(byteArrayOutputStream) - var zipEntry:ZipEntry = null + var zipEntry: ZipEntry = null for (elem <- map) { zipEntry = new ZipEntry(elem._1) zos.putNextEntry(zipEntry) - IOUtils.copyBytes(elem._2,zos,1024*1024*50,false) + IOUtils.copyBytes(elem._2, zos, 1024 * 1024 * 50, false) zos.closeEntry() } zos.close() @@ -566,7 +575,7 @@ object API { } -class WaitProcessTerminateRunnable(spark : SparkSession, process: Process) extends Runnable { +class WaitProcessTerminateRunnable(spark: SparkSession, process: Process) extends Runnable { override def run(): Unit = { process.awaitTermination() //spark.close() diff --git a/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala b/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala index c661c185..43f32155 100644 --- a/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala +++ b/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala @@ -56,8 +56,109 @@ object HTTPClientStartMockDataFlow { | } |} """.stripMargin - - val url = "http://10.0.85.83:8001/flow/start" + val json_2 = + """ + |{ + | "flow": { + | "name": "Example", + | "executorMemory": "1g", + | "executorNumber": "1", + | "uuid": "8a80d63f720cdd2301723a4e679e2457", + | "paths": [ + | { + | "inport": "", + | "from": "CsvParser", + | "to": "CsvSave", + | "outport": "" + | } + | ], + | "executorCores": "1", + | "driverMemory": "1g", + | "stops": [ + | { + | "name": "CsvSave", + | "bundle": "cn.piflow.bundle.csv.CsvSave", + | "uuid": "8a80d63f720cdd2301723a4e67a52467", + | "properties": { + | "csvSavePath": "hdfs://172.18.39.41:9000/user/Yomi/test1.csv", + | "partition": "", + | "header": "false", + | "saveMode": "append", + | "delimiter": "," + | }, + | "customizedProperties": { + | + | } + | }, + | { + | "name": "CsvParser", + | "bundle": "cn.piflow.bundle.csv.CsvParser", + | "uuid": "8a80d63f720cdd2301723a4e67a82470", + | "properties": { + | "schema": "title,author,pages", + | "csvPath": "hdfs://172.18.39.41:9000/user/Yomi/test.csv", + | "delimiter": ",", + | "header": "false" + | }, + | "customizedProperties": { + | + | } + | } + | ] + | } + |} + | + |""".stripMargin + val json3= + """ + |{ + | "flow": { + | "name": "Example", + | "executorMemory": "1g", + | "executorNumber": "1", + | "uuid": "8a80d63f720cdd2301723a4e679e2457", + | "paths": [ + | { + | "inport": "", + | "from": "CsvParser", + | "to": "ArrowFlightOut", + | "outport": "" + | } + | ], + | "executorCores": "1", + | "driverMemory": "1g", + | "stops": [ + | { + | "name": "ArrowFlightOut", + | "bundle": "cn.piflow.bundle.arrowflight.ArrowFlightOut", + | "uuid": "8a80d63f720cdd2301723a4e67a52467", + | "properties": { + | "outputIp": "127.0.0.1", + | }, + | "customizedProperties": { + | + | } + | }, + | { + | "name": "CsvParser", + | "bundle": "cn.piflow.bundle.csv.CsvParser", + | "uuid": "8a80d63f720cdd2301723a4e67a82470", + | "properties": { + | "schema": "title,author,pages", + | "csvPath": "hdfs://172.18.39.41:9000/user/Yomi/test.csv", + | "delimiter": ",", + | "header": "false" + | }, + | "customizedProperties": { + | + | } + | } + | ] + | } + |} + | + |""".stripMargin + val url = "http://172.18.32.1:8002/flow/start" val timeout = 1800 val requestConfig = RequestConfig.custom() .setConnectTimeout(timeout*1000) @@ -68,7 +169,7 @@ object HTTPClientStartMockDataFlow { val post:HttpPost = new HttpPost(url) post.addHeader("Content-Type", "application/json") - post.setEntity(new StringEntity(json)) + post.setEntity(new StringEntity(json3)) val response:CloseableHttpResponse = client.execute(post) diff --git a/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartScalaFlow.scala b/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartScalaFlow.scala index 6ea518ee..19010a4a 100644 --- a/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartScalaFlow.scala +++ b/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartScalaFlow.scala @@ -36,7 +36,7 @@ object HTTPClientStartScalaFlow { | "bundle":"cn.piflow.bundle.script.ExecuteScalaFile", | "properties":{ | "plugin":"stop_scalaTest_ExecuteScalaFile_4444", - | "script":"val df = in.read() \n df.createOrReplaceTempView(\"people\") \n val df1 = spark.sql(\"select * from prople where author like 'xjzhu'\") \n out.write(df1);" + | "script":"val df = in.read().getSparkDf \n df.createOrReplaceTempView(\"people\") \n val df1 = spark.sql(\"select * from prople where author like 'xjzhu'\") \n out.write(df1);" | } | }, | { diff --git a/piflow-server/src/main/scala/cn/piflow/api/HTTPService.scala b/piflow-server/src/main/scala/cn/piflow/api/HTTPService.scala index 03938274..4fb4ebbf 100644 --- a/piflow-server/src/main/scala/cn/piflow/api/HTTPService.scala +++ b/piflow-server/src/main/scala/cn/piflow/api/HTTPService.scala @@ -17,6 +17,7 @@ import cn.piflow.util._ import com.typesafe.akka.extension.quartz.QuartzSchedulerExtension import com.typesafe.config.ConfigFactory +import java.net.InetAddress import scala.concurrent.{Await, Future} //import scala.util.parsing.json.JSON @@ -27,7 +28,6 @@ import org.h2.tools.Server import spray.json.DefaultJsonProtocol import java.io.File -import java.net.InetAddress import java.text.SimpleDateFormat import java.util.Date @@ -54,9 +54,9 @@ object HTTPService extends DefaultJsonProtocol with Directives with SprayJsonSup def toJson(entity: RequestEntity): Map[String, Any] = { entity match { - case HttpEntity.Strict(_, data) =>{ -// val temp = JSON.parseFull(data.utf8String) -// temp.get.asInstanceOf[Map[String, Any]] + case HttpEntity.Strict(_, data) => { + // val temp = JSON.parseFull(data.utf8String) + // temp.get.asInstanceOf[Map[String, Any]] val temp = JsonUtil.jsonToMap(data.utf8String) temp } @@ -72,38 +72,45 @@ object HTTPService extends DefaultJsonProtocol with Directives with SprayJsonSup case HttpRequest(GET, Uri.Path("/flow/info"), headers, entity, protocol) => { - val appID = req.getUri().query().getOrElse("appID","") - if(!appID.equals("")){ - //server state in h2db - var result = API.getFlowInfo(appID) - println("getFlowInfo result: " + result) -// val resultMap = OptionUtil.getAny(JSON.parseFull(result)).asInstanceOf[Map[String, Any]] - val resultMap = JsonUtil.jsonToMap(result) - val flowInfoMap = MapUtil.get(resultMap, "flow").asInstanceOf[Map[String, Any]] - val flowState = MapUtil.get(flowInfoMap,"state").asInstanceOf[String] - - //yarn flow state - val flowYarnInfoJson = API.getFlowYarnInfo(appID) -// val map = OptionUtil.getAny(JSON.parseFull(flowYarnInfoJson)).asInstanceOf[Map[String, Any]] - val map = JsonUtil.jsonToMap(flowYarnInfoJson) - val yanrFlowInfoMap = MapUtil.get(map, "app").asInstanceOf[Map[String, Any]] - val name = MapUtil.get(yanrFlowInfoMap,"name").asInstanceOf[String] - val flowYarnState = MapUtil.get(yanrFlowInfoMap,"state").asInstanceOf[String] + val appID = req.getUri().query().getOrElse("appID", "") + if (!appID.equals("")) { + //server state in h2db + var result = API.getFlowInfo(appID) + println("getFlowInfo result: " + result) + // val resultMap = OptionUtil.getAny(JSON.parseFull(result)).asInstanceOf[Map[String, Any]] + val resultMap = JsonUtil.jsonToMap(result) + val flowInfoMap = MapUtil.get(resultMap, "flow").asInstanceOf[Map[String, Any]] + val flowState = MapUtil.get(flowInfoMap, "state").asInstanceOf[String] + + // println("----------------getFlowYarnInfo--------------------start") + //yarn flow state + val flowYarnInfoJson = API.getFlowYarnInfo(appID) + // println("----------------getFlowYarnInfo--------------------finish") + // println("----------------getFlowYarnInfo--------------------"+flowYarnInfoJson) + + // val map = OptionUtil.getAny(JSON.parseFull(flowYarnInfoJson)).asInstanceOf[Map[String, Any]] + val map = JsonUtil.jsonToMap(flowYarnInfoJson) + val yanrFlowInfoMap = MapUtil.get(map, "app").asInstanceOf[Map[String, Any]] + val name = MapUtil.get(yanrFlowInfoMap, "name").asInstanceOf[String] + val flowYarnState = MapUtil.get(yanrFlowInfoMap, "state").asInstanceOf[String] if (flowInfoMap.contains("state")) { - val checkState = StateUtil.FlowStateCheck(flowState, flowYarnState) - if (checkState == true) { - Future.successful(HttpResponse(SUCCESS_CODE, entity = result)) - } else { - val newflowState = StateUtil.getNewFlowState(flowState, flowYarnState) - if (newflowState != flowState) { - H2Util.updateFlowState(appID, newflowState) - } - result = API.getFlowInfo(appID) - Future.successful(HttpResponse(SUCCESS_CODE, entity = result)) - } + println("----------------flowInfoMap.state--------------------" + flowState) + // val checkState = StateUtil.FlowStateCheck(flowState, flowYarnState) + // if (checkState == true) { + // Future.successful(HttpResponse(SUCCESS_CODE, entity = result)) + // } else { + // val newflowState = StateUtil.getNewFlowState(flowState, flowYarnState) + // if (newflowState != flowState) { + // H2Util.updateFlowState(appID, newflowState) + // } + // result = API.getFlowInfo(appID) + // Future.successful(HttpResponse(SUCCESS_CODE, entity = result)) + // } + Future.successful(HttpResponse(SUCCESS_CODE, entity = result)) } else if (yanrFlowInfoMap.contains("state")) { + println("----------------yanrFlowInfoMap.state--------------------" + flowYarnState) var flowInfoMap = Map[String, Any]() flowInfoMap += ("id" -> appID) flowInfoMap += ("name" -> name) @@ -268,7 +275,7 @@ object HTTPService extends DefaultJsonProtocol with Directives with SprayJsonSup // } // responseFuture - } + } case HttpRequest(POST, Uri.Path("/flow/stop"), headers, entity, protocol) => { @@ -513,7 +520,7 @@ object HTTPService extends DefaultJsonProtocol with Directives with SprayJsonSup val startDateStr = dataMap.get("startDate").getOrElse("").asInstanceOf[String] val endDateStr = dataMap.get("endDate").getOrElse("").asInstanceOf[String] val scheduleInstance = dataMap.get("schedule").getOrElse(Map[String, Any]()).asInstanceOf[Map[String, Any]] - println("scheduleInstance:"+scheduleInstance) + println("scheduleInstance:" + scheduleInstance) val id: String = "schedule_" + IdGenerator.uuid(); @@ -828,9 +835,10 @@ object HTTPService extends DefaultJsonProtocol with Directives with SprayJsonSup def run = { - val ip = InetAddress.getLocalHost.getHostAddress + // val ip = InetAddress.getLocalHost.getHostAddress + val ip = PropertyUtil.getPropertyValue("server.ip") //write ip to server.ip file - FileUtil.writeFile("server.ip=" + ip, ServerIpUtil.getServerIpFile()) + // FileUtil.writeFile("server.ip=" + ip, ServerIpUtil.getServerIpFile()) val port = PropertyUtil.getIntPropertyValue("server.port") Http().bindAndHandleAsync(route, ip, port) @@ -867,7 +875,7 @@ object HTTPService extends DefaultJsonProtocol with Directives with SprayJsonSup val scheduleList = H2Util.getStartedSchedule() scheduleList.foreach(id => { val scheduleContent = FlowFileUtil.readFlowFile(FlowFileUtil.getScheduleFilePath(id)) -// val dataMap = JSON.parseFull(scheduleContent).get.asInstanceOf[Map[String, Any]] + // val dataMap = JSON.parseFull(scheduleContent).get.asInstanceOf[Map[String, Any]] val dataMap = JsonUtil.jsonToMap(scheduleContent) val expression = dataMap.get("expression").getOrElse("").asInstanceOf[String] @@ -901,10 +909,18 @@ object Main { def flywayInit() = { - val ip = InetAddress.getLocalHost.getHostAddress + val ip = "127.0.0.1" + // val ip = PropertyUtil.getPropertyValue("server.ip") // Create the Flyway instance val flyway: Flyway = new Flyway(); - var url = "jdbc:h2:tcp://" + ip + ":" + PropertyUtil.getPropertyValue("h2.port") + "/~/piflow" + val h2Path: String = PropertyUtil.getPropertyValue("h2.path") + var url: String = "" + if (h2Path != null && h2Path.nonEmpty) { + url = "jdbc:h2:tcp://" + ip + ":" + PropertyUtil.getPropertyValue("h2.port") + "/~/piflow/" + h2Path + } else { + url = "jdbc:h2:tcp://" + ip + ":" + PropertyUtil.getPropertyValue("h2.port") + "/~/piflow" + } + // var url = "jdbc:h2:tcp://" + ip + ":" + PropertyUtil.getPropertyValue("h2.port") + "/~/piflow" // Point it to the database flyway.setDataSource(url, null, null); flyway.setLocations("db/migrations"); @@ -936,8 +952,8 @@ object Main { }) } - def main(argv: Array[String]):Unit = { - val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers","-ifNotExists", "-tcpPort",PropertyUtil.getPropertyValue("h2.port")).start() + def main(argv: Array[String]): Unit = { + val h2Server = Server.createTcpServer("-tcp", "-tcpAllowOthers", "-ifNotExists", "-tcpPort", PropertyUtil.getPropertyValue("h2.port")).start() flywayInit() HTTPService.run initPlugin() diff --git "a/piflow\344\275\277\347\224\250\346\226\207\346\241\243-v1.1.md" "b/piflow\344\275\277\347\224\250\346\226\207\346\241\243-v1.1.md" deleted file mode 100644 index 394f97c1..00000000 --- "a/piflow\344\275\277\347\224\250\346\226\207\346\241\243-v1.1.md" +++ /dev/null @@ -1,674 +0,0 @@ -# 大数据流水线系统PiFlow 使用说明书v1.0 - -# 引言 - -## 编写目的 - -该文档主要用于介绍大数据流水线系统PiFlow 的使用 - -## 建设范围 - -PiFlow server 及PiFlow web的使用说明 - -## 术语 - -- PiFlow :大数据流水线系统; - -- Flow:大数据流水线; - -- Stop:大数据流水线数据处理组件; - -- Path: 每个大数据流水线数据处理组件之间的连接线; - -- Group:大数据流水线组,支持流水线/流水线组的顺序调度; - -- Template:大数据流水线模板,支持将流水线/流水线组保存成模板、下载、上传和加载; - -- DataSource:数据源,支持FTP、JDBC、ElasticSearch、Hive等数据源注册,支持自定义数据源; - -- Schedule:大数据流水线调度,支持流水线/流水线组的调度及定时调度 - -- StopsHub:组件热插拔,支持用户开发自定义组件一键上传 - -- SparkJar: spark jar依赖包管理 - -# 项目概述 - -大数据流水线系统PiFlow 主要是针对大数据的ETL工具,它具有如下特性: - -- 简单易用 - - - 提供所见即所得页面配置流水线 - - - 监控流水线状态 - - - 查看流水线日志 - - - 检查点功能 - - - 调度功能 - - - 组件热插拔功能 - -- 可扩展性 - - - 支持用户自定义开发组件 - -- 性能优越 - - - 基于分布式计算引擎Spark开发 - -- 功能强大 - - - 提供100+数据处理组件 - - - 包括 - spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json等. - -# 使用说明 - -## 3.1 界面说明 - -### 3.1.1 注册 - -![](media/804bb965d44bc42d78980ab035863bf6.png) - -### 3.1.2 登录 - -![](media/0fb8f864b8b8771cb4b112f01b99a1b4.png) - -### 3.1.4 首页 - -首页展示了资源使用情况,包括CPU、内存和磁盘。同时,展示了流水线Flow的总体情况、Group的总体情况、调度Schedule的总体情况、数据源DataSource的总体情况、数据处理组件的基本情况。其中Processor为运行态流水线/流水线组,状态可分为Started开始、完成Completed、失败Failed、Killed杀死、其他Other状态。 - -![](media/2ba4f0258d7c1ce88b5d9a62951d69ec.png) - -同时,支持了国际化,如下图所示。 - -![](media/37c5fd88e21160456829b81ef0da6e17.png) - -### 3.1.4 流水线Flow - -#### 3.1.4.1 流水线列表 - -![](media/ce3c14c203f683dbb6352bc65bd485c3.png) - -- 可点击进入流水线配置页面按钮,对流水线进行配置。 - -![](media/3370e78c6400dd8da9a74c6aad38e26c.png) - -- 可编辑流水线信息 - -![](media/314bdac455964e5c9147aa60f196e38c.png) - -- 可运行流水线 - -![](media/c8026939816d22732be62f3ee74a69d2.png) - -- 可以debug模式运行流水线 - -![](media/98188fa9ff6e34c5ed847a9b8a6a8f06.png) - -- 可删除流水线 - -![](media/e0fe4e3ce81df3a2c6bcbf9011665d0b.png) - -- 可对流水线保存模板 - -![](media/4b55ce202cb56a84b9f3c379f52457d4.png) - -#### 3.1.4.2 创建流水线 - -用户点击创建按钮,创建流水线。需要输入流水线名称及描述信息,同时可设置流水线需要的资源。 - -![](media/3773464b6a0681231693e93907a14913.png) - -#### 3.1.4.3 配置流水线 - -- 用户可通过拖拽方式进行流水线的配置,方式类似visio,如下图所示。 - -![](media/196a7a99a6675b13de7b16280a771ed1.png) - -- 画布左边栏显示组件组和组件,可按关键字搜索。用户选择好组件后可拖至画布中央。 - -![](media/5e41f29adc56f48a11e366cb71964b1e.png) - -- 画布右侧显示流水线基本信息,包括流水线名称及描述。 - -![](media/9a8d4dcf9ec91763d84da112503b8746.png) - -- 画布中央选择任一数据处理组件,右侧显示该数据处理组件的基本信息,包括名称,描述,作者等信息。选择AttributeInfo - Tab,显示该数据处理组件的属性信息,用户可根据实际需求进行配置。鼠标浮动到问号上会出现对应属性的说明,同时可以选择已设置好的数据源进行属性填充。 - - 数据处理组件基本信息如下图所示,点击StopName可对数据处理组件进行改名。 - -![](media/224b0701fe371ea188c486a965479af7.png) - -数据处理组件属性信息设置如下图所示。“问号”按钮提示该属性描述信息,“红星”表示必填项。 - -![](media/ac4a2020b57ad8a89baf3a5baff79041.png) - -数据处理组件属性样例信息如下图所示: - -![](media/fb46f7d107154685d461da70d58fee95.png) - -数据处理组件数据源填充如下图所示。已选择数据源相关数据会自动填充到所选数据处理组件中。数据源变更后,相应组件的属性也会随之更新。 - -![](media/c9aca3f7fdf518b303dfd89eb531583e.png) - -#### 3.1.4.4 运行流水线 - -用户配置好流水线后,可点击运行按钮运行流水线。 - -![](media/298f03ec66e58bbffa4b39bfde561a73.png) - -支持运行单个数据处理组件和当前及以下数据处理组件。 - -![](media/ca448a8e10ab5c9d6f3d7784186ef450.png) - -针对选中数据处理组件,需要给端口指定数据来源(测试数据管理,详见3.1.12)后运行 - -![](media/c6ba04b0642ecba881e6781052bca0ce.png) - -#### 3.1.4.5 流水线监控 - -加载完成之后,进入流水线监控页面。监控页面会显示整条流水线的执行状况,包括运行状态、执行进度、执行时间等。 - -![](media/4b7aa225d667724cf4d131203ef4b38a.png) - -点击具体数据处理组件,显示该数据处理组件的运行状况,包括运行状态、执行时间。 - -![](media/94819e5c8445407656ab2a448f056d89.png) - -#### 3.1.4.6 流水线日志 - -![](media/f19aec5e6da383e3f4fd82deeaa2d73d.png) - -#### 3.1.4.7 调试流水线 - -- 可以以Debug模式运行流水线,运行后可查看流经每条线上的数据信息,实现数据可溯源 - -![](media/bea241c3479deca8085bd1480ca46c7d.png) - -![](media/458067709215b1f7a74448ada1c562da.png) - -![](media/3fb3c84ac73286d9f9292cf635a8ef78.png) - -#### 3.1.4.8 检查点 - -- 流水线可设置检查点,再次运行时可选择从检查点运行 - -![](media/d8fef3b9b572baf509ef6c005f6377ad.png) - -#### 3.1.4.9 可视化组件 - -![](media/938789ca2232d5a79f7c0b6ed5aa349c.png) - -![](media/20ab6f45ca3a76fe1fe1e854eae1e6a3.png) - -![](media/02c5f7d5142128486bec25629ad12298.png) - -#### 3.1.4.10 表格组件 - -TableShow组件可以按表格形式展示数据。 - -![](media/97b1b2e80e9d2d451c87de60f8c7d221.png) - -数据可以直接导出 - -![](media/7b1ff06a83237e146cb5a578c2d64f82.png) - -### 3.1.5 流水线组Group - -#### 3.1.5.1 流水线组列表 - -- 流水线组支持流水线的顺序调度功能,组嵌套功能。列表功能与流水线列表功能一致。列表支持进入、编辑、运行/停止、删除、保存模板功能。 - -![](media/3e2b75a920dab6e8e80223ea55e4138c.png) - -#### 3.1.5.2 新建流水线组 - -- 点击创建按钮,输入流水线组名称和基本信息可创建流水线组Group。 - -![](media/2098c1b0bc900e3ef6784f190b90ea03.png) - -#### 3.1.5.3 配置流水线组 - -##### 3.1.5.3.1 创建group - -- 拖动左侧group图标 - -![](media/962d857fa6d81841b68e7d66fec1fab7.png) - -##### 3.1.5.3.2 创建flow - -- 拖动flow图标创建流水线flow - -![](media/0ce7214d28dfafa361eb71797b420208.png) - -##### 3.1.5.3.3 创建label - -- 拖动Label可添加标签,用于备注说明 - -![](media/691a0860ccc3550de400e40fff27d9aa.png) - -##### 3.1.5.3.4 创建调度关系 - -- 连线实现调度顺序 - -![](media/5d252f945b58e243b67e17db0bdd3266.png) - -##### 3.1.5.3.5 创建子group - -- Group可双击进入,配置组内流水线组group、流水线flow、以及之间的调度顺序。下部有导航栏,可退出该级目录,返回上一级。同时可以返回根目录。 - -![](media/c37366a416edcbdbc8d3e156b2c17728.png) - -##### 3.1.5.3.6 配置流水线flow - -- 双击flow图标,可进入具体流水线的配置界面 - -![](media/74004a9c950b9938546b9bc4d60d1a6d.png) - -##### 3.1.5.3.7 导入流水线flow - -- 可导入flow列表中已配置的流水线 - -![](media/e40afa0d5300a994c13fb71d60edb02d.png) - -##### 3.1.5.3.8 更换图标 - -- 右键group或flow,可支持更换图标 - -![](media/723edc23349f5db4e0206ab84f9183f3.png) - -- Group图标列表,支持用户上传 - -![](media/778fc60b703609d7bed36c1d3ddf1b65.png) - -- Flow图标列表,支持用户上传 - -![](media/9a586cd3d63b62ebe0468a556cbb48c8.png) - -#### 3.1.5.4 运行流水线组 - -- 运行 - -![](media/ba6ace4a165ebe4103022d51d925a809.png) - -- 可右键运行单个group或flow - -![](media/6fcd81e765e4b35e6856d7e256a52c93.png) - -![](media/2cc6a0aa5fae6bf352e05b520f688a79.png) - -#### 3.1.5.5 监控流水线组 - -- 默认显示流水线组监控信息 - -![](media/eac334b51f911c2ab6485637ec3a61a8.png) - -- 单击group或flow,显示点击组件的监控信息 - -![](media/5d46f397da6d8fe00d59881e51e0489f.png) - -可双击进入group或flow,查看进一步监控信息 - -#### 3.1.5.6 流水线组日志 - -![](media/fd427e60a1bf0cf972af6def8de99d62.png) - -### 3.1.6 运行态流水线Process - -已运行流水线组和流水线会显示在Process -List中,包括开始时间、结束时间、进度、状态等。同时可对已运行流水线进行查看,在运行,停止,和删除操作。 - -![](media/d08ecc7d6809d724e193abb87bee4579.png) - -### 3.1.7 模板Template - -流水线组和流水线可保存成模板 - -#### 3.1.7.1 模板保存 - -- 流水线保存模板 - -![](media/843ef0e7d2ef3d8ae9b25fb7643e124a.png) - -- 流水线组保存模板 - -![](media/222be5da30a64ff4a5f0ea6a775fda84.png) - -#### 3.1.7.2 模板列表 - -![](media/551a4be7c3ed491e61b33ca06fc95a75.png) - -#### 3.1.7.3 模板下载 - -![](media/5a303edbe77897f3affcaf8b92ed66cb.png) - -#### 3.1.7.4 模板上传 - -![](media/3f7d71f9be11017717a6657f2acb378c.png) - -#### 3.1.7.5 模板加载 - -- 流水线模板加载 - -![](media/69f40f0793cf7ac591a41a693811d7b3.png) - -- 流水线组模板加载 - -![](media/22734e828f3636d889c561d2779fe09a.png) - -#### 3.1.7.6 模板删除 - -![](media/5974d3c28593d699e9c5e05326273b0b.png) - -### 3.1.8 数据源 - -#### 3.1.8.1 创建数据源 - -- 支持JDBC、ElasticSearch、等数据源的创建。同时支持自定义数据源(other) - -![](media/697b4d7a34548ae8cc38d2ea2b678830.png) - -#### 3.1.8.1 使用数据源 - -在流水线配置页面,选择某个组件,设置该组件属性时,可从已配置数据源填充相关属性。同时数据源变更时,该组件属性也随之变更。 - -![](media/6131d5e3b230691195e7ffe8e2cf158a.png) - -### 3.1.9 定时调度 - -支持对已配置流水线/流水线组进行定时调度,调度采用Cron表达式进行设置,支持设置调度的开始时间和结束时间。 - -调度列表如下图所示,支持进入流水线/流水线组进行配置,支持编辑调度信息,启动调度、停止调度和删除调度。 - -![](media/74506b31ea2a9b14757b1383d5907443.png) - -新建调度如下图所示。需选择调度的类型(Flow/FlowGroup),Cron表达式,调度开始时间,调度结束时间,以及被调度的流水线/流水线组。 - -![](media/a7ad4b6f8ad9ea4bd7b019dc0d7fe51c.png) - -被调度成功的流水线会在Processor列表页显示。 - -![](media/6ce32509960f6ce33a48a8c52d2bbd35.png) - -### 3.1.10 组件热插拔 - -支持上传自定义开发组件jar包,mount成功后流水线配置页面会自动显示自定义开发组件。Unmount后,自定义开发组件会消失。 - -![](media/3f7ba0f389717bf0da295ff4ccdc2553.png) - -### 3.1.11 依赖包管理 - -支持上传Spark依赖的jar包,mount成功后运行流水线时会自动加载jar包。 - -![](media/aa3f92f8d2b247fbe522c6375c628e94.png) - -### 3.1.12 测试数据管理 - -支持创建测试数据,包括定义Schema、手工录入数据和CSV导入数据。测试数据管理可支持右键运行Stop功能(详见3.1.4.4)。 - -![](media/331054a587b3c84e992f663700b25f61.png) - -测试数据支持Manual和CSV导入两种方式 - -![](media/1cca1dbef9c3aa61f1083414dfd95fde.png) - -手动录入模式,支持Schema定义和数据录入 - -![](media/867bf8b3751b2fb22c0601159ada2d74.png) - -![](media/23f3085d3856920014ebdae487e8a037.png) - -CSV导入模式支持带入CSV文件 - -![](media/d84d68e4c6ec6cdaa83e49edda9784f6.png) - -### 3.1.13 数据处理组件显隐管理 - -支持设置数据处理组件的显示和隐藏,效果实时同步至流水线配置页面。 - -![](media/b39c91588796724aeb4213106591caae.png) - -### 3.1.14 交互式编程 - -支持在线代码编程 - -![](media/9dbc86aa5dfdc5ec4ebc07a7ae82a2b2.png) - -![](media/13ca4a51697916ec7a5a8e3aeff645fd.png) - -### 3.1.15 全局变量 - -支持添加全局变量,下图展示全局变量列表 - -![](media/a33196069b19b98dfe6a5597e6754ddb.png) - -添加全局变量 - -![](media/4f373310a568c9f6700f8191c9f3f095.png) - -创建流水线时可选择需要加载的全局变量,流水内部可对全局变量进行引用。 - -引用方式为”\${SPARK_HOME}” - -![](media/a97eaf4ab336500604e394082b978816.png) - -## 3.2 Restful API - -接口采用REST设计风格,目前需求如下接口: - -### 3.2.1 getAllGroups - -| **基本信息** | | | | -|----------------|--------------------------------|------------------------------------------|----------| -| 接口名称 | getAllGroups | | | -| 接口描述 | 获取所有数据处理组件Stop所在组 | | | -| 接口URL | GET /stop/ groups | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| 无 | | | | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回所有组信息 | 200 | {“groups”:”Common,Hive,Http,…”} | | -| | 500 | “getGroup Method Not Implemented Error!” | | - -### 3.2.2 getAllStops - -| **基本信息** | | | | -|----------------|----------------------|---------------------------------------------|----------| -| 接口名称 | getAllStops | | | -| 接口描述 | 获取所有数据处理组件 | | | -| 接口URL | GET /stop/list | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| 无 | | | | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回所有Stop | 200 | {“stops”:”cn.PiFlow .bundle.Common.Fork,…”} | | -| | 500 | “Can not found stop !” | | - -### 3.2.3 getStopInfo - -| **基本信息** | | | | -|------------------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------| -| 接口名称 | getStopInfo | | | -| 接口描述 | 获取数据处理组件Stop的详细信息 | | | -| 接口URL | GET /stop/info?bundle=\*\*\* | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| bundle | Stop的类名 | Query | String | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回Stop详细信息 | 200 | {"name":"LoadFromFtp","bundle":"cn.PiFlow .bundle.ftp.LoadFromFtp",“groups”:"ftp,load","description":"load data from ftp server","properties":[{"property":{"name":"url_str","displayName":"URL","description":null,"defaultValue":"","allowableValues":"","required":"true","sensitive":"false"}},{"property":{"name":"port","displayName":"PORT","description":null,"defaultValue":"","allowableValues":"","required":"true","sensitive":"false"}},{"property":{"name":"username","displayName":"USER_NAME","description":null,"defaultValue":"","allowableValues":"","required":"true","sensitive":"false"}},{"property":{"name":"password","displayName":"PASSWORD","description":null,"defaultValue":"","allowableValues":"","required":"true","sensitive":"false"}},{"property":{"name":"ftpFile","displayName":"FTP_File","description":null,"defaultValue":"","allowableValues":"","required":"true","sensitive":"false"}},{"property":{"name":"localPath","displayName":"Local_Path","description":null,"defaultValue":"","allowableValues":"","required":"true","sensitive":"false"}}]} | | -| | 500 | “get PropertyDescriptor or getIcon Method Not Implemented Error!” | | - -### 3.2.4 startFlow - -| **基本信息** | | | -|----------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 接口名称 | startFlow | | -| 接口描述 | 运行流水线 | | -| 接口URL | POST /flow/start | | -| **参数说明** | | | -| 描述 | 类型 | 实例 | -| Flow的json配置字符串 | String | {"flow":{"name":"test","uuid":"1234","stops":[{"uuid":"1111","name":"XmlParser","bundle":"cn.PiFlow .bundle.xml.XmlParser","properties":{"xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml","rowTag":"phdthesis"}},{"uuid":"2222","name":"SelectField","bundle":"cn.PiFlow .bundle.common.SelectField","properties":{"schema":"title,author,pages"}},{"uuid":"3333","name":"PutHiveStreaming","bundle":"cn.PiFlow .bundle.hive.PutHiveStreaming","properties":{"database":"sparktest","table":"dblp_phdthesis"}},{"uuid":"4444","name":"CsvParser","bundle":"cn.PiFlow .bundle.csv.CsvParser","properties":{"csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv","header":"false","delimiter":",","schema":"title,author,pages"}},{"uuid":"555","name":"Merge","bundle":"cn.PiFlow .bundle.common.Merge","properties":{}},{"uuid":"666","name":"Fork","bundle":"cn.PiFlow .bundle.common.Fork","properties":{"outports":["out1","out2","out3"]}},{"uuid":"777","name":"JsonSave","bundle":"cn.PiFlow .bundle.json.JsonSave","properties":{"jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"}},{"uuid":"888","name":"CsvSave","bundle":"cn.PiFlow .bundle.csv.CsvSave","properties":{"csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv","header":"true","delimiter":","}}],"paths":[{"from":"XmlParser","outport":"","inport":"","to":"SelectField"},{"from":"SelectField","outport":"","inport":"data1","to":"Merge"},{"from":"CsvParser","outport":"","inport":"data2","to":"Merge"},{"from":"Merge","outport":"","inport":"","to":"Fork"},{"from":"Fork","outport":"out1","inport":"","to":"PutHiveStreaming"},{"from":"Fork","outport":"out2","inport":"","to":"JsonSave"},{"from":"Fork","outport":"out3","inport":"","to":"CsvSave"}]} | -| **返回值说明** | | | -| 描述 | 返回代码 | 实例 | -| 返回flow的appId | 200 | {“flow”:{“id”:”\*\*\*”,”pid”:””\*\*\*}} | -| | 500 | “Can not start flow!” | - -### 3.2.5 stopFlow - -| **基本信息** | | | -|----------------|-----------------|--------------------------------| -| 接口名称 | stopFlow | | -| 接口描述 | 停止流水线 | | -| 接口URL | POST /flow/stop | | -| **参数说明** | | | -| 描述 | 类型 | 实例 | -| Flow的appID | String | {“appID”:”\*\*\*”} | -| **返回值说明** | | | -| 描述 | 返回代码 | 实例 | -| 返回执行状态 | 200 | “ok” | -| | 500 | “Can not found process Error!” | - -### 3.2.6 getFlowInfo - -| **基本信息** | | | | -|------------------|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------| -| 接口名称 | getFlowInfo | | | -| 接口描述 | 获取流水线Flow的信息 | | | -| 接口URL | GET /flow/info?appID=\*\*\* | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| appID | Flow的Id | Query | String | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回Flow详细信息 | 200 | {"flow":{"id":"application_1540442049798_0297","pid":"process_372bd7da-a53e-46b4-8c44-edc0463064f5_1","name":"xml,csv-merge-fork-hive,json,csv","state":"COMPLETED","startTime":"Tue Nov 27 14:37:03 CST 2018","endTime":"Tue Nov 27 14:37:28 CST 2018","stops":[{"stop":{"name":"JsonSave","state":"COMPLETED","startTime":"Tue Nov 27 14:37:24 CST 2018","endTime":"Tue Nov 27 14:37:28 CST 2018"}},{"stop":{"name":"CsvSave","state":"COMPLETED","startTime":"Tue Nov 27 14:37:20 CST 2018","endTime":"Tue Nov 27 14:37:24 CST 2018"}},{"stop":{"name":"PutHiveStreaming","state":"COMPLETED","startTime":"Tue Nov 27 14:37:13 CST 2018","endTime":"Tue Nov 27 14:37:20 CST 2018"}},{"stop":{"name":"Fork","state":"COMPLETED","startTime":"Tue Nov 27 14:37:13 CST 2018","endTime":"Tue Nov 27 14:37:13 CST 2018"}},{"stop":{"name":"Merge","state":"COMPLETED","startTime":"Tue Nov 27 14:37:11 CST 2018","endTime":"Tue Nov 27 14:37:11 CST 2018"}},{"stop":{"name":"SelectField","state":"COMPLETED","startTime":"Tue Nov 27 14:37:11 CST 2018","endTime":"Tue Nov 27 14:37:11 CST 2018"}},{"stop":{"name":"XmlParser","state":"COMPLETED","startTime":"Tue Nov 27 14:37:09 CST 2018","endTime":"Tue Nov 27 14:37:11 CST 2018"}},{"stop":{"name":"CsvParser","state":"COMPLETED","startTime":"Tue Nov 27 14:37:03 CST 2018","endTime":"Tue Nov 27 14:37:09 CST 2018"}}]}} | | -| | 500 | “appID is null or flow run failed!” | | - -### 3.2.7 getFlowProgress - -| **基本信息** | | | | -|----------------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------| -| 接口名称 | getFlowProgress | | | -| 接口描述 | 获取流水线Flow的执行进度 | | | -| 接口URL | GET /flow/progress?appID=\*\*\* | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| appID | Flow的Id | Query | String | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回Flow的进度 | 200 | {"flow":{"appId":"application_1540442049798_0297","name":"xml,csv-merge-fork-hive,json,csv","state":"COMPLETED","progress":"100%"}} | | -| | 500 | “appId is null or flow run failed!” | | - -### 3.2.8 getFlowLog - -| **基本信息** | | | | -|---------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------| -| 接口名称 | getFlowProgress | | | -| 接口描述 | 获取流水线Flow的执行进度 | | | -| 接口URL | GET /flow/log?appID=\*\*\* | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| appID | Flow的Id | Query | String | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回Flow的log的地址 | 200 | {"app":{"id":"application_1540442049798_0297","user":"root","name":"xml,csv-merge-fork-hive,json,csv","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://master:8088/proxy/application_1540442049798_0297/A","diagnostics":"","clusterId":1540442049798,"applicationType":"SPARK","applicationTags":"","startedTime":1543300611067,"finishedTime":1543300648590,"elapsedTime":37523,"amContainerLogs":"http://master:8042/node/containerlogs/container_1540442049798_0297_01_000001/root","amHostHttpAddress":"master:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":217375,"vcoreSeconds":105,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0}} | | -| | 500 | “appID is null or flow does not exist!” | | - -### 3.2.9 getFlowCheckPoints - -| **基本信息** | | | | -|-----------------------|------------------------------------|-----------------------------------------|----------| -| 接口名称 | getFlowCheckPoints | | | -| 接口描述 | 获取流水线Flow的checkPoints | | | -| 接口URL | GET /flow/checkpoints?appID=\*\*\* | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| appID | Flow的appID | Query | String | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回Flow的checkpoints | 200 | {"checkpoints":"Merge,Fork"} | | -| | 500 | “appID is null or flow does not exist!” | | - -### 3.2.10 getFlowDebugData - -| **基本信息** | | | | -|----------------------------------------------|----------------------------------|-----------------------------------------|----------| -| 接口名称 | getFlowDebugData | | | -| 接口描述 | 获取流水线Flow的调试数据 | | | -| 接口URL | GET /flow/debugData?appID=\*\*\* | | | -| **参数说明** | | | | -| 名称 | 描述 | 类型 | 数据类型 | -| appID | Flow的appID | Query | String | -| stopName | stop的名称 | Query | String | -| Port | Stop的端口名 | Query | String | -| **返回值说明** | | | | -| 描述 | 返回代码 | 实例 | | -| 返回Flow的指定stop和端口的调试数据的hdfs路径 | 200 | {"schema":””, “debugDataPath”:" "} | | -| | 500 | “appID is null or flow does not exist!” | | - -### 3.2.11 startFlowGroup - -| **基本信息** | | | -|---------------------------|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 接口名称 | startFlowGroup | | -| 接口描述 | 运行流水线组 | | -| 接口URL | POST /group/start | | -| **参数说明** | | | -| 描述 | 类型 | 实例 | -| FlowGroup的json配置字符串 | String | {"group":{"flows":[{"flow":{"executorNumber":"1","driverMemory":"1g","executorMemory":"1g","executorCores":"1","paths":[{"inport":"","from":"MockData","to":"ShowData","outport":""}],"name":"f4","stops":[{"customizedProperties":{},"name":"MockData","uuid":"8a80d63f720cdd2301723b7745b72649","bundle":"cn.PiFlow .bundle.common.MockData","properties":{"schema":"title:String,author:String,age:Int","count":"10"}},{"customizedProperties":{},"name":"ShowData","uuid":"8a80d63f720cdd2301723b7745b72647","bundle":"cn.PiFlow .bundle.external.ShowData","properties":{"showNumber":"5"}}],"uuid":"8a80d63f720cdd2301723b7745b62645"}},{"flow":{"executorNumber":"1","driverMemory":"1g","executorMemory":"1g","executorCores":"1","paths":[{"inport":"","from":"MockData","to":"ShowData","outport":""}],"name":"f3","stops":[{"customizedProperties":{},"name":"MockData","uuid":"8a80d63f720cdd2301723b7745b9265d","bundle":"cn.PiFlow .bundle.common.MockData","properties":{"schema":"title:String,author:String,age:Int","count":"10"}},{"customizedProperties":{},"name":"ShowData","uuid":"8a80d63f720cdd2301723b7745b9265b","bundle":"cn.PiFlow .bundle.external.ShowData","properties":{"showNumber":"5"}}],"uuid":"8a80d63f720cdd2301723b7745b82659"}}],"name":"SimpleGroup","groups":[{"group":{"flows":[{"flow":{"executorNumber":"1","driverMemory":"1g","executorMemory":"1g","executorCores":"1","paths":[{"inport":"","from":"MockData","to":"ShowData","outport":""}],"name":"MockData","stops":[{"customizedProperties":{},"name":"MockData","uuid":"8a80d63f720cdd2301723b7745b4261a","bundle":"cn.PiFlow .bundle.common.MockData","properties":{"schema":"title:String,author:String,age:Int","count":"10"}},{"customizedProperties":{},"name":"ShowData","uuid":"8a80d63f720cdd2301723b7745b32618","bundle":"cn.PiFlow .bundle.external.ShowData","properties":{"showNumber":"5"}}],"uuid":"8a80d63f720cdd2301723b7745b32616"}},{"flow":{"executorNumber":"1","driverMemory":"1g","executorMemory":"1g","executorCores":"1","paths":[{"inport":"","from":"MockData","to":"ShowData","outport":""}],"name":"MockData","stops":[{"customizedProperties":{},"name":"MockData","uuid":"8a80d63f720cdd2301723b7745b5262e","bundle":"cn.PiFlow .bundle.common.MockData","properties":{"schema":"title:String,author:String,age:Int","count":"10"}},{"customizedProperties":{},"name":"ShowData","uuid":"8a80d63f720cdd2301723b7745b5262c","bundle":"cn.PiFlow .bundle.external.ShowData","properties":{"showNumber":"5"}}],"uuid":"8a80d63f720cdd2301723b7745b4262a"}}],"name":"g1","uuid":"8a80d63f720cdd2301723b7745b22615"}}],"conditions":[{"entry":"f4","after":"g1"},{"entry":"f3","after":"g1"}],"uuid":"8a80d63f720cdd2301723b7745b22614"}} | -| **返回值说明** | | | -| 描述 | 返回代码 | 实例 | -| 返回flowGroup的Id | 200 | {"group":{"id":"group_fc1cb223-9c44-467f-a063-e959ffb6bcd8"}} | -| | 500 | “Can not start group!” | - -### 3.2.12 stopFlowGroup - -| **基本信息** | | | -|---------------------------|------------------|----------------------------------| -| 接口名称 | stopFlowGroup | | -| 接口描述 | 停止流水线组 | | -| 接口URL | POST /group/stop | | -| **参数说明** | | | -| 描述 | 类型 | 实例 | -| FlowGroup的json配置字符串 | String | {“groupId”:”\*\*\*”} | -| **返回值说明** | | | -| 描述 | 返回代码 | 实例 | -| 返回停止操作的状态 | 200 | “Stop FlowGroup OK!!!” | -| | 500 | “Can not found FlowGroup Error!” | - -### 3.2.13 getFlowGroupInfo - -| **基本信息** | | | -|----------------|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 接口名称 | getFlowGroupInfo | | -| 接口描述 | 获取流水线组信息 | | -| 接口URL | GET /group/info?groupID=\*\*\* | | -| **参数说明** | | | -| 名称 | 描述 | 类型 | -| groupId | GroupId | String | -| **返回值说明** | | | -| 描述 | 返回代码 | 实例 | -| 返回group信息 | 200 | {"group":{"name":"SimpleGroup","startTime":"FriMay2918:10:50CST2020","state":"STARTED","flows":[],"groups":[{"group":{"name":"g1","startTime":"FriMay2918:10:50CST2020","state":"COMPLETED","flows":[{"flow":{"name":"MockData","startTime":"FriMay2918:11:03CST2020","state":"COMPLETED","endTime":"FriMay2918:11:07CST2020","id":"application_1589249052248_0440","pid":"process_b3c96bf0-c9b4-41b1-b0e0-06fb2d5e4be5_1","progress":"100","stops":[{"stop":{"name":"ShowData","state":"COMPLETED","startTime":"FriMay2918:11:07CST2020","endTime":"FriMay2918:11:07CST2020"}},{"stop":{"name":"MockData","state":"COMPLETED","startTime":"FriMay2918:11:03CST2020","endTime":"FriMay2918:11:07CST2020"}}]}}],"groups":[],"endTime":"FriMay2918:11:20CST2020","id":"group_2322a41d-7b69-4fe7-9a87-a78c50f26e09"}}],"endTime":"","id":"group_0a7abbd3-9c9a-4dfa-9a0b-7f77fdacf3d4"}} | -| | 500 | “Can not found FlowGroup Error!” | - -### 3.2.14 getFlowGroupProgress - -| **基本信息** | | | -|----------------|------------------------------------|----------------------------------------------------| -| 接口名称 | getFlowGroupProgress | | -| 接口描述 | 获取流水线group的执行进度 | | -| 接口URL | GET /group/progress?groupId=\*\*\* | | -| **参数说明** | | | -| 名称 | 描述 | 类型 | -| groupId | Group的id | String | -| **返回值说明** | | | -| 描述 | 返回代码 | 实例 | -| 返回执行进度 | 200 | “100” | -| | 500 | “groupId is null or flowGroup progress exception!” | diff --git a/pom.xml b/pom.xml index 85f99e70..47c58cc9 100644 --- a/pom.xml +++ b/pom.xml @@ -8,65 +8,18 @@ UTF-8 9.0.0.M0 - 3.4.0 - 2.12.18 - 10.2.0 - 3.3.0 + 2.3.4 + 2.11.8 1.8 + 1.8 + 1.8 + 0.9.1 - - - org.apache.hadoop - hadoop-client - ${hadoop.version} - - - - org.apache.hadoop - hadoop-hdfs - ${hadoop.version} - - - - org.apache.hadoop - hadoop-auth - ${hadoop.version} - - - - org.apache.logging.log4j - log4j-slf4j-impl - 2.19.0 - - - - - com.alibaba.fastjson2 - fastjson2 - 2.0.33 - - - - - org.json4s - json4s-native_2.12 - 3.7.0-M11 - - - - - org.json4s - json4s-jackson_2.12 - 3.7.0-M11 - - - - org.scala-lang scala-library - 2.12.18 + ${scala.version} org.scala-lang @@ -78,8 +31,6 @@ scala-compiler ${scala.version} - - junit junit @@ -88,47 +39,47 @@ org.apache.spark - spark-core_2.12 + spark-core_2.11 + ${spark.version} + + + org.apache.spark + spark-sql_2.11 ${spark.version} org.apache.spark - spark-sql_2.12 + spark-hive_2.11 ${spark.version} org.apache.spark - spark-hive_2.12 + spark-yarn_2.11 ${spark.version} org.apache.spark - spark-yarn_2.12 + spark-streaming_2.11 ${spark.version} org.apache.spark - spark-streaming_2.12 + spark-streaming-kafka-0-10_2.11 ${spark.version} org.apache.spark - spark-streaming-kafka-0-10_2.12 + spark-streaming-flume_2.11 ${spark.version} - - - - - org.mongodb.spark - mongo-spark-connector_2.12 - ${mongo.version} + mongo-spark-connector_2.11 + ${spark.version} org.apache.spark - spark-mllib_2.12 + spark-mllib_2.11 ${spark.version} @@ -138,36 +89,39 @@ org.apache.kafka - kafka_2.12 + kafka_2.11 2.1.1 + + org.apache.spark + spark-sql-kafka-0-10_2.11 + 2.3.4 + + + net.jpountz.lz4 + lz4 + + + net.jpountz.lz4 lz4 1.3.0 - - - - - - com.h2database h2 - 2.1.214 - + 1.4.197 - org.apache.httpcomponents httpcore 4.4 - org.postgresql - postgresql - 9.4.1212 + io.jsonwebtoken + jjwt + ${jjwt.version} @@ -190,7 +144,7 @@ - org.apache.maven.plugins + org.apache.maven.plugins maven-compiler-plugin 3.7.0 @@ -264,7 +218,6 @@ piflow-core - piflow-bundle piflow-server piflow-configure diff --git a/python/embed/README.md b/python/embed/README.md new file mode 100644 index 00000000..e69de29b diff --git a/python/embed/embed.zip b/python/embed/embed.zip new file mode 100644 index 00000000..cb93fb22 Binary files /dev/null and b/python/embed/embed.zip differ diff --git a/python/embed/pinecone/README.md b/python/embed/pinecone/README.md new file mode 100644 index 00000000..72ff11f0 --- /dev/null +++ b/python/embed/pinecone/README.md @@ -0,0 +1,99 @@ +#
Pinecone向量数据库存储组件使用说明书 + + + +### 1.组件介绍 + +​ 该组件是一个用于将文本数据向量化并存储到 Pinecone 向量数据库 中的工具。它利用 Hugging Face 的嵌入模型,将来自PdfParser 、 ImageParser 等解析器⽣成的非结构化文本数据转换为高维向量,并通过 Pinecone 的索引功能进行管理和高效检索。组件的设计支持灵活的配置和扩展,允许用户根据不同的需求调整模型、索引、度量标准等参数,适用于各种基于向量的文本检索和存储场景。 + +**组件功能:** + +1. **文本向量化**:通过 Hugging Face 的预训练模型(如 MiniLM、RoBERTa 等)将文本转化为固定维度的向量表示,支持多种嵌入模型。 +2. **向量存储与检索**:使用 Pinecone 构建和管理向量索引,支持高效的向量插入、存储和检索 +3. **可配置参数**: + - **API 密钥**:通过 API 密钥连接 Pinecone 服务。 + - **嵌入模型**:允许用户根据需求选择不同的 Hugging Face 嵌入模型。 + - **索引管理**:用户可自定义索引名称、向量维度、度量标准等,未创建的索引会自动创建。 + - **度量标准选择**:支持三种度量方式(欧几里得距离、余弦相似度、点积),根据场景灵活选择。 +4. **批量处理**:支持大规模数据的批量向量化和插入,用户可根据数据量和性能需求调整批量大小。 + +### 2.环境要求 + +- python版本:3.9或更高版本 +- pinecone云数据库:访问pinecone官网,注册账户 +- 所需python库: + - numpy + - pandas + - langchain_community + - langchain_huggingface + - pinecone-client + +### 3.参数说明 + +可以通过命令行传入多个参数来控制 Pinecone 索引的创建、向量存储及模型选择等行为。以下是每个参数的使用说明: + +| 参数名称 | 类型 | 默认值 | 说明 | +| :---------- | ---- | ------------------- | ------------------------------------------------------------ | +| api_key | str | | Pinecone API 密钥,用于验证和访问 Pinecone 服务 | +| embed_model | str | all_MiniLM_L6_v2 | 嵌入模型的名称,用于将文本转换为向量 | +| index_name | str | document-embeddings | Pinecone 中的索引名称,用于存储和检索向量 | +| dimension | int | 384 | 向量的维度,通常与选择的嵌入模型输出的向量维度一致 | +| metric | str | cosine | 向量检索的度量标准,支持的度量标准有 `"euclidean"`(欧几里得距离)、`"cosine"`(余弦相似度)和 `"dotproduct"`(点积) | +| cloud | str | aws | 指定 Pinecone 所使用的云平台,支持的值有 `"aws"`、`"gcp"` | +| region | str | us-east-1 | 指定 Pinecone 所使用的区域 | +| batch_size | int | | 在将向量插入 Pinecone 时的批处理大小 | + + + +### 4.支持的向量化模型 + +在该组件中,嵌入模型的选择和加载是通过 helpers 脚本中的 `embed_change` 函数实现的。具体过程如下: + +1. **模型选择**:`embed_change` 函数用于根据外界传入的模型名称来选择相应的 Hugging Face 嵌入模型。该函数会将模型名称映射到具体的模型路径,确保正确加载对应的模型。 +2. **模型加载**:根据用户选择的模型名称,组件会调用 `embed_change` 函数,返回对应模型的路径或名称。然后使用 `HuggingFaceEmbeddings` 来加载该模型,用于后续的文本向量化。 +3. **文本向量化**:加载完指定的嵌入模型后,组件将文本数据传递给 Hugging Face 模型进行向量化处理,生成高维向量。这些向量可以用于存储或相似性检索。 + +通过这种方式,`helper` 脚本中的 `embed_change` 函数使组件能够灵活选择和切换不同的嵌入模型,适应不同的任务场景和需求。 + +​ 目前集成了7个模型,支持自定义扩展(源码在helpers.py),具体信息如下表所示。用户可以自行去hugging face 官网下载,也可以用网盘下载(https://pan.quark.cn/s/fc0de5220493),下载后放到/data的目录下。 + +![image-20240927170843969](./pictures/image-20240927170843969.png) + +### 4.使用说明 + +(1)配置基础镜像:在docker中拉取python3.9镜像,点击基础镜像管理菜单,配置基础镜像python:3.9即可,配置详情如下图所示: + +![image-20240921060606680](./pictures/image-20240921060606680.png) + +(2)从公开连接下载对应的嵌入模型放到data\testingStuff\models路径下【https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/】 + +(3)配置数据库存储组件:从 GitHub 下载包含向量数据库存储组件的 zip ⽂件,然后将zip包上传到系统并挂载(mount)。挂载成功后,选择组件并编辑其基本信息和图标。配置的详细步骤如下图所示: + +![image-20240921060858879](./pictures/image-20240921060858879.png) + +(4)配置流水线,详情如下图所示: + +![image-20240927172024264](./pictures/image-20240927172024264.png) + +![image-20240927172040390](./pictures/image-20240927172040390.png) + +(5)运行流水线 + +![image-20240921061009045](./pictures/image-20240921061009045.png) + +(6)运行成功以后日志显示: + +![image-20240921061421703](./pictures/image-20240921061421703.png) + +(7)查看pinecone数据库,可以看到创建了新的索引并且存储了数据 +![image-20240921053728775](./pictures/image-20240921053728775.png) + +![image-20240921053803937](./pictures/image-20240921053803937.png) + +### 5.注意事项 + +1.需要在pinecone官网先注册账户,从而得到api_key,通过api_key连接数据库 + +2.免费的账户云空间会有一定的限制,当索引量超多免费容量时便无法添加 + +3.pinecone_finnal.py组件名文件不可以用pinecone.py命名,因为开头有import pinecone会产生干扰 \ No newline at end of file diff --git a/python/embed/pinecone/data_connect.py b/python/embed/pinecone/data_connect.py new file mode 100644 index 00000000..bcf3d813 --- /dev/null +++ b/python/embed/pinecone/data_connect.py @@ -0,0 +1,96 @@ +from pyhdfs import HdfsClient +from hdfs.client import Client +import pandas as pd +import os +import uuid +import shutil + + +class DATAConnect: + def __init__(self): + env_dist = os.environ + self.HdfsClientHost = env_dist.get("hdfs_url") + print(self.HdfsClientHost) + self.client_read = HdfsClient(self.HdfsClientHost) + self.client_wirte = Client('http://' + self.HdfsClientHost) + + def dataInputStream(self, port="1"): + df = pd.DataFrame() + with open('/app/inputPath.txt', 'r', encoding='utf-8') as f: + input_path_dir = f.readline().strip("\n") + input_path = input_path_dir + port + '/' + + print("--------------------") + print(input_path) + print("--------------------") + + # flag = self.client_read.get_content_summary(input_path).get('directoryCount') + # if flag == 0 : df = pd.concat([df, pd.read_table(self.client_read.open(input_path))]) + # else: df = pd.concat([df,pd.concat([pd.read_table(self.client_read.open(input_path+i)) for i in self.client_read.listdir(input_path) if i.endswith('.parquet')])]) + # print(df.head(5)) + # 初始化一个空的 DataFrame 用于存储所有结果 + + # 使用 client_read 的方法列出目录中的文件,并检查扩展名 + _path_id = str(uuid.uuid4()) + _COPYPATH: str = "/data/piflow/tmp/" + _path_id + os.makedirs(_COPYPATH, exist_ok=True) + for i in self.client_read.listdir(input_path): + + FILEPATH: str = _COPYPATH + "/copy.parquet" + if i.endswith('.parquet'): + # 构建完整的文件路径 + file_path = input_path + i + # 使用 pd.read_parquet 读取文件 + print("CURRENT_DIRECTORY:", os.getcwd()) + self.client_read.copy_to_local(file_path, FILEPATH) + temp_df = pd.read_parquet(FILEPATH) # self.client_read.open(file_path)) + # 将读取的 DataFrame 追加到主 DataFrame + df = pd.concat([df, temp_df], ignore_index=True) + shutil.rmtree(_COPYPATH) + print(df.head(5)) + return df + + def dataOutputStream(self, df, port="1"): + with open('/app/outputPath.txt', 'r', encoding='utf-8') as f: + output_path_dir = f.readline().strip("\n") + output_path = output_path_dir + port + + print("--------------------") + print(output_path) + print("--------------------") + + self.client_wirte.makedirs(output_path, '777') + self.client_wirte.write(output_path + '/demo.csv', df.to_csv(index=False, sep=','), overwrite=True, + encoding='utf-8') + + def putFileToHdfs(self, hdfs_path, local_path, isDelete=False): + # 如果文件已存在,自动删除,默认为 False + # if isDelete : self.client_wirte.delete(hdfs_path) + # hdfs 路径会自动创建 + self.client_wirte.upload(hdfs_path, local_path) + + def downloadFileFromHdfs(self, hdfs_path, local_path, overwrite=False): + # 自动创建文件夹 + parentDir = os.path.dirname(local_path) + print(parentDir) + # 判断本地文件夹是否存在 + isExists = os.path.exists(parentDir) + # 不存在,自动创建 + if not isExists: os.makedirs(parentDir) + + # 本地路径不会自动创建 + self.client_wirte.download(hdfs_path, local_path, overwrite) + + # 文件夹下载 + def downloadFolderFromHdfs(self, hdfs_path, local_path, overwrite=False): + # 自动创建文件夹 + parentDir = os.path.dirname(local_path) + print(parentDir) + # 判断本地文件夹是否存在 + isExists = os.path.exists(parentDir) + # 不存在,自动创建 + if not isExists: os.makedirs(parentDir) + + # 本地路径不会自动创建 + # 本地路径少一级才会正确下载,如 hdfs_path=/a/b/ parentDir=/a + self.client_wirte.download(hdfs_path, parentDir, overwrite) diff --git a/python/embed/pinecone/helpers.py b/python/embed/pinecone/helpers.py new file mode 100644 index 00000000..4018e039 --- /dev/null +++ b/python/embed/pinecone/helpers.py @@ -0,0 +1,71 @@ +import itertools, pandas as pd +import os + +embed_models_path = os.environ.get("embed_model", "/data/models/") + +if (embed_models_path == "embed_models_path"): + embed_models_path = "/data/models/" + +# 如果embed_models_path不以'/'结尾,则加上'/' +if not embed_models_path.endswith('/'): + embed_models_path += '/' + + +def chunk_generator(lis: list, batch_size: int = 100): + lis = iter(lis) + chunk = tuple(itertools.islice(lis, batch_size)) + while chunk: # amogus + yield chunk + chunk = tuple(itertools.islice(lis, batch_size)) + + +def transpose(data: tuple[dict]) -> dict[str, list]: + df = pd.DataFrame(data) + retD = {} + for c in df.columns: + retD[c] = df[c].to_list() + return retD + + +def embed_change(name: str) -> str: + if name == "all_MiniLM_L6_v2": + return embed_models_path + "all_MiniLM" + elif name == "sentence-transformers/all-MiniLM-L6-v2": + return embed_models_path + "all_MiniLM" + + elif name == "all-roberta-large-v1": + return embed_models_path + "all_RoBERTa_large" + elif name == "sentence-transformers/all-roberta-large-v1": + return embed_models_path + "all_RoBERTa_large" + + elif name == "average_word_embeddings_glove.840B.300d": + return embed_models_path + "glove_avg_word" + elif name == "sentence-transformers/average_word_embeddings_glove.840B.300d": + return embed_models_path + "glove_avg_word" + + elif name == "gte-small": + return embed_models_path + "gteSmallModel" + elif name == "thenlper/gte-small": + return embed_models_path + "gteSmallModel" + + elif name == "sentence-t5-xl": + return embed_models_path + "sentence_t5" + elif name == "sentence-transformers/sentence-t5-xl": + return embed_models_path + "sentence_t5" + + elif name == "snowflake-arctic-embed-m": + return embed_models_path + "snowflake_arctic" + elif name == "Snowflake/snowflake-arctic-embed-m": + return embed_models_path + "snowflake_arctic" + + elif name == "sentence-transformers-e5-large-v2": + return embed_models_path + "ste_embaas_e5_large" + elif name == "embaas/sentence-transformers-e5-large-v2": + return embed_models_path + "ste_embaas_e5_large" + + else: + raise ValueError("Bad Model Name") + +# remove duplicate rows +# def purge(data:pd.DataFrame) -> pd.DataFrame: +# return data.drop_duplicates() \ No newline at end of file diff --git a/python/embed/pinecone/pictures/image-20240921053728775.png b/python/embed/pinecone/pictures/image-20240921053728775.png new file mode 100644 index 00000000..2792c5ee Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240921053728775.png differ diff --git a/python/embed/pinecone/pictures/image-20240921053803937.png b/python/embed/pinecone/pictures/image-20240921053803937.png new file mode 100644 index 00000000..a7de06ec Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240921053803937.png differ diff --git a/python/embed/pinecone/pictures/image-20240921060606680.png b/python/embed/pinecone/pictures/image-20240921060606680.png new file mode 100644 index 00000000..bc0538cb Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240921060606680.png differ diff --git a/python/embed/pinecone/pictures/image-20240921060858879.png b/python/embed/pinecone/pictures/image-20240921060858879.png new file mode 100644 index 00000000..0b645e64 Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240921060858879.png differ diff --git a/python/embed/pinecone/pictures/image-20240921061009045.png b/python/embed/pinecone/pictures/image-20240921061009045.png new file mode 100644 index 00000000..b3764c05 Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240921061009045.png differ diff --git a/python/embed/pinecone/pictures/image-20240921061421703.png b/python/embed/pinecone/pictures/image-20240921061421703.png new file mode 100644 index 00000000..84a3e9aa Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240921061421703.png differ diff --git a/python/embed/pinecone/pictures/image-20240927170843969.png b/python/embed/pinecone/pictures/image-20240927170843969.png new file mode 100644 index 00000000..cd090813 Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240927170843969.png differ diff --git a/python/embed/pinecone/pictures/image-20240927172024264.png b/python/embed/pinecone/pictures/image-20240927172024264.png new file mode 100644 index 00000000..3e1908b7 Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240927172024264.png differ diff --git a/python/embed/pinecone/pictures/image-20240927172040390.png b/python/embed/pinecone/pictures/image-20240927172040390.png new file mode 100644 index 00000000..408ff05a Binary files /dev/null and b/python/embed/pinecone/pictures/image-20240927172040390.png differ diff --git a/python/embed/pinecone/pinecone_finnal.py b/python/embed/pinecone/pinecone_finnal.py new file mode 100644 index 00000000..663480b8 --- /dev/null +++ b/python/embed/pinecone/pinecone_finnal.py @@ -0,0 +1,114 @@ +from langchain_huggingface import HuggingFaceEmbeddings as hfe +import os +from data_connect import DATAConnect # 引入 data_connect 中的 DATAConnect 类 +import pandas as pd +from helpers import *# 从 helpers.py 引入 embed_change 函数 +import sys +from pinecone import Pinecone, ServerlessSpec # 引入 Pinecone 和 ServerlessSpec +import numpy as np + +# 支持的 metric 类型 +SUPPORTED_METRICS = {"euclidean", "cosine", "dotproduct"} + + +def init_pinecone_index(api_key: str, + index_name: str = "hhh", + dimension: int = 384, # 外部传入 dimension + metric: str = "cosine", + cloud: str = "aws", region: str = "us-east-1"): + # 检查传入的 metric 是否支持 + if metric not in SUPPORTED_METRICS: + print(f"不支持这种 metric: {metric},已修改为默认的 metric: 'cosine'") + metric = "cosine" # 设置为默认 metric + + # 初始化 Pinecone 客户端 + pc = Pinecone(api_key=api_key) + + # 如果索引不存在,则创建索引 + if index_name not in pc.list_indexes().names(): + pc.create_index( + name=index_name, + dimension=dimension, # 通过外部参数传入 dimension + metric=metric, # 使用用户传入或默认的距离度量标准 + spec=ServerlessSpec( + cloud=cloud, # 指定云平台 + region=region # 指定区域 + ) + ) + # 返回连接到的索引 + return pc.Index(index_name) + + +def vectorize_texts(df: pd.DataFrame, index, embed_provider: str, batch_size: int = 100): + # 初始化嵌入模型 + embed_func = hfe(model_name=embed_provider) + + # 批量存储向量 + vectors_to_insert = [] + + for idx, record in df.iterrows(): # 迭代 DataFrame 每一行 + element_id = record.get('element_id', 'N/A') + print(f"Processing Element ID: {element_id}") + + # 只对 `text` 字段进行向量化 + if 'text' in record: + # 使用 embed_documents 来替代 encode,直接获取嵌入向量 + text_vector = embed_func.embed_documents([str(record['text'])])[0] # 返回的是一个列表,取第一个元素 + vectors_to_insert.append((element_id, text_vector, { # 不需要再调用 .tolist() + "element_id": element_id, + "type": record.get('type', 'N/A') # 作为元数据存储 + })) + + # 处理 `metadata` 字段,将元数据直接存储 + metadata = {} + if 'metadata' in record: + metadata = { + "filename": record['metadata'].get('filename', 'N/A'), + "filetype": record['metadata'].get('filetype', 'N/A'), + # 确保 languages 是 list 或 str,而不是 ndarray + "languages": record['metadata'].get('languages', 'N/A') + if not isinstance(record['metadata'].get('languages'), np.ndarray) + else record['metadata'].get('languages').tolist(), + "page_number": record['metadata'].get('page_number', 'N/A') + } + # 将 metadata 作为元数据存储,但不向量化 + vectors_to_insert[-1][2].update(metadata) # 把元数据加入最新的向量 + + # 检查是否达到批量大小限制 + if len(vectors_to_insert) >= batch_size: + # 批量插入到 Pinecone 中 + index.upsert(vectors=vectors_to_insert) + print(f"Batch inserted {len(vectors_to_insert)} vectors") + vectors_to_insert = [] # 清空列表以便继续下一批处理 + + # 插入剩余的向量 + if vectors_to_insert: + index.upsert(vectors=vectors_to_insert) + print(f"Batch inserted {len(vectors_to_insert)} vectors (final batch)") + + +if __name__ == "__main__": + # 从命令行或外部配置中获取 Pinecone 连接信息和模型名称,设置默认值 + api_key = sys.argv[1] # Pinecone API key + + # 可选的命令行参数,有默认值 + embed_provider = embed_change(sys.argv[2]) if len(sys.argv) > 2 else "all_MiniLM_L6_v2" # 嵌入模型名称 + index_name = sys.argv[3] if len(sys.argv) > 3 else "document-embeddings" # 索引名称 + dimension = int(sys.argv[4]) if len(sys.argv) > 4 else 384 # 向量维度 + metric = sys.argv[5] if len(sys.argv) > 5 else "cosine" # 距离度量方法 + cloud = sys.argv[6] if len(sys.argv) > 6 else "aws" # 云平台 + region = sys.argv[7] if len(sys.argv) > 7 else "us-east-1" # 区域 + batch_size = int(sys.argv[8]) if len(sys.argv) > 8 else 100 # 批量大小 + + print(f"Parameters received: {api_key}, {embed_provider}, {index_name},{dimension}, {metric},{cloud},{region},{batch_size}") + # 初始化数据库并创建或连接到索引 + index = init_pinecone_index(api_key, index_name, dimension, metric, cloud, region) + + # 使用 data_connect 中的方法从 HDFS 获取数据 + dataConnect = DATAConnect() + + # 从 HDFS 读取数据 + df = dataConnect.dataInputStream(port="input_read") + df.drop_duplicates(subset=["element_id"], inplace=True) + + vectorize_texts(df,index,embed_provider,batch_size=batch_size) diff --git "a/python/embed/pinecone/pinecone\345\233\276\346\240\207.png" "b/python/embed/pinecone/pinecone\345\233\276\346\240\207.png" new file mode 100644 index 00000000..a82f7d77 Binary files /dev/null and "b/python/embed/pinecone/pinecone\345\233\276\346\240\207.png" differ diff --git a/python/embed/pinecone/requirements.txt b/python/embed/pinecone/requirements.txt new file mode 100644 index 00000000..0001b060 --- /dev/null +++ b/python/embed/pinecone/requirements.txt @@ -0,0 +1,9 @@ +numpy +hdfs +pyhdfs +pandas +langchain_community +langchain_huggingface +protobuf==3.20.1 +pinecone-client +pyarrow \ No newline at end of file diff --git a/python/embed/qdrant/README.md b/python/embed/qdrant/README.md new file mode 100644 index 00000000..0a285277 --- /dev/null +++ b/python/embed/qdrant/README.md @@ -0,0 +1,277 @@ + + + + + + + + + + + + + + + + +
Qdrant 向量数据库存储组件使用说明书
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +---- + +
目录
+ +---- + +[TOC] + +---- + + + +# Qdrant 向量数据库存储组件使用说明 + +## 1. 组件简介 + +该组件基于 Qdrant 向量数据库,专门为 `πFlow` 系统开发,旨在处理 `PdfParser`、`ImageParser` 等解析器生成的非结构化数据(如文本和图像),将其转换为向量并存储到 Qdrant 中。该组件支持通过不同的嵌入模型进行向量化处理,并提供高效的存储、检索解决方案,以满足大规模数据(如用于大模型训练)的存储和检索需求。 + +### 主要功能: +- **向量化非结构化数据**:通过 Hugging Face 预训练模型将文本或图像转换为向量。 +- **数据批量写入**:支持批量写入到 Qdrant 数据库,优化数据存储效率。 +- **多种距离度量支持**:支持多种向量相似度计算方法(欧几里得距离、余弦相似度、点积等)。 +- **集成数据流**:与 `DATAConnect` 集成,读取数据并进行处理和存储。 + +--- + +## 2. 环境要求 + +### 依赖项 + +- **Python 版本**:需要 Python 3.10 或更高版本。 +- **Qdrant 向量数据库环境**: + +- **所需python库**: + - `qdrant-client`:用于连接和操作 Qdrant 向量数据库 + - `transformers`:用于加载 Hugging Face 的预训练模型 + - `pandas`:用于数据处理 + - `langchain_huggingface`:用于嵌入向量生成 + - 其他依赖:`sys`、`warnings` + +### Qdrant 向量数据库: +- **本地实例**:可以使用本地数据库,或者通过 Docker 运行 Qdrant 实例。 + +- **远程实例**:支持通过指定 `host` 和 `port` 连接远程 Qdrant 实例。 + +- **Docker运行 Qdrant 示例:** 建议使用 Docker 来搭建 Qdrant 环境。可以使用以下命令来运行最新版本的 Qdrant Docker 镜像: + + ```bash + docker run -it qdrant/qdrant:latest -p 6333:6333 -p 6334:6334 + ``` + + 该命令会启动 Qdrant 实例,并将本地的 6333 和 6334 端口映射到 Docker 容器中的相应端口。可以通过web访问http://localhost:6333/dashboard,查看qdrant数据库状态. + + ![image-20240916135327817](./README_picture/image_20240916135327817.png) + +--- + +## 3. 参数说明 + +组件通过命令行参数接收配置参数。以下是各参数的详细说明: + +| 参数名 | 类型 | 默认值 | 说明 | +| ----------------- | ----- | -------------------- | ------------------------------------------------------------ | +| `collection_name` | `str` | `default_collection` | Qdrant 中的集合名称,用于存储向量化数据。 | +| `batch_size` | `int` | `100` | 批量写入 Qdrant 的数据条数。 | +| `host` | `str` | `127.0.0.1` | 连接的 Qdrant 数据库主机地址。 | +| `port` | `int` | `6333` | Qdrant 数据库的端口号。 | +| `grpc_port` | `int` | `6334` | Qdrant gRPC 服务的端口号。 | +| `embed_model` | `str` | `all_MiniLM_L6_v2` | 使用的文本嵌入模型。 | +| `distance_metric` | `str` | `cosine` | 向量距离度量方式,支持 `cosine`、`euclid`、`dot`、`manhattan`。 | + +### 支持的嵌入模型: +支持 Hugging Face 上的多种预训练模型,如 `all-MiniLM-L6-v2`、`all-roberta-large-v1`、`sentence-t5-xl` 等。 + +以下是已集成的向量化模型表格: + +| 序号 | 名称 | 参数 | 介绍 | +| :--: | :----------------------------------------------------------: | :---------------------------------------: | :----------------------------------------------------------: | +| 1 | `sentence-transformers/all-MiniLM-L6-v2` | `all_MiniLM_L6_v2` | 通用文本嵌入(GTE)模型:将句子和段落映射到384维的密集向量空间,可以用于聚类或语义搜索等任务。 | +| 2 | `sentence-transformers/all-roberta-large-v1` | `all-roberta-large-v1` | 通用文本嵌入(GTE)模型:将句子和段落映射到1024维的密集向量空间,可以用于聚类或语义搜索等任务。 | +| 3 | `sentence-transformers/average_word_embeddings_glove.840B.300d` | `average_word_embeddings_glove.840B.300d` | 通用文本嵌入(GTE)模型:将句子和段落映射到300维的密集向量空间,可以用于聚类或语义搜索等任务。 | +| 4 | `thenlper/gte-small` | `gte-small` | 通用文本嵌入(GTE)模型:基于多阶段对比学习的通用文本嵌入,由阿里巴巴达摩学院训练。 | +| 5 | `sentence-transformers/sentence-t5-xl` | `sentence-t5-xl` | 通用文本嵌入(GTE)模型:将句子和段落映射到768维的密集向量空间,可以用于聚类或语义搜索等任务。 | +| 6 | `Snowflake/snowflake-arctic-embed-m` | `snowflake-arctic-embed-m` | 通用文本嵌入(GTE)模型,专注于创建针对性能优化的高质量检索模型。 | +| 7 | `embaas/sentence-transformers-e5-large-v2` | `sentence-transformers-e5-large-v2` | 通用文本嵌入(GTE)模型:将句子和段落映射到1024维的密集向量空间,可用于聚类或语义搜索等任务。 | + +这个表格简明地列出了支持的向量化模型及其介绍,帮助用户理解每个模型的特点和应用场景。 + +**一些常用模型汇总: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/** + + + +### 支持的距离度量方式: +- `cosine`:余弦相似度 +- `euclid`:欧几里得距离 +- `dot`:点积 +- `manhattan`:曼哈顿距离 + +--- + +## 4. 工作流程 + +1. **配置连接**:组件会尝试连接到本地或远程的 Qdrant 实例。如果没有提供 `host` 和 `port`,则会使用本地嵌入式 Qdrant 数据库。 + +2. **数据加载**:通过 `DATAConnect` 从数据流端口中读取输入数据流(如 PDF 文本或图像数据),并去重处理。 + +3. **数据向量化**:使用 Hugging Face 的 `transformers` 模型将文本或图像转换为向量。模型可配置,默认使用 `all_MiniLM_L6_v2` 模型。 + +4. **数据存储**:向量化的数据批量写入到 Qdrant 中,集合名称通过参数指定。写入完成后,显示上传进度及状态。 + +5. **检索支持**:写入后的数据可以通过 Qdrant 提供的接口进行检索,支持通过向量化搜索进行相似性查询。 + +--- + +## 5. 代码说明 + +### 主程序结构: +```python +DATAConnectif __name__ == "__main__": + # 获取组件配置参数, 使用命令行参数或默认值进行赋值 + collection_name = sys.argv[1] if len(sys.argv) > 1 else DEFAULT_COLLECTION_NAME + batch_size = int(sys.argv[2]) if len(sys.argv) > 2 else DEFAULT_BATCH_SIZE + host = sys.argv[3] if len(sys.argv) > 3 else DEFAULT_HOST + port = int(sys.argv[4]) if len(sys.argv) > 4 else DEFAULT_PORT + grpc_port = int(sys.argv[5]) if len(sys.argv) > 5 else DEFAULT_GRPC_PORT + embed_model = sys.argv[6] if len(sys.argv) > 6 else DEFAULT_EMBED_MODEL + distance_metric = sys.argv[7] if len(sys.argv) > 7 else DEFAULT_DISTANCE_METRIC + + # 初始化 Qdrant 客户端 + if host and port: + client = QdrantClient(host=host, port=port, grpc_port=grpc_port, prefer_grpc=True, headers=DEFAULT_HEADERS) + else: + client = QdrantClient(path="./qdrant_local.db") + + # 数据输入和去重 + dataConnect = DATAConncet() + df = dataConnect.dataInputStream(port="input_read") + df.drop_duplicates(subset=["element_id"], inplace=True) + + # 向量化和写入 + write_dict(collection_name=collection_name, elements_dict=datas, client=client, batch_size=batch_size) + + # 关闭客户端 + client.close() +``` + +### 关键函数: +- **`write_dict`**:负责将数据转换为向量并批量写入 Qdrant。 + +```python +def write_dict(collection_name: str, elements_dict: t.List[t.Dict[str, str]], client: QdrantClient, batch_size: int): + embedFunc = hfe(model_name=embed_model) + + def _embedText(s: str) -> t.List[float]: + return embedFunc.embed_documents(texts=[s])[0] + + points = [] + for i in range(len(elements_dict)): + content = elements_dict[i] + vector = _embedText(str(content['text'])) + points.append(PointStruct(id=i, vector=vector, payload=content)) + + if (i + 1) % batch_size == 0 or i == len(elements_dict) - 1: + try: + client.upsert(collection_name=collection_name, points=points) + points.clear() + except Exception as e: + print(f"Error during upsert: {e}") +``` + +--- + +## 6. 使用示例 + +1)**配置基础镜像**:在基础镜像管理菜单中,可以选择已有镜像或从官方镜像拉取制定版本的python镜像(python本版3.10以上,此处我们设置基础镜像为 `registry.cn-hangzhou.aliyuncs.com/cnic-piflow/embed-base:v1`)。配置的具体步骤请参考下图: + +![image-20240916140241146](./README_picture/image_20240916140241146.png) + +2)**安装向量数据库存储组件**:首先,从 [GitHub](https://github.com/cas-bigdatalab/piflow/blob/master/doc/embed/embed.zip) 下载包含向量数据库存储组件的 ZIP 文件。然后,将 ZIP 文件上传到系统并进行挂载(mount)。挂载成功后,选择组件并编辑其基本信息和图标。配置的详细步骤请参考下图: + +![image-20240916140334408](./README_picture/image_20240916140334408.png) + +![image-20240927113942764](./README_picture/image_20240927113942764.png) + +![image-20240927105740270](./README_picture/image_20240927105740270.png) + +![image-20240927105806488](./README_picture/image_20240927105806488.png) + +![image-20240927110000207](./README_picture/image_20240927110000207.png) + + + +--- + +## 7. 注意事项 + +- 运行前请确保 Qdrant 数据库已经启动并正确配置(本地或远程)。 +- `batch_size` 应根据实际数据量设置,建议在大批量数据时适当调整该值以优化性能。 +- 默认情况下,组件会尝试连接本地 Qdrant 实例;如果要连接远程实例,请确保提供正确的 `host` 和 `port`。 +- **禁止将不同预训练模型得到的嵌入存储到同一个数据库中:** qdrant数据库在创建时要求设置向量维度,不同预训练模型获得的嵌入向量维度有所差异.选择了不同的预训练模型时,`collection_name`参数不能一致. +--- + +## 8. 扩展功能 + +- **模型切换**:通过修改 `embed_model` 参数可以选择不同的预训练模型,支持对不同类型的文本和图像进行向量化处理。 +- **距离度量方式**:可以通过修改 `distance_metric` 参数切换不同的距离度量方法,如余弦相似度、欧几里得距离、曼哈顿距离等。 +- **模型扩展:**关于文本嵌入模型还有很多可选择,可根据自身任务需要选择合适的预训练模型.(一些常用模型汇总:https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/) +--- + +## 9. 结语 + +通过以上步骤,已经成功配置了所需的基础镜像,并将向量数据库存储组件集成到系统中。这些设置为处理和存储来自不同解析组件的非结构化数据提供了强大的支持,确保了数据在 Qdrant 向量数据库中的高效存储与检索。 + +在进行后续操作时,可以利用此组件来优化大模型训练的数据需求,提升数据检索与分析的效率。Qdrant 的向量化处理和存储能力将更好地管理和利用非结构化数据,为数据驱动的决策提供坚实的基础。 + +## 10. 代码地址 + +https://github.com/cas-bigdatalab/piflow/tree/master/python/embed/qdrant + diff --git a/python/embed/qdrant/README.pdf b/python/embed/qdrant/README.pdf new file mode 100644 index 00000000..415d902f Binary files /dev/null and b/python/embed/qdrant/README.pdf differ diff --git a/python/embed/qdrant/README_picture/image_20240916135327817.png b/python/embed/qdrant/README_picture/image_20240916135327817.png new file mode 100644 index 00000000..19131dd0 Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240916135327817.png differ diff --git a/python/embed/qdrant/README_picture/image_20240916140241146.png b/python/embed/qdrant/README_picture/image_20240916140241146.png new file mode 100644 index 00000000..b261bebe Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240916140241146.png differ diff --git a/python/embed/qdrant/README_picture/image_20240916140334408.png b/python/embed/qdrant/README_picture/image_20240916140334408.png new file mode 100644 index 00000000..55392e38 Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240916140334408.png differ diff --git a/python/embed/qdrant/README_picture/image_20240927105740270.png b/python/embed/qdrant/README_picture/image_20240927105740270.png new file mode 100644 index 00000000..2d2877f3 Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240927105740270.png differ diff --git a/python/embed/qdrant/README_picture/image_20240927105806488.png b/python/embed/qdrant/README_picture/image_20240927105806488.png new file mode 100644 index 00000000..1612164d Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240927105806488.png differ diff --git a/python/embed/qdrant/README_picture/image_20240927110000207.png b/python/embed/qdrant/README_picture/image_20240927110000207.png new file mode 100644 index 00000000..402cb9e8 Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240927110000207.png differ diff --git a/python/embed/qdrant/README_picture/image_20240927113942764.png b/python/embed/qdrant/README_picture/image_20240927113942764.png new file mode 100644 index 00000000..863f837b Binary files /dev/null and b/python/embed/qdrant/README_picture/image_20240927113942764.png differ diff --git a/python/embed/qdrant/data_connect.py b/python/embed/qdrant/data_connect.py new file mode 100644 index 00000000..9b028567 --- /dev/null +++ b/python/embed/qdrant/data_connect.py @@ -0,0 +1,96 @@ +from pyhdfs import HdfsClient +from hdfs.client import Client +import pandas as pd +import os +import uuid +import shutil + + +class DATAConnect: + def __init__(self): + env_dist = os.environ + self.HdfsClientHost = env_dist.get("hdfs_url") + print(self.HdfsClientHost) + self.client_read = HdfsClient(self.HdfsClientHost) + self.client_wirte = Client('http://'+self.HdfsClientHost) + + def dataInputStream(self, port="1"): + df = pd.DataFrame() + with open('/app/inputPath.txt','r', encoding='utf-8') as f: + input_path_dir = f.readline().strip("\n") + input_path = input_path_dir+port+'/' + + print("--------------------") + print(input_path) + print("--------------------") + + + # flag = self.client_read.get_content_summary(input_path).get('directoryCount') + # if flag == 0 : df = pd.concat([df, pd.read_table(self.client_read.open(input_path))]) + # else: df = pd.concat([df,pd.concat([pd.read_table(self.client_read.open(input_path+i)) for i in self.client_read.listdir(input_path) if i.endswith('.parquet')])]) + # print(df.head(5)) + # 初始化一个空的 DataFrame 用于存储所有结果 + + # 使用 client_read 的方法列出目录中的文件,并检查扩展名 + _path_id = str(uuid.uuid4()) + _COPYPATH:str = "/data/piflow/tmp/"+_path_id + os.makedirs(_COPYPATH, exist_ok=True) + for i in self.client_read.listdir(input_path): + + FILEPATH:str = _COPYPATH+"/copy.parquet" + if i.endswith('.parquet'): + # 构建完整的文件路径 + file_path = input_path + i + # 使用 pd.read_parquet 读取文件 + print("CURRENT_DIRECTORY:", os.getcwd()) + self.client_read.copy_to_local(file_path, FILEPATH) + temp_df = pd.read_parquet(FILEPATH) # self.client_read.open(file_path)) + # 将读取的 DataFrame 追加到主 DataFrame + df = pd.concat([df, temp_df], ignore_index=True) + shutil.rmtree(_COPYPATH) + print(df.head(5)) + return df + + def dataOutputStream(self, df, port="1"): + with open('/app/outputPath.txt','r', encoding='utf-8') as f: + output_path_dir= f.readline().strip("\n") + output_path= output_path_dir + port + + print("--------------------") + print(output_path) + print("--------------------") + + self.client_wirte.makedirs(output_path, '777') + self.client_wirte.write(output_path +'/demo.csv', df.to_csv(index=False, sep=','), overwrite=True, encoding='utf-8') + + + def putFileToHdfs(self, hdfs_path, local_path, isDelete=False): + # 如果文件已存在,自动删除,默认为 False + # if isDelete : self.client_wirte.delete(hdfs_path) + # hdfs 路径会自动创建 + self.client_wirte.upload(hdfs_path, local_path) + + def downloadFileFromHdfs(self, hdfs_path, local_path, overwrite=False): + # 自动创建文件夹 + parentDir = os.path.dirname(local_path) + print(parentDir) + # 判断本地文件夹是否存在 + isExists = os.path.exists(parentDir) + # 不存在,自动创建 + if not isExists : os.makedirs(parentDir) + + # 本地路径不会自动创建 + self.client_wirte.download(hdfs_path, local_path, overwrite) + #文件夹下载 + def downloadFolderFromHdfs(self, hdfs_path, local_path, overwrite=False): + # 自动创建文件夹 + parentDir = os.path.dirname(local_path) + print(parentDir) + # 判断本地文件夹是否存在 + isExists = os.path.exists(parentDir) + # 不存在,自动创建 + if not isExists : os.makedirs(parentDir) + + # 本地路径不会自动创建 + # 本地路径少一级才会正确下载,如 hdfs_path=/a/b/ parentDir=/a + self.client_wirte.download(hdfs_path, parentDir, overwrite) diff --git a/python/embed/qdrant/helpers.py b/python/embed/qdrant/helpers.py new file mode 100644 index 00000000..6460038d --- /dev/null +++ b/python/embed/qdrant/helpers.py @@ -0,0 +1,70 @@ +import itertools, pandas as pd +import os + +embed_models_path = os.environ.get("embed_model","/data/models/") + +if(embed_models_path == "embed_models_path"): + embed_models_path = "/data/models/" + +# 如果embed_models_path不以'/'结尾,则加上'/' +if not embed_models_path.endswith('/'): + embed_models_path += '/' + +def chunk_generator(lis:list, batch_size:int = 100): + lis = iter(lis) + chunk = tuple(itertools.islice(lis, batch_size)) + while chunk: # amogus + yield chunk + chunk = tuple(itertools.islice(lis, batch_size)) + +def transpose(data:tuple[dict]) -> dict[str,list]: + df = pd.DataFrame(data) + retD = {} + for c in df.columns: + retD[c] = df[c].to_list() + return retD + +def embed_change(name:str) -> str: + match name: + case "all_MiniLM_L6_v2": + return embed_models_path + "all_MiniLM" + case "sentence-transformers/all-MiniLM-L6-v2": + return embed_models_path + "all_MiniLM" + + case "all-roberta-large-v1": + return embed_models_path + "all_RoBERTa_large" + case "sentence-transformers/all-roberta-large-v1": + return embed_models_path + "all_RoBERTa_large" + + case "average_word_embeddings_glove.840B.300d": + return embed_models_path + "glove_avg_word" + case "sentence-transformers/average_word_embeddings_glove.840B.300d": + return embed_models_path + "glove_avg_word" + + case "gte-small": + return embed_models_path + "gteSmallModel" + case "thenlper/gte-small": + return embed_models_path + "gteSmallModel" + + case "sentence-t5-xl": + return embed_models_path + "sentence_t5" + case "sentence-transformers/sentence-t5-xl": + return embed_models_path + "sentence_t5" + + case "snowflake-arctic-embed-m": + return embed_models_path + "snowflake_arctic" + case "Snowflake/snowflake-arctic-embed-m": + return embed_models_path + "snowflake_arctic" + + case "sentence-transformers-e5-large-v2": + return embed_models_path + "ste_embaas_e5_large" + case "embaas/sentence-transformers-e5-large-v2": + return embed_models_path + "ste_embaas_e5_large" + + case _: + raise ValueError("Bad Model Name") + + +# remove duplicate rows +# def purge(data:pd.DataFrame) -> pd.DataFrame: +# return data.drop_duplicates() \ No newline at end of file diff --git a/python/embed/qdrant/qdrant.png b/python/embed/qdrant/qdrant.png new file mode 100644 index 00000000..d1b63a64 Binary files /dev/null and b/python/embed/qdrant/qdrant.png differ diff --git a/python/embed/qdrant/qdrant.py b/python/embed/qdrant/qdrant.py new file mode 100644 index 00000000..8a20d593 --- /dev/null +++ b/python/embed/qdrant/qdrant.py @@ -0,0 +1,174 @@ +import typing as t +from qdrant_client import QdrantClient +from qdrant_client.models import PointStruct, VectorParams, Distance +import pandas as pd +from langchain_huggingface import HuggingFaceEmbeddings as hfe +from transformers import AutoConfig +from sys import argv +from helpers import * +from data_connect import DATAConnect +import sys +import warnings + +# 默认参数值 +DEFAULT_COLLECTION_NAME = "default_collection" +DEFAULT_BATCH_SIZE = 100 +DEFAULT_HOST = "localhost" +DEFAULT_PORT = 6333 +DEFAULT_GRPC_PORT = 6334 +DEFAULT_EMBED_MODEL = "all_MiniLM_L6_v2" +DEFAULT_DISTANCE_METRIC = "cosine" +DEFAULT_HEADERS = "" + +def write_dict(collection_name: str, elements_dict: t.List[t.Dict[str, str]], client: QdrantClient, batch_size: int): + embedFunc = hfe(model_name=embed_model) + + def _embedText(s: str) -> t.List[float]: + return embedFunc.embed_documents(texts=[s])[0] + + points = [] + print("开始转化为向量嵌入了") + for i in range(len(elements_dict)): + content = elements_dict[i] + vector = _embedText(str(content['text'])) + points.append(PointStruct(id=i, vector=vector, payload=content)) + + if (i + 1) % batch_size == 0 or i == len(elements_dict) - 1: + try: + print(f"Uploading {len(points)} points to Qdrant") + client.upsert(collection_name=collection_name, points=points) + points.clear() + except Exception as e: + print(f"Error during upsert: {e}") + + print(f"{len(elements_dict)}条数据全部写入完成!") + +# ## 选择嵌入模型 +# def embed_change(name: str) -> str: +# embed_models_path = "your/base/path/" # 假设你有一个基础路径 +# +# # 使用字典映射模型名称到嵌入模型路径 +# model_mapping = { +# "all_MiniLM_L6_v2": embed_models_path + "all_MiniLM", +# "sentence-transformers/all-MiniLM-L6-v2": embed_models_path + "all_MiniLM", +# +# "all-roberta-large-v1": embed_models_path + "all_RoBERTa_large", +# "sentence-transformers/all-roberta-large-v1": embed_models_path + "all_RoBERTa_large", +# +# "average_word_embeddings_glove.840B.300d": embed_models_path + "glove_avg_word", +# "sentence-transformers/average_word_embeddings_glove.840B.300d": embed_models_path + "glove_avg_word", +# +# "gte-small": embed_models_path + "gteSmallModel", +# "thenlper/gte-small": embed_models_path + "gteSmallModel", +# +# "sentence-t5-xl": embed_models_path + "sentence_t5", +# "sentence-transformers/sentence-t5-xl": embed_models_path + "sentence_t5", +# +# "snowflake-arctic-embed-m": embed_models_path + "snowflake_arctic", +# "Snowflake/snowflake-arctic-embed-m": embed_models_path + "snowflake_arctic", +# +# "sentence-transformers-e5-large-v2": embed_models_path + "ste_embaas_e5_large", +# "embaas/sentence-transformers-e5-large-v2": embed_models_path + "ste_embaas_e5_large" +# } +# +# # 使用 .get() 方法查找模型对应的路径,若不存在则抛出异常 +# path = model_mapping.get(name) +# if path is not None: +# return path +# else: +# raise ValueError("Bad Model Name") + +if __name__ == "__main__": + # 获取组件配置参数,使用命令行参数或默认值进行赋值 + collection_name: str = sys.argv[1] if len(sys.argv) > 1 else DEFAULT_COLLECTION_NAME + batch_size: int = int(sys.argv[2]) if len(sys.argv) > 2 else DEFAULT_BATCH_SIZE + host: str = sys.argv[3] if len(sys.argv) > 3 else DEFAULT_HOST + port: int = int(sys.argv[4]) if len(sys.argv) > 4 else DEFAULT_PORT + grpc_port: int = int(sys.argv[5]) if len(sys.argv) > 5 else DEFAULT_GRPC_PORT + embed_model = embed_change(sys.argv[6]) if len(sys.argv) > 6 else DEFAULT_EMBED_MODEL + # embed_model = sys.argv[6] if len(sys.argv) > 6 else DEFAULT_EMBED_MODEL + distance_metric = sys.argv[7] if len(sys.argv) > 7 else DEFAULT_DISTANCE_METRIC + + # distance_metric = "manhattan" + # 距离度量方式选择. + distance_switch = { + 'euclid': Distance.EUCLID, + 'cosine': Distance.COSINE, + 'dot': Distance.DOT, + 'manhattan': Distance.MANHATTAN + } + # 检查输入的距离度量是否有效 + if distance_metric in distance_switch: + distance_metric = distance_switch[distance_metric] + print(f"选择的距离度量方式是: {distance_metric}") + else: + # 获取距离度量,如果输入不在字典内,则默认选择 Distance.COSINE + distance_metric = distance_switch.get(distance_metric, Distance.COSINE) + print(f"不支持的距离度量方式: {distance_metric}, 系统默认选择了{distance_metric}余弦度量!!") + + + dataConnect = DATAConnect() + print("dataConnect = DATAConnect()成功!") + + # df = pd.read_parquet("./test2.parquet") + df = dataConnect.dataInputStream(port="input_read") + df.drop_duplicates(subset=["element_id"], inplace=True) + print(df.head(5)) + + # collection_name = "test_collection4" + + # embed_model = "/data/model/all-MiniLM-L12-v1" + + + print(f"Parameters received: {collection_name}, {batch_size}, {host},{port}, {grpc_port},{embed_model},{distance_metric}") + + config = AutoConfig.from_pretrained(embed_model) + output_dim = config.hidden_size + print(f"该模型的输出向量维度为: {output_dim}!") + + # client = QdrantClient(host='localhost', port=6333) + # #连接到Qdrant数据库。如果提供了host和port,则连接远程Qdrant实例。否则,将连接到本地实例。 + + if host and port: + # 如果提供了host和port,则连接远程Qdrant实例 + client = QdrantClient( + host=host, + port=port, + # grpc_host=grpc_host if grpc_host else host, # 如果没有提供grpc_host,则使用host + grpc_port=grpc_port, # 默认GRPC端口为6334 + prefer_grpc=True, # 优先使用gRPC + headers=DEFAULT_HEADERS + ) + print(f"连接到远程Qdrant实例: {host}:{port}") + else: + # 如果没有提供host和port,连接到本地实例 + client = QdrantClient(path="./qdrant_local.db") # 使用本地路径的嵌入式Qdrant数据库 + warnings.warn("由于缺少host或port参数,创建了一个本地临时Qdrant客户端,进程结束后数据库将被删除。") + + + datas = [] + print("向量写入前") + for i in range(len(df)): + dic = { + "filename": df.iloc[i].get("metadata", {}).get("filename", ""), + "filetype": df.iloc[i].get("metadata", {}).get("filetype", ""), + "languages": str(df.iloc[i].get("metadata", {}).get("languages", "")), + "page_number": str(df.iloc[i].get("metadata", {}).get("page_number", "")), + "text": str(df.iloc[i].get("text", "")), + "type": df.iloc[i].get("type", "") + } + datas.append(dic) + + collections = [col.name for col in client.get_collections().collections] + if collection_name not in collections: + client.create_collection( + collection_name=collection_name, + vectors_config=VectorParams( + size=output_dim, + distance=distance_metric + ) + ) + + write_dict(collection_name=collection_name, elements_dict=datas, client=client, batch_size=batch_size) + client.close() + print("向量写入结束") diff --git a/python/embed/qdrant/requirements.txt b/python/embed/qdrant/requirements.txt new file mode 100644 index 00000000..e0559867 --- /dev/null +++ b/python/embed/qdrant/requirements.txt @@ -0,0 +1,10 @@ +numpy +hdfs +pyhdfs +pandas +langchain_community +langchain_huggingface +protobuf==3.20.1 +transformers +qdrant_client +pyarrow diff --git a/readMe.txt b/readMe.txt deleted file mode 100644 index d50cffac..00000000 --- a/readMe.txt +++ /dev/null @@ -1,20 +0,0 @@ -1.maven error - apt-get install maven - mvn install:install-file -Dfile=/opt/project/piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar - mvn install:install-file -Dfile=/opt/project/piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar - mvn install:install-file -Dfile=/opt/project/piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar - mvn install:install-file -Dfile=/opt/project/piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar - -2.Packaging by Intellij - 1)Edit Configurations --> add Maven - Command line: clean package -Dmaven.test.skip=true -X - 2)Build piflow-server-0.9.jar - -3.run main class in Intellij - - - 1)Edit Configurations --> Application - Main class: cn.piflow.api.Main - Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6; - - diff --git "a/\346\234\250\345\205\260\347\244\276\345\214\272-PiFlow.md" "b/\346\234\250\345\205\260\347\244\276\345\214\272-PiFlow.md" deleted file mode 100644 index c97c0370..00000000 --- "a/\346\234\250\345\205\260\347\244\276\345\214\272-PiFlow.md" +++ /dev/null @@ -1,76 +0,0 @@ -### 项目名称或建议名称(在木兰开源社区中必须唯一) -- 大数据流水线系统PiFlow - -### 要求的项目成熟度级别:孵化|毕业 -- 孵化 - -### 项目描述 -- PiFlow是一个基于分布式计算框架技术开发的大数据流水线处理与调度系统。该系统将大数据采集、清洗、存储与分析进行抽象和组件化开发,以所见即所得、拖拽配置的简洁方式实现大数据处理流程化配置、运行与智能监控。提供100+的数据处理组件,包括Hadoop 、Spark、MLlib、Hive、Solr、Redis、MemCache、ElasticSearch、JDBC、MongoDB、HTTP、FTP、XML、CSV、JSON等,更支持面向领域的二次组件开发。数据可溯源,性能优越。 - -### 是否与当前木兰开源社区托管项目有合作机会 -- 有 - -### 许可证名称,版本和许可证文本的URL - - Apache License 2.0 - - https://github.com/cas-bigdatalab/piflow/blob/master/LICENSE - -### 源代码控制(Trustie、GitHub、Gitee等)-请确认使用的工具 - - GitHub:https://github.com/cas-bigdatalab/piflow - - Gitee: https://gitee.com/opensci/piflow - -### 问题追踪器(Trustie、GitHub、Gitee、JIRA等)-请确认使用的工具 - - GitHub:https://github.com/cas-bigdatalab/piflow/issues - - JIRA - -### 协作工具(Mail List,Wiki,IRC,Slack,WeChat,QQ等)-请确认正在使用的工具,并注明您想要使用的工具的要求 - - Wiki - - WeChat:PiFlow User Group - - QQ群:1003489545 - - -### 外部依赖关系,包括这些依赖关系的许可证(名称和版本) - - Spark 2.3.4 (Apache-2.0 License) - - hadoop 2.6.0 (Apache-2.0 License) - -### 最初的提交者(姓名,电子邮件,组织)以及他们从事该项目已有多长时间 - - - PiFlow server - | 姓名 | 邮箱 | 组织 | 从事该项目时间 | - | ------ | --------------- | ----------------------------- | ---- | - | 沈志宏 | bluejoe@cnic.cn | 中国科学院网络信息中心 | 2018.05-- 至今 | - | 朱小杰 | xjzhu@cnic.cn | 中国科学院网络信息中心 | 2018.07-- 至今 | - - - PiFlow web - | 姓名 | 邮箱 | 组织 | 从事该项目时间 | - | ------ | ----------------|-------------------------------| ---- | - | 周健鹏 | zjp@cnic.cn | 中国科学院网络信息中心 | 2018.09-- 至今 | - | 孙静芳 | sxideal@163.com | 中国科学院网络信息中心 | 2020.08 -- 至今 | - - - -### 项目是否定义了贡献者,提交者,维护者等角色?如果是,请在MAINTAINERS.md中记录它 -- 无 - -### 该项目的贡献者总数,包括其从属关系: -- 17人 - -### 该项目有发布方法吗?如果是,请在RELEASES.md中进行记录 -- GitHub中进行Release - -### 该项目是否有行为准则?如果是,请共享URL。如果否,请创建CODE_OF_CONDUCT.md并指向。 -- 否 - -### 在木兰开源社区中托管项目时,您是否需要基础架构(域名、邮箱、论坛等)请求 -- 需要 - -### 项目网站-您是否有网站?如果没有,您是否保留了一个域名,并希望您创建一个网站 -- 无 - -### 项目治理-您是否有该项目的有效治理模型?请提供URL到它的记录位置,通常是GOVERNANCE.md -- 无 - -### 社交媒体帐户-您是否有任何Twitter/Facebook/微博/公众号 -- 无 - -### 现有赞助(例如,是否有任何组织迄今为止提供了资金或其他支持,以及对该支持的描述) -- 无