Skip to content

Commit a05f66c

Browse files
Idan707Idan Benaunidan707
authoredDec 16, 2020
round 2 updates (#77)
* add function * describe presto * add describe spark notebook * no need * add iris csv to describe * add iris dataset * add readme * add readme * add readme and split fix * add model * fix ability to run stand alone * add readme and update nb * add .py * update model server and readme * add read me and iris parquet * update model server test and readme * remove describe spark * update naming * update to wasabi * round 2 improvements * update model tester * round 2 updates * update model server * add sklearn classifier with dask * change get header loction * make classifier genreric * Update README.md * Update README.md * add .py .yaml files again * .DS_Store banished * add .py, .yaml, spark describe * add .py .yaml files * update with pip install and .yaml * add .yaml * add .yaml .py files * add .py .yaml files * add readme and docstring * update readme * change work with dask init * update functions - model server pending * update model server * update model servers Co-authored-by: Idan Benaun <idanb@Idans-MacBook-Pro.local> Co-authored-by: idan707 <lotart707@gmail.com>
1 parent 779a3c2 commit a05f66c

38 files changed

+6440
-763
lines changed
 

‎.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ images/
88
.idea/
99
.vscode/
1010
.empty/
11+
.DS_Store/
1112
# Byte-compiled / optimized / DLL files
1213
__pycache__/
1314
*.py[cod]

‎describe/README.md

+26-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,26 @@
1-
# describe
1+
# Describe
2+
3+
Get the table's summary statistics and summary plots
4+
5+
The functions will require the following parameters:
6+
7+
```markdown
8+
9+
:param context: the function context
10+
:param table: MLRun input pointing to pandas dataframe (csv/parquet file path)
11+
:param label_column: ground truth column label
12+
:param class_labels: label for each class in tables and plots
13+
:param plot_hist: (True) set this to False for large tables
14+
:param plots_dest: destination folder of summary plots (relative to artifact_path)
15+
:param update_dataset: when the table is a registered dataset update the charts in-place
16+
17+
```
18+
19+
The function will output the following artifacts per column within the data frame (based on data types):
20+
21+
1. histogram chart
22+
2. violin chart
23+
3. imbalance chart
24+
4. correlation-matrix chart
25+
5. correlation-matrix csv
26+
6. imbalance-weights-vec csv

‎describe/describe.ipynb

+93-100
Large diffs are not rendered by default.

‎describe/describe.py

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from mlrun.datastore import DataItem
1414
from mlrun.artifacts import PlotArtifact, TableArtifact
1515
from mlrun.mlutils import gcf_clear
16+
import mlrun
1617

1718
from typing import List
1819

‎describe/function.yaml

+4-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ kind: job
22
metadata:
33
name: describe
44
tag: ''
5-
hash: d16383b0300193bddbf3095cbaf6747555eaa368
5+
hash: 8d69cb5fafe7a48795f5a2b043956cff9bba47c2
66
project: default
77
labels:
88
author: yjb
@@ -48,9 +48,9 @@ spec:
4848
default: false
4949
outputs:
5050
- default: ''
51-
lineno: 21
51+
lineno: 22
5252
description: describe and visualizes dataset stats
5353
build:
54-
functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlcgoKaW1wb3J0IHdhcm5pbmdzCndhcm5pbmdzLnNpbXBsZWZpbHRlcihhY3Rpb249J2lnbm9yZScsIGNhdGVnb3J5PUZ1dHVyZVdhcm5pbmcpCgppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdAppbXBvcnQgc2VhYm9ybiBhcyBzbnMKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IFBsb3RBcnRpZmFjdCwgVGFibGVBcnRpZmFjdApmcm9tIG1scnVuLm1sdXRpbHMgaW1wb3J0IGdjZl9jbGVhcgoKZnJvbSB0eXBpbmcgaW1wb3J0IExpc3QKCnBkLnNldF9vcHRpb24oImRpc3BsYXkuZmxvYXRfZm9ybWF0IiwgbGFtYmRhIHg6ICIlLjJmIiAlIHgpCgpkZWYgc3VtbWFyaXplKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICB0YWJsZTogRGF0YUl0ZW0sCiAgICBsYWJlbF9jb2x1bW46IHN0ciA9IE5vbmUsCiAgICBjbGFzc19sYWJlbHM6IExpc3Rbc3RyXSA9IFtdLAogICAgcGxvdF9oaXN0OiBib29sID0gVHJ1ZSwKICAgIHBsb3RzX2Rlc3Q6IHN0ciA9ICJwbG90cyIsCiAgICB1cGRhdGVfZGF0YXNldCA9IEZhbHNlLAopIC0+IE5vbmU6CiAgICAiIiJTdW1tYXJpemUgYSB0YWJsZQoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gdGFibGU6ICAgICAgICAgICBNTFJ1biBpbnB1dCBwb2ludGluZyB0byBwYW5kYXMgZGF0YWZyYW1lIChjc3YvcGFycXVldCBmaWxlIHBhdGgpCiAgICA6cGFyYW0gbGFiZWxfY29sdW1uOiAgICBncm91bmQgdHJ1dGggY29sdW1uIGxhYmVsCiAgICA6cGFyYW0gY2xhc3NfbGFiZWxzOiAgICBsYWJlbCBmb3IgZWFjaCBjbGFzcyBpbiB0YWJsZXMgYW5kIHBsb3RzCiAgICA6cGFyYW0gcGxvdF9oaXN0OiAgICAgICAoVHJ1ZSkgc2V0IHRoaXMgdG8gRmFsc2UgZm9yIGxhcmdlIHRhYmxlcwogICAgOnBhcmFtIHBsb3RzX2Rlc3Q6ICAgICAgZGVzdGluYXRpb24gZm9sZGVyIG9mIHN1bW1hcnkgcGxvdHMgKHJlbGF0aXZlIHRvIGFydGlmYWN0X3BhdGgpCiAgICA6cGFyYW0gdXBkYXRlX2RhdGFzZXQ6ICB3aGVuIHRoZSB0YWJsZSBpcyBhIHJlZ2lzdGVyZWQgZGF0YXNldCB1cGRhdGUgdGhlIGNoYXJ0cyBpbi1wbGFjZSAKICAgICIiIgogICAgZGYgPSB0YWJsZS5hc19kZigpCiAgICBoZWFkZXIgPSBkZi5jb2x1bW5zLnZhbHVlcwogICAgZXh0cmFfZGF0YSA9IHt9CiAgICAKICAgIHRyeToKICAgICAgICBnY2ZfY2xlYXIocGx0KQogICAgICAgIHNuc3BsdCA9IHNucy5wYWlycGxvdChkZiwgaHVlPWxhYmVsX2NvbHVtbikjLCBkaWFnX2t3cz17ImJ3IjogMS41fSkKICAgICAgICBleHRyYV9kYXRhWyJoaXN0b2dyYW1zIl0gPSBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3QoImhpc3RvZ3JhbXMiLCAgYm9keT1wbHQuZ2NmKCkpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGxvY2FsX3BhdGg9ZiJ7cGxvdHNfZGVzdH0vaGlzdC5odG1sIiwgZGJfa2V5PUZhbHNlKQogICAgZXhjZXB0IEV4Y2VwdGlvbiBhcyBlOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmVycm9yKGYnRmFpbGVkIHRvIGNyZWF0ZSBwYWlycGxvdCBoaXN0b2dyYW1zIGR1ZSB0bzoge2V9JykKICAgIAogICAgdHJ5OgogICAgICAgIGdjZl9jbGVhcihwbHQpCiAgICAgICAgcGxvdF9jb2xzID0gMwogICAgICAgIHBsb3Rfcm93cyA9IGludCgobGVuKGhlYWRlcikgLSAxKSAvIHBsb3RfY29scykrMQogICAgICAgIGZpZywgYXggPSBwbHQuc3VicGxvdHMocGxvdF9yb3dzLCBwbG90X2NvbHMsIGZpZ3NpemU9KDE1LCA0KSkKICAgICAgICBmaWcudGlnaHRfbGF5b3V0KHBhZD0yLjApCiAgICAgICAgZm9yIGkgaW4gcmFuZ2UocGxvdF9yb3dzICogcGxvdF9jb2xzKToKICAgICAgICAgICAgaWYgaSA8IGxlbihoZWFkZXIpOgogICAgICAgICAgICAgICAgc25zLnZpb2xpbnBsb3QoeD1kZltoZWFkZXJbaV1dLCBheD1heFtpbnQoaSAvIHBsb3RfY29scyldW2kgJSBwbG90X2NvbHNdLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG9yaWVudD0naCcsIHdpZHRoPTAuNywgaW5uZXI9InF1YXJ0aWxlIikKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIGZpZy5kZWxheGVzKGF4W2ludChpIC8gcGxvdF9jb2xzKV1baSAlIHBsb3RfY29sc10pICAgICAgICAKICAgICAgICAgICAgaSs9MQogICAgICAgIGV4dHJhX2RhdGFbInZpb2xpbiJdID0gY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KCJ2aW9saW4iLCAgYm9keT1wbHQuZ2NmKCksIHRpdGxlPSdWaW9saW4gUGxvdCcpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS92aW9saW4uaHRtbCIsIGRiX2tleT1GYWxzZSkKICAgIGV4Y2VwdCBFeGNlcHRpb24gYXMgZToKICAgICAgICBjb250ZXh0LmxvZ2dlci53YXJuKGYnRmFpbGVkIHRvIGNyZWF0ZSB2aW9saW4gZGlzdHJpYnV0aW9uIHBsb3RzIGR1ZSB0bzoge2V9JykKCiAgICBpZiBsYWJlbF9jb2x1bW46IAogICAgICAgIGxhYmVscyA9IGRmLnBvcChsYWJlbF9jb2x1bW4pCiAgICAgICAgaW1idGFibGUgPSBsYWJlbHMudmFsdWVfY291bnRzKG5vcm1hbGl6ZT1UcnVlKS5zb3J0X2luZGV4KCkKICAgICAgICB0cnk6CiAgICAgICAgICAgIGdjZl9jbGVhcihwbHQpICAKICAgICAgICAgICAgYmFsYW5jZWJhciA9IGltYnRhYmxlLnBsb3Qoa2luZD0nYmFyJywgdGl0bGU9J2NsYXNzIGltYmFsYW5jZSAtIGxhYmVscycpCiAgICAgICAgICAgIGJhbGFuY2ViYXIuc2V0X3hsYWJlbCgnY2xhc3MnKQogICAgICAgICAgICBiYWxhbmNlYmFyLnNldF95bGFiZWwoInByb3BvcnRpb24gb2YgdG90YWwiKQogICAgICAgICAgICBleHRyYV9kYXRhWyJpbWJhbGFuY2UiXSA9IGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdCgiaW1iYWxhbmNlIiwgYm9keT1wbHQuZ2NmKCkpLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2ltYmFsYW5jZS5odG1sIikKICAgICAgICBleGNlcHQgRXhjZXB0aW9uIGFzIGU6CiAgICAgICAgICAgIGNvbnRleHQubG9nZ2VyLndhcm4oZidGYWlsZWQgdG8gY3JlYXRlIGNsYXNzIGltYmFsYW5jZSBwbG90IGR1ZSB0bzoge2V9JykKICAgICAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChUYWJsZUFydGlmYWN0KCJpbWJhbGFuY2Utd2VpZ2h0cy12ZWMiLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRmPXBkLkRhdGFGcmFtZSh7IndlaWdodHMiOiBpbWJ0YWJsZX0pKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2ltYmFsYW5jZS13ZWlnaHRzLXZlYy5jc3YiLCBkYl9rZXk9RmFsc2UpCgogICAgdGJsY29yciA9IGRmLmNvcnIoKQogICAgbWFzayA9IG5wLnplcm9zX2xpa2UodGJsY29yciwgZHR5cGU9bnAuYm9vbCkKICAgIG1hc2tbbnAudHJpdV9pbmRpY2VzX2Zyb20obWFzayldID0gVHJ1ZQogICAgCiAgICBkZmNvcnIgPSBwZC5EYXRhRnJhbWUoZGF0YT10Ymxjb3JyLCBjb2x1bW5zPWhlYWRlciwgaW5kZXg9aGVhZGVyKQogICAgZGZjb3JyID0gZGZjb3JyW25wLmFyYW5nZShkZmNvcnIuc2hhcGVbMF0pWzosIE5vbmVdID4gbnAuYXJhbmdlKGRmY29yci5zaGFwZVsxXSldCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChUYWJsZUFydGlmYWN0KCJjb3JyZWxhdGlvbi1tYXRyaXgiLCBkZj10Ymxjb3JyLCB2aXNpYmxlPVRydWUpLCAKICAgICAgICAgICAgICAgICAgICAgICAgIGxvY2FsX3BhdGg9ZiJ7cGxvdHNfZGVzdH0vY29ycmVsYXRpb24tbWF0cml4LmNzdiIsIGRiX2tleT1GYWxzZSkKICAgIAogICAgdHJ5OgogICAgICAgIGdjZl9jbGVhcihwbHQpCiAgICAgICAgYXggPSBwbHQuYXhlcygpCiAgICAgICAgc25zLmhlYXRtYXAodGJsY29yciwgYXg9YXgsIG1hc2s9bWFzaywgYW5ub3Q9RmFsc2UsIGNtYXA9cGx0LmNtLlJlZHMpCiAgICAgICAgYXguc2V0X3RpdGxlKCJmZWF0dXJlcyBjb3JyZWxhdGlvbiIpCiAgICAgICAgZXh0cmFfZGF0YVsiY29ycmVsYXRpb24iXSA9IGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdCgiY29ycmVsYXRpb24iLCAgYm9keT1wbHQuZ2NmKCksIHRpdGxlPSdDb3JyZWxhdGlvbiBNYXRyaXgnKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2NvcnIuaHRtbCIsIGRiX2tleT1GYWxzZSkKICAgIGV4Y2VwdCBFeGNlcHRpb24gYXMgZToKICAgICAgICAgICAgY29udGV4dC5sb2dnZXIud2FybihmJ0ZhaWxlZCB0byBjcmVhdGUgZmVhdHVyZXMgY29ycmVsYXRpb24gcGxvdCBkdWUgdG86IHtlfScpCiAgICAKCiAgICBnY2ZfY2xlYXIocGx0KQogICAgaWYgdXBkYXRlX2RhdGFzZXQgYW5kIHRhYmxlLm1ldGEgYW5kIHRhYmxlLm1ldGEua2luZCA9PSAnZGF0YXNldCc6CiAgICAgICAgZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IHVwZGF0ZV9kYXRhc2V0X21ldGEKICAgICAgICB1cGRhdGVfZGF0YXNldF9tZXRhKHRhYmxlLm1ldGEsIGV4dHJhX2RhdGE9ZXh0cmFfZGF0YSkKICAgICAgICAKCg==
54+
functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlcgoKaW1wb3J0IHdhcm5pbmdzCndhcm5pbmdzLnNpbXBsZWZpbHRlcihhY3Rpb249J2lnbm9yZScsIGNhdGVnb3J5PUZ1dHVyZVdhcm5pbmcpCgppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdAppbXBvcnQgc2VhYm9ybiBhcyBzbnMKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IFBsb3RBcnRpZmFjdCwgVGFibGVBcnRpZmFjdApmcm9tIG1scnVuLm1sdXRpbHMgaW1wb3J0IGdjZl9jbGVhcgppbXBvcnQgbWxydW4KCmZyb20gdHlwaW5nIGltcG9ydCBMaXN0CgpwZC5zZXRfb3B0aW9uKCJkaXNwbGF5LmZsb2F0X2Zvcm1hdCIsIGxhbWJkYSB4OiAiJS4yZiIgJSB4KQoKZGVmIHN1bW1hcml6ZSgKICAgIGNvbnRleHQ6IE1MQ2xpZW50Q3R4LAogICAgdGFibGU6IERhdGFJdGVtLAogICAgbGFiZWxfY29sdW1uOiBzdHIgPSBOb25lLAogICAgY2xhc3NfbGFiZWxzOiBMaXN0W3N0cl0gPSBbXSwKICAgIHBsb3RfaGlzdDogYm9vbCA9IFRydWUsCiAgICBwbG90c19kZXN0OiBzdHIgPSAicGxvdHMiLAogICAgdXBkYXRlX2RhdGFzZXQgPSBGYWxzZSwKKSAtPiBOb25lOgogICAgIiIiU3VtbWFyaXplIGEgdGFibGUKCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIHRhYmxlOiAgICAgICAgICAgTUxSdW4gaW5wdXQgcG9pbnRpbmcgdG8gcGFuZGFzIGRhdGFmcmFtZSAoY3N2L3BhcnF1ZXQgZmlsZSBwYXRoKQogICAgOnBhcmFtIGxhYmVsX2NvbHVtbjogICAgZ3JvdW5kIHRydXRoIGNvbHVtbiBsYWJlbAogICAgOnBhcmFtIGNsYXNzX2xhYmVsczogICAgbGFiZWwgZm9yIGVhY2ggY2xhc3MgaW4gdGFibGVzIGFuZCBwbG90cwogICAgOnBhcmFtIHBsb3RfaGlzdDogICAgICAgKFRydWUpIHNldCB0aGlzIHRvIEZhbHNlIGZvciBsYXJnZSB0YWJsZXMKICAgIDpwYXJhbSBwbG90c19kZXN0OiAgICAgIGRlc3RpbmF0aW9uIGZvbGRlciBvZiBzdW1tYXJ5IHBsb3RzIChyZWxhdGl2ZSB0byBhcnRpZmFjdF9wYXRoKQogICAgOnBhcmFtIHVwZGF0ZV9kYXRhc2V0OiAgd2hlbiB0aGUgdGFibGUgaXMgYSByZWdpc3RlcmVkIGRhdGFzZXQgdXBkYXRlIHRoZSBjaGFydHMgaW4tcGxhY2UgCiAgICAiIiIKICAgIGRmID0gdGFibGUuYXNfZGYoKQogICAgaGVhZGVyID0gZGYuY29sdW1ucy52YWx1ZXMKICAgIGV4dHJhX2RhdGEgPSB7fQogICAgCiAgICB0cnk6CiAgICAgICAgZ2NmX2NsZWFyKHBsdCkKICAgICAgICBzbnNwbHQgPSBzbnMucGFpcnBsb3QoZGYsIGh1ZT1sYWJlbF9jb2x1bW4pIywgZGlhZ19rd3M9eyJidyI6IDEuNX0pCiAgICAgICAgZXh0cmFfZGF0YVsiaGlzdG9ncmFtcyJdID0gY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KCJoaXN0b2dyYW1zIiwgIGJvZHk9cGx0LmdjZigpKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2hpc3QuaHRtbCIsIGRiX2tleT1GYWxzZSkKICAgIGV4Y2VwdCBFeGNlcHRpb24gYXMgZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5lcnJvcihmJ0ZhaWxlZCB0byBjcmVhdGUgcGFpcnBsb3QgaGlzdG9ncmFtcyBkdWUgdG86IHtlfScpCiAgICAKICAgIHRyeToKICAgICAgICBnY2ZfY2xlYXIocGx0KQogICAgICAgIHBsb3RfY29scyA9IDMKICAgICAgICBwbG90X3Jvd3MgPSBpbnQoKGxlbihoZWFkZXIpIC0gMSkgLyBwbG90X2NvbHMpKzEKICAgICAgICBmaWcsIGF4ID0gcGx0LnN1YnBsb3RzKHBsb3Rfcm93cywgcGxvdF9jb2xzLCBmaWdzaXplPSgxNSwgNCkpCiAgICAgICAgZmlnLnRpZ2h0X2xheW91dChwYWQ9Mi4wKQogICAgICAgIGZvciBpIGluIHJhbmdlKHBsb3Rfcm93cyAqIHBsb3RfY29scyk6CiAgICAgICAgICAgIGlmIGkgPCBsZW4oaGVhZGVyKToKICAgICAgICAgICAgICAgIHNucy52aW9saW5wbG90KHg9ZGZbaGVhZGVyW2ldXSwgYXg9YXhbaW50KGkgLyBwbG90X2NvbHMpXVtpICUgcGxvdF9jb2xzXSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvcmllbnQ9J2gnLCB3aWR0aD0wLjcsIGlubmVyPSJxdWFydGlsZSIpCiAgICAgICAgICAgIGVsc2U6CiAgICAgICAgICAgICAgICBmaWcuZGVsYXhlcyhheFtpbnQoaSAvIHBsb3RfY29scyldW2kgJSBwbG90X2NvbHNdKSAgICAgICAgCiAgICAgICAgICAgIGkrPTEKICAgICAgICBleHRyYV9kYXRhWyJ2aW9saW4iXSA9IGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdCgidmlvbGluIiwgIGJvZHk9cGx0LmdjZigpLCB0aXRsZT0nVmlvbGluIFBsb3QnKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGxvY2FsX3BhdGg9ZiJ7cGxvdHNfZGVzdH0vdmlvbGluLmh0bWwiLCBkYl9rZXk9RmFsc2UpCiAgICBleGNlcHQgRXhjZXB0aW9uIGFzIGU6CiAgICAgICAgY29udGV4dC5sb2dnZXIud2FybihmJ0ZhaWxlZCB0byBjcmVhdGUgdmlvbGluIGRpc3RyaWJ1dGlvbiBwbG90cyBkdWUgdG86IHtlfScpCgogICAgaWYgbGFiZWxfY29sdW1uOiAKICAgICAgICBsYWJlbHMgPSBkZi5wb3AobGFiZWxfY29sdW1uKQogICAgICAgIGltYnRhYmxlID0gbGFiZWxzLnZhbHVlX2NvdW50cyhub3JtYWxpemU9VHJ1ZSkuc29ydF9pbmRleCgpCiAgICAgICAgdHJ5OgogICAgICAgICAgICBnY2ZfY2xlYXIocGx0KSAgCiAgICAgICAgICAgIGJhbGFuY2ViYXIgPSBpbWJ0YWJsZS5wbG90KGtpbmQ9J2JhcicsIHRpdGxlPSdjbGFzcyBpbWJhbGFuY2UgLSBsYWJlbHMnKQogICAgICAgICAgICBiYWxhbmNlYmFyLnNldF94bGFiZWwoJ2NsYXNzJykKICAgICAgICAgICAgYmFsYW5jZWJhci5zZXRfeWxhYmVsKCJwcm9wb3J0aW9uIG9mIHRvdGFsIikKICAgICAgICAgICAgZXh0cmFfZGF0YVsiaW1iYWxhbmNlIl0gPSBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3QoImltYmFsYW5jZSIsIGJvZHk9cGx0LmdjZigpKSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS9pbWJhbGFuY2UuaHRtbCIpCiAgICAgICAgZXhjZXB0IEV4Y2VwdGlvbiBhcyBlOgogICAgICAgICAgICBjb250ZXh0LmxvZ2dlci53YXJuKGYnRmFpbGVkIHRvIGNyZWF0ZSBjbGFzcyBpbWJhbGFuY2UgcGxvdCBkdWUgdG86IHtlfScpCiAgICAgICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdCgiaW1iYWxhbmNlLXdlaWdodHMtdmVjIiwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBkZj1wZC5EYXRhRnJhbWUoeyJ3ZWlnaHRzIjogaW1idGFibGV9KSksCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS9pbWJhbGFuY2Utd2VpZ2h0cy12ZWMuY3N2IiwgZGJfa2V5PUZhbHNlKQoKICAgIHRibGNvcnIgPSBkZi5jb3JyKCkKICAgIG1hc2sgPSBucC56ZXJvc19saWtlKHRibGNvcnIsIGR0eXBlPW5wLmJvb2wpCiAgICBtYXNrW25wLnRyaXVfaW5kaWNlc19mcm9tKG1hc2spXSA9IFRydWUKICAgIAogICAgZGZjb3JyID0gcGQuRGF0YUZyYW1lKGRhdGE9dGJsY29yciwgY29sdW1ucz1oZWFkZXIsIGluZGV4PWhlYWRlcikKICAgIGRmY29yciA9IGRmY29ycltucC5hcmFuZ2UoZGZjb3JyLnNoYXBlWzBdKVs6LCBOb25lXSA+IG5wLmFyYW5nZShkZmNvcnIuc2hhcGVbMV0pXQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdCgiY29ycmVsYXRpb24tbWF0cml4IiwgZGY9dGJsY29yciwgdmlzaWJsZT1UcnVlKSwgCiAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2NvcnJlbGF0aW9uLW1hdHJpeC5jc3YiLCBkYl9rZXk9RmFsc2UpCiAgICAKICAgIHRyeToKICAgICAgICBnY2ZfY2xlYXIocGx0KQogICAgICAgIGF4ID0gcGx0LmF4ZXMoKQogICAgICAgIHNucy5oZWF0bWFwKHRibGNvcnIsIGF4PWF4LCBtYXNrPW1hc2ssIGFubm90PUZhbHNlLCBjbWFwPXBsdC5jbS5SZWRzKQogICAgICAgIGF4LnNldF90aXRsZSgiZmVhdHVyZXMgY29ycmVsYXRpb24iKQogICAgICAgIGV4dHJhX2RhdGFbImNvcnJlbGF0aW9uIl0gPSBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3QoImNvcnJlbGF0aW9uIiwgIGJvZHk9cGx0LmdjZigpLCB0aXRsZT0nQ29ycmVsYXRpb24gTWF0cml4JyksCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS9jb3JyLmh0bWwiLCBkYl9rZXk9RmFsc2UpCiAgICBleGNlcHQgRXhjZXB0aW9uIGFzIGU6CiAgICAgICAgICAgIGNvbnRleHQubG9nZ2VyLndhcm4oZidGYWlsZWQgdG8gY3JlYXRlIGZlYXR1cmVzIGNvcnJlbGF0aW9uIHBsb3QgZHVlIHRvOiB7ZX0nKQogICAgCgogICAgZ2NmX2NsZWFyKHBsdCkKICAgIGlmIHVwZGF0ZV9kYXRhc2V0IGFuZCB0YWJsZS5tZXRhIGFuZCB0YWJsZS5tZXRhLmtpbmQgPT0gJ2RhdGFzZXQnOgogICAgICAgIGZyb20gbWxydW4uYXJ0aWZhY3RzIGltcG9ydCB1cGRhdGVfZGF0YXNldF9tZXRhCiAgICAgICAgdXBkYXRlX2RhdGFzZXRfbWV0YSh0YWJsZS5tZXRhLCBleHRyYV9kYXRhPWV4dHJhX2RhdGEpCiAgICAgICAgCgo=
5555
commands: []
56-
code_origin: https://github.com/mlrun/functions#33ca010bd29d557802f88f2c5c3bd2f289452cc4:describe.ipynb
56+
code_origin: https://github.com/Idan707/functions.git#b113aaa99964591e9e2500de0411a6c0029fbe05:describe.ipynb

‎describe_spark/README.md

+86
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# WIP - Spark Describe Function with MLRun (non-sparkoperator)
2+
3+
## Run .py file Using Spark
4+
### Steps:
5+
1. Deploy spark-operator on the cluster (create service from dashboard).
6+
This is required at this stage in order to create a configmap for the daemon.
7+
2. In Jupyter:
8+
Save the followin code under my.py in fuse (in this case /v3io/users/admin/my.py):
9+
10+
```python
11+
#!/usr/local/bin/python
12+
13+
# Locate v3iod:
14+
from subprocess import run
15+
run(["/bin/bash", "/etc/config/v3io/v3io-spark-operator.sh"])
16+
17+
# The pyspark code:
18+
import os
19+
from pyspark.sql import SparkSession
20+
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
21+
22+
spark = (SparkSession.builder.appName("Spark JDBC to Databases - ipynb")
23+
.config("spark.driver.extraClassPath", "/v3io/users/admin/mysql-connector-java-5.1.45.jar")
24+
.config("spark.executor.extraClassPath", "/v3io/users/admin/mysql-connector-java-5.1.45.jar")
25+
.getOrCreate())
26+
27+
dfMySQL = (spark.read.format("jdbc")
28+
.option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam")
29+
.option("dbtable", "Rfam.family")
30+
.option("user", "rfamro")
31+
.option("password", "")
32+
.option("driver", "com.mysql.jdbc.Driver")
33+
.load())
34+
35+
dfMySQL.write.format("io.iguaz.v3io.spark.sql.kv").mode("overwrite").option("key", "rfam_id").save("v3io://users/admin/frommysql")
36+
37+
spark.stop()
38+
```
39+
40+
3. Make sure that your script has execution permissions.
41+
4. Execute the following block in a notebook:
42+
43+
```python
44+
from mlrun import new_function
45+
from mlrun.platforms.iguazio import mount_v3io, mount_v3iod
46+
import os
47+
image_name = 'iguazio/shell:' + os.environ.get("IGZ_VERSION")
48+
run = new_function(name='my-spark', image=image_name , command='/v3io/users/admin/my.py', kind='job', mode='pass')
49+
run.apply(mount_v3io(name="v3io-fuse", remote="/", mount_path="/v3io"))
50+
run.apply(mount_v3iod(namespace="default-tenant", v3io_config_configmap="spark-operator-v3io-config"))
51+
run.run(artifact_path="/User/artifacts")
52+
```
53+
---
54+
55+
## Create Simple Read CSV Function Using Spark
56+
Please refer to the read_csv_spark notebook
57+
58+
---
59+
60+
## Create Describe Function Using Spark
61+
Generates profile reports from an Apache Spark DataFrame.
62+
Based on pandas_profiling, but for Spark's DataFrames instead of pandas.
63+
64+
For each column the following statistics - if relevant for the column type - are presented:
65+
66+
* `Essentials:` type, unique values, missing values
67+
* `Quantile statistics:` minimum value, Q1, median, Q3, maximum, range, interquartile range
68+
* `Descriptive statistics:` mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
69+
* `Most frequent values:` for categorical data
70+
71+
```
72+
Function params
73+
74+
:param context: Function context.
75+
:param dataset: Raw data file (currently needs to be a local file located in v3io://User/bigdata)
76+
:param bins: Number of bin in histograms
77+
:param describe_extended: (True) set to False if the aim is to get a simple .describe() infomration
78+
```
79+
80+
* All operations are done efficiently, which means that **no** Python UDFs or .map() transformations are used at all;
81+
* only Spark SQL's Catalyst is used for the retrieval of all statistics.
82+
83+
---
84+
### TODO:
85+
1. Add plots
86+
2. Add ability to generte html report

0 commit comments

Comments
 (0)