Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

round 2 updates #77

Merged
merged 44 commits into from
Dec 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
a45b9ba
add function
Idan707 Oct 12, 2020
903f48d
describe presto
Idan707 Oct 12, 2020
396f2f6
add describe spark notebook
Oct 13, 2020
04bb72a
Merge branch 'master' of https://github.com/Idan707/functions into ma…
Oct 13, 2020
b5c49f1
no need
Oct 13, 2020
6409eca
add iris csv to describe
Oct 15, 2020
04a45af
add iris dataset
Oct 15, 2020
bc6b4bc
add readme
Oct 15, 2020
d688c10
add readme
Oct 15, 2020
eb12951
add readme and split fix
Oct 15, 2020
8a508cd
add model
Oct 15, 2020
f287c68
fix ability to run stand alone
Oct 15, 2020
a8c87b6
add readme and update nb
Oct 15, 2020
f2eceb7
add .py
Oct 15, 2020
7618037
update model server and readme
Oct 16, 2020
7a86d49
add read me and iris parquet
Oct 17, 2020
7bd0b57
update model server test and readme
Oct 17, 2020
cb0f048
remove describe spark
Oct 18, 2020
c580708
update naming
Oct 18, 2020
64e22a3
Merge branch 'master' of https://github.com/Idan707/functions into fu…
Oct 18, 2020
abb8d0b
update to wasabi
Oct 18, 2020
380d3a8
round 2 improvements
Oct 18, 2020
7608d5e
update model tester
Oct 19, 2020
a36bfe8
round 2 updates
Oct 19, 2020
b0bbb46
update model server
Oct 19, 2020
114f8c6
add sklearn classifier with dask
Oct 25, 2020
dcfaaa2
change get header loction
Oct 25, 2020
7c1c8d2
make classifier genreric
Oct 26, 2020
3d7d4f7
Update README.md
Idan707 Oct 26, 2020
a68e6f7
Update README.md
Idan707 Oct 26, 2020
30812b3
add .py .yaml files again
Idan707 Oct 28, 2020
156b414
.DS_Store banished
Idan707 Oct 28, 2020
877277c
add .py, .yaml, spark describe
Idan707 Oct 28, 2020
a62c798
add .py .yaml files
Idan707 Oct 28, 2020
7175ca7
update with pip install and .yaml
Idan707 Oct 28, 2020
1513dcd
add .yaml
Idan707 Oct 28, 2020
a0e559d
add .yaml .py files
Idan707 Oct 28, 2020
eaff08c
add .py .yaml files
Idan707 Oct 28, 2020
9c28177
add readme and docstring
Idan707 Oct 29, 2020
16f89cb
update readme
Idan707 Oct 29, 2020
b113aaa
change work with dask init
Nov 23, 2020
c5ad89a
update functions - model server pending
Dec 6, 2020
d6ffb94
update model server
Idan707 Dec 6, 2020
6cc0754
update model servers
Idan707 Dec 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ images/
.idea/
.vscode/
.empty/
.DS_Store/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
27 changes: 26 additions & 1 deletion describe/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,26 @@
# describe
# Describe

Get the table's summary statistics and summary plots

The functions will require the following parameters:

```markdown

:param context: the function context
:param table: MLRun input pointing to pandas dataframe (csv/parquet file path)
:param label_column: ground truth column label
:param class_labels: label for each class in tables and plots
:param plot_hist: (True) set this to False for large tables
:param plots_dest: destination folder of summary plots (relative to artifact_path)
:param update_dataset: when the table is a registered dataset update the charts in-place

```

The function will output the following artifacts per column within the data frame (based on data types):

1. histogram chart
2. violin chart
3. imbalance chart
4. correlation-matrix chart
5. correlation-matrix csv
6. imbalance-weights-vec csv
193 changes: 93 additions & 100 deletions describe/describe.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions describe/describe.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from mlrun.datastore import DataItem
from mlrun.artifacts import PlotArtifact, TableArtifact
from mlrun.mlutils import gcf_clear
import mlrun

from typing import List

Expand Down
8 changes: 4 additions & 4 deletions describe/function.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ kind: job
metadata:
name: describe
tag: ''
hash: d16383b0300193bddbf3095cbaf6747555eaa368
hash: 8d69cb5fafe7a48795f5a2b043956cff9bba47c2
project: default
labels:
author: yjb
Expand Down Expand Up @@ -48,9 +48,9 @@ spec:
default: false
outputs:
- default: ''
lineno: 21
lineno: 22
description: describe and visualizes dataset stats
build:
functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlcgoKaW1wb3J0IHdhcm5pbmdzCndhcm5pbmdzLnNpbXBsZWZpbHRlcihhY3Rpb249J2lnbm9yZScsIGNhdGVnb3J5PUZ1dHVyZVdhcm5pbmcpCgppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdAppbXBvcnQgc2VhYm9ybiBhcyBzbnMKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IFBsb3RBcnRpZmFjdCwgVGFibGVBcnRpZmFjdApmcm9tIG1scnVuLm1sdXRpbHMgaW1wb3J0IGdjZl9jbGVhcgoKZnJvbSB0eXBpbmcgaW1wb3J0IExpc3QKCnBkLnNldF9vcHRpb24oImRpc3BsYXkuZmxvYXRfZm9ybWF0IiwgbGFtYmRhIHg6ICIlLjJmIiAlIHgpCgpkZWYgc3VtbWFyaXplKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICB0YWJsZTogRGF0YUl0ZW0sCiAgICBsYWJlbF9jb2x1bW46IHN0ciA9IE5vbmUsCiAgICBjbGFzc19sYWJlbHM6IExpc3Rbc3RyXSA9IFtdLAogICAgcGxvdF9oaXN0OiBib29sID0gVHJ1ZSwKICAgIHBsb3RzX2Rlc3Q6IHN0ciA9ICJwbG90cyIsCiAgICB1cGRhdGVfZGF0YXNldCA9IEZhbHNlLAopIC0+IE5vbmU6CiAgICAiIiJTdW1tYXJpemUgYSB0YWJsZQoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gdGFibGU6ICAgICAgICAgICBNTFJ1biBpbnB1dCBwb2ludGluZyB0byBwYW5kYXMgZGF0YWZyYW1lIChjc3YvcGFycXVldCBmaWxlIHBhdGgpCiAgICA6cGFyYW0gbGFiZWxfY29sdW1uOiAgICBncm91bmQgdHJ1dGggY29sdW1uIGxhYmVsCiAgICA6cGFyYW0gY2xhc3NfbGFiZWxzOiAgICBsYWJlbCBmb3IgZWFjaCBjbGFzcyBpbiB0YWJsZXMgYW5kIHBsb3RzCiAgICA6cGFyYW0gcGxvdF9oaXN0OiAgICAgICAoVHJ1ZSkgc2V0IHRoaXMgdG8gRmFsc2UgZm9yIGxhcmdlIHRhYmxlcwogICAgOnBhcmFtIHBsb3RzX2Rlc3Q6ICAgICAgZGVzdGluYXRpb24gZm9sZGVyIG9mIHN1bW1hcnkgcGxvdHMgKHJlbGF0aXZlIHRvIGFydGlmYWN0X3BhdGgpCiAgICA6cGFyYW0gdXBkYXRlX2RhdGFzZXQ6ICB3aGVuIHRoZSB0YWJsZSBpcyBhIHJlZ2lzdGVyZWQgZGF0YXNldCB1cGRhdGUgdGhlIGNoYXJ0cyBpbi1wbGFjZSAKICAgICIiIgogICAgZGYgPSB0YWJsZS5hc19kZigpCiAgICBoZWFkZXIgPSBkZi5jb2x1bW5zLnZhbHVlcwogICAgZXh0cmFfZGF0YSA9IHt9CiAgICAKICAgIHRyeToKICAgICAgICBnY2ZfY2xlYXIocGx0KQogICAgICAgIHNuc3BsdCA9IHNucy5wYWlycGxvdChkZiwgaHVlPWxhYmVsX2NvbHVtbikjLCBkaWFnX2t3cz17ImJ3IjogMS41fSkKICAgICAgICBleHRyYV9kYXRhWyJoaXN0b2dyYW1zIl0gPSBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3QoImhpc3RvZ3JhbXMiLCAgYm9keT1wbHQuZ2NmKCkpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGxvY2FsX3BhdGg9ZiJ7cGxvdHNfZGVzdH0vaGlzdC5odG1sIiwgZGJfa2V5PUZhbHNlKQogICAgZXhjZXB0IEV4Y2VwdGlvbiBhcyBlOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmVycm9yKGYnRmFpbGVkIHRvIGNyZWF0ZSBwYWlycGxvdCBoaXN0b2dyYW1zIGR1ZSB0bzoge2V9JykKICAgIAogICAgdHJ5OgogICAgICAgIGdjZl9jbGVhcihwbHQpCiAgICAgICAgcGxvdF9jb2xzID0gMwogICAgICAgIHBsb3Rfcm93cyA9IGludCgobGVuKGhlYWRlcikgLSAxKSAvIHBsb3RfY29scykrMQogICAgICAgIGZpZywgYXggPSBwbHQuc3VicGxvdHMocGxvdF9yb3dzLCBwbG90X2NvbHMsIGZpZ3NpemU9KDE1LCA0KSkKICAgICAgICBmaWcudGlnaHRfbGF5b3V0KHBhZD0yLjApCiAgICAgICAgZm9yIGkgaW4gcmFuZ2UocGxvdF9yb3dzICogcGxvdF9jb2xzKToKICAgICAgICAgICAgaWYgaSA8IGxlbihoZWFkZXIpOgogICAgICAgICAgICAgICAgc25zLnZpb2xpbnBsb3QoeD1kZltoZWFkZXJbaV1dLCBheD1heFtpbnQoaSAvIHBsb3RfY29scyldW2kgJSBwbG90X2NvbHNdLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG9yaWVudD0naCcsIHdpZHRoPTAuNywgaW5uZXI9InF1YXJ0aWxlIikKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIGZpZy5kZWxheGVzKGF4W2ludChpIC8gcGxvdF9jb2xzKV1baSAlIHBsb3RfY29sc10pICAgICAgICAKICAgICAgICAgICAgaSs9MQogICAgICAgIGV4dHJhX2RhdGFbInZpb2xpbiJdID0gY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KCJ2aW9saW4iLCAgYm9keT1wbHQuZ2NmKCksIHRpdGxlPSdWaW9saW4gUGxvdCcpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS92aW9saW4uaHRtbCIsIGRiX2tleT1GYWxzZSkKICAgIGV4Y2VwdCBFeGNlcHRpb24gYXMgZToKICAgICAgICBjb250ZXh0LmxvZ2dlci53YXJuKGYnRmFpbGVkIHRvIGNyZWF0ZSB2aW9saW4gZGlzdHJpYnV0aW9uIHBsb3RzIGR1ZSB0bzoge2V9JykKCiAgICBpZiBsYWJlbF9jb2x1bW46IAogICAgICAgIGxhYmVscyA9IGRmLnBvcChsYWJlbF9jb2x1bW4pCiAgICAgICAgaW1idGFibGUgPSBsYWJlbHMudmFsdWVfY291bnRzKG5vcm1hbGl6ZT1UcnVlKS5zb3J0X2luZGV4KCkKICAgICAgICB0cnk6CiAgICAgICAgICAgIGdjZl9jbGVhcihwbHQpICAKICAgICAgICAgICAgYmFsYW5jZWJhciA9IGltYnRhYmxlLnBsb3Qoa2luZD0nYmFyJywgdGl0bGU9J2NsYXNzIGltYmFsYW5jZSAtIGxhYmVscycpCiAgICAgICAgICAgIGJhbGFuY2ViYXIuc2V0X3hsYWJlbCgnY2xhc3MnKQogICAgICAgICAgICBiYWxhbmNlYmFyLnNldF95bGFiZWwoInByb3BvcnRpb24gb2YgdG90YWwiKQogICAgICAgICAgICBleHRyYV9kYXRhWyJpbWJhbGFuY2UiXSA9IGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdCgiaW1iYWxhbmNlIiwgYm9keT1wbHQuZ2NmKCkpLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2ltYmFsYW5jZS5odG1sIikKICAgICAgICBleGNlcHQgRXhjZXB0aW9uIGFzIGU6CiAgICAgICAgICAgIGNvbnRleHQubG9nZ2VyLndhcm4oZidGYWlsZWQgdG8gY3JlYXRlIGNsYXNzIGltYmFsYW5jZSBwbG90IGR1ZSB0bzoge2V9JykKICAgICAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChUYWJsZUFydGlmYWN0KCJpbWJhbGFuY2Utd2VpZ2h0cy12ZWMiLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRmPXBkLkRhdGFGcmFtZSh7IndlaWdodHMiOiBpbWJ0YWJsZX0pKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2ltYmFsYW5jZS13ZWlnaHRzLXZlYy5jc3YiLCBkYl9rZXk9RmFsc2UpCgogICAgdGJsY29yciA9IGRmLmNvcnIoKQogICAgbWFzayA9IG5wLnplcm9zX2xpa2UodGJsY29yciwgZHR5cGU9bnAuYm9vbCkKICAgIG1hc2tbbnAudHJpdV9pbmRpY2VzX2Zyb20obWFzayldID0gVHJ1ZQogICAgCiAgICBkZmNvcnIgPSBwZC5EYXRhRnJhbWUoZGF0YT10Ymxjb3JyLCBjb2x1bW5zPWhlYWRlciwgaW5kZXg9aGVhZGVyKQogICAgZGZjb3JyID0gZGZjb3JyW25wLmFyYW5nZShkZmNvcnIuc2hhcGVbMF0pWzosIE5vbmVdID4gbnAuYXJhbmdlKGRmY29yci5zaGFwZVsxXSldCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChUYWJsZUFydGlmYWN0KCJjb3JyZWxhdGlvbi1tYXRyaXgiLCBkZj10Ymxjb3JyLCB2aXNpYmxlPVRydWUpLCAKICAgICAgICAgICAgICAgICAgICAgICAgIGxvY2FsX3BhdGg9ZiJ7cGxvdHNfZGVzdH0vY29ycmVsYXRpb24tbWF0cml4LmNzdiIsIGRiX2tleT1GYWxzZSkKICAgIAogICAgdHJ5OgogICAgICAgIGdjZl9jbGVhcihwbHQpCiAgICAgICAgYXggPSBwbHQuYXhlcygpCiAgICAgICAgc25zLmhlYXRtYXAodGJsY29yciwgYXg9YXgsIG1hc2s9bWFzaywgYW5ub3Q9RmFsc2UsIGNtYXA9cGx0LmNtLlJlZHMpCiAgICAgICAgYXguc2V0X3RpdGxlKCJmZWF0dXJlcyBjb3JyZWxhdGlvbiIpCiAgICAgICAgZXh0cmFfZGF0YVsiY29ycmVsYXRpb24iXSA9IGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdCgiY29ycmVsYXRpb24iLCAgYm9keT1wbHQuZ2NmKCksIHRpdGxlPSdDb3JyZWxhdGlvbiBNYXRyaXgnKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2NvcnIuaHRtbCIsIGRiX2tleT1GYWxzZSkKICAgIGV4Y2VwdCBFeGNlcHRpb24gYXMgZToKICAgICAgICAgICAgY29udGV4dC5sb2dnZXIud2FybihmJ0ZhaWxlZCB0byBjcmVhdGUgZmVhdHVyZXMgY29ycmVsYXRpb24gcGxvdCBkdWUgdG86IHtlfScpCiAgICAKCiAgICBnY2ZfY2xlYXIocGx0KQogICAgaWYgdXBkYXRlX2RhdGFzZXQgYW5kIHRhYmxlLm1ldGEgYW5kIHRhYmxlLm1ldGEua2luZCA9PSAnZGF0YXNldCc6CiAgICAgICAgZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IHVwZGF0ZV9kYXRhc2V0X21ldGEKICAgICAgICB1cGRhdGVfZGF0YXNldF9tZXRhKHRhYmxlLm1ldGEsIGV4dHJhX2RhdGE9ZXh0cmFfZGF0YSkKICAgICAgICAKCg==
functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlcgoKaW1wb3J0IHdhcm5pbmdzCndhcm5pbmdzLnNpbXBsZWZpbHRlcihhY3Rpb249J2lnbm9yZScsIGNhdGVnb3J5PUZ1dHVyZVdhcm5pbmcpCgppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdAppbXBvcnQgc2VhYm9ybiBhcyBzbnMKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IFBsb3RBcnRpZmFjdCwgVGFibGVBcnRpZmFjdApmcm9tIG1scnVuLm1sdXRpbHMgaW1wb3J0IGdjZl9jbGVhcgppbXBvcnQgbWxydW4KCmZyb20gdHlwaW5nIGltcG9ydCBMaXN0CgpwZC5zZXRfb3B0aW9uKCJkaXNwbGF5LmZsb2F0X2Zvcm1hdCIsIGxhbWJkYSB4OiAiJS4yZiIgJSB4KQoKZGVmIHN1bW1hcml6ZSgKICAgIGNvbnRleHQ6IE1MQ2xpZW50Q3R4LAogICAgdGFibGU6IERhdGFJdGVtLAogICAgbGFiZWxfY29sdW1uOiBzdHIgPSBOb25lLAogICAgY2xhc3NfbGFiZWxzOiBMaXN0W3N0cl0gPSBbXSwKICAgIHBsb3RfaGlzdDogYm9vbCA9IFRydWUsCiAgICBwbG90c19kZXN0OiBzdHIgPSAicGxvdHMiLAogICAgdXBkYXRlX2RhdGFzZXQgPSBGYWxzZSwKKSAtPiBOb25lOgogICAgIiIiU3VtbWFyaXplIGEgdGFibGUKCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIHRhYmxlOiAgICAgICAgICAgTUxSdW4gaW5wdXQgcG9pbnRpbmcgdG8gcGFuZGFzIGRhdGFmcmFtZSAoY3N2L3BhcnF1ZXQgZmlsZSBwYXRoKQogICAgOnBhcmFtIGxhYmVsX2NvbHVtbjogICAgZ3JvdW5kIHRydXRoIGNvbHVtbiBsYWJlbAogICAgOnBhcmFtIGNsYXNzX2xhYmVsczogICAgbGFiZWwgZm9yIGVhY2ggY2xhc3MgaW4gdGFibGVzIGFuZCBwbG90cwogICAgOnBhcmFtIHBsb3RfaGlzdDogICAgICAgKFRydWUpIHNldCB0aGlzIHRvIEZhbHNlIGZvciBsYXJnZSB0YWJsZXMKICAgIDpwYXJhbSBwbG90c19kZXN0OiAgICAgIGRlc3RpbmF0aW9uIGZvbGRlciBvZiBzdW1tYXJ5IHBsb3RzIChyZWxhdGl2ZSB0byBhcnRpZmFjdF9wYXRoKQogICAgOnBhcmFtIHVwZGF0ZV9kYXRhc2V0OiAgd2hlbiB0aGUgdGFibGUgaXMgYSByZWdpc3RlcmVkIGRhdGFzZXQgdXBkYXRlIHRoZSBjaGFydHMgaW4tcGxhY2UgCiAgICAiIiIKICAgIGRmID0gdGFibGUuYXNfZGYoKQogICAgaGVhZGVyID0gZGYuY29sdW1ucy52YWx1ZXMKICAgIGV4dHJhX2RhdGEgPSB7fQogICAgCiAgICB0cnk6CiAgICAgICAgZ2NmX2NsZWFyKHBsdCkKICAgICAgICBzbnNwbHQgPSBzbnMucGFpcnBsb3QoZGYsIGh1ZT1sYWJlbF9jb2x1bW4pIywgZGlhZ19rd3M9eyJidyI6IDEuNX0pCiAgICAgICAgZXh0cmFfZGF0YVsiaGlzdG9ncmFtcyJdID0gY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KCJoaXN0b2dyYW1zIiwgIGJvZHk9cGx0LmdjZigpKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2hpc3QuaHRtbCIsIGRiX2tleT1GYWxzZSkKICAgIGV4Y2VwdCBFeGNlcHRpb24gYXMgZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5lcnJvcihmJ0ZhaWxlZCB0byBjcmVhdGUgcGFpcnBsb3QgaGlzdG9ncmFtcyBkdWUgdG86IHtlfScpCiAgICAKICAgIHRyeToKICAgICAgICBnY2ZfY2xlYXIocGx0KQogICAgICAgIHBsb3RfY29scyA9IDMKICAgICAgICBwbG90X3Jvd3MgPSBpbnQoKGxlbihoZWFkZXIpIC0gMSkgLyBwbG90X2NvbHMpKzEKICAgICAgICBmaWcsIGF4ID0gcGx0LnN1YnBsb3RzKHBsb3Rfcm93cywgcGxvdF9jb2xzLCBmaWdzaXplPSgxNSwgNCkpCiAgICAgICAgZmlnLnRpZ2h0X2xheW91dChwYWQ9Mi4wKQogICAgICAgIGZvciBpIGluIHJhbmdlKHBsb3Rfcm93cyAqIHBsb3RfY29scyk6CiAgICAgICAgICAgIGlmIGkgPCBsZW4oaGVhZGVyKToKICAgICAgICAgICAgICAgIHNucy52aW9saW5wbG90KHg9ZGZbaGVhZGVyW2ldXSwgYXg9YXhbaW50KGkgLyBwbG90X2NvbHMpXVtpICUgcGxvdF9jb2xzXSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvcmllbnQ9J2gnLCB3aWR0aD0wLjcsIGlubmVyPSJxdWFydGlsZSIpCiAgICAgICAgICAgIGVsc2U6CiAgICAgICAgICAgICAgICBmaWcuZGVsYXhlcyhheFtpbnQoaSAvIHBsb3RfY29scyldW2kgJSBwbG90X2NvbHNdKSAgICAgICAgCiAgICAgICAgICAgIGkrPTEKICAgICAgICBleHRyYV9kYXRhWyJ2aW9saW4iXSA9IGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdCgidmlvbGluIiwgIGJvZHk9cGx0LmdjZigpLCB0aXRsZT0nVmlvbGluIFBsb3QnKSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGxvY2FsX3BhdGg9ZiJ7cGxvdHNfZGVzdH0vdmlvbGluLmh0bWwiLCBkYl9rZXk9RmFsc2UpCiAgICBleGNlcHQgRXhjZXB0aW9uIGFzIGU6CiAgICAgICAgY29udGV4dC5sb2dnZXIud2FybihmJ0ZhaWxlZCB0byBjcmVhdGUgdmlvbGluIGRpc3RyaWJ1dGlvbiBwbG90cyBkdWUgdG86IHtlfScpCgogICAgaWYgbGFiZWxfY29sdW1uOiAKICAgICAgICBsYWJlbHMgPSBkZi5wb3AobGFiZWxfY29sdW1uKQogICAgICAgIGltYnRhYmxlID0gbGFiZWxzLnZhbHVlX2NvdW50cyhub3JtYWxpemU9VHJ1ZSkuc29ydF9pbmRleCgpCiAgICAgICAgdHJ5OgogICAgICAgICAgICBnY2ZfY2xlYXIocGx0KSAgCiAgICAgICAgICAgIGJhbGFuY2ViYXIgPSBpbWJ0YWJsZS5wbG90KGtpbmQ9J2JhcicsIHRpdGxlPSdjbGFzcyBpbWJhbGFuY2UgLSBsYWJlbHMnKQogICAgICAgICAgICBiYWxhbmNlYmFyLnNldF94bGFiZWwoJ2NsYXNzJykKICAgICAgICAgICAgYmFsYW5jZWJhci5zZXRfeWxhYmVsKCJwcm9wb3J0aW9uIG9mIHRvdGFsIikKICAgICAgICAgICAgZXh0cmFfZGF0YVsiaW1iYWxhbmNlIl0gPSBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3QoImltYmFsYW5jZSIsIGJvZHk9cGx0LmdjZigpKSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS9pbWJhbGFuY2UuaHRtbCIpCiAgICAgICAgZXhjZXB0IEV4Y2VwdGlvbiBhcyBlOgogICAgICAgICAgICBjb250ZXh0LmxvZ2dlci53YXJuKGYnRmFpbGVkIHRvIGNyZWF0ZSBjbGFzcyBpbWJhbGFuY2UgcGxvdCBkdWUgdG86IHtlfScpCiAgICAgICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdCgiaW1iYWxhbmNlLXdlaWdodHMtdmVjIiwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBkZj1wZC5EYXRhRnJhbWUoeyJ3ZWlnaHRzIjogaW1idGFibGV9KSksCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS9pbWJhbGFuY2Utd2VpZ2h0cy12ZWMuY3N2IiwgZGJfa2V5PUZhbHNlKQoKICAgIHRibGNvcnIgPSBkZi5jb3JyKCkKICAgIG1hc2sgPSBucC56ZXJvc19saWtlKHRibGNvcnIsIGR0eXBlPW5wLmJvb2wpCiAgICBtYXNrW25wLnRyaXVfaW5kaWNlc19mcm9tKG1hc2spXSA9IFRydWUKICAgIAogICAgZGZjb3JyID0gcGQuRGF0YUZyYW1lKGRhdGE9dGJsY29yciwgY29sdW1ucz1oZWFkZXIsIGluZGV4PWhlYWRlcikKICAgIGRmY29yciA9IGRmY29ycltucC5hcmFuZ2UoZGZjb3JyLnNoYXBlWzBdKVs6LCBOb25lXSA+IG5wLmFyYW5nZShkZmNvcnIuc2hhcGVbMV0pXQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdCgiY29ycmVsYXRpb24tbWF0cml4IiwgZGY9dGJsY29yciwgdmlzaWJsZT1UcnVlKSwgCiAgICAgICAgICAgICAgICAgICAgICAgICBsb2NhbF9wYXRoPWYie3Bsb3RzX2Rlc3R9L2NvcnJlbGF0aW9uLW1hdHJpeC5jc3YiLCBkYl9rZXk9RmFsc2UpCiAgICAKICAgIHRyeToKICAgICAgICBnY2ZfY2xlYXIocGx0KQogICAgICAgIGF4ID0gcGx0LmF4ZXMoKQogICAgICAgIHNucy5oZWF0bWFwKHRibGNvcnIsIGF4PWF4LCBtYXNrPW1hc2ssIGFubm90PUZhbHNlLCBjbWFwPXBsdC5jbS5SZWRzKQogICAgICAgIGF4LnNldF90aXRsZSgiZmVhdHVyZXMgY29ycmVsYXRpb24iKQogICAgICAgIGV4dHJhX2RhdGFbImNvcnJlbGF0aW9uIl0gPSBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3QoImNvcnJlbGF0aW9uIiwgIGJvZHk9cGx0LmdjZigpLCB0aXRsZT0nQ29ycmVsYXRpb24gTWF0cml4JyksCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jYWxfcGF0aD1mIntwbG90c19kZXN0fS9jb3JyLmh0bWwiLCBkYl9rZXk9RmFsc2UpCiAgICBleGNlcHQgRXhjZXB0aW9uIGFzIGU6CiAgICAgICAgICAgIGNvbnRleHQubG9nZ2VyLndhcm4oZidGYWlsZWQgdG8gY3JlYXRlIGZlYXR1cmVzIGNvcnJlbGF0aW9uIHBsb3QgZHVlIHRvOiB7ZX0nKQogICAgCgogICAgZ2NmX2NsZWFyKHBsdCkKICAgIGlmIHVwZGF0ZV9kYXRhc2V0IGFuZCB0YWJsZS5tZXRhIGFuZCB0YWJsZS5tZXRhLmtpbmQgPT0gJ2RhdGFzZXQnOgogICAgICAgIGZyb20gbWxydW4uYXJ0aWZhY3RzIGltcG9ydCB1cGRhdGVfZGF0YXNldF9tZXRhCiAgICAgICAgdXBkYXRlX2RhdGFzZXRfbWV0YSh0YWJsZS5tZXRhLCBleHRyYV9kYXRhPWV4dHJhX2RhdGEpCiAgICAgICAgCgo=
commands: []
code_origin: https://github.com/mlrun/functions#33ca010bd29d557802f88f2c5c3bd2f289452cc4:describe.ipynb
code_origin: https://github.com/Idan707/functions.git#b113aaa99964591e9e2500de0411a6c0029fbe05:describe.ipynb
86 changes: 86 additions & 0 deletions describe_spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# WIP - Spark Describe Function with MLRun (non-sparkoperator)

## Run .py file Using Spark
### Steps:
1. Deploy spark-operator on the cluster (create service from dashboard).
This is required at this stage in order to create a configmap for the daemon.
2. In Jupyter:
Save the followin code under my.py in fuse (in this case /v3io/users/admin/my.py):

```python
#!/usr/local/bin/python

# Locate v3iod:
from subprocess import run
run(["/bin/bash", "/etc/config/v3io/v3io-spark-operator.sh"])

# The pyspark code:
import os
from pyspark.sql import SparkSession
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"

spark = (SparkSession.builder.appName("Spark JDBC to Databases - ipynb")
.config("spark.driver.extraClassPath", "/v3io/users/admin/mysql-connector-java-5.1.45.jar")
.config("spark.executor.extraClassPath", "/v3io/users/admin/mysql-connector-java-5.1.45.jar")
.getOrCreate())

dfMySQL = (spark.read.format("jdbc")
.option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam")
.option("dbtable", "Rfam.family")
.option("user", "rfamro")
.option("password", "")
.option("driver", "com.mysql.jdbc.Driver")
.load())

dfMySQL.write.format("io.iguaz.v3io.spark.sql.kv").mode("overwrite").option("key", "rfam_id").save("v3io://users/admin/frommysql")

spark.stop()
```

3. Make sure that your script has execution permissions.
4. Execute the following block in a notebook:

```python
from mlrun import new_function
from mlrun.platforms.iguazio import mount_v3io, mount_v3iod
import os
image_name = 'iguazio/shell:' + os.environ.get("IGZ_VERSION")
run = new_function(name='my-spark', image=image_name , command='/v3io/users/admin/my.py', kind='job', mode='pass')
run.apply(mount_v3io(name="v3io-fuse", remote="/", mount_path="/v3io"))
run.apply(mount_v3iod(namespace="default-tenant", v3io_config_configmap="spark-operator-v3io-config"))
run.run(artifact_path="/User/artifacts")
```
---

## Create Simple Read CSV Function Using Spark
Please refer to the read_csv_spark notebook

---

## Create Describe Function Using Spark
Generates profile reports from an Apache Spark DataFrame.
Based on pandas_profiling, but for Spark's DataFrames instead of pandas.

For each column the following statistics - if relevant for the column type - are presented:

* `Essentials:` type, unique values, missing values
* `Quantile statistics:` minimum value, Q1, median, Q3, maximum, range, interquartile range
* `Descriptive statistics:` mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* `Most frequent values:` for categorical data

```
Function params

:param context: Function context.
:param dataset: Raw data file (currently needs to be a local file located in v3io://User/bigdata)
:param bins: Number of bin in histograms
:param describe_extended: (True) set to False if the aim is to get a simple .describe() infomration
```

* All operations are done efficiently, which means that **no** Python UDFs or .map() transformations are used at all;
* only Spark SQL's Catalyst is used for the retrieval of all statistics.

---
### TODO:
1. Add plots
2. Add ability to generte html report
Loading