diff --git a/README.md b/README.md
index b530cc8..b79a89a 100644
--- a/README.md
+++ b/README.md
@@ -55,9 +55,22 @@ by active learning (by developers of Spacy), text and image
(Feature extraction could be computationally expensive and nearly impossible to scale, hence re-using features by different models and teams is a key to high performance ML teams).
* [FEAST](https://github.com/gojek/feast) (Google cloud, Open Source)
* [Michelangelo Palette](https://eng.uber.com/michelangelo/) (Uber)
+ * [Hopsworks](https://github.com/logicalclocks/hopsworks) (Hopsworks, Open Source)
* Suggestion: At training time, copy data into a local or networked **filesystem** (NFS). [1](#fsdl)
-### 1.4. Data Versioning
+### 1.4. File Formats
+* Different file formats may be used along different parts of the ML Pipeline, as shown in the figure below:
+
+
+
+
+* Feature engineering may be performed on legacy tabular file formats like .csv, or modern columnar formats like .parquet, .orc. Nested file formats like .json, .avro are less often used.
+
+* Models are typically trained with data in files, and different frameworks have different native file formats. TensorFlow favors .tfrecords, PyTorch favors .npy, Scikit-Learn favors .csv files. Uber released .petastorm as a columnar file format with native readers for TensorFlow/Keras and PyTorch.
+
+* File formats for model serving include: .pb (TensorFlow), .onnx (framework independent), .pkl (Scikit-Learn - picked python objects), and legacy formats such as .pmml.
+
+### 1.5. Data Versioning
* It's a "MUST" for deployed ML models:
**Deployed ML models are part code, part data**. [1](#fsdl) No data versioning means no model versioning.
* Data versioning platforms:
@@ -65,7 +78,7 @@ by active learning (by developers of Spacy), text and image
* [Pachyderm](https://www.pachyderm.com/): version control for data
* [Dolt](https://www.liquidata.co/): versioning for SQL database
-### 1.5. Data Processing
+### 1.6. Data Processing
* Training data for production models may come from different sources, including *Stored data in db and object stores*, *log processing*, and *outputs of other classifiers*.
* There are dependencies between tasks, each needs to be kicked off after its dependencies are finished. For example, training on new log data, requires a preprocessing step before training.
* Makefiles are not scalable. "Workflow manager"s become pretty essential in this regard.
@@ -133,6 +146,7 @@ by active learning (by developers of Spacy), text and image
* [Comet](https://www.comet.ml/): lets you track code, experiments, and results on ML projects
* [Weights & Biases](https://www.wandb.com/): Record and visualize every detail of your research with easy collaboration
* [MLFlow Tracking](https://www.mlflow.org/docs/latest/tracking.html#tracking): for logging parameters, code versions, metrics, and output files as well as visualization of the results.
+ * [Hopsworks Experiments](https://github.com/logicalclocks/hopsworks): for logging hyperparameters, results, notebooks, datasets/features used for training, and any output files/images.
* Automatic experiment tracking with one line of code in python
* Side by side comparison of experiments
* Hyper parameter tuning
@@ -144,6 +158,7 @@ by active learning (by developers of Spacy), text and image
* Random search
* Bayesian optimization
* HyperBand
+ * Asynchronous Successive Halving
* Platforms:
* [Katib](https://github.com/kubeflow/katib): Kubernete's Native System for Hyperparameter Tuning and Neural Architecture Search, inspired by [Google vizier](https://static.googleusercontent.com/media/ research.google.com/ja//pubs/archive/ bcb15507f4b52991a0783013df4222240e942381.pdf) and supports multiple ML/DL frameworks (e.g. TensorFlow, MXNet, and PyTorch).
@@ -152,14 +167,16 @@ by active learning (by developers of Spacy), text and image
* [Ray-Tune](https://github.com/ray-project/ray/tree/master/python/ray/ tune): A scalable research platform for distributed model selection (with a focus on deep learning and deep reinforcement learning)
* [Sweeps](https://docs.wandb.com/library/sweeps) from [Weights & Biases] (https://www.wandb.com/): Parameters are not explicitly specified by a developer. Instead they are approximated and learned by a machine learning model.
* [Keras Tuner](https://github.com/keras-team/keras-tuner): A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0.
-
+ * [Maggy](https://github.com/logicalclocks/maggy): An asynchronous parallel hyperparameter tuning framework for TensorFlow/Keras, built on PySpark.
+
### 2.6. Distributed Training
* Data parallelism: Use it when iteration time is too long (both tensorflow and PyTorch support)
* Model parallelism: when model does not fit on a single GPU
* Other solutions:
* Ray
* Horovod
-
+ * [TensorFlow CollectiveAllReduce on PySpark](https://www.logicalclocks.com/blog/goodbye-horovod-hello-collectiveallreduce)
+
## 3. Troubleshooting [TBD]
## 4. Testing and Deployment
@@ -231,6 +248,7 @@ Machine Learning production software requires a more diverse set of test suites
* Catching service and data regressions
* Cloud providers solutions are decent
* [Kiali](https://kiali.io/):an observability console for Istio with service mesh configuration capabilities. It answers these questions: How are the microservices connected? How are they performing?
+* [Hopsworks](https://github.com/logicalclocks/hopsworks): online models have their predictions logged to a Kafka topic, and a Spark Streaming of Flink application monitors the model for concept drift, data drift, model drift, and other anomalies.
#### Are we done?
diff --git a/images/file-formats.png b/images/file-formats.png
new file mode 100644
index 0000000..d9277f7
Binary files /dev/null and b/images/file-formats.png differ