dcavar
diff --git a/Diff for: ‎.idea/misc.xml
+1-1 b/Diff for: ‎.idea/misc.xml
+1-1
diff --git a/Diff for: ‎.idea/python-tutorial-notebooks.iml
+4-2 b/Diff for: ‎.idea/python-tutorial-notebooks.iml
+4-2
diff --git a/Diff for: ‎README.md
+2 b/Diff for: ‎README.md
+2
diff --git a/Diff for: ‎notebooks/BERT_vectors.ipynb
+57-29 b/Diff for: ‎notebooks/BERT_vectors.ipynb
+57-29
diff --git a/Diff for: ‎notebooks/Combinatory Categorial Grammar Parsing with NLTK.ipynb
+8-4 b/Diff for: ‎notebooks/Combinatory Categorial Grammar Parsing with NLTK.ipynb
+8-4
diff --git a/Diff for: ‎notebooks/Multilayer_Perceptron.ipynb
+32-32 b/Diff for: ‎notebooks/Multilayer_Perceptron.ipynb
+32-32
diff --git a/Diff for: ‎notebooks/N-gram Models for Language Models.ipynb
+7-7 b/Diff for: ‎notebooks/N-gram Models for Language Models.ipynb
+7-7
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/README.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/README.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/SOStr.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/SOStr.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/STree.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/STree.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/datasetSentences.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/datasetSentences.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/datasetSplit.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/datasetSplit.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/dictionary.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/dictionary.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/original_rt_snippets.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/original_rt_snippets.txt
100755100644
diff --git a/Diff for: ‎notebooks/data/StanfordSentimentTreebank/sentiment_labels.txt
100755100644 b/Diff for: ‎notebooks/data/StanfordSentimentTreebank/sentiment_labels.txt
100755100644
diff --git a/Diff for: ‎notebooks/spaCy_CoNLL_Training.ipynb
+65 b/Diff for: ‎notebooks/spaCy_CoNLL_Training.ipynb
+65
@@ -65,6 +65,8 @@
 
 - [spaCy Tutorial](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/spaCy%20Tutorial.ipynb)
 - [spaCy 3.x Tutorial: Transformers Spanish](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/spaCy%203.x%20Tutorial%20Transformers%20Spanish.ipynb)
+- [spaCy Model from CoNLL Data](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/spaCy_CoNLL_Training.ipynb)
+- [Train spaCy Model for Marathi (mr)](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/Marathi/train_model.ipynb)
 - [Linear Algebra and Embeddings - spaCy](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/Embeddings_and_Vectors.ipynb)
 
 
 
@@ -81,12 +81,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "jupyter": {
+     "is_executing": true
+    }
+   },
    "source": [
     "from nltk.ccg import chart, lexicon"
-   ]
+   ],
+   "outputs": [],
+   "execution_count": null
   },
   {
    "cell_type": "markdown",
 
@@ -39,7 +39,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -57,7 +57,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -85,23 +85,23 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "W [[0.57916493 0.1989773  0.71685006]\n",
-      " [0.06420334 0.23917944 0.03679699]]\n",
-      "U [[0.44530666 0.60784364]\n",
-      " [0.77164787 0.40612112]\n",
-      " [0.83222563 0.69558143]]\n",
-      "bias_W [[0.90328775 0.89391968 0.63126251]]\n",
-      "bias_U [[0.93231218 0.7755912 ]]\n",
-      "O [[0.6369282 ]\n",
-      " [0.36734706]]\n",
-      "bias_O [[0.93714153]]\n"
+      "W [[0.72620524 0.25526523 0.69675275]\n",
+      " [0.2365146  0.02996081 0.50613528]]\n",
+      "U [[0.63461337 0.06771906]\n",
+      " [0.86606937 0.3349142 ]\n",
+      " [0.91925414 0.75621645]]\n",
+      "bias_W [[0.71746436 0.42482447 0.26262425]]\n",
+      "bias_U [[0.68904939 0.59691488]]\n",
+      "O [[0.04374218]\n",
+      " [0.10052295]]\n",
+      "bias_O [[0.52142174]]\n"
      ]
     }
    ],
@@ -122,7 +122,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
@@ -157,28 +157,28 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "array([1, 0])"
+       "array([3, 3])"
       ]
      },
-     "execution_count": 17,
+     "execution_count": 7,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "one_hot = np.array([0, 1, 0, 0, 0, 0, 0, 0])\n",
+    "one_hot = np.array([0, 0, 0, 0, 0, 0, 0, 1])\n",
     "one_hot.dot(input_data)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -203,7 +203,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 38,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -213,7 +213,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 42,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -223,7 +223,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -232,21 +232,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 50,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "output 0.9658545034605426 - true score: 1 - loss -0.03474207364924937\n",
-      "output 0.986959889282255 - true score: 0 - loss -4.3397252318950565\n",
-      "output 0.9894527613414252 - true score: 0 - loss -4.5518911918432865\n",
-      "output 0.995086368253607 - true score: 0 - loss -5.315741947375225\n",
-      "output 0.9985133193959704 - true score: 1 - loss -0.0014877868101581678\n",
-      "output 0.9988002123932317 - true score: 1 - loss -0.0012005079281317262\n",
-      "output 0.9974135571146144 - true score: 1 - loss -0.002589793507494032\n",
-      "output 0.9990317957413032 - true score: 0 - loss -6.940067481896969\n"
+      "output 0.6675859293553982 - true score: 1 - loss -0.4040871638764277\n",
+      "output 0.6945833449779889 - true score: 0 - loss -1.1860783525764986\n",
+      "output 0.7090513591078905 - true score: 0 - loss -1.2346085191668439\n",
+      "output 0.7203067606618183 - true score: 0 - loss -1.2740618501847558\n",
+      "output 0.7575838283922055 - true score: 1 - loss -0.2776210831773161\n",
+      "output 0.7700554259317871 - true score: 1 - loss -0.2612927849953743\n",
+      "output 0.760135340291323 - true score: 1 - loss -0.27425878222531397\n",
+      "output 0.782069969876003 - true score: 0 - loss -1.523581230446569\n"
      ]
     }
    ],
@@ -360,7 +360,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.7"
+   "version": "3.12.3"
   }
  },
  "nbformat": 4,
 
@@ -206,7 +206,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 4,
    "metadata": {
     "scrolled": true
    },
@@ -265,7 +265,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -274,7 +274,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -306,7 +306,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -330,7 +330,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -353,7 +353,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [
     {
@@ -377,7 +377,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
 
@@ -0,0 +1,65 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# spaCy Model from CoNLL Data\n",
+    "\n",
+    "(C) 2024 by [Damir Cavar](http://damir)\n",
+    "\n",
+    "The spaCy documentation provides a good introduction into [training a model](https://spacy.io/usage/training) and in particular using CoNLL data. The following code is based on this [spaCy training documentation](https://spacy.io/usage/training) and the code provided there."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Converting CoNLL (and [CoNLL-U](https://universaldependencies.org/format.html)) files to the necessary spaCy corpus format can be achieved using the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m spacy convert ./Marathi/mr_ufal-ud-train.conllu ./Marathi/train.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Check the `train_model.ipynb` Jupyter notebook in the `Marathi` subfolder here for details on training a model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While [Prodigy](https://prodi.gy/) is an excellent tool for creating training data for spaCy models, [CoNLL-U](https://universaldependencies.org/format.html) files can be created using different tools. One such tool is [INCEpTION](https://inception-project.github.io/). A good resource for CoNLL files for different languages can be found on the [Universal Dependencies](https://universaldependencies.org/) website."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(C) 2024 by [Damir Cavar](http://damir.cavar.me/)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}