Skip to content

Commit 63ea36c

Browse files
committed
added notes and pointers to spacy model training and the Marathi example
1 parent 19dca46 commit 63ea36c

16 files changed

+176
-75
lines changed

Diff for: .idea/misc.xml

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Diff for: .idea/python-tutorial-notebooks.iml

+4-2
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Diff for: README.md

+2
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@
6565

6666
- [spaCy Tutorial](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/spaCy%20Tutorial.ipynb)
6767
- [spaCy 3.x Tutorial: Transformers Spanish](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/spaCy%203.x%20Tutorial%20Transformers%20Spanish.ipynb)
68+
- [spaCy Model from CoNLL Data](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/spaCy_CoNLL_Training.ipynb)
69+
- [Train spaCy Model for Marathi (mr)](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/Marathi/train_model.ipynb)
6870
- [Linear Algebra and Embeddings - spaCy](https://github.com/dcavar/python-tutorial-notebooks/blob/master/notebooks/Embeddings_and_Vectors.ipynb)
6971

7072

Diff for: notebooks/BERT_vectors.ipynb

+57-29
Large diffs are not rendered by default.

Diff for: notebooks/Combinatory Categorial Grammar Parsing with NLTK.ipynb

+8-4
Original file line numberDiff line numberDiff line change
@@ -81,12 +81,16 @@
8181
},
8282
{
8383
"cell_type": "code",
84-
"execution_count": 1,
85-
"metadata": {},
86-
"outputs": [],
84+
"metadata": {
85+
"jupyter": {
86+
"is_executing": true
87+
}
88+
},
8789
"source": [
8890
"from nltk.ccg import chart, lexicon"
89-
]
91+
],
92+
"outputs": [],
93+
"execution_count": null
9094
},
9195
{
9296
"cell_type": "markdown",

Diff for: notebooks/Multilayer_Perceptron.ipynb

+32-32
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
},
4040
{
4141
"cell_type": "code",
42-
"execution_count": 5,
42+
"execution_count": 2,
4343
"metadata": {},
4444
"outputs": [],
4545
"source": [
@@ -57,7 +57,7 @@
5757
},
5858
{
5959
"cell_type": "code",
60-
"execution_count": 6,
60+
"execution_count": 3,
6161
"metadata": {},
6262
"outputs": [],
6363
"source": [
@@ -85,23 +85,23 @@
8585
},
8686
{
8787
"cell_type": "code",
88-
"execution_count": 37,
88+
"execution_count": 4,
8989
"metadata": {},
9090
"outputs": [
9191
{
9292
"name": "stdout",
9393
"output_type": "stream",
9494
"text": [
95-
"W [[0.57916493 0.1989773 0.71685006]\n",
96-
" [0.06420334 0.23917944 0.03679699]]\n",
97-
"U [[0.44530666 0.60784364]\n",
98-
" [0.77164787 0.40612112]\n",
99-
" [0.83222563 0.69558143]]\n",
100-
"bias_W [[0.90328775 0.89391968 0.63126251]]\n",
101-
"bias_U [[0.93231218 0.7755912 ]]\n",
102-
"O [[0.6369282 ]\n",
103-
" [0.36734706]]\n",
104-
"bias_O [[0.93714153]]\n"
95+
"W [[0.72620524 0.25526523 0.69675275]\n",
96+
" [0.2365146 0.02996081 0.50613528]]\n",
97+
"U [[0.63461337 0.06771906]\n",
98+
" [0.86606937 0.3349142 ]\n",
99+
" [0.91925414 0.75621645]]\n",
100+
"bias_W [[0.71746436 0.42482447 0.26262425]]\n",
101+
"bias_U [[0.68904939 0.59691488]]\n",
102+
"O [[0.04374218]\n",
103+
" [0.10052295]]\n",
104+
"bias_O [[0.52142174]]\n"
105105
]
106106
}
107107
],
@@ -122,7 +122,7 @@
122122
},
123123
{
124124
"cell_type": "code",
125-
"execution_count": 16,
125+
"execution_count": 5,
126126
"metadata": {},
127127
"outputs": [
128128
{
@@ -157,28 +157,28 @@
157157
},
158158
{
159159
"cell_type": "code",
160-
"execution_count": 17,
160+
"execution_count": 7,
161161
"metadata": {},
162162
"outputs": [
163163
{
164164
"data": {
165165
"text/plain": [
166-
"array([1, 0])"
166+
"array([3, 3])"
167167
]
168168
},
169-
"execution_count": 17,
169+
"execution_count": 7,
170170
"metadata": {},
171171
"output_type": "execute_result"
172172
}
173173
],
174174
"source": [
175-
"one_hot = np.array([0, 1, 0, 0, 0, 0, 0, 0])\n",
175+
"one_hot = np.array([0, 0, 0, 0, 0, 0, 0, 1])\n",
176176
"one_hot.dot(input_data)"
177177
]
178178
},
179179
{
180180
"cell_type": "code",
181-
"execution_count": 18,
181+
"execution_count": 8,
182182
"metadata": {},
183183
"outputs": [
184184
{
@@ -203,7 +203,7 @@
203203
},
204204
{
205205
"cell_type": "code",
206-
"execution_count": 38,
206+
"execution_count": 9,
207207
"metadata": {},
208208
"outputs": [],
209209
"source": [
@@ -213,7 +213,7 @@
213213
},
214214
{
215215
"cell_type": "code",
216-
"execution_count": 42,
216+
"execution_count": 10,
217217
"metadata": {},
218218
"outputs": [],
219219
"source": [
@@ -223,7 +223,7 @@
223223
},
224224
{
225225
"cell_type": "code",
226-
"execution_count": null,
226+
"execution_count": 11,
227227
"metadata": {},
228228
"outputs": [],
229229
"source": [
@@ -232,21 +232,21 @@
232232
},
233233
{
234234
"cell_type": "code",
235-
"execution_count": 50,
235+
"execution_count": 12,
236236
"metadata": {},
237237
"outputs": [
238238
{
239239
"name": "stdout",
240240
"output_type": "stream",
241241
"text": [
242-
"output 0.9658545034605426 - true score: 1 - loss -0.03474207364924937\n",
243-
"output 0.986959889282255 - true score: 0 - loss -4.3397252318950565\n",
244-
"output 0.9894527613414252 - true score: 0 - loss -4.5518911918432865\n",
245-
"output 0.995086368253607 - true score: 0 - loss -5.315741947375225\n",
246-
"output 0.9985133193959704 - true score: 1 - loss -0.0014877868101581678\n",
247-
"output 0.9988002123932317 - true score: 1 - loss -0.0012005079281317262\n",
248-
"output 0.9974135571146144 - true score: 1 - loss -0.002589793507494032\n",
249-
"output 0.9990317957413032 - true score: 0 - loss -6.940067481896969\n"
242+
"output 0.6675859293553982 - true score: 1 - loss -0.4040871638764277\n",
243+
"output 0.6945833449779889 - true score: 0 - loss -1.1860783525764986\n",
244+
"output 0.7090513591078905 - true score: 0 - loss -1.2346085191668439\n",
245+
"output 0.7203067606618183 - true score: 0 - loss -1.2740618501847558\n",
246+
"output 0.7575838283922055 - true score: 1 - loss -0.2776210831773161\n",
247+
"output 0.7700554259317871 - true score: 1 - loss -0.2612927849953743\n",
248+
"output 0.760135340291323 - true score: 1 - loss -0.27425878222531397\n",
249+
"output 0.782069969876003 - true score: 0 - loss -1.523581230446569\n"
250250
]
251251
}
252252
],
@@ -360,7 +360,7 @@
360360
"name": "python",
361361
"nbconvert_exporter": "python",
362362
"pygments_lexer": "ipython3",
363-
"version": "3.12.7"
363+
"version": "3.12.3"
364364
}
365365
},
366366
"nbformat": 4,

Diff for: notebooks/N-gram Models for Language Models.ipynb

+7-7
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@
206206
},
207207
{
208208
"cell_type": "code",
209-
"execution_count": 14,
209+
"execution_count": 4,
210210
"metadata": {
211211
"scrolled": true
212212
},
@@ -265,7 +265,7 @@
265265
},
266266
{
267267
"cell_type": "code",
268-
"execution_count": 15,
268+
"execution_count": 7,
269269
"metadata": {},
270270
"outputs": [],
271271
"source": [
@@ -274,7 +274,7 @@
274274
},
275275
{
276276
"cell_type": "code",
277-
"execution_count": 16,
277+
"execution_count": 8,
278278
"metadata": {},
279279
"outputs": [],
280280
"source": [
@@ -306,7 +306,7 @@
306306
},
307307
{
308308
"cell_type": "code",
309-
"execution_count": 17,
309+
"execution_count": 9,
310310
"metadata": {},
311311
"outputs": [
312312
{
@@ -330,7 +330,7 @@
330330
},
331331
{
332332
"cell_type": "code",
333-
"execution_count": 18,
333+
"execution_count": 10,
334334
"metadata": {},
335335
"outputs": [],
336336
"source": [
@@ -353,7 +353,7 @@
353353
},
354354
{
355355
"cell_type": "code",
356-
"execution_count": 19,
356+
"execution_count": 11,
357357
"metadata": {},
358358
"outputs": [
359359
{
@@ -377,7 +377,7 @@
377377
},
378378
{
379379
"cell_type": "code",
380-
"execution_count": 20,
380+
"execution_count": 12,
381381
"metadata": {},
382382
"outputs": [
383383
{

Diff for: notebooks/data/StanfordSentimentTreebank/README.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/SOStr.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/STree.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/datasetSentences.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/datasetSplit.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/dictionary.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/original_rt_snippets.txt

100755100644
File mode changed.

Diff for: notebooks/data/StanfordSentimentTreebank/sentiment_labels.txt

100755100644
File mode changed.

Diff for: notebooks/spaCy_CoNLL_Training.ipynb

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# spaCy Model from CoNLL Data\n",
8+
"\n",
9+
"(C) 2024 by [Damir Cavar](http://damir)\n",
10+
"\n",
11+
"The spaCy documentation provides a good introduction into [training a model](https://spacy.io/usage/training) and in particular using CoNLL data. The following code is based on this [spaCy training documentation](https://spacy.io/usage/training) and the code provided there."
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"Converting CoNLL (and [CoNLL-U](https://universaldependencies.org/format.html)) files to the necessary spaCy corpus format can be achieved using the following command:"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {},
25+
"outputs": [],
26+
"source": [
27+
"!python -m spacy convert ./Marathi/mr_ufal-ud-train.conllu ./Marathi/train.spacy --converter conllu --file-type spacy --seg-sents --morphology --merge-subtokens --lang mr"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"metadata": {},
33+
"source": [
34+
"Check the `train_model.ipynb` Jupyter notebook in the `Marathi` subfolder here for details on training a model."
35+
]
36+
},
37+
{
38+
"cell_type": "markdown",
39+
"metadata": {},
40+
"source": [
41+
"While [Prodigy](https://prodi.gy/) is an excellent tool for creating training data for spaCy models, [CoNLL-U](https://universaldependencies.org/format.html) files can be created using different tools. One such tool is [INCEpTION](https://inception-project.github.io/). A good resource for CoNLL files for different languages can be found on the [Universal Dependencies](https://universaldependencies.org/) website."
42+
]
43+
},
44+
{
45+
"cell_type": "markdown",
46+
"metadata": {},
47+
"source": [
48+
"(C) 2024 by [Damir Cavar](http://damir.cavar.me/)"
49+
]
50+
}
51+
],
52+
"metadata": {
53+
"kernelspec": {
54+
"display_name": "Python 3",
55+
"language": "python",
56+
"name": "python3"
57+
},
58+
"language_info": {
59+
"name": "python",
60+
"version": "3.12.3"
61+
}
62+
},
63+
"nbformat": 4,
64+
"nbformat_minor": 2
65+
}

0 commit comments

Comments
 (0)