-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlecture-1.Rmd
326 lines (218 loc) · 9.41 KB
/
lecture-1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
output:
revealjs::revealjs_presentation:
theme: white
transition: none
css: custom.css
self_contained: true
center: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, error=FALSE, dpi = 400,fig.cap = "", cache = T, echo=FALSE)
```
## ELAN, R and Python {data-background="white"}
### Thought on how these go together
### Niko Partanen
# Introduction
---------------------------------------------------------------------
## Who am I?
- A linguist with MA in Finno-Ugristics
- Doing my PhD (supervisor Michael Rießler)
- Topic: Variation in Komi dialects
- Komi is an Uralic language
- Occasionally I also touch Udmurt and Karelian
- I stay now in LATTICE laboratory in Paris
- Work there focuses to dependency parsing
---------------------------------------------------------------------
## What I am not?
- A professional programmer
- I know R rather well, Python less so
- I work regularly with both + bit with JavaScript
- I genuinely like programming
- I believe formulating our research questions programmatically is the way to go
---------------------------------------------------------------------
## What is this course?
- One way to discuss my work with audience
- Lots of R courses dive directly into statistical analysis
- In this workshop we stay in more shallow waters
- We don't go far at all
- But I hope this opens new directions
- Almost everything I work with is somewhere online!
- GitHub issues as a cooperation channel
## {data-background-image="https://i.imgur.com/hH0xro2.png" data-background-size="70%"}
---------------------------------------------------------------------
## We will learn
- Parsing ELAN files and metadata into R
- Adapting this to your needs
- Manipulating that data in R
- Building some interactive workflows around R, ELAN and Praat
- Use Python to manipulate tier structures and explore Pympi
- Little bit that with R as well…
- Basic concepts for creating visualization from the data
# What is ELAN?
## ELAN
* Annotation tool developed in Nijmegen
* [Open source](https://tla.mpi.nl/tools/tla-tools/elan/elan-old-versions) Java application
* Used widely in language documentation projects and elsewhere
* Main focus in utterance long annotations
---------------------------------------------------------------------
## ELAN corpora
* Often data from endangered languages
- Limited resources
- Language technology underdeveloped
- NLP tools usually target larger languages
* Data often collected in prolonged period of time
- Research projects spanning usually three years
- Not created by large number of people, but rarely by just one
---------------------------------------------------------------------
* Interlinearized glosses may be included
- Created through a round trip to FLEX or Toolbox
- Done manually within ELAN
- Time will tell what new Interlinearization Mode brings
---------------------------------------------------------------------
## What follows…
* Typos
* Wrongly done clicks
* Overlaps with people working with same file
- Random hacks to keep things together
* Inconsistencies between files
* Different tier templates during years
- More hacks and tricks
---------------------------------------------------------------------
## How they are used?
* Examples in grammatical descriptions, links to corpus


---------------------------------------------------------------------
## ELAN corpora?
* Some people refuse to call their language documentation materials corpus
* The fact that data is referred to doesn't mean that corpus contains those annotations
- The reference means usually **that this example exists**
* Others must have already finished this conversation
---------------------------------------------------------------------
### ELAN corpus
### =
### anything that is in ELAN file
---------------------------------------------------------------------
## What is there?
- Transcriptions
- Tokenized and/or annotated layers
- Linked files
- Participant ID's
- In tier names or `PARTICIPANT` attributes
- Session name (?)
- Comments and notes
- Translations
---------------------------------------------------------------------
## What's the problem?
- Language documentation corpora are rarely used in corpus linguistic fashion, compare:
> "Finding an example of phenomena X"
> "Find all instances of phenomena X, do something with those"
---------------------------------------------------------------------
## Why this matters?
- The corpora are rarely thoroughly tested
- It is not certain all files share the same structure and conventions
- The questions of representativity are easily skipped
---------------------------------------------------------------------
# R and Python
---------------------------------------------------------------------
```{r out.width = "20%"}
knitr::include_graphics('images/Rlogo.png')
```
```{r out.width = "20%"}
knitr::include_graphics('images/200px-Python.svg.png')
```
---------------------------------------------------------------------
* Programming languages
* Active communities around them (#rstats in Twitter)
* Data manipulation and visualization typical uses
* R orientates toward statistics, Python more general
* "Sort of similar" in the end of the day (my opinion)
---------------------------------------------------------------------
## Notes about R
- R is currently going through large transformation
- Tidyverse: collection of packages that operate consistently with one another
- Makes R kind of an moving target at the moment
- Opinionated, but clearly the direction to go
- Without doubt R is getting less cumbersome
---------------------------------------------------------------------
## {data-background="black"}
<img src="https://i.imgur.com/pyRnT7a.png" />
---------------------------------------------------------------------
## Notes about Python
- Python module [Pympi](http://dopefishh.github.io/pympi/) is very useful to work with ELAN and Praat files
- Hides a bit the murky details
- Probably has solved many problems -- no need to reinvent the wheel
- More generic signal processing tools
- [pyannote](http://pyannote.github.io/)
- Good NLP ecosystem ([nltk](http://www.nltk.org/))
---------------------------------------------------------------------
## Notebooks
- [RMarkdown](http://rmarkdown.rstudio.com/) and [Jupyter Notebook](http://jupyter.org/)
- Can be run interactively in the server
- Allows combining text, code and citations into one document
- At least with R can also be combined into LaTeX document
- If you really want to go down that road!
- It is also easy to generate LaTeX fragments or HTML
---------------------------------------------------------------------
## Why R or Python?
- Easy to build data validation tools
- Easy to automatize some tedious tasks
- Leverages some other tools that can enrich our data
- Good collection of HTML and PDF outputs
- High level of [reproducibility](https://www.biorxiv.org/content/early/2016/07/29/066803)
- Inluding **you** in few months
- We will see advantages of this on the course
- Tasks can be automatized
- We humans are bad in repeating tasks!
- More a shift in workload than total freedom
- But ideally more time for thinking and important tasks
## How to learn more?
## {data-background-iframe="http://r4ds.had.co.nz/"}
## {data-background-iframe="https://adv-r.hadley.nz/"}
## {data-background-iframe="http://socviz.co/"}
## {data-background-iframe="https://www.degruyter.com/view/product/203826"}
## {data-background-iframe="https://benjamins.com/#catalog/books/z.195/main"}
## {data-background-iframe="http://www.nltk.org/"}
## Please send me good Python resources!
## Python's role
- Lots of NLP tools work around Python
- Bindings to morphological analysators, [hfst]()
- Syntactic parsers
- It is much more widely used than R
- Pympi is rather mature tool already
- If most generic parts of the workflows are implemented in Python, the potential to reuse is bigger
- Although, if all we do is send command line calls around, who cares
## Example: Tier creation
Do we approach it as:
- create xml node, add attributes x, y and z, add child, add other child, blaablaablaa
Or as:
- create_tier(...)
## Comparison
1. Works in specific use case in specific kind of files
2. Is general, bugs can be solved together
- ELAN always does things same way, so we must to be able to replicate exactly that
## My point:
### Ideally more general than atomistic solutions
## Next: About perils of exporting
# Evils of exporting
---------------------------------------------------------------------
## ELAN export as part of the workflow
##
- [Naomi Nagy's workflows]
- ELAN-Toolbox interaction scripts
- etc.
##

---------------------------------------------------------------------
## Exporting is dangerous!
- You create a new version (a branch, so to say)
- When the file changes you need to repeat the export
- Will you remember?
- Are all exports done identically?
- Export in ELAN has quite many boxes to tick
- Export cannot contain data that was not already in the ELAN file
- It takes lots of time to export tens or hundreds of files
---------------------------------------------------------------------
# Thank you!
## Up next: Our test corpus & Parsing ELAN file