-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
300 lines (206 loc) · 9.8 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "Advanced analysis and manipulation of ELAN corpus data with R and Python"
author: "Niko Partanen"
date: "11/16/2017"
output:
html_document:
toc: true
toc_float: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Advanced analysis and manipulation of ELAN corpus data with R and Python
This is a set of course materials Niko Partanen ([me](https://github.com/nikopartanen)) is using while teaching the this course. It is supposed to be used with a [Komi-Zyrian Test Corpus](https://github.com/langdoc/kpv-test-corpus) and the materials have an accompanying [website](https://nikopartanen.github.io/adv_elan_draft/), which is more thorough but also still more at the draft level. I work currently in LATTICE laboratory in Paris, and I'm also working with [IKDP-2](https://github.com/langdoc/IKDP-2) research project.
If you have comments or suggestions, any feedback, my email address is [email protected]. In principle I'm also in Twitter, @nikopartanen, but I don't understand very well how to use it.
The lectures assume that you have done something like:
git clone https://github.com/langdoc/elan_lectures
git clone https://github.com/langdoc/testcorpus
git clone https://github.com/langdoc/praat-stuff
So when you are in the main directory what you see is:
elan_lectures
testcorpus
praat-stuff
It doesn't matter if there are other folders around.
## Slides
The course is split along following slides:
- [Introduction](https://langdoc.github.io/elan_lectures/lecture-1)
- [Getting test corpus into R](https://langdoc.github.io/elan_lectures/lecture-2)
- [How does ELAN work?](https://langdoc.github.io/elan_lectures/lecture-3)
- [More advanced example with emuR, Shiny and PraatScript](https://langdoc.github.io/elan_lectures/lecture-4)
- [Some file manipulation examples](https://langdoc.github.io/elan_lectures/lecture-5)
## Preparatory work
Before the course starts it can be a good idea to install some of the following tools:
- [R](https://www.r-project.org/)
- [RStudio](https://www.rstudio.com/)
- [Python](https://www.python.org/downloads/)
## Getting course materials
- Download or clone the lecture Rmd files from [here](https://github.com/langdoc/elan_lectures)
- Download or clone the demo corpus from [here](https://github.com/langdoc/testcorpus)
You can do from command line:
git clone https://github.com/langdoc/elan_lectures
git clone https://github.com/langdoc/testcorpus
Course materials, or related materials, are also slowly being collected into accompanying [website](https://nikopartanen.github.io/adv_elan_draft/), but this doesn't usually reflect the actual state of things I want to discuss. This work in on draft level, but the idea is to have there one method of discussing things further.
## R packages
You need (at least) following R packages.
- [tidyverse](https://www.tidyverse.org/)
- [shiny]()
- [leaflet]()
- [emuR](https://github.com/IPS-LMU/emuR)
- [meow](https://github.com/achubaty/meow)
## Python
pip install pympi-ling
## How to load R packages
```{r}
#install.packages("devtools")
#install.packages("tidyverse")
library(devtools)
library(tidyverse)
library(xml2)
#install_github(repo = "langdoc/FRelan")
#install_github(repo = "achubaty/meow")
library(FRelan)
library(meow)
```
## Projects in RStudio
In RStudio you can easily arrange your work around projects. Those can be accessed in right-upper corner of the program. If you work in a project, your working directory is always the one of the project, and you avoid lots of hassle.
In case you need, this is how you can change the working directory.
```{r, eval=FALSE}
getwd()
setwd("~/path/to/some/directory")
```
## Expected file structure
Throughout the course we assume that you have following file structure:
higher_level_folder
testcorpus
elan_lectures
praat-stuff
## Examples
- Reading ELAN files or individual tiers
```{r}
FRelan::read_eaf(eaf_file = '../testcorpus/kpv_udo20120330SazinaJS-encounter.eaf')
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "refT")
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT")
```
## Observations
- With Windows backslashes can be strange
- How the folder paths work on Windows?
- Unicode issues: easy to fix with some command, it is about some default settings?
- Some ways of fixing Cyrillic not being displayed properly still resulted some characters not being ok
## Filenames
Things will generally go better if you never have in filenames:
- Other than ascii characters
- __
- No spaces
##
```{r, eval=FALSE}
corpus <- FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT")
corpus %>% select(conte)
## DOES NOT WORK
## This should be customized somehow
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT",
participant_location = "attribute|prefix|suffix")
FRelan::read_eaf('pite.eaf', ind_tier = "reference tier type", sa_tier = "transcription tier type", ss_tier = "token level tier type")
```
## Writing files
In principle you can write something into csv like this.
```{r}
FRelan::read_tier(eaf_file = '../testcorpus/kpv_izva20140330-1-fragment.eaf', linguistic_type = "orthT") %>%
write_csv('results.csv')
read_csv('results.csv')
```
There are also other ways:
- [googlesheets](https://github.com/jennybc/googlesheets) package allows writing into Google Sheets
- [readxl](http://readxl.tidyverse.org/) should be really good with Excel files
## Improvements
- ind_tier etc to have better names in FRelan functions
- Some stuff there is very messy
- Lots of things are experimental and not supposed to be used by anyone
- Some parts use [XML](https://cran.r-project.org/web/packages/XML/index.html) R package instead of newer [xml2]
## Custom read_eaf function
This should be adaptable to most of ELAN tier structures, although I admit one maybe has to think a bit what is going on there. In principle one should always be able to recover the logic of ELAN's internal structure like this.
```{r}
read_custom_eaf <- function(path_to_file = '../testcorpus/kpv_udo20120330SazinaJS-encounter.eaf'){
ref <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "refT") %>%
dplyr::select(content, annot_id, participant, time_slot_1, time_slot_2) %>%
dplyr::rename(ref = content) %>%
dplyr::rename(ref_id = annot_id)
orth <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "orthT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(orth = content) %>%
dplyr::rename(orth_id = annot_id) # %>%
# dplyr::rename(ref_id = ref_id) # This is there just as a note
token <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "wordT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(token = content) %>%
dplyr::rename(token_id = annot_id) %>%
dplyr::rename(orth_id = ref_id)
lemma <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "lemmaT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(lemma = content) %>%
dplyr::rename(lemma_id = annot_id) %>%
dplyr::rename(token_id = ref_id)
pos <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "posT") %>%
dplyr::select(content, ref_id, participant) %>%
dplyr::rename(pos = content) %>%
dplyr::rename(lemma_id = ref_id)
elan <- left_join(ref, orth) %>%
left_join(token) %>%
left_join(lemma) %>%
left_join(pos) %>%
select(token, lemma, pos, time_slot_1, time_slot_2, everything(), -ends_with('_id'))
time_slots <- FRelan::read_timeslots(path_to_file)
corpus <- elan %>%
left_join(time_slots %>% rename(time_slot_1 = time_slot_id)) %>%
rename(time_start = time_value) %>%
left_join(time_slots %>% rename(time_slot_2 = time_slot_id)) %>%
rename(time_end = time_value) %>%
select(token, lemma, pos, participant, time_start, time_end, everything(), -starts_with('time_slot_'))
corpus %>% mutate(filename = path_to_file)
}
```
##
```{r}
library(FRelan)
dir("../testcorpus/", "kpv.+eaf$", full.names = TRUE) %>%
map(read_custom_eaf)
```
## Pite case
- Joshua Wilbur has a special case that gave errors, the general start for checking this is below, in principle it can be boiled down into an XPath
- wordT type is incorrectly somewhere where one should have notesT
```{r, eval=FALSE}
library(tidyverse)
library(xml2)
elan_file = "sje20150329b.eaf"
read_xml(elan_file) %>%
xml_find_all("//TIER[@LINGUISTIC_TYPE_REF='wordT' and starts-with(@TIER_ID, 'notes')]") %>%
map(~ glue::glue("Fix tier {xml_attr(.x, 'TIER_ID')} in file {elan_file}"))
```
## Notes from Ruprecht von Waldenfels' discussion part
- Basic metadata in itself a clear and simple issue
- Some users, i.e. sociolinguists, want particularly detailed metadata
- exact education
- army service
- where person has lived
- Multilinguality or speaking majority language can also be focus in some projects, this changes things a bit too
- How much info is collected on language use and knowledge
- Personal relations and language use between individuals
- How to map values into basic core that everyone shares?
- CMDI a subset of what we actually deal with
- CMDI is public and for finding things online -- where we store the real thing?
- Things get totally messy - how do we constrain the chaos?
- There is a need for some sort of mapping between different metadata concepts used, but in manner different from CMDI
## Useful packages
- tidyverse
- xml2
- stringr
- leaflet
- ggplot2
- sf (for maps, new, I haven't studied it yet at all, but looks awesome)
- tidytext
- unnest_tokens()
- [Tidy Text Mining book](http://tidytextmining.com/)
## License
CC-BY Niko Partanen 2017