-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUrIIIPro.Rmd
148 lines (97 loc) · 3.65 KB
/
UrIIIPro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: "Ur III administrative texts proveniance classification"
output: html_notebook
---
This Notebook will guide you though the classification task of Ur III administrative texts, assigning unprovenanced tablets to a site.
Insert here a function to cross reference the catalogue and the atf dump for provenience and prepare the adequate csv file.
## Grab the cdli dumps
```{r}
download.file('https://raw.githubusercontent.com/cdli-gh/data/master/cdli_catalogue.csv.zip', 'catalogue.zip', 'wget', quiet = FALSE, mode = "w",
cacheOK = TRUE,
extra = getOption("download.file.extra"))
download.file('https://raw.githubusercontent.com/cdli-gh/data/master/cdliatf_unblocked.atf.zip', 'atf.zip', 'wget', quiet = FALSE, mode = "w",
cacheOK = TRUE,
extra = getOption("download.file.extra"))
```
## Unzip
Here we unzip the catalogue and textual data we just downloaded
```{r}
unzip('catalogue.zip', files = NULL, list = FALSE, overwrite = TRUE,
junkpaths = FALSE, exdir = ".", unzip = "internal",
setTimes = FALSE)
unzip('atf.zip', files = NULL, list = FALSE, overwrite = TRUE,
junkpaths = FALSE, exdir = ".", unzip = "internal",
setTimes = FALSE)
```
## Read and chop
This chunk reads the catalogue into a dataframe and selects only the needed columns
```{r}
catalogue= read.csv("cdli_catalogue.csv") # read csv file
catalogue <- subset(catalogue, grepl("Admin", genre) )
catalogue <- catalogue[,c('id_text','provenience','period')]
catalogue2 <- catalogue
```
## Chose period
This chunk selects only Ur III texts and discards the rest
```{r}
# remove lines where period doesn't start with "Ur III"
catalogue <- subset(catalogue, grepl("Ur III", period))
catalogue$period <- NULL
```
```{r}
#Add 2 columns
catalogue$assumed_prov<-NA
catalogue$guessed_prov<-NA
```
```{r}
library(stringr)
cat2 = catalogue
prov_char <- as.character(cat2$provenience)
cat2$provenience <- prov_char
ass_char <- as.character(cat2$assumed_prov)
cat2$assumed_prov <- ass_char
is.character(cat2$provenience)
cat2$assumed_prov <- ifelse(str_detect(cat2$provenience, '//?'), cat2$provenience, '')
write.csv(cat2, "cat2.csv")
```
## Clean-up ATF and match with catalogue
- convert the file to 2 columns table: text id + atf
- clean up atf from any lines that do not start with a numeral
- remove numerals
- clean up tokens
- match the atf entries with the catalogue
- remove catalogue entries with no atf
```{r}
atf=read.txt("cdliatf_unblocked.atf") # read csv file
```
##Clean up catalogue
- Remove provenience from entries with ? in provenience and move to ther column
- Create table with text_id, provenience and an emtpy column for textual data
- Should I also add dates referenced and size fields?
text_id, atf, provenience, assumed_prov, guessed_prov
## At this point we manually cleaned the data to get a chance to apply our algorithm before the end of Friday...
## Classification
At this point, we have one table of 3 columns : text_id, provenience and atf.
- Set aside a copy of 33% of the text with known provenience for testing
```{r}
# convert cat entry id to p#
# match catalogue with atf
#cat= read.csv("cat.csv")
atf = read.csv("transcat-3.csv", header=FALSE)
colnames(atf) <- c("id_text","atf")
atf$id_text <- paste("P", atf$id_text, sep="")
```
```{r}
full_data <- merge(cat,atf, by = "id_text")
#full_data <- subset(full_data, !is.null(full_data$atf)
write.csv(full_data, "full_data.csv")
```
train on 2/3 of known
test on 1/3 of known
extract features on each provenience :
word freq
word length
freq of bigrams
topic modeling
get features on the unknown texts
use the Suporrt Vector Machine algorithm