UrIIIPro.Rmd

---
title: "Ur III administrative texts proveniance classification"
output: html_notebook
---


This Notebook will guide you though the classification task of Ur III administrative texts, assigning unprovenanced tablets to a site. 

Insert here a function to cross reference the catalogue and the atf dump for provenience and prepare the adequate csv file.

  
## Grab the cdli dumps
```{r}
download.file('https://raw.githubusercontent.com/cdli-gh/data/master/cdli_catalogue.csv.zip', 'catalogue.zip', 'wget', quiet = FALSE, mode = "w",
              cacheOK = TRUE,
              extra = getOption("download.file.extra"))

download.file('https://raw.githubusercontent.com/cdli-gh/data/master/cdliatf_unblocked.atf.zip', 'atf.zip', 'wget', quiet = FALSE, mode = "w",
              cacheOK = TRUE,
              extra = getOption("download.file.extra"))

```


## Unzip
Here we unzip the catalogue and textual data we just downloaded
```{r}
unzip('catalogue.zip', files = NULL, list = FALSE, overwrite = TRUE,
      junkpaths = FALSE, exdir = ".", unzip = "internal",
      setTimes = FALSE)

unzip('atf.zip', files = NULL, list = FALSE, overwrite = TRUE,
      junkpaths = FALSE, exdir = ".", unzip = "internal",
      setTimes = FALSE)

```


## Read and chop
This chunk reads the catalogue into a dataframe and selects only the needed columns
```{r}
catalogue= read.csv("cdli_catalogue.csv")  # read csv file 
catalogue <-  subset(catalogue, grepl("Admin", genre) )
catalogue <- catalogue[,c('id_text','provenience','period')]
catalogue2 <- catalogue
```

## Chose period
This chunk selects only Ur III texts and discards the rest
```{r}
# remove lines where period doesn't start with "Ur III"
catalogue <-  subset(catalogue, grepl("Ur III", period))
catalogue$period <- NULL
```


```{r}
#Add 2 columns
catalogue$assumed_prov<-NA
catalogue$guessed_prov<-NA


```

```{r}
library(stringr)
cat2 = catalogue
prov_char <- as.character(cat2$provenience)
cat2$provenience <- prov_char
ass_char <- as.character(cat2$assumed_prov)
cat2$assumed_prov <- ass_char
is.character(cat2$provenience)

cat2$assumed_prov <- ifelse(str_detect(cat2$provenience, '//?'), cat2$provenience, '')


write.csv(cat2, "cat2.csv")


```


## Clean-up ATF and match with catalogue
- convert the file to 2 columns table: text id + atf
- clean up atf from any lines that do not start with a numeral
- remove numerals
- clean up tokens

- match the atf entries with the catalogue
- remove catalogue entries with no atf
```{r}
atf=read.txt("cdliatf_unblocked.atf")  # read csv file 


```


##Clean up catalogue
- Remove provenience from entries with ? in provenience and move to ther column
- Create table with text_id, provenience and an emtpy column for textual data
- Should I also add dates referenced and size fields?

text_id, atf, provenience, assumed_prov, guessed_prov


## At this point we manually cleaned the data to get a chance to apply our algorithm before the end of Friday... 


## Classification
At this point, we have one table of 3 columns : text_id, provenience and atf. 


- Set aside a copy of 33% of the text with known provenience for testing

```{r}

# convert cat entry id to p#


# match catalogue with atf
#cat= read.csv("cat.csv")
atf = read.csv("transcat-3.csv",  header=FALSE)
colnames(atf) <- c("id_text","atf")
atf$id_text <- paste("P", atf$id_text, sep="")


```


```{r}
full_data <- merge(cat,atf, by  = "id_text")
#full_data <- subset(full_data, !is.null(full_data$atf)
write.csv(full_data, "full_data.csv")
```

train on 2/3 of known
test on 1/3 of known

extract features on each provenience :
word freq
word length
freq of bigrams
topic modeling

get features on the unknown texts
use the Suporrt Vector Machine algorithm