Solved regex warnings, restored TF 1.13.1 for compatibility, created …

…generate.sh, updated readme.
dbpedia · May 6, 2020 · 3636ca8 · 3636ca8
1 parent 76898db
commit 3636ca8
Show file tree

Hide file tree

Showing 7 changed files with 40 additions and 64 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,11 @@
 # 🤖 Neural SPARQL Machines
 
-A LSTM-based Machine Translation Approach for Question Answering.
+A LSTM-based Machine Translation Approach for Question Answering over Knowledge Graphs.
 
 ![British flag.](http://www.liberai.org/img/flag-uk-160px.png "English")
 ![Seq2Seq neural network.](http://www.liberai.org/img/seq2seq-webexport-160px.png "seq2seq")
 ![Semantic triple flag.](http://www.liberai.org/img/flag-sparql-160px.png "SPARQL")
 
-This is a fork of the original repo which i updated to work with python3 and tensorflow 1.14.0
 
 ## Code
 
@@ -18,75 +17,58 @@ git lfs checkout
 git submodule update --init
 ```
 
-### Python Setup
+### Python setup
 
 Codebase: Python 3.7+
 
 ```bash
 pip install -r requirements.txt
 ```
 
-### Data preparation
+### The Generator module
+
 #### Pre-generated data
 
 You can extract pre-generated data from `data/monument_300.zip` and `data/monument_600.zip` in folders having the respective names.
 
 #### Manual Generation (Alternative to using pre-generated data)
 
-The template used in the paper can be found in a file such as `annotations_monument.tsv`. To generate the training data, launch the following command.
+The template used in the paper can be found in a file such as `annotations_monument.tsv`. `data/monument_300` will be the ID of the working dataset used throughout the tutorial. To generate the training data, launch the following command.
 
 <!-- Made monument_300 directory in data directory due to absence of monument_300 folder in data directory  -->
 ```bash
 mkdir data/monument_300
-python generator.py --templates data/annotations_monument.csv  --output data/monument_300
+python generator.py --templates data/annotations_monument.csv --output data/monument_300
 ```
 
-Build the vocabularies for the two languages (i.e., English and SPARQL) with:
+Launch the command to build the vocabularies for the two languages (i.e., English and SPARQL) and split into train, dev, and test sets.
 
 ```bash
-python build_vocab.py data/monument_300/data_300.en > data/monument_300/vocab.en
-python build_vocab.py data/monument_300/data_300.sparql > data/monument_300/vocab.sparql
+./generate.sh data/monument_300
 ```
 
-Count lines in `data_.*`
-<!-- Fixing the bash related error pertaining to assigning value to NUMLINES here -->
-```bash
-NUMLINES=$(echo awk '{ print $1}' | cat data/monument_300/data_300.sparql |  wc -l)
-echo $NUMLINES
-# 7097 (Don't worry if it varies)
-```
-
-Split the `data_.*` files into `train_.*`, `dev_.*`, and `test_.*` (usually 80-10-10%).
-
-<!-- Making this instruction consistent with the previous instructions by changing data.sparql to data_300.sparql -->
-```bash
-cd data/monument_300/
-python ../../split_in_train_dev_test.py --lines $NUMLINES  --dataset data_300.sparql
-```
-
-### Training
+### The Learner module
 
 <!-- Just a simple note to go back to the initial directory.-->
 Now go back to the initial directory and launch `train.sh` to train the model. The first parameter is the prefix of the data directory and the second parameter is the number of training epochs.
 
 ```bash
-sh train.sh data/monument_300 12000
+./train.sh data/monument_300 12000
 ```
 
 This command will create a model directory called `data/monument_300_model`.
 
-### Inference
+### The Interpreter module
 
 Predict the SPARQL sentence for a given question with a given model.
 
 ```bash
-sh ask.sh data/monument_300 "where is edward vii monument located in?"
-
+./ask.sh data/monument_300 "where is edward vii monument located in?"
 ```
 
-### Chatbots, Integration & Cia
+## Use cases & Integrations
 
-- Telegram: The [Telegram NSpM chatbot](https://github.com/AKSW/NSpM/wiki/NSpM-Telegram-Bot) offers an integration of NSpM with the Telegram message platform.
+* [The Telegram NSpM chatbot](https://github.com/AKSW/NSpM/wiki/NSpM-Telegram-Bot) offers an integration of NSpM with the Telegram messaging platform.
 
 ## Papers
 
@@ -123,4 +105,7 @@ sh ask.sh data/monument_300 "where is edward vii monument located in?"
 
 * Primary contacts: [Tommaso Soru](http://tommaso-soru.it) and [Edgard Marx](http://emarx.org).
 * Neural SPARQL Machines [mailing list](https://groups.google.com/forum/#!forum/neural-sparql-machines).
-* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
+* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
+* Follow [Liber AI Research](http://liberai.org) on [Twitter](https://twitter.com/theLiberAI).
+
+![Liber AI logo.](http://www.liberai.org/img/Liber-AI-logo-name-200px.png "Liber AI")
diff --git a/analyse.py b/analyse.py
@@ -42,8 +42,8 @@ def validate( translation ):
     match = re.search(entity_with_attribute, query)
     if match:
         entity = match.group(0)
-        entity_encoded = re.sub(r'\(<?', '\(', entity)
-        entity_encoded = re.sub(r'>?\)', '\)', entity_encoded)
+        entity_encoded = re.sub(r'\(<?', r'\(', entity)
+        entity_encoded = re.sub(r'>?\)', r'\)', entity_encoded)
         query = query.replace(entity, entity_encoded)
     try:
         parser.parseQuery(query)

diff --git a/generate.sh b/generate.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+DATASET=$1
+
+echo "Building English vocabulary..."
+python build_vocab.py ${DATASET}/data.en > ${DATASET}/vocab.en
+echo "Building SPARQL vocabulary..."
+python build_vocab.py ${DATASET}/data.sparql > ${DATASET}/vocab.sparql
+
+NUM_LINES=$(echo awk '{ print $1}' | cat ${DATASET}/data.sparql | wc -l)
+
+NSPM_HOME=`pwd`
+cd ${DATASET}
+echo "Splitting data into train, dev, and test sets..."
+python ${NSPM_HOME}/split_in_train_dev_test.py --lines $NUM_LINES --dataset data.sparql
+cd ${NSPM_HOME}
+echo "Done."
diff --git a/generator.py b/generator.py
@@ -180,7 +180,7 @@ def generate_dataset(templates, output_dir, file_mode):
     if not os.path.exists(output_dir):
         os.makedirs(output_dir)
     it = 0
-    with io.open(output_dir + '/data_300.en', file_mode, encoding="utf-8") as english_questions, io.open(output_dir + '/data_300.sparql', file_mode, encoding="utf-8") as sparql_queries:
+    with io.open(output_dir + '/data.en', file_mode, encoding="utf-8") as english_questions, io.open(output_dir + '/data.sparql', file_mode, encoding="utf-8") as sparql_queries:
         for template in tqdm(templates):
             it = it + 1
             print("for {}th template".format(it))

diff --git a/generator_utils.py b/generator_utils.py
@@ -243,7 +243,7 @@ def extractTriples (sparqlQuery):
 
 
 def splitIntoTriples (whereStatement):
-    tripleAndSeparators = re.split('(\.[\s\?\<$])', whereStatement)
+    tripleAndSeparators = re.split(r'(\.[\s\?\<$])', whereStatement)
     trimmed = [str.strip() for str in tripleAndSeparators]
 
     def repair (list, element):

diff --git a/requirements.txt b/requirements.txt
@@ -18,8 +18,8 @@ numpy==1.16.6
 protobuf==3.11.3
 six==1.14.0
 soupsieve==1.9.5
-tensorboard==1.14.0
-tensorflow==1.14.0
+tensorboard==1.13.1
+tensorflow==1.13.1
 termcolor==1.1.0
 tqdm==4.43.0
 Werkzeug==1.0.0
diff --git a/sparql.grammar b/sparql.grammar