Skip to content

Commit

Permalink
Solved regex warnings, restored TF 1.13.1 for compatibility, created …
Browse files Browse the repository at this point in the history
…generate.sh, updated readme.
  • Loading branch information
Tommaso Soru committed May 6, 2020
1 parent 76898db commit 3636ca8
Show file tree
Hide file tree
Showing 7 changed files with 40 additions and 64 deletions.
51 changes: 18 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
# 🤖 Neural SPARQL Machines

A LSTM-based Machine Translation Approach for Question Answering.
A LSTM-based Machine Translation Approach for Question Answering over Knowledge Graphs.

![British flag.](http://www.liberai.org/img/flag-uk-160px.png "English")
![Seq2Seq neural network.](http://www.liberai.org/img/seq2seq-webexport-160px.png "seq2seq")
![Semantic triple flag.](http://www.liberai.org/img/flag-sparql-160px.png "SPARQL")

This is a fork of the original repo which i updated to work with python3 and tensorflow 1.14.0

## Code

Expand All @@ -18,75 +17,58 @@ git lfs checkout
git submodule update --init
```

### Python Setup
### Python setup

Codebase: Python 3.7+

```bash
pip install -r requirements.txt
```

### Data preparation
### The Generator module

#### Pre-generated data

You can extract pre-generated data from `data/monument_300.zip` and `data/monument_600.zip` in folders having the respective names.

#### Manual Generation (Alternative to using pre-generated data)

The template used in the paper can be found in a file such as `annotations_monument.tsv`. To generate the training data, launch the following command.
The template used in the paper can be found in a file such as `annotations_monument.tsv`. `data/monument_300` will be the ID of the working dataset used throughout the tutorial. To generate the training data, launch the following command.

<!-- Made monument_300 directory in data directory due to absence of monument_300 folder in data directory -->
```bash
mkdir data/monument_300
python generator.py --templates data/annotations_monument.csv --output data/monument_300
python generator.py --templates data/annotations_monument.csv --output data/monument_300
```

Build the vocabularies for the two languages (i.e., English and SPARQL) with:
Launch the command to build the vocabularies for the two languages (i.e., English and SPARQL) and split into train, dev, and test sets.

```bash
python build_vocab.py data/monument_300/data_300.en > data/monument_300/vocab.en
python build_vocab.py data/monument_300/data_300.sparql > data/monument_300/vocab.sparql
./generate.sh data/monument_300
```

Count lines in `data_.*`
<!-- Fixing the bash related error pertaining to assigning value to NUMLINES here -->
```bash
NUMLINES=$(echo awk '{ print $1}' | cat data/monument_300/data_300.sparql | wc -l)
echo $NUMLINES
# 7097 (Don't worry if it varies)
```

Split the `data_.*` files into `train_.*`, `dev_.*`, and `test_.*` (usually 80-10-10%).

<!-- Making this instruction consistent with the previous instructions by changing data.sparql to data_300.sparql -->
```bash
cd data/monument_300/
python ../../split_in_train_dev_test.py --lines $NUMLINES --dataset data_300.sparql
```

### Training
### The Learner module

<!-- Just a simple note to go back to the initial directory.-->
Now go back to the initial directory and launch `train.sh` to train the model. The first parameter is the prefix of the data directory and the second parameter is the number of training epochs.

```bash
sh train.sh data/monument_300 12000
./train.sh data/monument_300 12000
```

This command will create a model directory called `data/monument_300_model`.

### Inference
### The Interpreter module

Predict the SPARQL sentence for a given question with a given model.

```bash
sh ask.sh data/monument_300 "where is edward vii monument located in?"

./ask.sh data/monument_300 "where is edward vii monument located in?"
```

### Chatbots, Integration & Cia
## Use cases & Integrations

- Telegram: The [Telegram NSpM chatbot](https://github.com/AKSW/NSpM/wiki/NSpM-Telegram-Bot) offers an integration of NSpM with the Telegram message platform.
* [The Telegram NSpM chatbot](https://github.com/AKSW/NSpM/wiki/NSpM-Telegram-Bot) offers an integration of NSpM with the Telegram messaging platform.

## Papers

Expand Down Expand Up @@ -123,4 +105,7 @@ sh ask.sh data/monument_300 "where is edward vii monument located in?"

* Primary contacts: [Tommaso Soru](http://tommaso-soru.it) and [Edgard Marx](http://emarx.org).
* Neural SPARQL Machines [mailing list](https://groups.google.com/forum/#!forum/neural-sparql-machines).
* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
* Follow [Liber AI Research](http://liberai.org) on [Twitter](https://twitter.com/theLiberAI).

![Liber AI logo.](http://www.liberai.org/img/Liber-AI-logo-name-200px.png "Liber AI")
4 changes: 2 additions & 2 deletions analyse.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ def validate( translation ):
match = re.search(entity_with_attribute, query)
if match:
entity = match.group(0)
entity_encoded = re.sub(r'\(<?', '\(', entity)
entity_encoded = re.sub(r'>?\)', '\)', entity_encoded)
entity_encoded = re.sub(r'\(<?', r'\(', entity)
entity_encoded = re.sub(r'>?\)', r'\)', entity_encoded)
query = query.replace(entity, entity_encoded)
try:
parser.parseQuery(query)
Expand Down
16 changes: 16 additions & 0 deletions generate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env bash
DATASET=$1

echo "Building English vocabulary..."
python build_vocab.py ${DATASET}/data.en > ${DATASET}/vocab.en
echo "Building SPARQL vocabulary..."
python build_vocab.py ${DATASET}/data.sparql > ${DATASET}/vocab.sparql

NUM_LINES=$(echo awk '{ print $1}' | cat ${DATASET}/data.sparql | wc -l)

NSPM_HOME=`pwd`
cd ${DATASET}
echo "Splitting data into train, dev, and test sets..."
python ${NSPM_HOME}/split_in_train_dev_test.py --lines $NUM_LINES --dataset data.sparql
cd ${NSPM_HOME}
echo "Done."
2 changes: 1 addition & 1 deletion generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ def generate_dataset(templates, output_dir, file_mode):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
it = 0
with io.open(output_dir + '/data_300.en', file_mode, encoding="utf-8") as english_questions, io.open(output_dir + '/data_300.sparql', file_mode, encoding="utf-8") as sparql_queries:
with io.open(output_dir + '/data.en', file_mode, encoding="utf-8") as english_questions, io.open(output_dir + '/data.sparql', file_mode, encoding="utf-8") as sparql_queries:
for template in tqdm(templates):
it = it + 1
print("for {}th template".format(it))
Expand Down
2 changes: 1 addition & 1 deletion generator_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ def extractTriples (sparqlQuery):


def splitIntoTriples (whereStatement):
tripleAndSeparators = re.split('(\.[\s\?\<$])', whereStatement)
tripleAndSeparators = re.split(r'(\.[\s\?\<$])', whereStatement)
trimmed = [str.strip() for str in tripleAndSeparators]

def repair (list, element):
Expand Down
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ numpy==1.16.6
protobuf==3.11.3
six==1.14.0
soupsieve==1.9.5
tensorboard==1.14.0
tensorflow==1.14.0
tensorboard==1.13.1
tensorflow==1.13.1
termcolor==1.1.0
tqdm==4.43.0
Werkzeug==1.0.0
25 changes: 0 additions & 25 deletions sparql.grammar

This file was deleted.

0 comments on commit 3636ca8

Please sign in to comment.