Skip to content

Commit

Permalink
Merge pull request #51 from dbpedia/gsoc-mehrzad
Browse files Browse the repository at this point in the history
gsoc Mehrzad final submission preview
  • Loading branch information
mommi84 authored Nov 6, 2023
2 parents f8d9d0b + cb171a9 commit feb1bd1
Show file tree
Hide file tree
Showing 27 changed files with 1,207,169 additions and 1 deletion.
79 changes: 78 additions & 1 deletion gsoc/mehrzad/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,80 @@
# Question-Answering over DBpedia with Pretrained Auto-regressive Models
Read more about the project and weekly progress updates at https://mehrzadshm.github.io/GSoC-2023-blog/

## Overview
This project addresses the challenge of translating natural language questions into formal SPARQL queries to facilitate efficient information retrieval from the DBpedia knowledge graph. This initiative is pivotal in fostering user-friendly, fact-based interactions supported by open knowledge repositories such as DBpedia. Traditionally, interfacing with knowledge graphs has required users to have a working knowledge of complex formal query languages, a barrier for non-expert users. To ease this process, this project builds upon the foundation laid by the [Neural SPARQL Machines (NSpM)](https://github.com/LiberAI/NSpM) project, taking it a step further by leveraging pre-trained Large Language Models (LLMs).

To learn more about the project scope, objectives and weekly progress, check out the designated [blog page](https://mehrzadshm.github.io/GSoC-2023-blog/).

## Approach
The cornerstone of the used methodology is the fine-tuning of LLms that are specilized for code generation task. To compare different architectures and model sizes, three different base models were included in the experiments as follows:

| Model Architecture | Model Size | Checkpoint Link |
|--------------------|------------|------------------|
| CodeGen | 350M | https://huggingface.co/Salesforce/codegen-350M-multi|
| StarCoder | 1B | https://huggingface.co/bigcode/starcoderbase-1b|
| Code-llama | 7B | https://huggingface.co/codellama/CodeLlama-7b-hf|

All fine-tuning attempts were conducted by leveraging [PEFT](https://github.com/huggingface/peft) (Parameter-Efficient Fine-Tuning) and [QLoRA](https://github.com/artidoro/qlora) (Low-Rank Adaptation of "Quantized" Large Language Models).

## Getting Started
The following instructions will get you a copy of the fine-tuned LLMs on your local machine. The instrcutions provided below refer to the codegen-based model (follow the same logic for other tow models).


With the virtual environment active, install the project dependencies using the `requirements.txt` file:
```
pip install -r requirements.txt
```

After activating the project's virtual environment, download the base model and its corresponding tokenizer from Huggingface:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-multi")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-multi", device_map='auto')
```

After downloading the base model to your local machine run the following script to merge the fine-tuned checkpoint with the base model weights:
```
python merge_model.py\
--base_model_name_or_path="Salesforce/codegen-350M-multi"\
--merged_model_name_suffix="codegen-350M"\
--peft_model_path="./final_checkpoints/nspm-codegen-350M/"
```

The above script saves a seperate model checkpoint named as `dbpedia/nspm-codegen-350M`.

Now you can load the fine-tuned checkpoint:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint="dbpedia/nspm-codegen-350M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, use_auth_token=True).to(device)
```

For inference, follow the same format used for fine-tuning data:
```python
prompt = """# write a SPARQL query for:
# what is the population of Italy?
"""

inputs = tokenizer.encode_plus(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs['input_ids'], max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```


To evaluate the performance of the fine-tuned model run `evaluate_model.py` script:
```
!python evaluate_model.py\
--model_name_or_path="dbpedia/nspm-codegen-350M"\
--data_path="./data/nspm-fine-tuning/test.csv"\
--metric_name="bleu"\
--report_file_name="eval_results_codegen350M_multi_nspm"\
--verbose
```
It saves evaluation report, including raw and post-proccessed generated SPARQL queries, and BELU scores for each pair of natural language question and ground truth SPARQL pairs.

### Acknowledgments
The scripts for fine-tuning and merging peft adapters were adapted form [starcoder repo](https://github.com/bigcode-project/starcoder).

Read more about the project and weekly progress updates at https://mehrzadshm.github.io/GSoC-2023-blog/
Loading

0 comments on commit feb1bd1

Please sign in to comment.