forked from LiberAI/NSpM
-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #51 from dbpedia/gsoc-mehrzad
gsoc Mehrzad final submission preview
- Loading branch information
Showing
27 changed files
with
1,207,169 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,80 @@ | ||
# Question-Answering over DBpedia with Pretrained Auto-regressive Models | ||
Read more about the project and weekly progress updates at https://mehrzadshm.github.io/GSoC-2023-blog/ | ||
|
||
## Overview | ||
This project addresses the challenge of translating natural language questions into formal SPARQL queries to facilitate efficient information retrieval from the DBpedia knowledge graph. This initiative is pivotal in fostering user-friendly, fact-based interactions supported by open knowledge repositories such as DBpedia. Traditionally, interfacing with knowledge graphs has required users to have a working knowledge of complex formal query languages, a barrier for non-expert users. To ease this process, this project builds upon the foundation laid by the [Neural SPARQL Machines (NSpM)](https://github.com/LiberAI/NSpM) project, taking it a step further by leveraging pre-trained Large Language Models (LLMs). | ||
|
||
To learn more about the project scope, objectives and weekly progress, check out the designated [blog page](https://mehrzadshm.github.io/GSoC-2023-blog/). | ||
|
||
## Approach | ||
The cornerstone of the used methodology is the fine-tuning of LLms that are specilized for code generation task. To compare different architectures and model sizes, three different base models were included in the experiments as follows: | ||
|
||
| Model Architecture | Model Size | Checkpoint Link | | ||
|--------------------|------------|------------------| | ||
| CodeGen | 350M | https://huggingface.co/Salesforce/codegen-350M-multi| | ||
| StarCoder | 1B | https://huggingface.co/bigcode/starcoderbase-1b| | ||
| Code-llama | 7B | https://huggingface.co/codellama/CodeLlama-7b-hf| | ||
|
||
All fine-tuning attempts were conducted by leveraging [PEFT](https://github.com/huggingface/peft) (Parameter-Efficient Fine-Tuning) and [QLoRA](https://github.com/artidoro/qlora) (Low-Rank Adaptation of "Quantized" Large Language Models). | ||
|
||
## Getting Started | ||
The following instructions will get you a copy of the fine-tuned LLMs on your local machine. The instrcutions provided below refer to the codegen-based model (follow the same logic for other tow models). | ||
|
||
|
||
With the virtual environment active, install the project dependencies using the `requirements.txt` file: | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
|
||
After activating the project's virtual environment, download the base model and its corresponding tokenizer from Huggingface: | ||
```python | ||
from transformers import AutoTokenizer, AutoModelForCausalLM | ||
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-multi") | ||
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-multi", device_map='auto') | ||
``` | ||
|
||
After downloading the base model to your local machine run the following script to merge the fine-tuned checkpoint with the base model weights: | ||
``` | ||
python merge_model.py\ | ||
--base_model_name_or_path="Salesforce/codegen-350M-multi"\ | ||
--merged_model_name_suffix="codegen-350M"\ | ||
--peft_model_path="./final_checkpoints/nspm-codegen-350M/" | ||
``` | ||
|
||
The above script saves a seperate model checkpoint named as `dbpedia/nspm-codegen-350M`. | ||
|
||
Now you can load the fine-tuned checkpoint: | ||
```python | ||
from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
||
checkpoint="dbpedia/nspm-codegen-350M" | ||
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_auth_token=True) | ||
model = AutoModelForCausalLM.from_pretrained(checkpoint, use_auth_token=True).to(device) | ||
``` | ||
|
||
For inference, follow the same format used for fine-tuning data: | ||
```python | ||
prompt = """# write a SPARQL query for: | ||
# what is the population of Italy? | ||
""" | ||
|
||
inputs = tokenizer.encode_plus(prompt, return_tensors="pt").to(device) | ||
outputs = model.generate(inputs['input_ids'], max_new_tokens=100) | ||
print(tokenizer.decode(outputs[0])) | ||
``` | ||
|
||
|
||
To evaluate the performance of the fine-tuned model run `evaluate_model.py` script: | ||
``` | ||
!python evaluate_model.py\ | ||
--model_name_or_path="dbpedia/nspm-codegen-350M"\ | ||
--data_path="./data/nspm-fine-tuning/test.csv"\ | ||
--metric_name="bleu"\ | ||
--report_file_name="eval_results_codegen350M_multi_nspm"\ | ||
--verbose | ||
``` | ||
It saves evaluation report, including raw and post-proccessed generated SPARQL queries, and BELU scores for each pair of natural language question and ground truth SPARQL pairs. | ||
|
||
### Acknowledgments | ||
The scripts for fine-tuning and merging peft adapters were adapted form [starcoder repo](https://github.com/bigcode-project/starcoder). | ||
|
||
Read more about the project and weekly progress updates at https://mehrzadshm.github.io/GSoC-2023-blog/ |
Oops, something went wrong.