Skip to content

Commit feb1bd1

Browse files
authored
Merge pull request #51 from dbpedia/gsoc-mehrzad
gsoc Mehrzad final submission preview
2 parents f8d9d0b + cb171a9 commit feb1bd1

27 files changed

+1207169
-1
lines changed

gsoc/mehrzad/README.md

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,80 @@
11
# Question-Answering over DBpedia with Pretrained Auto-regressive Models
2-
Read more about the project and weekly progress updates at https://mehrzadshm.github.io/GSoC-2023-blog/
32

3+
## Overview
4+
This project addresses the challenge of translating natural language questions into formal SPARQL queries to facilitate efficient information retrieval from the DBpedia knowledge graph. This initiative is pivotal in fostering user-friendly, fact-based interactions supported by open knowledge repositories such as DBpedia. Traditionally, interfacing with knowledge graphs has required users to have a working knowledge of complex formal query languages, a barrier for non-expert users. To ease this process, this project builds upon the foundation laid by the [Neural SPARQL Machines (NSpM)](https://github.com/LiberAI/NSpM) project, taking it a step further by leveraging pre-trained Large Language Models (LLMs).
5+
6+
To learn more about the project scope, objectives and weekly progress, check out the designated [blog page](https://mehrzadshm.github.io/GSoC-2023-blog/).
7+
8+
## Approach
9+
The cornerstone of the used methodology is the fine-tuning of LLms that are specilized for code generation task. To compare different architectures and model sizes, three different base models were included in the experiments as follows:
10+
11+
| Model Architecture | Model Size | Checkpoint Link |
12+
|--------------------|------------|------------------|
13+
| CodeGen | 350M | https://huggingface.co/Salesforce/codegen-350M-multi|
14+
| StarCoder | 1B | https://huggingface.co/bigcode/starcoderbase-1b|
15+
| Code-llama | 7B | https://huggingface.co/codellama/CodeLlama-7b-hf|
16+
17+
All fine-tuning attempts were conducted by leveraging [PEFT](https://github.com/huggingface/peft) (Parameter-Efficient Fine-Tuning) and [QLoRA](https://github.com/artidoro/qlora) (Low-Rank Adaptation of "Quantized" Large Language Models).
18+
19+
## Getting Started
20+
The following instructions will get you a copy of the fine-tuned LLMs on your local machine. The instrcutions provided below refer to the codegen-based model (follow the same logic for other tow models).
21+
22+
23+
With the virtual environment active, install the project dependencies using the `requirements.txt` file:
24+
```
25+
pip install -r requirements.txt
26+
```
27+
28+
After activating the project's virtual environment, download the base model and its corresponding tokenizer from Huggingface:
29+
```python
30+
from transformers import AutoTokenizer, AutoModelForCausalLM
31+
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-multi")
32+
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-multi", device_map='auto')
33+
```
34+
35+
After downloading the base model to your local machine run the following script to merge the fine-tuned checkpoint with the base model weights:
36+
```
37+
python merge_model.py\
38+
--base_model_name_or_path="Salesforce/codegen-350M-multi"\
39+
--merged_model_name_suffix="codegen-350M"\
40+
--peft_model_path="./final_checkpoints/nspm-codegen-350M/"
41+
```
42+
43+
The above script saves a seperate model checkpoint named as `dbpedia/nspm-codegen-350M`.
44+
45+
Now you can load the fine-tuned checkpoint:
46+
```python
47+
from transformers import AutoModelForCausalLM, AutoTokenizer
48+
49+
checkpoint="dbpedia/nspm-codegen-350M"
50+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_auth_token=True)
51+
model = AutoModelForCausalLM.from_pretrained(checkpoint, use_auth_token=True).to(device)
52+
```
53+
54+
For inference, follow the same format used for fine-tuning data:
55+
```python
56+
prompt = """# write a SPARQL query for:
57+
# what is the population of Italy?
58+
"""
59+
60+
inputs = tokenizer.encode_plus(prompt, return_tensors="pt").to(device)
61+
outputs = model.generate(inputs['input_ids'], max_new_tokens=100)
62+
print(tokenizer.decode(outputs[0]))
63+
```
64+
65+
66+
To evaluate the performance of the fine-tuned model run `evaluate_model.py` script:
67+
```
68+
!python evaluate_model.py\
69+
--model_name_or_path="dbpedia/nspm-codegen-350M"\
70+
--data_path="./data/nspm-fine-tuning/test.csv"\
71+
--metric_name="bleu"\
72+
--report_file_name="eval_results_codegen350M_multi_nspm"\
73+
--verbose
74+
```
75+
It saves evaluation report, including raw and post-proccessed generated SPARQL queries, and BELU scores for each pair of natural language question and ground truth SPARQL pairs.
76+
77+
### Acknowledgments
78+
The scripts for fine-tuning and merging peft adapters were adapted form [starcoder repo](https://github.com/bigcode-project/starcoder).
79+
80+
Read more about the project and weekly progress updates at https://mehrzadshm.github.io/GSoC-2023-blog/

0 commit comments

Comments
 (0)