You can download llamadatasets
directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamadataset {NAME_CAMELCASE} --download-dir ./data
You can then inspect the files at ./data
. When you're ready to load the data into
python, you can use the below snippet of code:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()
You can download the dataset to a directory, say ./data
directly in Python
as well. From there, you can use the convenient RagEvaluatorPack
llamapack to
run your own LlamaIndex RAG pipeline with the llamadataset
.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex
# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset("{NAME_CAMELCASE}", "./data")
# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
# evaluate using the RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
"RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack(
rag_dataset=rag_dataset, query_engine=query_engine
)
############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds. #
# For Usage Tier 1, settings that seemed to work well were batch_size=5, #
# and sleep_time_in_seconds=15 (as of December 2023.) #
############################################################################
benchmark_df = await rag_evaluator_pack.arun(
batch_size=20, # batches the number of openai api calls to make
sleep_time_in_seconds=1, # seconds to sleep before making an api call
)