Skip to content

Conversation

@Delacrobix
Copy link
Contributor

No description provided.

@gitnotebooks
Copy link

gitnotebooks bot commented Sep 18, 2025

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you amend the output to ask the LLM to include sources? This will make it easier for the audience to find the applicable document from the dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added citations to the LLM’s responses!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about including an example suggesting that a particular technology (specifically Elasticsearch) is slow, since this is on Elasticsearch labs. Especially since one fo the other models suggests it's inefficient. It might be worth amending the transcripts to include a common slowness use case (such as sharding or node issues for a high volume) and regenerate the answer. Alternatively I would change it to different technologies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right, I will make sure to avoid mentioning specific names in the datasets. I removed those names and used generic ones; for example, I replaced Elasticsearch with “Database” and Redis with “Cache implementation.” Related commit.



## Stats
✅ Indexed 5 documents in 250ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the indexing differing between models? I would expect indexing is a one-off operation independent to the model. This doesn't make sense to me. Should it be removed or clarified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure why it differs between tests; I think it’s better to remove it. I did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this example to something generic such as "Why is the sky blue?" or something else. As a developer this comes across as quite cringy to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. I added a new file showing the output response. Related commit.

from openai import OpenAI

ES_URL = "http://localhost:9200"
ES_API_KEY = "your-api-key-here"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the URL, API key and LOCAL_AI_URL values to environment variables that are loaded via something like dotenv and a local .env file. While this is fine for local development, when developers try to move this to production they need to tidy it up. So lets set the example now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Added dotenv support! Related commit.

ai_client = OpenAI(base_url=LOCAL_AI_URL, api_key="sk-x")


def build_documents(dataset_folder, index_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called load_documents as it's opening text files. It's not really building the documents from scratch which is misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Method renamed!

if filename.endswith(".txt"):
filepath = os.path.join(dataset_folder, filename)

with open(filepath, "r", encoding="utf-8") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment explaining why you've used utf-8 encoding here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added!

}


def index_documents():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add top level comments for each function explaining what they do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added functions descriptions! Related commit.

start_time = time.time()

try:
response = ai_client.chat.completions.create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would perhaps add a comment making clear that this is a simple generate rather than streaming of the response token by token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarifying comment!

try:
start_time = time.time()

success, _ = helpers.bulk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add the index creation code either here based on the condition that the index doesn't exist, or in a separate utility function. For semantic text you'll need to specify that mapping when creating the index, and that step is missing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the index creation as a method in the script. I also added a verification step before bulk-indexing the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants