Description
Is this a new bug?
- I believe this is a new bug
- I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
49%|████████████████████████████████████████▍ | 4935/10000 [05:52<02:22, 35.66it/s]E0807 11:33:40.238867000 140704491832896 ssl_transport_security_utils.cc:105] Corruption detected.
E0807 11:33:40.238921000 140704491832896 ssl_transport_security_utils.cc:61] error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
E0807 11:33:40.238930000 140704491832896 secure_endpoint.cc:305] Decryption error: TSI_DATA_CORRUPTED
49%|████████████████████████████████████████▍ | 4937/10000 [05:54<06:03, 13.95it/s]
When the following code runs:
for i, record in enumerate(tqdm(data)):
# first get metadata fields for this record
metadata = {
'wiki-id': str(record['id']),
'source': record['url'],
'title': record['title']
}
# now we create chunks from the record text
record_texts = text_splitter.split_text(record['text'])
# create individual metadata dicts for each chunk
record_metadatas = [{
"chunk": j, "text": text, **metadata
} for j, text in enumerate(record_texts)]
# append these to current batches
texts.extend(record_texts)
metadatas.extend(record_metadatas)
# if we have reached the batch_limit we can add texts
if len(texts) >= batch_limit:
ids = [str(uuid4()) for _ in range(len(texts))]
embeds = embed.embed_documents(texts)
index.upsert(vectors=zip(ids, embeds, metadatas))
texts = []
metadatas = []
https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/ is the reference example that running into an issue.
The implementation is using python 3.8 and on macos (Intel) box.
Expected Behavior
The indexing process should iterate through the data we’d like to add to our knowledge base, creating IDs, embeddings, and metadata — then adding these to the index.
As we do this in batches.
this is from: https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/
Steps To Reproduce
- activate conda env using python 3.8 (to be compatible with tiktoken)
- run this in a jupyter notebook
- Error when this part of the code is in the for-loop:
if len(texts) >= batch_limit:
ids = [str(uuid4()) for _ in range(len(texts))]
embeds = embed.embed_documents(texts)
index.upsert(vectors=zip(ids, embeds, metadatas))
texts = []
metadatas = []```
The error is around this, I think ... but I might be wrong:
```PineconeException Traceback (most recent call last)
Cell In[28], line 21
19 ids = [str(uuid4()) for _ in range(len(texts))]
20 embeds = embed.embed_documents(texts)
---> 21 index.upsert(vectors=zip(ids, embeds, metadatas))
22 texts = []
23 metadatas = []```
Relevant log output
No response
Environment
- **MacOS**: Ventura 13.4.1 (c) (22F770820d)
- **Language version**: `langchain 0.0.162`
- **Pinecone client version**: `pinecone-client 2.2.2`
Additional Context
I am doing this while connected to a vpn.