Skip to content

[Bug] Indexing problem with langchain retrieval augmentation  #250

Open
@abitofalchemy

Description

@abitofalchemy

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

 49%|████████████████████████████████████████▍                                         | 4935/10000 [05:52<02:22, 35.66it/s]E0807 11:33:40.238867000 140704491832896 ssl_transport_security_utils.cc:105] Corruption detected.
E0807 11:33:40.238921000 140704491832896 ssl_transport_security_utils.cc:61] error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
E0807 11:33:40.238930000 140704491832896 secure_endpoint.cc:305]       Decryption error: TSI_DATA_CORRUPTED
 49%|████████████████████████████████████████▍                                         | 4937/10000 [05:54<06:03, 13.95it/s]

When the following code runs:

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []
        

https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/ is the reference example that running into an issue.

The implementation is using python 3.8 and on macos (Intel) box.

Expected Behavior

The indexing process should iterate through the data we’d like to add to our knowledge base, creating IDs, embeddings, and metadata — then adding these to the index.

As we do this in batches.

this is from: https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/

Steps To Reproduce

  1. activate conda env using python 3.8 (to be compatible with tiktoken)
  2. run this in a jupyter notebook
  3. Error when this part of the code is in the for-loop:
if len(texts) >= batch_limit:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))
    texts = []
    metadatas = []```
    
    The error is around this, I think ... but I might be wrong:
    ```PineconeException                         Traceback (most recent call last)

Cell In[28], line 21
19 ids = [str(uuid4()) for _ in range(len(texts))]
20 embeds = embed.embed_documents(texts)
---> 21 index.upsert(vectors=zip(ids, embeds, metadatas))
22 texts = []
23 metadatas = []```

Relevant log output

No response

Environment

- **MacOS**: Ventura 13.4.1 (c) (22F770820d)
- **Language version**: `langchain                 0.0.162`
- **Pinecone client version**: `pinecone-client           2.2.2`

Additional Context

I am doing this while connected to a vpn.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions