Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Why status for full_docs and chunks storage in OracleDB #1008

Open
2 tasks done
lselch opened this issue Mar 5, 2025 · 0 comments
Open
2 tasks done

[Question]: Why status for full_docs and chunks storage in OracleDB #1008

lselch opened this issue Mar 5, 2025 · 0 comments
Labels
question Further information is requested

Comments

@lselch
Copy link

lselch commented Mar 5, 2025

Do you need to ask a question?

  • I have searched the existing question and discussions and this question is not already answered.
  • I believe this is a legitimate question, not just a bug or feature request.

Your Question

What is the purpose of storing the document status in the LIGHTRAG_DOC_FULL and LIGHTRAG_DOC_CHUNKS for the oracle implementation? I haven't seen it in other storage implementations, only for the document status storage, which is currently not implemented for Oracle. When I try to run the graph indexing, I get a KeyError in the upsert method for the OracleKVStorage, as the document status is not provided here:

async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
        logger.info(f"Inserting {len(data)} to {self.namespace}")
        if not data:
            return

        if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
            list_data = [
                {
                    "id": k,
                    **{k1: v1 for k1, v1 in v.items()},
                }
                for k, v in data.items()
            ]
            contents = [v["content"] for v in data.values()]
            batches = [
                contents[i : i + self._max_batch_size]
                for i in range(0, len(contents), self._max_batch_size)
            ]
            embeddings_list = await asyncio.gather(
                *[self.embedding_func(batch) for batch in batches]
            )
            embeddings = np.concatenate(embeddings_list)
            for i, d in enumerate(list_data):
                d["__vector__"] = str(embeddings[i].tolist())

            merge_sql = SQL_TEMPLATES["merge_chunk"]
            for item in list_data:
                _data = {
                    "id": item["id"],
                    "content": item["content"],
                    "workspace": self.db.workspace,
                    "tokens": item["tokens"],
                    "chunk_order_index": item["chunk_order_index"],
                    "full_doc_id": item["full_doc_id"],
                    "content_vector": item["__vector__"],
                    "status": item["status"],
                }
                await self.db.execute(merge_sql, _data)

Additional Context

No response

@lselch lselch added the question Further information is requested label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant