From 4e8263efbc593115ff055d25a5b6df096d1a12fb Mon Sep 17 00:00:00 2001 From: Kyle Knapp Date: Tue, 9 Sep 2025 13:44:37 -0700 Subject: [PATCH 1/3] Add document loader proposal The proposal outlines the design and implementation of Azure Blob Storage document loaders for the langchain-azure-storage package. These document loaders are intended to replace the current document loaders in langchain-community package. --- .../proposals/document_loaders.md | 730 ++++++++++++++++++ 1 file changed, 730 insertions(+) create mode 100644 libs/azure-storage/proposals/document_loaders.md diff --git a/libs/azure-storage/proposals/document_loaders.md b/libs/azure-storage/proposals/document_loaders.md new file mode 100644 index 00000000..6f55b736 --- /dev/null +++ b/libs/azure-storage/proposals/document_loaders.md @@ -0,0 +1,730 @@ +# Azure Storage document loaders + +| Proposal | Metadata | +|-------------|-------------------| +| **Author** | Kyle Knapp | +| **Status** | Proposed | +| **Created** | 12-September-2025 | + +## Abstract +This proposal outlines the design and implementation of Azure Blob Storage document loaders. These +document loaders will be part of the [`langchain-azure-storage`][langchain-azure-storage-pkg] package and +provide first-party support for loading LangChain [`Document`][langchain-document-ref] objects from either +a container or a specific blob in Azure Blob Storage. These document loaders aim to replace the existing +community-sourced Azure Blob Storage document loaders, which can be improved in several areas and are not +directly maintained by Azure Storage. + +## Background and motivation + +### What are document loaders? +[Document loaders][langchain-document-loader-concept] load data from a variety of sources (e.g., local +filesystem, cloud storage, etc.) and parse the data (e.g., from JSON, PDF, DOCX, etc.) to human readable +text. Typically, the loaded text is then stored along with an [embedding][langchain-embedding-concept] +representation in a [vector store][langchain-vector-store-concept] for use in [RAG][langchain-rag-concept]. +Using [semantic search][wiki-semantic-search], this allows an LLM to retrieve information from sources outside its training data to use to +generate responses. + + +Below shows a basic example of how document loaders fit into a LangChain workflow. Outside the +runtime of an AI application, document loaders are utilized to seed a vector store with +documents: + +```python +import os + +from langchain_community.document_loaders import PyPDFLoader +from langchain_text_splitters import RecursiveCharacterTextSplitter +from langchain_azure_ai.embeddings import AzureAIEmbeddingsModel +from langchain_core.vectorstores import InMemoryVectorStore + +# Load a local PDF using a PDF document loader. This will return +# the text of each page in the PDF as `Document` objects. +documents = PyPDFLoader('azure-storage-user-guide.pdf').load() + +# Split the documents by character chunks to create smaller documents +# to be stored in the vector store. This allows for retrieving smaller, +# more relevant chunks of text to use as context for the LLM. +text_splitter = RecursiveCharacterTextSplitter( + chunk_size=1000, + chunk_overlap=200 +) +chunked_documents = text_splitter.split_documents(documents) + +# Configure an embedding model to use to embed each chunk of text. +embed_model = AzureAIEmbeddingsModel( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=os.environ["AZURE_INFERENCE_CREDENTIAL"], + model="text-embedding-3-large", +) +# Create a vector store to store chunks and their embeddings. Using an in-memory +# vector store here for example simplicity, but in production a more robust vector store +# (e.g., Azure AI search, etc.) should be used. +db = InMemoryVectorStore(embed_model) +db.add_documents(chunked_documents) +``` +Then within the runtime of the AI application, the vector store can be queried to retrieve +relevant documents to use as context for the LLM: + +```python +# At AI application runtime, the vector store will be connected to and accessed in its +# retriever mode to be able to perform semantic search on documents stored in the vector store. +db = get_vector_store() +retriever = db.as_retriever() + +# Retrieves original PDF chunks that are relevant to the query as `Document` objects. +docs = retriever.invoke("What are the different types of blobs in Azure Blob Storage?") +``` + +All document loader implementations follow the [`BaseLoader`][langchain-document-loader-base-ref] interface whose methods +return [`Document`][langchain-document-ref] objects representing the parsed text of a loaded document. `Document` objects +are also widely expected in downstream LangChain RAG components such as [text splitters][langchain-text-splitter-concept] and +[vector stores][langchain-vector-store-concept]. + +Today, there is a wide variety of [community-maintained document loader][langchain-document-loader-integrations] implementations +available for LangChain. In general, each document loader implementation: + +* Typically, loads data from only a single logical source (e.g. a local directory, a cloud storage bucket, social media platform, etc.). For example, + there is a [`DirectoryLoader`][langchain-directory-loader] that loads files from a local directory and an [`S3DirectoryLoader`][langchain-s3-directory-loader] + that loads files from an S3 bucket, but there is no superset loader implementation that loads files from both a local directory and an S3 bucket. +* Supports parsing data for 1 to N different file types (e.g. `.pdf`, `.docx`, `.txt`, etc.). For example, the [`JSONLoader`][langchain-json-loader] supports only + parsing JSON content to text, but the [`UnstructuredLoader`][langchain-unstructured-loader] supports a variety of formats (e.g. `.txt`, `.pdf`, `.docx`, etc.). +* May provide built-in chunking of documents into smaller documents to be stored in a vector store. If supported, chunking configuration is configurable ( + e.g., whether chunking is enabled, how to chunk, etc.). + +While technically document loaders are not needed to store text in a vector store (i.e., an individual could load and parse documents themselves and then insert +the text directly into a vector store), document loaders provide value in they: + +* Abstract away the details of loading and parsing documents from various sources and formats. Developers just need to provide configuration to the document loader + (e.g., path to directory, connection details to cloud storage, etc.) and the document loader handles the logic needed to extract a resource's text. +* Provides a consistent interface for loading documents. This makes it easy to load documents across a variety of different sources and formats + and use across other LangChain components (e.g., text splitters, vector stores, etc.) without needing custom adapters. + + +### How does document loaders fit with Azure Storage? +Azure Blob Storage is heavily used for storing unstructured data whether it +be documents, CSVs, logs, etc. This sort of data is commonly used in RAG scenarios to +provide context to LLMs (e.g., searching company documents to answer questions about company policies). Having purpose-built Azure Blob Storage document loaders allows +AI application developers to easily retrieve these documents from Azure Blob Storage, +augment them with embeddings, and store them in a vector store of their choice for use in RAG scenarios. + +While for larger AI applications a developer would likely opt for a full-fledged embedding ingestion pipeline (e.g., includes scheduled/event-driven re-embeddings, +content enrichment stages, data lineage, etc.), document loaders provide the opportunity to quickly prototype and get up and running with a RAG application. + + +### Community Azure Storage document loaders +Prior to this proposal, there are two community-built Azure Blob Storage document loaders: + +* [`AzureBlobStorageFileLoader`][community-azure-blob-storage-file] - Loads `Document` objects from a single blob from Azure Blob Storage. +* [`AzureBlobStorageContainerLoader`][community-azure-blob-storage-container] - Loads `Document` objects from all blobs from a container in Azure Blob Storage. + +These document loaders were contributed to Langchain as part of this [pull request][community-pr] to satisfy a [feature request][community-feature-request] asking +for Azure Blob Storage document loaders to match already offered S3 and GCS document loaders. Today, these document loaders reside in +the [langchain-community repository][community-repository], which is the de facto package for third-party LangChain integrations that do not +have an owner outside the LangChain core team. + + +#### Areas of improvement with current community document loaders + +With the current community document loaders, there are several opportunities for +improvement: + +* There is no active owner or maintainer of these document loaders. Recently, Azure +created the [`langchain-azure`][langchain-azure-repo] repository to host and take +first-party ownership of Azure-related LangChain integrations. Pulling these +document loaders into this repository would allow Azure Storage to provide +first-party support and maintenance of these document loaders. This approach also aligns +with the direction from the LangChain core team to push community integrations to dedicated +`langchain-*` packages for individual integrations instead of continuing to maintain them in the +`langchain-community` package. +* The document loaders do not implement [`lazy_load()`][langchain-lazy-load-ref]. This +method allows customers to lazily iterate through all documents without loading them all into memory at once. This is especially +important in the context of Azure Blob Storage where billions+ of documents could be stored totaling to TiB+ of data. +* The document loaders only support connection string for authentication. This +prevents customers from following best practices around using Microsoft Entra ID +and managed identities for authentication and authorization. +* By default, uses the [`UnstructuredLoader`][langchain-unstructured-loader] for parsing documents. Unstructured provides +wide coverage for parsing different file formats. However, this decision forces Azure Storage customers to rely on Unstructured +and any of its underlying dependencies. Furthermore, there are no options to customize how to load and parse blobs that are +downloaded from Azure Blob Storage (e.g., use a different parser). +* Uses the default async implementations of `aload()` and `alazy_load()` which just +runs the synchronous implementations in an executor. While this prevents blocking the +event loop, the Azure Storage SDK offers asynchronous interfaces that could further improve performance when accessing document loaders asynchronously. + +See [Appendix A](#appendix-a-reported-customer-issues-with-community-azure-storage-document-loaders) for a list of +reported customer issues with the existing community Azure Storage document loaders that echo the areas of improvement listed above. + +### Goals +Based on the background, the goals of this proposal and design are to: +* Port the existing community Azure Blob Storage document loaders into the `langchain-azure` repository so that they can receive first-party support and maintenance. +* Address the areas of improvement listed [above](#areas-of-improvement-with-current-community-document-loaders) as part of implementation port. +* Minimize interfaces changes from existing community document loaders to only those that are necessary. Ideally, changes needed to migrate away from the current +community document loaders should be minimal, requiring at most changes to import statements and constructor call patterns. +* Prioritize ease of use and sensible defaults. Ideally, customers should only need +to provide minimal configuration (e.g., the resource URL) to get up and running +quickly. It should not require in-depth knowledge of either Azure Blob +Storage or the SDKs in order to use. + +### Non-goals +Below are non-goals for this proposal: +* Adding Azure Storage integrations for LangChain components outside of +document loaders (e.g., [blob loaders][langchain-blob-loader-ref] and +[memory stores][langgraph-memory-store]). This is intended to keep the scope of +the proposal narrow and manageable. These additional integration opportunities should be researched and considered in future proposals. +* Maximizing available network throughput to the environment. Implementation should +still effectively use the SDK to download blobs quickly, but this first iteration +should not be targeting additional optimization layers to further boost download +throughput (e.g., eagerly download batches of blobs in parallel) especially when +blobs are expected to be lazily loaded one at a time to be chunked, embedded, +and stored in a vector store. Based on feedback from the initial implementation, +we can explore speed optimizations in future iterations. + + +## Specification +Below is the proposed specification for the Azure Blob Storage document loaders. + +### Public interfaces +All Azure Storage document loaders will live in the [`langchain_azure_storage` package][langchain-azure-storage-pkg] +under a new `document_loaders` module. + +There will be two document loaders introduced: + +* `AzureBlobStorageFileLoader` - Loads a `Document` from a single blob in Azure Blob Storage. +* `AzureBlobStorageContainerLoader` - Loads `Document` objects from all blobs in a given container in Azure Blob Storage. +Assuming no chunking is happening, each `Document` loaded will correspond 1:1 with a blob in the container. + +Each document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous +and asynchronous loading of documents, as well as lazy loading of documents. + +Below show the proposed constructor signatures for each document loader: + +```python +from typing import Optional, Union, Callable +import azure.core.credentials +import azure.core.credentials_async +from langchain_core.document_loaders import BaseLoader + + +class AzureBlobStorageFileLoader(BaseLoader): + def __init__(self, + account_url: str, + container_name: str, + blob_name: str, + *, + credential: Optional[ + Union[ + azure.core.credentials.AzureSasCredential, + azure.core.credentials.TokenCredential, + azure.core.credentials_async.AsyncTokenCredential, + ] + ] = None, + loader_factory: Optional[Callable[str, BaseLoader]] = None, + ): ... + + +class AzureBlobStorageContainerLoader(BaseLoader): + def __init__(self, + account_url: str, + container_name: str, + *, + prefix: str = "", + credential: Optional[ + Union[ + azure.core.credentials.AzureSasCredential, + azure.core.credentials.TokenCredential, + azure.core.credentials_async.AsyncTokenCredential, + ] + ] = None, + loader_factory: Optional[Callable[str, BaseLoader]] = None, + ): ... +``` + +In terms of parameters supported: +* `account_url` - The URL to the storage account (e.g., `https://.blob.core.windows.net`) +* `container_name` - The name of the container within the storage account +* `blob_name` - (File loader only) The name of the blob within the container to load. +* `credential` - The credential object to use for authentication. If not provided, +the loader will use [Azure default credentials][azure-default-credentials]. The +`credential` field only supports token-based credentials and SAS credentials. It does +not support access key based credentials nor anonymous access. +* `prefix` - (Container loader only) An optional prefix to filter blobs within the container. +Only blobs whose names start with the specified prefix will be loaded. +* `loader_factory` - A callable that returns a custom document loader (e.g., `UnstructuredLoader`) to use +for parsing blobs downloaded. When provided, the Azure Storage document loader will download each blob to +a temporary local file and then call `loader_factory` with the path to the temporary file to get a document +loader to use to load and parse the local file as `Document` objects. If `loader_factory` is not provided, +the loader will return the content as is in a single `Document` object for each blob. The blob content will be +treated as UTF-8 encoding for this default case. + + +### Usage examples +Below are some example usage patterns for the Azure Blob Storage document loaders. + +#### Load from a blob +Below shows how to load a document from a single blob in Azure Blob Storage: + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageFileLoader + +loader = AzureBlobStorageFileLoader("https://.blob.core.windows.net", "", "") +for doc in loader.lazy_load(): + print(doc.page_content) # Prints content of blob. There should only be one document loaded. +``` + +#### Load from a container + +Below shows how to load documents from all blobs in a given container in Azure Blob Storage: + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader + +loader = AzureBlobStorageContainerLoader("https://.blob.core.windows.net", "") +for doc in loader.lazy_load(): + print(doc.page_content) # Prints content of each blob in the container. +``` + +Below shows how to load documents from blobs in a container with a given prefix: + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", "", prefix="some/prefix/" +) +for doc in loader.lazy_load(): + print(doc.page_content) # Prints content of each blob whose name starts with "some/prefix/" . +``` + +#### Load asynchronously +Below shows how to load documents asynchronously. This is acheived by calling the `aload()` or `alazy_load()` methods on the document loader. For example: + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader + + +async def main(): + loader = AzureBlobStorageContainerLoader("https://.blob.core.windows.net", "") + async for doc in loader.alazy_load(): + print(doc.page_content) # Prints content of each blob in the container. +``` + +#### Override credentials +Below shows how to override the default credentials used by the document loader: + +```python +from azure.core.credentials import AzureSasCredential +from azure.idenity import ManagedIdentityCredential + +# Override with SAS token +loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", + "", + credential=AzureSasCredential("") +) + + +# Override with more specific token credential than the entire +# default credential chain (e.g., system-assigned managed identity) +loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", + "", + credential=ManagedIdentityCredential() +) +``` + +#### Override loader +Below shows how to override the default document loader used to parse downloaded blobs. + +In the below example, the blobs are downloaded to a temporary local file and uses +the `UnstructuredLoader` to parse the local file and return `Document` objects +on behalf of the Azure Storage document loader: + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_unstructured import UnstructuredLoader + +loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", + "", + # The UnstructuredLoader class accepts a string to the local file path to its constructor, + # so the class can be provided directly as the loader_factory. + loader_factory=UnstructuredLoader +) +for doc in loader.lazy_load(): + # Document returned are generated directly from UnstructuredLoader and + # are not just the unmodified blob content. + print(doc.page_content) +``` + +If a customer wants to provide additional configuration to the document loader, they can +define a callable that returns an instantiated document loader. For example, to provide +custom configuration to the `UnstructuredLoader`: +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_unstructured import UnstructuredLoader + + +def loader_factory(file_path: str) -> UnstructuredLoader: + return UnstructuredLoader( + file_path, + mode="by_title", # Custom configuration + strategy="fast", # Custom configuration + ) + + +loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", "", + loader_factory=loader_factory +) +``` + + +### Migration details + +In migrating from the existing community document loaders to the new Azure Storage document loaders, +customers will need to perform the following changes: + +1. Depend on the `langchain-azure-storage` package instead of `langchain-community`. +2. Update import statements from `langchain_community.document_loaders` to + `langchain_azure_storage.document_loaders`. +3. Update document loader constructor calls to: + 1. Use an account URL instead of a connection string. + 2. Specify `UnstructuredLoader` as the `loader_factory` if they continue to want to use + Unstructured for parsing documents. +4. Ensure environment has proper credentials (e.g., running `azure login` command, setting up + managed identity, etc.) as the connection string would have previously contained the credentials. + +Below shows code snippets of what usage patterns look like before and after the proposed migration: + +**Before migration:** + +```python +from langchain_community.document_loaders import AzureBlobStorageContainerLoader, AzureBlobStorageFileLoader + +container_loader = AzureBlobStorageContainerLoader( + "DefaultEndpointsProtocol=https;AccountName=;AccountKey=;EndpointSuffix=core.windows.net", + "", +) + +file_loader = AzureBlobStorageFileLoader( + "DefaultEndpointsProtocol=https;AccountName=;AccountKey=;EndpointSuffix=core.windows.net", + "", + "" +) +``` + +**After migration:** + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader, AzureBlobStorageFileLoader +from langchain_unstructured import UnstructuredLoader + +container_loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", + "", + loader_factory=UnstructuredLoader +) + +file_loader = AzureBlobStorageFileLoader( + "https://.blob.core.windows.net", + "", + "", + loader_factory=UnstructuredLoader +) +``` + + +### Implementation details + +This section dives into implementation details stemming from public interface outlined above. + +### `BaseLoader` methods to implement +For each of the Azure Storage document loaders, the following methods from `BaseLoader` will be implemented: + +* [`lazy_load()`][langchain-lazy-load-ref] - Lazily loads documents one at a time +* [`alazy_load()`][langchain-alazy-load-ref] - Asynchronously and lazily loads documents one at a time. This +should use the asynchronous SDK clients instead of rely on the default `alazy_load()` implementation which just runs +the synchronous implementation in an executor. + +For the rest of the methods offered by `BaseLoader` (e.g., `load()`, `aload()`), the default +document loader implementations will be used, which call into the lazy loading methods and do not +require any overrides. + +### Credential details + +When no `credential` is provided, the document loaders will use [Azure default credentials][azure-default-credentials]. +Specifically, the document loaders will instantiate `azure.identity.DefaultAzureCredential` and +`azure.identity.aio.DefaultAzureCredential` credential objects and provide them to the synchronous and +asynchronous SDK clients respectively. + +When a `credential` is provided, the credential will be: +* Validated to ensure it is one of the supported credential types. If not, a `ValueError` will be raised. +* Passed directly to the SDK client's `credential` parameter. +* If a synchronous token credential (e.g., `TokenCredential`) is provided and asynchronous methods are called + (e.g., `aload()`), the method will raise a `ValueError`. The reverse applies as well (i.e., throwing exceptions when + providing an asynchronous token credential and calling synchronous methods). For example: + ```python + import azure.identity + import azure.identity.aio + from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader + + sync_doc_loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", + "", + credential=azure.identity.ManagedIdentityCredential() + ) + sync_doc_loader.aload() # Raises ValueError because a sync credential was provided + + async_doc_loader = AzureBlobStorageContainerLoader( + "https://.blob.core.windows.net", + "", + credential=azure.identity.aio.ManagedIdentityCredential() + ) + async_doc_loader.load() # Raises ValueError because an async credential was provided + ``` + If a customer wants to access both the synchronous and asynchronous methods, they will need to + instantiate two separate document loaders: one for sync operations and one for async operations. + Also note that this restriction does not apply to when no credential is provided and default credentials are used; + the document loader implementation will automatically handle matching the correct credential to client type. + +### Document metadata + +By default, the document loaders will populate the `source` metadata field of each `Document` +object with the URL of the blob (e.g., `https://.blob.core.windows.net//`). For example: +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader + +loader = AzureBlobStorageContainerLoader("https://.blob.core.windows.net", "") +for doc in loader.lazy_load(): + print(doc.metadata["source"]) # Prints URL of each blob in the container. +``` +If a custom `loader_factory` is provided, the document loaders will override any `source` metadata field +set by the custom loader to be the URL of the blob but retain all other metadata fields set by the custom loader. + + +### `loader_factory` mechanics + +When no `loader_factory` is provided, the document loaders will download the blob content +and return a single `Document` object per blob with the blob content treated as UTF-8. This +will happen all in-memory without writing the blob to a temporary file. + +If a `loader_factory` is provided, the document loaders will download each blob to a temporary local file +and then call `loader_factory` with the path to the temporary file to get a document loader to use to load and parse the local file. Below shows a simplified example of how this would work: +```python +import tempfile +from langchain_core.document_loaders import BaseLoader +from langchain_core.documents import Document +from typing import Iterator + + +class AzureBlobStorageContainerLoader(BaseLoader): + ... + def _lazy_load_from_custom_loader(self, blob_name: str) -> Iterator[Document]: + with tempfile.NamedTemporaryFile() as temp_file: + self._download_blob_to_file(blob_name, temp_file.name) + loader = self._loader_factory(temp_file.name) + yield from loader.lazy_load() +``` +It's important to note that the temporary file should be deleted after it has been loaded by the custom loader +so that disk usage does not continue to grow over the duration of the `lazy_load()` or `load()` invocation. + + +## Rationale + +This section discusses the rationale behind design decisions made in the specification above. + +### Alternatives considered + +Below discusses alternatives considered for key design decisions made in the specification above. + +#### Exposing Azure storage integrations as a blob loader instead of document loader +A [blob loader][langchain-blob-loader-ref] is another type of LangChain loader that just loads +the raw bytes from storage and does not parse the data to text. Instead, a [blob parser][langchain-blob-parser-ref] +is used to parse the raw bytes to text. Together, blob loaders and parsers effectively help decouple how +data is loaded from how data is converted to text for an LLM. Blob loaders are arguably even a cleaner +integration point for Azure Storage since data stored in Azure Blob Storage is generally unstructured and could +be in any format. + +However, the decision to support document loaders instead is a reflection of meeting customers where they are +at: + +* Document loaders are more widely used and have far more 3rd party integrations than blob loaders and + blob parsers. For these reasons, customers will expect that the integration is in the form of a document + loader. +* The current community Azure storage integrations are document loaders. If we were to only supporting blob + loaders, the migration to `langchain-azure-storage` would be more involved such as needing to learn how + blob loaders and parsers work and possibly implementing their own blob parser if the LangChain community + does not offer a parser that meets their needs. + +In the future, we should consider exposing blob loaders if requested by customers, especially if we see +customers wanting to customize loading behavior beyond what is offered by the `loader_factory` parameter. + + +#### Exposing a `blob_parser` parameter instead of `loader_factory` +In order to customize how content was parsed to text, it was considered to expose a `blob_parser` parameter, +which would accept a [blob parser][langchain-blob-parser-ref] to use to parse the raw bytes of a blob to text. +The advantage of this approach is that: +1. The implementation would not need to write to disk since blob parsers accept blobs loaded into memory. With + the `loader_factory` approach, the blob needs to be written to disk because most document loaders expect a + path to a local file to load from. +2. There would be more customization available to the customer as we could more naturally proxy metadata of the blob + stored in Azure Blob Storage (e.g., content type, last modified time, tags, metadata etc.) to the `blob_parser`. + +However, similar to why document loaders were chosen over blob loaders, blob parsers do not have as wide of +3rd party support as document loaders, which would require customers to write their own blob parser wrappers +over libraries like Unstructured and takeaway from the batteries-included value proposition that LangChain document +loaders provide. + + +#### Alternatives to default parsing to UTF-8 text +The default parsing logic when no `loader_factory` is provided is to treat the blob content as UTF-8 text +and return a single `Document` object per blob. This was chosen to have some default to get up and running +with document loaders with minimal configuration. Alternatives considered included: +* Require a `loader_factory` to be provided. While it is likely customers will want to provide their own loader +factory (unless all of their documents are just plain text), not having a default parser would add friction +to getting started in that they will now need to decide both how to properly setup their connection to Azure Blob Storage and have to decide which document loader to use. +* Default to a more full-featured document loader like [`UnstructuredLoader`][langchain-unstructured-loader]. +However, this would require taking a 3rd party dependency outside the Azure Storage and +LangChain ecosystem, which customers would likely not expect as a first-party Azure Storage integration. +Furthermore, LangChain does not offer a recommended default document loader nor blob parser to use for integrations. + + +#### Alternatives to defaulting to Azure default credentials +Using [Azure default credentials][azure-default-credentials] when no credential is provided was chosen to +minimize configuration needed to get up and running in addition to defaulting to using Oauth2 token-based +credentials, which is the recommended best practice for authenticating to Azure services. Furthermore, +many of the other Azure LangChain integrations packages (e.g., `langchain-azure-dynamic-sessions` and +`langchain-azure-postgresql`) default to Azure default credentials. In terms of alternative defaults +considered, these included: +* Requiring a credential to be provided. While this would force customers to think about authentication +and authorization, it would add friction to getting started as customers would need to now learn about +the various in-code credential types and select the proper one to use. Furthermore, in most cases, +customers will want to use the Azure default credentials anyway, especially for getting started. +* Match SDK default of anonymous access when no credential is provided. Most customers will want to be +making authenticated requests to Azure Blob Storage. So customers would likely need to provide a credential +anyway with the added undesired side effect that to resolve auth issues, customers may also try to just make +the container public. + + +### FAQs + +Below are some additional questions and answers about the design not covered by alternatives considered +above. + +#### Q: Why not support access key, anonymous, or connection string based credentials? +It is purposely not supported to encourage customers to follow best practices around using +Microsoft Entra ID and managed identities for authentication and authorization. While the community +document loaders only supported connection string based authentication, it is a relatively small +configuration value change to switch to using account URL, especially since account URL is a heavily +prevalent configuration value in Azure Blob Storage. + +#### Q: Why not support synchronous token credentials when calling asynchronous methods and vice versa? +In general, it is not recommended to interchange synchronous credentials with asynchronous clients and +vice versa (e.g., synchronous credentials can block the async event loop). Furthermore, there are no +adapter mechanisms to use them interchangeably. By enforcing this restriction, it ensures customers +do not accidentally use the wrong credential type for their sync or async usage pattern. + +Technically, we could support this use case by being able to accept both sync and async token credentials +at instantiation (e.g., expose an `async_credential` parameter or a tuple of sync and async credentials). +However, this could introduce more complexity to the interface, when we expect for a single document loader +instance, customers will likely be using only sync or only async methods, but not both. + +If we get the feedback that customers want to use both sync and async methods with a single +document loader instance, we can revisit this decision in the future. + + +#### Q: How would the `loader_factory` expand in the future? +In general, the `loader_factory` is intended to be a simple escape hatch to allow customers to +customize how blobs are parsed to text. However, possible requested extension points may include: +* Needing blob properties to make decisions on how to parse the blob. +* Wanting the blob data to be passed using an in-memory representation than file on disk + +If we ever plan to extend the interface, we should strongly consider exposing blob loaders +instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) +section above. + +If blob loaders do not suffice, we could consider expanding the `loader_factory` to: + +* Inspect signature arguments of callable provided to `loader_factory` and call the callable with + additional parameters if detected (e.g., detect if the a `blob_properties` parameter is present and + proxy it when calling the callable). +* Introduce additional opt-in parameters to the document loader constructor to control how the + `loader_factory` is called. For example, we could expose an `include_blob_properties` to include + blob properties when calling the `loader_factory`. + + +#### Q: Why is the blob properties not exposed in the `Document` metadata? +It was done to simplify the initial implementation and provide flexibility in how blob properties are +represented as document metadata in the future. The `source` field is a widely adopted metadata field and +is generally expected to be set. However, there is no strong guidance on other metadata fields +beyond `source`. + +Based on customer requests, in the future, we could consider exposing these properties by either: +* Adding a `blob_properties` field in the `Document.metadata` dictionary to store all blob properties +* Mapping certain blob properties to commonly expected metadata fields (e.g., map last modified time to + `Document.metadata["last_modified"]` or content type to `Document.metadata["type"]`) + + +## Future work +Below are some possible future work ideas that could be considered after the initial implementation based on customer feedback: + +* Expose blob loader integrations for Azure Blob Storage (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section). +* Proxy additional blob properties as document metadata (see [FAQs](#q-why-is-the-blob-properties-not-exposed-in-the-document-metadata) section). +* Support `async_credential` parameter to allow using both sync and async token credentials with a single document loader instance + (see [FAQs](#q-why-not-support-synchronous-token-credentials-when-calling-asynchronous-methods-and-vice-versa) section). +* Support integrations that streamline loading documents directly from Azure Storage as part of the +vector store or memory store interface instead of needing to do it manually with document loaders (see +[non-goals](#non-goals) section). + + +## Appendix + +### Appendix A: Reported customer issues with community Azure Storage document loaders +Below audits some of the issues reported by customers with the existing community Azure Storage document +loaders. Community reported issues were located by searching LangChain discussion board for ["AzureBlobStorageFileLoader"][issue-tracker-langchain-1] and ["AzureBlobStorageContainerLoader"][issue-tracker-langchain-2]. Note there were no reports filed in the LangChain GitHub issue queue. + +* [#7883](https://github.com/langchain-ai/langchain/discussions/7883) - Customer wants the entire document +returned from container loader and not something that is already chunked using Unstructured. +* [#8615](https://github.com/langchain-ai/langchain/discussions/8615) - Customer wants to use PDF parser +to load blobs from container. +* [#9743](https://github.com/langchain-ai/langchain/discussions/9743) - Customer wants to be able to load +the markdown as is without using Unstructured to parse the markdown. +* [#9934](https://github.com/langchain-ai/langchain/discussions/9934) - Customer wants to be able to use +token-based authentication instead of forced to use access keys via connection string. +* [#17812](https://github.com/langchain-ai/langchain/discussions/17812) - Customer wants to load blobs +using `PyPDFLoader` instead of use behavior from `UnstructuredLoader`. +* [#19992](https://github.com/langchain-ai/langchain/discussions/19992) - Customer notes that process +to parse blobs using Unstructured is really slow for their environment. Being able to customize the loader +would unblock them. + + + +[langchain-document-loader-concept]: https://python.langchain.com/docs/concepts/document_loaders/ +[langchain-text-splitter-concept]: https://python.langchain.com/docs/concepts/text_splitters/ +[langchain-embedding-concept]: https://python.langchain.com/docs/concepts/embedding_models/ +[langchain-vector-store-concept]: https://python.langchain.com/docs/concepts/vectorstores/ +[langchain-rag-concept]: https://python.langchain.com/docs/concepts/rag/ +[langchain-document-loader-integrations]: https://python.langchain.com/docs/integrations/document_loaders/ +[wiki-semantic-search]: https://en.wikipedia.org/wiki/Semantic_search +[langchain-document-ref]: https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html +[langchain-document-loader-base-ref]: https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.base.BaseLoader.html +[langchain-lazy-load-ref]: https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.base.BaseLoader.html#langchain_core.document_loaders.base.BaseLoader.lazy_load +[langchain-alazy-load-ref]: https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.base.BaseLoader.html#langchain_core.document_loaders.base.BaseLoader.alazy_load +[langchain-blob-loader-ref]: https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.blob_loaders.BlobLoader.html +[langchain-blob-parser-ref]: https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.base.BaseBlobParser.html +[langchain-directory-loader]: https://python.langchain.com/docs/how_to/document_loader_directory/ +[langchain-s3-directory-loader]: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.s3_directory.S3DirectoryLoader.html +[langchain-json-loader]: https://python.langchain.com/docs/integrations/document_loaders/json/ +[langchain-unstructured-loader]: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file/ +[langgraph-memory-store]: https://langchain-ai.github.io/langgraph/concepts/persistence/#memory-store + +[community-repository]: https://github.com/langchain-ai/langchain-community +[community-azure-blob-storage-file]: https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_file/ +[community-azure-blob-storage-container]: https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container/ +[issue-tracker-langchain-1]: https://github.com/langchain-ai/langchain/discussions?discussions_q=is%3Aopen+AzureBlobStorageFileLoader+ +[issue-tracker-langchain-2]: https://github.com/langchain-ai/langchain/discussions?discussions_q=is%3Aopen+AzureBlobStorageContainerLoader+ +[community-pr]: https://github.com/langchain-ai/langchain/pull/1890 +[community-feature-request]: https://github.com/langchain-ai/langchain/issues/1805 + +[langchain-azure-repo]: https://github.com/langchain-ai/langchain-azure +[langchain-azure-storage-pkg]: https://pypi.org/project/langchain-azure-storage/ + +[azure-default-credentials]: https://learn.microsoft.com/en-us/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview \ No newline at end of file From 468d71054313a2a5040a02698d3096360e690616 Mon Sep 17 00:00:00 2001 From: Kyle Knapp Date: Tue, 23 Sep 2025 14:55:36 -0700 Subject: [PATCH 2/3] Update document loader proposal based on feedback Specifically, added the following: * Combine the two document loaders into a single AzureBlobStorageLoader class. * Encourage blob_parser parameter more as a future consideration. This would be helpful if a customer did not necessarily wanted to use the blob loader interfaces but still wanted more control over how to parse blobs. --- .../proposals/document_loaders.md | 141 +++++++++++------- 1 file changed, 83 insertions(+), 58 deletions(-) diff --git a/libs/azure-storage/proposals/document_loaders.md b/libs/azure-storage/proposals/document_loaders.md index 6f55b736..de8c6a9c 100644 --- a/libs/azure-storage/proposals/document_loaders.md +++ b/libs/azure-storage/proposals/document_loaders.md @@ -186,47 +186,29 @@ Below is the proposed specification for the Azure Blob Storage document loaders. All Azure Storage document loaders will live in the [`langchain_azure_storage` package][langchain-azure-storage-pkg] under a new `document_loaders` module. -There will be two document loaders introduced: +There will be a single document loader introduced, `AzureBlobStorageLoader`. This single loader will encompass +functionality from both the community-sourced `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader` +document loaders. -* `AzureBlobStorageFileLoader` - Loads a `Document` from a single blob in Azure Blob Storage. -* `AzureBlobStorageContainerLoader` - Loads `Document` objects from all blobs in a given container in Azure Blob Storage. -Assuming no chunking is happening, each `Document` loaded will correspond 1:1 with a blob in the container. - -Each document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous +The document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous and asynchronous loading of documents, as well as lazy loading of documents. -Below show the proposed constructor signatures for each document loader: +Below shows the proposed constructor signature for the document loader: ```python -from typing import Optional, Union, Callable +from typing import Optional, Union, Callable, Iterable import azure.core.credentials import azure.core.credentials_async from langchain_core.document_loaders import BaseLoader -class AzureBlobStorageFileLoader(BaseLoader): +class AzureBlobStorageLoader(BaseLoader): def __init__(self, account_url: str, container_name: str, - blob_name: str, + blob_names: Optional[Union[str, Iterable[str]]] = None, *, - credential: Optional[ - Union[ - azure.core.credentials.AzureSasCredential, - azure.core.credentials.TokenCredential, - azure.core.credentials_async.AsyncTokenCredential, - ] - ] = None, - loader_factory: Optional[Callable[str, BaseLoader]] = None, - ): ... - - -class AzureBlobStorageContainerLoader(BaseLoader): - def __init__(self, - account_url: str, - container_name: str, - *, - prefix: str = "", + prefix: Optional[str] = None, credential: Optional[ Union[ azure.core.credentials.AzureSasCredential, @@ -241,13 +223,16 @@ class AzureBlobStorageContainerLoader(BaseLoader): In terms of parameters supported: * `account_url` - The URL to the storage account (e.g., `https://.blob.core.windows.net`) * `container_name` - The name of the container within the storage account -* `blob_name` - (File loader only) The name of the blob within the container to load. +* `blob_names` - The name of the blob(s) within the container to load. If provided, only the specified blob(s) +in the container will be loaded. If not provided, the loader will list blobs from the container to load, which +will be all blobs unless `prefix` is specified. * `credential` - The credential object to use for authentication. If not provided, the loader will use [Azure default credentials][azure-default-credentials]. The `credential` field only supports token-based credentials and SAS credentials. It does not support access key based credentials nor anonymous access. -* `prefix` - (Container loader only) An optional prefix to filter blobs within the container. -Only blobs whose names start with the specified prefix will be loaded. +* `prefix` - An optional prefix to filter blobs when listing from the container. Only blobs whose names start with the +specified prefix will be loaded. This parameter is incompatible with `blob_names` and will raise a `ValueError` if both +are provided. * `loader_factory` - A callable that returns a custom document loader (e.g., `UnstructuredLoader`) to use for parsing blobs downloaded. When provided, the Azure Storage document loader will download each blob to a temporary local file and then call `loader_factory` with the path to the temporary file to get a document @@ -263,21 +248,36 @@ Below are some example usage patterns for the Azure Blob Storage document loader Below shows how to load a document from a single blob in Azure Blob Storage: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageFileLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader -loader = AzureBlobStorageFileLoader("https://.blob.core.windows.net", "", "") +loader = AzureBlobStorageLoader("https://.blob.core.windows.net", "", "") for doc in loader.lazy_load(): print(doc.page_content) # Prints content of blob. There should only be one document loaded. ``` +### Load from a list of blobs +Below shows how to load documents from a list of blobs in Azure Blob Storage: + +```python +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader + +loader = AzureBlobStorageLoader( + "https://.blob.core.windows.net", + "", + ["blob1", "blob2", "blob3"] +) +for doc in loader.lazy_load(): + print(doc.page_content) # Prints content of each blob from list. +``` + #### Load from a container Below shows how to load documents from all blobs in a given container in Azure Blob Storage: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader -loader = AzureBlobStorageContainerLoader("https://.blob.core.windows.net", "") +loader = AzureBlobStorageLoader("https://.blob.core.windows.net", "") for doc in loader.lazy_load(): print(doc.page_content) # Prints content of each blob in the container. ``` @@ -285,8 +285,9 @@ for doc in loader.lazy_load(): Below shows how to load documents from blobs in a container with a given prefix: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader -loader = AzureBlobStorageContainerLoader( +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader + +loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", prefix="some/prefix/" ) for doc in loader.lazy_load(): @@ -297,11 +298,11 @@ for doc in loader.lazy_load(): Below shows how to load documents asynchronously. This is acheived by calling the `aload()` or `alazy_load()` methods on the document loader. For example: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader async def main(): - loader = AzureBlobStorageContainerLoader("https://.blob.core.windows.net", "") + loader = AzureBlobStorageLoader("https://.blob.core.windows.net", "") async for doc in loader.alazy_load(): print(doc.page_content) # Prints content of each blob in the container. ``` @@ -312,9 +313,10 @@ Below shows how to override the default credentials used by the document loader: ```python from azure.core.credentials import AzureSasCredential from azure.idenity import ManagedIdentityCredential +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader # Override with SAS token -loader = AzureBlobStorageContainerLoader( +loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", credential=AzureSasCredential("") @@ -323,7 +325,7 @@ loader = AzureBlobStorageContainerLoader( # Override with more specific token credential than the entire # default credential chain (e.g., system-assigned managed identity) -loader = AzureBlobStorageContainerLoader( +loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", credential=ManagedIdentityCredential() @@ -338,10 +340,10 @@ the `UnstructuredLoader` to parse the local file and return `Document` objects on behalf of the Azure Storage document loader: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader from langchain_unstructured import UnstructuredLoader -loader = AzureBlobStorageContainerLoader( +loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", # The UnstructuredLoader class accepts a string to the local file path to its constructor, @@ -358,7 +360,7 @@ If a customer wants to provide additional configuration to the document loader, define a callable that returns an instantiated document loader. For example, to provide custom configuration to the `UnstructuredLoader`: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader from langchain_unstructured import UnstructuredLoader @@ -370,7 +372,7 @@ def loader_factory(file_path: str) -> UnstructuredLoader: ) -loader = AzureBlobStorageContainerLoader( +loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", loader_factory=loader_factory ) @@ -385,11 +387,13 @@ customers will need to perform the following changes: 1. Depend on the `langchain-azure-storage` package instead of `langchain-community`. 2. Update import statements from `langchain_community.document_loaders` to `langchain_azure_storage.document_loaders`. -3. Update document loader constructor calls to: +3. Change class names from `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader` + to `AzureBlobStorageLoader`. +4. Update document loader constructor calls to: 1. Use an account URL instead of a connection string. 2. Specify `UnstructuredLoader` as the `loader_factory` if they continue to want to use Unstructured for parsing documents. -4. Ensure environment has proper credentials (e.g., running `azure login` command, setting up +5. Ensure environment has proper credentials (e.g., running `azure login` command, setting up managed identity, etc.) as the connection string would have previously contained the credentials. Below shows code snippets of what usage patterns look like before and after the proposed migration: @@ -414,16 +418,16 @@ file_loader = AzureBlobStorageFileLoader( **After migration:** ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader, AzureBlobStorageFileLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader from langchain_unstructured import UnstructuredLoader -container_loader = AzureBlobStorageContainerLoader( +container_loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", loader_factory=UnstructuredLoader ) -file_loader = AzureBlobStorageFileLoader( +file_loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", "", @@ -464,16 +468,16 @@ When a `credential` is provided, the credential will be: ```python import azure.identity import azure.identity.aio - from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader + from langchain_azure_storage.document_loaders import AzureBlobStorageLoader - sync_doc_loader = AzureBlobStorageContainerLoader( + sync_doc_loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", credential=azure.identity.ManagedIdentityCredential() ) sync_doc_loader.aload() # Raises ValueError because a sync credential was provided - async_doc_loader = AzureBlobStorageContainerLoader( + async_doc_loader = AzureBlobStorageLoader( "https://.blob.core.windows.net", "", credential=azure.identity.aio.ManagedIdentityCredential() @@ -490,9 +494,9 @@ When a `credential` is provided, the credential will be: By default, the document loaders will populate the `source` metadata field of each `Document` object with the URL of the blob (e.g., `https://.blob.core.windows.net//`). For example: ```python -from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader +from langchain_azure_storage.document_loaders import AzureBlobStorageLoader -loader = AzureBlobStorageContainerLoader("https://.blob.core.windows.net", "") +loader = AzureBlobStorageLoader("https://.blob.core.windows.net", "") for doc in loader.lazy_load(): print(doc.metadata["source"]) # Prints URL of each blob in the container. ``` @@ -515,7 +519,7 @@ from langchain_core.documents import Document from typing import Iterator -class AzureBlobStorageContainerLoader(BaseLoader): +class AzureBlobStorageLoader(BaseLoader): ... def _lazy_load_from_custom_loader(self, blob_name: str) -> Iterator[Document]: with tempfile.NamedTemporaryFile() as temp_file: @@ -573,6 +577,27 @@ However, similar to why document loaders were chosen over blob loaders, blob par over libraries like Unstructured and takeaway from the batteries-included value proposition that LangChain document loaders provide. +It's important to note that this decision does not prevent us from exposing a `blob_parser` parameter in the future. +Specifically, this would be useful if we see customers wanting to customize loading behavior more but not necessarily +want to drop down to using a blob loader interface. + + +#### Exposing document loaders as two classes, `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`, instead of a single `AzureBlobStorageLoader` +Exposing the document loaders as these two classes would be beneficial in that they would match the existing community +document loaders and lessen the amount of changes needed to migrate. However, combining them into a single class +has the following advantages: + +* It simplifies the getting started experience. Customers will no longer have to make a decision on which Azure Storage +document loader class to use as there will be only one document loader class to choose from. +* It simplifies class names by removing the additional `File` and `Container` qualifiers, which could lead to +misinterpretations on what the classes do. +* It is easier to maintain as there is only one class that will need to be maintained and less code will likely need to +be duplicated. + +While this will introduce an additional step in migrating (i.e., change class names), the impact is limited +as customers will still be providing the same positional parameters even after changing class names +(i.e., use account + container for the container loader and account + container + blob for the file loader). + #### Alternatives to default parsing to UTF-8 text The default parsing logic when no `loader_factory` is provided is to treat the blob content as UTF-8 text @@ -638,10 +663,10 @@ customize how blobs are parsed to text. However, possible requested extension po * Wanting the blob data to be passed using an in-memory representation than file on disk If we ever plan to extend the interface, we should strongly consider exposing blob loaders -instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) +and/or a `blob_parser` parameter instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section above. -If blob loaders do not suffice, we could consider expanding the `loader_factory` to: +If blob loaders nor a `blob_parser` parameter suffice, we could consider expanding the `loader_factory` to: * Inspect signature arguments of callable provided to `loader_factory` and call the callable with additional parameters if detected (e.g., detect if the a `blob_properties` parameter is present and @@ -666,7 +691,7 @@ Based on customer requests, in the future, we could consider exposing these prop ## Future work Below are some possible future work ideas that could be considered after the initial implementation based on customer feedback: -* Expose blob loader integrations for Azure Blob Storage (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section). +* Expose blob loader and/or blob parser integrations (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section). * Proxy additional blob properties as document metadata (see [FAQs](#q-why-is-the-blob-properties-not-exposed-in-the-document-metadata) section). * Support `async_credential` parameter to allow using both sync and async token credentials with a single document loader instance (see [FAQs](#q-why-not-support-synchronous-token-credentials-when-calling-asynchronous-methods-and-vice-versa) section). From e419d85e2762f80b9d5109cb1a071eebfcc05030 Mon Sep 17 00:00:00 2001 From: Kyle Knapp Date: Mon, 27 Oct 2025 10:50:58 -0700 Subject: [PATCH 3/3] Move status to public preview --- libs/azure-storage/proposals/document_loaders.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/libs/azure-storage/proposals/document_loaders.md b/libs/azure-storage/proposals/document_loaders.md index de8c6a9c..5282d7cc 100644 --- a/libs/azure-storage/proposals/document_loaders.md +++ b/libs/azure-storage/proposals/document_loaders.md @@ -1,10 +1,10 @@ # Azure Storage document loaders -| Proposal | Metadata | -|-------------|-------------------| -| **Author** | Kyle Knapp | -| **Status** | Proposed | -| **Created** | 12-September-2025 | +| Proposal | Metadata | +|-------------|--------------------------------------------| +| **Author** | Kyle Knapp | +| **Status** | Public Preview (released in version 1.0.0) | +| **Created** | 12-September-2025 | ## Abstract This proposal outlines the design and implementation of Azure Blob Storage document loaders. These