🆕✨ New in documentation!

2mo ago

0 replies

Ever wonder how to get original document elements and their metadata after applying smart chunking methods? Learn how to access the document elements and all of their metadata: page numbers, coordinates, and more! https://docs.unstructured.io/open-source/core-functionality/chunking#recovering-chunk-elements In LLM RAG, the primary role of Chunking is to split long documents into smaller, semantically coherent segments, improving the precision and efficiency of retrieval. Today, let's explore how Unstructed, with its extensive experience in unstructured document parsing and processing, handles Chunking. ## Common Chunking Methods ## Typically, chunking starts with the text extracted from a document and forms chunks based on plain text features, such as character sequences that may indicate paragraph boundaries or list item boundaries, like "\n\n" or "\n". ## Unstructured Chunking Methods ## Unstructured's chunking approach is more intelligent and flexible, better preserving the document's semantic structure and key information, thus enhancing subsequent processing. 1. **Based on Semantic Units:** Unstructured uses specific knowledge of document formats to divide documents into semantic units (document elements), rather than relying solely on plain text features (such as line breaks) for chunking. 2. **Preserving Semantic Integrity:** Unless a single element exceeds the maximum chunk size, all chunks contain one or more complete semantic units, thus maintaining the semantic coherence established during segmentation. 3. **Flexible Chunking Strategies:** Two chunking strategies, "basic" and "by_title," are available, allowing selection based on needs. The "by_title" strategy also preserves chapter boundaries, enhancing the retention of structured information. 4. **Adjustable Parameters:** Multiple adjustable parameters (such as max_characters, new_after_n_chars, overlap, etc.) are provided, allowing users to fine-tune chunking behavior according to specific requirements. 5. **Metadata Preservation:** Through the .metadata.orig_elements field, it is possible to access the original elements that make up each chunk, preserving important metadata information (such as page numbers, coordinates, etc.). 6. **Integration with the Segmentation Process:** Chunking can be performed directly during the segmentation process or as a separate step, offering greater flexibility. 7. **Table Processing:** Special handling for table elements ensures the integrity and readability of tables.

🤔
No comments yet be the first to help