Beyond basic Text Splitting
Disclaimer: This series explores concepts and creative solutions for building Advanced Retrieval Augmented Generation (RAG) Systems. If you’re new to RAG systems, I recommend exploring some introductory materials first. Microsoft’s Introduction to RAG in AI development and Google Cloud’s free course on Retrieval Augmented Generation are good places to start.
Welcome to the first instalment of my multi-part series on building advanced RAG (Retrieval-Augmented Generation) systems. Over the past two years, I’ve had the opportunity to implement these systems for clients across multiple different real-world environments and contexts. I’m excited to share the insights, challenges, and solutions I’ve encountered along the way. The strategies l will present in this series have consistently produced higher-quality results in real-world applications when compared to standard RAG implementations. *
In this first post, we’ll dive into a fundamental yet often overlooked aspect of RAG systems: document splitting. The manner in which documents are ingested into a RAG system is critical for the quality of the system as a whole. Yet some industry-standard implementations rely on simplistic approaches that can compromise the quality of retrieved information in a RAG system.
The limitations of traditional text splitters
The most common approach to RAG document ingestion has two steps. The first step is to convert the document to plain text using a document loader. Next, the plain text is split into chunks by a text splitter. The most commonly used text splitter is LangChain’s RecursiveCharacterTextSplitter. This tool chops documents into smaller pieces based on word count and can keep some amount of overlapping text between chunks to try and preserve context.
The reason we split documents into chunks is both technical and practical. But these are the two primary reasons:
- It helps reduce cost by limiting the number of input tokens (text length) sent to the LLM.
- It enables more granular retrieval of information.
However, even when properly calibrated, the most commonly used document loaders and text splitters have some major drawbacks. They ignore the structure of documents and will split text mid-sentence disrupting the natural flow of information. This can break apart closely related concepts and lose important contextual relationships contained within a document. Other document elements like tables, diagrams and graphs often become an incoherent jumble of symbols and text which is unusable by the LLM.
The downstream effect is that RAG systems perform well during prototyping, but performance degrades in production. For practical reasons in the initial phases of development there is a tendency to select and tune document ingestion for a small set of sample documents. The reality, however, is that the target production databases are usually filled with documents of varying formats, quality, and consistency. Many documents end up with sub-optimal chunking patterns which negatively impacts performance.
Improve splitting: Use the document’s structure
A more effective approach to document splitting leverages the inherent structure of documents rather than relying on word counts. Documents are organized into headings, sections, sub-sections, and paragraphs for a reason. These divisions already represent logical groupings of information created by the authors themselves.
Breaking up documents based on these natural boundaries offers several advantages:
- It preserves the intended context and relationship of information.
- It maintains the coherence of individual sections.
- It becomes easier to track and reference data sources in the original document.
- It is easier to extract tables, graphs, charts and images.
- The quality of metadata is improved.
Preserve table structure with markdown
Tables present a particular challenge for basic text splitters. Text splitters often destroy a table’s structure by treating it as plain text. This results in the loss of the relationship between data cells, rows and column headings. Ultimately, text splitters can leave table data practically unusable for an LLM.
Markdown offers an elegant solution to this problem. By converting tables to Markdown format, we can:
- Preserve the relationship between headers, rows, and columns.
- Maintain the tabular structure in a format that LLMs can effectively process.
- Achieve a lower token count than HTML and XML.
- Keep the data easily readable for humans and machines.
Here’s a quick example of how a table maintains its structure in Markdown:
| Feature | Traditional Splitter | Structure-Based Splitting |
|———|———————|————————–|
| Context Preservation | Limited | High |
| Metadata Quality | Basic | Detailed |
| Implementation Complexity | Low | Medium |
Implementation considerations
There isn’t currently a widely available open-source tool that handles structure-based document splitting perfectly. This means, for now, you will need to implement this yourself. Here are a few tips to help you:
- Convert all documents to a common format, build your splitter for that common format.
- Split documents at logical boundaries (section breaks, headings, paragraphs, tables and complete sentences).
- Extract visual elements and maintain a reference to their position in the original document.
- Generate metadata that reflects the document’s structure.
For converting documents, open-source tools like Pandoc and LibreOffice will work with some limitations. At an additional cost there are closed source libraries that offer better support for proprietary document formats.
Looking ahead
In the next instalment of this series, I will present strategies for ingesting visual elements like images, diagrams, and flowcharts. We’ll look at how modern vision models can help us capture and preserve this information in our RAG systems.
*By “standard RAG implementation” I am referring to a RAG system built using LangChain’s basic document loaders and text splitters. The observed improvements are based on my own benchmarking tools and user feedback.