Youdiowei Eteimorde

Posted on Aug 12, 2023 • Edited on Aug 16, 2023

Understanding LangChain's RecursiveCharacterTextSplitter

#python #ai #chatgpt #langchain

Large language models are powerful tools with extensive capabilities; nonetheless, they grapple with a distinct limitation known as the context window. This context window defines the boundaries within which these models can proficiently process text. Take, for example, gpt-3.5-turbo, which operates within a context length of 4,096 tokens, approximately corresponding to 3,500 words.

But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

Langchain provides users with a range of chunking techniques to choose from. However, among these options, the RecursiveCharacterTextSplitter emerges as the favored and strongly recommended method.

Quick overview

The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].

It takes in the large text then tries to split it by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. If it is still larger than our specified chunk size it moves to the next character in the set until we get a split that is less than our specified chunk size.

Code implementation

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The text above is extracted from an article written by Paul Graham, titled: What I Worked On. Let's utilize the RecursiveCharacterTextSplitter to break it into small chunks, each with a maximum size of 100 characters.

First we import it from langchain:



from langchain.text_splitter import RecursiveCharacterTextSplitter

Let's load the text we wish to create chunks from into a variable called text.



text = """What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
"""

Next we create a RecursiveCharacterTextSplitter instance, configuring it with a chunk_size of 100 and a chunk_overlap value of zero. Our approach involves using the length function to measure each chunk based on its character count.



text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 0,
    length_function = len,
)

The RecursiveCharacterTextSplitter offers several methods for performing splits. In our case, we will utilize the split_text method. This method requires a string input representing the text and returns an array of strings, each representing a chunk after the splitting process.



texts = text_splitter.split_text(text)
print(len(texts)) # 11
print(texts[0]) # 'What I Worked On\n\nFebruary 2021'

Upon performing the split our text was successfully divided into a total of 11 separate chunks.

In-Depth Explanation

Just as its name suggests, the RecursiveCharacterTextSplitter employs recursion as the core mechanism to accomplish text splitting. Now, let's take a detailed journey through the process of how our earlier code was capable of achieving this feat.

For our walkthrough, we'll utilize the same text and parameters that we employed during the code implementation. This involves a segment from Paul Graham's essay, and we'll consider a chunk size of 100 characters. The characters we use for splitting will be ['\n\n', '\n', ' ', '' ].

Let's begin with our initial text. Currently presented in human-readable form, our next step involves transforming it into a format that computers can readily comprehend.

Now, the new lines have been converted to \n, which is precisely what we need in order to carry out our splitting process.

Let's select our text. This can be likened to invoking the split_text method on our text.

As mentioned earlier, the RecursiveCharacterTextSplitter attempts to initiate splits using a predefined set of characters. Its first attempt involves the \n\n character, which serves as a means to split by paragraphs. Let's now identify all occurrences of this character within our text.

Once we've located all instances of the \n\n characters, the subsequent step involves executing a split using this character as our designated separator.

Presently, we have four splits. Our next step involves assessing each split to check whether they meet the condition of being smaller than our specified chunk size, which is set at 100 characters.

The first two splits satisfy this condition, thus earning them the label of good splits. Since both segments consist of fewer than 100 characters, we can combine them to create our initial chunk.

Proceeding to the second split, we find ourselves in a situation where further reduction isn't achievable using the \n\n character. Therefore, we proceed to the next character: \n. Our objective here is to execute a split using the \n character and determine if we can achieve a reduction in the split's size.

This operation is akin to invoking the split_text on the second split text, but with the inclusion of the \n character. This is where the concept of recursion comes into play.

Upon executing the split using the \n character, we end up with two splits. The first split qualifies as a good split, given that it contains only one character. However, the second split surpasses our designated chunk size.

Consequently, we need to invoke the split_text method on this particular split once again. However, this time we'll employ a split using the next character in our character list, which happens to be the ' ' character.

Finally, we have successfully decreased the split size. Now, we proceed to iterate through each split in order to perform a merge. The guiding principle for these merges is that no resulting merged split should exceed our designated chunk size of 100 characters.

Following the merge, we end up with four chunks, each adhering to our condition that a chunk should not surpass 100 characters.

Now, let's revisit the original text splits and identify which split remains to be processed.

We still have one split that is greater than our chunk size. We repeat the same procedures again.

We initiate the split using the new line character as the separator.

We perform a split using spaces as the separators.

Next, we proceed with a merge, ensuring that no merged segments exceed the defined chunk size.

After going through the entire process, we arrive at generating eleven individual chunks. Each of these eleven chunks successfully adheres to the 100-character limit.
This outcome aligns precisely with what we achieved programmatically.

And there we have it. We've delved into the inner workings of LangChain's RecursiveCharacterTextSplitter. For those who are intrigued, you can explore the source code here. If you found this article informative, please consider showing your appreciation with a reaction: 💖 🦄 🤯 🙌 🔥

Top comments (24)

scitlec • Aug 31 '23

The maintainers of the Langchain documentation should link to your useful explanation.

Thanks!

Pham Minh Quang • Sep 10 '23

I totally agree! The langchain documentation is just suck.

Youdiowei Eteimorde • Sep 1 '23

Thanks for your kind words 🥰 Who knows they might.

Cristian Molina • Jan 25

What about the chunk_overlap param?

Youdiowei Eteimorde • Jan 27 • Edited

The chunk_overlap parameter determines how much the chunks overlap with each other.

For example let's split your comment into three chunks.

What about | the chunk_ | overlap param?

Let's overlap each chunk with 5 characters:

What about the | about the chunk_ | chunk_overlap param?

If we didn't use chunk overlapping your comment would have lost is meaning when split.

Cristian Molina • Jan 28

Thanks! That makes sense but what value should I use if, for instance, I need to save the texts In a vectorDB later to augment a RAG?
Does it matter? If this is significant I'd add this information to the article.
Thanks again.

Youdiowei Eteimorde • Jan 29

It is all depends on your data and what you are trying to achieve. The whole Augmenting LLMs with external knowledge is still in it's infancy. So you can experiment with different params to see how your LLM performs during RAG.

James Stover • Jan 17

Something doesn’t quite work right as I see some words throughout my text after splitting are broken apart with a space making 2 non-words of each of them. They have quite a few characters in between, so it isn’t frequent, but in a large body of text, these add up. I am concerned about the detrimental impact to the vector embeddings and retrieval then.