Taki089.Dang

Posted on Dec 13 • Edited on Dec 14

What is Chunk Size and Chunk Overlap

#programming #ai #langchain #rag

What is Chunk Size?

The chunk size refers to the maximum number of characters or tokens allowed in a single chunk. For example, if the chunk size is 300, each chunk will have up to 300 characters.

Example:
Imagine this text:

Chunking is the process of breaking a large piece of text into smaller, manageable pieces to make it easier to process and analyze.

if the chunk size is 40 characters:

Chunk 1: "Chunking is the process of breaking a"
Chunk 2: "large piece of the text into smaller,"
Chunk 3: "manageable piece to make it easier"
Chunk 4: "to process and analyze."

Notes: you can use tool to count of tokens Tokenizer or you can use tiktoken like that:

import { encoding_for_model } from "tiktoken";
const text = "Chunking is the process of breaking a large piece of text into smaller, manageable pieces to make it easier to process and analyze.";
const tokenizer = encoding_for_model("gpt-3.5-turbo");
const tokens = tokenizer.encode(text);

console.log(`Total count of tokens: ${tokens.length}`);

What is Chunk Overlap?

Chunk Overlap refers to the number of characters or tokens shared between consecutive chunks. Overlapping ensures that important context is not lost when diving the text into smaller parts.

Example:
Using the same text:

Chunking is the process of breaking a large piece of text into smaller, manageable pieces to make it easier to process and analyze.

With a chunk size of 40 and an overlap of 10:

Chunk 1: "Chunking is the process of breaking a"
Chunk 2: "breaking a large piece of the text into,"
Chunk 3: "text into, manageable piece"
Chunk 4: "piece to make it easier to process"
Chunk 5: "process and analyze."

Here, "breaking a" is part of both Chunk 1 and Chunk 2, ensuring that no context is lost between chunks.

What should you use overlap?

Long sentences: Prevents sentences from being split in a way that loses meaning.
Preserving context: Ensures context is maintained for task like embedding or searching.

How to Chunk Text Programmatically (with Langchain):

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = "Chunking is the process of breaking a large piece of text into smaller, manageable pieces to make it easier to process and analyze.";
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 40,      // Maximum size of each chunk
  chunkOverlap: 10,   // Overlap between chunks
});

const chunks = await splitter.splitText(text);
console.log(chunks);

output:

[
  'Chunking is the process of breaking a',
  'a large piece of text into smaller,',
  'smaller, manageable pieces to make it',
  'make it easier to process and analyze.'
]

Why Chunking Matters

For Embedding Models: Many models have a token limit, so chunking ensures the text fits.
Improves Retrieval: Overlapping chunk helps capture text, making retrieval results more accurate.
Scalability: Large texts are broken into smaller pieces, allowing efficient processing.

How to set Chunk Size and Chunk Overlap

Chunk Size: Aim to fit within the model's token limit (e.g., Open AI models like text-davinci-003 or gpt-3.5-turbo typically have token limits ranging from 4000 to 6000) but small enough for efficient processing. Common values range from 200 to 500 tokens per chunk for embedding tasks.
Chunk Overlap: Generally, overlap is set to 10%-20% of the chunk size to ensure continuity. This overlap ensures that the last few tokens of one chunk carry over to the next chunk, preserving context for embedding or search tasks

DEV Community

What is Chunk Size and Chunk Overlap

What is Chunk Size?

What is Chunk Overlap?

What should you use overlap?

How to Chunk Text Programmatically (with Langchain):

Why Chunking Matters

How to set Chunk Size and Chunk Overlap

Top comments (0)

Read next

How to Nail Technical AI Questions in Interviews (2025 Edition)

Is the EU Falling Behind in the AI Race?

NeurIPS 2024 - What Matters When Building Vision Language Models

WTF is shift-left testing? 🗟