MeatballWiki

ContextualChunking

RetrievalAugmentedGeneration relies on bringing external knowledge into the context window. Usually one chunks the background knowledge into strings of a few hundred tokens at most. The problem is that natural language often uses indirect language.

“The company had outstanding performance in 2024Q3 growing 3% year over year” refers to the company, but we don’t know which company this is referring to. This chunk would have been pulled from a larger document where earlier in the context we would have discovered the company they are reporting on is Acme Corp.

A better chunk would be “Acme Corp had outstanding performance in 2024Q3 growing 3% year over year”

The other advantage is that by being specific the chunk is also indexable using keyword methods like TfIdf which makes it possible to query “how did Acme Corp perform this year?” and pull up only documents about Acme instead of Ajax Corp.

One can use an LLM to rewrite chunks contextually

https://www.anthropic.com/news/contextual-retrieval


Edit this page | History