LargeLanguageModel?s (LLM) have been very good at generating coherent text about general knowledge contexts.
However, for most use cases, people want to ask or generate text about contexts local and specific to them.
For example, you may want to answer customer support questions about your product from you help center; help customers find products for sale on your store; or search through your internal corporate wiki for who is responsible for implementing a given project.
Adding new information to a LLM FoundationModel? is very difficult and expensive.
You can use FineTuning? techniques to improve the style of the output, tighten the output range to a smaller range (e.g. only emit Yes or No), or prioritize some knowledge in the LLM or others (or even knock out some concepts entirely), you can only really remove information from an LLM, not add your own local and specific information. This is because most FineTuning? techniques only modify the final layer of the LLM or provide a layer that modifies the model from the "outside" (e.g. LoRA?). This is like chiseling away at a marble statue.
It's not possible to add your own specific facts, like "Sarah Jones is responsible for regional sales in the Northeast."
The solution has been RetrievalAugmentedGeneration? (RAG), which allows users to add their own dynamic and specific knowledge to the static and general knowledge base of a trained LLM. This works by selectively quoting snippets from your own corpus of text (e.g. a help center, a FAQ, a wiki) and including those snippets in the ContextWindow? provided the LLM to inform the generated output.
For example, instead of prompting the LLM with, "Who is the regional sales director in New England?" and asking OpenAI, who doesn't know anything about your organization, the prompt would be modified with additional context from your own knowledge base.
Context: Sarah Jones is responsible for regional sales in Northeast. Raj Mehta is responsible for regional sales in the Southwest. Simon Clerk was promoted to regional sales director in the Northwest. Veronica Brown moved on as regional sales director in the Southeast.
Query: Who is the regional sales director in New England?
ChatGPT 4o-mini returns:
Based on the information provided:
Sarah Jones is responsible for regional sales in the Northeast. Since New England is part of the Northeast region, Sarah Jones is the regional sales director for New England.
This raises the question of how do we even know what context to inject into the prompt?
One truly amazing quality of the LLMs is that they organize human knowledge spatially, not just by word, but by concept. By spatially, I mean you can literally get the (x,y,z,...) coordinates of a concept from the LLM. Typically the number of dimensions is large, say 768 or 1536 or even higher. This set of coordinates is an EmbeddingVector?.
Therefore, the common technique is to index the entire corpus of your own knowledge in a special VectorDatabase by calculating the EmbeddingVector? for each text in the corpus.
Or, rather, we don't do that. A given text may be very large, like entire EBook. A book will have dozens or hundreds or thousands of ideas inside of it, that each separately may be very far apart from each other in the vector space of the LLM. Furthermore, LLMs often have LLMContextWindow?s that are relatively much smaller than the texts of a given corpus. And those with larger context windows are very expensive to operate, and not necessarily more coherent (see the LLMNeedleInAHaystackTest?).
Instead, we chunk the texts into smaller pieces, such as at the paragraph level. This helps ensure the entire chunk relates to a single point in the vector space in the LLM.
Then you can calculate the EmbeddingVector? of the original prompt, and find the chunks in the VectorDatabase with the closest distance to the prompt's embedding vector (e.g. using CosineSimilarity?). Pick the top 10, say, chunks that relate to the prompt, and you're more than likely have found relevant context to answer the prompt.
Then when prompting the LLM, you simply add the chunks at the top and ask the LLM to answer the query based on the context you provided; or in a DocumentSearch? task, you ask the LLM to provide a prose summary of the search results that came back.
It's pretty amazing.
Because we usually index snippets instead of entire texts, the information can become chopped up. Each snippet is isolated, even though in normal writing, each paragraph depends on the text that came before. Often, information references or implies knowledge of information outside the text it was written in.
For instance, if you had built a search engine over a news archive of the 20th century, you may find articles about an ongoing election. This article may refer cursorily to what each candidate is doing without explaining who the candidates are or the reason they are stumping in one city or another. If you're searching through the old archives of the news, the snippet method won't provide enough context to answer the query.
Baseline RAG cannot answer questions that requires synthesizing information across the entire dataset, such as "What are the top 5 ideas?", because the system only pulls in the most semantically similar snippets to the query itself. With a very abstract query like that example, the snippets that come back are likely to be nearly random.
GraphRAG uses a more sophisticated method than simple indexing of EmbeddingVector?s. It works from the ground up to identify entities and relationships from the source data, cluster together related entries, uses the LLM to generate pre-summaries of these groups, and then indexes those summaries.
This makes it easier to not only pull in a snippet related to a query, but also snippets from nearby relationships, and also higher order understandings of the dataset.
A visualization of Microsoft's LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo. [1]
The KnowledgeGraph? in GraphRAG feels very, very similar to a wiki. The relationships between texts in the corpus are like WikiLink?s. The hierarchical communities of texts feel like CategoriesAndTopics. The idea of synthesizing summaries over communities feels like IndexPage?s, and the organic process of synthesis common in wikis such as the formation of PatternLanguages.
My question is why does the synthesis have to be LLM only? Why can't the texts be open to edits by humans?
Contributors: SunirShah