Guides
Vector Databases vs. Knowledge Graphs for RAG
Most AI SaaS companies today are implementing RAG with vector databases, but knowledge graphs are starting to pick up steam as the importance of retrieval accuracy becomes more apparent. We'll compare both options to help you better pick the right approach to use.
Mathew Pregasen
,
Developer Relations
12
mins to read
It doesn't matter which database you pick if you haven't solved the data ingestion challenge (especially with unstructured data). Paragon enables you to connect to and ingest data from your users' hundreds of 3rd party applications, without worrying about the plumbing (auth, webhooks, rate limits etc.). Try it out - you can be up and running with an MVP in minutes.
Retrieval-Augmented Generation (RAG) has become the default way companies are building AI SaaS products and features, and the selection of the right database and data model is crucial. Unsurprisingly, most teams have adopted vector databases due to how easy they are to spin up and the speed of retrieval they offer. But knowledge graphs are starting to gain traction particularly in enterprise AI spaces where understanding complex data relationships is critical for better accuracy.
In fact, Microsoft’s launch of GraphRAG highlights the growing importance of knowledge graphs in complex AI applications, and I'm sure this space will continue to develop over time.
We'll explore both architectures and walk through the pros and cons of both vector database and knowledge graph implementations to give you a clearer understanding of which approach best suits your AI product's use case.
Vector Database Overview
A vector database is a specialized database designed to store, index, and query vector embeddings — numerical representations of unstructured data such as text, images, and audio. Without getting too into the weeds, vector embeddings enable you to capture meaning and contextual characteristics from the unstructured data, making them a valuable and expressive data format.
How it works
To use a vector database, text or other unstructured data is first divided into smaller, more manageable chunks. These chunks are then processed by an embedding model, a type of machine learning model that transforms unstructured data into numerical vectors. The model is trained to capture the semantic meaning and features of the data so that important concepts and relationships are preserved, enabling it to be much more useful than strings when it comes to search and retrieval.
Querying a vector database involves converting the query into a vector using the same embedding model used to populate the database. This vector is then compared to the stored vectors in the database using similarity metrics like cosine similarity. The results are ranked based on their similarity to the query and the closest matches are returned.
Advantages of Vector Databases
Speed
Vector databases excel at managing and querying unstructured data. By storing vector embeddings, vector databases are optimized to to provide lightning fast search and retrieval - after all, comparing the distance between vectors (numbers) is extremely easy.
Search via semantic meaning
Additionally, what makes vector databases particularly useful within the context of RAG and AI is their ability to perform semantic search. This allows it to account for the meaning and context of data, versus traditional text similarity search, which results in more accurate and relevant search results.
High data volumes
Vector databases are also built to handle large-scale datasets, making them ideal for applications that need to process and analyze vast amounts of complex data efficiently. The scalability of vector databases ensures they can manage the demands of extensive data processing tasks without compromising performance.
Limitations of Vector Databases
Lack of hierarchy and relationships between data
Think of vector databases as a single store of all of this unstructured data. There's no structure for the most part, which means you lose the hierarchical or relational structures found in traditional databases. This may not be a problem for simple RAG-based AI applications, but it can be limiting for use cases that require organized data relationships.
For example, in a CRM, hierarchical structure is often crucial for understanding customer information, sales leads, and customer support interactions. Without this structure, managing and navigating these interconnected data points becomes cumbersome and inefficient.
Limited results
Additionally, applications that use vector databases often face the challenge of limited search results. To avoid returning an overwhelming amount of data, search results are usually restricted to the top several results using a ranking algorithm. This limitation can hinder the effectiveness of searches, especially for use cases where more comprehensive data retrieval is necessary for thorough analysis or decision-making.
Knowledge Graph Overview
For those that have taken CS 101, a graph database will be a very familiar concept. But for those who may not have, a knowledge graph is a structured way to organize data, representing entities as nodes and relationships as edges within a graph. This method allows for a clear, visual representation of how different data points are connected to each other, making it easier to understand and analyze complex information.
How it works
Knowledge graphs are constructed by first identifying key entities and their relationships within a domain to define the ontology of the graph, which usually involves extensive data modeling. Metadata and additional attributes can be added to nodes to provide context or add functionality to the graph, such as respecting permissions during retrieval.
For example, say you are building an AI application for the healthcare industry. A knowledge graph could be used to unify patient data, medical research, treatment protocols, and other important data sources. When a doctor queries the system for treatment options, the knowledge graph could explore relationships between similar cases, relevant research, and potential drug interactions, providing a comprehensive understanding of the available treatments.
Advantages of Knowledge Graphs
Retrieval accuracy
Knowledge graphs can help improve search relevance by accounting for relationships between data points, vs. a generic semantic search across all available data. This enhanced contextual understanding can lead to more accurate responses in AI applications.
Lineage
Additionally, knowledge graphs provide transparency and data lineage by adding metadata to nodes and edges. This makes it easier to trace the origin and evolution of data, and makes AI responses and problem-solving more explainable. By clearly documenting how data is connected and where it comes from, knowledge graphs can help make AI systems more trustworthy.
Limitations of Knowledge Graphs
Set up and maintenance costs
The biggest reason why knowledge graphs haven't been as widely adopted is due to the complexity of setting it up and maintaining it over time. It requires significant effort in data modeling and ontology design, and requires specialized expertise and ongoing management (you'll need a team of data and software engineers with experience with knowledge graphs and the business domain dedicated towards it).
Slower data retrieval
Knowledge graphs are also slower at retrieving data, as it has to traverse the graph to get to the data required to answer a query. Imagine if you ingested all of your customers' business data from hundreds of files and applications - that would be a lot of data to traverse through! This is definitely a factor to consider as it will impact the performance of your AI application.
Complexity with changes in data
Another challenge is handling data updates. Integrating new data and updating existing information can be difficult, potentially leading to inconsistencies and errors. Ensuring the knowledge graph remains consistent and accurate amidst frequent data changes is a recurrent and demanding task.
Implement a Vector DB or Knowledge Graph in your RAG application
Vector Database Implementation
It is relatively simple to spin up a basic vector database implementation (which is why most AI startups go down this path). But the true challenge comes with optimizing each part of the process for performance and accuracy, such as how you pre-process the data after ingesting it, picking the right embedding models and chunk sizes, and the way you handle permissions. That said, here's the high level summary of the process.
Data ingestion and pre-processing: Raw data (text, images, audio, etc.) is collected, cleaned, and segmented into manageable chunks from 3rd party sources, whether that be in the form of files or application data. The optimal size of your chunks will depend on what embedding model you use and how you expect to search for content.
Creating and indexing vector embeddings: Your data chunks are converted into vector embeddings using an embedding model, capturing their semantic meaning and context. These embeddings are stored and indexed in the vector database. A good place to look for the latest and greatest embedding models is in this Huggingface leaderboard.
Query processing: When a query is made, you'll use the same embedding model that was applied to your chunks in step 2. This query embedding is then matched against the stored vector embeddings in the database to find similar vectors.
Generating results: You'll combine the retrieved data with the original query, and pass it to the LLM of your choice to return a much more holistic and contextual answer to the original query.
If you're looking for a vector database, you can easily get started with Pinecone. Or, if you use Postgres, look into their pgvector plugin.
Knowledge Graph Implementation
Implementing a knowledge graph in RAG pipelines is a lot more complex as we shared above. We won't get into the nitty gritties of ontology design here, but this provides a pretty quick overview.
Data extraction and integration: Data is extracted from various sources (databases, documents, APIs, etc.), and using an LLM or extraction algorithm, you'll need to identify the entities, relationships, and metadata from that source data.
For example, if your primary sources are project reports for your company in Google Drive, you could instruct an LLM to 1) read in each document; 2) identify key people, meetings, and action items within the document; and 3) output them as a structured table like below:
Building the knowledge graph: Using the data extracted in the previous step, a node is created for each entity, an edge is created for each relationship, and all nodes and relationships are tagged with their associated metadata. Most graph databases like Neo4j provide standard APIs and libraries you can use to easily create new nodes and edges from input data, especially when it’s structured as shown above.
Query processing: When a query is made, the LLM identifies relevant entities and relationships within the query. It then constructs a graph query (e.g. a Cypher query) to retrieve nodes and edges that relate to the entities and relationships from the user query.
Generating results: The LLM combines its own training knowledge with the retrieved entity and relationship information to create a comprehensive context to answer the query.
Hybrid Implementation
You can also combine both vector databases and knowledge graphs to get the benefits of both, if you're up for the added complexity, of course. In this approach, knowledge graphs are used to capture and query the complex relationships between different data points. Meanwhile, any unstructured content or metadata is converted into vector embeddings and stored in a vector database, enabling efficient semantic searches.
For instance, Neo4j is a popular graph database that natively supports creating embeddings for graph nodes, facilitating vector searches within the graph. This hybrid method allows for both advanced semantic search capabilities and the rich, interconnected data structure of knowledge graphs. Consequently, it offers the benefits of both a knowledge graph and a vector database.
Implementing a hybrid approach using a knowledge graph and vector search in RAG models involves the following steps:
Populating a knowledge graph and vector database: The steps outlined in the previous sections are used to construct a knowledge graph and populate a vector database of your source data.
Entity extraction with vector search: When a query is made, the LLM extracts key entities and relationships from the query. A vector search is then performed on the properties of the knowledge graph to narrow down to relevant nodes of interest.
Graph traversal: The LLM generates graph queries to traverse the graph and extract any entities, relationships, and metadata that may be relevant when answering the original query.
Response generation: The retrieved graph information is used to augment the LLM’s context, which the LLM uses to generate a response to the query.
The decision framework
When determining whether to use a vector database or a knowledge graph for your RAG applications, it really comes down to the use case your application is trying to solve, and the type of data you're working with. If you're just getting started, a vector database should be your weapon of choice because it's easier to implement and still provides pretty relevant answers to queries.
However, there are definitely cases where using a knowledge graph can help. Is a structured representation of your data possible and beneficial? Do relationships between data points play a large role in answering questions about your data? Do you need to trace the decision-making process of your model?
These are some of the questions that you need to answer, but we'll walk through some examples to make it easier to visualize.
When to Choose a Knowledge Graph
If your RAG application requires an in-depth understanding of relationships and hierarchies within your data, a knowledge graph might be the right choice. This is particularly true if:
Recommendation systems: Your use case involves making recommendations based on complex relationships within your customers’ data. For example, if you provide an e-commerce search product, you can use a knowledge graph to intelligently help e-commerce companies suggest products to their customers based on their customers’ past purchases and similar users' buying patterns.
Hierarchy recognition: You need your RAG model to recognize and utilize hierarchies between different types of data, such as understanding nested folders in a Google Drive directory or associating sales data with different regions in a CRM application.
Explainability and data lineage: It is important that your model’s responses are explainable and that the data lineage is clear and traceable. This can be especially crucial when dealing with sensitive data (e.g. financial or healthcare data that may be regulated), and can also be helpful for debugging and improving decisions made by your model over time.
Before choosing a knowledge graph approach, keep in mind that creating and maintaining knowledge graphs can be a complex and resource-intensive process. Initial setup requires extensive data modeling and ontology design, demanding a deep understanding of the domain. This complexity becomes even more challenging when building an AI SaaS application, where each customer may have different data hierarchies and unique domain-specific requirements. Maintaining the knowledge graph adds another layer of difficulty, requiring continuous updates to ensure data accuracy and relevance while avoiding inconsistencies.
When to Choose a Vector Database
If the above factors are not hard requirements for your use case, then definitely go with a vector database. They're flexible and most data types are supported, which means there is a much lower barrier to entry and lower maintenance cost associated with using a vector database vs. a knowledge graph.
Additionally, vector databases currently benefit from strong industry and community support and there are many open-source vector database options available. Most RAG research until now has also been focused on the “traditional” RAG approach that relies on semantic search via a vector database to augment LLM context, so there is more known about the performance of these models.
When to Choose a Hybrid Approach
Choosing a hybrid approach over using just a vector database or a knowledge graph in RAG applications is beneficial under many circumstances. For instance:
Large data sets: When dealing with large datasets, the hybrid approach allows you to offload structural queries to a knowledge graph while also using a vector database to index large volumes of unstructured data.
Diverse data sets: For diverse datasets, the hybrid approach can efficiently handle any unstructured data through vector databases while organizing and querying any structured data via knowledge graphs.
Complex queries: If your queries are complex and multifaceted, requiring both contextual understanding and detailed relationship mapping, the hybrid approach can provide a more robust solution that neither vector databases nor knowledge graphs alone can offer.
Although the hybrid approach is very powerful, it also comes with drawbacks. It requires you to construct a knowledge graph, populate a vector database, and keep the two in sync over time. Accordingly, this approach can be costly and can come with a high maintenance burden. It ultimately comes with the drawbacks of both knowledge graphs and vector databases, while potentially slowing down retrieval because it must execute two types of searches.
Conclusion
When using RAG in your AI applications, choosing the right database technology is pivotal. Vector databases excel in handling unstructured data and performing semantic searches, while knowledge graphs provide a structured way to understand complex relationships and track data lineage. The recent emergence of hybrid approaches like GraphRAG demonstrates the potential to combine these technologies, harnessing the strengths of both to achieve a more optimal RAG architecture. By carefully considering the specific needs of your use case, you can select the most effective strategy to optimize your AI applications.