Getting you started on the Graph Journey
We’ll explore everything you need to know to get you started on Graph database. For better understanding we’ll take example of Relational database schema and see how to fit it into Graph usecase. Also, lets look how Retrieval Augmented Generation can be performed on Graph DB.
Graph Database Features:
- Build for scale- billion of record (nodes & relationship)
- Form Complex queries for retrieval - considering interconnected graph
- Flexible Schema- allowing to update to exiting schema
- Intuitive data representation ( exploring & visualising)
- Leverage graph algorithm- Ex: centrality, shortest path, BFS, Djikstra, Community Detection etc..
Neo4j , Cosmos Graph DB, Amazon Neptune Analytics, Tiger Graph, Memgraph are some available graph dtabases.
Nodes ,Edge(Relationship) & its Properties:
Graph DB that consists of nodes (entities) and relationships (connections between entities), which allows for a more intuitive representation of complex data relationships.
Nodes: Represent entities (e.g., Employee, Company, Location).
Labels: Used to categorise nodes, allowing for more efficient querying and organisation.
Ex: (Employee, Manager) can be 2 labels associated with Employee Node
Relationships: Define how nodes are connected (e.g., “HAS_CEO,” “LOCATED_IN”). Direction of connection must be specified explicitly.
self referencing relationship CEO is also allowed.
Ex: (Employee-[:manages]->Employee) if Employee is CEO
Properties: Key-value pairs associated with nodes and relationships, providing additional context (e.g., HAS_CEO relationship might have a property indicating the date they started working)
Data types supported : Boolean, Int, Float, String, Point, Date, Time, DateTime, List[Floats] (embeddings)
Translating RDBMS Schema to Graph Structure
Most of us come from RDBMS background to lets understand how this can be modelled to Graph schema. You could imagine SQL schema below figure.
To be migrated to graph schema consider following 3 points:
- A row is a node with respective properties attached to node
- A table name is a label name.
- A join or foreign key is a relationship, it can have its properties as well
Below figure depicts how graph schema would look
Graph Query Language
Like we have SQL to perform SQL DB related operation there are graph specific query language to querying. CYPHER, GREMLINE and SPARQL Query Language syntax are used for data management across different set of Graph DB Engine.
Mostly this language are intuitive to understand and support various graph specific functionality.
Graph Data Science
Built in package-in neo4j allows to leverage native graph algorithm that allows to run analytics on top of the data and often useful in scenarios for recommendation engine, clustering communities , community leaders, shortest driving route between two locations.
- PageRank: Measures the importance of nodes in a graph based on the number and quality of links to them.
- Community Detection (Louvain Method): Groups nodes into communities based on their connectivity, revealing clusters within the graph.
- Link Prediction: Estimates the likelihood of future connections between nodes based on existing relationships, commonly used in social networks.
Graph RAG
GraphRAG is possible recently due to rise of LLMs and Vector Databases. Leverage Langchain/ Llamaindex framework for chunking & triplet construction.
Some GraphDB has native support for Index creation like neo4j, However Microsoft Azure Cosmos Graph relies on external creation of Open AI Search Service.
LLM: Helps extract triplet based on business domain by just few pydantic prompts as opposed to earlier not so effective methods like:
- https://huggingface.co/ibm/knowgl-large
- https://huggingface.co/Babelscape/mdeberta-v3-base-triplet-critic-xnli
Vector DB: Store high dimensional vector embedding based of vector Indexer( CLIP model for Image or ada model for text)
Graph Construction:
1. Document Processing Start with a collection of documents to extract structured information. Chunk Text data as per different chunking strategies that fits the business needs. Extract Triplet based on Chunk obtained.
Reference: https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/
Implicit Path Extractor: Creates a linked list of text chunks.
Simple LLM Path Selector: Uses prompt engineering to extract entities from chunk.
entities = Literal["PERSON", "PLACE", "THING"]
relations = Literal["PART_OF", "HAS", "IS_A"]
schema = {
"PERSON": ["PART_OF", "HAS", "IS_A"],
"PLACE": ["PART_OF", "HAS"],
"THING": ["IS_A"],
}
# Initialize the DynamicLLMPathExtractor
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
extract_prompt=extract_prompt,
possible_entities=entities,
possible_relations=relations,
kg_validation_schema=schema,
strict=True,
num_workers=4,
max_paths_per_chunk=10,
show_progress=False,
)
2.Vector Indexes can be maintained on Chunk, Entity , Entity + Relation as well. Additionally, Fulltext Index search on required field can be setup to perform complex text search including keyword searches, phrase searches, wildcard searches, Fuzzy Searches
3. Retrivers can be based of different parameter like depth , breadth, Node labels. properties ,filter condition etc. Specifically for Graph rag case
- Vector Context Retriever: Uses vector search for more robust results. This can be against Chunk, Entity , Entity + Relation Vector Index based on on what user query is embedded.
- Text-to-Cypher Retriever: Generates Cypher queries from natural language.
- Cypher Template Retriever: Executes predefined Cypher templates with parameters.
There are many other possibilitis what we can achieve with graph, this was just gentle glimse to demonstrate possibilites.
REFERENCE
https://neo4j.com/developer-blog/graphrag-field-guide-rag-patterns/
https://neo4j.com/docs/getting-started/appendix/tutorials/guide-import-relational-and-etl/
https://dataheadhunters.com/academy/graph-theory-in-data-science-applications-and-algorithms/