Querying TRIRIGA Data With AI

Wednesday, February 21, 2024

NOTE: You can find the entire code repository covered in this post over here on GitHub.


Quick Background

For anyone who has worked with the TRIRIGA data scheme, they may have noticed how a simple representation on the frontend can become complex with the backend SQL. For example, a space record will be associated to a parent floor via the association "Is Child Of" and also "Parent Floor". Whenever I design SQL schema, I heavily utilize foreign keys for associations, but does this mean that every association requires its own field? TRIRIGA gets around this by utilizing a middle table called IBS_SPEC_ASSIGNMENTS (and soon, module-level tables to break it down) for tracking these associations. It works pretty well, but i can become sluggish with large data sets. Another way we can represent this data is to utilize a graph database.


A graph database stores data as nodes and relationships. A node is similar to a table as it will contain the attributes and relationships are used to relate nodes against each other. Back in 2012, I was part of a failed startup that aimed to use the Neo4J graph database to aggregate relationships from social networks to display relevant feeds to an end user. Graph databases are so powerful that Facebook even built a faux graph interface against their backend MySQL database which speeds up data retrieval through their query engine and caching techniques.


One last topic that we'll cover in this post is retrieval augmented generation (RAG). Although this solution is not a true RAG since it requires the user to train their large language model (LLM) on their data, it does use a lot of RAG techniques with off-the-shelf models. If you've seen other RAG tutorials, you may see solutions requiring splitting PDFs into chunks and then embedding the results in a vector database. The LLM will then take the user query and compare it against those embeddings to return the closest matches. This is fine, but unstructured data has a lot of variables and these solutions are never great for production. I will follow up this post with a strategy for unstructured data, but this post will focus on taking structured TRIRIGA data and allowing user queries against it.


Why Do This?

When the first local LLMs were published, I tried finding good use cases for them. Right now, with Mixtral 8x-7b, I believe that we have reached a point where someone can now utilize a local LLM on consumer-grade hardware. What this means is that there isn't any data sent to OpenAI which is a big concern with many large companies today. From my own tests, this model performs very close to GPT-3.5 in terms of generating the queries. I will continue to play around with other models and may even find something smaller that can run on lower grade hardware!



The following video is a quick demo with some queries against the sample data. As you can see, the first few queries return data, then we run into an issue with a badly generated cypher query. Finally, the last query did run, but due to the way it was constructed, it double counted the data. Issues like this will require additional tuning on the code and like any other AI-powered tool, it's always good for users to verify the responses.


Representing Data

For this application, I created a simple schema involving the space, floor and building records. The schema looks as follows:

flowchart TD

    S -->|SpaceChildOfBuilding| B
    S -->|SpaceChildOfFloor| F
    F -->|FloorChildOfBuilding| B


Within TRIRIGA, this is similar to querying using the Parent Floor and Parent Building locator fields/associations. Each node has a few test attributes that we will use in our queries down the line.

The graph database Kùzu is used for this solution. It is a fast, embeddable graph database that has been developed by some smart folks at my alma mater. With some simple python code, the schema is generated:

import kuzu

db = kuzu.Database('./location_data')
conn = kuzu.Connection(db)

conn.execute("CREATE NODE TABLE Space(id STRING, name STRING, spaceClass STRING, area FLOAT, PRIMARY KEY (id))")
conn.execute("CREATE NODE TABLE Floor(id STRING, name STRING, level INT32, PRIMARY KEY (id))")
conn.execute("CREATE NODE TABLE Building(id STRING, name STRING, PRIMARY KEY (id))")
conn.execute("CREATE REL TABLE SpaceChildOfFloor(FROM Space TO Floor, MANY_ONE)")
conn.execute("CREATE REL TABLE FloorChildOfBuilding(FROM Floor TO Building, MANY_ONE)")
conn.execute("CREATE REL TABLE SpaceChildOfBuilding(FROM Space TO Building, MANY_ONE)")


Data Flow with RAG

Now that you've read through a lot of background and data set up, let me discuss the overall architecture of this application. The user will send the LLM a query and the system prompts the LLM to generate a cypher statement based on the data schema. This query is executed against the graph database and the results (if the query was valid) will be returned to the user. In short, the flow is as follows:

    participant TRIRIGA
    participant Middleware
    participant Graph DB
    participant Local LLM
    participant FastAPI Backend
    actor User

    TRIRIGA ->> Middleware: Automated data feed (TBD)
    Middleware ->> Graph DB: Transform & load data (TBD)
    User ->> FastAPI Backend: Send text query
    Graph DB ->> FastAPI Backend: Generate prompt from schema
    FastAPI Backend ->> Local LLM: Send query with prompt template
    FastAPI Backend ->> Graph DB: Execute Cypher statement
    Graph DB ->> FastAPI Backend: Return Data
    FastAPI Backend ->> User: Display results


Areas of Improvement

This has been a fun project to throw together, but there are definitely some areas of improvement:

  1. 1. When a cypher statement is not valid, try altering the query and resubmit it to the LLM for regeneration. This will involve some prompt updates
  2. 2. Send the results back to the LLM for processing. This will make the solution more complete. LangChain had some issues with the latest version of Kùzu when I started this, but I did resolve those issues with a PR and it should be working in the latest release
  3. 3. Charts! Everyone loves AI generated charts, so why not take the data and generate a nice chart?
Running on recycled hardware from my closet