Study Note/Data Analysis

Grounding Generative AI: Avoiding Hallucinations When Working with Databases

jhleeatl 2025. 6. 20. 13:55

Why Hallucinations Occur

One of the most remarkable aspects of generative AI is its creative capacity—its ability to generate novel ideas, stories, and solutions. However, this same creative power can lead to problems when working with databases and SQL, where precision is paramount.

When generative AI lacks sufficient context about your database environment, it attempts to "fill in the blanks" with plausible-sounding information—much like a student guessing on an exam. This phenomenon, known as hallucination, can manifest as the AI inventing column names that don't exist, creating table relationships that aren't real, or generating syntactically correct but functionally incorrect queries.

Think of it this way: if you asked a new employee to write a query without showing them your database schema or explaining your environment, they would have to make educated guesses. Generative AI operates similarly—without adequate information, it will produce what seems reasonable based on its training, not what's accurate for your specific database.

Key Techniques for Establishing Database Context

When we are working with Generative AI, we are having a conversation. It is critical that we start every conversation by providing prompts with sufficient information to establish a sound context for the Generative AI. If we are going to generate SQL, we want very specific information to be in our prompts.

Provide Technical Environment Details

The foundation of successful AI-generated SQL begins with clear specification of your technical environment:

  • Database Type and Version: Explicitly state whether you're using PostgreSQL, MySQL, SQL Server, Oracle, or another database system, including the specific version number. Sample prompt: "I'm working with MySQL version 8.0.32 and need to create a query that shows customer purchase history."
  • Client Information: Identify the exact client or tool you're using to connect to the database. Specifying this prevents syntax incompatibilities like the PSQL command-line example mentioned earlier. Sample prompt: "I'm using DBeaver 23.1.0 as my database client to connect to PostgreSQL 14. The query I'm trying to fix is..."
  • Connection Method: Describe how you're accessing the database—through a programming language (Python, Java, etc.), a GUI tool, or a command-line interface. Sample prompt: "I'm connecting to SQL Server 2019 through pyodbc in Python 3.10. I need to retrieve customer data that meets these conditions..."

Share Detailed Database Structure

Beyond the technical environment, the AI needs comprehensive information about your database structure:

  • Database and Table Names: List all relevant databases and tables that may be involved in your query. Sample prompt: "I'm working with the 'sales_analytics' database which contains four main tables: customers, orders, products, and order_items. I need to analyze customer purchasing patterns."
  • Column Definitions: Provide complete column information including names, data types, constraints, and any special characteristics. Sample prompt: "In the customers table, we have customer_id (INT PRIMARY KEY), name (VARCHAR(100)), email (VARCHAR(100)), signup_date (DATE), and status (ENUM with values 'active', 'inactive', 'suspended')."
  • Table Relationships: Describe primary and foreign key relationships between tables to enable proper joins. Sample prompt: "The orders table has order_id as its primary key and customer_id as a foreign key linking to the customers table. Each order can have multiple items in the order_items table, which has order_id and product_id as a composite key."
  • Indexing Information: When performance matters, share details about indexed columns. Sample prompt: "We have indexes on customer_id in the orders table, and on order_date to optimize date-range queries. The products table has a full-text search index on the product_description column."

Here is what a detailed prompt might look like at the start of the conversation:

 

I need help creating an SQL query for customer purchase analysis. Here's my database context:

Technical Environment:
- Database: PostgreSQL 13.4
- Client: pgAdmin 4 (version 6.8)
- Connection: Direct connection through the pgAdmin interface

Database Structure:
- Database name: retail_analytics

Tables and Columns:
1. customers
   - customer_id (INT, PRIMARY KEY)
   - first_name (VARCHAR(50), NOT NULL)
   - last_name (VARCHAR(50), NOT NULL)
   - email (VARCHAR(100), UNIQUE)
   - signup_date (DATE)
   - status (VARCHAR(20), CHECK status IN ('active', 'inactive', 'suspended'))

2. orders
   - order_id (INT, PRIMARY KEY)
   - customer_id (INT, FOREIGN KEY references customers.customer_id)
   - order_date (TIMESTAMP, NOT NULL)
   - total_amount (DECIMAL(10,2))
   - payment_method (VARCHAR(20))
   - status (VARCHAR(20), CHECK status IN ('pending', 'processing', 'shipped', 'delivered', 'canceled'))

3. order_items
   - order_id (INT, part of COMPOSITE PRIMARY KEY)
   - product_id (INT, part of COMPOSITE PRIMARY KEY)
   - quantity (INT, NOT NULL)
   - unit_price (DECIMAL(10,2), NOT NULL)

4. products
   - product_id (INT, PRIMARY KEY)
   - name (VARCHAR(100), NOT NULL)
   - category (VARCHAR(50))
   - price (DECIMAL(10,2), NOT NULL)
   - in_stock (BOOLEAN, DEFAULT true)

Table Relationships:
- customers.customer_id → orders.customer_id (one-to-many)
- orders.order_id → order_items.order_id (one-to-many)
- products.product_id → order_items.product_id (one-to-many)

Indexes:
- orders.order_date is indexed for date-range query optimization
- orders.customer_id is indexed to speed up customer-based lookups
- products.category has a non-unique index

I need a query that identifies our top 10 most valuable customers based on total purchase amount in the last 6 months, showing their purchase frequency, average order value, and preferred product categories.

 

 

 

Include Data Samples and Patterns

The structure alone often isn't sufficient—the AI also needs to understand the actual data patterns. You may have variations with how the same thing is represented in a column, such as "Vanderbilt University" vs. "VU" being used to represent Vanderbilt University in a string column. A simple way to help account for this is to give the AI sample rows, although it may be hard to fully sample all variations in the data (we will cover other techniques to handle this later).

  • Representative Data: Share sample rows that illustrate typical data patterns. Sample prompt: "Here are a few sample rows from our customers table: customer_id: 1001, first_name: 'John', last_name: 'Smith', state: 'NY', signup_date: '2023-01-15' customer_id: 1002, first_name: 'Sarah', last_name: 'Johnson', state: 'California', signup_date: '2023/02/21' customer_id: 1003, first_name: 'Robert', last_name: 'Williams', state: 'TX', signup_date: '03-15-2023'"
  • Data Formats: Clarify how certain data is formatted, especially when variations exist. Sample prompt: "In our state column, we have inconsistent formats—some entries use two-letter codes like 'NY' and 'TX', while others use full state names like 'California'. Any query filtering by state needs to account for both formats." Sample prompt: "Our system has dates in three different formats: some are 'YYYY-MM-DD', others are 'MM/DD/YYYY', and some older records use 'DD-MM-YYYY'. This inconsistency needs to be handled in any date-based queries."
  • Data Anomalies: Mention any inconsistencies or special cases in your data that queries need to accommodate.Sample prompt: "Some customer records have NULL values in the email field, which should be treated as valid records, not errors. Additionally, about 5% of our order records show a total_amount of $0.00, which typically indicates a fully-refunded order that should be excluded from sales analyses." Sample prompt: "For product categories, we have a special value 'MISC' that was used before we implemented proper categorization. These products should be counted separately in any category-based analysis rather than grouped with current categories."

Understanding these data patterns is crucial because a query that works perfectly for one data format might fail entirely for another. For example, a query searching for customers in "Tennessee" would miss customers recorded as "TN", while a query filtering dates with a specific format would exclude dates stored differently. By sharing these details, you enable the AI to generate queries that handle all the variations and special cases in your actual data.

From Context to Accurate Queries

When you provide comprehensive context, generative AI can shift from guessing to reasoning. Instead of hallucinating possible database structures, it can focus on crafting queries that precisely match your actual environment.

This context-setting process may seem time-consuming, but it's far more efficient than debugging mysterious errors or unexpected results. Consider it an investment that pays dividends in the form of accurate, effective queries that work the first time.

Review of Contextual Prompt Structures

  1. Environment-specific prompt: "I'm using MySQL 8.0 with the MySQL Workbench client version 8.0.28. Generate a query that will..."
  2. Structure-sharing prompt: "Here's my database schema: [schema details]. I need a query that connects the Customers table to the Orders table using the customer_id field which is an integer in both tables."
  3. Data pattern prompt: "In my database, dates are stored in 'YYYY-MM-DD' format, and state codes are two-letter abbreviations like 'CA' and 'NY'. Generate a query that finds all transactions from California in January 2023."
  4. Client limitation prompt: "I'm using DBeaver version 22.3 to connect to a PostgreSQL 14 database. I need a query that creates a new table, but my client doesn't support PostgreSQL-specific metacommands like \c."
  5. Examples in prompt: "My query is returning incorrect results. Here's my table structure: [structure details]. Here are some sample rows: [data samples]. I expected to see [expected result], but instead I'm getting [actual result]."

Remember that generative AI's effectiveness with databases is directly proportional to the quality and completeness of the context you provide. By thoroughly establishing database context before asking for queries, you can harness AI's power while avoiding the pitfalls of hallucination.