Querying AWS Athena Using Natural Language with Llamaindex and ChatGPT

Terris Linenbach
3 min readJan 6, 2024

Let’s use GPT programmatically to query virtually any data that is accessible to AWS Athena (CSV files, SQL databases, Cloudwatch Logs, …) using natural language. This solution uses Llamaindex. There is also a Langchain version.

Can you hear me now, Data?

The questions asked don’t need to refer to table or column names. GPT4 does its best to answer, which is usually pretty good. For example, if your data is about sales, you can ask “which customers most recently made a purchase exceeding $2 million.” Perhaps you saw a similar demo (using Amazon Q) at re:Invent 2023 except they used a Bedrock model (probably Titan).

First, install Python3.

Install dependencies and specify your OpenAI API key:

python -m ensurepip
pip install --upgrade pip setuptools
pip install llamaindex PyAthena

export OPENAI_API_KEY=YOUR-KEY-HERE

Finally, modify and run the following script. Ignore the warning which is the result of PyAthena remaining compatible with SQLAlchemy 1.x:

SADeprecationWarning: The dbapi() classmethod on
dialect classes has been renamed to import_dbapi().
backwards compatibility.
# Inspiration from https://awstip.com/llamaindex-with-structured-data-rds-and-s3-an-alternative-to-quicksight-q-e7441633ce74
from urllib.parse import quote_plus
from sqlalchemy import create_engine
from llama_index import SQLDatabase
from llama_index.indices.struct_store import NLSQLTableQueryEngine

AWS_REGION = "us-west-2" # Change me
SCHEMA_NAME = "schema-name" # Change me -- Athena calls this a database name
S3_STAGING_DIR = "s3://bucket/temp-athena-queries" # Change me
QUESTION = "Which table has the most rows" # Change me

connect_str = "awsathena+rest://athena.{region_name}.amazonaws.com:443/{schema_name}?s3_staging_dir={s3_staging_dir}"

engine = create_engine(connect_str.format(
region_name=AWS_REGION,
schema_name=SCHEMA_NAME,
s3_staging_dir=quote_plus(S3_STAGING_DIR)
))

metadata_obj = MetaData()
metadata_obj.reflect(engine)

sql_database = SQLDatabase(engine)

table_names = [table for table in metadata_obj.tables.values()]

query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=table_names,
)

# See https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/duckdb_sql_query.html
response = query_engine.query(QUESTION)

# Output the SQL statement
print(response.metadata)

print(response)

Overkill?

Although this solution is slow and OpenAI isn’t free (a task for readers: use Mixtral instead), what we’re getting for writing virtually no code is astonishing. All those tips and tricks in our heads don’t have the half-life that they used to.

Of course, complex questions involving multiple queries would require more advanced chains. For now. Asking AI to generate a Llamaindex program using a description of a high-level object is within the realm of feasibility.

For queries that are not entered by humans, GPT4 is overkill. Just use PyAthena. The above program conveniently outputs the SQL statement.

import pandas as pd
from pyathena import connect

conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='YOUR_AWS_REGION')

cursor = conn.cursor()
query = "SELECT * FROM your_table LIMIT 10"
cursor.execute(query)

df = pd.DataFrame(cursor.fetchall())

RAG?

This is a RAG solution in the sense that it generates SQL queries to retrieve data which is then augmented to produce a human-readable response. However, RAG typically involves generating embeddings up front and keeping them in sync as data changes. Such embeddings could be used to locate data elements via vector search instead of SQL. This is a very hard problem to solve for arbitrarily complex schema with a large amount of data.

--

--