Coaxing GenAI to Create Grid Tables in Pandoc Markdown
Large language models can generate rich grid tables using HTML and Pandoc’s Markdown extensions, to name a few formats. This article focuses on the latter because Markdown is easier for humans to read and include in prompts.
Unfortunately, today’s LLMs struggle with Pandoc’s strict Markdown grid table format, necessitating retries. But how do we know when the generated Markdown is valid?
Grid Tables
Grid tables are more advanced than standard Markdown tables. Like HTML tables, they support spanning rows as well as columns.
The following example grid table contains no data and can be used as a template:
+---------------------+:---------------------:+
| Location | Temperature 1961-1990 |
| | in degree Celsius |
| +-------+-------+-------+
| | min | mean | max |
+=====================+=======+=======+=======+
| | | | |
+---------------------+-------+-------+-------+
Here’s the same table containing two rows of data:
+---------------------+:---------------------:+
| Location | Temperature 1961-1990 |
| | in degree Celsius |
| +-------+-------+-------+
| | min | mean | max |
+=====================+=======+=======+=======+
| Antarctica | -89.2 | N/A | 19.8 |
+---------------------+-------+-------+-------+
| Earth | -89.2 | 14 | 56.7 |
+---------------------+-------+-------+-------+
You can test this template via the Try Pandoc! web page. Add some values to the empty cells, click the download link, and open the downloaded HTML file in your browser. Also try generating a Docx file.
Docx output doesn’t center the Temperature 1961–1990 header. This seems to be a bug.
Pandoc’s Grid Table Markdown Format
The layout of a Pandoc grid table is defined using just four characters:
+ - = : |
Colons are used in column borders to define whether values in that column are left-justified (the default?), right-justified (one colon on the right side, before the +), or centered (two colons).
In the above example, the “Temperature 1961–1990 …” heading is specified to be centered:
Pandoc Grid Table Markdown Rules
Pandoc converts Markdown to other formats such as HTML and Docx.
The basic requirements for grid table Markdown are:
- Each line must have the same length
- The first and last lines must not contain |
- Each line must begin and end with either + or |
- Each | must have either a + or | above and below it
These rules are not comprehensive — good luck finding them! However, they are sufficient for validating the output of large language models, particularly GPT-4o.
Prompting GenAI to Create Pandoc Grid Table Markdown Documents
Grid table Markdown documents can be generated by LLM APIs in three steps:
- Create a prompt containing some instructions, a grid table template, and data to add to the table:
Use DATA to create a Pandoc Markdown grid table.
Only output the table.
Create a grid table using Pandoc Markdown using the following template. Rules:
- Each line must have the same length
- Vertically align | and + characters
- Output \ characters as-is
+---------------------+-----------------------+
| Location | Temperature \ |
| | 1961-1990 |
| | in degree Celsius |
| +-------+-------+-------+
| | min | mean | max |
+=====================+=======+=======+=======+
| | | | |
+---------------------+-------+-------+-------+
DATA:
<add input data here (in JSON, for example)>
- The instruction “Only output the table” prevents (most of the time, anyway) the model from writing an introductory sentence about the table, which it only does intermittently. You could instead instruct it to add an introduction if desired.
- The instruction “Output \ characters as-is” is needed because LLMs intermittently fail to output backslashes. As a demonstration, I added a backslash after ‘Temperature’ in the top-right header. Pandoc converts backslashes in grid table Markdown into line breaks.
2. Validate the output
3. If the returned grid table Markdown is invalid, go to 1
Why use an LLM for this?
It’s not because I like using gold-plated bulldozers to swat flies!
Grid tables are more sophisticated (or troublesome, if you have to write the code) than meets the eye. Consider the case in which a cell’s contents are too big to fit in a single line. Long lines should be word-wrapped, which requires adding additional “rows” without adding grid lines for them. For example, let’s replace “Antarctica” with “Antarctica is a cold place.” Should we make the Location column wider or maintain the columns’ proportions specified in the template? Here’s what GPT-4o outputs:
+---------------------+:---------------------:+
| Location | Temperature 1961-1990 |
| | in degree Celsius |
| +-------+-------+-------+
| | min | mean | max |
+=====================+=======+=======+=======+
| Antarctica is a | -89.2 | N/A | 19.8 |
| cold place | | | |
+---------------------+-------+-------+-------+
| Earth | -89.2 | 14 | 56.7 |
+---------------------+-------+-------+-------+
This is exactly what we want, rather than:
+----------------------------+:---------------------:+
| Location | Temperature 1961-1990 |
| | in degree Celsius |
| +-------+-------+-------+
| | min | mean | max |
+============================+=======+=======+=======+
| Antarctica is a cold place | -89.2 | N/A | 19.8 |
+----------------------------+-------+-------+-------+
| Earth | -89.2 | 14 | 56.7 |
+----------------------------+-------+-------+-------+
I’ll wait while you code this behavior. Be sure to support every possible grid table template.
Perhaps a better strategy is to have an LLM write the code. Or search PyPi etc. for a suitable library.
Using LLMs is overkill for most use cases. It’s slow and can cost money when using cloud-hosted models.
I can’t say why I’ve been using GPT-4o to create grid tables, but it’s been fun and productive. I’m happy to share what I have learned along the way.
Troubleshooting Consistent Data-Induced Failures
Models have been trained to maintain the column proportions specified in the provided grid table template. For example, in the above table, the Location column is considerably wider than the min, mean, and max columns. But sometimes LLM are unable to fill table cells with data while maintaining these proportions, resulting in undesired word wrapping or invalid grid tables. The solution to this problem is to change the template and make some if not all of the columns wider.
Close, But No Grid Table
GenAI’s inherent randomness inevitably produces grid tables that fail validation. For example, the following output is from GPT-4o (August 2024 version) in mid-September 2024. The second line contains an extra space.
+---------------------+-----------------------------+
| Location | Temperature 1961-1990 in °C |
| +---------+---------+---------+
| | min | mean | max |
+=====================+=========+=========+=========+
| | | | |
+---------------------+---------+---------+---------+
This type of failure will occur occasionally regardless of the provided temperature and p-value. As of September 2024, it’s impossible to prompt LLMs to create perfect grid tables 100% of the time. However, common failures like these can be fixed automatically, thus reducing the need for retries considerably.
Working Around GenAI’s Foibles
The function validate_grid_table_markup()
adds or removes -, =, and spaces at the end of each line in a grid table that has the wrong length. For example, the table:
+:------------:+:--------------------:+
| Heading 1 | Heading 2 |
+---+----+-----+----+------+--------------:+
| A | B | C | D | E | ABC |
+---+----+-----+----+------+------------+
can be transformed to:
+:------------:+:-------------------:+
| Heading 1 | Heading 2 |
+---+----+-----+----+------+--------:+
| A | B | C | D | E | ABC |
+---+----+-----+----+------+---------+
However, the following table can not be repaired because the Heading 2 column is too narrow to fit the value in the bottom right cell:
+:------------:+:--------------------:+
| Heading 1 | Heading 2 |
+---+----+-----+----+------+---------:+
| A | B | C | D | E | FFFFFFFFFFFFFFFF |
+---+----+-----+----+------+----------+
A grid table’s maximum line length is determined via the first line of the table, specifically:
+: — — — — — — :+: — — — — — — — — — — :+
Necessary Complexity
The last grid table presented in the previous section can be repaired by changing each line, if necessary, to conform to the length of the table’s longest line. However, drastic changes to column widths can produce undesired results and therefore should be avoided.
That said, experience with GPT-4o has shown that the rightmost columns in generated grid table Markdown often need to be widened by a single character. GPT-4o intermittently fills these columns with one too many characters.
It is possible to work around this quirk in the today’s public models. In other words, the table:
+-----+----------------------:+
| Red | Little Red Riding Hood|
+-----+-------------------------------+
can be fixed by widening the rightmost column by one character:
+-----+----------------------:+
| Red | Little Red Riding Hood|
+-----+-----------------------+
This fix this should be performed only when at least one line in the grid table is longer than the first. However, validate_grid_table_markup()
doesn’t check this condition because adding only one space could negatively impact the layout of extra wide tables, and it’s better to find out immediately instead of eventually. Therefore, this feature is optional.