Large language models can process small spreadsheets, but very large spreadsheets often exceed their limits for input length. Researchers devised a method that processes large spreadsheets so LLMs can answer questions about them.
What’s new: Yuzhang Tian, Jianbo Zhao, and colleagues at Microsoft proposed SheetCompressor, a way to represent spreadsheets that enables LLMs to identify and request the parts they need to answer specific questions.
Key insight: Most spreadsheets can be broken down into a set of tables that may be bordered by visual dividers like thick lines or empty rows and/or columns. But detecting these tables isn’t trivial, since they may contain the same kinds of markers. (See the illustration above, in which tables are denoted by red dashes.) To answer many questions, you don’t need the whole spreadsheet, only the relevant table. Moreover, given a question, an LLM can recognize the table it needs to produce an answer. However, to identify the correct table, it needs to see the whole spreadsheet, which may be too large for its input context window, and the tables, which may not be clearly separated, need to be parsed. The solution is to compress the spreadsheet, feed the compressed representation to the LLM along with the question, and ask the LLM to identify the boundaries of the table it needs to answer the question. Then, given an uncompressed version of that table, the LLM can produce an answer.
How it works: The authors built software that prepared spreadsheets by (i) parsing them into tables and (ii) compressing them while maintaining the table structure. Then they fine-tuned LLMs to detect tables in the compressed spreadsheets and prompted the fine-tuned LLMs to identify the tables relevant to a given question.
- Given a spreadsheet, the authors removed rows and columns that weren’t near likely table boundaries defined by empty cells, thick lines, changes in color, and so on.
- To compress a parsed spreadsheet, they represented each table as a JSON dictionary, using cell values as dictionary keys and cell addresses as dictionary values. (This reduces the sequence length, since duplicate cell values have the same dictionary key.) To compress it further, within each table, they detected types of values — for instance temperature, age, percentage, and so on — and merged adjacent cells that shared the same type into a single dictionary key that represented the type rather than the values. For example, merging dates that appear in the same column into a single entry: {"yyyy-mm-dd" : <cell addresses>}.
- They compressed a dataset of spreadsheets with annotated table boundaries according to this method. They used the compressed dataset to fine-tune GPT-4, Llama 3, and other LLMs to detect tables within compressed spreadsheets.
- Inference was a two-step process: (i) Prompt the LLM, given a compressed spreadsheet and a question, to output the boundaries of the table(s) most relevant to the question and (ii) prompt the LLM, given an uncompressed version of the relevant table(s), to answer the question.
Results: The authors compared the fine-tuned LLMs’ ability to detect tables in spreadsheets that were compressed using their method and in their original uncompressed form. They fed the models spreadsheets of various sizes that ranged from small (up to 4,000 tokens) to huge (more than 32,000 tokens). They gauged the models’ performance according to F1 score (higher is better).
- Small spreadsheets: Fed compressed spreadsheets, the fine-tuned Llama 3 achieved 83 percent F1 score, and the fine-tuned GPT-4 achieved 81 percent F1 score. By contrast, fed uncompressed spreadsheets, Llama 3 achieved 72 percent F1 score, and GPT-4 achieved 78 percent F1 score.
- Huge spreadsheets: Fed compressed spreadsheets, the fine-tuned Llama 3 achieved 62 percent F1 score, and the fine-tuned GPT-4 achieved 69 percent F1 score. Fed uncompressed spreadsheets, both models both achieved 0 percent F1 score.
- Answering questions: The authors also tested the fine-tuned models on their own dataset of questions about 64 spreadsheets that spanned the same range of sizes, posing questions that involved fundamental tasks like searching, comparing, and basic arithmetic. Fed compressed spreadsheets, the fine-tuned GPT-4 achieved a 74 percent accuracy on zero-shot question answering. Fed uncompressed spreadsheets, it achieved 47 percent accuracy.
Why it matters: By giving LLMs the ability to detect a spreadsheet’s functional components, this approach enables them to process a wide variety of spreadsheets regardless of their size and complexity.
We’re thinking: When considering the strengths of LLMs, we no longer have to take spreadsheets off the table.