For builders of interactive AI like ChatGPT, this new method makes Large Language Models economical

Karthik Dinakar

Pienso's interactive generative A.I platform makes LLM harnessable and consumable by enterprises.

A seismic shift is well underway in the world of large language models.

For those of us who have been immersed in large language models (LLMs) for the past couple of years, the public reception to LLMs since the arrival of ChatGPT on the world stage feels like an arengataram of sorts, like a first solo piano concert after years and years of training.

Interactive AI is the elegant coupling of Deep Learning with intuitive, responsive user interfaces.

A user interacts with the underlying machine learning algorithm using powerful user interfaces, where every user interaction guides the machine learning algorithm running in the background.

At Pienso, we are passionate about Interactive AI Training. Why? A compelling and valuable machine learning model requires careful and often subtle insights about the data. Insights that don’t come directly from the data but from a nuanced understanding of how owners interpret it.

Data owners have tacit and subtle insights about their domain in a way that is hard to represent merely by data labeling. For example, a cardiologist’s perspective on a research article about angina will likely be more nuanced than that of a deep learning engineer. But the cardiologist may not have a computer science and deep learning background, and therefore relies on a data science team to build a model for her – a lossy, time-consuming, expensive, and often frustration-inducing process.

The same people who need AI models should train them.

But the status quo is one of barriers and impediments. The technical nature of the deep learning process: the need to code, data cleansing, data labeling, model training, and model deployment – a body of work increasingly referred to as “ML Ops” – has typically blocked domain data owners from doing their own AI modeling and surfacing their own AI-derived insights.

What’s needed is Interactive AI — intuitive interfaces that cloak powerful under-the-hood AI in an approachable, point-and-click interface, allowing data owners to embed their insights into the underlying AI without needing to code or have any experience with deep learning.

This is precisely what we have built at Pienso.

Enter ChatGPT

On November 30th, 2022, OpenAI released ChatGPT. While the text-based interactive component — the “Chat” part of ChatGPT — grabbed headlines and our imaginations, it’s the underlying LLM deserves the spotlight.

Why? Because massive things are heavy and hard (and costly) to move.

And LLMs are massive. They’ve been trained on gargantuan amounts of text data (think: most of the internet), giving them with enormous capacities for a variety of text processing tasks that can supercharge productivity.

Organizations can “fine-tune” these generic LLMs by training them with additional text data that’s specific to their own domain. Business users may be interested in adapting a generic LLM to help understand customer intention, feedback, and emotions in real-time. Physicians may train an LLM to help structure their medical notes.

While LLMs can be powerful when fine-tuned to a particular business area, they are computationally expensive to train and deploy for ongoing inference — and they require large numbers of costly hardware accelerators to function.

Building intuitive interfaces to equip Interactive AI for domain experts requires prompt responses to user input — no one has time to wait minutes, let alone hours, for their AI-powered analysis.

To be adopted, fast and scalable inference from LLMs is essential. Despite their huge size, response times must be nearly instantaneous. This is what was so incredible about ChatGPT — the speed with which the gargantuan LLM underlying each interaction responded.

One way to increase inference speed from LLMs, and what OpenAI and Microsoft have done, is to add more hardware accelerators.

However, not only is this extremely expensive, there simply aren’t enough hardware accelerators available on the cloud providers where the LLMs are hosted to meet the need.

So what’s the alternative?

Introducing Packing

At Pienso, we improve LLMs’ efficiency and scalability by reducing their computational footprint.

One crucial technique is called packing.

First developed by Graphcore for faster training, packing is a method to drastically reduce the computational waste in arranging input text data — and in doing so, radically reduce the time it takes to pre-train, fine-tune, and infer from LLMs.

We have collaborated with Graphcore to productionize packing for large-scale and efficient inference for BERT-flavored LLMs in Pienso on Graphcore IPUs.

While there are a variety of techniques to make inference faster and more scalable, most current practices involve compressing the original LLM into a smaller, more efficient version.

But this lossy compression approach effectively emaciates the model, creating a withered version of the original and sacrificing inference quality for a smaller computational footprint.

Packing is different. We don’t diminish the original model through lossy compression. Instead, we minimize padding to reduce computational waste.

For example, suppose every input text document can contain a maximum of 512 words, or tokens. For every document with less than 512 tokens, the remaining spaces are “padded” with dummy values to ensure 512 tokens represent each document.

Very often, text documents have fewer than 512 tokens, so the final corpus of documents has a massive amount of dummy padding values. These contribute nothing to the training or inference process, but add plenty of weight.

Packing is a method to “pack” tokens as tightly as possible, eliminating extraneous and unnecessary padding, and dramatically reducing the computational overhead required during training or inference.

In production, packing means fine-tuning LLMs without delay or cost-prohibitive measures.

With packing, users can fine-tune a BERT architecture LLM with increased speed. Since fine-tuning BERT architecture models on IPUs is already faster compared to GPUs, fine-tuning with packing on IPUs results in dramatic reductions in training time, making per-epoch times virtually instantaneous.

Packing for fast training reduces response times and promotes a richer Interactive AI experience. The faster one can train a model, the more time spent experimenting. And more experimentation leads to better models.

Equally important, packing allows for robust and efficient inference of LLMs.

Organizations with massive volumes of data to analyze on a daily basis will regularly use LLMs for inference. If they have deployed an LLM tuned with their historical data, they’ll want to analyze high-volume new data in real-time using the deployed LLMs for inference.

Since packing makes inference scalable and virtually instantaneous, using additional LLMs to explore the same data is possible.

For example, consider an enterprise with its foundation LLM for customer relationship management. If they’d like to infer customer intention and feelings for every sentence spoken or written by the customer, and every sentence needs one LLM inference each, 20 million daily customer sentences will require 20 million inference API calls.

If the inference time is slow, it means a constant backlog of data waiting to be analyzed. But with packing, inference is now an order of magnitude faster — without any compromise on quality.

What used to take hours can now be achieved in a few minutes.

This near real-time ability to fetch inferences promotes new uses of AI, unlocking LLMs for mainstream use.

There is only user experience

As deep learning engineers, we pay attention to the quality of our models. We invest time and effort in making better-quality models that deliver accurate insights and generalize well to new data.

As Interactive AI makes it more and more likely that the person using AI is a domain data expert, not a deep learning engineer, user experience matters ever more.

A high-quality LLM does not deliver if the user experience is slow, frustrating, or otherwise discourages experimentation. Tedious training and sluggish inference will stymie AI adoption.

We must make both training and inference faster and more scalable — and do so economically. Packing is one promising method.

We’re pleased to productionize packing for BERT in our next Pienso release, and we’ll apply packing to new LLMs in 2023.

About the author

Karthik Dinakar is a computer scientist specializing in machine learning, natural language processing, and human-computer interaction. A Reid Hoffman fellow, Karthik holds a doctoral degree from the Massachusetts Institute of Technology. He is the Chief Technology Officer and co-founder of Pienso, an interactive deep learning company.