Home/ Opinion / Columns/  Offline reinforcement learning can beat generative AI
Back

Offline reinforcement learning can beat generative AI

This is a special kind of machine learning that’s better suited for tasks that require complex training

Photo: ShutterstockPremium
Photo: Shutterstock

While ChatGPT and GPT 4 are all the rage, there are other generative Artificial Intelligence (AI) models —both proprietary and open source—that are being honed to compete in the field, like Google’s Bert and the open-source Bloom. While these models compete among themselves for supremacy in the Generative AI space, let us remember that despite the hype, AI as a field is by no means circumscribed by Generative AI alone. There are plenty of other models for AI to solve our problems. I would caution against obsessing over Generative AI models while ignoring the others.

These models use much smaller sets of data to focus on a specific kind of problem: say, an engineering parameter such as the tensile strength of an electric scooter’s front fork. Using Generative AI for a problem like this would in my view be a waste of resources and time, and worse, would most likely deliver an unworkable solution. Smaller, more useful data sets would be needed.

For this column, I will focus on another method called offline reinforcement learning or offline RL, as it is known in the trade. I also feel that Microsoft/OpenAI’s competitors will use this sort of model to become much more precise in their business problem-solving, but that will take several months. What follows is simply a contrast between the two models (offline RL and generative AI).

Offline RL is a type of machine learning that uses a fixed data-set of experiences to train an agent to make decisions based on rewards and punishments received from the environment. This approach is particularly useful in scenarios where it is impractical or expensive to collect data in real-time, or where the environment is complex and potentially dangerous.

Offline RL has several strengths that make it a powerful tool for solving complex problems. Firstly, it is computationally efficient, since it does not require real-time interactions with the environment. This means that it can use pre-collected data to learn, making it suitable for large data-sets. Secondly, offline RL can learn from a wide range of data-sets, including expert demonstrations, human preferences and simulations. This flexibility lets the agent learn from diverse data sources, improving quality. Finally, offline RL can learn in a safe, controlled environment. This is very useful in applications such as robotics, where real-time interactions with the environment can be dangerous.

I came across a blog post that claims that Reinforcement Learning is the way forward and suggests that it may indeed be Google’s hidden weapon in winning the AI wars (and neither Bert nor Bard, which are its Generative AI models). This blog was written by Ignacio DeGregorio (bit.ly/3JRMFHd).

DeGregorio provides an elegant definition for this alternate method: “Reinforcement Learning, or RL, is a multi-step process that requires ‘interaction’. For each step in the process, the model acknowledges its state (its situation in the environment), performs an action, and if the action implies an approximation to the desired final state, it receives a reward. For each action the model makes in say, a video game, it understands the impact of that action, potentially receiving a reward and reshaping its parameters in order to maximize those rewards. This way, the model learns what actions yield rewards and defines the policy—the strategy—it will follow to maximize them." To me, this precision is of value for point solutions.

However, offline RL also has some limitations that must be considered. Firstly, the quality of the policy learnt is heavily dependent on the quality and diversity of the training data-set. Biased data-sets can lead to biased policies, limiting the agent’s ability to generalize them for new environments. Secondly, offline RL, at least so far, does not explore the environment in real time, limiting the agent’s ability to learn from new experiences. This can result in a sub-optimal policy that is unable to adapt to changing environments. Finally, the distribution of the data-set used for training may differ from the distribution of the environment in which the policy will be deployed. This can lead to a phenomenon known as ‘distribution shift’, where the policy learnt does not perform well in a real environment.

Generative AI, by its very nature, can be used to generate new data samples, which can be used to augment the training data-set. This approach can improve the quality of the learnt policy by providing more diverse and representative data. Secondly, generative models can be trained on unlabelled data, allowing for unsupervised learning. Also, generative models can generate new and novel data samples, making them useful in creative applications such as music, writing and computer programming.

However, generative models can suffer from a phenomenon known as ‘mode collapse’, where the model generates only a limited subset of the training data. This reduces the quality of what is learnt. Also, these models do not provide direct control over the generated data, which makes it challenging to ensure that what is generated is suitable for the application. And these models can be computationally expensive to train.

Both offline RL and generative AI are powerful tools that can be used to solve problems. However, the choice between these techniques depends on the specific problem at hand and the resources available. And then there is Google’s humble search application, which when used in tandem with either of these models, can still yield a workable result, like art, an essay, or even computer code.

Siddhart Pai is co-founder of Siana Capital, a venture fund manager.

 

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.
More Less
Updated: 20 Mar 2023, 10:23 PM IST
Recommended For You
×
Get alerts on WhatsApp
Set Preferences My Reads Watchlist Feedback Redeem a Gift Card Logout