Home >Industry >Infotech >Machine learning underpins data-driven AI: Una-May O’Reilly

Mumbai: Una-May O’Reilly, principal research scientist at Anyscale Learning For All (ALFA) group at the Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, has expertise in scalable machine learning, evolutionary algorithms, and frameworks for large-scale, automated knowledge mining, prediction and analytics. O’Reilly is one of the keynote speakers at the two-day EmTech India 2016 event, to be held in New Delhi on 18 March.

In an email interview, she spoke, among other things, about how machine learning underpins data-driven artificial intelligence (AI), giving the ability to predict complex events from predictive cues within streams of data. Edited excerpts:

When you say that the ALFA group aims at solving the most challenging Big Data problems—questions that go beyond the scope of typical analytics—what do you exactly mean?

Typical analytics visualize and retrieve direct information in the data. This can be very helpful. Visualizations allow one to discern relationships and correlations, for example. Graphs and charts plotting trends and comparing segments are informative. Beyond its value for typical analytics, one should also be aware that the data has latent (that is, hidden) predictive power. By using historical examples, machine learning makes it possible to build predictive models from data. What segments are likely to spend next month? Which students are likely to drop out? Which patient may suffer an acute health episode? Predictive models of this sort rely upon historical data and are vital. Predictive analytics is new, exciting and what my group aims to enable technologically.

How does ALFA work with raw data, and how does the data support machine learning? What are the challenges you face in this process?

Raw data is sometimes incomplete (sensors fail, records get lost), sometime noisy (incorrect data is recorded or entries are transcribed inaccurately), and it is often dispersed and collected from different sources then stored along different axes. It also often describes very low-level observations collected from a complex system.

In contrast, humans observe and relate to the same system frequently in more complex, nuanced ways but are confounded by low-level details. This gap implies that one major set of challenges of machine learning is the software transformation of raw data into influential explanatory variables or sophisticated response variables to enable effective predictive modelling.

It takes a lot of work to close the gap between an educational expert saying “a student is likely to drop out because she is procrastinating" and the definition and extraction of an operational variable that identifies how early a student started the problem set. Is an early or late start good evidence of “procrastination"? The same question arises for relational trends: Is a patient in the lower decile acutely ill? Is a student’s current grade 10% below last month’s?

Facilitating the natural and efficient translation of human conceptual descriptions into machine learning-ready data is both challenging and motivating because this is at the crux of human-data interaction.

Another exciting challenge is that experts frequently have hypotheses about predictive variables and problems that the data might be able to solve. Formerly, one had to close down many early investigations of a data set’s power and somewhat prematurely commit to a specific problem and its predictors because the cost of systematic tools and workflows was prohibitive versus the foreseen value of a data set. Given that we now have bigger data sets that can serve multiple purposes, we have invented a means of predictably mining physiological waveform data to confirm or refine multiple hypotheses in an optimized, efficient manner.

Please give us some examples.

One problem we’ve addressed is wind farm site selection. Committing to a particular site is a critical decision because the wind speeds and directions at a site determine how much energy can be harvested from it. Of course, it is uncertain how wind will blow in the future, but very-long-term (30-40 years) predictive estimations are possible. They require site wind velocity measurements, access to neighbouring sensing locations—for example, airports with decades of historical data, and the use of probabilistic models. We have developed a new probabilistic modelling technique for estimating wind resources.

We have also collaborated to develop technology that optimizes the placement of wind turbines on a wind farm. This is challenging because turbines downwind suffer from wake effects of others upwind that diminish the wind energy they receive.

Additionally there are topological properties such as roads, buildings or ponds that constrain where a turbine can be placed. Pre-existing solutions dealt with the problem for modest numbers of turbines and simplified topological properties. Our challenge was to design a better algorithm that scaled both to more turbines and more topological complexity. Our solution, documented in A Continuous Developmental Model for Wind Farm Layout Optimization, uses a developmental model based on a model of gene regulatory networks to control cells that act in a continuous rather than discretized grid space.

In healthcare, we have been addressing how to shrink the time it takes a clinical researcher to generate a predictive model from data. We focus on physiological waveforms and demonstrate our ideas with the forecasting of potential acute physiological episodes. One approach we’ve taken is efficient predictability mining.

Another approach is to “let the data speak" by finding “patients like me". We focus on efficiently finding archived time series that are similar to the critical time series of a patient. Then we leverage these “nearest matches to make predictions for the patient".

You train data scientists. What are the main challenges they face today?

Data scientists will always have to balance how much they become a specialist in the data domain in which they work with how they master general data science tools. Experience always makes a data scientist better. Data scientists need to communicate well so they can grasp, from domain experts, the key challenges and opportunities around mining a specific domain’s data for insights.

How should they prioritize the humongous amounts of data from various sources?

Our rule of thumb is to start with a pilot project and use it to prioritize the data sources that lend value to an actual predictive problem. It’s better to link, cross-reference and efficiently organize multiple data streams with clear use cases in mind.

What precautions ought they to take so that they do not violate data privacy in law and spirit?

All data scientists should be informed about the most up-to-date standards for data privacy. Newer approaches to data privacy consider how it is “built in" rather than “tacked on" to a system. Data scientists need to stay on top of new tools and designs for applications that allow privacy to be protected.

Would you agree that the world is gradually moving towards an algorithmic economy?

Data without algorithms is not useful. As increasing capabilities to sense systems around us in more detail converge with capacity for data storage and processing, data-driven analyses and efficiencies will strongly influence how business is conducted.

Many technology luminaries like Bill Gates, Elon Musk and even physicist Stephen Hawking have expressed fears that robots with AI could rule mankind. Do you share these concerns?

I’m not concerned about this occurring in my lifetime. It’s not that robot engineers and AI experts won’t make tremendous progress! However, acting sensibly in the world (and controlling humans!) requires a lot of experience (which depends on very complex learning), a lot of complex integrative reasoning and long-term intelligent robustness that goes beyond the rather siloed, sheltered capabilities we’re still focused upon.

Your research also focuses on genetic programming which you define as “the evolution of programs". Please elaborate on this subject and its application.

Genetic programming is an exciting kind of evolutionary algorithm. All evolutionary algorithms transform and abstract nature’s algorithm of new-Darwinian evolution into a computer program. The program starts with a population of random solutions. The solutions are assessed for quality, and the fittest among them are selected as parents, offspring are “bred" from these parents, they inherit some of their parents’ characteristics while also undergoing random genetic variation, then the algorithm loops back with the new population.

These offspring are the next (and better) generation. By iterating generation after generation, an evolutionary algorithm typically captures the adaptive power of evolution and channels it into solving optimization problems.

Genetic programming is an evolutionary algorithm that solves machine learning via a technique called symbolic regression. Effectively, genetic programming can identify accurate predictive models by simultaneously integrating model structure and parameter search. This facilitates non-linearities. Evolutionary algorithms are well suited to parallelism. We’re interested in scaling genetic programming effectively on the cloud for large-scale machine learning. See FCUBE, a system we have developed for collaborative research and development of machine learning algorithms towards solving large-scale problems.

We’re also using genetic programming for predictive feature engineering which is one of the less emphasized but challenging tasks with machine learning.

Tell us a little about the STEALTH (Simulating Tax Evasion And Law Through Heuristics) program and its progress.

The STEALTH project uses computation to elucidate the adversarial dynamics that lead to power oscillations between tax evasion strategies and regulatory checks for detecting non-compliance. In cybersecurity terms, we treat the (specific portion related to partnerships of the) tax code as an “attack surface" and we devise ways to model auditing to “defend" the tax regulations and to model financial networks that serve as “exploits" that attack loopholes in the code. Rather than being data driven, STEALTH relies upon model-based reasoning, to help people think about how adversarial strategies interact. See this New York Times article for more information.

Subscribe to Mint Newsletters
* Enter a valid email
* Thank you for subscribing to our newsletter.

Never miss a story! Stay connected and informed with Mint. Download our App Now!!

Edit Profile
My ReadsRedeem a Gift CardLogout