One approach is to focus on data quality rather than quantity. AI labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an AI firm, says that this is the “main differentiator" between ai models on the market. “True information" about the world obviously matters; so does lots of “reasoning". That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.