A new AI method directs scientists toward promising avenues of inquiry.
What's new: Jamshid Sourati and James A. Evans at University of Chicago proposed a method to predict new scientific discoveries by building a graph that connects researchers, their objects of study, and the scientific properties thereof. They evaluated their approach using data from materials science.
Key insight: Overlapping interests among researchers may indicate areas where further research would be fruitful. For example, if one group of researchers studies a material A and its property P, a second group studies materials A and B, and another group studies materials B and C, it may turn out that material C exhibits property P.
How it works: The authors tried to predict whether certain inorganic materials have certain electrical properties based on scientific literature through the year 2000. From 1.5 million articles that described 100,000 inorganic compounds, they extracted the author names, materials mentioned (for example, sodium nitrite), and their properties (for example, thermoelectricity, the ability to convert heat into electricity and vice versa). They used this data to construct a graph whose nodes were authors, materials, and properties. Edges connected the nodes that appeared in the same paper, for example a particular author whose paper covered specific material or property.
- The authors conducted random walks through the graph, stepping from node to node, to produce sequences of authors, materials, and properties. Then they removed the authors from the sequences, because they were interested mainly in establishing possible connections between materials and properties.
- They trained Word2Vec, which computes word embeddings, on their sequences, treating materials and properties as words and sequences as documents. This yielded an embedding for each material and property.
- To predict possible discoveries — that is, which material might exhibit a given property — the authors scored each material based on (i) the similarity between the material’s embedding and the given property’s embedding and (ii) the smallest number of edges in the path that connected each material and the property. Then they summed scores (i) and (ii). The 50 highest-scoring materials were predicted to have the property (that weren’t directly connected in the graph; that is, excluding materials that already were known to have the property).
Results: The authors predicted which materials possessed each of three properties. They compared their results with predictions obtained in a similar way using a Word2Vec model trained exclusively on text from scientific papers. They used papers from 2001 through 2018 to evaluate the predictions. For thermoelectricity, the cumulative precision (percentage of predicted discoveries proven correct) was 76 percent, while the cumulative precision of the alternative method was 48 percent. The cumulative precision of random guesses was about 3 percent. The authors obtained similar results for the other two properties.
Why it matters: Science is a social endeavor, where the connections between people and their work can be represented as a graph that reflects the collective attention of the scientific community. The collective attention acts as a signal that predicts promising avenues for further research — a signal that machine learning can help to tease out.
We're thinking: The authors also predicted drug discoveries with similarly good results. Their method may be useful for identifying fruitful directions in other scientific areas, and perhaps in other domains entirely.