1 Introduction
2 How do learning systems solve problems?
- 2.1 Supervised learning
- 2.2 Unsupervised learning
- 2.3 Reinforcement learning
- 2.4 Neural Networks
- 2.5 What makes machine learning different from “regular” algorithms?
3 What makes a scientific problem suited for machine learning?
- 3.1 Large combinatorial solution spaces
- 3.2 Clear objective function
- 3.3 Classification
- 3.4 Lots of data or an accurate simulator
- 3.5 Underlying distribution is not clear, or cannot be found analytically
- 3.6 Simulations
4 Problems suited for self-learning systems
- 4.1 Physics Simulations
- 4.2 Quantum Chemistry
- 4.3 Protein Folding
- 4.4 Disease detection
5 Conclusions

Introduction

Machine learning has seen an incredible amount of applications since the birth of this ever-growing field. The forces of constant advances in theory, increases in funding and investments, increased usefulness and market value, and exponential growth in computing power are making machine learning systems ever more advanced, powerful, and useful. The use of these systems in science is increasing very fast as well, and these systems are proving to be incredibly useful for a large amount of scientific problems.

But not all problems are suited for machine learning, at least not yet. Discovering new species in rain forests, or conducting geological field work, for example, are not yet the optimal problems for machine learning algorithms. Why not? Which problems are? We would like to know which properties make a scientific problem fit for approach with machine learning methods. What are the properties of a scientific problem that it needs to have, such that contemporary machine learning methods could be reasonably expected to handle it? It is important to ask this question, so we can identify the problems that should be approached with machine learning algorithms, and to be able to see if a problem is not a good fit for machine learning methods yet. Some problems are already solved to some degree, but machine learning solutions would simply be better solutions, with respect to computational complexity, efficiency, or human resource preservation.

We will first take a look at how these machine learning algorithms solve problems, and describe some main methods in machine learning. This will provide us with a better intuition of why these systems are good at what they are good at. Next, we will explore the fundamental questions of this paper: which general properties make a scientific problem suited for a machine learning approach? Finally, we will take a look at some scientific problems that machine learning approaches have worked reasonably well for in recent decades, and we will discuss which of our criteria these problems possess. If the attributes of these problems match our criteria well, this will give (at least mild) support to our general claims.

How do learning systems solve problems?

Opposed to expert systems (or regular programs for that matter), machine learning systems learn by themselves. These systems are most often built to either make predictions, or to make decisions to act upon, or both. To learn to make these predictions or decisions, there must be some kind of function or structure that takes relevant data as input (images, text, sensory data, etc.) and produces output in the correct format. This function/structure is called a model. The model contains (often very many) parameters that control its output. [1] In this paper, we will primarily focus on supervised learning, which is a type of machine learning algorithm, along with unsupervised learning and reinforcement learning. However, we will also take a brief look at unsupervised learning and reinforcement learning.

Supervised learning

Systems utilizing supervised learning learn by employing a machine learning algorithm that modifies the model parameters such that the model produces output that is close to the desired output. This is called training the model. This is done using training data. The training data consists of input data and the corresponding desired output for each of these input data. The part of the training data that is the desired output is called the target output. A utility function(often called a cost function, loss function, or objective function) is made that can evaluate the model’s prediction/decision by comparing the model’s output to the desired output. [1]

To train the model, the model’s parameters are first initialized randomly from some probability density function. Then, a batch of input data is fed into the model, the model produces an output, and the cost function assesses its performance. Next, the machine learning algorithm figures out in which direction the model’s parameters should be tweaked to decrease the cost outputted by the cost function. The parameters are then tweaked some amount in that direction. This is most often done using gradient descent. Then the next batch of data is fed into the model, and the process is repeated. [1]

If the model and machine learning algorithm are sufficient, and the data good enough, repeated steps of this tweaking step will make the model’s output approach the desired out- put. The model can then be used to make predictions with new data. [1] We see that with supervised learning, we provide the system with the output that we desire, and the system is tasked with finding out how to produce that output. We know what we want from the system.

We shall not be strict in our terminology in this paper. We will most often refer to the combination of the model and the training algorithm together. To refer to the whole system, we will use terms such as “machine learning algorithm” or simply “system”.

Unsupervised learning

Another type of machine learning method is unsupervised learning. In contrast to supervised learning, an unsupervised learning algorithm is not provided with any target output along with its input data. The unsupervised learning system is instead tasked with representing the statistical structure of the input data [2]

Apparently, the brain uses something akin to unsupervised learning extensively, and it might prove to be critical in the advent for artificial general intelligence [2]. For more information about unsupervised learning, we advise the reader to refer to a known textbook on the matter.

Reinforcement learning

Reinforcement learning is another type of machine learning, which does not involve the usual training data, but instead involves some type of agent that acts in some environment to maximize some objective function. The environment might be a chessboard, where the objective function reflects the outcome of the game. Then the reinforcement learning system learns from the actions it took, and inches its parameters in a direction such that it will take actions that will give it a higher expected score. The agent might also actively learn, so that it takes actions that will cause it to obtain observations that will improve the agent’s learning. [3]

Neural Networks

Machine learning algorithms often include neural networks. We will describe these only briefly, as there are many sources that provide excellent descriptions of neural networks. Though there are other types, we will only describe feed-forward neural networks. Neural networks are really a series of layers of functions that the input data to the model passes through. Each layer will consist of one layer of neurons, and an activation function. Each neuron in a layer of neurons takes as input all the outputs from the previous layer(or the original data, if this is the first layer), and returns a weighted sum of these. The weights that are used in the weighted sum are the parameters of the model, that we mentioned above. The output of the neuron then passes through a so-called activation function before being sent on to the next layer. [4]

In recent years, these neural networks have proven incredibly effective for a plethora of applications, and many if not most of the applications that we will talk about in this paper involve neural networks.

What makes machine learning different from “regular” algorithms?

An important thing to note about the problem-solving process of a supervised learning machine learning algorithm is that the learning process is really just optimization. The algorithm adjusts the model parameters to optimize the utility function. As we shall see, many hard problems in science can be treated as optimization problems, and they can then be attacked with machine learning algorithms. Google DeepMind has made significant progress with the infamous protein folding problem by finding a way to treat it as an optimization problem, and then developing a machine learning algorithm to do the optimization [5]. We will take a closer look at AlphaFold later in this paper.

It is important to notice is that in no step of the model training process do we tell the system how to solve the problem. We just give it some examples of what we want from it (the training data), and let the system simply find its own way to optimize the utility function. A consequence of this is that these systems often find solutions that we could not have found on our own, or would have been very hard to produce with hard coded algorithms. For example, it is very hard to hard code a good algorithm that recognizes spam, but machine learning algorithms have proven to be up to 99.9% effective at detecting spam. [6]

A feature of machine learning algorithms is that they can find approximate solutions to problems that have large combinatorial solution spaces [7], and they do not need to know much about the mechanics of the system they are making predictions on in order to make good predictions [8]. As we shall see, this can be of immense use in science.

What makes a scientific problem suited for machine learning?

Not all scientific problems are suited for machine learning (yet!). We would like to know what the features that a problem can have that makes it appropriate for machine learning algorithms to tackle. Also, we would of course require that there be some benefits that come with solving these problems using machine learning instead of “regular” methods. These benefits might include faster computation of the solution, fewer people hours spent, and more automation of solving the problem. If the problem is impossible (or infeasible) to solve using regular methods, but is possibly solvable (partially or completely) with machine learning methods, this is of course a reason to employ machine learning on the problem. In this section, we will explore which features of a problem make it suited for machine learning. Of course, a problem does not need to have all of these characteristics to be fit for machine learning methods, and there are some problems that might have many of these characteristics that might still be very hard to approach and solve with machine learning methods, at least in the near future. That said, we will try to find some properties that perhaps might make it more likely that a problem could be solved, or our understanding of it advanced, using machine learning methods. In the next chapter we will then take a look at some problems that fit our criteria, that we think machine learning methods will be very helpful in solving.

The attributes we talk about in this chapter are not all on equal footing. Some of these attributes are such that we absolutely require the problem to have the given attribute if we hope to approach it with machine learning methods with any success. Others are not necessary, but could be good indicators that the problem should be attempted with machine learning methods. For each attribute we mention, we will indicate its level of importance.

Large combinatorial solution spaces

Many problems in science have large combinatorial solution spaces that cannot be feasibly iterated through.

Consider the famous protein folding problem. Because of the large number of degrees of freedom in an unfolded peptide chain, the molecule will have an astronomical number of possible conformations [9]. It is then very hard to predict the structure that the protein will assume.

Another example is the traveling salesman problem. In this problem, we are presented with a graph, in which each edge is associated a number. Then, given two nodes in the graph, we must find the path through the graph which, if we sum the numbers of the edges we travelled along, has the smallest such sum. To find that “shortest” path, we must iterate through every possible path to find the shortest one. Because the number of possible paths through the graph grows exponentially with the size of the graph, it is computationally infeasible to do this. Finding exact solutions to problems like this is computationally infeasible. For these problems, we must create algorithms that will find approximate, or “good enough” solutions, in a reasonable time.

Problems like these are called combinatorial optimizationp roblems. Regular algorithms to solve problems like these rely on handcrafted heuristics for making predictions and decisions. These algorithms usually take very long to develop, and are most likely not as good as they could be, considering the space of possible algorithms to solve the problem. There can also be cases where expert knowledge on the problem at hand will simply be unsatisfactory to make an acceptable algorithm. This can happen where the underlying mechanics or data distribution callously resists analytic attacks. [10].

Furthermore, researchers working on these problems have to spend a great deal of time learning about the specific problem, and its underlying structure to make their algorithms. They then continue to fine tune the algorithm’s parameters, as they learn more about the underlying structure of the domain of their problem. This is itself a learning process. [10]

Why would machine learning be a good approach to these problems? Combinatorial optimization problems have the nature that it is infeasible to find exact solutions to them. We must therefore approximate them. Also, the underlying data distribution of these problems are not known, and/or there are no good heuristics that we can conjure up to approximate the solution. Machine learning is in its nature approximate, and machine learning algorithms are incredibly good at finding very good solutions to problems without knowing in advance the underlying data distribution and/or the mechanics of the system they are working on. [10]

With the right setup, a machine learning system could be developed for many combinatorial optimization problems. The researchers first need to find a good way to represent the problem for a machine learning system. For example, the AlphaFold system by Google DeepMind predicts the structure of a protein given the chain of amino acids by predicting the angles between successive amino acids, and the distance between each pair of amino acids [5]. Given enough data, a machine learning system can then, through a process of learning, approximate the underlying data distribution of the problem it faces, and machine learning systems have proven very good at this, even when the solution space is combinatorially large. According to Yoshua Bengio, a pioneer in the field of machine learning, the learning process that the researchers and scientists went through to produce the approximation algorithms for combinatorial optimization problems could be automated by machine learning algorithms [10].

Both Yoshua Bengio and Demis Hassabis, the CEO of Google DeepMind maintain that machine learning algorithms are very much suited to combinatorial optimization problems [7, 10].

This attribute is absolutely not necessary for a problem to have for us to tackle it with machine learning methods. But it is a strong indicator that, if it satisfies the critical requirements, we should attempt it with machine learning methods.

Clear objective function

Supervised learning algorithms and reinforcement learning algorithms require a clear, mathematical objective function [1]. (The search for a clear objective function that does not cause catastrophes (like the AI turning the whole solar system into paperclips) is a notorious problem in the quest for artificial general intelligence [11]). Since these algorithms require objective functions that can be concretely expressed mathematically, we require that we can find such objective functions that is a good guide of how well our algorithm is solving our problem. If we employ a bad objective function for our problem, our machine learning algorithm will optimize for the wrong output, and will thus not solve the problem adequately.

The objective function needs to be painfully accurate, as the machine learning algorithm will, in any way it can, try to most easily optimize the objective function. In a majority of cases when the objective function is bad, the machine learning algorithm will “cheat”, or find a way to make the objective function return a higher value without doing what we want it to do. This effect is called reward hacking, and can often be rather comedic when machine learning algorithms are tasked with playing video games, but can be a very hard problem (arguably the hardest problem) when trying to decide a utility function. [12]. We therefore conclude that for machine learning to be suited for a problem, we must be able to find a good objective function for it to maximize, that accurately describes how we want the solution to look like. When we have training data with target outputs, the objective function most often (fully or partly) consists of the similarity of the model output to the training data.

Another approach has, however, been developed in recent years. In the situations where it proves to be very difficult to produce a good objective function, researchers have instead made the system learn the objective function. Thus the system learns both the objective function, and how to optimize it. This approach consists in humans observing the output/actions of the machine learning system, and the humans give the system a score themselves. Then, the part of the system that learns the objective function –we will call this part the evaluator– will recognize what the humans want, and assign the system a performance score. In cases where the evaluator is very uncertain about the score it should give the system, it will ask the humans, from which it will learn even further [13, 14]. This approach however is often very problematic, as the humans are often not perfect at judging what they want the machine to do, and the evaluator will very frequently not be able to correctly encode what the humans actually want [15].

That we are able to find a good objective function for our model is absolutely crucial for the machine learning algorithm to succeed. Thus, this attribute is often a requirement, not merely an indicator. However, we sometimes do not need a bona fide utility function when we desire to use unsupervised learning to discover some structure in our data.

Classification

We have seen that machine learning has proved very good at classification problems. It has proven excellent at detecting spam, recognizing various objects(like cats) in pictures, detecting credit card fraud, and the list goes on. Machine learning algorithms do this by learning to predict the probability that the input data belongs to each class. [1]. An example of a classification problem in science that has already been automated with machine learning is galaxy classification. Deep neural networks can rapidly look at the images of galaxies and classify them with around 97% accuracy [16]. There are many classification problems in science that have optimal machine learning characteristics: they have large data sets, in which each data point needs to be assigned to some class, and it would require a vast amount of professional human hours to classify all this data into appropriate classes. Some of these classification problems also have the character that it is critical that the classification be done accurately. An example of this would be to recognize, from some medical scan, if a patient has some kind of disease or not. We are now already starting to automate this process with machine learning methods, where deep learning is used to look at medical scans to detect many kinds of cancer, for example breast cancer and prostate cancer [17]. The professional people hours that can be saved by automating this kind of classification are incredibly valuable. We have talked about classification problems that humans can do pretty well, but there are also scientific classification problems that have the character that humans can not do the classification with any acceptable accuracy. For these problems, we absolutely require methods that can do the classification accurately, and machine learning is a very good candidate for many of these problems. We will look at some examples of scientific classification problems in chapter four.

Of course, it is not necessary for a problem to be a classification problem so that we can solve it with machine learning, but it should increase the chances that it could be solved with machine learning.

Lots of data or an accurate simulator

As we know, machine learning algorithms learn from data. This means that whatever machine learning model we choose to apply to a given problem, we will need sufficient amounts of data to train it. For example, to train a model that predicts the protein structure that a given chain of amino acids will assume, we need a large dataset of examples, with the amino acid sequence as the input variable, and the structure of the protein as the target variable. A large problem for contemporary machine learning is that it needs an incredibly large amount of data to learn. Humans require very small amounts of data to learn, for example, associations between objects and words, or to understand simple instructions or explanations of concepts. Contemporary machine learning is very bad at this, and requires very large amounts of data to understand such things. We hope that in the future, we can make breakthroughs that might give our models the ability to learn from smaller amounts of data,

but for now, this means that we need very large amounts of data if we hope to employ machine learning algorithms on a problem. Thus when searching for problems that machine learning could solve, we should, at the moment, dismiss any problem for which there does not exist enough data, and we cannot in some way create data. For some problems, we can make an efficient simulator to produce/complement our data. For example, to train a model to predict some desired aspects of quantum systems, we can make simulators of those systems, or simpler components of these systems, and train our models with the data produced from those simulators. This will often lead to a much more efficient simulator, and/or predictor of the system [18].

Underlying distribution is not clear, or cannot be found analytically

In very many problems in science, we want to know (often implicitly) a probability distribution that our data is coming from. For example, we often want to predict some target output, such as protein structure given a chain of amino acids, disease presence given some medical scan, future state of a physical system given the initial state, or the likelihood that a person will have some disease given their genome. In these cases, if we cannot find the answer analytically, we are implicitly trying to estimate the probability distribution of the target variable(s), given the data. Modern machine learning excels at this task of estimating the probability distribution of their data, which can be seen by taking a look at the plethora of image-recognizing neural networks. By observing data, they learn a way to predict the probability that there is, say, a cat in the image. This is the implicit learning of some underlying data distribution. When we train the model, we are inching our model ever closer to an accurate representation of the underlying density function [1]. If the underlying data distribution of the system cannot be found analytically, or is painful to approximate, machine learning seems like an ideal solution, provided we can find the right model for the situation, and that the problem has other necessary attributes needed for it to be a candidate for machine learning.

Another type of machine learning system are Generative models. These are in their nature, generative, in the sense that they generate new samples of some distribution that we are interested in. A famous type of generative model are Generative Adversarial Networks, or GANs. These are neural networks that are explicitly made to learn the underlying data distribution of the data they receive [19]. They can observe data, for example pictures of cats, and learn the underlying distribution from which the data came from. After the system has learned the underlying distribution of the data, it can then sample from that distribution to generate, for example, synthetic images of cats. This is very useful in many kinds of problems, for example in the field of drug discovery, where GANs can generate designs of new drugs, which we could then test with various methods [20]. For many problems that require the generation of new samples from some unknown probability distribution, these generative models will prove very useful.

We conclude that the underlying distribution of the data is not analytically known is a good indicator that the problem might be approached with machine learning, if the critical requirements are met.

Simulations

There are incredibly many systems in science that we would like to simulate. For example, we would very much like to be able to quickly, accurately, and efficiently simulate how some type of physical system will evolve, given some initial state. For example, we might want to simulate how a fluid would behave, given an initial condition, or how some cosmic system will evolve, given initial conditions (for these systems, we might want to simulate how the system would evolve over a very, very large amount of time). The problem is, many of these systems are very hard to simulate. It might be way too computationally expensive, we might not know the underlying mechanics exactly, we might not have any good approximation methods, or we might not even have any good analytic methods to simulate the system at all. Machine learning is proving to be very useful in this regard. The generative systems that we touched on in the previous section could, and are, proving very useful for creating simulators of systems that are otherwise very computationally expensive or completely infeasible to simulate. For example, it has proven incredibly computationally expensive to simulate many cosmic systems, such as classical N-body simulations, and simulating these systems would, without machine learning, be very slow, and resource expensive. These simulations could, and are, in some cases, being carried out by Generative adversarial networks [21].

That the problem is a simulation problem is an indicator that machine learning is a good candidate for it, but it is of course not a requirement.

Problems suited for self-learning systems

We have now seen which attributes of problems can make machine learning algorithms a good candidate for solving them. We will now take a look at some specific problems in science that machine learning is appropriate for. We will look at problems that researchers are already implementing machine learning algorithms on with success, and investigate which of our criteria they comply with. We will also look at problems that are not yet being solved with machine learning, but according to our criteria, machine learning would be an appropriate way to approach them.

Physics Simulations

We suggested that creating simulations would be a good class of problems for machine learning. We therefore look at the problem of creating simulations of physical systems.

To have good simulators of physical systems is incredibly important to many fields of science and engineering. However, these simulators have traditionally be very expensive to create and use. The traditional way to build these simulators is to program the rules of physics into some simulator. This can often be very time-consuming, and there are a lot of design and optimization problems that have to be decided when building the simulator. When building the simulator, the designers need to decide on some rules for approximating these rules of physics, and these approximations are often not accurate enough to produce sufficiently accurate results. Furthermore, it is then very computationally expensive to run a given simulation, if it is expected to be sufficiently accurate. This problem is, however, being tackled with machine learning methods, and is seeing great success [22]. Researchers have made a machine learning model that observes how physical systems behave, and learns to simulate these physical systems’ behavior. This causes the building of the simulator to become a problem of creating a sufficient model. With this method, the researchers no longer need to approximate the laws of physics themselves, so the process of building of the simulator becomes much faster, after the methods of making these simulators with machine learning become mature. Furthermore, the simulations are then able to run much faster, because the machine learning model has learned very accurate ways of approximating the behavior of the physical systems, and the predictions of the model are more accurate than the traditionally built simulators [22].

We see that this problem beautifully fits many of our criteria. It is not difficult to create abundant data for these machine learning systems, as we can simply show the model examples of how these physical systems behave in the real world, so the problem fits our data requirement. The solution space is combinatoric, which is to say that the number of possible ways the system could evolve grows combinatorically. We saw that this is an indicator that it could be smart to apply machine learning methods to the problem. Thus this supports our claim that machine learning can be a good fit for these kinds of problems. The objective function for this problem is clear. We will simply compare the model’s predictions to how the system evolves in the real world. Thus this problem fits our objective function criteria as well.

Quantum Chemistry

Demis Hassabis, the cofounder and CEO of Google Deepmind suggested that Quantum chemistry is a good domain to employ machine learning models [7]. We know that this is true, as there have been advances in applying machine learning to Quantum Chemistry in recent years [23]. We will try to figure out why this is true, and compare the computational problems of quantum chemistry with the attributes that we decided could indicate a good machine learning problem.

Quantum chemistry is used to, to name one thing, discover useful and novel materials with desirable properties. What the chemists need to do, then, in these situations is to first find some chemical structure to test, and then actually test it for the properties that they desire. To do this, the chemists usually rely on trial-and-error methods to find some chemical structure, and then use incredibly computationally expensive methods to predict its properties [23]. However, what researchers have tried, and partly succeeded in doing, is making machine learning algorithms that learn the relationship between chemical structure and material properties, which is much more efficient than previous methods. Thus, the current machine learning methods accelerate the identification of materials that have our requested properties. Why would machine learning be a candidate for this? First, the solution space is combinatorially large, which we know is an indicator that machine learning might succeed. There also exists a lot of data from calculations that were made by the currently used computationally expensive methods, along with empirical data. The underlying connection between chemical structure and material properties is also very hard to efficiently approximate, if this can even be done with traditional methods. Lastly, researchers have not gotten in great trouble finding good objective functions. We see that our properties would predict that this is a good machine learning problem.

Protein Folding

We have used the protein folding problem as an example many times in this paper. We will now try to understand why it would be a good target for machine learning algorithms.

As of 2020, Google DeepMind’s AlphaFold system vastly outperforms any other set of techniques for predicting a final protein structure, given a chain of amino acids. To achieve their success, to grossly oversimplify, they trained a deep neural network to perform the task. Why is protein folding a good problem to tackle with machine learning? It turns out that this problem fits many of our predictors of a good machine learning problem pretty well. The solution space of the protein folding problem is combinatorially large, and we saw that that is an indicator that machine learning might be a good approach. There exists sufficient data for protein folding. To oversimplify, the data is the accumulation of observing the protein structure of proteins with known amino acid sequences [24]. Thus our data requirement is satisfied. We also see that the objective function is clear. We simply measure how close the predicted structure is to the observed structure.

Disease detection

We have already seen that machine learning excels at observing medical scans or data, and predicting if a given disease is present. We can, for example, detect many types of cancer by observing various types of scans of the patient [17], we can detect retinal diseases from pictures using deep learning [25], and we can detect melanoma from pictures of moles [26].

It seems the problem of detecting diseases from medical scans is very well suited for ma- chine learning. We again see that our indicators indicate that this problem is fit for machine learning. First, observe that there is an abundance of data on most of these diseases. For most of these diseases, there exist large datasets of scans along with the diagnosis that followed. This problem satisfies our data requirement. The objective function is clear in this case: predict the same diagnosis as the training data does. This problem satisfies the objective function requirement. Disease detection is a class of classification problems, as we are either asking the system for a yes/no answer of the question of the disease being present, or the model might be able to handle the prediction of some number of diseases. We saw that if a problem is a classification problem, this increases the likelihood that it could be solved with machine learning. In this class of problems, we presumably cannot analytically find the mathematical distribution of scans that indicate the disease we are looking for. This is why had to rely on humans to look at the scans, and diagnose the patient. However, we saw that machine learning approaches are often incredibly good at approximating underlying data distributions, given only data. This is a perfect example of machine learning being able to approximate the underlying distribution very well, in a case when this is otherwise very hard to do.

Conclusions

In this article, we wanted to see if we could find some properties that indicate if a problem should be approached with machine learning methods with any hope of success. If we had such properties, we could presumably find the problems that could be solved with machine learning methods, and we could find them very efficiently and systematically.

We found two attributes that are necessary but not sufficient(at least with contemporary methods) for a problem to be approached with machine learning. We found that there must be a sufficient amount of data for the machine learning algorithm to learn from, we must be able to create a good objective function for the model (either explicitly, or by training an evaluator to learn the objective function). These conditions are absolutely critical if we are to employ a machine learning algorithm on our problem. Thus we can efficiently dismiss the possibility that a problem could be solved with machine learning methods if it does not have these attributes. We also found some attributes that problems could have that could indicate that machine learning could be an appropriate approach to solving it. We found that if a problem has a large combinatorial solution space, we should consider machine learning as a candidate solution. We also found that if the underlying data distribution is not clear, or cannot be found analytically, we should consider using machine learning for that problem, as machine learning algorithms have shown to be good at approximating distributions that are very hard to find analytically, or approximate with other methods. We also found that simulation problems could be attempted (fully or partly) with machine learning, as that can decrease the computational demand of the simulation, improve accuracy, and decrease expert hours needed to build the simulator.

Finally, we looked at some scientific problems that have recently seen much progress using machine learning, and compared their attributes to the attributes we found. We saw that these correlated adequately well. However, we did no mathematical analyses on these comparisons, and they were purely examples for illustration.

This process of systematically finding the attributes of problems that make them especially suited for machine learning approaches could result in a systematic and efficient way to evaluate the possibility that a problem could be solved with machine learning, and could thus make it easier to find the problems that contemporary machine learning is suited for. This would enable us to faster employ the techniques of machine learning on useful scientific problems. Undoubtedly, there exist many scientific problems in the world today that are ripe for solving with machine learning, and this article hopes to step in the right general direction of developing a systematic way to find these problems efficiently.

References

[1] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.

[2] P. Dayan, M. Sahani, and G. Deback, “Unsupervised learning,”The MIT encyclopedia of the cognitive sciences, pp. 857–859, 1999.

[3] R. S. Sutton and A. G. Barto,Reinforcement learning: An introduction. MIT press, 2018.

[4] I. Goodfellow, Y. Bengio, and A. Courville,Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.

[5] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A.ˇZ ́ıdek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis, “Improved protein structure prediction using potentials from deep learning,”Nature, vol. 577, no. 7792, pp. 706–710, 2020.

[6] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O. Adetunmbi, and O. E. Ajibuwa, “Machine learning for email spam filtering: review, approaches and open research problems,” Heliyon, vol. 5, no. 6, p. e01802, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S

[7] Demis Hassabis. The power of self-learning systems - ias - demis hassabis. [YouTube video]. [Online]. Available: https://www.youtube.com/watch?v=wxis9FrCHbw&t=622s

[8] V. Duros, J. Grizou, A. Sharma, S. H. M. Mehr, A. Bubliauskas, P. Frei, H. N. Miras, and L. Cronin, “Intuition-enabled machine learning beats the competition when joint human-robot teams perform inorganic chemical experiments,” Journal of Chemical Information and Modeling, vol. 59, no. 6, pp. 2664–2671, 2019, pMID: 31025861. [Online]. Available: https://doi.org/10.1021/acs.jcim.9b

[9] Wikibooks, “Structural biochemistry/proteins/protein folding — wikibooks, the free textbook project,” 2020, [Online; accessed 14-March-2021]. [Online]. Available: \url{https://en.wikibooks.org/w/index.php?title=StructuralBiochemistry/Proteins/ProteinFolding&oldid=3701092}

[10] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinatorial optimization: A methodological tour d’horizon,” European Journal of Operational Research, vol. 290, no. 2, pp. 405–421, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S

[11] N. Bostrom,Superintelligence: Paths, Dangers, Strategies, 1st ed. USA: Oxford University Press, Inc., 2014.

[12] Robert Miles. “reward hacking: Concrete problems in ai safety part 3” aug. 12, 2017. [YouTube video]. [Online]. Available: https://www.youtube.com/watch?v=92qDfT8pENs

[13] M. Wimmer, F. Stulp, S. Pietzsch, and B. Radig, “Learning local objective functions for robust face model fitting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 1357–1370, 2008.

[14] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, “Mosnet: Deep learning based objective assessment for voice conversion,” 2019.

[15] Robert Miles. Training ai without writing a reward function, with reward modelling. [YouTube video]. [Online]. Available: https://www.youtube.com/watch?v=PYylPRX6z4Q

[16] N. E. M. Khalifa, M. H. N. Taha, A. E. Hassanien, and I. M. Selim, “Deep galaxy: Classification of galaxies based on deep convolutional neural networks,” CoRR, vol. abs/1709.02245, 2017. [Online]. Available: http://arxiv.org/abs/1709.

[17] G. Litjens, C. I. S ́anchez, N. Timofeeva, M. Hermsen, I. Nagtegaal, I. Kovacs, C. Hulsbergen-Van De Kaa, P. Bult, B. Van Ginneken, and J. Van Der Laak, “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis,” Scientific reports, vol. 6, no. 1, pp. 1–11, 2016.

[18] J. Carrasquilla, “Machine learning for quantum matter,” Advances in Physics: X, vol. 5, no. 1, p. 1797528, 2020. [Online]. Available: https://doi.org/10.1080/23746149. 2020.

[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.

[20] A. Zhavoronkov, “Artificial intelligence for drug discovery, biomarker development, and generation of novel chemistry,”Molecular Pharmaceutics, vol. 15, no. 10, pp. 4311–4313, 2018. [Online]. Available: https://doi.org/10.1021/acs.molpharmaceut.8b

[21] A. C. Rodr ́ıguez, T. Kacprzak, A. Lucchi, A. Amara, R. Sgier, J. Fluri, T. Hofmann, and A. R ́efr ́egier, “Fast cosmic web simulations with generative adversarial networks,”Computational Astrophysics and Cosmology, vol. 5, no. 1, pp. 1–11, 2018.

[22] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. W. Battaglia, “Learning to simulate complex physics with graph networks,” 2020.

[23] Y. Kang, L. Li, and B. Li, “Recent progress on discovery and properties prediction of energy materials: Simple machine learning meets complex quantum chemistry,”Journal of Energy Chemistry, 2020.

[24] B. Manavalan, K. Kuwajima, and J. Lee, “Pfdb: A standardized protein folding database with temperature correction,”Scientific reports, vol. 9, no. 1, pp. 1–9, 2019.

[25] J. De Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentinet al., “Clinically applicable deep learning for diagnosis and referral in retinal disease,”Nature medicine, vol. 24, no. 9, pp. 1342–1350, 2018.

[26] S. Jinnai, N. Yamazaki, Y. Hirano, Y. Sugawara, Y. Ohe, and R. Hamamoto, “The development of a skin cancer classification system for pigmented skin lesions using deep learning,”Biomolecules, vol. 10, no. 8, p. 1123, 2020.

Recognizing which scientific problems should be approached with machine learning

Contents

Introduction

How do learning systems solve problems?

Supervised learning

Unsupervised learning

Reinforcement learning

Neural Networks

What makes machine learning different from “regular” algorithms?

What makes a scientific problem suited for machine learning?

Large combinatorial solution spaces

Clear objective function

Classification

Lots of data or an accurate simulator

Underlying distribution is not clear, or cannot be found analytically

Simulations

Problems suited for self-learning systems

Physics Simulations

Quantum Chemistry

Protein Folding

Disease detection

Conclusions

References

Contents#

Introduction#

How do learning systems solve problems?#

Supervised learning#

Unsupervised learning#

Reinforcement learning#

Neural Networks#

What makes machine learning different from “regular” algorithms?#

What makes a scientific problem suited for machine learning?#

Large combinatorial solution spaces#

Clear objective function#

Classification#

Lots of data or an accurate simulator#

Underlying distribution is not clear, or cannot be found analytically#

Simulations#

Problems suited for self-learning systems#

Physics Simulations#

Quantum Chemistry#

Protein Folding#

Disease detection#

Conclusions#

References#

Contents

Introduction

How do learning systems solve problems?

Supervised learning

Unsupervised learning

Reinforcement learning

Neural Networks

What makes machine learning different from “regular” algorithms?

What makes a scientific problem suited for machine learning?

Large combinatorial solution spaces

Clear objective function

Classification

Lots of data or an accurate simulator

Underlying distribution is not clear, or cannot be found analytically

Simulations

Problems suited for self-learning systems

Physics Simulations

Quantum Chemistry

Protein Folding

Disease detection

Conclusions

References