Notice: No-Free-Lunches for Anyone, Bayesians Included

This page was last edited on 09/04/02 by Malcolm R Forster

Abstract

The problem with the no-free-lunch theorems in machine learning is that they are rather complicated and difficult to follow because they are proven under the most general conditions possible (Wolpert 1992, 1995, 1996). The purpose of this note is to present the idea behind them in a very simple example.
    Consider an imaginary universe that lasts for exactly 2 days, where on each day there exists exactly one object, which is either a sphere or a cube. The object may or may not be the same shape on both days. There are exactly 4 possible histories that this world may have: (sphere, sphere), (sphere, cube), (cube, sphere) and (cube, cube). A learning rule tells us what to predict on the second day given what is observed on the first day. In this world, there are 4 learning rules: Same = "same on both days," Diff = "different on both days", Sphere = "sphere on second day no matter what", and Cube = "cube on second day no matter what". Given the four histories have the same probability (1/4), what is the probability that each learning rule will make a correct prediction on the second day? If you work it out, you will find that the probability is 1/2 for all four learning rules.
    In our simple imaginary universe, we might insist that whatever appears on the first day will also appear on the second day. This uniformity of nature assumption implies that there are only two possible states of the universe: (sphere, sphere) and (cube, cube). Assuming that these each of these possible worlds has a probability of 1/2, the learning rules Same, Diff, Sphere, and Cube have probabilities of successful predictions on the second day of 1, 0, 1/2 and 1/2, respectively. Now Same is the best learning rule. But this conclusion is only reached by assuming that nature is uniform. Once the theorem is understood in this simple example, it is plausible that they extend to more general circumstances. Philosophers will recognize a connection with Hume’s problem of induction, where the main difference is that the no-free-lunch theorems are explicitly formulated within a probabilistic framework.
    The philosophical lesson of the no-free-lunch theorems is that we should not try to approach epistemology from an a prioristic point of view. Rather, we should begin by accurately describing the learning methods that have been successful, and then ask what real-world assumptions might explain their success, to the extent that they have been successful.
    Bayesian conditionalization is a learning rule that falls within the scope the no-free-lunch theorems. So it is puzzling to see Bayesians seek a priori justifications for the rule in terms of coherence if coherent and incoherent rules have the same a priori probability of predictive success. It may be that coherent rules are more successful, in fact, but their coherence does not automatically explain that success.
    No assumption can guarantee the predictive success of Bayesian conditionalization carte blanche; it surely depends on the choice of prior. The no-free-lunch theorems now imply that there is no a priori justification of the choice of priors. Perhaps Bayesians can only wait and see which learning rules work, and then determine post hoc which prior is needed to implement the rule? If so, then what normative role does Bayesian learning play? And what is its descriptive value? The no-free-lunch theorems promise to put a fresh face on some age-old questions.

Download

 

Note: You need Adobe Acrobat Reader 3.0, or later, to read and print this article.  It is free.

Abstract, formatted as submitted to the 11th International Congress of Logic, Methodology and Philosophy of Science, to be held in Cracow, Poland, August 20-26, 1999.

Publication Data

Appeared in the conference abstracts.  Only the abstract exists, but see section 6 of How Simple Rules Fit to Reality for more on this subject.

References

Wolpert, David H. (1992): "On the Connection Between In-Sample Testing and Generalization Error," Complex Systems 6: 47-94.

Wolpert, David H. (1995): "The Relationship Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework," in Wolpert (ed.) (1995): The Mathematics of Generalization, Reading, MA: Addison-Wesley 117-214.

Wolpert, David H. (1996): "The lack of a priori distinctions between learning algorithms," Neural Computation 8: 1341 - 1390.