The problem with the no-free-lunch theorems in machine
learning is that they are rather complicated and difficult to follow because they are
proven under the most general conditions possible (Wolpert 1992, 1995, 1996). The purpose
of this note is to present the idea behind them in a very simple example.
Consider an imaginary universe that lasts for exactly 2 days, where on
each day there exists exactly one object, which is either a sphere or a cube. The object
may or may not be the same shape on both days. There are exactly 4 possible histories that
this world may have: (sphere, sphere), (sphere, cube), (cube, sphere) and (cube, cube). A
learning rule tells us what to predict on the second day given what is observed on the
first day. In this world, there are 4 learning rules: *Same* = "same on both
days," *Diff* = "different on both days", *Sphere* = "sphere
on second day no matter what", and *Cube* = "cube on second day no matter
what". Given the four histories have the same probability (1/4), what is the
probability that each learning rule will make a correct prediction on the second day? If
you work it out, you will find that the probability is 1/2 for all four learning rules.
In our simple imaginary universe, we might insist that whatever appears
on the first day will also appear on the second day. This *uniformity of nature*
assumption implies that there are only two possible states of the universe: (sphere,
sphere) and (cube, cube). Assuming that these each of these possible worlds has a
probability of 1/2, the learning rules *Same*, *Diff*, *Sphere*, and *Cube*
have probabilities of successful predictions on the second day of 1, 0, 1/2 and 1/2,
respectively. Now *Same* is the best learning rule. But this conclusion is only
reached by assuming that nature is uniform. Once the theorem is understood in this simple
example, it is plausible that they extend to more general circumstances. Philosophers will
recognize a connection with Hume’s problem of induction, where the main difference is
that the no-free-lunch theorems are explicitly formulated within a probabilistic
framework.
The philosophical lesson of the no-free-lunch theorems is that we
should not try to approach epistemology from an a prioristic point of view. Rather, we
should begin by accurately *describing* the learning methods that have been
successful, and then ask what real-world assumptions might *explain* their success,
to the extent that they have been successful.
Bayesian conditionalization is a learning rule that falls within the
scope the no-free-lunch theorems. So it is puzzling to see Bayesians seek *a priori*
justifications for the rule in terms of coherence if coherent and incoherent rules have
the same *a priori *probability of predictive success. It may be that coherent rules
are more successful, in fact, but their coherence does not automatically explain that
success.
No assumption can guarantee the predictive success of Bayesian
conditionalization *carte blanche*; it surely depends on the choice of prior. The
no-free-lunch theorems now imply that there is no *a priori* justification of the
choice of priors. Perhaps Bayesians can only wait and see which learning rules work, and
then determine *post hoc* which prior is needed to implement the rule? If so, then
what normative role does Bayesian learning play? And what is its descriptive value? The
no-free-lunch theorems promise to put a fresh face on some age-old questions. |
**Note**: You need Adobe Acrobat Reader 3.0,
or later, to read and print this article. It is free.
Abstract, formatted as submitted to the 11th
International Congress of Logic, Methodology and Philosophy of Science, to be held in
Cracow, Poland, August 20-26, 1999.
Appeared in the conference abstracts.
Only the abstract exists, but see section 6 of
How Simple Rules Fit to Reality
for more on this subject.
Wolpert, David H. (1992): "On the Connection
Between In-Sample Testing and Generalization Error," *Complex Systems* **6**:
47-94.
Wolpert, David H. (1995): "The Relationship Between PAC, the
Statistical Physics Framework, the Bayesian Framework, and the VC Framework," in
Wolpert (ed.) (1995): *The Mathematics of Generalization*, Reading, MA:
Addison-Wesley 117-214.
Wolpert, David H. (1996): "The lack of a priori distinctions
between learning algorithms," *Neural Computation* **8**: 1341 - 1390. |