JCM Ensemble Learning Notes |
Source/References |
Introduction |
|
Examples |
|
Bayesian voting |
|
"optimal" in a sense |
|
difficult to construct prior |
|
Manipulating Training Examples |
|
especially good for unstable learning algorithms: NN, decision-tree, rule learning algorithms |
|
Manipulating Input Features |
|
Manipulating Output Targets |
|
Inject Randomness |
|
Bagging (bootstrap aggregation) - training set consists of a sample of N training examples drawn randomly with replacement from the original training set - called a bootstrap replicate. |
|
cross-validated committees - leave out disjoint subsets of the training data |
|
Bagging Predictors, Breiman |
|
Abstract. Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy. 1. Introduction A learning set of L consists of data f(y n ; x n ), n = 1; : : : ; Ng where the y's are either class labels or a numerical response. We have a procedure for using this learning set to form a predictor '(x; L) --- if the input is x we ... |
@article{ breiman96bagging, author = "Leo Breiman", title = "Bagging Predictors", journal = "Machine Learning", volume = "24", number = "2", pages = "123-140", year = "1996" } |
Bagging Equalizes Influence, Y. Grandvalet |
|
Boosting - produces a series of classifiers where the training set used for each member of the series is chosen based on the performance of earlier classifiers. |
|
can be viewed as a stage-wise algorithm for minimizing a particular error function (Brieman(1997)) |
|
J(H) = ∑ [ exp(-yi H(xi))], where yiH(xi) is called the margin because it is the amount by which xi is correctly classified |
|
strong ties to Support Vector Machines |
|
can be viewed as a generalized additive model (Hastie and Tibshirani, 1990) |
|
The Boosting Approach to Machine Learning: An Overview, R.E. Schapire |
|
Abstract. Boosting is a general method for improving the accuracy of any given learning algorithm. Focusing primarily on the AdaBoost algorithm, this chapter overviews some of the recent work on boosting including analyses of AdaBoost’s training error and generalization error; boosting’s connection to game theory and linear programming; the relationship between boosting and logistic regression; extensions of AdaBoost for multiclass classification problems; methods of incorporating human knowledge into boosting; and experimental and applied work using boosting. |
|
A Short Introduction to Boosting, Freund and Schapire |
|
Abstract. Boosting is a general method for improving the accuracy of any given learning algorithm. This short overview paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting as well as boosting’s relationship to support-vector machines. Some examples of recent applications of boosting are also described. |
@misc{ freund99short, author = "Y. Freund and R. Schapire", title = "A short introduction to boosting", text = "Y. Freund and R. Schapire. A short introduction to boosting, J. Japan. Soc. for Artif. Intel. 14(5) (1999), 771-780. 11", year = "1999", url = "citeseer.ist.psu.edu/freund99short.html" } |
ADABOOST - maintains a set of (voting) weights that are updated at each iteration based on whether a training example is misclassified or not. |
|
Adaboost gives best performance but reasons for its success are not fully understood |
|
Empirical Studies |
|
Popular Ensemble Methods: An Empirical Study, D. Opitz and R. Maclin |
|
Empirical evaluation of bagging and boosting for NN and decision trees |
|
Bagging almost always outperforms a single classifier |
|
Boosting can outperform bagging and single clasifier, but not always |
|
Boosting may suffer from overfitting in presence of noise |
|
Simple ensemble approach using NN that differ only in their random intial weight settings often does as well as Bagging |
|
Performance of Boosting depends on data more strongly than Bagging |
|
Abstract |
|
An ensemble consists of a set of individually trained classifers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classi ers in the ensemble. Bagging (Breiman, 1996c) and Boosting (Freund & Schapire, 1996; Schapire, 1990) are two relatively new but popular methods for producing ensembles. In this paper we evaluate these methods on 23 data sets using both neural networks and decision trees as our classi cation algorithm. Our results clearly indicate a number of conclusions. First, while Bagging is almost always more accurate than a single classi er, it is sometimes much less accurate than Boosting. On the other hand, Boosting can create ensembles that are less accurate than a single classifer, especially when using neural networks. Analysis indicates that the performance of the Boosting methods is dependent on the characteristics of the data set being examined. In fact, further results show that Boosting ensembles may over t noisy data sets, thus decreasing its performance. Finally, consistent with previous studies, our work suggests that most of the gain in an ensemble's performance comes in the rst few classi ers combined; however, relatively large gains can be seen up to 25 classi ers when Boosting decision trees. |
|
Experiments with a New Boosting Algorithm, Freund and Schapire |
|
Applications of Ensemble Methods |
|
Rapid-response Adaptive Computing Environments for LHC Physics Analysis |
|
The University of Wisconsin-Madison and Computer Science groups proposed to further strengthen this productive collaboration and work as one interdisciplinary team to define and build a Rapid-response Adaptive Computing Environment (RACE) for enabling timely extraction of LHC physics. In particular, they propose to build a RACE for CMS trigger system data validation and data forensics, in online data acquisition and offline data processing stages of CMS computing. |
|
In order to improve the accuracy of the prediction and reduce the number of false alarms, we are investigating ensemble methods, which apply prediction models for various chunks of the stream and make the prediction by summarizing the results from individual classifiers. We propose two types of ensembles: The first ensemble approach makes prediction over individual events in the output stream and aggregates the prediction result across a fixed time duration. The second approach first generates data summaries for nearby events and makes prediction based on these data summaries |
|
Separting Signal from Background Using Ensembles of Rules, Jerome H. Friedman, Department of Statistics and Stanford Linear Accelerator Center |
|
Abstract |
|
Machine learning has emerged as a important tool for separating signal events from associated background in high energy particle physics experiments. This paper describes a new machine learning method based on ensembles of rules. Each rule consists of a conjuction of a small number of simple statements (“cuts”) concerning the values of individual input variables. These rule ensembles produce predictive accuracy comparable to the best methods. However their principal advantage lies in interpretation. Because of its simple form, each rule is easy to understand, as is its influence on the predictive model. Similarly, the degree of relevance of each of the respective input variables can be assessed. Graphical representations are presented that can be used to ascertain the dependence of the model jointly on the variables used for prediction |
|
Interesting Side Relations |
|
Wisdom of Crowds - Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, James Surowiecki, Doubleday, 2004 |
|
Review from Publishers Weekly. While our culture generally trusts experts and distrusts the wisdom of the masses, New Yorker business columnist Surowiecki argues that "under the right circumstances, groups are remarkably intelligent, and are often smarter than the smartest people in them." To support this almost counterintuitive proposition, Surowiecki explores problems involving cognition (we're all trying to identify a correct answer), coordination (we need to synchronize our individual activities with others) and cooperation (we have to act together despite our self-interest). His rubric, then, covers a range of problems, including driving in traffic, competing on TV game shows, maximizing stock market performance, voting for political candidates, navigating busy sidewalks, tracking SARS and designing Internet search engines like Google. If four basic conditions are met, a crowd's "collective intelligence" will produce better outcomes than a small group of experts, Surowiecki says, even if members of the crowd don't know all the facts or choose, individually, to act irrationally. "Wise crowds" need (1) diversity of opinion; (2) independence of members from one another; (3) decentralization; and (4) a good method for aggregating opinions. The diversity brings in different information; independence keeps people from being swayed by a single opinion leader; people's errors balance each other out; and including all opinions guarantees that the results are "smarter" than if a single expert had been in charge. Surowiecki's style is pleasantly informal, a tactical disguise for what might otherwise be rather dense material. He offers a great introduction to applied behavioral economics and game theory. Copyright © Reed Business Information, a division of Reed Elsevier Inc. All rights reserved. |
|
Smarter Than the CEO, Surowiecki, Wired, Issue 12.06, June 2004 |
|
Artificial Markets |
|
Assessing the probabilities of future events is a problem often faced by science policymakers. For example, CERN, the European laboratory for particle physics, recently had to judge whether the probability of discovering a Higgs boson was high enough to justify extending the operation of its collider (see Science, 22 Sept., p. 2014 and 29 Sept., p. 2260). At the Foresight Exchange (FX) Web site (www.ideosphere.com—), traders can actually bet on the outcomes of unresolved scientific questions, including whether physicists will discover the Higgs boson by 2005. The going price of the security (0.77 as of 24 Jan) can be seen as the market's assessment of the probability of the particle's discovery. FX is only a game, run with play money (FX dollars). Empirical studies [1], laboratory investigations [2], and policy proposals [3] argue that prices of real-money securities do constitute accurate likelihoods, since traders have strong (monetary) incentives to leverage pertinent information. But can we place legitimate credence on the accuracy of FX prices, which are determined solely through competition in a play-money market game? |
|
Foresight Exchange Prediction Market |
|
NewsFutures |
|
Prediction Markets BasicsIn the simplest case, you define an outcome for which you would like a reliable estimate. Then you invite people with relevant knowledge to trade "virtual" stock based on their confidence in this outcome. The result is a trading price that tracks the consensus opinion (in contrast with the average opinion that a poll would yield). Because the market is online, it involves any number of participants, from anywhere, at any time. |
|
More References |
|
Ensemble Methods in Machine Learning26 Jul 2006 10:57 • Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery,3 (1999) [Online. Ensemble methods as an apparent violation of Occam's Razor.] • G. Langer and U. Parlitz, "Modeling parameter dependence from time series", Physical Review E70 (2004): 056217 [Interesting use of ensemble methods in state space modeling] • Laurence K. Saul and Michael I. Jordan, "Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones", Machine Learning37 (1999): 75--87 • Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods", Annals of Statistics26 (1998): 1651--1686 • To read: • Ran Avnimelech and Nathan Intrator, "Boosted Mixture of Experts: An Ensemble Learning Scheme", Neural Computation11 (1999): 483--497 [Link] • M. Di Marzio and C. C. Taylor, "Kernel density classification and boosting: an L2 analysis", Statistics and Computing15 (2005): 113--123 • Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics32 (2004): 1698--1722 = math.ST/0410092 • Yoav Freund, Robert E. Schapire, Yoram Singer and Manfred K. Warmuth, "Using and combining predictors that specialize" [PDF preprint] • G. Fumera and F. Roli, "A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems", IEEE Transactions on Pattern Analysis and Machine Intelligence27 (2005): 942--956 • Etienne Grossmann, "A Theory of Probabilistic Boosting, Decision Trees and Matryoshki", cs.LG/0607110 • Jakob Vogdrup Hansen, Combining Predictors: Meta Machine Learning Methods and Bias/Variance & Ambiguity Decompositions [Ph.D. thesis, University of Aarhus, 2000; on-line] • Geoffrey E. Hinton, "Training Products of Experts by Minimizing Contrastive Divergence," Neural Computation14 (2002): 1771--1800. • Marcus Hutter and Jan Poland, "Adaptive Online Prediction by Following the Perturbed Leader", cs.AI/0504078 = Journal of Machine Learning Research6 (2005): 639--660 • Robert A. Jacobs, "Bias/Variance Analyses of Mixtures-of-Experts Architectures", Neural Computation9 (1997): 369--383 ["This article investigates the bias and variance of mixtures-of-experts (ME) architectures. The variance of an ME architecture can be expressed as the sum of two terms: the first term is related to the variances of the expert networks that comprise the architecture and the second term is related to the expert networks' covariances. One goal of this article is to study and quantify a number of properties of ME architectures via the metrics of bias and variance. A second goal is to clarify the relationships between this class of systems and other systems that have recently been proposed. It is shown that in contrast to systems that produce unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated."] • Wenxin Jiang, "Boosting with Noisy Data: Some Views from Statistical Theory", Neural Computation16 (2004): 789--810 • Ludmila I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms • Nicole Kraemer, "Boosting for Functional Data", math.ST/0605751 • Guillaume Lecu&eaucte;, "Lower Bounds and Aggregation in Density Estimation", Journal of Machine Learning Research7 (2006): 971--981 • Seiji Miyoshi, Kazuyuki Hara, and Masato Okada, "Analysis of ensemble learning using simple perceptrons based on online learning theory", Physical Review E71 (2005): 036116 • L. Nunes and E. Oliveira, "On Learning by Exchanging Advice," cs.LG/0203010 • Frenando C. Pereira and Yoram Singer, "An Efficient Extension to Mixture Techniques for Prediction and Decision Trees", Machine Learning36 (1999): 183--199 • Evgueni Petrov, "Constraint-based analysis of composite solvers," cs.AI/0302036 • Yoram Singer, "Adaptive Mixtures of Probabilistic Transducers", Neural Computation9 (1997): 1711--1733 [PS.gz preprint] • Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science292 (2002): 447-464 • Héla Zouari, Laurent Heutte and Yves Lecourtier, "Controlling the diversity in classifier ensembles through a measure of agreement", Pattern Recognition38 (2005): 2195--2199 |