Partition Maps Nicolai Meinshausen Department of Statistics University of Oxford, UK

[email protected] June 15, 2010 Abstract Tree ensembles, notably Random Forests, have been shown to deliver very accurate predic- tions on a wide range of regression and classifi cation tasks. A common, yet maybe unjustifi ed, criticism is that they operate as black boxes and provide very little understanding of the data beyond accurate predictions. We focus on multiclass classifi cation and show that Homogeneity Analysis, a technique mostly used in psychometrics, can be leveraged to provide interesting and meaningful visualizations of tree ensemble predictions. Observations and nodes of the tree ensemble are placed in a bipartite graph, connecting each observation to all nodes it is falling into. The graph layout is then chosen by minimizing the sum of the squared edge lengths under certain constraints. We propose a variation of Homogeneity Analysis, called Partition Maps, and analyze advantages and shortcomings compared with multidimensional scaling of proxim- ity matrices. Partition Maps has as potential advantages that (a) the infl uence of the original nodes and variables is visible in the low-dimensional embedding, similar to biplots, (b) new observations can be placed very easily and (c) the test error is very similar to the original tree ensemble when using simple nearest neighbour classifi cation in the two-dimensional Partition Map embedding. Class boundaries, as found by the original tree ensemble algorithm, are thus refl ected accurately in Partition Maps and the low-dimensional visualizations allow meaningful exploratory analysis of tree ensembles. 1Introduction For multiclass classifi cation with K classes, let (X1,Y1),.,(Xn,Yn) be n observations of a pre- dictor variable X ∈ X and a class response variable Y ∈ {1,.,K}. The predictor variable X is 1 assumed to be p-dimensional and can include continuous or factorial variables. Visualizations and low-dimensional embeddings of data can take many forms, from unsupervised manifold learning (Roweis and Saul, 2000; Tenenbaum et al., 2000) to supervised dimensionality re- duction or, used here equivalently, low-dimensional embeddings such as Neighbourhood Component Analysis (Goldberger et al., 2005) and related methods (Sugiyama, 2007; Weinberger and Saul, 2009). Our goal here is diff erent: we want to start from an existing algorithm, or rather a family of algorithms that includes Random Forests and bagged classifi cation trees, and provide an eff ective visualization for members of this family. Tree ensembles have been shown to deliver very accurate predictions and Random Forests is ar- guably one of the best off -the-shelf machine learning algorithm in the sense that predictive accuracy is very close to optimal in general even without much tuning of parameters. We will mostly be working with Random Forests (Breiman, 2001), bagged decision trees (Breiman, 1996) and some recent developments such as Rule Ensembles (Friedman and Popescu, 2008) that aim to keep the predictive power of Random Forests while reducing the number of fi nal leaf nodes dramatically. Some existing work on visualizations of trees (Breiman et al., 1984) and tree ensembles has been nicely summarized in Urbanek (2008) and includes mountain plots and trace plots to assess for example the stability of trees and which variables have been used as splitting criteria. For low- dimensional embeddings of observations in a classifi cation setting so-called proximity matrices are often used (Breiman, 2001; Liaw and Wiener, 2002; Lin and Jeon, 2006). For an application to unsupervised learning with Random Forests, see Shi and Horvath (2006). Each entry in the n×n- dimensional proximity matrices measures the fraction of trees in an ensemble for which a pair of observations are in the same leaf node. A proximity matrix can be turned into a distance matrix and there has been work on producing meaningful low-dimensional embeddings, mostly using some form of multidimensional scaling (MDS) (Kruskal, 1964; Borg and Groenen, 1997; De Leeuw, 1988). A potential shortcoming of MDS for visualization is that the original nodes of the tree are not shown in the visualization and in fact do not infl uence the low-dimensional embedding once the proximity matrix is extracted. Class separation is also often markedly less sharp than for the original tree ensemble. Here we extend ideas of Homogeneity Analysis (Meulman, 1982; Michailidis and De Leeuw, 1998; Tenenhaus and Young, 1985) to visualizations of tree ensembles. While Homogeneity Analysis was developed in a diff erent context, mainly psychometrics, it is an interesting tool for visualization of tree ensembles. Observations and nodes or (used here synonymously) rules form then a bipar- 2 tite graph. Minimizing squared edge distances in this graph yields a very useful low-dimensional embedding of the data. It has a number of advantages to have nodes (or rules) present in the same plot as the observations, somewhat similar to biplots (Gower and Hand, 1996): it adds interpretability to the embedding, allows fast placing of new observations. It also leads generally to sharper class boundaries that refl ect the predictive accuracy of the underlying tree ensemble. If the number of classes is small to moderate, it will be shown empirically that a simple nearest neighbour classifi cation rule is roughly equally accurate as the original Random Forest prediction. 2Homogeneity Analysis Homogeneity Analysis (Meulman, 1982; Michailidis and De Leeuw, 1998; Tenenhaus and Young, 1985) was developed in the social sciences for analysis and visual representation of datasets with factorial variables. Suppose there are f factorial variables h = 1,.,f, each with !hfactor levels. The data can be written as a binary matrix by encoding each of the h = 1,.,f variables as a n×!hbinary indicator matrix G(h), where the k-th column in this matrix contains 1 entries for all observations that have factor level k in variable h. These matrices can then be summarized by one n × m-matrix G = (G(1),., ˉ G(f)), where m = ! h!h is the total number of dummy variables. Before giving a short description of Homogeneity Analysis, it is worthwhile to explain how this con- text ties in with tree ensembles in machine learning and also point out that Homogeneity Analysis is very similar, indeed sometimes equivalent in terms of computations, to multiple correspondence analysis and related methods (Tenenhaus and Young, 1985; Greenacre and Hastie, 1987). Connection to tree ensembles.Each leaf node in a tree can be represented by a binary indicator variable, with 1 indicating that an observation is falling into the leaf node and 0 indicating that it is not. Viewing the leaf nodes or ‘rules’ in this light as generalized dummy variables, one can construct the indicator matrix G in a similar way for tree ensembles. For a given tree ensembles with in total m leaf nodes, also called ‘rules’ as in Friedman and Popescu (2008), let Pj? X for j = 1,.,m be the hyperrectangles of the predictor space X that correspond to leaf node j. An observation falls into a leaf node Pj iff Xi∈ Pj. The results of a tree ensemble with m leaf nodes across all trees can be summarized in a n × m indicator matrix G, where Gij= 1 if the i-th 3 observation falls into the j-th leaf node and 0 otherwise, Gij= ? ? ? 1Xi∈ Pj 0Xi/ ∈ Pj This matrix is now very similar to the indicator matrix in Homogeneity Analysis. The row-sums of G are identical to the number F of factorial variables in Homogeneity Analysis, while the row- sums of G for tree ensembles is equal to the number of trees, since each observation is falling into exactly one fi nal leaf node in each tree. We will be a bit more general in the following, though, and do not assume that row-sums are constant, which will allow some generalizations without unduly complicating the notation and computations. It is only assumed that all row- and column sums of G are strictly positive, such that each observation is falling into at least one rule and each rule contains at least one observation. We will moreover assume that the root node, containing all observations, is added to the set of rules. This will ensure that the bipartite graph corresponding to G is connected. Bipartite graph and Homogeneity Analysis.In both contexts, Homogeneity Analysis can be thought of as forming a bipartite graph, where each of the n observations and each of the m rules or dummy variables is represented by a node in the graph. There is an edge between an observation and node (or rule) if and only if the observation is contained in the rule, More precisely, there is an edge between observation i and rule j if and only if Gij= 1. For an example, see Figure 1. Homogeneity Analysis is now concerned with fi nding a q-dimensional embedding (with typically q = 2) of both observations and rules such that the sum of the squared edge lengths is as small as possible. The goal is that an observation is close to all rules that contain it. And, conversely, a rule is supposed to be close to all observation it contains. Let U be the n×q-matrix of the coordinates of the n observations in the q-dimensional embedding. Let R be the m×q-matrix of projected rules. Let Uibe the i-th row of U and Rjthe j-th row of R. Homogeneity Analysis is choosing a projection by minimizing squared edge lengths argminU,R % i,j:Gij=1 $Ui? Rj$2 2. (1) Let enbe the n-dimensional column vector with all entries 1 and 1qthe q-dimensional identity matrix. To avoid trivial solutions, typically a constraint of the form UTWU=1q(2) eT nU =0(3) 4 is imposed (De Leeuw, 2009) for some positive defi nite matrix weighting W. In the following, we weight samples by the number of rules to which they are connected by letting W be the diagonal matrix with entries Wii= ! jGij so that W = diag(GGT) is the diagonal part of GGT. In standard Homogeneity Analysis, every sample is part of exactly the same number of rules since they correspond to factor levels and the weighting matrix W = f1nwould thus be the identity matrix multiplied by the number f of factorial variables. For most tree ensembles, it will also hold true that W = T1nis a diagonal matrix since every observations is falling into the same number T of leaf nodes if there are T trees in the ensemble. 2.1Optimization If the coordinates U of the observations are held constant, the rule positions R in problem (1) under constraints (2) can easily be found since the constraints do not depend on R. Each projected rule Rj, j = 1,.,m is at the center of all observations it contains, Rj= ! iGijUi ! iGij . Writing diag(M) for the diagonal part of a matrix M, with all non-diagonal elements set to 0, R = diag(GTG)?1GTU.(4) The values of U are either found by alternating least squares or by solving an eigenvalue problem. Alternating least squares.The optimization problem (1) under constraints (2) can be solved by alternating between optimizing the positions U of the n samples, and optimizing the positions R of the m rules, keeping in each case all other parameters constant (Michailidis and De Leeuw, 1998). The optimization over the rule positions R, given fi xed positions U of the samples is solved by (4). The optimization over U under constraints (2) is solved by a least-squares step analogous to (4), U = diag(GGT)?1GR,(5) putting each sample into the center of all rules that contain it. This is followed by an orthogonal- ization step for U (De Leeuw and Mair, 2008). Unlike in the following eigenvalue approach, the computations consists only of matrix multipli- cations. Moreover, G is a very sparse matrix in general and this sparsity can be leveraged for computational effi ciency. 5 Eigenvalue solution.For the original Homogeneity Analysis, the following eigenvalue approach to solving (1) is not advisable in practice since the alternating least squares method is much faster and effi cient in fi nding an approximate solution (De Leeuw and Mair, 2008). The eigenvalue solution provides some additional insight into the solution and will be useful for the extension to Partition Maps. The objective function (1) can alternatively be written in matrix notation as % i,j:Gij=1 $Ui? Rj$2 2 = % i $Ui$2 2 % j Gij+ % j $Rj$2 2 % i Gij? 2 % ij tr(UT i GijRj) =tr(UTdiag(GGT)U) + tr(RTdiag(GTG)R) ? 2 tr(UTGR), where tr(M) is the trace of a matrix M. Let Du=diag(GGT) Dr=diag(GTG) Using now (4), the rule positions at the optimal solution are given by R = diag(GTG)?1GTU = D?1 r GTU. It follows that the objective function (1) can be written as tr(UTDuU + RTDrR ? 2UTGR)=tr(UTDuU + UTGD?1 r DrD?1 r GTU ? 2UTGD?1 r GTU) =tr(UTDuU ? UTGD?1 r GTU). Since the onstraint UTDuU = 1qis imposed in (2), the solution to (1) under the constraints (2) can be obtained as argmaxUtr(UT(GD?1 r GT)U)such that UTDuU = 1qand eT nU = 0 (6) Let Sn= 1n? n?1eneT n be the projection onto the space orthogonal to en. Letting ? U = D1/2 u U, the problem is equivalent to argmaxUtr(?UTD?1/2 u ST n(GD ?1 r GT)SnD?1/2 u ? U)such that ? UT ? U = 1q.(7) Let v1,.,vq be the fi rst q eigenvectors of the symmetric n × n-matrix A := D?1/2 u ST n(GD ?1 r GT)SnD?1/2 u (8) and V be the matrix V = (v1,.,vq). Solving the eigenvalue decomposition of A in (8) can clearly also be achieved by a SVD of the matrix, D?1/2 r GTSnD?1/2 u , keeping the fi rst q right- singular vectors. 6 The solution to (6) and hence the original problem (1) is then given by setting U=D?1/2 u V(9) R=D?1 r GTU(10) having used (4) for the position of the rules. The matrix A is a dense n × n-matrix and computation of V can be computationally challenging for larger datasets. In contrast, the alternating least squares solution, while computing only an ap- proximate solution, uses only matrix multiplication with very sparse matrices and is thus preferred in practice. 2.2Fixed rule positions and new observations Having obtained a graph layout, we propose to simplify in a second step and fi x the rule positions R at their solution (10). Homogeneity Analysis then has a benefi cial feature: it is trivial to add new observations to the plot without recomputing the solution. Minimizing the sum of squared edge lengths (1) when rule positions are kept fi xed is simple: an observation is placed into the center of all rules that apply to it. In matrix notation, in analogy to (5), U = diag(GGT)?1GR = D?1 u GR.(11) A simple example is shown in Figure 1. Using the Zoo dataset (Asuncion and Newman, 2007), the Node Harvest estimator (Meinshausen, 2010) is trained on the class labels (converting to binary classifi cation one class at a time and collecting all selected rules) to select some interesting rules. To generate relevant rules, rule ensembles (Friedman and Popescu, 2008) and Random Forests (Breiman, 2001) are possible alternatives.The rules are binary indicator variables of single or combinations of two features (‘Has the animal teeth? Is it aquat