% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{RankAggreg Overview} % \VignetteDepends{clValid, xtable} % \VignetteKeywords{rank aggregation, Cross-Entropy Monte Carlo algorithm, Genetic algorithm, meta analysis} % \VignettePackage{RankAggreg} \SweaveOpts{engine=R,eps=FALSE,echo=TRUE} \documentclass[11pt]{article} %\documentclass[article, shortnames]{jss} %\setlength{\oddsidemargin}{0.25truein} %\setlength{\evensidemargin}{0.25truein} %\setlength{\textwidth}{6.0truein} \setlength{\topmargin}{0.25truein} %\setlength{\textheight}{8.5truein} \setlength{\headsep}{0.0truein} %\setlength{\headheight}{0.0truein} \setlength{\topskip}{10.0pt} %\pagestyle{empty} %\baselineskip=0.6cm \usepackage{amsmath,amssymb,amsfonts} % Typical maths resource packages \usepackage{graphics} % Packages to allow inclusion of graphics \usepackage{color} % For creating coloured text and background \usepackage{hyperref} % For creating hyperlinks in cross references \usepackage{relsize} \usepackage[mathcal]{euscript} \usepackage[round]{natbib} \usepackage{Sweave} \begin{document} \author{Vasyl Pihur, Somnath Datta, and Susmita Datta \\ \\ Department of Bioinformatics and Biostatistics, University of Louisville \\ \url{http://vpihur.com/biostat} } \title{\emph{RankAggreg}, an R package for weighted rank aggregation} \maketitle \tableofcontents \begin{abstract} Rank aggregation plays an important role in our daily lives. Ordered lists are ubiquitous and we, consciously or unconsciously, attempt to make sense of them. Unfortunately, for more complex problems, where either the number of lists is large or the lists are long or both, the aggregation becomes more than a simple brainteaser, often requiring advanced computational techniques. The \emph{RankAggreg} package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation are given in the manuscript, one a moderately large problem in the context of clustering and the other relatively difficult one in the context of meta-analysis of microarray experiments. \end{abstract} %\Keywords{rank aggregation, Cross-Entropy Monte Carlo algorithm, Genetic algorithm, meta analysis} %\Address{ %Vasyl Pihur\\ %Department of Bioinformatics and Biostatistics\\ %School of Public Health and Information Sciences\\ %University of Louisville\\ %555 S Floyd St.\\ %Louisville, KY, USA 40292\\ %E-mail: \email{v0pihu01@louisville.edu}\\ %URL: \url{http://louisville.edu/~v0pihu01/} %} \section{Introduction} \label{sec:intro} If you had to fill out at least one survey in your life, there is a good chance that you were asked to rank a collection of items in the order of preference from the most favorable to the least favorable. Certain voting schemes, particularly the ones used in the past, required voters to rank all candidates, making the final decision based on the obtained ordered lists. Many statistical and data mining procedures applied to large amounts of biological data usually produce a list, ordered according to some importance measure, of biologically meaningful entities such as, for example, a list of genes from a microarray experiment indicative of the cancer status. These are just a few examples that illustrate the abundance of ordered lists in all aspects of our lives, both scientific and mundane. Franklin D. Roosevelt once said that "there are as many opinions as there are experts" and, thus, as many ordered lists on any given subject. Rank aggregation techniques are indispensable tools for combining individual ordered lists into a single "super"-list reflective of the overall preference or importance within the population. The idea is quite simple and, ideally, everyone's opinion should be accounted for. Different rank aggregation schemes, however, differ greatly in the underlying philosophy, as well as mathematical complexity. Two radically different philosophies on rank aggregation exist. The first one is based on the majoritarian principles and attempts to accommodate the "majority" of individual preferences putting less or no weight on the relatively infrequent ones. The final aggregate ranking is usually based on the number of pairwise wins between items within individual lists. If item "A" is ranked higher than item "B" more often than not, then item "A" should also be ranked higher than item "B" in the overall list. The second philosophical approach to rank aggregation seeks the consensus among individual ordered lists and is usually based on some form of rank averaging. It is possible that the two approaches will produce different aggregated lists if applied to the same problem. Conceptually, rank aggregation techniques range from quite simple (based on rank average or on a number of pairwise wins) to fairly complex and may employ advanced computational methodologies to find a solution. Simple solutions are not necessarily desirable as they usually rely on "ad hoc" principles and lack any formal justification. Mathematical rigor brings certain satisfaction and "security" at the expense of increased complexity and intensive computation. In this paper, we present an R \emph{RankAggreg} package which provides two distinct algorithms for rank aggregation: the Cross-Entropy Monte Carlo algorithm (CE) \cite{Rub99, Deb05} and the Genetic algorithm (GA) \cite{Gol89}. Both methods are available through the main function \emph{RankAggreg}. In addition, a brute force algorithm is also provided through the \emph{BruteAggreg} function which simply tries all possible solutions and selects the one which is optimal. What is meant by "optimal" and how to find the "optimal" solution will be the discussion of the next sections. \section{Rank aggregation as an optimization problem} If we are to cast the rank aggregation in the framework of an optimization problem, we first would need to define our objective function. In this context, we would like to find a "super"-list which would be as "close" as possible to all individual ordered lists simultaneously. This is a natural requirement and the objective function, at least in its most abstract form, is very simple and intuitive \begin{eqnarray*} \Phi(\delta) = \sum_{i=1}^m w_id(\delta, L_i), \end{eqnarray*} where $\delta$ is a proposed ordered list of length $k=|L_i|$, $w_i$ is the importance weight associated with list $L_i$, $d$ is a distance function which will be discussed in details below, and $L_i$ is the $i^{th}$ ordered list \cite{Pih07}. The idea is to find $\delta^*$ which would minimize the total distance between $\delta^*$ and $L_i$'s \begin{eqnarray*} \delta^*=arg \hspace{.1cm} min \, \sum_{i=1}^m w_id(\delta, L_i). \end{eqnarray*} Selecting the appropriate distance function $d$ which would measure the "distance" between ordered lists is very important and two choices of a distance function are available: Spearman footrule distance and Kendall's tau distance. The two distances usually produce slightly different aggregated lists which is mainly due to the differences in the two philosophical paradigms discussed in the Introduction (\ref{sec:intro}). Now let us take a closer look at how "distances" between ordered lists are measured. \subsection{Spearman footrule distance} Before defining the two distance measures, let us introduce some necessary notation. Let $M_i(1), \ldots, M_i(k)$ be the scores associated with the ordered list $L_i$, where $M_i(1)$ is the best (can be the largest or the smallest depending on the context) score, $M_i(2)$ is the second best, and so on. Let $r^{L_i}(A)$ be the rank of $A$ in the list $L_i$ (1 means "best") if $A$ is within the top $k$, and be equal to $k+1$, otherwise; $r^\delta(A)$ is defined likewise. The Spearman's footrule distance between $L_i$ and any ordered list $\delta$ can be defined as $$ S(\delta, L_i)=\sum_{t \in L_i \cup \delta} |r^\delta(t)-r^{L_i}(t)|. $$ It is nothing more than the summation of the absolute differences between the ranks of all unique elements from both ordered lists combined. It is rather a very intuitive metric for comparing two ordered lists of arbitrary length. The smaller the value of the metric, the more similar the lists. For Spearman's footrule distance, the maximum value when comparing two top-$k$ lists is $k(k+1)$ which is attained when the two lists have no elements in common. The appeal of the Spearman footrule distance comes from its simplicity and it is adequate in many situations when the only information available about the individual lists is the rank order of their elements. In a case when additional information which was used to rank the lists in the first place is available, it would be beneficial and prudent to incorporate this information into our aggregation scheme \cite{Pih07}. Even though in soccer a win is a win, a win by 5 goals is more "convincing" than a marginal victory secured by a penalty kick on the last minute of the game. The \emph{qualitative} difference in terms of ranks has an objective \emph{quantitative} difference underlying it. This is probably true in most cases. Thus, we define the Weighted Spearman's footrule distance between $L_i$ and any ordered list $\delta$ which makes use of the quantitative information available in many cases. It is given by this weighted sum representation \begin{eqnarray*} WS(\delta, L_i)&=& \sum_{t \in L_i \cup \delta} |M(r^\delta(t))-M(r^{L_i}(t))| \times |r^\delta(t)-r^{L_i}(t)|. \end{eqnarray*} One can intuitively think of $WS(\delta, L_i)$ in terms of sum of penalties for moving an arbitrary element of the list $L_i$, $t$, from the position $r^\delta(t)$ to another position $r^{L_i}(t)$ within the list (second term of the products) adjusted by the difference in scores between the two positions (first term). \subsection{Kendall's tau distance} The Kendall's tau distance takes a different approach at measuring the distance between two ordered lists. It utilizes pairs of elements from the union of two lists and is defined $$ K(\delta, L_i)=\sum_{t,u \in L_i \cup \delta} K^p_{tu}, $$ where $$ K^p_{tu} = \begin{cases} 0 & \textnormal{if $r^\delta(t)r^\delta(u),r^{L_i}(t)>r^{L_i}(u)$} \\ 1 & \textnormal{if $r^\delta(t)>r^\delta(u),r^{L_i}(t)r^{L_i}(u)$} \\ p & \textnormal{if $r^\delta(t)=r^\delta(u)=k+1$ or $r^{L_i}(t)=r^{L_i}(u)=k+1$}. \end{cases} $$ Here, $p \in [0,1]$ is a parameter that needs to be specified for Kendall's tau. If $p$ is set to 0, the maximum value that the distance can achieve is $k^2$ and this happens when the intersection of the two lists compared is an empty set. Intuitively, Kendall's tau can be thought about in the following way. If the two elements $t$ and $u$ have the same ordering in both lists, then no penalty is incurred (a good scenario). If the element $t$ precedes $u$ in the first list and $u$ precedes $t$ in the second list, then a penalty of 1 is imposed (a bad scenario). A case when both $t$ and $u$ do not appear in either one of the lists (their ranks are $k+1$) can be handled by selecting $p$ on a spectrum ranging from very liberal (0) to very conservative (1). That is, if we have no knowledge of the relative position of $t$ and $u$ in one of the lists, we have several choices in the matter. We can either impose no penalty (0), full penalty (1), or a partial penalty ($0 < p < 1$). The following three choices are common: 0, 1, and 0.5. It is a matter of a philosophical taste as to which option one chooses. We use $p=0$ in the internal Kendall function of the package. Somewhat analogously to the Weighted Spearman distance, the Weighted Kendall's tau is defined by $$ WK(\delta, L_i)=\sum_{t,u \in L_i \cup \delta} |M(r^{L_i}(t))-M(r^{L_i}(u))|\times K^p_{tu}, $$ in which the penalty imposed is adjusted by the absolute difference in the scores for elements $t$ and $u$. Here, $K^p_{tu}$ is defined identically as above. Normalization of scores from each list $L_i$ before computing $WS$ and $WK$ is necessary. The weights must be comparable otherwise disproportionately large or small weights can benefit a particular list and pull the "optimal" list $\delta^*$ towards it. A number of normalization schemes that map the scores from the real line to the interval $[0,1]$ were considered. Unfortunately, most of them resulted in transformed scores occupying a very narrow portion of the interval. We settled for a simple normalization which spread the scores "evenly" between 0 and 1 \begin{eqnarray*} M_i^* = \frac{M_i-\min(M_i)}{\max(M_i)-\min(M_i)}, \hspace{1cm} i=1,\ldots,n. \end{eqnarray*} We would like to make one last comment on the reasons behind introducing weighted distance measures here. Quite obviously they are motivated by the desire for a more efficient use of the data, in this case, the numerical scores which underlie the rankings. But that is not their sole purpose. When using the original Spearman and Kendall distances we noticed that in many situations no clear winner exists as two or more ordered lists have the same objective function score due to the discrete nature of the ranks. This brought computational instability into the iterative aggregation process. The algorithm would never converge but would simply oscillate between the two "best" lists, understandably not knowing which one to pick. When continuous weights are used to adjust the discrete ranks, the possibility of such ties is almost eliminated and the algorithm is much more computationally stable. In addition, we obtain a clear winner in an objective way. \section{Cross-Entropy Monte Carlo algorithm} Due to practical considerations and computational convenience, we represent an ordered list as an $(X)_{n\times k}$ random matrix whose entries are 0 or 1 with the constraints of its columns summing up to 1 and its row summing to at most 1. Here, $n$ is the total number of unique elements in all ordered lists to be combined and $k$ is usually the length of ordered lists (but can be smaller if necessary). Under this setup, each realization of $X$, $x$, uniquely determines an ordered list of size $k$ by the position of 1's in each column from left to right \cite{Lin06}. For example, if the full list was $(A, B, C)$, a $3\times 3$ matrix \begin{equation*} x= \begin{bmatrix} 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{bmatrix} \end{equation*} would translate into a candidate list of $(C, A, B)$. Having defined $X$ as a matrix of size $n\times k$ with the two constraints regarding the sums of columns and rows, we thereby define the space of solutions, $\mathcal{X}$, to our minimization problem. It is of interest to find the minimum of the objective function $\Phi$ over $\mathcal{X}$ and the corresponding $x$ at which that minimum is attained. Assume that $X$ comes from a pmf $P(x)$ that is indexed by the parameter matrix $(v)_{n\times k} = ((p_{jr}))$. More precisely, we assume that the joint distribution $P_v(X=x)$ satisfying the above two constraints is given by \begin{eqnarray*} P_v(x) &\propto & \prod_{j=1}^n \prod_{r=1}^k(p_{jr})^{x_{jr}} \\ &&\times I\left(\sum_{r=1}^k x_{jr} \le 1, 1 \le j \le n; \sum_{j=1}^n x_{jr} = 1, 1 \le r \le k\right) \end{eqnarray*} \noindent The CE algorithm proceeds in the following fashion: \begin{enumerate} \item{\textbf{Initialization}: Set $t=0$. Set the initial parameter matrix $v^0$ of the random distribution of $X$ to constant values; i.e., let each $p_{jr}^0=1/n$. Thus, at this stage, each of the $n$ unique elements has equal chances of being included in the overall lists of size $k$ for which the objective function $\Phi$ will be evaluated. This default behavior can be overridden by specifying the initial probability matrix $v1$ in the \emph{RankAggreg} function.} \item{\textbf{Sampling}: At each iteration $t$, draw a sample of size $N$ from $P_{v^t}(x)$. Find the corresponding top-$k$ lists $\delta_i$'s and the values of the objective function $\Phi(\delta_i)$. Sort the $\Phi(\delta_i)$'s in ascending order and find a $\rho$-quantile $y^t=\Phi_{([\rho N])}$, where $[a]$, for any real number $a$, is the integer part of $a$.} \item{\textbf{Updating}: Update the parameter vector as follows $$ p_{jr}^{(t+1)}=(1-w)p_{jr}^t+w\frac{\sum^N_{i=1}I(\Phi(\delta_i) \le y^t)x_{ijr}}{\sum^N_{i=1}I(\Phi(\delta_i) \le y^t)}, $$ where $x_{ijr}$ is the value at the $jr^{th}$ position of the $i^{th}$ sample and $w$ is a weight parameter introduced to avoid convergence to a local maxima (specified by the \emph{weight} argument with the default value of .25).} \item{\textbf{Convergence}: The algorithm is stopped if the optimal list does not change in a user-specified number of iterations. The default value is 7 for the CE algorithm and it is specified by the \emph{convIn} argument.} \end{enumerate} The CE algorithm requires users to set a number of parameters. Convergence to a global optimal solution in many ways depends on the parameters chosen. It is recommended that the number of samples $N$ is to be set to at least $10k^2$ (in case, $n>>k$, $10kn$) and the rarity parameter $\rho$ is to be set to 0.01 if $N$ is relatively large or 0.1 if $N$ is small (less than 100). \section{Genetic algorithm} Genetic algorithms are another set of tools suitable for solving complex combinatorial problems \cite{Gol89}. Their main advantage is their inherent simplicity in both conceptual understanding and software implementation. In our experience, the GA perform reasonably well for the aggregation problem but one has to be careful with the selection of important parameters defining the rate at which the learning proceeds. As implemented in this package, the GA has the following steps: \begin{enumerate} \item{\textbf{Initialization:} Randomly select \emph{popSize} ordered lists of size $k$ which form the initial population of possible solutions to our optimization problem. The population size \emph{popSize} is important and, obviously, the larger the population size, the better chance of it containing, at some point, the optimal solution. It should ideally be a function of $k$ and the number of unique elements in the original ordered lists $L_i$, but computational feasibility has to be considered here.} \item{\textbf{Selection:} Depending on which distance is used, compute the objective function for each member of the population. Then randomly select current members for the next generation using weighted random sampling where the weights are determined by the member's fitness (the objective function score).} \item{\textbf{Cross-over:} The selected members are then crossed-over with the probability of \emph{CP} (the cross-over probability), i.e. two random ordered lists can swap their tails which start at a random position with the \emph{CP} probability. Only 1-point cross-overs are allowed.} \item{\textbf{Mutation:} Crossing-over will allow only for the mixing of ordered lists but a rather drastic event is required to bring radically new solutions to the population pool. These are introduced by mutations which happen with the probability of MP (mutation probability). Thus, any list in the pool can randomly change one or more of its elements.} \item{\textbf{Convergence:} The algorithm is stopped if the "optimal" list remains optimal for \emph{convIn} consecutive generations (default is 30). To ensure that the algorithm stops running eventually, the maximum number of generations can be set in advance which will terminate the execution regardless of the first condition being true. If neither the maximum number of iterations has been reached nor the "optimal" list stayed untouched during the last \emph{convIn} generations, continue to step Selection.} \end{enumerate} As was mentioned previously, the choice of the parameters \emph{popSize}, \emph{CP}, and \emph{MP} is crucial for the success of the GA. If one is too conservative and selects small \emph{CP} and \emph{MP} probabilities, the GA will have a hard time exploring the space of possible solutions in a reasonable time, particularly, when the space is extremely large. On the other hand, choosing large values for \emph{CP} and \emph{MP} will results in a "haste" decision, perhaps getting stuck in a local minimum without a chance to explore the whole search space. \section{Examples of rank aggregation} We present two rank different rank aggregation problems, one in the context of unsupervised learning where there is an intrinsic difficulty of choosing the best clustering algorithm for a particular problem, and the other one in the context of meta-analysis of different microarray cancer studies where the goal is to determine the combined set of genes indicative of cancer status. To start using the \emph{RankAggreg} package, it must be loaded into R with the regular \emph{library()} function. Package documentation, examples, and additional information are available through \emph{help()} and \emph{vignette()} functions. <>= library(RankAggreg) @ <>= help(package="RankAggreg") vignette("RankAggreg") @ \subsection{Aggregation of clustering validation measures} Rank aggregation in the clustering context was introduced by \cite{Pih07}. Numerous clustering algorithms are available in R and other statistical and data mining software packages, each one having its relative strength and weaknesses in terms of how successfully they can handle certain types of data. Thus, it is often difficult to select the "best" algorithm for a particular clustering task. Validation (performance) measures come to rescue to some extent and offer an objective way of ranking clustering algorithms according to their assessment of what a "good" clustering result is. If $k$ clustering algorithms are validated with $m$ validation measures, $m$ ordered lists of size $k$ are produced as a result. Even though desirable, the order of clustering algorithms within each list is rarely the same. Rank aggregation is helpful in reconciling the ranks and producing the "super"-list which determines the overall winner and also ranks all clustering algorithms based on their performance as determined by all $m$ validation measures simultaneously. Clustering validation is implemented in the \emph{clValid} package \cite{Bro07}. After loading the package, we bring in a mouse microarray dataset available and select the first 100 genes from it. Assuming that those 100 genes form 5 natural clusters (this is a big assumption but it is not essential for the rank aggregation demonstration), we evaluate 10 clustering algorithms with 6 validation measures. Available clustering algorithms are: SOTA (ST), FANNY (FN), K-Means(KM), PAM(PM), Hierarchical(HR), Agnes(AG), CLARA(CL), Diana(DI), and Model-based(MO). Further details can be obtained from the clValid package documentation. <>= options(width=55) library(xtable) options(warn = -1) @ <>= options(warn = -1) library(clValid) library(mclust) library(kohonen) data(mouse) express <- mouse[1:100,c("M1","M2","M3","NC1","NC2","NC3")] rownames(express) <- mouse$ID[1:100] set.seed(100) result <- clValid(express, 5, clMethods=c("hierarchical","fanny","model", "kmeans","sota","pam","clara", "agnes", "diana"), validation=c("internal","stability")) @ The \emph{result} object contains a $7\times9$ matrix of scores which measure the performance of the algorithms. For each validation measure, 9 clustering algorithms can now be ranked based on these scores which are sorted either in ascending or descending order depending on whether larger or smaller scores correspond to better performance under the measure. Here, the Dunn index and the Silhouette Width measure give higher scores with better performance and for the other measures the smaller scores are desirable. <>= getRanks_Weights <- function(clObj){ temp <- clObj@measures[,1,] ranks <- matrix(0, nrow(temp), ncol(temp)) colnames(ranks) <- 1:ncol(ranks) rownames(ranks) <- rownames(temp) weights <- ranks algs <- colnames(temp) measures <- rownames(temp) for(i in 1:nrow(temp)){ if(measures[i] %in% c("Dunn", "Silhouette")) incr <- TRUE else incr <- FALSE sorted <- sort(temp[i,], ind=TRUE, decreasing=incr) ranks[i,] <- algs[sorted$ix] weights[i,] <- sorted$x } list(ranks=ranks, weights=weights) } res <- getRanks_Weights(result) shortNames <- sort(c("ST", "FN", "KM", "PM", "HR", "AG", "CL", "DI", "MO")) longNames <- sort(unique(as.vector(res$ranks))) for(i in 1:length(shortNames)) res$ranks[res$ranks==longNames[i]] <- shortNames[i] @ This is how the 7 ordered lists of 9 algorithms look like. <>= print(res$ranks, quote=FALSE) @ The underlying validation measures' scores are given below. <>= print(res$ranks, quote=FALSE) @ We can see that K-Means and Hierarchical clustering are performing quite well. Since the number of possible solutions is not that large in this case $(k!=9! = 362,880)$, it is feasible to use the brute force approach to find the optimal solution. This can be done using the \emph{BruteAggreg} function provided in the package. Please note that even for this relatively small problem it takes hours to perform the necessary computations. The approach is limited to toy examples only and should not be attempted if $k$ is larger than 10. <>= load("aggr_res.Rdata") @ <>= BruteAggreg(res$ranks, 9, res$weights, "Spearman") @ The best overall list as determine by trying all possible solutions with the weighted Spearman footrule distance is KM HR AG FN PM CL DI ST MO with the minimum objective function score of 6.378781. As expected, Hierarchical clustering and the K-Means algorithm are the top two performers. We will now see if the CE algorithm can quickly discover the solution without resorting to an exhaustive search. <>= (CEWS <- RankAggreg(res$ranks, 9, res$weights, seed=123)) @ <>= CEWS @ We get exactly the same solution in only 22 iterations and in about 40 seconds by examining mere 13000 potential candidates. To get a visual representation of the results, a convenient \emph{plot} function is provided. It takes the object returned by the \emph{RankAggreg} function as its first argument and outputs three side-by-side plots with useful information on the convergence properties and the final ranking. <>= plot(CEWS) @ \begin{figure}[hp] \begin{center} <>= <> @ \caption{Visual Representation of the aggregation results through the \emph{plot()} function. The first plot in the top row shows the path of minimum values of the objective function over time. The global minimum is shown in the top right corner. The histogram of the objective function scores at the last iteration is displayed in the second plot. Looking at these two plots, one can get a general idea about the rate of convergence and the distribution of candidate lists at the last iteration. The third plot at the bottom shows the individual lists and the obtained solution along with optional average ranking.} \end{center} \end{figure} Weighted Kendall's tau distance can also be used, though it is much more expensive to compute. If the \emph{verbose} argument of the \emph{RankAggreg} function is set to TRUE (it is by default), R console window outputs information at each iteration to keep the user updated. In addition, a plot similar to Figure 1 is shown and updated at each iteration to monitor convergence. <>= (CEWK <- RankAggreg(res$ranks, 9, res$weights, "CE", "Kendall", seed=123, verbose=FALSE)) @ <>= CEWK @ This time K-means is still ahead, while the Hierarchical clusters got pushed down to number 4. The Genetic Algorithm can also be used with both the weighted Spearman and Kendall distances. Unfortunately, it seems to lack the monotonicity property that the CE algorithm exhibits to some extent. This can be seen in the first plot of Figure 2. Due to that fact, the convergence criteria needs to be stricter to avoid sporadic local solutions. The default value for the \emph{convIn} arguments for the GA is 30. <>= (GAWS <- RankAggreg(res$ranks, 9, res$weights, "GA", "Spearman",seed=123, verbose=FALSE)) @ <>= GAWS @ <>= (GAWK <- RankAggreg(res$ranks, 9, res$weights, "GA", "Kendall",seed=123, verbose=FALSE)) @ <>= GAWK @ Both results agree with the ones obtained using the CE algorithm. Besides the jaggedness of the minimum path in the first plot, it is easy to notice that the GA algorithm takes significantly larger amount of cycles to converge. Even given that, the population distribution of the last generation is much more heterogeneous than that of the CE. <>= plot(GAWS) @ \begin{figure}[hp] \begin{center} <>= <> @ \caption{Visual representation of rank aggregation using the GA algorithm with the Weighted Spearman distance. } \end{center} \end{figure} \subsection{Meta-analysis of microarray experiments} Microarray cancer studies often attempt to identify genes related to a specific cancer. Their most common output is a list of genes ordered by corresponding p-values. Different studies, even the ones analyzing the same cancer type (for example, lung cancer), almost never produce identical gene lists. Meta-analysis of multiple microarray studies is difficult, especially if different experimental platforms have been used. Rank aggregation, however, avoids the issue of multiple experimental conditions by dealing with the final product: the ordered list of genes. Recently, we have carried out the meta-analysis of 20 microarray studies on multiple cancers using the proposed rank aggregation algorithms \cite{Pih08}. Our goal was to identify genes which would be important in development of multiple cancers. Further details on the rank aggregation details can be found in the original article. Here, we present a smaller example described by \cite{Dec06} who used three different Monte Carlo algorithms for rank aggregation of 5 prostate cancer microarray datasets. Two experiments were conducted using the Affymetrix chip technology and the other three studies used custom cDNA chips. Each individual study tried to identify genes which are either up or down-regulated in prostate cancer patients, coming up with ordered lists of upregualated genes shown in Table 1 (the lists appear in Table 4 in \cite{Dec06}). <>= data(geneLists) @ <>= xtable(t(geneLists), caption="Top-25 upregulated genes from 5 prostate microarray experiments.") @ There are 89 unique genes in all 5 gene lists. The only gene that appears in all of them is HPN, while genes AMACR, GDF15, and NME1 appear in 4 lists. 66 genes appear in just one list. The goal of rank aggregation is to combine these lists into the overall top-25 gene list which hopefully would be more accurate than any individual list by itself. Since no p-values are reported, we will use the regular Spearman distance for both the CE and the GA algorithms. <>= top25CE <- RankAggreg(geneLists, 25, seed=100, rho=.01) @ <>= top25CE @ The CE algorithm converges in 38 iterations with the minimum of 319.6. The overall list is perhaps not surprising, putting HPN in the first place, followed closely by the two other genes that appear in four lists. Using the GA algorithm we get the similar results. In case when there would be an indication that some microarray studies are more reliable than others, we could set the \emph{importance} parameter available in the \emph{RankAggreg} function to reflect these beliefs. By default, it assigns equal weights to all ordered lists, but one, for example, could set importance=c(1,2,1,1,2) placing stronger emphasis on the Affymetrix arrays which are considered to have higher sensitivity rates. <>= top25CEw <- RankAggreg(geneLists, 25, seed=100, importance=c(1,2,1,1,2), rho=.01) @ <>= top25CEw @ This produces the combined list which is slightly different from the one obtained treating all five studies equally. The objective function score here is 295.43, being a little smaller than 319.6. Clearly, OACT2 is ranked higher now (3rd) due to being at the top (also 3rd) in the Welsh study which received more weight. Similarly, the KRT18 gene moved up a couple spots due to being present in both Welsh and Singh top lists which are both Affymetrix. The GA algorithm can also be applied. We increase the maximum number of iterations to allow for a longer evolution process. Increasing the \emph{convIn} (converge in) argument to 50 will assure that we do not stop the algorithm too soon. <>= top25GA <- RankAggreg(geneLists, 25, seed=100, method="GA", maxIter=3000, convIn=50) @ <>= top25GA @ The algorithm did not converge (due to setting a rather stringent criteria) and was stopped after 3000 generations. The final list had an objective function score of 320.8 which is slightly worse than what we obtained using the CE algorithm. These lists are almost identical in terms of which genes where included in the top 25 (22 genes are the same), but they are somewhat different in the actual ordering. This should not come as a huge surprise, taking into consideration the enormous solution space. Both of the obtained lists are most likely very close to the true minimum solution. \section{Discussion} The \emph{RankAggreg} package provides an easy and convenient interface to handle complex rank aggregation problems. It provides the user with two choices of methods for aggregation as well as two different distance functions. The brute force approach is also available for small-scale problems. A simple plot function helps to visualize the rank aggregation problem and the obtained solution. We would like to stress that using either the CE or the GA algorithms for large problems does not "guarantee" an optimal solution. Performance of both of these algorithms is quite sensitive to the tuning parameters, in particular the sample size \emph{N} for the CE algorithm and the cross-over (CP) and mutation (MP) probabilities for the GA algorithm. The user is encouraged to run the RankAggreg function several times. If different optimal lists are produced, increasing sample size is probably necessary. Tuning additional parameters as discussed above may also prevent local minima traps. That said, however, we are quite impressed by the ability of both algorithms, the CE in particular, in discovering the optimal ordering of the elements in the combined list. {\small \bibliographystyle{abbrvnat} \bibliography{refs} } \end{document}