\documentclass[article,shortnames,nojss]{jss}

\usepackage{amsmath,amssymb,amsfonts,subfigure,mathrsfs,thumbpdf}

\newcommand{\defi}{\mathop{=}\limits^{\Delta}}  

%% need no \usepackage{Sweave.sty}
\SweaveOpts{engine=R, eps=FALSE}

%\VignetteIndexEntry{Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods}
%\VignetteDepends{isotone}
%\VignetteKeywords{isotone optimization, PAVA, monotone regression, active set, R}
%\VignettePackage{isotone}

\author{Jan de Leeuw\\University of California,\\Los Angeles \And 
        Kurt Hornik\\WU  Wirtschafts-\\universit\"at Wien  \And 
        Patrick Mair \\Harvard University}
\Plainauthor{Jan de Leeuw, Kurt Hornik, Patrick Mair}

\title{Isotone Optimization in \proglang{R}: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods}
\Plaintitle{Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods}
\Shorttitle{PAVA and Active Set Methods in \proglang{R}}

\Abstract{
  This introduction to the \proglang{R} package \pkg{isotone} is a (slightly)
  modified version of \cite{Leeuw+Hornik+Mair:2009}, published in the
  \emph{Journal of Statistical Software}.

  In this paper we give a general framework for isotone optimization. First we
  discuss a generalized version of the pool-adjacent-violators algorithm (PAVA)
  to minimize a separable convex function with simple chain constraints. Besides
  of general convex functions we extend existing PAVA implementations in terms
  of observation weights, approaches for tie handling, and responses from
  repeated measurement designs. Since isotone optimization problems can be
  formulated as convex programming problems with linear constraints we then
  develop a primal active set method to solve such problem. This methodology is
  applied on specific loss functions relevant in statistics. Both approaches are
  implemented in the \proglang{R} package \pkg{isotone}.
}

\Keywords{isotone optimization, PAVA, monotone regression, active set, \proglang{R}}
\Plainkeywords{isotone optimization, PAVA, monotone regression, active set, R}

\Address{
  Jan de Leeuw\\
  Department of Statistics\
  University of California, Los Angeles\\
  E-mail: \email{deleeuw@stat.ucla.edu}\\
  URL: \url{http://www.stat.ucla.edu/~deleeuw/}
}

\begin{document}

\section{Introduction: History of monotone regression}
<<preliminaries, echo=FALSE, results=hide>>=
library("isotone")
options(prompt = "R> ", continue = "+  ")
@

In monotone (or isotone) regression we fit an increasing or decreasing function to a set of points in the plane. This generalization of linear regression has a fairly complicated history, which we now discuss.

On June 4, 1958, Constance van Eeden defended her dissertation at the University of Amsterdam. The
dissertation \emph{Testing and Estimating Ordered Parameters of Probability Distributions}
\citep{vanEeden:1958} summarized and extended four articles published in 1956--1957 in \emph{Indagationes Mathematicae}. Van Eeden's promotor Van Dantzig said in his speech at the promotion  
\begin{quotation}
As often happens, just before you were able to obtain a final version of your result, a publication of an American professor appeared, which treated the same problem, while in a second publication by him, joint with four co-authors, the special case of a complete order was investigated in more detail.
\end{quotation}
He referred to \citet{Brunk:1955} and \citet{Ayer+Brunk+Ewing+Reid+Silverman:1955}, which both appeared in the \emph{Annals of Mathematical Statistics}. To complicate the situation, there were also the papers by 
\citet{Miles:1959} and \citet{Bartholomew:1959a,Bartholomew:1959b}, which can be thought of as yet another independent discovery. Some of the interconnectivity, and some of the early history, is reviewed in the excellent book by the four B's~\citep{Barlow+Batholomew+Brenner+Brunk:1972}.

Of the classical treatments of monotone regression, Van Eeden's is the most general one. Moreover she very
clearly separates the optimization problem, treated in full detail in Chapter~1 of her dissertation, from the statistical applications. Of course we must realize that this work predates the advent of convex analysis by about 15 years, while it was done at the time that statisticians started showing interest in the (Karush-)Kuhn-Tucker conditions and in quadratic programming. In Van Eeden's setup, the problem is to minimize a strictly unimodal function
\(f(x_1,\ldots,x_n)\) under the bound constraints \(\alpha_i\leq x_i\leq\beta_i\) and the partial order constraints \(\sigma_{ij}(x_i-x_j)\geq 0\). Here the \(\sigma_{ij}\) are either zero or \(\pm 1\), and we suppose the system of inequalities defined by the bound and order constraints is consistent. By careful classical analysis, Van Eeden derives her algorithms, and shows how they specialize if the objective function is separable and/or a sum of squares, and if the order constraints define a complete order.

The next major contributions were by Robertson and Dykstra, summarized in the book by~\citet{Robertson+Wright+Dykstra:1988}. At the same time, starting with~\citet{Barlow+Brunk:1972} and~\citet{Dykstra:1981}, the isotonic regression problem was embedded more and more in quadratic and convex programming frameworks by, among others,~\citet{Best+Chakravarti:1990} and~\citet{Best+Chakravarti+Ubhaya:2000}. Recent major contributions, relying heavily on mathematical programming results, are~\citet{Stroemberg:1991}, \citet{Ahuja+Orlin:2001}, and~\citet{Hansohm:2007}. Extensions to particular partial orders defined by the variables of a multiple regression are in~\citet{Dykstra:1982} and~\citet{Burdakov+Grimvall+Hussian:2004}.

In this paper we present the \proglang{R} \citep{R:09} package \pkg{isotone} which implements PAVA and active set algorithms for solving monotone regression and more general isotone optimization problems. The package is available from the Comprehensive \proglang{R} Archive Network at \url{http://CRAN.R-project.org/package=isotone}.


%---------------------- theoretical part ----------------------------
\section{A general isotone optimization framework}
\subsection{Basics of monotone regression}
\label{sec:monreg}
In simple regression we assume a linear relationship between a predictor $z = (z_1, \ldots, z_i, \ldots z_n)$ and a response $y = (y_1, \ldots, y_i, \ldots y_n)$. Note that for the predictors we use $z$ instead of the common notation $x$ since later on we embed this algorithm into a convex programming problem where the target variables are typically denoted by $x$. However, the loss function to be minimized is a least squares problem of the form
\begin{equation}
\label{eq:ls}
f(\alpha,\beta)=\sum_{i=1}^n w_i\left(y_i-\alpha-\beta z_i\right)^2 \rightarrow \min!
\end{equation}
where $\alpha$ and $\beta$ are the regression parameters and $w_i$ some optional observation weights. Extensions can be formulated in terms of polynomial or other nonlinear parametric regression specifications. In many situations the researcher has no information regarding the mathematical specification of the true regression function. Rather, she can assume a particular shape which can be characterized by certain order restrictions. Typically, this involves that the $y_i$'s increase with the ordered $z_i$'s. Such a situation is called \emph{isotonic regression}; the decreasing case \emph{antitonic regression}. The corresponding umbrella term for both cases is \emph{monotonic regression} \citep[for a compact description, see][]{deLeeuw:2005}.

Suppose that $Z$ is the finite set $\{z_1,z_2,\ldots,z_n\}$ of the ordered predictors with no ties, i.e., $z_1 < z_2 < \cdots <z_n$. Let $y$ be again the observed response vector and $x = (x_1, \ldots, x_i, \ldots x_n)$ the unknown response values to be fitted. The least squares problem in monotonic regression can be stated as 
\begin{equation}
\label{eq:monreg}
f(x)=\sum_{i=1}^n w_i\left(y_i-x_i\right)^2 \rightarrow \min!
\end{equation}
which has to be minimized over $x$ under the inequality restrictions $x_1 \leq x_2 \leq \cdots \leq x_n$ for isotonic regression and  $x_1 \geq x_2 \geq \cdots \geq x_n$ for the antitonic case. These restrictions avoid that we always get perfect fit. The basic theorem the isotonic solution of (\ref{eq:monreg}) is based on, is that if $y_i \geq y_{i+1} $, then the fitted values $\hat x_{i+1}:=\hat x_i$. Correspondingly, the antitonic solution requires $y_i \leq y_{i+1} $. 

For a non-strict partial order of the predictors, i.e., $z_1 \leq z_2 \leq \cdots \leq z_n$, several algorithms for the treatment of ties can be considered \citep{Kruskal:1964, deLeeuw:1977}. The \emph{primary} approach partitions the index set $\{1,2,\ldots,n\}$ into a number of tie blocks $I_1, I_2, \ldots, I_k$ with $k \leq n$. In case of a tie $z_i = z_{i'}$ this approach implies that $x_i$ does not necessarily equal $x_{i'}$. It forces only that the following monotonicity condition holds for the tied observations $i$ and $i'$: If $y_i \leq y_{i'}$ then $x_i \leq x_{i'}$. In practical applications this approach is mostly used. The \emph{secondary} approach is more restrictive and requires $x_i = x_{i'}$ for the tie $i$ and $i'$, regardless which $y$-values were observed. The \emph{tertiary} approach defined in \citet{deLeeuw:1977} abandons the monotonicity condition from the primary approach. For each tie block $I_1, I_2, \ldots, I_k$ the unit of analysis are the weighted means $\overline{x}_{I_1}, \overline{x}_{I_2}, \ldots, \overline{x}_{I_K}$. The tertiary approach requires only that these means are monotonic across tie blocks. 


\subsection[Generalization to lp and multiple measurements]{Generalization to $\ell_p$ and multiple measurements}
\label{sec:general}
Now we extend this classical isotonic regression problem in terms of non-least squares loss functions and repeated observations. We establish the optimization problem $\mathcal{P}_0$ which includes a convex function of the form 
\begin{equation}
\label{eq:pnorm}
h_i(y_{ij},x_i) = |y_{ij}-x_i|^p.
\end{equation} 
The building blocks of $\mathcal{P}_0$ are a (possibly infinite) open real interval $\mathcal{I}$, and $n$ real-valued convex functions $f_i$ defined on $\mathcal{I}$. Problem $\mathcal{P}_0$ is to minimize the \emph{separable} function of $n$ variables of the form
\begin{eqnarray}
\label{eq:loss}
f(x)=\sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij} h_i\left(y_{ij},x_i\right) \rightarrow \min! \\
\textrm{subject to } x_1\leq x_2\leq\cdots\leq x_n.
\end{eqnarray}
over all \emph{non-decreasing} $x$, i.e., all $x\in\mathcal{I}^n$. The response vectors of length $m_i$ for each observation $i$ we denote as $\mathbf{y}_i$ which can be caused, e.g., by repeated observations $y_{ij}$. 

Common special cases of (\ref{eq:pnorm}) are $p=2$ and $p=1$:
\begin{subequations}
\begin{align}
h_i(y_{ij},x_i)&=(y_{ij}-x_i)^2,\\
h_i(y_{ij},x_i)&=|y_{ij}-x_i|.
\end{align}
\end{subequations}
Least-squares estimation for the first problem results in estimates that approximate the conditional mean of the response given predictor value $z_i$. If weights $w_i$ are involved, we get the weighted mean. If we estimate the second equation within a quantile regression framework \citep[see][]{Koenker:2005}, the estimates approximate the (weighted) median. The general quantile regression specification is
\begin{equation}
\label{eq:quantile}
h_i(y_{ij},x_i)=a \max(y_{ij} - x_i, 0) + b \max(x_i - y_{ij}, 0). 
\end{equation}
with $a,b > 0$. The corresponding proportions are $a/(a+b)$ and $b/(a+b)$. Note that for $a = b$ Equation~\ref{eq:quantile} reduces again to the weighted median. Let us refer to the expression for the conditional (weighted) mean and (weighted) quantiles as \emph{solver} $s(x_i)$. This function plays a central role for the parameter updates in PAVA described in the next section. Note that in our general framework the user can specify other convex functions and their corresponding solvers. As we will see in Section~\ref{sec:packpava} the \code{gpava()} function takes a (user-defined) solver as argument which is sufficient for defining the PAVA optimization problem.


\subsection{Isotone optimization as a convex programming problem}
The whole theory in this section is based on the fact that isotone optimization can be formulated as convex programming problem with inequality constraints. For instance, the least squares problem in Equation~\ref{eq:ls} is one particular example of such a convex program. The isotonicity condition is contained in the side constraints. These inequalities defining isotonicity can be written in matrix form as $Ax \geq 0$ with $x \in \mathbb R^n$ as the target variable to be optimized. $A$ is a $m \times n$ matrix in which each row corresponds with an index pair $i,j$ such that $i\succeq j$. Such a row of length $n$ has element $i$ equal to $+1$, element $j$ equal to $-1$, and the rest of the elements equal to zero. Formally, such isotone optimization problems, written as convex programming problem with linear inequality constraints look as follows:
\begin{eqnarray}
\label{eq:convexAx}
 f(x) \rightarrow \min! \nonumber \\ 
\textrm{subject to } Ax \geq 0
\end{eqnarray}

The constraints in isotone regression are a total order. We now consider the more general case of partial orders. In order to eliminate redundancies we include a row in $A$ for a pair $(i,j)$ iff $i$ \emph{covers} $j$, which means that $i\succeq j$ and there is no $k\not= i,j$ such that $i\succeq k\succeq j$. This specification allows us to specify isotonicity patterns in a flexible manner. Some examples are given in Figure~\ref{fig:hasse} by means of \emph{Hasse diagrams} (also known as \emph{cover graphs}). 

\begin{figure}[p]
\centering
\subfigure[Total order]{\label{F:total}\includegraphics[height = 130mm,keepaspectratio]{total.pdf}}
\subfigure[Tree order]{\label{F:tree}\includegraphics[width = 80mm]{tree.pdf}}
\subfigure[Loop order]{\label{F:loop}\includegraphics[width = 30mm]{loop.pdf}}
\subfigure[Block order]{\label{F:block}\includegraphics[width = 80mm]{block.pdf}}
\caption{\label{fig:hasse} Hasse diagrams for different types of orders.}
\end{figure}


The Lagrangian associated with problem~\ref{eq:convexAx} is
\begin{equation}
\label{eq:lagrange}
L(x,\lambda) = f(x)-\lambda^\top Ax
\end{equation}
with $\lambda$ as the \emph{Lagrange multiplier vector}. The associated \emph{Lagrange dual function} can be expressed as
\begin{equation}
\label{eq:lagrangeD}
g(\lambda)=\inf_{x} L(x,\lambda) = \inf_{x}\left(f(x)-\lambda^\top Ax\right).
\end{equation}
and the corresponding dual optimization problem is
\begin{eqnarray}
\label{eq:lagrangeDP}
g(\lambda)  \rightarrow \max! \nonumber \\
\textrm{subject to } \lambda_l \geq 0. 
\end{eqnarray}
The Lagrange dual can be reformulated using the \emph{convex conjugate}. Let $x^{\star} \in \mathbb{R}^n$. The \emph{conjugate function} of $f(x)$ denoted by $f^\star(x^\star)$ is the maximum gap between the linear function $\left<x^{\star},x\right>$ and the target function $f(x)$ and can be expressed as
\begin{equation}
f^\star(x^\star)= \sup_x \left(\left<x^{\star},x\right>-f(x)\right).
\end{equation}
It can be shown \citep[see][p.~221]{Boyd+Vandenberghe:2004} that
\begin{equation}
g(\lambda)=-f^\star(A^\top\lambda)
\end{equation}
In other words: The Lagrange dual is determined by the convex conjugate with $A^\top\lambda$. This relates the convex primal and the dual optimization problem in the following manner:
\[
\min \left\{f(x):Ax \geq 0 \right\} \leftrightarrow \min \left\{f^\star(A^\top\lambda):\lambda \geq 0 \right\}
\]

As necessary conditions for the minimum we give the \emph{Karush-Kuhn-Tucker} (KKT) conditions for this
problem. The convex function $f$ is minimized on a convex set $\{x\mid Ax\geq 0\}$ at $\hat x$ iff there
exists a vector of Lagrange multipliers $\hat\lambda$ (i.e., the KKT vector) such that~\citep[see][Chapter~28]{Rockafellar:1970}
\[
A^\top\hat\lambda\in\partial f(x),\quad A\hat x\geq 0,\quad\hat\lambda\geq 0,\quad\hat\lambda^\top A\hat x=0.
\]

Here $\partial f(\hat{x})$ is the \emph{subdifferential} of $f$ at $\hat{x}$. In general, the subdifferential at $x$ is the set of all \emph{subgradients} of $f$ at $x$, where $y$ is a subgradient at $x$ if
\begin{equation}
f(z)\geq f(x)+(z-x)^\top y\quad\forall z.
\end{equation}
The subdifferential is a convex compact set. If $f$ is differentiable at $x$, there is a unique subgradient, the gradient $\nabla f(x)$. Thus, the necessary and sufficient conditions for a $\hat x$ to be a minimizer in the differentiable case are the existence of a KKT vector $\hat\lambda$ such that
\[
\nabla f(\hat x)=A^\top\hat\lambda,\quad A\hat{x}\geq 0,\quad\hat\lambda\geq 0,\quad\hat\lambda^\top A\hat{x}=0.
\]
These conditions are referred to as stationarity, primal feasibility, dual feasibility, and complementary slackness. 

\subsection{Active set method for convex programming problems}
\label{sec:activestep}

Basically, there are two classes of algorithms to solve convex optimization problems: The first class are \emph{interior point methods} aiming for complementary slackness while maintaining primal and dual feasibility. The second class are active set strategies where we distinguish between two variants:
\begin{description}
 \item[\emph{Primal active set methods}:] They aim for dual feasibility while maintaining primal feasibility and complementary slackness.
 \item[\emph{Dual active set methods}:] They aim for primal feasibility while maintaining dual feasibility and complementary slackness.
\end{description}
Active set methods in general are approaches for solving linear, quadratic, and convex programming problems
with inequality constraints and are the most effective for solving small- to medium-scaled problem. In our
implementation, by means of the function \code{activeSet()}, we provide a primal active set strategy which
\citet[][Chapter~8]{Zangwill:1969} denotes as \emph{manifold suboptimization}. 

Recall problem $\mathcal{P}$ given in Equation~\ref{eq:convexAx} where the aim is to minimize $f(x)$ over $Ax\geq 0$. The minimum is $\hat f(x)$ and the corresponding minimizer is $\hat x$. Write $I$ for subsets of the index set \(\mathcal{I}= 1,2,\cdots,m \) for the constraints. Then $A_I$ is the corresponding $\mathbf{card}(I)\times n$ submatrix of $A$, and $A_{ \overline{I} }$ is the \((m-\mathbf{card}(I))\times n\) complementary submatrix. The \emph{active constraints} at $x$, which we write as $\mathscr{A}$,  are the indices $i$ for which $a_i^\top x=0$. That is, at a given point $x$ a constraint is called
\begin{description}
 \item[\emph{active}] if $a_i^\top x = 0$, $i \in \mathscr{A}$,  and
 \item[\emph{inactive}] if $a_i^\top x > 0$, $i \notin \mathscr{A}$.
\end{description}
Each iteration in the active set algorithm attempts to locate the solution to an equality problem in which only the active constraints occur. There are more ways to state the active set optimization problem for our general convex case with linearity restrictions. Let us first elaborate an auxiliary problem denoted by $\mathcal{P}_I^+$
\begin{eqnarray}
\label{eq:asAx}
 f(x) \rightarrow \min! \\ 
\textrm{subject to } A_Ix = 0 \nonumber \\
 A_{\overline{I}}x > 0 \nonumber
\end{eqnarray}
which partitions the constraints set into equality and strict inequality restrictions. Note that this partitions the constraint set $ x\mid Ax \geq 0 $ into $2^m$ faces, some of which may be empty. Solution $\hat x_I^+$ is optimal for $\mathcal{P}_I^+$ iff there exists a KKT vector $\hat\lambda_I^+$ such that
\[
A_I^\top \hat\lambda_I^+\in\partial f(\hat x_I^+),\quad A_I\hat x_I^+=0,\quad A_{ \overline{I} }\hat x_I^+> 0.
\]
The crucial condition that $\hat x_I^+$ is the optimum also for $\mathcal{P}$ is the dual feasibility condition \(\hat\lambda_I^+\geq 0\). If this holds then $\hat x_I^+$ is optimal for $\mathcal{P}$. Conversely, if $\mathscr{A}^{\star}$ are the indices of the active constraints at the solution $\hat x$ of $\mathcal{P}$, then $\hat x$ solves $\mathcal{P}^+$ as well.

A second way to give an active set formulation for our setting is problem $\mathcal{P}_I$ that includes equality restrictions only:
\begin{eqnarray}
\label{eq:asAx2}
 f(x) \rightarrow \min! \\ 
\textrm{subject to } A_Ix = 0 \nonumber 
\end{eqnarray}
Solution $\hat x_I$ is optimal for iff there exists a KKT vector  $\hat\lambda_I$ such that
\[
A_I^\top \hat\lambda_I\in\partial f(\hat x_I),\quad A_I\hat x_I=0.
\]
To ensure that $\hat x_I$ is optimal for $\mathcal{P}$ as well we have to check whether the primal feasibility $A_{ \overline{I} }\hat x_I\geq 0$ and, again, the dual feasibility $\hat\lambda_I\geq 0$ holds. Conversely, $\hat x$ with the corresponding active constraints $\mathscr{A}^{\star}$ solves $\mathcal{P}$. Thus, if we knew $\mathscr{A}^{\star}$ we could solve $\mathcal{P}$ by solving $\mathcal{P}_I$. 

In order to achieve this we discuss an equivalent formulation of $\mathcal{P}_I$. Basically, the equality constraints $A_Ix=0$ define a relation $\approx_I$ on $ 1,2,\cdots,n $, with $i \approx_I k$ if there is a row $j$ of $A_I$ in which both $a_{ji}$ and $a_{jk}$ are non-zero. The reflexive and transitive closure \(\overline{\approx}_I\) of \(\approx_I\) is an equivalence relation, which can be coded as an \emph{indicator matrix} $G_I$, i.e., a binary matrix in which all $n$ rows have exactly one element equal to one. 

In the package we compute $G_I$ from $A_I$ in two steps. We first make the adjacency matrix of $\approx_I$ and add the identity to make it reflexive. We then apply Warshall's Algorithm \citep{Warshall:1962} to replace the adjacency matrix by that of the transitive closure $\overline{\approx}_I$, which is an equivalence relation. Thus the transitive adjacency matrix has blocks of ones for the equivalence classes.
We then select the unique rows of the transitive adjacency matrix (using the \code{unique()} function), and transpose to get $G_I$. 

Note that $G_I$ is of full column-rank, even if $A_I$ is singular. We write $r_I$ for the number of equivalence classes of $\overline{\approx}_I$. Thus $G_I$ is an $n\times r_I$ matrix satisfying $A_IG_I=0$, in fact $G_I$ is a basis for the null space of $A_I$. Moreover $D_I\defi G_I^\top G_I$ is diagonal and indicates the number of elements in each of the equivalence classes. 

Finally, problem $\mathcal{P}_I$ can be rewritten as unconstrained convex optimization problem
\begin{equation}
\label{eq:gxi}
f(G_I\xi_I) \rightarrow \min!
\end{equation}
with $\xi_I \in\mathbb{R}^r_I$. The vector $\hat\xi_I$ is a solution iff $0\in G_I^\top \partial f(G_I\hat\xi_I)$. Then $\hat x_I=G_I\hat\xi_I$ solves $\mathcal{P}_I$. If $0\in G_I^\top \partial f(G_I\hat\xi_I)$ it follows that there is a non-empty intersection of the subgradient $\partial f(G_I\hat\xi_I)$ and the row-space of $A_I$, i.e., there is a KKT vector $A_I^\top \hat\lambda_I\in\partial f(G_I\hat\xi_I)$.

In \proglang{R} we can simply use the \code{optim()} function---e.g., with Broyden-Fletcher-Goldfarb-Shanno
(BFGS) quasi-Newton---to minimize $f(G_I\xi)$ over $\xi$, and then set $\hat x=G_I\hat\xi$. This guarantees (if the optimum is found with sufficient precision) that the gradient at $\hat x$ is orthogonal to the indicator matrix $G_I$, and consequently that Lagrange multipliers can be computed. By making sure that $A_I$ has full row-rank, the Kuhn-Tucker vector is actually unique. 

To conclude, in this section we elaborated auxiliary formulations of the active set problem. In the algorithmic implementation described in the next section we tackle optimization of problem $\mathcal{P}$ by defining and solving iteratively subproblems $\mathcal{P}_I$ (including the indicator matrix $G_I$). 


%--------------- Description algorithms --------------
\section{PAVA and active set algorithm}
\subsection{A generalized PAVA approach}
\label{sec:gpavath}
\citet{Barlow+Batholomew+Brenner+Brunk:1972} present a graphical interpretation of monotonic regression in
terms of the \emph{greatest convex minorant} (GCM) and the method of successive approximation to the GCM can
be described algebraically as PAVA \citep[see e.g.][p.~9--10]{Robertson+Wright+Dykstra:1988}. In this paper we
focus on the algorithmical description of our PAVA implementation following \citet[][p.~15]{Barlow+Batholomew+Brenner+Brunk:1972}.

PAVA is a very simple iterative algorithm for solving monotonic regression problems. The computation of a non-decreasing sequence of values $x_i$ such that the problem $\mathcal{P}_0$ is optimized. In the simple isotonic regression case we have the measurement pairs $(z_{i}, y_{i})$. Let us assume that these pairs are ordered with respect to the predictors. The initial solution at iteration $l=0$ is $x_i^{(0)}:=y_i$. The index for the blocks is $r = 1,\ldots,B$ where at step 0 $B:=n$, i.e., each observation $x_r^{(0)}$ forms a block.

\begin{enumerate}
 \item Merge $x^{(l)}$-values into blocks if $x_{r+1}^{(l)} < x_r^{(l)}$ (adjacent pooling).
 \item Solve $f(x)$  each block $r$, i.e., compute the update based on the solver which gives $x_r^{(l+1)}:=s(x_r^{(l)})$.
 \item If $x_{r+1}^{(l)} \leq x_r^{(l)}$ increase $l:=l+1$ an go back to step 1.
\end{enumerate}
The algorithm stops when the $x$-blocks are increasing, i.e., $x_{r+1}^{(l)} \geq x_r^{(l)}$ for all $r$. Finally the block values are expanded with respect to the observations $i=1,\ldots,n$ such that the final result is the vector $x$ of length $n$ with elements $\hat{x}_i$ of increasing order. 
In the case of repeated measurements as described in Section~\ref{sec:general} the starting values are $x_i^{(0)}:=s(y_{ij})$. Then we proceed as described above. 

A final notes considers the relation between the generalized PAVA and the active set strategy.
\citet[][Chapter~6]{Robertson+Wright+Dykstra:1988} provide a chapter about duality in monotone regression. For the simple least squares case \citet{Best+Chakravarti:1990} show that PAVA can be considered as a dual active set method. In a subsequent paper, \citet{Best+Chakravarti+Ubhaya:2000} they extend this relation to the general $\ell_p$ case. Therefore it can be concluded that PAVA for general $\ell_p$ cases belong to the family of dual active set methods. 

\subsection{Algorithmic implementation of the active set formulation}
Active set descriptions for convex (quadratic) programming problems can be found in \citet[][Section 10.3]{Fletcher:1987} and \citet[][Section 16.4]{Nocedal+Wright:1999}. In general, based on a feasibile starting point, the active set strategy works as follows:
\begin{enumerate}
 \item Solve the equality problem defined by the active set $\mathcal{A}$.
 \item Compute the Lagrange multipliers $\lambda$ for $\mathcal{A}$.
 \item Remove a subset of constraints with $\lambda < 0$.
 \item Search for infeasible constraints and return to step 1.
\end{enumerate}
These steps are repeated until we find a solution that is ``optimal enough''. In practice (and in our package), the zero-boundaries for the constraints and the dual feasibility checking are relaxed in favor of a small value $\varepsilon$ which leads to $Ax \geq -\varepsilon$, and $\lambda \geq -\varepsilon$. 
 
Again, our goal is to solve problem $\mathcal P$ given in (\ref{eq:asAx}). Based on the elaborations above, in our active set approach we solve a finite sequence of subproblems $\mathcal{P}_I$, that minimize $f(x)$ over $x$ satisfying $A_Ix$. After solving each of the problems we change the active set $\mathscr{A}$, either by adding or by dropping a constraint. The algorithm can be expected be efficient if minimizing $f(x)$ under simple equality constraints, or equivalently minimizing $f(G_I\xi_I)$ over $\xi_I$, can be done quickly and reliably. Starting with a feasible point $x^{(0)}$ which defines the index sets $I^{(0)}=\mathscr{A}^{(0)}$ and $\overline{I}^{(0)}=\mathcal{I}-I^{(0)}$. For each iteration $l$

\begin{enumerate}
 \item Suppose $\tilde x^{(l)}$ is a solution of $\mathcal{P}_{I^{(l)}}$. Then $\tilde x^{(l)}$ can be either feasible or infeasible for $\mathcal{P}$, depending on if $A_{\overline{I}^{(l)}}\tilde x^{(l)} \geq 0$ or not. 
 \item If $\tilde x^{(l)}$ is \emph{infeasible} we choose $x^{(l+1)}$ on the line between $x^{(l)}$ and $\tilde x^{(l)}$, where it crosses
 the boundary of the feasible region (see Equation~\ref{eq:alpha} below). This defines a new and larger set of active constraints $\mathscr{A}^{(l+1)}$. Go back to step 1.
 \item If $\tilde x^{(l)}$ is \emph{feasible} we determine the corresponding Lagrange multipliers $\lambda_I^{(l)}$ for $\mathcal{P}_{I^{(l)}}$. If \(\lambda^{(l)}\geq 0\) we have solved \(\mathcal{P}\) and the algorithm stops. If $\min\lambda_I^{(l)}<0$, we find the most negative Lagrange multiplier and drop the corresponding equality constraint from $\mathscr{A}^{(l)}$ to define a new and smaller set of active constraints $\mathscr{A}^{(l+1)}$ and go back to step 1.
\end{enumerate}

In step 2 we solve
\begin{eqnarray}
\label{eq:alpha}
&\max_\alpha x^{(l)}+\alpha(\tilde x^{(l)}-x^{(l)})\\
&\textrm{over} \min_{i\in A_{\overline{I}^{(l)}}}a_i^\top x^{(l)}+\alpha(a_i^\top \tilde x^{(l)}-a_i^\top x^{(l)})\geq 0 \nonumber
\end{eqnarray}
Finding the smallest Lagrange multiplier in step 3 is straightforward to implement in the differentiable case. We have to solve $A_{I^{(l)}}^\top \lambda_I=\nabla f(\tilde x^{(l)})$. This is achieved by using the formulation given in (\ref{eq:gxi}). Because $G_{I^{(l)}}^\top \nabla f(\tilde x^{(l)})=0$ and $A_{I^{(l)}}$ is of full row-rank, there is a unique solution $\lambda_{I^{(s)}}$.


\section{Special cases for active set optimization}
In the convex non-differentiable case, matters are more complicated. We have to deal with the fact that in general $\partial f(x)$ may not be a singleton. It is possible to develop a general theory for active set methods in this case \citep{Panier:1987}, but we will just look at two important special cases.

\subsection{The weighted Chebyshev norm}
The first case we consider is the $\ell_\infty$ or Chebyshev norm. We have to minimize 
\begin{equation}
\label{eq:cheby}
f(\xi)=\|h(\xi)\|_\infty=\max_{i=1}^n |w_ih_i(\xi)|
\end{equation}
where $h(\xi)=y-G\xi$ are the \emph{residuals} based on the observed response values $y$. We assume, without loss of generality, that $w_i>0$ for all $i$. The minimization can be done for each of the $r$ columns of the indicator matrix $G$ with elements $g_{ij}$ separately. The solution $\hat\xi_j$ is the corresponding \emph{weighted mid-range}. More specifically, let $I_j=\{i\mid g_{ij}=1\}$. Then
\begin{equation}
f_j(\hat\xi_j)=\min_{\xi_j}\max_{i\in I_j}|y_i-\xi_j|=\max_{i,k\in I_j}\frac{w_iw_k}{w_i+w_k}|y_i-y_k|.
\end{equation}
If the (not necessarily unique) maximum over $(i,k)\in I_j$ is attained at $(i_j,k_j)$, then the minimum of $f$ over $\xi$ is attained at
\begin{equation}
\hat\xi_j=\frac{w_{i_j}y_{i_j}+w_{k_j}y_{k_j}}{w_{i_j}+w_{k_j}},
\end{equation}
where we choose the order within the pair $(i_j,k_j)$ such that $y_{i_j}\leq\hat\xi_j\leq y_{k_j}$. Now
\begin{equation}
\min_\xi f(\xi)=\max_{j=1}^r f_j(\hat\xi_j).
\end{equation}
These results also apply if $I_j$ is a singleton $\{i\}$, in which case $\hat\xi_j=y_i$ and $f_j(\hat\xi_j)=0$. Set $\hat x=G\hat\xi$.

Next we must compute a subgradient in $\partial f(\hat x)$ orthogonal to $G$. Suppose that $e_i$ is a unit weight vectors, i.e., a  vector with all elements equal to zero, except element $i$ which is equal to either plus or minus $w_i$. Consider the set $\mathcal{E}$ of the $2n$ unit weight vectors. Then $f(\xi)=\max_{e_i\in\mathcal{E}} e_i^\top h(\xi)$. Let $\mathcal{E}(\xi)=\{e_i\mid e_i^\top h(\xi)=f(\xi)\}$. Then, by the formula for the subdifferential of the pointwise maximum of a finite number of convex functions \citep[also known as Danskin's Theorem; see][]{Danskin:1966}, we have
$\partial f(\xi)=\mathbf{conv}(\mathcal{E}(\xi))$ with $\mathbf{conv}()$ as the convex hull.

Choose any $j$ for which $f_j(\hat\xi_j)$ is maximal. Such a $j$ may not be unique in general. The index pair $(i_j,k_j)$ corresponds with the two unit weight vectors with non-zero elements $-w_{i(j)}$ and $+w_{k(j)}$. The subgradient we choose is the convex combination which has element $-1$ at position $i_j$ and element $+1$ at position $k(j)$. It is orthogonal to $G$, and thus we can find a corresponding Kuhn-Tucker vector. 

In the \pkg{isotone} package the Chebyshev norm is provided by the solver function \code{mSolver()}. 
 
\subsection{The weighted absolute value norm}
For the $\ell_1$ or weighted absolute value norm 
\begin{equation}
\label{eq:wabsval}
f(\xi)=\|h(\xi)\|=\sum_{i=1}^n |w_ih_i(\xi)|
\end{equation}
we find the optimum $\hat\xi$ by computing \emph{weighted medians} instead of weighted mid-ranges. Uniqueness problems, and the subdifferentials, will generally be smaller than in the case of $\ell_\infty$. 

For $\ell_1$ we define $\mathcal{E}$ to be the set of $2^n$ vectors $(\pm w_1,\pm w_2,\cdots,\pm w_n)$. The subdifferential is the convex hull of the vectors $e\in\mathcal{E}$ for which $e_i^\top h(\xi)=\min_\xi f(\xi)$. If $h_i(\xi)\not=0$ then $e_i=\mathbf{sign}(h_i(\xi))w_i$, but if $h_i(\xi)=0$ element $e_i$ can be any number in $[-w_i,+w_i]$. Thus the subdifferential is a multidimensional rectangle. If the medians are not equal to the observations the loss function is differentiable. If $h_i(\xi)=0$ for some $i$ in $I_j$ then we select the corresponding element in the subgradient in such a way that they add up to zero over all $i\in I_j$. 

In the package the \code{dSolver()} function implements the weighted absolute value norm.

\subsection{Some additional solvers}
\label{sec:addsolvers}
The \pkg{isotone} package provides some pre-specified loss functions but allows also the user to define his own functions. The corresponding argument in the function \code{activeSet()} is \code{mySolver}. This can be either the function name or the equivalent string expression given in brackets below. Each solver has extra arguments which we describe in this section. 

A user-defined specification of a convex differentiable function can be achieved by setting \code{mySolver = fSolver}. The target function itself is passed to the function by means of the \code{fobj} argument and the corresponding gradient using \code{gobj}. Note that it is not at all necessary that the problems are of the regression or projection type, i.e., minimize some norm $\|y-x\|$. In fact, the driver can be easily modified to deal with general convex optimization problems with linear inequality constraints which are not necessarily of the isotone type. We give examples in the next section. 

Now let us describe some functions pre-specified in the package. We start with the cases which solve the equivalent formulation of problem $\mathcal P_I$ given in (\ref{eq:gxi}), that is, the one which minimizes over $\xi$. Because of the structure of these problems it is more efficient to use this formulation. The solvers passed to the \code{mySolver} argument are 
\begin{description}
 \item[\code{lsSolver} (\code{"LS"}):] Solves the least squares $\ell_2$ norm given in (\ref{eq:monreg}).
 \item[\code{dSolver} (\code{"L1"}):] Minimizes the weighted absolute value $\ell_1$ norm as given in (\ref{eq:wabsval}).
 \item[\code{mSolver} (\code{"chebyshev"}):] Solves the Chebyshev $\ell_\infty$ norm given in (\ref{eq:cheby}).
 \item[\code{pSolver} (\code{"quantile"}):] Solves the quantile regression problem as given in (\ref{eq:quantile}).
 \item[\code{lfSolver} (\code{"GLS"}):] Tackles a general least squares problem of the form $f(x)=$\linebreak $(y-x)^\top W(y-x)$ where $W$ is a not necessarily positive definite matrix of order $n$.   
\end{description}
The extra arguments for all these solvers are \code{y} for the observed response vector and \code{weights} for optional observation weights. The functions return a list containing \code{x} as the fitted values, \code{lbd} as the Lagrange vector, \code{f} as the value of the target function, and \code{gx} as the gradient at point $x$. 

Furthermore, we provide some hybrids which are just wrapper functions and call \code{fSolver()} internally. It is not needed that \code{fobj} and \code{gobj} are provided by the user. 
\begin{description}
 \item[\code{sSolver} (\code{"poisson"}):] Solves the negative Poisson log-likelihood with loss function $f(x)=\sum_{i=1}^n x_i-y_i\log(x_i)$.
 \item[\code{oSolver} (\code{"Lp"}):] Minimizes the $\ell_p$ norm as given in (\ref{eq:loss}) with $m_i = 1\ \ \forall i$.
 \item[\code{aSolver} (\code{"asyLS"}):] \citet{Efron:1991} gives an asymmetric least squares formulation
 based on~(\ref{eq:monreg}). The weights $b_w$ for $(y_i-x_i) \leq 0$ (extra argument \code{bw}) and $a_w$ for $(y_i-x_i) > 0$ (extra argument \code{aw}) are introduced. Consequently, $h_i(y_i,x_i) = (y_i-x_i)^2b_w$ for $(y_i-x_i)\leq 0$ and $h_i(x_i,y_i) = (y_i-x_i)^2a_w$ for $(y_i-x_i) > 0$. 
 \item[\code{eSolver} (\code{"L1eps"}):] Minimizes the $\ell_1$ approximation for $f(x)=\sum_{i=1}^n w_i\sqrt{(y_i-x_i)^2+\varepsilon}$. Correspondingly, the additional extra argument is \code{eps}. 
 \item[\code{hSolver} (\code{"huber"}):] Solves the Huber loss-function \citep{Huber:1981} with extra argument \code{eps}. 
\[
h_i(y_i,x_i)= \left\{
\begin{array}{ll}
(y_i-x_i)^2/4\varepsilon & |y_i-x_i|>2 \\
|y_i-x_i|-\varepsilon & \textrm{otherwise}
\end{array}
\right.
\]
\item[\code{iSolver} (\code{"SILF"}):] Minimizes the soft insensitive loss function (SILF) defined in \citet{Chu+Keerthi+Ong:2004} with the two extra parameters $\varepsilon$ (argument \code{eps}) and $\beta$ (argument \code{beta}). We have to distinguish as follows:
\[
h_i(y_i,x_i)= \left\{
\begin{array}{ll}
-|y_i-x_i|-\varepsilon & |y_i-x_i| \in \left(-\infty,-(1+\beta)\varepsilon \right)\\
(|y_i-x_i|+(1-\beta)\varepsilon)^2/4\beta\varepsilon & |y_i-x_i| \in \left[-(1+\beta)\varepsilon,-(1-\beta)\varepsilon \right] \\
0 & |y_i-x_i| \in \left(-(1-\beta)\varepsilon,(1-\beta)\varepsilon \right)\\
(|y_i-x_i|-(1-\beta)\varepsilon)^2/4\beta\varepsilon & |y_i-x_i| \in \left[(1-\beta)\varepsilon,(1+\beta)\varepsilon \right]\\
|y_i-x_i|-\varepsilon & |y_i-x_i| \in \left((1+\beta)\varepsilon,\infty \right)
\end{array}
\right.
\]
\end{description}
With a little extra effort various other fashionable support vector machines (SVM) and lasso isotone regressions could be added. 
%------------------------- end algorithms --------------------------


%-------------------------- END THEORY ----------------------------------

\section{Package description and examples}

\begin{table}[b!]
\begin{center}
\begin{tabular}{|p{40mm}|p{27mm}|l|p{32mm}|l|}
\hline
{Package} & {Function} & {Location} & {Description} & {Language} \\
\hline
\pkg{stats} \citep{R:09} & \code{isoreg()} & external & isotonic regression &  \proglang{C} \\
\pkg{monreg} \citep{Pilz+Titoff:2009} & \code{monreg()} & external & monotonic regression & \proglang{C} \\
\pkg{fdrtool} \citep{Strimmer:2008} & \code{monoreg()} & external & monotonic regression, weights & \proglang{C} \\
\pkg{Cir} \citep{Oron:2008} & \code{pava()}, \code{cir.pava()} & external & isotonic regression, weights & \proglang{R} \\
\pkg{Iso} \citep{Turner:2009} & \code{pava()} & external & isotonic regression, weights & \proglang{Fortran} \\
\pkg{clue} \citep{Hornik:2005} & \code{pava()} & internal & isotonic regression (mean/median) & \proglang{R} \\
\pkg{logcondens} \citep{Rufibach+Duembgen:2009} & \code{isoMean()} & external & isotonic regression, weights & \proglang{R} \\ 
\pkg{sandwich} \citep{Zeileis:2004} & \code{pava.blocks()} & internal & isotonic regression, weights & \proglang{R}\\
\pkg{intcox} \citep{Henschel+Heiss+Mansmann:2009} & \code{intcox.pavaC()} & internal & isotonic regression, weights & \proglang{C} \\
\pkg{SAGx} \citep{Broberg:2008} & \code{pava.fdr()} & external & false discovery rate using isotonic regression & \proglang{R} \\
\pkg{smacof} \citep{deLeeuw+Mair:2009} & \code{pavasmacof()} & internal & nonmetric multidimensional scaling, ties & \proglang{R}\\ 
\hline 
\end{tabular}
\caption{\label{tab:pava} Existing PAVA implementations in \proglang{R}.}
\end{center}
\end{table} 

The \pkg{isotone} package consists of two main components: The PAVA component with its main function \code{gpava()} and the active set component with its main function \code{activeSet()}. In the following sections we describe their basic functionalities and give corresponding examples. 

\pagebreak

\subsection{The PAVA component}
\label{sec:packpava}
Table~\ref{tab:pava} gives an overview of available PAVA implementations in \proglang{R}. It quotes the package name, the corresponding function, whether this function is externally accessible, a brief description, and in which programming language the core computations are implemented. The functions that use \proglang{C} or \proglang{Fortran} can be expected to be faster for large observation vectors.   

None of the functions in Table~\ref{tab:pava} allows for the specifications of general convex functions, repeated measurements and the handling of ties for such structures. Our package which is available on CRAN, allows for the computation of rather general target functions with additional options in terms of repeated measurements, tie handling, and weights.

The main function for this generalized PAVA computation is \code{gpava()}. The argument \code{solver} takes the solver function which can be \code{weighted.mean}, \code{weighted.median}, \code{weighted.fractile} or a user-specified function. \proglang{S}3 methods such as \code{print()} and \code{plot()} are provided and in the following subsections we present simple two examples.

\subsection{Monotone regression with ties for pituitary data}
The first example is based on a dataset from \citet{Pothoff+Roy:1964} which is also analyzed in \citet{Robertson+Wright+Dykstra:1988}. The Dental School at the University of North Carolina measured the size of the pituitary fissue (in millimeters) on 11 subjects from 8 to 14 years. The predictor is age and it can be expected that the size of the fissure increases with age. The data are of the following structure:

<<>>=
require("isotone")
data("pituitary")
head(pituitary)
@

As we see we have a several ties on the predictors. By means of this example we present different results caused by different approaches for handling ties (i.e., primary, secondary, tertiary). We compute the isotonic regression using the function \code{gpava()}: 

<<>>=
res1 <- with(pituitary, gpava(age, size, ties = "primary"))
res2 <- with(pituitary, gpava(age, size, ties = "secondary"))
res3 <- with(pituitary, gpava(age, size, ties = "tertiary"))
@

<<pit-plot, echo=TRUE, eval = FALSE>>=
layout(matrix(c(1,1,2,2,0,3,3,0), 2, 4, byrow = TRUE))
plot(res1, main = "PAVA plot (primary)")
plot(res2, main = "PAVA plot (secondary)")
plot(res3,  main = "PAVA plot (tertiary)")
@

\setkeys{Gin}{width=\textwidth}
\begin{figure}[t!]
\begin{center}
<<pit-plot1, fig=TRUE, echo=FALSE, height=6, width=6>>=
<<pit-plot>>
@
\caption{\label{fig:pitties} Different tie approaches for pituitary data.}
\end{center}
\end{figure}

For the primary method we can have different (monotonic) fitted values within tied observations, for the secondary the fitted values are the same within ties, and the tertiary approach only requires monotonicity on the means. This can be examined by doing

<<>>=
tapply(res3$x, res3$z, mean)
@

Using the tertiary approach the fitted values within ties can even decrease (see, e.g., within 10 year old children). The plots in Figure~\ref{fig:pitties} show the fitted step functions for each of the approaches. 


\subsection{Repeated measures using posturographic data}
To demonstrate PAVA on longitudinal data we use a subset of the data collected in \citet{Leitner+Mair:2008}. The sample consists of 50 subjects (healthy controls and patients chronical low back pain). The subjects' task on a sensory organization test (SOT) was to keep the balance on a dynamic $18 \times 18$ inches dual force plate. Measurements on various testing conditions were collected and the SOT equilibrium scores computed. They were based on the maximum anterior posterior sway angle during the SOT trials and reflected the overall
coordination of visual, proprioceptive and vestibular components of balance to maintain standing posture. These equilibrium scores represent the angular difference between the subject's calculated anterior/posterior center of gravity displacements and the theoretical maximum of $12.5^{\circ}$. A score of 100 theoretically indicates no anterior/posterior excursion. A score of 0 indicates a fall of the subject. Thus, the higher the score, the more able a person is to maintain the balance. 

The subset we select for this example is based on the condition where both proprioceptive and visual stimuli are altered by moving a surrounding visual screen and the platform with the subject's anterior/posterior body sway. We examine the relationship between body height and three repeated SOT scores.

<<>>=
data("posturo")
head(posturo)
@

PAVA is performed on the weighted median target function and using the secondary approach for ties. 

<<keep.source=TRUE>>=
res.mean <- with(posturo, gpava(height, cbind(SOT.1, SOT.2, SOT.3),
  solver = weighted.mean, ties = "secondary"))
res.median <- with(posturo, gpava(height, cbind(SOT.1, SOT.2, SOT.3),
  solver = weighted.median, ties = "secondary"))
@

<<post-plot, echo=TRUE, eval=FALSE>>=
plot(res.mean)
plot(res.median)
@

\setkeys{Gin}{width=\textwidth}
\begin{figure}
\begin{center}
<<post-plot1, fig=TRUE, echo=FALSE, height=5.3, width=9.6>>=
par(mar = c(5,4,4,2))
par(mfrow = c(1, 2))
<<post-plot>>
@
\caption{\label{fig:post} Step function for posturographic data.}
\end{center}
\end{figure}

The fitted step functions for the two different target functions are given in Figure~\ref{fig:post}.

%--------------------- end pava component -------------------------

\subsection{The active set component}
As mentioned above a primal active set implementation according to the elaborations in Section~\ref{sec:activestep} is provided by the function \code{activeSet()} which has the following arguments:
\begin{description}
 \item[\code{isomat}:] Matrix with 2 columns that contains isotonicity conditions (see examples).
 \item[\code{mySolver}:] Either a user-specified solver function or one of the pre-specified solvers as described in Section~\ref{sec:addsolvers}.
 \item[\code{x0}:] Optional vector containing starting values. The response values $y$ are passed to the solver function (see below).
 \item[\code{ups}:] Upper boundary $\varepsilon$ for KKT feasibility checking. 
 \item[\code{check}:] If \code{TRUE}, a final KKT feasibility check is performed. 
\end{description}
For the solvers described in Section~\ref{sec:addsolvers}, each one requires various extra arguments (see corresponding package help files for detailed description). Each of the solvers needs to have the response vector $y$ as argument.  

Now we use some small artificial examples to show the application of different solver routines. The data are
<<>>=
set.seed(12345)
y <- rnorm(9)
w1 <- rep(1, 9)
Atot <- cbind(1:8, 2:9)
@

The variable \code{Atot} defines the pairwise isotonicity matrix which needs to have 2 columns and as many rows as unique isotonicity constraints.

<<>>=
Atot
@

We see that this matrix defines a total order. The specification is always as follows: element column 2 $\succeq$ element column 1, e.g., the first row states $x_2 \geq x_1$. 

Let us start with a simple least squares model using the \code{lsSolver()} and the equivalent user-specification of the $\ell_2$ norm using \code{fSolver()} with corresponding target function \code{fobj} and its gradient \code{gobj}. 

<<keep.source=TRUE>>=
fit.ls1 <- activeSet(Atot, "LS", y = y, weights = w1)
fit.ls2 <- activeSet(Atot, fSolver, y = y, weights = w1,
  fobj = function(x) sum(w1 * (x - y)^2),
  gobj = function(x) 2 * drop(w1 * (x - y)))
@

This model has unit weights. We will refit this model with a different weight vector and, subsequently, its generalization using a non-diagonal weight matrix by means of the \code{lfsolver()}. 

<<>>=
set.seed(12345)
wvec <- 1:9
wmat <- crossprod(matrix(rnorm(81), 9, 9))/9
fit.wls <- activeSet(Atot, "LS", y = y, weights = wvec)
fit.gls <- activeSet(Atot, "GLS", y = y, weights = wmat)
@

Quantile regression can be performed by means of the \code{pSolver()}:
<<>>=
fit.qua <- activeSet(Atot, "quantile", y = y, weights = wvec, aw = 0.3, bw = 0.7)
@

Now let us generalize the LS problems in terms of other norms using again unit weights. Let us start with the $\ell_1$ provided by \code{dSolver()}.

<<>>=
fit.abs <- activeSet(Atot, "L1", y = y, weights = w1)
@

This exact absolute value norm can be approximated with either \code{eSolver()} which requires $\varepsilon$ or \code{oSolver()} which requires the power $p$. 

<<>>=
fit.eps <- activeSet(Atot, "L1eps", y = y, weights = w1, eps = 1e-04)
fit.pow <- activeSet(Atot, "Lp", y = y, weights = w1, p = 1.2)
@

Furthermore, the Chebyshev $\ell_{\infty}$ norm can be calculated using \code{mSolver()}. 
   
<<>>=
fit.che <- activeSet(Atot, "chebyshev", y = y, weights = w1)
@
   
Efron's asymmetric LS problem can be solved by means of \code{aSolver()} with the extra arguments \code{aw} for the $y_i > x_i$ case and \code{bw} for the $y_i \leq x_i$ case:

<<>>=
fit.asy <- activeSet(Atot, "asyLS", y = y, weights = w1, aw = 2, bw = 1)
@

The Huber loss function with the $\varepsilon$ parameter and the SILF SVM extension with additionally $\beta$ as parameter can be computed as follows:

<<>>=
fit.hub <- activeSet(Atot, "huber", y = y, weights = w1, eps = 1)
fit.svm <- activeSet(Atot, "SILF", y = y, weights = w1, beta = 0.8, eps = 0.2)
@
    
As a final norm let us consider the negative Poisson log-likelihood specified in \code{sSolver()} using Poisson distribution random numbers with parameter $\lambda = 5$.

<<>>=
set.seed(12345)
yp <- rpois(9, 5)
x0 <- 1:9
fit.poi <- activeSet(Atot, "poisson", x0 = x0, y = yp)
@
   
So far we focused on total orders only. Now we back to the LS case and specify different types or orders according to the Hasse diagrams in Figure~\ref{fig:hasse}. The tree order in Figure~\ref{F:tree} can be defined as follows:

<<keep.source=TRUE>>=
Atree <- matrix(c(1, 1, 2, 2, 2, 3, 3, 8, 2, 3, 4, 5, 6, 7, 8, 9), 8, 2)
Atree
fit.tree <- activeSet(Atree, "LS", y = y, weights = w1)
@

The loop order given in Figure~\ref{F:loop} and the corresponding LS-fit are

<<keep.source=TRUE>>=
Aloop <- matrix(c(1, 2, 3, 3, 4, 5, 6, 6, 7, 8, 3, 3, 4, 5, 6, 6,
  7, 8, 9, 9), 10, 2)
Aloop
fit.loop <- activeSet(Aloop, "LS", y = y, weights = w1)
@

Finally, using the same data we fit the block order given in Figure~\ref{F:block}.

<<keep.source=TRUE>>=
Ablock <- cbind(c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3), rep(5, 3),
  rep(6, 3)), c(rep(c(4, 5, 6), 3), rep(c(7, 8, 9), 3)))
Ablock
fit.block <- activeSet(Ablock, "LS", y = y, weights = w1)
@

Other types of orders can be defined easily by a proper specification of the constraints matrix.

\subsection{PAVA computation using active set}
In Section~\ref{sec:gpavath} we mentioned that PAVA can be considered as a dual active set method. Using the simulated data again, we show that LS PAVA and LS active set lead to the same results. Consequently, the mean squared error becomes almost 0. 

<<>>=
pava.fitted <- gpava(1:9, y)$x
aset.fitted <- activeSet(Atot, "LS", weights = w1, y = y)$x
mse <- mean((pava.fitted - aset.fitted)^2)
mse
@

Obviously, the active set approach is somewhat more general than PAVA in terms of loss functions and different types of orders. But, pertaining to the way we have implemented both functions, in cases in which it applies \code{gpava()} is more convenient and efficient than \code{activeSet()}. In the \code{gpava()} function we provide an easy way to handle multiple measurements. Basically, users can do this in \code{activeSet()} as well by a corresponding solver specification. Furthermore, \code{gpava()} offers different approaches to handle ties whereas \code{activeSet()} does not. Finally, for large data problems, PAVA is computationally more efficient than active set. 


\section{Discussion}
After a historical outline we presented the theory and the algorithmical descriptions of approaches for solving isotone optimization problems: PAVA and active set. We started with the classical representation of the monotone LS regression problem with simple chain constraints and extended it to general convex programming problems with linear inequality constraints. This leads to a general isotone optimization framework that is implemented in the \proglang{R} package \pkg{isotone}. Applications on real and simulated data sets were shown. 

One option for the extension of the PAVA algorithm is to allow for multiple predictors. This approach is elaborated in \citet{Burdakov+Grimvall+Hussian:2004}. 

Of course, the field of convex analysis is rather extensive and our \code{activeSet()} function covers only a certain fraction \citep[cf.][]{Boyd+Vandenberghe:2004}. Since we use  \code{optim()} in the \code{fSolver()} function this may cause troubles for non-differentiable convex functions. Therefore, a replacement of this optimizer would facilitate the optimization of such non-differentiable problems. In addition, for a target $f(x)$ we allow for one particular type of solver $s(x)$ only, that is, the target function is separable but the components must be of the same type. This could be extended in terms of different solvers for one particular optimization problem. 

\newpage

\bibliography{isotone}
\end{document}