title: “softImpute Vignette” author: “Trevor Hastie” date: “September 5, 2014”

## output: html_document

softImpute is a package for matrix completion using nuclear norm regularization. It offers two algorithms:

• One iteratively computes the soft-thresholded SVD of a filled in matrix - an algorithm described in Mazumder et al (2010). This is option type="svd" in the call to softImpute().

• The other uses alternating ridge regression, at each stage filling in the missing entries with the latest estimates. This is described in Hastie et al (2014). This we believe is the faster option, and is option type="als" in the call to softImpute.

The package can deal with both small and very large matrices; the latter are stored in sparse-matrix format using a new S4 class "Incomplete". For example, softImpute can happily fit a rank 100 SVD to the netflix data (480,189 x 17,770, 99% missing) using a machine with about 32Gb of memory. For smaller matrices with missing data, the usual full matrix with NA suffices.

The package has two other notable features. It can compute a (low-rank) SVD of a large sparse matrix, stored in sparse matrix format. This can be done either with nuclear-norm regularization or not, the former being somewhat faster.

There is a function biScale() that generalizes the scale() function in R. It can simultaneously scale a matrix to have row and column means zero, and row and column variances one. Of course, any subset of these can be chosen as well. This function can deal with the missing values, if present. It is also smart enough to deal with large sparse matrices. In this case, it does not actually apply the centering operation and destroy the sparsity; instead it stores the resulting matrix in the new matrix class "SparseplusLowRank". This is a special class used by softImpute, originally created to allow it to avoid ever storing a large, filled in matrix, when the original matrix was stored in sparse matrix format via class "Incomplete".

This vignette is a simple guide to using the package.

## What softImpute solves

Here we briefly describe the problem solved. Suppose $$X$$ is a large $$m\times n$$ matrix, with many missing entries. Let $$\Omega$$ contain the pairs of indices $$(i,j)$$ where $$X$$ is observed, and let $$P_\Omega(X)$$ denote a matrix with the entries in $$\Omega$$ left alone, and all other entries set to zero. So when $$X$$ has missing entries in $$\Omega^\perp$$, $$P_\Omega(X)$$ would set the missing values to zero.

Consider the criterion $\min_M\frac12\|P_\Omega(X)-P_\Omega(M)\|^2_F+\lambda\|M\|_*,$ where $$\|M\|_*$$ is the nucelar norm of $$M$$ (sum of singular values).

If $$\widehat M$$ solves this convex-cone problem, then it satisfies the following stationarity condition: ${\widehat M}=S_\lambda(Z)$ where $Z=P_\Omega(X)+P_{\Omega^\perp}(\widehat M).$ Hence $$Z$$ is the “filled in”“ version of $$X$$. The operator $$S_\lambda(Z)$$ applied to matrix $$Z$$ does the following:

1. Compute the SVD of $$Z=UDV^T$$, and let $$d_i$$ be the singular values in $$D$$.
2. Soft-threshold the singular values: $$d_i^*= (d_i-\lambda)_+$$.
3. Reconstruct: $$S_\lambda(Z)=UD^*V^T$$. We call this operation the "soft-thresholded SVD”. Notice that for sufficiently large $$\lambda$$, $$D^*$$ will be rank-reduced, and hence so will be $$UD^*V^T$$.

This suggests the obvious iterative algorithm: using the current estimate for $$M$$, create $$Z$$, and update $$M$$ by the soft-thresholded SVD of $$Z$$.

This is exactly what softImpute does on (small) matrices with missing values stored as NAs. By small we mean small enough that the SVD can be computed by R in a small amount of time.

This is not tenable for very large matrices, like those stored as class "Incomplete". Here we make two very important changes to the recipe:

• Re-express $$Z$$ at each iteration as as $Z=P_\Omega(X)-P_\Omega(\widehat M) + \widehat M.$ This is of the form "SparseplusLowRank" (assuming $$\widehat M$$ is low rank), and hence can be stored. Left and right matrix multiplies by skinny matrices can also be efficiently performed.
• Anticipating that the solution $$\widehat M$$ will have low rank, compute only a low-rank SVD of $$Z$$, using alternating subspace methods.

Indeed, softImpute has a rank argument that allows one to limit the rank of the solution; if the algorithm returns a solution with rank lower than the specified rank $$r$$, you know it has solved the unrestricted problem.

Consider the alternative criterion $\min_{A,B}\frac12\|P_\Omega(X)-P_\Omega(AB^T)\|^2_F+\frac{\lambda}2(\|A\|_F^2 +\|B\|_F^2),$ where $$A$$ and $$B$$ have each $$r$$ columns, and let us suppose that $$r$$ is bigger than or equal to the solution rank of the earlier criterion. This problem is not convex, but remarkably, it has a solution that satisfies $${\widehat A}{\widehat B}^T=\widehat M$$!

We can once again characterize the stationarity conditions, now in terms of $$\widehat A$$ and $$\widehat B$$. Let $Z=P_\Omega(X)+P_{\Omega^\perp}({\widehat A}{\widehat B}^T),$ the filled in version of $$X$$. Then $\widehat B= ({\widehat A}^T{\widehat A}+\lambda I)^{-1}{\widehat A}^T Z.$ We get $$\widehat B$$ by ridge regressions of the columns of $$Z$$ on $$\widehat A$$. For $$\widehat A$$ its the same, with the roles of $$A$$ and $$B$$ reversed. This again suggests an obvious alternation, now by ridged regressions. After each regression, we update the component $$A$$ or $$B$$, and the filled in $$Z$$. If $$r$$ is sufficiently large, this again solves the same problem as before.

This last algorithm (softImpute ALS) can be seen as combining the alternating subspace SVD algorithm for computing the SVD with the iterative filling in and SVD calculation. It turns out that this interweaving leads to computational savings, and allows for a very efficient distributed implementation (not covered here).

## A simple example

We will start with a small and simple example. Lets generate a small matrix and make some values missing.

require(softImpute)

## Loading required package: softImpute

## Loading required package: Matrix

## Loaded softImpute 1.4-1

set.seed(1011)
x=matrix(rnorm(30),6,5)
x[sample(1:30,10,replace=FALSE)]=NA
x

##            [,1]        [,2]       [,3]       [,4]       [,5]
## [1,]  0.8654889  0.01565179         NA         NA  0.7868103
## [2,] -0.6004172 -0.39411039         NA 0.11949527         NA
## [3,]         NA          NA -1.0545575         NA         NA
## [4,]  0.6965558 -0.50331812         NA 1.54375663  0.4265161
## [5,]  1.2311610 -0.34232368 -0.8102688         NA -0.1325694
## [6,]  0.2664415  0.14486388  0.6051380 0.03406604 -2.2408786

fits=softImpute(x,trace=TRUE,type="svd")

## 1 : obj 0.06702 ratio 0.06610829
## 2 : obj 0.04042 ratio 0.02307993
## 3 : obj 0.02846 ratio 0.008449936
## 4 : obj 0.02306 ratio 0.0037347
## 5 : obj 0.02028 ratio 0.002094635
## 6 : obj 0.01855 ratio 0.001436751
## 7 : obj 0.01727 ratio 0.001120993
## 8 : obj 0.01621 ratio 0.0009378269
## 9 : obj 0.0153 ratio 0.0008121361
## 10 : obj 0.01447 ratio 0.000714582
## 11 : obj 0.01372 ratio 0.0006329802
## 12 : obj 0.01304 ratio 0.0005620697
## 13 : obj 0.01242 ratio 0.0004994695
## 14 : obj 0.01185 ratio 0.0004439727
## 15 : obj 0.01133 ratio 0.0003948185
## 16 : obj 0.01086 ratio 0.0003513964
## 17 : obj 0.01043 ratio 0.0003131414
## 18 : obj 0.01004 ratio 0.0002795089
## 19 : obj 0.00969 ratio 0.0002499776
## 20 : obj 0.00936 ratio 0.0002240586
## 21 : obj 0.00907 ratio 0.0002013033
## 22 : obj 0.0088 ratio 0.0001813079
## 23 : obj 0.00855 ratio 0.0001637137
## 24 : obj 0.00832 ratio 0.0001482057
## 25 : obj 0.00811 ratio 0.0001345095
## 26 : obj 0.00792 ratio 0.000122387
## 27 : obj 0.00774 ratio 0.0001116329
## 28 : obj 0.00758 ratio 0.0001020699
## 29 : obj 0.00743 ratio 9.35457e-05
## 30 : obj 0.00728 ratio 8.592893e-05
## 31 : obj 0.00715 ratio 7.910658e-05
## 32 : obj 0.00703 ratio 7.29812e-05
## 33 : obj 0.00692 ratio 6.746868e-05
## 34 : obj 0.00681 ratio 6.249629e-05
## 35 : obj 0.00671 ratio 5.800103e-05
## 36 : obj 0.00662 ratio 5.392822e-05
## 37 : obj 0.00653 ratio 5.02303e-05
## 38 : obj 0.00645 ratio 4.686583e-05
## 39 : obj 0.00637 ratio 4.379861e-05
## 40 : obj 0.0063 ratio 4.099692e-05
## 41 : obj 0.00623 ratio 3.843298e-05
## 42 : obj 0.00616 ratio 3.608232e-05
## 43 : obj 0.0061 ratio 3.39234e-05
## 44 : obj 0.00604 ratio 3.193718e-05
## 45 : obj 0.00599 ratio 3.010683e-05
## 46 : obj 0.00593 ratio 2.841741e-05
## 47 : obj 0.00588 ratio 2.685564e-05
## 48 : obj 0.00584 ratio 2.540973e-05
## 49 : obj 0.00579 ratio 2.406911e-05
## 50 : obj 0.00575 ratio 2.282437e-05
## 51 : obj 0.00571 ratio 2.166707e-05
## 52 : obj 0.00567 ratio 2.058965e-05
## 53 : obj 0.00563 ratio 1.958529e-05
## 54 : obj 0.00559 ratio 1.864788e-05
## 55 : obj 0.00556 ratio 1.777189e-05
## 56 : obj 0.00552 ratio 1.695234e-05
## 57 : obj 0.00549 ratio 1.618472e-05
## 58 : obj 0.00546 ratio 1.546493e-05
## 59 : obj 0.00543 ratio 1.478928e-05
## 60 : obj 0.0054 ratio 1.415439e-05
## 61 : obj 0.00538 ratio 1.355719e-05
## 62 : obj 0.00535 ratio 1.299489e-05
## 63 : obj 0.00533 ratio 1.246494e-05
## 64 : obj 0.0053 ratio 1.196501e-05
## 65 : obj 0.00528 ratio 1.149297e-05
## 66 : obj 0.00526 ratio 1.104687e-05
## 67 : obj 0.00523 ratio 1.062492e-05
## 68 : obj 0.00521 ratio 1.022547e-05
## 69 : obj 0.00519 ratio 9.847008e-06

fits

## $u ## [,1] [,2] ## [1,] -0.3797451 0.2586730 ## [2,] -0.5335339 -0.5958327 ## [3,] -0.4909516 0.1821492 ## [4,] -0.3171147 0.2827295 ## [5,] -0.3300830 0.5487556 ## [6,] 0.3472537 0.4047526 ## ##$d
## [1] 5.106126 3.762716
##
## $v ## [,1] [,2] ## [1,] -0.1792890 0.45184864 ## [2,] 0.1383413 -0.02634153 ## [3,] 0.3991341 -0.06952173 ## [4,] -0.5419099 0.62015693 ## [5,] -0.7040900 -0.63695451 ## ## attr(,"lambda") ## [1] 0 ## attr(,"call") ## softImpute(x = x, type = "svd", trace.it = TRUE)  Since this is a small matrix, it has solved it using repeated SVDs. There is no penalization here ($$\lambda=0$$), and by default the rank was taken to be 2. Since there is no penalization, if the rank was given to be $$\min(m,n)$$, then there is no restriction, and any values for the missing data would give the same minimum loss of 0. In other words, either penalization, or a rank restriction (or both) are needed for sensible imputation. We could use ALS instead here (the default for type= argument) fita=softImpute(x,trace=TRUE)  ## 1 : obj 0.25203 ratio 4.586827 ## 2 : obj 0.16171 ratio 0.1327814 ## 3 : obj 0.15175 ratio 0.04043357 ## 4 : obj 0.13433 ratio 0.0662755 ## 5 : obj 0.10898 ratio 0.06977428 ## 6 : obj 0.0856 ratio 0.05503238 ## 7 : obj 0.06107 ratio 0.05359341 ## 8 : obj 0.03855 ratio 0.03282198 ## 9 : obj 0.02771 ratio 0.01165608 ## 10 : obj 0.02336 ratio 0.0043565 ## 11 : obj 0.02122 ratio 0.002039977 ## 12 : obj 0.01999 ratio 0.001152274 ## 13 : obj 0.01918 ratio 0.0007496857 ## 14 : obj 0.01861 ratio 0.0005395284 ## 15 : obj 0.01816 ratio 0.0004171491 ## 16 : obj 0.0178 ratio 0.0003399138 ## 17 : obj 0.01749 ratio 0.0002881306 ## 18 : obj 0.01722 ratio 0.0002516812 ## 19 : obj 0.01697 ratio 0.0002249267 ## 20 : obj 0.01674 ratio 0.0002045315 ## 21 : obj 0.01652 ratio 0.0001884323 ## 22 : obj 0.01632 ratio 0.0001753094 ## 23 : obj 0.01612 ratio 0.0001642956 ## 24 : obj 0.01594 ratio 0.0001548095 ## 25 : obj 0.01576 ratio 0.000146455 ## 26 : obj 0.01559 ratio 0.0001389589 ## 27 : obj 0.01542 ratio 0.0001321311 ## 28 : obj 0.01526 ratio 0.0001258384 ## 29 : obj 0.0151 ratio 0.0001199861 ## 30 : obj 0.01495 ratio 0.0001145071 ## 31 : obj 0.0148 ratio 0.0001093526 ## 32 : obj 0.01466 ratio 0.0001044868 ## 33 : obj 0.01453 ratio 9.988287e-05 ## 34 : obj 0.01439 ratio 9.552008e-05 ## 35 : obj 0.01426 ratio 9.138181e-05 ## 36 : obj 0.01414 ratio 8.745431e-05 ## 37 : obj 0.01402 ratio 8.372574e-05 ## 38 : obj 0.01391 ratio 8.01856e-05 ## 39 : obj 0.01379 ratio 7.682427e-05 ## 40 : obj 0.01368 ratio 7.363282e-05 ## 41 : obj 0.01358 ratio 7.060276e-05 ## 42 : obj 0.01348 ratio 6.772603e-05 ## 43 : obj 0.01338 ratio 6.499487e-05 ## 44 : obj 0.01328 ratio 6.240184e-05 ## 45 : obj 0.01319 ratio 5.993977e-05 ## 46 : obj 0.0131 ratio 5.760178e-05 ## 47 : obj 0.01302 ratio 5.538126e-05 ## 48 : obj 0.01293 ratio 5.327189e-05 ## 49 : obj 0.01285 ratio 5.126763e-05 ## 50 : obj 0.01277 ratio 4.936272e-05 ## 51 : obj 0.01269 ratio 4.755166e-05 ## 52 : obj 0.01262 ratio 4.582927e-05 ## 53 : obj 0.01255 ratio 4.419059e-05 ## 54 : obj 0.01248 ratio 4.263095e-05 ## 55 : obj 0.01241 ratio 4.114594e-05 ## 56 : obj 0.01234 ratio 3.973139e-05 ## 57 : obj 0.01228 ratio 3.838335e-05 ## 58 : obj 0.01221 ratio 3.709812e-05 ## 59 : obj 0.01215 ratio 3.587221e-05 ## 60 : obj 0.01209 ratio 3.470232e-05 ## 61 : obj 0.01203 ratio 3.358538e-05 ## 62 : obj 0.01198 ratio 3.251847e-05 ## 63 : obj 0.01192 ratio 3.149886e-05 ## 64 : obj 0.01187 ratio 3.052399e-05 ## 65 : obj 0.01181 ratio 2.959143e-05 ## 66 : obj 0.01176 ratio 2.869892e-05 ## 67 : obj 0.01171 ratio 2.784434e-05 ## 68 : obj 0.01166 ratio 2.702567e-05 ## 69 : obj 0.01162 ratio 2.624105e-05 ## 70 : obj 0.01157 ratio 2.548869e-05 ## 71 : obj 0.01152 ratio 2.476693e-05 ## 72 : obj 0.01148 ratio 2.407422e-05 ## 73 : obj 0.01144 ratio 2.340907e-05 ## 74 : obj 0.01139 ratio 2.277011e-05 ## 75 : obj 0.01135 ratio 2.215603e-05 ## 76 : obj 0.01131 ratio 2.15656e-05 ## 77 : obj 0.01127 ratio 2.099767e-05 ## 78 : obj 0.01123 ratio 2.045114e-05 ## 79 : obj 0.01119 ratio 1.992499e-05 ## 80 : obj 0.01116 ratio 1.941826e-05 ## 81 : obj 0.01112 ratio 1.893002e-05 ## 82 : obj 0.01108 ratio 1.845941e-05 ## 83 : obj 0.01105 ratio 1.800561e-05 ## 84 : obj 0.01101 ratio 1.756787e-05 ## 85 : obj 0.01098 ratio 1.714544e-05 ## 86 : obj 0.01095 ratio 1.673764e-05 ## 87 : obj 0.01091 ratio 1.634383e-05 ## 88 : obj 0.01088 ratio 1.596337e-05 ## 89 : obj 0.01085 ratio 1.55957e-05 ## 90 : obj 0.01082 ratio 1.524025e-05 ## 91 : obj 0.01079 ratio 1.48965e-05 ## 92 : obj 0.01076 ratio 1.456395e-05 ## 93 : obj 0.01073 ratio 1.424214e-05 ## 94 : obj 0.0107 ratio 1.393062e-05 ## 95 : obj 0.01067 ratio 1.362895e-05 ## 96 : obj 0.01065 ratio 1.333674e-05 ## 97 : obj 0.01062 ratio 1.30536e-05 ## 98 : obj 0.01059 ratio 1.277917e-05 ## 99 : obj 0.01057 ratio 1.251311e-05 ## 100 : obj 0.01054 ratio 1.225507e-05  ## Warning in simpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, : ## Convergence not achieved by 100 iterations  The objectives are different! At this point we are playing with non-convex optimization, and so the solutions can be local minima. Lets use some regularization now, choosing a value for lambda that will give a rank 2 solution (this required trial and error to get it right). fits2=softImpute(x,rank.max=3,lambda=1.9,trace=TRUE,type="svd")  ## 1 : obj 0.34108 ratio 0.01044924 ## 2 : obj 0.34087 ratio 0.0008944817 ## 3 : obj 0.34085 ratio 9.099969e-05 ## 4 : obj 0.34085 ratio 9.739077e-06  fita2=softImpute(x,rank.max=3,lambda=1.9,trace=TRUE)  ## 1 : obj 0.39531 ratio 1.162386 ## 2 : obj 0.35119 ratio 0.175445 ## 3 : obj 0.34429 ratio 0.04351293 ## 4 : obj 0.34224 ratio 0.01285728 ## 5 : obj 0.34151 ratio 0.004385971 ## 6 : obj 0.34121 ratio 0.001679768 ## 7 : obj 0.34106 ratio 0.0007045559 ## 8 : obj 0.34098 ratio 0.0003235072 ## 9 : obj 0.34093 ratio 0.0001641758 ## 10 : obj 0.3409 ratio 9.236062e-05 ## 11 : obj 0.34089 ratio 5.698217e-05 ## 12 : obj 0.34088 ratio 3.777872e-05 ## 13 : obj 0.34087 ratio 2.634852e-05 ## 14 : obj 0.34086 ratio 1.900665e-05 ## 15 : obj 0.34086 ratio 1.401783e-05 ## 16 : obj 0.34086 ratio 1.049385e-05 ## 17 : obj 0.34086 ratio 7.938875e-06 ## final SVD: obj 0.34085  fits2$d

## [1] 0.7058351 0.1224937 0.0000000


These two are the same (modulo convergence criterion). Because the smallest singular value is zero, we know we are in the convex regime, and so both algorithms give the same solution. We can impute the missing values using complete(), which returns the full matrix:

complete(x,fits2)

##             [,1]        [,2]        [,3]       [,4]         [,5]
## [1,]  0.86548894  0.01565179 -0.09395368 0.06964884  0.786810255
## [2,] -0.60041722 -0.39411039  0.01072968 0.11949527 -0.009833786
## [3,]  0.04108268 -0.02061187 -1.05455746 0.03611551  0.109137401
## [4,]  0.69655580 -0.50331812 -0.10903638 1.54375663  0.426516062
## [5,]  1.23116102 -0.34232368 -0.81026875 0.06793878 -0.132569422
## [6,]  0.26644155  0.14486388  0.60513797 0.03406604 -2.240878626


We can first double center our matrix before completion

xc=biScale(x,col.scale=FALSE,row.scale=FALSE,trace=TRUE)

## Iter 1 Total Changes 2.154972
## Iter 2 Total Changes 0.2193935
## Iter 3 Total Changes 0.02088204
## Iter 4 Total Changes 0.003046592
## Iter 5 Total Changes 0.0004756816
## Iter 6 Total Changes 7.498127e-05
## Iter 7 Total Changes 1.183915e-05
## Iter 8 Total Changes 1.870081e-06
## Iter 9 Total Changes 2.954266e-07
## Iter 10 Total Changes 4.667172e-08
## Iter 11 Total Changes 7.373316e-09
## Iter 12 Total Changes 1.164859e-09
## Iter 13 Total Changes 1.840282e-10

xc

##            [,1]       [,2]       [,3]        [,4]       [,5]
## [1,] -0.2404891 -0.3826329         NA          NA  0.6231220
## [2,] -0.4506120  0.4633881         NA -0.01277613         NA
## [3,]         NA         NA  0.0000000          NA         NA
## [4,] -0.1862984 -0.6784790         NA  0.37882582  0.4859516
## [5,]  0.7289970 -0.1367944 -0.8997589          NA  0.3075563
## [6,]  0.1483993  0.7345150  0.8997696 -0.36605279 -1.4166311
## attr(,"biScale:row")
## attr(,"biScale:row")$center ## [1] 0.68813207 -0.56765115 -1.05972954 0.46500826 0.08431808 -0.29980372 ## ## attr(,"biScale:row")$scale
## [1] 1 1 1 1 1 1
##
## attr(,"biScale:column")
## attr(,"biScale:column")$center ## [1] 0.417845944 -0.289847375 0.005172072 0.699922549 -0.524443794 ## ## attr(,"biScale:column")$scale
## [1] 1 1 1 1 1
##
## attr(,"critmat")
##       iter         crit
##  [1,]    1 2.154972e+00
##  [2,]    2 2.193935e-01
##  [3,]    3 2.088204e-02
##  [4,]    4 3.046592e-03
##  [5,]    5 4.756816e-04
##  [6,]    6 7.498127e-05
##  [7,]    7 1.183915e-05
##  [8,]    8 1.870081e-06
##  [9,]    9 2.954266e-07
## [10,]   10 4.667172e-08
## [11,]   11 7.373316e-09
## [12,]   12 1.164859e-09
## [13,]   13 1.840282e-10

fits3=softImpute(xc,rank.max=3,lambda=1,type="svd")
fits3$d  ## [1] 1.32863296 0.09314895 0.00000000  complete(x,fits3,unscale=TRUE)  ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.8654889 0.01565179 0.4977103 1.47716505 0.7868103 ## [2,] -0.6004172 -0.39411039 -0.4352420 0.11949527 -1.2431261 ## [3,] -0.6418836 -1.34957691 -1.0545575 -0.35980699 -1.5841733 ## [4,] 0.6965558 -0.50331812 0.2393086 1.54375663 0.4265161 ## [5,] 1.2311610 -0.34232368 -0.8102688 0.86688289 -0.1325694 ## [6,] 0.2664415 0.14486388 0.6051380 0.03406604 -2.2408786  Notice that we completed x with fits3, which was run on the centered version xc. The scaling info is stored on the SVD object as an attribute, and with unscale=TRUE (actually the default), the centering is reversed. ## Debiasing the fit We have recently added a function deBias (since version 1.4) that allows one to upscale the elements d of the SVD object returned by softImpute. This is achieved by linear regression using the observed values of x. fits2$d

## [1] 0.7058351 0.1224937 0.0000000

deBias(x,fits2)$d  ## [1] 2.664118 2.020497 1.790697  ## Using the sparse matrix version So far we have not been worried about matrix size, because x is small. We can convert it to a sparse matrix object xs=as(x,"Incomplete") xs  ## 6 x 5 sparse Matrix of class "Incomplete" ## ## [1,] 0.8654889 0.01565179 . . 0.7868103 ## [2,] -0.6004172 -0.39411039 . 0.11949527 . ## [3,] . . -1.0545575 . . ## [4,] 0.6965558 -0.50331812 . 1.54375663 0.4265161 ## [5,] 1.2311610 -0.34232368 -0.8102688 . -0.1325694 ## [6,] 0.2664415 0.14486388 0.6051380 0.03406604 -2.2408786  Notice that it stores the missing entries as “zeros”, but because it is class "Incomplete", the object “knows”“ this is not really the case. In practice, we would not run as() on a huge matrix with tons of missing values, because we probably could not fit it in memory. So we would need a way of getting this matrix into R. This is typically stored on disk in what is known as "market matrix” format, as the triples (i,j,value). We can reverse engineer this here just for a demo: i=row(x)[!is.na(x)] j=col(x)[!is.na(x)] value=x[!is.na(x)] cbind(i,j,value)  ## i j value ## [1,] 1 1 0.86548894 ## [2,] 2 1 -0.60041722 ## [3,] 4 1 0.69655580 ## [4,] 5 1 1.23116102 ## [5,] 6 1 0.26644155 ## [6,] 1 2 0.01565179 ## [7,] 2 2 -0.39411039 ## [8,] 4 2 -0.50331812 ## [9,] 5 2 -0.34232368 ## [10,] 6 2 0.14486388 ## [11,] 3 3 -1.05455746 ## [12,] 5 3 -0.81026875 ## [13,] 6 3 0.60513797 ## [14,] 2 4 0.11949527 ## [15,] 4 4 1.54375663 ## [16,] 6 4 0.03406604 ## [17,] 1 5 0.78681026 ## [18,] 4 5 0.42651606 ## [19,] 5 5 -0.13256942 ## [20,] 6 5 -2.24087863  For the Netflix data dimensions, there are 99 million non-missing entries, much less than 8.5 trillion entries in the full matrix. We would input the data via the (i,j,value) representation, and then construct xs Incomplete(i=i,j=j,x=value)  ## 6 x 5 sparse Matrix of class "Incomplete" ## ## [1,] 0.8654889 0.01565179 . . 0.7868103 ## [2,] -0.6004172 -0.39411039 . 0.11949527 . ## [3,] . . -1.0545575 . . ## [4,] 0.6965558 -0.50331812 . 1.54375663 0.4265161 ## [5,] 1.2311610 -0.34232368 -0.8102688 . -0.1325694 ## [6,] 0.2664415 0.14486388 0.6051380 0.03406604 -2.2408786  Lets pretend this object is huge, for the purposes of our demonstration. We can double center it just as before, and run softImpute() xsc=biScale(xs,col.scale=FALSE,row.scale=FALSE) fitss=softImpute(xsc,rank.max=3,lambda=1,trace=TRUE,type="svd")  ## Ssvd.als: 1 : obj 0.12729 ratio 1.22473 ## Ssvd.als: 2 : obj 0.09205 ratio 0.2689606 ## Ssvd.als: 3 : obj 0.08931 ratio 0.01720473 ## Ssvd.als: 4 : obj 0.08886 ratio 0.001233977 ## Ssvd.als: 5 : obj 0.08872 ratio 0.0002149125 ## Ssvd.als: 6 : obj 0.08866 ratio 7.444629e-05 ## Ssvd.als: 7 : obj 0.08863 ratio 3.225671e-05 ## Ssvd.als: 8 : obj 0.08862 ratio 1.521757e-05 ## Ssvd.als: 9 : obj 0.08861 ratio 7.525504e-06 ## Ssvd.als: 1 : obj 0.08743 ratio 0.006149798 ## Ssvd.als: 2 : obj 0.08724 ratio 0.0003029866 ## Ssvd.als: 3 : obj 0.08723 ratio 2.870728e-05 ## Ssvd.als: 4 : obj 0.08723 ratio 1.306851e-05 ## Ssvd.als: 5 : obj 0.08722 ratio 7.363171e-06 ## 1 : obj 0.13827 ratio 0.008883289 ## Ssvd.als: 1 : obj 0.08708 ratio 0.0007860491 ## Ssvd.als: 2 : obj 0.08705 ratio 3.502276e-05 ## Ssvd.als: 3 : obj 0.08705 ratio 2.347268e-06 ## 2 : obj 0.13752 ratio 0.001064519 ## Ssvd.als: 1 : obj 0.08703 ratio 0.0001129927 ## Ssvd.als: 2 : obj 0.08703 ratio 5.227221e-06 ## 3 : obj 0.13743 ratio 0.0001448351 ## Ssvd.als: 1 : obj 0.08703 ratio 1.829268e-05 ## Ssvd.als: 2 : obj 0.08703 ratio 9.062267e-07 ## 4 : obj 0.13741 ratio 2.275408e-05 ## Ssvd.als: 1 : obj 0.08703 ratio 3.248486e-06 ## 5 : obj 0.13741 ratio 3.819071e-06  fitss$d

## [1] 1.32862991 0.09315031 0.00000000

fits3$d  ## [1] 1.32863296 0.09314895 0.00000000  Notice here that we get additional trace info with trace=TRUE. Since xs is “huge”, the SVD is computed using alternating subspace methods (with warm starts), and so we are seeing that as inner loops as well. With huge matrices we would not use the complete() function, but rather request individual predictions. For example entries (2,2), and (3,2) could be imputed via impute(fitss,i=c(2,3),j=c(2,2))  ## [1] -0.7580316 -1.3495769  Again, the unscale=TRUE default for impute() means that the centering (stored on the object fitss) was reversed. ## Reduced rank SVD of large sparse matrices This is almost an aside, but the tools we developed for large matrix completion problems are also useful for working with large (but sparse) matrices. Suppose we had such a beast, and we wanted to compute its principal components. We would need to column-center the matrix, which would render it no longer sparse. We would also want to compute a few of the largest singular vectors for this beast. Lets see how we do that. First we will read in our large matrix, again in market-matrix format. For simplicity we use our same matrix x, except now the missing values are zeros. x0=sparseMatrix(i=i,j=j,x=value) x0  ## 6 x 5 sparse Matrix of class "dgCMatrix" ## ## [1,] 0.8654889 0.01565179 . . 0.7868103 ## [2,] -0.6004172 -0.39411039 . 0.11949527 . ## [3,] . . -1.0545575 . . ## [4,] 0.6965558 -0.50331812 . 1.54375663 0.4265161 ## [5,] 1.2311610 -0.34232368 -0.8102688 . -0.1325694 ## [6,] 0.2664415 0.14486388 0.6051380 0.03406604 -2.2408786  x0c=biScale(x0,col.scale=FALSE,row.scale=FALSE,row.center=FALSE) x0c  ## An object of class "SparseplusLowRank" ## Slot "x": ## 6 x 5 sparse Matrix of class "dgCMatrix" ## ## [1,] 0.8654889 0.01565179 . . 0.7868103 ## [2,] -0.6004172 -0.39411039 . 0.11949527 . ## [3,] . . -1.0545575 . . ## [4,] 0.6965558 -0.50331812 . 1.54375663 0.4265161 ## [5,] 1.2311610 -0.34232368 -0.8102688 . -0.1325694 ## [6,] 0.2664415 0.14486388 0.6051380 0.03406604 -2.2408786 ## ## Slot "a": ## [,1] [,2] ## [1,] 0 1 ## [2,] 0 1 ## [3,] 0 1 ## [4,] 0 1 ## [5,] 0 1 ## [6,] 0 1 ## ## Slot "b": ## [,1] [,2] ## [1,] -1 -0.4098717 ## [2,] -1 0.1798728 ## [3,] -1 0.2099480 ## [4,] -1 -0.2828863 ## [5,] -1 0.1933536  So the column centered matrix is still stored in sparse format, but now it has class "SparseplusLowRank", with slots a and b, and slot x the original matrix (x0). In fact, the centered matrix is $$x+ab^T$$. Now we compute the SVD of this matrix, using our function svd.als() ( a workhorse for softImpute()) svdx0c=svd.als(x0c,rank=3,trace=TRUE)  ## Ssvd.als: 1 : obj 0.13251 ratio 4.040324 ## Ssvd.als: 2 : obj 0.02205 ratio 0.1877227 ## Ssvd.als: 3 : obj 0.01766 ratio 0.004319109 ## Ssvd.als: 4 : obj 0.01748 ratio 0.0002819854 ## Ssvd.als: 5 : obj 0.01746 ratio 3.586265e-05 ## Ssvd.als: 6 : obj 0.01746 ratio 5.368189e-06  svdx0c$d

## [1] 2.552159 1.528909 1.425626


We can compare this to the SVD of the full matrix version of this

x02=as.matrix(x0)
svd(scale(x02,TRUE,FALSE))$d  ## [1] 2.5521593 1.5289123 1.4256270 0.9474538 0.3871998  One can actually use regularization here as well. For large problems, this can speed up convergence, and biases the convergence criterion toward the larger singular values. Note that if $$Z=S_\lambda(X)$$ has rank $$r$$, then the rank-$$r$$ SVD of $$X$$ will have the same left and right singular vectors as $$Z$$, and singular values $$\lambda$$ units higher. ## Warm starts and regularization paths Typically we don't have a clue about what values of $$\lambda$$ are reasonable. One useful function is lambda0(), which identifies the smallest value of $$\lambda$$ for which the rank of the solution is zero. lam0=lambda0(xs) lam0  ## [1] 2.554681  fit0=softImpute(xs,lambda=lam0+.2) fit0$d

## [1] 0


(If we had used lam0 itself, we would have to increase the number of iterations and decrease the threshold for softImpute to achieve an exact zero with ALS)

This value is actually the largest singular value for the version of xs with the missing values replaced by zeros. Lets check:

xs0=as(xs,"sparseMatrix")
fit0=svd.als(xs0)
fit0$d  ## [1] 2.554681 1.986540  Now, armed with lam0 we could make a sequence of lambda values, decreasing toward zero. lamseq=exp(seq(from=log(lam0),to=log(1),length=10)) lamseq  ## [1] 2.554681 2.301850 2.074041 1.868778 1.683830 1.517185 1.367033 1.231741 ## [9] 1.109838 1.000000  Now the idea is to fit a sequence of models, using these values of lambda, and warms starts. For large matrices, we also want to be clever with the rank, because we could not fit models with very large rank. Here is an example of what we could do. fits=as.list(lamseq) ranks=as.integer(lamseq) rank.max=2 warm=NULL for( i in seq(along=lamseq)){ fiti=softImpute(xs,lambda=lamseq[i],rank=rank.max,warm=warm) ranks[i]=sum(round(fiti$d,4)>0)
rank.max=min(ranks[i]+2,4)
warm=fiti
fits[[i]]=fiti
cat(i,"lambda=",lamseq[i],"rank.max",rank.max,"rank",ranks[i],"\n")
}

## Warning in Ssimpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, :
## Convergence not achieved by 100 iterations

## 1 lambda= 2.554681 rank.max 3 rank 1
## 2 lambda= 2.30185 rank.max 3 rank 1
## 3 lambda= 2.074041 rank.max 3 rank 1
## 4 lambda= 1.868778 rank.max 4 rank 2
## 5 lambda= 1.68383 rank.max 4 rank 2
## 6 lambda= 1.517185 rank.max 4 rank 2
## 7 lambda= 1.367033 rank.max 4 rank 2
## 8 lambda= 1.231741 rank.max 4 rank 2
## 9 lambda= 1.109838 rank.max 4 rank 3
## 10 lambda= 1 rank.max 4 rank 3


Notes:

• The warning message is for the first fit; it needs more iterations to get convergence to the NULL solution, and in fact we see that it didn't quite get there, since the rank is 1 (fits[[1]]\$d)
• We added 2 to the rank achieved for the previous lambda. That is fine for this small problem, but for large problems (Netflix size), you don't want to be too cavalier with large ranks. Ideally, the rank you use should be sufficiently large to achieve a solution with some soft-thresholded singular values set to zero. This would guarantee that you have solved the regularized problem.