Workflow

Debug messages

First of all, the package can show quite informative (but sometimes verbose) messages in the console. To turn on/off such messages you can use.

JMatrixSetDebug(TRUE)
#> Debugging for jmatrix package set to ON.

# Initially, state of debug is FALSE.

Data storage

As stated before, the binary matrix files should normally be created from C++ getting the data from an external source like a data file in a format used in bioinformatics or a .csv file. These files should be read by chunks. As an example, look at function CsvToJMat in package scellpam.

As a convenience and only for testing purposes (to be used in this vignette), we provide the function JWriteBin to write a R matrix as a jmatrix file.

# Create a 6x8 matrix of random values
Rf <- matrix(runif(48),nrow=6)
# Set row and column names for it
rownames(Rf) <- c("A","B","C","D","E","F")
colnames(Rf) <- c("a","b","c","d","e","f","g","h")
# Let's see the matrix
Rf
#>           a         b         c          d          e         f         g
#> A 0.4086237 0.1735683 0.8904162 0.18355700 0.79628552 0.2181740 0.8552968
#> B 0.8252862 0.4333349 0.5618043 0.62894191 0.12233887 0.4244371 0.3433047
#> C 0.2818962 0.8077596 0.5595075 0.54155193 0.79166619 0.4528630 0.7961829
#> D 0.6025945 0.7676627 0.6791392 0.85251282 0.46285139 0.7730553 0.9917301
#> E 0.7431097 0.1997096 0.5127008 0.06159526 0.02837436 0.7405783 0.1347855
#> F 0.5042047 0.3465926 0.7883318 0.63079981 0.60034070 0.9642359 0.7196796
#>           h
#> A 0.6456435
#> B 0.7039185
#> C 0.6782883
#> D 0.1317270
#> E 0.6084986
#> F 0.4670445

# and write it as the binary file Rfullfloat.bin
JWriteBin(Rf,"Rfullfloat.bin",dtype="float",dmtype="full",
          comment="Full matrix of floats")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rfullfloat.bin of (6x8)
#> End of block of binary data at offset 320
#>    Writing row names (6 strings written, from A to F).
#>    Writing column names (8 strings written, from a to h).
#>    Writing comment: Full matrix of floats

# Also, you can write it with double data type:
JWriteBin(Rf,"Rfulldouble.bin",dtype="double",dmtype="full",
          comment="Full matrix of doubles")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rfulldouble.bin of (6x8)
#> End of block of binary data at offset 512
#>    Writing row names (6 strings written, from A to F).
#>    Writing column names (8 strings written, from a to h).
#>    Writing comment: Full matrix of doubles

To get information about the stored file the function JMatInfo is provided. Of course, this funcion does not read the complete file in memory but just the header.

# Information about the float binary file
JMatInfo("Rfullfloat.bin")
#> File:               Rfullfloat.bin
#> Matrix type:        FullMatrix
#> Number of elements: 48
#> Data type:          float
#> Endianness:         little endian (same as this machine)
#> Number of rows:     6
#> Number of columns:  8
#> Metadata:           Stored names of rows and columns.
#> Metadata comment:  "Full matrix of floats"

# Same information about the double binary file
JMatInfo("Rfulldouble.bin")
#> File:               Rfulldouble.bin
#> Matrix type:        FullMatrix
#> Number of elements: 48
#> Data type:          double
#> Endianness:         little endian (same as this machine)
#> Number of rows:     6
#> Number of columns:  8
#> Metadata:           Stored names of rows and columns.
#> Metadata comment:  "Full matrix of doubles"

A jmatrix binary file can be exported to .csv/.tsv table. This is done with the function JMatToCsv

# Create a 6x8 matrix of random values
Rf <- matrix(runif(48),nrow=6)
# Set row and column names for it
rownames(Rf) <- c("A","B","C","D","E","F")
colnames(Rf) <- c("a","b","c","d","e","f","g","h")
# Store it as the binary file Rfullfloat.bin
JWriteBin(Rf,"Rfullfloat.bin",dtype="float",dmtype="full",
          comment="Full matrix of floats")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rfullfloat.bin of (6x8)
#> End of block of binary data at offset 320
#>    Writing row names (6 strings written, from A to F).
#>    Writing column names (8 strings written, from a to h).
#>    Writing comment: Full matrix of floats

# Save the content of this .bin as a .csv file
JMatToCsv("Rfullfloat.bin","Rfullfloat.csv",csep=",",withquotes=FALSE)
#> Read full matrix with size (6,8)

The generated file will not have quotes neither around the column names (in its first line) nor around each row name (at the beginning of each line) since withquotes is FALSE but it can be set to TRUE for the opposite behavior. Also, a .tsv (tabulator separated values) would have been generated using csep=“\t”.

Also, a jmatrix binary file can also be generated from a .csv/.tsv file. Such file must have a first line with the names of the columns (possibly surrounded by double quotes, including a first empty double-quote, since the column of row names has no name itself). The rest of its lines must start with a string (possibly surrounded by double quotes) with the row name and the values. In all cases (first line and data lines) each column must be separated from the next by a separation character (usually, a comma). No separation character must be added at the end of each line. This format is compatible with the .csv generated by R with the function write.csv.

The function to read .csv files is CsvToJMat

# Create a 6x8 matrix of random values
Rf <- matrix(runif(48),nrow=6)
# Set row and column names for it
rownames(Rf) <- c("A","B","C","D","E","F")
colnames(Rf) <- c("a","b","c","d","e","f","g","h")
# Save it as a .csv file with the standard R function...
write.csv(Rf,"rf.csv")
# ...and read it to create a jmatrix binary file
CsvToJMat("rf.csv","rf.bin",mtype="full",csep=",",ctype="raw",valuetype="float",transpose=FALSE,comment="Test matrix generated reading a .csv file")
#> 8 columns of values (not including the column of names) in file rf.csv.
#> 6 lines (excluding header) in file rf.csv
#> Data will be read from each line and stored as float values.
#> Reading line... 0 
#> Read 6 data lines of file rf.csv, as expected.
#> Writing binary matrix rf.bin of (6x8)
#> End of block of binary data at offset 320
#>    Writing row names (6 strings written, from A to F).
#>    Writing column names (8 strings written, from a to h).
#>    Writing comment: Test matrix generated reading a .csv file

# Let's see the characteristics of the binary file
JMatInfo("rf.bin")
#> File:               rf.bin
#> Matrix type:        FullMatrix
#> Number of elements: 48
#> Data type:          float
#> Endianness:         little endian (same as this machine)
#> Number of rows:     6
#> Number of columns:  8
#> Metadata:           Stored names of rows and columns.
#> Metadata comment:  "Test matrix generated reading a .csv file"

Special note for symmetric matrices:

The parameter mtype=“symmetric” will consider the content of the .csv file as a symmetric matrix. This implies that it must be a square matrix (same number of rows and columns) but the upper-diagonal matrix that must be present (it does not matter with which values) will be read, and immediately ignored, i.e.: only the lower-diagonal matrix (including the main diagonal) will be stored.

Data load

As stated before, no function is provided to read the whole matrix in memory which would contradict the philosophy of this package, but you can get rows or columns from a file.

# Reads row 1 into vector vf. Float values inside the file are
# promoted to double.
(vf<-GetJRow("Rfullfloat.bin",1))
#>         a         b         c         d         e         f         g         h 
#> 0.5210876 0.2671642 0.1054070 0.3952717 0.4201307 0.5736179 0.2767342 0.1742007

Obviously, storage in float provokes a loosing of precision. We have observed this not to be relevant for PAM (partitioning around medoids) algorihm but it can be important in other cases. It is the price to pay for halving the needed space.

# Checks the precision lost
max(abs(Rf[1,]-vf))
#> [1] 0.4428462

Nevertheless, storing as double obviously keeps the data intact.

vd<-GetJRow("Rfulldouble.bin",1)
max(abs(Rf[1,]-vd))
#> [1] 0.3855508

Now, let us see examples of some functions to read rows or columns by number or by name, or to read several rows/columns as a R matrix. In all examples numbers for rows and columns are in R-convention (i.e. starting at 1)

# Read column number 3
(vf<-GetJCol("Rfullfloat.bin",3))
#>          A          B          C          D          E          F 
#> 0.10540701 0.62464148 0.86758912 0.02637640 0.52538973 0.04534603

# Test precision
max(abs(Rf[,3]-vf))
#> [1] 0.8853732

# Read row with name C
(vf<-GetJRowByName("Rfullfloat.bin","C"))
#>          a          b          c          d          e          f          g 
#> 0.29295090 0.58392364 0.86758912 0.37287527 0.65189719 0.20174821 0.03082207 
#>          h 
#> 0.42795238

# Read column with name c
(vf<-GetJColByName("Rfullfloat.bin","c"))
#>          A          B          C          D          E          F 
#> 0.10540701 0.62464148 0.86758912 0.02637640 0.52538973 0.04534603

# Get the names of all rows or columns as vectors of R strings
(rn<-GetJRowNames("Rfullfloat.bin"))
#> [1] "A" "B" "C" "D" "E" "F"

(cn<-GetJColNames("Rfullfloat.bin"))
#> [1] "a" "b" "c" "d" "e" "f" "g" "h"

# Get the names of rows and columns simultaneosuly as a list of two elements
(l<-GetJNames("Rfullfloat.bin"))
#> $rownames
#> [1] "A" "B" "C" "D" "E" "F"
#> 
#> $colnames
#> [1] "a" "b" "c" "d" "e" "f" "g" "h"

# Get several rows at once. The returned matrix has the rows in the
# same order as the passed list,
# and this list can contain even repeated values
(vm<-GetJManyRows("Rfullfloat.bin",c(1,4)))
#>           a         b         c         d         e         f         g
#> A 0.5210876 0.2671642 0.1054070 0.3952717 0.4201307 0.5736179 0.2767342
#> D 0.1648742 0.9979452 0.0263764 0.7945443 0.6296650 0.7041845 0.7152251
#>            h
#> A 0.17420073
#> D 0.09749809


# Of course, columns can be extrated equally
(vc<-GetJManyCols("Rfulldouble.bin",c(1,4)))
#>           a          d
#> A 0.4086237 0.18355700
#> B 0.8252862 0.62894191
#> C 0.2818962 0.54155193
#> D 0.6025945 0.85251282
#> E 0.7431097 0.06159526
#> F 0.5042047 0.63079981

# and similar functions are provided for extracting by names:
(vm<-GetJManyRowsByNames("Rfulldouble.bin",c("A","D")))
#>           a         b         c         d         e         f         g
#> A 0.4086237 0.1735683 0.8904162 0.1835570 0.7962855 0.2181740 0.8552968
#> D 0.6025945 0.7676627 0.6791392 0.8525128 0.4628514 0.7730553 0.9917301
#>           h
#> A 0.6456435
#> D 0.1317270

(vc<-GetJManyColsByNames("Rfulldouble.bin",c("a","d")))
#>           a          d
#> A 0.4086237 0.18355700
#> B 0.8252862 0.62894191
#> C 0.2818962 0.54155193
#> D 0.6025945 0.85251282
#> E 0.7431097 0.06159526
#> F 0.5042047 0.63079981

The package can manage and store sparse and symmetric matrices, too.

# Generation of a 6x8 sparse matrix
Rsp <- matrix(rep(0,48),nrow=6)
sparsity <- 0.1
nnz <- round(48*sparsity)
where <- floor(47*runif(nnz))
val <- runif(nnz)
for (i in 1:nnz)
{
 Rsp[floor(where[i]/8)+1,(where[i]%%8)+1] <- val[i]
}
rownames(Rsp) <- c("A","B","C","D","E","F")
colnames(Rsp) <- c("a","b","c","d","e","f","g","h")
# Let's see the matrix
Rsp
#>   a b         c         d e f g          h
#> A 0 0 0.0000000 0.0000000 0 0 0 0.00000000
#> B 0 0 0.0000000 0.0000000 0 0 0 0.04418851
#> C 0 0 0.0000000 0.0000000 0 0 0 0.00000000
#> D 0 0 0.2236608 0.0000000 0 0 0 0.00000000
#> E 0 0 0.0000000 0.6672965 0 0 0 0.00000000
#> F 0 0 0.0000000 0.2683676 0 0 0 0.00000000

# Write the matrix as sparse with type float
JWriteBin(Rsp,"Rspafloat.bin",dtype="float",dmtype="sparse",
          comment="Sparse matrix of floats")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rspafloat.bin of (6x8)
#> End of block of binary data at offset 184
#>    Writing row names (6 strings written, from A to F).
#>    Writing column names (8 strings written, from a to h).
#>    Writing comment: Sparse matrix of floats

Notice that the condition of being a sparse matrix and the storage space used can be known with the matrix info.

JMatInfo("Rspafloat.bin")
#> File:               Rspafloat.bin
#> Matrix type:        SparseMatrix
#> Number of elements: 48
#> Data type:          float
#> Endianness:         little endian (same as this machine)
#> Number of rows:     6
#> Number of columns:  8
#> Metadata:           Stored names of rows and columns.
#> Metadata comment:  "Sparse matrix of floats"
#> Binary data size:   56 bytes, which is 29.1667% of the full matrix size (which would be 192 bytes).

Be careful: trying to store as sparse a matrix which is not (it has not a majority of 0-entries) works, but produces a matrix larger than the corresponding full matrix.

With respect to symmetric matrices, JWriteBin works the same way. Let us generate a \(7 \times 7\) symmetric matrix.

Rns <- matrix(runif(49),nrow=7)
Rsym <- 0.5*(Rns+t(Rns))
rownames(Rsym) <- c("A","B","C","D","E","F","G")
colnames(Rsym) <- c("a","b","c","d","e","f","g")
# Let's see the matrix
Rsym
#>           a          b         c         d         e         f          g
#> A 0.6118869 0.64849899 0.4305555 0.2787304 0.6546710 0.1221248 0.10522165
#> B 0.6484990 0.03567519 0.4347802 0.4173582 0.5173779 0.7196819 0.78907129
#> C 0.4305555 0.43478024 0.5014423 0.5601479 0.1329801 0.5881830 0.56032566
#> D 0.2787304 0.41735824 0.5601479 0.7526650 0.6372126 0.5563559 0.32704139
#> E 0.6546710 0.51737794 0.1329801 0.6372126 0.1874310 0.6442623 0.57706707
#> F 0.1221248 0.71968185 0.5881830 0.5563559 0.6442623 0.1972290 0.59367026
#> G 0.1052217 0.78907129 0.5603257 0.3270414 0.5770671 0.5936703 0.02290495

# Write the matrix as symmetric with type float
JWriteBin(Rsym,"Rsymfloat.bin",dtype="float",dmtype="symmetric",
          comment="Symmetric matrix of floats")
#> The passed matrix has row names for the 7 rows and they will be used.
#> Writing binary matrix Rsymfloat.bin
#> End of block of binary data at offset 240
#>    Writing row names (7 strings written, from A to G).
#>    Writing comment: Symmetric matrix of floats

# Get the information 
JMatInfo("Rsymfloat.bin")
#> File:               Rsymfloat.bin
#> Matrix type:        SymmetricMatrix
#> Number of elements: 49 (28 really stored)
#> Data type:          float
#> Endianness:         little endian (same as this machine)
#> Number of rows:     7
#> Number of columns:  7
#> Metadata:           Stored only names of rows.
#> Metadata comment:  "Symmetric matrix of floats"

Notice that if you store a R matrix which is NOT symmetric as a symmetric jmatrix, only the lower triangular part (including the main diagonal) will be saved. The upper-triangular part will be lost.

The functions to read rows/colums stated before works equally independently of the matrix character (full, sparse or symmetric) so you can play with them using the Rspafloat.bin and Rsymfloat.bin file to check they work.

Finally, if the jmatrix stored in a binary file has names associated to rows or columns, you can filter it using them and generate another jmatrix file with only the rows or columns you wish to keep. The function to do so is ‘FilterJMatByName’.

Rns <- matrix(runif(49),nrow=7)
rownames(Rns) <- c("A","B","C","D","E","F","G")
colnames(Rns) <- c("a","b","c","d","e","f","g")
# Let's see the matrix
Rns
#>              a          b         c         d         e         f          g
#> A 0.4953492514 0.62467222 0.6740876 0.2625262 0.3254563 0.5369447 0.17946897
#> B 0.0008792523 0.35239357 0.9329283 0.7631227 0.5239688 0.3685424 0.55470043
#> C 0.7806307387 0.22010257 0.6760207 0.1036505 0.1555564 0.3443474 0.07926985
#> D 0.1286137626 0.17195774 0.1188026 0.8371029 0.8243827 0.7235768 0.19740728
#> E 0.0090754919 0.10733668 0.3446555 0.2758209 0.5698018 0.8007540 0.91526358
#> F 0.5653575899 0.99604244 0.6753018 0.7799929 0.4983178 0.8067427 0.78722573
#> G 0.1050717619 0.03787626 0.6105914 0.8752175 0.4422261 0.9264210 0.23399173

# Write the matrix as full with type float
JWriteBin(Rns,"Rfullfloat.bin",dtype="float",dmtype="full",
          comment="Full matrix of floats")
#> The passed matrix has row names for the 7 rows and they will be used.
#> The passed matrix has column names for the 7 columns and they will be used.
#> Writing binary matrix Rfullfloat.bin of (7x7)
#> End of block of binary data at offset 324
#>    Writing row names (7 strings written, from A to G).
#>    Writing column names (7 strings written, from a to g).
#>    Writing comment: Full matrix of floats

# Extract the first two and the last two columns
FilterJMatByName("Rfullfloat.bin",c("a","b","f","g"),"Rfullfloat_fourcolumns.bin",namesat="cols")
#> Read full matrix with size (7,7)
#> Writing binary matrix Rfullfloat_fourcolumns.bin of (7x4)
#> End of block of binary data at offset 240
#>    Writing row names (7 strings written, from A to G).
#>    Writing column names (4 strings written, from a to g).
#>    Writing comment: Full matrix of floats

# Let's load the matrix and let's see it
vm<-GetJManyRows("Rfullfloat_fourcolumns.bin",c(1,7))
vm
#>           a          b         f         g
#> A 0.4953493 0.62467223 0.5369447 0.1794690
#> G 0.1050718 0.03787626 0.9264210 0.2339917

Domingo, Juan. 2023a. Jmatrix: Read from/Write in Disks Matrices with Any Data Type in a Binary Format.

———. 2023b. Parallelpam: Applies the Partitioning-Around-Medoids (PAM) Clustering Algortihm to Big Sets of Data Using Parallel Implementation, If Several Cores Are Available.

———. 2023c. Scellpam: Applying Partitioning Around Medoids to Single Cell with High Number of Cells.

Eddelbuettel, Dirk, and Romain François. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.

Schmidt, Drew. 2022. “float: 32-Bit Floats.” https://cran.r-project.org/package=float.

jmatrix

Loading package

Purpose

Workflow

Debug messages

Data storage

Special note for symmetric matrices:

Data load