--- title: "Using Parquet Storage for Large Datasets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using Parquet Storage for Large Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction When working with large brain connectome datasets, storing all matrices in memory as R lists can be inefficient. The `riemtan` package now supports Apache Arrow Parquet format for efficient storage and lazy loading of SPD matrices. This vignette demonstrates how to: - Export connectomes to Parquet format - Load connectomes using the Parquet backend - Work with CSample and CSuperSample using Parquet storage ## Benefits of Parquet Storage - **Memory Efficiency**: Matrices are loaded on-demand rather than all at once - **Fast Access**: Columnar storage format optimized for numerical data - **Metadata Support**: Store subject IDs, provenance, and other metadata - **Interoperability**: Standard format compatible with Python, Julia, and other tools ## Installation Make sure you have the `arrow` package installed: ```{r eval=FALSE} install.packages("arrow") ``` ## Basic Workflow ### 1. Writing Connectomes to Parquet First, let's create some example SPD matrices and save them to Parquet format: ```{r eval=FALSE} library(riemtan) library(Matrix) # Create example connectomes (4x4 SPD matrices) set.seed(42) connectomes <- lapply(1:50, function(i) { mat <- diag(4) + matrix(rnorm(16, 0, 0.1), 4, 4) mat <- (mat + t(mat)) / 2 # Make symmetric mat <- mat + diag(4) * 0.5 # Ensure positive definite Matrix::pack(Matrix::Matrix(mat, sparse = FALSE)) }) # Write to Parquet format write_connectomes_to_parquet( connectomes, output_dir = "my_connectomes", subject_ids = paste0("subject_", 1:50), provenance = list( study = "Example Study", acquisition_date = "2024-01-01", preprocessing = "Standard pipeline v1.0" ) ) ``` This creates a directory structure: ``` my_connectomes/ ├── metadata.json ├── matrix_0001.parquet ├── matrix_0002.parquet ├── ... └── matrix_0050.parquet ``` ### 2. Validating Parquet Directory Before using a Parquet directory, you can validate its structure: ```{r eval=FALSE} # Detailed validation with verbose output validate_parquet_directory("my_connectomes", verbose = TRUE) ``` ### 3. Creating a CSample with Parquet Backend Now use the Parquet backend to create a CSample: ```{r eval=FALSE} # Load AIRM metric data(airm) # Create Parquet backend (default cache size: 10 matrices) backend <- create_parquet_backend( "my_connectomes", cache_size = 10 ) # Create CSample with the backend sample <- CSample$new( backend = backend, metric_obj = airm ) # Sample info print(paste("Sample size:", sample$sample_size)) print(paste("Matrix dimension:", sample$matrix_size)) ``` ### 4. Computing with Parquet-backed CSample All standard CSample operations work transparently with the Parquet backend: ```{r eval=FALSE} # Compute tangent images sample$compute_tangents() # Compute vectorized images sample$compute_vecs() # Compute Frechet mean sample$compute_fmean(tol = 0.01, max_iter = 50) # Center the sample sample$center() # Compute variation sample$compute_variation() print(paste("Variation:", sample$variation)) ``` ### 5. Lazy Loading Behavior The Parquet backend loads matrices on-demand: ```{r eval=FALSE} # Access specific connectome (loads from disk if not cached) conn_1 <- sample$connectomes[[1]] # Access all connectomes (loads all from disk) all_conns <- sample$connectomes # Cache management backend$get_cache_size() # Check current cache usage backend$clear_cache() # Clear cache to free memory ``` ## Working with CSuperSample CSuperSample works seamlessly with Parquet-backed samples: ```{r eval=FALSE} # Create two samples with different backends backend1 <- create_parquet_backend("study1_connectomes") backend2 <- create_parquet_backend("study2_connectomes") sample1 <- CSample$new(backend = backend1, metric_obj = airm) sample2 <- CSample$new(backend = backend2, metric_obj = airm) # Create super sample super_sample <- CSuperSample$new(list(sample1, sample2)) # Gather all connectomes super_sample$gather() # Compute statistics super_sample$compute_fmean() super_sample$compute_variation() ``` ## Backwards Compatibility The Parquet backend is fully compatible with the traditional list-based approach: ```{r eval=FALSE} # Traditional approach (all in memory) sample_memory <- CSample$new( conns = connectomes, metric_obj = airm ) # Parquet approach (lazy loading) backend <- create_parquet_backend("my_connectomes") sample_parquet <- CSample$new( backend = backend, metric_obj = airm ) # Both work identically sample_memory$compute_fmean() sample_parquet$compute_fmean() ``` ## Performance Tips ### Cache Size Tuning Adjust cache size based on available memory: ```{r eval=FALSE} # Small cache (memory-constrained environments) backend_small <- ParquetBackend$new("my_connectomes", cache_size = 5) # Large cache (memory-rich environments) backend_large <- ParquetBackend$new("my_connectomes", cache_size = 50) ``` ### Batch Operations For operations that need all matrices, consider loading in batches: ```{r eval=FALSE} # Process in chunks n <- sample$sample_size batch_size <- 10 for (start in seq(1, n, by = batch_size)) { end <- min(start + batch_size - 1, n) # Load batch batch <- lapply(start:end, function(i) backend$get_matrix(i)) # Process batch... # Clear cache to free memory backend$clear_cache() } ``` ### Parallel Processing with Parquet **New in v0.2.5**: riemtan now supports cross-platform parallel processing that works seamlessly with the Parquet backend. #### Enabling Parallel Processing Configure parallel processing using the futureverse framework: ```{r eval=FALSE} library(riemtan) # Enable parallel processing (works on all platforms including Windows!) set_parallel_plan("multisession", workers = 4) # Check status is_parallel_enabled() # TRUE get_n_workers() # 4 # Create Parquet-backed sample backend <- create_parquet_backend("large_dataset", cache_size = 20) sample <- CSample$new(backend = backend, metric_obj = airm) ``` #### Parallel I/O and Computation All major operations automatically use parallel processing when beneficial: ```{r eval=FALSE} # Parallel tangent computations with progress bar sample$compute_tangents(progress = TRUE) # 3-8x faster # Parallel vectorization sample$compute_vecs(progress = TRUE) # 2-4x speedup # Parallel Frechet mean computation sample$compute_fmean(progress = TRUE) # 2-5x faster for large samples ``` #### Batch Loading with Parallel I/O For very large Parquet datasets, use batch loading with parallel I/O and memory management: ```{r eval=FALSE} # Load specific subset in batches subset_conns <- sample$load_connectomes_batched( indices = 1:500, # Load first 500 matrices batch_size = 50, # 50 matrices per batch progress = TRUE # Show progress ) # This loads 500 matrices in 10 batches, clearing cache between batches # Each batch is loaded in parallel for 5-10x speedup ``` #### Performance Comparison ```{r eval=FALSE} # Sequential (default if not configured) set_parallel_plan("sequential") system.time(sample$compute_tangents()) # Baseline # Parallel with 4 workers set_parallel_plan("multisession", workers = 4) system.time(sample$compute_tangents()) # 3-4x faster # Parallel with 8 workers set_parallel_plan("multisession", workers = 8) system.time(sample$compute_tangents()) # 6-8x faster ``` #### Progress Reporting Enable progress bars to monitor long-running operations: ```{r eval=FALSE} # Install progressr for progress bars (optional) install.packages("progressr") # Enable parallel processing set_parallel_plan("multisession", workers = 4) # All operations support progress parameter sample$compute_tangents(progress = TRUE) sample$compute_vecs(progress = TRUE) sample$compute_fmean(progress = TRUE) # Batch loading with progress conns <- sample$load_connectomes_batched( indices = 1:1000, batch_size = 100, progress = TRUE # Shows "Batch 1/10: loading matrices 1-100" ) ``` #### Best Practices **1. Choose appropriate worker count**: ```{r eval=FALSE} # Conservative (leave cores for system) set_parallel_plan("multisession", workers = parallel::detectCores() - 1) # Maximum performance (use all cores) set_parallel_plan("multisession", workers = parallel::detectCores()) ``` **2. Memory management with parallelization**: ```{r eval=FALSE} # Each worker loads its own data copy # For 4 workers with 100 matrices of 200x200, expect: # Memory = 4 workers × cache_size × matrix_size ≈ 4 × 20 × 320 KB ≈ 25 MB # Use smaller cache with more workers backend <- create_parquet_backend("dataset", cache_size = 10) set_parallel_plan("multisession", workers = 8) ``` **3. Reset when done**: ```{r eval=FALSE} # Compute with parallelization set_parallel_plan("multisession", workers = 4) sample$compute_tangents(progress = TRUE) sample$compute_fmean(progress = TRUE) # Reset to free worker resources reset_parallel_plan() ``` #### Auto-Detection Parallel processing is **automatically disabled** for small datasets to avoid overhead: ```{r eval=FALSE} # Small dataset (n < 10): Uses sequential processing automatically small_sample <- CSample$new(conns = small_connectomes[1:5], metric_obj = airm) small_sample$compute_tangents() # Sequential (no overhead) # Large dataset (n >= 10): Uses parallel processing automatically large_sample <- CSample$new(backend = backend, metric_obj = airm) large_sample$compute_tangents() # Parallel (if plan is active) ``` #### Parallel Strategies Different strategies for different environments: ```{r eval=FALSE} # multisession: Works on all platforms (Windows, Mac, Linux) set_parallel_plan("multisession", workers = 4) # multicore: Unix only, lower overhead set_parallel_plan("multicore", workers = 4) # Auto-fallback to multisession on Windows # cluster: For remote/distributed computing set_parallel_plan("cluster", workers = c("node1", "node2")) ``` ## Metadata Access Access metadata stored with your Parquet data: ```{r eval=FALSE} # Get metadata metadata <- backend$get_metadata() # Access subject IDs subject_ids <- metadata$subject_ids # Access provenance information provenance <- metadata$provenance print(provenance$study) print(provenance$preprocessing) ``` ## Advanced: Custom File Patterns Customize the Parquet file naming pattern: ```{r eval=FALSE} write_connectomes_to_parquet( connectomes, output_dir = "custom_naming", file_pattern = "conn_%03d.parquet" ) # Files will be named: conn_001.parquet, conn_002.parquet, ... ``` ## Troubleshooting ### Large Datasets For very large datasets (1000+ matrices): ```{r eval=FALSE} # Use minimal cache backend <- ParquetBackend$new("large_dataset", cache_size = 3) # Compute statistics without loading all matrices at once sample <- CSample$new(backend = backend, metric_obj = airm) sample$compute_tangents() sample$compute_vecs() # Operations that don't need all matrices in memory sample$compute_fmean(batch_size = 32) # Uses batching ``` ### Memory Monitoring ```{r eval=FALSE} # Check cache usage cache_size <- backend$get_cache_size() print(paste("Cached matrices:", cache_size)) # Free memory when needed backend$clear_cache() ``` ## Summary The Parquet backend provides: - **Scalability**: Handle datasets too large for memory - **Efficiency**: Lazy loading minimizes memory footprint - **Flexibility**: Fully compatible with existing riemtan workflows - **Metadata**: Rich metadata support for reproducible research For small to medium datasets that fit in memory, the traditional list approach remains perfectly valid. For large brain connectome studies, the Parquet backend enables analysis at scale.