--- title: "Getting Started with compositional.mle" author: "Alexander Towell" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Getting Started with compositional.mle} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) ``` ## Introduction The `compositional.mle` package provides **composable** optimization strategies for maximum likelihood estimation (MLE). Following SICP principles, it offers: 1. **Primitive solvers** - `gradient_ascent()`, `newton_raphson()`, `bfgs()`, `nelder_mead()`, etc. 2. **Composition operators** - `%>>%` (sequential), `%|%` (race), `with_restarts()` 3. **Closure property** - Combining solvers yields a solver ## Installation ```{r install, eval=FALSE} devtools::install_github("queelius/compositional.mle") ``` ```{r load} library(compositional.mle) ``` ## Quick Start: Normal Distribution MLE ```{r quickstart} # Generate sample data set.seed(123) data <- rnorm(100, mean = 5, sd = 2) # Define the problem (separate from solver strategy) problem <- mle_problem( loglike = function(theta) { if (theta[2] <= 0) return(-Inf) sum(dnorm(data, theta[1], theta[2], log = TRUE)) }, score = function(theta) { mu <- theta[1]; sigma <- theta[2]; n <- length(data) c(sum(data - mu) / sigma^2, -n / sigma + sum((data - mu)^2) / sigma^3) }, constraint = mle_constraint( support = function(theta) theta[2] > 0, project = function(theta) c(theta[1], max(theta[2], 1e-6)) ), theta_names = c("mu", "sigma") ) # Solve with gradient ascent result <- gradient_ascent()(problem, theta0 = c(0, 1)) cat("Estimated mean:", result$theta.hat[1], "(true: 5)\n") cat("Estimated sd:", result$theta.hat[2], "(true: 2)\n") ``` ## The Problem-Solver Separation The key design principle is separating **what** you're estimating from **how** you estimate it: ```{r problem-solver} # The problem encapsulates the statistical model print(problem) # Solvers are independent strategies solver1 <- gradient_ascent(max_iter = 100) solver2 <- newton_raphson(max_iter = 50) solver3 <- bfgs() # Same problem, different solvers result1 <- solver1(problem, c(0, 1)) result2 <- solver2(problem, c(0, 1)) result3 <- solver3(problem, c(0, 1)) cat("Gradient ascent:", result1$theta.hat, "\n") cat("Newton-Raphson:", result2$theta.hat, "\n") cat("BFGS:", result3$theta.hat, "\n") ``` ## Composing Solvers ### Sequential Chaining (`%>>%`) Chain solvers for coarse-to-fine optimization: ```{r sequential} # Grid search finds a good region, then gradient ascent refines strategy <- grid_search(lower = c(-10, 0.5), upper = c(10, 5), n = 5) %>>% gradient_ascent(max_iter = 50) result <- strategy(problem, theta0 = c(0, 1)) cat("Result:", result$theta.hat, "\n") ``` ### Three-Stage Refinement ```{r three-stage} # Coarse grid -> gradient ascent -> Newton-Raphson polish strategy <- grid_search(lower = c(-10, 0.5), upper = c(10, 5), n = 5) %>>% gradient_ascent(max_iter = 30) %>>% newton_raphson(max_iter = 10) result <- strategy(problem, theta0 = c(0, 1)) cat("Result:", result$theta.hat, "\n") cat("Converged:", result$converged, "\n") ``` ### Parallel Racing (`%|%`) Race multiple methods and keep the best result: ```{r race} # Try gradient-based and derivative-free methods strategy <- gradient_ascent() %|% bfgs() %|% nelder_mead() result <- strategy(problem, theta0 = c(0, 1)) cat("Winner:", result$solver, "\n") cat("Result:", result$theta.hat, "\n") ``` ### Random Restarts Escape local optima with multiple starting points: ```{r restarts} strategy <- with_restarts( gradient_ascent(), n = 10, sampler = uniform_sampler(c(-10, 0.5), c(10, 5)) ) result <- strategy(problem, theta0 = c(0, 1)) cat("Best restart:", result$best_restart, "of", result$n_restarts, "\n") cat("Result:", result$theta.hat, "\n") ``` ### Conditional Refinement Only refine if the first solver didn't converge: ```{r conditional} strategy <- unless_converged( gradient_ascent(max_iter = 10), # Quick attempt newton_raphson(max_iter = 50) # Refine if needed ) result <- strategy(problem, theta0 = c(0, 1)) cat("Result:", result$theta.hat, "\n") ``` ## Available Solvers | Factory | Method | Requires | |---------|--------|----------| | `gradient_ascent()` | Steepest ascent with line search | score (or numerical) | | `newton_raphson()` | Second-order Newton | score, fisher (or numerical) | | `bfgs()` | Quasi-Newton BFGS | score (or numerical) | | `lbfgsb()` | L-BFGS-B with box constraints | score (or numerical) | | `nelder_mead()` | Simplex (derivative-free) | - | | `grid_search()` | Exhaustive grid | - | | `random_search()` | Random sampling | - | ## Constraints Define domain constraints with support checking and projection: ```{r constraints} # Positive parameters pos_constraint <- mle_constraint( support = function(theta) all(theta > 0), project = function(theta) pmax(theta, 1e-8) ) # Box constraints [0, 10] box_constraint <- mle_constraint( support = function(theta) all(theta >= 0 & theta <= 10), project = function(theta) pmax(0, pmin(10, theta)) ) # Use in problem definition problem_constrained <- mle_problem( loglike = function(theta) -sum((theta - 5)^2), constraint = pos_constraint ) ``` ## Function Transformers ### Stochastic Gradient (Mini-batching) For large datasets, subsample observations: ```{r subsampling, eval=FALSE} # Original log-likelihood uses all data loglike_full <- function(theta, obs = large_data) { sum(dnorm(obs, theta[1], theta[2], log = TRUE)) } # Stochastic version uses random subsets loglike_sgd <- with_subsampling(loglike_full, data = large_data, subsample_size = 100) ``` ### Regularization Add penalty terms for regularization: ```{r penalties} loglike <- function(theta) -sum(theta^2) # L1 (Lasso), L2 (Ridge), Elastic Net loglike_l1 <- with_penalty(loglike, penalty_l1(), lambda = 0.1) loglike_l2 <- with_penalty(loglike, penalty_l2(), lambda = 0.1) loglike_enet <- with_penalty(loglike, penalty_elastic_net(alpha = 0.5), lambda = 0.1) theta <- c(1, 2, 3) cat("Original:", loglike(theta), "\n") cat("With L1:", loglike_l1(theta), "\n") cat("With L2:", loglike_l2(theta), "\n") ``` ## Tracing Optimization Track the optimization path for diagnostics: ```{r trace} trace_config <- mle_trace(values = TRUE, path = TRUE, gradients = TRUE) result <- gradient_ascent(max_iter = 20)( problem, theta0 = c(0, 1), trace = trace_config ) if (!is.null(result$trace_data)) { cat("Iterations:", result$trace_data$total_iterations, "\n") cat("Final log-likelihood:", tail(result$trace_data$values, 1), "\n") } ``` ## API Summary **Problem Specification:** - `mle_problem()` - Define the estimation problem - `mle_constraint()` - Domain constraints **Solver Factories:** - `gradient_ascent()`, `newton_raphson()`, `bfgs()`, `lbfgsb()`, `nelder_mead()` - `grid_search()`, `random_search()` **Composition:** - `%>>%` - Sequential chaining - `%|%` - Parallel racing - `with_restarts()` - Multiple starting points - `unless_converged()` - Conditional refinement - `compose()` - Compose multiple solvers **Samplers:** - `uniform_sampler()`, `normal_sampler()` **Transformers:** - `with_subsampling()`, `with_penalty()` - `penalty_l1()`, `penalty_l2()`, `penalty_elastic_net()`