Introduction
Using a ZIP file or repository
Creating an own ZIP file
Frequently asked questions
Default repositories
- Lecture Notes Sigbert Klinke, HU Berlin
- Lecture Notes Bernd Rönz, HU Berlin (in german)

Introduction

Aim

This package was designed with the aim of distributing educational resources for statistics courses targeted at students.

Once the teaching materials have been downloaded, the primary functions of this package include:

Opening Shiny Apps, R, Python, and R Markdown files directly within RStudio.
Loading data files seamlessly.
Viewing HTML and PDF files conveniently in a web browser.

With this feature, students can access various educational materials such as interactive apps, R code, data files, and other resources that can be helpful in learning statistical concepts. By providing easy access to these materials, the package aims to facilitate the learning process for students and make it more interactive and engaging.

GitHub allows you to download the repository as a ZIP file. You can find the option to download under the Code button (Download ZIP) in the repository. mmstat4 works with this ZIP file, but you can also use one of your own ZIP files.

In my courses, I assume that all R programs run in a freshly started R environment, meaning there are no path dependencies, and all necessary libraries are loaded within the R program. My repositories contain not only the example programs for the students but also the programs I use to create images and tables, as well as the Shiny Apps I demonstrate.

Installation

You can install mmstat4 from CRAN using:

install.packages("mmstat4")

Alternatively, you can install the development version from GitHub using devtools:

devtools::install_github("sigbertklinke/mmstat")

Getting started

A component of the package includes a small ZIP file containing educational materials. Initially, we need to instruct mmstat4 to utilize this ZIP file instead of the larger ZIP file for my Data Analysis I and II courses.

ghget('local') 
ghopen("example_mcnemar.R")  # open a R example file

ghget returns the key (local) associated with the currently active ZIP file. ghopen launches the example file in RStudio. To access the equivalent Python script, use:

ghopen("example_mcnemar.py")  # open a Python example file

Note: To run Python scripts, ensure local Python installation. Scripts execute within mmstat4.xxxx virtual environment, created upon script run or open. User approval is crucial. Upon setup, script checks for init_py.R in ZIP file. If found, it’s executed, often installing Python modules with reticulate::py_install('module name').

Shiny apps can also be launched in RStudio and run locally.

ghopen("pca_best_line/app.R")  # open a Shiny app

Data files can be loaded with:

x <- ghload("TelefonDaten.csv")  # load a data set
head(x)

#>     V1    V2
#> 1 1876  2600
#> 2 1877  9300
#> 3 1878 26300
#> 4 1879 30900
#> 5 1880 47900
#> 6 1881 71400

HTML and PDF files will open in the default application:

ghopen("Formelsammlung.pdf")  # open a PDF file

Using a ZIP file or repository

`ghget`

A ZIP file or repository can be stored locally or in the internet. A key-value approach can be used to determine the location of the source ZIP file. If no key is defined then ghget uses the base name of the source ZIP file as the key.

ghget(dummy="https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")

Three repository keys are predefined: hu.data, hu.stat and dummy. You can retrieve them via

ghget('dummy')
ghget('hu.stat')
ghget('hu.data')

If you do not use a key, the programme will create one and return it as result.

ghget(system.file("zip/mmstat4.dummy.zip", package = "mmstat4"))
#> [1] "mmstat4.dummy"
ghget("https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")
#> [1] "main"
# tries https://github.com/my/github_repo/archive/refs/heads/[main|master].zip 
ghget("my/github_repo")  # will fail
#> my/github_repo
#> https://github.com/my/github_repo/archive/refs/heads/main.zip
#> https://github.com/my/github_repo/archive/refs/heads/master.zip
#> Error in ghget("my/github_repo"): None of the previously displayed possible ZIP files were found!
#
ghget()                  # uses 'hu.data'
#> [1] "hu.data"

ghget downloads the ZIP file, saves it to a temporary location and unpacks it. For non-temporary locations, see the FAQ.

Full and short names for files

In addition, unique short names, related to the ZIP file content, are generated from the path components.

After unpacking the ZIP file, unique short names are generated for these files.

ghget('dummy')
#> [1] "dummy"
gd <- ghdecompose(ghlist(full.names=TRUE))
head(gd)
#>                              outpath inpath minpath              filename
#> 1 /tmp/RtmpGONbxJ/mmstat4.dummy-main                              LICENSE
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy-main                            README.md
#> 3 /tmp/RtmpGONbxJ/mmstat4.dummy-main   data                12411-0006.csv
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy-main   data         ArbeitsloseBerlin.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy-main           data             BANK2.sav
#> 6 /tmp/RtmpGONbxJ/mmstat4.dummy-main   data                Preisindex.csv
#>                                                          source
#> 1                    /tmp/RtmpGONbxJ/mmstat4.dummy-main/LICENSE
#> 2                  /tmp/RtmpGONbxJ/mmstat4.dummy-main/README.md
#> 3        /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/12411-0006.csv
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv
#> 5             /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/BANK2.sav
#> 6        /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/Preisindex.csv

The file name is split into four parts. The last two parts, minpath and filename, are used to create short names:

the short name for /tmp/RtmpXXXXXX/mmstat4.dummy-main/LICENSE is LICENSE. There was no other file named LICENSE in the ZIP file. Therefore, it is sufficient to address this file in the ZIP file.
the short name for /tmp/RtmpXXXXXX/mmstat4.dummy-main/data/BANK2.sav is data/BANK2.sav. There is another file called BANK2.sav in the ZIP file, but to address it uniquely, data/BANK2.sav is sufficient for this file in the ZIP file (the other is dbscan/BANK2.sav). Currently, no check is made whether two files with identical basenames are also identical in content.

ghlist("BANK2", full.names=TRUE) # full names
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/BANK2.sav"                        
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/examples/data/cluster/dbscan/BANK2.sav"
ghlist("BANK2")                  # short names
#> [1] "data/BANK2.sav"   "dbscan/BANK2.sav"

`ghopen`, `ghload`, `ghsource`

The short names (or the full names) can be used to work with the files

## x <- ghload("data/BANK2.sav")          # load data via rio::import
## ghopen("univariate/example_ecdf.R")    # open file in RStudio editor
## ghsource("univariate/example_ecdf.R")  # execute file via source
ghlist("example_ecdf")                 # "univariate/" was unnecessary
#> [1] "example_ecdf.R"

`ghlist`, `ghquery`

With ghlist you can get a list of unique (short) names for all files or a subset based on a regular expression pattern in the repository

str(ghlist())     # get all short names
#>  chr [1:473] "LICENSE" "README.md" "12411-0006.csv" "ArbeitsloseBerlin.csv" ...
ghlist("\\.pdf$") # get all short names of PDF files
#> [1] "Aufgaben.pdf"       "Formelsammlung.pdf" "Loesungen.pdf"

With ghquery you can query the list of unique (short) names for all files based on the overlap distance.

ghlist("bnk")  # pattern = "bnk
#> character(0)
ghquery("bnk") # nearest string matching to "bnk"
#> [1] "data/BANK2.sav"        "dbscan/BANK2.sav"      "AverageGroupLinkage.R"
#> [4] "AverageLinkage.R"      "CentroidLinkage.R"     "CompleteLinkage.R"

`ghfile`, `ghpath`, `ghdecompose`

ghfile tries to find a unique match for a given file and returns the full path. If there is no unique match, an error is returned with some possible matches.

ghdecompose builds a data frame and decomposes the full names of the files into

outpath the path part which is the same for all files (basically the place where the ZIP file is extraced to),
inpath the path part that is not used in minpath, but in the ZIP file,
minpath the minimum path part, so that all files are uniquely addressable,
filename the base name of the file, and
source the input for shortpath.

The short names for the files are built from the components minpath and filename.

ghpath builds up the short name with various path components from a ghdecompose object.

ghfile('data/BANK2.sav')
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/BANK2.sav"
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
fnf <- ghlist(full.names=TRUE)
dfn <- ghdecompose(fnf)
head(dfn)
#>                         outpath inpath minpath       filename
#> 1 /tmp/RtmpGONbxJ/mmstat4.dummy   data           hhberlin.csv
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy   data         Preisindex.csv
#> 3 /tmp/RtmpGONbxJ/mmstat4.dummy           data      BANK2.sav
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy   data         12411-0006.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy   data         child_data.sav
#> 6 /tmp/RtmpGONbxJ/mmstat4.dummy   data                hhD.rda
#>                                              source
#> 1   /tmp/RtmpGONbxJ/mmstat4.dummy/data/hhberlin.csv
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy/data/Preisindex.csv
#> 3      /tmp/RtmpGONbxJ/mmstat4.dummy/data/BANK2.sav
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy/data/12411-0006.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy/data/child_data.sav
#> 6        /tmp/RtmpGONbxJ/mmstat4.dummy/data/hhD.rda
head(ghpath(dfn))
#>                                                   1 
#>   "/tmp/RtmpGONbxJ/mmstat4.dummy/data/hhberlin.csv" 
#>                                                   2 
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/Preisindex.csv" 
#>                                                   3 
#>      "/tmp/RtmpGONbxJ/mmstat4.dummy/data/BANK2.sav" 
#>                                                   4 
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/12411-0006.csv" 
#>                                                   5 
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/child_data.sav" 
#>                                                   6 
#>        "/tmp/RtmpGONbxJ/mmstat4.dummy/data/hhD.rda"

RStudio addins

The package comes with two RStudio addins (see under Addins -> MMSTAT4):

Open a file from a zip file (ghopenAddin), which gives access to the unzipped zip file and opens the selected file in an RStudio editor window.
Execute a Shiny app from a zip file (ghappAddin), which extracts all directories containing Shiny apps and opens the selected app in a web browser (using the default browser).

Creating an own ZIP file

Preparation 1: Libraries/Modules used

Currently there are the following routines to support R/Python code snippets:

pkglist or modlist, which extracts all library/require/import calls from code snippets and returns a frequency table of the packages or and modules called.

ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
files <- ghlist(pattern="*.R$", full.names = TRUE)
cat(head(pkglist(files, repos="https://cloud.r-project.org"), 12))
#> if(!require("Amelia")) install.packages("Amelia", repos="https://cloud.r-project.org/src/contrib")
#>  # if(!require("CHAID")) install.packages("CHAID")
#>  if(!require("DescTools")) install.packages("DescTools", repos="https://cloud.r-project.org/src/contrib")
#>  if(!require("GGally")) install.packages("GGally", repos="https://cloud.r-project.org/src/contrib")
#>  if(!require("HMMpa")) install.packages("HMMpa", repos="https://cloud.r-project.org/src/contrib")
#>  if(!require("Hmisc")) install.packages("Hmisc", repos="https://cloud.r-project.org/src/contrib")
#>  # if(!require("MASS")) install.packages("MASS")
#>  # if(!require("MissingDataGUI")) install.packages("MissingDataGUI")
#>  if(!require("NbClust")) install.packages("NbClust", repos="https://cloud.r-project.org/src/contrib")
#>  if(!require("QuantPsyc")) install.packages("QuantPsyc", repos="https://cloud.r-project.org/src/contrib")
#>  if(!require("RColorBrewer")) install.packages("RColorBrewer", repos="https://cloud.r-project.org/src/contrib")
#>  if(!require("TeachingDemos")) install.packages("TeachingDemos", repos="https://cloud.r-project.org/src/contrib")

Note that the line for CHAID is commented out. The package cannot be found in CRAN, but you can install it from R-Forge.

cat(head(pkglist(files, repos=c("https://cloud.r-project.org", "http://R-Forge.R-project.org")), 12))

You can add a file init_R.R or init_py.R to your ZIP file, which installs the necessary R packages or Python modules.

Preparation 2: Scripts run independently

checkFiles checks whether each R code snippet runs smoothly in a freshly started R.

# just check the last files from the list 
# Note that the R console will show more output (warnings etc.)
checkFile(files, start=435)  # alternatively: Rsolo

Three modes are available for checking a file:

exist: Does the source file exist?
parse: Is parse(file) or python -m "file" successful? (default)
run: Is Rscript "file" or python3 "file" successful?

Preparation 3: Searching for (and removing) duplicate files

dupFiles uses checksums to check whether files exist twice.

files <- ghlist(full.names = TRUE)
head(dupFiles(files))  # alternatively: Rdups
#> $c300e8fe6f0bc562256e81670c23d8c0
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/data/BANK2.sav"                        
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/cluster/dbscan/BANK2.sav"
#> 
#> $`4efddb6dc6c7ed743221295d55133817`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/nnet/mincer_nnet3.R"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/nnet/mincer_nnet5.R"
#> 
#> $`9f9fe7603aa82f33bbc85a9d32e39d03`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/cluster/dbscan/app.tmpl"       
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/scagnostics/app.tmpl"
#> 
#> $`0b74b824367df429803599708daf2e2e`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/subgroup/example_mosaic.R" 
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/example_mosaic.R"
#> 
#> $`8eaa4f89e233ba69fcda053d238699aa`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/subgroup/example_mosaic_cotabplot.R" 
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/example_mosaic_cotabplot.R"
#> 
#> $`8ed6128aab796148df5e71cbeab547da`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/subgroup/example_mosaic_graphics.R" 
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/example_mosaic_graphics.R"

Note: there is also an error message if the necessary libraries are not installed!

ZIP file and access names

Once you created your ZIP file you need to know under which names a specific file can be accessed. In the example we use a ZIP file which comes with the package mmstat4:

ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
ghnames <- ghdecompose(ghlist(full.names=TRUE))
ghnames[58,]
#>                          outpath                inpath minpath filename
#> 58 /tmp/RtmpGONbxJ/mmstat4.dummy examples/data/cluster         pcplot.R
#>                                                          source
#> 58 /tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/cluster/pcplot.R

The shortest possible name is determined by minpath and filename. But all other paths determined by uniquepath, minpath and filename should also work.

For file number 58, the following access names are possible:

BANK2.sav will not work since more than one file named BANK2.sav in the ZIP file.
dbscan/BANK2.sav will work since this the shortest possible name.
cluster/dbscan/BANK2.sav, data/cluster/dbscan/BANK2.sav, and examples/data/cluster/dbscan/BANK2.sav will work.

x1 <- ghload("BANK2.sav")
#> Best matches: 
#>   ghload(x = "data/BANK2.sav")
#>   ghload(x = "dbscan/BANK2.sav")
#> Error in ghfile(x, msg = msg): No (unique) file 'BANK2.sav' found, check matches!
x2 <- ghload("dbscan/BANK2.sav")
x3 <- ghload("cluster/dbscan/BANK2.sav")
x4 <- ghload("data/cluster/dbscan/BANK2.sav")
x5 <- ghload("examples/data/cluster/dbscan/BANK2.sav")

Frequently asked questions

Something is not working properly. Where can I get help?

Please email me at sigbert@hu-berlin.de. You can also try the current development version of the package from GitHub:

# install.packages("devtools")
devtools::install_github("sigbertklinke/mmstat4")

Can I use a password protected ZIP file?

No, this is not supported.

How can I force a reload of a zip file?

ghget("dummy", .force=TRUE)

How can I store a zip file permanently?

ghget("dummy", .tempdir=FALSE)        # install non-temporarily
ghget("dummy", .tempdir="~/mmstat4")  # install non-temporarily to ~/mmstat4
ghget("dummy", .tempdir=TRUE)         # install again temporarily

Note: If a repository was installed permanently and you switch back to temporarily storage then the downloaded files will not be deleted.

How can I find all directories with Shiny apps?

ghget("dummy", .tempdir=TRUE)
ghlist(pattern="/(app|server)\\.R$")
ghopen("dbscan") # open the app

How can I find all `csv` data files?

ghget("dummy", .tempdir=TRUE)
#> [1] "dummy"
ghlist(pattern="\\.csv$", ignore.case=TRUE, full.names=TRUE)
#>  [1] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/12411-0006.csv"       
#>  [2] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv"
#>  [3] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/Preisindex.csv"       
#>  [4] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/TelefonDaten.csv"     
#>  [5] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/haushalte.csv"        
#>  [6] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/haushalte_berlin.csv" 
#>  [7] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/hhberlin.csv"         
#>  [8] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/hhberlin_2017.csv"    
#>  [9] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/pechstein.csv"        
#> [10] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/rentcap.csv"
# use mmstat4::ghload for importing
ghlist(pattern="\\.csv$")
#>  [1] "12411-0006.csv"        "ArbeitsloseBerlin.csv" "Preisindex.csv"       
#>  [4] "TelefonDaten.csv"      "haushalte.csv"         "haushalte_berlin.csv" 
#>  [7] "hhberlin.csv"          "hhberlin_2017.csv"     "pechstein.csv"        
#> [10] "rentcap.csv"
pechstein <- ghload("pechstein.csv")
str(pechstein)
#> 'data.frame':    29 obs. of  3 variables:
#>  $ Datum        : chr  "04.02.00" "01.02.01" "10.11.01" "06.02.02" ...
#>  $ Tag          : int  34 397 679 767 771 783 1043 1160 1166 1421 ...
#>  $ Retikulozyten: chr  "2,3" "2,5" "2,45" "2,1" ...

What should I install to use Python scripts?

For Ubuntu (Linux) install:

sudo apt-get install python3 python3-dev python3-pip python3-venv libbz2-dev

Note: mmstat4 installs these Python modules numpy, scipy, statsmodels, pandas, scikit-learn, matplotlib, and seaborn by default.

`init_py.R` is only called if the virtual environment is created. Can I force a new call?

Yes, delete the virtual environment and recreate it

reticulate::virtualenv_remove('mmstat4')
ghinstall('py', force=TRUE)

Default repositories

The package recognises three standard repositories: dummy, hu.stat, and hu.data.

Repository	Size	ZIP file location
`dummy`	3 MB	`https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip`
`hu.data`	29 MB	`https://github.com/sigbertklinke/mmstat4.data/archive/refs/heads/main.zip`
`hu.stat`	31 MB	`https://github.com/sigbertklinke/mmstat4.stat/archive/refs/heads/main.zip`

dummy is small subsample of hu.stat and hu.data which is intended for examples and test purposes.

Lecture Notes Sigbert Klinke, HU Berlin

Basic statistics I+II (in german)

Mathematische Grundlagen - Einführung - Grundbegriffe - Univariate Verteilungen - Parameter univariater Verteilungen - Bivariate Verteilungen - Parameter bivariater Verteilungen - Regressionanalyse - Zeitreihenanalyse - Indexzahlen - Wahrscheinlichkeitsrechnung - Zufallsvariablen - So lügt man mit Statistik - Wichtige Verteilungsmodelle - Stichprobentheorie - Statistische Schätzverfahren - Regressionsmodell - Konfidenzintervalle - Statistische Testverfahren - Parameterische Tests - Nichtparametrische Tests

ghget("hu.stat")
ghopen("Statistik.pdf")
ghopen("Aufgaben.pdf")
ghopen("Loesungen.pdf")
ghopen("Formelsammlung.pdf")

Data analysis

General - R - Basics and data generation - Test and estimation theory - Parameter of distributions - Distribution - Transformations - Robust statistics - Missing values - Subgroup analysis - Correlation and association - Multivariate graphics - Principal component analysis - Exploratory factor analysis - Reliability - Cluster analysis - Regression analysis - Linear regression - Nonparametric regression - Classification and regression trees - Neural networks

ghget("hu.data")
ghopen("dataanalysis.pdf")

Lecture Notes Bernd Rönz, HU Berlin (in german)

Computergestützte Statistik I mit SPSS 10 (2001)

Einführung - Entdeckung und Identifikation von Ausreißern - Prüfung der Verteilungsform von Variablen - Parametervergleiche bei unbhängigen Stichproben - Anhänge A-D, Literaturverzeichnis, Index

ghget("hu.data")
ghopen("cs1_roenz.pdf")

Computergestützte Statistik II mit SPSS 10 (2000)

Vorwort - Überprüfung von Zusammenhängen - Regressionsanalyse - Reliabilitäts- und Homogenitätsanalyse von Konstrukten - Anhänge A-H, Literaturverzeichnis, Stichwortverzeichnis

ghget("hu.data")
ghopen("cs2_roenz.pdf")

Generalisierte lineare Modelle mit SPSS 10 (2001)

Einführung - Verallgemeinerte lineare Modelle (generalized linear models, GLM) - Modellierung binärer Daten - Das multinomiale Logit Modell - Modellierung multinomialer Daten (log-lineare Modelle) - Literaturverzeichnis, Index

ghget("hu.data")
ghopen("glm_roenz.pdf")

mmstat4

Sigbert Klinke

29 April, 2024