General

This package has been designed to provide teaching materials for various statistics courses that are aimed at students.

The main function of this package is called gh, which can be used to perform several tasks such as:

Using this function, students can access various teaching materials such as interactive apps, R code, data files, and other resources, which can be helpful in learning statistics concepts. By providing easy access to these materials, the package aims to facilitate the learning process for students and make it more interactive and engaging.

GitHub allows downloading the repository as a ZIP file, see in the repository under the Code button (Download ZIP). mmstat4 works with this ZIP file, but you can also use any of your own ZIP files.

In my courses I assume that all R programs run in a freshly started R, i.e. there are no path dependencies, all necessary libraries are loaded in the R program and so on. My repositories contain not only the example programs for the students, but also the programs I use to generate images and tables, and also the Shiny Apps I show.

Using a ZIP file or repository

ghget

A ZIP file or repository can be stored locally or in the internet. A user defined key has to be given for the location of the ZIP file

ghget(dummy="https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")

Three repositories are predefined: hu.data, hu.stat and dummy. You can retrieve them via

ghget('dummy')
ghget('hu.stat')
ghget('hu.data')
ghget()           # uses hu.data

ghget downloads the ZIP file, saves it to a temporary location and unpacks it. For non-temporary locations, see the FAQ.

Full and short names for files

In addition, unique short names, related to the zip file content, are generated from the path components.

After unpacking the ZIP file, unique short names are generated for these files.

ghget('dummy')
gd <- ghdecompose(ghlist(full.names=TRUE))
head(gd)
#>                           commonpath uniquepath minpath              filename
#> 1 /tmp/RtmpiCpOOd/mmstat4.dummy-main                                  LICENSE
#> 2 /tmp/RtmpiCpOOd/mmstat4.dummy-main                                README.md
#> 3 /tmp/RtmpiCpOOd/mmstat4.dummy-main       data                12411-0006.csv
#> 4 /tmp/RtmpiCpOOd/mmstat4.dummy-main       data         ArbeitsloseBerlin.csv
#> 5 /tmp/RtmpiCpOOd/mmstat4.dummy-main               data             BANK2.sav
#> 6 /tmp/RtmpiCpOOd/mmstat4.dummy-main       data                Preisindex.csv
#>                                                          source
#> 1                    /tmp/RtmpiCpOOd/mmstat4.dummy-main/LICENSE
#> 2                  /tmp/RtmpiCpOOd/mmstat4.dummy-main/README.md
#> 3        /tmp/RtmpiCpOOd/mmstat4.dummy-main/data/12411-0006.csv
#> 4 /tmp/RtmpiCpOOd/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv
#> 5             /tmp/RtmpiCpOOd/mmstat4.dummy-main/data/BANK2.sav
#> 6        /tmp/RtmpiCpOOd/mmstat4.dummy-main/data/Preisindex.csv

The file name is split into four parts. The last two parts, minpath and filename, are used to create short names:

  1. the short name for /tmp/RtmpXXXXXX/mmstat4.dummy-main/LICENSE is LICENSE. There was no other file named LICENSE in the ZIP file. Therefore, it is sufficient to address this file in the ZIP file.
  2. the short name for /tmp/RtmpXXXXXX/mmstat4.dummy-main/data/BANK2.sav is data/BANK2.sav. There is another file called BANK2.sav in the ZIP file, but to address it uniquely, data/BANK2.sav is sufficient for this file in the ZIP file (the other is dbscan/BANK2.sav). Currently, no check is made whether two files with identical basenames are also identical in content.
ghlist("BANK2", full.names=TRUE) # full names
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/BANK2.sav"                        
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/examples/data/cluster/dbscan/BANK2.sav"
ghlist("BANK2")                  # short names
#> [1] "data/BANK2.sav"   "dbscan/BANK2.sav"

ghopen, ghload, ghsource

The short names (or the full names) can be used to work with the files

## x <- ghload("data/BANK2.sav")          # load data via rio::import
## ghopen("univariate/example_ecdf.R")    # open file in RStudio editor
## ghsource("univariate/example_ecdf.R")  # execute file via source
ghlist("example_ecdf")                 # "univariate/" was unnecessary
#> [1] "example_ecdf.R"

ghlist, ghquery

With ghlist you can get a list of unique (short) names for all files or a subset based on a regular expression pattern in the repository

str(ghlist())     # get all short names
#>  chr [1:473] "LICENSE" "README.md" "12411-0006.csv" "ArbeitsloseBerlin.csv" ...
ghlist("\\.pdf$") # get all short names of PDF files
#> [1] "Aufgaben.pdf"       "Formelsammlung.pdf" "Loesungen.pdf"

With ghquery you can query the list of unique (short) names for all files based on the overlap distance.

ghlist("bnk")  # pattern = "bnk
#> character(0)
ghquery("bnk") # nearest string matching to "bnk"
#> [1] "data/BANK2.sav"   "dbscan/BANK2.sav" "dbscan.R"         "kernel.R"        
#> [5] "dbscan2.R"        "linkage.R"

ghfile, ghpath, ghdecompose

ghfile tries to find a unique match for a given file and returns the full path. If there is no unique match, an error is returned with some possible matches.

ghdecompose builds a data frame and decomposes the full names of the files into

  • commonpath the path part which is the same for all files,
  • uniquepath the path part that is unique for all files,
  • minpath the minimum path part, so that all files are uniquely addressable,
  • filename the base name of the file, and
  • source the input for shortpath.

The short names for the files are built from the components minpath and filename.

ghpath builds up the short name with various path components from a ghdecompose object.

ghfile('data/BANK2.sav')
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/BANK2.sav"
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
fnf <- ghlist(full.names=TRUE)
dfn <- ghdecompose(fnf)
head(dfn)
#>                      commonpath uniquepath minpath       filename
#> 1 /tmp/RtmpiCpOOd/mmstat4.dummy       data           hhberlin.csv
#> 2 /tmp/RtmpiCpOOd/mmstat4.dummy       data         Preisindex.csv
#> 3 /tmp/RtmpiCpOOd/mmstat4.dummy               data      BANK2.sav
#> 4 /tmp/RtmpiCpOOd/mmstat4.dummy       data         12411-0006.csv
#> 5 /tmp/RtmpiCpOOd/mmstat4.dummy       data         child_data.sav
#> 6 /tmp/RtmpiCpOOd/mmstat4.dummy       data                hhD.rda
#>                                              source
#> 1   /tmp/RtmpiCpOOd/mmstat4.dummy/data/hhberlin.csv
#> 2 /tmp/RtmpiCpOOd/mmstat4.dummy/data/Preisindex.csv
#> 3      /tmp/RtmpiCpOOd/mmstat4.dummy/data/BANK2.sav
#> 4 /tmp/RtmpiCpOOd/mmstat4.dummy/data/12411-0006.csv
#> 5 /tmp/RtmpiCpOOd/mmstat4.dummy/data/child_data.sav
#> 6        /tmp/RtmpiCpOOd/mmstat4.dummy/data/hhD.rda
head(ghpath(dfn))
#>                1                2                3                4 
#>   "hhberlin.csv" "Preisindex.csv" "data/BANK2.sav" "12411-0006.csv" 
#>                5                6 
#> "child_data.sav"        "hhD.rda"

RStudio addins

The package comes with two RStudio addins (see under Addins -> MMSTAT4):

  • Open a file from a zip file (ghopenAddin), which gives access to the unzipped zip file and opens the selected file in an RStudio editor window.

  • Execute a Shiny app from a zip file (ghappAddin), which extracts all directories containing Shiny apps and opens the selected app in a web browser (using the default browser).

Creating an own ZIP file

Preparation: Libraries used and R programs run standalone

Currently there are the following routines to support R code snippets:

  • Rlibs, which extracts all library and require calls from the R code snippets and returns a frequency table of the packages called.
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
files <- ghlist(pattern="*.R$", full.names = TRUE)
head(Rlibs(files), 30)
#> 
#>          Amelia           CHAID       DescTools          GGally           Hmisc 
#>               1               1               6               4               1 
#>            MASS  MissingDataGUI         NbClust       QuantPsyc    RColorBrewer 
#>             130               1               1               6               2 
#>   TeachingDemos          UsingR             VIM additivityTests       agricolae 
#>               1               1               2               1               5 
#>       alphahull         andrews             ape         aplpack             ash 
#>               1               4               1               3               2 
#>            boot             car         cluster            coin          dbscan 
#>               4              13              17               1               3 
#>          deldir        devtools           e1071         effsize         entropy 
#>               1               3               5               1               3
  • Rsolo, which checks that each R code snippet runs smoothly in a freshly started R.
# just check the last files from the list 
# Note that the R console will show more output (warnings etc.)
Rsolo(files, start=435)  
  • Rdups, which checks if the duplicate files can be found based on checksums
files <- ghlist(full.names = TRUE)
head(Rdups(files))
#> $c300e8fe6f0bc562256e81670c23d8c0
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy/data/BANK2.sav"                        
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/cluster/dbscan/BANK2.sav"
#> 
#> $`4efddb6dc6c7ed743221295d55133817`
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/nnet/mincer_nnet3.R"
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/nnet/mincer_nnet5.R"
#> 
#> $`9f9fe7603aa82f33bbc85a9d32e39d03`
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/cluster/dbscan/app.tmpl"       
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/mgraphics/scagnostics/app.tmpl"
#> 
#> $`0b74b824367df429803599708daf2e2e`
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/subgroup/example_mosaic.R" 
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/mgraphics/example_mosaic.R"
#> 
#> $`8eaa4f89e233ba69fcda053d238699aa`
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/subgroup/example_mosaic_cotabplot.R" 
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/mgraphics/example_mosaic_cotabplot.R"
#> 
#> $`8ed6128aab796148df5e71cbeab547da`
#> [1] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/subgroup/example_mosaic_graphics.R" 
#> [2] "/tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/mgraphics/example_mosaic_graphics.R"

Note: there is also an error message if the necessary libraries are not installed!

ZIP file and access names

Once you created your ZIP file you need to know under which names a specific file can be accessed. In the example we use a ZIP file which comes with the package mmstat4:

ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
ghnames <- ghdecompose(ghlist(full.names=TRUE))
ghnames[58,]
#>                       commonpath            uniquepath minpath  filename
#> 58 /tmp/RtmpiCpOOd/mmstat4.dummy examples/data/cluster  dbscan BANK2.sav
#>                                                                  source
#> 58 /tmp/RtmpiCpOOd/mmstat4.dummy/examples/data/cluster/dbscan/BANK2.sav

The shortest possible name is determined by minpath and filename. But all other paths determined by uniquepath, minpath and filename should also work.

For file number 58, the following access names are possible:

  • BANK2.sav will not work since more than one file named BANK2.sav in the ZIP file.
  • dbscan/BANK2.sav will work since this the shortest possible name.
  • cluster/dbscan/BANK2.sav, data/cluster/dbscan/BANK2.sav, and examples/data/cluster/dbscan/BANK2.sav will work.
x1 <- ghload("BANK2.sav")
#> Possible matches: 
#>   data/BANK2.sav
#>   dbscan/BANK2.sav
#> Error in ghfile(x): Several files for 'BANK2.sav' found, check matches!
x2 <- ghload("dbscan/BANK2.sav")
x3 <- ghload("cluster/dbscan/BANK2.sav")
x4 <- ghload("data/cluster/dbscan/BANK2.sav")
x5 <- ghload("examples/data/cluster/dbscan/BANK2.sav")

Frequently asked questions

Something is not working properly. Where can I get help?

Please email me at sigbert@hu-berlin.de. You can also try the current development version of the package from GitHub:

# install.packages("devtools")
devtools::install_github("sigbertklinke/mmstat4")

How can I force a reload of a zip file?

ghget("dummy", .force=TRUE)

How can I store a zip file permanently?

ghget("dummy", .tempdir=FALSE)        # install non-temporarily
ghget("dummy", .tempdir="~/mmstat4")  # install non-temporarily to ~/mmstat4
ghget("dummy", .tempdir=TRUE)         # install again temporarily

Note: If a repository was installed permanently and you switch back to temporarily storage then the downloaded files will not be deleted.

How can I find all directories with Shiny apps?

ghget("dummy", .tempdir=TRUE)
ghlist(pattern="/(app|server)\\.R$")
ghopen("dbscan") # open the app

How can I find all csv data files?

ghget("dummy", .tempdir=TRUE)
ghlist(pattern="\\.csv$", ignore.case=TRUE, full.names=TRUE)
#>  [1] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/12411-0006.csv"       
#>  [2] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv"
#>  [3] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/Preisindex.csv"       
#>  [4] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/TelefonDaten.csv"     
#>  [5] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/haushalte.csv"        
#>  [6] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/haushalte_berlin.csv" 
#>  [7] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/hhberlin.csv"         
#>  [8] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/hhberlin_2017.csv"    
#>  [9] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/pechstein.csv"        
#> [10] "/tmp/RtmpiCpOOd/mmstat4.dummy-main/data/rentcap.csv"
# use mmstat4::ghload for importing
ghlist(pattern="\\.csv$")
#>  [1] "12411-0006.csv"        "ArbeitsloseBerlin.csv" "Preisindex.csv"       
#>  [4] "TelefonDaten.csv"      "haushalte.csv"         "haushalte_berlin.csv" 
#>  [7] "hhberlin.csv"          "hhberlin_2017.csv"     "pechstein.csv"        
#> [10] "rentcap.csv"
pechstein <- ghload("pechstein.csv")
str(pechstein)
#> 'data.frame':    29 obs. of  3 variables:
#>  $ Datum        : chr  "04.02.00" "01.02.01" "10.11.01" "06.02.02" ...
#>  $ Tag          : int  34 397 679 767 771 783 1043 1160 1166 1421 ...
#>  $ Retikulozyten: chr  "2,3" "2,5" "2,45" "2,1" ...

Default repositories

The package has three default repositories: dummy, hu.stat, and hu.data.

Repository Size ZIP file location
dummy 3 MB https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip
hu.data 29 MB https://github.com/sigbertklinke/mmstat4.data/archive/refs/heads/main.zip
hu.stat 31 MB https://github.com/sigbertklinke/mmstat4.stat/archive/refs/heads/main.zip

dummy is small subsample of hu.stat and hu.data which is intended for examples and test purposes.

Lecture Notes Sigbert Klinke, HU Berlin

Basic statistics I+II (in german)

Mathematische Grundlagen - Einführung - Grundbegriffe - Univariate Verteilungen - Parameter univariater Verteilungen - Bivariate Verteilungen - Parameter bivariater Verteilungen - Regressionanalyse - Zeitreihenanalyse - Indexzahlen - Wahrscheinlichkeitsrechnung - Zufallsvariablen - So lügt man mit Statistik - Wichtige Verteilungsmodelle - Stichprobentheorie - Statistische Schätzverfahren - Regressionsmodell - Konfidenzintervalle - Statistische Testverfahren - Parameterische Tests - Nichtparametrische Tests

ghget("hu.stat")
ghopen("Statistik.pdf")
ghopen("Aufgaben.pdf")
ghopen("Loesungen.pdf")
ghopen("Formelsammlung.pdf")

Data analysis

General - R - Basics and data generation - Test and estimation theory - Parameter of distributions - Distribution - Transformations - Robust statistics - Missing values - Subgroup analysis - Correlation and association - Multivariate graphics - Principal component analysis - Exploratory factor analysis - Reliability - Cluster analysis - Regression analysis - Linear regression - Nonparametric regression - Classification and regression trees - Neural networks

ghget("hu.data")
ghopen("dataanalysis.pdf")

Lecture Notes Bernd Rönz, HU Berlin (in german)

Computergestützte Statistik I mit SPSS 10 (2001)

Einführung - Entdeckung und Identifikation von Ausreißern - Prüfung der Verteilungsform von Variablen - Parametervergleiche bei unbhängigen Stichproben - Anhänge A-D, Literaturverzeichnis, Index

ghget("hu.data")
ghopen("cs1_roenz.pdf")

Computergestützte Statistik II mit SPSS 10 (2000)

Vorwort - Überprüfung von Zusammenhängen - Regressionsanalyse - Reliabilitäts- und Homogenitätsanalyse von Konstrukten - Anhänge A-H, Literaturverzeichnis, Stichwortverzeichnis

ghget("hu.data")
ghopen("cs2_roenz.pdf")

Generalisierte lineare Modelle mit SPSS 10 (2001)

Einführung - Verallgemeinerte lineare Modelle (generalized linear models, GLM) - Modellierung binärer Daten - Das multinomiale Logit Modell - Modellierung multinomialer Daten (log-lineare Modelle) - Literaturverzeichnis, Index

ghget("hu.data")
ghopen("glm_roenz.pdf")