MIDAS offers a scalable solution to multiple imputation of very large datasets. Such applications can, of course, have sizable memory requirements that push the limits of a personal computer. Users may therefore wish to “offshore” model training to a virtual server that has greater memory capacity.
This guide runs through how to configure an Amazon AWS server instance to run MIDAS in R (and Python).
Amazon AWS (and other server providers) provide a range of different hardware configurations. In our experience, the largest bottleneck to training times is memory as sizable datasets are memory-intensive. Both the t5 and c6 classes offer cost-efficient and memory-scalable solutions, which we have used successfully in our own testing and workflows.
Users may also wish to use GPU-acceleration. More information on GPU-compatible classes can be found here. Note, MIDASpy is not configured to take advantage of distributed training at present.
When configuring the instance initially, we recommend choosing the Ubuntu operating system (the latest version at time of writing was 22.04 LTS).
Once instantiated, tunnel into the instance from the command line (making sure any authentication key is stored in the current directory):
ssh -i [your_authentication_key_name.pem] ubuntu@[server.address]
At this point, we are ready to configure the software on the server instance. First, we install a series of common tools required for compiling software:
Then, we install C++ compilers:
Next, we download and install miniconda to allow us
to isolate the Python dependencies for rMIDAS. Note,
you may be asked whether to initialise miniconda after install. Type
yes when prompted:
Finally, we can setup a conda environment using the YAML file available on GitHub or by using the hard-coded URL:
To setup R on the server instance, we need to install a few extra helper packages, as well as R itself, by running the following code at the command line:
sudo apt update -qq sudo apt install --no-install-recommends software-properties-common dirmngr wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" sudo apt update sudo apt install r-base sudo apt install build-essential
The server is now configured and ready to be used. Remember to include a line at the beginning of your script that points R to the conda environment we set up above: