--- title: "Introduction to dataSDA" author: "Po-Wei Chen, Chun-houh Chen and Han-Ming Wu*" date: "2025-04-30" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{Introduction to dataSDA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(warning = FALSE) knitr::opts_chunk$set(message = FALSE) knitr::opts_chunk$set(out.width = "100%") knitr::opts_chunk$set(fig.align = 'center') library(knitr) library(dataSDA) library(RSDA) ``` $$\\[0.5in]$$ # Introduction The primary aim of `dataSDA` package is to gather various symbolic data tailored to different research themes, and to execute the reading, writing, and conversion of symbolic data in diverse formats, as well as compute the descriptive statistics of symbolic variables. The \pkg{dataSDA} package is currently available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=dataSDA and https://hmwu.idv.tw/dataSDA/. **Current version** (build on April 30, 2025): [dataSDA_0.1.1.zip](https://hmwu.idv.tw/dataSDA/dataSDA_0.1.1.zip) $$\\[0.5in]$$ # Symbolic data formats conversion ## Example: convert interval-valued datasets into the `symbolic_tbl` class For the purpose of illustrating two distinct class of interval-valued dataset in `R`, we will utilize two of these datasets: `Abalone` and `mushroom`. The `Abalone` dataset includes 24 observations, each featuring 7 interval-valued variables. It is categorized as an object of the `symbolic_tbl` class, and each variable within it belongs the `symbolic_n` class. Both of these classes are defined by the `RSDA` package. Positioning the dataset object within the `symbolic_tbl` class facilitates the more straightforward application of symbolic data methods provided by `RSDA`. The `Abalone` dataset is integrated as a built-in dataset in the `RSDA` package. A copy of it is also included in the `dataSDA` package and is renamed as `Abalone.int`. ```{r prompt=TRUE} library(dataSDA) data(Abalone.iGAP) head(Abalone.iGAP) class(Abalone.iGAP) data(Abalone) head(Abalone) class(Abalone) ``` The `mushroom` dataset consists of a set of 23 species described by 3 interval-valued variables: stipe length, stipe thickness and pileus cap width. The dataset also contains other two categorical variables: Species and Edibility. The dataset is the `data.frame` class by default and we would like to convert it into the `symbolic_tbl` class. ```{r prompt=TRUE} data(mushroom) head(mushroom) ``` Within the framework of SDA, the variables "Species" and "Edibility" are treated as set variables. We have developed the `set_variable_format` function to create pseudo-variables that correspond to the categories of a specified categorical variable, employing the one-hot encoding method. The `location` argument denotes the position of the set variable in the data. Following the restructuring of the dataset, the values assigned to the "Species" and "Edibility" variables are modified to reflect the number of categories associated with each variable. ```{r prompt = TRUE} mushroom_set <- set_variable_format(data = mushroom, location = 8, var = "Species") head(mushroom_set, 3) ``` To adhere to the formatting conventions of `RSDA`, we have implemented the `RSDA_format` function. This function prefixes each variable with a `$` symbol to indicate the type of symbolic variable. Specifically, set variables are prefixed with `$S`, and interval-valued variables are prefixed with $I. The syntax of the `RSDA_format` function with its arguments is as follows. ``` RSDA_format(data, sym_type1, location, sym_type2, var) ``` * `data`: a conventional data. * `sym_type1`, `sym_type2`: the labels I means an interval variable and S means set variable. * `location`: the location of the `sym_type` in the data. * `var`: the name of the symbolic variable in the data. ```{r prompt = TRUE} mushroom_tmp <- RSDA_format(data = mushroom_set, sym_type1 = c("I", "I", "I", "S"), location = c(25, 27, 29, 31), sym_type2 = c("S"), var = c("Species")) head(mushroom_tmp, 3) ``` The suffixes `min` and `max` from the variable names are removed using the `clean_colnames` function. Subsequently, the modified dataset is written out utilizing the `write_csv_table` function. This external data file is then read using the `read.sym.table` function, which is provided by the `RSDA` package. Upon import, the dataset is a `symbolic_tbl` class. ```{r prompt = TRUE} mushroom_clean <- clean_colnames(data = mushroom_tmp) head(mushroom_clean, 3) ``` Write the data object with `symbolic_tbl` class to a csv file. ```{r prompt = TRUE} write_csv_table(data = mushroom_clean, file = "mushroom_interval.csv") mushroom_int <- read.sym.table(file = 'mushroom_interval.csv', header = T, sep = ';', dec = '.', row.names = 1) head(mushroom_int, 3) class(mushroom_int) ``` ## Example: the conversion of histogram-valued datasets into the `MatH` class To demonstrate the process of converting a dataset into a matrix of histogram-valued data class — specifically, the `MatH` class — we utilize two datasets: `BLOOD` and `Weight`. This class is facilitated by the `HistDAWass` package. The `BLOOD` dataset is a `MatH` object, supplied by the `HistDAWass` package, and it encompasses 14 groups of patients, each characterized by three distributional variables. Each distribution within a cell is depicted by its mean and standard deviation. ```{r prompt = TRUE, eval = FALSE} library(dataSDA) data(BLOOD) BLOOD[1:3, 1:2] ``` Below, we illustrate the process of transforming a `list` object into an instance of the `MatH` class, specifically, `Weight`. We use the `distributionH` function from the `HistDAWass` package to encapsulate the histogram-valued data present in the dataset. Subsequently, the constructed `Weight` dataset comprises 7 observations, each with 1 variable. We then utilize the `new_method` function to assign the data object as a member of the `MatH` class. As a result, we can leverage the analysis methods offered by the `HistDAWass` package on objects of the `MatH` class. ```{r prompt = TRUE} library(HistDAWass) BLOOD[1:3, 1:2] ``` ```{r prompt = TRUE} A1 <- c(50, 60, 70, 80, 90, 100, 110, 120) B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00) A2 <- c(50, 60, 70, 80, 90, 100, 110, 120) B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00) A3 <- c(50, 60, 70, 80, 90, 100, 110, 120) B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00) List <- list(A1, B1, A2, B2, A3, B3) List ListOfWeight <- vector("list", 3) x <- 0 for (i in 1:length(ListOfWeight)){ ListOfWeight[[i]] <- distributionH(List[[i + x]], List[[i + x + 1]]) x <- x + 1 } Weight <- methods::new("MatH", nrows = 3, ncols = 1, ListOfDist = ListOfWeight, names.rows = c("20s", "30s", "40s"), names.cols = c("weight"), by.row = FALSE) Weight ``` ## Example: convert iGAP format to MM format and RSDA format. To convert iGAP files to MM format: ```{r prompt = TRUE} data(Face.iGAP) class(Face.iGAP) head(Face.iGAP) Face <- iGAP_to_MM(data = Face.iGAP, location = 1:6) head(Face) ``` Changes the format of the data to conform to RSDA format. ```{r prompt = TRUE} Face.tmp <- RSDA_format(data = Face, sym_type1 = c("I", "I", "I", "I", "I", "I"), location = c(1, 3, 5, 7, 9, 11)) head(Face.tmp) ``` Clean up variable names to conform to the RSDA format. ```{r prompt = TRUE} Face.clean <- clean_colnames(data = Face.tmp) head(Face.clean) ``` Write a symbolic data table to a CSV data file. ```{r prompt = TRUE} write_csv_table(data = Face.clean, file = 'Face_interval.csv') ``` Read the symbolic data table and check the format. ```{r prompt = TRUE} Face.interval <- read.sym.table(file = 'Face_interval.csv', header = T, sep = ';', dec = '.', row.names = 1) head(Face.interval) ``` ## Example: convert RSDA format to MM format and iGAP format. Convert RSDA format interval dataframe to MM format. ```{r prompt = TRUE} Face.MM <- RSDA_to_MM(Face.interval, RSDA = TRUE) head(Face.MM) ``` Convert MM format interval dataframe to iGAP format. ```{r prompt = TRUE} Face.iGAP_trans <- MM_to_iGAP(Face.MM) head(Face.iGAP_trans) ``` $$\\[0.5in]$$ # Descriptive statistics ## For interval-valued data ```{r prompt = TRUE} data(mushroom.int) int_mean(mushroom.int, var_name = "Pileus.Cap.Width") int_mean(mushroom.int, var_name = 2:3) var_name <- c("Stipe.Length", "Stipe.Thickness") method <- c("CM", "FV", "EJD") int_mean(mushroom.int, var_name, method) int_var(mushroom.int, var_name, method) var_name1 <- "Pileus.Cap.Width" var_name2 <- c("Stipe.Length", "Stipe.Thickness") method <- c("CM", "VM", "QM", "SE", "FV", "EJD", "GQ", "SPT") int_cov(mushroom.int, var_name1, var_name2, method) int_cor(mushroom.int, var_name1, var_name2, method) ``` ## For histogram-valued data ```{r prompt = TRUE} data(BLOOD) hist_mean(BLOOD, "Cholesterol") hist_var(BLOOD, "Cholesterol") hist_cov(BLOOD, 'Cholesterol', 'Hemoglobin', method = "B") hist_cor(BLOOD, 'Cholesterol', 'Hemoglobin', method = "L2W") ``` $$\\[0.5in]$$ # Symbolic dataset donation/submission guidelines We welcome contributions of high-quality datasets for symbolic data analysis. Submitted datasets will be made publicly available (or under specified constraints) to support research in machine learning, statistics, and related fields. You can submit the related files via email to [wuhm@g.nccu.edu.tw](mailto:wuhm@g.nccu.edu.tw) or through the Google Form at [Symbolic Dataset Submission Form](https://forms.gle/AB6UCsNkrTzqDTp97). The submission requirements are as follows. 1. **Dataset Format**: - Preferred formats: `.csv`, `.xlsx`, or any symbolic format in plain text. - Compressed (`.zip` or `.gz`) if multiple files are included. 2. **Required Metadata**: Contributors must provide the following details: | **Field** | **Description** | **Example** | |-------------------------|---------------------------------------------------------------------------------|-------------| | **Dataset Name** | A clear, descriptive title. | "face recognition data" | | **Dataset Short Name** | A clear,abbreviation title. | "face data" | | **Authors** | Full names of donator. | "First name, Last name" | | **E-mail** | Contact email. | "abc123@gmail.com" | | **Institutes** | Affiliated organizations. | "-" | | **Country** | Origin of the dataset. | "France" | | **Dataset Descriptions** | Data descriptive | See 'README' | | **Sample Size** | Number of instances/rows. | 27 | | **Number of Variables** | Total features/columns (categorical/numeric). | 6 (interval) | | **Missing Values** | Indicate if missing values exist and how they’re handled. | "None" / "Yes, marked as NA" | | **Variable Descriptions**| Detailed description of each column (name, type, units, range). | See 'README' | | **Source** | Original data source (if applicable). | "Leroy et al. (1996)" | | **References** | Citations for prior work using the dataset. | "Douzal-Chouakria, Billard, and Diday (2011)" | | **Applied Areas** | Relevant fields (e.g., biology, finance). | "Machine Learning" | | **Usage Constraints** | Licensing (CC-BY, MIT) or restrictions. | "Public domain" | | **Data Link** | URL to download the dataset (Google Drive, GitHub, etc.). | "(https)" | 3. **Quality Assurance**: - Datasets should be **clean** (no sensitive/private data). 4. **Optional (Recommended)**: - A companion `README` file with: - Dataset background. - Suggested use cases. - Known limitations. $$\\[0.5in]$$ # Citation Po-Wei Chen, Chun-houh Chen, Han-Ming Wu (2025), dataSDA: datasets and basic statistics for symbolic data analysis in R. Technical report.