This package contains R functions corresponding to useful Stata commands.

The package includes: - panel data functions (monthly/quarterly dates, lead/lag, fillin) - data.frame functions (tabulate, merge) - vector functions (xtile, pctile, winsorize) - graph functions (binscatter)

`sum_up`

prints detailed summary statistics (corresponds
to Stata `summarize`

)

```
<- 100
N <- tibble(
df id = 1:N,
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE)
)sum_up(df)
%>% sum_up(starts_with("v"), d = TRUE)
df %>% group_by(v1) %>% sum_up() df
```

`tab`

prints distinct rows with their count. Compared to
the dplyr function `count`

, this command adds frequency,
percent, and cumulative percent.

```
<- 1e2 ; K = 10
N <- tibble(
df id = sample(c(NA,1:5), N/K, TRUE),
v1 = sample(1:5, N/K, TRUE)
)tab(df, id)
tab(df, id, na.rm = TRUE)
tab(df, id, v1)
```

`join`

is a wrapper for dplyr merge functionalities, with
two added functions

The option

`check`

checks there are no duplicates in the master or using data.tables (as in Stata).`# merge m:1 v1 join(x, y, kind = "full", check = m~1)`

The option

`gen`

specifies the name of a new variable that identifies non matched and matched rows (as in Stata).`# merge m:1 v1, gen(_merge) join(x, y, kind = "full", gen = "_merge")`

The option

`update`

allows to update missing values of the master dataset by the value in the using dataset

```
# pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
<- c(NA, 1:10)
v pctile(v, probs = c(0.3, 0.7), na.rm = TRUE)
# xtile creates integer variable for quantile categories (corresponds to Stata xtile)
<- c(NA, 1:10)
v xtile(v, n_quantiles = 3) # 3 groups based on terciles
xtile(v, probs = c(0.3, 0.7)) # 3 groups based on two quantiles
xtile(v, cutpoints = c(2, 3)) # 3 groups based on two cutpoints
# winsorize (default based on 5 x interquartile range)
<- c(1:4, 99)
v winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))
```

The classes “monthly” and “quarterly” print as dates and are
compatible with usual time extraction (ie `month`

,
`year`

, etc). Yet, they are stored as integers representing
the number of elapsed periods since 1970/01/0 (resp in week, months,
quarters). This is particularly handy for simple algebra:

```
# elapsed dates
library(lubridate)
<- mdy(c("04/03/1992", "01/04/1992", "03/15/1992"))
date <- as.monthly(date)
datem # displays as a period
datem#> [1] "1992m04" "1992m01" "1992m03"
# behaves as an integer for numerical operations:
+ 1
datem #> [1] "1992m05" "1992m02" "1992m04"
# behaves as a date for period extractions:
year(datem)
#> [1] 1992 1992 1992
```

`tlag`

/`tlead`

a vector with respect to a
number of periods, **not** with respect to the number of
rows

```
<- c(1989, 1991, 1992)
year <- c(4.1, 4.5, 3.3)
value tlag(value, 1, time = year)
library(lubridate)
<- mdy(c("01/04/1992", "03/15/1992", "04/03/1992"))
date <- as.monthly(date)
datem <- c(4.1, 4.5, 3.3)
value tlag(value, time = datem)
```

In constrast to comparable functions in `zoo`

and
`xts`

, these functions can be applied to any vector and be
used within a `dplyr`

chain:

```
<- tibble(
df id = c(1, 1, 1, 2, 2),
year = c(1989, 1991, 1992, 1991, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)%>% group_by(id) %>% mutate(value_l = tlag(value, time = year)) df
```

`is.panel`

checks whether a dataset is a panel i.e. the
time variable is never missing and the combinations (id, time) are
unique.

```
<- tibble(
df id1 = c(1, 1, 1, 2, 2),
id2 = 1:5,
year = c(1991, 1993, NA, 1992, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)%>% group_by(id1) %>% is.panel(year)
df <- df %>% filter(!is.na(year))
df1 %>% is.panel(year)
df1 %>% group_by(id1) %>% is.panel(year)
df1 %>% group_by(id1, id2) %>% is.panel(year) df1
```

fill_gap transforms a unbalanced panel into a balanced panel. It
corresponds to the stata command `tsfill`

. Missing
observations are added as rows with missing values.

```
<- tibble(
df id = c(1, 1, 1, 2),
datem = as.monthly(mdy(c("04/03/1992", "01/04/1992", "03/15/1992", "05/11/1992"))),
value = c(4.1, 4.5, 3.3, 3.2)
)%>% group_by(id) %>% fill_gap(datem)
df %>% group_by(id) %>% fill_gap(datem, full = TRUE)
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest") df
```

`stat_binmean()`

(a `stat`

for ggplot2) returns
the mean of `y`

and `x`

within 20 bins of
`x`

. It’s a barebone version of the Stata command binscatter

```
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean()
# change number of bins
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10)
# add regression line
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean() + stat_smooth(method = "lm", se = FALSE)
```

You can install