# Main function for the estimation of domain-level (nonlinear) indicators with MERFs

Source:`R/SAEforest_model.R`

`SAEforest_model.Rd`

This function enables the use of Mixed Effects Random Forests (MERFs)
for applications of Small Area Estimation (SAE). Unit-level survey data on a target and
auxiliary covariates is required to produce reliable estimates of various
disaggregated economic and inequality indicators. Option `meanOnly`

saves computational
time for users that are only interested in the estimation of domain-specific means using
unit-level and aggregated auxiliary data. Predefined indicators include
the mean, median, quantiles (`10%, 25%, 75% and 90%`

), the head count ratio, the poverty gap,
the Gini-coefficient and the quintile share ratio. The MERF algorithm is an algorithmic procedure
reminiscent of an EM-algorithm (see Details). Overall, the function serves as a coherent framework
for the estimation of point estimates and if requested uncertainty estimates for the indicators.
Methodological details are found in Krennmair & Schmid (2022) and Krennmair et al. (2022b).
The following examples showcase further potential applications.

## Usage

```
SAEforest_model(
Y,
X,
dName,
smp_data,
pop_data,
MSE = "none",
meanOnly = TRUE,
aggData = FALSE,
smearing = TRUE,
popnsize = NULL,
importance = "impurity",
OOsample_obs = 25,
ADDsamp_obs = 0,
w_min = 3,
B = 100,
B_adj = 100,
B_MC = 100,
threshold = NULL,
custom_indicator = NULL,
initialRandomEffects = 0,
ErrorTolerance = 1e-04,
MaxIterations = 25,
aggregate_to = NULL,
na.rm = TRUE,
...
)
```

## Arguments

- Y
Continuous input value of target variable.

- X
Matrix or data.frame of predictive covariates.

- dName
Character specifying the name of the domain identifier, for which random intercepts are modeled.

- smp_data
data.frame of survey sample data including the specified elements of

`Y`

and`X`

.- pop_data
data.frame of unit-level population covariate data

`X`

. Please note that the column names of predictive covariates must match column names of`smp_data`

. This holds especially for the name of the domain identifier.- MSE
Character input specifying the type of uncertainty estimates. Available options are: (i) "none" if only point estimates are requested, (ii) "nonparametric" following the MSE bootstrap procedure proposed by Krennmair & Schmid (2022) or by Krennmair et al. (2022a) if

`aggData = TRUE`

. (iii) "wild" only for nonlinear indicators proposed by Krennmair et al. (2022b). Defaults to "none".- meanOnly
Logical. Calculating domain-level means only. Defaults to

`TRUE`

.- aggData
Logical input indicating whether aggregated covariate information or unit-level covariate information is used for domain-level means. Defaults to

`FALSE`

, assuming unit-level covariate data.- smearing
Logical input indicating whether a smearing based approach or a Monte Carlo (MC) version for point estimates should be obtained to estimate (nonlinear) indicators. MC should be used if computational constraints prohibit a smearing approach. For theoretical details see Krennmair et al (2022b). Defaults to

`TRUE`

.- popnsize
data.frame, comprising information of population size of domains. Only needed if

`aggData = TRUE`

and a MSE is requested. Please note that the name of the domain identifier must match the column name of`smp_data`

.- importance
Variable importance mode processed by the random forest from ranger. Must be 'none', 'impurity', 'impurity_corrected' or 'permutation'. A concept of variable importance is needed for the production of generic plots

`plot`

. For the estimation of domain-level means under aggregated covariate data, variable importance is needed to rank information in the process of finding suitable calibration weights (Krennmair et al., 2022b). For further information regarding measures of importance see ranger.- OOsample_obs
Number of out-of-sample observations taken from the closest area for potentially unsampled areas. Only needed if

`aggData = TRUE`

with defaults to 25.- ADDsamp_obs
Number of out-of-sample observations taken from the closest area if first iteration for the calculation of calibration weights fails. Only needed if

`aggData = TRUE`

with defaults to 0.- w_min
Minimal number of covariates from which informative weights are calculated. Only needed if

`aggData = TRUE`

. Defaults to 3.- B
Number of bootstrap replications for MSE estimation procedures. Defaults to 100.

- B_adj
Number of bootstrap replications for the adjustment of residual variance proposed by Mendez and Lohr (2001). Defaults to 100.

- B_MC
Number of bootstrap populations in the MC version for point estimates of (nonlinear) indicators. Defaults to 100.

- threshold
Set a custom threshold for indicators, such as the head count ratio. The threshold can be a known numeric value or function of

`Y`

. If the threshold is`NULL`

,`60 %`

of the median of`Y`

is taken as threshold. Defaults to`NULL`

.- custom_indicator
A list of additional functions containing the indicators to be calculated. These functions must only depend on the target variable

`Y`

and optionally the`threshold`

. Defaults to`NULL`

.- initialRandomEffects
Numeric value or vector of initial estimates of random effects. Defaults to 0.

- ErrorTolerance
Numeric value to monitor the MERF algorithm's convergence. Defaults to 1e-04.

- MaxIterations
Numeric value specifying the maximal amount of iterations for the MERF algorithm. Defaults to 25.

- aggregate_to
Character containing the name of a variable from population data that indicates the target domain level for which the results are to be displayed. Only available if

`aggData = FALSE`

. Defaults to`NULL`

.- na.rm
Logical. Whether missing values should be removed. Defaults to

`TRUE`

.- ...
Additional parameters are directly passed to the random forest ranger. Most important parameters are for instance

`mtry`

(number of variables to possibly split at in each node), or`num.trees`

(number of trees). For further details on possible parameters see ranger and the example below.

## Value

An object of class `SAEforest`

includes point estimates for disaggregated indicators
as well as information on the MERF-model. Optionally corresponding MSE estimates are returned.
Several generic functions have methods for the returned object of class `SAEforest`

. For a full
list and explanation of components and possibilities for objects of class `SAEforest`

,
see `SAEforestObject`

.

## Details

MERFs combine advantages of regression forests (such as implicit model-selection and robustness properties) with the ability to model hierarchical dependencies.

The MERF algorithm iteratively optimizes two separate steps: a) the random forest function, assuming the random effects term to be correct and b) estimates the random effects part, assuming the OOB-predictions from the forest to be correct. Overall convergence of the algorithm is monitored by log-likelihood of a joint model of both components. For further details see Krennmair and Schmid (2022).

Users that are only interested in the estimation of domain-level means should set `meanOnly = TRUE`

.
The MERF requires covariate micro-data. This function, however also
allows for the use of aggregated covariate information, by setting `aggData = TRUE`

. Aggregated
covariate information is adaptively incorporated through calibration-weights based on empirical likelihood
for the estimation of area-level means. See methodological details in Krennmair et al. (2022a)

For the estimation of (nonlinear) poverty indicators and/or quantiles, we need information on the
area-specific cumulative distribution function (CDF) of `Y`

. Krennmair et al. (2022b) propose a smearing
approach originated by Duan (1983). Alternatively, Monte-Carlo methods are used to simulate the domain-specific
CDF of `Y`

.

For the estimation of the MSE, the bootstrap population is built based on a bias-corrected residual variance as discussed in Krennmair and Schmid (2022). The bootstrap bias correction follows Mendez and Lohr (2011).

Note that the `MERFmodel`

object is a composition of elements from a random forest of class
`ranger`

and a random effects model of class `merMod`

. Thus, all generic functions are
applicable to corresponding objects. For further details on generic functions see `ranger`

and `lmer`

as well as the examples below.

## References

Duan, N. (1983). Smearing Estimate: A Nonparametric Retransformation Method. Journal of the American Statistical Association, 78(383), 605–610.

Krennmair, P., & Schmid, T. (2022). Flexible Domain Prediction Using Mixed Effects Random Forests. Journal of Royal Statistical Society: Series C (Applied Statistics) (forthcoming).

Krennmair, P., & Würz, N. & Schmid, T. (2022a). Analysing Opportunity Cost of Care Work using Mixed Effects Random Forests under Aggregated Census Data.

Krennmair, P., & Schmid, T & Tzavidis, Nikos. (2022b). The Estimation of Poverty Indicators Using Mixed Effects Random Forests. Working Paper.

Mendez, G., & Lohr, S. (2011). Estimating residual variance in random forest regression. Computational Statistics & Data Analysis, 55 (11), 2937–2950.

## Examples

```
# \donttest{
# Loading data
data("eusilcA_pop")
data("eusilcA_smp")
income <- eusilcA_smp$eqIncome
X_covar <- eusilcA_smp[,-c(1, 16, 17, 18)]
# Example 1:
# Calculating point estimates and discussing basic generic functions
model1 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
smp_data = eusilcA_smp, pop_data = eusilcA_pop)
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'
# SAEforest generics:
summary(model1)
#> Error in eval(expr, envir, enclos): object 'model1' not found
# Example 2:
# Calculating point + MSE estimates for aggregated covariate data and passing
# arguments to the random forest.
# Note that B is unrealistically low to improve example speed
# remove factor for gender
X_covar <- X_covar[,-1]
model2 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
smp_data = eusilcA_smp, pop_data = eusilcA_popAgg,
MSE = "nonparametric", popnsize = popNsize,B = 5, mtry = 5,
num.trees = 100, aggData = TRUE)
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'
# SAEforest generics:
summary(model2)
#> Error in eval(expr, envir, enclos): object 'model2' not found
summarize_indicators(model2, MSE = TRUE, CV = TRUE)
#> Error in eval(expr, envir, enclos): object 'model2' not found
# Example 3:
# Calculating point + MSE estimates and passing arguments to the forest.
# Two additional custom indicators and the threshold is defined as a custom function of Y.
# Note that B is unrealistically low to improve example speed.
model3 <- SAEforest_model(Y = income, X = X_covar, dName = "district", smp_data = eusilcA_smp,
pop_data = eusilcA_pop, meanOnly = FALSE, MSE = "nonparametric",
B = 5, mtry = 5, num.trees = 100, threshold = function(Y){0.5 *
median(Y)}, custom_indicator = list(my_max = function(Y,
threshold){max(Y)}, mean40 = function(Y, threshold){
mean(Y[Y<=quantile(Y,0.4)])}), smearing = FALSE)
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'
# SAEforest generics:
summary(model3)
#> Error in eval(expr, envir, enclos): object 'model3' not found
summarize_indicators(model3, MSE = FALSE, CV = TRUE, indicator = c("Gini", "my_max", "mean40"))
#> Error in eval(expr, envir, enclos): object 'model3' not found
# Example 4:
# Calculating point + MSE estimates with random effect on the district level
# while the output is at state level.
# Note that B is unrealistically low to improve example speed
model4 <- SAEforest_model(Y = income, X = X_covar, dName = "district", smp_data = eusilcA_smp,
pop_data = eusilcA_pop, meanOnly = FALSE, MSE = "wild", B = 5,
num.trees = 50, aggregate_to = "state")
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'
summary(model4)
#> Error in eval(expr, envir, enclos): object 'model4' not found
summarize_indicators(model4, CV = TRUE)
#> Error in eval(expr, envir, enclos): object 'model4' not found
# }
```