Main function for the estimation of domain-level (nonlinear) indicators with MERFs

This function enables the use of Mixed Effects Random Forests (MERFs) for applications of Small Area Estimation (SAE). Unit-level survey data on a target and auxiliary covariates is required to produce reliable estimates of various disaggregated economic and inequality indicators. Option meanOnly saves computational time for users that are only interested in the estimation of domain-specific means using unit-level and aggregated auxiliary data. Predefined indicators include the mean, median, quantiles (10%, 25%, 75% and 90%), the head count ratio, the poverty gap, the Gini-coefficient and the quintile share ratio. The MERF algorithm is an algorithmic procedure reminiscent of an EM-algorithm (see Details). Overall, the function serves as a coherent framework for the estimation of point estimates and if requested uncertainty estimates for the indicators. Methodological details are found in Krennmair & Schmid (2022) and Krennmair et al. (2022b). The following examples showcase further potential applications.

Usage

SAEforest_model(
  Y,
  X,
  dName,
  smp_data,
  pop_data,
  MSE = "none",
  meanOnly = TRUE,
  aggData = FALSE,
  smearing = TRUE,
  popnsize = NULL,
  importance = "impurity",
  OOsample_obs = 25,
  ADDsamp_obs = 0,
  w_min = 3,
  B = 100,
  B_adj = 100,
  B_MC = 100,
  threshold = NULL,
  custom_indicator = NULL,
  initialRandomEffects = 0,
  ErrorTolerance = 1e-04,
  MaxIterations = 25,
  aggregate_to = NULL,
  na.rm = TRUE,
  ...
)

Arguments

Y: Continuous input value of target variable.
X: Matrix or data.frame of predictive covariates.
dName: Character specifying the name of the domain identifier, for which random intercepts are modeled.
smp_data: data.frame of survey sample data including the specified elements of Y and X.
pop_data: data.frame of unit-level population covariate data X. Please note that the column names of predictive covariates must match column names of smp_data. This holds especially for the name of the domain identifier.
MSE: Character input specifying the type of uncertainty estimates. Available options are: (i) "none" if only point estimates are requested, (ii) "nonparametric" following the MSE bootstrap procedure proposed by Krennmair & Schmid (2022) or by Krennmair et al. (2022a) if aggData = TRUE. (iii) "wild" only for nonlinear indicators proposed by Krennmair et al. (2022b). Defaults to "none".
meanOnly: Logical. Calculating domain-level means only. Defaults to TRUE.
aggData: Logical input indicating whether aggregated covariate information or unit-level covariate information is used for domain-level means. Defaults to FALSE, assuming unit-level covariate data.
smearing: Logical input indicating whether a smearing based approach or a Monte Carlo (MC) version for point estimates should be obtained to estimate (nonlinear) indicators. MC should be used if computational constraints prohibit a smearing approach. For theoretical details see Krennmair et al (2022b). Defaults to TRUE.
popnsize: data.frame, comprising information of population size of domains. Only needed if aggData = TRUE and a MSE is requested. Please note that the name of the domain identifier must match the column name of smp_data.
importance: Variable importance mode processed by the random forest from ranger. Must be 'none', 'impurity', 'impurity_corrected' or 'permutation'. A concept of variable importance is needed for the production of generic plots plot. For the estimation of domain-level means under aggregated covariate data, variable importance is needed to rank information in the process of finding suitable calibration weights (Krennmair et al., 2022b). For further information regarding measures of importance see ranger.
OOsample_obs: Number of out-of-sample observations taken from the closest area for potentially unsampled areas. Only needed if aggData = TRUE with defaults to 25.
ADDsamp_obs: Number of out-of-sample observations taken from the closest area if first iteration for the calculation of calibration weights fails. Only needed if aggData = TRUE with defaults to 0.
w_min: Minimal number of covariates from which informative weights are calculated. Only needed if aggData = TRUE. Defaults to 3.
B: Number of bootstrap replications for MSE estimation procedures. Defaults to 100.
B_adj: Number of bootstrap replications for the adjustment of residual variance proposed by Mendez and Lohr (2001). Defaults to 100.
B_MC: Number of bootstrap populations in the MC version for point estimates of (nonlinear) indicators. Defaults to 100.
threshold: Set a custom threshold for indicators, such as the head count ratio. The threshold can be a known numeric value or function of Y. If the threshold is NULL, 60 % of the median of Y is taken as threshold. Defaults to NULL.
custom_indicator: A list of additional functions containing the indicators to be calculated. These functions must only depend on the target variable Y and optionally the threshold. Defaults to NULL.
initialRandomEffects: Numeric value or vector of initial estimates of random effects. Defaults to 0.
ErrorTolerance: Numeric value to monitor the MERF algorithm's convergence. Defaults to 1e-04.
MaxIterations: Numeric value specifying the maximal amount of iterations for the MERF algorithm. Defaults to 25.
aggregate_to: Character containing the name of a variable from population data that indicates the target domain level for which the results are to be displayed. Only available if aggData = FALSE. Defaults to NULL.
na.rm: Logical. Whether missing values should be removed. Defaults to TRUE.
...: Additional parameters are directly passed to the random forest ranger. Most important parameters are for instance mtry (number of variables to possibly split at in each node), or num.trees (number of trees). For further details on possible parameters see ranger and the example below.

Value

An object of class SAEforest includes point estimates for disaggregated indicators as well as information on the MERF-model. Optionally corresponding MSE estimates are returned. Several generic functions have methods for the returned object of class SAEforest. For a full list and explanation of components and possibilities for objects of class SAEforest, see SAEforestObject.

Details

MERFs combine advantages of regression forests (such as implicit model-selection and robustness properties) with the ability to model hierarchical dependencies.

The MERF algorithm iteratively optimizes two separate steps: a) the random forest function, assuming the random effects term to be correct and b) estimates the random effects part, assuming the OOB-predictions from the forest to be correct. Overall convergence of the algorithm is monitored by log-likelihood of a joint model of both components. For further details see Krennmair and Schmid (2022).

Users that are only interested in the estimation of domain-level means should set meanOnly = TRUE. The MERF requires covariate micro-data. This function, however also allows for the use of aggregated covariate information, by setting aggData = TRUE. Aggregated covariate information is adaptively incorporated through calibration-weights based on empirical likelihood for the estimation of area-level means. See methodological details in Krennmair et al. (2022a)

For the estimation of (nonlinear) poverty indicators and/or quantiles, we need information on the area-specific cumulative distribution function (CDF) of Y. Krennmair et al. (2022b) propose a smearing approach originated by Duan (1983). Alternatively, Monte-Carlo methods are used to simulate the domain-specific CDF of Y.

For the estimation of the MSE, the bootstrap population is built based on a bias-corrected residual variance as discussed in Krennmair and Schmid (2022). The bootstrap bias correction follows Mendez and Lohr (2011).

Note that the MERFmodel object is a composition of elements from a random forest of class ranger and a random effects model of class merMod. Thus, all generic functions are applicable to corresponding objects. For further details on generic functions see ranger and lmer as well as the examples below.

References

Duan, N. (1983). Smearing Estimate: A Nonparametric Retransformation Method. Journal of the American Statistical Association, 78(383), 605–610.

Krennmair, P., & Schmid, T. (2022). Flexible Domain Prediction Using Mixed Effects Random Forests. Journal of Royal Statistical Society: Series C (Applied Statistics) (forthcoming).

Krennmair, P., & Würz, N. & Schmid, T. (2022a). Analysing Opportunity Cost of Care Work using Mixed Effects Random Forests under Aggregated Census Data.

Krennmair, P., & Schmid, T & Tzavidis, Nikos. (2022b). The Estimation of Poverty Indicators Using Mixed Effects Random Forests. Working Paper.

Mendez, G., & Lohr, S. (2011). Estimating residual variance in random forest regression. Computational Statistics & Data Analysis, 55 (11), 2937–2950.

Examples

# \donttest{
# Loading data
data("eusilcA_pop")
data("eusilcA_smp")

income <- eusilcA_smp$eqIncome
X_covar <- eusilcA_smp[,-c(1, 16, 17, 18)]

# Example 1:
# Calculating point estimates and discussing basic generic functions

model1 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
                          smp_data = eusilcA_smp, pop_data = eusilcA_pop)
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'

# SAEforest generics:
summary(model1)
#> Error in eval(expr, envir, enclos): object 'model1' not found

# Example 2:
# Calculating point + MSE estimates for aggregated covariate data and passing
# arguments to the random forest.
# Note that B is unrealistically low to improve example speed

# remove factor for gender
X_covar <- X_covar[,-1]
model2 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
                          smp_data = eusilcA_smp, pop_data = eusilcA_popAgg,
                          MSE = "nonparametric", popnsize = popNsize,B = 5, mtry = 5,
                          num.trees = 100, aggData = TRUE)
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'

# SAEforest generics:
summary(model2)
#> Error in eval(expr, envir, enclos): object 'model2' not found
summarize_indicators(model2, MSE = TRUE, CV = TRUE)
#> Error in eval(expr, envir, enclos): object 'model2' not found

# Example 3:
# Calculating point + MSE estimates and passing arguments to the forest.
# Two additional custom indicators and the threshold is defined as a custom function of Y.
# Note that B is unrealistically low to improve example speed.

model3 <- SAEforest_model(Y = income, X = X_covar, dName = "district", smp_data = eusilcA_smp,
                          pop_data = eusilcA_pop, meanOnly = FALSE, MSE = "nonparametric",
                          B = 5, mtry = 5, num.trees = 100, threshold = function(Y){0.5 *
                          median(Y)}, custom_indicator = list(my_max = function(Y,
                          threshold){max(Y)}, mean40 = function(Y, threshold){
                          mean(Y[Y<=quantile(Y,0.4)])}), smearing = FALSE)
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'

# SAEforest generics:
summary(model3)
#> Error in eval(expr, envir, enclos): object 'model3' not found
summarize_indicators(model3, MSE = FALSE, CV = TRUE, indicator = c("Gini", "my_max", "mean40"))
#> Error in eval(expr, envir, enclos): object 'model3' not found

# Example 4:
# Calculating point + MSE estimates with random effect on the district level
# while the output is at state level.
# Note that B is unrealistically low to improve example speed

model4 <- SAEforest_model(Y = income, X = X_covar, dName = "district", smp_data = eusilcA_smp,
                          pop_data = eusilcA_pop, meanOnly = FALSE, MSE = "wild", B = 5,
                          num.trees = 50, aggregate_to = "state")
#> Error in initializePtr(): function 'cholmod_factor_ldetA' not provided by package 'Matrix'

summary(model4)
#> Error in eval(expr, envir, enclos): object 'model4' not found
summarize_indicators(model4, CV = TRUE)
#> Error in eval(expr, envir, enclos): object 'model4' not found
# }