Mixed Effect Random Forests for Small Area Estimation • SAEforest

The package promotes the use of Mixed Effects Random Forests (MERFs) for applications of Small Area Estimation (SAE). The package effectively combines functions for the estimation of regionally disaggregated linear and nonlinear economic and inequality indicators using survey sample data. Estimated models increase the precision of direct estimates from survey data, combining unit-level and aggregated population level covariate information from census or register data. Apart from point estimates, MSE estimates for requested indicators can be easily obtained. The package provides procedures to facilitate the analysis of model performance of MERFs and visualizes predictive relations from covariates and variable importance. Additionally, users can summarize and map indicators and corresponding measures of uncertainty.

Installation

You can install the development version of SAEforest from Github with:

# install.packages("devtools")
devtools::install_github("krennpa/SAEforest")

Example

This is a basic example which demonstrates the functionality of this package:

library(SAEforest)

data("eusilcA_pop")
data("eusilcA_smp")

income <- eusilcA_smp$eqIncome
X_covar <- eusilcA_smp[,-c(1,16,17,18)]

#Example 1:
#Calculating point estimates and discussing basic generic functions

model1 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
                         smp_data = eusilcA_smp, pop_data = eusilcA_pop)

#SAEforest generics:
summary(model1)
#> ________________________________________________________________
#> Mixed Effects Random Forest for Small Area Estimation
#> ________________________________________________________________
#> Call:
#> SAEforest_model(Y = income, X = X_covar, dName = "district", 
#>     smp_data = eusilcA_smp, pop_data = eusilcA_pop)
#> 
#> Domains
#> ________________________________________________________________
#>  In-sample Out-of-sample Total
#>         70            24    94
#> 
#> Totals:
#> Units in sample: 1945 
#> Units in population: 25000 
#> 
#>                    Min. 1st Qu. Median      Mean 3rd Qu. Max.
#> Sample_domains       14    17.0   22.5  27.78571   29.00  200
#> Population_domains    5   126.5  181.5 265.95745  265.75 5857
#> 
#> Random forest component: 
#> ________________________________________________________________
#>                                            
#> Type:                            Regression
#> Number of trees:                        500
#> Number of independent variables:         14
#> Mtry:                                     3
#> Minimal node size:                        5
#> Variable importance mode:          impurity
#> Splitrule:                         variance
#> Rsquared (OOB):                     0.62976
#> 
#> Structural component of random effects:
#> ________________________________________________________________
#> Linear mixed model fit by maximum likelihood  ['lmerMod']
#> Formula: Target ~ -1 + (1 | district)
#>    Data: data
#>  Offset: forest_preds
#> 
#>      AIC      BIC   logLik deviance df.resid 
#>  39193.1  39204.2 -19594.5  39189.1     1943 
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.9730 -0.5194 -0.0759  0.4448 11.8159 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  district (Intercept) 11157235 3340    
#>  Residual             30335770 5508    
#> Number of obs: 1945, groups:  district, 70
#> 
#> ICC:  0.2688944 
#> 
#> Convergence of MERF algorithm: 
#> ________________________________________________________________
#> Convergence achieved after 4 iterations.
#> A maximum of 25 iterations used and tolerance set to: 1e-04 
#> 
#> Monitored Log-Likelihood:                                          
#>  0 -19545.67 -19573.45 -19593.59 -19594.53

I included some further features to inspect the model graphically. For instance look at the following output from the generic function plot, which shows a so-called variable importance plot:

#> Press [enter] to continue

We cannot only inspect the model graphically, but also map our indicators. Take a look at this example on Austrian pseudo-data for district-level mean income produced by the function map_indicators:

I hope you like this presentation and the package. If you are interested in model-based SAE you should definitely also check out package emdi.