logo

Gianluca Boo1, Roland Hosner2, Pierre Z Akilimali3, Edith Darin1, Heather R Chamberlain1, Warren C Jochem1, Patricia Jones1, Roger Shulungu Runika4, Henri Marie Kazadi Mutombo4,5, Attila N Lazar1 and Andrew J Tatem1

1WorldPop Research Group, University of Southampton, Southampton, United Kingdom
2Flowminder Foundation, Stockholm, Sweden
3École de Santé Publique de Kinshasa, Kinshasa, Democratic Republic of the Congo
4Institut National de la Statistique, Kinshasa, Democratic Republic of the Congo
5Bureau Central du Recensement, Kinshasa, Democratic Republic of the Congo

Introduction

This report is a supplement to the modelled gridded population estimates for the Haut-Katanga, Haut-Lomami, Ituri, Kasaï, Kasaï-Oriental, Lomami and Sud-Kivu provinces in the Democratic Republic of the Congo (DRC) (2021). The report describes the Bayesian statistical model used to produce the gridded population estimates, following an approach described by Wardrop et al. (2018). The modelling work consists of three main steps involving — 1) input data, 2) model specification, 3) model fit and evaluation and 4) discussion of the results. These steps were inspired by previous work carried out in Nigeria (Leasure et al. 2020), the Democratic Republic of the Congo (Boo et al. 2021) and other countries part of the GRID3 programme.

1) Input data

All the spatial data presented below were georeferenced using the WGS84 datum (World Geodetic System 1984: EPSG 4326) with a consistent spatial resolution of 0.0008333 decimal degrees (i.e. approximately 100m).

Microcensus data — we derived total population counts and age and sex breakdowns within the 1,397 microcensus clusters surveyed in a microcensus led by the Flowminder Foundation et al. (2021) between March and April 2021. Inclusion criteria, imputation and other data processing steps are described in detail in the COD_population_v3_0_methods_microcensus.html report.

Building footprints — we accessed the latest building footprints produced by Ecopia.AI using Maxar Technologies satellite imagery (2021) collected between 2010 and 2021 for the DRC and rasterized them with a resolution of approximately 100m. In doing so, we computed the number of centroids within each grid cell, similar to the work carried out by Dooley et al. (2021). Building footprints were also used to derive some of the model covariates presented below.

Administrative boundaries — we accessed vector boundaries for the Haut-Katanga, Haut-Lomami, Ituri, Kasaï, Kasaï-Oriental, Lomami and Sud-Kivu provinces in the DRC produced by the Bureau Central du Recensement (2018) and rasterized them with a resolution of approximately 100m. The original administrative boundaries are in the process of being consolidated and are currently not publicly available.

Settlement classes — we derived four settlement classes (i.e. urban, periurban, village and hamlet) by reclassifying GHS-SMOD data (GHSL 2019) and rasterized them with a resolution of approximately 100m. The original classes 10 and 11 were reclassified as hamlet, the classes 12 and 13 as village, the classes 21, 22 and 23 as periurban and the class 30 as urban.

Model covariates — we created 1,178 model covariates which were standardized using the mean and standard deviation derived at the grid cell level. We selected the eight covariates with the highest correlation coefficient to population densities (log-transformed) at the microcensus cluster level. We then discarded four covariates that were highly correlated among them (e.g. average area and average perimeter of building footprints). The four covariates implemented in the model are.

2) Model specification

We developed a Bayesian hierarchical model to estimate total population counts within grid cells of approximately 100m, similarly to Leasure et al. (2020) and Boo et al. (2021). The model estimates the total number of people \(N_i\) in each microcensus cluster \(i\) as a Poisson process, where \(D_i\) is the population density (people/100 buildings) and \(B_i\) is the total number of buildings (buildings/100). \(D_i\) is modelled as a log-normal process, where \(\bar{D}_i\) is the expected population density on a log-scale and \(\sigma_{t,p}\) is a hierarchical precision term estimated by settlement type \(t\) and province \(p\).

\[ \begin{equation} N_i \sim Poisson(D_i B_i) \end{equation} \] \[ \begin{equation} D_i \sim LogNormal( \bar{D}_i, \sigma_{t ,p}) \end{equation} \] Expected population densities \(\bar{D}_i\) within each microcensus cluster \(i\) are estimated using a linear regression and \(K\) covariates — 1) Average Building Length (log-transformed), 2) Average Building Compactness (log-transformed), 3) Dry Matter Productivity Variation, and 4) Surface Temperature Variation. The random intercept \(\alpha_{t,p}\) is estimated hierarchically by settlement type \(t\) and province \(p\) and the parameter \(\beta_{k}\) is estimated on the covariates \(x_{k,i}\) within each microcensus cluster \(i\). \[ \begin{equation} \bar{D}_i= \alpha_{t,p}+∑_{k=1}^K\beta_{k} x_{k,i} \end{equation} \] The random intercept \(\alpha_{t,p}\) is modelled as a nested hierarchy composed by settlement types \(t\) and provinces \(p\) with uninformative priors on \(\alpha\), \(\nu\) and \(upsilon\). \(\alpha\) is centred on 6 — the approximate mean observed population density (on a log scale), \(\nu\) and \(\upsilon\) are modelled using a positive half-normal distribution truncated at 0. \[ \begin{equation} \alpha_{t,p}∼ Normal(\alpha_{t},\nu )\\ \alpha_{t}∼ Normal(\alpha,\upsilon)\\ \alpha∼ Normal(6,10)\\ \nu∼ HalfNormal(0, 10)\\ \upsilon∼ HalfNormal(0,10) \end{equation} \] A similar hierarchical structure is also used for \(\sigma_{t,p}\), which is modelled by settlement type \(t\) and province \(p\) using uninformative priors on \(\mu\), \(\xi\) and \(\zeta\). \(\xi\) and \(\zeta\) are modelled using a positive half-normal distribution truncated at 0.

\[ \begin{equation} \sigma_{t,p}∼ Normal(\mu_{t},\xi )\\ \mu_{t}∼ Normal(\mu,\zeta)\\ \mu∼ Normal(0,10)\\ \xi∼ HalfNormal(0,10)\\ \zeta∼ HalfNormal(0,10) \end{equation} \] The parameter \(\beta_k\) is modelled using an uninformative normal distribution.

\[ \begin{equation} \beta_k ∼Normal(0,10) \end{equation} \]

3) Model fit and evaluation

We fit the model using the Stan software in R version 4.0.2 and achieved model convergence in 1,000 iterations using three MCMC chains. We ran each chain for an additional 1,000 iterations to provide a more stable posterior distribution. We assessed standard convergence statistics and none of the iterations ended with divergence or saturated the maximum tree depth of 15 and the energy Bayesian fraction of missing information (E-BFMI) indicated no pathological behaviour (Gelman et al. 2020).

The table below shows goodness-fit-statistics derived from the analysis of the model residuals (i.e. predicted total population counts minus observed total population counts) both raw and standardized (i.e. residuals divided by predicted total population counts). Bias represents the mean of the residuals, imprecision the standard deviation of residuals, inaccuracy the mean of absolute residuals, R2 is the squared Pearson correlation coefficient among the residuals. The analysis was carried out for in-sample predictions.

Residuals Bias Imprecision Inaccuracy R2
Raw 20.10 108.54 79.56 0.58
Standardized 0.03 0.53 0.34

The plot below shows the observed versus predicted total population counts at the microcensus cluster level. The 95% credibility intervals include 95.6% of the observations, suggesting a robust error structure. Over the 1,397 observations, approximately six data points show substantial model over-prediction and nine substantial under-prediction.

After comparing the number of surveyed buildings with both the building footprint count raster and the buildings visible on satellite imagery, it appears that over-prediction occurs where a large part of the buildings from the building footprint layer were not in the survey records - they might have no longer existed. Conversely, under-prediction occurs when new residential buildings have been built in the period between the delineation of the building footprints and the survey. Importantly, these sources of uncertainty were captured by the error structure of the model.

The map below shows the standardized residuals computed as the predicted minus observed total population counts divided by the predicted population counts at the microcensus cluster level. Values above zero highlight clusters with model over-prediction and values below zero under-prediction. Most clusters have standardized residuals close to zero with some exceptions.

The map also shows that the model residuals have a random spatial distribution, except for the area North of the Ituri province, where all the microcensus clusters display slight over-prediction. This observation is confirmed by spatial-autocorrelation tests carried out using different distance classes, where weak (Moran’s I=0.06) but statistically significant (p<0.05) spatial autocorrelation was detected. Reasons for over-prediction in the North of the Ituri province can be linked to the complicated security situation resulting in complex household structures and fluid residential patterns.

4) Discussion of the results

This statistical model builds on several assumptions on the input data.

The model also implies the following assumptions.

Finally, the statistical model assumes that the processes observed within the microcensus clusters are reflective of the ones occurring at the grid cell level. However, it is expected that for larger areas this relationship is subject to higher degrees of uncertainty because of the modifiable areal unit problem (MAUP), as some mismatches between total population counts estimated at the microcensus-cluster and grid-cell level have been observed.

Conclusions

The statistical model described in this report enabled us to produce gridded estimates of total population counts for the seven provinces included in the GRID3 Mapping for Health Project. These estimates were used to derive post-hoc breakdowns for different age and sex groups, by multiplying the total population counts by age and sex proportions at the province level. The age and sex proportions were derived from the microcensus data aggregated at the province level and were weighted by design weights. The design weights were calculated as the inverse sampling probability of each microcensus cluster because clusters were sampled based on a stratification by settlement type (and province). The design weights account for the varying sampling probabilities of clusters within each province.

Cited datasets [not publicly available]

Acknowledgements

These data were produced by the WorldPop Research Group at the University of Southampton as part of the GRID3 Mapping for Health Project. This project was delivered under the leadership of the Ministry of Public Health, Hygiene and Prevention of the DRC and funded by Gavi, the Vaccine Alliance (RM 86720420A2). The project was led by the Flowminder Foundation and the Center for International Earth Science Information Network (CIESIN) at the Columbia University, in collaboration with the WorldPop Research Group at the University of Southampton and national partners including, but not limited to, the École de Santé Publique de Kinshasa and both the Bureau Central du Recensement and the Institut National de la Statistique. This work was a continuation of the GRID3 (Geo-Referenced Infrastructure and Demographic Data for Development) programme funded by the Bill and Melinda Gates Foundation (BMGF) and the United Kingdom’s Foreign, Commonwealth & Development Office (INV 009579, formerly OPP 1182425). The study was approved by the Faculty Ethics Committee of the University of Southampton (ERGO II 62716).

Suggested citation

G Boo, R Hosner, PZ Akilimali, E Darin, HR Chamberlain, WC Jochem, P Jones, R Shulungu Runika, HM Kazadi Mutombo, AN Lazar and AJ Tatem. 2021. Modelled gridded population estimates for the Haut-Katanga, Haut-Lomami, Ituri, Kasaï, Kasaï-Oriental, Lomami and Sud-Kivu provinces in the Democratic Republic of the Congo (2021), version 3.0. WorldPop, University of Southampton, Flowminder Foundation, École de Santé Publique de Kinshasa, Bureau Central du Recensement and Institut National de la Statistique. DOI: 10.5258/SOTON/WP00720

License

This report may be redistributed following the terms of a Creative Commons Attribution 4.0 International (CC BY 4.0) license.