Predictive models of malaria vector larval habitat locations may provide a basis for understanding the spatial determinants of malaria transmission.

We used four landscape variables (topographic wetness index [TWI], soil type, land use-land cover, and distance to stream) and accumulated precipitation to model larval habitat locations in a region of western Kenya through two methods: logistic regression and random forest. Additionally, we used two separate data sets to account for variation in habitat locations across space and over time.

Larval habitats were more likely to be present in locations with a lower slope to contributing area ratio (i.e. TWI), closer to streams, with agricultural land use relative to nonagricultural land use, and in friable clay/sandy clay loam soil and firm, silty clay/clay soil relative to friable clay soil. The probability of larval habitat presence increased with increasing accumulated precipitation. The random forest models were more accurate than the logistic regression models, especially when accumulated precipitation was included to account for seasonal differences in precipitation. The most accurate models for the two data sets had area under the curve (AUC) values of 0.864 and 0.871, respectively. TWI, distance to the nearest stream, and precipitation had the greatest mean decrease in Gini impurity criteria in these models.

This study demonstrates the usefulness of random forest models for larval malaria vector habitat modeling. TWI and distance to the nearest stream were the two most important landscape variables in these models. Including accumulated precipitation in our models improved the accuracy of larval habitat location predictions by accounting for seasonal variation in the precipitation. Finally, the sampling strategy employed here for model parameterization could serve as a framework for creating predictive larval habitat models to assist in larval control efforts.

Malaria is one of the most significant infectious diseases affecting people in poverty, with an estimated 219 million cases of malaria worldwide in 2010 killing 660,000 people [

The vast majority of deaths from malaria (91%) occur in Africa [

The locations of larval

The objectives of this study were to create a model for predicting larval

The Asembo region of Rarieda District in western Kenya (Figure ^{2}. Most of the residents are subsistence farmers, and the landscape is largely dominated by small-scale agriculture. Small plots of land generally surround family-based groups of houses, or compounds, further arranged into villages. While the compounds are highly dispersed within villages, the boundaries between villages are often discernable only by residents [

Malaria is holoendemic in Asembo, with parasitemia rates in children under 5 being around 50% in 2009 [^{2} of actual landmass in the 10 by 10 km site.

The 10 by 10 km study site was divided into 500 by 500 m quadrats for larval

All potential larval

To capture variation in habitat location across time due to seasonal rainfall patterns, additional ground surveys were conducted monthly in two neighboring villages, Aduoyo-Miyare and Nguka, covering 6.22 km^{2} within the 10 by 10 km study site (Figure

Spatial data for soils, land use-land cover (LULC), distance to the nearest stream, and TWI were created across the study site. These data were assembled in ArcGIS 10.0 (ESRI, Redlands, CA) in raster data structures with a spatial resolution of 20 m. All four datasets were treated as constant over time. Soil data were taken from the 1:1,000,000 exploratory soil map of Kenya, compiled by the Kenya Soil Survey in 1980 [

The TWI data were derived from a digital elevation model (DEM) of the study site. The DEM was created using local universal kriging to interpolate 11,130 GPS elevation records previously taken within Asembo [

Because the TWI was calculated as the ratio of slope to contributing area, the lowest value (0) represented the wettest locations, while the highest TWI value (100) represented the driest areas.

Daily precipitation totals for March 2011 to July 2012, as measured by the weather station at the Kisumu Airport (about 40 km east of Asembo), were downloaded from the National Climatic Data Center’s Global Summary of Day (GSoD) database (Figures

We used two approaches for modeling the distribution of larval habitats across the landscape, logistic regression and

Both methods were used separately on the two datasets (one from the 10 by 10 km area and the other from the 15 monthly ground surveys in Aduoyo-Miyare and Nguka). For both methods, the unit of analysis was a 20 m pixel. Because each of the 31 quadrats in the 10 by 10 km area was surveyed once, each pixel in that dataset had a single value for all of the variables described above (habitat presence/absence, TWI, soil, LULC, distance to stream, and 31 values of

We built a series of candidate logistic regression models to select the most useful predictor variables. To determine which

We implemented the random forest approach using the R package ‘randomForest’ [

The top models from both approaches within each dataset were evaluated by determining their accuracy at predicting larval habitat presence and absence for holdout data. Fifty percent of each dataset was randomly selected as a holdout dataset before model building. Evaluation of model accuracy required the selection of a threshold at which to convert predicted probabilities into larval habitat presence or absence. Because threshold specific accuracy statistics can be sensitive to the threshold used for conversion, we generated an optimal threshold value by minimizing the absolute value of the difference between sensitivity and specificity [

Finally, we calculated Pearson’s correlation coefficient among the cumulative precipitation measures to assess differences among the temporal-resolution/modeling-approach combinations. To quantify the contribution of cumulative 30-day precipitation to variation in the number of habitats found each month in Aduoyo-Miyare and Nguka, we used simple linear regression.

In the 31 sampling quadrats selected from the 10 by 10 km study site, we recorded the locations of 1,673 larval

In 15 monthly ground surveys in Aduoyo-Miyare and Nguka, a total of 6,770 larval ^{2} = 0.1931, p = 0.1012; Figure

^{2} = 0.1931, p = 0.1012. The red and blue boxes highlight variation in the residual error discussed further in the text.

The best cumulative precipitation total to use in the models differed between the datasets and between the modeling approaches. For the 15 monthly ground surveys in Aduoyo-Miyare and Nguka, the logistic regression model for 30-day cumulative precipitation had the lowest BIC within the precipitation candidate models (Figure

Pearson’s correlation coefficient (r) matrix of cumulative precipitation measures

A) 17 May - 4 July 2011 (temporal scale for 10 by 10 km dataset) | |||||

6-Day | 0.411 | 1.000 | 0.721 | 0.477 | 0.611 |

14-Day | 0.332 | 0.721 | 1.000 | 0.794 | 0.578 |

B) April 2011 - June 2012 (temporal scale for Aduoyo-Miyare and Nguka dataset) | |||||

21-Day | 0.408 | 0.555 | 0.841 | 1.000 | 0.979 |

30-Day | 0.336 | 0.537 | 0.819 | 0.979 | 1.000 |

(A) For the 22 days of larval habitat ground surveys in the 10 by 10 km area from 17 May to 4 July 2011, and (B) For the last day of larval habitat ground surveys each month in Aduoyo-Miyare and Nguka from April 2011 to June 2012. 0-day refers to precipitation total for the day of ground surveys, while 6-day, 14-day, 21-day, and 30-day refer to the cumulative precipitation total for the day of the ground surveys plus the previous 6 days, 14 days, 21 days, or 30 days, respectively.

The environmental variables used in the best logistic regression models for predicting the locations of larval

Top four logistic regression candidate models for the 15 monthly ground surveys in Aduoyo-Miyare and Nguka

TWI + LULC + DS + Soil + Precip. | 45933.1 | NA | 0.9999 |

TWI + DS + Soil + Precip. | 45955.6 | 22.5 | <0.001 |

TWI + LULC + DS + Precip. | 46045.5 | 112.3 | <0.001 |

TWI + DS + Precip. | 46073.7 | 140.5 | <0.001 |

Based on Bayesian information criterion (BIC), in order of increasing difference in BIC from the top model (ΔBIC) and decreasing BIC weight,

Odds ratios for top logistic regression model from the Aduoyo-Miyare and Nguka data (15 monthly surveys)

(Intercept) | 0.0378 | 0.0334 | 0.0426 |

TWI | 0.9365 | 0.9324 | 0.9405 |

LULC, Ag:NonAg | 1.3371 | 1.2096 | 1.4780 |

DS | 0.9980 | 0.9978 | 0.9981 |

Soil, Type3:Type2 | 0.7127 | 0.6715 | 0.7564 |

Precip. | 1.0033 | 1.0029 | 1.0036 |

Odds ratios are presented with 95% CI. TWI, topographic wetness index; LULC, land use-land cover; DS, distance to the nearest stream; Precip., cumulative 30-day precipitation total; Ag:NonAg, odds ratio of agricultural to nonagricultural LULC; Type3:Type2, odds ratio of the firm, silty clay/clay soil type to the friable clay/sandy clay loam soil type.

For the 10 by 10 km dataset, the logistic regression model with the lowest BIC included four of the variables (TWI, distance to stream, soil type, and cumulative 6-day precipitation). No other model had a ΔBIC less than 9 (Table

The top five logistic regression candidate models for the 10 by 10 km area

TWI + DS + Soil + Precip. | 6482.6 | 0 | 0.9822 |

TWI + LULC + DS + Soil + Precip. | 6491.8 | 9.3 | 0.0096 |

TWI + DS + Precip. | 6492.2 | 9.6 | 0.0082 |

TWI + LULC + DS + Precip. | 6502.0 | 19.4 | < 0.001 |

TWI + DS + Soil | 6658.0 | 175.4 | < 0.001 |

Based on Bayesian information criterion (BIC), in order of increasing difference in BIC from the top model (ΔBIC) and decreasing BIC weight,

Odds ratios for the top logistic regression model for the 10 by 10 km area data

(Intercept) | 0.3156 | 0.2472 | 0.4030 |

TWI | 0.9117 | 0.8988 | 0.9248 |

DS | 0.9973 | 0.9969 | 0.9977 |

Soil, Type2:Type1 | 1.9105 | 1.4994 | 2.4343 |

Soil, Type3:Type1 | 1.4970 | 1.2024 | 1.8639 |

Precip. | 0.9700 | 0.9656 | 0.9745 |

Odds ratios are presented with 95% CI. TWI, topographic wetness index; DS, distance to the nearest stream; Precip., cumulative 6-day precipitation total; Type2:Type1, odds ratio of the friable clay/sandy clay loam soil type to the friable clay soil type; Type3:Type1, odds ratio of the firm, silty clay/clay soil type to the friable clay soil type.

The most accurate model for predicting larval

Comparison of models predicting the presence of larval habitats in the 10 by 10 km area

RF: TWI + DS + Soil + LULC + Precip. | 0.864 | 0.806 | 0.789 | 0.790 | 0.216 |

RF: TWI + DS + Soil + LULC | 0.808 | 0.750 | 0.725 | 0.726 | 0.145 |

LR: TWI + DS + Soil + Precip. | 0.799 | 0.748 | 0.709 | 0.711 | 0.133 |

Two random forest (RF) models are shown, with and without Precip. (the cumulative 14-day precipitation total). The best logistic regression (LR) model is also shown. TWI, topographic wetness index; LULC, land use-land cover; DS, distance to the nearest stream; AUC, area under the receiver operating curve; PCC, percent correctly classified.

The most accurate model for predicting larval

Comparison of models predicting the presence of larval habitats in Aduoyo-Miyare and Nguka

RF: TWI + DS + Soil + LULC + Precip. | 0.871 | 0.820 | 0.773 | 0.774 | 0.102 |

RF: TWI + DS + Soil + LULC | 0.827 | 0.659 | 0.936 | 0.930 | 0.268 |

LR: TWI + DS + Soil + Precip. | 0.733 | 0.621 | 0.704 | 0.703 | 0.045 |

Two random forest (RF) models are shown with and without Precip. (the cumulative 14-day precipitation total). The best logistic regression (LR) model is also shown. TWI, topographic wetness index; LULC, land use-land cover; DS, distance to the nearest stream; AUC, area under the receiver operating curve; PCC, percent correctly classified.

The use of models to predict the distribution of species is common in ecology [

The most important landscape variables for predicting larval habitat presence in these models were TWI and distance to the nearest stream. In the 10 by 10 km random forest model, the mean decreases in the Gini impurity criteria of TWI and distance to the nearest stream were much larger than those of LULC and soil (Figure

An important question in the application of predictive larval habitat models is whether models parameterized with data for habitat locations in one season are applicable to another season [

The

In addition to the use of precipitation data from one location, there were other limitations to this study. First, we did not account for spatial autocorrelation in the logistic regression models. Doing so may have slightly increased the confidence intervals associated with the parameters of those models, but it is unlikely to have changed the model comparisons or accuracy evaluations presented here. Previous studies modeling

Finally, the models developed here exclusively used physical and environmental factors as predictor variables, but the formation of larval

The sampling designs of these two datasets allowed us to address two complementary goals. The monthly surveys in Aduoyo-Miyare and Nguka captured variation in precipitation across both dry and rainy seasons in the same landscape. This provided a stronger logical basis for inferences about the relationship between seasonal variation in precipitation and variation in the location and number of larval habitats. The small spatial extent of Aduoyo-Miyare and Nguka made monthly surveys more feasible, but it also limited the applicability of the model results across a larger area. Conversely, limiting the ground surveys of the 31 quadrats from the 10 by 10 km study site to one season likely impeded our ability to infer much about the effect of precipitation on these data. On the other hand, concentrating our sampling effort to increase replication across space in the 31 quadrats captured more variation in landscape variables, allowing us to apply the results of models based on these data to a larger area.

As a general application, the spatially stratified sampling strategy used in the 10 by 10 km site could serve as a framework for creating predictive larval habitat models for larval control. Targeted larval control is often cited as a useful application of predictive larval habitat models [

The authors declare that they have no competing interests.

RSM, MNB, JMV, JEG and EDW designed the study and implemented the data collection. RSM, JPM and DWM analyzed the data and assessed the models. All authors participated in the preparation of the manuscript, and read and approved the final manuscript.

We thank George Olang’ and Maurice Ombok for logistical support; Richard Owerah, Jared Sudhe, Evans Owino, Peter Owera and Michael Nyonga for assistance with field work; Nicole Smith for assistance with DEM interpolation; Saul Daniel Ddumba and Nathan Moore for advice about the GSoD precipitation data; the staff from the KEMRI/CDC field station in Kisian for support in organizing field work and for assistance with mosquito identification; and the residents of Asembo for their cooperation during field surveys. We also thank four anonymous reviewers for constructive criticisms of an earlier version of the manuscript. This work is published with the permission of the Director of the Kenya Medical Research Institute. This study was supported by a National Science Foundation Ecology of Infectious Diseases grant (grant no. EF- 0723770) with additional support from the Rhodes Thompson Memorial Fellowship Fund.

The opinions expressed by the authors of this article do not necessarily reflect the opinions of the U.S. Centers for Disease Control and Prevention.