De La Fuente Abstract: Robust data splitting for hydrological modeling: Implementation of machine learning techniques

Hydrological modeling is typically characterized by deterioration in model performance when applied to the evaluation period datasets. This situation has only sporadically been studied and, in general, has become accepted as the “norm” when performing hydrological modeling. However, Machine Learning techniques are now available that can help us better deal with and extract information from the data, and that can help us to improve our understanding of the models. Notably, the highly skewed nature of the streamflow data distribution, can result significant differences in the data distributions resulting from the training/validation/testing period split. To explore this hypothesis, we created random subsets of the hydrological data having essentially the same distribution. The original dataset was split into clusters using the K-means algorithm and the Kolmogorov-Smirnov hypothesis test was used to check the consistency in the splits. From those clusters, data values were selected randomly, without replacement, to create three datasets which keep the percentage defined by the user for each period. Besides, the maximum streamflow event of each event was kept in the training period to ensure the maximum information gain in each model. Hundreds of hydrological models were constructed using an autoregressive Random Forest Model and the change in the dispersion between the traditional data split and the new technique was examined. The outcome indicates an improvement in the robustness of the hydrological models identified using any of the random data-splitting method. That can be very useful for modeling catchments characterized by high skewness, such as those in arid and semi-arid zones.