More Data or a Better Model?

A new paper published in Soil Science Society of America Journal by Sanjeewani Nimalka Somarathna Pallegedara Dewage as part of her PhD tried to answer this question by figuring out what matters most for DSM of soil carbon.

In this study, Sanjee studied how diverse spatial modelling techniques perform under varying training sample sizes, in terms of SOC predictions. The study explores the behaviour of various algorithms ranging from simple linear models to complex machine learning techniques trained under numerous sample sizes.

The study examined the behavior of multiple linear regression (MLR), geographically weighted regression (GWR), linear mixed models (LMMs), Cubist regression trees, quantile regression forests (QRFs), and the ‘deep learning’ extreme learning machine regression (ELMR) under varying sample sizes.

The results showed that for the study site in the Hunter Valley, Australia, the accuracy of spatial prediction of soil carbon is more sensitive to training sample size compared to the model type used. The prediction accuracy initially increases exponentially with increasing sample size, eventually reaching a plateau. Different models reach their maximum predictive potential at different sample sizes. Furthermore, the uncertainty of model predictions decreases with increasing training sample sizes.

The conclusion from this study site in the Hunter Valley revealed that prediction accuracy of SOC is mostly determined by the sample size. Model type can influence the accuracy, though, to a lesser extent. Complex deep learning model such as ELMR performs worse than the simple linear mixed model.

This study also shows that the there is a diminishing return of increasing number of sample size and model accuracy. For this area of 2200 hectare, the optimum number of samples is 300 where a further increase in the number of samples does not improve the accuracy significantly.

The learning curve for different models trained at various sample sizes. Note that although ELMR has a low RMSE, it has a large bias, making it having a low accuracy. There is a diminishing return for increase sample size.

Soil carbon content (%) and the associated prediction variance (bottom) according to linear mixed model for the sample sizes 300, 500 and 1000 for topsoil

Leave a Reply Cancel reply