Why is my model so bad?: Tales of a wandering PhD student - CANCELLED DUE TO LOCKDOWN


George Box is often credited with the phrase ‘…all models are wrong, but some are useful.’ Whilst this makes sense to me now, it was frustrating in my early career, when I didn’t know if I was wrong, my model was the wrong choice, I coded it badly, or if it was working as well as it could. This talk summarise part of my PhD work, attempting to build predictive models on NHS Incident Reporting data in R. Firstly, the data quality was terrible. This is to be expected when you understand the context in which it was generated, and I’ll discuss data generating mechanisms and some of our assumptions about them. To get around my poor data, I ‘borrowed strength’ from other datasets and used them at the same level of aggregation. This became a ‘panel’ dataset, and I was modelling count data. I will discuss our default assumption of Poisson regression (GLM), why it is not perfect with real-word data, and what overdispersion is. There are a variety reasons for overdispersion, and my models proceeded by testing different sources and assumptions including: clustering, aggregation, and ‘noise’ from poor data quality. I will discuss methods for fitting these assumptions including GEE, GLMM, Boostrapping / Cross-validation, GAMs with smoothers, and a Random Forest thrown in for good measure. The ultimate question here is: ‘is my model any good?’ I will briefly discuss how we can assess predictive model accuracy, and a few different schools of thought on it. All examples will be presented with a why, and a how, including R code made available on GitHub. I will focus both on the how I chose and learnt the methods, as well as practical tips for analysing count data with overdispersion.

Apr 4, 2020 —