Wednesday, April 1, 2020

Science, Statistics. Misinterpretation, and Missing Data

From a medical and immunological standpoint, we know a great deal about the latest strain of the coronavirus (COVID-19).  We know how it works, how it puts the body's immune system into high gear, how it kills, and how it's primarily transmitted, etc.

What we don't know, or refuse to know, is the true data behind the outbreak.  We know in science that statistics, when misapplied, can shield the true nature of the data and worse, they can lie.  Statistics can be misinterpreted leading to false conclusions and actions taken that are based on bad data.  Missing data will, to varying degrees, skew statistics and the inferences made from those statistics.  In hypothesis testing, there are two errors - Type I and Type II.  Type I is a false positive, seeing an effect in the data that is not actually present.  Type II is more insidious and generally carries a greater consequence - it's when there's an effect present and you missed it  Let's say that you're being tested for antibodies that indicate cancer may be present.  If you get a false positive (Type I error), it may scare you and cause an emotional reaction, but you don't have cancer and subsequent tests will reveal that.  However, if your test comes back negative, but you actually have cancer (Type II error), the consequences are potentially very bad.  When hypothesizing and estimating the impact of the coronavirus, we have no clue if we are committing either type of error or are not committing errors because we are not scientifically testing the data; there is just not a reliable data source yet from which to draw samples.

We are seeing lots of statistics from many different sources on the coronavirus outbreak, some are confirmatory with other data and some are contradictory to other data.  So which statistics are correct?  The long and short of it is that we don't know.  There is no single source of the truth.  The numerators and denominators for basic rates and ratios are inconsistent at best and completely wrong at worst...and everything in between.  We do not know the actual transmission rates, we do not know the rate of natural immunity in the population, we don't even know how many true cases that we have in the population because not everybody is being tested.  We just started to reveal data on recovery rates, so our understanding of the true impact of this virus in the population is not yet known.

What does the data tell us that we can trust?  Not much; however, there is mounting evidence (though still anecdotal) that the transmission rates are not as bad as we thought. The story of the Grand Princess cruise ship is a case study in bad data, but the data that we do have indicates that even in a closed environment like a cruise ship the rates of infection are low.  There were over 3,400 people on that ship.  It was isolated for a time at sea and then the passengers were quarantined in close quarters at Travis Air Force Base.  Of those passengers, there are 103 712 (updated 5 Apr 20) confirmed cases and two confirmed deaths; however, half of those who tested positive did not present symptoms.  Albeit, only 1,103 of the passengers and crew were tested (again, giving us missing data), but the transmission rate does not appear to equate to what we are hearing from various sources.  The death rate doesn't seem to be much higher than a serious influenza outbreak and neither do most estimates of the death rate. Some estimates are based on untested models, so here again, we have issues with the data. 

In the end, keep up the social distancing, keep up with the recommended hygiene practices, and don't panic; there is nothing in the data to suggest that this is the end of the world.  Our behaviors and reactions will determine how long this will last and when we can expect normalcy.