Some descriptive COVID-19 regressions

Having raised the bar so incredibly high with my last post, I now want to bring it down again and show you some unsophisticated data analysis.

Every day now you can see people comparing countries on performance in this pandemic all over the place. How well ist Germany doing? How does the UK compare to France? And what about Sweden: should we have followed their hands-off approach?

All of these comparisons lack one fundamental ingredient: meaningful data. The data everybody is using (and which I will be using in a minute) is riddled with measurement issues. Most important among them is the issue of testing: who gets tested, how fast, how many gest tested – all of that varies from country to country and across time within a given country. Not even the death statistics are reliable as we learned only this week when the UK drastically corrected their number upwards.

But I thought to myself, what the heck. If everyone’s doing it, I might be forgiven for having some fun as well. And so, in between grading final exams, I pulled together some country-level data and ran some regressions.

It goes without saying that this analysis has some, shall we say, shortcomings. All I’m doing is using regressions to describe some patterns in the data. Although I did have some mental model when deciding which variables to include in my regressions, they were of the sort “I imagine X could have effect on COVID deaths” rather than any deep causal understanding of how the epidemic works (but, frankly, does anyone have that?)

So without further ado, here’s what I did. I took the data from the European Center for Disease Prevention and Control (ECDC), giving me daily new cases and new deaths for each country reporting those things, which I summed up until April 30th to get the cumulative cases and deaths. I then divided by population to get cases and deaths per capita. These are my dependent variables.

For my regressors I went on a wild hunt on the World Bank and OECD databases and downloaded everything that I thought would be interesting to regress on COVID-19. After some fooling around, I settled on the following two models:

Model 1: cumulative COVID-19 cases per capita (in logs)

The first variable (lrgdp_pc) here is PPP-adjusted GDP per capita (in logs). This is the single most important variable in “explaining” the number of cases: richer countries have more official cases. The relationship is 1:1, i.e. one percent more income is associated with one percent more cases. It is almost useless to speculate about the “causal channels” for this effect. If I were to guess, I’d say that rich countries got the virus earlier and perform more tests per capita and therefore detect more cases.

The second variable (pop65) is the share of population above the age of 65. We know that seniors are more susceptible to this disease, so any sensible model must take the age structure into account. It’s reassuring that the coefficient is positive and significant. I take this as a sanity check for my model.

The next two variables is population density (pop_dens) and share of urban population (urban). My “theory” here is that denser, more urban countries provide a more fertile environment for the virus to spread. Somewhat disappointingly population density seems to have no effect and urbanization only has a small one (a 1 percentage point higher urban share gives you 1.4% more cases per capita). And no, density and urbanization are not highly correlated (corr=0.17), glad that you’ve asked.

Lastly, I wanted to check if more open countries are more exposed. I tried to capture that with the trade share (exports plus imports divided by GDP). The answer seems to be a clear no. Being more open to international trade is not associated with more infections. In an alternative specification I checked if imports from China had a positive effect and was disappointed.

I direct your attention to the fact that the R-squared of this regression is 68.5%. I have seen papers published in decent journals with much worse goodness of fit given the sample size and number of regressors. Just saying.

Model 2: cumulative COVID-19 deaths per capita

Turning to coronavirus deaths, the first important “explanatory” variable is the number of cases (lcases_pc). Again, this is nothing more than a sanity check.

I then add all the variables from the previous model to see if they have an effect on deaths over and above the effect they have through the number of cases. Unsurprisingly, an older population has the expected positive effect on deaths: raising the share of old people by 1 percentage point raises deaths per capita by 11% (in addition to the effect through cases).

More surprising are the effects of population density and urbanization. It looks like, after controlling for the number of cases, being a denser, more urban country reduces the number of deaths. I suppose this can make sense: given the number of infections, living closer together and in cities means living closer to hospitals, which might improve the chances of getting timely and effective treatment. But this is getting dangerously close to over-interpretation of weak effect estimates (small, barely significant coefficients).

The last variable is the number of hospital beds per 1000 people. The estimated coefficient suggests that each additional bed per 1000 inhabitants lowers the number of deaths by about 15%. Austria has 7.37 beds per 1000 people, the European average is 5. So bringing all the countries of Europe to the level of Austria would cut the death rate by about 36%. That’s a big effect.

I also toyed around with various measures of health care spending (per capita or as a share of GDP). In all the regressions I checked, health spending had a positive effect, which I couldn’t make sense of. My best guess is that, conditional on hospital beds per capita, spending more on health is a sign that your health system is too expensive and inefficient which is associated both with more cases and more deaths. But it’s still kind of a head scratcher.

Excess Cases and Deaths

OK. Having run these regressions and found some interesting patterns, what else can we learn from then?

One thing is that the regression model provides a benchmark to evaluate how individual countries are doing. Admittedly, this is risky business, given the poor data quality. But I’m putting it out there nevertheless.

Below, I’m plotting the excess cases and excess deaths per capita for a number of countries. Excess cases is the difference between the actual cases and the number of cases predicted by the model. Excess deaths are calculated analogously. (Attentive readers will realize that these are just the regression residuals.) The vertical axis shows cases and deaths per 100,000 people.

Three countries stand out in terms of excess cases: Italy, UK and US. Their case numbers are far higher than what one would expect on the basis of their country characteristics.

The “worst performers” among the selected countries in terms of excess deaths are France, Britain and Italy.

China and Korea have negative excess cases and no excess deaths. That is, these countries have fewer cases (and neither fewer nor more deaths) than the model predicts.

Notice that Sweden has similar excess cases as Germany and Austria, but far higher excess deaths. Make of that what you will.

(Data file and STATA code are available on request.)

Recreational econometrics with GERD

In her last post, Katharina pointed to a great data source on R&D expenditure, GERD. In the comment section of that post we discussed the issue of cuts in government R&D expenditures. An interesting question in this context is whether public R&D expenditure is a complement to private R&D expenditure or a substitute. If it is a complement, cuts in the public R&D budget are very bad, because they can be expected to be followed by cuts in private research budgets. If it is a substitute, public R&D ‘crowds out’ private R&D, so that public cuts are not that bad because they can be expected to be replaced by private R&D.

There is an extensive literature on this question, yielding mixed results. So I asked myself what does GERD say? The figure below shows a scatterplot of government and private expenditure on R&D as a share in GDP for 36 countries in 2008. You can see that the data points are pretty much all over the place (Austria is marked red). It turns out that if you regress private on public R&D expenditure, you get a positive coefficient indicating complementarity. However, the coefficient is not statistically significant (t-ratio of 1.63) and the R-squared is very low. So we have no strong evidence for complementarity, but also no evidence for substitutability. Instead, what my recreational econometrics exercise suggests is that private R&D expenditure is pretty much independent from government research budgets.