## A restart?

The COVID-19 epidemic was over.

Throughout the month of May, analyzes of the wastewater from the Cape Canaveral treatment plant in Florida did not detect any presence of the new coronavirus, at least until the week of May 27: with the historic launch of American astronauts Bob Behnken and Doug Hurley to the ISS, everyone converged on the Cape and tests that week revealed a concentration of the virus corresponding to at least 85 COVID-19 patients.

Four weeks later, between June 29 and July 12, the deaths from COVID-19 in Florida were much younger than before, with more than 20% under the age of 65.

In early June, according to ABC News, the COVID-19 epidemic seemed to have restarted everywhere and American experts were warning of a possible second wave. In Texas, the governor was forced to restore the requirement to wear a mask while his Louisiana counterpart warned that “progress against the coronavirus [had] been wiped out in the past three weeks” while intensive care was operating at full capacity in certain hospitals. Not only were cases and hospitalizations increasing (up to 97% in Orange County, California), but deaths were also increasing.

Israel, which had a relatively good handling of the crisis, closed its bars, nightclubs and gyms. Morocco placed the city of Safi and its 300,000 inhabitants in quarantine, subjecting them to a total confinement.

In Spain, which has followed the crisis as disorderly and calamitously as France and Belgium, the government “now finds itself with new outbreaks which force it to return to confinements which, this time, are localized” explains Arnaud Fontanet, an epidemiologist at the Pasteur Institute (and a member of the French Scientific Council on the Coronavirus), adding that the situation in Spain “is really a warning signal for [the French]”.

Are we dealing with a second wave? Or is it the beginning of the end? Nothing is perfectly certain and depends on two factors, the degree of contagion R_{0} and the actual serology.

## Basic epidemiology

In epidemiology, the basic reproduction number or R_{0} of an infection is the *average* number of new cases generated *by each patient* in a population where *all individuals are susceptible* and *before* specific *prophylactic measures are taken*:

*average*because some individuals do not infect anyone while others are the precursors to multiple chains of infection (the “super-spreaders”),- a
*susceptible population*because R_{0}only has meaning at the start of an epidemic when all the individuals are healthy except one, - And
*before*behavior changes, either voluntarily or by State diktats, i.e. at the start of an epidemic.

For COVID-19, 80% of new transmissions are caused by less than 20% of carriers, according to a recent article (preprint) on transmissions in Hong Kong. The vast majority of patients infect few or no people. Only a select minority of individuals, the “super-spreaders”, spread the virus aggressively as was the case of the churches of South Korea and Washington State.

R_{0} is therefore not a real number: it is in fact a random variable, simplified to its mean: in fact, epidemiologists also have a measure of its dispersion (i.e. a kind of inverse of its variance), k, which is lower if the disease has many clusters. When the disease has little to no clusters, k is close to 1.0 as for the seasonal influenza.

In 2005, in a fundamental article in Nature, Lloyd-Smith and his co-authors estimated that SARS-CoV (from 2003) – in which over-propagation played a major role – had a k of 0.16. The estimated k for MERS, which appeared in 2012, would be around 0.25. For SARS-CoV-2, the estimates of k vary according to the sources from 0.20 to 0.10 with a 95% confidence interval (95% CI) from 0.20 to 0.04 in the latter case.

If k is truly less than or equal to 0.10, most infections do not give rise to other infections. However, in a few cases, one patient infects dozens of others.

To illustrate the problem, consider two phylogenetic trees (which I made with R). In both, I fixed the transmission at R_{0}=3.0 starting from the “patient zero” in red.

For the first, on the left, I considered that there was no variance (zero standard deviation), the zero patient infects 3 people (in blue) who, in turn, each infect 3 more. {1, 3, 9, 27, 81,…} The incubation period is fixed at one unit of time.

For the second tree, on the right, I introduced a non-zero standard deviation. The R_{0} is also three but the patients infect 3 people *on average* with a non-zero variance. The incubation period is also variable but *of a same average*.

Clearly, in both cases, the R_{0} of the disease is 3.0, but in the second case, the distribution of the random variable represented by R_{0} is actually different. Many patients transmit the disease only once. Others transmit it more than ten times: in the end, it is the same (average) rate of contagion.

Except that we understand that, if the disease behaves as in the graph on the left, it cannot be dormant for a long time. In fact, *it cannot be dormant at all!*

On the contrary, if the k of SARS-CoV-2 is close to 0.10, that is to say if the disease rather behaves like in the phylogenetic tree on the right, there is the distinct possibility that the chain that started from Wuhan’s patient zero was just a long, thin branch of under-propagators before it finally encountered a super-spreader who ultimately detonated COVID-19 into a global pandemic:

In the example above, the patient zero (in green) only infects one other person who in turn infects only one, etc. until you reach a super-spreader (in blue). Instead of having {1, 6, 7, 26, 81, 243, …} sick people in each generation, we have {1, 1, 1, 1, 1, 1, 6, 7, 26, 81, 243, …} sick patients. It (almost) doesn’t change anything for R_{0} nor the empirical variance because they are weighted by the future (large) generations.

On the other hand, this changes everything to the history of the pandemic because, in the case of COVID-19, patients take an average of 5.2 days to incubate and are sick for 14 days (*on average*). A “generation” of patients is therefore 5.2+7 days long: the start of our series {1, 1, 1, 1, 1, 1, 6} therefore elapses (*on average*) over 73 days instead of 12 days like in the previous example {1, 6}!

If the coefficient k is exceptionally low, it is possible (but unlikely) to have such phylogenetic chains: a dormant disease for weeks where each patient transmits the disease only to one or two other patients. A phylogenetic tree without branches or with very small branches that have no leaves. Or almost none…

On the other hand, as soon as one begins to have cases in number, that is to say branches to this tree, the probability that all of them die becomes close to zero.

This leaves room for a few possible cases outside of China, for example one case in Paris, two in Brazil, one in Milan, etc. between patient zero and December 31, 2019. Each of these branches died or had vegetated.

But these branches, which were in the countries which I have just mentioned, cannot have had more than a few cases. Otherwise, it is the exponential of Wuhan in January, of Milan in February, of Paris in March, of New York City in April…

Either way, while any hypothetical phylogenetic tree must reconcile the clinical and genetic data we have, it is obvious that a simple simulation by the Monte Carlo method would reveal multiple scenarios – implausible but not impossible – where, without changing neither R_{0} nor k, SARS-CoV-2 passes from an animal to the first human months before fifty cases appeared in Wuhan at the end of December 2019.

What is important to remember is that the lower the k is, i.e. the more variant the R_{0}, the less implausible such scenarios really are.

Recent calculations of the variance in the transmission of SARS-CoV-2 – which appears to be very high – would therefore explain many bizarre data for COVID-19:

- First of all, the virus would have had to be exported on average at least 4 times from country X to country Y for the epidemic to start in country Y. This would explain the slow initial spread of the virus from one region to another.
- This reinforces the cluster effects: geography plays a preponderant part (here) in COVID-19. As my former colleagues at the Center for Data Analysis have shown, COVID-19 in New York and New Jersey spread almost entirely along the dense network of commuter trains that lead to Manhattan. This implies that
*undifferentiated confinement policies are ineffective*(here and there). - This would also confirm the genetic data (which we quoted in an article at the beginning of February) which indicates a 90% probability that SARS-CoV-2 appeared between June 27 and October 29, 2019, approximately 3 to 6 months before the epidemic, according to Bayesian coalescent phylogenetic analysis and the estimated nucleotide substitution rate (molecular clock). Of course, this leaves a 5% probability that it appeared
*before*and a 5% probability that it appeared*after*these dates. - This would allow for the possibility that traces of SARS-CoV-2 have really been found in a sample of wastewater from Santa Catalina, Brazil, in November 2019!
- In the same vein, SARS-CoV-2 could have been present in Barcelona sewage waters on March 12, 2019! If this was not a handling error or a false positive (unlikely but possible) of the PCR tests, a sick resident of Wuhan would have visited the Catalan capital at the time. This seems very unlikely to me but, given the very low k of SARS-CoV-2, not completely impossible.

If these last two results are correct and, *a fortiori*, if we discover new ones, very many branches of the chains of infection disappear by themselves. And there are (necessarily) many more super-spreaders than we think!

## The SIR and SEIR models

Epidemiologists classify a given population into “compartments”: at each instant t, there are healthy susceptible people S(t), infected patients I(t), and recovered people R(t). Sometimes, we also consider exposed patients E(t) who are in the incubation period. For certain diseases, we need other compartments, which I will pass over.

The models are therefore called SIR or SEIR , depending on their compartments.

People move from one compartment to another. At every instant t, susceptible people S(t) are exposed E(t) then become infected I(t) and recover R(t) from the disease or die D(t) (if there is a compartment for deaths). The infected population I(t) cannot infect people who have recovered R(t) (nor, of course, those who have died).

Evidently, the sum S(t) + E(t) + I(t) + R(t) (plus, possibly, D(t) for deaths) is equal to the starting population N that we can take as constant for COVID-19.

Initially, I(0) = 1 for the patient zero, and S(0) = N – 1.

The derivative of S’(t) as a function of time is -β.I(t).S(t)/N where β is *an average of infection-producing contacts per unit of time*. As β, I(t) and S(t) are positive, the number of susceptible people decreases.

The derivative of infections I’(t) is obviously equal to the new patients S’(t) minus those who recover in a proportion γ (which is also the *inverse of the average period of infection*). As a result, I’(t) = β.I(t).S(t)/N – γ.I(t).

In plain English, new infections are equal to a constant β multiplied by the number of patients I(t) multiplied by the proportion of healthy people, which is S(t)/N, minus the number of people who recover which is obviously γ% of those who are sick I(t).

Of course, the derivative of those who heal R’(t) is equal to γ.I(t), i.e. the portion of the infected who heal at each period of time.

People move from one compartment to another without disappearing: the sum of the changes is zero. If I have +n people in a compartment, I have -n people in the others. The sum of the derivatives of these functions is therefore equal to 0!

These ordinary differential equations (ODE) are extremely common and are found in all demographic models, in actuarial life insurance models, in economics, in chemistry, in econometrics, in medicine and in some artificial neural networks.

What is the expected number of new infections in a population where all subjects are susceptible except one, the patient zero? It is obviously R_{0} by its very definition!

But since β is the average of infection producing contacts per unit of time and since 1/γ is also the average period of infection, R_{0} = β/γ

Without solving any equation! Here! We are all R_{0} specialists like the rest of Facebook users!

## R_{0} is quintessential

The R_{0} ratio is quintessential: simple and completely intuitive – “how many people will I infect?” –, it thus appears naturally in the mathematics of epidemiological models.

If the R_{0} is less than 1.0 then the derivative I’(t) above is negative and the quantity I(t) the number of people infected is decreasing: the epidemic stops!

Conversely, if the R_{0} is greater than 1.0 then the epidemic does not stop until 1-1/R_{0} percent of the population has been infected or immunized. This is a simple consequence of the differential equations that govern the dynamics of the epidemic.

The R_{0} determines the outcome: if no one is immunized and if I pass my virus to three people on average (i.e. if R_{0} = 3.0) during my convalescence, I will infect three new patients. Conversely, as soon as two thirds of the people I meet are immunized, I will only infect one since the other two will be immunized.

Thus, if R_{0} is 2.0, then the epidemic ends when 50% of the population has been infected and recovered. If R_{0} is 3.0, then almost 67% of the population must have been infected before things get back to normal. If the R_{0} is 4.0, then almost 75% of the population will be infected.

Again, *the processes involved are random and the infection develops in clusters* (as we can see in these fascinating simulations).

When the 2009 H1N1 flu pandemic started, its R_{0} was estimated between 1.2 and 1.6 which implied that 16.7% to 37.5% of the population ended up sick. In the United States, there are 325 million inhabitants and 16.7% to 37.5% therefore represents 54 to 122 million people. In the end, 61 million Americans were infected according to the CDC.

## The basic reproduction number of SARS-CoV-2

For SARS-CoV-2, current estimates of R_{0} in the scientific literature range from 0.91 (in Lithuania, end of March) to 7.4 (in Turkey, in April).

From all the scientific articles on the subject of R_{0}, – circa 120 since the start of the crisis –, it is possible to retain 96 estimates based on solid studies (for example here, here or there). To summarize, we can refer to the period covered by the data from each study and examine the results for the different regions (with China in red and Wuhan in purple):

As can be seen, the vast majority of studies find an R_{0} of approximately 3.0 for SARS-CoV-2. There are, however, some notable exceptions: either low values in rural areas, or high values that can be attributed to bad habits (political or sanitary).

Certain countries like Croatia, Lithuania, Slovakia, or Bulgaria clearly benefited from the experience of their neighbors: in fact, the population had already changed their habits when SARS-CoV-2 appeared and the R_{0} calculated using the early data is very low.

Notably, this is also the case in India where the R_{0} is very low (even if doubts persist regarding data quality). If, in practice, it is excellent as this means that the disease is less contagious there, in theory, it is not a good thing: R_{0} is the basic reproduction rate, that is to say one that would prevail *naturally without behavioral changes*.

Studies using data at the start of the epidemic in a given region are therefore more reliable than those that would use data after the epidemic began to change habits.

As a fascinating article from Harvard Magazine explains, these low measures from April and May potentially hide a less rosy reality: epidemiologist Marc Lipsitch, professor at Harvard T.H. Chan School of Public Health points out that “there is a view that [the R_{0}] is as high [as 5.7]. It stills seems to be a minority view, but I think it is a credible one” even if he cautiously adds that he has not yet decided “how strongly to weight [the] possibility” that the number reaches 5 or 6.

There are many empirical methods for calculating R_{0}. These 96 estimates come from half a dozen of these methods. In fact, even with the same data applied to the same SIR model, it is possible to obtain small differences of a few tenths of a point depending of the calculation method. In practice, each result should be interpreted according to the underlying theory.

That being said, there is little chance that all of these studies are simultaneously false:

- First of all, not all epidemiologists are systematically incompetent.
- We would have discovered that the SIR or SEIR models are faulty a long time ago.
- The mathematics they contain are very well understood and used in many areas of the physical, biological and human sciences.
- Finally, the SIR and SEIR models work very well for all other epidemics.

## What do these studies tell us?

Even if we recognize that R_{0} is not the only important measure of the epidemic, we have to accept the idea that there is some truth to these results.

Paradoxically, the more so because the range of estimates is large. There is little chance that the “real” R_{0} has not been “guessed” correctly.

Since 80% of the aforementioned studies give an R_{0} between 1.62 and 4.71, at least 38% to 78% of the general population would have to be immunized – either by infection, naturally, or by a vaccine – in order to be reasonably sure that the epidemic is coming to an end.

If we consider the (unweighted) average of all these studies, 3.14, it would take 68% of the population to be immune before the COVID-19 disaster really ends.

The question is not neutral. In the case of SARS-CoV-2, an R_{0} of 5 instead of 2 would require billions of additional doses of vaccine worldwide. Worse, if R_{0} is high, a sufficient percentage of people refusing the vaccine would be enough for the disease to become endemic. And if collective immunity is only obtained through infection, millions of deaths worldwide would occur before the end of the pandemic.

Everyone who thinks it’s over implicitly claims that the R_{0} is very low.

But is it true?

This post is also available in: FR (FR)