This article is updated regularly.
All data presented are as of August 09, 2020, the last update was on August 10, 2020.

Introduction

The ongoing COVID-19 pandemic needs no explanation because the whole world is currently badly affected. The high infectivity of the SARS-CoV-2 pathogen and the potentially severe course of the disease are putting healthcare systems around the world to the test. However, to take the right steps to contain the pandemic, analyzing the number of cases is an essential step.

Many of us have now made it a habit to look at the current state of the case numbers at least once a day. The good availability of the data and new visualization technologies enable us to update graphs in the shortest possible time. Especially the data of the Johns Hopkins University and of the Robert Koch Institute are used intensively for reporting cases in Germany. On different platforms, for example Kaggle.com, a lot of data is available to everyone, in the hope that the large global community of data scientists will generate valuable information and predictions from the data.

We at StatSoft Europe GmbH offer a variety of R training courses and have also used the COVID-19 data to generate knowledge through visualization. We have used the Open-Source statistics software R and disclosed the code below to provide interested analysts with free help for visualizing data in R. We used the data from the Johns Hopkins University (downloaded from this Kaggle Link). The original data, as used in Kaggle, originated from the Git Repository of the Johns Hopkins University.

At this point we would like to take the opportunity to draw your attention to various support platforms for the Corona crisis. Perhaps you are affected yourself or know people who are particularly at risk and should stay at home for their own safety. In order to make life a little easier for those people, many support platforms have emerged, for example to provide free grocery shopping. This is a list of German aid platforms, where you can register as a person affected or as a potential helper. There you will also find links to Germany-wide help platforms. Please also remember that animal shelters in particular may have a large influx of pets during this time, whose owners are no longer able to take care of due to the illness. You can support the shelters by donating food or money.

Reommended literature

For data analysis and visualization we can warmly recommend the excellent books R for Data Science and Hands on Programming with R by Hadley Wickham and Garret Grolemund. Incidentally, both authors are also the authors of most packages that we need in this project.

The Current Situation

In this section we would like to use various graphics to show you how the COVID-19 pandemic is spreading throughout the world. The status of the data record is the August 09, 2020. In the next section below, we then show how we created these graphics using R.

Worldwide

The following figure shows the number of confirmed cases from all countries, divided into active cases, recoveries and deaths, which had registered at least 50000 confirmed cases on August 09, 2020.

The following figure shows the number of confirmed cases on a log scale by days after the 100th confirmed case was registered in each country. This way, it is possible to see how fast the outbreak developed in the respective countries. For this figure, only the countries which registered the most cases as of August 09, 2020 as well as South Korea were selected. South Korea is a special case, because their fast reaction and testing regime caused a strong decrease in further infections. By choosing the log scale, it is easier to compare the increase in cases irrespective of their order of magnitude.

China, Europe and USA

The following figure shows how the number of cases developed in China, Europe, the USA and all other countries (other). It becomes clear how quickly China managed to get the outbreak under control by taking very strict measures. About five weeks after the number of cases in China reached a plateau, the number of cases in Europe was already higher than in China. Just two more weeks later, there were five times as many cases in Europe as in China. The USA also showed a rapid increase, which follows Europe’s trend, delayed by about a week and a half. Europe and the USA are currently the new epicentres of the COVID-19 pandemic.

The following figure shows the daily increase in case numbers in China, Europe, the USA and in all other countries. This clearly shows how quickly the Chinese government reacted and took drastic measures, because the quarantine of Hubei already began on January 23, at a time when there were only 639 confirmed cases and 18 deaths in China. Nevertheless, despite the quarantine, the confirmed cases in China rose to 84668, including 4634 deaths (as of August 09, 2020). This makes it very clear that a strong restriction on travel and freedom of movement (however painful this may be) a) can effectively limit the spread, b) should be carried rather earlier than later, and c) may flatten the curve only after a certain delay.

In Europe on 04.04.2020 the highest daily increase in cases (54896) was recorded, in the USA the highest increase was 78310 cases on 16.07.2020. In China, on the other hand, the highest increase was only 15133 cases on 13.02.2020.

The following figure shows the total active cases of all defined regions for the respective period, the width of the colored band symbolizes the proportion of those diseases that occurred in the respective region. The spread began in China first, then continued in Europe and other countries and at the same time decreased again in China. Around two weeks after the outbreak in Europe, the disease spread to the United States. The United States is the country with the highest number of acute illnesses.

The following figure shows the confirmed cases, deaths and active cases for the different regions on August 09, 2020.

Europe

In the following figure we see the confirmed cases of all European countries in which at least 5000 confirmed cases were registered on 30.03.2020. The width of the colored band shows which proportion of confirmed cases states can be assigned to the respective state and how this proportion changes over time. This shows that Italy was the first country to register a high number of confirmed cases. A few weeks later, the number of infections also rose sharply in surrounding European countries. Most infections were recorded until August 09, 2020 in Italy, Spain and Germany and France.

In the following figure we see the deaths in all European countries where at least 5000 confirmed cases were registered on 30.03.2020. The number of deaths in Germany is still very small compared to the relatively high number of confirmed cases.

The following figure shows the number of recoveries within the different European countries. The number of recoveries is highest in Spain, Italy and Germany.

The following figure shows the course of active cases within the European countries.

Germany

The following figure shows the active cases, deaths and recoveries in Germany over time.

The following figure shows the daily percentage increase in confirmed cases in Germany. The highest daily increases (> 50%) were registered when there were fewer than 5000 cases in Germany. Subsequently, the daily increases were between 25 and 35% for 7 days and then decreased further to 10-17% for the following 8 days. Since March 29, 2020 daily increases have remained below the 10% limit.

This reduction in new infections in Germany since March 21, 2020 was very important, because otherwise the number of infections on March 30, 2020 would have been more than three times as the actual values. The following figure shows the actual development (light blue bars) of the case numbers in Germany. The red line simulates the number of cases that would have resulted if the daily increase of 25-35%, as it had taken place from 03/14/2020 - 03/20/2020, had continued. On March 30, 2020, 229108 infections would have occurred. In fact, only 66885 cases were registered due to this reduction.

The Code

In this section we will show you how we prepared the data to generate the figures shown above.

Loading Packages and Data

For this project we need the packages tidyverse andlubridate as well as scales, RColorBrewer and countrycode. These must have been installed beforehand.

library(tidyverse)
library(lubridate)
library(scales)
library(RColorBrewer)
library(countrycode)

Now we load the dataset (downloaded from this Kaggle Link) using the Function read_csv():

data <- read_csv("data/covid_19_data.csv")
Parsed with column specification:
cols(
  SNo = col_double(),
  ObservationDate = col_character(),
  `Province/State` = col_character(),
  `Country/Region` = col_character(),
  `Last Update` = col_character(),
  Confirmed = col_double(),
  Deaths = col_double(),
  Recovered = col_double()
)

Data Cleaning

Now we change the format of the variable ObservationDate into a valid form using lubridate::mdy() and save it as a new variable called Date using the function dplyr::mutate(). Then we group the data according to Date and Country/Region and sum up the confirmed cases (Confirmed) within each group using the summarize() function. This gives us the sums of all cases within a country and day. We do this mainly because, for example, all provinces of China or all states of the USA are available as individual entries, but we want to consider the case numbers for each whole country. Then we sort by the variable Date and Confirmed Cases to see which is the most current date of our data set and in which countries the highest number of cases was registered:

data <- data %>% 
  mutate(Date = mdy(ObservationDate)) %>% 
  group_by(Date, `Country/Region`) %>% 
  summarize(`Confirmed Cases` = sum(Confirmed),
            `Deaths` = sum(Deaths),
            `Recoveries` = sum(Recovered)) %>% 
  arrange(desc(Date), desc(`Confirmed Cases`))

head(data)

Later we would like to compare the progression in China with all of Europe. However, to do this, we need a list of European countries if we do not want to assign them manually because the variable continent does not exist in the data set. The package countrycode helps us here, which contains a list of all countries in the world and the associated country codes and continents. With the function countrycode() we create a new variable called continent. At the same time, we can take the opportunity to use the countrycode function to convert the English country names into German ones (only used in the German version of this article). We then call this variable Staat, which is the German word for country:

library(countrycode)
#data3
data <- data %>% 
  mutate(continent = countrycode(`Country/Region`,
                                 origin = "country.name",
                                 destination = "continent"),
         Staat = countrycode(`Country/Region`,
                             origin = "country.name",
                             destination = "country.name.de"))

Now we create another variable, which we call Region. In this variable we will define the regions Europe,China, United States and Others. We do this using the base::ifelse() function within the dplyr::mutate() function.

The following code results in full writing: If the variable continent equals Europe, then the variable Region should be set to Europe; otherwise it should be set to China if the variable Country/Region equals Mainland China; otherwise to USA if the variable Country/Region equals US; otherwise the variable Region should be set to Other.

We can now use the two variables (continent and Country/Region) to group the regions Europe, China, United States and Others:

#data4
data <- data %>% 
  mutate(Region = ifelse(continent == "Europe", "Europe",
                                ifelse(`Country/Region` == "Mainland China",
                                       "China",
                                       ifelse(`Country/Region` == "US",
                                              "USA",
                                              "Others"))))

Now we can take a look at our data set to check whether the new variables Region, continent and Staat were created correctly:

arrange(data, desc(Date), desc(`Confirmed Cases`))

Next we create the new variable Active Cases, which can be calculated from the other variables as follows:

data <- mutate(data, `Active Cases` = `Confirmed Cases` - `Deaths` - `Recoveries`)
data

The rough work is now complete and we can now use the data to generate the figures.

Cases Worldwide

For our first figure, we first have to filter the dataset according to the latest date and the number of confirmed cases of at least 50000. Then we use pivot_longer() to convert the table into the long form and create the variable Category and collect the number of cases in the variable Cases. Now the changed data record looks like this:

Countries_50000 <- data %>% 
  filter(Date == strftime(max(data$Date)),
         `Confirmed Cases` >= 50000) %>%  
  pivot_longer(cols = c(`Active Cases`, 
                        `Deaths`, 
                        `Recoveries`), 
               names_to = "Category", 
               values_to = "Cases")

head(Countries_50000)

Now we can create a bar graph with ggplot() and geom_col(). First we define the X-axis as state and the Y-axis as number of cases, then we flip the two axes with the command coord_flip(), so that the Y-axis becomes the X-axis and vice versa. Now the last step is to sort the bars according to the total number of confirmed cases. Since we have not converted the variable Confirmed Cases into the long form, this variable is still available to us and we can sort the bar chart with reorder() according to the variable Confirmed Cases.

Countries_50000$Category <- factor(Countries_50000$Category, levels = c('Active Cases', 'Recoveries', 'Deaths'))

g1 <- ggplot(data = Countries_50000) +
  geom_col(mapping = aes(x = reorder(`Country/Region`, 
                                     `Confirmed Cases`), 
                         y = `Cases`, 
                         fill = Category)) +
  coord_flip() +
  theme(legend.position = c(0.8, 0.3),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = number) +
  labs(x = "Country", 
       y = "Cases", 
       title = paste("Confirmed cases by country as of ", 
                     date_dataset,
                     sep=""))
g1

Creating the following figure is a bit tricky. First, we need to select all the cases from our original dataset and filter by those that have more than or equal to 100 cases. Then we group by Country/Region and use the summarize() function to calculate the minimum value of Confirmed Cases, which gives us only one observation per country (in most cases). These values are then set to 0 and stored in the new variable Days after 100th case using the mutate() function. We call this object worldwide_log_1.

worldwide_log_1 <- data %>%
  arrange(Date) %>% 
  filter(`Confirmed Cases` >=100) %>% 
  group_by(`Country/Region`) %>% 
  summarize(`Confirmed Cases` = min(`Confirmed Cases`)) %>% 
  mutate(`Days after 100th case` = 0)

The next step is to join this summarized dataset back to the original dataset, calling the new object worldwide_log_2. Then we filter again by those observations where Confirmed Cases are more than or equal to 100, ungroup (important for cumsum() later) the dataset and then arrange by Date. By joining the datasets together, only one row of each country has the value 0, the others have NA. These NAs are then replaced by 1s. Then, after grouping by Country/Region again, we can calculate the cumulative sum of the Days after 100th case and arrange the data by Country/Region and Date.

worldwide_log_2 <- full_join(data,worldwide_log_1) %>% 
  filter(`Confirmed Cases` >= 100) %>% 
  ungroup(worldwide_log_2) %>% 
  arrange(Date) %>%  
  mutate(`Days after 100th case` = ifelse(is.na(`Days after 100th case`) == T, 1, 0)) %>% 
  group_by(`Country/Region`) %>% 
  mutate(`Days after 100th case` = cumsum(`Days after 100th case`)) %>% 
  arrange(`Country/Region`, Date)

Now we calculate the number of cases which would have occurred at a Doubling Time of 2, 7, 14, and 60 days. In order to do this, we create a new dataset (DBLT) and define the variable Days after 100th case as ranging from 1 to 60. Now we create four new variables (2 Days, 7 Days etc.) and calculate the cases that would have occured with the respective doubling time. Now we transform this dataset using pivot_longer(). In the end, we filter this dataset just in order to not make the comparison lines too long.

DBLT <- tibble(`Days after 100th case` = seq(1:60)) %>% 
  mutate(`2 Days` = 100*(2^(1/2))^`Days after 100th case`,
         `7 Days` = 100*(2^(1/7))^`Days after 100th case`,
         `14 Days` = 100*(2^(1/14))^`Days after 100th case`,
         `60 Days` = 100*(2^(1/60))^`Days after 100th case`,) %>% 
  pivot_longer(cols = c("2 Days", "7 Days", "14 Days", "60 Days"), names_to = "Doubling Time", values_to = "Cases") %>% 
  mutate(`Doubling Time` = as_factor(`Doubling Time`)) %>% 
  filter(Cases <= 4000000 & `Days after 100th case` <= 50)

Now we can plot the Confirmed Cases against Days after 100th case using ggplot(). By choosing scale_y_continuous(trans="log10"), we can display the Y-axis on a log scale. This makes it easier to compare the growth of cases among different countries. Then we add another line plot which contains the calculated values from the step above and set the linetype to Doubling Time.

log_cases <- filter(worldwide_log_2, `Country/Region` %in% (c("Mainland China", "Germany", "Spain", "France", "Italy", "US", "Iran", "South Korea")))

g2 <- ggplot(data = log_cases) +
  geom_line(mapping = aes(x = `Days after 100th case`, y = `Confirmed Cases`, color = `Country/Region`), size = 1.5) +
  geom_line(data = DBLT, mapping = aes(x = `Days after 100th case`, y = Cases, linetype = `Doubling Time`), color = "royalblue4") +
  scale_y_continuous(trans='log10', labels = number) +
  labs(title = paste("Confirmed cases (log scale) by days after 100th case as of ", 
                     date_dataset,
                     sep="")) +
  theme(legend.position = c(0.85, 0.5),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) +
    scale_color_brewer(palette="Set3") +
  guides(linetype = guide_legend(order = 1), color = guide_legend(order = 2))
g2

Cases by Region

Now we group our data set by Date and Region and add up the variables Confirmed Cases, Deaths and Recoveries with sum() to get the sum of all cases within each day and within each region. All missing values are then removed using filter() and !Is.na(). Now we group by Region and use mutate()to create the variable Increase in Confirmed Cases by Region with the lag() function and the variable Active Cases by Region:

Regions <- data %>% 
  group_by(Date, Region) %>% 
  summarize(`Confirmed Cases by Region` = sum(`Confirmed Cases`),
            `Deaths by Region` = sum(`Deaths`),
            `Recoveries by Region` = sum(`Recoveries`)) %>% 
  filter(!is.na(Region)) %>% 
  group_by(Region) %>% 
  mutate(`Increase in Confirmed Cases by Region` = `Confirmed Cases by Region` - lag(`Confirmed Cases by Region`),
         `Active Cases by Region` = `Confirmed Cases by Region` - `Recoveries by Region` - `Deaths by Region`)

head(arrange(Regions, desc(Date), desc(`Confirmed Cases by Region`)))

Now we can plot the confirmed cases for China, Europe, US and others with ggplot(). The X-axis is Date and the Y-axis is the sum of the confirmed cases per region (Confirmed Cases by Region). Note: The backticks are necessary if variable names have special characters (here spaces in the variable name Confirmed Cases by Region).

g3a <- ggplot(data = Regions) +
  geom_line(mapping = aes(x = Date, 
                          y = `Confirmed Cases by Region`, 
                          color = Region), 
            size = 1) +
  labs(x = "Date", 
       y = "Confirmed Cases by Region",
       title = paste("Confirmed Cases by Region as of ",
                     date_dataset,
                     sep="")) +
  scale_y_continuous(labels = number) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  theme(legend.position = c(0.13, 0.6),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5))
g3a

If we also want to display the other two variables (Deaths by Region and Recoveries by Region) in the same plot, we first have to transfer the data record to the long form with tidyr::pivot_longer(). Here we create two new variables, which we call Category and Cases. The Category variable contains information as to whether it is a confirmed case, a death or a recovered case. The variable Cases contains the count values that were previously among the three combined variables.

Regions_long <- Regions %>% 
  pivot_longer(cols = c(`Confirmed Cases by Region`,
                        `Deaths by Region`,
                        `Recoveries by Region`,
                        `Increase in Confirmed Cases by Region`,
                        `Active Cases by Region`),
               names_to = "Category",
               values_to = "Cases")
head(Regions_long)

Then we add the new variable Category as aesthetics in ggplot() (here as linetype). Thus the line type changes depending on the variable Category. We select the newly created variable Cases as the Y axis, which contains the counted values. Since the function countrycode() used above could not assign names to all countries and thus generated some NAs, we remove them in the same step with drop_na(), otherwise they would also be shown in the figure (alternatively we could also assign the missing countries manually to the correct continent). With the argument scale_x_date we change the date format of the X-axis and set it to weekly steps, with the argument theme() we position the legend in the plot and set the background transparent.

g3 <- ggplot(data = filter(Regions_long, 
                     Category %in% c("Confirmed Cases by Region",
                                      "Deaths by Region",
                                      "Recoveries by Region"))) +
  geom_line(mapping = aes(x = Date, 
                          y = `Cases`, 
                          color = Region, 
                          linetype = Category), 
            size = 1)+
  labs(x = "Date", 
       y = "Cases", 
       title = paste("Cases by Category and Region as of ", 
                                               date_dataset,
                                               sep="")) +
   scale_y_continuous(labels = number) +
  scale_x_date(date_breaks = "2 week", 
               date_labels = "%d.%m") +
  theme(legend.position = c(0.2, 0.6),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5))
g3

Now we can plot the daily increase in the number of cases. In addition, we use geom_vline() to add markers and geom_label() to place some labels where events have taken place that could affect the spread of the disease, namely the times at which different countries started the quarantine measures. China, for example, started the Hubei lockdown on January 23, at a time when there were only 639 confirmed cases and 18 deaths in China. Italy started the curfew on March 9th.

g4 <- ggplot(data = Regions) +
  geom_line(mapping = aes(x = Date, 
                          y = `Increase in Confirmed Cases by Region`, 
                          color = Region), size = 1) +
  geom_vline(xintercept = as.Date("2020-01-23")) +
  geom_label(label="Hubei \n quarantine", y=15000, x=as.Date("2020-01-24")+1)+
  geom_vline(xintercept = as.Date("2020-03-09"), color = "black") +
  geom_label(label="Italy \n lockdown", y=18000, x=as.Date("2020-03-09")) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +  
  theme(legend.position = c(0.2, 0.8),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = paste("Daily Increase in Cases as of ", 
                                               date_dataset,
                                               sep=""))
g4

Now we want to plot the active cases as an area plot over time. An area plot is created with geom_area within ggplot().

g5 <- ggplot(data = Regions) +
  geom_area(mapping = aes(x = Date, 
                          y = `Active Cases by Region`, 
                          fill = Region), 
            size = 1) +
  labs(x = "Date", 
       y = "Active COVID-19 Cases", 
       title = paste("Active Cases by Region as of ",
                    date_dataset,
                    sep="")) +
  scale_y_continuous(labels = number) +
  scale_x_date(date_breaks = "2 week", 
               date_labels = "%d.%m") +
  theme(legend.position = c(0.1, 0.8),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette="Set3")
g5

We can also generate a bar chart in ggplot() with geom_col(), in which the X-axis represents the region, the Y-axis the number of cases and the color (fill) the region of the variables. The command facet_wrap(~Category) creates a separate plot for each category in the dataset as a further dimension. To do this, however, we first need to create a variable that should contain the levels of the Category variable that we want to use (we don’t want to map all levels fromCategory):

Selected_Category <- c("Active Cases by Region",
                            "Recoveries by Region",
                            "Deaths by Region")
g6 <- ggplot(data = filter(Regions_long, 
                     Date %in% max(Regions_long$Date),
                     Category %in% Selected_Category)
       ) +
  geom_col(mapping = aes(x = Region, 
                         y = `Cases`, 
                         fill = Region), 
           position = "dodge") +
  facet_wrap(~Category, ncol=6) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(labels = number) +
  labs(title = paste("Cases by Region and Category as of ",
                    date_dataset,
                    sep="")) +
  theme(legend.position = "top",
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5))
g6

Cases in Europe

This graphic shows the case numbers of all European countries in which at least 5000 confirmed cases were registered on March 30, 2020. To do this, we first have to filter the data set with filter() for the region Europe, the date 2020-03-30 and Confirmed Cases> = 5000 (don’t forget the back ticks). Then we have the variable Country/Region output from this object and save the list of countries that had more than 5000 confirmed cases on March 30th, 2020 in the object Europe_5000_list. Now we have a list of all of these countries and we filter our data set by countries that appear in this list. Then we create an area plot:

Europe_5000_list <- filter(data, Region == "Europe" &
           Date == as.Date("2020-03-30") &
           `Confirmed Cases` >= 5000)$`Country/Region`

Europe5000 <- filter(data, 
                  Region =="Europe" & 
                    `Country/Region` %in% Europe_5000_list & 
                    Date >= as.Date("2020-02-24"))
g7 <- ggplot(data = Europe5000) +
  geom_area(mapping = aes(x = Date, 
                          y = `Confirmed Cases`, 
                          fill = `Country/Region`)) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  scale_y_continuous(labels = number) +
  labs(x = "Date", 
       y = "Confirmed Cases", 
       title = paste("Confirmed Cases in Europe as of ",
                     date_dataset,
                     sep="")) +
  theme(legend.position = c(0.15, 0.6),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) + 
  scale_fill_brewer(palette="Set3")
g7

We create the three other graphics in the same way, except that we do not have to do the previous filtering again:

g8 <- ggplot(data = Europe5000) +
  geom_area(mapping = aes(x = Date, 
                          y = `Deaths`, 
                          fill = `Country/Region`)) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  scale_y_continuous(labels = number) +
  labs(x = "Date", 
       y = "Deaths", 
       title = paste("Deaths in Europe as of ",
                    date_dataset,
                    sep="")) +
  theme(legend.position = c(0.15, 0.6),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) + 
  scale_fill_brewer(palette="Set3")
g8

g9 <- ggplot(data = Europe5000) +
  geom_area(mapping = aes(x = Date, 
                          y = `Recoveries`, 
                          fill = `Country/Region`)) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  scale_y_continuous(labels = number) +
  labs(x = "Date", 
       y = "Recoveries", 
       title = paste("Recoveries in Europe as of ",
                    date_dataset,
                    sep="")) +
  theme(legend.position = c(0.15, 0.6),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) + 
  scale_fill_brewer(palette="Set3")
g9

g10 <- ggplot(data = Europe5000) +
  geom_area(mapping = aes(x = Date, 
                          y = `Active Cases`, 
                          fill = `Country/Region`)) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  scale_y_continuous(labels = number) +
  labs(title = paste("Active Cases in Europe as of ",
                    date_dataset,
                    sep="")) +
  theme(legend.position = c(0.15, 0.6),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) + 
  scale_fill_brewer(palette="Set3")
g10

Cases in Germany

In order to map all three categories of cases in Germany in a summarizing area plot, we first have to group the data set according to Country/Region and then filter it according to Germany. Then we sort the data set by Date in ascending order and create the variable Absolute Increase in Confirmed Cases using the lag() function and the variable Percentage Increase in Confirmed Cases.

Now we can convert this data record into the long form with pivot_longer().

Germany <- data %>% 
  group_by(`Country/Region`) %>% 
  filter(`Country/Region` == "Germany") %>% 
  arrange(Date) %>% 
  mutate(`Absolute Increase in Confirmed Cases` = lag(`Confirmed Cases`),
         `Percentage Increase in Confirmed Cases` = ((`Confirmed Cases` / `Absolute Increase in Confirmed Cases`)-1)*100)
  


Germany_long <- pivot_longer(data = Germany, 
                                 cols = c(`Confirmed Cases`,
                                          `Deaths`,
                                          `Recoveries`,
                                          `Active Cases`,
                                          `Absolute Increase in Confirmed Cases`,
                                          `Percentage Increase in Confirmed Cases`),
                                 names_to = "Category",
                                 values_to = "Cases")

Before creating an area plot for Germany, we have to wrangle a little more. Because the categories are usually colored in an alphabetical order, this would distort the order in which we would like to present the data. By using factor(), we reorder the factor levels of the Category variable in the order that we would like to present them. Since we have more levels within the Category variable in our long dataset than we want to map, we filter again in ggplot for the three variables Active Cases, Recoveries and Deaths.

Germany_long$Category <- factor(Germany_long$Category, levels = c("Active Cases", "Recoveries", "Deaths", "Absolute Increase in Confirmed Cases", "Percentage Increase in Confirmed Cases", "Confirmed Cases"))

g11 <- ggplot(data = filter(Germany_long, 
                     Category == "Active Cases" |
                       Category == "Recoveries" |
                       Category == "Deaths", Date >= as.Date("2020-03-01"))) +
  geom_area(mapping = aes(x = Date, 
                          y = `Cases`, 
                          fill = Category)) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  labs(x = "Date", 
       y = "Confirmed Cases in Germany",
       title = paste("Confirmed Cases in Germany as of ",
                    date_dataset,
                    sep="")) + 
  theme(legend.position = c(0.2, 0.8),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5)) + 
  scale_fill_brewer(palette="Set3")
g11

Now we use ggplot() and geom_col() to create a bar chart and insert a few labels. We can read the case numbers of the labels from the filtered data set. Within ggplot() we filter for data that is newer than Feb 24, 2020.

g12 <- ggplot(data = filter(Germany, Date >= as.Date("2020-02-24"))) +
  geom_col(mapping = aes(x = Date, 
                         y = `Percentage Increase in Confirmed Cases`), 
           fill = "#6699ff") +
  scale_y_continuous(breaks=seq(0,max(Germany$`Percentage Increase in Confirmed Cases`, na.rm = T), 10)) +
  geom_vline(xintercept = as.Date("2020-03-06")) +
  geom_label(label="670 Cases", y=70, x=as.Date("2020-03-06")) +
  geom_vline(xintercept = as.Date("2020-03-14")) +
  geom_label(label="4585 Cases", y=60, x=as.Date("2020-03-14")) +
  geom_vline(xintercept = as.Date("2020-03-21")) +
  geom_vline(xintercept = as.Date("2020-03-23")) +
  geom_label(label="Bavaria \n lockdown, \n 22213 Cases", y=75, x=as.Date("2020-03-20")) +
  geom_label(label="Countrywide \n Restraining \n Order, \n 29056 Cases", y=40, x=as.Date("2020-03-25")) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") + 
  labs(y = "Percent Increase", 
       title = paste("Percentage Increase in Confirmed Cases in Germany as of ",
                     date_dataset,
                     sep="")) +
  theme(legend.position = c(0.9, 0.8),
        legend.background = element_rect(fill=alpha(0.4)),
        plot.title = element_text(hjust = 0.5))
g12

In the following, we would like to create a bar chart with the case numbers in Germany and calculate an exponential function based on the case numbers from March 14th to 20th, 2020. We want to draw the values after March 20, 2020 in light blue, the values before March 20, 2020 should be darkblue.

For this we first filter the data set to the data from Feb 23, 2020 to Mar 30, 2020. For the dark blue bars, we filter again and only select the dates that lie between Mar 1, 2020 and Mar 20, 2020. On these days, the cases rose stronger than after March 20, 2020. We want to color this data dark blue.

Now we filter the dataset again on the data from March 14 to March 20, 2020, because we only want to use this data for the calculation of the model. We use this dataset to calculate the logarithm of the confirmed cases. Then we calculate a linear regression with lm() on the log-transformed data. We then use the seq() function to create a sequence of days for which we want to use our model to predict the number of cases. With the function predict.lm() we calculate the simulated values for this sequence and then expose them again with exp() and save them in the object prediction as a variable model. Now we join the newly created dataset to our previously filtered dataset prediction using the full_join() function.

all_data <- Germany %>% 
  filter(Date >= as.Date("2020-02-23") & Date <= max(data$Date))

darkblue <- Germany %>% 
  filter(Date >= as.Date("2020-03-01") & Date <= as.Date("2020-03-20"))

modeldata <- Germany %>% 
  filter(Date >= as.Date("2020-03-14") & Date <= as.Date("2020-03-20")) %>% 
  mutate(`Log Bestätigte Cases` = log(`Confirmed Cases`))

modell <- lm(`Log Bestätigte Cases` ~ Date, data = modeldata)
Date <- seq(as.Date("2020-03-14"), as.Date("2020-03-30"), by = 1)
prediction <- data.frame(Date)
prediction$modell <- exp(predict.lm(modell, newdata = prediction)) 

all_data_prediction <- full_join(all_data, prediction, by = "Date")

Now by using ggplot() we can first create a bar chart data set all_data_prediction and choose light blue (royalblue1) as the color. Then we create a second bar chart of the object darkblue, which only contains the values from Mar 1 to Mar 20, 2020 and color it darkblue (royalblue4). Then we add a line plot with geom_line(), which contains the predicted case numbers (model) based on the data from March 14th to March 20th, 2020.

Finally, we add some labels (geom_label()) and vertical lines (geom_vline()), set the date to an interval of 5 days and change the formatting of the date on the X axis.

g13 <- ggplot(data = all_data_prediction) +
  geom_col(mapping = aes(x = Date, 
                         y = `Confirmed Cases`), 
           fill ="royalblue1") +
  geom_col(data = darkblue, 
           mapping = aes(x = Date, 
                         y = `Confirmed Cases`), 
           fill ="royalblue4") +
  geom_line(mapping = aes(x = Date, 
                          y = modell), 
            color = "red4", size = 1) + 
  
  geom_vline(xintercept = as.Date("2020-03-06")) +
  geom_label(label="670 Cases", 
             y=20000, 
             x=as.Date("2020-03-06")) +
    
  geom_vline(xintercept = as.Date("2020-03-14")) +
  geom_label(label="4585 Cases", 
             y=50000, 
             x=as.Date("2020-03-14")) +
  
  geom_vline(xintercept = as.Date("2020-03-21")) +
  geom_vline(xintercept = as.Date("2020-03-23")) +
  
  geom_label(label="Bavaria \n lockdown, \n 22213 Cases", 
             y=80000, 
             x=as.Date("2020-03-20")) +
  geom_label(label="Countrywide \n Restraining \n Order, \n 29056 Cases", 
             y=150000, 
             x=as.Date("2020-03-23")) +
  
  geom_vline(xintercept = as.Date("2020-03-30")) +
  geom_label(label="229108 \n Cases", 
             y=225000, 
             x=as.Date("2020-03-30"), 
             color = "red") +
  geom_label(label="66885 \n Cases", 
             y=95000, 
             x=as.Date("2020-03-30")) +
  
  labs(x = "Date", 
       y = "Confirmed Cases",  
       title = paste("Confirmed Cases in Germany as of ",
                     date_dataset,
                     sep="")) +
  scale_x_date(date_breaks = "2 week", date_labels = "%d.%m") +
  theme(plot.title = element_text(hjust = 0.5))
g13

This page is constantly being updated. Last edited on August 10, 2020.

Date of the dataset: August 09, 2020














Contact us:

StatSoft (Europe) GmbH
Possmoorweg 1
22301 Hamburg
Germany

Fon +49 40 22 85 900-0
Fax +49 40 22 85 900-77
E-Mail info@statsoft.de
Internet: www.statsoft.de

Imprint
Data Protection

