Wed Jun 03 2020

Combining data to understand the impact of COVID-19

The problem

We all know about the Corona crisis by now and the major impact that it has on society as a whole as well as the lives of most people. Since humanity has recently been exposed to this pandemic, there are still many questions about how life has been affected and how we can use the answers to the aforementioned problems to anticipate and avoid further destruction or even to resort to alternatives that will benefit us in the future. We believe that only together Covid-19 can be beaten. Insights are never the same and can be reached in different ways, thus collaboration among engineers and/or scientists is the key.

The goal

By creating new insights related to the Covid-19 virus, we are able to contribute to the problem as a whole. This is divided into 3 main tasks, where the first two tasks are from Kaggle:

  1. creating insights for referenced papers related to the Covid-19 disease and;
  2. forecasting infections, hospitalizations, and deaths;
  3. aggregating usable datasets about the social and economic impact of the crisis.

This blog focuses on the third task and emphasizes data enrichment from external data sources, as well as providing access to processed data, such that it enables analysis of the social and economic effects of the crisis virus in the Netherlands in more detail.

Approaching the problem

By first creating a plan of conduct with the outline of the idea and an Excel sheet with all the available data, which was ranked in feasibility and importance, it was easier to get a grasp of the amount of data and the data which could be an addition to future analysis. Consequently, we decided to contribute through the CORD-19 Kaggle challenge by first collecting interesting data and then analysing this data in order to observe how the virus has affected the electricity consumption, the air quality, and also the country's economy; specifically the number of companies that went bankrupt.

To approach our goal, we used data provided by Kaggle and also collected data from the external sources defined in the Excel sheet. After analysing the above data, we were able to shape an idea of how the above sectors were affected and to what extent.

Tackling the problem

Data Overview

CoronaNL Project

The Dutch CoronaNL project already scraped some of the data regarding the number of infections, hospitalizations, and deaths in The Netherlands. We use this data as the starting point for analysing the consequences regarding the Corona Crisis.

Bankruptcies

The virus has affected many sectors and one of them is the economy of the country. In order to get an impression about the difficulties that industries face during this crisis, we observe the number of companies that file for bankruptcy. It is interesting to see if this number surged in comparison to last year’s data or it fluctuates to similar levels. The data is collected daily from Dutch websites relevant to analytics and economy and then stored in order to be cleaned, processed, and explored.

Electricity Consumption

Across the country, companies are forced to be closed and people are advised to work at home. It is assumed that this has a large impact on energy consumption. And if there is a difference in energy consumption, is the cause of this the Corona crisis? The energy consumption is gathered from an open-source API called Entsoe and keeps track of the overall energy load of most of the European states. The energy load is defined and updated on an hourly basis and is gathered daily to be used in further analytics.

Air Quality

After the government decided that the country should work at home whenever possible, the total distance traveled by car decreased. We wonder if there is a noticeable increase in air quality as a result. To investigate this we collected open-source data from measuring stations throughout the Netherlands. Hourly data of several years of air composition measurements resulted in a dataset of more than 30 million data points. Thankfully we could use AWS S3 for cloud storage, PySpark for big data processing, and Tableau for data visualization in order to give answers to this interesting question.

Keep the analytics up to date

To scrape the data and keep the analytics up to date, it was necessary to also create an infrastructure capable of supporting it. Virtual machine services from Amazon and Azure were used to achieve this. A 100$ Mongo Atlas subscription was provided to contribute to the analysis against the corona crisis and was used to store the gathered/cleaned data before serving it to the public.

Bringing it together

To help data scientists analyse the data, Mongo Atlas was used to orchestrate all the data in one place. Subsequently, an API has been created to support an easy gathering of data which was already cleaned and ready to be used by data scientists.

Analysing the data

Now that the data has been collected, analytics will give new insights about the virus. Most of this work was done using Jupyter Notebooks, but also more complete programs like Tableau, as well as Python frameworks like Django. By working together with both data engineers and data scientists we managed to find out quite a lot. The next step would be to report our findings.

Showcasing our findings

After conducting an analysis of our data, the results so far led us to the following observations:

As far as the bankruptcies are concerned, the virus does not seem to have a major impact on the economy so far since bankruptcies fluctuate at about the same level as last year. The chart illustrates the number of companies that went bankrupt in February, March, and April of 2019 and 2020 respectively. The reason why data for February was also gathered is to have an impression of the situation for a month that has not been affected by the virus. It can be seen that there is a relatively small increase in the bankruptcies from 2019 to 2020 for each of the demonstrated months. More specifically, for March there is a slight increase in bankruptcies around 2.4% while for April there are just two more bankruptcies added to those of 2019. However, it might still be too early to come to a conclusion about the economic impact that the virus has caused as the consequences might become visible in the upcoming period.

Furthermore, we analysed to what extent each province has been affected by the virus. The figure above demonstrates the bankruptcies that occurred per province. It can be observed that the region with the most bankruptcies is North Holland followed by South Holland, North Brabant, and Gelderland. Thus, the number of bankruptcies seems to be analogous to the total number of companies that exist in each province since the latter places each one of the aforementioned provinces to the same order.

With regard to electricity consumption, we found out that there was a decrease in energy consumption of around 9.7% in comparison with last year. In 2015, the average was even lower, but this happened because the consumption in January was much lower in comparison with 2020.

From the analysis in energy generation, such as solar panels and wind energy, we found out that the blue skies in The Netherlands lately do seem to have a significant impact on the extreme drop in energy consumption, with 37.28% more solar production, 7.17% more offshore wind energy and 4.92% more onshore wind energy. Future analysis should show if the corona crisis contributed to the drop in energy consumption in the Netherlands.

In the figure below, based on the distribution of consumed kWh in 2018, one can see that companies that are closed due to the Corona crisis, on average, consume 3409 kWh. Households show an average of 2790 kWh. It is assumed that the amount of Kw from households increases in the crisis. This also means that the sudden drop in energy consumption regarding businesses will most surely also impact the drop in energy consumption in the Corona crisis.

By serving the data and results via a public website, hosted on Amazon and supported by MongoDB, it is possible to get the data that was also used by analytics for private use. Next to this, the website also offers the results of the performed analytics on the data.

Reflection

Throughout the project we had a daily stand up meeting where we explained what we accomplished/worked on the day before and the target of the current day. This greatly helped in knowing which tasks other group members were working on, thereby giving a better understanding of the project development process. Structuring a project as large as this is a challenge in and of itself, but by dividing this into three smaller tasks we managed to keep the project manageable, also helping each other out whenever needed. To sum up, each project was considered as a challenge as each one of the people involved had his/her own development setup to make the learning process more challenging and collaborative.

Takeaway

By collaborating throughout the project and step-wise improvement on the implementation of data gathering and data analysis, we managed to contribute to some interesting questions regarding the social and economic impact of the crisis. Our findings show that in the case of the bankruptcies that occurred during the past period the impact was not as major as one could expect, as the numbers fluctuate at about the same level as last year. As far as electricity consumption is concerned, there was a decrease in energy consumption of around 9.7% in comparison with last year but on the other hand, it is assumed that the amount of Kw from households increased. Although it might still be too early to come to a conclusion about the impact that the virus had on society and the economy, this period was undoubtedly hard and characterized by uncertainty. It is our belief that through such trials and tribulations people unite to find a solution and face every difficulty.

Untitled-1