Is it legal to scrape and view Facebook data?

Applications of web scraping in official statistics

Summary

Large amounts of data are available on the World Wide Web (“Web” for short) that official statistics can also use for themselves. The extraction of this data by web scraping offers a wide range of potentials, for example reducing the costs of data collection, relieving respondents, improving the quality of official data or identifying sample-relevant units in surveys. Using the example of price, tourism, labor market and business statistics, this article shows how official statistics in Germany are already using web scraping. Many of the applications listed here are still in the early stages of development. In other national statistical offices, data from the web are already being used to a greater extent for experimental statistics and in productive operation. This is due, among other things, to an inadequate legal basis for web scraping in official statistics in Germany, an inadequate IT infrastructure for the method and a lack of employees with the necessary qualifications.

Abstract

Accessing the increasing amount of data available on the World Wide Web (“web” in short) by means of web scraping offers new possibilities for official statistics. Possible benefits of automatically extracting data from the web include a reduction of costs for data collection, a decreased burden for respondents, an improved quality of statistical products as well as a more targeted approach to identifying units of interest for surveys. This article uses official statistics about prices, tourism, employment and enterprises as an example to showcase how web scraping is used in official statistics in Germany. Many of these applications are still in an early stage of development. In comparison, some other national statistical institutes use data from the web more intensely for experimental statistics as well as in production. The major reasons for the hesitant usage of web scraping in official statistics in Germany are the absence of a broad legal basis for web scraping, an inadequacy of the IT infrastructure as well as a lack of personnel with the necessary qualifications.

introduction

The web as a data source can open up great potential for official statistics. The use of web scraping in official statistics as a method for the automated extraction of data from the web (Mitchell 2018, p. IXf) gives rise to the hope that the survey effort will be reduced due to the new access to information and that the burden on those obliged to provide information will be reduced (Blaudow and Ostermann 2020). In addition, new methods of analysis, insights and content can be obtained, for example when text classification is used on company websites to examine e-commerce (Peters 2018b) or when new insights into the structure of the online job market are gained through the analysis of online job advertisements (Rengers 2018a). The quality of official statistics also has the potential to be further improved through web scraping. Internet data can be collected at high frequency and processed automatically, so that the resulting statistics are up-to-date (Hackl 2016). In addition, web scraping can often be used to collect (approximately) populations, for example the websites of all companies with a web presence in Germany. A comparison of data from the web with administrative and survey data is also conceivable in order to check the quality of the respective data source and to estimate the correct expression of a characteristic of interest (Peters 2018a). Web scraping can also be used to support surveys, for example to select respondents who are of particular interest. In exceptional circumstances, for example during the Corona crisis 2020/21, in which surveys, especially among companies, were only possible to a limited extent, data from the Internet can help meet the demand for current statistics (see, for example, Kinne et al. 2020) .

This article presents various existing or developing applications of web scraping in official statistics. The following section first describes what web scraping is and what data can be obtained with it. This is followed by an overview of existing fields of application of web scraping in official statistics. Then the challenges for the use of web scraping in official statistics in Germany are dealt with. Finally, a conclusion is drawn.

What is web scraping?

The automated extraction of data from the web is known as web scraping (Condron et al. 2019). As a rule, the website is not scraped as it is directly visible to human users, but the underlying HTML code, which determines the appearance and content of the website, is evaluated. A webpage is a document on the web that consists of structured text (mostly HTML code) and can be reached with a web address or the Uniform Resource Locator (URL). A website ("website") usually consists of several of these websites, all of which have the same URL stem, called a domain (e.g. amazon.com, spiegel.de). The homepage of a website is the website that the browser navigates to when you enter the domain.

Web scraping programs simulate the behavior of human users in order to obtain certain information on websites (for an introduction see e.g. Mitchell 2018). The interaction with the websites is very similar to that of human users. Human interaction with the web is generally done through a browser that communicates with the website server and presents the HTML code in a visually appealing way. In web scraping, browsers are sometimes programmed in such a way that they visit certain websites and extract certain parts of the website. Filling out forms (e.g. search or login fields) can also be automated. However, web scraping is also possible without a browser. The web scraping program then communicates with the website server without going through a browser. The HTML code of the website is then evaluated locally.

In German usage, screen scraping is often used synonymously with web scraping (von Schönfeld 2018, p. 25). However, the term screen scraping implies that the representation of a web page on the screen is extracted and not the underlying code or the database. Screen scraping also encompasses other applications that record what is displayed on digital screens, such as extracting data that is off the web. Web scraping is therefore the more precise term for the applications presented in this article.

Another term that is often used as an alternative to web scraping is web mining. Web mining is a collective term for the entire process of automatic exploration and extraction of information from the web (Kosala and Blockeel 2000). On the one hand, this term is broader than web scraping, since it also includes the analysis of the use and structure of the web, not just the extraction of information from the content of websites (the latter is referred to as web content mining). On the other hand, web content mining is very much focused on the preparation of unstructured data, such as text or images. However, web scraping is not limited to this unstructured data, but is also used to call up very specific information on websites that only require a few additional processing steps. An example of this is the extraction of prices from online shops. Overall, the terms web mining, web scraping and screen scraping have a lot of overlap and in some cases can be used interchangeably. The term web scraping is used in this article.

Web crawling is another term that is often used interchangeably with web scraping. While web scraping refers to the extraction of a single website, the goal of web crawling is to find links on a website. These links are then followed to find further links (Mitchell 2018, chap. 3). Web crawling is mainly used by search engines that crawl and index the web. There are also applications of web crawling in official statistics, for example when the website of a company including all of its websites ("webpages") is to be scrapped in order to use the extracted text to derive latent properties. Web crawling is then the process of identifying which web pages should be scrapped.

Furthermore, specific and generic web scraping can be distinguished from one another (Stateva et al. 2018). With specific web scraping, the structure of the website is known in advance and only certain information is extracted in a targeted manner. This is the case, for example, with web scraping in price statistics. Prices are usually always in the same place in the HTML code in the same online shop and can therefore be extracted directly. In contrast, with generic web scraping, the content and structure of the website are not known. In this case, the entire content of the website is extracted. One use case here is z. B. the derivation of latent properties from company websites. It is not known how individual companies build their web presence. Because of this, all of the HTML code is extracted.

A significant aspect of web scraping is data extraction. The locally available HTML code is still unstructured and does not in itself represent any gain in knowledge. With specific web scraping, the data of interest are usually extracted based on their position in the HTML code. HTML is structured with so-called "tags", which assign functions and properties to the components of the code. With the help of these "tags" parts of the code can be selected very precisely.

In generic web scraping, all of the text on a web page is often of interest. This text is then brought into a form using text mining methods that can be used for statistical analysis. Text mining is a collective term for various algorithm-based and statistical methods for the analysis of text data (for an introduction see Gentzkow et al. 2019).

Some website operators offer direct automated access to the databases on which the website is based. This takes place in the form of a so-called web API (for English "application programming interface"; Mitchell 2018, chapter 12). The use of an API is generally preferable to regular web scraping, among other things because an API transmits the data of interest in a structured form. However, very little of the information on the web is available through an API. If applications of web scraping in official statistics are presented in the following, this can also include applications that use an API to obtain information, since the same data source is opened up with it.

Areas of application of web scraping

In the following, current areas of application of web scraping in official statistics in Germany are listed according to subject or specialist areas. The order in which the application examples are listed corresponds to the degree of implementation in production. First of all, applications that are already used in statistics production are listed below.

Price statistics

Web scraping is already well established as a method of collecting prices for the consumer price index. Due to the great importance of online trading in Germany, around 10,000 products are collected online - albeit partly still manually - for consumer price statistics every month. The prices of some product groups are already recorded fully automatically. This applies, for example, to rental cars as well as long-distance bus and train trips. By the end of 2021, the goal is to automatically collect all products in the online price collection using web scraping. For this reason, the generic program "ScraT" (scraping tool) was created, which makes web scraping of prices possible even for employees without in-depth IT knowledge (Blaudow and Ostermann 2020). The price entry is faster and more cost-efficient compared to manual collection. The frequency of price collection can also be greatly increased as required. This improves the quality of consumer price statistics.

In addition to collecting prices for the consumer price index, web scraping is also used in price statistics to investigate dynamic pricing in online retail. Dynamic pricing in online trading poses a particular challenge for price statistics, because the traditional monthly survey of a price per product is no longer representative. Web scraping is used to examine the frequency of price changes and the variance of prices. It has been shown that some online retailers make more use of dynamic pricing than others and that the variation in prices also differs between retailers. The frequency of the price collection should therefore be made dependent on the dealer (Blaudow and Burg 2018). The extent of hourly, daily and seasonal price changes could also be examined with web scraping (Hansen 2020a, b). From this, Hansen (2020b) derived recommendations for action for the timing of automated price queries in order to avoid times of strong price volatility.

The phenomenon of personalized pricing, in which prices are adjusted to the presumed willingness to pay of potential buyers, has not yet been examined in official statistics (Zander-Hayat et al. 2016). Features for predicting willingness to pay are, for example, the device used, previously visited websites or the length of time spent on the respective website. This poses new challenges for official statistics, as not only the time and frequency of price collection have to be changed, but typical buyer profiles must also be simulated so that the prices actually paid can be collected. So far this has not happened. In a study from 2018, however, only very few of the online shops examined used personalized pricing, and only to a small extent (Dautzenberg et al. 2018).

Using web scraping is a good way to automate price collection on the Internet. In addition, it is an efficient method to expand price collection on the Internet, since fewer human resources have to be used in the medium term than with traditional price collection. Some recent developments, such as dynamic pricing, can be countered by frequent price increases through web scraping. However, personalized pricing has not yet been investigated, which - presumably only to a small extent - restricts the representativeness of the prices collected.

Bankruptcy notices

An already established use of web scraping in official statistics is also the scraping of insolvency announcements on the website "insolvenzbekanntmachungen.de", operated by the Ministry of Justice of the State of North Rhine-Westphalia (for the entire federal territory). All notices of insolvency proceedings for the current last two weeks are available there and can be extracted automatically. These insolvency announcements are scrapped by the Federal Statistical Office (Destatis) once a week and searched using text mining for keywords of interest and for file numbers or court identifiers.

This application of web scraping is used for quality assurance of the insolvency statistics, as it can be ensured that all insolvency proceedings have been recorded in the statistics. In addition, the data obtained in this way serve as a basis for estimating the quarterly corporate demographics. During the Corona crisis 2020/21, this data could be used to offer statistics on opened insolvency proceedings with increased topicality.Footnote 1

However, the procedure is currently dependent on the "insolvenzwohlmachungen.de" website remaining unchanged. If something fundamental changes in the HTML code of the page, the web scraping program no longer works and must be adapted to the changes. This is particularly problematic because an unrestricted search in the bankruptcy notices is only possible two weeks after publication. Adjustments to the web scraping program would then have to be made under great time pressure. This problem could be solved by the Ministry of Justice of North Rhine-Westphalia providing official statistics with an API or regular data delivery to official statistics taking place in another form.

Tourism statistics

Web scraping is used in the Hessian State Statistical Office (HSL) for enrichment and quality control of the monthly tourism statistics. The Hessian accommodation statistics cover around 3500 establishments with at least 10 available beds every month. Many of these companies not only have their own company website, but also have a commercial online portal, such as Booking.com, HRS-Holiday or Hotel.com, on which they offer their services.There, information about the accommodation providers with many attributes can be called up and, thanks to the structure of the associated websites that remain the same within the online portal, can be extracted without time-consuming word processing steps. The subsequent link with the reporting group of the accommodation statistics enables a comparison to be made and the completeness of the reporting group to be checked. The information on the number of rooms offered is available in many tourism-related online portals and is used to determine the number of beds. This variable is one of the survey characteristics in the accommodation statistics and is also decisive for the establishment's obligation to provide information. In addition, help features such as email addresses or telephone numbers are also often available in the online portals. Thus, web scraping can be used here as an important support for the maintenance and updating of the companies to be recorded for the statistics. Since it is possible to search the offers on tourism-related online platforms for different regions, these are also a promising source of data for the nationwide tourism statistics.

As part of a feasibility study, the HSL succeeded in automatically extracting the data from a commercial online booking portal for accommodation to collect 2438 Hessian accommodation establishments, including further information such as the number of rooms and beds (Peters 2018a). In this context, it was also possible to identify Hessian accommodation providers who, due to the specified number of rooms, were at the limit of the obligation to provide information. The reporting group of the accommodation statistics could thus be supplemented by these establishments. The automatically collected data could also be extremely useful for the plausibility check of the turnover of accommodation establishments, since the number of sleeping accommodations offered generally correlates positively with the company's turnover.

Online job advertisements

The use of web scraping for labor market statistics has been the subject of research at Destatis for several years. As part of the ESSnet Big Data 2016–2018 and 2018–2020 projectsFootnote 2 the possibility is examined of how online job advertisements can be analyzed and new indicators on the job market and labor demand can be obtained from them (Rengers 2018a, b).

These job advertisements are primarily obtained from online job portals. One of the challenges is the large number of job portals in Germany. Two data sources are used for the analysis in the projects: on the one hand job advertisements from two data deliveries from the Federal Employment Agency (BA), which operates the largest German job portal, and on the other hand data from the European Center for the Promotion of Vocational Training (CedefopFootnote 3). Cedefop has been running web scraping for job portals in several European countries since 2013 and already had experience in web scraping, an existing infrastructure and well-equipped staff and IT resources before the project started. It was therefore decided to fall back on the existing infrastructure in order to make better use of the limited resources of the national statistical authorities. For the second phase of the study, the data from 134 job portals from Germany were available there, which were determined through market observation.

The data quality is a particular challenge in this project, among other things because information has to be obtained from the running text of heterogeneous advertisements. Job advertisements are often published on several portals and sometimes even appear several times on the same portal. At the same time, however, this redundancy also leads to better coverage of the online job market, as not all portals are continuously recorded for technical reasons, for example due to changes to the websites that require the web scraping programs to be adapted, as well as insufficient computing capacity can. It is therefore necessary to identify and remove duplicates (Rengers 2018a). So far, the bias that arises from the fact that attractive job offers for applicants are taken offline again after a short time, as enough applications have already been received, has not yet been investigated. The scraped data would therefore contain a disproportionately large number of unattractive job advertisements. This bias can be avoided by a sufficiently high frequency of scraping attempts, since then job advertisements with a very short dwell time on the job portals are also recorded. At this point in time, however, it has not yet been investigated whether the frequency of scraping by Cedefop is sufficient to minimize this bias.

Both the Cedefop data and the data from the BA lack precise information on how up-to-date the job advertisement is, as the publication date of the job advertisement either cannot be recorded or was not recorded. Positions that have already been filled could also be included in the data record. Some characteristics of the job advertisements that are of great interest for the analysis, such as the number of jobs per advertisement, are not included in the Cedefop data in particular (Rengers 2018a).

It can be assumed that the online job market differs systematically from the entire job market. However, the online job market is of great importance in Germany because, according to the IAB job survey 2015, around 41% of all companies cite online job portals as a means of recruiting new employees. A job advertisement on your own website is even more popular; 52% of all companies stated that they use this route (Brenzel et al. 2016). Analyzes from Switzerland confirm this assessment and also suggest that advertisements exclusively via print media no longer play a role (Sacchi 2014). An overview of the structure of the online job market therefore helps to better understand the job market in Germany. The data from the Federal Employment Agency could be used to examine the distribution of online job advertisements and offers across industries, professions and company size. A challenge here was that the definitions of these characteristics of the Federal Employment Agency deviate from the definitions used in statistics and thus a comparison with the entire labor market for some characteristics was only possible to a limited extent or not at all. The Cedefop data set, on the other hand, is based on common international classifications.

For the continuation of the investigation within the scope of the ESSnet Big Data II, the online job market will be examined in a longitudinal section. For this continuation, Cedefop data are used, for which significantly more job portals were included compared to the study by Rengers (2018a) and for which the frequency of scrapping was increased so that the German online job market can be mapped much more completely.

In the second phase of the project, two types of indicators based on online job advertisements will be developed. On the one hand, these are monthly indices on labor demand by region, industry and qualification, comparable to the Internet Vacancy Index of the Australian Ministry of Labor (Australian Government 2020). Cedefop's data are used for this. Since the exact period of validity of individual job advertisements is not known, a synthetic inventory is calculated on the basis of average validity. In addition, indices are calculated that show the relative change in labor demand compared to the starting value, because despite deduplication, advertisements are most likely included more than once.

On the other hand, an indicator on the labor market concentration on the demand side is created. This shows how many companies in each region are offering vacancies for a specific occupation. If only a few potential employers in a region offer jobs in certain professions, then these employers have market power and can offer lower wages. Online job advertisements can thus be used to identify monopsons or oligopsons in the labor market (Azar et al. 2018).

Extraction of company properties based on the company website

A large number of companies in Germany have their own web presence. This website can be used as a source of information about companies, as numerous company characteristics can be derived from it. Many of these characteristics are latent, as only characteristics that point to such characteristics can be directly observed on the company website, but usually not the characteristics themselves. These include, for example, innovative activities, charitable work or belonging to a certain industry. If the relationship between the occurrence of such features on the company website and the latent property is known, it can be determined for each unknown, automatically extracted company website after assignment whether a company has this latent property (see, for example, Kühnemann et al . 2020). Web scraping can therefore make a contribution to enriching, updating and checking available company data.

The prerequisite for enriching company data with new Internet content is knowledge of the URL of the company's own website. However, this is not available in the business register. The business register is a statistical database that aims to provide a complete list of all economically active companies in Germany. Based on the model of a procedure published by the Italian Statistical Office (ISTAT) (Barcaroli et al. 2016), an algorithm was developed at the HSL that searches for URLs using the address data of a company database with the help of a search engine such as Google and assigns them to the appropriate company . With this method, around 1200 websites of companies of the Hessian survey in information and communication technology IKTU 2017 have already been found and assigned. About 90% of the assignments turned out to be correct after a manual check of a sample of 100 companies (Peters 2018b).

Starting from known company domains, these web presences can be scrapped and used as a basis for the extraction of latent properties. An example of an existing application in this area is a study measuring the Internet economy in the Netherlands (Oostrom et al. 2016). There, data from company websites was used as a basis to classify companies into categories of Internet usage. These categories included, for example, the existence of an online shop or the offer of online services. It could be shown that 4.4% of all jobs and 7.7% of the total turnover in the Netherlands can be found in the internet economy.

The HSL has been working in a similar direction since 2019, as it is investigating how the operation of an online shop can be automatically verified on German company websites. In a first attempt with 8422 researched websites of companies from different registers, 86% of all companies with online shops could be correctly identified with the classification algorithm. Around 10% were wrongly classified as a company operating online.

When corporate URLs are available, entirely new characteristics can be explored as needed. In the Center for European Economic Research, for example, shortly after the introduction of far-reaching contact restrictions due to the corona pandemic in Germany, an analysis was made of how badly German companies were affected by the crisis. For this purpose, it was examined which companies mention the Corona crisis on their website and in which context this happened, for example whether companies had to close or just adjusted the opening times (Kinne et al. 2020). The websites were scrapped several times a week in order to be able to understand the development over time. Regional differences were also examined. This shows that access to company URLs offers the potential to respond quickly to new information needs.

Relocation of economic activities abroad

A completely new, currently only conceptual application of web scraping in business statistics concerns the identification of companies that are relocating business areas abroad. These companies are the focus of statistics on global value chains to be introduced from the reporting years 2021–2023. The official statistics in Germany are faced with the challenge that, at just under 2%, only a very small proportion of companies in the population actually relocate economic activities abroad (Kaus 2019a). In the context of a random sample, the majority of the questionnaire is not relevant for the majority of the companies surveyed. This presumably has a negative effect on the acceptance and the response rate of the survey, which was very low, for example, in the survey in 2016 (Kaus 2019b). If web scraping could be used to identify relocating companies before the sample was drawn, they could be included in the sample in a much more targeted manner. This would reduce both the information burden on the companies surveyed and the costs of the survey. A qualitative analysis of various data sources available online has already shown that it is in principle possible to identify companies with relocations abroad. Company names from the statistical business register (URS) would have to be used to identify relocating companies and the results would have to be scrapped.

Summary of previous experiences

These six examples illustrate the great potential of web scraping in official statistics. The examples show that web scraping could be used in official statistics in at least three different areas:

  • Data collection,

  • Quality assurance and plausibility check

  • as well as the determination of the population or units relevant to the sample.

Tab. 1 gives an overview of the functions web scraping fulfills in the examples from this article.

It is also interesting which different types of Internet sources are scrapped. The type of website that is used as a data source has a major impact on the approach used for web scraping and data preparation. The degree of structuring of the internet source is of particular importance here. In the case of websites with the same structure of the HTML code, on which the information of interest can always be found in a consistent format in the same position, one can speak of a high degree of structuring. In the case of websites that all have a different structure of the HTML code and on which the information of interest is available in unstandardized form, for example as free text in various formats, one can speak of a low degree of structuring. Tab. 2 shows examples of some types of websites as well as their (approximate) degree of structuring and indicates how these are used for official statistics. Internet sources that are considered to be partially structured often have a constant structure of the HTML code, but the information of interest is not yet available in standardized form. An example is the address of an accommodation establishment, which is always in the same position in an online booking portal, but is available in different spellings or with spelling errors.

At the moment, in the German official statistics, websites that already have a high degree of structuring are primarily scrapped. On the one hand, these are online shops in which product prices are generally listed in the same position in the HTML code of the respective website. The prices are available in one or a few standardized formats (e.g. as a number with a maximum of two decimal places and one euro symbol) so that the data can be transferred with little or no further processing steps. Bankruptcy notices on the Internet also have a consistent structure, in which fixed legal formulations are used to present the information of interest. The accommodation statistics also benefit from the fact that there are relatively few online booking portals within which the offers for accommodation are consistently structured. This constant structure makes it possible to extract the data of interest without the use of machine learning processes or complex preparation of text data.

The use of sources with a lower degree of structuring is currently hardly established in German official statistics. The derivation of latent properties on the basis of the company website is still at the very beginning, especially because company URLs are not yet available in the URS. Scraped data from social media are also currently not used in German official statistics. There are certainly models for this in the European statistical system. The Dutch Statistical Office (CBS) offers a dashboard on social tensions and emotions related to security and justice under the heading “experimental data” (Daas and Puts 2014). For this purpose, CBS acquires posts from numerous social media, primarily Twitter and Facebook, which have already been evaluated for their sentiment. Sentiment here means whether posts convey a positive, negative or neutral emotion. Results are provided online within 24 hoursFootnote 4. It has been shown that these aggregated emotions are strongly related to the consumer confidence index (Daas and Puts 2014).

Despite all the advantages shown, web scraping in Germany - with the exception of price statistics - is still in its infancy. The reasons for this are explained in more detail in the following section.

Challenges

Most of the examples listed above for applications of web scraping in official statistics in Germany are currently not used in regular production. The reason for this is that numerous challenges make it difficult to carry out web scraping in official statistics in Germany. These are essentially inadequate or non-existent individual statistical regulations, an inadequate IT infrastructure and a lack of employee qualifications for this new method.

Legal framework

Publicly freely available information, e.g. data from web scraping, may be collected by official statistics in Germany and linked to data from other sources for the creation of economic and environmental statistics. According to Section 5 (5) of the Federal Statistics Act (BStatG), the statistical offices may collect data from “generally accessible sources” even without a law or ordinance. For the maintenance and management of the statistics register, information from generally accessible sources may also be used in accordance with Section 13 (2) sentence 4 of the Federal Statistics Law. In addition, the BStatG allows the consolidation of data from surveys, statistical registers and other publicly available sources (§13a BStatG) for economic and environmental statistics. The particular aim of this regulation is to reduce the burden on those obliged to provide information.

A challenge in web scraping for official statistics are technical barriers on websites that have the purpose of preventing access for web scraping programs (so-called "bots"). These technical barriers are used, for example, to prevent frequent requests from bots that could jeopardize the functionality of the website. It goes without saying that it is not in the interest of official statistics to disrupt the normal operation of websites through web scraping. To prevent this, guidelines for ethical web scraping were drawn up in the ESSnet Big Data II to ensure that the method is used responsibly (Condron et al. 2019). This includes, for example, minimizing the burden on website operators (e.g. by not making too many inquiries to the same website in a short period of time), identifying oneself to the website as a statistical office (using the so-called "user agent string") and transparently using methods and to inform processes for web scraping and data preparationFootnote 5.

Technical barriers that prevent bots from accessing website content make it more difficult for official statistics to collect web data, even if the guidelines on ethical web scraping are observed. The price statistics are an exception here, as they already have a legal basis for individual statistics for the use of web scraping. The Price Statistics Act (PreisStatG) has explicitly allowed the use of "automated retrieval procedures" for generally accessible price information since 01.01.2020 and obliges the "holder of the data [...] to allow the retrieval of the data" (Section 7b PreisStatG). The legislature hereby recognizes that the use of web scraping as a new collection method can secure or even improve the quality of price statistics (German Federal Council 2019). The change in the law is justified, among other things, by the fact that online trading has increased significantly and, as a result, companies' pricing policies have also changed. This is primarily expressed in greater price volatility. The new price statistics law enables the statistical offices of the federal and state governments to use web scraping to ensure the representativeness of the prices collected. In the course of this, website operators cannot prevent the statistical offices from automatically extracting price data. This makes web scraping much easier for price statistics, as website operators can no longer use technical means to prevent scraping activities. There is still no comparable regulation in other statistical laws.

IT infrastructure

The hardware required for web scraping depends on the type of application. For many smaller projects that aim to collect data once, a simple laptop is sufficientFootnote 6 with free access to the Internet. For the implementation of web scraping in statistics production, however, the hardware requirements increase depending on the scope, the complexity and the methods used. The downloading of websites and even more the processing of large amounts of unstructured data, such as those that arise when processing company websites, place high demands on the main memory as well as the number and performance of the processor cores. A future-proof IT infrastructure for web scraping would also include powerful graphics processors with which artificial neural networks can be trained. Artificial neural networks emerged as being particularly suitable for processing text data (see, for example, Yang et al. 2016).

Requirements for the required software depend both on the application goal and on the experience of the employees. Stateva et al. (2018) found that, at least in the area of ​​web scraping of corporate features, the necessary web scraping methodology can be implemented in numerous programming languages ​​or with the support of a wide variety of software. The creation of the program or script for performing the web scraping can be done in Java, Python or R, for example. In addition, database software is often required for efficient management of the scraped data. Both relational databases (e.g. MySQL) and non-relational databases (e.g. MongoDB) are suitable for this, depending on the degree of structuring of the data source. Web scraping software is often open source and often free of charge.

The choice of software used can be made dependent on the one hand on what employees already have experience with, on the other hand a multitude of different software hinders the possibilities of exchange and the use of synergy effects. If different software is available for the development of web scraping applications, this promotes innovation, since completely new methods can be implemented quickly and the employees' existing programming skills can be used. On the other hand, using too many different programming languages ​​means that the experience gained cannot be used beyond a specific application. In addition, the handling of error messages, the creation of logs and the like must then be regulated individually for each web scraping program. Ideally, some common IT tools and programming languages ​​for web scraping would be identified, with which web scraping programs can be developed as generalized as possible in order to increase the chance of their reusability.

In order to enable feasibility studies on the use of web scraping in official statistics in Germany and its implementation in production, a server with free access to the Internet and scalable computing and working memory capacity is required. An automatable interface between this web scraping server and the secure network of the respective statistical office should also be available in order to allow the data collected to be transmitted. The IT infrastructure required for this in the German statistical offices does not yet exist or is still being set up.

Employee qualification

Web scraping is not a method that is taught in common business and social sciences courses. Web scraping is also not a central part of the curriculum in data science studies in Germany. Web scraping requires employees to have good programming skills, currently for example in R, Python or Java, and computer science skills, for example about the parallelization of processes or system administration. These skills are rarely found in combination with a sufficiently deep understanding of the specialist statistics in which the data extracted from the web are needed. Project groups for the development of web scraping applications, with members from specialist statistics and IT (and ideally supplemented by a data scientist), so that the necessary knowledge does not have to be combined in one person, are currently still rare in official statistics. Employees who do web scraping in official statistics have often acquired the necessary knowledge on their own. Training courses in this field are mostly online courses that are not tailored to the specifics of official statistics. However, some courses in the European Statistical Training Program (ESTP) now deal with aspects of web scraping, in particular they provide an overview of application examples and existing IT tools for processing large amounts of dataFootnote 7. However, the offer does not cover all the necessary qualifications in sufficient depth.

Conclusion and outlook

In this article, areas of application for web scraping in official statistics were shown. The method offers new potential and can improve the quality of official statistics. Web scraping also reduces the burden on respondents, as information that is available on the Internet may no longer have to be collected via a questionnaire in the future. The collection costs can also be reduced through web scraping, for example because prices from online shops do not have to be collected manually. Overall, it can be expected that further areas of application of web scraping for official statistics will be opened up in the future. Despite the diverse potential of web scraping in official statistics, this method can so far only be used in a binding manner in the German statistical system. Most applications are still in the early stages of development. With the exception of price statistics, this is due to a lack of individual statistical legal bases. In addition, the IT infrastructure in the German statistical system is not designed for the requirements of web scraping. In addition, there is a shortage of employees with the necessary knowledge and skills.

Restrictions in the quality of data from web scraping are also a challenge for the use of this data in official statistics. The same quality requirements are placed on web data in official statistics as on data from surveys or administrative sources. Qualitative aspects of the scraped data must be checked separately for each application. Restrictions in the quality of data from web scraping and other new digital data were examined, for example, in the ESSnet Big Data II for the various applications, and guidelines for dealing with them were drawn up (Quaresma et al. 2020). Since insufficient representativeness poses a particular challenge for the use of new digital data sources, this quality aspect was dealt with in detail in Beręsewicz et al. (2018).

Some of the national statistical offices in the ESS have largely overcome these challenges. One example is CBS, which has been using data from the web for experimental statistics for years (see e.g. Daas and Puts 2014; Oostrom et al. 2016). What these early studies have in common, however, is that data obtained through web scraping was acquired by global commercial companies through CBS. For Daas and Puts (2014) a dataset with billions of posts on social media was bought. The acquired data already contained the characteristic of interest, because all posts were classified by a commercial company according to their sentiment. No further text classification steps were therefore necessary. Oostrom et al. (2016) used data that had already been prepared to a high degree for the analysis. As a result, the authors' task was limited to linking the web data with official data sources and to traditional statistical analyzes - that is, to core competencies in official statistics.

In addition to your own data collection using web scraping, the acquisition of scraped and - at least partially - processed data is a way of using the potential of the data volumes available on the web. Numerous global companies operate web scraping on a large scale and have potentially interesting data for official statistics (e.g. Google, data provider) or offer web scraping products tailored to the needs of customers (Scrapinghub, Octoparse, etc.). When purchasing scraped and (partly) processed data, however, not all steps of data acquisition and processing are often transparent for the statistical office. This is why it is sometimes not possible to check compliance with the high quality requirements of official statistics. Since the transparency of the methods used plays a major role in official statisticsFootnote 8, this is a particular problem.

However, this problem exists for many new digital data sources and is addressed by Eurostat with the concept of Trusted Smart Statistics (Ricciato et al. 2019). Necessary computing steps from the raw data to the desired output are flexibly divided between the data owners and the statistical office. Two extreme cases are conceivable: on the one hand, the statistical office could acquire the raw data (or collect it through web scraping) and carry out all intermediate steps itself, as has been done in the previous web scraping applications in German official statistics. On the other hand, it is conceivable that the data owners bring the data into the desired output form and only deliver this output to the statistical office. Between these two extremes it is conceivable that data owners carry out certain steps of data preparation and aggregation and provide the statistical office with an intermediate product. The execution of the programs that generate a particular output should be separated from the development of these programs. The latter should continue to be carried out under the responsibility of the statistical office in order to ensure compliance with quality and data protection principles and to make them transparent. Trusted Smart Statistics therefore continue to require a high level of methodological competence on the part of the staff at the statistical office for the respective data source. With this concept, technological hurdles in the statistical offices could be bypassed and built on the almost decades of experience of some commercial companies with web scraping. In order to realize this concept for web data, Eurostat is currently working on the establishment of a Web Intelligence Hub, which will provide a uniform IT infrastructure for web scraping activities throughout the ESS, as described in DIME / ITDG SG (2020). In the Web Intelligence Hub as part of the new Trusted Smart Statistics Center to be set up, software components for the acquisition and analysis of data from the web as well as sufficient computing capacity are to be made available for the members of the ESS. The functionality of the Web Intelligence Hub should be checked at the beginning in the application areas online job advertisements and web scraping of company data in order to be able to expand it to other areas of application.

In this article, numerous possibilities for utilizing data from the web for official statistics in Germany were shown: the own collection of data through web scraping, the use of an API offered by website operators, the use of the Web Intelligence Hub planned by Eurostat, which is currently under development and the acquisition of scraped data from outside organizations, especially commercial companies. Which of these access routes can and should be used must be decided individually for each area of ​​application.

Notes

  1. 1.

    For example, see the press release of the Federal Statistical Office of May 11, 2020: https://www.destatis.de/DE/Presse/Pressemitteilungen/2020/05/PD20_163_52411.html.

  2. 2.

    More information at https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/ESSnet_Big_Data.

  3. 3.

    The abbreviation stands for “Center européen pour le développement de la formation professionnelle”.

  4. 4.

    Find it at: https://dashboards.cbs.nl/beta/experimenteel_SocialeSpanningen/ (in Dutch).

  5. 5.

    See an example of the implementation by the HSL: https://statistik.hessen.de/ua/.

  6. 6.

    Example configuration: Intel core i5, 2.4 GHz clock frequency, 2 cores, 16 GB RAM.

  7. 7.

    See the courses “Introduction to Big Data in Official Statistics” and “Big Data Tools for Data Scientists” on the ESTP website (https://ec.europa.eu/eurostat/cros/content/estp-training-offer_en).

  8. 8.

    See the European Statistics Code of Practice (https://ec.europa.eu/eurostat/de/web/products-catalogues/-/ks-02-18-142).

literature

  1. Australian Government (2020) Vacancy report.Labor market information portal. http://lmip.gov.au/default.aspx?LMIP/GainInsights/VacancyReport. Accessed March 30, 2020

  2. Azar JA, Marinescu I, Steinbaum MI, Taska B (2018) Concentration in US labor markets: evidence from online vacancy data. Working Paper No. 24395. National Bureau of Economic Research. https://doi.org/10.3386/w24395

    Book Google Scholar

  3. Barcaroli G, Scannapieco M, Summa D (2016) On the use of Internet as a data source for official statistics: a strategy for identifying enterprises on the web. Italian Rev Econ Demogr Stat 70 (4): 25-41

    Google Scholar

  4. Beręsewicz M, Lehtonen R, Reis F, Di Consiglio L, Karlberg M (2018) An overview of methods for treating selectivity in big data sources. Eurostat Statistical Working Paper. https://doi.org/10.2785/312232

    Book Google Scholar

  5. Blaudow C, Burg F (2018) Dynamic pricing as a challenge for consumer price statistics. WISTA 2/2018: 11–22

    Google Scholar

  6. Blaudow C, Ostermann H (2020) Development of a generic program for the use of web scraping in consumer price statistics. WISTA 5/2020: 103-113

    Google Scholar

  7. Brenzel H, Czepek J, Kubis A, Moczall A, Rebien M, Röttger C, Szameitat J, Warning A, Weber E (2016) New hires in 2015: Positions are often filled through personal contacts (IAB short report, p. 6) . Institute for Employment Research. http://doku.iab.de/kurzber/2016/kb0416.pdf. Accessed March 25, 2020