What are the best books for Hadoop

Big Data Analytics with Hadoop - book recommendation

Big data is one of the buzzwords of these years and stands for the potential of the growing amount of data for business and science. The potential arises from capturing and collecting this amount of data from a wide variety of sources. However, the potential is only used with the data analysis, which is why big data is often talked about, but Big data analytics meant.

The database model that occurs most frequently in practice is the relational database, which stores data in linked table structures. Relational databases are not directly limited to a certain size, only the operating system limits the size of a MySQL database, but they appear with large queries SQL-Queries often make mistakes that are difficult to understand. A failed SQL query is difficult to debug if it terminates after hours.

Promises a remedy Apache Hadoop with a distributed file system (HDFS) and the NoSQL approach and MapReduce algorithm, which also enables the analysis of unstructured data. HDFS enables data collections and analyzes in the petabyte range with commodity hardware. This means that backup-critical mainframes are no longer required for evaluating the amount of data, because MapReduce jobs can take place in parallel on different, spatially separated hardware; only the results are compiled and processed on a dedicated computer (server). Data is kept redundant via the distributed hardware, which as a data warehouse (under certain conditions) means reliability.

Hadoop is a free, open source framework, the development of which was promoted by the Apache Software Foundation. It is also used by the companies pioneered today by Big data apply, for example Facebook, Yahoo !, Twitter, Amazon and Google. Especially Google, Facebook and Yahoo! contributed large parts to Hadoop.

Hadoop: Reliable, distributed and scalable big data applications

The book Hadoop - Reliable, distributed and scalable big data applications by the author Ramon Wartala offers a broad and deep insight into Hadoop and its modular ancillary systems:

  • Data flow languages:
  • Column-oriented databases
  • Data serialization
    • Avro
    • Thrift
    • Google Protocol Buffer
    • Sqoop
  • Workflow systems
    • Azkaban
    • Oozie
    • Cascading
    • Hue
  • ZooKeeper
  • Mahout
  • Whirr

The book guides you in detail through the installation of Hadoop on a Linux system, guides you through the first steps in dealing with the distributed file system of Hadoop (HDFS) as well as the implementation of MapReduce algorithms. The recommended development environment Eclipse (with plug-in) is also adequately instructed. Finally, the author gives tips about the management and monitoring of MapReduce jobs and the Hadoop ecosystem. Furthermore, four examples of the use of Hadoop in practice are presented. If you want to get started with the practice of Hadoop, you can install Hadoop as a standalone application and simulate the data distribution, or you can rent Linux servers from providers.

Hadoop essentially consists of Java program code (from Oracle, formerly Sun Microsystems), so at least a basic knowledge of Java is necessary in order to be able to delve deeper into Hadoop and also to understand the source code examples in the book. Knowledge of Linux (especially Ubuntu / Debian) is also an advantage. If you are not a fan of the Java programming language, you can use Hadoop streaming to implement MapReduce functions in C, Ruby, PHP or Python; Hadoop offers a standard input / output interface for this.

Hadoop: The Definitive Guide
The community around Hadoop is almost entirely English-speaking, so English-language literature is recommended for a deeper insight into Hadoop, for example the book Hadoop: The Definitive Guide. For a first introduction to the Hadoop system, however, the above German-language book is absolutely recommended.

Categories Big Data Analytics, Books, Databases, Computer Science, Trends & Future