0% found this document useful (0 votes)
3 views

Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar instant download

The document is a promotional overview for the book 'Scaling Big Data with Hadoop and Solr' by Hrishikesh Karambelkar, which provides a comprehensive guide to building efficient enterprise search repositories using Hadoop and Solr. It covers topics such as Hadoop's ecosystem, Solr configuration, and performance optimization for Big Data applications. The book aims to equip readers with practical knowledge and real-world use cases for implementing Big Data solutions.

Uploaded by

zawminsaleth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar instant download

The document is a promotional overview for the book 'Scaling Big Data with Hadoop and Solr' by Hrishikesh Karambelkar, which provides a comprehensive guide to building efficient enterprise search repositories using Hadoop and Solr. It covers topics such as Hadoop's ecosystem, Solr configuration, and performance optimization for Big Data applications. The book aims to equip readers with practical knowledge and real-world use cases for implementing Big Data solutions.

Uploaded by

zawminsaleth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Scaling Big Data with Hadoop and Solr 1st

Edition Karambelkar pdf download

https://ebookname.com/product/scaling-big-data-with-hadoop-and-
solr-1st-edition-karambelkar/

Get Instant Ebook Downloads – Browse at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Pig design patterns simplify Hadoop programming to


create complex end to end enterprise big data solutions
with Pig 1st Edition Pradeep Pasupuleti

https://ebookname.com/product/pig-design-patterns-simplify-
hadoop-programming-to-create-complex-end-to-end-enterprise-big-
data-solutions-with-pig-1st-edition-pradeep-pasupuleti/

Harness Oil and Gas Big Data with Analytics Optimize


Exploration and Production with Data Driven Models 1st
Edition Keith Holdaway

https://ebookname.com/product/harness-oil-and-gas-big-data-with-
analytics-optimize-exploration-and-production-with-data-driven-
models-1st-edition-keith-holdaway/

From Big Data to Smart Data 1st Edition Fernando


Iafrate

https://ebookname.com/product/from-big-data-to-smart-data-1st-
edition-fernando-iafrate/

Pocket Podiatry Footwear and Foot Orthoses 1st Edition


Anita Williams

https://ebookname.com/product/pocket-podiatry-footwear-and-foot-
orthoses-1st-edition-anita-williams/
Personality theories an introduction Barbara Engler

https://ebookname.com/product/personality-theories-an-
introduction-barbara-engler/

Southern Crucifix Southern Cross Catholic Protestant


Relations in the Old South 1st Edition Edition Andrew
Henry Stern

https://ebookname.com/product/southern-crucifix-southern-cross-
catholic-protestant-relations-in-the-old-south-1st-edition-
edition-andrew-henry-stern/

Perspectives on Managing Employees 1st Edition Michael


A. Fina

https://ebookname.com/product/perspectives-on-managing-
employees-1st-edition-michael-a-fina/

Everyday Surveillance Vigilance And Visibility In


Postmodern Life 2nd Edition Edition William G. Staples

https://ebookname.com/product/everyday-surveillance-vigilance-
and-visibility-in-postmodern-life-2nd-edition-edition-william-g-
staples/

Software and patents in Europe 1st Edition Philip Leith

https://ebookname.com/product/software-and-patents-in-europe-1st-
edition-philip-leith/
Textbook of Clinical Neuropsychiatry Hodder Arnold
Publication 1st Edition David P. Moore

https://ebookname.com/product/textbook-of-clinical-
neuropsychiatry-hodder-arnold-publication-1st-edition-david-p-
moore/
Scaling Big Data with
Hadoop and Solr

Learn exciting new ways to build efficient, high


performance enterprise search repositories for
Big Data using Hadoop and Solr

Hrishikesh Karambelkar

BIRMINGHAM - MUMBAI
Scaling Big Data with Hadoop and Solr

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2013

Production Reference: 1190813

Published by Packt Publishing Ltd.


Livery Place
35 Livery Street
Birmingham B3 2PB, UK.

ISBN 978-1-78328-137-4

www.packtpub.com

Cover Image by Prashant Timappa Shetty (sparkling.spectrum.123@gmail.com)


Credits

Author Project Coordinator


Hrishikesh Karambelkar Akash Poojary

Reviewer Proofreader
Parvin Gasimzade Lauren Harkins

Acquisition Editor Indexer


Kartikey Pandey Tejal Soni

Commisioning Editor Graphics


Shaon Basu Ronak Dhruv

Technical Editors Production Coordinator


Pratik More Prachali Bhiwandkar
Amit Ramadas
Shali Sasidharan Cover Work
Prachali Bhiwandkar
About the Author

Hrishikesh Karambelkar is a software architect with a blend of entrepreneurial


and professional experience. His core expertise involves working with multiple
technologies such as Apache Hadoop and Solr, and architecting new solutions
for the next generation of a product line for his organization. He has published
various research papers in the domains of graph searches in databases at various
international conferences in the past. On a technical note, Hrishikesh has worked
on many challenging problems in the industry involving Apache Hadoop and Solr.

While writing the book, I spend my late nights and weekends


bringing in the value for the readers. There were few who stood
by me during good and bad times, my lovely wife Dhanashree,
my younger brother Rupesh, and my parents. I dedicate this book
to them. I would like to thank the Apache community users who
added a lot of interesting content for this topic, without them,
I would not have got an opportunity to add new interesting
information to this book.
About the Reviewer

Parvin Gasimzade is a MSc student in the department of Computer Engineering


at Ozyegin University. He is also a Research Assistant and a member of the Cloud
Computing Research Group (CCRG) at Ozyegin University. He is currently working
on the Social Media Analysis as a Service concept. His research interests include
Cloud Computing, Big Data, Social and Data Mining, information retrieval, and
NoSQL databases. He received his BSc degree in Computer Engineering from
Bogazici University in 2009, where he mainly worked on web technologies and
distributed systems. He is also a professional Software Engineer with more than five
years of working experience. Currently, he works at the Inomera Research Company
as a Software Engineer. He can be contacted at parvin.gasimzade@gmail.com.
www.PacktPub.com

Support files, eBooks, discount offers and more


You might want to visit www.PacktPub.com for support files and downloads related
to your book.

Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online


digital book library. Here, you can access, read and search across Packt's entire
library of books.

Why Subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print and bookmark content
• On demand and accessible via web browser

Free Access for Packt account holders


If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials
for immediate access.
Table of Contents
Preface 1
Chapter 1: Processing Big Data Using Hadoop MapReduce 7
Understanding Apache Hadoop and its ecosystem 9
The ecosystem of Apache Hadoop 9
Apache HBase 10
Apache Pig 11
Apache Hive 11
Apache ZooKeeper 11
Apache Mahout 11
Apache HCatalog 12
Apache Ambari 12
Apache Avro 12
Apache Sqoop 12
Apache Flume 13
Storing large data in HDFS 13
HDFS architecture 13
NameNode 14
DataNode 15
Secondary NameNode 15
Organizing data 16
Accessing HDFS 16
Creating MapReduce to analyze Hadoop data 18
MapReduce architecture 18
JobTracker 19
TaskTracker 20
Installing and running Hadoop 20
Prerequisites 21
Setting up SSH without passphrases 21
Installing Hadoop on machines 22
Hadoop configuration 22
Table of Contents

Running a program on Hadoop 23


Managing a Hadoop cluster 24
Summary 25
Chapter 2: Understanding Solr 27
Installing Solr 28
Apache Solr architecture 29
Storage 29
Solr engine 30
The query parser 30
Interaction 33
Client APIs and SolrJ client 33
Other interfaces 33
Configuring Apache Solr search 33
Defining a Schema for your instance 34
Configuring a Solr instance 35
Configuration files 36
Request handlers and search components 38
Facet 40
MoreLikeThis 41
Highlight 41
SpellCheck 41
Metadata management 41
Loading your data for search 42
ExtractingRequestHandler/Solr Cell 43
SolrJ 43
Summary 44
Chapter 3: Making Big Data Work for Hadoop and Solr 45
The problem 45
Understanding data-processing workflows 46
The standalone machine 47
Distributed setup 47
The replicated mode 48
The sharded mode 48
Using Solr 1045 patch – map-side indexing 49
Benefits and drawbacks 50
Benefits 50
Drawbacks 50
Using Solr 1301 patch – reduce-side indexing 50
Benefits and drawbacks 52
Benefits 52
Drawbacks 52
Using SolrCloud for distributed search 53

[ ii ]
Table of Contents

SolrCloud architecture 53
Configuring SolrCloud 54
Using multicore Solr search on SolrCloud 56
Benefits and drawbacks 58
Benefits 58
Drawbacks 58
Using Katta for Big Data search (Solr-1395 patch) 59
Katta architecture 59
Configuring Katta cluster 60
Creating Katta indexes 60
Benefits and drawbacks 61
Benefits 61
Drawbacks 61
Summary 61
Chapter 4: Using Big Data to Build Your Large Indexing 63
Understanding the concept of NOSQL 63
The CAP theorem 64
What is a NOSQL database? 64
The key-value store or column store 65
The document-oriented store 66
The graph database 66
Why NOSQL databases for Big Data? 67
How Solr can be used for Big Data storage? 67
Understanding the concepts of distributed search 68
Distributed search architecture 68
Distributed search scenarios 69
Lily – running Solr and Hadoop together 70
The architecture 70
Write-ahead Logging 72
The message queue 72
Querying using Lily 72
Updating records using Lily 72
Installing and running Lily 73
Deep dive – shards and indexing data of Apache Solr 74
The sharding algorithm 75
Adding a document to the distributed shard 77
Configuring SolrCloud to work with large indexes 77
Setting up the ZooKeeper ensemble 78
Setting up the Apache Solr instance 79
Creating shards, collections, and replicas in SolrCloud 80
Summary 81

[ iii ]
Table of Contents

Chapter 5: Improving Performance of Search


while Scaling with Big Data 83
Understanding the limits 84
Optimizing the search schema 85
Specifying the default search field 85
Configuring search schema fields 85
Stop words 86
Stemming 86
Index optimization 88
Limiting the indexing buffer size 89
When to commit changes? 89
Optimizing the index merge 91
Optimize an option for index merging 92
Optimizing the container 92
Optimizing concurrent clients 93
Optimizing the Java virtual memory 93
Optimization the search runtime 95
Optimizing through search queries 95
Filter queries 95
Optimizing the Solr cache 96
The filter cache 97
The query result cache 97
The document cache 98
The field value cache 98
Lazy field loading 99
Optimizing search on Hadoop 99
Monitoring the Solr instance 100
Using SolrMeter 101
Summary 102
Appendix A: Use Cases for Big Data Search 103
E-commerce websites 103
Log management for banking 104
The problem 104
How can it be tackled? 105
High-level design 107

[ iv ]
Table of Contents

Appendix B: Creating Enterprise Search Using


Apache Solr 109
schema.xml 109
solrconfig.xml 110
spellings.txt 113
synonyms.txt 114
protwords.txt 115
stopwords.txt 115
Appendix C: Sample MapReduce Programs to
Build the Solr Indexes 117
The Solr-1045 patch – map program 118
The Solr-1301 patch – reduce-side indexing 119
Katta 120
Index 123

[v]
Preface
This book will provide users with a step-by-step guide to work with Big Data using
Hadoop and Solr. It starts with a basic understanding of Hadoop and Solr, and
gradually gets into building efficient, high performance enterprise search repository
for Big Data.

You will learn various architectures and data workflows for distributed search
system. In the later chapters, this book provides information about optimizing the
Big Data search instance ensuring high availability and reliability.

This book later demonstrates two real world use cases about how Hadoop and Solr
can be used together for distributer enterprise search.

What this book covers


Chapter 1, Processing Big Data Using Hadoop and MapReduce, introduces you with
Apache Hadoop and its ecosystem, HDFS, and MapReduce. You will also learn
how to write MapReduce programs, configure Hadoop cluster, the configuration
files, and the administration of your cluster.

Chapter 2, Understanding Solr, introduces you to Apache Solr. It explains how you
can configure the Solr instance, how to create indexes and load your data in the
Solr repository, and how you can use Solr effectively for searching. It also discusses
interesting features of Apache Solr.

Chapter 3, Making Big Data Work for Hadoop and Solr, brings the two worlds together;
it drives you through different approaches for achieving Big Data work with
architectures and their benefits and applicability.
Preface

Chapter 4, Using Big Data to Build Your Large Indexing, explains the NoSQL and
concepts of distributed search. It then gets you into using different algorithms
for Big Data search covering shards and indexing. It also talks about SolrCloud
configuration and Lily.

Chapter 5, Improving Performance of Search while Scaling with Big Data, covers different
levels of optimizations that you can perform on your Big Data search instance as the
data keeps growing. It discusses different performance improvement techniques
which can be implemented by the users for their deployment.

Appendix A, Use Cases for Big Data Search, describes some industry use cases and
case studies for Big Data using Solr and Hadoop.

Appendix B, Creating Enterprise Search Using Apache Solr, shares a sample Solr
schema which can be used by the users for experimenting with Apache Solr.

Appendix C, Sample MapReduce Programs to Build the Solr Indexes, provides a sample
MapReduce program to build distributed Solr indexes for different approaches.

What you need for this book


This book discusses different approaches, each approach needs a different set
of software. To run Apache Hadoop/Solr instance, you need:

• JDK 6
• Apache Hadoop
• Apache Solr 4.0 or above
• Patch sets, depending upon which setup you intend to run
• Katta (only if you are setting Katta)
• Lily (only if you are setting Lily)

Who this book is for


This book provides guidance for developers who wish to build high speed enterprise
search platform using Hadoop and Solr. This book is primarily aimed at Java
programmers, who wish to extend Hadoop platform to make it run as an enterprise
search without prior knowledge of Apache Hadoop and Solr.

[2]
Preface

Conventions
In this book, you will find a number of styles of text that distinguish between
different kinds of information. Here are some examples of these styles, and an
explanation of their meaning.

Code words in text are shown as follows: "You will typically find the
hadoop-example jar in /usr/share/hadoop, or in $HADOOP_HOME."

A block of code is set as follows:


public static class IndexReducer {
protected void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
SolrRecordWriter.addReducerContext(context);
}
}

When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:

A programming task is divided into multiple identical subtasks, and when it is


distributed among multiple machines for processing, it is called a map task. The
results of these map tasks are combined together into one or many reduce tasks.
Overall, this approach of computing tasks is called the MapReduce approach.

Any command-line input or output is written as follows:


java -Durl=http://node1:8983/solr/clusterCollection/update -jar
post.jar ipod_video.xml

New terms and important words are shown in bold. Words that you see on
the screen, in menus or dialog boxes for example, appear in the text like this:
"The admin UI will start showing the Cloud tab."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

[3]
Preface

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or may have disliked. Reader feedback is important for
us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com,


and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things
to help you to get the most from your purchase.

Downloading the example code


You can download the example code files for all Packt books you have purchased
from your account at http://www.packtpub.com. If you purchased this book
elsewhere, you can visit http://www.packtpub.com/support and register to
have the files e-mailed directly to you.

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you would report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.
com/submit-errata, selecting your book, clicking on the errata submission form link,
and entering the details of your errata. Once your errata are verified, your submission
will be accepted and the errata will be uploaded on our website, or added to any list of
existing errata, under the Errata section of that title. Any existing errata can be viewed
by selecting your title from http://www.packtpub.com/support.

[4]
Preface

Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you
come across any illegal copies of our works, in any form, on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected


pirated material.

We appreciate your help in protecting our authors, and our ability to bring you
valuable content.

Questions
You can contact us at questions@packtpub.com if you are having a problem
with any aspect of the book, and we will do our best to address it.

[5]
Processing Big Data Using
Hadoop and MapReduce
Traditionally computation has been processor driven. As the data grew, the industry
was focused towards increasing processor speed and memory for getting better
performances for computation. This gave birth to the distributed systems. In today's
real world, different applications create hundreds and thousands of gigabytes of
data every day. This data comes from disparate sources such as application software,
sensors, social media, mobile devices, logs, and so on. Such huge data is difficult to
operate upon using standard available software for data processing. This is mainly
because the data size grows exponentially with time. Traditional distributed systems
were not sufficient to manage the big data, and there was a need for modern systems
that could handle heavy data load, with scalability and high availability. This is
called Big Data.

Big data is usually associated with high volume and heavily growing data with
unpredictable content. A video gaming industry needs to predict the performance
of over 500 GB of data structure, and analyze over 4 TB of operational logs every
day; many gaming companies use Big Data based technologies to do so. An IT
advisory firm Gartner defines big data using 3Vs (high volume of data, high velocity
of processing speed, and high variety of information). IBM added fourth V (high
veracity) to its definition to make sure the data is accurate, and helps you make your
business decisions.
Processing Big Data Using Hadoop and MapReduce

While the potential benefits of big data are real and significant, there remain many
challenges. So, organizations which deal with such high volumes of data face the
following problems:

• Data acquisition: There is lot of raw data that gets generated out of various
data sources. The challenge is to filter and compress the data, and extract the
information out of it once it is cleaned.
• Information storage and organization: Once the information is captured out
of raw data, the data model will be created and stored in a storage device.
To store a huge dataset effectively, traditional relational system stops being
effective at such a high scale. There has been a new breed of databases called
NOSQL databases, which are mainly used to work with big data. NOSQL
databases are non-relational databases.
• Information search and analytics: Storing data is only a part of building a
warehouse. Data is useful only when it is computed. Big data is often noisy,
dynamic, and heterogeneous. This information is searched, mined, and
analyzed for behavioral modeling.
• Data security and privacy: While bringing in linked data from multiple
sources, organizations need to worry about data security and privacy at
the most.

Big data offers lot of technology challenges to the current technologies in use today.
It requires large quantities of data processing within the finite timeframe, which
brings in technologies such as massively parallel processing (MPP) technologies and
distributed file systems.

Big data is catching more and more attention from various organizations. Many of
them have already started exploring it. Recently Gartner (http://www.gartner.
com/newsroom/id/2304615) published an executive program survey report, which
reveals that Big Data and analytics are among the top 10 business priorities for CIOs.
Similarly, analytics and BI stand as the top priority for CIO's technical priorities. We
will try to understand Apache Hadoop in this chapter. We will cover the following:

• Understanding Apache Hadoop and its ecosystem


• Storing large data in HDFS
• Creating MapReduce to analyze the Hadoop data
• Installing and running Hadoop
• Managing and viewing a Hadoop cluster
• Administration tools

[8]
Chapter 1

Understanding Apache Hadoop and its


ecosystem
Google faced the problem of storing and processing big data, and they came up
with the MapReduce approach, which is basically a divide-and-conquer strategy for
distributed data processing.

A programming task which is divided into multiple identical subtasks,


and which is distributed among multiple machines for processing, is
called a map task.. The results out of these map tasks are combined
together into one or many reduce tasks. Overall this approach of
computing tasks is called a MapReduce approach.

MapReduce is widely accepted by many organizations to run their Big Data


computations. Apache Hadoop is the most popular open source Apache licensed
implementation of MapReduce. Apache Hadoop is based on the work done by
Google in the early 2000s, more specifically on papers describing the Google file
system published in 2003, and MapReduce published in 2004. Apache Hadoop
enables distributed processing of large datasets across a commodity of clustered
servers. It is designed to scale up from single server to thousands of commodity
hardware machines, each offering partial computational units and data storage.

Apache Hadoop mainly consists of two major components:

• The Hadoop Distributed File System (HDFS)


• The MapReduce software framework

HDFS is responsible for storing the data in a distributed manner across multiple
Hadoop cluster nodes. The MapReduce framework provides rich computational
APIs for developers to code, which eventually run as map and reduce tasks on the
Hadoop cluster.

The ecosystem of Apache Hadoop


Understanding Apache Hadoop ecosystem enables us to effectively apply the
concepts of the MapReduce paradigm at different requirements. It also provides end-
to-end solutions to various problems that are faced by us every day.

[9]
Processing Big Data Using Hadoop and MapReduce

Apache Hadoop ecosystem is vast in nature. It has grown drastically over the time
due to different organizations contributing to this open source initiative. Due to the
huge ecosystem, it meets the needs of different organizations for high performance
analytics. To understand the ecosystem, let's look at the following diagram:

Apache Hadoop Ecosystem

Flume/
Mahout Pig Hive
Sqoop

Zookeeper

Ambari

Avro
HBase MapReduce HCatlog

Hadoop Distributed File System (HDFS)

Apache Hadoop ecosystem consists of the following major components:


• Core Hadoop framework: HDFS and MapReduce
• Metadata management: HCatalog
• Data storage and querying: HBase, Hive, and Pig
• Data import/export: Flume, Sqoop
• Analytics and machine learning: Mahout
• Distributed coordination: Zookeeper
• Cluster management: Ambari
• Data storage and serialization: Avro

Apache HBase
HDFS is append-only file system; it does not allow data modification. Apache HBase
is a distributed, random-access, and column-oriented database. HBase directly runs
on top of HDFS, and it allows application developers to read/write the HDFS data
directly. HBase does not support SQL; hence, it is also called as NOSQL database.
However, it provides command-line-based interface, as well as a rich set of APIs to
update the data. The data in HBase gets stored as key-value pairs in HDFS.

[ 10 ]
Chapter 1

Apache Pig
Apache Pig provides another abstraction layer on top of MapReduce. It provides
something called Pig Latin, which is a programming language that creates
MapReduce programs using Pig. Pig Latin is a high-level language for developers to
write high-level software for analyzing data. Pig code generates parallel execution
tasks, therefore effectively uses the distributed Hadoop cluster. Pig was initially
developed at Yahoo! Research to enable developers create ad-hoc MapReduce jobs
for Hadoop. Since then, many big organizations such as eBay, LinkedIn, and Twitter
have started using Apache Pig.

Apache Hive
Apache Hive provides data warehouse capabilities using Big Data. Hive runs on
top of Apache Hadoop, and uses HDFS for storing its data. The Apache Hadoop
framework is difficult to understand, and it requires a different approach from
traditional programming to write MapReduce-based programs. With Hive,
developers do not write MapReduce at all. Hive provides a SQL like query language
called HiveQL to application developers, enabling them to quickly write ad-hoc
queries similar to RDBMS SQL queries.

Apache ZooKeeper
Apache Hadoop nodes communicate with each other through Apache Zookeeper.
It forms the mandatory part of Apache Hadoop ecosystem. Apache Zookeeper is
responsible for maintaining coordination among various nodes. Besides coordinating
among nodes, it also maintains configuration information, and group services to
the distributed system. Apache ZooKeeper can be used independent of Hadoop,
unlike other components of the ecosystem. Due to its in-memory management of
information, it offers the distributed coordination at a high speed.

Apache Mahout
Apache Mahout is an open source machine learning software library that can
effectively empower Hadoop users with analytical capabilities such as clustering,
data mining, and so on, over distributed Hadoop cluster. Mahout is highly effective
over large datasets, the algorithms provided by Mahout are highly optimized to run
the MapReduce framework over HDFS.

[ 11 ]
Processing Big Data Using Hadoop and MapReduce

Apache HCatalog
Apache HCatalog provides metadata management services on top of Apache
Hadoop. It means all the software that runs on Hadoop can effectively use HCatalog
to store their schemas in HDFS. HCatalog helps any third party software to create,
edit, and expose (using rest APIs) the generated metadata or table definitions. So,
any user or script can run Hadoop effectively without actually knowing where
the data is physically stored on HDFS. HCatalog provides DDL (Data Definition
Language) commands with which the requested MapReduce, Pig, and Hive jobs can
be queued for execution, and later monitored for progress as and when required.

Apache Ambari
Apache Ambari provides a set of tools to monitor Apache Hadoop cluster hiding the
complexities of the Hadoop framework. It offers features such as installation wizard,
system alerts and metrics, provisioning and management of Hadoop cluster, job
performances, and so on. Ambari exposes RESTful APIs for administrators to allow
integration with any other software.

Apache Avro
Since Hadoop deals with large datasets, it becomes very important to optimally
process and store the data effectively on the disks. This large data should be
efficiently organized to enable different programming languages to read large
datasets Apache Avro helps you to do that. Avro effectively provides data
compression and storages at various nodes of Apache Hadoop. Avro-based
stores can easily be read using scripting languages as well as Java. Avro provides
dynamic access to data, which in turn allows software to access any arbitrary data
dynamically. Avro can be effectively used in the Apache Hadoop MapReduce
framework for data serialization.

Apache Sqoop
Apache Sqoop is a tool designed to do load large datasets in Hadoop efficiently.
Apache Sqoop allows application developers to import/export easily from specific
data sources such as relational databases, enterprise data warehouses, and custom
applications. Apache Sqoop internally uses a map task to perform data import/
export effectively on Hadoop cluster. Each mapper loads/unloads slice of data
across HDFS and data source. Apache Sqoop establishes connectivity between non-
Hadoop data sources and HDFS.

[ 12 ]
Another Random Document on
Scribd Without Any Related Topics
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookname.com

You might also like