0% found this document useful (0 votes)

3 views

Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar instant download

The document is a promotional overview for the book 'Scaling Big Data with Hadoop and Solr' by Hrishikesh Karambelkar, which provides a comprehensive guide to building efficient enterprise search repositories using Hadoop and Solr. It covers topics such as Hadoop's ecosystem, Solr configuration, and performance optimization for Big Data applications. The book aims to equip readers with practical knowledge and real-world use cases for implementing Big Data solutions.

Uploaded by

zawminsaleth

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar instant download

Uploaded by

zawminsaleth

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Scaling Big Data with Hadoop and Solr 1st

Edition Karambelkar pdf download

https://ebookname.com/product/scaling-big-data-with-hadoop-and-
solr-1st-edition-karambelkar/

Get Instant Ebook Downloads – Browse at https://ebookname.com

Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Pig design patterns simplify Hadoop programming to

create complex end to end enterprise big data solutions
with Pig 1st Edition Pradeep Pasupuleti

https://ebookname.com/product/pig-design-patterns-simplify-
hadoop-programming-to-create-complex-end-to-end-enterprise-big-
data-solutions-with-pig-1st-edition-pradeep-pasupuleti/

Harness Oil and Gas Big Data with Analytics Optimize

Exploration and Production with Data Driven Models 1st
Edition Keith Holdaway

https://ebookname.com/product/harness-oil-and-gas-big-data-with-
analytics-optimize-exploration-and-production-with-data-driven-
models-1st-edition-keith-holdaway/

From Big Data to Smart Data 1st Edition Fernando

Iafrate

https://ebookname.com/product/from-big-data-to-smart-data-1st-
edition-fernando-iafrate/

Pocket Podiatry Footwear and Foot Orthoses 1st Edition

Anita Williams

https://ebookname.com/product/pocket-podiatry-footwear-and-foot-
orthoses-1st-edition-anita-williams/
Personality theories an introduction Barbara Engler

https://ebookname.com/product/personality-theories-an-
introduction-barbara-engler/

Southern Crucifix Southern Cross Catholic Protestant

Relations in the Old South 1st Edition Edition Andrew
Henry Stern

https://ebookname.com/product/southern-crucifix-southern-cross-
catholic-protestant-relations-in-the-old-south-1st-edition-
edition-andrew-henry-stern/

Perspectives on Managing Employees 1st Edition Michael

A. Fina

https://ebookname.com/product/perspectives-on-managing-
employees-1st-edition-michael-a-fina/

Everyday Surveillance Vigilance And Visibility In

Postmodern Life 2nd Edition Edition William G. Staples

https://ebookname.com/product/everyday-surveillance-vigilance-
and-visibility-in-postmodern-life-2nd-edition-edition-william-g-
staples/

Software and patents in Europe 1st Edition Philip Leith

https://ebookname.com/product/software-and-patents-in-europe-1st-
edition-philip-leith/
Textbook of Clinical Neuropsychiatry Hodder Arnold
Publication 1st Edition David P. Moore

https://ebookname.com/product/textbook-of-clinical-
neuropsychiatry-hodder-arnold-publication-1st-edition-david-p-
moore/
Scaling Big Data with
Hadoop and Solr

Learn exciting new ways to build efficient, high

performance enterprise search repositories for
Big Data using Hadoop and Solr

Hrishikesh Karambelkar

BIRMINGHAM - MUMBAI
Scaling Big Data with Hadoop and Solr

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2013

Production Reference: 1190813

Published by Packt Publishing Ltd.

Livery Place
35 Livery Street
Birmingham B3 2PB, UK.

ISBN 978-1-78328-137-4

www.packtpub.com

Cover Image by Prashant Timappa Shetty (sparkling.spectrum.123@gmail.com)

Credits

Author Project Coordinator

Hrishikesh Karambelkar Akash Poojary

Reviewer Proofreader
Parvin Gasimzade Lauren Harkins

Acquisition Editor Indexer

Kartikey Pandey Tejal Soni

Commisioning Editor Graphics

Shaon Basu Ronak Dhruv

Technical Editors Production Coordinator

Pratik More Prachali Bhiwandkar
Amit Ramadas
Shali Sasidharan Cover Work
Prachali Bhiwandkar
About the Author

Hrishikesh Karambelkar is a software architect with a blend of entrepreneurial

and professional experience. His core expertise involves working with multiple
technologies such as Apache Hadoop and Solr, and architecting new solutions
for the next generation of a product line for his organization. He has published
various research papers in the domains of graph searches in databases at various
international conferences in the past. On a technical note, Hrishikesh has worked
on many challenging problems in the industry involving Apache Hadoop and Solr.

While writing the book, I spend my late nights and weekends

bringing in the value for the readers. There were few who stood
by me during good and bad times, my lovely wife Dhanashree,
my younger brother Rupesh, and my parents. I dedicate this book
to them. I would like to thank the Apache community users who
added a lot of interesting content for this topic, without them,
I would not have got an opportunity to add new interesting
information to this book.
About the Reviewer

Parvin Gasimzade is a MSc student in the department of Computer Engineering

at Ozyegin University. He is also a Research Assistant and a member of the Cloud
Computing Research Group (CCRG) at Ozyegin University. He is currently working
on the Social Media Analysis as a Service concept. His research interests include
Cloud Computing, Big Data, Social and Data Mining, information retrieval, and
NoSQL databases. He received his BSc degree in Computer Engineering from
Bogazici University in 2009, where he mainly worked on web technologies and
distributed systems. He is also a professional Software Engineer with more than five
years of working experience. Currently, he works at the Inomera Research Company
as a Software Engineer. He can be contacted at parvin.gasimzade@gmail.com.
www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related
to your book.

Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online

digital book library. Here, you can access, read and search across Packt's entire
library of books.

Why Subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print and bookmark content
• On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials
for immediate access.
Table of Contents
Preface 1
Chapter 1: Processing Big Data Using Hadoop MapReduce 7
Understanding Apache Hadoop and its ecosystem 9
The ecosystem of Apache Hadoop 9
Apache HBase 10
Apache Pig 11
Apache Hive 11
Apache ZooKeeper 11
Apache Mahout 11
Apache HCatalog 12
Apache Ambari 12
Apache Avro 12
Apache Sqoop 12
Apache Flume 13
Storing large data in HDFS 13
HDFS architecture 13
NameNode 14
DataNode 15
Secondary NameNode 15
Organizing data 16
Accessing HDFS 16
Creating MapReduce to analyze Hadoop data 18
MapReduce architecture 18
JobTracker 19
TaskTracker 20
Installing and running Hadoop 20
Prerequisites 21
Setting up SSH without passphrases 21
Installing Hadoop on machines 22
Hadoop configuration 22
Table of Contents

Running a program on Hadoop 23

Managing a Hadoop cluster 24
Summary 25
Chapter 2: Understanding Solr 27
Installing Solr 28
Apache Solr architecture 29
Storage 29
Solr engine 30
The query parser 30
Interaction 33
Client APIs and SolrJ client 33
Other interfaces 33
Configuring Apache Solr search 33
Defining a Schema for your instance 34
Configuring a Solr instance 35
Configuration files 36
Request handlers and search components 38
Facet 40
MoreLikeThis 41
Highlight 41
SpellCheck 41
Metadata management 41
Loading your data for search 42
ExtractingRequestHandler/Solr Cell 43
SolrJ 43
Summary 44
Chapter 3: Making Big Data Work for Hadoop and Solr 45
The problem 45
Understanding data-processing workflows 46
The standalone machine 47
Distributed setup 47
The replicated mode 48
The sharded mode 48
Using Solr 1045 patch – map-side indexing 49
Benefits and drawbacks 50
Benefits 50
Drawbacks 50
Using Solr 1301 patch – reduce-side indexing 50
Benefits and drawbacks 52
Benefits 52
Drawbacks 52
Using SolrCloud for distributed search 53

[ ii ]
Table of Contents

SolrCloud architecture 53
Configuring SolrCloud 54
Using multicore Solr search on SolrCloud 56
Benefits and drawbacks 58
Benefits 58
Drawbacks 58
Using Katta for Big Data search (Solr-1395 patch) 59
Katta architecture 59
Configuring Katta cluster 60
Creating Katta indexes 60
Benefits and drawbacks 61
Benefits 61
Drawbacks 61
Summary 61
Chapter 4: Using Big Data to Build Your Large Indexing 63
Understanding the concept of NOSQL 63
The CAP theorem 64
What is a NOSQL database? 64
The key-value store or column store 65
The document-oriented store 66
The graph database 66
Why NOSQL databases for Big Data? 67
How Solr can be used for Big Data storage? 67
Understanding the concepts of distributed search 68
Distributed search architecture 68
Distributed search scenarios 69
Lily – running Solr and Hadoop together 70
The architecture 70
Write-ahead Logging 72
The message queue 72
Querying using Lily 72
Updating records using Lily 72
Installing and running Lily 73
Deep dive – shards and indexing data of Apache Solr 74
The sharding algorithm 75
Adding a document to the distributed shard 77
Configuring SolrCloud to work with large indexes 77
Setting up the ZooKeeper ensemble 78
Setting up the Apache Solr instance 79
Creating shards, collections, and replicas in SolrCloud 80
Summary 81

[ iii ]
Table of Contents

Chapter 5: Improving Performance of Search

while Scaling with Big Data 83
Understanding the limits 84
Optimizing the search schema 85
Specifying the default search field 85
Configuring search schema fields 85
Stop words 86
Stemming 86
Index optimization 88
Limiting the indexing buffer size 89
When to commit changes? 89
Optimizing the index merge 91
Optimize an option for index merging 92
Optimizing the container 92
Optimizing concurrent clients 93
Optimizing the Java virtual memory 93
Optimization the search runtime 95
Optimizing through search queries 95
Filter queries 95
Optimizing the Solr cache 96
The filter cache 97
The query result cache 97
The document cache 98
The field value cache 98
Lazy field loading 99
Optimizing search on Hadoop 99
Monitoring the Solr instance 100
Using SolrMeter 101
Summary 102
Appendix A: Use Cases for Big Data Search 103
E-commerce websites 103
Log management for banking 104
The problem 104
How can it be tackled? 105
High-level design 107

[ iv ]
Table of Contents

Appendix B: Creating Enterprise Search Using

Apache Solr 109
schema.xml 109
solrconfig.xml 110
spellings.txt 113
synonyms.txt 114
protwords.txt 115
stopwords.txt 115
Appendix C: Sample MapReduce Programs to
Build the Solr Indexes 117
The Solr-1045 patch – map program 118
The Solr-1301 patch – reduce-side indexing 119
Katta 120
Index 123

[v]
Preface
This book will provide users with a step-by-step guide to work with Big Data using
Hadoop and Solr. It starts with a basic understanding of Hadoop and Solr, and
gradually gets into building efficient, high performance enterprise search repository
for Big Data.

You will learn various architectures and data workflows for distributed search
system. In the later chapters, this book provides information about optimizing the
Big Data search instance ensuring high availability and reliability.

This book later demonstrates two real world use cases about how Hadoop and Solr
can be used together for distributer enterprise search.

What this book covers

Chapter 1, Processing Big Data Using Hadoop and MapReduce, introduces you with
Apache Hadoop and its ecosystem, HDFS, and MapReduce. You will also learn
how to write MapReduce programs, configure Hadoop cluster, the configuration
files, and the administration of your cluster.

Chapter 2, Understanding Solr, introduces you to Apache Solr. It explains how you
can configure the Solr instance, how to create indexes and load your data in the
Solr repository, and how you can use Solr effectively for searching. It also discusses
interesting features of Apache Solr.

Chapter 3, Making Big Data Work for Hadoop and Solr, brings the two worlds together;
it drives you through different approaches for achieving Big Data work with
architectures and their benefits and applicability.
Preface

Chapter 4, Using Big Data to Build Your Large Indexing, explains the NoSQL and
concepts of distributed search. It then gets you into using different algorithms
for Big Data search covering shards and indexing. It also talks about SolrCloud
configuration and Lily.

Chapter 5, Improving Performance of Search while Scaling with Big Data, covers different
levels of optimizations that you can perform on your Big Data search instance as the
data keeps growing. It discusses different performance improvement techniques
which can be implemented by the users for their deployment.

Appendix A, Use Cases for Big Data Search, describes some industry use cases and
case studies for Big Data using Solr and Hadoop.

Appendix B, Creating Enterprise Search Using Apache Solr, shares a sample Solr
schema which can be used by the users for experimenting with Apache Solr.

Appendix C, Sample MapReduce Programs to Build the Solr Indexes, provides a sample
MapReduce program to build distributed Solr indexes for different approaches.

What you need for this book

This book discusses different approaches, each approach needs a different set
of software. To run Apache Hadoop/Solr instance, you need:

• JDK 6
• Apache Hadoop
• Apache Solr 4.0 or above
• Patch sets, depending upon which setup you intend to run
• Katta (only if you are setting Katta)
• Lily (only if you are setting Lily)

Who this book is for

This book provides guidance for developers who wish to build high speed enterprise
search platform using Hadoop and Solr. This book is primarily aimed at Java
programmers, who wish to extend Hadoop platform to make it run as an enterprise
search without prior knowledge of Apache Hadoop and Solr.

[2]
Preface

Conventions
In this book, you will find a number of styles of text that distinguish between
different kinds of information. Here are some examples of these styles, and an
explanation of their meaning.

Code words in text are shown as follows: "You will typically find the
hadoop-example jar in /usr/share/hadoop, or in $HADOOP_HOME."

A block of code is set as follows:

public static class IndexReducer {
protected void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
SolrRecordWriter.addReducerContext(context);
}
}

When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:

A programming task is divided into multiple identical subtasks, and when it is

distributed among multiple machines for processing, it is called a map task. The
results of these map tasks are combined together into one or many reduce tasks.
Overall, this approach of computing tasks is called the MapReduce approach.

Any command-line input or output is written as follows:

java -Durl=http://node1:8983/solr/clusterCollection/update -jar
post.jar ipod_video.xml

New terms and important words are shown in bold. Words that you see on
the screen, in menus or dialog boxes for example, appear in the text like this:
"The admin UI will start showing the Cloud tab."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

[3]
Preface

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or may have disliked. Reader feedback is important for
us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com,

and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things
to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased
from your account at http://www.packtpub.com. If you purchased this book
elsewhere, you can visit http://www.packtpub.com/support and register to
have the files e-mailed directly to you.

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you would report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.
com/submit-errata, selecting your book, clicking on the errata submission form link,
and entering the details of your errata. Once your errata are verified, your submission
will be accepted and the errata will be uploaded on our website, or added to any list of
existing errata, under the Errata section of that title. Any existing errata can be viewed
by selecting your title from http://www.packtpub.com/support.

[4]
Preface

Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you
come across any illegal copies of our works, in any form, on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected

pirated material.

We appreciate your help in protecting our authors, and our ability to bring you
valuable content.

Questions
You can contact us at questions@packtpub.com if you are having a problem
with any aspect of the book, and we will do our best to address it.

[5]
Processing Big Data Using
Hadoop and MapReduce
Traditionally computation has been processor driven. As the data grew, the industry
was focused towards increasing processor speed and memory for getting better
performances for computation. This gave birth to the distributed systems. In today's
real world, different applications create hundreds and thousands of gigabytes of
data every day. This data comes from disparate sources such as application software,
sensors, social media, mobile devices, logs, and so on. Such huge data is difficult to
operate upon using standard available software for data processing. This is mainly
because the data size grows exponentially with time. Traditional distributed systems
were not sufficient to manage the big data, and there was a need for modern systems
that could handle heavy data load, with scalability and high availability. This is
called Big Data.

Big data is usually associated with high volume and heavily growing data with
unpredictable content. A video gaming industry needs to predict the performance
of over 500 GB of data structure, and analyze over 4 TB of operational logs every
day; many gaming companies use Big Data based technologies to do so. An IT
advisory firm Gartner defines big data using 3Vs (high volume of data, high velocity
of processing speed, and high variety of information). IBM added fourth V (high
veracity) to its definition to make sure the data is accurate, and helps you make your
business decisions.
Processing Big Data Using Hadoop and MapReduce

While the potential benefits of big data are real and significant, there remain many
challenges. So, organizations which deal with such high volumes of data face the
following problems:

• Data acquisition: There is lot of raw data that gets generated out of various
data sources. The challenge is to filter and compress the data, and extract the
information out of it once it is cleaned.
• Information storage and organization: Once the information is captured out
of raw data, the data model will be created and stored in a storage device.
To store a huge dataset effectively, traditional relational system stops being
effective at such a high scale. There has been a new breed of databases called
NOSQL databases, which are mainly used to work with big data. NOSQL
databases are non-relational databases.
• Information search and analytics: Storing data is only a part of building a
warehouse. Data is useful only when it is computed. Big data is often noisy,
dynamic, and heterogeneous. This information is searched, mined, and
analyzed for behavioral modeling.
• Data security and privacy: While bringing in linked data from multiple
sources, organizations need to worry about data security and privacy at
the most.

Big data offers lot of technology challenges to the current technologies in use today.
It requires large quantities of data processing within the finite timeframe, which
brings in technologies such as massively parallel processing (MPP) technologies and
distributed file systems.

Big data is catching more and more attention from various organizations. Many of
them have already started exploring it. Recently Gartner (http://www.gartner.
com/newsroom/id/2304615) published an executive program survey report, which
reveals that Big Data and analytics are among the top 10 business priorities for CIOs.
Similarly, analytics and BI stand as the top priority for CIO's technical priorities. We
will try to understand Apache Hadoop in this chapter. We will cover the following:

• Understanding Apache Hadoop and its ecosystem

• Storing large data in HDFS
• Creating MapReduce to analyze the Hadoop data
• Installing and running Hadoop
• Managing and viewing a Hadoop cluster
• Administration tools

[8]
Chapter 1

Understanding Apache Hadoop and its

ecosystem
Google faced the problem of storing and processing big data, and they came up
with the MapReduce approach, which is basically a divide-and-conquer strategy for
distributed data processing.

A programming task which is divided into multiple identical subtasks,

and which is distributed among multiple machines for processing, is
called a map task.. The results out of these map tasks are combined
together into one or many reduce tasks. Overall this approach of
computing tasks is called a MapReduce approach.

MapReduce is widely accepted by many organizations to run their Big Data

computations. Apache Hadoop is the most popular open source Apache licensed
implementation of MapReduce. Apache Hadoop is based on the work done by
Google in the early 2000s, more specifically on papers describing the Google file
system published in 2003, and MapReduce published in 2004. Apache Hadoop
enables distributed processing of large datasets across a commodity of clustered
servers. It is designed to scale up from single server to thousands of commodity
hardware machines, each offering partial computational units and data storage.

Apache Hadoop mainly consists of two major components:

• The Hadoop Distributed File System (HDFS)

• The MapReduce software framework

HDFS is responsible for storing the data in a distributed manner across multiple
Hadoop cluster nodes. The MapReduce framework provides rich computational
APIs for developers to code, which eventually run as map and reduce tasks on the
Hadoop cluster.

The ecosystem of Apache Hadoop

Understanding Apache Hadoop ecosystem enables us to effectively apply the
concepts of the MapReduce paradigm at different requirements. It also provides end-
to-end solutions to various problems that are faced by us every day.

[9]
Processing Big Data Using Hadoop and MapReduce

Apache Hadoop ecosystem is vast in nature. It has grown drastically over the time
due to different organizations contributing to this open source initiative. Due to the
huge ecosystem, it meets the needs of different organizations for high performance
analytics. To understand the ecosystem, let's look at the following diagram:

Apache Hadoop Ecosystem

Flume/
Mahout Pig Hive
Sqoop

Zookeeper

Ambari

Avro
HBase MapReduce HCatlog

Hadoop Distributed File System (HDFS)

Apache Hadoop ecosystem consists of the following major components:

• Core Hadoop framework: HDFS and MapReduce
• Metadata management: HCatalog
• Data storage and querying: HBase, Hive, and Pig
• Data import/export: Flume, Sqoop
• Analytics and machine learning: Mahout
• Distributed coordination: Zookeeper
• Cluster management: Ambari
• Data storage and serialization: Avro

Apache HBase
HDFS is append-only file system; it does not allow data modification. Apache HBase
is a distributed, random-access, and column-oriented database. HBase directly runs
on top of HDFS, and it allows application developers to read/write the HDFS data
directly. HBase does not support SQL; hence, it is also called as NOSQL database.
However, it provides command-line-based interface, as well as a rich set of APIs to
update the data. The data in HBase gets stored as key-value pairs in HDFS.

[ 10 ]
Chapter 1

Apache Pig
Apache Pig provides another abstraction layer on top of MapReduce. It provides
something called Pig Latin, which is a programming language that creates
MapReduce programs using Pig. Pig Latin is a high-level language for developers to
write high-level software for analyzing data. Pig code generates parallel execution
tasks, therefore effectively uses the distributed Hadoop cluster. Pig was initially
developed at Yahoo! Research to enable developers create ad-hoc MapReduce jobs
for Hadoop. Since then, many big organizations such as eBay, LinkedIn, and Twitter
have started using Apache Pig.

Apache Hive
Apache Hive provides data warehouse capabilities using Big Data. Hive runs on
top of Apache Hadoop, and uses HDFS for storing its data. The Apache Hadoop
framework is difficult to understand, and it requires a different approach from
traditional programming to write MapReduce-based programs. With Hive,
developers do not write MapReduce at all. Hive provides a SQL like query language
called HiveQL to application developers, enabling them to quickly write ad-hoc
queries similar to RDBMS SQL queries.

Apache ZooKeeper
Apache Hadoop nodes communicate with each other through Apache Zookeeper.
It forms the mandatory part of Apache Hadoop ecosystem. Apache Zookeeper is
responsible for maintaining coordination among various nodes. Besides coordinating
among nodes, it also maintains configuration information, and group services to
the distributed system. Apache ZooKeeper can be used independent of Hadoop,
unlike other components of the ecosystem. Due to its in-memory management of
information, it offers the distributed coordination at a high speed.

Apache Mahout
Apache Mahout is an open source machine learning software library that can
effectively empower Hadoop users with analytical capabilities such as clustering,
data mining, and so on, over distributed Hadoop cluster. Mahout is highly effective
over large datasets, the algorithms provided by Mahout are highly optimized to run
the MapReduce framework over HDFS.

[ 11 ]
Processing Big Data Using Hadoop and MapReduce

Apache HCatalog
Apache HCatalog provides metadata management services on top of Apache
Hadoop. It means all the software that runs on Hadoop can effectively use HCatalog
to store their schemas in HDFS. HCatalog helps any third party software to create,
edit, and expose (using rest APIs) the generated metadata or table definitions. So,
any user or script can run Hadoop effectively without actually knowing where
the data is physically stored on HDFS. HCatalog provides DDL (Data Definition
Language) commands with which the requested MapReduce, Pig, and Hive jobs can
be queued for execution, and later monitored for progress as and when required.

Apache Ambari
Apache Ambari provides a set of tools to monitor Apache Hadoop cluster hiding the
complexities of the Hadoop framework. It offers features such as installation wizard,
system alerts and metrics, provisioning and management of Hadoop cluster, job
performances, and so on. Ambari exposes RESTful APIs for administrators to allow
integration with any other software.

Apache Avro
Since Hadoop deals with large datasets, it becomes very important to optimally
process and store the data effectively on the disks. This large data should be
efficiently organized to enable different programming languages to read large
datasets Apache Avro helps you to do that. Avro effectively provides data
compression and storages at various nodes of Apache Hadoop. Avro-based
stores can easily be read using scripting languages as well as Java. Avro provides
dynamic access to data, which in turn allows software to access any arbitrary data
dynamically. Avro can be effectively used in the Apache Hadoop MapReduce
framework for data serialization.

Apache Sqoop
Apache Sqoop is a tool designed to do load large datasets in Hadoop efficiently.
Apache Sqoop allows application developers to import/export easily from specific
data sources such as relational databases, enterprise data warehouses, and custom
applications. Apache Sqoop internally uses a map task to perform data import/
export effectively on Hadoop cluster. Each mapper loads/unloads slice of data
across HDFS and data source. Apache Sqoop establishes connectivity between non-
Hadoop data sources and HDFS.

[ 12 ]
Another Random Document on
Scribd Without Any Related Topics
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and

personal growth!

ebookname.com

Artificial Intelligence Learning Roadmap - Step-By-Step Guide 2023
No ratings yet
Artificial Intelligence Learning Roadmap - Step-By-Step Guide 2023
9 pages
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar Hrishikesh pdf download
100% (1)
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar Hrishikesh pdf download
50 pages
Full Download Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar PDF DOCX
100% (13)
Full Download Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar PDF DOCX
60 pages
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar 2024 scribd download
No ratings yet
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar 2024 scribd download
51 pages
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar 2024 Scribd Download
100% (5)
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar 2024 Scribd Download
61 pages
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar Hrishikesh - Read the ebook now with the complete version and no limits
No ratings yet
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar Hrishikesh - Read the ebook now with the complete version and no limits
68 pages
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar - The ebook with all chapters is available with just one click
No ratings yet
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar - The ebook with all chapters is available with just one click
51 pages
(Ebook) Scaling Big Data with Hadoop and Solr by Karambelkar, Hrishikesh ISBN 9781783281374, 1783281375 - The latest updated ebook version is ready for download
No ratings yet
(Ebook) Scaling Big Data with Hadoop and Solr by Karambelkar, Hrishikesh ISBN 9781783281374, 1783281375 - The latest updated ebook version is ready for download
34 pages
Mastering Hadoop3 Big Data Processing at Scale to Unlock Unique Business Insights Compress
No ratings yet
Mastering Hadoop3 Big Data Processing at Scale to Unlock Unique Business Insights Compress
531 pages
Hadoop Cluster Deployment 1st Edition Zburivsky Danil download
100% (1)
Hadoop Cluster Deployment 1st Edition Zburivsky Danil download
31 pages
Best Hadoop Online Training
100% (1)
Best Hadoop Online Training
6 pages
3_Hadoop-Stack_for_BD
No ratings yet
3_Hadoop-Stack_for_BD
53 pages
Ewwww
No ratings yet
Ewwww
12 pages
Rewwww
No ratings yet
Rewwww
12 pages
Iouu
No ratings yet
Iouu
12 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
Full Download Microsoft SQL Server 2012 with Hadoop 1st Edition Debarchan Sarkar PDF DOCX
100% (7)
Full Download Microsoft SQL Server 2012 with Hadoop 1st Edition Debarchan Sarkar PDF DOCX
50 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Hadoop What You Need To Know
No ratings yet
Hadoop What You Need To Know
40 pages
CSE Hadoop Report
No ratings yet
CSE Hadoop Report
14 pages
Sap Hana Big Data
No ratings yet
Sap Hana Big Data
54 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
15 pages
Welcome To Apache™ Hadoop®!: Wiki Search With Apache Solr Search
No ratings yet
Welcome To Apache™ Hadoop®!: Wiki Search With Apache Solr Search
6 pages
Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Big Data Platforms: Department of Computer Science and Engineering Rajasthan Technical University Kota, Rajasthan
No ratings yet
Big Data Platforms: Department of Computer Science and Engineering Rajasthan Technical University Kota, Rajasthan
17 pages
Introduction To BigData Hadoop
No ratings yet
Introduction To BigData Hadoop
12 pages
Hadoop Vs Spark Vs Kafka - Comparing Big Data & Distributed Streaming Tools
No ratings yet
Hadoop Vs Spark Vs Kafka - Comparing Big Data & Distributed Streaming Tools
4 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Apache Spark 101 for Data Engineering
No ratings yet
Apache Spark 101 for Data Engineering
15 pages
Cse 17CS82 M2 S1 PPT
No ratings yet
Cse 17CS82 M2 S1 PPT
35 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Beginner Guide Spark
No ratings yet
Beginner Guide Spark
12 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
Big Data Analytics: Dmitry Anoshin
No ratings yet
Big Data Analytics: Dmitry Anoshin
11 pages
Big Data Hadoop Architect
No ratings yet
Big Data Hadoop Architect
19 pages
How To Program Mapreduce Jobs in Hadoop With R: Group 8 João Rosa, Mario Almeida, Alex Pérez
No ratings yet
How To Program Mapreduce Jobs in Hadoop With R: Group 8 João Rosa, Mario Almeida, Alex Pérez
27 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
SAP HANA Hadoop Integration
No ratings yet
SAP HANA Hadoop Integration
16 pages
Getting Started With HDP Sandbox
No ratings yet
Getting Started With HDP Sandbox
107 pages
Unit 2
No ratings yet
Unit 2
10 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
No ratings yet
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
61 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Big Data Hadoop Stack
No ratings yet
Big Data Hadoop Stack
52 pages
What Is The Hadoop Ecosystem
No ratings yet
What Is The Hadoop Ecosystem
5 pages
What Is Hadoop - Introduction, Architecture, Ecosystem, Components
No ratings yet
What Is Hadoop - Introduction, Architecture, Ecosystem, Components
8 pages
Hadoop Demo
No ratings yet
Hadoop Demo
14 pages
B66266C3 Hadoopvsspark
No ratings yet
B66266C3 Hadoopvsspark
13 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Srushti Kaushik Professional Summary
No ratings yet
Srushti Kaushik Professional Summary
6 pages
Big Data G
No ratings yet
Big Data G
11 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
46 pages
Math 11-ABM Business Math-Q2-Week-7
No ratings yet
Math 11-ABM Business Math-Q2-Week-7
30 pages
Appications of Computer (BBA-23)
No ratings yet
Appications of Computer (BBA-23)
13 pages
Simplified DPIA
No ratings yet
Simplified DPIA
21 pages
suvuxamakajojasox
No ratings yet
suvuxamakajojasox
2 pages
Oracle Data Integrator (ODI) Online Training and Support
No ratings yet
Oracle Data Integrator (ODI) Online Training and Support
9 pages
Outline: Multidatabase Query Processing
No ratings yet
Outline: Multidatabase Query Processing
41 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
Writing The USRP File System Disk Image To A SD Card
No ratings yet
Writing The USRP File System Disk Image To A SD Card
2 pages
Partitioning PDF
No ratings yet
Partitioning PDF
5 pages
Effects of Eviction Drive On Human Life of Slum Dwellers
No ratings yet
Effects of Eviction Drive On Human Life of Slum Dwellers
10 pages
Class12 Cs Practical File
No ratings yet
Class12 Cs Practical File
60 pages
Analyzing Market Research Firms in BD, Assignment
No ratings yet
Analyzing Market Research Firms in BD, Assignment
11 pages
10987C ENU PowerPoint Day 3
No ratings yet
10987C ENU PowerPoint Day 3
125 pages
Programme Planning: Course Title-Fundamentals of Agriculture Extension Education
No ratings yet
Programme Planning: Course Title-Fundamentals of Agriculture Extension Education
16 pages
SQL Server Questionnaire-I
No ratings yet
SQL Server Questionnaire-I
47 pages
Gis Question Bank
100% (7)
Gis Question Bank
23 pages
Importance of Literature Review in Quantitative Research
100% (1)
Importance of Literature Review in Quantitative Research
9 pages
IT - Grade 12 Annual Plan
No ratings yet
IT - Grade 12 Annual Plan
13 pages
Step-By-Step Tutorial - ExcelPowerQuery
No ratings yet
Step-By-Step Tutorial - ExcelPowerQuery
15 pages
EMC-Oracle StorageLayout BestPractices
No ratings yet
EMC-Oracle StorageLayout BestPractices
14 pages
Tutorial 04
No ratings yet
Tutorial 04
29 pages
IDQ Architecture 1
No ratings yet
IDQ Architecture 1
8 pages
Capstone Project - Guidelines On Synopsis Report
No ratings yet
Capstone Project - Guidelines On Synopsis Report
2 pages
Thesis On Software Design
100% (2)
Thesis On Software Design
8 pages
Client Work Done-Report
No ratings yet
Client Work Done-Report
7 pages
Science Communication Today
No ratings yet
Science Communication Today
17 pages
Cheat Sheet: Learn Python For Data Science Interactively at
No ratings yet
Cheat Sheet: Learn Python For Data Science Interactively at
1 page
Tugas Pengolahan Citra Praktikum 13: "Deteksi Gambar Wajah"
No ratings yet
Tugas Pengolahan Citra Praktikum 13: "Deteksi Gambar Wajah"
14 pages