Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar instant download
Scaling Big Data with Hadoop and Solr 1st Edition Karambelkar instant download
https://ebookname.com/product/scaling-big-data-with-hadoop-and-
solr-1st-edition-karambelkar/
https://ebookname.com/product/pig-design-patterns-simplify-
hadoop-programming-to-create-complex-end-to-end-enterprise-big-
data-solutions-with-pig-1st-edition-pradeep-pasupuleti/
https://ebookname.com/product/harness-oil-and-gas-big-data-with-
analytics-optimize-exploration-and-production-with-data-driven-
models-1st-edition-keith-holdaway/
https://ebookname.com/product/from-big-data-to-smart-data-1st-
edition-fernando-iafrate/
https://ebookname.com/product/pocket-podiatry-footwear-and-foot-
orthoses-1st-edition-anita-williams/
Personality theories an introduction Barbara Engler
https://ebookname.com/product/personality-theories-an-
introduction-barbara-engler/
https://ebookname.com/product/southern-crucifix-southern-cross-
catholic-protestant-relations-in-the-old-south-1st-edition-
edition-andrew-henry-stern/
https://ebookname.com/product/perspectives-on-managing-
employees-1st-edition-michael-a-fina/
https://ebookname.com/product/everyday-surveillance-vigilance-
and-visibility-in-postmodern-life-2nd-edition-edition-william-g-
staples/
https://ebookname.com/product/software-and-patents-in-europe-1st-
edition-philip-leith/
Textbook of Clinical Neuropsychiatry Hodder Arnold
Publication 1st Edition David P. Moore
https://ebookname.com/product/textbook-of-clinical-
neuropsychiatry-hodder-arnold-publication-1st-edition-david-p-
moore/
Scaling Big Data with
Hadoop and Solr
Hrishikesh Karambelkar
BIRMINGHAM - MUMBAI
Scaling Big Data with Hadoop and Solr
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78328-137-4
www.packtpub.com
Reviewer Proofreader
Parvin Gasimzade Lauren Harkins
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM
http://PacktLib.PacktPub.com
Why Subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print and bookmark content
• On demand and accessible via web browser
[ ii ]
Table of Contents
SolrCloud architecture 53
Configuring SolrCloud 54
Using multicore Solr search on SolrCloud 56
Benefits and drawbacks 58
Benefits 58
Drawbacks 58
Using Katta for Big Data search (Solr-1395 patch) 59
Katta architecture 59
Configuring Katta cluster 60
Creating Katta indexes 60
Benefits and drawbacks 61
Benefits 61
Drawbacks 61
Summary 61
Chapter 4: Using Big Data to Build Your Large Indexing 63
Understanding the concept of NOSQL 63
The CAP theorem 64
What is a NOSQL database? 64
The key-value store or column store 65
The document-oriented store 66
The graph database 66
Why NOSQL databases for Big Data? 67
How Solr can be used for Big Data storage? 67
Understanding the concepts of distributed search 68
Distributed search architecture 68
Distributed search scenarios 69
Lily – running Solr and Hadoop together 70
The architecture 70
Write-ahead Logging 72
The message queue 72
Querying using Lily 72
Updating records using Lily 72
Installing and running Lily 73
Deep dive – shards and indexing data of Apache Solr 74
The sharding algorithm 75
Adding a document to the distributed shard 77
Configuring SolrCloud to work with large indexes 77
Setting up the ZooKeeper ensemble 78
Setting up the Apache Solr instance 79
Creating shards, collections, and replicas in SolrCloud 80
Summary 81
[ iii ]
Table of Contents
[ iv ]
Table of Contents
[v]
Preface
This book will provide users with a step-by-step guide to work with Big Data using
Hadoop and Solr. It starts with a basic understanding of Hadoop and Solr, and
gradually gets into building efficient, high performance enterprise search repository
for Big Data.
You will learn various architectures and data workflows for distributed search
system. In the later chapters, this book provides information about optimizing the
Big Data search instance ensuring high availability and reliability.
This book later demonstrates two real world use cases about how Hadoop and Solr
can be used together for distributer enterprise search.
Chapter 2, Understanding Solr, introduces you to Apache Solr. It explains how you
can configure the Solr instance, how to create indexes and load your data in the
Solr repository, and how you can use Solr effectively for searching. It also discusses
interesting features of Apache Solr.
Chapter 3, Making Big Data Work for Hadoop and Solr, brings the two worlds together;
it drives you through different approaches for achieving Big Data work with
architectures and their benefits and applicability.
Preface
Chapter 4, Using Big Data to Build Your Large Indexing, explains the NoSQL and
concepts of distributed search. It then gets you into using different algorithms
for Big Data search covering shards and indexing. It also talks about SolrCloud
configuration and Lily.
Chapter 5, Improving Performance of Search while Scaling with Big Data, covers different
levels of optimizations that you can perform on your Big Data search instance as the
data keeps growing. It discusses different performance improvement techniques
which can be implemented by the users for their deployment.
Appendix A, Use Cases for Big Data Search, describes some industry use cases and
case studies for Big Data using Solr and Hadoop.
Appendix B, Creating Enterprise Search Using Apache Solr, shares a sample Solr
schema which can be used by the users for experimenting with Apache Solr.
Appendix C, Sample MapReduce Programs to Build the Solr Indexes, provides a sample
MapReduce program to build distributed Solr indexes for different approaches.
• JDK 6
• Apache Hadoop
• Apache Solr 4.0 or above
• Patch sets, depending upon which setup you intend to run
• Katta (only if you are setting Katta)
• Lily (only if you are setting Lily)
[2]
Preface
Conventions
In this book, you will find a number of styles of text that distinguish between
different kinds of information. Here are some examples of these styles, and an
explanation of their meaning.
Code words in text are shown as follows: "You will typically find the
hadoop-example jar in /usr/share/hadoop, or in $HADOOP_HOME."
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
New terms and important words are shown in bold. Words that you see on
the screen, in menus or dialog boxes for example, appear in the text like this:
"The admin UI will start showing the Cloud tab."
[3]
Preface
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or may have disliked. Reader feedback is important for
us to develop titles that you really get the most out of.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things
to help you to get the most from your purchase.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you would report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.
com/submit-errata, selecting your book, clicking on the errata submission form link,
and entering the details of your errata. Once your errata are verified, your submission
will be accepted and the errata will be uploaded on our website, or added to any list of
existing errata, under the Errata section of that title. Any existing errata can be viewed
by selecting your title from http://www.packtpub.com/support.
[4]
Preface
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you
come across any illegal copies of our works, in any form, on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.
We appreciate your help in protecting our authors, and our ability to bring you
valuable content.
Questions
You can contact us at questions@packtpub.com if you are having a problem
with any aspect of the book, and we will do our best to address it.
[5]
Processing Big Data Using
Hadoop and MapReduce
Traditionally computation has been processor driven. As the data grew, the industry
was focused towards increasing processor speed and memory for getting better
performances for computation. This gave birth to the distributed systems. In today's
real world, different applications create hundreds and thousands of gigabytes of
data every day. This data comes from disparate sources such as application software,
sensors, social media, mobile devices, logs, and so on. Such huge data is difficult to
operate upon using standard available software for data processing. This is mainly
because the data size grows exponentially with time. Traditional distributed systems
were not sufficient to manage the big data, and there was a need for modern systems
that could handle heavy data load, with scalability and high availability. This is
called Big Data.
Big data is usually associated with high volume and heavily growing data with
unpredictable content. A video gaming industry needs to predict the performance
of over 500 GB of data structure, and analyze over 4 TB of operational logs every
day; many gaming companies use Big Data based technologies to do so. An IT
advisory firm Gartner defines big data using 3Vs (high volume of data, high velocity
of processing speed, and high variety of information). IBM added fourth V (high
veracity) to its definition to make sure the data is accurate, and helps you make your
business decisions.
Processing Big Data Using Hadoop and MapReduce
While the potential benefits of big data are real and significant, there remain many
challenges. So, organizations which deal with such high volumes of data face the
following problems:
• Data acquisition: There is lot of raw data that gets generated out of various
data sources. The challenge is to filter and compress the data, and extract the
information out of it once it is cleaned.
• Information storage and organization: Once the information is captured out
of raw data, the data model will be created and stored in a storage device.
To store a huge dataset effectively, traditional relational system stops being
effective at such a high scale. There has been a new breed of databases called
NOSQL databases, which are mainly used to work with big data. NOSQL
databases are non-relational databases.
• Information search and analytics: Storing data is only a part of building a
warehouse. Data is useful only when it is computed. Big data is often noisy,
dynamic, and heterogeneous. This information is searched, mined, and
analyzed for behavioral modeling.
• Data security and privacy: While bringing in linked data from multiple
sources, organizations need to worry about data security and privacy at
the most.
Big data offers lot of technology challenges to the current technologies in use today.
It requires large quantities of data processing within the finite timeframe, which
brings in technologies such as massively parallel processing (MPP) technologies and
distributed file systems.
Big data is catching more and more attention from various organizations. Many of
them have already started exploring it. Recently Gartner (http://www.gartner.
com/newsroom/id/2304615) published an executive program survey report, which
reveals that Big Data and analytics are among the top 10 business priorities for CIOs.
Similarly, analytics and BI stand as the top priority for CIO's technical priorities. We
will try to understand Apache Hadoop in this chapter. We will cover the following:
[8]
Chapter 1
HDFS is responsible for storing the data in a distributed manner across multiple
Hadoop cluster nodes. The MapReduce framework provides rich computational
APIs for developers to code, which eventually run as map and reduce tasks on the
Hadoop cluster.
[9]
Processing Big Data Using Hadoop and MapReduce
Apache Hadoop ecosystem is vast in nature. It has grown drastically over the time
due to different organizations contributing to this open source initiative. Due to the
huge ecosystem, it meets the needs of different organizations for high performance
analytics. To understand the ecosystem, let's look at the following diagram:
Flume/
Mahout Pig Hive
Sqoop
Zookeeper
Ambari
Avro
HBase MapReduce HCatlog
Apache HBase
HDFS is append-only file system; it does not allow data modification. Apache HBase
is a distributed, random-access, and column-oriented database. HBase directly runs
on top of HDFS, and it allows application developers to read/write the HDFS data
directly. HBase does not support SQL; hence, it is also called as NOSQL database.
However, it provides command-line-based interface, as well as a rich set of APIs to
update the data. The data in HBase gets stored as key-value pairs in HDFS.
[ 10 ]
Chapter 1
Apache Pig
Apache Pig provides another abstraction layer on top of MapReduce. It provides
something called Pig Latin, which is a programming language that creates
MapReduce programs using Pig. Pig Latin is a high-level language for developers to
write high-level software for analyzing data. Pig code generates parallel execution
tasks, therefore effectively uses the distributed Hadoop cluster. Pig was initially
developed at Yahoo! Research to enable developers create ad-hoc MapReduce jobs
for Hadoop. Since then, many big organizations such as eBay, LinkedIn, and Twitter
have started using Apache Pig.
Apache Hive
Apache Hive provides data warehouse capabilities using Big Data. Hive runs on
top of Apache Hadoop, and uses HDFS for storing its data. The Apache Hadoop
framework is difficult to understand, and it requires a different approach from
traditional programming to write MapReduce-based programs. With Hive,
developers do not write MapReduce at all. Hive provides a SQL like query language
called HiveQL to application developers, enabling them to quickly write ad-hoc
queries similar to RDBMS SQL queries.
Apache ZooKeeper
Apache Hadoop nodes communicate with each other through Apache Zookeeper.
It forms the mandatory part of Apache Hadoop ecosystem. Apache Zookeeper is
responsible for maintaining coordination among various nodes. Besides coordinating
among nodes, it also maintains configuration information, and group services to
the distributed system. Apache ZooKeeper can be used independent of Hadoop,
unlike other components of the ecosystem. Due to its in-memory management of
information, it offers the distributed coordination at a high speed.
Apache Mahout
Apache Mahout is an open source machine learning software library that can
effectively empower Hadoop users with analytical capabilities such as clustering,
data mining, and so on, over distributed Hadoop cluster. Mahout is highly effective
over large datasets, the algorithms provided by Mahout are highly optimized to run
the MapReduce framework over HDFS.
[ 11 ]
Processing Big Data Using Hadoop and MapReduce
Apache HCatalog
Apache HCatalog provides metadata management services on top of Apache
Hadoop. It means all the software that runs on Hadoop can effectively use HCatalog
to store their schemas in HDFS. HCatalog helps any third party software to create,
edit, and expose (using rest APIs) the generated metadata or table definitions. So,
any user or script can run Hadoop effectively without actually knowing where
the data is physically stored on HDFS. HCatalog provides DDL (Data Definition
Language) commands with which the requested MapReduce, Pig, and Hive jobs can
be queued for execution, and later monitored for progress as and when required.
Apache Ambari
Apache Ambari provides a set of tools to monitor Apache Hadoop cluster hiding the
complexities of the Hadoop framework. It offers features such as installation wizard,
system alerts and metrics, provisioning and management of Hadoop cluster, job
performances, and so on. Ambari exposes RESTful APIs for administrators to allow
integration with any other software.
Apache Avro
Since Hadoop deals with large datasets, it becomes very important to optimally
process and store the data effectively on the disks. This large data should be
efficiently organized to enable different programming languages to read large
datasets Apache Avro helps you to do that. Avro effectively provides data
compression and storages at various nodes of Apache Hadoop. Avro-based
stores can easily be read using scripting languages as well as Java. Avro provides
dynamic access to data, which in turn allows software to access any arbitrary data
dynamically. Avro can be effectively used in the Apache Hadoop MapReduce
framework for data serialization.
Apache Sqoop
Apache Sqoop is a tool designed to do load large datasets in Hadoop efficiently.
Apache Sqoop allows application developers to import/export easily from specific
data sources such as relational databases, enterprise data warehouses, and custom
applications. Apache Sqoop internally uses a map task to perform data import/
export effectively on Hadoop cluster. Each mapper loads/unloads slice of data
across HDFS and data source. Apache Sqoop establishes connectivity between non-
Hadoop data sources and HDFS.
[ 12 ]
Another Random Document on
Scribd Without Any Related Topics
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookname.com