Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles download
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles download
https://ebookfinal.com/download/machine-learning-with-spark-and-
python-essential-techniques-for-predictive-analytics-2nd-edition-
michael-bowles/
https://ebookfinal.com/download/advanced-analytics-with-spark-first-
edition-laserson/
https://ebookfinal.com/download/mastering-predictive-analytics-
with-r-2nd-edition-james-d-miller/
https://ebookfinal.com/download/thoughtful-machine-learning-with-
python-early-release-matthew-kirk/
https://ebookfinal.com/download/python-machine-learning-second-
edition-sebastian-raschka/
https://ebookfinal.com/download/python-machine-learning-by-example-
yuxi-hayden-liu/
https://ebookfinal.com/download/deep-learning-for-finance-creating-
machine-deep-learning-models-for-trading-in-python-1st-edition-kaabar/
https://ebookfinal.com/download/deep-learning-with-python-2nd-edition-
francois-chollet/
Second Edition
Michael Bowles
Machine Learning with Spark™ and Python®: Essential Techniques for Predictive Analytics, Second Edition
Published by
John Wiley & Sons, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
ISBN: 978‐1‐119‐56193‐4
ISBN: 978‐1‐119‐56201‐6 (ebk)
ISBN: 978‐1‐119‐56195‐8 (ebk)
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted
under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clear-
ance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 646‐8600. Requests to the
Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111
River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/
go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or war-
ranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be
created or extended by sales or promotional materials. The advice and strategies contained herein may not
be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in
rendering legal, accounting, or other professional services. If professional assistance is required, the services
of a competent professional person should be sought. Neither the publisher nor the author shall be liable for
damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation
and/or a potential source of further information does not mean that the author or the publisher endorses
the information the organization or website may provide or recommendations it may make. Further, readers
should be aware that Internet websites listed in this work may have changed or disappeared between when
this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department
within the United States at (877) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley publishes in a variety of print and electronic formats and by print‐on‐demand. Some material included
with standard print versions of this book may not be included in e‐books or in print‐on‐demand. If this book
refers to media such as a CD or DVD that is not included in the version you purchased, you may down-
load this material at http://booksupport.wiley.com. For more information about Wiley products, visit
www.wiley.com.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc.
and/or its affiliates, in the United States and other countries, and may not be used without written permis-
sion. Spark is a trademark of the Apache Software Foundation, Inc. Python is a registered trademark of the
Python Software Foundation. All other trademarks are the property of their respective owners. John Wiley &
Sons, Inc. is not associated with any product or vendor mentioned in this book.
I dedicate this book to my expanding family of children and grandchildren, Scott,
Seth, Cayley, Rees, and Lia. Being included in their lives is a constant source of joy for
me. I hope it makes them smile to see their names in print. I also dedicate it to my
close friend Dave, whose friendship remains steadfast in spite of my best efforts.
I hope this makes him smile too.
Dr. Michael Bowles (Mike) holds bachelor’s and master’s degrees in mechanical
engineering, an ScD in instrumentation, and an MBA. He has worked in aca-
demia, technology, and business. Mike currently works with companies where
artificial intelligence or machine learning are integral to success. He serves var-
iously as part of the management team, a consultant, or advisor. He also teaches
machine learning courses at UC Berkeley and Hacker Dojo, a co-working space
and startup incubator in Mountain View, CA.
Mike was born in Oklahoma and took his bachelor’s and master’s degrees
there, then after a stint in Southeast Asia went to Cambridge for ScD and C.
Stark Draper Chair at MIT after graduation. Mike left Boston to work on com-
munications satellites at Hughes Aircraft Company in Southern California, and
then after completing an MBA at UCLA moved to the San Francisco Bay Area
to take roles as founder and CEO of two successful venture-backed startups.
Mike remains actively involved in technical and startup-related work. Recent
projects include the use of machine learning in industrial inspection and auto-
mation, financial prediction, predicting biological outcomes on the basis of
molecular graph structures, and financial risk estimation. He has participated
in due diligence work on companies in the artificial intelligence and machine
learning arenas. Mike can be reached through mbowles.com.
vii
About the Technical Editor
ix
Acknowledgments
I’d like to acknowledge the splendid support that people at Wiley have offered
during the course of writing this book and making the revisions for this second
edition. It began with Robert Elliot, the acquisitions editor who first contacted me
about writing a book—very easy to work with. Tom Dinse has done a splendid
job editing this second edition. He’s been responsive, thorough, flexible, and
completely professional, as I’ve come to expect from Wiley. I thank you.
I’d also like to acknowledge the enormous comfort that comes from having
such a quick, capable computer scientist as James Winegar doing the technical
editing on the book. James has brought a more consistent style and has made
a number of improvements that will make the code that comes along with the
book easier to use and understand. Thank you for that.
The example problems used in the book come from the University of California
at Irvine’s data repository. UCI does the machine learning community a great
service by gathering these data sets, curating them, and making them freely
available. The reference for this material is:
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository (http://
archive.ics.uci.edu/ml). Irvine, CA: University of California, School of
Information and Computer Science.
xi
Contents at a Glance
Introductionxxi
Chapter 1 The Two Essential Algorithms for Making Predictions 1
Chapter 2 Understand the Problem by Understanding the Data 23
Chapter 3 Predictive Model Building: Balancing Performance,
Complexity, and Big Data 77
Chapter 4 Penalized Linear Regression 129
Chapter 5 Building Predictive Models Using Penalized Linear Methods 169
Chapter 6 Ensemble Methods 221
Chapter 7 Building Ensemble Models with Python 265
Index 329
xiii
Contents
Introductionxxi
Chapter 1 The Two Essential Algorithms for Making Predictions 1
Why Are These Two Algorithms So Useful? 2
What Are Penalized Regression Methods? 7
What Are Ensemble Methods? 9
How to Decide Which Algorithm to Use 11
The Process Steps for Building a Predictive Model 13
Framing a Machine Learning Problem 15
Feature Extraction and Feature Engineering 17
Determining Performance of a Trained Model 18
Chapter Contents and Dependencies 18
Summary20
Chapter 2 Understand the Problem by Understanding the Data 23
The Anatomy of a New Problem 24
Different Types of Attributes and Labels Drive
Modeling Choices 26
Things to Notice about Your New Data Set 27
Classification Problems: Detecting Unexploded
Mines Using Sonar 28
Physical Characteristics of the Rocks Versus Mines Data Set 29
Statistical Summaries of the Rocks Versus Mines Data Set 32
Visualization of Outliers Using a Quantile-Quantile Plot 34
Statistical Characterization of Categorical Attributes 35
How to Use Python Pandas to Summarize the Rocks
Versus Mines Data Set 36
Visualizing Properties of the Rocks Versus Mines Data Set 39
Visualizing with Parallel Coordinates Plots 39
Visualizing Interrelationships between Attributes and Labels 41
xv
xvi Contents
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com