100% found this document useful (4 votes)
18 views

Machine Learning with Python Cookbook, 2nd Edition (First Early Release) Kyle Gallatin pdf download

The document is an early release of the 'Machine Learning with Python Cookbook, 2nd Edition' by Kyle Gallatin and Chris Albon, which provides practical solutions for machine learning tasks using Python. It covers foundational tools like NumPy for data manipulation, including creating vectors, matrices, and sparse matrices, as well as selecting elements and applying functions. The content is intended for readers to access unedited material as the authors develop the book, allowing them to utilize the information before the official release.

Uploaded by

swarnbibats5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
18 views

Machine Learning with Python Cookbook, 2nd Edition (First Early Release) Kyle Gallatin pdf download

The document is an early release of the 'Machine Learning with Python Cookbook, 2nd Edition' by Kyle Gallatin and Chris Albon, which provides practical solutions for machine learning tasks using Python. It covers foundational tools like NumPy for data manipulation, including creating vectors, matrices, and sparse matrices, as well as selecting elements and applying functions. The content is intended for readers to access unedited material as the authors develop the book, allowing them to utilize the information before the official release.

Uploaded by

swarnbibats5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Machine Learning with Python Cookbook, 2nd

Edition (First Early Release) Kyle Gallatin


install download

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-2nd-edition-first-early-release-kyle-gallatin/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

Machine Learning with Python Cookbook, 2nd Edition Kyle


Gallatin

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-2nd-edition-kyle-gallatin/

Machine Learning with Python Cookbook 2nd Edition Chris


Albon

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-2nd-edition-chris-albon/

Machine Learning with Python Cookbook Practical


Solutions from Preprocessing to Deep Learning 2nd Ed
Release 5 2nd Edition Chris Albon

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-practical-solutions-from-preprocessing-to-deep-
learning-2nd-ed-release-5-2nd-edition-chris-albon/

The White Educators’ Guide to Equity Jeramy Wallace

https://ebookmeta.com/product/the-white-educators-guide-to-
equity-jeramy-wallace/
Lawyer Games After Midnight in the Garden of Good and
Evil 2nd Edition Dep Kirkland

https://ebookmeta.com/product/lawyer-games-after-midnight-in-the-
garden-of-good-and-evil-2nd-edition-dep-kirkland/

Artificial Intelligence A Modern Approach 3rd edition


Stuart Russell Peter Norvig

https://ebookmeta.com/product/artificial-intelligence-a-modern-
approach-3rd-edition-stuart-russell-peter-norvig/

Body and Soul in Hellenistic Philosophy 1st Edition


Brad Inwood

https://ebookmeta.com/product/body-and-soul-in-hellenistic-
philosophy-1st-edition-brad-inwood/

Gravity Falls Don t Color This Book 1st Edition Emmy


Cicierega Alex Hirsch

https://ebookmeta.com/product/gravity-falls-don-t-color-this-
book-1st-edition-emmy-cicierega-alex-hirsch/

Folk Tales of Bengal 1st Edition Lal Behari Day

https://ebookmeta.com/product/folk-tales-of-bengal-1st-edition-
lal-behari-day/
Annual Review of Gerontology and Geriatrics Volume 39
2019 154th Edition Roland J Thorpe Jr Phd

https://ebookmeta.com/product/annual-review-of-gerontology-and-
geriatrics-volume-39-2019-154th-edition-roland-j-thorpe-jr-phd/
Machine Learning with
Python Cookbook
SECOND EDITION
Practical Solutions from Preprocessing to Deep Learning

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

Kyle Gallatin and Chris Albon


Machine Learning with Python Cookbook
by Kyle Gallatin and Chris Albon
Copyright © 2023 Kyle Gallatin. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Acquisitions Editor: Nicole Butterfield
Development Editor Jeff Bleiel
Production Editor: Christopher Faucher
Interior Designer: David Futato
Cover Designer: Karen Montgomery
April 2018: First Edition
October 2023: Second Edition
Revision History for the Early Release
2022-08-24: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781098135720 for
release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Machine Learning with Python Cookbook, the cover image, and
related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do
not represent the publisher’s views. While the publisher and the
authors have used good faith efforts to ensure that the information
and instructions contained in this work are accurate, the publisher
and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from
the use of or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is
subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof
complies with such licenses and/or rights.
978-1-098-13566-9
Chapter 1. Working with
Vectors, Matrices and Arrays
in NumPy

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the authors’
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 1st chapter of the final book.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the authors at feedback.mlpythoncookbook@gmail.com.

1.0 Introduction
NumPy is a foundational tool of the Python machine learning stack.
NumPy allows for efficient operations on the data structures often
used in machine learning: vectors, matrices, and tensors. While
NumPy is not the focus of this book, it will show up frequently
throughout the following chapters. This chapter covers the most
common NumPy operations we are likely to run into while working
on machine learning workflows.

1.1 Creating a Vector

Problem
You need to create a vector.
Solution
Use NumPy to create a one-dimensional array:

# Load library
import numpy as np

# Create a vector as a row


vector_row = np.array([1, 2, 3])

# Create a vector as a column


vector_column = np.array([[1],
[2],
[3]])

Discussion
NumPy’s main data structure is the multidimensional array. A vector
is just an array with a single dimension. In order to create a vector,
we simply create a one-dimensional array. Just like vectors, these
arrays can be represented horizontally (i.e., rows) or vertically (i.e.,
columns).

See Also
Vectors, Math Is Fun
Euclidean vector, Wikipedia

1.2 Creating a Matrix

Problem
You need to create a matrix.

Solution
Use NumPy to create a two-dimensional array:
# Load library
import numpy as np

# Create a matrix
matrix = np.array([[1, 2],
[1, 2],
[1, 2]])

Discussion
To create a matrix we can use a NumPy two-dimensional array. In
our solution, the matrix contains three rows and two columns (a
column of 1s and a column of 2s).
NumPy actually has a dedicated matrix data structure:

matrix_object = np.mat([[1, 2],


[1, 2],
[1, 2]])

matrix([[1, 2],
[1, 2],
[1, 2]])

However, the matrix data structure is not recommended for two


reasons. First, arrays are the de facto standard data structure of
NumPy. Second, the vast majority of NumPy operations return
arrays, not matrix objects.

See Also
Matrix, Wikipedia
Matrix, Wolfram MathWorld
1.3 Creating a Sparse Matrix

Problem
Given data with very few nonzero values, you want to efficiently
represent it.

Solution
Create a sparse matrix:

# Load libraries
import numpy as np
from scipy import sparse

# Create a matrix
matrix = np.array([[0, 0],
[0, 1],
[3, 0]])

# Create compressed sparse row (CSR) matrix


matrix_sparse = sparse.csr_matrix(matrix)

Discussion
A frequent situation in machine learning is having a huge amount of
data; however, most of the elements in the data are zeros. For
example, imagine a matrix where the columns are every movie on
Netflix, the rows are every Netflix user, and the values are how many
times a user has watched that particular movie. This matrix would
have tens of thousands of columns and millions of rows! However,
since most users do not watch most movies, the vast majority of
elements would be zero.
A sparse matrix is a matrix in which most elements are 0. Sparse
matrices only store nonzero elements and assume all other values
will be zero, leading to significant computational savings. In our
solution, we created a NumPy array with two nonzero values, then
converted it into a sparse matrix. If we view the sparse matrix we
can see that only the nonzero values are stored:

# View sparse matrix


print(matrix_sparse)

(1, 1) 1
(2, 0) 3

There are a number of types of sparse matrices. However, in


compressed sparse row (CSR) matrices, (1, 1) and (2, 0)
represent the (zero-indexed) indices of the non-zero values 1 and 3,
respectively. For example, the element 1 is in the second row and
second column. We can see the advantage of sparse matrices if we
create a much larger matrix with many more zero elements and then
compare this larger matrix with our original sparse matrix:

# Create larger matrix


matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[3, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Create compressed sparse row (CSR) matrix


matrix_large_sparse = sparse.csr_matrix(matrix_large)

# View original sparse matrix


print(matrix_sparse)

(1, 1) 1
(2, 0) 3

# View larger sparse matrix


print(matrix_large_sparse)

(1, 1) 1
(2, 0) 3
As we can see, despite the fact that we added many more zero
elements in the larger matrix, its sparse representation is exactly the
same as our original sparse matrix. That is, the addition of zero
elements did not change the size of the sparse matrix.
As mentioned, there are many different types of sparse matrices,
such as compressed sparse column, list of lists, and dictionary of
keys. While an explanation of the different types and their
implications is outside the scope of this book, it is worth noting that
while there is no “best” sparse matrix type, there are meaningful
differences between them and we should be conscious about why
we are choosing one type over another.

See Also
Sparse matrices, SciPy documentation
101 Ways to Store a Sparse Matrix

1.4 Pre-allocating Numpy Arrays

Problem
You need to pre-allocate arrays of a given size with some value.

Solution
NumPy has functions for generating vectors and matrices of any size
using 0s, 1s, or values of your choice.

# Load library
import numpy as np

# Generate a vector of shape (1,5) containing all zeros


vector = np.zeros(shape=5)
# View the vector
print(vector)

array([0., 0., 0., 0., 0.])

# Generate a matrix of shape (3,3) containing all ones


matrix = np.full(shape=(3,3), 1)

# View the vector


print(matrix)

array([[1., 1., 1.],


[1., 1., 1.],
[1., 1., 1.]])

Discussion
Generating arrays prefilled with data is useful for a number of
purposes, such as making code more performant or having synthetic
data to test algorithms with. In many programming languages, pre-
allocating an array of default values (such as 0s) is considered
common practice.

1.5 Selecting Elements

Problem
You need to select one or more elements in a vector or matrix.

Solution
NumPy’s arrays make it easy to select elements in vectors or
matrices:

# Load library
import numpy as np

# Create row vector


vector = np.array([1, 2, 3, 4, 5, 6])

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Select third element of vector


vector[2]

# Select second row, second column


matrix[1,1]

Discussion
Like most things in Python, NumPy arrays are zero-indexed, meaning
that the index of the first element is 0, not 1. With that caveat,
NumPy offers a wide variety of methods for selecting (i.e., indexing
and slicing) elements or groups of elements in arrays:

# Select all elements of a vector


vector[:]

array([1, 2, 3, 4, 5, 6])

# Select everything up to and including the third element


vector[:3]

array([1, 2, 3])

# Select everything after the third element


vector[3:]

array([4, 5, 6])
# Select the last element
vector[-1]

# Reverse the vector


vector[::-1]

array([6, 5, 4, 3, 2, 1])

# Select the first two rows and all columns of a matrix


matrix[:2,:]

array([[1, 2, 3],
[4, 5, 6]])

# Select all rows and the second column


matrix[:,1:2]

array([[2],
[5],
[8]])

1.6 Describing a Matrix

Problem
You want to describe the shape, size, and dimensions of the matrix.

Solution
Use the shape, size, and ndim attributes of a NumPy object:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])

# View number of rows and columns


matrix.shape

(3, 4)

# View number of elements (rows * columns)


matrix.size

12

# View number of dimensions


matrix.ndim

Discussion
This might seem basic (and it is); however, time and again it will be
valuable to check the shape and size of an array both for further
calculations and simply as a gut check after some operation.

1.7 Applying Functions Over Each Element

Problem
You want to apply some function to all elements in an array.

Solution
Use NumPy’s vectorize method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Create function that adds 100 to something


add_100 = lambda i: i + 100

# Create vectorized function


vectorized_add_100 = np.vectorize(add_100)

# Apply function to all elements in matrix


vectorized_add_100(matrix)

array([[101, 102, 103],


[104, 105, 106],
[107, 108, 109]])

Discussion
NumPy’s vectorize class converts a function into a function that
can apply to all elements in an array or slice of an array. It’s worth
noting that vectorize is essentially a for loop over the elements
and does not increase performance. Furthermore, NumPy arrays
allow us to perform operations between arrays even if their
dimensions are not the same (a process called broadcasting). For
example, we can create a much simpler version of our solution using
broadcasting:

# Add 100 to all elements


matrix + 100

array([[101, 102, 103],


[104, 105, 106],
[107, 108, 109]])
1.8 Finding the Maximum and Minimum Values

Problem
You need to find the maximum or minimum value in an array.

Solution
Use NumPy’s max and min methods:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Return maximum element


np.max(matrix)

# Return minimum element


np.min(matrix)

Discussion
Often we want to know the maximum and minimum value in an
array or subset of an array. This can be accomplished with the max
and min methods. Using the axis parameter we can also apply the
operation along a certain axis:

# Find maximum element in each column


np.max(matrix, axis=0)
array([7, 8, 9])

# Find maximum element in each row


np.max(matrix, axis=1)

array([3, 6, 9])

1.9 Calculating the Average, Variance, and


Standard Deviation

Problem
You want to calculate some descriptive statistics about an array.

Solution
Use NumPy’s mean, var, and std:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Return mean
np.mean(matrix)

5.0

# Return variance
np.var(matrix)

6.666666666666667

# Return standard deviation


np.std(matrix)
2.5819888974716112

Discussion
Just like with max and min, we can easily get descriptive statistics
about the whole matrix or do calculations along a single axis:

# Find the mean value in each column


np.mean(matrix, axis=0)

array([ 4., 5., 6.])

1.10 Reshaping Arrays

Problem
You want to change the shape (number of rows and columns) of an
array without changing the element values.

Solution
Use NumPy’s reshape:

# Load library
import numpy as np

# Create 4x3 matrix


matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])

# Reshape matrix into 2x6 matrix


matrix.reshape(2, 6)

array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12]])
Discussion
reshape allows us to restructure an array so that we maintain the
same data but it is organized as a different number of rows and
columns. The only requirement is that the shape of the original and
new matrix contain the same number of elements (i.e., the same
size). We can see the size of a matrix using size:

matrix.size

12

One useful argument in reshape is -1, which effectively means “as


many as needed,” so reshape(1, -1) means one row and as
many columns as needed:

matrix.reshape(1, -1)

array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])

Finally, if we provide one integer, reshape will return a 1D array of


that length:

matrix.reshape(12)

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

1.11 Transposing a Vector or Matrix

Problem
You need to transpose a vector or matrix.
Solution
Use the T method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Transpose matrix
matrix.T

array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

Discussion
Transposing is a common operation in linear algebra where the
column and row indices of each element are swapped. One nuanced
point that is typically overlooked outside of a linear algebra class is
that, technically, a vector cannot be transposed because it is just a
collection of values:

# Transpose vector
np.array([1, 2, 3, 4, 5, 6]).T

array([1, 2, 3, 4, 5, 6])

However, it is common to refer to transposing a vector as converting


a row vector to a column vector (notice the second pair of brackets)
or vice versa:

# Tranpose row vector


np.array([[1, 2, 3, 4, 5, 6]]).T
array([[1],
[2],
[3],
[4],
[5],
[6]])

1.12 Flattening a Matrix

Problem
You need to transform a matrix into a one-dimensional array.

Solution
Use flatten:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Flatten matrix
matrix.flatten()

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Discussion
flatten is a simple method to transform a matrix into a one-
dimensional array. Alternatively, we can use reshape to create a
row vector:

matrix.reshape(1, -1)
array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])

One more common method to flatten arrays is the ravel method.


Unlike flatten which returns a copy of the original array, ravel
operates on the original object itself and is therefore slightly faster.
It also lets us flatten lists of arrays, which we can’t do with the
flatten method. This operation is useful for flattening very large
arrays and speeding up code.

# Create one matrix


matrix_a = np.array([[1, 2],
[3, 4]])

# Create a second matrix


matrix_b = np.array([[5, 6],
[7, 8]])

# Create a list of matrices


matrix_list = [matrix_a, matrix_b]

# Flatten the entire list of matrices


np.ravel(matrix_list)

array([1, 2, 3, 4, 5, 6, 7, 8])

1.13 Finding the Rank of a Matrix

Problem
You need to know the rank of a matrix.

Solution
Use NumPy’s linear algebra method matrix_rank:

# Load library
import numpy as np
# Create matrix
matrix = np.array([[1, 1, 1],
[1, 1, 10],
[1, 1, 15]])

# Return matrix rank


np.linalg.matrix_rank(matrix)

Discussion
The rank of a matrix is the dimensions of the vector space spanned
by its columns or rows. Finding the rank of a matrix is easy in
NumPy thanks to matrix_rank.

See Also
The Rank of a Matrix, CliffsNotes

1.14 Getting the Diagonal of a Matrix

Problem
You need to get the diagonal elements of a matrix.

Solution
Use diagonal:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[2, 4, 6],
[3, 8, 9]])
# Return diagonal elements
matrix.diagonal()

array([1, 4, 9])

Discussion
NumPy makes getting the diagonal elements of a matrix easy with
diagonal. It is also possible to get a diagonal off from the main
diagonal by using the offset parameter:

# Return diagonal one above the main diagonal


matrix.diagonal(offset=1)

array([2, 6])

# Return diagonal one below the main diagonal


matrix.diagonal(offset=-1)

array([2, 8])

1.15 Calculating the Trace of a Matrix

Problem
You need to calculate the trace of a matrix.

Solution
Use trace:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[2, 4, 6],
[3, 8, 9]])
# Return trace
matrix.trace()

14

Discussion
The trace of a matrix is the sum of the diagonal elements and is
often used under the hood in machine learning methods. Given a
NumPy multidimensional array, we can calculate the trace using
trace. We can also return the diagonal of a matrix and calculate its
sum:

# Return diagonal and sum elements


sum(matrix.diagonal())

14

See Also
The Trace of a Square Matrix

1.16 Calculating Dot Products

Problem
You need to calculate the dot product of two vectors.

Solution
Use NumPy’s dot:

# Load library
import numpy as np

# Create two vectors


vector_a = np.array([1,2,3])
vector_b = np.array([4,5,6])

# Calculate dot product


np.dot(vector_a, vector_b)

32

Discussion
The dot product of two vectors, a and b, is defined as:

where ai is the ith element of vector a. We can use NumPy’s dot


function to calculate the dot product. Alternatively, in Python 3.5+
we can use the new @ operator:

# Calculate dot product


vector_a @ vector_b

32

See Also
Vector dot product and vector length, Khan Academy
Dot Product, Paul’s Online Math Notes

1.17 Adding and Subtracting Matrices

Problem
You want to add or subtract two matrices.
Solution
Use NumPy’s add and subtract:

# Load library
import numpy as np

# Create matrix
matrix_a = np.array([[1, 1, 1],
[1, 1, 1],
[1, 1, 2]])

# Create matrix
matrix_b = np.array([[1, 3, 1],
[1, 3, 1],
[1, 3, 8]])

# Add two matrices


np.add(matrix_a, matrix_b)

array([[ 2, 4, 2],
[ 2, 4, 2],
[ 2, 4, 10]])

# Subtract two matrices


np.subtract(matrix_a, matrix_b)

array([[ 0, -2, 0],


[ 0, -2, 0],
[ 0, -2, -6]])

Discussion
Alternatively, we can simply use the + and - operators:

# Add two matrices


matrix_a + matrix_b

array([[ 2, 4, 2],
[ 2, 4, 2],
[ 2, 4, 10]])
1.18 Multiplying Matrices

Problem
You want to multiply two matrices.

Solution
Use NumPy’s dot:

# Load library
import numpy as np

# Create matrix
matrix_a = np.array([[1, 1],
[1, 2]])

# Create matrix
matrix_b = np.array([[1, 3],
[1, 2]])

# Multiply two matrices


np.dot(matrix_a, matrix_b)

array([[2, 5],
[3, 7]])

Discussion
Alternatively, in Python 3.5+ we can use the @ operator:

# Multiply two matrices


matrix_a @ matrix_b

array([[2, 5],
[3, 7]])

If we want to do element-wise multiplication, we can use the *


operator:
# Multiply two matrices element-wise
matrix_a * matrix_b

array([[1, 3],
[1, 4]])

See Also
Array vs. Matrix Operations, MathWorks

1.19 Inverting a Matrix

Problem
You want to calculate the inverse of a square matrix.

Solution
Use NumPy’s linear algebra inv method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 4],
[2, 5]])

# Calculate inverse of matrix


np.linalg.inv(matrix)

array([[-1.66666667, 1.33333333],
[ 0.66666667, -0.33333333]])

Discussion
The inverse of a square matrix, A, is a second matrix A–1, such that:
where I is the identity matrix. In NumPy we can use linalg.inv
to calculate A–1 if it exists. To see this in action, we can multiply a
matrix by its inverse and the result is the identity matrix:

# Multiply matrix and its inverse


matrix @ np.linalg.inv(matrix)

array([[ 1., 0.],


[ 0., 1.]])

See Also
Inverse of a Matrix

1.20 Generating Random Values

Problem
You want to generate pseudorandom values.

Solution
Use NumPy’s random:

# Load library
import numpy as np

# Set seed
np.random.seed(0)

# Generate three random floats between 0.0 and 1.0


np.random.random(3)

array([ 0.5488135 , 0.71518937, 0.60276338])


Discussion
NumPy offers a wide variety of means to generate random numbers,
many more than can be covered here. In our solution we generated
floats; however, it is also common to generate integers:

# Generate three random integers between 0 and 10


np.random.randint(0, 11, 3)

array([3, 7, 9])

Alternatively, we can generate numbers by drawing them from a


distribution:

# Draw three numbers from a normal distribution with mean


0.0
# and standard deviation of 1.0
np.random.normal(0.0, 1.0, 3)

array([-1.42232584, 1.52006949, -0.29139398])

# Draw three numbers from a logistic distribution with mean


0.0 and scale of 1.0
np.random.logistic(0.0, 1.0, 3)

array([-0.98118713, -0.08939902, 1.46416405])

# Draw three numbers greater than or equal to 1.0 and less


than 2.0
np.random.uniform(1.0, 2.0, 3)

array([ 1.47997717, 1.3927848 , 1.83607876])

Finally, it can sometimes be useful to return the same random


numbers multiple times to get predictable, repeatable results. We
can do this by setting the “seed” (an integer) of the pseudorandom
generator. Random processes with the same seed will always
produce the same output. We will use seeds throughout this book so
that the code you see in the book and the code you run on your
computer produces the same results.
Chapter 2. Loading Data

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the authors’
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 2nd chapter of the final book.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the authors at feedback.mlpythoncookbook@gmail.com.

2.0 Introduction
The first step in any machine learning endeavor is to get the raw
data into our system. The raw data might be a logfile, dataset file,
database, or cloud blob store such as Amazon S3. Furthermore,
often we will want to retrieve data from multiple sources.
The recipes in this chapter look at methods of loading data from a
variety of sources, including CSV files and SQL databases. We also
cover methods of generating simulated data with desirable
properties for experimentation. Finally, while there are many ways to
load data in the Python ecosystem, we will focus on using the
pandas library’s extensive set of methods for loading external data,
and using scikit-learn—an open source machine learning library in
Python—for generating simulated data.
2.1 Loading a Sample Dataset

Problem
You want to load a preexisting sample dataset from the scikit-learn
library.

Solution
scikit-learn comes with a number of popular datasets for you to use:

# Load scikit-learn's datasets


from sklearn import datasets

# Load digits dataset


digits = datasets.load_digits()

# Create features matrix


features = digits.data

# Create target vector


target = digits.target

# View first observation


features[0]

array([ 0., 0., 5., 13., 9., 1., 0., 0.,


0., 0., 13.,
15., 10., 15., 5., 0., 0., 3., 15.,
2., 0., 11.,
8., 0., 0., 4., 12., 0., 0., 8.,
8., 0., 0.,
5., 8., 0., 0., 9., 8., 0., 0.,
4., 11., 0.,
1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0.,
0., 0., 0., 6., 13., 10., 0., 0.,
0.])
Discussion
Often we do not want to go through the work of loading,
transforming, and cleaning a real-world dataset before we can
explore some machine learning algorithm or method. Luckily, scikit-
learn comes with some common datasets we can quickly load. These
datasets are often called “toy” datasets because they are far smaller
and cleaner than a dataset we would see in the real world. Some
popular sample datasets in scikit-learn are:
load_boston
Contains 503 observations on Boston housing prices. It is a good
dataset for exploring regression algorithms.

load_iris
Contains 150 observations on the measurements of Iris flowers.
It is a good dataset for exploring classification algorithms.

load_digits
Contains 1,797 observations from images of handwritten digits. It
is a good dataset for teaching image classification.
To see more details on any of the datasets above, you can print the
DESCR attribute:

# Load scikit-learn's datasets


from sklearn import datasets

# Load digits dataset


digits = datasets.load_digits()

# Print the attribute


print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset


--------------------------------------------------
**Data Set Characteristics:**

:Number of Instances: 1797


:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in
the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998
...

See Also
scikit-learn toy datasets
The Digit Dataset

2.2 Creating a Simulated Dataset

Problem
You need to generate a dataset of simulated data.

Solution
scikit-learn offers many methods for creating simulated data. Of
those, three methods are particularly useful: make_regression,
make_classification, and make_blobs.
When we want a dataset designed to be used with linear regression,
make_regression is a good choice:

# Load library
from sklearn.datasets import make_regression

# Generate features matrix, target vector, and the true


coefficients
features, target, coefficients = make_regression(n_samples
= 100,
n_features
= 3,

n_informative = 3,
n_targets
= 1,
noise =
0.0,
coef =
True,

random_state = 1)

# View feature matrix and target vector


print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ 1.29322588 -0.61736206 -0.11044703]
[-2.793085 0.36633201 1.93752881]
[ 0.80186103 -0.18656977 0.0465673 ]]
Target Vector
[-10.37865986 25.5124503 19.67705609]

If we are interested in creating a simulated dataset for classification,


we can use make_classification:

# Load library
from sklearn.datasets import make_classification

# Generate features matrix and target vector


features, target = make_classification(n_samples = 100,
n_features = 3,
n_informative = 3,
n_redundant = 0,
n_classes = 2,
weights = [.25,
.75],
random_state = 1)

# View feature matrix and target vector


print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])
Feature Matrix
[[ 1.06354768 -1.42632219 1.02163151]
[ 0.23156977 1.49535261 0.33251578]
[ 0.15972951 0.83533515 -0.40869554]]
Target Vector
[1 0 0]

Finally, if we want a dataset designed to work well with clustering


techniques, scikit-learn offers make_blobs:

# Load library
from sklearn.datasets import make_blobs

# Generate feature matrix and target vector


features, target = make_blobs(n_samples = 100,
n_features = 2,
centers = 3,
cluster_std = 0.5,
shuffle = True,
random_state = 1)

# View feature matrix and target vector


print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ -1.22685609 3.25572052]
[ -9.57463218 -4.38310652]
[-10.71976941 -4.20558148]]
Target Vector
[0 1 1]

Discussion
As might be apparent from the solutions, make_regression
returns a feature matrix of float values and a target vector of float
values, while make_classification and make_blobs return a
feature matrix of float values and a target vector of integers
representing membership in a class.
scikit-learn’s simulated datasets offer extensive options to control the
type of data generated. scikit-learn’s documentation contains a full
description of all the parameters, but a few are worth noting.
In make_regression and make_classification,
n_informative determines the number of features that are used
to generate the target vector. If n_informative is less than the
total number of features (n_features), the resulting dataset will
have redundant features that can be identified through feature
selection techniques.
In addition, make_classification contains a weights
parameter that allows us to simulate datasets with imbalanced
classes. For example, weights = [.25, .75] would return a
dataset with 25% of observations belonging to one class and 75% of
observations belonging to a second class.
For make_blobs, the centers parameter determines the number
of clusters generated. Using the matplotlib visualization library,
we can visualize the clusters generated by make_blobs:

# Load library
import matplotlib.pyplot as plt

# View scatterplot
plt.scatter(features[:,0], features[:,1], c=target)
plt.show()
See Also
make_regression documentation
make_classification documentation
make_blobs documentation

2.3 Loading a CSV File

Problem
You need to import a comma-separated values (CSV) file.

Solution
Use the pandas library’s read_csv to load a local or hosted CSV
file:

# Load library
import pandas as pd

# Create URL
url =
'https://raw.githubusercontent.com/chrisalbon/sim_data/mast
er/data.csv'

# Load dataset
dataframe = pd.read_csv(url)

# View first two rows


dataframe.head(2)

integer datetime category


0 5 2015-01-01 00:00:00 0
1 5 2015-01-01 00:00:01 0

Discussion
There are two things to note about loading CSV files. First, it is often
useful to take a quick look at the contents of the file before loading.
It can be very helpful to see how a dataset is structured beforehand
and what parameters we need to set to load in the file. Second,
read_csv has over 30 parameters and therefore the documentation
can be daunting. Fortunately, those parameters are mostly there to
allow it to handle a wide variety of CSV formats. For example, CSV
files get their names from the fact that the values are literally
separated by commas (e.g., one row might be 2,"2015-01-01
00:00:00",0); however, it is common for “CSV” files to use other
characters as separators, like tabs. pandas’ sep parameter allows us
to define the delimiter used in the file. Although it is not always the
case, a common formatting issue with CSV files is that the first line
of the file is used to define column headers (e.g., integer,
datetime, category in our solution). The header parameter
allows us to specify if or where a header row exists. If a header row
does not exist, we set header=None.
2.4 Loading an Excel File

Problem
You need to import an Excel spreadsheet.

Solution
Use the pandas library’s read_excel to load an Excel spreadsheet:

# Load library
import pandas as pd

# Create URL
url =
'https://raw.githubusercontent.com/chrisalbon/sim_data/mast
er/data.xlsx'

# Load data
dataframe = pd.read_excel(url, sheet_name=0, header=1)

# View the first two rows


dataframe.head(2)

5 2015-01-01 00:00:00 0
0 5 2015-01-01 00:00:01 0
1 9 2015-01-01 00:00:02 0

Discussion
This solution is similar to our solution for reading CSV files. The main
difference is the additional parameter, sheetname, that specifies
which sheet in the Excel file we wish to load. sheetname can
accept both strings containing the name of the sheet and integers
pointing to sheet positions (zero-indexed). If we need to load
multiple sheets, include them as a list. For example, sheetname=
[0,1,2, "Monthly Sales"] will return a dictionary of pandas
DataFrames containing the first, second, and third sheets and the
sheet named Monthly Sales.

2.5 Loading a JSON File

Problem
You need to load a JSON file for data preprocessing.

Solution
The pandas library provides read_json to convert a JSON file into
a pandas object:

# Load library
import pandas as pd

# Create URL
url =
'https://raw.githubusercontent.com/chrisalbon/sim_data/mast
er/data.json'

# Load data
dataframe = pd.read_json(url, orient='columns')

# View the first two rows


dataframe.head(2)

category datetime integer


0 0 2015-01-01 00:00:00 5
1 0 2015-01-01 00:00:01 5

Discussion
Importing JSON files into pandas is similar to the last few recipes we
have seen. The key difference is the orient parameter, which
indicates to pandas how the JSON file is structured. However, it
might take some experimenting to figure out which argument
(split, records, index, columns, and values) is the right one.
Another helpful tool pandas offers is json_normalize, which can
help convert semistructured JSON data into a pandas DataFrame.

See Also
json_normalize documentation

2.6 Loading a parquet file

Problem
You need to load a parquet file.

Solution
The pandas read_parquet function allows us to read in parquet
files:

# Load library
import pandas as pd

# Create URL
url = 'https://machine-learning-python-
cookbook.s3.amazonaws.com/data.parquet'

# Load data
dataframe = pd.read_parquet(url)

# View the first two rows


dataframe.head(2)

category datetime integer


0 0 2015-01-01 00:00:00 5
category datetime integer
1 0 2015-01-01 00:00:01 5

Discussion
Paruqet is a popular data storage format in the large data space. It
is often used with big data tools such as hadoop and spark. While
Pyspark is outside the focus of this book, it’s highly likely companies
operating a large scale will use an efficient data storage format such
as parquet and it’s valuable to know how to read it into a dataframe
and manipulate it.

See Also
Apache Parquet Documentation

2.7 Querying a SQLite Database

Problem
You need to load data from a database using the structured query
language (SQL).

Solution
pandas’ read_sql_query allows us to make a SQL query to a
database and load it:

# Load libraries
import pandas as pd
from sqlalchemy import create_engine

# Create a connection to the database


database_connection = create_engine('sqlite:///sample.db')

# Load data
Discovering Diverse Content Through
Random Scribd Documents
materially different from that of
other inventors: he may have
been kept for years on the
threshold of success, vainly
trying to remove some
obstruction which blocked up his
way. If we suppose that
Gutenberg p401 began, as a
novice would probably begin, by
founding types of soft lead in
moulds of sand, the printer will
understand why he would
condemn the types made by this
method. If he afterward made a
mould of hard metal, and
founded types in matrices of
brass, we can understand that, in
the beginning, he had abundant
reason to reject his first types for
inaccuracies of body and
irregularities of height and lining.
To him as to all true inventors,
there could be no patching up of
defects in plan or in construction.
It was necessary to throw away
all the defective work and to
begin anew. Experiments like Fac-simile of the Types of a ♠
these consume a great deal of Donatus attributed to
Gutenberg at Strasburg.
time and quite as much of [From Bernard.]

money. The testimony shows


that the money contributed by some of the partners in
the association had been collected with difficulty. We may
suppose that when this had been spent to no purpose,
they were unable or unwilling to contribute any more.
It may be that the failure of the Strasburg associates
was due solely to the audacity of Gutenberg, whose plans
were always beyond his pecuniary ability. Even then he
may have purposed the printing of the great Bible of 36
lines in three volumes, which he afterward completed in
an admirable manner. In trying to accomplish much, he
may have failed to do anything of p402 value. Whatever
the reason, it is certain that his partners abandoned
Gutenberg and his invention. We read no more of Riffe
and Heilmann in connection with typography.
There is evidence that Gutenberg was financially
embarrassed after the trial. On the second day of
January, 1441, Gutenberg and the knight Luthold von
Ramstein gave security for the annual payment of five
pounds to the Chapter of St. Thomas at Strasburg, in
consideration of the present sum of one hundred pounds
paid by the chapter to Gutenberg. On the fifteenth day of
December, 1442, John Gutenberg and Martin Brether sold
to the same corporation for the present sum of eighty
pounds, an annual income of four pounds, from the
revenues of the town of Mentz. Gutenberg had inherited
this income from his uncle, Johan Lehheimer, secular
judge of that city. The tax-book of the city shows that he
was in arrear for taxes between the years 1436 and 1440.
In the tax-book for 1443, it is plainly recorded that
Gutenberg’s tax was paid by the Ennel Gutenbergen who
is supposed to have been his wife. Gutenberg had reason
to be disheartened. He had spent all his money; had
alienated his partners; had apparently wasted a great
deal of time in fruitless experiments; had damaged his
reputation as a man of business, and seemed further
from success than when he revealed his plans to his
partners.
It is the common belief that Gutenberg went direct
from Strasburg to Mentz. Winaricky, on the contrary, says
that he forsook Strasburg for the University of Prague, at
which institution he took the degree of bachelor of arts in
1445, and in which city he resided, until it was besieged,
and he was obliged to leave, in 1448. There is no
trustworthy authority for either statement. The period in
his life between 1442 and 1448 is blank, but it is not
probable that he was idle.
XXI

Gutenberg appears in Mentz as a Borrower of Money . . . Was then Ready to


Begin as a Printer . . . Donatus of 1451 . . . Letters of Indulgence of 1454
and 1455 . . . Made from Founded Types . . . Circumstances attending
their sale . . . Fac-simile of Holbein’s Satire . . . Fac-simile of the Letter
dated 1454, with a Translation . . . Almanac of 1455 . . . Gutenberg’s two
Bibles . . . Dates of Publication Uncertain . . . Bible of 36 lines, with Fac-
simile . . . Evidences of its probable Priority . . . Apparently an
Unsuccessful Book . . . John Fust, with Portrait . . . Fust’s Contract with
Gutenberg in 1450 . . . Probable Beginning of the Bible of 42 lines . . .
Description of Book, with Fac-simile . . . Colophon of the Illuminator . . .
Must have been Printed before 1456 . . . Fust brings Suit against
Gutenberg . . . Official Record of the Trial . . . Gutenberg’s Inability to pay
his debt . . . Suit was a Surprise . . . Portrait of Gutenberg . . . Fust
deposes Gutenberg and installs Schœffer at the head of the Office.

GUTENBERG’S last act upon record in Strasburg was


the selling out of the last remnant of his inheritance. The
first evidence we have of his return to Mentz is an entry,
on the sixth day of October, 1448, in a record of legal
contracts, in which he appears as a borrower of money. It
seems that Gutenberg had persuaded his kinsman, Arnold
Gelthus, to borrow from Rynhard Brömser and John
Rodenstein, the sum of 150 guilders, for the use of which
Gutenberg promised to pay the yearly interest of 8 1⁄ 2
guilders. Gutenberg had no securities to offer; Gelthus
had to pledge the rents of some houses for this purpose.
How this money was to be used is not stated, but it may
be presumed that Gutenberg needed it for the
development of his grand invention. His plans, p404
whatever they were, met with the approbation of his
uncle John Gensfleisch, by whose permission he occupied
the leased house248 Zum Jungen , which he used not only
for a dwelling, but as a printing office.
At this time Gutenberg was, no doubt, nearly perfect in
his knowledge of the correct theory of type-founding, and
had also acquired fair practice as a printer. Helbig thinks
that he had ready the types of the Bible of 36 lines .
Madden says that he was then, or very soon after,
engaged in printing a small edition of this book. There is
evidence that these types were in use at least as early as
1451. Two leaves of an early typographic edition of the
Donatus , 27 lines to the page, printed on vellum from the
types of the Bible of 36 lines , have been discovered near
Mentz, in the original binding of an old account book of
1451.249 In one word the letter i is reversed, a positive
proof that it was printed from types, and not from blocks.
The ink is still very black, but Fischer says that it will not
resist water.250 As this fragment shows the large types of
the Bible of 36 lines in their most primitive form, it
authorizes the belief that it should have been printed by
Gutenberg soon after his return to Mentz.
During the interval between 1440 and 1451, about
which history records so little, Guten­berg may have
printed many trifles. He could not have been always
unsuc­cessful: he could not have borrowed money for
more than ten years, without p405 a demon­stra­tion of his
ability to print and to sell printed work.
It is probable that he had to post­pone
his grand plans, and that his neces­‐
sities compelled him to begin the
practice of his new art with the printing
of trivial work. There is evidence that
the branch of typo­graphy which is now
known as job printing is as old as, if
not older than, book printing. This
evidence is furnished in the Letters of
Indulgence , which have dis­tinc­tion as
the first works with type-printed dates.
Three distinct editions of the Letters
of Indulgence are known. The copies
are dated 1454 or 1455, but are more
clearly defined by the number of the
lines in each edition, as Letters of 30 ,
or 31 , or 32 lines . Each Letter is
printed from movable types, in black
ink, upon one side of a stout piece of
parch­ment, about nine inches high and
thir­teen inches wide. The form of
words is sub­stan­tially the same in all
editions, and all copies present the
same general typo­graph­ical features,
as if they were the work of the same
printing office. In all copies, the press­‐ Fac-simile of the ♠
work is good; they seem to have been Types of the Donatus
of 1451. [anc405]
printed by a properly cons­truct­ed press [From Fischer.]

on damp vel­lum with ink mixed in oil.


The types p406 of the three editions have a general
resemb­lance,251 yet they differ seriously as to face and
body. They were certainly cast from dif­ferent ma­trices
and adjust­ments of the mould,252 and were composed by
dif­ferent com­pos­itors. In the edition of 30 lines , the types
of the text are on a body smaller than English, and those
of the large lines are on Paragon body; in the edition of
31 lines the types of the text are on English body, and
those of the large lines approx­imate Dou­ble-pica body.

Pica Paragon
Body. Body. [from De la Borde.] ♠
English Double-
body. pica
Body.

The types on Double-pica body are those of the


Donatus of 1451 and the Bible of 36 lines ; the types on
Paragon body are those of the Bible of 42 lines . The ap­‐
pear­ance of these types in the Bibles is pre­sump­tive evi­‐
dence that the print­er of the Bibles was the print­er of the
Let­ters . The small types are unique; they were never
used, so far as we know, for any other work. The large
initials may have been engraved on wood, but the text
and the display lines were founded p407 types. The il­lus­‐
tra­tion on the previous page shows that although the
mat­rices were fit­ted with close­ness, each type was found­‐
ed on a square body.
The cir­cum­stances con­nect­ed with the pub­li­ca­tion of
the Letters require more than a pas­sing no­tice, for they
pre­sent the first specific in­di­ca­tion of a de­mand for print­‐
ing. These cir­cum­stances give us a glim­mer of the cor­‐
rup­tion of some of the men who sold the in­dul­gences—a
cor­rup­tion which, in the next cen­tury, brought down upon
the sel­lers and the system the scorn of Holbein and the
wrath of Luther.

Fac-simile of Holbein’s Satire on the Sale of Indulgences. ♠


[From Woltmann.]
see larger
The canon at the right absolves the kneeling young man, but points significantly to the
huge money-chest into which the widow puts her mite. Three Dominicans, seated at
the table, are preparing and selling indulgences: one of them, holding back the letter,
greedily counts the money as it is paid down; another pauses in his writing, to repulse
the penitent but penniless cripple; another is leering at the woman whose letter he
delays. The pope, enthroned in the nave, and surrounded by cardinals, is giving a
commission for the sale of the letters.

On the twelfth day of April, 1451, a plenary indulgence


of three years was accorded by Pope Nicholas V to all
who, from May 1, 1452, to May 1, 1455, should properly
contribute with money to the aid of the alarmed king of
Cyprus, then threatened by the Turks. Paul Zappe, an
ambassador of the king of Cyprus, selected John de
Castro as chief commissioner for the sale of the
indulgences in Germany. Theodoric, archbishop of Mentz,
gave him full permission to sell them, but p408 held the
commissioner accountable for the moneys collected. The
precaution was justified. When the dreaded news of the
capture of Constantinople (May 29, 1453) was received,
John de Castro, thinking that Cyprus had also been taken,
squandered the money he had collected. De Castro was
arrested, convicted and sent to prison, but the scandal
that had been created by the embezzlement greatly
injured the sale of the indulgences. As the permission to
sell indulgences expired by limitation on May 1, 1455,
Zappe, the chief commissioner, made renewed and more
vigorous efforts to promote the sale. It was found that, in
the limited time allowed for sale, the customary process
of copying was entirely too slow. There was, also, the
liability that a hurried copyist would produce inexact
copies; that an unscrupulous copyist or seller would issue
spurious copies. These seem to have been the reasons
that led Zappe to have the documents printed, which was
accordingly done, with blank spaces for the insertion of
the name of the buyer and the signature of the seller.
The typography of this Letter of 31 lines is much better
than that of the Donatus , but it has many blemishes. The
text is deformed with abbreviations; the lines are not
evenly spaced out; the capital letters of the text are
rudely drawn and carelessly cut. The white space below
the sixteenth line, and the space and the crookedness in
the three lines at the foot, are evidences that the types
were not securely fastened in the chase. These faults
provoke notice, but it must be admitted that the types
were fairly fitted and stand in decent line. They were
obviously cast in moulds of metal; it would be
impracticable to make types so small in moulds of sand.
Reduced Fac-simile of a Letter of Indulgence, dated 1454. ♠
[From De la Borde.]
see larger

Translation .
To all the faithful followers of Christ who may read this letter, Paul
Zappe, counselor, ambassador, and administrator-general of his
gracious majesty, the king of Cyprus, sends greeting:
Whereas the Most Holy Father in Christ, our Lord, Nicholas V, by
divine grace, pope, mercifully compassionating the afflictions of the
kingdom of Cyprus from those most treacherous enemies of the
Cross of Christ, the Turks and Saracens, in an earnest exhortation,
by the sprinkling of the blood of our Lord Jesus Christ, freely
granted to all those faithful followers of Christ, wheresoever
established, who, within three years from the first day of May, in
the year of our Lord 1452, should piously contribute, according to
their ability, more or less, as it should seem good to their own
consciences, to the procurators, or their deputies, for the defense of
the Catholic religion and the aforementioned kingdom,—that
confessors, secular and regular, chosen by themselves, having
heard their confessions for excesses, crimes, and faults, however
great, even for those hitherto reserved exclusively for the apostolic
see to remit, should be licensed to pronounce due absolution upon
them, and enjoin salutary penance; and, also, that they might
absolve those persons, if they should humbly beseech it, who,
perchance might be suffering excommunication, suspension, and
other sentences, censures, and ecclesiastical punishments,
instituted by canon law, or promulgated by man,—salutary penance
being required, or other satisfaction which might be enjoined by
canon law, varying according to the nature of the offence; and,
also, that they might be empowered by apostolic authority to grant
to those who were truly penitent, and confessed their guilt, or if
perchance, on account of the loss of speech, they could not
confess, those who gave outward demonstrations of contrition—the
fullest indulgence of all their sins, and a full remission, as well
during life as in the hour of death—reparation being made by them
if they should survive, or by their heirs if they should then die: And
the penance required after the granting of the indulgence is this—
that they should fast throughout a whole year on every Friday, or
some other day of the week, the lawful hindrances to performance
being prescribed by the regular usage of the Church, a vow or any
other thing not standing in the way of it; and as for those
prevented from so doing in the stated year, or any part of it, they
should fast in the following year, or in any year they can; and if they
should not be able conveniently to fulfill the required fast in any of
the years, or any part of them, the confessor, for that purpose shall
be at liberty to commute it for other acts of charity, which they
should be equally bound to do: And all this, so that they presume
not, which God forbid, to sin from the assurance of remission of this
kind, for otherwise, that which is called concession, whereby they
are admitted to full remission in the hour of death, and remission,
which, as it is promised, leads them to sin with assurance, would be
of no weight and validity: And whereas the devout Judocus Ott von
Apspach , in order to obtain the promised indulgence, according to
his ability hath piously contributed to the above-named laudable
purpose, he is entitled to enjoy the benefit of indulgence, of this
nature. In witness of the truth of the above concession, the seal
ordained for this purpose is affixed. Given at Mentz in the year of
our Lord 1454, on the last day of December .
T HE F ULLEST F ORM OF A BSOLUTION AND R EMISSION D URING
L IFE : May our Lord Jesus Christ bestow on thee his most holy and
gracious mercy; may he absolve thee, both by his own authority
and that of the blessed Peter and Paul, His apostles; and by the
authority apostolic committed unto me, and conceded on thy behalf,
I absolve thee from all thy sins repented for with contrition,
confessed and forgotten, as also from all carnal sins, excesses,
crimes and delinquencies ever so grievous, and whose cognizance is
reserved to the Holy See, as well as from any ecclesiastical
judgment, censure, and punishment, promulgated either by law or
by man, if thou hast incurred any,—giving thee plenary indulgence
and remission of all thy sins, inasmuch as in this matter the keys of
the Holy Mother Church do avail. In the name of the Father, and the
Son, and the Holy Ghost. Amen.
T HE P LENARY FORM OF R EMISSION AT THE P OINT OF D EATH : May
our Lord [as above]. I absolve thee from all thy sins, with contrition
repented for, confessed and forgotten, restoring thee to the unity of
the faithful, and the partaking of the sacraments of the Church,
releasing thee from the torments of purgatory, which thou hast
incurred, by giving thee plenary remission of all thy sins, inasmuch
as in this matter the keys of the Mother Church do avail. In the
name of the Father, and the Son, and the Holy Ghost. Amen.
Joseph, abbot of the Monastery of Saint Burckard,
Duly qualified to make this engagement.

Eighteen copies of these Letters of Indulgence are


known, all bearing the printed date of 1454 or of 1455.
The places where they were sold having been written on
the document by the seller, we discover that they must
have been sold over a large territory, for one was issued
at Copenhagen, another at Nuremberg, and another at
Cologne. The large number of copies preserved is
evidence that many copies must have p410 been printed.
It is probable that Gutenberg was required to compose
and print the form at three different times; but we do not
know why he found it necessary to make a new face of
text type for the second and third editions,253 for it is
very plain that the types of the first edition were not
worn out.
The Appeal of Christianity against the Turks ,
sometimes called the Almanac of 1455 , is another small
work attributed to Gutenberg. It is a little quarto of six
printed leaves, in German verse, in the large type of the
Bible of 36 lines . As it contains a calendar for the year
1455, it is supposed that it was printed at the close of
1454. Its typographical appearance is curious: the type
was large, the page was narrow, and the compositor run
the lines together as in prose, marking the beginning of
every verse with a capital, and its ending by a fanciful
arrangement of four full points. It is the first
typographic work in German, and the first work in that
language which can be attributed to Gutenberg. But one
copy of this book is known.
Gutenberg’s fame as a great printer is more justly
based on his two editions in folio of the Holy Bible in
Latin. The breadth of his mind, and his faith in the
comprehensiveness of his invention, are more fully set
forth by his selection of a book of so formidable a nature.
There was an admirable propriety in his determination
that his new art should be fairly introduced to the reading
world by the book known p411 throughout Christendom as
The Book . These two editions of the Bible are most
clearly defined by the specification of the number of lines
to the page in the columns of each book: one is the Bible
of 42 lines ,254 in types of Paragon body, usually bound in
two volumes; the other is the Bible of 36 lines ,255 in
types of Double-pica body, usually bound in three
volumes.
It is not certainly known which was printed first. Each
edition was published without printed date, and, like all
other works by Gutenberg, without name or place of
printer. They were not accurately described by any
contemporary author. In the sixteenth century they were
obsolete, and the tradition that they had been printed by
Gutenberg was entirely lost. When a copy of the Bible of
42 lines was discovered in the library of Cardinal Mazarin,
and was identified as the work of John Gutenberg, it was
not known that there was another edition. The Bible of 42
lines was consequently regarded as the first—as the book
described by Zell, which, he says, was printed in 1450.
This belief was strengthened by the subsequent
discovery, in another copy of this edition, of the
certificate of an illuminator that, in the year 1456, he had
finished his task of illumination in the book. More than
twenty copies of this edition (seven of which are on
vellum) have been found, and they have generally been
sold and bought as copies of the first edition.
The Bible of 36 lines was definitely described for the
first time by the bibliographer Schwartz, who, in 1728,
discovered a copy in the library of a monastery near
Mentz. In the old manuscript catalogue of this library was
a note, stating that this book had been given to the
monastery by John p412 Gutenberg and his associates.
Schwartz said that this must have been the first edition. A
still more exact description of this edition was published
by Schelhorn in 1760, under the title of The Oldest
Edition of the Latin Bible . He said that this must have
been the edition described by Zell.
The Bible of 36 lines is a large demy folio of 1764
pages, made up, for the most part, in sections of ten
leaves, and usually bound in three volumes. Each page
has two columns of 36 lines each. In some sections, a
leaf torn out, possibly on account of some error, has been
replaced by the insertion of a single leaf or a half sheet.
The workmanship of the first section is inferior: the
indentation of paper by too hard pressure is very strongly
marked; the pages are sadly out of register; on one page
the margins and white space between the columns show
the marks of a wooden chase and bearers, which were
used to equalize impressions and prevent undue wear of
types. This section has the appearance of experimental or
unpractised workmanship. It is apparent, almost at a
glance, that the printer did not use a proper chase and
bearers, nor a frisket, nor points for making register.256
All other sections were printed with the proper
appliances, with uncommon neatness of presswork, in
black ink, with exact register, and with a nicely graduated
impression, which shows the sharp edges of the types
with clearness.
Fac-simile of the Types of the Bible of 36 Lines, with the Rubricator’s Marks on
the Capitals. Verses 17 to 22 of the Sixth Chapter of the Book of Wisdom.
[anc413]
[Photographed from a Fragment of the Original in the Collection of Mr. David Wolfe Bruce.]

The types of this book closely resemble, in face and


body, many letters being identically the same, the types
of the display line in the Letter of Indulgence of 31 lines ,
and of the Donatus of 1451 . In some features they
resemble the types of the Bible of 42 lines . It is possible
that the types of each edition were designed and made
by the same letter cutter, and that they were made for
and used by the same printer. This opinion is
strengthened after an inspection of the mannerisms of
the composition, which are those of the Bible of 42 lines .
The colon, period, and hyphen are the only marks of
punctuation. [anc412] The lines of the text are always full:
the hyphen p414 is frequently seen projecting beyond the
letters. A blank space was left for every large initial
which, it was expected, would be inserted by the
calligrapher. Red ink was not used by the printer; the
rubricated letters were dabbed over with a stroke from
the brush of the illuminator.

Some of the Abbreviations of the Bible of 36 lines. ♠


[From Duverger.]

One copy of the book contains a written annotation


dated 1461. An account book of the Abbey of Saint
Michael of Bamberg, which begins with the date March
21, 1460, has in its original binding some of the waste
leaves of this Bible. These, the earliest evidences of date,
prove that this edition could not have been printed later
than 1459. That it was done in 1450, as asserted by
Madden, has not been decisively proved, but the
evidence favoring this conclusion deserves consideration.
Ulric Zell’s testimony that the first Bible was printed in
1450 from missal-like types,257 points with directness to
the Bible of 36 lines , for there is no other printed Bible to
which Zell’s description can be applied. Its close imitation
of the large and generous style in which the choicer
manuscripts of that period are written marks the period
of transition between the old and the new style of book-
making. The prodigality in the use of paper seems the
work of a man who had not counted the cost, or who
thought that he was obliged to disregard the expense. As
not more than half a dozen copies are known, it is
probable that the number printed was small. Nearly all
the copies and leaves of this edition were found in the
neighborhood of Bamberg. This curious circumstance may
be explained by the supposition that the entire edition,
probably small, had been printed at the order of, or p415
had been mortgaged to, one of the many ecclesiastical
bodies of that town. There is evidence that Gutenberg
frequently borrowed money from wealthy monasteries.
The imperfect workmanship of the first section is,
apparently, the work of a printer in the beginning of his
practice, when he had not discovered all the tools and
implements which he afterward used with so much
success.258
The Bible of 36 lines should have been in press a long
time, for it cannot be supposed that Gutenberg had the
means to do this work with regularity. His office was
destitute of composing sticks and rules, iron chases,
galleys, and imposing stones. Deprived of these and
other labor-saving tools, without the expertness acquired
by practice, frequently delayed by the corrections of the
reader, the failures of the type-founder and the errors of
pressmen, it is not probable that the compositor
perfected more than one page a day. He may have done
less. Even if, as Madden supposes, two or more
compositors were engaged on this, as they were upon
other early work, the Bible of 36 lines should have been
in press about three years.259
The newness of the types seems to favor the opinion
that this must be the earlier edition. The same types, or
types cast from the same matrices, were frequently used
in little books printed between the years 1451 and 1462,
but they always appear with worn and blunted faces, as if
they had p416 been rounded under the long-continued
pressure of a press, or had been founded in old and
clogged matrices.
Gutenberg deceived himself as much as he did his
Strasburg partners, in his over-sanguine estimate of the
profits of printing and the difficulties connected with its
practice. His printed work did not meet with the rapid
sale he had anticipated, or the cost of doing the work
was very much in excess of the price he received. The
great success which Andrew Dritzehen hoped to have
within one year, or in 1440, had not been attained in
1450. During this year Gutenberg comes before us again
as the borrower of money. If he had been only an
ordinary dreamer about great inventions, he would have
abandoned an enterprise so hedged in with mechanical
and financial difficulties. But he was an inventor in the full
sense of the word, an inventor of means as well as of
ends, as resolute in bending indifferent men as he was in
fashioning obdurate metal. After spending, ineffectually,
all the money he had acquired from his industry, from his
partners, from his inheritance, from his friends,—still
unable to forego his great project,—he went, as a last
resort, to one of the professional money-lenders of
Mentz. “Heaven or hell,” says Lacroix, “sent him the
partner John Fust.”260
The character and
services of John Fust have
been put p417 before us in
strange lights. By some of
the earlier writers he was
most un­truly rep­re­sent­ed
as the inven­tor of typo­‐
graphy, as the in­struc­tor, as
well as the part­ner, of
Guten­berg. By another
class of authors he has
been regarded as the John Fust. ♠
[From Maittaire.]
patron and bene­fac­tor of
Guten­berg, a man of public spirit, who had the wit to see
the great value of Guten­berg’s new art, and the courage
to unite his fortunes with those of the needy inventor.
This latter view has been popular: to this day, Fust is
thoroughly iden­ti­fied with all the honors of the invention.
The un­reason­able­ness of this pre­ten­sion has sent other
writers to the op­posite extreme. During the present
century, Fust has been frequently painted as a greedy
and crafty speculator, who took a mean advantage of the
needs of Gutenberg, and basely robbed him of the fruits
of his inven­tion.261
It is possible that Gutenberg knew John Fust, the
money-lender, through business relations with Fust’s
brother, James, the goldsmith; for we have seen that,
during his experiments in Strasburg, Gutenberg had work
done by two goldsmiths. What projects Gutenberg
unfolded to John Fust, and what allurements he set forth,
are not known; but the wary money-lender would not
have hazarded a guilder on Gutenberg’s invention, if he
had not been convinced of its value and of Gutenberg’s
ability. John Fust knew that there was some risk in the
enterprise, for it is probable that he had heard of p418 the
losses of Dritzehen, Riffe and Heilmann. In making an
alliance with the inventor, Fust neglected none of the
precautions of a money-lender. He really added to them,
insisting on terms through which he expected to receive
all the advantages of a partnership without its
liabilities.262
The terms were hard. But Gutenberg had the firmest
faith in the success of his invention: in his view it was not
only to be successful, but so enormously profitable that
he could well afford to pay all the exactions of the
money-lender. The object of the partnership is not
explicitly stated, but it was, without doubt, the business
of printing and publishing text books, and, more
especially, the production of a grand edition of the Bible ,
the price of a fair manuscript copy of which, at that time,
was five hundred guilders. The expense that would be
made in printing a large edition of this work seemed
trivial in comparison with the sum which Gutenberg
dreamed would be readily paid for the new books. But
the expected profit was not the only allurement.
Gutenberg was, no doubt, completely dominated by the
idea that necessity was laid on him—that he must
demonstrate the utility and grandeur of his invention,—
and this must be done whether the demonstration
beggared or enriched him. After sixteen years of labor,
almost if not entirely fruitless, he snatched at the
partnership with Fust as the only means by which he
could realize the great purpose of his life. The overruling
power of the money-lender was shown in the p419
begining of the partnership. Gutenberg had ready the
types of the Bible of 36 lines , and had, perhaps, printed a
few copies of the work—too few to supply the demand.
Another edition could have been printed without delay,
but it was decided that this new edition should be in a
smaller type and in two volumes. It was intended that the
cost of the new edition should be about one-third less
than that of the Bible of 36 lines . Gutenberg was,
consequently, obliged to cut a new face and found a new
font of types, which, by the terms of the agreement,
were to be mortgaged to Fust.
Fust did not assist Gutenberg as he should have done.
Instead of paying the 800 guilders at once, as was
implied in the agreement, he allowed two years to pass
before this amount was fully paid. The equipment of the
printing office with new types was sadly delayed. At the
end of the two years, when Gutenberg was ready to
print, he needed for the next year’s expenses, and for the
paper and vellum for the entire edition, more than the
300 guilders allowed to him by the agreement of 1450.
Fust, perceiving the need of Gutenberg, saw also his
opportunity for a stroke in finance, which would assist
him in the designs which he seems to have entertained
from the beginning. He proposed a modification of the
contract—to commute the annual payment of 300
guilders for the three successive years by the immediate
payment of 800 guilders. As an offset to the loss
Gutenberg would sustain by this departure from the
contract, Fust proposed to remit his claim to interest on
the 800 guilders that had been paid. Gutenberg, eager
for the money, and credulous, assented to these
modifications.
The delays and difficulties which Gutenberg
encountered in the printing of this edition were great, but
no part of the work was done hastily or unadvisedly. He
may not have received practical education as a book-
maker, but he had the rare good sense to accept
instruction from those who had. The Bible of 42 lines was
obviously planned by an adept in all the book-making skill
of his time. It was laid out in 66 p420 sections, for the
most part of 10 leaves each. To facilitate the division of
the book in parts (so that it could be bound, if necessary
for the convenience of the reader, in ten thin volumes),
some of the sections have but 4, some 11, and some 12
leaves. The book proper, without the summary of
contents, consists of 1282 printed pages, 2 columns to
the page, and, for the most part, with 42 lines to the
column.263
A wide margin was allowed for the ornamental borders,
without which no book of that time was complete, and
large spaces were also left in the text for the great initial
letters. It was expected that the purchaser of the book
would have the margins and spaces covered with the
fanciful designs and bright colors of the illuminator. In
some copies, this work of illumination was admirably
done; in others it was badly done or entirely neglected.
The rubrics were roughly made by dabbing a brush filled
with red ink over a letter printed in black. On the pages
of 40 lines, the summaries of chapters were printed in
red ink; on other pages the summaries were written,
sometimes in red and sometimes in black ink. p421 It
would seem that it was Gutenberg’s original intention to
print all the summaries in red ink, and that he was
obliged, for some unknown reason, to have them written
in.
The general effect of the typography is that of
excessive blackness,—an effect which seems to have
been made of set purpose, for the designer of the types
made but sparing use of hair lines. It may be that the
avoidance of hair lines was caused by difficulties of type-
founding. The type-founding was properly done: the
types have solid faces and stand in line. The letters are
not only black but condensed, and are so closely
connected that they seem to have been spread by
pressure. Double letters and abbreviations were freely
used. Judged by modern standards, the types are
ungraceful; the text letters are too dense and black, and
the capitals are of rude form, obscure, and too small for
the text. The presswork is unequal: on some vellum
copies, the types are clearly and sharply printed; on other
copies, they show muddily from excess of ink. On the
paper copies, the ink is usually of a full black, but there
are pages on paper and on vellum, in which, for lack of
ink and impression,264 the color is of a grimy gray-black.
Van der Linde and others say that the ink will not resist
water, but the ink on the fragments of vellum belonging
to Mr. Bruce stood a severe test by water, without any
weakening of color. The register on the paper copies is
very good; on the vellum copies it is offensively irregular,
a plain proof that the vellum had been dampened, and
had shrunk or twisted before the second side was
printed.
It has been said that this Bible of 42 lines was printed
with intent to cheat purchasers, so that it might be sold
as a manuscript. There is a legend that Fust did attempt
the cheat at Paris, but there is no good authority for the

You might also like