Text Processing with Ruby Extract Value from the Data That Surrounds You 1st Edition Rob Miller pdf download
Text Processing with Ruby Extract Value from the Data That Surrounds You 1st Edition Rob Miller pdf download
https://ebookgate.com/product/text-processing-with-ruby-extract-
value-from-the-data-that-surrounds-you-1st-edition-rob-miller/
https://ebookgate.com/product/value-maps-valuation-tools-that-unlock-
business-wealth-1st-edition-warren-d-miller/
ebookgate.com
https://ebookgate.com/product/data-and-the-city-1st-edition-rob-
kitchin/
ebookgate.com
https://ebookgate.com/product/text-processing-with-
javascript-1-version-p1-0-converted-edition-faraz-kelhini/
ebookgate.com
Probability and random processes with applications to
signal processing and communications 1st Edition Scott
Miller
https://ebookgate.com/product/probability-and-random-processes-with-
applications-to-signal-processing-and-communications-1st-edition-
scott-miller/
ebookgate.com
https://ebookgate.com/product/from-com-to-profit-inventing-business-
models-that-deliver-value-and-profit-nick-earle/
ebookgate.com
https://ebookgate.com/product/value-leadership-the-7-principles-that-
drive-corporate-value-in-any-economy-1st-edition-peter-s-cohan/
ebookgate.com
https://ebookgate.com/product/corporate-boards-that-create-value-1st-
edition-john-carver/
ebookgate.com
This is a fun, readable, and very useful book. I’d recommend it to anyone who
needs to deal with text—which is probably everyone.
➤ Paul Battley
Developer, maintainer of text gem
I’d recommend this book to anyone who wants to get started with text processing.
Ruby has powerful tools and libraries for the whole ETL workflow, and this book
describes everything you need to get started and succeed in learning.
➤ Hajba Gábor László
Developer
A lot of people get into Ruby via Rails. This book is really well suited to anyone
who knows Rails, but wants to know more Ruby.
➤ Drew Neil
Director, Studio Nelstrom, and author of Practical Vim
Rob Miller
Acknowledgments . . . . . . . . . . . ix
Introduction . . . . . . . . . . . . . xi
3. Shell One-Liners . . . . . . . . . . . 29
Arguments to the Ruby Interpreter 30
Prepending and Appending Code 35
Example: Parsing Log Files 37
Wrapping Up 39
5. Delimited Data . . . . . . . . . . . . 51
Parsing a TSV 52
Delimited Data and the Command Line 56
The CSV Format 58
Wrapping Up 62
6. Scraping HTML . . . . . . . . . . . . 63
The Right Tool for the Job: Nokogiri 63
Searching the Document 64
Working with Elements 72
Exploring a Page 77
Example: Reading a League Table 80
Wrapping Up 88
7. Encodings . . . . . . . . . . . . . 89
A Brief Introduction to Character Encodings 90
Ruby’s Support for Character Encodings 92
Detecting Encodings 98
Wrapping Up 99
Part IV — Appendices
A1. A Shell Primer . . . . . . . . . . . . 229
Running Commands 229
Controlling Output 230
Exit Statuses and Flow Control 232
Many thanks to Alessandro Bahgat, Paul Battley, Jacob Chae, Peter Cooper,
Iris Faraway, Kevin Gisi, Derek Graham, James Edward Gray II, Avdi Grimm,
Hajba Gábor László, Jeremy Hinegardner, Kerri Miller, and Drew Neil for their
helpful technical review comments, questions, and suggestions—all of which
shaped this book for the better.
Thanks to Rob Griffiths, Mark Rogerson, Samuel Ryzycki, David Webb, Lewis
Wilkinson, Alex Windett, and Mike Wright for ensuring there was no chance
I got too big for my football boots.
Unlike binary formats, text has the pleasing quality of being readable by
humans as well as computers, making it easy to debug and requiring no
distinction between output that’s for human consumption and output that’s
to be used as the input for another step in a process.
The second concern is with actually processing the text once we’ve got it into
the program. This usually means either extracting data from within the text,
parsing it into a Ruby data structure, or transforming it into another format.
The most important subject in this second stage is, without a doubt, regular
expressions. We’ll look at regular expression syntax, how Ruby uses regular
expressions in particular, and, importantly, when not to use them and instead
reach for solutions such as parsers.
We’ll also look at the subject of natural language processing in this part of
the book, and how we can use tools from computational linguistics to make
our programs smarter and to process data that we otherwise couldn’t.
The final step is outputting the transformed text or the extracted data some-
where—to a file, to a network service, or just to the screen. Part of this process
is concerned with the actual writing process, and part of it is concerned with
the form of the written data. We’ll look at both of these aspects in the third
part of the book.
Together, these three steps are often described as “extract, transform, and
load” (ETL). It’s a term especially popular with the “big data” folks. Many text
processing tasks, even ones that seem on the surface to be very different from
one another, fall into this pattern of three steps, so I’ve tried to mirror that
structure in the book.
In general, we’re going to explore why Ruby is an excellent tool to reach for
when working with text. I also hope to persuade you that you might reach
for Ruby sooner than you think—not necessarily just for more complex tasks,
but also for quick one-liners.
Most of all, I hope this book offers you some useful techniques that help you
in your day-to-day programming tasks. Where possible, I’ve erred toward the
practical rather than the theoretical: if it does anything, I’d like this book to
point you in the direction of practical solutions to real-world problems. If your
day job is anything like mine, you probably find yourself trawling through
text files, CSVs, and command-line output more often than you might like.
Helping to make that process quick and—dare I say it?—fun would be fantas-
tic.
While the book starts with material likely to be familiar to anyone who’s
written a command-line application in Ruby, there’s still something here for
the more advanced user. Even people who’ve worked with Ruby a lot aren’t
necessarily aware of the material covered in Chapter 3, Shell One-Liners, on
page 29, for example, and I see far too many developers reaching for regular
expressions to parse HTML rather than using the techniques outlined in
Chapter 6, Scraping HTML, on page 63.
Even experienced developers might not have written parsers before (covered
in Chapter 10, Writing Parsers, on page 127), or dabbled in natural language
processing (as we do in Chapter 11, Natural Language Processing, on page
155)—so hopefully those subjects will be interesting regardless of your level of
experience.
I’ve tried to include in each of the chapters material of interest even to more
advanced Rubyists, so there aren’t any chapters that are obvious candidates
to skip if you’re at that end of the skill spectrum.
If you’re not familiar with how to use the command line, there’s a beginner’s
tutorial in Appendix 1, A Shell Primer, on page 229, and a guide to various
commands in Appendix 2, Useful Shell Commands, on page 235. These
appendixes will give you more than enough command-line knowledge to follow
all of the examples in the book.
1. https://pragprog.com/book/rmtpruby/text-processing-with-ruby
Online Resources
The page for this book on the Pragmatic Bookshelf website3 contains a discus-
sion forum, where you can post any comments or questions you might have
about the book and make suggestions for any changes or expansions you’d
like to see in future editions. If you discover any errors in the book, you can
submit them there, too.
Rob Miller
August 2015
2. https://www.cygwin.com/
3. https://pragprog.com/book/rmtpruby/text-processing-with-ruby
The first part of our text processing journey is concerned with getting text into our
program. This text might reside in files, might be entered by the user, or might come
from other processes; wherever it comes from, we’ll learn how to read it.
We’ll also look at taking structure from the text that we read, learning how to parse
CSV files and even scrape information from web pages.
Throughout the course of this chapter, we’ll look at how we can use Ruby to
reach text that resides in files. We’ll look at the basics you might expect, with
some methods to straightforwardly read files in one go. We’ll then look at a
technique that will allow us to read even the biggest files in a memory-efficient
way, by treating files as streams, and look at how this can give us random
access into even the largest files. Let’s take a look.
Opening a File
Before we can do something with a file, we need to open it. This signals our
intent to read from or write to the file, allowing Ruby to do the low-level that
make that intention actually happen on the filesystem. Once it’s done those
things, Ruby gives us a File object that we can use to manipulate the file.
Once we have this File object, we can do all sorts of things with it: read from
the file, write to it, inspect its permissions, find out its path on the filesystem,
check when it was last modified, and much more.
To open a file in Ruby, we use the open method of the File class, telling it the
path to the file we’d like to open. We pass a block to the open method, in which
we can do whatever we like with the file. Here’s an example:
File.open("file.txt") do |file|
# ...
end
Because we passed a block to open, Ruby will automatically close the file for
us after the block finishes, freeing us from doing that cleanup work ourselves.
The argument that open passes to our block, which in this example I’ve called
file, is a File object that points to the file we’ve requested access to (in this case,
file.txt). Unless we tell Ruby otherwise, it will open files in read-only mode, so
we can’t write to them accidentally—a safe default.
Kernel#open
In the real world, it’s common to see people using the global open method rather than
explicitly using File.open:
open("file.txt") do |file|
# ...
end
As well as being shorter, which is always nice, this convenient method is actually a
wrapper for a number of different types of IO objects, not just files. You can use it to
open URLs, other processes, and more. We’ll cover some more uses of open later; for
now, use either File.open or regular open as you prefer.
There’s nothing in our block yet, so this code isn’t very useful; it doesn’t
actually do anything with the file once it’s opened. Let’s take a look at how
we can read content from the file.
We can achieve this by using the read method on our File object:
File.open("file.txt") do |file|
contents = file.read
end
The read method returns for us a string containing the file’s contents, no
matter how large they might be.
Alternatively, if all we’re doing is reading the file and we have no further use
for the File object once we’ve done so, Ruby offers us a shortcut. There’s a read
method on the File class itself, and if we pass it the name of a file, then it will
open the file, read it, and close it for us, returning the contents:
contents = File.read("file.txt")
Whichever method we use, the result is that we have the entire contents of
the file stored in a string. This is useful if we want to blindly pass those con-
tents over to something else for processing—to a Markdown parser, for
example, or to insert it into a database, or to parse it as JSON. These are all
very common things to want to do, so read is a widely used method.
For example, if our file contained some JSON data, we could parse it using
Ruby’s built-in JSON library:
require "json"
json = File.read("file.json")
data = JSON.parse(json)
Line-by-line Processing
Lots of plain-text formats—log files, for instance—use the lines of a file as a
way of structuring the content within them. In files like this, each line repre-
sents a distinct item or record. It’s about the simplest way to separate data,
but this kind of structure is more than enough for many use cases, so it’s
something you’ll run into frequently when processing text.
One example of this sort of log file that you might have encountered before
is from the popular web server Apache. For each request made to it, Apache
will log some information: things like the IP address the request came from,
the date and time that the request was made, the URL that was requested,
and so on. The end result looks like this:
127.0.0.1 - [10/Oct/2014:13:55:36] "GET / HTTP/1.1" 200 561
127.0.0.1 - [10/Oct/2014:13:55:36] "GET /images/logo.png HTTP/1.1" 200 23260
192.168.0.42 - [10/Oct/2014:14:10:21] "GET / HTTP/1.1" 200 561
192.168.0.91 - [10/Oct/2014:14:20:51] "GET /person.jpg HTTP/1.1" 200 46780
192.168.0.42 - [10/Oct/2014:14:20:54] "GET /about.html HTTP/1.1" 200 483
Let’s imagine we wanted to process this log file so that we could see all the
requests made by a certain IP address. Because each line in the file represents
one request, we need some way to loop over the lines in the file and check
whether each one matches our conditions—that is, whether the IP address
at the start of the line is the one we’re interested in.
One way to do this would be to use the readlines method on our File object. This
method reads the file in its entirety, breaking the content up into individual
lines and returning an array:
File.open("access_log") do |log_file|
requests = log_file.readlines
end
At this point, we’ve got an array—requests—that contains every line in the file.
The next step is to loop over those lines and only output the ones that match
our conditions:
File.open("access_log") do |log_file|
requests = log_file.readlines
requests.each do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
end
Using each, we loop over each request. We then ask the request if it starts
with 127.0.0.1, and if the response is true, we output it. Lines that don’t start
with 127.0.0.1 will simply be ignored.
While this solution works, it has a problem. Because it reads the whole file
at once, it consumes an amount of memory at least equal to the size of the
file. This will hold up okay for small files, but as our log file grows, so will the
memory consumed by our script.
If you think about it, though, we don’t actually need to have the whole file in
memory to solve our problem. We’re only ever dealing with one line of the file
at any given moment, so we only really need to have that particular line in
memory. For some problems it’s necessary to read the whole file at once, but
this isn’t one of them. Let’s look at how can we rework this example so that
we only read one line at a time.
The solution is to treat the file as a stream. Instead of reading from the
beginning of the file to the end in one go, and keeping all of that information
in memory, we read only a small amount at a time. We might read the first
line, for example, then discard it and move onto the second, then discard that
and move onto the third, and so on until we reach the end of the file. Or we
might instead read it character by character, or word by word. The important
thing is that at no point do we have the full file in memory: we only ever store
the little bit that we’re processing.
The File object yielded to our block has a method called each_line. This method
accepts a block and will step through the file one line at a time, executing
that block once for each line.
File.open("access_log") do |log_file|
log_file.each_line do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
end
That’s it. The each_line method allows us to step through each line in the file
without ever having more than one line of the file in memory at a time. This
method will consume the same amount of memory no matter how large the
file is, unlike our first solution.
Just like with File.read, Ruby offers us a shortcut that doesn’t require us to
open the file ourselves: File.foreach. Using it trims the previous example down
a little:
File.foreach("access_log") do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
Enumerable Streams
The each_line method of the File class is aliased to each. This might not seem
particularly remarkable, but it’s actually tremendously useful. This is because
Ruby has a module called Enumerable that defines methods like map, find_all,
count, reduce, and many more. The purpose of Enumerable is to make it easy to
search within, add to, delete from, iterate over, and otherwise manipulate
collections. (You’ve probably used methods like these when working with
arrays, for example.)
Well, a file is a collection too. By default, Ruby considers the elements of that
collection to be the lines within the file, so because File includes the Enumerable
module, we can use all of its methods on those lines. This can make many
processing operations simple and expressive, and because many of Enumerable’s
methods don’t require us to consume the whole file—they’re lazy, in other
words—we often retain the performance benefits of streaming, too.
To explore what this means, we can revisit our log example. Let’s imagine you
wanted to group all of the requests made by each IP address, and within that
group them by the URL requested. In other words, you want to end up with
a data structure that looks something like this:
{
"127.0.0.1" => [
"/",
"/images/logo.png"
],
"192.168.0.42" => [
"/",
"/about.html"
],
"192.168.0.91" => [
"/person.jpg"
]
}
Here’s a script that uses the methods offered by Enumerable to achieve this:
requests-by-ip.rb
requests =
File.open("data/access_log") do |file|
file
.map { |line| { ip: line.split[0], url: line.split[5] } }
.group_by { |request| request[:ip] }
.each { |ip, requests| requests.map! { |r| r[:url] } }
end
We open the file just like we did previously. But instead of using each_line to
iterate over the lines of the file, we use map. This loops over the lines of the
file, building up an array as it does so by taking the return value of the block
we pass to it. Here our block is using split to separate the lines on whitespace.
The first of these whitespace-separated fields contains the IP, and the sixth
contains the URL that was requested, so the block returns a hash. The result
of our map operation is therefore an array of hashes that contain only the
information about the request that we’re interested in—the IP address and
the URL.
Next, we use the group_by method. This transforms our array of hashes into a
single hash. It does so by checking the return value of the block that we pass
to it; all the elements of the array that return the same value will be grouped
together. In this case, our block returns the IP part of the request, which
means that all of the requests made by the same IP address will be grouped
together.
The data structure after the group_by operation looks something like this:
{
"127.0.0.1" => [
{:ip=>"127.0.0.1", :url=>"/"},
{:ip=>"127.0.0.1", :url=>"/images/logo.png"}
],
"192.168.0.42" => [
{:ip=>"192.168.0.42", :url=>"/"},
{:ip=>"192.168.0.42", :url=>"/about.html"}
],
"192.168.0.91" => [
{:ip=>"192.168.0.91", :url=>"/person.jpg"}
]
}
This is almost what we were after. The problem is that we have both the IP
address and the URL of the request, rather than just the URL. So the next
step in our chain uses each to loop over these IP address and request combi-
nations. It then uses map! to replace the array of hashes with just the URL
portion, leaving us with an array of strings.
By default, each behaves exactly like each_line, looping over the lines in the file.
But it also accepts an argument allowing you to change the character on
which it will split, from a newline to anything else you might like.
Let’s imagine we had a file with only a single line in it, but that contained
many different records separated by commas:
this is field one,this is field two,this is field three
Again, this method has all of the benefits of other streaming examples; we
only ever have a single character in memory at one time, so we can process
even the largest of files.
For example, we could rewrite the previous example, where we quite verbosely
initialized our n variable and incremented it manually, by using Enumerable’s
count method:
character-count.rb
n =
File.open("file.txt") do |file|
file.each_char.count { |char| char == "b" }
end
The count method accepts a block and will return the number of values for
which the block returned true. This is exactly what our previous code was
doing, but this way is a little shorter and a little neater, and reveals our
intentions more clearly.
This might seem like an academic distinction, but it has an important benefit: it
means that other types of IO in Ruby have those same methods, too. Files, streams,
sockets, Unix pipelines—all of these things are fundamentally similar, and it’s in IO
that these similarities are gathered into one abstraction. In the words of Ruby’s own
documentation, IO is “the basis for all input and output in Ruby.” By learning to read
from files, then, you’ll learn both principles and concrete methods that will translate
to all the other places from which you might want to acquire text.
If you know how to write output to the screen, then—using puts—you already know
how to write to a file: by calling puts on the file object. Our screen and a file are both
IO objects—of two different kinds—so the way we interact with them is the same.
This similarity will be very useful throughout our text processing journey.
This might seem an inflexible and impractical way of doing things. After all,
how can we know at how many bytes from the start of the file we’ll find the
Let’s imagine we wanted to dig into this data. We might want to find out what
the warmest week was, or plot the results on a graph, or just show what the
temperature of a particular region was last week. To do any of these things,
we need to parse the data and get it into our script.
First, a quick explanation of the data. The first column contains the date of
the week in which the measurements were taken. The other four columns
represent different regions of the ocean. For each of them we have two num-
bers: the first representing the recorded temperature, and the second repre-
senting the departure from the expected temperature that this recording
represents (the “sea surface temperature anomaly”). In the first row, then,
the first region recorded a temperature in the week of 3 January 1990 of 23.4
degrees, which is an anomaly of -0.4 degrees.
The pleasing visual quality that this data has—the fact that all the columns
in the table line up neatly—will help us in this task. If we were to count the
characters across each line, we’d see that each field started at exactly the
same place in each row. The first column, containing the date of the week in
question, is always twelve characters long. The next number is nine characters
long, always, and the following number is always four characters, regardless
of whether it has a negative sign. This nine/four pattern repeats three more
times for the other three regions.
In trying to get this data into our script, let’s look at how to read the first row
of data.
Previously, we used read in its basic form, without any arguments, which read
the entire file into memory. But if we pass an integer as the first argument,
read will read only that many bytes forward from the current position in the
file.
So, from the start of the file, we could read in each field in the first row as
follows:
noaa-first-row-simple.rb
File.open("data/wksst8110.for") do |file|
puts file.read(10)
4.times do
puts file.read(9)
puts file.read(4)
end
end
# >> 03JAN1990
# >> 23.4
# >> -0.4
# >> 25.1
# >> -0.3
# >> 26.6
# >> 0.0
# >> 28.6
# >> 0.3
We first read ten bytes, to get the name of the week. Then we read nine bytes
followed by four bytes to extract the numbers, doing this four times so that
we extract all four regions.
From here, it’s not much work to have our script continue through the rest
of the file, slurping up all of the data within and converting it into a Ruby
data structure—in this case, a hash:
noaa-all-rows.rb
File.open("data/wksst8110.for") do |file|
weeks = []
until file.eof?
week = {
date: file.read(10).strip,
temps: {}
}
file.read(1)
weeks
# => [{:date=>"03JAN1990",
# :temps=>
# {:nino12=>{:temp=>23.4, :change=>-0.4},
# :nino3=>{:temp=>25.1, :change=>-0.3},
# :nino34=>{:temp=>26.6, :change=>0.0},
# :nino4=>{:temp=>28.6, :change=>0.3}}},
# {:date=>"10JAN1990",
# :temps=>
# {:nino12=>{:temp=>23.4, :change=>-0.8},
# :nino3=>{:temp=>25.2, :change=>-0.3},
# :nino34=>{:temp=>26.6, :change=>0.1},
# :nino4=>{:temp=>28.6, :change=>0.3}}},
# {:date=>"17JAN1990",
# :temps=>
# {:nino12=>{:temp=>24.2, :change=>-0.3},
# :nino3=>{:temp=>25.3, :change=>-0.3},
# :nino34=>{:temp=>26.5, :change=>-0.1},
# :nino4=>{:temp=>28.6, :change=>0.3}}},
# ...snip...
end
The logic is fundamentally the same as when reading the first row. To loop
over all the rows in the file, there are two main changes: first, we loop until we
hit the end of the file by checking file.eof?; it will return true when the end of
the file is reached and therefore end our loop. The other addition is the call
to file.read(1) at the end of the row; this will consume the newline character at
the end of each line. We’re also using strip to strip the whitespace from the
week name, and to_f to convert the temperature numbers to floats.
This method works and is fast. But by only using read to consume a fixed
numbers of bytes, we haven’t seen the most important advantage of treating
the file in this way: the fact that it offers us random access to the records
within the file.
Well, because each of the columns within the data has a fixed width, that
means that all of the rows do, too. Adding up the columns, including the
newline at the end, gives us 10 + 4 * (9 + 4) + 1 = 63 characters, so we know that
each of our records is 63 bytes long.
If we used seek to skip 63 bytes into the file, then our first call to read would
begin reading from the second record:
noaa-skip-first-row.rb
File.open("data/wksst8110.for") do |file|
file.seek(63)
file.read(10)
# => " 10JAN1990"
end
As we can see, our first call to read returns for us the date of the second week
in the file, not the first. Using this method, we can now skip to arbitrary
records—the first, the tenth, the thousandth, whatever we like—and read
their data.
The most important part of this is that seeking happens in constant time.
That means that it takes the same amount of time no matter how large the
file is and no matter how far into the file we want to seek. We’ve finally
uncovered the amazing benefit to fixed-width files like this—that we gain the
ability to access records within them at random, so it’s no slower to find the
303rd record than it is to find the third—or even the 300,003rd.
In the final version of our script, then, we can write a get_week method that
will retrieve a record for us given an index for that record (1 for the first, 2 for
the second, and so on):
noaa-seek.rb
def get_week(file, week)
file.seek((week - 1) * 63)
week = {
date: file.read(10).strip,
temps: {}
}
week
end
File.open("data/wksst8110.for") do |file|
get_week(file, 3)
# => {:date=>"17JAN1990",
# :temps=>
# {:nino12=>{:temp=>24.2, :change=>-0.3},
# :nino3=>{:temp=>25.3, :change=>-0.3},
# :nino34=>{:temp=>26.5, :change=>-0.1},
# :nino4=>{:temp=>28.6, :change=>0.3}}}
get_week(file, 303)
# => {:date=>"18OCT1995",
# :temps=>
# {:nino12=>{:temp=>20.0, :change=>-0.8},
# :nino3=>{:temp=>24.1, :change=>-0.9},
# :nino34=>{:temp=>25.8, :change=>-0.9},
# :nino4=>{:temp=>28.2, :change=>-0.5}}}
get_week(file, 1303)
# => {:date=>"17DEC2014",
# :temps=>
# {:nino12=>{:temp=>22.9, :change=>0.1},
# :nino3=>{:temp=>26.0, :change=>0.8},
# :nino34=>{:temp=>27.4, :change=>0.8},
# :nino4=>{:temp=>29.4, :change=>1.0}}}
end
Here we use the get_week method to fetch the third, 303rd, and 1,303rd records.
With this method we can treat the data within the file almost as though it
was a data structure within our script—like an array—even though we haven’t
had to read any of it in. This allows us to randomly access data within even
the largest of files in a very fast and efficient way.
One important caveat is that read and seek operate at the level of bytes, not
characters. You’ll learn more about the difference between the two in Chapter
7, Encodings, on page 89, but it’s worth noting that if you’re using a multibyte
character encoding, like UTF-8, then using seek carelessly might leave you in
the middle of a multibyte character and might mean that you get some gib-
berish when you try to read data.
You should therefore use these methods only when you know that you’re
dealing solely with single-byte characters or when you know that the location
you’re seeking to will never be in the middle of a character—as in our temper-
ature data example, where we’re seeking to the boundaries between records.
Despite this limitation of seek, hopefully you can see the benefit of using a
fixed-width file like this. We can retrieve any value, no matter how big the file
is, without reading any unnecessary data; we have what’s called random
access to the data within. To retrieve the tenth record, we just need to seek
567 bytes from the start of the file; to retrieve the 703rd, we just need to seek
44,226 bytes from the start; and so on. The wonderful thing is that no matter
how large our file gets, this operation will always take the exact same amount
of time—even if we’ve got hundreds of megabytes of data. That’s why it’s
sometimes worth putting up with the limitations of such a format: it’s both
very simple and very fast.
Wrapping Up
That’s about it for reading files. We looked at how to open a file and what we
can do with the resulting File object. We covered reading files in one go and
processing them like streams, and why you’d prefer one or the other. We
explored how we can use the methods offered by Enumerable to transform and
manipulate the content of files. We looked at line-by-line processing and
reading arbitrary numbers of bytes, and how we can seek to arbitrary locations
in the file to replicate some of the functionality of a database.
With these techniques, we’ve gained an impressive arsenal for reading text
from files large and small. Next, we’ll take our newfound knowledge of streams
and apply it to another source of text: standard input.
This source of input is called standard input, and it’s one of the foundations
of text processing. Along with its output equivalents standard output and
standard error, it enables different programs to communicate with one
another in a way that doesn’t rely on complex protocols or unintelligible
binary data, but instead on straightforward, human-readable text.
Learning how to process standard input will allow you to write flexible and
powerful utilities for processing text, primarily by enabling you to write pro-
grams that form part of text processing pipelines. These chains of programs,
linked together so that each one’s output flows into the input of the next, are
incredibly powerful. Mastering them will allow you to make the most both of
your own utilities and of those that already exist, giving you the most text
processing ability for the least amount of typing possible.
Let’s take a look at how we can write scripts that process text from standard
input, slotting into pipeline chains and giving us flexibility, reusability, and
power.
Here we ask standard input—$stdin—for a line of input using the gets method,
using chomp to remove the trailing newline. This gives us a string, which we
store in name.
This simplistic use of standard input isn’t particularly useful, let’s face it.
But it’s actually only half of the story. Standard input isn’t just used to read
from the keyboard interactively; it can also read from input that’s been redi-
rected—or piped—to your script from another process.
The ultimate goal here is to be able to use your scripts in pipeline chains.
These are chains of programs strung together so that the output from the
first is fed into the input of the second, the output of the second becomes the
input of the third, and so on. Here’s an example:
$ ps ax | grep ruby | cut -d' ' -f1
That example used preexisting commands to do its work. But we can write
our own programs that slot into such workflows. Imagine that we frequently
wanted to convert sections of text to uppercase. We know how to convert text
to uppercase in Ruby, so we could write a script that works like this:
$ echo "hello world" | ruby to-uppercase.rb
HELLO WORLD
In other words, we could write a program that converts any text it receives
on standard input to uppercase, then outputs that converted text. It won’t
know where the text is coming from (for example, the echo command we saw
previously versus the hostname command)—it accepts anything you pass to it.
This gives you great flexibility in how you use the script, opening up ways of
using it that you might not have foreseen when writing it.
This flexibility is what makes such scripts useful. Your goal, or at least a
pleasant side effect of processing text in this way, is to build up a library of
such scripts so that, if you encounter the same problem again, you can just
slot the script you wrote last time into the new pipeline chain and be on your
way. The to-uppercase.rb script is a good example of this: you might need to write
it from scratch the first time you encounter the problem of converting input
to uppercase, but after that it can be used again and again in completely
different situations.
Saving this script as to-uppercase.rb, we’ve got everything we need. We can run
it like this:
$ echo "hello world" | ruby to-uppercase.rb
HELLO WORLD
$ hostname | ruby to-uppercase.rb
ROB.LOCAL
$ whoami | ruby to-uppercase.rb
ROB
We now have a script that reads from standard input, modifies what it receives,
and outputs it to standard output. It’s general purpose. It doesn’t know or
care where its input comes from, but it processes the input happily regardless.
Countless examples of this type of tool already exist, distributed with Unix-
like operating systems: grep, for example, which outputs only lines that match
a given pattern, or sort, which outputs an alphabetically sorted version of its
input. The scripts you write yourself will be right at home with these standard
Unix utilities as part of your text processing pipelines.
It was also annoying in the previous example that we had to type ruby to-
uppercase.rb. Other commands are short and snappy—cut, grep—but we had to
type what feels like a lot of superfluous information.
For our next example, we’re going to write a script that extracts URLs from
the input passed to it, outputting any that it finds and ignoring the rest of
the input. So, if we passed it the following text:
Alice's website is at http://www.example.com
While Jane's website is at https://example.net and contains a useful blog.
This script will be called urls, and once we’ve written it we’ll be able to use it
in any pipeline we like. Because it will treat its input as a stream, we’ll be
able to use it on whatever input we like, no matter how large it is. So we’ll be
able to extract the URLs from a text file:
$ cat file.txt | urls
The Shebang
Up until now we’ve only run our Ruby scripts by telling the Ruby interpreter
the name of the file to execute. But when we’re using ordinary Unix commands,
such as grep or uniq, we just specify them as commands in their own right.
Ideally, we want to be able to do the same with our URL extractor. It would
be annoying if we had to type ruby urls.rb or something similar each time we
wanted to use it, especially if we’re going to be using it a lot.
But if we just called our script urls, how would our shell know that it was a
Ruby script and know to pass its contents to Ruby to execute? The answer
is, because we tell it to, and we tell it using a special line at the top of our
script called the shebang. In this case, we’d use:
#!/usr/bin/env ruby
The special part is the #!—it’s this that gives the line its name (“hash” +
“bang”). Since the Ruby interpreter might be in different places on different
people’s computers, we use a command called env to tell the shell to use ruby,
wherever ruby might be.
The presence of this shebang allows us to save our script as a file called urls
and run it directly, rather than as ruby urls. The final step in this process is to
allow the file to be executed. We can do this with the chmod command:
$ chmod +x urls
That’s it. We can now call ./urls from within the directory our urls file resides
in, and it will execute our script as Ruby code.
If we wanted to be able to call our version from anywhere, not just from the
directory in which it’s saved, we could put it into a directory that’s within our
PATH—/usr/local/bin, for example. Many people create a directory under their
home directory—typically called bin—and put that into their path, so that they
have a place to keep all of their own scripts without losing them among the
ones that came with their system or that they’ve installed from elsewhere.
Putting the script in a directory that’s in your PATH will make it feel just like
any other text processing command and make it really easy to use wherever
you are. If you think you’ll use a particular script regularly, then don’t hesitate
to put it there. The only thing you need to do is to make sure the name of the
script doesn’t clash with an existing command that you still want to be able
to use—otherwise, you’ll run your script when you type the command, rather
than whatever command originally had that name. So don’t call it ls or mv!
Just like the File objects we saw in the previous chapter, $stdin has an each_line
method that allows us to iterate over the lines in our input:
$stdin.each_line do |line|
# ...
end
mean that we can pass our output along to the next stage in the process as
and when we process it. If our script is the last stage in the pipeline, that
means the user sees output more quickly; and if we’re earlier in the pipeline,
then it means the next part of the pipeline can be doing its processing while
we’re working on our next chunk.
The Logic
Unlike our to-uppercase.rb example, we’re not actually interested in printing the
line of output, even in a modified form. Instead we want to extract any URLs
we find in it and then output those. To do that, we’ll use a regular expression.
We’ll be covering these in depth in Chapter 8, Regular Expressions Basics,
on page 103, so don’t worry too much about them now:
urls
#!/usr/bin/env ruby
$stdin.each_line do |line|
urls = line.scan(%r{https?://\S+})
urls.each do |url|
puts url
end
end
Here we use String’s scan method to extract everything that looks like a URL.
Then, we loop over them—after all, there might be multiple URLs in a single
line—and output each one of them.
Of course, we’re not limited to having our script be the final stage in the
pipeline. We could use it as an intermediary step—for example, to fetch a web
page, extract the URLs from it, and then download each of those URLs:
$ curl http://example.com | urls | xargs wget
Hopefully you can imagine many scenarios where having such a script and
other tools like it would come in handy. Before long, if you’re anything like
me, you’ll have built up quite the collection of them, each in true Unix fashion
built to do one thing—but to do it well.
In reality, though, that’s not the case. All of the programs in the pipeline chain
run simultaneously, and data flows between them bit by bit—just like water
through a real pipe. While the second process is working with the first chunk
of information, the first process is generating another chunk; by the time the
first chunk is through to the third or fourth process in the pipeline, the first
process may be onto the third, tenth, or hundredth chunk.
The amazing thing about this concurrency is that the processes themselves
need know nothing about it. It’s all taken care of by the operating system and
the shell, leaving the individual process to worry only about fetching input
and producing output.
We can prove this concurrency by typing the following into our command
line:
$ sleep 5 | echo "hello, world"
hello, world
If the tasks were executed in series, we’d see nothing for five seconds, and
only then would hello, world appear on our screen. But instead, because the
echo command starts at the same time as sleep, we see the output immediately.
When we request more data from standard input—when calling $stdin.gets, for
example—Ruby will do one of two things. If it has input available in its buffer,
it will pass it on immediately. If it doesn’t, though, it will block, waiting until
the process before it in the pipeline has generated enough output.
This can be frustrating when the input you’re receiving is in many small
chunks, especially if those small chunks are slow to generate. One example
is the find command, which searches the filesystem for files matching given
conditions. It might generate hundreds of filenames per second, or it might
generate one per minute, depending on how many files you’re searching
through and how many of them match your conditions.
If we pipe the result of a find into this script, it will be a long time before the
script actually receives any input, and because this buffering happens at the
output stage, not the input stage, there’s nothing we can do about it. Our
supposedly concurrent pipeline sometimes doesn’t behave concurrently at
all.
While we have no control over the behavior of other programs, if we’re writing
programs ourselves that generate slow output like find does, then we can
remove this buffering by telling our standard output stream to behave syn-
chronously. To illustrate the change, here’s a script that uses the default
behavior and therefore has its output buffered:
stdout-async.rb
100.times do
"hello world".each_char do |c|
print c
sleep 0.1
end
print "\n"
end
then we’ll see the problem: nothing happens for a very, very long time. Because
we’re outputting a character only every 0.1 seconds, it would take us 410
100.times do
"hello world".each_char do |c|
print c
sleep 0.1
end
print "\n"
end
Here we set $stdout.sync to true, telling our standard output stream not to buffer
but instead to flush constantly. If we pipe the input from this script into cat,
we’ll see a character appear every 0.1 second. Although the script will take
the same amount of time in total to execute, the next program in the pipeline
will have the chance to work with the output immediately, potentially speeding
up the overall time the pipeline takes.
Wrapping Up
We looked now at how to use standard input to obtain input from users’
keyboards, how to redirect the output of other programs into our own, and
how powerful text processing pipelines can be. We saw the value of small
tools that perform a single task and how they can be composed together in
different ways to perform complex text processing tasks. We learned how to
write scripts that can be directly executed and that can process standard
input as a stream and so can work with large quantities of input.
Shell One-Liners
We’ve looked at processing text in Ruby scripts, but there exists a stage of
text processing in which writing full-blown scripts isn’t the correct approach.
It might be because the problem you’re trying to solve is temporary, where
you don’t want the solution hanging around. It might be that the problem is
particularly lightweight or simple, unworthy of being committed to a file. Or
it might be that you’re in the early stages of formulating a solution and are
just trying to explore things for now.
Such processing pipelines will inevitably make use of standard Unix utilities,
such as cat, grep, cut, and so on. In fact, those utilities might actually be suffi-
cient—tasks like these are, after all, what they’re designed for. But it’s common
to encounter problems that get just a little too complex for them, or that for
some reason aren’t well suited to the way they work. At times like these, it
would nice if we could introduce Ruby into this workflow, allowing us to
perform the more complex parts of the processing in a language that’s familiar
to us.
It turns out that Ruby comes with a whole host of features that make it a
cinch to integrate it into such workflows. First, we need to discover how we
can use it to execute code from the command line. Then we can explore dif-
ferent ways to process input within pipelines and some tricks for avoiding
lengthy boilerplate—something that’s very important when we’re writing
scripts as many times as we run them!
This will execute the code found in foo.rb, but otherwise it won’t do anything
too special. If you’ve ever written Ruby on the command line, you’ll definitely
have started Ruby in this way.
What you might not know is that by passing options to the ruby command,
you can alter the behavior of the interpreter. There are three key options that
will make life much easier when writing one-liners in the shell. The first is
essential, freeing you from having to store code in files; the second and third
allow you to skip a lot of boilerplate code when working with input. Let’s take
a look at each them in turn.
When it comes to using Ruby in the shell, this is hugely limiting. We don’t
want to have to store code in files; we want to be able to compose it on the
command line as we go.
By using the -e flag when invoking Ruby, we can execute code that we pass
in directly on the command line—removing the need to commit our script to
a file on disk. (It might be helpful to remember -e as standing for evaluate,
because Ruby is evaluating the code we pass contained within this option.)
The universal “hello world” example, then, would be as follows:
$ ruby -e 'puts "Hello world"'
Hello world
Any code that we could write in a script file can be passed on the command
line in this way. We could, though it wouldn’t be much fun, define classes
and methods, require libraries, and generally write a full-blown script, but
in all likelihood we’ll limit our code to relatively short snippets that just do a
couple of things. Indeed, this desire to keep things short will lead to making
choices that favor terseness over even readability, which isn’t usually the
choice we make when writing scripts.
This is the first step toward being able to use Ruby in an ad hoc pipeline: it
frees us from having to write our scripts to the filesystem. The second step
is to be able to read from input. After all, if we want our script to be able to
behave as part of a pipeline, as we saw in the previous chapter, then it needs
to be able to read from standard input.
The obvious solution might be to read from STDIN in the code that we pass in
to Ruby, looping over it line by line as we did in the previous chapter:
$ printf "foo\nbar\n" | ruby -e 'STDIN.each { |line| puts line.upcase }'
FOO
BAR
But this is a bit clunky. Considering how often we’ll want to process input
line by line, it would be much nicer if we didn’t have to write this tedious
boilerplate every time. Luckily, we don’t. Ruby offers a shortcut for just this
use case.
This means that the code we pass in the -e argument is executed once for
each line in our input. The content of the line is stored in the $_ variable. This
is one of Ruby’s many global variables, sometimes referred to as cryptic globals,
and it always points to the last line that was read by gets.
[code language="session"]
$ printf "foo\nbar\n" | ruby -ne 'print'
foo
bar
This implicit behavior is particularly useful for filtering down the input to
only those lines that match a certain condition—only those that start with f,
for example:
$ printf "foo\nbar\n" | ruby -ne 'print if $_.start_with? "f"'
foo
This kind of conditional output can be made even more terse with another
shortcut. As well as print, regular expressions also operate implicitly on $_.
We’ll be covering regular expressions in depth in Chapter 8, Regular Expres-
sions Basics, on page 103, but if in the previous example we changed our
start_with? call to use a regular expression instead, it would read:
This one-liner is brief almost to the point of being magical; the subject of both
the print statement and the if are both completely implicit. But one-liners like
this are optimized more for typing speed than for clarity, and so tricks like
this—which have a subtlety that might be frowned upon in more permanent
scripts—are a boon.
There are also shortcut methods for manipulating input. If we invoke Ruby
with either the -n or -p flag, Ruby creates two global methods for us: sub and
gsub. These act just like their ordinary string counterparts, but they operate
on $_ implicitly.
This means we can perform search and replace operations on our lines of
input in a really simple way. For example, to replace all instances of COBOL
with Ruby:
$ echo 'COBOL is the best!' | ruby -ne 'print gsub("COBOL", "Ruby")'
Ruby is the best!
We didn’t need to call $_.gsub, as you might expect, since the gsub method
operates on $_ automatically. This is a really handy shortcut.
Sulphurous Acid.
Water.
Black Varnish.
PRACTICAL DETAILS
OF THE
POSITIVE
OR
AMBROTYPE PROCESS.
CHAPTER IV.
Manipulations.
The glass is held between the thumb and forefinger of the left
hand by the corner 1, Fig. A., 3 and 4 towards and nearest the body,
and as nearly level as possible. I find this the best position to hold
the glass; as, in the case of the larger ones, they can be rested on
the end of the little finger, which should be placed as near the edge
as possible. Then, from the collodion vial, pour on the collodion,
commencing a little beyond the centre and towards 1, continuing
pouring in the same place until the collodion nearly reaches the
thumb—the glass slightly inclined that way; then let the glass incline
towards 4, and continue to pour towards 2.
As soon as enough has been put on to liberally flow the glass,
rapidly and steadily raise corner 1, and hold it directly over 3, where
the excess will flow oil into the mouth of the vial, which should be
placed there to receive it. In case of a speck of dust falling at the
time of coating, it can often be prevented from injuring the surface
by changing the direction of the flowing collodion, so as to stop it in
some place where it will not be seen when the picture is finished.
Now, with the thumb and finger of the right hand, I wipe off any
drops or lines of collodion that may be found upon the outer edge or
side of the glass, being careful not to disturb that connected with
the face.
When the coating has become sufficiently dry, so that when I put
my finger against it, it does not break the film, but only leaves a
print, I put it into the silvering bath [see Fig. p. 34]. I generally try
corners 2 and 3. The time, from the first commencement of pouring
on collodion to its being put into the bath, should not exceed about
half a minute, at a temperature of 60°. The finger test is the best I
have found. The glass is to be rested on a dipper [see Fig. p. 34],
and placed steadily and firmly into the nitrate of silver bath—this in a
dark room. It should not be allowed to rest for an instant as it is
entering the solution, or it would cause a line. The time for the glass
to remain in the bath depends upon the age and amount of silver
the bath contains; for a new solution, from two to three minutes will
be sufficient to give the proper action. If it be old, three to five
minutes will be better. When it is properly coated, it can be raised up
and taken by the corner, and allowed to drain for a few seconds, and
then should be placed in the tablet, and is ready for the camera. The
time of exposure will depend upon the amount of light present. If
the bath is newly mixed, and the collodion recently iodized, it should
produce a sufficiently strong impression by an exposure of about
one-third of the time required for a daguerreotype. If the collodion
has been iodized some time, and the bath is old, about one-half of
the time necessary to produce a daguerreian image will be required.
The plate should in no case be allowed to become dry from the
time it is taken from the bath up to the time of pouring on the
developer. At a temperature of about 70°, I have had the glass out
of the bath ten minutes without drying. After exposure, the glass
should be taken again into the dark room, and removed from the
tablet and held over a sink, pail, or basin and the developing solution
poured on it as follows: hold the glass between the thumb and
finger of the left hand, by the opposite end corner from that in
coating with collodion, i. e., 2, and let 3 and 4 be from you.
Commence pouring on the developing solution MANIPULATIONS
THE POSITIVE
OF
at the end by the thumb, and let it flow quickly and PROCESS.
evenly over the entire surface, the first flooding washing off any
excess of nitrate of silver there may be about the edges or corners
of the glass (if this silver is not washed off, it flows over the edges
and on the surface of the impression, producing white wavy clouds
of scum), and then hold the glass as nearly level as possible, it
having upon its surface a thin covering of solution (care should be
observed not to pour the developing solution on the plate in one
place, as it would remove all the nitrate of silver and prevent the
development of the image, leaving only a dark or black spot where it
is poured on). Put down the bottle containing the developing
solution, and take up a quart pitcher previously filled with water, and
as soon as the outline of the image can be plainly seen by the weak
or subdued light of an oil or fluid lamp or candle, pour the water
over copiously and rapidly. Continue this until all the iron solution
has been removed. If this is not done, the plate will be covered with
blue scum on the application of the washing solution. Then the glass
can be taken into a light room, and the iodide of silver coating
washed off with the cyanide solution, and then rinsed with clear
pure water, and stood in a position to drain and dry. I place a little
blotting paper under them: it aids in absorbing the water, and
facilitates the operation.
Place the face of the glass against the wall, in order to prevent
dust from falling upon it. I have often dried the coating by holding or
standing the glass adjacent to a stove. A steady heat is advisable, as
it leaves the surface in a more perfect state, and free from any
scum. After the coating is perfectly dry, it is ready for the preserving
process. It should be warmed evenly, and when about milk warm,
"Humphrey's Collodion Gilding" is poured on the image in precisely
the same manner as the collodion. In a few seconds the coating
sets, and after three-quarters of a minute, if it has not become dry,
the blaze of a spirit lamp may be applied to the back and it will
immediately become perfectly transparent, and nearly as hard as the
glass itself: the effect is fully equal, if not superior, to that of chloride
of gold in gilding the daguerreotype image. The surface becomes
brilliant and permanent. The back of the glass can now be wiped
and cleaned with paper or cloth, and gently warmed, and then with
a common small brush one coat of black varnish can be applied. This
brush should be drawn from side to side across the glass, and on the
side opposite to that which has received the image.
This is in order not to make streaks in the coating of varnish, but
to have uniform lines across the entire length or breadth of the
glass. If the varnish is of the proper consistency, it will flow into a
smooth, even coating. After this first coating is dry, apply a second
in the same manner, only in an opposite direction, so as to cross the
lines of the first, uniting at right angles; when this last coating is
very nearly dry, a piece of paper, glazed black on one side, and cut
to the proper size, can be put next the varnish; it gives it a clean
finish, at the same time that it aids towards a dense blackening.
I sometimes apply the black varnish by flowing, in the same
manner as in putting on the collodion.
This picture is to be colored and put up in the same manner as
the daguerreotype image, with a mat and glass. The last glass may
be dispensed with by first using the collodion gilding, and then upon
its surface apply the black varnish, as before. In this case the image
is seen through the same glass it is on, and without being reversed:
in this case the mat goes on the outside of the glass.
When the image is seen through the glass upon which it is taken,
it cannot be colored with very great success, as it cannot be seen
through the reduced silver forming it. This forms a more or less
opaque surface; but in point of economy the single glass is
preferable. Yet I would not recommend such economy, for I consider
that a good impression ought to be well put up, and the welfare of
the art fully substantiates that consideration.
Many ways have been devised for putting up pictures I have
produced pleasing effects upon colored glasses: for instance, a
picture on a light purple glass has a very pleasing effect; also in
some other colors. I have also used patent leather for backing the
image.
I have produced curious and interesting results by placing a piece
of white paper, or coloring white the back of the whites of the
image, and then blackening over or around this. By this means the
whites are preserved very clear.
Positives for Pins, Lockets, etc.—I employ mica for floating the
collodion on, as it can be as easily cut and fitted as the metallic plate
in the daguerreotype; and positives taken upon fine, clear,
transparent mica, are fully equal to those taken upon glass, and yet
they are ambrotypes.
Mica is an article familiar to every one, as being used in stoves,
gratings, etc.
The method of using it, is to take the impression on a thick piece,
and then split it off, which can readily be done in the most perfect,
thin, transparent plates; it is equally as thin as tissue paper, and can
be cut as easily. The thickness of the piece upon which the
impression is taken is of no moment, since it can be reduced at
pleasure and is more easily handled while thick.
PRACTICAL DETAILS
OF THE
NEGATIVE PROCESS.
CHAPTER V.
Negative Process.
The method for preparing this has been given in page 41. It is
prepared in the same manner for both positives and negatives.
Plain Collodion.
Re-developing Solution.
This solution is for the purpose of giving increased intensity to
the negative, but as its use in the hands of beginners is attended
with some difficulty, I would not recommend the operator to try it
until he has had considerable experience in the developing process,
or he will undoubtedly spoil his proofs. Its use requires promptness
of action and quick observation.
The following is the formula for its preparation:
Water 4 ounces.
Protosulphite of iron 400 grains.
Put this into a bottle, and when the crystals are dissolved, it is
ready for use. It should be kept filtered, and can be used only once.
Now in another bottle put
Water 4 ounces.
Nitrate of silver 48 grains.
Remarks.—The impression is to be well washed after the
developing solution has been poured off, and then the re-developing
solution (that portion containing the protosulphate of iron) can be
poured on—the plate being held perfectly level: the surface is
completely covered; the water containing the nitrate of silver should
then be poured rapidly on, to mix with the iron, when the surface of
the impression will instantly commence to blacken; and if the action
be allowed to continue for a lengthened period, say one minute, the
impression will be ruined.
It is a matter worthy of notice, that there is no perceptible action
when the iron solution is poured over the glass; but the action is
very energetic the instant the nitrate of silver solution comes in
contact with the iron salt and the silver.
As soon as any change can be observed, after the re-developer
has been poured over the plate, it should be quickly and copiously
washed off with clean water, and then it is ready for the fixing
process.
I would dissuade novices in the art from practising with the re-
developing solution, until they have first thoroughly mastered the
entire process of taking negatives. The developing solution is the
only one used by operators generally, and will, with proper care,
produce satisfactory results.
Water 8 ounces.
Hyposulphite of soda 4 ounces.
This is done with the same material, and in the same manner, as
that given for positives—page 134.
Remarks.—The glass negatives, when not wanted for use, should
be carefully put aside in a box, and kept free from dust and
dampness: by so doing, it is believed that they will remain good for
any length of time.
Nitrate of Silver Bath.
This solution differs only from the positive bath, by omitting the
nitric acid: in all other respects it is precisely the same, and is
prepared by the same formula, as given at page 64.
This is called the neutral bath, and is best adapted to the
negative process. The nitrate of silver employed in its preparation
should be perfectly free from excess of nitric acid, otherwise the
whole solution will be slightly acid.
If it should not be convenient to obtain nitrate of silver without
this objection, the acid may be neutralized by putting into the
solution a small quantity of common washing soda— say 1 grain to
each 100 grains of nitrate of silver—previously dissolved in about
half an ounce of water. This may be put in at the same time that the
iodide of potassium is, and it would save one filtration.
In twenty samples of nitrate of silver that I have tried the above
quantity of soda has been found sufficient; if, however, the white
precipitate first formed is re-dissolved on shaking the mixture, free
nitric acid is present, and more of the soda may be added.
This bath will improve by age, and be less liable to fog after
having been in constant use for one or two weeks.
Operators who have the means, and design following the art
professionally, will find it to their advantage to make from two to
three times the quantity of solution they require for immediate use:
by this means they will be enabled to replenish their stock, which
may be used up or otherwise lost.
PRACTICAL DETAILS
OF THE
PRINTING PROCESS.
CHAPTER VI.
Salting Paper.
Water 1 quart.
Muriate of ammonia 65 grains.
Silvering Paper.
Water 2¼ ounces.
Nitrate of silver 75 grains.
Dissolve (in a 4-ounce vial) the nitrate of silver in the water, and
then pour one-fourth of the solution into an ounce graduate or any
convenient vessel: this keep for farther use in preventing the
presence of an excess of ammonia. Now, into the bottle containing
the three-fourths put about 4 drops of aqua-ammonia; shake well
and a brown precipitate will be given. Continue adding the ammonia,
drop by drop, and shake after each addition, until the brown
precipitate is re-dissolved and the solution is clear; then pour back
into the bottle the one-fourth taken out at first: this will leave the
solution slightly turbid, and when so, there is no excess of ammonia
which would be objectionable. It may now be filtered through
filtering paper, and it (the clear liquid) is ready for use. This should
be kept in the dark, as it decomposes rapidly when exposed to light.
The method of silvering the paper with ammonio-nitrate of silver,
is as follows: take a tuft of clean cotton, roll it into a ball-shape, then
wet it by holding it against the mouth of the bottle containing the
ammonio-nitrate, and when well wet, apply it to the paper (which
should be placed flat on a clean board) by gently rubbing it over the
surface, care being taken not to roughen it.
If the solution has not been filtered for some time, it would be
advisable to pour a little on the centre of the paper, and then
distribute it over the surface by means of the cotton, which is held in
the fingers: by this last method any sediment which may be in the
bottom of the bottle is prevented from getting upon the paper, and
causing spots.
I have used a brush for the purpose of distributing the solution,
by which plan there is less liability of getting it on the fingers and
staining them. Care must be taken to cover the entire surface of the
paper, or there will be light streaks, occasioned by the absence of
the silvering solution.
This want of silver will appear on the paper in light parts, as seen
in the accompanying cut:
Fig. 36.
After the paper has been perfectly coated, or washed with the
silvering solution, it should be placed in a perpendicular position to
dry. I usually tack the paper on a board of the requisite size, and
then stand it on one edge until it has drained and dried. As soon as
dry, it is ready for use. This paper will not keep more than twelve
hours, therefore the operator should silver in the morning the
quantity required for the day. It is imperatively necessary that the
silvered paper be kept in the dark. It is extremely sensitive to light,
and a very brief exposure of the prepared sheet would render it unfit
for use.
The several kinds of apparatus used for holding the negative and
the sensitive paper together, have already been given on page 36,
Figs. 31, 32, 33. The paper having been salted and silvered, as just
described, should be placed on the pad of the printing frame or
glasses, with its sensitive surface up, and then the negative placed
directly upon and in contact with it; then it is to be fastened
together, when it will be ready for exposure to the direct rays of the
sun. From 10 to 40 seconds will be found enough to give a
sufficiently intense print.
The paper first changes to a slate color, and then to a brown or
copper color t when of a dark slate color is about the proper time to
take it out and immerse in the toning bath.
Mounting of Positives.
ebookgate.com