Lectures On Mathematical Computing With Python Jay Gopalakrishnan pdf download
Lectures On Mathematical Computing With Python Jay Gopalakrishnan pdf download
https://ebookbell.com/product/lectures-on-mathematical-computing-
with-python-jay-gopalakrishnan-55033540
https://ebookbell.com/product/lectures-on-mathematical-control-
theory-22216138
https://ebookbell.com/product/parisprinceton-lectures-on-mathematical-
finance-2010-1st-edition-areski-cousin-2042140
https://ebookbell.com/product/parisprinceton-lectures-on-mathematical-
finance-2004-1st-edition-ren-a-carmona-4211596
https://ebookbell.com/product/parisprinceton-lectures-on-mathematical-
finance-2013-editors-vicky-henderson-ronnie-sircar-1st-edition-fred-
espen-benth-4293956
Parisprinceton Lectures On Mathematical Finance 2003 1st Edition
Tomasz R Bielecki
https://ebookbell.com/product/parisprinceton-lectures-on-mathematical-
finance-2003-1st-edition-tomasz-r-bielecki-1011964
https://ebookbell.com/product/parisprinceton-lectures-on-mathematical-
finance-2002-1st-edition-peter-bank-1076404
https://ebookbell.com/product/phenomenology-and-logic-the-boston-
college-lectures-on-mathematical-logic-and-existentialism-bernard-
lonergan-editor-philip-mcshane-editor-51915218
https://ebookbell.com/product/the-science-of-cities-and-regions-
lectures-on-mathematical-model-design-1st-edition-alan-wilson-
auth-2518938
https://ebookbell.com/product/the-science-of-cities-and-regions-
lectures-on-mathematical-model-design-1st-edition-alan-wilson-
auth-4112226
Lectures on
Mathematical Computing
with Python
Jay Gopalakrishnan
• Share: copy and redistribute the material in any medium or format, and
• Adapt: remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
• Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes
were made. You may do so in any reasonable manner, but not in any way that suggests the licensor
endorses you or your use.
• ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions
under the same license as the original.
• No additional restrictions: You may not apply legal terms or technological measures that legally restrict
others from doing anything the license permits.
Recommended citation:
Gopalakrishnan, J., Lectures on Mathematical Computing with Python, PDXOpen: Open Educational Resource
28, Portland State University Library, DOI: 10.15760/pdxopen-28, July 2020.
2
Preface
These lectures were prepared for a class of (mostly) second year mathematics and statis-
tics undergraduate students at Portland State University during Spring 2020. The term
was unlike any other. The onslaught of COVID-19 moved the course meetings online, an
emergency transition that few of us were prepared for. Many lectures reflect our preoccu-
pations with the damage inflicted by the virus. I have not attempted to edit these out since
I felt that a utilitarian course on computing need not be divested from the real world.
These materials offer class activities for studying basics of mathematical computing using
the python programming language, with glimpses into modern topics in scientific com-
putation and data science. The lectures attempt to illustrate computational thinking by
examples. They do not attempt to introduce programming from the ground up, although
students, by necessity, will learn programming skills from external materials. In my expe-
rience, students are able and eager to learn programming by themselves using the abun-
dant free online resources that introduce python programming. In particular, my students
and I found the two (free and online) books of Jake VanderPlas invaluable. Many sec-
tions of these two books, hyperlinked throughout these lectures, were assigned as required
preparatory reading materials during the course (see List of Preparatory Materials).
Materials usually covered in a first undergraduate linear algebra course and in a one-
variable differential calculus course form mathematical prerequisites for some lectures.
Concepts like convergence may not be covered rigorously in such prerequisites, but I have
not shied away from talking about them: I feel it is entirely appropriate that a first en-
counter with such concepts is via computation.
Each lecture has a date of preparation. It may help the reader understand the context in
relation to current events and news headlines. The timestamp also serves as an indicator
of the state of the modules in the ever-changing python ecosystem of modules for scientific
computation. The specific version numbers of the modules used are listed overleaf. The
codes may need tinkering with to ensure compatibility with future versions. The materials
are best viewed as offering a starting point for your own adaptation.
If you are an instructor declaring these materials as a resource in your course syllabus, I
would be happy to provide any needed solutions to exercises or datafiles. If you find errors
please alert me. If you wish to contribute by updating or adding materials, please fork the
public GitHub Repository where these materials reside and send me a pull request.
Jay Gopalakrishnan
(gjay@pdx.edu)
3
Software Requirements:
• Python >= 3.7
• Jupyter >= 1
Main modules used:
• cartopy==0.18.0b2.dev48+
• geopandas==0.7.0
• gitpython==3.1.0
• matplotlib==3.2.1
• numpy==1.18.2
• pandas==1.0.4
• scipy==1.4.1
• scikit-learn==0.23.1
• seaborn==0.10.0
• spacy==2.2.4
Other (optional) facilities used include line_profiler, memory_profiler, numexpr, pandas-
datareader, and primesieve.
4
Table of Contents
Lecture Notebooks
• 01 Overview of some tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
• 02 Interacting with python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
• 03 Working with git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
• 04 Conversion table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
• 05 Approximating derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
• 06 Genome of SARS-CoV-2 virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
• 07 Fibonacci primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
• 08 Numpy blitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
• 09 The SEIR model of infectious diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
• 10 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
• 11 Bikes on Tilikum Crossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
• 12 Visualizing geospatial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
• 13 Gambler’s ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
• 14 Google’s PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
• 15 Supervised learning by regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
• 16 Unsupervised learning by PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
• 17 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Exercises
• Power sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
• Graph functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
• Argument passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
• Piecewise functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
• Row swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
• Averaging matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
• Differentiation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
• Pairwise differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
• Hausdorff distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
• k-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
• Predator-prey model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
• Column space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
• Null space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
• Pandas from dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
• Iris flowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
• Stock prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
• Passengers on the Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
• Animate functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5
• Insurance company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
• Probabilities on small graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
• Ehrenfest urns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
• Power method for large graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
• Google’s toy graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
• Atmospheric carbon dioxide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
• Ovarian cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
• Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
• Word vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Projects
• Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
• Rise of CO2 in the atmosphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
• COVID-19 cases in the west coast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
• World map of COVID-19 cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
• Neighbor’s color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6
List of Preparatory Materials for Each Activity
The activities in the table of contents are enumerated again below in a linear ordering with
hyperlinks to external online preparatory materials for each.
7
Required Preparation Activity
Learn sorting, partitioning [JV-H], and Exercise: Row Swap
quick ways to make matrices from
[numpy.org].
Exercise: Averaging Matrix
Exercise: Differentiation Matrix
8
Required Preparation Activity
Be acquainted with scipy.sparse’s matrix Exercise: Power method for large graphs
format, specifically COO and CSR Exercise: Google’s toy graph
formats.
9
I
Overview of some tools
This lecture is an introductory overview to give you a sense of the broad utility of a few
python tools you will encounter in later lectures. Each lecture or class activity is guided by
a Jupyter Notebook (like this document), which combines executable code, mathematical
formulae, and text notes. This overview notebook also serves to check and verify that you
have a working installation of some of the python modules we will need to use later. We
shall delve into basic programming using python (after this overview and a few further
start-up notes) starting from a later lecture.
The ability to program, analyze and compute with data are life skills. They are useful well
beyond your mathematics curriculum. To illustrate this claim, let us begin by considering
the most pressing current issue in our minds as we begin these lectures: the progression
of COVID-19 disease worldwide. The skills you will learn in depth later can be applied
to understand many types of data, including the data on COVID-19 disease progression.
In this overview, we shall use a few python tools to quickly obtain and visualize data on
COVID-19 disease worldwide. The live data on COVID-19 (which is changing in as yet
unknown ways) will also be used in several later activities.
Specifically, this notebook contains all the code needed to perform these tasks:
• download today’s data on COVID-19 from a cloud repository,
• make a data frame object out of the data,
• use a geospatial module to put the data on a world map,
• download county maps from US Census Bureau, and
• visualize the COVID-19 data restricted to Oregon.
The material here is intended just to give you an overview of the various tools we will learn
in depth later. There is no expectation that you can immediately digest the code here. The
goal of this overview is merely to whet your appetite and motivate you to allocate time to
learn the materials yet to come.
10
Please install these modules if you do not have them already. (If you do not have these
installed, attempting to run the next cell will give you an error.)
[2]: # your local folder into which you want to download the covid data
covidfolder = '../../data_external/covid19'
Remember this location where you have stored the COVID-19 data. You will need to return
to it when you use the data during activities in later days, including assignment projects.
covidfolder)
datadir = repo.working_dir + '/csse_covid_19_data/
↪→csse_covid_19_daily_reports'
The folder datadir contains many files (all of which can be listed here using the command
os.listdir(datadir) if needed). The filenames begin with a date like 03-27-2020 and
ends in .csv. The ending suffix csv stands for “comma separated values”, a common
simple format for storing uncompressed data.
11
I.3 Examine the data for a specific date
The python module pandas, the workhorse for all data science tasks in python, can make a
DataFrame object out of each such .csv files. You will learn more about pandas later in the
course. For now, let us pick a recent date, say March 27, 2020, and examine the COVID-19
data for that date.
[4]: c = pd.read_csv(datadir+'/03-27-2020.csv')
The DataFrame object c has over 3000 rows. An examination of the first five rows already
tells us a lot about the data layout:
[5]: c.head()
Combined_Key
0 Abbeville, South Carolina, US
1 Acadia, Louisiana, US
2 Accomack, Virginia, US
3 Ada, Idaho, US
4 Adair, Iowa, US
Note that depending on how the output is rendered where you are reading this, the later
columns may be line-wrapped or may be visible only after scrolling to the edges. This
object c, whose head part is printed above, looks like a structured array. There are features
corresponding to locations, as specified in latitude Lat and longitude Long_. The columns
Confirmed, Deaths, and Recovered represents the number of confirmed cases, deaths, and
recovered cases due to COVID-19 at a corresponding location.
12
(gpd) is well-suited for visualizing geospatial data. It is built on top of the pandas library.
So it is easy to convert our pandas object c to a geopandas object.
Combined_Key geometry
0 Abbeville, South Carolina, US POINT (-82.46171 34.22333)
1 Acadia, Louisiana, US POINT (-92.41420 30.29506)
2 Accomack, Virginia, US POINT (-75.63235 37.76707)
3 Ada, Idaho, US POINT (-116.24155 43.45266)
4 Adair, Iowa, US POINT (-94.47106 41.33076)
The only difference between gc and c is the last column, which contains the new geometry
objects representing points on the globe. Next, in order to place markers at these points on
a map of the world, we need to get a simple low resolution world map:
13
You can download and use maps with better resolution from Natural Earth, but that will
be too far of a digression for this overview. On top of the above low resolution map, we
can now put the markers whose sizes are proportional to the number of confirmed cases.
These python tools have made it incredibly easy for us to immediately identify the COVID-
19 trouble spots in the world. Moreover, these visualizations can be updated easily by
re-running this code as data becomes available for other days.
[9]: co = c[c['Province_State']=='Oregon']
The variable co now contains the data restricted to Oregon. However, we are now pre-
sented with a problem. To visualize the restricted data, we need a map of Oregon. The
module geopandas does not carry any information about Oregon and its counties. How-
ever this information is available from the United States Census Bureau. (By the way, the
2020 census is happening now! Do not forget to respond to their survey. They are one of
our authoritative sources of quality data.)
To visualize the COVID-19 information on a map of Oregon, we need to get the county
boundary information from the census bureau. This illustrates a common situation that
arises when trying to analyze data: it is often necessary to procure and merge data from
multiple sources in order to understand a real-world phenomena.
A quick internet search reveals the census page with county information. The information
is available in an online file cb_2018_us_county_500k.zip at the URL below. Python al-
lows you to download this file using its urllib module without even needing to leave this
notebook.
[10]: # url of the data
census_url = 'https://www2.census.gov/geo/tiger/GENZ2018/shp/
↪→cb_2018_us_county_500k.zip'
14
# location of your download
your_download_folder = '../../data_external'
if not os.path.isdir(your_download_folder):
os.mkdir(your_download_folder)
us_county_file = your_download_folder + '/cb_2018_us_county_500k.zip'
shutil.copyfileobj(response, out_file)
Now, your local computer has a zip file, which has among its contents, files with geometry
information on the county boundaries, which can be read by geopandas. We let geopandas
directly read in the zip file: it knows which information to extract from the zip archive to
make a data frame with geometry.
AWATER geometry
0 69473325 POLYGON ((-89.18137 37.04630, -89.17938 37.053...
1 4829777 POLYGON ((-84.44266 38.28324, -84.44114 38.283...
2 13943044 POLYGON ((-86.94486 37.07341, -86.94346 37.074...
3 6516335 POLYGON ((-84.12662 37.64540, -84.12483 37.646...
4 7182793 POLYGON ((-83.98428 38.44549, -83.98246 38.450...
The object us_counties has information about all the counties. Now, we need to re-
strict this data to just that of Oregon. Looking at the columns, we find something called
STATEFP. Searching through the government pages, we find that STATEFP refers to a 2-
character state FIPS code. The FIPS code refers to Federal Information Processing Standard
which was a “standard” at one time, then deemed obsolete, but still continues to be used
today. All that aside, it suffices to note that Oregon’s FIPS code is 41. Once we know this,
15
python makes it is easy to restrict the data to Oregon:
Now we have the Oregon data in two data frames, ore and co. We must combine the two
data frames. This is again a situation so often encountered when dealing with real data
that there is a facility for it in pandas called merge. Both data has FIPS codes: in ore you
find it under column GEOID, and in co you find it called FIPS. The merged data frame is
represented by the variable orco below:
The orco object now has both the geometry information as well as the COVID-19 informa-
tion, making it extremely easy to visualize.
16
This is an example of a chloropleth map, a map where regions are colored or shaded in
proportion to some data variable. It is an often-used data visualization tool.
17
Confirmed COVID-19 cases until 2020-03-30
7000 Oregon
Washington
6000 California
5000
4000
3000
2000
1000
0
1 15 1 -15 -01
2-0 02- 3-0 3 4
0-0 0 - 0-0 0-0 0-0
202 202 202 202 202
Dates
How does the progression of infections in New York compare with Hubei where the dis-
ease started? Again the answer based on the data we have up to today is easy to extract,
and is displayed next.
50000
40000
30000
20000
10000
0
01 15 01 -15 -01
0-02- 0- 02- 0-03- 0-0
3
0-04
202 202 202 202 202
Dates
Of course, the COVID-19 situation is evolving, so these figures are immediately outdated
after today’s class. This situation is evolving in as yet unknown ways. I am sure that you,
like me, want to know more about how these plots will change in the next few months.
You will be able to generate plots like this and learn many more data analysis skills from
these lectures. As you amass more technical skills, let me encourage you to answer your
18
own questions on COVID-19 by returning to this overview, pulling the most recent data,
and modifying the code here to your needs. In fact, some later assignments will require
you to work further with this Johns Hopkins COVID-19 worldwide dataset. Visualizing
the COVID-19 data for any other state, or indeed, any other region in the world, is easily
accomplished by some small modifications to the code of this lecture.
19
II
Interacting with Python
20
Note the following from the interactive session displayed in the figure above:
• Computing the square root of a number using sqrt is successful only after import-
ing math. Most of the functionality in Python is provided by modules, like the math
module. Some modules, like math, come with python, while others must be installed
after python is installed.
• Strings that begin with # (like “# works!” in the figure) differentiate comments from
code. This is the case in a python shell and also in the other forms of interacting with
python discussed below.
• The dir command shows the facilities provided by a module. As you can see, the
math module contains many functions in addition to sqrt.
21
II.3 Jupyter Notebook
The Jupyter notebook is a web-browser based graphical environment consisting of cells,
which can consist of code, or text. The text cells should contain text in markdown syntax,
which allows you to type not just words in bold and italic, but also tables, mathematical
formula using latex, etc. The code cells of Jupyter can contain code in various languages,
but here we will exclusively focus on code cells with Python 3.
For example, this block of text that begins with this sentence marks the beginning of a
jupyter notebook cell containing markdown content. If you are viewing this from jupyter,
click on jupyter’s top menu -> Cell -> Cell Type to see what is the type of the current cell,
or to change the cell type. Computations must be done in a code cell, not a markdown cell.
For example, to compute √
cos(π π )7
we open a code cell next with the following two lines of python code:
cos(pi*sqrt(pi))**7
[1]: 0.14008146171564725
This seamless integration of text and code makes Jupyter attractive for developing a repro-
ducible environment for scientific computing.
22
II.4 Python file
Open your favorite text editor, type in some python code, and then save the file as
myfirstpy.py. Here is a simple example of such a file.
#------- myfirstpy.py ---------------------------------
from math import cos, sqrt, pi
The above output cell should display the same output as what one would have obtained
if we executed the python file on the command line.
For larger projects (including take-home assignments), you will need to create such python
files with many lines of python code. Therefore it is essential that you know how to create
and execute python files in your system.
23
III
Working with git
April 2, 2020
Git a distributed version control system (and is a program often used independently of
python). A version control system tracks the history of changes in projects with many files,
including data files, and codes, which many people access simultaneously. Git facilitates
identification of changes made, fetching revisions from a cloud repository in git format,
and pushing revisions to the cloud.
GitHub is a cloud server that specializes in serving data in the form of git repositories.
Many other such cloud services exists, such as Atlassian’s BitBucket.
The notebooks that form these lectures are in a git repository served from GitHub. In this
notebook, we describe how to access materials from this remote git repository. We will
also use this opportunity to introduce some object-oriented terminology like classes, objects,
constructor, data members, and methods, which are pervasive in python. Those already
familiar with this terminology and GitHub may skip to the next activity.
24
III.2 Git Repo class in python
We shall use the python module gitpython to work with git. (We already used this mod-
ule in the first overview lecture. The documentation of gitpython contains a lot of infor-
mation on how to use its facilities. The main facility is the class called Repo which it uses
to represent git repositories.
Classes have a special method called constructor, which you would find listed among its
methods as __init__.
[3]: help(Repo.__init__)
25
:param path:
the path to either the root git directory or the bare git repo::
repo = Repo("/Users/mtrier/Development/git-python")
repo = Repo("/Users/mtrier/Development/git-python.git")
repo = Repo("~/Development/git-python.git")
repo = Repo("$REPOSITORIES/Development/git-python.git")
repo = Repo("C:\Users\mtrier\Development\git-python\.git")
Please note that this was the default behaviour in older versions of GitPython,
which is considered a bug though.
:raise InvalidGitRepositoryError:
:raise NoSuchPathError:
:return: git.Repo
The __init__ method is called when you type in Repo(...) with the arguments allowed
in __init__. Below, we will see how to initialize a Repo object using our github repository.
[5]: import os
os.path.abspath(coursefolder)
[5]: '/Users/Jay/tmpdir'
Please double-check that the output is what you expected on your operating system: if not,
please go back and revise coursefolder before proceeding. (Windows users should see
forward slashes converted to double backslashes, while mac and linux users will usually
retain the forward slashes.)
We proceed to download the course materials from GitHub. These materials will be stored
in a subfolder of coursefolder called mth271content, which is the name of the git reposi-
tory.
26
[6]: repodir = os.path.join(os.path.abspath(coursefolder), 'mth271content')
repodir # full path name of the subfolder
[6]: '/Users/Jay/tmpdir/mth271content'
Again, the value of the string variable repodir output above describes the location on your
computer where your copy of the course materials from GitHub will reside.
[7]: True
The output above should be False if you are running this notebook for the first time, per
my assumption above. When you run it after you have executed this notebook successfully
at least once, you would already have cloned the repository, so the folder will exist.
27
• Repo.clone_from(...) calls the clone_from(...) method.
Now you have the updated course materials in your computer in a local folder. The object
repo stores information about this folder, which you gave to the constructor in the string
variable repodir, in a data member called working_dir. You can access any data members
of an object in memory, and you do so just like you access a method, using a dot . followed
by the member name. Here is an example:
[9]: repo.working_dir
[9]: '/Users/Jay/tmpdir/mth271content'
Note how the Repo object was either initialized with repodir (if that folder exists) or set to
clone a remote repository at a URL.
28
IV
Conversion table
April 2, 2020
This elementary activity is intended to check and consolidate your understanding of very
basic python language features. It is modeled after a similar activity in [HPL] and involves
a simple temperature conversion formula. You may have seen kitchen cheat sheets (or have
one yourself) like the following:
Fahrenheit Celsius
cool oven 200 F 90 C
very slow oven 250 F 120 C
slow oven 300-325 F 150-160 C
moderately slow oven 325-350 F 160-180 C
moderate oven 350-375 F 180-190 C
moderately hot oven 375-400 F 190-200 C
hot oven 400-450 F 200-230 C
very hot oven 450-500 F 230-260 C
This is modeled after a conversion table at the website Cooking Conversions for Old Time
Recipes, which many found particularly useful for translating old recipes from Europe.
Of course, the “old continent” has already moved on to the newer, more rational, metric
system, so all European recipes, old and new, are bound to have temperatures in Celsius
(C). Even if recipes don’t peak your interest, do know that every scientist must learn to
work with the metric system.
Celsius values can be converted to the Fahrenheit system by the formula
9
F= C + 32.
5
The task in this activity is to print a table of F and C values per this formula. While ac-
complishing this task, you will recall basic python language features, like while loop, for
loop, range, print, list and tuples, zip, and list comprehension.
29
C = 0
while C <= 250:
F = 9 * C / 5 + 32
print(F, C)
C += 10
F C
32.0 0
50.0 10
68.0 20
86.0 30
104.0 40
122.0 50
140.0 60
158.0 70
176.0 80
194.0 90
212.0 100
230.0 110
248.0 120
266.0 130
284.0 140
302.0 150
320.0 160
338.0 170
356.0 180
374.0 190
392.0 200
410.0 210
428.0 220
446.0 230
464.0 240
482.0 250
This cell shows how to add, multiply, assign, increment, print, and run a while loop. Such
basic language features are introduced very well in the prerequisite reading for this lecture,
the official python tutorial’s section titled “An informal introduction to Python.” (Note
that all pointers to prerequisite reading materials are listed together just after the table of
contents in the beginning.)
C = 0
30
while C <= 250:
F = 9 * C / 5 + 32
print('%4.0f %4.0f' % (F, C))
C += 10
F C
32 0
50 10
68 20
86 30
104 40
122 50
140 60
158 70
176 80
194 90
212 100
230 110
248 120
266 130
284 140
302 150
320 160
338 170
356 180
374 190
392 200
410 210
428 220
446 230
464 240
482 250
F C
32 0
50 10
68 20
86 30
104 40
122 50
140 60
158 70
176 80
31
194 90
212 100
230 110
248 120
266 130
284 140
302 150
320 160
338 170
356 180
374 190
392 200
410 210
428 220
446 230
464 240
F C
-58 -50
-49 -45
-40 -40
-31 -35
-22 -30
-13 -25
-4 -20
5 -15
14 -10
23 -5
32 0
41 5
50 10
59 15
68 20
77 25
86 30
95 35
104 40
113 45
As you see from the output above, at −40 degrees, the Fahrenheit scale and the Celsius
scale coincide. If you have lived in Minnesota, you probably know how −40 feels like, and
you likely already know the fact we just discovered above (it’s common for Minnesotans
to throw around this tidbit while commiserating in the cold).
32
IV.5 Store in a list
If we want to use the above-printed tables later, we would have to run a loop again. Our
conversion problem is so small that there is no cost to run as many loops as we like, but
in many practical problems, loops contains expensive computations. So one often wants
to store the quantities computed in the loop in order to reuse them later. Lists are good
constructs for this.
First we should note that python has lists and also tuples. Only the former can be modified
after creation. Here is an example of a list:
You access a tuple element just like a list element, so Cs[0] will give the first element
whether or not Cs is a list or a tuple. But the statement Cs[0] = -10 that changes an
element of the container will work only if Cs is a list. We say that a list is mutable, while
a tuple is immutable. Tuples are generally faster than lists, but lists are more flexible than
tuples.
Here is an example of how to store the computed C and F values within a loop into lists.
[0, 25, 50, 75, 100, 125, 150, 175, 200, 225]
[9]: print(Fs)
[32.0, 77.0, 122.0, 167.0, 212.0, 257.0, 302.0, 347.0, 392.0, 437.0]
This is not as pretty an output as before. But we can easily run a loop and print the stored
values in any format we like. This is a good opportunity to show off a pythonic feature zip
that allows you to traverse two lists simultaneously:
33
[10]: print(' F C')
for C, F in zip(Cs, Fs):
print('%4.0f %4.0f' % (F, C))
F C
32 0
77 25
122 50
167 75
212 100
257 125
302 150
347 175
392 200
437 225
Note how this makes for compact code without sacrificing readability: constructs like this
are why your hear so much praise for python’s expressiveness. For mathematicians, the
list comprehension syntax is also reminiscent of the set notation in mathematics: the set
(list) Fs is described in mathematical notation by
{︃ }︃
9
Fs = C + 32 : C ∈ Cs .
5
Note how similar it is to the list comprehension code. (Feel free to check that the Fs com-
puted by the above one-liner is the same as the Fs we computed previously within a loop.)
34
V
Approximating the derivative
April 7, 2020
In calculus, you learnt about the derivative and its central role in modeling processes
where a rate of change is important. How do we compute the derivate on a computer?
Recall what you did in your first calculus course to compute the derivative. You memo-
rized derivatives of simple functions like cos x, sin x, exp x, x n etc. Then you learnt rules
like product rule, quotient rule, chain rule etc. In the end you could systematically com-
pute derivatives of complicated functions by reducing it to simpler components and ap-
plying the rules. We could teach the computer the same rules and come up with an algo-
rithm for computing derivatives. This is the idea behind automatic differentiation. Python
modules like sympy can compute derivatives symbolically in this fashion. However, this
approach has its limits.
In the real world, we often encounter complicated functions, such as functions that cannot
be represented in terms of simple component functions, or functions whose values you can
only query from some proprietary company code, or functions whose values are based off
a table, like for instance this function.
700
600
500
400
1-01 1-1
5 01 -15 01 -15 -01
0-0 0-0 0-02- 0-0
2
0-03- 0-0
3
0-0
4
2 0 2 202 202 202 202 202 202
Date
This function represents Tesla’s stock prices this year until yesterday (which I got, in case
you are curious, using just a few lines of python code). The function is complicated (not
35
to mention depressing - it reflects the market downturn due to the pandemic). But its
rate of change drives some investment decisions. Instead of the oscillatory daily stock
values, analysts often look at the rate of change of trend lines (like the rolling weekly
means above), a function certainly not expressible in terms of a few simple functions like
sines or cosines.
In this activity, we look at computing a numerical approximation to the derivative using
something you learnt in calculus.
f ( x + h/2) − f ( x − h/2)
f ′ (x) ≈
h
Below is a plot of the tangent line of some function f at x, whose slope is f ′ ( x ), together
with the secant line whose slope is the approximation on the right hand side above. Clearly
as the spacing h decreases, the secant line becomes a better and better approximation to
the tangent line.
t
gen
tan nt
seca
f (x)
36
This is the Central Difference Formula for the second derivative.
The first task in this activity is to write a function to compute the above-stated second
derivative approximation,
f ( x − h) − 2 f ( x ) + f ( x + h)
h2
given any function f of a single variable x. The parameter h should also be input, but can
take a default value of 10−6 .
The prerequisite reading for this activity included python functions, keyword arguments,
positional arguments, and lambda functions. Let’s apply all of these concepts while com-
puting the derivative approximation. Note that python allows you to pass functions them-
selves as arguments to other functions. Therefore, without knowing what specific function
f to apply the central difference formula, we can write a generic function D2 for implement-
ing the formula for any f .
Let’s apply the formula to some nice function, say the sine function.
D2(sin, 0.2)
[2]: -0.19864665468105613
Of course we know that second derivative of sin( x ) is negative of itself, so a quick test of
correctness is to compare the above value to that of − sin(0.2).
[3]: -sin(0.2)
[3]: -0.19866933079506122
How do we apply D2 to, say, sin(2x )? One way is to define a function returning sin(2 ∗ x )
and then pass it to D2, as follows.
D2(g, 0.2)
[4]: -1.5576429035490946
An alternate way is using a lambda function. This gives a one-liner without damaging code
readability.
37
[5]: -1.5576429035490946
Of course, in either case the computed value approximates the actual value of sin′′ (2x ) =
−4 sin(2x ), thus verifying our code.
[6]: -4*sin(2* 0.2) # actual 2nd derivative value
[6]: -1.557673369234602
V.3 Error
The error in the approximation formula we just implemented is
f ( x − h) − 2 f ( x ) + f ( x + h)
ε( x ) = f ′′ ( x ) −
h2
Although we can’t know the error ε( x ) without knowing the true value f ′′ ( x ), calculus
gives you all the tools to bound this error.
Substituting the Taylor expansions
h2 ′′ h3 ′′′ h4 ′′′′
f ( x + h) = f ( x ) + h f ′ ( x ) + f (x) + f (x) + f (x) + · · ·
2 6 24
and
h2 ′′ h3 ′′′ h4 ′′′′
f ( x − h) = f ( x ) − h f ′ ( x ) +
f (x) − f (x) + f (x) + · · ·
2 6 24
into the definition of ε( x ), we find that the after several cancellations, the dominant term
is O(h2 ) as h → 0.
This means that if h is halved, the error should decrease by a factor of 4. Let us take a look
at the error in the derivative approximations applied to a simple function
f ( x ) = x −6
at, say x = 1. I am sure you can compute the exact derivative using your calculus knowl-
edge. In the code below, we subtract this exact derivative from the computed derivative
approximation to obtain the error.
h D2 Result Error
6e-02 42.99863 0.998629
3e-02 42.24698 0.246977
2e-02 42.06158 0.061579
8e-03 42.01538 0.015384
Clearly, we observe that the error decreases by a factor of 4 when h is halved. This is in
accordance with what we expected from the Taylor expansion analysis above.
38
V.4 Limitations
A serious limitation of numerical differentiation formulas like this can be seen when we
take values of h really close to 0. Although the limiting process in calculus relies on h going
to 0, your computer is not equipped to deal with very small numbers. This creates issues.
Instead of halving h, let us aggressively reduce h by a factor of 10, going down to 10−13
and look at the results.
[8]: for k in range(1,14):
h = 10**(-k)
d2g = D2(lambda x: x**-6,1, h)
print('%.0e %18.5f' %(h, d2g))
1e-01 44.61504
1e-02 42.02521
1e-03 42.00025
1e-04 42.00000
1e-05 41.99999
1e-06 42.00074
1e-07 41.94423
1e-08 47.73959
1e-09 -666.13381
1e-10 0.00000
1e-11 0.00000
1e-12 -666133814.77509
1e-13 66613381477.50939
39
VI
Genome of SARS-CoV-2
April 7, 2020
Since most data come in files and streams, a data scientist must be able to effectively work
with them. Python provides many facilities to make this easy. In this class activity, we
will review some of python’s file, string, and dictionary facilities by examining a file con-
taining the genetic code of the virus that has been disrupting our lives this term. Here is
a transmission electron micrograph showing the virus (a public domain image from the
CDC, credited to H. A. Bullock and A. Tamin).
The genetic code of each living organism is a long sequence of simple molecules called nu-
cleotides or bases. Although many nucleotides exist in nature, only 4 nucleotides, labeled
A, C, G, and T, have been found in DNA. They are abbreviations of Adenine, Cytosine,
Guanine, and Thymine. Although it is difficult to put viruses in the category of living
organisms, they also have genetic codes made up of nucleotides.
40
name of the virus is SARS-CoV-2 (which is different from the name of the disease, COVID-
19), or “Severe Acute Respiratory Syndrome Coronavirus 2” in full. Searching the NCBI
website with the proper virus name will help you locate many publicly available data sets.
Let’s download NCBI’s Reference Sequence NC_045512 giving the complete genome ex-
tracted from a sample of SARS-CoV-2 from the Wuhan seafood market, called the Wuhan-
Hu-1 isolate. Here is a code using urllib that will attempt to directly download from the
url specified below. It is unclear if this url would serve as a stable permanent link. In the
event you have problems executing the next cell, please just head over to the webpage for
NC_045512, click on “FASTA” (a data format) and then click on “Send to” a file. Then save
the file in the same relative location mentioned below in f within the folder where we have
been putting all the data files in this course.
url = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&' + \
'save=file&log$=seqview&db=nuccore&report=fasta&id=1798174254&' + \
'extrafeat=null&conwithfeat=on&hide-cdd=on'
f = '../../data_external/SARS-CoV-2-Wuhan-NC_045512.2.fasta'
[2]: import os
import urllib
import shutil
if not os.path.isdir('../../data_external/'):
os.mkdir('../../data_external/')
r = urllib.request.urlopen(url)
fo = open(f, 'wb')
shutil.copyfileobj(r, fo)
fo.close()
As mentioned in the page describing the data, this file gives the RNA of the virus.
The file has been opened in read-only mode. The variable lines contains a list of all the
lines of the file. Here are the first five lines:
[4]: lines[0:5]
41
␣
↪→ 'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA\n',
␣
↪→ 'CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC\n',
␣
↪→ 'TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG\n',
␣
↪→ 'TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC\n']
The first line is a description of the data. The long genetic code is broken up into the
following lines. We need to strip end-of-line characters from each such line to re-assemble
the RNA string. Here is a way to strip off the end-of-line character:
[5]: lines[1].strip()
[5]: 'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA'
Let’s do so for every line starting ignoring the first. Since lines is a list object, ignoring the
first element of the list is done by lines[1:]. (If you don’t know this already, you must
review the list access constructs.) The following code uses the string operation join to put
together the lines into one long string. This is the RNA of the virus.
The first thousand characters and the last thousand characters of the RNA of the coron-
avirus are printed below:
[7]: rna[:1000]
[7]: 'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTA
AAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAG
GACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTT
TCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTG
CCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACA
TCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCA
AACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGT
CGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAA
GAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTG
ATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAAC
GGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCT
AGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCC
GTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCT'
[8]: rna[-1000:]
[8]: 'GCTGGCAATGGCGGTGATGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAACCAGCTTGAGAGCAAAATGTCTGGTA
AAGGCCAACAACAACAAGGCCAAACTGTCACTAAGAAATCTGCTGCTGAGGCTTCTAAGAAGCCTCGGCAAAAACGTACT
GCCACTAAAGCATACAATGTAACACAAGCTTTCGGCAGACGTGGTCCAGAACAAACCCAAGGAAATTTTGGGGACCAGGA
42
ACTAATCAGACAAGGAACTGATTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAA
TGTCGCGCATTGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGAT
CCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAA
GGACAAAAAGAAGAAGGCTGATGAAACTCAAGCCTTACCGCAGAGACAGAAGAAACAGCAAACTGTGACTCTTCTTCCTG
CTGCAGATTTGGATGATTTCTCCAAACAATTGCAACAATCCATGAGCAGTGCTGACTCAACTCAGGCCTAAACTCATGCA
GACCACACAAGGCAGATGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATAGTCTACTCTTGTGCAGAATGAA
TTCTCGTAACTACATAGCACAAGTAGATGTAGTTAACTTTAATCTCACATAGCAATCTTTAATCAGTGTGTAACATTAGG
GAGGACTTGAAAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTACGATCGAGTGTACAGTGAACAATGCTAGGGAGA
GCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGG
AGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
[9]: len(rna)
[9]: 29903
While the human genome is over 3 billion in length, the genome of this virus does not even
reach the length of 30000.
43
The next task in this class activity is to find if this sequence occurs in the RNA we just
downloaded, and if it does, where it occurs. To this end, we first make the replacements
required to read the string in terms of A, T, G, and C.
[11]: 'ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTA
CAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAA
TTTGCACTGACTTGCTTTAGCACTCAATTTGCTTTTGCTTGTCCTGACGGCGTAAAACACGTCTATCAGTTACGTGCCAG
ATCAGTTTCACCTAAACTGTTCATCAGACAAGAGGAAGTTCAAGAACTTTACTCTCCAATTTTTCTTATTGTTGCGGCAA
TAGTGTTTATAACACTTTGCTTCACACTCAAAAGAAAGACAGAATGATTGAACTTTCATTAATTGACTTCTATTTGTGCT
TTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTGCAAGATCATAATGAA
ACTTGTCACGCCTAAACGAAC'
The next step is now a triviality in view of python’s exceptional string handling mecha-
nisms:
[12]: s in rna
[12]: True
We may also easily find the location of the ORF7a sequence and read off the entire string
beginning with the sequence.
[13]: rna.find(s)
[13]: 27393
[14]: rna[27393:]
[14]: 'ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTA
CAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAA
TTTGCACTGACTTGCTTTAGCACTCAATTTGCTTTTGCTTGTCCTGACGGCGTAAAACACGTCTATCAGTTACGTGCCAG
ATCAGTTTCACCTAAACTGTTCATCAGACAAGAGGAAGTTCAAGAACTTTACTCTCCAATTTTTCTTATTGTTGCGGCAA
TAGTGTTTATAACACTTTGCTTCACACTCAAAAGAAAGACAGAATGATTGAACTTTCATTAATTGACTTCTATTTGTGCT
TTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTGCAAGATCATAATGAA
ACTTGTCACGCCTAAACGAACATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATG
TAGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCACTTCTATTCTAAATGGTATA
TTAGAGTAGGAGCTAGAAAATCAGCACCTTTAATTGAATTGTGCGTGGATGAGGCTGGTTCTAAATCACCCATTCAGTAC
ATCGATATCGGTAATTATACAGTTTCCTGTTTACCTTTTACAATTAATTGCCAGGAACCTAAATTGGGTAGTCTTGTAGT
GCGTTGTTCGTTCTATGAAGACTTTTTAGAGTATCATGACGTTCGTGTTGTTTTAGATTTCATCTAAACGAACAAACTAA
AATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTA
ACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTC
ACCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAAGGCGTTCCAATTAACACCAATAGCAGTCC
AGATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAATTCGTGGTGGTGACGGTAAAATGAAAGATCTCAGTCCAA
GATGGTATTTCTACTACCTAGGAACTGGGCCAGAAGCTGGACTTCCCTATGGTGCTAACAAAGACGGCATCATATGGGTT
44
GCAACTGAGGGAGCCTTGAATACACCAAAAGATCACATTGGCACCCGCAATCCTGCTAACAATGCTGCAATCGTGCTACA
ACTTCCTCAAGGAACAACATTGCCAAAAGGCTTCTACGCAGAAGGGAGCAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCT
CATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTAGGGGAACTTCTCCTGCTAGAATGGCTGGCAAT
GGCGGTGATGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAACCAGCTTGAGAGCAAAATGTCTGGTAAAGGCCAACA
ACAACAAGGCCAAACTGTCACTAAGAAATCTGCTGCTGAGGCTTCTAAGAAGCCTCGGCAAAAACGTACTGCCACTAAAG
CATACAATGTAACACAAGCTTTCGGCAGACGTGGTCCAGAACAAACCCAAGGAAATTTTGGGGACCAGGAACTAATCAGA
CAAGGAACTGATTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGCAT
TGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGATCCAAATTTCA
AAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAAGGACAAAAAG
AAGAAGGCTGATGAAACTCAAGCCTTACCGCAGAGACAGAAGAAACAGCAAACTGTGACTCTTCTTCCTGCTGCAGATTT
GGATGATTTCTCCAAACAATTGCAACAATCCATGAGCAGTGCTGACTCAACTCAGGCCTAAACTCATGCAGACCACACAA
GGCAGATGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATAGTCTACTCTTGTGCAGAATGAATTCTCGTAAC
TACATAGCACAAGTAGATGTAGTTAACTTTAATCTCACATAGCAATCTTTAATCAGTGTGTAACATTAGGGAGGACTTGA
AAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTACGATCGAGTGTACAGTGAACAATGCTAGGGAGAGCTGCCTATA
TGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
[16]: freq
45
[17]: url2 = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?' + \
'tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&' + \
'id=1828694245&extrafeat=null&conwithfeat=on&hide-cdd=on'
f2 = '../../data_external/SARS-CoV-2-Washington_MT293201.1.fasta'
[18]: r2 = urllib.request.urlopen(url2)
fo2 = open(f2, 'wb')
shutil.copyfileobj(r2, fo2)
You might have already heard in the news that there are multiple strains of the virus
around the globe. Let’s investigate this genetic code a bit closer.
Is this the same genetic code as from the Wuhan sample? Repeating the previous pro-
cedure on this new file, we now make a string object that contains the RNA from the
Washington sample. We shall call it rna2 below.
We should note that not all data sets uses just ATGC. There is a standard notation that ex-
tends the four letters, e.g., N is used to indicate any nucleotide. So, it might be a good idea
to answer this question first: what are the distinct characters in the new rna2? There can be
very simply done in python if you use the set data structure, which removes duplicates.
[20]: set(rna2)
The next natural question might be this. Are the lengths of rna and rna2 the same?
We could also look at the first and last 30 characters and check if they are the same, like so:
46
[25]: freq2
Although the Washington genome is not identical to the Wuhan one, their nucleotide fre-
quencies are very close to the Wuhan one, reproduced here:
[26]: freq
[27]: True
[28]: rna2.find(s)
[28]: 27364
Thus, we located the same ORF7a instruction in this virus at a different location. Although
the genetic code from the Washington sample and the Wuhan sample are different, they
can make the same protein ORF7a and their nucleotide frequencies are very close.
This activity provided you with just a glimpse into the large field of bioinformatics, which
studies, among other things, patterns of nucleotide arrangements. If you are interested in
this field, you should take a look at Biopython, a bioinformatics python package.
47
VII
Fibonacci primes
April 9, 2020
Fibonacci numbers appear in so many unexpected places that I am sure you have already
seen them. They are elements of the Fibonacci sequence Fn defined by
F0 = 0, F1 = 1,
Fn = Fn−1 + Fn−2 , for n > 1.
Obviously, this recursive formula gives infinitely many Fibonacci numbers. We also know
that there are infinitely many prime numbers: the ancient Greeks knew it (actually proved
it) in 300 BC!
But, to this day, we still do not know if there are infinitely many prime numbers in the Fibonacci
sequence. These numbers form the set of Fibonacci primes. Its (in)finiteness is one of the
still unsolved problems in mathematics.
In this activity, you will compute a few initial Fibonacci primes, while reviewing some
python features along the way, such as generator expressions, yield, next, all, line mag-
ics, modules, and test functions. Packages we shall come across include memory_profiler,
primesieve, and pytest.
ni , n = 0, 1, 2, . . . , N − 1
succinctly:
If you change the brackets to parentheses, then instead of a list comprehension, you get a
different object called generator expression.
Both L and G are examples of iterators, an abstraction of a sequence of things with the
ability to tell, given an element, what is the next element of the sequence. Since both L and
48
G are iterators, you will generally not see a difference in the results if you run a loop to
print their values, or if you use them within a list comprehension.
[3]: [l for l in L]
[4]: [g for g in G]
[5]: [g for g in G]
[5]: []
The difference between the generator expression G and the list L is that a generator expres-
sion does not actually compute the values until they are needed. Once an element of the
sequence is computed, the next time, the generator can only compute the next element in
the sequence. If the end of a finite sequence was already reached in a previous use of the
generator, then there are no more elements of the sequence to compute. This is why we
got the empty output above.
[7]: G2 = GG()
print(*G2) # see that you get the same values as before
1 4 9 16 25 36 49 64 81
The yield statement tells python that this function does not just return a value, but rather
a value that is an element of a sequence, or an iterator. Internally, in order for something
to be an iterator in python, it must have a well-defined __next__() method: even though
you did not explicitly define anything called __next__ when you defined GG, python seeing
yield defines one for you behind the scenes.
49
Recall that you have seen another method whose name also began with two underscores,
the special __init__ method, which allows you to construct a object using the name of the
class followed by parentheses. The __next__ method is also a “special” method in that it
allows you to call next on the iterator to get its next value, like so:
[8]: G2 = GG()
[8]: (1, 4, 9)
16 25 36 49 64 81
As you can see, a generator “remembers” where it left off in a prior iteration.
[10]: i = -20
N = 10**8
50
Random documents with unrelated
content Scribd suggests to you:
d ates e, ea g day s ace, ca ed o eus
"This mortal steals upon my sovereignty,
Stands brazen champion for the world of flesh,
Determines souls that waver towards the Styx—
Worse! hales the souls back from beyond the Styx,
Bringing the dead to life. This is more craft,
Brother, than we may suffer in a man.
Shall he with careless finger sway at will
The Balance of Destiny? Avenge me, Zeus!"
A Cyclops forged a thunder-bolt for Zeus,
And, black-browed, Zeus did launch it ... Thus I lost
My son Asklepios, killed thro' too much knowledge.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookbell.com