Website change!

Hi all! I’ve recently used Github to build a website that I’ve copied all the content here to, and will be updating that site with new content in the future. I’d be stoked if you checked it out!

I don’t have plans to shut this site down, but I have no idea how long it will be maintained by the university in the future.

P.S. – I build my site by forking the Beautiful Jekyll

respository. It was a lot easier than I thought it would be! There’s a great tutorial by the author, but if you’re curious about any of the details, feel free to ask me about it!



We’re using programs like BLAST and HMMER now, and these are just the beginning of an entire world of bioinformatic algorithms. We unfortunately don’t have a lot of time to over these during class, but I compiled some resources here if you’d like to explore them later.


Nature Biotech primer  –  good place to start ->

Markov Chains – play around on this so you can get a feel for the underlying dynamics of a Hidden Markov Model:

Hidden Markov Models ->

More Hidden Markov Models & Bayesian info ->

Many different algorithms (with some awesome example code in the form of Jupyter notebooks)

More examples (look at dishonest casino under the Viterbi section) ->

Whether or not you are comfortable with R at this point, there is a wealth of information to be found in the package documentation. Try searching the repository in google like this:

hidden markov model site:

You’ll find a lot of pdf links, generally these are documents written to accompany packages (aka vignettes) and will tell you more than you ever wanted to know about the algorithms we’re getting into these days (which is the natural progression of the introduction to biocomputing).

YouTube playlist of  lectures explaining these concepts

Ready for more?


Holiday Announcements

Thxgiving break announcements

Hi Biocomputing folx! For those of you who will be around during the Thanksgiving break, note that I will not be in my usual spot for office hours (Jordan Lobby, Thursday 5-6pm), but I will be around all week and am happy to meet – just bounce me an email.

For group projects, I am the contact person for the bioinformatics topic (This includes R folx – we aren’t doing a lot of R/python things since most of this work will be on the remote machines or at least L/unix). I’m happy to help with the other two, but your point person will be better prepared to answer questions about specific grading/requirements.

Plotting in Python

Plotting in (mostly) Python*

*Note to class: some of these may be python 3 & most of what we’ve done so far has been in Python 2. Copy and paste with caution*

Installing the PlotNine Package (ggplots for python)

Class Specific notes/announcements – 11/1/18 – plotting links continued below

You do not need to have plotnine installed and working correctly to complete exercise 8. There is some example code in the exercise. We’ll go over this in office hours later today if it would be helpful.


We have another thing to try for plotnine installation. We think that maybe if you start a fresh conda environment some of the dependency issues we’ve been having will be solved. Dr. Jones is waiting to see if we can get this working and then will figure out what to do from there.

In a fresh terminal, type

conda activate

Your terminal should process this for a minute and then your prompt should change from something like this: criva$ to something like this: (base) criva$

From there, you can try to use conda to install plotnine again: conda install -c conda-forge plotnine

Now open spyder and try out some of the plotnine commands. You can use the dataset mpg.txt from class (if you don’t have this dataset, you can download it here.)
import numpy
import pandas as pd
from plotnine import *


a = ggplot(mpg,aes(x="displ",y="cty"))

Let me know what happens – hopefully this forces a new environment which should fix the dependency nonsense we’ve been going through.

Some people were having issues with getting numpy re-installed on Wednesday – here is some code that should work for you. Let me know if this doesn’t work. If we didn’t talk about this, ignore it.

Code for numpy installation

  • Plotting in Pandas (python code): sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy

For what it’s worth, this dependency issue is not something unique to the class. See comic. (


Links to plotting tutorials/resources

  • Bunch of different types of graphs and the code to make them (Python code):

  • Useful Python libraries:

  • All python libraries (similar to R-cran)

  • Lots of examples of ggplots2 graphs (R code):

  • Advanced data visualization ideas and examples (not necessarily tutorials and not exclusive to python)

Python In class challenge answers 10-26-18

Python In class challenge answers 10-26-18

Here are the answers to the python challenges this week in class. The formatting isn’t great here, so the link below will show another format if it’s which might be easier for you to understand/read.

# cumulative sum


for i in range(0,len(x)):



for i in range(0,len(x)):


#how long to 5 5’s



while count < 5:
if x[i]==5:


Answers for Lecture 11 extra practice


import pandas as pd

# 0 – calculate sum of female and male wages in wages.csv


for i in range(0,len(wages),1):
if wages.gender[i]==”female”:


#the two commands below won’t work – why?


 Find runs –

This is the super abstract one we went over in Tutorial!

# load file – available on Sakai

# create a variable out that is currently undefined

# I will use this variable cur to hold onto the previous number in the vector;
# this is analagous to using findRuns[i-1]
# this is a counter that I use to keep track of how long a run of repeated values is;
# if there are not repeated values than this count equals 1

# loop through each entry of our vector (except the 1st one, which we set to cur above)
for i in range(1,50,1):
# test if the ith value in the vector findRuns equals the previous (stored in cur)
if findRuns.iloc[i,0]==cur:
# test whether count is 1 (we aren’t in the middle of a run) or >1 (in the middle of a run)
if count==1:
# if the ith value in the vector equals the previous (stored in cur) and count is 1, we
# are at the beginning of a run and we want to store this value (we temporarily store it in ‘start’)

# we add one to count because the run continued based on the ith value of findRuns being equal to
# the previous (stored in cur)
# if the ith value in findRuns is not the same as the previous (stored in cur) we either are not in a run
# or we are ending a run
# if count is greater than 1 it means we were in a run and must be exiting one
if count>1:
# add a row to ‘out’ that will hold the starting positions in the first column and the length
# of runs in the second column; this appends rows to out after finding and counting each run
# reset count to 1 because we just exited a run
# remember cur holds the previous element in findRuns, so we need to update this after each time
# we go through the for loop

Python intro

Getting started with Python

Helpful, lengthy explanation of indexes and slices in arrays:


Numpy is a library for arrays.

Numpy basics:

More explanation of data types (numpy):

Why do we want to use numpy vs regular python lists?


Pandas is a python library for data frames. Understanding the basics of numpy will be helpful before getting into pandas.

Pandas introduction:

And another:

Selecting data in a dataframe (iloc):

Tutorial Challenge from 10/05/18 – Prompt and corresponding code

Git – Version Control and Reproducibility


Everything is fine.

Git & github links/instructions

Since we’re going to be using git for the rest of the semester to turn in assignments, we should get it down pretty good now.

Here’s a nice explanation of git & github that uses an analogy of an author writing a book.

When git goes wrong:

Resolve a conflict

Turning in assignments – only one per pair/group

  1. Complete the exercise on your local machine. Commit your changes.
  2. Push your repo to your own github repository.
  3. Go to the GitHub website and your own repository (which is a forked repo of my original repo).
  4. Use “Pull Request” to turn in your assignment (upper middle of the screen, click “New Pull Request”
  5. Make sure to type you and your collaborator’s name into the text box (ex: jones-rivaldi submission).
  6. Click “Create Pull Request”
  7. Done!

String manipulation with sed and grep

String manipulation with sed and grep

What is string manipulation and why do we care about this?

As biologists, or more generally as people interested in working with data, a lot of what we want to do will require us to manipulate many characters at the same time.

By definition:

  • A character is  class whose instances can hold a single character value.
  • A string is an immutable class for working with multiple characters.


For our purposes, we can consider strings as information that we don’t want to use for numerical calculations. DNA sequences, column or row names, and categorical/qualitative data values will generally be strings. You might want to remove the primers from a lot (like a lot) of sequences at the same time, or you might want to remove whitespaces from a dataset you found online.

Most of our string manipulation is covered by the previous links that are tied in with the for loops – here are a couple of useful comics for some of those commands though. ‘awk’ is new but you’re probably going to run into it during google adventures. As it in some ways is its own programming language, it’s very much worth learning, we just didn’t quite have time to fit it into our classwork.