Website change!

Hi all! I’ve recently used Github to build a website that I’ve copied all the content here to, and will be updating that site with new content in the future. I’d be stoked if you checked it out!

crivaldi.github.io

I don’t have plans to shut this site down, but I have no idea how long it will be maintained by the university in the future.

P.S. – I build my site by forking the Beautiful Jekyll

respository. It was a lot easier than I thought it would be! There’s a great tutorial by the author, but if you’re curious about any of the details, feel free to ask me about it!

*Note to all users: As of 07/14/2019, all posts have been moved to (and are being updated at crivaldi.github.io . I don’t have plans to shut this site down, but I have no idea how long it will be maintained by the university in the future.*

I taught at this workshop and made this lesson with the brillian Shannon Joslin (@IntrprtngGnmcs). If you have questions about getting started cluster computing, this may answer some of them! Feel free to contact me via email or twitter for more questions or suggestions for the lesson.

Link to lesson!

Bioinformatics

We’re using programs like BLAST and HMMER now, and these are just the beginning of an entire world of bioinformatic algorithms. We unfortunately don’t have a lot of time to over these during class, but I compiled some resources here if you’d like to explore them later.

Algorithms

Nature Biotech primer – good place to start -> https://www.nature.com/articles/nbt1004-1315

Markov Chains – play around on this so you can get a feel for the underlying dynamics of a Hidden Markov Model: http://setosa.io/blog/2014/07/26/markov-chains/

Hidden Markov Models -> http://www.cs.cmu.edu/~./awm/tutorials/hmm.html

More Hidden Markov Models & Bayesian info -> https://github.com/laryamamoto/BayesianCourseNotes/blob/master/tex/bayesian.pdf

Many different algorithms (with some awesome example code in the form of Jupyter notebooks) http://www.langmead-lab.org/teaching-materials/

More examples (look at dishonest casino under the Viterbi section) -> http://comprna.upf.edu/courses/Master_AGB/

Whether or not you are comfortable with R at this point, there is a wealth of information to be found in the package documentation. Try searching the repository in google like this:

hidden markov model site:https://cran.r-project.org

You’ll find a lot of pdf links, generally these are documents written to accompany packages (aka vignettes) and will tell you more than you ever wanted to know about the algorithms we’re getting into these days (which is the natural progression of the introduction to biocomputing).

YouTube playlist of lectures explaining these concepts https://www.youtube.com/playlist?list=PL2mpR0RYFQsBiCWVJSvVAO3OJ2t7DzoHA

Ready for more?

Holiday Announcements

Thxgiving break announcements

Hi Biocomputing folx! For those of you who will be around during the Thanksgiving break, note that I will not be in my usual spot for office hours (Jordan Lobby, Thursday 5-6pm), but I will be around all week and am happy to meet – just bounce me an email.

For group projects, I am the contact person for the bioinformatics topic (This includes R folx – we aren’t doing a lot of R/python things since most of this work will be on the remote machines or at least L/unix). I’m happy to help with the other two, but your point person will be better prepared to answer questions about specific grading/requirements.

Intro to working on remote machines

Quick tutorial for logging on to the remote student computers at Notre Dame

https://hackmd.io/s/HyPnKtMjm#

Tutorial I wrote up for muscle and hmmer installations (general installation instructions, although they tend to vary with every program).

https://hackmd.io/s/SJUdL7DpX#

Plotting in Python

Plotting in (mostly) Python*

*Note to class: some of these may be python 3 & most of what we’ve done so far has been in Python 2. Copy and paste with caution*

Installing the PlotNine Package (ggplots for python)

Class Specific notes/announcements – 11/1/18 – plotting links continued below

You do not need to have plotnine installed and working correctly to complete exercise 8. There is some example code in the exercise. We’ll go over this in office hours later today if it would be helpful.

MAC PEOPLE HAVING TROUBLE WITH PLOTNINE INSTALLATION

We have another thing to try for plotnine installation. We think that maybe if you start a fresh conda environment some of the dependency issues we’ve been having will be solved. Dr. Jones is waiting to see if we can get this working and then will figure out what to do from there.

In a fresh terminal, type

conda activate

Your terminal should process this for a minute and then your prompt should change from something like this: criva$ to something like this: (base) criva$

From there, you can try to use conda to install plotnine again: conda install -c conda-forge plotnine

Now open spyder and try out some of the plotnine commands. You can use the dataset mpg.txt from class (if you don’t have this dataset, you can download it here.)
import numpy import pandas as pd from plotnine import *

mpg=pd.read_csv("mpg.txt",sep="\t",header=0

a = ggplot(mpg,aes(x="displ",y="cty")) a+geom_point()+coord_cartesian()

Let me know what happens – hopefully this forces a new environment which should fix the dependency nonsense we’ve been going through.

Some people were having issues with getting numpy re-installed on Wednesday – here is some code that should work for you. Let me know if this doesn’t work. If we didn’t talk about this, ignore it.

Code for numpy installation

Plotting in Pandas (python code): sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy

For what it’s worth, this dependency issue is not something unique to the class. See comic. (https://xkcd.com/1987/)

Links to plotting tutorials/resources

https://lmicke.github.io/first-ramen-post.html

https://lmicke.github.io/second-ramen-post.html#second-ramen-post

Bunch of different types of graphs and the code to make them (Python code):

https://python-graph-gallery.com/

Useful Python libraries:

https://blog.modeanalytics.com/python-data-visualization-libraries/

All python libraries (similar to R-cran)

https://pypi.org/

Lots of examples of ggplots2 graphs (R code):

http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Slope%20Chart

Advanced data visualization ideas and examples (not necessarily tutorials and not exclusive to python)

http://sxywu.com/

If you’re using jupyter notebooks, here are some tips: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Python In class challenge answers 10-26-18

Here are the answers to the python challenges this week in class. The formatting isn’t great here, so the link below will show another format if it’s which might be easier for you to understand/read.

https://hackmd.io/s/S1oyAag3Q#

# cumulative sum

x=[3,10,4,12,55]
cs=[0]*5

for i in range(0,len(x)):
cs[i]=sum(x[0:(i+1)])

print(cs)

x=[3,10,4,12,55]
cs=list()

for i in range(0,len(x)):
cs.append(sum(x[0:(i+1)]))

print(cs)

#how long to 5 5’s

x=[5,3,2,5,5,1,2,5,3,5,1,5,1]

count=0
i=0

while count < 5:
if x[i]==5:
count+=1
i+=1

print(i)

Answers for Lecture 11 extra practice

import pandas as pd

# 0 – calculate sum of female and male wages in wages.csv
wages=pandas.read_csv(“wages.csv”,header=0,sep=”,”)

femaleSum=0
maleSum=0

for i in range(0,len(wages),1):
if wages.gender[i]==”female”:
femaleSum=femaleSum+wages.wage[i]
else:
maleSum=maleSum+wages.wage[i]

femaleSum
maleSum

#the two commands below won’t work – why?

sum(wages.gender==”female”)
sum(wages.gender==”male”)

Find runs –

This is the super abstract one we went over in Tutorial!

# load file – available on Sakai
findRuns=pd.read_csv(“findRuns.txt”,header=None,sep=”\t”)

# create a variable out that is currently undefined
out=pd.DataFrame(columns=[‘startIndex’,’runLength’])

# I will use this variable cur to hold onto the previous number in the vector;
# this is analagous to using findRuns[i-1]
cur=findRuns.iloc[0,0]
#cur=findRuns[i-1]
# this is a counter that I use to keep track of how long a run of repeated values is;
# if there are not repeated values than this count equals 1
count=1

# loop through each entry of our vector (except the 1st one, which we set to cur above)
for i in range(1,50,1):
# test if the ith value in the vector findRuns equals the previous (stored in cur)
if findRuns.iloc[i,0]==cur:
# test whether count is 1 (we aren’t in the middle of a run) or >1 (in the middle of a run)
if count==1:
# if the ith value in the vector equals the previous (stored in cur) and count is 1, we
# are at the beginning of a run and we want to store this value (we temporarily store it in ‘start’)
start=(i-1)

# we add one to count because the run continued based on the ith value of findRuns being equal to
# the previous (stored in cur)
count=count+1
# if the ith value in findRuns is not the same as the previous (stored in cur) we either are not in a run
# or we are ending a run
else:
# if count is greater than 1 it means we were in a run and must be exiting one
if count>1:
# add a row to ‘out’ that will hold the starting positions in the first column and the length
# of runs in the second column; this appends rows to out after finding and counting each run
out.loc[len(out)]=[start,count]
# reset count to 1 because we just exited a run
count=1
# remember cur holds the previous element in findRuns, so we need to update this after each time
# we go through the for loop
cur=findRuns.iloc[i,0]
cur
out

Python intro

Getting started with Python

Helpful, lengthy explanation of indexes and slices in arrays: https://stackoverflow.com/a/24713353

Numpy

Numpy is a library for arrays.

Numpy basics: https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html

More explanation of data types (numpy): https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Why do we want to use numpy vs regular python lists? https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists

Pandas

Pandas is a python library for data frames. Understanding the basics of numpy will be helpful before getting into pandas.

Pandas introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html

And another: https://www.learnpython.org/en/Pandas_Basics

Selecting data in a dataframe (iloc): https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Tutorial Challenge from 10/05/18 – Prompt and corresponding code

Git – Version Control and Reproducibility

Welcome.

Everything is fine.

Git & github links/instructions

Since we’re going to be using git for the rest of the semester to turn in assignments, we should get it down pretty good now.

Here’s a nice explanation of git & github that uses an analogy of an author writing a book.

https://blog.red-badger.com/blog/2016/11/29/gitgithub-in-plain-english

When git goes wrong:

http://ohshitgit.com

Resolve a conflict

https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/

Turning in assignments – only one per pair/group

Complete the exercise on your local machine. Commit your changes.
Push your repo to your own github repository.
Go to the GitHub website and your own repository (which is a forked repo of my original repo).
Use “Pull Request” to turn in your assignment (upper middle of the screen, click “New Pull Request”
Make sure to type you and your collaborator’s name into the text box (ex: jones-rivaldi submission).
Click “Create Pull Request”
Done!

String manipulation with sed and grep

What is string manipulation and why do we care about this?

As biologists, or more generally as people interested in working with data, a lot of what we want to do will require us to manipulate many characters at the same time.

By definition:

A character is class whose instances can hold a single character value.
A string is an immutable class for working with multiple characters.

For our purposes, we can consider strings as information that we don’t want to use for numerical calculations. DNA sequences, column or row names, and categorical/qualitative data values will generally be strings. You might want to remove the primers from a lot (like a lot) of sequences at the same time, or you might want to remove whitespaces from a dataset you found online.

Most of our string manipulation is covered by the previous links that are tied in with the for loops – here are a couple of useful comics for some of those commands though. ‘awk’ is new but you’re probably going to run into it during google adventures. As it in some ways is its own programming language, it’s very much worth learning, we just didn’t quite have time to fit it into our classwork.

https://www.hackerearth.com/practice/algorithms/string-algorithm/basics-of-string-manipulation/tutorial/

https://pythonforbiologists.com/printing-and-manipulating-text/

Chissa Rivaldi

Website change!

Working on Clusters (general)

Bioinformatics

Bioinformatics

Algorithms

Holiday Announcements

Thxgiving break announcements

Intro to working on remote machines

Plotting in Python

Plotting in (mostly) Python*

Class Specific notes/announcements – 11/1/18 – plotting links continued below

Links to plotting tutorials/resources

Python In class challenge answers 10-26-18

Python In class challenge answers 10-26-18

Here are the answers to the python challenges this week in class. The formatting isn’t great here, so the link below will show another format if it’s which might be easier for you to understand/read.

# cumulative sum

#how long to 5 5’s

Answers for Lecture 11 extra practice

Find runs –

This is the super abstract one we went over in Tutorial!

Python intro

Getting started with Python

Numpy

Pandas

Tutorial Challenge from 10/05/18 – Prompt and corresponding code

Git – Version Control and Reproducibility

Welcome.

Everything is fine.

Git & github links/instructions

Resolve a conflict

Turning in assignments – only one per pair/group

String manipulation with sed and grep

String manipulation with sed and grep