Bioinformatics

Bioinformatics

We’re using programs like BLAST and HMMER now, and these are just the beginning of an entire world of bioinformatic algorithms. We unfortunately don’t have a lot of time to over these during class, but I compiled some resources here if you’d like to explore them later.

Algorithms

Nature Biotech primer  –  good place to start -> https://www.nature.com/articles/nbt1004-1315

Markov Chains – play around on this so you can get a feel for the underlying dynamics of a Hidden Markov Model: http://setosa.io/blog/2014/07/26/markov-chains/

Hidden Markov Models -> http://www.cs.cmu.edu/~./awm/tutorials/hmm.html

More Hidden Markov Models & Bayesian info ->  https://github.com/laryamamoto/BayesianCourseNotes/blob/master/tex/bayesian.pdf

Many different algorithms (with some awesome example code in the form of Jupyter notebooks) http://www.langmead-lab.org/teaching-materials/

More examples (look at dishonest casino under the Viterbi section) -> http://comprna.upf.edu/courses/Master_AGB/

Whether or not you are comfortable with R at this point, there is a wealth of information to be found in the package documentation. Try searching the repository in google like this:

hidden markov model site:https://cran.r-project.org

You’ll find a lot of pdf links, generally these are documents written to accompany packages (aka vignettes) and will tell you more than you ever wanted to know about the algorithms we’re getting into these days (which is the natural progression of the introduction to biocomputing).

YouTube playlist of  lectures explaining these concepts https://www.youtube.com/playlist?list=PL2mpR0RYFQsBiCWVJSvVAO3OJ2t7DzoHA

Ready for more?

 

Holiday Announcements

Thxgiving break announcements

Hi Biocomputing folx! For those of you who will be around during the Thanksgiving break, note that I will not be in my usual spot for office hours (Jordan Lobby, Thursday 5-6pm), but I will be around all week and am happy to meet – just bounce me an email.

For group projects, I am the contact person for the bioinformatics topic (This includes R folx – we aren’t doing a lot of R/python things since most of this work will be on the remote machines or at least L/unix). I’m happy to help with the other two, but your point person will be better prepared to answer questions about specific grading/requirements.

Plotting in Python

Plotting in (mostly) Python*

*Note to class: some of these may be python 3 & most of what we’ve done so far has been in Python 2. Copy and paste with caution*


Installing the PlotNine Package (ggplots for python)

Class Specific notes/announcements – 11/1/18 – plotting links continued below

You do not need to have plotnine installed and working correctly to complete exercise 8. There is some example code in the exercise. We’ll go over this in office hours later today if it would be helpful.

MAC PEOPLE HAVING TROUBLE WITH PLOTNINE INSTALLATION

We have another thing to try for plotnine installation. We think that maybe if you start a fresh conda environment some of the dependency issues we’ve been having will be solved. Dr. Jones is waiting to see if we can get this working and then will figure out what to do from there.

In a fresh terminal, type

conda activate

Your terminal should process this for a minute and then your prompt should change from something like this: criva$ to something like this: (base) criva$

From there, you can try to use conda to install plotnine again: conda install -c conda-forge plotnine

Now open spyder and try out some of the plotnine commands. You can use the dataset mpg.txt from class (if you don’t have this dataset, you can download it here.)
import numpy
import pandas as pd
from plotnine import *

mpg=pd.read_csv("mpg.txt",sep="\t",header=0

a = ggplot(mpg,aes(x="displ",y="cty"))
a+geom_point()+coord_cartesian()

Let me know what happens – hopefully this forces a new environment which should fix the dependency nonsense we’ve been going through.

Some people were having issues with getting numpy re-installed on Wednesday – here is some code that should work for you. Let me know if this doesn’t work. If we didn’t talk about this, ignore it.

Code for numpy installation

  • Plotting in Pandas (python code): sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy

For what it’s worth, this dependency issue is not something unique to the class. See comic. (https://xkcd.com/1987/)

 


Links to plotting tutorials/resources

https://lmicke.github.io/first-ramen-post.html

https://lmicke.github.io/second-ramen-post.html#second-ramen-post

  • Bunch of different types of graphs and the code to make them (Python code):

https://python-graph-gallery.com/

  • Useful Python libraries:

https://blog.modeanalytics.com/python-data-visualization-libraries/

  • All python libraries (similar to R-cran)

https://pypi.org/

  • Lots of examples of ggplots2 graphs (R code):

http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Slope%20Chart

  • Advanced data visualization ideas and examples (not necessarily tutorials and not exclusive to python)

http://sxywu.com/

Python In class challenge answers 10-26-18

Python In class challenge answers 10-26-18

Here are the answers to the python challenges this week in class. The formatting isn’t great here, so the link below will show another format if it’s which might be easier for you to understand/read.

https://hackmd.io/s/S1oyAag3Q#

# cumulative sum

x=[3,10,4,12,55]
cs=[0]*5

for i in range(0,len(x)):
cs[i]=sum(x[0:(i+1)])

print(cs)

x=[3,10,4,12,55]
cs=list()

for i in range(0,len(x)):
cs.append(sum(x[0:(i+1)]))

print(cs)

#how long to 5 5’s

x=[5,3,2,5,5,1,2,5,3,5,1,5,1]

count=0
i=0

while count < 5:
if x[i]==5:
count+=1
i+=1

print(i)

Answers for Lecture 11 extra practice

#

import pandas as pd

# 0 – calculate sum of female and male wages in wages.csv
wages=pandas.read_csv(“wages.csv”,header=0,sep=”,”)

femaleSum=0
maleSum=0

for i in range(0,len(wages),1):
if wages.gender[i]==”female”:
femaleSum=femaleSum+wages.wage[i]
else:
maleSum=maleSum+wages.wage[i]

femaleSum
maleSum

#the two commands below won’t work – why?

sum(wages.gender==”female”)
sum(wages.gender==”male”)

 Find runs –

This is the super abstract one we went over in Tutorial!

# load file – available on Sakai
findRuns=pd.read_csv(“findRuns.txt”,header=None,sep=”\t”)

# create a variable out that is currently undefined
out=pd.DataFrame(columns=[‘startIndex’,’runLength’])

# I will use this variable cur to hold onto the previous number in the vector;
# this is analagous to using findRuns[i-1]
cur=findRuns.iloc[0,0]
#cur=findRuns[i-1]
# this is a counter that I use to keep track of how long a run of repeated values is;
# if there are not repeated values than this count equals 1
count=1

# loop through each entry of our vector (except the 1st one, which we set to cur above)
for i in range(1,50,1):
# test if the ith value in the vector findRuns equals the previous (stored in cur)
if findRuns.iloc[i,0]==cur:
# test whether count is 1 (we aren’t in the middle of a run) or >1 (in the middle of a run)
if count==1:
# if the ith value in the vector equals the previous (stored in cur) and count is 1, we
# are at the beginning of a run and we want to store this value (we temporarily store it in ‘start’)
start=(i-1)

# we add one to count because the run continued based on the ith value of findRuns being equal to
# the previous (stored in cur)
count=count+1
# if the ith value in findRuns is not the same as the previous (stored in cur) we either are not in a run
# or we are ending a run
else:
# if count is greater than 1 it means we were in a run and must be exiting one
if count>1:
# add a row to ‘out’ that will hold the starting positions in the first column and the length
# of runs in the second column; this appends rows to out after finding and counting each run
out.loc[len(out)]=[start,count]
# reset count to 1 because we just exited a run
count=1
# remember cur holds the previous element in findRuns, so we need to update this after each time
# we go through the for loop
cur=findRuns.iloc[i,0]
cur
out

Python intro

Getting started with Python

Helpful, lengthy explanation of indexes and slices in arrays: https://stackoverflow.com/a/24713353

Numpy

Numpy is a library for arrays.

Numpy basics: https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html

More explanation of data types (numpy): https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Why do we want to use numpy vs regular python lists? https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists

Pandas

Pandas is a python library for data frames. Understanding the basics of numpy will be helpful before getting into pandas.

Pandas introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html

And another: https://www.learnpython.org/en/Pandas_Basics

Selecting data in a dataframe (iloc): https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Tutorial Challenge from 10/05/18 – Prompt and corresponding code

Git – Version Control and Reproducibility

Welcome.

Everything is fine.

Git & github links/instructions

Since we’re going to be using git for the rest of the semester to turn in assignments, we should get it down pretty good now.

Here’s a nice explanation of git & github that uses an analogy of an author writing a book.

https://blog.red-badger.com/blog/2016/11/29/gitgithub-in-plain-english

When git goes wrong:

http://ohshitgit.com

Resolve a conflict

https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/

Turning in assignments – only one per pair/group

  1. Complete the exercise on your local machine. Commit your changes.
  2. Push your repo to your own github repository.
  3. Go to the GitHub website and your own repository (which is a forked repo of my original repo).
  4. Use “Pull Request” to turn in your assignment (upper middle of the screen, click “New Pull Request”
  5. Make sure to type you and your collaborator’s name into the text box (ex: jones-rivaldi submission).
  6. Click “Create Pull Request”
  7. Done!

String manipulation with sed and grep

String manipulation with sed and grep

What is string manipulation and why do we care about this?

As biologists, or more generally as people interested in working with data, a lot of what we want to do will require us to manipulate many characters at the same time.

By definition:

  • A character is  class whose instances can hold a single character value.
  • A string is an immutable class for working with multiple characters.

 

For our purposes, we can consider strings as information that we don’t want to use for numerical calculations. DNA sequences, column or row names, and categorical/qualitative data values will generally be strings. You might want to remove the primers from a lot (like a lot) of sequences at the same time, or you might want to remove whitespaces from a dataset you found online.

Most of our string manipulation is covered by the previous links that are tied in with the for loops – here are a couple of useful comics for some of those commands though. ‘awk’ is new but you’re probably going to run into it during google adventures. As it in some ways is its own programming language, it’s very much worth learning, we just didn’t quite have time to fit it into our classwork.

https://www.hackerearth.com/practice/algorithms/string-algorithm/basics-of-string-manipulation/tutorial/

https://pythonforbiologists.com/printing-and-manipulating-text/

For loops in bash

A collection of really useful links for bash scripting

For loops:

Here are some links to tutorials I’ve compiled so you can get some extra practice using/crafting for loops. All of these will contain information we haven’t covered yet in addition to the basic for loop.

https://jvns.ca/blog/2017/03/26/bash-quirks/

https://astrobiomike.github.io/bash/for_loops

https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php

http://tldp.org/LDP/abs/html/loops1.html#EX22

Warning about using the output of ‘ls’ as a set for a for loop:

http://mywiki.wooledge.org/ParsingLs

More bash goodies:

http://www.kfirlavi.com/blog/2012/11/14/defensive-bash-programming/

https://google.github.io/styleguide/shell.xml

Test your skills!!

https://cmdchallenge.com/

Environment to test out code if you think something weird might be going on with your setup (warning – there might also be something weird with this setup, I haven’t played with it a whole lot).

https://repl.it/languages

Regex Practice

(Comic: www.xkcd.com/208)

Lots of options for practice – choose your favorite!

https://regexr.com/

https://regexone.com/

http://rubular.com/

Regex combined with sed and awk: https://likegeeks.com/regex-tutorial-linux/?epik=0wDgLEvIWHzZ9

Regex golf – match a string with the shortest possible expression:

https://alf.nu/RegexGolf

Bash scripting cheatsheet: https://devhints.io/bash
Common/userful bash one-liners: http://www.bashoneliners.com/
Friend’s github page with too much awesome information to put it into any other category – spend some time digging around: https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources

 

Biocomputing Students: hello!

Beginning of the Semester Info

My office is located at Galvin 175. My lab is Galvin 004. My office hours are Thursdays from 5-6PM (recently changed from 4-5). I’ll be in Jordan near the coffee.

My email is crivaldi@nd.edu

Here’s a friendly link to get you started with installation of UNIX (if you have Windows. Mac users, you don’t need to install anything. Yet.):
https://hackmd.io/s/rkfhUOP8m#

(Note: if you have already installed Cygwin/Ubuntu/etc., there are some tips at the bottom so check it out.)


Slides from Week 1 (8/23):