I don’t have plans to shut this site down, but I have no idea how long it will be maintained by the university in the future.

P.S. – I build my site by forking the Beautiful Jekyll

respository. It was a lot easier than I thought it would be! There’s a great tutorial by the author, but if you’re curious about any of the details, feel free to ask me about it!

]]>

I taught at this workshop and made this lesson with the brillian Shannon Joslin (@IntrprtngGnmcs). If you have questions about getting started cluster computing, this may answer some of them! Feel free to contact me via email or twitter for more questions or suggestions for the lesson.

]]>

We’re using programs like BLAST and HMMER now, and these are just the beginning of an entire world of bioinformatic algorithms. We unfortunately don’t have a lot of time to over these during class, but I compiled some resources here if you’d like to explore them later.

Nature Biotech primer – good place to start -> https://www.nature.com/articles/nbt1004-1315

Markov Chains – play around on this so you can get a feel for the underlying dynamics of a Hidden Markov Model: http://setosa.io/blog/2014/07/26/markov-chains/

Hidden Markov Models -> http://www.cs.cmu.edu/~./awm/tutorials/hmm.html

More Hidden Markov Models & Bayesian info -> https://github.com/laryamamoto/BayesianCourseNotes/blob/master/tex/bayesian.pdf

Many different algorithms (with some awesome example code in the form of Jupyter notebooks) http://www.langmead-lab.org/teaching-materials/

More examples (look at dishonest casino under the Viterbi section) -> http://comprna.upf.edu/courses/Master_AGB/

Whether or not you are comfortable with R at this point, there is a wealth of information to be found in the package documentation. Try searching the repository in google like this:

*hidden markov model site:https://cran.r-project.org*

You’ll find a lot of pdf links, generally these are documents written to accompany packages (aka vignettes) and will tell you more than you ever wanted to know about the algorithms we’re getting into these days (which is the natural progression of the introduction to biocomputing).

YouTube playlist of lectures explaining these concepts https://www.youtube.com/playlist?list=PL2mpR0RYFQsBiCWVJSvVAO3OJ2t7DzoHA

]]>

Hi Biocomputing folx! For those of you who will be around during the Thanksgiving break, note that I will not be in my usual spot for office hours (Jordan Lobby, Thursday 5-6pm), but I will be around all week and am happy to meet – just bounce me an email.

For group projects, I am the contact person for the bioinformatics topic (This includes R folx – we aren’t doing a lot of R/python things since most of this work will be on the remote machines or at least L/unix). I’m happy to help with the other two, but your point person will be better prepared to answer questions about specific grading/requirements.

]]>Quick tutorial for logging on to the remote student computers at Notre Dame

Tutorial I wrote up for muscle and hmmer installations (general installation instructions, although they tend to vary with every program).

*Note to class: some of these may be python 3 & most of what we’ve done so far has been in Python 2. Copy and paste with caution*

Installing the PlotNine Package (ggplots for python)

You do not need to have plotnine installed and working correctly to complete exercise 8. There is some example code in the exercise. We’ll go over this in office hours later today if it would be helpful.

MAC PEOPLE HAVING TROUBLE WITH PLOTNINE INSTALLATION

We have another thing to try for plotnine installation. We think that maybe if you start a fresh conda environment some of the dependency issues we’ve been having will be solved. Dr. Jones is waiting to see if we can get this working and then will figure out what to do from there.

In a fresh terminal, type

conda activate

Your terminal should process this for a minute and then your prompt should change from something like this: `criva$`

to something like this: `(base) criva$`

From there, you can try to use conda to install plotnine again: `conda install -c conda-forge plotnine`

Now open spyder and try out some of the plotnine commands. You can use the dataset mpg.txt from class (if you don’t have this dataset, you can download it here.)

`import numpy`

import pandas as pd

from plotnine import *

`mpg=pd.read_csv("mpg.txt",sep="\t",header=0`

`a = ggplot(mpg,aes(x="displ",y="cty"))`

a+geom_point()+coord_cartesian()

Let me know what happens – hopefully this forces a new environment which should fix the dependency nonsense we’ve been going through.

Some people were having issues with getting numpy re-installed on Wednesday – here is some code that should work for you. Let me know if this doesn’t work. If we didn’t talk about this, ignore it.

Code for numpy installation

- Plotting in Pandas (python code):
`sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy`

For what it’s worth, this dependency issue is not something unique to the class. See comic. (https://xkcd.com/1987/)

https://lmicke.github.io/first-ramen-post.html

https://lmicke.github.io/second-ramen-post.html#second-ramen-post

- Bunch of different types of graphs and the code to make them (Python code):

https://python-graph-gallery.com/

- Useful Python libraries:

https://blog.modeanalytics.com/python-data-visualization-libraries/

- All python libraries (similar to R-cran)

- Lots of examples of ggplots2 graphs (R code):

http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Slope%20Chart

- Advanced data visualization ideas and examples (not necessarily tutorials and not exclusive to python)

- If you’re using jupyter notebooks, here are some tips: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

]]>

https://hackmd.io/s/S1oyAag3Q#

x=[3,10,4,12,55]

cs=[0]*5

for i in range(0,len(x)):

cs[i]=sum(x[0:(i+1)])

print(cs)

x=[3,10,4,12,55]

cs=list()

for i in range(0,len(x)):

cs.append(sum(x[0:(i+1)]))

print(cs)

x=[5,3,2,5,5,1,2,5,3,5,1,5,1]

count=0

i=0

while count < 5:

if x[i]==5:

count+=1

i+=1

print(i)

#

import pandas as pd

# 0 – calculate sum of female and male wages in wages.csv

wages=pandas.read_csv(“wages.csv”,header=0,sep=”,”)

femaleSum=0

maleSum=0

for i in range(0,len(wages),1):

if wages.gender[i]==”female”:

femaleSum=femaleSum+wages.wage[i]

else:

maleSum=maleSum+wages.wage[i]

femaleSum

maleSum

#the two commands below won’t work – why?

sum(wages.gender==”female”)

sum(wages.gender==”male”)

# load file – available on Sakai

findRuns=pd.read_csv(“findRuns.txt”,header=None,sep=”\t”)

# create a variable out that is currently undefined

out=pd.DataFrame(columns=[‘startIndex’,’runLength’])

# I will use this variable cur to hold onto the previous number in the vector;

# this is analagous to using findRuns[i-1]

cur=findRuns.iloc[0,0]

#cur=findRuns[i-1]

# this is a counter that I use to keep track of how long a run of repeated values is;

# if there are not repeated values than this count equals 1

count=1

# loop through each entry of our vector (except the 1st one, which we set to cur above)

for i in range(1,50,1):

# test if the ith value in the vector findRuns equals the previous (stored in cur)

if findRuns.iloc[i,0]==cur:

# test whether count is 1 (we aren’t in the middle of a run) or >1 (in the middle of a run)

if count==1:

# if the ith value in the vector equals the previous (stored in cur) and count is 1, we

# are at the beginning of a run and we want to store this value (we temporarily store it in ‘start’)

start=(i-1)

# we add one to count because the run continued based on the ith value of findRuns being equal to

# the previous (stored in cur)

count=count+1

# if the ith value in findRuns is not the same as the previous (stored in cur) we either are not in a run

# or we are ending a run

else:

# if count is greater than 1 it means we were in a run and must be exiting one

if count>1:

# add a row to ‘out’ that will hold the starting positions in the first column and the length

# of runs in the second column; this appends rows to out after finding and counting each run

out.loc[len(out)]=[start,count]

# reset count to 1 because we just exited a run

count=1

# remember cur holds the previous element in findRuns, so we need to update this after each time

# we go through the for loop

cur=findRuns.iloc[i,0]

cur

out

Helpful, lengthy explanation of indexes and slices in arrays: https://stackoverflow.com/a/24713353

Numpy is a library for arrays.

Numpy basics: https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html

More explanation of data types (numpy): https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Why do we want to use numpy vs regular python lists? https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists

Pandas is a python library for data frames. Understanding the basics of numpy will be helpful before getting into pandas.

Pandas introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html

And another: https://www.learnpython.org/en/Pandas_Basics

Selecting data in a dataframe (iloc): https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Since we’re going to be using git for the rest of the semester to turn in assignments, we should get it down pretty good now.

Here’s a nice explanation of git & github that uses an analogy of an author writing a book.

https://blog.red-badger.com/blog/2016/11/29/gitgithub-in-plain-english

When git goes wrong:

https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/

- Complete the exercise on your local machine. Commit your changes.
- Push your repo to your own github repository.
- Go to the GitHub website and your own repository (which is a forked repo of my original repo).
- Use “Pull Request” to turn in your assignment (upper middle of the screen, click “New Pull Request”
- Make sure to type you and your collaborator’s name into the text box (ex: jones-rivaldi submission).
- Click “Create Pull Request”
- Done!

What is string manipulation and why do we care about this?

As biologists, or more generally as people interested in working with data, a lot of what we want to do will require us to manipulate many characters at the same time.

By definition:

- A
**character**is class whose instances can hold a single character value. - A
**string**is an*immutable*class for working with multiple characters.

For our purposes, we can consider strings as information that we don’t want to use for numerical calculations. DNA sequences, column or row names, and categorical/qualitative data values will generally be strings. You might want to remove the primers from a lot (like *a lot*) of sequences at the same time, or you might want to remove whitespaces from a dataset you found online.

Most of our string manipulation is covered by the previous links that are tied in with the for loops – here are a couple of useful comics for some of those commands though. ‘awk’ is new but you’re probably going to run into it during google adventures. As it in some ways is its own programming language, it’s very much worth learning, we just didn’t quite have time to fit it into our classwork.

https://pythonforbiologists.com/printing-and-manipulating-text/

]]>