Phylogenetic Tree Construction
From Spring 2010 Biocomputing
In order to run, unzip the file (7 zip) and run sim.py.
How to run:
Once you've download the files, you'll notice that in the "bioinformatics" folder there are quite a few files. Most of this you can completely ignore because it is the info needed for MUSCLE/Dendropy/and Biopython. The two files you need to know about are "sequences.fasta" and "sim.py". "sim.py" is the main file that you need to run, "sequences.fasta" is the datafile that "sim.py" reads from. Simply cd to the "bioinformatics" directory and type "python sim.py".
- Michael S
- Mike L
Our First meeting is Tuesday May 4th after class in CF416.
We hope to meet every recurring Tuesday after class.
June 1st @ 4pm
CF 416 or 162
Helpful Project Links
http://hginit.com/ --Learn everything you need to know about Mercurial. It's a long read but totally worth it.
http://biopython.org/wiki/Main_Page -- Biopython website (and tutorial)
http://packages.python.org/DendroPy/index.html -- Phylogenetic tree library for Python.
June 1st Meeting Minutes -MichaelS
During The Meeting:
Showed us the similarity matrix he built over the weekend
Attempt to use Matrix built from the sequence data and pass it to the similarity matrix function
Merged successfully and got a resulting tree out of comparisons
Currently the code only gives us one tree, even if there is more than one possibility (because of the min function)
We could tweak the algorithm. The way it recurses when it combines two things together, then the grouping becomes three, you still only divide by 2 every time, so we would have to measure how many sequences there are in that part of the list... so the weights aren't off.
We want to try to add tick marks before the final due date, its the main priority if we are able to add anything before thursday.
May 27th Meeting Minutes -AndrewL
Great coding session guys!
Outline for meeting:
MikeS and Andrew reformatted main loop structure, and main sim.py file
MikeS and Andrew changed bug for parent to wait on the child process for MUSCLE
We all looked at Mark's pseudo code
MikeL and Tom discussed tree structure more
MikeL and Mark colaborated to get Mark's pseudo code structure to work with MikeL's tree code
Tom helped out and gave good direction for the work flow
To-dos (by Tuesday):
Andrew and MikeS: wait on Mark's pseudo code function to materialize by Tuesday.
Mark: Let's finish last time's to-do up no later than Tuesday.
MikeL: write the tree code to integrate with Mark's code.
May 25th Meeting Minutes -AndrewL
Outline for meeting:
Go through code and answer questions:
Switch to objects instead of dict?
output type for trees?
To-dos (by Thursday @ 11 am):
Andrew: convert data type to an object vs a dictionary of strings.
Mark: write the successive function to work with Michael's matrix.
Mike: write the tree code without tick marks.
MikeS: you deserve a break. :)
If we finish that up we'll be in good shape for a coding session tomorrow.
May 18th Meeting Minutes -MichaelS
Pointed us to dendropy as our phylogenetic tree builder (3:55 went home sick)
For next week: find the type of input dendropy needs so we know what to output
Done for this week: Function to compare each dna string to every other one, finding the best
Combination and keeping a large tuple of the items to be changed
To do for next week: Right the succession function. The current tuple system is inefficient and confusing.
I'm going to be looking in to numpy to build a matrix that will be a better data structure for keeping track
of our comparisons.
Done for this week: A great small example for us to run through our program to determine if we are going about
our program correctly.
Next week: Will be looking at code, especially our string comparison function. He will try to code up a better
comparison function so our output will be more accurate. What he doesn't know how to code he will comment
with what he wants (pseudocode) so we can go in and code it up.
To Do: link up our program with MUSCLE in the cases where our dna sequences are different lengths
combine our input.py with our sim.py
May 11th Meeting Minutes -MichaelS
parsing text files to make key/value pairs and creating a dictionary to be passed to the similarity matrix
To Do: reading in from the FASTA format, or using the FASTA ID from a database to pull and parse for creating the dictionary
Another example of phylogenetic tree construction and another walk through of a simple similarity matrix by hand
To Do: Create a more complex example of a phylogenetic tree to be run through our algorithm
To Do: Explore EuGene and how it can be applied to our project.
To Do: Explore Phylo and how it can be applied to our project. Discover the type of output we want from our similarity matrix algorithm to make use of it.
Created a simple similarity matrix that evaluates the similarities betweeen DNA sequences using a basic algorithm comparing character-to-character and sequence length.
Explained current state of the algorithm and output.
To Do: Expand algorithm to create successive pieces of the similarity matrix. Remove the constraint of DNA sequence length giving poor scores.
Improving the similarity matrix algorithm
Handling spaces in differing DNA sequence lengths
Deciding how to use our similarity matrix output to create a phylogenetic tree
Figuring out how to determine where tick marks are placed on the tree
May 4th Meeting Minutes -MichaelS
-Review of phylogenetic tree construction
What it is we are accomplishing, how to do it by had manually
Used a matrix to go through a simple example containing # of differences
-Looked at a calendar
-Random String of nucleotieds
*introduce mutations (week 1)
-Focus on Matrix (similarity), Get it producing (week 1)
-Find graphical package for trees in position
-Compare to other possible trees
-Shoot for sequences of 1023
*but start small
- Set up repository on bitbucket.org
- Everyone creates account / given write access
- Program overview
-From file to Matrix
-Algorithm for similarity matrix