Phylogenetic Tree Construction

From Spring 2010 Biocomputing

Jump to: navigation, search

Contents

Resulting Project

In order to run, unzip the file (7 zip) and run sim.py.
File:Bioinformatics.7z

How to run:
Once you've download the files, you'll notice that in the "bioinformatics" folder there are quite a few files. Most of this you can completely ignore because it is the info needed for MUSCLE/Dendropy/and Biopython. The two files you need to know about are "sequences.fasta" and "sim.py". "sim.py" is the main file that you need to run, "sequences.fasta" is the datafile that "sim.py" reads from. Simply cd to the "bioinformatics" directory and type "python sim.py".

media:Phylogenetic_Tree_Construction.pdf‎ --Phylogenetic_Tree_Construction.pdf
media:Phylogenetic_Tree_Construction.docx --Phylogenetic_Tree_Construction.docx.
media:Phylogenetic_Tree_Construction.ppt --Phylogenetic_Tree_Construction.ppt.

Members

  • Tom
  • Michael S
  • Mike L
  • Andrew
  • Mark

Meeting Time

Our First meeting is Tuesday May 4th after class in CF416.
We hope to meet every recurring Tuesday after class.
Next meeting:
June 1st @ 4pm
CF 416 or 162

Helpful Project Links

http://hginit.com/ --Learn everything you need to know about Mercurial. It's a long read but totally worth it.
http://biopython.org/wiki/Main_Page -- Biopython website (and tutorial)
http://packages.python.org/DendroPy/index.html -- Phylogenetic tree library for Python.

Project Files

media:sequences.zip --Some hemoglobin sequences for testing.
media:Simple_Similarity_Matrix_Example.JPG --A sample similarity matrix analysis for testing purposes.

June 1st Meeting Minutes -MichaelS

Location: CF162
Start: 3:30pm
End: 4:45

    Present:
  • MikeL
  • MichaelS
  • Tom
  • Andrew
  • Mark

During The Meeting:
Showed us the similarity matrix he built over the weekend
Attempt to use Matrix built from the sequence data and pass it to the similarity matrix function
Merged successfully and got a resulting tree out of comparisons
Currently the code only gives us one tree, even if there is more than one possibility (because of the min function)

To Do:
We could tweak the algorithm. The way it recurses when it combines two things together, then the grouping becomes three, you still only divide by 2 every time, so we would have to measure how many sequences there are in that part of the list... so the weights aren't off.
We want to try to add tick marks before the final due date, its the main priority if we are able to add anything before thursday.

May 27th Meeting Minutes -AndrewL

Location: CF162

    Start: 11:00am
    End: 2:00pm

Great coding session guys!


Present:
MikeL
MichaelS
Tom
Andrew
Mark

Outline for meeting:

    MikeS and Andrew reformatted main loop structure, and main sim.py file
    MikeS and Andrew changed bug for parent to wait on the child process for MUSCLE
    We all looked at Mark's pseudo code
    MikeL and Tom discussed tree structure more
    MikeL and Mark colaborated to get Mark's pseudo code structure to work with MikeL's tree code
    Tom helped out and gave good direction for the work flow

To-dos (by Tuesday):

Andrew and MikeS: wait on Mark's pseudo code function to materialize by Tuesday.
Mark: Let's finish last time's to-do up no later than Tuesday.
MikeL: write the tree code to integrate with Mark's code.


May 25th Meeting Minutes -AndrewL

CF416

    Start: 3:40pm
    End: 4:30pm

Present:
MikeL
MichaelS
Tom
Andrew
Mark

Outline for meeting:
Go through code and answer questions:

    Switch to objects instead of dict?
    output type for trees?

To-dos (by Thursday @ 11 am):

Andrew: convert data type to an object vs a dictionary of strings.
Mark: write the successive function to work with Michael's matrix.
Mike: write the tree code without tick marks.
MikeS: you deserve a break. :)

If we finish that up we'll be in good shape for a coding session tomorrow.


May 18th Meeting Minutes -MichaelS

CF416

    Start: 3:45pm
    End: 4:45pm

Present:
MikeL
MichaelS
Tom
Andrew
Mark

(MikeL)
Pointed us to dendropy as our phylogenetic tree builder (3:55 went home sick)
For next week: find the type of input dendropy needs so we know what to output
(MichaelS)
Done for this week: Function to compare each dna string to every other one, finding the best
Combination and keeping a large tuple of the items to be changed
To do for next week: Right the succession function. The current tuple system is inefficient and confusing.
I'm going to be looking in to numpy to build a matrix that will be a better data structure for keeping track
of our comparisons.
(Mark)
Learn python.
(Tom)
Done for this week: A great small example for us to run through our program to determine if we are going about
our program correctly.
Next week: Will be looking at code, especially our string comparison function. He will try to code up a better
comparison function so our output will be more accurate. What he doesn't know how to code he will comment
with what he wants (pseudocode) so we can go in and code it up.
(Andrew)
To Do: link up our program with MUSCLE in the cases where our dna sequences are different lengths
combine our input.py with our sim.py

May 11th Meeting Minutes -MichaelS

CF314

Start: 3pm
End: 4:00pm

Present:
MikeL
MichaelS
Tom
Andrew
Mark

(Andrew)
parsing text files to make key/value pairs and creating a dictionary to be passed to the similarity matrix
To Do: reading in from the FASTA format, or using the FASTA ID from a database to pull and parse for creating the dictionary
(Tom)
Another example of phylogenetic tree construction and another walk through of a simple similarity matrix by hand
To Do: Create a more complex example of a phylogenetic tree to be run through our algorithm
(Mark)
Researched EuGene
To Do: Explore EuGene and how it can be applied to our project.
(MikeL)
Researched phylo
To Do: Explore Phylo and how it can be applied to our project. Discover the type of output we want from our similarity matrix algorithm to make use of it.
(MichaelS)
Created a simple similarity matrix that evaluates the similarities betweeen DNA sequences using a basic algorithm comparing character-to-character and sequence length.
Explained current state of the algorithm and output.
To Do: Expand algorithm to create successive pieces of the similarity matrix. Remove the constraint of DNA sequence length giving poor scores.

-Obstacles
Improving the similarity matrix algorithm
Handling spaces in differing DNA sequence lengths
Deciding how to use our similarity matrix output to create a phylogenetic tree
Figuring out how to determine where tick marks are placed on the tree

May 4th Meeting Minutes -MichaelS

CF416

Start: 4pm
End: 5:30pm

Present:
MikeL
MichaelS
Tom
Andrew
Mark

(Tom)
-Review of phylogenetic tree construction
What it is we are accomplishing, how to do it by had manually
Used a matrix to go through a simple example containing # of differences

(MichaelS)
-Looked at a calendar
(Group Input)
Goals
-Random String of nucleotieds
*introduce mutations (week 1)
-Focus on Matrix (similarity), Get it producing (week 1)
-Find graphical package for trees in position
-Compare to other possible trees
-Shoot for sequences of 1023
*but start small

(Andrew)
- Set up repository on bitbucket.org
- Everyone creates account / given write access
- Program overview
Tasks
-From file to Matrix
-Algorithm for similarity matrix
-Library Trees
-Library Display

Prototyping
-Unrooted Trees
-Ancestry
-Multiple Trees

Personal tools