Monday, August 9, 2010

Crossing the 180

I've implemented an algorithm to deal with branches that cross the 180 meridian. The basic algorithm is as follows:

1. get the mathematical midpoint of the two points (or mean if the node has more than 2 children)
2. subtract the midpoint from 180, or if the midpoint is negative subtract the absolute value of the midpoint
3. if the original midpoint was negative leave the value from 2 as is, else change the sign (if the midpoint is positive the new midpoint is (180-midpoint)*-1

I only use this method if one of the decedents of a node has a different sign than the others and is less than -90 or between 90 and 180. Otherwise I just use the regular midpoint. If a node has more than 2 children I use the mean rather than midpoint.

Monday, July 26, 2010

Cosmopolitan Trees

I've been playing around with globally expansive trees and noticed some funny behavior of Google Earth. When trees are very large the lines pass though the Earth instead of following the curvature of the Earth. In order to get curved lines I have to use tessellated lines. These are fine, but the altitude of these lines must be "clamped to earth." This means all 3-D structure of the tree is lost i.e. it produces a flat tree along the planet's surface. I have come up with a few ways to solve this issue:

1. Use the 3-D lines for all trees but if they get too large add points along the line to elevate it so it doesn't pass through the earth. One drawback of this is there will be a lot of meaningless points in the KML and it adds clutter to the tree when viewed in Google Earth.

2. Use the 3-D style tree for compact trees and the tessellated lines for cosmopolitan trees. This is the approach I'm partital to using because I think cosmopolitan trees would look too cluttered with the 3-D style.

3. A hybrid system of 3-D lines for the branches that are close together and tessellated lines for far reaching branches. One problem I see with this is attaching a tessellated line to a 3-D line. The 3-D line would have an absolute altitude above the earth whereas the tessellated line would be clamped to the earth. This would add discontinuity to the tree.

4. Create a tree far above the earth's surface with "drop lines" to mark the position of the leafs.

Thursday, July 8, 2010

KML tree and working code

Trees now look decent, check it out on github. I'm continually adding little touch ups and such so watch for changes. I still have to fix some bugs, but it's getting close to being 100% functional (at least with the test files).

Friday, July 2, 2010

First tree

I've got a very basic tree to appear on Google Earth. It's not much to look at, but the code is finally starting to come together. I'm leaving for Chicago for the holiday weekend so I won't be able to get much done over the weekend, but I'll put the code up on github so you can have a look.

EDIT: the KML file testfile.kml in the phyloGeoRef directory on github has the test tree in it. Just copy that file into Google Earth and the tree should appear.

Thursday, July 1, 2010

KML placemarks working

I've managed to get the KML placemarks working. There is now a pin for each node. Now I will be working on connecting the place marks and styling the nodes. Files will be up on github momentarily.

Monday, June 28, 2010

Jak working

I finally got Jak working (mostly, it compiles with errors, but if you ignore them the sample file will work and you get a very basic kml file, but it's kml non the less). To get it to work, must be installed following the instructions here. This is not an ideal situation for my library as I would like it to be self contained and I don't want to make the user install other libraries in addition to phyloGeoRef. This may mean I'll have to find another KML writing library or write my own. Considering the time frame I think it is still early enough that I would be willing to give writing my own KML writers a shot. This would cut into time I would have otherwise spent writing more parsers to support more filetypes. This doesn't mean more filetypes won't be supported it just makes for more work. At the moment I feel motivated to add this extra bit of work and I think I can still work in additional parser writing. I'll work on it this week in addition to my planned work on the tree location algorithms and see how it goes.

EDIT: The more I play around with jak, the more I discover the cool things it can do. It might be a good idea to keep it after all.

Wednesday, June 23, 2010

Mean Position Algorithm Breakdown

I'm between World Cup matches so I'll take some time to explain how the mean node position function works. Here's the important part of the function:

if ( !node.isExternal() ) {
node.getNodeData().setDistribution(new Distribution(""));
Distribution dist = data.getDistribution();

for (int i=0; i node.getNumberOfDescendants(); i++){ //do mean calcs
PhylogenyNode childNode = node.getChildNode(i);
NodeData childData = childNode.getNodeData();
Distribution childDist = childData.getDistribution();
BigDecimal childLat = childDist.getLatitude();
BigDecimal childLong = childDist.getLongitude();


latSum = latSum.add(childLat);
longSum = longSum.add(childLong);



BigDecimal count = new BigDecimal(c);
BigDecimal meanLat = latSum.divide(count,BigDecimal.ROUND_CEILING);
BigDecimal meanLong = longSum.divide(count,BigDecimal.ROUND_CEILING);


Now to break it down.

if ( !node.isExternal() ) {
node.getNodeData().setDistribution(new Distribution(""));
Distribution dist = data.getDistribution();

We're only assigning non external nodes for this function, so first we check to make sure the node is internal. Next, since this node doesn't have a distribution yet we need to give it an empty one. Then we can get at the distribution of the node, here called dist.

Next we iterate through all the descendants of the node collecting the lat and long of each descendent:

for (int i=0; i node.getNumberOfDescendants(); i++){
PhylogenyNode childNode = node.getChildNode(i);
NodeData childData = childNode.getNodeData();
Distribution childDist = childData.getDistribution();
BigDecimal childLat = childDist.getLatitude();
BigDecimal childLong = childDist.getLongitude();


Now we need to get the sum of all the children, we simply add:

latSum = latSum.add(childLat);
longSum = longSum.add(childLong);

and keep track of how many descendants we've seen: c++;

This last bit is just a convoluted way of getting the mean:

BigDecimal count = new BigDecimal(c);
BigDecimal meanLat = latSum.divide(count,BigDecimal.ROUND_CEILING);
BigDecimal meanLong = longSum.divide(count,BigDecimal.ROUND_CEILING);


It's convoluted because we're dealing with BigDecimal objects so we have to do a bunch of stuff to keep the data types happy.

Sunday, June 20, 2010

Pro Git

I've been using git as my version control system for this project and this is my first foray into the git world. Yesterday I came across this online book, Pro Git which has been very helpful. There's a lot of useful tips and tricks so do check it out if you're new to git like me.

Saturday, June 19, 2010

Mean Position Algorithm

I've implemented the mean position algorithm for assigning coordinates to the nodes. Basically, the function looks at all the children of the node and takes the mean of the latitude and longitude of the children and assigns it to the node.

Friday, June 18, 2010

Binary Tree Algorithm

Here's a brief outline of the binary tree algorithm:

For trees that are completely binary (each node has exactly two children) the leafs are assigned latitude and longitude from the coordinate file. The other nodes are assigned their latitude and longitude based on the midpoint of the two children. The altitudes of all nodes are assigned based on the height of the tree.

EDIT: I'm playing around with this function a lot so check github to make sure you have the latest version.

Thursday, June 17, 2010


I've added a main class for testing library functions and some test files, they're up on github. It's been really helpful for tracking down bugs. I'll keep adding features to this class as the project progresses. It's really basic at the moment, I'll post detailed instructions once everything is looking nice later today. If you're feeling adventurous go ahead and try it out now!

Wednesday, June 9, 2010


I've implemented a function to assign lat/long/alt coordinates to nodes of binary trees. These are trees where each node has exactly two children. This same function can be used for 'lazy' node coordinate assignment. The basic idea is given the height of the tree, altitudes are assigned to each node using altitude = a + ((n-1)*b) from from Janies et al. 2007. Check out the paper or the project site for more info on that algorithm. To assign latitude and longitude the function simply takes the midpoint of the two children. This is exactly how the 'lazy' node assigning function works. Neither of these functions take branch length into consideration.

EDIT: these functions have been pushed to GitHub

Monday, June 7, 2010

Worse is better

This is a less technical update, if you're not interested in my project musings keep calm and carry on. There will be a more technical update later today (if blogger ever comes back up). It's the start of the third week of coding now and my project is well underway. Now that I've gotten back into the Java state of mind and I have a few classes written I've noticed how much being a Python programmer had affected my coding style. To a Java programmer my library probably looks a little strange. Why did you make that a class? I'm asking myself this every once in a while. The answer: it seemed like a good idea at the time. I'm very used to the way Python works and all its idiosyncrasies. This is my first "big" Java project and it's been a really fun learning experience. While not as explosive as the Java robots (my first Java experience) I've definitely learned a ton and this project is probably going to be a lot more useful.

I'm a long time *nix geek and a big fan of the Unix philosophy. As such, when I'm developing any kind of software I tend to follow the Worse is Better philosophy. I take it to mean make sure the code works then make it pretty. That's the attitude I've had from the beginning of this project. If you look at the source on GitHub it looks a little ragged. I'm going back and refining things as I see fit, but I plan to leave the bulk of that until the end of the coding period or even after the project is officially over. The reason: I'd rather deliver something ugly that works well then something that's pretty but broken.

Thursday, June 3, 2010

Triple, Quad and new parsers

Today marks the introduction of two new classes, Triple and Quad. Both are simple classes that define an object that holds three other objects and four other objects respectively. I'm using Triple like so (species name, latitude, longitude) and similarly Quad (species name, latitude, longitude, metadata). The new parsers will take any kind of deliminated file (csv, tsv, basic shapefile) and return an array of Triples or Quads. These arrays can then be passed to the latitude/longitude/altitude calculating functions which will iterate through the nodes assigning latitude/longitude/altitude.

If it seems convoluted right now that's because it is. I'm still working out flow control (see my previous thoughts on storeTree). My approach is to get all the functions I need written and have them all called in a particular order. Magic class takes care of this for the user with the lovely toKML(). Alternatively, the user can call most of the library's functions from their own program. I'm not sure if this is an ideal solution. Another approach is to have the functions private and call each other when needed. I'm hesitant to do this because it will limit the flexibility of the library. The code will be up on github later this evening.

Java's lack of tuples

Switching between Python in my 'day job' and Java for my SoC project is making me wish I could combine the two and have the best of both worlds. I'm currently working on functions that assign altitude, latitude and longitude to nodes in a tree. Currently this info is stored in a List object but I think that may be a mistake. This is a situation I wish I could use Python's tuple or C++'s Pair. I may end up just defining a simple class that imitates this behavior to store this spatial data. This is from Wikipedia:

public class Pair<T, S>{
public Pair(T f, S s){
first = f;
second = s;

public T getFirst(){
return first;

public S getSecond() {
return second;

public String toString() {
return "(" + first.toString() + ", " + second.toString() + ")";

private final T first;
private final S second;
It does basically what I want, a simple object to store lat/long pairs. With a list iterator to parse the lines from the CSV file to get the leaf lat/long this should work out well for storing lat/long pairs. Altitude is determined by nodeAltitude = a + ((n-1)*b) where a is the altitude of a node with only terminal leafs (based on the size of the tree), b is the altitude of nodes that are ancestral to more than one node and n is the number of node from the current node to the leaf. I'm still working on how to calculate lat/long for non terminal nodes. Suggestions welcome!

Tuesday, June 1, 2010

Much ado about storeTree

I'm not sure what to do with the storeTree class. It seems like a necessary class for my library but the more I think about it, the more superfluous it becomes. At the moment it seems I can avoid it with some creative function calling and only retrieving information as needed instead of storing information about the trees in some variable in the hope that it will be useful later in the code. I'm able to get away from it with the creation of the magic class. Magic defines one function, toKML() which takes a tree file and coordinate file and writes a kml file. This is sort of a pseudo main class. Most of the functions in the library are public and your application can call them as needed, but toKML() takes care of this for you. It calls all the necessary functions as needed and tries to guess at the file types you're using without having to specify them. This may turn out to be really useful or really annoying. In any case this class has helped me add structure to my library and it helps shape the way the functions in the library should be used.

Slightly new direction

I've changed up the project plan a little bit. A draft of the new outline is up on the wiki. I also wanted to give a quick, high-level outline of how I want the library to work:

myTree = openTree( treeFile ) //myTree is a phylogeny object

myCoords = parseCSV( inCSVfile ) //myCoords is a List of coords

treeForKML = calculate_node_coords(myTree, myCoords ) //treeForKML is a kml tree object
//this function will actually be quite complex and probably a series of functions by the time it's complete, but this should be hidden from the user.

writeKML( tree_for_kml )

The first two functions have already been written and I'm currently working on iterators for moving through or "walking" the trees. Phylogeny objects already define spatial information for nodes so it will simply be a matter of assigning the proper coordinates at each node. Actually calculating the coordinates will be a much bigger challenge and the focus of next week and the following week of my project.

Thursday, May 27, 2010

getTree and github

I've got the openTree functions working in the getTree class. openTree takes in a tree file and returns an array of phylogeny objects. These objects can be accessed by storeTree.

The csv parsers seem to be working as well. I'm thinking about replacing these with parsers from the GeoTools library. There's tons of stuff there, so I'll need to spend time picking out what I need for phyloGeoRef.

A side note, I'm having some problems with git and github so if you're watching the repo I wouldn't rely on anything that shows up there until I give the OK on this blog.

UPDATE: everything in the github repo should be good to go now :-)

Wednesday, May 26, 2010

CSV Parsers

I've been working on the parseCoord class. This class contains several parsers for parsing delimited files. There are built in csv and tsv parsers and a function to define your own delimiter. These will be utilized to parse files containing the latitude and longitude coordinates. I've been using the opencsv library of csv parsers extensively.

Monday, May 24, 2010

Day 1: Tree classes

A brief summary of day 1:

I'm currently focusing on crating the tree classes. getTree and storeTree are the two I'm working on now. Currently I'm relying on the forester package for my parsers. This package is part of the biojava library. I'm currently implementing newick, nexus and phyloxml parsers. More will come as time permits, but these three are the priority right now.

storeTree, as the name suggests, is the class that defines how a tree object is stored in the library. It mainly defines a tree objects and holds information such as parent-child relationships, number of nodes, branch lengths and leafs.

getTree deals with the parsers and outlines functions that call the appropriate parser. It will probably end up interacting heavily with storeTree.

Wednesday, May 19, 2010

Hello Code

An update of what I've been doing during the SoC bonding period.

First of all, information about the project will be kept mainly in 3 places:

Github will host the actual code and a rough draft of the documentation will be kept on the github wiki.

This blog will have updates of what I'm currently working on and the current state of the project.

Phyloinformatics wiki is a semi-static page that has general info about the project and the project plan. This will be updated as I complete tasks, but this blog should be considered the most up to date record of the project.

Currently trying to decide between NetBeans and Eclipse, any suggestions are welcome.

Open source java libraries
Jak is a java API for kml
BioJava has methods for reading and building phylogenetic trees
GeoPhyloBuilder implements trees in 3D space
freegis has many open source java libraries and applications
java-source has some more open source geo-java libraries and programs