phyloGeoRef Project Blog: June 2010

Monday, June 28, 2010

Jak working

I finally got Jak working (mostly, it compiles with errors, but if you ignore them the sample file will work and you get a very basic kml file, but it's kml non the less). To get it to work, JAXB2.xxx must be installed following the instructions here. This is not an ideal situation for my library as I would like it to be self contained and I don't want to make the user install other libraries in addition to phyloGeoRef. This may mean I'll have to find another KML writing library or write my own. Considering the time frame I think it is still early enough that I would be willing to give writing my own KML writers a shot. This would cut into time I would have otherwise spent writing more parsers to support more filetypes. This doesn't mean more filetypes won't be supported it just makes for more work. At the moment I feel motivated to add this extra bit of work and I think I can still work in additional parser writing. I'll work on it this week in addition to my planned work on the tree location algorithms and see how it goes.

EDIT: The more I play around with jak, the more I discover the cool things it can do. It might be a good idea to keep it after all.

Wednesday, June 23, 2010

Mean Position Algorithm Breakdown

I'm between World Cup matches so I'll take some time to explain how the mean node position function works. Here's the important part of the function:

if ( !node.isExternal() ) {

node.getNodeData().setDistribution(new Distribution(""));

Distribution dist = data.getDistribution();

for (int i=0; i node.getNumberOfDescendants(); i++){ //do mean calcs

PhylogenyNode childNode = node.getChildNode(i);

NodeData childData = childNode.getNodeData();

Distribution childDist = childData.getDistribution();

BigDecimal childLat = childDist.getLatitude();

BigDecimal childLong = childDist.getLongitude();

childCoordsLat.add(childLat);

childCoordsLong.add(childLong);

latSum = latSum.add(childLat);

longSum = longSum.add(childLong);

c++;

}

BigDecimal count = new BigDecimal(c);

BigDecimal meanLat = latSum.divide(count,BigDecimal.ROUND_CEILING);

BigDecimal meanLong = longSum.divide(count,BigDecimal.ROUND_CEILING);

dist.setLatitude(meanLat);

dist.setLongitude(meanLong);

Now to break it down.

if ( !node.isExternal() ) {
node.getNodeData().setDistribution(new Distribution(""));
Distribution dist = data.getDistribution();

We're only assigning non external nodes for this function, so first we check to make sure the node is internal. Next, since this node doesn't have a distribution yet we need to give it an empty one. Then we can get at the distribution of the node, here called dist.

Next we iterate through all the descendants of the node collecting the lat and long of each descendent:

for (int i=0; i node.getNumberOfDescendants(); i++){
PhylogenyNode childNode = node.getChildNode(i);
NodeData childData = childNode.getNodeData();
Distribution childDist = childData.getDistribution();
BigDecimal childLat = childDist.getLatitude();
BigDecimal childLong = childDist.getLongitude();

childCoordsLat.add(childLat);
childCoordsLong.add(childLong);

Now we need to get the sum of all the children, we simply add:

latSum = latSum.add(childLat);
longSum = longSum.add(childLong);

and keep track of how many descendants we've seen: c++;

This last bit is just a convoluted way of getting the mean:

BigDecimal count = new BigDecimal(c);
BigDecimal meanLat = latSum.divide(count,BigDecimal.ROUND_CEILING);
BigDecimal meanLong = longSum.divide(count,BigDecimal.ROUND_CEILING);

dist.setLatitude(meanLat);
dist.setLongitude(meanLong);

It's convoluted because we're dealing with BigDecimal objects so we have to do a bunch of stuff to keep the data types happy.

Sunday, June 20, 2010

Pro Git

I've been using git as my version control system for this project and this is my first foray into the git world. Yesterday I came across this online book, Pro Git which has been very helpful. There's a lot of useful tips and tricks so do check it out if you're new to git like me.

Saturday, June 19, 2010

Mean Position Algorithm

I've implemented the mean position algorithm for assigning coordinates to the nodes. Basically, the function looks at all the children of the node and takes the mean of the latitude and longitude of the children and assigns it to the node.

Friday, June 18, 2010

Binary Tree Algorithm

Here's a brief outline of the binary tree algorithm:

For trees that are completely binary (each node has exactly two children) the leafs are assigned latitude and longitude from the coordinate file. The other nodes are assigned their latitude and longitude based on the midpoint of the two children. The altitudes of all nodes are assigned based on the height of the tree.

EDIT: I'm playing around with this function a lot so check github to make sure you have the latest version.

Thursday, June 17, 2010

testMain

I've added a main class for testing library functions and some test files, they're up on github. It's been really helpful for tracking down bugs. I'll keep adding features to this class as the project progresses. It's really basic at the moment, I'll post detailed instructions once everything is looking nice later today. If you're feeling adventurous go ahead and try it out now!

Wednesday, June 9, 2010

lat/long/alt

I've implemented a function to assign lat/long/alt coordinates to nodes of binary trees. These are trees where each node has exactly two children. This same function can be used for 'lazy' node coordinate assignment. The basic idea is given the height of the tree, altitudes are assigned to each node using altitude = a + ((n-1)*b) from from Janies et al. 2007. Check out the paper or the project site for more info on that algorithm. To assign latitude and longitude the function simply takes the midpoint of the two children. This is exactly how the 'lazy' node assigning function works. Neither of these functions take branch length into consideration.

EDIT: these functions have been pushed to GitHub

Monday, June 7, 2010

Worse is better

This is a less technical update, if you're not interested in my project musings keep calm and carry on. There will be a more technical update later today (if blogger ever comes back up). It's the start of the third week of coding now and my project is well underway. Now that I've gotten back into the Java state of mind and I have a few classes written I've noticed how much being a Python programmer had affected my coding style. To a Java programmer my library probably looks a little strange. Why did you make that a class? I'm asking myself this every once in a while. The answer: it seemed like a good idea at the time. I'm very used to the way Python works and all its idiosyncrasies. This is my first "big" Java project and it's been a really fun learning experience. While not as explosive as the Java robots (my first Java experience) I've definitely learned a ton and this project is probably going to be a lot more useful.

I'm a long time *nix geek and a big fan of the Unix philosophy. As such, when I'm developing any kind of software I tend to follow the Worse is Better philosophy. I take it to mean make sure the code works then make it pretty. That's the attitude I've had from the beginning of this project. If you look at the source on GitHub it looks a little ragged. I'm going back and refining things as I see fit, but I plan to leave the bulk of that until the end of the coding period or even after the project is officially over. The reason: I'd rather deliver something ugly that works well then something that's pretty but broken.

Thursday, June 3, 2010

Triple, Quad and new parsers

Today marks the introduction of two new classes, Triple and Quad. Both are simple classes that define an object that holds three other objects and four other objects respectively. I'm using Triple like so (species name, latitude, longitude) and similarly Quad (species name, latitude, longitude, metadata). The new parsers will take any kind of deliminated file (csv, tsv, basic shapefile) and return an array of Triples or Quads. These arrays can then be passed to the latitude/longitude/altitude calculating functions which will iterate through the nodes assigning latitude/longitude/altitude.

If it seems convoluted right now that's because it is. I'm still working out flow control (see my previous thoughts on storeTree). My approach is to get all the functions I need written and have them all called in a particular order. Magic class takes care of this for the user with the lovely toKML(). Alternatively, the user can call most of the library's functions from their own program. I'm not sure if this is an ideal solution. Another approach is to have the functions private and call each other when needed. I'm hesitant to do this because it will limit the flexibility of the library. The code will be up on github later this evening.

Java's lack of tuples

Switching between Python in my 'day job' and Java for my SoC project is making me wish I could combine the two and have the best of both worlds. I'm currently working on functions that assign altitude, latitude and longitude to nodes in a tree. Currently this info is stored in a List object but I think that may be a mistake. This is a situation I wish I could use Python's tuple or C++'s Pair. I may end up just defining a simple class that imitates this behavior to store this spatial data. This is from Wikipedia:

public class Pair<T, S>{
public Pair(T f, S s){
first = f;
second = s;
}

public T getFirst(){
return first;
}

public S getSecond()   {
return second;
}

public String toString()  {
return "(" + first.toString() + ", " + second.toString() + ")";
}

private final T first;
private final S second;
}

It does basically what I want, a simple object to store lat/long pairs. With a list iterator to parse the lines from the CSV file to get the leaf lat/long this should work out well for storing lat/long pairs. Altitude is determined by nodeAltitude = a + ((n-1)*b) where a is the altitude of a node with only terminal leafs (based on the size of the tree), b is the altitude of nodes that are ancestral to more than one node and n is the number of node from the current node to the leaf. I'm still working on how to calculate lat/long for non terminal nodes. Suggestions welcome!

Tuesday, June 1, 2010

Much ado about storeTree

I'm not sure what to do with the storeTree class. It seems like a necessary class for my library but the more I think about it, the more superfluous it becomes. At the moment it seems I can avoid it with some creative function calling and only retrieving information as needed instead of storing information about the trees in some variable in the hope that it will be useful later in the code. I'm able to get away from it with the creation of the magic class. Magic defines one function, toKML() which takes a tree file and coordinate file and writes a kml file. This is sort of a pseudo main class. Most of the functions in the library are public and your application can call them as needed, but toKML() takes care of this for you. It calls all the necessary functions as needed and tries to guess at the file types you're using without having to specify them. This may turn out to be really useful or really annoying. In any case this class has helped me add structure to my library and it helps shape the way the functions in the library should be used.

Slightly new direction

I've changed up the project plan a little bit. A draft of the new outline is up on the wiki. I also wanted to give a quick, high-level outline of how I want the library to work:

myTree = openTree( treeFile ) //myTree is a phylogeny object

myCoords = parseCSV( inCSVfile ) //myCoords is a List of coords

treeForKML = calculate_node_coords(myTree, myCoords ) //treeForKML is a kml tree object
//this function will actually be quite complex and probably a series of functions by the time it's complete, but this should be hidden from the user.

writeKML( tree_for_kml )

The first two functions have already been written and I'm currently working on iterators for moving through or "walking" the trees. Phylogeny objects already define spatial information for nodes so it will simply be a matter of assigning the proper coordinates at each node. Actually calculating the coordinates will be a much bigger challenge and the focus of next week and the following week of my project.

phyloGeoRef Project Blog