K-means Clustering on Hiking Adventures
Six years of hiking in the Upper Peninsula has taken me to a lot of cool places. Mulligan Plains, High Rock Bay, Ives Hill (Lookout Mountain), and the McCormick Tract are all locations that I never dreamed existed during my years at Michigan Tech, places I only found after research and exploring done once I graduated. The other day I began to wonder about where I had been. What lands did I explore the most? Were there sections of the Upper Peninsula that I had hiked more than others?
There were two large pieces to this question. First, I needed to fetch geodata based on my hikes. For each and every hike, whether it was a waterfall visit or a mountain climb, I would need to pull a lat/lng pair that represents the visit. Then I need to cluster that data - find a mathematical way to spatially group the hikes.
Spoiler: cluster means using k-means.
Fetching the data turned out to be fairly easy. I keep a unique Google Map for each year of hiking, from 2009 - 2014 (2008 was only waterfalls, so I ignore that data). Each map offers the option of downloading a KML file with all the included points. All I had to do was download 6 different files.
Parsing KMLs was also trivial. Two of the years (2013 and 2014) were recorded in the new Google Maps interface, so the format of the file is slightly different, which was easy to code around. Here's the basic script I used to pull geodata information from the files.
// relative links to the yearly KML
$kml_list = [
'data/2009.kml',
'data/2010.kml',
'data/2011.kml',
'data/2012.kml',
'data/2013.kml',
'data/2014.kml',
];
$geo_data = [];
foreach ($kml_list as $kml) {
$kml_data = simplexml_load_file($kml);
// 2013 and 2014 used the newer Google Maps interface
if (in_array($kml, ['data/2013.kml', 'data/2014.kml'])) {
$placemarks = $kml_data->Document->Folder->Placemark;
} else {
$placemarks = $kml_data->Document->Placemark;
}
// only the coordinates are important
foreach ($placemarks as $placemark) {
$coordinate = (string) $placemark->Point->coordinates;
$coordinate = explode(',', $coordinate);
array_push($geo_data, array_slice($coordinate, 0, 2));
}
}
Now I had a nicely formatted multi-dimensional list of points, a container of lat/lng pairs. 311 pairs, to be exact. Now I had to find a way to cluster the points into digestible and understandable sets. Enter k-means.
K-means clustering is mostly used for data mining and signal processing. It attempts to group observations into clusters based on their location. As long as the observations can be compared numerically they can be clustered. Geodata works perfect for this and is easy to visualize. All I wanted to do was group my hikes into areas that I tend to visit often.
First I needed to come up with an algorithm. There are a few written in PHP that I just didn't like - some had hard-coded dependencies on 2-variable observations, some assumed an intialization method, others made the centroids (centers of the clusters) inaccessible. So I built my own and threw it in packagist. It's on Github: kmeans via php.
Armed with the library it was easy to calculate the centroids. Here is the rest of the script, combined with the final output.
use Jacobemerick\KMeans\KMeans;
$kmeans = new KMeans($geo_data);
$kmeans->cluster(10); // find the 10 most common hike areas
$url = 'http://maps.googleapis.com/maps/api/staticmap';
$url .= '?center=47,-88.1';
$url .= '&zoom=8';
$url .= '&size=640x400';
$url .= '&maptype=roadmap';
foreach ($kmeans->getCentroids() as $centroid) {
$url .= "&markers={$centroid[1]},{$centroid[0]}";
}
echo $url;
From the map of results I can see that I spend some time in the Keweenaw (especially the very tip), a lot of time in the Huron Mountains, and some time in Ontonagon County. Outliers, like my Pictured Rocks hike this spring and a visit to Black River Falls a few years ago, are skewing the result set a bit.
There are a few caveats involved. K-means is dandy for exploration and playing with data but is not useful for making large assumptions. There is an initialization step, which can be done by either 'forgy' (pick random observations in the set) or 'random' (pick random locations within the range of the set), that will heavily influence the speed and outcome of the algorithm. For the best results you probably don't want to exceed more than five clusters.
There is also the problem of using Euclidean distances. The clusters are grouped by relative distance on a flat plane. There is no weighting done on the size of the cluster (more points within the cluster should create a larger 'sphere of influence'). Ideally you would play with different distance methods and weighting factors if you wanted to tweak more information out of your data… But this example works for my case.
Comments (2)