R Analysis on Git Log Data

A few weeks ago at work one of the managers asked me an interesting question. They were preparing an end-of-year presentation and needed to some stats for a few slides. Paraphrased, 'How many commits did the team deploy in 2014 and is there anything else interesting in those numbers?'.

Please note: number of commits is a slimy metric to play with and should not (and is not at my workplace) be used for serious decisions.

Anyways, the number of commits is relatively easy to figure out. Git has a nice command 'git log' that will fetch commit messages and stats on a per-branch stats that also accepts filters to limit to certain users, timeframes, etc. The other part of the question is what struck me. Was there a way to analyze the data in the git log in a scientific manner to pull interesting statistics out? A way to use data science - R - on git logs? Well, of course there is.

The first step is to pull the data out of git. Like I mentioned earlier, 'git log' is a simple and easy way to produce a stream output on the commit history of a given branch. For example, here is a simple bash line to pull a nicely-formatted stream of history.

  1. git log --date=local --no-merges --shortstat\

  2. --pretty_format:"%h,%an,%ad,%s"

Most of the params here are self-explanatory. I wanted the dates to be formatted based on my computer so I didn't have to translate between timezones, which R really stinks at doing, so I used the local setting. I didn't want any merges included in my stream. Also, I wanted to know how many files were affected, the number of lines inserted, and the number of lines deleted, which is what the shortstat line gives. The pretty_format is a bit more tricky… Each param is a placeholder. For more information on the pretty_format flag I recommend checking on the documentation on git log.

This works great for a quick view but I needed more. I needed something that I could parse in R. The author name, commit message, and shortstat fields each had the potential of including commas, so delimiting by that was useless. Plus git has an annoying way of dropping the stortstat to the next line, which further complicates parsing. I needed something more powerful.

  1. git log --date=local --no-merges --shortstat\

  2. --pretty=format:"%h%x09%an%x09%ad%x09%sEOL" |\

  3. perl -pi -e 's/EOL\n/\t/g' > git.log

By using tabs as a delimiter '%x09', a token for the formatted end of line, and Perl to catch that end and crunch the shortstat onto the above line, I finally had something that was ready for analysis. Tab-delimited files are pretty easy for R to crunch through, even if comma-delimited makes more sense to us. Of course, tabs could be still embedded in author names and commit messages, but I wasn't worried about that case yet.

  1. log <- read.delim(

  2. file = 'git.log',

  3. sep = "\t",

  4. col.names = c('commit', 'author', 'time', 'message', 'effect'),

  5. stringsAsFactors = FALSE

  6. )

Loading the file into R is easy. Each column will be a character vector, which makes things a bit difficult to play with, and our last column is an annoying cluster of text. Some examples of the last column could include '1 file changed, 34 insertions(+), 3 deletions(-)' or '1 file changed, 303 deletions(-)'. Plus the time column is a flat string, no date information stored in it. This data needs to be cleaned up in R before moving forward.

  1. #format column into date format

  2. log$time <- strptime(log$time, format = '%a %b %e %H:%M:%S %Y')

  3. #setup threw new columns for stat information

  4. log$files_changed <- numeric(nrow(log))

  5. log$insertions <- numeric(nrow(log))

  6. log$deletions <- numeric(nrow(log))

  7. #loop through stats to backfill stat information

  8. for (row in 1:nrow(log)) {

  9. log$files_changed[row] <- regmatches(

  10. log$effect[row],

  11. regexec('([0-9]+) files? changed', log$effect[row])

  12. )[[1]][2];

  13. log$insertions[row] <- regmatches(

  14. log$effect[row],

  15. regexec('([0-9]+) insertions?', log$effect[row])

  16. )[[1]][2];

  17. log$deletions[row] <- regmatches(

  18. log$effect[row],

  19. regexec('([0-9]+) deletions?', log$effect[row])

  20. )[[1]][2];

  21. }

  22. #re-cast numeric columns as numeric

  23. log$files_changed <- as.numeric(log$files_changed)

  24. log$insertions <- as.numeric(log$insertions)

  25. log$deletions <- as.numeric(log$deletions)

There. Now the data is nice and clean and ready to run analysis on. For example, you can easily figure out a breakdown of inserts and deletes by author like this:

  1. #limit to 2014 data

  2. log2014 <- log[

  3. format(log$time, '%Y') == '2014',

  4. c('author', 'insertions', 'deletions')

  5. ]

  6. #output breakdown of inserts by author

  7. aggregate(insertions ~ author, log2014, sum)

  8. #output breakdown of deletes by author

  9. aggregate(deletions ~ author, log2014, sum)

Or, for something more visual, you can graph the number of files affected per commit over the year. Here's an example of using ggplot that will make a pretty graph for the year's data, using one of Bigstock's internal repos as a data source.

  1. #limit to 2014 data

  2. log2014month <- log[

  3. format(log$time, '%Y') == '2014',

  4. c('commit', 'author', 'time')

  5. ]

  6. #add column for month/year aggregation

  7. log2014month$short_time <- strftime(

  8. log2014month$time,

  9. format = '%Y/%m'

  10. )

  11. #aggregate commits by author by month

  12. log2014month <- aggregate(

  13. commit ~ short_time + author,

  14. log2014month,

  15. length

  16. )

  17. #constrain result to at least five commits per month

  18. log2014month <- log2014month[

  19. log2014month$commit >= 5,

  20. ]

  21. #graph the result using ggplot

  22. ggplot(log2014month, aes(short_time, commit)) +

  23. geom_point(aes(color = author))

There is a lot more that one can do with this data, just with tweaking and playing with the fields that we brought in. Plus, we didn't even try to parse the commit messages. Or the merged branches. Either one of those facets could bring in a lot of different data and insights into codebase history and flow. Still, this isn't a bad start to playing with git history data.