GitHub 2016 round up
Almost a month into the new year, the yearly summaries and annual resolutions posts season is definitely over.
We have been working in Data Analysis for a while (in stealth mode for the most part, shame on us!) but we felt this was a good time to give you a sneak peek on what we have been up to. A bit late for a 2016 recap? Maybe! One thing is granted, the Last Mover Advantage is on our side!
As you may already know, we are really passionate about all things Open Source. We have been working on many open source tools to improve developers’ life by providing well-tested, clean & extensible components, which are used by a thriving community of users and contributors.
GitHub has been a great tool to share this passion as it enables seamless collaboration with fellow developers, so we thought it would be interesting to study the use of GitHub for open source projects during 2016. In this post we wanted to share our findings to give you insight about trends and fun facts in the OSS ecosystem, and pay a deserved tribute to some of the OSS heroes out there!
We will also discuss the tech stack and techniques we used and walk you through some of the challenges we faced – this may come in handy if you want to perform this type of Data Analysis. Hint: we played around with first-class Data Analysis tools such as Apache Spark, Databricks, Apache Zeppelin and Google BigQuery.
The Tech Stack
Before going deep into the data insights, let us walk you through the steps that got us there and the tools we tried out in the process. The GitHub Archive records all the public activity on GitHub and can be accessed for free. You can grab historical data of the activity registered on GitHub since 2/12/2011 on its website. For that purpose, it provides an endpoint that lets you request the historical data files by hour, each of which has an average size of over 80MB when unzipped. So for one year this is a lot of data provided you want to store it and process it yourself (700GB+)! This was a job for Spark!
Spark lets us process data in a computer cluster in a fast and efficient manner. There are several ways to use it but we aimed at Databricks owing to the high level of abstraction it provides, by using web notebooks (very similar to Jupyter Notebook’s) and by exploiting Amazon EC2 instances with ease. It also lets you load files from S3 out of the box, so we crafted a script called gh2s3 to transfer the GitHub Archive’s 2016 data to S3. In order to get the users’ locations, we made use of Scrapy (a Python library to build crawlers) to extract this data from GitHub’s API. Scrapy allows to throttle the request rate to stay inside GitHub’s rate limits, among many other things. So our initial setup using Databricks is represented in the diagram below.
But then we found out that the GitHub Archive is published on Google BigQuery as a public dataset, which made us realize that using Google BigQuery as an end-to-end solution for data storage, analysis and visualization was smarter.
So we got exposed to different top-tier tools but ended up finding a great shortcut that saved us time and effort. Epic win!
GitHub 2016 Topographical Depiction
We started off analyzing the commits based on their geographic location by mapping them to the location that the committers declare in their own profile. This work was inspired by a similar previous work by Ramiro Gómez. We used a script that is based on his work with some minor modifications. This script reads JSON files with the user information from GitHub, takes the users location string and tries to map it to a country. Since the location field on GitHub is a regular string with no restrictions (for instance, it could be ‘Earth’, ‘localhost’ or ‘Milky Way’) it’s not always mappable to a country. The script tries to find the name of a country or city in the location string and can map this to a country in more than 96% of the cases where the location is not empty. This is a pretty impressive result but we need to take into account that the location string is a user generated data point so even in cases in which it’s mappable, its accuracy is not granted. In this case we plotted the commits from the users by country, then compared them with their population. We also took a look at the amount of committers in addition to the commits. To plot the data into the map, we segmented it accordingly in 8 levels.
Commits: the usual suspects and a big surprise!
The first map shows the total amount of commits by country. It comes as no surprise that the US has the first place by a big margin – it has more commits than the rest of the countries in the top 8 combined! The second country with the most commits in 2016 is Germany, closely followed by China and the United Kingdom, with Canada completing the top 5. Congratulations to the 2016 winners!
China, India and the US, among other countries with huge populations have a clear advantage over the rest when talking about total amount of commits. When we consider their populations, we can have a better sense of the relative performance of different countries. What happens when we compare the number of commits per capita? Doubt no more! We went on and created a new map (use the dropdown next to the map above to switch among the different maps) that displays the commits per capita of every country. Here Switzerland, Netherlands, Sweden and Canada stand out among the rest – well done!
Not surprisingly, most of the countries leading the total commits and commits per capita charts are highly developed countries. These are countries with a big impact in the technology industry that also attract technical talent from abroad. Will the 2016 data reflect a relationship between the amount of commits per capita and a country HDI?
For that purpose we created yet another graph, that compares commits per capita against countries’ HDI (Human Development Index) from the last report. The data seem to confirm our initial perception but we also found out that countries such as Greece, New Zealand and Finland do very well on this one! Other remarkable data points in this graph include Namibia, Uruguay, Ukraine, Bulgaria and Hungary, who perform way better than expected when taking a deeper look at the numbers. At Xmartlabs we are glad to contribute to this stats from our engineering HQ in Montevideo, Uruguay :D.
2016 Biggest Surprise: Cocos Islands
Yes! Cocos Island. This tiny country of only 14m2 and 600 inhabitants, located in the Indian Ocean has a HDI of 0.829 and 11,036 commits during 2016! Woah. Are we in presence of a programming heaven? We don’t really know :) This and other outliers happen to be countries with very little population, in which very few data and the presence of errors in the users’ stated locations can explain the high deviation of their stats. More recognizable countries such as Monaco and Vatican City display a similar pattern.
Up to now we have seen the amount of commits, but how many committers do they have with respect to their population? By taking a look a the corresponding map we see some differences. This time Iceland, Norway, Denmark and Ireland fare better – can you spot a trend?
Developers from African countries, such as Guinea-Bissau, The Democratic Republic of the Congo, Botswana and Algeria, have a significant number of commits. However, we have only found 1, 5, 6 and 6 committers respectively, so they are clearly few but good! Big shout-out to ivandrofly, jniles, tsetsiba and assem-ch who are great examples of this.
Commit Messages Analysis
The dataset available included commit messages so we wanted to get some insights out of them. In that regards, we defined some metrics and compared the global values to those of some popular repositories.
The metrics we took were the following:
- Messages with ‘fix’: Messages that include the string
fix. This commits should be a representation of bug fixing commits and not commits that change documentation or add a new feature.
- Messages with link to an issue or pull request: These are messages containing references to GitHub’s issues or pull request like
- Messages shorter than 15 characters: Following the good practices for commit messages posted on several sites like OpenStack we searched the commit message length to see how many are too short. It is not easy to specify a number of characters for which we say this is too short but in general commit messages should be descriptive of the solved problem or the new feature so that a message with less then 15 characters should not be a good message. We could also have tried a higher number than 15.
- Average message length: The average length of commit messages. To get an insight to how messages are structured for a repo in general.
We then chose some repositories with a lot of stars, from different programming languages and communities. So we chose Linux as one of the biggest repos as well as Bootstrap and JSON-Server which share a language but are maintained quite differently. We also compared this to the results of all the commit messages with some interesting results:
|Messages with ‘fix’||42%||17%||8%||10%|
|Messages with link to issue or pull request||2%||37%||16%||8%|
|Messages shorter than 15 characters||<1%||5%||26%||17%|
|Average message length||664.7||82.4||37.5||60.3|
The first thing that caught our eye was the high standards Linux keep for their commit messages as not even 1 in 100 is shorter than 15 characters and that the average length exceeds 664 characters. This completely contrasts with the relatively high percentage of short commits in JSON-Server, but also in general.
Not surprising is the fact that almost half of the commits in Linux do
fix something and that
fix appears in those long and complete commit messages.
This makes sense with the fact that Linux receives more bug fixes than new features.
There is also a great difference between Bootstrap and Linux in terms of linking to issues and pull request as the Linux repo has issue reporting disabled on GitHub and does merge commits that do not always come from GitHub pull request but SCM. If that was not the case then low amount of links to issues or pull requests would mean a lot of direct pushes to master branch (as pull request merges would be caught by this rule).
This is the top 20 repos in stars received in 2016:
We can see that Free Code Camp still gains lots of attraction! It currently has 216,340 stars, so most of them (86%!) were achieved in 2016. I mean, that’s an average of 577 stars per day! However this outlier can be explained taking into account that the number of people interested in programming increase in large quantities every year, that Free Code Camp is the mainstream entry point for it and that one of their first tasks asks for starring the repo.
It’s interesting to see how certain repos that have been around for some time are still among the most attracted, like Twitter Bootstrap, gitignore from GitHub and free-programming-books. But incredibly the second place is for google-interview-university that reached this position even though it’s way younger that the others. As a result, one can wonder how well new repos perform.
This is the Top 20 in stars for the repos created in 2016:
So 2016 let amazing repos appear such as the new dependency manager yarn, public-apis that has a list of open JSON APIs for web development and neural-doodle that generates masterpiece art images from very simple doodles!
|Language||Sum of stars|
For example, take a look at the stars timeline of google-interview-university:
The repos’ star events tend to happen with big spikes, as they spread virally like rumors in social media. The day it reached the apex was the same day that it was shared on HackerNews and from there shared on Reddit. And according to the author, it all started on October 4th when Amit Agarwal shared it on Twitter.
What we learn from this and from other cases? There is nothing new under the sun: if you have something that’s interesting for people in general or for a niche and you are able to make someone influential share it, you will surely get in the trending. Chances are that your repo will stay among the top for some time because the GitHub trending repos are a showcase too :D
2016 was a great year for GitHub. Even though developed countries are the ones that most contribute, there is a commitment with Open Source from all over the world, including islands with very few population! And there are countries such as Greece and Namibia that exploit their potential at best.
Few commits seem to be fixes whereas in projects such as The Linux Kernel there are plenty of them. Besides, new projects seem to be less structured but some of them take it to the other end, having meaningless commit messages.
Hope you liked this post! If you have any doubt, comment or complain, don’t hesitate in leaving a comment below or dropping us a line at [email protected].