Skip to content

Site Maintenance August 31st 2013

This Saturday, August 31st, 2013 at 5AM PDT we will be upgrading a large portion of our database infrastructure in order to provide a faster and more reliable GitHub experience.

We estimate that the upgrades should take no longer than 20 minutes. In order to minimize risk we will be putting the site into maintenance mode while the upgrade is performed. Thus, HTTP, API and Git access to GitHub.com will be unavailable for the duration of the upgrade.

We will update our status page and @githubstatus at the beginning of maintenance and again at the end.

IP Address Changes

As we continue to expand the infrastructure that powers GitHub, we want to make everyone aware of some changes to the IP addresses that we use. Most customers won't have to do anything as a result of these changes.

We mentioned these new addresses back in April and updated the Meta API to reflect them. Some GitHub services have have already been migrated to the new addresses, including:

  • api.github.com
  • gist.github.com
  • ssh.github.com

Our next step is to begin using these IP addresses for the main GitHub site, so we're reminding everyone about this change. There are a few gotchas that might affect some people:

  1. If you have explicit firewall rules in place that allow access to GitHub from your network, you'll want to make sure that all of the IP ranges listed in this article are included.
  2. If you have an entry in your /etc/hosts file that points github.com at a specific IP address, you should remove it and instead rely on DNS to give you the most accurate set of addresses.
  3. If you are accessing your repositories over the SSH protocol, you will receive a warning message each time your client connects to a new IP address for github.com. As long as the IP address from the warning is in the range of IP addresses in the previously mentioned Help page, you shouldn't be concerned. Specifically, the new addresses that are being added this time are in the range from 192.30.252.0 to 192.30.255.255. The warning message looks like this:
Warning: Permanently added the RSA host key for IP address '$IP' to the list of known hosts.

Thanks for your patience and continued support as we work to make GitHub faster and more reliable!

GitHub Flow in the Browser

Now that you can delete files directly on GitHub we’ve reached a very exciting milestone—the entire GitHub Flow™ is now possible using nothing but a web browser.

What is GitHub Flow?

A little while ago our very own @schacon wrote an article outlining the workflow we use here at GitHub to collaborate using Git. It’s a deceptively simple workflow, but it works brilliantly for us across a huge range of projects, and has adapted extremely well as our team has grown from being only a handful of people to our current size of 183 GitHubbers.

Since we use this workflow every day ourselves, we wanted to bring as many parts of the workflow to our web-based interface as possible. Here’s a quick outline of the how the GitHub Flow works using just a browser:

  1. Create a branch right from the repository.
  2. Create, edit, and delete files, rename them, or move them around.
  3. Send a pull request from your branch with your changes to kick off a discussion.
  4. Continue making changes on your branch as needed, updating the pull request automatically.
  5. Once the branch is ready to go, the pull request can be merged using the big green button.
  6. Branches can then be tidied up using the delete buttons in the pull request, or on the branches page.
  7. Repeat.

Even before some of these steps were possible in the browser, this simple workflow made it possible for us to iterate extremely quickly, deploy dozens of times each day, and address critical issues with minimal delay. Probably the most compelling aspect of GitHub Flow however is that there is no complicated branching model to have to wrap your head around.

Now that this workflow is available entirely via the web interface though, a whole new set of benefits start to emerge. Let’s dive into a few of them.

Lowering barriers to collaboration

Perhaps the most interesting consequence of having these tools available in the browser is that people don’t have to interact with Git or the command-line at all—let alone understand them—in order to contribute to projects in meaningful ways. This makes for a much easier learning curve for anyone new to Git and GitHub, whether they’re non-technical team members, or experienced developers learning something new.

Lowering barriers to learning

These web-based workflows have been especially helpful for our training team too. For some classes they’ve been able to have people begin learning the basic concepts of Git and GitHub using just the browser, saving students the complexity of installing Git and learning their way around the terminal commands until after they’ve got their heads around the fundamental ideas. When a browser is all you need, hallway demos and super-short classes can also more effectively convey the GitHub Flow to people in less time, and without the distractions of preparing computers with all the necessary software.

Less disruptive workflows

Even if you are a command-line wizard, there are always times when you just need to make a quick tweak to a project. Even if it’s just a single typetypo, doing this in the terminal requires a crazy number of steps. So now, instead of:

  • Switching contexts from your browser into your terminal.
  • Pulling down the code to work on.
  • Making the change.
  • Staging the change.
  • Committing the change.
  • Pushing the change.

…you can simply make the change right there without leaving your browser, saving you time and maintaining your zen-like frame of mind. What’s more, if you have any kind of continuous integration server set up to integrate with our Status API, you’ll even see that your typo fix built successfully and is ready to go!

Writing documentation

When editing Markdown files in repositories, being able to quickly preview how your change will look before you commit is very powerful. Combining that with things like the ability to easily create relative links between Markdown documents, and the availability of fullscreen zen mode for enhanced focus when composing content which isn’t code, the end result is a set of tools which make creating and maintaining documentation a real pleasure indeed.

Working with GitHub Pages websites

For projects that make use of GitHub Pages for their websites, tasks like writing new blog posts, or adding pages to an existing site are incredibly simple now that these workflows are possible in the browser. Whether you use Jekyll or just regular static HTML for your GitHub Pages site, these new workflows mean that you’re able to make changes using GitHub in your browser, have your site rebuilt and deployed automatically, and see your changes reflected live on <username>.github.io or your custom domain before you know it.

To learn more about building websites using GitHub Pages, you should check out these articles.

Get your GitHub Flow on!

In using them ourselves, we’ve found that these web-based workflows have come in handy on an increasingly regular basis, and we’re noticing real benefits from the reduced barrier to contributing. We hope you find these workflows useful too, and we can’t wait to see the interesting ways we know you’ll put them to good use.

Git Merge Berlin 2013

Last month GitHub was proud to host the first Git Merge conference, a place for Git core developers and Git users to meet, talk about Git and share what they've been working on or interested in. The first Git Merge was held in Berlin at the amazing Radisson Blu Berlin on May 9-11, 2013.

git-merge

The Git Merge conference came out of the GitTogether meetings that several Git developers held for several years at Google's campus directly after their Google Summer of Code Mentors Summit. We felt that we should hold a similar conference of the Git minds in the EU to accomplish the same things - get Git developers together to meet in person, talk about interesting things they're working on and meet some users.

dev-dinner

This conference was run a little differently than most. It was split up into three days - a Developer Day, a User Day and a Hack Day.

The first day was the developer day, limited to individuals who have made contributions to core Git or one of its implementations such as libgit2 or JGit. About 30 developers came and had discussions ranging from an incremental merge tool, to our participation and success in the Google Summer of Code program, to fixing race conditions in the Git server code.

dev-talking

The second day was the User Day, meant to allow everyone to share tools they were working on or issues they have with Git. The first half of the day was set up entirely in lightning talk format and over 40 talks were given, ranging in duration from a few seconds to nearly 20 minutes. After the lightning talks were done and everyone who wanted to had spoken, we broke up into small group discussions about more specific topics - Laws on GitHub, Git migration issues, tools and tips for teaching Git and more.

The final day was the Hack Day which gave attendees a chance to sit down with people they had met the previous day or two and start working on something interesting.

Notes for the entire conference, collaborated on by attendees, can be found here.

Recorded talks from each day can be found here. Some really interesting examples are Roberto Tyley's bfg-repo-cleaner talk, a tool to clean up bad history in git repositories, and this talk which covers the German federal law repository on GitHub.

Thanks to everyone who attended!

Hey Judy, don't make it bad

Last week we explained how we greatly reduced the rendering time of our web views by switching our escaping routines from Ruby to C. This speed-up was two-fold: the C code for escaping HTML was significantly faster than its Ruby equivalent, and on top of that, the C code was generating a lot fewer objects on the Ruby heap, which meant that subsequent garbage collection runs would run faster.

When working with a mark and sweep garbage collector (like the one in MRI), the amount of objects in the Heap at any given moment of time matters a lot. The more objects, the longer each GC pause will take (all the objects must be traversed during the Mark phase!), and since MRI's garbage collector is also "stop the world", while GC is running Ruby code cannot be executing, and hence web requests cannot be served.

In Ruby 1.9 and 2.0, the ObjectSpace module contains useful metadata regarding the current state of the Garbage collector and the Ruby Heap. Probably the most useful method provided by this module is count_objects, which returns the amount of objects allocated in the Ruby heap, separated by type: this offers a very insightful birds-eye view of the current state of the heap.

We tried running count_objects on a fresh instance of our main Rails application, as soon as all the libraries and dependencies were loaded:

GitHub.preload_all
GC.start
count = ObjectSpace.count_objects

puts count[:TOTAL] - count[:FREE]
#=> 605183

Whelp! More than 600k Ruby objects allocated just after boot! That's a lotta heap, like we say in my country. The obvious question now is whether all those objects on the heap are actually necessary, and whether we can free or simply prevent from allocating some of them to reduce our garbage collection times.

This question, however, is rather hard to answer by using only the ObjectSpace module. Although it offers an ObjectSpace#each_object method to enumerate all the objects that have been allocated, this enumeration is of very little use because we cannot tell where each object was allocated and why.

Fortunately, @tmm1 had a master plan one more time. With a few lines of code, he added a __sourcefile__ and __sourceline__ method to every single object in the Kernel, which kept track of the file and line in which the object was allocated. This is priceless: we are now able to iterate through every single object in the Ruby Heap and pinpoint and aggregate its source of allocation.

GitHub.preload_all
GC.start
ObjectSpace.each_object.to_a.inject(Hash.new 0){ |h,o| h["#{o.__sourcefile__}:#{o.class}"] += 1; h }.
  sort_by{ |k,v| -v }.
  first(10).
  each{ |k,v| printf "% 6d  |  %s\n", v, k }
36244  |  lib/ruby/1.9.1/psych/visitors/to_ruby.rb:String
28560  |  gems/activesupport-2.3.14.github21/lib/active_support/dependencies.rb:String
26038  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/route_set.rb:String
19337  |  gems/activesupport-2.3.14.github21/lib/active_support/multibyte/unicode_database.rb:ActiveSupport::Multibyte::Codepoint
17279  |  gems/mime-types-1.19/lib/mime/types.rb:String
10762  |  gems/tzinfo-0.3.36/lib/tzinfo/data_timezone_info.rb:TZInfo::TimezoneTransitionInfo
10419  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/route.rb:String
9486  |  gems/activesupport-2.3.14.github21/lib/active_support/dependencies.rb:RubyVM::InstructionSequence
8459  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/route_set.rb:RubyVM::InstructionSequence
5569  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/builder.rb:String

Oh boy, let's take a look at this in more detail. Clearly, there are allocation sources which we can do nothing about (the Rails core libraries, for example), but the biggest offender here looks very interesting. Psych is the YAML parser that ships with Ruby 1.9+, so apparently something is parsing a lot of YAML and keeping it in memory at all times. Who could this be?

A Pocket-size Library of Babel

Linguist is an open-source Ruby gem which we developed to power our language statistics for GitHub.com.

People push a lot of code to GitHub, and we needed a reliable way to identify and classify all the text files which we display on our web interface. Are they actually source code? What language are they written in? Do they need to be highlighted? Are they auto-generated?

The first versions of Linguist took a pretty straightforward approach towards solving these problems: definitions for all languages we know of were stored in a YAML file, with metadata such as the file extensions for such language, the type of language, the lexer for syntax highlighting and so on.

However, this approach fails in many important corner cases. What's in a file extension? that which we call .h by any other extension would take just as long to compile. It could be C, or it could be C++, or it could be Objective-C. We needed a more reliable way to separate these cases, and hundreds of other ambiguous situations in which file extensions are related to more than one programming language, or source files do not even have an extension.

That's why we decided to augment Linguist with a very simple classifier: Armed with a pocket-size Library of Babel of Code Samples (that is, a collection of source code files from different languages hosted in GitHub) we attempted to perform a weighted classification of all the new source code files we encounter.

The idea is simple: when faced with a source code file which we cannot recognize, we tokenize it, and then use a weighted classifier to find out the likehood that those tokens in the file belong to a given programming language. For example, an #include token is very likely to belong to a C or a C++ file, and not to a Ruby file. A class token can very well belong to a C++ file or a Ruby file, but if we find both an #include and a class token on the same file, then the answer is most definitely C++.

Of course, to perform this classification, we need to keep in memory a large list of tokens for every programming language that is hosted on GitHub, and their respective probabilities. It was this collection of tokens which was topping our allocation meters for the Ruby Garbage collector. For the classifier to be accurate, it needs to be trained with a large dataset --the bigger the better--, and although 36000 token samples are barely enough for training a classifier, they are a lot for the poor Ruby heap.

Take a slow classifier and make it better

We had a very obvious plan to fix this issue: move the massive token dataset out of the Ruby Heap and into native C-land, where it doesn't need to be garbage collected, and keep it as compact as possible in memory.

For this, we decided to store the tokens in a Judy Array, a trie-like data structure that acts as an associative array or key-value store with some very interesting performance characteristics.

As opposed to traditional trie-like data structures storing strings, branches happen at the bit-level (i.e. the Judy Array acts as a 256-ary trie), and their nodes are highly compressed: the claim is that thanks to this compression, Judy Arrays can be packed extremely tightly in cache lines, minimizing the amount of cache misses per lookup. The supposed result of this are lookup times that can compete against a hash table, even though the algorithmic complexity of Judy Arrays is O(log n), like any other trie-like structure.

Of course, there is no real-world silver bullet when it comes to algorithmic performance, and Judy Arrays are no exception. Despite the claims in Judy's original whitepaper, cache misses in modern CPU architectures do not fetch data stored in the Prison of Azkaban; they fetch it from the the L2 cache, which happens to be oh-not-that-far-away.

In practice, this means that the constant loss of time caused by a few (certainly not many) cache misses in a hash table lookups (O(1)) is not enough to offset the lookup time in a Judy array (O(log n)), no matter how tightly packed it is. On top of that, on hash tables with linear probing and a small step size, the point of reduced cache misses becomes moot, as most of the time collisions can be resolved in the same cache line where they happened. These practical results have been proven over and over again in real-world tests. At the end of the day, a properly tuned hash table will always be faster than a Judy Array.

Why did we choose Judy arrays for the implementation, then? For starters, our goal right now is not related to performance (classification is usually not a performance critical operation), but to maximizing the size of the training dataset while minimizing its memory usage. Judy Arrays, thanks to their remarkable compression techniques, store the keys of our dataset in a much smaller chunk of memory and with much less redundancy than a hash table.

Furthermore, we are pitting Judy Arrays against MRI's Hash Table implementation, which is known to be not particularly performant. With some thought on the way the dataset is stored in memory, it becomes feasible to beat Ruby's hash tables at their own game, even if we are performing logarithmic lookups.

The main design constraint for this problem is that the tokens in the dataset need to be separated by language. The YAML file we load in memory takes the straightforward approach of creating one hash table per language, containing all of its tokens. We can do better using a trie structure, however: we can store all the tokens in the same Judy Array, but prefixing them with an unique 2-byte prefix that identifies their language. This creates independent subtrees of tokens inside the same global data structure for each different language, which increases cache locality and reduces the logarithmic cost of lookups.

Judy Array Structure

For the average query behavior of the dataset (burst lookups of thousands of tokens of the same language in a row), having these subtrees means keeping the cache permanently warm, and minimzing the amount of traversals around the Array, since the internal Judy cursor never leaves the subtree for a language between queries.

The results of this optimization are much more positive than what we'd expect from benchmarking a logarithmic time structure against one which allegedly performs lookups in constant time:

Lookup times (no gc)

In this benchmark where we have disabled MRI's garbage collector, we can see how the lookup of 3.5 million tokens on the database stays more than 50% faster against the Hash Table, even as we artificially increase the dataset with random tokens. Thanks to the locality of the token subtrees per language, lookup times remain mostly constant and don't exhibit a logarithmic behavior.

Things get even better for Judy Arrays when we enable the garbage collector and GC cycles start being triggered between lookups:

Lookup times (gc)

Here we can see how the massive size of the data structures in the Ruby Heap cause the garbage collector to go bananas, with huge spikes in lookup times as the dataset increases and GC runs are triggered. The Judy Array (stored outside the Ruby Heap) remains completely unfazed by it, and what's more, manages to maintain its constant lookup time while Hash Table lookups become more and more expensive because of the higher garbage collection times.

The cherry on top comes from graphing the RSS usage of our Ruby process as we increase the size of our dataset:

RSS usage

Once again (and this time as anticipated), Judy Arrays throw MRI's Hash Table implementation under a bus. Their growth remains very much linear and increases extremely slowly, while we can appreciate considerable bumps and very fast growth as hash tables get resized.

GC for the jilted generation

With the new storage engine for tokens on Linguist's classifier, we are now able to dramatically expand our sampling dataset. A bigger dataset means more accurate classification of programming languages and more accurate language graphs on all repositories; this makes GitHub more awesome.

The elephant in the room still lives on in the shape of MRI's garbage collector, however. Without a generational GC capable of finding and marking roots of the Heap that are very unlikely to be freed (if at all), we must keep permanent attention to the amount of objects we allocate on our main app. More objects not only mean higher memory usage: they also mean higher garbage collection times and slower requests.

The good news is that Koichi Sasada has recently proposed a Generational Garbage Collector for inclusion in MRI 2.1. This prototype is remarkable because it allows a subset of generational garbage collection to happen while maintaining compatibility with MRI's current C extension API, which in its current iteration has several trade-offs (for the sake of simplicity when writing extensions) that make memory management for internal objects extremely difficult.

This compatibility with older versions, of course, comes at a price. Objects in the heap now need to be separated between "shady" and "sunny", depending on whether they have write barriers or not, and hence whether they can be generationally collected. This enforces an overly complicated implementation of the GC interfaces (several Ruby C APIs must drop the write barrier from objects when they are used), and the additional bookkeeping needed to separate the different kind of objects creates performance regressions under lighter GC loads. On top of that, this new garbage collector is also forced to run expensive Mark & Sweep phases for the young generation (as opposed to e.g. a copying phase) because of the design choices that make the current C API support only conservative garbage collection.

Despite the best efforts of Koichi and other contributors, Ruby Core's concern with backwards compatibility (particularly regarding the C Extension API) keeps MRI lagging more than a decade behind Ruby implementations like Rubinius and JRuby which already have precise, generational and incremental garbage collectors.

It is unclear at the moment whether this new GC on its current state will make it into the next version of MRI, and whether it will be a case of "too little, too late" given the many handicaps of the current implementation. The only thing we can do for now is wait and see... Or more like wait and C. HAH. Amirite guys? Amirite?

Heads up: nosniff header support coming to Chrome and Firefox

Both GitHub and Gist offer ways to view "raw" versions of user content. Instead of viewing files in the visual context of the website, the user can see the actual text content as it was commited by the author. This can be useful if you want to select-all-and-copy a file or just see a Markdown file without having it be rendered. The key point is that this is a feature to improve the experience of our human users.

Some pesky non-human users (namely computers) have taken to "hotlinking" assets via the raw view feature -- using the raw URL as the src for a <script> or <img> tag. The problem is that these are not static assets. The raw file view, like any other view in a Rails app, must be rendered before being returned to the user. This quickly adds up to a big toll on performance. In the past we've been forced to block popular content served this way because it put excessive strain on our servers.

We added the X-Content-Type-Options: nosniff header to our raw URL responses way back in 2011 as a first step in combating hotlinking. This has the effect of forcing the browser to treat content in accordance with the Content-Type header. That means that when we set Content-Type: text/plain for raw views of files, the browser will refuse to treat that file as JavaScript or CSS.

Until recently, Internet Explorer has been the only browser to respect this header, so this method of hotlinking prevention has not been effective for many users. We're happy to report that the good people at Google and Mozilla are moving towards adoption as well. As nosniff support is added to Chrome and Firefox, hotlinking will stop working in those browsers, and we wanted our beloved users, human and otherwise, to know why.

Content Security Policy

We've started rolling out a new security feature called "Content Security Policy" or CSP. As a user, it will better protect your account against XSS attacks. But, be aware, it may cause issues with some browser extensions and bookmarklets.

Content Security Policy is a new HTTP header that provides a solid safety net against XSS attacks. It does this by blocking inline scripts and limiting the domains that other scripts can be loaded from. This doesn't mean you can forget about escaping user data on the server side, but if you screw up, CSP will give you a last layer of defense.

Preparing your app

CSP header

Activating CSP in a Rails app is trivial since it's just a simple header. You don't need any separate libraries; a simple before filter should do.

before_filter :set_csp

def set_csp
  response.headers['Content-Security-Policy'] = "default-src *; script-src https://assets.example.com; style-src https://assets.example.com"
end

The header defines whitelisted urls that content can be loaded from. The script-src and style-src directives are both configured to our asset host's (or CDNs) base URL. Then, no scripts can be loaded from hosts other than ours. Lastly, default-src is a catch-all for all other directives we didn't define. For example, image-src and media-src can be used to restrict urls that images, video, and audio can loaded from.

If you want to broaden your browser support, set the same header value for X-Content-Security-Policy and X-WebKit-CSP as well. Going forward, you should only have to worry about the Content-Security-Policy standard.

As CSP implementations mature, this might become an out of the box feature built into Rails itself.

Turning on CSP is easy, getting your app CSP ready is the real challenge.

Inline scripts

Unless unsafe-inline is set, all inline script tags are blocked. This is the main protection you'll want against XSS.

Most of our prior inline script usage was page specific configuration.

<script type="text/javascript">
GitHub.user = 'josh'
GitHub.repo = 'rails'
GitHub.branch = 'master'
</script>

A better place to put configuration like this would be in a relevant data-* attribute.

<div data-user="josh" data-repo="rails" data-branch="master">
</div>

Inline event handlers

Like inline script tags, inline event handlers are now out too.

If you've written any JS after 2008, you've probably used an unobtrusive style of attaching event handlers. But you may still have some inline handlers lurking around your codebase.

<a href="" onclick="handleClick();"></a>
<a href="javascript:handleClick();"></a>

Until Rails 3, Rails itself generated inline handlers for certain link_to and form_tag options.

<%= link_to "Delete", "/", :confirm => "Are you sure?" %>

would output

<a href="/" onclick="return confirm('Are you sure?');">Delete</a>

With Rails 3, it now emits a declarative data attribute.

<a href="/" data-confirm="Are you sure?">Delete</a>

You'll need to be using a UJS driver like jquery-ujs or rails-behaviors for these data attributes to have any effect.

Eval

The use of eval() is also disabled unless unsafe-eval is set.

Though you may not be using eval() directly in your app code, if you are using any sort of client side templating library, it might be. Typically string templates are parsed and compiled into JS functions which are evaled on the client side for better performance. Take @jeresig's classic micro-templating script for an example. A better approach would be precompiling these templates on the server side using a library like sstephenson/ruby-ejs.

Another gotcha is returning JavaScript from the server side via RJS or a ".js.erb" template. These would be actions using format.js in a respond_to block. Both jQuery and Prototype need to use eval() to run this code from the XHR response. It's unfortunate that this doesn't work, since your own server is white listed in the script-src directive. Browsers would need native support for evaluating text/javascript bodies in order to enforce the CSP policy correctly.

Inline CSS

Unless unsafe-inline is set on style-src, all inline style attributes are blocked.

The most common use case is to hide an element on load.

<div class="tab"></div>
<div class="tab" style="display:none"></div>
<div class="tab" style="display:none"></div>

A better approach here would be using a CSS state class.

<div class="tab selected"></div>
<div class="tab"></div>
<div class="tab"></div>
tab { display: none }
tab.selected { display: block }

Though, there are caveats to actually using this feature. Libraries that do any sort of feature detection like jQuery or Modernizr typically generate and inject custom css into the page which sets off CSP alarms. So for now, most applications will probably need to just disable this feature.

Shortcomings

Bookmarklets

As made clear by the CSP spec, browser bookmarklets shouldn't be affected by CSP.

Enforcing a CSP policy should not interfere with the operation of user-supplied scripts such as third-party user-agent add-ons and JavaScript bookmarklets.

http://www.w3.org/TR/CSP/#processing-model

Whenever the user agent would execute script contained in a javascript URI, instead the user agent must not execute the script. (The user agent should execute script contained in "bookmarklets" even when enforcing this restriction.)

http://www.w3.org/TR/CSP/#script-src

But, none of the browsers get this correct. All cause CSP violations and prevent the bookmarklet from functioning.

Though its highly discouraged, you can disable CSP in Firefox as a temporary workaround. Open up about:config and set security.csp.enable to false.

Extensions

As with bookmarklets, CSP isn't supposed to interfere with any extensions either. But in reality, this isn't always the case. Specifically, in Chrome and Safari, where extensions are built in JS themselves, its typical to make modifications to the current page which may trigger a CSP exception.

The Chrome LastPass extension has some issues with CSP compatibility since it attempts to inject inline <script> tags into the current document. We've contacted the LastPass developers about the issue.

CSSOM limitations

As part of the default CSP restrictions, inline CSS is disabled unless unsafe-inline is set on the style-src directive. At this time, only Chrome actually implements this restriction.

You can still dynamically change styles via the CSSOM.

The user agent is also not prevented from applying style from Cascading Style Sheets Object Model (CSSOM).

http://www.w3.org/TR/CSP/#style-src

This is pretty much a requirement if you intend to implement something like custom tooltips on your site which need to be dynamically absolutely positioned.

Though there still seems to be some bugs regarding inline style serialization.

An example of a specific bug is cloning an element with a style attribute.

var el = document.createElement('div');
el.style.display = 'none'
el.cloneNode(true);
> Refused to apply inline style because it violates the following Content Security Policy directive: "style-src http://localhost".

Also, as noted above, libraries that do feature detection like jQuery and Modernizr are going to trigger this exception as they generate and inject custom styles to test if they work. Hopefully, these issues can be resolved in the libraries themselves.

Reporting

The CSP reporting feature is actually a pretty neat idea. If an attacker found a legit XSS escaping bug on your site, victims with CSP enabled would report the violation back the server when they visit the page. This could act as sort of an XSS intrusion detection system.

However, because of the current state of bookmarklet and extension issues, most CSP violations are false positives that flood your reporting backend. Depending on the browser, the report payload can be pretty vague. You're lucky to get a line number (without any offset) on a minified js file when a script triggers a violation. It's usually impossible to tell if the error is happening in your JS or some extension inject code. This makes any sort of filtering impossible.

Conclusion

Even with these issues, we are still committing to rolling out CSP. Hopefully a wider CSP adoption helps smooth out these issues in the upcoming CSP 1.1 draft.

Also, special thanks to @mikewest at Google for helping us out.

Escape Velocity

We work very hard to keep GitHub fast. Ruby is not the fastest programming language, so we go to great lengths benchmarking and optimizing our large codebase: our goal is to keep bringing down response times for our website even as we add more features every day. Usually this means thinking about and implementing new features with a deep concern for performance (our motto has always been "It's not fully shipped until it's fast"), but sometimes optimizing means digging deep into the codebase to find old pieces of code that are not as performant as they could be.

The key to performance tuning is always profiling, but unfortunately the current situation when it comes to profiling under Ruby/MRI is not ideal. We've been using @tmm1's experimental rblineprof for it. This little bundle of programming joy hooks into the Ruby VM and traces the stack of your process at a high frequency. This way, as your Ruby code executes, rblineprof can gather the accumulated time spent on each line of your codebase, and dump informative listings with the data. This is incredibly useful for finding hotspots on any Ruby application and optimizing them away.

Last week, we traced a standard request to our main Rails app trying to find bottlenecks in our view rendering code, and got some interesting results:

# File actionpack/lib/action_view/helpers/url_helper.rb, line 231
                   |       def link_to(*args, &block)
     0.2ms (  897) |         if block_given?
     0.0ms (    1) |           options      = args.first || {}
     0.0ms (    1) |           html_options = args.second
     0.2ms (    3) |           concat(link_to(capture(&block), options, html_options))
                   |         else
     0.3ms (  896) |           name         = args.first
     1.1ms (  896) |           options      = args.second || {}
     0.9ms (  896) |           html_options = args.third
                   | 
->  18.9ms (  896) |           url = url_for(options)
                   | 
                   |           if html_options
     9.6ms (  887) |             html_options = html_options.stringify_keys
                   |             href = html_options['href']
     0.9ms (  887) |             convert_options_to_javascript!(html_options, url)
->  66.4ms (  887) |             tag_options = tag_options(html_options)
                   |           else
                   |             tag_options = nil
                   |           end
                   | 
                   |           href_attr = "href=\"#{url}\"" unless href
     6.8ms ( 1801) |           "<a #{href_attr}#{tag_options}>#{ERB::Util.h(name || url)}</a>".html_safe
                   |         end
                   |         def tag_options(options, escape = true)
     0.2ms (  977) |           unless options.blank?
                   |             attrs = []
                   |             if escape
    59.1ms (  974) |               options.each_pair do |key, value|
     3.9ms ( 3612) |                 if BOOLEAN_ATTRIBUTES.include?(key)
     0.0ms (    1) |                   attrs << %(#{key}="#{key}") if value
                   |                 else
->  44.4ms (10971) |                   attrs << %(#{key}="#{escape_once(value)}") if !value.nil?
                   |                 end
                   |               end
                   |             else
     0.0ms (    9) |               attrs = options.map { |key, value| %(#{key}="#{value}") }
                   |             end
     6.5ms ( 3908) |             " #{attrs.sort * ' '}".html_safe unless attrs.empty?
                   |           end
                   |         end

Surprisingly enough, the biggest hotspots in the view rendering code all had the same origin: the escape_once helper that performs HTML escaping for insertion into the view. Digging into the source code for that method, we saw that it was indeed not optimal:

def escape_once(html)
  ActiveSupport::Multibyte.clean(html.to_s).gsub(/[\"><]|&(?!([a-zA-Z]+|(#\d+));)/) { |special| ERB::Util::HTML_ESCAPE[special] }
end

escape_once performs a Regex replacement (with a rather complex regex), with table lookups in Rubyland for each replaced character. This means very expensive computation times for the regex matching, and a lot of temporary Ruby objects allocated which will have to be freed by the garbage collector later on.

Introducing Houdini, the Escapist

Houdini is a set of C APIs for performing escaping for the web. This includes HTML, hrefs, JavaScript, URIs/URLs, and XML. It also performs unescaping, but we don't talk about that because it spoils the joke on the project name. It has been designed with a focus on security (both ensuring the proper and safe escaping of all input strings, and avoiding buffer overflows or segmentation faults), but it is also highly performant.

Houdini uses different approaches for escaping and unescaping different data types: for instance, when unescaping HTML, it uses a perfect hash (generated at compile time) to match every map entity with the character it represents. When escaping HTML, it uses a lookup table to output escaped entities without branching, and so on.

We wrote Houdini as a C library so we could reuse it from the many programming languages we use internally at GitHub. The first implementation using them is @brianmario's EscapeUtils gem, whose custom internal escaping functions were discarded and replaced with Houdini's API, while keeping the well-known and stable external API.

We had been using EscapeUtils in some places of our codebase already, so it was an obvious choice to simply replace the default escape_once with a call to EscapeUtils.escape_html and see if we could reap any performance benefits.

When it comes to real-world performance in Ruby programs, EscapeUtil's biggest advantage (besides the clearly performant C implementation behind it) is that Houdini is able to lazily escape strings. This means it will only allocate memory for the resulting string if the input contains escapable characters. Otherwise, it will flag the string as clean and return the original version, with no extra allocations and no objects to clean up by the GC. This is a massive performance win on an average Rails view render, which escapes thousands of small strings, most of which don't need to be escaped at all.

The result

Once the escaping method was replaced with a call to EscapeUtils, we ran our helpful ./script/bench. This benchmarking script allows us to compare different branches of our main app and different Ruby versions or VM tweaks to see if the optimizations we are performing have any effect. It runs a specified number of GETs on any route of the Rails app, and measures the average time per request and the amount of Ruby objects allocated.

$ BENCH_RUBIES="tcs" BENCH_BRANCHES="master faster-erb" ./script/bench --user tmm1 --url /github/rails -n 250
-- Switching to origin/master

250 requests to https://github.com/github/rails as tmm1
       cpu time:  57,972ms total                 (231ms avg/req,  223ms - 262ms)
    allocations:  73,387,940 objs total          (293,551 objs avg/req,  293,550 objs - 293,558 objs)

-- Switching to origin/faster-erb

250 requests to https://github.com/github/rails as tmm1
       cpu time:  46,525ms total                 (186ms avg/req,  181ms - 221ms)
    allocations:  68,251,672 objs total          (273,006 objs avg/req,  273,005 objs - 273,013 objs)

Not bad at all! Just by replacing the escaping function with a more optimized one, we've reduced the average request time by 45ms, and we're allocating 20,000 less Ruby objects per request. That was a lot of escaped HTML right there!

rblineprof is still experimental, but if you're working with Ruby, make sure to check it out: @tmm1 has just added support for Ruby 2.0.

And for those of you not running Ruby, we are also open-sourcing the escaping implementation we're now using in GitHub.com as a C library, so you can wrap it and use it from your language of choice. You can find it at vmg/houdini.

Yummy cookies across domains

Last Friday we announced and performed a migration of all GitHub Pages to their own github.io domain. This was a long-planned migration, with the specific goal of mitigating phishing attacks and cross-domain cookie vulnerabilities arising from hosting custom user content in a subdomain of our main website.

There's been, however, some confusion regarding the implications and impact of these cross-domain cookie attacks. We hope this technical blog post will help clear things up.

Cookie tossing from a subdomain

When you log in on GitHub.com, we set a session cookie through the HTTP headers of the response. This cookie contains the session data that uniquely identifies you:

Set-Cookie: _session=THIS_IS_A_SESSION_TOKEN; path=/; expires=Sun, 01-Jan-2023 00:00:00 GMT; secure; HttpOnly

The session cookies that GitHub sends to web browsers are set on the default domain (github.com), which means they are not accessible from any subdomain at *.github.com. We also specify the HttpOnly attribute, which means they cannot be read through the document.cookie JavaScript API. Lastly, we specify the Secure attribute, which means that they will only be transferred through HTTPS.

Hence, it's never been possible to read or "steal" session cookies from a GitHub Pages hosted site. Session cookies are simply not accessible from the user code running in GitHub Pages, but because of the way web browsers send cookies in HTTP requests, it was possible to "throw" cookies from a GitHub Pages site to the GitHub parent domain.

When the web browser performs an HTTP request, it sends the matching cookies for the URL in a single Cookie: header, as key-value pairs. Only the cookies that match the request URL will be sent. For example, when performing a request to github.com, a cookie set for the domain github.io will not be sent, but a cookie set for .github.com will.

GET / HTTP/1.1
Host: github.com
Cookie: logged_in=yes; _session=THIS_IS_A_SESSION_TOKEN;

Cookie tossing issues arise from the fact that the Cookie header only contains the name and value for each of the cookies, and none of the extra information with which the cookies were set, such as the Path or Domain.

The most straightforward cookie-tossing attack would have involved using the document.cookie JavaScript API to set a _session cookie on a GitHub Pages hosted website. Given that the website was hosted under *.github.com, this cookie would have been sent to all requests to the parent domain, despite the fact it was set in a subdomain.

/* set a cookie in the .github.com subdomain */
document.cookie = "_session=EVIL_SESSION_TOKEN; Path=/; Domain=.github.com"
GET / HTTP/1.1
Cookie: logged_in=yes; _session=EVIL_SESSION_TOKEN; _session=THIS_IS_A_SESSION_TOKEN;
Host: github.com

In this example, the cookie set through JavaScript in the subdomain is sent next to the legitimate cookie set in the parent domain, and there is no way to tell which one is coming from where given that the Domain, Path, Secure and HttpOnly attributes are not sent to the server.

This is a big issue for most web servers, because the ordering of the cookies set in a domain and in its subdomains is not specified by RFC 6265, and web browsers can choose to send them in any order they please.

In the case of Rack, the web server interface that powers Rails and Sinatra, amongst others, cookie parsing happens as follows:

def cookies
  hash = {}
  cookies = Utils.parse_query(cookie_header, ';,')
  cookies.each { |k,v| hash[k] = Array === v ? v.first : v }
  hash
end

If there is more than one cookie with the same name in the Cookie: header, the first one will be arbitrarily assumed to be the value of the cookie.

This is a very well-known attack: A couple weeks ago, security researcher Egor Homakov blogged about a proof-of-concept attack just like this one. The impact of the vulnerability was not critical (CSRF tokens get reset after each log-in, so they cannot be permanently fixated), but it's a very practical example that people could easily reproduce to log out users and be generally annoying. This forced us to rush our migration of GitHub Pages to their own domain, but left us with a few weeks' gap (until the migration was complete), during which we had to mitigate the disclosed attack vector.

Fortunately, the style of the disclosed attack was simple enough to mitigate on the server side. We anticipated, however, several other attacks that were either trickier to stop, or simply impossible. Let's take a look at them.

Protecting from simple cookie tossing

The first step was mitigating the attack vector of simple cooking tossing. Again, this attack exploits the fact that web browsers will send two cookie tokens with the same name without letting us know the domain in which they were actually set.

We cannot see where each cookie is coming from, but if we skip the cookie parsing of Rack, we can see whether any given request has two duplicate _session cookies. The only possible cause for this is that somebody is attempting to throw cookies from a subdomain, so instead of trying to guess which cookie is legitimate and which cookie is being tossed, we simply instruct the web browser to drop the cookie set in the subdomain before proceeding.

To accomplish this, we craft a very specific response: we instruct the web browser to redirect to the same URL that was just requested, but with a Set-Cookie header that drops the subdomain cookie.

GET /libgit2/libgit2 HTTP/1.1
Host: github.com
Cookie: logged_in=yes; _session=EVIL_SESSION_TOKEN; _session=THIS_IS_A_SESSION_TOKEN;
HTTP/1.1 302 Found
Location: /libgit2/libgit2
Content-Type: text/html
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/; Domain=.github.com;

We decided to implement this as a Rack middleware. This way the cookie check and consequent redirect could be performed before the application code gets to run.

When the Rack middleware triggers, the redirect will happen transparently without the user noticing, and the second request will contain only one _session cookie: the legitimate one.

This "hack" is enough to mitigate the straightforward cookie tossing attack that most people would attempt, but there are more complex attacks that we also need to consider.

Cookie paths workaround

If the malicious cookie is set for a specific path which is not the root (e.g. /notifications) the web browser will send that cookie when the user visits github.com/notifications, and when we try to clear it in the root path, our header will have no effect.

document.cookie = "_session=EVIL_SESSION_TOKEN; Path=/notifications; Domain=.github.com"
GET /notifications HTTP/1.1
Host: github.com
Cookie: logged_in=yes; _session=EVIL_SESSION_TOKEN; _session=THIS_IS_A_SESSION_TOKEN;
HTTP/1.1 302 Found
Location: /notifications
Content-Type: text/html
# This header has no effect; the _session cookie was set
# with `Path=/notifications` and won't be cleared by this,
# causing an infinite redirect loop
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/; Domain=.github.com;

The solution is pretty straightforward, albeit rather inelegant: for any given request URL, the web browser would only send a malicious JavaScript cookie if its Path matches partially the path of the request URL. Hence, we only need to attempt to drop the cookie once in each component of the path:

HTTP/1.1 302 Found
Location: /libgit2/libgit2/pull/1457
Content-Type: text/html
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/; Domain=.github.com;
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/libgit2; Domain=.github.com;
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/libgit2/libgit2; Domain=.github.com;
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/libgit2/libgit2/pull; Domain=.github.com;
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/libgit2/libgit2/pull/1457; Domain=.github.com;

Again, we're blind on the server-side when it comes to cookies. Our only option is this brute-force approach to clearing the cookies, which despite its roughness, worked surprisingly well while we completed the github.io migration.

Cookie escaping

Let's step up our game: Another attack can be performed by exploiting the fact that RFC 6265 doesn't specify an escaping behavior for cookies. Most web servers/interfaces, including Rack, assume that cookie names can be URL-encoded (which is a rather sane assumption to make, if they contain non-ASCII characters), and hence will unescape them when generating the cookie list:

cookies = Utils.parse_query(string, ';,') { |s| Rack::Utils.unescape(s) rescue s }

This allows a malicious user to set a cookie that the web framework will interpret as _session despite the fact that its name in the web browser is not _session. The attack simply has to escape characters that don't necessarily need to be escaped:

GET / HTTP/1.1
Host: github.com
Cookie: logged_in=yes; _session=chocolate-cookie; _%73ession=bad-cookie; 
{
  "_session" : ["chocolate-cookie", "bad-cookie"]
}

If we try to drop the second cookie from the list of cookies that Rack generated, our header will have no effect. We've lost crucial information after Rack's parsing: the fact that the name of the cookie was URL-encoded to a different value than the one our web framework received.

# This header has no effect: the cookie in
# the browser is actually named `_%73ession`
Set-Cookie: _session=; Expires=Thu, 01-Jan-1970 00:00:01 GMT; Path=/; Domain=.github.com;

To work around this, we had to skip Rack's cookie parsing by disabling the unescaping and finding all the cookie names that would match our target after unescaping.

  cookie_pairs = Rack::Utils.parse_query(cookies, ';,') { |s| s }
  cookie_pairs.each do |k, v|
    if k == '_session' && Array === v
      bad_cookies << k 
    elsif k != '_session' && Rack::Utils.unescape(k) == '_session'
      bad_cookies << k 
    end
  end

This way we can actually drop the right cookie (be it either set as _session or as a escaped variation). With this kind of Middleware in place, we were able to tackle all the cookie tossing attacks that can be tackled on the server side. Unfortunately, we were aware of another vector which made middleware protection useless.

Cookie overflow

If you're having cookie problems I feel bad for you, son.
I've got 99 cookies and my domain's ain't one.

This is a slightly more advanced attack that exploits the hard limit that all web browsers have on the number of cookies that can be set per domain.

Firefox, for example, sets this hard limit to 150 cookies, while Chrome sets it to 180. The problem is that this limit is not defined per cookie Domain attribute, but by the actual domain where the cookie was set. A single HTTP request to any page on the main domain and subdomains will send a maximum number of cookies, and the rules for which ones are picked are, once again, undefined.

Chrome for instance doesn't care about the cookies of the parent domain, the ones set through HTTP or the ones set as Secure: it'll send the 180 newest ones. This makes it trivially easy to "knock out" every single cookie from the parent domain and replace them with fake cookies, all by running JavaScript on a subdomain:

for (i = 0; i < 180; i++) {
    document.cookie = "cookie" + i + "=chocolate-chips; Path=/; Domain=.github.com"
}

After setting these 180 cookies in the subdomain, all the cookies from the parent domain vanish. If now we expire the cookies we just set, also from JavaScript, the cookie list for both the subdomain and the parent domain becomes empty:

for (i = 0; i < 180; i++) {
    document.cookie = "cookie" + i + "=chocolate-chips; Path=/; Domain=.github.com; Expires=Thu, 01-Jan-1970 00:00:01 GMT;"
}

/* all cookies are gone now; plant the evil one */
document.cookie = "_session=EVIL_SESSION_TOKEN; Path=/; Domain=.github.com"

This allows us to perform a single request with just one _session cookie: the one we've crafted in JavaScript. The original Secure and HttpOnly _session cookie is now gone, and there is no way to detect in the web server that the cookie being sent is neither Secure, HttpOnly, nor set in the parent domain, but fully fabricated.

With only one _session cookie sent to the server, there is no way to know whether the cookie is being tossed at all. Even if we could detect an invalid cookie, the same attack can be used to simply annoy users by logging them out of GitHub.

Conclusion

As we've seen, by overflowing the cookie jar in the web browser, we can craft requests with evil cookies that cannot be blocked server-side. There's nothing particularly new here: Both Egor's original proof of concept and the variations exposed here have been known for a while.

As it stands right now, hosting custom user content under a subdomain is simply a security suicide, particularly accentuated by Chrome's current implementation choices. While Firefox handles more gracefully the distinction between Parent Domain and Subdomain cookies (sending them in more consistent ordering, and separating their storage to prevent overflows from a subdomain), Chrome performs no such distinction and treats session cookies set through JavaScript the same way as Secure HttpOnly cookies set from the server, leading to a very enticing playground for tossing attacks.

Regardless, the behavior of cookie transmission through HTTP headers is so ill-defined and implementation-dependent that it's just a matter of time until somebody comes up with yet another way of tossing cookies across domains, independent of the targeted web browser.

While cookie tossing attacks are not necessarily critical (i.e. it is not possible to hijack user sessions, or accomplish anything anything besides phishing/annoying the users), they are worringly straightforward to perform, and can be quite annoying.

We hope that this article will help raise awareness of the issue and the difficulties to protect against these attacks by means that don't involve a full domain migration: a drastic, but ultimately necessary measure.

New GitHub Pages domain: github.io

Beginning today, all GitHub Pages sites are moving to a new, dedicated domain: github.io. This is a security measure aimed at removing potential vectors for cross domain attacks targeting the main github.com session as well as vectors for phishing attacks relying on the presence of the "github.com" domain to build a false sense of trust in malicious websites.

If you've configured a custom domain for your Pages site ("yoursite.com" instead of "yoursite.github.com") then you are not affected by this change and may stop reading now.

If your Pages site was previously served from a username.github.com domain, all traffic will be redirected to the new username.github.io location indefinitely, so you won't have to change any links. For example, newmerator.github.com now redirects to newmerator.github.io.

From this point on, any website hosted under the github.com domain may be assumed to be an official GitHub product or service.

Please contact support if you experience any issues due to these changes. We've taken measures to prevent any serious breakage but this is a major change and could have unexpected consequences. Do not hesitate to contact support for assistance.

Technical details

Changes to Pages sites and custom domains:

  • All User, Organization, and Project Pages not configured with a custom domain are now hosted on github.io instead of github.com. For instance, username.github.com is now served canonically from username.github.io.

  • An HTTP 301 Moved Permanently redirect has been added for all *.github.com sites to their new *.github.io locations.

  • Pages sites configured with a custom domain are not affected.

  • The Pages IP address has not changed. Existing A records pointing to the Pages IP are not affected.

Changes to GitHub repositories:

  • User Pages repositories may now be named using the new username/username.github.io convention or the older username/username.github.com convention.

  • Existing User Pages repositories named like username/username.github.com do not need to be renamed and will continue to be published indefinitely.

  • If both a username.github.io and a username.github.com repository exists, the username.github.io version wins.

Security vulnerability

There are two broad categories of potential security vulnerabilities that led to this change.

  1. Session fixation and CSRF vulnerabilities resulting from a browser security issue sometimes referred to as "Related Domain Cookies". Because Pages sites may include custom JavaScript and were hosted on github.com subdomains, it was possible to write (but not read) github.com domain cookies in way that could allow an attacker to deny access to github.com and/or fixate a user's CSRF token.

  2. Phishing attacks relying on the presence of the "github.com" domain to create a false sense of trust in malicious websites. For instance, an attacker could set up a Pages site at "account-security.github.com" and ask that users input password, billing, or other sensitive information.

We have no evidence of an account being compromised due to either type of vulnerability and have mitigated all known attack vectors.

Upcoming IP address changes

As we expand our infrastructure, we are making changes to the IP addresses currently in use.

If you are explicitly whitelisting GitHub in your firewall rules, please make sure that you include all of the IP ranges as described in this article.

Additional IP for GitHub.com

Soon, we will begin using additional IP addresses (A records) for GitHub.com in order to add more load balancers.

Starting Thursday, April 4, 2013, GitHub.com will also resolve to 204.232.175.90, in addition to the current IP (207.97.227.239). Both addresses resolve to GitHub.com via reverse DNS.

If you are using SSH, you will see this message the first time you see this IP address:

Warning: Permanently added the RSA host key for IP address '204.232.175.90' to the list of known hosts.

This warning is the expected behavior of SSH clients.

Please note that in the near future, we will be adding an additional IP subnet: 192.30.252.0/22 (or as a range, 192.30.252.1 - 192.30.255.255).

GitHub for Windows Recent Improvements

It’s been almost a year since we first released GitHub for Windows. Today we just shipped version 1.0.38. That’s 38 updates since 1.0!

As we’ve said before, we ship early and often.

Since we’ve been mostly quiet about the work we’ve done with GitHub for Windows, I thought I’d summarize some of the improvements we’ve made recently.

Visual Studio Git support integration

You may have heard that Microsoft announced an early preview of their Git integration for Visual Studio. The easiest way to get a Git repository into Visual Studio is to clone it via GitHub for Windows. When you navigate to the Team Explorer, it’ll be listed there.

vs-integration

Repository Private/Public glyphs

Repositories now have an indicator that shows whether they are a private, public, or non-GitHub repository.

private-public-indicators

Drag and Drop URLs to Clone Repositories

If you have a URL to a Git repository, you can now simply drag and drop it onto GitHub for Windows’ dashboard to clone that repository. This is handy if you have repositories not on GitHub that you need to work with.

drag-drop-url-to-clone

Drag a folder to create a repository

Say you have some code that’s not yet in a Git repository. Simply drag it onto the dashboard and we’ll create a new Git repository with the code.

drag-to-create-repo

If you haven’t installed GitHub for Windows, download it from windows.github.com. It’s the easiest way to get Git on a Windows machine.

TCMalloc and MySQL

Over the last month or so, we noticed steady week-over-week rises in overall MySQL query time. A poorly-performing database makes for a slow site and a slow site makes for a terrible experience, so we dug in to see what was going on.

The investigation is probably more interesting than the actual solution. We started with the most basic things: taking a look at long-running queries, database size, indexes. Nothing in that was especially fruitful.

We turned to some extremely helpful open source tools from Percona to figure out what was going on. I set up pt-stalk to trigger alerts and dumps of diagnostic data whenever MySQL's running thread count got inordinately high. We sent stats about frequency of pt-stalk triggers to our Graphite instance, and I set up an alert in our Campfire that audibly pinged me on every trigger. The constant sound of the "submarine" alert in my Campfire client was driving me to Poe-esque insanity, so I was doubly motivated to fix this.

With more investigation, we were able to rule out a number of other potential causes, including physical problems on the machine, network issues, and IO issues. Naturally, the submarine noises kept going off, and at one point I may or may not have heard Sean Connery say "Give me a ping, Vasily. One ping only, please."

Luckily, I noticed pt-stalk data dumps and SHOW ENGINE INNODB STATUS revealing hundreds of these bad things:

--Thread 140164252591872 has waited at trx/trx0trx.c line 807 for 1.0000 seconds the semaphore:
Mutex at 0x24aa298 '&kernel_mutex', lock var 0

Issues in MySQL 5.1 and 5.5 (we run 5.1 for now) with kernel_mutex around transactions have been known for some time. They have also been known to my friend and increasingly well-regarded yoga expert @jamesgolick, who last year wrote about the exact issue we were encountering.

So we took a page from Golick's work and from our own thinking on the issue, and went forward with the switch from stock malloc to TCMalloc.

Different allocators are good at different things, and TCMalloc (available as part of gperftools) is particularly good at reducing lock contention in programs with a lot of threads. So @tmm1 dropped in a straight-forward export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so", and with a database restart we were off to the races.

Good things happened. After the switchover, those pt-stalk triggers declined to about 1/7th of their original frequency, and the previously-seen contention dropped significantly. I had expected at best a stop in rising query times with this change, but we ended up with something much better — a rather shocking 30 percent improvement in query performance across the board and a general decline in spikes.

I allowed myself a small celebration by reading The Raven in silence and enjoying a non-trivial amount of Nutella.

Introducing Boxen

Today we’re proud to open source Boxen, our tool for automating and managing Macs at GitHub.

Boxen started nearly a year ago as a project called “The Setup” — a pipe dream to let anyone at GitHub run GitHub.com on their development machine with a single command. Now every new GitHubber’s first day starts with unboxing a new laptop and hacking on GitHub in about 30 minutes.

What is it?

Boxen is a framework for managing almost every aspect of your Mac. We built a massive standard library of Puppet modules optimized for Boxen to manage everything from running MySQL to installing Minecraft.

We designed Boxen with teams that work like GitHub in mind. Boxen automatically updates itself every run and opens and closes GitHub Issues as problems arise and get fixed. With Boxen, we treat our development environments with the same care we give production: we test our code and rely on Continuous Integration to deploy changes.

Your Boxen

boxen/our-boxen

Start here to build a boxen for your team. This repo is our recommended template for a basic web development environment. The README shows you how to get started, and the basic configuration shows off some of Boxen's standard modules for managing system services and your team’s projects.

boxen/boxen-web

Once you’ve built a boxen for your team, boxen-web is the easiest way to roll it out. It’s a simple Rails app that allows you to onboard folks with a single Terminal command. It's easy to run on Heroku and uses OAuth to limit access to only members of your GitHub organization.

boxen/puppet-template

Now that your team uses Boxen, you probably want to add support for new services and tools. Starting a new Puppet module can be a little complex, so we’ve wrapped everything you need to write one up into this example module. It follows our own best practices about Puppet module development with tools like cardboard, puppet-lint, and rspec-puppet.

Using Boxen

Once your team writes project manifests for your applications, any member of your team can run them locally with ease. At any time, you can run a single command to get a project and all its dependencies ready to go. If any GitHubber wants to hack on GitHub for Mac all they need to do is run:

$ boxen mac

<3

All sorts of GitHubbers use and improve Boxen: shippers, HR, lawyers, designers, and developers. More than 50 people have contributed internal fixes and features, and updates ship almost every day.

We love Boxen. We hope you and your team love it too. Happy automating!

Recent Code Search Outages

Last week, between Thursday, January 24 and Friday, January 25 we experienced a critical outage surrounding our newly-launched Code Search service. As always, we strive to provide detailed, transparent post-mortems about these incidents. We’ll do our best to explain what happened and how we’ve mitigated the problems to prevent the cause of this outage from occurring again.

But first, I’d like to apologize on behalf of GitHub for this outage. While it did not affect the availability of any component but Code Search, the severity and the length of the outage are both completely unacceptable to us. I'm very sorry this happened, especially so soon after the launch of a feature we’ve been working on for a very long time.

Background

Our previous search implementation used a technology called Solr. With the launch of our new and improved search, we had finally finished migrating all search results served by GitHub to multiple new search clusters built on elasticsearch.

Since the code search index is quite large, we have a cluster dedicated to it. The cluster currently consists of 26 storage nodes and 8 client nodes. The storage nodes are responsible for holding the data that comprises the search index, while the client nodes are responsible for coordinating query activity. Each of the storage nodes has 2TB of SSD based storage.

At the time of the outage, we were storing roughly 17TB of code in this cluster. The data is sharded across the cluster and each shard has a single replica on another node for redundancy, bringing for a total of around 34TB of space in use. This put the total storage utilization of the cluster at around 67%. This Code Search cluster operated on Java 6 and elasticsearch 0.19.9, and had been running without problem for several months while we backfilled all the code into the index.

On Thursday, January 17 we were preparing to launch our Code Search service to complete the rollout of our new, unified search implementation. Prior to doing so, we noted that elasticsearch had since released version 0.20.2 which contained a number of fixes and some performance improvements.

We decided that delaying the Code Search launch to upgrade our elasticsearch cluster from version 0.19.9 to 0.20.2 before launching it publicly would help ensure a smooth launch.

We were able to complete this upgrade successfully on Thursday, January 17. All nodes in the cluster were successfully online and recovering the cluster state.

What went wrong?

Since this upgrade, we have experienced two outages in the Code Search cluster.

Unlike some other search services that use massive, single indexes to store data, elasticsearch uses a sharding pattern to divide data up so it can be easily distributed around the cluster in manageable chunks. Each of these shards is itself a Lucene index, and elasticsearch aggregates search queries across these shards using Lucene merge indexes.

The first outage occurred roughly 2 hours after the upgrade, during the recovery process that takes place as part of a cluster restart. We found error messages in the index logs indicating that some shards were unable to assign or allocate to certain nodes. Upon further inspection, we discovered that while some of these data shards had their segment cache files corrupted, others were missing on disk. elasticsearch was able to recover any shards with corrupted segment cache files and shards where only one of the replicas was missing, but 7 shards (out of 510) were missing both the primary copy and the replica.

We reviewed the circumstances of the outage and determined at the time that the problems we saw stemmed from the high load during the cluster recovery. Our research into this problem did not demonstrate other elasticsearch users encountering these sorts of problems. The cluster has happy and healthy over the weekend, and so we decided to send it out to the world.

The second outage began on Thursday, January 24. We first noticed problems as our exception tracking and monitoring systems detected a large spike in exceptions. Further review indicated that the majority of these exceptions were coming from timeouts in code search queries and from the background jobs that update our code search indexes with data from new pushes.

At this time, we began to examine both the overall state of all members of the cluster and elasticsearch's logs. We were able to identify massive levels of load on a seemingly random subset of storage nodes. While most nodes were using single digit percentages of CPU, several were consuming nearly 100% of all of the available CPU cores. We were able to eliminate system-induced load and IO-induced load as culprits: the only thing contributing to the massive load on these servers was the java process elasticsearch was running in. With the search and index timeouts still occurring, we also noticed in the logs that a number of nodes were being rapidly elected to and later removed from the master role in the cluster. In order to mitigate potential problems resulting from this rapid exchanging of master role around the cluster, we determined that the best course of action was to full-stop the cluster and bring it back up in "maintenance mode", which disables allocation and rebalancing of shards.

We were able to bring the cluster back online this way, but we noted a number of problems in the elasticsearch logs.

Recovery

After the cluster restart, we noticed that some nodes were completely unable to rejoin the cluster, and some data shards were trying to double-allocate to the same node. At this point, we reached out to Shay and Drew from elasticsearch, the company that develops and supports elasticsearch.

We were able to confirm with Shay and Drew that these un-allocatable shards (23 primaries plus replicas) had all suffered data loss. In addition to the data loss, the cluster spent a great deal of time trying to recover the remaining shards. During the course of this recovery, we had to restart the cluster several times as we rolled out further upgrades and configuration changes, which resulted in having to verify and recover shards again. This ended up being the most time consuming part of the outage as loading 17TB of indexed data off of disk multiple times is a slow process.

With Shay and Drew, we were able to discover some areas where our cluster was either misconfigured or the configuration required further tuning for optimal performance. They were also able to identify two bugs in elasticsearch itself (see these two commits for further details on those bugs) based on the problems we encountered and within a few hours released a new version with fixes included. Lastly, we were running a version of Java 6 that was released in early 2009. This contains multiple critical bugs that affect both elasticsearch and Lucene as well as problems with large memory allocation which can lead to high load.

Based on their suggestions, we immediately rolled out upgrades for Java and elasticsearch, and updated our configuration with their recommendations. This was done by creating a topic branch and environment on our Puppetmaster for these specific changes, and running Puppet on each of these nodes in that environment.

While these audits increased the length of the outage by a few hours, we believe that the time was well spent garnering the feedback from experts in large elasticsearch deployments.

With the updated configuration, new elasticsearch version with the fixes for the bugs we encountered, and the performance improvements in Java 7, we have not been able to reproduce any of the erratic load or rapid master election problems we witnessed in the two outages discussed so far.

Outage Monday

We suffered an additional outage Monday, January 28 to our Code Search cluster. This outage was unrelated to any of the previous incidents and was the result of human error.

An engineer was merging the feature branch containing the Java and elasticsearch upgrades back into our production environment. In the process, the engineer rolled the Puppet environment on the Code Search nodes back to the production environment before deploying the merged code. This resulted in elasticsearch being restarted on nodes as Puppet was running on them. We recognized immediately the source of the problem and stopped the cluster to prevent any problems caused by running multiple versions of Java and elasticsearch in the same cluster. Once the merged code was deployed, we ran Puppet on all the Code Search nodes again and brought the cluster back online. Rather than enabling Code Search indexing and querying while the cluster was in a degraded state, we opted to wait for full recovery. Once the cluster finished recovering, we turned Code Search back on.

Mitigating the problem

We did not sufficiently test the 0.20.2 release of elasticsearch on our infrastructure prior to rolling this upgrade out to our code search cluster, nor had we tested it on any other clusters beforehand. A contributing factor to this was the lack of a proper staging environment for the code search cluster. We are in the process of provisioning a staging environment for the code search cluster so we can better test infrastructure changes surrounding it.

The bug fixes included in elasticsearch 0.20.3 do make us confident that we won’t encounter the particular problems they caused again. We’re also running a Java version now that is actively tested by the elasticsearch team and is known to be more stable and performant running elasticsearch. Additionally, our code search cluster configuration has been audited by the team at elasticsearch with future audits scheduled to ensure it remains optimal for our use case.

As for Monday’s outage, we are currently working on automation to make the a Puppet run in a given environment impossible in cases where the branch on GitHub is ahead of the environment on the Puppetmaster.

Finally, there are some specific notes from the elasticsearch team regarding our configuration that we'd like to share in hopes of helping others who may be running large clusters:

  1. Set the ES_HEAP_SIZE environment variable so that the JVM uses the same value for minimum and maximum memory. Configuring the JVM to have different minimum and maximum values means that each time the JVM needs additional memory (up to the maximum), it will block the Java process to allocate it. Combined with the old Java version, this explains the pauses that our nodes exhibited when introduced to higher load and continuous memory allocation when they were opened up to public searches. The elasticsearch team recommends a setting of 50% of system RAM.
  2. Our cluster was configured with a recover_after_time set to 30 minutes. The elasticsearch team recommended a change so that recovery would begin immediately rather than after a timed period.
  3. We did not have minimum_master_nodes configured, so the cluster became unstable when nodes experienced long pauses as subsets of nodes would attempt to form their own clusters.
  4. During the initial recovery, some of our nodes ran out of disk space. It's unclear why this happened since our cluster was only operating at 67% utilization before the initial event, but it's believed this is related to the high load and old Java version. The elasticsearch team continues to investigate to understand the exact circumstances.

Summary

I’m terribly sorry about the availability problems of our Code Search feature since its launch. It has not been up to our standards, and we are taking each and every one of the lessons these outages have taught us to heart. We can and will do better. Thank you for supporting us at GitHub, especially during the difficult times like this.

Something went wrong with that request. Please try again.