Skip to content

Rolling out the Redcarpet

Here at GitHub, we love Markdown. We use it everywhere: to render the wikis, issues, pull requests, and all user-generated comments. We even encourage developers to write their READMEs in this awesome markup language. In fact, we use it so much that we've learnt a few lessons on Markdown parsing the hard way.

Every day, GitHub renders thousands of Markdown documents with all kinds of user-submitted content, ranging from poorly formatted to downright malicious. Your average Markdown parser is not prepared to deal with potentially pathological inputs, and hence is vulnerable to DOS attacks. That's why we've decided to take Natacha Porté's awesome library, Upskirt, and pimped it with everything you'd expect in a Markdown library for the web - both in features and in security.

Our fork of the library also comes with a Ruby wrapper, aptly named Redcarpet. Redcarpet works as a drop-in replacement for BlueCloth and RDiscount; we've been slowly deploying it through all our frontend machines, and so far none of them has caught :fire:. We consider this a tremendous success, but since we strive for perfection, please report any rendering errors you may encounter in your Markdown documents to help us improve the library.

Finally, to celebrate the release of the new library we're enabling syntax highlighted code blocks in GitHub Flavored Markdown.

Four space indentation is now no longer required when including code, backtraces and other text in a comment, issue, Gist or any other Markdown-enabled text. Instead, simply create a fenced block with ```. An optional language identifier after the backticks will syntax highlight the code in that language.

``` ruby
require 'redcarpet'
markdown = Redcarpet.new("Hello World!")
puts markdown.to_html
```
require 'redcarpet'
markdown = Redcarpet.new("Hello World!")
puts markdown.to_html

GitHub + Rebase Support in IntelliJ

The IntelliJ platform has had GitHub integration for a few months, but today they rolled out Advanced GitHub Integration: Rebase My GitHub Fork.

What is it? Rebase support.

Now staying up to date with upstream repos is easier than ever in IntelliJ! Best of all, it's available in the free community edition of IDEA as well as the commercial edition. Integration for other products is underway.

Check their blog post for more information and a screencast, and keep watching for more awesome stuff from the JetBrains team in the future.

Download IntelliJ

Mind Control with Frickin Lasers

Several months ago I hosted a GitHub meetup in Boston to which tons of local geeks attended and drank free beer. During that meeting, I talked to a local graduate student in biophysics at Harvard named Andrew Leifer who told me that he loved GitHub and was in fact using it to collaborate on a program that accomplished mind control. with lasers. on worms.

Well, it turns out that I had not in fact been drinking too much and the project is real. Andrew's research is called CoLBeRT: Controlling Locomotion and Behavior in Real-Time and works by running real-time analysis on video of a 1mm long specially bred light-sensitive C. elegans worm. The CoLBeRT system tracks the worm as it moves and shines laser light on specific neurons as the worm is moving to stimulate or inhibit those neurons.

The system can make the worm paralyzed, lay eggs, back up, speed up or sense touch in different areas of its body, all by directing laser light into specific neurons. That's right, I said lay eggs. Check out this kick-ass laser:

If you aimed that at me, I'd probably lay eggs too.

Andrew's research has recently been published in Nature Methods and covered in Science News and Scientific American and true to his word the source code for laser worm mind control is on GitHub, aptly named MindControl, and is open source.

So, though the human brain has over 100 billion neurons instead of the worm's 302, and we're not photosensitive, this project brings us 1 evil pull request closer to total world domination. with frickin lasers.

3D GitHub Badge with Pure HTML/CSS

Nico Hagenburger released a nifty bit of HTML/CSS today that allows you to get the famous "Fork me on GitHub" banner on the corner of your site. But this time, with a twist. Literally. If you hover over the banner in Safari, the banner will flip around and show some alternate text on the other side. Fancy!

Go take a look and try it out for yourself!

Online Git Training Series

This December 14th we'll be kicking off the first in a series of all-day online Git training sessions. The series will be taught by our partner Matthew McCullough of Ambient Ideas, who has been doing excellent Git training and talking all over the world lately.

If you're interested in you or your colleagues getting a one-day crash course in Git, check out our new training page and sign up for the first course. Pants are of course, optional. As they are in our in-person trainings.

https://github.com/training/online

Hope to see you there!

Get Good with Git

If you have little or no experience on the command line, and no experience with Git, this might be the ebook for you: Getting Good With Git. From command line basics to using GitHub for collaboration, there's a ton of great stuff in this thing.

With great content and a beautiful layout, you should definitely check it out if you want to learn Git.

Sidejack Prevention Phase 3: SSL Proxied Assets

This is the third, and hopefully final, response to session hijacking on github.com. We've been safe from session hijacking for a while now but we were still serving pages with mixed-content warnings. People have complained about these warnings in the past but it still remains an issue in most browsers. We want our users to focus on getting things done and we want them to feel secure while they use our site.

A few of our pages allowed people to embed images directly via github flavored markdown. Our users find this really useful and we wanted to avoid leaving people's browsers looking like this:

insecure

You can now link to remote images in your comments/readmes/issues without creating mixed content warnings.

issue preview

We did this by rewriting the src attribute on img tags when we render github flavored markdown. The src attribute is rewritten to proxy through our normal asset servers so it appears to come from a secure source. On the backend we wrote a simple HTTP proxy in node that runs behind our normal nginx setup. The code is available here.

Please open a support ticket if you find pages on the site that are still generating mixed content warnings. So far the system seems to be holding up well and we're ready to get back to hacking on features for GitHub. Thanks for your patience over the last few weeks.

Sidejack Prevention Phase 2: SSL Everywhere

Last Tuesday, we rolled out a secure cookies for all SSL-protected pages. This meant that all private repositories, user dashboards, all admin settings (even for free users and repositories) were protected against sidejacking attempts. However, any user actions on gists and public repositories (such as issues, wikis, downloads) were still vulnerable.

Last night, we rolled out the next phase from our latest security audit: SSL everywhere. Every hit to the website, whether you're logged in or not, is over HTTPS with a secure cookie.

This is a big step, but we're still seeing some resources being served directly from other sites and giving SSL warnings. We're going to address this issue next. In the meantime your browsers might give warnings that look like this.

Insecure Resources

Our next step will be to fix these insecure assets that you might see in commit and issue comments. We're hoping to have the remaining issues fixed over the next few days.

Our new build status indicator

If you've been following us on the twitters, you know that we recently got our first office space in San Francisco. Our first course of action was to immediately purchase a stoplight.

But what do you do with a stoplight? Hook up arduino to ci joe to show the status of our build of course!

Luckily our friend Greg flew out and did the arduino magic to get it all working. He even wrote a post explaining the process and open sourced the arduino firmware.

Thanks Greg!

Introducing Resque

Resque is our Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later.

Background jobs can be any Ruby class or module that responds to perform. Your existing classes can easily be converted to background jobs or you can create new classes specifically to do work. Or, you can do both.

All the details are in the README. We've used it to process over 10m jobs since our move to Rackspace and are extremely happy with it.

But why another background library?

A Brief History of Background Jobs

We've used many different background job systems at GitHub. SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd. Each change was out of necessity: we were running into a limitation of the current system and needed to either fix it or move to something designed with that limitation in mind.

With SQS, the limitation was latency. We were a young site and heard stories on Amazon forums of multiple minute lag times between push and pop. That is, once you put something on a queue you wouldn't be able to get it back for what could be a while. That scared us so we moved.

ActiveMessaging was next, but only briefly. We wanted something focused more on Ruby itself and less on libraries. That is, our jobs should be Ruby classes or objects, whatever makes sense for our app, and not subclasses of some framework's design.

BackgroundJob (bj) was a perfect compromise: you could process Ruby jobs or Rails jobs in the background. How you structured the jobs was largely up to you. It even included priority levels, which would let us make "repo create" and "fork" jobs run faster than the "warm some caches" jobs.

However, bj loaded the entire Rails environment for each job. Loading Rails is no small feat: it is CPU-expensive and takes a few seconds. So for a job that may take less than a second, you could have 8 - 20s of added overhead depending on how big your app is, how many dependencies it requires, and how bogged down your CPU is at that time.

DelayedJob (dj) fixed this problem: it is similar to bj, with a database-backed queue and priorities, but its workers are persistent. They only load Rails when started, then process jobs in a loop.

Jobs are just YAML-marshalled Ruby objects. With some magic you can turn any method call into a job to be processed later.

Perfect. DJ lacked a few features we needed but we added them and contributed the changes back.

We used DJ very successfully for a few months before running into some issues. First: backed up queues. DJ works great with small datasets, but once your site starts overloading and the queue backs up (to, say, 30,000 pending jobs) its queries become expensive. Creating jobs can take 2s+ and acquiring locks on jobs can take 2s+, as well. This means an added 2s per job created for each page load. On a page that fires off two jobs, you're at a baseline of 4s before doing anything else.

If your queue is backed up because your site is overloaded, this added overhead just makes the problem worse.

Solution: move to beanstalkd. beanstalkd is great because it's fast, supports multiple queues, supports priorities, and speaks YAML natively. A huge queue has constant time push and pop operations, unlike a database-backed queue.

beanstalkd also has experimental persistence - we need persistence.

However, we quickly missed DJ features: seeing failed jobs, seeing pending jobs (beanstalkd only allows you to 'peek' ahead at the next pending job), manipulating the queue (e.g. running through and removing all jobs that were created by a bug or with a bad job name), etc. A database-queue gives you a lot of cool features. So we moved back to DJ - the tradeoff was worth it.

Second: if a worker gets stuck, or is processing a job that will take hours, DJ has facilities to release a lock and retry that job when another worker is looking for work. But that stuck worker, even though his work has been released, is still processing a job that you most likely want to abort or fail.

You want that worker to fail or restart. We added code so that, instead of simply retrying a job that failed due to timeout, other workers will a) fail that job permanently then b) restart the locked worker.

In a sense, all the workers were babysitting each other.

But what happens when all the workers are processing stuck or long jobs? Your queue quickly backs up.

What you really need is a manager: someone like monit or god who can watch workers and kill stale ones.

Also, your workers will probably grow in memory a lot during the course of their life. So you need to either make sure you never create too many objects or "leak" memory, or you need to kill them when they get too large (just like you do with your frontend web instances).

At this point we have workers processing jobs with god watching them and killing any that are a) bloated or b) stale.

But how do we know all this is going on? How do we know what's sitting on the queue? As I mentioned earlier, we had a web interface which would show us pending items and try to infer how many workers are working. But that's not easy - how do you have a worker you just kill -9'd gracefully manage its own state? We added a process to inspect workers and add their info to memcached, which our web frontend would then read from.

But who monitors that process. And do we have one running on each server? This is quickly becoming very complicated.

Also we have another problem: startup time. There's a multi-second startup cost when loading a Rails environment, not to mention the added CPU time. With lots of workers doing lots of jobs being restarted on a non-trival basis, that adds up.

It boils down to this: GitHub is a warzone. We are constantly overloaded and rely very, very heavily on our queue. If it's backed up, we need to know why. We need to know if we can fix it. We need workers to not get stuck and we need to know when they are stuck.

We need to see what the queue is doing. We need to see what jobs have failed. We need stats: how long are workers living, how many jobs are they processing, how many jobs have been processed total, how many errors have there been, are errors being repeated, did a deploy introduce a new one?

We need a background job system as serious as our web framework. I highly recommend DelayedJob to anyone whose site is not 50% background work.

But GitHub is 50% background work.

In Search of a Solution

In the Old Architecture, GitHub had one slice dedicated to processing background jobs. We ran 25 DJ workers on it and all they did was run jobs. It was known as our "utility" slice.

In the New Architecture, certain jobs needed to be run on certain machines. With our emphasis on sharding data and high availability, a single utility slice no longer fit the bill.

Both beanstalkd and bj supported named queues or "tags," but DelayedJob did not. Basically we needed a way to say "this job has a tag of X" and then, when starting workers, tell them to only be interested in jobs with a tag of X.

For example, our "archive" background job creates tarballs and zip files for download. It needs to be run on the machine which serves tarballs and zip files. We'd tag the archive job with "file-serve" and only run it on the file serving slice. We could then re-use this tag with other jobs that needed to only be run on the file serving slice.

We added this feature to DelayedJob but then realized it was an opportunity to re-evaluate our background job situation. Did someone else support this already? Was there a system which met our upcoming needs (distributed worker management - god/monit for workers on multiple machines along with visibility into the state)? Should we continue adding features to DelayedJob? Our fork had deviated from master and the merge (plus subsequent testing) was not going to be fun.

We made a list of all the things we needed on paper and started re-evaluating a lot of the existing solutions. Kestrel, AMQP, beanstalkd (persistence still hadn't been rolled into an official release a year after being pushed to master).

Here's that list:

  • Persistence
  • See what's pending
  • Modify pending jobs in-place
  • Tags
  • Priorities
  • Fast pushing and popping
  • See what workers are doing
  • See what workers have done
  • See failed jobs
  • Kill fat workers
  • Kill stale workers
  • Kill workers that are running too long
  • Keep Rails loaded / persistent workers
  • Distributed workers (run them on multiple machines)
  • Workers can watch multiple (or all) tags
  • Don't retry failed jobs
  • Don't "release" failed jobs

Redis to the Rescue

Can you name a system with all of these features:

  • Atomic, O(1) list push and pop
  • Ability to paginate over lists without mutating them
  • Queryable keyspace, high visibility
  • Fast
  • Easy to install - no dependencies
  • Reliable Ruby client library
  • Store arbitrary strings
  • Support for integer counters
  • Persistent
  • Master-slave replication
  • Network aware

I can. Redis.

If we let Redis handle the hard queue problems, we can focus on the hard worker problems: visibility, reliability, and stats.

And that's Resque.

With a web interface for monitoring workers, a parent / child forking model for responsiveness, swappable failure backends (so we can send exceptions to, say, Hoptoad), and the power of Redis, we've found Resque to be a perfect fit for our architecture and needs.

web ui

We hope you enjoy it. We certainly do!

NY State Senate Code on GitHub

If you’re in New York, or are interested in Open Government initiatives, you may be excited to know that the New York State Senate has opened up to the online community in a big way. They have put up a Free and Open-Source Software & Services website that provides and documents an API for all of their legislative data, feeds and widgets for that data, a browser for that data that even uses Disqus to allow you to comment on legislation, and open source software projects that help consume that data.

The cool thing for us is that they’ve put all their open source projects up on GitHub at github.com/nysenate for you to use and improve.

As a user of Open-Source software the New York Senate wants to help give back to the community that has given it so much – including this website. To meet its needs the Senate is constantly developing new code and fixing existing bugs. Not only does the Senate recognize that it has a responsibility to give back to the Open Source community, but public developments, made with public money should be public.

We are very happy that we can help them share these projects, and I hope more local and federal government efforts will open up to this degree. Congratulations to the New York Senate for moving forward with openness and accountability.

Perl Mirror on GitHub

For those of you who had not heard, Perl has moved its 20 year source code history for Perl 5 from Perforce to Git (also, the bird is in fact the word).

At the request of the Perl Git transition team, we’ve setup a mirror of the main repository here on GitHub for Perl developers to fork should they prefer collaborating on GitHub. It is automatically updated every hour, so fork away.

http://github.com/github/perl

Also, as a little background – our friend Sam Vilain, whom we met and had a few beers with at GitTogether 2008, has completed a heroic job of moving the Perl source code over.

From the Use Perl article:

He spent more than a year building
custom tools to transform 21 years of Perl history into the first
ever unified repository of every single change to Perl. In addition
to changes from Perforce, Sam patched together a comprehensive view
of Perl’s history incorporating publicly available snapshot releases,
changes from historical mailing list archives and patch sets recovered
from the hard drives of previous Perl release engineers.

Wow.

Holiday Shirt Super Sale!

Super Shirt Sale!

As a special holiday treat, we’re putting all of our remaining shirt stock on sale. You can now get Fork You for $18.00 USD and All Your Rebase for $16.00 USD. We ship to most places worldwide. Grab one while supplies last. Once these are all gone, we’ll be printing up some new shirt designs!

fork you fork you

New Create/Delete Events

You’ve created a repo on GitHub, written a bunch of great code, and now it’s time to push it:

Here’s what happens on GitHub now:

Along with the create events, if you have any service hooks setup, we will pipe one commit to them so others will be notified of the new branch or tag creation. We’ve also added delete events if you remotely delete a branch or tag.

This isn’t the second big feature launch Scott mentioned in his Fork Queue post yesterday (stay tuned about that), but we’re still pretty happy about these additions.

Traffic Graphs

We all love stats and graphs, and I especially love bringing them to you. So without further ado I present the long asked for traffic graph for each and every repo on the site. Simply go to the Graphs tab and select the Traffic graph. We’re making the pageview numbers for the last 90 days available to YOU!

Something went wrong with that request. Please try again.