Mad Marmot A blog about programming, ruby, rails.

Capistrano Tip to Avoid Disk Intensive Removal of Files

Posted on May 14, 2009

At TST Media we host our Rails app at Engine Yard on four slices which all utilize the same shared disk via gfs. Anytime there is any hard core disk activity our sites slow down to a crawl. This makes removing files a bit tricky, such as when we want to empty our cache files or when Capistrano removes a release at the end of a deploy. The strategy we have come up with to handle this is to move the files to a specific directory we call the "caches_to_remove" directory, and use a cron task to empty this directory at night when most of our users are asleep. Moving files is extremely fast and not disk intensive, as long as the source and destination is on the same disk of course.

The shell script that the cron runs nightly is simple:

# remove_cache_dirs.sh
rm -rf /data/tst/caches_to_remove
mkdir /data/tst/caches_to_remove

We have a capistrano task to empty the cache files, which simply moves the cache directory into the caches_to_remove directory.

set :remove_files_dir, '/data/tst/caches_to_remove/'
namespace :cache do
  desc "Delete all cache files on disk."
  task :empty, :roles => :web do
    sudo "mv /data/tst/cache #{remove_files_dir} && mkdir /data/tst/cache"
  end 
end

We also want a deploy to do a mv instead of a rm -rf on the release to be "removed". Our application has quite a few files and some 33,000 lines of code not counting all the plugins, gems and Rails itself which are vendored, so doing a rm -rf at the end of deploy considerably slows our app down for several minutes. So we overrode the built-in capistrano deploy:cleanup task to accomplish this.

namespace(:deploy) do
  # overriding cleanup task, changing it from a rm -rf to a mv
  desc <<-DESC
    Clean up old releases. By default, the last 5 releases are kept on each \
    server (though you can change this with the keep_releases variable). All \
    other deployed revisions are removed from the servers. By default, this \
    will use sudo to clean up the old releases, but if sudo is not available \
    for your environment, set the :use_sudo variable to false instead.
  DESC
  task :cleanup, :except => { :no_release => true } do
    count = fetch(:keep_releases, 5).to_i
    if count >= releases.length
      logger.important "no old releases to clean up"
    else
      logger.info "keeping #{count} of #{releases.length} deployed releases"

      # COMMENT OUT THIS CODE
      # directories = (releases - releases.last(count)).map { |release|
      #   File.join(releases_path, release) }.join(" ")
      # invoke_command "rm -rf #{directories}", :via => run_method
     
      # ADD THIS CODE
      directories = (releases - releases.last(count)).each do |release|
        directory = File.join(releases_path, release)
        invoke_command "mv #{directory} #{remove_files_dir}", :via => run_method
      end
    end
  end
end

And the world is a happier place!

A Capistrano task for a rolling Mongrel restart and deploy

Posted on May 13, 2009

At TST Media we have our rails app hosted at Engine Yard. Currently we use Nginx, haproxy, and Mongrel and have 4 slices each with 4 mongrels. When an HTTP request first comes in to our system it hits the load balancer which chooses a slice to send it to. The nginx on the given slice picks the request up and sends it onto haproxy. Haproxy chooses a mongrel to send the request to based on availability. When we roll out bug fixes, which we do once every other day or so, the Mongrels all restart at once and all the users browsing our sites experience 20-30 seconds of... basically downtime. The browser spins and waits until the mongrels are ready to go. If requests come in at a certain time the users may see a 502 Bad Gateway response or a 503 Service Unavailable response, both of which started showing up once we started using haproxy. Clearly this is unacceptable. Soon we hope to switch to Nginx with Phusion Passenger which may not have this problem. Until then we have started doing rolling restarts, where one slice is down at a time which allows us to do small deploys without impact to our users.

To accomplish this rolling restart with our setup we have to stop nginx on the slice that is down. This prevents the load balancer from sending requests to the slice that is down. If we leave nginx up and only stop the mongrels then requests will still be routed to this slice and will hang in a similar manner as if we had restarted all the mongrels at once. We put together this capistrano task:
     
namespace :mongrel do
  desc <<-DESC
  Rolling restart, 1 server at a time.
  DESC
  task :rolling_restart do
    find_servers(:roles => :app).each do |server|
      ENV['HOSTS'] = "#{server.host}:#{server.port}"
      nginx.stop
      puts "Sleeping 10 seconds to wait for mongrels to finish."
      sleep 10
      mongrel.restart
      puts "Sleeping 30 seconds to wait for mongrels to start up."
      sleep 30
      nginx.start
    end
  end
end

This task iterates over each server/slice and stops nginx, waits for 10 seconds to let the mongrels finish what they are doing, restarts the mongrels, waits 30 seconds for the mongrels to boot up, and then starts nginx up again. This capistrano task assumes the existence of nginx.stop, nginx.start, and mongrel.restart tasks.

With this mongrel:rolling_restart task in place, we then defined a deploy:rolling task like this:

desc <<-DESC
  A deploy without migrations where the mongrels restarted in a rolling manner.
DESC
task :rolling do
  update
  mongrel.rolling_restart
end

When using this deploy:rolling task our site remains up and responsive during the entire deploy. This approach is useful for small bug fix roll outs where there are no migrations that need to be ran. There is a short window of time in which some of your servers will be out-of-date. For example you may see issues if your bug fix includes changes to a view file and a controller, and say a user hits a mongrel and is served the new view and then makes a post to an out-of-date mongrel with the new controller. However this is usually preferred to forcing all of your users to wait 30 seconds while all the mongrels restart. I would rather impact a very small percentage of our users than 100% of our users.

RailsConf 2009 and the Danger of Remote Mob Mentality

Posted on May 10, 2009

My first Ruby on Rails Conference was a positive experience.  RailsConf was in Vegas this year, and while I didn't win any money gambling, I did see several good talks and met some interesting Rails developers.

During the Wednesday morning keynote, as Chad Fowler was introducing Chris Wanstrath of Github, he asked who uses Git. Basically everyone in the room raised their hand. He went on to say that Rails programmers are like lemmings, which I think is a very interesting observation. It wasn't too long ago that most Rails developers used Subversion, and as soon as the Rails core team switched to Git everyone followed. It wasn't too long ago that test-driven development was an obscure programming practice only used by "Extreme" programmers. Now, if you are working on a Rails project it is a given that you have a decent test suite. And don't forget about Rest architecture.... people love Rest architecture.

After Timothy Ferriss's disappointing keynote Tuesday night, which served to entertain as the source of many jokes throughout the remainder of the conference, everyone was ready for a real hardcore motivational speech. Wow did Robert Martin deliver in his talk, "What Killed Smalltalk Could Kill Ruby Too." No slides, just Robert Martin pacing on the stage and flinging his note-cards into the air when he was done with them. Being a great speaker, he had everyones rapt attention. He recapped a short history of Smalltalk and why it "died", and outlined what the Ruby and Rails community can do to avoid the same fate. This included doing test-driven development, professionalism, not being arrogant towards non Ruby programmers, and the development of more powerful Ruby Integrated Development Environments. He stressed test-driven development quite a bit, as I knew he would given his Extreme programming background. When the speech finished the crowd gave him a standing ovation. Everyone loved it.

At RailsConf it was apparent to me that Rails developers are a young crowd. I knew this before the conference, but seeing 1300 Ruby on Rails nerds all in the same room made it even more obvious. An analogy to lemmings is clearly extreme, but certainly Rails developers are impressionistic. There definitely seems to be a sort of remote mob mentality thing going on, which is a little disturbing. You know those Simpsons episodes where the towns people group together in a mob and everyone wants to kill Bart. Then someones yells some other new purpose and the mob follows without thinking. Anyway, the point is that I'd like to see Rails Developers and other programmers think more for themselves. Everyone's circumstances and project is different, and pretending that there are a few programming practices such as test-driven development that absolutely must be done to succeed as Robert Martin implied is absurd. I would add "Think for Yourself" to Robert Martin's list of what the Rails community must do to avoid the fate of Smalltalk.

At TST Media we spend very little time writing tests and have a weak test suite. Our lines of code comes out at 33435, and our test lines of code is 1811, a test to code ratio of 0.05. While I would like to see this improved marginally, given our current situation it is simply not worth trading features for a slightly higher quality code base, which is what a better test suite would give us.
 

Rails patch for caching ‘SHOW FIELDS’ for has_and_belongs_to_many associations

Posted on January 8, 2009

Last week I was examining the MySQL slow query logs at work and discovered the following which led to an easy Rails patch which improved the performance of our app by about 25%.

# Time: 090108 11:05:02
# Query_time: 14.412306  Lock_time: 0.000521  Rows_sent: 2  Rows_examined: 2
SHOW FIELDS FROM `events_page_nodes`;
# Query_time: 14.390774  Lock_time: 0.000556  Rows_sent: 2  Rows_examined: 2
SHOW FIELDS FROM `events_page_nodes`;

Normally 'SHOW FIELDS' queries are moderately fast. I ran it manually just now and it took 0.16 seconds. However here you can see that these 'SHOW FIELDS' queries took 14 seconds to complete! Turns out that MySQL creates a temporary table on disk for 'SHOW FIELDS' queries, so if the disk is busy with something else these queries can take awhile to complete as seen here.

In development mode these 'SHOW FIELDS' queries are not cached and occur very frequently, but in production mode Rails caches these queries the first time they are called for each model. I noticed that our database was receiving a large number of these 'SHOW FIELDS' queries, which I thought should only occur when a Rails environment is loaded or shortly thereafter when the models are first loaded. (ex. mongrel restarts, a background job, or a cron job).   
                                                 
However, upon inspection it turns out that Rails DOES NOT cache 'SHOW FIELDS' queries for has_and_belongs_to_many associations. So every time a select or an insert is done via a Rails has_and_belongs_to_many association, a 'SHOW FIELDS' on the join table is executed. One way to solve this problem would be to switch to using the has_many :through approach, which involves adding a primary key id column to the join table and creating an ActiveRecord model for it, which would then take advantage of the built-in Rails caching of 'SHOW FIELDS'. However we have 20 some join tables in our application. So instead I patched Rails to cache the 'SHOW FIELDS' queries, which turned out to be rather simple and noticeably impacted the performance of our app (see charts below).                                                 

The 'SHOW FIELDS' queries are wrapped inside the database connection's columns method, and columns is called from insert_record and finding_with_ambiguous_select? in the has_and_belongs_to_many_association.rb file. In this patch I replace @owner.connection.columns with a call to @owner.habtm_columns instead, which has the caching logic within it. The cache is a hash with the table_name as the key and the result of connection.columns as the value. Since there isn't a class associated with the join table, I simply store the cache within one of the two classes that are part of the association. So it is possible that 'SHOW FIELDS' will be called twice and cached twice for a single join table, once for each class that is a part of the association.

Here are the changes I made to Rails for this patch.

1. In insert_record of has_and_belongs_to_many_association.rb replace this line:
  columns = @owner.connection.columns(@reflection.options[:join_table], "#{@reflection.options[:join_table]} Columns")
  # with this line:
  columns = @owner.class.habtm_columns(@reflection.options[:join_table], "#{@reflection.options[:join_table]} Columns")

2. Change the method finding_with_ambiguous_select? of has_and_belongs_to_many_association.rb replace to be:
  def finding_with_ambiguous_select?(select_clause)
    !select_clause && @owner.class.habtm_columns(@reflection.options[:join_table], "Join Table Columns").size != 2
  end

3. In base.rb define this method as a class method:
  def habtm_columns(table_name, name)
    @habtm_columns_hash = {} unless defined?(@habtm_columns_hash) && @habtm_columns_hash
    unless @habtm_columns_hash.has_key? table_name
      @habtm_columns_hash[table_name] = connection.columns(table_name, name)
    end
    @habtm_columns_hash[table_name]
  end
 
4. And in reset_column_information be sure to clear out @habtm_columns_hash by adding it to the list of variables to nil out:
  def reset_column_information
    generated_methods.each { |name| undef_method(name) }
    @column_names = @columns = @columns_hash = @content_columns = @dynamic_methods_hash = @generated_methods = @inheritance_column = @habtm_columns_hash = nil
  end  

Here is the complete patch for Rails 2.1.1.

Here is the complete patch for Rails 2.0.2.

After rolling this change out we have seen a very noticeable improvement on performance. By using New Relic's Compare With Yesterday feature (see chart below) we can see that our application response time dropped from an average of about 560 ms per request to 420 ms per request. This is roughly a 25% performance increase. The yellow line shows today, the blue line is yesterday. CPU and database load decreased by about 25% as well. This patch was rolled out at 13:30 yesterday, which is why the response time, CPU, and database graphs converge around 13:30. Throughput is basically the same across both days. (Note that the big dip in throughput, CPU, and database right around 13:30 is inaccurate. When the mongrels restarted after rolling this change out, several mongrels failed to communicate with New Relic for some unknown reason. Restarting these mongrels seemed to fix the communication problem with New Relic, which hasn't occured since.)

compare_with_yesterday

Tagged as: , , 2 Comments