Ruby

Speeding Up Bulk Imports in Rails

This article is about speeding up bulk imports in Rails. Let’s imagine we have 50,000 products to import… not an absurd amount, but enough that we can start to notice some pain and enough that we wouldn’t want to keep the user sitting there while the system tries to import them all.

That is unless we can figure out a way to have these records imported quickly.

The Data We’re Working With

We’ll be working with a Product model which has the columns: sku, name, origin, and msrp_cents. To populate our database, we receive a nightly dump from the manufacturer which arrives to us as a CSV file. The data looks like this:

0abcd,0 Product,China,3000
1abcd,1 Product,Mexico,4000
2abcd,2 Product,Canada,2000
3abcd,3 Product,USA,5000
4abcd,4 Product,Canada,4000

Here is the schema:

create_table "products", force: :cascade do |t|
  t.string   "sku",        null: false
  t.string   "name",       null: false
  t.string   "origin",     null: false
  t.integer  "msrp_cents", null: false
  t.datetime "created_at", null: false
  t.datetime "updated_at", null: false
end

add_index "products", ["sku"], name: "index_products_on_sku", unique: true, using: :btree

The data is obviously fake. But that doesn’t matter, the principle is the same. The only validation our model has is that it requires the sku field to be unique, and the other fields to not be null.

Attempt 1 – The original (59 seconds)

My first attempt was the most straight-forward attempt… nothing fancy is going on. I’m taking the CSV file, looping over the records and saving them.

task import_products1: :environment do
  require 'csv'

  CSV.foreach("#{Rails.root}/tmp/products.csv") do |p|
    product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3])
    product.save!
  end
end

Running this with the time program produces the following output:

rake speedy:import_products1  59.45s user 4.44s system 77% cpu 1:22.59 total

Is 59 seconds fast? Slow? At this point we don’t really know. But let’s take a look at what’s going on for each of the records.

(0.1ms)  BEGIN
Product Exists (0.6ms)  SELECT  1 AS one FROM "products" WHERE "products"."sku" = '123' LIMIT 1
SQL (0.3ms)  INSERT INTO "products" ("sku", "name", "origin", "msrp_cents", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id"  [["sku", "123"], ["name", "n"], ["origin", "Russia"], ["msrp_cents", 1000], ["created_at", "2015-12-05 01:34:41.896705"], ["updated_at", "2015-12-05 01:34:41.896705"]]
(6.1ms)  COMMIT

Here is what’s happening step by step:

  1. BEGIN starts a database transaction. If there is a failure of any kind it will run the ROLLBACK command to avoid finalizing these changes in the database.
  2. Next because we have a uniqueness validation in the model, it is querying to check if this sku already exists.
  3. At this point it does the actual insert statement, inserting the data into the database.
  4. Because there were no failures, it runs COMMIT to finalize all changes made within this database transaction.

What we don’t see is that there is a lot going on behind the scenes as well. Rails is having to talk over TCP to the database four times for each record, plus it is having to build two different SQL statements, each time. Luckily, this last part is quick thanks to some work by Tenderlove on AdequateRecord Pro where the query construction is cached.

Attempt 2 – Rely on DB for unique validation (44 seconds)

The first thing we’ll do to speed up the import is to rely on our database to ensure the skus are unique rather than on Rails. This will allow us to avoid that extra SELECT statement from being run.

You should always have a unique index on any field you want to actually be unique. Rails will do its best, but it can’t guarantee uniqueness due to race conditions.

The migration to add this unique index looks like this:

add_index :products, [:sku], unique: true

If there is failure, it will raise an ActiveRecord::RecordNotUnique exception which you can rescue from. To avoid this validation run, we could send the save! method the option validate: false, but this would stop all validations from being run, not just the one we want to avoid.

Below is a technique for avoiding a single validation. It creates an attr_accessor called skip_uniqueness, which we use in conjunction with the unless option that we can pass to the validates method.

class Product < ActiveRecord::Base
  attr_accessor :skip_uniqueness
  validates :sku, uniqueness: true, unless: :skip_uniqueness

  def save_for_bulk_import!(options = {})
    self.skip_uniqueness = true
    save! options
  end
end

Using this technique, we created our own version of the save! method which sets the skip_uniqeness attribute to true, thereby skipping the uniqueness validation.

And finally, here’s the code that was run to produce this result:

task import_products2: :environment do
  require 'csv'

  CSV.foreach("#{Rails.root}/tmp/products.csv") do |p|
    product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3])
    product.save_for_bulk_import!
  end
end

Now after running the import, we get a time of 44 seconds… about 15 seconds slower than the original!

rake speedy:import_products2  44.10s user 3.56s system 78% cpu 1:00.57 total

Attempt 3 – Wrap it in a transaction (34 seconds)

The way we have our code now, each of the 50,000 products gets its own database transaction. This is how Rails handles saving a model by default… unless of course we wrap it in our own transaction, a single one that we can use for all 50,000 records.

This means that if there is any failure, none of them will be inserted.

task import_products3: :environment do
  require 'csv'

  Product.transaction do
    CSV.foreach("#{Rails.root}/tmp/products.csv") do |p|
      product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3])
      product.save_for_bulk_import!
    end
  end
end

Now when we run the code we see that it is down to 34 seconds… another 10 seconds slower!

rake speedy:import_products3  34.14s user 1.47s system 89% cpu 39.696 total

Attempt 4 – Bulk insert (4.6 seconds)

Our code is still having to connect to the database 50,000 times. I had a feeling this was the real reason for slowness, because if I instantiate 50,000 products without saving them, the script only takes about 5 seconds.

To solve this, let’s take advantage of something both MySQL and PostgreSQL support: Inserting multiple values with a single INSERT statement.

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

As with many things in Ruby, there is a gem that helps us do this called activerecord-import. Now we’ll read all of the products into an array and insert them into the database all at once:

task import_products4: :environment do
  require 'csv'

  Product.transaction do
    products = CSV.read("#{Rails.root}/tmp/products.csv")
    columns = [:sku, :name, :origin, :msrp_cents]
    Product.import columns, products, validate: false
  end
end

This lowers our code all the way down to 4.6 seconds!

Keep in mind that in this case I’ve turned off validations entirely. As a result, you should be sure that your incoming data is valid, maybe running it through a cleanse phase before getting to this point. That being said, even by running the validations, it ran in about 25 seconds, which is the second fastest time I was able to get.

rake speedy:import_products4  4.64s user 0.30s system 75% cpu 6.527 total

Conclusion

Take advantage of your database! It’s very efficient at what it does, and by relying on its functionality, you’ll be able to figure out different ways to make your code more efficient.

In this article, we relied upon our not-null database constraints along with a unique index, and then we used the ability to insert multiple records with a single SQL query. All in all, we were able to get 50,000 records into the database from 59 seconds all the way down to 4.6 seconds, around 12 times faster.

Reference: Speeding Up Bulk Imports in Rails from our WCG partner Florian Motlik at the Codeship Blog blog.

Leigh Halliday

Leigh is a developer at theScore. He writes about Ruby, Rails, and software development on his personal site.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button