Converting Multiple Microsoft Access Databases into a Single MySQL Database – Rails, Ruby, MDBTools

Some of the data I’ve been dealing with lately is in Microsoft Access databases (.mdb files). I’ve been surprised with the lack of support for manipulating these files outside of Access. There are a couple solutions, but nothing satisfied me needs. The worst data set came to me in a few hundred different .mdb files. It looked like the person that exported the database exported a copy of the database for each unique facility. So each facility had a file, but the schema for each file was exactly the same. It was apparent that in order to do any sort of complex queries on the data, the files would have to be combined. I was recommended a tool called mdbtools by somebody at a local Ruby meetup. I ended up using this along with Ruby inside of a Rake task to convert all my .mdb files into a single MySQL database.

Prerequisites

The first thing I needed to do was to successfully configure and compile mdbtools. This probably took the longest as there are a few different version out there. I had the most success with these instructions from the author’s git repository, http://www.automatthew.com/2008/02/how-to-compile-mdbtools-on-mac-os-x.html?showComment=1219246620000. I was also able to compile this version on my ubuntu machine pretty easily. The biggest hangup was with a couple of dependencies. I also had a weird error where the autogen.sh was not detecting my libtools library, but I knew this was installed. I commented out this check in the autogen.sh and things all worked, luckily.

Understanding mdb-tools was not very difficult. It contains various libaries for interfacing with it through C, etc. The only thing I cared about were the command line tools that it provided, listed here. The commands I will use in this post are mdb-tables, mdb-schema, mdb-export.

The other requirement is to have a MySQL table created ready with permissions, etc. that will end up holding the data from the combined .mdb files.

Step 1: Extracting the Schema

One important thing to note about my situation is that all the .mdb files have the same schema. Knowing this, I was able to simply find a random .mdb file and base my MySQL schema on this.

I have my .mdb files stored in a shared directory and so within my Rake task I used the following code to grab the first one (I’m sure there is a better way).

first_mdb_file = ''
#get first mdb file
Dir.entries("#{RAILS_ROOT}/../shared/system/data_sources/SC/sc_mdb_files/").each do |mdb_file|
  if mdb_file =~ /\.mdb/
    first_mdb_file = "#{RAILS_ROOT}/../shared/system/data_sources/SC/sc_mdb_files/#{mdb_file}"
    break
  end
end

Knowing the path to the first .mdb file the schema can be extracted by getting the result of the following shell command (the code surrounded in backticks is executed on the command line of the system). The -S flag tells it to santize some of the table names by replacing spaces with underscore, etc. I needed this for my tables because there were some problems without it.

schema_sql = `mdb-schema -S #{first_mdb_file} mysql`

Step 2: Building the Tables

Since I’m doing this in a Rake task in Rails (you could do this in plain ruby), I am using an ActiveRecord connection to interface with my MySQL database. I have a class called ExternalDatabase that allows me to connect to a different database without modifying the one that my main Rails program uses. I will list it below, I don’t remember exactly where I got it from.

#Class used to connect to external databases without modifying the exiisting connection used by Rails
class Data::ExternalDatabase < ActiveRecord::Base
  self.abstract_class = true
 
  def self.set_source(configuration_name)
    source = ActiveRecord::Base.configurations["sources"][configuration_name.to_s]
    establish_connection source
  end
 
  def self.set_configuration(config)
    source = config
    establish_connection source
  end
end

As you can see, I also have a section in my database.yml where i list connection details for various source databases. This isn’t related to this post though.

So, assuming we have this class, we can use the following code to load our new schema into out MySQL database.

Data::ExternalDatabase.set_configuration(:adapter  => "mysql",
:host     => "localhost",
:username => YOUR_DATABASE_USERNAME,
:password => YOUR_DATABASE_PASSWORD,
:database => YOUR_DATABASE_NAME,
)
db =  Data::ExternalDatabase.connection 
 
puts "adding if exists clause to drop tables"
 
schema_sql = schema_sql.gsub(/DROP TABLE (.*);/, 'DROP TABLE IF EXISTS \1;')
schema_sql = schema_sql.gsub(/--.*$/, '')
puts "executing schema sql"
lines = schema_sql.split(';')
 
lines.each do |sql_line|
  next if sql_line.blank?
  db.execute(sql_line)
end

The first part builds my database configuration hash that I pass in to ExternalDatabase. I save the connection into a variable called ‘db’. One of the important things to note about this and mdb-tools is that it does not check if the table currently exists before trying to drop the table. So I added a snippet of code that removes the comments and adds ‘IF EXISTS’ to the ‘DROP TABLE’ clause in the schema sql.

For some reason, trying to execute this entire schema sql string with db.execute causes various errors that I couldn’t figure out. By separating the schema sql into individual sql statements terminated with semicolons seemed to fix this problem. The last bit of code does the separation into lines and executes the sql on the database. After all this, the database should have the correct schema.

Step 3: Load Data

One of the last things we want to do is actually load all the data from the different mdb files into the single MySQL database.

tables = (`mdb-tables #{first_mdb_file}`).split(' ')
Dir.entries("#{RAILS_ROOT}/../shared/system/data_sources/SC/sc_mdb_files/").each do |mdb_file|
  if mdb_file =~ /\.mdb/
    puts "Current file: #{mdb_file}"
    tables.each do |table|
      puts "Current table: #{table}"
      insert_sql = `mdb-export -I -S #{RAILS_ROOT}/../shared/system/data_sources/SC/sc_mdb_files/#{mdb_file} #{table}`
      insert_sql = insert_sql.gsub(/;/, ',')
      insert_sql = insert_sql.gsub(/(.*?\))(\s+INSERT)/, '\1;\2')
      lines = insert_sql.split(';')
      lines.each do |sql_line|
        next if sql_line.blank?
        db.execute(sql_line)
      end
    end
  end
end

So the first line runs mdb-tools command line tool ‘mdb-tables’ to get a list of the tables separated by spaces. I could have done this through my previously created schema as well.

This code goes through every .mdb file in my directory and uses mdb-tools to extract the data as sql ‘INSERT’ statements from the command-line.

In the inner-most loop, I clean up the returned value from mdb-export because it does not include semi-colons after each INSERT statement. I also remove any semi-colons before adding any because some of my fields had semi-colons in them. Again, I separate each INSERT statement into a single statement and execute these individually in the last section of the code.

After all this you should have all your data combined into a single MySQL database. One problem might exists though. Certain tables, such as tables that contain general purpose labels, etc. for other tables to refer to, might contain duplicates. This wouldn’t have caused a big problem for me because I just needed to extract data, but if you needed to continue using this database you would need to remove the duplicates.

Final Step: Removing Duplicate Rows

There seem to be a lot of different ways to remove duplicate rows from a MySQL table. I wish there was a simple solution, but I couldn’t find one. The following code does the following:

  1. Create a temporary table that has the same schema as the table we are removing duplicates from.
  2. Insert all distinct rows into this new temporary table.
  3. remove all values from the original table.
  4. Insert all of the rows from the temporary table back into the original table.
  5. Finally, drop the temporary table.
tables.each do |table|
  db.execute("CREATE TABLE temp LIKE #{table}")
  db.execute("INSERT INTO temp select distinct * FROM #{table}")
  db.execute("TRUNCATE #{table}")
  db.execute("INSERT INTO #{table} select * from temp")
  db.execute("DROP TABLE temp")
end

Thanks for reading, I hope this was helpful to others that have some similar problems.

- Chase

blog comments powered by Disqus
© Chasing • Powered by Wordpress • Using the Swiss Cool theme.