Setting up a local copy of the OpenStreetMap database, kept up to date with minutely diffs
Here's my experience setting up a local copy of the OpenStreetMap database, kept up to date with minutely diffs. This is the raw data database, not a database suited for generating map tiles (that would be PostGIS).
I started this as part of a school project involving database research, in which I built a robot. I didn't want to hammer the main servers for data, so I wanted to set up a local copy to query against instead.
Note that this should be considered a read-only database. When I submit changes, I submit them directly to the main database. You get in to some difficult consistency issues if you apply updates to your database and then try to extract and apply those to the master database.
Hardware
My hardware is pretty basic. I used my existing Linux machine, which was built from just cheap (but mostly reliable) components.
- Intel Celeron dual-core 64-bit E3200 2.4GHz
- 2GB memory
- 1x 500GB SATA drive (main system)
- 1x 250GB SATA drive (database)
Software
- Unix/Linux - You'll need some flavor of *nix to work on. Use whatever you're comfortable with... Fedora/RedHat, Ubuntu, etc. I use Gentoo: it's a package-based system, and its package system (portage) installs almost all packages by automatically downloading and compiling source, rather than installing binaries.
- PostgreSQL - OSM supports either MySQL or PostgreSQL as its backing database, but PostgreSQL is what's used currently, and so it's probably best to follow their lead to avoid unnecessary complications.
- Osmosis - Osmosis is a Java tool for reading and writing XML changesets and data dumps.
- The Rails Port - This is a port of the API to Ruby on Rails. It isn't necessary for most of this, except for being the only "official" way to load the schema into the database.
Wiki Links
Overview
We'll do roughly the following, not necessarily in this order
- Install a bunch of software
- Download a database dump
- Prep the database
- Load the dump into the database
- Set up automatic updating to keep the database recent
Step-by-step
Step 1: Download planet.osm
planet.osm is a dump of the database to the OSM XML format. These dumps are done about once a week, and we'll use this to load the database with something relatively recent.
Go to the planet.osm page, and pick a mirror to download from. Look for a file with the name "planet-100915.osm.bz2", with the numbers representing the date in YYMMDD format. I ended up going with one that was only a week or two behind the present.
(Make sure that whichever planet file you use, you can find a state.txt in the minute-replicate/ directory that is close to the date of the dump. The state.txt files only seem to go back about 1 year from current.)
Start this downloading. On a cable connection capped at around 500KB/s, it took me about 8-9 hours.
Screen
You will probably want to use the "screen" command, which creates a virtual terminal that you can disconnect from and leave running. You can then use "screen -r" to reattach to the screen from any other login session. (Otherwise, you'd have to leave a terminal window connected, and if you happen to drop the connection... oops, there goes the shell and the download process. Trust me, screen will be your friend through this.)
Axel
Axel is a command-line file downloader like wget, but it takes multiple URLs of mirrors and downloads multiple chunks of the file simultaneously. If your download isn't maxing out your Internet pipe, you might consider using this.
MD5
You want to make sure you got the file intact, so download the .md5 checksum and run
md5sum -c *.md5
to have it verify the intergrity of the file. (Unfortunately, if it fails, your only choice is to redownload the whole thing, since you don't know what part of the file is bad.)
Step 2: Meanwhile: Install Postgres
Although OpenStreetMap can run on either MySQL or Postgres, currently everything runs on and is primarily targeted for Postgres, so that's what we'll use.
On my system (Gentoo Linux), this was just
emerge dev-db/postgresql-server
This is where I popped in a second hard drive I wasn't using, and mounted the whole thing under /var/lib/postgresql/. Then I did the database set-up (on Gentoo, "emerge --config dev-db/postgresql-server")
To create the empty database, do this: (copied from here)
sudo -u postgres -i createuser openstreetmap -s -P createdb -E UTF8 -O openstreetmap openstreetmap createdb -E UTF8 -O openstreetmap osm_test createdb -E UTF8 -O openstreetmap osm psql -d openstreetmap < /usr/share/postgresql/8.4/contrib/btree_gist.sql
Tuning the server
By default, Postgres is not tuned for such a large database. I changed two settings that vastly improved its performance.
shared_buffers = 512MB # Set this to 25% of your total memory (2GB in my case) bgwriter_delay = 2000ms # I don't know what a good value is, but since we're loading a lot of data, larger sounds better
You can set these in the usual manner for your distro or Postgres (probably postgresql.conf in the database directory (/var/lib/postgresql/8.4/data))
pg_top
There's a tool called pg_top which is basically "top" for Postgres. Install it with "emerge dev-db/pg_top".
Step 3: Meanwhile: Install Osmosis
Osmosis is a tool that can read and write a variety of OSM-compatible sources, including OSM XML files, databases, and diff mirrors.
Gentoo Portage already has a package for Osmosis, but it's out of date. Fortunately, you can create a local portage tree and add your own packages to it. I did this and created an ebuild for osmosis-0.36.
I also created a tiny patch file to fix a problem in the launcher script which was causing it to look in the wrong directory for some files.
Gentoo users: You may require some java builds that aren't available in the main portage tree yet. Install layman, which manages portage overlay trees, and add the java overlay with "layman -a java-overlay"
Step 4: Meanwhile: Install Rails port
The Rails port is a port of the OSM API to Rails, a web framework built on Ruby. Although I intended to query the database directly, the Rails port is the only way to create the empty tables in the database. Thus, you have to have it.
Like with Osmosis, I created ebuilds for all the packages I didn't have. However (in the interest of laziness and flexibility) I actually just had the ebuild for the Rails port manage the dependencies, and took care of actually getting the source myself.
The wiki page for the Rails port is pretty complete, so these are largely duplicates of what's on there, with a few modifications and caveats. So, here we go.
Dependencies
According to the wiki page (and a little trial-and-error), the Rails Port requires:
- >= Ruby 1.8.6
- >= RubyGems 1.3.1
- Either of:
- >= Postgres 8.3 and bindings (we'll use this one)
- >= MySQL 4.1.0 and bindings
- RMagick
It also requires the following Gems to be installed before starting:
- Rails 2.3.8.*
- Rack 1.1.0.*
- >= Ruby libxml bindings 1.1.1
- Timecop
- OAuth
I got complaints if I didn't use certain versions (e.g., Rails 2.3.9 wouldn't work, it wanted 2.3.8 instead), and these may change depending on the version of the Rails port you end up with.
Installation
As per the wiki page, just do
git clone git://git.openstreetmap.org/rails.git
Configuration
In the config/ directory, we have a couple things to do. You need a database.yml file, which doesn't exist yet. Copy postgres.example.database.yml (or mysql.example.database.yml). Likewise, you need to create an application.yml file from example.application.yml (you can probably leave this unmodified).
As per the README in the db/ directory, there's some code we need to load in to PostgreSQL.
cd db/functions/ make libpgosm.so psql openstreetmap -U openstreetmap > CREATE FUNCTION maptile_for_point(int8, int8, int4) RETURNS int4 AS '/path/to/rails-port/db/functions/libpgosm.so', 'maptile_for_point' LANGUAGE C STRICT;
Note: I'm not sure whether this is necessary if you're only running PostgreSQL, and not PostGIS.
Hacks/Fixes
I had some issues running the Rails port. It always complained about an error in config/i18n.rb. I ended up deleting it (rather, renaming it to "i18n.rb.ignore" just in case i needed it) and everything worked fine.
It also didn't like my languages.yml file. I searched for an older version (add link later) from git and copied it in; that seemed to fix things.
Step 5: Initialize database
From the Rails port, run "rake db:migrate". This will only take a minute.
Vacuuming
Databases require periodic maintenance to ensure that the data is packed in the most efficient format, and that not too much free space is being wasted. They also periodically collect statistics on the tables which guides their decisions about how to deal with data.
It doesn't hurt to do this after you create the tables. Go into Postgres ("psql -h hostname database -U username" and run "VACUUM ANALYZE;". If all you have in Postgres is the OpenStreetMap database you just created, it will be done in under a minute.
Step 6: Recompress database
We're told that Osmosis is much slower at reading bz2 files than gz files. (Also, when I tried to read the bz2 file, it complained that it was malformatted.) So, we recompress it. The GZip version will take up 15GB (plus the 12GB to keep the BZip2 version around).
bunzip2 -c planet-100915.osm.bz2 | gzip -9 > planet-100915.osm.gz
I didn't time this, I think it took about 1 hour. At maximum compression, it's CPU-bound on gzip. It probably doesn't matter if both files are on the same physical drive.
If you have 160GB available, you could just uncompress it and deal with it raw. (If so, you probably want pbunzip2 which is multi-threaded, and will probably be faster.) I don't have that much space. :-)
Step 7: Excerpt that part of the database you want
This step is strictly optional, but... the whole planet will take up more than 300GB once it's loaded into the Postgres server. And unless you have a pretty beefy machine, it will take you several weeks to import.
Nah, you didn't want the whole planet anyway.
Osmosis has --bounding-box
and --bounding-polygon
options. You can go to http://downloads.cloudmade.com/ and get a .poly file for the continent/country/state you want, and use that to trim the dataset.
Or, you can just set a bounding box. You'll get a little extra, but who cares. I went with this option because the Cloudmade download site happened to be down when I was doing this. Just drop by Wikipedia and look up the latitude and longitude of the thing you're looking for.
gunzip -c planet-100915.osm.gz | osmosis --read-xml - --bounding-box left=-89 right=-80 bottom=25 top=30 clipIncompleteEntities=true --write-xml florida.osm.gz
Make sure to include the clipIncompleteEntities=true
setting. Be default this is false, and it will leave references in ways to nodes that weren't in the bounding box. Several hours in to your data import, Postgres will fail with a foreign key constraint because of this, and you'll have to start over.
Sort and merge
If you're doing more than one bounding box/polygon, you probably want to combine them into a single file to load. This is easy. Merging them requires that you sort them first.
osmosis --read-xml florida.osm.gz --sort --write-xml florida-sorted.osm.gz
osmosis --read-xml florida-sorted.osm.gz --read-xml kentucky-sorted.osm.gz --merge --write-xml fl-ky.osm.gz
You can merge lots of files in one command... just read 2 files, merge (that leaves 1), read 1 more, merge, read 1 more, merge, etc.
Step 8: Load database in to Postgres
Now we're ready to load the data. This is easy, but still really time-consuming.
osmosis --read-xml filename="planet-100915.osm.gz" --write-apidb host="localhost" database="openstreetmap" user="openstreetmap" password="secret" populateCurrentTables=yes validateSchemaVersion=no
Step 9: Prepare Osmosis for reading diffs
Create an empty directory somewhere; Osmosis will use this as a working directory. Download a state.txt file that's close to the same vintage as the planet.osm dump you got (it's better to be behind than ahead). Save this file as "state.txt" (remove the date from it).
Set up the directory by running
osmosis --rrii workingDirectory=./
This will create a couple other files. Edit the configuration.txt file to point to the mirror of your choice; you can point it at minutely or hourly diffs (but make sure the mirror you want actually has them; not all mirrors do). You can also adjust the maxInterval setting to your liking (i'll have more advice about this once i've made it to this point... >_>)