Osmosis/Tuning

From OpenStreetMap Wiki
Jump to navigation Jump to search

This page contains information on the options available for measuring and improving performance in Osmosis.

JVM Options

There are a number of JVM options available which can improve performance of Osmosis processing. These are environment variables which you may wish to save in config files. See Osmosis/Installation#Environment variables and config files

Server JVM

The default java command invokes a "client" version of Java. This is optimised for fast startup, and it not ideally suited to long-running Osmosis tasks. The "server" version provides improved performance and can provide measurable performance improvements. To enable it, edit the ~/.osmosis file (or create ~/.osmosis from an empty file, if it does not exist) and specify the server option as a JVM argument.

JAVACMD_OPTIONS=-server

Memory

The majority of Osmosis tasks use low amounts of memory due to the fact that most processing is done on streams. However, there are cases where additional memory is required such as area extraction tasks (eg. --bounding-box and --bounding-polygon), or the nodeLocationStoreType=InMemory argument to the --write-pgsql task.

To change the amount of memory allocated to Osmosis, edit the ~/.osmosis file and set the JAVACMD_OPTIONS parameter with the memory arguments. For example, to allocate 2GB RAM to the Osmosis, process, use the following entry:

JAVACMD_OPTIONS=-Xmx2G

(This may be a good quick fix to try if you're getting "java.lang.OutOfMemoryError: Java heap space")

To change where Java stores tempfiles add the following. This is vital when using mapfile-writer with -hd mode enabled and standard temp disk is low on space.

-Djava.io.tmpdir=<path_to_dir>



Threading

Osmosis is a multi-threaded application allowing processing to be distributed across multiple cores in certain cases. Osmosis has two types of tasks, active and passive tasks. active tasks run within their own thread, whereas passive tasks simply run within the thread of input tasks passing them data. If run with the -v option, Osmosis will provide debugging information that specifies whether a task is active or passive during pipeline initialisation.

For example, the following command line:

osmosis -v --read-xml input.osm --write-xml output.osm

will produce (among others), the following lines of console output:

INFO: Launching pipeline execution.
FINE: Launching task 1-read-xml in a new thread.
FINE: Task 2-write-xml is passive, no execution required.

It is possible to make otherwise passive tasks run in their own thread by adding a --buffer task before them. A --buffer task runs in its own thread and therefore processes subsequent passive tasks in the context of a new thread.

For example, the following command line:

osmosis -v --read-xml input.osm --buffer --write-xml output.osm

will split the processing across two threads as indicated by the following lines of console output:

INFO: Launching pipeline execution.
FINE: Launching task 1-read-xml in a new thread.
FINE: Launching task 2-buffer in a new thread.
FINE: Task 3-write-xml is passive, no execution required. 

In the above case, the read-xml task will run within its own thread, then the subsequent --buffer and --write-xml tasks will run within a separate thread. In many cases the reading task will consume a large amount of CPU, so dedicating a thread to it can provide noticeable performance improvements on a multi-CPU machine.

Note that additional thread incurs additional overhead and won't always provide the gains that might be expected. On a single CPU machine they will typically slow things down, and even on a multi-CPU machine they may not always improve performance due to the overheads of exchanging data between threads. As always, some trial and error is required to determine the best combination of additional threads and pipeline layout.

Measuring CPU Usage

In a large number of cases, CPU will be the bottleneck for Osmosis. Some database tasks will instead incur heavy IO overhead but in that case tuning will be required in the database and not in Osmosis so won't be discussed here. Some tasks using temporary files may also incur heavy IO overhead, but typically these are not suitable for large-scale file processing and should be avoided where possible because little can be done to speed them up.

Where CPU is the bottleneck, it can be difficult to determine which parts of the pipeline are CPU-bound due to the potentially large number of threads being utilised.

A great tool for measuring per-thread CPU usage is the Top Threads plugin for JConsole. Download the jar file provided on the Top Threads page, then run JConsole with the following command:

jconsole -pluginpath /path/to/topthreads.jar

From there a running Osmosis instance can be selected (Osmosis must be launched first), and connected to using JConsole. JConsole will provide a number of JVM statistics, and the "Top threads" tab will provide details on which threads are consuming the most CPU. The threads will be labelled according to the task that instantiated them. For example, a thread named "Thread-2-buffer" means that this thread was created by the second task on the command line which was a --buffer task.

Once the problem threads have been identified, it may be possible to insert additional --buffer tasks to reduce the number of tasks utilising that thread. If large numbers of --buffer tasks have already been used, it may be possible to remove some that are using small amounts of CPU in order to reduce the thread synchronisation overhead.