Basic project structure is set up.
I have been looking at the OSM data further and thinking in the back of my mind about how to work with the large amount of data with the WPS. The main problem is that the planet.osm data is fairly large, thus will take time to download, and then process. This can be further broken down into two main subissues, namely:
- how to download/store the data such that WPS has access to it immediately
- how to modify the datastore quickly such that the WPS provides the most up-to-date information from OSM
The latter issue is easy to solve as planet.osm provides incremental daily or weekly updates of roughly 300MB in size. This small amount of data (compared to 21GB compressed, 303GB uncompressed fully planet.osm data) will be easy and fast to process given average server resources/speed (need to provide benchmarks for this). Integrating this is potentially a problem.
The former issue is more difficult as this is quite clearly the initialization step. In terms of storage, there are several options:
- Use a db-based backend, e.g. PostgreSQL database with PostGIS extensions installed; There exists osm2pgsql, but it does lossy conversion. There is also osmosis.
- Use a OSM-native backend. Examples of this are using the Overpass API or other APIs
- Roll our own datastore/backend within WPS. This is easily feasible for small osm data, but not so much for large osm data. This would probably work with GeoTools as this is what WPS primarily uses for conversion to different formats. This might be slow, however.
If we go with the last option, this would probably involve rolling out our own class extending the AbstractFileDataStore class in GeoTools or that of IData in WPS. The specifics of how this would work would be (depending on what sort of defaults we’d want) store general features, and create some sort of hierarchy from general to specific for particular regions? This is dependent on how large particular regions are in terms of data available, and would require a significant preprocessing stage. Currently it is unknown if the planet.osm data is sorted.
It also turns out that the osmosis tool is written in Java, and has some useful benchmarks, so it might be worth looking at some of the classes within osmosis on github to see how fast processing of osm data is done. Despite the large amount of data, it appears that it is possible to process them rather fast without a significant memory footprint.
It appears more and more that this will have to be a fork of trunk at some point possibly. I can think of one way to organize the different components of osmtransform in the project:
- the preprocessing part of osmtransform to be part of the IO module, namely in the parsers package. Depending on the user needs, it may be worth to set a separate backend/process to handle this. Basically:
- Two possible parsers: one for file-based OSM data, one for API-based OSM data.
- The file-based OSM data will need some preprocessing based on the size of the OSM data. This is converted into some kind of intermediary format, with some preprocessing to make requests faster. Probably some hierarchal sorting (i.e. by blocks, or by region) will be useful.
- The API-based OSM data probably will need minimal preprocessing, and instead serve more as a passthrough interface to the OSM process
- Two possible processes/AlgorithmRepositories: one for file-based OSM data, one for API-based OSM data
- The file-based OSM data will require an AlgorithmRepository that can handle the internal converted format. The capabilities which it supports is… ???? There will exist options for which mirror to pull the OSM data from, and how often it updates
- The API-based OSM data will require an AlgorithmRepository that handles the OSM data directly as queried by the OSM provider. There will exist options for the source of the API-based OSM data, and whether it has write access (and what API format it uses)
- the ui part of osmtransform would likely be a separate web page that performs some basic operations on OSM data, that are forwarded to the WPS itself.
Diagram of workflow would look like this:
Some details of the ui may not quite be as clearcut as it would initially seem. As mentioned before, there was discussion as to whether the UI component would be Java Swing-based or web-based. If the former option, it would act like one of the existing clients that interact with the WPS. If the latter option, then would probably be some additional webpage for the WPS? Assuming we go with the latter option, it is quite possible that we can have certain kinds of requests be served over a web interface, e.g. integrate OpenLayers into the WPS. This might facilitate easy testing, and allow for uploads of smaller osm data for conversion.
TODO FOR THE WEEK
Start writing the Parser classes for planet.osm data and OSM3S data. Look at the osmosis code and get hints for possible speedups. Document all the facilities that OSM3S provides, and start composing the AlgorithmRepository subclass that provides the capabilities of OSM3S servers.