A Column-Oriented Data Explorer
Clive Page, University of Leicester, UK
Abstract
Astronomical data mining will mostly involve tabular datasets; data exploration and cleaning operations need to be carried out interactively.
Relational DBMS are very inefficient for many of these operations, especially those involving sequential scans of large tables. Performance can be improved by one or two orders of magnitude using techniques such as column-oriented storage and parallel execution, which makes all the difference between batch and interactive working. The features of a proposed system will be outlined, and a basic prototype demonstrated.
The Astronomical Data Deluge
It has been common for presentations on the Virtual Observatory to start with scary stories about the astronomical data deluge (alternatively avalanche, tidal wave, tsunami...). Perhaps "rising tide" is a more appropriate metaphor, since what we have is not a sudden increase but a continuing expansion of astronomical data, amounting to exponential growth in the long term. This expansion has many causes: the increased interest in conducting sky surveys such as those from WFCAM and VISTA, the use of larger-format CCD detectors (new instruments such as MegaCAM and OmegaCAM have hundreds of megapixels), the increased interest in high-time resolution phenomena leading to systems which take images of big chunks of the sky every minute or less (e.g. SuperWASP), and also the wealth of new observational facilities at all wavelengths from the radio to the gamma ray band.
The most spectacular figures apply to the
raw data from telescopes, but in due course we shall find that volumes of reduced data rise in much the same proportion. Telescopes in different bands produce raw data in very different forms: radio telescopes produce complex visibilities, infra-red and optical telescopes typically produce CCD frames, whereas in the X-ray and gamma-ray bands one gets a stream of measurements of individual photons. Specialised software is needed to handle these formats and apply the corrections for the characteristics of the diverse instruments, and there seems little scope for general-purpose software to help here.
In the case of reduced data, however, there is much more homegeneity, since tabular datasets predominate. One gets, for example, lists (or catalogues) of detected sources, spectra, and time-series (light-curves) which are all essentially tables of numbers. The largest tabular datasets tend to be the source catalogues, with several recent examples around the billion row mark. It is worth noting, however, that most of these large tables come from rather old technology: the scanning of photographic plates from sky surveys which were started back in the 1950s. The new generation of surveys using electronic detectors are only just starting to feed through into source catalogues.
Another important fact is that source catalogues are not just getting longer, they are getting
wider too, as astronomers record more parameters of each source. Here are some recent examples:
| Table | Number of Columns |
| USNO-B catalog | 50 |
| 2MASS point source catalog | 61 |
| 1XMM x-ray catalogue | 379 |
| SDSSS DR3 PhotObjAll? table | 446 |
The expansion is to some extent all part of the phenomenon generally known as Moore's Law, which observes the fact that computing speeds have been doubling every 18 months or so for the last two decades. Disc capacities have also been growing exponentially at much the same rate. Estimates of the growth in astronomical data vary, but my guess is that the data volume doubling time is similar, so that in two basic areas we can just about keep up: processing the data, and storing it on disc. Unfortunately disc bandwidth and seek times are improving much more slowly, perhaps only by 10% or 15% per year. This means that the I/O bottleneck is become more severe, year by year.
How to Process Tabular Data
The principal choices are between using relational database management systems (RDBMS), object-oriented DBMS, or something else.
Relational Databases
There are many products to choose from all different in detail, but they have many attractive features in common:
- RDBMS technology is mature and the products are generally robust and reliable.
- There are several open source products available, most notably Postgres and MySQL, which makes them attractive to astronomers with limited budgets for software.
- RDBMS can handle large datasets well beyond the 2 GB boundary of 32-bit file systems.
- They provide efficient indexing using B-trees so that any small chunk of data can be located in a few tens of milliseconds.
- They handle null (missing) values correctly using three-valued logic (yes,no,unknown), a valuable feature in astronomy where many tables contain null fields.
- They provide a powerful query language in the form of SQL.
But RDBMS are designed to support the needs of commerce and industry, and are not all that well-matched to the requirements of astronomical data archives:
* Facilities for handling scientific data are generally poor, e.g. it requires messy and non-standard syntax just to print out numbers with a fixed number of decimal digits, there are no facilities for handling values with errors attached, nor upper limits, and input and output of angles in sexagesimal notation are completely unsupported.
- Only a few RDBMS can handle spatial data efficiently using R-trees (or similar) and some of these do so only with expensive add-ons.
- It is normally impossible to attach adequate metadata to columns or tables, not even something as basic as physical units.
- SQL is a goal-oriented language poorly standardised between products. It is hard to use it to carry out iterative data exploration. Astronomers usually find it hard to use SQL except in the most basic way.
Object-oriented databases
An object-oriented database can handle arbitrary data structures, there is no need to force everything to fit into a table. The ability to navigate through the structure using links instead having to do this with by joining whole tables also makes them potentially much more efficient. A few astronomy groups have therefore chosen to use an OO-DBMS to handle scientific data. These include the Sloan Digital Sky Survey project which used Objectivity/DB, and the XMM-Newton Survey Science Centre where O2 was installed at Leicester, MSSL and Strasbourg. In both projects, however, the software products turned out to be harder to use, less efficient, and less well supported than expected, and both are switching back to purely relational technology.
Statistical and Visualisation Packages
There are many other types of software which handle tabular datasets and provide useful functionality. These include statistical packages (SAS, SPSS, MINITAB, BDMP, MATLAB, S-PLUS, etc.) and those providing data visualisation (IDL, PVWave, AVS, etc.). In nearly all cases, however, these software packages are designed to handle datasets which are small enough that they will fit entirely in memory. Although hardware advances mean that memory sizes are continually growing, more or less in line with the growth in astronomical datasets, there remains a substantial gap. Many astronomical datasets will indeed fit into memory easily, but not all of them. The algorithms suitable for disc-resident data are very different, and for the most part these packages are not sufficiently scalable.
Other Software
My guess is that about nine out of ten astronomical data archives use an RDBMS to manage their data. So what do the other ten percent use? Usually a package written by astronomers themselves.
The BROWSE software was originally written at ESOC in the late 1970s to support data from the ESA's EXOSAT spacecraft. It was adopted at a number of other sites (including LEDAS at Leicester) and software support was provided for some years by HEASARC at NASA/GSFC. It is basically a flat-file data manager but supports the ability to select subsets and then carry out further operations on these filtered datasets - just what is needed for data exploration. It also allows celestial catalogues to be joined on celestial position, and has very basic graphical display facilities built in. But the package was written originally in Fortran66, and I believe that maintenance has now ceased.
The WCStools package comes from the Smithsonian Astrophysical Observatory and includes several tools for handling tabular data, including specific support for data in the forms in which large source catalogues are distributed. The functionality is basic in database terms, but includes cone-search and cross-match functions which are important to astronomers.
It is perhaps also worth noting that the SIMBAD database at Centre de Donées de Strasbourg was also written in-house. This holds the astronomical world's premier collection of bibliographical information. Home-grown software should not be ignored, therefore, as way of providing astronomers with the functionality that they require.
Data Exploration Requirements
In a short article it is hardly possible to list all the operations that astronomers might want to carry out on tabular data. Indeed the list needs to be open-ended. But it is not too difficult to list the main operations that astronomers are likely to need. These include
- Basic database operations: select subsets, compute new columns, sort, group, compute equi-joins with other tables. RDBMS can do these quite well.
- Statistical operations: find means, medians and other quantiles, find outliers, compute regressions. The simplest operations are easy to do with an RDBMS, the others take a bit more effort.
- Cross-matching: the spatial join is easiest if a spatial index such as an R-tree is provided, but can be done alternatively using a pixel-mapping code such as HTM or HEALPix.
- Graphics and visualisation: plotting histograms, scatter-plots and other line graphs; overlaying sources on images, producing density maps, multi-dimensional plots, and other visualisation techniques. These cannot be done at all by DBMS, but require data to be exported, normally losing all context and metadata.
- Advanced mining algorithms, such as clustering and classification. These are well beyond what any DBMS can handle, and require data export also.
Main types of query
As far as the operations go which can be handled easily by a relational DBMS, it is worth noting that for large datasets they fall into two main groups according to how fast they run.
- Queries which use an index to locate their data usually run in a matter of milliseconds, or at most a few seconds, so can be used in a fully interactive manner.
- Queries which require the scanning of an entire table or a substantial part of one have a running time proportional to the length of the table. For tables of a billion rows on current hardware this means upwards of half an hour.
There are, of course, more complex queries which involve both types of operation, for example joins. These often require at least the smaller table to be scanned sequentially, with indexed lookups on the larger one (but there are many join algorithms, not all of them work like this).
Sequential Scan Queries
Here is a short list of queries which in most cases will involve a sequential scan:
Accelerating the Sequential Scan
Since these operations are so slow, it should be a priority to speed them up. There are three fairly obvious ways of doing this:
Using Data Compression
It is fairly easy to compress even a binary dataset by a factor of two, somewhat higher factors are possible with specially-tailored algorithms. This reduces the I/O load, but it is hard to compress data much in a database without affecting indexed access. MySQL has a utility called
mysiampack which converts any default (ISAM) table into a compressed version. I tried this on a sample of data from 2MASS. It did indeed approximately halve the size of the disc file, but I found that query speeds were hardly improved: I think the cpu overheads of doing the decompression more or less balanced the reduction in I/O.
Using Parallel I/O
Many types database operations of interest could be carried out perfectly easily on a system in which the relevant table was divided between a number discs each with their own channel to the processor, or on a cluster of machines each with its own disc. This seems a good way of avoiding the I/O bottleneck.
Parallelism in databases is in fact a hot topic of research. The main problem is the difficulty of supporting transactions, as these need reliable, synchronised, and consistent updates of all the discs involved. Solutions are available from the main commercial vendors such as Oracle and IBM, but usually require special hardware configurations, as well as expensive software. For data mining, astronomers mostly only need read access to essentially static datasets, which should be very much simpler, but no off-the-shelf solutions seem to be available. I have experimented with the DBLINK package (an add-on to Postgres) which allows querying of two or more servers at the same time. In principle this could be used to spread out a query over a number of independent database instances, but some work is needed to work out the best way of segmenting the tables, how to support indexing over this collection, and how to combine the results. This might be an interesting development project for someone.
Column-oriented Storage
Just about all RDBMS store their data with all the fields of a row stored in the same disc block. Updates and deletions of whole rows are obviously much more efficient when the data are co-located on disc. This structure is less efficient when a query involves scanning a whole table but the extraction of data from just a few columns, as the whole table has to be read from disc each time. In commercial practice, however, the queries are mostly pre-defined so a suitable set of indices can be set up to avoid sequential scans, and most tables have relatively few columns, so even scanning queries are not too inefficient. In astronomical archives, however, the number of columns is often much larger (up to hundreds as noted above) and the range of queries used by astronomers is very wide, so that it is not feasible to ensure that all of them can be handled via an index.
A fairly obvious solution is to store all the values in each column adjacently. Queries involving just one or two columns out of many than only need to access a small fraction of the disc blocks occupied by the whole table, and might then be faster by a factor of ten or more. Updates and deletions are much more expensive, as many different disc blocks are involved, but if these are rare or absent, then this is no drawback. The idea of using column-based storage is not a recent one. The ESO/MIDAS table system, dating back to the 1980s at least, uses column-based storage. The STSDAS Table System invented at STScI (and an optional component of IRAF) allows the user to choose row or column-based storage whenever a table is created.
Column-based storage is also used by a number of specialised DBMS used for example in the financial services or telecoms industries, but there seem to be very few general-purpose packages on the market. The only product we have discovered which seems at all suitable is Sybase-IQ. This is a completely separate product from the well-known RDBMS called Sybase-ASE, but produced by the same company, and the query languages are very similar. This has been under evaluation by the Cambridge
IoA for some time, and appears to have a performance in line with theoretical expectations. The drawbacks to Sybase-IQ are the licencing cost, which is very high (compared to the funds usually available to astronomers), the lack of a version for Linux at present, and the absence of any spatial indexing functionality.
The Story so far
To summarise what has been established so far. Tabular datasets are very important in astronomical data analysis, especially in data exploration and mining. Tables are getting not only longer but wider, with hundreds of columns not unusual. Many, indeed most, queries will only use a handful of columns out of all these. Although an index should be used whenever possible, there are many types of operations which require a sequential scan of all or part of a large table, and it is these types of operation which take up almost all of the elapsed time. As processors get faster and tables larger, the I/O bottleneck will loom larger. Column-based storage would greatly reduce the I/O load of many queries, but only a few products have implemented this.
It seemed to me time to experiment with column-based storage in a form which could be of immediate use to astronomers. Being a member of the IAU's FITS Working Group, I thought I ought to start by using the FITS format, if it were at all feasible.
Column-based Storage Experiments
FITS Tables
The FITS Standard includes two formats for tabular data, but both store data by rows. A bit of lateral thinking was required. I managed to devise two ways in which a FITS file could be forced to implement column-based storage. One way is to make use of the fact that the columns in a
binary table can be vectors: if one defines a table with just one row, but with each column a vector of length N, where N is the number of rows in the equivalent normal table, then one has achieved column-based storage. The other is to make use of the ability of a FITS file to contain multiple header-data units (HDU): one can simple have each column as a separate table, each in its own HDU. I was pleased to find that the well-known cFITSIO library from HEASARC worked well on both of these structures, and some simple experiments showed gains in elapsed time in line with expectations. I ingested a sample of the 2MASS catalogue with some 61 columns and found that queries using only two columns were nearly 30 times faster. But this solution, while very promising, had certain disadvantages. The FITS structure is fairly complex and on little-endian hardware the big-endian storage requires numerical values above byte-size to be byte-swapped. I found that read operations occupied nearly 100% of a fast cpu. Another limitation is that the cFITSIO library still uses some counters and pointers which are only 32 bits long, which imposes rather serious limits on file sizes. Another serious drawback was that my column-based tables, while conforming to the letter of the FITS Standard, were very unusual, and practically illegible to all other FITS-reading software. This more or less negated the value of using an existing astronomical data format.
HDF5
The Hierarchical Data Format has been used in data rich sciences for many years, although it is much less popular than FITS in most branches of astronomy. The latest version, HDF5, is supported by NSCA and a number of other well-known US laboratories. Its other advantages include
- Flexible structure, supporting arrays and tables.
- Designed for efficient access to files well over 2 GB in size.
- Compatible with Globus and GridFTP.
- Has a simple API, with bindings for C, Fortran90, Java, Python.
- An unlimited amount of metadata can be attached to any HDF5 object.
Noting all this, I decided to build an HDF5-based prototype, and managed to do this in a few days. I needed a query parser and executor and used one which I wrote over a decade earlier for a Starlink package called CURSA.
I ingested a sample of 55 million rows of 2MASS (about 10% of the total) and ran some trials, comparing it to MySQL which is generally thought to be the fastest relational DBMS for simple queries.
Here are the timings for the same operations using hydra (2.2 GHz Xeon):
mysql> SELECT COUNT(phi_opt),MIN(phi_opt),MAX(phi_opt),AVG(phi_opt) FROM twomass;
+----------------+--------------+--------------+--------------+
| COUNT(phi_opt) | MIN(phi_opt) | MAX(phi_opt) | AVG(phi_opt) |
+----------------+--------------+--------------+--------------+
| 33971487 | 0 | 360 | 195.9702 |
+----------------+--------------+--------------+--------------+
1 row in set (4 min 20.11 sec)
hydra:~/hdf> hstats phi_opt
--Column-- points minimum maximum average
phi_opt 33971487 0.0000 360.00 195.97
8.37 seconds
This shows a gain of a factor of 32 over the RDBMS, which is not unreasonable when we are using only one column out of 61. In data exploration queries based on a similar set of columns will often be run repeatedly. When the same MySQL query is run again it takes the same time as before, but when the HDF5-based query is run again, it is now over 70 times faster:
hydra:~/hdf> hstats phi_opt
--Column-- points minimum maximum average
phi_opt 33971487 0.0000 360.00 195.97
3.51 seconds
This is because the whole column has been cached in memory by the operating system (Linux). The remaining time is largely due to the overheads of the CURSA-vintage expression parser; a modern replacement could be substantially faster.
How to build a Data Explorer using HDF5
It might be thought that building a new database management system would be a serious undertaking, and that is probably true. What I am proposing here is something much less difficult, just a system for reading data, not for updating it, handling transactions, and so forth. A very large number of simplifications can be made when the database has only read-access. I think that it would not be very difficult to build a data exploration infrastructure based on HDF5 files. Many of the required software elements are readily available. The main components that are needed are these:
- HDF5 library for data access: already exists and known to be robust, reliable, and efficient.
- Query parser and evaluator: could be a refactoring of the code from CURSA, or something based on the JEL (Java Execution Library), or something else.
- Graphics package: there are so many of them available that the main problem would be to sift through them to choose a good one.
- Statistics routines: again a huge range of source code is available, so that the main problem is selecting what to use.
- Indexing: B-tree code is widely available.
- Spatial indexing: there is at least one R-tree code that can be borrowed, or one could be written from the published algorithm. Alternatively a pixel-code method based on HTM or HEALPix code could be used.
- Integration with the Virtual Observatory: I assume that it would be easy to use the code created by AstroGrid and other VO projects to provide access control (authentication, authorization), virtual storage (MySpace etc.) and other services.
- User interface: this is perhaps the most difficult area, as it is desirable to give the user facilities to iteratively explore data, to back-track, and to save commands for future use, e.g. in a script. The user also needs facilities to browse the metadata attached to each table and column.
The provision of an HDF5-based data explorer brings with it a number of incidental advantages:
- The user interface can be made astronomer-friendly, e.g. so that it can accept and display angles in various ways, including sexagesimal formats.
- It should be much easier to support a distributed cross-match (spatial join) capability, as only a few columns of information (typically RA, dec, position-error) need to be transferred from one site to another; with column-based storage this is much easier to implement.
- Users will always want to have algorithms available, often ones they have written themselves, which are not provided in the base distribution: with a conventional DBMS it is necessary to export the data, here it would be feasible for the user to integrate their own code with the HDF5 files, since the API is reasonable simple and well documented.
- It should not be difficult to allow read-only access to tables stored in other formats, such as FITS tables, VOTable, or CSV, or indeed tables stored in conventional RDBMS using a link based on JDBC. Such access would not be as fast as to data stored in the native HDF5 format, but this would give the Data Explorer access to a much wider range of data sources, and would also be a good way of ingesting data from these formats into internal HDF5 format.
Conclusions
I think it would be feasible to build a Data Explorer with column-based storage using HDF5 files. This would give the astronomer integrated access to database, statistical and graphical functions, and allow more advanced data mining algorithms to be added easily. By adopting column-based storage the many operations involving sequential scans would be speeded up by one or two orders of magnitude, bringing what are normally batch operations into the realm of interactive ones, and bringing a huge reward in terms of usability.
This Data Explorer would have the basic characteristics of a database manager, but would essentially only provide read access to the original datasets. It is not intended to take over the role of the RDBMS in storing archive data, but to supplement it when interactive data exploring and data mining facilities are required. This means that these datasets would be stored in two different forms: this is an added cost, but disc space is now so cheap that it is probably not a serious obstacle to implementation.