SSA Ranking prototype installation and testing notes
Purpose
This package consists of an implementation of a scoring algorithm.
The
IVOA Simple Spectrum Access Protocol introduces the concept of a scoring mechanism to rank matching records by relevance. The intention of this little project was to build a simple prototype of a
ranking algorithm.
Overview
The concept is to place a scoring service between the SSA client and server; this service analyses the query, sends it to the SSA server, and upon reception of the result, adds the score to it before delivery to the client.
Such a service needs the following input:
- Query parameters. In order to adjust the score to the whishes of the user, the query parameters must be analysed.
- Result VOTable. The VOTable returned by the SSA server, unaware of the scoring service.
- Scoring configuration. The scoring service must be configured by an archive expert so that the algorithm is adapted to each dataset.
- Precomputed statistics. These might be used to enhance performance.
Design
While the algorithm was designed with SSA in mind, one of the design goals was to make it applicable to other VO services; so rather than binding it to SSA in that sense, this prototype is divided in 3 modules, minimising dependences from the server. Also, a SkyNode was used as the server. This option makes it more general that using an SSA server (assuming that SSA and other data access protocols can be implemented using ADQL).
The prototype's modules are:
- scoring core. This module receives a VOTable as input, calculates a score for each row, depending on the configuration defined by an archive expert, and produces a "scored" VOTable as output (basically, a "score" column is added containing the score itself).
- skynode proxy. This modules behaves like a proxy to a SkyNode, submitting queries to the real SkyNode, feeding the core with the returned VOTable, and returning scored VOTables.
- scoring webui. A Java webapp that provides a simple user interface to the skynode proxy. This interface accepts ADQL/s queries, submits the corresponding ADQL/x and display the returned scored VOTable.
Flow of data:
- the user submits a query
- the query is forwarded to the SkyNode
- the SkyNode retrieves the data from storage
- the SkyNode returns the queried data as VOTable
- the VOTable is forwarded to the core module, were the score is computed and added to the VOTable
- the scored VOTable is returned
- the scored VOTable is forwarded to the user
The "skynode proxy" enables both the use of a SkyNode as a data source (other systems might obviously be used as a data source for the scoring, as long as a proxy wrapper is developed) and provides a rich set of information on the what kind of data the user is looking for. The "scoring core" isn't dependent on the
SkyNode? protocol: it simply consumes and produces VOTables.
Setup
Publish data through a SkyNode interface
If you have an existing SkyNode available, you can jump to the next step.
This example setup uses:
NVO's Java FullSkyNode (from the
NVO Summer School 2005), MySQL 4.1.14 and observation logs from CDS
Obtain and ingest observations logs
- You can use the sample catalog provided (B-fuse-fuse). It was downloaded from CDS's VizieR. The following steps were taken to obtain it:
- go to http://vizier.u-strasbg.fr/viz-bin/VizieR
- select "Obs_Log" as "controlled Astronomical keyword", and "Find Catalogues"
- select catalog B/fuse/fuse catalog
- get all rows (Maximum Entries: unlimited), all columns, Output layout: Tab-Separated-Values, Position in: decimal degrees; save as B-fuse-fuse.tsv
- get one rows and all columns as XML-VOTable; save as B-fuse-fuse.xml (this file contains extra metadata)
- Prepare the data for ingestion into MySQL: the asu2sql.py script will create an SQL script to create a table and a data file suitable for MySQL
- asu2sql.py B-fuse-fuse.tsv
- install/start MySQL (http://dev.mysql.com/downloads)
- create scoring db (connecting as root; the db has no security constraints):
- bin/mysql -u root
- CREATE DATABASE scoring;
- USE scoring
- GRANT ALL ON scoring.* TO ''@'localhost';
- ingest VizieR tsv files into db (connecting as a regular user):
- asu2sql.py *.tsv
- bin/mysql --password="" scoring
- source B-fuse-fuse.tsv.sql
Obtain, install and configure a SkyNode
- obtain the NVO 2005 Summer School software packages
- proceedings: http://us-vo.org/summer-school/2005/proceedings/index.html
- download: http://nvo.caltech.edu/nvoss2005.zip
- extract the bin and java folders
- apply the patch: patch -p0 -u < nvoss2005.patch
- install either the "basicskynode" or the "fullskynode", following the instructions in the appropriate README, but note that:
- edit build.xml: property[name="config"]
- make sure to fill in all of the following optional configuration items: IncludeTable, IncludeColumn, MetaTable, MetaColumn
- the columns for right ascension and declination must be named "ra" and "declination"
- use 'ant deploy' to deploy; 'ant olddeploy' deploys to WS to Axis webapp. The README refers to olddeploy funtionality (like Axis URL) as belonging to deploy.
- keep it on the safe side: use all-lowercase characters for table and column names.
Compile and deploy the scoring modules
- install maven 2
- configure the application options(file: skynodeproxy\src\main\resources\options.properties)
- debug.folder: staging area; if a file named scoring-config.xml is present on this folder, it will be used as scoring configuration (see section below)
- edit deploy.bat to point it to your application server's webapps folder (I use that same as the one used by the SkyNode)
- run depoy.bat
Configure the scoring engine
The scoring configuration resides on a file in scoring-webui\src\main\resources\scoring-config.xml, but it is replaced by the configuration file located at DEBUG_FOLDER\scoring-config.xml (see section above).
The configuration defines the
rules to apply to each dataset, so that each dataset can be configured in a sensible way:
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<rules name="B_eso_safcat">
[ ... rules ... ]
</rules>
<rules name="B_fuse_fuse">
[ ... rules ... ]
</rules>
</configuration>
Generic rules configuration
Rules are defined within the
rules element. All rules share the following configuration:
- field: the dataset's column to which this rule apply (most rules use this attribute, except those that use more than one column)
- activate: defines when the rule is applied
- query: only apply rule when the field is used on the query to filter the results.
- always: always apply rule, regardless of query
- weight: relative importance of the rule within the dataset. This value is used as a factor when each rule value is combined into the resulting score.
<...Rule field="ExpTime" activate="always" weight="0.5">
[... rule-specific configuration ...]
Numerical Inequality
This rule specifies that a higher (or lower) value of a numerical field should result in a better score. If
activate="always" the rule is triggered when the query contains
field > value or
field < value
- extremeLow: lowest value that the field might return
- extremeHigh: highest value that the field might return
- direction: only used if
activate="always"
- GREATER_THAN: high values will have a better score
- LESS_THAN: low values will have a better score
When
activate="query" the rule is triggered when the query contains:
-
field > value: high values will have a better score
-
field < value: low values will have a better score
<NumericalInequalityRule field="ExpTime" activate="always" weight="0.5">
<extremeLow>0</extremeLow>
<extremeHigh>10800</extremeHigh>
<direction>GREATER_THAN</direction>
</NumericalInequalityRule>
Angular Distance
This rule specifies that an object near to the specified target should have a better score. This rule is only applicable for
activate="query", there must be a target position to compare against.
- fieldRA: field that contains right ascension, in degrees
- fieldDec: field that contains declination, in degrees.
- size: distance (in degrees) beyond which the score is null.
Note that since this rule uses two columns, the field attribute is not applicable.
<AngularDistanceRule activate="query" weight="10.0">
<fieldRA>ra</fieldRA>
<fieldDec>declination</fieldDec>
<size>0.02</size>
</AngularDistanceRule>
Choice Rule
This rule contains a finite list of possible values a field might return, and their respective score
- value: value to compare against
- weight: resulting score
<ChoiceRule field="ObsTech" activate="always" weight="1.0">
<choices>
<choice value="IMAGE" weight="0.2"/>
<choice value="SPECTRUM" weight="1.0"/>
</choices>
</ChoiceRule>
Numerical Equality
This rule specifies that the more similar a value is from if a value is textually similar to the specified value, the better the score
- size: distance (difference between target and field value) beyond which the score is null.
<NumericalEqualityRule field="ExpTime" activate="query" weight="0.1">
<size>20000</size>
</NumericalEqualityRule>
Vector Equality
Multi-dimensional equivalent of Numerical Equality. More than one field is evaluated, the
field attribute is ignored
<VectorEqualityRule activate="query" weight="0.1">
<fields>
<field>RAJ2000</field>
<field>DEJ2000</field>
</fields>
<size>20000</size>
</VectorEqualityRule>
Interval
Checks how well intervals overlap.
- fieldLow: field used as low boundary
- fieldHigh: field used as high boundary
- rate: comparison method:
- STRICT: intervals must match exacly
- LOOSE: user specified interval needs to be a subinterval from the interval defined in the data
- PARTIAL: overlapping intervals are also used
<IntervalRule fieldLow="RAJ2000" fieldHigh="DEJ2000" activate="query" weight="0.1">
<rate>PARTIAL</rate>
</IntervalRule>
Text Matching
This rule specifies that the closer a numerical field is from the specified value
- rate: comparison method:
- PARTIAL: the field value is a substring of the target, and the closer it is from being the entire string, the better the score
<TextMatchingRule field="Target" activate="query" weight="0.1">
<rate>PARTIAL</rate>
</TextMatchingRule>
Experiment!
The scoring web application will be accessible at:
http://localhost:8080/scoring-webui (change the webserver hostname and port to yours, if applicable).
The configuration file located at DEBUG_FOLDER\scoring-config.xml is dynamically loaded on every query, so changes made to it are immediately visible.
Test Scenario
The following test scenario was used as a demo at the
Stage 03 Planning Meetings. It makes use of the ESO SAF observation logs, available through VizieR, as "B/eso/safcat" (see
Obtain and ingest observations logs for details on how to obtain it).
Goal: Search ESO archive for high quality (galaxy) data
Criteria:
- exposure time: bigger is better
- angular distance: the closer the pointing to the specified target (smaller angular distance is better)
- airmass: the smaller the airmass the better
Sample ADQL query: searching for A1689...
SELECT a.recno, a.Target, a.ObsTech, a.ExpTime, a.AirMass, a.RAJ2000, a.DEJ2000
FROM B_eso_safcat a
WHERE Region('CIRCLE J2000 197.8925 -1.35 1.5')
Sample configuration file:
scoring-config.xml
Known Issues
- Some complex queries (notably when joining "Region" with other constraints) are not correctly handed by the SkyNode, and return no results when there should be.
- Queries with string constant values are not correctly handed.
- The current implementation of skunodeproxy doesn't expose a SOAP interface; instead, it is used as a library by the webui
--
BrunoRino - 11 May 2006