Photometric Quasar Detection

I'm working on software which will classify star-like objects in the SDSS as either stars or quasars, based on colour data (u-g,g-m,r-m,i-z) and in combined SDSS and UKIDSS data using the colours u-g, g-r, r-i, j-h and h-k. The advantage of this method over spectrograph measurements is that it will quickly provide a lot more objects. The methods I am using are based on Richards et. all (2004)

Kernel Density Estimation

The classifier works on the principal of Kernel Density Estimation, which approximates the density of the points using a sum of simple kernel functions, centered on a set of training points. They are similar to histograms with irregular bins. If we have N training points {x_1,...,x_N}, then p(x), the density at point x is 1/N \sum_n K(|x-x_n|/_h_), where h is known as the width, and K is the kernel function.

Different kernel functions can be used, but they make very little difference to to the quality of the model; the most important free parameter is the width of the kernel function, which controls its smoothness, as the following images illustrate.

undersmooth.jpeg oversmooth.jpeg optsmooth.jpeg

Classifying M test points with N training points takes O(NM) time as each of the M test points must be compared with N training points.

Fast KDE

The algorithm used to perform fast kernel density estimation is based on Multi Resolution kD trees, space partitioning data structures which organise points in k-dimensional space. In addition to partitioning the data, mrkd trees record statistics on each partition, including its bounding box, and the number points of each class it contains. This information can be used to cut down on the need to re-perform many calculations.

The algorithm relies on the fact that the kernel functions are monotonic as a function of distance. If we know the bounding box of two partitions, then we can work out the maximum and minimum distances between any two points they contain, which in turn gives us the maximum and minimum amount that the points can contribute to the density sum.

The algorithm works as follows:

  1. Build a tree for both the training ant test data sets, call these R and Q respectively
  2. For every node in the Q tree, record the minimum and maximum kernel sum for each class thee points in it's partition could take.
  3. By recursively comparing the nodes of the two trees, these min and max values can be refined.
  4. If at any stage the min value for one class A in a node exceeds the max value for the class B, every point in that node can be classified as A, and the recursion can stop

Simplified Version

My initial implementation of the Fast Kernel Density Algorithm didn't provide that great a speedup. One problem is that the calculations needed to refine the upper and lower bounds for the class probabilities are quite demanding and usually a significant amount of comparisons must be made before any pruning can take place. In addition, the classifier performs a hard classification, and does not give class probabilities. This makes it harder to develop a systematic way of finding the optimal bandwidths for the class kernels.

I have come up with a simplified version of the algorithm which uses the Epichnikov kernel. This kernel is 0 for points that are furhter than h apart, so the density sum for each test point, x is only dependent on the training points located in a ball with width h centered on x. Finding these points is easy using the kD-tree allowing the exact density can be calculated directly, without performing any summation over the points which do not contribute. In practice, as the number of training points increases, the optimal bandwidth should decrease, so the algorithm should scale well. TODO: get a ref for this.

Implementation

The classifier is implemented in C using the KDtree library available from AutonLab (www.autonlab.org). It contains two executables, one to perform classification and one to perform bandwidth search.

In addition to the classification software there are scripts that use the STILTS library to perform cleaning and transformations on the data used to perform classification.

Example: Classifing Quasars Using Combined UKIDSS and SDSS Data.

Extracting the data

The raw data for the training sets were acquired using the free form SQL interface to the WFCAM Science Archive (http://surveys.roe.ac.uk/wsa/) using the following queries.

For the sample of training quasars, we used the existing catalog of SDSS quasars and crossmatched it with the photoPrimary table of the SDSS and the lasSource table of UKIDSS as follows:

select p.ra,
p.dec,
p.objID, 
p.psfMag_u - p.extinction_u as u,
p.psfMag_g - p.extinction_g as g, p.psfMag_r - p.extinction_r as r,
p.psfMag_i - p.extinction_i as i, p.psfMag_z - p.extinction_z as z,
l.yAperMag3 as Y, 
l.ymj_1Pnt as YmJ, 
l.j_1mhPnt as JmH, 
l.hmkPnt as  HmK, 
n.distancemins  
from 
bestdr5..photoprimary as p,
bestdr5..quasarcatalog as qb, 
bestdr5..specObjAll as sp,
lasSourceXDR5PhotoObj as n, 
lasSource as l 
where
qb.specobjid=sp.specobjid AND 
p.objid=sp.bestObjID AND
n.slaveObjID=sp.bestObjID AND 
n.masterobjid=l.sourceID AND yClass=-1
AND l.yppErrBits<256 AND j_1Class=-1 AND 
l.j_1ppErrBits<256 AND
n.distancemins in (select min(distancemins) from
lasSourceXDR5PhotoObj where masterObjID=n.masterObjID)

For the stars in our training set we selected a sample of objects that were not in the quasar catalog or the FIRST catalog of radio sources as they have a high chance of being undetected quasars.

select p.ra, p.dec, p.objID, 
(p.psfMag_u - p.extinction_u) as u, 
(p.psfMag_g - p.extinction_g) as g, 
(p.psfMag_r - p.extinction_r) as r, 
(p.psfMag_i - p.extinction_i) as i, 
(p.psfMag_z - p.extinction_i) as z, 
p.psfmagErr_u, 
p.psfmagErr_g, 
p.psfmagErr_r, 
p.psfmagErr_i, 
p.psfmagErr_z, 
l.yAperMag3 as Y, 
l.ymj_1Pnt as YmJ, 
l.j_1mhPnt as JmH, 
l.hmkPnt as HmK,  
p.flags, 
p.status, 
p.flags_u, 
p.flags_g, 
p.flags_r, 
p.flags_i, 
p.flags_z, 
n.distancemins as dist from bestdr5..photoPrimary as p, 
lasSourceXDR5PhotoObj as n, 
reliableLasPointSource as l 
Where (p.type=6) And 
(p.camcol!=2 or p.colc<1383 or p.colc>1387) AND 
(p.camcol!=5 or p.colc<1019 or p.colc>1031) AND 
(p.psfMag_g - p.extinction_g)>14.5 AND 
(p.psfMag_g - p.extinction_g)<21.0 and 
l.yAperMag3>0 and p.objID%10=4 and
p.objid=n.slaveobjid and 
n.masterobjid=l.sourceID and 
p.objid not in
(select objid from bestdr5..first) and  
n.distancemins in (select
    min(distancemins) from  lasSourceXDR5PhotoObj where
    masterObjID=n.masterObjID)

In addition to colour data these queries also extract error information which is used to filter the data in the next stage.

Filtering the data

The data is filtered using the STILTS tool kit. as the command used to perform the filtering is complicated it is generated programatically using a python script.

import os
import sys
from tempfile import mkdtemp

BRIGHT_F = int("0x0000000000000002",16)
SATURATED_F = int("0x0000000000040000",16)
EDGE_F = int("0x0000000000000004",16)
BLENDED_F = int("0x0000000000000008",16)

FATAL_FLAGS = BRIGHT_F | EDGE_F | BLENDED_F | SATURATED_F 

PEAKCENTER_F = int("0x0000000000000020",16)
NOTCHECKED_F = int("0x0000000000080000",16)
DEBLEND_NOPEAK_F = int("0x0000400000000000",16)
INTERP_CENTER_F = int("0x0000100000000000",16)

NONFATAL_FLAGS = PEAKCENTER_F | NOTCHECKED_F  | DEBLEND_NOPEAK_F

OK_SCANLINE_F = int("200",16)
CHILD_F = int("10",16);

fatal_clause = "((flags&%i)==0)&&((status&%i)!=0)&&("%(FATAL_FLAGS,OK_SCANLINE_F)+\
      "||".join("(psfMagErr_%s<=0.2)"%x for x in ('u','g','r','i'))+")"


nonfatal_clauses="((flags&%il)==0)||"%CHILD_F+"&&".join(["(((flags_%s&%il)==0)||!(%s<23&&psfMagErr_%s<0.12))"\
      %(x,NONFATAL_FLAGS,x,x) for x in ('u','g','r','i')])

deblend_clause="((%s) || (%s))"\
      %("(flags & %i)==0"%CHILD_F,\
      "&&".join(['((psfMagErr_%s<0.25) || ((flags_%s&%il)!=0) && (psfMagErr_%s<=1.0))'\
      %(x,x,DEBLEND_NOPEAK_F,x) for x in ["u","g","r","i"]]))


tmpdir=mkdtemp(prefix="filt",dir="/fiag/bcw/data/tmp")
interp_bad="%s/interp_bad.fits"%tmpdir
interp_ok="%s/interp_ok.fits"%tmpdir
interp_fixed="%s/interp_fixed.fits"%tmpdir
combined="%s/combined.fits"%tmpdir


os.system("""java -Xmx512m -jar ~/bin/stilts.jar -disk tpipe cmd="select '(%s)&&(%s)&&(%s)'" """ \
      %(fatal_clause,nonfatal_clauses,deblend_clause) + \
      "in=%s out=%s"%(sys.argv[1],sys.argv[2]))

Estimating bandwidth and performing classification

In order to perform classification we must estimate the best bandwidth to use. This is performed by running a hillwalking algorithm on a leave one out cross-validation of the training set. This is performed by the bandwidth_search executable, which opperates as follows
bandwidth_search -train TRAINING_FILE -out OUTPUT_FILE -labels CLASS_LABELS  [-priors CLASS_PRIORS]

where
   TRAINING_FILE is the CSV file containing the training data, the class label must be the last attribute
   OUTPUT_FILE is a file where the results will be stored
   CLASS_LABELS is a comma separated list of the possible labels the objects can take
   CLASS_PRIORS is an optional comma separated list of prior probabilities for each class. It must have the same length as CLASS_LABELS

Once the bandwidth has been estimated classification can take place.This is done using the fast_kda executable which operates as follows.

fast_kda -train TRAINING_FILE -test TEST_FILE -out OUTPUT_FILE -labels CLASS_LABELS  -widths CLASS_BANDWIDTHS [-priors CLASS_PRIORS] [-cross]
where
   TRAINING_FILE is the CSV file containing the training data, the class label must be the last attribute.
   TEST_FILE is the CSV file containing the data to be classified, the class label must be the last attribute (though it will be ignored).
   OUTPUT_FILE is a file where the results will be stored
   CLASS_LABELS is a comma separated list of the possible labels the objects can take
   CLASS_BANDWIDTHS is a comma separated list of the bandwidth for each class kernel. It must have the same length as CLASS_LABELS.
   CLASS_PRIORS is an optional comma separated list of prior probabilities for each class. It must have the same length as CLASS_LABELS.
   -cross can be used if cross validation is required.

-- BobMann - 13 Nov 2008

Topic revision: r2 - 2008-12-15 - 15:11:27 - BrianWalshe
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback