Вы находитесь на странице: 1из 11

Astronomy and Computing 23 (2018) 72–82

Contents lists available at ScienceDirect

Astronomy and Computing


journal homepage: www.elsevier.com/locate/ascom

Full length article

Computer-aided discovery of debris disk candidates: A case study


using the Wide-Field Infrared Survey Explorer (WISE) catalog
T. Nguyen a, *, V. Pankratius b , L. Eckman c , S. Seager d
a
Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, USA
b
Massachusetts Institute of Technology, Haystack Observatory, USA
c
Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Lab, USA
d
Massachusetts Institute of Technology, Department of Earth and Planetary Sciences, USA

article info a b s t r a c t
Article history: Debris disks around stars other than the Sun have received significant attention in studies of exoplanets,
Received 10 February 2017 specifically exoplanetary system formation. Since debris disks are major sources of infrared emissions,
Accepted 25 February 2018 infrared survey data such as the Wide-Field Infrared Survey (WISE) catalog potentially harbors numerous
Available online 14 March 2018
debris disk candidates. However, it is currently challenging to perform disk candidate searches for over
Keywords:
747 million sources in the WISE catalog due to the high probability of false positives caused by interstellar
Debris disk matter, galaxies, and other background artifacts. Crowdsourcing techniques have thus started to harness
WISE citizen scientists for debris disk identification since humans can be easily trained to distinguish between
Machine learning desired artifacts and irrelevant noises. With a limited number of citizen scientists, however, increasing
Classification data volumes from large surveys will inevitably lead to analysis bottlenecks. To overcome this scalability
problem and push the current limits of automated debris disk candidate identification, we present a
novel approach that uses citizen science results as a seed to train machine learning based classification.
In this paper, we detail a case study with a computer-aided discovery pipeline demonstrating such
feasibility based on WISE catalog data and NASA’s Disk Detective project. Our approach of debris disk
candidates classification was shown to be robust under a wide range of image quality and features. Our
hybrid approach of citizen science with algorithmic scalability can facilitate big data processing for future
detections as envisioned in future missions such as the Transiting Exoplanet Survey Satellite (TESS) and
the Wide-Field Infrared Survey Telescope (WFIRST).
© 2018 Elsevier B.V. All rights reserved.

1. Introduction with circumstellar debris disks are often promising candidates for
exoplanet discovery due to the common origin of debris disks and
1.1. Background and motivation planets as well as their interactions (Janson et al., 2013; Kóspál et
al., 2009). Debris disks structures often have gaps and cavities, re-
Debris disks around stars other than the Sun, sometimes re- vealing the possible existence of exoplanets along with constraints
ferred to as exozodi or exozidacal dust due to the similarity to the on their properties (Janson et al., 2013; Greaves et al., 2005).
solar system’s zodiacal cloud, are of interest to scientists as they Multiple exoplanet discoveries have been made around stars with
are essential to understand the foundation of planetary systems debris disks in recent years (Lisse et al., 2007; Marois et al., 2008;
(Morales et al., 2009). These disks are believed to have been formed Liseau et al., 2010; Dodson-Robinson et al., 2011; Moór et al., 2013).
Positive correlation has been suggested between the presence of
by collisions between planets and planetsimals, remnants of the
exozodi and exoplanets (Kóspál et al., 2009; Raymond et al., 2011).
planetary formation process (Backman and Paresce, 1993b). Multi-
These findings provide information for future surveys dedicated
ple circumstellar debris disks have been directly imaged with high
to high-resolution imaging of debris disks to further understand
spatial resolution using the Hubble Space Telescope, such as the
the interaction and correlation between exoplanet and debris disks
circumstellar disks around Fomalhaut shown in Fig. 1. (Janson et al., 2013). In addition, understanding the properties of
Debris disks have gained significant interest among as- extrasolar debris disks is essential for the target selection pro-
tronomers due to their importance in exoplanet detection. Stars cess of future exoplanet direct imaging and spectroscopy missions
due to the dominant photon-noise produced by exozodiacal light
(Beichman et al., 2006; Weinberger et al., 2015; Kennedy et al.,
* Corresponding author.
E-mail address: tamz@mit.edu (T. Nguyen). 2014).

https://doi.org/10.1016/j.ascom.2018.02.004
2213-1337/© 2018 Elsevier B.V. All rights reserved.
T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82 73

30,000 registered participants. Many of these superusers had sci-


ence degrees and were described to be very enthusiastic about this
particular project topic. The experience report reveals additional
‘‘soft’’ scalability biases in citizen science data analysis, such as the
enthusiasm of a group of people and their perceived interests in the
project that keeps them engaged. As data volumes from surveys
and exoplanet imaging missions continue to increase, it becomes
apparent that the data analysis stage can turn into a bottleneck
with unpredictable performance if it relies on citizen science alone.
Addressing these problems, we propose a novel approach that
builds on the success of citizen science while achieves scalability by
integrating crowdsourced data in a machine-supported search for
astronomical phenomena and objects, inspired by the computer-
aided discovery approach of Pankratius et al. (2016). In our pro-
cess model, we include (1) confirmed citizen science classification
Fig. 1. Debris disk around Fomalhaut as captured by the Hubble Space Telescope. results which are the outcome of meticulous WISE data inspections
Credit: NASA, ESA, and P. Kalas (University of California, Berkeley and SETI Institute).
and (2) algorithmic detection. We utilize citizen science classi-
fication results to train and validate our machine-based debris
disk classification algorithm, which enables scalability and parallel
A common technique used to identify the existence of debris operation on large data sets. In addition, our process facilitates
disks around stars is based upon the detection of infrared-excess the creation of a feedback loop to iteratively improve search pa-
caused by circumstellar disk’s infrared emission in the stellar rameters by taking result samples from the trained algorithm and
photosphere, sometimes referred to as the ‘‘Vega Phenomenon’’ presenting them to citizen scientists for validation.
(Backman and Paresce, 1993a; Dodson-Robinson et al., 2011). The
‘‘Vega Phenomenon’’ was first observed using the Infrared Astro- 1.2. Objectives
nomical Satellite (IRAS) (Aumann et al., 1984) and has since been
observed by the Spitzer Space Telescope (Su et al., 2005; Chen et This article presents a computer-aided approach to identify-
al., 2006; Morales et al., 2009). A recent all-sky mid-infrared survey ing circumstellar debris disk candidates through the detection
was conducted by the Wide-field Infrared Survey Explorer (WISE) of infrared-flux excess in the WISE catalog database. The work
mission. The WISE observatory is a NASA-funded medium-class presented here focuses on demonstrating the feasibility of au-
Explorer mission managed by the Jet Propulsion Laboratory (JPL) tomated debris disk detection leveraging existing crowd-sourced
and launched in 2010. The WISE payload consists of a cryogenically data and basic image processing and machine learning techniques.
cooled 40-cm diameter telescope and a 4-Mpixel detector array, The method presented in this article is designed to be scalable
providing a field-of-view of 47 arcminutes. WISE provides higher and re-configurable, providing a time-efficient and agile software
sensitivity and resolution than previous comparable missions such platform for the detection of astronomical phenomena and objects,
as IRAS (Wright et al., 2010). specifically debris disks in this case study.
As debris disk detection is difficult to automate due to a variety The case study used in this work utilizes the input target list
of artifacts causing false positive identifications, NASA has ex- of all confirmed planet–host stars, acquired from the NASA’s Ex-
plored a crowdsourced effort in the Disk Detective project (Kuchner oplanet database, and citizen science results from NASA’s Disk
et al., 2016; Kuchner, 2016). It was developed to leverage the visual Detective crowdsourcing project. In Section 2, we provide detailed
perception of participants going through catalogs of stellar targets descriptions of the data discovery and processing pipeline, includ-
and identifying potential disk candidates using image data from ing data query, image processing, and robustness benchmark for
the WISE survey, with additional cross-checking from the 2 Micron the techniques. In Section 3, we describe the training and classi-
AllSky Survey (2MASS) in optical wavelengths, the Digitized Sky fying process. In Section 4, we present the automated debris disk
Survey (DSS) in primarily optical and some near-IR wavelengths detection case study intermediate image processing results as well
(Kuchner et al., 2015). To use Disk Detective, users access a Web as final candidate selection results. We provide additional develop-
environment, look at a slideshow of the available images for a ment, including a candidate ranking mechanism along with input
given target of interest with an overlaid red diffraction-limited target list expansion and an intuitive graphical user interface for
circle at 22 µm. There are six options to describe each image, scientists in Section 5. Lastly, we provide a conclusion and outlook
of which users can choose all that apply: ‘‘Multiple objects in in Section 6.
the Red Circle’’, ‘‘Extended beyond circle in WISE images’’, ‘‘Not
round in DSS2 or 2MASS images’’, ‘‘Object moves off the cross- 2. Data discovery and processing pipeline
hairs’’, ‘‘Empty circle in WISE images’’, ‘‘None of the above/Good
candidate’’ (see Kuchner (2016) for details). The citizen science 2.1. Catalog and image query
approach has been able to identify a number of good candidates for
follow-up studies with ground and space telescopes. The closest, The data used in this work was queried and verified from
brightest candidates are being used to search for exoplanets in the publicly available astronomy online databases through a series of
disk region while other targets are helping astronomers to better Python scripts. For this demonstration, we used the WISE database
understand the distribution and characteristics of disks in the for infrared image data and the confirmed planet–host stars
galaxy. database for target input. Both databases are made publicly avail-
The citizen science approach has led to interesting detections able by NASA.
but inevitably suffers from inherent scalability drawback because Images from the WISE catalog are queried through NASA’s
the number of participants and their productivity are limited. In Infrared Science Archive (IRSA) image server (NASA, 2017b). Re-
addition, Daniel (2015) describes that the distribution of work quests for URL location of images can be made given the target’s
and productivity is unequal among participants: about half of right ascension (ra), declination (dec), and angular size of interest.
the online work was done by 30 ‘‘superusers’’ out of a pool of A list of confirmed exoplanets along with their host stars can be
74 T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82

Fig. 2. Summary of the image query framework, which uses data from the NASA exoplanet archive and the WISE image server to download infrared images of the planet–host
stars.

found through NASA’s Exoplanet Archive (NASA, 2017a). This list The centroid displacement can be found by comparing the
can be reduced to a list of planet–host stars and their correspond- location of the image first-moment and the image center, which
ing ra, dec. Lastly, a Python script reads the list of planet–host stars’ represents the target star location, as specified in image query. The
ra and dec and sends image requests to the corresponding URL for WISE imaging system diffraction scale was specified as 12′′ at 22
this star images. For each target, 4 images corresponding to the µm (Wright et al., 2010). This value is consistent with diffraction-
4 WISE bands are saved locally for subsequent image processing limited system estimates using the telescope aperture diameter
and analysis. Fig. 2 summarizes the data query process. To ensure and operating wavelength. In the case study, we added a 20%
that all images are downloaded correctly, a checking protocol was margin to the diffraction-limited scale to account for systematic
implemented to re-initiate the query process when an error occurs variations in the size of the point-spread functions. For each target,
on the image server, decide on which image cut-outs to keep, and a single centroid displacement parameter is computed as the aver-
remove duplicate image files. age displacement of the target in 4 WISE bands, with the exception
of images that include invalid pixel values. The second parameter,
2.2. Image processing and features extraction out-of-diffraction percentage, is applied only to images from WISE
band 4, to reduce the probability of miss due to bright stellar source
In this step, image processing and feature extraction techniques in shorter wavelength. Similar figures of merit are used in the
are applied to raw images from the WISE catalog to isolate the crowd-sourcing project Disk Detective. Fig. 4 shows the two figure-
target and extract relevant features for classification. The image of-merit parameters computed for a sample star target from the
processing pipeline includes three main steps: noise reduction, WISE catalog. Both original and processed images are presented
image segmentation, and central object isolation. In the noise along with figure-of-merit parameters for each WISE band and
reduction process, a threshold level is automatically selected using overall parameters for the target.
a simple mean of gray-scale values in the input image (Glasbey,
1993). Since only the central target is of interest, we implemented 2.3. Robustness evaluation benchmarks
an image segmentation technique to find the central object, if it
exists, and remove all other resolved objects in the image. To iden- A series of benchmark experiments were conducted to evaluate
tify each object, a watershed segmentation method was used with the performance of the image processing pipeline under vary-
markers at local maxima. The image processing implementations ing parameters, including signal-to-noise ratio, neighboring object
used in this analysis are part of the scikit-image Python package brightness and distance to the primary target object. The goals of
(van der Walt et al., 2014). Fig. 3 shows the image processing these benchmark experiments are not only to verify the image
pipeline as applied to a sample image with 2 objects, a main target processing approach for the case study application but also to
at the center of the image and a neighbor off-center object. The provide an evaluation framework for the method performance
image processing steps described above correctly transform the when applied to potential future data sets given data quality and
contaminated input image into a clean image of only the target of features.
interest. In this benchmark analysis, controlled images are generated,
Next, feature extractions are applied to the processed images to consisting of two circular objects: one primary target object and
enable binary decisions on the potential existence of circumstellar one secondary neighboring object, which acts as a contamination
disks. Images of an ideal star candidate with a potential debris-
source to the primary target. Gaussian random noise is added
disk would consist of a bright spot across all 4 bands at the image
to each image to simulate systematic noise source such as back-
center that does not extend beyond the diffraction-limit scale, as
ground noise and sensor noise. Next, the image processing tech-
described in Kuchner et al. (2016). To quantify these features, the
nique as described in Section 2.2 is applied to the image, which was
figure-of-merit parameters chosen for this analysis are:
designed to reduce the noise level and remove the secondary object
• The spot centroid displacement ∆c such that the primary target can be recovered. The output image
• The percentage of spot outside of the diffraction-limit scale is then compared with the image of the original primary target
p. object to determine whether the primary target has been identified
correctly. The experiment is repeated for multiple different values
The first parameter is used to check whether the location of the of signal-to-noise ratio (SNR), separation between the objects, and
infrared (IR) object, if it exists, coincides with that of the input relative brightness of the secondary object to the primary object.
target star to avoid misidentification. The second parameter en- The controlled images in this analysis were generated in Python,
sures that the IR object represents a single point source and is not where each object is modeled as a 2D Gaussian with standard vari-
contaminated by other extended IR sources. Similar parameters are ation σ , representing the point-spread function size. The resolution
employed by Disk Detective as metrics for debris disk detection. of the image is defined by the total number of pixels along one
T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82 75

Fig. 3. Illustration of the image processing pipeline as applied to a sample image with a central object and a secondary off-center object.

Fig. 4. Original WISE images and processed images of a sample target star in 4 WISE bands with corresponding figure-of-merit parameters for centroid displacement (∆c)
and out-of-diffraction percentage (p).

dimension N. The separation ∆r is defined as the distance between


the circumferences of the 1 − σ circles around the two objects,
as illustrated in Fig. 5(a). The noise process follows a normal
distribution with variance computed from the input SNR. Fig. 5(b)
shows a central object at with SNR levels of 10 dB, 15 dB, and 20 dB,
respectively. The relative brightness M2 /M1 is defined as the ratio
between the Gaussian amplitudes of the neighboring secondary
object and the central target object. The varying parameters in this
experiment include: SNR, ∆r, and M2 /M1 . The selected parameter
ranges are: SNR ∈ [10, 15, 20], ∆r /N ∈ [0, 0.1, . . . , 0.5] (∆r
ranges from 0 to 18 pixels), and M2 /M1 ∈ [0, 0.1, . . . , 2.0]. The
fixed parameters include the image resolution N (36 pixels) and
σ (0.1N), which were selected to approximately match the WISE
images used in the case study.
For each set of conditions, 1000 independent simulated images
are generated and processed. A successful detection is recorded
when the conditions shown in Eq. (1) are met, where I1 and I2 are
the first and second moment of the post-processed image, respec-
tively, I2x and I2y are the second moment in the x and y direction,
respectively, and σ is the standard deviation of the original object.
Fig. 6 shows an example of a successful detection of the central
object. The detection probability is calculated as the ratio between
the number of instances of correct detection and the total number
of runs for each set of conditions.
∥I1 ∥ < 0.1σ centroid location

0.5σ < ∥I2 ∥ < 1.5σ spot size (1)
⏐√
⏐ I2x − I2y ⏐ < 0.1σ symmetry.
√ ⏐

The benchmark results are shown in Fig. 7. Each subfigure Fig. 5. Illustration of benchmark image properties: (a) object separation ∆r, spot
size σ , number of pixels N, and (b) signal-to-noise ratio (SNR).
shows the contour plot of detection probability for a specific SNR
value with varying separation normalized to the image size (∆r /N)
and relative brightness (M1 /M2 ). The results show that detection
can be achieved reliably for ∆r /N > 0.3 (∆r > 3σ ). When the two processing pipeline. The detection probability is improved when
objects are too close together, their combined image resembles a M1 /M2 ≈ 1 and degrades when there is an imbalance between
single extended object and is treated correspondingly by the image the magnitudes of the 2 objects due to the nature of the local
76 T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82

Fig. 6. Example of a successful detection of a central object (SNR= 20 dB, N = 36 pixels, ∆r /N = 0.2, M2 /M1 = 0.5).

maximum watershed algorithm. The detection probability map


differs insignificantly between SNR = 15 dB and SNR = 20 dB and
converges for SNR > 20 dB.

3. Classification approach

3.1. Training sets

In order to apply machine learning techniques for candidate


selection, training sets are necessary as inputs to classifier algo-
rithms. Confirmed debris disk data is sparse at the moment, so
we followed a best effort approach for our proof of concept. Our
training sets identifying ‘‘good’’ candidates include a Disk Detective
disk candidates set and a disk candidate set hypothesized in liter-
ature. The first set was acquired from Disk Detective’s preliminary
crowd-sourcing results (thanks to support from Disk Detective’s
creator, Dr. Marc Kuchner). The training data set consists of 114
stars locations in the Southern Hemisphere that were determined
to be good debris-disk candidates by Disk Detective’s users. The
targets are to be followed up for confirmation by the Leoncito
Astronomical Complex (CASLEO) observatory in Argentina. The
second ‘‘good’’ training set is a list of 13 debris disk targets as sug-
gested in the literature. Since targets with uninteresting features
are rarely documented, the ‘‘bad’’ target list was constructed by
the authors through visual inspection of 450 randomly selected
images from the WISE catalog, resulting in a ‘‘bad’’ candidate list of
138 targets. To avoid introducing biases to the analysis, no outliers
were removed in these training sets.

3.2. Classifier

To identify good candidates from the input list, we employ


machine learning classifiers to generate the decision boundary
using the training sets specified in Section 3.1. The 2D parameter
space is constructed by the two features presented in Section 2.2:
centroid displacement (∆c) and out-of-range percentage (p).
Fig. 8 shows the decision boundaries computed by a number
of basic classifiers, including: (1) Support Vector Machine (SVM)
with radial basis function (RBF) kernel, (2) SVM with linear kernel,
(3) Gaussian Process, (4) Gaussian Naive Bayes (NB), (5) Logistic
Regression, and (6) Quadratic Discriminant Analysis (QDA). The
corresponding accuracy score for each algorithm is presented at
the under classifier name (training sets were used for scoring). The
implementation of these algorithms are described in more details
in the documentation of the scikit-learn Python module (Pedregosa
et al., 2011). It can be observed that all classifiers presented here
yield reasonable accuracy scores and logical decision boundary
Fig. 7. Contour plots of detection probability for different signal-to-noise ratio,
and, hence, can potentially be good options for debris disk detec- object separation, and relative brightness.
tion.
T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82 77

Fig. 9. Representative images of planet–host star locations, queried from the WISE
image archive, illustrating a variety of features: (a) single central object in all bands,
(b) single central object, saturated in band 1–2, (c) one neighboring object resolved
in band 1–3, contamination in band 4, (d) multiple neighboring objects resolved
in band 1–2, noisy in band 3–4, (e) single object in band 1–2, neighboring object
Fig. 8. Decision boundaries generated with specified training sets from various resolved in band 3, contamination/misidentification in band 4, (f) single central
classifiers: (1) Support Vector Machine (SVM) with radial basis function (RBF) object in band 1–3, extended background object in band 4.
kernel, (2) SVM with linear kernel, (3) Gaussian Process, (4) Gaussian Naive Bayes
(NB), (5) Logistic Regression, and (6) Quadratic Discriminant Analysis (QDA).

download WISE images at the location of the planet–host stars,


as described in detail in Section 2.1. The images are processed
Since it is not our intention to compare different classifiers in
through our pipeline to compute the figures of merit, as presented
this paper, we chose to proceed with SVM with RBF kernel for
in Section 2.2. Lastly, we use the SVM method with RBF kernel
the following case study due to the intuitive decision boundary
given a number of likely outliers in the training set. We employ to identify the debris disk candidates from the input list using
a single parameter ν to control the upper bound on the fraction of preliminary crowdsourcing results for training. The output of the
training errors and a lower bound of the fraction of support vectors. case study is a list of confirmed planet–host star candidates with
Detailed description of the parameter and technique can be found potential debris disk to be further observed and studied.
in section 2.2 of Chang and Lin (2011). The specific training results The input target list consists of 2612 confirmed planet–host
of SVM and specific parameters selection with regard to our case stars from the NASA’s Exoplanet Archive as of August 2017. WISE
study will be described in detail in Section 4. images at these star locations were queried through the IRSA image
server. Fig. 9 shows representative samples of the images acquired
4. Automated debris disk detection from the WISE catalog, looking at planet–host star targets, with
an image size of 0.8′ × 0.8′ (36 × 36 pixels2 ), showing various
We now present the details of our case study: debris disk features encountered in the real data sets. It can be seen that
search around confirmed planet–host stars with crowdsourced the images of the planet–host stars as seen in the WISE cata-
training sets. The input target list used in this case study consists log have a wide range of features, which also differ at different
of confirmed planet–host star targets from the NASA Exoplanet wavelengths for the same target. In most cases, a central tar-
Archive. The query process uses this target list to request and get can be identified in the shorter wavelengths since the input
78 T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82

Fig. 11. Classification with support vector machines (SVM) with radial basis
function (RBF) kernel (ν = 0.5). The classification boundary obtained by machine
learning separates the ‘‘good’’ and ‘‘bad’’ regions on the parameter space and shows
promising results for our known data. (For interpretation of the references to colour
in this figure legend, the reader is referred to the web version of this article.)

processed, the centroid displacement (∆c) and out-of-range per-


centage (p) parameters are computed for each star target. In the
case where no central object is identified, the two parameters are
set to their maximum values (∆c = 25.5, p = 100).
The feature parameters ∆c and p are used to construct a param-
eter space in the training process, which uses SVM to select ‘‘good’’
candidates from the input target list. As mentioned in Section 3,
the training phase uses the two ‘‘good’’ training sets (Disk Detective
- CASLEO and hypothesized disks from literature) and one ‘‘bad’’
Fig. 10. Original WISE images and images at various stages of the image processing set (WISE visual inspection). Fig. 11 shows the training sets and
pipeline for various image features, as described in Fig. 9. corresponding decision boundary for ν = 0.5 using the radial
basis function (RBF) kernel SVM. The ‘‘good’’ candidates region
is highlighted in dark blue in the lower left along with the clas-
list consists of relatively well-understood star targets. For bright sification boundary of the classifier. This region was determined
stars, the target images at shorter wavelengths can appear to be algorithmically, enclosing 113/114 targets from the Disk Detective
saturated. In some cases, the primary target is off-center, which - CASLEO list, 6/9 targets from the known disk list, and 3/138 from
could be explained by inaccuracies in target location in either the ‘‘bad’’ target list (total accuracy score is 0.97). The classification
the input list or the WISE location tag. In addition to the central boundary generated by the SVM-based machine learning can then
targets, neighboring objects can be resolved in many image cutouts be used to classify targets from the confirmed planet–host stars
input list and identify star candidates with potential circumstellar
in the shorter wavelengths due to shorter diffraction scales. At
debris disks.
the longer wavelengths, especially in band 4, the image cut-outs
Fig. 12 shows the planet–host stars feature data in the same pa-
have varying features, including bright central spots (good debris
rameter space as Fig. 11, overlaying the same classification bound-
disk candidates), noisy/no target, or bright and extended objects,
ary found in the training process. The data points enclosed by the
which is often an indication that the star targets at this band
classification boundary represent the star targets classified by the
is contaminated by neighboring stars or bright background ob- machine learning algorithm as being ‘‘good’’ candidates for having
jects. In many cases, the star target images are ambiguous with debris disks because of the similarities with the ‘‘good’’ training
unclear features, motivating our use of machine learning with sets and dissimilarities with the ‘‘bad’’ training set. The number
visually-inspected targets from crowdsourcing projects as training of debris disk candidates found in the input list is a function of
sets. the ν parameter, which controls the number of support vectors
The WISE images of the planet–host stars were piped through in SVM and estimates the training error upper bound. In this case
our image processing software, which consists of noise reduction, study, the number of candidates found varies between 346 and
image segmentation, and central object isolation, before feature 393 depending on the ν value selected in the SVM algorithm. In
extraction techniques were applied. Fig. 10 illustrates examples practice, scientists can use this parameter to control the number
of the image processing stages when applied to different WISE of candidates found in the input set. For ν = 0.5, a total 367
images of the planet–host star targets. Once each image has been star candidates were selected as potentially having debris disks.
T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82 79

negatives and no false positives. We inspected the 2 false negatives


further and have found that the first false negative was the result
of unexpected image feature (ring-shaped star image), while the
second false negative was very close to the edge of the decision
boundary in the feature parameter space but had just missed the
mark. We expect that these limitations will be relieved when more
training data become available. For this demonstration, we believe
that a 0.98 cross-validation score is sufficient as a proof-of-concept.
We present in the Appendix a full list of planet–host star targets
with the potential of having debris disks as discovered by SVM as
trained with available crowd-sourcing data. These targets can be
followed up by future ground and space missions to further in-
vestigate the existence of debris disks. The targets with confirmed
debris disks can in turn be added to the training sets, providing
improved classification for future discovery.

5. Debris disk candidate ranking and visualization

5.1. Ranking approach

In addition to the presented binary classification approach, we


Fig. 12. Classification of the input data set of planet-host stars. The candidates that have also explored ranking of debris disk candidates as an alterna-
we predict to have potential debris disks are enclosed by the classification boundary tive that presents scientists with an ordered list that can be used for
generated by SVM (RBF kernal, ν = 0.5). The cross-validation score is 0.98 for a
randomly selected validation set of 100 samples.
follow-up studies. To enhance the relevance of rank computation,
we have cross-referenced data from several catalogs.
The computer-aided discovery approach presented in this pa-
per can be easily expanded to include more target input lists as
For validation, we created a set of 100 randomly selected targets well as cross-checking astronomical data bases when evaluating
from the IPAC planet–host star list. We visually inspected these candidates. As a proof of concept we have cross-referenced the
validation targets and assigned each target a status of either being a Tycho-2 star catalog with WISE and 2MASS catalogs (Høg et al.,
candidate that we would like to further investigate or not, without 2000) for debris-disk searching. This approach acts as an additional
looking at the SVM classifications. We then compared our manual filter for the larger WISE catalog, narrowing the objects of interest
classification with the SVM classification results and found that the to exclude as many non-stellar objects as possible. We employed
results agreed for 98/100 targets. We found that there were 2 false the cross-match service provided by CDS, Strasbourg (University of

Fig. 13. A Graphical User Interface (GUI) showing candidate ranks and a visual inspection of processed images in the data discovery pipeline.
80 T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82

Table 1
Debris-disk candidates generated with computer-aided discovery approach.
# Star name/ID # Star name/ID # Star name/ID # Star name/ID # Star name/ID # Star name/ID
1 11 Com 63 HD 111232 125 HD 160691 187 HD 216437 249 HD 45364 311 HD 93083
2 14 And 64 HD 111998 126 HD 16175 188 HD 216536 250 HD 45652 312 HD 9446
3 16 Cyg B 65 HD 113337 127 HD 163607 189 HD 216770 251 HD 46375 313 HD 95086
4 18 Del 66 HD 114613 128 HD 16417 190 HD 217107 252 HD 47186 314 HD 95089
5 24 Sex 67 HD 114729 129 HD 164509 191 HD 217786 253 HD 4732 315 HD 95127
6 30 Ari B 68 HD 114783 130 HD 164595 192 HD 219415 254 HD 47536 316 HD 96063
7 4 UMa 69 HD 11506 131 HD 165155 193 HD 219828 255 HD 48265 317 HD 96127
8 42 Dra 70 HD 116029 132 HD 1666 194 HD 220074 256 HD 49674 318 HD 96167
9 47 UMa 71 HD 117207 133 HD 166724 195 HD 220689 257 HD 50499 319 HD 97658
10 51 Eri 72 HD 11755 134 HD 167042 196 HD 220773 258 HD 50554 320 HD 98219
11 51 Peg 73 HD 117618 135 HD 168443 197 HD 220842 259 HD 52265 321 HD 98649
12 6 Lyn 74 HD 118203 136 HD 1690 198 HD 221287 260 HD 5319 322 HD 99706
13 75 Cet 75 HD 11977 137 HD 169830 199 HD 222076 261 HD 5583 323 HIP 105854
14 8 UMi 76 HD 120084 138 HD 170469 200 HD 222155 262 HD 5608 324 HIP 107773
15 81 Cet 77 HD 121504 139 HD 17092 201 HD 222582 263 HD 564 325 HIP 116454
16 AB Pic 78 HD 12484 140 HD 171028 202 HD 224538 264 HD 5891 326 HIP 14810
17 BD+03 2562 79 HD 125612 141 HD 171238 203 HD 224693 265 HD 59686 A 327 HIP 57274
18 BD+15 2375 80 HD 12648 142 HD 17156 204 HD 23079 266 HD 60532 328 HIP 63242
19 BD+15 2940 81 HD 12661 143 HD 173416 205 HD 23127 267 HD 63454 329 HIP 65407
20 BD+20 1790 82 HD 126614 144 HD 175167 206 HD 23596 268 HD 65216 330 HIP 65426
21 BD+20 2457 83 HD 128356 145 HD 17674 207 HD 240210 269 HD 66141 331 HIP 65891
22 BD+20 274 84 HD 129445 146 HD 177565 208 HD 240237 270 HD 66428 332 HIP 67537
23 BD+48 738 85 HD 130322 147 HD 177830 209 HD 24040 271 HD 67087 333 HIP 67851
24 BD+49 828 86 HD 131496 148 HD 179079 210 HD 24064 272 HD 6718 334 HIP 70849
25 BD-06 1339 87 HD 131664 149 HD 179949 211 HD 25171 273 HD 68402 335 HIP 74890
26 BD-13 2130 88 HD 13189 150 HD 180314 212 HD 2638 274 HD 68988 336 HIP 78530
27 CT Cha 89 HD 132406 151 HD 180902 213 HD 27442 275 HD 70642 337 HIP 79431
28 DH Tau 90 HD 132563 152 HD 181342 214 HD 27631 276 HD 7199 338 HIP 8541
29 GJ 3470 91 HD 134987 153 HD 181433 215 HD 28185 277 HD 72659 339 HIP 91258
30 GJ 504 92 HD 136418 154 HD 183263 216 HD 28254 278 HD 73256 340 HIP 97233
31 GJ 676 A 93 HD 13908 155 HD 185269 217 HD 28678 279 HD 73267 341 HN Peg
32 GQ Lup 94 HD 13931 156 HD 187085 218 HD 29021 280 HD 73534 342 HR 2562
33 HAT-P-11 95 HD 139357 157 HD 187123 219 HD 290327 281 HD 74156 343 HR 8799
34 HAT-P-2 96 HD 14067 158 HD 18742 220 HD 2952 282 HD 7449 344 KELT-11
35 HD 100655 97 HD 141399 159 HD 188015 221 HD 30177 283 HD 75289 345 KELT-2 A
36 HD 100777 98 HD 141937 160 HD 189733 222 HD 30669 284 HD 75784 346 KELT-9
37 HD 10180 99 HD 142245 161 HD 190647 223 HD 30856 285 HD 75898 347 Kepler-21
38 HD 101930 100 HD 142415 162 HD 190984 224 HD 31253 286 HD 76700 348 Kepler-408
39 HD 102117 101 HD 143105 163 HD 191806 225 HD 32963 287 HD 7924 349 Kepler-409
40 HD 102195 102 HD 143361 164 HD 192263 226 HD 33142 288 HD 79498 350 Kepler-410 A
41 HD 102272 103 HD 145377 165 HD 192699 227 HD 33283 289 HD 80606 351 LkCa 15
42 HD 102329 104 HD 145457 166 HD 196050 228 HD 33564 290 HD 81040 352 NGC 2682 Sand 364
43 HD 102956 105 HD 145934 167 HD 196885 229 HD 33844 291 HD 81688 353 NGC 2682 Sand 978
44 HD 103197 106 HD 147513 168 HD 19994 230 HD 34445 292 HD 82886 354 ROXs 12
45 HD 103720 107 HD 147873 169 HD 200964 231 HD 35759 293 HD 82943 355 ROXs 42 B
46 HD 103774 108 HD 148427 170 HD 203030 232 HD 37605 294 HD 83443 356 TYC 3667-1280-1
47 HD 10442 109 HD 149026 171 HD 2039 233 HD 38283 295 HD 8535 357 TYC 4282-00605-1
48 HD 104985 110 HD 149143 172 HD 204313 234 HD 38529 296 HD 85390 358 WASP-18
49 HD 106252 111 HD 1502 173 HD 204941 235 HD 38801 297 HD 8574 359 WASP-33
50 HD 106270 112 HD 150706 174 HD 205739 236 HD 40307 298 HD 86081 360 WASP-7
51 HD 10647 113 HD 152581 175 HD 206610 237 HD 40979 299 HD 86226 361 WASP-8
52 HD 106906 114 HD 154857 176 HD 20782 238 HD 41004 A 300 HD 86264 362 bet Pic
53 HD 10697 115 HD 155233 177 HD 208487 239 HD 41004 B 301 HD 8673 363 kap And
54 HD 107148 116 HD 155358 178 HD 208527 240 HD 4113 302 HD 86950 364 ome Ser
55 HD 108147 117 HD 156279 179 HD 209458 241 HD 42012 303 HD 87646 365 omi CrB
56 HD 108341 118 HD 156411 180 HD 210702 242 HD 4203 304 HD 87883 366 tau Boo
57 HD 108863 119 HD 156668 181 HD 212301 243 HD 4208 305 HD 88133 367 xi Aql
58 HD 108874 120 HD 156846 182 HD 212771 244 HD 4313 306 HD 89307
59 HD 109246 121 HD 158038 183 HD 213240 245 HD 43197 307 HD 89744
60 HD 109271 122 HD 159243 184 HD 214823 246 HD 43691 308 HD 90156
61 HD 109749 123 HD 159868 185 HD 215497 247 HD 44219 309 HD 9174
62 HD 110014 124 HD 1605 186 HD 216435 248 HD 45350 310 HD 92788

Strasbourg, 2016). The resulting catalog contained approximately 1. The number of targets in the image recognized by the
2.5 million targets. processing algorithm with centers within the diffraction-
For a proof of concept, images of randomly selected subset of limited circles (‘‘Number of Central Targets’’ column in
these targets were downloaded in all available bands for the WISE, Fig. 13)
2MASS, and DSS surveys. Even though the DSS survey catalog was 2. The identifying coordinates of the target with the center
not included in the initial cross reference, it was found to have closest to the center of the image (‘‘Main Target Coordi-
imaged all targets in the subset catalog in at least one band. The nates’’)
image download is by far the rate limiting factor in developing 3. The displacement of the center of the central target from the
catalogs, and it is the main concern preventing scaling up to a larger image center (‘‘Main Target Displacement’’)
subset or analyzing the entire constructed catalog. 4. The percentage of the main target that extends beyond the
Each target is ranked based on the following metrics: diffraction-limited circle (‘‘Percent Outside Diffraction’’)
T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82 81

5. The value returned by the thresholding algorithm in the human classification records are going to increase the fidelity of
initial processing (‘‘Threshold Value’’) the approach presented in this paper.
6. The percentage of original pixels with values equal to or
above that threshold (‘‘Percent of Image White’’). Acknowledgments

A number of other data are included in the returned analysis We would like to thank Dr. Marc Kuchner for discussions
of each target for the purposes of identification, for example, the about Disk Detective as well as for providing data and access
right ascension and declination of the object as well as the file to the Disk Detective Web site. We also acknowledge support
path of its related unprocessed images and 0-indexed position in from the National Science Foundation ACI-1442997 and NASA
the input catalog. We generate a spreadsheet of data including all AIST14 NNX15AG84G for computer-aided discovery and the Na-
ranking metrics to allow researchers to select the most important tional Science Foundation Graduate Research Fellowship Grant No.
features to their research and/or to condition on other constraints, 1122374. The results of this paper are based on projects completed
such as identifying all candidates present in relatively noisy or during the 2015 graduate-level Astroinformatics course at MIT
quiet regions (determined by ‘‘Number of Central Targets’’, see Department of Earth, Atmospheric and Planetary Sciences (EAPS)
Fig. 13). and the 2015 summer MIT Undergraduate Research Opportuni-
The ranked data approach has proven useful for (a) quick elim- ties Program (UROP) supporting the work of Laura Eckman. This
ination of undesirable targets, and (b) metrics of fit somewhere research has made use of the NASA Exoplanet Archive, which is
between ideal candidates for follow up and bad targets to exclude operated by the California Institute of Technology, under contract
from future analysis. Ideally, it would be possible for researchers to with the National Aeronautics and Space Administration under the
identify the best candidates to look at in a given region of the sky Exoplanet Exploration Program.
to expedite follow up confirmation.
Appendix
5.2. Graphical user interface
See Table 1.
We developed a graphical user interface to allow scientists a
more hands-on overview of the analysis. Users can open catalogs,
download images, and run the discovery analysis on selected tar- References
gets of interest. Currently, the GUI analyzes a data set one target
Aumann, H., Beichman, C., Gillett, F., De Jong, T., Houck, J., Low, F., Neugebauer,
at a time, which allows users to easily and quickly view all of G., Walker, R., Wesselius, P., 1984. Discovery of a shell around Alpha Lyrae.
the existing information for a particular star without analyzing Astrophys. J. 278, L23–L27.
an entire catalog, providing a useful alternative to the large-scale Backman, D.E., Paresce, F., 1993a. Main-sequence stars with circumstellar solid
processing approach. The GUI also has the unique ability to gener- material-the Vega phenomenon. In: Protostars and planets III, Vol. 1,
pp. 1253–1304.
ate a display of the processed images associated with each stellar Backman, D., Paresce, F., 1993b. Protostars and Planets III, ed.
target — this functionality is useful for ‘‘debugging’’ and better Beichman, C., Bryden, G., Stapelfeldt, K., Gautier, T., Grogan, K., Shao, M., Velusamy,
understanding why certain candidates are chosen. The ability to T., Lawler, S., Blaylock, M., Rieke, G., et al., 2006. New debris disks around nearby
return to the processed images is useful in further examination main-sequence stars: impact on the direct detection of planets. Astrophys. J.
652 (2), 1674.
of anomalous data points as well as refinement of thresholds in
Chang, C.-C., Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM
the analysis process. A screen shot is shown in Fig. 13, which also Trans. Intell. Syst. Technol. (TIST) 2 (3), 27.
illustrates a ranked list of candidates with their corresponding Chen, C.H., Sargent, B., Bohac, C., Kim, K., Leibensperger, E., Jura, M., Najita, J.,
ranking parameters. Forrest, W., Watson, D., Sloan, G., et al., 2006. Spitzer IRS spectroscopy of IRAS-
discovered debris disks. Astrophys. J. Suppl. Ser. 166 (1), 351.
Daniel, K., 2015. Disk detective: Crowdsourcing new planets. August 11. https:
6. Conclusion and outlook //www.citizenscience.gov/2015/08/11/disk-detective/.
Dodson-Robinson, S.E., Beichman, C., Carpenter, J.M., Bryden, G., 2011. A Spitzer
This article presents a study on a computer-aided approach infrared spectrograph study of debris disks around planet-host stars. Astron.
for debris disk candidate search that leverages data from several J. 141 (1), 11.
Glasbey, C.A., 1993. An analysis of histogram-based thresholding algorithms. CVGIP,
astronomical databases and crowd-sourcing results. The case study Graph. Models Image Process. 55 (6), 532–537.
presented here employs crowd-sourcing data from NASA’s Disk Greaves, J., Holland, W., Wyatt, M., Dent, W., Robson, E., Coulson, I., Jenness, T.,
Detective project to train machine learning algorithms and achieve Moriarty-Schieven, G., Davis, G., Butner, H., et al., 2005. Structure in the ε Eridani
scalability on a larger data set, providing predictions of debris disk debris disk. Astrophys. J. Lett. 619 (2), L187.
Høg, E., Fabricius, C., Makarov, V.V., Urban, S., Corbin, T., Wycoff, G., Bastian, U.,
candidates from the list of all planet–host stars. Furthermore, the
Schwekendiek, P., Wicenec, A., 2000. The Tycho-2 catalogue of the 2.5 million
algorithmic approach also facilitates the easy creation of ranking brightest stars. Astron. Astrophys. 355, L27–L30.
lists with debris disk candidates, which can help human scientists Janson, M., Brandt, T.D., Moro-Martín, A., Usuda, T., Thalmann, C., Carson, J.C., Goto,
query candidates with different properties and assess potential M., Currie, T., McElwain, M., Itoh, Y., et al., 2013. The SEEDS direct imaging survey
probabilities for detections in follow-up studies. To our knowledge, for planets and scattered dust emission in debris disk systems. Astrophys. J.
773 (1), 73.
this is the first attempt to combine crowd sourcing and machine Kennedy, G.M., Wyatt, M.C., Bailey, V., Bryden, G., Danchi, W.C., Defrère, D., Haniff,
learning for debris disk search in infrared bands in the presented C., Hinz, P.M., Lebreton, J., Mennesson, B., et al., 2015. EXO-zodi modeling for
way. the large binocular telescope interferometer. Astrophys. J. 216 (2), 23.
With upcoming missions such as the Transiting Exoplanet Sur- Kóspál, Á., Ardila, D.R., Moór, A.,Ábrahám, P., 2009. On the relationship between
debris disks and planets. Astrophys. J. Lett. 700 (2), L73.
vey Satellite (TESS), the Wide-Field Infrared Survey Telescope
Kuchner, M., 2016. NASA Disk Detective. https://www.diskdetective.org, last ac-
(WFIRST), as well as the James Webb Space Telescope (JWST), the cessed September 2016.
astronomical data volume will inevitably lead to a stronger need Kuchner, M.J., Silverberg, S.M., Bans, A.S., Bhattacharjee, S., Kenyon, S.J., Debes, J.H.,
for automation. The case study presented in this work demonstrate Currie, T., Garcia, L., Jung, D., Lintott, C., et al., 2016. Disk detective: Discovery of
that this hybrid approach has the potential to provide target candi- new circumstellar disk candidates through citizen science. Astrophys. J. 830 (2),
84.
dates for these missions as well as to support astronomical discov- Kuchner, M.J., Silverberg, S., Bans, A., Team, D.D., 2015. Diskdetective.org: The first
eries from future data. Lastly, as more data is being collected and 1,000,000 classifications. In: American Astronomical Society Meeting Abstracts,
validated, better training sets for machine learning and additional Vol. 225.
82 T. Nguyen et al. / Astronomy and Computing 23 (2018) 72–82

Liseau, R., Eiroa, C., Fedele, D., Augereau, J.-C., Olofsson, G., González, B., Maldonado, Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
J., Montesinos, B., Mora, A., Absil, O., et al., 2010. Resolving the cold debris M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Courna-
disc around a planet-hosting star-PACS photometric imaging observations of peau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine
q1 Eridani (HD 10647, HR 506). Astron. Astrophys. 518, L132. learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Lisse, C., Beichman, C., Bryden, G., Wyatt, M., 2007. On the nature of the dust in the Raymond, S.N., Armitage, P.J., Moro-Martín, A., Booth, M., Wyatt, M.C., Armstrong,
debris disk around HD 69830. Astrophys. J. 658 (1), 584. J.C., Mandell, A.M., Selsis, F., West, A.A., 2011. Debris disks as signposts of
Marois, C., Macintosh, B., Barman, T., Zuckerman, B., Song, I., Patience, J., Lafrenière, terrestrial planet formation. Astron. Astrophys. 530, A62.
D., Doyon, R., 2008. Direct imaging of multiple planets orbiting the star HR 8799. Su, K., Rieke, G., Misselt, K., Stansberry, J., Moro-Martin, A., Stapelfeldt, K., Werner,
Science 322 (5906), 1348–1352. M., Trilling, D., Bendo, G., Gordon, K., et al., 2005. The Vega debris disk: A surprise
Moór, A., Ábrahám, P., Kóspál, Á., Szabó, G.M., Apai, D., Balog, Z., Csengeri, T.,
from Spitzer. Astrophys. J. 628 (1), 487.
Grady, C., Henning, T., Juhász, A., et al., 2013. A resolved debris disk around the
University of Strasbourg, 2016. CDS X-match service. http://cdsxmatch.u-strasbg.
candidate planet-hosting star HD 95086. Astrophys. J. Lett. 775 (2), L51.
fr/xmatch (last accessed 05.09.16).
Morales, F.Y., Werner, M., Bryden, G., Plavchan, P., Stapelfeldt, K., Rieke, G., Su, K.,
van der Walt, S., Schönberger, J.L., Nunez-Iglesias, J., Boulogne, F., Warner, J.D., Yager,
Beichman, C., Chen, C., Grogan, K., et al., 2009. Spitzer mid-IR spectra of dust
N., Gouillart, E., Yu, T., 2014. scikit-image: image processing in Python. PeerJ 2,
debris around A and late B type stars: asteroid belt analogs and power-law dust
distributions. Astrophys. J. 699 (2), 1067. e453.
NASA, 2017a. NASA exoplanet archive. http://exoplanetarchive.ipac.caltech.edu/ Weinberger, A.J, Bryden, G., Kennedy, G.M., Roberge, A., Defrère, D., Hinz, P.M.,
docs/program_interfaces.html (last accessed 16.08.17). Millan-Gabet, R., Rieke, G., Bailey, V.P., Danchi, W.C., et al., 2015. Target selection
NASA, 2017b. NASA/IPAC Infrared Science Archive. http://irsa.ipac.caltech.edu/ibe/ for the LBTI exozodi key science program. Astrophys. J. 216 (2), 24.
index.html (last accessed 16.08.17). Wright, E.L., Eisenhardt, P.R., Mainzer, A.K., Ressler, M.E., Cutri, R.M., Jarrett, T.,
Pankratius, V., Li, J., Gowanlock, M., Blair, D.M., Rude, C., Herring, T., Lind, F., Erickson, Kirkpatrick, J.D., Padgett, D., McMillan, R.S., Skrutskie, M., et al., 2010. The Wide-
P.J., Lonsdale, C., 2016. Computer-aided discovery: Toward scientific insight field Infrared Survey Explorer (WISE): mission description and initial on-orbit
generation with machine support. IEEE Intell. Syst. 31 (4), 3–10. performance. Astron. J. 140 (6), 1868.

Вам также может понравиться