AIC Seminar Series
Scalable Partitioning and exploration of Chemical Spaces
|Debojyoti Dutta||University of Southern California|
Date: 2006-03-15 at 13:00
Location: EK255 (Directions)
Due to technological advances, there has been a rapid growth in the
amount of bio-chemical data. Two examples of such large collections of
data are small molecule libraries and collections of peptide mass
spectra. There is an urgent need to design algorithms to efficiently learn,
mine and interpret this ever-increasing corpus of data.
In this talk, I will first present a framework to scale traditional
applications such as classification, partitioning, and outlier
detection for large chemical data sets without a significant loss of
accuracy by applying locality sensitive hashing (LSH). We hash
chemical descriptors so that points close to each other in the
descriptor space are also close to each other in the hashed space.
Using this data structure, one can perform approximate
nearest-neighbor searches very quickly, in sub-linear time. We validate
the accuracy and performance of our framework on three real data sets
of sizes ranging from 4337 to 249 071 molecules. Results indicate that
the identification of nearest neighbors using the LSH algorithm is at
least 2 orders of magnitude faster than the traditional
k-nearest-neighbor method and is over 94% accurate for most query
Next, I will discuss my ongoing work in prefiltering large small
molecule databases towards obtaining new inhibitors for drug targets
using ensemble learning techniques. Finally, I will present a brief
summary of my work in mining mass spectrometry data as well as my
previous work on low state schemes to fight selfish and malicious
traffic within a network.
Debojyoti graduated from IIT Kharagpur with a Btech in
Computer Science in 1999. He got his PhD at USC/ISI in Computer
Science with Ashish Goel (now at Stanford) as his advisor.
Since his PhD in Summer 2004, he has been a postdoc at the
department of Computational Biology (Ting Chens group). His
interests include large scale machine learning and data mining for
proteomics and chemoinformatics, and networking.
Please arrive at least 10 minutes early in order to sign in and be escorted to the conference room. SRI is located at 333 Ravenswood Avenue in Menlo Park. Visitors may park in the visitors lot in front of Building E, and should follow the instructions by the lobby phone to be escorted to the meeting room. Detailed directions to SRI, as well as maps, are available from the Visiting AIC web page.
©2014 SRI International 333 Ravenswood Avenue, Menlo Park, CA 94025-3493