I am David Grangier, welcome to my homepage. I am currently with Microsoft Research, in Redmond, WA, USA. I am interested in large scale Machine Learning at its application to pattern analysis tasks, such as Information Retrieval, Speech Recognition and Text Analysis. My current work is described in the following, while my resumé or my linkedin will give you an overview of my previous experiences.
Research Interests
Learning to rank has increasingly received attention during the last
decade, due to applications in domains such as information retrieval.
These types of problems aim at assigning a confidence value to each
example of a set, such that the values induce a specific
ordering over the set. This task hence significantly differs
from classical machine learning problem, since the example
outputs cannot be considered independently.
Many effective learning algorithms, such as nearest neighbor classifiers or
support vector machines, crucially relies on a function to compare
examples. Such a function can be referred to as a distance metric, a similarity measure or
a kernel, depending on its mathematical properties. Rather than defining this
function relying only on prior knowledge, new learning techniques have been
introduced to learn it from training data providing information on the desired
example proximity.
In many problems, specific relations exist between the different inputs
or between the labels. For instance, computer vision features extracted
from the same image share spatial relations, or the phoneme classes
in speech belong to a hierarchical structure. Encoding prior information
about such structures, or learning such relationships offer great opportunities
to improve machine learning approaches for several pattern recognition problems,
and several techniques has been introduced toward this objective in the recent
years.
Publications
Multiple data sources containing different types of features may be available for a given task. For instance, users’ profiles can be used to build recommendation systems. In addition, a model can also use users’ historical behaviors and social networks to infer users’ interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build effective learning models. We call this framework heterogeneous learning. In our proposed setting, data sources can include (i) non-overlapping features, (ii) non-overlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) There should be consensus among the predictions of overlapping instances (if any) from different data sources; (2) Connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number recognition and terrorist attack detection tasks. We observe that the proposed model can improve out-of-sample error rate by as much as 80%.@inproceedings{shi:2012:heterogeneous_gbdt_sdm,
author = "X. Shi and JF. Paiement and D. Grangier and P. Yu",
title = "Learning from Heterogeneous Sources via Gradient Boosting Consensus",
booktitle = "SIAM International Conference on Data Mining (SDM)",
year = "2012",
}
We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then non-linearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets.@inproceedings{grangier:2010:missing_nips,
author = "D. Grangier and I. Melvin",
title = "Feature Set Embedding for Incomplete Data",
booktitle = "Advances in Neural Information Processing Systems (NIPS)",
year = "2010",
}
Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.@inproceedings{weston:2010:label_trees_nips,
author = "J. Weston and S. Bengio and D. Grangier",
title = "Label Embedding Trees for Large Multi-Class Tasks",
booktitle = "Advances in Neural Information Processing Systems (NIPS)",
year = "2010",
}
We study the standard retrieval task of ranking a fixed set of items given a previously unseen query and pose it as the half-transductive ranking problem. Transductive representations (where the vector representation of each example is learned) allow the generation of highly nonlinear embeddings that capture the characteristics of object relationships without relying on a specific choice of features, and require only relatively simple optimization. Unfortunately, they have no direct out-of-sample extension. Inductive approaches on the other hand allow for the representation of unknown queries. We describe algorithms for this setting which have the advantages of both transductive and inductive approaches, and can be applied in unsupervised (either reconstruction-based or graph-based) and supervised ranking setups. We show empirically that our methods give strong performance on all three tasks.@inproceedings{bai:2010:halftrans_aistats,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri",
title = "Half Transductive Ranking",
booktitle = "Artificial Intelligence and Statistics (AISTATS)",
year = "2010",
}
We study the standard retrieval task of ranking a fixed set of documents given a previously unseen query and pose it as the half-transductive ranking problem. The task is partly transductive as the document set is fixed. Existing transductive approaches are natural non-linear methods for this set, but have no direct out-of-sample extension. Functional approaches, on the other hand, can be applied to the unseen queries, but fail to exploit the availability of the document set in its full extent. This work introduces a half-transductive approach to benefit from the advantages of both transductive and functional approaches and show its empirical advantage in supervised ranking setups.@inproceedings{bai:2009:halftrans_nips,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri",
title = "Ranking with Half Transductive Models",
booktitle = "NIPS Workshop on Advances in Ranking",
year = "2009",
}
We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods.@inproceedings{bai:2009:psi_nips,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and C. Cortes and M. Mohri",
title = "Polynomial Semantic Indexing",
booktitle = "Advances in Neural Information Processing Systems (NIPS)",
year = "2009",
}
In this article we present Supervised Semantic Indexing (SSI) which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing (CFH) and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.@article{bai:2009:ssi_jir,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and Y. Qi and K. Sadamasa and O. Chapelle and K. Weinberger",
title = "Learning to Rank with (a Lot of) Word Features",
journal = "Information Retrieval -- Special Issue on Learning to Rank",
publisher = "Springer",
year = "2009",
}
In this article, we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH).We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.@inproceedings{bai:2009:ssi_cikm,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and Y. Qi and K. Sadamasa and O. Chapelle and K. Weinberger",
title = "Supervised Semantic Indexing",
booktitle = "ACM Conference on Information and Knowledge Management (CIKM)",
year = "2009",
}
We present a class of models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the task of interest, which we argue is the reason for our superior results. We provide an empirical study on Wikipedia documents, using the links to define document-document or query-document pairs, where we obtain state-of-the-art performance using our method.@inproceedings{bai:2009:ssi_ecir,
author = "B. Bai and J.Weston and R. Collobert and D. Grangier",
title = "Supervised Semantic Indexing",
booktitle = "European Conference on Information Retrieval (ECIR)",
year = "2009",
}
This chapter introduces a discriminative approach for the detection of keywords in speech utterances. Specifically, this work proposes a learning algorithm, which aims at maximizing the area under the receiver operating curve, given a set of training spotting problems. This algorithm relies on a large margin formulation of the spotting task, and adopts an efficient online learning strategy. This approach contrasts with standard spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a loss directly related to the spotting performance. Different experiments performed over TIMIT and WSJ data show the advantage of our approach over HMM-based alternatives.@incollection{grangier:2009:kws_book,
author = "D. Grangier, J. Keshet and S. Bengio",
title = "Discriminative Keyword Spotting",
booktitle = "Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods",
editor = "J. Keshet and S. Bengio",
publisher = "Wiley",
year = "2009",
}
In this thesis, we explore the use of machine learning techniques for information retrieval. More specifically, we focus on ad-hoc retrieval, which is concerned with searching large corpora to identify the documents relevant to user queries. This identification is performed through a ranking task. Given a user query, an ad-hoc retrieval system ranks the corpus documents, so that the documents relevant to the query ideally appear above the others. In a machine learning framework, we are interested in proposing learning algorithms that can benefit from limited training data in order to identify a ranker likely to achieve high retrieval performance over unseen documents and queries. This problem presents novel challenges compared to traditional learning tasks, such as regression or classification. First, our task is a ranking problem, which means that the loss for a given query cannot be measured as a sum of an individual loss suffered for each corpus document. Second, most retrieval queries present a highly unbalanced setup, with a set of relevant documents accounting only for a very small fraction of the corpus. Third, ad-hoc retrieval corresponds to a kind of ``double'' generalization problem, since the learned model should not only generalize to new documents but also to new queries. Finally, our task also presents challenging efficiency constraints, since ad-hoc retrieval is typically applied to large corpora. The main objective of this thesis is to investigate the discriminative learning of ad-hoc retrieval models. For that purpose, we propose different models based on kernel machines or neural networks adapted to different retrieval contexts. The proposed approaches rely on different online learning algorithms that allow efficient learning over large corpora.@phdthesis{grangier:2008:phd_thesis,
author = "D. Grangier",
title = "Machine Learning for Information Retrieval",
number = "4088",
school = "Ecole Polytechnique Federale de Lausanne",
year = "2008",
}
This paper proposes a new approach for keyword spotting, which is not based on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training a keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach. Further experiments using the TIMIT trained model, but tested on the WSJ dataset, show that without further training our method outperforms the conventional HMM-based approach.@article{grangier:2008:kws_journal,
author = "J. Keshet and D. Grangier and S. Bengio",
title = "Discriminative Keyword Spotting",
journal = "Speech Communication",
year = "2008",
}
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.@article{grangier:2008:tpami,
author = "D. Grangier and S. Bengio",
title = "A Discriminative Kernel-based Model to Rank Images from Text Queries",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)",
year = "2008",
}
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.@inproceedings{grangier:2007:rr_07-15,
author = "D. Grangier and S. Bengio",
title = "Learning the Inter-frame Distance for Discriminative Template-based Keyword Detection",
booktitle = "International Conference on Speech Processing (INTERSPEECH)",
year = "2007",
}
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non-linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.@inproceedings{keshet:2007:nolisp,
author = "J. Keshet, D. Grangier and S. Bengio",
title = "Discriminative Keyword Spotting",
booktitle = "International Workshop on Non-LInear Speech Processing (NOLISP)",
year = "2007",
}
This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.@inproceedings{grangier:2006:icann,
author = "D. Grangier and S. Bengio",
title = "A Neural Network to Retrieve Images from Text Queries",
booktitle = "International Conference on Artificial Neural Networks (ICANN)}",
year = "2006",
}
This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the state-of-the-art (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).@inproceedings{grangier:2006:amr,
author = "D. Grangier and F. Monay and S. Bengio",
title = "Learning to Retrieve Images from Text Queries with a Discriminative Model",
booktitle = "International Workshop on Adaptive Multimedia Retrieval (AMR)",
year = "2006",
}
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.@inproceedings{grangier:2006:ecml,
author = "D. Grangier and F. Monay and S. Bengio",
title = "A Discriminative Approach for the Retrieval of Images from Text Queries",
booktitle = "European Conference on Machine Learning (ECML)",
year = "2006",
}
In this report, we propose a discriminative decoder for phoneme recognition, i.e. the identification of the uttered phoneme sequence from a speech recording. This task is solved as a 3 step process: a phoneme classifier first classifies each accoustic frame, then temporal consistency features (TCF) are extracted from the phoneme classifier outputs, and finally a sequence decoder identifies the phoneme sequence according to the TCF.@techreport{grangier:2005:idiap-05-67,
author = "D. Grangier and S. Bengio",
title = "A Discriminative Decoder for the Recognition of Phoneme Sequences",
number = "67",
institution = "IDIAP",
year = "2005",
}
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d in C according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b -> sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied over a non-hyperlinked retrieval corpus.@inproceedings{grangier:2005:nips_workshop,
author = "D. Grangier and S. Bengio",
title = "Exploiting Hyperlinks to Learn a Retrieval Model",
booktitle = "NIPS Workshop on Learning to Rank",
year = "2005",
}
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b -> sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.@inproceedings{grangier:2005:cikm,
author = "D. Grangier and S. Bengio",
title = "Inferring Document Similarity from Hyperlinks",
booktitle = "ACM Conference on Information and Knowledge Management (CIKM)",
year = "2005",
}
This paper presents experiments that evaluate the effect of different video segmentation methods on text-based video retrieval. Segmentations relying on modalities like speech, video and text or their combination are compared with a baseline sliding window segmentation. The results suggest that even with the sliding window segmentation, acceptable performance can be obtained on a broadcast news retrieval task. Moreover, in the case where manually segmented data are available for training, the approach combining the different modalities can lead to IR results close to those obtained with a manual segmentation.@inproceedings{grangier:2005:icme,
author = "D. Grangier and A. Vinciarelli",
title = "Effect of Segmentation Method on Video Retrieval Performance",
booktitle = "IEEE International Conference on Multimedia and Expo (ICME)",
year = "2005",
}
This paper presents clustering experiments performed over noisy texts (i.e. texts that have been extracted through an automatic process like character or speech recognition). The effect of recognition errors is investigated by comparing clustering results performed over both clean (manually typed data) and noisy (automatic speech transcriptions) versions of the same speech recording corpus.@techreport{grangier:2004:idiap-04-82,
author = "D. Grangier and A. Vinciarelli",
title = "Effect of Recognition Errors on Text Clustering",
number = "82",
institution = "IDIAP",
year = "2004",
}
Spoken Document Retrieval (SDR) consists in retrieving segments of a speech database that are relevant to a query. The state-of-the-art approach to the SDR problem consists in transcribing the speech data into digital text before applying common Information Retrieval (IR) techniques. The transcription, produced by an Automatic Speech Recognition system, contains recognition errors. These errors can be referred to as noise. This thesis investigates the effect of this noise on the retrieval process. We compare the results obtained with clean and noisy data at different steps of the retrieval process. To perform such a task, standard IR measures (precision, recall, break-even point, etc.) are used. It is shown that even with very different error rates (10% vs 30%), the performances obtained over noisy text are only slightly lower than those over clean text (9% degradation of average precision for our complete IR system, 45.2% vs 41.2%).@techreport{com-03-08,
author = "D. Grangier and A. Vinciarelli and H. Bourlard",
title = "Information Retrieval on Noisy Text",
number = "8",
Keywords = "Information Retrieval, Speech, Spoken Documents Retrieval, Noisy Text",
institution = "IDIAP",
year = "2003",
}
Code & Demos
Tutorials
Colleagues and Co-Authors