
![]()
I am David Grangier, welcome to my homepage. I am currently with the Machine Learning Group at Apple Research, in Cupertino, California. I am interested in large scale Machine Learning and its application to pattern analysis tasks, such as Information Retrieval, Speech Recognition and Natural Language Processing. My research is described in below, while my resume and my linkedin give an overview of my previous experiences.
Research Interests
Learning to rank has increasingly received attention during the last
decade, due to applications in domains such as information retrieval.
These types of problems aim at assigning a confidence value to each
example of a set, such that the values induce a specific
ordering over the set. This task hence significantly differs
from classical machine learning problem, since the example
outputs cannot be considered independently.
Many effective learning algorithms, such as nearest neighbor classifiers or
support vector machines, crucially relies on a function to compare
examples. Such a function can be referred to as a distance metric, a similarity measure or
a kernel, depending on its mathematical properties. Rather than defining this
function relying only on prior knowledge, new learning techniques have been
introduced to learn it from training data providing information on the desired
example proximity.
In many problems, specific relations exist between the different inputs
or between the labels. For instance, computer vision features extracted
from the same image share spatial relations, or the phoneme classes
in speech belong to a hierarchical structure. Encoding prior information
about such structures, or learning such relationships offer great opportunities
to improve machine learning approaches for several pattern recognition problems,
and several techniques has been introduced toward this objective in the recent
years.
Publications
Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs
of the same size. However, the specialist data needed to pretrain these models is only available in limited amount
for most tasks. In this work, we build specialist models from large generalist training sets instead. We adjust
the training distribution of the generalist data with guidance from the limited domain-specific data. We explore
several approaches, with clustered importance sampling standing out. This method clusters the generalist dataset
and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable,
suitable for pretraining and continued pretraining, it works well in multi-task settings. Our findings demonstrate
improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice
question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering
configurations, and model sizes.
@inproceedings{grangier2025crisp,
title = {Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling},
author = {Grangier, David and Fan, Simin and Seto, Skyler and Ablin, Pierre},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
year = {2025},
url={https://doi.org/10.48550/arXiv.2410.03735},
}
Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an
Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older
gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate
moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and
empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight
to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose
AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past
gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that
gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower
minima: e.g., a 1.3B parameter AdEMAMix LLM trained on 101B tokens performs comparably to an AdamW model trained
on 197B tokens (+95%). Moreover, our method significantly slows-down model forgetting during training. Our work
motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.
@inproceedings{pagliardini2025ademamix,
author={Matteo Pagliardini and Pierre Ablin and David Grangier},
title={The AdEMAMix Optimizer: Better, Faster, Older},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
year={2025},
url={https://doi.org/10.48550/arXiv.2409.03137},
}
We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost
asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the
need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router
directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a
fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha
SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs
and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on
75% of the tasks.
@inproceedings{filippova2025no,
title={No Need to Talk: Asynchronous Mixture of Language Models},
author={Anastasiia Filippova and Angelos Katharopoulos and David Grangier and Ronan Collobert},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
year={2025},
url={https://arxiv.org/abs/2410.03529},
}
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most
progress is made for English, given its abundance of high-quality pretraining data. For most other languages,
however, such high-quality pretraining data is unavailable. In this work, we study how to boost pretrained model
performance in a data-constrained target language by enlisting data from an auxiliary language for which
high-quality data is available. We study this by quantifying the performance gap between training with data in a
data-rich auxiliary language compared with training in the target language, exploring the benefits of translation
systems, studying the limitations of model scaling for data-constrained languages, and proposing new methods for
upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in
performance gains without modification to the model or training objective for close languages, and, in particular,
that performance gains due to the development of more information-rich English pretraining datasets can extend to
targeted language settings with limited data.
@inproceedings{seto2024bilingual,
title={Training Bilingual LMs with Data Constraints in the Targeted Language},
author={Seto, Skyler and ter Hoeve, Maartje and Bai, He and Schluter, Natalie and Grangier, David},
booktitle = {Findings of the Association of Computer Linguists (ACL)},
year={2025},
url={https://arxiv.org/abs/2411.12986},
}
Machine learning models are routinely trained on a mixture of different data domains. Different domain weights
yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can
instantiate a model at test time for any domain weights with minimal computational cost and without re-training
the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate
one model. We learn the linear combination coefficients as a function of the input domain weights. To train this
architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch
of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on
several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many
different specialist models quickly under a model size constraint.
@inproceedings{ablin2025soup,
author={Pierre Ablin and Angelos Katharopoulos and Skyler Seto and David Grangier},
title={Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year={2025},
url={https://arxiv.org/abs/2502.01804},
}
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained
model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two
challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly
overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the
generic
knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target
domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining
data
into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our
study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from
forgetting the pretraining set.
@inproceedings{bethune2025finetuning,
title={Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection},
author={Louis Bethune and David Grangier and Dan Busbridge and Eleonora Gualdoni and Marco Cuturi and Pierre Ablin},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year={2025},
url={https://arxiv.org/abs/2502.06042},
}
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm,
the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This
work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting
the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an
online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at
training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that
in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may
not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well
and point towards further research to address current limitations.
@article{fan2024dga,
author={Simin Fan and David Grangier and Pierre Ablin},
title={Dynamic Gradient Alignment for Online Data Mixing},
journal={arXiv},
volume={2410.02498},
year={2024},
url={https://doi.org/10.48550/arXiv.2410.02498},
}
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more
efficient inference, but their lower capacity means that their performance can be good only if one limits their
scope to a specialized domain. This paper explores how to get good specialized small language models using a
large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on
whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a
single pretrained model for each task. In the first scenario, we propose an effective solution based on importance
sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the
second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters
can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the
empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.
@article{grangier2024slm,
author={David Grangier and Angelos Katharopoulos and Pierre Ablin and Awni Hannun},
title={Need a Small Specialized Language Model? Plan Early!},
journal={arXiv},
volume={2402.01093},
year={2024},
url={https://doi.org/10.48550/arXiv.2402.01093},
}
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more
efficient inference but their lower capacity means that their performance can be good only if one limits their
scope to a specialized domain. This paper explores how to get a small language model with good specialized
accuracy, even when specialization data is unknown during pretraining. We propose a novel architecture, projected
networks (PN). PN is a high capacity network whose parameters can be linearly projected into a small network for
fine tuning. We assess the empirical effectiveness of our solution compared to small model training, distillation
and hard mixture of experts.
@inproceedings{grangier2024projected,
title={Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones},
author={Grangier, David and Katharopoulos, Angelos and Ablin, Pierre and Hannun, Awni},
booktitle={ICML 2024 FM-Wild Workshop},
year={2024},
url={https://openreview.net/forum?id=Wi88giKi7N},
}
Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle
in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the
visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient
finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In
this paper, we improve prompt learning by distilling the textual knowledge from natural language prompts (either
human- or LLM-generated) to provide rich priors for those under-represented concepts. We first obtain a prompt
``summary'' aligned to each input image via a learned prompt aggregator. Then we jointly train a prompt generator,
optimized to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at
the same time. We dub such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE). AAPE is shown to be
able to generalize to different downstream data distributions and tasks, including vision-language understanding
tasks (e.g., few-shot classification, VQA) and generation tasks (image captioning) where AAPE achieves competitive
performance. We also show AAPE is particularly helpful to handle non-canonical and OOD examples. Furthermore, AAPE
learning eliminates LLM-based inference cost as required by baselines, and scales better with data and LLM model
size.
@inproceedings{huang2024aggregate,
author={Chen Huang and Skyler Seto and Samira Abnar and David Grangier and Navdeep Jaitly and Josh Susskind},
title={Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP},
booktitle={Advances in Neural Information Processing Systems},
year={2024},
url={https://neurips.cc/virtual/2024/poster/94659},
}
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most
progress is made for English, given its abundance of high-quality pretraining data. For most other languages,
however, such high-quality pretraining data is unavailable. In this work, we study how to boost pretrained model
performance in a data-constrained target language by enlisting data from an auxiliary language for which
high-quality data is available. We study this by quantifying the performance gap between training with data in a
data-rich auxiliary language compared with training in the target language, exploring the benefits of translation
systems, studying the limitations of model scaling for data-constrained languages, and proposing new methods for
upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in
performance gains without modification to the model or training objective for close languages, and, in particular,
that performance gains due to the development of more information-rich English pretraining datasets can extend to
targeted language settings with limited data.
@inproceedings{maini2024rephrase,
author={Pratyush Maini and Skyler Seto and Richard He Bai and David Grangier and Yizhe Zhang and Navdeep Jaitly},
title={Rephrasing the Web: {A} Recipe for Compute and Data-Efficient Language Modeling},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2024},
pages={14044--14072},
}
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm,
the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This
work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting
the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an
online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at
training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that
in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may
not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well
and point towards further research to address current limitations.
@article{grangier2024adaptive,
title={Adaptive Training Distributions with Scalable Online Bilevel Optimization},
author={Grangier, David and Ablin, Pierre and Hannun, Awni},
journal={Transactions on Machine Learning Research (TMLR)},
year={2024},
url={https://openreview.net/forum?id=JP1GVyF5i5},
}
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the
input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this
representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction
quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely,
we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term
structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training
on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short
prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and
semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen
speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music
continuations, despite being trained without any symbolic representation of music.
@article{audio-lm-generation-2023,
author={Borsos, Zalán and Marinier, Raphaël and Vincent, Damien and Kharitonov, Eugene and Pietquin, Olivier and Sharifi, Matt and Roblek, Dominik and Teboul, Olivier and Grangier, David and Tagliasacchi, Marco and Zeghidour, Neil},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={AudioLM: A Language Modeling Approach to Audio Generation},
year={2023},
volume={31},
number={},
pages={2523-2533},
keywords={Semantics;Acoustics;Training;Computational modeling;Codecs;Predictive models;Task analysis;Computer generated music;speech synthesis},
doi={10.1109/TASLP.2023.3288409}
}
Pre-trained models are growing increasingly large which can be problematic for applications with strong inference
constraints. Fortunately, task-aware structured pruning offers a solution. While existing pruning algorithms can
be efficient, the common practical setting where task-specific data is limited is yet to be addressed. To
ameliorate the data scarcity problem, we propose a structured pruning strategy that leverages transfer learning.
Detailed analyses of simple transfer learning based remedies lead us to a simple, flexible formulation of what,
how and when to transfer, resulting in pruned models with improved generalization over strong baselines.
@inproceedings{dery-grangier-hannun-structured-pruning,
title={Transfer Learning for Structured Pruning under Limited Task Data},
author={Lucio Dery and David Grangier and Awni Hannun},
year={2023},
booktitle = {Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III)},
}
Language models trained on very large web corpora have become a central piece of modern language processing. In
this paradigm, the large, heterogeneous training set rarely matches the distribution of the application domain.
This work considers modifying the training distribution in the case where one can observe a small sample of data
reflecting the test conditions. We propose an algorithm based on recent formulation of this problem as an online,
bilevel optimization problem. We show that this approach compares favorably with alternative strategies from the
domain adaptation literature.
@inproceedings{grangier-ablin-hannun-bilevel-learn-lm-distribution,
title={Bilevel Optimization to Learn Training Distributions for Language Modeling under Domain Shift},
author={David Grangier and Pierre Ablin and Awni Hannun},
year={2023},
booktitle = {NeurIPS 2023 Workshop on Distribution Shifts},
}
This work connects language model adaptation with concepts of machine learning theory. We consider a training
setup with a large out- of-domain set and a small in-domain set. We derive how the benefit of training a model on
either set depends on the size of the sets and the distance between their underlying distributions. We analyze how
out-of-domain pre-training before in-domain fine-tuning achieves better generalization than either solution
independently. Finally, we present how adaptation techniques based on data selection, such as importance sampling,
intelligent data selection and influence functions, can be presented in a common framework which highlights their
similarity and also their subtle differences.
@inproceedings{grangier2022tradeoffs,
title={The Trade-offs of Domain Adaptation for Neural Language Models},
author={David Grangier and Dan Iter},
year={2022},
booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
}
Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to
translation style. This means that, even when considered accurate and fluent, MT output can still sound less {\em
natural} than high quality human translations or text originally written in the target language.
Machine translation output notably exhibits lower lexical diversity, and employs constructs that mirror those in
the source sentence. In this work we propose a method for training MT systems to achieve a more natural style,
i.e. mirroring the style of text originally written in the target language. Our method tags parallel training data
according to the naturalness of the target side by contrasting language models trained on natural and translated
data.
Tagging data allows us to put greater emphasis on target sentences originally written in the target language.
Automatic metrics show that the resulting models achieve lexical richness on par with human translations,
mimicking a style much closer to sentences originally written in the target language. Furthermore, we find that
their output is preferred by human experts when compared to the baseline translations.
@inproceedings{freitag-natural-diet-translation-2022,
title={A Natural Diet: Towards Improving Naturalness of Machine Translation},
author={Markus Freitag and David Vilar and David Grangier and Colin Cherry and George Foster},
booktitle={Findings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2022},
}
Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or
pooling layers, that progressively reduce the resolution of intermediate representations. This provides some
shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter
of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the
best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which
rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence,
exploring this search space by gradient descent would allow finding better configurations at a lower computational
cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the
size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way.
Experiments on audio and image classification show the generality and effectiveness of our solution: we use
DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show
that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10,
CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating
strides as learnable variables allows us to introduce a regularization term that controls the computational
complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on
ImageNet.
@inproceedings{riad-learning-strides-convnets-2022,
title={Learning Strides in Convolutional Neural Networks},
author={Rachid Riad and Olivier Teboul and David Grangier and Neil Zeghidour},
booktitle={International Conference on Learning Representation (ICLR)},
year={2022},
}
In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability
should also be the translation with the highest quality as measured by humans. In this work, we question this
assumption and show that model estimates and translation quality only vaguely correlate. We apply Minimum Bayes
Risk (MBR) decoding on unbiased samples to optimize diverse automated metrics of translation quality as an
alternative inference strategy to beam search. Instead of targeting the hypotheses with the highest model
probability, MBR decoding extracts the hypotheses with the highest estimated quality. Our experiments show that
the combination of a neural translation model with a neural reference-based metric, Bleurt, results in significant
improvement in human evaluations. This improvement is obtained with translations different from classical
beam-search output: These translations have much lower model likelihood and are less favored by surface metrics
like Bleu.
@article{freitag-etal-2022-high,
title = "High Quality Rather than High Model Probability: Minimum {B}ayes Risk Decoding with Neural Metrics",
author = "Freitag, Markus and Grangier, David and Tan, Qijun and Liang, Bowen",
editor = "Roark, Brian and Nenkova, Ani",
journal = "Transactions of the Association for Computational Linguistics",
volume = "10",
year = "2022",
url = "https://aclanthology.org/2022.tacl-1.47/"
}
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is
increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been
considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step
toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the
Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the
outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by
professional translators with access to full document context. We analyze the resulting data extensively, finding
among other results a substantially different ranking of evaluated systems from the one established by the WMT
crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that
automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly
available for further research.
@article{freitag2021experts,
title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation},
author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
journal={Transactions of the Association for Computational Linguistics (TACL)},
year={2021},
}
We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task
as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice
activity of each speaker conditioned on the extracted representations. This strategy intrinsically resolves the
speaker ordering ambiguity without requiring the classical permutation invariant training loss. In contrast with
prior work, our model does not rely on pretrained speaker representations and optimizes all parameters of the
system with a multi-speaker voice activity loss. Importantly, our loss explicitly excludes unreliable speaker turn
boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) evaluation.
Overall, these contributions yield a system redefining the state-of-the-art on the standard CALLHOME benchmark,
with 6.7% DER compared to 7.8% for the best alternative.
@inproceedings{zeghidour-teboul-grangier-dive-2021,
title={DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding},
author={Neil Zeghidour and Olivier Teboul and David Grangier},
booktitle = {{IEEE} Automatic Speech Recognition and Understanding Workshop ({ASRU})},
year={2021},
}
While deep learning has been very beneficial in data-rich settings, tasks with smaller training set
often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful
consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks
actually help the primary task. We seek to alleviate this burden by formulating a model-agnostic framework that
performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into
directions which help, damage or leave the primary task loss unchanged. This allows weighting the update
directions
differently depending on their impact on the problem of interest. We present a novel and efficient algorithm for
that purpose and show its advantage in practice. Our method leverages efficient automatic differentiation
procedures and randomized singular value decomposition for scalability. We show that our framework is generic and
encompasses some prior work as particular cases. Our approach consistently outperforms strong and widely used
baselines when leveraging out-of-distribution data for Text and Image classification tasks.
@inproceedings{ldery-aux-taks-iclr21,
title={Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral},
author={Lucio Dery and Yann Dauphin and David Grangier},
booktitle={International Conference on Learning Representation (ICLR)},
year={2021},
}
We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a
representation for each source and then estimates each source signal given the inferred representations. The model
is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations
via clustering, which addresses the fundamental permutation problem of separation. For speech separation, our
sequence-wide speaker representations provide a more robust separation of long, challenging recordings compared to
prior work. Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2/3mix), as well
as in noisy and reverberated settings (WHAM/WHAMR). We also set a new benchmark on the recent LibriMix dataset.
Finally, we show that Wavesplit is also applicable to other domains, by separating fetal and maternal heart rates
from a single abdominal electrocardiogram.
@article{zeghidour-grangier-wavesplit-2021,
title={Wavesplit: End-to-End Speech Separation by Speaker Clustering},
author={Neil Zeghidour and David Grangier},
journal = {{IEEE} {ACM} Transaction on Audio Speech and Language Processing (TASLP)},
year={2021},
}
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
Our approach is based on contrastive learning: it learns a representation which assigns high similarity to audio
segments extracted from the same recording while assigning lower similarity to segments from different recordings.
We build on top of recent advances in contrastive learning for computer vision and reinforcement learning to
design a lightweight, easy-to-implement self-supervised model of audio. We pre-train embeddings on the large-scale
Audioset database and transfer these representations to 9 diverse classification tasks, including speech, music,
animal sounds, and acoustic scenes. We show that despite its simplicity, our method significantly outperforms
previous self-supervised systems. We furthermore conduct ablation studies to identify key design choices and
release a library to pre-train and fine-tune COLA models.
@inproceedings{saeeds-contrastive-audio-2021,
title={Contrastive Learning of General-Purpose Audio Representations},
author={Aaqib Saeed and David Grangier and Neil Zeghidour},
booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2021},
}
We propose CHARM, a method for training a single neural network across inconsistent input channels. Our work is
motivated by Electroencephalography (EEG), where data collection protocols from different headsets result in
varying channel ordering and number, which limits the feasibility of transferring trained systems across datasets.
Our approach builds upon attention mechanisms to estimate a latent reordering matrix from each input signal and
map input channels to a canonical order. CHARM is differentiable and can be composed further with architectures
expecting a consistent channel ordering to build end-to-end trainable classifiers. We perform experiments on four
EEG classification datasets and demonstrate the efficacy of CHARM via simulated shuffling and masking of input
channels. Moreover, our method improves the transfer of pre-trained representations between datasets collected
with different protocols.
@article{saeeds-eeg-reordering-2021,
title={Learning from Heterogeneous EEG Signals with Differentiable Channel Reordering},
author={Aaqib Saeed and David Grangier and Olivier Pietquin and Neil Zeghidour},
booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2021},
}
Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training
and then fine tuning. Data selection improves target domain generalization by training further on pretraining data
identified by relying on a small sample of target domain data. This work examines the benefit of data selection
for language modeling and machine translation. Our experiments assess the complementarity of selection with fine
tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but
not so much as to erode the complementary effect of fine-tuning; (ii) there is a trade-off between selecting
little data for fast but limited progress or much data for slow but long lasting progress; (iii) data selection
can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data
selection from domain classifiers is often more effective than the popular contrastive data selection method.
@misc{iter2021complementarity,
title={On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation},
author={Dan Iter and David Grangier},
year={2021},
eprint={2109.07591},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Whereas existing literature on unsupervised machine translation (MT) focuses on exploiting unsupervised techniques
for low-resource language pairs where bilingual training data is scare or unavailable, we investigate whether
unsupervised MT can also improve translation quality of high-resource language pairs where sufficient bitext does
exist. We compare the style of correct translations generated by either supervised or unsupervised MT and find
that the unsupervised output is less monotonic and more natural than supervised output. We demonstrate a way to
combine the benefits of unsupervised and supervised MT into a single system, resulting in better human evaluation
of quality and fluency. Our results open the door to discussions about the potential contributions of unsupervised
MT in high-resource settings, and how supervised and unsupervised systems might be mutually-beneficial.
@misc{marchisio2021unsupervised,
title={What Can Unsupervised Machine Translation Contribute to High-Resource Language Pairs?},
author={Kelly Marchisio and Markus Freitag and David Grangier},
year={2021},
eprint={2106.15818},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its
effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence
length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small
set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid
allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon
two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with
the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing
Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall
complexity of attention to O(n1.5d) from O(n2d) for sequence length n and hidden dimension d. We show that our
model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3
perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention
layers.
@article{aurkoroy-routing-transformer-2020,
title={Efficient Content-Based Sparse Attention with Routing Transformers},
author={Aurko Roy and Mohammad Saffar and Ashish Vaswani and David Grangier},
journal={Transactions of the Association for Computational Linguistics (TACL) },
year={2020},
}
Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has
recently been proposed by Freitag et al. (2020). When used in place of original references, the paraphrased
versions produce metric scores that correlate better with human judgment. This effect holds for a variety of
different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In
this paper we compare the results of performing endto-end system development using standard and paraphrased
references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references
produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on
standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate
better with human judgment, and demonstrates for the first time that using these scores for system development can
lead to significant improvements.
@inproceedings{freitag-paraphrase-mert-2020,
title={Human-Paraphrased References Improve Neural Machine Translation},
author={Markus Freitag and George Foster and David Grangier and Colin Cherry},
booktitle={Conference on Machine Translation (WMT)},
year={2020},
}
The quality of automatic metrics for machine translation has been increasingly called into question, especially
for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the
references is also critical. We study different methods to collect references and compare their value in automated
evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the
finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a
paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our
method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German,
but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with
automatic metrics using standard references. We demonstrate that our methodology improves correlation with all
modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that
multi-reference BLEU does not improve the correlation for high quality output, and present an alternative
multi-reference formulation that is more effective.
@article{freitag-bleu-paraphrase-references-2020,
title={BLEU might be Guilty but References are not Innocent},
author={Markus Freitag and David Grangier and Isaac Caswell},
booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2020},
}
Machine translation has an undesirable propensity to produce "translationese" artifacts, which can lead to higher
BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e.
natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot
translation between original source text and original target text? There is no data with original source and
original target, so we train sentence-level classifiers to distinguish translationese from original target text,
and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to
produce more natural outputs at test time, yielding gains in human evaluation scores on both accuracy and fluency.
Additionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU
score, increasing it while decreasing human-rated quality. We analyze these models using metrics to measure the
degree of translationese in the output, and present an analysis of the capriciousness of heuristically-based
train-data tagging.
@article{riley-translationese-2020,
title={Translationese as a Language in "Multilingual" {NMT}},
author={Parker Riley and Isaac Caswell and Markus Freitag and David Grangier},
booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year = {2020},
}
This work proposes a sentence-level language model which predicts the next sentence in a story given the
embeddings of the previous sentences. The model operates at the sentence-level and selects the next sentence
within a fine set of fluent alternatives. By working with sentence embeddings instead of word embeddings, our
model is able to efficiently consider a large number of alternative sentences. By considering only fluent
sentences, our model is relieved from modeling fluency and can focus on longer range dependencies. Our method
achieves state-of-the-art accuracy on the StoryCloze task in the unsupervised setting.
@inproceedings{ippolito-storyline-2020,
title={Towards Better Storylines with Sentence-Level Language Models},
author={Daphne Ippolito and David Grangier and Douglas Eck and Chris Callison-Burch},
booktitle={Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2020},
}
We introduce the first large-scale corpus for long form question answering, a task requiring elaborate and
in-depth answers to open-ended questions. The dataset comprises 270K threads from the Reddit forum “Explain Like
I’m Five” (ELI5) where an online community provides answers to questions which are comprehensible by five year
olds. Compared to existing datasets, ELI5 comprises diverse questions requiring multi-sentence answers. We provide
a large set of web documents to help answer the question. Automatic and human evaluations show that an abstractive
model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong
extractive baseline.However, our best model is still far from human performance since raters prefer gold responses
in over 86% of cases, leaving ample opportunity for future improvement.
@inproceedings{fan-etal-2019-eli5,
title = "{ELI}5: Long Form Question Answering",
author = "Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
year = "2019",
url = "https://www.aclweb.org/anthology/P19-1346"
}
Recent work in Neural Machine Translation (NMT) has shown significant quality gains from noised-beam decoding
during back-translation, a method to generate synthetic parallel data. We show that the main role of such
synthetic noise is not to diversify the source side, as previously suggested, but simply to indicate to the model
that the given source is synthetic. We propose a simpler alternative to noising techniques, consisting of tagging
back-translated source sentences with an extra token. Our results on WMT outperform noised back-translation in
English-Romanian and match performance on English-German, re-defining state-of-the-art in the former.
@inproceedings{caswell:tagbt:2019,
author = {Isaac Caswell and Ciprian Chelba and David Grangier},
title = {Tagged Back-Translation},
year = {2019},
booktitle = {Conference on Machine Translation (WMT)},
}
Paraphrasing exemplifies the ability to abstract semantic content from surface forms. Recent work on automatic
paraphrasing is dominated by methods leveraging Machine Translation (MT) as an intermediate step. This contrasts
with humans, who can paraphrase without being bilingual. This work proposes to learn paraphrasing models from an
unlabeled monolingual corpus only. To that end, we propose a residual variant of vector-quantized variational
auto-encoder. We compare with MT-based approaches on paraphrase identification, generation, and training
augmentation. Monolingual paraphrasing outperforms unsupervised translation in all settings. Comparisons with
supervised translation are more mixed: monolingual paraphrasing is interesting for identification and
augmentation; supervised translation is superior for generation.
@inproceedings{aurkoroy:paraphrase:2019,
author = {Aurko Roy and David Grangier},
title = {Unsupervised Paraphrasing without Translation},
year = {2019},
booktitle = {Conference of the Association for Computational Linguistics (ACL)},
}
fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models
for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on
PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision
training and inference on modern GPUs. @inproceedings{ottedunov:fairseq:2019,
author = {Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli},
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
year = {2019},
booktitle = {Demo of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL Demo)},
}
Story infilling involves predicting words to go into a missing span from a story. This challenging task has the
potential to transform interactive tools for creative writing. However, state-of-the-art conditional language
models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a
hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By
relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model
conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without
sacrificing diversity.
@inproceedings{ippolito-etal-2019-unsupervised,
title = "Unsupervised Hierarchical Story Infilling",
author = "Ippolito, Daphne and Grangier, David and Callison-Burch, Chris and Eck, Douglas",
booktitle = "Proceedings of the First Workshop on Narrative Understanding @ NAACL",
year = "2019",
url = "https://www.aclweb.org/anthology/W19-2405"}
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model
based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and
effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D
keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the
supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm
mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows
significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably
outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. Code and
models are available.
@inproceedings{pavllo:3dpose:2018,
author = {Dario Pavllo and Christoph Feichtenhofer and David Grangier and Michael Auli},
title = {3D human pose estimation in video with temporal convolutions and semi-supervised training},
year = {2019},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
}
An effective method to improve neural machine translation with monolingual data is to augment the parallel
training corpus with back-translations of target language sentences.
This work broadens the understanding of back-translation and investigates a number of methods to generate
synthetic source sentences.
We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are
most effective.
Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated
by beam or greedy search.
We also compare how synthetic data compares to genuine bitext and study various domain effects.
Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU
on the WMT'14 English-German test set.
@article{edunov:backtranslation:2018,
author = {Sergey Edunov Myle Ott and Michael Auli and David Grangier},
title = {Understanding Back-Translation at Scale},
year = {2018},
booktitle = {Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
}
Sequence to sequence learning models still require several days to reach state of the art performance on large
benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can
speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14
English-German translation, we match the accuracy of Vaswani et al (2017) in under 5 hours when training on 8 GPUs
and we obtain a new state of the art of 29.3 BLEU after training for 91 minutes on 128 GPUs. We further improve
these results to 29.8 BLEU by training on the much larger Paracrawl dataset.
@inproceedings{ott:scaling:2018,
author = {Myle Ott and Sergey Edunov and David Grangier and Michael Auli},
title = {Scaling Neural Machine Translation},
year = {2018},
booktitle = {Workshop on Machine Translation ({WMT@EMNLP})},
}
Deep learning for predicting or generating 3D human pose sequences is an active research area. Previous work
regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the
kinematic chain, as well as discontinuities when using Euler angle or exponential map parameterizations. The
latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This
work addresses both limitations. Our recurrent network, QuaterNet, represents rotations with quaternions and our
loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle
errors. On short-term predictions, QuaterNet improves the state-of-the-art quantitatively. For long-term
generation, our approach is qualitatively judged as realistic as recent neural strategies from the graphics
literature.
@inproceedings{pavllo:quaternet:2018,
author = {Dario Pavllo and David Grangier and Michael Auli},
title = {QuaterNet: A Quaternion-based Recurrent Model for Human Motion},
year = {2018},
booktitle = {British Machine Vision Conference (BMVC)},
}
Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much
recent research, there is still a lack of understanding of these models. Practitioners report performance
degradation with large beams, the under-estimation of rare words and a lack of diversity in the final
translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence
of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy
training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model
distribution and how it affects search strategies that generate translations. Our results show that search works
remarkably well but that the models tend to spread too much probability mass over the hypothesis space. Next, we
propose tools to assess model calibration and show how to easily fix some shortcomings of current models. We
release both code and multiple human reference translations for two popular benchmarks.
@inproceedings{ott:uncertainty:2018,
author = {Myle Ott and Michael Auli and David Grangier and Marc’Aurelio Ranzato},
title = {Analyzing Uncertainty in Neural Machine Translation},
year = {2018},
booktitle = {International Conference on Machine Learning {(ICML)}},
}
Current models for document summarization ignore user preferences such as the desired length, style or entities
that the user has a preference for. We present a neural summarization model that enables users to specify such
high level attributes in order to control the shape of the final summaries to better suit their needs. With user
input, we show that our system can produce high quality summaries that are true to user preference. Without user
input, we can set the control variables automatically and outperform comparable state of the art summarization
systems despite the relative simplicity of our model.
@inproceedings{fan:summarization:2018,
author = {Angela Fan and David Grangier and Michael Auli},
title = {Controllable Abstractive Summarization},
year = {2018},
booktitle = {ACL Workshop on Neural Machine Translation and Generation (NMT@ACL)},
}
There has been much recent work on training neural attention models at the sequence-level using either
reinforcement learning-style methods or by optimizing the beam. In this paper, we survey a range of classical
objective functions that have been widely used to train linear models for structured prediction and apply them to
neural sequence to sequence models. Our experiments show that these losses can perform surprisingly well by
slightly outperforming beam search optimization in a like for like setup. We also report new state of the art
results on both IWSLT 2014 German-English translation as well as Gigaword abstractive summarization.
@inproceedings{edunovott:structured:2018,
author = {Sergey Edunov and Myle Ott and Michael Auli and David Grangier and Marc’Aurelio Ranzato},
title = {Classical Structured Prediction Losses for Sequence to Sequence Learning},
year = {2018},
booktitle = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
}
We propose a framework for computer-assisted text editing. It applies to translation post-editing and to
paraphrasing and relies on very simple interactions: a human editor modifies a sentence by marking tokens they
would like the system to change. Our model then generates a new sentence which reformulates the initial sentence
by avoiding the words from the marked tokens. Our approach builds upon neural sequence-to-sequence modeling and
introduces a neural network which takes as input a sentence along with deleted token markers. Our model is trained
on translation bitext by simulating post-edits. Our results on post-editing for machine translation and
paraphrasing evaluate the performance of our approach. We show +11.4 BLEU with limited post-editing effort on the
WMT-14 English-German translation task (25.2 to 36.6), which represents +5.9 BLEU over the post-editing baseline
(30.7 to 36.6).
@article{grangier:quickedit:2018,
author = {David Grangier and Michael Auli},
title = {QuickEdit: Editing Text and Translations by Crossing Words Out},
year = {2018},
booktitle = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
}
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output
sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural
networks. Compared to recurrent models, computations over all elements can be fully parallelized during training
and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our
use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention
module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and
WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.
@inproceedings{gehring:convs2s:2017,
author = {Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin},
title = {Convolutional Sequence to Sequence Learning},
year = {2017},
booktitle = {International Conference on Machine Learning {(ICML)}}
}
The pre-dominant approach to language model- ing to date is based on recurrent neural networks. In this paper we
present a convolutional approach to language modeling. We introduce a novel gating mechanism that eases gradient
propagation and which performs better than the LSTM- style gating of Oord et al. (2016b) despite being simpler. We
achieve a new state of the art on WikiText-103 as well as a new best single-GPU result on the Google Billion Word
benchmark. In settings where latency is important, our model achieves an order of magnitude speed-up compared to a
recurrent baseline since computation can be parallelized over time. To our knowledge, this is the first time a
non-recurrent approach out- performs strong recurrent models on these tasks.
@inproceedings{dauphin:gatedlm:2017,
author = {Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier},
title = {Language Modeling with Gated Convolutional Networks},
year = {2017},
booktitle = {International Conference on Machine Learning {(ICML)}}
}
We propose an approximate strategy to efficiently train neural network based language models over very large
vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by
exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of
computational complexity. Our approach further reduces the computational cost by exploiting the specificities of
modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing
units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our
approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that
of the full softmax.
@inproceedings{grave:softmax:2017,
author = {Edouard Grave and Armand Joulin and Moustapha Cisse and David Grangier and Herve Jegou},
title = {Strategies for Training Large Vocabulary Neural Language Models},
year = {2017},
booktitle = {International Conference on Machine Learning {(ICML)}}
}
The prevalent approach to neural machine translation relies on bi-directional LSTMs to encode the source sentence.
In this paper we present a faster and conceptually simpler architecture based on a succession of convolutional
layers. This allows to encode the entire source sentence simultaneously compared to recurrent networks for which
computation is constrained by temporal dependencies. We achieve a new state-of-the-art on WMT'16 English-Romanian
translation and outperform several recently published results on the WMT'15 English-German task. We also achieve
almost the same accuracy as a very deep LSTM setup on WMT'14 English-French translation. Our convolutional encoder
speeds up CPU decoding by more than two times at the same or higher accuracy as a strong bi-directional LSTM
baseline.
@inproceedings{gehring:convnmt:2017,
author = {Jonas Gehring and Michael Auli and David Grangier and Yann N. Dauphin},
title = {A Convolutional Encoder Model for Neural Machine Translation},
year = {2017},
booktitle = {Conference of the Association for Computational Linguistics ({ACL})}
}
Existing machine translation decoding algorithms generate translations in a strictly monotonic fashion and never
revisit previous decisions. As a result, earlier mistakes cannot be corrected at a later stage. In this paper, we
present a translation scheme that starts from an initial guess and then makes iterative improvements that may
revisit previous decisions. We parameterize our model as a convolutional neural network that predicts discrete
substitutions to an existing translation based on an attention mechanism over both the source sentence as well as
the current translation output. By making less than one modification per sentence, we improve the output of a
phrase-based translation system by up to 0.4 BLEU on WMT15 German-English translation.
@article{novak:refinement:2016,
author = {Roman Novak and Michael Auli and David Grangier},
title = {Iterative Refinement for Machine Translation},
year = {2016},
booktitle = {arxiv}
}
Classical translation models constrain the space of possible outputs by selecting a subset of translation rules
based on the input sentence. Recent work on improving the efficiency of neural translation models adopted a
similar strategy by restricting the output vocabulary to a subset of likely candidates given the source. In this
paper we experiment with context and embedding-based selection methods and extend previous work by examining speed
and accuracy trade-offs in more detail. We show that decoding time on CPUs can be reduced by up to 90% and
training time by 25% on the WMT15 English-German and WMT16 English-Romanian tasks at the same or only negligible
change in accuracy. This brings the time to decode with a state of the art neural translation system to just over
140 msec per sentence on a single CPU core for English-German.
@article{lhostis:selectionmt:2016,
author = {Gurvan L'Hostis and David Grangier and Michael Auli},
title = {Vocabulary Selection Strategies for Neural Machine Translation},
year = {2016},
booktitle = {arxiv}
}
This paper introduces a neural model for concept-to-text generation that scales to large, rich domains. We
experiment with a new dataset of biographies from Wikipedia that is an order of magnitude larger than existing
resources with over 700k samples. The dataset is also vastly more diverse with a 400k vocabulary, compared to a
few hundred words for Weathergov or Robocup. Our model builds upon recent work on conditional neural language
model for text generation. To deal with the large vocabulary, we extend these models to mix a fixed vocabulary
with copy actions that transfer sample-specific words from the input database to the generated output sentence.
Our neural model significantly out-performs a classical Kneser-Ney language model adapted to this task by nearly
15 BLEU.
@inproceedings{lebret:emnlp:2016,
author = {Remi Lebret and David Grangier and Michael Auli},
title = {Neural Generation of Text from Structured Data with Application to the Bibliography Domain},
year = {2016},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}
}
Training neural network language models over large vocabularies is still computationally very costly compared to
count-based models such as Kneser-Ney. At the same time, neural language models are gaining popularity for many
applications such as speech recognition and machine translation whose success depends on scalability. We present a
systematic comparison of strategies to represent and train large vocabularies, including softmax, hierarchical
softmax, target sampling, noise contrastive estimation and self normalization. We further extend self
normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax. We evaluate
each method on three popular benchmarks, examining performance on rare words, the speed/accuracy trade-off and
complementarity to Kneser-Ney.
@inproceedings{chen:acl:2016,
author = {Wenlin Chen and David Grangier and Michael Auli},
title = {Strategies for Training Large Vocabulary Neural Language Models},
year = {2016},
booktitle = {Conference of the Association for Computational Linguistics (ACL)}
}
In text classification, dictionaries can be used to define human-comprehensible features. We propose an
improvement to dictionary features called smoothed dictionary features. These features recognize document contexts
instead of n-grams. We describe a principled methodology to solicit dictionary features from a teacher, and
present results showing that models built using these human-comprehensible features are competitive with models
trained with Bag of Words features.
@inproceedings{jandot:2016:whi,
author = {Camille Jandot and Patrice Simard and Max Chickering and David Grangier and Jina Suh},
title = {Interactive Semantic Featuring for Text Classification},
year = {2016},
booktitle = { ICML Workshop on Human Interpretability in Machine Learning (WHI)}
}
Conditional belief networks introduce stochastic binary variables in neural networks. Contrary to a classical
neural network, a belief network can predict more than the expected value of the output Y given the input X. It
can predict a distribution of outputs Y which is useful when an input can admit multiple outputs whose average is
not necessarily a valid answer. Such networks are particularly relevant to inverse problems such as image
prediction for denoising, or text to speech. However, traditional sigmoid belief networks are hard to train and
are not suited to continuous problems. This work introduces a new family of networks called linearizing belief
nets or LBNs. A LBN decomposes into a deep linear network where each linear unit can be turned on or off by
non-deterministic binary latent units. It is a universal approximator of real-valued conditional distributions and
can be trained using gradient descent. Moreover, the linear pathways efficiently propagate continuous information
and they act as multiplicative skip-connections that help optimization by removing gradient diffusion. This yields
a model which trains efficiently and improves the state-of-the-art on image denoising and facial expression
generation with the Toronto faces dataset.@inproceedings{dauphin:2016:iclr,
author = {Yann N. Dauphin and David Grangier},
title = {Predicting distributions with Linearizing Belief Networks},
year = {2016},
booktitle = {International Conference on Learning Representation (ICLR)}
}
Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when
working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The
learning machine leverages big data to find examples that maximize the training value of its interaction with the
teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance
of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what
examples or predictive features should be used) as the learning task progresses, then the problem becomes one of
interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning,
the teacher and the machine need an environment that supports an interaction language. The machine can access,
process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the
teacher can revise the definition of the task or make it more precise. Both the teacher and the machine
continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and
deployable models and (2) support research on both the machine learning and user interface challenges of the
interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture
that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is
to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are
presented as illustrations of the architecture but are not the primary focus of the paper.@article{simard:2014:ice,
author = {Patrice Y. Simard and
David Maxwell Chickering and
Aparna Lakshmiratan and
Denis Xavier Charles and
L{\'{e}}on Bottou and
Carlos Garcia Jurado Suarez and
David Grangier and
Saleema Amershi and
Johan Verwey and
Jina Suh},
title = {{ICE:} Enabling Non-Experts to Build Models Interactively for Large-Scale
Lopsided Problems},
journal = {CoRR},
volume = {abs/1409.4814},
year = {2014},
url = {http://arxiv.org/abs/1409.4814},
timestamp = {Thu, 02 Oct 2014 07:52:03 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/SimardCLCBSGAVS14},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
With the rapid development of database technologies, multiple data sources may be available for a given learning
task (e.g. collaborative filtering). However, the data sources may contain different types of features. For example,
users' profiles can be used to build recommendation systems. In addition, a model can also use users' historical
behaviors and social networks to infer users' interests on related products. We argue that it is desirable to
collectively use any available multiple heterogeneous data sources in order to build effective learning models. We
call this framework heterogeneous learning. In our proposed setting, data sources can include (i) nonoverlapping
features, (ii) nonoverlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this
paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning
model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) there should be
consensus among the predictions of overlapping instances (if any) from different data sources; (2) connected
instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient
boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and
deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This
approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number
recognition, and terrorist attack detection tasks. Furthermore, the approach is evaluated on AT&T's distributed
database with over 500 000 instances, 91 different data sources, and over 45 000 000 joined features. We observe
that the proposed model can improve out-of-sample error rate substantially.@article{shi:2013:gbc,
title={GBC: gradient boosting consensus model for heterogeneous data},
author={Shi, Xiaoxiao and Paiement, Jean-Francois and Grangier, David and Yu, Philip S},
journal={Statistical Analysis and Data Mining},
year={2013},
publisher={Wiley Online Library}
}
Multiple data sources containing different types of features may be available for a given task. For instance,
users’ profiles can be used to build recommendation systems. In addition, a model can also use users’ historical
behaviors and social networks to infer users’ interests on related products. We argue that it is desirable to
collectively use any available multiple heterogeneous data sources in order to build effective learning models. We
call this framework heterogeneous learning. In our proposed setting, data sources can include (i) non-overlapping
features, (ii) non-overlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In
this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding
learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) There
should be consensus among the predictions of overlapping instances (if any) from different data sources; (2)
Connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic
gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and
deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This
approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number
recognition and terrorist attack detection tasks. We observe that the proposed model can improve out-of-sample
error rate by as much as 80%.@inproceedings{shi:2012:heterogeneous_gbdt_sdm,
author = "X. Shi and JF. Paiement and D. Grangier and P. Yu",
title = "Learning from Heterogeneous Sources via Gradient Boosting Consensus",
booktitle = "SIAM International Conference on Data Mining (SDM)",
year = "2012",
}
We present a new learning strategy for classification problems in which train and/or test data suffer from missing
features. In previous work, instances are represented as vectors from some feature space and one is forced to
impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as
sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we
propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and
then non-linearly combines the set of embedded vectors. The embedding and the combination parameters are learned
jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior
knowledge about the features in the embedding step and yields advantageous results compared to alternative
solutions over several datasets.@inproceedings{grangier:2010:missing_nips,
author = "D. Grangier and I. Melvin",
title = "Feature Set Embedding for Incomplete Data",
booktitle = "Advances in Neural Information Processing Systems (NIPS)",
year = "2010",
}
Multi-class classification becomes challenging at test time when the number of classes is very large and testing
against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or
learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of
classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling
methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than
non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two
ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being
orders of magnitude faster.@inproceedings{weston:2010:label_trees_nips,
author = "J. Weston and S. Bengio and D. Grangier",
title = "Label Embedding Trees for Large Multi-Class Tasks",
booktitle = "Advances in Neural Information Processing Systems (NIPS)",
year = "2010",
}
We study the standard retrieval task of ranking a fixed set of items given a previously unseen query and pose it
as the half-transductive ranking problem. Transductive representations (where the vector representation of each
example is learned) allow the generation of highly nonlinear embeddings that capture the characteristics of object
relationships without relying on a specific choice of features, and require only relatively simple optimization.
Unfortunately, they have no direct out-of-sample extension. Inductive approaches on the other hand allow for the
representation of unknown queries. We describe algorithms for this setting which have the advantages of both
transductive and inductive approaches, and can be applied in unsupervised (either reconstruction-based or
graph-based) and supervised ranking setups. We show empirically that our methods give strong performance on all
three tasks.@inproceedings{bai:2010:halftrans_aistats,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri",
title = "Half Transductive Ranking",
booktitle = "Artificial Intelligence and Statistics (AISTATS)",
year = "2010",
}
We study the standard retrieval task of ranking a fixed set of documents given a previously unseen query and pose
it as the half-transductive ranking problem. The task is partly transductive as the document set is fixed.
Existing transductive approaches are natural non-linear methods for this set, but have no direct out-of-sample
extension. Functional approaches, on the other hand, can be applied to the unseen queries, but fail to exploit the
availability of the document set in its full extent. This work introduces a half-transductive approach to benefit
from the advantages of both transductive and functional approaches and show its empirical advantage in supervised
ranking setups.@inproceedings{bai:2009:halftrans_nips,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri",
title = "Ranking with Half Transductive Models",
booktitle = "NIPS Workshop on Advances in Ranking",
year = "2009",
}
We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the
word content in a query-document or document-document pair to a ranking score. Dealing with polynomial models on
word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of
our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on
retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing
realistically scalable methods.@inproceedings{bai:2009:psi_nips,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and C. Cortes and M. Mohri",
title = "Polynomial Semantic Indexing",
booktitle = "Advances in Neural Information Processing Systems (NIPS)",
year = "2009",
}
In this article we present Supervised Semantic Indexing (SSI) which defines a class of nonlinear (quadratic)
models that are discriminatively trained to directly map from the word content in a query-document or
document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of
correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised
signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the
query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks,
such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words
features is computationally challenging. We propose several improvements to our basic model for addressing this
issue, including low rank (but diagonal preserving) representations, correlated feature hashing (CFH) and
sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents
as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically
scalable methods.@article{bai:2009:ssi_jir,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and Y. Qi and K. Sadamasa and O. Chapelle and K. Weinberger",
title = "Learning to Rank with (a Lot of) Word Features",
journal = "Information Retrieval -- Special Issue on Learning to Rank",
publisher = "Springer",
year = "2009",
}
In this article, we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document)
pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models
take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a
supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results.
As the query and target texts are modeled separately, our approach is easily generalized to different retrieval
tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally
challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but
diagonal preserving) representations, and correlated feature hashing (CFH).We provide an empirical study of all
these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain
state-of-the-art performance while providing realistically scalable methods.@inproceedings{bai:2009:ssi_cikm,
author = "B. Bai and J. Weston and D. Grangier and R. Collobert and Y. Qi and K. Sadamasa and O. Chapelle and K. Weinberger",
title = "Supervised Semantic Indexing",
booktitle = "ACM Conference on Information and Knowledge Management (CIKM)",
year = "2009",
}
We present a class of models that are discriminatively trained to directly map from the word content in a
query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take
account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a
supervised signal directly on the task of interest, which we argue is the reason for our superior results. We
provide an empirical study on Wikipedia documents, using the links to define document-document or query-document
pairs, where we obtain state-of-the-art performance using our method.@inproceedings{bai:2009:ssi_ecir,
author = "B. Bai and J.Weston and R. Collobert and D. Grangier",
title = "Supervised Semantic Indexing",
booktitle = "European Conference on Information Retrieval (ECIR)",
year = "2009",
}
This chapter introduces a discriminative approach for the detection of keywords in speech utterances.
Specifically, this work proposes a learning algorithm, which aims at maximizing the area under the receiver
operating curve, given a set of training spotting problems. This algorithm relies on a large margin formulation of
the spotting task, and adopts an efficient online learning strategy. This approach contrasts with standard
spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a
loss directly related to the spotting performance. Different experiments performed over TIMIT and WSJ data show
the advantage of our approach over HMM-based alternatives.@incollection{grangier:2009:kws_book,
author = "D. Grangier, J. Keshet and S. Bengio",
title = "Discriminative Keyword Spotting",
booktitle = "Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods",
editor = "J. Keshet and S. Bengio",
publisher = "Wiley",
year = "2009",
}
In this thesis, we explore the use of machine learning techniques for information retrieval. More specifically, we
focus on ad-hoc retrieval, which is concerned with searching large corpora to identify the documents relevant to
user queries. This identification is performed through a ranking task. Given a user query, an ad-hoc retrieval
system ranks the corpus documents, so that the documents relevant to the query ideally appear above the others. In
a machine learning framework, we are interested in proposing learning algorithms that can benefit from limited
training data in order to identify a ranker likely to achieve high retrieval performance over unseen documents and
queries. This problem presents novel challenges compared to traditional learning tasks, such as regression or
classification. First, our task is a ranking problem, which means that the loss for a given query cannot be
measured as a sum of an individual loss suffered for each corpus document. Second, most retrieval queries present
a highly unbalanced setup, with a set of relevant documents accounting only for a very small fraction of the
corpus. Third, ad-hoc retrieval corresponds to a kind of ``double'' generalization problem, since the learned
model should not only generalize to new documents but also to new queries. Finally, our task also presents
challenging efficiency constraints, since ad-hoc retrieval is typically applied to large corpora. The main
objective of this thesis is to investigate the discriminative learning of ad-hoc retrieval models. For that
purpose, we propose different models based on kernel machines or neural networks adapted to different retrieval
contexts. The proposed approaches rely on different online learning algorithms that allow efficient learning over
large corpora.@phdthesis{grangier:2008:phd_thesis,
author = "D. Grangier",
title = "Machine Learning for Information Retrieval",
number = "4088",
school = "Ecole Polytechnique Federale de Lausanne",
year = "2008",
}
This paper proposes a new approach for keyword spotting, which is not based on HMMs. Unlike previous approaches,
the proposed method employs a discriminative learning procedure, in which the learning phase aims at maximizing
the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The
keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with
the target keyword into a vector space. Building on techniques used for large margin and kernel methods for
predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates
speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We
describe a simple iterative algorithm for training a keyword spotter and discuss its formal properties.
Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach. Further
experiments using the TIMIT trained model, but tested on the WSJ dataset, show that without further training our
method outperforms the conventional HMM-based approach.@article{grangier:2008:kws_journal,
author = "J. Keshet and D. Grangier and S. Bengio",
title = "Discriminative Keyword Spotting",
journal = "Speech Communication",
year = "2008",
}
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach
formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion
related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not
rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning
procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient,
scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments
performed over stock photography data show the advantage of our discriminative ranking approach over
state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should
be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our
model is especially advantageous over difficult queries such as queries with few relevant pictures or
multiple-word queries.@article{grangier:2008:tpami,
author = "D. Grangier and S. Bengio",
title = "A Discriminative Kernel-based Model to Rank Images from Text Queries",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)",
year = "2008",
}
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn
the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed
algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the
receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The
experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM
system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.@inproceedings{grangier:2007:rr_07-15,
author = "D. Grangier and S. Bengio",
title = "Learning the Inter-frame Distance for Discriminative Template-based Keyword Detection",
booktitle = "International Conference on Speech Processing (INTERSPEECH)",
year = "2007",
}
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a
new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC
curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is
based on non-linearly mapping the input acoustic representation of the speech utterance along with the target
keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole
sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech
utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We
describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties.
Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.@inproceedings{keshet:2007:nolisp,
author = "J. Keshet, D. Grangier and S. Bengio",
title = "Discriminative Keyword Spotting",
booktitle = "International Workshop on Non-LInear Speech Processing (NOLISP)",
year = "2007",
}
This work presents a neural network for the retrieval of images from text queries. The proposed network is
composed of two main modules: the first one extracts a global picture representation from local block descriptors
while the second one aims at solving the retrieval problem from the extracted representation. Both modules are
trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous
when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries
reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.@inproceedings{grangier:2006:icann,
author = "D. Grangier and S. Bengio",
title = "A Neural Network to Retrieve Images from Text Queries",
booktitle = "International Conference on Artificial Neural Networks (ICANN)}",
year = "2006",
}
This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this
approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we
rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments
performed over the Corel dataset show that our approach compares favorably with generative models that constitute
the state-of-the-art (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to
16.7% for PLSA, the best alternative).@inproceedings{grangier:2006:amr,
author = "D. Grangier and F. Monay and S. Bengio",
title = "Learning to Retrieve Images from Text Queries with a Discriminative Model",
booktitle = "International Workshop on Adaptive Multimedia Retrieval (AMR)",
year = "2006",
}
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work,
this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to
the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones
when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed
Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative
models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art
approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best
alternative, Probabilistic Latent Semantic Analysis.@inproceedings{grangier:2006:ecml,
author = "D. Grangier and F. Monay and S. Bengio",
title = "A Discriminative Approach for the Retrieval of Images from Text Queries",
booktitle = "European Conference on Machine Learning (ECML)",
year = "2006",
}
In this report, we propose a discriminative decoder for phoneme recognition, i.e. the identification of the
uttered phoneme sequence from a speech recording. This task is solved as a 3 step process: a phoneme classifier
first classifies each accoustic frame, then temporal consistency features (TCF) are extracted from the phoneme
classifier outputs, and finally a sequence decoder identifies the phoneme sequence according to the TCF.@techreport{grangier:2005:idiap-05-67,
author = "D. Grangier and S. Bengio",
title = "A Discriminative Decoder for the Recognition of Phoneme Sequences",
number = "67",
institution = "IDIAP",
year = "2005",
}
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C
should be ranked such that the documents relevant to q appear above the others. This task is generally performed
by ranking the documents d in C according to their similarity with respect to q, sim (q,d). The identification of
an effective function a,b -> sim(a,b) could be performed using a large set of queries with their corresponding
relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose
to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure
sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied
over a non-hyperlinked retrieval corpus.@inproceedings{grangier:2005:nips_workshop,
author = "D. Grangier and S. Bengio",
title = "Exploiting Hyperlinks to Learn a Retrieval Model",
booktitle = "NIPS Workshop on Learning to Rank",
year = "2005",
}
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this
work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare
any text documents, with or without hyperlinks. As linked documents are generally semantically closer than
unlinked documents, we use a training corpus with hyperlinks to infer a function a,b -> sim(a,b) that assigns a
higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that
this function compares favorably with OKAPI matching on document retrieval tasks.@inproceedings{grangier:2005:cikm,
author = "D. Grangier and S. Bengio",
title = "Inferring Document Similarity from Hyperlinks",
booktitle = "ACM Conference on Information and Knowledge Management (CIKM)",
year = "2005",
}
This paper presents experiments that evaluate the effect of different video segmentation methods on text-based
video retrieval. Segmentations relying on modalities like speech, video and text or their combination are compared
with a baseline sliding window segmentation. The results suggest that even with the sliding window segmentation,
acceptable performance can be obtained on a broadcast news retrieval task. Moreover, in the case where manually
segmented data are available for training, the approach combining the different modalities can lead to IR results
close to those obtained with a manual segmentation.@inproceedings{grangier:2005:icme,
author = "D. Grangier and A. Vinciarelli",
title = "Effect of Segmentation Method on Video Retrieval Performance",
booktitle = "IEEE International Conference on Multimedia and Expo (ICME)",
year = "2005",
}
This paper presents clustering experiments performed over noisy texts (i.e. texts that have been extracted through
an automatic process like character or speech recognition). The effect of recognition errors is investigated by
comparing clustering results performed over both clean (manually typed data) and noisy (automatic speech
transcriptions) versions of the same speech recording corpus.@techreport{grangier:2004:idiap-04-82,
author = "D. Grangier and A. Vinciarelli",
title = "Effect of Recognition Errors on Text Clustering",
number = "82",
institution = "IDIAP",
year = "2004",
}
Spoken Document Retrieval (SDR) consists in retrieving segments of a speech database that are relevant to a query.
The state-of-the-art approach to the SDR problem consists in transcribing the speech data into digital text before
applying common Information Retrieval (IR) techniques. The transcription, produced by an Automatic Speech
Recognition system, contains recognition errors. These errors can be referred to as noise. This thesis
investigates the effect of this noise on the retrieval process. We compare the results obtained with clean and
noisy data at different steps of the retrieval process. To perform such a task, standard IR measures (precision,
recall, break-even point, etc.) are used. It is shown that even with very different error rates (10% vs 30%), the
performances obtained over noisy text are only slightly lower than those over clean text (9% degradation of
average precision for our complete IR system, 45.2% vs 41.2%).@techreport{com-03-08,
author = "D. Grangier and A. Vinciarelli and H. Bourlard",
title = "Information Retrieval on Noisy Text",
number = "8",
Keywords = "Information Retrieval, Speech, Spoken Documents Retrieval, Noisy Text",
institution = "IDIAP",
year = "2003",
}
Code, Data & Demos
Tutorials and Workshops
Colleagues and Co-Authors