David Grangier

Machine Learning Research

david grangier

News - Research - Publications - Code - Colleagues

I am David Grangier, welcome to my homepage. I am currently with the Machine Learning Group at Apple Research, in Cupertino, California. I am interested in large scale Machine Learning and its application to pattern analysis tasks, such as Information Retrieval, Speech Recognition and Natural Language Processing. My research is described in below, while my resume and my linkedin give an overview of my previous experiences.

Research Interests

Learning to Generate Text
The generation of text by computers is relevant to machine translation, summarization, question answering... From a machine learning perspective, it involves conditional language modeling, learning text representation and learning to search a discrete space of sequence for decoding.
Learning to Rank
Learning to rank has increasingly received attention during the last decade, due to applications in domains such as information retrieval. These types of problems aim at assigning a confidence value to each example of a set, such that the values induce a specific ordering over the set. This task hence significantly differs from classical machine learning problem, since the example outputs cannot be considered independently.
Distance and Similarity Learning
Many effective learning algorithms, such as nearest neighbor classifiers or support vector machines, crucially relies on a function to compare examples. Such a function can be referred to as a distance metric, a similarity measure or a kernel, depending on its mathematical properties. Rather than defining this function relying only on prior knowledge, new learning techniques have been introduced to learn it from training data providing information on the desired example proximity.
Learning with Structured Data
In many problems, specific relations exist between the different inputs or between the labels. For instance, computer vision features extracted from the same image share spatial relations, or the phoneme classes in speech belong to a hierarchical structure. Encoding prior information about such structures, or learning such relationships offer great opportunities to improve machine learning approaches for several pattern recognition problems, and several techniques has been introduced toward this objective in the recent years.

Publications

2026

Pretraining with hierarchical memories: separating long-tail and common knowledge
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel - ICLR. 2026.
This work addresses the impracticality of compressing all world knowledge into parameters for edge devices. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge. Through trillion-token-scale experiments, we show that a 160M-parameters model augmented with an 18M-parameters memory fetches comparable performance to a regular model with more than 2x the parameters.
pdf
```
@inproceedings{pouransari2026memories,
        title     = {Pretraining with hierarchical memories: separating long-tail and common knowledge},
        author    = {Pouransari, Hadi and Grangier, David and Thomas, C and Kirchhof, Michael and Tuzel, Oncel},
        booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
        year      = {2026},
        url       = {https://arxiv.org/abs/2510.02375},
}
```
Compute-Optimal Quantization-Aware Training
Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun - ICLR. 2026.
We conduct extensive experiments to investigate how different QAT durations impact final performance across various compute budgets and model sizes. We demonstrate that the loss-optimal ratio of QAT to full-precision training increases with the total amount of compute. From experimental data, we derive a loss scaling law that predicts optimal QAT ratios and final model performance. Additionally, we propose a novel cooldown and QAT fusion approach that achieves significant compute savings by performing learning rate decay jointly with quantization-aware training.
pdf
```
@inproceedings{dremov2026qat,
        title     = {Compute-Optimal Quantization-Aware Training},
        author    = {Dremov, Aleksandr and Grangier, David and Katharopoulos, Angelos and Hannun, Awni},
        booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
        year      = {2026},
        url       = {https://arxiv.org/abs/2509.22935},
}
```
Which Data Matter? Embedding-Based Data Selection for Speech Recognition
Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel, Jie Chi, Zijin Gu, Takuya Higuchi, Jee-weon Jung, Shinji Watanabe, David Grangier, Barry-John Theobald, Tatiana Likhomanenko - arXiv:2603.05819. 2026.
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models, they pose challenges for specialist models targeting specific domains which often lack the capacity to learn from all available data. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
pdf
```
@article{aldeneh2026data_speech,
        title     = {Which Data Matter? Embedding-Based Data Selection for Speech Recognition},
        author    = {Aldeneh, Zakaria and Seto, Skyler and de Seyssel, Maureen and Chi, Jie and Gu, Zijin and Higuchi, Takuya and Jung, Jee-weon and Watanabe, Shinji and Grangier, David and Theobald, Barry-John and Likhomanenko, Tatiana},
        journal   = {arXiv preprint arXiv:2603.05819},
        year      = {2026},
        url       = {https://arxiv.org/abs/2603.05819},
}
```
Optimal Splitting of Language Models from Mixtures to Specialized Domains
Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier - arXiv:2603.19149. 2026.
We propose a method for pretraining multiple models independently over a general pretraining corpus and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N and extrapolates to larger model sizes and token counts. Applied to language model training, this approach improves performance consistently across common sense knowledge and reasoning benchmarks.
pdf
```
@article{seto2026splitting,
        title     = {Optimal Splitting of Language Models from Mixtures to Specialized Domains},
        author    = {Seto, Skyler and Ablin, Pierre and Filippova, Anastasiia and Ye, Jiayuan and Bethune, Louis and Katharopoulos, Angelos and Grangier, David},
        journal   = {arXiv preprint arXiv:2603.19149},
        year      = {2026},
        url       = {https://arxiv.org/abs/2603.19149},
}
```

2025

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier, Simin Fan, Skyler Seto, Pierre Ablin - ICLR. 2025.
Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data. We explore several approaches, with clustered importance sampling standing out. This method clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.
pdf
```
@inproceedings{grangier2025crisp,
        title     = {Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling},
        author    = {Grangier, David and Fan, Simin and Seto, Skyler and Ablin, Pierre},
        booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
        year      = {2025},
        url={https://doi.org/10.48550/arXiv.2410.03735},
}
```
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini, Pierre Ablin, David Grangier - ICLR. 2025.
Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a 1.3B parameter AdEMAMix LLM trained on 101B tokens performs comparably to an AdamW model trained on 197B tokens (+95%). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.
pdf
```
@inproceedings{pagliardini2025ademamix,
  author={Matteo Pagliardini and Pierre Ablin and David Grangier},
  title={The AdEMAMix Optimizer: Better, Faster, Older},
  booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2025},
  url={https://doi.org/10.48550/arXiv.2409.03137},
}
```
No Need to Talk: Asynchronous Mixture of Language Models
Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert - ICLR. 2025.
We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.
pdf
```
@inproceedings{filippova2025no,
  title={No Need to Talk: Asynchronous Mixture of Language Models},
  author={Anastasiia Filippova and Angelos Katharopoulos and David Grangier and Ronan Collobert},
  booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2025},
  url={https://arxiv.org/abs/2410.03529},
}
```
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto, Maartje ter Hoeve, He Bai, Natalie Schluter, David Grangier - ACL. 2025.
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high-quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a data-constrained target language by enlisting data from an auxiliary language for which high-quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling for data-constrained languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.
pdf
```
@inproceedings{seto2024bilingual,
  title={Training Bilingual LMs with Data Constraints in the Targeted Language},
  author={Seto, Skyler and ter Hoeve, Maartje and Bai, He and Schluter, Natalie and Grangier, David},
  booktitle = {Findings of the Association of Computer Linguists (ACL)},
  year={2025},
  url={https://arxiv.org/abs/2411.12986},
}
```
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
Pierre Ablin, Angelos Katharopoulos, Skyler Seto and David Grangier - ICML. 2025.
Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.
pdf
```
@inproceedings{ablin2025soup,
  author={Pierre Ablin and Angelos Katharopoulos and Skyler Seto and David Grangier},
  title={Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year={2025},
  url={https://arxiv.org/abs/2502.01804},
}
```
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, Pierre Ablin - ICML. 2025.
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
pdf
```
@inproceedings{bethune2025finetuning,
  title={Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection},
  author={Louis Bethune and David Grangier and Dan Busbridge and Eleonora Gualdoni and Marco Cuturi and Pierre Ablin},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year={2025},
  url={https://arxiv.org/abs/2502.06042},
  }
```
Partial Parameter Updates for Efficient Distributed Training
Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert - NeurIPS OPT Workshop. 2025.
We introduce a memory- and compute-efficient method for low-communication distributed training. By restricting backpropagation—where each node updates only a fixed subset of parameters during local steps—we substantially reduce peak memory usage and training FLOPs. Experiments on a 1.3B-parameter language model show our method matches the perplexity of prior approaches while reducing computational overhead.
pdf
```
@inproceedings{filippova2025distributed,
        title     = {Partial Parameter Updates for Efficient Distributed Training},
        author    = {Filippova, Anastasiia and Katharopoulos, Angelos and Grangier, David and Collobert, Ronan},
        booktitle = {NeurIPS OPT Workshop},
        year      = {2025},
        url       = {https://arxiv.org/abs/2509.22418},
}
```
Scaling Laws for Optimal Data Mixtures
Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin - NeurIPS. 2025.
We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector. We validate these laws across LLM, native multimodal, and large vision model pretraining, providing a principled alternative to costly trial-and-error methods.
pdf
```
@inproceedings{shukor2025scaling_laws_mixtures,
        title     = {Scaling Laws for Optimal Data Mixtures},
        author    = {Shukor, Mustafa and Bethune, Louis and Busbridge, Dan and Grangier, David and Fini, Enrico and El-Nouby, Alaaeldin and Ablin, Pierre},
        booktitle = {Proceedings of the Conference on Neural Information Processing Systems (NeurIPS)},
        year      = {2025},
        url       = {https://arxiv.org/abs/2507.09404},
}
```
Assessing the Role of Data Quality in Training Bilingual Language Models
Skyler Seto, Maartje ter Hoeve, Maureen de Seyssel, David Grangier - Findings of EMNLP. 2025.
Our analysis reveals that unequal data quality, not just quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data using only high-quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and significantly reduces performance gaps.
pdf
```
@inproceedings{seto2025quality_bilingual,
        title     = {Assessing the Role of Data Quality in Training Bilingual Language Models},
        author    = {Seto, Skyler and ter Hoeve, Maartje and de Seyssel, Maureen and Grangier, David},
        booktitle = {Findings of the Association for Computational Linguistics: EMNLP},
        year      = {2025},
        url       = {https://aclanthology.org/2025.findings-emnlp.1236/},
}
```
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining
Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin - arXiv:2510.00866. 2025.
We provide an in-depth analysis of Classifier-based Quality Filtering (CQF). We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. Our results challenge the view that CQF captures a meaningful notion of data quality, explaining the paradox by the fact that CQF implicitly filters the high-quality dataset itself.
pdf
```
@article{saada2025quality_illusion,
        title     = {The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining},
        author    = {Saada, Thiziri Nait and Bethune, Louis and Klein, Michal and Grangier, David and Cuturi, Marco and Ablin, Pierre},
        journal   = {arXiv preprint arXiv:2510.00866},
        year      = {2025},
        url       = {https://arxiv.org/abs/2510.00866},
}
```

2024

Dynamic Gradient Alignment for Online Data Mixing
Simin Fan, David Grangier, Pierre Ablin - arXiv. 2024.
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.
pdf
```
@article{fan2024dga,
  author={Simin Fan and David Grangier and Pierre Ablin},
  title={Dynamic Gradient Alignment for Online Data Mixing},
  journal={arXiv},
  volume={2410.02498},
  year={2024},
  url={https://doi.org/10.48550/arXiv.2410.02498},
}
```
Need a Small Specialized Language Model? Plan Early!
David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun - arXiv. 2024.
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.
pdf
```
@article{grangier2024slm,
  author={David Grangier and Angelos Katharopoulos and Pierre Ablin and Awni Hannun},
  title={Need a Small Specialized Language Model? Plan Early!},
  journal={arXiv},
  volume={2402.01093},
  year={2024},
  url={https://doi.org/10.48550/arXiv.2402.01093},
}
```
Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones
David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun - ICML 2024 FM-Wild Workshop. 2024.
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get a small language model with good specialized accuracy, even when specialization data is unknown during pretraining. We propose a novel architecture, projected networks (PN). PN is a high capacity network whose parameters can be linearly projected into a small network for fine tuning. We assess the empirical effectiveness of our solution compared to small model training, distillation and hard mixture of experts.
pdf
```
@inproceedings{grangier2024projected,
  title={Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones},
  author={Grangier, David and Katharopoulos, Angelos and Ablin, Pierre and Hannun, Awni},
  booktitle={ICML 2024 FM-Wild Workshop},
  year={2024},
  url={https://openreview.net/forum?id=Wi88giKi7N},
}
```
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly, Josh Susskind - NeurIPS. 2024.
Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In this paper, we improve prompt learning by distilling the textual knowledge from natural language prompts (either human- or LLM-generated) to provide rich priors for those under-represented concepts. We first obtain a prompt ``summary'' aligned to each input image via a learned prompt aggregator. Then we jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at the same time. We dub such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE). AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks (e.g., few-shot classification, VQA) and generation tasks (image captioning) where AAPE achieves competitive performance. We also show AAPE is particularly helpful to handle non-canonical and OOD examples. Furthermore, AAPE learning eliminates LLM-based inference cost as required by baselines, and scales better with data and LLM model size.
pdf
```
@inproceedings{huang2024aggregate,
  author={Chen Huang and Skyler Seto and Samira Abnar and David Grangier and Navdeep Jaitly and Josh Susskind},
  title={Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP},
  booktitle={Advances in Neural Information Processing Systems},
  year={2024},
  url={https://neurips.cc/virtual/2024/poster/94659},
}
```
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Pratyush Maini, Skyler Seto, Richard He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly - ACL. 2024.
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high-quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a data-constrained target language by enlisting data from an auxiliary language for which high-quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling for data-constrained languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.
pdf
```
@inproceedings{maini2024rephrase,
  author={Pratyush Maini and Skyler Seto and Richard He Bai and David Grangier and Yizhe Zhang and Navdeep Jaitly},
  title={Rephrasing the Web: {A} Recipe for Compute and Data-Efficient Language Modeling},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2024},
  pages={14044--14072},
}
```
Adaptive Training Distributions with Scalable Online Bilevel Optimization
David Grangier, Pierre Ablin, Awni Hannun - Transactions on Machine Learning Research (TMLR). 2024.
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.
pdf
```
@article{grangier2024adaptive,
  title={Adaptive Training Distributions with Scalable Online Bilevel Optimization},
  author={Grangier, David and Ablin, Pierre and Hannun, Awni},
  journal={Transactions on Machine Learning Research (TMLR)},
  year={2024},
  url={https://openreview.net/forum?id=JP1GVyF5i5},
}
```

2023

AudioLM: A Language Modeling Approach to Audio Generation
Zalan Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour - IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP). 2023.
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.
pdf
```
@article{audio-lm-generation-2023,
  author={Borsos, Zalán and Marinier, Raphaël and Vincent, Damien and Kharitonov, Eugene and Pietquin, Olivier and Sharifi, Matt and Roblek, Dominik and Teboul, Olivier and Grangier, David and Tagliasacchi, Marco and Zeghidour, Neil},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={AudioLM: A Language Modeling Approach to Audio Generation},
  year={2023},
  volume={31},
  number={},
  pages={2523-2533},
  keywords={Semantics;Acoustics;Training;Computational modeling;Codecs;Predictive models;Task analysis;Computer generated music;speech synthesis},
  doi={10.1109/TASLP.2023.3288409}
}
```
Transfer Learning for Structured Pruning under Limited Task Data
Lucio Dery, David Grangier, Awni Hannun - Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III). 2023.
Pre-trained models are growing increasingly large which can be problematic for applications with strong inference constraints. Fortunately, task-aware structured pruning offers a solution. While existing pruning algorithms can be efficient, the common practical setting where task-specific data is limited is yet to be addressed. To ameliorate the data scarcity problem, we propose a structured pruning strategy that leverages transfer learning. Detailed analyses of simple transfer learning based remedies lead us to a simple, flexible formulation of what, how and when to transfer, resulting in pruned models with improved generalization over strong baselines.
pdf
```
@inproceedings{dery-grangier-hannun-structured-pruning,
      title={Transfer Learning for Structured Pruning under Limited Task Data},
      author={Lucio Dery and David Grangier and Awni Hannun},
      year={2023},
      booktitle = {Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III)},
}
```
Bilevel Optimization to Learn Training Distributions for Language Modeling under Domain Shift
David Grangier, Pierre Ablin, Awni Hannun - NeurIPS 2023 Workshop on Distribution Shifts. 2023.
Language models trained on very large web corpora have become a central piece of modern language processing. In this paradigm, the large, heterogeneous training set rarely matches the distribution of the application domain. This work considers modifying the training distribution in the case where one can observe a small sample of data reflecting the test conditions. We propose an algorithm based on recent formulation of this problem as an online, bilevel optimization problem. We show that this approach compares favorably with alternative strategies from the domain adaptation literature.
pdf
```
@inproceedings{grangier-ablin-hannun-bilevel-learn-lm-distribution,
      title={Bilevel Optimization to Learn Training Distributions for Language Modeling under Domain Shift},
      author={David Grangier and Pierre Ablin and Awni Hannun},
      year={2023},
      booktitle = {NeurIPS 2023 Workshop on Distribution Shifts},
}
```

2022

The Trade-offs of Domain Adaptation for Neural Language Models
David Grangier and Dan Iter - Annual Meeting of the Association for Computational Linguistics (ACL). 2022.
This work connects language model adaptation with concepts of machine learning theory. We consider a training setup with a large out- of-domain set and a small in-domain set. We derive how the benefit of training a model on either set depends on the size of the sets and the distance between their underlying distributions. We analyze how out-of-domain pre-training before in-domain fine-tuning achieves better generalization than either solution independently. Finally, we present how adaptation techniques based on data selection, such as importance sampling, intelligent data selection and influence functions, can be presented in a common framework which highlights their similarity and also their subtle differences.
pdf
```
@inproceedings{grangier2022tradeoffs,
      title={The Trade-offs of Domain Adaptation for Neural Language Models},
      author={David Grangier and Dan Iter},
      year={2022},
      booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
}
```
A Natural Diet: Towards Improving Naturalness of Machine Translation
Markus Freitag, David Vilar, David Grangier, Colin Cherry and George Foster - Findings of the Annual Meeting of the Association for Computational Linguistics (ACL). 2022.
Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to translation style. This means that, even when considered accurate and fluent, MT output can still sound less {\em natural} than high quality human translations or text originally written in the target language. Machine translation output notably exhibits lower lexical diversity, and employs constructs that mirror those in the source sentence. In this work we propose a method for training MT systems to achieve a more natural style, i.e. mirroring the style of text originally written in the target language. Our method tags parallel training data according to the naturalness of the target side by contrasting language models trained on natural and translated data. Tagging data allows us to put greater emphasis on target sentences originally written in the target language. Automatic metrics show that the resulting models achieve lexical richness on par with human translations, mimicking a style much closer to sentences originally written in the target language. Furthermore, we find that their output is preferred by human experts when compared to the baseline translations.
pdf
```
@inproceedings{freitag-natural-diet-translation-2022,
    title={A Natural Diet: Towards Improving Naturalness of Machine Translation},
    author={Markus Freitag and David Vilar and David Grangier and Colin Cherry and George Foster},
    booktitle={Findings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
    year={2022},
}
```
Learning Strides in Convolutional Neural Networks
Rachid Riad, Olivier Teboul, David Grangier and Neil Zeghidour - International Conference on Learning Representation (ICLR). 2022.
Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on ImageNet.
pdf
```
@inproceedings{riad-learning-strides-convnets-2022,
    title={Learning Strides in Convolutional Neural Networks},
    author={Rachid Riad and Olivier Teboul and David Grangier and Neil Zeghidour},
    booktitle={International Conference on Learning Representation (ICLR)},
    year={2022},
}
```
High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics
Markus Freitag, David Grangier, Qijun Tan and Bowen Liang - TACL. 2022.
In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and show that model estimates and translation quality only vaguely correlate. We apply Minimum Bayes Risk (MBR) decoding on unbiased samples to optimize diverse automated metrics of translation quality as an alternative inference strategy to beam search. Instead of targeting the hypotheses with the highest model probability, MBR decoding extracts the hypotheses with the highest estimated quality. Our experiments show that the combination of a neural translation model with a neural reference-based metric, Bleurt, results in significant improvement in human evaluations. This improvement is obtained with translations different from classical beam-search output: These translations have much lower model likelihood and are less favored by surface metrics like Bleu.
pdf
```
@article{freitag-etal-2022-high,
        title = "High Quality Rather than High Model Probability: Minimum {B}ayes Risk Decoding with Neural Metrics",
        author = "Freitag, Markus and Grangier, David and Tan, Qijun and Liang, Bowen",
        editor = "Roark, Brian and Nenkova, Ani",
        journal = "Transactions of the Association for Computational Linguistics",
        volume = "10",
        year = "2022",
        url = "https://aclanthology.org/2022.tacl-1.47/"
    }
```

2021

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan and Wolfgang Macherey - Transactions of the Association for Computational Linguistics (TACL). 2021.
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.
pdf
```
@article{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation},
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      journal={Transactions of the Association for Computational Linguistics (TACL)},
      year={2021},
}
```
DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding
Neil Zeghidour, Olivier Teboul and David Grangier - IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 2021.
We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice activity of each speaker conditioned on the extracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classical permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker representations and optimizes all parameters of the system with a multi-speaker voice activity loss. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) evaluation. Overall, these contributions yield a system redefining the state-of-the-art on the standard CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.
pdf
```
@inproceedings{zeghidour-teboul-grangier-dive-2021,
    title={DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding},
    author={Neil Zeghidour and Olivier Teboul and David Grangier},
    booktitle   = {{IEEE} Automatic Speech Recognition and Understanding Workshop ({ASRU})},
    year={2021},
}
```
Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral
Lucio Dery, Yann Dauphin, David Grangier - International Conference on Learning Representation (ICLR). 2021.
While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the primary task. We seek to alleviate this burden by formulating a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged. This allows weighting the update directions differently depending on their impact on the problem of interest. We present a novel and efficient algorithm for that purpose and show its advantage in practice. Our method leverages efficient automatic differentiation procedures and randomized singular value decomposition for scalability. We show that our framework is generic and encompasses some prior work as particular cases. Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks.
pdf
```
@inproceedings{ldery-aux-taks-iclr21,
    title={Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral},
    author={Lucio Dery and Yann Dauphin and David Grangier},
    booktitle={International Conference on Learning Representation (ICLR)},
    year={2021},
}
```
Wavesplit: End-to-End Speech Separation by Speaker Clustering
Neil Zeghidour and David Grangier - Transaction on Audio Speech and Language Processing (TASLP). 2021.
We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a representation for each source and then estimates each source signal given the inferred representations. The model is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation. For speech separation, our sequence-wide speaker representations provide a more robust separation of long, challenging recordings compared to prior work. Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2/3mix), as well as in noisy and reverberated settings (WHAM/WHAMR). We also set a new benchmark on the recent LibriMix dataset. Finally, we show that Wavesplit is also applicable to other domains, by separating fetal and maternal heart rates from a single abdominal electrocardiogram.
pdf
```
@article{zeghidour-grangier-wavesplit-2021,
    title={Wavesplit: End-to-End Speech Separation by Speaker Clustering},
    author={Neil Zeghidour and David Grangier},
    journal   = {{IEEE} {ACM} Transaction on Audio Speech and Language Processing (TASLP)},
    year={2021},
}
```
Contrastive Learning of General-Purpose Audio Representations
Aaqib Saeed, David Grangier, Neil Zeghidour - International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021.
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio. Our approach is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings. We build on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio. We pre-train embeddings on the large-scale Audioset database and transfer these representations to 9 diverse classification tasks, including speech, music, animal sounds, and acoustic scenes. We show that despite its simplicity, our method significantly outperforms previous self-supervised systems. We furthermore conduct ablation studies to identify key design choices and release a library to pre-train and fine-tune COLA models.
pdf
```
@inproceedings{saeeds-contrastive-audio-2021,
    title={Contrastive Learning of General-Purpose Audio Representations},
    author={Aaqib Saeed and David Grangier and Neil Zeghidour},
    booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year={2021},
}
```
Learning from Heterogeneous EEG Signals with Differentiable Channel Reordering
Aaqib Saeed, David Grangier, Olivier Pietquin, Neil Zeghidour - International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021.
We propose CHARM, a method for training a single neural network across inconsistent input channels. Our work is motivated by Electroencephalography (EEG), where data collection protocols from different headsets result in varying channel ordering and number, which limits the feasibility of transferring trained systems across datasets. Our approach builds upon attention mechanisms to estimate a latent reordering matrix from each input signal and map input channels to a canonical order. CHARM is differentiable and can be composed further with architectures expecting a consistent channel ordering to build end-to-end trainable classifiers. We perform experiments on four EEG classification datasets and demonstrate the efficacy of CHARM via simulated shuffling and masking of input channels. Moreover, our method improves the transfer of pre-trained representations between datasets collected with different protocols.
pdf
```
@article{saeeds-eeg-reordering-2021,
    title={Learning from Heterogeneous EEG Signals with Differentiable Channel Reordering},
    author={Aaqib Saeed and David Grangier and Olivier Pietquin and Neil Zeghidour},
    booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year={2021},
}
```
On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation
Dan Iter and David Grangier - arXiv:2109.07591. 2021.
Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data. This work examines the benefit of data selection for language modeling and machine translation. Our experiments assess the complementarity of selection with fine tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but not so much as to erode the complementary effect of fine-tuning; (ii) there is a trade-off between selecting little data for fast but limited progress or much data for slow but long lasting progress; (iii) data selection can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data selection from domain classifiers is often more effective than the popular contrastive data selection method.
pdf
```
@misc{iter2021complementarity,
      title={On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation},
      author={Dan Iter and David Grangier},
      year={2021},
      eprint={2109.07591},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
What Can Unsupervised Machine Translation Contribute to High-Resource Language Pairs?
Kelly Marchisio, Markus Freitag and David Grangier - arXiv:2106.15818. 2021.
Whereas existing literature on unsupervised machine translation (MT) focuses on exploiting unsupervised techniques for low-resource language pairs where bilingual training data is scare or unavailable, we investigate whether unsupervised MT can also improve translation quality of high-resource language pairs where sufficient bitext does exist. We compare the style of correct translations generated by either supervised or unsupervised MT and find that the unsupervised output is less monotonic and more natural than supervised output. We demonstrate a way to combine the benefits of unsupervised and supervised MT into a single system, resulting in better human evaluation of quality and fluency. Our results open the door to discussions about the potential contributions of unsupervised MT in high-resource settings, and how supervised and unsupervised systems might be mutually-beneficial.
pdf
```
@misc{marchisio2021unsupervised,
      title={What Can Unsupervised Machine Translation Contribute to High-Resource Language Pairs?},
      author={Kelly Marchisio and Markus Freitag and David Grangier},
      year={2021},
      eprint={2106.15818},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

2020

Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier - Transactions of the Association for Computational Linguistics (TACL). 2020.
Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n1.5d) from O(n2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers.
pdf
```
@article{aurkoroy-routing-transformer-2020,
    title={Efficient Content-Based Sparse Attention with Routing Transformers},
    author={Aurko Roy and Mohammad Saffar and Ashish Vaswani and David Grangier},
    journal={Transactions of the Association for Computational Linguistics (TACL) },
    year={2020},
}
```
Human-Paraphrased References Improve Neural Machine Translation
Markus Freitag, George Foster, David Grangier, Colin Cherry - Conference on Machine Translation (WMT). 2020.
Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has recently been proposed by Freitag et al. (2020). When used in place of original references, the paraphrased versions produce metric scores that correlate better with human judgment. This effect holds for a variety of different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In this paper we compare the results of performing endto-end system development using standard and paraphrased references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment, and demonstrates for the first time that using these scores for system development can lead to significant improvements.
pdf
```
@inproceedings{freitag-paraphrase-mert-2020,
    title={Human-Paraphrased References Improve Neural Machine Translation},
    author={Markus Freitag and George Foster and David Grangier and Colin Cherry},
    booktitle={Conference on Machine Translation (WMT)},
    year={2020},
}
```
BLEU might be Guilty but References are not Innocent
Markus Freitag, David Grangier, Isaac Caswell - Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.
pdf
```
@article{freitag-bleu-paraphrase-references-2020,
    title={BLEU might be Guilty but References are not Innocent},
    author={Markus Freitag and David Grangier and Isaac Caswell},
    booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2020},
}
```
Translationese as a Language in "Multilingual" NMT
Parker Riley, Isaac Caswell, Markus Freitag, David Grangier - Annual Meeting of the Association for Computational Linguistics (ACL). 2020.
Machine translation has an undesirable propensity to produce "translationese" artifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target text? There is no data with original source and original target, so we train sentence-level classifiers to distinguish translationese from original target text, and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to produce more natural outputs at test time, yielding gains in human evaluation scores on both accuracy and fluency. Additionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU score, increasing it while decreasing human-rated quality. We analyze these models using metrics to measure the degree of translationese in the output, and present an analysis of the capriciousness of heuristically-based train-data tagging.
pdf
```
@article{riley-translationese-2020,
    title={Translationese as a Language in "Multilingual" {NMT}},
    author={Parker Riley and Isaac Caswell and Markus Freitag and David Grangier},
    booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    year = {2020},
}
```
Towards Better Storylines with Sentence-Level Language Models
Daphne Ippolito, David Grangier, Douglas Eck, Chris Callison-Burch - Annual Meeting of the Association for Computational Linguistics (ACL). 2020.
This work proposes a sentence-level language model which predicts the next sentence in a story given the embeddings of the previous sentences. The model operates at the sentence-level and selects the next sentence within a fine set of fluent alternatives. By working with sentence embeddings instead of word embeddings, our model is able to efficiently consider a large number of alternative sentences. By considering only fluent sentences, our model is relieved from modeling fluency and can focus on longer range dependencies. Our method achieves state-of-the-art accuracy on the StoryCloze task in the unsupervised setting.
pdf
```
@inproceedings{ippolito-storyline-2020,
    title={Towards Better Storylines with Sentence-Level Language Models},
    author={Daphne Ippolito and David Grangier and Douglas Eck and Chris Callison-Burch},
    booktitle={Annual Meeting of the Association for Computational Linguistics (ACL)},
    year={2020},
}
```

2019

ELI5: Long Form Question Answering
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, Michael Auli - Annual Meeting of the Association for Computational Linguistics (ACL). 2019.
We introduce the first large-scale corpus for long form question answering, a task requiring elaborate and in-depth answers to open-ended questions. The dataset comprises 270K threads from the Reddit forum “Explain Like I’m Five” (ELI5) where an online community provides answers to questions which are comprehensible by five year olds. Compared to existing datasets, ELI5 comprises diverse questions requiring multi-sentence answers. We provide a large set of web documents to help answer the question. Automatic and human evaluations show that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline.However, our best model is still far from human performance since raters prefer gold responses in over 86% of cases, leaving ample opportunity for future improvement.
pdf
```
@inproceedings{fan-etal-2019-eli5,
    title = "{ELI}5: Long Form Question Answering",
    author = "Fan, Angela  and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    year = "2019",
    url = "https://www.aclweb.org/anthology/P19-1346"
}
```
Tagged Back-Translation
Isaac Caswell, Ciprian Chelba, David Grangier - Conference on Machine Translation (WMT). 2019.
Recent work in Neural Machine Translation (NMT) has shown significant quality gains from noised-beam decoding during back-translation, a method to generate synthetic parallel data. We show that the main role of such synthetic noise is not to diversify the source side, as previously suggested, but simply to indicate to the model that the given source is synthetic. We propose a simpler alternative to noising techniques, consisting of tagging back-translated source sentences with an extra token. Our results on WMT outperform noised back-translation in English-Romanian and match performance on English-German, re-defining state-of-the-art in the former.
pdf
```
@inproceedings{caswell:tagbt:2019,
  author    = {Isaac Caswell and Ciprian Chelba and David Grangier},
  title     = {Tagged Back-Translation},
  year      = {2019},
  booktitle   = {Conference on Machine Translation (WMT)},
}
```
Unsupervised Paraphrasing without Translation
Aurko Roy, David Grangier - Conference of the Association for Computational Linguistics (ACL). 2019.
Paraphrasing exemplifies the ability to abstract semantic content from surface forms. Recent work on automatic paraphrasing is dominated by methods leveraging Machine Translation (MT) as an intermediate step. This contrasts with humans, who can paraphrase without being bilingual. This work proposes to learn paraphrasing models from an unlabeled monolingual corpus only. To that end, we propose a residual variant of vector-quantized variational auto-encoder. We compare with MT-based approaches on paraphrase identification, generation, and training augmentation. Monolingual paraphrasing outperforms unsupervised translation in all settings. Comparisons with supervised translation are more mixed: monolingual paraphrasing is interesting for identification and augmentation; supervised translation is superior for generation.
pdf
```
@inproceedings{aurkoroy:paraphrase:2019,
  author    = {Aurko Roy and David Grangier},
  title     = {Unsupervised Paraphrasing without Translation},
  year      = {2019},
  booktitle   = {Conference of the Association for Computational Linguistics (ACL)},
}
```
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli - Demo of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL Demo). 2019.
fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs.

pdf
```
@inproceedings{ottedunov:fairseq:2019,
  author    = {Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli},
  title     = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  year      = {2019},
  booktitle   = {Demo of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL Demo)},
}
```
Unsupervised Hierarchical Story Infilling
Daphne Ippolito, David Grangier, Chris Callison-Burch, Douglas Eck - Workshop on Narrative Understanding @ NAACL. 2019.
Story infilling involves predicting words to go into a missing span from a story. This challenging task has the potential to transform interactive tools for creative writing. However, state-of-the-art conditional language models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without sacrificing diversity.
pdf
```
@inproceedings{ippolito-etal-2019-unsupervised,
    title = "Unsupervised Hierarchical Story Infilling",
    author = "Ippolito, Daphne and Grangier, David and Callison-Burch, Chris and Eck, Douglas",
    booktitle = "Proceedings of the First Workshop on Narrative Understanding @ NAACL",
    year = "2019",
    url = "https://www.aclweb.org/anthology/W19-2405"}
  
```
3D human pose estimation in video with temporal convolutions and semi-supervised training
Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli - Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. Code and models are available.

pdf
```
@inproceedings{pavllo:3dpose:2018,
  author    = {Dario Pavllo and Christoph Feichtenhofer and David Grangier and Michael Auli},
  title     = {3D human pose estimation in video with temporal convolutions and semi-supervised training},
  year      = {2019},
  booktitle = {Conference on Computer Vision and  Pattern Recognition (CVPR)},
}
```

2018

Understanding Back-Translation at Scale
Sergey Edunov, Myle Ott, Michael Auli, David Grangier - Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018.
An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences. This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences. We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective. Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search. We also compare how synthetic data compares to genuine bitext and study various domain effects. Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT'14 English-German test set.

pdf
```
@article{edunov:backtranslation:2018,
  author    = {Sergey Edunov Myle Ott and Michael Auli and David Grangier},
  title     = {Understanding Back-Translation at Scale},
  year      = {2018},
  booktitle = {Conference on Empirical Methods in Natural Language Processing ({EMNLP})},
}
```
Scaling Neural Machine Translation
Myle Ott, Sergey Edunov, David Grangier, Michael Auli - Workshop on Machine Translation (WMT@EMNLP). 2018.
Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of Vaswani et al (2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 91 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset.

pdf
```
@inproceedings{ott:scaling:2018,
  author    = {Myle Ott and Sergey Edunov and David Grangier and Michael Auli},
  title     = {Scaling Neural Machine Translation},
  year      = {2018},
  booktitle = {Workshop on Machine Translation ({WMT@EMNLP})},
}
```
QuaterNet: A Quaternion-based Recurrent Model for Human Motion
Dario Pavllo, David Grangier, Michael Auli - British Machine Vision Conference (BMVC). 2018.
Deep learning for predicting or generating 3D human pose sequences is an active research area. Previous work regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angle or exponential map parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. Our recurrent network, QuaterNet, represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. On short-term predictions, QuaterNet improves the state-of-the-art quantitatively. For long-term generation, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature.

pdf
```
@inproceedings{pavllo:quaternet:2018,
  author    = {Dario Pavllo and David Grangier and Michael Auli},
  title     = {QuaterNet: A Quaternion-based Recurrent Model for Human Motion},
  year      = {2018},
  booktitle   = {British Machine Vision Conference (BMVC)},
}
```
Analyzing Uncertainty in Neural Machine Translation
Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato - International Conference on Machine Learning (ICML). 2018.
Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much recent research, there is still a lack of understanding of these models. Practitioners report performance degradation with large beams, the under-estimation of rare words and a lack of diversity in the final translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations. Our results show that search works remarkably well but that the models tend to spread too much probability mass over the hypothesis space. Next, we propose tools to assess model calibration and show how to easily fix some shortcomings of current models. We release both code and multiple human reference translations for two popular benchmarks.

pdf
```
@inproceedings{ott:uncertainty:2018,
  author    = {Myle Ott and Michael Auli and David Grangier and Marc’Aurelio Ranzato},
  title     = {Analyzing Uncertainty in Neural Machine Translation},
  year      = {2018},
  booktitle   = {International Conference on Machine Learning {(ICML)}},
}
```
Controllable Abstractive Summarization
Angela Fan, David Grangier, Michael Auli - ACL Workshop on Neural Machine Translation and Generation (NMT@ACL) 2018.
Current models for document summarization ignore user preferences such as the desired length, style or entities that the user has a preference for. We present a neural summarization model that enables users to specify such high level attributes in order to control the shape of the final summaries to better suit their needs. With user input, we show that our system can produce high quality summaries that are true to user preference. Without user input, we can set the control variables automatically and outperform comparable state of the art summarization systems despite the relative simplicity of our model.

pdf
```
@inproceedings{fan:summarization:2018,
  author    = {Angela Fan and David Grangier and Michael Auli},
  title     = {Controllable Abstractive Summarization},
  year      = {2018},
  booktitle   = {ACL Workshop on Neural Machine Translation and Generation (NMT@ACL)},
}
```
Classical Structured Prediction Losses for Sequence to Sequence Learning
Sergey Edunov, Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato - Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2018.
There has been much recent work on training neural attention models at the sequence-level using either reinforcement learning-style methods or by optimizing the beam. In this paper, we survey a range of classical objective functions that have been widely used to train linear models for structured prediction and apply them to neural sequence to sequence models. Our experiments show that these losses can perform surprisingly well by slightly outperforming beam search optimization in a like for like setup. We also report new state of the art results on both IWSLT 2014 German-English translation as well as Gigaword abstractive summarization.

pdf
```
@inproceedings{edunovott:structured:2018,
  author    = {Sergey Edunov and Myle Ott and Michael Auli and David Grangier and Marc’Aurelio Ranzato},
  title     = {Classical Structured Prediction Losses for Sequence to Sequence Learning},
  year      = {2018},
  booktitle   = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
}
```
QuickEdit: Editing Text and Translations by Crossing Words Out
David Grangier and Michael Auli - Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2018.
We propose a framework for computer-assisted text editing. It applies to translation post-editing and to paraphrasing and relies on very simple interactions: a human editor modifies a sentence by marking tokens they would like the system to change. Our model then generates a new sentence which reformulates the initial sentence by avoiding the words from the marked tokens. Our approach builds upon neural sequence-to-sequence modeling and introduces a neural network which takes as input a sentence along with deleted token markers. Our model is trained on translation bitext by simulating post-edits. Our results on post-editing for machine translation and paraphrasing evaluate the performance of our approach. We show +11.4 BLEU with limited post-editing effort on the WMT-14 English-German translation task (25.2 to 36.6), which represents +5.9 BLEU over the post-editing baseline (30.7 to 36.6).

pdf
```
@article{grangier:quickedit:2018,
  author    = {David Grangier and Michael Auli},
  title     = {QuickEdit: Editing Text and Translations by Crossing Words Out},
  year      = {2018},
  booktitle   = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
}
```

2017

Convolutional Sequence to Sequence Learning
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin - International Conference on Machine Learning (ICML). 2017.
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

pdf
```
@inproceedings{gehring:convs2s:2017,
  author    = {Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin},
  title     = {Convolutional Sequence to Sequence Learning},
  year      = {2017},
  booktitle = {International Conference on Machine Learning {(ICML)}}
}
```
Language Modeling with Gated Convolutional Networks
Yann N. Dauphin, Angela Fan, Michael Auli and David Grangier - International Conference on Machine Learning (ICML). 2017.
The pre-dominant approach to language model- ing to date is based on recurrent neural networks. In this paper we present a convolutional approach to language modeling. We introduce a novel gating mechanism that eases gradient propagation and which performs better than the LSTM- style gating of Oord et al. (2016b) despite being simpler. We achieve a new state of the art on WikiText-103 as well as a new best single-GPU result on the Google Billion Word benchmark. In settings where latency is important, our model achieves an order of magnitude speed-up compared to a recurrent baseline since computation can be parallelized over time. To our knowledge, this is the first time a non-recurrent approach out- performs strong recurrent models on these tasks.

pdf
```
@inproceedings{dauphin:gatedlm:2017,
  author    = {Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier},
  title     = {Language Modeling with Gated Convolutional Networks},
  year      = {2017},
  booktitle = {International Conference on Machine Learning {(ICML)}}
}
```
Efficient softmax approximation for GPUs
E. Grave and A. Joulin and M. Cisse and D. Grangier and H. Jegou - International Conference on Machine Learning (ICML). 2017.
We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computational complexity. Our approach further reduces the computational cost by exploiting the specificities of modern architectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks, such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax.

pdf
```
@inproceedings{grave:softmax:2017,
  author    = {Edouard Grave and Armand Joulin and Moustapha Cisse and David Grangier and Herve Jegou},
  title     = {Strategies for Training Large Vocabulary Neural Language Models},
  year      = {2017},
  booktitle = {International Conference on Machine Learning {(ICML)}}
}
```
A Convolutional Encoder Model for Neural Machine Translation
Jonas Gehring, Michael Auli, David Grangier, Yann N. Dauphin - Conference of the Association for Computational Linguistics (ACL). 2017.
The prevalent approach to neural machine translation relies on bi-directional LSTMs to encode the source sentence. In this paper we present a faster and conceptually simpler architecture based on a succession of convolutional layers. This allows to encode the entire source sentence simultaneously compared to recurrent networks for which computation is constrained by temporal dependencies. We achieve a new state-of-the-art on WMT'16 English-Romanian translation and outperform several recently published results on the WMT'15 English-German task. We also achieve almost the same accuracy as a very deep LSTM setup on WMT'14 English-French translation. Our convolutional encoder speeds up CPU decoding by more than two times at the same or higher accuracy as a strong bi-directional LSTM baseline.

pdf
```
@inproceedings{gehring:convnmt:2017,
  author    = {Jonas Gehring and Michael Auli and David Grangier and Yann N. Dauphin},
  title     = {A Convolutional Encoder Model for Neural Machine Translation},
  year      = {2017},
  booktitle = {Conference of the Association for Computational Linguistics ({ACL})}
}
```

2016

Iterative Refinement for Machine Translation
Roman Novak, Michael Auli, David Grangier - arXiv:1610.06602. 2016.
Existing machine translation decoding algorithms generate translations in a strictly monotonic fashion and never revisit previous decisions. As a result, earlier mistakes cannot be corrected at a later stage. In this paper, we present a translation scheme that starts from an initial guess and then makes iterative improvements that may revisit previous decisions. We parameterize our model as a convolutional neural network that predicts discrete substitutions to an existing translation based on an attention mechanism over both the source sentence as well as the current translation output. By making less than one modification per sentence, we improve the output of a phrase-based translation system by up to 0.4 BLEU on WMT15 German-English translation.

pdf
```
@article{novak:refinement:2016,
  author    = {Roman Novak and Michael Auli and David Grangier},
  title     = {Iterative Refinement for Machine Translation},
  year      = {2016},
  booktitle = {arxiv}
}
```
Vocabulary Selection Strategies for Neural Machine Translation
Gurvan L'Hostis, David Grangier, Michael Auli - arXiv:1610.00072. 2016.
Classical translation models constrain the space of possible outputs by selecting a subset of translation rules based on the input sentence. Recent work on improving the efficiency of neural translation models adopted a similar strategy by restricting the output vocabulary to a subset of likely candidates given the source. In this paper we experiment with context and embedding-based selection methods and extend previous work by examining speed and accuracy trade-offs in more detail. We show that decoding time on CPUs can be reduced by up to 90% and training time by 25% on the WMT15 English-German and WMT16 English-Romanian tasks at the same or only negligible change in accuracy. This brings the time to decode with a state of the art neural translation system to just over 140 msec per sentence on a single CPU core for English-German.

pdf
```
@article{lhostis:selectionmt:2016,
  author    = {Gurvan L'Hostis and David Grangier and Michael Auli},
  title     = {Vocabulary Selection Strategies for Neural Machine Translation},
  year      = {2016},
  booktitle = {arxiv}
}
```
Neural Generation of Text from Structured Data with Application to the Bibliography Domain
Remi Lebret, David Grangier, Michael Auli - Conference on Empirical Methods in Natural Language Processing (EMNLP). 2016.
This paper introduces a neural model for concept-to-text generation that scales to large, rich domains. We experiment with a new dataset of biographies from Wikipedia that is an order of magnitude larger than existing resources with over 700k samples. The dataset is also vastly more diverse with a 400k vocabulary, compared to a few hundred words for Weathergov or Robocup. Our model builds upon recent work on conditional neural language model for text generation. To deal with the large vocabulary, we extend these models to mix a fixed vocabulary with copy actions that transfer sample-specific words from the input database to the generated output sentence. Our neural model significantly out-performs a classical Kneser-Ney language model adapted to this task by nearly 15 BLEU.

pdf
```
@inproceedings{lebret:emnlp:2016,
  author    = {Remi Lebret and David Grangier and Michael Auli},
  title     = {Neural Generation of Text from Structured Data with Application to the Bibliography Domain},
  year      = {2016},
  booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}
}
```
Strategies for Training Large Vocabulary Neural Language Models
W. Chen, D. Grangier and M. Auli - Conference of the Association for Computational Linguistics (ACL). 2016.
Training neural network language models over large vocabularies is still computationally very costly compared to count-based models such as Kneser-Ney. At the same time, neural language models are gaining popularity for many applications such as speech recognition and machine translation whose success depends on scalability. We present a systematic comparison of strategies to represent and train large vocabularies, including softmax, hierarchical softmax, target sampling, noise contrastive estimation and self normalization. We further extend self normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax. We evaluate each method on three popular benchmarks, examining performance on rare words, the speed/accuracy trade-off and complementarity to Kneser-Ney.

pdf
```
@inproceedings{chen:acl:2016,
  author    = {Wenlin Chen and David Grangier and Michael Auli},
  title     = {Strategies for Training Large Vocabulary Neural Language Models},
  year      = {2016},
  booktitle = {Conference of the Association for Computational Linguistics (ACL)}
}
```
Interactive Semantic Featuring for Text Classification
C. Jandot, P. Simard, M. Chickering, D. Grangier, J. Suh - ICML Workshop on Human Interpretability in Machine Learning (WHI). 2016.
In text classification, dictionaries can be used to define human-comprehensible features. We propose an improvement to dictionary features called smoothed dictionary features. These features recognize document contexts instead of n-grams. We describe a principled methodology to solicit dictionary features from a teacher, and present results showing that models built using these human-comprehensible features are competitive with models trained with Bag of Words features.

pdf
```
@inproceedings{jandot:2016:whi,
  author    = {Camille Jandot and Patrice Simard and Max Chickering and David Grangier and Jina Suh},
  title     = {Interactive Semantic Featuring for Text Classification},
  year      = {2016},
  booktitle = { ICML Workshop on Human Interpretability in Machine Learning (WHI)}
}
```
Predicting distributions with Linearizing Belief Networks
Y. Dauphin, D. Grangier - International Conference on Learning Representation (ICLR). 2016.
Conditional belief networks introduce stochastic binary variables in neural networks. Contrary to a classical neural network, a belief network can predict more than the expected value of the output Y given the input X. It can predict a distribution of outputs Y which is useful when an input can admit multiple outputs whose average is not necessarily a valid answer. Such networks are particularly relevant to inverse problems such as image prediction for denoising, or text to speech. However, traditional sigmoid belief networks are hard to train and are not suited to continuous problems. This work introduces a new family of networks called linearizing belief nets or LBNs. A LBN decomposes into a deep linear network where each linear unit can be turned on or off by non-deterministic binary latent units. It is a universal approximator of real-valued conditional distributions and can be trained using gradient descent. Moreover, the linear pathways efficiently propagate continuous information and they act as multiplicative skip-connections that help optimization by removing gradient diffusion. This yields a model which trains efficiently and improves the state-of-the-art on image denoising and facial expression generation with the Toronto faces dataset.

pdf
```
@inproceedings{dauphin:2016:iclr,
  author    = {Yann N. Dauphin and David Grangier},
  title     = {Predicting distributions with Linearizing Belief Networks},
  year      = {2016},
  booktitle = {International Conference on Learning Representation (ICLR)}
}
```

2014

ICE: Enabling Non-Experts to Build Models Interactively for Large-Scale Lopsided Problems
P. Simard, M. Chickering, A. Lakshmiratan, L. Bottou, C. Garcia, D. Grangier, S. Amershi, J. Verwey, J. Suh - arXiv:1409.4814. 2014.
Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.

pdf
```
@article{simard:2014:ice,
  author    = {Patrice Y. Simard and
               David Maxwell Chickering and
               Aparna Lakshmiratan and
               Denis Xavier Charles and
               L{\'{e}}on Bottou and
               Carlos Garcia Jurado Suarez and
               David Grangier and
               Saleema Amershi and
               Johan Verwey and
               Jina Suh},
  title     = {{ICE:} Enabling Non-Experts to Build Models Interactively for Large-Scale
               Lopsided Problems},
  journal   = {CoRR},
  volume    = {abs/1409.4814},
  year      = {2014},
  url       = {http://arxiv.org/abs/1409.4814},
  timestamp = {Thu, 02 Oct 2014 07:52:03 +0200},
  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/corr/SimardCLCBSGAVS14},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}
```

2013

GBC: gradient boosting consensus model for heterogeneous data
X. Shi, JF. Paiement, D. Grangier and P. Yu - Statistical Analysis and Data Mining (SADM). 2013.
With the rapid development of database technologies, multiple data sources may be available for a given learning task (e.g. collaborative ﬁltering). However, the data sources may contain diﬀerent types of features. For example, users' proﬁles can be used to build recommendation systems. In addition, a model can also use users' historical behaviors and social networks to infer users' interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build eﬀective learning models. We call this framework heterogeneous learning. In our proposed setting, data sources can include (i) nonoverlapping features, (ii) nonoverlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) there should be consensus among the predictions of overlapping instances (if any) from diﬀerent data sources; (2) connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number recognition, and terrorist attack detection tasks. Furthermore, the approach is evaluated on AT&T's distributed database with over 500 000 instances, 91 diﬀerent data sources, and over 45 000 000 joined features. We observe that the proposed model can improve out-of-sample error rate substantially.

pdf
```
@article{shi:2013:gbc,
  title={GBC: gradient boosting consensus model for heterogeneous data},
  author={Shi, Xiaoxiao and Paiement, Jean-Francois and Grangier, David and Yu, Philip S},
  journal={Statistical Analysis and Data Mining},
  year={2013},
  publisher={Wiley Online Library}
}
```

2012

Learning from Heterogeneous Sources via Gradient Boosting Consensus
X. Shi, JF. Paiement, D. Grangier and P. Yu - SIAM International Conference on Data Mining (SDM). 2012.
Multiple data sources containing different types of features may be available for a given task. For instance, users’ profiles can be used to build recommendation systems. In addition, a model can also use users’ historical behaviors and social networks to infer users’ interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build effective learning models. We call this framework heterogeneous learning. In our proposed setting, data sources can include (i) non-overlapping features, (ii) non-overlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) There should be consensus among the predictions of overlapping instances (if any) from different data sources; (2) Connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number recognition and terrorist attack detection tasks. We observe that the proposed model can improve out-of-sample error rate by as much as 80%.

pdf - ps
```
@inproceedings{shi:2012:heterogeneous_gbdt_sdm,
	author = "X. Shi and JF. Paiement and D. Grangier and P. Yu",
	title = "Learning from Heterogeneous Sources via Gradient Boosting Consensus",
	booktitle = "SIAM International Conference on Data Mining (SDM)",
	year = "2012",
}
```

2010

Feature Set Embedding for Incomplete Data
D. Grangier and I. Melvin - Advances in Neural Information Processing Systems (NIPS). 2010.
We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then non-linearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets.

pdf - ps
```
@inproceedings{grangier:2010:missing_nips,
	author = "D. Grangier and I. Melvin",
	title = "Feature Set Embedding for Incomplete Data",
	booktitle = "Advances in Neural Information Processing Systems (NIPS)",
	year = "2010",
}
```
Label Embedding Trees for Large Multi-Class Tasks
J. Weston, S. Bengio and D. Grangier - Advances in Neural Information Processing Systems (NIPS). 2010.
Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.

pdf - ps
```
@inproceedings{weston:2010:label_trees_nips,
	author = "J. Weston and S. Bengio and D. Grangier",
	title = "Label Embedding Trees for Large Multi-Class Tasks",
	booktitle = "Advances in Neural Information Processing Systems (NIPS)",
	year = "2010",
}
```
Half Transductive Ranking
B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes and M. Mohri - Artificial Intelligence and Statistics (AISTATS). 2010.
We study the standard retrieval task of ranking a fixed set of items given a previously unseen query and pose it as the half-transductive ranking problem. Transductive representations (where the vector representation of each example is learned) allow the generation of highly nonlinear embeddings that capture the characteristics of object relationships without relying on a specific choice of features, and require only relatively simple optimization. Unfortunately, they have no direct out-of-sample extension. Inductive approaches on the other hand allow for the representation of unknown queries. We describe algorithms for this setting which have the advantages of both transductive and inductive approaches, and can be applied in unsupervised (either reconstruction-based or graph-based) and supervised ranking setups. We show empirically that our methods give strong performance on all three tasks.

pdf - ps
```
@inproceedings{bai:2010:halftrans_aistats,
	author = "B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri",
	title = "Half Transductive Ranking",
	booktitle = "Artificial Intelligence and Statistics (AISTATS)",
	year = "2010",
}
```

2009

Ranking with Half Transductive Models
B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes and M. Mohri - NIPS Workshop on Advances in Ranking. 2009.
We study the standard retrieval task of ranking a fixed set of documents given a previously unseen query and pose it as the half-transductive ranking problem. The task is partly transductive as the document set is fixed. Existing transductive approaches are natural non-linear methods for this set, but have no direct out-of-sample extension. Functional approaches, on the other hand, can be applied to the unseen queries, but fail to exploit the availability of the document set in its full extent. This work introduces a half-transductive approach to benefit from the advantages of both transductive and functional approaches and show its empirical advantage in supervised ranking setups.

pdf - ps
```
@inproceedings{bai:2009:halftrans_nips,
	author = "B. Bai and J. Weston and D. Grangier and R. Collobert and C. Cortes and M. Mohri",
	title = "Ranking with Half Transductive Models",
	booktitle = "NIPS Workshop on Advances in Ranking",
	year = "2009",
}
```
Polynomial Semantic Indexing
B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, C. Cortes and M. Mohri - Advances in Neural Information Processing Systems (NIPS). 2009.
We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods.

pdf - ps
```
@inproceedings{bai:2009:psi_nips,
	author = "B. Bai and J. Weston and D. Grangier and R. Collobert and K. Sadamasa and Y. Qi and C. Cortes and M. Mohri",
	title = "Polynomial Semantic Indexing",
	booktitle = "Advances in Neural Information Processing Systems (NIPS)",
	year = "2009",
}
```
Learning to Rank with (a Lot of) Word Features
B. Bai, J. Weston, D. Grangier, R. Collobert, Y. Qi, K. Sadamasa, O. Chapelle and K. Weinberger - Information Retrieval -- Special Issue on Learning to Rank, Springer. 2009.
In this article we present Supervised Semantic Indexing (SSI) which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing (CFH) and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

pdf - ps
```
@article{bai:2009:ssi_jir,
	author = "B. Bai and J. Weston and D. Grangier and R. Collobert and Y. Qi and K. Sadamasa and O. Chapelle and K. Weinberger",
	title = "Learning to Rank with (a Lot of) Word Features",
	journal = "Information Retrieval -- Special Issue on Learning to Rank",
	publisher = "Springer",
	year = "2009",
}
```
Supervised Semantic Indexing
B. Bai, J. Weston, D. Grangier, R. Collobert, Y. Qi, K. Sadamasa, O. Chapelle and K. Weinberger - ACM Conference on Information and Knowledge Management (CIKM). 2009.
In this article, we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH).We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

pdf - ps
```
@inproceedings{bai:2009:ssi_cikm,
	author = "B. Bai and J. Weston and D. Grangier and R. Collobert and Y. Qi and K. Sadamasa and O. Chapelle and K. Weinberger",
	title = "Supervised Semantic Indexing",
	booktitle = "ACM Conference on Information and Knowledge Management (CIKM)",
	year = "2009",
}
```
Supervised Semantic Indexing
B. Bai, J.Weston, R. Collobert and D. Grangier - European Conference on Information Retrieval (ECIR). 2009.
We present a class of models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the task of interest, which we argue is the reason for our superior results. We provide an empirical study on Wikipedia documents, using the links to define document-document or query-document pairs, where we obtain state-of-the-art performance using our method.

pdf - ps
```
@inproceedings{bai:2009:ssi_ecir,
	author = "B. Bai and J.Weston and R. Collobert and D. Grangier",
	title = "Supervised Semantic Indexing",
	booktitle = "European Conference on Information Retrieval (ECIR)",
	year = "2009",
}
```
Discriminative Keyword Spotting
D. Grangier, J. Keshet and S. Bengio - Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, J. Keshet and S. Bengio (Ed.), Wiley. 2009.
This chapter introduces a discriminative approach for the detection of keywords in speech utterances. Specifically, this work proposes a learning algorithm, which aims at maximizing the area under the receiver operating curve, given a set of training spotting problems. This algorithm relies on a large margin formulation of the spotting task, and adopts an efficient online learning strategy. This approach contrasts with standard spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a loss directly related to the spotting performance. Different experiments performed over TIMIT and WSJ data show the advantage of our approach over HMM-based alternatives.

pdf - ps
```
@incollection{grangier:2009:kws_book,
	author = "D. Grangier, J. Keshet and S. Bengio",
	title = "Discriminative Keyword Spotting",
	booktitle = "Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods",
	editor = "J. Keshet and S. Bengio",
	publisher = "Wiley",
	year = "2009",
}
```

2008

Machine Learning for Information Retrieval
D. Grangier - Ecole Polytechnique Federale de Lausanne - PhD Thesis 4088. 2008.
In this thesis, we explore the use of machine learning techniques for information retrieval. More specifically, we focus on ad-hoc retrieval, which is concerned with searching large corpora to identify the documents relevant to user queries. This identification is performed through a ranking task. Given a user query, an ad-hoc retrieval system ranks the corpus documents, so that the documents relevant to the query ideally appear above the others. In a machine learning framework, we are interested in proposing learning algorithms that can benefit from limited training data in order to identify a ranker likely to achieve high retrieval performance over unseen documents and queries. This problem presents novel challenges compared to traditional learning tasks, such as regression or classification. First, our task is a ranking problem, which means that the loss for a given query cannot be measured as a sum of an individual loss suffered for each corpus document. Second, most retrieval queries present a highly unbalanced setup, with a set of relevant documents accounting only for a very small fraction of the corpus. Third, ad-hoc retrieval corresponds to a kind of ``double'' generalization problem, since the learned model should not only generalize to new documents but also to new queries. Finally, our task also presents challenging efficiency constraints, since ad-hoc retrieval is typically applied to large corpora. The main objective of this thesis is to investigate the discriminative learning of ad-hoc retrieval models. For that purpose, we propose different models based on kernel machines or neural networks adapted to different retrieval contexts. The proposed approaches rely on different online learning algorithms that allow efficient learning over large corpora.

pdf - ps
```
@phdthesis{grangier:2008:phd_thesis,
	author = "D. Grangier",
	title = "Machine Learning for Information Retrieval",
	number = "4088",
	school = "Ecole Polytechnique Federale de Lausanne",
	year = "2008",
}
```
Discriminative Keyword Spotting
J. Keshet, D. Grangier and S. Bengio - Speech Communication. 2008.
This paper proposes a new approach for keyword spotting, which is not based on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training a keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach. Further experiments using the TIMIT trained model, but tested on the WSJ dataset, show that without further training our method outperforms the conventional HMM-based approach.

pdf - ps
```
@article{grangier:2008:kws_journal,
	author = "J. Keshet and D. Grangier and S. Bengio",
	title = "Discriminative Keyword Spotting",
	journal = "Speech Communication",
	year = "2008",
}
```
A Discriminative Kernel-based Model to Rank Images from Text Queries
D. Grangier and S. Bengio - IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 2008.
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

pdf - ps
```
@article{grangier:2008:tpami,
	author = "D. Grangier and S. Bengio",
	title = "A Discriminative Kernel-based Model to Rank Images from Text Queries",
	journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)",
	year = "2008",
}
```

2007

Learning the Inter-frame Distance for Discriminative Template-based Keyword Detection
D. Grangier and S. Bengio - International Conference on Speech Processing (INTERSPEECH). 2007.
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

pdf - ps
```
@inproceedings{grangier:2007:rr_07-15,
	author = "D. Grangier and S. Bengio",
	title = "Learning the Inter-frame Distance for Discriminative Template-based Keyword Detection",
	booktitle = "International Conference on Speech Processing (INTERSPEECH)",
	year = "2007",
}
```
Discriminative Keyword Spotting
J. Keshet, D. Grangier and S. Bengio - International Workshop on Non-LInear Speech Processing (NOLISP). 2007.
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non-linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.

pdf - ps
```
@inproceedings{keshet:2007:nolisp,
	author = "J. Keshet, D. Grangier and S. Bengio",
	title = "Discriminative Keyword Spotting",
	booktitle = "International Workshop on Non-LInear Speech Processing (NOLISP)",
	year = "2007",
}
```

2006

A Neural Network to Retrieve Images from Text Queries
D. Grangier and S. Bengio - International Conference on Artificial Neural Networks (ICANN)}. 2006.
This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.

pdf - ps
```
@inproceedings{grangier:2006:icann,
	author = "D. Grangier and S. Bengio",
	title = "A Neural Network to Retrieve Images from Text Queries",
	booktitle = "International Conference on Artificial Neural Networks (ICANN)}",
	year = "2006",
}
```
Learning to Retrieve Images from Text Queries with a Discriminative Model
D. Grangier, F. Monay and S. Bengio - International Workshop on Adaptive Multimedia Retrieval (AMR). 2006.
This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the state-of-the-art (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).

pdf - ps
```
@inproceedings{grangier:2006:amr,
	author = "D. Grangier and F. Monay and S. Bengio",
	title = "Learning to Retrieve Images from Text Queries with a Discriminative Model",
	booktitle = "International Workshop on Adaptive Multimedia Retrieval (AMR)",
	year = "2006",
}
```
A Discriminative Approach for the Retrieval of Images from Text Queries
D. Grangier, F. Monay and S. Bengio - European Conference on Machine Learning (ECML). 2006.
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.

pdf - ps
```
@inproceedings{grangier:2006:ecml,
	author = "D. Grangier and F. Monay and S. Bengio",
	title = "A Discriminative Approach for the Retrieval of Images from Text Queries",
	booktitle = "European Conference on Machine Learning (ECML)",
	year = "2006",
}
```

2005

A Discriminative Decoder for the Recognition of Phoneme Sequences
D. Grangier and S. Bengio - IDIAP Research Report RR2005-67. 2005.
In this report, we propose a discriminative decoder for phoneme recognition, i.e. the identification of the uttered phoneme sequence from a speech recording. This task is solved as a 3 step process: a phoneme classifier first classifies each accoustic frame, then temporal consistency features (TCF) are extracted from the phoneme classifier outputs, and finally a sequence decoder identifies the phoneme sequence according to the TCF.

pdf - ps
```
@techreport{grangier:2005:idiap-05-67,
	author = "D. Grangier and S. Bengio",
	title = "A Discriminative Decoder for the Recognition of Phoneme Sequences",
	number = "67",
	institution = "IDIAP",
	year = "2005",
}
```
Exploiting Hyperlinks to Learn a Retrieval Model
D. Grangier and S. Bengio - NIPS Workshop on Learning to Rank. 2005.
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d in C according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b -> sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied over a non-hyperlinked retrieval corpus.

pdf - ps
```
@inproceedings{grangier:2005:nips_workshop,
	author = "D. Grangier and S. Bengio",
	title = "Exploiting Hyperlinks to Learn a Retrieval Model",
	booktitle = "NIPS Workshop on Learning to Rank",
	year = "2005",
}
```
Inferring Document Similarity from Hyperlinks
D. Grangier and S. Bengio - ACM Conference on Information and Knowledge Management (CIKM). 2005.
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b -> sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.

pdf - ps
```
@inproceedings{grangier:2005:cikm,
	author = "D. Grangier and S. Bengio",
	title = "Inferring Document Similarity from Hyperlinks",
	booktitle = "ACM Conference on Information and Knowledge Management (CIKM)",
	year = "2005",
}
```
Effect of Segmentation Method on Video Retrieval Performance
D. Grangier and A. Vinciarelli - IEEE International Conference on Multimedia and Expo (ICME). 2005.
This paper presents experiments that evaluate the effect of different video segmentation methods on text-based video retrieval. Segmentations relying on modalities like speech, video and text or their combination are compared with a baseline sliding window segmentation. The results suggest that even with the sliding window segmentation, acceptable performance can be obtained on a broadcast news retrieval task. Moreover, in the case where manually segmented data are available for training, the approach combining the different modalities can lead to IR results close to those obtained with a manual segmentation.

pdf - ps
```
@inproceedings{grangier:2005:icme,
	author = "D. Grangier and A. Vinciarelli",
	title = "Effect of Segmentation Method on Video Retrieval Performance",
	booktitle = "IEEE International Conference on Multimedia and Expo (ICME)",
	year = "2005",
}
```

2004

Effect of Recognition Errors on Text Clustering
D. Grangier and A. Vinciarelli - IDIAP Research Report RR2004-82. 2004.
This paper presents clustering experiments performed over noisy texts (i.e. texts that have been extracted through an automatic process like character or speech recognition). The effect of recognition errors is investigated by comparing clustering results performed over both clean (manually typed data) and noisy (automatic speech transcriptions) versions of the same speech recording corpus.

pdf - ps
```
@techreport{grangier:2004:idiap-04-82,
	author = "D. Grangier and A. Vinciarelli",
	title = "Effect of Recognition Errors on Text Clustering",
	number = "82",
	institution = "IDIAP",
	year = "2004",
}
```

2003

Information Retrieval on Noisy Text
D. Grangier, A. Vinciarelli and H. Bourlard - IDIAP Research Report RR2003-8. 2003.
Spoken Document Retrieval (SDR) consists in retrieving segments of a speech database that are relevant to a query. The state-of-the-art approach to the SDR problem consists in transcribing the speech data into digital text before applying common Information Retrieval (IR) techniques. The transcription, produced by an Automatic Speech Recognition system, contains recognition errors. These errors can be referred to as noise. This thesis investigates the effect of this noise on the retrieval process. We compare the results obtained with clean and noisy data at different steps of the retrieval process. To perform such a task, standard IR measures (precision, recall, break-even point, etc.) are used. It is shown that even with very different error rates (10% vs 30%), the performances obtained over noisy text are only slightly lower than those over clean text (9% degradation of average precision for our complete IR system, 45.2% vs 41.2%).

pdf - ps
```
@techreport{com-03-08,
	author = "D. Grangier and A. Vinciarelli and H. Bourlard",
	title = "Information Retrieval on Noisy Text",
	number = "8",
	Keywords = "Information Retrieval, Speech, Spoken Documents Retrieval, Noisy Text",
	institution = "IDIAP",
	year = "2003",
}
```

Code, Data & Demos

Fairseq in lua/torch and pytorch
Fairseq is a sequence-to-sequence learning toolkit tailored to Neural Machine Translation (NMT). It implements the convolutional NMT models proposed as well as a standard LSTM-based model. It features multi-GPU training and fast beam search generation for CPU and GPU. We provide pre-trained models for English to French, English to German and English to Romanian translation.
Wikipedia biography dataset
WikiBio is a dataset of over 700K biography paragraphs paired with structured data from Wikipedia. It has been introduced in our EMNLP'16 paper to evaluate the generation of text from structured data.
Deep Convolutional Networks for Scene Parsing
We use deep convolutional networks for modeling the complex label structure of street scenes, relying on a supervised greedy learning strategy.
Discriminative Keyword Spotter
Our algorithm learns a keyword spotter with the objective to maximize the area under the ROC curve. The code is available under BSD licence.
PAMIR Code
PAMIR is a model to retrieve images from text queries, introduced last year. The code is available under BSD licence.

Tutorials and Workshops

ICML'16 Workshop: Neural Nets Back to the Future
A workshop linking the past, present and future research on neural networks, co-organized with John Platt, Leon Bottou and Tomas Mikolov.
NIPS'14 Workshop on Learning Semantics
co-organized with Cedric Archambeau, Antoine Bordes, Leon Bottou and Chris Burges.
Introduction to Machine Learning and Rankings
This is a lecture I gave to the UC Berkeley School of Information in September 2011. Jean-Francois Paiement prepared an associated lab on Boosted Trees.
NIPS'06 Workshop on Learning to Compare Examples
Samy Bengio and I organized a workshop on learning distances and similarity measures, both the proceedings and videos of the talks are available online.

Colleagues and Co-Authors

Pierre Ablin, Apple Inc, Paris, France
Michael Auli, Facebook AI Research, Menlo Park, CA, USA
Bing Bai, NEC Labs America, Princeton, NJ, USA
Samy Bengio, Apple Inc, Cupertino CA, USA
Léon Bottou, Facebook AI Research, New York, NY, USA
Ronan Collobert, Apple Inc, Cupertino CA, USA
Yann Dauphin, Google Research, Montreal, QC, Canada
Angela Fan, Facebook AI Research, Paris, France
Markus Freitag, Google Research, Mountain View, CA, USA
Jonas Gehring, Facebook, Paris, France
Joseph Keshet, Bar Ilan University, Ramat Gan, Israel
Florent Monay, ams AG, Bern, Switzerland
Jean François Paiement, AT&T Labs Research, San Francisco, CA
Marc'Aurelio Ranzato, DeepMind, London, UK
Patrice Simard, Microsoft Research, Redmond, WA, USA
Jason Weston, Facebook AI Research, New York, NY, USA
Neil Zeghidour, Kyutai, Paris, France