A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. (Une approche basée sur les données pour le traitement automatique du langage naturel en français contemporain et historique).

[BibT_eX]

[DOI]

Pedro Javier Ortiz Suárez

PhD thesis, 2022

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets.

[BibT_eX]

[DOI]

Nasanbayar Ulzii-Orshikh

Pedro Javier Ortiz Suárez

Iroro Orife

Kelechi Ogueji

Andre Niyongabo Rubungo

Toan Q. Nguyen

Mathias Müller

André Müller

Shamsuddeen Hassan Muhammad

Nanda Muhammad

Ayanda Mnyakeni

Jamshidbek Mirzakhalov

Tapiwanashe Matangira

Bonaventure F. P. Dossou

Trans. Assoc. Comput. Linguistics, 2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data.

[BibT_eX]

[DOI]

CoRR, 2022

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.

[BibT_eX]

[DOI]

CoRR, 2022

Automatic Extraction of Materials and Properties from Superconductors Scientific Literature.

[BibT_eX]

[DOI]

Luca Foppiano

Pedro Baptista de Castro

CoRR, 2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources.

[BibT_eX]

[DOI]

Angelina McMillan-Major

Pedro Javier Ortiz Suárez

Zeerak Talat

Daniel van Strien

Yacine Jernite

CoRR, 2022

Le projet FREEM : ressources, outils et enjeux pour l'étude du français d'Ancien Régime (The F RE EM project: Resources, tools and challenges for the study of Ancien Régime French).

[BibT_eX]

[DOI]

Simon Gabay

Pedro Javier Ortiz Suárez

Proceedings of the Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, 2022

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.

[BibT_eX]

[DOI]

Albert Villanova del Moral

Teven Le Scao

Leandro von Werra

Chenghao Mou

Eduardo González Ponferrada

Angelina McMillan-Major

David Ifeoluwa Adelani

Alexandra Sasha Luccioni

Yacine Jernite

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

BERTrade: Using Contextual Embeddings to Parse Old French.

[BibT_eX]

[DOI]

Loïc Grobol

Mathilde Regnault

Pedro Javier Ortiz Suárez

Benoît Sagot

Laurent Romary

Benoît Crabbé

Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus.

[BibT_eX]

[DOI]

Julien Abadji

Pedro Javier Ortiz Suárez

Laurent Romary

Benoît Sagot

Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

A Data-driven Approach to Named Entity Recognition for Early Modern French.

[BibT_eX]

[DOI]

Pedro Ortiz Suarez

Simon Gabay

Proceedings of the 29th International Conference on Computational Linguistics, 2022

2020

Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity ).

[BibT_eX]

[DOI]

Louis Martin

Benjamin Muller

Pedro Javier Ortiz Suárez

Yoann Dupont

Laurent Romary

Éric Villemonte de la Clergerie

Benoît Sagot

Djamé Seddah

Proceedings of the Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 2020

Establishing a New State-of-the-Art for French Named Entity Recognition.