Pedro Ortiz Suarez

Orcid: 0000-0003-0343-8852

According to our database1, Pedro Ortiz Suarez authored at least 20 papers between 2020 and 2023.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
CoRR, 2023

Semi-automatic staging area for high-quality structured data extraction from scientific literature.
CoRR, 2023

2022
A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. (Une approche basée sur les données pour le traitement automatique du langage naturel en français contemporain et historique).
PhD thesis, 2022

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets.
Trans. Assoc. Comput. Linguistics, 2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data.
CoRR, 2022

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
CoRR, 2022

Automatic Extraction of Materials and Properties from Superconductors Scientific Literature.
CoRR, 2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources.
CoRR, 2022

Le projet FREEM : ressources, outils et enjeux pour l'étude du français d'Ancien Régime (The F RE EM project: Resources, tools and challenges for the study of Ancien Régime French).
Proceedings of the Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, 2022


BERTrade: Using Contextual Embeddings to Parse Old French.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

A Data-driven Approach to Named Entity Recognition for Early Modern French.
Proceedings of the 29th International Conference on Computational Linguistics, 2022

2020
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity ).
Proceedings of the Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 2020

Establishing a New State-of-the-Art for French Named Entity Recognition.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German.
Proceedings of the Working Notes of CLEF 2020, 2020

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

CamemBERT: a Tasty French Language Model.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020


  Loading...