Peter Rupnik

Daniela Sirinic

CoRR, February, 2026

Mići Princ - A Little Boy Teaching Speech Technologies the Chakavian Dialect.

[BibT_eX]

[DOI]

Tea Perincic

CoRR, February, 2026

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora.

[BibT_eX]

[DOI]

Vít Suchomel

CoRR, January, 2026

Regional Variation in the Performance of ASR Models on Croatian and Serbian.

[BibT_eX]

[DOI]

Tanja Samardzic

Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, 2026

2025

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

[BibT_eX]

[DOI]

CoRR, November, 2025

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian.

[BibT_eX]

[DOI]

Ivan Porupski

CoRR, November, 2025

ParlaMint II: advancing comparable parliamentary corpora across Europe.

[BibT_eX]

[DOI]

Lang. Resour. Evaluation, September, 2025

Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models.

[BibT_eX]

[DOI]

Ivan Porupski

Proceedings of the 26th Annual Conference of the International Speech Communication Association, 2025

2024

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining.

[BibT_eX]

[DOI]

CoRR, 2024

JSI and WüNLP at the DIALECT-COPA Shared Task: In-Context Learning From Just a Few Dialectal Examples Gets You Quite Far.

[BibT_eX]

[DOI]

Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects, 2024

DIALECT-COPA: Extending the Standard Translations of the COPA Causal Commonsense Reasoning Dataset to South Slavic Dialects.

[BibT_eX]

[DOI]

Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects, 2024

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings.

[BibT_eX]

[DOI]

Danijel Korzinek

Proceedings of the Speech and Computer - 26th International Conference, 2024

Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages.

[BibT_eX]

[DOI]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings.

[BibT_eX]

[DOI]

Michal Mochtak

Proceedings of the 2024 Joint International Conference on Computational Linguistics, 2024

2023

BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian.

[BibT_eX]

[DOI]

Taja Kuzman

Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects, 2023

Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora.

[BibT_eX]

[DOI]

Taja Kuzman

Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects, 2023

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.

[BibT_eX]

[DOI]

Aarón Galiano Jiménez

Jaume Zaragoza-Bernabeu

Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023

2022

The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia.

[BibT_eX]

[DOI]

Michal Mochtak

CoRR, 2022

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild.

[BibT_eX]

[DOI]

Taja Kuzman