Nikola Ljubesic

Orcid: 0000-0001-7169-9152

According to our database1, Nikola Ljubesic authored at least 106 papers between 2008 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation.
CoRR, 2024

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages.
CoRR, 2024


2023
Correction: Content-based comparison of communities in social networks: Ex-Yugoslavian reactions to the Russian invasion of Ukraine.
Appl. Netw. Sci., December, 2023

Content-based comparison of communities in social networks: Ex-Yugoslavian reactions to the Russian invasion of Ukraine.
Appl. Netw. Sci., December, 2023

Quantifying the impact of context on the quality of manual hate speech annotation.
Nat. Lang. Eng., November, 2023

The ParlaMint corpora of parliamentary proceedings.
Lang. Resour. Evaluation, March, 2023

Who are the haters? A corpus-based demographic analysis of authors of hate speech.
Frontiers Artif. Intell., February, 2023

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark.
CoRR, 2023

The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings.
CoRR, 2023

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages.
CoRR, 2023

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification.
CoRR, 2023

BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian.
Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects, 2023

Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora.
Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects, 2023

Findings of the VarDial Evaluation Campaign 2023.
Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects, 2023


MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.
Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023

2022
The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia.
CoRR, 2022

Geographic Adaptation of Pretrained Language Models.
CoRR, 2022

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 2022

2021
The KAS corpus of Slovenian academic writing.
Lang. Resour. Evaluation, 2021

Retweet communities reveal the main sources of hate speech.
CoRR, 2021

Community evolution in retweet networks.
CoRR, 2021

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian.
CoRR, 2021

Evolution of topics and hate speech in retweet network communities.
Appl. Netw. Sci., 2021

Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection.
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, 2021

Social Media Variety Geolocation with geoBERT.
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, 2021

Findings of the VarDial Evaluation Campaign 2021.
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, 2021

Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages.
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021

Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization.
Proceedings of the Seventh Workshop on Noisy User-generated Text, 2021

2020
The Janes project: language resources and tools for Slovene user generated content.
Lang. Resour. Evaluation, 2020


HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models.
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, 2020

A Report on the VarDial Evaluation Campaign 2020.
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, 2020

SemEval-2020 Task 3: Graded Word Similarity in Context.
Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020

Gigafida 2.0: The Reference Corpus of Written Standard Slovene.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

2019
Extracting Data from Comparable Corpora.
Proceedings of the Using Comparable Corpora for Under-Resourced Areas of Machine Translation, 2019


How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts.
Nat. Lang. Eng., 2019

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context.
CoRR, 2019

KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning.
Proceedings of the Text, Speech, and Dialogue - 22nd International Conference, 2019

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English.
Proceedings of the Text, Speech, and Dialogue - 22nd International Conference, 2019

What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian.
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019

2018
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign.
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, 2018

Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages.
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, 2018

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings.
Proceedings of The Third Workshop on Representation Learning for NLP, 2018

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018

Datasets of Slovene and Croatian Moderated News Comments.
Proceedings of the 2nd Workshop on Abusive Language Online, 2018

2017
Crawl and crowd to bring machine translation to under-resourced languages.
Lang. Resour. Evaluation, 2017

Findings of the VarDial Evaluation Campaign 2017.
Proceedings of the Fourth Workshop on NLP for Similar Languages, 2017

Language-independent Gender Prediction on Twitter.
Proceedings of the Second Workshop on NLP and Computational Social Science, 2017

Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages.
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, 2017

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text.
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, 2017

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene.
Proceedings of the First Workshop on Abusive Language Online, 2017

2016
Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian.
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects, 2016

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task.
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects, 2016

Detecting Semantic Shifts in Slovene Twitterese.
Proceedings of the 10th Workshop on Recent Advances in Slavonic Natural Languages Processing, 2016

Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication.
Proceedings of the 10th Workshop on Recent Advances in Slavonic Natural Languages Processing, 2016

Croatian Error-Annotated Corpus of Non-Professional Written Language.
Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, 2016

New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian.
Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, 2016

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair.
Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, 2016

Corpus-Based Diacritic Restoration for South Slavic Languages.
Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, 2016

Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene.
Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, 2016

Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation.
Proceedings of the 13th Conference on Natural Language Processing, 2016

Normalising Slovene data: historical texts vs. user-generated content.
Proceedings of the 13th Conference on Natural Language Processing, 2016

Abu-MaTran: automatic building of machine translation.
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products, 2016

Dealing with Data Sparseness in SMT with Factured Models and Morphological Expansion: a Case Study on Croatian.
Proceedings of the 19th Annual Conference of the European Association for Machine Translation, 2016

Collaborative Development of a Rule-Based Machine Translator between Croatian and Serbian.
Proceedings of the 19th Annual Conference of the European Association for Machine Translation, 2016

TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data.
Proceedings of the COLING 2016, 2016

Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries.
Proceedings of the Selected papers from the CLARIN Annual Conference 2016, 2016

A Global Analysis of Emoji Usage.
Proceedings of the 10th Web as Corpus Workshop, 2016

Private or Corporate? Predicting User Types on Twitter.
Proceedings of the 2nd Workshop on Noisy User-generated Text, 2016

2015
Discriminating Between Closely Related Languages on Twitter.
Informatica (Slovenia), 2015

*MWELex - MWE Lexica of Croatian, Slovene and Serbian Extracted from Parsed Corpora.
Informatica (Slovenia), 2015

The slWaC Corpus of the SloveneWeb.
Informatica (Slovenia), 2015

Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling.
Proceedings of the Tenth Workshop on Statistical Machine Translation, 2015

Predicting the Level of Text Standardness in User-generated Content.
Proceedings of the Recent Advances in Natural Language Processing, 2015

Predicting Inflectional Paradigms and Lemmata of Unknown Words for Semi-automatic Expansion of Morphological Lexicons.
Proceedings of the Recent Advances in Natural Language Processing, 2015

Abu-MaTran: Automatic building of Machine Translation.
Proceedings of the 18th Annual Conference of the European Association for Machine Translation, 2015

Regional Linguistic Data Initiative (ReLDI).
Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing, 2015

Universal Dependencies for Croatian (that work for Serbian, too).
Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing, 2015

2014
A Report on the DSL Shared Task 2014.
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, 2014

Quality Estimation for Synthetic Parallel Data Generation.
Proceedings of the Ninth International Conference on Language Resources and Evaluation, 2014

caWaC - A web corpus of Catalan and its application to language modeling and machine translation.
Proceedings of the Ninth International Conference on Language Resources and Evaluation, 2014

TweetCaT: a tool for building Twitter corpora of smaller languages.
Proceedings of the Ninth International Conference on Language Resources and Evaluation, 2014

Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites.
Proceedings of the Ninth International Conference on Language Resources and Evaluation, 2014

The SETimes.HR Linguistically Annotated Corpus of Croatian.
Proceedings of the Ninth International Conference on Language Resources and Evaluation, 2014

Standardizing Tweets with Character-Level Machine Translation.
Proceedings of the Computational Linguistics and Intelligent Text Processing, 2014

{bs, hr, sr}WaC - Web Corpora of Bosnian, Croatian and Serbian.
Proceedings of the 9th Web as Corpus Workshop, 2014

2013
Vector Disambiguation for Translation Extraction from Comparable Corpora.
Informatica (Slovenia), 2013

Cross-lingual WSD for Translation Extraction from Comparable Corpora.
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, 2013

Identifying false friends between closely related languages.
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, 2013

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian.
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, 2013

2012
Addressing polysemy in bilingual lexicon extraction from comparable corpora.
Proceedings of the Eighth International Conference on Language Resources and Evaluation, 2012

Efficient Discrimination Between Closely Related Languages.
Proceedings of the COLING 2012, 2012

2011
Bootstrapping Bilingual Lexicons from Comparable Corpora for Closely Related Languages.
Proceedings of the Text, Speech and Dialogue - 14th International Conference, 2011

hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene.
Proceedings of the Text, Speech and Dialogue - 14th International Conference, 2011

Bilingual lexicon extraction from comparable corpora for closely related languages.
Proceedings of the Recent Advances in Natural Language Processing, 2011

Building and Using Comparable Corpora for Domain-Specific Bilingual Lexicon Extraction.
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, 2011

2010
Statistical Machine Translation of Croatian Weather Forecasts: How Much Data Do We Need?
J. Comput. Inf. Technol., 2010

Building a Gold Standard for Event Detection in Croatian.
Proceedings of the International Conference on Language Resources and Evaluation, 2010

Towards Sentiment Analysis of Financial Texts in Croatian.
Proceedings of the International Conference on Language Resources and Evaluation, 2010

2008
Generating a Morphological Lexicon of Organization Entity Names.
Proceedings of the International Conference on Language Resources and Evaluation, 2008

Comparing measures of semantic similarity.
Proceedings of the ITI 2008 30th International Conference on Information Technology Interfaces, 2008


  Loading...