Embeddings-based techniques for Media Monitoring Applications

Publications:

Caporusso et al. (2024a). Analysing Bias in Slovenian News Media: A Computational Comparison Based on Readers’ Political Orientation. COBISS.SI-ID 222178307
Caporusso et al. (2024b). A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media. COBISS.SI-ID 197420291
Chatterjee et al. (2024) The “Right” Discourse on Migration: Analysing Migration-Related Tweets in Right and Far-Right Political Movements. COBISS.SI-ID 222190851
Đoković & Robnik Šikonja (2024). Sarcasm detection in a less-resourced language. COBISS.SI-ID 216268291
Ghinassi et al. (2024a). Recent Trends in Linear Text Segmentation: A Survey. COBISS.SI-ID 220088323
Ghinassi et al. (2024b). When Cohesion Lies in the Embedding Space: Embedding-Based Reference-Free Metrics for Topic Segmentation. COBISS.SI-ID 220053507
Hosseini et al. (2025). Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly. COBISS.SI-ID 229056003
Ivačič et al. (2024). Comparing News Framing of Migration Crises using Zero-Shot Classification. COBISS.SI-ID 199763459
Ivačič et al. (2025). Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned.
Karan et al. (2025). A Dataset for Expert Reviewer Recommendation with Large Language Models as Zero-shot Rankers. COBISS.SI-ID 229052931
Klemen et al. (2024a). Neural spell-checker : beyond words with synthetic data generation. COBISS.SI-ID 213519107 (Available on ArXiv).
Klemen et al. (2024b). SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation. COBISS.SI-ID 197916931
Kmecl & Robnik Šikonja (2024). Logično sklepanje v naravnem jeziku za slovenščino. COBISS.SI-ID 206551299
Koloski et al. (2026). FuDoBa: Fusing Document and Knowledge Graph Based Representations with Bayesian Optimisation.
Koloski et al. (2025). Measuring catastrophic forgetting in cross-lingual classification : transfer paradigms and tuning strategies. COBISS.SI-ID 227115523
Koloski et al. (2024a). AutoML-guided fusion of entity and LLM-based representations for document classification. COBISS.SI-ID 224164867
Koloski et al., (2024b). AHAM : adapt, help, ask, model harvesting LLMs for literature mining. COBISS.SI-ID 193861891
Kuzman & Ljubešić (2025). LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification. COBISS.SI-ID 229074179
Kuzman Pungeršek et al. (2025). State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?
Kuzman Pungeršek et al. (2026). The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora.
Ljubešić & Kuzman (2024). CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation. COBISS.SI-ID 196933379
Ljubešić et al. (2025). ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian.
Martinc et al. (2024). Sistem za zaznavanje sprememb v rabi besed in njegova uporaba za sociolingvistično analizo. COBISS.SI-ID 214922755
Martinc et al. (2025). Viewpoint detection on LGBT + reporting using contextual embeddings and qualitative thematic analysis : the use case on the word deep. COBISS.SI-ID 229359363
Mochtak et al. (2024). The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings. COBISS.SI-ID 197916931
Piskorski et al. (2024). Overview of the CLEF-2024 CheckThat! Lab Task 3 on persuasion techniques. COBISS.SI-ID 208589315
Ulčar et al. (2026). Mono- and cross-lingual evaluation of representation language models on less-resourced languages. COBISS.SI-ID 241622275
Vreš et al. (2024). Generative model for less-resourced language with 1 billion parameters. COBISS.SI-ID 212016131
Žagar et al. (2024). SENTA: Sentence Simplification System for Slovene. COBISS.SI-ID 197916675

Resources:

Brglez et al. (2024). Slovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0. COBISS.SI-ID 201681923
Ivačič et al. (2024). News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0. COBISS.SI-ID 216372227
Krsnik et al. (2024a). Corpus extraction tool LIST 1.3. COBISS.SI-ID 218014211
Krsnik et al. (2024b). Dependency tree extraction tool STARK 3.0. COBISS.SI-ID 206072835
Kuzman & Ljubešić. (2024). Večjezični učni nabor novic, označenih s temami iz sheme IPTC NewsCodes Media Topic. COBISS.SI-ID 219483395
Kuzman. Programska koda za razvoj in vrednotenje klasifikatorja za razvrščanje novic v teme IPTC NewsCodes Media Topic: IPTC Media Topic Classification. https://github.com/TajaKuzman/IPTC-Media-Topic-Classification
Vreš et al. (2024). Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0. COBISS.SI-ID 218023427
Žagar et al. (2024). Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0. COBISS.SI-ID 219460867

Models:

Ivačič. XLM-Roberta-base NER model for Slavic languages.
Kuzman & Ljubešić. (2024). Večjezični klasifikator novic v teme po shemi IPTC NewsCodes Media Topic, Multilingual IPTC Media Topic Classifier.

Abstracts:

Gruevska-Madžoska et al. (2024). The Macedonian language as an integral part of the multilingual encyclopedic dictionary BabelNet (BabelNet) and the BabelFy Tool (BabelFy) : on the method of representation and recognition of word meanings (current state and perspectives). COBISS.SI-ID 218806531
Koloski & Pollak. (2024). Enabling topic-modeling for specific domains via domain-adaptation of LLMs. COBISS.SI-ID 220002819

Workshops:

We organized SLaLaM 2023, the first Slovenian workshop on Large Language Models techniques and applications. The proceedings are available here:
Jaya Caporusso, Nada Lavrač (eds.) (2023). Proceedings of SLaLaM 2023, 1st Slovenian Workshop on Large Language Models: Techniques and Applications. Bernardin, Slovenia.

We organized SLaLoM 2026, the second Slovenian workshop on Large Language Models techniques and applications. The proceedings are available here:
Jaya Caporusso, Nada Lavrač (eds.) (2026). Proceedings of SLaLoM 2026, 2nd Slovenian Workshop on Large Language Models: Techniques and Applications. Kranjska Gora, Slovenia.

Project duration: from 1. 10. 2023 to 30. 9. 2026

Fundings: This work was supported by the Slovenian Research and Innovation Agency research project Embeddings-based techniques for Media Monitoring Applications (L2-50070, co-funded by the Kliping d.o.o. agency).