Embeddings-based techniques for Media Monitoring Applications

In machine learning, the analysis of big data is still a great challenge. Term big data refers data, characterised by its large volume, velocity, veracity, and variety. The proposed project tackles the challenge of the language variety and velocity (dynamics) of media contents, which we address by using advanced text representation methods (embeddings) and deep learning. The increasing amounts of media content include a spectrum from traditional high-quality news to less-reliable social media content. Media monitoring and analysis need to be performed in real-time: grouping articles by their content, adding several categories of meta-information, summarizing several news sources, performing analyses, and reporting. Clipping agencies, such as Slovenian agency Kliping d.o.o., which will co-finance this industrial project, therefore, face a challenging problem, especially as many analytical tasks have to be performed manually, especially in less-resourced languages where many tools are non-existent or do not return results of sufficient quality. Kliping monitors over 70,000 traditional articles and over 1 million social media posts per day, resulting in more than 1,500 daily reports for their respective target users, covering the Slovenian as well as Western Balkans media space and thus including text in six different languages (Slovenian, Croatian, Bosnian, Serbian, Macedonian and Albanian) and two alphabets (Latin and Cyrillic). Recent machine learning techniques for advanced Natural Language Processing, which are based on text embeddings and large pretrained language models, enable the development of advanced text processing tools for text analysis, such as text categorisation in terms of their topics or sentiment, and text summarisation from multiple sources. However, even the best of these tools have to be adapted and improved to cope with the specific user needs, the complexity of news category hierarchies, metadata structures used in the news industry, and coverage of multiple languages. To this end, this project aims to develop advanced multilingual news and social media content analysis tools to help automate text analysis processes while increasing society’s ability to understand the rapid flow of information surrounding us.

 

Project duration: from 1. 10. 2023 to 30. 9. 2026

Fundings: This work was supported by the Slovenian Research and Innovation Agency research project Embeddings-based techniques for Media Monitoring Applications (L2-50070, co-funded by the Kliping d.o.o. agency).

ARISLogo_ANG-01

Keywords:

machine learning, text mining, natural language processing, deep neural networks, document representation, language models, embeddings, media monitoring

Project manager:

Nada Lavrač

Participating institutions:

Institut “Jožef Stefan”

IJS_logo

Univerza v Ljubljani, Fakulteta za računalništvo in informatiko

Kliping d.o.o.