Embeddings-based techniques for Media Monitoring Applications

The proposed research project will make an important contribution to science by advancing the state-of-the art in natural language processing (NLP) research and in developing advanced NLP technologies, particularly for Slovene and other South-Slavic languages.

In the research and development of new NLP technologies, the project will focus on news categorization, sentiment analysis and news summarisation. Since the project is conducted with industrial target users, the methods will be robust, fast and able to work with human users in the loop, which also presents an advance compared to typical methods of the field of NLP. In addition, the developed tools and results will also benefit researchers from social sciences, including political sciences and discourse analysts, as the developed tools will allow for large-scale analysis of social media posts and news articles from a range of sources.

The project will positively affect the society and the economy. As for the society, the developed open source solutions will be available for reuse to researchers and students. As for the economy, the project will have numerous impacts. It will directly benefit the co-financer, Kliping d.o.o., as it will enable a radical digital transformation of the processes used in the daily work of the company. Nowadays, the majority of meta-data attribution is done manually, so introducing semi-automated process would have a large impact in terms of speed, efficiency, and cost, while also allowing new services to be introduced (e.g. sentiment analysis on media posts is currently not offered as a service, as it would be too time consuming). Moreover, currently Kliping has only monolingual processes, but as the presence of the company is across several countries, with data in several languages, the introduction of cross-lingual analysis will provide large potential for novel market opportunities.

In terms of other industrial potential, the project will produce a set of solutions that will be re-usable by other companies (code will be made publicly available to allow other companies to train their own models, although the models trained on Kliping data will remain confidential). This will allow media companies, international standardisation bodies such as the IPTC, and others to benefit from these tools. In addition, European SMEs, such as TEXTA (one of the partners in the finalized EMBEDDIA project that JSI has coordinated) will be able to benefit from the solutions developed, thus contributing even more widely to increase the competitiveness of European industry.

For society, the benefits are several fold. Developing solutions for less-resourced languages is important to help increase representation of non-dominant languages in the digital space. The tools for media monitoring are important for understanding society, and can allow researchers and the general public to reveal e.g. changes of sentiment and reporting across time (the main focus of the WP on longitudinal analysis). The societal benefits will be ensured also through the developed open source solutions will be made available for reuse by the interested public, including students and researchers.

Detailed description of the work programme

WP1: Keyword extraction and topic categorisation.

WP1 will develop unsupervised and supervised multilingual methods for keyword extraction, topic modelling, and document categorisation, including multi-label hierarchical classification. Categorization will be based on standardised and custom category inventories: the standardised inventory includes IPTC media topics of >1,200 terms (see https://www.iptc.org/standards/media-topics/), organised in three hierarchical levels, while the custom topic inventory includes c.20,000 specific topics across several languages that were defined by Kliping for clients’ needs. The inventories are dynamic and can be extended.

Task 1.1: Target and keyword detection. The keyword extraction approach will rely on the adaptation of our supervised keyword extraction method TNT-KID (Martinc et al., 2020b) by training the system on novel datasets and including named entity recognition (Ljubešić, 2020) information. It will be adapted to also identify client-specific targets (i.e. entities relevant to the clients). The proposed system will handle different types of documents and will be adaptable to covering different clients’ needs and languages.

Tasks 1.2: Assignment of documents to IPTC categories. IPTC tags are a hierarchical standardised tag structure. In the first step, a semi-automatic process will be developed to match the existing custom topic categories to the IPTC standardised tagset; we will start from our method developed for Finnish (Pranjic et al., 2021). Next, we will work on two new challenges: (a) active learning (Bouguelia et al., 2018), allowing automatic improvement from expert feedback, and introduction of new categories (the IPTC tagset is dynamically changing), and (b) multilinguality, using multilingual BERT representations and integration of background knowledge (based on our method Koloski et al., 2022b), allowing IPTC categorisation across languages.

Task 1.3: Assignment of documents to customized categories. In this task, we will focus on the problem of categorising documents to a tagset comprising a client-defined set of about 20,000 tailored topics. The approach will rely on combining keyword extraction (from T1.1) and unsupervised learning to find documents similar to the ones belonging to a specific topic. To this end, we will leverage a variety of embedding representations (e.g., representations within the Sentence Transformers (Reimers and Gurevych, 2019) framework). For supervised methods, we will adapt the ones for extreme multilabel settings (building upon graph-based metrics in Dahiya et al. (2021)) in order to allow for assigning documents into such an extremely large set of categories (for IPTC tags as well as tailored topics).

WP2: Sentiment analysis.

WP2 will develop methods for sentiment analysis, capable of automatically assessing the sentiment of a given document (e.g., news article/social media post), from the perspective of a particular client and/or regarding a particular target (e.g., company, product, individual) (Task 2.1). We will then advance these methods to be capable of open-domain target analysis (assessing sentiment towards any given target, not necessarily those known or present in the data when the models were created; Task 2.2), and of tracking sentiment changes over time (Task 2.3). To improve robustness and obtain reliability scores, relevant approaches will be adapted with we our Bayesian averaging approach for LPLMs (Miok et al., 2022).

Task 2.1: Document-level target-based sentiment analysis. We will build on our previous work in document-level sentiment analysis, treating it as a text classification problem (Pelicon et al., 2020). We will extend this to document collections labelled with sentiment towards a given target or client-specific concern (training a classifier for each client/target, using documents labelled for that client/target only). Besides public sentiment datasets, we will use Kliping’s existing target-based sentiment annotations.

Task 2.2: Target-conditioned sentiment analysis. Although we expect the models developed in T2.1 to be accurate and robust, that approach has several disadvantages: it requires models specific to each target, requires annotated data for each target, and can only apply to targets known in advance. As the number of targets increases this may mean prohibitively high computational resources, data annotation costs, and time requirements. A more ambitious approach would be to develop generalized models that can be conditioned with target-specific information, thus giving different target-relative outputs as required, but needing only one model (see e.g. Xue and Li, 2018). Initial work will use the same known set of targets as in T2.1 but will extract target-specific information using explanation techniques such as our TransSHAP (Kokalj et al., 2021). We will then investigate an open-domain approach adapting LPLMs with target representations, allowing extensions to any target even without having explicitly annotated target-specific training data.

Task 2.3: Sentiment-based social responsibility index tracking over time. We will apply relevant models from T2.1 and T2.2 to track changes in target-based sentiment over time. We will test this on an analysis task of special interest to Kliping: tracking changes in sentiment towards environmental, social and governance (ESG) issues (e.g., air pollution, carbon footprint, employee well-being), which has seen significant shifts in recent years. ESG categories exist in the IPTC hierarchy and can therefore be defined either via the standard IPTC terminology or by using T1.2’s models for those categories. We will train specific versions of T2.2’s target-based models for these categories, and apply them over time-specific dataset segments to produce a time-dependent ESG sentiment “index” for any given company.

WP3: Longitudinal media monitoring.

In WP3, we will develop methods for longitudinal monitoring, where we will combine methods for diachronic
semantic analysis with sentiment analysis from WP2, adapt them to a multilingual setting, and
assure the analysis from the perspective of a specific target. We will for example see how the reporting
on science changes across time in terms of reporting as well as in terms of sentiment.

Task 3.1: Monolingual longitudinal media monitoring. We will build upon our initial method
for diachronic analysis of news discourse using time-dependent contextual embeddings and clustering
(Montariol et al., 2021). This method can identify differentiating semantic associations for a given word
across time, as well as the most changing words for a specific topic. For example, we can see how over a
period of several years, the news on science change. The main novelty will be two-fold. First, instead of
the current method, where word usages are clustered into a predefined number of clusters, the number of
clusters will be automatically determined, by merging of clusters based on distance and testing clustering
algorithms with automatic assignment of the number of clusters. Second, as the named entities (NEs)
are very important in media discourse, we will investigate how NE linking methods (e.g., mapping Albert
Einstein and Einstein to the same entity), and treating NE embeddings separately, can improve the
current approach, which does not distinguish between NE and other embeddings. Next, we will combine
the semantic change information with information on sentiment from WP2, and allow for joint exploration
of semantics and sentiment across time. Each semantic cluster can contain various sentiment labels, and
the analysis will show the sentiment connotation of each usage across time.

Task 3.2: Multilingual longitudinal media analysis. Given the multilingual corpus of Kliping,
we will transform the existing method (Montariol et al., 2021) so that it allows for multilingual
longitudinal news analysis. First, we will test a direct approach with multilingual LPLMs that contain
joint representation of languages and could in principle allow semantic clustering of word usages for each
time period across languages. For this approach, we will adapt cluster interpretation methods based
on keywords with a cross-lingual mapping of keywords; for joint semantic and sentiment-based analysis
across time, we will apply multilingual sentiment models, developed in WP2. The second (i.e. backup)
approach will rely on machine translation and tracing of changes in translated documents.

WP4: Document summarisation.

In WP4, we will develop methods for summary generation, capable not only of summarisation of single
documents, but also of producing summaries for multiple documents, including news and social media
outlets, with a combination of extractive and abstractive summarisation methods. We will develop tools
to be used in a typical Kliping daily use case when several (e.g., 5–10) documents covering the same
content important for different target clients need to be summarised. Client-specific requirements will be
taken into account during generation to allow for different summaries covering the same set of documents
and enable clients with specific interests to receive targeted summaries.

Task 4.1: Identification of stories. The set of articles for each client will be clustered into groups
of articles covering the same stories. For this, we will use methods for event detection, article clustering,
and topic modelling such as BERTopic (Grootendorst, 2022). As entities and keywords are very important
for the task, we will try to integrate this information from WP1 into methods used.

Task 4.2: General summarisation. For each cluster of articles obtained in T4.1, we will generate
a summary using extractive summarisation, i.e. by training the models to select the most relevant
sentences from a set of clustered articles. We will also treat the summarisation task as text generation
and train variants of T5, BART, and GPT-3 models as abstractive summarisers. Besides publicly available
summarisation data, we will use the existing manually written summaries of Kliping for training these
models. The developed approaches will be cross-lingual (using training data in multiple languages) and
multilingual (generating summaries in languages with sufficient LPLM and data support).

Task 4.3: Target-based summarisation. While T4.2 develops general summarisation tools, T4.3
will adapt them to specific targets (i.e. clients), which is a task not encountered in existing NLP research.
The best approaches from task 4.2 will be upgraded to be more client-specific by taking into account
the particular client requirements of different forms. Here, several client-specific summarisation methods
will be developed: a) a simple extraction of sentences (or chunks of text) containing a target entity;
b) training/fine-tuning of neural SoA extractive summarisation models on the pre-existing client-specific
summaries; c) training/fine-tuning of neural SoA abstractive summarisation models on the pre-existing
client-specific summaries; and d) Using a combination of the above methods in order to obtain concise
and accurate summaries adapted for each specific client.

Project duration: from 1. 10. 2023 to 30. 9. 2026

Fundings: This work was supported by the Slovenian Research and Innovation Agency research project Embeddings-based techniques for Media Monitoring Applications (L2-50070, co-funded by the Kliping d.o.o. agency).