KDIR 2024 Abstracts


Full Papers
Paper Nr: 26
Title:

Learning to Rank for Query Auto-Complete with Language Modelling in Enterprise Search

Authors:

Colin Daly and Lucy Hederman

Abstract: Query Auto-Completion (QAC) is of particular importance to the field of Enterprise Search, where query suggestions can steer searchers to use the appropriate organisational jargon/terminology and avoid submitting queries that produce no results. The order in which QAC candidates are presented to users (for a given prefix) can be influenced by signals, such as how often the prefix appears in the corpus, most popular completions, most frequently queried, anchor text and other of a document, or what queries are currently trending in the organisation. We measure the individual contribution of each of these heuristic signals and supplement them with a feature based on Large Language Modelling (LLM) to detect jargon/terminology. We use Learning To Rank (LTR) to combine the weighted features to create a QAC ranking model for a live Enterprise Search service. In an online A/B test over a 12-week period processing 100,000 queries, our results show that the addition of our jargon/terminology detection LLM feature to the heuristic LTR model results in a Mean Reciprocal Rank score increase of 3.8%.
Download

Paper Nr: 28
Title:

Machine Learning Unravels Sex-Specific Biomarkers for Atopic Dermatitis

Authors:

Ana Duarte and Orlando Belo

Abstract: The prevalence of atopic dermatitis is significantly higher in women than in men. Understanding the differences in the manifestation of the disease between males and females can contribute to more tailored and effective treatments. Our goal in this paper was to discover sex-specific biomarkers that can be used to differentiate between lesional and non-lesional skin in atopic dermatitis patients. Using transcriptomic datasets, we first identified the genes with the highest expression difference. Subsequently, several feature selection methods and machine learning models were employed to select the most relevant genes and identify potential candidates for sex-specific biomarkers. Based on backward feature elimination, we obtained a male-specific signature with 11 genes and a female-specific signature with 10 genes. Both candidate signatures were properly evaluated by an ensemble classifier using an independent test. The obtained AUC and accuracy values for the male signature were 0.839 and 0.7222, respectively, and 0.65 and 0.6667 for the female signature. Finally, we tested the male signature on female data and the female signature on male data. As expected, the analysed metrics decreased considerably in these scenarios. These results suggest that we have identified two promising sex-specific gene signatures, and support that sex affects the ability to distinguish lesions in patients with eczema.
Download

Paper Nr: 48
Title:

MLN-Subdue: Substructure Discovery In Homogeneous Multilayer Networks

Authors:

Anish Rai, Anamitra Roy, Abhishek Santra and Sharma Chakravarthy

Abstract: Substructure discovery is a well-researched problem for graphs (both simple and attributed) for knowledge discovery. Recently, multilayer networks (or MLNs) have been shown to be better suited for modeling complex datasets that have multiple entity and relationship types. However, the MLN representation brings new challenges in finding substructures due to the presence of layers, and substructure discovery methods for MLNs are currently not available. This paper proposes a substructure discovery algorithm for homogeneous MLNs using the decoupling approach. In HoMLNs, each layer has same or a common subset of nodes but different intralayer connectivity. This algorithm has been implemented using the Map/Reduce framework to handle arbitrarily large layers and to improve the response time through distributed and parallel processing. In the decoupled approach, each layer is processed independently (without using any information from other layers) and in parallel and the substructures generated from each layer are combined after each iteration to generate substructures that span layers. The focus is on the correctness of the algorithm and resource utilization based on the number of layers. The proposed algorithm is validated through extensive experimental analysis on large real-world and synthetic graphs with diverse graph characteristics.
Download

Paper Nr: 78
Title:

A Knowledge Map Mining-Based Personalized Learning Path Recommendation Solution for English Learning

Authors:

Duong T. Nguyen and Thu T. Nguyen

Abstract: Recommendation systems (RS) have been widely utilized across various fields, particularly in education, where smart e-learning systems recommend personalized learning paths (PLP) based on the characteristics of learners and learning resources. Despite efforts to provide highly personalized recommendations, challenges such as data sparsity and cold-start issues persist. Recently, knowledge graph (KG)-based RS development has garnered significant interest. KGs can leverage the properties of users and items within a unified graph structure, utilizing semantic relationships among entities to address these challenges and offer more relevant recommendations than traditional methods. In this paper, we propose a KG-based PLP recommendation solution to support English learning by generating a sequence of lessons designed to guide learners effectively from their current English level to their target level. We built a domain KG architecture specifically for studying English certification exams, incorporating key concept classes and their relationships. We then researched and applied graph data mining algorithms (GAs) to create an effective PLP recommendation solution. Using consistent experimental conditions and a selected set of weights, along with our collected dataset, we evaluated our solution based on criteria such as accuracy, efficiency, stability, and execution time.
Download

Paper Nr: 88
Title:

Positive-Unlabeled Learning Using Pairwise Similarity and Parametric Minimum Cuts

Authors:

Torpong Nitayanont and Dorit S. Hochbaum

Abstract: Positive-unlabeled (PU) learning is a binary classification problem where the labeled set contains only positive class samples. Most PU learning methods involve using a prior π on the true fraction of positive samples. We propose here a method based on Hochbaum’s Normalized Cut (HNC), a network flow-based method, that partitions samples, both labeled and unlabeled, into two sets to achieve high intra-similarity and low inter-similarity, with a tradeoff parameter to balance these two goals. HNC is solved, for all tradeoff values, as a parametric minimum cut problem on an associated graph producing multiple optimal partitions, which are nested for increasing tradeoff values. Our PU learning method, called 2-HNC, runs in two stages. Stage 1 identifies optimal data partitions for all tradeoff values, using only positive labeled samples. Stage 2 first ranks unlabeled samples by their likelihood of being negative, according to the sequential order of partitions from stage 1, and then uses the likely-negative along with positive samples to run HNC. Among all generated partitions in both stages, the partition whose positive fraction is closest to the prior π is selected. An experimental study demonstrates that 2-HNC is highly competitive compared to state-of-the-art methods.
Download

Paper Nr: 108
Title:

Efficient Visualization of Association Rule Mining Using the Trie of Rules

Authors:

Mikhail Kudriavtsev, Andrew McCarren, Hyowon Lee and Marija Bezbradica

Abstract: Association Rule Mining (ARM) is a popular technique in data mining and machine learning for uncovering meaningful relationships within large datasets. However, the extensive number of generated rules presents significant challenges for interpretation and visualization. Effective visualization must not only be clear and informative but also efficient and easy to learn. Existing visualization methods often fall short in these areas. In response, we propose a novel visualization technique called the ”Trie of Rules.” This method adapts the Frequent Pattern Tree (FP-tree) structure to visualize association rules efficiently, capturing extensive information while maintaining clarity. Our approach reveals hidden insights such as clusters and substitute items, and introduces a unique feature for calculating confidence in rules with compound consequents directly from the graph structure. We conducted a comprehensive evaluation using a survey where we measured cognitive load to calculate the efficiency and learnability of our methodology. The results indicate that our method significantly enhances the interpretability and usability of ARM visualizations.
Download

Paper Nr: 114
Title:

Predicting Post Myocardial Infarction Complication: A Study Using Dual-Modality and Imbalanced Flow Cytometry Data

Authors:

Nada ALdausari, Frans Coenen, Anh Nguyen and Eduard Shantsila

Abstract: Previous research indicated that white blood cell counts and phenotypes can predict complications after Myocardial Infarction (MI). However, progress is hindered by the need to consider complex interactions among different cell types and their characteristics and manual adjustments of flow cytometry data. This study aims to improve MI complication prediction by applying deep learning techniques to white blood cell test data ob-tained via flow cytometry. Using data from a cohort study of 246 patients with acute MI, we focused on Major Adverse Cardiovascular Events as the primary outcome. Flow cytometry data, available in tabular and image formats, underwent data normalisation and class imbalance adjustments. We built two classification models: a neural network for tabular data and a convolutional neural network for image data. Combining outputs from these models using a voting mechanism enhanced the detection of post-MI complications, improving the average F1 score to 51 compared to individual models. These findings demonstrate the potential of integrating diverse data handling and analytical methods to advance medical diagnostics and patient care.
Download

Paper Nr: 116
Title:

GenCrawl: A Generative Multimedia Focused Crawler for Web Pages Classification

Authors:

Domenico Benfenati, Antonio M. Rinaldi, Cristiano Russo and Cristian Tommasino

Abstract: The unprecedented expansion of the internet necessitates the development of increasingly efficient techniques for systematic data categorization and organization. However, contemporary state-of-the-art techniques often need help with the complex nature of heterogeneous multimedia content within web pages. These challenges, which are becoming more pressing with the rapid growth of the internet, highlight the urgent need for advancements in information retrieval methods to improve classification accuracy and relevance in the context of varied and dynamic web content. In this work, we propose GenCrawl, a generative multimedia-focused crawler designed to enhance web document classification by integrating textual and visual content analysis. Our approach combines the most relevant topics extracted from textual and visual content, using innovative generative techniques to create a visual topic. The reported findings demonstrate significant improvements and a paradigm shift in classification efficiency and accuracy over traditional methods. GenCrawl represents a substantial advancement in web page classification, offering a promising solution for systematically organizing web content. Its practical benefits are immense, paving the way for more efficient and accurate information retrieval in the era of the expanding internet.
Download

Paper Nr: 117
Title:

Antibiotic Resistance Gene Identification from Metagenomic Data Using Ensemble of Finetuned Large Language Models

Authors:

Syama K. and J. A. Jothi

Abstract: Antibiotic resistance is a potential challenge to global health. It limits the effect of antibiotics on humans. Antibiotic resistant genes (ARG) are primarily associated with acquired resistance, where bacteria gain resistance through horizontal gene transfer or mutation. Hence, the identification of ARGs is essential for the treatment of infections and understanding the resistance mechanism. Though there are several methods for ARG identification, the majority of them are based on sequence alignment and hence fail to provide accurate results when the ARGs diverge from those in the reference ARG databases. Additionally, a significant fraction of proteins still need to be accounted for in public repositories. This work introduces a multi-task ensemble model called ARG-LLM of multiple large language models (LLMs) for ARG identification and antibiotic category prediction. We finetuned three pre-trained protein language LLMs, ProtBert, ProtAlbert, and Evolutionary Scale Modelling (ESM), with the ARG prediction data. The predictions of the finetuned models are combined using a majority vote ensembling approach to identify the ARG sequences. Then, another ProtBert model is fine-tuned for the antibiotic category prediction task. Experiments are conducted to establish the superiority of the proposed ARG-LLM using the PLM-ARGDB dataset. Results demonstrate that ARG-LLM outperforms other state-of-the-art methods with the best Recall of 96.2%, F1-score of 94.4%, and MCC of 90%.
Download

Paper Nr: 128
Title:

Approaches for Extending Recommendation Models for Food Choices in Meals

Authors:

Nguyen H. Nhung, Dao K. Nguyen, Tiet G. Hong and Thi H. Vu

Abstract: In this paper, we propose food recommender systems based on users' historical food choices. Their advantage lies in providing personalized food suggestions for each user considering each meal. These systems are developed using two popular recommendation principles: neighbor-based and latent factor-based. In the neighbor-based model, the system aggregates the food choices of neighboring users to recommend food choices for the active user during the considered meal. In contrast, the latent factor-based model constructs and optimizes an objective function to learn positive representations of users, foods, and meals. In this new space, predicting users' food choices during meals becomes straightforward. Experimental results have demonstrated the effectiveness of the proposed models in specific cases. However, in a global statistical comparison, the latent factor-based model has proven to be more effective than the neighbor-based model.
Download

Paper Nr: 147
Title:

Personalization of Dataset Retrieval Results Using a Data Valuation Method

Authors:

Malick Ebiele, Malika Bendechache, Eamonn Clinton and Rob Brennan

Abstract: In this paper, we propose a data valuation method that is used for Dataset Retrieval (DR) results re-ranking. Dataset retrieval is a specialization of Information Retrieval (IR) where instead of retrieving relevant documents, the information retrieval system returns a list of relevant datasets. To the best of our knowledge, data valuation has not yet been applied to dataset retrieval. By leveraging metadata and users’ preferences, we estimate the personal value of each dataset to facilitate dataset ranking and filtering. With two real users (stakeholders) and four simulated users (users’ preferences generated using a uniform weight distribution), we studied the user satisfaction rate. We define users’ satisfaction rate as the probability that users find the datasets they seek in the top k = {5,10} of the retrieval results. Previous studies of fairness in rankings (position bias) have shown that the probability or the exposure rate of a document drops exponentially from the top 1 to the top 10, from 100% to about 20%. Therefore, we calculated the Jaccard score@5 and Jaccard score@10 between our approach and other re-ranking options. It was found that there is a 42.24% and a 56.52% chance on average that users will find the dataset they are seeking in the top 5 and top 10, respectively. The lowest chance is 0% for the top 5 and 33.33% for the top 10; while the highest chance is 100% in both cases. The dataset used in our experiments is a real-world dataset and the result of a query sent to a National mapping agency data catalog. In the future, we are planning to extend the experiments performed in this paper to publicly available data catalogs.
Download

Paper Nr: 161
Title:

A Systematic Literature Review on LLM-Based Information Retrieval: The Issue of Contents Classification

Authors:

Diogo Cosme, António Galvão and Fernando Brito E Abreu

Abstract: This paper conducts a systematic literature review on applying Large Language Models (LLMs) in information retrieval, specifically focusing on content classification. The review explores how LLMs, particularly those based on transformer architectures, have addressed long-standing challenges in text classification by leveraging their advanced context understanding and generative capabilities. Despite the rapid advancements, the review identifies gaps in current research, such as the need for improved transparency, reduced computational costs, and the handling of model hallucinations. The paper concludes with recommendations for future research directions to optimize the use of LLMs in content classification, ensuring their effective deployment across various domains.
Download

Paper Nr: 169
Title:

Enhancing Answer Attribution for Faithful Text Generation with Large Language Models

Authors:

Juraj Vladika, Luca Mülln and Florian Matthes

Abstract: The increasing popularity of Large Language Models (LLMs) in recent years has changed the way users interact with and pose questions to AI-based conversational systems. An essential aspect for increasing the trustworthiness of generated LLM answers is the ability to trace the individual claims from responses back to relevant sources that support them, the process known as answer attribution. While recent work has started exploring the task of answer attribution in LLMs, some challenges still remain. In this work, we first perform a case study analyzing the effectiveness of existing answer attribution methods, with a focus on subtasks of answer segmentation and evidence retrieval. Based on the observed shortcomings, we propose new methods for producing more independent and contextualized claims for better retrieval and attribution. The new methods are evaluated and shown to improve the performance of answer attribution components. We end with a discussion and outline of future directions for the task.
Download

Paper Nr: 174
Title:

MERGE App: A Prototype Software for Multi-User Emotion-Aware Music Management

Authors:

Pedro L. Louro, Guilherme Branco, Hugo Redinho, Ricardo Correia, Ricardo Malheiro, Renato Panda and Rui P. Paiva

Abstract: We present a prototype software for multi-user music library management using the perceived emotional content of songs. The tool offers music playback features, song filtering by metadata, and automatic emotion prediction based on arousal and valence, with the possibility of personalizing the predictions by allowing each user to edit these values based on their own emotion assessment. This is an important feature for handling both classification errors and subjectivity issues, which are inherent aspects of emotion perception. A path-based playlist generation function is also implemented. A multi-modal audio-lyrics regression methodology is proposed for emotion prediction, with accompanying validation experiments on the MERGE dataset. The results obtained are promising, showing higher overall performance on train-validate-test splits (73.20% F1-score with the best dataset/split combination).
Download

Paper Nr: 180
Title:

Contrato360 2.0: A Document and Database-Driven Question-Answer System Using Large Language Models and Agents

Authors:

Antony Seabra, Claudio Cavalcante, João Nepomuceno, Lucas Lago, Nicolaas Ruberg and Sergio Lifschitz

Abstract: We present a question-and-answer (Q&A) application designed to support the contract management process by leveraging combined information from contract documents (PDFs) and data retrieved from contract management systems (database). This data is processed by a large language model (LLM) to provide precise and relevant answers. The accuracy of these responses is further enhanced through the use of Retrieval-Augmented Generation (RAG), text-to-SQL techniques, and agents that dynamically orchestrate the workflow. These techniques eliminate the need to retrain the language model. Additionally, we employed Prompt Engineering to fine-tune the focus of responses. Our findings demonstrate that this multi-agent orchestration and combination of techniques significantly improve the relevance and accuracy of the answers, offering a promising direction for future information systems.
Download

Paper Nr: 190
Title:

Reviewing Machine Learning Techniques in Credit Card Fraud Detection

Authors:

Ibtissam Medarhri, Mohamed Hosni, Mohamed Ettalhaoui, Zakaria Belhaj and Rabie Zine

Abstract: The growing use of credit cards for transactions has increased the risk of fraud, as fraudsters frequently attempt to exploit these transactions. Consequently, credit card companies need decision support systems that can automatically detect and manage fraudulent activities without human intervention, given the vast volume of daily transactions. Machine learning techniques have emerged as a powerful solution to address these challenges. This paper provides a comprehensive overview of the knowledge domain related to the application of machine learning techniques in combating credit card fraud. To achieve this, a review of published work in academic journals from 2018 to 2023 was conducted, encompassing 131 papers. The review classifies the studies based on eight key aspects: publication trends and venues, machine learning approaches and techniques, datasets, evaluation frameworks, balancing techniques, hyperparameter optimization, and tools used. The main findings reveal that the selected studies were published across various journal venues, employing both single and ensemble machine learning approaches. Decision trees were identified as the most frequently used technique. The studies utilized multiple datasets to build models for detecting credit card fraud and explored various preprocessing steps, including feature engineering (such as feature extraction, construction, and selection) and data balancing techniques. Python and its associated libraries were the most commonly used tools for implementing these models.
Download

Short Papers
Paper Nr: 17
Title:

An Index Bucketing Framework to Support Data Manipulation and Extraction of Nested Data Structures

Authors:

Jeffrey Myers II and Yaser Mowafi

Abstract: Handling nested data collections in large-scale distributed data structures poses considerable challenges in query processing, often resulting in substantial costs and error susceptibility. These challenges are exacerbated in scenarios involving skewed, nested data with irregular inner data collections. Processing such data demands costly operations, leading to extensive data duplication and imposing challenges in ensuring balanced distribution across partitions—consequently impeding performance and scalability. This work introduces an index bucketing framework that amalgamates upfront computations with data manipulation techniques, specifically focusing on flattening procedures. The framework resembles principles from the bucket spreading strategy, a parallel hash join method that aims to mitigate adverse implications of data duplication and information loss, while effectively addressing both skewed and irregularly nested structures. The efficacy of the proposed framework is assessed through evaluations conducted on prominent question-answering datasets such as QuAC and NewsQA, comparing its performance against the Pandas Python API and recursive, iterative flattening implementations.
Download

Paper Nr: 27
Title:

Comparative Analysis of Topic Modelling Approaches on Student Feedback

Authors:

Faiz Hayat, Safwan Shatnawi and Ella Haig

Abstract: Topic modelling, a type of clustering for textual data, is a popular method to extract themes from text. Methods such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and Non-negative Matrix Factorization (NMF) have been successfully used across a wide range of applications. Large Language Models, such as BERT, have led to significant improvements in machine learning tasks for textual data in general, as well as topic modelling, in particular. In this paper, we compare the performance of a BERT-based topic modelling approach with LDA, LSA and NMF on textual feedback from students about their mental health and remote learning experience during the COVID-19 pandemic. While all methods lead to coherent and distinct topics, the BERT-based approach and NMF are able to identify more fine-grained topics. Moreover, while NMF resulted in more detailed topics about the students’ mental health-related experiences, the BERT-based approach produced more detailed topics about the students’ experiences with remote learning.
Download

Paper Nr: 39
Title:

Efficient Neural Network Training via Subset Pretraining

Authors:

Jan Spörer, Bernhard Bermeitinger, Tomas Hrycej, Niklas Limacher and Siegfried Handschuh

Abstract: In training neural networks, it is common practice to use partial gradients computed over batches, mostly very small subsets of the training set. This approach is motivated by the argument that such a partial gradient is close to the true one, with precision growing only with the square root of the batch size. A theoretical justification is with the help of stochastic approximation theory. However, the conditions for the validity of this theory are not satisfied in the usual learning rate schedules. Batch processing is also difficult to combine with efficient second-order optimization methods. This proposal is based on another hypothesis: the loss minimum of the training set can be expected to be well-approximated by the minima of its subsets. Such subset minima can be computed in a fraction of the time necessary for optimizing over the whole training set. This hypothesis has been tested with the help of the MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks, optionally extended by training data augmentation. The experiments have confirmed that results equivalent to conventional training can be reached. In summary, even small subsets are representative if the overdetermination ratio for the given model parameter set sufficiently exceeds unity. The computing expense can be reduced to a tenth or less.
Download

Paper Nr: 42
Title:

Multi-Label Classification for Fashion Data: Zero-Shot Classifiers via Few-Shot Learning on Large Language Models

Authors:

Dongming Jiang, Abhishek Shah, Stanley Yeung, Jessica Zhu, Karan Singh and George Goldenberg

Abstract: Multi-Label classification is essential in the fashion industry due to the complexity of fashion items, which often have multiple attributes such as style, material, and occasion. Traditional machine-learning approaches face challenges like data imbalance, high dimensionality, and the constant emergence of new styles and labels. To address these issues, we propose a novel approach that leverages Large Language Models (LLMs) by integrating few-shot and zero-shot learning. Our methodology utilizes LLMs to perform few-shot learning on a small, labeled dataset, generating precise descriptions of new fashion classes. These descriptions guide the zero-shot learning process, allowing for the classification of new items and categories with minimal labeled data. We demonstrate this approach using OpenAI’s GPT-4, a state-of-the-art LLM. Experiments on a dataset from CaaStle Inc., containing 2,480 unique styles with multiple labels, show significant improvements in classification performance. Few-shot learning enhances the quality of zero-shot classifiers, leading to superior results. GPT-4’s multi-modal capabilities further improve the system’s effectiveness. Our approach provides a scalable, flexible, and accurate solution for fashion classification, adapting to dynamic trends with minimal data requirements, thereby improving operational efficiency and customer experience. Additionally, this method is highly generalizable and can be applied beyond the fashion industry.
Download

Paper Nr: 43
Title:

Optimizing High-Dimensional Text Embeddings in Emotion Identification: A Sliding Window Approach

Authors:

Hande Aka Uymaz and Senem Kumova Metin

Abstract: Natural language processing (NLP) is an interdisciplinary field that enables machines to understand and generate human language. One of the crucial steps in several NLP tasks, such as emotion and sentiment analysis, text similarity, summarization, and classification, is transforming textual data sources into numerical form, a process called vectorization. This process can be grouped into traditional, semantic, and contextual vectorization methods. Despite their advantages, these high-dimensional vectors pose memory and computational challenges. To address these issues, we employed a sliding window technique to partition high-dimensional vectors, aiming not only to enhance computational efficiency but also to detect emotional information within specific vector dimensions. Our experiments utilized emotion lexicon words and emotionally labeled sentences in both English and Turkish. By systematically analyzing the vectors, we identified consistent patterns with emotional clues. Our findings suggest that focusing on specific sub-vectors rather than entire high-dimensional BERT vectors can capture emotional information effectively, without performance loss. With this approach, we examined an increase in pairwise cosine similarity scores within emotion categories when using only sub-vectors. The results highlight the potential of the use of sub-vector techniques, offering insights into the nuanced integration of emotions in language and the applicability of these methods across different languages.
Download

Paper Nr: 52
Title:

Text-Based Feature-Free Automatic Algorithm Selection

Authors:

Amanda Salinas-Pinto, Bryan Alvarado-Ulloa, Dorit Hochbaum, Matías Francia-Carramiñana, Ricardo Ñanculef and Roberto Asín-Achá

Abstract: Automatic Algorithm Selection involves predicting which solver, among a portfolio, will perform best for a given problem instance. Traditionally, the design of algorithm selectors has relied on domain-specific features crafted by experts. However, an alternative approach involves designing selectors that do not depend on domain-specific features, but receive a raw representation of the problem’s instances and automatically learn the characteristics of that particular problem using Deep Learning techniques. Previously, such raw representation was a fixed-sized image, generated from the input text file specifying the instance, which was fed to a Convolutional Neural Network. Here we show that a better approach is to use text-based Deep Learning models that are fed directly with the input text files specifying the instances. Our approach improves on the image-based feature-free models by a significant margin and furthermore matches traditional Machine Learning models based on basic domain-specific features, known to be among the most informative features.

Paper Nr: 53
Title:

Flow Is Best, Fast and Scalable: The Incremental Parametric Cut for Maximum Density and Other Ratio Subgraph Problems

Authors:

Dorit S. Hochbaum

Abstract: The maximum density subgraph, or densest subgraph, problem has numerous applications in analyzing graph and community structures in social networks, DNA networks and financial networks. The densest subgraph problem has been the subject of study since the early 80s and polynomial time flow-based algorithms are known, yet research in the last couple of decades has been focused on developing heuristic methods for solving the problem claiming that flow computations are computationally prohibitive. We introduce here a new polynomial time algorithm, the incremental parametric cut algorithm (IPC) that solves the maximum density subgraph problem and many other max or min ratio problems in the complexity of a single minimum-cut. A characterization of all these efficiently solvable ratio problems is given here as problems with monotone integer programming formulations. IPC is much more efficient than the parametric cut algorithm since instead of generating all breakpoints it explores only a tiny fraction of those breakpoints. Compared to the heuristic methods, IPC not only guarantees optimality, but also runs orders of magnitude faster than the heuristic methods, as shown in an accompanying experimental study.
Download

Paper Nr: 63
Title:

Integrated Evaluation of Semantic Representation Learning, BERT, and Generative AI for Disease Name Estimation Based on Chief Complaints

Authors:

Ikuo Keshi, Ryota Daimon, Yutaka Takaoka and Atsushi Hayashi

Abstract: This study compared semantic representation learning + machine learning, BERT, and GPT-4 to estimate disease names from chief complaints and evaluate their accuracy. Semantic representation learning + machine learning showed high accuracy for chief complaints of at least 10 characters in the International Classification of Diseases 10th Revision (ICD-10) codes middle categories, slightly surpassing BERT. For GPT-4, the Retrieval Augmented Generation (RAG) method achieved the best performance, with a Top-5 accuracy of 84.5% when all chief complaints, including the evaluation data, were used. Additionally, the latest GPT-4o model further improved the Top-5 accuracy to 90.0%. These results suggest the potential of these methods as diagnostic support tools. Future work aims to enhance disease name estimation through more extensive evaluations by experienced physicians.

Paper Nr: 85
Title:

Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation

Authors:

Elona Shatri and György Fazekas

Abstract: Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI, significantly reducing the costs and time of manual transcription. This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music. Unlike Optical Char-acter Recognition (OCR), OMR must handle the intricate semantics of Common Western Music Notation (CWMN), where symbol meanings depend on shape, position, and context. Our approach leverages instance segmentation to manage the density and overlap of musical symbols, facilitating more precise information retrieval from music scores. Evaluations on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with our method achieving a mean Average Precision (mAP) of up to 59.70% in dense symbol environments, achieving comparable results to object detection. Furthermore, using traditional computer vision techniques, we add a parallel step for staff detection to infer the pitch for the recognised symbols. This study emphasises the role of pixel-wise segmentation in advancing accurate music symbol recognition, contributing to knowledge discovery in OMR. Our findings indicate that instance segmentation provides more precise representations of musical symbols, particularly in densely populated scores, advancing OMR technology. We make our implementation, pre-processing scripts, trained models, and evaluation results publicly available to support further research and development.

Paper Nr: 91
Title:

Prompt Distillation for Emotion Analysis

Authors:

Andrew L. Mackey, Susan Gauch and Israel Cuevas

Abstract: Emotion Analysis (EA) is a field of study closely aligned with sentiment analysis whereby a discrete set of emotions are extracted from a given document. Existing methods of EA have traditionally explored both lexicon and machine learning techniques for this task. Recent advancements in large language models have achieved success in a wide range of tasks, including language, images, speech, and videos. In this work, we construct a model that applies knowledge distillation techniques to extract information from a large language model which instructs a lightweight student model to improve its performance with the EA task. Specifically, the teacher model, which is much larger in terms of parameters and training inputs, performs an analysis of the document and shares this information with the student model to predict the target emotions for a given document. Experimental results demonstrate the efficacy of our proposed prompt-based knowledge distillation approach for EA.
Download

Paper Nr: 100
Title:

Enhancing LLMs with Knowledge Graphs for Academic Literature Retrieval

Authors:

Catarina Pires, Pedro G. Correia, Pedro Silva and Liliana Ferreira

Abstract: While Large Language Models have demonstrated significant advancements in Natural Language Generation, they frequently produce erroneous or nonsensical texts. This phenomenon, known as hallucination, raises concerns about the reliability of Large Language Models, particularly when users seek accurate information, such as in academic literature retrieval. This paper addresses the challenge of hallucination in Large Language Models by integrating them with Knowledge Graphs using prompt engineering. We introduce GPTscholar, an initial study designed to enhance Large Language Models responses in the field of computer science academic literature retrieval. The authors manually evaluated the quality of responses and frequency of hallucinations on 40 prompts across 4 different use cases. We conclude that the approach is promising, as the system outperforms the results we obtained with gpt-3.5-turbo without Knowledge Graphs.
Download

Paper Nr: 101
Title:

An Explainable Classifier Using Diffusion Dynamics for Misinformation Detection on Twitter

Authors:

Arghya Kundu and Uyen T. Nguyen

Abstract: Misinformation, often spread via social media, can cause panic and social unrest, making its detection crucial. Automated detection models have emerged, using methods like text mining, usage of social media user properties, and propagation pattern analysis. However, most of these models do not effectively use the diffusion pattern of the information and are essentially black boxes, and thus are often uninterpretable. This paper proposes an ensemble based classifier with high accuracy for misinformation detection using the diffusion pattern of a post in Twitter. Additionally, the particular design of the classifier enables intrinsic explainability. Furthermore, in addition to using different temporal and spatial properties of diffusion cascades this paper introduces features motivated from the science behind the spread of an infectious disease in epidemiology, specially from recent studies conducted for the analysis of the COVID-19 pandemic. Finally, this paper presents the results of the comparison of the classifier with baseline models and quantitative evaluation of the explainability.
Download

Paper Nr: 110
Title:

Comparative Analysis of Real-Time Time Series Representation Across RNNs, Deep Learning Frameworks, and Early Stopping

Authors:

Ming-Chang Lee, Jia-Chun Lin and Sokratis Katsikas

Abstract: Real-Time time series representation is becoming increasingly crucial in data mining applications, enabling timely clustering and classification of time series without requiring parameter configuration and tuning in advance. Currently, the implementation of real-time time series representation relies on a fixed setting, consisting of a single type of recurrent neural network (RNN) within a specific deep learning framework, along with the adoption of early stopping. It remains unclear how leveraging different types of RNNs available in various deep learning frameworks, combined with the use of early stopping, influences the quality of representation and the efficiency of representation time. Arbitrarily selecting an RNN variant from a deep learning framework and activating the early stopping function for implementing a real-time time series representation approach may negatively impact the performance of the representation. Therefore, in this paper, we aim to investigate the impact of these factors on real-time time series representation. We implemented a state-of-the-art real-time time series representation approach using multiple well-established RNN variants supported by three widely used deep learning frameworks, with and without the adoption of early stopping. We analyzed the performance of each implementation using real-world open-source time series data. The findings from our evaluation provide valuable guidance on selecting the most appropriate RNN variant, deciding whether to adopt early stopping, and choosing a deep learning framework for real-time time series representation.
Download

Paper Nr: 111
Title:

Multilayer Networks: For Modeling and Analysis of Big Data

Authors:

Abhishek Santra, Hafsa Billah and Sharma Chakravarthy

Abstract: In this position paper, we make a case for the appropriateness, utility, and effectiveness of graph models for big data analysis focusing on Multilayer Networks (or MLNs) – a specific type of graph. MLNs have been shown to be more appropriate for modeling complex data compared to their traditional counterparts. MLNs have also been shown to be useful for diverse data types, such as videos and information integration. Further, MLNs have been shown to be flexible for computing analysis objectives from diverse application domains using extant and new algorithms. There is research for automating the modeling of MLNs using widely used EER (Enhanced/Extended Entity Relationship) or Unified Modeling Language (UML) approaches. We start by discussing different graph models and their benefits and limitations. We demonstrate how MLNs can be effectively used to model applications with complex data. We also summarize the work on the use of EER models to generate MLNs in a principled manner. We elaborate on analysis alternatives provided by MLNs and their ability to match analysis needs. We show the use of MLNs for - i) traditional data analysis, ii) video content analysis, iii) complex data analysis, and iv) propose the use of MLNs for information integration or fusion. We show examples drawn from the literature of their modeling and analysis usage. We conclude that graphs, specifically MLNs provide a rich alternative to model and analyze big data. Of course, this certainly does not preclude newer data models that are likely to come along.
Download

Paper Nr: 119
Title:

Route Recommendation Based on POIs and Public Transportation

Authors:

Ágata Palma, Pedro Morais and Ana Alves

Abstract: With the rapid advancement of technology in today’s interconnected world, Ambient Intelligence (AmI) emerges as a powerful tool that revolutionizes how we interact with our environments. This article delves into the integration of AmI principles, Python programming, and Geographic Information Systems (GIS) to develop intelligent route recommendation systems for urban exploration. The motivation behind this study lies in the potential of AmI to address challenges in urban navigation, personalized recommendations, and sustainable transportation solutions. The objectives include optimizing travel routes, promoting sustainable transportation options, and enhancing user experiences. This research will contribute to advancing AmI technologies and their practical applications in improving urban living standards and mobility solutions.
Download

Paper Nr: 122
Title:

Hyperparameter Optimization for Search Relevance in E-Commerce

Authors:

Manuel Dalcastagné and Giuseppe Di Fabbrizio

Abstract: The tuning of retrieval and ranking strategies in search engines is traditionally done manually by search experts in a time-consuming and often irreproducible process. A typical use case is field boosting in keyword-based search, where the ranking weights of different document fields are changed in a trial-and-error process to obtain what seems to be the best possible results on a set of manually picked user queries. Hyperparameter optimization (HPO) can automatically tune search engines’ hyperparameters like field boosts and solve these problems. To the best of our knowledge, there has been little work in the research community regarding the application of HPO to search relevance in e-commerce. This work demonstrates the effectiveness of HPO techniques for optimizing the relevance of e-commerce search engines using a real-world dataset and evaluation setup, providing guidelines on key aspects to consider for the application of HPO to search relevance. Differential evolution (DE) optimization achieves up to 13% improvement in terms of NDCG@10 over baseline search configurations on a publicly available dataset.
Download

Paper Nr: 132
Title:

A Framework for Self-Service Business Intelligence

Authors:

Rosa Matias and Maria B. Piedade

Abstract: Building an effective Business Intelligence solution involves several key steps. Recently, low-code software tools have allowed casual users - those with domain-specific knowledge of a case study - to develop custom solutions independently of IT teams. This is the era of Self-Service Business Intelligence. However, some drawbacks have been identified due to casual users' lack of Business Intelligence expertise. In response, a framework is proposed, introducing the role of casual power users and specifying the Business Intelligence knowledge they should possess. Additionally, the framework aims to integrate Business Intelligence methodologies more cohesively with data visualization and data storytelling development cycles. As a proof of concept, the framework was applied to develop a solution for monitoring class attendance at a higher education institution. In this case study, a casual power user is able to identify, early in the semester, which classes require adjustments to improve resource management and pedagogical outcomes. The contextualization provided by the framework enabled that user to successfully uncover critical insights.
Download

Paper Nr: 133
Title:

Knowledge Graphs Can Play Together: Addressing Knowledge Graph Alignment from Ontologies in the Biomedical Domain

Authors:

Hanna Abi Akl, Dominique Mariko, Yann-Alan Pilatte, Stéphane Durfort, Nesrine Yahiaoui and Anubhav Gupta

Abstract: We introduce DomainKnowledge, a system that leverages a pipeline for triple extraction from natural text and domain-specific ontologies leading to knowledge graph construction. We also address the challenge of aligning text-extracted and ontology-based knowledge graphs using the biomedical domain as use case. Finally, we derive graph metrics to evaluate the effectiveness of our system compared to a human baseline.
Download

Paper Nr: 134
Title:

Decoding AI’s Evolution Using Big Data: A Methodological Approach

Authors:

Sophie Gvasalia, Mauro Pelucchi, Simone Perego and Rita Porcelli

Abstract: This study presents a novel approach to measuring the impact of Artificial Intelligence on occupations through an analysis of the Atlante del Lavoro dataset and web job postings. By focusing on data preparation and model selection, we provide real-time insights into how AI is reshaping job roles and required skills. Our methodological framework enables a detailed examination of specific labour market segments, emphasizing the dynamic nature of occupational demands. Through a rigorous mixed-method approach, the study highlights the AI impact on sectors such as ICT, telecommunications, and mechatronic, revealing distinct skill clusters and their significance. This innovative analysis not only delineates the convergence of digital, soft, and hard skills but also offers a multidimensional view of future workforce competencies. The findings serve as a valuable resource for educators, policymakers, and industry stakeholders, guiding workforce development in line with emerging AI-driven demands.
Download

Paper Nr: 137
Title:

MAEVE: An Agnostic Dataset Generator Framework for Predicting Customer Behavior in Digital Marketing

Authors:

William S. Filho, Seyed J. Haddadi and Julio D. Reis

Abstract: Data analysis plays a crucial role in assessing the effectiveness of business strategies. In Digital Marketing, analytical tools predominantly rely on traffic data and trend analysis, focusing on user behaviors and interactions. This study introduces a dataset generation framework to assist marketing professionals in conducting micro-level analyses of individual user responses to digital marketing strategies. The implemented proof of concept demonstrates that the framework can be integrated with enterprise software monitoring applications to ingest logs and, through appropriate configuration, generate comprehensive and valuable datasets. This research centers on the application of the framework for predicting customer behavior. The evaluation examines the extent to which the generated datasets are suitable for training various machine learning (ML) algorithms. The framework has shown promise in producing machine learning-ready datasets that accurately represent complex real-world scenarios.
Download

Paper Nr: 143
Title:

Beyond Twitter: Exploring Alternative API Sources for Social Media Analytics

Authors:

Alina Campan and Noah Holtke

Abstract: Social media is a valuable source of data for applications in a multitude of fields: agriculture, banking, business intelligence, communication, disaster management, education, government, health, hospitality and tourism, journalism, management, marketing, etc. There are two main ways to collect social media data: web scraping (requires more complex custom programs, faces legal and ethical concerns) and API-scraping using services provided by the social media platform itself (clear protocols, clean data, follows platform established rules). However, API-based access to social media platforms has significantly changed in the last few years, with the mainstream platforms placing more restrictions and pricing researchers out. At the same time, new, federated social media platforms have emerged, many of which have a growing user base and could be valuable data sources for research. In this paper, we describe an experimental framework to API-scrape data from the federated Mastodon platform (specifically its flagship node, Mastodon.social), and the results of volume, sentiment, emotion, and topic analysis on two datasets we collected – as a proof of concept for the usefulness of sourcing data from the Mastodon platform.
Download

Paper Nr: 159
Title:

An Improved Meta-Knowledge Prompt Engineering Approach for Generating Research Questions in Scientific Literature

Authors:

Meng Wang, Zhixiong Zhang, Hanyu Li and Guangyin Zhang

Abstract: Research questions are crucial for the development of science, which are an important driving force for scientific evolution and progress. This study analyses the key meta knowledge required for generating research questions in scientific literature, including research objective and research method. To extract metaknowledge, we obtained feature words of meta-knowledge from knowledge-enriched regions and embedded them into the DeBERTa (Decoding-enhanced BERT with disentangled attention) for training. Compared to existing models, our proposed approach demonstrates superior performance across all metrics, achieving improvements in F1 score of +9% over BERT (88% vs. 97%), +3% over BERT-CNN (94% vs. 97%), and +2% over DeBERTa (95% vs. 97%) for identifying meta-knowledge. And, we construct the prompts integrate meta-knowledge to fine tune LLMs. Compared to the baseline model, the LLMs fine-tuned using metaknowledge prompt engineering achieves an average 88.6% F1 score in the research question generation task, with improvements of 8.4%. Overall, our approach can be applied to the research question generation in different domains. Additionally, by updating or replacing the meta-knowledge, the model can also serve as a theoretical foundation and model basis for the generation of different types of sentences.
Download

Paper Nr: 163
Title:

An End-to-End Generative System for Smart Travel Assistant

Authors:

Miraç Tuğcu, Begüm Ç. Erdinç, Tolga Çekiç, Seher C. Akay, Derya Uysal, Onur Deniz and Erkut Erdem

Abstract: Planning a travel with a customer assistant is a multi-stage process that involves information collecting, and usage of search and reservation services. In this paper, we present an end-to-end system of a voice-enabled virtual assistant specifically designed for travel planning in Turkish. This system involves fine-tuned state-of-the-art models of Speech-to-text (STT) and Text-to-speech (TTS) models for increased success in the tourism domain for Turkish language as well as improvements to chatbot experience that can handle complex, multifaceted conversations that are required for planning a travel thoroughly. We detail the architecture of our voice-based chatbot, focusing on integrating STT and TTS engines with a Natural Language Understanding (NLU) module tailored for travel domain queries. Furthermore, we present a comparative evaluation of speech modules, considering factors such as parameter size and accuracy. Our findings demonstrate the feasibility of voice-based interfaces for streamlining travel planning and booking processes in Turkish language which lacks high-quality corpora of speech and text pairs.
Download

Paper Nr: 179
Title:

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions

Authors:

Junyi Hu, You Zhou and Jie Wang

Abstract: We introduce the Overall Performance Index (OPI), an intrinsic metric to evaluate retrieval-augmented generation (RAG) mechanisms for applications involving deep-logic queries. OPI is computed as the harmonic mean of two key metrics: the Logical-Relation Correctness Ratio and the average of BERT embedding similarity scores between ground-truth and generated answers. We apply OPI to assess the performance of LangChain, a popular RAG tool, using a logical relations classifier fine-tuned from GPT-4o on the RAG-Dataset-12000 from Hugging Face. Our findings show a strong correlation between BERT embedding similarity scores and extrinsic evaluation scores. Among the commonly used retrievers, the cosine similarity retriever using BERT-based embeddings outperforms others, while the Euclidean distance-based retriever exhibits the weakest performance. Furthermore, we demonstrate that combining multiple retrievers, either algorithmically or by merging retrieved sentences, yields superior performance compared to using any single retriever alone.
Download

Paper Nr: 186
Title:

Prediction of Response to Intra-Articular Injections of Hyaluronic Acid for Knee Osteoarthritis

Authors:

Eva K. Lee, Fan Yuan, Barton J. Mann and Marlene DeMaio

Abstract: Osteoarthritis (OA) is a degenerative joint disease, with the knee being the most frequently affected joint. Knee OA is a leading cause of arthritis disability, with 50% of knee OA patients eventually receiving surgical procedures. Specifically, 99% of these knee replacements are performed to address pain and functional limitations. However, it was reported that about one-third of these surgeries are unnecessary. Intra-articular injections of hyaluronic acid (HA) can serve as a non-invasive cost-effective alternative to surgery for knee osteoarthritis. Although research studies have clearly demonstrated that HA improves knee function, the efficacy of this treatment remains controversial. However, many clinicians have observed that effects depend on several patient characteristics such as age, weight, gender, severity of the OA, and technical issues such as injection site and placement. In this study, a multi-stage multi-group machine learning model is utilized to uncover discriminatory features that can predict the response status of knee OA patients to different types of HA treatment. The algorithm can identify certain subgroups of knee OA patients who respond well (or those who don’t) to HA therapy. Specifically, a baseline result including factors such as patients’ weight, smoking status and frequency allows physicians the first step of patient treatment recommendation, steering those patients most suitable for HA injection. The model can achieve over 85% blind prediction accuracy. The data and model derived from this study allows physicians to administer HA products more selectively and effectively, which will increase the percentage of patients who experience a successful HA therapy. Information about predicted responses could also easily be shared with patients to incorporate their values and preferences into treatment selection. In addition, the decision support tools would allow providers to quickly determine whether a patient is exhibiting at least an expected treatment response and if not, to potentially take corrective action. The model is generalizable and can also be used to predict patient responses to other treatments and conditions.
Download

Paper Nr: 188
Title:

Comparative Analysis of Single and Ensemble Support Vector Regression Methods for Software Development Effort Estimation

Authors:

Mohamed Hosni

Abstract: Providing an accurate estimation of the effort required to develop a software project is crucial for its success. These estimates are essential for managers to allocate resources effectively and deliver the software product on time and with the desired quality. Over the past five decades, various effort estimation techniques have been developed, including machine learning (ML) techniques. ML methods have been applied in software development effort estimation (SDEE) for the past three decades and have demonstrated promising levels of accuracy. Numerous ML methods have been explored, including the Support Vector Regression (SVR) technique, which has shown competitive performance compared to other ML techniques. However, despite the plethora of proposed methods, no single technique has consistently outperformed the others in all situations. Prior research suggests that generating estimations by combining multiple techniques in ensembles, rather than relying solely on a single technique, can be more effective. Consequently, this research paper proposes estimating SDEE using both individual ML techniques and ensemble methods based on SVR. Specifically, four variations of the SVR technique are employed, utilizing four different kernels: polynomial, linear, radial basis function, and sigmoid. Additionally, a homogeneous ensemble is constructed by combining these four variants using two types of combiners. An empirical analysis is conducted on six well-known datasets, evaluating performance using eight unbiased criteria and the Scott-Knott statistical test. The results suggest that both single and ensemble SVR techniques exhibit similar predictive capabilities. Furthermore, the SVR variant with the polynomial kernel is deemed the most suitable for SDEE. Regarding the combiner rule, the non-linear combiner yields superior accuracy for the SVR ensemble.
Download

Paper Nr: 189
Title:

Software Testing Effort Estimation Based on Machine Learning Techniques: Single and Ensemble Methods

Authors:

Mohamed Hosni, Ibtissam Medarhri and Juan M. Carrillo de Gea

Abstract: Delivering an accurate estimation of the effort required for software system development is crucial for the success of any software project. However, the software development lifecycle (SDLC) involves multiple activities, such as software design, software build, and software testing, among others. Software testing (ST) holds significant importance in the SDLC as it directly impacts software quality. Typically, the effort required for the testing phase is estimated as a percentage of the overall predicted SDLC effort, typically ranging between 10% and 60%. However, this approach poses risks as it hinders proper resource allocation by managers. Despite the importance of this issue, there is limited research available on estimating ST effort. This paper aims to address this concern by proposing four machine learning (ML) techniques and a heterogeneous ensemble to predict the effort required for ST activities. The ML techniques employed include K-nearest neighbor (KNN), Support Vector Regression, Multilayer Perceptron Neural Networks, and decision trees. The dataset used in this study was obtained from a well-known repository. Various unbiased performance indicators were utilized to evaluate the predictive capabilities of the proposed techniques. The overall results indicate that the KNN technique outperforms the other ML techniques, and the proposed ensemble showed superior performance accuracy compared to the remaining ML techniques.
Download

Paper Nr: 191
Title:

Insights into the Potential of Fuzzy Systems for Medical AI Interpretability

Authors:

Hafsaa Ouifak and Ali Idri

Abstract: Machine Learning (ML) solutions have demonstrated significant improvements across various domains. However, the complete integration of ML solutions into critical fields such as medicine is facing one main challenge: interpretability. This study conducts a systematic mapping to investigate primary research focused on the application of fuzzy logic (FL) in enhancing the interpretability of ML black-box models in medical contexts. The mapping covers the period from 1994 to January 2024, resulting in 67 relevant publications from multiple digital libraries. The findings indicate that 60% of selected studies proposed new FL-based interpretability techniques, while 40% of them evaluated existing techniques. Breast cancer emerged as the most frequently studied disease using FL interpretability methods. Additionally, TSK neuro-fuzzy systems were identified as the most employed systems for enhancing interpretability. Future research should aim to address existing limitations, including the challenge of maintaining interpretability in ensemble methods

Paper Nr: 20
Title:

Utilizing Data Analysis for Optimized Determination of the Current Operational State of Heating Systems

Authors:

Ahmed Qarqour, Sahil-Jai Arora, Gernot Heisenberg, Markus Rabe and Tobias Kleinert

Abstract: In response to the pressing global challenge of climate change, the emphasis on sustainable energy technologies has escalated, spotlighting the critical role of heat pump systems as eco-friendly alternatives for heating and cooling. These systems stand at the forefront of efforts to reduce greenhouse gas emissions and improve energy efficiency. The advent of Internet of Things (IoT) technology has unlocked the potential for comprehensive data collection on the operational intricacies of heat pump systems in real-world settings, offering precious insights into their performance and guiding technological advancements. This paper introduces an analytical approach to optimize air-to-water heat pump systems using time series data from Bosch Home Comfort Group's systems. Utilizing Fayyad's data-driven analysis model and the Random Forest algorithm, the study tackles system behavior complexities. Characterized by interpretability crucial for application, it achieves a 97.6% fault detection accuracy. The method encounters difficulties in accurately predicting compressor control faults due to limited data quality and a lack of comprehensive system information. The findings highlight IoT's potential to enhance system efficiency and availability, but also point to the limitations of relying solely on data-driven models for fault prediction in field systems.
Download

Paper Nr: 21
Title:

A Core Technology Discovery Method Based on Hypernetwork

Authors:

Chen Wenjie

Abstract: Identifying and analyzing the core technologies in a specific technical field can comprehensively understand the research status and development trends in that field, providing reference and suggestions for the research and development of key and disruptive technologies. This article introduces the technique of representing the multiple co-occurrence relationships between entities using hypernetwork structure, and uses hypernetwork embedding technology to automatically generate technology node vectors that integrate structural and attribute features. Through fuzzy clustering, technology clusters are obtained, and measurement indicators such as local neutrality, semi local centrality, and global centrality based on hypernetwork structure are constructed to identify the core technology nodes in each technology cluster. Taking the field of carbon capture, utilization, and storage technology as an example, the effectiveness and scientificity of the method proposed in this article were verified. The results showed that chemical absorption, membrane separation, solid adsorption, and low-temperature separation are the core technologies in this field, which helps China to allocate resources reasonably, increase research and development efforts in core technology, and gain competitive advantages.

Paper Nr: 24
Title:

Modelling of an Untrustworthiness of Fraudulent Websites Using Machine Learning Algorithms

Authors:

Kristína Machová and Martin Kaňuch

Abstract: This paper focuses on learning models that can detect fraudulent websites accurately enough to help users avoid becoming a victim of fraud. Both classical machine learning methods and neural network learning were used for modelling. Attributes were extracted from the content and the structure of fraudulent websites, as well as attributes derived from the way of their using, to generate the detection models. The best model was used in an application in the form of a Google Chrome browser extension. The application may be beneficial in the future for new users and older people who are more prone to believe scammers. By focusing on key factors such as URL syntax, hostname legitimacy, and other special attributes, the app can help prevent financial loss and protect individuals and businesses from online fraud.
Download

Paper Nr: 31
Title:

Reducing the Transformer Architecture to a Minimum

Authors:

Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath and Siegfried Handschuh

Abstract: Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. All these parameter savings make sense only if the representational performance of the architecture is not significantly reduced. A comprehensive empirical proof for all important domains would be a huge task. We have laid the groundwork by testing widespread CV benchmarks: MNIST, CIFAR-10, and, with restrictions, ImageNet. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90 % of parameters without hurting the classification performance.
Download

Paper Nr: 57
Title:

Federated Learning for XSS Detection: A Privacy-Preserving Approach

Authors:

Mahran Jazi and Irad Ben-Gal

Abstract: Collaboration between edge devices has increased the scale of machine learning (ML), which can be attributed to increased access to large volumes of data. Nevertheless, traditional ML models face significant hurdles in securing sensitive information due to rising concerns about data privacy. As a result, federated learning (FL) has emerged as another way to enable devices to learn from each other without exposing user’s data. This paper suggests that FL can be used as a validation mechanism for finding and blocking malicious attacks such as cross-site scripting (XSS). Our contribution lies in demonstrating the practical effectiveness of this approach on a real-world dataset, the details of which are expounded upon herein. Moreover, we conduct comparative performance analysis, pitting our FL approach against traditional centralized parametric ML methods, such as logistic regression (LR), deep neural networks (DNNs), support vector machines (SVMs), and k-nearest neighbors (KNN), thus shedding light on its potential advantages. The dataset employed in our experiments mirrors real-world conditions, facilitating a meaningful assessment of the viability of our approach. Our empirical evaluations reveal that the FL approach not only achieves performance on par with that of centralized ML models but also provides a crucial advantage in terms of preserving the privacy of sensitive data.
Download

Paper Nr: 82
Title:

RUDEUS: A Machine Learning Classification System to Study DNA-Binding Proteins

Authors:

David Medina-Ortiz, Gabriel Cabas-Mora, Iván Moya, Nicole Soto-García and Roberto Uribe-Paredes

Abstract: DNA-binding proteins play crucial roles in biological processes such as replication, transcription, pack-aging, and chromatin remodeling. Their study has gained importance across scientific fields, with computational biology complementing traditional methods. While machine learning has advanced bioinformatics, generalizable pipelines for identifying DNA-binding proteins and their specific interactions remain scarce. We present RUDEUS, a Python library with hierarchical classification models to identify DNA-binding proteins and distinguish between single- and double-stranded DNA interactions. RUDEUS integrates protein language models, supervised learning, and Bayesian optimization, achieving 95% precision in DNA-binding identification and 89% accuracy in distinguishing interaction types. The library also includes tools for annotating unknown sequences and validating DNA-protein interactions through molecular docking. RUDEUS delivers competitive performance and is easily integrated into protein engineering workflows. It is available under the MIT License, with the source code and models available on the GitHub repository https://github.com/ProteinEngineering-PESB2/RUDEUS.
Download

Paper Nr: 89
Title:

Evaluating the Suitability of Long Document Embeddings for Classification Tasks: A Comparative Analysis

Authors:

Bardia Rafieian and Pere-Pau Vázquez

Abstract: Long documents pose a significant challenge for natural language processing (NLP), which requires high-quality embeddings. Despite the numerous approaches that encompass both deep learning and machine learning methodologies, tackling this task remains hard. In our study, we tackle the issue of long document classification by leveraging recent advancements in machine learning and deep learning. We conduct a comprehensive evaluation of several state-of-the-art models, including Doc2vec, Longformer, LLaMA-3, and SciBERT, focusing on their effectiveness in handling long to very long documents (in number of tokens). Furthermore, we trained a Doc2vec model using a massive dataset, achieving state-of-the-art quality, and surpassing other methods such as Longformer and SciBERT, which are very costly to train. Notably, while LLaMA-3 outperforms our model in certain aspects, Doc2vec remains highly competitive, particularly in speed, as it is the fastest among the evaluated methods. Through experimentation, we thoroughly evaluate the performance of our custom-trained Doc2vec model in classifying documents with an extensive number of tokens, demonstrating its efficacy, especially in handling very long documents. However, our analysis also uncovers inconsistencies in the performance of all models when faced with documents containing larger text volumes.
Download

Paper Nr: 99
Title:

Comparing Human and Machine Generated Text for Sentiment

Authors:

WingYin Ha and Diarmuid P. O’Donoghue

Abstract: This paper compares human and machine generated texts, focusing on a comparison of their sentiment. We use two corpora; the first being the HC3 question and answer texts. We present a second corpus focused on human written text-materials sourced from psychology experiments and we used a language model to generate stories analogous to the presented information. Two sentiment analysis tools generated sentiment results, showing that there was a frequent occurrence of statistically significant differences between the sentiment scores on the individual sub-collections within these corpora. Generally speaking, machine generated text tended to have a slightly more positive sentiment than the human authored equivalent. However, we also found low levels of agreement between the Vader and TextBlob sentiment-analysis systems used. Any proposed use of LLM generated content in the place of retrieved information needs to carefully consider subtle differences between the two – and the implications these differences may have on down-stream tasks.
Download

Paper Nr: 102
Title:

A Network Learning Method for Functional Disability Prediction from Health Data

Authors:

Riccardo Dondi and Mehdi Hosseinzadeh

Abstract: This contribution proposes a novel network analysis model with the goal of predicting a classification of individuals as either ‘disabled’ or ‘not-disabled’, using a dataset from the Health and Retirement Study (HRS). Our approach is based on selecting features that span health indicators and socioeconomic factors due to their pivotal roles in identifying disability. Considering the selected features, our approach computes similarities between individuals and uses this similarity to predict disability. We present a preliminary experimental eval-uation of our method on the HRS dataset, where it shows an enhanced average accuracy of 62.48%.
Download

Paper Nr: 103
Title:

Enhancing Dyeing Processes with Machine Learning: Strategies for Reducing Textile Non-Conformities

Authors:

Mariana Carvalho, Ana Borges, Alexandra Gavina, Lídia Duarte, Joana Leite, Maria J. Polidoro, Sandra Aleixo and Sónia Dias

Abstract: The textile industry, a vital sector in global production, relies heavily on dyeing processes to meet stringent quality and consistency standards. This study addresses the challenge of identifying and mitigating non-conformities in dyeing patterns, such as stains, fading and coloration issues, through advanced data analysis and machine learning techniques. The authors applied Random Forest and Gradient Boosted Trees algorithms to a dataset provided by a Portuguese textile company, identifying key factors influencing dyeing non-conformities. Our models highlight critical features impacting non-conformities, offering predictive capabilities that allow for preemptive adjustments to the dyeing process. The results demonstrate significant potential for reducing non-conformities, improving efficiency and enhancing overall product quality.
Download

Paper Nr: 146
Title:

Optimizing Federated Learning for Intrusion Detection in IoT Networks

Authors:

Abderahmane Hamdouchi and Ali Idri

Abstract: The Internet of Things (IoT) involves billions of interconnected devices, making IoT networks vulnerable to cyber threats. To enhance security, deep learning (DL) techniques are increasingly used in intrusion detection systems (IDS). However, centralized DL-based IDSs raise privacy concerns, prompting interest in Federated Learning (FL). This research evaluates FL configurations using dense neural networks (DNN) and convolutional neural networks (CNN) with two optimizers, stochastic gradient descent (SGD) and Adam, across 20% and 60% feature thresholds. Two cost-sensitive learning techniques were applied: undersampling with binary cross-entropy and weighted classes using weighted binary cross-entropy. Using the NF-ToN-IoT-v2 dataset, 16 FL configurations were analyzed. Results indicate that SGD, combined with CNN and the Undersampling technique applied to the top 7 features, outperformed other configurations.
Download

Paper Nr: 162
Title:

A Smart Hybrid Enhanced Recommendation and Personalization Algorithm Using Machine Learning

Authors:

Aswin K. Nalluri and Yan Zhang

Abstract: In today’s era of streaming services, the effectiveness and precision of recommendation systems are pivotal in enhancing user satisfaction. Traditional recommendation systems often grapple with challenges such as data sparsity in user-item interactions, the need for parallel processing, and increased computational demands due to matrix densification, all of which hinder the overall efficiency and scalability of recommendation systems. To address these issues, we proposed the Smart Hybrid Enhanced Recommendation and Personalization Algorithm (SHERPA), a cutting-edge machine learning approach designed to revolutionize movie recommendations. SHERPA combines Term Frequency-Inverse Document Frequency (TF-IDF) for content-based filtering and Alternating Least Squares (ALS) with weighted regularization for collaborative filtering, offering a sophisticated method for delivering personalized suggestions. We evaluated the proposed SHERPA algorithm using a dataset of over 50 million ratings from 480,000 Netflix users, covering 17,000 movie titles. The performance of SHERPA was meticulously compared to traditional hybrid models, demonstrating a 70% improvement in prediction accuracy based on Root Mean Square Error (RMSE) metrics during the training, testing, and validation phases. These findings underscore SHERPA’s ability to discern and cater to users’ nuanced preferences, marking a significant advancement in personalized recommendation systems.
Download

Paper Nr: 175
Title:

Comparative Performance Analysis of Active Learning Strategies for the Entity Recognition Task

Authors:

Philipp Kohl, Yoka Krämer, Claudia Fohry and Bodo Kraft

Abstract: Supervised learning requires a lot of annotated data, which makes the annotation process time-consuming and expensive. Active Learning (AL) offers a promising solution by reducing the number of labeled data needed while maintaining model performance. This work focuses on the application of supervised learning and AL for (named) entity recognition, which is a subdiscipline of Natural Language Processing (NLP). Despite the potential of AL in this area, there is still a limited understanding of the performance of different approaches. We address this gap by conducting a comparative performance analysis with diverse, carefully selected corpora and AL strategies. Thereby, we establish a standardized evaluation setting to ensure reproducibility and consistency across experiments. With our analysis, we discover scenarios where AL provides performance improvements and others where its benefits are limited. In particular, we find that strategies including historical information from the learning process and maximizing entity information yield the most significant improvements. Our findings can guide researchers and practitioners in optimizing their annotation efforts.
Download