
Ebook: Artificial Intelligence and Human-Computer Interaction

The importance of artificial intelligence (AI) to all our lives is now undeniable, and with interactions between humans, computers, and AI continuing to increase, this area has become the focus of growing interest.
This book presents the proceedings of ArtInHCI2024, the 2nd International Conference on Artificial Intelligence and Human-Computer Interaction, held as a hybrid event from 25 to 27 October 2024 in Kunming, China. The ArtInHCI conference series was conceived with the aim of promoting academic exchange within and across disciplines, addressing theoretical and practical challenges and advancing current understanding and application; a process which it is hoped will also serve to spread amity, establish connections and enable future collaboration. ArtInHCI2024 provided a platform for the discussion of a number of hot topics, including deep learning, artificial neural networks, computer vision and pattern recognition, and the conference focused on research challenges as well as those of application. A total of 191 submissions were received for the conference, and after initial screening, 142 were submitted to a rigorous, double blind peer review procedure based on relevance, writing skills, scientific quality and soundness, and contribution or practical implications. Following a final decision-making process, 93 of the papers were selected for presentation and publication here, an acceptance rate of 48.7%.
Covering a wide range of topics in the sphere of AI and human/computer interaction, the book will be of interest to all those working in the field.
This volume in the series, Frontiers in Artificial Intelligence and Applications (FAIA), presents the proceedings of the 2nd International Conference on Artificial Intelligence and Human-Computer Interaction (ArtInHCI2024). The conference was successfully held in Kunming, China from 25 to 27 October 2024.
ArtInHCI2024 organized discussions on a number of hot topics, including deep learning, artificial neural networks, computer vision and pattern recognition, and focused on the challenges of research as well as of application. The conference consisted of an onsite session and an online session, showcasing various items such as keynote speeches, oral reports, poster presentations, and Q&A. We were fortunate to have with us experts and scholars from around the globe to share their latest findings and insights, including Professor Ji Zhang, University of Southern Queensland (UniSQ), Australia; Professor Huiyu Zhou, University of Leicester, UK; Professor Liang Liao, Zhongyuan University of Technology, China; Assoc. Prof. Aslina Baharum, Sunway University, Malaysia; Prof. Xiaohui Zou, Peking University, Director of Interdisciplinary Knowledge Modeling Research Group, Special & Researcher, Hengqin Searle Technology Co. Ltd., China; Prof. Ljiljana Trajkovic, Simon Fraser University, Burnaby, British Columbia, Canada; Assoc. Prof. Teh Sin Yin, Universiti Sains Malaysia; Assoc. Prof. Le Nguyen Quoc Khanh, Taipei Medical University (TMU), Taiwan, China; Asst. Prof. Teoh Wei Lin, Heriot-Watt University Malaysia, Malaysia; Asst. Prof. Chong Zhi Lin, Universiti Tunku Abdul Rahman, Malaysia; Asst. Prof. Luís Silva, NOVA School of Science and Technology, Portugal; and Assoc. Prof. Yuping Song, Shanghai Normal University, China.
The ArtInHCI conference was conceived with the aim of promoting academic exchange within and across disciplines, addressing theoretical and practical challenges and advancing current understanding and application; a process which we hope will also spread amity, establish connections and enable future collaboration.
The ArtInHCI organizing committee extend their sincerest gratitude to all those who have supported the conference in their various ways; the authors who have chosen this platform to publish their works and communicate with peers, the participants who took an interest and attended the conference, the chairs and committee members whose professional expertise and judgment has been indispensible, the keynote speakers who so generously shared their vision and passion, and the reviewers who upheld the faith in scholarship and contributed their experience and honest opinions. It has been a pleasure and honor to work alongside them, and we look forward to further cooperation with them at future ArtInHCI conferences.
The Editors
Human-in-the-loop reinforcement learning (HIRL) enhances sampling efficiency in deep reinforcement learning by incorporating human expertise and experience into the training process. However, HIRL methods still heavily depend on expert guidance, which is a key factor limiting their further development and largescale application. In this paper, an uncertainty-based dynamic weighted experience replay approach (UDWER) is proposed to solve the above problem. Our approach enables the algorithm to detect decision uncertainty, triggering human intervention only when uncertainty exceeds a threshold. This reduces the need for continuous human supervision. Additionally, we design a dynamic experience replay mechanism that prioritizes machine self-exploration and human-guided samples with different weights based on decision uncertainty. We also provide a theoretical derivation and related discussion. Experiments in the Lunar Lander environment demonstrate improved sampling efficiency and reduced reliance on human guidance.
Model Predictive Control (MPC) has become an important control method for autonomous vehicles and complex robotic systems. However, MPC requires solving an optimization problem to ensure optimal control inputs, which can be computationally expensive for nonlinear and high-dimensional systems. This paper proposes using Particle Filters (PF) to execute the solving process of MPC to enhance efficiency and accuracy. Our approach applies PF to solve quadratic programming problems and integrates it into the MPC framework. We investigate two specific applications: lane-keeping control for autonomous vehicles and control of a robotic arm mounted on a differential drive mobile platform. Experimental results show that using PF can effectively optimize MPC problems, significantly reduce computation time, and improve control accuracy, particularly in handling complex nonlinear systems in the mentioned applications. This paper demonstrates the potential of PF as an optimizer in MPC and suggests further testing of this approach in more complex control problems to verify its broad applicability and reliability.
Recognizing the cognitive workload (CW) of operators is crucial to avoid human factor failures. Recently, multimodal CW recognition has attracted increasing attention since it leverages complementary information from different physiological modalities to enhance CW recognition performance. However, in real-world scenarios, not all modalities are always available. The performance of multimodal CW recognition may degrade when any modality misses, especially electroencephalogram (EEG). Although existing methods are capable of addressing the issue by explicitly recovering the missing modalities based on the available ones, they struggle to generate high-quality missing modality features due to their neglect of inter-modality relationships. In this paper, we propose a novel multimodal learning framework for CW recognition to address the incomplete modality problem. To recover the missing modality, a mutual information-assisted recovery strategy, which can maximize the mutual information between the missing and the available modalities, is used to train a feature generation module. Furthermore, to efficiently utilize complementary multimodal information, we employ a feature fusion strategy based on a channel attention mechanism to help the model focus on the key information. As a result, the proposed framework can achieve good CW recognition performance in missing modality scenarios. Our method attains an average accuracy of 75.04% on a public dataset, which is the highest among all compared methods and demonstrates the effectiveness of our framework.
The real estate price is an essential index to measure the real estate industry, urban economy, and investment policy. This paper introduces a non-parametric regression (NR)-deep learning framework, which uses the non-parametric model to predict the trend of the real estate price series, and then uses the deep learning model to capture the residual information, to achieve the effect of error correction and optimize the accuracy of real estate price prediction. The empirical results show that error correction can improve the prediction accuracy by an order of magnitude. The improvement degree of six evaluation criteria is far more than 10 times. In addition, under the error correction framework, NR-gated recurrent unit (GRU) has certain advantages in processing nonlinear complex error sequences. Compared with the SVR and LSTM model under the framework, the average improvement percentage of evaluation criteria is about 5.20% and 0.09%, and the DM statistics are all positive.
In the era of smart education, online courses as the avant-garde force in the educational field are leading the way in innovating teaching methods. Although online learning platforms provide students with convenient channels for learning, issues such as course quality, personalized service, and learning motivation still exist. This study, based on China University MOOC, proposes a personalized online course recommendation method based on emotion recognition, aimed at deeply understanding students’ emotional states to enhance the accuracy and personalization of course recommendations. Initially, this paper collected a dataset of course user comments from China University MOOC and built an emotional dictionary in the education domain to analyze users’ emotional states. Combining emotion analysis, user characteristics, and course features, the SAFM and SDFM models were proposed, incorporating a negative sampling method to generate personalized course recommendations. The experiments prove that this method effectively enhances students’ learning motivation and participation, offering new insights for the development of online education platforms.
Multi-Label Text Classification (MLTC) is a crucial task in natural language processing (NLP), enabling the assignment of multiple labels to a single text sample, which aligns with the diverse and multifaceted nature of discussions typically found in Bulletin Board System. This study presents an investigation into text classification methodologies, leveraging a dataset comprising 388,693 entries, with 234,237 entries manually annotated for model training. The dataset encompasses diverse text data from prominent social platforms, including GitHub, H5-based forums, WeChat, QQ group chats, and more. Four distinct methods for text classification are compared: BERT and BiLSTM models with Binary Cross-Entropy (BCE) loss, BERT for feature extraction followed by BiLSTM and BCE, BERT and BiLSTM models with Focal Loss (FL), and BERT for feature extraction followed by BiLSTM and FL. The experimentation reveals insights into their performance, indicating that models utilizing pre-trained BERT for feature extraction outperform those without pre-training. Focal Loss emerges as a superior alternative to Binary Cross-Entropy, demonstrating efficacy in handling class imbalance and noisy data, thereby improving overall model accuracy and robustness. These findings underscore the importance of thoughtful model architecture and loss function selection. Future research directions include exploring ensemble methodologies, alternative pre-training techniques for BERT, and enhancing model interpretability. Keeping pace with NLP advancements and integrating cutting-edge techniques into future investigations holds promise for further advancements in model efficacy and practical utility.
The proliferation of fake images on the internet has become increasingly alarming. Advanced techniques including generative adversarial networks can generate visually real images that can mislead people and create false information. This poses a threat and can cause serious impacts. Many methods based on deep learning were proposed to detect fake images. These methods have demonstrated ability to achieve highly accurate results in detecting fake images. However, due to its “black-box” nature, there is a lack of explainability of the decision-making process in these models. In this paper, we integrate the convolution block attention module in ResNet-18 to improve the explainability of the deep learning model for fake image detection. The results showed that our method achieved a higher performance compared to the baseline method.
Plant invasion presents a substantial challenge to ecosystems on a global scale, thus requiring the development of effective categorization techniques for precise recognition and control. This article offers an in-depth investigation of plant invasion categorization using the Random Forest (RF) approach. The fundamental principles of RF and its relevance to invasion ecology are deliberated. By means of a methodical examination, we assess the effectiveness of RF in discerning invasive plant species from non-invasive ones based on a variety of ecological characteristics. Our results emphasize the efficiency of RF in classification assignments, underscoring its potential as a valuable instrument for the management of invasive species and conservation endeavors.
The offshore Renminbi (CNH) exchange rate against the United States dollar (USD) better reflects the immediate changes in market supply and demand and investor sentiment due to its exemption from foreign exchange control in mainland China. This paper explores how to improve the forecasting accuracy of the offshore RMBUSD exchange rate by constructing an investor sentiment index based on online forums and news comments. This paper firstly collects and analyzes many financial news headlines on the English for Treasury website, and applies the BERT model in natural language processing technology to identify and quantify the sentiment tendencies in the news headlines, to construct a daily investor sentiment index. Subsequently, this sentiment index is combined with traditional financial market and macroeconomic indicators, and a variety of advanced machine learning and deep learning methods, including Random Forest, Support Vector Machines, Long Short-Term Memory Networks (LSTM), and Gated Recurrent Units (GRUs), are applied to forecast the offshore RMB exchange rate. It is found that the introduction of sentiment indices significantly improves the accuracy of the prediction models. Especially in LSTM and GRU models, the inclusion of sentiment index makes the models perform better in capturing the nonlinear features of exchange rate fluctuations.
Image recognition using deep learning, especially deep convolution neural network (DNN), of great success to human due to increasing use of computer vision in our daily life. In this paper, we introduce a novel framework of image recognition system based on quaternion fractional-order radial orthogonal moments and deep learning. The proposed image recognition system is derived by combining quaternion fractional-order polar harmonic-Fourier moments (QFr-PHFMs) and Micro-Convolution Neural Network (Micro-CNN). The proposed methodology can use a small number of network layers and achieve high-quality recognition accuracy, especially in the case of image processing with high noise and smooth filtering conditions. Theoretical analyses and experimental results showed that the proposed methodology offer enhanced image recognition compared with the different moment-based feature-extraction algorithms and the existing CNN methods.
The proliferation of generative artificial intelligence (AI) technologies has introduced challenges related to the use of automatically generated texts within foreign language learning settings. This study is aimed at developing clear-cut principles of incorporating large language models (LLM) into the English teachers’ workflow. Our approach employed examining research papers and conducting a small-scale study designed to obtain and evaluate texts produced by state-of-the-art LLMs in response to the General English course writing assignments. The experiment revealed peculiarities and limitations of the chosen LLMs in generating texts for study purposes. The educational potential of this technology as well as suggested conditions for its effective integration into instructional practices were also presented. We concluded that the transformation of writing assignments is necessary, with a focus on fostering critical thinking skills. Furthermore, fostering students’ information and communication technology (ICT) skills while engaging with an LLM chatbot is considered of paramount importance.
Recently, in the field of natural language processing, there has been an increasing emphasis on enhancing the language expressiveness of generative pre-training models. To address this challenge, this paper proposes an approach that involves specified identities or roles and evaluation metrics. By introducing specified identities or roles, the model can adapting various communicative roles tailored to specific contexts and needs, thereby better adapting to different scenarios. In terms of model evaluation, we used ten sample models and provided each model with 3000 questions. Other models and humans rated the answers given by the model on a scale of 0 to 10. The average score was then obtained. This average score is then provided as feedback to the model, encouraging it to reflect and provide more accurate answers. Finally, the paper explores the potential application prospects of this approach in human-computer dialogues, personalized Q&A systems, and other domains, demonstrating its value in enhancing natural language processing technology.
Early detection of suicidal ideations is one of the key suicide prevention strategies. However, there are challenges that obstruct the detection of suicidal ideations. Mainly, the stigma surrounding mental health, and suicide in particular, obstructs traditional risk screening methods, such as questionnaires and interviews. These methods rely on at-risk individuals to explicitly communicate their suicidal thoughts. At the same time, people with suicidal ideations are increasingly turning to online forums such as Reddit to share their experiences and seek emotional support. Consequently, these platforms have emerged as a large source of textual data for detecting suicidal ideations using machine learning and natural language processing methods. This paper aims to explore the effectiveness of transformer-based models for detecting suicidal ideations on Reddit forums. In this study, the transformer models were fine-tuned and compared against machine learning and deep learning baselines. Our experimental results show that the fine-tuned base-BERT transformer model demonstrates superior performance in detecting suicidal ideations compared to baseline machine learning and deep learning models, achieving an F1-score of 99%.
Gold futures, as an essential hedge asset, have received much attention. In this paper, the VMD-reconstruction-integration framework combining rolling windows is proposed to predict gold futures prices. The Fine to Coarse (FTC) method is used to reconstruct the decomposed IMFs, and the non-parametric regression (NR) model and extreme learning machine (ELM) model are used to predict the reconstructed long-term trend term and short-term disturbance term respectively, which effectively avoided the information leakage problem in the decomposition process and improved the prediction accuracy of gold futures price. The empirical results show that after avoiding the decomposition leakage problem, the model under the decomposition framework still has a certain improvement effect. In addition, the R-VMD-NR/ELM model has the best prediction effect, and compared with the R-VMD-NR/SVR model, the six evaluation criteria improved by 0.8883, 9.7188, 0.0021, 1.0492, 0.008, and 0.02, respectively.
The study proposed a model for monitoring temperature anomalies in industrial machinery based on image and thermography data, focusing on motors and electrical panels. Enhance the operational stability of the machine through multimodal data fusion analysis using Convolutional Neural Networks (CNN). The model includes four parts: data collection, data preprocessing, and anomaly detection model training and deployment.
The objective of canal semantic segmentation in this work preliminarily investigates the visual intelligence on an SSSB (Self-Sailing Sweeper Boat) that can visualize constraint sailing in a restricted region further for sweeping the canal. The contribution and novelty of this work firstly proposes the night-scenario canal segmentation in terms of the low-lightness problems, in which the boundaries between the canal region and the bank shore cannot be precisely distinguished for ground-true labeling prior to the dataset training and validation, further decreasing the segmentation efficiency. To do so, this work investigates the contrast enhancement approach on both Histogram Equalization (HE) and Adaptive Histogram Equalization (AHE) methods, respectively, naming AH2E2 on night-scenario canal to increase boundary visibility to benefit the precise labeling of the ground-true region for canal segmentation. Thereafter, three U-net-based approaches, including the Primordial U-Net, ResU-Net, and AttresU-net, of which six combinations from HE and AHE are examined for evaluation. To further inspect the tradeoff between the training cost and the segmentation efficiency in terms of required EPOCH and the engaged ground-true number. The experiments dynamically set the participated ground-true labels from 150 to 750 step 150 and set EPOCH as 100. Experimental results reported that HE for contrast enhancement method with AttresU-net learning approach performed superior segmentation efficiency compared to the other five combinations. However, HE+AttresU-net herein is observed to give a higher training cost than the other five combinations. More discussions are elucidated in the experiments.
In recent years, speech recognition technology and image recognition technology have gradually become the main ways of human-computer interaction, and research on speech recognition based on noisy backgrounds has also gradually emerged. Although the recognition accuracy of isolated words has reached 99% in the testing environment, from a practical perspective, the accuracy of speech recognition is significantly reduced under the influence of noisy background noise. In order to further improve the accuracy of language recognition, this article designs and implements a multimodal language recognition system based on Markov model, which runs in WIN10 and is compiled in C++language. The audio features selected are MFCC features and FBanK features, while the image features selected are the geometric and shape features of the lip fitting curve. And it has been verified that the multimodal language recognition system performs better than pure speech recognition in noisy environments.
Compared with the monocular-based object detection approach, the binocular-based one can exploit much richer cues like the depth information. However, existing binocular-based methods typically require to calculate the explicit disparity maps of scenes as a depth cue, which brings extra computational cost and may cause errors in intermediate depth inference. In order to overcome these shortcomings, we propose an end-to-end neural network for binocular-based object detection. The network has an asymmetric two-stream architecture. One stream takes charge of the depth cue extraction from stereo images, called the Implicit Depth Mining Network (IDMN). The other stream, called the Multi-Modal Detection Network (MMDN), is to exploit the appearance cue from a monocular image and then to fuse the appearance cue and depth cue for object detection. Such a model exploits depth information but does not need to explicitly calculate the disparity map or depth map, so it can work efficiently in practice. Experimental results on indicate that our method achieves a good trade-off between effectiveness and efficiency.
Cardiovascular disease (CVD) remains a leading cause of global mortality, requiring accurate and early diagnosis. This study proposes a ConvNeXt-based multi-module feature fusion approach that enhances feature extraction, interpretability, and spatial-temporal representation in ECG classification. Using the MIT-BIH dataset, ECG signals are denoised with Discrete Wavelet Transform (DWT) and balanced using SMOTE, then encoded into GASF, GADF, and MTF images. These images are fused to improve generalization within the 2D module. The first module uses a ConvNeXt architecture with CBAM for feature extraction from the fused 2D representations, while the second combines CNN, SENet, and BiLSTM to analyze 1D ECG signals. This fusion maximizes the benefits of both data types, leading to a robust cardiac abnormality detection. The approach achieves training and validation accuracies of 99.97% and 99.90%, demonstrating its potential for practical cardiovascular diagnostics.
Aiming at the problem of day-ahead NOx emission prediction from thermal power units, a sequence to point model SPCLPM is proposed which combines Conv1D and LSTM Prediction Model. The model is feed with 12 selected features, according to the NOx generation mechanism. One-dimensional convolution network is used to automatically extract the dependencies between selected features, while maintaining the chronological order. LSTM is used to extract the temporal characteristics, and the prediction results are output through the full connection layer. The model is trained based on the monitoring data of four thermal power units. The experimental results show that the prediction performance of SPCLPM is significantly better than that of LSTM model without one-dimensional convolution and traditional random forest model, and it can more accurately track and predict the change trend of NOx emissions in the next 24 hours.
The potential of Chinese high-spatial satellite imagery and deep learning models in coastline extraction were explored in this work. The performances of three deep learning models including U-Net, ResUnet, and SegNet were compared using 2-m resolution pansharpened products of GaoFen-1(GF1), and Ziyuan-3 (ZY3) imagery. The prediction results of ResUnet were significantly more accurate than those of U-Net and SegNet. The trained ResUnet model was then used to predict and extract the coastlines of Haikou City, Hainan Province. The 2-m resolution coastline products of Haikou City in 2016, 2018, and 2019 were obtained. The results showed that no significant changes in the coastline of Haikou City from 2016 to 2019.
In recent years, various digital currencies have emerged, among which Bitcoin has been widely accepted as an alternative to sovereign currencies for commodity trading. However, the dramatic volatility of bitcoin prices can pose a risk to global financial markets. In this paper, we firstly construct a more comprehensive forecasting index system from seven aspects, and then construct a VMD-GRU model. This model uses the variational modal decomposition (VMD) to decompose the time series into intrinsic mode functions (IMFs) and use gated recurrent unit (GRU) to forecast different IMFs. This paper also compares the forecast results with classical machine learning models and deep learning models, and the results show that the forecast accuracy of the VMD-GRU model is more than 16% better than other models.
The BCI system based on Steady-Stated Visual Evoked Potential (SSVEP) is used widely because of its high system stability, information transfer rate (ITR), and accuracy. Accuracy is an important indicator to evaluate the BCI system performance. There has been a lot of research on the effect of the SSVEP paradigm parameters, such as frequency, shape, color, etc., on recognition accuracy. In this paper, we focus on the influence of the layout of different frequencies on the performance of BCI systems. We carried out experimental verification in the SSVEP-based BCI system. The experimental results show that adjusting the frequency layout improves the recognition accuracy of the system from 93.61% to 96.11%.