Ebook: Fuzzy Systems and Data Mining VIII
Fuzzy logic is vital to applications in the electrical, industrial, chemical and engineering realms, as well as in areas of management and environmental issues. Data mining is indispensible in dealing with big data, massive data, and scalable, parallel and distributed algorithms.
This book presents papers from FSDM 2022, the 8th International Conference on Fuzzy Systems and Data Mining. The conference, originally scheduled to take place in Xiamen, China, was held fully online from 4 to 7 November 2022, due to ongoing restrictions connected with the COVID-19 pandemic. This year, FSDM received 196 submissions, of which 47 papers were ultimately selected for presentation and publication after a thorough review process, taking into account novelty, and the breadth and depth of research themes falling under the scope of FSDM. This resulted in an acceptance rate of 23.97%. Topics covered include fuzzy theory, algorithms and systems, fuzzy applications, data mining and the interdisciplinary field of fuzzy logic and data mining.
Offering an overview of current research and developments in fuzzy logic and data mining, the book will be of interest to all those working in the field of data science.
This book presents papers from FSDM 2022, the 8th International Conference on Fuzzy Systems and Data Mining. The conference, originally scheduled to take place in Xiamen, China, was held fully online during 4–7 November 2022, due to ongoing restrictions connected with the COVID-19 pandemic.
FSDM has had a short but also a very intense history in reaching this 8th edition of the conference series whose proceedings are published in the prestigious book series Frontiers in Artificial Intelligence and Applications (FAIA) by IOS Press.
FSDM 2015 [1] was held in Shanghai (China), FSDM 2016 [2] took place in Macau SAR (China), FSDM 2017 [3] went to Hualien, FSDM 2018 [4] travelled to Bangkok (Thailand), FSDM 2019 [5] was held in Kitakyushu City (Japan), FSDM 2020 [6] was online although initially scheduled in Xiamen (China), and finally FSDM 2021 [7] was moved to a virtual conference instead of the initially agreed venue in Seoul (South Korea).
All papers were carefully reviewed by the technical program committee (TPC) and reviewers, bearing in mind the quality, novelty, breadth and depth of the research themes which fall into the FSDM scope. I am very glad to inform you that this year FSDM has received 196 valid submissions. After an intense discussion stage, the committee, which included many experts, decided to accept 47 papers, which represents an acceptance rate of 23.97%. The profile of the authors is very remarkable and the number of full professors who have contributed is very high.
Furthermore, I am very grateful to everyone, especially the program committee members and reviewers, who devoted time to assess the papers.
My particular thanks and regards also go to the FAIA series editors for supporting this conference.
22nd September 2022
Antonio J. Tallón-Ballesteros
University of Huelva (Spain)
Huelva City, Spain
References
[1] G. Chen, F. Liu, & M. Shojafar (Eds.). (2016). Fuzzy System and Data Mining: proceedings of FSDM 2015 (Vol. 281). IOS Press.
[2] S.L. Sun, A.J. Tallón-Ballesteros, & D.S. Pamuar (Eds.). (2016). Fuzzy Systems and Data Mining II: Proceedings of FSDM 2016 (Vol. 293). IOS Press.
[3] A.J. Tallón-Ballesteros, & K. Li (Eds.). (2017). Fuzzy Systems and Data Mining III: Proceedings of FSDM 2017 (Vol. 299). IOS Press.
[4] A.J. Tallón-Ballesteros, & K. Li (Eds.). (2018). Fuzzy Systems and Data Mining IV: Proceedings of FSDM 2018 (Vol. 309). IOS Press.
[5] A.J. Tallón-Ballesteros (Ed.). (2019). Fuzzy Systems and Data Mining V: Proceedings of FSDM 2019 (Vol. 320). IOS Press.
[6] A.J. Tallón-Ballesteros (Ed.). (2020). Fuzzy Systems and Data Mining VI: Proceedings of FSDM 2020 (Vol. 331). IOS Press.
[7] A.J. Tallón-Ballesteros (Ed.). (2021). Fuzzy Systems and Data Mining VII: Proceedings of FSDM 2021 (Vol. 340). IOS Press.
Visualization is claimed as one of the essential “V’s” of Big Data since it allows presenting the data in a human-friendly way and is, therefore, a stepping-stone for the Big Data mining process. Visual analytics, in turn, ensures knowledge discovery out of the data through cognitive graphics and filtering capabilities. But to be efficient, visualization and analytics tools have to consider other Big Data “V’s” by handling the large data volumes, keeping up with the data growth and changing velocity, and adapting to the variety of the data representation formats. We propose using ontology engineering methods to create a visual analytics platform controlled by an ontological knowledge base that describes supported data types, input formats, data filters, visual objects, and visualization algorithms, as well as available communication protocols and computing nodes, the platform modules can run on. This allows introducing new functions and distributed computation scenarios to the platform on the fly just by extending the underlying domain ontologies without changing the source code of the platform’s core. The analytics flow inside this platform is described by task ontologies enabling semantic data mining process. As a result, seamless integration with different data sources is achieved, including plain files, databases, and even third-party soft- and hardware solvers. We demonstrate the viability of the approach proposed by solving several data mining and fuzzy classification problems, including the assessment of the citizens’ regional identity according to the mental maps they draw and the reconstruction of ontogenesis of extinct synapsid ‘Titanophoneus potens’ Efremov, 1938.
In financial planning problems, the determination of the best investment is one of the interesting optimization models. In the proposed work, an investment problem (IP) is introduced in vague environment. The vagueness in return parameter is characterized by normalized heptagonal fuzzy number (HFN). One of the suitable interval approximations, namely, an inexact rough interval of a normalized HFN is utilized. Afterward, the inexact rough interval investment problem is considered. A dynamic programming (DP) approach is developed, which is applied for optimizing the fuzzy investment problem. The ideology of “rough interval number” is suggested in the mathematical modeling framework of the proposed problem to show the rough data as an inexact rough interval of piecewise quadratic fuzzy numbers. Afterward, the DP approach is applied to solve and compute a rough interval solution. Finally, a numerical example is yielded for the utility of the approach to apply on real-world problem for the decision-maker. The obtained results consist of the total optimal return with inexact rough intervals on a $ 10 million investments is as follows: $ [[1.69, 2.08]: [1.75, 1.91]] millions.
Spatial sub-frequent co-location patterns reveal the rich spatial relation-ship of spatial features and instances, which are widely used in real applications such as environmental protection, urban computing, public transportation, and so on. Existing sub-frequent pattern mining methods cannot distinguish patterns whose row instance spatial distributions are significantly different. Additionally, patterns whose row instances are tightly located in a local area can further reveal the particularity of the local area such as special environments and functions. Therefore, this paper proposes mining Local Tight Spatial Sub-frequent Co-location Patterns (LTSCPs). First, a relevancy index is presented to measure the local tightness between sub-frequent pattern row instances by analyzing mutual participation instances between row instances. The concept of LTSCPs is then proposed followed by an algorithm for mining these LTSCPs. Finally, a large number of experiments are carried out on synthetic and real datasets. The results show that the algorithm for mining LTSCPs is efficient and LTSCPs are practical.
Aiming at the problems of low detection accuracy for objects with non-significant features in the FCOS network, a new object detection method based on attention mechanism is proposed to improve the performance of FCOS network, which can effectively guide the network to focus on the detailed features. According to the verification experimental results on KITTI dataset, the AP value of the improved attention mechanism-based method for car and person detection is improved by 1.1% and 4.9% compared with the standard FCOS network, respectively, and the average category accuracy value is improved by 3%. Thus, the experimental results show the effectiveness of the proposed method.
The main purpose of credit risk assessment is to help financial institutions identify applicants with good credit and eliminate applicants with bad credit, minimizing the risk of capital loss and maximizing returns. Recent years have witnessed excellent performance of machine learning in the credit risk prediction. This paper extends the previous research by applying two boosting algorithms, namely AdaBoost and XGboost, to perform the credit scoring for real data from Lending Club. Compared with two statistical methods and three individual classifiers, the results show that (i) AdaBoost and XGBoost obtain higher forecasting accuracy for credit risk, providing stronger discrimination ability. (ii) AdaBoost has a greater ability to discriminate minority classes (defaulters), which can reduce capital losses for institutions. (iii) XGBoost is able to capture more potential benefits for institutions because it is more accurate in identifying majority classes, i.e., non-defaulters.
With the development of fifth-generation mobile communication technology, a huge volume of mobile data have been generated which enable a wide range of location-based services. As a result, user location prediction has attracted attention from researchers. However, existing methods have low accuracy due to the sparsity of user check-ins. In order to address this issue, we propose a method for user location prediction based on similar living patterns. We first obtain a vector representation of each user’s living habits to cluster users with similar living patterns. Then, embedded vectors of POI category and POI location are learned. Finally, we construct activity prediction model and location prediction model for each user cluster by using Gate Recurrent Unit (GRU). The experimental results for real user check-ins show that the proposed method outperforms the baseline methods in most cases.
COVID-19 detection is an interesting field of study in the medical world and the commonly used method is classification. In determining the best detection model, several classification architectures, such as SVM, KNN, and CNN were utilized. The CNN is a changeable architecture due to having combinations of varying numbers of hidden layers or different activation and optimizer functions. Therefore, this study uses a deep CNN architecture with a combination of Leaky ReLU activation functions and 3 different optimizers, which include Adagrad, Adadelta, and Adamax. The results showed that the combination of the Leaky ReLU activation function and the Adamax optimizer produced good and stable accuracy in the CRX and CT datasets.
Community structure is one of the most important structural features of complex networks. However, most of the existing community division metrics only consider the relationship between nodes, and do not consider the overall closeness of the internal and external communities from the perspective of topology. Persistent homology (PH) is a mathematical tool in computational topology, which can capture high-dimensional topological features and is widely used in the analysis of complex networks. In this paper, we define a community partitioning metric based on persistent homology theory, and propose an algorithm of community division based on CPH which provides a new method for community partitioning performance. From the validation experiments, the Louvain algorithm is used to evaluate the community partitioning performance of social networks, and the experimental results show that community division metric based on persistent homology can measure the performance of community partitioning from the perspective of topology, persistent homology can be used as a new way to describe community structure.
Tax is the main source of income for the State. However, managing tax collection effectively and limiting the tax risks is a challenge for state tax authorities. This study applies machine learning to assess and predict firms with tax risks using logistic regression algorithm. The data set includes 872 observations of firms in Vietnam market. The machine learning approach is used to classifies the firms into 2 categories which has tax risk or not based on 6 main factors: (i) revenue and other income; (ii) expenses; (iii) liquidity; (iv) asset; (v) liabilities; and (vi) equity. The results show that the machine learning method is effective and accurate in identifying and predicting risks in tax declaration. The authors recommend that the tax agencies could apply machine learning methods and go further with big data and artificial intelligence approach to identify and classify enterprises.
Conformable fractional calculus will be a promising area of research for information processing as natural language and material modelling due to its ease of implementation. In this paper, we propose a fractional gradient descent method for the backpropagation training of neural networks. In particular, the conformable fractional calculus is employed to evaluate the fractional differential gradient function instead of the classical differential gradient function. The results obtained on a large dataset with this approach provide a new optimized, faster and simpler implemented algorithm than the conventional one.
Recently shape constrained classification has gained popularity in the machine learning literature in order to exploit extra model information besides raw data features. In this paper, we present a new Lattice Linear Discriminant Analysis (Lattice-LDA) classifier, which allows to take shape constraints of data inputs, such as monotonicity and convexity/concavity. Lattice-LDA constructs a nonparametric nonlinear discriminant hyperplane for classification, using an additive format of 1-D lattice functions (piecewise linear functions). Moreover, the new classifier features in taking complex shape constraints including combinations of shapes or S-shape. We optimize the model parameters using the Adaptive Moment Estimation (Adam) algorithm embedding stepwise projections which guarantee feasibility of the shape constraints. Through simulation and real-world examples, we demonstrate that the new classifier could accurately recover the nonlinear marginal effect functions and improve classification accuracy when additional shape information is present.
Continuous timely repair and replacement of infrastructures, equipment and utilities play an important role in maintaining the smooth-running of a city or local community. Thereby, to help individuals and businesses go about their daily activities with ease, it is vital to develop a proper method for automatically identifying and assigning capable workers for tasks. This paper defines the community management service task allocation problem as CMS-TAP and hence an end-to-end “recommendation + allocation” network, i.e. a task recommender and allocation optimization network (denoted as TROpt-NET), is then developed for handling such problem. TROpt-NET consists of two layers, namely one for predicting worker ability and the other for allocating tasks which are TR Layer and TA Layer, corresponding to “recommendation” and “allocation” of tasks. Different from operations research approaches where workers are assigned to jobs based on their pre-labelled skills and fixed locations, we propose a task recommender and allocation optimization network. The TR layer is a task recommender system designed to learn implicit worker abilities for different tasks using Neural Collaborative Filtering (NCF) by mining a historical dataset of worker task completion. Whereas in the TA layer a differential optimization approach for allocation is used because of its differentiable property and ability to allow for backpropagation to the prediction layer. In this study, we first formulate the CMS-TAP problem as a recommendation +optimization problem and then propose and end-to-end network architecture that tackles the problem in a real-world setting. TROpt-NET curbs uncertainty and assumptions in optimization by learning to more accurately approximate worker ability across different tasks. Additionally, the network can learn implicit worker abilities enabling optimal utilization of workers across a wide range of tasks, which is often ignored in task allocation problems. We find that normalizing worker ability across all tasks improves the implicit learning capability of the network and that good approximations don’t always lead to optimal allocation but learning allocations by backpropagating through recommendations improves the allocation objective. Offline experiments on a real-world large-scale dataset demonstrate the effectiveness of our proposed TROpt-NET.
The cloning attack is harmful to RFID systems. So, estimating the number of cloned tags is helpful to evaluate potential security risks for RFID systems. This paper studies the problem of estimating the number of cloned tags to present a cardinality estimation scheme CECT when there exist unknown tags and the capture effect. CECT scheme requires a RFID reader to first predict responses of the known tags by a virtual frame executed in terms of the ALOHA protocol. Then the reader collects responses from active tags over a channel with the capture effect. Simulation results show that under the given number of unknown tags and capture effect parameter, CECT can meet the required estimation accuracy and reliability. Under the same parameters, the accuracy is improved by more than 20%.
In order to provide decision support for students’ education management, a new student behavior analysis method based on the K-means algorithm and consumption data of campus smart card is proposed in this paper. An optimized Apriori algorithm is used to analyze the relationship between consumption behavior and academic performance. This method effectively provides management decision support for student managers, and improves management efficiency. This should improve the process of intelligent management.
In this paper is considered the problem of designing fuzzy dynamic output controllers in the parallel distributed form for Takagi-Sugeno fuzzy systems. After a brief section outlining used separation principle, the typical matrix variable structures are found, to define the solution based on the parameterized matrix inequalities. By creating the common framework related to the closed-loop system, applied separability defines two part factorisations with linearized set of linear matrix inequalities. The main results are shown in detail by the example to characterize potential application of the method for the systems with Takagi-Sugeno models.
Reducing the number of alerts and anomalies has been the focus of several studies, but an automated anomaly detection using log files is still an ongoing challenge. One of the pertinent challenges in the detection of anomalies using log files is dealing with ‘unlabelled’ data. In the existing approaches, there is a lack of anomalous examples and that log anomalies can have many different patterns. One solution is to label the data manually, but this can be a tedious task as the data size could be very large and log files are not easily understandable. In this paper, we have presented an automated anomaly detection model that combines supervised and unsupervised machine learning with domain knowledge. Our method reduces the number of alerts by accurately predicting anomalous log events based on domain expertise, which is used to create automated rules that allow generating a labelled dataset from unlabelled log records, which are unstructured and present in many different formats. This labelled dataset is then used to train a classification model that will help predict anomalous log events. Our results show that we can accurately predict anomalous and non-anomalous events with an average accuracy of 98%. Our approach offers a practical solution for systems where logs are collected without any labelling, making it difficult to create an accurate model to identify anomalous log records. The methodology presented is very fast and efficient, which can provide real-time anomaly detection for time critical environments.
In this paper, we study the preliminary test method in a linear regression model. The preliminary test Liu-type estimator is introduced when it is suspected that the regression parameter may be constraint to a subspace. We also compare the preliminary test Liu-type estimator to the preliminary test estimator, preliminary test ridge estimator and preliminary test Liu estimator in the mean squared error sense.
Monitoring influenza activity can facilitate developing prevention strategies and optimizing public health resource allocation in an effective manner. Traditional influenza surveillance methods usually have a time lag of 1 to 2 weeks. This study concerns the problem of nowcasting influenza-like illness (ILI) by comprehensively incorporating historical ILI records, Internet search data, and tourist flow information. In this study, a set of predictive models are adapted for ILI prediction, including autoregressive integrated moving average (ARIMA), autoregressive with Google search data (ARGO), extreme gradient boosting (XGBoost), and linear regression (LR). To further improve prediction accuracy, a stacking-based ensemble approach is developed to integrate the prediction results from the different models. These methods are validated using the ILI-related data in Taiwan Province of China at both global and city levels. The results show that the stacking-based ensemble approach achieved the best performance in the task of nowcasting, with the least prediction errors at the global level (MAPE=5.6%; RMSE=0.16%; MAE=0.08%). The developed approach is easily tractable and computationally efficient and can be viewed as a feasible alternative to nowcast ILI in areas where influenza activity has no constant seasonal trend.
Based on the definitions and properties of fuzzy metric for random sets. We considered the limit theory of weighted sums for random sets in the sense of fuzzy metric. The random sets are independent and compactly uniformly integrable, and the weights are more general constants. The convergence is in the sense of fuzzy metric induced by the Hausdorff metric.
The persuasive techniques in propaganda campaigns impact the Internet environment and our society. Detecting persuasive techniques has aroused broad attention in natural language processing field. In this paper, we propose a novel emotion-enhanced and multi-level representation learning approach for multi-modal persuasive techniques detection. To consider the emotional factors used in persuasive techniques, we embed the text and images using different networks, and use a fully connected emotion enhanced layer to fuse multi-modal embedding, where the type and strength of emotions are incorporated in the text embedding. To better model the multi-modal features in persuasive techniques, the fused features are inputted to a split-and-share module where multi-level representations are employed to obtain better detection performance. Furthermore, we integrate the focal loss to alleviate the problem of data imbalance for persuasive techniques detection. Experimental results on publicly used dataset show that the proposed model is effective for multi-modal persuasive techniques detection. Remarkable experimental results indicate the capability of our MPDES in extracting the deeper information contained in dual modalities.
Recently, many research works adopt machine learning to provide accurate predictions on the COVID-19 pandemic. In this paper, we design and develop a web system which adopts machine learning methodologies to provide data analysis and data visualization. For experiment analytics results in the system, we find that SVM method outperforms LR method in every use case. We propose a web-based user-friendly and intuitive COVID-19 information hub, which can improve data accessibility to the public and allow more accurate decision-making to help fight the pandemic.
In this study, we propose a method to visualize the factors that contribute to the buzz phenomenon triggered by Twitter posts. The analysis included tweets, images, and replies. Replies are after-the-fact responses posted in response to a posted tweet and therefore cannot be used to predict buzz phenomena. Therefore, they cannot be used to predict the buzz phenomena. In this study, the tweet body, images, and reply text were feature vectors, and an affective analysis model was constructed. Visualization of the relationship between the sensibility features output from this model and the number of RTs and likes (echo index), which represent the scale of the buzz, will be useful for analyzing the factors behind the popularity. Consequently, the subjective sensibility information with the most likes also tended to have a higher degree of similarity among the sensibility vectors.
The main problems of the traditional perceptron learning algorithm (PLA) is that there are too many iterations and it is difficult to generate a model quickly, and more iterations are needed when the boundary between the two classes is closed. In this paper, we improve PLA by introducing the current weight into the updating formulation, which can significantly accelerate the iteration. The experiments on different public datasets show that our proposed method can greatly improve the speed of the traditional PLA.
Climate change is becoming an important factor in policy making and economic development. Analyzing droughts and precipitation changes over regions, seasons and years provides insights to the climate change patterns in many aspects. With satellite imaging and more collections on Google Earth Engine (GEE) [1] library, more information is available to discover the patterns. Our research analyses precipitations in United States using Google Earth Engine’s ERA5 Monthly Aggregates [2] image collection. From the collection, different regions of United States were selected to calculate both the spatial and temporal patterns. Image sequence predication and decision tree methods are used for spatial precipitation change pattern. Optical flow analysis is also explored for pattern tracking.