Ebook: Fuzzy Systems and Data Mining III
Data science is proving to be one of the major trends of the second decade of the 21st century. Even though the term was coined by Peter Naur in the mid 1960s as ‘datalogy’, or the science of data, it is in the context of data analytics, and especially of big data, that data science has emerged as the new paradigm. Fuzzy and Crisp strategies are two of the most widespread approaches within the computational intelligence umbrella.
This book presents 65 papers from the 3rd International Conference on Fuzzy Systems and Data Mining (FSDM 2017), held in Hualien, Taiwan, in November 2017. All papers were carefully reviewed by program committee members, who took into consideration the breadth and depth of the research topics that fall within the scope of FSDM.
Offering a state-of-the-art overview of fuzzy systems and data mining, the publication will be of interest to all those whose work involves data science.
Data science is proving to be one of the major trends of the second decade of the 21st century. Even though the term was coined by Peter Naur in the mid 1960s as ‘datalogy’, or the science of data, it is in the context of Data Analytics and especially of Big Data that data science has emerged as a new paradigm. Fuzzy and Crisp strategies are two of the most widespread approaches within the Computational Intelligence umbrella.
The Fuzzy Systems and Data Mining (FSDM) conference series has become established as a consolidated event offering contemporary research conducted by leading experts in various aspects of Artificial Intelligence. FSDM is a yearly international conference covering four main groups of topics:
• Data mining
• Fuzzy theory, algorithm and system
• Fuzzy application
• Interdisciplinary field of fuzzy logic and data mining.
This thematic conference started two years ago. FSDM 2015 took place in Shanghai (China) in December. The proceedings were published in the remarkable book series Frontiers in Artificial Intelligence and Applications (FAIA) by IOS Press. A special issue was published in the Journal of Intelligent & Fuzzy Systems (IOS Press) as a post-conference follow-up. FSDM 2016 was held in Macau (China) in December and the proceedings were published as Vol. 293 in the FAIA series. In addition, one special issue was published in the Journal of Intelligent & Fuzzy Systems and another is in preparation for Filomat.
Following the huge success of the previous editions, the third conference in the FSDM series is being held at Hualien (Taiwan), where experts, researchers, academics and industry people will introduce the latest developments in the field of Fuzzy Sets and Data Mining. Hualien City is the capital of Hualien County, and is located on the east coast of Taiwan on the Pacific Ocean. Its population exceeds 100,000. There are three Universities in Hualien and according to the figures the National Dong Hwa University is considered the most outstanding.
This book contains the papers accepted and presented at the 3rd International Conference on Fuzzy Systems and Data Mining (FSDM 2017), held on 24–27 November 2017 in Hualien, Taiwan and hosted by National Dong Hwa University. All papers were carefully reviewed by programme committee members and took into consideration the breadth and depth of the research topics that fall within the scope of FSDM. Additionally, FSDM 2017 was a reference conference, and attracted two remarkable keynote speakers: Prof. Hari Mohan Srivastava from Canada, Prof. Shun-Feng Su from Taiwan. The publication of a special issue in Filomat is scheduled.
I am very happy to announce that this year FSDM has received more submissions than either the 2015 or 2016 editions. I would like to thank all the keynote speakers and authors for their effort in preparing a contribution for this leading international conference. Moreover, I am very grateful to the all those who devoted time to evaluate the papers, especially the programme committee members and reviewers. It is a great honour to continue with the publication of these proceedings in the prestigious series Frontiers in Artificial Intelligence and Applications (FAIA) by IOS Press. Our particular thanks also go to J. Breuker, N. Guarino, J.N. Kok, R. López de Mántaras, J. Liu, R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong, the FAIA series editors, for once again supporting this conference.
Finally, I hope you will enjoy your stay in Hualien City and at the National Dong Hwa University. Have a magnificent experience in both places. The climate in Hualien is very mild and, according to the historical weather statistics for November, the temperature is likely to be between 19 and 25 degrees (Celsius) and the average number of rainy days for the month around 10.
October 17th 2017
Antonio J. Tallón-Ballesteros
University of Seville (Spain)
Rough set theory is conceived as an effective mathematical tool to deal with uncertain information. However, rough set is not suitable to deal with dynamic data, some methods are proposed to solve this question. Evolution model of granular decision is an extension for time series sets based on classical rough set model. Its purpose is to research the rules of decision on time series. Evolution model of granular changes the focus from static decision information system to dynamic time series. It researches the regulations of evolution in decision information system which will be changed in time series. It is a new method to research decision information system. Evolution model of granular decision has the unique advantage in application, but it does not have the clear concept for classification of attributes. This paper gives a concept for decision evolution set based on evolution model of granular decision, and researches its effects in application.
This project is to build a statistical model to predict who will win 2017 NBA Most Valuable Player (MVP) Award in the regular season. Team has collected three raw data from public Sports domain: (1) player statistics, (2) team win%, and (3) historical MVP winners. Before building the proto model, the player statistics have been standardized to the Z scale in each player statistics category in order to remove any mean and standard deviation effect. This Z transformation can eliminate any statistics bias or domination from any particular category. The Z scale will also analyze each player's performance as compared to the other top NBA players in the same season. The “MVP Index” has been derived from combining each player's Z statistics with equal weight as a “Uniform” model. To evaluate the model accuracy, team has derived another “Accuracy Index” of predicting the top five MVP players recognized annually. The first “Uniform” model can predict the top five winners at 47% accuracy. Team has further derived the “Weighted” model by adding the weight factor which was calculated based on the dispersion/separation between the top two MVP winners and the remaining players not in top 5. The weight factor will reflect which player statistics categories are more critically contributed to the MVP Award selection process. The “Weighted” model has further improved the Accuracy Index from 47% to 52%. To further optimize the prediction accuracy, authors have added the “Team Winning” factor. Most historical MVP winners were from the teams with best or better regular season records. Authors have assessed the team winning factor based on the “Power” model from power= 0 (equivalent to the Weighted Model), 1, 2, 3, 4, 5, 6 to power= infinity (MVP from the best Team). The MVP Index will be multiplied by the power of the team winning% in the Power model. Power transformation will improve the model accuracy but which may also create any over-fit concern when power level is getting higher. Based on the Power Model, team can improve the Accuracy Index to 70% at Power=3. There is little benefit but more over-fit risk to further increase the power level beyond 3. Authors also used Data Mining Discriminant analysis to rank players into clusters. The Discriminant model accuracy is 55% not better than Power Model. Team will use the Power=3 model to predict 2017 MVP on the first day of each month starting from Dec.1, 2016 until the season end in April, 2017. The final model prediction will be available around middle April, 2017. The same modeling technique can be applied to predict the MVP winner for Olympic event as well as for the other professional sports such as Football, Basketball, and Soccer.
This aim of this project is to apply a series of pattern detection Data Mining algorithms to accurately identify cheating by one or more students during classroom test exams. JMP software was utilized to analyze correlation among exam scores for 75 students (sitting at 25 different tables due to space constraints) who took a multiple-choice assessment exam. During the exam, three students were seated per table, each given an exam with the same questions but arranged in different order to prevent cheating. Each of the three students, therefore, was given Versions “A,” “B,” and “C” of the exam, respectively, per table. Nonetheless, the possibility of cheating by students existed since they could still synchronize the question sequence prior to, and during the exam. To detect if a pattern could be identified on the answer keys between students not attributable to chance alone (and therefore attributed to cheating), multivariate statistics tools were used to determine whether there was any association pattern among the students from the same exam table. Hierarchical Clustering and Dendrogram Tree were used to identify the grouping affinity behavior related to exam cheating pattern. The clustering analysis would group students with similar answering patterns among the 75 students who took the exam. The cheating pattern could be identified among the first few groupings if students were seated at the same test table during the exam. Authors also used JMP Graph Builder and Graphical Heat Map to identify and recognize patterns in exam scores among students using visual analysis. To further improve the prediction confidence of these combined tools, the authors also selected the top 20% of questions considered the most difficult ones (as identified by the instructor), in order to increase the detection signal-noise ratio. The probability of picking the same wrong answers on the difficult questions are even more unlikely by chance alone as compared to picking the right answers for the easy questions. It is statistically even more improbable that students would unintentionally select the same wrong answers on difficult questions, and therefore provides very evidence of cheating. Principle component analysis was also used to identify pairs of students who cheated, with separation of pairs based on the top two principle components. From the analysis presented using a unique set of Data mining tools, three tables were summarized in this paper and all supported evidence of cheating in the same student pairs. The predictive model approach using Data Mining tools was very powerful for analysis of the complex exam cheating patterns. This case study has been included in standard curriculum in graduate school course to discourage students cheating on exams. The approach herein can also be used to study patterns in students' multiple-choice answers across subject matter, and to help instructors design their future curriculums based on pattern recognition tools derived from these Data mining algorithms.
Multiple alignments of strings have been extensively studied as an effective tool to study string-type data such as DNA. In this paper, we generalize the notion of multiple alignments of strings and introduce
With the rapid development of the mobile Internet, the research on network behavior of the users of mobile terminals has become an active research area in the field of Internet research. The requirements for user experience are higher and higher, and there are also huge demand for personalized page to meet the needs of mobile terminals. Due to the huge number of users and numerous scenes of interaction, layout of the mobile terminal needs to be modified accordingly. These modifications, such as personality changes, required in system development increase workload and developing cost, which affect the scalability and flexibility of application been developed. This paper proposes a model of visual data mining based on the mobile terminal interface in Android opening source. In order to better suit the need of users, the model first uses the Android opening source technology to conduct an analysis on the features and logic of interactive page layout. Improvements of page interaction are made in terms of process, algorithm, timing and pre-sorted priority mechanism. Through the establishment of tokenization and tree construction, a general layout parser has been proposed in this paper, meanwhile a visual layout operation has been realized, so that the configuration can be altered flexibly for different scenarios. Improvements realized by this study reduce user response time and workload, and promote code reuse rate. The experimental results show that this model can simulate the concurrent operation of mobile terminal users, so that the average response time can be reduced by around 45%, and the success rate can be increased by about 30%. At the end of the project, it is found that the software development process in various stages of the workload has been significantly reduced, the average reduction of workload is 57%, according to the analysis of the measurement data of the project.
In this study, we display preliminary results for harnessing fuzziness of yet-another fuzzy rule-bases. They are based on the pragmatic rule-design (PRD), which has been proposed by the authors. The PRD is novel since a pragmatic rule is not an “IF-THEN” rule nor an artificial neural network, and does not represent a stimulus-response relation. A pragmatic rule is a vector of relative characteristics of effective responses in itself. In the original PRD, the fuzziness in discretizing a system state is too surplus. Restricting such fuzziness may improve the performance of the rule-base, therefore a modification of the original PRD is proposed. Some PRD variants based on that modification are developed and evaluated through their applications to elevator operation problems.
As the semantic web develops, a large amount of available web data grows rapidly. It is still a hot topic how to query useful information from these Web data more quickly, efficiently and exactly. To approach the issues, we present a conjunctive query method for Linked Data based on semi-join in order to reduce the overhead to transit the intermediate results produced by each data source from the network and when selecting data source, we exploit the query results of the basic graph pattern to precompute the cardinalities of the intermediate results and present a new data source selection algorithm based on “vocabulary of interlinked dataset ( voiD ) ”+“SPARQL ASK” to improve the accuracy of the data source selection. To reduce the query response time, we present a parallel semi-join algorithm. Extensive experiments show that our solution is more efficient and more effective compared to the existing techniques.
The Traveling Salesman Problem (TSP) is the subject of study in operational research for more than 30 years. The TSP is considered as NP-complete; consequently, many heuristic and metaheuristic algorithms have been developed to cope with the intractable nature of the problem. Although the problem is well-studied, lack of integrated software that harnesses the new computers' computational power and provides an easy comparison between heuristic algorithms is sensible. TSP solver is the state-of-the-art software that provides a common framework to compare the performance of different algorithms over TSPLIB library. Academicians can focus on developing new methodologies without concerning the availability and correctness of reported algorithms in the literature. Practitioners may also benefit from provided transparency by our software solution and build their own customized packages to flourish their businesses. The proposed software can be a foundation for the future implementations in which users design their algorithm and results would be uploaded automatically on a public server together with the source code.
This paper uses the eGARCH-Copula model to examine the tail dependence and Value at Risk (VaR) of the log returns of the US and Asian exchange indices as pairs of portfolio in three periods: before, in and after finance crisis. The results indicated that the eGARCH-Copula model works well on measuring the tail dependence and VaR between the US and Asian stock market; and after finance crisis, the dependence structure changed including on tail dependence and VaR.
The urban rapid development leads to a series of urban diseases. The developments of remote sensing and geographic information sciences generate varied of data and create great potentials for urban disease diagnosis. In this paper, a framework of the urban pulsation analysis is established to diagnose a specific urban disease. The urban pulsation analysis implements three tasks. They are optimizing an examinational index system for an urban disease, extracting spatio-temporal data sequences that corresponds to the index system, and spatio-temporal data mining with urban constraints. This paper describes the essential technologies of these tasks with perspectives of the further trend.
The digital marketing has become more popular than ever. In order to realize the on-line behavior of the users and tackle the considerable web data generated by the users, the analysis of web browsing behavior has been drawn a lot of attention, which challenges researchers. Web usage mining (WUM) exploiting data mining, text mining, machine learning and statistic is a useful method to solve problems where the hidden knowledge can be extracted for better understanding various interests of web users. In this investigation, consumer browsing log data were collected at google analytics, and then unstructured browsing log was preprocessed into semi-structure text. For each session, all URLs were compacted as a sentence representing a consumer's browsing path, which was further performed by hierarchical neural network, and then renormalized by Hopfield neural network to obtain the stable weights between URLs. The final two visualizing graphs highlights two types of browsing patterns denoting the intent of two groups of consumers where the graph with more links reflects user as a visitor looking for interesting things and the plain graph reveals the user accessing website by using keywords at search engine as a consumer most likely taking action. The findings would make a significant contribution to the marketing strategy and understanding of the browsing behavior and intent of on-line users.
The Internet is becoming more and more important for nearly everybody as it is one of the most forward-looking media for free bidirectional knowledge and information communication than traditional media, such as newspaper and television. In this paper, we analyze the browsing trends and interest levels of different population groups based on Web browsing behaviors reflected by net view data
Net View Data is a log from Japanese users about their web page accessing records. Each record contains the panel user's ID, access time, dwell time, access URL, etc. It is collected and provided by Nielsen Online (http://www.netratings.co.jp/) for research and investigation.
Most of the real world problems are complex that is raised from uncertainity in the form of ambiguity. To overcome the ambiguity and impreciseness, Zadeh [20] introduced fuzzy set theory in the year 1965. The underlying power of fuzzy sets is that linguistic variables can be used rather than quantitative variables to represent imprecise concepts which are continuous transitions. The continuous imprecise concepts need to be modeled by continuous fuzzy numbers instead of intervals and real numbers. Many researchers applied triangular and trapezoidal fuzzy numbers which are piecewise linear and continuous for imprecise concepts because of its convenience and ease. But imprecise variables in the fields of decision making, data analysis, fuzzy control, image processing, fuzzy clustering etc., need not be linear always. So parabolic fuzzy numbers can be used for modeling nonlinear fuzzy concepts to have a better accuracy in real life problems. Since parabolic fuzzy numbers are not completely studied till now and the triangular fuzzy numbers have been thoroughly studied, triangular approximations of parabolic fuzzy numbers may be used as supplementary tool along with parabolic fuzzy numbers. Approximation is a kind of de-fuzzification that has to change non-linear functions into linear functions which reduces the fuzziness for the undefined information and produces a better accuracy. So the study of approximation of parabolic fuzzy numbers is necessary and hence, in this paper, the problem of triangular approximations of parabolic fuzzy numbers is attempted using distance function in terms of α-cuts. Some properties of the parabolic fuzzy numbers which are useful in multi criteria decision making (MCDM) have also been discussed. Finally pertinent illustrations and applications of the approximation of parabolic fuzzy number have also been given.
Dead reckoning has a significant purpose in pedestrian navigation algorithms with assumption of accurate heading estimation. This paper proposes a heading estimation on real time compensation on the basis of region partition particle filter for pedestrian navigation system (RPPF). The RPPF algorithm computes heading correction in real-time using a particle filter. It establishes a functional relationship between the movement of stochastic pedestrian and the regular hexagonal heading constraint; afterwards, it realizes the heading compensation with hexagonal constraint and enhances the heading accuracy. In order implement the proposed algorithm, we conducted a walking experiment of closed curves; the comparative result with the traditional strap-down attitude algorithm, the RPPF enables the effective reduction in the heading error and truly reflects the trajectory curve of pedestrian. The positioning error of the RPPF is less than 2% of the travel distance, which can meet the positioning requirements of pedestrian.
In this paper, we propose an approach for fault diagnosis of railway train based on a combination of fractal theory and k-mean clustering technique. First, the fractal dimensions of the waveforms were calculated to analysis the singular characteristics of the waveforms. Second, the k-mean clustering was used to cluster the singular characteristics in order to determine the running state and fault diagnosis. It can effectively monitor the running state and safety performance of the railway train, and provides technical support for the safe operation and maintenance of the railway train. The method has the advantages of strong real-time and high accuracy for fault classification, which has a certain reference value for analysis of uncertainty and irregular waveform.
This paper studies the emergent property of complex adaptive system of group behavior driven by the individual by the agent simulation method in an example of urbanization. It is considered that urbanization is a stochastic process of graph. The payoff of an arbitrary city can be regarded as the corresponding payoff coming from the interacting with the other cities, and it must rely on its local topological configuration. Meantime urbanization is also a complex adaptive system. When the system is under attack, it has a strong criticality, and this critical probability of attack relies on the payments of agents in the system.
In data assimilation for numerical weather prediction, observation errors are typically neglected or are assumed to have zero correlations, resulting in a loss of information. For spurious correlations in data assimilations of limited ensemble sizes, a new method is proposed by calculating observation distances and constructing equivalent observation position weights. Coupled with standard fuzzy control algorithms, a fuzzy control ensemble transform Kalman filter (FETKF) is presented, and the corresponding procedures are given. Within the framework of the classical Lorenz-96 chaotic model, we compare the performance of the ensemble transform Kalman filter (ETKF), the local ensemble transform Kalman filter (LETKF), and the proposed FETKF with varying physical parameters. Comparisons are drawn between the LETKF localization coefficients and FETKF spatial distance vectors to calculate the corresponding weight function of the chaotic model. The results show that the new method can eliminate spurious correlations, prevent long-range observation effects of state update variables and reduce analysis errors. The error handling methods based on fuzzy control performed well under both perfect and imperfect model scenarios in the Lorenz-96 model.
Aiming at the shortcomings of infrared image recognition in intelligent substation, an improved target extraction scheme is proposed. Firstly, the Retinex theory is used to remove the influence of the lighting factor and improve the image brightness effect, and then the improved histogram equalization method based on grayscale transformation of segments is used to improve the image contrast. The region of the device is extracted by the region segmentation algorithm combined with the connected component extraction and the regional growth. The experimental results show that the scheme overcomes the drawbacks of existing methods and has good segmentation effect compared with the traditional method. It is a practical target extraction method for substation equipment.
Rough sets is born for analyzing the uncertainty data in information systems. However, many important problems of rough sets are NP-hard, most of which needs greedy algorithms to solve. Matroid, a sophisticated mathematical structure, provides a well platform for greedy algorithms. Hence, it is necessary to integrate rough set with matroid. In this paper, we establish a spanning matroidal structure is established and some characteristics of the matroidal structure are investigated in different ways. Moreover, some axiomatic characterizations of matroid are obtained through rough set. Firstly, a family of sets is defined by the upper approximation operator and prove it to satisfy spanning set axiom. So a matroid is induced by rough set in this way. We call the matroid spanning matroid. Secondly, some characteristics of spanning matroid, such as closed sets, spanning sets and bases, are investigated with rough set and matrix approaches, respectively. Thirdly, we investigate the axiomatic characterization of the upper approximation operator based on equivalence relation from matroids. Finally, based on rough sets, we obtain some axiomatic characterizations of spanning matroid in different ways.
In this paper, a fuzzy predator-prey system is proposed by adopting fuzzy parameter in a predator-prey system. The steady state and linear stability of predator-prey system are determined and analyzed. Here, we show that the trivial steady state is unstable. Meanwhile the semi trivial steady state is locally asymptotically stable for all values of α under certain conditions.
Effective feature extraction is very important for motor imagery (MI) Electroencephalography (EEG) signal pattern recognition in brain machine interface (BMI) or brain computer interface (BCI) applications. Common spatial patterns (CSP) method is a frequently used machine learning algorithm for discriminative feature extraction in MI related BMIs. Although CSP is a famous and effective method in BMI applications, the intrinsic variations in the EEG signal properties would affect its performance. To address this issue, instead of using the eigenvalues for spatial filter vectors selection as instructed by the traditional CSP approach, we present a novel criterion which considers not only the differences between different classes but also the dissimilarities in the same class in a data-driven manner for robust spatial filter construction. We compared our method against CSP using a public dataset, namely dataset IVa of BCI competition III. The final results of the experiment indicate that our approach gives higher classification accuracy, by reducing the variations in the same class.
Due to the great variability of asthma symptomatology; the medical teams find practical difficulties in determining the severity of asthma. Asthma is very commonly encounter in daily medical practice. This work objective is to design a system that helps medical teams in determining the severity of asthma. Its use could reduce time, effort and cost of categorizing asthma patient. Asthma severity diagnosis is currently done by an expert person, a doctor. The motivation is to release some burden from medical team by providing them a tool that determines the severity of asthma. One of the partial goals of the work is to model the asthma problem as a fuzzy problem, because many of the symptoms can be interpreted in a fuzzy way for the diagnosis. We model the problem using the RFuzzy framework, a Prolog-based tool for representing and reasoning with fuzzy information. The fact that several researches are being done to determine the level of asthma severity developed motivates us to use a fuzzy tool to try to automatize it. Our approach is not interesting because of our medical knowledge that we have taken from some medical collaborators. The value of our work is that we have found the way of representing in a simple way the knowledge of any asthma expert for classifying automatically the severity of an asthma patient just by collecting some simple numerical data relative to the patient symptoms. Any medical professional with a different criteria for asthma classification can easily modify our system according to his/her knowledge and obtain the corresponding results. This system was developed by the participation of experienced asthma physicians and followed the global initiative for asthma (GINA) guideline.
The Analytic Hierarchy Process (AHP) is a structured multiple criteria decision making (MCDM) tool in dealing with complex decision making or decisions involving several decision criteria, alternatives and decision makers (DM) with subjective and objective evaluations to come up with a decision. Since group decision making is a common scenario in business and the presence of collective wisdom can lead to better decisions, AHP has been extended to Group Decision Making (GDM). However, there exist limitations in the current AHP GDM algorithms such as the use of imprecise values in preference elicitation, subjective weights to assign for each DM and inferior decision making preferences maximization. Furthermore, existing methodologies in AHP-GDM addressed the lack of precise values by introducing interval judgment which limit the DMs in expressing their preferences. This paper presents a Non-Linear Programming (NLP) model that maximizes the preferences of the DM and maintaining an acceptable level of inconsistency in a GDM setting. It also provides a way to determine the weights to be assigned to each DM based on subjective and objective criteria.