Fuzzy Hoeffding Decision Tree for Data Stream Classification

Datastreammininghasrecentlygrowninpopularity,thankstoanincreasingnumberofapplicationswhichneedcontinuousand fastanalysisofstreamingdata.Suchdataaregenerallyproducedinapplicationdomainsthatrequireimmediatereactionswith strict temporal constraints. These particular characteristics make problematic the use of classical machine learning algorithms for mining knowledge from these fast data streams and call for appropriate techniques. In this paper, based on the well-known Hoeffding Decision Tree (HDT) for streaming data classification, we introduce FHDT, a fuzzy HDT that extends HDT with fuzziness, thus making HDT more robust to noisy and vague data. We tested FHDT on three synthetic datasets, usually adopted foranalyzingconceptdriftsindatastreamclassification,andtworeal-worlddatasets,alreadyexploitedinsomerecentresearches onfuzzysystemsforstreamingdata.WeshowthatFHDToutperformsHDT,especiallyinpresenceofconceptdrift.Furthermore, FHDTischaracterizedbyahighlevelofinterpretability,thankstothelinguisticrulesthatcanbeextractedfromit. ©


INTRODUCTION
During the last years, the analysis of data streams is getting more and more attention because of the huge number of different sources of streaming data we get in touch everyday.These can be simple sensors, but also wearables, smartphones as well as all other smart devices of the Internet of Things environment [1].Other examples of data stream sources may entail tools for the continuous measurement and monitoring of network traffic [2], web surfing logs [3], networked manufacturing machines in Industry 4.0 [4,5], virtual learning environments [6], online social networks [7], etc.
In the context of data streams, data are usually collected at very high rates, imposing strict temporal constraints and important challenges to the algorithms used for machine learning and data mining.These challenges can be briefly summarized as follows: • the practical availability of limited resources compared to the theoretically infinite quantity of data in a stream.Indeed, machine learning algorithms should consider an instance only once and get immediate information for training the models, limiting as much as possible the amount of stored data; • the changes over time of the distribution of the data, a phenomenon also known as concept drift which should be properly faced and addressed; • the necessity for the models, which are continuously updated as data arrive, to make accurate predictions at any time [8].
* Corresponding author.Email: rpecori@unisannio.it As a consequence, following the aforementioned three challenges, three corresponding metrics, mutually dependent, are usually considered in the design of models for data streams.They are the following: accuracy in the predictions, quantity of memory to store training data, and time needed for learning the model itself [9].The concurrent optimization of the aforementioned three metrics, which should take place in the initial learning phase, is a very difficult and demanding task, just because of the already mentioned mutual dependence: improving accuracy generally requires more memory and time, and vice versa.
In this paper, we describe a new approach for the incremental learning of multi-way Fuzzy Decision Trees (FDTs), which has been preliminaryly introduced in Ref. [10].In particular, we take advantage of the main characteristics of the traditional Hoeffding Decision Tree (HDT) [11], a decision tree purposely proposed for managing data stream classification, and extend it with fuzziness by adopting some of the concepts of the FDT proposed by Segatori et al. [12].In particular, we choose the best input attribute to be used for the splitting at each decision node, by means of the fuzzy information gain (FIG) proposed in Ref. [12].Unlike Ref. [12], for the sake of simplicity, we adopt for each input attribute, fuzzy uniform partitions.
We have tested the proposed incremental learning procedure for building FDTs on five datasets for data stream classification.Specifically, we adopted three synthetic datasets on which researchers usually carry out experiments for the analysis of concept drifts in streaming data [13][14][15][16].Moreover, we selected two datasets which are related to real-world applications, namely the optical recognition of handwritten digits and the detection of the presence or absence of people in a room.As regards the latter, it is a typical Smart Building application [17,18], suitable for taking the decision of switching off power consuming devices, whenever the room is empty.
An extensive experimental analysis has been carried out, considering different strategies for simulating the data streaming and for analyzing the generated models in terms of their classification performance and model complexity and interpretability.We have compared the proposed FDT, denoted as Fuzzy HDT (FHDT) in the following, with the traditional HDT.Moreover, as regards the room occupancy dataset, we have also compared the results achieved by both FHDT and HDT with the ones presented in the paper where the dataset has been introduced [18] and with ones achieved by a recent fuzzy clustering-based algorithm for data streaming classification [19].
The experiments show that FHDT achieves very good results in terms of classification, performance, and complexity, especially for the dataset characterized by the presence of concept drifts, such as the synthetic datasets and the occupancy dataset.Moreover, the exploitation of uniform fuzzy partitions allows us to extract, from FHDT, simple, accurate, and explainable fuzzy rules.
Compared to Ref. [10], we have added the following further contributions: • we have tested the novel FHDT considering three synthetic datasets and an additional real-world application, namely the room occupancy detection; • we have compared FHDT and HDT experimenting different data splitting scenarios for simulating the data streaming; • as regards the two real-world datasets, we have evaluated both FHDT and HDT considering two modalities, namely the classical test-the-train strategy and using pre-defined test sets.
The rest of the paper has the following structure.Section 2 summarizes some recent work about data stream mining, with particular emphasis on decision trees and FDTs.In Section 3, we report some background about fuzzy partitioning and the structure of an FDT.
In Section 4, we delve into the proposed incremental procedure for learning FHDTs, describing it in detail.In Section 5, we present the results of the experimental campaign carried out on the considered datasets.Section 6 draws some conclusions and discusses some future work.

RELATED WORK
HDT [11] also known as "Very Fast Decision Tree", is a reference model to classify streaming data.HDT encompasses a learning strategy to create a decision tree in an incremental way, storing in memory only the necessary statistics.In particular, the growing of the tree entails the splitting of one leaf whenever it contains a minimum number of instances and a condition based on the Hoeffding's theorem is met.In particular, the theorem defines a bound which decreases with the increase of the number of instances.The condition is verified when the difference between the two highest values of the information gains computed for the attributes available at the decision node is higher than the bound.In Ref. [11], the authors state that this bound guarantees, with a given level of confidence, that the attribute chosen for the splitting is the same as the one that would have been selected after analyzing the infinite sequence of instances.The theoretical justification based on the Hoeffding's theorem has been criticized in some papers published in the last years [20][21][22].The authors of these papers contest the applicability of the Hoeffding's theorem to the information gain.Ref. [21] claims that "the bound obtained in Ref. [11] cannot be justified using Hoeffding's theorem.However, it still can be considered as a heuristic method, which gives satisfactory practical results." Actually, HDT has been implemented in some software packages and is commonly used in streaming data classification.
During the last years, some solutions to improve the HDT learning have been proposed.For example, the authors of Ref. [23] have proposed an approach to address the concept drift issue.Moreover, the contribution in Ref. [24] has introduced a method to deal with noisy instances as well as to manage the dimension of the tree.In Ref. [25] Kourtellis et al. have presented Vertical HDT, a parallel and distributed version of the incremental HDT learning procedure, which has been successfully tested on big streaming data.
Regarding fuzzy logic and fuzzy set theory, they have been recently used in decision tree models, both considering batch and streaming scenarios.As a matter of fact, when fuzzy sets are employed to discretize the domains of the attributes used in these classifiers, vague and noisy data can be handled in a very efficient way [26][27][28][29].An introduction on theoretical issues and methods on FDTs can be found in Ref. [30].The work of Umanol et al. [31] represents one of the first contributions on FDTs: the authors discuss the application of a fuzzy version of the well-known ID3 algorithm for diagnosis systems.The contribution of Segatori et al. [12] presents a recent technique to train binary and multi-way FDTs in batch mode for big data classification.In particular, the authors take advantage of a fuzzy partitioning of the input attributes, performed through a novel distributed discretizer, and employ the FIG to choose the splitting attributes in the decision nodes of the tree.More recently, Barsacchi et al. [32] have introduced a method for building ensembles of binary FDTs based on an AdaBoost-like boosting method.
As regards streaming data classification, the work discussed in Ref. [26] introduces one of the first approaches for incremental learning of an FDT.The proposed learning algorithm is based on the main scheme discussed in Ref. [11], but it is improved by introducing a new technique called Threaded Binary Search Tree, which allows a more efficient control of the candidate splitting points.In order to improve noisy data handling, a soft discretization is adopted along with a fuzzy version of the information gain.The Flexible FDT (FlexDT) [27] is another example of FDT proposed in the literature to handle the classification of streaming data.FlexDT is grounded on the same incremental learning procedure of the traditional HDT, but it exploits binary fuzzy partitions for each attribute at the internal decision nodes.The authors have employed sigmoidal fuzzy sets and defined proper heuristics in order to tune their parameters in the learning stage.The Hoeffding bound and the information gain, in their traditional definitions, are used in order to, respectively, decide if growing or not the tree and choose the best possible attribute for the splitting at each decision node.
An extension of FlexDT, Multi FlexDT, has been introduced in Ref. [28].Multi FlexDT employs multi-way rather than binary splits at each internal decision node.When a new decision node is created, a binary fuzzy partition is adopted as in FlexDT, but, subsequently, the incremental learning algorithm creates a new binary partition for the attribute at the arrival of a new training instance whose membership degree to each fuzzy set is lower than a certain threshold.In this way, the previous partition is extended at each decision node by taking the two new sigmoidal fuzzy sets into account, thus considering the two new splittings.A further extension of Multi FlexDT, called MHFlexDT, has been recently discussed in Ref. [29].Here, the authors have proposed to use the McDiarmid bound rather than the Hoeffding bound for checking the splitting conditions at the decision node.In the experiments, MHFlexDT has proved to better control the depth of the tree than Multi FlexDT.
Two common characteristics of FlexDT, Multi FlexDT, and MHFlexDT are: i) the incremental generation of noise-robust models and ii) the efficient management of missing values.Nevertheless, some drawbacks regard the impossibility of i) defining a-priori linguistic labels for each input attribute and ii) ensuring a good level of integrity for the fuzzy partitions.Therefore, the interpretability of the aforementionedFDTs may not be always guaranteed.
In the specialized literature, also different fuzzy models can be found for mining useful knowledge from data streams in real-time.These models belong to the "Evolving Fuzzy Systems" [33] research branch, which includes fuzzy rule-based systems, fuzzy pattern trees (FPTs), semi-supervised approaches based on fuzzy clustering, and, obviously, FDTs.A recent survey on Evolving Fuzzy Systems has been published by Škrjanc et al. [34] in 2019.
In 2004, Plamen and Filev published one of the pioneer work in the framework of Evolving Fuzzy Systems [35], in which they discussed an incremental approach for learning Takagi-Sugeno fuzzy rule-based systems.Along the years, several extensions and applications to different fields have interested evolving fuzzy rule-based systems [36,37].
In 2013, Shaker et al. [14] discussed an evolving version of the so called FPTs for binary classification.An FPT is a tree-like structure different from classical decision trees.Indeed, its inner nodes represent fuzzy operators whereas leaf nodes are associated with fuzzy predicates employing input attributes.The input of a pattern tree is entered at the leaf nodes and its membership degree to each leaf is calculated and propagated to the parent nodes until reaching the root where a real value in [0,1] is generated.Using a threshold function, the input is classified in one of the two classes.
Casalino et al. recently introduced the Dynamic Incremental Semi-Supervised FCM (DISSFCM) classification algorithm [19,38].DISSFCM is based on an extended version of the well-known fuzzy C-means algorithm and has been designed for handling streaming data, even in a semi-supervised fashion.The incremental learning algorithm is capable to dynamically adapt the number of clusters to data streams.

THEORETICAL BACKGROUND
In this section, we introduce the theoretical background our work is grounded on.It mainly concerns the theory about fuzzy partitions, the structure of an FDT as well as of a traditional HDT.

Fuzzy Partitions
and Y be the set of numerical input attributes and the output of a fuzzy classification model, respectively.Let U f , with f = 1,...,F, be the universe of the f th attribute X f .Let In this work, we adopt strong and uniform triangular fuzzy partitions [39].Figure 1 shows an example of these partitions made up of three triangular fuzzy sets A f,j , whose membership function is represented by the tuples (a f,j , b f,j , c f,j ), where a f,j and c f,j correspond to the left and right extremes of the support of A f,j , and b f,j to its core.

Fuzzy Decision Trees
A decision tree can be defined as a directed acyclic graph, wherein internal nodes perform a test on an attribute, branches represent the outcome of the test, and terminal nodes, i.e., leaves, contain instances belonging to one or more class labels.The first node to be created is named "root" of the tree.Generally, in each leaf node, class Γ k ∈ Γ is associated with a weight w k determining the strength of class Γ k in the leaf node itself.
)} be the training set, where, for each instance (x i ,y i ), with i = 1,...,N, y i ∈ Γ and x i,f ∈ U f , with f = 1,..., F. Starting from TR, the structure of a decision tree, considering nodes, branches, and leaves, is generated in a recursive way.
First of all, one of the attributes is chosen at the root, taking into consideration the overall TR, and a set of branches and corresponding nodes is returned.A branch corresponds to a condition on the values of the attribute.For each node, a new attribute is selected from the set of the attributes, considering only the instances of TR satisfying the test associated with the branch.When no attribute can be selected, the node is considered as a leaf.The attribute selection is based on a metric, which is usually either the Information Gain or the Gini Index [40].
In this paper, we adopt a multi-way FDT exploiting FIG [12] as attribute selection metric.First of all, each attribute is partitioned.For the sake of simplicity, we set the same value of T f = T max for each attribute.In multi-way FDTs, each test node has exactly a number of branches equal to the number of fuzzy sets in the partition of the attribute selected in the node.Thus, in our case, each test node has T max branches.Figure 2 shows an example of multi-way FDT, where we represent only two attributes: X 2 at the root and X 1 at the second level.All attributes are partitioned into three fuzzy sets, thus generating exactly three output branches.In the example, for the sake of simplicity, each leaf is associated with just one class.
In the learning phase, the root contains the whole TR and each FDT node contains the subset of instances in TR which has a membership degree different from 0 to the node.The FDT construction starts from the root and follows a top-down approach; unless a termination condition is satisfied, a node gives rise to T max child nodes according to the fuzzy partition of the attribute chosen for the splitting.The chosen attribute is the one that yields the highest FIG, which will be defined below.
More formally, given a parent node P, let C j indicate the generic jth child node, j = 1,...,T max .The subset of TR in C j contains only the instances belonging to the support of the fuzzy set A f,j .Let S P be the set of instances in the parent node, and S C j ⊆ S P be the set of instances in Cj, i.e., the instances in the support of A f,j .An instance x i in S C j belongs to node C j with a membership degree where µ P (x i ) is the membership degree of instance x i to the parent node P (for the root node, µ P (x i ) = 1) and the operator T is a T-norm.Thus, each child node C j can be modeled as a fuzzy set, characterized by the following fuzzy cardinality where N j is the number of instances in set S C j .
The FIG used for selecting the splitting attribute is computed, for a generic attribute X f with partition Zf , as where I P is the support of the fuzzy set which characterizes parent node P. The Fuzzy Entropy FEnt(P) is defined as where fuzzy cardinalities |P| and |P k | are computed on the set of instances in I P and the set of instances in I P with class label C k , respectively.The weighted fuzzy entropy WFEnt (Z f ,I P ) of partition Z f is defined as where ) is the fuzzy entropy of C f j .Once the learning phase has terminated and the tree has been built, a given unlabeled instance x can be assigned to a class Γ k ∈ Γ by following the activation path of nodes from the root to one or more leaves.In traditional decision trees, each node represents a crisp set and each leaf is labeled with a unique class label.This determines that x activates only one single path and is assigned to a unique class.In FDTs, each node represents a fuzzy set; therefore, x could activate simultaneously multiple paths in the tree.These paths could reach more than one leaf with different strengths of activation, called matching degrees.Considering a current node N c , the matching degree md N c of x with N c coincides with the membership degree of x where md L is computed as in formula (1).The association degree h L k of x with class Γ k at leaf node L can be computed according to the following equation: where md L (x) is the matching degree of x with L and W L k is the class weight associated with class Γ k at leaf node L. In the literature, different definitions have been proposed for weight W L k , but in this paper we considered as a proper weight the "fuzzy cardinality" as in formula (2).
Finally, the output class of the unlabeled instance x can be computed, according to the relevant literature [12] by using two main approaches: • maximum matching: the output class is the one corresponding to the maximum association degree computed for the instance; • weighted vote: the output class is the one corresponding to the maximum total strength of vote, computed by summing all the activation degrees in each leaf for the class.
In this work, we have employed the weighted vote approach.

The Incremental Learning Procedure of HDT
In this subsection, we briefly summarize the iterative learning procedure of the traditional HDT.Whenever a new label instance of the stream is processed, the HDT learning procedure updates the current structure of the decision tree by means of two main steps: i) refresh of the class statistics in the nodes and leaves and ii) possible growth of the tree.This procedure depends on the following parameters: • Grace Period (GP): a threshold on the number of instances contained in one leaf before considering it for a split; • Tie Threshold (TT): a threshold to break the ties forcibly if the difference between the information gains of the two attributes with the highest information gains is too small.This avoids waiting for too much time before a splitting event; • Split Confidence (): a metric involved in the direct computation of the Hoeffding Bound; • Minimum Fraction (MF): the minimum percentage of instances in a branch after a possible split of a leaf.
Regarding the first step, the procedure starts from the root of the tree, determines the leaf reached by the current instance and updates its per-class statistics as well as the cardinality (total number of crisp instances) of the leaf itself.After that, some checks are performed in order to decide if splitting or not the current leaf L and therefore growing the tree (second step).
First of all, the splitting is not permitted whenever the number of instances, i.e., the Cardinality (N L ) in L is lower than GP.
If this check is passed, attributes are ranked according to their Information Gain and the Hoeffding Bound (HB L ) at the current leaf L is computed as: where R = log 2 D L and D L is the number of classes contained in L.
Afterward, the difference between the information gains of the first two attributes in the ranking is computed.If this difference is higher than HB L or HB L is lower than TT, then the splitting of L is permitted and is performed by employing the attribute with the highest information gain as the splitting attribute.Indeed, a final check is performed before the splitting is actually carried out: this check regards the percentage of instances in each post-split branch, which should be greater than MF for the splitting process to be completed.The creators of HDT claimed in Ref. [11] that the splitting procedure employs the Hoeffding bound to ensure, with high probability, that the attribute chosen for the splitting by using the subset of instances corresponding to the cardinality at the current leaf is the same as the one that would be chosen using an infinite number of instances [11].As we have already highlighted in Section 2, some subsequent papers have contested the applicability of the Hoeffding's theorem to the estimation of the information gains, thus making the Hoeffding bound used in the HDT learning approach a heuristic rather than a theoretically-based threshold.The use of this heuristic in the last years has however proved that it ensures satisfactory practical results.In the next section, we adapt the heuristic proposed in Ref. [11] to the fuzzy domain to take into consideration that, considering the fuzzy partitions used in this paper, one instance can belong to different nodes at each tree level (except the root) with different membership degrees.

THE PROPOSED STREAM FDT
The incremental learning procedure of FHDT is based on that one employed in the standard HDT, apart from the following two aspects: • the procedure to update the statistics at a particular leaf; • the use of a similar but different metric for selecting the best splitting attribute, i.e., the FIG.
Concerning the first aspect, the learning procedure starts from the root and follows the paths activated by the current training instance until leaves are reached.Unlike the traditional HDT, where an instance can reach only one leaf, in the proposed FHDT a training instance can reach more than one leaf.This happens because an instance triggers more than one branch in a splitting node.Indeed, an instance can belong to different fuzzy sets in a partition of an attribute with different membership degrees.In the scenario we consider in this paper, i.e., strong uniform partitions, only two output branches can be activated simultaneously at each decision node.Once an instance, traveling throughout the tree, reaches a leaf node, the membership degree to the node is calculated according to Eq. ( 1).Furthermore, the fuzzy cardinality of the leaf node and the fuzzy cardinalities of the leaf node per each class represented in the node are updated exploiting Eq. ( 2) with, respectively, all the instances belonging to the leaf node and with the subset of these instances labeled with a specific class.
Regarding the second aspect, the learning procedure verifies whether the conditions to allow the growth of the FHDT by splitting one of the current leaves are satisfied.Since each instance belonging to a leaf L with a membership degree higher than 0 may also belong to other children (only one with the uniform partitions used in this paper) of the same parent P, we modified the Hoeffding bound in Eq. (7) to take this aspect into consideration.Indeed, instead of computing the cardinality N L as the sum of the instances belonging to the leaf, we compute it as the sum of the membership degrees of the instances in P to the fuzzy set associated with the branch leading to L. We denote this cardinality as local fuzzy cardinality LFC L of L. Let A f,j be the fuzzy set associated with the branch which links P to L. Formally, LFC L is defined as: where N L is the number of instances in L. To take into consideration that the instance belongs to different children of the same parent with different membership degrees, we modified the Hoeffding Bound by considering the local fuzzy cardinality rather than the cardinality.Thus, we introduce the following variant of the Hoeffding Bound, denoted as Fuzzy Hoeffding Bound (FHB L ) in the following: If the LFC L is greater than GP, then the algorithm sorts the attributes X f according to their FIG computed as in Eq. ( 3).Therefore, it computes the difference between the FIGs of the first two attributes in the ranking: if the difference is higher than FHB L or FHB L is lower than TT, then L is split.The attribute with the highest FIG is considered as splitting attribute.The splitting is actually performed if LFC L j in each post-split leaf L j generated from L is greater than MF.
We would like to highlight that in our proposed learning procedure for FHDT, we adopted LFC L rather than N L in the conditions to start the splitting procedure.This decision is intuitively supported by the fact that the uncertainty associated with the arrival of a new instance is spread among the different paths of the FHDT rather than in a single path as in the case of HDT.It should be noticed that, since the maximum value of the fuzzy cardinality of an instance in L is equal to 1, N L can be considered as an upper bound for LFC L .Therefore, FHB L can be regarded as an upper bound for HB L and, therefore, the splitting condition used in our FHDT, namely , is more restrictive than the one checked in the traditional HDT.This allows collecting a higher number of instances and therefore to have a higher confidence on the decision of splitting the leaf.
Algorithm 1 summarizes the algorithm in pseudo-code.In the code, X ϕ represents the null attribute that consists of not splitting the node.For each new training instance, the algorithm checks the conditions for splitting each leaf and, if the conditions are satisfied, replaces the leaf with an internal node and add a number of new leaves corresponding to the number of possible values for the attribute chosen for the splitting.Compute LFC L , the sum of the fuzzy cardinality over all classes at leaf L

9:
if LFC L > GP then 10: Let X a be the attribute with the highest FIG L

13:
Let X b be the attribute with the second-highest FIG L

14: fuzzy Hoeffding Bound
AND for each branch j of the possible split performed by using X a , LFC L j > MF then 16: Replace L with an internal node splitting on X a

17:
Add a new leaf L j and let X L = X L − X a

SIMULATIONS AND RESULTS
In this section, first of all we describe the synthetic and real-world datasets and the simulation scenario in which they were experimented.Then, we present and discuss the results achieved by the proposed FHDT and compare them to the ones obtained by HDT.Specifically, we first describe the analysis carried out on synthetic datasets and then the analysis carried out on real-world datasets.

Datasets
We evaluate the effectiveness of the proposed FHDT by considering two categories of datasets, namely synthetic and real-world datasets, simulating their evolution over time.
As regards synthetic datasets, we adopted the SEA concepts and Rotating Hyperplane datasets, which have been extensively used in several contributions on data stream classification, especially for the analysis of the concept drift [13][14][15][16].
The SEA concepts dataset (SEA dataset, in short) regards data characterized by an abrupt concept drift.It considers three input attributes whose values are randomly generated in the range [0,10) and only the first two attributes are relevant.An instance belongs to the positive class if the sum of the first two attribute values is lower than a threshold , otherwise it belongs to the negative class.Different values of  are usually adopted in order to generate different concepts.We adopted a version of the dataset, available on a web repository, a composed of 50000 instances with four different concept drifts.
The Rotating Hyperplane dataset regards data characterized by a gradual concept drift.It is generated from a d-dimensional hyperplane described by the equation ∑ d i=1 a i x i = a 0 , where a i are weights in [0,1].An instance in the d-dimensional space belongs to the positive class if ∑ d i=1 a i x i ⩾ a 0 , otherwise it belongs to the negative class.We adopted two versions of the dataset, available on the same web repository as SEA dataset, composed of 10000 instances each.Specifically, we downloaded the hyperplane5.arffand the hyper-plane9.arfffiles, which are characterized by two different levels of drifts.We labeled the two adopted versions of the dataset as Hyper-plane5 and Hyperplane9, respectively.Details on the parameters and the procedures for generating the three synthetic datasets, namely SEA, Hyperplane5, and Hyper-plane9, can be found in Ref. [13].
As regards real-world datasets, they have been extracted from the UCI machine learning repository and are:  1 summarizes the characteristics of the Handwritten digits dataset.
The second real-world dataset concerns the occupancy detection in a room, namely a binary classification task about the presence or absence of people.This dataset regards a real application in the framework of Smart Buildings [17].Indeed, the information on the occupancy of a room can be used in order to switch off lights, ventilation, and all the energy-based equipment to save energy and enact possible safety actions.This dataset was introduced in Ref. [18] and recently adopted in the experiments carried out in Ref. [19].In this dataset, each data sample encompasses five attributes: Temperature (Celsius), Relative Humidity (%), Light (Lux), CO 2 (ppm), Humidity Ratio (kg of water-vapor per kg of air), which have been measured every minute in a room in the period from Feb. 3, 2015 to Feb. 10, 2015.Each sample belongs to one out of two possible classes, namely "nonoccupied room" and "occupied room".Therefore, the dataset is structured as a stream of 20 560 chronologically sorted  c https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+data samples, divided into a training set and two test sets, which differ in the room conditions.In the first test set, similarly to the training set, the measurements have been taken mostly with the door closed during the occupied status, while in the second test set the measurements have been taken mostly with the door open during the occupied status.Table 2 reports some basic statistics on the data and on the class distribution of the Occupancy dataset.
In the experiments, we have normalized, for all the datasets, the values of each attribute to vary in the interval [0, 1], thus the universe U f of each attribute X f ranges from zero to one.

Simulation Settings
We executed the simulations on a computer featuring an x86 − 64 architecture, four cores, Intel(R) Core(TM) i7 − 2600 CPU @ 3.40GHz and 16GB RAM.Both the learning and the classification procedures of the proposed FHDT have been implemented in the JAVA language and integrated into the WEKA Toolkit, d by properly modifying the already embedded implementation of the HDT, as discussed in Ref. [11].
For all the datasets, the data stream has been simulated by dividing the training sets into a number Q of chunks and then by analyzing the chunks as they arrived in sequence.
For the real-world datasets, similar to the work discussed in Refs.[19,38], we performed two different experiments for both datasets: • Test-the-train: we considered only the training set, opportunely divided into Q chunks.We simulated the data stream by evaluating the performance, on the data chunk at the current time interval T q , with q = 2,...Q, of the current classification model, trained considering the stream of instances included into the data chunks arrived until the time interval T q−1 .
During the first time interval T 1 , the classification model is just built using the first chunk of the dataset.
• Test-the-test: the generalization capability of each model incrementally built at the current time interval T q has been evaluated considering the available test sets.We recall that, as discussed in the previous section, for the Occupancy dataset we have two test sets.
For the real-world datasets, we adopted the same experimental scenarios as those adopted in Refs.[19,38], in which, the authors considered Q equal to 5, 10, 15, and 20 for the Handwritten digits dataset and Q equal to 7, 12, and 17 for the Occupancy dataset, respectively.We labeled this organization of the datasets into  For the synthetic datasets, similar to the works discussed in Refs.[13][14][15], we carried out only test-the-train experiments.Similar to the work in Ref. [13], we adopted experimental scenarios setting Q to 50 and 100 for the SEA dataset, and to 20 and 40 for the Hyper-plane5 and Hyperplane9 datasets.Also for the synthetic datasets, we labeled this organization of the datasets into two different numbers of chunks as S1 and S2.
The optimal performance of a classifier for streaming data depends on a fixed set of critical parameters, whose values are conditioned by the particular considered dataset [41].Thus, we have accomplished a preliminary simulation campaign in order to select the best possible values for the main parameters of both HDT and FHDT, considering the five selected datasets.Table 3 reports these values for both the algorithms: these values have been obtained by searching for an optimal trade-off between increasing the accuracy and reducing the tree depth.
For a fair comparison, we considered the same values of the aforementioned parameters for both the proposed FHDT and the traditional HDT.As regards FHDT, we adopted for each attribute a uniform fuzzy partition consisting of 3 triangular fuzzy sets.Actually, we experimented different configurations considering also 5, 7, and 9 fuzzy sets for partitioning each attribute.The best results, in terms of trade-off between classification performance and complexity (number of nodes and leaves) of the model, have been achieved with 3 fuzzy sets.Thus, for the sake of brevity and readability of graphs and tables, we skipped to show the other results.

Experimental Results on Synthetic Datasets
In the synthetic datasets, due to the simulated concept drift, the class distribution in the different chunks changes dynamically.For this reason, we adopted the weighed F-measures for evaluating the performance of the classifiers on these datasets.Figures 3-8 show the trends of the weighted F-measures obtained by FHDT and HDT on, respectively, SEA, Hyperplane5, and Hyperplane9, along the sequence of chunks, considering both S1 and S2 chunking strategies.As expected, the concept drift is very accentuated for the SEA datasets.However, it is worth noticing that, both for S1 and S2, FHDT can control this phenomenon better than HDT.Futhermore, in most of the cases, the values of the weighted F-measure achieved by FHDT are higher than the ones achieved by HDT.As regards the two versions of the Hyperplane dataset, the figures show that FHDT outperforms HDT at each chunk of the data stream.It is worth noticing that for Hyperplane9 the concept drift is more evident than for Hyperplane5.Indeed, FHDT controls better the concept variations in the first part of the sequence of chunks, rather than in the last part in which the F-measure drops below 0.8.
As regards the complexity of the generated models, Figures 9-14 show the trends of the total number of nodes of each tree obtained by FHDT and HDT on SEA, Hyperplane5, and Hyper-plane9, respectively, along the sequence of chunks, considering both S1 and S2 chunking strategies.For each considered experiment on each dataset, the size of the trees generated by HDT increases linearly while new chunks are fed into the incremental learning algorithm, whereas the sizes of the trees generated by FHDT grow very slowly.Specifically, as regards the SEA dataset, the structure of the trees generated by the FHDT learning algorithm remains stable along the entire flow of data chunks.This is mainly due to the fact that a simple fuzzy tree with 7 nodes and 5 leaves is able to model  the instances of the SEA dataset represented in a three-dimensional space.As regards the two Hyperplane datasets, we observe that the FHDTs maintain a stable structure more or less after analyzing the half of the chunks (check Tables 5 and 6 in which this behavior is more evident than in the Figures).
Tables 4-6 show an excerpt of the results, in terms of classification performance and of complexity of the models, for SEA, Hyper-plane5, and Hyperplane9 datasets, respectively.In particular, we present the results for the chunks which are characterized by the most evident concept drift.The tables confirm that the FHDTs outperform, in most of the cases, the HDTs in terms of weighted F-measure, thus pointing out a good capability of managing the concept drift.Moreover, the FHDTs are characterized by a much lower complexity level, in terms of their depth, number of nodes, and number of leaves, than the HDTs.This characteristic, along with the use of linguistic terms in the partitions, ensures a much higher level of interpretability of the FHDTs than the HDTs.chunk we show the accuracy calculated considering it as training and test set.For the remaining chunks, we show the accuracy calculated considering each chunk q as test set, after an incremental learning process which includes training samples till chunk q − 1.

Handwritten digits dataset
The aforementioned figures clearly show that, in most cases, FHDT outperforms, in terms of accuracy, HDT.Specifically, FHDT outperforms HDT after analyzing one or two training chunks.It is worth noticing that, for this dataset, the two models do not suffer from the concept drift problem.Indeed, their accuracy increases while the number of training instances grows.This is also due to the fact that the classes are uniformly distributed in the chunks.
As regards the complexity of the models, Figures 19-22 show the trends of the number of nodes of each tree obtained by FHDT and HDT on S1, S2, S3, and S4, respectively, along the sequence of chunks.We can observe that the number of nodes in FHDT increases faster than in HDT.However, the increasing of the total number of nodes seems to be linear with the increasing of the chunks considered during the learning stage: the slope of FHDT is a bit greater than the one of HDT.Table 7 shows the accuracy, the depth expressed in terms of the number of levels, the number of nodes, and the number of leaves of the trees in the qth chunk.
Even though the models generated by the HDT learning procedure are more compact than the ones generated by the FHDT learning procedure, we have to consider that the rules that may be extracted from FHDT are characterized by a higher level of interpretability than the ones extracted from HDT.Indeed, the FHDT rules are expressed in linguistic terms easily understandable, while the HDT   rules are based on numeric intervals.We will discuss some examples of rules in Section 5.4.2.
Figures [23][24][25][26] show the trends of the accuracy obtained by FHDT and HDT on S1, S2, S3, and S4, respectively along the sequence of chunks, considering the test set.We point out that the trends of the complexity of the models are the same as the ones shown in Figures 19-22.The results confirm that also on the test set, FHDT outperforms HDT in terms of achieved accuracy for all the different splitting strategies.

Occupancy dataset
As already mentioned, this dataset reflects a real application of a Smart Building, in which an intelligent system can be involved in the automatic management of electricity consumption of appliances and devices.We can envision such a system mainly composed of a sensing module, an occupancy estimation module, and an automation module.The sensing module continuously gets measures from sensors and feeds the estimation module, which provides an output about the occupancy or not of a specific room.The automation module is in charge of switching off devices and appliances, if the room is empty.In Table 8, we show the distribution of the classes into the instances composing each chunk for S1, S2, and S3, respectively.In the table, the Ch ID header stands for the ID of the data chunk.
One can notice that the class distribution changes in the chunks.
Changes in class distribution are more appreciable when we consider a finer granularity.We observe that in S1, S2, and S3, the splitting interval consists of 24, 12, and 8 hours, respectively.Moreover, February 7 th and 8 th , 2015 were public holidays, thus the room was always not occupied by people, as it also occurs during night hours.Thus, we expect that the incremental classifiers will experience a concept drift issue, especially when the number of chunks increases.Since we are dealing with a binary and imbalanced dataset, we evaluate the performance of the classifiers in terms of F-measure of the relevant class, namely the non-occupied one.Indeed, in a smart building framework, it will be more interesting to identify empty rooms in order to automatically switch off all power consuming devices and appliances.However, we also extracted the values of precision and recall for each class.The trends of such measures are very similar to those of the F-measure for the relevant class, thus, for the sake of brevity, we skipped to show them.
As regards the test-the-train experiments, in Figures 27-29, we show the trends of the F-measure obtained by FHDT and HDT on, respectively, S1, S2, and S3 along the sequence of chunks.It is worth noticing that for the first chunk we show the F-measure calculated considering the chunk as training and test set.For the remaining chunks, we show the F-measure calculated considering each chunk q as test set, after an incremental learning process which includes training samples till chunk q − 1.
We observe that, in most of the time intervals, the F-measure achieved by FHDT is higher than the one achieved by HDT.The better classification performance of FHDT is much more evident in the last chunks and especially in S3, when the concept drift is, as expected, much more visible.HDT, respectively.It is worth to highlight that, after the last training chunk, when the entire training set has been shown to the learning algorithm, the generated trees, both FHDT and HDT, have the same structure for S1, S2, and S3.In conclusion, for the Occupancy dataset, we can state that the models, incrementally generated by the FHDT learning procedure are always characterized by a lower complexity level than the ones generated by the HTD learning procedure.
In order to highlight the high interpretability level associated with models generated by the FHDT learning, in Figure 33 we show the FHDT obtained after feeding the incremental learning process with   the entire training set.We labeled the fuzzy sets, adopted in the partitioning of the input attributes, as "Low, " "Medium, " and "High." As a consequence of the rules, we considered the class with the highest weight in the specific leaf.
As we can see, the model itself is very readable, far from being a black box like some classical machine learning models, such as neural networks and support vector machines.Moreover, unlike the trees generated by the HDT learning, it is possible to extract linguistic rules, which can help the user to understand how a specific output has been obtained.In the following, we show a couple of       Finally, for the test-the-train experiments, an excerpt of the results, in terms of classification performance and of complexity of the models, is presented in Table 9.Specifically, we show the results for the chunks which are characterized by the most evident concept drift.Figures 34-39 show the trends of the F-measure of the relevant class, for both FHDT and HDT, calculated on Test 1 and Test 2 datasets, respectively.We recall that the trends of the complexity of the models are the same as the ones shown in Figures 30-32.Indeed, for the test-the-test experiments, the models are incrementally trained on the same data chunks as those adopted for the test-the-train experiments.As regards the classification performance, for the testthe-test experiments, the results confirm that FHDT always outperforms HDT, even with the lowest complexity levels of the models.Notice that for both FHDT and HDT, the values of F-measure remain more or less constant along the time intervals.However,      and HDT with the recent DISSFCM algorithm [19], proposed for classifying data streams, and with other classical batch classifiers.Actually, in this case, similar to the analysis performed in Ref. [19], we calculated the average values of the accuracy of each trend for S1, S2, and S3, and compared them with the average values of accuracy extracted from the paper in which DISSFCM has been experimented on the Occupancy dataset.In Ref. [19] the authors compared, in terms of average accuracy, the DISSFCM with other classical batch classifiers, namely Random Forest (RF), Gradient Boosting Machines (GBM), Linear Discriminant Analysis (LDA), and Classification and Regression Trees (CART).We recall that for these batch classifiers, the entire training set has been provided as input for the learning algorithm, thus for them, we do not have the experiments on S1, S2, and S3.The results confirm that FHDT always outperforms HDT and DISSFCM, for S1, S2, and S3, on both Test 1 and Test2.As regards the comparison with batch classifiers, LDA always achieves the highest average accuracy.However, the average accuracies obtained by FHDT are only about 1% lower than those obtained by LDA, which is, unlike FHDT, a black box classification model.

Execution Time Analysis of the Incremental Learning Algorithms
Table 11 shows, for each considered dataset, the time (in seconds) needed for analyzing the entire stream of data by the incremental learning algorithms for building both FHDTs and HDTs.As regards the real-world datasets, we considered only the training set as the entire stream of data.In the table, the first three columns contain the name of the dataset, the number of instances, and the number of attributes, respectively, which characterize each data stream, whereas the last two columns contain the execution times of the incremental learning algorithms for FHDT and HDT, respectively.
We observe that the HDT and FHDT learning algorithms require less than 0.2 seconds and up to 8.35 seconds for building the decision tree, respectively.Thus, the execution time for learning FHDTs is one order of magnitude higher than the execution time for learning HDTs.This is mainly due to the fact that, as discussed in Section 3.2, each training instance may activate simultaneously multiple paths in FHDT and several matching must be calculated for updating the statistics of the decision nodes and for checking the splitting conditions.

CONCLUSION AND FUTURE WORK
In this paper, we have presented FHDT, the fuzzy version of the traditional Hoeffding Tree (HDT) for data stream classification.More precisely, we have defined for each input attribute a fuzzy strong uniform partition and have taken advantage of the FIG as a metric to choose the best input attribute in order to split decision nodes.Furthermore, we have modified the Hoeffding bound to take into consideration that one instance can belong to different leaves with different membership degrees.
We have experimented the proposed FHDT learning procedure on three synthetic datasets, usually adopted for analyzing concept drifts in data stream classification, and on two real-world datasets, already exploited in some recent researches on fuzzy systems for streaming data.The data streams have been simulated by splitting the datasets in chunks.We have compared the obtained results achieved by FHDT with the ones achieved by HDT in terms of accuracy, complexity, and computational time.Moreover, as regards the real dataset related to the detection of the presence or absence of persons in a room, we have also shown some comparisons with a recent streaming data classification algorithm, based on fuzzy clustering, and with some state-of-the-art machine learning algorithms.
The results have demonstrated that the FHDTs generated by the learning procedure proposed in this paper outperform the HDTs in terms of classification performance and complexity, although the time needed to learn them is longer than the one needed to learn the HDTs.We have shown some examples of linguistic fuzzy rules, that can be extracted from an FHDT: these rules result to be more interpretable, and also more suitable for handling data vagueness, than the ones that can be extracted from an HDT.
Future work may regard the design, the implementation, and the experimentation of an incremental learning algorithm for FDT in a distributed computing framework for Big Data.Moreover, we envision to adopt the FHDT learning algorithm in an IoT architecture for anomaly detection and predictive maintenance for Industry 4.0.

Figure 1
Figure 1 Example of a strong fuzzy partition with three triangular fuzzy sets.

Figure 2
Figure 2 Example of multi-way FDT with only two levels and one single class per leaf.

Algorithm 1 : 1 : 2 :
Algorithm to learn an FHDT.Input: S: the stream of training instances, X: the set of numeric attributes, FIG(.): the fuzzy information gain, : one minus the probability of choosing the correct attribute at any given node.Output: one FHDT 3: Let FHDT be a fuzzy decision tree with a single leaf L (root) 4: Let X 1 = X ∪ X  5: for each training instance in S do 6: for each leaf L do 7:Update statistics of leaf L needed to compute the conditions for deciding the splitting 8:

Figure 3
Figure 3 Test-the-train experiment: weighted F-measure trends for S1 on the SEA dataset.

Figure 4 Table 3
Figure 4 Test-the-train experiment: weighted F-measure trends for S2 on the SEA dataset.

Figure 5
Figure 5 Test-the-train experiment: weighted F-measure trends for S1 on the Hyperplane5 dataset.

Figure 6
Figure 6 Test-the-train experiment: weighted F-measure trends for S2 on the Hyperplane5 dataset.

Figure 7
Figure 7 Test-the-train experiment: weighted F-measure trends for S1 on the Hyperplane9 dataset.

Figures 15 -
show the trends of the accuracy obtained by FHDT and HDT on S1, S2, S3, and S4, respectively, along the sequence of chunks of the Handwritten Digits dataset (shortened to Handwritten dataset in the following).It is worth noticing that for the first

Figure 8
Figure 8 Test-the-train experiment: weighted F-measure trends for S2 on the Hyperplane9 dataset.

Figure 9
Figure 9 Model complexity trends for S1 on the SEA dataset.

P.Figure 10
Figure 10 Model complexity trends for S2 on the SEA dataset.

Figure 11
Figure 11 Model complexity trends for S1 on the Hyperplane5 dataset.

Figure 12
Figure 12 Model complexity trends for S2 on the Hyperplane5 dataset.

Figure 13
Figure 13 Model complexity trends for S1 on the Hyperplane9 dataset.

Figure 14
Figure 14 Model complexity trends for S2 on the Hyperplane9 dataset.

Figure
Figure Test-the-train experiment: Accuracy trends for S1 on the Handwritten digits dataset.

Figure 16
Figure 16 Test-the-train experiment: Accuracy trends for S2 on the Handwritten digits dataset.

Figure 17
Figure 17 Test-the-train experiment: Accuracy trends for S3 on the Handwritten digits dataset.

Figures 30 -
show the trends of the number of nodes for the two compared incremental decision trees.Unlike the Handwritten dataset, we highlight that the size of FHDT grows much more slowly than the one of HDT.Moreover, after using all the chunks for the training stage, the number of nodes composing the HDT is more than 3 times higher than the one of FHDT (53 nodes vs 16 nodes, respectively).After the last training chunk, the final numbers of leaves of the trees are equal to 11 and 27 for FHDT and

Figure 18
Figure 18 Test-the-train experiment: Accuracy trends for S4 on the Handwritten digits dataset.

Figure 19
Figure 19 Model complexity trends for S1 on the Handwritten digits dataset.

Figure 20
Figure 20 Model complexity trends for S2 on the Handwritten digits dataset.

21
Model complexity trends for on the Handwritten digits dataset.

Figure 22
Figure 22 Model complexity trends for S4 on the Handwritten digits dataset.

Figure 23
Figure 23 Test-the-test experiment: Accuracy trends for S1 on the Handwritten digits dataset.

Figure 24
Figure 24 Test-the-test experiment: Accuracy trends for S2 on the Handwritten digits dataset.

Figure 25
Figure 25 Test-the-test experiment: Accuracy trends for S3 on the Handwritten digits dataset.

Figure 26
Figure 26 Test-the-test experiment: Accuracy trends for S4 on the Handwritten digits dataset.

R 1 :
IF Light is Low AND HumidityRatio is Low THEN Room is not occupied R 2 : IF Light is Medium AND CO 2 is High THEN Room is occupiedMoreover, we recall that defining fuzzy partitions for each attribute, and consequently generating fuzzy rules, allows adequately handling noisy and vague data that may be extracted from the sensors

Figure 27
Figure 27 Test-the-train experiment: F-measure for S1 on the Occupancy dataset.

Figure 28
Figure 28 Test-the-train experiment: F-measure trends for S2 on the Occupancy dataset.

Figure 29
Figure 29 Test-the-train experiment: F-measure trends for S3 on the Occupancy dataset.

Figure 30
Figure 30 Models complexity trends for S1 on the Occupancy dataset.

Figure 31
Figure 31 Models complexity trends for S2 on the Occupancy dataset.

Figure 32
Figure 32 Models complexity trends for S3 on the Occupancy dataset.

Figure 33 Final
Figure 33 Final FHDT the Occupancy dataset.

Figure 34
Figure 34 Test-the-test experiment: F-measure trends for S1 on the Occupancy Test1.

Figure 35
Figure 35 Test-the-test experiment: F-measure trends for S2 on the Occupancy Test1.

Figure 36
Figure 36 Test-the-test experiment: F-measure trends for S3 on the Occupancy Test1.

Figure 37
Figure 37 Test-the-test experiment: F-measure trends for S1 on the Occupancy Test2.

Figure 38
Figure 38 Test-the-test experiment: F-measure trends for S2 on the Occupancy Test2.

Table 1
Characteristics of the Handwritten digits dataset.

Table 2
Characteristics of the Occupancy dataset.

Table 4
Results for some relevant chunks of the SEA dataset.

Table 5
Results for some relevant chunks of the Hyperplane5 dataset.

Table 6
Results for some relevant chunks of the Hyperplane9 dataset.

Table 7
Results for the models obtained in the last training chunk on the Handwritten digits dataset.

Table
Class distribution for the Occupancy dataset in S1, S2, and S3.

Table 9
Results for some relevant of the Occupancy dataset.
39gure39Test-the-test experiment: F-measure trends for S3 on the Occupancy Test2.

Table 10
Average accuracy comparison for the Test-the-test experiments.

Table 11
Execution times (in s) of the incremental learning algorithms.