Decision Tree Simulator

Decision trees are a popular and versatile supervised machine learning algorithm used for both classification and regression tasks. They are powerful models that aim to mimic human decision-making processes.

A decision tree consists of a tree-like structure where:

  • internal nodes, also called decision nodes, represent a decision based on an attribute
  • branches represent the outcome of the decision
  • leaf nodes represent the predicted class label

Decision trees are constructed using a top-down, recursive partitioning approach called "Top-Down Induction of Decision Trees (TDIDT)". The algorithm selects the best attribute to split the dataset into subsets at each node based on a criterion, such as "Information Gain" or "Gini impurity" (this application uses "Information Gain"). This process continues until all instances within a subset belong to the same class or there are no more attributes to split the subset on. A leaf node is created that, in the former case, is assigned the class label that all instances in that subset belong to or, in the latter case, is assigned the most common class label of that subset.

Learn more

Input:

  • Dataset \( S \)
  • Attributes \( A \)

Output:

  • Tree \( T \)

Algorithm:

  1. Dataset \( S \) will represent the root node of the current tree \( T \)
  2. Check if all instances in \( S \) belong to the same class:
    • If yes, create a leaf node with the corresponding class label and return the node as \( T \)
  3. Check if the attribute set \( A \) is empty:
    • If yes, create a leaf node with the most frequent class label and return the node as \( T \)
  4. Calculate the information gain for each attribute in \( A \) and select \( A_{\text{best}} \) as the attribute with the highest information gain
  5. Create a decision node containing \( A_{\text{best}} \)
  6. Split \( S \) into subsets based on the different categories/values of \( A_{\text{best}} \)
  7. For each subset \( S_i \), do a recursion call with:
    • Input: Subset \( S_i \), attributes \( A \setminus \{A_{\text{best}}\} \)
    • Output: Subtree \( T_i \)
  8. Append \( T_i \) to \( T \)'s children and return \( T \)

To calculate the Information Gain, the following formula is used:

\(IG(S, A) =\) \(E(S)\) \( - \) \(E(S|A)\)

Where:

  • \(S\) is the dataset
  • \(A\) is the attribute/set of categories/values of that attribute
  • \(E(S)\) is the entropy of the dataset \(S\)
  • \(E(S|A)\) is the conditional entropy of the dataset given the attribute \(A\)