Decision Tree ID3

General Information

Decision trees are a popular and versatile supervised machine learning algorithm used for both classification and regression tasks. They are powerful models that aim to mimic human decision-making processes.

A decision tree consists of a tree-like structure where:

internal nodes, also called decision nodes, represent a decision based on an attribute
branches represent the outcome of the decision
leaf nodes represent the predicted class label

Decision trees are constructed using a top-down, recursive partitioning approach called "Top-Down Induction of Decision Trees (TDIDT)". The algorithm selects the best attribute to split the dataset into subsets at each node based on a criterion, such as "Information Gain" or "Gini impurity" (this application uses "Information Gain"). This process continues until all instances within a subset belong to the same class or there are no more attributes to split the subset on. A leaf node is created that, in the former case, is assigned the class label that all instances in that subset belong to or, in the latter case, is assigned the most common class label of that subset.

Learn more

ID3 Algorithm

Input:

Dataset \( S \)
Attributes \( A \)

Output:

Tree \( T \)

Algorithm:

Dataset \( S \) will represent the root node of the current tree \( T \)
Check if all instances in \( S \) belong to the same class:
- If yes, create a leaf node with the corresponding class label and return the node as \( T \)
Check if the attribute set \( A \) is empty:
- If yes, create a leaf node with the most frequent class label and return the node as \( T \)
Calculate the information gain for each attribute in \( A \) and select \( A_{\text{best}} \) as the attribute with the highest information gain
Create a decision node containing \( A_{\text{best}} \)
Split \( S \) into subsets based on the different categories/values of \( A_{\text{best}} \)
For each subset \( S_i \), do a recursion call with:
- Input: Subset \( S_i \), attributes \( A \setminus \{A_{\text{best}}\} \)
- Output: Subtree \( T_i \)
Append \( T_i \) to \( T \)'s children and return \( T \)

Calculation

To calculate the Information Gain, the following formula is used:

\(IG(S, A) = H(S) - H(S \mid A)\)

Where:

\(S\) is the dataset
\(A\) is the attribute/set of categories/values of that attribute
\(H(S)\) is the entropy of the dataset \(S\)
\(H(S \mid A)\) is the conditional entropy of the dataset given the attribute \(A\)

FILE REQUIREMENTS

Format: The file needs to be in CSV (Comma-Separated Values) format and must not contain "" (single quotation marks '' are allowed)
Categorical values: All attributes must only have categorical values (if you want to use numbers as categorical values, make sure to put them in single quotation marks, e.g., '5')
Class Column: The class column, indicating the target attribute, must be the last column in your dataset
Maximum class categories: The class attribute can only have 2 distinct categories
Maximum Rows: Your file must not contain more than 150 rows of data (not counting attribute row)
Maximum Columns: Your file must not contain more than 25 columns
Completeness: The number of columns in each row must be the same number as the amount of attributes specified in the first row

Example:

Failure to meet these requirements may result in errors or incorrect analysis.

Choose CSV file