Treatment learning

Treatment learning is a process by which an ordered classified data set can be evaluated as part of a data mining session to produce a representative data model. The data model should describe some key property of the data set.

The output of a treatment learning session is a treatment, a conjunction of attribute-value pairs. The size of the treatment is the number of pairs that compose the treatment.

From [1]: Three concepts can be used to define treatment learning: lift, minimum best support, and treatment effect size.

A decision's lift is the change that some chosen decision makes to a subset once that decision has been imposed upon the subset.

Lift is measured by the weighted sum of all outcomes with a particular treatment over the weighted sum of all total outcomes:


 * $${\mathrm{weighted\ sum(outcomes\ of\ data\ treatment)} \over \mathrm{weighted\ sum(all\ outcomes)}}.$$

To understand the concept of weighted sums, one must understand that all outcomes are given a weight. More weight is given to the outcomes that we would prefer to happen while less weight is given to outcomes that we do not want.

For example, say in a particular data set there are three different types of outcomes. We might score the outcomes as follows:


 * best outcome = 3
 * average outcome = 2
 * worst outcome = 1

Also, say our data set includes 6 outcomes, two of each ranking. We then can come up with a weighted sum of all outcomes:


 * $$1\left(\frac{2}{6}\right) + 2 \left(\frac{2}{6}\right) + 3 \left(\frac{2}{6}\right) = 2$$

Now, say that 3 out of 6 of the outcomes share a common attribute. Of the 6 outcomes, 2 best outcomes and 1 average outcome share a common attribute. Then this attribute will be our treatment and the weighted sum of outcomes with the treatment is this:


 * $$1 \left(\frac{0}{3}\right) + 2 \left(\frac{1}{3}\right) + 3 \left(\frac{2}{3}\right) = 2.33$$

Then our computed lift would be:


 * $$\frac{2.33}{2} = 1.165$$

A treatment learner may apply some weight to the data representing how often golf was played during the weekend - in our example, we state that a higher weight will be achieved when more golf is played.

The treatment learner may notice that under certain weather conditions, it may be possible to increase the baseline score (for example: it might be noticed that the individual always plays lots of golf on overcast weekends). If a decision is applied such that the baseline score increases, this is said to have the property of "higher" or "increased" lift.

While one can usually find better lift values by adding more attributes to a data treatment, eventually the example set becomes too specific to the example data pool given.

All treatment learners keep a minimum best support value in order to constrain the number of attributes used in a data treatment. This is used to avoid overfitting a model, which means to make too specific of an interpretation on a given data pool.

The best support value is obtained by taking the number of best outcomes in the data pool after the data treatment and dividing them by the total number of best outcomes in the data pool without the treatment:


 * $$\mathrm{best\ support} = \frac{\mathrm{number\ of\ best\ outcomes\ after\ treatment}}{\mathrm{number\ of\ best\ outcomes\ before\ treatment}}.$$

Using our example from before, we had 2 best outcomes before the data treatment. After the data treatment we still had 2 of the best outcomes:


 * $$\mathrm{best\ support} = \frac{2}{2} = 1.$$

This is the maximum number possible for best support. Treatment learners will keep a fixed number for its minimum best support. If the best support for a data treatment falls below this number it is thrown out. This prevents the data treatment from becoming too specific and overfitting the model.

A side effect of minimum best support is the small treatment effect. Treatment effect describes the number of attributes that compose a particular treatment. Because of the minimum best support test, the number of attributes used in a data treatment usually remain small. Adding attributes to a data treatment will often cause the data treatment to fall below the minimum best support threshold.

External sources

 * [1] T. Menzies, Y. Hu, Data Mining for Very Busy People. IEEE Computer, October 2003, pgs. 18-25.