GitHub - TimoSavi/ceif: Categorized extended isolation forest tool

Categorized extended isolation forest tool

This is a simple command-line program for anomaly detection based on extended isolation forest by Hariri et al.. It is mentioned for environments were are lot of different simple datasets to be analyzed, but making a separate program for each set is too tedious task. It has practical extensions like:

Some input data fields can be used as a category field. Effectively this creates an own forest for each category making each category data independent.
One input data field can be used as a label field. Label field is an unique label for each input row, making it easy to identify outlier data (e.g. timestamp or row id)
Forests can be saved to file to be used later in analysis
Sampling is done using reservoir sampling. This allows very large training data to be used.
Existing forests can be enhanced by new data. New data is added using reservoir sampling.

See more documents in docs.

Algorithm change

Selection of intercept point p

The original algorithm has some problems with certain types of datasets. This is due to the selection method of random intercept point p. Interception p is selected from rectangular area and if data is uniformly distributed over rectangular area then all sub-spaces divided by random slopes contain sample points. This causes all sub-spaces to infinity to have inliers. This gives a anomaly score to be app. 0.5 for the whole space.

To tackle this here the p selection has following steps:

Select a random sample point
Calculate a random adjustment vector a from standard normal distribution [0,1]. The length |a| is proportional to:

Tree height (larger at tree root)
Dimension value range, larger range makes adjustment larger

Interception p is calculated by adding the a to randomly selected sample point.

This has following effects:

There will always be some ps outside the sample area
Most ps tend to accumulated where the data is already at beginning of the building of trees

The p selection area is effectively an enlarged sample point area and not rectangular area which can cause anomalies.

Nearest training point distance in leaf nodes

The relative distance between analysed point and nearest node training data point is calculated in leaf nodes. The absolute distance is scaled to relative distance using average sample distance in tree. If relative distance is larger than average then the score is incremented and if distance is smaller the score is reduced.

The scale of dimension attribute values can be adjusted too, see the tweaking document for details.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
configure.ac		configure.ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

Makefile.am

Makefile.am

README.md

README.md

configure.ac

configure.ac

Repository files navigation

Categorized extended isolation forest tool

Algorithm change

Selection of intercept point p

Nearest training point distance in leaf nodes

About

Releases

Packages

Languages

License

TimoSavi/ceif

Folders and files

Latest commit

History

Repository files navigation

Categorized extended isolation forest tool

Algorithm change

Selection of intercept point p

Nearest training point distance in leaf nodes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages