Java SOMToolbox

This page is being kept for sentimental reasons and because people still find it when searching for SOM software packages. If you want to use an up-to-date version of the software, please visit the Java SOMToolbox page (http://www NULL.ifs NULL.tuwien NULL.ac NULL.at/dm/somtoolbox/index NULL.html) of the Data Mining group (http://www NULL.ifs NULL.tuwien NULL.ac NULL.at/dm/) at the Institute of Software Technology and Interactive Systems, Vienna University of Technology.

Installation

  • Download somtoolbox-0.4.6-md.tar.gz (http://test NULL.pragsem NULL.org/wp-content/uploads/2010/07/somtoolbox-0 NULL.4 NULL.6-md NULL.tar NULL.gz) (UPDATED!!), unpack and open somtoolbox.sh in an editor.
    This release fixes a bug that prevented the SOMViewer application from starting. A batch script for windows users is also included.
  • Change the BASE_DIR variable to directory where somtoolbox.sh has been extracted to.
  • Create a link to the somtoolbox.sh script from some directory in your PATH (e.g. /usr/local/bin) in order to run it in any directory you like.

Usage

Overview

The applications of the toolbox are run via the somtoolbox.sh script. If no parameters are given, a list of applications that can be run is returned.

$ somtoolbox.sh

Runnable classes: SOMViewer
AttendeeMapper
SOMLibVectorNormalization
GrowingSOM
GHSOM
TrajectoryOutputter
HTMLOutputter
LabelSOM
SOMLibMapOutputter
SOMLibZeroVectorRemover

The two most important applications are GrowingSOM to train a Self-Organizing Map (also fix-sized SOMs) and SOMViewer to explore a trained SOM.

SOM Training

For the example below the Iris data set is used. The necessary files are:

  • iris.vec (http://test NULL.pragsem NULL.org/wp-content/uploads/2010/07/iris NULL.vec) the actual data vectors.
  • iris.tv (http://test NULL.pragsem NULL.org/wp-content/uploads/2010/07/iris NULL.tv) the template vector.
  • iris.clsinf (http://test NULL.pragsem NULL.org/wp-content/uploads/2010/07/iris NULL.clsinf) the class information file.

It is usually quite helpful to create a directory structure that looks something like this:

$ ls -R
.:
output  properties  vectors

./output:

./properties:

./vectors:

where output will contain all files that are created by the SOM training application, properties the property files and vectors the data files.

First, download the Iris data set and save the files to directory vectors.

Then, you need to create a property file (iris_test1.prop (http://test NULL.pragsem NULL.org/wp-content/uploads/2010/07/iris_test1 NULL.prop)) in directory properties containing the training parameters.

outputDirectory=output
namePrefix=iris_test1
vectorFileName=vectors/iris.vec
templateFileName=vectors/iris.tv
isNormalized=false
randomSeed=7

xSize=35
ySize=25
learnrate=0.7
#metricName=
numIterations=10000

With this property file, all files that are generated are put into directory output (which has to exist) and all files are prefixed with the string iris_test1. The data vectors are read from the given files. The property isNormalized is used for the initialization of the weight vectors of the units. If it is true, the randomly initialized weight vectors are normalized to length 1. This is only useful when the data vectors are also normalized to length 1 (e.g. when using vectors describing text documents). For the Iris data set used here this property is set to false, because the values of the data vectors are not normalized.

The Self-Organizing Map that will be trained for 10,000 iterations consists of 35×25 units and the initial learning rate is 0.7. The learning rate decreases automatically during training. The property metricName is not explicitly set in this example, hence the L2 metric (Euclidean) is used as default for distance calculation during training.

In the next step, map training is started by calling:

$ somtoolbox.sh GrowingSOM -h -l LabelSOM -n 4 properties/iris_test1.prop

...
[lots of debug output generated here]
...

INFO: finished GrowingSOM

The switch -h turns on HTML output. Option -l defines the labeling algorithm to use (currently, only LabelSOM is implemented) and -n the number of labels that will be calculated. In this example, the data vectors are of dimensionality 4, so it doesn’t make sense to use more.

The directory output should now contain the following files:

$ ls output/
iris_test1.dwm.gz  iris_test1.map.gz   iris_test1.wgt.gz   wz_tooltip.js
iris_test1.html    iris_test1.unit.gz  somtoolbox.css

Here a short explanation of the different files:

  • .dwm files contain the assignment of each input datum to its best-matching units, sorted with increasing distance (i.e. best-matching unit first, then second best-matching, third …)
  • .map files contain some metadata.
  • .wgt files contain the weight vectors of the map units.
  • .unit files contain the description of the single units, i.e. quantization error, mean quantization error, which input data are mapped onto a unit, unit labels, etc.

Viewing a SOM

A trained SOM can be viewed using the SOMViewer application:

$ somtoolbox.sh SOMViewer -u output/iris_test1.unit.gz
-w output/iris_test1.wgt.gz -m output/iris_test1.map.gz
-c vectors/iris.clsinf

Here is a screenshot of the SOMViewer:

SOM Viewer

Contact

If you are interested in commercial search and text analytics solutions and consulting, send me an e-mail (m NULL.dittenbach null@null max-recall NULL.com) or call the max-recall (http://www NULL.max-recall NULL.com) phone number: +43 720 978603. I am also happy to answer e-mails regarding my research topics.

Location: Vienna, Austria