Next Previous Contents

3. Usage

3.1 Concepts

nilsimsa code

A compact representation of the trigram statistics of a file. Currently, it is expressed as a 64-character string of hex digits, with one bit for each of 256 numbers indicating whether it is above or below average.. A more versatile base-64 form is planned; it will represent the same bit vector in 44 characters, and can also encode the actual counts.

nilsimsa

The amount of similarity between two nilsimsa codes on a scale of [-128, 128] with 128 meaning that the codes are identical and -128 that they are completely different.

cluster

A set of similar nilsimsa codes which are different from the other nilsimsa codes. A cluster may mean that the files are part of a spam run; it may also mean that three people at an ISP are on a mailing list.

3.2 General Usage

Commands

Nilsimsa takes a list of files or codes and some switches (options or commands). If no files are specified, it reads standard input as the one file.

Nilsimsa has three commands: -c, which compares each file to its argument; -C, which scans the files for clusters and, if it finds one, adds it to the argument; and -H, which compares each file to the clusters listed in the argument. If no command is specified, nilsimsa computes the code of each file and outputs it to standard output.

Options

Nilsimsa has ten options:

-t, --threshold

In the -c command, sets the threshold for being similar to the argument of -c. Nilsimsa outputs 1 or 0 instead of the nilsimsa. In the -C command, sets the minimum nilsimsa for a cluster to be considered a cluster. The default is 24.

-s, --skip-headers

When reading a file, ignore everything up to and including the first blank line. In a mail or news message, these are the headers.

--mbox

Parse each file as a mailbox file, computing the nilsimsa code for each message in it. They are named with the filename with #n appended, where n is a number.

-a, --aggregate

Add up all the histograms of the files and output a single nilsimsa code. This does not work with the -C command.

-r, --recursive

If any of the files are directories, compute the nilsimsa codes of all files contained in them.

-m, --min-cluster-size

Sets the minimum cluster size for the -C command. The default is 3.

-x, --exhaustive

The -C command compares all pairs of codes. Without this option, it compares only nlogn codes.

-v, --verbose

In the -C command, lists all the codes after sorting to find clusters; in the -H command, lists which cluster matches.

--cat

Outputs the characters that were counted in nilsimsa codes to standard output. This is for debugging.

--debug

This outputs whatever I need for debugging something.

3.3 News servers

You can use nilsimsa -C /var/spool/news/nilsimsa.codes -sr /var/spool/news/message.id (the directory will vary among news servers) to identify excessive multiposts and, if your news server has a way of running a command on incoming posts, run nilsimsa -H /var/spool/news/nilsimsa.codes on them to filter out EMP. Allow one byte of RAM in the nilsimsa process for every eight bytes of news spool, unless you have binaries or other large posts, in which case it will take less. This may take a few hours.


Next Previous Contents