A compact representation of the trigram statistics of a file. Currently, it is expressed as a 64-character string of hex digits, with one bit for each of 256 numbers indicating whether it is above or below average.. A more versatile base-64 form is planned; it will represent the same bit vector in 44 characters, and can also encode the actual counts.
The amount of similarity between two nilsimsa codes on a scale of [-128, 128] with 128 meaning that the codes are identical and -128 that they are completely different.
A set of similar nilsimsa codes which are different from the other nilsimsa codes. A cluster may mean that the files are part of a spam run; it may also mean that three people at an ISP are on a mailing list.
Nilsimsa takes a list of files or codes and some switches (options or commands). If no files are specified, it reads standard input as the one file.
Nilsimsa has three commands: -c
, which compares each
file to its argument; -C
, which scans the files for clusters
and, if it finds one, adds it to the argument; and -H
, which
compares each file to the clusters listed in the argument. If no command
is specified, nilsimsa computes the code of each file and outputs
it to standard output.
Nilsimsa has ten options:
In the -c
command, sets the threshold for being similar
to the argument of -c
. Nilsimsa outputs 1 or 0 instead of
the nilsimsa. In the -C
command, sets the minimum
nilsimsa for a cluster to be considered a cluster. The default is 24.
When reading a file, ignore everything up to and including the first blank line. In a mail or news message, these are the headers.
Parse each file as a mailbox file, computing the nilsimsa code for
each message in it. They are named with the filename with #n
appended, where n is a number.
Add up all the histograms of the files and output a single nilsimsa code.
This does not work with the -C
command.
If any of the files are directories, compute the nilsimsa codes of all files contained in them.
Sets the minimum cluster size for the -C
command. The
default is 3.
The -C
command compares all pairs of codes. Without this option,
it compares only nlogn codes.
In the -C
command, lists all the codes after sorting to
find clusters; in the -H
command, lists which cluster matches.
Outputs the characters that were counted in nilsimsa codes to standard output. This is for debugging.
This outputs whatever I need for debugging something.
You can use nilsimsa -C /var/spool/news/nilsimsa.codes -sr /var/spool/news/message.id
(the directory will vary among news servers) to identify excessive
multiposts and, if your news server has a way of running a command
on incoming posts, run nilsimsa -H /var/spool/news/nilsimsa.codes
on them to filter out EMP. Allow one byte of RAM in the nilsimsa
process for every eight bytes of news spool, unless you have
binaries or other large posts, in which case it will take less.
This may take a few hours.