################################################################################

         Welcome to Principal Compoment Analysis (PCA) detection module
________________________________________________________________________________
--------------------------------------------------------------------------------
Directory structure:
   -[pca]
      |-[alglib]
      |-[init-data]
      |-[pca-test-receiver]
      |-[src]
          |-[pca-basic-data-reader]
          |-[sketch_version]
          |-[timebin_division]
      |-<pca basic source files>

Description:
   - [alglib] folder contains source files of free version of alglib library
   (GPL 2+ terms, http://www.alglib.net/).

   - [init-data] contains set of initialization data for PCA detection module.
   This serves to speed up start of PCA module (filling whole data matrix,
   before first detection). Creation and usage of this data will be decribed
   later in PCA module desription.

   - [pca-test-receiver] is temporary module for receiving and logging an
   anomaly detection messages. Received message are logged into simple ".txt"
   file in "pca-test-receiver" folder.

   - [src]
      - [pca-basic-data-reader] is auxiliary module for testing and debugging
      PCA module. It reads source data files and sending them over TRAP
      interface in PCA basic input format. This simulates real traffic in
      fraction of real time.

      - [sketch_version] folder contains source files for PCA detection module
      based on a set of sketch subspaces and multiple detection. However
      development of this module have been suspended. This module is NOT useable
      for anomaly detection yet.

      - [timebin_division] contains source and testing files for timebin
      division, based on timetamps of incoming messages. There are two versions
      of timebins division and testing srcipt for comparsion of both. See
      README_CZ (czech language only) in "src/timebin_division/".

   - <pca basic source files> are source files of PCA basic module. Description
   of this module follows.
________________________________________________________________________________

           Module for anomaly detection using PCA - Basic version.
................................................................................
This module detecting a network anomalies in a traffic volume and entropy
timesiries. Incoming data are stored into one data matrix, which  rows
corresponds to values from one tiembin and columns corresponds to timesiries of
one feature (volume unit or entropy). Columns are arranged as follows: first n
(n is count of used links) columns holds timeseries of flow count for link 1-n.
Next n columns (i.e columns (n+1)-2n) holds timeseries of packet count for link
1-n etc. Order of features is flow count, packet count, byte count, source IP
entropy, destination IP entropy, source Port entropy and destination Port
entropy.

It is necessary to complete whole data matrix before first detection. To speed
this step up, there is an matrix initialization from static data. However
there has to be one initial data file for every timebin. When the data matrix
is completed, an anomalies are detected in the last timebin only. If there is an
anomaly(-ies) detected, one message (UniRec) for the one anomaly is sended.

Module also provides an option for creation of this files. In this case the
module receiving and saving data only. Data are saved immediately, after a one
timebin is received. The module is closed after saving all data (for every
timebin = data matrix rows count).

Before using the PCA method on the data matrix there is (optional) preprocessing
of the data, which should reduce intensity of huge anomaly(ies). This
anomaly(ies) is(are) reported also.
................................................................................
Interfaces:
  Inputs (1):
     1. UniRec TIMESLOT,LINK_BIT_FIELD,<VOLUME>,<BASIC_ENTROPY>
        - data values for one timebin and link (of given agregation).
        - <VOLUME> contains volume metrics (field FLOWS, PACKETS, BYTES).
        - <BASIC_ENTROPY> contains entropy values of source/destination IP
          addresses and source/destination ports (fields ENTROPY_SRCIP,
          ENTROPY_DSTIP, ENTROPY_SRCPORT, ENTROPY_DSTPORT).
  Outputs (1):
     1. UniRec <PCA_DETECTION>
        - information about time, link and feature on witch an anomaly
          occuring.
        - <PCA_DETECTION> format of anomaly detection message (fields TIMESLOT,
          LINK_BIT_FIELD, PCA_ANOMALY_FIELD, where LINK_BIT_FIELD holds
          identification of link and PCA_ANOMALY_FIELD hold identification of
          feature (dimension) on which an anomaly occured).
................................................................................
Parameters:
  -s N      Initial starting timebin number. For more accurate detection
            results it is better to attach on timebin in initial data
            corresponding to timebin in which module is starting. N is
            number of five minute long timebin in a day (i.e. 1 - 288).
  -t N      Timeslot increment.
  -w N      Count of timebins for detection (i.e. data matrix row count).
            N have to be greater then [num_of_links]*[num_of_features].
  -f <PATH> Path to/for initial data files folder. PATH = \\ means
            o initialization\ from file.
  -P        Forcing preprocessing of data (if it is turned off in settings
            file).
  -N        This turns off initialization (same as -f \"\").
  -S        Save initial data matrix files.

  Note: Every parameter (except common TRAP \"-i\") is optional. If some
        parameter is not set, default value from PCA basic header file is used.
--------------------------------------------------------------------------------
              Desription of PCA basic header file - PCA_basic.h
................................................................................
Header file PCA_basic.h contains global settings for PCA basic module. Some of
them could be changed by parameter(s). There is also definition of structure for
module settings. Description of this structure and defined macros follows.

Type pca_basic_settings_t contains struct PCA_basic_settings, which contains
this variables:
   in_unirec_specifier - string specificating input UniRec template.
   out_unirec_specifier - string specificating input UniRec template.
   links - structure (defined in UniRec) for keeping links settings (link count,
      link mask, mapping of links to link mask - link indexes).
   f_count - count of used features (dimensions).
   working_window - count of timebins (row count of data matrix), for which data
      are stored in data matrix.
   data_matrix_width - count of data matrix columns, this is derived from link
      count and count of feature.
   init_starting_timebin - initial starting timebin number (see part
      "Parameters", parameter "-s" for more description).
   path_to_init_files - path to folder with initialization files
   timeslot_increment - timebin size in seconds
   unsigned int flags - module flags (initialization, storing initialization
      files, preprocessing)

Enumeration type pca_flags holds positions of "ones" in "flags" field, in
settings structure.

IN_COMMON_WAIT - common timeout for "trap_get_data" function
   +default: TRAP_WAIT

OUT_COMMON_WAIT - common timeout for "trap_send_data" function
   +default: TRAP_HALFWAIT

MODULE_MAX_LINK_COUNT/MODULE_MAX_FEATURE_COUNT - maximum of possible used
   links/features (dimensions).
   +default: 64 (given by size of uint64_t data type)
 -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
Following "S_DEFAULT_..." macros are default values for module settings
structure.
S_DEFAULT_IN_UNIREC_SPECIFIER - input UniRec template specification.
   +default: "TIMESLOT,LINK_BIT_FIELD,<VOLUME>,<BASIC_ENTROPY>"

S_DEFAULT_OUT_UNIREC_SPECIFIER - output UniRec template specification.
   +default: "<PCA_DETECTION>"

S_DEFAULT_FEATURE_COUNT - count of used features.
   +default: 7

S_DEFAULT_WORKING_TIMEBIN_WINDOW_SIZE - count of timebins to store (data matrix
   rows).
   +default: 288*3 (3 days)

S_DEFAULT_INIT_DETECTION_FLAG - sets initialization on/off(0 - no initialization
   !=0 - initialize data matrix).
   +default: 0 (since we cannot be sure if initializatio data was created)

S_DEFAULT_INIT_STARTING_TIMEBIN - initial starting timebin number (see
   part "Parameters", parameter "-s" for more description).
   +default: 0

S_DEFAULT_PATH_TO_INIT_FILES - path to initialization files folder.
   +default: "" ("" value indicates initialization from traffic)

S_DEFAULT_TIMESLOT_INCREMENT - size of one timebin in seconds.
   +default: 300

S_DEFAULT_PREPROCESSING_FLAG - turns preprocessing on/off (0 - no preprocessing
   !=0 - preprocess data).
   +default: 1

S_DEFAULT_INIT_STORE_FLAG - turns storing of initialization data on/off (0 - do
   not store, !=0 - store initialization files).
   +default: 0

S_DEFAULT_LINK_MASK - link mask string (holds whole links settings - link coun,
   link mask, link indexes).
   +default: "1"
 -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
MINIMAL_NORMAL_TIMESLOT_INCREMENT - sets minimal size of timebin (in seconds),
   smaller size could cause problems because of computation time of detection
	core.
   +default: 30

INIT_DATA_MATRIX_FILE_PREFIX, INIT_DATA_MATRIX_FILE_SUFFIX - suffix and preffix
   of initialization data files names.
   +default: "" (same for both)

INIT_DATA_PRECISION - real number precision of initialization data
   +default: 8

DATA_WAS_CHANGED - "magic constant", telling that data was changed during
   preprocessing phase.
   +default: 100

PREPROCESS_DATA_DEV_MULTIPLIER - multiplier of standard deviation for threshold
   in preprocessing phase.
   +default: 5.0

NSS_FIXED <N> - defines fixed size of normal subspace to N first principal
   components.
   +NOT DEFINED by default
NSS_BY_PERCENTAGE <P> - defines normal subspace size by count of first principal
   components, holding P*100 perecent of total variance.
   +default: 0.90

DEFAULT_DETECTION_THRESHOLD - multiplier of standard deviation for threshold in
   detection test.
   +default: 5.0

Following set is temporary, for generating source data for change point detector
//#define CPD_SRC_DATA
#ifdef CPD_SRC_DATA
   ...
#else
   ...
#endif
--------------------------------------------------------------------------------
								 Example of use
................................................................................
Expecting input data records (input UniRec format) on localhost, on tcp port
445566:

./pca -i "tt;localhost,445566;556677,1" -f ./init-data/ -m 1ff -P -s 120

This will start pca detection module with:
 (-f) initialization from static data turned ON, initialization data are at
      "./init-data/" folder
 (-m) link mask 0000 ... 0001 1111 1111, which mean 9 links with indexes is same
		as (positions - 1)
 (-P) preprocessing turned ON
 (-s) starting module at 10:00 am - timebins are 300 seconds long (default), so
		there is 12 timebins in one hour, and 120 timebins in 10 hour.
		Initialization data starts at 0:00 am.
 Output data will be sended on tcp port 556677
................................................................................
Expecting input data records (input UniRec format) on localhost, on tcp port
445566 again:

./pca -i "tt;localhost,445566;556677,1" -S -f ./init-data/ -m 1ff
This will start pca detection module in "store mode" (-S), telling to store
initialization data into the folder "./init-data/". Mask is same as in previous
example. There will be no detection, module exits after whole data matrix (of
given row count) is readed. Setting of output interface doesnt matter here,
since there is no data sendend by pca module in this mode (but there is have to
be "some").
________________________________________________________________________________
  End of PCA README file. Hope that it was helpfull.

  You can contact me at Pavel.Krobot@cesnet.cz if needed.
################################################################################