ALPINE

ALPINE is A Locality-Sensitive Packet Inspection Engine which generates locality-sensitive hash fingerprints from feature sets extracted from the headers of packets through TaPCAP.

Feature Extraction

Header Features
Feature	Description
L4 Protocol	The transport layer protocol (ex: TCP, UDP) detected using PyShark.
Source IP Address	IP address from which the packet was sent.
Source Port	The port from which the packet was sent.
Destination IP Address	IP address which the packet is destined for.
Destination Port	The port which the packet is destined for.
Total Packet Length	The length of the packet, in bytes.
IP Flags	Bit 0: reserved, Bit 1: Do Not Fragment flag, Bit 2: More Fragments flag.
Differentiated Services (DS) Field	Marks data belonging to certain protocols so they get priority through the network.

MinHash LSH Forest

For each feature set, ALPINE uses the MinHash algorithm to generate hash values which can be compared among packets (see datasketch for further details).

Suppose you have a very large collection of sets. Given a query, which is also a set, you want to find sets in your collection that have Jaccard similarities above certain threshold, and you want to do it with many other queries. To do this efficiently, you can create a MinHash for every set, and when a query comes, you compute the Jaccard similarities between the query MinHash and all the MinHash of your collection, and return the sets that satisfy your threshold.

The said approach is still an O(n) algorithm, meaning the query cost increases linearly with respect to the number of sets. A popular alternative is to use Locality Sensitive Hashing (LSH) index. LSH can be used with MinHash to achieve sub-linear query cost - that is a huge improvement. The details of the algorithm can be found in Chapter 3, Mining of Massive Datasets.

In order to support top-k queries, ALPINE implements MinHashLSHForest. Bawa et al proposed LSH Forest as a general LSH data structure that makes top-k query possible for many different types of LSH indexes, including MinHash LSH. The MinHash LSH Forest takes a MinHash data sketch of the query set and returns the top-k matching sets that have approximately the highest Jaccard similarities with the query set.

Storage and Optimization

For optimization and re-use, MinHash LSH Forest supports “pickling” as the MinHash LSH object is serializable. By default, ALPINE MinHash LSH are stored in the “cache” directory in the forager source directory under the “alpine” directory. alpine.bin contains the serialized binary object which will be reloaded when Forager is run in test mode and where data will be written when run in training mode. labels.txt contains the labels corresponding to the indexes assigned by MinHash LSH Forest for query lookup. For accuracy, it is important that the same labels file which is generated when the MinHash LSH Forest is created in training mode is used for classification during testing mode.

Installation

ALPINE is included as a supported module inside Forager.

Usage

ALPINE must be pre-trained with data and labels before being used for classification. PCAP/PCAPNG data may be provided to TaPCAP and given a label during the training steps. You may add as many labels and input files as you would like to the training model.

Training Mode:

Forager: A Network Training Classification Toolkit.
        Please choose a task:

   tabularize packet data (TaPCAP)
   generate regular expression signatures (RExACtor)
=> configure and train models (ALPINE, PALM, MAPLE, DATE)
   classify packets (ALPINE, PALM, MAPLE, DATE)
   clear current cache

In the main Forager menu, select “configure and train models” from the options. Note that if you choose to proceed, the stored model and labels in the cache directory will be overwritten. To save these models, copy them elsewhere before proceeding to train new ones. Next, select ALPINE as a training model. Note here that you may choose to train multiple models sequentially to save manual entry and effort.

Forager: A Network Training Classification Toolkit.
Please choose one or more models to train (press SPACE to mark, ENTER to continue):

=> (x) ALPINE
   ( ) PALM
   ( ) MAPLE
   ( ) DATE

Following selection, you will be asked to provide input files and a label for each file. Labels may be re-used for multiple files. Note that labels must be exact in order to match (i.e. case-sensitive, spelled identically). You will be prompted for more files until you reply ‘n’.

Forager: A Network Training Classification Toolkit
Entering training mode...
WARNING: editing a model's configuration will override its current cache and settings. Continue (y/n)? y
CSV file input path? /Users/mkapoor1/Desktop/pop.csv
Label? POP3
Add another file (y/n)?

Once input files are provided, training will commence and the MinHash LSH Forest will be serialized and saved to cache/alpine/alpine.bin. The labels will be saved to cache/alpine/labels.txt.

Testing Mode:

Forager: A Network Training Classification Toolkit.
Please choose a task:

   tabularize packet data (TaPCAP)
   generate regular expression signatures (RExACtor)
   configure and train models (ALPINE, PALM, MAPLE, DATE)
=> classify packets (ALPINE, PALM, MAPLE, DATE)
   clear current cache

In the main Forager menu, select “classify packets” from the options. You will then be prompted to select the model(s) for training. The saved weights and data structures from the previous training session will be loaded from the cache. If no model is found, an error message will be shown.

Forager: A Network Training Classification Toolkit.
Please choose one or more models to test (press SPACE to mark, ENTER to continue):

=> (x) ALPINE
   ( ) PALM
   ( ) MAPLE
   ( ) DATE

Next, you will need to provide an input file path for the test data.

Forager: A Network Training Classification Toolkit
Entering testing mode...
CSV file input path? /Users/mkapoor1/Desktop/pop.csv
testing ALPINE

After classification is complete, results will be written to a file under “cache/results/<input file name>_results.txt”.

Publication:

Kapoor, M., Krishnan, S., Moyer, T. Deep Packet Inspection at Scale: Search Optimization Through Locality-Sensitive Hashing. In proceedings of IEEE 21st International Symposium on Network Computing and Applications (NCA). 14-16 December, 2022.