CESNET-TLS22

The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Moreover, because most of the network traffic is being encrypted, the traditional deep-packet-inspecting (DPI) solutions are becoming obsolete, and there is an urgent need for modern classification methods capable of analyzing encrypted traffic. These methods have to forgo the packet’s opaque payload and focus on flow statistics and packet metadata sequences like packet sizes, directions, and inter-arrival times. The classification can be further extended with the task of “rejecting” unknown traffic, i.e., the traffic not seen during the training phase. This makes the problem more challenging, and neural networks offer superior performance for tackling this problem.

When the factors of (1) the hardness of classification of encrypted traffic with unknown traffic detection and (2) the neural networks’ inherent need for large datasets are combined, the requirement for a rich, large, and up-to-date dataset is even stronger.

Therefore, we present a large dataset of TLS traffic, spanning two weeks of real traffic and having 191 fine-grained service labels and 140 million network flows. The dataset is intended as a benchmark for the task of identification of services in encrypted traffic with the detection of unknown services.

Overview

  • Built from traffic of the CESNET2 network observed during the first two weeks of October 2021
  • Contains 140 million flow records
  • Has 191 service labels (e.g., Windows Update, Google Search, Instagram, Dropbox)
  • Contains two types of data: packet metadata sequences and flow statistics

Download

  • A sample of 1000 flows in the CSV format is available for download here.
  • The whole dataset is stored in a PyTables database that has 37 GB and can be downloaded here. A good tool for browsing the database is ViTables. We plan to publish Python scripts, which make working with the database easier (for example, PyTorch Dataloader for neural network training). However, the scripts still need some polishing. For now, we recommend using the official PyTables documentation.
  • The list of services and their domains, which were used for the ground-truth labeling of the dataset, is available for download here.
  • The instructions on replicating the dataset collection process are here.