apple banana

Here at Sentryo, Network Security Monitoring (NSM) is our daily life. Network traffic is our main source of information to provide our users with an accurate depiction of their infrastructure and to alert them on a potential intrusion.
In a complex and dynamically changing threat environment, massive amounts of data have to be processed almost manually by security analysts. Evasion techniques, mutation of known attacks and never-seen threats like zero-days, highlight the limits of traditional tools and rule-based detection.

Detecting and responding in time to cyberattacks with standard tools is a challenging and risky task.

We think Data Science and Machine Learning techniques can help fill this gap.
Machine Learning is a set of tools that, as Wikipedia defines perfectly, “can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions” [1].
Machine Learning is now used with a lot of success in domains like spam filtering or image recognition, and we are convinced that NSM can benefit from it. However, tech companies tend to use Machine Learning as a buzzword, to rebrand old technology, or as a way to hide what their product really does.

It is not our philosophy, we want to demystify Machine Learning. The general lack of understanding around this subject makes it difficult for security experts to trust these algorithms, especially in the complex domain that is cybersecurity. Since we are supporter of machine learning-backed security monitoring, we want it to succeed and we think it goes with understanding what it really does.
This first part “General Concepts” will introduce the techniques and the usual steps one needs to follow in a Data Science project. These concepts will be essential to understand our approach when we will apply it to network data.
In the second part “Building the actual classifier”, we will focus on preparing traffic data and using a supervised learning approach to build a YouTube / Spotify packet classifier – i.e. determining, from neutral metadata only, if a packet comes from a transaction involving a YouTube video or a Spotify music.

 

The Data Science methodology

Like any other type of project, it is important to follow guidelines when carrying out a data science project. The steps to follow may vary according to the type of business you are interested in, the size of the team and so on, but in general, the following can apply:

  1. Start with a question
  2. Collect and prepare data
  3. Analyze, explore and enhance data
  4. Build a model
  5. Analyze the results and draw insights
  6. Iterate back from step 2 to 5, if needed.

The apple vs banana example

From now on, we will discuss the different concepts with one simple example where someone wants to automatically classify fruits as being apples or bananas.

apple banana

We will assume that we live in a world where only two kinds of fruit exist: what is not an apple is necessarily a banana and vice versa.

Step 1 – Start with a question

This is certainly the most crucial step of the project. Even though it may seems obvious, this formulation step will guide a team to evaluate which techniques and technology may need to be used. The take-home message is to never start collecting data if the need (or the question) has not been defined beforehand. This is primordial in order to avoid frustration or endless iterative processes.

In our case, the question is obvious: “is a given fruit an apple or a banana ?” The underlying task would be to distinguish a fruit and classify it as an apple or a banana”.

Step 2 – Collect and prepare data

Each business has its way to collect data: it may be probes in a web site to gain knowledge on customers, connected sensors in an assembly line to supervise a process… Collecting is not enough, preprocessing may be necessary for data to be understandable by algorithms (for example converting string attributes to numerical values).

Here, items in our data will be fruits (“apple” or “banana”) and they will be described only by two features for the sake of simplicity : height and diameter.

 

Step 3 – Analyze, explore and enhance data

In this phase, the idea is to get familiar with the data to analyze. One way to do it is to visualize the data in different ways to better understand the problem at hand. A number of methods exists (statistical indicators like mean or standard deviation, histograms, charts, etc.)
One way to explore data could be to study how features are correlated. For example, fruits can be of different sizes:

for apples, the bigger they are, the larger is the diameter (high correlation),
for bananas, this statement does not hold since the diameter is rather constant for big or small fruits (low correlation).

Finally, with this information in mind, one could enhance the dataset through feature engineering.

We could create new features from the existing ones: the ratio between height and diameter may be a “strong” variable in the sense that apples and bananas have very different values or distributions for this feature. Usually, it is crucial to inject domain knowledge for the newly created features, which tend to be very informative.

Ultimately, we could gather information from external sources. In our case it could be any information which is not based on the shape, like the amount of sugar in the fruit or its origin.

 

Step 4 – Build a model

The preliminary steps let us gain deep understanding of the data and at least a rough feeling on how different types of fruits can be classified. This is when machine learning algorithms enter the game.

 

What is learning ?

To better understand what learning does, let’s consider our dataset representing two subgroups – apples in blue and bananas in red. These subgroups are called classes. Every item of the dataset can be placed as a point, with coordinates (height, diameter) in a 2D-plane as follows :

dataset points in a plane

Our task is to separate the population of points (fruits) represented in different colors by a boundary. A naïve approach would be to draw a line in this plane. But the question is: what is the best line to classify apples and bananas? Machine learning algorithms will try to find the best set of rules to isolate the two sets of data.

On the importance of splitting the dataset

It is important to always split the data sample in two parts: the training and test sets. The first one will be used to build the model and the second one to evaluate its performance.

The performance of the algorithm should be evaluated on an independent data set used for training, in order to avoid overfitting. Overfitting means that the model is strongly biased to represent the training data, but will fail to generalize to any other unseen data. An example is shown on the figure below.

representing training data
The two curves represent the ability of the model to properly classify items (denoted as “Error” on the figure) from the training (blue) and test (green) datasets: the lower the error, the better it performs. As the model complexity (or tuning) is increasing, the error is getting lower for the training set, which means that the classification is more and more accurate. However, for the test dataset, we observe that the test error decreases before rising again. This depicts the fact that a complex model will tend to overfit the training data and not generalize well when it encounters unseen data.

Which algorithm to use?

Among the multitude of Machine Learning algorithms, we will explain Random Forests. This algorithm is commonly used for classification and has a good out-of-the-box performance. Moreover its results are easily interpretable, hence allowing to get a good understanding on how decisions are made.

But first, how does it work?
A random forest is a collection of decision trees, i.e. “a model which maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves)” [2].
Each tree has been individually designed (or trained) to find the best splitting along features to separate two sets of data points by successive segmentations.

Instead of evaluating data on a single tree, Random Forest aggregates the result of several trees (the forest) which were built independently. This makes the classifier more robust to noisy data and also exploits optimally all features that describe the data.
Here is one example of a decision tree in our apple/banana example:

example decision tree

Training Random Forests

As we can understand from the previous figure, the principle of a decision tree is a succession of if/else statement. This way, the algorithm will subdivide the aforementioned  2-D plane in smaller parts. Ideally, each subpart will only be filled by points having the same colors. This process is done on a random selection of data points and features.
The algorithm can be tuned through various parameters to optimize its performance, the most important ones being :

  • The size of the forest (the number of trees)
  • the maximum depth of a tree.

Step 5 – Analyze the results and draw insights

The last step for this example is to give our model a figure of merit to evaluate its performance: how well did it classify fruits? When taking a decision, the most critical part is to avoid making an error.
There are different ways to quantify how often a classification function may be wrong. Two common metrics are the “false positive” and the “false negative” counts.
One common usage of a binary classifier (2 classes) is to determine whether items in a dataset belong to a class or not. For example, a binary classifier could classify bananas as being ripe or not ripe. In this case, the “ripe status” (or ripe class) is considered a positive response (as in the medical sense, indicating existence or presence of such trait [3]) while a “non-ripe status” (the non-ripe class) is considered negative.
In our apple/banana example, the classes exist on their own and positioning a “positive” or “negative” label on either side, is completely arbitrary. However, choosing one side as positive is necessary to use certain metrics.
As an illustration, let’s consider our decision tree to distinguish apples from bananas. In this case, apple is arbitrarily given the positive label and banana the negative one.
In the following height-diameter plane, we represent the 3 different splits the decision tree uses to classify apples and bananas. The decision tree defines zones with assigned classes (green for apples and yellow for banana, intuitive isn’t it?).

decision tree

In the green apple zone defined by “diameter > 8”, we can see that there are 3 bananas, these are “false positives”. The algorithm wrongly assigns the “apple class” to these 3 banana items. Similarly, if there was an apple in a banana zone, it would be considered as a false negative. Still following?
One common evaluation metric is the the accuracy score, which represents how many times a classifier has correctly labelled an item relatively to the total number of items in a dataset.
In our case, it is easy to compute this metric. Out of the 20 items (10 apples and 10 bananas), 4 predictions, our decision tree made, were incorrect (4 bananas were classified as apples). The accuracy score is 16/20 or 80%.

Step 6 – Iterate

There are 3 major areas for improvement when it comes to classifier accuracy:

more data: adding fuel to our engine
feature engineering: the selection of more powerful features and the design of new ones with the help of domain expertise
algorithm tuning: as we saw earlier, the random forest has characteristic parameters which can be set by the user (number of trees, depth of the trees…).
But we will not go further since each point would deserve an entire article.

Conclusion

This first part introduced some basic principles to follow when carrying out a data science projects. This scientific approach can be adopted to answer differents types of data-driven problems. This is the path we will follow to build our network traffic classifier which will be introduced in Part 2. We will show you a fun example on how encrypted traffic can be classified, based only on metadata (timing and size). We hope you enjoyed reading this foretaste, please feel free to contact us [4] if you have any question or feedback.

References :

[1]https://en.wikipedia.org/wiki/Machine_learning

[2]https://en.wikipedia.org/wiki/Decision_tree_learning

[3]http://medical-dictionary.thefreedictionary.com/positive

[4]contact@sentryo.net