apple banana

As mentioned in part 1 “Building a naïve YouTube vs Spotify classifier: introduction to machine learning“, here at Sentryo, network traffic is our main source of information to provide our users with an accurate depiction of their infrastructure and to alert them on a potential intrusion.
Machine Learning allows us to build powerful algorithms to perform classification tasks. In the context of Network Security Monitoring, such techniques can be used, for example, to classify some activity as malicious.


In this second part, we will follow the different steps of a Data Science project we outlined in part 1, to build a YouTube / Spotify packet classifier – i.e. determining, from neutral metadata only, if a packet comes from a transaction involving a YouTube video or a Spotify music.
The reader should keep in mind that the work presented in this article is just for explanation purposes. The goal here is not to present the most advanced techniques in data science or introduce deep knowledge from a business or cybersecurity point of view, but just to introduce basic principles. Basic principles that still make us automatically distinguish YouTube from Spotify more than 86% of the time.

 

Step 1 – Start with a question

Our question here is simple: “Does this network activity correspond to Youtube or Spotify browsing?”

 

Step 2 – Collect and prepare data

Data Collection

The first step of our analysis is to gather web browsing data representing the communication between a laptop and these websites. One standard way to capture network traffic is to use Wireshark [1], a free and open source packet analyzer used for network analysis. As shown on the picture below, this tool provides insights about the packets transferred between hosts, in this case between a local laptop and either youtube.com or spotify.com.

insights about packets

Data Preparation

After a conversion into a more versatile format, i.e CSV, the data can be analyzed by standard data science tools.

The item in the dataset is an ethernet frame, or more precisely in our case, an IP packet (let us say a network message). It now corresponds to one line in the CSV file.

In the case of network data, as seen in the Wireshark screenshot above, items (packets) are described with the following features :

  • frame time of arrival,
  • frame length,
  • IP address of the sender,
  • IP address of the receiver,
  • network protocol,
  • etc.

To follow this article, you will not need any deep network knowledge. However, you need to know what an IP address is, i.e., a numerical label assigned to each device participating in a computer network [2].

As an IP address can directly identify a YouTube server from a Spotify one, we will not use this property to build the classifier.
The network protocol in use when web browsing is HTTPS (or HTTP) as seen in your browser’s address bar. The particularity of HTTPS, with “S” for secure, is encryption. Data in transit, containing, for example, the name of the song or the song itself are encrypted. They are not readable by an observer.
Since we are in the position of this observer, in order to build our classifier, we will only focus on timing and size-related metrics and not on features based on content. This choice is partly inspired by JS Atkinson’s thesis work, where the researcher “determined that the comparison of various permutations of timing and frame size information is sufficient to distinguish specific user activities” [3].

 

Step 3 – Analyze, explore and enhance data

In this phase, the idea is to get familiar with the data to analyze. For each item (packet) in the dataset, we calculate the quantity of packets received by the laptop in the past 10 milliseconds before this packet’s time of arrival.

This way, another variable describes each item of the dataset, such a variable is called a feature.

Intuitively, we would expect that, when browsing Youtube, we receive more packets per 10 milliseconds window since a video throughput is usually higher than an audio one.
On the histogram shown below :

  • the x axis (what we are counting) represents the quantity of packet received in intervals of 10 milliseconds
  • the y axis represents the number of occurrence for this observation

histogram

We can first notice that Youtube usually sends more packets than Spotify (on average, Youtube sends 30 packets while Spotify sends 23 during the same time period).
As a second example, and in order to have a feeling of the download/upload rate, let us  visualize the number of frames received from the remote hosts. By using the same 10ms window approach, we build a feature for each item representing the number of packets received from YouTube/Spotify in the current timeframe.

We notice that packets from Youtube tend to arrive in bursts with some interruptions in between, while Spotify seems to provide a more stable flow.

bursts interruptions stable flow
These are the kind of difference between datasets that a machine learning algorithm will exploit to perform a classification task.

 

Our feature selection

We decide to describe the process of streaming by using the following 10 features:

On timing (in seconds)

  • time delta from previous packet

On the size of packets (in bytes)

  • size of the current packet
  • quantity of data received in the last 10ms
  • quantity of data sent in the last 10ms
  • average size of received packets in the last 10ms
  • average size of sent packets in the last 10ms
  • standard deviation of the size of received packets in the last 10ms
  • standard deviation of the size of sent packets in the last 10ms

On the quantity of packets :

  • number of packets received in the last 10ms
  • number of packets sent in the last 10ms

As stated in the introduction of this article, the features selected are far from optimal but were retained for the sake of simplicity, and also to point out that the algorithm can achieve decent performance even with a naïve selection of features. However, it is obvious to state that some smarter feature engineering would require more work but would yield much improved results.

 

Step 4 – Build a model

Now that we gained knowledge about how the traffic data looks like in different browsing activities, and enhanced our dataset, let’s apply a machine learning algorithm to learn these differences and ultimately classify the activity as YouTube or Spotify.

Training Random Forests

After training, we now have a forest populated with 200 decision trees with the maximum depth of a tree set to 15. These parameters were chosen to optimize the performance of the algorithm.
A representation of a single (and simplified) tree from the forest is shown on the figure below:

representation of a tree

Now, if a packet is analyzed by a tree and ends up in an red final leaf, it will be classified as “Youtube”.

 

Step 5 – Analyze the results and draw insights

In the case of the YouTube vs. Spotify classifier, the accuracy score is 87%. For a first try, this score is correct, but for a more critical business case (like intrusion detection), we would expect a much more reliable classifier.

As stated in Part 1, the classifier could definitely be improved by taking the following actions:

  • adding more data,
  • adding new information through feature engineering,
  • tuning the parameters of the model.

Let’s keep in mind that these metrics can only be computed if we have the information of the correct class beforehand, just like the training data. If this information is not available, the classifier is still able to determine a class for any new item but can not evaluate its accuracy.

Conclusion

We hope that this article has convinced you that data science and machine learning are not black magic and that these approaches can easily be applied to analyze network traffic. Especially, we showed the strength of metadata-based features based on timing or size, an approach that even works when the connection is encrypted in transit.
This two-part article is our first take on data science and machine learning applied to network data and cybersecurity. More articles will come, so stay tuned. We hope you enjoyed reading this first iteration, please feel free to contact us [4] if you have any questions or feedback.

References :

[1] https://www.wireshark.org/

[2] https://en.wikipedia.org/wiki/IP_address

[3] Your WiFi is leaking Inferring user behaviour encryption irrelevant

[4] contact@sentryo.net