What is AI Malware filtering?
Our traditional malware filter utilizes multiple source lists that are updated on an hourly basis. These lists are meticulously compiled from a variety of sources and threat intelligence feeds, which are then curated by us to remove false positives.
While this has served us and our customers well, we’re always looking for new tools to improve the value of our products. We decided to see if we can do better than the reactionary approach of curated lists by detecting malware domains *before* they ever appear in our malware blocklist.
Enter predictive malware detection and filtering.
Using machine learning algorithms, we built a model for detecting and blocking domains that have a high probability of serving malware. Considering the limited data that DNS queries alone provide, this was quite tricky, but here are the results we’ve been able to achieve.
Results and Benefits For Our Customers
TLDR; Block malware before it appears on a malware blocklist
The following demonstrates the effectiveness of our malware detection algorithm. We took new unseen domains from our existing malware blocklist, as well as a list of 1000 random benign domains from Alexa’s top 10,000 most popular websites, to determine the probability of a domain serving malware.
|Malware Probability||False Negative Rate||False Positive Rate|
A malware probability of greater than 50% implies the machine learning model predicts that a specific domain has a better than random chance of serving malware. If you were to block domains based on this probability, there would be roughly a 20% false positive and false negative rate. Considering that these are domains our traditional filter would have missed entirely (until the next hourly update), these results are not bad. Setting the AI filter to block domains with a probability of 80% or higher yields an accuracy of over 90%.
To enable this feature, head over to Profiles, then click on the “Edit” button for the desired Profile, then click Profile Options and then simply toggle it on.
While this is by no means the end of the story for this feature, we are pleased to offer it to you now so that you can make your internet experience even safer and more secure.
For anyone who enjoys geeking out over the technical aspects of this project, the rest of this article will deep dive into the juicy details.
Data Gathering, Analysis, Feature Engineering & Data Preparation
Since our goal was to make accurate predictions about the probability of a domain serving malware, we needed to gather as much data about the domains we know to be serving malware. We started with our conventional malware blocklist, which provides us with a baseline dataset of roughly ~250,000 known malware domains.
To get as much signal from our data as possible, we identified a set of features we could extract from the details of a DNS request.
|Feature Name||Feature Type||Feature Description|
|SLD||Lexical||Second Level Domain|
|TLD||Lexical||Top Level Domain|
|Count_Subdomain_Len||Lexical||Count of subdomain levels|
|Longest_Word_Ratio||Lexical||Longest word in domain ratio|
|Numeric_len||Lexical||Numeric count in domain|
|Domain_Entropy||Lexical||Domain Entropy (calculates the|
|Shannon entry of the string)|
|Unigrams||Lexical||Unigrams of the domain in|
|Bigrams||Lexical||Bigrams of the domain in|
|Trigrams||Lexical||Trigrams of the domain in|
|Domain_age||Third_Party||When was the domainr registered|
|City Code||Third_Party||City Code|
|AS||Third_Party||Automated System Name|
|ISP||Third_Party||Internet Service Protocol|
|Organization||Third_Party||Name of the Organization|
- Feature selection was heavily inspired by this paper from 2021 on “Classifying Malicious Domains using DNS TrafﬁcAnalysis”
We expanded the base dataset of just domain names into a database of each domain along with each feature. The “Third Party” data was looked up via geo-ip and whois databases.
Since machine learning can be considered highly specialized statistical models, we need to normalize the data into numerical formats that allow for mathematical operations. This is achieved using type conversion and encoding techniques like label encoding and one hot encoding.
Model Training & Evaluation
Now that we have both a known malware dataset and a benign dataset (top 100,000 Alexa websites), we are working with pre-labeled data. This means we can use a supervised learning algorithm to train our model. There are lots of algorithms to choose from, and after some review, we chose the XGBoost library, which is an implementation of Extreme Gradient Boosting.
When training a model, there are settings that will determine the way the model learns. These settings are called hyperparameters and must be tuned to the dataset. To determine these values, we performed hyperparameter tuning using grid search and random search on some of the crucial parameters of the XGBoost model, then re-trained with the best hyperparameter scores.
To evaluate the effectiveness of our model after training, we ran a portion of both the malware and benign datasets that the model had not seen during training, and reviewed a Confusion Matrix.
As we fine-tune the model to improve accuracy and reduce false positives, we need to have model explainability to understand what happens in the model from input to output.
For this, we use feature importance and SHAP Explainer. These methods help us understand the score for all of the input features and increase the interpretability of our models.
- Please note: the images in this post do not represent the state of the production model, as they were taken during initial development.
After tuning the features, we took a dataset of unseen malware domains that have appeared in our conventional malware blocklist. On this 1049 domain dataset, we achieved 93.9% accuracy.
This accuracy comes at the cost of a fairly high false positive rate of over 20%. While this could be easily mitigated by implementing an allow-list of the most popular domains, the real show-stopping result was the prediction latency of ~100ms. We work very hard to keep DNS latency as low as possible for our users, and a 100ms addition to the existing latency would likely discourage even the most security-minded customers from using this feature.
To improve the prediction latency of the model, we explored various hardware and software optimization techniques. While we could get faster predictions using GPU computing, we needed to keep the inference in line with the incoming DNS request. However, running this model on a remote GPU would increase the cost and add additional latency due to the network latency between our DNS servers and the remote GPU instances.
Our only remaining option was to find software optimizations that can speed up the CPU-based computing available on our DNS servers.
After multiple failed experiments with Intel’s oneDAL, we settled on Microsoft's Hummingbird library. It is an excellent library that transforms machine-learning models into tensor computations. Using Hummingbird, we converted the existing model without modifying any inference code.
With the updated model, the prediction latency was reduced from 100ms to 10ms. That was a big win, with the only downside of a slightly increased false positive rate.
This increased false positive rate was easily mitigated by both an allow list for popular domains and a modification to the model that changes the output to a probability value from 0% to 100%, effectively allowing users to set the aggressiveness of the filter to fit their risk tolerance and user experience.
A New Component to a Comprehensive Security Strategy
Our new filter is a powerful tool for blocking malware. By leveraging the capability of machine learning and automation, our system can block threats in real-time, minimizing the chance of your network being compromised.
As with all of our new features, we would love to get your feedback so that we can improve our offerings, and continue to build more tools to improve your internet experience. You can reach out to us here:
Contact Us: https://controld.com/contact/
Feedback Portal: https://feedback.controld.com/
While this was our first foray into the world of adding machine learning tools into our products, it will most certainly not be our last. We are already experimenting with new approaches to identifying network threats, as well as creating dynamic website categorization for much more robust filter and services capabilities and much more!