Author: Sumit Adsul
Cyber security is the practice of defending computers, servers, mobile devices, electronic systems, and associated data from malicious attacks. But even after enforcing the security measures, there are several incidents of security getting compromised by means of human errors, which include navigating through malicious URLs.
Malicious urls/websites are serious threats to cybersecurity. Attackers can deliver such urls via WhatsApp, email, text messages, etc., and when users click on such URLs, it can propagate the virus, malware, or a program that can steal all the user data and can damage the user’s device completely. That’s why it is a top priority for any organization to detect and mitigate the spread of such malicious URLs.
Malicious URL detection is basically a URL classifier that classifies the user-inputted URL into Safe or Malicious so that users can check if the URL is legit or malicious before visiting the URL & can avoid exposure to malicious software trying to steal their data.
The steps that need to be followed for building the malicious URL prediction model are:
Step 1: Data Collection
We collected data from Kaggle for a malicious URL dataset which contains around 8 lakh rows. The dataset contains malicious, benign & good URLs in equal proportion.
Step 2: Feature Extraction
It is a process of extracting features from URLs. To develop a machine learning model, we need to generate features from the URL, and then we need to pass these features to the machine learning model.
The image below shows the structure of the URL, and the task here is to find out the characteristics/features which separate the malicious URLs and benign (safe) URLs.
The URL is made up of different components, as shown in the figure above; these components are nothing but the features which need to be extracted from the URL.
Below is the list of features we have extracted from the URL and passed onto the machine learning model, which then predicts if the URL is malicious or benign.
1. Use of an IP address in URL:
Attackers might include an IP address in the URL instead of the website’s actual name to steal the user’s personal information and data, for e.g., “http://18.104.22.168/fake.html”.
2. Long URLs to hide suspicious parts:
Phishers can use long URLs to hide the doubtful part in the address bar of the URL so that when a user sees such URLs, they won’t get any doubt and proceeds to click on it & ends up falling into a trap.
3. URL Shortening services like TinyURL:
Phishers may shorten the URL so that users cannot see the actual contents of the URL or domain name and fall into a trap.
4. URL Redirecting
This feature helps in identifying the url-forwarding. See the below-mentioned URL example.
http://www.legitimate.com//http://www.phishing.com – such websites contain URL forwarding using ‘//’. When a user clicks on one URL, he can be redirected to some malicious websites.
Phishers can add such symbols (-) in URL so that sometimes users think they are actually visiting a legit website but end up visiting fishy content.
The presence of HTTPS is important while checking the URL and this feature checks the same.
7. Https in domain part:
Phishers may add https in front of http, like in the above example, to trap users.
8. Google index:
Fishy websites generally exist for a short period of time, and such web pages may not be listed on the google index, so the machine learning model needs to check the google index of a URL.
Count of @ and special symbols in the URL.
9. Domain registration length:
This feature determines the time duration of a domain from the present time, and most genuine & legit domains exist for a longer period of time, but fishy ones get created often, and most of the time, their registration length is less than 1 yr.
So, these are some of the features which we have extracted from the URL; there are more, and based on such features, we are classifying these URLs into Safe or Malicious.
Now after generating these new features, the model needs to be trained.
Step 3: Split the dataset into a training and testing set. In this case, we have used 30% for testing and the remaining 70% for training.
Step 4: Train the model with classification algorithms.
As it is a classification problem, various classification algorithms have been used for building the models for e,g, Logistic regression, Decision Tree classifier, Random Forest, Xgboost classifier, etc.
1. Model summary:
The model is performing well on training and testing data and has a decent accuracy of around 96%.
2. Confusion Matrix:
It displays the no. of true positives and false positives, and the values on the diagonal are the accurate predictions for each class, and the rest are misclassified values as we can see from the image below that the model is predicting nearly accurate results, but there are few miss-classifications.
We have successfully implemented the URL classifier with 96% accuracy, but as we can see in the confusion matrix, there are false results/miss-classifications, and with such misleading results, we can’t use this system. So, to overcome this problem, we need to generate/extract a few more features from the URL and need to find out the features which are most important in predicting the URL result, and then we might be able to optimize it further.