Author: Sumit Adsul
Cyber security is the practice of defending computers, servers, mobile devices, electronic systems and the associated data from malicious attacks. But even after enforcing the security measures, there are several incidents of security getting compromised, by means of human errors which includes navigating through malicious URLs.
Malicious urls/websites are serious threats in cybersecurity. Attackers can deliver such urls via WhatsApp, email, text message etc and when users click on such urls, it can propagate the virus, malware or a program which can steal all the user data and can damage the user’s device completely. That’s why it is top priority for any organization to detect and mitigate the spread of such malicious urls.
The malicious url detection is basically a URL classifier which classifies the user-inputted URL into Safe or Malicious so that users can check if the URL is legit or malicious before visiting the url & can avoid exposure to malicious software, trying to steal their data.
Steps that need be followed for building the malicious url prediction model are:
Step 1: Data Collection
We collected data from Kaggle for a malicious url dataset which contains around 8 lakh rows. The dataset contains malicious, benign & good urls in equal proportion.
Step 2: Feature Extraction
It is a process of extracting features from urls. To develop a machine learning model, we need to generate features from the url and then we need to pass these features to the machine learning model.
The image below shows the structure of the url and the task here is to find out the characteristics/features which separates the malicious urls and benign(safe) urls.
The url is made up of different components as shown in the figure above, these components are nothing but the features which need to be extracted from the url.
Below is the list of features we have extracted from the url and passed onto the machine learning model which then predicts if the url is malicious or benign.
1. Use of an IP address in URL:
Attackers might include an IP address in the URL instead of the website’s actual name to steal the user’s personal information and data, for e.g, “http://22.214.171.124/fake.html”.
2. Long URLs to hide suspicious part:
Phishers can use long URLs to hide the doubtful part in the address bar of the url so that when a user sees such urls they won’t get any doubt and proceeds to click on it & ends up falling in a trap.
3. URL Shortening services like TinyURL:
Phishers may shorten the URL so that users cannot see the actual contents of URL or domain name and fall in a trap.
4. URL Redirecting
This feature helps in identifying the url-forwarding. See the below mentioned url example.
http://www.legitimate.com//http://www.phishing.com – such websites contain URL forwarding using ‘//’. When a user clicks on one url, he can be redirected to some malicious websites.
Phishers can add such symbols (-) in URL, so that sometimes users think they are actually visiting a legit website but end up visiting fishy contents.
The presence of https is important while checking the url and this feature checks the same.
7. Https in domain part:
Phishers may add https in front of http, like the above example to trap users.
8. Google index:
Fishy websites generally exist for a short period of time and such web pages may not be listed on google index, so the machine learning model needs to check the google index of a url.
Count of @ and special symbols in the url.
9. Domain registration length:
This feature determines the time duration of a domain from the present time and mostly genuine & legit domains exist for a longer period of time but fishy ones get created often and most of the time their registration length is less than 1 yr.
So, these are some of the features which we have extracted from URL, there are more and based on such features we are classifying these URLs into Safe or Malicious.
Now after generating these new features, the model needs to be trained.
Step 3: Split the dataset into a training and testing set. In this case, we have used 30% for testing and the remaining 70% for training.
Step 4: Train the model with classification algorithms.
As it is a classification problem, various classification algorithms have been used for building the models for e,g, Logistic regression, Decision Tree classifier, Random Forest, Xgboost classifier etc.
1. Model summary:
The model is performing well on training and testing data and has a decent accuracy of around 96%.
2. Confusion Matrix:
It displays the no. of true positives and the false positives and the values on the diagonal are the accurate predictions for each class and the rest are misclassified values. As we can see from the image below that the model is predicting nearly accurate results but there are few miss-classifications.
We have successfully implemented the URL classifier with 96% accuracy but as we can see in the confusion matrix that there are false results/miss-classifications and with such misleading results we can’t use this system. So, to overcome this problem, we need to generate/extract a few more features from the URL and need to find out the features which are most important in predicting the URL result and then we might be able to optimize it further.