Vulnerability Prediction

Author: Sumit Adsul


Cyber security is the practice of defending computers, servers, mobile devices, electronic systems and the associated data from malicious attacks. It is also known as IT security or electronic information security. It is a must for safeguarding the applications, organizations develop and for protecting its operational activities against threats. But even after enforcing the security measures, there are several incidents of security getting compromised, by means of human errors or failing to implement counter-measures and effective solutions against open vulnerabilities which results in data and security breaches.

So, vulnerability detection is a machine learning model which predicts the Top-10 OWASP (Open web application security project) categories based on the vulnerabilities the users are facing and accordingly provides the necessary recommendations and solutions to fix the various kinds of vulnerabilities that the user encounters.

How does it work?

1. Users can provide the data related to various threats/vulnerabilities identified in their web application, during the application auditing.

2. Based on each type of vulnerability the user has mentioned, an example scenario along with a corresponding recommendation will be provided, so that issue closure can be done. The recommendations and suggestions provided will be chosen from the OWASP online community.

3. ML model identifies the threat user has mentioned and correspondingly provides issue fix recommendation for the same.

List of Top-10 Owasp Categories:

These are the owasp categories which we are predicting and based on the category predicted, the necessary recommendations will be provided to users to tackle those vulnerabilities.

Implementation Details:

Step 1: Data collection –

Data is collected through web scraping from owasp website and contains around 1 lakh records. Below screenshot displays the columns of the dataset.

Description is a column which explains the threats users are facing.

Step 2: Text-pre-processing:

This is basically a text classification, so we need to pre-process the data i.e. to clean the text and remove unwanted things from the text.

As we know, text is a form of unstructured data which contains many unnecessary things for e.g. special characters, links, stop words, punctuation etc. So before passing this data to the model, text cleaning needs to be done.Below are some of the methods which we have used for cleaning the text.

a) Stop-words Removal:

Stop-words such as “I,in,the,are,you,which” should be removed from the text as it doesn’t add meaning in text classification. It is done through NLTK’s stopword library.

b) Text-cleaning:

Simple text-cleaning process involves the removal of punctuation, URL, links, hashtag extra spaces etc. It also removes the leading or trailing extra white spaces.

c) Lemmatization:

It converts words into its basic form. For e.g. Having = have

#This is a code snippet which removes all the special characters from the text, also it removes the stop words which don't add meaning to the text.

#After applying the text-processing, we have a new column named ‘clean_text’ which contains cleaned data. Now, this clean_text column will be used for ML model prediction of OWASP categories.

Step 3: Vectorization:

As ML models work with numerical data, we need to extract vectors from the text and the process which converts text into numerical data is known as vectorization.

These are the methods which we have implemented: -

i. Count-Vectors- It counts how many times a word has appeared in each document. For more detail, refer to the below example.

ii. Tf-Idf-

It is a statistical measure which measures how important a word is to a document and it is obtained by multiplying the following two metrics –

  1. How many times a word occurs in a document.

  2. The inverse document frequency of the word.

Step 4: Train the model and evaluate:

For training the model, split the dataset into testing and training, in our case we have used 80% for training and 20% for testing.

The model is trained on the vectorized training dataset using various classification algorithms such as Logistic Regression, Decision Tree, Random forest, Naïve Bayes, SVC, xgboost classifiers and LSTM etc.

Accuracy of the model is 82%.


We have successfully implemented the machine learning model for vulnerability prediction but the accuracy it shows is concerning and surely it needs to be improved. During the model evaluation, we found out that the dataset we have doesn't contain many records for some owasp categories and that’s one of the reasons why the accuracy is low. So, to overcome this problem, we need to gather more data.

18 views0 comments

Recent Posts

See All