Author: Sumit Adsul
Introduction
Cyber security is the practice of defending computers, servers, mobile devices, electronic systems, and associated data from malicious attacks. It is also known as IT security or electronic information security. It is a must for safeguarding the applications organizations develop and for protecting their operational activities against threats. But even after enforcing the security measures, there are several incidents of security getting compromised by means of human errors or failure to implement counter-measures and effective solutions against open vulnerabilities, which results in data and security breaches.
So, vulnerability detection is a machine learning model which predicts the Top-10 OWASP (Open web application security project) categories based on the vulnerabilities the users are facing and accordingly provides the necessary recommendations and solutions to fix the various kinds of vulnerabilities that the user encounters.
How does it work?
1. Users can provide the data related to various threats/vulnerabilities identified in their web application during the application auditing.
2. Based on each type of vulnerability the user has mentioned, an example scenario along with a corresponding recommendation will be provided, so that issue closure can be done. The recommendations and suggestions provided will be chosen from the OWASP online community.
3. ML model identifies the threat user has mentioned and correspondingly provides issue-fix recommendations for the same.
List of Top-10 Owasp Categories:
These are the owasp categories that we are predicting, and based on the category predicted, the necessary recommendations will be provided to users to tackle those vulnerabilities.

Implementation Details:
Step 1: Data collection –
Data is collected through web scraping from owasp website and contains around 1 lakh records. The below screenshot displays the columns of the dataset.
The description is a column that explains the threats users are facing.

Step 2: Text-pre-processing:
This is basically a text classification, so we need to pre-process the data, i.e., to clean the text and remove unwanted things from the text.
As we know, text is a form of unstructured data that contains many unnecessary things e.g. special characters, links, stop words, punctuation, etc. So before passing this data to the model, text cleaning needs to be done. Below are some of the methods which we have used for cleaning the text.
a) Stop-words Removal:
Stop-words such as “I, in, the, are, you, which” should be removed from the text as it doesn’t add meaning in text classification. It is done through NLTK’s stopword library.
b) Text-cleaning:
Simple text-cleaning process involves the removal of punctuation, URL, links, hashtag extra spaces, etc. It also removes the leading or trailing extra white spaces.
c) Lemmatization:
It converts words into their basic form. For e.g., Having = have

#This is a code snippet that removes all the special characters from the text; also, it removes the stop words which don't add meaning to the text.

#After applying the text processing, we have a new column named ‘clean_text’, which contains cleaned data. Now, this clean_text column will be used for ML model prediction of OWASP categories.

Step 3: Vectorization:
As ML models work with numerical data, we need to extract vectors from the text, and the process which converts text into numerical data is known as vectorization.
These are the methods that we have implemented: -
i. Count-Vectors- It counts how many times a word has appeared in each document. For more detail, refer to the below example.

ii. Tf-Idf-
It is a statistical measure which measures how important a word is to a document, and it is obtained by multiplying the following two metrics –
How many times a word occurs in a document.
The inverse document frequency of the word.

Step 4: Train the model and evaluate:
For training, the model split the dataset into testing and training; in our case, we have used 80% for training and 20% for testing.
The model is trained on the vectorized training dataset using various classification algorithms such as Logistic Regression, Decision Tree, Random forest, Naïve Bayes, SVC, xgboost classifiers and LSTM, etc.

Accuracy of the model is 82%.

Conclusion
We have successfully implemented the machine learning model for vulnerability prediction, but the accuracy it shows is concerning, and it surely needs to be improved. During the model evaluation, we found out that the dataset we have doesn't contain many records for some owasp categories, and that’s one of the reasons why the accuracy is low. So, to overcome this problem, we need to gather more data.