top of page

NLP Using Snowpark In Python

Author: Kajal Trehan


Image source: https://lh5.googleusercontent.com/proxy/X6Gzmxbwy4CH8FrldpJULlvQ1azsDbZWPYuBxqwjnCiA4WHNR8qSQN-wu_FOrVznfxnw2bEypDvmOn-wUAd09HXwUdfKQGnat6HMNyHXgJfLs5VOutrYl2K9NlXwQcKy2rQcZxwF8qqqJDZS8dmL-2HMatqjvTK6aJHaJoWsLJKpByWDeCwO7FCr7c_a3wuqC3N1J7TgShL7fzIRukGVUbqotRqtyhedtHfD4vUibqrJmYt7FCTKyktJfXjZRnDbd7aNFGu_CnlsAV70iGdu85E=w1200-h630-p-k-no-nu

INTRODUCTION

NLP is the machine's ability to understand and interpret human language the way it is written or spoken. The NLP plays a vital role in making computers/machines as intelligent as human beings in understanding language. Its main objective is to fill the gap between how people communicate (natural language) and what the computer understands (machine language).


APPLICATIONS OF NLP:
  • Sentimental Analysis

  • Machine Translation

  • Chatbots

  • Virtual Assistants

  • Information Extraction

  • Speech Recognition

  • Fake news detection

  • Advertise Matching

  • Grammatical error detection

  • Text Summarize


LIBRARIES TO IMPORT FOR IMPLEMENTING NLP

NLTK library:


NLTK is a standard python library that is most widely used for NLP and computational linguistics and provides a set of diverse algorithms for NLP.


For windows, try out the below command in the command prompt:


pip install nltk


For mac/Linux, try out the below command in the terminal:


sudo pip install -U nltk or sudo pip3 install -U nltk


SPACY LIBRARY:


SPacy is an open-supply Python library that gives us the ability to conduct advanced natural language processing analysis and build fashions that can underpin file analysis, chatbot abilities, and all other kinds of text evaluation.


import spacy


nlp = spacy.load("en_core_web_sm")


CLEANING TEXTS USING NLP

One of the maximum not unusual obligations in Natural Language Processing (NLP) is to smooth text information. On the way to maximizing your consequences, it is crucial to distill your textual content to the maximum important root phrases within the corpus. The following are general steps in textual content preprocessing:


Tokenization: Tokenization is the technique of breaking down huge chunks of textual content into smaller words or sentences, but they're now not comprehensible by a system until they're separated. We have to be extra careful while breaking down the terms so that the outcome should not break or affect the logic of the content. Maximum software program applications deal with area instances (U.S. broke into the USA and now not U and S), but it’s usually critical to ensure it’s carried out efficiently. It’s an essential and principal step in methods like count Vectorizer and also in Deep -learning based architectures like Transformers.


(i) Sentence Tokenization:- Sentence tokenization is a process of splitting multiple paragraphs into sentences.


from nltk.tokenize import sent_tokenize


text_to_sentence = sent_tokenize(text)


print(text_to_sentence)


(ii) Word Tokenization:- Word tokenization is the splitting of text data into individual words and is implemented using word_tokenize().


from nltk.tokenize import word_tokenize


tokenized_word = word_tokenize(text)


print(tokenized_word)


Cleaning: Data cleaning is a vital technique of putting off textual content and characters that are not critical to the analysis. It consists of correcting or disposing of corrupt, duplicate, incorrect, or needless information from our dataset along with text including URLs, noncritical items together with hyphens or unique characters, web scraping, HTML, and CSS information discarded.


Removing Stop Words: Stop words are words that are generally filtered out before processing a natural language as they don't contribute to the meaning of a sentence. To remove them from a sentence, we can first divide the text into words and then remove the word if it exists in the list of stopwords provided by NLTK. Words such as “a” and “the” are examples. These phrases also seem, very often, to become dominant in our analysis and obscure the significant phrases.


Spelling: During the analysis, it is important to correct spelling mistakes. There might be more or fewer errors depending on the medium of communication. Social media posts or more informal communications like email can most likely contain more errors, whereas official corporate or educational documents have fewer errors. It is a very critical step to correct spelling errors depending on the desired outcome.


Stemming and Lemmatization: Stemming with python nltk package deal is a technique of reducing inflection in words to their root forms consisting of mapping a collection of phrases to the identical stem, even supposing the stem itself is not a valid word in the language. An instance of stemming could be to reduce “runs” to “run” as the base word dropping the “s,” in which “ran” might now not be within the same stem. On the other hand, Lemmatization in nltk is the algorithmic procedure of finding the lemma of a phrase relying on its meaning and context and, in this situation, could classify “ran” inside the identical lemma.


2 types of Stemming are there:


(i) Porter Stemming: The Porter Stemming algorithm is used to cast off the suffixes from an English word and gain its stem, which turns into very useful within the area of Informational Retrieval (IR).

from nltk.stem import PorterStemmer


stemmer = PorterStemmer()


(ii) Snowball Stemming:- The approach used in this stemming is unique and is called “English Stemmer” or “Porter2 Stemmer.” and it's far really quicker and extra logical than the authentic Porter Stemmer. SnowballStemmer() is a module in NLTK that implements the Snowball stemming technique.


from nltk.stem.snowball import SnowballStemmer


snow_stem = SnowballStemmer(language='english')


For Lemmatization:


from nltk.stem.wordnet import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()


Brief overview of how NLP plays a vital role in log analysis for errors.

INTRODUCTION


Logs play a vital role in the development and maintenance of software systems.

After capturing logs, we can analyze what is happening in each and every layer of a system and investigate the problems. It is still a huge challenge to analyze large amounts of distributed log data.

● After gathering and parsing logs from different sources, log analysis tools analyze large amounts of data to find the main cause of an issue concerning any application or system error.

●Logs are getting updated in snowflakes during data ingestion, use case implementation, transformations, and other processes, and in our case, we are analyzing the logs for the error outcomes.

● Logs can be seen as textual data, so with the help of NLP techniques, we can gather the same logs in an organized manner, making it possible to search for specific error keywords describing the error by removing the stopwords.


We have made use of NLP with snowpark as we are generating CSV files having a detailed list of errors we have analyzed, and after that, analyzing the distinct errors and calculating the count of each error rather than going on a browser to search the solution for distinct errors, we have provided the links with their top 5 respective solutions, and sometimes system generates the full error with the error id and full explanation of error so our concern is to remove the stopwords and display the actual error keywords along with the frequently occurring error keywords as this will help the user to prioritize the errors.


SNOWPARK is used for processing the data

The Snowpark library affords an intuitive API for querying and processing data in a data pipeline. Using this library, you may build applications that process facts in Snowflake without moving data to the system where your application code runs.


Installation steps for Snowpark:


Install the Snowpark Python package into the virtual environment by using:


pip install snowflake-snowpark-python


Now, create a session to connect to snowflake account


from snowflake.snowpark import Session


connection_parameters={'account':'be01772.sa-east-1.aws','user':'KAJAL','password':'kajal123@','role':'ACCOUNTADMIN','warehouse':'compute_wh','database':'raw_layer','schema':'public'}

test_session = Session.builder.configs(connection_parameters).create()


WORKING WITH DATAFRAMES IN SNOWPARK PYTHON


DataFrames are constructed by specifying the source of the data for the dataset. For example, we can create a DataFrame to hold data from a table, from local data, an external CSV file, or the execution of a SQL statement. Afterward, we can specify how the dataset in the DataFrame should be transformed. For example, we can specify which columns should be selected, how the results should be sorted and grouped, how the rows should be filtered, etc. A Statement is executed to retrieve the data into the DataFrame. In order to retrieve the data into the DataFrame, we must call a method that performs an action (for example, the collect() method).

CREATED A STORED PROCEDURE IN SNOWPARK(PYTHON)


For python stored procedures, we require the latest Snowpark Python library, which installs additional third-party dependencies. Before looking to create Python stored procedures, make sure that the Anaconda package feature is enabled so that the specified third-party dependencies can be loaded. When writing the method or function for the stored procedure, we have to specify the Snowpark Session object as the first argument of our method or function. When we call your stored procedure, Snowflake automatically creates a Session object and passes it to our stored procedure.

To copy a specified number of rows from one table to another table. The method takes the following arguments:

  • A Snowpark Session object

  • The name of the table to copy the rows from

  • The name of the table to save the rows to

  • The number of rows to copy


CREATE OR REPLACE procedure ERROR_P(from_table STRING, to_table STRING)

RETURNS STRING

LANGUAGE PYTHON

RUNTIME_VERSION = '3.8'

PACKAGES = ('snowflake-snowpark-python')

HANDLER = 'run'

AS

$$

from snowflake.snowpark.functions import col

def run(session, from_table, to_table):

session.sql("delete from RAW_LAYER.PUBLIC.LOG_ERRORS")

session.table(from_table).filter(col('OUTCOME')=='FAILED').write.save_as_table(to_table)

return "SUCCESS"

$$;


Here we are inserting only the failed outcome into our log_errors(to_table) table as we need to analyze only the error from the log_table_1(from_table).


We can specify Anaconda packages to install when you create Python stored procedures, and for that, we have to follow a few steps:

  • Log into Snowsight, the Snowflake web interface.



  • Click the dropdown menu next to your login name, then click Switch Role » ORGADMIN to change to the organization administrator role.



  • Click Admin » Billing » Terms & Billing.

  • Scroll to the Anaconda section and click the Enable button. The Anaconda Packages (Preview Feature) dialog opens.

  • Click the link to review the Snowflake Third Party Terms.

  • If you agree to the terms, click the Acknowledge & Continue button.



FETCHING ERROR OUTCOME FROM SNOWFLAKE

We have a log_errors table in snowflake having only error outcomes in that table and fetching them using:-

output_err_list=test_session.sql("SELECT * FROM RAW_LAYER.PUBLIC.LOG_ERRORS")

error_count_list = test_session.sql("select LOG_MESSAGE, COUNT(*) as ERROR_COUNT from

RAW_LAYER.PUBLIC.LOG_ERRORS group by LOG_MESSAGE")


CONVERTING INTO PANDAS DATAFRAME:

Now convert both error_list and error_count into a pandas data frame using the .to_pandas function.


PANDAS:


Pandas is an open-source library that is made specifically for working with relational or labeled data both easily and intuitively. to_pandas() function is used to convert the frame right into a pandas DataFrame. If the frame being converted has one or extra key columns, those columns become the index in the panda’s DataFrame.


output_err_list=output_err_list.to_pandas()

error_count_list=error_count_list.to_pandas()

Using google search library to search for solution links:

After when our data frames are ready, we need a solution for all the errors we encounter. So we use google search library for this.

GOOGLESEARCH:


Googlesearch is a Python library for searching Google effortlessly. GoogleSearch uses requests and BeautifulSoup4 to scrape Google. We're defining a google_search feature that is looking for the top 5 relevant links to the solution to the errors we are getting.


def google_search(df):

file = ['Error', 'Count', 'Solution']

lst = []

df2 = pandas.DataFrame(lst, columns=file)

for index, row in df.iterrows():

err_msg = (row['LOG_MESSAGE'])

err_cnt = (row['ERROR_COUNT'])

text_search = "how to fix "+ err_msg

result_url = [i for i in googlesearch.search(text_search, stop=5)]

lst.append([err_msg, err_cnt, result_url])

df2 = pandas.DataFrame(lst, columns=file)

df2.drop_duplicates(subset ="Error", keep = LAST, inplace = True)

df2=df2.explode('Solution')

return df2


Here in this google_search function:


Drop duplicates:


DataFrame.drop_duplicates(subset=None,keep='first',inplace=False, ignore_index=False)


Return DataFrame with duplicate rows removed.


DataFrame with duplicates removed or None if inplace=True.


Explode:


DataFrame.explode(column, ignore_index=False)


Transform each element of a list-like to a row, replicating index values. This function we used for trying to get out the relevant link for our errors.


soution_list = google_search(error_count_list)


We also used the datetime module to give the time and date.


HOW NLP HELPED IN LOG ANALYSIS FOR ERRORS

def test_sp(session):

df2=session.sql("CALL RAW_LAYER.PUBLIC.NL_PROCESSING()").collect()

df2=df2[0]

df2=df2[0]

doc = sent_tokenize(df2)

err_freq = get_top_n_trigram(doc)

return err_freq


Using this function, we are collecting the processing data from snowflake database.


Using collect() function and sent_tokenize() is used to tokenize the error messages. In snowflake, we created one nl_processing procedure in the raw layer database. From where we are collecting the data.


CREATE OR REPLACE procedure NL_PROCESSING()

RETURNS STRING

LANGUAGE PYTHON

RUNTIME_VERSION = '3.8'

PACKAGES = ('snowflake-snowpark-python','nltk', 'pandas', 'spacy')

HANDLER = 'run'

AS

$$

from snowflake.snowpark.functions import col

from snowflake.snowpark import Row

import pandas

import nltk

import spacy

from nltk.tokenize import sent_tokenize

from nltk.tokenize import word_tokenize

def run(session):

df=session.sql("SELECT ENVIRONMENT, LOG_MESSAGE, FUNCTION_NAME FROM RAW_LAYER.PUBLIC.LOG_ERRORS")

df=df.to_pandas()

df2=''

for index, row in df.iterrows():

df2=df2+('Error occurred in {} environment. The error message is: {}. Error occured while executing {} function. '.format(row['ENVIRONMENT'], row['LOG_MESSAGE'], row['FUNCTION_NAME']))

return df2

$$;


The above image is the snowflake procedure that we created. After that, we also tokenize the sentence that we get from the nl_processing procedure.


For stopword implementation, we created the trigram function. We have created a trigram function to return the frequency of distinct errors we are getting..

  • First, we are updating stopwords from our log_message

  • By using these commands, we are using stop words in our projects.

  • stopwords = set(STOPWORDS)

  • These commands help us to remove the stop word in our project. They can safely be ignored without sacrificing the meaning of the sentence.

  • CountVectorizer, which is a great tool provided by the sci-kit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

  • Lastly, we count the frequency of words that we get.


def get_top_n_trigram(corpus, n=None):

stopwords = set(STOPWORDS)

stopwords.update(["ORA", "ERROR", "error", "message", "MESSAGE", "Message", "Error", "occurred", "ora", "occured", "executing"])

vec= CountVectorizer(ngram_range=(3, 3), stop_words=stopwords).fit(corpus)

bag_of_words = vec.transform(corpus)

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

return words_freq[:n]


This function top_trigram we used to stopword and count the frequency and all. After getting frequency, the last step is to proccess_error_count


process_err_count(dataf)


def process_err_count(dataf):

file = ['Error Keywords', 'Frequency of occurence']

lst = []

df = pandas.DataFrame(lst, columns=file)

for data in dataf:

lst.append(data)

df = pandas.DataFrame(lst, columns=file)#print("Error : {0}, Count : {1}".format(data[0], data[1]))

subject = 'Frequently occurring error keywords : '

timestr = time.strftime("%d-%m-%Y-%H:%M:%S")

subject = subject + timestr

send_email(email_list, subject, df, password)

return dataf


In this function, we give the parameter as (what we get from test_sp function). This function will help us to create a file of error keywords and frequency of occurrence. As a list here, we use the pandas dataframe function.


Sending all the gathered information through SMTP protocol via Email in python

send_email(email_list, subject_1, output_err_list, password)

send_email(email_list, subject_2, soution_list, password)


For this, we've got used the smtplib module, which defines an SMTP patron session item that may be used to send mail to any internet machine with an SMTP or ESMTP listener daemon. We will observe these steps to perform this procedure:

  • Create an SMTP item for connection to the server.

  • Log in to your account.

  • Outline your message headers and login credentials.

  • Create a MIMEMultipart message object and fasten the relevant headers to it, i.e., From, To, and Subject.

  • Connect the message to the MIMEMultipart object.

  • Eventually, ship the message.


OUTPUT WE ARE RECEIVING VIA EMAIL:

Detailed Error List:



Error Count and Solution for errors


As we are already defining these error_count functions in python and sending that in email as below:



And Frequency Occurring specific keywords


As earlier, we have created a function to count the total frequency of words in a sentence (counting the number of times the same word appears), and here we are making use of SQL query (using stored procedure and defining a function in snowflake using snowpark) to fetch the logs from log_errors table because more datatypes are being supported and sending the result after log analysis in the mail as below:



CONCLUSION:

So, the main purpose of using NLP in our case for log analysis is to keep the vocabulary small and to remove the noise(unwanted stuff), which helps to improve the accuracy of errors as directly looking into error id with a large text of error will need more time to analyze. So NlP helped us in many ways, like displaying only the distinct errors, the number of times that particular error has occurred, along with links to the top five solutions, so we don't need to explicitly go and search in browsers for the solution and also NLP helped us a lot in recognizing the error keywords by removing stopwords with their frequency occurrence which helped us in prioritizing the errors so we are sharing a mail with a detailed list of errors, distinct error count with their respective solutions and after removing the stopwords from full log_message we are sending the error keywords with their frequency count.


References:

207 views0 comments

Recent Posts

See All
bottom of page