The client had data in MongoDB, and we needed to migrate it to Snowflake without using ETL tools. We discovered a Python library called PyMongo through which a MongoDB connection could be made to a Jupyter Notebook. We also knew that Snowpark could connect Snowflake to a Jupyter Notebook. Therefore we decided to use Python as a staging area. First, we read the data from MongoDB through the PyMongo library, stored it in a data frame, and wrote the data present in this data frame to Snowflake via Snowpark.
During the implementation of the use case mentioned above, the team faced some challenges. Some were easy to solve, and some had no direct solution. So, we had to come up with workaround methods to implement it.
Challenge 1: The first problem we faced while implementing the use case was writing the data in Snowflake. The issue was one of the columns of the data frame was in JSON format, and others were in string. Pandas data frame has a limitation in that it cannot write data in Snowflake, which has two different DataType.
Solution: To tackle this problem, we converted our JSON column into a string column using the JSON library in Python. By doing this, we were able to write our data in Snowflake.
breadcrumbs_mongo = breadcrumbs_mongo.astype(convert_dict)
breadcrumbs_mongo['asset'] = breadcrumbs_mongo['asset'].apply(json.dumps)
breadcrumbs_mongo['device'] = breadcrumbs_mongo['device'].apply(json.dumps)
breadcrumbs_mongo['deviceEvent'] = breadcrumbs_mongo['deviceEvent'].apply(json.dumps)
breadcrumbs_mongo['events'] = breadcrumbs_mongo['events'].apply(json.dumps)
breadcrumbs_mongo['landmarks'] = breadcrumbs_mongo['landmarks'].apply(json.dumps)
breadcrumbs_mongo['location'] = breadcrumbs_mongo['location'].apply(json.dumps)
breadcrumbs_mongo['stats'] = breadcrumbs_mongo['stats'].apply(json.dumps)
breadcrumbs_mongo['weather'] = breadcrumbs_mongo['weather'].apply(json.dumps)
breadcrumbs_mongo['tripAndStay'] = breadcrumbs_mongo['tripAndStay'].apply(json.dumps)
Challenge 2: We created a stored procedure to implement the above code in the client machine. The problem faced was the PyMongo library was not available in the Anaconda packages of Snowflake. So, we put the PyMongo library manually inside Snowflake by putting it inside a stage. But we discovered that the Pymongo library's wheel file was written in C, and now Snowflake does not have the executable rights for it. How to check if the wheel file is in C?
The 'cp' in the above screenshots indicates that the wheel file is written in C.
Solution: Since there was no workaround for the wheel file problem, we decided to go for a different approach to implement the MongoDB to Snowflake connection in the client system. The solution was to implement it through the client's EC2 instance. We got the .pem file for the EC2 instance and accessed the client's EC2 instance via Putty.
Through Putty, we successfully launched the EC2 instance and created the virtual environment of Python 3.8 to run Snowpark. We created a .py file of the code, uploaded it to the Python 3.8 virtual environment inside the EC2 instance, and ran the file inside the virtual environment. Through this method, we also successfully implemented MongoDB to Snowflake inside the client system.
In conclusion, the migration of data from MongoDB to Snowflake can be accomplished without using ETL tools by utilizing Python libraries such as PyMongo and Snowpark. During the implementation of this process, the team encountered challenges such as handling data in different data types and issues with PyMongo library's compatibility with the version of Snowflake being used. However, the team overcame these challenges through creative workarounds, such as converting the JSON column to a string and implementing the connection through the client's EC2 instance. This successful implementation showcases the power of Python libraries and the importance of finding innovative solutions to overcome technical challenges.