Author(s): Pragati Srivastava & Yashodhan Talekar
Xplenty: SecurETL Platform
Big data processing, transformation, and integration can complicate businesses. Factors like scale, complex file formats, connectivity, and API access can add difficulty. Often, it requires coding and an entire IT team to deploy. But a SecurETL or data integration platform can help you tackle these challenges head-on. The key is well-designed data flows that can process both structured and unstructured data. You also need your platform to have the ability to integrate with a variety of sources. One of the best of these platforms is Xplenty.
ETL software takes data from one system, transforms it, and then imports it into the receiving system, regardless of the source or destination system. The right ETL platform can read and process data from many sources, including storage file sources. Businesses that do their own data integration have to code data pipelines manually, which is time-consuming and costly for IT teams or data managers. Xplenty connects to an ever-expanding number of integrations without additional coding or manual data flow creation. This helps you quickly and easily integrate your business data for use by analytics suites or business intelligence tools. Xplenty’s SecurETL platform is also cost-effective and saves crucial business hours.
Our overview of Xplenty helps you understand the platform before you make your final vendor decision, ensuring you choose a platform that works for you and your business. Whatever ETL solution you choose, needs to integrate with all your data sources; otherwise, you risk paying for software or a platform that won’t do what you need.
Xplenty is an ETL platform that requires no coding or deployment. It has a point-and-click interface that enables simple data integration, processing, and preparation. It also connects with a large variety of data sources and has all the capabilities you need to perform data analytics.
Rich Connectivity Library: Xplenty connects with a variety of data sources. Including structured query language (SQL) data stores, NoSQL databases, and cloud storage services.
Out-of-the-Box Data Integrations: Without coding anything, Xplenty offers over a hundred out-of-the-box data integrations. These help you integrate data from many sources, including CRMs like Salesforce, cloud-based organizational platforms like Asana, databases, and more into one single destination.
Third-Party Integrations: Xplenty integrates with a variety of tools, especially for logging, visualization, and data analytics.
Scheduling: Run your data processes when you want with the ability to schedule.
The platform also allows you to Schedule jobs, Track job progress and status, Sample data outputs, Execute both UI or API. You can integrate data from more than 100 data stores and SaaS applications. Some of the most popular are Amazon Aurora, Google BigQuery, Oracle, Secure File Transfer Protocol, SalesForce, Snowflake. Applications, databases, files, data warehouses, Xplenty works with them all and is always increasing the number of available integrations in line with digital developments.
No installation is needed for Xplenty just go to Xplenty’s website and create your account. We created our trial account.
Snowflake Data Warehouse is a fully managed, cloud data warehouse available to customers in the form of Software-as-a-Service (SaaS) or Database-as-a-Service (DaaS). The phrase ‘fully managed’ means users shouldn’t be concerned about back-end work like server installation, maintenance, etc. Its unique architecture is a hybrid of traditional shared-disk database architectures and shared-nothing database architectures, which allows complete relational database support on both structured as well as semi-structured data (CSV, JSON, ORC, Avro, Parquet, XML), and now it is supporting to unstructured data such as Images, Videos, etc.
Getting Familiar with UI
On the left-hand side, we find our dark grey box. The top 2 icons are for building connections as well as packages. The next two icons following the above icons are monitoring icons, to monitor your job as well as a cluster. The 5th icon is the scheduler icon. Once you build your data pipeline you can attach it to your schedule. They spin up and run on their own. Then the last icon is our setting. So this is where you can set up service hooks to receive notifications for when your job’s complete. If there is a failure if there are any other notifications that you want that’s where you would set them up.
So let’s start at the top.
Step 1: Creating Connectors
So these are Xplenty’s native connectors.
All connectors are bidirectional except connectors in the Services category (except Salesforce).
You can load as well as extract data from them. For this demonstration here we are going to use Snowflake as our destination.
So let’s start by creating a new connection to Snowflake. Select Snowflake from the Analytical Databases menu.
Name it as SnowflakeConn
Enter your snowflake account name
Enter your Snowflake username and password
Enter Database, Warehouse and Region
Click on Create Connection.
Next, we also need a connection to our Students Data which is stored in an Amazon S3 bucket.
So we will find Amazon S3 and create a new connection.
Enter the name of the connection.
Select your Region.
Enter Access Key ID and Secret access key.
Select Create connection.
Now we move to the package editor, where you build your data pipeline.
Step 2: Creating a Package
1. Click on New package
2. Enter a name for your package.
3. Now Xplenty supports 2 different types of flows.
Dataflow: Dataflow is your basic data pipeline, it’s where you extract data from one or more sources, perform any transformations necessary and then you can push to one or more destinations.
Workflow: Workflow allows you to define a sequence of tasks. So a task can either be a Dataflow or could be a SQL statement. Once you have defined a sequence of tasks, you then define the condition upon which the next task executes.
4. Select Template as Blank
5. Click Create Package.
6. You will be directed to a blank canvas.
7. Next, we will add components. Components are grouped into three categories.
There are Source components, Transformation components, and Destination components.
8. Grab File Storage, drag and drop on the canvas. Select the File Storage and select the Amazon S3 connection which we created earlier. This is File Storage for Student info Data.
9. Enter the source bucket and source path of your S3 bucket where data is stored. We keep all other options as default.
10. Click Next. The component will make changes and will load the schema and some sample data from the bucket.
11. Create another File Storage for Student performance Data having an Amazon S3 connection.
12. Click the + sign on File Storage 1 component.
Select the Join component from the Transformations section.
13. Drag + sign from File Storage 2 to Join component.
14. Click on the Join component.
Join inputs will be auto filled since we joined two File Storage components on the canvas.
In Join properties, select Inner Join and choose id as a primary key.
15. Add a new Select component from Transformations and specify field expressions.
16. So now we are ready to push this into Snowflake. Select Snowflake connector from Destinations.
1. Choose the Snowflake Connector which we created earlier.
2. In the Destination, properties select the target schema.
Inside Destination properties, we have to select the operation type as Insert only.
We can also write SQL statements that we want to run before our job is executed in the Pre-action SQL section.
We can also write SQL statements that we want to run after our job is executed in the Post-action SQL section.
3. Click Save.
Step 3: Creating a Cluster
Now we create a cluster to run our Dataflow. Select the 4th icon on the left-hand sidebar.
Name your Cluster.
Select when to auto suspend
Click Create Cluster.
Now we move to the last step, Scheduling our Dataflow.
Step 4: Scheduling the Package
Click the 5th icon on the left sidebar.
Click on New Schedule.
Enter the number of days you want to repeat and the starting time. Remember that time is in UTC. You can also use CRON expressions.
Turn ON the Status slider.
In Cluster select ‘Any similar cluster’ from the Reuse drop-down menu
Select your package.
Step 5: Monitoring your Job
You can monitor your job by clicking on the Jobs icon(3rd icon on the left-hand sidebar).
100% in the green highlighted box suggests that the job has been successfully executed.
Xplenty can be used with any source database directly as it has many sources. We can also schedule the job for a particular time or interval, it will run the job at the scheduled time. We can use any transformation before loading the data to snowflake by just drag and drop to the components we need for the transformation.