Author: Suryanarayana Raju Sri Kakarlapudi
There are many ways of loading the data from a source, such as a website like Kaggle, a government website, or a third-party website. This cronjob Linux method is the one that is best suited for our particular use case, which is to get the data from the government website. This particular website already provides a URL for downloading the data and making that data available in the aws s3 bucket once every month. In this following blog, you can have the complete basic knowledge of operating different tools and functions to perform this particular job.
Starting up Linux
There are multiple ways to get access to the Linux operating system. Among them, we're going to discuss the Windows Subsystem for Linux to access Linux operating systems in the following blog.
The Windows Subsystem for Linux allows developers to run the Linux registry and build
most command line tools, utilities, and applications directly on Windows without changing the overhead of a traditional virtual machine.
Once the Windows Subsystem for Linux is ready, set up a Linux user and log in using those credentials.
Knowing some of the basic Linux commands will be helpful in finishing the job smoothly. Some of the basic commands used in this job are as follows
pwd - Gives us the path that starts from the root.
ls - To list the files in the directory you are in.
cd - change directory.
mkdir - To create a folder or directory.
rm - To remove a folder or directory
cp - Copy files through the command line.
mv - Move files through the command line
sudo - To run any command with administrative or root privileges.
df - Checks the available disk space.
apt-get - To install required packages in the system.
What is a cron?
Cronjob is a tool that can be used to schedule a task and define that task using certain rules in order to execute that specific task at the specific time period required.
By following these specific rules, we can formulate the command in the form of a syntax that determines the time at which those commands need to be executed.
What are cron jobs in Linux?
Any task that you schedule via cron is called a cron job. Cron jobs are useful for automating routine hourly, daily, monthly, or yearly tasks.
How to Add cron Jobs in Linux
To use cron jobs, you must first check the status of the cron service. If you don't have cron installed, you can easily download it from your package manager.
Cron job syntax
Crontab uses the following flags to add and list cron jobs.
crontab -e: Edit crontab entries to add, remove, or edit cron jobs.
crontab -l: List all cron jobs for the current user.
crontab -u username -l: List her crons for another user.
crontab -u username -e: Edit another user's cron.
This is what appears when we list crons:
* * * * * sh /path/to/script.sh
In the above example, * * * * * represents minute(s) hour(s) day(s) month(s) weekday(s), respectively.
Command would be executed at the specific minute
Command would be executed at the specific hour
It denotes the number of the days in a month when the task needs to be executed
It denotes the number of the month
Days of the week where commands would run. Here, 0 is Sunday
/path/to/script.sh specifies the path to the script.
Here is how a cron job looks.
The AWS Command Line Interface (AWS CLI) is a command line interface used to control and manage our AWS services. AWS commands are not available by default in Linux. If we want to use AWS-CLI commands in LINUX, the is a command in LINUX known as an apt-get command. This command allows the user to download any Linux command package. Using this command, we can download the AWS-CLI package and start using AWS-CLI commands in your LINUX system. Once the aws-cli package is installed in your Linux, you are ready to use all the AWS commands in your Linux system.
Here are some of the aws-S3-cli commands.
Create a Bucket - s3 mb
List buckets and objects - s3 ls
Delete buckets - s3 rb
Delete objects - s3 rm
Move objects - s3 mv
Copy objects - s3 cp
Sync objects - s3 sync
Web get command (wget)
By using the wget command, we can download the files from the internet using Linux by just giving the http link where the files are present. By assigning this webget to a cron job, we can schedule the download job to whenever we want
In Linux, we can create a directory and then login into the cron job and schedule a cron job using the wget command such that it downloads the file at the specific time and place the file into the directory we have created, and then using another cron job we can make use of aws-cli such that the file from the local directory which was downloaded is being uploaded into the aws s3 bucket immediately.
The following cron job does the task mentioned above.
The process of loading data from a third-party website can be made extremely easy when compared to the normal data extraction process using Linux alongside cron jobs. One can easily do the automation of the data extraction process by starting a Linux instance and using the cron job feature with the help of the syntax mentioned above. Now that you know this process, if the use case suits to be used in Linux, this process can be leveraged.