Authors: Sai Charan, Mahima Khandewal, Ruchir Bhardwaj, Sudhendu Pandey
This blog is in continuation of the previous blog title Modern Data Stack: Data Catalog Introduction.
In this blog, we will explore some of the well-known data cataloging tools available in the market and understand them a little better. Since there is never a complete and extensive list of anything, we might be missing some good tools that you have worked on!
Atlan
Website: https://atlan.com/
Atlan is a Modern Data Workspace catalog tool with the vision to enable data democratization within organizations while maintaining the highest standards of governance and security. It's a tool for data teams that reduces chaos & eases collaboration among data teams.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Google Big Query
Data Lakes: Azure Data Lake Storage, AWS S3, AWS Glue, Databricks, Delta lake
Database:
RDBMS: MySQL, Postgres
BI: Tableau, Power BI, Sisense, Looker
Transformations: DBT
Workflow: Airflow
Collaboration: Salesforce, Slack, Jira
Reference: https://atlan.com/platform/integrations/
Deployment:
On-Premise/Hybrid cloud support - No
Cloud support - Yes
AWS CloudFormation
Terraform on AWS Cloud
Pros:
Ease of learning the tool
Global design & information architecture standards followed in designing the tool
Ease of collaboration for data teams
Ease of Data Governance
Cons:
The tool is quite young with most of the basic features introduced but needs improvements in the Automation of the Data Catalog to reduce manual efforts
So far, only the AWS cloud platform is supported for the deployment of the Atlan environment
A resource is needed to monitor & maintain the Atlan provisioned environment
Ataccama One
Website: https://www.ataccama.com/
Ataccama reinvents the way data is managed to create value on an enterprise scale. Unifying Data Governance, Data Quality, and Master Data Management into a single, AI-powered fabric across hybrid and Cloud environments, Ataccama gives your business and data teams the ability to innovate with unprecedented speed while maintaining trust, security, and governance of your data.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Google Big Query, Azure Synapse, Amazon Athena
Data Lakes: Azure Data Lake Storage, Amazon S3, AWS Glue, Databricks, Hive, Cloudera, Amazon EMR, Google Dataproc
Database:
RDBMS: Oracle DB, SAP S4/HANA, Teradata, Microsoft SQL Server, Postgres, IBM DB2, MySQL
Streaming Data: Apache Kafka, Apollo MQ, Rabbit MQ, Amazon SQS
BI: Tableau, Power BI, Dataiku, Looker, Qlik, thoughtSpot, GoodData
Reference: https://www.ataccama.com/deployment/one-architecture-and-Integration
Deployment:
On-Premise/Hybrid cloud support - Yes
Amazon
Google
Azure
Linux
Cloud support - Yes
Amazon
Azure
Pros:
A power-packed tool for Modern Metadata Management
Offers native integration support for most of the data sources
Ease of Data Governance - quality, profiling
Offers various deployment options - Platform as a Service(PaaS), On-prem & hybrid
Strong Knowledge base & useful resources available on the tool
Cons:
Maybe for some beginners, the tool may look complex at the first go
Alation
Website: https://www.alation.com/
Alation pioneered the data catalog market and is now leading its evolution into a platform for a broad range of data intelligence solutions including data search & discovery, data governance, data stewardship, analytics, and digital transformation.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Azure Synapse, Databricks, AWS Redshift Spectrum, SAS(Data files), Google Big Query.
Data Lakes: Azure Data Lake Storage, Databricks, Cloudera, Azure Blob Storage, HDP.
Database:
RDBMS: Oracle on EC2, SAP Sybase/HANA, Teradata, SQL Server on EC2/ AWS RDS/ Azure VM, Postgres, MySQL on EC2/AWS EC2.
NoSQL: MongoDB, DynamoDB, Azure Cosmos, Cassandra.
Messaging: Kafka.
BI: Tableau, Power BI, Looker, Qlik, MicroStrategy, SSRS.
Reference: https://www.alation.com/product/connectors/
Deployment:
On-Premise/Hybrid cloud support - Yes
Cloud support - Yes
AWS
Pros:
Powerful Behavioral Analysis Engine.
Inbuilt collaboration capabilities.
Cons:
Data lineage needs enhancement in terms of the relation between the physical data point and the business element.
Dataedo
Website:https://dataedo.com/
Dataedo is an on-premises data catalog with powerful database documentation and metadata management features. It allows you to catalog, document, and understand your data with a data dictionary, business glossary, and ERDs. It reads your schema and lets you easily describe each data element with descriptions, business-friendly aliases, and custom fields. It features a data community module, which allows you to crowdsource knowledge about data from everyone in your organization.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Amazon Redshift, Google Big Query, Amazon Athena.
Data Lakes: Hive, Delta Lake.
Database:
RDBMS: Oracle, Teradata, Microsoft SQL Server, PostgresSQL, MySQL, IBM DB2, MariaDB, Amazon RDS.
NoSQL: MongoDB, Azure Cosmos, Cassandra.
BI: Power BI, Azure Synapse Analytics, Analysis Services Tabular.
File scanners: Apache Avro, Apache ORC, Apache Parquet, CSV, Excel, JSON, XML
Reference: https://dataedo.com/sources
Deployment:
On-Premise/Hybrid cloud support - Yes
Windows
Linux
Cloud support - No
Pros:
Powerful documentation tool.
Easy to implement and intuitive to use.
Cons:
It is an on-premise solution. No cloud support as of now.
No customizable documentation format.
Datakin
Website: https://datakin.com/
Datakin is an end-to-end lineage solution for data engineers and data scientists. It observes the movement of data through pipelines, tracing relationships between datasets and making it easier to find, fix, and prevent issues.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Google Big Query, Apache Spark
Databases:
RDBMS: PostgreSQL
Transformations: DBT
Orchestrations: Google Cloud Composer, Astronomer
References: https://datakin.com/features/
Deployment:
On-Premise/Hybrid cloud support - No
Cloud support - Yes
Pros:
Automatic lineage tool
Integration with popular data tools
Cons:
Only a data-lineage solution.
A fairly new tool in the market.
Data.world
Website : https://data.world/
data.world is the enterprise data catalog for the modern data stack. Our cloud-native SaaS platform leverages the power of the knowledge graph to make data discovery, governance, and analysis easy, turning data workers into knowledge superheroes.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Google Big Query, Apache Hive, Apache Spark,
Data Lakes: AWS S3, AWS Glue, Databricks, Dremio
Database:
RDBMS: MySQL, PostgreSQL, Oracle, Microsoft SQL
Cloud Storage: Dropbox, Google Drive, Box.com
Enterprise: IBM db2,
BI and Visualizations: Tableau, Power BI, Chart Builder, Excel, Google Data Studio, Heartcount, Infoveave, Knime, Keshif, Knowi, MicroStrategy, Plotly, R Studio, SAP Analytics Cloud, SPSS, Google Analytics, MixPanel,
Collector: dw DBT, Vertica, Tableau Server, SAP SQL Anywhere, Reltio, Presto, Monte Carlo, Marquez, Manta, Google Looker, Hive metastore, DOMO, Datakin
Collaboration: Slack
App Builder: Algorithmia
Marketing, Sales, and Adds: Facebook Ads, Google Ads, Hubspot, Salesforce, Marketo
SuperConnector: IFTT, JDBC/JAVA, Knots, Knime, Stitch, Singer,
Programming Languages: Python, Jupyter, Golang
Reference : https://data.world/integrations, https://docs.data.world/en/59261-57792-Integrations.html
Deployment:
On-Premise/Hybrid cloud support - No
Cloud support - Yes
Amazon AWS
Pros:
Ingest data from almost any source
Join disparate datasets
Integrate with other platforms
Create subsets of data and share them across teams
Cons:
Organize queries/datasets by categories
Allow data edits for admins
Requires knowledge of SQL - visual SQL builder would help with accessibility
Amundsen
Website : https://www.amundsen.io/
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g., highly queried tables show up earlier than less queried tables). Think of it as a Google search for data.
Some Popular Integrations:
Warehouse: Snowflake, Amazon Redshift, Google Big Query, Apache Hive
Database:
RDBMS: MySQL, Microsoft SQL Server, OCI, PostgreSQL, Apache Cassandra
Enterprise: IBM Db2
Data Lakes: Amazon S3, AWS Glue, Delta Lake,
Analytics: Elastic, Apache Spark
Deployment:
On-Premise/Hybrid cloud support - No
Cloud support - Yes (Deploy docker first, then install amundsen instance)
Amazon AWS
Google Cloud Platform
Azure
Pros:
Open source
Automated and curated metadata
A dashboard for searching through data is hosted on the local machine
Cons:
The learning curve is quite high.
Maintenance of the tools would need time.