Modern Data Stack: Data Cataloging Popular Tools

Updated: Apr 20

Authors: Sai Charan, Mahima Khandewal, Ruchir Bhardwaj, Sudhendu Pandey


This blog is in continuation of the previous blog title Modern Data Stack: Data Catalog Introduction.


In this blog, we will explore some of the well-known data cataloging tools available in the market and understand them a little better. Since there is never a complete and extensive list of anything, we might be missing some good tools that you have worked on!


Atlan

Website: https://atlan.com/


Atlan is a Modern Data Workspace catalog tool with the vision to enable data democratization within organizations while maintaining the highest standards of governance and security. It's a tool for data teams that reduces chaos & eases collaboration among data teams.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Google Big Query

  • Data Lakes: Azure Data Lake Storage, AWS S3, AWS Glue, Databricks, Delta lake

  • Database:

  • RDBMS: MySQL, Postgres

  • BI: Tableau, Power BI, Sisense, Looker

  • Transformations: DBT

  • Workflow: Airflow

  • Collaboration: Salesforce,Slack,Jira

Reference: https://atlan.com/platform/integrations/


Deployment:

  • On Premise/Hybrid cloud support - No

  • Cloud support - Yes

  • AWS CloudFormation

  • Terraform on AWS Cloud

Pros:

  • Ease of learning the tool

  • Global design & information architecture standards followed in designing the tool

  • Ease of collaboration for data teams

  • Ease of Data Governance


Cons:

  • The tool is quite young with most of the basic features introduced but needs improvements in Automation of Data Catalog to reduce manual efforts

  • So far, only AWS cloud platform is supported for deployment of Atlan environment

  • A resource is needed to monitor & maintain the Atlan provisioned environment


Ataccama One

Website: https://www.ataccama.com/


Ataccama reinvents the way data is managed to create value on an enterprise scale. Unifying Data Governance, Data Quality, and Master Data Management into a single, AI-powered fabric across hybrid and Cloud environments, Ataccama gives your business and data teams the ability to innovate with unprecedented speed while maintaining trust, security, and governance of your data.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Google Big Query, Azure Synapse, Amazon Athena

  • Data Lakes: Azure Data Lake Storage, Amazon S3, AWS Glue, Databricks, Hive, Cloudera, Amazon EMR, Google Dataproc

  • Database:

  • RDBMS: Oracle DB, SAP S4/HANA, Teradata, Microsoft SQL Server, Postgres, IBM DB2, MySQL

  • Streaming Data: Apache Kafka, Apollo MQ, Rabbit MQ, Amazon SQS

  • BI: Tableau, Power BI, Dataiku, Looker, Qlik, thoughtSpot, GoodData

Reference: https://www.ataccama.com/deployment/one-architecture-and-Integration


Deployment:

  • On Premise/Hybrid cloud support - Yes

  • Amazon

  • Google

  • Azure

  • Linux

  • Cloud support - Yes

  • Amazon

  • Azure

Pros:

  • A power packed tool for Modern Metadata Management

  • Offers native integration support for most of the data sources

  • Ease of Data Governance - quality, profiling

  • Offers various deployment options - Platform as a Service(PaaS), On prem & hybrid

  • Strong Knowledge base & useful resources available on the tool


Cons:

  • Maybe for some beginners, the tool may look complex at the first go


Alation

Website: https://www.alation.com/


Alation pioneered the data catalog market and is now leading its evolution into a platform for a broad range of data intelligence solutions including data search & discovery, data governance, data stewardship, analytics, and digital transformation.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Azure Synapse, Databricks, AWS Redshift Spectrum, SAS(Data files), Google Big Query.

  • Data Lakes: Azure Data Lake Storage, Databricks, Cloudera, Azure Blob Storage, HDP.

  • Database:

  • RDBMS: Oracle on EC2, SAP Sybase/HANA, Teradata, SQL Server on EC2/ AWS RDS/ Azure VM, Postgres, MySQL on EC2/AWS EC2.

  • NoSQL : MongoDB, DynamoDB, Azure Cosmos, Cassandra.

  • Messaging: Kafka.

  • BI: Tableau, Power BI, Looker, Qlik, MicroStrategy, SSRS.

Reference: https://www.alation.com/product/connectors/


Deployment:

  • On Premise/Hybrid cloud support - Yes

  • Cloud support - Yes

  • AWS

Pros:

  • Powerful Behavioral Analysis Engine.

  • Inbuilt collaboration capabilities.

Cons:

  • Data lineage needs enhancement in terms of relation between the physical data point and business element.


Dataedo

Website:https://dataedo.com/

Dataedo is an on-premises data catalog with powerful database documentation and metadata management features. It allows you to catalog, document, and understand your data with a data dictionary, business glossary, and ERDs. It reads your schema and lets you easily describe each data element with descriptions, business-friendly aliases, and custom fields. It features a data community module, which allows you to crowdsource knowledge about data from everyone in your organization.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Amazon Redshift, Google Big Query, Amazon Athena.

  • Data Lakes: Hive, Delta Lake.

  • Database:

  • RDBMS: Oracle, Teradata, Microsoft SQL Server, PostgresSQL, MySQL, IBM DB2, MariaDB, Amazon RDS.

  • NoSQL: MongoDB, Azure Cosmos, Cassandra.

  • BI: Power BI, Azure Synapse Analytics, Analysis Services Tabular.

  • File scanners: Apache Avro, Apache ORC, Apache Parquet, CSV, Excel, JSON, XML

Reference: https://dataedo.com/sources


Deployment:

  • On Premise/Hybrid cloud support - Yes

  • Windows

  • Linux

  • Cloud support - No

Pros:

  • Powerful documentation tool.

  • Easy to implement and intuitive to use.

Cons:

  • It is an on-premise solution. No cloud support as of now.

  • No customizable documentation format.


Datakin

Website: https://datakin.com/


Datakin is an end-to-end lineage solution for data engineers and data scientists. It observes the movement of data through pipelines, tracing relationships between datasets and making it easier to find, fix, and prevent issues.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Google Big Query, Apache Spark

  • Databases:

  • RDBMS: PostgreSQL

  • Transformations: DBT

  • Orchestrations: Google Cloud Composer, Astronomer

References: https://datakin.com/features/


Deployment:

  • On Premise/Hybrid cloud support - No

  • Cloud support - Yes

Pros:

  • Automatic lineage tool

  • Integration with popular data tools

Cons:

  • Only a data-lineage solution.

  • Fairly new tool in the market.


Data.world

Website : https://data.world/


data.world is the enterprise data catalog for the modern data stack. Our cloud-native SaaS platform leverages the power of the knowledge graph to make data discovery, governance, and analysis easy, turning data workers into knowledge superheroes.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Google Big Query, Apache Hive, Apache Spark,

  • Data Lakes: AWS S3, AWS Glue, Databricks, Dremio

  • Database:

  • RDBMS: MySQL, PostgreSQL, Oracle, Microsoft SQL

  • Cloud Storage : Dropbox, Google Drive, Box.com

  • Enterprise : IBM db2,

  • BI and Visualizations: Tableau, Power BI, Chart Builder, Excel, Google Data Studio, Heartcount, Infoveave, Knime, Keshif, Knowi, MicroStrategy, Plotly, R Studio, SAP Analytics Cloud, SPSS, Google Analytics, MixPanel,

  • Collector: dw DBT, Vertica, Tableau Server, SAP SQL anywhere, Reltio, Presto, Monte Carlo, Marquez, Manta, Google Looker, Hive metastore, DOMO, Datakin

  • Collaboration: Slack

  • App Builder: Algorithmia

  • Marketing, Sales and Adds: Facebook Ads, Google Ads, Hubspot, Salesforce, Marketo

  • SuperConnector: IFTT, JDBC/JAVA, Knots, Knime, Stitch, Singer,

  • Programming Languages: Python, Jupyter, Golang


Reference : https://data.world/integrations, https://docs.data.world/en/59261-57792-Integrations.html


Deployment:

  • On Premise/Hybrid cloud support - No

  • Cloud support - Yes

  • Amazon AWS

Pros:

  • Ingest data from almost any source

  • Join disparate datasets

  • Integrate with other platforms

  • Create subsets of data and share across teams

Cons:

  • Organize queries/datasets by categories

  • Allow data edits for admins

  • Requires knowledge of SQL - visual SQL builder would help with accessibility


Amundsen

Website : https://www.amundsen.io/


Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g., highly queried tables show up earlier than less queried tables). Think of it as Google search for data.


Some Popular Integrations:

  • Warehouse: Snowflake, Amazon Redshift, Google Big Query, Apache Hive

  • Database:

  • RDBMS: MySQL, Microsoft SQL Server, OCI, PostgreSQL, Apache Cassandra

  • Enterprise: IBM Db2

  • Data Lakes: Amazon S3, AWS Glue, Delta Lake,

  • Analytics: Elastic, Apache Spark

Deployment:

  • On Premise/Hybrid cloud support - No

  • Cloud support - Yes (Deploy docker first, then install amundsen instance)

  • Amazon AWS

  • Google Cloud Platform

  • Azure

Pros:

  • Open source

  • Automated and curated metadata

  • Dashboard for searching through data is hosted on local machine

Cons:

  • Learning curve is quite high.

  • Maintenance of the tools would need time.

210 views0 comments

Recent Posts

See All