Skip to content

Instantly share code, notes, and snippets.

@gkanishk
Last active October 22, 2022 18:04
Show Gist options
  • Select an option

  • Save gkanishk/662945d2f9350805339ab6d343bb0efe to your computer and use it in GitHub Desktop.

Select an option

Save gkanishk/662945d2f9350805339ab6d343bb0efe to your computer and use it in GitHub Desktop.
Data Catalog

Data Catalog:

What is Data?

Datas can be facts, statistics, terms that are collected together for reference or analysis. We get huge number of meta data everyday which can be hard to manage because datas are not structured. If we simply search any data let's say a mobile we will get thousands of results related to mobile but if we specify what kind of mobile we are looking for we get reduced results related to what we are looking for.

What is Meta Data:

Metadata are all the information related to data like size of data, owner of data, reader(user), access permission and other related information.

What is Data Catalog?

Data Catalogs are data management tools which help users easily browse data they want to. It's combination of data management tools and search engine which help us easily find the data we are looking for as we have huge number of data being generated on day to day basis. If we talk about an example user wants to buy a laptop and he searches laptop in any e-commerce site he will get thousand of resutls related to laptop which won't be of any use. Now to get exact models(data) user wants he has to apply certain filters(facets) so that he gets his desired result like:

  • Screen size
  • Processor
  • disk space/ ram and other confrigrations.... Now his search results reduces in 100s of data.

This is what data Catalog does

Each data Catalog consists of it's own elements like:

  • It's dashboard which shows all the data.
  • Search engine with it's facet filters - to browse data
  • Profile Page - Profile pages are specific to each data which shows all the information related to data like schemas, description, all the statistics and analysis of data. Which help user to understand the schemas and create queries for datas.
    • Another use of profile page is to show recomendations or datas which are similar to data we are browsing. Like there can be a user which can be trreated as Primary Key and can have data tables like address or payments table so if we browse one the profile page recomends another.
    • Similar to any ecommerce site if you purchase a mobile you start getting suggestions for it's accesories. -> User searches any data 1000(Unstructured data)->(Apply some filer)->10 Related(Sorted Data)
  • Glossary- It's the inventory which contains all the terms and description related to data like tags used to describe or categorise data.

Why we need Data Catalog?

As we know datas are unstructured and they keep on changing on day to day basis. We have millions on data created and destroyed everyday and crawling(tracking) each data can lead to uncountable days. So we need data catelogs which crawls all the data and gives us data we are looking for. We comes to browsing data we can face different challenges like:

Challenges?

  • Scale in data (new data everytime)
    • Everyday millions of data are created they come from the user's end on daily basis like if we talk about any logistic company they have huge number of data transcations related to n number of users. That's one challenge we face when it comes to data.
  • Variety (types of data):
    • We have n number of variety when it comes to data like data can have different file types (pdf, text, audio, video and many more) we have different datasets for same data and different type to based on specification like if we are using a database then what type of database is it SQL or NOSQL , is if MYSQL, POSTGRESSQL or any other DB.
    • What's the type of data, what are it's specifications, and statistics for all the data.
  • Churn in data (Changes/Deletion/Updation):
    • If we have any data then that need to be updated or deleted on daily basis as per requirement. We have billions of data changed or deleted which needs to be crawled to save time.
    • If we have any stale data or deleted data then to save time that should not be crawled more frequently. These were few reasons why we need data catelogs for managing data.

Data Governance

Data governance is a practice to manage and monitor availibility or access of data tomake it secure which can include:

  • Providing access to data
  • Describing any data
  • Making data secure
  • Ensuring data quality
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment