Background:
As described by techtarget.com: “A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications. While a traditional data warehouse stores data in hierarchical dimensions and tables, a data lake uses a flat architecture to store data, primarily in files or object storage. That gives users more flexibility on data management, storage and usage…Data lakes commonly store sets of big data that can include a combination of structured, unstructured and semistructured data. Such environments aren't a good fit for the relational databases that most data warehouses are built on. Relational systems require a rigid schema for data, which typically limits them to storing structured transaction data. Data lakes support various schemas and don't require any to be defined upfront. That enables them to handle different types of data in separate formats.”
(<https://www.techtarget.com/searchdatamanagement/definition/data-lake>)
The proposed solution is to partner with IOG and other Catalyst data holders and set up ingestion pipelines for all relevant publicly available data, both structured and unstructured. The technical components of this system will live in a highly available cloud-based solution (AWS) utilising industry-standard data lake infrastructure and tooling. Users of this data can use EMR for Apache Spark, Redshift, Athena, AWS Glue, and Amazon QuickSight on diverse datasets to get the data they need for insights and surfacing useful patterns and information. This will also open the door to machine learning applications on top of the Open Catalyst Data Lake.
<https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/>
An open data lake will be a critical resource for many applications, businesses, and projects that need to be able to understand and act on the stories the data will tell us about the community, governance, budgeting, proposals, development work, and other key areas of Catalyst. The unlocked potential of business intelligence and machine learning on top of this data will create endless new products and services to further grow and expand the Cardano and Catalyst ecosystem.
Outcomes and activities of this project will be announced via the Bridge Builders Website and social media Channels or Town Hall updates. Furthermore, we plan to launch a marketing campaign to raise awareness of the Data Lake Initiative
The challenges of Data Lakes
The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents. For a data lake to make data usable, it needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted, resulting in a “data swamp." Meeting the needs of wider audiences requires data lakes to have governance, semantic consistency, and access controls.
This proposal not only intends to build the technical infrastructure of the data lake, it will also create governance structures for the data lake, ensure semantic consistency through a well-defined ontology, and manage access controls to ensure fair and open access to the data while ensuring data security and monitored resource usage.