Open Data Lake Blueprint

Funds Fund 8 Proposals F8: Open Source Development Ecosystem Open Data Lake Blueprint

over budget

View on Ideascale

Current Project Status

Unfunded

Amount
Received

Amount
Requested

$105,000

Percentage
Received

0.00%

Solution

Work with Catalyst community to outline initial needs, training, whitepaper, and resources for constructing phase 1 of a Data Lake plan.

Problem

Catalyst lacks a data lake to enable open access to analyse across all varieties of data sets — both structured and unstructured.

Addresses Challenge

Feasibility

Auditability

Open Data Lake Blueprint

Impact

Background:

As described by techtarget.com: “A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications. While a traditional data warehouse stores data in hierarchical dimensions and tables, a data lake uses a flat architecture to store data, primarily in files or object storage. That gives users more flexibility on data management, storage and usage…Data lakes commonly store sets of big data that can include a combination of structured, unstructured and semistructured data. Such environments aren't a good fit for the relational databases that most data warehouses are built on. Relational systems require a rigid schema for data, which typically limits them to storing structured transaction data. Data lakes support various schemas and don't require any to be defined upfront. That enables them to handle different types of data in separate formats.”

(<https://www.techtarget.com/searchdatamanagement/definition/data-lake>)

The proposed solution is to partner with IOG and other Catalyst data holders and set up ingestion pipelines for all relevant publicly available data, both structured and unstructured. The technical components of this system will live in a highly available cloud-based solution (AWS) utilising industry-standard data lake infrastructure and tooling. Users of this data can use EMR for Apache Spark, Redshift, Athena, AWS Glue, and Amazon QuickSight on diverse datasets to get the data they need for insights and surfacing useful patterns and information. This will also open the door to machine learning applications on top of the Open Catalyst Data Lake.

<https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/>

https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

An open data lake will be a critical resource for many applications, businesses, and projects that need to be able to understand and act on the stories the data will tell us about the community, governance, budgeting, proposals, development work, and other key areas of Catalyst. The unlocked potential of business intelligence and machine learning on top of this data will create endless new products and services to further grow and expand the Cardano and Catalyst ecosystem.

Outcomes and activities of this project will be announced via the Bridge Builders Website and social media Channels or Town Hall updates. Furthermore, we plan to launch a marketing campaign to raise awareness of the Data Lake Initiative

The challenges of Data Lakes

The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents. For a data lake to make data usable, it needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted, resulting in a “data swamp." Meeting the needs of wider audiences requires data lakes to have governance, semantic consistency, and access controls.

This proposal not only intends to build the technical infrastructure of the data lake, it will also create governance structures for the data lake, ensure semantic consistency through a well-defined ontology, and manage access controls to ensure fair and open access to the data while ensuring data security and monitored resource usage.

Feasibility

(1) Discovery Phase

This initial phase is focusing on delivering two main components in the Discovery Phase

Research and Discovery

Education and Training

Research and Discovery

Research, within the community, for individuals that can assist in defining the overall concepts of the Open Data Lake blueprint. In addition, survey and compile information, from within the community, that will provide a high-level overview of the understanding of Data Lake concepts.

Education and Training

The purpose of this portion of the project will be to provide basic and core educational information around Data Lake concepts. It will also identify what additional materials would be needed for building a first detailed roadmap and framework of the Data Lake design and building process.

(2) Pioneer Phase

Previously earned experience is now handled over into actionable items and design + building processes.

Identifying

Through Sessions and Workshops, we will Identify the first steps for a successful implementation and building process of the Data Lake

Prioritize

Once Identified, the Team will work internally with a focus to prioritize the sensed needs and set the order in which projects might be delivered first.

Organizing

Package identified needs as projects with an included project plan, roles required & management strategy.

Offer fully packaged Projects with related resources to the community. The full packages will be submitted as proposals to gather the required resources to fully implement the chosen solution.

Incentivized Team Building

After Identifying and setting up a project execution plan, we will then encourage and empower devs, engineers and designers to form rewarded execution teams from the community.

As Bridge Builders maintains a wide Community Network, we have direct access to devs, engineers, architects & designers.

In addition to that, our experience in initiating Teams and workflows will help us in the Team Building process.

(3) Reflect and Improve

Reflecting on the first pilots and initiatives to improve further steps. Documenting the process and creating easy to consume media content to attract and introduce new devs into this specific project and Project Catalyst as a whole.

Data Engineer x 1 = $10,000/mo x 3 months = $30,000

Software Developer x 1 = $10,000/mo x 3 months = $30,000

Project Management x 1 = $7500/mo x 3 months = $22,500

Program and Marketing Management x 1 = $7500/mo x 3 months = $22,500

Total: 105000

Dean Taylor

Dean has more than 25 years of experience in the technology industry including 15+ years at executive management level—including startup and established organisations. His primary experience is in moving companies or offerings through the "pre-chasm" marketing and partnering activities, bridging the business gap that exists as technology moves through stages of innovation into business acceptance and adoption. As a Linux pioneer during the '90s, he was there at the inception of the open-source movement, working for Caldera Systems as the Vice President of Marketing and Channel Sales. He played a key role in growing the company's market presence and creating Caldera's Initial Public Offering (IPO).

Linkedin: https://www.linkedin.com/in/rdeantaylor/

Felix Weber

Founder and leader of the Catalyst Swarm, Initiator and Coordinator of the Ambassadors Guilds, Initiator and of the Eastern Town Hall, Elected Member of the Catalyst Circle version 1, Community Advisor,Challenge Team Member, Proposal Mentor and Community Manager

Linkedin: https://www.linkedin.com/in/felix-weber-339590209

Nori Nishigaya

Founder of the Salmon Nation Decentralized Alliance (SANADA), Bridge Builders, and SAMON pool. Member of the Catalyst Circle Admin Team, Cardano4Climate, and Rapid Funding Challenge Team. Cardano Ambassador, CA, and Funded proposer. Passionate about radical inclusivity and community, and devoted to making Cardano the best community on the planet. Nori brings over 30 years of experience in software development, agile methodologies, leadership and managing teams, and founding and running technology startups.

Linkedin:https://www.linkedin.com/in/nishigaya/

Juliane Montag

Co-Founder of Gimbalabs, Project Catalyst Fund 1 participant, Community Advisor, Veteran Community Advisor, Catalyst Circle 1 participant and funded proposer. Background in front- and back-office Business Development, PMO Transformation & Change, Family Business & Board Program Management. Juliane brings over 19+ years of experience in business- and market intelligence, strategic planning and execution, operationalizing transformational projects, as well as managing and coaching teams and individuals.

Linkedin: <https://www.linkedin.com/in/julianemontag>

Auditability

Main Development, achievements and failures will be announced via the Bridge Builders Website & social media announcements.

Project Management will be handled via github, gitbook, Jira etc in a fully open and accessible manner

At the end of the first iterations, we will look back and ask ourselves if we successfully arrived to:

Build first frameworks and Roadmaps for Building a Open Source Data Lake

Recruit and Onboard Community Members to engage in the Project

Explore new concepts of project implementation via coordinated efforts by the Bridge Builders

Document, duplicate and improve earned experiences

This is an entirely new Proposal from the Bridge Builders

bookmarked!

bookmarked!