Cardano on BigQuery: open, fast and easy querying of on-chain data

Funds Fund 12 Proposals Cardano Use Cases: MVP Cardano on BigQuery: open, fast and easy querying of on-chai...

over budget

View on Ideascale View on projectcatalyst.io

Current Project Status

Unfunded

Amount
Received

₳0

Amount
Requested

₳157,500

Percentage
Received

0.00%

Solution

We have already built a PoC where we export the Cardano on-chain data to BigQuery, using db-sync (see link 2). We will make this PoC a reliable, production-ready solution & move it to the next level.

Problem

Free and open data access to Cardano’s blockchain data. Anybody with basic SQL knowledge can get information around specific pools, addresses, transactions fast and free of charge.

Impact Alignment

Feasibility

Value for Money

Cardano on BigQuery: open, fast and easy querying of on-chain data

Solution

Please describe your proposed solution

Currently there is no easy way to get access and query the information on the Cardano blockchain.

The existing solutions are:

using an online explorer, that offers an easy, but very limited way to access information on chain (eg. only rewards for an address, transactions etc.)
or spend the time to run a local Cardano node, and a tool that sets up the data to be queryable (eg. db-sync: https://github.com/IntersectMBO/cardano-db-sync). This is indeed a flexible and powerful way of accessing the data, albeit it is high maintenance and has a high entry barrier (one has to run these systems in a very powerful machine).

We want to solve this problem by offering a way that would be easy (you only need to setup a Google project), free (the first Terabyte of query data is free https://cloud.google.com/bigquery/pricing) and versatile (using SQL you can run custom queries, extracting insights from the data).

The current proof of Concept already exports data from a db-sync pipeline (cardano node + db-sync) to a dataset in Google BigQuery, exposing the on chain data. However it is limited in capability: slow, there are discrepancies in the data, the validation occurs only after an epoch ends and does not include all the on chain data.

We would like to bring this project to the next level: faster updates, guaranteed data accuracy, complete on-chain data and insightful analytics on top of them.

Impact

Please define the positive impact your project will have on the wider Cardano community

Earlier stages of this project have been originating in 2022 within IOG’s Data Analytics team and we know of a number of projects in the Cardano community which already build on Cardano_on_BigQuery.

To secure long-term stability of the project we ask for funding to transition to a community owned project which is completely open source.

The more users are involved the greater the potential for future development. The more shoulders that carry the burden of maintaining Cardano_on_BigQuery, the lower the costs.

We believe that this project enables small and mid sized projects on the Cardano blockchain to implement their data layers: scalability at minimal, usage related costs.

Capability & Feasibility

What is your capability to deliver your project with high levels of trust and accountability? How do you intend to validate if your approach is feasible?

Both of the team members have been working at IOHK (later known as IOG - the company that built Cardano), for a combined 8 years on the Cardano node and on commercial blockchain solutions. There is already a proof of concept that has been built.

Project Milestones

What are the key milestones you need to achieve in order to complete your project successfully?

Milestone 1: M1: Redesign/Optimise data export (2 months)

The current version of the data export from db-sync to BigQuery occurs in slow and sequential manner. The output of this milestone involves parallelising the data export, redesigning certain parts of the process and improving the schema of the data.

The acceptance criteria for this milestone would be to double the speed in which the data is being updated (from data refresh every 30' in the PoC version, to 15').

Milestone 2: M2: Setup monitoring (1 month)

The whole system has many moving parts and various things that can fail (a db-sync process, a Cardano node process, CPU/memory utilization, Postgres database).

This milestone involves setting up monitoring and alerting for all the various subsystem's.

The acceptance criteria for this milestone would be:

expose a series of graphs, tracking the various subsystems' health
perform a "fire-drill" alert and demonstrate the alerting mechanism

Milestone 3: M3: Continuous data validation (2 months)

In the PoC version the data is being validated at the epoch boundary. This means that we can only be sure about the validity of the data only after the epoch changes. This milestone would involve changing and redesigning the process, so that we can guarantee that the data is valid and accurate constantly.

The way this would be done is by changing the way the updates occur, to be atomic and deterministic.

Acceptance criteria for this milestone would be code in the Github repo.

Acceptance criteria for this milestone would be performing an update that intentionally fails and has no side effects.

Final Milestone: Final Milestone: Documentation - Analytics (1 month)

The final milestone output would be detailed documentation on how to use the data: how to connect to the data, how to write optimised and cost effective queries, full data dictionary, as well as provide certain analytics on top of the data.

The deliverable would be documentation in Github pages, example queries, code snippets on how to use the data as well as Data Studio dashboards and BigQuery views offering interesting insights about the data.

Resources

Who is in the project team and what are their roles?

Alexander Diemand (cardanobigquery</member>"><member communityid="163" id="190990">cardanobigquery</member> ): architecture & design, project management, communication, documentation

Thomas Kaliakos (<member communityid="163" id="190996">thomaska</member>): data engineering, data quality responsibility, documentation

Budget & Costs

Please provide a cost breakdown of the proposed work and resources

Cost breakdown assumptions:

Cost for a data engineer manmonth: 3500$
Cost for cloud infrastructure: 1500$/month

M1:

4 manmonths -> 4500 * 4 = $18000

2 months of cloud infra -> 1500 * 2 = $3000

M1 cost: $21,000

M2:

2 manmonths -> 4500 * 2 = $9000

1 months of cloud infra -> 1500 * 1 = $1500

M2 cost: $10,500

M3:

4 manmonths -> 4500 * 4 = $18000

2 months of cloud infra -> 1500 * 2 = $3000

M3 cost: $21,000

Final milestone:

2 manmonths -> 4500 * 2 = $9000

1 months of cloud infra -> 1500 * 1 = $1500

Final Milestone cost: $10,500

Total cost:

$63000 -> per 1 ADA(₳)=$0.4 -> ₳157500

Project Dependencies

This project will depend on data provided by Cardano Db-sync (https://github.com/IntersectMBO/cardano-db-sync)

Value For Money

How does the cost of the project represent value for money for the Cardano ecosystem?

We are aiming to empower everyone: from an individual Cardano user with basic SQL skills, to companies that have whole teams with data analysts, to gain information and insights by interrogating the data on chain.

We aspire to share the public blockchain data in their most practical form such that each participant can independently work with it. We believe that SQL is the most accessible way of querying data and everybody will find a way in their own setup to connect and query from the BigQuery dataset which is always available, timely updated and accurate.

Running the complete Db-Sync pipeline costs several hundred dollars per month in infrastructure and maintenance. On the other hand, BigQuery offers a free monthly quota of 1 TB queried data and usually incurs no costs if used sparingly.

bookmarked!

bookmarked!