completed
Fast reindexable data format
Current Project Status
Complete
Amount
Received
$50,000
Amount
Requested
$50,000
Percentage
Received
100.00%
Solution

We introduce new thread-safe memory-efficient persistent storage with indexing capabilities. Using a single-file solution, there will be no extra processing overhead for download and synchronization.

Problem

Currently no efficient tooling exists for storing the blocks persistently with fast indexing and fast distribution. For example, an empty indexer on Oura will still take ~300s/epoch

Impact / Alignment
Feasibility
Auditability

dcSpark

3 members

Fast reindexable data format

Please describe your proposed solution.

There are two main problems we are trying to solve:

  1. Currently, the main bottleneck for indexing data is querying the block data from the cardano-node. Since cardano-node was not optimized for fast historical block fetching, an indexer that scans the chain and does nothing with the data (no-op) still takes ~300s / epoch which means it takes over an entire day for the indexer to complete. This is really bad for developer agility because it means if you need to write a custom indexer you need to spend a day re-running your indexer every time you change your code. This proposal will allow us to save the block data in a fast-to-query format once to optimize for use-cases that need to re-index the same data multiple times such as protocols similar to TheGraph and also optimize for developer agility
  2. Currently, there is no way to save block data in way that supports (in parallel) the need to append new blocks to storage and read the blocks from a specified index. We have researched whether there are any solutions to tackle this problem (see below). However, it turned out that there’s no suitable solution at the moment.

For example, StreamingFast has the technology for storing blocks. However, they try to solve both fork resolution and data storage problems at the same time. They have plenty of files storing 100 blocks per file. This doesn’t work for quick distribution and efficient indexing at the same time. As for the solution we propose, fork resolution can be done through multiverse and only confirmed blocks go to the actual storage.

Moreover, the solution we propose can be used for other projects in the Cardano ecosystem. The storage can be used as a source for Oura or Carp, so instead of spending 4-5 days to synchronize with Cardano you just download one large file. If a person doesn’t trust third-party backups they can synchronize the node and create their own backup, reusing it after.

Let’s dive in a little more detail of potential implementation:

  1. Memory management

We can utilize mapping of file to memory, so we sync the changes to the disk easily and don’t have overhead for data access. Moreover, this is cache efficient, since we have memory reads of consecutive data. There’s a syscall on linux that we can use called mmap. As long as we append more data we allocate new chunks, several records can be in the same chunk. If the storage is closed and reopened after we create an mmap for existing records (data structure doesn’t depend on how the memory was allocated).

  1. Memory structure

For every record we need service bytes (e.g. 8 bytes to store offset to the next record) + size of serialized block bytes. The architecture is serialization-generic, so any format can be used (e.g. cbor)

  1. First-level indexing
  2. The indexing by number can be made on top of the offset structure. For instance, we know the offsets of the records, so we create an in-memory index, where we have the array of them. If we reopen the storage we just go through the offsets, verify data validity and reconstruct the in-memory index. Besides, we get efficient iterators automatically having this approach
  3. Thread-safety

As long as we never modify the existing records, the only thing that needs to be treated carefully is the end of the file (last mmap) and mmap structure (in case rebalance is needed). Due to immutability of existing events we can make the access lock-free

  1. Advanced indexing
  2. Having the first-level indexing we can build any index (in-memory or persistent) on top of that using various approaches. Like block hash -> block mapping and so on.

Please describe how your proposed solution will address the Challenge that you have submitted it in.

This tool will allow for much faster access of block data which will help make indexer in Cardano more responsive indexers, unlock use-cases that require quick re-indexing and also improve developer agility. This solution can then be used to unlock the next generation of Cardano products & tools that depend on these properties.

What are the main risks that could prevent you from delivering the project successfully and please explain how you will mitigate each risk?

No significant risk beyond standard engineering risk (delay, overbudget, etc.)

We are confident this project will meet the indexing needs of Milkomeda and that it's generic enough to be useful for many different use-cases, but there may be other approaches that other projects need such as http3-based block fetching

Please provide a detailed plan, including timeline and key milestones for delivering your proposal.

Q3: implement the solution, open source it and integrate it as part of Milkomeda. We expect the project to take ~2.5 months

Please provide a detailed budget breakdown.

All funds will go towards engineering effort to implement the solution

Please provide details of the people who will work on the project.

1 Rust engineer at Milkomeda

1 Project lead (shared resource between this and other Milkomeda projects)

If you are funded, will you return to Catalyst in a later round for further funding? Please explain why / why not.

No plans for this specific project. Depending on interest, we could also develop alternatives to cover different use-cases such as a http3-based block syncing

Please describe what you will measure to track your project's progress, and how will you measure these?

Development progress on the project itself, followed by its integration into Milkomeda

What does success for this project look like?

Project is successfully open sourced, integrated into Milkomeda and usable by projects such as Carp that need fast re-indexing of data

Please provide information on whether this proposal is a continuation of a previously funded project in Catalyst or an entirely new one.

Entirely new proposal, but inspired by our indexing work for Carp and Milkomeda

close

Playlist

  • EP2: epoch_length

    Authored by: Darlington Kofa

    3m 24s
    Darlington Kofa
  • EP1: 'd' parameter

    Authored by: Darlington Kofa

    4m 3s
    Darlington Kofa
  • EP3: key_deposit

    Authored by: Darlington Kofa

    3m 48s
    Darlington Kofa
  • EP4: epoch_no

    Authored by: Darlington Kofa

    2m 16s
    Darlington Kofa
  • EP5: max_block_size

    Authored by: Darlington Kofa

    3m 14s
    Darlington Kofa
  • EP6: pool_deposit

    Authored by: Darlington Kofa

    3m 19s
    Darlington Kofa
  • EP7: max_tx_size

    Authored by: Darlington Kofa

    4m 59s
    Darlington Kofa
0:00
/
~0:00