Fast reindexable data format

Please describe your proposed solution.

There are two main problems we are trying to solve:

Currently, the main bottleneck for indexing data is querying the block data from the cardano-node. Since cardano-node was not optimized for fast historical block fetching, an indexer that scans the chain and does nothing with the data (no-op) still takes ~300s / epoch which means it takes over an entire day for the indexer to complete. This is really bad for developer agility because it means if you need to write a custom indexer you need to spend a day re-running your indexer every time you change your code. This proposal will allow us to save the block data in a fast-to-query format once to optimize for use-cases that need to re-index the same data multiple times such as protocols similar to TheGraph and also optimize for developer agility
Currently, there is no way to save block data in way that supports (in parallel) the need to append new blocks to storage and read the blocks from a specified index. We have researched whether there are any solutions to tackle this problem (see below). However, it turned out that there’s no suitable solution at the moment.

For example, StreamingFast has the technology for storing blocks. However, they try to solve both fork resolution and data storage problems at the same time. They have plenty of files storing 100 blocks per file. This doesn’t work for quick distribution and efficient indexing at the same time. As for the solution we propose, fork resolution can be done through multiverse and only confirmed blocks go to the actual storage.

Moreover, the solution we propose can be used for other projects in the Cardano ecosystem. The storage can be used as a source for Oura or Carp, so instead of spending 4-5 days to synchronize with Cardano you just download one large file. If a person doesn’t trust third-party backups they can synchronize the node and create their own backup, reusing it after.

Let’s dive in a little more detail of potential implementation:

Memory management

We can utilize mapping of file to memory, so we sync the changes to the disk easily and don’t have overhead for data access. Moreover, this is cache efficient, since we have memory reads of consecutive data. There’s a syscall on linux that we can use called mmap. As long as we append more data we allocate new chunks, several records can be in the same chunk. If the storage is closed and reopened after we create an mmap for existing records (data structure doesn’t depend on how the memory was allocated).

Memory structure

For every record we need service bytes (e.g. 8 bytes to store offset to the next record) + size of serialized block bytes. The architecture is serialization-generic, so any format can be used (e.g. cbor)

First-level indexing
The indexing by number can be made on top of the offset structure. For instance, we know the offsets of the records, so we create an in-memory index, where we have the array of them. If we reopen the storage we just go through the offsets, verify data validity and reconstruct the in-memory index. Besides, we get efficient iterators automatically having this approach
Thread-safety

As long as we never modify the existing records, the only thing that needs to be treated carefully is the end of the file (last mmap) and mmap structure (in case rebalance is needed). Due to immutability of existing events we can make the access lock-free

Advanced indexing
Having the first-level indexing we can build any index (in-memory or persistent) on top of that using various approaches. Like block hash -> block mapping and so on.

Please describe how your proposed solution will address the Challenge that you have submitted it in.

This tool will allow for much faster access of block data which will help make indexer in Cardano more responsive indexers, unlock use-cases that require quick re-indexing and also improve developer agility. This solution can then be used to unlock the next generation of Cardano products & tools that depend on these properties.

What are the main risks that could prevent you from delivering the project successfully and please explain how you will mitigate each risk?

No significant risk beyond standard engineering risk (delay, overbudget, etc.)

We are confident this project will meet the indexing needs of Milkomeda and that it's generic enough to be useful for many different use-cases, but there may be other approaches that other projects need such as http3-based block fetching

bookmarked!

bookmarked!