Please describe your proposed solution.
Outline
- Abstract
- Background
- General Approach
- Open Source Tooling
- Audience
Abstract
I want to build an AI analysis pipeline called RAGDoC (Retrieval Augmented Generation for Documents on Cardano) that runs completely on Cardano (e.g. NuNet and Iagon). This pipeline can cluster and summarize Catalyst proposals (or any set of documents) to make finding proposals that align with your interests easier. In the process of developing this tool, I will create or further develop the open source tools needed to run this and other pipelines on Cardano infrastructure.
Background
I was a member of a group of Minswap volunteers that provided input to Minswap on the 50 proposals that were voted for in Fund10. 1600 proposals was way too many documents to look through, so I decided to use a combination of RAG models, dimension reduction, and clustering to group proposals together and then have AI models summarize each group. This help us to more easily browse through the proposals and find the ones relevant to our community. However, I used OpenAI to accomplish this and never released any of the source code.
Since Fund10, NuNet and Iagon have become much more mature, both having functional alphas for compute and storage respectively. Further, Iagon plans to have an alpha version of compute in early 2024. With these tools, it is possible to completely recreate this the workflow I developed for Minswap using completely open source tools and completely decentralized infrastructure! However, tooling is needed to make it easier for developers to utilize them.
Approach
A broad overview of this approach is retrieval augmented generation (RAG) with a dimension reduction and clustering intermediate step. The general steps are Catalyst proposal aggregation, text embedding with a large language model to obtain vector embeddings, dimension reduction of the vector embeddings, clustering, and finally summarization of the contents of the clusters using a large language model. Below is an example of the result of this workflow from Fund10, showing that the model clustered similar proposals together and appropriately summarized them.
> Group 26 (relevance: 100.00%):
> The common themes across the proposals include the use of the Aiken programming
> language, the need for audits and bug bounties, the goal of increasing DeFi usage on
> Cardano, and the desire to strengthen liquidity in the ecosystem. Other common goals
> include showcasing the efficiency and interoperability of Aiken, empowering Cardano
> developers with open-source tools, upgrading contracts for efficiency and functionality,
> and enabling decentralized renting. Feasibility is a key consideration, with proposals
> emphasizing technical assessments, prototype development and testing, security audits,
> user feedback and validation, and community engagement and adoption. The proposals also
> highlight specific challenges such as the lack of open-source Stableswap and options for
> launching tokens on Cardano, as well as the need for better user experiences during high
> chain load. Customizability and adaptability are important factors in addressing these
> challenges.
>
> Proposals:
> Title: Minswap Aiken Stableswap Audit + Bug Bounty
> https://cardano.ideascale.com/a/dtd/101498-163 (332000 ada requested of 9,080,400 ada available)
> Title: SundaeSwap Aiken Smart Contracts
> https://cardano.ideascale.com/a/dtd/102976-163 (276000 ada requested of 9,080,400 ada available)
> Title: Lenfi V2 Aiken Audit + Bug Bounty
> https://cardano.ideascale.com/a/dtd/103087-163 (265000 ada requested of 9,080,400 ada available)
> Title: Revolutionizing Cardano Rewards Contracts: Aiken Language Upgrade for Efficiency
> and Functionality
> https://cardano.ideascale.com/a/dtd/103870-163 (85000 ada requested of 9,080,400 ada available)
> Title: FluidShare: Decentralized Uncollateralized Renting [Release + Audit + Open
> Source]
> https://cardano.ideascale.com/a/dtd/104787-163 (200000 ada requested of 9,080,400 ada available)
> Title: Minswap Aiken V2 Audit
> https://cardano.ideascale.com/a/dtd/105516-163 (467000 ada requested of 9,080,400 ada available)
> Title: Minswap Liquidity Bootstrapping for DAOs
> https://cardano.ideascale.com/a/dtd/103138-163 (206000 ada requested of 3,158,400 ada available)
The original version of this workflow used OpenAI for the embedding and summarization steps, but these can be replaced by open source models that also perform better than the OpenAI models. For text embedding, I will use Instructor-XL from Meta and the Allen Institute for AI. For summarization I will use Llama2 from Meta's Facebook Research group. A stretch goal for this project will be to generalize the code to use any model for embedding or summarization.
Vector storage will use FAISS (an MIT licensed project from Facebook). Dimension reduction will allow a variety of different reduction types including UMAP and PaCMAP. Clustering will come with the ability to use a variety of clustering algorithms including HDBscan and the standard k-means.
Tooling
All tools will be developed in Python, the primary language used for AI development. The tooling component to this proposal is as valuable as the end product itself. It will create the open source tools, or build upon the existing ones I have released, to enable AI developers to make use of decentralized infrastructure on Cardano.
nunet-py
NuNet is a decentralized computing project on Cardano that allows individuals to rent the processing power of their computer. nunet-py is a project I have developed while actively testing NuNet during it's alpha testing phase, and it allows programmatic execution of jobs on NuNet. It is capable of fully configuring and executing a job on NuNet, but it suffers from some basic usability issues and no documentation. This tool will be further developed and be the job submission tool for running the data aggregation, text embeddings, clustering, etc for RAGDoC.
iagon-py
Iagon is a decentralized, privacy focused storage solution that runs on Cardano. It allows individuals to rent out disk space on their computer. iagon-py is a project I developed during Iagons alpha test phase, but it has very rudimentary functionality and no documentation. This tool will be used for storing intermediate data, such as text embeddings, clusters, and summarization information.
cardano-flows
To provide additional utility to developers, it would be helpful to make the workflow of RAGDoC modular so that data aggregation, embeddings, dimension reduction, clustering, and summarization are all separate steps in the process. The reason is that if each task is made into a separate step, the tools can be re-used for other applications. While there are tools for creating workflows in Python, most are tied to a workflow manager directly. cardano-flows will be a new tool used to create and run workflows on Cardano infrastructure. For this proposal, it will use NuNet for compute and Iagon for storage, but it will make the individual components abstractable so that as new projects come online they can be easily added. For example, when Iagon's compute infrastructure comes online, cardano-flows should be built in a way to easily incorporate it as a compute backend.
RAGDoC Dashboard
The final piece of RAGDoC is a Dashboard for browsing Catalyst data, tuning parameters, and submitting workflows. The Dashboard will be created with Solara, a Python wrapper around React. This dashboard will allow users to submit the pipeline to NuNet and access results from Iagon to be displayed in an interface that will allows users to browse results and link back to the original documents in IdeaScale. Part of this dashboarding will include open sourcing some custom components for Solara, such as the wallet connector that allows people to sign transactions and CIP-8 messages (already live and in use on the SteelSwap dex aggregator).
Audience
I see two general categories of audience for this project:
- Individuals and communities voting on Catalyst. This tool can improve the speed of finding relevant proposals for a community, as well as helping to ensure important proposals do not fall through the cracks. It can potentially help to weed out low quality and bad proposals.
- Developers interested in deploying on Cardano infrastructure. The road to RAGDoC comes with knock on benefits of more documentation and usability of the underlying tools, that are general utilities not specifically tailored to RAGDoC.