over budget
AI RAG Analysis of CIP 1694
Current Project Status
Unfunded
Amount
Received
₳0
Amount
Requested
₳51,000
Percentage
Received
0.00%
Solution

Provide an educational exposition of the development of an open-source LLM (Large Language Model) based on RAG (Retrieval Augmented Generation) for question & analysis of CIP 1694.

Problem

https://youtu.be/f15DkGTTOioCommunity members need a way to ask questions about, and educate themselves about, CIP-1694 governance. We can address this using AI-powered contextual retrieval.

Impact Alignment
Feasibility
Value for Money

QA-DAO

2 members

AI RAG Analysis of CIP 1694

Please describe your proposed solution

-------------------------------------

Overview

CIP 1694 (Cardano Improvement Proposal) was an important step in Cardano’s Voltaire Roadmap. Various sources were used to draft, edit, review, consult and agree consensus for the CIP. This diversity of sources is an important feature of our decentralised community; but it does come with challenges. Despite a great deal of community communication, many of these sources remain opaque and hard to access.

When vital information is scattered across multiple sources in this way, it can lead to confusion, and make it challenging for people to educate themselves on the topic, or analyse the data to extract relevant insights, or verify exactly what was said at any given point in the process. This impacts the transparency and accountability of Cardano’s governance materials and processes.

Simply applying generalist Large Language Models such as Open AI will generate misleading responses, because external data sources are not carefully incorporated, which results in the LLM filling its knowledge gaps with hallucinations. This means that the effectiveness of untutored LLMs is limited, and any deeper understanding based on context-specific information is not possible without a way to embed specific external data sources.

Reasons for our approach: Retrieval-Augmented Generation (RAG)

-------------------------------------

This proposal will address these limitations by developing a Language Model using a Retrieval-Augmented Generation (RAG) approach which can be tailored to specific datasets. The RAG approach integrates a necessary data retrieval component with a language model, which enhances its ability to generate context-relevant responses.

The aim is to ease the process of question-and-answering, and enable more accurate answers. The RAG process provides contextual information retrieval and synthesis (data source comparisons) to ensure accuracy and comprehension.

A further aim of the proposal is as a demonstration/proof-of-concept. By developing a RAG retrieval process with a very specific data source (the CIP itself and related data), we aim to demonstrate how it can support community members to self-educate and keep informed about a specific topic - in this case, Voltaire governance.

Additionally the RAG approach allows for greater control over prompt engineering constraints and unit testing, so that quality assurance and ethical safeguards can be applied.

(See What is retrieval-augmented generation? for further information)

Who will we engage ?

-------------------------------------

QADAO will apply its extensive experience in community engagement and outreach to publicise our Open Source workflow and invite the community to reuse our methods, code and documentation. We also intend to engage the community with expositions at the close of each milestone, where we will demonstrate how we have worked, and invite the community to engage with and learn from our process.

How will we demonstrate or prove our impact ?

-------------------------------------

Overview of the proposed RAG model architecture

-------------------------------------

We will demonstrate our impact by providing educational step-by-step documentation and exposition of our RAG workflow.

Our methodology will focus on the most accessible models as a proof of concept, with each step documented in Colab (Python) notebooks. The Python code will be hosted, processed and documented in Colab Notebooks which will be committed to the project’s public GitHub repository with an Open Source Apache 2.0 licence.

Langchain libraries will be used to build the model architecture. Large Language Models (e.g. Open Source models hosted on HuggingFace) will be assessed for semantic use in combination with local datasets.

Data preprocessing and preparation

-------------------------------------

The data will be sourced, prepared and processed prior to embedding in a vector data store.

Model training and fine-tuning

-------------------------------------

The embeddings in the vector data store will provide the basis for model training and fine tuning. Sample queries and expected responses, sourced from the community, will be tested against the model.

Evaluation metrics and techniques

-------------------------------------

Conversation or query chains will be built and tested against the model. This will take the form of a Q&A interaction between a constrained local source and a general LLM, and a series of prescriptive prompt instructions.

Documentation and Knowledge Transfer

-------------------------------------

This entire process - the workflow, the code, the data processing, model training and evaluation - will be fully documented along the way and published at the close of the project.

Please define the positive impact your project will have on the wider Cardano community

The project will bring value to the Cardano ecosystem by delivering a straightforward, open-source, AI process that can be reused, queried and analysed by community members.

This will support people to self-educate on CIP-1694 (the topic of the dataset) despite barriers such as relevant data being widely dispersed, hard to track down, and lengthy; and it will thereby contribute to improved transparency and governance inclusion.

The project will also offer a demonstration of how a RAG retrieval process can work in this type of context, and will enable others in the ecosystem to use and adapt the process to enable AI querying of any set of data - both to analyse the data itself, and to use it as source material for community education about a topic.

We will measure this impact by:

    1. measuring levels of engagement with our "community exposition" After TownHall sessions and videos at the close of Milestones 1 and 2 (these will educate people on data preparation and exploration processes; RAG model development, refinement and optimization; model deployment and integration).
    1. measuring social media engagement that will draw attention to our engagement events and associated Open Source repository resources..
    1. using Open Source quantitative measures, such as the number of commits and views on our learning materials.

We will share the outputs via our open-source documentation, and with a closing video which walks through what we did and how it works.

What is your capability to deliver your project with high levels of trust and accountability? How do you intend to validate if your approach is feasible?

The team members are skilled and experienced members of the Catalyst community, and both have experience of working in transparent and open-source ways via GitHub, GitBook, and Dework, providing a trackable, accountable and trustworthy audit trail. See for example

  • Community Governance Oversight - <https://quality-assurance-dao.gitbook.io/community-governance-oversight/>
  • CIP 1694 Resources (such as - <https://quality-assurance-dao.gitbook.io/ekphrasis/2023/february-2023/voltaire-cip-1694-summary> )

They also have well-established skills in community engagement and education; and a thorough grasp of the RAG retrieval process and its educative and ethical implications.

Our proposal includes not only thorough documentation of our process, but also community education and sharing. This will offer a high level of trust and accountability, since the community verifies our work by learning about it and trying it in practice.

What are the key milestones you need to achieve in order to complete your project successfully?

Milestone 1: -------------------------------------

Data preparation and exploration; RAG model development

We will locate, preprocess and prepare data sources relevant to CIP-1694; and we will deliver a baseline RAG implementation and data.

Milestone outputs

  1. Cleaned and preprocessed dataset
  2. Dataset exploration and analysis report
  3. Initial RAG model
  4. Preliminary results (data sets)
  5. Community exposition - delivery of an After Town Hall session to outline what we have done so far and educate the community on our process

Acceptance criteria

  1. Our dataset will be relevant to CIP-1694 and will be processed to enable it to be used with our RAG model
  2. Our analysis report will explain in a readable way what our dataset contains and how it has been processed, and will give clear information on where the data was sourced
  3. Concise exposition of our RAG model.
  4. Clear presentation of our preliminary results (data sets).
  5. Our After Town Hall presentation slides will offer a clear and accessible explanation of our process so far

Evidence of milestone completion

  1. Our dataset
  2. Our analysis report
  3. Exposition of our RAG model in a public document.
  4. Documentation of our preliminary results (data sets).
  5. After Town Hall presentation slides, and video of the session on QA-DAO’s YouTube

Milestone 2: -------------------------------------

Model refinement and optimization; model deployment and integration.

We will deliver refinements to the baseline model; and deploy the RAG model in a public, Open Source environment

Milestone outputs:

  1. Model refinements demonstration and documentation
  2. Refined data sets
  3. Deployed RAG model (as Colab notebook on our project repo)
  4. Published datasets (as JSON or CSV on our project repo)
  5. Community exposition - a second After Town Hall presentation explaining and educating the community on our process.

Acceptance criteria

  1. The model will be refined based on testing
  2. The datasets will be refined based on testing
  3. The RAG model will be a working version as a Colab notebook
  4. The datatsets will be publicly available on our repo, and will integrate both the CIP text itself, and community discussion about it
  5. The After Town Hall session will present a working version and will enable people to suggest test queries

Evidence of milestone completion

  1. A demonstration video of how the model works, and documentation to show how we have refined it
  2. Our datasets
  3. Colab notebook on the project repo
  4. jSON file or .csv file on our repo containing full datasets
  5. After Town Hall slides, and video of the session on YouTube

Final Milestone: -------------------------------------

Testing and evaluation; documentation and knowledge transfer

We will provide comprehensive testing and evaluation of the deployed system; and deliver the final comprehensive documentation and knowledge transfer.

Milestone outputs

  1. Test reports and performance benchmarks
  2. Completed technical documentation, user guides, and training materials
  3. Community exposition - closing report and video.

Acceptance criteria

  1. Test reports adequately reflect embeddings
  2. Our technical documentation and user guides will be complete and publicly available
  3. Our closing report and video will be accepted by IOG

Evidence of milestone completion

  1. Documentation of test reports with associated data
  2. Documentation available on a GitBook or repository.
  3. Closing report and video

Who is in the project team and what are their roles?

Stephen Whitenstall is the co-founder of Quality-Assurance DAO, <https://qadao.io/> , and has provided project management consultancy for many Catalyst projects since Fund 4 including Catalyst Circle, Audit Circle, Community Governance Oversight, Training &amp; Automation (with Treasury Guild), Governance Guild and Swarm. A Circle V2 representative for funded proposers. Also engaged in cross chain collaboration with SingularityNET managing an Archive project. He has 30 years experience in development, test management, project management, social enterprises in Investment Banking, Telecoms and Local Government. A philosophy honors graduate with an interest in Blockchain governance.

Vanessa Cardui

Community engagement professional with 20+ years' experience of working with communities to record and document their information, archive it, and make it discoverable. Part of QA-DAO, where she leads on documentation; founding member of The Facilitators’ Collective; founding member of the SingularityNET Archives; part of the SingularityNET DeepFunding Focus Group.

Please provide a cost breakdown of the proposed work and resources

-------------------------------------

Data preparation and exploration; RAG model development

Milestone 1 outputs

  1. Cleaned and preprocessed dataset - 1500 ADA
  2. Dataset exploration and analysis report - 3000 ADA
  3. Initial RAG model -3000 ADA
  4. Preliminary results (data sets) - 1500 ADA
  5. Community exposition - 6000 ADA

Subtotal - 15,000 ADA

-------------------------------------

Model refinement and optimization; model deployment and integration

Milestone 2 outputs:

  1. Model refinements demonstration and documentation - 6000 ADA
  2. Refined data sets - 1500 ADA
  3. Deployed RAG model - 6000 ADA
  4. Published datasets - 1500 ADA
  5. Community exposition - 6000 ADA

Subtotal - 21,000 ADA

-------------------------------------

Testing and evaluation; documentation and knowledge transfer

Final Milestone outputs

  1. Test reports and performance benchmarks - 5000 ADA
  2. Completed technical documentation, user guides, and training materials - 5000 ADA
  3. Community exposition - closing report and video - 5000 ADA

Subtotal - 15,000 ADA

-------------------------------------

Overall Total - 51,000 ADA

No dependencies.

How does the cost of the project represent value for money for the Cardano ecosystem?

The pay rates given are self-employed rates that take into account the employment overheads of the resources contracted. The rates are based on the low end of US and European averages. The amounts are calculated for each milestone based on the hours to complete.

A freelance project manager can charge from $50/hr. In addition management of this project requires knowledge of open source software tools and an awareness of blockchain technology and LLMs. [Source - Project Management Fees | Hourly &amp; Consulting Rates | Salaries – OCM Solution]

In addition, all the resources working on this project are taking on the currency risk of being paid in ADA. This means that a fall in the ADA price will result in being paid less or delivering less in each milestone. Any rise in the ADA price will represent a reward for investing in the Cardano ecosystem.

Consequently, given these factors, we believe this proposal offers excellent value for money in a volatile cryptocurrency environment.

close

Playlist

  • EP2: epoch_length

    Authored by: Darlington Kofa

    3m 24s
    Darlington Kofa
  • EP1: 'd' parameter

    Authored by: Darlington Kofa

    4m 3s
    Darlington Kofa
  • EP3: key_deposit

    Authored by: Darlington Kofa

    3m 48s
    Darlington Kofa
  • EP4: epoch_no

    Authored by: Darlington Kofa

    2m 16s
    Darlington Kofa
  • EP5: max_block_size

    Authored by: Darlington Kofa

    3m 14s
    Darlington Kofa
  • EP6: pool_deposit

    Authored by: Darlington Kofa

    3m 19s
    Darlington Kofa
  • EP7: max_tx_size

    Authored by: Darlington Kofa

    4m 59s
    Darlington Kofa
0:00
/
~0:00