Please describe your proposed solution
-------------------------------------
Overview
CIP 1694 (Cardano Improvement Proposal) was an important step in Cardano’s Voltaire Roadmap. Various sources were used to draft, edit, review, consult and agree consensus for the CIP. This diversity of sources is an important feature of our decentralised community; but it does come with challenges. Despite a great deal of community communication, many of these sources remain opaque and hard to access.
When vital information is scattered across multiple sources in this way, it can lead to confusion, and make it challenging for people to educate themselves on the topic, or analyse the data to extract relevant insights, or verify exactly what was said at any given point in the process. This impacts the transparency and accountability of Cardano’s governance materials and processes.
Simply applying generalist Large Language Models such as Open AI will generate misleading responses, because external data sources are not carefully incorporated, which results in the LLM filling its knowledge gaps with hallucinations. This means that the effectiveness of untutored LLMs is limited, and any deeper understanding based on context-specific information is not possible without a way to embed specific external data sources.
Reasons for our approach: Retrieval-Augmented Generation (RAG)
-------------------------------------
This proposal will address these limitations by developing a Language Model using a Retrieval-Augmented Generation (RAG) approach which can be tailored to specific datasets. The RAG approach integrates a necessary data retrieval component with a language model, which enhances its ability to generate context-relevant responses.
The aim is to ease the process of question-and-answering, and enable more accurate answers. The RAG process provides contextual information retrieval and synthesis (data source comparisons) to ensure accuracy and comprehension.
A further aim of the proposal is as a demonstration/proof-of-concept. By developing a RAG retrieval process with a very specific data source (the CIP itself and related data), we aim to demonstrate how it can support community members to self-educate and keep informed about a specific topic - in this case, Voltaire governance.
Additionally the RAG approach allows for greater control over prompt engineering constraints and unit testing, so that quality assurance and ethical safeguards can be applied.
(See What is retrieval-augmented generation? for further information)
Who will we engage ?
-------------------------------------
QADAO will apply its extensive experience in community engagement and outreach to publicise our Open Source workflow and invite the community to reuse our methods, code and documentation. We also intend to engage the community with expositions at the close of each milestone, where we will demonstrate how we have worked, and invite the community to engage with and learn from our process.
How will we demonstrate or prove our impact ?
-------------------------------------
Overview of the proposed RAG model architecture
-------------------------------------
We will demonstrate our impact by providing educational step-by-step documentation and exposition of our RAG workflow.
Our methodology will focus on the most accessible models as a proof of concept, with each step documented in Colab (Python) notebooks. The Python code will be hosted, processed and documented in Colab Notebooks which will be committed to the project’s public GitHub repository with an Open Source Apache 2.0 licence.
Langchain libraries will be used to build the model architecture. Large Language Models (e.g. Open Source models hosted on HuggingFace) will be assessed for semantic use in combination with local datasets.
Data preprocessing and preparation
-------------------------------------
The data will be sourced, prepared and processed prior to embedding in a vector data store.
Model training and fine-tuning
-------------------------------------
The embeddings in the vector data store will provide the basis for model training and fine tuning. Sample queries and expected responses, sourced from the community, will be tested against the model.
Evaluation metrics and techniques
-------------------------------------
Conversation or query chains will be built and tested against the model. This will take the form of a Q&A interaction between a constrained local source and a general LLM, and a series of prescriptive prompt instructions.
Documentation and Knowledge Transfer
-------------------------------------
This entire process - the workflow, the code, the data processing, model training and evaluation - will be fully documented along the way and published at the close of the project.