Amazon IXD – VGT2 Las Vegas: Utilizing a DAO for LLM Training Data Management, Part 3: Transitioning from IPFS to the Knowledge Base

In the first installment of this series, we explored the innovative concept of leveraging a decentralized autonomous organization (DAO) to oversee the entire lifecycle of an AI model, with a specific emphasis on the acquisition of training data. We detailed the overarching architecture, established a large language model (LLM) knowledge base using Amazon Bedrock, and aligned it with Ethereum Improvement Proposals (EIPs). The second part involved the creation and deployment of a stripped-down smart contract on the Ethereum Sepolia testnet, utilizing Remix and MetaMask, to facilitate governance over which training data could be uploaded to the knowledge base and who had permission to do so.

In this article, we will configure the Amazon API Gateway and deploy AWS Lambda functions to transfer data from the InterPlanetary File System (IPFS) to Amazon Simple Storage Service (S3) and initiate a knowledge base ingestion job.

Overview of the Solution

In the previous part, we developed a smart contract that holds IPFS file identifiers along with Ethereum addresses authorized to upload the contents of these files for model training. This post will concentrate on the journey this data takes from a verified data provider (the owner of one of those Ethereum addresses) to the LLM knowledge base. The data flow is illustrated in the diagram below.

To implement this data flow, we will create the following components:

A Lambda function named s32kb that updates the knowledge base whenever new content is added to the S3 bucket.
An S3 trigger to activate the s32kb Lambda function.
A Lambda function called ipfs2s3 that handles the uploading of content from IPFS to the S3 bucket.
An API Gateway to invoke the ipfs2s3 function.

Prerequisites

Before proceeding, please ensure you have completed the necessary steps outlined in Parts 1 and 2 of this series.

Setting Up the s32kb Lambda Function

This section will guide you through the setup of the s32kb function.

Creating the s32kb IAM Role

Prior to creating the Lambda function, it is essential to establish an AWS Identity and Access Management (IAM) role that will be utilized during the execution of the function. Follow these steps:

Open an AWS CloudShell terminal and upload the following files:
- s32kb_trust_policy.json – The trust policy for the s32kb function role.
- s32kb_inline_policy_template.json – The inline policy template for the role.
- s32kb.py – The Python script that will create a Lambda function to automatically refresh the knowledge base upon new file uploads.
- s32kb.py.zip – The zipped file containing the Lambda code.

Create a JSON document for the inline policy (this policy is necessary to grant the Lambda function permissions to log to Amazon CloudWatch):

ACCOUNT=$(aws sts get-caller-identity --query "Account" --output text) && 
cat s32kb_inline_policy_template.json | sed -e s+ACCOUNT+$ACCOUNT+g > s32kb_inline_policy.json

Create the IAM role:

aws iam create-role 
--role-name s32kb 
--assume-role-policy-document file://s32kb_trust_policy.json

Attach the AmazonBedrockFullAccess managed policy:

aws iam attach-role-policy --role-name s32kb --policy-arn arn:aws:iam::aws:policy/AmazonBedrockFullAccess

Attach the previously generated inline policy:

aws iam put-role-policy 
--role-name s32kb 
--policy-name s32kb_inline_policy 
--policy-document file://s32kb_inline_policy.json

Creating the s32kb Lambda Function

Now, let’s create the s32kb function:

Open the s32kb.py file in your preferred editor (we will use vi for this example) and examine its contents. The file initializes an Amazon Bedrock agent and uses this agent to initiate a knowledge base ingestion job. Two environment variables must also be set:
- The KB_ID variable, which holds the knowledge base ID.
- The KB_DATA_SOURCE_ID variable, which contains the data source ID (the S3 bucket).
To look up those values:
1. In the Amazon Bedrock console, navigate to Knowledge bases in the navigation pane.
2. Select the crypto-ai-kb knowledge base.
3. Note the knowledge base ID under the Knowledge base overview section.
4. Under Data source, choose the EIPs data source.
5. Record the data source ID from the Data source overview.

Export these values in CloudShell:

export KB_ID=<Knowledge base ID>
export KB_DATA_SOURCE_ID=<Data source ID>

Create the Lambda function:

ACCOUNT=$(aws sts get-caller-identity --query "Account" --output text) && 
aws lambda create-function 
--function-name s32kb 
--timeout 300 
--runtime python3.12 
--architectures x86_64 
--zip-file fileb://s32kb.py.zip 
--handler s32kb.handler 
--role arn:aws:iam::$ACCOUNT:role/s32kb 
--environment Variables={KB_ID=$KB_ID,KB_DATA_SOURCE_ID=$KB_DATA_SOURCE_ID}

Creating an S3 Trigger for the s32kb Lambda Function

To automatically run the s32kb function whenever a new file is uploaded to the S3 bucket, follow these steps:

In the Lambda console, select Functions from the navigation pane.
Choose the s32kb function.
Click Add trigger.
For Trigger configuration, select S3 as the source.
Choose the bucket named crypto-ai-kb-<your_account_id>.
Select All object create events and All object delete events for Event types.
Tick the acknowledgement checkbox and click Add.

Testing the s32kb Lambda Function

Next, we will add a new file to the bucket and verify that the Lambda function is triggered. To build upon our earlier discussions on danksharding from Part 1, we will enhance the knowledge base with the network upgrade specification of the “Cancun” upgrade.

Open a CloudShell terminal and execute the following commands:

ACCOUNT=$(aws sts get-caller-identity --query "Account" --output text) && 
wget https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/cancun.md && 
aws s3 cp ./cancun.md s3://crypto-ai-kb-$ACCOUNT/

To check if the Lambda function executed successfully:

Access the CloudWatch console and navigate to the /aws/Lambda/s32kb log group.
Look for a log stream with a Last event time corresponding to the current time and select it.
Review the logs to ensure that the Lambda function returned a 202 HTTPStatusCode.

Also, verify the status of the Amazon Bedrock job:

In the Amazon Bedrock console, go to the crypto-ai-kb knowledge base.
Under Data source, confirm that the Last sync time value reflects the current time.

If you want to delve deeper, consider querying the knowledge base for information specifically mentioned in the network upgrade specification.

Setting Up the ipfs2s3 Lambda Function

This section will guide you through the steps necessary to establish the ipfs2s3 function.

Creating the ipfs2s3 IAM Role

To create the ipfs2s3 IAM role, follow these instructions:

For more insights on this topic, you can read this engaging blog post here. Additionally, for further expertise, refer to this resource that offers authoritative information on the subject. If you’re looking for an excellent resource on Amazon’s operations and leadership programs, check out this link.

Amazon IXD – VGT2, located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115.