Tailoring Coding Companions for Organizations | Artificial Intelligence

Generative AI models designed for coding companions primarily rely on publicly accessible source code and natural language data. While the extensive training data allows these models to generate code for frequently utilized functions, they remain oblivious to the code within private repositories and the specific coding styles adhered to by developers. As a result, the suggestions generated may need significant revisions before being suitable for integration into an internal codebase.

To bridge this gap and reduce the need for further manual adjustments, we have developed a customization feature for Amazon CodeWhisperer, which incorporates code knowledge from private repositories atop a language model trained on public code. In this article, we present two methods for customizing coding companions: retrieval-augmented generation and fine-tuning.

The objective of the CodeWhisperer customization feature is to empower organizations to tailor the model using their internal repositories and libraries, generating code recommendations that are specific to the organization. This not only saves time but also aligns with organizational standards and reduces the risk of bugs or security vulnerabilities. This capability is especially beneficial for enterprise software development, addressing challenges such as:

Insufficient documentation or information for internal libraries and APIs, which compels developers to spend time analyzing previous code to replicate usage.
Inconsistencies in applying enterprise-specific coding practices, styles, and patterns.
Unintentional use of deprecated code and APIs by developers.

By utilizing internal code repositories for additional training, which have already passed code reviews, the language model can highlight the use of internal APIs and code snippets that address the aforementioned issues. Since the reference code has been reviewed and meets high standards, the chance of introducing bugs or security vulnerabilities is greatly diminished. Additionally, by carefully selecting the source files used for customization, organizations can further reduce reliance on deprecated code.

Design Challenges

Customizing code suggestions based on an organization’s private repositories presents several intriguing design challenges. The deployment of large language models (LLMs) for code suggestions incurs fixed costs for availability and variable costs based on the number of tokens generated. Thus, having separate customizations for each customer, while hosting them individually, can lead to prohibitive expenses. Conversely, simultaneous customizations on the same system necessitate a multi-tenant infrastructure to safeguard proprietary code for each customer. Furthermore, the customization feature should provide options to select the appropriate training subset from the internal repository using various metrics (e.g., files with a history of fewer bugs or recently committed code). By leveraging code based on these metrics, customization can be trained with higher-quality code, enhancing the quality of code suggestions. Ultimately, even as code repositories evolve, the costs associated with customization should remain minimal to help enterprises achieve cost savings through increased developer productivity.

A foundational approach to building customization might involve pretraining the model on a unified training corpus that combines the existing public dataset with enterprise-specific code. Although this method is effective, it requires (redundant) individual pretraining on the public dataset for each enterprise, as well as additional deployment costs for hosting customized models that only serve requests from that specific customer. By decoupling the training of public and private code and deploying the customization on a multi-tenant system, we can avoid these redundant expenses.

How to Customize

In general, there are two types of customization techniques available: retrieval-augmented generation (RAG) and fine-tuning (FT).

Retrieval-Augmented Generation (RAG): RAG identifies code snippets within a repository that are similar to a given code fragment (e.g., code preceding the cursor in the IDE) and augments the prompt used to query the LLM with these matching snippets. This enhances the prompt to encourage the model to produce more pertinent code. Techniques explored in literature include Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, REALM, kNN-LM, and RETRO.
Fine-Tuning (FT): FT involves taking a pre-trained LLM and further training it on a specific, smaller codebase (relative to the pretraining dataset) to better adapt it for the desired repository. Fine-tuning modifies the LLM’s weights based on this training, aligning it more closely with the organization’s unique requirements.

Both RAG and fine-tuning are effective methods for improving the performance of LLM-based customization. RAG can quickly adjust to private libraries or APIs with reduced training complexity and cost. However, the process of searching and augmenting retrieved code snippets can increase runtime latency. On the other hand, fine-tuning does not require any context augmentation since the model is already trained on private libraries and APIs; nevertheless, it incurs higher training costs and complexities when managing multiple custom models across various enterprise clients.

Retrieval-Augmented Generation

The RAG process involves several steps:

Indexing: Given a private repository, an index is created by dividing the source code files into manageable chunks. This chunking transforms code snippets into digestible pieces that provide valuable information to the model and can be easily retrieved in context. The size of a chunk and its extraction method are design decisions that affect the final output. For instance, chunks may be divided by lines of code or based on syntactic blocks.
Administrator Workflow: Contextual search allows for searching a set of indexed code snippets based on a few lines of code above the cursor, retrieving relevant snippets through various algorithms. Options may include:
- Bag of Words (BM25): A retrieval function that ranks a set of code snippets based on query term frequencies and snippet lengths.
- Semantic Retrieval: This technique converts queries and indexed snippets into high-dimensional vectors, ranking snippets based on semantic similarity. Often, k-nearest neighbors (KNN) or approximate nearest neighbor (ANN) searches are utilized to find snippets with similar semantics.

BM25 focuses on lexical matching. Thus, switching “add” with “delete” may not significantly alter the BM25 score, as the overall length remains similar.

For those looking to deepen their understanding of onboarding processes, this article is an excellent resource on Day One Onboarding Experience. Additionally, for best practices in achieving organizational objectives, you may find valuable insights in this SHRM article. If you are seeking mentorship, consider checking out this informative piece on finding a mentor.

Tailoring Coding Companions for Organizations | Artificial Intelligence

Design Challenges

How to Customize

Retrieval-Augmented Generation

SEO Metadata

Related Topics:

Comments

Leave a Reply Cancel reply