An Introduction to Crafting Your Own Dataset for LLM Training

Large language models (LLMs) have shown impressive abilities across a variety of linguistic tasks. However, the effectiveness of these models largely depends on the quality and relevance of the data they are trained on. In this article, we will guide you through the process of preparing your own dataset for LLM training. Whether you’re looking to fine-tune a pre-trained model for a specific task or continue pre-training for niche applications, ensuring a well-curated dataset is essential for achieving optimal performance.

Data Preprocessing

Text data can originate from various sources and come in multiple formats, including PDF, HTML, JSON, and Microsoft Office documents like Word, Excel, and PowerPoint. It’s uncommon to have access to text data that is immediately ready to be processed and utilized for LLM training. Therefore, the initial step in preparing data for LLM training involves extracting and compiling data from these diverse sources and formats. During this phase, you will read data from multiple origins, utilizing tools like optical character recognition (OCR) for scanned PDFs, HTML parsers for web documents, and specialized libraries for proprietary formats such as Microsoft Office files. Non-text elements, such as HTML tags and non-UTF-8 characters, are typically removed or normalized.

The following step involves filtering out low-quality or undesirable documents. Common filtering strategies include:

Metadata filtering based on document names or URLs.
Content-based filtering to exclude toxic or harmful content and personally identifiable information (PII).
Regex filters to identify specific character patterns in the text.
Eliminating documents with excessive repetitive sentences or n-grams.
Language-specific filters, like those for English.
Additional quality filters, such as the total word count, average word length, and the ratio of alphabetic to non-alphabetic characters.
Model-based quality filtering using lightweight text classifiers to identify low-quality documents. For instance, the FineWeb-Edu classifier assesses the educational value of web pages.

Extracting text from various file formats can be complex, but numerous high-level libraries can simplify this process. We will illustrate some examples of extracting text and discuss how to scale this to large collections of documents later.

HTML Preprocessing

When dealing with HTML documents, it’s important to remove non-text data like markup tags, inline CSS, and JavaScript. Additionally, structured objects such as lists, tables, and code blocks should be converted into markdown format. The trafilatura library offers a command-line interface (CLI) and Python SDK for this purpose. The following code snippet illustrates how to extract and preprocess HTML data from a blog post about fine-tuning Meta Llama 3.1 models using torchtune on Amazon SageMaker.

from trafilatura import fetch_url, extract, html2txt

url = "https://aws.amazon.com/blogs/machine-learning/fine-tune-meta-llama-3-1-models-using-torchtune-on-amazon-sagemaker/"

downloaded = fetch_url(url)
print("RAW HTMLn", downloaded[:250])

all_text = html2txt(downloaded)
print("nALL TEXTn", all_text[:250])

main_text = extract(downloaded)
print("nMAIN TEXTn", main_text[:250])

The trafilatura library provides various functions for handling HTML. In the above example, fetch_url retrieves the raw HTML, html2txt extracts text content (including navigation and related links), and extract focuses on the main body content, which contains the blog post itself. The output will resemble the following:

RAW HTML
<!doctype html> <html lang="en-US" class="no-js aws-lng-en_US" xmlns="http://www.w3.org/1999/xhtml" data-aws-assets="https://a0.awsstatic.com" data-js-version="1.0.681" data-css-version="1.0.538" data-static-assets="https://a0.awsstatic.com" prefix="

ALL TEXT
Skip to Main Content Click here to return to Amazon Web Services homepage About AWS Contact Us Support English My Account Sign In Create an AWS Account Products Solutions Pricing Documentation Learn Partner Network AWS Marketplace Customer Enablement

MAIN TEXT
AWS Machine Learning Blog Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker This post is co-written with Meta’s PyTorch team. In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large language...

PDF Processing

PDF is a prevalent format for storing and sharing documents within organizations. Extracting clean text from PDFs can be challenging due to their complex layouts, which may include columns, images, tables, and graphics. PDFs often lack the structural information found in HTML, making parsing more difficult. If possible, it’s advisable to avoid PDF parsing in favor of alternative formats like HTML or markdown. When no alternative exists, libraries such as pdfplumber, pypdf, and pdfminer can assist in extracting text and data from PDFs. Here’s an example of using pdfplumber to parse the first page of the 2023 Amazon annual report in PDF format.

import pdfplumber

pdf_file = "Amazon-com-Inc-2023-Annual-Report.pdf"

with pdfplumber.open(pdf_file) as pdf:
    page = pdf.pages[1]

print(page.extract_text(x_tolerance=1)[:300])

pdfplumber provides bounding box information, which can help eliminate unnecessary text like page headers and footers. However, it only works with PDFs that contain actual text, such as digitally created PDFs. For scanned documents that require OCR, you might consider using services like Amazon Textract.

Office Document Processing

Documents created with Microsoft Office or compatible productivity software are also common. These can include DOCX, PPTX, and XLSX files, and libraries are available to work with them. The following code snippet demonstrates how to use the python-docx library to extract text from a Word document.

from docx import Document

doc_file = "SampleDoc.docx"

doc = Document(doc_file)

full_text = []
for paragraph in doc.paragraphs:
    full_text.append(paragraph.text)

document_text = 'n'.join(full_text)

Deduplication

After the preprocessing stage, it is vital to further refine the data by removing duplicates (deduplication) and filtering out substandard content. Deduplication is crucial for preparing high-quality pretraining datasets, as duplicated training examples are a common issue in many natural language processing (NLP) datasets. This issue can lead to bias and compromise the quality and effectiveness of the models.

In conclusion, creating a robust dataset for LLM training involves multiple steps, including data extraction, preprocessing, and deduplication. For further reading on building effective onboarding experiences, you can check out this excellent resource from Forbes. For insights on the importance of anti-racism resources, this blog post may be beneficial. Additionally, if you’re interested in employment law compliance, you can find valuable information from an authority on the topic here.

An Introduction to Crafting Your Own Dataset for LLM Training

Data Preprocessing

HTML Preprocessing

PDF Processing

Office Document Processing

Deduplication

Related Topics:

Comments

Leave a Reply Cancel reply