Karakuri Unveils Comprehensive Know-How on Leveraging AWS Trainium Amid GPU Price Surge

Karakuri Co., Ltd., based in Chuo-ku, Tokyo, led by CEO Shimomoto Oda, has provided invaluable insights into the practical implementation of AWS Trainium, a deep learning-focused chip developed by AWS. This resource is specifically designed for engineers involved in Large Language Model (LLM) development.

Background and Purpose of Knowledge Release

The surge in generative AI has caused the costs for high-performance GPUs to soar, leaving many companies grappling with resource limitations when scaling up LLM training and fine-tuning. AWS Trainium offers superior cost performance compared to traditional GPUs, yet effective utilization of this chip requires specialized knowledge, including compatibility with designated software development kits (SDKs) like Neuron SDK and the ability to migrate computation graphs. There has been a clear demand for practical guides on this topic, delivered in Japanese.

In response, Karakuri has structured its practical knowledge of AWS Trainium and made it publicly accessible. This initiative aims to empower more engineers to utilize AWS Trainium effectively, thus contributing to the diversification of technical options in AI development within Japan.

Technical Depth and Features of the Released Know-How

The public knowledge consists of several components aimed at engineers with foundational understanding of shell operations, PyTorch, and transformer architectures. Here’s a breakdown of the included elements:

1. Introduction to AWS Trainium: Exploration of starting AWS Trainium using the `trn1.2xlarge` instance, checking core operating conditions via `neuron-top`, and explaining the distinct latency evaluation behavior (Lazy Mode).

2. Building Computational Clusters: Detailed guidance on constructing large-scale learning infrastructure using AWS ParallelCluster and CloudFormation for Trn1 instances, presented in a command-line interface (CUI) format.

3. Implementing Distributed Learning for LLMs: Instructions on establishing a learning environment with NeuronX Distributed Training (NxDT), set up for checkpoint conversion, Ahead-Of-Time (AOT) compilation and executing distributed learning tasks.

4. Cutting-edge Model Transfer Techniques: An in-depth look at modifying model architecture targeted at Llama 3-based models utilizing custom parallel layers within NxDT.

5. The Theory of Distributed Learning: This includes principles behind key distributed learning strategies, such as Data Parallel (DP), Tensor Parallel (TP), and Pipeline Parallel (PP), along with their application to AWS Trainium environments.

The transfer of Llama 3-based models represents a cutting-edge technical know-how that is vital for operating new models on new accelerators, significantly broadening the horizons of LLM development.

For more detailed insights, refer to the AWS Trainium 50 Exercises on KARAKURI Techblog.

Future Outlook

Karakuri intends to leverage the insights and feedback gained through this knowledge release to address technical challenges in LLM development while fostering innovation. Continuous updates will include adaptations for upcoming versions of AWS Trainium, like Trn2, ensuring the community remains equipped with the latest accelerator utilization know-how.

Company Overview

Karakuri operates under the vision of “Friendly Technology” and aims to implement AI in customer support through large-scale language models (LLMs). Since 2018, Karakuri has explored transformer models like BERT and has been researching LLMs, including GPT, since 2022. The company offers an AI series tailored for customer support, chosen by leading firms, including Takashimaya, SBI Securities, Seven-Eleven Japan, and Hoshino Resorts.

Key Achievements

- 2018: Winner at ICC Summit Startup Catapult
- 2020: Accepted into Google for Startups Accelerator 2020
- 2022: Selected for Google for Startups Growth Academy Tech 2022
- 2023: Chosen for AWS LLM Development Support Program
- 2024: Recognized by the AI Practicalization Promotion Program
- 2024: Invited to Meta's exclusive generative AI developers conference
- 2024: Accepted into the Ministry of Economy’s “GENIAC” program

Address: 5F Camel Tsukiji II, 2-7-3 Tsukiji, Chuo-ku, Tokyo 104-0045
Founded: October 3, 2016
CEO: Shimomoto Oda
Business Focus: Development, provision, and operation of customer support AI series