Visual Bank Launches Qlean Dataset for Safe Japanese AI Training in Multimodal Contexts

Introduction to Qlean Dataset

Visual Bank Inc., located in Minato-ku, Tokyo, has recently announced an innovative initiative in AI data solutions through its subsidiary, amanaimages Inc. The newly launched Qlean Dataset is specially designed to offer safety-aligned training datasets targeting multimodal artificial intelligence systems. As AI technology advances, the need for specialized datasets that provide both high quality and contextual relevance has become evident, especially in the Japanese market.

The Importance of Contextual Safety

The evolution of generative AI towards multimodal systems—encompassing text, images, videos, and audio—has raised significant challenges in ensuring safety and compliance. In particular, the pressing issues of harmful outputs and compliance risks require urgent attention during the development of foundation models like large language models (LLMs) and vision-language models (VLMs). Traditional methods relying on post-deployment filtering often fail to balance creativity and safety. Thus, embedding safety protocols from the design and training phases—known as "Safety-by-Design"—is crucial.

Challenges in the Japanese AI Landscape

In Japan, there are unique challenges that necessitate this new approach:
1. Cultural Misalignment: Existing global datasets often do not adequately reflect Japan's unique cultural nuances, local legal frameworks, or specific taboos around topics such as copyright and personality rights.
2. Complex Multimodal Interactions: Risks associated with combining various types of content (e.g., images and prompts) lead to difficulties in detecting inappropriate combinations, which traditional benchmarks may overlook.

These factors highlight the need for tailored data solutions that address the specific concerns of the Japanese market and comply with local regulations.

What Qlean Dataset Offers

The Qlean Dataset aims to support the safety alignment in AI training across various models through:

- Design and Collection: Offering specialized training data that complies with Japan-specific ethical and regulatory standards.
- Multimodal Risk Data: Creating datasets that address potential risks arising from various combinations of media, including images, videos, audio, and text.
- Assessment and Annotation: Implementing rigorous labeling frameworks to evaluate intellectual property risks and ensure demographic fairness, crucial in minimizing biases across AI outputs.

Addressing Core Challenges

To ensure the safety of AI systems, several core challenges in data preparation are tackled:

- Local Ethical Standards: The Qlean Dataset aligns with Japan’s ethical and legal standards, which are often more nuanced than international frameworks.
- Data Collection Constraints: By adhering to strict legal compliance and carefully managed workflows, the dataset mitigates potential legal risks while ensuring data integrity.
- Composite Risk Identification: Leveraging advanced expertise to assess and label risks associated with complex media interactions, ensuring comprehensive risk coverage.
- Annotator Well-being: Putting a focus on the mental health of annotators dealing with sensitive content while maintaining consistent data annotation standards.

Detailed Data Solutions by Modality

1. Text (LLMs): Adapting safety benchmarks to reflect local cultures and norms, along with response strategies for directing potentially harmful prompts.
2. Image Generation Models: Conducting detailed IP risk evaluations to minimize risk associated with style imitations or copyright infringement while developing NSFW classification aligned with Japanese legal standards.
3. VLMs: Establishing frameworks for cross-modal risk understanding and ensuring diversity in demographic representation to avoid biases based on race, gender, or age.

Conclusion: Supporting Safe AI Development

The Qlean Dataset is an essential resource for fostering both safety and reliability in AI systems, especially as they transition into broader societal applications. By focusing on advanced methodologies in data preparation and adhering to cultural specifics, Qlean Dataset empowers developers to build responsible and compliant AI technologies. With continuous collaborations with public institutions and major manufacturers, the dataset ensures that AI models do not just generate creative outputs but do so within a safe and ethically grounded framework.

About Qlean Dataset

The Qlean Dataset is part of amanaimages Inc. (a Visual Bank Group company) and is positioned as a commercially viable AI training data solution. It supports a variety of data formats for safe, compliant use in both commercial and research applications, continually building on its resources in collaboration with leading media and data rights holders. Visit the Qlean Dataset website for more information about its offerings.