Qlean Dataset Unveils New Japanese Dialogue Dataset for AI Research and Development

Introduction to the Qlean Dataset

Visual Bank Co., Ltd., headquartered in Minato-ku, Tokyo, is proud to announce a significant addition to its AI training data offerings. Under its Qlean Dataset division, the company is launching the "Japanese 2-Speaker LR-Separated Private Dialogue Speech with Transcripts." This dataset is particularly valuable for developers and researchers working on AI speech technology, providing approximately 500 hours of rich audio content for model training and evaluation.

Dataset Overview

The newly launched dataset contains around 500 hours of dialogue audio featuring 87 pairs of Japanese speakers. Notably, these recordings are LR-separated, meaning each speaker's voice is captured on independent channels. This setup allows researchers to utilize the audio for various applications, including speaker diarization, automatic speech recognition (ASR), and large language model (LLM) fine-tuning. The conversations within this dataset are centered around natural discourse reflecting personal hobbies, skills, and values, closely simulating spontaneous speech rather than scripted dialogues.

Features of the Dataset

Diverse Speakers: The dataset includes a diverse representation of Japanese speakers, ensuring varied gender and age profiles among the speakers.

Web Conferencing Format: The dialogues were recorded in a web conferencing format, which mirrors real-life discussions and can be beneficial for modeling ASR systems in real-world environments.

Audio Specifications: The audio files are provided in MP3 format, with a sampling rate of 48 kHz and a bit rate of 192 kbps, ensuring high-quality audio suitable for analysis.

Transcription Included: Each recording comes with a transcript, making it easier for AI developers to align the audio data with textual analysis and training tasks.

Applications of the Dataset

The Qlean Dataset can be leveraged across multiple AI applications:

1. Speaker Diarization: The LR-separated audio enables ground-truth evaluations for various speaker diarization models. Tools like pyannote.audio and NeMo can benefit from the dataset as they fine-tune their methodologies against established benchmarks.

2. ASR Model Fine-Tuning: With transcripts accompanying each recording, the dataset is ideal for adapting existing ASR models such as Whisper or ESPnet to conversational domains, enhancing their ability to process spontaneous speech characteristics.

3. Speech Separation Research: By utilizing the LR-separated format, researchers can test and benchmark their speech separation models, generating synthetic mixed audio for performance evaluations.

4. Pre-Training Speech LLMs: The integration of audio data with corresponding transcripts provides a holistic training framework required for developing advanced speech language models that handle speech and text simultaneously.

5. Customization Options: For users seeking specific datasets, Visual Bank offers custom recording solutions, which can accommodate unique speaker profiles or conversation topics, expanding the dataset's applicability to niche domains such as healthcare or finance.

Availability and Access

The Qlean Dataset is designed for easy access, allowing commercial and research use. Interested parties can explore sample data and examine the dataset's capabilities through this link. Visual Bank ensures prompt delivery of existing datasets, typically within two business days, while also providing options for custom data collection upon request.

About Visual Bank Co., Ltd.

Visual Bank is dedicated to maximizing AI development through next-generation data infrastructures. By fostering an environment where AI developers can source high-quality data without legal concerns, the company is enabling innovations across various industries. The recent launch of the Qlean Dataset serves as a testament to Visual Bank's commitment to providing valuable resources for the growing demand for AI training data in speech applications.

In summary, the launch of the "Japanese 2-Speaker LR-Separated Private Dialogue Speech with Transcripts" marks a significant step forward in the availability of high-quality audio datasets for AI developers, particularly in the field of speech recognition and dialogue systems.

For additional information regarding Qlean Dataset and Visual Bank's offerings, please visit Visual Bank's website.