Qlean Dataset Launches a New Japanese Sports Dialogue Audio Corpus for AI Development

Introduction

Visual Bank Inc., based in Minato-ku, Tokyo, has officially launched a fresh dataset titled the "Japanese Two-Speaker Sports Dialogue Audio Corpus with Transcripts." This new offering, part of the Qlean Dataset, is tailored for the development of speech and language-oriented AI technologies, including Automatic Speech Recognition (ASR) and Natural Language Processing (NLP).

Dataset Overview

The dataset introduces a diverse collection of Japanese audio recordings featuring two speakers engaged in natural and unscripted conversations centered around sports themes. It comes equipped with precise transcripts detailing the content of the discussions. The conversations encompass a variety of sports-related topics such as reviews of past games, exchanges of tactics and strategies, and personal impressions of sporting events.

All recordings are conducted without any pre-defined scripts, ensuring that natural conversational structures—such as speaker turn-taking and overlapping dialogue—are captured. This characteristic makes the dataset particularly valuable for research and development in voice recognition technologies, dialogue processing, and spoken language understanding.

Key Features of the Dataset

The "Japanese Two-Speaker Sports Dialogue Audio Corpus" is designed with several key aspects in mind:

- Audio and Text Data: The dataset contains audio data in formats such as WAV and MP3, along with text in TXT, JSON, and CSV formats.
- Subject Attributes: It features both men and women from Japan, aged between 20 and 50, reflecting a representative demographic.
- Recording Duration: The total recording time amounts to approximately 200 hours, with each audio segment lasting between 5 to 60 minutes. Most recordings are sampled at a rate of 44.1 kHz.
- Conversational Contexts: The dataset is structured around various scenarios, including discussions of sports experiences and analysis, with dialogues that proceed freely without scripted guidelines.

Potential Applications

The new dataset offers substantial potential for both research and industrial applications:

Research Use Cases

1. Conversational ASR Model Evaluation: The dataset allows researchers to evaluate the accuracy and error rates of Japanese ASR systems under conditions that replicate real-world conversational dynamics, including speaker turn-taking and overlapping speech.
2. Dialogue Understanding Research: This corpus supports the exploration of dialogue comprehension and discourse structure, allowing for insights into intent estimation and dialogue segmentation during sports-related conversations.

Industry Applications

1. Voice-Based Conversational AI Development: Companies can utilize this dataset to enhance voice interaction systems aimed at delivering sports information or facilitating user interactions, thereby refining recognition and response models.
2. Call Center Technology Validation: The naturally flowing dialogues within the dataset can be instrumental for testing technologies like speech separation and speaker identification in dialogue analysis.

About Qlean Dataset

Qlean Dataset is part of Visual Bank’s initiative to provide accessible, legally compliant AI training data solutions. It specializes in a wide array of data formats—be it images, audio, or text—supporting both academic research and commercial applications. Through partnerships with various data providers, including Chiba Lotte Marines and Toyo Keizai, Qlean Dataset consistently expands its offerings under the "AI Data Recipe" moniker.

Furthermore, the Qlean Dataset significantly alleviates the challenges associated with data collection and preparation in AI development, establishing a risk-free and legally sound environment for researchers and developers alike.

Conclusion

The launch of the "Japanese Two-Speaker Sports Dialogue Audio Corpus" marks a significant addition to the Qlean Dataset portfolio, showcasing the company’s commitment to enhancing AI development through high-quality, contextually rich datasets. For more information, visit the Qlean Dataset website.

In this era where sports analytics and AI technology intersect, the availability of such robust datasets empowers developers and researchers to innovate and excel in their respective fields.