Creating Speech Collection Datasets for Improved Representation

Introducing our Proudly South African Speech Collection Datasets Featuring Four Official Languages

Communication has changed drastically with the advancements in AI technology and natural language processing, and the need to build on African speech collection datasets has become paramount. We have access to thousands of languages, and billions of people, but not all languages have the same opportunities or resources available to develop. This creates a problem for automated speech recognition (ASR) technologies, which need diverse datasets to promote inclusivity and accurately recognise accents and dialects.

Bridging the gap: Our South African speech collection project

The Project

Our team commenced the South African Speech Collection Dataset project in 2022, collecting speech data in four of the 11 official languages spoken in South Africa. These datasets were designed to improve ASR technologies for under-served language communities and to promote inclusivity and accessibility in online and digital spaces. The speech collection process was meticulously planned, collected, annotated, and curated with natural language processing best practices in mind.

 

Our Participants

We collected over 200 hours of speech data across four South African languages: English, Afrikaans, isiZulu, and seSotho. We recruited 212 participants ranging in age from 18 to 69, all with first language proficiency in one or more of the target languages. To simulate real-world scenarios, we focused on call centre domains such as debt collection, insurance, retail, and travel.

 

Simulating Unscripted, Real-world Scenarios

Participants were asked to role-play as either a customer or an agent in a customer service scenario. They were encouraged to draft a scenario for each call but to keep the conversation as unscripted and natural as possible. There was no pre-requisite on pairing up with a particular partner; participants were encouraged to find their match based on who was online at the same time as them or to post their availability and have others reach out to them.

speech-collection-dataset

Challenges with Representation and (Audio) Quality

 

Despite our best efforts, there were some challenges which could not be avoided. We aimed for a balanced gender split between men and women but faced an oversubscription of women in the first round of recruiting for almost all languages. We were able to recruit more male call recorders by incentivising referrals; we also posted on social media to raise awareness and encourage participation. Call recorders were paid for all calls collected which contributed to their commitment to participate and complete their calls.

Our call recorders also faced challenges in terms of lack of general technical know-how to sign up to online platforms, loadshedding which affected their network coverage, limited Wi-Fi access and noisy recording environments. We did not wish to exclude participants who were affected by these very real, everyday challenges and had a very hands-on approach to troubleshoot any technical challenges as well as being proactive in matching up online call recorders so that they could record their first call. All first calls were checked for quality before they could commence recording additional calls.

Dataset Recordings

 

The dataset collections comprise of unscripted, natural conversations simulating real-world conversations in common domains, debt collection, insurance, retail, and travel. Recordings and transcripts include routine security verifications such as ID, email and phone number validation.

 

Each dataset collection includes:

• Metadata speaker demographics
• Segmented wav files that match the .csv files (transcript)
• Single channel recording for agent and client
• Audio files available as segment or full channel files

 

ENGLISH

Hours available: 50 hours
Age range of speakers: 18 – 69
Download size: 32GB
Number of speakers: 63
Audio format: WAV

AFRIKAANS

Hours available: 50 hours
Age range of speakers: 18 – 69
Download size: 32GB
Number of speakers: 46
Audio format: WAV

ISIZULU

Hours available: 50 hours
Age range of speakers: 18 – 49
Download size: 38GB
Number of speakers: 54
Audio format: WAV

SESOTHO

Hours available: 50 hours
Age range of speakers: 18 – 49
Download size: 38GB
Number of speakers: 49
Audio format: WAV

At Way With Words, we have gathered valuable speech collection datasets to train ASR models to recognise a broader range of languages and speech patterns. We are also able to create custom training data tailored to specific requirements related to conventions, languages, or domains.
Our proudly South African speech collection dataset is a significant step towards promoting inclusivity and diversity in the digital space. Our hope, moving forward, is to improve ASR technologies for under-served language communities and support responsible AI development.

Additional Services

Video Captioning Services
About Captioning

Perfectly synched 99%+ accurate closed captions for broadcast-quality video.

Machine Transcription Polishing
Machine Transcription Polishing

For users of machine transcription that require polished machine transcripts.

Speech Collection for AI training
About Speech Collection

For users that require machine learning language data.