The researchers speculated that the systems had a common flaw: “insufficient audio data from Black speakers when training the models.” A startup called Speechmatics has developed a technique that appears to reduce this data gap. Speechmatics attributed much of its performance to a technique called self-supervised learning.
Training school
The advantage of self-supervised models is that they don’t require all their training data to be labeled by humans. As a result, they can enable AI systems to learn from a much larger pool of information. This helped Speechmatics increase its training data from around 30,000 hours of audio to around 1.1 million hours. Will Williams, the company’s VP of machine learning, told TNW that the approach improved the software’s performance across a variety of speech patterns:
Learning like a child
One of the technique’s benefits was closing Speechmatics’ age understanding gap. Based on the open-source project Common Voice, the software had a 92% accuracy rate on children’s voices. The Google system, by comparison, had an accuracy of 83.4%. Williams said enhancing the recognition of kids’ voices was never a specific objective: That doesn’t mean that self-supervised learning alone can eliminate AI biases. Allison Koenecke, the lead author of the Stanford study, noted that other issues also need to be addressed: