Moments Lab AI Research Leading Advancements in Multimodal Speaker Diarization

Investing in Machine Learning Techniques for Audio Data Processing

To top off a year of record-breaking R&D advancements, Moments Lab (ex Newsbridge) had the pleasure of welcoming Yannis, its new AI Researcher, who’s doctoral groundwork lies in cutting-edge signal processing, machine learning, computer vision and speech analysis methods.

More specifically, Yannis specializes in Speaker Diarization– a technology that Moments Lab recognizes as a significant advancement as applied to Multimodal AI (audio analysis being an extra modal to the multimodal family)!

Currently, state-of-the-art diarization relies solely on audio processing. To challenge the latest technology advancements, Moments Lab is now working with diarization as applied to other types of processing, i.e. video and other external metadata to achieve ‘next-gen diarization’. By applying diarization to advanced speaker and transcription detection capabilities, the company plans to be a step ahead of existing state-of-the art solutions.

Moments Lab believes this technology will be a crucial component applied to the platform in the years to come.

“By leading the way in multimodal diarization research, we plan on improving the accuracy of our underlying technology. While simultaneously leveraging and cross-analyzing multiple modes- such as face, object and context detection- we are improving diarization, consequently putting us at the ‘research forefront’ of this state-of-the-art technology. The high-level objective is making the Moments Lab platform home to the most powerful Multimodal AI Indexing technology in the world.”

– Frédéric Petitpont, Moments Lab CTO

‍

Basic Speaker Diarization: 3 Steps to Determine the ‘Who Spoke When’

So let’s back up a bit- what exactly is Speaker Diarization, and what does the process look like? High-level, Speaker Diarization can be summed up as ‘who spoke and when’.

A visual representation of Speaker Diarization: The ‘*Who Spoke and When*’ technology

‍

Let’s break basic Speaker Diarization down into 3 steps:

1) Homogeneous Segments: In Speaker Diarization, an audiovisual file is first broken down into bite-size homogeneous segments to better analyze various components. For reference, a segment is considered homogeneous when it contains the voice of only one person. This means that each extracted segment must correspond to a particular individual.

2) Segment Clustering: Next, the segments are clustered- meaning that all segments associated with a particular speaker are grouped together and annotated accordingly. To clarify, this means multiple segments from a single audiovisual file are broken down and attached to the corresponding speaker, with automated tags (i.e. speaker ID, speech time, pauses etc.).

3) Annotation + Identified Speech Turns: We are then presented with a fully annotated audio file, with identified speech turns of the different corresponding speakers.

A solution that creates a fully annotated audio file, complete with identified speech turns.

‍

Moments Lab Speaker Diarization Research: Who Spoke What and When

Now that we know how basic speaker diarization works, relying on the single mode of audio- let’s look at what Moments Lab is doing to improve this cutting edge technology.

When applied to the Moments Lab platform, there’s an important component added to the mix… ‘who spoke what, and when’.

So what does that mean?

Because Moments Lab already leverages built-in speech-to-text technology as part of its AI-powered media valorization platform offering, speaker diarization can eventually be improved by taking into account what individuals are actually saying (using NLP techniques applied to the auto-generated text from the transcription) and can better match speaker turns, resulting in higher quality speech-to-text output. As this is internal research at Moments Lab, the implementation is set to debut in the next couple of years.

‍

Speaker Diarization applied to speech-to-text technology helps improve accuracy of automated transcription by segment clustering, clarifying speech of overlapping voices.

‍

Another important mode that will improve Speaker Diarization in reference to Moments Lab is the who. Since the platform also analyzes video, the diarization pipeline will also be able to take into account any publicly known (or priorly tagged speakers via facial recognition). If the speaker is then recognized, the platform takes into account Wikidata information and can also detect if he or she is a native speaker- improving diarization.

‍

Future Implications: How Will Professionals Use Speaker Diarization?

Once applied to speech-to-text technology, Speaker Diarization has the potential to revolutionize a number of industries who work heavily in the automatic speech-recognition space. By adopting this ‘who spoke when’ technology, the possibilities are seemingly endless.

For example, Speaker Diarization helps:

1) Calculate speaking times

This is especially relevant for high-profile political debates in which strict speaking turn guidelines are given as part of Public Speaking Ethics. In this way, post-examiners can analyze and report speaker length, an important component of conversational analysis. It is also a great tool to analyze speaker parity among sexes- ensuring men and women are (quite literally!) equally heard.

2) Automatically obtain subtitles associated with the person speaking

Currently, for post-production teams there’s no easy way to automatically align subtitles with speaking turns in a video. Most of this work is done manually and oftentimes prone to errors (due to overlapping voices) and long turn-around times. By implementing Speaker Diarization applied to speech-to-text detection, production teams can automate and improve this drawn-out process, auto-generating subtitles based on diarized speaking-turn segments. Improved quality in subtitles is also a major win in terms of Accessibility, further assisting those who are deaf or blind (if vocal assistance is activated).

3) Better understand overlapping speeches

This can be applied to various examples- especially for journalists who are analyzing hundreds of media assets on any given subject (i.e. sports conferences, interviews, congressional speeches, debates, etc) in which there are multiple speakers overlapping one another. In order to publish their story as quickly as possible, journalists need to work with correct transcription for quotes and usually end up transcribing by ear, or if they do have a speech-to-text tool, many times the output quality is questionable, at best.

4) Identify Audio Patterns

This could be a game-changer for corporate marketers when working with and searching through their base of existing media assets. For example, after indexing media based on detected ads, music, jingles and applause- a user can search via audio (i.e. I can’t remember the name of the ad, but I need the video with that ‘Rise & Shine’ song).

5) Match Voice Fingerprints

This could be especially useful for documentalists or archivists who are tasked with creating a digital archives library and analyzing individual video clips which may have ‘off-screen’ entities which are only heard and not seen. In this case, applied Speaker Diarization can assist with matching voice fingerprints to known individuals, simplifying the archive analysis process. Matching voice fingerprints is also effective for journalists combatting the war on deep fake, allowing them to ensure correct speaker identity.

‍

Final Thoughts: Continued Research for Commercialized Application

As applied to Moments Lab’s Multimodal AI technology, Speaker Diarization is an important component in improving algorithm robustness. Depending on the quality of media assets, certain low resolution videos and photos will be better analyzed and thus indexed, due to improved speech-to-text functionality.

More generally, as a technology that relies on both supervised and unsupervised Machine Learning Techniques along with Deep Learning and Agglomerative Clustering models, Multimodal Speaker Diarization is not yet a mainstream solution due to its complex nature. This is why continued research remains top priority for those interested in leading the way with commercialized application.