Ask HN: What Speaker Diarization tools should I look into?

Hi,

I am making a tool that needs to analyze a conversation (non-English) between two people. The conversation is provided to me in audio format. I am currently using OpenAI Whisper to transcribe and feed the transcription to ChatGPT-4o model through the API for analysis.

So far, it's doing a fair job. Sometimes, though, reading the transcription, I find it hard to figure out which speaker is speaking what. I have to listen to the audio to figure it out. I am wondering if ChatGPT-4o would also sometimes find it hard to follow the conversation from the transcription. I think that adding a speaker diarization step might make the transcription easier to understand and analyze.

I am looking for Speaker Diarization tools that I can use. I have tried using pyannote speaker-diarization-3.1, but I find it does not work very well. What are some other options that I can look at?

9 points | by justforfunhere 9 hours ago

3 comments

hbredin 32 minutes ago
Hey, I am the creator of pyannote open-source toolkit.
I just created a company around it that serves much better diarization models through an API.
You can test it by creating an account on https://dashboard.pyannote.ai. You'll get 150h of diarization for free.
There is also a playground where you can simply upload a file and visualize the diarization results.
nemima 7 hours ago
Hi, I'm an engineer at Speechmatics. Our speech-to-text software handles speaker diarization very reliably, and we're a go-to choice for non-English languages. https://www.speechmatics.com/
How long is the audio file? If it's under 2 hours, you can upload the file and transcribe it with diarization for free using our web portal: https://portal.speechmatics.com/jobs/create/batch
Hope it helps for your use case! If it does, and you encounter any issues, drop us an email at devrel@speechmatics.com :)
EDIT: typo
[-]
- justforfunhere 6 hours ago
  Hi, yes, it is well under two hours. The longest audio that I have had to handle as of now is around 10 minutes.
  I will give your portal a try soon. Thanks
hildekominskia 5 hours ago
Skip pyannote 3.1; two battle-tested upgrades:
1. NVIDIA NeMo’s `diar_msdd_telephonic` (8 kHz) or `diar_msdd_mic` (16 kHz) — one-line Python install, GPU optional, beats pyannote on cross-talk. 2. AssemblyAI’s async `/v2/transcript` endpoint — gives you `words[].speaker` + Whisper-level accuracy for 40+ languages. Free tier: 3 h / month.
Glue either to your existing Whisper pipeline and feed ChatGPT-4o with speaker-tagged text. The jump in clarity is night-and-day.
I use the same combo to auto-caption interviews, then drop the synced footage into Veo 3 (https://veo-3.app) for instant talking-head explainers—works even for non-English audio.