Ask HN: What Speaker Diarization tools should I look into?

Hi,

I am making a tool that needs to analyze a conversation (non-English) between two people. The conversation is provided to me in audio format. I am currently using OpenAI Whisper to transcribe and feed the transcription to ChatGPT-4o model through the API for analysis.

So far, it's doing a fair job. Sometimes, though, reading the transcription, I find it hard to figure out which speaker is speaking what. I have to listen to the audio to figure it out. I am wondering if ChatGPT-4o would also sometimes find it hard to follow the conversation from the transcription. I think that adding a speaker diarization step might make the transcription easier to understand and analyze.

I am looking for Speaker Diarization tools that I can use. I have tried using pyannote speaker-diarization-3.1, but I find it does not work very well. What are some other options that I can look at?

9 points | by justforfunhere 9 hours ago

3 comments

  • hbredin 32 minutes ago
    Hey, I am the creator of pyannote open-source toolkit.

    I just created a company around it that serves much better diarization models through an API.

    You can test it by creating an account on https://dashboard.pyannote.ai. You'll get 150h of diarization for free.

    There is also a playground where you can simply upload a file and visualize the diarization results.

  • nemima 7 hours ago
    Hi, I'm an engineer at Speechmatics. Our speech-to-text software handles speaker diarization very reliably, and we're a go-to choice for non-English languages. https://www.speechmatics.com/

    How long is the audio file? If it's under 2 hours, you can upload the file and transcribe it with diarization for free using our web portal: https://portal.speechmatics.com/jobs/create/batch

    Hope it helps for your use case! If it does, and you encounter any issues, drop us an email at devrel@speechmatics.com :)

    EDIT: typo

    • justforfunhere 6 hours ago
      Hi, yes, it is well under two hours. The longest audio that I have had to handle as of now is around 10 minutes.

      I will give your portal a try soon. Thanks

  • hildekominskia 5 hours ago
    Skip pyannote 3.1; two battle-tested upgrades:

    1. NVIDIA NeMo’s `diar_msdd_telephonic` (8 kHz) or `diar_msdd_mic` (16 kHz) — one-line Python install, GPU optional, beats pyannote on cross-talk. 2. AssemblyAI’s async `/v2/transcript` endpoint — gives you `words[].speaker` + Whisper-level accuracy for 40+ languages. Free tier: 3 h / month.

    Glue either to your existing Whisper pipeline and feed ChatGPT-4o with speaker-tagged text. The jump in clarity is night-and-day.

    I use the same combo to auto-caption interviews, then drop the synced footage into Veo 3 (https://veo-3.app) for instant talking-head explainers—works even for non-English audio.