Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.
We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.
Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.
From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).
And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.
We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.
Looking forward to your feedback!
EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)
*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.
EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)
Sure, that kind of thing is great fun. But photorealistic avatars are gonna be abused to hell and back and everyone knows it. I would rather talk to a robot that looks like a robot, ie C-3PO. I would even chat with scary skeleton terminator. I do not want to talk with convincingly-human-appearing terminator. Constantly checking whether any given human appearing on a screen is real or not is a huge energy drain on my primate brain. I already find it tedious with textual data, doing it on realtime video imagery consumers considerably more energy.
Very impressive tech, well done on your engineering achievement and all, but this is a Bad Thing.
OP, I think this is the coolest thing ever. Keep going.
Naysayers have some points, but nearly every major disruptive technology has had downsides that have been abused. (Cars can be used for armed robbery. Steak knives can be used to murder people. Computers can be used for hacking.)
The upsides of tech typically far outweigh the downsides. If a tech is all downsides, then the government just bans it. If computers were so bad, only government labs and facilities would have them.
I get the value in calling out potential dangers, but if we do this we'll wind up with the 70 years where we didn't build nuclear reactors because we were too afraid. As it turns out, the dangers are actually negligible. We spent too much time imagining what would go wrong, and the world is now worse for it.
The benefits of this are far more immense.
While the world needs people who look at the bad in things, we need far more people who dream of the good. Listen to the critiques, allow it to aid in your safety measures, but don't listen to anyone who says the tech is 100% bad and should be stopped. That's anti nuclear rhetoric, and it's just not true.
Keep going!
"Your hackers were so preoccupied with whether or not they could, they didn't stop to think if they should."
I played around with your avatars and one thing that it lacks is that it's "not patient", it's rushing the user, so maybe something to try and finetune there? Great work overall!
Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.
EDIT: I see that the primary focus is on video generation not audio.
But, to your point, there are many benefits of two-way S2S voice beyond just speed.
Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.
Not only is perfect is the enemy of good enough, you're only looking for PMF signal at this point. If you chase quality right now, you'll miss validation and growth.
The early "Will Smith eating spaghetti" companies didn't need perfect visuals. They needed excited early adopter customers. Now look where we're at.
In the fullness of time, all of these are just engineering problems and they'll all be sorted out. Focus on your customer.
The text processing is running Qwen / Alibaba?
Video Agents Unlimited agents Up to 3 concurrent calls Creative Studio 1min long videos Up to 3 concurrent generations
Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?
Can I have different avatars or only the same avatar x 3?
Can I record the avatar and make videos and post on social media?
At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.
The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.
In general, a very positive experience, thank you.
[0] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [1] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [2] https://research.nvidia.com/labs/adlr/personaplex/
Thanks for the feedback! That's helpful!
btw, she gives helpful instructions like "/imagine" whatever but the instructions only seem to work about 50% of the time. meaning, try the same command or variants a few times, and it works about half of them. she never did shift out of aussie accent though.
she came up with a remarkably fanciful explanation why as a brazilian she sounded aussie and why imagining native accent like she said would work didn't...
i was shocked when /imagine face left turn to the side did actually work, the agent was in side profile and precisely as natural as the original front facing avatar
all in all, by far the best agent experience i've played with!
Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.
I did /imagine cheeseburger and /imagine a fire extinguisher and both were correctly generated but the agent has no context. when I ask what they are holding in both cases they ramble about not holding anything and referencing lemons and lemon trees.
I expected it to retain the context as the chat continues. If I ask it what it imagined it just tells me I can use /imagine.
(Update of https://news.ycombinator.com/item?id=43785494)
It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.
Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.
> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?
Oh well, this looks very cool regardless and congratulations on the release.
Answer is no because you will eventually release a subpar model not your sorta model.
Also people don't have infrastructure to run this at scale (100-500 concurrent users) at best they can run it for 1-2 concurrent users.
This could be a good way for peoples to test it then use your infra.
Ah but you do have an online demo, so you might think this is enough, WRONG.
My mind is blown! It feels like the first time I used my microphone to chat with ai
Have your early versions made any sort of profit?
Absolutely amazing stuff to me. A teenager I very briefly showed it to was nonplussed - 'it's a talking head, isn't that really easy to do' ...
You can also control background motions (like ocean waves, or a waterfall or car driving).
We are actively training a model that has better text control over hand motions.
[0] https://lemonslice.com/privacy
You probably didn't intend to do that
Feels like those sci-fi shows where you can talk to Hari Seldon even though he lived like a 100 years ago.
My prediction, this will become really, really big.
I have so many websites that would do well with this!
I was thinking why the quality is so poor.
I am double checking now to make 100% sure we return the original audio (and not the encoded/decoded audio).
We are working on high-res.
Take my money!!!!!!
For the fully hosted version, we are currently partnered with ElevenLabs.
I think people will just copy it, and we just need to continue moving as fast as we can. I do think that a bit of a revolution is happening right now in real-time video diffusion models. There are so many great papers being published in that area in the last 6 months. My guess is that many DiT models will be real time within 1 year.
Thanks, it seemed to be the case that this was really something new, but HN tends to be circumspect so wanted to check. It's an interesting space and I try to stay current but everything is moving so fast. But I was pretty sure I hadn't seen anyone do that. Its a huge achievement to do it first and make it work for real like this! So well done!
It’s bad enough some companies are doing AI-only interviews. I could see this used to train employees, interview people, replace people at call centers… it’s the next step towards an absolute nightmare. Automated phone trees are bad enough.
There will likely be little human interaction in those and many other situations, and hallucinations will definitely disqualify some people from some jobs.
I’m not anti AI, I’m anti destructive innovation in AI leading to personal health and societal issues, just like modern social media has. I’m not saying this tool is that, I’m saying it’s a foundation for that.
People can choose to not work on things that lead to eventual negative outcomes, and that’s a personal choice for everyone. Of course hindsight is 20/20 but some things can certainly be foreseen.
Apologies for the seemingly negative rant, but this positivity echo chamber in this thread is crazy and I wanted to provide an alternative feedback view.
Lord. I can see this quickly extending even further into HR e.g. performance reviews: employee must 'speak' to an HR avatar about their performance in the last quarter. the AI will then summarize the discussion for the manager and give them coaching tips.
It sounds valuable and efficient but the slippery slope is all but certain.
I appreciate your concern for the quality of the site - that fact that the community here cares so much about protecting it is the main reason why it continues to survive. Still, it's against HN's rules to post like you did here. Could you please review https://news.ycombinator.com/newsguidelines.html? Note this part:
"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."
even I am surprised with how many opnely positive comments we are getting. it's not been our experience in the past.