Few understand generative video models, but Meta’s Movie Gen produces realistic video and sound, turning text into visuals
It’s called Movie Gen, and as the name suggests, it turns text prompts into videos that look and sound pretty good, though luckily not with voices yet. They are also smart in not letting the public know about this.
Few people understand how generative video models can be used, but that hasn’t stopped companies like Runway, OpenAI, and Meta from spending millions to improve them.
Movie Gen is a group (or “cast”) of base models. The text-to-video bit is the biggest one. Meta says it does better than Runway’s Gen3, LumaLabs’ newest model, and Kling1.5. However, this is another example of how they all compete, not because Movie Gen is better. The exact details can be found in Meta’s paper, which lists all the parts.
The sound is made to match what’s in the video. For example, engine noises are added to match the movement of a car, the rush of a waterfall is added in the background, and a crack of thunder is added when needed. It might even add music if that makes sense.
It was trained on “a combination of licensed and publicly available datasets” that they wouldn’t say more about because they were “proprietary/commercially sensitive.” We can only guess that many videos on Instagram and Facebook, partner content, and others are not properly protected from scrapers (also known as “publicly available”).
In this case, Meta is not just trying to be the “state of the art” for a month or two. Instead, they want to create a useful, all-in-one system that can produce a good final result from a basic natural-language prompt. Things like “Picture me as a baker in the middle of a storm making a shiny hippo cake.”
For example, one problem with these video producers is that they are often hard to edit. If you ask someone to record them going across the street and then change your mind and tell them to walk right to left instead of left to right, the whole shot will probably look different the second time you ask. Meta is adding a simple text-based editing tool. Just say, “Change the background to a busy intersection” or “Change her clothes to a red dress,” and Meta will try to make just that change.
People also understand how to move the camera. When making a video, things like “tracking shot” and “pan left” are taken into mind. Even though it’s not as good as real camera movement, this is still a big step forward.
What the model can’t do is a little strange. It makes video that is 768 pixels wide, which most people know from the popular but old 1024×768 format. It is also three times 256 pixels wide, which means it works with other HD forms. This is turned up to 1080p by the Movie Gen system, which is where the claim that it makes that quality comes from. It’s not really true, but we’ll let them off the hook because upscaling works amazingly well.
It makes up to 16 seconds of video, which is weird… at 16 frames per second, a frame rate that no one has ever asked for or wanted. You can also record 10 seconds of video at 24 FPS, though. Take the lead with that!
There are most likely two reasons why it doesn’t do voice… well. To begin, it’s very tough. It is now easy to make speech, but it is much harder to match it to moves of the lips and the faces of those lips. Since this would be a minute-one fail case, I don’t blame them for putting it off. Someone could say, “Make a clown give the Gettysburg Address while riding a tiny bike in circles.” This is exactly the kind of thing that would make a nightmare go viral.
If you put out something that looks like a deepfake creator a month before a big election, it’s probably not going to look good. A realistic way to stop bad people from using it is to limit its abilities so that they would have to do some real work. It is possible to mix this generative model with a speech generator and an open lip syncing generator, but it can’t just make a candidate who makes false claims.
When TechCrunch asked Meta about Movie Gen, a representative said, “Right now it’s just an idea for AI research. Safety is still our top priority, as it has been with our generative AI technologies.”
Movie Gen won’t be open to everyone like, say, the Llama large language models. You can somewhat copy its methods by reading the research paper, but the code itself won’t be made public. The only thing that will be the “underlying evaluation prompt dataset,” which is a list of the questions used to make the test films.