Lip-Synching Text in realtime using Flash

Lip-Synching Text in realtime using Flash
by Eric Medine aka MKultra

This is a quick walkthru of my process, from concept to completion, of how to build a flash utility to lip-synch animation to input text. I assume a certain amount of familiarity with Flash and object oriented programming, as well as an understanding of basic concepts of visual animation.

I used a Dell Inspiron 8100 with 2 gigs of ram running WinBlows XP, some kinda ATI video card that has disappointed me in the past, and a 2.4 MHZ processor. Running Vista? Haven't tried it yet.

Project: Create a video utility that will allow me to add text in realtime to video mixing performances. It has to display text as well as have video or stills of someone speaking it.

Platform: Windows XP, Flash for the programming, Resolume for the live presentation.

I used this application during a live set I did at City Hall - Rambla Cataluña in Barcelona, Spain. I got to work with Andreas Castillo, a DJ and promoter for a few tuesdays in October of 2008.

About this tutorial
Use this tutorial however you want, feel free to add to it, be sure to forward it to whomever and give it away for free. If you charge some lazy bastard for this tutorial, you suck. If you use it for a performance and you make money-- you rock! If you have some questions, advice, or praise, contact me at:
info(at)ericmedine(dot)com

Other resources:
Here's some links to related projects, some sample files, and a live set in Barcelona, Spain using this module.
Source Flash files
Thread on VJ Central about similar projects
Live VJ set using this application at City Hall in Barcelona

STEPS:
1) Figure out the scope of my project-- what functions do I need? What video or images do I need?
2) Build my functions
3) Build my animation
4) Troubleshoot functionality, look and feel
5) prep for live performance

STEP ONE: What do I need?

A. Video or stills of someone speaking.
This is more complicated than you would think-- I can't record every possible, conceivable word in the English language to videotape (and what if I want to do other languages? This was originally for a series of performances in Barcelona, which means I might need Catalan as well as Spanish and English), and even if I could, there is a tremendous technical difficulty in being able to access such a huge library of video loops in realtime.

Rather than use video loops, I decided to use stills of phonemes-- the facial/lip expressions that are the result of pronnouncing words. I found this list of phonemes online-- there are better lists out there, but really I only need a guide.

This way I can match a visual with ANY sound that the text editor displays... with some exceptions. More on that later. Using this method my library has become much more manageable.

B. Text Input.
What exactly is "Text Input?" Does this need to be readable by humans-- or can I show phonetical spelling of words? Do I want it to display live and on-screen, or will it be hidden from the audience?

No-- I want to use this to display all the text that I type. If the DJ wants to give a shout-out, or I want to lip-sych with the lyrics in a track, or if the sponsor need their names up in lights, correct spelling is key. This means no phonetic spelling-- text must be readable by humans.

Ideally I want to be able to type, and every line (or word?) will appear synched with the stills (more on that later) AS I TYPE IT.

STEP TWO: How do I build it?

Now that we know what we want to get, we can figure out how to program it. There are three things that need to happen here--

set up a way to input text
create a loop to read thru the text that has been input
assign a animation frame that will fire AS THE TEXT is being read.

Raw coding is intimidating, but I've included most of the code you will need in the download file. There are two different Flash files-- one of them (lip_synch_trigger.fla) is for testing and debugging, the other (lip_synch_framesynch.fla) is what you can use in your performance. I'm really going to skip over the basics and focus on nuance of syllabic structures.

A) Build an input text box.
Open up Flash, create an imput text box (in this example it is called "textbox_speakeasy"), make an "enter text" button that will fire the function that will start the loops and animation, give the button an instance name (in this case it is "enterButt"), blah, blah blah.

Basically, we want to make a way to test our "text to animation" function-- so make a text box that we can type into, and a button to activate our animation function. This way we can most easily check how it works with different spellings, spacing, letter combos, etc.

The action "enterButt.onRelease" is what activates our animation by calling a "setInterval" loop. Once we have a loop made, we can use it to trigger any one of our stills.

//// interval data
var intervalId:Number;
var count:Number = 0;
var maxCount:Number = 0;
var duration:Number = 150;

var currentSyllable;
var currentLetter;
var mouthPosition;
var theText:String = textbox_speakeasy;

enterButt.onRelease = function(){
// trace ("The length of the text is: " + theText.length);
// intervalId = setInterval(this, "startSpeaking", duration);

intervalId = setInterval(this._parent, "startSpeaking", duration);
maxCount = theText.length;

// mouthPosition = textbox_speakeasy.text;
// mouthPlayer.gotoAndStop(mouthPosition);
// trace(textbox_speakeasy.text);
}

B) Create your loop.
Once the "setInterval" loop is called, it keeps looping until it gets to "maxCount" and stops. "maxCount" is the number of ALL THE LETTERS CURENTLY IN THE TEXT BOX-- "theText.length". For instance, if we type in "Hello World" and click "enterButt", the variable "maxCount" becomes eleven (11). What's that you say? The words "Hello" and World" only have ten characters, not eleven? Our "maxCount" includes spaces as a character as well. This will become important later.

Although we aren't really reading the letters, we are checking to see what letter we are curently at. This means that the variable "count" can be used to refer to a specific letter-- if it has looped 3 times ("count" = 3) then our loop is at the third letter in our text box, which is the "l" in "Hello World".

The function "startSpeaking" is called EVERY TIME "setInterval" loops. This is what tells our animation to go to the correct frame. Why didn't I use "on Enter Frame" instead of "setInterval"? Well, I wanted to be able to change the framerate to make the animation "speak" slower or faster for testing purposes. This may have an adverse impact once I put this software into my video mixer though. More on that later.

function startSpeaking():Void {

//// this checks to see what letter we're on

currentLetter = textbox_speakeasy.text.substr(count, 1);
_root.displayCurrentLetter.text = currentLetter;

/// this tells our mouth to go to a current letter

mouthPlayer.gotoAndPlay(currentLetter);

// are we at the end of our text?
if(count >= maxCount) {
clearInterval(intervalId);
mouthPlayer.gotoAndStop("stop");
count = 0;
}
//// if not at the end of the text, go back and loop
count++;

}

C) Assign an animation frame.

When the "startSpeaking" function is called it does three things:

1. It assigns the variable "currentLetter" the value of whatever letter the loop is on. It knows what that value is because "count" is the position of the current character. If you've typed "hello world" and the "count" is at seven (7), then "currentLetter" will be "w". The function "substr" gives us the value of the data in the textbox from the character at the "count" position (in this case "7") to one (1) character after that.

Refer to the "help" in Flash if you are still not sure of what substr does.

2. It tells the animation (the "mouthPlayer" movie clip) to go to the frame named.... you guessed it! The same thing as "currentLetter". I'm having it target a frame label-- that way I can keep track of what picture goes with what sound just by glancing at my flash timeline.

3. It checks to see if the loop has gone thru the entire text-- if so, we're done!

STEP THREE) Build our animation.

Switch placeholder phonemes with your own stills.
I've used photos of a model to do facial expressions. You can draw your own, use a robot, or sign language (fun project) if you want. The key thing is to have the different expressions as recognizable as possible.

B, TH, F, O-- these are the most recognizable (sound them out if you need to.) Think about the "flintstones" cartoon-- every time Fred said a word that began with "f", his ENTIRE LOWER JAW would almost be sucked up into his teeth. This is the kind of exaggeration that we want.

Prepare your movie clip to be targeted by your startSpeaking loop.
Import all your images into your flash movie, put them into a movie clip "mouthPlayer", and label the frames with letters of the alphabet. Notice that we are on frame label "f" and the frame in the animation is of a mouth saying "f". Do this for every mouth position.

Oh, wait-- there are fewer facial expressions than letters you say? "d" and "e" have a similar look? "x" has no identifiable expression at all? Welcome to the world of lip-synching-- it barely follows the rules of grammar, and more care needs to be taken to make sure it "looks right" than it sticks to "this letter = this expression" rules.

This part is where trial and error comes in.

STEP FOUR) Trial and Error

Now test it! Notice something weird? You'll notice that the mouth will jump to letters that you wouldn't normally pronounce. In fact, the mouth is moving more than you would ordinarily expect for certain sounds -- "th" particularly stands out. Also, when our little text animation gets to a space, there is no letter so it holds the last one that was "read". This is not normal-- humans tend to have a "rest" position in between words. So-- time to build some exceptions.

function startSpeaking():Void {
//// test for letter combos and syllables
currentSyllable = textbox_speakeasy.text.substr(count, 3);
currentLetter = textbox_speakeasy.text.substr(count, 1);
// trace("I am the current letter" + currentLetter);
trace ("Syllable is" + currentSyllable);
if (currentLetter == " "){
//trace ("Syllable is" + currentSyllable);
mouthPlayer.gotoAndPlay("stop");

if(count >= maxCount) {
clearInterval(intervalId);
mouthPlayer.gotoAndStop("stop");
count = 0;
}
count++;

}

Make an exception for spaces in the text.
Now, although we have made a loop that will send a command to our animation for EVERY LETTER in our textbox, we don't really want that.

For instance, people don't hold their mouths in the position of the last syllable they spoke-- they close their mouths, or get ready to say the next word. For this reason, I've made a "rest" label where the mouth is closed. How do I program this? Use an "if" statement-- for instace, "if" the currentLetter is equal to a space then tell the animation to go to the "stop" frame.

After we've told the animation to go to the correct frame, check to see if we have looped to the end of our text ("maxCount"), and delete our "setInterval", which will keep our mouth from looping.

} else if (currentSyllable == "ext"){
mouthPlayer.gotoAndPlay("xt");
count = count+2;

} else {
//trace ("Syllable is" + currentSyllable);
mouthPlayer.gotoAndPlay(currentLetter);
}
// trace("Why am I not parsed?" + textbox_speakeasy.text);
if(count >= maxCount) {
clearInterval(intervalId);
mouthPlayer.gotoAndStop("stop");
count = 0;
}
count++;
}

Make an exception for everything that looks weird.
Fixing pauses in between wordis all well and good, but it does not solve the problems of things like "PH", for instance. We don't want to display a face making a "p" sound and an "h" sound (there really isn't one anyway). We want the expression of "F" as in "phone".

Here is a more advanced programming configuration. Notice how I have added a variable-- "currentSyllable". This does something I mentioned before-- finding the letters around the current letter. Why would we want this? Well, I want to check if our face needs to be saying "phone" or a word where we want a hard "p" (as in punk).

So-- every time our loop hits a letter, check to see if we have a syllable (it checks the current letter + three letters after) that deserves special treatment. If so, give it special treatment ("f" instead of "P" then "H"). If not, simply go to the specified letter.

You can, and will, end up adding bunches of different syllable types, special configurations, etc. This is where trial and error comes in.

Here's a for-instance. Let's say you have typed in "Phreaky Bootie". When our syllable checker sees "P", it looks to see if the next letter is "H". If this is the case, it tells our animation to go to the "f" frame, not the "p" frame. Great, right? Not really.

After our loop checks the "p", sees the next letter is "h", and it tells our animation to go to the "f" frame (actually it goes to a frame designated as "ph" but whatever), it is going to go to the next letter-- "h" and tell our animation to go to the "h" frame. We don't want that-- if there is a syllable, we want to skip the next few letters in the syllable-- so I've added "count++" to move our loop ahead two letters.

All this skipping stuff is weird, but basically we want the animation for "Phreaky" to go to the frames "freeky" not "P-F-H-r-e-e-k-y".

STEP Five) Integrate it with your video software

Now that we have this thing programmed, let's do some video mixing with it! Since the "look and feel" is key, we want the text AND the animation to loop at the same time as the beat in a music track... or whatever. That means that we will get rid of the button and have our animation fire according to our frame speed-- if our .swf is looping at 24 fps then we want our mouth to lip synch once a second.

I used this application during a live set I did at City Hall - Rambla Cataluña in Barcelona, Spain. I got to work with Andreas Castillo, a DJ and promoter for a few tuesdays in October of 2008.

The only video software I've really found that handles Flash in any meaningful way is Resolume. I hate all kind of things about it, but mainly it has to do with how it handles frame rates. When we embed our swf into Resolume we will notice all kinds of synching issues-- multiple lines of text trigger the animation at weird times, if we speed up the frame rate the animation starts looping uncontrollably. (let's ignore the fact that I can't type quickly enough to get the lyrics to line up with the text). This seems to have to do with the way Resolume handles flash frames. Even though I've made a 1 frame swf, Rez wants to give it a timeline, which conflicts with the loop speed I set using "setInterval".

I also need to change the way the text input is handled-- we will no longer be typing into a text box and hitting a button, we will be letting Resolume pass text data INTO our .swf. I've handled this by creating a temporary text box ("rezInputBox"). For more info on how to pass variables to and from Rezolume, consult their website.

I want to be able to synch the speed of the lip-synching with the speed of whatever audio track I'm using, or whatever other video loops are activated, which means that basically I want to have the frame rate I set IN RESOLUME, using the slider, to be the speed of the lip synching.

The solution to this is to change the "setInterval" loop function to that of "onEnterFrame". Tougher than it sounds-- I've already mentioned that Rez doesn't like one-frame movie clips (I'm still unsure of how the "use Flash framerate" is different than "use Resolume framerate") so although "onEnterFrame" really ought to give me a lip-synch framerate that matches the Resolume framerate (remember, onEnterFrame fires at the frames-per-second that you set in the FLA-- not necessarily how many frames are in your movie) it still behaves unpredictably.

//// interval data
var count:Number = 0;
var maxCount:Number = 0;

var currentLetter;
var mouthPosition;

var rezInputText = rezInputBox.text;

/// let resolume declare the text
textbox_speakeasy.text = text;

var theText:String = textbox_speakeasy;

function speechCheckLoop(phraseLength){
startSpeaking();
count ++;
if (count>=phraseLength) {
//trace("ENOUGH TALKING");
count = 0;
textbox_speakeasy.text = "";
}
}

speechCheckLoop(theText.length);

So, how to fix? It seems like Rez wants to give my .swf a timeline no matter what I do, so I decided to build a loop that intergrates with the way Rez handles Flash.

Rather than build an internal loop that calls "startSpeaking" on every letter, then stops at the end of the text, I've built my .swf to cycle thru it's timeline (remember, its speed is set by Resolume) and call the function at every frame.

You'll notice ALL my "setInterval" code is gone and in its place is a "speechCheckLoop", a simple and similar function that increments thru the text string, calls the "startSpeaking" function, and once it gets to the end (it knows what the end is because we have passed theText.length to it) resets back to zero and clears the text field. (You'll notice I stripped out a lot of the other code to save space)

This code is on frame one, and is only good for one frame, one letter-- how do I do this for every letter? Well, this is the hack-y part. I just call put a call to the "speechCheckLoop" code that is on frame 1 on every subsequent frame. Resolume wants to crawl thru a timeline no matter what? Fine, I'll just duplicate the same "speech check" call (NOT the entire lip-synch function, that's only on frame 1) on every frame on the timeine. If I set the Resolume framerate to be 3 FPS, my lip-synch animation will move at 3 FPS.

What happens when I get to the end of the text string? Well, I put a "gotoAndStop" action in my function to fire if the text length has been exceeded. This will put our mouth back in the "rest" position. Keep in mind that my timeline has 100 frames, so it can't read anything beyond 100 characters (get it? each frame is a letter). Really, though, this widget is about making some quick shout-outs or taglines, not for reading "paradise lost".

You can see in the screengrab how ugly this looks, but it works much better-- I can speed up or slow down my flash movie to make it "look" and "read" much more convincingly. I still have to "trim" my timeline in Resolume, since my lip-synch only increments at every frame. Otherwise I get a long pause after every text reading. It still wants to loop in funky ways... but I need a drink before I move forward. Bourbon and soda!

At the time of this writing the project is still a very imperfect workaround. I don't fully understand the limitations of Resolume's flash engine, and it is tough to program-- I have to export my .swf, then open Rez to test, and close and restart if something isn't working quite right. I haven't got my hands on the new version of Resolume, or had a chance to experiment with other softwares like MotionDive, but hopefully this will be a good jumping-off project for more text experimentation. If anyone uses this for some projects, let me know!

MkUltra (out).