Parsing XHTML with E4X in AS3

July 2nd, 2008 by Steven Sacks

Gaia has an SEO feature that parses copy from the XHTML page and loads it into Flash. In AS3, it uses E4X. There are a few important things to know when parsing in XHTML using E4X.

First, you need to turn off ignoreWhitespace and prettyPrinting. The reason for doing this will be explained below.

XML.ignoreWhitespace = false;
XML.prettyPrinting = false;

Next, you need to set the default xml namespace to the xhtml namespace. This is one of the trickier parts for those unfamiliar with namespaces. If you don't do this, you won't be able to parse the XHTML.

default xml namespace = new Namespace("http://www.w3.org/1999/xhtml");

Then, you wrap the loaded html into an XML object. In this example, event.target.data is from the Event.COMPLETE of a URLLoader.

var html:XML = XML(event.target.data);

Now comes the fun part. Since valid XHTML is technically XML, E4X is able to parse through it exactly the same. In Gaia, I'm searching the XHTML page for a <div> with an id of "copy" to extract all the <p> tags out of it. Using E4X's descendant syntax, it's easy!

var copyTags:XMLList = html..div.(hasOwnProperty("@id") && @id == "copy")..p;

The above line of code searches all the div tags of the html file html..div for one that has an attribute called id hasOwnProperty("@id") and whose id attribute has a value of copy @id == "copy". Once it finds it, it returns an XMLList of all the <p> tags inside that div by using the descendant syntax ..p.

Now here's the tricky part. E4X was not specifically meant to parse XHTML, particularly node values with other tags inside them, such as paragraph tags. Inside <p> tags, XHTML often has other tags like <font>, <strong> and <em>. E4X sees these tags as child nodes, not part of the node value of the <p> tag. In order to get around this, you have to iterate through the children of the <p> tag and concatenate them manually.

For instance, if your <p> tag looked like this:

<p>This is <strong>bold</strong> text.</p>

The children(), represented as an Array, would be:

["This is", "<strong>", "bold", "</strong>", " text."];

Here's the code to parse and concatenate the node value of the entire <p> tag.

// get the first p node
var copyTag:XML = copyTags[0];
var str:String = "";
var len:int = copyTag.children().length();
for (var i:int = 0; i < len; i++)
{
        // concatenate each child
        str += copyTag.children()[i].toXMLString();
}

Unfortunately, E4X is going to inject the namespace into any tags inside the <p> tag, including the <strong> tag, resulting in your output looking like this:

"This is <strong namespace="http://www.w3.org/1999/xhtml">bold</strong> text."

This shouldn't have any bad effect if you assign it as htmlText to a TextField in Flash, but if you want to clean up the namespace, unfortunately, you can't use removeNamespace(), it just doesn't work. You instead need to use some fancy RegEx (graciously provided by Mike Keesey).

str = str.replace(/\s+xmlns(:[^=]+)?="[^"]*"/g, "");

This strips out the namespace="http://www.w3.org/1999/xhtml" from any and all tags inside the <p> tag.

Remember how we set prettyPrinting = false up above? The reason for this is that E4X automatically puts carriage returns between the tags inside the <p> tag, so you need to turn prettyPrinting off to get rid of them. If you didn't, the node value output would look like this:

This is
bold
text.

And, we also set ignoreWhitespace=false. If you don't have ignoreWhitespace set to false, then any spaces around tags inside the p tags will be removed and your node value output would look like this:

This isboldtext.

However, if you do everything correct, you end up with this as your node value:

This is bold text.

Of course, you could get around all of this by using the XMLNode class provided by Adobe for more AS1-style XML parsing. But that just wouldn't be as fun, would it?

Posted in Actionscript, E4X, Flash, Tips/Tricks

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

About Steven Sacks

I am a professional Flash developer with over 13 years of programming experience. I have consulted for high-profile agencies and companies in San Francisco, Los Angeles, Atlanta and New York, and developed numerous award-winning websites and rich internet applications for clients including Adobe, Fox Sports, FX Networks, Anheuser-Busch, GE, DirecTV, ESPN, The Weather Channel, Home Depot, and Coca-Cola.

I am the author of the open-source Gaia Framework for Adobe Flash, which dramatically reduces development time and makes developing Flash sites much easier.