Parsing XHTML with E4X in AS3

July 2nd, 2008 by Steven Sacks

Gaia has an SEO feature that parses copy from the XHTML page and loads it into Flash. In AS3, it uses E4X. There are a few important things to know when parsing in XHTML using E4X.

First, you need to turn off ignoreWhitespace and prettyPrinting. The reason for doing this will be explained below.

XML.ignoreWhitespace = false;
XML.prettyPrinting = false;

Next, you need to set the default xml namespace to the xhtml namespace. This is one of the trickier parts for those unfamiliar with namespaces. If you don't do this, you won't be able to parse the XHTML.

default xml namespace = new Namespace("http://www.w3.org/1999/xhtml");

Then, you wrap the loaded html into an XML object. In this example, event.target.data is from the Event.COMPLETE of a URLLoader.

var html:XML = XML(event.target.data);

Now comes the fun part. Since valid XHTML is technically XML, E4X is able to parse through it exactly the same. In Gaia, I'm searching the XHTML page for a <div> with an id of "copy" to extract all the <p> tags out of it. Using E4X's descendant syntax, it's easy!

var copyTags:XMLList = html..div.(hasOwnProperty("@id") && @id == "copy")..p;

The above line of code searches all the div tags of the html file html..div for one that has an attribute called id hasOwnProperty("@id") and whose id attribute has a value of copy @id == "copy". Once it finds it, it returns an XMLList of all the <p> tags inside that div by using the descendant syntax ..p.

Now here's the tricky part. E4X was not specifically meant to parse XHTML, particularly node values with other tags inside them, such as paragraph tags. Inside <p> tags, XHTML often has other tags like <font>, <strong> and <em>. E4X sees these tags as child nodes, not part of the node value of the <p> tag. In order to get around this, you have to iterate through the children of the <p> tag and concatenate them manually.

For instance, if your <p> tag looked like this:

<p>This is <strong>bold</strong> text.</p>

The children(), represented as an Array, would be:

["This is", "<strong>", "bold", "</strong>", " text."];

Here's the code to parse and concatenate the node value of the entire <p> tag.

// get the first p node
var copyTag:XML = copyTags[0];
var str:String = "";
var len:int = copyTag.children().length();
for (var i:int = 0; i < len; i++)
{
        // concatenate each child
        str += copyTag.children()[i].toXMLString();
}

Unfortunately, E4X is going to inject the namespace into any tags inside the <p> tag, including the <strong> tag, resulting in your output looking like this:

"This is <strong namespace="http://www.w3.org/1999/xhtml">bold</strong> text."

This shouldn't have any bad effect if you assign it as htmlText to a TextField in Flash, but if you want to clean up the namespace, unfortunately, you can't use removeNamespace(), it just doesn't work. You instead need to use some fancy RegEx (graciously provided by Mike Keesey).

str = str.replace(/\s+xmlns(:[^=]+)?="[^"]*"/g, "");

This strips out the namespace="http://www.w3.org/1999/xhtml" from any and all tags inside the <p> tag.

Remember how we set prettyPrinting = false up above? The reason for this is that E4X automatically puts carriage returns between the tags inside the <p> tag, so you need to turn prettyPrinting off to get rid of them. If you didn't, the node value output would look like this:

This is
bold
text.

And, we also set ignoreWhitespace=false. If you don't have ignoreWhitespace set to false, then any spaces around tags inside the p tags will be removed and your node value output would look like this:

This isboldtext.

However, if you do everything correct, you end up with this as your node value:

This is bold text.

Of course, you could get around all of this by using the XMLNode class provided by Adobe for more AS1-style XML parsing. But that just wouldn't be as fun, would it?

Posted in Actionscript, E4X, Flash, Tips/Tricks

10 Responses

  1. Szabesz

    Thanx for sharing!

  2. Julien

    HAHA! So nice! Thanks for this. :)

  3. Joe Jackson

    For some reason the prettyPrinting is still creating the:

    This is
    bold
    text

    scenario… But I use "custom" nodes rather than for bold. Could this be the case? So instead of bold I use bold

  4. danthepizzaman

    Thanks for this, exactly what i was looking for!

  5. Bookmarks about Xhtml

    [...] – bookmarked by 2 members originally found by Dulcinea on 2008-11-30 Parsing XHTML with E4X in AS3 http://www.stevensacks.net/2008/07/02/parsing-xhtml-with-e4x-in-as3/ – bookmarked by 1 members [...]

  6. Goran

    Thanks for this, was looking for something like this. I tried to implement it, but still can't get it right. Is there any chance you can leave a simple example for download so I can test it and learn where I am getting it wrong?

  7. Glidias

    With ignoreWhitespace set to "true", is there any way to skip iterating through null XML nodes, or perhaps this is simply unavoidable? Iteration time is doubled and I wish this could be avoided.

    Also, in your Gaia Flash Framework, you forgot to strip the namespace off the root node as well for innerHTML.

  8. oxid

    hello,

    i was really happy finding this site but the solution presented here seems to not work on my case.

    see, I'd like to parse the html on this link: http://www.1club.fm/NowPlayMy1clubfm/thebeat4.html

    but in every case the script gives back a parser error.

    here is my code:

    default xml namespace = new Namespace("http://www.w3.org/1999/xhtml");
    XML.ignoreWhitespace = false;
    XML.prettyPrinting = false;
    XML.ignoreComments = true;
    XML.ignoreProcessingInstructions = true;
    var requester:URLRequest = new URLRequest("http://www.1club.fm/NowPlayMy1clubfm/thebeat4.html");
    var loader:URLLoader = new URLLoader(requester);
    loader.addEventListener(Event.COMPLETE, loaded);
    function loaded(e:Event):void{
    var xmlobj:XML = new XML(e.target.data);
    trace(xmlobj.toString());
    }

    can you please help me?

    thank you!

  9. skribbs

    oxid: I got this error as well it happens when your html isn't properly coded (make sure all the tags are closed properly).

  10. Turghon

    Thanks! We always need this!

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

About Steven Sacks

I am a professional Flash developer with over 13 years of programming experience. I have consulted for high-profile agencies and companies in San Francisco, Los Angeles, Atlanta and New York, and developed numerous award-winning websites and rich internet applications for clients including Adobe, Fox Sports, FX Networks, Anheuser-Busch, GE, DirecTV, ESPN, The Weather Channel, Home Depot, and Coca-Cola.

I am the author of the open-source Gaia Framework for Adobe Flash, which dramatically reduces development time and makes developing Flash sites much easier.