Parsing XHTML with E4X in AS3
Gaia has an SEO feature that parses copy from the XHTML page and loads it into Flash. In AS3, it uses E4X. There are a few important things to know when parsing in XHTML using E4X.
First, you need to turn off ignoreWhitespace and prettyPrinting. The reason for doing this will be explained below.
XML.ignoreWhitespace = false; XML.prettyPrinting = false;
Next, you need to set the default xml namespace to the xhtml namespace. This is one of the trickier parts for those unfamiliar with namespaces. If you don't do this, you won't be able to parse the XHTML.
default xml namespace = new Namespace("http://www.w3.org/1999/xhtml");
Then, you wrap the loaded html into an XML object. In this example, event.target.data is from the Event.COMPLETE of a URLLoader.
var html:XML = XML(event.target.data);
Now comes the fun part. Since valid XHTML is technically XML, E4X is able to parse through it exactly the same. In Gaia, I'm searching the XHTML page for a <div> with an id of "copy" to extract all the <p> tags out of it. Using E4X's descendant syntax, it's easy!
var copyTags:XMLList = html..div.(hasOwnProperty("@id") && @id == "copy")..p;
The above line of code searches all the div tags of the html file html..div for one that has an attribute called id hasOwnProperty("@id") and whose id attribute has a value of copy @id == "copy". Once it finds it, it returns an XMLList of all the <p> tags inside that div by using the descendant syntax ..p.
Now here's the tricky part. E4X was not specifically meant to parse XHTML, particularly node values with other tags inside them, such as paragraph tags. Inside <p> tags, XHTML often has other tags like <font>, <strong> and <em>. E4X sees these tags as child nodes, not part of the node value of the <p> tag. In order to get around this, you have to iterate through the children of the <p> tag and concatenate them manually.
For instance, if your <p> tag looked like this:
<p>This is <strong>bold</strong> text.</p>
The children(), represented as an Array, would be:
["This is", "<strong>", "bold", "</strong>", " text."];
Here's the code to parse and concatenate the node value of the entire <p> tag.
// get the first p node var copyTag:XML = copyTags[0]; var str:String = ""; var len:int = copyTag.children().length(); for (var i:int = 0; i < len; i++) { // concatenate each child str += copyTag.children()[i].toXMLString(); }
Unfortunately, E4X is going to inject the namespace into any tags inside the <p> tag, including the <strong> tag, resulting in your output looking like this:
"This is <strong namespace="http://www.w3.org/1999/xhtml">bold</strong> text."
This shouldn't have any bad effect if you assign it as htmlText to a TextField in Flash, but if you want to clean up the namespace, unfortunately, you can't use removeNamespace(), it just doesn't work. You instead need to use some fancy RegEx (graciously provided by Mike Keesey).
str = str.replace(/\s+xmlns(:[^=]+)?="[^"]*"/g, "");
This strips out the namespace="http://www.w3.org/1999/xhtml" from any and all tags inside the <p> tag.
Remember how we set prettyPrinting = false up above? The reason for this is that E4X automatically puts carriage returns between the tags inside the <p> tag, so you need to turn prettyPrinting off to get rid of them. If you didn't, the node value output would look like this:
This is
bold
text.
And, we also set ignoreWhitespace=false. If you don't have ignoreWhitespace set to false, then any spaces around tags inside the p tags will be removed and your node value output would look like this:
This isboldtext.
However, if you do everything correct, you end up with this as your node value:
This is bold text.
Of course, you could get around all of this by using the XMLNode class provided by Adobe for more AS1-style XML parsing. But that just wouldn't be as fun, would it?
Posted in Actionscript, E4X, Flash, Tips/Tricks

October 15th, 2008 at 2:50 pm
Thanx for sharing!
October 19th, 2008 at 10:24 am
HAHA! So nice! Thanks for this.
November 21st, 2008 at 10:17 am
For some reason the prettyPrinting is still creating the:
This is
bold
text
scenario… But I use "custom" nodes rather than for bold. Could this be the case? So instead of bold I use bold
December 8th, 2008 at 8:24 am
Thanks for this, exactly what i was looking for!
December 30th, 2008 at 4:45 pm
[...] – bookmarked by 2 members originally found by Dulcinea on 2008-11-30 Parsing XHTML with E4X in AS3 http://www.stevensacks.net/2008/07/02/parsing-xhtml-with-e4x-in-as3/ – bookmarked by 1 members [...]