Parsing XHTML with E4X in AS3
Gaia has an SEO feature that parses copy from the XHTML page and loads it into Flash. In AS3, it uses E4X. There are a few important things to know when parsing in XHTML using E4X.
First, you need to turn off ignoreWhitespace and prettyPrinting. The reason for doing this will be explained below.
XML.ignoreWhitespace = false; XML.prettyPrinting = false;
Next, you need to set the default xml namespace to the xhtml namespace. This is one of the trickier parts for those unfamiliar with namespaces. If you don't do this, you won't be able to parse the XHTML.
default xml namespace = new Namespace("http://www.w3.org/1999/xhtml");
Then, you wrap the loaded html into an XML object. In this example, event.target.data is from the Event.COMPLETE of a URLLoader.
var html:XML = XML(event.target.data);
Now comes the fun part. Since valid XHTML is technically XML, E4X is able to parse through it exactly the same. In Gaia, I'm searching the XHTML page for a <div> with an id of "copy" to extract all the <p> tags out of it. Using E4X's descendant syntax, it's easy!
var copyTags:XMLList = html..div.(hasOwnProperty("@id") && @id == "copy")..p;
The above line of code searches all the div tags of the html file html..div for one that has an attribute called id hasOwnProperty("@id") and whose id attribute has a value of copy @id == "copy". Once it finds it, it returns an XMLList of all the <p> tags inside that div by using the descendant syntax ..p.
Now here's the tricky part. E4X was not specifically meant to parse XHTML, particularly node values with other tags inside them, such as paragraph tags. Inside <p> tags, XHTML often has other tags like <font>, <strong> and <em>. E4X sees these tags as child nodes, not part of the node value of the <p> tag. In order to get around this, you have to iterate through the children of the <p> tag and concatenate them manually.
For instance, if your <p> tag looked like this:
<p>This is <strong>bold</strong> text.</p>
The children(), represented as an Array, would be:
["This is", "<strong>", "bold", "</strong>", " text."];
Here's the code to parse and concatenate the node value of the entire <p> tag.
// get the first p node var copyTag:XML = copyTags[0]; var str:String = ""; var len:int = copyTag.children().length(); for (var i:int = 0; i < len; i++) { // concatenate each child str += copyTag.children()[i].toXMLString(); }
Unfortunately, E4X is going to inject the namespace into any tags inside the <p> tag, including the <strong> tag, resulting in your output looking like this:
"This is <strong namespace="http://www.w3.org/1999/xhtml">bold</strong> text."
This shouldn't have any bad effect if you assign it as htmlText to a TextField in Flash, but if you want to clean up the namespace, unfortunately, you can't use removeNamespace(), it just doesn't work. You instead need to use some fancy RegEx (graciously provided by Mike Keesey).
str = str.replace(/\s+xmlns(:[^=]+)?="[^"]*"/g, "");
This strips out the namespace="http://www.w3.org/1999/xhtml" from any and all tags inside the <p> tag.
Remember how we set prettyPrinting = false up above? The reason for this is that E4X automatically puts carriage returns between the tags inside the <p> tag, so you need to turn prettyPrinting off to get rid of them. If you didn't, the node value output would look like this:
This is
bold
text.
And, we also set ignoreWhitespace=false. If you don't have ignoreWhitespace set to false, then any spaces around tags inside the p tags will be removed and your node value output would look like this:
This isboldtext.
However, if you do everything correct, you end up with this as your node value:
This is bold text.
Of course, you could get around all of this by using the XMLNode class provided by Adobe for more AS1-style XML parsing. But that just wouldn't be as fun, would it?
Posted in Actionscript, E4X, Flash, Tips/Tricks
October 15th, 2008 at 2:50 pm
Thanx for sharing!
October 19th, 2008 at 10:24 am
HAHA! So nice! Thanks for this.
November 21st, 2008 at 10:17 am
For some reason the prettyPrinting is still creating the:
This is
bold
text
scenario… But I use "custom" nodes rather than for bold. Could this be the case? So instead of bold I use bold
December 8th, 2008 at 8:24 am
Thanks for this, exactly what i was looking for!
December 30th, 2008 at 4:45 pm
[...] – bookmarked by 2 members originally found by Dulcinea on 2008-11-30 Parsing XHTML with E4X in AS3 http://www.stevensacks.net/2008/07/02/parsing-xhtml-with-e4x-in-as3/ – bookmarked by 1 members [...]
October 15th, 2009 at 12:48 am
Thanks for this, was looking for something like this. I tried to implement it, but still can't get it right. Is there any chance you can leave a simple example for download so I can test it and learn where I am getting it wrong?
January 5th, 2010 at 11:37 pm
With ignoreWhitespace set to "true", is there any way to skip iterating through null XML nodes, or perhaps this is simply unavoidable? Iteration time is doubled and I wish this could be avoided.
Also, in your Gaia Flash Framework, you forgot to strip the namespace off the root node as well for innerHTML.
February 2nd, 2010 at 3:02 am
hello,
i was really happy finding this site but the solution presented here seems to not work on my case.
see, I'd like to parse the html on this link: http://www.1club.fm/NowPlayMy1clubfm/thebeat4.html
but in every case the script gives back a parser error.
here is my code:
default xml namespace = new Namespace("http://www.w3.org/1999/xhtml");
XML.ignoreWhitespace = false;
XML.prettyPrinting = false;
XML.ignoreComments = true;
XML.ignoreProcessingInstructions = true;
var requester:URLRequest = new URLRequest("http://www.1club.fm/NowPlayMy1clubfm/thebeat4.html");
var loader:URLLoader = new URLLoader(requester);
loader.addEventListener(Event.COMPLETE, loaded);
function loaded(e:Event):void{
var xmlobj:XML = new XML(e.target.data);
trace(xmlobj.toString());
}
can you please help me?
thank you!