HTML Email Parser continued
Recently I posted about my quest to find a suitable email parser that could tidy up html emails and strip out specific tags and attributes. I'm very happy to say that my mission has been accomplished .. and a mission it was. The orinial post is here.
There's a heap of Java html parsers out there. I skimmed over about 5, but really spent most of my time looking at HTMLparser and JTidy. HTMLparser has great parsing and filtering features and is being actively developed...Bonus. The only downside for what I wanted was that it wouldn't do and kind of auto tidy of the original html doc. The parsing API is probably more complete and robust than JTidy's but JTidy has that one magic feature of being able to transform the input html to valid XHTML. If the html is too far out of wack it will throw an error, but it works pretty much 99% of the time.
Anyhow, enough rambling. I've create a cfc that takes a utf-8 string filters it and passes back a filter string. I'm stripping out allot of stuff that you may want to keep in the output html, but it's a simple matter of modifying the tag/attributes lists to keep/remove what you want.
Note: This cfc relies on you having Tidy.jar in your cf servers classpath. But it is possible to load the JTidy library on the fly using something like this.
My cfc source is listed below for your reference.
Name : JTidyFilter.cfc
Author : JasonSheedy
Created : 12 May 2006
Responsibilities : I filter unwanted tags and attributes out of a html string.
Last Updated :
History :
--->
<cfcomponent name="JTidyFilter" output="false" hint="I use jTidy to filter certain tags and attributes out of a html string.">
<!--- setup the instance attributes --->
<cfscript>
variables.my = StructNew();
// jtidy attribute and tag lists my.attArray = ListToArray("style,bgcolor,background,width,height,class,id,onclick,ondblclick, onkeydown,onkeypress,onkeyup,onload,onmousedown,onmousemove, onmouseout,onmouseover,onmouseup,onunload");
my.tagArray = ListToArray("link,meta,script,noscript,style,frame,frameset,iframe, basehref,base,form,input,textarea,select,option,applet,img,object, embed,marquee,map,area");
</cfscript>
<cffunction name="Init" access="Public" returnType="utilities" output="false" hint="I am the constructor.">
<cfreturn this />
</cffunction>
<cffunction name="ContentFilter" access="public" output="true" returntype="String">
<cfargument name="ContentString" type="string" required="true" />
<cfscript>
var returnString = "";
var byteArray = CreateObject("java","java.lang.String"). init(arguments.ContentString).getBytes("UTF8");
var bais = createobject("java","java.io.ByteArrayInputStream").init(byteArray);
var baos = createobject("java","java.io.ByteArrayOutputStream").init();
try {
doTidy(bais,baos);
returnString = baos.toString("UTF8");
baos.close();
bais.close();
return returnString;
} catch(any ex) {
dump(ex);
}
</cfscript>
</cffunction>
<cffunction name="doTidy" access="private" output="true" returntype="void">
<cfargument name="is" type="any" required="true" />
<cfargument name="os" type="any" required="true" />
<cfscript>
var doc = "";
var nl = "";
var Configuration = createobject("java","org.w3c.tidy.Configuration");
var jtidy = createobject("java","org.w3c.tidy.Tidy");
jtidy.setMakeClean(true);
jtidy.setCharEncoding(Configuration.utf8);
jtidy.setDropFontTags(true);
jtidy.setXHTML(true);
jtidy.setRawOut(true);
jtidy.setSmartIndent(true);
jtidy.setWord2000(true);
jtidy.setDropEmptyParas(true);
jtidy.setShowWarnings(false);
jtidy.setFixComments(true);
try {
doc = jtidy.parseDOM(arguments.is,javacast("null",""));
// remove the doctype removeDocType(doc);
// remove the specified tags removeTags(doc);
// remove the specified attributes nl = doc.getChildNodes();
removeAttributes(nl);
jtidy.pprint(doc,os);
} catch(any ex) {
//dump(ex); }
</cfscript>
</cffunction>
<cffunction name="removeDocType" access="private" output="true" returntype="void">
<cfargument name="doc" type="any" required="true" />
<cfscript>
var dt = "";
tr

Comments
Do you have some source examples of using this cfc?
Not sure what you mean Dan. You pass a string in, you get a string out, you do the hocky pocky and you shake it all about.. :) Sorry couldn't resist.