Package org.w3c.tidy
Class Clean
java.lang.Object
org.w3c.tidy.Clean
Clean up misuse of presentation markup. Filters from other formats such as Microsoft Word often make excessive use of
presentation markup such as font tags, B, I, and the align attribute. By applying a set of production rules, it is
straight forward to transform this to use CSS. Some rules replace some of the children of an element by style
properties on the element, e.g.
.
...
....
Such rules are applied to the element's content and then to the element itself until none of the rules more apply. Having applied all the rules to an element, it will have a style attribute with one or more properties. Other rules strip the element they apply to, replacing it by style properties on the contents, e.g....
... These rules are applied to an element before processing its content and replace the current element by the first element in the exposed content. After applying both sets of rules, you can replace the style attribute by a class value and style rule in the document head. To support this, an association of styles and class names is built. A naive approach is to rely on string matching to test when two property lists are the same. A better approach would be to first sort the properties before matching.
- Version:
- $Revision: 1125 $ ($Author: aditsu $)
- Author:
- Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.void
Clean an html tree.void
cleanWord2000
(Lexer lexer, Node node) This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000.void
dropSections
(Lexer lexer, Node node) Drop if/endif sections inserted by word2000.void
Replace i by em and b by strong.boolean
isWord2000
(Node root) Check if the current document is a converted Word document.void
Some people use dir or ul without an li to indent the content.void
nestedEmphasis
(Node node) simplifies ...pruneSection
(Lexer lexer, Node node) node is<![if ...]>
prune up to<![endif]>
.void
purgeWord2000Attributes
(Node node) Remove word2000 attributes from node.Word2000 uses span excessively, so we strip span out.
-
Constructor Details
-
Clean
Instantiates a new Clean.- Parameters:
tagTable
- tag table instance
-
-
Method Details
-
cleanTree
Clean an html tree.- Parameters:
lexer
- Lexerdoc
- root node
-
nestedEmphasis
simplifies ... ... etc.- Parameters:
node
- root Node
-
emFromI
Replace i by em and b by strong.- Parameters:
node
- root Node
-
list2BQ
Some people use dir or ul without an li to indent the content. The pattern to look for is a list with a single implicit li. This is recursively replaced by an implicit blockquote.- Parameters:
node
- root Node
-
bQ2Div
Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.- Parameters:
node
- root Node
-
pruneSection
node is<![if ...]>
prune up to<![endif]>
.- Parameters:
lexer
- Lexernode
- Node- Returns:
- cleaned up Node
-
dropSections
Drop if/endif sections inserted by word2000.- Parameters:
lexer
- Lexernode
- Node root node
-
purgeWord2000Attributes
Remove word2000 attributes from node.- Parameters:
node
- node to cleanup
-
stripSpan
Word2000 uses span excessively, so we strip span out.- Parameters:
lexer
- Lexerspan
- Node span- Returns:
- cleaned node
-
cleanWord2000
This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000. It doesn't yet know what to do with VML tags, but these will appear as errors unless you declare them as new tags, such as o:p which needs to be declared as inline.- Parameters:
lexer
- Lexernode
- node to clean up
-
isWord2000
Check if the current document is a converted Word document.- Parameters:
root
- root Node- Returns:
true
if the document has been geenrated by Microsoft Word.
-