Package org.w3c.tidy
Class Lexer
java.lang.Object
org.w3c.tidy.Lexer
Lexer for html parser.
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections
- Version:
- $Revision: 1100 $ ($Author: aditsu $)
- Author:
- Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected short
for accessibility errors.protected short
for bad char encodings.protected boolean
set if html or PUBLIC is missing.protected short
for mismatched/mispositioned form tags.protected short
for bad style errors.protected int
at start of current token.protected Configuration
configuration.protected int
version as given by doctype (if any).protected short
count of errors.protected PrintWriter
error output stream.protected boolean
Netscape compatibility.protected boolean
true if moved out of table.static final short
state: ignore markup.static final short
state: ignore whitespace.protected StreamIn
file stream.protected Node
Inline stack for compatibility with Mosaic.protected int
for inferring inline tags.protected boolean
when space is moved after end tag.protected Stack
stack.protected int
start of frame.protected boolean
true if xmlns attribute on html element.protected byte[]
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements.protected int
allocated.protected int
used.protected int
lines seen.static final short
state: mixed content.static final short
state: preformatted.protected boolean
true after token has been pushed back.protected Report
report.protected Node
Root node is saved here.protected boolean
already seen end body tag?protected boolean
already seen end html tag?protected short
state of lexer's finite state machine.protected Style
used for cleaning up presentation markup.protected Node
current node.protected int
end of current node.protected int
start of current node.protected short
bit vector of HTML versions.protected short
count of warnings in this document.protected boolean
used to collapse contiguous white space. -
Constructor Summary
ConstructorsConstructorDescriptionLexer
(StreamIn in, Configuration configuration, Report report) Instantiates a new Lexer. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addByte
(int c) Adds a byte to lexer buffer.void
addCharToLexer
(int c) Store char c as UTF-8 encoded byte stream.boolean
addGenerator
(Node root) Add meta element for Tidy.void
addStringLiteral
(String str) calls addCharToLexer for any char in the string.void
addStringToLexer
(String str) Adds a string to lexer buffer.short
Return the html version used in document.boolean
Can the given element be removed?void
changeChar
(byte c) Substitute the last char in buffer.boolean
checkDocTypeKeyWords
(Node doctype) Check system keywords (keywords should be uppercase).cloneAttributes
(AttVal attrs) Clones an attribute value and add eventual asp or php node to node list.Clones a node and add it to node list.void
deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.boolean
Has end of input stream been reached?short
findGivenVersion
(Node doctype) Examine DOCTYPE to identify version.boolean
fixDocType
(Node root) Fixup doctype if missing.void
fixHTMLNameSpace
(Node root, String profile) Fix xhtml namespace.void
duplicate name attribute as an id and check if id and name match.boolean
fixXmlDecl
(Node root) Ensure XML document starts with<?XML version="1.0"?>
.Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.getToken
(short mode) Gets a token.short
Choose what version to use for new doctype.Choose what version to use for new doctype.inferredTag
(String name) Generates and inserts a new node.int
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc.static boolean
isCSS1Selector
(String buf) In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item).boolean
Is the node in the stack?static boolean
isValidAttrName
(String attr) Check if attr is a valid name.Adds a new line node.newNode()
Creates a new node and add it to nodelist.newNode
(short type, byte[] textarray, int start, int end) Creates a new node and add it to nodelist.Creates a new node and add it to nodelist.parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value.parseAttribute
(boolean[] isempty, Node[] asp, Node[] php) consumes the '>' terminating start tags.parseAttrs
(boolean[] isempty) Parse tag attributes.void
parseEntity
(short mode) Parse an html entity.parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g.int
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.char
Parses a tag name.parseValue
(String name, boolean foldCase, boolean[] isempty, int[] pdelim) Parse an attribute value.void
Pop a copy of an inline node from the stack.protected boolean
preContent
(Node node) Is content acceptable for pre elements?void
pushInline
(Node node) Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed.boolean
setXHTMLDocType
(Node root) Adds a new xhtml doctype to the document.void
protected void
updateNodeTextArrays
(byte[] oldtextarray, byte[] newtextarray) Updateoldtextarray
in the current nodes.
-
Field Details
-
IGNORE_WHITESPACE
public static final short IGNORE_WHITESPACEstate: ignore whitespace.- See Also:
-
MIXED_CONTENT
public static final short MIXED_CONTENTstate: mixed content.- See Also:
-
PREFORMATTED
public static final short PREFORMATTEDstate: preformatted.- See Also:
-
IGNORE_MARKUP
public static final short IGNORE_MARKUPstate: ignore markup.- See Also:
-
in
file stream. -
errout
error output stream. -
badAccess
protected short badAccessfor accessibility errors. -
badLayout
protected short badLayoutfor bad style errors. -
badChars
protected short badCharsfor bad char encodings. -
badForm
protected short badFormfor mismatched/mispositioned form tags. -
warnings
protected short warningscount of warnings in this document. -
errors
protected short errorscount of errors. -
lines
protected int lineslines seen. -
columns
protected int columnsat start of current token. -
waswhite
protected boolean waswhiteused to collapse contiguous white space. -
pushed
protected boolean pushedtrue after token has been pushed back. -
insertspace
protected boolean insertspacewhen space is moved after end tag. -
excludeBlocks
protected boolean excludeBlocksNetscape compatibility. -
exiled
protected boolean exiledtrue if moved out of table. -
isvoyager
protected boolean isvoyagertrue if xmlns attribute on html element. -
versions
protected short versionsbit vector of HTML versions. -
doctype
protected int doctypeversion as given by doctype (if any). -
badDoctype
protected boolean badDoctypeset if html or PUBLIC is missing. -
txtstart
protected int txtstartstart of current node. -
txtend
protected int txtendend of current node. -
state
protected short statestate of lexer's finite state machine. -
token
current node. -
lexbuf
protected byte[] lexbufLexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars. -
lexlength
protected int lexlengthallocated. -
lexsize
protected int lexsizeused. -
inode
Inline stack for compatibility with Mosaic. For deferring text node. -
insert
protected int insertfor inferring inline tags. -
istack
stack. -
istackbase
protected int istackbasestart of frame. -
styles
used for cleaning up presentation markup. -
configuration
configuration. -
seenEndBody
protected boolean seenEndBodyalready seen end body tag? -
seenEndHtml
protected boolean seenEndHtmlalready seen end html tag? -
report
report. -
root
Root node is saved here.
-
-
Constructor Details
-
Lexer
Instantiates a new Lexer.- Parameters:
in
- StreamInconfiguration
- configuation instancereport
- report instance, for reporting errors
-
-
Method Details
-
newNode
Creates a new node and add it to nodelist.- Returns:
- Node
-
newNode
Creates a new node and add it to nodelist.- Parameters:
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end position- Returns:
- Node
-
newNode
Creates a new node and add it to nodelist.- Parameters:
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionelement
- tag name- Returns:
- Node
-
cloneNode
Clones a node and add it to node list.- Parameters:
node
- Node- Returns:
- cloned Node
-
cloneAttributes
Clones an attribute value and add eventual asp or php node to node list.- Parameters:
attrs
- original AttVal- Returns:
- cloned AttVal
-
updateNodeTextArrays
protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray) Updateoldtextarray
in the current nodes.- Parameters:
oldtextarray
- previous text arraynewtextarray
- new text array
-
newLineNode
Adds a new line node. Used for creating preformatted text from Word2000.- Returns:
- new line node
-
endOfInput
public boolean endOfInput()Has end of input stream been reached?- Returns:
true
if end of input stream been reached
-
addByte
public void addByte(int c) Adds a byte to lexer buffer.- Parameters:
c
- byte to add
-
changeChar
public void changeChar(byte c) Substitute the last char in buffer.- Parameters:
c
- new char
-
addCharToLexer
public void addCharToLexer(int c) Store char c as UTF-8 encoded byte stream.- Parameters:
c
- char to store
-
addStringToLexer
Adds a string to lexer buffer.- Parameters:
str
- String to add
-
parseEntity
public void parseEntity(short mode) Parse an html entity.- Parameters:
mode
- mode
-
parseTagName
public char parseTagName()Parses a tag name.- Returns:
- first char after the tag name
-
addStringLiteral
calls addCharToLexer for any char in the string.- Parameters:
str
- input String
-
htmlVersion
public short htmlVersion()Choose what version to use for new doctype.- Returns:
- html version constant
-
htmlVersionName
Choose what version to use for new doctype.- Returns:
- html version name
-
addGenerator
Add meta element for Tidy. If the meta tag is already present, update release date.- Parameters:
root
- root node- Returns:
true
if the tag has been added
-
checkDocTypeKeyWords
Check system keywords (keywords should be uppercase).- Parameters:
doctype
- doctype node- Returns:
- true if doctype keywords are all uppercase
-
findGivenVersion
Examine DOCTYPE to identify version.- Parameters:
doctype
- doctype node- Returns:
- version code
-
fixHTMLNameSpace
Fix xhtml namespace.- Parameters:
root
- root Nodeprofile
- current profile
-
setXHTMLDocType
Adds a new xhtml doctype to the document.- Parameters:
root
- root node- Returns:
true
if a doctype has been added
-
apparentVersion
public short apparentVersion()Return the html version used in document.- Returns:
- version code
-
fixDocType
Fixup doctype if missing.- Parameters:
root
- root node- Returns:
false
if current version has not been identified
-
fixXmlDecl
Ensure XML document starts with<?XML version="1.0"?>
. Add encoding attribute if not using ASCII or UTF-8 output.- Parameters:
root
- root node- Returns:
- always true
-
inferredTag
Generates and inserts a new node.- Parameters:
name
- tag name- Returns:
- generated node
-
getCDATA
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.- Parameters:
container
- container node- Returns:
- cdata node
-
ungetToken
public void ungetToken() -
getToken
Gets a token.- Parameters:
mode
- one of the following:MixedContent
-- for elements which don't accept PCDATAPreformatted
-- white spacepreserved as isIgnoreMarkup
-- for CDATA elements such as script, style
- Returns:
- next Node
-
parseAsp
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values:href='invalid input: '<'%=rsSchool.Fields("ID").Value%>'
where the ASP that generates the attribute value is masked from Tidy by the quotemarks.- Returns:
- parsed Node
-
parsePhp
PHP is like ASP but is based upon XML processing instructions, e.g.<?php ... ?>
.- Returns:
- parsed Node
-
parseAttribute
consumes the '>' terminating start tags.- Parameters:
isempty
- flag is passed as array so it can be modifiedasp
- asp Node, passed as array so it can be modifiedphp
- php Node, passed as array so it can be modified- Returns:
- parsed attribute
-
parseServerInstruction
public int parseServerInstruction()Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.- Returns:
- delimiter
-
parseValue
Parse an attribute value.- Parameters:
name
- attribute namefoldCase
- fold case?isempty
- is attribute empty? Passed as an array reference to allow modificationpdelim
- delimiter, passed as an array reference to allow modification- Returns:
- parsed value
-
isValidAttrName
Check if attr is a valid name.- Parameters:
attr
- String to check, must be non-null- Returns:
true
if attr is a valid name.
-
isCSS1Selector
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front.- Parameters:
buf
- css selector name- Returns:
true
if the given string is a valid css1 selector name
-
parseAttrs
Parse tag attributes.- Parameters:
isempty
- is tag empty?- Returns:
- parsed attribute/value list
-
pushInline
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance:<p><em> text <p><em> more text
Shouldn't be mapped to<p><em> text </em></p><p><em><em> more text </em></em>
- Parameters:
node
- Node to be pushed
-
popInline
Pop a copy of an inline node from the stack.- Parameters:
node
- Node to be popped
-
isPushed
Is the node in the stack?- Parameters:
node
- Node- Returns:
true
is the node is found in the stack
-
inlineDup
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in:<i><h1>italic heading</h1></i>
which is then treated as equivalent to<h1><i>italic heading</i></h1>
This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream.- Parameters:
node
- original node- Returns:
- stack size
-
insertedToken
- Returns:
-
canPrune
Can the given element be removed?- Parameters:
element
- node- Returns:
true
if he element can be removed
-
fixId
duplicate name attribute as an id and check if id and name match.- Parameters:
node
- Node to check for name/it attributes
-
deferDup
public void deferDup()Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated. -
preContent
Is content acceptable for pre elements?- Parameters:
node
- content- Returns:
true
if node is acceptable in pre elements
-