Package org.w3c.tidy

Class Lexer

java.lang.Object
org.w3c.tidy.Lexer

public class Lexer extends Object
Lexer for html parser.

Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections

Version:
$Revision: 1100 $ ($Author: aditsu $)
Author:
Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected short
    for accessibility errors.
    protected short
    for bad char encodings.
    protected boolean
    set if html or PUBLIC is missing.
    protected short
    for mismatched/mispositioned form tags.
    protected short
    for bad style errors.
    protected int
    at start of current token.
    protected Configuration
    configuration.
    protected int
    version as given by doctype (if any).
    protected short
    count of errors.
    protected PrintWriter
    error output stream.
    protected boolean
    Netscape compatibility.
    protected boolean
    true if moved out of table.
    static final short
    state: ignore markup.
    static final short
    state: ignore whitespace.
    protected StreamIn
    file stream.
    protected Node
    Inline stack for compatibility with Mosaic.
    protected int
    for inferring inline tags.
    protected boolean
    when space is moved after end tag.
    protected Stack
    stack.
    protected int
    start of frame.
    protected boolean
    true if xmlns attribute on html element.
    protected byte[]
    Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements.
    protected int
    allocated.
    protected int
    used.
    protected int
    lines seen.
    static final short
    state: mixed content.
    static final short
    state: preformatted.
    protected boolean
    true after token has been pushed back.
    protected Report
    report.
    protected Node
    Root node is saved here.
    protected boolean
    already seen end body tag?
    protected boolean
    already seen end html tag?
    protected short
    state of lexer's finite state machine.
    protected Style
    used for cleaning up presentation markup.
    protected Node
    current node.
    protected int
    end of current node.
    protected int
    start of current node.
    protected short
    bit vector of HTML versions.
    protected short
    count of warnings in this document.
    protected boolean
    used to collapse contiguous white space.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Lexer(StreamIn in, Configuration configuration, Report report)
    Instantiates a new Lexer.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    addByte(int c)
    Adds a byte to lexer buffer.
    void
    Store char c as UTF-8 encoded byte stream.
    boolean
    Add meta element for Tidy.
    void
    calls addCharToLexer for any char in the string.
    void
    Adds a string to lexer buffer.
    short
    Return the html version used in document.
    boolean
    canPrune(Node element)
    Can the given element be removed?
    void
    changeChar(byte c)
    Substitute the last char in buffer.
    boolean
    Check system keywords (keywords should be uppercase).
    Clones an attribute value and add eventual asp or php node to node list.
    Clones a node and add it to node list.
    void
    Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
    boolean
    Has end of input stream been reached?
    short
    Examine DOCTYPE to identify version.
    boolean
    Fixup doctype if missing.
    void
    fixHTMLNameSpace(Node root, String profile)
    Fix xhtml namespace.
    void
    fixId(Node node)
    duplicate name attribute as an id and check if id and name match.
    boolean
    Ensure XML document starts with <?XML version="1.0"?>.
    getCDATA(Node container)
    Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
    getToken(short mode)
    Gets a token.
    short
    Choose what version to use for new doctype.
    Choose what version to use for new doctype.
    Generates and inserts a new node.
    int
    This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc.
     
    static boolean
    In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item).
    boolean
    isPushed(Node node)
    Is the node in the stack?
    static boolean
    Check if attr is a valid name.
    Adds a new line node.
    Creates a new node and add it to nodelist.
    newNode(short type, byte[] textarray, int start, int end)
    Creates a new node and add it to nodelist.
    newNode(short type, byte[] textarray, int start, int end, String element)
    Creates a new node and add it to nodelist.
    parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value.
    parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
    consumes the '>' terminating start tags.
    parseAttrs(boolean[] isempty)
    Parse tag attributes.
    void
    parseEntity(short mode)
    Parse an html entity.
    PHP is like ASP but is based upon XML processing instructions, e.g.
    int
    Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
    char
    Parses a tag name.
    parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim)
    Parse an attribute value.
    void
    Pop a copy of an inline node from the stack.
    protected boolean
    Is content acceptable for pre elements?
    void
    Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
    boolean
    Adds a new xhtml doctype to the document.
    void
     
    protected void
    updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
    Update oldtextarray in the current nodes.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • IGNORE_WHITESPACE

      public static final short IGNORE_WHITESPACE
      state: ignore whitespace.
      See Also:
    • MIXED_CONTENT

      public static final short MIXED_CONTENT
      state: mixed content.
      See Also:
    • PREFORMATTED

      public static final short PREFORMATTED
      state: preformatted.
      See Also:
    • IGNORE_MARKUP

      public static final short IGNORE_MARKUP
      state: ignore markup.
      See Also:
    • in

      protected StreamIn in
      file stream.
    • errout

      protected PrintWriter errout
      error output stream.
    • badAccess

      protected short badAccess
      for accessibility errors.
    • badLayout

      protected short badLayout
      for bad style errors.
    • badChars

      protected short badChars
      for bad char encodings.
    • badForm

      protected short badForm
      for mismatched/mispositioned form tags.
    • warnings

      protected short warnings
      count of warnings in this document.
    • errors

      protected short errors
      count of errors.
    • lines

      protected int lines
      lines seen.
    • columns

      protected int columns
      at start of current token.
    • waswhite

      protected boolean waswhite
      used to collapse contiguous white space.
    • pushed

      protected boolean pushed
      true after token has been pushed back.
    • insertspace

      protected boolean insertspace
      when space is moved after end tag.
    • excludeBlocks

      protected boolean excludeBlocks
      Netscape compatibility.
    • exiled

      protected boolean exiled
      true if moved out of table.
    • isvoyager

      protected boolean isvoyager
      true if xmlns attribute on html element.
    • versions

      protected short versions
      bit vector of HTML versions.
    • doctype

      protected int doctype
      version as given by doctype (if any).
    • badDoctype

      protected boolean badDoctype
      set if html or PUBLIC is missing.
    • txtstart

      protected int txtstart
      start of current node.
    • txtend

      protected int txtend
      end of current node.
    • state

      protected short state
      state of lexer's finite state machine.
    • token

      protected Node token
      current node.
    • lexbuf

      protected byte[] lexbuf
      Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.
    • lexlength

      protected int lexlength
      allocated.
    • lexsize

      protected int lexsize
      used.
    • inode

      protected Node inode
      Inline stack for compatibility with Mosaic. For deferring text node.
    • insert

      protected int insert
      for inferring inline tags.
    • istack

      protected Stack istack
      stack.
    • istackbase

      protected int istackbase
      start of frame.
    • styles

      protected Style styles
      used for cleaning up presentation markup.
    • configuration

      protected Configuration configuration
      configuration.
    • seenEndBody

      protected boolean seenEndBody
      already seen end body tag?
    • seenEndHtml

      protected boolean seenEndHtml
      already seen end html tag?
    • report

      protected Report report
      report.
    • root

      protected Node root
      Root node is saved here.
  • Constructor Details

    • Lexer

      public Lexer(StreamIn in, Configuration configuration, Report report)
      Instantiates a new Lexer.
      Parameters:
      in - StreamIn
      configuration - configuation instance
      report - report instance, for reporting errors
  • Method Details

    • newNode

      public Node newNode()
      Creates a new node and add it to nodelist.
      Returns:
      Node
    • newNode

      public Node newNode(short type, byte[] textarray, int start, int end)
      Creates a new node and add it to nodelist.
      Parameters:
      type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
      textarray - array of bytes contained in the Node
      start - start position
      end - end position
      Returns:
      Node
    • newNode

      public Node newNode(short type, byte[] textarray, int start, int end, String element)
      Creates a new node and add it to nodelist.
      Parameters:
      type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
      textarray - array of bytes contained in the Node
      start - start position
      end - end position
      element - tag name
      Returns:
      Node
    • cloneNode

      public Node cloneNode(Node node)
      Clones a node and add it to node list.
      Parameters:
      node - Node
      Returns:
      cloned Node
    • cloneAttributes

      public AttVal cloneAttributes(AttVal attrs)
      Clones an attribute value and add eventual asp or php node to node list.
      Parameters:
      attrs - original AttVal
      Returns:
      cloned AttVal
    • updateNodeTextArrays

      protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
      Update oldtextarray in the current nodes.
      Parameters:
      oldtextarray - previous text array
      newtextarray - new text array
    • newLineNode

      public Node newLineNode()
      Adds a new line node. Used for creating preformatted text from Word2000.
      Returns:
      new line node
    • endOfInput

      public boolean endOfInput()
      Has end of input stream been reached?
      Returns:
      true if end of input stream been reached
    • addByte

      public void addByte(int c)
      Adds a byte to lexer buffer.
      Parameters:
      c - byte to add
    • changeChar

      public void changeChar(byte c)
      Substitute the last char in buffer.
      Parameters:
      c - new char
    • addCharToLexer

      public void addCharToLexer(int c)
      Store char c as UTF-8 encoded byte stream.
      Parameters:
      c - char to store
    • addStringToLexer

      public void addStringToLexer(String str)
      Adds a string to lexer buffer.
      Parameters:
      str - String to add
    • parseEntity

      public void parseEntity(short mode)
      Parse an html entity.
      Parameters:
      mode - mode
    • parseTagName

      public char parseTagName()
      Parses a tag name.
      Returns:
      first char after the tag name
    • addStringLiteral

      public void addStringLiteral(String str)
      calls addCharToLexer for any char in the string.
      Parameters:
      str - input String
    • htmlVersion

      public short htmlVersion()
      Choose what version to use for new doctype.
      Returns:
      html version constant
    • htmlVersionName

      public String htmlVersionName()
      Choose what version to use for new doctype.
      Returns:
      html version name
    • addGenerator

      public boolean addGenerator(Node root)
      Add meta element for Tidy. If the meta tag is already present, update release date.
      Parameters:
      root - root node
      Returns:
      true if the tag has been added
    • checkDocTypeKeyWords

      public boolean checkDocTypeKeyWords(Node doctype)
      Check system keywords (keywords should be uppercase).
      Parameters:
      doctype - doctype node
      Returns:
      true if doctype keywords are all uppercase
    • findGivenVersion

      public short findGivenVersion(Node doctype)
      Examine DOCTYPE to identify version.
      Parameters:
      doctype - doctype node
      Returns:
      version code
    • fixHTMLNameSpace

      public void fixHTMLNameSpace(Node root, String profile)
      Fix xhtml namespace.
      Parameters:
      root - root Node
      profile - current profile
    • setXHTMLDocType

      public boolean setXHTMLDocType(Node root)
      Adds a new xhtml doctype to the document.
      Parameters:
      root - root node
      Returns:
      true if a doctype has been added
    • apparentVersion

      public short apparentVersion()
      Return the html version used in document.
      Returns:
      version code
    • fixDocType

      public boolean fixDocType(Node root)
      Fixup doctype if missing.
      Parameters:
      root - root node
      Returns:
      false if current version has not been identified
    • fixXmlDecl

      public boolean fixXmlDecl(Node root)
      Ensure XML document starts with <?XML version="1.0"?>. Add encoding attribute if not using ASCII or UTF-8 output.
      Parameters:
      root - root node
      Returns:
      always true
    • inferredTag

      public Node inferredTag(String name)
      Generates and inserts a new node.
      Parameters:
      name - tag name
      Returns:
      generated node
    • getCDATA

      public Node getCDATA(Node container)
      Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
      Parameters:
      container - container node
      Returns:
      cdata node
    • ungetToken

      public void ungetToken()
    • getToken

      public Node getToken(short mode)
      Gets a token.
      Parameters:
      mode - one of the following:
      • MixedContent-- for elements which don't accept PCDATA
      • Preformatted-- white spacepreserved as is
      • IgnoreMarkup-- for CDATA elements such as script, style
      Returns:
      next Node
    • parseAsp

      public Node parseAsp()
      parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values: href='invalid input: '<'%=rsSchool.Fields("ID").Value%>' where the ASP that generates the attribute value is masked from Tidy by the quotemarks.
      Returns:
      parsed Node
    • parsePhp

      public Node parsePhp()
      PHP is like ASP but is based upon XML processing instructions, e.g. <?php ... ?>.
      Returns:
      parsed Node
    • parseAttribute

      public String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
      consumes the '>' terminating start tags.
      Parameters:
      isempty - flag is passed as array so it can be modified
      asp - asp Node, passed as array so it can be modified
      php - php Node, passed as array so it can be modified
      Returns:
      parsed attribute
    • parseServerInstruction

      public int parseServerInstruction()
      Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
      Returns:
      delimiter
    • parseValue

      public String parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim)
      Parse an attribute value.
      Parameters:
      name - attribute name
      foldCase - fold case?
      isempty - is attribute empty? Passed as an array reference to allow modification
      pdelim - delimiter, passed as an array reference to allow modification
      Returns:
      parsed value
    • isValidAttrName

      public static boolean isValidAttrName(String attr)
      Check if attr is a valid name.
      Parameters:
      attr - String to check, must be non-null
      Returns:
      true if attr is a valid name.
    • isCSS1Selector

      public static boolean isCSS1Selector(String buf)
      In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front.
      Parameters:
      buf - css selector name
      Returns:
      true if the given string is a valid css1 selector name
    • parseAttrs

      public AttVal parseAttrs(boolean[] isempty)
      Parse tag attributes.
      Parameters:
      isempty - is tag empty?
      Returns:
      parsed attribute/value list
    • pushInline

      public void pushInline(Node node)
      Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance: <p><em> text <p><em> more text Shouldn't be mapped to <p><em> text </em></p><p><em><em> more text </em></em>
      Parameters:
      node - Node to be pushed
    • popInline

      public void popInline(Node node)
      Pop a copy of an inline node from the stack.
      Parameters:
      node - Node to be popped
    • isPushed

      public boolean isPushed(Node node)
      Is the node in the stack?
      Parameters:
      node - Node
      Returns:
      true is the node is found in the stack
    • inlineDup

      public int inlineDup(Node node)
      This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in: <i><h1>italic heading</h1></i> which is then treated as equivalent to <h1><i>italic heading</i></h1> This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream.
      Parameters:
      node - original node
      Returns:
      stack size
    • insertedToken

      public Node insertedToken()
      Returns:
    • canPrune

      public boolean canPrune(Node element)
      Can the given element be removed?
      Parameters:
      element - node
      Returns:
      true if he element can be removed
    • fixId

      public void fixId(Node node)
      duplicate name attribute as an id and check if id and name match.
      Parameters:
      node - Node to check for name/it attributes
    • deferDup

      public void deferDup()
      Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
    • preContent

      protected boolean preContent(Node node)
      Is content acceptable for pre elements?
      Parameters:
      node - content
      Returns:
      true if node is acceptable in pre elements