Package org.w3c.tidy
Class EncodingUtils
java.lang.Object
org.w3c.tidy.EncodingUtils
- Version:
- $Revision: 622 $ ($Author: fgiust $)
- Author:
- Fabrizio Giustina
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
states for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets.static final int
state ESC.static final int
state ESCD.static final int
state ESCDP.static final int
state ESCP.static final int
state NONASCII.static final int
UTF-16 high surrogate.static final int
utf16 low surrogate.static final int
Max UTF-16 value.static final int
Max UTF-88 valid char value.static final int
the default (big-endian) UNICODE BOM.static final int
the big-endian (default) UNICODE BOM.static final int
the little-endian UNICODE BOM.static final int
the UTF-8 UNICODE BOM.static final int
UTF-16 surrogate pair areas: high surrogates begin.static final int
UTF-16 surrogate pair areas: high surrogates end.static final int
UTF-16 surrogate pair areas: low surrogates begin.static final int
UTF-16 surrogate pair areas: low surrogates end.static final int
UTF-16 surrogates begin. -
Method Summary
Modifier and TypeMethodDescriptionprotected static int
decodeMacRoman
(int c) Function to convert from MacRoman to Unicode.protected static int
decodeWin1252
(int c) Function for conversion from Windows-1252 to Unicode.
-
Field Details
-
UNICODE_BOM_BE
public static final int UNICODE_BOM_BEthe big-endian (default) UNICODE BOM.- See Also:
-
UNICODE_BOM
public static final int UNICODE_BOMthe default (big-endian) UNICODE BOM.- See Also:
-
UNICODE_BOM_LE
public static final int UNICODE_BOM_LEthe little-endian UNICODE BOM.- See Also:
-
UNICODE_BOM_UTF8
public static final int UNICODE_BOM_UTF8the UTF-8 UNICODE BOM.- See Also:
-
FSM_ASCII
public static final int FSM_ASCIIstates for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets. The designators defined and used in ISO-2022-JP are: "ESC" + "(" + ? for ISO646 variants "ESC" + "$" + ? and "ESC" + "$" + "(" + ? for multibyte character sets. State ASCII.- See Also:
-
FSM_ESC
public static final int FSM_ESCstate ESC.- See Also:
-
FSM_ESCD
public static final int FSM_ESCDstate ESCD.- See Also:
-
FSM_ESCDP
public static final int FSM_ESCDPstate ESCDP.- See Also:
-
FSM_ESCP
public static final int FSM_ESCPstate ESCP.- See Also:
-
FSM_NONASCII
public static final int FSM_NONASCIIstate NONASCII.- See Also:
-
MAX_UTF8_FROM_UCS4
public static final int MAX_UTF8_FROM_UCS4Max UTF-88 valid char value.- See Also:
-
MAX_UTF16_FROM_UCS4
public static final int MAX_UTF16_FROM_UCS4Max UTF-16 value.- See Also:
-
LOW_UTF16_SURROGATE
public static final int LOW_UTF16_SURROGATEutf16 low surrogate.- See Also:
-
UTF16_SURROGATES_BEGIN
public static final int UTF16_SURROGATES_BEGINUTF-16 surrogates begin.- See Also:
-
UTF16_LOW_SURROGATE_BEGIN
public static final int UTF16_LOW_SURROGATE_BEGINUTF-16 surrogate pair areas: low surrogates begin.- See Also:
-
UTF16_LOW_SURROGATE_END
public static final int UTF16_LOW_SURROGATE_ENDUTF-16 surrogate pair areas: low surrogates end.- See Also:
-
UTF16_HIGH_SURROGATE_BEGIN
public static final int UTF16_HIGH_SURROGATE_BEGINUTF-16 surrogate pair areas: high surrogates begin.- See Also:
-
UTF16_HIGH_SURROGATE_END
public static final int UTF16_HIGH_SURROGATE_ENDUTF-16 surrogate pair areas: high surrogates end.- See Also:
-
HIGH_UTF16_SURROGATE
public static final int HIGH_UTF16_SURROGATEUTF-16 high surrogate.- See Also:
-
-
Method Details
-
decodeWin1252
protected static int decodeWin1252(int c) Function for conversion from Windows-1252 to Unicode.- Parameters:
c
- char to decode- Returns:
- decoded char
-
decodeMacRoman
protected static int decodeMacRoman(int c) Function to convert from MacRoman to Unicode.- Parameters:
c
- char to decode- Returns:
- decoded char
-