Implementations must act as if they used the following state
machine to tokenise HTML. The state machine must start in the
data state . Most states consume a single
character, which may have various side-effects, and either switches
the state machine to a new state to reconsume the same
character, or switches it to a new state (to consume the next
character), or repeats the same state (to consume the next
character). Some states have more complicated behaviour behavior and
can consume several characters before switching to another
state.
The exact behaviour behavior of certain states depends on a content model flag that is set after certain
tokens are emitted. The flag has several states: PCDATA , RCDATA , CDATA
, and PLAINTEXT . Initially it must be in the
PCDATA state. In the RCDATA and CDATA states, a further escape flag is used to control the behaviour behavior of
the tokeniser. It is either true or false, and initially must be
set to the false state. The
insertion mode and the stack of open elements also affects tokenisation.
The output of the tokenisation step is a series of zero or more
of the following tokens: DOCTYPE, start tag, end tag, comment,
character, end-of-file. DOCTYPE tokens have a name, a public
identifier, a system identifier, and a correctness flag.force-quirks flag . When a DOCTYPE token is
created, its name, public identifier, and system identifier must be
marked as missing, missing (which is a distinct state from the empty
string), and the correctnessforce-quirks flag must be set to
correct off (its other state is incorrect on ).
Start and end tag tokens have a tag name name, a
self-closing flag , and a list
of attributes, each of which has a name and a value. When a start or end tag token is created, its
self-closing flag must be unset (its other state is that it be set), and
its attributes list must be empty. Comment and character
tokens have data.
When a token is emitted, it must immediately be handled by the
tree construction stage. The tree
construction stage can affect the state of the content model flag , and can insert additional
characters into the stream. (For example, the script element can result in scripts
executing and using the dynamic markup
insertion APIs to insert characters into the stream being
tokenised.)
When a start tag token is emitted with
its self-closing flag
set, if the flag is not acknowledged when it
is processed by the tree construction stage, that is a
parse error .
When an end tag token is emitted,
the content model flag must be switched to
the PCDATA state.
When an end tag token is emitted with attributes, that is a
parse error .
A permitted slash When an end tag token is a
U+002F SOLIDUS character emitted with
its self-closing flag
set, that is immediately followed by a U+003E GREATER-THAN SIGN, if, and only if, the current
token being processed is a start tag token whose tag name is one of
the following: base , link , meta , hr , br , img , embed , param ,
area , col , inputparse error .
Otherwise: treat it as per the "anything else" entry
below.
U+002D HYPHEN-MINUS (-)
If the content model flag is set to
either the RCDATA state or the CDATA state, and the escape flag is false, and there are at least three
characters before this one in the input stream, and the last four
characters in the input stream, including this one, are U+003C
LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, and
U+002D HYPHEN-MINUS ("<!--"), then set the escape flag to true.
In any case, emit the input character as a character token. Stay
in the data state .
Otherwise: treat it as per the "anything else" entry
below.
U+003E GREATER-THAN SIGN (>)
If the content model flag is set to
either the RCDATA state or the CDATA state, and the escape flag is true, and the last three characters in
the input stream including this one are U+002D HYPHEN-MINUS, U+002D
HYPHEN-MINUS, U+003E GREATER-THAN SIGN ("-->"), set the escape flag to false.
In any case, emit the input character as a character token. Stay
in the data state .
EOF
Emit an end-of-file token.
Anything else
Emit the input character as a character token. Stay in the
data state .
Entity Character reference data
state
(This cannot happen if the content model
flag is set to the CDATA state.)
Consume the next input character . If
it is a U+002F SOLIDUS (/) character, switch to the close tag open state . Otherwise, emit a U+003C
LESS-THAN SIGN character token and reconsume the current input
character in the data state .
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
Create a new start tag token, set its tag name to the lowercase
version of the input character (add 0x0020 to the character's code
point), then switch to the tag name state
. (Don't emit the token yet; further details will be filled in
before it is emitted.)
U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL
LETTER Z
Create a new start tag token, set its tag name to the input
character, then switch to the tag name
state . (Don't emit the token yet; further details will be
filled in before it is emitted.)
U+003E GREATER-THAN SIGN (>)
Parse error . Emit a U+003C LESS-THAN
SIGN character token and a U+003E GREATER-THAN SIGN character
token. Switch to the data state .
Parse error . Emit a U+003C LESS-THAN
SIGN character token and reconsume the current input character in
the data state .
Close tag open state
If the content model flag is set to the
RCDATA or CDATA states but no start tag token has ever been emitted
by this instance of the tokeniser ( fragment
case ), or, if the content model flag
is set to the RCDATA or CDATA states and the next few characters do
not match the tag name of the last start tag token emitted (case
insensitively), or if they do but they are not immediately followed
by one of the following characters:
U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+0020 SPACE
U+003E GREATER-THAN SIGN (>)
U+002F SOLIDUS (/)
EOF
...then emit a U+003C LESS-THAN SIGN character token, a U+002F
SOLIDUS character token, and switch to the data state to process the next input character .
Otherwise, if the content model flag is
set to the PCDATA state, or if the next few characters do
match that tag name, consume the next input
character :
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
Create a new end tag token, set its tag name to the lowercase
version of the input character (add 0x0020 to the character's code
point), then switch to the tag name state
. (Don't emit the token yet; further details will be filled in
before it is emitted.)
U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL
LETTER Z
Create a new end tag token, set its tag name to the input
character, then switch to the tag name
state . (Don't emit the token yet; further details will be
filled in before it is emitted.)
Emit the current tag token. Switch to the data state .
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
Append the lowercase version of the current input character
(add 0x0020 to the character's code point) to the current tag
token's tag name. Stay in the tag name
state .
EOF
Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
U+002F SOLIDUS (/)
Parse error unless this is a permitted
slash . Switch to the before attribute
nameself-closing start tag state .
Anything else
Append the current input character to the current tag token's
tag name. Stay in the tag name state
.
Emit the current tag token. Switch to the data state .
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
Start a new attribute in the current tag token. Set that
attribute's name to the lowercase version of the current input
character (add 0x0020 to the character's code point), and its value
to the empty string. Switch to the attribute
name state .
Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
Anything else
Start a new attribute in the current tag token. Set that
attribute's name to the current input character, and its value to
the empty string. Switch to the attribute
name state .
Emit the current tag token. Switch to the data state .
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
Append the lowercase version of the current input character
(add 0x0020 to the character's code point) to the current
attribute's name. Stay in the attribute name
state .
U+002F SOLIDUS (/)
Parse error unless this is a permitted
slash . Switch to the before attribute
nameself-closing start tag state .
U+0022 QUOTATION MARK (")
U+0027 APOSTROPHE (')
Parse error
.Treat it as per the "anything else" entry
below.
EOF
Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
Anything else
Append the current input character to the current attribute's
name. Stay in the attribute name state
.
When the user agent leaves the attribute name state (and before
emitting the tag token, if appropriate), the complete attribute's
name must be compared to the other attributes on the same token; if
there is already an attribute on the token with the exact same
name, then this is a parse error and the new
attribute must be dropped, along with the value that gets
associated with it (if any).
Emit the current tag token. Switch to the data state .
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
Start a new attribute in the current tag token. Set that
attribute's name to the lowercase version of the current input
character (add 0x0020 to the character's code point), and its value
to the empty string. Switch to the attribute
name state .
U+002F SOLIDUS (/)
Parse error unless this is a permitted
slash . Switch to the before attribute
nameself-closing start tag state .
EOF
Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
Anything else
Start a new attribute in the current tag token. Set that
attribute's name to the current input character, and its value to
the empty string. Switch to the attribute
name state .
(This can only happen if the content
model flag is set to the PCDATA state.)
Consume every character up to the first U+003E GREATER-THAN SIGN
character (>) or the end of the file (EOF), whichever comes
first. Emit a comment token whose data is the concatenation of all
the characters starting from and including the character that
caused the state machine to switch into the bogus comment state, up
to and including the last consumed character before the U+003E
character, if any, or up to the end of the file otherwise. (If the
comment was started by the end of the file (EOF), the token is
empty.)
If the end of the file was reached, reconsume the EOF
character.
Markup declaration open state
(This can only happen if the content
model flag is set to the PCDATA state.)
If the next two characters are both U+002D HYPHEN-MINUS (-)
characters, consume those two characters, create a comment token
whose data is the empty string, and switch to the comment start state .
Otherwise Otherwise, if the next seven characters are a
case-insensitive match for the word "DOCTYPE", then
consume those characters and switch to the DOCTYPE state .
Otherwise, if the insertion mode is " in
foreign content " and the
current node is
not an element in the HTML namespace
and the next seven characters are a
case-sensitive match for the string "[CDATA[" (the five uppercase
letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before
and after), then consume those characters and switch to the
CDATA block state
(which is unrelated to the content model flag
's CDATA state).
Otherwise, this is a parse error . Switch to the bogus
comment state . The next character that is consumed, if any, is
the first character that will be in the comment.
Parse error . Create a new DOCTYPE token.
Set its correctnessforce-quirks flag to incorrect on .
Emit the token. Switch to the data state
.
EOF
Parse error . Create a new DOCTYPE token.
Set its correctnessforce-quirks flag to incorrect on .
Emit the token. Reconsume the EOF character in the data state .
Anything else
Create a new DOCTYPE token. Set the token's name name to the current input character. Switch to the
DOCTYPE name state .
Emit the current DOCTYPE token. Switch to the data state .
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Anything else
Append the current input character to the current DOCTYPE
token's name. Stay in the DOCTYPE name
state .
Emit the current DOCTYPE token. Switch to the data state .
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Anything else
If the next six characters are a case-insensitive
match for the word "PUBLIC", then consume those characters and
switch to the before DOCTYPE public identifier
state .
Otherwise, if the next six characters are a
case-insensitive match for the word "SYSTEM", then
consume those characters and switch to the before DOCTYPE system identifier state .
Otherwise, this is the parse error .
Set the DOCTYPE token's force-quirks flag to on . Switch
to the bogus DOCTYPE state .
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Switch to the data
state .
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Emit the current DOCTYPE token. Switch to the data state .
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Switch to the data
state .
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Emit the current DOCTYPE token. Switch to the data state .
EOF
Parse error . Set the DOCTYPE token's
correctnessforce-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
Consume every character up to the next
occurrence of the three character sequence U+005D RIGHT SQUARE
BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN
( ]]>
), or the end of the file (EOF), whichever
comes first. Emit a series of text tokens consisting of all the
characters consumed except the matching three character sequence at
the end (if one was found before the end of the file).
If the end of the file was reached,
reconsume the EOF character.
8.2.3.1. 8.2.4.1. Tokenising entities character
references
This section defines how to consume
an entity a
character reference . This definition is used when
parsing entities character references in text and in attributes
.
The behaviour behavior depends on the identity of the next
character (the one immediately after the U+0026 AMPERSAND
character):
U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+0020 SPACE
U+003C LESS-THAN SIGN
U+0026 AMPERSAND
EOF
The additional allowed
character ,if there is
one
Not an entity. a character reference. No characters are consumed,
and nothing is returned. (This is not an error, either.)
U+0023 NUMBER SIGN (#)
Consume the U+0023 NUMBER SIGN.
The behaviour behavior further depends on the character after
the U+0023 NUMBER SIGN:
U+0078 LATIN SMALL LETTER X
U+0058 LATIN CAPITAL LETTER X
Consume the X.
Follow the steps below, but using the range of characters U+0030
DIGIT ZERO through to U+0039 DIGIT NINE, U+0061 LATIN SMALL LETTER
A through to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL
LETTER A, through to U+0046 LATIN CAPITAL LETTER F (in other words,
0-9, A-F, a-f).
When it comes to interpreting the number, interpret it as a
hexadecimal number.
Anything else
Follow the steps below, but using the range of characters U+0030
DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just 0-9).
When it comes to interpreting the number, interpret it as a
decimal number.
Consume as many characters as match the range of characters
given above.
If no characters match the range, then don't consume any
characters (and unconsume the U+0023 NUMBER SIGN character and, if
appropriate, the X character). This is a parse
error ; nothing is returned.
Otherwise, if the next character is a U+003B SEMICOLON, consume
that too. If it isn't, there is a parse error
.
If one or more characters match the range, then take them all
and interpret the string of characters as a number (either
hexadecimal or decimal as appropriate).
If that number is one of the numbers in the first column of the
following table, then this is a parse error .
Find the row with that number in the first column, and return a
character token for the Unicode character given in the second
column of that row.
Number
Unicode character
0x0D
U+000A
LINE FEED (LF)
0x80
U+20AC
EURO SIGN ('€')
0x81
U+FFFD
REPLACEMENT CHARACTER
0x82
U+201A
SINGLE LOW-9 QUOTATION MARK ('‚')
0x83
U+0192
LATIN SMALL LETTER F WITH HOOK ('ƒ')
0x84
U+201E
DOUBLE LOW-9 QUOTATION MARK ('„')
0x85
U+2026
HORIZONTAL ELLIPSIS ('…')
0x86
U+2020
DAGGER ('†')
0x87
U+2021
DOUBLE DAGGER ('‡')
0x88
U+02C6
MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ')
0x89
U+2030
PER MILLE SIGN ('‰')
0x8A
U+0160
LATIN CAPITAL LETTER S WITH CARON ('Š')
0x8B
U+2039
SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹')
0x8C
U+0152
LATIN CAPITAL LIGATURE OE ('Œ')
0x8D
U+FFFD
REPLACEMENT CHARACTER
0x8E
U+017D
LATIN CAPITAL LETTER Z WITH CARON ('Ž')
0x8F
U+FFFD
REPLACEMENT CHARACTER
0x90
U+FFFD
REPLACEMENT CHARACTER
0x91
U+2018
LEFT SINGLE QUOTATION MARK ('‘')
0x92
U+2019
RIGHT SINGLE QUOTATION MARK ('’')
0x93
U+201C
LEFT DOUBLE QUOTATION MARK ('“')
0x94
U+201D
RIGHT DOUBLE QUOTATION MARK ('”')
0x95
U+2022
BULLET ('•')
0x96
U+2013
EN DASH ('–')
0x97
U+2014
EM DASH ('—')
0x98
U+02DC
SMALL TILDE ('˜')
0x99
U+2122
TRADE MARK SIGN ('™')
0x9A
U+0161
LATIN SMALL LETTER S WITH CARON ('š')
0x9B
U+203A
SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›')
0x9C
U+0153
LATIN SMALL LIGATURE OE ('œ')
0x9D
U+FFFD
REPLACEMENT CHARACTER
0x9E
U+017E
LATIN SMALL LETTER Z WITH CARON ('ž')
0x9F
U+0178
LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
Otherwise, if the number is zero, if the
number is higher than 0x10FFFF, or if it's one of the surrogate
characters (characters in the range 0x0000 to 0x0008, 0x000E to 0x001F, 0x007F to
0x009F, 0xD800 to 0xDFFF),
0xDFFF , 0xFDD0 to 0xFDDF, or is one of
0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF,
0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE,
0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF,
0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, or is
higher than 0x10FFFF, then this is a parse
error ; return a character token for the U+FFFD REPLACEMENT
CHARACTER character instead.
Otherwise, return a character token for the Unicode character
whose code point is that number.
Anything else
Consume the maximum number of characters possible, with the
consumed characters case-sensitively matching one of the
identifiers in the first column of the entitiesnamed character references table.
If no match can be made, then this is a parse
error . No characters are consumed, and nothing is
returned.
If the last character matched is not a U+003B SEMICOLON (
; ), there is a parse
error .
If the entity character reference is being consumed as part of an
attribute , and the last character matched is not a U+003B
SEMICOLON ( ; ), and the next character is in
the range U+0030 DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN
CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN
SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical
reasons, all the characters that were matched after the U+0026
AMPERSAND (&) must be unconsumed, and nothing is returned.
Otherwise, return a character token for the character
corresponding to the entity character reference name (as given by the second
column of the entitiesnamed character
references table).
If the markup contains I'm ¬it; I tell
you , the entity character reference is parsed as "not", as in,
I'm ¬it; I tell you . But if the markup was
I'm ∉ I tell you , the entity character
reference would be parsed as "notin;", resulting in
I'm ∉ I tell you .