Bug 4106 - [FO] regex syntax: position of backreference
: [FO] regex syntax: position of backreference
Status: CLOSED FIXED
Product: XPath / XQuery / XSLT
Functions and Operators 1.0
: Proposed Recommendation
: PC Windows XP
: P2 normal
: ---
Assigned To: Michael Kay
: Mailing list for public feedback on specs from XSL and XML Query WGs
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-12-21 23:35 UTC by Michael Kay
Modified: 2008-08-06 18:37 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Kay 2006-12-21 23:35:45 UTC
In the syntax of regular expressions, a backreference is currently allowed as a
charClassEsc. As such it can appear either outside square brackets, for example 

(abc)\1

or within square brackets

(abc)[\1]

However, it doesn't make sense to allow a backreference within square brackets,
because constructs allowed within square brackets always match a single
character (perhaps one of a set of possible characters, but never a sequence of
more than one character), while a back-reference will in general match a
sequence of characters.

I think that backreferences should appear in the syntax at the level of an
atom:

[9] atom ::= Char | charClass | ( '(' regExp ')' ) | backReference

I have not been able to find a similar restriction documented for REs in Perl,
Java, or .NET. However, none of these languages attempt to define the syntax of
regular expressions using a BNF grammar, or to give a rigorous exposition of
the semantics. Experiments with Java suggest that "(abc)[\1]" is accepted as a
valid regular expression, but its semantics appear to be undefined: I am unable
to identify any string that it matches.
Comment 1 Abel Braaksma 2006-12-22 10:50:11 UTC
Perhaps it's good to consider what Perl does here:

Perl handles backreferences inside a character class as octal escapes:
perl /[\7]/ matches U+07
perl /[\1]/ matches U+01
perl /[\11]/ matches the tab U+09

outside a character class, an octal escape must be of at least two digits to be
treated as an octal escape. Otherwise, if it does not match a parenthesis, it
will raise a runtime error. (which I still believe is the better approach and
should've been adopted by XSLT, but it's too late to change the specs).

Of course, in XSLT there is no such thing as an octal escape, it uses entity
references.
Comment 2 Abel Braaksma 2006-12-22 10:59:19 UTC
Considering the construct 

(low)[\1]

and taken the current specs, one might be tempted to treat it as matching the
string "low" followed by any of the characters in the that string: [low]
(expanded backreference). Taken the string:

helloworld

then the regular expression would:
not match: "hel"
match:     "lowo"
not match: "rld"

This could be a potentially very powerful construct, but I for one have not
encountered it in any regular expression flavor (yet). For regular expression
engines, it may involve a lot of re-engineering to allow for expandable
character classes.

A better approach would be, I believe, to disallow the construct in its
entirety.
Comment 3 Michael Kay 2007-02-13 16:42:35 UTC
The WG has approved this change in principle, subject to detailed wording.

Here is proposed text for the change to F+O.

Within section 7.6.1, the bullet point starting "Back-References are allowed"
should change to read as follows:

# Back-references are allowed outside a character class expression. A
back-reference is an additional kind of atom. The construct \n where n is a
single digit is always recognized as a back-reference; if this is followed by
further digits, these digits are taken to be part of the back-reference if and
only if the back-reference is preceded by sufficiently many capturing
subexpressions. A back-reference matches the string that was matched by the nth
capturing subexpression within the regular expression, that is, the
parenthesized subexpression whose opening left parenthesis is the nth unescaped
left parenthesis within the regular expression. The closing right parenthesis
of this subexpression must occur before the back-reference. For example, the
regular expression ('|").*\1 matches a sequence of characters delimited either
by an apostrophe at the start and end, or by a quotation mark at the start and
end.

If no string is matched by the nth capturing subexpression, the back-reference
is interpreted as matching a zero-length string.

Back-references change the following production:

[9] atom ::= Char | charClass | ( '(' regExp ')' )

to

[9] atom ::= Char | charClass | ( '(' regExp ')' ) | backReference

[9a] backReference ::= "\" [1-9][0-9]*

Note: within a character class expression, "\" followed by a digit is an error.
Some other regular expression languages interpret this as an octal character
reference.
Comment 4 Jim Melton 2007-02-26 00:35:55 UTC
The fix for this bug does not appear in the Recommendation of 23 January 2007. 
It will be considered for a future publication (either an Errata document or
some possible future version of the specification).
Comment 5 Michael Kay 2007-02-27 17:08:06 UTC
The change proposed in comment #3 was accepted, with one minor change, changing
the Note at the end to:

Note: within a character class expression, "\" followed by a digit is invalid.
Some other regular expression languages interpret this as an octal character
reference.

[changes "an error" to "invalid" because that's the terminology used in the
description of errors such as FORX0002]

This will be published in due course as an Erratum.
Comment 6 Jim Melton 2007-04-29 22:53:18 UTC
http://www.w3.org/Bugs/Public/show_bug.cgi?id=4106#c5 indicates the final
resolution of this bug, so I am marking it FIXED.  Because the author of the
bug and the author of the solution are the same person, I am also marking the
bug CLOSED. 


  翻译: