Bug 5183 - [FO] Effect of type promotion in fn:distinct-values
: [FO] Effect of type promotion in fn:distinct-values
Status: CLOSED FIXED
Product: XPath / XQuery / XSLT
Functions and Operators 1.0
: Recommendation
: PC Windows XP
: P5 normal
: ---
Assigned To: Michael Kay
: Mailing list for public feedback on specs from XSL and XML Query WGs
: http://www.w3.org/TR/xpath-functions/...
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-10-12 17:25 UTC by Henry Zongaro
Modified: 2009-10-16 20:47 UTC (History)
0 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Henry Zongaro 2007-10-12 17:25:17 UTC
Consider the following expression.  The values of type xs:float and xs:double
both compare equal to the value of type xs:decimal, due to the loss of
precision in type promotion, but the float and double values are not equal to
one another.  What is the correct result?  Or is it implementation-dependent
whether the result is 1 or 2?

count(distinct-values(
         (xs:float('1.0'),
          xs:decimal('1.0000000000100000000001',
          xs:double( '1.00000000001')))


See also the related XSLT 2.0 bug 2916.[1]

[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=2916
Comment 1 Michael Kay 2007-10-12 18:09:55 UTC
Valid point. The analogy with min() and max() would suggest converting all the
values to the least common supertype. That's the solution I would recommend if
we can satisfy ourselves that it's implementable with acceptable performance.
Comment 2 Michael Kay 2007-10-16 14:41:50 UTC
This is a tricky problem. The rule I suggested in comment #1, of converting to
the least common supertype, suffers the disadvantage that when you process a
sequence in order, you cannot always tell whether a value is a "new value"
(distinct from all previous values) immediately.

I believe Saxon converts all numeric values to double unconditionally before
deciding whether the value is unique. That's not very satisfactory either. 

The only rule I can really think of is (a) take the sequence in some
implementation-defined order, (b) for each value, if the value is equal to some
previous value (that has been deemed distinct), discard it, otherwise include
it in the result of the function. The effect of this rule is that not only is
the order of the result implementation-defined (as now), but in pathological
cases different implementations may find a different number of distinct values.

I think that's effectively teh following, excluding pathologicals such as NaN:

declare function distinct-values($arg) {
  let $s := unordered($arg)
  return
    if (exists($s))
    then ($s[1], distinct-values(remove($s,1)[. ne $s[1]]))
    else ()
}


Comment 3 John Snelson 2007-10-23 15:31:41 UTC
Both XQilla and Berkeley DB XML follow the algorithm Michael describes in
comment #2.
Comment 4 Michael Kay 2007-10-24 08:13:11 UTC
I would like to propose a solution along the following lines (detailed wording
to follow later).

1. The function partitions the items in the atomized input into a number of
groups such that:

  a. Within a group, every pair of items in the group are mutually equal (that
is, A eq B, or A and B are both NaN)

  b. Given two distinct groups, there is at least one pair of values chosen one
from each group such that the two values are unequal (A ne B unless one is NaN)

  c. Note that this does not guarantee that there is no pair of values that are
equal to each other but assigned to different groups, because of the
transitivity issue

  d. Note also that in the general case there may be more than one possible
partitioning that meets these rules.

2. The function then selects one item from each group, chosen arbitrarily,
except [discuss?] that the item that is chosen from one group must not be equal
to the item that is chosen from any other group.

I think this can be implemented by an algorithm that processes the items in
input order, that makes an immediate decision for each item whether to include
it in the result or not, and that retains in memory (a) the items that have
been returned in the output, and (b) for each value that has been returned in
the output, at most one value of each primitive data type that has not been
returned itself, but is equal to a value that has been returned.

For xsl:for-each-group, given that we guarantee the order of groups and the
order of items within a group, we could be a bit more prescriptive: we could
prescribe an algorithm that processes the items in order and allocates each one
to an existing group if it is equal to every item in that group, and that
starts a new group otherwise. This algorithm (I believe!) meets the rules for
distinct values given above, and gives a more predictable result.
Comment 5 Henry Zongaro 2007-10-30 14:47:14 UTC
Michael, in your proposed solution in comment 4, does the implementation of the
function have to choose the partition into groups in the first step in such a
way that it is actually able to select exactly one item from each group in the
second step?  Going back to my initial example,

fn:distinct-values(
     (xs:float('1.0'),
      xs:decimal('1.0000000000100000000001',
      xs:double( '1.00000000001')))

For convenience, I'll refer to the float value as Fl, the decimal value as De
and the double value as Do.  So we have these possible partitions into groups:
1. {Fl}, {De}, {Do}
2. {Fl,De}, {Do}
3. {Fl}, {De, Do}

If the implementation doesn't have to be able to select precisely one item from
each group, it could select (Fl,Do) or (De) in the first case, which means it's
still implementation-dependent whether there is one item or two items in the
result.

Since that doesn't solve the problem, I'll assume you meant that it must be
able to select precisely one item from each group.  In that way, with either
the second or third partition, the number of items in the result will be two.
Comment 6 Michael Kay 2007-10-30 15:08:43 UTC
>So we have these possible partitions into groups:
1. {Fl}, {De}, {Do}
2. {Fl,De}, {Do}
3. {Fl}, {De, Do}

I don't think solution (1) satisfies rule 2: De and Do don't contain a pair of
values that are unequal. So I think (2) and (3) are the only possible
partitionings. I tried to devise the rules on the basis that it would always be
possible to select one value from each group in the second step; I can't easily
prove that I have succeeded.

Mike
Comment 7 Henry Zongaro 2007-10-30 19:24:45 UTC
My apologies for that blunder!
Comment 8 Michael Kay 2007-11-28 18:21:31 UTC
After sitting on this for a while, I propose to resolve this by adding the
paragraph:

If the input sequence contains values of different numeric types that differ
from each other by small amounts, then the eq operator is not transitive,
because of rounding effects occurring during type promotion. In the situation
where the input contains three values A, B, and C such that A eq B, B eq C, but
A ne C, then the number of items in the result of the function (as well as the
choice of which items are returned) is implementation dependent, subject only
to the constraints that (a) no two items in the result sequence compare equal
to each other, and (b) every input item that does not appear in the result
sequence compares equal to some item that does appear in the result sequence. 

For example, this arises when computing 

      distinct-values(
         (xs:float('1.0'),
          xs:decimal('1.0000000000100000000001',
          xs:double( '1.00000000001'))

because the values of type xs:float and xs:double both compare equal to the
value of type xs:decimal but not equal to each other.

For xsl:for-each-group, we need to add similar wording. I will raise a separate
bug report on this. 
Comment 9 Michael Kay 2008-02-05 17:56:57 UTC
Discussed on 5 Feb 2008, agreed to accept the proposal in comment #8
Comment 10 Michael Kay 2009-02-14 19:06:55 UTC
The resolution in comment #8 has now (after an inordinate delay) been copied
into draft Erratum E44. 
Comment 11 Michael Kay 2009-10-16 20:47:47 UTC
Marking this as closed, since the erratum has been issued and applied to the
master text for both the 1.02e and 1.1 documents.


  翻译: