Skip to content

EA interface - requirements for "hashable, value+order-preserving ndarray" #33276

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Across a set of issues/PRs (eg #32586, #32673, #33064), there has lately been quite some discussion regarding _values_for_factorzie / _values_for_argsort, the need for roundtrippability, the need for order-preserving, the need for _ndarray_values, ...

Those different questions and topics that came up (as far as I kept track, probably not complete! but already too long .. ;)):

  • In EA: revisit interface #32586, the question was raised what the difference is between _values_for_argsort and _values_for_factorize, and whether we need both.

    Some difference that came up:

    • The main difference might be that the values in _values_for_factorize need to be hashable, while the ones in _values_for_argsort don't need to be (although this difference is not properly documented). Looking back at ENH: Sorting of ExtensionArrays #19957 from @TomAugspurger, it was mentioned the that "sortable" is an easier requirement than what other algos like factorize might need.
      Related to this difference is that _values_for_factorize needs to return a dtype supported by the hashtables (int64, uint64, float64, object), while _values_for_argsort can return any sortable dtype (so also int8, int32, etc).
    • The return type is different: _values_for_factorize also returns a na_value sentinel, which means you can encode missing values in a different way than a "missing value" (eg nan in float dtype). While for _values_for_argsort, it simply returns one array (I would need to look into the details how missing values are handled here, it seems they are filtered out in nargsort, so it might not matter how they are encoded in the returned array).

    Is this sufficiently different to warrant two methods? Probably, with a bit work, they could be combined in a single method. However, their original purpose was only to help implement EA.factorize() and EA.argsort(). So for that purpose only, it might not necessarily be worth trying to combine them. And see the last bullet point for a more general "hashable, orderable array"

  • In addition, we actually also have _values_for_rank for Categorical, which we probably should try to get rid off as well -> BUG: Categorical.values_for_(factorize|argsort) dont preserve order #33245

  • I have argued that in general, we should also look at a "masked" version of eg _values_for_factorize. Having the option to return a (values, mask) tuple in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though -> eg ENH/PERF: use mask in factorize for nullable dtypes #33064)

  • We also had a vaguely defined _ndarray_values (API / internals: exact semantics of _ndarray_values #23565), that was recently removed (CLN: remove _ndarray_values #32768). It was eg used in indexing code (index engine, joining indexes), where it was replaced with _values_for_argsort (CLN: use _values_for_argsort for join_non_unique, join_monotonic #32467, REF: implement _get_engine_target #32611).

  • What else can they be used for internally? (_values_for_factorize / _values_for_argsort)
    As mentioned above, _values_for_argsort is since recently used for ExtensionIndex joining and engine values. Further, _values_for_factorize is used in the general merging code.

    However, the initial purpose of _values_for_factorize / _values_for_argsort was not to be used internally in pandas, but only has a helper to EA.factorize() and EA.argsort(). So following our current EA interface spec, we should not use them internally (which means we should fix the few cases where we started using them).
    The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:

    # Implementer note: There are two ways to override the behavior of
    # pandas.factorize
    # 1. _values_for_factorize and _from_factorize.
    # Specify the values passed to pandas' internal factorization
    # routines, and how to convert from those values back to the
    # original ExtensionArray.
    # 2. ExtensionArray.factorize.
    # Complete control over factorization.

    So external EAs are not guaranteed to have an efficient implementation of _values_for_factorize/_values_for_argsort (they still have the default astype(object) implementation).
    Fletcher is an example of external EAs that implement factorize and not _values_for_factorize.

    So ideally, for anything factorize/argsort-related, we should actually always call the EA.factorize() or EA.argsort() methods.

  • In API: should values_for_factorize and _from_factorized round-trip missing values? #32673, @jbrockmendel questioned whether the _values_for_factorize and _from_factorized combo should faithfully roundtrip? Currently, they do, but not necessarily when missing values are included.
    However, when only considering them as "internal" to the EA.factorize() implementation, this question doesn't actually matter. But it does matter when we want to use those values more generally.

  • I mentioned above that ideally we should use factorize() or argsort() directly as much as possible and avoid _values_for_factorize/argsort (since this is the official EA interface). However, there are still cases where such direct usage is not sufficient, and where we actually need some "values".

    For example, in the merging/joining code, you can't "just" factorize() the left and right array, because then the integer codes of left and right both don't necessarily can be matched (it depends on the uniques being present what those integers mean).

    I think it is clear we have some use case for "ndarray values", but so we should think about for which use cases we need that and what requirements we have for those.
    @jbrockmendel started to list some requirements here: EA: revisit interface #32586 (comment)


Having read again through all the recent issues and having written up the above, my current take-away point are:

  • To start, we should maybe put the questions around _values_for_factorize / _values_for_argsort aside for a moment. In principle they are internal to EA.factorize() / EA.argsort(), and so we could also remove those methods (if we wanted that) if we "just" require EA authors to implement factorize/argsort directly instead of through the helper _values_for.. .
    And if we figure out the general "ndarray values" requirements (see below), we can still come back to this to see if we can actually replace both _values_for_factorize / _values_for_argsort with this single "ndarray values" interface.

  • I now think that replacing _ndarray_values with _values_for_argsort to be able to remove _ndarray_values actually didn't solve much. We replaced one vaguely specified property (_ndarray_values) with another (_values_for_argsort for other purposes than just sorting, as there are currnetly also no guarantees / requirements outside of sorting specified for _values_for_argsort).

  • Instead, I would focus on figuring out what the requirements are for the "hashable / value preserving ndarray values":

    1. What are the exact use cases we need this for?
    2. What are the exact semantics needed for those use cases? (hashable, orderable, deterministic across arrays, ..)
    3. Do those use cases need roundtripping of those values?
    4. How would we implement those values for the internal EAs?*

    It might be that this ends up to be something close to what _values_for_argsort or _values_for_factorize now are. But I also think that my comment above about the possibility to include a mask in this interface is important for the nullable dtypes.

  • An alternative that we didn't really mention yet, is adding more to the EA interface instead of requiring this "ndarray values" concept. For example, if we want that external EAs can have control over joining, we could have a EA.__join__(other_EA, how) -> Tuple[ndarray[int], ndarray[int]] that returns indices into left and right EAs that determine how to join them.
    For joining that might be a relatively straightforward interface, for the indexing engine that looks more complex though (but so let's first define the use cases).

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignExtensionArrayExtending pandas with custom dtypes or arrays.Needs DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

        翻译: