NEWS.0

    CHANGES IN bit64 VERSION 4.5.2

BUG FIXES

    o "[.integer64"(x,i) can now cope with i longer than x


     CHANGES IN bit64 VERSION 4.5.1


USER VISIBLE CHANGES

    o generics 'is.integer64', 'as.integer64', 'as.bitstring'
      are no longer registered as S4 methods of 'is' and 'as'


    CHANGES IN bit64 VERSION 4.5.0

NEW FEATURES

    o new method as.list.integer64
    o setting options(integer64_semantics="new")
      gives the better semantics suggested by Ofek Shilon.
      Downstream package authors: please test and adjust to the new semantics,
      we plan to make that the default and deprecate integer64_semantics="new".


USER VISIBLE CHANGES

    o min.integer64 and max.integer64 emit better warnings
      when extreme values are returned (suggested by Pepijn de Vries)


BUG FIXES

    o seq.integer64 now properly handles sequences of length 1
      (found by Christopher Swingley)


    CHANGES IN bit64 VERSION 4.0.5

BUG FIXES

    o PKG_LIBS=-lm added to Makevars
      (fixes https://bugzilla.redhat.com/show_bug.cgi?id=1763127
      thanks to Elliott Sales de Andrade)


    CHANGES IN bit64 VERSION 4.0.4

BUG FIXES

    o runif64() no longer long long overflows
      for the maximum integer64 range
    o UBSAN false alarms removed with
      __attribute__((no_sanitize("signed-integer-overflow")))
    o added temporary flags to Makefile
      for UBSAN checks


    CHANGES IN bit64 VERSION 4.0.3

BUG FIXES

    o added Makefile with temporary -flto
      and removed LTO error regarding runif_integer64


    CHANGES IN bit64 VERSION 4.0.2

BUG FIXES

    o now DESCRIPTION URL points to github


    CHANGES IN bit64 VERSION 4.0.1

BUG FIXES

    o removed pragma because no longer needed with recent compilers
    o removed a clang warning


    CHANGES IN bit64 VERSION 4.0.0


NEW FEATURES

    o new method all.equal.integer64
      (contributed by Leonardo Silvestri)


USER VISIBLE CHANGES

    o license has been extendend from GPL-2 to GPL-2 | GPL-3
    o still.identical is now exported from package bit


BUG FIXES

    o removed unused SEXP ret_ from r_ram_integer64_sortnut and
      r_ram_integer64_ordernut (LTO problems reported by Brian Ripley)
    o min, max and range now give correct results for multiple arguments
      (reported by Colin Umanski)
    o r_ram_integer64_ordertab_asc and r_ram_integer64_sortordertab_asc
      now properly PROTECT their shortened return vector before R_Busy(0)
      (Thanks to Tomas Kalibera)
    o operations on zero length integer64 now return
      zero length integer64 instead of throwing an error
      (reported by Xianying Tan)
    o match.integer64 (and %in%) now coerce the second argument to integer64
      instead of throwing an error (reported by Xianying Tan)
    o zero-length integer64() no longer prints as `character(0)`
      (reported by Xianying Tan)


    CHANGES IN bit64 VERSION 0.9-8


NEW FEATURES

    o New function runif64 which can sample from finite
      and infinite populations (wish of Dan Reznik)

    o New methods as.integer64.bitstring
      and print.bitstring (wish of Dan Reznik)


USER VISIBLE CHANGES

    o [.integer64 now returns NA where the subscripts require this
      (contributed by Leonardo Silvestri)
    o binary operators now handle attributes more like R
      (new binattr() code contributed by Leonardo Silvestri)
    o as.bitstring.integer64 now returns its string vector
      with class 'bitstring'
    o round.integer64 with a negative digits argument now rounds
      like round(integer) would do (wish of Ian Lyttle)
    o range.integer64 now has an argument finite=FALSE for compatibility
      with range.default (wish of Sergio Oller)


BUG FIXES

    o calculating hashbits in hashfun, hashmap, hashmaptab and hashmapuni
      now gives 0 instead of stopping (bug reported by Jakob Schelbert)


    CHANGES IN bit64 VERSION 0.9-7

BUG FIXES

    o All .Call routines are now registered


    CHANGES IN bit64 VERSION 0.9-6

NEW FEATURES

    o New method str.integer64 shows the integer64
      and no longer the underlying double
      (wish of Matt Dowle)
    o New integer64 methods is.nan, is.finite, is.infinite
      (wish of Florent Angly)


USER VISIBLE CHANGES

    o as.integer64.double and as.double.integer64
      now have an argument keep.names=FALSE
      (wish of Dirk Edelbüttel and Leonardo Silvestri)


BUG FIXES

    o We now protect our SEXP return-vector before calling R_Busy
      (reported by Thomas Kalibera)
    o median.integer64 now gets a ... argument if the generic has it
      (wish of Kurt Hornik)
    o we migrated all files to UTF-8


    CHANGES IN bit64 VERSION 0.9-5

USER VISIBLE CHANGES

    o The following functions are converted to S3 generics and mask package:base
      :, is.double, match, %in%, rank, order
    o NA_integer64_ is now available and exported


BUG FIXES

    o ramsort.integer64 no longer complains about misssing return
      value when stable || optimize == "time" (reported by Dan Southern)
    o removed a harmless warning on request of CRAN maintainers
      gcc had complained about using %lli format which is not
      supported under the windows MCPP compiler, under which
      %lli and thus as.character.integer64 will fail.
    o now uses R's RNG instead of the system RNG


    CHANGES IN bit64 VERSION 0.9-4

BUG FIXES

    o The packages now uses clone(x) instead of x[]
    o log(x) tests no longer fail under valgrind
      (Thanks to Heisenberg it only failed under Valgrind)
    o UBSAN should no longer complain about left shift


    CHANGES IN bit64 VERSION 0.9-3

USER VISIBLE CHANGES

    o The following functions are converted to S3 generics and mask package:base
      :, is.double, match, %in%, rank, order
    o table.integer64 now automatically converts non-integer64 arguments to integer64
      rather than stopping on error (but gives a warning for each column)
    o table.integer64 called with return="table" returns empty cells now with
      0 rather than NA
    o %in%.integer64 no longer has arguments 'nunique' and 'method' in order
      to match the generic with only two arguments 'x', 'table' and ...


BUG FIXES

    o c(x,x,x) failed with integer64 type because R no longer copies the
      arguments in list(...) as from R-3.0.2 . Presumably now the ugly
      workaround in table.integer64 is no longer needed but that has NOT
      been fixed yet
    o round.integer64 no longer removes the "integer64" class attribute
      (reported by Dan Southern)


    CHANGES IN bit64 VERSION 0.9-2

BUG FIXES

    o match.integer64 (and %in%.integer64) now call correctly with
      method="hashpos" and method="hashrev"
    o removed platform specific timing code that was not needed
      and prevented compiling under MacOS


    CHANGES IN bit64 VERSION 0.9-1


NEW FEATURES

    o new methods for 'match', '%in%', 'duplicated', 'unique', 'table'
      , 'sort', 'order', 'rank', 'quantile', 'median' and 'summary'
    o new generics and methods for data management:
        'unipos' (positions of the unique values)
      , 'tiepos' (positions of ties)
      , 'keypos' (positions of values in a sorted unique table)
      and derived methods 'as.factor' and 'as.ordered'
    o new generic caching scheme, see ?cache and ?hashcache
    o new low level functions for fast sorting, ordering and hashing,
    see ?sortnut and ?hashmap


USER VISIBLE CHANGES

    o the package is back on CRAN. Method 'as.vector.integer64' has been removed
      at request of the CRAN maintainer. The starting point for this request was:
      'matrix(integer64())' does not work. The result of removing
      'as.vector.integer64' is a deterioration: 'array(integer64())' does not work
      anymore. You can restore 'as.vector.integer64' if you prefer.
    o package 'bit64' now shares generics for low-level sorting with package 'ff'
      and depends on package 'bit' for those generics


    ANNOUNCEMENT for bit64 VERSION 0.9.0

Dear R community,

The new version of package 'bit64' - which extends R with fast 64-bit integers - now has fast (single-threaded) implementations of the most important univariate algorithmic operations (those based on hashing and sorting). Package 'bit64' now has methods for 'match', '%in%', 'duplicated', 'unique', 'table', 'sort', 'order', 'rank', 'quantile', 'median' and 'summary'. Regarding data management it has novel generics 'unipos' (positions of the unique values), 'tiepos' (positions of ties), 'keypos' (positions of values in a sorted unique table) and derived methods 'as.factor' and 'as.ordered'. This 64-bit functionality is implemented carefully to be not slower than the respective 32-bit operations in Base R and also to avoid excessive execution times observed with 'order', 'rank' and 'table' (speedup factors 20/16/200 respective). This increases the dataset size with wich we can work truly interactive. The speed is achieved by simple heuristic optimizers: the mentioned high-level functions choose the best from multiple low-level algorithms and further take advantage of a novel optional caching method. In an example R session using a couple of these operations the 64-bit integers performed 22x faster than base 32-bit integers, hash-caching improved this to 24x amortized, sortorder-caching was most efficient with 38x (caching both, hashing and sorting is not worth it with 32x at duplicated RAM consumption).

Since the package covers the most important functions for (univariate) data exploration and data management, I think it is now appropriate to claim that R has sound 64-bit integer support, for example for working with keys or counts imported from large databases. For details concerning approach, implementation and roadmap please check the remainder of the announcement below and the package help files.

Kind regards


Jens Oehlschlägel
Munich, 22.10.2012

---- details ----

I have used package 'bit64' as a testbed to explore a couple of approaches for implementing R's univariate algorithmic functionality efficiently. I have focused on single-threaded efficiency for two reasons: 1) Amdahl's law dictates that the more we parallelize, the more we depend on serial efficiency. 2) When working with truly big data it is not only absolute speed but also energy consumption that we care about. 

Under the hood package 'bit64' has multiple implementations of the same functionality, and high-level functions contain (yet simple heuristic) optimizers that choose among the available low-level functions. For example 'match' can choose between eight functions based on hashing or sorting/ordering. 

Function 'match' (and '%in%') has been accelerated by complementing lookup of 'x' in hashed 'table' by reverse lookup of 'table' in hashed 'x'. If 'x' is small and 'table' is big, reverse lookup avoids the cost of building a huge hashmap. As suggested in Simon Urbanek's package 'fastmatch', if 'match' is called multiple times with the same 'table', performance can be improved by re-using the hashmap implicitely built by 'match'. Beyond that, I have realized a couple of improvements:

1) Building the hashmap has now been singled out in a separate function 'hashmap' that explicitely returns an environment of class c("cache_integer64", "cache", "environment") containing the hashmap and some auxilliary data. 

2) Instead of implicitely caching the hashmap as a side-effect when calling 'fastmatch', there are explicit functions for caching, for example 'hashcache' for attaching a cache with a hashmap, and 'remcache' for removing any cached data. 

3) If the 'hashcache' function after hashing discovers that the number of unique values is much smaller than the total number of values, it will hash again using a much smaller hashmap: this typically saves a lot of RAM and accelerates usage of the hashmap because it reduces random access.

4) The cache layer has a mechanism for detecting outdated caches. This is even more important in the case of a cached hashmap, since R's typical hashmap only contains index pointers to the data, not the data itself (unlike in standard hashtables). As a result, an outdated cache might lead to a crash, if the data has changed since creation of the cached hashmap. The detection mechanism comes for free, since R does Copy-on-write and each change of a vector leads to memory reallocation: on each cache access we check for a modified vector address and remove the cache with a warning in case of a detected change. However, this method is of-course not failsafe in case of multiple changes. Therefore, until cache checking and removal is done in Base R, users using caching should carefully remove caches before modifying data. Users must also carefully remove caches before using functions that do in-place modifications such as 'ramsort', 'ramorder' and 'ramsortorder'. Users should also note that R's COPY-ON-MODIFY mechanism does more copying than one would expect: just reading from variable length arguments with the recommended 'list(...)' construct always copies the arguments and invalidates caches. For a workaround see the implementation of 'table.integer64'.

5) Beyond 'match', the package leverages speed gains of hashing or cached hashing for a couple of other high-level functions: '%in%', 'duplicated', 'unique', 'unipos' and 'table'. However, it turned out that only 'match', '%in%' and 'duplicated' benefit from a cached hashmap. For 'unique', 'unipos' and 'table' the cost of traversing an existing hashmap is as high as creating the hashmap from scratch. That leads to the undesireable effect that we need two implementations for each of these methods: one that simultaneously builds and uses the hashmap, and another that uses an existing hashmap (using the simultaneous method while a hashmap has been cached would duplicate RAM consumption).

6) Beyond leveraging hashing, all these high-level functions also have two low-level implementations that take advantage of (cached) ordering and (cached) sortordering instead (see order below). 

6) Additional functions are implemented that benefit only from (cached) ordering and (cached) sortordering: 'sort', 'order', 'tiepos', 'keypos', 'rank', 'quantile' and dependants thereof ('median','summary','as.factor','as.ordered','table'). 

Method 'sort' is a cache-aware wrapper around 'ramsort', which depending on context chooses from multiple sorting algorithms (or from the cache): 'shellsort' (R's traditional inplace sorting algorithm), 'quicksort' (faster inplace), 'mergesort' (fast and stable), 'radixsort' (stable with linear scalability, for large datasets). The quicksort algorithm implemented here is in this context faster than the famous one of Bentley and McIllroy. It uses median of three random pivots and is like introsort protected against O(n^2) runtime (if a recursion limit is reached, it for now falls back to shellsort instead of heapsort).

Function 'order.integer64' with option 'optimize = "memory"' calls 'ramorder' which chooses from a similar set of low-level algorithms. 'ramorder' - like in package 'ff' - is faster than ordering in Base R, but like 'order' in Base R still does the job by sorting index pointers to the data which creates heavy random access to the data. The novel 'ramsortorder' method realizes ordering close to the speed of sorting, by sorting index and data simultaneously and thereby avoiding heavy random access. Therefore the option 'optimize = "time"' is the default in 'order.integer64' and calls 'ramsortorder'.

Function 'rank.integer64' implements only 'ties.method = "average"' and 'na.last="keep"' (the only sensible default, see e.g. 'cor'). Function 'prank.integer64'  projects the values [min..max] via ranks [1..n] to [0..1]. 'qtile.integer64' is the inverse function of 'prank.integer64' and projects [0..1] to [min..max]. 'quantile.integer64' with 'type=0' and 'median.integer64' are convenience wrappers to 'qtile'. 'qtile' behaves very similar to 'quantile.default' with 'type=1' in that it only returns existing values, it is mostly symetric but it is using 'round' rather than 'floor'. Note that this implies that 'median.integer64' does not interpolate for even number of values (interpolation would create values that could not be represented as 64-bit integers). 

Function 'table.integer64' leverages hashing or sorting for counting frequencies of all unique values. This is by factor 3 slower than 'tabulate', but when called with 'return="list"' is by order of magnitude faster than 'table' (because 'table' wastes a lot of performance in large scale raw data manipulation before calling tabulate and in attaching the unique values as 'names' which loads heavy on the global string cache). When dealing with combinations of input vectors, 'table.integer64' can handle up to 2^63 hypothetical combinations and can return the existing combinations in a sparse format, whereas standard 'table' theoretically bails out at 2^31 (practically earlier due to RAM limitations) and insists on returning a full blown dense array. 

I compared the speed gains of hashing+usage versus sortordering+usage over a couple of univariate algorithmic operations: hashing and sortordering are competitive, with hashing rather winning for smaller and sortordering rather winning for larger vectors (due to better cache-obliviousness of sorting). The third option - ordering - is much slower, though competitive with Base R, and 50% RAM saving makes this an interesting option, especially when working with datasets close to the RAM limits. Though operations based on hashing can be faster than those on sortordering it is worth to note that if sortorder is cached, in most cases going with the sortorder-operation is faster than building the hashmap and using it. Thus sortordering seems a better RAM investement than hashing. It has the following advantages:
- sortordering supports more functionality than hashing
- sortordering gives better modularity (different from hashing, we can well separate *creating* and *using* the sortordering, because sorting permanently improves cache-locality)
- without computational costs of keeping the original order ('keep.order=TRUE' in 'unique' and 'table'), sortorder gives sorted results while hashing gives random result order. If there are many unique values, fixing random order by sorting afterwards kills any performance benefit of hashing, compare for example the sequence {y <- unique(x); ind <- sort.list(y)} in 'factor'.
- sorting better generalizes to very large data on disk compared to hashing
- it is easier to lockfree parallelize sorting compared to hashing
- creating the ordering quickly via sortordering and then caching only ordering (without the sorted data) is an interesting option to save RAM without too much speed loss
- with ordering instead of sortordering there is an option to work with large borderline-sized datasets in-RAM 

These advantages of sorting over hashing are good news for my novel energy-efficient greeNsort® algorithms.

The long term roadmap for packages 'bit64' and 'ff' is
- demonstrate power of greeNsort® by accelerating integer64 sorting by yet another factor 2
- parallelization of important functions in bit64
- unifying the sort capabilities in ff with those in bit64 (logical, factor, integer, integer64, double)
- generalizing the fast data management to all numeric data types (integer, integer64, double)
- removing the 2^31-1 address limit in ff (rather using integer64 than double)
- providing ff with proper disk sorting (reducing n*log(n) passes to 2 passes over the memory-mapped disk)

© 2010-2012 Jens Oehlschlägel

    CHANGES IN bit64 VERSION 0.8-3


FIXES

    o removed chonst char warning (thanks to Murray Stokely)
    o reduced R dependency down to version 2.12.1 (wish of Murray Stokely)

    ANNOUNCEMENT for bit64 VERSION 0.8.0

Dear R-Core team,
Dear Rcpp team and other package teams,
Dear R users,

The new package 'bit64' is available on CRAN for beta-testing and code-reviewing.

Package 'bit64' provides fast serializable S3 atomic 64bit (signed) integers that can be used in vectors, matrices, arrays and data.frames. Methods are available for coercion from and to logicals, integers, doubles, characters as well as many elementwise and summary functions. 

Package 'bit64' has the following advantages over package 'int64' (which was sponsored by Google):
- true atomic vectors usable with length, dim, names etc.
- only S3, not S4 class system used to dispatch methods
- less RAM consumption by factor 7 (under 64 bit OS)
- faster operations by factor 4 to 2000 (under 64 bit OS)
- no slow-down of R's garbage collection (as caused by the pure existence of 'int64' objects)
- pure GPL, no copyrights from transnational commercial company

While the advantage of the atomic S3 design over the complicated S4 object design is obvious, it is less obvious that an external package is the best way to enrich R with 64bit integers. An external package will not give us literals such as 1LL or directly allow us to address larger vectors than possible with base R. But it allows us to properly address larger vectors in other packages such as 'ff' or 'bigmemory' and it allows us to properly work with large surrogate keys from external databases. An external package realizing one data type also makes a perfect test bed to play with innovative performance enhancements. Performance tuned sorting and hashing are planned for the next release, which will give us fast versions of sort, order, merge, duplicated, unique, and table - for 64bit integers.

For those who still hope that R's 'integer' will be 64bit some day, here is my key learning: migrating R's 'integer' from 32 to 64 bit would be RAM expensive. It would most likely require to also migrate R's 'double' from 64 to 128 bit - in order to again have a data type to which we can lossless coerce. The assumption that 'integer' is a proper subset of 'double' is scattered over R's semantics. We all expect that binary and n-ary functions such as '+' and 'c' do return 'double' and do not destroy information. With solely extending 64bit integers but not 128bit doubles, we have semantic changes potentially disappointing such expectations: integer64+double returns integer64 and does kill decimals. I did my best to make operations involving integer64 consistent and numerically stable - please consult the documentation at ?bit64 for details.

Since this package is 'at risk' to create a lot of dependencies from other packages, I'd appreciate serious  beta-testing and also code-review from the R-Core team. Please check the 'Limitations' sections at the help page and the numerics involving "long double" in C. If the conclusion is that this should be better done in Base R - I happly donate the code and drop this package. If we have to go with an external package for 64bit integers, it would be great if this work could convince the Rcpp team including Romain about the advantages of this approach. Shouldn't we join forces here?

Best regards

Jens Oehlschlägel
Munich, 11.2.2012