-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathNEWS.0
393 lines (230 loc) · 22.6 KB
/
NEWS.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
CHANGES IN bit64 VERSION 4.5.2
BUG FIXES
o "[.integer64"(x,i) can now cope with i longer than x
CHANGES IN bit64 VERSION 4.5.1
USER VISIBLE CHANGES
o generics 'is.integer64', 'as.integer64', 'as.bitstring'
are no longer registered as S4 methods of 'is' and 'as'
CHANGES IN bit64 VERSION 4.5.0
NEW FEATURES
o new method as.list.integer64
o setting options(integer64_semantics="new")
gives the better semantics suggested by Ofek Shilon.
Downstream package authors: please test and adjust to the new semantics,
we plan to make that the default and deprecate integer64_semantics="new".
USER VISIBLE CHANGES
o min.integer64 and max.integer64 emit better warnings
when extreme values are returned (suggested by Pepijn de Vries)
BUG FIXES
o seq.integer64 now properly handles sequences of length 1
(found by Christopher Swingley)
CHANGES IN bit64 VERSION 4.0.5
BUG FIXES
o PKG_LIBS=-lm added to Makevars
(fixes https://bugzilla.redhat.com/show_bug.cgi?id=1763127
thanks to Elliott Sales de Andrade)
CHANGES IN bit64 VERSION 4.0.4
BUG FIXES
o runif64() no longer long long overflows
for the maximum integer64 range
o UBSAN false alarms removed with
__attribute__((no_sanitize("signed-integer-overflow")))
o added temporary flags to Makefile
for UBSAN checks
CHANGES IN bit64 VERSION 4.0.3
BUG FIXES
o added Makefile with temporary -flto
and removed LTO error regarding runif_integer64
CHANGES IN bit64 VERSION 4.0.2
BUG FIXES
o now DESCRIPTION URL points to github
CHANGES IN bit64 VERSION 4.0.1
BUG FIXES
o removed pragma because no longer needed with recent compilers
o removed a clang warning
CHANGES IN bit64 VERSION 4.0.0
NEW FEATURES
o new method all.equal.integer64
(contributed by Leonardo Silvestri)
USER VISIBLE CHANGES
o license has been extendend from GPL-2 to GPL-2 | GPL-3
o still.identical is now exported from package bit
BUG FIXES
o removed unused SEXP ret_ from r_ram_integer64_sortnut and
r_ram_integer64_ordernut (LTO problems reported by Brian Ripley)
o min, max and range now give correct results for multiple arguments
(reported by Colin Umanski)
o r_ram_integer64_ordertab_asc and r_ram_integer64_sortordertab_asc
now properly PROTECT their shortened return vector before R_Busy(0)
(Thanks to Tomas Kalibera)
o operations on zero length integer64 now return
zero length integer64 instead of throwing an error
(reported by Xianying Tan)
o match.integer64 (and %in%) now coerce the second argument to integer64
instead of throwing an error (reported by Xianying Tan)
o zero-length integer64() no longer prints as `character(0)`
(reported by Xianying Tan)
CHANGES IN bit64 VERSION 0.9-8
NEW FEATURES
o New function runif64 which can sample from finite
and infinite populations (wish of Dan Reznik)
o New methods as.integer64.bitstring
and print.bitstring (wish of Dan Reznik)
USER VISIBLE CHANGES
o [.integer64 now returns NA where the subscripts require this
(contributed by Leonardo Silvestri)
o binary operators now handle attributes more like R
(new binattr() code contributed by Leonardo Silvestri)
o as.bitstring.integer64 now returns its string vector
with class 'bitstring'
o round.integer64 with a negative digits argument now rounds
like round(integer) would do (wish of Ian Lyttle)
o range.integer64 now has an argument finite=FALSE for compatibility
with range.default (wish of Sergio Oller)
BUG FIXES
o calculating hashbits in hashfun, hashmap, hashmaptab and hashmapuni
now gives 0 instead of stopping (bug reported by Jakob Schelbert)
CHANGES IN bit64 VERSION 0.9-7
BUG FIXES
o All .Call routines are now registered
CHANGES IN bit64 VERSION 0.9-6
NEW FEATURES
o New method str.integer64 shows the integer64
and no longer the underlying double
(wish of Matt Dowle)
o New integer64 methods is.nan, is.finite, is.infinite
(wish of Florent Angly)
USER VISIBLE CHANGES
o as.integer64.double and as.double.integer64
now have an argument keep.names=FALSE
(wish of Dirk Edelbüttel and Leonardo Silvestri)
BUG FIXES
o We now protect our SEXP return-vector before calling R_Busy
(reported by Thomas Kalibera)
o median.integer64 now gets a ... argument if the generic has it
(wish of Kurt Hornik)
o we migrated all files to UTF-8
CHANGES IN bit64 VERSION 0.9-5
USER VISIBLE CHANGES
o The following functions are converted to S3 generics and mask package:base
:, is.double, match, %in%, rank, order
o NA_integer64_ is now available and exported
BUG FIXES
o ramsort.integer64 no longer complains about misssing return
value when stable || optimize == "time" (reported by Dan Southern)
o removed a harmless warning on request of CRAN maintainers
gcc had complained about using %lli format which is not
supported under the windows MCPP compiler, under which
%lli and thus as.character.integer64 will fail.
o now uses R's RNG instead of the system RNG
CHANGES IN bit64 VERSION 0.9-4
BUG FIXES
o The packages now uses clone(x) instead of x[]
o log(x) tests no longer fail under valgrind
(Thanks to Heisenberg it only failed under Valgrind)
o UBSAN should no longer complain about left shift
CHANGES IN bit64 VERSION 0.9-3
USER VISIBLE CHANGES
o The following functions are converted to S3 generics and mask package:base
:, is.double, match, %in%, rank, order
o table.integer64 now automatically converts non-integer64 arguments to integer64
rather than stopping on error (but gives a warning for each column)
o table.integer64 called with return="table" returns empty cells now with
0 rather than NA
o %in%.integer64 no longer has arguments 'nunique' and 'method' in order
to match the generic with only two arguments 'x', 'table' and ...
BUG FIXES
o c(x,x,x) failed with integer64 type because R no longer copies the
arguments in list(...) as from R-3.0.2 . Presumably now the ugly
workaround in table.integer64 is no longer needed but that has NOT
been fixed yet
o round.integer64 no longer removes the "integer64" class attribute
(reported by Dan Southern)
CHANGES IN bit64 VERSION 0.9-2
BUG FIXES
o match.integer64 (and %in%.integer64) now call correctly with
method="hashpos" and method="hashrev"
o removed platform specific timing code that was not needed
and prevented compiling under MacOS
CHANGES IN bit64 VERSION 0.9-1
NEW FEATURES
o new methods for 'match', '%in%', 'duplicated', 'unique', 'table'
, 'sort', 'order', 'rank', 'quantile', 'median' and 'summary'
o new generics and methods for data management:
'unipos' (positions of the unique values)
, 'tiepos' (positions of ties)
, 'keypos' (positions of values in a sorted unique table)
and derived methods 'as.factor' and 'as.ordered'
o new generic caching scheme, see ?cache and ?hashcache
o new low level functions for fast sorting, ordering and hashing,
see ?sortnut and ?hashmap
USER VISIBLE CHANGES
o the package is back on CRAN. Method 'as.vector.integer64' has been removed
at request of the CRAN maintainer. The starting point for this request was:
'matrix(integer64())' does not work. The result of removing
'as.vector.integer64' is a deterioration: 'array(integer64())' does not work
anymore. You can restore 'as.vector.integer64' if you prefer.
o package 'bit64' now shares generics for low-level sorting with package 'ff'
and depends on package 'bit' for those generics
ANNOUNCEMENT for bit64 VERSION 0.9.0
Dear R community,
The new version of package 'bit64' - which extends R with fast 64-bit integers - now has fast (single-threaded) implementations of the most important univariate algorithmic operations (those based on hashing and sorting). Package 'bit64' now has methods for 'match', '%in%', 'duplicated', 'unique', 'table', 'sort', 'order', 'rank', 'quantile', 'median' and 'summary'. Regarding data management it has novel generics 'unipos' (positions of the unique values), 'tiepos' (positions of ties), 'keypos' (positions of values in a sorted unique table) and derived methods 'as.factor' and 'as.ordered'. This 64-bit functionality is implemented carefully to be not slower than the respective 32-bit operations in Base R and also to avoid excessive execution times observed with 'order', 'rank' and 'table' (speedup factors 20/16/200 respective). This increases the dataset size with wich we can work truly interactive. The speed is achieved by simple heuristic optimizers: the mentioned high-level functions choose the best from multiple low-level algorithms and further take advantage of a novel optional caching method. In an example R session using a couple of these operations the 64-bit integers performed 22x faster than base 32-bit integers, hash-caching improved this to 24x amortized, sortorder-caching was most efficient with 38x (caching both, hashing and sorting is not worth it with 32x at duplicated RAM consumption).
Since the package covers the most important functions for (univariate) data exploration and data management, I think it is now appropriate to claim that R has sound 64-bit integer support, for example for working with keys or counts imported from large databases. For details concerning approach, implementation and roadmap please check the remainder of the announcement below and the package help files.
Kind regards
Jens Oehlschlägel
Munich, 22.10.2012
---- details ----
I have used package 'bit64' as a testbed to explore a couple of approaches for implementing R's univariate algorithmic functionality efficiently. I have focused on single-threaded efficiency for two reasons: 1) Amdahl's law dictates that the more we parallelize, the more we depend on serial efficiency. 2) When working with truly big data it is not only absolute speed but also energy consumption that we care about.
Under the hood package 'bit64' has multiple implementations of the same functionality, and high-level functions contain (yet simple heuristic) optimizers that choose among the available low-level functions. For example 'match' can choose between eight functions based on hashing or sorting/ordering.
Function 'match' (and '%in%') has been accelerated by complementing lookup of 'x' in hashed 'table' by reverse lookup of 'table' in hashed 'x'. If 'x' is small and 'table' is big, reverse lookup avoids the cost of building a huge hashmap. As suggested in Simon Urbanek's package 'fastmatch', if 'match' is called multiple times with the same 'table', performance can be improved by re-using the hashmap implicitely built by 'match'. Beyond that, I have realized a couple of improvements:
1) Building the hashmap has now been singled out in a separate function 'hashmap' that explicitely returns an environment of class c("cache_integer64", "cache", "environment") containing the hashmap and some auxilliary data.
2) Instead of implicitely caching the hashmap as a side-effect when calling 'fastmatch', there are explicit functions for caching, for example 'hashcache' for attaching a cache with a hashmap, and 'remcache' for removing any cached data.
3) If the 'hashcache' function after hashing discovers that the number of unique values is much smaller than the total number of values, it will hash again using a much smaller hashmap: this typically saves a lot of RAM and accelerates usage of the hashmap because it reduces random access.
4) The cache layer has a mechanism for detecting outdated caches. This is even more important in the case of a cached hashmap, since R's typical hashmap only contains index pointers to the data, not the data itself (unlike in standard hashtables). As a result, an outdated cache might lead to a crash, if the data has changed since creation of the cached hashmap. The detection mechanism comes for free, since R does Copy-on-write and each change of a vector leads to memory reallocation: on each cache access we check for a modified vector address and remove the cache with a warning in case of a detected change. However, this method is of-course not failsafe in case of multiple changes. Therefore, until cache checking and removal is done in Base R, users using caching should carefully remove caches before modifying data. Users must also carefully remove caches before using functions that do in-place modifications such as 'ramsort', 'ramorder' and 'ramsortorder'. Users should also note that R's COPY-ON-MODIFY mechanism does more copying than one would expect: just reading from variable length arguments with the recommended 'list(...)' construct always copies the arguments and invalidates caches. For a workaround see the implementation of 'table.integer64'.
5) Beyond 'match', the package leverages speed gains of hashing or cached hashing for a couple of other high-level functions: '%in%', 'duplicated', 'unique', 'unipos' and 'table'. However, it turned out that only 'match', '%in%' and 'duplicated' benefit from a cached hashmap. For 'unique', 'unipos' and 'table' the cost of traversing an existing hashmap is as high as creating the hashmap from scratch. That leads to the undesireable effect that we need two implementations for each of these methods: one that simultaneously builds and uses the hashmap, and another that uses an existing hashmap (using the simultaneous method while a hashmap has been cached would duplicate RAM consumption).
6) Beyond leveraging hashing, all these high-level functions also have two low-level implementations that take advantage of (cached) ordering and (cached) sortordering instead (see order below).
6) Additional functions are implemented that benefit only from (cached) ordering and (cached) sortordering: 'sort', 'order', 'tiepos', 'keypos', 'rank', 'quantile' and dependants thereof ('median','summary','as.factor','as.ordered','table').
Method 'sort' is a cache-aware wrapper around 'ramsort', which depending on context chooses from multiple sorting algorithms (or from the cache): 'shellsort' (R's traditional inplace sorting algorithm), 'quicksort' (faster inplace), 'mergesort' (fast and stable), 'radixsort' (stable with linear scalability, for large datasets). The quicksort algorithm implemented here is in this context faster than the famous one of Bentley and McIllroy. It uses median of three random pivots and is like introsort protected against O(n^2) runtime (if a recursion limit is reached, it for now falls back to shellsort instead of heapsort).
Function 'order.integer64' with option 'optimize = "memory"' calls 'ramorder' which chooses from a similar set of low-level algorithms. 'ramorder' - like in package 'ff' - is faster than ordering in Base R, but like 'order' in Base R still does the job by sorting index pointers to the data which creates heavy random access to the data. The novel 'ramsortorder' method realizes ordering close to the speed of sorting, by sorting index and data simultaneously and thereby avoiding heavy random access. Therefore the option 'optimize = "time"' is the default in 'order.integer64' and calls 'ramsortorder'.
Function 'rank.integer64' implements only 'ties.method = "average"' and 'na.last="keep"' (the only sensible default, see e.g. 'cor'). Function 'prank.integer64' projects the values [min..max] via ranks [1..n] to [0..1]. 'qtile.integer64' is the inverse function of 'prank.integer64' and projects [0..1] to [min..max]. 'quantile.integer64' with 'type=0' and 'median.integer64' are convenience wrappers to 'qtile'. 'qtile' behaves very similar to 'quantile.default' with 'type=1' in that it only returns existing values, it is mostly symetric but it is using 'round' rather than 'floor'. Note that this implies that 'median.integer64' does not interpolate for even number of values (interpolation would create values that could not be represented as 64-bit integers).
Function 'table.integer64' leverages hashing or sorting for counting frequencies of all unique values. This is by factor 3 slower than 'tabulate', but when called with 'return="list"' is by order of magnitude faster than 'table' (because 'table' wastes a lot of performance in large scale raw data manipulation before calling tabulate and in attaching the unique values as 'names' which loads heavy on the global string cache). When dealing with combinations of input vectors, 'table.integer64' can handle up to 2^63 hypothetical combinations and can return the existing combinations in a sparse format, whereas standard 'table' theoretically bails out at 2^31 (practically earlier due to RAM limitations) and insists on returning a full blown dense array.
I compared the speed gains of hashing+usage versus sortordering+usage over a couple of univariate algorithmic operations: hashing and sortordering are competitive, with hashing rather winning for smaller and sortordering rather winning for larger vectors (due to better cache-obliviousness of sorting). The third option - ordering - is much slower, though competitive with Base R, and 50% RAM saving makes this an interesting option, especially when working with datasets close to the RAM limits. Though operations based on hashing can be faster than those on sortordering it is worth to note that if sortorder is cached, in most cases going with the sortorder-operation is faster than building the hashmap and using it. Thus sortordering seems a better RAM investement than hashing. It has the following advantages:
- sortordering supports more functionality than hashing
- sortordering gives better modularity (different from hashing, we can well separate *creating* and *using* the sortordering, because sorting permanently improves cache-locality)
- without computational costs of keeping the original order ('keep.order=TRUE' in 'unique' and 'table'), sortorder gives sorted results while hashing gives random result order. If there are many unique values, fixing random order by sorting afterwards kills any performance benefit of hashing, compare for example the sequence {y <- unique(x); ind <- sort.list(y)} in 'factor'.
- sorting better generalizes to very large data on disk compared to hashing
- it is easier to lockfree parallelize sorting compared to hashing
- creating the ordering quickly via sortordering and then caching only ordering (without the sorted data) is an interesting option to save RAM without too much speed loss
- with ordering instead of sortordering there is an option to work with large borderline-sized datasets in-RAM
These advantages of sorting over hashing are good news for my novel energy-efficient greeNsort® algorithms.
The long term roadmap for packages 'bit64' and 'ff' is
- demonstrate power of greeNsort® by accelerating integer64 sorting by yet another factor 2
- parallelization of important functions in bit64
- unifying the sort capabilities in ff with those in bit64 (logical, factor, integer, integer64, double)
- generalizing the fast data management to all numeric data types (integer, integer64, double)
- removing the 2^31-1 address limit in ff (rather using integer64 than double)
- providing ff with proper disk sorting (reducing n*log(n) passes to 2 passes over the memory-mapped disk)
© 2010-2012 Jens Oehlschlägel
CHANGES IN bit64 VERSION 0.8-3
FIXES
o removed chonst char warning (thanks to Murray Stokely)
o reduced R dependency down to version 2.12.1 (wish of Murray Stokely)
ANNOUNCEMENT for bit64 VERSION 0.8.0
Dear R-Core team,
Dear Rcpp team and other package teams,
Dear R users,
The new package 'bit64' is available on CRAN for beta-testing and code-reviewing.
Package 'bit64' provides fast serializable S3 atomic 64bit (signed) integers that can be used in vectors, matrices, arrays and data.frames. Methods are available for coercion from and to logicals, integers, doubles, characters as well as many elementwise and summary functions.
Package 'bit64' has the following advantages over package 'int64' (which was sponsored by Google):
- true atomic vectors usable with length, dim, names etc.
- only S3, not S4 class system used to dispatch methods
- less RAM consumption by factor 7 (under 64 bit OS)
- faster operations by factor 4 to 2000 (under 64 bit OS)
- no slow-down of R's garbage collection (as caused by the pure existence of 'int64' objects)
- pure GPL, no copyrights from transnational commercial company
While the advantage of the atomic S3 design over the complicated S4 object design is obvious, it is less obvious that an external package is the best way to enrich R with 64bit integers. An external package will not give us literals such as 1LL or directly allow us to address larger vectors than possible with base R. But it allows us to properly address larger vectors in other packages such as 'ff' or 'bigmemory' and it allows us to properly work with large surrogate keys from external databases. An external package realizing one data type also makes a perfect test bed to play with innovative performance enhancements. Performance tuned sorting and hashing are planned for the next release, which will give us fast versions of sort, order, merge, duplicated, unique, and table - for 64bit integers.
For those who still hope that R's 'integer' will be 64bit some day, here is my key learning: migrating R's 'integer' from 32 to 64 bit would be RAM expensive. It would most likely require to also migrate R's 'double' from 64 to 128 bit - in order to again have a data type to which we can lossless coerce. The assumption that 'integer' is a proper subset of 'double' is scattered over R's semantics. We all expect that binary and n-ary functions such as '+' and 'c' do return 'double' and do not destroy information. With solely extending 64bit integers but not 128bit doubles, we have semantic changes potentially disappointing such expectations: integer64+double returns integer64 and does kill decimals. I did my best to make operations involving integer64 consistent and numerically stable - please consult the documentation at ?bit64 for details.
Since this package is 'at risk' to create a lot of dependencies from other packages, I'd appreciate serious beta-testing and also code-review from the R-Core team. Please check the 'Limitations' sections at the help page and the numerics involving "long double" in C. If the conclusion is that this should be better done in Base R - I happly donate the code and drop this package. If we have to go with an external package for 64bit integers, it would be great if this work could convince the Rcpp team including Romain about the advantages of this approach. Shouldn't we join forces here?
Best regards
Jens Oehlschlägel
Munich, 11.2.2012