Skip to content

Commit

Permalink
Number parsing fix and optimization (#27)
Browse files Browse the repository at this point in the history
Remove `GOLANG_NUMBER_PARSING` and remove the imprecise parsing and fix up the actual number parsing in Go.

By default, everything that looked like a number would be accepted and a lot of errors were not caught.

Uints will now actually be used if numbers are above maximum int64 and below uint64 with no float point markers.

Even with all the additional checks we are still faster:

```
λ benchcmp before.txt after.txt
benchmark                               old ns/op     new ns/op     delta
BenchmarkParseNumber/Pos/63bit-32       91.9          75.9          -17.41%
BenchmarkParseNumber/Neg/63bit-32       106           77.2          -27.17%
BenchmarkParseNumberFloat-32            190           72.5          -61.84%
BenchmarkParseNumberFloatExp-32         212           98.6          -53.49%
BenchmarkParseNumberBig-32              401           175           -56.36%
BenchmarkParseNumberRandomBits-32       420           230           -45.24%
BenchmarkParseNumberRandomFloats-32     305           172           -43.61%
```
... and full benchmarks:
```
benchmark                                             old ns/op      new ns/op      delta
BenchmarkApache_builds-32                             137091         139556         +1.80%
BenchmarkCanada-32                                    30705862       19000003       -38.12%
BenchmarkCitm_catalog-32                              1921474        2093471        +8.95%
BenchmarkGithub_events-32                             77611          77873          +0.34%
BenchmarkGsoc_2018-32                                 1220291        1215097        -0.43%
BenchmarkInstruments-32                               366747         374568         +2.13%
BenchmarkMarine_ik-32                                 27410259       18343775       -33.08%
BenchmarkMesh-32                                      8200018        5896043        -28.10%
BenchmarkMesh_pretty-32                               9793413        6947830        -29.06%
BenchmarkNumbers-32                                   1967319        1213924        -38.30%
BenchmarkRandom-32                                    1072071        1042956        -2.72%
BenchmarkTwitter-32                                   645530         645529         -0.00%
BenchmarkTwitterescaped-32                            1014456        1022548        +0.80%
```
  • Loading branch information
klauspost authored Jan 18, 2021
1 parent fcc30eb commit d846a83
Show file tree
Hide file tree
Showing 15 changed files with 688 additions and 587 deletions.
204 changes: 146 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,162 @@ Additionally `simdjson-go` has the following features:

- No 4 GB object limit
- Support for [ndjson](http://ndjson.org/) (newline delimited json)
- Proper memory management
- Pure Go (no need for cgo)

## Usage

Run the following command in order to install `simdjson-go`

```
$ go get github.com/minio/simdjson-go
```

In order to parse a JSON byte stream, you either call [`simdjson.Parse()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Parse)
or [`simdjson.ParseND()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParseND) for newline delimited JSON files.
Both of these functions return a [`ParsedJson`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson)
struct that can be used to navigate the JSON object by calling [`Iter()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.Iter).

Using the type [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter) you can call
[`Advance()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter.Advance) to iterate over the tape, like so:

```Go
for {
typ := iter.Advance()

switch typ {
case simdjson.TypeRoot:
if typ, tmp, err = iter.Root(tmp); err != nil {
return
}

if typ == simdjson.TypeObject {
if obj, err = tmp.Object(obj); err != nil {
return
}

e := obj.FindKey(key, &elem)
if e != nil && elem.Type == simdjson.TypeString {
v, _ := elem.Iter.StringBytes()
fmt.Println(string(v))
}
}

default:
return
}
}
```

When you advance the Iter you get the next type currently queued.

Each type then has helpers to access the data. When you get a type you can use these to access the data:

| Type | Action on Iter |
|------------|----------------------------|
| TypeNone | Nothing follows. Iter done |
| TypeNull | Null value |
| TypeString | `String()`/`StringBytes()` |
| TypeInt | `Int()`/`Float()` |
| TypeUint | `Uint()`/`Float()` |
| TypeFloat | `Float()` |
| TypeBool | `Bool()` |
| TypeObject | `Object()` |
| TypeArray | `Array()` |
| TypeRoot | `Root()` |

The complex types returns helpers that will help parse each of the underlying structures.

It is up to you to keep track of the nesting level you are operating at.


## Parsing NDSJON stream

Newline delimited json is sent as packets with each line being a root element.

Here is an example that counts the number of `"Make": "HOND"` in NDSJON similar to this:

```
{"Age":20, "Make": "HOND"}
{"Age":22, "Make": "TLSA"}
```

```Go
func findHondas(r io.Reader) {
// Temp values.
var tmpO simdjson.Object{}
var tmpE simdjson.Element{}
var tmpI simdjson.Iter
var nFound int

// Communication
reuse := make(chan *simdjson.ParsedJson, 10)
res := make(chan simdjson.Stream, 10)

simdjson.ParseNDStream(r, res, reuse)
// Read results in blocks...
for got := range res {
if got.Error != nil {
if got.Error == io.EOF {
break
}
log.Fatal(got.Error)
}

all := got.Value.Iter()
// NDJSON is a separated by root objects.
for all.Advance() == simdjson.TypeRoot {
// Read inside root.
t, i, err := all.Root(&tmpI)
if t != simdjson.TypeObject {
log.Println("got type", t.String())
continue
}

// Prepare object.
obj, err := i.Object(&tmpO)
if err != nil {
log.Println("got err", err)
continue
}

// Find Make key.
elem := obj.FindKey("Make", &tmpE)
if elem.Type != TypeString {
log.Println("got type", err)
continue
}

// Get value as bytes.
asB, err := elem.Iter.StringBytes()
if err != nil {
log.Println("got err", err)
continue
}
if bytes.Equal(asB, []byte("HOND")) {
nFound++
}
}
reuse <- got.Value
}
fmt.Println("Found", nFound, "Hondas")
}
```

More examples can be found in the examples subdirectory and further documentation can be found at [godoc](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc).


## Performance vs simdjson

Based on the same set of JSON test files, the graph below shows a comparison between `simdjson` and `simdjson-go`.

![simdjson-vs-go-comparison](chart/simdjson-vs-simdjson-go.png)

These numbers were measured on a MacBook Pro equipped with a 3.1 GHz Intel Core i7.
Also, to make it a fair comparison, the constant `GOLANG_NUMBER_PARSING` was set to `false` (default is `true`)
These numbers were measured on a MacBook Pro equipped with a 3.1 GHz Intel Core i7.
Also, to make it a fair comparison, the constant `GOLANG_NUMBER_PARSING` was set to `false` (default is `true`)
in order to use the same number parsing function (which is faster at the expense of some precision; see more below).

In addition the constant `ALWAYS_COPY_STRINGS` was set to `false` (default is `true`) for non-streaming use case
scenarios where the full JSON message is kept in memory (similar to the `simdjson` behaviour).
scenarios where the full JSON message is kept in memory (similar to the `simdjson` behaviour).

## Performance vs `encoding/json` and `json-iterator/go`

Expand Down Expand Up @@ -84,7 +225,7 @@ BenchmarkUpdate_center-8 101.41 860.52 8.49x

## AVX512 Acceleration

Stage 1 has been optimized using AVX512 instructions. Under full CPU load (8 threads) the AVX512 code is about 1 GB/sec (15%) faster as compared to the AVX2 code.
Stage 1 has been optimized using AVX512 instructions. Under full CPU load (8 threads) the AVX512 code is about 1 GB/sec (15%) faster as compared to the AVX2 code.

```
benchmark AVX2 MB/s AVX512 MB/s speedup
Expand All @@ -93,52 +234,6 @@ BenchmarkFindStructuralBitsParallelLoop 7225.24 8302.96 1.15x

These benchmarks were generated on a c5.2xlarge EC2 instance with a Xeon Platinum 8124M CPU at 3.0 GHz.

## Usage

Run the following command in order to install `simdjson-go`

```
$ go get github.com/minio/simdjson-go
```

In order to parse a JSON byte stream, you either call [`simdjson.Parse()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Parse)
or [`simdjson.ParseND()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParseND) for newline delimited JSON files.
Both of these functions return a [`ParsedJson`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson)
struct that can be used to navigate the JSON object by calling [`Iter()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.Iter).

Using the type [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter) you can call
[`Advance()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter.Advance) to iterate over the tape, like so:

```
for {
typ := iter.Advance()
switch typ {
case simdjson.TypeRoot:
if typ, tmp, err = iter.Root(tmp); err != nil {
return
}
if typ == simdjson.TypeObject {
if obj, err = tmp.Object(obj); err != nil {
return
}
e := obj.FindKey(key, &elem)
if e != nil && elem.Type == simdjson.TypeString {
v, _ := elem.Iter.StringBytes()
fmt.Println(string(v))
}
}
default:
return
}
}
```

More examples can be found in the examples subdirectory and further documentation can be found at [godoc](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc).

## Requirements

`simdjson-go` has the following requirements:
Expand Down Expand Up @@ -207,13 +302,6 @@ For string values without special characters the tape's payload points directly

For more information, see `TestStage2BuildTape` in `stage2_build_tape_test.go`.

## Minor number inprecisions

The number parser has minor inprecisions compared to Golang's standard number parsing.
There is constant `GOLANG_NUMBER_PARSING` (on by default) that uses Go's
parsing functionality at the expense of giving up some performance.
Note that the performance metrics mentioned above have been measured by setting the `GOLANG_NUMBER_PARSING` to `false`.

## Non streaming use cases

The best performance is obtained by keeping the JSON message fully mapped in memory and setting the
Expand Down
Loading

0 comments on commit d846a83

Please sign in to comment.