Well Known Binary vs Well Known Text? #170

cholmes · 2023-05-01T22:35:29Z

cholmes
May 1, 2023
Maintainer

I'm splintering this discussion out of #169 as it's coming up there, but would be good to tackle on its own.

I don't believe we did seriously consider Well Known Text (WKT) as an alternative to Well Known Binary (WKB). In my mind since Parquet is a binary format than using WKB makes sense, as a raw parquet file can't be read by a human. But it's a good point that there's lots of parquet readers, and if the data is a text string then WKT would show up for users as something sensible, while WKB wouldn't.

So I'm curious about the actual pros and cons. WKT has a clear 'pro' in the human readability - doing it by default instead of relying on tools to properly render the WKB. I'm curious about potential cons:

Will WKT be 'bigger' on disk? Parquet does really great compression by default, so maybe WKT compressed by Parquet wouldn't be worse? Could potentially be better?
Would WKT be slower? I'd guess it might be? More data to crunch through? But the question would be how much slower, and is that meaningful relatively to how long things take in general.

Others please add more here.

jwass · 2023-05-03T14:25:52Z

jwass
May 3, 2023
Collaborator

I think this and #169 are mostly my fault. They're somewhat related in that the goal is trying to make spatial data in parquet more accessible to folks that might not be geospatial data experts and often don't have spatial software on their systems (GeoParquet readers / WKB parsers etc). At least in my experience, folks seeing POINT(-71.05796, 42.36047) or LINESTRING(....) and being able to understand what's going on is a great gateway to start learning more and then working with spatial data. Whereas a blob of undecipherable WKB is probably just enough friction to move on and not even ask what it is. It's also annoying to work with if you don't have a WKB parser at hand. I'm also coming from a world where GeoParquet isn't natively supported (at least not yet) and the engines don't know to treat a WKB column as anything other than plain binary...

This is definitely a balance of usability vs. performance though. I can try to find time in the next few days to compare a column of gzip compressed WKT vs. compressed WKB from some representative OSM sample and characterize the storage difference - assuming nobody has done this yet? I imagine the parsing/serialization/deserialization performance is far more significant and a bigger driver here though.

1 reply

jedsundwall May 4, 2023

I'm an enthusiastic supporter of this goal: "trying to make spatial data in parquet more accessible to folks that might not be geospatial data experts and often don't have spatial software on their systems." It's impossible to quantify, but I have a hunch that making data visible and tangible to people is a powerful way to educate people and welcome them to our community.

FWIW, I have plans for Radiant to work on creating interfaces that will make Parquet a bit more tangible to people via browser interfaces.

cholmes · 2023-05-03T16:20:17Z

cholmes
May 3, 2023
Maintainer Author

This is definitely a balance of usability vs. performance though. I can try to find time in the next few days to compare a column of gzip compressed WKT vs. compressed WKB from some representative OSM sample and characterize the storage difference - assuming nobody has done this yet? I imagine the parsing/serialization/deserialization performance is far more significant and a bigger driver here though.

Would you be doing that within Parquet? That's the bit I'm most interested - if Parquet's native compression actually makes it so WKT vs WKB isn't that much different. And yeah, would be interesting to see the parsing/deserialization - I wonder if @tschaub or @rouault might be able to help. I do think it'd be a nice feature for gpq or ogr to be able to read in WKT & WKB columns from parquet, look for defaults and take user input for non-defaults. So maybe part of adding that could be to look a bit into the performance.

0 replies

brendan-ward · 2023-05-04T03:04:03Z

brendan-ward
May 4, 2023

First off, we need to take as a given that beyond a nontrivial number of coordinates, WKT is basically unreadable (vs WKB, which is always unreadable). So while our goals may be to support making geospatial data more accessible and readable, we will quickly be confounded by this.

For example, for display in Shapely, we limit WKT to 62 characters; other data table displays may impose their own limits. But this is just intended to give a quick view of the first few coordinates, not to show the entire WKT.

Second, WKT may be lossy. For instance, Shapely uses a default of 6 decimal places when writing WKT. This would require extra steps in order to write full precision, and depending on the engine available to the user may not be easily available - leading to data loss. Lossy is not at all ideal for serializing data for analysis instead of display.

I worked up a simple example to probe at this using GeoPandas, Shapely 2.0, and pyarrow. For my tests, I used level 2 code hydrologic regions for the US derived originally from here. This is a small but nontrivial dataset with 20 records.

The first record WKT is 2,568,157 characters long. Here's the first 1,000:

'MULTIPOLYGON (((1892734.760992 645962.412497, 1892748.754026 645993.888655, 1892760.36212 645995.216753, 1892800.350383 645991.894954, 1892828.830611 646012.959201, 1892833.344958 646014.814546, 1892843.740631 646013.147427, 1892851.439234 646022.293334, 1892872.278846 646031.15024, 1892882.940588 646048.453011, 1892885.538009 646060.794302, 1892881.356728 646077.945815, 1892856.258525 646116.704366, 1892854.862068 646112.081883, 1892858.146927 646098.875267, 1892869.626053 646079.705495, 1892875.459701 646058.845346, 1892865.503787 646038.555064, 1892861.42074 646037.332019, 1892853.040215 646038.5433, 1892848.148827 646035.492224, 1892843.022669 646028.723227, 1892827.18278 646024.542016, 1892815.030552 646018.294643, 1892812.155738 646014.823082, 1892795.026849 646006.049822, 1892771.766651 646004.420562, 1892754.502651 646003.455295, 1892748.754026 645993.888655, 1892742.755867 645990.531607, 1892724.886859 645969.493841, 1892722.044919 645960.741709, 1892729.761629 645934.675524, '

For the tests below, I am only outputting the geometry column in order to focus on how this varies between WKB and WKT.

Under the hood in GeoPandas, geometries are stored natively as Python-wrapped GEOS geometry objects, and we are using GEOS for conversion to / from WKB or WKT (via Shapely). While other engines may use a different geometry representation under the hood, this should provide a reasonable benchmark of encoding / decoding times for WKB and WKT.

For serializing WKT, I used the default of 6 decimal places of precision (lossy!).

For serializing to Parquet, I'm using snappy compression.

Serialization / deserialization times are just based on writing a pyarrow table with WKB or WKT content.

Following times are based on MacOS 12.6.5 / M1:

Format	size	convert to	convert from	serialize to parquet	deserialize from parquet
wkb	54.6 MB	0.089s	0.046s	0.044s	0.03s
wkt	71.6 MB	0.44s	0.45s	0.38s	0.12s

Conclusions

WKT is bigger and slower, potentially lossy, and not human readable beyond a tiny amount of coordinates. All pain and no meaningful gain.

Code for the above

from timeit import timeit

import geopandas as gp
from geopandas.io.arrow import _create_metadata, _encode_metadata
import pandas as pd
import shapely
from pyarrow import parquet, Table

df = ...read the WBD HUC2 dataset ...

geo_metadata = _create_metadata(df)


def set_metadata(table, geo_metadata):
    metadata = table.schema.metadata
    metadata.update({b"geo": _encode_metadata(geo_metadata)})
    return table.replace_schema_metadata(metadata)


timeit(
    "shapely.to_wkb(df.geometry.values, flavor='iso')", globals=globals(), number=10
) / 10
timeit("shapely.to_wkt(df.geometry.values)", globals=globals(), number=10) / 10

wkb = shapely.to_wkb(df.geometry.values, flavor="iso")
wkt = shapely.to_wkt(df.geometry.values)

timeit("shapely.from_wkb(wkb)", globals=globals(), number=10) / 10
timeit("shapely.from_wkt(wkt)", globals=globals(), number=10) / 10

wkb_table = set_metadata(
    Table.from_pandas(pd.DataFrame({"geometry": wkb})), geo_metadata
)
wkt_table = set_metadata(
    Table.from_pandas(pd.DataFrame({"geometry": wkt})), geo_metadata
)

timeit(
    "parquet.write_table(wkb_table, '/tmp/wkb.pq', compression='snappy')",
    globals=globals(),
    number=10,
) / 10

timeit(
    "parquet.write_table(wkt_table, '/tmp/wkt.pq', compression='snappy')",
    globals=globals(),
    number=10,
) / 10

timeit("parquet.read_table('/tmp/wkb.pq')", globals=globals(), number=10) / 10
timeit("parquet.read_table('/tmp/wkt.pq')", globals=globals(), number=10) / 10

5 replies

jorisvandenbossche May 4, 2023
Maintainer

And a small addition on the aspect of "lossy" conversion in case of WKT: while you can think to not care about that many decimals, such lossy conversion can (although probably rarely) in practice result in small changes in your coordinates causing invalid geometries.

jwass May 4, 2023
Collaborator

Sounds good. Thanks for the quick script, @brendan-ward. Safe to say I'll drop this discussion :)

jwass May 4, 2023
Collaborator

I'd quibble a bit with "beyond a nontrivial number of coordinates, WKT is basically unreadable" in that you can still tell what kind of geometry type it is and the first few coordinates can give you an idea of where it is and people unfamiliar with these things will start to understand it a bit more. But I agree the performance tradeoffs are likely a non-starter. Thanks for the profiling - I certainly know it's bigger and slower, but the question is by how much.

cholmes May 4, 2023
Maintainer Author

Thanks @brendan-ward! Great to have these numbers in the public discussion - love that you included the code. I definitely concur it's not worth the tradeoffs, but good to have the hard data.

jedsundwall May 4, 2023

Yeah, thank you for doing the math @brendan-ward.

jorisvandenbossche · 2023-05-04T14:02:53Z

jorisvandenbossche
May 4, 2023
Maintainer

Apart from a discussion about what should be the default, we could easily allow people to use WKT if they want by providing an "encoding": "WKT" option. That isn't hard to add (and I think most readers that can handle WKB will also be able to handle WKT).

(Of course, if it's a non-default option, you can only use it with readers that actually can check the metadata, or allow the user to pass metadata)

6 replies

jorisvandenbossche May 5, 2023
Maintainer

On the other hand, if people are writing parquet files with a column of WKT geometries nonetheless, we could make those automatically "geoparquet-compatible" by allowing this encoding option. Yes, they won't be readable assuming defaults, but they could then be readable by a reader where you can pass some metadata (or by one that would infer the encoding based on string vs binary).

For example, in geopandas.read_parquet, we could add the option that the user can pass (a subset of) the metadata, and then you could read such a file by specifying the encoding is WKT. That would make our reader understand such (already existing) files.

(I don't have a pressing use case myself, so I am not necessarily going to argue for it much more, but I would personally have no problem with allowing such files)

tschaub May 5, 2023
Collaborator

As I mentioned on #169, it would also be an option to have the (default) encoding depend on the column type. For example, we could say that the default encoding for a geometry column with a string annotation is WKT and without this annotation the default is WKB. I understand that people don't like the metadata-less approach for other reasons, but it does seem somewhat sensible to use what is already in Parquet to convey as much information as possible. And here the idea would be to use the logical type of a column to infer the geometry encoding.

cholmes May 5, 2023
Maintainer Author

Cool. Yeah, I like the idea of automatic 'geoparquet-compatible' - specifying somewhere that WKT can be an option in the metadata-less approach. I think the defaults idea will be the main topic on the next call, it seems like there's good consensus around doing something, and the key will be deciding how to package / name it. But +1 on having WKT in the mix, and using as much info in Parquet to read things in.

jorisvandenbossche May 5, 2023
Maintainer

If we want that WKT encoding can be inferred in the potential metadata-less compatible files, then I think we should at least start with adding it as an option to the official metadata (in the idea that any file without metadata always represents some (default/inferred) instance of fully specified metadata)

cholmes May 5, 2023
Maintainer Author

Ah, good point. Now I'm leaning towards 'yes' :)

rouault · 2023-05-04T14:41:49Z

rouault
May 4, 2023

The OGR Parquet driver already supported a GEOMETRY_ENCODING=WKT layer creation option, as an extension.
In OSGeo/gdal#7690, you'll find the result of an experiment to compare WKB vs WKT. Findings are similar to @brendan-ward experiments:

the size of WKT GeoParquet can be significantly larger than WKB if outputting 17/18 significant digits that are required to preserve full 64-bit precision, or comparable if reducing the precision
building geometry objects from WKT has a at best x2 performance penalty for feature-by-feature iteration, and at best x4 penalty for the bulk loading where you need to convert it to WKB (or some equivalent in-memory binary representation)

2 replies

jwass May 4, 2023
Collaborator

Thanks for running that profiling and improvements.

cholmes May 4, 2023
Maintainer Author

Awesome, thanks @rouault!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Well Known Binary vs Well Known Text? #170

{{title}}

Replies: 5 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Well Known Binary vs Well Known Text? #170

cholmes May 1, 2023 Maintainer

Replies: 5 comments · 14 replies

jwass May 3, 2023 Collaborator

jedsundwall May 4, 2023

cholmes May 3, 2023 Maintainer Author

brendan-ward May 4, 2023

Conclusions

jorisvandenbossche May 4, 2023 Maintainer

jwass May 4, 2023 Collaborator

jwass May 4, 2023 Collaborator

cholmes May 4, 2023 Maintainer Author

jedsundwall May 4, 2023

jorisvandenbossche May 4, 2023 Maintainer

jorisvandenbossche May 5, 2023 Maintainer

tschaub May 5, 2023 Collaborator

cholmes May 5, 2023 Maintainer Author

jorisvandenbossche May 5, 2023 Maintainer

cholmes May 5, 2023 Maintainer Author

rouault May 4, 2023

jwass May 4, 2023 Collaborator

cholmes May 4, 2023 Maintainer Author

cholmes
May 1, 2023
Maintainer

Replies: 5 comments 14 replies

jwass
May 3, 2023
Collaborator

cholmes
May 3, 2023
Maintainer Author

brendan-ward
May 4, 2023

jorisvandenbossche May 4, 2023
Maintainer

jwass May 4, 2023
Collaborator

jwass May 4, 2023
Collaborator

cholmes May 4, 2023
Maintainer Author

jorisvandenbossche
May 4, 2023
Maintainer

jorisvandenbossche May 5, 2023
Maintainer

tschaub May 5, 2023
Collaborator

cholmes May 5, 2023
Maintainer Author

jorisvandenbossche May 5, 2023
Maintainer

cholmes May 5, 2023
Maintainer Author

rouault
May 4, 2023

jwass May 4, 2023
Collaborator

cholmes May 4, 2023
Maintainer Author