Well Known Binary vs Well Known Text? #170
Replies: 5 comments 14 replies
-
I think this and #169 are mostly my fault. They're somewhat related in that the goal is trying to make spatial data in parquet more accessible to folks that might not be geospatial data experts and often don't have spatial software on their systems (GeoParquet readers / WKB parsers etc). At least in my experience, folks seeing This is definitely a balance of usability vs. performance though. I can try to find time in the next few days to compare a column of gzip compressed WKT vs. compressed WKB from some representative OSM sample and characterize the storage difference - assuming nobody has done this yet? I imagine the parsing/serialization/deserialization performance is far more significant and a bigger driver here though. |
Beta Was this translation helpful? Give feedback.
-
Would you be doing that within Parquet? That's the bit I'm most interested - if Parquet's native compression actually makes it so WKT vs WKB isn't that much different. And yeah, would be interesting to see the parsing/deserialization - I wonder if @tschaub or @rouault might be able to help. I do think it'd be a nice feature for gpq or ogr to be able to read in WKT & WKB columns from parquet, look for defaults and take user input for non-defaults. So maybe part of adding that could be to look a bit into the performance. |
Beta Was this translation helpful? Give feedback.
-
First off, we need to take as a given that beyond a nontrivial number of coordinates, WKT is basically unreadable (vs WKB, which is always unreadable). So while our goals may be to support making geospatial data more accessible and readable, we will quickly be confounded by this. For example, for display in Shapely, we limit WKT to 62 characters; other data table displays may impose their own limits. But this is just intended to give a quick view of the first few coordinates, not to show the entire WKT. Second, WKT may be lossy. For instance, Shapely uses a default of 6 decimal places when writing WKT. This would require extra steps in order to write full precision, and depending on the engine available to the user may not be easily available - leading to data loss. Lossy is not at all ideal for serializing data for analysis instead of display. I worked up a simple example to probe at this using GeoPandas, Shapely 2.0, and pyarrow. For my tests, I used level 2 code hydrologic regions for the US derived originally from here. This is a small but nontrivial dataset with 20 records. The first record WKT is 2,568,157 characters long. Here's the first 1,000: 'MULTIPOLYGON (((1892734.760992 645962.412497, 1892748.754026 645993.888655, 1892760.36212 645995.216753, 1892800.350383 645991.894954, 1892828.830611 646012.959201, 1892833.344958 646014.814546, 1892843.740631 646013.147427, 1892851.439234 646022.293334, 1892872.278846 646031.15024, 1892882.940588 646048.453011, 1892885.538009 646060.794302, 1892881.356728 646077.945815, 1892856.258525 646116.704366, 1892854.862068 646112.081883, 1892858.146927 646098.875267, 1892869.626053 646079.705495, 1892875.459701 646058.845346, 1892865.503787 646038.555064, 1892861.42074 646037.332019, 1892853.040215 646038.5433, 1892848.148827 646035.492224, 1892843.022669 646028.723227, 1892827.18278 646024.542016, 1892815.030552 646018.294643, 1892812.155738 646014.823082, 1892795.026849 646006.049822, 1892771.766651 646004.420562, 1892754.502651 646003.455295, 1892748.754026 645993.888655, 1892742.755867 645990.531607, 1892724.886859 645969.493841, 1892722.044919 645960.741709, 1892729.761629 645934.675524, ' For the tests below, I am only outputting the geometry column in order to focus on how this varies between WKB and WKT. Under the hood in GeoPandas, geometries are stored natively as Python-wrapped GEOS geometry objects, and we are using GEOS for conversion to / from WKB or WKT (via Shapely). While other engines may use a different geometry representation under the hood, this should provide a reasonable benchmark of encoding / decoding times for WKB and WKT. For serializing WKT, I used the default of 6 decimal places of precision (lossy!). For serializing to Parquet, I'm using Serialization / deserialization times are just based on writing a pyarrow table with WKB or WKT content. Following times are based on MacOS 12.6.5 / M1:
ConclusionsWKT is bigger and slower, potentially lossy, and not human readable beyond a tiny amount of coordinates. All pain and no meaningful gain. Code for the above from timeit import timeit
import geopandas as gp
from geopandas.io.arrow import _create_metadata, _encode_metadata
import pandas as pd
import shapely
from pyarrow import parquet, Table
df = ...read the WBD HUC2 dataset ...
geo_metadata = _create_metadata(df)
def set_metadata(table, geo_metadata):
metadata = table.schema.metadata
metadata.update({b"geo": _encode_metadata(geo_metadata)})
return table.replace_schema_metadata(metadata)
timeit(
"shapely.to_wkb(df.geometry.values, flavor='iso')", globals=globals(), number=10
) / 10
timeit("shapely.to_wkt(df.geometry.values)", globals=globals(), number=10) / 10
wkb = shapely.to_wkb(df.geometry.values, flavor="iso")
wkt = shapely.to_wkt(df.geometry.values)
timeit("shapely.from_wkb(wkb)", globals=globals(), number=10) / 10
timeit("shapely.from_wkt(wkt)", globals=globals(), number=10) / 10
wkb_table = set_metadata(
Table.from_pandas(pd.DataFrame({"geometry": wkb})), geo_metadata
)
wkt_table = set_metadata(
Table.from_pandas(pd.DataFrame({"geometry": wkt})), geo_metadata
)
timeit(
"parquet.write_table(wkb_table, '/tmp/wkb.pq', compression='snappy')",
globals=globals(),
number=10,
) / 10
timeit(
"parquet.write_table(wkt_table, '/tmp/wkt.pq', compression='snappy')",
globals=globals(),
number=10,
) / 10
timeit("parquet.read_table('/tmp/wkb.pq')", globals=globals(), number=10) / 10
timeit("parquet.read_table('/tmp/wkt.pq')", globals=globals(), number=10) / 10 |
Beta Was this translation helpful? Give feedback.
-
Apart from a discussion about what should be the default, we could easily allow people to use WKT if they want by providing an (Of course, if it's a non-default option, you can only use it with readers that actually can check the metadata, or allow the user to pass metadata) |
Beta Was this translation helpful? Give feedback.
-
The OGR Parquet driver already supported a GEOMETRY_ENCODING=WKT layer creation option, as an extension.
|
Beta Was this translation helpful? Give feedback.
-
I'm splintering this discussion out of #169 as it's coming up there, but would be good to tackle on its own.
I don't believe we did seriously consider Well Known Text (WKT) as an alternative to Well Known Binary (WKB). In my mind since Parquet is a binary format than using WKB makes sense, as a raw parquet file can't be read by a human. But it's a good point that there's lots of parquet readers, and if the data is a text string then WKT would show up for users as something sensible, while WKB wouldn't.
So I'm curious about the actual pros and cons. WKT has a clear 'pro' in the human readability - doing it by default instead of relying on tools to properly render the WKB. I'm curious about potential cons:
Others please add more here.
Beta Was this translation helpful? Give feedback.
All reactions