Database tables do not use indexes #43

cameronmurdoch · 2022-04-25T14:29:49Z

Database tables hosts and hosts_source do not define any indexes. Even with small(ish) tables, this can lead to extra IO and 'slow' queries.

Background info:
https://www.postgresql.org/docs/14/datatype-json.html#JSON-INDEXING
https://www.postgresql.org/docs/11/datatype-json.html#JSON-INDEXING

In these examples hosts_source has 12763 rows and is approx 9.5MB.

A typical query:

zabbix-auto-config/zabbix_auto_config/processing.py

Line 258 in 13d42cf

    
           db_cursor.execute(f"SELECT data FROM {self.db_source_table} WHERE data->>'hostname' = %s", [hostname])

This gives, for example, the following query plan:

# explain(analyse,buffers) select data from hosts_source WHERE data->>'hostname' = 'battersea.uio.no';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on hosts_source  (cost=0.00..1364.44 rows=64 width=696) (actual time=0.829..11.086 rows=1 loops=1)
   Filter: ((data ->> 'hostname'::text) = 'battersea.uio.no'::text)
   Rows Removed by Filter: 12762
   Buffers: shared hit=1217
 Planning Time: 0.054 ms
 Execution Time: 11.120 ms
(6 rows)

PostgreSQL has no choice but to perform a sequential scan on the whole table. This is pretty fast, around 11ms, as the table is only 9.5MB. However, the query can run approx 250 times faster if we create a GIN index and rewrite the query to use the jsonb containment operator:

# create index hosts_source_gin_idx on hosts_source using GIN (data jsonb_path_ops);
CREATE INDEX
# explain(analyse,buffers) select data from hosts_source WHERE data @> '{"hostname": "battersea.uio.no"}';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hosts_source  (cost=8.01..10.02 rows=1 width=696) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on hosts_source_gin_idx  (cost=0.00..8.01 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=1)
         Index Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.190 ms
 Execution Time: 0.042 ms
(11 rows)

We can see that postgreSQL could use the index and required significantly less IO (buffers) than before.

Note the this will also speed up other queries (including updates and deletes) that search for hostname. For example:

zabbix-auto-config/zabbix_auto_config/processing.py

Line 182 in 13d42cf

    
           db_cursor.execute(f"SELECT data FROM {self.db_source_table} WHERE data->>'hostname' = %s AND data->'sources' ? %s", [host.hostname, source])

# explain(analyse,buffers) select data from hosts_source WHERE data @> '{"hostname": "battersea.uio.no"}' and data->'sources' ? 'nivlheim';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hosts_source  (cost=8.01..10.03 rows=1 width=696) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
   Filter: ((data -> 'sources'::text) ? 'nivlheim'::text)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on hosts_source_gin_idx  (cost=0.00..8.01 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
         Index Cond: (data @> '{"hostname": "battersea.uio.no"}'::jsonb)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.113 ms
 Execution Time: 0.037 ms
(12 rows)

Here, PostgreSQL could use the index to find all the rows matching the hostname first and then apply the filter on sources.

However this query will not use the index:

zabbix-auto-config/zabbix_auto_config/processing.py

Line 171 in 13d42cf

    
           db_cursor.execute(f"SELECT DISTINCT data->>'hostname' FROM {self.db_source_table} WHERE data->'sources' ? %s", [source])

It is possible to create an index that can be used here, eg

# create index hosts_source_gin_sources_idx on hosts_source using GIN ((data -> 'sources'));

But any performance improvement is somewhat dependent on how many rows can be discarded by the filter and how big the table is. The postgresql documentation says that if querying for particular items within the "sources" key is common, defining an index like this may be worthwhile. Some quick experiments led to 0x to 3-4x speed improvements depending on the search key used.

A final note is that on for example postgresql 14, jsonpath queries will use the jsonb_path_ops index created above.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database tables do not use indexes #43

Database tables do not use indexes #43

cameronmurdoch commented Apr 25, 2022

Database tables do not use indexes #43

Database tables do not use indexes #43

Comments

cameronmurdoch commented Apr 25, 2022