Added db.select_from_TABLE methods #1828

dsblank · 2024-12-21T11:45:34Z

This PR adds methods designed to be implemented in a low-level DB system, like SQL. The human-facing code is all Python, and gets parsed into SQL. All of the code that is converted into SQL is written as strings. This allows coders to write in the same syntax that is supported by the DataDict interface (minus the object-creation variation).

For example, you could select all of the male people with:

db.select_from_person(where="person.gender == Person.MALE")

(Person is defined in the environment evaluated in.)

By default, the methods returns a DataDict per row. But you can optionally select one attribute ("person.handle") or a list of attributes (["person.handle", "person.gramps_id"]) using the what parameter.

All arguments are optional.

Further Examples:

db.select_from_person(where="person.handle == 'A6E74B3D65D23F'")
db.select_from_person("person.handle", where="person.handle == 'A6E74B3D65D23F'")
db.select_from_person(
    what=["person.handle", "person.gramps_id"],
    where="person.handle == 'A6E74B3D65D23F'"
    order_by=[("person.gramps_id", "DESC")]
    env={"Person": Person}
)

gramps/plugins/db/dbapi/dbapi.py

dsblank · 2024-12-26T15:11:30Z

I was going to not add the what parameter, but it takes a significant amount of time to return and json.loads() the entire object, rather than just have SQL select parts of it. Here we see that just getting the handle is 4 times faster than getting the whole JSON data.

In [8]: %%timeit
   ...: for data in db.select_from_person():
   ...:     pass
   ...: 
39.1 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %%timeit
   ...: for data in db.select_from_person("person.handle"):
   ...:     pass
   ...: 
10.4 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

stevenyoungs · 2024-12-26T15:28:58Z

gramps/gen/db/generic.py

@@ -2739,3 +2739,73 @@ def set_serializer(self, serializer_name):
            self.serializer = BlobSerializer
        elif serializer_name == "json":
            self.serializer = JSONSerializer
+
+    def select_from_table(
+        self, table_name, what=None, where=None, order_by=None, env=None


I know that currently you only pass a single table_name but if you were to make this low level method take an array of table_names then I think everything exists to do cross table selects.
With that it would be possible to add a generic filter UI that allows the user to enter a query defined in terms of tables, what, where and order_by strings. I readily acknowledge that this might be a more "advanced user" feature but it would be really powerful.
It will also allow code to execute more complex queries directly within the DB ❤️

@stevenyoungs, I like this idea! First, let's make sure that we can get this type of method added, and then think about expanding it.

(I'd like to see an idea of how you would represent a JOIN, but I don't want you to spend too much time if this PR gets rejected.)

dsblank · 2024-12-27T17:45:21Z

@Nick-Hall and other reviewers, I've been thinking for a very long time about how to add a select-style method in Gramps, while keeping with a Python interface. This PR represents the best that I can come up with.

Note that the syntax for the SELECT fields, WHERE clause and ORDER BY fields are all represented as strings of Python expressions. (I was hoping that they wouldn't be strings. I tried lambda functions, but couldn't reliably get the AST from the body of the lambda. I didn't want to use functions for all of the fields and clauses, because that would be a lot of overhead for the user. I also tried expressions like ("$.gramps_id", "==", "I005") but that wasn't very Pythonic. So, this is my best solution given the constraints.)

The strings are parsed by Python into an Abstract Syntax Tree (ast) that is then used to generate the SQL syntax. I wrote the Evaluator with different DB engines in mind, in case they use different syntax for JSON extraction, etc. The code is fairly minimal and with low complexity to make it easy to maintain and extend.

Let me know if you have concerns or ideas for improvement.

dsblank · 2024-12-27T18:03:30Z

@DavidMStraub this could serve as a replacement in gramps-web for both gramps-ql and object-ql as it is converted into SQL. (It doesn't yet allow everything that the others do though).

dsblank · 2024-12-28T13:09:42Z

One thing that I realize that this doesn't respect is filters and proxies. But I think that can be fixed. Some options:

Fallback to using standard gramps access if there is a proxy or filter
After selection, see if object is accessible in filter/proxy

Other ideas?

dsblank · 2024-12-28T14:18:41Z

@Nick-Hall, actually, I'm realizing that we have a bigger issue: if you have a proxy/filter in place, then you might not be able to access all of the items in the JSON data.

That means that:

person_data.family_list != person_object.family_list

if a family does not appear in the filter/proxy.

It could be that if we have a filter or proxy, we must force the DataDict to generate the object through methods like db.get_person_from_handle().

DavidMStraub · 2024-12-28T14:28:02Z

If I may ... I think it's really great that so much refactoring and improvement is happening, but I find it a bit strange that so many things are merged so quickly without (sorry - at least my impression) considering all the implications (I was triggered by the example with proxies and filters), while at the same time my simple PR which does nothing but enable static type checking has been open for half a year. Static type checking would make the refactoring less dangerous.

dsblank · 2024-12-28T15:58:49Z

@DavidMStraub, nothing has been merged yet that has any effect on the implications I have raised above. The implication is for the things being considered for merging. It would be great if we had more developers (like yourself) that would be able to comment on such implications. So, no, things aren't being merged "too quickly" and without thinking about consequences.

Working on the what is next gives us insight into complex issues. So no need to get triggered by such a realization.

Regarding type checking: yes, I would have merged that PR many months ago because I am very familiar with the benefits of typing, and realize there are no down sides.

But also, the implication above is the realization that a "type" (eg, Person) doesn't capture the essence of the issue. A Person object is an API to possibly altered and hidden properties. That is hidden in the API. So a Person created directly from the data isn't the same kind of Person we get from db methods. We might be able to create different types to catch such errors, but we don't even have the concepts for such types yet.

In any event, we need to refactor this PR, and the filter refactor PR. And probably adjust the DataDict class to make sure we don't access items that we shouldn't.

stevenyoungs · 2024-12-28T16:35:15Z

One of your optimisations is to keep the data in a DataDict unless the true object is required - but at least you have the full set of data in memory. With this db.select_from_TABLE PR, you only have partial object data data in memory. Therefore you have insufficient data to determine in all cases if a record should be returned \ "sanitized" by the proxy db.
From a quick scan, the set of ProxyDbBase classes are used in report \ export scenarios where performance is (perhaps) less critical? Perhaps then these proxy db classes force db.select_from_TABLE to read all data that is required to correctly run the filter?
This would hopefully retain the performance gains in scenarios where no proxy is in use.

Nick-Hall · 2024-12-28T21:19:41Z

@dsblank This PR reminds me of the db.collcetion.find method in MongoDB. It may be worth a quick look if you are unfamiliar with it. You may get some ideas.

I like how you have made the query pythonic. This is better than previous SQL-like designs and the JSON queries of MongoDB.

@DavidMStraub We seem to have been discussing this on and off for about 7 or 8 years now, so I don't think that the progress is too fast. There have also been a couple of prototypes. The static type checking PR makes changes to 51 files. I tend to leave this type of change until fairly close to release in order to avoid potential conflicts when merging up fixes from the maintenance branch. Also the smaller changes tend to be easier to fit in when I have time available. Your PR is on my schedule though.

@stevenyoungs Yes. Proxies are mainly used in the report and export code. I don't mind if these are not optimised to use the new code, but we must make sure that they don't run significantly slower than at present. Some people already have to wait a long time for certain reports to run.

I don't regard this PR as essential for the next release, but it may be worth continuing to investigate our options.

dsblank · 2024-12-30T22:31:43Z

because otherwise I suspect we'll be seeing a lot of AttributeErrors or alternatively have to sprinkle the code with assert hasattr

BTW, DataDict is a drop-in replacement in terms of attributes. They have the same attribute fields as the Primary Objects.

(The problem is that DataDict objects can only be used as a interface to raw data. They can't be used otherwise).

dsblank · 2024-12-31T16:33:33Z

Here is an example of where we would need to be careful about falling back to a regular loop through the data in the case of a proxy:

If #1794 is merged, it has an optimization for looking for rule.map sets (in a particular scenario). Consider the HasTag rule. Currently, it requires looping over all objects O(N), even if only 5 out of a million people are tagged.

But we could add code to the prepare() method of the rule like:

    rows = db.select_from_person(
        what="person.handle", 
        where=f"'{self.tag_handle}' in person.tag_list"
    )
    self.map = set([handle for handle in rows])

But, if db is a proxy, that would fallback to another loop around all objects, making in O(2N). To prevent that, the above would raise an exception.

A fix is to add:

if not db.is_proxy():
    rows = db.select_from_person(
        what="person.handle", 
        where=f"'{self.tag_handle}' in person.tag_list",
    )
    self.map = set([handle for handle in rows])

then the rule.map is not created, and the standard rule.apply_to_one() would do the regular check.

Here are the time comparisons (seconds) without and with the select/map:

Scenario	Prepare Time	Apply Time	Total Time
Gramps 5.2	0.00	1.47	1.47
Gramps 6.0 + select + optimizer	0.21	0.02	0.23

Finally, if you wanted to force a loop in a proxy, you could still do this:

    rows = db.select_from_person(
        what="person.handle", 
        where=f"'{self.tag_handle}' in person.tag_list"
        allow_use_in_proxy=True
    )

dsblank · 2024-12-31T16:42:28Z

Bah! @stevenyoungs pointed out that the proxies properly process raw data. All my worrying above, and some of my comments about DataDict and raw data are wrong. DataDict should be perfectly fine to use throughout gramps to save a bit of time. (Well, except in proxies, they probably take a bit longer... not sure).

stevenyoungs · 2024-12-31T17:56:56Z

Bah! @stevenyoungs pointed out that the proxies properly process raw data. All my worrying above, and some of my comments about DataDict and raw data are wrong. DataDict should be perfectly fine to use throughout gramps to save a bit of time. (Well, except in proxies, they probably take a bit longer... not sure).

the get_raw_* functions in a proxy look like they create the true object in order to do the filtering. So I agree, not efficient, but should give the correct result.

dsblank · 2025-01-03T16:24:35Z

the get_raw_* functions in a proxy look like they create the true object in order to do the filtering. So I agree, not efficient, but should give the correct result.

#1839 will allow efficient get_raw_* functions in proxies.

dsblank added 6 commits December 11, 2024 15:48

Added a dict wrapper that acts like an object

7a8afc5

Linting

4fc46b7

Convert to object if str(data)

f86bc4c

Linting

771cdd2

Added 11k tests

de3e0a8

Added a version of select using string+ast

1f0287d

dsblank changed the title ~~Added db.dbi.select using ast~~ Added db.dbapi.select using ast Dec 21, 2024

Linting

1491bf9

dsblank requested a review from Nick-Hall December 22, 2024 13:58

Nick-Hall and others added 3 commits December 22, 2024 09:01

Add gen.db.conversion_tools from PR #1786

5647610

Convert to object if str(data)

ecfd747

Merge branch 'master' into dsb/added-select-via-ast

768e608

dsblank self-assigned this Dec 22, 2024

dsblank added the enhancement label Dec 22, 2024

dsblank mentioned this pull request Dec 26, 2024

Update database method names #1829

Open

stevenyoungs reviewed Dec 26, 2024

View reviewed changes

gramps/plugins/db/dbapi/dbapi.py Outdated Show resolved Hide resolved

Added select what, added to generic

ce9cfeb

dsblank marked this pull request as ready for review December 26, 2024 14:56

dsblank requested a review from stevenyoungs December 26, 2024 14:57

dsblank changed the title ~~Added db.dbapi.select using ast~~ Added db.select_from_TABLE methods Dec 26, 2024

stevenyoungs reviewed Dec 26, 2024

View reviewed changes

dsblank requested review from DavidMStraub and stevenyoungs December 30, 2024 22:35

Hide select_from_table; added docs

522cb2c

DavidMStraub mentioned this pull request Dec 31, 2024

Fix error in Fast*Filter #1834

Open

dsblank added 5 commits January 1, 2025 08:50

Fixed two bugs: IN, and return OBJECT

5f2b31e

Added '_' as object; added len(person.family_list)

2731f58

WIP: adding tests

14e681f

More tests

8ec3701

Test DbGeneric

39750bf

dsblank added 16 commits January 3, 2025 11:30

Always use table_name for _

652fe64

Use correct quotes for Python 3.9 sqlite

e0dc3f0

Bumping CI to ubuntu 21.04

9577897

Bumping CI to ubuntu 22.04

e785303

Adjust package names for ubuntu-22.04

77ec49c

Adjust package names for ubuntu-22.04

f54d6d9

Show SQL command on error

c917535

Change syntax of order_by: '-person.gender'

fdecf87

Print out sqlite versions

5804e3a

Skip tests if no support for json_array_length

1ecb3bf

Install pytest

f6b4c5f

Try a variation of json_array_length

dbfab2e

Unroll json_extract

1cdd63a

Put everything back

5204a60

Finished adding tests

93529ef

Removed comment

939c968

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added db.select_from_TABLE methods #1828

Added db.select_from_TABLE methods #1828

dsblank commented Dec 21, 2024 •

edited

Loading

dsblank commented Dec 26, 2024 •

edited

Loading

stevenyoungs Dec 26, 2024

dsblank Dec 27, 2024

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 28, 2024

dsblank commented Dec 28, 2024

DavidMStraub commented Dec 28, 2024

dsblank commented Dec 28, 2024

stevenyoungs commented Dec 28, 2024

Nick-Hall commented Dec 28, 2024

dsblank commented Dec 30, 2024 •

edited

Loading

dsblank commented Dec 31, 2024

dsblank commented Dec 31, 2024 •

edited

Loading

stevenyoungs commented Dec 31, 2024

dsblank commented Jan 3, 2025

Added db.select_from_TABLE methods #1828

Are you sure you want to change the base?

Added db.select_from_TABLE methods #1828

Conversation

dsblank commented Dec 21, 2024 • edited Loading

dsblank commented Dec 26, 2024 • edited Loading

stevenyoungs Dec 26, 2024

Choose a reason for hiding this comment

dsblank Dec 27, 2024

Choose a reason for hiding this comment

dsblank commented Dec 27, 2024 • edited Loading

dsblank commented Dec 27, 2024 • edited Loading

dsblank commented Dec 28, 2024

dsblank commented Dec 28, 2024

DavidMStraub commented Dec 28, 2024

dsblank commented Dec 28, 2024

stevenyoungs commented Dec 28, 2024

Nick-Hall commented Dec 28, 2024

dsblank commented Dec 30, 2024 • edited Loading

dsblank commented Dec 31, 2024

dsblank commented Dec 31, 2024 • edited Loading

stevenyoungs commented Dec 31, 2024

dsblank commented Jan 3, 2025

dsblank commented Dec 21, 2024 •

edited

Loading

dsblank commented Dec 26, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 30, 2024 •

edited

Loading

dsblank commented Dec 31, 2024 •

edited

Loading