Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added db.select_from_TABLE methods #1828

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Dec 21, 2024

This PR adds methods designed to be implemented in a low-level DB system, like SQL. The human-facing code is all Python, and gets parsed into SQL. All of the code that is converted into SQL is written as strings. This allows coders to write in the same syntax that is supported by the DataDict interface (minus the object-creation variation).

For example, you could select all of the male people with:

db.select_from_person(where="person.gender == Person.MALE")

(Person is defined in the environment evaluated in.)

By default, the methods returns a DataDict per row. But you can optionally select one attribute ("person.handle") or a list of attributes (["person.handle", "person.gramps_id"]) using the what parameter.

All arguments are optional.

Further Examples:

db.select_from_person(where="person.handle == 'A6E74B3D65D23F'")
db.select_from_person("person.handle", where="person.handle == 'A6E74B3D65D23F'")
db.select_from_person(
    what=["person.handle", "person.gramps_id"],
    where="person.handle == 'A6E74B3D65D23F'"
    order_by=[("person.gramps_id", "DESC")]
    env={"Person": Person}
)

@dsblank dsblank changed the title Added db.dbi.select using ast Added db.dbapi.select using ast Dec 21, 2024
@dsblank dsblank requested a review from Nick-Hall December 22, 2024 13:58
@dsblank dsblank marked this pull request as ready for review December 26, 2024 14:56
@dsblank dsblank requested a review from stevenyoungs December 26, 2024 14:57
@dsblank
Copy link
Member Author

dsblank commented Dec 26, 2024

I was going to not add the what parameter, but it takes a significant amount of time to return and json.loads() the entire object, rather than just have SQL select parts of it. Here we see that just getting the handle is 4 times faster than getting the whole JSON data.

In [8]: %%timeit
   ...: for data in db.select_from_person():
   ...:     pass
   ...: 
39.1 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %%timeit
   ...: for data in db.select_from_person("person.handle"):
   ...:     pass
   ...: 
10.4 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@dsblank dsblank changed the title Added db.dbapi.select using ast Added db.select_from_TABLE methods Dec 26, 2024
@@ -2739,3 +2739,73 @@ def set_serializer(self, serializer_name):
self.serializer = BlobSerializer
elif serializer_name == "json":
self.serializer = JSONSerializer

def select_from_table(
self, table_name, what=None, where=None, order_by=None, env=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that currently you only pass a single table_name but if you were to make this low level method take an array of table_names then I think everything exists to do cross table selects.
With that it would be possible to add a generic filter UI that allows the user to enter a query defined in terms of tables, what, where and order_by strings. I readily acknowledge that this might be a more "advanced user" feature but it would be really powerful.
It will also allow code to execute more complex queries directly within the DB ❤️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenyoungs, I like this idea! First, let's make sure that we can get this type of method added, and then think about expanding it.

(I'd like to see an idea of how you would represent a JOIN, but I don't want you to spend too much time if this PR gets rejected.)

@dsblank
Copy link
Member Author

dsblank commented Dec 27, 2024

@Nick-Hall and other reviewers, I've been thinking for a very long time about how to add a select-style method in Gramps, while keeping with a Python interface. This PR represents the best that I can come up with.

Note that the syntax for the SELECT fields, WHERE clause and ORDER BY fields are all represented as strings of Python expressions. (I was hoping that they wouldn't be strings. I tried lambda functions, but couldn't reliably get the AST from the body of the lambda. I didn't want to use functions for all of the fields and clauses, because that would be a lot of overhead for the user. I also tried expressions like ("$.gramps_id", "==", "I005") but that wasn't very Pythonic. So, this is my best solution given the constraints.)

The strings are parsed by Python into an Abstract Syntax Tree (ast) that is then used to generate the SQL syntax. I wrote the Evaluator with different DB engines in mind, in case they use different syntax for JSON extraction, etc. The code is fairly minimal and with low complexity to make it easy to maintain and extend.

Let me know if you have concerns or ideas for improvement.

@dsblank
Copy link
Member Author

dsblank commented Dec 27, 2024

@DavidMStraub this could serve as a replacement in gramps-web for both gramps-ql and object-ql as it is converted into SQL. (It doesn't yet allow everything that the others do though).

@dsblank
Copy link
Member Author

dsblank commented Dec 28, 2024

One thing that I realize that this doesn't respect is filters and proxies. But I think that can be fixed. Some options:

  1. Fallback to using standard gramps access if there is a proxy or filter
  2. After selection, see if object is accessible in filter/proxy

Other ideas?

@dsblank
Copy link
Member Author

dsblank commented Dec 28, 2024

@Nick-Hall, actually, I'm realizing that we have a bigger issue: if you have a proxy/filter in place, then you might not be able to access all of the items in the JSON data.

That means that:

person_data.family_list != person_object.family_list

if a family does not appear in the filter/proxy.

It could be that if we have a filter or proxy, we must force the DataDict to generate the object through methods like db.get_person_from_handle().

@DavidMStraub
Copy link
Member

If I may ... I think it's really great that so much refactoring and improvement is happening, but I find it a bit strange that so many things are merged so quickly without (sorry - at least my impression) considering all the implications (I was triggered by the example with proxies and filters), while at the same time my simple PR which does nothing but enable static type checking has been open for half a year. Static type checking would make the refactoring less dangerous.

@dsblank
Copy link
Member Author

dsblank commented Dec 28, 2024

@DavidMStraub, nothing has been merged yet that has any effect on the implications I have raised above. The implication is for the things being considered for merging. It would be great if we had more developers (like yourself) that would be able to comment on such implications. So, no, things aren't being merged "too quickly" and without thinking about consequences.

Working on the what is next gives us insight into complex issues. So no need to get triggered by such a realization.

Regarding type checking: yes, I would have merged that PR many months ago because I am very familiar with the benefits of typing, and realize there are no down sides.

But also, the implication above is the realization that a "type" (eg, Person) doesn't capture the essence of the issue. A Person object is an API to possibly altered and hidden properties. That is hidden in the API. So a Person created directly from the data isn't the same kind of Person we get from db methods. We might be able to create different types to catch such errors, but we don't even have the concepts for such types yet.

In any event, we need to refactor this PR, and the filter refactor PR. And probably adjust the DataDict class to make sure we don't access items that we shouldn't.

@stevenyoungs
Copy link
Contributor

One of your optimisations is to keep the data in a DataDict unless the true object is required - but at least you have the full set of data in memory. With this db.select_from_TABLE PR, you only have partial object data data in memory. Therefore you have insufficient data to determine in all cases if a record should be returned \ "sanitized" by the proxy db.
From a quick scan, the set of ProxyDbBase classes are used in report \ export scenarios where performance is (perhaps) less critical? Perhaps then these proxy db classes force db.select_from_TABLE to read all data that is required to correctly run the filter?
This would hopefully retain the performance gains in scenarios where no proxy is in use.

@Nick-Hall
Copy link
Member

@dsblank This PR reminds me of the db.collcetion.find method in MongoDB. It may be worth a quick look if you are unfamiliar with it. You may get some ideas.

I like how you have made the query pythonic. This is better than previous SQL-like designs and the JSON queries of MongoDB.

@DavidMStraub We seem to have been discussing this on and off for about 7 or 8 years now, so I don't think that the progress is too fast. There have also been a couple of prototypes. The static type checking PR makes changes to 51 files. I tend to leave this type of change until fairly close to release in order to avoid potential conflicts when merging up fixes from the maintenance branch. Also the smaller changes tend to be easier to fit in when I have time available. Your PR is on my schedule though.

@stevenyoungs Yes. Proxies are mainly used in the report and export code. I don't mind if these are not optimised to use the new code, but we must make sure that they don't run significantly slower than at present. Some people already have to wait a long time for certain reports to run.

I don't regard this PR as essential for the next release, but it may be worth continuing to investigate our options.

@dsblank
Copy link
Member Author

dsblank commented Dec 30, 2024

because otherwise I suspect we'll be seeing a lot of AttributeErrors or alternatively have to sprinkle the code with assert hasattr

BTW, DataDict is a drop-in replacement in terms of attributes. They have the same attribute fields as the Primary Objects.

(The problem is that DataDict objects can only be used as a interface to raw data. They can't be used otherwise).

@dsblank
Copy link
Member Author

dsblank commented Dec 31, 2024

Here is an example of where we would need to be careful about falling back to a regular loop through the data in the case of a proxy:

If #1794 is merged, it has an optimization for looking for rule.map sets (in a particular scenario). Consider the HasTag rule. Currently, it requires looping over all objects O(N), even if only 5 out of a million people are tagged.

But we could add code to the prepare() method of the rule like:

    rows = db.select_from_person(
        what="person.handle", 
        where=f"'{self.tag_handle}' in person.tag_list"
    )
    self.map = set([handle for handle in rows])

But, if db is a proxy, that would fallback to another loop around all objects, making in O(2N). To prevent that, the above would raise an exception.

A fix is to add:

if not db.is_proxy():
    rows = db.select_from_person(
        what="person.handle", 
        where=f"'{self.tag_handle}' in person.tag_list",
    )
    self.map = set([handle for handle in rows])

then the rule.map is not created, and the standard rule.apply_to_one() would do the regular check.

Here are the time comparisons (seconds) without and with the select/map:

Scenario Prepare Time Apply Time Total Time
Gramps 5.2 0.00 1.47 1.47
Gramps 6.0 + select + optimizer 0.21 0.02 0.23

Finally, if you wanted to force a loop in a proxy, you could still do this:

    rows = db.select_from_person(
        what="person.handle", 
        where=f"'{self.tag_handle}' in person.tag_list"
        allow_use_in_proxy=True
    )

@dsblank
Copy link
Member Author

dsblank commented Dec 31, 2024

Bah! @stevenyoungs pointed out that the proxies properly process raw data. All my worrying above, and some of my comments about DataDict and raw data are wrong. DataDict should be perfectly fine to use throughout gramps to save a bit of time. (Well, except in proxies, they probably take a bit longer... not sure).

@stevenyoungs
Copy link
Contributor

Bah! @stevenyoungs pointed out that the proxies properly process raw data. All my worrying above, and some of my comments about DataDict and raw data are wrong. DataDict should be perfectly fine to use throughout gramps to save a bit of time. (Well, except in proxies, they probably take a bit longer... not sure).

the get_raw_* functions in a proxy look like they create the true object in order to do the filtering. So I agree, not efficient, but should give the correct result.

@dsblank
Copy link
Member Author

dsblank commented Jan 3, 2025

the get_raw_* functions in a proxy look like they create the true object in order to do the filtering. So I agree, not efficient, but should give the correct result.

#1839 will allow efficient get_raw_* functions in proxies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants