Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU (Unicode) support? #145

Open
mindplay-dk opened this issue Dec 26, 2024 · 3 comments
Open

ICU (Unicode) support? #145

mindplay-dk opened this issue Dec 26, 2024 · 3 comments

Comments

@mindplay-dk
Copy link

Is there any particular reason you ship a binary without Unicode (ICU) support?

You ship with the full-text search extension, which makes sense - but how is full-text search any use without Unicode support? Literally all the text on the internet is in Unicode format.

I can understand shipping SQLite without Unicode support for things like internal databases in apps, or the system registry in your computer or phone - but (I would assume) a primary reason people are interested in SQLite in Deno, is to build microservices, or smaller/simpler web apps, and I can't imagine what those services or apps would even be doing that doesn't require Unicode support?

Maybe this is a cultural divide of some sort? But I swear, I have researched this topic everywhere, and the only explanation I can find is, the ICU extension is "big" and makes some things "slow" - and I can understand this position from SQLite itself, being designed and optimized for use as an embedded database.

But shipping this as a Deno package is very different from the SQLite default use-case, I think? Are we more concerned about being "small" and "fast", or being actually useful?

I understand it's possible to build the extension from source and load it myself, but I'm just a humble web developer, and Unicode support seems very much like something basically everyone would need?

I can't understand how people are building anything with SQLite without Unicode?

Or are they all going out on their own to solve this problem first? am I just dumb or lazy? 😅

@mindplay-dk
Copy link
Author

To the best of my understanding, SQLite should at least support storing Unicode UTF-8 strings?

So I attempted to run a simple query SELECT upper('abc æøå') just to see what happens, and:

Screenshot 2024-12-26 153559

I also tried running PRAGMA encoding='UTF-8' first, same result.

I then swapped to npm:libsql, which runs the query, but of course return ABC æøå, since there's no ICU support there either.

I found this article, which sums up 5 ways to handle Unicode, all of them slow or impractical in various ways. So is custom collations and bad performance what people put up with in JS? Or does everyone build their own binaries from source?

@michaelmanus
Copy link

michaelmanus commented Dec 27, 2024

I was able to reproduce your issue with a straight sql string - db.prepare("select * from table where somerow = '🦄'").run();

However, if it is parameterized, there is no problem. db.prepare("select * from table where somerow = ?").run(["🦄"]);

So this means it is practically not a big deal.

sqlite's CLI does allow unicode literals, though, so I suppose it should be supported.

@mindplay-dk
Copy link
Author

Can confirm, it works with parameter binding. 👍

sqlite's CLI does allow unicode literals, though, so I suppose it should be supported.

I think most clients allow it? I guess queries containing unicode literals might be rare - I just did it for testing. 🙂

I guess I've kind of mixed two different issues into one though.

Really the main thing that baffles me is why no client ships with something as basic and essential as Unicode support?

I know ICU is "big" and makes things "slower". I also understand there are some issues with OS kernel differences, leading to practical subtleties around DB file portability. Yet, the internet runs on Unicode - everything right down to something as trivial as JSON requires Unicode.

So I just spent another half day on this, and finally came up with a tolerable solution - this relies on a lot of fragile, user-supported tooling though.

Do you think it would make sense to bundle some loadable extensions like nalgeon/text with the package? (I read that the official icu extension is "big", but this one is just ~64KB.)

Alternatively, would it make sense to integrate sqlpkg somehow?

Perhaps providing a function like e.g. getExtension("nalgeon/text") which would download and install sqlpkg locally in the project, run sqlpkg init and sqlpkg install nalgeon/text, and then return a Promise<string> with the resolved path, e.g. .sqlpkg/nalgeon/text/text.so.

Alternatively some maybe some way to literally import from e.g. @sqlpkg/nalgeon/text to get an installed package and path? Obviously that's way outside the scope of this project, I'm just airing the idea in case you find this approach preferable to including a loadable extension.

I'm just wondering how we can save the next person from a day or two of adventuring through a long list of half solutions and bad ideas before discovering how to do something as basic as just searching and sorting by Unicode strings. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants