Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add "latest" and "related" search. #2055

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jimsynz
Copy link

@jimsynz jimsynz commented Jan 21, 2025

Here's my first stab at adding latest and related search as per #2013

I'm keen for early feedback, thus this draft PR.

I have a few questions:

  1. Is there a standard function in the JS (I couldn't find one) to generate a link to hexdocs for a specific package name, version and ref? I feel iffy about just having string interpolation in the code.
  2. It looks like it might be difficult (but not impossible) to correctly generate the metadata (and thus the excerpts) from the response payload. Each hit contains a list of highlight snippets however it treats underscores as token separators so searching for to_ast is returns results for to ast.
  3. It seems to return the same hits over and over again. Each returned hit has a different document ID so I think it's actually an index problem not a search problem. I can dedup them by ref but I'd rather that the search index returned better results.

I haven't yet changed any of the markup as I wanted to get this step right before I go adding things like package names/versions to results, etc.

typesense search

@josevalim
Copy link
Member

Is there a standard function in the JS (I couldn't find one) to generate a link to hexdocs for a specific package name, version and ref? I feel iffy about just having string interpolation in the code.

There isn't because this is the first time we do something like this. However, since it comes from hexdocs, I believe it is fair to assume a naming scheme.

Each hit contains a list of highlight snippets however it treats underscores as token separators so searching for to_ast is returns results for to ast.

@ruslandoga, is this something we can change in the indexer?

It seems to return the same hits over and over again. Each returned hit has a different document ID so I think it's actually an index problem not a search problem. I can dedup them by ref but I'd rather that the search index returned better results.

The reason it returns the same results over and over again is because we break the documentation of a single function into multiple entries, one per h2 to be more precise. The benefit is that we can provide more precise links too. For now, I don't think we dedup them for regular results right? So I would not dedup them here, but we can dedup them later if we want to (we can even render it like Google results, where we show the main entry and below we refer to specific sections in the result).


This is fantastic progress. :)

@garazdawi
Copy link
Contributor

FYI, in the algolia for https://erlang.org/doc/search we disable _ as punctuation just as proposed here for the same reasons.

@ruslandoga
Copy link

ruslandoga commented Jan 22, 2025

👋

I couldn't find any options to make highlighting ignore the token separators. I'll try asking Typesense people if it's supported or planned to be supported.

For now, I think we might need to roll our own highlighter, possibly similar to the one used in autocomplete:

/**
* Returns an HTML string highlighting the individual tokens from the query string.
*/
function highlightMatches (text, query) {
// Sort terms length, so that the longest are highlighted first.
const terms = tokenize(query).sort((term1, term2) => term2.length - term1.length)
return highlightTerms(text, terms)
}
function highlightTerms (text, terms) {
if (terms.length === 0) return text
const [firstTerm, ...otherTerms] = terms
const match = text.match(new RegExp(`(.*)(${escapeRegexModifiers(firstTerm)})(.*)`, 'i'))
if (match) {
const [, before, matching, after] = match
// Note: this has exponential complexity, but we expect just a few terms, so that's fine.
return highlightTerms(before, terms) + '<em>' + escapeHtmlEntities(matching) + '</em>' + highlightTerms(after, terms)
} else {
return highlightTerms(text, otherTerms)
}
}


Removing _ from the current token_separators somewhat breaks non-prefix search. For example, searching upload stops returning allow_upload. Typesense does have a way around it with infix query option but I couldn't make it work well. We can try it again :)

const searchNodes = getSearchNodes()

if (['related', 'latest'].includes(queryType) && searchNodes.length > 0) {
results = await remoteSearch(value, queryType, searchNodes)
Copy link

@ruslandoga ruslandoga Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple nitpicks :)

Can we have a race condition here? When the previous request returns after the current request and updates the items to stale results. I think it's possible with multiple HTTP/1.1 connections, but not sure about multiple streams on the same HTTP/2 connection, are they guaranteed to be ordered? Or maybe JS runtime resolves it in some way?

Also, do we need to debounce on remote search or check for response.ok and results.length > 0?

For some reason I decided to do these things in ruslandoga#1 but I don't remember if I actually had these problems or was just playing it safe ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm sure you're right. As you can probably tell it's been almost a decade since I wrote any JavaScript so I'm still getting the hang of the new idioms.

filterNodes = searchNodes.slice(0, 1)
}

const filters = filterNodes.map(node => `package:=${node.name}-${node.version}`).join(' || ')
Copy link

@ruslandoga ruslandoga Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nitpick: unless we start using custom URLs like the one mentioned in hexpm/hexdocs#49 (comment), an array filter like this might help keep the URL shorter (no package:= repetition):

if (nodes && nodes.length > 0) {
  const packages = nodes.map((node) => `${node.name}-${node.version}`);
  params.filter_by = `package:=[${packages.join(",")}]`;
}

Adapted from Typesense search PoC: https://gist.github.com/ruslandoga/d544addc4a17e1f4853d7e9ae97818a4

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants