feat: Add "latest" and "related" search. #2055

jimsynz · 2025-01-21T23:00:36Z

Here's my first stab at adding latest and related search as per #2013

I'm keen for early feedback, thus this draft PR.

I have a few questions:

Is there a standard function in the JS (I couldn't find one) to generate a link to hexdocs for a specific package name, version and ref? I feel iffy about just having string interpolation in the code.
It looks like it might be difficult (but not impossible) to correctly generate the metadata (and thus the excerpts) from the response payload. Each hit contains a list of highlight snippets however it treats underscores as token separators so searching for to_ast is returns results for to ast.
It seems to return the same hits over and over again. Each returned hit has a different document ID so I think it's actually an index problem not a search problem. I can dedup them by ref but I'd rather that the search index returned better results.

I haven't yet changed any of the markup as I wanted to get this step right before I go adding things like package names/versions to results, etc.

josevalim · 2025-01-22T11:04:55Z

Is there a standard function in the JS (I couldn't find one) to generate a link to hexdocs for a specific package name, version and ref? I feel iffy about just having string interpolation in the code.

There isn't because this is the first time we do something like this. However, since it comes from hexdocs, I believe it is fair to assume a naming scheme.

Each hit contains a list of highlight snippets however it treats underscores as token separators so searching for to_ast is returns results for to ast.

@ruslandoga, is this something we can change in the indexer?

It seems to return the same hits over and over again. Each returned hit has a different document ID so I think it's actually an index problem not a search problem. I can dedup them by ref but I'd rather that the search index returned better results.

The reason it returns the same results over and over again is because we break the documentation of a single function into multiple entries, one per h2 to be more precise. The benefit is that we can provide more precise links too. For now, I don't think we dedup them for regular results right? So I would not dedup them here, but we can dedup them later if we want to (we can even render it like Google results, where we show the main entry and below we refer to specific sections in the result).

This is fantastic progress. :)

garazdawi · 2025-01-22T11:41:42Z

FYI, in the algolia for https://erlang.org/doc/search we disable _ as punctuation just as proposed here for the same reasons.

ruslandoga · 2025-01-22T12:21:53Z

👋

I couldn't find any options to make highlighting ignore the token separators. I'll try asking Typesense people if it's supported or planned to be supported.

For now, I think we might need to roll our own highlighter, possibly similar to the one used in autocomplete:

ex_doc/assets/js/autocomplete/suggestions.js

Lines 289 to 311 in 51cd422

    
           /** 
        
            * Returns an HTML string highlighting the individual tokens from the query string. 
        
            */ 
        
           function highlightMatches (text, query) { 
        
             // Sort terms length, so that the longest are highlighted first. 
        
             const terms = tokenize(query).sort((term1, term2) => term2.length - term1.length) 
        
             return highlightTerms(text, terms) 
        
           } 
        
           function highlightTerms (text, terms) { 
        
             if (terms.length === 0) return text 
        
             const [firstTerm, ...otherTerms] = terms 
        
             const match = text.match(new RegExp(`(.*)(${escapeRegexModifiers(firstTerm)})(.*)`, 'i')) 
        
             if (match) { 
        
               const [, before, matching, after] = match 
        
               // Note: this has exponential complexity, but we expect just a few terms, so that's fine. 
        
               return highlightTerms(before, terms) + '<em>' + escapeHtmlEntities(matching) + '</em>' + highlightTerms(after, terms) 
        
             } else { 
        
               return highlightTerms(text, otherTerms) 
        
             } 
        
           }

Removing _ from the current token_separators somewhat breaks non-prefix search. For example, searching upload stops returning allow_upload. Typesense does have a way around it with infix query option but I couldn't make it work well. We can try it again :)

ruslandoga · 2025-01-22T12:33:23Z

assets/js/search-page.js

+      const searchNodes = getSearchNodes()
+
+      if (['related', 'latest'].includes(queryType) && searchNodes.length > 0) {
+        results = await remoteSearch(value, queryType, searchNodes)


Just a couple nitpicks :)

Can we have a race condition here? When the previous request returns after the current request and updates the items to stale results. I think it's possible with multiple HTTP/1.1 connections, but not sure about multiple streams on the same HTTP/2 connection, are they guaranteed to be ordered? Or maybe JS runtime resolves it in some way?

Also, do we need to debounce on remote search or check for response.ok and results.length > 0?

For some reason I decided to do these things in ruslandoga#1 but I don't remember if I actually had these problems or was just playing it safe ...

Thanks. I'm sure you're right. As you can probably tell it's been almost a decade since I wrote any JavaScript so I'm still getting the hang of the new idioms.

ruslandoga · 2025-01-22T13:41:23Z

assets/js/search-page.js

+    filterNodes = searchNodes.slice(0, 1)
+  }
+
+  const filters = filterNodes.map(node => `package:=${node.name}-${node.version}`).join(' || ')


Another nitpick: unless we start using custom URLs like the one mentioned in hexpm/hexdocs#49 (comment), an array filter like this might help keep the URL shorter (no package:= repetition):

if (nodes && nodes.length > 0) { const packages = nodes.map((node) => `${node.name}-${node.version}`); params.filter_by = `package:=[${packages.join(",")}]`; }

^{Adapted from Typesense search PoC: https://gist.github.com/ruslandoga/d544addc4a17e1f4853d7e9ae97818a4}

Thanks will do.

wip: First draft of latest/related typesense search function.

4a9175d

ruslandoga reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add "latest" and "related" search. #2055

feat: Add "latest" and "related" search. #2055

jimsynz commented Jan 21, 2025 •

edited

Loading

josevalim commented Jan 22, 2025

garazdawi commented Jan 22, 2025

ruslandoga commented Jan 22, 2025 •

edited

Loading

ruslandoga Jan 22, 2025 •

edited

Loading

jimsynz Jan 22, 2025

ruslandoga Jan 22, 2025 •

edited

Loading

jimsynz Jan 22, 2025

feat: Add "latest" and "related" search. #2055

Are you sure you want to change the base?

feat: Add "latest" and "related" search. #2055

Conversation

jimsynz commented Jan 21, 2025 • edited Loading

josevalim commented Jan 22, 2025

garazdawi commented Jan 22, 2025

ruslandoga commented Jan 22, 2025 • edited Loading

ruslandoga Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

jimsynz Jan 22, 2025

Choose a reason for hiding this comment

ruslandoga Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

jimsynz Jan 22, 2025

Choose a reason for hiding this comment

jimsynz commented Jan 21, 2025 •

edited

Loading

ruslandoga commented Jan 22, 2025 •

edited

Loading

ruslandoga Jan 22, 2025 •

edited

Loading

ruslandoga Jan 22, 2025 •

edited

Loading