-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add "returning" search option to select only specified fields from a document #770
base: main
Are you sure you want to change the base?
Conversation
@fasenderos is attempting to deploy a commit to the OramaSearch Team on Vercel. A member of the Team first needs to authorize it. |
Hi @fasenderos, thanks for your PR! This solution creates a lot of temporary objects that slows down the application. Returning to the original question (#769), the application wants to reduce the network traffic usage by specifying which properties return. Is this could be achieved with some special serialization method? I'm thinking how fastify allows an output schema to speed up the serialization process. Could you use a library like fast-json-stringify to address that topic? |
Hi @allevo thanks for the reply.
Yes, you can also achieve the same result with Before starting the implementation, I wanted to run some tests to check performance (time and memory usage). Based on the results, If you think I should add more use cases or modify the tests, let me know. I can still proceed with using Below are the results, and at the bottom, the script used. 1000 Runs on 10/100/1000 documents and serializing 1 property
1000 Runs on 10/100/1000 documents and serializing 3 properties
1000 Runs on 10/100/1000 documents and serializing 6 properties
1000 Runs on 10/100/1000 documents and serializing 8 properties where one property is an entire object and another one is a single nested property of an object
Here is the script used for testing import fastJson from "fast-json-stringify"
import fastParse from "fast-json-parse"
import { pickDocumentProperties } from "./src/utils"
const serialize = {
// Return 1 prop for each document
'props-1': {
fastJson: fastJson({ type: 'object', properties: { string1: { type: 'string' }}}),
pick: ['string1']
},
// Return 3 props for each document
'props-3': {
fastJson: fastJson({
type: 'object',
properties: {
string1: { type: 'string' },
number1: { type: 'number' },
bool1: { type: 'boolean' }
}
}),
pick: ['string1', 'number1', 'bool1']
},
// Return 6 props for each document
'props-6': {
fastJson: fastJson({
type: 'object',
properties: {
string1: { type: 'string' },
string2: { type: 'string' },
number1: { type: 'number' },
number2: { type: 'number' },
bool1: { type: 'boolean' },
bool2: { type: 'boolean' }
}
}),
pick: ['string1', 'string2', 'number1', 'number2','bool1', 'bool2']
},
// Return 8 props for each document where 1 is an entire object and 1 is a single nested prop of another object
'props-8': {
fastJson: fastJson({
type: 'object',
properties: {
string1: { type: 'string' },
string2: { type: 'string' },
number1: { type: 'number' },
number2: { type: 'number' },
bool1: { type: 'boolean' },
bool2: { type: 'boolean' },
// entire object
object1: {
type: 'object',
properties: {
string1: { type: 'string' },
string2: { type: 'string' },
number1: { type: 'number' },
number2: { type: 'number' },
bool1: { type: 'boolean' },
bool2: { type: 'boolean' },
nested: {
type: 'object',
properties: {
string1: { type: 'string' },
number1: { type: 'number' },
bool1: { type: 'boolean' },
}
}
}
},
// single nested fields object2.nested.string1
object2: {
type: 'object',
properties: {
nested: {
type: 'object',
properties: {
string1: { type: 'string' }
}
}
}
}
}
}),
pick: ['string1', 'string2', 'number1', 'number2','bool1', 'bool2', 'object1', 'object2.nested.string1']
}
}
function getNDocuments(n: number) {
const response: any = []
for (let index = 0; index < n; index++) {
response.push({
string1: 'foo bar',
string2: 'foo bar',
number1: 99.99,
number2: 99.99,
bool1: false,
bool2: true,
object1: {
string1: 'foo bar',
string2: 'foo bar',
number1: 99.99,
number2: 99.99,
bool1: false,
bool2: true,
nested: {
string1: 'foo bar',
number1: 99.99,
bool1: false,
}
},
object2: {
string1: 'foo bar',
string2: 'foo bar',
number1: 99.99,
number2: 99.99,
bool1: false,
bool2: true,
nested: {
string1: 'foo bar',
number1: 99.99,
bool1: false,
}
},
})
}
return response
}
function profiling(fn: (docs, props) => void, label: string, docs: any[], props: number) {
const memoryBefore = process.memoryUsage().heapUsed;
const start = performance.now();
fn(docs, props);
const end = performance.now();
const memoryAfter = process.memoryUsage().heapUsed;
const time = end - start
const memory = memoryAfter - memoryBefore
return { label, time, memory, count: docs.length, props }
}
const groupBy = (array, key) => {
return array.reduce((result, currentValue) => {
const groupKey = currentValue[key];
if (!result[groupKey]) result[groupKey] = [];
result[groupKey].push(currentValue);
return result;
}, {});
};
const percentile = (arr, p) => {
const index = Math.ceil(arr.length * (p / 100)) - 1;
return arr[index];
}
const mean = (arr, prop) => {
return arr.reduce((sum, item) => sum + item[prop], 0) / arr.length;
}
const roundTo = (num, decimals = 2) => {
const factor = Math.pow(10, decimals);
return Math.round(num * factor) / factor;
}
function printResults(results){
// { 10: [], 100: []}
const groupedByRuns = groupBy(results, 'count');
for (const docs in groupedByRuns) {
const groupedByLabel = groupBy(groupedByRuns[docs], 'label')
const summary: any = {
runs: 0,
time: [],
memory: []
}
// { fast-json-stringify: [], 'pick-document-properties': [] }
for (const label in groupedByLabel) {
const timeOrdered = [...groupedByLabel[label]]
timeOrdered.sort((a, b) => a.time - b.time);
const memoryOrdered = [...groupedByLabel[label]]
memoryOrdered.sort((a, b) => a.memory - b.memory);
const bestTime = timeOrdered[0];
const worstTime = timeOrdered[timeOrdered.length - 1]
const bestMemory = memoryOrdered[0];
const worstMemory = memoryOrdered[memoryOrdered.length - 1]
const avgTime = mean(timeOrdered, 'time')
const avgMemory = mean(memoryOrdered, 'memory')
const timePercentile25 = percentile(timeOrdered, 25)
const timePercentile50 = percentile(timeOrdered, 50)
const timePercentile75 = percentile(timeOrdered, 75)
const timePercentile95 = percentile(timeOrdered, 95)
const memoryPercentile25 = percentile(memoryOrdered, 25)
const memoryPercentile50 = percentile(memoryOrdered, 50)
const memoryPercentile75 = percentile(memoryOrdered, 75)
const memoryPercentile95 = percentile(memoryOrdered, 95)
summary.time.push(timePercentile50)
summary.memory.push(memoryPercentile50)
summary.runs = timeOrdered.length
console.log(label)
console.table([
{
"Stats": 'Time ms',
"25%": roundTo(timePercentile25.time, 4),
"50%": roundTo(timePercentile50.time, 4),
"75%": roundTo(timePercentile75.time, 4),
"95%": roundTo(timePercentile95.time, 4),
"Average (Mean)": roundTo(avgTime, 4),
"Best (Min)": roundTo(bestTime.time, 4),
"Worst (Max)": roundTo(worstTime.time, 4),
},
{
"Stats": 'Memory byte',
"25%": roundTo(memoryPercentile25.memory, 4),
"50%": roundTo(memoryPercentile50.memory, 4),
"75%": roundTo(memoryPercentile75.memory, 4),
"95%": roundTo(memoryPercentile95.memory, 4),
"Average (Mean)": roundTo(avgMemory, 4),
"Best (Min)": roundTo(bestMemory.memory, 4),
"Worst (Max)": roundTo(worstMemory.memory, 4),
}]
);
}
summary.time.sort((a, b) => a.time - b.time)
summary.memory.sort((a, b) => a.memory - b.memory)
const fastest = summary.time[0]
console.log(`\n\nRESULTS SUMMARY FOR ${summary.runs} Runs - Serializing ${fastest.props} properties in ${fastest.count} docs`)
console.log(`\nTime elapsed in ms (50% Percentile)`)
summary.time.forEach((item, index) => {
console.log(`${index + 1}° ${item.label} in ${roundTo(item.time, 4)}${index === 0 ? ' (fastest)' : ''}`);
})
console.log(`\nMemory used in byte (50% Percentile)`)
summary.memory.forEach((item, index) => {
console.log(`${index + 1}° ${item.label} in ${roundTo(item.memory, 4)}${index === 0 ? ' (least consuming)' : ''}`);
})
}
}
function useFastJson(docs, props) {
const serializer = serialize[`props-${props}`].fastJson
return docs.map((doc) => serializer(doc))
}
function useFastJsonAndNormalParse(docs, props) {
const serializer = serialize[`props-${props}`].fastJson
return docs.map((doc) => JSON.parse(serializer(doc)))
}
function useFastJsonAndFastParse(docs, props) {
const serializer = serialize[`props-${props}`].fastJson
return docs.map((doc) => fastParse(serializer(doc)).value)
}
function usePickDocumentProperties(docs, props){
const properties = serialize[`props-${props}`].pick
return docs.map((doc) => pickDocumentProperties(doc, properties))
}
const runs = (docs, props) => {
const results: any = []
for (let i = 0; i < 1000; i++) {
results.push(profiling(useFastJson, 'fast-json-stringify', docs, props))
results.push(profiling(useFastJsonAndNormalParse, 'fast-json-stringify-and-normal-parse', docs, props))
results.push(profiling(useFastJsonAndFastParse, 'fast-json-stringify-and-fast-parse', docs, props))
results.push(profiling(usePickDocumentProperties, 'pick-document-properties', docs, props))
}
printResults(results)
}
const init = (props: 1 | 3 | 6 | 8) => {
const docs = getNDocuments(10000)
const docs_10 = docs.slice(0, 10)
const docs_100 = docs.slice(0, 100)
const docs_1000 = docs.slice(0, 1000)
const docs_10000 = docs.slice(0, 10000)
// Execute 1000 runs on 10 docs
runs(docs_10, props)
// Execute 1000 runs on 100 docs
runs(docs_100, props)
// Execute 1000 runs on 1.000 docs
runs(docs_1000, props)
// Execute 1000 runs on 10.000 docs
runs(docs_10000, props)
}
init(1) // Serialize 1 prop for each document
init(3) // Serialize 3 props for each document
init(6) // Serialize 6 props for each document
init(8) // Serialize 8 props for each document where 1 is an entire object and 1 is a single nested prop of another object |
One other thing to consider: |
Ok, is there anything I need to do on this PR (besides the tests)? |
Implements #769.
This PR introduces the ability to pass an array of document fields to be returned via the
returning
option.Initially, I considered using the existing
getDocumentProperties()
function. However, this function does not preserve the original structure of objects. Moreover, when dealing with nested objects, it only returns the deepest fields. This behavior forces users to specify all properties if they want to return the entire object, which can be cumbersome.Given that
getDocumentProperties()
is widely used throughout the codebase, I decided to create a new function calledpickDocumentProperties()
that preserves the original structure of objects, allowing users to specify top-level keys for nested objects, which simplifies the process of selecting which fields to return.It's important to note that the
includeVectors
option is skipped if thereturning
option is also provided. It seemed more logical to me to prioritize user selection, although this behavior can be subject to further discussion.If everything looks good, I’ll proceed with adding the corresponding tests.