How adding papers to index manually #784

Snikch63200 · 2025-01-03T15:48:33Z

Hi,

A code snippet is provided in PaperQA' documentation to create a reusable documents index :

import os

from paperqa import Settings
from paperqa.agents.main import agent_query
from paperqa.agents.models import QueryRequest
from paperqa.agents.search import get_directory_index


async def amain(folder_of_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_papers)

    # 1. Build the index. Note an index name is autogenerated when unspecified
    built_index = await get_directory_index(settings=settings)
    print(settings.get_index_name())  # Display the autogenerated index name
    print(await built_index.index_files)  # Display the index contents

    # 2. Use the settings as many times as you want with ask
    answer_response_1 = await agent_query(
        query=QueryRequest(
            query="What is the best way to make a vaccine?", settings=settings
        )
    )
    answer_response_2 = await agent_query(
        query=QueryRequest(
            query="What manufacturing challenges are unique to bispecific antibodies?",
            settings=settings,
        )
    )

In this way we can create an index by adding papers to index automatically but it seems it's not possible to add "citation" and "docname" for each paper as it can be done when creating a Doc() object. See following :

for doc in tqdm(file_list):
    try:
        docs.add(str("./Papers/ToAdd/"+str(doc)),
                     citation="File " + doc,
                     docname=doc,
                     settings=settings,
                     embedding_model=hybrid_model)
        shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Added/"+str(doc)))
        added_doc += 1
    except Exception as e:
        # sometimes this happens if PDFs aren't downloaded or readable
        print("Could not read", doc, e)
        shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Failed/"+str(doc)))
        failed_doc += 1
        continue
                 
with open("./Docs/my_docs3.pkl", "wb") as f:
    compress_pickle.dump(docs, f, compression="lzma")

So I wonder if it is a way to add papers manually to an empty IndexSearch() object, iteratively, as it can be done with Doc() objects.

Also, documentation provides snippet to fill a Doc() object with papers form ZoteroDB :

from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

Is it a way to create reusable index from ZoteroDB ?

Best regards.

@dosu.

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-03T15:50:16Z

To manually add papers to an IndexSearch object in PaperQA, you can use the add_document method of the SearchIndex class. This allows you to add documents with metadata such as 'citation' and 'docname'. Here's a general approach:

Create an index_doc Dictionary: Include metadata like 'title', 'year', 'file_location', and 'body' of the document.
Prepare the Document: If you have a document object, prepare it for storage.
Use the add_document Method: Add the document to the index by passing the index_doc dictionary and the document object.

Example:

await search_index.add_document(
    {
        "title": "Document Title",
        "year": "2023",
        "file_location": "/path/to/document",
        "body": "The full text of the document goes here.",
    },
    document=your_document_object,  # Optional
)

This method handles the addition of the document to the index, including writing the document to storage if needed ^[1].

For creating a reusable index from ZoteroDB, you can iterate over items in the Zotero library using the ZoteroDB class and add them to a Docs object. This process involves fetching metadata and PDFs from Zotero and adding them to the Docs object, which can then be used to build an index. Here's a snippet for adding documents from Zotero:

from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

This method handles metadata by using the Zotero library to fetch and manage document details ^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

dosubot bot added the question Further information is requested label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How adding papers to index manually #784

How adding papers to index manually #784

Snikch63200 commented Jan 3, 2025

dosubot bot commented Jan 3, 2025

How adding papers to index manually #784

How adding papers to index manually #784

Comments

Snikch63200 commented Jan 3, 2025

dosubot bot commented Jan 3, 2025