Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How adding papers to index manually #784

Open
Snikch63200 opened this issue Jan 3, 2025 · 1 comment
Open

How adding papers to index manually #784

Snikch63200 opened this issue Jan 3, 2025 · 1 comment
Labels
question Further information is requested

Comments

@Snikch63200
Copy link

Hi,

A code snippet is provided in PaperQA' documentation to create a reusable documents index :

import os

from paperqa import Settings
from paperqa.agents.main import agent_query
from paperqa.agents.models import QueryRequest
from paperqa.agents.search import get_directory_index


async def amain(folder_of_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_papers)

    # 1. Build the index. Note an index name is autogenerated when unspecified
    built_index = await get_directory_index(settings=settings)
    print(settings.get_index_name())  # Display the autogenerated index name
    print(await built_index.index_files)  # Display the index contents

    # 2. Use the settings as many times as you want with ask
    answer_response_1 = await agent_query(
        query=QueryRequest(
            query="What is the best way to make a vaccine?", settings=settings
        )
    )
    answer_response_2 = await agent_query(
        query=QueryRequest(
            query="What manufacturing challenges are unique to bispecific antibodies?",
            settings=settings,
        )
    )

In this way we can create an index by adding papers to index automatically but it seems it's not possible to add "citation" and "docname" for each paper as it can be done when creating a Doc() object. See following :

for doc in tqdm(file_list):
    try:
        docs.add(str("./Papers/ToAdd/"+str(doc)),
                     citation="File " + doc,
                     docname=doc,
                     settings=settings,
                     embedding_model=hybrid_model)
        shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Added/"+str(doc)))
        added_doc += 1
    except Exception as e:
        # sometimes this happens if PDFs aren't downloaded or readable
        print("Could not read", doc, e)
        shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Failed/"+str(doc)))
        failed_doc += 1
        continue
                 
with open("./Docs/my_docs3.pkl", "wb") as f:
    compress_pickle.dump(docs, f, compression="lzma")

So I wonder if it is a way to add papers manually to an empty IndexSearch() object, iteratively, as it can be done with Doc() objects.

Also, documentation provides snippet to fill a Doc() object with papers form ZoteroDB :

from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

Is it a way to create reusable index from ZoteroDB ?

Best regards.

@dosu.

Copy link

dosubot bot commented Jan 3, 2025

To manually add papers to an IndexSearch object in PaperQA, you can use the add_document method of the SearchIndex class. This allows you to add documents with metadata such as 'citation' and 'docname'. Here's a general approach:

  1. Create an index_doc Dictionary: Include metadata like 'title', 'year', 'file_location', and 'body' of the document.

  2. Prepare the Document: If you have a document object, prepare it for storage.

  3. Use the add_document Method: Add the document to the index by passing the index_doc dictionary and the document object.

Example:

await search_index.add_document(
    {
        "title": "Document Title",
        "year": "2023",
        "file_location": "/path/to/document",
        "body": "The full text of the document goes here.",
    },
    document=your_document_object,  # Optional
)

This method handles the addition of the document to the index, including writing the document to storage if needed [1].

For creating a reusable index from ZoteroDB, you can iterate over items in the Zotero library using the ZoteroDB class and add them to a Docs object. This process involves fetching metadata and PDFs from Zotero and adding them to the Docs object, which can then be used to build an index. Here's a snippet for adding documents from Zotero:

from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

This method handles metadata by using the Zotero library to fetch and manage document details [2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the question Further information is requested label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant