Releases: Future-House/paper-qa
v5.9.2
Note to self: run unit tests (and not just mypy
) in downstream repos before cutting the release.
What's Changed
- Fixing not using
set_llm_session_ids
fromfh-llm-client
by @jamesbraza in #792
Full Changelog: v5.9.1...v5.9.2
v5.9.1
What's Changed
- Pinning min version of
fh-llm-client
by @jamesbraza in #790 - Fixed
Record
import from Qdrant not being inTYPE_CHECKING
block by @jamesbraza in #791
Full Changelog: v5.9.0...v5.9.1
v5.9.0
What's Changed
- feat: Qdrant support by @Anush008 in #730
- Made it possible to get answers from litqa evaluations by @whitead in #760
- Added answer an ideal to gradeable environments by @whitead in #762
- Pinned lower bound on
aiohttp
foraiohttp.ClientConnectionResetError
by @jamesbraza in #763 - Added llmclient dependency by @maykcaldas in #757
- Propagating citation count in flaky
test_pdf_reader_match_doc_details
by @jamesbraza in #766 - Renovate
lockFileMaintenance
respectingschedule
, 2-week stability period, noopenai
pinning by @jamesbraza in #767 - chore(deps): lock file maintenance by @renovate in #765
- Added test split's source DOIs and question IDs by @jamesbraza in #771
- Consolidated LDP imports into
ldp_shims
module by @jamesbraza in #772 - Missing
None
failovers in LDP shims by @jamesbraza in #774 - Dropping
refurb
in favor of itsruff
port by @jamesbraza in #773 - Converting
add_texts_and_embeddings
to async by @ThomasRochefortB in #778 - Add clinical trials search tool by @mskarlin in #777
- chore(deps): lock file maintenance by @renovate in #769
- Replace
raise_for_status
returnNone
with Mock() in clinical trials test by @mskarlin in #785 - Docs around embeddings and agentic usage by @jamesbraza in #780
- Moved to
MultipleChoiceQuestion
/MultipleChoiceEvaluation
fromaviary
by @jamesbraza in #768 - Feat/qdrant docs reconstruct by @ThomasRochefortB in #776
- Removed dead test cassettes by @jamesbraza in #788
New Contributors
- @Anush008 made their first contribution in #730
- @ThomasRochefortB made their first contribution in #778
Full Changelog: v5.8.0...v5.9.0
v5.6.1
Full Changelog: v5.6.0...v5.6.1
v5.8.0
What's Changed
- Update all non-major dependencies by @renovate in #745
- Created
dev
extra for convenience by @jamesbraza in #750 - Update all non-major dependencies by @renovate in #754
- Populated
LICENSE
by @jamesbraza in #756 - Add partitioning func capabilities to allow doc-types-based embedding ranking by @mskarlin in #752
- Exposed seeding of LitQA2 read and shuffling by @jamesbraza in #758
Full Changelog: v5.7.0...v5.8.0
v5.7.0
What's Changed
- Moved
README
to usesession
overanswer
by @jamesbraza in #741 - Moved
Docs.aadd
to supportstr | os.PathLike
by @jamesbraza in #742 - Cleared up 'Adding Documents Manually' docs by @jamesbraza in #740
- Support env states with custom status functions by @mskarlin in #743
- Update astral-sh/setup-uv action to v4 by @renovate in #746
- Moved JSON summary prompt to mention score is an integer by @jamesbraza in #748
Full Changelog: v5.6.0...v5.7.0
v5.6.0
Highlights
This release is mainly a bunch of bug fixes:
- Pulling in breaks in upstream dependencies (e.g. Pydantic 2.10, aviary 0.10.1)
- Makes
GradablePaperQAEnvironment
's evaluations robust to an empty answer or multiple answers
Due to the introduction of Complete.NO_ANSWER_PHRASE
in #726 it was requested we consider this a minor version bump, as it will impact system performance.
What's Changed
- Fixed settings
session
intoEnvironmentState
, and suppressing PyMuPDF derivedDeprecationWarning
by @jamesbraza in #713 - Adding assertion
gather_evidence
doesn't populatesession.answer
by @jamesbraza in #716 - Lock file maintenance by @renovate in #715
- Fixes
gather_with_concurrency
typing by @maykcaldas in #714 - Latest tooling dependencies by @jamesbraza in #719
- Lock file maintenance by @renovate in #718
- Fixed
EVAL_PROMPT_TEMPLATE
to handle empty string or multiple match answers by @jamesbraza in #724 - Address missing
GenerateAnswer
in trajectories, no answers afterComplete
tools, and better history by @mskarlin in #726 - Pulling in latest
aviary
forconcurrency
rename by @jamesbraza in #728 - Pulling in latest
aviary
for dependencies fix, and retrying flakytest_propagate_options
more by @jamesbraza in #729 - Pulling in latest
ldp
forCallback.before_rollout
by @jamesbraza in #734 - Documenting why we don't handle evaluation failures in
GradablePaperQAEnvironment.step
by @jamesbraza in #738 - Created
LitQAEvaluation.calculate_accuracy_precision
utility by @jamesbraza in #733 - Refreshed test cassettes, fixed flaky test
test_search
, and fixed test type ignores by @jamesbraza in #739 - Unpins pydantic >2.10.2 requirement, removes TYPE_CHECKING by @nadolskit in #725
- Lock file maintenance by @renovate in #737
- Alternative maybe is text by @loesinghaus in #717
New Contributors
- @maykcaldas made their first contribution in #714
- @loesinghaus made their first contribution in #717
Full Changelog: v5.5.0...v5.6.0
v5.5.1
Full Changelog: v5.5.0...v5.5.1
v5.5.0
Highlights
In all of v5 before this release, we defined the presence of 1+ answer generations not containing the substring "cannot answer"
as the agent loop's end. However, this (suboptimally) leads to the agent loop terminating early on partial answers like "Based on the sources provided, it appears no one has done x." We realized this, and have resolved this issue by:
- No longer coupling our done condition with the substring
"cannot answer"
being not present in 1+ generated answers - No longer implicitly depending on clients mentioning this
"cannot answer"
sentinel in the inputqa
prompt
We also fixed several (bad) bugs:
- We support parallel tool calling (2+
ToolCall
s in oneaction: ToolRequestMessage
). However, our tools (notablygather_evidence
) are not actually concurrent-safe. Our tool schemae instructed not to call certain tools in parallel, nonetheless we observed agents specifyinggather_evidence
to be called in parallel. So now we force our tools to be non-concurrently executed to work around this race condition - When using
LitQAEvaluation
and the sameGradablePaperQAEnvironment
2+ times, we repeatedly added the "unsure" option to the target multiple choice question, degrading performance over time - When using
PaperQAEnvironment
2+ times, eachreset
was not properly wiping theDocs
object - The reward distribution of
LitQAEvaluation
was mixing up "unsure" reward of0.1
with the "incorrect" reward of-1.0
, not properly incentivizing learning
There are a bunch of other minor features, cleanups, and bugfixes here too, see the full list below.
What's Changed
- Deprecation cycle for
AgentSettings.should_pre_search
by @jamesbraza in #679 - Moved agent prompts to
prompts.py
by @jamesbraza in #681 - Refactor to remove
skip_system
fromLLMModel.run_prompt
by @jamesbraza in #680 - Resolving
evidence_detailed_citations
andAnswer
deprecations by @jamesbraza in #682 - Fixed agent prompt names and contents after #681 mess up by @jamesbraza in #683
- Removed
tool_names
validation forgen_answer
being present by @jamesbraza in #685 - Fixing
test_evaluation
logic bugs by @jamesbraza in #686 - Removed
GenerateAnswer.FAILED_TO_ANSWER
as its unnecessary by @jamesbraza in #691 - Allowing serialized
Settings
inget_settings
by @jamesbraza in #688 - Fixed LDP runner's
TRUNCATED
not callinggen_answer
, and documentedAgentStatus
by @jamesbraza in #690 - Removed
gen_answer
's dead argumentquestion
by @jamesbraza in #689 - Making sure we copy distractors by @sidnarayanan in #694
- Created
complete
tool to allow unsure answers by @jamesbraza in #684 - Added missing
test_from_question
cassette by @jamesbraza in #696 - Moved
fake
agent to LLM proposecomplete
tool by @jamesbraza in #695 - Default to ordered tool calls, w env variable control by @mskarlin in #697
- Lock file maintenance by @renovate in #699
- Refactored
TestGradablePaperQAEnvironment
for DRY code by @jamesbraza in #702 - Fixing
PaperQAEnvironment.reset
respectingmmr_lambda
andtext_hashes
by @jamesbraza in #703 - Removed
"cannot answer"
literals and addedreset
tool by @jamesbraza in #698 - Update all non-major dependencies by @renovate in #705
- Fixing
LitQAEvaluation
bugs: incorrect reward indices, not using LLM's native knowledge by @jamesbraza in #708 - Adding filters to paper-qa Docs by @whitead in #707
- Fixed mutably defaulted
NumpyVectorStore.texts
by @jamesbraza in #711
Full Changelog: v5.4.0...v5.5.0
Hotfix to included `ordered=True` in tool exec calls
Prevents parallel tool calls from clobbering the env. state.