Skip to content

Releases: Future-House/paper-qa

v5.9.2

06 Jan 22:16
13a38c3
Compare
Choose a tag to compare

Note to self: run unit tests (and not just mypy) in downstream repos before cutting the release.

What's Changed

  • Fixing not using set_llm_session_ids from fh-llm-client by @jamesbraza in #792

Full Changelog: v5.9.1...v5.9.2

v5.9.1

06 Jan 19:50
2d1f7ca
Compare
Choose a tag to compare

What's Changed

Full Changelog: v5.9.0...v5.9.1

v5.9.0

06 Jan 19:01
f4c299a
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v5.8.0...v5.9.0

v5.6.1

11 Dec 01:19
Compare
Choose a tag to compare

Full Changelog: v5.6.0...v5.6.1

v5.8.0

10 Dec 16:43
58dbfc0
Compare
Choose a tag to compare

What's Changed

Full Changelog: v5.7.0...v5.8.0

v5.7.0

04 Dec 19:31
c36903a
Compare
Choose a tag to compare

What's Changed

Full Changelog: v5.6.0...v5.7.0

v5.6.0

02 Dec 21:53
0130233
Compare
Choose a tag to compare

Highlights

This release is mainly a bunch of bug fixes:

  • Pulling in breaks in upstream dependencies (e.g. Pydantic 2.10, aviary 0.10.1)
  • Makes GradablePaperQAEnvironment's evaluations robust to an empty answer or multiple answers

Due to the introduction of Complete.NO_ANSWER_PHRASE in #726 it was requested we consider this a minor version bump, as it will impact system performance.

What's Changed

  • Fixed settings session into EnvironmentState, and suppressing PyMuPDF derived DeprecationWarning by @jamesbraza in #713
  • Adding assertion gather_evidence doesn't populate session.answer by @jamesbraza in #716
  • Lock file maintenance by @renovate in #715
  • Fixes gather_with_concurrency typing by @maykcaldas in #714
  • Latest tooling dependencies by @jamesbraza in #719
  • Lock file maintenance by @renovate in #718
  • Fixed EVAL_PROMPT_TEMPLATE to handle empty string or multiple match answers by @jamesbraza in #724
  • Address missing GenerateAnswer in trajectories, no answers after Complete tools, and better history by @mskarlin in #726
  • Pulling in latest aviary for concurrency rename by @jamesbraza in #728
  • Pulling in latest aviary for dependencies fix, and retrying flaky test_propagate_options more by @jamesbraza in #729
  • Pulling in latest ldp for Callback.before_rollout by @jamesbraza in #734
  • Documenting why we don't handle evaluation failures in GradablePaperQAEnvironment.step by @jamesbraza in #738
  • Created LitQAEvaluation.calculate_accuracy_precision utility by @jamesbraza in #733
  • Refreshed test cassettes, fixed flaky test test_search, and fixed test type ignores by @jamesbraza in #739
  • Unpins pydantic >2.10.2 requirement, removes TYPE_CHECKING by @nadolskit in #725
  • Lock file maintenance by @renovate in #737
  • Alternative maybe is text by @loesinghaus in #717

New Contributors

Full Changelog: v5.5.0...v5.6.0

v5.5.1

03 Dec 01:36
Compare
Choose a tag to compare

Full Changelog: v5.5.0...v5.5.1

v5.5.0

20 Nov 00:23
0b3ef89
Compare
Choose a tag to compare

Highlights

In all of v5 before this release, we defined the presence of 1+ answer generations not containing the substring "cannot answer" as the agent loop's end. However, this (suboptimally) leads to the agent loop terminating early on partial answers like "Based on the sources provided, it appears no one has done x." We realized this, and have resolved this issue by:

  • No longer coupling our done condition with the substring "cannot answer" being not present in 1+ generated answers
  • No longer implicitly depending on clients mentioning this "cannot answer" sentinel in the input qa prompt

We also fixed several (bad) bugs:

  • We support parallel tool calling (2+ ToolCalls in one action: ToolRequestMessage). However, our tools (notably gather_evidence) are not actually concurrent-safe. Our tool schemae instructed not to call certain tools in parallel, nonetheless we observed agents specifying gather_evidence to be called in parallel. So now we force our tools to be non-concurrently executed to work around this race condition
  • When using LitQAEvaluation and the same GradablePaperQAEnvironment 2+ times, we repeatedly added the "unsure" option to the target multiple choice question, degrading performance over time
  • When using PaperQAEnvironment 2+ times, each reset was not properly wiping the Docs object
  • The reward distribution of LitQAEvaluation was mixing up "unsure" reward of 0.1 with the "incorrect" reward of -1.0, not properly incentivizing learning

There are a bunch of other minor features, cleanups, and bugfixes here too, see the full list below.

What's Changed

Full Changelog: v5.4.0...v5.5.0

Hotfix to included `ordered=True` in tool exec calls

18 Nov 16:33
v5.3.4
f59b3ab
Compare
Choose a tag to compare

Prevents parallel tool calls from clobbering the env. state.