Skip to content

Commit

Permalink
feat: added support for audio timestamp understanding to Google Vertex (
Browse files Browse the repository at this point in the history
#4061)

Co-authored-by: Lars Grammel <[email protected]>
  • Loading branch information
timconnorz and lgrammel authored Dec 17, 2024
1 parent c53ebee commit db31e74
Show file tree
Hide file tree
Showing 6 changed files with 81 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .changeset/poor-apples-punch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@ai-sdk/google': patch
---

feat: adding audioTimestamp support to GoogleGenerativeAISettings
7 changes: 7 additions & 0 deletions content/providers/01-ai-sdk-providers/11-google-vertex.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,13 @@ The following optional settings are available for Google Vertex models:

Optional. When enabled, the model will [use Google search to ground the response](https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview).

- **audioTimestamp** _boolean_

Optional. Enables timestamp understanding for audio files. Defaults to false.

This is useful for generating transcripts with accurate timestamps.
Consult [Google's Documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/audio-understanding) for usage details.

You can use Google Vertex language models to generate text with the `generateText` function:

```ts highlight="1,4"
Expand Down
30 changes: 30 additions & 0 deletions examples/ai-core/src/e2e/google-vertex.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,36 @@ describe.each(Object.values(RUNTIME_VARIANTS))(
expect(result.text.toLowerCase()).toContain('cat');
expect(result.usage?.totalTokens).toBeGreaterThan(0);
});

it(
'should generate text from audio input',
{ timeout: LONG_TEST_MILLIS },
async () => {
const model = vertex(modelId);
const result = await generateText({
model,
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Output a transcript of spoken words. Break up transcript lines when there are pauses. Include timestamps in the format of HH:MM:SS.SSS.',
},
{
type: 'file',
data: Buffer.from(fs.readFileSync('./data/galileo.mp3')),
mimeType: 'audio/mpeg',
},
],
},
],
});
expect(result.text).toBeTruthy();
expect(result.text.toLowerCase()).toContain('galileo');
expect(result.usage?.totalTokens).toBeGreaterThan(0);
},
);
});

describe.each(MODEL_VARIANTS.embedding)('Embedding Model: %s', modelId => {
Expand Down
30 changes: 30 additions & 0 deletions examples/ai-core/src/generate-text/google-vertex-audio.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import { vertex } from '@ai-sdk/google-vertex';
import { generateText } from 'ai';
import 'dotenv/config';
import fs from 'node:fs';

async function main() {
const result = await generateText({
model: vertex('gemini-1.5-flash', { audioTimestamp: true }),
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Output a transcript of spoken words. Break up transcript lines when there are pauses. Include timestamps in the format of HH:MM:SS.SSS.',
},
{
type: 'file',
data: Buffer.from(fs.readFileSync('./data/galileo.mp3')),
mimeType: 'audio/mpeg',
},
],
},
],
});

console.log(result.text);
}

main().catch(console.error);
3 changes: 3 additions & 0 deletions packages/google/src/google-generative-ai-language-model.ts
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,9 @@ export class GoogleGenerativeAILanguageModel implements LanguageModelV1 {
this.supportsStructuredOutputs
? convertJSONSchemaToOpenAPISchema(responseFormat.schema)
: undefined,
...(this.settings.audioTimestamp && {
audioTimestamp: this.settings.audioTimestamp,
}),
};

const { contents, systemInstruction } =
Expand Down
6 changes: 6 additions & 0 deletions packages/google/src/google-generative-ai-settings.ts
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,12 @@ Optional. A list of unique safety settings for blocking unsafe content.
| 'BLOCK_ONLY_HIGH'
| 'BLOCK_NONE';
}>;
/**
* Optional. Enables timestamp understanding for audio-only files.
*
* https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/audio-understanding
*/
audioTimestamp?: boolean;

/**
Optional. When enabled, the model will use Google search to ground the response.
Expand Down

0 comments on commit db31e74

Please sign in to comment.