Am I Correct That Each Text Chunk (Per Document) Is Processed Sequentially in Graph Creation? #1615

darien-schettler · 2025-01-11T22:19:24Z

darien-schettler
Jan 11, 2025

Hi there, I'm doing a deep dive into the code base, and I noticed that while the documents (each row of the DF) are processed in parallel, the actual texts are processed sequentially. This obviously could be by design. The graph creation and processing of the documents is already highly parallel... or perhaps we don't want race conditions for node/relationship creation (i.e. We WANT to go one by one so the graph is created in in the same way that a human would reading front to back)?

Anyway, I just wanted to explicitly call it out and ask. Thanks in advance!

Currently, the code below will process each document sequentially.

all_records: dict[int, str] = {}
source_doc_map: dict[int, str] = {}

for doc_index, text in enumerate(texts):
    try:
        result = await self._process_document(text, prompt_variables)
        source_doc_map[doc_index] = text
        all_records[doc_index] = result
    except Exception as e:
        # handle error

I have include the full code for the __call__ method from the GraphExtractor class below.

async def __call__(
        self, texts: list[str], prompt_variables: dict[str, Any] | None = None
    ) -> GraphExtractionResult:
        """Call method definition."""
        if prompt_variables is None:
            prompt_variables = {}
        all_records: dict[int, str] = {}
        source_doc_map: dict[int, str] = {}

        # Wire defaults into the prompt variables
        prompt_variables = {
            **prompt_variables,
            self._tuple_delimiter_key: prompt_variables.get(self._tuple_delimiter_key)
            or DEFAULT_TUPLE_DELIMITER,
            self._record_delimiter_key: prompt_variables.get(self._record_delimiter_key)
            or DEFAULT_RECORD_DELIMITER,
            self._completion_delimiter_key: prompt_variables.get(
                self._completion_delimiter_key
            )
            or DEFAULT_COMPLETION_DELIMITER,
            self._entity_types_key: ",".join(
                prompt_variables[self._entity_types_key] or DEFAULT_ENTITY_TYPES
            ),
        }

        for doc_index, text in enumerate(texts):
            try:
                # Invoke the entity extraction
                result = await self._process_document(text, prompt_variables)
                source_doc_map[doc_index] = text
                all_records[doc_index] = result
            except Exception as e:
                log.exception("error extracting graph")
                self._on_error(
                    e,
                    traceback.format_exc(),
                    {
                        "doc_index": doc_index,
                        "text": text,
                    },
                )

        output = await self._process_results(
            all_records,
            prompt_variables.get(self._tuple_delimiter_key, DEFAULT_TUPLE_DELIMITER),
            prompt_variables.get(self._record_delimiter_key, DEFAULT_RECORD_DELIMITER),
        )

        return GraphExtractionResult(
            output=output,
            source_docs=source_doc_map,
        )

ps: Awesome library! Keep up the good work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Am I Correct That Each Text Chunk (Per Document) Is Processed Sequentially in Graph Creation? #1615

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Am I Correct That Each Text Chunk (Per Document) Is Processed Sequentially in Graph Creation? #1615

darien-schettler Jan 11, 2025

Replies: 0 comments

darien-schettler
Jan 11, 2025