Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonCssExtractionStrategy Fails to Handle Lists of Elements #433

Open
hypy13 opened this issue Jan 8, 2025 · 0 comments
Open

JsonCssExtractionStrategy Fails to Handle Lists of Elements #433

hypy13 opened this issue Jan 8, 2025 · 0 comments

Comments

@hypy13
Copy link

hypy13 commented Jan 8, 2025

After working for several hours, I discovered a significant problem with the JsonCssExtractionStrategy when trying to extract a list of elements.

Example:

In the example provided in the repository:
https://github.com/unclecode/crawl4ai/blob/main/docs/examples/v0_4_24_walkthrough.py

There are two <article class="post"> elements, but JsonCssExtractionStrategy only retrieves the first <article> tag.

Root Cause:

The issue lies in the implementation of the _get_elements() method in JsonCssExtractionStrategy, which is designed to fetch only the first element matching the selector:

def _get_elements(self, element, selector: str):
    selected = element.select_one(selector)  # Only gets the first match
    return [selected] if selected else [] 

This approach completely overlooks the possibility of handling multiple elements. As a result, I couldn't even retrieve a list of tags inside a <div>.

My Thoughts:

This limitation has been a frustrating blocker, and it wasted several hours of my time. I'm open to contributing and rewriting this method to handle lists of elements properly, but doing so would require a change in the current strategy. Let me know if this aligns with your goals, and I can propose an updated implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant