You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two <article class="post"> elements, but JsonCssExtractionStrategy only retrieves the first <article> tag.
Root Cause:
The issue lies in the implementation of the _get_elements() method in JsonCssExtractionStrategy, which is designed to fetch only the first element matching the selector:
def_get_elements(self, element, selector: str):
selected=element.select_one(selector) # Only gets the first matchreturn [selected] ifselectedelse []
This approach completely overlooks the possibility of handling multiple elements. As a result, I couldn't even retrieve a list of tags inside a <div>.
My Thoughts:
This limitation has been a frustrating blocker, and it wasted several hours of my time. I'm open to contributing and rewriting this method to handle lists of elements properly, but doing so would require a change in the current strategy. Let me know if this aligns with your goals, and I can propose an updated implementation.
The text was updated successfully, but these errors were encountered:
After working for several hours, I discovered a significant problem with the
JsonCssExtractionStrategy
when trying to extract a list of elements.Example:
In the example provided in the repository:
https://github.com/unclecode/crawl4ai/blob/main/docs/examples/v0_4_24_walkthrough.py
There are two
<article class="post">
elements, butJsonCssExtractionStrategy
only retrieves the first<article>
tag.Root Cause:
The issue lies in the implementation of the
_get_elements()
method inJsonCssExtractionStrategy
, which is designed to fetch only the first element matching the selector:This approach completely overlooks the possibility of handling multiple elements. As a result, I couldn't even retrieve a list of tags inside a
<div>
.My Thoughts:
This limitation has been a frustrating blocker, and it wasted several hours of my time. I'm open to contributing and rewriting this method to handle lists of elements properly, but doing so would require a change in the current strategy. Let me know if this aligns with your goals, and I can propose an updated implementation.
The text was updated successfully, but these errors were encountered: