-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add video models + functions #814
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #814 +/- ##
==========================================
- Coverage 87.61% 86.63% -0.98%
==========================================
Files 128 129 +1
Lines 11385 11556 +171
Branches 1540 1561 +21
==========================================
+ Hits 9975 10012 +37
- Misses 1023 1156 +133
- Partials 387 388 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing PR!
It would be great to use concise and minimalistic naming and API because we are going to have many file types for multiple domains.
- Naming
Keywords like Meta will make it hard for user to remember and use the classes - user have their own meta 🙂
How about this renaming:
VideoFile -> BaseVideo (I assume people won't use this often)
VideoMeta -> Video (the most used class)
VideoClip -> Clip (also, shouldn't it be based on Video with meta?)
VideoFrame -> FrameBase
VideoFrameMeta -> Frame
start_time --> start
end_time --> end
frames_count --> count
Image -> BaseImage
ImageMeta -> Image
FileTypes can be also extended: image
(read meta), base_image
(do not read meta), video
(read meta), base_video
(do not read meta), video_clip
, base_video_clip
, ...
- Do we need dummy classes?
I assume that people prefer working with meta information while dealing with images and videos. A followup question - do we really need BaseImages
and BaseVideo
without any logic? Why don't we clean up API and keep only Meta-enrich version in the API? User still can work with videos as File if meta is not needed.
- Do we need singular methods?
save_video_clips()
and save_video_clip()
How much extra code user needs to get rid of singular form. If one method - let's avoid the singular version.
The same question for video_frames()
and video_frames_np()
I assume, we can add the method and classes later if there is a need. But I'd not start with such rich API for now and try my best to keep in minimalistic.
WDYT?
src/datachain/lib/file.py
Outdated
|
||
width: int | ||
height: int | ||
format: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about EXIF and XMP? :)
yield img | ||
|
||
|
||
def video_frames( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can a lot of these helpers become part of the Video*
classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question 👍 I was thinking about this and tried to implement it this way, but in the end I've checked other types and files in lib
module (images, hf) and make it the same way.
I was also thinking and trying to move all the models to the datachain.model
module, but it turns out it needs more work and may be not backward compatible with File
model. In is a subject for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we need all of theses to become methods of Video class. Should it be a followup or in this PR?
I'd appreciate more insights on the issues with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ✅
src/datachain/lib/video.py
Outdated
props = iio.improps(file.stream(), plugin="pyav") | ||
frames_count, width, height, _ = props.shape | ||
|
||
meta = iio.immeta(file.stream(), plugin="pyav") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this part, it looks like we are reading video file twice here. Need to check the other way to get video meta information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, also are we reading the whole file to get meta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I have rewritten this code to use ffmpeg-python
package, it is using file underneath. This is the fastest way I know to get video file metadata and more robust, since it is using ffmpeg
. I am open to discussion here.
Deploying datachain-documentation with Cloudflare Pages
|
for more information, see https://pre-commit.ci
👍
For now we have naming with
Done. Only
We don't have
That's good suggestion, only we use
Good question. I've added def video_meta(file: "VideoFile") -> Video:
"""
Returns video file meta information.
Args:
file (VideoFile): VideoFile object.
Returns:
Video: Video file meta information.
"""
Sounds reasonable to me 👍 Will update the code (not done yet).
Done.
Those are great comments! Love the discussion ❤️ |
src/datachain/lib/file.py
Outdated
"""`DataModel` for reading video files.""" | ||
|
||
|
||
class VideoClip(VideoFile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, how are these all modes connected with the helpers? how do I instantiate them? do I have to write my own UDFs to do that (just instantiate these classes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check video example notebook here: iterative/datachain-examples#28
|
||
def save(self, destination: str): | ||
"""Writes it's content to destination""" | ||
self.read().save(destination) | ||
|
||
|
||
class Image(DataModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this separate model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for video info (Video
model). I can remove it from this PR 🤔
timestamp: float = Field(default=0) | ||
|
||
|
||
class Video(DataModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be a subclass of VideoFile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This is video meta information only. Do you think it will be used with VideoFile
model only? What is the best name for this model then? 🤔
* added main logic for outer join * fixing filters * removign datasetquery tests and added more datachain unit tests
If usearch fails to download the extension, it will keep retrying in the future. This adds significant cost - for example, in `tests/func/test_pytorch.py` run, it was invoked 111 times, taking ~30 seconds in total. Now, we cache the return value for the whole session.
* move tests using cloud_test_catalog into func directory * move tests using tmpfile catalog * move long running tests that read/write from disk
updates: - [github.com/astral-sh/ruff-pre-commit: v0.9.1 → v0.9.2](astral-sh/ruff-pre-commit@v0.9.1...v0.9.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Bumps [ultralytics](https://github.com/ultralytics/ultralytics) from 8.3.61 to 8.3.64. - [Release notes](https://github.com/ultralytics/ultralytics/releases) - [Commits](ultralytics/ultralytics@v8.3.61...v8.3.64) --- updated-dependencies: - dependency-name: ultralytics dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.22 to 9.5.50. - [Release notes](https://github.com/squidfunk/mkdocs-material/releases) - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG) - [Commits](squidfunk/mkdocs-material@9.5.22...9.5.50) --- updated-dependencies: - dependency-name: mkdocs-material dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update dc.py Adding support for CSV files where values can span several lines, pyarrow parser already supports it * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update dc.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adding csv parse options config * naming of parse_options_config to parse_options * typo * fix tests, address PR review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ivan Shcheklein <[email protected]>
…#842) [The `test_query_e2e` takes almost ~8mins to run][1] (whole CI job takes 11 mins). The `name_len_slow` script is the main culprit, since it sleeps for 1 sec in each udf function and that mapper is run in a single process parallel mode. ``` 474.21s call tests/test_query_e2e.py::test_query_e2e@tmpfile ``` This commit adds a limit of 3 files to the name_len_slow script, which is enough, since it's only running a single process. (We immediately interrupt the running process after seeing "UDF Processing Started" gets printed). This also split tests into two: one for the e2e tests and one for the rest, so that these things are more obvious in the future. [1]: https://github.com/iterative/datachain/actions/runs/12879531971/job/35907168617#step:8:82
* Handle permission error properly when checking for file Currently, we had blanket catch for exception when trying to check the file using _isfile. As a result, the exception stacktrace was repeated and catching the exception in script was difficult as we had to capture different exception. This convert the error to datachain native error that can be captured safely and proceed accordingly. This is first step toward handling #600 * Convert scheme to lower * Handle case for glob in windows
I've updated the code.
Also updated video example notebook here: iterative/datachain-examples#28, please, check. It is using most of the new features from this PR. |
catalog = get_catalog() | ||
|
||
parent, name = posixpath.split(path) | ||
client = catalog.get_client(parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] The top of this method is the same as upload
could have a get_client_from_path
helper
client = catalog.get_client(parent) | ||
|
||
file_info = client.fs.info(path) | ||
return client.info_to_file(file_info, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be able to call open
on this File
you need to _set_stream
with a catalog.
|
||
return video_info(self) | ||
|
||
def get_frame_np(self, frame: int) -> "ndarray": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking out-loud here but should a frame be an ImageFile
and ImageFile
have a to_ndarray
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at the notebook but my thought would be that you would want to be able to call something like video.split_to_frame
with an optional start/end frame + optional destination path and DataChain would split the video into frames and upload them all to a bucket as images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frames use-case could end up looking something like:
(
DataChain.from_storage("gs://datachain-demo/some-desc/videos")
.limit(20)
.gen(frame=file.split_to_frame, params="file", output={"frame": ImageFile})
.setup(yolo=lambda: YOLO("yolo11n.pt"))
.map(boxes=process_bboxes)
.show()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also can be done by the save_frames
method below, or we can add new method upload_frames
to upload images to the storage instead of saving them.
See #797
TODO:
Video models added
Meta models added
Couple usage examples
Listing
Add meta
Split video to virtual frames
Split video into frames and upload to storage