Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add video models + functions #814

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
75877d1
Add video models + functions
dreadatour Jan 13, 2025
031b9df
Code review update
dreadatour Jan 14, 2025
548bbd5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 14, 2025
b55149a
Code review update
dreadatour Jan 14, 2025
2cd6d62
Code review update
dreadatour Jan 15, 2025
5892ab9
Small fixes due to work on usage examples
dreadatour Jan 15, 2025
f3dc66a
Examples fixes
dreadatour Jan 20, 2025
65529f3
docs(merge): add examples with Func object (#811)
shcheklein Jan 13, 2025
b044082
fix(tqdm): import tqdm to support jupyter (#812)
shcheklein Jan 13, 2025
2a77047
[pre-commit.ci] pre-commit autoupdate (#815)
pre-commit-ci[bot] Jan 13, 2025
89ee2f0
progress: remove unused logging/tqdm lock (#817)
skshetry Jan 14, 2025
5f522ad
build(deps): bump ultralytics from 8.3.58 to 8.3.61 (#816)
dependabot[bot] Jan 14, 2025
e2f5a3a
Review help/usage for cli commands (#802)
amritghimire Jan 15, 2025
67beb9f
file: raise error (#820)
skshetry Jan 15, 2025
60c5848
README - mistral fix (#821)
dmpetrov Jan 16, 2025
d3b1619
file: support exporting files as a symlink (#819)
skshetry Jan 16, 2025
e31210c
prefetching: remove prefetched item after use in udf (#818)
skshetry Jan 16, 2025
bcd95b1
ReferenceFileSystem: use fs.open instead of fs._open (#823)
skshetry Jan 16, 2025
08edd27
Second iteration of cli command help (#826)
amritghimire Jan 18, 2025
dbefa5f
Fix list of tuples. Closes #827 (#828)
dmpetrov Jan 19, 2025
258454e
Added full outer join (#822)
ilongin Jan 20, 2025
328c1a7
memoize usearch.sqlite_path() (#833)
skshetry Jan 20, 2025
a1a47b2
Added `isnone()` function (#801)
ilongin Jan 20, 2025
5b2f45b
tests: reduce pytorch functional tests' runtime (#834)
skshetry Jan 20, 2025
14caa08
improve runtime of diff unit tests (#831)
mattseddon Jan 20, 2025
746fd73
move functional tests out of unit test suite (#832)
mattseddon Jan 20, 2025
0fe47dd
import Int into test_datachain_merge (fix tests broken on bad merge) …
mattseddon Jan 20, 2025
1598c4c
[pre-commit.ci] pre-commit autoupdate (#836)
pre-commit-ci[bot] Jan 20, 2025
0c3f3b4
build(deps): bump ultralytics from 8.3.61 to 8.3.64 (#839)
dependabot[bot] Jan 21, 2025
bf824af
build(deps): bump mkdocs-material from 9.5.22 to 9.5.50 (#838)
dependabot[bot] Jan 21, 2025
428d865
Revert "build(deps): bump mkdocs-material from 9.5.22 to 9.5.50 (#838…
yathomasi Jan 21, 2025
b7549b1
Add CSV parsing options (#813)
skirdey Jan 21, 2025
8639246
e2e tests: limit name_len_slow to 3, split e2e tests from other tests…
skshetry Jan 21, 2025
3376449
ci: switch trigger from `pull_request_target` to `pull_request` (#843)
skshetry Jan 21, 2025
5b2e437
rename DataChainCache to Cache (#847)
skshetry Jan 21, 2025
213b1d8
feat: add apollo integration, drop reo.dev (#835)
yathomasi Jan 22, 2025
43389f7
append e2e tests coverage instead of overwriting (#851)
mattseddon Jan 22, 2025
5a20c4e
drop unstructured examples (#854)
mattseddon Jan 24, 2025
b72c440
add upload classmethod to File (#850)
mattseddon Jan 24, 2025
55cd044
drop .edatachain support (#853)
skshetry Jan 24, 2025
69a4385
pull _is_file checks to get_listing (#846)
skshetry Jan 24, 2025
7859e16
use posixpath in upload methods (#855)
mattseddon Jan 24, 2025
3f47d12
Handle permission error properly when checking for file (#856)
amritghimire Jan 27, 2025
17118d1
catch (HfHub)HTTPError in hf-dataset-llm-eval example (#848)
mattseddon Jan 27, 2025
cc05da9
Code review updates
dreadatour Jan 27, 2025
8d9f6c2
Merge branch 'main' into video-models
dreadatour Jan 27, 2025
23514f7
Update video requirements
dreadatour Jan 28, 2025
8a8dd64
Code review updates
dreadatour Jan 28, 2025
1a04dd0
Merge branch 'main' into video-models
dreadatour Jan 28, 2025
0c95c3d
Merge branch 'main' into video-models
dreadatour Jan 29, 2025
e55405d
Code review updates + tests
dreadatour Jan 29, 2025
8e2a673
Set up ffmpeg in tests
dreadatour Jan 29, 2025
9c910ec
Set up ffmpeg in tests
dreadatour Jan 29, 2025
a2b8c9a
Set up ffmpeg in tests
dreadatour Jan 29, 2025
63448d9
Update 'ensure_cached' test
dreadatour Jan 29, 2025
abe39f5
Revert 'ensure_cached' test
dreadatour Jan 29, 2025
3b7b829
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/tests-studio.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ jobs:
path: './backend/datachain'
fetch-depth: 0

- name: Set up FFmpeg
uses: AnimMouse/setup-ffmpeg@v1

- name: Set up Python ${{ matrix.pyv }}
uses: actions/setup-python@v5
with:
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.sha || github.ref }}

- name: Set up FFmpeg
uses: AnimMouse/setup-ffmpeg@v1

- name: Set up Python ${{ matrix.pyv }}
uses: actions/setup-python@v5
with:
Expand Down
10 changes: 9 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,16 @@ hf = [
"numba>=0.60.0",
"datasets[audio,vision]>=2.21.0"
]
video = [
# Use 'av<14' because of incompatibility with imageio
# See https://github.com/PyAV-Org/PyAV/discussions/1700
"av<14",
dreadatour marked this conversation as resolved.
Show resolved Hide resolved
"ffmpeg-python",
"imageio[ffmpeg]",
"opencv-python"
]
tests = [
"datachain[torch,remote,vector,hf]",
"datachain[torch,remote,vector,hf,video]",
"pytest>=8,<9",
"pytest-sugar>=0.9.6",
"pytest-cov>=4.1.0",
Expand Down
219 changes: 215 additions & 4 deletions src/datachain/lib/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from urllib.request import url2pathname

from fsspec.callbacks import DEFAULT_CALLBACK, Callback
from PIL import Image
from PIL import Image as PilImage
from pydantic import Field, field_validator

from datachain.client.fileslice import FileSlice
Expand All @@ -27,6 +27,7 @@
from datachain.utils import TIME_ZERO

if TYPE_CHECKING:
from numpy import ndarray
from typing_extensions import Self

from datachain.catalog import Catalog
Expand All @@ -40,7 +41,7 @@
# how to create file path when exporting
ExportPlacement = Literal["filename", "etag", "fullpath", "checksum"]

FileType = Literal["binary", "text", "image"]
FileType = Literal["binary", "text", "image", "video"]


class VFileError(DataChainError):
Expand Down Expand Up @@ -193,7 +194,7 @@
@classmethod
def upload(
cls, data: bytes, path: str, catalog: Optional["Catalog"] = None
) -> "File":
) -> "Self":
if catalog is None:
from datachain.catalog.loader import get_catalog

Expand All @@ -203,6 +204,8 @@

client = catalog.get_client(parent)
file = client.upload(data, name)
if not isinstance(file, cls):
file = cls(**file.model_dump())
file._set_stream(catalog)
return file

Expand Down Expand Up @@ -486,13 +489,219 @@
def read(self):
"""Returns `PIL.Image.Image` object."""
fobj = super().read()
return Image.open(BytesIO(fobj))
return PilImage.open(BytesIO(fobj))

def save(self, destination: str):
"""Writes it's content to destination"""
self.read().save(destination)


class Image(DataModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this separate model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for video info (Video model). I can remove it from this PR 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a bit weird that we have ImageFile and Image (that contains only some basic metadata) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was VideoMeta (and ImageMeta) before, but Dmitry was asked to rename these models here. I agree having Video (Image) model with just meta looks odd. I think you're right and we should inherit this model from VideoFile (ImageFile) to extend files with meta, than it will make sense. If no, what do you think about VideoInfo (and ImageInfo)?

"""`DataModel` for image file meta information."""

width: int = Field(default=-1)
height: int = Field(default=-1)
format: str = Field(default="")


class VideoFile(File):
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
"""`DataModel` for reading video files."""

def get_info(self) -> "Video":
"""Returns video file information."""
from .video import video_info

return video_info(self)

def get_frame_np(self, frame: int) -> "ndarray":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking out-loud here but should a frame be an ImageFile and ImageFile have a to_ndarray method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the notebook but my thought would be that you would want to be able to call something like video.split_to_frame with an optional start/end frame + optional destination path and DataChain would split the video into frames and upload them all to a bucket as images.

Copy link
Member

@mattseddon mattseddon Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frames use-case could end up looking something like:

(
    DataChain.from_storage("gs://datachain-demo/some-desc/videos")
    .limit(20)
    .gen(frame=file.split_to_frame, params="file", output={"frame": ImageFile})
    .setup(yolo=lambda: YOLO("yolo11n.pt"))
    .map(boxes=process_bboxes)
    .show()
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also can be done by the save_frames method below, or we can add new method upload_frames to upload images to the storage instead of saving them.

Copy link
Member

@mattseddon mattseddon Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_frames without upload breaks the promise of dataset reproducibility - thinking out-loud again

"""
Reads video frame from a file.

Args:
frame (int): Frame number to read.

Returns:
ndarray: Video frame.
"""
from .video import video_frame_np

return video_frame_np(self, frame)

def get_frame(self, frame: int, format: str = "jpg") -> bytes:
"""
Reads video frame from a file and returns as image bytes.

Args:
frame (int): Frame number to read.
format (str): Image format (default: 'jpg').

Returns:
bytes: Video frame image as bytes.
"""
from .video import video_frame

return video_frame(self, frame, format)

def save_frame(
self,
frame: int,
output_file: str,
format: Optional[str] = None,
) -> "VideoFrame":
"""
Saves video frame as an image file.

Args:
frame (int): Frame number to read.
output_file (str): Output file path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what format is default? does it support different formats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output format is taken from output file extension. See here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, extension determines it? I wonder if we need to clarify or will be kinda expected by end users 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated:

    def save_frame(
        self,
        frame: int,
        output_file: str,
        format: Optional[str] = None,
    ) -> "VideoFrame":
        """
        Saves video frame as an image file.

        Args:
            frame (int): Frame number to read.
            output_file (str): Output file path.
            format (str): Image format (default: use output file extension).

        Returns:
            VideoFrame: Video frame model.
        """

format (str): Image format (default: use output file extension).

Returns:
VideoFrame: Video frame model.
"""
from .video import save_video_frame

return save_video_frame(self, frame, output_file, format=format)

def get_frames_np(
self,
start_frame: int = 0,
end_frame: Optional[int] = None,
step: int = 1,
) -> "Iterator[ndarray]":
"""
Reads video frames from a file.

Args:
start_frame (int): Frame number to start reading from (default: 0).
end_frame (int): Frame number to stop reading at (default: None).
step (int): Step size for reading frames (default: 1).

Returns:
Iterator[ndarray]: Iterator of video frames.
"""
from .video import video_frames_np

yield from video_frames_np(self, start_frame, end_frame, step)

def get_frames(
self,
start_frame: int = 0,
end_frame: Optional[int] = None,
step: int = 1,
format: str = "jpg",
) -> "Iterator[bytes]":
"""
Reads video frames from a file and returns as bytes.

Args:
start_frame (int): Frame number to start reading from (default: 0).
end_frame (int): Frame number to stop reading at (default: None).
step (int): Step size for reading frames (default: 1).
format (str): Image format (default: 'jpg').

Returns:
Iterator[bytes]: Iterator of video frames.
"""
from .video import video_frames

yield from video_frames(self, start_frame, end_frame, step, format)

def save_frames(
self,
output_dir: str,
start_frame: int = 0,
end_frame: Optional[int] = None,
step: int = 1,
format: str = "jpg",
) -> "Iterator[VideoFrame]":
"""
Saves video frames as image files.

Args:
output_dir (str): Output directory path.
start_frame (int): Frame number to start reading from (default: 0).
end_frame (int): Frame number to stop reading at (default: None).
step (int): Step size for reading frames (default: 1).
format (str): Image format (default: 'jpg').

Returns:
Iterator[VideoFrame]: List of video frame models.
"""
from .video import save_video_frames

yield from save_video_frames(
self, output_dir, start_frame, end_frame, step, format
)

def save_fragment(
self,
start_time: float,
end_time: float,
output_file: str,
) -> "VideoFragment":
"""
Saves video interval as a new video file.

Args:
start_time (float): Start time in seconds.
end_time (float): End time in seconds.
output_file (str): Output file path.

Returns:
VideoFragment: Video fragment model.
"""
from .video import save_video_fragment

return save_video_fragment(self, start_time, end_time, output_file)

def save_fragments(
self,
intervals: list[tuple[float, float]],
output_dir: str,
) -> "Iterator[VideoFragment]":
"""
Saves video intervals as new video files.

Args:
intervals (list[tuple[float, float]]): List of start and end times
in seconds.
output_dir (str): Output directory path.

Returns:
Iterator[VideoFragment]: List of video fragment models.
"""
from .video import save_video_fragments

yield from save_video_fragments(self, intervals, output_dir)


class VideoFragment(VideoFile):
"""`DataModel` for reading video fragments."""

start: float = Field(default=-1.0)
end: float = Field(default=-1.0)


class VideoFrame(ImageFile):
"""`DataModel` for reading video frames."""

frame: int = Field(default=-1)
timestamp: float = Field(default=-1.0)


class Video(DataModel):
"""`DataModel` for video file meta information."""

width: int = Field(default=-1)
height: int = Field(default=-1)
fps: float = Field(default=-1.0)
duration: float = Field(default=-1.0)
frames: int = Field(default=-1)
format: str = Field(default="")
codec: str = Field(default="")


class ArrowRow(DataModel):
"""`DataModel` for reading row from Arrow-supported file."""

Expand Down Expand Up @@ -528,5 +737,7 @@
file = TextFile
elif type_ == "image":
file = ImageFile # type: ignore[assignment]
elif type_ == "video":
file = VideoFile

Check warning on line 741 in src/datachain/lib/file.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/file.py#L741

Added line #L741 was not covered by tests

return file
2 changes: 1 addition & 1 deletion src/datachain/lib/hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

except ImportError as exc:
raise ImportError(
"Missing dependencies for huggingface datasets:\n"
"Missing dependencies for huggingface datasets.\n"
"To install run:\n\n"
" pip install 'datachain[hf]'\n"
) from exc
Expand Down
Empty file removed src/datachain/lib/vfile.py
Empty file.
Loading
Loading