Big Interleaved Dataset is a LAION project that aims to create an open-source multimodal dataset like Deepmind M3W (MultiModal MassiveWeb dataset).
- Configure and install Poetry. We use this tool to manage our Python dependencies.
- Setup your base Python 3.8.15 environment with tools such as Miniconda.
- Clone the project and run the following commands
cd big-interleaved-dataset/
# Setup the Python environment and install all associated dev packages
poetry install --with dev
# Activate the virutal environment
poetry shell
# Initialize pre-commit to setup formatting via Black, etc.
pre-commit install
- We have configured our virutal environment and here are some helpful commands for development.
# Run a certain script with the virtual environment
poetry run python tests/adhoc_test.py
# To add a new package
poetry add numpy==1.24.1
poetry update
For more information around using Poetry, check out their documentation.
requirements.txt
is being auto-generated as a back-up way to configure the environment withvirtualenv
We discuss our ongoing project progress at #big-interleaved-dataset on LAION Discord.
Our weekly meeting time is usually Tuesday or Thursday at 8 PM CET. The meeting information will be provided in the channel.
Go to Design.