Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement large file storage and remove large files from this repo #173

Open
varjmes opened this issue Jul 2, 2015 · 18 comments
Open

Implement large file storage and remove large files from this repo #173

varjmes opened this issue Jul 2, 2015 · 18 comments

Comments

@varjmes
Copy link
Contributor

varjmes commented Jul 2, 2015

Hallo!

One barrier to entry for new contributors is the size of this repository. For maximum awesomeitude:tm:, we should reduce this. The solution is two-fold.

1: Implement git-lfs.

Git-lfs (large file storage) is a way to track large files and make them smaller, by turning them into pointers that point to the larger version of the file on a server. This is an early access thing on GitHub that we have access to on this repository. You must install git-lfs and add a .gitattributes file to start tracing the file extensions of large files (eg. .psd). You need to work out what files need to be tracked. However, this appears to (as of the time of writing, with the knowledge I have) to only work for files adding after we start tracking files. It does not work on the files already in the repo. Which brings us to step 2.

2: Remove large files from the repository.

This is something that @janl has a better idea of how to do.
We could do the following: create a new repository, copying everything over except the large files from this hood.ie repository. We can then start tracking files for git-lfs (see point one) and rename this repository to hoodie-old, calling the new one hood.ie. We must try our best to preserve the commit history from the old repository wherever possible. Good knowledge of git will come in handy. This also needs to be done in a not-busy period as it will require a 10-20 minute downtime (at least) on the main website.

@lewiscowper
Copy link
Contributor

So would we not be able to update git attributes, then git rm {{largefile}}, and then git add {{largefile}}?

@varjmes
Copy link
Contributor Author

varjmes commented Jul 2, 2015

Unfortunately; although git rm will remove the item from the repository, it will still be in the repositories history and thus doesn't really disappear, continuing to contribute to the bloat.

@lewiscowper
Copy link
Contributor

Ah, that's frustrating.

I found a SO link talking about stuff where the git repo on a per checkout basis was small, but the .git directory/local history was massive: http://stackoverflow.com/questions/5613345/how-to-shrink-the-git-folder

Might have something in there like git gc that might be of help. :)

There's also http://stevelorek.com/how-to-shrink-a-git-repository.html which appears to get to the point that people's local copies of the repo become invalidated, but I'd guess we may run into the same thing with the name swap.

EDIT: I'm not a git expert, nor volunteering myself to do it and push it. But I'm willing to research it. :)

@varjmes
Copy link
Contributor Author

varjmes commented Jul 2, 2015

@lewiscowper Looks like it might be worth doing a prune. I don't think that our assets change too much on the main site, but it's worth a go. That second article looks really interesting, thanks!

I too do not feel that I am the best person to do this, but am certainly happy to research/talk about our options/hold someones hand when they do this :)

@NickColley
Copy link
Contributor

How big is the repo at the moment?

I think it's somewhat reasonable from a maintainer's point of view to keep it simple and just have one repo.

From my experience multiple repos is a pain in the ass.

@lewiscowper
Copy link
Contributor

@NickColley it's not splitting up the repo into two, it's transferring the contents of this repo to a new one with Large File Storage set up from the outset.

We're not technically removing them from the project (if I'm understanding Github's LFS correctly), we're transferring the large files that drag down the weight of the repo, to an alternate remote where they can be downloaded separately.

From a contributor's point of view the repo will be identical in content, but much easier to get and store locally.

@mkoppanen
Copy link

Hey,

happened to see a tweet referencing this issue. Maybe the following would be of use: https://rtyley.github.io/bfg-repo-cleaner/

@KrofDrakula
Copy link

@mkoppanen As was said in the issue, the point is not in purging old files, but have large blobs excluded from the repo so that they're still able to be checked out (preserving history) but hosted elsewhere for a smaller download. That way you have the entire history, but have large blobs downloaded from external storage (git-lfs).

@KrofDrakula
Copy link

OK, having had a Twitter conversation on the topic, it seems that the following approach should be valid for the task at hand:

  1. freeze the original repo to prevent adding history after migration;
  2. create a new empty repository and commit the git-lfs rules as the first, root commit;
  3. export the existing branches as a series of patches from the current repository;
  4. replay the commits on the new repository, preserving history but applying the git-lfs pointers in place of the current in-place blobs for git-lfs matched files;
  5. push to a new repository;
  6. have everyone abandon the old repository and check out the new one.

I'm volunteering to try tackling this issue, if just to try to get things going, and I'll update this issue as things move along.

FYI, I'll just be using a local git-lfs server to test storage and document the steps needed to achieve this if the Hoodie team wants to do the migration afterwards themselves.

@rtyley
Copy link

rtyley commented Dec 16, 2015

Since version v1.12.5, the BFG has supported converting Git repos to git-lfs format:

$ java -jar ~/bfg-1.12.5.jar --convert-to-git-lfs '*.wav' --no-blob-protection

https://github.com/rtyley/bfg-repo-cleaner/releases/tag/v1.12.5

@varjmes
Copy link
Contributor Author

varjmes commented Dec 16, 2015

Thank you @KrofDrakula! ✨

@rtyley
Copy link

rtyley commented Dec 16, 2015

I did some quick tests using the BFG - if you transfer all *.zip, *.mov, *.png, & *.jpg out of your repo, your Git packfile size will go from 186M to 16M, which is a pretty good saving, but bear in mind that Git LFS does come with some caveats that may mean it's not your best choice right now:

Disclaimer: I'm a former but not current employee of GitHub, and am not closely involved with LFS project

$ git clone --mirror [email protected]:hoodiehq/hood.ie.git
$ cd hood.ie.git
$ du -h --summarize objects
186M    objects
$ bfg --convert-to-git-lfs '*.mov' --no-blob-protection
$ bfg --convert-to-git-lfs '*.jpg' --no-blob-protection
$ bfg --convert-to-git-lfs '*.png' --no-blob-protection
$ bfg --convert-to-git-lfs '*.zip' --no-blob-protection
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ du -h --summarize objects lfs
16M objects
204M    lfs

@varjmes
Copy link
Contributor Author

varjmes commented Dec 16, 2015

Git LFS on GitHub doesn't support public forks right now - so only people with push rights to your hoodiehq/hood.ie repo will be able to raise pull requests against it once you switch on LFS.

This concerns me, @janl ?

@gr2m
Copy link
Member

gr2m commented Dec 16, 2015

please coordinate with @lewiscowper @verpixelt et team who are currently working on the website I think? If we loos git history, this could cause headaches, just want to make sure :)

@lewiscowper
Copy link
Contributor

@gr2m We won't be editing anything more than HTML inside this repo, so LFS shouldn't (I hope) affect anything we need to do, at least as far as I understand it.

@KrofDrakula
Copy link

Seems like the migration process is covered by @rtyley. It really just boils down to what your decision re: the git-lfs hosting is. You could consider hosting your own git-lfs server on S3 which could turn out to be cheap enough to facilitate your use case as an open source project (depending on bandwidth, obviously), but that's out of scope of what is being discussed here.

@rtyley
Copy link

rtyley commented Dec 17, 2015

In the issue definition you've got two parts:

  1. Implement git-lfs
  2. Remove large files from the repository

...depending on what infrastructure you've got available, and what dev pipeline you want to have, it might be reasonable to consider just doing the second one: removing the large files. In the case of your repo, the main contributors to size are .pngs & .jpgs - they account for 116M of your 186M packfile. You can get some quick wins from deleting .zips & .movs, which account for 54M, or .gifs (11M), but in order to make an order-of-magnitude difference to download size, you'll need to do the pngs & jpgs too: *.{zip,mov,gif,png,jpg} gets you down to a packfile that's under 5M.

Without git-lfs, you would need an alternative place for the assets to live, and that might be hard. At the Guardian we have the luxury of an in-house (open-source) image management service, which hashes and permanently stores every jpeg at various resolutions in an S3 bucket behind a CDN. We've made use of that on our membership-frontend (a fairly chunky site, with lots of hi-res imagery) to ensure that very few images are committed to our source control and as a result the packfile is just 40M. You probably don't want to run your own instance of that service, but something like http://cloudinary.com/ would do a similar job.

@varjmes
Copy link
Contributor Author

varjmes commented Jun 20, 2016

Does this help as at all? https://github.com/blog/2163-import-repositories-with-large-files

if we deleted and reuploaded and put files into LFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants