From: Jakub Narebski <jnareb@gmail.com>
To: Alexander Gavrilov <angavrilov@gmail.com>
Cc: git@vger.kernel.org, Asger Ottar Alstrup <asger@ottaralstrup.dk>
Subject: Re: Narrow clone implementation difficulty estimate
Date: Thu, 14 May 2009 03:39:49 -0700 (PDT) [thread overview]
Message-ID: <m38wl0klxt.fsf@localhost.localdomain> (raw)
In-Reply-To: <200905141404.30695.angavrilov@gmail.com>
Alexander Gavrilov <angavrilov@gmail.com> writes:
> We are considering using Git to manage a large set of mostly binary
> files (large images, pdf files, open-office documents, etc). The
> amount of data is such that it is infeasible to force every user
> to download all of it, so it is necessary to implement a partial
> retrieval scheme.
>
> In particular, we need to decide whether it is better to invest
> effort into implementing Narrow Clone, or partitioning and
> reorganizing the data set into submodules (the latter may prove
> to be almost impossible for this data set). We will most likely
> develop a new, very simplified GUI for non-technical users,
> so the details of both possible approaches will be hidden
> under the hood.
First, there were quite complete, although as far as I know newer
accepted into git, work on narrow / sparse / subtree / partial
*checkout*. IIRC the general idea about extening or (ab)using
assume-unchanged mechanism was accepted, but the problem was in the
user interface details (I think that porcelain part was quite well
accepted, except hesitation whether to use/extend existing flag, or
create new for the purpose of narrow checkout). You can search
archive for that
http://article.gmane.org/gmane.comp.version-control.git/89900
http://article.gmane.org/gmane.comp.version-control.git/90016
http://article.gmane.org/gmane.comp.version-control.git/77046
http://article.gmane.org/gmane.comp.version-control.git/50256
...
should give you some idea what to search for. This is of course
only part of solution.
Second, there was an idea to use new "replace" mechanism for this
(currently in 'pu' only, I think, merged as 'cc/replace' branch).
This mechanism was created for better bisecting with non-bisectable
commits, and is meant to be transferable extension of 'graft'
mechanism. The "replace" mechanism allows to replace also blob objects
(contents of filename), so you can have two repositories: baseline
repository with stub files in place of large binary files, and
extended repository with replacement in and replacement blobs in
object database with 'proper' (and large) contents of those binary
files. But that is just an idea, without implementation.
Third, there was work (a year ago, perhaps?) by Dana How on better
support for large objects. Some of those got accepted, some
dosn't. You can set maximum size of object in pack, IIRC, and you can
use gitattributes to mark (binary) files that are meant to be not
deltaified. If all of your repositories are on networked filesystem,
you can create separate optimized pack containing only those large
binary files, mark it as "kept" (using *.keep file, see documentation)
to avoid repacking those large binary files, and distributed this pack
either using symlink, or using alternates (keeping only one copy of
this pack, and accessing it via networked filesystem when it is
required).
Fourth, a long thime ago there was send a patch supposedly adding
support for 'lazy' clone, where you download blob objects from remote
repository only as required. But its was send as a single large
patch, fairly intrusive. I don't think it got good review, nevermind
being accepted into git.
Some further reading:
* "large(25G) repository in git"
http://article.gmane.org/gmane.comp.version-control.git/114351
* "Re: Appropriateness of git for digital video production versioning"
http://article.gmane.org/gmane.comp.version-control.git/107696
* http://git.or.cz/gitwiki/GitTogether08 had some presentation
about media files in git, and some thread on git mailing list about
that issue was result (which I didn't bookmark).
HTH
--
Jakub Narebski
Poland
ShadeHawk on #git
next prev parent reply other threads:[~2009-05-14 10:39 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-14 10:04 Narrow clone implementation difficulty estimate Alexander Gavrilov
2009-05-14 10:39 ` Jakub Narebski [this message]
2009-05-16 5:17 ` Nguyen Thai Ngoc Duy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m38wl0klxt.fsf@localhost.localdomain \
--to=jnareb@gmail.com \
--cc=angavrilov@gmail.com \
--cc=asger@ottaralstrup.dk \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).