git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Asger Ottar Alstrup <asger@area9.dk>
To: Avery Pennarun <apenwarr@gmail.com>
Cc: git@vger.kernel.org, Alexander Gavrilov <angavrilov@gmail.com>
Subject: Re: git subtree as a solution to partial cloning?
Date: Mon, 25 May 2009 20:28:18 +0200	[thread overview]
Message-ID: <8873ae500905251128h1921895dp6ef227e0e0bbec49@mail.gmail.com> (raw)
In-Reply-To: <32541b130905251054k44bdb218sde8837e87d8e8e69@mail.gmail.com>

On Mon, May 25, 2009 at 7:54 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@area9.dk> wrote:
>> So a poor mans system could work like this:
>>
>> - A reduced repository is defined by a list of paths in a file, I
>> guess with a format similar to .gitignore
>
> Are you sure you want to define the list with exclusions instead of
> inclusions?  I don't really know your use case.

Since the .gitignore format supports !, I believe that should not make
much of a difference.

> Anyway, if you're using git filter-branch, it'll be up to you to fix
> the index to contain the list of files you want. (See man
> git-filter-branch)

Yes, sure, and that is why I asked whether there is some tool in git
that can give a list of concrete files surviving a .gitignore list of
patterns.

>> - To extract: A copy of the original repository is made. This copy is
>> reduced using git filter-branch. Is there some way of turning a
>> .gitignore syntax file into a concrete list of files? Also, can this
>> entire step be done in one step without the copy? Having to copy the
>> entire project first seems excessive. Will filter-branch preserve
>> and/or prune pack files intelligently?
>
> You probably need to read about the differences between git trees,
> blobs, and commits.  You're not actually "copying" anything; you're
> just creating some new directory structures that contain the
> *existing* blobs.  And of course the existing blobs are in your
> existing packs.

Thanks. OK, I see now that filter-branch will not destroy the original
repository. That is not at all obvious from reading the man page, when
the very first sentence says that it will rewrite history. But the
main point of this exercise is to reduce the size of the reduced
repository so that it can be transferred effectively. So after
filter-branch, I guess I would run clone afterwards to make the new,
smaller repository, and then the question becomes: Will clone reuse
and prune packs intelligently?

> Well, you're getting pretty far out there:
>
> - git is known to work badly with large files, and you have a bunch of
> large files;

As far as I know, git has most of the hooks needed to tune this. There
are still some weak areas where big files are read into memory
multiple times, but I have seen that people are already working on
this.

> - git is intended to manage entire repositories at a time, and you
> want a partial checkout;

The beauty of the subtree-inspired approach is of course that the
users of the reduced repositories WILL in fact be working on an entire
repository. The files are luckily fairly independent in THEIR
workflow. Also, if the mirror-sync proposal gets implemented, one
important part of the distribution piece is also solved: In effect,
these systems combined would give us a kind of narrow-clone.

> - git is intended to download the entire history at once, and you (I
> think) only want part of it.

I do need the entire history for the reduced files.

> By the time you're this far out, maybe what you want isn't git at all.
>  svn would work fine with this arrangement, and people who want
> partial checkouts would rarely benefit from git's distributedness
> anyway, I expect.

In my use case, some people will need to work on the full repository,
and they obviously will have the network and the machines to handle
this. I am currently thinking these people would use something like
glusterfs until mirrorsync is able to solve the problem for us.

However, there is a large group of users that do not need this, but
they DO need the entire history of the files they are interested in.
Subversion does not provide this. Also, Subversion is simply too slow
to handle the kind of files we need to work with. Also, we have run
tests on the kind of files we have, and the delta compression that git
uses is very effective for compression the pdf and openoffice
documents we use. The big files we have are primarily image files, and
obviously they do not compress very well. Fortunately, they do not
change much either.

While git might not currently be designed to support this use case, it
still seems like the best system to base this on. Yes, it will need
some work before we can use it for our needs, but it seems it is still
less work than what is needed to get other systems to support our
needs.

I appreciate your comments. They are very helpful.

Regards,
Asger

  reply	other threads:[~2009-05-25 18:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <8873ae500905250021p20e7096dwf5bc71c36c4047b@mail.gmail.com>
2009-05-25  7:59 ` git subtree as a solution to partial cloning? Avery Pennarun
2009-05-25  9:33   ` Asger Ottar Alstrup
2009-05-25 15:50     ` Avery Pennarun
2009-05-25 17:35       ` Asger Ottar Alstrup
2009-05-25 17:54         ` Avery Pennarun
2009-05-25 18:28           ` Asger Ottar Alstrup [this message]
2009-05-25 19:18             ` Avery Pennarun
2009-05-25 23:26             ` Jakub Narebski
2009-05-25  7:35 Asger Ottar Alstrup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8873ae500905251128h1921895dp6ef227e0e0bbec49@mail.gmail.com \
    --to=asger@area9.dk \
    --cc=angavrilov@gmail.com \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).