git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 'git clone' doesn't use alternates automatically?
@ 2009-01-30 22:12 James Pickens
  2009-01-31  7:12 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: James Pickens @ 2009-01-30 22:12 UTC (permalink / raw)
  To: Git ML

Hi,

I have a central, shared Git repository on an NFS drive at path
$central.  I have added "$central/objects" to
$central/objects/info/alternates.  I see that when I clone this
repository with Git 1.6.1, the alternates file is automatically copied
to the clone, but so are all the pack files and loose objects.  If I
then cd to the clone and run 'git gc', it removes the redundant local
objects.

I thought I tested this setup a few months back, and 'git clone'
automatically used the alternates file to avoid copying the redundant
objects into the clone.  Has this behavior changed, or is my memory
bad?

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-30 22:12 'git clone' doesn't use alternates automatically? James Pickens
@ 2009-01-31  7:12 ` Jeff King
  2009-01-31 20:08   ` James Pickens
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2009-01-31  7:12 UTC (permalink / raw)
  To: James Pickens; +Cc: Git ML

On Fri, Jan 30, 2009 at 03:12:42PM -0700, James Pickens wrote:

> I have a central, shared Git repository on an NFS drive at path
> $central.  I have added "$central/objects" to
> $central/objects/info/alternates.  I see that when I clone this
> repository with Git 1.6.1, the alternates file is automatically copied
> to the clone, but so are all the pack files and loose objects.  If I
> then cd to the clone and run 'git gc', it removes the redundant local
> objects.

Yes, we don't set up alternates to an origin by default. If it's a local
clone, we do hardlink by default:

  $ ls -i git/.git/objects/pack
  7639155 pack-0651ae7e35ffde1921db158a3292e1c81153be1a.idx
  7638782 pack-0651ae7e35ffde1921db158a3292e1c81153be1a.pack
  $ git clone git foo
  ...
  $ ls -i foo/.git/objects/pack
  7639155 pack-0651ae7e35ffde1921db158a3292e1c81153be1a.idx
  7638782 pack-0651ae7e35ffde1921db158a3292e1c81153be1a.pack

but presumably in your example the second clone is _not_ on the NFS
mount, and therefore can't hardlink.

So you can try "git clone -s" to specify that you definitely want
alternates.

> I thought I tested this setup a few months back, and 'git clone'
> automatically used the alternates file to avoid copying the redundant
> objects into the clone.  Has this behavior changed, or is my memory
> bad?

I don't recall clone ever being that clever, but I could be wrong (it is
not an area of the code that I am too familiar with).

Can you try a test with a few different versions to see if it ever
behaved as you expected (and if it does, bisect to find the breakage)?

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-31  7:12 ` Jeff King
@ 2009-01-31 20:08   ` James Pickens
  2009-01-31 21:08     ` Jakub Narebski
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: James Pickens @ 2009-01-31 20:08 UTC (permalink / raw)
  To: Git ML; +Cc: Jeff King

On Sat, Jan 31, 2009 at 12:12 AM, Jeff King <peff@peff.net> wrote:
> but presumably in your example the second clone is _not_ on the NFS
> mount, and therefore can't hardlink.

That's correct.

> So you can try "git clone -s" to specify that you definitely want
> alternates.

Well, the clone gets the alternates either way.  It just doesn't
use them to avoid copying the data unless I give -s.  More
importantly, if 'git clone' worked the way I thought, then when I
clone a remote repository for which I have a local mirror, I
could avoid typing '--reference <path to local mirror>' by adding
<path to local mirror>/objects to the alternates file in the
remote repository.

> I don't recall clone ever being that clever, but I could be wrong (it is
> not an area of the code that I am too familiar with).
>
> Can you try a test with a few different versions to see if it ever
> behaved as you expected (and if it does, bisect to find the breakage)?

Damn.  I was hoping the response would be "it's a regression, and
here's a patch to fix it".  I went ahead and tested a few old
versions and they all behave the same way.

So, is there any reason 'git clone' shouldn't automatically use
the alternates that it copied into the new repository?  I might
look into writing a patch if nobody objects.

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-31 20:08   ` James Pickens
@ 2009-01-31 21:08     ` Jakub Narebski
  2009-01-31 21:43       ` James Pickens
  2009-01-31 21:55     ` Jeff King
  2009-02-01  0:55     ` Junio C Hamano
  2 siblings, 1 reply; 13+ messages in thread
From: Jakub Narebski @ 2009-01-31 21:08 UTC (permalink / raw)
  To: James Pickens; +Cc: Git ML, Jeff King

James Pickens <jepicken@gmail.com> writes:

> So, is there any reason 'git clone' shouldn't automatically use
> the alternates that it copied into the new repository?  I might
> look into writing a patch if nobody objects.

Alternates are fragile with respect to garbage collecting in the
repository you borrow objects from.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-31 21:08     ` Jakub Narebski
@ 2009-01-31 21:43       ` James Pickens
  0 siblings, 0 replies; 13+ messages in thread
From: James Pickens @ 2009-01-31 21:43 UTC (permalink / raw)
  To: Git ML; +Cc: Jeff King, Jakub Narebski

On Sat, Jan 31, 2009 at 2:08 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> James Pickens <jepicken@gmail.com> writes:
>
>> So, is there any reason 'git clone' shouldn't automatically use
>> the alternates that it copied into the new repository?  I might
>> look into writing a patch if nobody objects.
>
> Alternates are fragile with respect to garbage collecting in the
> repository you borrow objects from.

I think that's irrelevant in this case.  The scenario is that I
clone repo A, which is borrowing objects from repo B.  So repo A
was already assuming that it's safe to borrow from B.

The current behavior is that the clone of A also borrows from B
automatically.  What I am asking is whether 'git clone' should
take advantage of that to avoid copying redundant objects from A
into the clone.  They will get deleted the first time I run 'git
gc' in the clone anyways.

James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-31 20:08   ` James Pickens
  2009-01-31 21:08     ` Jakub Narebski
@ 2009-01-31 21:55     ` Jeff King
  2009-02-01  1:19       ` Junio C Hamano
  2009-02-01  0:55     ` Junio C Hamano
  2 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2009-01-31 21:55 UTC (permalink / raw)
  To: James Pickens; +Cc: Git ML

On Sat, Jan 31, 2009 at 01:08:16PM -0700, James Pickens wrote:

> Well, the clone gets the alternates either way.  It just doesn't
> use them to avoid copying the data unless I give -s.  More

The other key change is that you don't depend on the origin in your
alternates when you don't use "-s".

> So, is there any reason 'git clone' shouldn't automatically use
> the alternates that it copied into the new repository?  I might
> look into writing a patch if nobody objects.

I think the reason "-s" isn't the default is that alternates are fragile
(as Jakub mentioned), and we don't want ot set them up without the user
asking to do so.

So from what you've posted (but I haven't double checked or looked at
the code), it sounds like the current behavior is:

  - with "-s", add the origin as an alternate, and use alternates while
    cloning

  - "with --reference", add some other repo as an alternate, and use
    alternates while cloning

  - without either, copy alternates from origin, but _don't_ use
    alternates while cloning

The last one seems a little silly. Why bother setting up the alternates
if you're not going to use them? I guess because we might not be able to
get the objects at all, otherwise, and we need to know where to copy
them from. But either:

  - that is an implementation-specific detail of clone, and those
    alternates should go away after we clone

      or

  - we should fully respect those alternates

The only downside to the latter is that now somebody who has cloned a
repository with alternates now has an alternates-based repository and
might not know it (i.e., they might have been the one who set up
alternates in the origin).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-31 20:08   ` James Pickens
  2009-01-31 21:08     ` Jakub Narebski
  2009-01-31 21:55     ` Jeff King
@ 2009-02-01  0:55     ` Junio C Hamano
  2009-02-01  1:32       ` James Pickens
  2 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-02-01  0:55 UTC (permalink / raw)
  To: James Pickens; +Cc: Git ML, Jeff King

James Pickens <jepicken@gmail.com> writes:

> So, is there any reason 'git clone' shouldn't automatically use
> the alternates that it copied into the new repository?

When you say "git clone" without -s, you are saying "I do not want to use
the repository I am cloning from as my alternate, because I do not know if
will stay stable.  I do not trust it."

This would be a very sensible way to clone, if you were cloning my
repository whose 'pu' and its constituent topic branches are subject to
rewinding at any time.  After I rebase some of the branches and rebuild
'pu', and prune the unnecessary objects from my repository, the objects
you may have been borrowing from me will be gone from my repository.  Of
course, I can remove my repository altogether any time, and when that
happens, your repository will have many missing objects.

That is why "-s" is not the default.

Only when you positively know that the other repository will not drop
branches or rewind them, perhaps because you control that repository
yourself, it is safe to use it as your alternate, and you use commands
like "git clone -s" and/or "git clone --reference" to do so.

        Side note. People on k.org are encouraged to use Linus's
        repository as an alternate to save space on the k.org machine,
        because it is known that Linus's repository will never rewind its
        branches.

Now, if you are cloning from a local filesystem, by default we will copy
the objects/info/alternates from the source repository to the new one.  It
may be debatable if this is a sensible thing to do.  On one hand, because
you are saying you don't trust if the objects in the source repository
will stay stable by not giving "-s", it might be sensible not to trust its
choice of alternates either.  But in such a case, you can always use file://
URL when cloning to get a full freestanding copy.

I suspect you are trying to improve the other extreme end: trusting all
the other repositories involved in the cloning process a lot more than the
code currently does.

I do not think it is a bad thing to do per-se.

I haven't looked at the codepaths involved recently, but if I recall
correctly, optimizing of cloning from a repository that uses alternates
itself was never a part of the initial design considerations.  I suspect
there may be an ample room for you to optimize things.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-01-31 21:55     ` Jeff King
@ 2009-02-01  1:19       ` Junio C Hamano
  2009-02-02 13:07         ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-02-01  1:19 UTC (permalink / raw)
  To: Jeff King; +Cc: James Pickens, Git ML

Jeff King <peff@peff.net> writes:

>   - without either, copy alternates from origin, but _don't_ use
>     alternates while cloning

Are you talking about a local clone optimization that does hardlink from
the source repository?

I am fairly certain that copying alternates from the source repository was
not an intended behaviour but was a consequence of lazy coding of how we
copy (or link) everything from it.  The original was literally the simple
matter of:

    find objects ! -type d -print | cpio $cpio_quiet_flag -pumd$l "$GIT_DIR/"

whose intention was to copy objects/?? and objects/pack/. and it wasn't
even part of the design consideration to worry about what would happen to
the alternates the source repository might have in objects/info/.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-02-01  0:55     ` Junio C Hamano
@ 2009-02-01  1:32       ` James Pickens
  2009-02-01  1:38         ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: James Pickens @ 2009-02-01  1:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git ML, Jeff King

On Sat, Jan 31, 2009, Junio C Hamano <gitster@pobox.com> wrote:
> When you say "git clone" without -s, you are saying "I do not want to use
> the repository I am cloning from as my alternate, because I do not know if
> will stay stable.  I do not trust it."

Yes, I'm aware of the caveats of -s.  I was talking about what
happens when I *don't* use -s.

> Now, if you are cloning from a local filesystem, by default we will copy
> the objects/info/alternates from the source repository to the new one.  It

Crap, I didn't realize the alternates were only copied when you
clone from the local filesystem.  I wanted to use this when
cloning over ssh from site A to site B, to automatically add a
mirror at site B as an alternate.  Sounds like I have no choice
but to use --reference for that.

> I suspect you are trying to improve the other extreme end: trusting all
> the other repositories involved in the cloning process a lot more than the
> code currently does.

What I was suggesting did not involve trusting anything any more
than the current code does.  It just meant taking immediate
advantage of the trust that was already there.

Thanks for your input,
James

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-02-01  1:32       ` James Pickens
@ 2009-02-01  1:38         ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-02-01  1:38 UTC (permalink / raw)
  To: James Pickens; +Cc: Git ML, Jeff King

James Pickens <jepicken@gmail.com> writes:

> ...  Sounds like I have no choice
> but to use --reference for that.

As --reference was invented exactly for that use case, I think using it to
instruct where to borrow your object from is a very sensible thing to do.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-02-01  1:19       ` Junio C Hamano
@ 2009-02-02 13:07         ` Jeff King
  2009-02-03  4:30           ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2009-02-02 13:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: James Pickens, Git ML

On Sat, Jan 31, 2009 at 05:19:31PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   - without either, copy alternates from origin, but _don't_ use
> >     alternates while cloning
> 
> Are you talking about a local clone optimization that does hardlink from
> the source repository?

Sorry, I was wrong about what was happening. From reading James' posts
and not doing any experimenting or looking, I had the impression that
doing this:

  # plain repo
  mkdir repo1 &&
    (cd repo1 && git init &&
     echo content >file && git add . && git commit -m one)

  # repo with alternates, but extra content
  git clone -s repo1 repo2 &&
    (cd repo2 &&
     echo content >>file && git commit -a -m two)

  # clone of repo w/ alternates
  git clone repo2 repo3

would cause the final clone to set up the alternate to repo1, but still
pull in the objects. But that isn't the case, of course. Either:

  1. It is a local hardlink clone, in which case we just pull in the
     objects from repo2.

  2. It isn't, in which case we don't copy over the alternates.

> I am fairly certain that copying alternates from the source repository was
> not an intended behaviour but was a consequence of lazy coding of how we
> copy (or link) everything from it.  The original was literally the simple
> matter of:
> 
>     find objects ! -type d -print | cpio $cpio_quiet_flag -pumd$l "$GIT_DIR/"
> 
> whose intention was to copy objects/?? and objects/pack/. and it wasn't
> even part of the design consideration to worry about what would happen to
> the alternates the source repository might have in objects/info/.

Right, I think that is what is going on. And what I was suggesting in my
other email is that it is actively harmful to have this behavior,
because now repo3 depends on repo1, without the user having explicitly
asked for such a relationship (and they might not even be aware of
repo1).

I was tempted to suggest avoiding copying the alternates from repo2
to repo3. But you can't do that: repo2 is _missing_ objects that repo3
won't have. Without the alternates file pointing to repo1, repo3 is
corrupt. So simply avoiding copying the alternates file doesn't work;
one would have to actually pull the missing objects in from the
alternate before doing so.

But actually, I think there is even more breakage in hardlinking the
alternates file: alternates files can be relative paths. So if repo2
points to "../../../repo1/.git/objects" (which it doesn't in the example
above, as "clone -s" uses absolute paths -- but it is easy enough to
construct a broken case), then repo3 will gain that alternate pointer,
but may be in a totally different directory where that relative path is
broken. And then repo3 is corrupt. So the alternates must be copied and
any relative paths munged for it to work reliably.

The hardlink code operates by default because it was thought to be a
safe optimization that couldn't bite people. But it interacts badly with
the concept of alternates. So I think a sane fix would be to disable
hardlinking if the parent repo is using alternates at all. Then a
vanilla "git clone repo2 repo3" will do the safe but more costly
behavior of actually copying the objects. If the user wants to accept
the risks of alternates, then he can give "-s" explicitly, and git will
track the alternates recursively through repo2 to repo1 at runtime.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-02-02 13:07         ` Jeff King
@ 2009-02-03  4:30           ` Junio C Hamano
  2009-02-03  6:06             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-02-03  4:30 UTC (permalink / raw)
  To: Jeff King; +Cc: James Pickens, Git ML

Jeff King <peff@peff.net> writes:

> The hardlink code operates by default because it was thought to be a
> safe optimization that couldn't bite people. But it interacts badly with
> the concept of alternates.

Yes, you are right.

To be fair, I think it was proposed/implemented by somebody who almost
never uses alternates himself, and certainly never a relative alternates.
The intention of hardlinking was while saving tons of disk space, still be
independent from the original repository.

Back when e95ab1e ([PATCH] Short-circuit git-clone-pack while cloning
locally (take 2)., 2005-07-06) was done, the packfile implementation was
still only a week old, and hardlinking made a lot of sense from space
saving's point of view.  These days, if you make a local hardlinked clone,
work a little there and then repack it, most of the space saving will be
gone; there isn't much point in the hardlink optimization anymore from
that angle, even though it still is a good compromise between the clone
speed and safety, especially when no alternates are involved.

I think a possible fix would be not to copy alternates file literally, but
install an alternates file to directly borrow from the same repositories
the clone-source repository borrows from ourselves, taking relative paths
into account.  Another would be to look at the alternates and hardlink the
objects and packs while cloning, and if the repositories involved reside
across filesystem boundaries, we need to fall back to copying.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 'git clone' doesn't use alternates automatically?
  2009-02-03  4:30           ` Junio C Hamano
@ 2009-02-03  6:06             ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2009-02-03  6:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: James Pickens, Git ML

On Mon, Feb 02, 2009 at 08:30:36PM -0800, Junio C Hamano wrote:

> To be fair, I think it was proposed/implemented by somebody who almost
> never uses alternates himself, and certainly never a relative alternates.
> The intention of hardlinking was while saving tons of disk space, still be
> independent from the original repository.
> 
> Back when e95ab1e ([PATCH] Short-circuit git-clone-pack while cloning
> locally (take 2)., 2005-07-06) was done, the packfile implementation was

Well, in your defense, relative alternates didn't come about until two
months later, in ccfd3e9. So you can blame the author of that patch for
screwing up your existing work. :)

> still only a week old, and hardlinking made a lot of sense from space
> saving's point of view.  These days, if you make a local hardlinked clone,
> work a little there and then repack it, most of the space saving will be
> gone; there isn't much point in the hardlink optimization anymore from
> that angle, even though it still is a good compromise between the clone
> speed and safety, especially when no alternates are involved.

True. But I still think the hardlinks are nice for one-off repositories.
Every once in a while I want to start a new topic or experiment while my
repository is a mess; it's nice to just "git clone git foo", hack around
in the work directory, and blow it away. And the hardlinks make that
first step a _lot_ faster.

But I also don't mind having to add a command-line option to get the
speed. And for my use case, there really isn't a benefit to hardlinks
over alternates.

> I think a possible fix would be not to copy alternates file literally, but
> install an alternates file to directly borrow from the same repositories
> the clone-source repository borrows from ourselves, taking relative paths
> into account.  Another would be to look at the alternates and hardlink the
> objects and packs while cloning, and if the repositories involved reside
> across filesystem boundaries, we need to fall back to copying.

Yes, either of those would work. But I wonder if it is really worth the
complexity. When I suggested just ditching hardlinks if the remote uses
alternates, my thought was that most people won't really care. Either
they use alternates, in which case they should be providing "-s" and
not doing hardlinks, or they don't, in which case things will happen as
usual.

But reading your response, I wonder if it is worth keeping the hardlink
optimization around at all; getting rid of it would simplify the code
and the explanation of why "git clone foo" is different from "git clone
file://$PWD/foo".  If people want a fast, dependent clone, they can use
"-s". I guess hardlinks are also useful for a fast "git clone foo bar;
rm -rf foo". But I'm not sure how common that is.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-02-03  6:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-30 22:12 'git clone' doesn't use alternates automatically? James Pickens
2009-01-31  7:12 ` Jeff King
2009-01-31 20:08   ` James Pickens
2009-01-31 21:08     ` Jakub Narebski
2009-01-31 21:43       ` James Pickens
2009-01-31 21:55     ` Jeff King
2009-02-01  1:19       ` Junio C Hamano
2009-02-02 13:07         ` Jeff King
2009-02-03  4:30           ` Junio C Hamano
2009-02-03  6:06             ` Jeff King
2009-02-01  0:55     ` Junio C Hamano
2009-02-01  1:32       ` James Pickens
2009-02-01  1:38         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).