How to resume broke clone ?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to resume broke clone ?
@ 2013-11-28  3:13 zhifeng hu
  2013-11-28  7:39 ` Trần Ngọc Quân
  0 siblings, 1 reply; 36+ messages in thread
From: zhifeng hu @ 2013-11-28  3:13 UTC (permalink / raw)
  To: git

Hello all:
Today i want to clone the Linux Kernel git repository.
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

I am in china. our bandwidth is very limitation. Less than 50Kb/s.

The clone progress is very slow, and broken times and time.
I am very unhappy. 
Because i could not easily to clone kernel.

I had do some research about resume clone , but no good plan how to resolve this problem .

Would it be possible add resume transfer clone repository after the transfer broken?

such as bittorrent  download. or what ever.

zhifeng hu 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  3:13 How to resume broke clone ? zhifeng hu
@ 2013-11-28  7:39 ` Trần Ngọc Quân
  2013-11-28  7:41   ` zhifeng hu
  0 siblings, 1 reply; 36+ messages in thread
From: Trần Ngọc Quân @ 2013-11-28  7:39 UTC (permalink / raw)
  To: zhifeng hu; +Cc: git

On 28/11/2013 10:13, zhifeng hu wrote:
> Hello all:
> Today i want to clone the Linux Kernel git repository.
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> I am in china. our bandwidth is very limitation. Less than 50Kb/s.
This repo is really too big.
You may consider using --depth option if you don't want full history, or
clone from somewhere have better bandwidth
$ git clone --depth=1
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
you may chose other mirror (github.com) for example
see git-clone(1)

-- 
Trần Ngọc Quân.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  7:39 ` Trần Ngọc Quân
@ 2013-11-28  7:41   ` zhifeng hu
  2013-11-28  8:14     ` Duy Nguyen
  0 siblings, 1 reply; 36+ messages in thread
From: zhifeng hu @ 2013-11-28  7:41 UTC (permalink / raw)
  To: Trần Ngọc Quân; +Cc: git

Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early.

zhifeng hu 



On Nov 28, 2013, at 3:39 PM, Trần Ngọc Quân <vnwildman@gmail.com> wrote:

> On 28/11/2013 10:13, zhifeng hu wrote:
>> Hello all:
>> Today i want to clone the Linux Kernel git repository.
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> 
>> I am in china. our bandwidth is very limitation. Less than 50Kb/s.
> This repo is really too big.
> You may consider using --depth option if you don't want full history, or
> clone from somewhere have better bandwidth
> $ git clone --depth=1
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> you may chose other mirror (github.com) for example
> see git-clone(1)
> 
> -- 
> Trần Ngọc Quân.
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  7:41   ` zhifeng hu
@ 2013-11-28  8:14     ` Duy Nguyen
  2013-11-28  8:35       ` Karsten Blees
  2013-11-28  9:20       ` How to resume broke clone ? Tay Ray Chuan
  0 siblings, 2 replies; 36+ messages in thread
From: Duy Nguyen @ 2013-11-28  8:14 UTC (permalink / raw)
  To: zhifeng hu; +Cc: Trần Ngọc Quân, Git Mailing List

On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote:
> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early.

if it works with --depth =1, you can incrementally run "fetch
--depth=N" with N larger and larger.

But it may be easier to ask kernel.org admin, or any dev with a public
web server, to provide you a git bundle you can download via http.
Then you can fetch on top.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: How to resume broke clone ?
@ 2013-11-28  8:32 Max Kirillov
  2013-11-28  9:12 ` Jeff King
  0 siblings, 1 reply; 36+ messages in thread
From: Max Kirillov @ 2013-11-28  8:32 UTC (permalink / raw)
  To: zhifeng hu; +Cc: git

> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> I am in china. our bandwidth is very limitation. Less than 50Kb/s.

You could manually download big packed bundled from some http remote.
For example http://repo.or.cz/r/linux.git

* create a new repository, add the remote there.

* download files with wget or whatever:
 http://repo.or.cz/r/linux.git/objects/info/packs
also files mentioned in the file. Currently they are:
 http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx
 http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack

and put them to the corresponding places.

* then run fetch of pull. I believe it should run fast then. Though I
have not test it.

Br,
-- 
Max

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  8:14     ` Duy Nguyen
@ 2013-11-28  8:35       ` Karsten Blees
  2013-11-28  8:50         ` Duy Nguyen
  2013-11-28  9:20       ` How to resume broke clone ? Tay Ray Chuan
  1 sibling, 1 reply; 36+ messages in thread
From: Karsten Blees @ 2013-11-28  8:35 UTC (permalink / raw)
  To: Duy Nguyen, zhifeng hu; +Cc: Trần Ngọc Quân, Git Mailing List

Am 28.11.2013 09:14, schrieb Duy Nguyen:
> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote:
>> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early.
> 
> if it works with --depth =1, you can incrementally run "fetch
> --depth=N" with N larger and larger.
> 
> But it may be easier to ask kernel.org admin, or any dev with a public
> web server, to provide you a git bundle you can download via http.
> Then you can fetch on top.
> 

Or simply download the individual files (via ftp/http) and clone locally:

> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> cd linux
> git remote set-url origin git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  8:35       ` Karsten Blees
@ 2013-11-28  8:50         ` Duy Nguyen
  2013-11-28  8:55           ` zhifeng hu
  0 siblings, 1 reply; 36+ messages in thread
From: Duy Nguyen @ 2013-11-28  8:50 UTC (permalink / raw)
  To: Karsten Blees
  Cc: zhifeng hu, Trần Ngọc Quân, Git Mailing List

On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
> Or simply download the individual files (via ftp/http) and clone locally:
>
>> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> cd linux
>> git remote set-url origin git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Yeah I didn't realize it is published over dumb http too. You may need
to be careful with this though because it's not atomic and you may get
refs that point nowhere because you're already done with "pack"
directory when you come to fetcing "refs" and did not see new packs...
If dumb commit walker supports resume (I don't know) then it'll be
safer to do

git clone http://git.kernel.org/....

If it does not support resume, I don't think it's hard to do.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  8:50         ` Duy Nguyen
@ 2013-11-28  8:55           ` zhifeng hu
  2013-11-28  9:09             ` Duy Nguyen
  0 siblings, 1 reply; 36+ messages in thread
From: zhifeng hu @ 2013-11-28  8:55 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Karsten Blees, Trần Ngọc Quân, Git Mailing List

The repository growing fast, things get harder . Now the size reach several GB, it may possible be TB, YB.
When then, How do we handle this?
If the transfer broken, and it can not be resume transfer, waste time and waste bandwidth.

Git should be better support resume transfer.
It now seems not doing better it’s job.
Share code, manage code, transfer code, what would it be a VCS we imagine it ?
 
zhifeng hu 



On Nov 28, 2013, at 4:50 PM, Duy Nguyen <pclouds@gmail.com> wrote:

> On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>> Or simply download the individual files (via ftp/http) and clone locally:
>> 
>>> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>>> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>> cd linux
>>> git remote set-url origin git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> Yeah I didn't realize it is published over dumb http too. You may need
> to be careful with this though because it's not atomic and you may get
> refs that point nowhere because you're already done with "pack"
> directory when you come to fetcing "refs" and did not see new packs...
> If dumb commit walker supports resume (I don't know) then it'll be
> safer to do
> 
> git clone http://git.kernel.org/....
> 
> If it does not support resume, I don't think it's hard to do.
> -- 
> Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  8:55           ` zhifeng hu
@ 2013-11-28  9:09             ` Duy Nguyen
  2013-11-28  9:29               ` Jeff King
  0 siblings, 1 reply; 36+ messages in thread
From: Duy Nguyen @ 2013-11-28  9:09 UTC (permalink / raw)
  To: zhifeng hu
  Cc: Karsten Blees, Trần Ngọc Quân, Git Mailing List

On Thu, Nov 28, 2013 at 3:55 PM, zhifeng hu <zf@ancientrocklab.com> wrote:
> The repository growing fast, things get harder . Now the size reach several GB, it may possible be TB, YB.
> When then, How do we handle this?
> If the transfer broken, and it can not be resume transfer, waste time and waste bandwidth.
>
> Git should be better support resume transfer.
> It now seems not doing better it’s job.
> Share code, manage code, transfer code, what would it be a VCS we imagine it ?

You're welcome to step up and do it. On top of my head  there are a few options:

 - better integration with git bundles, provide a way to seamlessly
create/fetch/resume the bundles with "git clone" and "git fetch"
 - shallow/narrow clone. the idea is get a small part of the repo, one
depth, a few paths, then get more and more over many iterations so if
we fail one iteration we don't lose everything
 - stablize pack order so we can resume downloading a pack
 - remote alternates, the repo will ask for more and more objects as
you need them (so goodbye to distributed model)
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  8:32 Max Kirillov
@ 2013-11-28  9:12 ` Jeff King
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2013-11-28  9:12 UTC (permalink / raw)
  To: Max Kirillov; +Cc: zhifeng hu, git

On Thu, Nov 28, 2013 at 01:32:36AM -0700, Max Kirillov wrote:

> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > 
> > I am in china. our bandwidth is very limitation. Less than 50Kb/s.
> 
> You could manually download big packed bundled from some http remote.
> For example http://repo.or.cz/r/linux.git
> 
> * create a new repository, add the remote there.
> 
> * download files with wget or whatever:
>  http://repo.or.cz/r/linux.git/objects/info/packs
> also files mentioned in the file. Currently they are:
>  http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx
>  http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack
> 
> and put them to the corresponding places.
> 
> * then run fetch of pull. I believe it should run fast then. Though I
> have not test it.

You would also need to set up local refs so that git knows you have
those objects. The simplest way to do it is to just fetch by dumb-http,
which can resume the pack transfer. I think that clone is also very
eager to clean up the partial transfer if the initial fetch fails. So
you would want to init manually:

  git init linux
  cd linux
  git remote add origin http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  GIT_SMART_HTTP=0 git fetch -vv

and then you can follow that up with regular smart fetches, which should
be much smaller.

It would be even simpler if you could fetch the whole thing as a bundle,
rather than over dumb-http. But that requires the server side (or some
third party who has fast access to the repo) cooperating and making a
bundle available.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  8:14     ` Duy Nguyen
  2013-11-28  8:35       ` Karsten Blees
@ 2013-11-28  9:20       ` Tay Ray Chuan
  2013-11-28  9:29         ` zhifeng hu
  1 sibling, 1 reply; 36+ messages in thread
From: Tay Ray Chuan @ 2013-11-28  9:20 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: zhifeng hu, Trần Ngọc Quân, Git Mailing List

On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote:
>> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early.
>
> if it works with --depth =1, you can incrementally run "fetch
> --depth=N" with N larger and larger.

I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially
create a "shallow" clone then 2) incrementally deepen your clone.

Zhifeng, in the course of your research into resumable cloning, you
might have learnt that while it's a really valuable feature, it's also
a pretty hard problem at the same time. So it's not because git
doesn't want to have this feature.

-- 
Cheers,
Ray Chuan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  9:09             ` Duy Nguyen
@ 2013-11-28  9:29               ` Jeff King
  2013-11-28 10:17                 ` Duy Nguyen
  2013-11-28 19:15                 ` Shawn Pearce
  0 siblings, 2 replies; 36+ messages in thread
From: Jeff King @ 2013-11-28  9:29 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: zhifeng hu, Karsten Blees, Trần Ngọc Quân,
	Git Mailing List

On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote:

> > Git should be better support resume transfer.
> > It now seems not doing better it’s job.
> > Share code, manage code, transfer code, what would it be a VCS we imagine it ?
> 
> You're welcome to step up and do it. On top of my head  there are a few options:
> 
>  - better integration with git bundles, provide a way to seamlessly
> create/fetch/resume the bundles with "git clone" and "git fetch"

I posted patches for this last year. One of the things that I got hung
up on was that I spooled the bundle to disk, and then cloned from it.
Which meant that you needed twice the disk space for a moment. I wanted
to teach index-pack to "--fix-thin" a pack that was already on disk, so
that we could spool to disk, and then finalize it without making another
copy.

One of the downsides of this approach is that it requires the repo
provider (or somebody else) to provide the bundle. I think that is
something that a big site like GitHub would do (and probably push the
bundles out to a CDN, too, to make getting them faster). But it's not a
universal solution.

>  - stablize pack order so we can resume downloading a pack

I think stabilizing in all cases (e.g., including ones where the content
has changed) is hard, but I wonder if it would be enough to handle the
easy cases, where nothing has changed. If the server does not use
multiple threads for delta computation, it should generate the same pack
from the same on-disk deterministically. We just need a way for the
client to indicate that it has the same partial pack.

I'm thinking that the server would report some opaque hash representing
the current pack. The client would record that, along with the number of
pack bytes it received. If the transfer is interrupted, the client comes
back with the hash/bytes pair. The server starts to generate the pack,
checks whether the hash matches, and if so, says "here is the same pack,
resuming at byte X".

What would need to go into such a hash? It would need to represent the
exact bytes that will go into the pack, but without actually generating
those bytes. Perhaps a sha1 over the sequence of <object sha1, type,
base (if applicable), length> for each object would be enough. We should
know that after calling compute_write_order. If the client has a match,
we should be able to skip ahead to the correct byte.

>  - remote alternates, the repo will ask for more and more objects as
> you need them (so goodbye to distributed model)

This is also something I've been playing with, but just for very large
objects (so to support something like git-media, but below the object
graph layer). I don't think it would apply here, as the kernel has a lot
of small objects, and getting them in the tight delta'd pack format
increases efficiency a lot.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  9:20       ` How to resume broke clone ? Tay Ray Chuan
@ 2013-11-28  9:29         ` zhifeng hu
  2013-11-28 19:35           ` Shawn Pearce
  2013-11-28 21:54           ` Jakub Narebski
  0 siblings, 2 replies; 36+ messages in thread
From: zhifeng hu @ 2013-11-28  9:29 UTC (permalink / raw)
  To: Tay Ray Chuan
  Cc: Duy Nguyen, Trần Ngọc Quân, Git Mailing List

Once using git clone —depth or git fetch —depth,
While you want to move backward.
you may face problem

 git fetch --depth=105
error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a
error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef
error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b
error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db
error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b


zhifeng hu 



On Nov 28, 2013, at 5:20 PM, Tay Ray Chuan <rctay89@gmail.com> wrote:

> On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu <zf@ancientrocklab.com> wrote:
>>> Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early.
>> 
>> if it works with --depth =1, you can incrementally run "fetch
>> --depth=N" with N larger and larger.
> 
> I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially
> create a "shallow" clone then 2) incrementally deepen your clone.
> 
> Zhifeng, in the course of your research into resumable cloning, you
> might have learnt that while it's a really valuable feature, it's also
> a pretty hard problem at the same time. So it's not because git
> doesn't want to have this feature.
> 
> -- 
> Cheers,
> Ray Chuan
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  9:29               ` Jeff King
@ 2013-11-28 10:17                 ` Duy Nguyen
  2013-11-28 19:15                 ` Shawn Pearce
  1 sibling, 0 replies; 36+ messages in thread
From: Duy Nguyen @ 2013-11-28 10:17 UTC (permalink / raw)
  To: Jeff King
  Cc: zhifeng hu, Karsten Blees, Trần Ngọc Quân,
	Git Mailing List

On Thu, Nov 28, 2013 at 4:29 PM, Jeff King <peff@peff.net> wrote:
>>  - stablize pack order so we can resume downloading a pack
>
> I think stabilizing in all cases (e.g., including ones where the content
> has changed) is hard, but I wonder if it would be enough to handle the
> easy cases, where nothing has changed. If the server does not use
> multiple threads for delta computation, it should generate the same pack
> from the same on-disk deterministically. We just need a way for the
> client to indicate that it has the same partial pack.
>
> I'm thinking that the server would report some opaque hash representing
> the current pack. The client would record that, along with the number of
> pack bytes it received. If the transfer is interrupted, the client comes
> back with the hash/bytes pair. The server starts to generate the pack,
> checks whether the hash matches, and if so, says "here is the same pack,
> resuming at byte X".
>
> What would need to go into such a hash? It would need to represent the
> exact bytes that will go into the pack, but without actually generating
> those bytes. Perhaps a sha1 over the sequence of <object sha1, type,
> base (if applicable), length> for each object would be enough. We should
> know that after calling compute_write_order. If the client has a match,
> we should be able to skip ahead to the correct byte.

Exactly. The hash would include the list of sha-1 and object source,
the git version (so changes in code or default values are covered),
the list of config keys/values that may impact pack generation
algorithm (like window size..), .git/shallow, refs/replace,
.git/graft, all or most of command line options. If we audit the code
carefully I think we can cover all input that influences pack
generation. From then on it's just a matter of protocol extension. It
also opens an opportunity for optional server side caching, just save
the pack and associate it with the hash. Next time the client asks to
resume, the server has everything ready.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  9:29               ` Jeff King
  2013-11-28 10:17                 ` Duy Nguyen
@ 2013-11-28 19:15                 ` Shawn Pearce
  2013-12-04 20:08                   ` Jeff King
  1 sibling, 1 reply; 36+ messages in thread
From: Shawn Pearce @ 2013-11-28 19:15 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Thu, Nov 28, 2013 at 1:29 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote:
>
>> > Git should be better support resume transfer.
>> > It now seems not doing better it’s job.
>> > Share code, manage code, transfer code, what would it be a VCS we imagine it ?
>>
>> You're welcome to step up and do it. On top of my head  there are a few options:
>>
>>  - better integration with git bundles, provide a way to seamlessly
>> create/fetch/resume the bundles with "git clone" and "git fetch"

We have been thinking about formalizing the /clone.bundle hack used by
repo on Android. If the server has the bundle, add a capability in the
refs advertisement saying its available, and the clone client can
first fetch $URL/clone.bundle.

For most Git repositories the bundle can be constructed by saving the
bundle reference header into a file, e.g.
$GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
created. The bundle can be served by combining the .bh and .pack
streams onto the network. It is very little additional disk overhead
for the origin server, but allows resumable clone, provided the server
has not done a GC.

> I posted patches for this last year. One of the things that I got hung
> up on was that I spooled the bundle to disk, and then cloned from it.
> Which meant that you needed twice the disk space for a moment.

I don't think this is a huge concern. In many cases the checked out
copy of the repository approaches a sizable fraction of the .pack
itself. If you don't have 2x .pack disk available at clone time you
may be in trouble anyway as you try to work with the repository post
clone.

> I wanted
> to teach index-pack to "--fix-thin" a pack that was already on disk, so
> that we could spool to disk, and then finalize it without making another
> copy.

Don't you need to separate the bundle header from the pack data before
you do this? If the bundle is only used at clone time there is no
--fix-thin step.

> One of the downsides of this approach is that it requires the repo
> provider (or somebody else) to provide the bundle. I think that is
> something that a big site like GitHub would do (and probably push the
> bundles out to a CDN, too, to make getting them faster). But it's not a
> universal solution.

See above, I think you can reasonably do the /clone.bundle
automatically on any HTTP server. Big sites might choose to have
/clone.bundle do a redirect into a caching CDN that fills itself by
going to the application servers to obtain the current data. This is
what we do for Android.

>>  - stablize pack order so we can resume downloading a pack
>
> I think stabilizing in all cases (e.g., including ones where the content
> has changed) is hard, but I wonder if it would be enough to handle the
> easy cases, where nothing has changed. If the server does not use
> multiple threads for delta computation, it should generate the same pack
> from the same on-disk deterministically. We just need a way for the
> client to indicate that it has the same partial pack.
>
> I'm thinking that the server would report some opaque hash representing
> the current pack. The client would record that, along with the number of
> pack bytes it received. If the transfer is interrupted, the client comes
> back with the hash/bytes pair. The server starts to generate the pack,
> checks whether the hash matches, and if so, says "here is the same pack,
> resuming at byte X".

An important part of this is the want set must be identical to the
prior request. It is entirely possible the branch tips have advanced
since the prior packing attempt started.

> What would need to go into such a hash? It would need to represent the
> exact bytes that will go into the pack, but without actually generating
> those bytes. Perhaps a sha1 over the sequence of <object sha1, type,
> base (if applicable), length> for each object would be enough. We should
> know that after calling compute_write_order. If the client has a match,
> we should be able to skip ahead to the correct byte.

I don't think Length is sufficient.

The repository could have recompressed an object with the same length
but different libz encoding. I wonder if loose object recompression is
reliable enough about libz encoding to resume in the middle of an
object? Is it just based on libz version?

You may need to do include information about the source of the object,
e.g. the trailing 20 byte hash in the source pack file.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  9:29         ` zhifeng hu
@ 2013-11-28 19:35           ` Shawn Pearce
  2013-11-28 21:54           ` Jakub Narebski
  1 sibling, 0 replies; 36+ messages in thread
From: Shawn Pearce @ 2013-11-28 19:35 UTC (permalink / raw)
  To: zhifeng hu
  Cc: Tay Ray Chuan, Duy Nguyen, Trần Ngọc Quân,
	Git Mailing List

On Thu, Nov 28, 2013 at 1:29 AM, zhifeng hu <zf@ancientrocklab.com> wrote:
> Once using git clone —depth or git fetch —depth,
> While you want to move backward.
> you may face problem
>
>  git fetch --depth=105
> error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a
> error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef
> error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b
> error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db
> error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b

We now have a resumable bundle available through our kernel.org
mirror. The bundle is 658M.

  mkdir linux
  cd linux
  git init

  wget https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/clone.bundle

  sha1sum clone.bundle
  96831de0b81713333e5ebba94edb31e37e70e1df  clone.bundle

  git fetch -u ./clone.bundle refs/*:refs/*
  git reset --hard

You can also use our mirror as an upstream, as we have servers in Asia
that lag no more than 5 or 6 minutes behind kernel.org:

  git remote add origin
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28  9:29         ` zhifeng hu
  2013-11-28 19:35           ` Shawn Pearce
@ 2013-11-28 21:54           ` Jakub Narebski
  1 sibling, 0 replies; 36+ messages in thread
From: Jakub Narebski @ 2013-11-28 21:54 UTC (permalink / raw)
  To: git

zhifeng hu <zf <at> ancientrocklab.com> writes:

> 
> Once using git clone —depth or git fetch —depth,
> While you want to move backward.
> you may face problem
> 
>  git fetch --depth=105
> error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a
> error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef
> error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b
> error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db
> error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b

BTW. there was (is?) a bundler service at http://bundler.caurea.org/
but I don't know if it can create Linux-size bundle.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-11-28 19:15                 ` Shawn Pearce
@ 2013-12-04 20:08                   ` Jeff King
  2013-12-05  6:50                     ` Shawn Pearce
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2013-12-04 20:08 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote:

> >>  - better integration with git bundles, provide a way to seamlessly
> >> create/fetch/resume the bundles with "git clone" and "git fetch"
> 
> We have been thinking about formalizing the /clone.bundle hack used by
> repo on Android. If the server has the bundle, add a capability in the
> refs advertisement saying its available, and the clone client can
> first fetch $URL/clone.bundle.

Yes, that was going to be my next step after getting the bundle fetch
support in. If we are going to do this, though, I'd really love for it
to not be "hey, fetch .../clone.bundle from me", but a full-fledged
"here are full URLs of my mirrors".

Then you can redirect a non-http cloner to http to grab the bundle. Or
redirect them to a CDN. Or even somebody else's server entirely (e.g.,
"go fetch from Linus first, my piddly server cannot feed you the whole
kernel"). Some of the redirects you can do by issuing an http redirect
to "/clone.bundle", but the cross-protocol ones are tricky.

If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors")
it does not add much overhead over a simple capability. There are a few
extra round trips to actually fetch the blob (client sends a want and no
haves, then server sends the pack), but I think that's negligible when
we are talking about redirecting a full clone. In either case, we have
to hang up the original connection, fetch the mirror, and then come
back.

> For most Git repositories the bundle can be constructed by saving the
> bundle reference header into a file, e.g.
> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
> created. The bundle can be served by combining the .bh and .pack
> streams onto the network. It is very little additional disk overhead
> for the origin server,

That's clever. It does not work out of the box if you are using
alternates, but I think it could be adapted in certain situations. E.g.,
if you layer the pack so that one "base" repo always has its full pack
at the start, which is something we're already doing at GitHub.

> but allows resumable clone, provided the server has not done a GC.

As an aside, the current transfer-resuming code in http.c is
questionable.  It does not use etags or any sort of invalidation
mechanism, but just assumes hitting the same URL will give the same
bytes. That _usually_ works for dumb fetching of objects and packfiles,
though it is possible for a pack to change representation without
changing name.

My bundle patches inherited the same flaw, but it is much worse there,
because your URL may very well just be "clone.bundle" that gets updated
periodically.

> > I posted patches for this last year. One of the things that I got hung
> > up on was that I spooled the bundle to disk, and then cloned from it.
> > Which meant that you needed twice the disk space for a moment.
> 
> I don't think this is a huge concern. In many cases the checked out
> copy of the repository approaches a sizable fraction of the .pack
> itself. If you don't have 2x .pack disk available at clone time you
> may be in trouble anyway as you try to work with the repository post
> clone.

Yeah, in retrospect I was being stupid to let that hold it up. I'll
revisit the patches (I've rebased them forward over the past year, so it
shouldn't be too bad).

> > I wanted
> > to teach index-pack to "--fix-thin" a pack that was already on disk, so
> > that we could spool to disk, and then finalize it without making another
> > copy.
> 
> Don't you need to separate the bundle header from the pack data before
> you do this?

Yes, though it isn't hard. We have to fetch part of the bundle header
into memory during discover_refs(), since that is when we realize we are
getting a bundle and not just the refs. From there you can spool the
bundle header to disk, and then the packfile separately.

My original implementation did that, though I don't remember if that one
got posted to the list (after realizing that I couldn't just
"--fix-thin" directly, I simplified it to just spool the whole thing to
a single file).

> If the bundle is only used at clone time there is no
> --fix-thin step.

Yes, for the particular use case of a clone-mirror, you wouldn't need to
--fix-thin. But I think "git fetch https://example.com/foo.bundle"
should work in the general case (and it does with my patches).

> See above, I think you can reasonably do the /clone.bundle
> automatically on any HTTP server.

Yeah, the ".bh" trick you mentioned is low enough impact to the server
that we could just unconditionally make it part of the repack.

> > What would need to go into such a hash? It would need to represent the
> > exact bytes that will go into the pack, but without actually generating
> > those bytes. Perhaps a sha1 over the sequence of <object sha1, type,
> > base (if applicable), length> for each object would be enough. We should
> > know that after calling compute_write_order. If the client has a match,
> > we should be able to skip ahead to the correct byte.
> 
> I don't think Length is sufficient.
> 
> The repository could have recompressed an object with the same length
> but different libz encoding. I wonder if loose object recompression is
> reliable enough about libz encoding to resume in the middle of an
> object? Is it just based on libz version?
> 
> You may need to do include information about the source of the object,
> e.g. the trailing 20 byte hash in the source pack file.

Yeah, I think you're right that it's too flaky without recording the
source. At any rate, I think I prefer the bundle approach you mentioned
above. It solves the same problem, and is a lot more flexible (e.g., for
offloading to other servers).

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-04 20:08                   ` Jeff King
@ 2013-12-05  6:50                     ` Shawn Pearce
  2013-12-05 13:21                       ` Michael Haggerty
  2013-12-05 16:04                       ` Jeff King
  0 siblings, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2013-12-05  6:50 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Wed, Dec 4, 2013 at 12:08 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote:
>
>> >>  - better integration with git bundles, provide a way to seamlessly
>> >> create/fetch/resume the bundles with "git clone" and "git fetch"
>>
>> We have been thinking about formalizing the /clone.bundle hack used by
>> repo on Android. If the server has the bundle, add a capability in the
>> refs advertisement saying its available, and the clone client can
>> first fetch $URL/clone.bundle.
>
> Yes, that was going to be my next step after getting the bundle fetch
> support in.

Yay!

> If we are going to do this, though, I'd really love for it
> to not be "hey, fetch .../clone.bundle from me", but a full-fledged
> "here are full URLs of my mirrors".

Ack. I agree completely.

> Then you can redirect a non-http cloner to http to grab the bundle. Or
> redirect them to a CDN. Or even somebody else's server entirely (e.g.,
> "go fetch from Linus first, my piddly server cannot feed you the whole
> kernel"). Some of the redirects you can do by issuing an http redirect
> to "/clone.bundle", but the cross-protocol ones are tricky.

Ack. My thoughts exactly. Especially the part of "my piddly server
shouldn't have to serve you a clone of Linus' tree when there are many
public hosts mirroring his code available to anyone". It is simply not
fair to clone Linus' tree off some guy's home ADSL connection, his
uplink probably sucks. But it is reasonable to fetch his incremental
delta after cloning from some other well known and well connected
source.

> If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors")
> it does not add much overhead over a simple capability. There are a few
> extra round trips to actually fetch the blob (client sends a want and no
> haves, then server sends the pack), but I think that's negligible when
> we are talking about redirecting a full clone. In either case, we have
> to hang up the original connection, fetch the mirror, and then come
> back.

I wasn't thinking about using a "well known blob" for this.

Jonathan, Dave, Colby and I were kicking this idea around on Monday
during lunch. If the initial ref advertisement included a "mirrors"
capability the client could respond with "want mirrors" instead of the
usual want/have negotiation. The server could then return the mirror
URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial
compared to the cost to really clone the repository.

These pkt-lines need to be a bit more than just URL. Or we need a new
URL like "bundle:http://...." to denote a resumable bundle over HTTP
vs. a normal HTTP URL that might not be a bundle file, and is just a
better connected server.

The mirror URLs could be stored in $GIT_DIR/config as a simple
multi-value variable. Unfortunately that isn't easily remotely
editable. But I am not sure I care?

GitHub doesn't let you edit $GIT_DIR/config, but it doesn't need to.
For most repositories hosted at GitHub, GitHub is probably the best
connected server for that repository. For repositories that are
incredibly high traffic GitHub might out of its own interest want to
configure mirror URLs on some sort of CDN to distribute the network
traffic closer to the edges. Repository owners just shouldn't have to
worry about these sorts of details. It should be managed by the
hosting service.

In my case for android.googlesource.com we want bundles on the CDN
near the network edges, and our repository owners don't care to know
the details of that. They just want our server software to make it all
happen, and our servers already manage $GIT_DIR/config for them. It
also mostly manages /clone.bundle on the CDN. And /clone.bundle is an
ugly, limited hack.

For the average home user sharing their working repository over git://
from their home ADSL or cable connection, editing .git/config is
easier than a blob in refs/mirrors. They already know how to edit
.git/config to manage remotes. Heck, remote.origin.url might already
be a good mirror address to advertise, especially if the client isn't
on the same /24 as the server and the remote.origin.url is something
like "git.kernel.org". :-)

>> For most Git repositories the bundle can be constructed by saving the
>> bundle reference header into a file, e.g.
>> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
>> created. The bundle can be served by combining the .bh and .pack
>> streams onto the network. It is very little additional disk overhead
>> for the origin server,
>
> That's clever. It does not work out of the box if you are using
> alternates, but I think it could be adapted in certain situations. E.g.,
> if you layer the pack so that one "base" repo always has its full pack
> at the start, which is something we're already doing at GitHub.

Yes, well, I was assuming the pack was a fully connected repack.
Alternates always creates a partial pack. But if you have an
alternate, that alternate maybe should be given as a mirror URL? And
allow the client to recurse the alternate mirror URL list too?

By listing the alternate as a mirror a client could maybe discover the
resumable clone bundle in the alternate, grab that first to bootstrap,
reducing the amount it has to obtain in a non-resumable way. Or... the
descendant repository could offer its own bundle with the "must have"
assertions from the alternate at the time it repacked. So the .bh file
would have a number of ^ lines and the bundle was built with a "--not
..." list.

>> but allows resumable clone, provided the server has not done a GC.
>
> As an aside, the current transfer-resuming code in http.c is
> questionable.  It does not use etags or any sort of invalidation
> mechanism, but just assumes hitting the same URL will give the same
> bytes.

Yea, our lunch conversation eventually reached this part too. repo's
/clone.bundle hack is equally stupid and assumes a resume will get the
correct data, with no validation. If you resume with the wrong data
while inside of the pack stream the pack will be invalid; the SHA-1
trailer won't match. But you won't know until you have downloaded the
entire useless file. Resuming a 700M download after the first 10M only
to find out the first 10M is mismatched sucks.

What really got us worried was the bundle header has no checksums, and
a resume in the bundle header from the wrong version could be
interesting.

> That _usually_ works for dumb fetching of objects and packfiles,
> though it is possible for a pack to change representation without
> changing name.

Yes. And this is why the packfile name algorithm is horribly flawed. I
keep saying we should change it to name the pack using the last 20
bytes of the file but ... nobody has written the patch for that?  :-)

> My bundle patches inherited the same flaw, but it is much worse there,
> because your URL may very well just be "clone.bundle" that gets updated
> periodically.

Yup, you followed the same thing we did in repo, which is horribly wrong.

We should try to use ETag if available to safely resume, and we should
try to encourage people to use stronger names when pointing to URLs
that are resumable, like a bundle on a CDN. If the URL is offered by
the server in pkt-lines after the advertisement its easy for the
server to return the current CDN URL, and easy for the server to
implement enforcement of the URLs being unique. Especially if you
manage the CDN automatically; e.g. Android uses tools to build the CDN
files and push them out. Its easy for us to ensure these have unique
URLs on every push. A bundling server bundling once a day or once a
week could simply date stamp each run.

>> > I posted patches for this last year. One of the things that I got hung
>> > up on was that I spooled the bundle to disk, and then cloned from it.
>> > Which meant that you needed twice the disk space for a moment.
>>
>> I don't think this is a huge concern. In many cases the checked out
>> copy of the repository approaches a sizable fraction of the .pack
>> itself. If you don't have 2x .pack disk available at clone time you
>> may be in trouble anyway as you try to work with the repository post
>> clone.
>
> Yeah, in retrospect I was being stupid to let that hold it up. I'll
> revisit the patches (I've rebased them forward over the past year, so it
> shouldn't be too bad).

I keep prodding Jonathan to work on this too, because I'd really like
to get this out of repo and just have it be something git knows how to
do. And bigger mirrors like git.kernel.org could do a quick
grep/sort/uniq -c through their access logs and periodically bundle up
a few repositories that are cloned often. E.g. we all know
git.kernel.org should just bundle Linus' repository.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-05  6:50                     ` Shawn Pearce
@ 2013-12-05 13:21                       ` Michael Haggerty
  2013-12-05 15:11                         ` Shawn Pearce
  2013-12-05 16:12                         ` Jeff King
  2013-12-05 16:04                       ` Jeff King
  1 sibling, 2 replies; 36+ messages in thread
From: Michael Haggerty @ 2013-12-05 13:21 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Jeff King, Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

This discussion has mostly been about letting small Git servers delegate
the work of an initial clone to a beefier server.  I haven't seen any
explicit mention of the inverse:

Suppose a company has a central Git server that is meant to be the
"single source of truth", but has worldwide offices and wants to locate
bootstrap mirrors in each office.  The end users would not even want to
know that there are multiple servers.  Hosters like GitHub might also
encourage their big customers to set up bootstrap mirror(s) in-house to
make cloning faster for their users while reducing internet traffic and
the burden on their own infrastructure.  The goal would be to make the
system transparent to users and easily reconfigurable as circumstances
change.

One alternative would be to ask users to clone from their local mirror.
 The local mirror would give them whatever it has, then do the
equivalent of a permanent redirect to tell the client "from now on, use
the central server" to get the rest of the initial clone and for future
fetches.  But this would require users to know which mirror is "local".

A better alternative would be to ask users to clone from the central
server.  In this case, the central server would want to tell the clients
to grab what they can from their local bootstrap mirror and then come
back to the central server for any remainders.  The trick is that which
bootstrap mirror is "local" would vary from client to client.

I suppose that this could be implemented using what you have discussed
by having the central server direct the client to a URL that resolves
differently for different clients, CDN-like.  Alternatively, the central
Git server could itself look where a request is coming from and use some
intelligence to redirect the client to the closest bootstrap mirror from
its own list.  Or the server could pass the client a list of known
mirrors, and the client could try to determine which one is closest (and
reachable!).

I'm not sure that this idea is interesting, but I just wanted to throw
it out there as a related use case that seems a bit different than what
you have been discussing.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-05 13:21                       ` Michael Haggerty
@ 2013-12-05 15:11                         ` Shawn Pearce
  2013-12-05 16:12                         ` Jeff King
  1 sibling, 0 replies; 36+ messages in thread
From: Shawn Pearce @ 2013-12-05 15:11 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Jeff King, Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Thu, Dec 5, 2013 at 5:21 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> This discussion has mostly been about letting small Git servers delegate
> the work of an initial clone to a beefier server.  I haven't seen any
> explicit mention of the inverse:
>
> Suppose a company has a central Git server that is meant to be the
> "single source of truth", but has worldwide offices and wants to locate
> bootstrap mirrors in each office.  The end users would not even want to
> know that there are multiple servers.  Hosters like GitHub might also
> encourage their big customers to set up bootstrap mirror(s) in-house to
> make cloning faster for their users while reducing internet traffic and
> the burden on their own infrastructure.  The goal would be to make the
> system transparent to users and easily reconfigurable as circumstances
> change.

I think there is a different way to do that.

Build a caching Git proxy server. And teach Git clients to use it.

One idea we had at $DAY_JOB a couple of years ago was to build a
daemon that sat in the background and continuously fetched content
from repository upstreams. We made it efficient by modifying the Git
protocol to use a hanging network socket, and the upstream server
would broadcast push pack files down these hanging streams as pushes
were received.

The original intent was for an Android developer to be able to have
his working tree forest of 500 repositories subscribe to our internal
server's broadcast stream. We figured if the server knows exactly
which refs every client has, because they all have the same ones, and
their streams are all still open and active, then the server can make
exactly one incremental thin pack and send the same copy to every
client. Its "just" a socket write problem. Instead of packing the same
stuff 100x for 100x clients its packed once and sent 100x.

Then we realized remote offices could also install this software on a
local server, and use this as a fan-out distributor within the LAN. We
were originally thinking about some remote offices on small Internet
connections, where delivery of 10 MiB x 20 was a lot but delivery of
10 MiB once and local fan-out on the Ethernet was easy.

The JGit patches for this work are still pending[1].

If clients had a local Git-aware cache server in their office and
~/.gitconfig had the address of it, your problem becomes simple.

Clients clone from the public URL e.g. GitHub, but the local cache
server first gives the client a URL to clone from itself. After that
is complete then the client can fetch from the upstream. The cache
server can be self-maintaining, watching its requests to see what is
accessed often-ish, and keep those repositories current-ish locally by
running git fetch itself in the background.

Its easy to do this with bundles on "CDN" like HTTP. Just use the
office's caching HTTP proxy server. Assuming its cache is big enough
for those large Git bundle payloads, and the viral cat videos. But you
are at the mercy of the upstream bundler rebuilding the bundles. And
refetching them in whole. Neither of which is great.

A simple self-contained server that doesn't accept pushes, but knows
how to clone repositories, fetch them periodically, and run `git gc`,
works well. And the mirror URL extension we have been discussing in
this thread would work fine here. The cache server can return URLs
that point to itself. Or flat out proxy the Git transaction with the
origin server.

[1] https://git.eclipse.org/r/#/q/owner:wetherbeei%2540google.com+status:open,n,z

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-05  6:50                     ` Shawn Pearce
  2013-12-05 13:21                       ` Michael Haggerty
@ 2013-12-05 16:04                       ` Jeff King
  2013-12-05 18:01                         ` Junio C Hamano
  2013-12-05 20:28                         ` [PATCH] pack-objects: name pack files after trailer hash Jeff King
  1 sibling, 2 replies; 36+ messages in thread
From: Jeff King @ 2013-12-05 16:04 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Wed, Dec 04, 2013 at 10:50:27PM -0800, Shawn Pearce wrote:

> I wasn't thinking about using a "well known blob" for this.
> 
> Jonathan, Dave, Colby and I were kicking this idea around on Monday
> during lunch. If the initial ref advertisement included a "mirrors"
> capability the client could respond with "want mirrors" instead of the
> usual want/have negotiation. The server could then return the mirror
> URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial
> compared to the cost to really clone the repository.

I don't think this is any more or less efficient than the blob scheme.
In both cases, the client sends a single "want" line and no "have"
lines, and then the server responds with the output (either pkt-lines,
or a single-blob pack).

What I like about the blob approach is:

  1. It requires zero extra code on the server. This makes
     implementation simple, but also means you can deploy it
     on existing servers (or even on non-pkt-line servers like
     dumb http).

  2. It's very debuggable from the client side. You can fetch the blob,
     look at it, and decide which mirror you want outside of git if you
     want to (true, you can teach the git client to dump the pkt-line
     URLs, too, but that's extra code). You could even do this with an
     existing git client that has not yet learned about the mirror
     redirect.

  3. It removes any size or structure limits that the protocol imposes
     (I was planning to use git-config format for the blob itself). The
     URLs themselves aren't big, but we may want to annotate them with
     metadata.

     You mentioned "this is a bundle" versus "this is a regular http
     server" below. You might also want to provide network location
     information (e.g., "this is a good mirror if you are in Asia"),
     though for the most part I'd expect that to happen magically via
     CDN.

     When we discussed this before, the concept came up of offering not
     just a clone bundle, but "slices" of history (as thin-pack
     bundles), so that a fetch could grab a sequence of resumable
     slices, starting with what they have, and then topping off with a
     true fetch. You would want to provide the start and end points of
     each slice.

  4. You can manage it remotely via the git protocol (more discussion
     below).

  5. A clone done with "--mirror" will actually propagate the mirror
     file automatically.

What are the advantages of the pkt-line approach? The biggest one I can
think of is that it does not pollute the refs namespace. While (5) is
convenient in some cases, it would make it more of a pain if you are
trying to keep a clone mirror up to date, but do _not_ want to pass
along upstream's mirror file.

You may want to have a server implementation that offers a dynamic
mirror, rather than a true object we have in the ODB. That is possible
with a mirror blob, but is slightly harder (you have to fake the object
rather than just dumping a line).

> These pkt-lines need to be a bit more than just URL. Or we need a new
> URL like "bundle:http://...." to denote a resumable bundle over HTTP
> vs. a normal HTTP URL that might not be a bundle file, and is just a
> better connected server.

Right, I think that's the most critical one (though you could also just
use the convention of ".bundle" in the URL). I think we may want to
leave room for more metadata, though.

> The mirror URLs could be stored in $GIT_DIR/config as a simple
> multi-value variable. Unfortunately that isn't easily remotely
> editable. But I am not sure I care?

For big sites that manage the bundles on behalf of the user, I don't
think it is an issue. For somebody running their own small site, I think
it is a useful way of moving the data to the server.

> For the average home user sharing their working repository over git://
> from their home ADSL or cable connection, editing .git/config is
> easier than a blob in refs/mirrors. They already know how to edit
> .git/config to manage remotes.

Yes, but it's editing .git/config on the server, not on the client,
which may be slightly harder for some people. I do think we'd want
some tool support on the client side. git-config recently learned to
read from a blob. The next step is:

  git config --blob=refs/mirrors --edit

or

  git config --blob=refs/mirrors mirror.ko.url git://git.kernel.org/...
  git config --blob=refs/mirrors mirror.ko.bundle true

We can't add tool support for editing .git/config on the server side,
because the method for doing so isn't standard.

> Heck, remote.origin.url might already
> be a good mirror address to advertise, especially if the client isn't
> on the same /24 as the server and the remote.origin.url is something
> like "git.kernel.org". :-)

You could have a "git-advertise-upstream" that generates a mirror blob
from your remotes config and pushes it to your publishing point. That
may be overkill, but I don't think it's possible with a
.git/config-based solution.

> > That's clever. It does not work out of the box if you are using
> > alternates, but I think it could be adapted in certain situations. E.g.,
> > if you layer the pack so that one "base" repo always has its full pack
> > at the start, which is something we're already doing at GitHub.
> 
> Yes, well, I was assuming the pack was a fully connected repack.
> Alternates always creates a partial pack. But if you have an
> alternate, that alternate maybe should be given as a mirror URL? And
> allow the client to recurse the alternate mirror URL list too?

The problem for us is not that we have a partial pack, but that the
alternates pack has a lot of other junk in it. A linux.git clone is
650MB or so. The packfile for all of the linux.git forks together on
GitHub is several gigabytes.

> What really got us worried was the bundle header has no checksums, and
> a resume in the bundle header from the wrong version could be
> interesting.

The bundle header is small enough that you should just throw it away if
you didn't get the whole thing (IIRC, that is what my patches do,
because it does not do _anything_ until we receive the whole ref
advertisement, at which point we decide if it is smart, dumb, or a
bundle).

> Yes. And this is why the packfile name algorithm is horribly flawed. I
> keep saying we should change it to name the pack using the last 20
> bytes of the file but ... nobody has written the patch for that?  :-)

Totally agree. I think we could also get rid of the horrible hacks in
repack where we pack to a tempfile, then have to do another tempfile
dance (which is not atomic!) to move the same-named packfile out of the
way. If the name were based on the content, we could just throw away our
new pack if one of the same name is already there (just like we do for
loose objects).

I haven't looked at making such a patch, but I think it shouldn't be too
complicated. My big worry would be weird fallouts from some hidden part
of the code that we don't realize is depending on the current naming
scheme. :)

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-05 13:21                       ` Michael Haggerty
  2013-12-05 15:11                         ` Shawn Pearce
@ 2013-12-05 16:12                         ` Jeff King
  1 sibling, 0 replies; 36+ messages in thread
From: Jeff King @ 2013-12-05 16:12 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Shawn Pearce, Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Thu, Dec 05, 2013 at 02:21:09PM +0100, Michael Haggerty wrote:

> A better alternative would be to ask users to clone from the central
> server.  In this case, the central server would want to tell the clients
> to grab what they can from their local bootstrap mirror and then come
> back to the central server for any remainders.  The trick is that which
> bootstrap mirror is "local" would vary from client to client.
>
> I suppose that this could be implemented using what you have discussed
> by having the central server direct the client to a URL that resolves
> differently for different clients, CDN-like.  Alternatively, the central
> Git server could itself look where a request is coming from and use some
> intelligence to redirect the client to the closest bootstrap mirror from
> its own list.  Or the server could pass the client a list of known
> mirrors, and the client could try to determine which one is closest (and
> reachable!).

Exactly. I think this will mostly happen via CDN, but I had also
envisioned that the server could add metadata to a list of possible
mirrors, like:

       [mirror "ko-us"]
       url = http://git.us.kernel.org/...
       zone = us

       [mirror "ko-cn"]
       url = http://git.cn.kernel.org/...
       zone = cn

If the "zone" keys follow a micro-format convention, then the client
knows that it prefers "cn" over "us" (either on the command line, or a
local config option in ~/.gitconfig).

The biggest problem with all of this is that the server has to know
about the mirrors. If you want to set up an in-house mirror for
something hosted on GitHub, but its only available to people in your
company, then GitHub would not want to advertise it. You need some way
to tell your clients about the mirror (and that is the inverse-mirror
"fetch from the mirror, which tells you it is just a bootstrap and to
now switch to the real repo" scheme that I think you were describing
earlier).

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-05 16:04                       ` Jeff King
@ 2013-12-05 18:01                         ` Junio C Hamano
  2013-12-05 19:08                           ` Jeff King
  2013-12-05 20:28                         ` [PATCH] pack-objects: name pack files after trailer hash Jeff King
  1 sibling, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2013-12-05 18:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

Jeff King <peff@peff.net> writes:

> Right, I think that's the most critical one (though you could also just
> use the convention of ".bundle" in the URL). I think we may want to
> leave room for more metadata, though.

Good. I like this line of thinking.

>> Heck, remote.origin.url might already
>> be a good mirror address to advertise, especially if the client isn't
>> on the same /24 as the server and the remote.origin.url is something
>> like "git.kernel.org". :-)
>
> You could have a "git-advertise-upstream" that generates a mirror blob
> from your remotes config and pushes it to your publishing point. That
> may be overkill, but I don't think it's possible with a
> .git/config-based solution.

I do not think I follow.  The upload-pack service could be taught to
pay attention to the uploadpack.advertiseUpstream config at runtime,
advertise 'mirror' capability, and then respond with the list of
remote.*.url it uses when asked (if we go with the pkt-line based
approach).  Alternatively, it could also be taught to pay attention
to the same config at runtime, create an blob to advertise the list
of remote.*.url it uses and store it in refs/mirror (or do this
purely in-core without actually writing to the refs/ namespace), and
emit an entry for refs/mirror using that blob object name in the
ls-remote part of the response (if we go with the magic blob based
approach).

>> Yes. And this is why the packfile name algorithm is horribly flawed. I
>> keep saying we should change it to name the pack using the last 20
>> bytes of the file but ... nobody has written the patch for that?  :-)
>
> Totally agree. I think we could also get rid of the horrible hacks in
> repack where we pack to a tempfile, then have to do another tempfile
> dance (which is not atomic!) to move the same-named packfile out of the
> way. If the name were based on the content, we could just throw away our
> new pack if one of the same name is already there (just like we do for
> loose objects).

Yay.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: How to resume broke clone ?
  2013-12-05 18:01                         ` Junio C Hamano
@ 2013-12-05 19:08                           ` Jeff King
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2013-12-05 19:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Shawn Pearce, Duy Nguyen, zhifeng hu, Karsten Blees,
	Trần Ngọc Quân, Git Mailing List

On Thu, Dec 05, 2013 at 10:01:28AM -0800, Junio C Hamano wrote:

> > You could have a "git-advertise-upstream" that generates a mirror blob
> > from your remotes config and pushes it to your publishing point. That
> > may be overkill, but I don't think it's possible with a
> > .git/config-based solution.
> 
> I do not think I follow.  The upload-pack service could be taught to
> pay attention to the uploadpack.advertiseUpstream config at runtime,
> advertise 'mirror' capability, and then respond with the list of
> remote.*.url it uses when asked (if we go with the pkt-line based
> approach).

I was assuming a triangular workflow, where your publishing point (that
other people will fetch from) does not know anything about the upstream.
Like:

  $ git clone git://git.kernel.org/pub/scm/git/git.git
  $ hack hack hack; commit commit commit
  $ git remote add me myserver:/var/git/git.git
  $ git push me
  $ git advertise-upstream origin me

If your publishing point is already fetching from another upstream, then
yeah, I'd agree that dynamically generating it from the config is fine.

> Alternatively, it could also be taught to pay attention
> to the same config at runtime, create an blob to advertise the list
> of remote.*.url it uses and store it in refs/mirror (or do this
> purely in-core without actually writing to the refs/ namespace), and
> emit an entry for refs/mirror using that blob object name in the
> ls-remote part of the response (if we go with the magic blob based
> approach).

Yes. The pkt-line versus refs distinction is purely a protocol issue.
You can do anything you want on the backend with either of them,
including faking the ref (you can also accept fake pushes to
refs/mirror, too, if you really want people to be able to upload that
way).

But it is worth considering what implementation difficulties we would
run across in either case. Producing a fake refs/mirror blob that
responds like a normal ref is more work than just dumping the lines. If
we're always just going to generate it dynamically anyway, then we can
save ourselves some effort.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] pack-objects: name pack files after trailer hash
  2013-12-05 16:04                       ` Jeff King
  2013-12-05 18:01                         ` Junio C Hamano
@ 2013-12-05 20:28                         ` Jeff King
  2013-12-05 21:56                           ` Shawn Pearce
                                             ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Jeff King @ 2013-12-05 20:28 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Git Mailing List

On Thu, Dec 05, 2013 at 11:04:18AM -0500, Jeff King wrote:

> > Yes. And this is why the packfile name algorithm is horribly flawed. I
> > keep saying we should change it to name the pack using the last 20
> > bytes of the file but ... nobody has written the patch for that?  :-)
> 
> Totally agree. I think we could also get rid of the horrible hacks in
> repack where we pack to a tempfile, then have to do another tempfile
> dance (which is not atomic!) to move the same-named packfile out of the
> way. If the name were based on the content, we could just throw away our
> new pack if one of the same name is already there (just like we do for
> loose objects).
> 
> I haven't looked at making such a patch, but I think it shouldn't be too
> complicated. My big worry would be weird fallouts from some hidden part
> of the code that we don't realize is depending on the current naming
> scheme. :)

So here's the first part, that actually changes the name. It passes the
test suite, so it must be good, right? And just look at that diffstat.

This actually applies on top of e74435a (sha1write: make buffer
const-correct, 2013-10-24), which is on another topic. Since the sha1
parameter to write_idx_file is no longer used as both an in- and out-
parameter (yuck), we can make it const.

The second half would be to simplify git-repack. The current behavior is
to replace the old packfile with a tricky rename dance. Which is still
correct, but overly complicated. We should be able to just drop the new
packfile, since we know the bytes are identical (or rename the new one
over the old, though I think keeping the old is probably kinder to the
disk cache, especially if another process already has it mmap'd).

-- >8 --
Subject: pack-objects: name pack files after trailer hash

Our current scheme for naming packfiles is to calculate the
sha1 hash of the sorted list of objects contained in the
packfile. This gives us a unique name, so we are reasonably
sure that two packs with the same name will contain the same
objects.

It does not, however, tell us that two such packs have the
exact same bytes. This makes things awkward if we repack the
same set of objects. Due to run-to-run variations, the bytes
may not be identical (e.g., changed zlib or git versions,
different source object reuse due to new packs in the
repository, or even different deltas due to races during a
multi-threaded delta search).

In theory, this could be helpful to a program that cares
that the packfile contains a certain set of objects, but
does not care about the particular representation. In
practice, no part of git makes use of that, and in many
cases it is potentially harmful. For example, if a dumb http
client fetches the .idx file, it must be sure to get the
exact .pack that matches it. Similarly, a partial transfer
of a .pack file cannot be safely resumed, as the actual
bytes may have changed.  This could also affect a local
client which opened the .idx and .pack files, closes the
.pack file (due to memory or file descriptor limits), and
then re-opens a changed packfile.

In all of these cases, git can detect the problem, as we
have the sha1 of the bytes themselves in the pack trailer
(which we verify on transfer), and the .idx file references
the trailer from the matching packfile. But it would be
simpler and more efficient to actually get the correct
bytes, rather than noticing the problem and having to
restart the operation.

This patch simply uses the pack trailer sha1 as the pack
name. It should be similarly unique, but covers the exact
representation of the objects. Other parts of git should not
care, as the pack name is returned by pack-objects and is
essentially opaque.

One test needs to be updated, because it actually corrupts a
pack and expects that re-packing the corrupted bytes will
use the same name. It won't anymore, but we can easily just
use the name that pack-objects hands back.

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-write.c          | 8 +-------
 pack.h                | 2 +-
 t/t5302-pack-index.sh | 4 ++--
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/pack-write.c b/pack-write.c
index ca9e63b..ddc174e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -44,14 +44,13 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts)
  */
 const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects,
 			   int nr_objects, const struct pack_idx_option *opts,
-			   unsigned char *sha1)
+			   const unsigned char *sha1)
 {
 	struct sha1file *f;
 	struct pack_idx_entry **sorted_by_sha, **list, **last;
 	off_t last_obj_offset = 0;
 	uint32_t array[256];
 	int i, fd;
-	git_SHA_CTX ctx;
 	uint32_t index_version;

 	if (nr_objects) {
@@ -114,9 +113,6 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	}
 	sha1write(f, array, 256 * 4);

-	/* compute the SHA1 hash of sorted object names. */
-	git_SHA1_Init(&ctx);
-
 	/*
 	 * Write the actual SHA1 entries..
 	 */
@@ -128,7 +124,6 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 			sha1write(f, &offset, 4);
 		}
 		sha1write(f, obj->sha1, 20);
-		git_SHA1_Update(&ctx, obj->sha1, 20);
 		if ((opts->flags & WRITE_IDX_STRICT) &&
 		    (i && !hashcmp(list[-2]->sha1, obj->sha1)))
 			die("The same object %s appears twice in the pack",
@@ -178,7 +173,6 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	sha1write(f, sha1, 20);
 	sha1close(f, NULL, ((opts->flags & WRITE_IDX_VERIFY)
 			    ? CSUM_CLOSE : CSUM_FSYNC));
-	git_SHA1_Final(sha1, &ctx);
 	return index_name;
 }

diff --git a/pack.h b/pack.h
index aa6ee7d..12d9516 100644
--- a/pack.h
+++ b/pack.h
@@ -76,7 +76,7 @@ struct pack_idx_entry {
 struct progress;
 typedef int (*verify_fn)(const unsigned char*, enum object_type, unsigned long, void*, int*);

-extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, unsigned char *sha1);
+extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, const unsigned char *sha1);
 extern int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr);
 extern int verify_pack_index(struct packed_git *);
 extern int verify_pack(struct packed_git *, verify_fn fn, struct progress *, uint32_t);
diff --git a/t/t5302-pack-index.sh b/t/t5302-pack-index.sh
index fe82025..4bbb718 100755
--- a/t/t5302-pack-index.sh
+++ b/t/t5302-pack-index.sh
@@ -174,11 +174,11 @@ test_expect_success \
 test_expect_success \
     '[index v1] 5) pack-objects happily reuses corrupted data' \
     'pack4=$(git pack-objects test-4 <obj-list) &&
-     test -f "test-4-${pack1}.pack"'
+     test -f "test-4-${pack4}.pack"'

 test_expect_success \
     '[index v1] 6) newly created pack is BAD !' \
-    'test_must_fail git verify-pack -v "test-4-${pack1}.pack"'
+    'test_must_fail git verify-pack -v "test-4-${pack4}.pack"'

 test_expect_success \
     '[index v2] 1) stream pack to repository' \
-- 
1.8.5.524.g6743da6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-05 20:28                         ` [PATCH] pack-objects: name pack files after trailer hash Jeff King
@ 2013-12-05 21:56                           ` Shawn Pearce
  2013-12-05 22:59                           ` Junio C Hamano
  2013-12-16  7:41                           ` Michael Haggerty
  2 siblings, 0 replies; 36+ messages in thread
From: Shawn Pearce @ 2013-12-05 21:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Thu, Dec 5, 2013 at 12:28 PM, Jeff King <peff@peff.net> wrote:
> Subject: pack-objects: name pack files after trailer hash
>
> Our current scheme for naming packfiles is to calculate the
> sha1 hash of the sorted list of objects contained in the
> packfile. This gives us a unique name, so we are reasonably
> sure that two packs with the same name will contain the same
> objects.

Yay-by: Shawn Pearce <spearce@spearce.org>

> ---
>  pack-write.c          | 8 +-------
>  pack.h                | 2 +-
>  t/t5302-pack-index.sh | 4 ++--
>  3 files changed, 4 insertions(+), 10 deletions(-)

Obviously this is correct given the diffstat. :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-05 20:28                         ` [PATCH] pack-objects: name pack files after trailer hash Jeff King
  2013-12-05 21:56                           ` Shawn Pearce
@ 2013-12-05 22:59                           ` Junio C Hamano
  2013-12-06 22:18                             ` Jeff King
  2013-12-16  7:41                           ` Michael Haggerty
  2 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2013-12-05 22:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, Git Mailing List

Jeff King <peff@peff.net> writes:

> The second half would be to simplify git-repack. The current behavior is
> to replace the old packfile with a tricky rename dance. Which is still
> correct, but overly complicated. We should be able to just drop the new
> packfile, since we know the bytes are identical (or rename the new one
> over the old, though I think keeping the old is probably kinder to the
> disk cache, especially if another process already has it mmap'd).

Concurred.

> One test needs to be updated, because it actually corrupts a
> pack and expects that re-packing the corrupted bytes will
> use the same name. It won't anymore, but we can easily just
> use the name that pack-objects hands back.

Re-reading the tests in that script, I am not sure if keeping these
tests is even a sane thing to do, by the way.  It "expects" that
certain breakages are propagated, and anybody who breaks that
expectation by improving pack-objects etc. to catch such breakages
will be yelled at by breaking the test that used to pass.

Seeing that the way the test scripts are line-wrapped follows the
ancient convention, I suspect that this may be because it predates
our more recent best practice to document known breakages with
test_expect_failure.

> diff --git a/t/t5302-pack-index.sh b/t/t5302-pack-index.sh
> index fe82025..4bbb718 100755
> --- a/t/t5302-pack-index.sh
> +++ b/t/t5302-pack-index.sh
> @@ -174,11 +174,11 @@ test_expect_success \
>  test_expect_success \
>      '[index v1] 5) pack-objects happily reuses corrupted data' \
>      'pack4=$(git pack-objects test-4 <obj-list) &&
> -     test -f "test-4-${pack1}.pack"'
> +     test -f "test-4-${pack4}.pack"'
>  
>  test_expect_success \
>      '[index v1] 6) newly created pack is BAD !' \
> -    'test_must_fail git verify-pack -v "test-4-${pack1}.pack"'
> +    'test_must_fail git verify-pack -v "test-4-${pack4}.pack"'

A good thing is that the above hunks are the right thing to do, even
if we are to modernise these tests so that they document a known
breakage with expect-failure.

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-05 22:59                           ` Junio C Hamano
@ 2013-12-06 22:18                             ` Jeff King
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2013-12-06 22:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, Git Mailing List

On Thu, Dec 05, 2013 at 02:59:45PM -0800, Junio C Hamano wrote:

> > One test needs to be updated, because it actually corrupts a
> > pack and expects that re-packing the corrupted bytes will
> > use the same name. It won't anymore, but we can easily just
> > use the name that pack-objects hands back.
> 
> Re-reading the tests in that script, I am not sure if keeping these
> tests is even a sane thing to do, by the way.  It "expects" that
> certain breakages are propagated, and anybody who breaks that
> expectation by improving pack-objects etc. to catch such breakages
> will be yelled at by breaking the test that used to pass.

I had a similar thought, but I figured I would leave it for the person
who _does_ make that change. The yelling will be a good signal that
they've got it right, and they can clean up the test (either by dropping
it, or modifying it to check the right thing) at that point.

> Seeing that the way the test scripts are line-wrapped follows the
> ancient convention, I suspect that this may be because it predates
> our more recent best practice to document known breakages with
> test_expect_failure.

I read it more as "make sure that the v1 index breaks, so when we are
testing v2 we know it is not an accident that we notice the breakage".

But I also see your reason, and I think it would be fine to use
test_expect_failure.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-05 20:28                         ` [PATCH] pack-objects: name pack files after trailer hash Jeff King
  2013-12-05 21:56                           ` Shawn Pearce
  2013-12-05 22:59                           ` Junio C Hamano
@ 2013-12-16  7:41                           ` Michael Haggerty
  2013-12-16 19:04                             ` Jeff King
  2 siblings, 1 reply; 36+ messages in thread
From: Michael Haggerty @ 2013-12-16  7:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, Git Mailing List

On 12/05/2013 09:28 PM, Jeff King wrote:
> [...]
> This patch simply uses the pack trailer sha1 as the pack
> name. It should be similarly unique, but covers the exact
> representation of the objects. Other parts of git should not
> care, as the pack name is returned by pack-objects and is
> essentially opaque.
> [...]

Peff,

The old naming scheme is documented in
Documentation/git-pack-objects.txt, under "OPTIONS" -> "base-name":

> base-name::
> 	Write into a pair of files (.pack and .idx), using
> 	<base-name> to determine the name of the created file.
> 	When this option is used, the two files are written in
> 	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
> 	of the sorted object names to make the resulting filename
> 	based on the pack content, and written to the standard
> 	output of the command.

The documentation should either be updated or the description of the
naming scheme should be removed altogether.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-16  7:41                           ` Michael Haggerty
@ 2013-12-16 19:04                             ` Jeff King
  2013-12-16 19:19                               ` Jonathan Nieder
  2013-12-16 19:33                               ` Junio C Hamano
  0 siblings, 2 replies; 36+ messages in thread
From: Jeff King @ 2013-12-16 19:04 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Shawn Pearce, Git Mailing List

On Mon, Dec 16, 2013 at 08:41:38AM +0100, Michael Haggerty wrote:

> The old naming scheme is documented in
> Documentation/git-pack-objects.txt, under "OPTIONS" -> "base-name":
> 
> > base-name::
> > 	Write into a pair of files (.pack and .idx), using
> > 	<base-name> to determine the name of the created file.
> > 	When this option is used, the two files are written in
> > 	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
> > 	of the sorted object names to make the resulting filename
> > 	based on the pack content, and written to the standard
> > 	output of the command.
> 
> The documentation should either be updated or the description of the
> naming scheme should be removed altogether.

Thanks. I looked in Documentation/technical for anything to update, but
didn't imagine we would be advertising the format in the user-facing
documentation. :)

The original patch is in next, so here's one on top. I just updated the
description. I was tempted to explicitly say something like "this is
opaque and meaningless to you, don't rely on it", but I don't know that
there is any need.

-- >8 --
Subject: docs: update pack-objects "base-name" description

As of 1190a1a, the SHA-1 used to determine the filename is
now calculated differently. Update the documentation to
reflect this.

Noticed-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
---
On top of jk/name-pack-after-byte-representations, naturally.

 Documentation/git-pack-objects.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index d94edcd..c69affc 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -51,8 +51,7 @@ base-name::
 	<base-name> to determine the name of the created file.
 	When this option is used, the two files are written in
 	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
-	of the sorted object names to make the resulting filename
-	based on the pack content, and written to the standard
+	of the bytes of the packfile, and is written to the standard
 	output of the command.
 
 --stdout::
-- 
1.8.5.524.g6743da6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-16 19:04                             ` Jeff King
@ 2013-12-16 19:19                               ` Jonathan Nieder
  2013-12-16 19:28                                 ` Jeff King
  2013-12-16 19:33                               ` Junio C Hamano
  1 sibling, 1 reply; 36+ messages in thread
From: Jonathan Nieder @ 2013-12-16 19:19 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Haggerty, Junio C Hamano, Shawn Pearce, Git Mailing List

Jeff King wrote:

> The original patch is in next, so here's one on top. I just updated the
> description.

Thanks.

>              I was tempted to explicitly say something like "this is
> opaque and meaningless to you, don't rely on it", but I don't know that
> there is any need.
[...]
> On top of jk/name-pack-after-byte-representations, naturally.

I think there is --- if someone starts caring about the SHA-1 used,
they won't be able to act on old packfiles that were created before
this change.  How about something like the following instead?

-- >8 --
From: Jeff King <peff@peff.net>
Subject: pack-objects doc: treat output filename as opaque

After 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), the SHA-1 used to determine the filename is calculated
differently.  Update the documentation to not guarantee anything more
than that the SHA-1 depends on the pack content somehow.

Hopefully this will discourage readers from depending on the old or
the new calculation.

Reported-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
 Documentation/git-pack-objects.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index d94edcd..cdab9ed 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -51,8 +51,7 @@ base-name::
 	<base-name> to determine the name of the created file.
 	When this option is used, the two files are written in
 	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
-	of the sorted object names to make the resulting filename
-	based on the pack content, and written to the standard
+	based on the pack content and is written to the standard
 	output of the command.
 
 --stdout::
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-16 19:19                               ` Jonathan Nieder
@ 2013-12-16 19:28                                 ` Jeff King
  2013-12-16 19:37                                   ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2013-12-16 19:28 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Michael Haggerty, Junio C Hamano, Shawn Pearce, Git Mailing List

On Mon, Dec 16, 2013 at 11:19:33AM -0800, Jonathan Nieder wrote:

> >              I was tempted to explicitly say something like "this is
> > opaque and meaningless to you, don't rely on it", but I don't know that
> > there is any need.
> [...]
> > On top of jk/name-pack-after-byte-representations, naturally.
> 
> I think there is --- if someone starts caring about the SHA-1 used,
> they won't be able to act on old packfiles that were created before
> this change.  How about something like the following instead?

Right, my point was that I do not think anybody has ever cared, and I do
not see them starting now. But that is just my intuition.

> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index d94edcd..cdab9ed 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -51,8 +51,7 @@ base-name::
>  	<base-name> to determine the name of the created file.
>  	When this option is used, the two files are written in
>  	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
> -	of the sorted object names to make the resulting filename
> -	based on the pack content, and written to the standard
> +	based on the pack content and is written to the standard

I'm fine with that. I was worried it would get clunky, but the way you
have worded it is good.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-16 19:04                             ` Jeff King
  2013-12-16 19:19                               ` Jonathan Nieder
@ 2013-12-16 19:33                               ` Junio C Hamano
  2013-12-16 19:35                                 ` Jeff King
  1 sibling, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2013-12-16 19:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, Shawn Pearce, Git Mailing List

Jeff King <peff@peff.net> writes:

> I was tempted to explicitly say something like "this is
> opaque and meaningless to you, don't rely on it", but I don't know that
> there is any need.

Thanks.

When we did the original naming, it was envisioned that we may use
the name for fsck to make sure that the pack contains what it
contains in the name, but it never materialized.  The most prominent
and useful characteristic of the new naming scheme is that two
packfiles with the same name must be identical, and we may want to
start using it some time later once everybody repacked their packs
with the updated pack-objects.

But until that time comes, some packs in existing repositories will
hash to their names while others do not, so spelling out how the new
names are derived without saying older pack-objects used to name
their output differently may add more confusion than it is worth.

>  	<base-name> to determine the name of the created file.
>  	When this option is used, the two files are written in
>  	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
> +	of the bytes of the packfile, and is written to the standard

"hash of the bytes of the packfile" tempts one to do

    $ sha1sum .git/objects/pack/pack-*.pack

but that is not what we expect. I wonder if there are better ways to
phrase it (or alternatively perhaps we want to make that expectation
hold by updating our code to hash)?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-16 19:33                               ` Junio C Hamano
@ 2013-12-16 19:35                                 ` Jeff King
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2013-12-16 19:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, Shawn Pearce, Git Mailing List

On Mon, Dec 16, 2013 at 11:33:11AM -0800, Junio C Hamano wrote:

> >  	<base-name> to determine the name of the created file.
> >  	When this option is used, the two files are written in
> >  	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
> > +	of the bytes of the packfile, and is written to the standard
> 
> "hash of the bytes of the packfile" tempts one to do
> 
>     $ sha1sum .git/objects/pack/pack-*.pack
> 
> but that is not what we expect. I wonder if there are better ways to
> phrase it (or alternatively perhaps we want to make that expectation
> hold by updating our code to hash)?

Yeah, I wondered about that, but didn't think it was worth the verbosity
to explain that the true derivation. I think Jonathan's suggestion takes
care of it, though.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] pack-objects: name pack files after trailer hash
  2013-12-16 19:28                                 ` Jeff King
@ 2013-12-16 19:37                                   ` Junio C Hamano
  0 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2013-12-16 19:37 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, Michael Haggerty, Shawn Pearce, Git Mailing List

Jeff King <peff@peff.net> writes:

> On Mon, Dec 16, 2013 at 11:19:33AM -0800, Jonathan Nieder wrote:
>
>> >              I was tempted to explicitly say something like "this is
>> > opaque and meaningless to you, don't rely on it", but I don't know that
>> > there is any need.
>> [...]
>> > On top of jk/name-pack-after-byte-representations, naturally.
>> 
>> I think there is --- if someone starts caring about the SHA-1 used,
>> they won't be able to act on old packfiles that were created before
>> this change.  How about something like the following instead?
>
> Right, my point was that I do not think anybody has ever cared, and I do
> not see them starting now. But that is just my intuition.
>
>> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
>> index d94edcd..cdab9ed 100644
>> --- a/Documentation/git-pack-objects.txt
>> +++ b/Documentation/git-pack-objects.txt
>> @@ -51,8 +51,7 @@ base-name::
>>  	<base-name> to determine the name of the created file.
>>  	When this option is used, the two files are written in
>>  	<base-name>-<SHA-1>.{pack,idx} files.  <SHA-1> is a hash
>> -	of the sorted object names to make the resulting filename
>> -	based on the pack content, and written to the standard
>> +	based on the pack content and is written to the standard
>
> I'm fine with that. I was worried it would get clunky, but the way you
> have worded it is good.

Our mails crossed; I think the above is good.

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-12-16 19:37 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-28  3:13 How to resume broke clone ? zhifeng hu
2013-11-28  7:39 ` Trần Ngọc Quân
2013-11-28  7:41   ` zhifeng hu
2013-11-28  8:14     ` Duy Nguyen
2013-11-28  8:35       ` Karsten Blees
2013-11-28  8:50         ` Duy Nguyen
2013-11-28  8:55           ` zhifeng hu
2013-11-28  9:09             ` Duy Nguyen
2013-11-28  9:29               ` Jeff King
2013-11-28 10:17                 ` Duy Nguyen
2013-11-28 19:15                 ` Shawn Pearce
2013-12-04 20:08                   ` Jeff King
2013-12-05  6:50                     ` Shawn Pearce
2013-12-05 13:21                       ` Michael Haggerty
2013-12-05 15:11                         ` Shawn Pearce
2013-12-05 16:12                         ` Jeff King
2013-12-05 16:04                       ` Jeff King
2013-12-05 18:01                         ` Junio C Hamano
2013-12-05 19:08                           ` Jeff King
2013-12-05 20:28                         ` [PATCH] pack-objects: name pack files after trailer hash Jeff King
2013-12-05 21:56                           ` Shawn Pearce
2013-12-05 22:59                           ` Junio C Hamano
2013-12-06 22:18                             ` Jeff King
2013-12-16  7:41                           ` Michael Haggerty
2013-12-16 19:04                             ` Jeff King
2013-12-16 19:19                               ` Jonathan Nieder
2013-12-16 19:28                                 ` Jeff King
2013-12-16 19:37                                   ` Junio C Hamano
2013-12-16 19:33                               ` Junio C Hamano
2013-12-16 19:35                                 ` Jeff King
2013-11-28  9:20       ` How to resume broke clone ? Tay Ray Chuan
2013-11-28  9:29         ` zhifeng hu
2013-11-28 19:35           ` Shawn Pearce
2013-11-28 21:54           ` Jakub Narebski
  -- strict thread matches above, loose matches on Subject: below --
2013-11-28  8:32 Max Kirillov
2013-11-28  9:12 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).