From: Nicolas Pitre <nico@cam.org>
To: Dana How <danahow@gmail.com>
Cc: Junio C Hamano <junkio@cox.net>, Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
Date: Wed, 04 Apr 2007 23:17:45 -0400 (EDT) [thread overview]
Message-ID: <alpine.LFD.0.98.0704042203160.28181@xanadu.home> (raw)
In-Reply-To: <56b7f5510704041555q4e735961ra9ee8008be0d33db@mail.gmail.com>
On Wed, 4 Apr 2007, Dana How wrote:
> On 4/4/07, Nicolas Pitre <nico@cam.org> wrote:
> > On Wed, 4 Apr 2007, Dana How wrote:
> > > The motivations are to better support portable media,
> > > older filesystems, and larger repositories without
> > > awkward enormous packfiles.
> >
> > I wouldn't qualify "enormous" pack files as "awkward".
> >
> > It will always be more efficient to have only one pack to deal with
> > (when possible of course).
> Yes. "(when possible of course)" refers to the remaining motivations
> I didn't explicitly mention: the 32b offset limit in .idx files,
> and keeping the mmap code working on a 32b system.
> I realize there are better solutions in the pipeline,
> but I'd like to address this now (for my own use) and hopefully
> also create something useful for 4GB-limited filesystems,
> USB sticks, etc.
I think this is a valid feature to have, no problem there.
> > > When --pack-limit[=N] is specified and --stdout is not,
> > > all bytes in the resulting packfile(s) appear at offsets
> > > less than N (which defaults to 1<<31). The default
> > > guarantees mmap(2) on 32b systems never sees signed off_t's.
> > > The object stream may be broken into multiple packfiles
> > > as a result, each properly and conventionally built.
> > >
> >
> > This sounds fine. *However* how do you ensure that the second pack (or
> > subsequent packs) is self contained with regards to delta base objects
> > when it is _not_ meant to be a thin pack?
> Good question. Search for "int usable_delta" in the patch.
> With --pack-limit (offset_limit in C), you can use a delta if the base
> is in the same pack and already written out. The first condition
> addresses your concern, and the second handles the case
> where the base object gets pushed to the next pack.
> These restrictions should be loosened for --thin-pack
> but I didn't do that yet.
OK.
> Also, --pack-limit turns on --no-reuse-delta.
> This is not necessary, but not doing it would have meant
> hacking up even more conditions which I didn't want to do
> on a newbie submission.
Thing is delta reusing is one of the most important feature for good
performances so it has to work.
But let's take a moment to talk about your "newbie submission". This
patch is _way_ too large and covers too many things at once. You'll
have to split it in many smaller logical units pretty please. For
example, if you have to borrow code from fast-import then make a patch
that perform that code movement _only_. Then you can have another patch
that move code around within builtin-pack-objects.c without changing any
functionality just to prepare the way for the next patch which would
concentrate on the new feature only. Then make sure you have the
addition of the new capability separate from the patch that let other
part of GIT use it. Etc.
That would be much easier for us to comment on smaller patches, and
especially easier for you to rework one of the smaller patches than the
big one if need be.
I looked at your patch and there are things I like and other things I
don't like at all. Because it is a single large patch I may only NAK
the whole of it.
> > > When --stdout is also specified, all objects in the
> >
> > Please scrap that. There is simply no point making --pack-limit and
> > --stdout work together. If the amount of data to send over the GIT
> > protocol exceeds 4G (or whatever) it is the receiving end's business to
> > split it up _if_ it wants/has to. The alternative is just too ugly.
> I have a similar but much weaker reaction, but Linus specifically asked for
> this combination to work.
Linus is a total imcompetent who knows nothing about programming or good
taste. So never ever listen to what he says. He is wrong, always.
And in this case he's more wrong than usual.
;-)
> So I made it work as well as possible
> given no seeking.
The fact is that there is no point in imposing split packs on the
receiving end if it can accept a single pack just fine. Pack layout is
and must remain a local policy, not something that the remote end felt
was good for the peer. This is true for the treshold value to decide
whether fetched packs are kept as is or unpacked as loose objects. This
is the same issue for pack size limit.
And as you discovered yourself, it is quite messy to implement in
pack-objects when the output is a stream because you don't know in
advance how many objects will be sent so you have to invent special
markers to indicate the end of pack. This in turn doesn't let
index-pack know how much to expect and provide some progress report at
all. The appropriate location for the splitting of packs in a fetch
context is really within index-pack not in pack-objects. And it is
probably so much easier to do in index-pack anyway.
> > > When --blob-limit=N is specified, blobs whose uncompressed
> > > size is greater than or equal to N are omitted from the pack(s).
> > > If --pack-limit is specified, --blob-limit is not, and
> > > --stdout is not, then --blob-limit defaults to 1/4
> > > of the --pack-limit.
> > Is this really useful?
> >
> > If you have a pack size limit and a blob cannot make it even in a
> > pack of its own then you're screwed anyway. It is much better to
> > simply fail the operation than leaving some blobs behind. IOW I
> > don't see the usefulness of this feature.
> I agree if --stdout is specified. This is why --pack-limit && --stdout
> DON'T turn on --blob-limit if not specified.
Let's forget about --stdout. I hope I convinced you that it doesn't
make sense.
> However, if I'm building packs inside a non-(web-)published
> repository, I find this useful. First of all, if there's some blob bigger
> than the --pack-limit I must drop it anyway -- it's not clear to me that
> the mmap window code works on 32b systems
> with >2GB-sized objects in packs. An "all-or-nothing" limitation
> wouldn't be helpful to me.
But do you realize that if you drop even a single object you are
screwed? The fetching of a repo that is missing objects is a corrupt
fetch. You cannot just sent a bunch of objects and leave a couple
behind in the hope that the peer will never need them. For one the peer
would never be able to perform a successful git-fsck.
If you care about your local usage only then this feature is bogus as
well. The blob _has_ to exist as a loose object before ever being
packed. If you have blobs larger than 2GB or 4GB and your filesystem is
unable to cope with that then you're screwed already. You won't be able
to check them out, etc.
> But blobs even close to the packfile limit don't seem all that useful
> to pack either (this of course is a weaker argument).
In the extreme case where you have a blob that is near the pack size
limit then you'll end up with one pack per blob. No problem there. If
you're lucky then you might have 10 big blobs which size is near the
pack size limit, but because they delta well against each other you
might be able to stuff them all in a single pack.
And if a blob is larger than the pack limit then either you should
increase your pack size limit, or if the pack limit is already near the
filesystem capacity for a single file then the blob cannot exist on that
filesystem in the first place.
So you still have to convince me this is a useful feature.
> Packing plays two roles: archive storage (long life) and
> transmission (possibly short life).
Packing is also a _huge_ performance boost. Don't underestimate that.
Nicolas
next prev parent reply other threads:[~2007-04-05 3:18 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-04 20:40 [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size Dana How
2007-04-04 22:04 ` Nicolas Pitre
2007-04-04 22:55 ` Dana How
2007-04-05 3:17 ` Nicolas Pitre [this message]
2007-04-05 7:15 ` Shawn O. Pearce
2007-04-05 15:52 ` Nicolas Pitre
2007-04-05 6:54 ` Shawn O. Pearce
2007-04-05 15:34 ` Linus Torvalds
2007-04-05 15:53 ` Shawn O. Pearce
2007-04-05 16:21 ` Linus Torvalds
2007-04-05 17:14 ` Shawn O. Pearce
2007-04-05 21:17 ` Florian Weimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.0.98.0704042203160.28181@xanadu.home \
--to=nico@cam.org \
--cc=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).