From: "Dana How" <danahow@gmail.com>
To: "Nicolas Pitre" <nico@cam.org>
Cc: "Junio C Hamano" <junkio@cox.net>,
"Git Mailing List" <git@vger.kernel.org>,
danahow@gmail.com
Subject: Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
Date: Wed, 4 Apr 2007 15:55:18 -0700 [thread overview]
Message-ID: <56b7f5510704041555q4e735961ra9ee8008be0d33db@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.0.98.0704041750030.28181@xanadu.home>
On 4/4/07, Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 4 Apr 2007, Dana How wrote:
> > The motivations are to better support portable media,
> > older filesystems, and larger repositories without
> > awkward enormous packfiles.
>
> I wouldn't qualify "enormous" pack files as "awkward".
>
> It will always be more efficient to have only one pack to deal with
> (when possible of course).
Yes. "(when possible of course)" refers to the remaining motivations
I didn't explicitly mention: the 32b offset limit in .idx files,
and keeping the mmap code working on a 32b system.
I realize there are better solutions in the pipeline,
but I'd like to address this now (for my own use) and hopefully
also create something useful for 4GB-limited filesystems,
USB sticks, etc.
> > When --pack-limit[=N] is specified and --stdout is not,
> > all bytes in the resulting packfile(s) appear at offsets
> > less than N (which defaults to 1<<31). The default
> > guarantees mmap(2) on 32b systems never sees signed off_t's.
> > The object stream may be broken into multiple packfiles
> > as a result, each properly and conventionally built.
> >
>
> This sounds fine. *However* how do you ensure that the second pack (or
> subsequent packs) is self contained with regards to delta base objects
> when it is _not_ meant to be a thin pack?
Good question. Search for "int usable_delta" in the patch.
With --pack-limit (offset_limit in C), you can use a delta if the base
is in the same pack and already written out. The first condition
addresses your concern, and the second handles the case
where the base object gets pushed to the next pack.
These restrictions should be loosened for --thin-pack
but I didn't do that yet.
Also, --pack-limit turns on --no-reuse-delta.
This is not necessary, but not doing it would have meant
hacking up even more conditions which I didn't want to do
on a newbie submission.
> > When --stdout is also specified, all objects in the
> > resulting packfile(s) _start_ at offsets less than N.
> > All the packfiles appear concatenated on stdout,
> > and each has its object count set to 0. The behavior
> > without --stdout cannot be duplicated here since
> > lseek(2) is not generally possible on stdout.
>
> Please scrap that. There is simply no point making --pack-limit and
> --stdout work together. If the amount of data to send over the GIT
> protocol exceeds 4G (or whatever) it is the receiving end's business to
> split it up _if_ it wants/has to. The alternative is just too ugly.
I have a similar but much weaker reaction, but Linus specifically asked for
this combination to work. So I made it work as well as possible
given no seeking.
> > When --blob-limit=N is specified, blobs whose uncompressed
> > size is greater than or equal to N are omitted from the pack(s).
> > If --pack-limit is specified, --blob-limit is not, and
> > --stdout is not, then --blob-limit defaults to 1/4
> > of the --pack-limit.
> Is this really useful?
>
> If you have a pack size limit and a blob cannot make it even in a pack
> of its
> own then you're screwed anyway. It is much better to simply fail the
> operation than leaving some blobs behind. IOW I don't see the
> usefulness of this feature.
I agree if --stdout is specified. This is why --pack-limit && --stdout
DON'T turn on --blob-limit if not specified.
However, if I'm building packs inside a non-(web-)published
repository, I find this useful. First of all, if there's some blob bigger
than the --pack-limit I must drop it anyway -- it's not clear to me that
the mmap window code works on 32b systems
with >2GB-sized objects in packs. An "all-or-nothing" limitation
wouldn't be helpful to me.
But blobs even close to the packfile limit don't seem all that useful
to pack either (this of course is a weaker argument).
In the sample (p4) checkout I'm testing on [i.e. no history],
I have 56K+ objects consuming ~55GB uncompressed;
there are 9 blobs over 500MB each uncompressed.
I'm guessing packing them is not a performance advantage,
and I certainly wouldn't want frequently-used objects to be
stuck between them. [ I guess my repo stats are going to
be a bit strange ;-) ]
Packing plays two roles: archive storage (long life) and
transmission (possibly short life).
These seem to pull the packing code in different directions.
Thanks,
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
next prev parent reply other threads:[~2007-04-04 22:55 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-04 20:40 [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size Dana How
2007-04-04 22:04 ` Nicolas Pitre
2007-04-04 22:55 ` Dana How [this message]
2007-04-05 3:17 ` Nicolas Pitre
2007-04-05 7:15 ` Shawn O. Pearce
2007-04-05 15:52 ` Nicolas Pitre
2007-04-05 6:54 ` Shawn O. Pearce
2007-04-05 15:34 ` Linus Torvalds
2007-04-05 15:53 ` Shawn O. Pearce
2007-04-05 16:21 ` Linus Torvalds
2007-04-05 17:14 ` Shawn O. Pearce
2007-04-05 21:17 ` Florian Weimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56b7f5510704041555q4e735961ra9ee8008be0d33db@mail.gmail.com \
--to=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=nico@cam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).