From: "Dana How" <danahow@gmail.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: "Nicolas Pitre" <nico@cam.org>, "Junio C Hamano" <junkio@cox.net>,
git@vger.kernel.org, danahow@gmail.com
Subject: Re: [PATCH 1/3] Lazily open pack index files on demand
Date: Sun, 27 May 2007 18:35:47 -0700 [thread overview]
Message-ID: <56b7f5510705271835m5a375324p3a908fe766fdf902@mail.gmail.com> (raw)
In-Reply-To: <20070527213525.GC28023@spearce.org>
On 5/27/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > On Sat, 26 May 2007, Dana How wrote:
> > > On 5/26/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > > > In pack v4 we're likely to move the SHA-1 table from the .idx file
> > > > into the front of the .pack file. This makes the .idx file hold
> > > > only the offsets and the CRC checkums of each object. If we start
> > > > making a super index, we have to duplicate the SHA-1 table twice
> > > > (once in the .pack, again in the super index).
> > > Hmm, hopefully the SHA-1 table can go at the _end_
> > > since with split packs that's the only time we know the number
> > > of objects in the pack... ;-)
> > Hmmm good point to consider.
> The problem with putting the SHA-1 table at the end of the pack is
> it ruins the streaming for both unpack-objects and index-pack if
> we were to ever use pack v4 as a transport format. Or just try
> to run a pack v4 packfile through unpack-objects, just locally,
> say to extract megablobs. ;-)
Perhaps I have stumbled on another issue related to
including SHA-1s in packfiles. This is completely independent
of the handling of megablobs (my current focus), but the presence
of large blobs do make this issue more apparent.
Some history of what I've been doing with git:
First I simply had to import the repo,
which led to split packs (this was before index v2).
Then maintaining the repo led to the unfinished maxblobsize stuff.
Distributing the repo included users pulling (usually) from the central repo,
which would be trivial since it was also an alternate.
Local repacking would avoid heavy load on it.
Now I've started looking into how to push back into the
central repo from a user's repo (not everything will be central;
some pulling between users will occur
otherwise I wouldn't be as interested).
It looks like the entire sequence is:
A. git add file [compute SHA-1 & compress file into objects/xx]
B. git commit [write some small objects locally]
C. git push {using PROTO_LOCAL}:
1. read & uncompress objects
2. recompress objects into a pack and send through a pipe
3. read pack on other end of pipe and uncompress each object
4. compute SHA-1 for each object and compress file into objects/xx
So, after creating an object in the local working tree,
to get it into the central repo, we must:
compress -> uncompress -> compress -> uncompress -> compress.
In responsiveness this won't compare very well to Perforce,
which has only one compress step.
The sequence above could be somewhat different currently in git.
The user might have repacked their repo before pushing,
but this just moves C1 and C2 back earlier in time,
it doesn't remove the need for them. Besides, the blobs in
a push are more likely to be recent and hence unpacked.
Also, C3 and C4 might not happen if more than 100 blobs get pushed.
But this seems very unusual; only 0.3% of commits in the history
had 100+ new files/file contents. If the 100 level is reduced,
then the central repo fills up with packfiles and their index files,
reducing performance for everybody (using the central repo as an alternate).
Thus there really is 5X more compression activity going on
compared to Perforce. How can this be reduced?
One way is to restore the ability to write the "new" loose object format.
Then C1, C2, and C4 disappear. C3 must remain because we need
to uncompress the object to compute its SHA-1; we don't need
to recompress since we were already given the compressed form.
And that final sentence is why I sent this email: if the packfile
contained the SHA-1s, either at the beginning or before each object,
then they wouldn't need to be recomputed at the receiving end
and the extra decompression could be skipped as well. This would
make the total zlib effort the same as Perforce.
The fact that a loose object is never overwritten would still be retained.
Is that sufficient security? Or does the SHA-1 always need to be
recomputed on the receiving end? Could that be skipped just for
specific connections and/or protocols (presumably "trusted" ones)?
Note that none of this depends on how megablobs are handled,
but large blobs certainly do make the 5X more zlib activity more apparent.
Shawn: I think you mentioned something related to this a few days ago.
Also, you didn't like split packs putting SHA-1s at the end because
that messed up streaming for transport, but packs are not split for transport.
Thanks,
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
next prev parent reply other threads:[~2007-05-28 1:35 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-26 5:24 [PATCH 1/3] Lazily open pack index files on demand Shawn O. Pearce
2007-05-26 8:29 ` Junio C Hamano
2007-05-26 17:30 ` Shawn O. Pearce
2007-05-26 17:31 ` Dana How
2007-05-27 2:43 ` Nicolas Pitre
2007-05-27 4:31 ` Dana How
2007-05-27 14:41 ` Nicolas Pitre
2007-05-27 3:34 ` Shawn O. Pearce
2007-05-27 4:40 ` Dana How
2007-05-27 15:29 ` Nicolas Pitre
2007-05-27 21:35 ` Shawn O. Pearce
2007-05-28 1:35 ` Dana How [this message]
2007-05-28 2:30 ` A Large Angry SCM
2007-05-28 18:31 ` Nicolas Pitre
2007-05-28 2:18 ` Nicolas Pitre
2007-05-27 15:26 ` Nicolas Pitre
2007-05-27 16:06 ` Dana How
2007-05-27 21:52 ` Shawn O. Pearce
2007-05-27 23:35 ` Nicolas Pitre
2007-05-28 16:22 ` Linus Torvalds
2007-05-28 17:13 ` Nicolas Pitre
2007-05-28 17:40 ` Karl Hasselström
-- strict thread matches above, loose matches on Subject: below --
2007-05-27 10:46 Martin Koegler
2007-05-27 15:36 ` Nicolas Pitre
2007-05-29 0:09 linux
2007-05-29 3:26 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56b7f5510705271835m5a375324p3a908fe766fdf902@mail.gmail.com \
--to=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=nico@cam.org \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).