From: A Large Angry SCM <gitzilla@gmail.com>
To: Dana How <danahow@gmail.com>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
Nicolas Pitre <nico@cam.org>, Junio C Hamano <junkio@cox.net>,
git@vger.kernel.org
Subject: Re: [PATCH 1/3] Lazily open pack index files on demand
Date: Sun, 27 May 2007 22:30:17 -0400 [thread overview]
Message-ID: <465A3EB9.7090403@gmail.com> (raw)
In-Reply-To: <56b7f5510705271835m5a375324p3a908fe766fdf902@mail.gmail.com>
Dana How wrote:
[...]
>
> Some history of what I've been doing with git:
> First I simply had to import the repo,
> which led to split packs (this was before index v2).
> Then maintaining the repo led to the unfinished maxblobsize stuff.
> Distributing the repo included users pulling (usually) from the central
> repo,
> which would be trivial since it was also an alternate.
> Local repacking would avoid heavy load on it.
>
> Now I've started looking into how to push back into the
> central repo from a user's repo (not everything will be central;
> some pulling between users will occur
> otherwise I wouldn't be as interested).
>
> It looks like the entire sequence is:
> A. git add file [compute SHA-1 & compress file into objects/xx]
> B. git commit [write some small objects locally]
> C. git push {using PROTO_LOCAL}:
> 1. read & uncompress objects
> 2. recompress objects into a pack and send through a pipe
> 3. read pack on other end of pipe and uncompress each object
> 4. compute SHA-1 for each object and compress file into objects/xx
>
> So, after creating an object in the local working tree,
> to get it into the central repo, we must:
> compress -> uncompress -> compress -> uncompress -> compress.
> In responsiveness this won't compare very well to Perforce,
> which has only one compress step.
>
> The sequence above could be somewhat different currently in git.
> The user might have repacked their repo before pushing,
> but this just moves C1 and C2 back earlier in time,
> it doesn't remove the need for them. Besides, the blobs in
> a push are more likely to be recent and hence unpacked.
>
> Also, C3 and C4 might not happen if more than 100 blobs get pushed.
> But this seems very unusual; only 0.3% of commits in the history
> had 100+ new files/file contents. If the 100 level is reduced,
> then the central repo fills up with packfiles and their index files,
> reducing performance for everybody (using the central repo as an
> alternate).
>
> Thus there really is 5X more compression activity going on
> compared to Perforce. How can this be reduced?
>
> One way is to restore the ability to write the "new" loose object format.
> Then C1, C2, and C4 disappear. C3 must remain because we need
> to uncompress the object to compute its SHA-1; we don't need
> to recompress since we were already given the compressed form.
>
> And that final sentence is why I sent this email: if the packfile
> contained the SHA-1s, either at the beginning or before each object,
> then they wouldn't need to be recomputed at the receiving end
> and the extra decompression could be skipped as well. This would
> make the total zlib effort the same as Perforce.
>
> The fact that a loose object is never overwritten would still be retained.
> Is that sufficient security? Or does the SHA-1 always need to be
> recomputed on the receiving end? Could that be skipped just for
> specific connections and/or protocols (presumably "trusted" ones)?
[...]
So how do you want to decide when to trust the sender and when to
validate that the objects received have the SHA-1's claimed? A _central_
repository, being authoritative, would need to _always_ validate _all_
objects it receives. An since, with a central repository setup, the
central repository is where the CPU resources are the most in demand,
validating the object IDs when received at the developers repositories
should not be a problem. And just to be fair, how does Perforce
guarantee that the retrieved version of a file matches what was checked in?
next prev parent reply other threads:[~2007-05-28 2:30 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-26 5:24 [PATCH 1/3] Lazily open pack index files on demand Shawn O. Pearce
2007-05-26 8:29 ` Junio C Hamano
2007-05-26 17:30 ` Shawn O. Pearce
2007-05-26 17:31 ` Dana How
2007-05-27 2:43 ` Nicolas Pitre
2007-05-27 4:31 ` Dana How
2007-05-27 14:41 ` Nicolas Pitre
2007-05-27 3:34 ` Shawn O. Pearce
2007-05-27 4:40 ` Dana How
2007-05-27 15:29 ` Nicolas Pitre
2007-05-27 21:35 ` Shawn O. Pearce
2007-05-28 1:35 ` Dana How
2007-05-28 2:30 ` A Large Angry SCM [this message]
2007-05-28 18:31 ` Nicolas Pitre
2007-05-28 2:18 ` Nicolas Pitre
2007-05-27 15:26 ` Nicolas Pitre
2007-05-27 16:06 ` Dana How
2007-05-27 21:52 ` Shawn O. Pearce
2007-05-27 23:35 ` Nicolas Pitre
2007-05-28 16:22 ` Linus Torvalds
2007-05-28 17:13 ` Nicolas Pitre
2007-05-28 17:40 ` Karl Hasselström
-- strict thread matches above, loose matches on Subject: below --
2007-05-27 10:46 Martin Koegler
2007-05-27 15:36 ` Nicolas Pitre
2007-05-29 0:09 linux
2007-05-29 3:26 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=465A3EB9.7090403@gmail.com \
--to=gitzilla@gmail.com \
--cc=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=nico@cam.org \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).