From: "Dana How" <danahow@gmail.com>
To: "Geert Bosch" <bosch@adacore.com>
Cc: "Nicolas Pitre" <nico@cam.org>,
"Andi Kleen" <andi@firstfloor.org>,
"Ken Pratt" <ken@kenpratt.net>,
"Shawn O. Pearce" <spearce@spearce.org>,
git@vger.kernel.org, danahow@gmail.com
Subject: Re: pack operation is thrashing my server
Date: Wed, 13 Aug 2008 10:13:00 -0700 [thread overview]
Message-ID: <56b7f5510808131013t4edfd31ar195177c82a91f93e@mail.gmail.com> (raw)
In-Reply-To: <3E057C8D-FF72-47A2-BBA8-27A22AD67167@adacore.com>
Hi Geert,
I wrote the blob-size-threshold patch last year to which
Jakub Narebski referred.
I think there will eventually be a way to better handle large
objects in Git. Some possible elements:
* Loose objects have a format which can be streamed
directly into or out of packs. This avoids a round-trip through zlib,
which is a big deal for big objects. This was the effect of the "new"
loose object format to which Shawn referred. This was
removed apparently because it was ugly and/or difficult
to maintain, which I didn't understand since I didn't personally
suffer.
* Loose objects actually _are_ singleton packs, but saved
in .git/objects/xx. Workable, but would never happen due to
the extra pack header at the beginning it would add. This
takes advantage of the existing pack-to-pack streaming.
* Large loose objects are never deltified and/or never packed.
The latter was the focus of my patch.
* Large loose objects are placed in their own packs in .git/packs .
Doesn't work for me since I have too many large objects,
thus slowing down _all_ pack operations.
All this is complicated by the dual nature of packfiles --
they are used as a "wire format" for serial transmission,
as well as a database format for random access.
The "magic" entropy detection idea is cute, but probably not
needed -- using the blob size should be sufficient. Trying to
(re)compress an incompressible _smallish_ blob is probably
not worth trying to avoid, and any computation on sufficiently large
blobs should be avoided.
Hopefully I can return to this problem after New Year's. And
perhaps with the expanding Git userbase, more people will have
"large blob" problems ;-) and there will be more interest in
better addressing this usage pattern.
At the moment, I am thinking about how to better structure
git's handling of very large repositories in a team entirely
connected by high-speed LAN. It seems a method where
each user has a repository with deep history, but shallow
blobs, would be ideal, but that's also very different from
how git does things now.
Have fun,
Dana How
On Wed, Aug 13, 2008 at 9:01 AM, Geert Bosch <bosch@adacore.com> wrote:
> On Aug 13, 2008, at 10:35, Nicolas Pitre wrote:
>>
>> On Tue, 12 Aug 2008, Geert Bosch wrote:
>>
>>> I've always felt that keeping largish objects (say anything >1MB)
>>> loose makes perfect sense. These objects are accessed infrequently,
>>> often binary or otherwise poor candidates for the delta algorithm.
>>
>> Or, as I suggested in the past, they can be grouped into a separate
>> pack, or even occupy a pack of their own.
>
> This is fine, as long as we're not trying to create deltas
> of the large objects, or do other things that requires keeping
> the inflated data in memory.
>
>> As soon as you have more than
>> one revision of such largish objects then you lose again by keeping them
>> loose.
>
> Yes, you lose potentially in terms of disk space, but you avoid the
> large memory footprint during pack generation. For very large blobs,
> it is best to degenerate to having each revision of each file on
> its own (whether we call it a single-file pack, loose object or whatever).
> That way, the large file can stay immutable on disk, and will only
> need to be accessed during checkout. GIT will then scale with good
> performance until we run out of disk space.
>
> The alternative is that people need to keep large binary data out
> of their SCMs and handle it on the side. Consider a large web site
> where I have all scripts, HTML content, as well as a few movies
> to manage. The movies basically should be copied and stored, only
> to be accessed when a checkout (or push) is requested.
>
> If we mix the very large movies with the 100,000 objects representing
> the webpages, the resulting pack will become unwieldy and slow even
> to just copy around during repacks.
>
>> You'll have memory usage issues whenever such objects are accessed,
>> loose or not.
>
> Why? The only time we'd need to access their contents for checkout
> or when pushing across the network. These should all be steaming operations
> with small memory footprint.
>
>> However, once those big objects are packed once, they can
>> be repacked (or streamed over the net) without really "accessing" them.
>> Packed object data is simply copied into a new pack in that case which
>> is less of an issue on memory usage, irrespective of the original pack
>> size.
>
> Agreed, but still, at least very large objects. If I have a 600MB
> file in my repository, it should just not get in the way. If it gets
> copied around during each repack, that just wastes I/O time for no
> good reason. Even worse, it causes incremental backups or filesystem
> checkpoints to become way more expensive. Just leaving large files
> alone as immutable objects on disk avoids all these issues.
>
> -Geert
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
next prev parent reply other threads:[~2008-08-13 17:14 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-10 19:47 pack operation is thrashing my server Ken Pratt
2008-08-10 23:06 ` Martin Langhoff
2008-08-10 23:12 ` Ken Pratt
2008-08-10 23:30 ` Martin Langhoff
2008-08-10 23:34 ` Ken Pratt
2008-08-11 3:04 ` Shawn O. Pearce
2008-08-11 7:43 ` Ken Pratt
2008-08-11 15:01 ` Shawn O. Pearce
2008-08-11 15:40 ` Avery Pennarun
2008-08-11 15:59 ` Shawn O. Pearce
2008-08-11 19:13 ` Ken Pratt
2008-08-11 19:10 ` Andi Kleen
2008-08-11 19:15 ` Ken Pratt
2008-08-13 2:38 ` Nicolas Pitre
2008-08-13 2:50 ` Andi Kleen
2008-08-13 2:57 ` Shawn O. Pearce
2008-08-11 19:22 ` Shawn O. Pearce
2008-08-11 19:29 ` Ken Pratt
2008-08-11 19:34 ` Shawn O. Pearce
2008-08-11 20:10 ` Andi Kleen
2008-08-13 3:12 ` Geert Bosch
2008-08-13 3:15 ` Shawn O. Pearce
2008-08-13 3:58 ` Geert Bosch
2008-08-13 14:37 ` Nicolas Pitre
2008-08-13 14:56 ` Jakub Narebski
2008-08-13 15:04 ` Shawn O. Pearce
2008-08-13 15:26 ` David Tweed
2008-08-13 23:54 ` Martin Langhoff
2008-08-14 9:04 ` David Tweed
2008-08-13 16:10 ` Johan Herland
2008-08-13 17:38 ` Ken Pratt
2008-08-13 17:57 ` Nicolas Pitre
2008-08-13 14:35 ` Nicolas Pitre
2008-08-13 14:59 ` Shawn O. Pearce
2008-08-13 15:43 ` Nicolas Pitre
2008-08-13 15:50 ` Shawn O. Pearce
2008-08-13 17:04 ` Nicolas Pitre
2008-08-13 17:19 ` Shawn O. Pearce
2008-08-14 6:33 ` Andreas Ericsson
2008-08-14 10:04 ` Thomas Rast
2008-08-14 10:15 ` Andreas Ericsson
2008-08-14 22:33 ` Shawn O. Pearce
2008-08-15 1:46 ` Nicolas Pitre
2008-08-14 14:01 ` Nicolas Pitre
2008-08-14 17:21 ` Linus Torvalds
2008-08-14 17:58 ` Linus Torvalds
2008-08-14 19:04 ` Nicolas Pitre
2008-08-14 19:44 ` Linus Torvalds
2008-08-14 21:30 ` Andi Kleen
2008-08-15 16:15 ` Linus Torvalds
2008-08-14 21:50 ` Nicolas Pitre
2008-08-14 23:14 ` Linus Torvalds
2008-08-14 23:39 ` Björn Steinbrink
2008-08-15 0:06 ` Linus Torvalds
2008-08-15 0:25 ` Linus Torvalds
2008-08-16 12:47 ` Björn Steinbrink
2008-08-16 0:34 ` Linus Torvalds
2008-09-07 1:03 ` Junio C Hamano
2008-09-07 1:46 ` Linus Torvalds
2008-09-07 2:33 ` Junio C Hamano
2008-09-07 17:11 ` Nicolas Pitre
2008-09-07 17:41 ` Junio C Hamano
2008-09-07 2:50 ` Jon Smirl
2008-09-07 3:07 ` Linus Torvalds
2008-09-07 3:43 ` Jon Smirl
2008-09-07 4:50 ` Linus Torvalds
2008-09-07 13:58 ` Jon Smirl
2008-09-07 17:08 ` Nicolas Pitre
2008-09-07 20:33 ` Jon Smirl
2008-09-08 14:17 ` Nicolas Pitre
2008-09-08 15:12 ` Jon Smirl
2008-09-08 16:01 ` Jon Smirl
2008-09-07 8:18 ` Andreas Ericsson
2008-09-07 7:45 ` Mike Hommey
2008-08-14 18:38 ` Nicolas Pitre
2008-08-14 18:55 ` Linus Torvalds
2008-08-13 16:01 ` Geert Bosch
2008-08-13 17:13 ` Dana How [this message]
2008-08-13 17:26 ` Nicolas Pitre
2008-08-13 12:43 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56b7f5510808131013t4edfd31ar195177c82a91f93e@mail.gmail.com \
--to=danahow@gmail.com \
--cc=andi@firstfloor.org \
--cc=bosch@adacore.com \
--cc=git@vger.kernel.org \
--cc=ken@kenpratt.net \
--cc=nico@cam.org \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).