[PATCH 09/13] drop objects larger than --blob-limit if specified

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 09/13] drop objects larger than --blob-limit if specified
@ 2007-04-05 22:36 Dana How
       [not found] ` <al pine.LFD.0.98.0704052103410.28181@xanadu.home>
  2007-04-06  1:04 ` Nicolas Pitre
  0 siblings, 2 replies; 11+ messages in thread
From: Dana How @ 2007-04-05 22:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, danahow

[-- Attachment #1: Type: text/plain, Size: 145 bytes --]

---
 builtin-pack-objects.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

[-- Attachment #2: 0009-drop-objects-larger-than-blob-limit-if-specified.patch.txt --]
[-- Type: text/plain, Size: 826 bytes --]

From 62532de2a8c19ca8d8bfb2e4bc02bff39bcc7ca8 Mon Sep 17 00:00:00 2001
From: Dana How <how@deathvalley.cswitch.com>
Date: Thu, 5 Apr 2007 14:06:05 -0700
Subject: [PATCH 09/13] drop objects larger than --blob-limit if specified

---
 builtin-pack-objects.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 37b0150..ccc2d15 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1116,6 +1116,7 @@ static void check_object(struct object_entry *entry)
 	}
 
 	entry->type = sha1_object_info(entry->sha1, &entry->size);
+	nr_skipped += entry->no_write = blob_limit && (unsigned long)entry->size >= blob_limit;
 	if (entry->type < 0)
 		die("unable to get type of object %s",
 		    sha1_to_hex(entry->sha1));
-- 
1.5.1.rc2.18.g9c88-dirty


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-05 22:36 [PATCH 09/13] drop objects larger than --blob-limit if specified Dana How
       [not found] ` <al pine.LFD.0.98.0704052103410.28181@xanadu.home>
@ 2007-04-06  1:04 ` Nicolas Pitre
  2007-04-06  2:19   ` Dana How
  1 sibling, 1 reply; 11+ messages in thread
From: Nicolas Pitre @ 2007-04-06  1:04 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, git


I still consider this feature to make no sense.


Nicolas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06  1:04 ` Nicolas Pitre
@ 2007-04-06  2:19   ` Dana How
  2007-04-06  3:20     ` Nicolas Pitre
  0 siblings, 1 reply; 11+ messages in thread
From: Dana How @ 2007-04-06  2:19 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git, danahow

On 4/5/07, Nicolas Pitre <nico@cam.org> wrote:
> I still consider this feature to make no sense.

Well, suppose I'm packing my 55GB of data into 2GB
packfiles.  There seemed to be some agreement that
limiting packfile size was useful.  700MB is another example.

Now,  suppose there is an object whose packing would
result in a packfile larger than the limit.  What should we do?

(1) Refuse to run.  This solution means I can't pack my repository.
(2) Pack the object any way and let the packfile size exceed
     my specification.  Ignoring a clear preference from the user
     doesn't seem good.
(3) Pack the object by itself in its own pack. This is better than the
     previous since I haven't wrapped up any small object in a pack
     whose size I dont't want to deal with.  The resulting pack is too big,
     but the original object was also too big so at least I haven't made
     the problem worse.  But why bother wrapping the object so?
     I just made the list of packs to look through longer for every access,
     instead of leaving the big object in the objects/xx directories which
     are already used to handle exceptions (usually meaning more recent).
     In my 55GB example, I have 9 jumbo objects, and this solution
     would more than double the number of packs to step through.
     Having them randomly placed in 256 subdirectories seems better.
(4) Just leave the jumbo object by itself, unpacked.

What do you think?

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06  2:19   ` Dana How
@ 2007-04-06  3:20     ` Nicolas Pitre
  2007-04-06 15:49       ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Nicolas Pitre @ 2007-04-06  3:20 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, git

On Thu, 5 Apr 2007, Dana How wrote:

> On 4/5/07, Nicolas Pitre <nico@cam.org> wrote:
> > I still consider this feature to make no sense.
> 
> Well, suppose I'm packing my 55GB of data into 2GB
> packfiles.  There seemed to be some agreement that
> limiting packfile size was useful.  700MB is another example.
> 
> Now,  suppose there is an object whose packing would
> result in a packfile larger than the limit.  What should we do?

You error out.

> (1) Refuse to run.  This solution means I can't pack my repository.

Exactly.  If you want packs not to be larger than 10MB and you have a 
100MB blob then you are screwed.  Just lift your pack size limit in such 
case.

> (2) Pack the object any way and let the packfile size exceed
>     my specification.  Ignoring a clear preference from the user
>     doesn't seem good.

It is not indeed.

> (3) Pack the object by itself in its own pack. This is better than the
>     previous since I haven't wrapped up any small object in a pack
>     whose size I dont't want to deal with.  The resulting pack is too big,
>     but the original object was also too big so at least I haven't made
>     the problem worse.  But why bother wrapping the object so?
>     I just made the list of packs to look through longer for every access,
>     instead of leaving the big object in the objects/xx directories which
>     are already used to handle exceptions (usually meaning more recent).
>     In my 55GB example, I have 9 jumbo objects, and this solution
>     would more than double the number of packs to step through.
>     Having them randomly placed in 256 subdirectories seems better.

You forget about the case where those jumbo blobs could delta well 
against each other.  That means that one pack could possibly contain 
those 9 objects because 8 of them are tiny deltas against the first big 
one.

> (4) Just leave the jumbo object by itself, unpacked.

Hmmmmm.

> What do you think?

Let's say I wouldn't mind much if it was implemented differently.  The 
objects array is probably the biggest cost in terms of memory usage for 
pack-objects.  When you have 4 milions objects like in the kde repo that 
means each new field you add will cost between 4 to 16 MB of memory.  I 
think this is too big a cost for filtering out a couple big objects once 
in a while.

Instead, I think you should apply the filtering in add_object_entry() 
directly and simply skip adding the unwanted object to the list 
altogether.

Nicolas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06  3:20     ` Nicolas Pitre
@ 2007-04-06 15:49       ` Linus Torvalds
       [not found]         ` <56b7f55 10704061109n2878a221p391b7c3edba89c63@mail.gmail.com>
  2007-04-06 18:09         ` Dana How
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2007-04-06 15:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Dana How, Junio C Hamano, git

On Thu, 5 Apr 2007, Nicolas Pitre wrote:
> 
> > (2) Pack the object any way and let the packfile size exceed
> >     my specification.  Ignoring a clear preference from the user
> >     doesn't seem good.
> 
> It is not indeed.

Well, I think there is an easy solution.

Just go back and say that when the user limits the size, it limits the 
offset at which objects can *start*.

Not only is that the only thing that the index file itself cares about, it 
also means that

 - you don't ever need to undo anything at all (because you know the 
   starting offset) when you begin packing a new object.

   This should simplify your patch a lot.

 - the object size issue just goes away. Sure, the pack-file limit looks a 
   bit strange (it's no longer a hard limit on the *size* of the 
   pack-file, just on the object offsets), but let's face it, we really 
   really don't care.

And in practice, by setting the pack-file limit to 2GB (or even 1GB), you 
never ever have to worry about the 32-bit filesystem limits any more, 
unless your repository is fundamentally so screwed that you simply 
*cannot* reporesent it well on something like FATFS (ie any object that is 
2GB in size will probably have a blob that is even bigger, and FATFS 
already has problems with it).

So in practice, just limiting the index offsets is what you really want 
anyway.

		Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06 15:49       ` Linus Torvalds
       [not found]         ` <56b7f55 10704061109n2878a221p391b7c3edba89c63@mail.gmail.com>
@ 2007-04-06 18:09         ` Dana How
  2007-04-06 19:21           ` Nicolas Pitre
                             ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Dana How @ 2007-04-06 18:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Junio C Hamano, git, danahow

On 4/6/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, 5 Apr 2007, Nicolas Pitre wrote:
> > > (2) Pack the object any way and let the packfile size exceed
> > >     my specification.  Ignoring a clear preference from the user
> > >     doesn't seem good.
> > It is not indeed.
>
> Well, I think there is an easy solution.
>
> Just go back and say that when the user limits the size, it limits the
> offset at which objects can *start*.
Yes,  this is what my original, unsplit patch did in order to support
the --pack-limit && --stdout combination, since lseek doesn't
always work on stdout.

> Not only is that the only thing that the index file itself cares about, it
> also means that
>
>  - you don't ever need to undo anything at all (because you know the
>    starting offset) when you begin packing a new object.
Yes, that was true.

>    This should simplify your patch a lot.
It removes the calls to, and the additions of, sha1mark() and sha1undo()
in csum-file.[ch].  The changes to builtin-pack-objects.c are almost
the same -- an "if" moves from after the write_object() to before.

>  - the object size issue just goes away. Sure, the pack-file limit looks a
>    bit strange (it's no longer a hard limit on the *size* of the
>    pack-file, just on the object offsets), but let's face it, we really
>    really don't care.
>
> And in practice, by setting the pack-file limit to 2GB (or even 1GB), you
> never ever have to worry about the 32-bit filesystem limits any more,
> unless your repository is fundamentally so screwed that you simply
> *cannot* reporesent it well on something like FATFS (ie any object that is
> 2GB in size will probably have a blob that is even bigger, and FATFS
> already has problems with it).
>
> So in practice, just limiting the index offsets is what you really want
> anyway.
I agree with your arguments, but packs have different uses
and to me there are several differing (non-)reasons for --pack-limit.
* To split packs into smaller chunks when the .pack format is used as a
  communications protocol. But the discussion has converged to
  "we're not going to split packs at the transmitter". I agree with this
  conclusion,  since it would make the total transmission larger (loss
of deltas).
  Since no .idx file is sent there is no requirement for --pack-limit here.
* To avoid offsets larger than 31 bits in .idx files.  Your proposal,
  and what I was doing for --pack-limit && --stdout, is sufficient to
address this.
* Avoiding (e.g.) 2GB+ files when none already exist in the repository --
  either the filesystem doesn't support anything beyond the limit,
  or we don't want to use a >31b off_t with mmap.  (Perhaps
  the latter case is completely avoided by some glibc 64b trickery,
  but is that always true?)  Only the write rollback approach can address this.
* Creating .pack files that fit on e.g. CDROM etc.
The 2nd and 3rd cases are what I'm thinking about,
which is why the first version of my patch did both.

Anyway, now that I understand Nicolas's issues with the patchset,
and realize I agree with his concerns, I'm going to let this percolate
a little bit until I've got the least number of added behaviors and
command line options.  At the moment I'm leaning towards killing my
--blob-limit idea, keeping --pack-limit based on rollback but with the
additional twist that the first object written to a packfile is never
"rolled back" [meaning (a) everything is packed to make Nicolas happy,
(b) any illegally-sized pack has only one object and could be removed
to make me happy, and (c) my patchset has one less bug], and
adding --index-limit which is what normal people might use on a large
repository [and could be made the default later].  So I think what you
discuss is the most common use for limiting.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06 18:09         ` Dana How
@ 2007-04-06 19:21           ` Nicolas Pitre
  2007-04-06 19:24           ` Linus Torvalds
  2007-04-06 20:12           ` Nicolas Pitre
  2 siblings, 0 replies; 11+ messages in thread
From: Nicolas Pitre @ 2007-04-06 19:21 UTC (permalink / raw)
  To: Dana How; +Cc: Linus Torvalds, Junio C Hamano, git

On Fri, 6 Apr 2007, Dana How wrote:

> * To avoid offsets larger than 31 bits in .idx files.  Your proposal,
>  and what I was doing for --pack-limit && --stdout, is sufficient to
> address this.

BTW I have code for an index version 2 that allows for 64-bit offsets.  
It also adds a CRC32 for each object data as stored in the pack to make 
data reuse safer when repacking.  I only have to fix a couple loose ends 
before sending patches out.

The new index only uses 64-bit offsets when it has to.  Meaning that 
objects in front of a pack which usually are the most recent ones won't 
pay the overhead of a larger offset table at run time.

Nicolas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06 18:09         ` Dana How
  2007-04-06 19:21           ` Nicolas Pitre
@ 2007-04-06 19:24           ` Linus Torvalds
  2007-04-06 22:33             ` Junio C Hamano
  2007-04-06 20:12           ` Nicolas Pitre
  2 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2007-04-06 19:24 UTC (permalink / raw)
  To: Dana How; +Cc: Nicolas Pitre, Junio C Hamano, git

On Fri, 6 Apr 2007, Dana How wrote:
>
> I agree with your arguments, but packs have different uses
> and to me there are several differing (non-)reasons for --pack-limit.
> * To split packs into smaller chunks when the .pack format is used as a
>  communications protocol. But the discussion has converged to
>  "we're not going to split packs at the transmitter". I agree with this
>  conclusion,  since it would make the total transmission larger (loss
>  of deltas).

I don't agree with it, since I don't think the delta advantage is 
noticeable once packs are "big enough". 

Splitting on the sending side means that it needs to split just in one 
place, rather than have the same logic in two different places.

I don't think it really "convered" on anything, I think the discussion 
just didn't continue along that thing.

>  Since no .idx file is sent there is no requirement for --pack-limit here.

I really don't think the idx file is the only - or even primary - reason. 
We need *some* size limiter for a lot of reasons. If the idx file was the 
only reason, we should just switch to a new index format and forget about 
it.

But since there are *other* reasons that cannot go away, that doesn't 
obviate the need for size limits, so while I think the idx one is the 
*first* reason to do this, I don't think it is really the most important 
one, exactly because the idx file we *could* solve other ways.

> * To avoid offsets larger than 31 bits in .idx files.  Your proposal,
>  and what I was doing for --pack-limit && --stdout, is sufficient to
>  address this.

Right. 

> * Avoiding (e.g.) 2GB+ files when none already exist in the repository --
>  either the filesystem doesn't support anything beyond the limit,
>  or we don't want to use a >31b off_t with mmap.  (Perhaps
>  the latter case is completely avoided by some glibc 64b trickery,
>  but is that always true?)  Only the write rollback approach can address this.

I disagree violently. 

IN THEORY only write rollback can address that. But "theory" is not 
practice, and anybody who thinks that theory is even *relevant* here is 
missing the big picture.

If you use FATFS as the place to store your objects, it's really easy to 
just say: you can't have objects bigger than 2GB. That's the real life 
solution. The "theory" that you could have pack-files larger than 32 bits 
is just about as relevant as quantum mechanics is to designing the tensile 
strength of a sky-scraper. Sure, *in*theory* quantum mechanics matters, 
but in practice, you'd be crazy if you tried to convince anybody that it 
should be done using QM rather than traditional means.

Quite frankly, if you have objects that compress to >2GB, you're going to 
be in so much pain even on *other* filesystems, that I seriously doubt 
you'd want to track them using git anyway. And even if you want to, just 
tell people: don't use FATFS, because this repository is insane.

> * Creating .pack files that fit on e.g. CDROM etc.
> The 2nd and 3rd cases are what I'm thinking about,
> which is why the first version of my patch did both.

Again, *in*practice*, for any sane situation, if you want to fit things on 
a CD-ROM, just give a limit of 600MB, and I can pretty much guarantee that 
you'll see a slop of just a percent or two for any realistic setup. And if 
it goes up to 660MB, you'll still fit on any CD.

So please ignore theory, when theory isn't relevant. Designing for 
something that practically cannot happen sounds pretty wrong.

(Of course, since you have a 55GB archive, you probably have an insane 
thing in the first place. My condoleances if so ;)

		Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06 18:09         ` Dana How
  2007-04-06 19:21           ` Nicolas Pitre
  2007-04-06 19:24           ` Linus Torvalds
@ 2007-04-06 20:12           ` Nicolas Pitre
  2 siblings, 0 replies; 11+ messages in thread
From: Nicolas Pitre @ 2007-04-06 20:12 UTC (permalink / raw)
  To: Dana How; +Cc: Linus Torvalds, Junio C Hamano, git

On Fri, 6 Apr 2007, Dana How wrote:

> * Avoiding (e.g.) 2GB+ files when none already exist in the repository --
>  either the filesystem doesn't support anything beyond the limit,
>  or we don't want to use a >31b off_t with mmap.  (Perhaps
>  the latter case is completely avoided by some glibc 64b trickery,
>  but is that always true?)  Only the write rollback approach can address this.

If the filesystem doesn't support anything beyond the limit, you cannot 
perform any rollback either.  Because if you roll back that's because 
you crossed the limit already.  But you're not supposed to even be able 
to cross that limit in the first place if the filesystem is limited.


Nicolas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if specified
  2007-04-06 19:24           ` Linus Torvalds
@ 2007-04-06 22:33             ` Junio C Hamano
  2007-04-08  1:53               ` David Lang
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dana How, Nicolas Pitre, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 6 Apr 2007, Dana How wrote:
>> ...
>> * Avoiding (e.g.) 2GB+ files when none already exist in the repository --
>>  either the filesystem doesn't support anything beyond the limit,
>>  or we don't want to use a >31b off_t with mmap.  (Perhaps
>>  the latter case is completely avoided by some glibc 64b trickery,
>>  but is that always true?)  Only the write rollback approach can address this.
>
> I disagree violently. 
>
> IN THEORY only write rollback can address that. But "theory" is not 
> practice, and anybody who thinks that theory is even *relevant* here is 
> missing the big picture.

Although I agree the "starting offset" is a very good
compromise, if we talk about THEORY, you should be able to
notice you are going to go beyond the limit inside write_one().
At that point, we know everything we need to know about all
objects that might be involved in writing that one object: if
they have been written out, what their offsets are, and how big
their in-pack representation will be.

And I agree with Nico that rollback after a failed write beyond
the boundary may not work correctly, so if we really want to do
this safely and sanely while satisfying Dana's desire to have a
hard limit, I think something like this is needed:

 - use "starting offset" to decide when to split;

 - introduce a "hard limit", perhaps optionally, to make sure
   that the result of writing out a packfile does not overstep
   that limit (i.e. the last object written below the "starting
   offset limit" might make the pack go over 700MB).

which means you would specify 600 as starting offset limit and
680 (or something like that) as the hard tail offset limit

> Again, *in*practice*, for any sane situation, if you want to fit things on 
> a CD-ROM, just give a limit of 600MB, and I can pretty much guarantee that 
> you'll see a slop of just a percent or two for any realistic setup. And if 
> it goes up to 660MB, you'll still fit on any CD.

if you really care the result fits on a CD.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 09/13] drop objects larger than --blob-limit if  specified
  2007-04-06 22:33             ` Junio C Hamano
@ 2007-04-08  1:53               ` David Lang
  0 siblings, 0 replies; 11+ messages in thread
From: David Lang @ 2007-04-08  1:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Dana How, Nicolas Pitre, git

On Fri, 6 Apr 2007, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> On Fri, 6 Apr 2007, Dana How wrote:
>
> And I agree with Nico that rollback after a failed write beyond
> the boundary may not work correctly, so if we really want to do
> this safely and sanely while satisfying Dana's desire to have a
> hard limit, I think something like this is needed:
>
> - use "starting offset" to decide when to split;
>
> - introduce a "hard limit", perhaps optionally, to make sure
>   that the result of writing out a packfile does not overstep
>   that limit (i.e. the last object written below the "starting
>   offset limit" might make the pack go over 700MB).
>
> which means you would specify 600 as starting offset limit and
> 680 (or something like that) as the hard tail offset limit
>
>> Again, *in*practice*, for any sane situation, if you want to fit things on
>> a CD-ROM, just give a limit of 600MB, and I can pretty much guarantee that
>> you'll see a slop of just a percent or two for any realistic setup. And if
>> it goes up to 660MB, you'll still fit on any CD.
>
> if you really care the result fits on a CD.

there are going to be cases whereeyou have a fixed size, but would really like 
to stream things rather then write a temp file and then send that (one example 
of wanting to stream things, but without the size cap is the network pull where 
we atart sending things before we've finished figuring out the details of what 
we are going to send). with the appropriately sized buffer you could stream to a 
CD burner for example, and while this may not make a huge difference in the time 
taken to make the CD, it will save you a significant amount of disk I/O and 
buffer space that is now still populated by info that's more useful to the user.

David Lang

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-04-08  2:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-05 22:36 [PATCH 09/13] drop objects larger than --blob-limit if specified Dana How
     [not found] ` <al pine.LFD.0.98.0704052103410.28181@xanadu.home>
     [not found]   ` <56b7f5510704051919v7daac590m 6ac52c4fcabd5321@mail.gmail.com>
2007-04-06  1:04 ` Nicolas Pitre
2007-04-06  2:19   ` Dana How
2007-04-06  3:20     ` Nicolas Pitre
2007-04-06 15:49       ` Linus Torvalds
     [not found]         ` <56b7f55 10704061109n2878a221p391b7c3edba89c63@mail.gmail.com>
2007-04-06 18:09         ` Dana How
2007-04-06 19:21           ` Nicolas Pitre
2007-04-06 19:24           ` Linus Torvalds
2007-04-06 22:33             ` Junio C Hamano
2007-04-08  1:53               ` David Lang
2007-04-06 20:12           ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).