git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* merge unpack-objects to index-pack?
@ 2012-05-11 12:02 Nguyen Thai Ngoc Duy
  2012-05-11 17:51 ` Junio C Hamano
  0 siblings, 1 reply; 2+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-11 12:02 UTC (permalink / raw)
  To: Git Mailing List

Hi,

I'm looking at adding large file support to unpack-objects. A simple
way is just to stream large blobs into loose large blobs. But I'd
rather keep those blobs in pack because pack-objects is happier that
way. I'm looking at unpack-objects and thinking maybe it's best to
just merge it back to index-pack.

In normal mode (all small objects), index-pack receives the pack
stream. Objects will be unpacked in phase two, resolving objects in
index-pack. The only downside I can see is the new unpack-objects now
write temporary pack on disk, which does not sound too bad to me.
unpack-packs is called on small packs so extra space is small. For
single-huge-blob packs, it's good to keep them on disk anyway. When
the pack has large blobs, we could either just keep full pack.

After this, the only pack receiver at client side is index-pack.
fetch-pack does not have to choose between unpack-objects and
index-pack, just pass --unpack-limit <n> to index-pack.

What do you think?
-- 
Duy

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: merge unpack-objects to index-pack?
  2012-05-11 12:02 merge unpack-objects to index-pack? Nguyen Thai Ngoc Duy
@ 2012-05-11 17:51 ` Junio C Hamano
  0 siblings, 0 replies; 2+ messages in thread
From: Junio C Hamano @ 2012-05-11 17:51 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> I'm looking at adding large file support to unpack-objects. A simple
> way is just to stream large blobs into loose large blobs. But I'd
> rather keep those blobs in pack because pack-objects is happier that
> way. I'm looking at unpack-objects and thinking maybe it's best to
> just merge it back to index-pack.
>
> In normal mode (all small objects), index-pack receives the pack
> stream. Objects will be unpacked in phase two, resolving objects in
> index-pack. The only downside I can see is the new unpack-objects now
> write temporary pack on disk, which does not sound too bad to me.
> unpack-packs is called on small packs so extra space is small. For
> single-huge-blob packs, it's good to keep them on disk anyway. When
> the pack has large blobs, we could either just keep full pack.
>
> After this, the only pack receiver at client side is index-pack.
> fetch-pack does not have to choose between unpack-objects and
> index-pack, just pass --unpack-limit <n> to index-pack.
>
> What do you think?

I think it is beneficial to step back a bit.  What is the _real_ reason
why we call unpack-objects instead of index-pack when we receive only a
handful of objects?

I think we did this to avoid littering the receiving repository with too
many packs from individual transfers.  As long as that goal is met, a
solution that replaces the current "if we are going to get less than N
objects, explode them to loose objects" does not actually have to explode
them to loose objects.  It could explode normal objects into loose ones,
while appending large ones into an existing pack (and it has to fix up the
pack .idx after doing so), for example.  Or it could even choose to
_always_ append into an existing pack designated for appending new
objects.  Or it could punt the "appending" part, declaring that large
object problem is a rare event, and create/leave a new pack in the
repository that stores a large object (this however would not satisfy "do
not litter the receiving repository with too many packs" goal if "large
object problem" is not rare enough).

And the first step to make that happen would be to let a single receiver
program, instead of receive-pack/fetch-pack, make the decision.  That
receiver program _might_ benefit from knowing how many objects it is going
to receive when making the decision before seeing a single byte from the
packstream, but there are other more meaningful data you can learn only
after looking at what is in the pack.

So I like the general direction you are heading.

Probably the first step in the right structure of such a series would
introduce a new helper program that builtin/receive-pack.c::unpack() and
builtin/fetch-pack.c::get_pack() call, remove the header-peeking these
calling processes currently do, and make that new helper responsible for
switching between unpack/index-pack (the new helper may peek the header
instead).  The first implementation of the new helper may decide exactly
like how these two functions choose between the two.

Once that is done, it will be an implementation detail of how objects in
the incoming packstream is stored locally from the point of view of
fetch-pack and receive-pack, and nobody should notice when the new helper
is updated to call only the updated index-pack that knows to stream large
(or all) objects into a pack.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-05-11 17:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-11 12:02 merge unpack-objects to index-pack? Nguyen Thai Ngoc Duy
2012-05-11 17:51 ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).