Balanced packing strategy

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Balanced packing strategy
  2005-11-12 13:40 ` Craig Schlenter
@ 2005-11-12 13:59   ` Petr Baudis
  2005-11-12 15:14     ` Craig Schlenter
  2005-11-13 20:06     ` Josef Weidendorfer
  0 siblings, 2 replies; 8+ messages in thread
From: Petr Baudis @ 2005-11-12 13:59 UTC (permalink / raw)
  To: Craig Schlenter; +Cc: git

Dear diary, on Sat, Nov 12, 2005 at 02:40:50PM CET, I got a letter
where Craig Schlenter <craig@codefountain.com> said that...
> On 12 Nov 2005, at 3:04 PM, Marcel Holtmann wrote:
> 
> >every time Linus re-creates the pack for his linux-2.6 tree, I end up
> >with another pack. I use HTTP as transport and thus the new pack will 
> >be
> >download (which is almost 100 MB), but that is fine.
> >[snip]
> 
> The 100MB situation is not cool for those of us on a tight bandwidth
> budget or slow links. Can anyone tell me if the native git protocol is
> any better at this stuff please?

Yes, the native GIT protocol transfers only the objects you need.

But the 100MB situation is still bad. FWIW, this is my proposal I sent
about a month ago to some packs-related discussion at the kernel.org
mailing list (ok, I updated it a little):

The repacking should be done in such a way to minimize the overhead for
the dumb transport users. Ideal for this is some structure like (at the
end of october):

	year2003.pack
	year2004.pack
	halfyear2004-2.pack
	halfyear2005-1.pack
	month4.pack
	month5.pack
	month6.pack
	month7.pack
	month8.pack
	month9.pack
	week37.pack
	week38.pack
	week39.pack
	week40.pack
	week41.pack
	week42.pack
	week43.pack
	<individual objects for weeks 43, 44>

This has the property that the second half of given pack is covered by
objects with precision lower by one. This is a relatively high overload
(this can be balanced by only keeping the last third or whatever), but
it designed to reduce the overhead of fetching packs over dumb
transport. E.g. if it's almost the end of July and you last fetched at
the start of June, you will not have to get the whole halfyear2005-1
pack, but be able to catch up by just fetching month6 pack, and then few
week-packs.

For the autopacker (which should be ideally ran by some cronjob), this
means packing new week each week and getting rid of a week worth of
objects, packing new month each month and getting rid of a month worth
of objects, etc.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
VI has two modes: the one in which it beeps and the one in which
it doesn't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Balanced packing strategy
  2005-11-12 13:59   ` Balanced packing strategy Petr Baudis
@ 2005-11-12 15:14     ` Craig Schlenter
  2005-11-13  2:34       ` Junio C Hamano
  2005-11-13 20:06     ` Josef Weidendorfer
  1 sibling, 1 reply; 8+ messages in thread
From: Craig Schlenter @ 2005-11-12 15:14 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

On 12 Nov 2005, at 3:59 PM, Petr Baudis wrote:

> Dear diary, on Sat, Nov 12, 2005 at 02:40:50PM CET, I got a letter
> where Craig Schlenter <craig@codefountain.com> said that...
>> The 100MB situation is not cool for those of us on a tight bandwidth
>> budget or slow links. Can anyone tell me if the native git protocol is
>> any better at this stuff please?
>
> Yes, the native GIT protocol transfers only the objects you need.

Ah, magic, thanks!

> But the 100MB situation is still bad. FWIW, this is my proposal I sent
> about a month ago to some packs-related discussion at the kernel.org
> mailing list (ok, I updated it a little):

It would be nice if there was some meaningful automatic packing that
didn't hurt "non-git-aware protocol" users.

Does the pack index file contain enough information to enable a client
to send http byte range requests to grab individual objects from a pack?
It does seem to store object offsets but maybe I'm missing something ...

Thank you,

--Craig

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Balanced packing strategy
  2005-11-12 15:14     ` Craig Schlenter
@ 2005-11-13  2:34       ` Junio C Hamano
  2005-11-13 11:00         ` Petr Baudis
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2005-11-13  2:34 UTC (permalink / raw)
  To: Craig Schlenter; +Cc: git

Craig Schlenter <craig@codefountain.com> writes:

> Does the pack index file contain enough information to enable a client
> to send http byte range requests to grab individual objects from a pack?
> It does seem to store object offsets...

Yes, it is certainly doable; there is enough information.  I am
not sure if it is worth the complexity, though.

Many objects are stored delitified, so your byte range requests
would return delta and base object name.  After you read what
was returned and find out the base object name, you would need
to get it, which can be another delta against its base object.
This would make tangling a delta chain would become a serialized
sequence of requests.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Balanced packing strategy
  2005-11-13  2:34       ` Junio C Hamano
@ 2005-11-13 11:00         ` Petr Baudis
  0 siblings, 0 replies; 8+ messages in thread
From: Petr Baudis @ 2005-11-13 11:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Craig Schlenter, git

Dear diary, on Sun, Nov 13, 2005 at 03:34:02AM CET, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> Craig Schlenter <craig@codefountain.com> writes:
> 
> > Does the pack index file contain enough information to enable a client
> > to send http byte range requests to grab individual objects from a pack?
> > It does seem to store object offsets...
> 
> Yes, it is certainly doable; there is enough information.  I am
> not sure if it is worth the complexity, though.

I think we need either the balanced packing or this.

> Many objects are stored delitified, so your byte range requests
> would return delta and base object name.  After you read what
> was returned and find out the base object name, you would need
> to get it, which can be another delta against its base object.
> This would make tangling a delta chain would become a serialized
> sequence of requests.

Sort the objects topologically, then get everything from the old heads
on. Obviously, this will not work so well when we get multiple heads in
single pack, but either don't do that (would it be actually so bad if we
would create one pack per head?), or:

  (i) objects are topologically sorted
  (ii) objects introduced by a commit/tree are right after the commit or
       tree in the pack file
  (iii) index file contains parents list for each commit

This way, you can possibly run through the gaps, or if the gap is big
enough, restart the request. You still will miss objects introduced by
commits in different branches, but in case of trees you can slurp the
trees at once again, and pick the individual objects otherwise; while
doing this second pass, you can apply the gaps strategy again.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
VI has two modes: the one in which it beeps and the one in which
it doesn't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Balanced packing strategy
  2005-11-12 13:59   ` Balanced packing strategy Petr Baudis
  2005-11-12 15:14     ` Craig Schlenter
@ 2005-11-13 20:06     ` Josef Weidendorfer
  2005-11-13 23:13       ` Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Josef Weidendorfer @ 2005-11-13 20:06 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Craig Schlenter

On Saturday 12 November 2005 14:59, Petr Baudis wrote:
> The repacking should be done in such a way to minimize the overhead for
> the dumb transport users. Ideal for this is some structure like (at the
> end of october):
> 
> 	year2003.pack
> 	year2004.pack
> ...
> 	week42.pack
> 	week43.pack
> 	<individual objects for weeks 43, 44>

I am not sure if it is really beneficial, as packs have the requirement
to be self contained, so you get a lot of objects undeltified which could
be deltified in a better scheme (as eg. in git native protocol).

AFAICS, the git native protocol (which is nothing more than a pack itself
for each transfer) even has this problem, too: If you are updating every
day via git native, the sum of transfered bytes in a month will be a
multiple of one git transfer for all the month's changes.

To keep the pack self-containment property, but work better with dumb
transfers, we could introduce incremental packs:

Instead of fully repacking, create a new pack by only appendending new
objects at the end of the pack. Thus, most objects will be appended in
deltified form, making the incremental addition quite small. The outcome
would be a totally new package.

Unfortunately, I do not know the package format in detail, and hope that
this is possible at all.

For dumb protocols to take advantage of this, the information that the
first part of a package is actually the same as another package has to
be stored somewhere visible.
If a client detects that it has the first part of a pack already locally,
it would be enough to fetch only some the second part. 

This is more or less the same as Pasky's solution, but by using incremental
packs instead. I think that such incremental packing will not even take
much more space that fully repacking.

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Balanced packing strategy
  2005-11-13 20:06     ` Josef Weidendorfer
@ 2005-11-13 23:13       ` Junio C Hamano
  0 siblings, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2005-11-13 23:13 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git, Josef Weidendorfer

Petr Baudis <pasky@suse.cz> writes:

> This has the property that the second half of given pack is covered by
> objects with precision lower by one. This is a relatively high overload
> (this can be balanced by only keeping the last third or whatever), but
> it designed to reduce the overhead of fetching packs over dumb
> transport.

I have a feeling that you would be better off if instead do the
repacking on the server side to prepare multiple packs, each of
which has all the necessary objects to bring people who was
up-to-date at various timerange ago, to arrange that you would
need only one patch fetch with individual objects near the tip.

This obviously needs smarter client-side support.

Suppose we are somewhere after releasing v1.8 and inching
towards v1.9:

In your proposal, the object ranges each pack contains would
look like this:

 v1.0..v1.5 --------
 v1.5..v1.6        -----
 v1.6..v1.7            -------
 v1.7..v1.8                  -----
 individual objects               ....

That is, there are slight overlaps but you would do multiple
packs if you are really behind.

Instead, you could do this:

 v1.0..v1.8 ----------------------
 v1.5..v1.8         --------------
 v1.6..v1.8             ----------
 v1.7..v1.8                   ----
 individual objects               ....

Everybody starts from the tip, fetching individual objects, and
when the last repack boundary (the time we released 1.8) is
reached, the dumb protocol downloader now faces a choice.  The
indices are fairly small, so you fetch all of them and see how
many objects you are lacking from each pack.  If you were
up-to-date very long time ago, say at v1.2, you would obviously
need to fetch the longest pack.  If you were up-to-date
recently, say after v1.6 was released, you need to fetch smaller
pack.

Given the self containedness requirements, any path that is
touched once in a period needs at least one full copy of it in
each pack (all other revisions could be deltified), and I
suspect in practice the oldest pack (v1.0..v1.5 pack in your
scheme) would not save much space by not having v1.5..v1.8
history.  We could tweak things further to do something like
this:

 v1.0..v1.8 ------------------
 v1.5..v1.8         ----------
 v1.6..v1.8             ------
 v1.7..v1.8                   ----
 individual objects               ....

to also account for a fact that the recent ones cover shorter
time range and not many paths are touched.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re:  Balanced packing strategy
@ 2005-11-14  5:03 Craig Schlenter
  2005-11-14 10:24 ` Josef Weidendorfer
  0 siblings, 1 reply; 8+ messages in thread
From: Craig Schlenter @ 2005-11-14  5:03 UTC (permalink / raw)
  To: Josef Weidendorfer, git; +Cc: Petr Baudis

Hi

> From: Josef Weidendorfer [mailto:Josef.Weidendorfer@gmx.de]
> [snip]
> AFAICS, the git native protocol (which is nothing more than a pack itself
> for each transfer) even has this problem, too: If you are updating every
> day via git native, the sum of transfered bytes in a month will be a
> multiple of one git transfer for all the month's changes.

Interesting ... is this because in a bigger pack the compression will
be better as there is probably more stuff to "delta" against?

--Craig

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Balanced packing strategy
  2005-11-14  5:03 Balanced packing strategy Craig Schlenter
@ 2005-11-14 10:24 ` Josef Weidendorfer
  0 siblings, 0 replies; 8+ messages in thread
From: Josef Weidendorfer @ 2005-11-14 10:24 UTC (permalink / raw)
  To: git; +Cc: Craig Schlenter, Petr Baudis

On Monday 14 November 2005 06:03, Craig Schlenter wrote:
> Hi
> 
> > From: Josef Weidendorfer [mailto:Josef.Weidendorfer@gmx.de]
> > [snip]
> > AFAICS, the git native protocol (which is nothing more than a pack itself
> > for each transfer) even has this problem, too: If you are updating every
> > day via git native, the sum of transfered bytes in a month will be a
> > multiple of one git transfer for all the month's changes.
> 
> Interesting ... is this because in a bigger pack the compression will
> be better as there is probably more stuff to "delta" against?

No, it is because of the self-containment of git packs: Deltas (storing
differences instead of full file content) are only allowed to other
objects in the same pack. The self-containment is a safety measure: you
do not want to have dependencies to the outside, because this would
destroy the contents of a pack by changing/removing another file.

So if the changes of a day are 100 oneliners to different files, the pack
has to contain the full content of the 100 files. With incremental
packing, you would probably add only the 100 oneliners (ie. deltas) at end
of the pack, because the full file contents have to be in the pack if there
was a change to them recorded in the pack previously. And this probability
is higher if the pack contains a larger history of the project.

Back to the example: By doing a git transfer every day, you always will
transfer the full contents of files which where changed on this day, because
git protocol transfers self-contained packs.

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-11-14 10:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-14  5:03 Balanced packing strategy Craig Schlenter
2005-11-14 10:24 ` Josef Weidendorfer
  -- strict thread matches above, loose matches on Subject: below --
2005-11-12 13:04 Remove unneeded packs Marcel Holtmann
2005-11-12 13:40 ` Craig Schlenter
2005-11-12 13:59   ` Balanced packing strategy Petr Baudis
2005-11-12 15:14     ` Craig Schlenter
2005-11-13  2:34       ` Junio C Hamano
2005-11-13 11:00         ` Petr Baudis
2005-11-13 20:06     ` Josef Weidendorfer
2005-11-13 23:13       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).