Repacking many disconnected blobs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Repacking many disconnected blobs
@ 2006-06-14  7:17 Keith Packard
  2006-06-14  7:29 ` Shawn Pearce
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Keith Packard @ 2006-06-14  7:17 UTC (permalink / raw)
  To: Git Mailing List; +Cc: keithp

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

parsecvs scans every ,v file and creates a blob for every revision of
every file right up front. Once these are created, it discards the
actual file contents and deals solely with the hash values.

The problem is that while this is going on, the repository consists
solely of disconnected objects, and I can't make git-repack put those
into pack objects. This leaves the directories bloated, and operations
within the tree quite sluggish. I'm importing a project with 30000 files
and 30000 revisions (the CVS repository is about 700MB), and after
scanning the files, and constructing (in memory) a complete revision
history, the actual construction of the commits is happening at about 2
per second, and about 70% of that time is in the kernel, presumably
playing around in the repository.

I'm assuming that if I could get these disconnected blobs all neatly
tucked into a pack object, things might go a bit faster.
-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14  7:17 Repacking many disconnected blobs Keith Packard
@ 2006-06-14  7:29 ` Shawn Pearce
  2006-06-14  9:07   ` Johannes Schindelin
  2006-06-14  9:37 ` Sergey Vlasov
  2006-06-14 15:53 ` Linus Torvalds
  2 siblings, 1 reply; 15+ messages in thread
From: Shawn Pearce @ 2006-06-14  7:29 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

Keith Packard <keithp@keithp.com> wrote:
> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

What about running git-update-index using .git/objects as the
current working directory and adding all files in ??/* into the
index, then git-write-tree that index and git-commit-tree the tree.

When you are done you have a bunch of orphan trees and a commit
but these shouldn't be very big and I'd guess would prune out with
a repack if you don't hold a ref to the orphan commit.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14  7:29 ` Shawn Pearce
@ 2006-06-14  9:07   ` Johannes Schindelin
  2006-06-14 12:33     ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2006-06-14  9:07 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Keith Packard, Git Mailing List

Hi,

On Wed, 14 Jun 2006, Shawn Pearce wrote:

> Keith Packard <keithp@keithp.com> wrote:
> > parsecvs scans every ,v file and creates a blob for every revision of
> > every file right up front. Once these are created, it discards the
> > actual file contents and deals solely with the hash values.
> > 
> > The problem is that while this is going on, the repository consists
> > solely of disconnected objects, and I can't make git-repack put those
> > into pack objects. This leaves the directories bloated, and operations
> > within the tree quite sluggish. I'm importing a project with 30000 files
> > and 30000 revisions (the CVS repository is about 700MB), and after
> > scanning the files, and constructing (in memory) a complete revision
> > history, the actual construction of the commits is happening at about 2
> > per second, and about 70% of that time is in the kernel, presumably
> > playing around in the repository.
> > 
> > I'm assuming that if I could get these disconnected blobs all neatly
> > tucked into a pack object, things might go a bit faster.
> 
> What about running git-update-index using .git/objects as the
> current working directory and adding all files in ??/* into the
> index, then git-write-tree that index and git-commit-tree the tree.
> 
> When you are done you have a bunch of orphan trees and a commit
> but these shouldn't be very big and I'd guess would prune out with
> a repack if you don't hold a ref to the orphan commit.

Alternatively, you could construct fake trees like this:

README/1.1.1.1
README/1.2
README/1.3
...

i.e. every file becomes a directory -- containing all the versions of that 
file -- in the (virtual) tree, which you can point to by a temporary ref.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14  9:07   ` Johannes Schindelin
@ 2006-06-14 12:33     ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2006-06-14 12:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Alternatively, you could construct fake trees like this:
>
> README/1.1.1.1
> README/1.2
> README/1.3
> ...
>
> i.e. every file becomes a directory -- containing all the versions of that 
> file -- in the (virtual) tree, which you can point to by a temporary ref.

That would not play well with the packing heuristics, I suspect.
If you reverse it to use rev/file-id, then the same files from
different revs would sort closer, though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14  7:17 Repacking many disconnected blobs Keith Packard
  2006-06-14  7:29 ` Shawn Pearce
@ 2006-06-14  9:37 ` Sergey Vlasov
  2006-06-14 15:53 ` Linus Torvalds
  2 siblings, 0 replies; 15+ messages in thread
From: Sergey Vlasov @ 2006-06-14  9:37 UTC (permalink / raw)
  To: Keith Packard; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On Wed, 14 Jun 2006 00:17:58 -0700 Keith Packard wrote:

> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

git-repack.sh basically does:

  git-rev-list --objects --all | git-pack-objects .tmp-pack

When you have only disconnected blobs, obviously the first part does
not work - git-rev-list cannot find these blobs.  However, you can do
that part manually - e.g., when you add a blob, do:

  fprintf(list_file, "%s %s\n", sha1, path);

(path should be a relative path in the repo without ",v" or "Attic" -
it is used for delta packing optimization, so getting it wrong will
not cause any corruption, but the pack may become significantly
larger).  You may output some duplicate sha1 values, but
git-pack-objects should handle duplicates correctly.

Then just invoke "git-pack-objects --non-empty .tmp_pack <list_file";
it will output the resulting pack sha1 to stdout.  Then you need to
move the pack into place and call git-prune-packed (which does not
use object lists, so it should work even with unreachable objects).

You may even want to repack more than once during the import;
probably the simplest way to do it is to truncate list_file after
each repack and use "git-pack-objects --incremental".

[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14  7:17 Repacking many disconnected blobs Keith Packard
  2006-06-14  7:29 ` Shawn Pearce
  2006-06-14  9:37 ` Sergey Vlasov
@ 2006-06-14 15:53 ` Linus Torvalds
  2006-06-14 17:55   ` Keith Packard
  2 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-14 15:53 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

On Wed, 14 Jun 2006, Keith Packard wrote:
>
> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects.

Ok. That's actually _easily_ rectifiable, because it turns out that your 
behaviour is something that re-packing is actually really good at 
handling.

The thing is, "git repack" (the wrapper function) is all about finding all 
the heads of a repository, and then tellign the _real_ packing logic which 
objects to pack.

In other words, it literally boils down to basically

	git-rev-list --all --objects $rev_list |
		git-pack-objects --non-empty $pack_objects .tmp-pack

where "$rev_list" and "$pack_objects" are just extra flags to the two 
phases that you don't really care about.

But the important point to recognize is that the pack generation itself 
doesn't care about reachability or anything else AT ALL. The pack is just 
a jumble of objects, nothing more. Which is exactly what you want.

> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

Absolutely. And it's even easy.

What you should do is to just generate a list of objects every once in a 
while, and pass that list off to "git-pack-objects", which will create a 
pack-file for you. Then you just move the generated pack-file (and index 
file) into the .git/objects/pack directory, and then you can run the 
normal "git-prune-packed", and you're done.

There's just two small subtle points to look out for:

 - You can list the objects with "most important first" order first, if 
   you can.  That will improve locality later (the packing will try to 
   generate the pack so that the order you gave the objects in will be a 
   rough order of the resul - the first objects will be together at the 
   beginning, the last objects will be at the end)

   This is not a huge deal. If you don't have a good order, give them in 
   any order, and then after you're done (and you do have branches and 
   tag-heads), the final repack (with a regular "git repack") will fix it 
   all up.

   You'll still get all of the size/access advantage of packfiles without 
   this, it just won't have the additional "nice IO patterns within the 
   packfile" behaviour (which mainly matters for the cold-cache case, so 
   you may well not care).

 - append the filename the object is associated with to the object name on 
   the list, if at all possible. This is what git-pack-objects will use as 
   part of the heuristic for finding the deltas, so this is actually a big 
   deal. If you forget (or mess up) the filename, packing will still 
   _work_ - it's just a heuristic, after all, and there are a few others 
   too - but the pack-file will have inferior delta chains.

   (The name doesn't have to be the "real name", it really only needs to 
   be something unique per *,v file, but real name is probably best)

   The corollary to this is that it's better to generate the pack-file 
   from a list of every version of a few files than it is to generate it 
   from a few versions of every file. Ie, if you process things one file 
   at a time, and create every object for that file, that is actually good 
   for packing, since there will be the optimal delta opportunity.

In other words, you should just feed git-pack-file a list of objects in 
the form "<sha1><space><filename>\n", and git-pack-file will do the rest.

Just as a stupid example, if you were to want to pack just the _tree_ that 
is the current version of a git archive, you'd do

	git-rev-list --objects HEAD^{tree} |
		git-pack-objects --non-empty .tmp-pack

which you can try on the current git tree just to see (the first line will 
generate a list of all objects reachable from the current _tree_: no 
history at all, the second line will create two files under the name of  
".tmp-pack-<sha1-of-object-list>.{pack|idx}".

The reason I suggest doing this for the current tree of the git archive is 
simply that you can look at the git-rev-list output with "less", and see 
for yourself what it actually does (and there are just a few hundred 
objects there: a few tree objects, and the blob objects for every file in 
the current HEAD).

So the git pack-format is actually _optimal_ for your particular case, 
exactly because the pack-files don't actually care about any high-level 
semantics: all they contain is a list of objects.

So in phase 1, when you generate all the objects, the simplest thing to do 
is to literally just remember the last five thousand objects or so as you 
generate them, and when that array of objects fills up, you just start the 
"git-pack-objects" thing, and feed it the list of objects, move the 
pack-file into .git/objects/pack/pack-... and do a "git prune-packed". 

Then you just continue.

So this should all fit the parsecvs approach very well indeed.

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 15:53 ` Linus Torvalds
@ 2006-06-14 17:55   ` Keith Packard
  2006-06-14 18:18     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Packard @ 2006-06-14 17:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: keithp, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1373 bytes --]

On Wed, 2006-06-14 at 08:53 -0700, Linus Torvalds wrote:

>  - You can list the objects with "most important first" order first, if 
>    you can.  That will improve locality later (the packing will try to 
>    generate the pack so that the order you gave the objects in will be a 
>    rough order of the resul - the first objects will be together at the 
>    beginning, the last objects will be at the end)

I take every ,v file and construct blobs for every revision. If I
understand this correctly, I should be shuffling the revisions so I send
the latest revision of every file first, then the next-latest revision.
It would be somewhat easier to just send the whole list of revisions for
the first file and then move to the next file, but if shuffling is what
I want, I'll do that.

>    The corollary to this is that it's better to generate the pack-file 
>    from a list of every version of a few files than it is to generate it 
>    from a few versions of every file. Ie, if you process things one file 
>    at a time, and create every object for that file, that is actually good 
>    for packing, since there will be the optimal delta opportunity.

I assumed that was the case. Fortunately, I process each file
separately, so this matches my needs exactly. I should be able to report
on this shortly.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 17:55   ` Keith Packard
@ 2006-06-14 18:18     ` Linus Torvalds
  2006-06-14 18:52       ` Linus Torvalds
  2006-06-14 18:59       ` Keith Packard
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-14 18:18 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

On Wed, 14 Jun 2006, Keith Packard wrote:

> On Wed, 2006-06-14 at 08:53 -0700, Linus Torvalds wrote:
> 
> >  - You can list the objects with "most important first" order first, if 
> >    you can.  That will improve locality later (the packing will try to 
> >    generate the pack so that the order you gave the objects in will be a 
> >    rough order of the resul - the first objects will be together at the 
> >    beginning, the last objects will be at the end)
> 
> I take every ,v file and construct blobs for every revision. If I
> understand this correctly, I should be shuffling the revisions so I send
> the latest revision of every file first, then the next-latest revision.
> It would be somewhat easier to just send the whole list of revisions for
> the first file and then move to the next file, but if shuffling is what
> I want, I'll do that.

You don't _need_ to shuffle. As mentioned, it will only affect the 
location of the data in the pack-file, which in turn will mostly matter 
as an IO pattern thing, not anything really fundamental.  If the pack-file 
ends up caching well, the IO patterns obviously will never matter.

Eventually, after the whole import has finished, and you do the final 
repack, that one will do things in "recency order" (or "global 
reachability order" if you prefer), which means that all the objects in 
the final pack will be sorted by how "close" they are to the top-of-tree. 

And that will happen regardless of what the intermediate ordering has 
been.

So if shuffling is inconvenient, just don't do it.

On the other hand, if you know that you generated the blobs "oldest to 
newest", just print them in the reverse order when you end up repacking, 
and you're all done (if you just save the info into some array before you 
repack, just walk the array backwards).

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 18:18     ` Linus Torvalds
@ 2006-06-14 18:52       ` Linus Torvalds
  2006-06-14 18:59       ` Keith Packard
  1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-14 18:52 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List



On Wed, 14 Jun 2006, Linus Torvalds wrote:
> 
> You don't _need_ to shuffle. As mentioned, it will only affect the 
> location of the data in the pack-file, which in turn will mostly matter 
> as an IO pattern thing, not anything really fundamental.  If the pack-file 
> ends up caching well, the IO patterns obviously will never matter.

Actually, thinking about it more, the way you do things, shuffling 
probably won't even help.

Why? Because you'll obviously have multiple files, and even if each file 
were to be sorted "correctly", the access patterns from any global 
standpoint won't really matter, becase you'd probably bounce back and 
forth in the pack-file anyway.

So if anything, I would say

 - just dump them into the packfile in whatever order is most convenient

 - if you know that later phases will go through the objects and actually 
   use them (as opposed to just building trees out of their SHA1 values) 
   in some particular order, _that_ might be the ordering to use.

 - in many ways, getting good delta chains is _much_ more important, since 
   "git repack -a -d" will re-use good deltas from a previous pack, but 
   will _not_ care about any ordering in the old pack. As well as 
   obviously improving the size of the temporary pack-files anyway.

I'll pontificate more if I can think of any other cases that might matter.

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 18:18     ` Linus Torvalds
  2006-06-14 18:52       ` Linus Torvalds
@ 2006-06-14 18:59       ` Keith Packard
  2006-06-14 19:18         ` Linus Torvalds
  2006-06-14 19:25         ` Nicolas Pitre
  1 sibling, 2 replies; 15+ messages in thread
From: Keith Packard @ 2006-06-14 18:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: keithp, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

On Wed, 2006-06-14 at 11:18 -0700, Linus Torvalds wrote:

> You don't _need_ to shuffle. As mentioned, it will only affect the 
> location of the data in the pack-file, which in turn will mostly matter 
> as an IO pattern thing, not anything really fundamental.  If the pack-file 
> ends up caching well, the IO patterns obviously will never matter.

Ok, sounds like shuffling isn't necessary; the only benefit packing
gains me is to reduce the size of each directory in the object store;
the process I follow is to construct blobs for every revision, then just
use the sha1 values to construct an index for each commit. I never
actually look at the blobs myself, so IO access patterns aren't
relevant.

Repacking after the import is completed should undo whatever horror show
I've created in any case.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 18:59       ` Keith Packard
@ 2006-06-14 19:18         ` Linus Torvalds
  2006-06-14 19:25         ` Nicolas Pitre
  1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-14 19:18 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List

On Wed, 14 Jun 2006, Keith Packard wrote:
> 
> Ok, sounds like shuffling isn't necessary; the only benefit packing
> gains me is to reduce the size of each directory in the object store;

There's actually a secondary benefit to packing that turned out to be much 
bigger from a performance standpoint: the size benefit coupled with the 
fact that it's all in one file ends up meaning that accessing packed 
objects is _much_ faster than accessing individual files.

The Linux system call overhead is one of the lowest ones out there, but 
it's still much bigger than just a function call, and doing a full 
pathname walk and open/close is bigger yet. In contrast, if you access 
lots of objects and they are all in a pack, you only end up doing one mmap 
and a page fault for each 4kB entry, and that's it.

So packing has a large performance benefit outside of the actual disk use 
one, and to some degree that performance benefit is then further magnified 
by good locality (ie you get more effective objects per page fault), but 
in your case that locality issue is secondary.

I assume that you never actually end up looking at the _contents_ of the 
objects any more ever afterwards, because in a very real sense you're 
really interested in the SHA1 names, right? All the latter phases of 
parsecvs will just use the SHA1 names directly, and never actually even 
open the data (packed or not).

So in that sense, you only care about the disksize and a much improved 
directory walk from fewer files (until the repository has actually been 
fully created, at which point a repack will do the right thing).

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 18:59       ` Keith Packard
  2006-06-14 19:18         ` Linus Torvalds
@ 2006-06-14 19:25         ` Nicolas Pitre
  2006-06-14 21:05           ` Keith Packard
  1 sibling, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-14 19:25 UTC (permalink / raw)
  To: Keith Packard; +Cc: Linus Torvalds, Git Mailing List

On Wed, 14 Jun 2006, Keith Packard wrote:

> On Wed, 2006-06-14 at 11:18 -0700, Linus Torvalds wrote:
> 
> > You don't _need_ to shuffle. As mentioned, it will only affect the 
> > location of the data in the pack-file, which in turn will mostly matter 
> > as an IO pattern thing, not anything really fundamental.  If the pack-file 
> > ends up caching well, the IO patterns obviously will never matter.
> 
> Ok, sounds like shuffling isn't necessary; the only benefit packing
> gains me is to reduce the size of each directory in the object store;
> the process I follow is to construct blobs for every revision, then just
> use the sha1 values to construct an index for each commit. I never
> actually look at the blobs myself, so IO access patterns aren't
> relevant.
> 
> Repacking after the import is completed should undo whatever horror show
> I've created in any case.

The only advantage of feeding object names from latest to oldest has to 
do with the delta direction.  In doing so the delta are backward such 
that objects with deeper delta chain are further back in history and 
this is what you want in the final pack for faster access to the latest 
revision.

Of course the final repack will do that automatically, but only if you 
use -a -f with git-repack.  But when -f is not provided then already 
deltified objects from other packs are copied as is without any delta 
computation making the repack process lots faster.  In that case it 
might be preferable that the reuse of already deltified data is made of 
backward delta which is the reason you might consider feeding object in 
the prefered order up front.

Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 19:25         ` Nicolas Pitre
@ 2006-06-14 21:05           ` Keith Packard
  2006-06-14 21:17             ` Linus Torvalds
  2006-06-14 21:20             ` Nicolas Pitre
  0 siblings, 2 replies; 15+ messages in thread
From: Keith Packard @ 2006-06-14 21:05 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: keithp, Linus Torvalds, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

On Wed, 2006-06-14 at 15:25 -0400, Nicolas Pitre wrote:

> The only advantage of feeding object names from latest to oldest has to 
> do with the delta direction.  In doing so the delta are backward such 
> that objects with deeper delta chain are further back in history and 
> this is what you want in the final pack for faster access to the latest 
> revision.

Ok, so I'm feeding them from latest to oldest along each branch, which
optimizes only the 'master' branch, leaving other branches much further
down in the data file. That should mean repacking will help a lot for
repositories with many active branches.

> In that case it 
> might be preferable that the reuse of already deltified data is made of 
> backward delta which is the reason you might consider feeding object in 
> the prefered order up front.

Hmm. As I'm deltafying along branches, the delta data should actually be
fairly good; the only 'bad' result will be the sub-optimal object
ordering in the pack files. I'll experiment with some larger trees to
see how much additional savings the various repack options yield.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 21:05           ` Keith Packard
@ 2006-06-14 21:17             ` Linus Torvalds
  2006-06-14 21:20             ` Nicolas Pitre
  1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-14 21:17 UTC (permalink / raw)
  To: Keith Packard; +Cc: Nicolas Pitre, Git Mailing List



On Wed, 14 Jun 2006, Keith Packard wrote:
> 
> > In that case it 
> > might be preferable that the reuse of already deltified data is made of 
> > backward delta which is the reason you might consider feeding object in 
> > the prefered order up front.
> 
> Hmm. As I'm deltafying along branches, the delta data should actually be
> fairly good; the only 'bad' result will be the sub-optimal object
> ordering in the pack files. I'll experiment with some larger trees to
> see how much additional savings the various repack options yield.

The fact that git repacking sorts by filesize after it sorts by filename 
should make this a non-issue: we always try to delta against the larger 
version (where "larger" is not only almost invariable also "newer", but 
the delta is simpler, since deleting data doesn't take up any space in the 
delta, while adding data needs to ay what the data added was, of course).

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Repacking many disconnected blobs
  2006-06-14 21:05           ` Keith Packard
  2006-06-14 21:17             ` Linus Torvalds
@ 2006-06-14 21:20             ` Nicolas Pitre
  1 sibling, 0 replies; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-14 21:20 UTC (permalink / raw)
  To: Keith Packard; +Cc: Linus Torvalds, Git Mailing List

On Wed, 14 Jun 2006, Keith Packard wrote:

> Hmm. As I'm deltafying along branches, the delta data should actually be
> fairly good; the only 'bad' result will be the sub-optimal object
> ordering in the pack files. I'll experiment with some larger trees to
> see how much additional savings the various repack options yield.

Note that the object list order is unlikely to affect pack size.  It is 
really about optimizing the pack layout for subsequent access to it.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-06-14 21:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-14  7:17 Repacking many disconnected blobs Keith Packard
2006-06-14  7:29 ` Shawn Pearce
2006-06-14  9:07   ` Johannes Schindelin
2006-06-14 12:33     ` Junio C Hamano
2006-06-14  9:37 ` Sergey Vlasov
2006-06-14 15:53 ` Linus Torvalds
2006-06-14 17:55   ` Keith Packard
2006-06-14 18:18     ` Linus Torvalds
2006-06-14 18:52       ` Linus Torvalds
2006-06-14 18:59       ` Keith Packard
2006-06-14 19:18         ` Linus Torvalds
2006-06-14 19:25         ` Nicolas Pitre
2006-06-14 21:05           ` Keith Packard
2006-06-14 21:17             ` Linus Torvalds
2006-06-14 21:20             ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).