All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sergey Vlasov <vsu@altlinux.ru>
To: Keith Packard <keithp@keithp.com>
Cc: git@vger.kernel.org
Subject: Re: Repacking many disconnected blobs
Date: Wed, 14 Jun 2006 13:37:41 +0400	[thread overview]
Message-ID: <20060614133741.73cd80cb.vsu@altlinux.ru> (raw)
In-Reply-To: <1150269478.20536.150.camel@neko.keithp.com>

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On Wed, 14 Jun 2006 00:17:58 -0700 Keith Packard wrote:

> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

git-repack.sh basically does:

  git-rev-list --objects --all | git-pack-objects .tmp-pack

When you have only disconnected blobs, obviously the first part does
not work - git-rev-list cannot find these blobs.  However, you can do
that part manually - e.g., when you add a blob, do:

  fprintf(list_file, "%s %s\n", sha1, path);

(path should be a relative path in the repo without ",v" or "Attic" -
it is used for delta packing optimization, so getting it wrong will
not cause any corruption, but the pack may become significantly
larger).  You may output some duplicate sha1 values, but
git-pack-objects should handle duplicates correctly.

Then just invoke "git-pack-objects --non-empty .tmp_pack <list_file";
it will output the resulting pack sha1 to stdout.  Then you need to
move the pack into place and call git-prune-packed (which does not
use object lists, so it should work even with unreachable objects).

You may even want to repack more than once during the import;
probably the simplest way to do it is to truncate list_file after
each repack and use "git-pack-objects --incremental".

[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]

  parent reply	other threads:[~2006-06-14  9:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-14  7:17 Repacking many disconnected blobs Keith Packard
2006-06-14  7:29 ` Shawn Pearce
2006-06-14  9:07   ` Johannes Schindelin
2006-06-14 12:33     ` Junio C Hamano
2006-06-14  9:37 ` Sergey Vlasov [this message]
2006-06-14 15:53 ` Linus Torvalds
2006-06-14 17:55   ` Keith Packard
2006-06-14 18:18     ` Linus Torvalds
2006-06-14 18:52       ` Linus Torvalds
2006-06-14 18:59       ` Keith Packard
2006-06-14 19:18         ` Linus Torvalds
2006-06-14 19:25         ` Nicolas Pitre
2006-06-14 21:05           ` Keith Packard
2006-06-14 21:17             ` Linus Torvalds
2006-06-14 21:20             ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060614133741.73cd80cb.vsu@altlinux.ru \
    --to=vsu@altlinux.ru \
    --cc=git@vger.kernel.org \
    --cc=keithp@keithp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.