git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Chris Lee" <chris133@gmail.com>
To: "Linus Torvalds" <torvalds@osdl.org>
Subject: Re: git-svnimport failed and now git-repack hates me
Date: Wed, 3 Jan 2007 18:16:51 -0800	[thread overview]
Message-ID: <204011cb0701031816hda8af9bw4d4a469c2b111339@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0701031737300.4989@woody.osdl.org>

On 1/3/07, Linus Torvalds <torvalds@osdl.org> wrote:
> > So I'm using git 1.4.1, and I have been experimenting with importing
> > the KDE sources from Subversion using git-svnimport.
>
> As one single _huge_ import? All the sub-projects together? I have to say,
> that sounds pretty horrid.

Unfortunately, that's how the KDE repo is organized. (I tried arguing
against this when they were going to do the original import, but I
lost the argument.) And git-svnimport doesn't appear to have any sort
of method for splitting a gigantic svn repo into several smaller git
repos.

> > First issue I ran into: On a machine with 4GB of RAM, when I tried to
> > do a full import, git-svnimport died after 309906 revisions, saying
> > that it couldn't fork.
> >
> > Checking `top` and `ps` revealed that there were no git-svnimport
> > processes doing anything, but all of my 4G of RAM was still marked as
> > used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
> > to free all the RAM that the svn import had used up.
>
> I think that was just all cached, and all ok. The reason you didn't see
> any git-svnimport was that it had died off already, and all your memory
> was just caches. You could just have left it alone, and the kernel would
> have started re-using the memory for other things even without any
> "drop_caches".
>
> But what you did there didn't make anything worse, it was just likely had
> no real impact.

I got the tip about drop_caches from davej. Normally, when a process
taking up a huge amount of memory exits, it shows a bunch of free
memory in `top` and friends. I was a little bit surprised when that
didn't happen this time.

> However, it does sound like git-svnimport probably acts like git-cvsimport
> used to, and just keeps too much in memory - so it's never going to act
> really nicely..
>
> It also looks like git-svnimport never repacks the repo, which is
> absolutely horrible for performance on all levels. The CVS importer
> repacks every one thousand commits or something like that.

Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
having it automatically repack every thousand revisions or so would
probably be a pretty big win.

> > Now, after that, I tried doing `git-repack -a` because I wanted to see
> > how small the packed archive would be (before trying to continue
> > importing the rest of the revisions. There are at least another 100k
> > revisions that I should be able to import, eventually.)
>
> I suspect you'd have been better off just re-starting, and using something
> like
>
>         while :
>         do
>                 git svnimport -l 1000 <...>
>                 .. figure out some way to decide if it's all done ..
>                 git repack -d
>         done
>
> which would make svnimport act a bit  more sanely, and repack
> incrementally. That should make both the import much faster, _and_ avoid
> any insane big repack at the end (well, you'd still want to do a "git
> repack -a -d" at the end to turn the many smaller packs into a bigger one,
> but it would be nicer).
>
> However, I don't know what the proper magic is for svnimport to do that
> sane "do it in chunks and tell when you're all done". Or even better - to
> just make it repack properly and not keep everything in memory.

You can pass limits to svnimport to give it a revision to start at and
another one to end at, so that wouldn't be too bad - I was thinking
about working around it like that (so that i don't have to go poking
around in the Perl code behind the svn importer).

By default, if I had, say, one pack with the first 1000 revisions, and
I imported another 1000, running 'git-repack' on its own would leave
the first pack alone and create a new pack with just the second 1000
revisions, right?

> > The repack finished after about nine hours, but when I try to do a
> > git-verify-pack on it, it dies with this error message:
> >
> > error: Packfile
> > .git/objects/pack/pack-540263fe66ab9398cc796f000d52531a5c6f3df3.pack
> > SHA1 mismatch with itself
>
> That sounds suspiciously like the bug we had in out POWER sha1
> implementation that would generate the wrong SHA1 for any pack-file that
> was over 512MB in size, due to an overflow in 32 bits (SHA1 does some
> counting in _bits_, so 512MB is 4G _bits_),
>
> Now, I assume you're not on POWER (and we fixed that bug anyway - and I
> think long before 1.4.1 too), but I could easily imagine the same bug in
> some other SHA1 implementation (or perhaps _another_ overflow at the 1GB
> or 2GB mark..). I assume that the pack-file you had was something horrid..
>
> I hope this is with a 64-bit kernel and a 64-bit user space? That should
> limit _some_ of the issues. But I would still not be surprised if your
> SHA1 libraries had some 32-bit ("unsigned int") or 31-bit ("int") limits
> in them somewhere - very few people do SHA1's over huge areas, and even
> when you do SHA1 on something like a DVD image (which is easily over any
> 4GB limit), that tends to be done as many smaller calls to the SHA1
> library routines.

This is on a dual-CPU dual-core Opteron, running the AMD64 variant of
Ubuntu's Edgy release (64-bit kernel, 64-bit native userland). The
pack-file was around 2.3GB.

  parent reply	other threads:[~2007-01-04 15:29 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-03 23:52 git-svnimport failed and now git-repack hates me Chris Lee
2007-01-04  1:59 ` Linus Torvalds
2007-01-04  2:06   ` Shawn O. Pearce
2007-01-04  2:35     ` Shawn O. Pearce
2007-01-04  2:36       ` Chris Lee
2007-01-04  2:45         ` Shawn O. Pearce
2007-01-04  2:53           ` Chris Lee
2007-01-04  2:57             ` Shawn O. Pearce
2007-01-04  2:58               ` Chris Lee
2007-01-04  3:05                 ` Shawn O. Pearce
2007-01-04  3:06                 ` Chris Lee
2007-01-04  2:16   ` Chris Lee [this message]
2007-01-04 17:56     ` Chris Lee
2007-01-04 18:30       ` Linus Torvalds
2007-01-04 18:54         ` Chris Lee
2007-01-04  2:33   ` Eric Wong
2007-01-04  2:40     ` Randal L. Schwartz
2007-01-04  3:13       ` Eric Wong
2007-01-05  2:09     ` [PATCH] git-svn: make --repack work consistently between fetch and multi-fetch Eric Wong
2007-01-04  6:25   ` git-svnimport failed and now git-repack hates me Junio C Hamano
2007-01-04  7:26     ` [PATCH] pack-check.c::verify_packfile(): don't run SHA-1 update on huge data Junio C Hamano
2007-01-04 17:58     ` git-svnimport failed and now git-repack hates me Chris Lee
2007-01-04 20:22       ` Junio C Hamano
2007-01-05 17:19         ` Chris Lee
2007-01-05 19:05           ` Junio C Hamano
2007-01-05 19:33             ` Chris Lee
2007-01-05 19:39               ` Shawn O. Pearce
2007-01-05 20:48                 ` Chris Lee
2007-01-05 21:37                 ` Junio C Hamano
2007-01-05 21:57                   ` Linus Torvalds
2007-01-05 22:18                     ` alan
2007-01-07  0:36                       ` Eric Wong
2007-01-05 22:39                     ` Linus Torvalds
2007-01-05 22:48                       ` Junio C Hamano
2007-01-05 23:00                         ` Linus Torvalds
2007-01-05 23:02                           ` Linus Torvalds
2007-01-05 23:44                           ` Junio C Hamano
2007-01-05 23:59                             ` Linus Torvalds
2007-01-06  0:06                             ` Johannes Schindelin
2007-01-05 23:03                   ` Chris Lee
2007-01-05 23:09                     ` Junio C Hamano
2007-01-05 23:17                       ` Linus Torvalds
2007-01-05 23:58                         ` Junio C Hamano
2007-01-06  0:11                           ` Linus Torvalds
2007-01-06  0:15                             ` Linus Torvalds
2007-01-06  0:23                               ` Junio C Hamano
2007-01-06  1:22                                 ` Linus Torvalds
2007-01-04 19:24   ` Chris Lee
2007-01-04 21:12     ` Linus Torvalds
2007-01-04 21:31   ` Sasha Khapyorsky
2007-01-04 22:04     ` Chris Lee
2007-01-07  0:17       ` [PATCH] git-svnimport: support for incremental import Sasha Khapyorsky
2007-01-07 18:12         ` Chris Lee
2007-01-07 18:59           ` Sasha Khapyorsky
2007-01-08  2:22             ` [PATCH] git-svnimport: fix edge revisions double importing Sasha Khapyorsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=204011cb0701031816hda8af9bw4d4a469c2b111339@mail.gmail.com \
    --to=chris133@gmail.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).