Memory issue with fast-import, why track branches?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Memory issue with fast-import, why track branches?
@ 2008-12-21  5:54 Felipe Contreras
  2008-12-21  8:10 ` John Chapman
  2008-12-21 22:17 ` Shawn O. Pearce
  0 siblings, 2 replies; 5+ messages in thread
From: Felipe Contreras @ 2008-12-21  5:54 UTC (permalink / raw)
  To: git list

Hi,

I tracked down an issue I have when importing a big repository. For
some reason memory usage keeps increasing until there is no more
memory.

Here is what valgrind shows:
==21034== 471,080,280 bytes in 114,517 blocks are still reachable in
loss record 8 of 8
==21034==    at 0x4004BA2: calloc (vg_replace_malloc.c:397)
==21034==    by 0x806A340: xcalloc (wrapper.c:75)
==21034==    by 0x8063BC1: use_pack (sha1_file.c:808)
==21034==    by 0x8063DA9: unpack_object_header (sha1_file.c:1443)
==21034==    by 0x8064F4F: unpack_entry (sha1_file.c:1736)
==21034==    by 0x8065393: cache_or_unpack_entry (sha1_file.c:1606)
==21034==    by 0x8065464: read_packed_sha1 (sha1_file.c:2000)
==21034==    by 0x80655E5: read_object (sha1_file.c:2090)
==21034==    by 0x8065677: read_sha1_file (sha1_file.c:2106)
==21034==    by 0x8056AE9: parse_object (object.c:190)
==21034==    by 0x805E90A: write_ref_sha1 (refs.c:1214)
==21034==    by 0x804CC4F: update_branch (fast-import.c:1558)

After looking at the code my guess is that I have a humongous amount
of branches.

Actually they are not really branches, but refs. For each git commit
there's an original mtn ref that I store in 'refs/mtn/sha1', but since
I'm using 'commit refs/mtn/sha1' to store it, a branch is created for
every commit.

I guess there are many ways to fix the issue, but for starters I
wonder why is fast-import keeping track of all the branches? In my
case I would like fast-import to work exactly the same if I specify
branches or not (I'll update them later).

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Memory issue with fast-import, why track branches?
  2008-12-21  5:54 Memory issue with fast-import, why track branches? Felipe Contreras
@ 2008-12-21  8:10 ` John Chapman
  2008-12-21 11:23   ` Felipe Contreras
  2008-12-21 22:17 ` Shawn O. Pearce
  1 sibling, 1 reply; 5+ messages in thread
From: John Chapman @ 2008-12-21  8:10 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git list

My first response was along the lines of "Why the heck are you storing
sha1's like that!?", until I realised that you're not storing actual git
sha1's, but mtn's hashes, which does make sense.

I'm doing something very similar with my perforce scripts, however I am
doing a bit more magic instead of making so many branches.

Instead of making branches, I make a tag instead, for each and every
changeset.  Every time I make a new git commit, if I need to do it from
a tag, I first read the tag and determine the sha1 I should use, and use
that instead.

Alternatively, you could choose to manage your mapping yourself, and
write them to a .git/mtg-git-map file.

On Sun, 2008-12-21 at 07:54 +0200, Felipe Contreras wrote:
> Hi,
> 
> I tracked down an issue I have when importing a big repository. For
> some reason memory usage keeps increasing until there is no more
> memory.
> 
> Here is what valgrind shows:
> ==21034== 471,080,280 bytes in 114,517 blocks are still reachable in
> loss record 8 of 8
> ==21034==    at 0x4004BA2: calloc (vg_replace_malloc.c:397)
> ==21034==    by 0x806A340: xcalloc (wrapper.c:75)
> ==21034==    by 0x8063BC1: use_pack (sha1_file.c:808)
> ==21034==    by 0x8063DA9: unpack_object_header (sha1_file.c:1443)
> ==21034==    by 0x8064F4F: unpack_entry (sha1_file.c:1736)
> ==21034==    by 0x8065393: cache_or_unpack_entry (sha1_file.c:1606)
> ==21034==    by 0x8065464: read_packed_sha1 (sha1_file.c:2000)
> ==21034==    by 0x80655E5: read_object (sha1_file.c:2090)
> ==21034==    by 0x8065677: read_sha1_file (sha1_file.c:2106)
> ==21034==    by 0x8056AE9: parse_object (object.c:190)
> ==21034==    by 0x805E90A: write_ref_sha1 (refs.c:1214)
> ==21034==    by 0x804CC4F: update_branch (fast-import.c:1558)
> 
> After looking at the code my guess is that I have a humongous amount
> of branches.
> 
> Actually they are not really branches, but refs. For each git commit
> there's an original mtn ref that I store in 'refs/mtn/sha1', but since
> I'm using 'commit refs/mtn/sha1' to store it, a branch is created for
> every commit.
> 
> I guess there are many ways to fix the issue, but for starters I
> wonder why is fast-import keeping track of all the branches? In my
> case I would like fast-import to work exactly the same if I specify
> branches or not (I'll update them later).
> 
> Cheers.
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Memory issue with fast-import, why track branches?
  2008-12-21  8:10 ` John Chapman
@ 2008-12-21 11:23   ` Felipe Contreras
  0 siblings, 0 replies; 5+ messages in thread
From: Felipe Contreras @ 2008-12-21 11:23 UTC (permalink / raw)
  To: John Chapman; +Cc: git list

On Sun, Dec 21, 2008 at 10:10 AM, John Chapman <thestar@fussycoder.id.au> wrote:
> My first response was along the lines of "Why the heck are you storing
> sha1's like that!?", until I realised that you're not storing actual git
> sha1's, but mtn's hashes, which does make sense.

Yes :)

> I'm doing something very similar with my perforce scripts, however I am
> doing a bit more magic instead of making so many branches.
>
> Instead of making branches, I make a tag instead, for each and every
> changeset.  Every time I make a new git commit, if I need to do it from
> a tag, I first read the tag and determine the sha1 I should use, and use
> that instead.

Well, simple tags and branches are exactly the same thing: refs. tags
are in 'refs/tags' and branches in 'refs/heads'; 'refs/mtn' are not
really branches.

> Alternatively, you could choose to manage your mapping yourself, and
> write them to a .git/mtg-git-map file.

The advantage of my approach is that the git tools handle all the mtn
sha1's almost as good as git sha1's, I just need to prepend 'mtn/'.

Also, git name-rev finds the mtn revision of a git commit. It' all so
convenient.

The only problem is that fast-import seems to be doing something wrong
with those "branches".

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Memory issue with fast-import, why track branches?
  2008-12-21  5:54 Memory issue with fast-import, why track branches? Felipe Contreras
  2008-12-21  8:10 ` John Chapman
@ 2008-12-21 22:17 ` Shawn O. Pearce
  2008-12-22  2:36   ` Felipe Contreras
  1 sibling, 1 reply; 5+ messages in thread
From: Shawn O. Pearce @ 2008-12-21 22:17 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git list

Felipe Contreras <felipe.contreras@gmail.com> wrote:
> I tracked down an issue I have when importing a big repository. For
> some reason memory usage keeps increasing until there is no more
> memory.
> 
> After looking at the code my guess is that I have a humongous amount
> of branches.
> 
> Actually they are not really branches, but refs. For each git commit
> there's an original mtn ref that I store in 'refs/mtn/sha1', but since
> I'm using 'commit refs/mtn/sha1' to store it, a branch is created for
> every commit.
> 
> I guess there are many ways to fix the issue, but for starters I
> wonder why is fast-import keeping track of all the branches? In my
> case I would like fast-import to work exactly the same if I specify
> branches or not (I'll update them later).

Because fast-import has to buffer them until the pack file is done.
The objects aren't available to the repository until after a
checkpoint is sent or until the stream ends.  Either way until
then fast-import has to buffer the refs so they don't get exposed
to other git processes reading that same repository, because they
would point to objects that the process cannot find.

I guess it could release the brnach memory after it dumps the
branches in a checkpoint, but its memory allocators work under an
assumption that strings (like branch and file names) will be reused
heavily by the frontend and thus they are poooled inside of a string
pool.  The branch objects are also pooled inside of a common alloc
pool, to ammortize the cost of malloc's block headers out over the
data used.

IOW, fast-import was designed for ~5k branches, not ~1 million
unique branches.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Memory issue with fast-import, why track branches?
  2008-12-21 22:17 ` Shawn O. Pearce
@ 2008-12-22  2:36   ` Felipe Contreras
  0 siblings, 0 replies; 5+ messages in thread
From: Felipe Contreras @ 2008-12-22  2:36 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git list

On Mon, Dec 22, 2008 at 12:17 AM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Felipe Contreras <felipe.contreras@gmail.com> wrote:
>> I tracked down an issue I have when importing a big repository. For
>> some reason memory usage keeps increasing until there is no more
>> memory.
>>
>> After looking at the code my guess is that I have a humongous amount
>> of branches.
>>
>> Actually they are not really branches, but refs. For each git commit
>> there's an original mtn ref that I store in 'refs/mtn/sha1', but since
>> I'm using 'commit refs/mtn/sha1' to store it, a branch is created for
>> every commit.
>>
>> I guess there are many ways to fix the issue, but for starters I
>> wonder why is fast-import keeping track of all the branches? In my
>> case I would like fast-import to work exactly the same if I specify
>> branches or not (I'll update them later).
>
> Because fast-import has to buffer them until the pack file is done.
> The objects aren't available to the repository until after a
> checkpoint is sent or until the stream ends.  Either way until
> then fast-import has to buffer the refs so they don't get exposed
> to other git processes reading that same repository, because they
> would point to objects that the process cannot find.
>
> I guess it could release the brnach memory after it dumps the
> branches in a checkpoint, but its memory allocators work under an
> assumption that strings (like branch and file names) will be reused
> heavily by the frontend and thus they are poooled inside of a string
> pool.  The branch objects are also pooled inside of a common alloc
> pool, to ammortize the cost of malloc's block headers out over the
> data used.
>
> IOW, fast-import was designed for ~5k branches, not ~1 million
> unique branches.

My point is: why is it not designed for 0 branches? In many places in
the code there's the assumption that the tree = branch, but that's not
always the case. You can specify a 'from sha1' and then the branch
becomes irrelevant.

In fact in monotone some commits are not part of any branch, and many
are part of multiple branches. Those cases can't be handled by
fast-import right now. Not to mention random refs like 'ref/mtn/foo'
which would come in handy for my script.

Now my question is: would it be possible to get rid of the notion of
branches on fast-import and go for refs instead?

On the other hand if branch memory is freed after a checkpoint then
there's no limit to how many 'branches' can be handled.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-12-22  2:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-21  5:54 Memory issue with fast-import, why track branches? Felipe Contreras
2008-12-21  8:10 ` John Chapman
2008-12-21 11:23   ` Felipe Contreras
2008-12-21 22:17 ` Shawn O. Pearce
2008-12-22  2:36   ` Felipe Contreras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).