git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>, Eric Wong <e@80x24.org>,
	Git Users <git@vger.kernel.org>, Luke Diamand <luke@diamand.org>,
	spearce@spearce.org
Subject: Re: [BUG] t9801 and t9803 broken on next
Date: Thu, 19 May 2016 10:19:51 +0200	[thread overview]
Message-ID: <DD9F360B-78F0-4C83-B642-606688772E39@gmail.com> (raw)
In-Reply-To: <20160517121330.GA7346@sigill.intra.peff.net>


On 17 May 2016, at 14:13, Jeff King <peff@peff.net> wrote:

> On Tue, May 17, 2016 at 10:07:16AM +0200, Lars Schneider wrote:
> 
>> I think that is pretty much the problem. Here is what is happening:
>> 
>> 1.  git-p4 imports all changelists for the "main" branch
>> 
>> 2.  git-p4 starts to import a second branch and creates a fast-import
>>    "progress checkpoint". This triggers:
>> 
>>    --> fast-import.c: cycle_packfile
>>    --> fast-import.c: end_packfile
>>    --> fast-import.c: loosen_small_pack
>> 
>>    At this point we have no packfile anymore.
>> 
>> 3.  git-p4 sends the command "commit refs/remotes/p4/depot/branch2"
>>    to fast-import in order to create a new branch. This triggers:
>> 
>>    --> fast-import.c: parse_new_commit
>>    --> fast-import.c: load_branch
>>    --> fast-import.c: load_tree
>> 
>>    "load_tree" calls "find_object" and the result has a "pack_id" of 0.
>>    I think this indicates that the object is in the packfile. Shouldn't
>>    the "pack_id" be "MAX_PACK_ID" instead?
>> 
>>        myoe = find_object(sha1);
>>        if (myoe && myoe->pack_id != MAX_PACK_ID) {
> 
> Thanks for the analysis. I think this is definitely the problem.  After
> fast-import finalizes a packfile (either at the end of the program or
> due to a checkpoint), it never discards its internal mapping of objects
> to pack locations. It _has_ to keep such a mapping before the pack is
> finalized, because git's regular object database doesn't know about it.
> But afterwards, it should be able to rely on the object database.
> 
> Retaining the mapping at the end of the program is obviously OK because
> we won't make any more lookups in it.
> 
> Retaining it at a checkpoint when we keep the packfile is OK, because we
> can (usually) still access that packfile (the exception would be if
> somebody runs "git repack" while fast-import is running).
> 
> But obviously a checkpoint where we throw away the pack needs to clear
> that mapping.
> 
> The patch below probably makes your case work, but there are a lot of
> open questions:
Confirmed. The offending tests pass with your patch.


>  1. Should we always discard the mapping, even when not loosening
>     objects? I can't think of a real downside to always using git's
>     object lookups.
> 
>  2. Can we free memory associated with object_entry structs at this
>     point? They won't be accessible via the hash, but might other bits
>     of the code have kept pointers to them?
> 
>     I suspect it may screw up the statistics that fast-import prints at
>     the end, but that's a minor point.
> 
>  3. I notice that a few other structs (branch and tag) hold onto the
>     pack_id, which will now point to a pack we can't access. Does this
>     matter? I don't think so, because checkpoint() seems to dump the
>     branches and tags.
>  4. In general, should we be loosening objects at checkpoints at all?
> 
>     I guess it is probably more efficient than creating a bunch of
>     little packs. And it should probably not happen much at all outside
>     of tests (i.e., callers should generally checkpoint after an
>     appreciable amount of work is done).
I am not too familiar with the code and therefore I can't give a good
answer to your questions. However, I noticed that Shawn (CC'ed) wrote a 
lot of fast-import code. Maybe he can help?

>From my point of view little packs are no problem. I run fast-import on
a dedicated migration machine. After fast-import completion I run repack [1] 
before I upload the repo to its final location.


Thanks,
Lars

[1] https://gcc.gnu.org/ml/gcc/2007-12/msg00165.html


> 
> diff --git a/fast-import.c b/fast-import.c
> index 0e8bc6a..9bfbfb0 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -597,6 +597,22 @@ static struct object_entry *insert_object(unsigned char *sha1)
> 	return e;
> }
> 
> +static void clear_object_table(void)
> +{
> +	unsigned int h;
> +	for (h = 0; h < ARRAY_SIZE(object_table); h++) {
> +		/*
> +		 * We can't individually free objects here
> +		 * because they are allocated from a pool.
> +		 */
> +		object_table[h] = NULL;
> +	}
> +	/*
> +	 * XXX maybe free object_entry_pool here,
> +	 * or might something still be referencing them?
> +	 */
> +}
> +
> static unsigned int hc_str(const char *s, size_t len)
> {
> 	unsigned int r = 0;
> @@ -1035,6 +1051,9 @@ discard_pack:
> 	pack_data = NULL;
> 	running = 0;
> 
> +	/* The objects are now available via git's regular lookups. */
> +	clear_object_table();
> +
> 	/* We can't carry a delta across packfiles. */
> 	strbuf_release(&last_blob.data);
> 	last_blob.offset = 0;

  reply	other threads:[~2016-05-19  8:20 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-13 10:02 [BUG] t9801 and t9803 broken on next Lars Schneider
2016-05-13 10:36 ` Eric Wong
2016-05-13 16:37   ` Junio C Hamano
2016-05-14  8:04     ` Lars Schneider
2016-05-14 18:15       ` Junio C Hamano
2016-05-17  8:07         ` Lars Schneider
2016-05-17  9:13           ` Eric Wong
2016-05-17 12:13           ` Jeff King
2016-05-19  8:19             ` Lars Schneider [this message]
2016-05-19 17:03               ` Junio C Hamano
2016-05-19 17:32                 ` Lars Schneider
2016-05-25 22:49             ` Eric Wong
2016-05-25 22:54               ` [RFC] fast-import: invalidate pack_id references after loosening Eric Wong
2016-05-25 23:09                 ` Jeff King
2016-05-26  8:02                   ` Eric Wong
2016-05-26 14:16                     ` Jeff King
2016-05-28  0:20                       ` Eric Wong
2016-05-25 22:56               ` [BUG] t9801 and t9803 broken on next Jeff King
2016-05-17  8:35         ` Luke Diamand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DD9F360B-78F0-4C83-B642-606688772E39@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=luke@diamand.org \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).