From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>, Eric Wong <e@80x24.org>,
Git Users <git@vger.kernel.org>, Luke Diamand <luke@diamand.org>,
spearce@spearce.org
Subject: Re: [BUG] t9801 and t9803 broken on next
Date: Thu, 19 May 2016 10:19:51 +0200 [thread overview]
Message-ID: <DD9F360B-78F0-4C83-B642-606688772E39@gmail.com> (raw)
In-Reply-To: <20160517121330.GA7346@sigill.intra.peff.net>
On 17 May 2016, at 14:13, Jeff King <peff@peff.net> wrote:
> On Tue, May 17, 2016 at 10:07:16AM +0200, Lars Schneider wrote:
>
>> I think that is pretty much the problem. Here is what is happening:
>>
>> 1. git-p4 imports all changelists for the "main" branch
>>
>> 2. git-p4 starts to import a second branch and creates a fast-import
>> "progress checkpoint". This triggers:
>>
>> --> fast-import.c: cycle_packfile
>> --> fast-import.c: end_packfile
>> --> fast-import.c: loosen_small_pack
>>
>> At this point we have no packfile anymore.
>>
>> 3. git-p4 sends the command "commit refs/remotes/p4/depot/branch2"
>> to fast-import in order to create a new branch. This triggers:
>>
>> --> fast-import.c: parse_new_commit
>> --> fast-import.c: load_branch
>> --> fast-import.c: load_tree
>>
>> "load_tree" calls "find_object" and the result has a "pack_id" of 0.
>> I think this indicates that the object is in the packfile. Shouldn't
>> the "pack_id" be "MAX_PACK_ID" instead?
>>
>> myoe = find_object(sha1);
>> if (myoe && myoe->pack_id != MAX_PACK_ID) {
>
> Thanks for the analysis. I think this is definitely the problem. After
> fast-import finalizes a packfile (either at the end of the program or
> due to a checkpoint), it never discards its internal mapping of objects
> to pack locations. It _has_ to keep such a mapping before the pack is
> finalized, because git's regular object database doesn't know about it.
> But afterwards, it should be able to rely on the object database.
>
> Retaining the mapping at the end of the program is obviously OK because
> we won't make any more lookups in it.
>
> Retaining it at a checkpoint when we keep the packfile is OK, because we
> can (usually) still access that packfile (the exception would be if
> somebody runs "git repack" while fast-import is running).
>
> But obviously a checkpoint where we throw away the pack needs to clear
> that mapping.
>
> The patch below probably makes your case work, but there are a lot of
> open questions:
Confirmed. The offending tests pass with your patch.
> 1. Should we always discard the mapping, even when not loosening
> objects? I can't think of a real downside to always using git's
> object lookups.
>
> 2. Can we free memory associated with object_entry structs at this
> point? They won't be accessible via the hash, but might other bits
> of the code have kept pointers to them?
>
> I suspect it may screw up the statistics that fast-import prints at
> the end, but that's a minor point.
>
> 3. I notice that a few other structs (branch and tag) hold onto the
> pack_id, which will now point to a pack we can't access. Does this
> matter? I don't think so, because checkpoint() seems to dump the
> branches and tags.
> 4. In general, should we be loosening objects at checkpoints at all?
>
> I guess it is probably more efficient than creating a bunch of
> little packs. And it should probably not happen much at all outside
> of tests (i.e., callers should generally checkpoint after an
> appreciable amount of work is done).
I am not too familiar with the code and therefore I can't give a good
answer to your questions. However, I noticed that Shawn (CC'ed) wrote a
lot of fast-import code. Maybe he can help?
>From my point of view little packs are no problem. I run fast-import on
a dedicated migration machine. After fast-import completion I run repack [1]
before I upload the repo to its final location.
Thanks,
Lars
[1] https://gcc.gnu.org/ml/gcc/2007-12/msg00165.html
>
> diff --git a/fast-import.c b/fast-import.c
> index 0e8bc6a..9bfbfb0 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -597,6 +597,22 @@ static struct object_entry *insert_object(unsigned char *sha1)
> return e;
> }
>
> +static void clear_object_table(void)
> +{
> + unsigned int h;
> + for (h = 0; h < ARRAY_SIZE(object_table); h++) {
> + /*
> + * We can't individually free objects here
> + * because they are allocated from a pool.
> + */
> + object_table[h] = NULL;
> + }
> + /*
> + * XXX maybe free object_entry_pool here,
> + * or might something still be referencing them?
> + */
> +}
> +
> static unsigned int hc_str(const char *s, size_t len)
> {
> unsigned int r = 0;
> @@ -1035,6 +1051,9 @@ discard_pack:
> pack_data = NULL;
> running = 0;
>
> + /* The objects are now available via git's regular lookups. */
> + clear_object_table();
> +
> /* We can't carry a delta across packfiles. */
> strbuf_release(&last_blob.data);
> last_blob.offset = 0;
next prev parent reply other threads:[~2016-05-19 8:20 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-13 10:02 [BUG] t9801 and t9803 broken on next Lars Schneider
2016-05-13 10:36 ` Eric Wong
2016-05-13 16:37 ` Junio C Hamano
2016-05-14 8:04 ` Lars Schneider
2016-05-14 18:15 ` Junio C Hamano
2016-05-17 8:07 ` Lars Schneider
2016-05-17 9:13 ` Eric Wong
2016-05-17 12:13 ` Jeff King
2016-05-19 8:19 ` Lars Schneider [this message]
2016-05-19 17:03 ` Junio C Hamano
2016-05-19 17:32 ` Lars Schneider
2016-05-25 22:49 ` Eric Wong
2016-05-25 22:54 ` [RFC] fast-import: invalidate pack_id references after loosening Eric Wong
2016-05-25 23:09 ` Jeff King
2016-05-26 8:02 ` Eric Wong
2016-05-26 14:16 ` Jeff King
2016-05-28 0:20 ` Eric Wong
2016-05-25 22:56 ` [BUG] t9801 and t9803 broken on next Jeff King
2016-05-17 8:35 ` Luke Diamand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DD9F360B-78F0-4C83-B642-606688772E39@gmail.com \
--to=larsxschneider@gmail.com \
--cc=e@80x24.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=luke@diamand.org \
--cc=peff@peff.net \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).