git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, Eric Wong <e@80x24.org>,
	Git Users <git@vger.kernel.org>, Luke Diamand <luke@diamand.org>
Subject: Re: [BUG] t9801 and t9803 broken on next
Date: Tue, 17 May 2016 08:13:31 -0400	[thread overview]
Message-ID: <20160517121330.GA7346@sigill.intra.peff.net> (raw)
In-Reply-To: <5E7631C9-DD59-4358-B907-D7C7AEA1739C@gmail.com>

On Tue, May 17, 2016 at 10:07:16AM +0200, Lars Schneider wrote:

> I think that is pretty much the problem. Here is what is happening:
> 
> 1.  git-p4 imports all changelists for the "main" branch
> 
> 2.  git-p4 starts to import a second branch and creates a fast-import
>     "progress checkpoint". This triggers:
> 
>     --> fast-import.c: cycle_packfile
>     --> fast-import.c: end_packfile
>     --> fast-import.c: loosen_small_pack
> 
>     At this point we have no packfile anymore.
> 
> 3.  git-p4 sends the command "commit refs/remotes/p4/depot/branch2"
>     to fast-import in order to create a new branch. This triggers:
> 
>     --> fast-import.c: parse_new_commit
>     --> fast-import.c: load_branch
>     --> fast-import.c: load_tree
> 
>     "load_tree" calls "find_object" and the result has a "pack_id" of 0.
>     I think this indicates that the object is in the packfile. Shouldn't
>     the "pack_id" be "MAX_PACK_ID" instead?
> 
>         myoe = find_object(sha1);
>         if (myoe && myoe->pack_id != MAX_PACK_ID) {

Thanks for the analysis. I think this is definitely the problem.  After
fast-import finalizes a packfile (either at the end of the program or
due to a checkpoint), it never discards its internal mapping of objects
to pack locations. It _has_ to keep such a mapping before the pack is
finalized, because git's regular object database doesn't know about it.
But afterwards, it should be able to rely on the object database.

Retaining the mapping at the end of the program is obviously OK because
we won't make any more lookups in it.

Retaining it at a checkpoint when we keep the packfile is OK, because we
can (usually) still access that packfile (the exception would be if
somebody runs "git repack" while fast-import is running).

But obviously a checkpoint where we throw away the pack needs to clear
that mapping.

The patch below probably makes your case work, but there are a lot of
open questions:

  1. Should we always discard the mapping, even when not loosening
     objects? I can't think of a real downside to always using git's
     object lookups.

  2. Can we free memory associated with object_entry structs at this
     point? They won't be accessible via the hash, but might other bits
     of the code have kept pointers to them?

     I suspect it may screw up the statistics that fast-import prints at
     the end, but that's a minor point.

  3. I notice that a few other structs (branch and tag) hold onto the
     pack_id, which will now point to a pack we can't access. Does this
     matter? I don't think so, because checkpoint() seems to dump the
     branches and tags.

  4. In general, should we be loosening objects at checkpoints at all?

     I guess it is probably more efficient than creating a bunch of
     little packs. And it should probably not happen much at all outside
     of tests (i.e., callers should generally checkpoint after an
     appreciable amount of work is done).

diff --git a/fast-import.c b/fast-import.c
index 0e8bc6a..9bfbfb0 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -597,6 +597,22 @@ static struct object_entry *insert_object(unsigned char *sha1)
 	return e;
 }
 
+static void clear_object_table(void)
+{
+	unsigned int h;
+	for (h = 0; h < ARRAY_SIZE(object_table); h++) {
+		/*
+		 * We can't individually free objects here
+		 * because they are allocated from a pool.
+		 */
+		object_table[h] = NULL;
+	}
+	/*
+	 * XXX maybe free object_entry_pool here,
+	 * or might something still be referencing them?
+	 */
+}
+
 static unsigned int hc_str(const char *s, size_t len)
 {
 	unsigned int r = 0;
@@ -1035,6 +1051,9 @@ discard_pack:
 	pack_data = NULL;
 	running = 0;
 
+	/* The objects are now available via git's regular lookups. */
+	clear_object_table();
+
 	/* We can't carry a delta across packfiles. */
 	strbuf_release(&last_blob.data);
 	last_blob.offset = 0;

  parent reply	other threads:[~2016-05-17 12:13 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-13 10:02 [BUG] t9801 and t9803 broken on next Lars Schneider
2016-05-13 10:36 ` Eric Wong
2016-05-13 16:37   ` Junio C Hamano
2016-05-14  8:04     ` Lars Schneider
2016-05-14 18:15       ` Junio C Hamano
2016-05-17  8:07         ` Lars Schneider
2016-05-17  9:13           ` Eric Wong
2016-05-17 12:13           ` Jeff King [this message]
2016-05-19  8:19             ` Lars Schneider
2016-05-19 17:03               ` Junio C Hamano
2016-05-19 17:32                 ` Lars Schneider
2016-05-25 22:49             ` Eric Wong
2016-05-25 22:54               ` [RFC] fast-import: invalidate pack_id references after loosening Eric Wong
2016-05-25 23:09                 ` Jeff King
2016-05-26  8:02                   ` Eric Wong
2016-05-26 14:16                     ` Jeff King
2016-05-28  0:20                       ` Eric Wong
2016-05-25 22:56               ` [BUG] t9801 and t9803 broken on next Jeff King
2016-05-17  8:35         ` Luke Diamand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160517121330.GA7346@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    --cc=luke@diamand.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).