From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5B8931A71 for ; Fri, 5 Jan 2024 17:07:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=ttaylorr.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=ttaylorr.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="ZDx7Ls1h" Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7816e463b8aso115948585a.3 for ; Fri, 05 Jan 2024 09:07:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20230601.gappssmtp.com; s=20230601; t=1704474466; x=1705079266; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=d+5jRhbIO5xif+0buWVu9SzFkrEKrrRuOXVjELoS3Tg=; b=ZDx7Ls1hz81wrHr3hxMC2Kec/Yzgy9e0RyEjho8AMclnDfrEhSJlWTqVmpbMskFcEA CLES6bQNSIj65xtwNZwcLNKZOYQVscMXXJvneKQWTWv52fKPmUK73YieKtjJrA2iD+Bf BYp0kU7ORSt2iWEMTQdRy/jrWSnDUO6MpbFz9xoDy0f3A+zyfQebKu4UMaWdKkZUd4Tg 6I+T3kLhcmIjOTwx48knezvfNgvlo/2Mg0hDWLqsB0GxRQMF+Ze+kvrxq6MeoM3ljCaP Z3SZumilvoX/xrZTNrH4A+69Mh9OL2ZGQifx/c82ZdxbKdVlEHLEuvg20bota5ddDBIi BLiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704474466; x=1705079266; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=d+5jRhbIO5xif+0buWVu9SzFkrEKrrRuOXVjELoS3Tg=; b=OQEskF5Oeep9+ax+PPxJYqRaSbdM1IacOcmgeKWYc8kN5Q/i3oeP8en/rJUbK5UGYy qpgC4gRbtJwrvuUQ+ayX0Ncii2S2kZmVF+M0fF4HlLuIW0XxjZd8AB+jVqUQaxL0ipmV p+j4NJ7/nD3Jny449c1Sisw4BwunqsFnruSEM6d3r0UsHQPJnLvK+nLIUc6NcloyBnM6 m1AQJ90I4Gx+zrZgUJuEqq3xMGRwvPGCAmWUJmPfa0bjffoQAeXFcjckG+0ePwkxD6DF Qd/V82OFqKhIprG1W3YEg9pOf0F74e39IT+ZnG46JWeZNNicylUVratJpLPe/e78583p 8bRw== X-Gm-Message-State: AOJu0YyiBwmetPVRAJ6ughe6bV3eqwmTACBLOqcwnQUxI8aCv7YR6bsA MdrilJupUflozwhcSSvYksGgRQme7bH+Pw== X-Google-Smtp-Source: AGHT+IGAvyDrookMJbk1LYGKuqpfdxVW9klcMdnDxk+pmmkCGayKFbRZYC+5mTLMw52b+MU2U5odPw== X-Received: by 2002:a05:620a:4552:b0:781:ad92:1708 with SMTP id u18-20020a05620a455200b00781ad921708mr2316883qkp.97.1704474465650; Fri, 05 Jan 2024 09:07:45 -0800 (PST) Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net. [104.178.186.189]) by smtp.gmail.com with ESMTPSA id b23-20020a05620a04f700b007815b84dbb3sm716855qkh.49.2024.01.05.09.07.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 Jan 2024 09:07:45 -0800 (PST) Date: Fri, 5 Jan 2024 12:07:44 -0500 From: Taylor Blau To: Jeff King Cc: git@vger.kernel.org, Junio C Hamano , Scott Leggett Subject: Re: [PATCH] commit-graph: retain commit slab when closing NULL commit_graph Message-ID: References: <20240105054142.GA2035092@coredump.intra.peff.net> Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240105054142.GA2035092@coredump.intra.peff.net> On Fri, Jan 05, 2024 at 12:41:42AM -0500, Jeff King wrote: > This fixes a regression introduced in ac6d45d11f (commit-graph: move > slab-clearing to close_commit_graph(), 2023-10-03), in which running: > > git -c fetch.writeCommitGraph=true fetch --recurse-submodules > > multiple times in a freshly cloned repository causes a segfault. What > happens in the second (and subsequent) runs is this: > > 1. We make a "struct commit" for any ref tips which we're storing > (even if we already have them, they still go into FETCH_HEAD). > > Because the first run will have created a commit graph, we'll find > those commits in the graph. > > The commit struct is therefore created with a NULL "maybe_tree" > entry, because we can load its oid from the graph later. But to do > that we need to remember that we got the commit from the graph, > which is recorded in a global commit_graph_data_slab object. > > 2. Because we're using --recurse-submodules, we'll try to fetch each > of the possible submodules. That implies creating a separate > "struct repository" in-process for each submodule, which will > require a later call to repo_clear(). > > The call to repo_clear() calls raw_object_store_clear(), which in > turn calls close_object_store(), which in turn calls > close_commit_graph(). And the latter frees the commit graph data > slab. > > 3. Later, when trying to write out a new commit graph, we'll ask for > their tree oid via get_commit_tree_oid(), which will see that the > object is parsed but with a NULL maybe_tree field. We'd then > usually pull it from the graph file, but because the slab was > cleared, we don't realize that we can do so! We end up returning > NULL and segfaulting. > > (It seems questionable that we'd write a graph entry for such a > commit anyway, since we know we already have one. I didn't > double-check, but that may simply be another side effect of having > cleared the slab). Yeah, I agree that we should not be re-writing a commit-graph entry that already exists in an earlier (non-removed) layer of the commit-graph. I haven't thought through all of the details here, but that sounds like a potentially dangerous thing to be doing. > So that fixes the regression in the least-risky way possible. Thanks for the detailed explanation. I think that the fix presented here is a reasonable approach, and is worthwhile to pick up. > I do think there's some fragility here that we might want to follow up > on. We have a global commit_graph_data_slab that contains graph > positions, and our global commit structs depend on the that slab > remaining valid. But close_commit_graph() is just about closing _one_ > object store's graph. So it's dangerous to call that function and clear > the slab without also throwing away any "struct commit" we might have > parsed that depends on it. I definitely agree that there seems like a high risk of another potentially-dangerous bug happening as a result of this area. One thing that sticks out to me is that we have a scope mismatch between the commit structs, the raw_object_store, and the commit slabs. Slabs are tied to the lifetime of the raw_object_store, but the commit objects are tied to the global lifetime of the process. I wonder if it would make sense to have a slab per object-store, and require that you pass both the commit *and* the object-store you're looking it up in as arguments to any slab-related functions. I dunno. I have not put a ton of thought into that ^^ approach either, so let me know if it seems obviously bogus to you or if this is worth looking into further. Thanks, Taylor