From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1064C13A244 for ; Thu, 22 Aug 2024 12:41:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=104.130.231.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724330473; cv=none; b=hYOYNvWZxZHkJcxdQFjVzwtHSVvznsXHag2vNHjXehGDvR9U5ETXUm3m4o1etvk5sVaDmQbQ8KUd27sJMc1Liq3G5e9VVZdDELOysV7iX9DQ9ESrQd3ohLf77SyocYIGDfPq/5J7JmlWTwyHPfRPJSpH/yHrRtJiuEGXLY6whxA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724330473; c=relaxed/simple; bh=QAPLlSLEsBMdD4g1+/ykGFxGajVrWaXqtBds/ldD/ww=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Gmt4ScOkLOsMqn3a57/1ppEc2VqhBsk1GD4kaOZpkSDz56WFd1gR17EHlaOKj85YBZ/Myudr80Bk4zFnPsdswkJ8jNi5t+w53LA8Lo1ybI4q7mchW9HI3SX8Xe8zdl0Pacz7qaDldo7fGjB7GctBBMq6aUQnmWqDfrsL2mr9c0g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=peff.net; spf=pass smtp.mailfrom=peff.net; arc=none smtp.client-ip=104.130.231.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=peff.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=peff.net Received: (qmail 3231 invoked by uid 109); 22 Aug 2024 12:41:04 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 22 Aug 2024 12:41:04 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 28328 invoked by uid 111); 22 Aug 2024 12:41:01 -0000 Received: from coredump.intra.peff.net (HELO coredump.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 22 Aug 2024 08:41:01 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 22 Aug 2024 08:41:00 -0400 From: Jeff King To: Patrick Steinhardt Cc: git@vger.kernel.org Subject: Re: [PATCH 00/10] reftable: fix reload with active iterators Message-ID: <20240822124100.GA1070988@coredump.intra.peff.net> References: Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: On Mon, Aug 19, 2024 at 05:39:38PM +0200, Patrick Steinhardt wrote: > This patch series fixes that issue by starting to refcount the readers. > Each iterator will bump the refcount, thus avoiding the problem. While > at it, I also found a second issue where we segfault when reloading a > table fails while reusing one of the table readers. In this scenario, we > would end up releasing the reader of the stack itself, even though it > would still be used by it. I gave a fairly cursory look over this, as I'm not all that familiar with the reftable code. But everything looked pretty sensible to me. I wondered how we might test this. It looks like you checked the --stress output (which I confirmed seems fixed for me), along with a synthetic test directly calling various reftable functions. Both seem good to me. I had hoped to be able to have a non-racy external test, where running actual Git commands showed the segfault in a deterministic way. But I couldn't come up with one. One trick we've used for "pausing" a reading process is to run "cat-file --batch" from a fifo. You can ask it to do some ref lookups, then while it waits for more input, update the reftable, and then ask it to do more. But I don't think that's sufficient here, as the race happens while we are actually iterating. You'd really need some for_each_ref() callback that blocks in some externally controllable way. Possibly you could do something clever with partial-clone lazy fetches (where you stall the fetch and then do ref updates in the middle), but that is getting pretty convoluted. So I think the tests you included seem like a good place to stop. I did have a little trouble applying this for testing. I wanted to do it on 'next', which has the maintenance changes to cause the race. But merging it there ended up with a lot of conflicts with other reftable topics (especially the tests). I was able to resolve them all, but you might be able to make Junio's life easier by coordinating the topics a bit. -Peff