From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fhigh-a4-smtp.messagingengine.com (fhigh-a4-smtp.messagingengine.com [103.168.172.155]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3B1C202C48 for ; Wed, 25 Mar 2026 00:42:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=103.168.172.155 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774399343; cv=none; b=jctiLP77RjxCzWOdFlp896uuXi6+7/lfdHm3rFhMoS3qNSXi9u0tnYyJEOgTywa8hSDgrEj+gN18Pmv8eMw7E7U08gZ5xQ5kGVXOYFwonmGVnyJeq7cc+iH1GGfCtTYXQ+EzJ+KAAzCFBLeewCMxrEmC9xWLrqka169pzzT31ZU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774399343; c=relaxed/simple; bh=uuOeIuCgjX8TRU1f7lAsMlhlln42DhG4LYrBHeTt6WU=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QHJu8UKogdzpabA76v/C8mpyYpVppL7IfwHKavQaYFEXhlfmtSr5/q7daz6kKg/NTSk6TZSWSRgkmC0+c/EcgVdrR0p23obsFM/BQqrZc3wdEljI5IvQMJDP4KiaqFwaWLgcSNI7mfHPmfRDnpkAybc2yvslzgmyfVbDb46WCUQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bur.io; spf=pass smtp.mailfrom=bur.io; dkim=pass (2048-bit key) header.d=bur.io header.i=@bur.io header.b=ctLBmFGB; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=jzsZLHHR; arc=none smtp.client-ip=103.168.172.155 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bur.io Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bur.io Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bur.io header.i=@bur.io header.b="ctLBmFGB"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="jzsZLHHR" Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfhigh.phl.internal (Postfix) with ESMTP id 2CE8F1400183; Tue, 24 Mar 2026 20:42:21 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-06.internal (MEProxy); Tue, 24 Mar 2026 20:42:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm2; t=1774399341; x= 1774485741; bh=VX+ulV7dyMWsEdMHqNd4aPNJXNDx0NelnayqVA6PpuY=; b=c tLBmFGB2L17LQbHPesBpz85qYy6e4p9mDpTjkGyBBh8YOrigms97Qu6MUy/IoVTk u59SpKotwpDKXqIfy717QfV+i3GRwo2YNVwf0/sKIYucu0QK2YlWFImVMa9CjKNW zC8tTnJXCq98uWN7uSB/gmTxJCB1aGFLCtxZiKpAw0ELgSuY63Iz5yVzLbf9NByI R7xiOYiBGLZBCKXclpRvsmVR1BievyAc44y1jX1SnY7TFDu0Z/WyOyIe1RAGUZFQ cPgye9AfRpg2T+5Z0cocoe636K953DbKTQhmhLbH1YZyUZaFpMmLYKpkRAXzE2bg sRfYwDjLO2iBRfHmhCQfg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm1; t=1774399341; x=1774485741; bh=VX+ulV7dyMWsEdMHqNd4aPNJXNDx 0NelnayqVA6PpuY=; b=jzsZLHHR3IxKIpQqITanVFWh9klJ8eOHRiEXbhukvpP+ VDWa85yiw9Vl2HJ8r71cpFNBT3Pdfj8aiz2UKw3SugNGq8jCPKfGUw4oUI2jH5v2 b7VF1thDi0/BIUd9w0qXr5VYbB6dB9LkxcVJVKe4OPeM9ElVmdtOsWW5llGfED6/ yKZUeWRTwhcczhhemPJB/Kl94OfNa3C9zPzA256QskOLj56CeyCMcjw2BRg/ES1i 0kcg9SwpHwAr9HMgqckTAUL7fHOAgyooGM8fFvMw1Q3TUULunjCIKgyQyl/5OfMD DA8fX8oWP3EwP2ZzjTnSxlWHwOflW2Ki3wrMQGDaIg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdefvdeftdehucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekredtre dttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdrihho qeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffvedvhe dtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgr ihhlfhhrohhmpegsohhrihhssegsuhhrrdhiohdpnhgspghrtghpthhtohepvddpmhhoug gvpehsmhhtphhouhhtpdhrtghpthhtoheplhhinhhugidqsghtrhhfshesvhhgvghrrdhk vghrnhgvlhdrohhrghdprhgtphhtthhopehkvghrnhgvlhdqthgvrghmsehfsgdrtghomh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 24 Mar 2026 20:42:20 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 5/5] btrfs: cap shrink_delalloc iterations to 128M Date: Tue, 24 Mar 2026 17:41:53 -0700 Message-ID: <62b90bf5df8d281f65c36bfc8d62df5455ae82f4.1774398665.git.boris@bur.io> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Even with more accurate delayed_refs reservations, preemptive reclaim is not perfect and we might generate tickets, especially in cases with a very large flood of writeback outstanding. Ultimately, if we do get into a situation with tickets pending and async reclaim blocking the system, we want to try to make as much progress as quickly as possible to unblock tasks. We want space reclaim to be effective, and to have a good chance at making progress, but not to block arbitrarily as this leads to untenable syscall latencies, long commits, and even hung task warnings. I traced such cases of heavy writeback async reclaim hung tasks and observed that we were blocking for long periods of time in shrink_delalloc(). This was particularly bad when doing writeback of incompressible data with the compress-force mount option. e.g. dd if=/dev/urandom of=urandom.seed bs=1G count=1 dd if=urandom.seed of=urandom.big bs=1G count=300 shrink_delalloc() computes to_reclaim as delalloc_bytes >> 3. With hundreds of gigs of delalloc (again imagine a large dirty_ratio and lots of ram), this is still 10-20+ GiB. Particularly in the wait phases, this can be quite slow, and generates even more delayed-refs as mentioned in the previous patch, so it doesn't even help that much with the immediate space shortfall. We do satisfy some tickets, but we are ultimately keep the system in essentially the same state, and with long stalling reclaim calls into shrink_delalloc(). It would be much better to start some good chunk of I/O and also to work through the new delayed_refs and keep things moving through the system while releasing the conservative over-estimated metadata reservations. To acheive this, tighten up the delalloc work to be in units of the maximum extent size. If we issue 128MiB of delalloc, we don't leave too much (any?) extent merging on the table, but don't ever block on pathological 10GiB+ chunks of delalloc. If we do detect that we satisfied a ticket, break out of shrink_delalloc() and run some of the new delayed_refs as well before going again. This way we strike a nice balance of making delalloc progress, but not at the cost of every other sort of reservation, as they all feed into each other. This means iterating over to_reclaim by 128MiB at a time until it is drained or we satisfy a ticket, rather than trying 3 times to do the whole thing. Signed-off-by: Boris Burkov --- fs/btrfs/space-info.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index e017bb182c8c..42f7d63e2464 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -729,7 +729,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info, u64 ordered_bytes; u64 items; long time_left; - int loops; + u64 orig_tickets_id; delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes); ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes); @@ -737,9 +737,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info, return; /* Calc the number of the pages we need flush for space reservation */ - if (to_reclaim == U64_MAX) { - items = U64_MAX; - } else { + if (to_reclaim != U64_MAX) { /* * to_reclaim is set to however much metadata we need to * reclaim, but reclaiming that much data doesn't really track @@ -753,7 +751,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info, * aggressive. */ to_reclaim = max(to_reclaim, delalloc_bytes >> 3); - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2; } trans = current->journal_info; @@ -766,12 +763,17 @@ static void shrink_delalloc(struct btrfs_space_info *space_info, if (ordered_bytes > delalloc_bytes && !for_preempt) wait_ordered = true; - loops = 0; - while ((delalloc_bytes || ordered_bytes) && loops < 3) { - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT; - long nr_pages = min_t(u64, temp, LONG_MAX); + spin_lock(&space_info->lock); + orig_tickets_id = space_info->tickets_id; + spin_unlock(&space_info->lock); + + while ((delalloc_bytes || ordered_bytes) && to_reclaim) { + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M); + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT; int async_pages; + items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2; + btrfs_start_delalloc_roots(fs_info, nr_pages, true); /* @@ -813,7 +815,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info, atomic_read(&fs_info->async_delalloc_pages) <= async_pages); skip_async: - loops++; + to_reclaim -= iter_reclaim; if (wait_ordered && !trans) { btrfs_wait_ordered_roots(fs_info, items, NULL); } else { @@ -836,6 +838,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info, spin_unlock(&space_info->lock); break; } + /* + * If a ticket was satisfied since we started, break out + * so the async reclaim state machine can process delayed + * refs before we flush more delalloc. + */ + if (space_info->tickets_id != orig_tickets_id) { + spin_unlock(&space_info->lock); + break; + } spin_unlock(&space_info->lock); delalloc_bytes = percpu_counter_sum_positive( -- 2.53.0