From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ADA3D346E71; Wed, 14 Jan 2026 19:59:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768420788; cv=none; b=lMccdMbt6OiZx7ZxI7BvU+RSFgzaHD7eCvbdFHBMUCfkKShBK0rh4fMB6nGDyFyvkiGilpF0/dInJDkgRswq3PGJIsgE1BVohpQ/C5r5YK5sXR7QLpxI5AsUFK2ingTq/gSHDibc5QLd3vtEBFfntXU9qvbqjLfwfzE+Q3g81pM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768420788; c=relaxed/simple; bh=Z24mddEgL6k73l4Yi0YB/E2zfHK2n6I+7pXx8sS6tfY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=WARSaPTZIRDb55K5nv2AT8F33kysp6Aq4zC+1FMGgews+AUlR/Ziek++q779shbkHR5tQ/efxsOcn1HjYOdC6jVCTFMfcwP4N2oIGAF62yHdQGwST1wJOtV2cZMgeIi0yx1SWh8moo750R5n9+3LiDdgGya+hiBCzA4iGFJ+le0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=mW1CHe2W; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="mW1CHe2W" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9B092C4CEF7; Wed, 14 Jan 2026 19:59:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1768420787; bh=Z24mddEgL6k73l4Yi0YB/E2zfHK2n6I+7pXx8sS6tfY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=mW1CHe2WHvxu2o1VzQ52dMJKSHZqsm8HcTz8nmxVu5yi0Bbqxgkkqg8RGgDyaA0qz nB3HKl49GknMeRbfDqxCzYa65GP1XvJj6RCO/1oOqzApw6N2xRKhete3JWz98FZAHj SMGw6ew0gRRcJprEqOL25GkEgNmzLJWfq6En98MK5dUPPQ5vZp+yzEo96DXl8JI/T+ pks553JMRZfAvoBMSkk7Czy/W1xVhKYv7BugomHuTbnqG44W72T4JNbPVxK1n8b29H GNeC0pkwZts/BuOQ+4G3PLVi4mzZbWK61bOS33TG3gkLiY3F4HzQsr1B7Yi0+zypQa ogo0szxlNBSpw== Date: Wed, 14 Jan 2026 11:59:47 -0800 From: "Darrick J. Wong" To: Christoph Hellwig Cc: Jens Axboe , Carlos Maiolino , Damien Le Moal , Hans Holmberg , Keith Busch , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, Carlos Maiolino Subject: Re: [PATCH 3/3] xfs: rework zone GC buffer management Message-ID: <20260114195947.GI15551@frogsfrogsfrogs> References: <20260114130651.3439765-1-hch@lst.de> <20260114130651.3439765-4-hch@lst.de> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260114130651.3439765-4-hch@lst.de> On Wed, Jan 14, 2026 at 02:06:43PM +0100, Christoph Hellwig wrote: > The double buffering where just one scratch area is used at a time does > not efficiently use the available memory. It was originally implemented > when GC I/O could happen out of order, but that was removed before > upstream submission to avoid fragmentation. Now that all GC I/Os are > processed in order, just use a number of buffers as a simple ring buffer. > > For a synthetic benchmark that fills 256MiB HDD zones and punches out > holes to free half the space this leads to a decrease of GC time by > a little more than 25%. > > Thanks to Hans Holmberg for testing and > benchmarking. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Hans Holmberg > Reviewed-by: Carlos Maiolino > Reviewed-by: Damien Le Moal Huh. I guess I never touched this patch at all??? Ok so it looks like we're switching out garbage collecting with two independent (but not) scratchpads for what amounts to a ring buffer consisting of a bunch of discontig folios. Now when gc wants to move some data, we compute the number of folios we need, and bump scratch_head by that amount. Then we attach those folios to a read bio and submit it. When that finishes, we call bio_reuse with REQ_OP_WRITE and submit that. When the write finishes we do the "remap if mapping hasn't changed" dance to update the metadata, and advance the tail by however many folios we've stopped using. If I got that right, then Reviewed-by: "Darrick J. Wong" --D > --- > fs/xfs/xfs_zone_gc.c | 106 ++++++++++++++++++++++++------------------- > 1 file changed, 59 insertions(+), 47 deletions(-) > > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c > index c9a3df6a5289..ba4f8e011e36 100644 > --- a/fs/xfs/xfs_zone_gc.c > +++ b/fs/xfs/xfs_zone_gc.c > @@ -50,23 +50,11 @@ > */ > > /* > - * Size of each GC scratch pad. This is also the upper bound for each > - * GC I/O, which helps to keep latency down. > + * Size of each GC scratch allocation, and the number of buffers. > */ > -#define XFS_GC_CHUNK_SIZE SZ_1M > - > -/* > - * Scratchpad data to read GCed data into. > - * > - * The offset member tracks where the next allocation starts, and freed tracks > - * the amount of space that is not used anymore. > - */ > -#define XFS_ZONE_GC_NR_SCRATCH 2 > -struct xfs_zone_scratch { > - struct folio *folio; > - unsigned int offset; > - unsigned int freed; > -}; > +#define XFS_GC_BUF_SIZE SZ_1M > +#define XFS_GC_NR_BUFS 2 > +static_assert(XFS_GC_NR_BUFS < BIO_MAX_VECS); > > /* > * Chunk that is read and written for each GC operation. > @@ -141,10 +129,14 @@ struct xfs_zone_gc_data { > struct bio_set bio_set; > > /* > - * Scratchpad used, and index to indicated which one is used. > + * Scratchpad to buffer GC data, organized as a ring buffer over > + * discontiguous folios. scratch_head is where the buffer is filled, > + * and scratch_tail tracks the buffer space freed. > */ > - struct xfs_zone_scratch scratch[XFS_ZONE_GC_NR_SCRATCH]; > - unsigned int scratch_idx; > + struct folio *scratch_folios[XFS_GC_NR_BUFS]; > + unsigned int scratch_size; > + unsigned int scratch_head; > + unsigned int scratch_tail; > > /* > * List of bios currently being read, written and reset. > @@ -210,20 +202,16 @@ xfs_zone_gc_data_alloc( > if (!data->iter.recs) > goto out_free_data; > > - /* > - * We actually only need a single bio_vec. It would be nice to have > - * a flag that only allocates the inline bvecs and not the separate > - * bvec pool. > - */ > if (bioset_init(&data->bio_set, 16, offsetof(struct xfs_gc_bio, bio), > BIOSET_NEED_BVECS)) > goto out_free_recs; > - for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) { > - data->scratch[i].folio = > - folio_alloc(GFP_KERNEL, get_order(XFS_GC_CHUNK_SIZE)); > - if (!data->scratch[i].folio) > + for (i = 0; i < XFS_GC_NR_BUFS; i++) { > + data->scratch_folios[i] = > + folio_alloc(GFP_KERNEL, get_order(XFS_GC_BUF_SIZE)); > + if (!data->scratch_folios[i]) > goto out_free_scratch; > } > + data->scratch_size = XFS_GC_BUF_SIZE * XFS_GC_NR_BUFS; > INIT_LIST_HEAD(&data->reading); > INIT_LIST_HEAD(&data->writing); > INIT_LIST_HEAD(&data->resetting); > @@ -232,7 +220,7 @@ xfs_zone_gc_data_alloc( > > out_free_scratch: > while (--i >= 0) > - folio_put(data->scratch[i].folio); > + folio_put(data->scratch_folios[i]); > bioset_exit(&data->bio_set); > out_free_recs: > kfree(data->iter.recs); > @@ -247,8 +235,8 @@ xfs_zone_gc_data_free( > { > int i; > > - for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) > - folio_put(data->scratch[i].folio); > + for (i = 0; i < XFS_GC_NR_BUFS; i++) > + folio_put(data->scratch_folios[i]); > bioset_exit(&data->bio_set); > kfree(data->iter.recs); > kfree(data); > @@ -590,7 +578,12 @@ static unsigned int > xfs_zone_gc_scratch_available( > struct xfs_zone_gc_data *data) > { > - return XFS_GC_CHUNK_SIZE - data->scratch[data->scratch_idx].offset; > + if (!data->scratch_tail) > + return data->scratch_size - data->scratch_head; > + > + if (!data->scratch_head) > + return data->scratch_tail; > + return (data->scratch_size - data->scratch_head) + data->scratch_tail; > } > > static bool > @@ -664,6 +657,28 @@ xfs_zone_gc_alloc_blocks( > return oz; > } > > +static void > +xfs_zone_gc_add_data( > + struct xfs_gc_bio *chunk) > +{ > + struct xfs_zone_gc_data *data = chunk->data; > + unsigned int len = chunk->len; > + unsigned int off = data->scratch_head; > + > + do { > + unsigned int this_off = off % XFS_GC_BUF_SIZE; > + unsigned int this_len = min(len, XFS_GC_BUF_SIZE - this_off); > + > + bio_add_folio_nofail(&chunk->bio, > + data->scratch_folios[off / XFS_GC_BUF_SIZE], > + this_len, this_off); > + len -= this_len; > + off += this_len; > + if (off == data->scratch_size) > + off = 0; > + } while (len); > +} > + > static bool > xfs_zone_gc_start_chunk( > struct xfs_zone_gc_data *data) > @@ -677,6 +692,7 @@ xfs_zone_gc_start_chunk( > struct xfs_inode *ip; > struct bio *bio; > xfs_daddr_t daddr; > + unsigned int len; > bool is_seq; > > if (xfs_is_shutdown(mp)) > @@ -691,17 +707,19 @@ xfs_zone_gc_start_chunk( > return false; > } > > - bio = bio_alloc_bioset(bdev, 1, REQ_OP_READ, GFP_NOFS, &data->bio_set); > + len = XFS_FSB_TO_B(mp, irec.rm_blockcount); > + bio = bio_alloc_bioset(bdev, > + min(howmany(len, XFS_GC_BUF_SIZE) + 1, XFS_GC_NR_BUFS), > + REQ_OP_READ, GFP_NOFS, &data->bio_set); > > chunk = container_of(bio, struct xfs_gc_bio, bio); > chunk->ip = ip; > chunk->offset = XFS_FSB_TO_B(mp, irec.rm_offset); > - chunk->len = XFS_FSB_TO_B(mp, irec.rm_blockcount); > + chunk->len = len; > chunk->old_startblock = > xfs_rgbno_to_rtb(iter->victim_rtg, irec.rm_startblock); > chunk->new_daddr = daddr; > chunk->is_seq = is_seq; > - chunk->scratch = &data->scratch[data->scratch_idx]; > chunk->data = data; > chunk->oz = oz; > chunk->victim_rtg = iter->victim_rtg; > @@ -710,13 +728,9 @@ xfs_zone_gc_start_chunk( > > bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock); > bio->bi_end_io = xfs_zone_gc_end_io; > - bio_add_folio_nofail(bio, chunk->scratch->folio, chunk->len, > - chunk->scratch->offset); > - chunk->scratch->offset += chunk->len; > - if (chunk->scratch->offset == XFS_GC_CHUNK_SIZE) { > - data->scratch_idx = > - (data->scratch_idx + 1) % XFS_ZONE_GC_NR_SCRATCH; > - } > + xfs_zone_gc_add_data(chunk); > + data->scratch_head = (data->scratch_head + len) % data->scratch_size; > + > WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW); > list_add_tail(&chunk->entry, &data->reading); > xfs_zone_gc_iter_advance(iter, irec.rm_blockcount); > @@ -834,6 +848,7 @@ xfs_zone_gc_finish_chunk( > struct xfs_gc_bio *chunk) > { > uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; > + struct xfs_zone_gc_data *data = chunk->data; > struct xfs_inode *ip = chunk->ip; > struct xfs_mount *mp = ip->i_mount; > int error; > @@ -845,11 +860,8 @@ xfs_zone_gc_finish_chunk( > return; > } > > - chunk->scratch->freed += chunk->len; > - if (chunk->scratch->freed == chunk->scratch->offset) { > - chunk->scratch->offset = 0; > - chunk->scratch->freed = 0; > - } > + data->scratch_tail = > + (data->scratch_tail + chunk->len) % data->scratch_size; > > /* > * Cycle through the iolock and wait for direct I/O and layouts to > -- > 2.47.3 > >