From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42E3222A1FA for ; Fri, 23 May 2025 13:37:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748007456; cv=none; b=EjCsbmME9HnGaCTRRiH5MMnj7Ei4000/mM/tZ2TkIJzrLFyDOUGuuYCpGCHYZqAOdgblDdOeIryVtdbgiRPsqtzvYfXooFJqfbB410wEzWO3MQydy49rpJQk5sM8Y/dKPsX6/3LRGXD0RSTvKIspv8NO4+9EIZLRlEpj32LEmH4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748007456; c=relaxed/simple; bh=cwrTmxEF1sLoQAgJe+tsNzSYnTxLgmeMkbxo81pfIhc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=oKEPvuh91nRjohzvBiR0CZYtQvcD7jGZIJ9HZRFlEQfgdLG5uZYNLFDXOfq1ByrYfcvesmEIpj9gB1RWWW1gsO/0yBWP8iXumWcqjRa6b0pvSIlv1s8RWZRKv3PyFFyHnw1j6htSL3E6SBjOYpCVdoNgBLm4oEdT9NMz8A+uJns= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=bombadil.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=qjbUx8XY; arc=none smtp.client-ip=198.137.202.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=bombadil.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="qjbUx8XY" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=6doQPpYhjYjpofhzi8sE2USoJ4axG1fgmnATV43tg0I=; b=qjbUx8XYc7NgPims4diFcCqcfb ONNvdqx+s0DHxR7MNnuKg8/e4EF1huUJ0yl5py2OBflVbcZ75daRSNO7X3lLF7C+HZxxTYkFxczBa x9dmlMPlKWxuUEPCZelqs/h/PGBIrh5gtExekdzFOIVDGo6I51dPLlcCucplxO5dycaQb5/tyf9PH CTYXnSE0qsS5emSTdS2fAsAasw4vzc02XyoH52XYSdlgSiLY82WSLD+6VWs7HV7JnV/VhuBp9NLrm UBq32ijIq7om8FJLbVEuxE08EMVcjfXTWiuRqcyh0jYWRCMSPEyLzEa6aajRWBvyNnqpeYyRE191d 0Fb1wung==; Received: from hch by bombadil.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux)) id 1uISab-00000003yzp-375K; Fri, 23 May 2025 13:37:33 +0000 Date: Fri, 23 May 2025 06:37:33 -0700 From: Christoph Hellwig To: Keith Busch Cc: Christoph Hellwig , Keith Busch , linux-block@vger.kernel.org, linux-nvme@lists.infradead.org Subject: Re: [PATCH 2/5] block: add support for copy offload Message-ID: References: <20250521223107.709131-1-kbusch@meta.com> <20250521223107.709131-3-kbusch@meta.com> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org. See http://www.infradead.org/rpr.html On Fri, May 23, 2025 at 07:26:45AM -0600, Keith Busch wrote: > > Urrgg. Please don't overload the bio_vec. We've been working hard to > > generalize it and share the data structures with more users in the > > block layer. > > Darn, this part of the proposal is really the core concept of this patch > set that everything builds around. It's what allows submitting > arbitrarily large sized copy requests and letting the block layer > efficiently split a bio to the queue limits later. Well, you can still do that without overloading the bio_bvec by just making bi_io_vec in the bio itself a union. > > > If having a bio for each source range is too much overhead > > for your user case (but I'd like to numbers for that), we'll need to > > find a way to do that without overloading the actual bio_vec structure. > > Getting good numbers might be a problem in the near term. The current > generation of devices I have access to that can do copy offload don't > have asic support for it, so it is instrumented entirely in firmware. > The performance is currently underwhelming, but I expect next generation > to be much better. I meant numbers for the all in one bio vs multiple bios approach. For hardware I think the main benefit is to not use host dram bandwith. Anyway, below is a patch to wire it up to the XFS garbage collection daemin. It survices the xfstests test cases for GC when run on a conventional device, but otherwise I've not done much testing with it. It shows two things, though: - right now there is block layer merging, and we always see single range bios. That is really annoying, and fixing the fs code to submit multiple ranges in one go would be really annoying, as extent-based completions hang off the bio completions. So I'd really like to block layer merges similar to what the old multi-bio code or the discard code do. - copy also needs to be handled by the zoned write plugs - bio_add_copy_src not updating bi_size is unexpected and annoying :) diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c index 8c541ca71872..e7dfdbbcf126 100644 --- a/fs/xfs/xfs_zone_gc.c +++ b/fs/xfs/xfs_zone_gc.c @@ -158,6 +158,8 @@ struct xfs_zone_gc_data { * Iterator for the victim zone. */ struct xfs_zone_gc_iter iter; + + bool can_copy; }; /* @@ -212,12 +214,19 @@ xfs_zone_gc_data_alloc( if (bioset_init(&data->bio_set, 16, offsetof(struct xfs_gc_bio, bio), BIOSET_NEED_BVECS)) goto out_free_recs; - for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) { - data->scratch[i].folio = - folio_alloc(GFP_KERNEL, get_order(XFS_GC_CHUNK_SIZE)); - if (!data->scratch[i].folio) - goto out_free_scratch; + + if (bdev_copy_sectors(mp->m_rtdev_targp->bt_bdev)) { + xfs_info(mp, "using copy offload"); + data->can_copy = true; + } else { + for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) { + data->scratch[i].folio = folio_alloc(GFP_KERNEL, + get_order(XFS_GC_CHUNK_SIZE)); + if (!data->scratch[i].folio) + goto out_free_scratch; + } } + INIT_LIST_HEAD(&data->reading); INIT_LIST_HEAD(&data->writing); INIT_LIST_HEAD(&data->resetting); @@ -241,8 +250,10 @@ xfs_zone_gc_data_free( { int i; - for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) - folio_put(data->scratch[i].folio); + if (!data->can_copy) { + for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) + folio_put(data->scratch[i].folio); + } bioset_exit(&data->bio_set); kfree(data->iter.recs); kfree(data); @@ -589,6 +600,8 @@ static unsigned int xfs_zone_gc_scratch_available( struct xfs_zone_gc_data *data) { + if (data->can_copy) + return UINT_MAX; return XFS_GC_CHUNK_SIZE - data->scratch[data->scratch_idx].offset; } @@ -690,7 +703,10 @@ xfs_zone_gc_start_chunk( return false; } - bio = bio_alloc_bioset(bdev, 1, REQ_OP_READ, GFP_NOFS, &data->bio_set); + bio = bio_alloc_bioset(bdev, 1, + data->can_copy ? REQ_OP_COPY : REQ_OP_READ, + GFP_NOFS, &data->bio_set); + bio->bi_end_io = xfs_zone_gc_end_io; chunk = container_of(bio, struct xfs_gc_bio, bio); chunk->ip = ip; @@ -700,21 +716,38 @@ xfs_zone_gc_start_chunk( xfs_rgbno_to_rtb(iter->victim_rtg, irec.rm_startblock); chunk->new_daddr = daddr; chunk->is_seq = is_seq; - chunk->scratch = &data->scratch[data->scratch_idx]; chunk->data = data; chunk->oz = oz; - bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock); - bio->bi_end_io = xfs_zone_gc_end_io; - bio_add_folio_nofail(bio, chunk->scratch->folio, chunk->len, - chunk->scratch->offset); - chunk->scratch->offset += chunk->len; - if (chunk->scratch->offset == XFS_GC_CHUNK_SIZE) { - data->scratch_idx = - (data->scratch_idx + 1) % XFS_ZONE_GC_NR_SCRATCH; + if (data->can_copy) { + struct bio_vec src = { + .bv_sector = + xfs_rtb_to_daddr(mp, chunk->old_startblock), + .bv_sectors = BTOBB(chunk->len), + }; + + bio_add_copy_src(bio, &src); + bio->bi_iter.bi_sector = daddr; + bio->bi_iter.bi_size = chunk->len; + + WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW); + list_add_tail(&chunk->entry, &data->writing); + } else { + chunk->scratch = &data->scratch[data->scratch_idx]; + + bio->bi_iter.bi_sector = + xfs_rtb_to_daddr(mp, chunk->old_startblock); + bio_add_folio_nofail(bio, chunk->scratch->folio, chunk->len, + chunk->scratch->offset); + chunk->scratch->offset += chunk->len; + if (chunk->scratch->offset == XFS_GC_CHUNK_SIZE) { + data->scratch_idx = + (data->scratch_idx + 1) % + XFS_ZONE_GC_NR_SCRATCH; + } + WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW); + list_add_tail(&chunk->entry, &data->reading); } - WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW); - list_add_tail(&chunk->entry, &data->reading); xfs_zone_gc_iter_advance(iter, irec.rm_blockcount); submit_bio(bio); @@ -839,10 +872,12 @@ xfs_zone_gc_finish_chunk( return; } - chunk->scratch->freed += chunk->len; - if (chunk->scratch->freed == chunk->scratch->offset) { - chunk->scratch->offset = 0; - chunk->scratch->freed = 0; + if (!chunk->data->can_copy) { + chunk->scratch->freed += chunk->len; + if (chunk->scratch->freed == chunk->scratch->offset) { + chunk->scratch->offset = 0; + chunk->scratch->freed = 0; + } } /*