From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-dy1-f178.google.com (mail-dy1-f178.google.com [74.125.82.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3688731985C for ; Mon, 29 Jun 2026 20:01:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.178 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782763283; cv=none; b=bc7Qmx7mV5oP3X6zVMGRjlUgotVZwNzM6vrIF/CLHoUnyDdqww8QgBz0AcS9MPPijLF6sHIXB5xQFRlmxnQKSaPCJQ6Ycjxlx8Bx6Fm8Y4m/pzKOXEPkmCzztBzUTab3CH6/ceYIaNr5rK8u9Tr8OQNOq7sSMOWDz5W9Isc8XXk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782763283; c=relaxed/simple; bh=F5UsF8blNj+MAIKtase0TimbmrK/C9lRkNPtmBnACt0=; h=Mime-Version:Content-Type:Date:Message-Id:Cc:Subject:From:To: References:In-Reply-To; b=b0u38RRvRuu9SQ3Woep8fy61jue4hmMi8nURUgNV30zfo8mGzttJzODdXX8G57nhRGhnpAUXCWIvVSO2NbS/tm7Gr4mXfupPu8WgvFs8G4uS765oGl6akiUA6nzLR/ToQGNxMArsGg+WIbYLTlh6e6CpHfAqPEVqPQldhVcjWmk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UHhzikC1; arc=none smtp.client-ip=74.125.82.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UHhzikC1" Received: by mail-dy1-f178.google.com with SMTP id 5a478bee46e88-30cac93dfd7so1810109eec.1 for ; Mon, 29 Jun 2026 13:01:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1782763273; x=1783368073; darn=vger.kernel.org; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4XD8phgf43akGVTDQARcCjwmlWkMrNHOFtFyR4sgC2k=; b=UHhzikC1IEesxI/5jEreLzibxWId1gxm3x4vhxxyDNasAXm9nKBpsbSUw05a/pmh86 ESk52TH/i1vofyCI6cxBOTIzwyXxyLuAr5jxjBPycKGSnhYkffWMcZCPQLn8ZgTSgCk8 +394QPak9TepkRR9LpP+XdKOCqmByENRQh7wWgphgsud1I4Kh4D5CSSGyVzVUQ7qKIDG TFLUHjfE0HXaZdMH5A7vHkL50jzx5zgWv6ebM7qD7qJF9TL6MZ2XVyyq3FlUyoT6ZV+8 iXJptXfehVrPDojVjyvc5pZaDp+VC4ig/E8xIXaaZXXGGI7qpgTqC+EfEqnbMShkDsAc YYTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782763273; x=1783368073; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=4XD8phgf43akGVTDQARcCjwmlWkMrNHOFtFyR4sgC2k=; b=NPfRF9Zo97ddsJ+lgauWXDaxRaVCiaPAjncYVaq++iDfq/t38nPZZ8uDd0evx7MKi/ PLhnoSt9MlMihX4JC+YbOvK53t0K0PYhYqzBZ+Oxs0Z5ij+vLLtY56yrLcJGlAfvY/yL gi7kHvqA2iKqiLANQUEJ8Ae2JjHIOPzejFHypt9RGvfmAnZCNZjIEum2dhOrQSW/wtUI rPuW0nyoIBlczbg0ekq6ETSGy9bIawnb+OgVrYtyNiqE+Iip7Xb/7dyqwJw1nxgt9mGJ oDZYZc4cebHkc94Xif+U5qgn6OvgkVKcbqsSHLY46jBQjwCyacz4hiAWJM4fFdewQREO bi+Q== X-Forwarded-Encrypted: i=1; AHgh+Ro4EYcIx1bhvcWDx0lNl1QbeQyWgLtOs7eu4o1adjRm7N2ufCuYO+ccFfeCDU9HuXhNrTWkIkMs3ZI01A==@vger.kernel.org X-Gm-Message-State: AOJu0Yz01mQb2FCQunhkYfmO2Wg2MUly05C/T+PD39xX1oMmWoxhJiVw gB7+z+AH3dCjO1XoVyfIYeqYVgusDusSmiMiUUFs0aBmy8Yl+88u6aKi X-Gm-Gg: AfdE7cnSH3Brh36GSgrPRFXaYM0z0Z8QKsHNqRb7BZro+GbgJ52uHvKBsi2wS6STeCt EPfUEMNHCVTiZDbhO2CvMEsFgv0d2N1um6ekBlANk0h4kUKLDDaGRwWoKunnOqHSJtKbau12GLb 900aHjaRDNBz7MOp1aioycd1EYqp2hKEpDK60jSmb2C2cGNU0WO/tqe5N/vQKT7FarOAYBHSps/ h94rnQ4vPEkFeUsYP74hPreyTfgGQ9H6aVautFsUcJuYvG3N4nvm0CZJg+hJ2aF86G6+yVstlnB LFbZFw4Yr8TWR6WL4jXx0/RE0IkHs+tseQ2U4kTFm3zfSsBRqn1qeb1F+FLqEfAB43qB5R1jT4X 2W3K7RgUVutS2hkUB3oGXnAYtWewYNUxxdNklmMyg9TESb4tJNcZ6RLO/wc9Oirxyo410Fs6M9c 7NLkg1eSqwD6HEV14FL3kRR8oWDPNGfjPLQ8X+xmf2BjFQ3OPfou3chQCjHrC/ut+BXKEX9DlNU TDJJv3og0+19ORRmEov X-Received: by 2002:a05:7300:8809:b0:30c:9bf9:d9c8 with SMTP id 5a478bee46e88-30ee1816d67mr415487eec.13.1782763272477; Mon, 29 Jun 2026 13:01:12 -0700 (PDT) Received: from localhost ([2601:644:8000:7a86::e34]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-30ee2f5e4dcsm629262eec.5.2026.06.29.13.01.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Jun 2026 13:01:11 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Mon, 29 Jun 2026 13:01:09 -0700 Message-Id: Cc: "Chris Mason" , "David Sterba" , "open list" Subject: Re: [PATCH] btrfs: raid56: use async_tx API for parity operations From: "Rosen Penev" To: "Qu Wenruo" , "Rosen Penev" , X-Mailer: aerc 0.21.0-0-g5549850facc2 References: <20260629023912.2102700-1-rosenp@gmail.com> In-Reply-To: On Sun Jun 28, 2026 at 10:11 PM PDT, Qu Wenruo wrote: > > > =E5=9C=A8 2026/6/29 12:09, Rosen Penev =E5=86=99=E9=81=93: >> Replace the kmap-local + synchronous library call + kunmap-local >> pattern in raid56 compute and recovery paths with the async_tx API >> (async_xor_offs, async_gen_syndrome, async_raid6_datap_recov, >> async_raid6_2data_recov, async_memcpy). The async_tx API provides >> DMA offloading when a suitable engine is available and falls back to >> the same software implementations (xor_gen, raid6_gen_syndrome, etc.) >> otherwise. >> >> Step-loop operations inside each vertical stripe are chained as DMA >> dependency chains with ASYNC_TX_FENCE. The chains are then extended >> across all sectors in a stripe so that the DMA engine can pipeline >> the entire stripe's parity computation before the CPU waits for >> completion. For RMW writes, all sectors' parity generation is >> issued as a single chain; for recovery reads, the recovery chains >> for every sector are submitted together, then the CPU verifies >> checksums and marks uptodate bits after a single wait. >> >> Each DMA chain is fully waited on before the caller touches any >> buffer pages, so there is never an in-flight DMA operation during >> cleanup. For the typical stripe size of 4 KiB--64 KiB, per-operation >> overhead from DMA submission may outweigh the offload benefit, making >> the software fallback (which runs synchronously, returning NULL) the >> common fast path. This trade-off is inherent in the async_tx API's >> unified interface, but on platforms with a DMA engine (e.g. Marvell >> mv_xor) and larger sectors the batching provides measurable offload. > > Do you have any benchmark? Not at this time. > > You do not need to do a full comprehensive workload, just some fixed > basic workloads, then measure the runtime of the converted functions, > and major entrances like rmw_rbio()/recover_rbio(). > > And do you have ever run the fstests on a system with async offload? No. The system in question runs Debian 12, so I assume it would be trivial to do so. > >> >> Also select ASYNC_RAID6_RECOV and ASYNC_MEMCPY in Kconfig so that >> the async_tx function implementations are linked; previously the >> code used the synchronous RAID6_PQ / XOR_BLOCKS libraries directly, >> but async_tx functions require their own Kconfig symbols. >> >> Converted functions: >> - generate_pq_vertical_step() / generate_pq_vertical() >> - recover_vertical_step() / recover_vertical() / recover_sectors() >> - verify_one_parity_step() / verify_one_parity_sector() >> - finish_parity_scrub() >> - recover_scrub_rbio() >> - memcpy_from_bio_to_stripe() >> - raid56_parity_cache_data_folios() >> >> Removed the now-unused btrfs_raid_bio::finish_pointers field and >> the no-longer-needed #include and >> #include . >> >> Assisted-by: opencode:big-pickle >> Signed-off-by: Rosen Penev >> --- >> fs/btrfs/Kconfig | 4 +- >> fs/btrfs/raid56.c | 654 ++++++++++++++++++++++++++++++++-------------- >> fs/btrfs/raid56.h | 3 - >> 3 files changed, 454 insertions(+), 207 deletions(-) >> >> diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig >> index 43dd8ef763b8..8223b40d8ecb 100644 >> --- a/fs/btrfs/Kconfig >> +++ b/fs/btrfs/Kconfig >> @@ -15,7 +15,9 @@ config BTRFS_FS >> select ZSTD_DECOMPRESS >> select FS_IOMAP >> select RAID6_PQ >> - select XOR_BLOCKS >> + select ASYNC_RAID6_RECOV >> + select ASYNC_MEMCPY >> + select ASYNC_XOR >> select XXHASH >> depends on PAGE_SIZE_LESS_THAN_256KB >> >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >> index 00a01b97cc0c..6e98e46df752 100644 >> --- a/fs/btrfs/raid56.c >> +++ b/fs/btrfs/raid56.c >> @@ -8,11 +8,10 @@ >> #include >> #include >> #include >> -#include >> #include >> #include >> -#include >> #include >> +#include >> #include "messages.h" >> #include "ctree.h" >> #include "disk-io.h" >> @@ -154,7 +153,6 @@ static void free_raid_bio_pointers(struct btrfs_raid= _bio *rbio) >> kfree(rbio->stripe_pages); >> kfree(rbio->bio_paddrs); >> kfree(rbio->stripe_paddrs); >> - kfree(rbio->finish_pointers); >> } >> >> static void free_raid_bio(struct btrfs_raid_bio *rbio) >> @@ -231,18 +229,33 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_= info *info) >> static void memcpy_from_bio_to_stripe(struct btrfs_raid_bio *rbio, uns= igned int sector_nr) >> { >> const u32 step =3D min(rbio->bioc->fs_info->sectorsize, PAGE_SIZE); >> + struct dma_async_tx_descriptor *tx =3D NULL; >> >> ASSERT(sector_nr < rbio->nr_sectors); >> for (int i =3D 0; i < rbio->sector_nsteps; i++) { >> unsigned int index =3D sector_nr * rbio->sector_nsteps + i; >> phys_addr_t dst =3D rbio->stripe_paddrs[index]; >> phys_addr_t src =3D rbio->bio_paddrs[index]; >> + struct async_submit_ctl submit; >> >> ASSERT(dst !=3D INVALID_PADDR); >> ASSERT(src !=3D INVALID_PADDR); >> >> - memcpy_page(phys_to_page(dst), offset_in_page(dst), >> - phys_to_page(src), offset_in_page(src), step); >> + init_async_submit(&submit, ASYNC_TX_FENCE, tx, NULL, NULL, NULL); >> + tx =3D async_memcpy(phys_to_page(dst), phys_to_page(src), >> + offset_in_page(dst), offset_in_page(src), >> + step, &submit); > > This doesn't look sane at all. > > If async_memcpy() returned an @tx, the next iteration will just overwrite= . > > I believe most LLM review can detect this problem, you may want to > choose a better model. Probably. >> + } >> + /* >> + * All steps are chained via ASYNC_TX_FENCE. issue_pending and >> + * dma_wait_for_async_tx on the last descriptor walk the entire >> + * chain, so earlier descriptors are submitted and waited on too. >> + * If every step completed synchronously, tx is NULL and no wait >> + * is needed. >> + */ >> + if (tx) { >> + async_tx_issue_pending(tx); >> + dma_wait_for_async_tx(tx); >> } >> } >> >> @@ -1079,12 +1092,11 @@ static struct btrfs_raid_bio *alloc_rbio(struct = btrfs_fs_info *fs_info, >> rbio->stripe_paddrs =3D kzalloc_objs(phys_addr_t, >> num_sectors * sector_nsteps, >> GFP_NOFS); >> - rbio->finish_pointers =3D kcalloc(real_stripes, sizeof(void *), GFP_NO= FS); >> rbio->error_bitmap =3D bitmap_zalloc(num_sectors, GFP_NOFS); >> rbio->stripe_uptodate_bitmap =3D bitmap_zalloc(num_sectors, GFP_NOFS)= ; >> >> if (!rbio->stripe_pages || !rbio->bio_paddrs || !rbio->stripe_paddrs = || >> - !rbio->finish_pointers || !rbio->error_bitmap || !rbio->stripe_upt= odate_bitmap) { >> + !rbio->error_bitmap || !rbio->stripe_uptodate_bitmap) { >> free_raid_bio_pointers(rbio); >> kfree(rbio); >> return ERR_PTR(-ENOMEM); >> @@ -1385,49 +1397,98 @@ static inline void *kmap_local_paddr(phys_addr_t= paddr) >> return kmap_local_page(phys_to_page(paddr)) + offset_in_page(paddr); >> } >> >> -static void generate_pq_vertical_step(struct btrfs_raid_bio *rbio, unsi= gned int sector_nr, >> - unsigned int step_nr) >> +/* >> + * The page/offset array model used by the async_tx helpers below repla= ces the >> + * old kmap-local pointer array. Each entry in pages[] + offsets[] rep= resents >> + * what was previously a single kmap_local_paddr() return value. >> + * >> + * For example, the old pattern: >> + * pointers[i] =3D kmap_local_paddr(paddr); >> + * raid6_gen_syndrome(..., pointers); >> + * kunmap_local(pointers[i]); >> + * >> + * becomes: >> + * pages[i] =3D phys_to_page(paddr); >> + * offsets[i] =3D offset_in_page(paddr); >> + * async_gen_syndrome(pages, offsets, ...); >> + * >> + * The async_tx API handles any necessary kmap internally. >> + */ > > What the comment is for? There is no code. Probably stale from various iterations. > > If you want to explain the change, commit message is your choice. > >> + >> +/* >> + * Maximum number of sectors per stripe. >> + * BTRFS_STRIPE_LEN (64 KiB) divided by the minimum sector size (PAGE_S= IZE, >> + * typically 4 KiB) gives at most 16 entries. If stripe length ever ch= anges, >> + * update this to BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits. >> + */ >> +#define RECOVER_MAX_STRIPE_SECTORS 16 > > Leave a new line between macro and function. > > Move the macro either at the beginning of the file, or just before where > it is utilized. > > And do not use immediate number, use (BTRFS_STRIPE_LEN / > BTRFS_MIN_BLOCKSIZE) instead. Sure. > >> +static struct dma_async_tx_descriptor * >> +generate_pq_vertical_step(struct btrfs_raid_bio *rbio, unsigned int sec= tor_nr, >> + unsigned int step_nr, >> + struct page **pages, unsigned int *offsets, >> + struct dma_async_tx_descriptor *depend_tx) >> { >> - void **pointers =3D rbio->finish_pointers; >> const u32 step =3D min(rbio->bioc->fs_info->sectorsize, PAGE_SIZE); >> - int stripe; >> const bool has_qstripe =3D rbio->bioc->map_type & BTRFS_BLOCK_GROUP_R= AID6; >> + struct async_submit_ctl submit; >> + struct dma_async_tx_descriptor *tx; >> + int stripe; >> >> - /* First collect one sector from each data stripe */ >> - for (stripe =3D 0; stripe < rbio->nr_data; stripe++) >> - pointers[stripe] =3D kmap_local_paddr( >> - sector_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0)); >> + for (stripe =3D 0; stripe < rbio->nr_data; stripe++) { >> + phys_addr_t paddr =3D sector_paddr_in_rbio(rbio, stripe, sector_nr, s= tep_nr, 0); >> >> - /* Then add the parity stripe */ >> - pointers[stripe++] =3D kmap_local_paddr(rbio_pstripe_paddr(rbio, secto= r_nr, step_nr)); >> + pages[stripe] =3D phys_to_page(paddr); >> + offsets[stripe] =3D offset_in_page(paddr); >> + } >> + >> + { >> + phys_addr_t paddr =3D rbio_pstripe_paddr(rbio, sector_nr, step_nr); >> + >> + pages[stripe] =3D phys_to_page(paddr); >> + offsets[stripe] =3D offset_in_page(paddr); >> + } > > A code block for what? Just to keep paddr as a local variable? > > Just big NO NO, declare @paddr at the beginning of the function and > remove the brackets. Will do. > >> + stripe++; >> >> if (has_qstripe) { >> - /* >> - * RAID6, add the qstripe and call the library function >> - * to fill in our p/q >> - */ >> - pointers[stripe++] =3D kmap_local_paddr( >> - rbio_qstripe_paddr(rbio, sector_nr, step_nr)); >> + phys_addr_t paddr =3D rbio_qstripe_paddr(rbio, sector_nr, step_nr); >> + >> + pages[stripe] =3D phys_to_page(paddr); >> + offsets[stripe] =3D offset_in_page(paddr); >> >> assert_rbio(rbio); >> - raid6_gen_syndrome(rbio->real_stripes, step, pointers); >> + init_async_submit(&submit, ASYNC_TX_FENCE, depend_tx, NULL, NULL, NUL= L); >> + tx =3D async_gen_syndrome(pages, offsets, rbio->real_stripes, step, &= submit); >> } else { >> - /* raid5 */ >> - memcpy(pointers[rbio->nr_data], pointers[0], step); >> - xor_gen(pointers[rbio->nr_data], pointers + 1, rbio->nr_data - 1, >> - step); >> + init_async_submit(&submit, ASYNC_TX_FENCE | ASYNC_TX_XOR_ZERO_DST, >> + depend_tx, NULL, NULL, NULL); >> + tx =3D async_xor_offs(pages[rbio->nr_data], offsets[rbio->nr_data], >> + pages, offsets, rbio->nr_data, step, &submit); >> } >> - for (stripe =3D stripe - 1; stripe >=3D 0; stripe--) >> - kunmap_local(pointers[stripe]); >> + return tx; > > I do not like the returning of an tx. > > Can't you just wait for the @tx to finish and keep the old void return ty= pe? > > >> } >> >> -/* Generate PQ for one vertical stripe. */ >> -static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int secto= rnr) >> +/* >> + * Generate PQ for one vertical stripe, chaining into @depend_tx. >> + * The caller must issue_pending + wait for the returned descriptor >> + * (which may be NULL if the operation completed synchronously), >> + * then call generate_pq_vertical_finish() to mark parity uptodate. >> + */ >> +static struct dma_async_tx_descriptor * >> +generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr, >> + struct page **pages, unsigned int *offsets, >> + struct dma_async_tx_descriptor *depend_tx) >> { >> - const bool has_qstripe =3D (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_R= AID6); >> + struct dma_async_tx_descriptor *tx =3D depend_tx; >> >> for (int i =3D 0; i < rbio->sector_nsteps; i++) >> - generate_pq_vertical_step(rbio, sectornr, i); >> + tx =3D generate_pq_vertical_step(rbio, sectornr, i, pages, offsets, t= x); > > Again, isn't @tx overwritten for each step? > > Also, sashiko is reporting several critical level problems, please refer > to the review: > > https://sashiko.dev/#/patchset/20260629023912.2102700-1-rosenp%40gmail.co= m will check out. Although based on Cristoph's comment, this should probably be abandoned. I originally wanted to implement: https://lore.kernel.org/all/20210106123738.GS6430@twin.jikos.cz/ The LLM told me it would result in worse performance as the async-tx API requires batching. So I moved on to XOR. My current setup for my RAID is md-raid5 > btrfs instead of the latter directly as I assume the former performs better based on /proc/interrupts output.