From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C2B3B241690; Tue, 13 Jan 2026 23:21:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768346474; cv=none; b=lBPJEZzvEBAe74IzQFXyP+ls+593xtVzJZYUNqTyyTV0z5lw2XSeHOH2kZChIcTQjDLuVib7Kd9UN6oTW7Y9nYJNA70lY+pHmvQdt647Wq1vHyV6aBil9hB7XlcUTLynVjTf7N2Vp5NokdCdxCZP8cRas8vWH+kz/UwaE6xOBdU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768346474; c=relaxed/simple; bh=UsNDol6n9tcvpC/ws30F5OOraDnyBN5fhodZEr2rS6w=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=KxDMXnSApKzsJf/aSsup6BaGmRuIAaECJpJqpaYqTgQwD1Ue27a9CnF3umh+mQmVNnTsXW2tECCJxCWjMvWphCmfonkHZoU7yYbUL9YC3hk2yxiByOU2qSsxfSoWpuNLWMQXy3ezInuzYQTMRlZ1QK1HAP19xDcIv+jZ0y/clAw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PioHGiMV; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PioHGiMV" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 56C8BC116C6; Tue, 13 Jan 2026 23:21:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1768346474; bh=UsNDol6n9tcvpC/ws30F5OOraDnyBN5fhodZEr2rS6w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=PioHGiMVf1SW4n3Uz5896+9PHlkJAQhRpvTN7831D46CifFGng2/J7ckCbuSfsETF Mw+zoB+3lQE/0ES5cj//pNAGRCUx/TVFf2kmyTzcapVjACyLNX+yo0RREwmlC9a1Eh vvN8kQ5lYwGq8T4BdNg/geuJUbgl1ygXD9zmNBo+D7Np9azXBFiiYpbOlchquOcoJX giR043uAn2FX3qTsEawWXnbjvcwoT+BwxE3QuFFhzWSHOyX4GDSowSxTnw5FCARA7E O7/EIDLGNVNXUjg+TaXntSdNUfcsvd0gXMYBY+9zWpY94m54qBkV7yMy2gOTcU87HX KogRwAm+YwNOQ== Date: Tue, 13 Jan 2026 15:21:13 -0800 From: "Darrick J. Wong" To: Christoph Hellwig Cc: cem@kernel.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH 11/11] xfs: add media verification ioctl Message-ID: <20260113232113.GD15551@frogsfrogsfrogs> References: <176826412644.3493441.536177954776056129.stgit@frogsfrogsfrogs> <176826412941.3493441.8359506127711497025.stgit@frogsfrogsfrogs> <20260113155701.GA3489@lst.de> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260113155701.GA3489@lst.de> On Tue, Jan 13, 2026 at 04:57:01PM +0100, Christoph Hellwig wrote: > On Mon, Jan 12, 2026 at 04:35:25PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong > > > > Add a new privileged ioctl so that xfs_scrub can ask the kernel to > > verify the media of the devices backing an xfs filesystem, and have any > > resulting media errors reported to fsnotify and xfs_healer. > > Hmm, the description is a bit sparse? > > > +/* Verify the media of the underlying devices */ > > +struct xfs_verify_media { > > + __u32 dev; /* I: XFS_VERIFY_*DEV */ > > This should probably use the enum xfs_device values? Yes, that's a good point. > > +#define XFS_VERIFY_TO_EOD (~0ULL) /* end of disk */ > > Is there much of a point in this flag? scrub/healer really should > know the device size, shouldn't they? Yes, scrub and healer both know the size they want to verify. I put that in for the sake of xfs_io so that it wouldn't have to figure out the device size, but as the ioctl always decreases @end_daddr to the actual EOD, I think it'd be ok if xfs_io blindly wrote in ~0ULL. > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c > > index 1edc4ddd10cdb2..5ef4109cc062d2 100644 > > --- a/fs/xfs/xfs_notify_failure.c > > +++ b/fs/xfs/xfs_notify_failure.c > > There's basically no overlap with the existing code in this file, > why not add a new one? Good idea! I didn't even realize that it shares nothing now. > > + const unsigned int iosize = BIO_MAX_VECS << PAGE_SHIFT; > > + unsigned int bufsize = iosize; > > That's a pretty gigantic buffer size. In general a low number of > MB should max out most current devices, and for a background scrub > you generally do not want to actually max out the device.. 256 * 4k (= 1MB) is too large a buffer? I guess that /is/ 16M on a 64k-page system. Maybe I need to reset my expectations. :) > The in the background is also a good point here - we probably want > a way to tune the size as it might put too much of a load onto the > system pretty easily, and we need a way to dial it back. I guess I could add a u32 max_io_size to constrain iosize; and a u16 rest_us that would enforce a sleep in between each submit_bio. For xfs_scrub@.service I rely on systemd to set the io and cpu priority and the kernel not to schedule xfs_scrub for resource usage control. > > + folio = folio_alloc(GFP_KERNEL, get_order(bufsize)); > > + if (!folio) > > That first folio_alloc will cause nasty stack traces when it fails. > > > + folio = folio_alloc(GFP_KERNEL, 0); > > .. and then we fall back to just a single page. This is what I ended > up writing for an about to submitted series elsewhere: > > static struct folio *folio_alloc_greedy(gfp_t gfp, size_t *size) > { > struct folio *folio; > > while (*size > PAGE_SIZE) { > folio = folio_alloc(gfp | __GFP_NORETRY, get_order(*size)); > if (folio) > return folio; > *size = rounddown_pow_of_two(*size - 1); > } > > return folio_alloc(gfp, get_order(*size)); > } > although that is a bit more complicated as we never want to round > up the actual size. I made a simpler version that does for (order = get_order(); order > 0; order--) folio = folio_alloc(...); > > + for (i = 0; i < nr_vecs; i++) { > > + unsigned int vec_sects = > > + min(nr_sects, bufsize >> SECTOR_SHIFT); > > + > > + bio_add_folio_nofail(bio, folio, > > + vec_sects << SECTOR_SHIFT, 0); > > + > > + bio_daddr += vec_sects; > > + bio_bbcount -= vec_sects; > > + bio_submitted += vec_sects; > > + } > > A single folio is always just a single vetor in the bio. No need > for any of the looping here. If we have to fall back to a single base page, shouldn't we still try to create a larger bio? A subtle assumption here is that it's ok to have all the bvecs pointing to the same memory, and that the device won't screw up if someone asks it to DMA to the same page simultaneously. Obviously that all goes away with REQ_OP_VERIFY. > > + /* Don't let too many IOs accumulate */ > > + if (bio_submitted > SZ_256M >> SECTOR_SHIFT) { > > + blk_finish_plug(&plug); > > + error = submit_bio_wait(bio); > > Also the building up and chaining here seems harmful. If you're > on SSDs you want to fire things off ASAP if you have large I/O. > On a HDD we'll take care of it below, but the bios will usually > actually be split, not merged anyway as they are beyond the > supported I/O size of the HBAs. Hrm, maybe I should query the block device for max_sectors_kb then? Though curiously the nvme devices all report 128K for that, and the spinning rust reports 1M. Anyway I've changed the loop body to allocate a bio, add up to iosize worth of meory to it, submit it, and wait. This avoids building up huge chains of bios, and now I ask the device what's the biggest IO that it supports: iosize = min(queue_max_bytes(bdev_get_queue(btp->bt_bdev)), SZ_1M); if (me->me_maxio && iosize > me->me_maxio) iosize = me->me_maxio; if (iosize < SECTOR_SIZE) iosize = SECTOR_SIZE; ... while (bio_bbcount > 0) { struct bio *bio; unsigned int nr_sects = min_t(sector_t, bio_bbcount, iosize >> SECTOR_SHIFT); const unsigned int nr_vecs = howmany(nr_sects << SECTOR_SHIFT, bufsize); unsigned int bio_submitted = 0; unsigned int i; bio = bio_alloc(btp->bt_bdev, nr_vecs, REQ_OP_READ, GFP_KERNEL); if (!bio) { error = -ENOMEM; break; } bio->bi_iter.bi_sector = bio_daddr; for (i = 0; i < nr_vecs; i++) { unsigned int vec_sects = min(nr_sects, bufsize >> SECTOR_SHIFT); bio_add_folio_nofail(bio, folio, vec_sects << SECTOR_SHIFT, 0); bio_daddr += vec_sects; bio_bbcount -= vec_sects; bio_submitted += vec_sects; } error = submit_bio_wait(bio); bio_put(bio); bio = NULL; if (error) { xfs_verify_media_error(mp, me, btp, fdev, new_start_daddr, bio_submitted, error); error = 0; break; } cond_resched(); new_start_daddr += bio_submitted; } The only downside of this simple loop is that we can't pipeline read requests, but maybe we don't want to flood the device with IO requests anyway. However, in the case where memory is fragmented and we can only get (say) a single base page, it'll still try to load up the bio with as many vecs as it can to try to keep the io size large, because issuing 256x 4k IOs is a lot slower than issuing 1x 1M IO with the same page added 256 times. I wonder if nr_vecs ought to be capped by queue_max_segments? --D