public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: cem@kernel.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 11/11] xfs: add media verification ioctl
Date: Tue, 13 Jan 2026 15:21:13 -0800	[thread overview]
Message-ID: <20260113232113.GD15551@frogsfrogsfrogs> (raw)
In-Reply-To: <20260113155701.GA3489@lst.de>

On Tue, Jan 13, 2026 at 04:57:01PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 12, 2026 at 04:35:25PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a new privileged ioctl so that xfs_scrub can ask the kernel to
> > verify the media of the devices backing an xfs filesystem, and have any
> > resulting media errors reported to fsnotify and xfs_healer.
> 
> Hmm, the description is a bit sparse?
> 
> > +/* Verify the media of the underlying devices */
> > +struct xfs_verify_media {
> > +	__u32	dev;		/* I: XFS_VERIFY_*DEV */
> 
> This should probably use the enum xfs_device values?

Yes, that's a good point.

> > +#define XFS_VERIFY_TO_EOD	(~0ULL)	/* end of disk */
> 
> Is there much of a point in this flag?  scrub/healer really should
> know the device size, shouldn't they?

Yes, scrub and healer both know the size they want to verify.  I put
that in for the sake of xfs_io so that it wouldn't have to figure out
the device size, but as the ioctl always decreases @end_daddr to the
actual EOD, I think it'd be ok if xfs_io blindly wrote in ~0ULL.

> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 1edc4ddd10cdb2..5ef4109cc062d2 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> 
> There's basically no overlap with the existing code in this file,
> why not add a new one?

Good idea!  I didn't even realize that it shares nothing now.

> > +	const unsigned int	iosize = BIO_MAX_VECS << PAGE_SHIFT;
> > +	unsigned int		bufsize = iosize;
> 
> That's a pretty gigantic buffer size.  In general a low number of
> MB should max out most current devices, and for a background scrub
> you generally do not want to actually max out the device..

256 * 4k (= 1MB) is too large a buffer?

I guess that /is/ 16M on a 64k-page system.

Maybe I need to reset my expectations. :)

> The in the background is also a good point here - we probably want
> a way to tune the size as it might put too much of a load onto the
> system pretty easily, and we need a way to dial it back.

I guess I could add a u32 max_io_size to constrain iosize; and a u16
rest_us that would enforce a sleep in between each submit_bio.

For xfs_scrub@.service I rely on systemd to set the io and cpu priority
and the kernel not to schedule xfs_scrub for resource usage control.

> > +	folio = folio_alloc(GFP_KERNEL, get_order(bufsize));
> > +	if (!folio)
> 
> That first folio_alloc will cause nasty stack traces when it fails.
> 
> > +		folio = folio_alloc(GFP_KERNEL, 0);
> 
> .. and then we fall back to just a single page.  This is what I ended
> up writing for an about to submitted series elsewhere:
> 
> static struct folio *folio_alloc_greedy(gfp_t gfp, size_t *size)
> {
>         struct folio *folio;
>                 
>         while (*size > PAGE_SIZE) {
>                 folio = folio_alloc(gfp | __GFP_NORETRY, get_order(*size));
>                 if (folio)
>                         return folio;
>                 *size = rounddown_pow_of_two(*size - 1);
>         }
> 
>         return folio_alloc(gfp, get_order(*size));
> }               

<nod>

> although that is a bit more complicated as we never want to round
> up the actual size.

I made a simpler version that does

for (order = get_order(); order > 0; order--)
	folio = folio_alloc(...);


> > +		for (i = 0; i < nr_vecs; i++) {
> > +			unsigned int	vec_sects =
> > +				min(nr_sects, bufsize >> SECTOR_SHIFT);
> > +
> > +			bio_add_folio_nofail(bio, folio,
> > +					vec_sects << SECTOR_SHIFT, 0);
> > +
> > +			bio_daddr += vec_sects;
> > +			bio_bbcount -= vec_sects;
> > +			bio_submitted += vec_sects;
> > +		}
> 
> A single folio is always just a single vetor in the bio.  No need
> for any of the looping here.

If we have to fall back to a single base page, shouldn't we still try to
create a larger bio?  A subtle assumption here is that it's ok to have
all the bvecs pointing to the same memory, and that the device won't
screw up if someone asks it to DMA to the same page simultaneously.

Obviously that all goes away with REQ_OP_VERIFY.

> > +		/* Don't let too many IOs accumulate */
> > +		if (bio_submitted > SZ_256M >> SECTOR_SHIFT) {
> > +			blk_finish_plug(&plug);
> > +			error = submit_bio_wait(bio);
> 
> Also the building up and chaining here seems harmful.  If you're
> on SSDs you want to fire things off ASAP if you have large I/O.
> On a HDD we'll take care of it below, but the bios will usually
> actually be split, not merged anyway as they are beyond the
> supported I/O size of the HBAs.

Hrm, maybe I should query the block device for max_sectors_kb then?

Though curiously the nvme devices all report 128K for that, and the
spinning rust reports 1M.

Anyway I've changed the loop body to allocate a bio, add up to iosize
worth of meory to it, submit it, and wait.  This avoids building up huge
chains of bios, and now I ask the device what's the biggest IO that it
supports:

iosize = min(queue_max_bytes(bdev_get_queue(btp->bt_bdev)), SZ_1M);
if (me->me_maxio && iosize > me->me_maxio)
	iosize = me->me_maxio;
if (iosize < SECTOR_SIZE)
	iosize = SECTOR_SIZE;

...

while (bio_bbcount > 0) {
	struct bio		*bio;
	unsigned int		nr_sects =
		min_t(sector_t, bio_bbcount, iosize >> SECTOR_SHIFT);
	const unsigned int	nr_vecs =
		howmany(nr_sects << SECTOR_SHIFT, bufsize);
	unsigned int		bio_submitted = 0;
	unsigned int		i;

	bio = bio_alloc(btp->bt_bdev, nr_vecs, REQ_OP_READ, GFP_KERNEL);
	if (!bio) {
		error = -ENOMEM;
		break;
	}
	bio->bi_iter.bi_sector = bio_daddr;

	for (i = 0; i < nr_vecs; i++) {
		unsigned int	vec_sects =
			min(nr_sects, bufsize >> SECTOR_SHIFT);

		bio_add_folio_nofail(bio, folio,
				vec_sects << SECTOR_SHIFT, 0);

		bio_daddr += vec_sects;
		bio_bbcount -= vec_sects;
		bio_submitted += vec_sects;
	}

	error = submit_bio_wait(bio);
	bio_put(bio);
	bio = NULL;
	if (error) {
		xfs_verify_media_error(mp, me, btp, fdev,
				new_start_daddr, bio_submitted, error);
		error = 0;
		break;
	}

	cond_resched();
	new_start_daddr += bio_submitted;
}

The only downside of this simple loop is that we can't pipeline read
requests, but maybe we don't want to flood the device with IO requests
anyway.

However, in the case where memory is fragmented and we can only get
(say) a single base page, it'll still try to load up the bio with as
many vecs as it can to try to keep the io size large, because issuing
256x 4k IOs is a lot slower than issuing 1x 1M IO with the same page
added 256 times.

I wonder if nr_vecs ought to be capped by queue_max_segments?

--D

  reply	other threads:[~2026-01-13 23:21 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-13  0:32 [PATCHSET v5] xfs: autonomous self healing of filesystems Darrick J. Wong
2026-01-13  0:32 ` [PATCH 01/11] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
2026-01-13 16:00   ` Christoph Hellwig
2026-01-13  0:33 ` [PATCH 02/11] xfs: start creating infrastructure for health monitoring Darrick J. Wong
2026-01-13 16:03   ` Christoph Hellwig
2026-01-13  0:33 ` [PATCH 03/11] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
2026-01-13 16:05   ` Christoph Hellwig
2026-01-13  0:33 ` [PATCH 04/11] xfs: convey filesystem unmount events to the health monitor Darrick J. Wong
2026-01-13 16:11   ` Christoph Hellwig
2026-01-13 18:48     ` Darrick J. Wong
2026-01-13  0:33 ` [PATCH 05/11] xfs: convey metadata health " Darrick J. Wong
2026-01-13 16:11   ` Christoph Hellwig
2026-01-13  0:34 ` [PATCH 06/11] xfs: convey filesystem shutdown " Darrick J. Wong
2026-01-13 16:14   ` Christoph Hellwig
2026-01-13 19:01     ` Darrick J. Wong
2026-01-13  0:34 ` [PATCH 07/11] xfs: convey externally discovered fsdax media errors " Darrick J. Wong
2026-01-13 16:15   ` Christoph Hellwig
2026-01-13  0:34 ` [PATCH 08/11] xfs: convey file I/O " Darrick J. Wong
2026-01-13 16:15   ` Christoph Hellwig
2026-01-13  0:34 ` [PATCH 09/11] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
2026-01-13 16:17   ` Christoph Hellwig
2026-01-13 18:28     ` Darrick J. Wong
2026-01-13  0:35 ` [PATCH 10/11] xfs: check if an open file is on the health monitored fs Darrick J. Wong
2026-01-13 16:17   ` Christoph Hellwig
2026-01-13  0:35 ` [PATCH 11/11] xfs: add media verification ioctl Darrick J. Wong
2026-01-13 15:57   ` Christoph Hellwig
2026-01-13 23:21     ` Darrick J. Wong [this message]
2026-01-14  5:40       ` Darrick J. Wong
2026-01-14  6:02       ` Christoph Hellwig
2026-01-14  6:07         ` Darrick J. Wong
2026-01-14  6:15           ` Christoph Hellwig
2026-01-14  6:19             ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2026-01-16  5:42 [PATCHSET v6] xfs: autonomous self healing of filesystems Darrick J. Wong
2026-01-16  5:44 ` [PATCH 11/11] xfs: add media verification ioctl Darrick J. Wong
2026-01-19 15:56   ` Christoph Hellwig
2026-01-19 17:35     ` Darrick J. Wong
2026-01-21  6:34 [PATCHSET v7 1/3] xfs: autonomous self healing of filesystems Darrick J. Wong
2026-01-21  6:37 ` [PATCH 11/11] xfs: add media verification ioctl Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260113232113.GD15551@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=cem@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox