From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0C0A329C69; Mon, 19 Jan 2026 17:35:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768844144; cv=none; b=CUW5jhNw3KNuIel6Z5zMlJqk+fN2VTCWFiU3+wp2VvwhpwCxek7iJDrqxfx3fNmyFwuvR2cXV/PZ7jJloGTrer1kgwvWF5+A8542aJSiI/ncRN8867As8/oXxMgDmSDy0VfhBAiBgqleOJdXAQQfkyHloVIoU4rg0v61GPOtbBU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768844144; c=relaxed/simple; bh=kFzynpRCnZlst/3w4DM3gAbVgknggmBfSxxX45P6hyk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=I04Qx1CKqnOarlYu4weglBFscUw7heQYr1YQien6CHgPDvzNqcDMgDiNxP09De/N4Qf8ehp+ilWsYweuug8tVHSoRcqFfWiQ0ZVeweNYuJEwH33Gr4fDzraMTxNawDDID0Y8ZmdfZr7pFN+lEwBGdpRRW2la61bQQlpHUb6AaLw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Vpfo5gjG; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Vpfo5gjG" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 118AAC116C6; Mon, 19 Jan 2026 17:35:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1768844144; bh=kFzynpRCnZlst/3w4DM3gAbVgknggmBfSxxX45P6hyk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Vpfo5gjGsJb2tHajhvEt4uwdisXhw46NJqGqMX2ZRgEGIRcmgaS63W1sDmCK/YHlk 7LpuSzd8grFFfNSAh0E3sx5CjZRLEERsd4HXK8jFmYirKsqzamQpWmHk1ZwuUMvgka WzDDRML2+tFTQqZ1T9UgmFlMqHzp+IegWvgiqOiWr5wBTMADSnsr70MDEmYAulF2Df sU0XiYiD4qan1c1unIHqk6s/ZcwLumgVxwu+n4IfABaLeSEYB52HIaDEgu/vqSyXHN S3Az+rsX/yNHI7ckWG1pCZJbhXESJd/N6Z9a0gbrptdAOqQOwH3y+1IgKUi/SoZLxA HQ3xKqkMmuRLg== Date: Mon, 19 Jan 2026 09:35:43 -0800 From: "Darrick J. Wong" To: Christoph Hellwig Cc: cem@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH 11/11] xfs: add media verification ioctl Message-ID: <20260119173543.GZ15551@frogsfrogsfrogs> References: <176852588473.2137143.1604994842772101197.stgit@frogsfrogsfrogs> <176852588776.2137143.7103003682733018282.stgit@frogsfrogsfrogs> <20260119155639.GA10822@lst.de> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260119155639.GA10822@lst.de> On Mon, Jan 19, 2026 at 04:56:39PM +0100, Christoph Hellwig wrote: > On Thu, Jan 15, 2026 at 09:44:53PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong > > > > Add a new privileged ioctl so that xfs_scrub can ask the kernel to > > verify the media of the devices backing an xfs filesystem, and have any > > resulting media errors reported to fsnotify and xfs_healer. > > I really wish this would explain the approach (reading data into a > kernel buffer, and the choices of the buffer size and I/O pattern) > and their rationale a bit better here. How about: "Add a new privileged ioctl so that xfs_scrub can ask the kernel to verify the media of the devices backing an xfs filesystem, and have any resulting media errors reported to fsnotify and xfs_healer. "To accomplish this, the kernel allocates a folio between the base page size and 1MB, and issues read IOs to a gradually incrementing range of one of the storage devices underlying an xfs filesystem. If any error occurs, that raw error is reported to the calling process. If the error happens to be one of the ones that the kernel considers indicative of data loss, then it will also be reported to xfs_healthmon and fsnotify. "Driving the verification from the kernel enables xfs (and by extension xfs_scrub) to have precise control over the size and error handling of IOs that are issued to the underlying block device, and to emit notifications about problems to other relevant kernel subsystems immediately. "Note that the caller is also allowed to reduce the size of the IO and to ask for a relaxation period after each IO." > > + > > +struct xfs_group_data_lost { > > + xfs_agblock_t startblock; > > + xfs_extlen_t blockcount; > > +}; > > + > > +/* Report lost file data from rmap records */ > > +STATIC int > > +xfs_verify_report_data_lost( > > + struct xfs_btree_cur *cur, > > + const struct xfs_rmap_irec *rec, > > + void *data) > > +{ > > + struct xfs_mount *mp = cur->bc_mp; > > + struct xfs_inode *ip; > > + struct xfs_group_data_lost *lost = data; > > + xfs_fileoff_t fileoff = rec->rm_offset; > > + xfs_extlen_t blocks = rec->rm_blockcount; > > + const bool is_attr = > > + (rec->rm_flags & XFS_RMAP_ATTR_FORK); > > + const xfs_agblock_t lost_end = > > + lost->startblock + lost->blockcount; > > + const xfs_agblock_t rmap_end = > > + rec->rm_startblock + rec->rm_blockcount; > > + int error = 0; > > + > > + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) > > + return 0; > > + > > + error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, 0, 0, &ip); > > + if (error) > > + return 0; > > + > > + if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) { > > + xfs_bmap_mark_sick(ip, is_attr ? XFS_ATTR_FORK : XFS_DATA_FORK); > > + goto out_rele; > > + } > > + > > + if (is_attr) { > > + xfs_inode_mark_sick(ip, XFS_SICK_INO_XATTR); > > + goto out_rele; > > + } > > + > > + if (lost->startblock > rec->rm_startblock) { > > + fileoff += lost->startblock - rec->rm_startblock; > > + blocks -= lost->startblock - rec->rm_startblock; > > + } > > + if (rmap_end > lost_end) > > + blocks -= rmap_end - lost_end; > > + > > + fserror_report_data_lost(VFS_I(ip), XFS_FSB_TO_B(mp, fileoff), > > + XFS_FSB_TO_B(mp, blocks), GFP_NOFS); > > + > > +out_rele: > > + xfs_irele(ip); > > + return 0; > > +} > > + > > +/* Walk reverse mappings to look for all file data loss */ > > +STATIC int > > +xfs_verify_report_losses( > > + struct xfs_mount *mp, > > + enum xfs_group_type type, > > + xfs_daddr_t daddr, > > + u64 bblen) > > +{ > > + struct xfs_group *xg = NULL; > > + struct xfs_trans *tp; > > + xfs_fsblock_t start_bno, end_bno; > > + uint32_t start_gno, end_gno; > > + int error; > > + > > + if (type == XG_TYPE_RTG) { > > + start_bno = xfs_daddr_to_rtb(mp, daddr); > > + end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1); > > + } else { > > + start_bno = XFS_DADDR_TO_FSB(mp, daddr); > > + end_bno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1); > > + } > > + > > + tp = xfs_trans_alloc_empty(mp); > > + start_gno = xfs_fsb_to_gno(mp, start_bno, type); > > + end_gno = xfs_fsb_to_gno(mp, end_bno, type); > > + while ((xg = xfs_group_next_range(mp, xg, start_gno, end_gno, type))) { > > + struct xfs_buf *agf_bp = NULL; > > + struct xfs_rtgroup *rtg = NULL; > > + struct xfs_btree_cur *cur; > > + struct xfs_rmap_irec ri_low = { }; > > + struct xfs_rmap_irec ri_high; > > + struct xfs_group_data_lost lost; > > + > > + if (type == XG_TYPE_AG) { > > + struct xfs_perag *pag = to_perag(xg); > > + > > + error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp); > > + if (error) { > > + xfs_perag_put(pag); > > + break; > > + } > > + > > + cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, pag); > > + } else { > > + rtg = to_rtg(xg); > > + xfs_rtgroup_lock(rtg, XFS_RTGLOCK_RMAP); > > + cur = xfs_rtrmapbt_init_cursor(tp, rtg); > > + } > > + > > + /* > > + * Set the rmap range from ri_low to ri_high, which represents > > + * a [start, end] where we looking for the files or metadata. > > + */ > > + memset(&ri_high, 0xFF, sizeof(ri_high)); > > + if (xg->xg_gno == start_gno) > > + ri_low.rm_startblock = > > + xfs_fsb_to_gbno(mp, start_bno, type); > > + if (xg->xg_gno == end_gno) > > + ri_high.rm_startblock = > > + xfs_fsb_to_gbno(mp, end_bno, type); > > + > > + lost.startblock = ri_low.rm_startblock; > > + lost.blockcount = min(xg->xg_block_count, > > + ri_high.rm_startblock + 1) - > > + ri_low.rm_startblock; > > + > > + error = xfs_rmap_query_range(cur, &ri_low, &ri_high, > > + xfs_verify_report_data_lost, &lost); > > + xfs_btree_del_cursor(cur, error); > > + if (agf_bp) > > + xfs_trans_brelse(tp, agf_bp); > > + if (rtg) > > + xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_RMAP); > > + if (error) { > > + xfs_group_put(xg); > > + break; > > + } > > + } > > + > > + xfs_trans_cancel(tp); > > + return 0; > > +} > > + > > +/* > > + * Compute the desired verify IO size. > > + * > > + * To minimize command overhead, we'd like to create bios that are 1MB, though > > + * we allow the user to ask for a smaller size. > > + */ > > +STATIC unsigned int > > +xfs_verify_iosize( > > + const struct xfs_verify_media *me, > > + struct xfs_buftarg *btp, > > + uint64_t bbcount) > > +{ > > + unsigned int iosize = > > + min_not_zero(SZ_1M, me->me_max_io_size); > > + > > + BUILD_BUG_ON(BBSHIFT != SECTOR_SHIFT); > > + ASSERT(BBTOB(bbcount) >= bdev_logical_block_size(btp->bt_bdev)); > > + > > + return clamp(iosize, bdev_logical_block_size(btp->bt_bdev), > > + BBTOB(bbcount)); > > +} > > > +/* Allocate as much memory as we can get for verification buffer. */ > > +STATIC struct folio * > > Can we please retired STATIC already? Ok. > > +STATIC void > > +xfs_verify_media_error( > > + struct xfs_mount *mp, > > + struct xfs_verify_media *me, > > + struct xfs_buftarg *btp, > > + xfs_daddr_t daddr, > > + unsigned int bio_bbcount, > > + blk_status_t bio_status) > > +{ > > + trace_xfs_verify_media_error(mp, me, btp->bt_bdev->bd_dev, daddr, > > + bio_bbcount, bio_status); > > + > > + /* > > + * Pass any I/O error up to the caller if we didn't successfully verify "Pass any error, I/O or otherwise, up to the caller..." > > + * any bytes at all. > > + */ > > + if (me->me_start_daddr == daddr) > > + me->me_ioerror = -blk_status_to_errno(bio_status); > > + > > + /* > > + * PI validation failures, medium errors, or general IO errors are > > + * treated as indicators of data loss. Everything else are (hopefully) > > + * transient errors and are not reported. > > + */ > > But still left in me->me_ioerror. Is that intentional? Yeah. All errors are reported to the ioctl caller, but only the ones that sound like data loss get passed to healthmon/fsnotify. > > + switch (me->me_dev) { > > + case XFS_DEV_DATA: > > + xfs_verify_report_losses(mp, XG_TYPE_AG, daddr, bio_bbcount); > > + break; > > + case XFS_DEV_RT: > > + xfs_verify_report_losses(mp, XG_TYPE_RTG, daddr, bio_bbcount); > > + break; > > + } > > At some point we really need dev_to_group_type and vice versa helpers. Heh, yes. Do you want that /now/ or when we add a second user? It also occurs to me that I could probably speed up the verify code a tiny bit more by hoisting the bio variable scope outside the loop and using bio_reuse to reset bi_iter and bi_sector, rather than freeing it and allocating a new one. --D