From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D31593570A0 for ; Tue, 28 Oct 2025 23:56:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761695789; cv=none; b=jvCqAL5PrvT0jOZPSKMwc0N03QSPzr71V/QTYncDjsGBesekXJWMAu24HJSmqYU9ps76HehQc0rzVeMGpBpFNXk0dMcc8GPrXffJ0u11KzuygtMr4ippU9RBYpdg/sfTryMtN6ptD47IeN3K9ipJ3t/xL1Yx756siY6si8tZifk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761695789; c=relaxed/simple; bh=FsyasErvTji3bnd3dFdHf4hvsq8BITaavdB/BJq0BO0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=SNsjP+vbu5B5vtZAQetjn+a05dIoe4VJjoyUUpa8mpXktN9nZ0y7pE872qrT8OcUBqHMzicVloo164bJiL5i61EM7miqAPiKSmsdYNQL0oF3tw3kobCQYDVlJ973UnHM0pM3Sa1hoIjB8fmcJdBJ8+JnX45V6k56p0rB5m/fSSw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=L/qwt0Xr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="L/qwt0Xr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 26B63C4CEFD; Tue, 28 Oct 2025 23:56:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761695789; bh=FsyasErvTji3bnd3dFdHf4hvsq8BITaavdB/BJq0BO0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=L/qwt0XrwkP0pfMKgyKQ3KUJzjUvV30H3sWvNfsfBtBMcKkKh1OAQSblw5oDDmLYs /ZK1ZYPEci9/DKslGyv+/vDRXRJ7Muwi6qh6FMR0wGw/P4nglIZxRnfEhZL3LvG97H gOWCxDeZXXoRg8oZN50eGpd1PbCLwrdZbaCdHSu7ZfwYIa5iszRZ8Q6eCAuPwAVqsN XTOJFckz7Y2Zv1eieXxr7S8Np/ck9qXBN0LRIJN5lzYwTJxUfHhZKjmRUpPPJ271iQ eqaQ/2woCVo8poy5DW1T+5UPivRFQBPrGpDLLqpWHRQiKsME6R1uvOESSUZ5vEnjdJ PVRy3EHQGyUSg== Date: Tue, 28 Oct 2025 19:56:28 -0400 From: Mike Snitzer To: Chuck Lever Cc: Jeff Layton , Christoph Hellwig , NeilBrown , Olga Kornievskaia , Dai Ngo , Tom Talpey , linux-nfs@vger.kernel.org, Chuck Lever , Christoph Hellwig Subject: Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags Message-ID: References: <5017c8dc-9c14-4a92-a259-6e4cdc67d250@kernel.org> <5a2e4884bb6b31b0443bdac6174c77f7273e92b1.camel@kernel.org> <1f7b30d2-f806-400f-81d3-80b6c924c410@kernel.org> Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Oct 28, 2025 at 02:48:33PM -0400, Chuck Lever wrote: > On 10/28/25 12:04 PM, Mike Snitzer wrote: > > I'm also concerned about initiating > > any page cache invalidation off the back of the DIO WRITE in the > > middle (and its potential to fallback to buffered if that invalidation > > fails... any associated buffered IO fallback or ends best have O_DSYNC > > set it ensure everything ondisk). > > My questions: All good questions, and I appreciate the time you spent analyzing these aspects. I'm all for NFSD's IO modes being able to work with all possible stable_how. Your analysis helps expand the scope of these IO modes to really be pushed to their limits (specifically thinking of NFSD_IO_DIRECT being paired with UNSTABLE and it _not_ using IOCB_DSYNC, interesting combo but IMO quite a risky default given we're only at the debugfs interface stage for NFSD_IO_DIRECT). > A. Which filesystem code can return -ENOTBLK? > > - iomap-based DIO (fs/iomap/direct-io.c:715): XFS, ext4, f2fs, gfs2, > zonefs, erofs > - Legacy DIO path (fs/direct-io.c:984): ext2, some ext4 code paths > - btrfs (has its own DIO implementation that uses -ENOTBLK) > > Code auditing shows that vfs_iocb_iter_write() never returns -ENOTBLK > because all filesystems that could generate this error internally > convert it to 0 or another error code before returning from their > write_iter implementation. > > Therefore NFSD itself does not need to know about or handle page cache > invalidation failure. OK, good to hear; but have you reviewed if they are all using D_SYNC as part of their internal handling of their buffered fallback? NFSD setting DSYNC just acknowledges it is required to ensure O_DIRECT WRITEs are ondisk upon return to client -- something I don't want to leave to the client to ensure via NFS_UNSTABLE's subsequent COMMIT. Setting O_DSYNC gives us a baseline where we don't leave some other subsystem to ensure data gets to disk as quickly as possible (O_DSYNC offers NFS_DATA_SYNC, but O_DSYNC|SYNC's NFS_FILE_SYNC offers a win by avoiding COMMIT, even more enticing). So I need NFSD_IO_DIRECT to be able to require IOCB_DSYNC and possibly IOCB_DSYNC|IOCB_SYNC. And I'm most comfortable with the misaligned DIO WRITE support using O_DSYNC for all 3 segments; so that WRITE's data is ondisk before returning from the WRITE (NFS_DATA_SYNC), and bonus if we can avoid extra COMMIT (NFS_FILE_SYNC). > B. Even if NFSD had to recover from a failed page cache invalidation, > does a fallback write need to set IOCB_DSYNC (not considering the NFS > protocol requirements) ? > > When invalidation fails, some pages remain in the cache (dirty or with > private data). The buffered fallback writes to those same pages, > updating them with the correct data. There's no torn state or corruption > hazard. > > The fallback restarts the entire write from scratch, as buffered. Any > partial progress (like a buffered first segment) gets rewritten with > the same data. No orphaned blocks or inconsistent state. > > If another process does a DIO read of this region, it will call > kiocb_write_and_wait() first (see mm/filemap.c:2912), which flushes any > dirty pages before reading. So they'll see correct data even if it's > still dirty in cache (Jeff's concern). Yes, but the point is there is extra work that involves the page cache by other layers. And O_DIRECT is intended to work expecting the IO has been flushed to disk. O_DIRECT is aiming to reduce page cache usage, to keep memory use low. So ensuring the page cache invalidated to disk by the subsequent write as a side-efffect of O_DSYNC helps (more on this below [0]). > C. When setting up a three-segment write, is IOCB_DSYNC needed on the > unaligned ends (not considering the NFS protocol requirements) ? > > Before writing the DIO middle segment, the VFS must first invalidate the > page cache, so the first (buffered) segment is flushed to durability > anyway. The use of IOCB_DSYNC for the first segment is superfluous. No, the first segment reflects the misaligned head of the WRITE. The DIO-aligned middle segment's DIO will just invalidate pages aligned on the middle segment's alignment (which is page aligned). So the DIO from the middle segment won't have any side-effect on the first segment's page. First page needs to be written out with IOCB_DSYNC. > After writing the DIO middle segment, the unaligned end remains only in > the page cache if IOCB_DSYNC is not used. But a subsequent DIO read will > flush that to durability (via kiocb_write_and_wait) before satisfying > the read. So, IOCB_DSYNC is not needed there to meet putative file data > visibility mandates. The same applies to both the misaligned first (head) and third (tail) segments, they both need O_DSYNC to ensure their contents are ondisk. [0]: Again, my goal for NFSD_IO_DIRECT is to operate in terms of O_DIRECT|O_DSYNC ondisk.. and not lean on the page cache more than needed. This makes the VM subsystem more scalable as a side-effect of it having more clean pages that are quickly dropped and/or reused if something needs memory. > >> As I understand it, IOCB_DSYNC has nothing to do with whether the three > >> segments are directed to the correct file offsets. Serially initiating > >> the writes with the same iocb should be sufficient to ensure > >> correctness. > > > > IOCB_DSYNC just ensures when they complete they are on-disk. And any > > failure of, or short WRITE, is trapped and immediate cause for early > > return. Whereby ensuring file offset integrity. > > AFAICT, generic_perform_write() updates ki_pos synchronously whenever > it returns successfully, regardless of the IOCB_DSYNC or IOCB_DIRECT > flag settings. If the vfs_iocb_iter_write() call fails or returns fewer > bytes than expected, the loop terminates immediately and writes to > subsequent segments are not initiated. I don't see a data integrity > issue here. For ki_pos we're saying the same thing: yes ki_pos is updated (and it'll reflect a short write if one happens, etc). IOCB_DSYNC doesn't influence if ki_pos is advanced. But if IOCB_DSYNC isn't set then ki_pos is advanced beyond what was written _to disk_. That obviously doesn't matter if NFS_UNSTABLE, thanks to its COMMIT, but it does increase latency (though it is worth benchmarking with various workloads!). Anyway, I've been working to ensure NFSD_IO_DIRECT acts as if at least NFS_DATA_SYNC, but preferably NFS_FILE_SYNC, set. I should have more clearly stated that. Bypassing all caches as much possible allows for enabling scalable cache coherence, e.g. intelligent applications that don't rely on file locking, think multiple writers writing to their own extent of a large file. Having NFSD_IO_DIRECT be handled as UNSTABLE isn't adequate for my needs (immediate ondisk requirements, lowest memory and CPU usage as possible). Would you be OK with adding 2 new debugfs knobs?: /sys/kernel/debug/nfsd/io_cache_read_stable_how /sys/kernel/debug/nfsd/io_cache_write_stable_how - Each defaults to using the stable_how that the client requested. - Each will serve as a floor, and only override default if they are greater/stricter than the client requested stable_how. (so if 'io_cache_write_stable_how' set to NFS_FILE_SYNC it'd override client specified UNSTABLE). - Each can override the stable_how used so each IO mode behaves accordingly (e.g. I can set NFSD_IO_DIRECT and NFS_FILE_SYNC to get the bahviour I'd like). > Thus I'm still unconvinced that IOCB_DSYNC is required if the client has > requested an UNSTABLE write. I'm open to reviewing actual evidence of a > failure mode where IOCB_DSYNC might prevent data corruption, but so far > I don't see how it can happen. Yeah, I see your point now. If io_cache_{read,write}_stable_how debugfs controls are acceptable to you it'll give us maximum flexibility for controlling how NFSD behaves in each NFSD_IO mode. Thanks. Mike