From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6D0EC07C79 for ; Mon, 22 Apr 2024 15:04:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0652F6B0087; Mon, 22 Apr 2024 11:04:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 014D26B0089; Mon, 22 Apr 2024 11:04:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1ECD6B008A; Mon, 22 Apr 2024 11:04:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C48986B0087 for ; Mon, 22 Apr 2024 11:04:11 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 6A1AB80B77 for ; Mon, 22 Apr 2024 15:04:11 +0000 (UTC) X-FDA: 82037488302.05.D6438D8 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf08.hostedemail.com (Postfix) with ESMTP id 257A6160015 for ; Mon, 22 Apr 2024 15:04:07 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=o7QGfXhh; spf=none (imf08.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713798248; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9wOxkCOVRUK81zz0hYioTjDzHLgukrv7N/ab7G/DRNA=; b=U6iX/JE1z8Z9VOfUpjajM9qysmDtFCV8fX08qYXIWCdMbQnuktH6zMwvtkwazfgMLV8wEZ VgLZ3aabhBKMDo5PJjbPEePurrCyTlEV0YMHspsv5CbKLP0kXpA1rO3UegeHgCOQmQ278W PVwGfuASt7Ux2d/2WdT6lPuKulgZ5B8= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=o7QGfXhh; spf=none (imf08.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713798248; a=rsa-sha256; cv=none; b=LNlmJKmtb9zlHSAJ/rzgmVMIXyzlyFJ40NLweSCX9DWOqRkS2EJj32R56ZlRYfRJFYq87E OCKEB+vohJKr78jHILsvdbIu6MPOp6BsXW8FcexT3hILgq70TZEkDCHXiGD+ZiC663H4oR zhm9UCqmkxZacVYF/k1kcdmU5z2pe4g= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=9wOxkCOVRUK81zz0hYioTjDzHLgukrv7N/ab7G/DRNA=; b=o7QGfXhheNbrl27ixrfNEePXFu 4EC2dL/ybNPCi8XDCdOuhFFL0X9/z3IYjbr+zsZK5siFquEowAhkX6PMaZCtL5DhkbNXpMtiJyC7B scW9a4mVAaSDGxZXHh+U3DpyXXhFuUZ82vPnPl71Q2eIylZmpzJKjGm58lHVYt1LzesNH4H7qgROa JQmZC6JyjbGTZ7FVcqdanxUsEyo/EFUtFe3PFlgw+L+n4kdHWoUdDU3UcoY/sKUGaet3ButT4lFOr Hc1atYWmqG+TVDNf0nyOfSRo827jxVKj+8DDeKCps++KJa56x0zmR2wpBc8Wy9l0/ogfdbIh/xMHN tCfZ7WSQ==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1ryvD2-0000000ERS1-3CAX; Mon, 22 Apr 2024 15:03:56 +0000 Date: Mon, 22 Apr 2024 16:03:56 +0100 From: Matthew Wilcox To: John Garry Cc: axboe@kernel.dk, brauner@kernel.org, djwong@kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, akpm@linux-foundation.org, dchinner@redhat.com, tytso@mit.edu, hch@lst.de, martin.petersen@oracle.com, nilay@linux.ibm.com, ritesh.list@gmail.com, mcgrof@kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, ojaswin@linux.ibm.com, p.raghav@samsung.com, jbongio@google.com, okiselev@amazon.com Subject: Re: [PATCH RFC 5/7] fs: iomap: buffered atomic write support Message-ID: References: <20240422143923.3927601-1-john.g.garry@oracle.com> <20240422143923.3927601-6-john.g.garry@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240422143923.3927601-6-john.g.garry@oracle.com> X-Stat-Signature: iigzk7pby19un31wmatzd1c53eiuq3hj X-Rspamd-Queue-Id: 257A6160015 X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1713798247-986330 X-HE-Meta: U2FsdGVkX18B7zNSeZ9qKAIhW5kM9qFRMiHIldUproHF9RWNwJiO2n1OXQD7bk42tb8iC03lULCs2RT6SK4x1Hawlgzj61RLMoyZVnCKrpvSB3ZPUdUWALrC1t3yE4WS3Ks2MR1F50jx1rHNaIMjoD7WCDJDbQEcxgtMGt1pQjaSAlLgQwRNcHx2SLEdTjByY/iZiCUcNe0/oDYnW/SlVYtSf3GZ+RKGHqbWTBn5qQZjiurDEsReoPm1a8vqXtVqnUN6SxdgvYJVzHJuaDurIv7VCcsKJAMXC/DLxo4R6IMwAsTW7+Y0LBo5vdB1ZzsTCGWZlewKflob3iAKkaMco+9OHqfdgmOUO3HJLr8unv8fXdsptlUNXbLLPgvE1bZXQguA9nxtfdqBK/YRnLMTxmVSnhfQKxc1OXRS4vqPuyYXBm5kITQhe0f9nUsnv7zuF+qx4sk+MyHBlF9dS6Z9f8w6QV0yAREb7JJdGLqprSmmNllIaDEIF99SRjU2KDA6hD0P5UW5KJINazsHQHXFoD95GWbyXVpbSedVHfEaviDqoYZIzsOc9J3oa/UH1mMJ8ajiH6+apudpLuGioGxdtQZ9iVty5wsCgfT0o5xLss9z4QHYWuUZBm6gUBWnGk+Op+VblqzT4IigehqXue5lgedsS9dqESaC/kxPEKnOI5grvUuwF3xf6efCuUd461Kvuygo8sXkAx7Ixx6VGz19u7k0qaAD+dEBWn+VqHCqo44T8+9KGBU82Bvz6QuimBBgGP2ecNXmQYnN6MbIlNYisOKnRWeWtI6vQ9ctBMs/dQu7k63ldSNne0SGeCTsfaq0ZGn0HM4FoUQUJo9z3OZg7w+DgXQqUZsIyG6J8GSk/IPIOcALWb7j04yfoYHFjjQoUELTG8QX616GuNm17c7ZPmCpRe/H6yabQ3Yav2AGp25gErj4JrGuInCOtZ3up20COErPYIiegZ5XlF7Yrg9 ygQhBooe aR2aCVfqKAO5NwLAVanB2yEeTC77KGWMCk+J6BgHTGJHM6ofH0sD6x1ehIrg1H2ukJITNWm6xk4jqs8QMvfYwsF3T7g78jeH+xtOvoULW6THWCAJIxCyS3RVdP8aH/fmEicCfWW+q3unakURRfKkdSuUdxtVYJfVYk7EeJKR8E4uRRinv+xfIJudSZkDhxhSwRe1DXWt5NTN/hBXMNmkmCLroCQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 22, 2024 at 02:39:21PM +0000, John Garry wrote: > Add special handling of PG_atomic flag to iomap buffered write path. > > To flag an iomap iter for an atomic write, set IOMAP_ATOMIC. > > For a folio associated with a write which has IOMAP_ATOMIC set, set > PG_atomic. > > Otherwise, when IOMAP_ATOMIC is unset, clear PG_atomic. > > This means that for an "atomic" folio which has not been written back, it > loses it "atomicity". So if userspace issues a write with RWF_ATOMIC set > and another write with RWF_ATOMIC unset and which fully or partially > overwrites that same region as the first write, that folio is not written > back atomically. For such a scenario to occur, it would be considered a > userspace usage error. > > To ensure that a buffered atomic write is written back atomically when > the write syscall returns, RWF_SYNC or similar needs to be used (in > conjunction with RWF_ATOMIC). > > As a safety check, when getting a folio for an atomic write in > iomap_get_folio(), ensure that the length matches the inode mapping folio > order-limit. > > Only a single BIO should ever be submitted for an atomic write. So modify > iomap_add_to_ioend() to ensure that we don't try to write back an atomic > folio as part of a larger mixed-atomicity BIO. > > In iomap_alloc_ioend(), handle an atomic write by setting REQ_ATOMIC for > the allocated BIO. > > When a folio is written back, again clear PG_atomic, as it is no longer > required. I assume it will not be needlessly written back a second time... I'm not taking a position on the mechanism yet; need to think about it some more. But there's a hole here I also don't have a solution to, so we can all start thinking about it. In iomap_write_iter(), we call copy_folio_from_iter_atomic(). Through no fault of the application, if the range crosses a page boundary, we might partially copy the bytes from the first page, then take a page fault on the second page, hence doing a short write into the folio. And there's nothing preventing writeback from writing back a partially copied folio. Now, if it's not dirty, then it can't be written back. So if we're doing an atomic write, we could clear the dirty bit after calling iomap_write_begin() (given the usage scenarios we've discussed, it should always be clear ...) We need to prevent the "fall back to a short copy" logic in iomap_write_iter() as well. But then we also need to make sure we don't get stuck in a loop, so maybe go three times around, and if it's still not readable as a chunk, -EFAULT?