From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A25ACC83F03 for ; Wed, 9 Jul 2025 07:58:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD6FF8D000F; Wed, 9 Jul 2025 03:58:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D1D8C8D0001; Wed, 9 Jul 2025 03:58:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C334F8D000F; Wed, 9 Jul 2025 03:58:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B00F98D0001 for ; Wed, 9 Jul 2025 03:58:05 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D31E210B467 for ; Wed, 9 Jul 2025 07:58:01 +0000 (UTC) X-FDA: 83643972762.24.631905D Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf01.hostedemail.com (Postfix) with ESMTP id 14BCB4000F for ; Wed, 9 Jul 2025 07:57:59 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GMD2gnAr; spf=pass (imf01.hostedemail.com: domain of hughd@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752047880; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XqGJ2ethPLZscTZagmfEGr8wPoTqcau3G9uOxQJTbYo=; b=M+dMfStjEDumtm4giirrxQApAwU4U9zk09FCGoVsNiOFnCLlhqKm3f1KVjwQ8970q0NW9E fIFJKuT36bokmSVpD5GWOGJZkQIlP60wwBUWKwXPp2FlLE3GFGNT3LdCgqmcXPsQoPn4qe Qokkk6ffq3aYRJ+DMcrLc+9i0dAYUY8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=GMD2gnAr; spf=pass (imf01.hostedemail.com: domain of hughd@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752047880; a=rsa-sha256; cv=none; b=WNpCxvw2QBqvVP/s2WFuBRsc4YJwDB7KcbG4ebAiyQ3NC1KzWpS386Z334YfYMLlfhT693 zbu/QG6M4hapELNK/6m5IJ4aI9ZKYjqfOunGcmk0oxAdML0+2fJGzLlwDilN046ORFI1B4 Ei95PBl56K3qgF3eaARxYWHSu/rCwhk= Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-e812fc35985so4450598276.0 for ; Wed, 09 Jul 2025 00:57:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1752047879; x=1752652679; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=XqGJ2ethPLZscTZagmfEGr8wPoTqcau3G9uOxQJTbYo=; b=GMD2gnArCP46x1oKH4Owh1drhZAVr1c7iyDEiX5X5wtTqySbm9L6xsi1VgSBRg5431 d0BNcXRHj9F2nN3WAzEUZ4doG30x5tpm3WPdrKwSoJaqoCAtrM5ML9WuhAacc+L5bGNQ MlD4r8ulR7xQrrTMzZgxHn8X2x1zGPWpa4TPtk1iKB8NfTrRRRq/XTMxY4tKd2uGXzfu zn/2dEgokSDVwvcXKmBlYu1LIEmRea0eEXVOmcOsZB8pbbk0b/1HSy9rGgvlJyfevKft MgTzxUUtDJFfWYSLJJdB7xrzWCuOYA6cjpArT/0hXbEfvSAQPa85B3O8fNqJ7hTUqjy4 XmsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752047879; x=1752652679; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XqGJ2ethPLZscTZagmfEGr8wPoTqcau3G9uOxQJTbYo=; b=rzbbEs56xol0C7PaBhfQGLOIeuSQuNGQCFFppu4hOV4xyy4cPIajy7opDUioXfDUFL bJF3n64NUyNdn/JYGDRbzh1jPsrX6pR7HUCMNckqXzloRs+xOJ/c8ICii52VGUYi0aWL yCrHhRCVC6gyLYG7RLyQtTTOs5VZquNtTMFdPXHHYt4q52Bhatg2g2mh4YD5HGfccPeY G15ZeYkjlNDIScM80H+cjaNAT64OHKS6OFfU1HaOlqu2NvX9CUkaRVQGYK8OhmEFJSwx vS/yvDtcBvTItvLRZaA4lQ4ZPgGogQsG1aajur2arK2YvTOJKQt0GygeiWj07RfjnBeW DVBQ== X-Gm-Message-State: AOJu0YxYdDtUqK0Spklj+HYHXk3t7HxmXjGN7EYlTou4DO4+8GQYr7o7 B67ZYT2dg0gERDC0C2gdFxwAoLEzQEzXTJG8UrXvoJfR9/jCGdsgNNIUFBOkgwUz5w== X-Gm-Gg: ASbGncvUZeHhTe6DHsY0B40U4n/LAygGQDHSj2pffmuHoDMU3Zkv+vyDAhfcUWje8HF h5zKvyGn5h6ZQ3e93aoSXp6QpjDoFAwJEwSxf0c7lu3Oy7e3TcXB631fBgyxesWt/N42ksOOir/ irWWL1YuQT7qRMCgnk32hEC71PJBOY6vR98N9ZPANsuH9eswJ98FGHTes5JsrWX3gDjqVJBC9TI kvv0dEGkTynl3MIJQ3fZvB9DrE0vz9xk9UwdCQsS2TSwDdN6vGRttYvoNEvynxmTTsiG8WXMSBq eK8y7kJHLbvuSzblz6hFusuQ4RjDGsEBtDfqflGqQo5i/+sQ5B/+vGbrTx0ohX6OTMQp+CVwBmM Lfpe+JdgXsaE5CrXGA2sb7LQAJ5/gJ3aXsJDxSu3e8uy/GnY= X-Google-Smtp-Source: AGHT+IGa5AB1lu1JI1nfpUkB6LyC43zGE5amWajF5W3YZAoym2yZcOj55rO9lx4ja8LU8j/85fdswQ== X-Received: by 2002:a05:6902:270a:b0:e8b:3d8e:c575 with SMTP id 3f1490d57ef6-e8b6e1a6a90mr2171857276.48.1752047878803; Wed, 09 Jul 2025 00:57:58 -0700 (PDT) Received: from darker.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id 00721157ae682-71665b123fbsm24469167b3.88.2025.07.09.00.57.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Jul 2025 00:57:57 -0700 (PDT) Date: Wed, 9 Jul 2025 00:57:35 -0700 (PDT) From: Hugh Dickins To: Brian Foster cc: linux-mm@kvack.org, Hugh Dickins , Baolin Wang , Matthew Wilcox , Usama Arif Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension In-Reply-To: <20250625184930.269727-1-bfoster@redhat.com> Message-ID: <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com> References: <20250625184930.269727-1-bfoster@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 14BCB4000F X-Stat-Signature: p6unnyumfciwrhykh65xgekh8teutzee X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1752047879-875865 X-HE-Meta: U2FsdGVkX18+alH48oGrjLphA4R7/S/+87hyv94czuy2z9TnUIRON4zDcn7U5V/Qy0NMT7SDvgQFCSOT59fur+YOUMkPlzLyammeCBwvEyWhXSIdY0bWXisOXzLHtb2g/I87KKokJf/zBrdxmvdjovwrWw0JqDU2gv2GgEth1cMVRGYa4cXer24mvsePW+tCC1PazTwSkr2YdbkqjbmW7a17f7nLoioCS9IyjPKdvZnm19n0JwMJCfxmd0zxSwXaArf9JY23y2E8mzuU+iqthYliRxfoHiEP89LKh5eu+GkWiojV0eI3VALX+3ASTVclEh55aRiAg/nZrQM19zb7EmhEv+WDnBwh1QF8Ia29/n99aBppw5yCF8tZgrLDlGZq1WWZP+hZrJ99S4/ucWKworkjGpi9uBINUyJzm4fSjVPh/I11GtIAz1x07XqkvpvWGpvlKPEOWUf3nDp3nbqJiM9XF9+zFZaM9Ge+u+vNYHMlCZ/lbPDlQNw5bZB01vObQs3RqBZyQINwMen4rPBhLPrXHW/oCHaqiKO97XZ9m+4p4VUohk2yl5lDPkniKsGlSC86L3OcL1C+Lru+Zo9c5ttx7zsnr/bLgFrnLntUPAUjJZwUE8ZgnSX4DTgAwVfo2B3Asgex3+bWMLbcFOwQHi5tm50i7PeBar0ciysw+Xgkw51zFBNt+7xmHMMru9Y6NQyEdZCb7eOOdRWW2eap5btq/bFeFJEUnCMFVe8tZIjg6HHH5Lw7gwq9WMfM202Sbjfhj2RsW42TY7BxrZ79UuULS7EZ65Mo+RPOuidbPfMLc5UYLNa92nMoAwWVcCc3Kqw7dUDzeIK57M5Up8fS1TO2gS8/kNsbO5bOqxG+5oCbQwDUOTt0lYgTzz8yrNEhrhh2iA0CEBBWn/rw8pMSKzQESH/0tLEAAOYcwBsDrL1MPLWEkrVzbljmV+bz7vD8RIzpNnSJKu27j3QwbDb 55efOH1Q SA1yOGZPLgc0zSlyh/V1YgD3g7BmOQ7VfwKw3NghQU4uXe7l17PpoXG829Di6/IxxUViwDWOZObPX1bJrWOMeBV9KguXtevCropIFWJYLbfU44FkqFZ/qYSgAzyVkQmpAmbGb+gaY/8Bvrex2tQtWV8fvCdWXHSWfFKTUvazphOGJ/fr5ZhQJrnU7/rNFB9tz+P5/hUXK2R22475NptVANSKKDRteO7tRhm57mKdIOGtSRo95IMSBz7TZWRRjQHBF5+9mDGH7L0OhEdL3yRN2/6x232JrBR69NrtIuxxnZOJehCbfXdepClKhB3oxPGpTWHQghKxO3gyf99oPQhFW2XLQvVi6X4Vl247Vqnq611sEyM7aR7CUsC5wrQC94CCIHLJ8Gc1ngPmpkDo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Thanks for suggestions, let me come back to comment on your original. On Wed, 25 Jun 2025, Brian Foster wrote: > Although Matthew's alternative commit was insufficient and wrong (because use of truncation risks deleting folios already promised by fallocate), I found his brief commit message very very helpful in understanding your patch - it makes clear what POSIX promises, is a better justification than just passing an xfstest you wrote, and explains why you're right to target places where i_size changes: POSIX requires that "If the file size is increased, the extended area shall appear as if it were zero-filled". It is possible to use mmap to write past EOF and that data will become visible instead of zeroes. Please add that paragraph here, right at the head of your commit message. > Most traditional filesystems zero the post-eof portion of the eof > folio at writeback time, or when the file size is extended by > truncate or extending writes. This ensures that the previously > post-eof range of the folio is zeroed before it is exposed to the > file. > > tmpfs doesn't implement the writeback path the way a traditional > filesystem does, so zeroing behavior won't be exactly the same. > However, it can still perform explicit zeroing from the various > operations that extend a file and expose a post-eof portion of the > eof folio. The current lack of zeroing is observed via failure of > fstests test generic/363 on tmpfs. This test injects post-eof mapped > writes in certain situations to detect gaps in zeroing behavior. > > Add a new eof zeroing helper for file extending operations. Look up > the current eof folio, and if one exists, zero the range about to be > exposed. This allows generic/363 to pass on tmpfs. > > Signed-off-by: Brian Foster > --- > > Hi all, > > This survives the aforemented reproducer, an fstests regression run, and > ~100m fsx operations without issues. Let me know if there are any other > recommended tests for tmpfs and I'm happy to run them. Otherwise, a > couple notes as I'm not terribly familiar with tmpfs... > > First, I used _get_partial_folio() because we really only want to zero > an eof folio if one has been previously allocated. My understanding is > that lookup path will avoid unnecessary folio allocation in such cases, > but let me know if that's wrong. > > Also, it seems that the falloc path leaves newly preallocated folios > !uptodate until they are used. This had me wondering if perhaps > shmem_zero_eof() could just skip out if the eof folio happens to be > !uptodate. Hm? Yes, you were right to think that it's better to skip the !uptodates, and Baolin made good suggestion there (though I'll unravel it a bit). > > Thoughts, reviews, flames appreciated. Sorry, much as you'd appreciate a flame, I cannot oblige: I think you've done a very good job here (but it's not ready yet), and you've done it in such a way that huge=always passes generic/363 with no trouble. I did keep on wanting to change this and that of your patch below; but then later came around to seeing why your choices were better than what I was going to suggest. I had difficulty getting deep enough into it, but think I'm there now. And have identified one missed aspect, which rather changes around what you should do - I'd have preferred to get into that at the end, but since it affects what shmem_zero_eof() should look like, I'd better talk about it here first. The problem is with huge pages (or large folios) in shmem_writeout(): what goes in as a large folio may there have to be split into small pages; or it may be swapped out as one large folio, but fragmentation at swapin time demand that it be split into small pages when swapped in. So, if there has been swapout since the large folio was modified beyond EOF, the folio that shmem_zero_eof() brings in does not guarantee what length needs to be zeroed. We could set that aside as a deficiency to be fixed later on: that would not be unreasonable, but I'm guessing that won't satisfy you. We could zero the maximum (the remainder of PMD size I believe) in shmem_zero_eof(): looping over small folios within the range, skipping !uptodate ones (but we do force them uptodate when swapping out, in order to keep the space reservation). TBH I've ignored that as a bad option, but it doesn't seem so bad to me now: ugly, but maybe not bad. The solution I've had in mind (and pursue in comments below) is to do the EOF zeroing in shmem_writeout() before it splits; and then avoid swapin in shmem_zero_eof() when i_size is raised. That solution partly inspired by the shmem symlink uninit-value bug https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/ which I haven't rushed to fix, but ought to be fixed along with this one (by "along with" I don't mean that both have to be fixed in one single patch, but it makes sense to consider them together). I was inclined not to zero the whole page in shmem_symlink(), but zero before swapout. It worries me that an EOF page might be swapped out and in a thousand times, but i_size set only once: I'm going for a solution which memsets a thousand times rather than once? But if that actually shows up as an issue in any workload, then we can add a shmem inode flag to say whether the EOF folio has been exposed via mmap (or symlink) so may need zeroing. What's your preference? My comments below assume the latter solution, but that may be wrong. > > Brian > > mm/shmem.c | 36 +++++++++++++++++++++++++++++++++++- > 1 file changed, 35 insertions(+), 1 deletion(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 3a5a65b1f41a..4bb96c24fb9e 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1077,6 +1077,29 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) > return folio; > } > > +/* > + * Zero any post-EOF range of the EOF folio about to be exposed by size > + * extension. > + */ > +static void shmem_zero_eof(struct inode *inode, loff_t pos) Please change that "pos" to, perhaps, "newsize": I was initiailly quite confused by that argument to shmem_zero_eof(), until I grasped that oldsize (really, the important one here) is calculated inside. > +{ > + struct folio *folio; > + loff_t i_size = i_size_read(inode); > + size_t from, len; > + > + folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT); I deceived myself into thinking shmem_get_folio(,,,,SGP_READ) was better, but no, that doesn't handle large folio spanning EOF in the way needed here. It frustrates and depresses me that there's no appropriate SGP_ for this. It used to be the case that everything must go through shmem_getpage(), but then large folio truncation came to need that odd shmem_get_partial_folio() bypass. Lacking a suitable SGP_, you did well to choose it for further use, but I now think you should go your own way, duplicating its filemap_get_entry() and !xa_is_value block (with folio_test_uptodate check added in - while folio locked I suppose, though I'm not certain that's necessary), without doing the shmem_get_folio(,,,,SGP_READ) to swapin. > + if (!folio) > + return; > + > + /* zero to the end of the folio or start of extending operation */ > + from = offset_in_folio(folio, i_size); If from == 0, doesn't that mean we're looking at the *next* folio, and do not need to zero it at all? A case that would have been ruled out by a simple alignment check before, in the old days of all-small pages; but nowadays we cannot tell without looking at the folio itself. > + len = min_t(loff_t, folio_size(folio) - from, pos - i_size); > + folio_zero_range(folio, from, len); > + > + folio_unlock(folio); > + folio_put(folio); > +} > + > /* > * Remove range of pages and swap entries from page cache, and free them. > * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. > @@ -1302,6 +1325,8 @@ static int shmem_setattr(struct mnt_idmap *idmap, > return -EPERM; > > if (newsize != oldsize) { > + if (newsize > oldsize) > + shmem_zero_eof(inode, newsize); Nit, and feel free to disagree, but I'd slightly prefer you to do that a few lines later, after the error return, before the i_size_write() and update_mtime. > error = shmem_reacct_size(SHMEM_I(inode)->flags, > oldsize, newsize); > if (error) > @@ -3464,6 +3489,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > ret = file_update_time(file); > if (ret) > goto unlock; > + if (iocb->ki_pos > i_size_read(inode)) > + shmem_zero_eof(inode, iocb->ki_pos); Okay. I was inclined to ask you to move it to where shmem_write_end() does its i_size_write() and zeroing to make uptodate, that might be a more natural locationi. But came to realize that there are several reasons why your choice is easier, and any attempt to rearrange likely to be a waste of your time (not to mention folio deadlock). > ret = generic_perform_write(iocb, from); > unlock: > inode_unlock(inode); > @@ -3791,8 +3818,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, > cond_resched(); > } > > - if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) > + /* > + * The post-eof portion of the eof folio isn't guaranteed to be zeroed > + * by fallocate, so zero through the end of the fallocated range > + * instead of the start. > + */ That comment continues to mystify me (the end instead of the start??): but there's an easy answer, just delete it, the code is easier to understand without it! > + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { > + shmem_zero_eof(inode, offset + len); > i_size_write(inode, offset + len); > + } > undone: > spin_lock(&inode->i_lock); > inode->i_private = NULL; > -- > 2.49.0 I don't have a proposal for how the code in shmem_writeout() should look, and maybe you won't agree that's where the change should be made. Thanks, Hugh