Re: Question on slow fallocate

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Eric Sandeen <esandeen@redhat.com>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
	Dave Chinner <david@fromorbit.com>
Cc: Masahiko Sawada <sawada.mshk@gmail.com>, linux-xfs@vger.kernel.org
Subject: Re: Question on slow fallocate
Date: Fri, 23 Jun 2023 15:04:06 -0500	[thread overview]
Message-ID: <cdc8001d-56fb-eb0d-c01b-28810997ce17@redhat.com> (raw)
In-Reply-To: <871qi24cwf.fsf@doe.com>

On 6/23/23 6:49 AM, Ritesh Harjani (IBM) wrote:
> Sorry, but I still haven't understood the real problem here for which
> XFS does filemap_write_and_wait_range(). Is it a stale data exposure
> problem?

(Hopefully I get this right by trying to be helpful, here. It's been a 
while).

Not really. IIRC the original problem was that the file size could get 
updated (transactionally) before the delayed allocation and IO happened 
at writeback time, leaving a hole before EOF where buffered writes had 
failed to land before a crash. This is what people originally called the 
"NULL files problem" because reading the hole post-crash returned zeros. 
It wasn't stale date, it was no data.

Some commits that dealt with this explain it fairly well I think:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c32676eea19ce29cb74dba0f97b085e83f6b8915

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba87ea699ebd9dd577bf055ebc4a98200e337542

> Now, in this code here in fs/xfs/xfs_iops.c we refer to the problem as
> "expose ourselves to the null files problem".
> What is the "expose ourselves to the null files problem here"
> for which we do filemap_write_and_wait_range()?
> 
> 
> 	/*
> 	 * We are going to log the inode size change in this transaction so
> 	 * any previous writes that are beyond the on disk EOF and the new
> 	 * EOF that have not been written out need to be written here.  If we

i.e. force the writeback of any pending buffered IO into the hole 
created up to the new EOF

> 	 * do not write the data out, we expose ourselves to the null files
> 	 * problem. Note that this includes any block zeroing we did above;
> 	 * otherwise those blocks may not be zeroed after a crash.

and I suppose this relates a little to stale date, IIRC this is 
referring to zeroing partial blocks past the old EOF.

> 	 */
> 	if (did_zeroing ||
> 	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
> 		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> 						ip->i_disk_size, newsize - 1);
> 		if (error)
> 			return error;
> 	}
> 
> 
> Talking about ext4, it handles truncates to a file using orphan
> handline, yes. In case if the truncate operation spans multiple txns and
> if the crash happens say in the middle of a txn, then the subsequent crash
> recovery will truncate the blocks spanning i_disksize.
> 
> But we aren't discussing shrinking here right. We are doing pwrite
> followed by fallocate to grow the file size. With pwrite we use delalloc
> so the blocks only get allocated during writeback time and with
> fallocate we will allocate unwritten extents, so there should be no
> stale data expose problem in this case right?

yeah, it's not a stale data problem. I think that the extended EOF 
created by fallocate is being treated exactly the same as if we had 
extended it with ftruncate(). Indeed, replacing the posix_fallocate with 
ftruncate to the same size in the test program results in a similarly 
slow run, slightly faster probably because unwritten conversion doesn't 
have to happen in that case.

> Hence my question was to mainly understand what does "expose ourselves to
> the null files problem" means in XFS?

Hopefully the above explains it; that said, I'm not sure this is 
anything more than academically interesting. As Dave mentioned, 
fallocating tiny space and then writing into it is not at all the 
recommended or efficient use of fallocate.

The one thing I'm not remembering exactly here is why we have the 
heuristic that a truncate up requires flushing all pending data behind it.

I *think* it's because most users knew enough to expect buffered writes 
could be lost on a crash, but they expected to see valid data up to the 
on-disk EOF post-crash. Without this heuristic, they'd get some valid 
data that made it out followed by a hole ("NULLS") up to the new EOF, 
and they Did Not Like It.

-Eric

next prev parent reply	other threads:[~2023-06-23 20:05 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-22  5:34 Question on slow fallocate Masahiko Sawada
2023-06-22  7:44 ` Wang Yugui
2023-06-22  8:18   ` Masahiko Sawada
2023-06-23  0:47 ` Dave Chinner
2023-06-23  8:29   ` Ritesh Harjani
2023-06-23 10:07     ` Dave Chinner
2023-06-23 11:49       ` Ritesh Harjani
2023-06-23 20:04         ` Eric Sandeen [this message]
2023-06-26  3:17   ` Masahiko Sawada
2023-06-26 15:32     ` Eric Sandeen
2023-06-27 15:50       ` Masahiko Sawada
2023-06-27 16:12         ` Eric Sandeen
2023-06-28  4:56           ` Christoph Hellwig
2023-07-11 22:49           ` Andres Freund
2023-07-19  7:25             ` Dave Chinner
2023-07-19 20:29               ` Andres Freund
2023-07-19 20:38                 ` Eric Sandeen
2023-07-19 20:49                   ` Eric Sandeen
2023-07-19 22:23                     ` Andres Freund
2023-07-11 22:28   ` Andres Freund

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cdc8001d-56fb-eb0d-c01b-28810997ce17@redhat.com \
    --to=esandeen@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=sandeen@redhat.com \
    --cc=sawada.mshk@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox