* Re: General Filesystem Question - Interesting Unexplainable Observation
[not found] <CAJBvgGfv9zsE4PEnuuVqKhiKfpbrxk=kXG4pp5AAMOXyVc5-bQ@mail.gmail.com>
@ 2022-10-31 11:22 ` Jan Kara
2022-11-02 3:07 ` Matt Bobrowski
0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2022-10-31 11:22 UTC (permalink / raw)
To: Matt Bobrowski; +Cc: Jan Kara, linux-ext4
Hi Matthew!
[added ext4 mailing list to CC, maybe others have more ideas]
On Fri 28-10-22 23:23:14, Matt Bobrowski wrote:
> Just had a general question in regards to some recent filesystem (ext4)
> behaviour I've recently observed, which kind of made my eyebrows raise a
> little and I wanted to understand why this was happening.
>
> We have an application (single threaded process) that basically performs
> the following sequence of filesystem operations using buffered I/O:
>
> ---
> fd = open("dir/tmp/filename.new", O_WRONLY | O_CREAT | O_TRUNC, 0400);
> ...
> write(fd, buf, sizeof(buf));
> ...
> rename("dir/tmp/filename.new", "dir/new/filename");
> ---
>
> At times, I see the "dir/new/filename" file size reporting 0 bytes, despite
> sizeof(buf) written to "dir/tmp/filename.new" always guaranteed to be > 0
> and the result of the write reported as being successful. This is the part
> I cannot come up with a valid explanation for (yet).
So by "file size reporting 0 bytes" do you mean that
stat("dir/new/filename") from a concurrent process returns file size 0
sometimes? Or do you refer to a situation after an unclean filesystem
shutdown?
> Understandably,
> there's no fsync being currently performed post calling write, which I
> think needs to be corrected, but I also can't see how not using fsync post
> write would result in the file size for "dir/new/filename" being reported
> as 0 bytes? One of the things that crossed my mind was that the rename
> operation was possibly being committed prior to the dirty pages from the
> pagecache being flushed, but regardless I don't see how a rename would
> result in the data blocks associated to the write not ever being committed
> for the same underlying inode?
>
> What are your thoughts? Any plausible explanation why I might be seeing
> this odd behaviour?
Ext4 uses delayed allocation. That means that write(2) just stores data in
the page cache but no blocks are allocated yet. So indeed rename(2) can be
fully committed in the journal before any of the data gets to persistent
storage. That being said ext4 has a workaround for buggy applications (can
be disabled with "noauto_da_alloc" mount option) that starts data writeback
before rename is done so at least in data=ordered mode you should not see 0
length files after a crash with the above scheme.
WRT concurrent process seeing 0 file length, I would not have a great
explanation for that because once data is written to the inode,
inode->i_size is set to the final inode size which is what stat(2) reports.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: General Filesystem Question - Interesting Unexplainable Observation
2022-10-31 11:22 ` General Filesystem Question - Interesting Unexplainable Observation Jan Kara
@ 2022-11-02 3:07 ` Matt Bobrowski
2022-11-02 14:22 ` Jan Kara
0 siblings, 1 reply; 3+ messages in thread
From: Matt Bobrowski @ 2022-11-02 3:07 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4
Hey Jan,
Thanks for getting back to me.
On Mon, Oct 31, 2022 at 12:22:37PM +0100, Jan Kara wrote:
> Hi Matthew!
>
> [added ext4 mailing list to CC, maybe others have more ideas]
>
> On Fri 28-10-22 23:23:14, Matt Bobrowski wrote:
> > Just had a general question in regards to some recent filesystem (ext4)
> > behaviour I've recently observed, which kind of made my eyebrows raise a
> > little and I wanted to understand why this was happening.
> >
> > We have an application (single threaded process) that basically performs
> > the following sequence of filesystem operations using buffered I/O:
> >
> > ---
> > fd = open("dir/tmp/filename.new", O_WRONLY | O_CREAT | O_TRUNC, 0400);
> > ...
> > write(fd, buf, sizeof(buf));
> > ...
> > rename("dir/tmp/filename.new", "dir/new/filename");
> > ---
> >
> > At times, I see the "dir/new/filename" file size reporting 0 bytes, despite
> > sizeof(buf) written to "dir/tmp/filename.new" always guaranteed to be > 0
> > and the result of the write reported as being successful. This is the part
> > I cannot come up with a valid explanation for (yet).
>
> So by "file size reporting 0 bytes" do you mean that
> stat("dir/new/filename") from a concurrent process returns file size 0
> sometimes?
Not quite, meaning that stat("dir/new/filename") is reporting 0 bytes
long after the write(2) operation had occurred. IOW, I'm seeing 0 byte
files laying around when they well and truly should have had bytes
written out to them (before a write(2) is issued we check to make sure
that the supplied buffer actually has something in it) i.e. manually
stat'ing them in a shell.
> Or do you refer to a situation after an unclean filesystem
> shutdown?
It could very well be from an unclean shutdown, but it's really hard
to say whether this is the culprit or not.
> > Understandably,
> > there's no fsync being currently performed post calling write, which I
> > think needs to be corrected, but I also can't see how not using fsync post
> > write would result in the file size for "dir/new/filename" being reported
> > as 0 bytes? One of the things that crossed my mind was that the rename
> > operation was possibly being committed prior to the dirty pages from the
> > pagecache being flushed, but regardless I don't see how a rename would
> > result in the data blocks associated to the write not ever being committed
> > for the same underlying inode?
> >
> > What are your thoughts? Any plausible explanation why I might be seeing
> > this odd behaviour?
>
> Ext4 uses delayed allocation. That means that write(2) just stores data in
> the page cache but no blocks are allocated yet. So indeed rename(2) can be
> fully committed in the journal before any of the data gets to persistent
> storage. That being said ext4 has a workaround for buggy applications (can
> be disabled with "noauto_da_alloc" mount option) that starts data writeback
> before rename is done so at least in data=ordered mode you should not see 0
> length files after a crash with the above scheme.
Right, we are using buffered I/O after all... However, even if the
rename(2) operation took place and was fully committed to the journal
before the dirty pages associated to the prior write(2) had been
written back, I wouldn't expect the data to be missing? IOW, the
write(2) and rename(2) operations are taking effect on the same
backing inode, no?
/M
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: General Filesystem Question - Interesting Unexplainable Observation
2022-11-02 3:07 ` Matt Bobrowski
@ 2022-11-02 14:22 ` Jan Kara
0 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2022-11-02 14:22 UTC (permalink / raw)
To: Matt Bobrowski; +Cc: Jan Kara, linux-ext4
Hello Matt!
On Wed 02-11-22 03:07:55, Matt Bobrowski wrote:
> On Mon, Oct 31, 2022 at 12:22:37PM +0100, Jan Kara wrote:
> > Hi Matthew!
> >
> > [added ext4 mailing list to CC, maybe others have more ideas]
> >
> > On Fri 28-10-22 23:23:14, Matt Bobrowski wrote:
> > > Just had a general question in regards to some recent filesystem (ext4)
> > > behaviour I've recently observed, which kind of made my eyebrows raise a
> > > little and I wanted to understand why this was happening.
> > >
> > > We have an application (single threaded process) that basically performs
> > > the following sequence of filesystem operations using buffered I/O:
> > >
> > > ---
> > > fd = open("dir/tmp/filename.new", O_WRONLY | O_CREAT | O_TRUNC, 0400);
> > > ...
> > > write(fd, buf, sizeof(buf));
> > > ...
> > > rename("dir/tmp/filename.new", "dir/new/filename");
> > > ---
> > >
> > > At times, I see the "dir/new/filename" file size reporting 0 bytes, despite
> > > sizeof(buf) written to "dir/tmp/filename.new" always guaranteed to be > 0
> > > and the result of the write reported as being successful. This is the part
> > > I cannot come up with a valid explanation for (yet).
> >
> > So by "file size reporting 0 bytes" do you mean that
> > stat("dir/new/filename") from a concurrent process returns file size 0
> > sometimes?
>
> Not quite, meaning that stat("dir/new/filename") is reporting 0 bytes
> long after the write(2) operation had occurred. IOW, I'm seeing 0 byte
> files laying around when they well and truly should have had bytes
> written out to them (before a write(2) is issued we check to make sure
> that the supplied buffer actually has something in it) i.e. manually
> stat'ing them in a shell.
I see. So inode got written with 0 size to the disk.
> > Or do you refer to a situation after an unclean filesystem
> > shutdown?
>
> It could very well be from an unclean shutdown, but it's really hard
> to say whether this is the culprit or not.
I see, ok.
> > > Understandably,
> > > there's no fsync being currently performed post calling write, which I
> > > think needs to be corrected, but I also can't see how not using fsync post
> > > write would result in the file size for "dir/new/filename" being reported
> > > as 0 bytes? One of the things that crossed my mind was that the rename
> > > operation was possibly being committed prior to the dirty pages from the
> > > pagecache being flushed, but regardless I don't see how a rename would
> > > result in the data blocks associated to the write not ever being committed
> > > for the same underlying inode?
> > >
> > > What are your thoughts? Any plausible explanation why I might be seeing
> > > this odd behaviour?
> >
> > Ext4 uses delayed allocation. That means that write(2) just stores data in
> > the page cache but no blocks are allocated yet. So indeed rename(2) can be
> > fully committed in the journal before any of the data gets to persistent
> > storage. That being said ext4 has a workaround for buggy applications (can
> > be disabled with "noauto_da_alloc" mount option) that starts data writeback
> > before rename is done so at least in data=ordered mode you should not see 0
> > length files after a crash with the above scheme.
>
> Right, we are using buffered I/O after all... However, even if the
> rename(2) operation took place and was fully committed to the journal
> before the dirty pages associated to the prior write(2) had been
> written back, I wouldn't expect the data to be missing? IOW, the
> write(2) and rename(2) operations are taking effect on the same
> backing inode, no?
No. Because inode size changes as well as block allocation changes get
added to the journal only once the writeback happens. So until writeback
starts, rename(2) and write(2) can be arbitratily reordered (or you can
even see only part of the write being completed).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-11-02 14:22 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAJBvgGfv9zsE4PEnuuVqKhiKfpbrxk=kXG4pp5AAMOXyVc5-bQ@mail.gmail.com>
2022-10-31 11:22 ` General Filesystem Question - Interesting Unexplainable Observation Jan Kara
2022-11-02 3:07 ` Matt Bobrowski
2022-11-02 14:22 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox