Some O_DIRECT vs. block_truncate

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Some O_DIRECT vs. block_truncate_page problems
@ 2002-09-23 18:09 Eric Sandeen
  2002-09-24  5:39 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Sandeen @ 2002-09-23 18:09 UTC (permalink / raw)
  To: linux-fsdevel

I'm trying to track down a bug I'm seeing after I run xfs's
defragmenter, where data is present past EOF in the last block.

xfs_fsr does this (essentially):

o allocate some contiguous space with an xfs-specific ioctl
o copy the fragmented file into the new file with O_DIRECT
o truncate() the new file back to the correct size (since it's currently
  a multiple of the block size)
o sync the new file
o do "xfs_inval_cached_pages", which is essentially:
    - filemap_fdatasync(ip->i_mapping);
    - fsync_inode_data_buffers(ip);
    - filemap_fdatawait(ip->i_mapping);
    - truncate_inode_pages(ip->i_mapping, first);
o swap the extents between the old & the new files

When this is all done, I see data past EOF in the resulting file.

I believe that this is because the buffer from the truncate() call is
not being synced to disk, and gets thrown away when we do
truncate_inode_pages().

block_truncate_page() does a __mark_buffer_dirty(bh) at the end, but it
does not file the buffer on the inode's dirty data queue, so the
fsync_inode_data_buffers() does not see it.  The inode only has a page
on it's clean pages list, so filemap_fdatasync does not see it either.

Since generic_file_direct_IO expects to do filemap_fdatasync,
fsync_inode_data_buffers, filemap_fdatawait to flush data to disk, I
think that the behavior after a truncate() will be broken for any
filesystem.

This patch seems to fix things up for me; is there a reason that the
buffer was not inserted originally?

--- linux/fs/buffer.c_1.109	Mon Sep 23 13:10:56 2002
+++ linux/fs/buffer.c	Mon Sep 23 13:04:44 2002
@@ -2028,7 +2028,12 @@
 	flush_dcache_page(page);
 	kunmap(page);

-	__mark_buffer_dirty(bh);
+	if (!atomic_set_buffer_dirty(bh)) {
+		__mark_dirty(bh);
+		buffer_insert_inode_data_queue(bh, inode);
+		balance_dirty();
+	}
+
 	err = 0;

On a related note, while testing some O_DIRECT cases on ext2, I found
another, possibly related, bug.

If you O_DIRECT write 2 blocks into a new file, truncate() the file to
1.5 blocks, then do an O_DIRECT read of the last block, you get
incorrect data.  Rather than 1/2 data, 1/2 zeroes, I get all zeroes.
This test passes on xfs.

Comments?

-Eric

-- 
Eric Sandeen      XFS for Linux     http://oss.sgi.com/projects/xfs
sandeen@sgi.com   SGI, Inc.         651-683-3102

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Some O_DIRECT vs. block_truncate_page problems
  2002-09-23 18:09 Some O_DIRECT vs. block_truncate_page problems Eric Sandeen
@ 2002-09-24  5:39 ` Andrew Morton
  2002-09-24 12:48   ` Some O_DIRECT vs. block_truncate_page problemsr Eric Sandeen
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2002-09-24  5:39 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-fsdevel

Eric Sandeen wrote:
> 
> I'm trying to track down a bug I'm seeing after I run xfs's
> defragmenter, where data is present past EOF in the last block.
> 
> xfs_fsr does this (essentially):
> 
> o allocate some contiguous space with an xfs-specific ioctl
> o copy the fragmented file into the new file with O_DIRECT
> o truncate() the new file back to the correct size (since it's currently
>   a multiple of the block size)
> o sync the new file
> o do "xfs_inval_cached_pages", which is essentially:
>     - filemap_fdatasync(ip->i_mapping);
>     - fsync_inode_data_buffers(ip);
>     - filemap_fdatawait(ip->i_mapping);
>     - truncate_inode_pages(ip->i_mapping, first);

What is "first"?  It should be i_size.  If it's zero then
yes, you lost the dirty pagecache and dirty buffer.

> ...
> 
> --- linux/fs/buffer.c_1.109     Mon Sep 23 13:10:56 2002
> +++ linux/fs/buffer.c   Mon Sep 23 13:04:44 2002
> @@ -2028,7 +2028,12 @@
>         flush_dcache_page(page);
>         kunmap(page);
> 
> -       __mark_buffer_dirty(bh);
> +       if (!atomic_set_buffer_dirty(bh)) {
> +               __mark_dirty(bh);
> +               buffer_insert_inode_data_queue(bh, inode);
> +               balance_dirty();
> +       }
> +
>         err = 0;

Makes sense I guess.  I'd be inclined to leave the balance_dirty()
out of there, just because it does add a tiny little more risk.

> On a related note, while testing some O_DIRECT cases on ext2, I found
> another, possibly related, bug.
> 
> If you O_DIRECT write 2 blocks into a new file, truncate() the file to
> 1.5 blocks, then do an O_DIRECT read of the last block, you get
> incorrect data.  Rather than 1/2 data, 1/2 zeroes, I get all zeroes.

Does this still happen with your patch?  I'd expect that with
your patch, you would see all data, not all zeroes.

What blocksize and pagesize are you using, btw?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Some O_DIRECT vs. block_truncate_page problemsr
  2002-09-24  5:39 ` Andrew Morton
@ 2002-09-24 12:48   ` Eric Sandeen
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Sandeen @ 2002-09-24 12:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel

On Mon, 23 Sep 2002, Andrew Morton wrote:

> Eric Sandeen wrote:
> > o do "xfs_inval_cached_pages", which is essentially:
> >     - filemap_fdatasync(ip->i_mapping);
> >     - fsync_inode_data_buffers(ip);
> >     - filemap_fdatawait(ip->i_mapping);
> >     - truncate_inode_pages(ip->i_mapping, first);
> 
> What is "first"?  It should be i_size.  If it's zero then
> yes, you lost the dirty pagecache and dirty buffer.

It's zero, but if we just synced the inode as above, there should be no dirty
data left, right?

The idea is that since xfs is about to swap the extents out from under
the inode, there had better be no old buffers floating around destined for
the old extents.

<patch snipped>

> Makes sense I guess.  I'd be inclined to leave the balance_dirty()
> out of there, just because it does add a tiny little more risk.

I wondered about that.

> > If you O_DIRECT write 2 blocks into a new file, truncate() the file to
> > 1.5 blocks, then do an O_DIRECT read of the last block, you get
> > incorrect data.  Rather than 1/2 data, 1/2 zeroes, I get all zeroes.
> 
> Does this still happen with your patch?  I'd expect that with
> your patch, you would see all data, not all zeroes.

Yes, it does still happen.  I expected my patch to fix this, but it seems 
to have no effect.

> What blocksize and pagesize are you using, btw?

4k pages, and I have tested with 1k and 4k block sizes.

For both xfs and ext2, the O_DIRECT read() returns 0, (sine we hit EOF?)
but in the xfs case, we see the data in the proper place.  It behaves this
way on Irix as well.  Apparently on freebsd, the same test also returns
the data as expected, and the read() returns the # of bytes of data read.
I haven't tested on solaris yet.

Is this sort of behavior for O_DIRECT defined anywhere, I wonder...

-Eric

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-09-24 12:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-23 18:09 Some O_DIRECT vs. block_truncate_page problems Eric Sandeen
2002-09-24  5:39 ` Andrew Morton
2002-09-24 12:48   ` Some O_DIRECT vs. block_truncate_page problemsr Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).