public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* inode data not getting included in commits?
@ 2008-12-19  0:22 Sage Weil
  2008-12-19  1:26 ` Yan Zheng
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19  0:22 UTC (permalink / raw)
  To: linux-btrfs

Hi Chris-

I noticed some data and metadata getting out of sync on disk, despite 
wrapping my writes with btrfs transactions.  After digging into it a bit, 
it appears to be a larger problem with inode size/data getting written 
during a regular commit.

I have a test program append a few bytes at a time to a few different 
files, in a loop.  I let it run until I see a btrfs transaction commit 
(via a printk at the bottom of btrfs_commit_transaction).  Then 'reboot -f 
-n'.  After remounting, all files exist but are 0 bytes, and debug-tree 
shows a bunch of empty files.  I would expect to see either the sizes when 
the commit happend (a few hundred KB in my case), or no files at all; 
there was actually no point in time when any of the files were 0 bytes.

Similarly, if I do the same but wait for a few commits to happen, after 
remount the file sizes reflect the size from around the next-to-last 
commit, not the last commit.

This is probably more information than you need, but my original test was 
a bit more complicated, with weirder results.  Append to each file, then 
write it's size to an xattr on another file.  Wrap both operations in a 
transaction.  Start it up, run 'sync', then reboot -f -n.  When I remount 
the size and xattr are out of sync by exactly one iteration: the xattr 
reflects the size that resulted from _two_ writes back, not the 
immediately preceeding write.  If anything I would expect to see a larger 
actual size than xattr value (for example if the start/end transaction 
ioctls weren't working)...

sage



#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
	while (1) {
		int r, fd, pos, i = rand() % 10;
		char a[20];
		
		sprintf(a, "%d.log", i);
		fd = open(a, O_CREAT|O_APPEND|O_WRONLY, 0600);
		r = write(fd, "foobarfoo\n", 10);
		pos = lseek(fd, 0, SEEK_CUR);
		printf("write %s = %d, size = %d\n", a, r, pos);
		close(fd);
	}
}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19  0:22 inode data not getting included in commits? Sage Weil
@ 2008-12-19  1:26 ` Yan Zheng
  2008-12-19  5:21   ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Yan Zheng @ 2008-12-19  1:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-btrfs

2008/12/19 Sage Weil <sage@newdream.net>:
> Hi Chris-
>
> I noticed some data and metadata getting out of sync on disk, despite
> wrapping my writes with btrfs transactions.  After digging into it a bit,
> it appears to be a larger problem with inode size/data getting written
> during a regular commit.
>
> I have a test program append a few bytes at a time to a few different
> files, in a loop.  I let it run until I see a btrfs transaction commit
> (via a printk at the bottom of btrfs_commit_transaction).  Then 'reboot -f
> -n'.  After remounting, all files exist but are 0 bytes, and debug-tree
> shows a bunch of empty files.  I would expect to see either the sizes when
> the commit happend (a few hundred KB in my case), or no files at all;
> there was actually no point in time when any of the files were 0 bytes.
>
> Similarly, if I do the same but wait for a few commits to happen, after
> remount the file sizes reflect the size from around the next-to-last
> commit, not the last commit.
>
> This is probably more information than you need, but my original test was
> a bit more complicated, with weirder results.  Append to each file, then
> write it's size to an xattr on another file.  Wrap both operations in a
> transaction.  Start it up, run 'sync', then reboot -f -n.  When I remount
> the size and xattr are out of sync by exactly one iteration: the xattr
> reflects the size that resulted from _two_ writes back, not the
> immediately preceeding write.  If anything I would expect to see a larger
> actual size than xattr value (for example if the start/end transaction
> ioctls weren't working)...
>
> sage
>
>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, char **argv)
> {
>        while (1) {
>                int r, fd, pos, i = rand() % 10;
>                char a[20];
>
>                sprintf(a, "%d.log", i);
>                fd = open(a, O_CREAT|O_APPEND|O_WRONLY, 0600);
>                r = write(fd, "foobarfoo\n", 10);
>                pos = lseek(fd, 0, SEEK_CUR);
>                printf("write %s = %d, size = %d\n", a, r, pos);
>                close(fd);
>        }
> }
>

This is the desired behaviour of data=ordered. Btrfs transaction commit
don't flush data, and metadata wont get updated until data IO complete.

http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code

Regards
Yan Zheng

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19  1:26 ` Yan Zheng
@ 2008-12-19  5:21   ` Sage Weil
  2008-12-19 14:12     ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19  5:21 UTC (permalink / raw)
  To: Yan Zheng; +Cc: linux-btrfs

On Fri, 19 Dec 2008, Yan Zheng wrote:
> > I noticed some data and metadata getting out of sync on disk, despite
> > wrapping my writes with btrfs transactions.  After digging into it a bit,
> > it appears to be a larger problem with inode size/data getting written
> > during a regular commit.
> > [...]
> 
> This is the desired behaviour of data=ordered. Btrfs transaction commit
> don't flush data, and metadata wont get updated until data IO complete.
> 
> http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code

Ah, right, so it is.

I think what I'm looking for then is a mount mode to get the old behavior, 
such that each commit flushes previously written data.  Probably a call to 
btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something 
along those lines...

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19  5:21   ` Sage Weil
@ 2008-12-19 14:12     ` Chris Mason
  2008-12-19 18:48       ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2008-12-19 14:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan Zheng, linux-btrfs

On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > I noticed some data and metadata getting out of sync on disk, despite
> > > wrapping my writes with btrfs transactions.  After digging into it a bit,
> > > it appears to be a larger problem with inode size/data getting written
> > > during a regular commit.
> > > [...]
> > 
> > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > don't flush data, and metadata wont get updated until data IO complete.
> > 
> > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> 
> Ah, right, so it is.
> 
> I think what I'm looking for then is a mount mode to get the old behavior, 
> such that each commit flushes previously written data.  Probably a call to 
> btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something 
> along those lines...

Could you describe the end goal a bit?  I'm happy to make modes where
it'll do what you need.

-chris



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19 14:12     ` Chris Mason
@ 2008-12-19 18:48       ` Sage Weil
  2008-12-19 19:07         ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19 18:48 UTC (permalink / raw)
  To: Chris Mason; +Cc: Yan Zheng, linux-btrfs

On Fri, 19 Dec 2008, Chris Mason wrote:
> On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> > On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > > I noticed some data and metadata getting out of sync on disk, despite
> > > > wrapping my writes with btrfs transactions.  After digging into it a bit,
> > > > it appears to be a larger problem with inode size/data getting written
> > > > during a regular commit.
> > > > [...]
> > > 
> > > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > > don't flush data, and metadata wont get updated until data IO complete.
> > > 
> > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> > 
> > Ah, right, so it is.
> > 
> > I think what I'm looking for then is a mount mode to get the old behavior, 
> > such that each commit flushes previously written data.  Probably a call to 
> > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something 
> > along those lines...
> 
> Could you describe the end goal a bit?  I'm happy to make modes where
> it'll do what you need.

The end goal is for data to flush and commit with the transaction that was 
running when the write() occured.

So, after a sequence like
 write A
 setxattr B
 <crash>
you should always see A if you see B.

And after a sequence like
 ioctl(fd, BTRFS_IOC_TRANS_START)
 write A
 setxattr B
 close(fd)
 <crash>
you should see either both A and B or neither A nor B.

fsync() isn't really appropriate since it forces a commit (or a tree log 
entry?), and it would still be better to roll lots of operations up 
together.  Either a mount mode that includes dirty data in each 
transaction commit (and probably disables the tree log?), or a per-file 
fsync-like operation that commits an individual file's dirty data to the 
running transaction would do the trick.

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19 18:48       ` Sage Weil
@ 2008-12-19 19:07         ` Chris Mason
  2008-12-19 20:08           ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2008-12-19 19:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan Zheng, linux-btrfs

On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote:
> On Fri, 19 Dec 2008, Chris Mason wrote:
> > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> > > On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > > > I noticed some data and metadata getting out of sync on disk, despite
> > > > > wrapping my writes with btrfs transactions.  After digging into it a bit,
> > > > > it appears to be a larger problem with inode size/data getting written
> > > > > during a regular commit.
> > > > > [...]
> > > > 
> > > > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > > > don't flush data, and metadata wont get updated until data IO complete.
> > > > 
> > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> > > 
> > > Ah, right, so it is.
> > > 
> > > I think what I'm looking for then is a mount mode to get the old behavior, 
> > > such that each commit flushes previously written data.  Probably a call to 
> > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something 
> > > along those lines...
> > 
> > Could you describe the end goal a bit?  I'm happy to make modes where
> > it'll do what you need.
> 
> The end goal is for data to flush and commit with the transaction that was 
> running when the write() occured.
> 
> So, after a sequence like
>  write A
>  setxattr B
>  <crash>
> you should always see A if you see B.
> 
> And after a sequence like
>  ioctl(fd, BTRFS_IOC_TRANS_START)
>  write A
>  setxattr B
>  close(fd)
>  <crash>
> you should see either both A and B or neither A nor B.
> 
> fsync() isn't really appropriate since it forces a commit (or a tree log 
> entry?), and it would still be better to roll lots of operations up 
> together.  Either a mount mode that includes dirty data in each 
> transaction commit (and probably disables the tree log?), or a per-file 
> fsync-like operation that commits an individual file's dirty data to the 
> running transaction would do the trick.

A third option is a different type of xattr operation that doesn't go to
disk until the metadata updates done at IO end time.

>From a performance point of view, it'll be much faster than slowing down
commit with data writes.

Can that work for you?

-chris



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19 19:07         ` Chris Mason
@ 2008-12-19 20:08           ` Sage Weil
  2008-12-20  0:11             ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19 20:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: Yan Zheng, linux-btrfs

On Fri, 19 Dec 2008, Chris Mason wrote:
> On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote:
> > On Fri, 19 Dec 2008, Chris Mason wrote:
> > > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> > > > On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > > > > I noticed some data and metadata getting out of sync on disk, despite
> > > > > > wrapping my writes with btrfs transactions.  After digging into it a bit,
> > > > > > it appears to be a larger problem with inode size/data getting written
> > > > > > during a regular commit.
> > > > > > [...]
> > > > > 
> > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > > > > don't flush data, and metadata wont get updated until data IO complete.
> > > > > 
> > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> > > > 
> > > > Ah, right, so it is.
> > > > 
> > > > I think what I'm looking for then is a mount mode to get the old behavior, 
> > > > such that each commit flushes previously written data.  Probably a call to 
> > > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something 
> > > > along those lines...
> > > 
> > > Could you describe the end goal a bit?  I'm happy to make modes where
> > > it'll do what you need.
> > 
> > The end goal is for data to flush and commit with the transaction that was 
> > running when the write() occured.
> > 
> > So, after a sequence like
> >  write A
> >  setxattr B
> >  <crash>
> > you should always see A if you see B.
> > 
> > And after a sequence like
> >  ioctl(fd, BTRFS_IOC_TRANS_START)
> >  write A
> >  setxattr B
> >  close(fd)
> >  <crash>
> > you should see either both A and B or neither A nor B.
> > 
> > fsync() isn't really appropriate since it forces a commit (or a tree log 
> > entry?), and it would still be better to roll lots of operations up 
> > together.  Either a mount mode that includes dirty data in each 
> > transaction commit (and probably disables the tree log?), or a per-file 
> > fsync-like operation that commits an individual file's dirty data to the 
> > running transaction would do the trick.
> 
> A third option is a different type of xattr operation that doesn't go to
> disk until the metadata updates done at IO end time.
> 
> >From a performance point of view, it'll be much faster than slowing down
> commit with data writes.
> 
> Can that work for you?

I suspect not, since multiple files are involved.  It's usually something 
like

 write A
 setxattr A
 write B
 setxattr C

and all need to be committed atomically.  The model really is a bundle of 
arbitrary operations that commit atomically.

Slower commit times aren't as much of a concern because this is on the 
storage backend, behind client caches and so forth.  I think it's 
a reasonable price to pay for the stronger consistency.  

Hopefully it's not throwing too big a wrench into the data=ordered 
machinery?  It sort of looks like this is already what you get when taking 
a snapshot (I see the call to wait_ordered_extnets in commit_transaction 
when snaps_pending).

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode data not getting included in commits?
  2008-12-19 20:08           ` Sage Weil
@ 2008-12-20  0:11             ` Chris Mason
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2008-12-20  0:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan Zheng, linux-btrfs

On Fri, 2008-12-19 at 12:08 -0800, Sage Weil wrote:

[ trigger data=ordered flush at commit time ]

>  
> > A third option is a different type of xattr operation that doesn't go to
> > disk until the metadata updates done at IO end time.
> > 
> > >From a performance point of view, it'll be much faster than slowing down
> > commit with data writes.
> > 
> > Can that work for you?
> 
> I suspect not, since multiple files are involved.  It's usually something 
> like
> 
>  write A
>  setxattr A
>  write B
>  setxattr C
> 
> and all need to be committed atomically.  The model really is a bundle of 
> arbitrary operations that commit atomically.
> 
> Slower commit times aren't as much of a concern because this is on the 
> storage backend, behind client caches and so forth.  I think it's 
> a reasonable price to pay for the stronger consistency.  
> 
> Hopefully it's not throwing too big a wrench into the data=ordered 
> machinery?  It sort of looks like this is already what you get when taking 
> a snapshot (I see the call to wait_ordered_extnets in commit_transaction 
> when snaps_pending).

If we make it optional, its fine by me to added a data=ordered flush at
commit time.

-chris



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-12-20  0:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-19  0:22 inode data not getting included in commits? Sage Weil
2008-12-19  1:26 ` Yan Zheng
2008-12-19  5:21   ` Sage Weil
2008-12-19 14:12     ` Chris Mason
2008-12-19 18:48       ` Sage Weil
2008-12-19 19:07         ` Chris Mason
2008-12-19 20:08           ` Sage Weil
2008-12-20  0:11             ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox