* inode data not getting included in commits?
@ 2008-12-19 0:22 Sage Weil
2008-12-19 1:26 ` Yan Zheng
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19 0:22 UTC (permalink / raw)
To: linux-btrfs
Hi Chris-
I noticed some data and metadata getting out of sync on disk, despite
wrapping my writes with btrfs transactions. After digging into it a bit,
it appears to be a larger problem with inode size/data getting written
during a regular commit.
I have a test program append a few bytes at a time to a few different
files, in a loop. I let it run until I see a btrfs transaction commit
(via a printk at the bottom of btrfs_commit_transaction). Then 'reboot -f
-n'. After remounting, all files exist but are 0 bytes, and debug-tree
shows a bunch of empty files. I would expect to see either the sizes when
the commit happend (a few hundred KB in my case), or no files at all;
there was actually no point in time when any of the files were 0 bytes.
Similarly, if I do the same but wait for a few commits to happen, after
remount the file sizes reflect the size from around the next-to-last
commit, not the last commit.
This is probably more information than you need, but my original test was
a bit more complicated, with weirder results. Append to each file, then
write it's size to an xattr on another file. Wrap both operations in a
transaction. Start it up, run 'sync', then reboot -f -n. When I remount
the size and xattr are out of sync by exactly one iteration: the xattr
reflects the size that resulted from _two_ writes back, not the
immediately preceeding write. If anything I would expect to see a larger
actual size than xattr value (for example if the start/end transaction
ioctls weren't working)...
sage
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
while (1) {
int r, fd, pos, i = rand() % 10;
char a[20];
sprintf(a, "%d.log", i);
fd = open(a, O_CREAT|O_APPEND|O_WRONLY, 0600);
r = write(fd, "foobarfoo\n", 10);
pos = lseek(fd, 0, SEEK_CUR);
printf("write %s = %d, size = %d\n", a, r, pos);
close(fd);
}
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 0:22 inode data not getting included in commits? Sage Weil
@ 2008-12-19 1:26 ` Yan Zheng
2008-12-19 5:21 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Yan Zheng @ 2008-12-19 1:26 UTC (permalink / raw)
To: Sage Weil; +Cc: linux-btrfs
2008/12/19 Sage Weil <sage@newdream.net>:
> Hi Chris-
>
> I noticed some data and metadata getting out of sync on disk, despite
> wrapping my writes with btrfs transactions. After digging into it a bit,
> it appears to be a larger problem with inode size/data getting written
> during a regular commit.
>
> I have a test program append a few bytes at a time to a few different
> files, in a loop. I let it run until I see a btrfs transaction commit
> (via a printk at the bottom of btrfs_commit_transaction). Then 'reboot -f
> -n'. After remounting, all files exist but are 0 bytes, and debug-tree
> shows a bunch of empty files. I would expect to see either the sizes when
> the commit happend (a few hundred KB in my case), or no files at all;
> there was actually no point in time when any of the files were 0 bytes.
>
> Similarly, if I do the same but wait for a few commits to happen, after
> remount the file sizes reflect the size from around the next-to-last
> commit, not the last commit.
>
> This is probably more information than you need, but my original test was
> a bit more complicated, with weirder results. Append to each file, then
> write it's size to an xattr on another file. Wrap both operations in a
> transaction. Start it up, run 'sync', then reboot -f -n. When I remount
> the size and xattr are out of sync by exactly one iteration: the xattr
> reflects the size that resulted from _two_ writes back, not the
> immediately preceeding write. If anything I would expect to see a larger
> actual size than xattr value (for example if the start/end transaction
> ioctls weren't working)...
>
> sage
>
>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, char **argv)
> {
> while (1) {
> int r, fd, pos, i = rand() % 10;
> char a[20];
>
> sprintf(a, "%d.log", i);
> fd = open(a, O_CREAT|O_APPEND|O_WRONLY, 0600);
> r = write(fd, "foobarfoo\n", 10);
> pos = lseek(fd, 0, SEEK_CUR);
> printf("write %s = %d, size = %d\n", a, r, pos);
> close(fd);
> }
> }
>
This is the desired behaviour of data=ordered. Btrfs transaction commit
don't flush data, and metadata wont get updated until data IO complete.
http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
Regards
Yan Zheng
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 1:26 ` Yan Zheng
@ 2008-12-19 5:21 ` Sage Weil
2008-12-19 14:12 ` Chris Mason
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19 5:21 UTC (permalink / raw)
To: Yan Zheng; +Cc: linux-btrfs
On Fri, 19 Dec 2008, Yan Zheng wrote:
> > I noticed some data and metadata getting out of sync on disk, despite
> > wrapping my writes with btrfs transactions. After digging into it a bit,
> > it appears to be a larger problem with inode size/data getting written
> > during a regular commit.
> > [...]
>
> This is the desired behaviour of data=ordered. Btrfs transaction commit
> don't flush data, and metadata wont get updated until data IO complete.
>
> http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
Ah, right, so it is.
I think what I'm looking for then is a mount mode to get the old behavior,
such that each commit flushes previously written data. Probably a call to
btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something
along those lines...
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 5:21 ` Sage Weil
@ 2008-12-19 14:12 ` Chris Mason
2008-12-19 18:48 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2008-12-19 14:12 UTC (permalink / raw)
To: Sage Weil; +Cc: Yan Zheng, linux-btrfs
On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > I noticed some data and metadata getting out of sync on disk, despite
> > > wrapping my writes with btrfs transactions. After digging into it a bit,
> > > it appears to be a larger problem with inode size/data getting written
> > > during a regular commit.
> > > [...]
> >
> > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > don't flush data, and metadata wont get updated until data IO complete.
> >
> > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
>
> Ah, right, so it is.
>
> I think what I'm looking for then is a mount mode to get the old behavior,
> such that each commit flushes previously written data. Probably a call to
> btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something
> along those lines...
Could you describe the end goal a bit? I'm happy to make modes where
it'll do what you need.
-chris
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 14:12 ` Chris Mason
@ 2008-12-19 18:48 ` Sage Weil
2008-12-19 19:07 ` Chris Mason
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19 18:48 UTC (permalink / raw)
To: Chris Mason; +Cc: Yan Zheng, linux-btrfs
On Fri, 19 Dec 2008, Chris Mason wrote:
> On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> > On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > > I noticed some data and metadata getting out of sync on disk, despite
> > > > wrapping my writes with btrfs transactions. After digging into it a bit,
> > > > it appears to be a larger problem with inode size/data getting written
> > > > during a regular commit.
> > > > [...]
> > >
> > > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > > don't flush data, and metadata wont get updated until data IO complete.
> > >
> > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> >
> > Ah, right, so it is.
> >
> > I think what I'm looking for then is a mount mode to get the old behavior,
> > such that each commit flushes previously written data. Probably a call to
> > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something
> > along those lines...
>
> Could you describe the end goal a bit? I'm happy to make modes where
> it'll do what you need.
The end goal is for data to flush and commit with the transaction that was
running when the write() occured.
So, after a sequence like
write A
setxattr B
<crash>
you should always see A if you see B.
And after a sequence like
ioctl(fd, BTRFS_IOC_TRANS_START)
write A
setxattr B
close(fd)
<crash>
you should see either both A and B or neither A nor B.
fsync() isn't really appropriate since it forces a commit (or a tree log
entry?), and it would still be better to roll lots of operations up
together. Either a mount mode that includes dirty data in each
transaction commit (and probably disables the tree log?), or a per-file
fsync-like operation that commits an individual file's dirty data to the
running transaction would do the trick.
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 18:48 ` Sage Weil
@ 2008-12-19 19:07 ` Chris Mason
2008-12-19 20:08 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2008-12-19 19:07 UTC (permalink / raw)
To: Sage Weil; +Cc: Yan Zheng, linux-btrfs
On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote:
> On Fri, 19 Dec 2008, Chris Mason wrote:
> > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> > > On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > > > I noticed some data and metadata getting out of sync on disk, despite
> > > > > wrapping my writes with btrfs transactions. After digging into it a bit,
> > > > > it appears to be a larger problem with inode size/data getting written
> > > > > during a regular commit.
> > > > > [...]
> > > >
> > > > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > > > don't flush data, and metadata wont get updated until data IO complete.
> > > >
> > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> > >
> > > Ah, right, so it is.
> > >
> > > I think what I'm looking for then is a mount mode to get the old behavior,
> > > such that each commit flushes previously written data. Probably a call to
> > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something
> > > along those lines...
> >
> > Could you describe the end goal a bit? I'm happy to make modes where
> > it'll do what you need.
>
> The end goal is for data to flush and commit with the transaction that was
> running when the write() occured.
>
> So, after a sequence like
> write A
> setxattr B
> <crash>
> you should always see A if you see B.
>
> And after a sequence like
> ioctl(fd, BTRFS_IOC_TRANS_START)
> write A
> setxattr B
> close(fd)
> <crash>
> you should see either both A and B or neither A nor B.
>
> fsync() isn't really appropriate since it forces a commit (or a tree log
> entry?), and it would still be better to roll lots of operations up
> together. Either a mount mode that includes dirty data in each
> transaction commit (and probably disables the tree log?), or a per-file
> fsync-like operation that commits an individual file's dirty data to the
> running transaction would do the trick.
A third option is a different type of xattr operation that doesn't go to
disk until the metadata updates done at IO end time.
>From a performance point of view, it'll be much faster than slowing down
commit with data writes.
Can that work for you?
-chris
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 19:07 ` Chris Mason
@ 2008-12-19 20:08 ` Sage Weil
2008-12-20 0:11 ` Chris Mason
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2008-12-19 20:08 UTC (permalink / raw)
To: Chris Mason; +Cc: Yan Zheng, linux-btrfs
On Fri, 19 Dec 2008, Chris Mason wrote:
> On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote:
> > On Fri, 19 Dec 2008, Chris Mason wrote:
> > > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:
> > > > On Fri, 19 Dec 2008, Yan Zheng wrote:
> > > > > > I noticed some data and metadata getting out of sync on disk, despite
> > > > > > wrapping my writes with btrfs transactions. After digging into it a bit,
> > > > > > it appears to be a larger problem with inode size/data getting written
> > > > > > during a regular commit.
> > > > > > [...]
> > > > >
> > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit
> > > > > don't flush data, and metadata wont get updated until data IO complete.
> > > > >
> > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code
> > > >
> > > > Ah, right, so it is.
> > > >
> > > > I think what I'm looking for then is a mount mode to get the old behavior,
> > > > such that each commit flushes previously written data. Probably a call to
> > > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something
> > > > along those lines...
> > >
> > > Could you describe the end goal a bit? I'm happy to make modes where
> > > it'll do what you need.
> >
> > The end goal is for data to flush and commit with the transaction that was
> > running when the write() occured.
> >
> > So, after a sequence like
> > write A
> > setxattr B
> > <crash>
> > you should always see A if you see B.
> >
> > And after a sequence like
> > ioctl(fd, BTRFS_IOC_TRANS_START)
> > write A
> > setxattr B
> > close(fd)
> > <crash>
> > you should see either both A and B or neither A nor B.
> >
> > fsync() isn't really appropriate since it forces a commit (or a tree log
> > entry?), and it would still be better to roll lots of operations up
> > together. Either a mount mode that includes dirty data in each
> > transaction commit (and probably disables the tree log?), or a per-file
> > fsync-like operation that commits an individual file's dirty data to the
> > running transaction would do the trick.
>
> A third option is a different type of xattr operation that doesn't go to
> disk until the metadata updates done at IO end time.
>
> >From a performance point of view, it'll be much faster than slowing down
> commit with data writes.
>
> Can that work for you?
I suspect not, since multiple files are involved. It's usually something
like
write A
setxattr A
write B
setxattr C
and all need to be committed atomically. The model really is a bundle of
arbitrary operations that commit atomically.
Slower commit times aren't as much of a concern because this is on the
storage backend, behind client caches and so forth. I think it's
a reasonable price to pay for the stronger consistency.
Hopefully it's not throwing too big a wrench into the data=ordered
machinery? It sort of looks like this is already what you get when taking
a snapshot (I see the call to wait_ordered_extnets in commit_transaction
when snaps_pending).
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: inode data not getting included in commits?
2008-12-19 20:08 ` Sage Weil
@ 2008-12-20 0:11 ` Chris Mason
0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2008-12-20 0:11 UTC (permalink / raw)
To: Sage Weil; +Cc: Yan Zheng, linux-btrfs
On Fri, 2008-12-19 at 12:08 -0800, Sage Weil wrote:
[ trigger data=ordered flush at commit time ]
>
> > A third option is a different type of xattr operation that doesn't go to
> > disk until the metadata updates done at IO end time.
> >
> > >From a performance point of view, it'll be much faster than slowing down
> > commit with data writes.
> >
> > Can that work for you?
>
> I suspect not, since multiple files are involved. It's usually something
> like
>
> write A
> setxattr A
> write B
> setxattr C
>
> and all need to be committed atomically. The model really is a bundle of
> arbitrary operations that commit atomically.
>
> Slower commit times aren't as much of a concern because this is on the
> storage backend, behind client caches and so forth. I think it's
> a reasonable price to pay for the stronger consistency.
>
> Hopefully it's not throwing too big a wrench into the data=ordered
> machinery? It sort of looks like this is already what you get when taking
> a snapshot (I see the call to wait_ordered_extnets in commit_transaction
> when snaps_pending).
If we make it optional, its fine by me to added a data=ordered flush at
commit time.
-chris
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-12-20 0:11 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-19 0:22 inode data not getting included in commits? Sage Weil
2008-12-19 1:26 ` Yan Zheng
2008-12-19 5:21 ` Sage Weil
2008-12-19 14:12 ` Chris Mason
2008-12-19 18:48 ` Sage Weil
2008-12-19 19:07 ` Chris Mason
2008-12-19 20:08 ` Sage Weil
2008-12-20 0:11 ` Chris Mason
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox