* garbage block(s) after powercycle/reboot + sparse writes
@ 2013-06-04 19:24 Sage Weil
2013-06-04 20:00 ` Ben Myers
2013-06-05 0:22 ` Eric Sandeen
0 siblings, 2 replies; 9+ messages in thread
From: Sage Weil @ 2013-06-04 19:24 UTC (permalink / raw)
To: xfs
I'm observing an interesting data corruption pattern:
- write a bunch of files
- power cycle the box
- remount
- immediately (within 1-2 seconds) write create a file and
- write to a lower offset, say offset 430423 len 527614
- write to a higher offset, say offset 1360810 len 269613
(there is other random io going to other files too)
- about 5 seconds later, read the whole file and verify content
And what I see:
- the first region is correct, and intact
- the bytes that follow, up until the block boundary, are 0
- the next few blocks are *not* zero! (i've observed 1 and 6 4k blocks)
- then lots of zeros, up until the second region, which appears intact.
I'm pretty reliably hitting this, and have reproduced it twice now and
found the above consistent pattern (but different filenames, different
offsets). What I haven't yet confirmed is whether the file was written at
all prior to the powercycle, since that tends to blow away the last
bit of the ceph logs, too. I'm adding some additional checks to see
whether the file is in fact new when the first extent is written.
The other possibly interesting thing is the offsets. The garbage regions
I saw were
0xea000 - 0xf0000
0xff000 - 0x100000
Does this failure pattern look familiar to anyone? I'm pretty sure it is
new in 3.9, which we switched over to right around the time when this
started happening. I'm confirming that as well, but just wanted to see if
this is ringing any bells...
Thanks!
sage
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-04 19:24 garbage block(s) after powercycle/reboot + sparse writes Sage Weil
@ 2013-06-04 20:00 ` Ben Myers
2013-06-05 0:22 ` Eric Sandeen
1 sibling, 0 replies; 9+ messages in thread
From: Ben Myers @ 2013-06-04 20:00 UTC (permalink / raw)
To: Sage Weil; +Cc: xfs
Hi Sage,
On Tue, Jun 04, 2013 at 12:24:00PM -0700, Sage Weil wrote:
> I'm observing an interesting data corruption pattern:
>
> - write a bunch of files
> - power cycle the box
> - remount
> - immediately (within 1-2 seconds) write create a file and
> - write to a lower offset, say offset 430423 len 527614
> - write to a higher offset, say offset 1360810 len 269613
> (there is other random io going to other files too)
>
> - about 5 seconds later, read the whole file and verify content
>
> And what I see:
>
> - the first region is correct, and intact
> - the bytes that follow, up until the block boundary, are 0
> - the next few blocks are *not* zero! (i've observed 1 and 6 4k blocks)
> - then lots of zeros, up until the second region, which appears intact.
>
> I'm pretty reliably hitting this, and have reproduced it twice now and
> found the above consistent pattern (but different filenames, different
> offsets). What I haven't yet confirmed is whether the file was written at
> all prior to the powercycle, since that tends to blow away the last
> bit of the ceph logs, too. I'm adding some additional checks to see
> whether the file is in fact new when the first extent is written.
>
> The other possibly interesting thing is the offsets. The garbage regions
> I saw were
>
> 0xea000 - 0xf0000
> 0xff000 - 0x100000
>
> Does this failure pattern look familiar to anyone? I'm pretty sure it is
> new in 3.9, which we switched over to right around the time when this
> started happening. I'm confirming that as well, but just wanted to see if
> this is ringing any bells...
Consider
commit 49b137cbbcc836ef231866c137d24f42c42bb483
Author: Dave Chinner <dchinner@redhat.com>
Date: Mon May 20 09:51:08 2013 +1000
xfs: fix sub-page blocksize data integrity writes
I think that this is the only candidate we have recently. Maybe you are seeing
stale data from disk, after the allocation was completed, but before the pages
were written it crashed. IIRC we have a zeroing problem in that area.
Regards,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-04 19:24 garbage block(s) after powercycle/reboot + sparse writes Sage Weil
2013-06-04 20:00 ` Ben Myers
@ 2013-06-05 0:22 ` Eric Sandeen
2013-06-12 17:02 ` Sage Weil
1 sibling, 1 reply; 9+ messages in thread
From: Eric Sandeen @ 2013-06-05 0:22 UTC (permalink / raw)
To: Sage Weil; +Cc: xfs
On 6/4/13 2:24 PM, Sage Weil wrote:
> I'm observing an interesting data corruption pattern:
>
> - write a bunch of files
> - power cycle the box
I guess this part is important? But I'm wondering why...
> - remount
> - immediately (within 1-2 seconds) write create a file and
a new file, right?
> - write to a lower offset, say offset 430423 len 527614
> - write to a higher offset, say offset 1360810 len 269613
> (there is other random io going to other files too)
>
> - about 5 seconds later, read the whole file and verify content
>
> And what I see:
>
> - the first region is correct, and intact
the lower offset you wrote?
> - the bytes that follow, up until the block boundary, are 0
that's good ;)
> - the next few blocks are *not* zero! (i've observed 1 and 6 4k blocks)
that's bad!
> - then lots of zeros, up until the second region, which appears intact.
the lot-of-zeros are probably holes?
What does xfs_bmap -vvp <filename> say about the file in question?
> I'm pretty reliably hitting this, and have reproduced it twice now and
> found the above consistent pattern (but different filenames, different
> offsets). What I haven't yet confirmed is whether the file was written at
> all prior to the powercycle, since that tends to blow away the last
> bit of the ceph logs, too. I'm adding some additional checks to see
> whether the file is in fact new when the first extent is written.
>
> The other possibly interesting thing is the offsets. The garbage regions
> I saw were
>
> 0xea000 - 0xf0000
234-240 4k blocks
> 0xff000 - 0x100000
255-256 4k blocks *shrug*
Is this what you saw w/ the write offsets & sizes you specified above?
I'm wondering if this could possibly have to do w/ speculative preallocation
on the file somehow exposing these blocks? But that's just handwaving.
-Eric
>
> Does this failure pattern look familiar to anyone? I'm pretty sure it is
> new in 3.9, which we switched over to right around the time when this
> started happening. I'm confirming that as well, but just wanted to see if
> this is ringing any bells...
>
> Thanks!
> sage
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-05 0:22 ` Eric Sandeen
@ 2013-06-12 17:02 ` Sage Weil
2013-06-19 1:46 ` Dave Chinner
0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-06-12 17:02 UTC (permalink / raw)
To: Eric Sandeen, david; +Cc: xfs
Hi guys,
I reproduced this on two more boxes and have more data. The full set of
notes/logs is at
http://newdream.net/~sage/bug-4976/notes.txt
I stashed a copy of the ceph log and the file itself for each case too:
http://newdream.net/~sage/bug-4976/
The new information:
- the file was created and allocated prior to the powercycle.
- the writes were then replayed by the ceph journaling after restart
- garbage data appears at offsets we never wrote to
If this pattern isn't suggesting any likely theories, I could rig things
up to try to capture the ceph log output leading up to the crash so I
positively confirm what the sequence was leading up to the power cycle.
Any ideas?
Thanks!
sage
On Tue, 4 Jun 2013, Eric Sandeen wrote:
> On 6/4/13 2:24 PM, Sage Weil wrote:
> > I'm observing an interesting data corruption pattern:
> >
> > - write a bunch of files
> > - power cycle the box
>
> I guess this part is important? But I'm wondering why...
>
> > - remount
> > - immediately (within 1-2 seconds) write create a file and
>
> a new file, right?
It was created and written to (w/ the same pattern) before the crash. We
then repeat the sequence after when replaying the ceph journal.
> > - write to a lower offset, say offset 430423 len 527614
> > - write to a higher offset, say offset 1360810 len 269613
> > (there is other random io going to other files too)
> >
> > - about 5 seconds later, read the whole file and verify content
> >
> > And what I see:
> >
> > - the first region is correct, and intact
>
> the lower offset you wrote?
Right
> > - the bytes that follow, up until the block boundary, are 0
>
> that's good ;)
>
> > - the next few blocks are *not* zero! (i've observed 1 and 6 4k blocks)
>
> that's bad!
>
> > - then lots of zeros, up until the second region, which appears intact.
>
> the lot-of-zeros are probably holes?
Right
> What does xfs_bmap -vvp <filename> say about the file in question?
The notes.txt file linked above has the bmap output.
Thanks!
sage
> > I'm pretty reliably hitting this, and have reproduced it twice now and
> > found the above consistent pattern (but different filenames, different
> > offsets). What I haven't yet confirmed is whether the file was written at
> > all prior to the powercycle, since that tends to blow away the last
> > bit of the ceph logs, too. I'm adding some additional checks to see
> > whether the file is in fact new when the first extent is written.
> >
> > The other possibly interesting thing is the offsets. The garbage regions
> > I saw were
> >
> > 0xea000 - 0xf0000
>
> 234-240 4k blocks
>
> > 0xff000 - 0x100000
>
> 255-256 4k blocks *shrug*
>
> Is this what you saw w/ the write offsets & sizes you specified above?
>
> I'm wondering if this could possibly have to do w/ speculative preallocation
> on the file somehow exposing these blocks? But that's just handwaving.
>
> -Eric
>
> >
> > Does this failure pattern look familiar to anyone? I'm pretty sure it is
> > new in 3.9, which we switched over to right around the time when this
> > started happening. I'm confirming that as well, but just wanted to see if
> > this is ringing any bells...
> >
> > Thanks!
> > sage
> >
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> >
>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-12 17:02 ` Sage Weil
@ 2013-06-19 1:46 ` Dave Chinner
2013-06-19 3:12 ` Sage Weil
0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-06-19 1:46 UTC (permalink / raw)
To: Sage Weil; +Cc: Eric Sandeen, xfs
On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote:
> Hi guys,
>
> I reproduced this on two more boxes and have more data. The full set of
> notes/logs is at
>
> http://newdream.net/~sage/bug-4976/notes.txt
case c:
/var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3:
0: [0..7]: 244206232..244206239
1: [8..1351]: hole
2: [1352..2431]: 244252824..244253903
3: [2432..3255]: hole
4: [3256..4855]: 244254728..244256327
bad data starts at offset 1179648:
Which lies within the middle of an allocated extent (offset 2304bb).
IIUC, then there was a write at offset 700466 for 445465 bytes,
(i.e start @ 1368bb, end @ 2239bb), but given the block count of
the file didn't change, this must have been an overwrite of existing
data. It's well within EOF, too, so it's not clear what has happened
here - the bad data was not written by the ceph journal replay, and
the extent was already allocated on disk...
case d:
0: [0..7]: 732632192..732632199
1: [8..783]: hole
2: [784..2943]: 733513808..733515967
3: [2944..3511]: hole
4: [3512..4703]: 733516536..733517727
INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!))
So, similar layout, we have a write at offset 404703 for 773518
bytes (start @790bb, end @2302). Ok, that makes a little more sense
to have corruption starting @ 2304. The pages in the page cache that
cover this write would cover offsets 784bb to 2303bb, having zeroed
the head and tail sections.
The file size and number of blocks didn't change again and it's
well within the file size, so this is also an overwrite without
allocat. IOWs the overwrite ended at 2303bb, with corruption
starting at 2304bb.
Let's play a "what-if" game. there are 3 writes to the file:
-@ 0 for 57 bytes (1FSB)
-@ 404703 for 773518 bytes (
-@ 1801584 for 603119 bytes
These translate to the following regions in BB:
- [0..1]
- [790..2302]
- [3518..4697]
If we round these to filesystem blocks, we have:
- [0..7]
- [784..2303]
- [3512..4703]
And in filesystem blocks:
Should have Actually have
----------- ------------
- [0..1] - [0..1]
- [98..287] - [98..367]
- [439..587] - [439..587]
So there's an extra 80 filesystem blocks in the middle extent. Does
that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 =
340fsb. it doesn't, so there's another 60fsb beyond EOF.
And that's pretty telling. There's non-zero'd speculative
preallocation there. The *big* question is this: why was the file
size extented (i.e. user data IO completed) and updated and the
file size update logged to the journal *before* the zeroed regions
of the file had been written to zero?
And why only on 3.8/3.9?
> Any ideas?
I'd love to see the code that is writing the files in the first
place - it's not playing silly buggers with sync_file_range(), is
it?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-19 1:46 ` Dave Chinner
@ 2013-06-19 3:12 ` Sage Weil
2013-06-19 4:05 ` Dave Chinner
0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-06-19 3:12 UTC (permalink / raw)
To: Dave Chinner; +Cc: Eric Sandeen, xfs
On Wed, 19 Jun 2013, Dave Chinner wrote:
> On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote:
> > Hi guys,
> >
> > I reproduced this on two more boxes and have more data. The full set of
> > notes/logs is at
> >
> > http://newdream.net/~sage/bug-4976/notes.txt
>
> case c:
>
> /var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3:
> 0: [0..7]: 244206232..244206239
> 1: [8..1351]: hole
> 2: [1352..2431]: 244252824..244253903
> 3: [2432..3255]: hole
> 4: [3256..4855]: 244254728..244256327
>
> bad data starts at offset 1179648:
>
> Which lies within the middle of an allocated extent (offset 2304bb).
>
> IIUC, then there was a write at offset 700466 for 445465 bytes,
> (i.e start @ 1368bb, end @ 2239bb), but given the block count of
> the file didn't change, this must have been an overwrite of existing
> data. It's well within EOF, too, so it's not clear what has happened
> here - the bad data was not written by the ceph journal replay, and
> the extent was already allocated on disk...
>
> case d:
>
> 0: [0..7]: 732632192..732632199
> 1: [8..783]: hole
> 2: [784..2943]: 733513808..733515967
> 3: [2944..3511]: hole
> 4: [3512..4703]: 733516536..733517727
>
>
> INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!))
>
> So, similar layout, we have a write at offset 404703 for 773518
> bytes (start @790bb, end @2302). Ok, that makes a little more sense
> to have corruption starting @ 2304. The pages in the page cache that
> cover this write would cover offsets 784bb to 2303bb, having zeroed
> the head and tail sections.
>
> The file size and number of blocks didn't change again and it's
> well within the file size, so this is also an overwrite without
> allocat. IOWs the overwrite ended at 2303bb, with corruption
> starting at 2304bb.
>
> Let's play a "what-if" game. there are 3 writes to the file:
>
> -@ 0 for 57 bytes (1FSB)
> -@ 404703 for 773518 bytes (
> -@ 1801584 for 603119 bytes
>
> These translate to the following regions in BB:
>
> - [0..1]
> - [790..2302]
> - [3518..4697]
>
> If we round these to filesystem blocks, we have:
>
> - [0..7]
> - [784..2303]
> - [3512..4703]
>
> And in filesystem blocks:
>
> Should have Actually have
> ----------- ------------
> - [0..1] - [0..1]
> - [98..287] - [98..367]
> - [439..587] - [439..587]
>
> So there's an extra 80 filesystem blocks in the middle extent. Does
> that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 =
> 340fsb. it doesn't, so there's another 60fsb beyond EOF.
>
> And that's pretty telling. There's non-zero'd speculative
> preallocation there. The *big* question is this: why was the file
> size extented (i.e. user data IO completed) and updated and the
> file size update logged to the journal *before* the zeroed regions
> of the file had been written to zero?
>
> And why only on 3.8/3.9?
I haven't gone back and tested on older kernels. It's possible this isn't
a new problem and previous failures were lost in the noise. :/
> > Any ideas?
>
> I'd love to see the code that is writing the files in the first
> place - it's not playing silly buggers with sync_file_range(), is
> it?
Actually, it is!
...
::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
...
int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
...and a call to close(2).
The intent was just to push things out of the disk so that a syncfs()
that comes later will take less time. I wouldn't have expected this to
change the commit ordering semantics, though; only to initiate writeback
sooner.
(Coincidentally, all of this code was just recently rewritten to do a
simple fsync. I haven't retested the powercycling since then, though.)
sage
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-19 3:12 ` Sage Weil
@ 2013-06-19 4:05 ` Dave Chinner
2013-06-19 4:15 ` Sage Weil
0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-06-19 4:05 UTC (permalink / raw)
To: Sage Weil; +Cc: Eric Sandeen, xfs
On Tue, Jun 18, 2013 at 08:12:01PM -0700, Sage Weil wrote:
> On Wed, 19 Jun 2013, Dave Chinner wrote:
> > On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote:
> > > Hi guys,
> > >
> > > I reproduced this on two more boxes and have more data. The full set of
> > > notes/logs is at
> > >
> > > http://newdream.net/~sage/bug-4976/notes.txt
> >
> > case c:
> >
> > /var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3:
> > 0: [0..7]: 244206232..244206239
> > 1: [8..1351]: hole
> > 2: [1352..2431]: 244252824..244253903
> > 3: [2432..3255]: hole
> > 4: [3256..4855]: 244254728..244256327
> >
> > bad data starts at offset 1179648:
> >
> > Which lies within the middle of an allocated extent (offset 2304bb).
> >
> > IIUC, then there was a write at offset 700466 for 445465 bytes,
> > (i.e start @ 1368bb, end @ 2239bb), but given the block count of
> > the file didn't change, this must have been an overwrite of existing
> > data. It's well within EOF, too, so it's not clear what has happened
> > here - the bad data was not written by the ceph journal replay, and
> > the extent was already allocated on disk...
> >
> > case d:
> >
> > 0: [0..7]: 732632192..732632199
> > 1: [8..783]: hole
> > 2: [784..2943]: 733513808..733515967
> > 3: [2944..3511]: hole
> > 4: [3512..4703]: 733516536..733517727
> >
> >
> > INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!))
> >
> > So, similar layout, we have a write at offset 404703 for 773518
> > bytes (start @790bb, end @2302). Ok, that makes a little more sense
> > to have corruption starting @ 2304. The pages in the page cache that
> > cover this write would cover offsets 784bb to 2303bb, having zeroed
> > the head and tail sections.
> >
> > The file size and number of blocks didn't change again and it's
> > well within the file size, so this is also an overwrite without
> > allocat. IOWs the overwrite ended at 2303bb, with corruption
> > starting at 2304bb.
> >
> > Let's play a "what-if" game. there are 3 writes to the file:
> >
> > -@ 0 for 57 bytes (1FSB)
> > -@ 404703 for 773518 bytes (
> > -@ 1801584 for 603119 bytes
> >
> > These translate to the following regions in BB:
> >
> > - [0..1]
> > - [790..2302]
> > - [3518..4697]
> >
> > If we round these to filesystem blocks, we have:
> >
> > - [0..7]
> > - [784..2303]
> > - [3512..4703]
> >
> > And in filesystem blocks:
> >
> > Should have Actually have
> > ----------- ------------
> > - [0..1] - [0..1]
> > - [98..287] - [98..367]
> > - [439..587] - [439..587]
> >
> > So there's an extra 80 filesystem blocks in the middle extent. Does
> > that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 =
> > 340fsb. it doesn't, so there's another 60fsb beyond EOF.
> >
> > And that's pretty telling. There's non-zero'd speculative
> > preallocation there. The *big* question is this: why was the file
> > size extented (i.e. user data IO completed) and updated and the
> > file size update logged to the journal *before* the zeroed regions
> > of the file had been written to zero?
> >
> > And why only on 3.8/3.9?
>
> I haven't gone back and tested on older kernels. It's possible this isn't
> a new problem and previous failures were lost in the noise. :/
>
> > > Any ideas?
> >
> > I'd love to see the code that is writing the files in the first
> > place - it's not playing silly buggers with sync_file_range(), is
> > it?
>
> Actually, it is!
>
> ...
> ::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
> ...
> int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
>
> ...and a call to close(2).
<sigh>
> The intent was just to push things out of the disk so that a syncfs()
> that comes later will take less time. I wouldn't have expected this to
> change the commit ordering semantics, though; only to initiate writeback
> sooner.
It initiates writeback on certain ranges of the file before others,
and so therefore can expose issues related to ordering of writeback.
It's made worse by the fact that, by definition, both
sync_file_range() and posix_fadvise(POSIX_FADV_DONTNEED) provide no
data integrity guarantees. And, further, neither are vectored
through the filesystem like fsync is and so the filesystem can't
provide any integrity guarantees, either.
Hence if you use range based posix_fadvise(DONTNEED) and/or
sync_file_range(), you can get real strange things happening as data
writeback ordering is fully under the control of the user, not the
filesystem, and the user, in general, has no clue about what the
filesystem is doing under the covers.
> (Coincidentally, all of this code was just recently rewritten to do a
> simple fsync. I haven't retested the powercycling since then, though.)
fsync should work just fine, as it does increasing offset writeback
and that means the zeroed blocks will get written before the data
blocks and so uninitialised data won't get exposed....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-19 4:05 ` Dave Chinner
@ 2013-06-19 4:15 ` Sage Weil
2013-06-19 5:18 ` Dave Chinner
0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-06-19 4:15 UTC (permalink / raw)
To: Dave Chinner; +Cc: Eric Sandeen, xfs
On Wed, 19 Jun 2013, Dave Chinner wrote:
> On Tue, Jun 18, 2013 at 08:12:01PM -0700, Sage Weil wrote:
> > On Wed, 19 Jun 2013, Dave Chinner wrote:
> > > On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote:
> > > > Hi guys,
> > > >
> > > > I reproduced this on two more boxes and have more data. The full set of
> > > > notes/logs is at
> > > >
> > > > http://newdream.net/~sage/bug-4976/notes.txt
> > >
> > > case c:
> > >
> > > /var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3:
> > > 0: [0..7]: 244206232..244206239
> > > 1: [8..1351]: hole
> > > 2: [1352..2431]: 244252824..244253903
> > > 3: [2432..3255]: hole
> > > 4: [3256..4855]: 244254728..244256327
> > >
> > > bad data starts at offset 1179648:
> > >
> > > Which lies within the middle of an allocated extent (offset 2304bb).
> > >
> > > IIUC, then there was a write at offset 700466 for 445465 bytes,
> > > (i.e start @ 1368bb, end @ 2239bb), but given the block count of
> > > the file didn't change, this must have been an overwrite of existing
> > > data. It's well within EOF, too, so it's not clear what has happened
> > > here - the bad data was not written by the ceph journal replay, and
> > > the extent was already allocated on disk...
> > >
> > > case d:
> > >
> > > 0: [0..7]: 732632192..732632199
> > > 1: [8..783]: hole
> > > 2: [784..2943]: 733513808..733515967
> > > 3: [2944..3511]: hole
> > > 4: [3512..4703]: 733516536..733517727
> > >
> > >
> > > INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!))
> > >
> > > So, similar layout, we have a write at offset 404703 for 773518
> > > bytes (start @790bb, end @2302). Ok, that makes a little more sense
> > > to have corruption starting @ 2304. The pages in the page cache that
> > > cover this write would cover offsets 784bb to 2303bb, having zeroed
> > > the head and tail sections.
> > >
> > > The file size and number of blocks didn't change again and it's
> > > well within the file size, so this is also an overwrite without
> > > allocat. IOWs the overwrite ended at 2303bb, with corruption
> > > starting at 2304bb.
> > >
> > > Let's play a "what-if" game. there are 3 writes to the file:
> > >
> > > -@ 0 for 57 bytes (1FSB)
> > > -@ 404703 for 773518 bytes (
> > > -@ 1801584 for 603119 bytes
> > >
> > > These translate to the following regions in BB:
> > >
> > > - [0..1]
> > > - [790..2302]
> > > - [3518..4697]
> > >
> > > If we round these to filesystem blocks, we have:
> > >
> > > - [0..7]
> > > - [784..2303]
> > > - [3512..4703]
> > >
> > > And in filesystem blocks:
> > >
> > > Should have Actually have
> > > ----------- ------------
> > > - [0..1] - [0..1]
> > > - [98..287] - [98..367]
> > > - [439..587] - [439..587]
> > >
> > > So there's an extra 80 filesystem blocks in the middle extent. Does
> > > that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 =
> > > 340fsb. it doesn't, so there's another 60fsb beyond EOF.
> > >
> > > And that's pretty telling. There's non-zero'd speculative
> > > preallocation there. The *big* question is this: why was the file
> > > size extented (i.e. user data IO completed) and updated and the
> > > file size update logged to the journal *before* the zeroed regions
> > > of the file had been written to zero?
> > >
> > > And why only on 3.8/3.9?
> >
> > I haven't gone back and tested on older kernels. It's possible this isn't
> > a new problem and previous failures were lost in the noise. :/
> >
> > > > Any ideas?
> > >
> > > I'd love to see the code that is writing the files in the first
> > > place - it's not playing silly buggers with sync_file_range(), is
> > > it?
> >
> > Actually, it is!
> >
> > ...
> > ::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
> > ...
> > int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> >
> > ...and a call to close(2).
>
> <sigh>
>
> > The intent was just to push things out of the disk so that a syncfs()
> > that comes later will take less time. I wouldn't have expected this to
> > change the commit ordering semantics, though; only to initiate writeback
> > sooner.
>
> It initiates writeback on certain ranges of the file before others,
> and so therefore can expose issues related to ordering of writeback.
> It's made worse by the fact that, by definition, both
> sync_file_range() and posix_fadvise(POSIX_FADV_DONTNEED) provide no
> data integrity guarantees. And, further, neither are vectored
> through the filesystem like fsync is and so the filesystem can't
> provide any integrity guarantees, either.
Sigh... well that explains that. I realize that these calls don't provide
any guarantees themselves (hence the syncfs(2) call later), but I did
not expect them to break zeroing. (That sounds problematic from a
security perspective?)
Is sync_file_range(2) similarly problematic with ext4?
> Hence if you use range based posix_fadvise(DONTNEED) and/or
> sync_file_range(), you can get real strange things happening as data
> writeback ordering is fully under the control of the user, not the
> filesystem, and the user, in general, has no clue about what the
> filesystem is doing under the covers.
>
> > (Coincidentally, all of this code was just recently rewritten to do a
> > simple fsync. I haven't retested the powercycling since then, though.)
>
> fsync should work just fine, as it does increasing offset writeback
> and that means the zeroed blocks will get written before the data
> blocks and so uninitialised data won't get exposed....
For the stable bugfix fdatasync(2) is preferred; that will do the same
increasing offset writeback I assume?
Thanks, Dave!
sage
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: garbage block(s) after powercycle/reboot + sparse writes
2013-06-19 4:15 ` Sage Weil
@ 2013-06-19 5:18 ` Dave Chinner
0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2013-06-19 5:18 UTC (permalink / raw)
To: Sage Weil; +Cc: Eric Sandeen, xfs
On Tue, Jun 18, 2013 at 09:15:31PM -0700, Sage Weil wrote:
> On Wed, 19 Jun 2013, Dave Chinner wrote:
> > On Tue, Jun 18, 2013 at 08:12:01PM -0700, Sage Weil wrote:
> > > > I'd love to see the code that is writing the files in the first
> > > > place - it's not playing silly buggers with sync_file_range(), is
> > > > it?
> > >
> > > Actually, it is!
> > >
> > > ...
> > > ::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
> > > ...
> > > int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> > >
> > > ...and a call to close(2).
> >
> > <sigh>
> >
> > > The intent was just to push things out of the disk so that a syncfs()
> > > that comes later will take less time. I wouldn't have expected this to
> > > change the commit ordering semantics, though; only to initiate writeback
> > > sooner.
> >
> > It initiates writeback on certain ranges of the file before others,
> > and so therefore can expose issues related to ordering of writeback.
> > It's made worse by the fact that, by definition, both
> > sync_file_range() and posix_fadvise(POSIX_FADV_DONTNEED) provide no
> > data integrity guarantees. And, further, neither are vectored
> > through the filesystem like fsync is and so the filesystem can't
> > provide any integrity guarantees, either.
>
> Sigh... well that explains that. I realize that these calls don't provide
> any guarantees themselves (hence the syncfs(2) call later), but I did
> not expect them to break zeroing.
Neither did I.
> (That sounds problematic from a
> security perspective?)
Yes, it is. I think i can fix the problem fairly easily, though, as
we have these things called unwritten extents and a function that
can convert an arbitrary byte range of a file to unwritten extents..
FWIW, what I've been seeing over the past few months is that people
responsible for distributed storage software that uses
posix_fadvise(POSIX_FADV_DONTNEED) has been getting much more
testing as there have been lots of strange side effects being
noticed. What it tells me is that these interfaces are not very
well tested even though they have been around for years and are
considered "mature".
e.g. even the fact that POSIX_FADV_DONTNEED could not stop cache
growth under typical object storage server loads was not discovered
until recently - it was fixed in 3.9-rc1. This is probably due to
the fact that these interfaces have, historically, not had wide use
in applications that make heavy use of them and/or care about data
integrity.
Given that problems with data integrity are much harder to trigger,
much harder to test and are much less tested, I'm not surprised by
the fact that these "mature" interfaces are being found to be
extremely unreliable from an application POV...
> Is sync_file_range(2) similarly problematic with ext4?
In data=writeback mode, most definitely. For data=ordered, I have no
idea - the writeack paths in ext4 are ... convoluted, and I hurt my
brain every time I look at them. I wouldn't be surprised if there
are problems, but they'll be different problems because ext4 doesn't
do speculative prealloc...
> > Hence if you use range based posix_fadvise(DONTNEED) and/or
> > sync_file_range(), you can get real strange things happening as data
> > writeback ordering is fully under the control of the user, not the
> > filesystem, and the user, in general, has no clue about what the
> > filesystem is doing under the covers.
> >
> > > (Coincidentally, all of this code was just recently rewritten to do a
> > > simple fsync. I haven't retested the powercycling since then, though.)
> >
> > fsync should work just fine, as it does increasing offset writeback
> > and that means the zeroed blocks will get written before the data
> > blocks and so uninitialised data won't get exposed....
>
> For the stable bugfix fdatasync(2) is preferred; that will do the same
> increasing offset writeback I assume?
Yup.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-06-19 5:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-04 19:24 garbage block(s) after powercycle/reboot + sparse writes Sage Weil
2013-06-04 20:00 ` Ben Myers
2013-06-05 0:22 ` Eric Sandeen
2013-06-12 17:02 ` Sage Weil
2013-06-19 1:46 ` Dave Chinner
2013-06-19 3:12 ` Sage Weil
2013-06-19 4:05 ` Dave Chinner
2013-06-19 4:15 ` Sage Weil
2013-06-19 5:18 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox