linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Wrong DIF guard tag on ext2 write
@ 2010-05-31 11:28 Christof Schmitt
  2010-05-31 11:34 ` Christof Schmitt
                   ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Christof Schmitt @ 2010-05-31 11:28 UTC (permalink / raw)
  To: linux-scsi, linux-kernel, linux-fsdevel

When running tests with an ext2 filesystem on a device using DIF/DIX
integrity data, i sometimes see invalid guard tags on write requests.
To track down the problem, i patched the function sd_prep_fn in sd.c
to verify the IP checksums in the guard tags against the actual data.
Sometimes there is a mismatch and the write request fails when the HBA
checks the guard tag.

Since the guard tags are created in Linux, it seems that the data
attached to the write request changes between the generation in
bio_integrity_generate and the call to sd_prep_fn.

Using ext3 or ext4 instead of ext2 does not show the problem.

There is a bugzilla open at Redhat with the same symptom, but there is
no data or activity:
https://bugzilla.redhat.com/show_bug.cgi?id=574266

What would be the best way to track down this problem?

--
Christof Schmitt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 11:28 Wrong DIF guard tag on ext2 write Christof Schmitt
@ 2010-05-31 11:34 ` Christof Schmitt
  2010-05-31 14:20 ` Martin K. Petersen
  2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
  2 siblings, 0 replies; 79+ messages in thread
From: Christof Schmitt @ 2010-05-31 11:34 UTC (permalink / raw)
  To: linux-scsi, linux-kernel, linux-fsdevel

On Mon, May 31, 2010 at 01:28:17PM +0200, Christof Schmitt wrote:
> When running tests with an ext2 filesystem on a device using DIF/DIX
> integrity data, i sometimes see invalid guard tags on write requests.
> To track down the problem, i patched the function sd_prep_fn in sd.c
> to verify the IP checksums in the guard tags against the actual data.
> Sometimes there is a mismatch and the write request fails when the HBA
> checks the guard tag.
> 
> Since the guard tags are created in Linux, it seems that the data
> attached to the write request changes between the generation in
> bio_integrity_generate and the call to sd_prep_fn.
> 
> Using ext3 or ext4 instead of ext2 does not show the problem.
> 
> There is a bugzilla open at Redhat with the same symptom, but there is
> no data or activity:
> https://bugzilla.redhat.com/show_bug.cgi?id=574266
> 
> What would be the best way to track down this problem?

One more thing: The test is running with a 2.6.34 kernel, the problem
in the bugzilla is reported for 2.6.33.

Christof Schmitt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 11:28 Wrong DIF guard tag on ext2 write Christof Schmitt
  2010-05-31 11:34 ` Christof Schmitt
@ 2010-05-31 14:20 ` Martin K. Petersen
  2010-05-31 14:46   ` Christof Schmitt
                     ` (2 more replies)
  2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
  2 siblings, 3 replies; 79+ messages in thread
From: Martin K. Petersen @ 2010-05-31 14:20 UTC (permalink / raw)
  To: Christof Schmitt; +Cc: linux-scsi, linux-kernel, linux-fsdevel

>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:

Christof> Since the guard tags are created in Linux, it seems that the
Christof> data attached to the write request changes between the
Christof> generation in bio_integrity_generate and the call to
Christof> sd_prep_fn.

Yep, known bug.  Page writeback locking is messed up for buffer_head
users.  The extNfs folks volunteered to look into this a while back but
I don't think they have found the time yet.


Christof> Using ext3 or ext4 instead of ext2 does not show the problem.

Last I looked there were still code paths in ext3 and ext4 that
permitted pages to be changed during flight.  I guess you've just been
lucky.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 14:20 ` Martin K. Petersen
@ 2010-05-31 14:46   ` Christof Schmitt
  2010-06-01 13:16     ` Martin K. Petersen
  2010-05-31 14:49   ` Nick Piggin
  2010-05-31 15:01   ` James Bottomley
  2 siblings, 1 reply; 79+ messages in thread
From: Christof Schmitt @ 2010-05-31 14:46 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-kernel, linux-fsdevel

On Mon, May 31, 2010 at 10:20:44AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> 
> Christof> Since the guard tags are created in Linux, it seems that the
> Christof> data attached to the write request changes between the
> Christof> generation in bio_integrity_generate and the call to
> Christof> sd_prep_fn.
> 
> Yep, known bug.  Page writeback locking is messed up for buffer_head
> users.  The extNfs folks volunteered to look into this a while back but
> I don't think they have found the time yet.

Thanks for the info. This means that this bug appears with all
filesystems?

> 
> 
> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> 
> Last I looked there were still code paths in ext3 and ext4 that
> permitted pages to be changed during flight.  I guess you've just been
> lucky.

ext3 looks good so far. I see the problem also with ext4, so i spoke
too early on that one. I will start a longer testrun with ext3 to see
if and when the problem appears with ext3 in my setup.

--
Christof Schmitt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 14:20 ` Martin K. Petersen
  2010-05-31 14:46   ` Christof Schmitt
@ 2010-05-31 14:49   ` Nick Piggin
  2010-06-01 13:17     ` Martin K. Petersen
  2010-05-31 15:01   ` James Bottomley
  2 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2010-05-31 14:49 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel

On Mon, May 31, 2010 at 10:20:44AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> 
> Christof> Since the guard tags are created in Linux, it seems that the
> Christof> data attached to the write request changes between the
> Christof> generation in bio_integrity_generate and the call to
> Christof> sd_prep_fn.
> 
> Yep, known bug.  Page writeback locking is messed up for buffer_head
> users.  The extNfs folks volunteered to look into this a while back but
> I don't think they have found the time yet.

What do you mean by messed up? Allowing modifications to the page while
it is under writeback? This is deliberate of course and not limited to
buffer_head users either.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 14:20 ` Martin K. Petersen
  2010-05-31 14:46   ` Christof Schmitt
  2010-05-31 14:49   ` Nick Piggin
@ 2010-05-31 15:01   ` James Bottomley
  2010-05-31 15:30     ` Boaz Harrosh
  2010-06-01  2:40     ` FUJITA Tomonori
  2 siblings, 2 replies; 79+ messages in thread
From: James Bottomley @ 2010-05-31 15:01 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel

On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> 
> Christof> Since the guard tags are created in Linux, it seems that the
> Christof> data attached to the write request changes between the
> Christof> generation in bio_integrity_generate and the call to
> Christof> sd_prep_fn.
> 
> Yep, known bug.  Page writeback locking is messed up for buffer_head
> users.  The extNfs folks volunteered to look into this a while back but
> I don't think they have found the time yet.
> 
> 
> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> 
> Last I looked there were still code paths in ext3 and ext4 that
> permitted pages to be changed during flight.  I guess you've just been
> lucky.

Pages have always been modifiable in flight.  The OS guarantees they'll
be rewritten, so the drivers can drop them if it detects the problem.
This is identical to the iscsi checksum issue (iscsi adds a checksum
because it doesn't trust TCP/IP and if the checksum is generated in
software, there's time between generation and page transmission for the
alteration to occur).  The solution in the iscsi case was not to
complain if the page is still marked dirty.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 15:01   ` James Bottomley
@ 2010-05-31 15:30     ` Boaz Harrosh
  2010-05-31 15:49       ` Nick Piggin
  2010-06-01 10:30       ` Christof Schmitt
  2010-06-01  2:40     ` FUJITA Tomonori
  1 sibling, 2 replies; 79+ messages in thread
From: Boaz Harrosh @ 2010-05-31 15:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Martin K. Petersen, Christof Schmitt, linux-scsi, linux-kernel,
	linux-fsdevel, Chris Mason

On 05/31/2010 06:01 PM, James Bottomley wrote:
> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
>>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
>>
>> Christof> Since the guard tags are created in Linux, it seems that the
>> Christof> data attached to the write request changes between the
>> Christof> generation in bio_integrity_generate and the call to
>> Christof> sd_prep_fn.
>>
>> Yep, known bug.  Page writeback locking is messed up for buffer_head
>> users.  The extNfs folks volunteered to look into this a while back but
>> I don't think they have found the time yet.
>>
>>
>> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
>>
>> Last I looked there were still code paths in ext3 and ext4 that
>> permitted pages to be changed during flight.  I guess you've just been
>> lucky.
> 
> Pages have always been modifiable in flight.  The OS guarantees they'll
> be rewritten, so the drivers can drop them if it detects the problem.
> This is identical to the iscsi checksum issue (iscsi adds a checksum
> because it doesn't trust TCP/IP and if the checksum is generated in
> software, there's time between generation and page transmission for the
> alteration to occur).  The solution in the iscsi case was not to
> complain if the page is still marked dirty.
> 

And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
would prevent data writing given a device queue flag that requests
it. So all these devices and modes could just flag the VFS/filesystems
that: "please don't allow concurrent writes, otherwise I need to copy data"

>From what Chris Mason has said before, all the mechanics are there, and it's
what btrfs is doing. Though I don't know how myself?

> James
> 

Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 15:30     ` Boaz Harrosh
@ 2010-05-31 15:49       ` Nick Piggin
  2010-05-31 16:25         ` Boaz Harrosh
  2010-06-01 13:22         ` Martin K. Petersen
  2010-06-01 10:30       ` Christof Schmitt
  1 sibling, 2 replies; 79+ messages in thread
From: Nick Piggin @ 2010-05-31 15:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: James Bottomley, Martin K. Petersen, Christof Schmitt, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason

On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> On 05/31/2010 06:01 PM, James Bottomley wrote:
> > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> >>
> >> Christof> Since the guard tags are created in Linux, it seems that the
> >> Christof> data attached to the write request changes between the
> >> Christof> generation in bio_integrity_generate and the call to
> >> Christof> sd_prep_fn.
> >>
> >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> >> users.  The extNfs folks volunteered to look into this a while back but
> >> I don't think they have found the time yet.
> >>
> >>
> >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> >>
> >> Last I looked there were still code paths in ext3 and ext4 that
> >> permitted pages to be changed during flight.  I guess you've just been
> >> lucky.
> > 
> > Pages have always been modifiable in flight.  The OS guarantees they'll
> > be rewritten, so the drivers can drop them if it detects the problem.
> > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > because it doesn't trust TCP/IP and if the checksum is generated in
> > software, there's time between generation and page transmission for the
> > alteration to occur).  The solution in the iscsi case was not to
> > complain if the page is still marked dirty.
> > 
> 
> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> would prevent data writing given a device queue flag that requests
> it. So all these devices and modes could just flag the VFS/filesystems
> that: "please don't allow concurrent writes, otherwise I need to copy data"
> 
> >From what Chris Mason has said before, all the mechanics are there, and it's
> what btrfs is doing. Though I don't know how myself?

The filesystems can do it themselves, they should have everything
required.

Easiest way would be to not unlock page during the writeback, unmap
mmaps before taking the checksum, and using vm_flags to prevent
get_user_pages.

More complex and maybe more performant would be to avoid holding page
lock but wait_on_page_writeback in page-modification (write, fault)
paths. More complex again could opportunistically replace the page
with a duplicate one and allow modification ops to continue from there.

That's all possible by overriding existing callbacks though. I don't
think I would like to put branches and flag dependent locking all
over existing functions.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 15:49       ` Nick Piggin
@ 2010-05-31 16:25         ` Boaz Harrosh
  2010-06-01 13:22         ` Martin K. Petersen
  1 sibling, 0 replies; 79+ messages in thread
From: Boaz Harrosh @ 2010-05-31 16:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: James Bottomley, Martin K. Petersen, Christof Schmitt, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason

On 05/31/2010 06:49 PM, Nick Piggin wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
>> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
>> would prevent data writing given a device queue flag that requests
>> it. So all these devices and modes could just flag the VFS/filesystems
>> that: "please don't allow concurrent writes, otherwise I need to copy data"
>>
>> >From what Chris Mason has said before, all the mechanics are there, and it's
>> what btrfs is doing. Though I don't know how myself?
> 
> The filesystems can do it themselves, they should have everything
> required.
> 
> Easiest way would be to not unlock page during the writeback, unmap
> mmaps before taking the checksum, and using vm_flags to prevent
> get_user_pages.
> 
> More complex and maybe more performant would be to avoid holding page
> lock but wait_on_page_writeback in page-modification (write, fault)
> paths. More complex again could opportunistically replace the page
> with a duplicate one and allow modification ops to continue from there.
> 
> That's all possible by overriding existing callbacks though. I don't
> think I would like to put branches and flag dependent locking all
> over existing functions.
> 

Thanks. I'll need to get to this soon enough when doing raid5/6.
At the exofs level at least. This is most valuable information.
I'll keep it in mind. (And I'll also need to do it at NFS which
will be a fight)

I agree that doing it clean at VFS level is match harder. But then
also duplicating code all over is also hard. As it stands RAID copies
data, and iscsi checksums are turned off by distros. And so will DIF.
I guess we need to leave room for the HW vendors, out there ;-)

Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 15:01   ` James Bottomley
  2010-05-31 15:30     ` Boaz Harrosh
@ 2010-06-01  2:40     ` FUJITA Tomonori
  1 sibling, 0 replies; 79+ messages in thread
From: FUJITA Tomonori @ 2010-06-01  2:40 UTC (permalink / raw)
  To: James.Bottomley
  Cc: martin.petersen, christof.schmitt, linux-scsi, linux-kernel,
	linux-fsdevel

On Mon, 31 May 2010 10:01:42 -0500
James Bottomley <James.Bottomley@suse.de> wrote:

> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > >>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > 
> > Christof> Since the guard tags are created in Linux, it seems that the
> > Christof> data attached to the write request changes between the
> > Christof> generation in bio_integrity_generate and the call to
> > Christof> sd_prep_fn.
> > 
> > Yep, known bug.  Page writeback locking is messed up for buffer_head
> > users.  The extNfs folks volunteered to look into this a while back but
> > I don't think they have found the time yet.
> > 
> > 
> > Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > 
> > Last I looked there were still code paths in ext3 and ext4 that
> > permitted pages to be changed during flight.  I guess you've just been
> > lucky.
> 
> Pages have always been modifiable in flight.  The OS guarantees they'll
> be rewritten, so the drivers can drop them if it detects the problem.
> This is identical to the iscsi checksum issue (iscsi adds a checksum
> because it doesn't trust TCP/IP and if the checksum is generated in
> software, there's time between generation and page transmission for the
> alteration to occur).  The solution in the iscsi case was not to
> complain if the page is still marked dirty.

In the iSCSI case, what we do is decreasing chance that pages in
flight could be modified (use sendmsg instead of sendpage) and praying.

If the initiator sends the wrong checksum, the target complains (and
starts the recovery process). The target can't know if the page is
marked dirty or not.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 15:30     ` Boaz Harrosh
  2010-05-31 15:49       ` Nick Piggin
@ 2010-06-01 10:30       ` Christof Schmitt
  2010-06-01 10:49         ` Boaz Harrosh
                           ` (2 more replies)
  1 sibling, 3 replies; 79+ messages in thread
From: Christof Schmitt @ 2010-06-01 10:30 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: James Bottomley, Martin K. Petersen, linux-scsi, linux-kernel,
	linux-fsdevel, Chris Mason

On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> On 05/31/2010 06:01 PM, James Bottomley wrote:
> > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> >>
> >> Christof> Since the guard tags are created in Linux, it seems that the
> >> Christof> data attached to the write request changes between the
> >> Christof> generation in bio_integrity_generate and the call to
> >> Christof> sd_prep_fn.
> >>
> >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> >> users.  The extNfs folks volunteered to look into this a while back but
> >> I don't think they have found the time yet.
> >>
> >>
> >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> >>
> >> Last I looked there were still code paths in ext3 and ext4 that
> >> permitted pages to be changed during flight.  I guess you've just been
> >> lucky.
> > 
> > Pages have always been modifiable in flight.  The OS guarantees they'll
> > be rewritten, so the drivers can drop them if it detects the problem.
> > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > because it doesn't trust TCP/IP and if the checksum is generated in
> > software, there's time between generation and page transmission for the
> > alteration to occur).  The solution in the iscsi case was not to
> > complain if the page is still marked dirty.
> > 
> 
> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> would prevent data writing given a device queue flag that requests
> it. So all these devices and modes could just flag the VFS/filesystems
> that: "please don't allow concurrent writes, otherwise I need to copy data"
> 
> From what Chris Mason has said before, all the mechanics are there, and it's
> what btrfs is doing. Though I don't know how myself?

I also tested with btrfs and invalid guard tags in writes have been
encountered as well (again in 2.6.34). The only difference is that no
error was reported to userspace, although this might be a
configuration issue.

What is the best strategy to continue with the invalid guard tags on
write requests? Should this be fixed in the filesystems?

Another idea would be to pass invalid guard tags on write requests
down to the hardware, expect an "invalid guard tag" error and report
it to the block layer where a new checksum is generated and the
request is issued again. Basically implement a retry through the whole
I/O stack. But this also sounds complicated.

--
Christof Schmitt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 10:30       ` Christof Schmitt
@ 2010-06-01 10:49         ` Boaz Harrosh
  2010-06-01 13:03         ` Chris Mason
  2010-06-01 13:27         ` James Bottomley
  2 siblings, 0 replies; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-01 10:49 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: James Bottomley, Martin K. Petersen, linux-scsi, linux-kernel,
	linux-fsdevel, Chris Mason

On 06/01/2010 01:30 PM, Christof Schmitt wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
>> On 05/31/2010 06:01 PM, James Bottomley wrote:
>>> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
>>>>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
>>>>
>>>> Christof> Since the guard tags are created in Linux, it seems that the
>>>> Christof> data attached to the write request changes between the
>>>> Christof> generation in bio_integrity_generate and the call to
>>>> Christof> sd_prep_fn.
>>>>
>>>> Yep, known bug.  Page writeback locking is messed up for buffer_head
>>>> users.  The extNfs folks volunteered to look into this a while back but
>>>> I don't think they have found the time yet.
>>>>
>>>>
>>>> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
>>>>
>>>> Last I looked there were still code paths in ext3 and ext4 that
>>>> permitted pages to be changed during flight.  I guess you've just been
>>>> lucky.
>>>
>>> Pages have always been modifiable in flight.  The OS guarantees they'll
>>> be rewritten, so the drivers can drop them if it detects the problem.
>>> This is identical to the iscsi checksum issue (iscsi adds a checksum
>>> because it doesn't trust TCP/IP and if the checksum is generated in
>>> software, there's time between generation and page transmission for the
>>> alteration to occur).  The solution in the iscsi case was not to
>>> complain if the page is still marked dirty.
>>>
>>
>> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
>> would prevent data writing given a device queue flag that requests
>> it. So all these devices and modes could just flag the VFS/filesystems
>> that: "please don't allow concurrent writes, otherwise I need to copy data"
>>
>> From what Chris Mason has said before, all the mechanics are there, and it's
>> what btrfs is doing. Though I don't know how myself?
> 
> I also tested with btrfs and invalid guard tags in writes have been
> encountered as well (again in 2.6.34). The only difference is that no
> error was reported to userspace, although this might be a
> configuration issue.
> 

I think in btrfs you need a raid1/5 multi-device configuration for this
to work. If you use a single device then it is just like ext4.

BTW: you could use DM or MD and it will guard your DIF by coping the
data before IO. 

> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?
> 
> Another idea would be to pass invalid guard tags on write requests
> down to the hardware, expect an "invalid guard tag" error and report
> it to the block layer where a new checksum is generated and the
> request is issued again. Basically implement a retry through the whole
> I/O stack. But this also sounds complicated.
> 

I suggest we should talk about this issue in upcoming LSF, because it does
not only affects DIF but any checksum subsystem. And it could enhance Linux
raid performance.

> --
> Christof Schmitt

Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 10:30       ` Christof Schmitt
  2010-06-01 10:49         ` Boaz Harrosh
@ 2010-06-01 13:03         ` Chris Mason
  2010-06-01 13:50           ` Christof Schmitt
       [not found]           ` <20100601135059.GA21008@schmichrtp.mainz.de.ibm.com>
  2010-06-01 13:27         ` James Bottomley
  2 siblings, 2 replies; 79+ messages in thread
From: Chris Mason @ 2010-06-01 13:03 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Boaz Harrosh, James Bottomley, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > >>
> > >> Christof> Since the guard tags are created in Linux, it seems that the
> > >> Christof> data attached to the write request changes between the
> > >> Christof> generation in bio_integrity_generate and the call to
> > >> Christof> sd_prep_fn.
> > >>
> > >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> > >> users.  The extNfs folks volunteered to look into this a while back but
> > >> I don't think they have found the time yet.
> > >>
> > >>
> > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > >>
> > >> Last I looked there were still code paths in ext3 and ext4 that
> > >> permitted pages to be changed during flight.  I guess you've just been
> > >> lucky.
> > > 
> > > Pages have always been modifiable in flight.  The OS guarantees they'll
> > > be rewritten, so the drivers can drop them if it detects the problem.
> > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > software, there's time between generation and page transmission for the
> > > alteration to occur).  The solution in the iscsi case was not to
> > > complain if the page is still marked dirty.
> > > 
> > 
> > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > would prevent data writing given a device queue flag that requests
> > it. So all these devices and modes could just flag the VFS/filesystems
> > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > 
> > From what Chris Mason has said before, all the mechanics are there, and it's
> > what btrfs is doing. Though I don't know how myself?
> 
> I also tested with btrfs and invalid guard tags in writes have been
> encountered as well (again in 2.6.34). The only difference is that no
> error was reported to userspace, although this might be a
> configuration issue.

This would be a btrfs bug.  We have strict checks in place that are
supposed to prevent buffers changing while in flight.  What was the
workload that triggered this problem?

> 
> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?
> 

Long term, I think the filesystems shouldn't be changing pages in
flight.  Bouncing just hurts way too much.

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 14:46   ` Christof Schmitt
@ 2010-06-01 13:16     ` Martin K. Petersen
  2010-06-02 13:37       ` Christof Schmitt
  0 siblings, 1 reply; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-01 13:16 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:

>> Yep, known bug.  Page writeback locking is messed up for buffer_head
>> users.  The extNfs folks volunteered to look into this a while back
>> but I don't think they have found the time yet.

Christof> Thanks for the info. This means that this bug appears with all
Christof> filesystems?

XFS and btrfs should be fine.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 14:49   ` Nick Piggin
@ 2010-06-01 13:17     ` Martin K. Petersen
  0 siblings, 0 replies; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-01 13:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin K. Petersen, Christof Schmitt, linux-scsi, linux-kernel,
	linux-fsdevel

>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:

>> Yep, known bug.  Page writeback locking is messed up for buffer_head
>> users.  The extNfs folks volunteered to look into this a while back
>> but I don't think they have found the time yet.

Nick> What do you mean by messed up? Allowing modifications to the page
Nick> while it is under writeback? This is deliberate of course and not
Nick> limited to buffer_head users either.

Messed up in the sense that (at least last I looked) very few
buffer_head users were aware of the existence of the page writeback bit.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-05-31 15:49       ` Nick Piggin
  2010-05-31 16:25         ` Boaz Harrosh
@ 2010-06-01 13:22         ` Martin K. Petersen
  1 sibling, 0 replies; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-01 13:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Boaz Harrosh, James Bottomley, Martin K. Petersen,
	Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:

Nick> More complex and maybe more performant would be to avoid holding
Nick> page lock but wait_on_page_writeback in page-modification (write,
Nick> fault) paths. 

That's what I was doing last I looked at this.  I seem to recall that my
head exploded once I added buffer_heads to the mix.  And then the extfs
folks promised to take a look so I didn't mess more with it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 10:30       ` Christof Schmitt
  2010-06-01 10:49         ` Boaz Harrosh
  2010-06-01 13:03         ` Chris Mason
@ 2010-06-01 13:27         ` James Bottomley
  2010-06-01 13:33           ` Chris Mason
  2010-06-03 11:20           ` Vladislav Bolkhovitin
  2 siblings, 2 replies; 79+ messages in thread
From: James Bottomley @ 2010-06-01 13:27 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Boaz Harrosh, Martin K. Petersen, linux-scsi, linux-kernel,
	linux-fsdevel, Chris Mason

On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?

For write requests, as long as the page dirty bit is still set, it's
safe to drop the request, since it's already going to be repeated.  What
we probably want is an error code we can return that the layer that sees
both the request and the page flags can make the call.

> Another idea would be to pass invalid guard tags on write requests
> down to the hardware, expect an "invalid guard tag" error and report
> it to the block layer where a new checksum is generated and the
> request is issued again. Basically implement a retry through the whole
> I/O stack. But this also sounds complicated.

No, no ... as long as the guard tag is wrong because the fs changed the
page, the write request for the updated page will already be queued or
in-flight, so there's no need to retry.  We still have to pass checksum
failures on in case the data changed because of some HW (or SW) cockup.
The check for this is page dirty.  If we get a checksum error back and
the page is still clean, we know nothing in the OS changed it, therefore
it's a real bit flip error.

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:27         ` James Bottomley
@ 2010-06-01 13:33           ` Chris Mason
  2010-06-01 13:40             ` James Bottomley
  2010-06-03 11:20           ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2010-06-01 13:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christof Schmitt, Boaz Harrosh, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 01:27:56PM +0000, James Bottomley wrote:
> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> > What is the best strategy to continue with the invalid guard tags on
> > write requests? Should this be fixed in the filesystems?
> 
> For write requests, as long as the page dirty bit is still set, it's
> safe to drop the request, since it's already going to be repeated.  What
> we probably want is an error code we can return that the layer that sees
> both the request and the page flags can make the call.

I'm afraid this isn't entirely true.  The FS tends to do this:

change the page
<---------> truck sized race right here where the page is clean
mark the page dirty

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:33           ` Chris Mason
@ 2010-06-01 13:40             ` James Bottomley
  2010-06-01 13:49               ` Chris Mason
  2010-06-01 13:50               ` Martin K. Petersen
  0 siblings, 2 replies; 79+ messages in thread
From: James Bottomley @ 2010-06-01 13:40 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christof Schmitt, Boaz Harrosh, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, 2010-06-01 at 09:33 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 01:27:56PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> > > What is the best strategy to continue with the invalid guard tags on
> > > write requests? Should this be fixed in the filesystems?
> > 
> > For write requests, as long as the page dirty bit is still set, it's
> > safe to drop the request, since it's already going to be repeated.  What
> > we probably want is an error code we can return that the layer that sees
> > both the request and the page flags can make the call.
> 
> I'm afraid this isn't entirely true.  The FS tends to do this:
> 
> change the page
> <---------> truck sized race right here where the page is clean
> mark the page dirty

Would it be too much work in the fs to mark the page dirty before you
begin altering it (and again after you finish, just in case some cleaner
noticed and initiated a write)?  Or some other flag that indicates page
under modification?  All the process controlling the writeout (which is
pretty high up in the stack) needs to know is if we triggered the check
error by altering the page while it was in flight.

I agree that a block based retry would close all the holes ... it just
doesn't look elegant to me that the fs will already be repeating the I/O
if it changed the page and so will block.

James


James



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:40             ` James Bottomley
@ 2010-06-01 13:49               ` Chris Mason
  2010-06-01 16:29                 ` Matthew Wilcox
  2010-06-01 13:50               ` Martin K. Petersen
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2010-06-01 13:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christof Schmitt, Boaz Harrosh, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 08:40:37AM -0500, James Bottomley wrote:
> On Tue, 2010-06-01 at 09:33 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 01:27:56PM +0000, James Bottomley wrote:
> > > On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
> > > > What is the best strategy to continue with the invalid guard tags on
> > > > write requests? Should this be fixed in the filesystems?
> > > 
> > > For write requests, as long as the page dirty bit is still set, it's
> > > safe to drop the request, since it's already going to be repeated.  What
> > > we probably want is an error code we can return that the layer that sees
> > > both the request and the page flags can make the call.
> > 
> > I'm afraid this isn't entirely true.  The FS tends to do this:
> > 
> > change the page
> > <---------> truck sized race right here where the page is clean
> > mark the page dirty
> 
> Would it be too much work in the fs to mark the page dirty before you
> begin altering it (and again after you finish, just in case some cleaner
> noticed and initiated a write)?  Or some other flag that indicates page
> under modification?  All the process controlling the writeout (which is
> pretty high up in the stack) needs to know is if we triggered the check
> error by altering the page while it was in flight.

I expect that once we went down that path we would end up waiting for
the IO to finish before changing the page.  Maybe there is a less
complex way, but I sure didn't see it.

> 
> I agree that a block based retry would close all the holes ... it just
> doesn't look elegant to me that the fs will already be repeating the I/O
> if it changed the page and so will block.

We might not ever repeat the IO.  We might change the page, write it,
change it again, truncate the file and toss the page completely.

-chris


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:40             ` James Bottomley
  2010-06-01 13:49               ` Chris Mason
@ 2010-06-01 13:50               ` Martin K. Petersen
  2010-06-01 14:28                 ` Nick Piggin
  2010-06-01 14:32                 ` James Bottomley
  1 sibling, 2 replies; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-01 13:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Chris Mason, Christof Schmitt, Boaz Harrosh, Martin K. Petersen,
	linux-scsi, linux-kernel, linux-fsdevel

>>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:

James> Would it be too much work in the fs to mark the page dirty before
James> you begin altering it (and again after you finish, just in case
James> some cleaner noticed and initiated a write)?  Or some other flag
James> that indicates page under modification?  All the process
James> controlling the writeout (which is pretty high up in the stack)
James> needs to know is if we triggered the check error by altering the
James> page while it was in flight.

James> I agree that a block based retry would close all the holes ... it
James> just doesn't look elegant to me that the fs will already be
James> repeating the I/O if it changed the page and so will block.

I experimented with this approach a while back.  However, I quickly got
into a situation where frequently updated blocks never made it to disk
because the page was constantly being updated.  And all writes failed
with a guard tag error.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:03         ` Chris Mason
@ 2010-06-01 13:50           ` Christof Schmitt
       [not found]           ` <20100601135059.GA21008@schmichrtp.mainz.de.ibm.com>
  1 sibling, 0 replies; 79+ messages in thread
From: Christof Schmitt @ 2010-06-01 13:50 UTC (permalink / raw)
  To: Chris Mason, Boaz Harrosh, James Bottomley, Martin K. Petersen,
	linux-scsi

On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > > >>
> > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > >> Christof> data attached to the write request changes between the
> > > >> Christof> generation in bio_integrity_generate and the call to
> > > >> Christof> sd_prep_fn.
> > > >>
> > > >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> > > >> users.  The extNfs folks volunteered to look into this a while back but
> > > >> I don't think they have found the time yet.
> > > >>
> > > >>
> > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > >>
> > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > >> permitted pages to be changed during flight.  I guess you've just been
> > > >> lucky.
> > > > 
> > > > Pages have always been modifiable in flight.  The OS guarantees they'll
> > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > software, there's time between generation and page transmission for the
> > > > alteration to occur).  The solution in the iscsi case was not to
> > > > complain if the page is still marked dirty.
> > > > 
> > > 
> > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > would prevent data writing given a device queue flag that requests
> > > it. So all these devices and modes could just flag the VFS/filesystems
> > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > 
> > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > what btrfs is doing. Though I don't know how myself?
> > 
> > I also tested with btrfs and invalid guard tags in writes have been
> > encountered as well (again in 2.6.34). The only difference is that no
> > error was reported to userspace, although this might be a
> > configuration issue.
> 
> This would be a btrfs bug.  We have strict checks in place that are
> supposed to prevent buffers changing while in flight.  What was the
> workload that triggered this problem?

I am running an internal test tool that creates files with a known
pattern until the disk is full, reads the data to verify if the
pattern is still intact, removes the files and starts over.

# uname -r
2.6.34

# mkfs.btrfs /dev/sda

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/sda
        nodesize 4096 leafsize 4096 sectorsize 4096 size 8.00GB
Btrfs Btrfs v0.19
    
# mount /dev/sda /mnt/esslv0/

The kernel log shows:

Jun  1 11:11:53 R78_4E19_ESAME26_40_1 kernel: device fsid d260dc729d784bf0-9d097091e01ec932 devid 1 transid 7 /dev/sda
Jun  1 11:23:26 R78_4E19_ESAME26_40_1 kernel: no space left, need 4096, 0 delalloc bytes, 7701659648 bytes_used, 0 bytes_reserved, 0 bytes_pinned, 0 bytes_readonly, 0 may use 7701659648 total 

I guess this is to be expected, since the test tool keeps writing
until there is no space left.

Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag c8c0, computed tag dbe0
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag 51d, computed tag 5861
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag 3693, computed tag c52e
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag ff8a, computed tag 4ac3
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag b337, computed tag 3840
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag fbed, computed tag 746a
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag 2bfc, computed tag 263e
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd tag mismatch, lba 0x12480, dif tag c0a4, computed tag 3942

This is the output from my debug patch where i see the mismatch
between the data and the provided guard tags in the block integrity
data.

Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Host Data Integrity Failure
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Sense Key : Illegal Request [current] [descriptor]
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] Add. Sense: Logical block guard check failed
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: sd 0:0:13:1077952528: [sda] CDB: Write(10): 2a 20 00 01 24 80 00 00 08 00
Jun  1 11:25:10 R78_4E19_ESAME26_40_1 kernel: end_request: I/O error, dev sda, sector 74880

As expected, the controller detects the checksum mismatch in the same
request and this is reported as an error for the request. But the
running application is not affected.

I can start this again to get more data, but for now, this is what i
see.

--
Christof Schmitt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
       [not found]           ` <20100601135059.GA21008@schmichrtp.mainz.de.ibm.com>
@ 2010-06-01 13:58             ` Chris Mason
  2010-06-08  7:18               ` Christof Schmitt
  2010-06-01 14:26             ` Nick Piggin
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2010-06-01 13:58 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Boaz Harrosh, James Bottomley, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > > > >>
> > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > >> Christof> data attached to the write request changes between the
> > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > >> Christof> sd_prep_fn.
> > > > >>
> > > > >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> > > > >> users.  The extNfs folks volunteered to look into this a while back but
> > > > >> I don't think they have found the time yet.
> > > > >>
> > > > >>
> > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > >>
> > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > >> permitted pages to be changed during flight.  I guess you've just been
> > > > >> lucky.
> > > > > 
> > > > > Pages have always been modifiable in flight.  The OS guarantees they'll
> > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > software, there's time between generation and page transmission for the
> > > > > alteration to occur).  The solution in the iscsi case was not to
> > > > > complain if the page is still marked dirty.
> > > > > 
> > > > 
> > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > would prevent data writing given a device queue flag that requests
> > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > > 
> > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > what btrfs is doing. Though I don't know how myself?
> > > 
> > > I also tested with btrfs and invalid guard tags in writes have been
> > > encountered as well (again in 2.6.34). The only difference is that no
> > > error was reported to userspace, although this might be a
> > > configuration issue.
> > 
> > This would be a btrfs bug.  We have strict checks in place that are
> > supposed to prevent buffers changing while in flight.  What was the
> > workload that triggered this problem?
> 
> I am running an internal test tool that creates files with a known
> pattern until the disk is full, reads the data to verify if the
> pattern is still intact, removes the files and starts over.

Ok, is the lba in the output the sector offset?  We can map that to a
btrfs block and figure out what it was.

Btrfs never complains about the IO error?  We really should explode.

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
       [not found]           ` <20100601135059.GA21008@schmichrtp.mainz.de.ibm.com>
  2010-06-01 13:58             ` Chris Mason
@ 2010-06-01 14:26             ` Nick Piggin
  1 sibling, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-01 14:26 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Chris Mason, Boaz Harrosh, James Bottomley, Martin K. Petersen,
	linux-scsi, linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > > > >>
> > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > >> Christof> data attached to the write request changes between the
> > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > >> Christof> sd_prep_fn.
> > > > >>
> > > > >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> > > > >> users.  The extNfs folks volunteered to look into this a while back but
> > > > >> I don't think they have found the time yet.
> > > > >>
> > > > >>
> > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > >>
> > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > >> permitted pages to be changed during flight.  I guess you've just been
> > > > >> lucky.
> > > > > 
> > > > > Pages have always been modifiable in flight.  The OS guarantees they'll
> > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > software, there's time between generation and page transmission for the
> > > > > alteration to occur).  The solution in the iscsi case was not to
> > > > > complain if the page is still marked dirty.
> > > > > 
> > > > 
> > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > would prevent data writing given a device queue flag that requests
> > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > > 
> > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > what btrfs is doing. Though I don't know how myself?
> > > 
> > > I also tested with btrfs and invalid guard tags in writes have been
> > > encountered as well (again in 2.6.34). The only difference is that no
> > > error was reported to userspace, although this might be a
> > > configuration issue.
> > 
> > This would be a btrfs bug.  We have strict checks in place that are
> > supposed to prevent buffers changing while in flight.  What was the
> > workload that triggered this problem?
> 
> I am running an internal test tool that creates files with a known
> pattern until the disk is full, reads the data to verify if the
> pattern is still intact, removes the files and starts over.

How are the checks done? The lock_page and wait_on_page_writeback from
prepare_pages?

Looks OK but AFAIKS you would need to unmap_mapping_range() the range
after taking the locks. You are also screwed if the page is in the
process of being modified by a get_user_pages() caller. You'd have to
add VM_IO to vm_flags to prevent get_user_pages.

Direct-IO is always going to have the same problems, so it's not like
block based solutions can just pretend it will easily go away.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:50               ` Martin K. Petersen
@ 2010-06-01 14:28                 ` Nick Piggin
  2010-06-01 14:32                 ` James Bottomley
  1 sibling, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-01 14:28 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Chris Mason, Christof Schmitt, Boaz Harrosh,
	linux-scsi, linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 09:50:29AM -0400, Martin K. Petersen wrote:
> >>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:
> 
> James> Would it be too much work in the fs to mark the page dirty before
> James> you begin altering it (and again after you finish, just in case
> James> some cleaner noticed and initiated a write)?  Or some other flag
> James> that indicates page under modification?  All the process
> James> controlling the writeout (which is pretty high up in the stack)
> James> needs to know is if we triggered the check error by altering the
> James> page while it was in flight.
> 
> James> I agree that a block based retry would close all the holes ... it
> James> just doesn't look elegant to me that the fs will already be
> James> repeating the I/O if it changed the page and so will block.
> 
> I experimented with this approach a while back.  However, I quickly got
> into a situation where frequently updated blocks never made it to disk
> because the page was constantly being updated.  And all writes failed
> with a guard tag error.

What if you bounce in the case of a first guard error?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:50               ` Martin K. Petersen
  2010-06-01 14:28                 ` Nick Piggin
@ 2010-06-01 14:32                 ` James Bottomley
  2010-06-01 14:54                   ` Martin K. Petersen
  1 sibling, 1 reply; 79+ messages in thread
From: James Bottomley @ 2010-06-01 14:32 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chris Mason, Christof Schmitt, Boaz Harrosh, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, 2010-06-01 at 09:50 -0400, Martin K. Petersen wrote:
> >>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:
> 
> James> Would it be too much work in the fs to mark the page dirty before
> James> you begin altering it (and again after you finish, just in case
> James> some cleaner noticed and initiated a write)?  Or some other flag
> James> that indicates page under modification?  All the process
> James> controlling the writeout (which is pretty high up in the stack)
> James> needs to know is if we triggered the check error by altering the
> James> page while it was in flight.
> 
> James> I agree that a block based retry would close all the holes ... it
> James> just doesn't look elegant to me that the fs will already be
> James> repeating the I/O if it changed the page and so will block.
> 
> I experimented with this approach a while back.  However, I quickly got
> into a situation where frequently updated blocks never made it to disk
> because the page was constantly being updated.  And all writes failed
> with a guard tag error.

But that's unfixable with a retry based system as well if the page is
changing so fast that the guard is always wrong by the time we get to
the array.  The only way to fix this is either to copy or freeze the
page.

James




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 14:32                 ` James Bottomley
@ 2010-06-01 14:54                   ` Martin K. Petersen
  0 siblings, 0 replies; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-01 14:54 UTC (permalink / raw)
  To: James Bottomley
  Cc: Martin K. Petersen, Chris Mason, Christof Schmitt, Boaz Harrosh,
	linux-scsi, linux-kernel, linux-fsdevel

>>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:

>> I experimented with this approach a while back.  However, I quickly
>> got into a situation where frequently updated blocks never made it to
>> disk because the page was constantly being updated.  And all writes
>> failed with a guard tag error.

James> But that's unfixable with a retry based system as well if the
James> page is changing so fast that the guard is always wrong by the
James> time we get to the array.  The only way to fix this is either to
James> copy or freeze the page.

Exactly,  and that's why I'm in favor of the filesystems implementing
whatever method they see fit for ensuring that pages don't change in
flight.  Whether that be locking, unmapping, or copying the page.

If there's a performance hit we can have a flag that indicates whether
this block device requires pages to be stable or not.  I believe extN
really depends on modifying pages for performance reasons.  However,
both XFS and btrfs seem to do just fine without it.

Over time we'll have checksums coming down from userspace with various
I/O submission interfaces, internally generated checksums for filesystem
metadata, etc.  I really don't think a one-size-fits-all retry heuristic
is going to cut it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:49               ` Chris Mason
@ 2010-06-01 16:29                 ` Matthew Wilcox
  2010-06-01 16:47                   ` Chris Mason
  0 siblings, 1 reply; 79+ messages in thread
From: Matthew Wilcox @ 2010-06-01 16:29 UTC (permalink / raw)
  To: Chris Mason, James Bottomley, Christof Schmitt, Boaz Harrosh,
	"Martin K. Peter

On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > I agree that a block based retry would close all the holes ... it just
> > doesn't look elegant to me that the fs will already be repeating the I/O
> > if it changed the page and so will block.
> 
> We might not ever repeat the IO.  We might change the page, write it,
> change it again, truncate the file and toss the page completely.

Why does it matter that it was never written in that case?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 16:29                 ` Matthew Wilcox
@ 2010-06-01 16:47                   ` Chris Mason
  2010-06-01 16:54                     ` James Bottomley
  0 siblings, 1 reply; 79+ messages in thread
From: Chris Mason @ 2010-06-01 16:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > I agree that a block based retry would close all the holes ... it just
> > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > if it changed the page and so will block.
> > 
> > We might not ever repeat the IO.  We might change the page, write it,
> > change it again, truncate the file and toss the page completely.
> 
> Why does it matter that it was never written in that case?

It matters is the storage layer is going to wait around for the block to
be written again with a correct crc.

Unless there is trim + DIF, but that's a lot of plates to spin just for
a basic implementation.

-chris


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 16:47                   ` Chris Mason
@ 2010-06-01 16:54                     ` James Bottomley
  2010-06-01 18:09                       ` Chris Mason
  0 siblings, 1 reply; 79+ messages in thread
From: James Bottomley @ 2010-06-01 16:54 UTC (permalink / raw)
  To: Chris Mason
  Cc: Matthew Wilcox, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > I agree that a block based retry would close all the holes ... it just
> > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > if it changed the page and so will block.
> > > 
> > > We might not ever repeat the IO.  We might change the page, write it,
> > > change it again, truncate the file and toss the page completely.
> > 
> > Why does it matter that it was never written in that case?
> 
> It matters is the storage layer is going to wait around for the block to
> be written again with a correct crc.

Actually, I wasn't advocating that.  I think block should return a guard
mismatch error.  I think somewhere in filesystem writeout is the place
to decide whether the error was self induced or systematic.  For self
induced errors (as long as we can detect them) I think we can just
forget about it ... if the changed page is important, the I/O request
gets repeated (modulo the problem of too great a frequency of changes
leading to us never successfully writing it) or it gets dropped because
the file was truncated or the data deleted for some other reason.

James



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 16:54                     ` James Bottomley
@ 2010-06-01 18:09                       ` Chris Mason
  2010-06-01 18:46                         ` Nick Piggin
                                           ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Chris Mason @ 2010-06-01 18:09 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > I agree that a block based retry would close all the holes ... it just
> > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > if it changed the page and so will block.
> > > > 
> > > > We might not ever repeat the IO.  We might change the page, write it,
> > > > change it again, truncate the file and toss the page completely.
> > > 
> > > Why does it matter that it was never written in that case?
> > 
> > It matters is the storage layer is going to wait around for the block to
> > be written again with a correct crc.
> 
> Actually, I wasn't advocating that.  I think block should return a guard
> mismatch error.  I think somewhere in filesystem writeout is the place
> to decide whether the error was self induced or systematic.

In that case the io error goes to the async page writeback bio-endio
handlers.  We don't have a reference on the inode and no ability to
reliably restart the IO, but we can set a bit on the address space
indicating that somewhere, sometime in the past we had an IO error.

> For self
> induced errors (as long as we can detect them) I think we can just
> forget about it ... if the changed page is important, the I/O request
> gets repeated (modulo the problem of too great a frequency of changes
> leading to us never successfully writing it) or it gets dropped because
> the file was truncated or the data deleted for some other reason.

Sorry, how can we tell the errors that are self induced from the evil
bit flipping cable induced errors?

-chris


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 18:09                       ` Chris Mason
@ 2010-06-01 18:46                         ` Nick Piggin
       [not found]                         ` <20100601184649.GE9453@laptop>
  2010-06-01 21:07                         ` James Bottomley
  2 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-01 18:46 UTC (permalink / raw)
  To: Chris Mason, James Bottomley, Matthew Wilcox, Christof Schmitt,
	Boaz Harrosh <bharro

On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > if it changed the page and so will block.
> > > > > 
> > > > > We might not ever repeat the IO.  We might change the page, write it,
> > > > > change it again, truncate the file and toss the page completely.
> > > > 
> > > > Why does it matter that it was never written in that case?
> > > 
> > > It matters is the storage layer is going to wait around for the block to
> > > be written again with a correct crc.
> > 
> > Actually, I wasn't advocating that.  I think block should return a guard
> > mismatch error.  I think somewhere in filesystem writeout is the place
> > to decide whether the error was self induced or systematic.
> 
> In that case the io error goes to the async page writeback bio-endio
> handlers.  We don't have a reference on the inode and no ability to
> reliably restart the IO, but we can set a bit on the address space
> indicating that somewhere, sometime in the past we had an IO error.
> 
> > For self
> > induced errors (as long as we can detect them) I think we can just
> > forget about it ... if the changed page is important, the I/O request
> > gets repeated (modulo the problem of too great a frequency of changes
> > leading to us never successfully writing it) or it gets dropped because
> > the file was truncated or the data deleted for some other reason.
> 
> Sorry, how can we tell the errors that are self induced from the evil
> bit flipping cable induced errors?

Block layer should retry it with bounce pages. That would be a lot nicer
than forcing all upper layers to avoid the problem.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
       [not found]                         ` <20100601184649.GE9453@laptop>
@ 2010-06-01 19:35                           ` Chris Mason
  2010-06-02  3:20                             ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Chris Mason @ 2010-06-01 19:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: James Bottomley, Matthew Wilcox, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Wed, Jun 02, 2010 at 04:46:49AM +1000, Nick Piggin wrote:
> On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > 
> > > For self
> > > induced errors (as long as we can detect them) I think we can just
> > > forget about it ... if the changed page is important, the I/O request
> > > gets repeated (modulo the problem of too great a frequency of changes
> > > leading to us never successfully writing it) or it gets dropped because
> > > the file was truncated or the data deleted for some other reason.
> > 
> > Sorry, how can we tell the errors that are self induced from the evil
> > bit flipping cable induced errors?
> 
> Block layer should retry it with bounce pages. That would be a lot nicer
> than forcing all upper layers to avoid the problem.
> 

So the idea is that we have sent down a buffer and it changed in flight.
The block layer is going to say: oh look, the crcs don't match, I'll
bounce it, recrc it and send again.  But, there are at least 3 reasons the crc
will change:

1) filesystem changed it
2) corruption on the wire or in the raid controller
3) the page was corrupted while the IO layer was doing the IO.

1 and 2 are easy, we bounce, retry and everyone continues on with
their lives.  With #3, we'll recrc and send the IO down again thinking
the data is correct when really we're writing garbage.

How can we tell these three cases apart?

-chris


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 18:09                       ` Chris Mason
  2010-06-01 18:46                         ` Nick Piggin
       [not found]                         ` <20100601184649.GE9453@laptop>
@ 2010-06-01 21:07                         ` James Bottomley
  2010-06-01 22:49                           ` Chris Mason
  2 siblings, 1 reply; 79+ messages in thread
From: James Bottomley @ 2010-06-01 21:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Matthew Wilcox, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Tue, 2010-06-01 at 14:09 -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > if it changed the page and so will block.
> > > > > 
> > > > > We might not ever repeat the IO.  We might change the page, write it,
> > > > > change it again, truncate the file and toss the page completely.
> > > > 
> > > > Why does it matter that it was never written in that case?
> > > 
> > > It matters is the storage layer is going to wait around for the block to
> > > be written again with a correct crc.
> > 
> > Actually, I wasn't advocating that.  I think block should return a guard
> > mismatch error.  I think somewhere in filesystem writeout is the place
> > to decide whether the error was self induced or systematic.
> 
> In that case the io error goes to the async page writeback bio-endio
> handlers.  We don't have a reference on the inode and no ability to
> reliably restart the IO, but we can set a bit on the address space
> indicating that somewhere, sometime in the past we had an IO error.
> 
> > For self
> > induced errors (as long as we can detect them) I think we can just
> > forget about it ... if the changed page is important, the I/O request
> > gets repeated (modulo the problem of too great a frequency of changes
> > leading to us never successfully writing it) or it gets dropped because
> > the file was truncated or the data deleted for some other reason.
> 
> Sorry, how can we tell the errors that are self induced from the evil
> bit flipping cable induced errors?

We have all the information ... the fs will eventually mark the page
dirty when it finishes the alterations, we just have to find a way to
express that.

If you're thinking of the double fault scenario where the page
spontaneously corrupts *and* the filesystem alters it, then the only way
of detecting that is to freeze the page as it undergoes I/O ... which
involves quite a bit of filesystem surgery, doesn't it?

James

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 21:07                         ` James Bottomley
@ 2010-06-01 22:49                           ` Chris Mason
  0 siblings, 0 replies; 79+ messages in thread
From: Chris Mason @ 2010-06-01 22:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 04:07:43PM -0500, James Bottomley wrote:
> On Tue, 2010-06-01 at 14:09 -0400, Chris Mason wrote:
> > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > > On Tue, 2010-06-01 at 12:47 -0400, Chris Mason wrote:
> > > > On Tue, Jun 01, 2010 at 10:29:30AM -0600, Matthew Wilcox wrote:
> > > > > On Tue, Jun 01, 2010 at 09:49:51AM -0400, Chris Mason wrote:
> > > > > > > I agree that a block based retry would close all the holes ... it just
> > > > > > > doesn't look elegant to me that the fs will already be repeating the I/O
> > > > > > > if it changed the page and so will block.
> > > > > > 
> > > > > > We might not ever repeat the IO.  We might change the page, write it,
> > > > > > change it again, truncate the file and toss the page completely.
> > > > > 
> > > > > Why does it matter that it was never written in that case?
> > > > 
> > > > It matters is the storage layer is going to wait around for the block to
> > > > be written again with a correct crc.
> > > 
> > > Actually, I wasn't advocating that.  I think block should return a guard
> > > mismatch error.  I think somewhere in filesystem writeout is the place
> > > to decide whether the error was self induced or systematic.
> > 
> > In that case the io error goes to the async page writeback bio-endio
> > handlers.  We don't have a reference on the inode and no ability to
> > reliably restart the IO, but we can set a bit on the address space
> > indicating that somewhere, sometime in the past we had an IO error.
> > 
> > > For self
> > > induced errors (as long as we can detect them) I think we can just
> > > forget about it ... if the changed page is important, the I/O request
> > > gets repeated (modulo the problem of too great a frequency of changes
> > > leading to us never successfully writing it) or it gets dropped because
> > > the file was truncated or the data deleted for some other reason.
> > 
> > Sorry, how can we tell the errors that are self induced from the evil
> > bit flipping cable induced errors?
> 
> We have all the information ... the fs will eventually mark the page
> dirty when it finishes the alterations, we just have to find a way to
> express that.

Eventually?

> 
> If you're thinking of the double fault scenario where the page
> spontaneously corrupts *and* the filesystem alters it, then the only way
> of detecting that is to freeze the page as it undergoes I/O ... which
> involves quite a bit of filesystem surgery, doesn't it?

No, I'm thinking of the case where the page corrupts and the FS doesn't
alter it.  I still don't know how to determine if the FS changed the
page and will write it again, if the FS changed the page and won't write
it again, or if the page corrupted all on its own.  The page dirty bit
isn't sufficient for this.

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 19:35                           ` Chris Mason
@ 2010-06-02  3:20                             ` Nick Piggin
  2010-06-02 13:17                               ` Martin K. Petersen
  0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2010-06-02  3:20 UTC (permalink / raw)
  To: Chris Mason, James Bottomley, Matthew Wilcox, Christof Schmitt,
	Boaz Harrosh <bharro

On Tue, Jun 01, 2010 at 03:35:28PM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 04:46:49AM +1000, Nick Piggin wrote:
> > On Tue, Jun 01, 2010 at 02:09:05PM -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 04:54:53PM +0000, James Bottomley wrote:
> > > 
> > > > For self
> > > > induced errors (as long as we can detect them) I think we can just
> > > > forget about it ... if the changed page is important, the I/O request
> > > > gets repeated (modulo the problem of too great a frequency of changes
> > > > leading to us never successfully writing it) or it gets dropped because
> > > > the file was truncated or the data deleted for some other reason.
> > > 
> > > Sorry, how can we tell the errors that are self induced from the evil
> > > bit flipping cable induced errors?
> > 
> > Block layer should retry it with bounce pages. That would be a lot nicer
> > than forcing all upper layers to avoid the problem.
> > 
> 
> So the idea is that we have sent down a buffer and it changed in flight.
> The block layer is going to say: oh look, the crcs don't match, I'll
> bounce it, recrc it and send again.  But, there are at least 3 reasons the crc
> will change:
> 
> 1) filesystem changed it
> 2) corruption on the wire or in the raid controller
> 3) the page was corrupted while the IO layer was doing the IO.
> 
> 1 and 2 are easy, we bounce, retry and everyone continues on with
> their lives.  With #3, we'll recrc and send the IO down again thinking
> the data is correct when really we're writing garbage.
> 
> How can we tell these three cases apart?

Do we really need to handle #3? It could have happened before the
checksum was calculated.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-02  3:20                             ` Nick Piggin
@ 2010-06-02 13:17                               ` Martin K. Petersen
  2010-06-02 13:41                                 ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-02 13:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, James Bottomley, Matthew Wilcox, Christof Schmitt,
	Boaz Harrosh, Martin K. Petersen, linux-scsi, linux-kernel,
	linux-fsdevel

>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:

>> 1) filesystem changed it
>> 2) corruption on the wire or in the raid controller
>> 3) the page was corrupted while the IO layer was doing the IO.
>> 
>> 1 and 2 are easy, we bounce, retry and everyone continues on with
>> their lives.  With #3, we'll recrc and send the IO down again
>> thinking the data is correct when really we're writing garbage.
>> 
>> How can we tell these three cases apart?

Nick> Do we really need to handle #3? It could have happened before the
Nick> checksum was calculated.

Reason #3 is one of the main reasons for having the checksum in the
first place.  The whole premise of the data integrity extensions is that
the checksum is calculated in close temporal proximity to the data
creation.  I.e. eventually in userland.

Filesystems will inevitably have to be integrity-aware for that to work.
And it will be their job to keep the data pages stable during DMA.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:16     ` Martin K. Petersen
@ 2010-06-02 13:37       ` Christof Schmitt
  2010-06-02 23:20         ` Dave Chinner
  0 siblings, 1 reply; 79+ messages in thread
From: Christof Schmitt @ 2010-06-02 13:37 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-kernel, linux-fsdevel

On Tue, Jun 01, 2010 at 09:16:35AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> 
> >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> >> users.  The extNfs folks volunteered to look into this a while back
> >> but I don't think they have found the time yet.
> 
> Christof> Thanks for the info. This means that this bug appears with all
> Christof> filesystems?
> 
> XFS and btrfs should be fine.

XFS looks good in my test, thanks for the hint. I am going to use XFS
for anything related to DIF for now. It would be nice to have a
solution that works for all filesystems, but it looks like this will
take some time and work.

Christof

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-02 13:17                               ` Martin K. Petersen
@ 2010-06-02 13:41                                 ` Nick Piggin
  2010-06-03 15:46                                   ` Chris Mason
  2010-06-04  1:30                                   ` Martin K. Petersen
  0 siblings, 2 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-02 13:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chris Mason, James Bottomley, Matthew Wilcox, Christof Schmitt,
	Boaz Harrosh, linux-scsi, linux-kernel, linux-fsdevel

On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> >>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:
> 
> >> 1) filesystem changed it
> >> 2) corruption on the wire or in the raid controller
> >> 3) the page was corrupted while the IO layer was doing the IO.
> >> 
> >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> >> their lives.  With #3, we'll recrc and send the IO down again
> >> thinking the data is correct when really we're writing garbage.
> >> 
> >> How can we tell these three cases apart?
> 
> Nick> Do we really need to handle #3? It could have happened before the
> Nick> checksum was calculated.
> 
> Reason #3 is one of the main reasons for having the checksum in the
> first place.  The whole premise of the data integrity extensions is that
> the checksum is calculated in close temporal proximity to the data
> creation.  I.e. eventually in userland.
> 
> Filesystems will inevitably have to be integrity-aware for that to work.
> And it will be their job to keep the data pages stable during DMA.

Let's just think hard about what windows can actually be closed versus
how much effort goes in to closing them. I also prefer not to accept
half-solutions in the kernel because they don't want to implement real
solutions in hardware (it's pretty hard to checksum and protect all
kernel data structures by hand).

For "normal" writes into pagecache, the data can get corrupted anywhere
from after it is generated in userspace, during the copy, while it is
dirty in cache, and while it is being written out.

Closing the while it is dirty, while it is being written back window
still leaves a pretty big window. Also, how do you handle mmap writes?
Write protect and checksum the destination page after every store? Or
leave some window between when the pagecache is dirtied and when it is
written back? So I don't know whether it's worth putting a lot of effort
into this case.

If you had an interface for userspace to insert checksums to direct IO
requests or pagecache ranges, then not only can you close the entire gap
between userspace data generation, and writeback. But you also can
handle mmap writes and anything else just fine: userspace knows about
the concurrency details, so it can add the right checksum (and
potentially fsync etc) when it's ready.

And the bounce-retry method would be sufficient to handle IO
transmission errors for normal IOs without being intrusive.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-02 13:37       ` Christof Schmitt
@ 2010-06-02 23:20         ` Dave Chinner
  2010-06-04  1:34           ` Martin K. Petersen
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Chinner @ 2010-06-02 23:20 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel

On Wed, Jun 02, 2010 at 03:37:48PM +0200, Christof Schmitt wrote:
> On Tue, Jun 01, 2010 at 09:16:35AM -0400, Martin K. Petersen wrote:
> > >>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > 
> > >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> > >> users.  The extNfs folks volunteered to look into this a while back
> > >> but I don't think they have found the time yet.
> > 
> > Christof> Thanks for the info. This means that this bug appears with all
> > Christof> filesystems?
> > 
> > XFS and btrfs should be fine.
> 
> XFS looks good in my test, thanks for the hint. I am going to use XFS
> for anything related to DIF for now. It would be nice to have a
> solution that works for all filesystems, but it looks like this will
> take some time and work.

If you are running DIF hardware, then XFS is only OK for direct IO.
XFS will still get torn writes if you are overwriting buffered data
(either by write() or mmap()) because there are no interlocks to
prevent cached pages under writeback from being modified while DMA
is being performed.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:27         ` James Bottomley
  2010-06-01 13:33           ` Chris Mason
@ 2010-06-03 11:20           ` Vladislav Bolkhovitin
  2010-06-03 12:07             ` Boaz Harrosh
  2010-07-23 17:59             ` Gennadiy Nerubayev
  1 sibling, 2 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-06-03 11:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christof Schmitt, Boaz Harrosh, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason, Gennadiy Nerubayev

James Bottomley, on 06/01/2010 05:27 PM wrote:
> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>> What is the best strategy to continue with the invalid guard tags on
>> write requests? Should this be fixed in the filesystems?
> 
> For write requests, as long as the page dirty bit is still set, it's
> safe to drop the request, since it's already going to be repeated.  What
> we probably want is an error code we can return that the layer that sees
> both the request and the page flags can make the call.
> 
>> Another idea would be to pass invalid guard tags on write requests
>> down to the hardware, expect an "invalid guard tag" error and report
>> it to the block layer where a new checksum is generated and the
>> request is issued again. Basically implement a retry through the whole
>> I/O stack. But this also sounds complicated.
> 
> No, no ... as long as the guard tag is wrong because the fs changed the
> page, the write request for the updated page will already be queued or
> in-flight, so there's no need to retry.

There's one interesting problem here, at least theoretically, with SCSI 
or similar transports which allow to have commands queue depth >1 and 
allowed to internally reorder queued requests. I don't know the FS/block 
layers sufficiently well to tell if sending several requests for the 
same page really possible or not, but we can see a real life problem, 
which can be well explained if it's possible.

The problem could be if the second (rewrite) request (SCSI command) for 
the same page queued to the corresponding device before the original 
request finished. Since the device allowed to freely reorder requests, 
there's a probability that the original write request would hit the 
permanent storage *AFTER* the retry request, hence the data changes it's 
carrying would be lost, hence welcome data corruption.

For single parallel SCSI or SAS devices such race may look practically 
impossible, but for sophisticated clusters when many nodes pretending to 
be a single SCSI device in a load balancing configuration, it becomes 
very real.

The real life problem we can see in an active-active DRBD-setup. In this 
configuration 2 nodes act as a single SCST-powered SCSI device and they 
both run DRBD to keep their backstorage in-sync. The initiator uses them 
as a single multipath device in an active-active round-robin 
load-balancing configuration, i.e. sends requests to both nodes in 
parallel, then DRBD takes care to replicate the requests to the other node.

The problem is that sometimes DRBD complies about concurrent local 
writes, like:

kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
[DISCARD L] new: 144072784s +8192; pending: 144072784s +8192

This message means that DRBD detected that both nodes received 
overlapping writes on the same block(s) and DRBD can't figure out which 
one to store. This is possible only if the initiator sent the second 
write request before the first one completed.

The topic of the discussion could well explain the cause of that. But, 
unfortunately, people who reported it forgot to note which OS they run 
on the initiator, i.e. I can't say for sure it's Linux.

Vlad


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 11:20           ` Vladislav Bolkhovitin
@ 2010-06-03 12:07             ` Boaz Harrosh
  2010-06-03 12:41               ` Vladislav Bolkhovitin
  2010-07-23 17:59             ` Gennadiy Nerubayev
  1 sibling, 1 reply; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-03 12:07 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: James Bottomley, Christof Schmitt, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason, Gennadiy Nerubayev

On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
> 
> There's one interesting problem here, at least theoretically, with SCSI 
> or similar transports which allow to have commands queue depth >1 and 
> allowed to internally reorder queued requests. I don't know the FS/block 
> layers sufficiently well to tell if sending several requests for the 
> same page really possible or not, but we can see a real life problem, 
> which can be well explained if it's possible.
> 
> The problem could be if the second (rewrite) request (SCSI command) for 
> the same page queued to the corresponding device before the original 
> request finished. Since the device allowed to freely reorder requests, 
> there's a probability that the original write request would hit the 
> permanent storage *AFTER* the retry request, hence the data changes it's 
> carrying would be lost, hence welcome data corruption.
> 

I might be totally wrong here but I think NCQ can reorder sectors but
not writes. That is if the sector is cached in device memory and a later
write comes to modify the same sector then the original should be
replaced not two values of the same sector be kept in device cache at the
same time.

Failing to do so is a scsi device problem.

Please note that page-to-sector is not necessary constant. And the same page
might get written at a different sector, next time. But FSs will have to
barrier in this case.

> For single parallel SCSI or SAS devices such race may look practically 
> impossible, but for sophisticated clusters when many nodes pretending to 
> be a single SCSI device in a load balancing configuration, it becomes 
> very real.
> 
> The real life problem we can see in an active-active DRBD-setup. In this 
> configuration 2 nodes act as a single SCST-powered SCSI device and they 
> both run DRBD to keep their backstorage in-sync. The initiator uses them 
> as a single multipath device in an active-active round-robin 
> load-balancing configuration, i.e. sends requests to both nodes in 
> parallel, then DRBD takes care to replicate the requests to the other node.
> 
> The problem is that sometimes DRBD complies about concurrent local 
> writes, like:
> 
> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
> 
> This message means that DRBD detected that both nodes received 
> overlapping writes on the same block(s) and DRBD can't figure out which 
> one to store. This is possible only if the initiator sent the second 
> write request before the first one completed.
> 

It is totally possible in today's code.

DRBD should store the original command_sn of the write and discard
the sector with the lower SN. It should appear as a single device
to the initiator.

> The topic of the discussion could well explain the cause of that. But, 
> unfortunately, people who reported it forgot to note which OS they run 
> on the initiator, i.e. I can't say for sure it's Linux.
> 
> Vlad
> 

Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 12:07             ` Boaz Harrosh
@ 2010-06-03 12:41               ` Vladislav Bolkhovitin
  2010-06-03 12:46                 ` Vladislav Bolkhovitin
  2010-06-03 13:06                 ` Boaz Harrosh
  0 siblings, 2 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-06-03 12:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: James Bottomley, Christof Schmitt, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason, Gennadiy Nerubayev

Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>> There's one interesting problem here, at least theoretically, with SCSI 
>> or similar transports which allow to have commands queue depth >1 and 
>> allowed to internally reorder queued requests. I don't know the FS/block 
>> layers sufficiently well to tell if sending several requests for the 
>> same page really possible or not, but we can see a real life problem, 
>> which can be well explained if it's possible.
>>
>> The problem could be if the second (rewrite) request (SCSI command) for 
>> the same page queued to the corresponding device before the original 
>> request finished. Since the device allowed to freely reorder requests, 
>> there's a probability that the original write request would hit the 
>> permanent storage *AFTER* the retry request, hence the data changes it's 
>> carrying would be lost, hence welcome data corruption.
>>
> 
> I might be totally wrong here but I think NCQ can reorder sectors but
> not writes. That is if the sector is cached in device memory and a later
> write comes to modify the same sector then the original should be
> replaced not two values of the same sector be kept in device cache at the
> same time.
> 
> Failing to do so is a scsi device problem.

SCSI devices supporting Full task management model (almost all) and 
having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 
allowed to freely reorder any commands with SIMPLE task attribute. If an 
application wants to maintain order of some commands for such devices, 
it must issue them with ORDERED task attribute and over a _single_ MPIO 
path to the device.

Linux neither uses ORDERED attribute, nor honors or enforces anyhow 
QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with 
order dependencies (overlapping writes in our case) over a single MPIO path.

> Please note that page-to-sector is not necessary constant. And the same page
> might get written at a different sector, next time. But FSs will have to
> barrier in this case.
> 
>> For single parallel SCSI or SAS devices such race may look practically 
>> impossible, but for sophisticated clusters when many nodes pretending to 
>> be a single SCSI device in a load balancing configuration, it becomes 
>> very real.
>>
>> The real life problem we can see in an active-active DRBD-setup. In this 
>> configuration 2 nodes act as a single SCST-powered SCSI device and they 
>> both run DRBD to keep their backstorage in-sync. The initiator uses them 
>> as a single multipath device in an active-active round-robin 
>> load-balancing configuration, i.e. sends requests to both nodes in 
>> parallel, then DRBD takes care to replicate the requests to the other node.
>>
>> The problem is that sometimes DRBD complies about concurrent local 
>> writes, like:
>>
>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>
>> This message means that DRBD detected that both nodes received 
>> overlapping writes on the same block(s) and DRBD can't figure out which 
>> one to store. This is possible only if the initiator sent the second 
>> write request before the first one completed.
> 
> It is totally possible in today's code.
> 
> DRBD should store the original command_sn of the write and discard
> the sector with the lower SN. It should appear as a single device
> to the initiator.

How can it find the SN? The commands were sent over _different_ MPIO 
paths to the device, so at the moment of the sending all the order 
information was lost.

Until SCSI generally allowed to preserve ordering information between 
MPIO paths in such configurations the only way to maintain commands 
order would be queue draining. Hence, for safety all initiators working 
with such devices must do it.

But looks like Linux doesn't do it, so unsafe with MPIO clusters?

Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 12:41               ` Vladislav Bolkhovitin
@ 2010-06-03 12:46                 ` Vladislav Bolkhovitin
  2010-06-09 15:58                   ` Vladislav Bolkhovitin
  2010-06-03 13:06                 ` Boaz Harrosh
  1 sibling, 1 reply; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-06-03 12:46 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: James Bottomley, Christof Schmitt, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason, Gennadiy Nerubayev



Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI 
>>> or similar transports which allow to have commands queue depth >1 and 
>>> allowed to internally reorder queued requests. I don't know the FS/block 
>>> layers sufficiently well to tell if sending several requests for the 
>>> same page really possible or not, but we can see a real life problem, 
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for 
>>> the same page queued to the corresponding device before the original 
>>> request finished. Since the device allowed to freely reorder requests, 
>>> there's a probability that the original write request would hit the 
>>> permanent storage *AFTER* the retry request, hence the data changes it's 
>>> carrying would be lost, hence welcome data corruption.
>>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
> 
> SCSI devices supporting Full task management model (almost all) and 
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 
> allowed to freely reorder any commands with SIMPLE task attribute. If an 
> application wants to maintain order of some commands for such devices, 
> it must issue them with ORDERED task attribute and over a _single_ MPIO 
> path to the device.
> 
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow 
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with 
> order dependencies (overlapping writes in our case) over a single MPIO path.
> 
>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically 
>>> impossible, but for sophisticated clusters when many nodes pretending to 
>>> be a single SCSI device in a load balancing configuration, it becomes 
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this 
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they 
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them 
>>> as a single multipath device in an active-active round-robin 
>>> load-balancing configuration, i.e. sends requests to both nodes in 
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local 
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received 
>>> overlapping writes on the same block(s) and DRBD can't figure out which 
>>> one to store. This is possible only if the initiator sent the second 
>>> write request before the first one completed.
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
> 
> How can it find the SN? The commands were sent over _different_ MPIO 
> paths to the device, so at the moment of the sending all the order 
> information was lost.
> 
> Until SCSI generally allowed to preserve ordering information between 
> MPIO paths in such configurations the only way to maintain commands 
> order would be queue draining. Hence, for safety all initiators working 
> with such devices must do it.
> 
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?

I meant load balancing MPIO clusters.

Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 12:41               ` Vladislav Bolkhovitin
  2010-06-03 12:46                 ` Vladislav Bolkhovitin
@ 2010-06-03 13:06                 ` Boaz Harrosh
  2010-06-03 13:23                   ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-03 13:06 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: James Bottomley, Christof Schmitt, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason, Gennadiy Nerubayev

On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI 
>>> or similar transports which allow to have commands queue depth >1 and 
>>> allowed to internally reorder queued requests. I don't know the FS/block 
>>> layers sufficiently well to tell if sending several requests for the 
>>> same page really possible or not, but we can see a real life problem, 
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for 
>>> the same page queued to the corresponding device before the original 
>>> request finished. Since the device allowed to freely reorder requests, 
>>> there's a probability that the original write request would hit the 
>>> permanent storage *AFTER* the retry request, hence the data changes it's 
>>> carrying would be lost, hence welcome data corruption.
>>>
>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
> 
> SCSI devices supporting Full task management model (almost all) and 
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 
> allowed to freely reorder any commands with SIMPLE task attribute. If an 
> application wants to maintain order of some commands for such devices, 
> it must issue them with ORDERED task attribute and over a _single_ MPIO 
> path to the device.
> 
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow 
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with 
> order dependencies (overlapping writes in our case) over a single MPIO path.
> 

OK I take your word for it. But that sounds stupid to me. I would think
that sectors can be ordered. not commands per se. What happen with reads
then? do they get to be ordered? I mean a read in between the two writes which
value is read? It gets so complicated that only a sector model makes sense
to me.

>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically 
>>> impossible, but for sophisticated clusters when many nodes pretending to 
>>> be a single SCSI device in a load balancing configuration, it becomes 
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this 
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they 
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them 
>>> as a single multipath device in an active-active round-robin 
>>> load-balancing configuration, i.e. sends requests to both nodes in 
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local 
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received 
>>> overlapping writes on the same block(s) and DRBD can't figure out which 
>>> one to store. This is possible only if the initiator sent the second 
>>> write request before the first one completed.
>>
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
> 
> How can it find the SN? The commands were sent over _different_ MPIO 
> paths to the device, so at the moment of the sending all the order 
> information was lost.
> 

I'm not hard on the specifics here. But I think the initiator has set
the same SN on the two paths, or has incremented them between paths.
You said:

> The initiator uses them as a single multipath device in an active-active
> round-robin load-balancing configuration, i.e. sends requests to both nodes
> in paralle.

So what was the SN sent to each side. Is there a relationship between them
or they each advance independently?

If there is a relationship then the targets on two sides should store
the SN for later comparison. (Life is hard)

> Until SCSI generally allowed to preserve ordering information between 
> MPIO paths in such configurations the only way to maintain commands 
> order would be queue draining. Hence, for safety all initiators working 
> with such devices must do it.
> 
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
> 
> Vlad
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 13:06                 ` Boaz Harrosh
@ 2010-06-03 13:23                   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-06-03 13:23 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: James Bottomley, Christof Schmitt, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason, Gennadiy Nerubayev

Boaz Harrosh, on 06/03/2010 05:06 PM wrote:
> On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI 
>>>> or similar transports which allow to have commands queue depth >1 and 
>>>> allowed to internally reorder queued requests. I don't know the FS/block 
>>>> layers sufficiently well to tell if sending several requests for the 
>>>> same page really possible or not, but we can see a real life problem, 
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for 
>>>> the same page queued to the corresponding device before the original 
>>>> request finished. Since the device allowed to freely reorder requests, 
>>>> there's a probability that the original write request would hit the 
>>>> permanent storage *AFTER* the retry request, hence the data changes it's 
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and 
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 
>> allowed to freely reorder any commands with SIMPLE task attribute. If an 
>> application wants to maintain order of some commands for such devices, 
>> it must issue them with ORDERED task attribute and over a _single_ MPIO 
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow 
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with 
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
> 
> OK I take your word for it. But that sounds stupid to me. I would think
> that sectors can be ordered. not commands per se. What happen with reads
> then? do they get to be ordered? I mean a read in between the two writes which
> value is read? It gets so complicated that only a sector model makes sense
> to me.

Look wider. For a single HDD your way of thinking makes sense. But how 
about big clusters consisting from many nodes with many clients? In them 
maintaining internal commands order is generally bad and often a way too 
expensive for performance.

It's the same as with modern CPUs, where for performance reasons 
programmers also must live with the commands reorder possibilities and 
use barriers, when necessary.

>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically 
>>>> impossible, but for sophisticated clusters when many nodes pretending to 
>>>> be a single SCSI device in a load balancing configuration, it becomes 
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this 
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they 
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them 
>>>> as a single multipath device in an active-active round-robin 
>>>> load-balancing configuration, i.e. sends requests to both nodes in 
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local 
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received 
>>>> overlapping writes on the same block(s) and DRBD can't figure out which 
>>>> one to store. This is possible only if the initiator sent the second 
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO 
>> paths to the device, so at the moment of the sending all the order 
>> information was lost.
>>
> 
> I'm not hard on the specifics here. But I think the initiator has set
> the same SN on the two paths, or has incremented them between paths.
> You said:
> 
>> The initiator uses them as a single multipath device in an active-active
>> round-robin load-balancing configuration, i.e. sends requests to both nodes
>> in paralle.
> 
> So what was the SN sent to each side. Is there a relationship between them
> or they each advance independently?
> 
> If there is a relationship then the targets on two sides should store
> the SN for later comparison. (Life is hard)

None of SCSI transports carry any SN to other paths (I_T nexuses) 
related information in internal packets, including iSCSI. It's simply 
out of SAM. If you need order information between paths, you must use 
"extensions", like iSCSI MC/S, but they are bad for many other reasons. 
I summarized it in http://scst.sourceforge.net/mc_s.html.

Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-02 13:41                                 ` Nick Piggin
@ 2010-06-03 15:46                                   ` Chris Mason
  2010-06-03 16:27                                     ` Nick Piggin
                                                       ` (3 more replies)
  2010-06-04  1:30                                   ` Martin K. Petersen
  1 sibling, 4 replies; 79+ messages in thread
From: Chris Mason @ 2010-06-03 15:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin K. Petersen, James Bottomley, Matthew Wilcox,
	Christof Schmitt, Boaz Harrosh, linux-scsi, linux-kernel,
	linux-fsdevel

On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > >>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:
> > 
> > >> 1) filesystem changed it
> > >> 2) corruption on the wire or in the raid controller
> > >> 3) the page was corrupted while the IO layer was doing the IO.
> > >> 
> > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > >> their lives.  With #3, we'll recrc and send the IO down again
> > >> thinking the data is correct when really we're writing garbage.
> > >> 
> > >> How can we tell these three cases apart?
> > 
> > Nick> Do we really need to handle #3? It could have happened before the
> > Nick> checksum was calculated.
> > 
> > Reason #3 is one of the main reasons for having the checksum in the
> > first place.  The whole premise of the data integrity extensions is that
> > the checksum is calculated in close temporal proximity to the data
> > creation.  I.e. eventually in userland.
> > 
> > Filesystems will inevitably have to be integrity-aware for that to work.
> > And it will be their job to keep the data pages stable during DMA.
> 
> Let's just think hard about what windows can actually be closed versus
> how much effort goes in to closing them. I also prefer not to accept
> half-solutions in the kernel because they don't want to implement real
> solutions in hardware (it's pretty hard to checksum and protect all
> kernel data structures by hand).
> 
> For "normal" writes into pagecache, the data can get corrupted anywhere
> from after it is generated in userspace, during the copy, while it is
> dirty in cache, and while it is being written out.

This is why the DIF/DIX spec has the idea of a crc generated in userland
when the data is generated.  At any rate the basic idea is to crc early
but not often...recalculating the crc after we hand our precious memory
to the evil device driver does weaken the end-to-end integrity checks.

What I don't want to do is weaken the basic DIF/DIX structure by letting
the lower recrc stuff as they find faults.  It would be fine if we had
some definitive way to say: the FS raced, just recrc, but we really
don't.

> 
> Closing the while it is dirty, while it is being written back window
> still leaves a pretty big window. Also, how do you handle mmap writes?
> Write protect and checksum the destination page after every store? Or
> leave some window between when the pagecache is dirtied and when it is
> written back? So I don't know whether it's worth putting a lot of effort
> into this case.

So, changing gears to how do we protect filesystem page cache pages
instead of the generic idea of dif/dix, btrfs crcs just before writing,
which does leave a pretty big window for the page to get corrupted.
The storage layer shouldn't care or know about that though, we hand it a
crc and it makes sure data matching that crc goes to the media.

> 
> If you had an interface for userspace to insert checksums to direct IO
> requests or pagecache ranges, then not only can you close the entire gap
> between userspace data generation, and writeback. But you also can
> handle mmap writes and anything else just fine: userspace knows about
> the concurrency details, so it can add the right checksum (and
> potentially fsync etc) when it's ready.

Yeah, I do agree here.

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-05-31 11:28 Wrong DIF guard tag on ext2 write Christof Schmitt
  2010-05-31 11:34 ` Christof Schmitt
  2010-05-31 14:20 ` Martin K. Petersen
@ 2010-06-03 16:09 ` Boaz Harrosh
  2010-06-03 16:30   ` [Lsf10-pc] " J. Bruce Fields
                     ` (2 more replies)
  2 siblings, 3 replies; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-03 16:09 UTC (permalink / raw)
  To: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	lsf10-pc, Nick 

[Topic]
How to not let pages change while in IO

[Abstract]
As seen in a long thread on the fsdvel scsi mailing lists. Lots of
people have headaches and sleep less nights because individual pages
can change while in IO and/or DMA. Though each one as slightly different
needs, the mechanics look to be the same.

People that care:
- Mirror and RAID people that need on disk consistency.
- Network storage that wants data checksum.
- DIF/DIX people
- ...

I for one know nothing of the subject but am a RAID person and would
like a solution that does not force me to copy the complete data load.

Please lets get all the VM VFS and drivers people in one room and see
if we can have a Linux solution to this problem

Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 15:46                                   ` Chris Mason
@ 2010-06-03 16:27                                     ` Nick Piggin
       [not found]                                     ` <20100603162718.GR6822@laptop>
                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-03 16:27 UTC (permalink / raw)
  To: Chris Mason, Martin K. Petersen, James Bottomley, Matthew Wilcox,
	Christof 

On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > > >>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:
> > > 
> > > >> 1) filesystem changed it
> > > >> 2) corruption on the wire or in the raid controller
> > > >> 3) the page was corrupted while the IO layer was doing the IO.
> > > >> 
> > > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > > >> their lives.  With #3, we'll recrc and send the IO down again
> > > >> thinking the data is correct when really we're writing garbage.
> > > >> 
> > > >> How can we tell these three cases apart?
> > > 
> > > Nick> Do we really need to handle #3? It could have happened before the
> > > Nick> checksum was calculated.
> > > 
> > > Reason #3 is one of the main reasons for having the checksum in the
> > > first place.  The whole premise of the data integrity extensions is that
> > > the checksum is calculated in close temporal proximity to the data
> > > creation.  I.e. eventually in userland.
> > > 
> > > Filesystems will inevitably have to be integrity-aware for that to work.
> > > And it will be their job to keep the data pages stable during DMA.
> > 
> > Let's just think hard about what windows can actually be closed versus
> > how much effort goes in to closing them. I also prefer not to accept
> > half-solutions in the kernel because they don't want to implement real
> > solutions in hardware (it's pretty hard to checksum and protect all
> > kernel data structures by hand).
> > 
> > For "normal" writes into pagecache, the data can get corrupted anywhere
> > from after it is generated in userspace, during the copy, while it is
> > dirty in cache, and while it is being written out.
> 
> This is why the DIF/DIX spec has the idea of a crc generated in userland
> when the data is generated.  At any rate the basic idea is to crc early
> but not often...recalculating the crc after we hand our precious memory
> to the evil device driver does weaken the end-to-end integrity checks.
> 
> What I don't want to do is weaken the basic DIF/DIX structure by letting
> the lower recrc stuff as they find faults.  It would be fine if we had
> some definitive way to say: the FS raced, just recrc, but we really
> don't.

That's true but a naive kernel crc cannot do better than the
user/kernel boundary (and has very big problems even doing that well
with mmap, get_user_pages, concurrent dirtying). So we are already
resigned there to a best effort approach.

Since we fundamentally can't have end-to-end protection then, it's
much harder to argue for significant complexity just to close the
hole a little.

So if we do the block layer retries in response to concurrent writes, it
opens the window there a little bit, but remember only a small
proportion of writes will require retries, and for that proportion, the
window is only opening a small amount.

As far as I know, we're not checksumming at the usercopy point, but the
writeback point, so we have a vastly bigger window there already.


> > Closing the while it is dirty, while it is being written back window
> > still leaves a pretty big window. Also, how do you handle mmap writes?
> > Write protect and checksum the destination page after every store? Or
> > leave some window between when the pagecache is dirtied and when it is
> > written back? So I don't know whether it's worth putting a lot of effort
> > into this case.
> 
> So, changing gears to how do we protect filesystem page cache pages
> instead of the generic idea of dif/dix, btrfs crcs just before writing,
> which does leave a pretty big window for the page to get corrupted.
> The storage layer shouldn't care or know about that though, we hand it a
> crc and it makes sure data matching that crc goes to the media.

I'm not totally against freezing redirtying events during writeback,
but as long as there is a huge window of the page sitting dirty in
pagecache, it's not worth much if any complexity.

Also I don't think we can deal with memory errors and scribbles just by
crcing dirty data. The calculations generating the data could get
corrupted. Data can be corrupted on its way back from the device to
userspace. Dirty memory writeback is usually only a small part of the
entire data transformation process.


> > If you had an interface for userspace to insert checksums to direct IO
> > requests or pagecache ranges, then not only can you close the entire gap
> > between userspace data generation, and writeback. But you also can
> > handle mmap writes and anything else just fine: userspace knows about
> > the concurrency details, so it can add the right checksum (and
> > potentially fsync etc) when it's ready.
> 
> Yeah, I do agree here.

After protecting writeback from IO bus and wire errors, I think this
would be the most productive thing to work on. Obviously this feature is
being pushed by databases and such that really want to pass checksums
all the way from userspace. Block retrying is _not_ needed or wanted
here of course.

After that is done, it might be worth coming back to see if
regular pagecache can be protected any better. I just think that's
the highest hanging fruit at the moment.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf10-pc] [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
@ 2010-06-03 16:30   ` J. Bruce Fields
  2010-06-03 17:41   ` Vladislav Bolkhovitin
  2010-06-04 16:23   ` Jan Kara
  2 siblings, 0 replies; 79+ messages in thread
From: J. Bruce Fields @ 2010-06-03 16:30 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	lsf10-pc, Nick Piggin, Al Viro, Chris Mason, James Bottomley,
	Martin K. Petersen, Ric Wheeler, Matthew Wilcox,
	Vladislav Bolkhovitin, Christoph Hellwig

On Thu, Jun 03, 2010 at 07:09:52PM +0300, Boaz Harrosh wrote:
> [Topic]
> How to not let pages change while in IO
> 
> [Abstract]
> As seen in a long thread on the fsdvel scsi mailing lists. Lots of
> people have headaches and sleep less nights because individual pages
> can change while in IO and/or DMA. Though each one as slightly different
> needs, the mechanics look to be the same.
> 
> People that care:
> - Mirror and RAID people that need on disk consistency.
> - Network storage that wants data checksum.

Yes, nfs/rpcsec_gss code is an example of code that tries to sign pages
containing data that could change underneath us.  I don't know if we've
seen it in practice.

--b.

> - DIF/DIX people
> - ...
> 
> I for one know nothing of the subject but am a RAID person and would
> like a solution that does not force me to copy the complete data load.
> 
> Please lets get all the VM VFS and drivers people in one room and see
> if we can have a Linux solution to this problem
> 
> Boaz
> _______________________________________________
> Lsf10-pc mailing list
> Lsf10-pc@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf10-pc

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
  2010-06-03 16:30   ` [Lsf10-pc] " J. Bruce Fields
@ 2010-06-03 17:41   ` Vladislav Bolkhovitin
  2010-06-04 16:23   ` Jan Kara
  2 siblings, 0 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-06-03 17:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	lsf10-pc, Nick Piggin, Al Viro, Chris Mason, James Bottomley,
	Martin K. Petersen, Ric Wheeler, Matthew Wilcox,
	Christoph Hellwig

Boaz Harrosh, on 06/03/2010 08:09 PM wrote:
> [Topic]
> How to not let pages change while in IO
> 
> [Abstract]
> As seen in a long thread on the fsdvel scsi mailing lists. Lots of
> people have headaches and sleep less nights because individual pages
> can change while in IO and/or DMA. Though each one as slightly different
> needs, the mechanics look to be the same.
> 
> People that care:
> - Mirror and RAID people that need on disk consistency.
> - Network storage that wants data checksum.
> - DIF/DIX people

  - Load balancing MPIO clusters, where out of order execution of 
overlapping write requests for the changed pages can introduce a data 
corruption, which makes using Linux with load balancing MPIO clusters 
unsafe.

> - ...
> 
> I for one know nothing of the subject but am a RAID person and would
> like a solution that does not force me to copy the complete data load.
> 
> Please lets get all the VM VFS and drivers people in one room and see
> if we can have a Linux solution to this problem
> 
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-02 13:41                                 ` Nick Piggin
  2010-06-03 15:46                                   ` Chris Mason
@ 2010-06-04  1:30                                   ` Martin K. Petersen
  1 sibling, 0 replies; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-04  1:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin K. Petersen, Chris Mason, James Bottomley, Matthew Wilcox,
	Christof Schmitt, Boaz Harrosh, linux-scsi, linux-kernel,
	linux-fsdevel

>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:

Nick,

>> Filesystems will inevitably have to be integrity-aware for that to
>> work.  And it will be their job to keep the data pages stable during
>> DMA.

Nick> Closing the while it is dirty, while it is being written back
Nick> window still leaves a pretty big window. Also, how do you handle
Nick> mmap writes?  Write protect and checksum the destination page
Nick> after every store? Or leave some window between when the pagecache
Nick> is dirtied and when it is written back? So I don't know whether
Nick> it's worth putting a lot of effort into this case.

I'm mostly interested in the cases where the filesystem acts as a
conduit for integrity metadata from user space. 

I agree the corruption windows inside the kernel are only of moderate
interest.  No filesystems have added support for explicitly protecting a
bio because the block layer's function to do so automatically is just a
few function calls away.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-02 23:20         ` Dave Chinner
@ 2010-06-04  1:34           ` Martin K. Petersen
  2010-06-04  2:32             ` Dave Chinner
  0 siblings, 1 reply; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-04  1:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christof Schmitt, Martin K. Petersen, linux-scsi, linux-kernel,
	linux-fsdevel

>>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:

Dave> If you are running DIF hardware, then XFS is only OK for direct
Dave> IO.  XFS will still get torn writes if you are overwriting
Dave> buffered data (either by write() or mmap()) because there are no
Dave> interlocks to prevent cached pages under writeback from being
Dave> modified while DMA is being performed.....

Didn't you use to wait_on_page_writeback() in page_mkwrite()?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
       [not found]                                     ` <20100603162718.GR6822@laptop>
@ 2010-06-04  1:46                                       ` Martin K. Petersen
  2010-06-04  3:09                                         ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-04  1:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Martin K. Petersen, James Bottomley, Matthew Wilcox,
	Christof Schmitt, Boaz Harrosh, linux-scsi, linux-kernel,
	linux-fsdevel

>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:

Nick> Also I don't think we can deal with memory errors and scribbles
Nick> just by crcing dirty data. The calculations generating the data
Nick> could get corrupted.

Yep, the goal is to make the window as small as possible.


Nick> Data can be corrupted on its way back from the device to
Nick> userspace.

We also get a CRC back from the storage.  So the (integrity-aware)
application is also able to check on read.


Nick> Obviously this feature is being pushed by databases and such that
Nick> really want to pass checksums all the way from userspace. Block
Nick> retrying is _not_ needed or wanted here of course.

Nope.  The integrity error is bubbled all the way up to the database and
we can decide to retry, recreate or error out depending on what we find
when we do validation checks on the data buffer and the integrity
metadata.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 15:46                                   ` Chris Mason
  2010-06-03 16:27                                     ` Nick Piggin
       [not found]                                     ` <20100603162718.GR6822@laptop>
@ 2010-06-04  2:02                                     ` Dave Chinner
       [not found]                                     ` <20100604020243.GE19651@dastard>
  3 siblings, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2010-06-04  2:02 UTC (permalink / raw)
  To: Chris Mason, Nick Piggin, Martin K. Petersen, James Bottomley,
	Matthew Wilcox <mat

On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > Closing the while it is dirty, while it is being written back window
> > still leaves a pretty big window. Also, how do you handle mmap writes?
> > Write protect and checksum the destination page after every store? Or
> > leave some window between when the pagecache is dirtied and when it is
> > written back? So I don't know whether it's worth putting a lot of effort
> > into this case.
> 
> So, changing gears to how do we protect filesystem page cache pages
> instead of the generic idea of dif/dix, btrfs crcs just before writing,
> which does leave a pretty big window for the page to get corrupted.
> The storage layer shouldn't care or know about that though, we hand it a
> crc and it makes sure data matching that crc goes to the media.

I think the only way to get accurate CRCs is to stop modifications
from occurring while the page is under writeback. i.e. when a page
transitions from dirty to writeback we need to unmap any writable
mappings on the page, and then any new modifications (either by the
write() path or through ->fault) need to block waiting for
page writeback to complete before they can proceed...

If we can lock out modification during writeback, we can calculate
CRCs safely at any point in time the page is not mapped. e.g. we
could do the CRC calculation at copy-in time and store it on new
pages. During writeback, if the page has not been mapped then the
stored CRC can be used. If it has been mapped (say writeable
mappings clear the stored CRC during ->fault) then we can
recalculate the CRC once we've transitioned the page to being under
writeback...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-04  1:34           ` Martin K. Petersen
@ 2010-06-04  2:32             ` Dave Chinner
  2010-06-07 16:20               ` Martin K. Petersen
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Chinner @ 2010-06-04  2:32 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel

On Thu, Jun 03, 2010 at 09:34:18PM -0400, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:
> 
> Dave> If you are running DIF hardware, then XFS is only OK for direct
> Dave> IO.  XFS will still get torn writes if you are overwriting
> Dave> buffered data (either by write() or mmap()) because there are no
> Dave> interlocks to prevent cached pages under writeback from being
> Dave> modified while DMA is being performed.....
> 
> Didn't you use to wait_on_page_writeback() in page_mkwrite()?

The generic implementation of ->page_mkwrite  (block_page_mkwrite())
which XFS uses has never had a wait_on_page_writeback() call in it.
There's no call in the generic write paths, either, hence my comment
that only direct IO on XFS will work.

It should be noted that metadata updates in XFS are already
protected against torn writes - buffers are held locked during IO,
and can only be modified while holding the lock. Hence the only
problem that needs solving for XFS to make full use of DIF/DIX is
getting the page cache paths to not modify pages under IO...


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-04  1:46                                       ` Martin K. Petersen
@ 2010-06-04  3:09                                         ` Nick Piggin
  0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-04  3:09 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Chris Mason, James Bottomley, Matthew Wilcox, Christof Schmitt,
	Boaz Harrosh, linux-scsi, linux-kernel, linux-fsdevel

On Thu, Jun 03, 2010 at 09:46:02PM -0400, Martin K. Petersen wrote:
> >>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:
> 
> Nick> Also I don't think we can deal with memory errors and scribbles
> Nick> just by crcing dirty data. The calculations generating the data
> Nick> could get corrupted.
> 
> Yep, the goal is to make the window as small as possible.
> 
> 
> Nick> Data can be corrupted on its way back from the device to
> Nick> userspace.
> 
> We also get a CRC back from the storage.  So the (integrity-aware)
> application is also able to check on read.

Well that's nice :)


> Nick> Obviously this feature is being pushed by databases and such that
> Nick> really want to pass checksums all the way from userspace. Block
> Nick> retrying is _not_ needed or wanted here of course.
> 
> Nope.  The integrity error is bubbled all the way up to the database and
> we can decide to retry, recreate or error out depending on what we find
> when we do validation checks on the data buffer and the integrity
> metadata.

By block retrying, I just meant the bounce / re-checksum approach.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
       [not found]                                     ` <20100604020243.GE19651@dastard>
@ 2010-06-04 15:32                                       ` Jan Kara
  0 siblings, 0 replies; 79+ messages in thread
From: Jan Kara @ 2010-06-04 15:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Nick Piggin, Martin K. Petersen, James Bottomley,
	Matthew Wilcox, Christof Schmitt, Boaz Harrosh, linux-scsi,
	linux-kernel, linux-fsdevel

On Fri 04-06-10 12:02:43, Dave Chinner wrote:
> On Thu, Jun 03, 2010 at 11:46:34AM -0400, Chris Mason wrote:
> > On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> > > Closing the while it is dirty, while it is being written back window
> > > still leaves a pretty big window. Also, how do you handle mmap writes?
> > > Write protect and checksum the destination page after every store? Or
> > > leave some window between when the pagecache is dirtied and when it is
> > > written back? So I don't know whether it's worth putting a lot of effort
> > > into this case.
> > 
> > So, changing gears to how do we protect filesystem page cache pages
> > instead of the generic idea of dif/dix, btrfs crcs just before writing,
> > which does leave a pretty big window for the page to get corrupted.
> > The storage layer shouldn't care or know about that though, we hand it a
> > crc and it makes sure data matching that crc goes to the media.
> 
> I think the only way to get accurate CRCs is to stop modifications
> from occurring while the page is under writeback. i.e. when a page
> transitions from dirty to writeback we need to unmap any writable
> mappings on the page, and then any new modifications (either by the
> write() path or through ->fault) need to block waiting for
> page writeback to complete before they can proceed...
  Actually, we already write-protect the page in clear_page_dirty_for_io
so the first part already happens. Any filesystem can do
wait_on_page_writeback() in its ->page_mkwrite function so even the second
part shouldn't be hard. I'm just a bit worried about the performance
implications / hidden deadlocks...
  Also we'd have to wait_on_page_writeback() in ->write_begin function to
protect against ordinary writes  but that's the easy part...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
  2010-06-03 16:30   ` [Lsf10-pc] " J. Bruce Fields
  2010-06-03 17:41   ` Vladislav Bolkhovitin
@ 2010-06-04 16:23   ` Jan Kara
  2010-06-04 16:30     ` [Lsf10-pc] " J. Bruce Fields
  2010-06-06  9:35     ` Boaz Harrosh
  2 siblings, 2 replies; 79+ messages in thread
From: Jan Kara @ 2010-06-04 16:23 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	lsf10-pc, Nick Piggin, Al Viro, Chris Mason, James Bottomley,
	Martin K. Petersen, Ric Wheeler, Matthew Wilcox,
	Vladislav Bolkhovitin, Christoph Hellwig

On Thu 03-06-10 19:09:52, Boaz Harrosh wrote:
> [Topic]
> How to not let pages change while in IO
> 
> [Abstract]
> As seen in a long thread on the fsdvel scsi mailing lists. Lots of
> people have headaches and sleep less nights because individual pages
> can change while in IO and/or DMA. Though each one as slightly different
> needs, the mechanics look to be the same.
  Hmm, I don't think it's really about "how to not let pages change" - that
is doable by using wait_on_page_writeback() in ->page_mkwrite and
->write_begin. I think the discussion is more about whether we should do it
or whether we should rechecksum and resubmit IO in case of checksum failure
as Nick proposed...

								Honza
> People that care:
> - Mirror and RAID people that need on disk consistency.
> - Network storage that wants data checksum.
> - DIF/DIX people
> - ...
> 
> I for one know nothing of the subject but am a RAID person and would
> like a solution that does not force me to copy the complete data load.
> 
> Please lets get all the VM VFS and drivers people in one room and see
> if we can have a Linux solution to this problem
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf10-pc] [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-04 16:23   ` Jan Kara
@ 2010-06-04 16:30     ` J. Bruce Fields
  2010-06-04 17:11       ` Jan Kara
  2010-06-06  9:35     ` Boaz Harrosh
  1 sibling, 1 reply; 79+ messages in thread
From: J. Bruce Fields @ 2010-06-04 16:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Boaz Harrosh, Vladislav Bolkhovitin, Martin K. Petersen,
	linux-scsi, linux-kernel, James Bottomley, Christof Schmitt,
	lsf10-pc, Al Viro, linux-fsdevel, Matthew Wilcox, Ric Wheeler,
	Christoph Hellwig

On Fri, Jun 04, 2010 at 06:23:32PM +0200, Jan Kara wrote:
> On Thu 03-06-10 19:09:52, Boaz Harrosh wrote:
> > [Topic]
> > How to not let pages change while in IO
> > 
> > [Abstract]
> > As seen in a long thread on the fsdvel scsi mailing lists. Lots of
> > people have headaches and sleep less nights because individual pages
> > can change while in IO and/or DMA. Though each one as slightly different
> > needs, the mechanics look to be the same.
>   Hmm, I don't think it's really about "how to not let pages change" - that
> is doable by using wait_on_page_writeback() in ->page_mkwrite and
> ->write_begin.

Will the same work for the NFS server checksumming page data in read
replies?

--b.

> I think the discussion is more about whether we should do it
> or whether we should rechecksum and resubmit IO in case of checksum failure
> as Nick proposed...
> 
> 								Honza
> > People that care:
> > - Mirror and RAID people that need on disk consistency.
> > - Network storage that wants data checksum.
> > - DIF/DIX people
> > - ...
> > 
> > I for one know nothing of the subject but am a RAID person and would
> > like a solution that does not force me to copy the complete data load.
> > 
> > Please lets get all the VM VFS and drivers people in one room and see
> > if we can have a Linux solution to this problem
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> _______________________________________________
> Lsf10-pc mailing list
> Lsf10-pc@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/lsf10-pc

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Lsf10-pc] [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-04 16:30     ` [Lsf10-pc] " J. Bruce Fields
@ 2010-06-04 17:11       ` Jan Kara
  0 siblings, 0 replies; 79+ messages in thread
From: Jan Kara @ 2010-06-04 17:11 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jan Kara, Boaz Harrosh, Vladislav Bolkhovitin, Martin K. Petersen,
	linux-scsi, linux-kernel, James Bottomley, Christof Schmitt,
	lsf10-pc, Al Viro, linux-fsdevel, Matthew Wilcox, Ric Wheeler,
	Christoph Hellwig

On Fri 04-06-10 12:30:09, J. Bruce Fields wrote:
> On Fri, Jun 04, 2010 at 06:23:32PM +0200, Jan Kara wrote:
> > On Thu 03-06-10 19:09:52, Boaz Harrosh wrote:
> > > [Topic]
> > > How to not let pages change while in IO
> > > 
> > > [Abstract]
> > > As seen in a long thread on the fsdvel scsi mailing lists. Lots of
> > > people have headaches and sleep less nights because individual pages
> > > can change while in IO and/or DMA. Though each one as slightly different
> > > needs, the mechanics look to be the same.
> >   Hmm, I don't think it's really about "how to not let pages change" - that
> > is doable by using wait_on_page_writeback() in ->page_mkwrite and
> > ->write_begin.
> 
> Will the same work for the NFS server checksumming page data in read
> replies?
  Currently it won't. The above relies on the fact that when we clear dirty
page flag in clear_page_dirty_for_io we also write-protect the page.
Moreover I suppose you don't set PageWriteback when serving a content from
a page...
  But an elegant way to solve this would IMHO be to do:
	lock_page(page);
	if (page_mkclean(page))
		set_page_dirty(page);
  and now you can be sure that until you unlock the page it cannot be
modified because both ->page_mkwrite and ->write_begin need page lock
before they can go and modify a page...

									Honza	
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-04 16:23   ` Jan Kara
  2010-06-04 16:30     ` [Lsf10-pc] " J. Bruce Fields
@ 2010-06-06  9:35     ` Boaz Harrosh
  2010-06-06 23:37       ` Jan Kara
  1 sibling, 1 reply; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-06  9:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	lsf10-pc, Nick Piggin, Al Viro, Chris Mason, James Bottomley,
	Martin K. Petersen, Ric Wheeler, Matthew Wilcox,
	Vladislav Bolkhovitin, Christoph Hellwig

On 06/04/2010 07:23 PM, Jan Kara wrote:
> On Thu 03-06-10 19:09:52, Boaz Harrosh wrote:
>> [Topic]
>> How to not let pages change while in IO
>>
>> [Abstract]
>> As seen in a long thread on the fsdvel scsi mailing lists. Lots of
>> people have headaches and sleep less nights because individual pages
>> can change while in IO and/or DMA. Though each one as slightly different
>> needs, the mechanics look to be the same.

>   Hmm, I don't think it's really about "how to not let pages change" - that
> is doable by using wait_on_page_writeback() in ->page_mkwrite and
> ->write_begin. I think the discussion is more about whether we should do it
> or whether we should rechecksum and resubmit IO in case of checksum failure
> as Nick proposed...
> 
> 								Honza

I have hijacked the DIF threads but, No, my proposal is for a general toolset
that could be used for all the above as well as DIF if needed.

Surly even with DIF the keep-constant vs retransmit is a matter of machine+link
speed multiply by faulting work loads. So there might be situations where an admin
wants to choose.

With other none checksum fixtures, like RAID5/MIRROR this is not always an option
and it becomes keep-constant vs copy. (That is complete workload copy). So for
these setups the option is clear. No?

I'm glad that you think it is easy/doable to implement. And I'll surly test your
above receipt. Do you think it would be acceptable as a generic per-sb tunable.
So for instance an ext3 over RAID5 could turn this on and eliminate the data copy?

Lets talk about this in LSF
Boaz

>> People that care:
>> - Mirror and RAID people that need on disk consistency.
>> - Network storage that wants data checksum.
>> - DIF/DIX people
>> - ...
>>
>> I for one know nothing of the subject but am a RAID person and would
>> like a solution that does not force me to copy the complete data load.
>>
>> Please lets get all the VM VFS and drivers people in one room and see
>> if we can have a Linux solution to this problem


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-06  9:35     ` Boaz Harrosh
@ 2010-06-06 23:37       ` Jan Kara
  2010-06-07  8:30         ` Boaz Harrosh
  0 siblings, 1 reply; 79+ messages in thread
From: Jan Kara @ 2010-06-06 23:37 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Christof Schmitt, linux-scsi, linux-kernel,
	linux-fsdevel, lsf10-pc, Nick Piggin, Al Viro, Chris Mason,
	James Bottomley, Martin K. Petersen, Ric Wheeler, Matthew Wilcox,
	Vladislav Bolkhovitin, Christoph Hellwig

On Sun 06-06-10 12:35:03, Boaz Harrosh wrote:
> On 06/04/2010 07:23 PM, Jan Kara wrote:
> > On Thu 03-06-10 19:09:52, Boaz Harrosh wrote:
> >> [Topic]
> >> How to not let pages change while in IO
> >>
> >> [Abstract]
> >> As seen in a long thread on the fsdvel scsi mailing lists. Lots of
> >> people have headaches and sleep less nights because individual pages
> >> can change while in IO and/or DMA. Though each one as slightly different
> >> needs, the mechanics look to be the same.
> 
> >   Hmm, I don't think it's really about "how to not let pages change" - that
> > is doable by using wait_on_page_writeback() in ->page_mkwrite and
> > ->write_begin. I think the discussion is more about whether we should do it
> > or whether we should rechecksum and resubmit IO in case of checksum failure
> > as Nick proposed...
> > 
> > 								Honza
> 
> I have hijacked the DIF threads but, No, my proposal is for a general
> toolset that could be used for all the above as well as DIF if needed.
> 
> Surly even with DIF the keep-constant vs retransmit is a matter of
> machine+link speed multiply by faulting work loads. So there might be
> situations where an admin wants to choose.
> 
> With other none checksum fixtures, like RAID5/MIRROR this is not always
> an option and it becomes keep-constant vs copy. (That is complete
> workload copy). So for these setups the option is clear. No?
  Is it? You can have enough CPU / memory bandwidth to do the copying while
you need not be comfortable with a thread blocking until IO is finished
when it tries to do a rewrite...

> I'm glad that you think it is easy/doable to implement. And I'll surly
> test your above receipt. Do you think it would be acceptable as a generic
> per-sb tunable.  So for instance an ext3 over RAID5 could turn this on
> and eliminate the data copy?
  Yes, that would be useful. At least so that one can get real performance
numbers...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write)
  2010-06-06 23:37       ` Jan Kara
@ 2010-06-07  8:30         ` Boaz Harrosh
  0 siblings, 0 replies; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-07  8:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christof Schmitt, linux-scsi, linux-kernel, linux-fsdevel,
	lsf10-pc, Nick Piggin, Al Viro, Chris Mason, James Bottomley,
	Martin K. Petersen, Ric Wheeler, Matthew Wilcox,
	Vladislav Bolkhovitin, Christoph Hellwig

On 06/07/2010 02:37 AM, Jan Kara wrote:
>> With other none checksum fixtures, like RAID5/MIRROR this is not always
>> > an option and it becomes keep-constant vs copy. (That is complete
>> > workload copy). So for these setups the option is clear. No?
>
>   Is it? You can have enough CPU / memory bandwidth to do the copying while
> you need not be comfortable with a thread blocking until IO is finished
> when it tries to do a rewrite...
> 
>> I'm glad that you think it is easy/doable to implement. And I'll surly
>> test your above receipt. Do you think it would be acceptable as a generic
>> per-sb tunable.  So for instance an ext3 over RAID5 could turn this on
>> and eliminate the data copy?
>
>   Yes, that would be useful. At least so that one can get real performance
> numbers...
> 
> 									Honza

Thanks Jan.
You have helped me tremendously. I think I can begin to understand now what I
need to do.

With the workloads I need (HPC), every cycle/memory counts and that the app
waits for a rewrite is a good thing, which reminds me that I would want to trace
that case so applications could be fixed, tuned.

I do understand that for a desktop, that might be just the opposite, so testing
is important. Perhaps I'll need help in instrumenting all this.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-04  2:32             ` Dave Chinner
@ 2010-06-07 16:20               ` Martin K. Petersen
  2010-06-07 17:22                 ` Boaz Harrosh
  0 siblings, 1 reply; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-07 16:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Christof Schmitt, linux-scsi, linux-kernel,
	linux-fsdevel

>>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:

>> Didn't you use to wait_on_page_writeback() in page_mkwrite()?

Dave> The generic implementation of ->page_mkwrite
Dave> (block_page_mkwrite()) which XFS uses has never had a
Dave> wait_on_page_writeback() call in it.  There's no call in the
Dave> generic write paths, either, hence my comment that only direct IO
Dave> on XFS will work.

I guess that wait_on_page_writeback() was something I added when I used
XFS for DIF testing.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-07 16:20               ` Martin K. Petersen
@ 2010-06-07 17:22                 ` Boaz Harrosh
  2010-06-07 17:40                   ` Martin K. Petersen
  0 siblings, 1 reply; 79+ messages in thread
From: Boaz Harrosh @ 2010-06-07 17:22 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Dave Chinner, Christof Schmitt, linux-scsi, linux-kernel,
	linux-fsdevel

On 06/07/2010 07:20 PM, Martin K. Petersen wrote:
>>>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:
> 
>>> Didn't you use to wait_on_page_writeback() in page_mkwrite()?
> 
> Dave> The generic implementation of ->page_mkwrite
> Dave> (block_page_mkwrite()) which XFS uses has never had a
> Dave> wait_on_page_writeback() call in it.  There's no call in the
> Dave> generic write paths, either, hence my comment that only direct IO
> Dave> on XFS will work.
> 
> I guess that wait_on_page_writeback() was something I added when I used
> XFS for DIF testing.
> 

Do you remember some performance numbers that show degradation / sameness?

What type of work loads?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-07 17:22                 ` Boaz Harrosh
@ 2010-06-07 17:40                   ` Martin K. Petersen
  2010-06-08  7:15                     ` Christof Schmitt
  0 siblings, 1 reply; 79+ messages in thread
From: Martin K. Petersen @ 2010-06-07 17:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Martin K. Petersen, Dave Chinner, Christof Schmitt, linux-scsi,
	linux-kernel, linux-fsdevel

>>>>> "Boaz" == Boaz Harrosh <bharrosh@panasas.com> writes:

Boaz> Do you remember some performance numbers that show degradation /
Boaz> sameness?

Boaz> What type of work loads?

I haven't been using XFS much for over a year.  I'm using an internal
async I/O tool and btrfs for most of my DIX/DIF testing these days.

But my original changes were along the lines of what Jan mentioned
earlier (hooking into page_mkwrite and waiting for writeback.  I could
have sworn that I only did it for ext[23] and that XFS waited out of the
box but git proves me wrong).  Anyway, I'll try to get some benchmarking
happening later this week.

This won't fix things completely, though.  ext2fs, for instance,
frequently changes metadata buffers in flight so it trips the guard
check in no time.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-07 17:40                   ` Martin K. Petersen
@ 2010-06-08  7:15                     ` Christof Schmitt
  2010-06-08  8:47                       ` Dave Chinner
  0 siblings, 1 reply; 79+ messages in thread
From: Christof Schmitt @ 2010-06-08  7:15 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Boaz Harrosh, Dave Chinner, linux-scsi, linux-kernel,
	linux-fsdevel

On Mon, Jun 07, 2010 at 01:40:21PM -0400, Martin K. Petersen wrote:
> >>>>> "Boaz" == Boaz Harrosh <bharrosh@panasas.com> writes:
> 
> Boaz> Do you remember some performance numbers that show degradation /
> Boaz> sameness?
> 
> Boaz> What type of work loads?
> 
> I haven't been using XFS much for over a year.  I'm using an internal
> async I/O tool and btrfs for most of my DIX/DIF testing these days.
> 
> But my original changes were along the lines of what Jan mentioned
> earlier (hooking into page_mkwrite and waiting for writeback.  I could
> have sworn that I only did it for ext[23] and that XFS waited out of the
> box but git proves me wrong).  Anyway, I'll try to get some benchmarking
> happening later this week.

Is there a patch with this change available somewhere? It might be
useful to patch a kernel with this XFS change for reliable DIF/DIX
testing.

> This won't fix things completely, though.  ext2fs, for instance,
> frequently changes metadata buffers in flight so it trips the guard
> check in no time.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-01 13:58             ` Chris Mason
@ 2010-06-08  7:18               ` Christof Schmitt
  0 siblings, 0 replies; 79+ messages in thread
From: Christof Schmitt @ 2010-06-08  7:18 UTC (permalink / raw)
  To: Chris Mason, Boaz Harrosh, James Bottomley, Martin K. Petersen,
	linux-scsi

On Tue, Jun 01, 2010 at 09:58:18AM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> > On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:
> > > > > >>
> > > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > > >> Christof> data attached to the write request changes between the
> > > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > > >> Christof> sd_prep_fn.
> > > > > >>
> > > > > >> Yep, known bug.  Page writeback locking is messed up for buffer_head
> > > > > >> users.  The extNfs folks volunteered to look into this a while back but
> > > > > >> I don't think they have found the time yet.
> > > > > >>
> > > > > >>
> > > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > > >>
> > > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > > >> permitted pages to be changed during flight.  I guess you've just been
> > > > > >> lucky.
> > > > > > 
> > > > > > Pages have always been modifiable in flight.  The OS guarantees they'll
> > > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > > software, there's time between generation and page transmission for the
> > > > > > alteration to occur).  The solution in the iscsi case was not to
> > > > > > complain if the page is still marked dirty.
> > > > > > 
> > > > > 
> > > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > > would prevent data writing given a device queue flag that requests
> > > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > > > 
> > > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > > what btrfs is doing. Though I don't know how myself?
> > > > 
> > > > I also tested with btrfs and invalid guard tags in writes have been
> > > > encountered as well (again in 2.6.34). The only difference is that no
> > > > error was reported to userspace, although this might be a
> > > > configuration issue.
> > > 
> > > This would be a btrfs bug.  We have strict checks in place that are
> > > supposed to prevent buffers changing while in flight.  What was the
> > > workload that triggered this problem?
> > 
> > I am running an internal test tool that creates files with a known
> > pattern until the disk is full, reads the data to verify if the
> > pattern is still intact, removes the files and starts over.
> 
> Ok, is the lba in the output the sector offset?  We can map that to a
> btrfs block and figure out what it was.
> 
> Btrfs never complains about the IO error?  We really should explode.

Right now, i don't have a system available to continue with this. When
i have a change to run it again, i can try the latest rc kernel to see
if there is any difference.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-08  7:15                     ` Christof Schmitt
@ 2010-06-08  8:47                       ` Dave Chinner
  2010-06-08  8:52                         ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Chinner @ 2010-06-08  8:47 UTC (permalink / raw)
  To: Christof Schmitt
  Cc: Martin K. Petersen, Boaz Harrosh, linux-scsi, linux-kernel,
	linux-fsdevel

On Tue, Jun 08, 2010 at 09:15:35AM +0200, Christof Schmitt wrote:
> On Mon, Jun 07, 2010 at 01:40:21PM -0400, Martin K. Petersen wrote:
> > >>>>> "Boaz" == Boaz Harrosh <bharrosh@panasas.com> writes:
> > 
> > Boaz> Do you remember some performance numbers that show degradation /
> > Boaz> sameness?
> > 
> > Boaz> What type of work loads?
> > 
> > I haven't been using XFS much for over a year.  I'm using an internal
> > async I/O tool and btrfs for most of my DIX/DIF testing these days.
> > 
> > But my original changes were along the lines of what Jan mentioned
> > earlier (hooking into page_mkwrite and waiting for writeback.  I could
> > have sworn that I only did it for ext[23] and that XFS waited out of the
> > box but git proves me wrong).  Anyway, I'll try to get some benchmarking
> > happening later this week.
> 
> Is there a patch with this change available somewhere? It might be
> useful to patch a kernel with this XFS change for reliable DIF/DIX
> testing.

Adding a wait_on_page_writeback() call to page_mkwrite() won't help
by itself - you need to unmap the pages after transitioning them
from dirty to writeback but before the hardware starts processing
the IO if you want to lock out mmap writes this way....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-08  8:47                       ` Dave Chinner
@ 2010-06-08  8:52                         ` Nick Piggin
  0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2010-06-08  8:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christof Schmitt, Martin K. Petersen, Boaz Harrosh, linux-scsi,
	linux-kernel, linux-fsdevel

On Tue, Jun 08, 2010 at 06:47:18PM +1000, Dave Chinner wrote:
> On Tue, Jun 08, 2010 at 09:15:35AM +0200, Christof Schmitt wrote:
> > > But my original changes were along the lines of what Jan mentioned
> > > earlier (hooking into page_mkwrite and waiting for writeback.  I could
> > > have sworn that I only did it for ext[23] and that XFS waited out of the
> > > box but git proves me wrong).  Anyway, I'll try to get some benchmarking
> > > happening later this week.
> > 
> > Is there a patch with this change available somewhere? It might be
> > useful to patch a kernel with this XFS change for reliable DIF/DIX
> > testing.
> 
> Adding a wait_on_page_writeback() call to page_mkwrite() won't help
> by itself - you need to unmap the pages after transitioning them
> from dirty to writeback but before the hardware starts processing
> the IO if you want to lock out mmap writes this way....

Actually as Jan pointed out, clear_page_dirty_for_io already does
this. So it seems pretty doable. get_user_pages remains a problem
(which I believe NFS just ignores) but it could be avoided by
just disallowing it.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 12:46                 ` Vladislav Bolkhovitin
@ 2010-06-09 15:58                   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-06-09 15:58 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: James Bottomley, Christof Schmitt, Martin K. Petersen, linux-scsi,
	linux-kernel, linux-fsdevel, Chris Mason

Vladislav Bolkhovitin, on 06/03/2010 04:46 PM wrote:
> 
> Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI 
>>>> or similar transports which allow to have commands queue depth >1 and 
>>>> allowed to internally reorder queued requests. I don't know the FS/block 
>>>> layers sufficiently well to tell if sending several requests for the 
>>>> same page really possible or not, but we can see a real life problem, 
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for 
>>>> the same page queued to the corresponding device before the original 
>>>> request finished. Since the device allowed to freely reorder requests, 
>>>> there's a probability that the original write request would hit the 
>>>> permanent storage *AFTER* the retry request, hence the data changes it's 
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and 
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 
>> allowed to freely reorder any commands with SIMPLE task attribute. If an 
>> application wants to maintain order of some commands for such devices, 
>> it must issue them with ORDERED task attribute and over a _single_ MPIO 
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow 
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with 
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically 
>>>> impossible, but for sophisticated clusters when many nodes pretending to 
>>>> be a single SCSI device in a load balancing configuration, it becomes 
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this 
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they 
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them 
>>>> as a single multipath device in an active-active round-robin 
>>>> load-balancing configuration, i.e. sends requests to both nodes in 
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local 
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! 
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received 
>>>> overlapping writes on the same block(s) and DRBD can't figure out which 
>>>> one to store. This is possible only if the initiator sent the second 
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO 
>> paths to the device, so at the moment of the sending all the order 
>> information was lost.
>>
>> Until SCSI generally allowed to preserve ordering information between 
>> MPIO paths in such configurations the only way to maintain commands 
>> order would be queue draining. Hence, for safety all initiators working 
>> with such devices must do it.
>>
>> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
> 
> I meant load balancing MPIO clusters.

Actually, if consider processing of exception conditions like Task Set 
Full status or Unit Attentions, queuing of several write commands for 
the same page(s) is not safe also for all other MPIO clusters as well as 
for single path SCSI-transport devices, like regular HDDs or other 
SAS/FC/iSCSI/... storage.

This is because in case of exception conditions the first write command 
could be preliminary finished to deliver the exception condition status 
to the initiator, but all the queued after it commands would be neither 
aborted, nor suspended. So, after retrying the command can be queued 
_after_ the second write command, hence they would be executed in the 
reverse order with related data corruption.

To prevent such things, SCSI standard provides ACA and UA interlock 
facilities, but Linux doesn't use them.

Thus, to be safe Linux should:

1. Either don't write on pages under IO, hence don't queue retries,

2. Or queue retries only after the original write finished.

Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-06-03 11:20           ` Vladislav Bolkhovitin
  2010-06-03 12:07             ` Boaz Harrosh
@ 2010-07-23 17:59             ` Gennadiy Nerubayev
  2010-07-23 19:16               ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 79+ messages in thread
From: Gennadiy Nerubayev @ 2010-07-23 17:59 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>
> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>
>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>
>>> What is the best strategy to continue with the invalid guard tags on
>>> write requests? Should this be fixed in the filesystems?
>>
>> For write requests, as long as the page dirty bit is still set, it's
>> safe to drop the request, since it's already going to be repeated.  What
>> we probably want is an error code we can return that the layer that sees
>> both the request and the page flags can make the call.
>>
>>> Another idea would be to pass invalid guard tags on write requests
>>> down to the hardware, expect an "invalid guard tag" error and report
>>> it to the block layer where a new checksum is generated and the
>>> request is issued again. Basically implement a retry through the whole
>>> I/O stack. But this also sounds complicated.
>>
>> No, no ... as long as the guard tag is wrong because the fs changed the
>> page, the write request for the updated page will already be queued or
>> in-flight, so there's no need to retry.
>
> There's one interesting problem here, at least theoretically, with SCSI or similar transports which allow to have commands queue depth >1 and allowed to internally reorder queued requests. I don't know the FS/block layers sufficiently well to tell if sending several requests for the same page really possible or not, but we can see a real life problem, which can be well explained if it's possible.
>
> The problem could be if the second (rewrite) request (SCSI command) for the same page queued to the corresponding device before the original request finished. Since the device allowed to freely reorder requests, there's a probability that the original write request would hit the permanent storage *AFTER* the retry request, hence the data changes it's carrying would be lost, hence welcome data corruption.
>
> For single parallel SCSI or SAS devices such race may look practically impossible, but for sophisticated clusters when many nodes pretending to be a single SCSI device in a load balancing configuration, it becomes very real.
>
> The real life problem we can see in an active-active DRBD-setup. In this configuration 2 nodes act as a single SCST-powered SCSI device and they both run DRBD to keep their backstorage in-sync. The initiator uses them as a single multipath device in an active-active round-robin load-balancing configuration, i.e. sends requests to both nodes in parallel, then DRBD takes care to replicate the requests to the other node.
>
> The problem is that sometimes DRBD complies about concurrent local writes, like:
>
> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>
> This message means that DRBD detected that both nodes received overlapping writes on the same block(s) and DRBD can't figure out which one to store. This is possible only if the initiator sent the second write request before the first one completed.
>
> The topic of the discussion could well explain the cause of that. But, unfortunately, people who reported it forgot to note which OS they run on the initiator, i.e. I can't say for sure it's Linux.

Sorry for the late chime in, but here's some more information of
potential interest as I've previously inquired about this to the drbd
mailing list:

1. It only happens when using blockio mode in IET or SCST. Fileio,
nv_cache, and write_through do not generate the warnings.
2. It happens on active/passive drbd clusters (on the active node
obviously), NOT active/active. In fact, I've found that doing round
robin on active/active is a Bad Idea (tm) even with a clustered
filesystem, until at least the target software is able to synchronize
the command state of either node.
3. Linux and ESX initiators can generate the warning, but I've so far
only been able to reliably reproduce it using a Windows initiator and
sqlio or iometer benchmarks. I'll be trying again using iometer when I
have the time.
4. It only happens using a random write io workload (any block size),
with initiator threads >1, OR initiator queue depth >1. The higher
either of those is, the more spammy the warnings become.
5. The transport does not matter (reproduced with iSCSI and SRP)
6. If DRBD is disconnected (primary/unknown), the warnings are not
generated. As soon as it's reconnected (primary/secondary), the
warnings will reappear.

(sorry for the duplicate, forgot to plaintext)

-Gennadiy
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-07-23 17:59             ` Gennadiy Nerubayev
@ 2010-07-23 19:16               ` Vladislav Bolkhovitin
  2010-07-23 20:51                 ` Gennadiy Nerubayev
  2010-07-24  1:03                 ` Dave Chinner
  0 siblings, 2 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-23 19:16 UTC (permalink / raw)
  To: Gennadiy Nerubayev
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote:
> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<vst@vlnb.net>  wrote:
>>
>> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>>
>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>>
>>>> What is the best strategy to continue with the invalid guard tags on
>>>> write requests? Should this be fixed in the filesystems?
>>>
>>> For write requests, as long as the page dirty bit is still set, it's
>>> safe to drop the request, since it's already going to be repeated.  What
>>> we probably want is an error code we can return that the layer that sees
>>> both the request and the page flags can make the call.
>>>
>>>> Another idea would be to pass invalid guard tags on write requests
>>>> down to the hardware, expect an "invalid guard tag" error and report
>>>> it to the block layer where a new checksum is generated and the
>>>> request is issued again. Basically implement a retry through the whole
>>>> I/O stack. But this also sounds complicated.
>>>
>>> No, no ... as long as the guard tag is wrong because the fs changed the
>>> page, the write request for the updated page will already be queued or
>>> in-flight, so there's no need to retry.
>>
>> There's one interesting problem here, at least theoretically, with SCSI or similar transports which allow to have commands queue depth>1 and allowed to internally reorder queued requests. I don't know the FS/block layers sufficiently well to tell if sending several requests for the same page really possible or not, but we can see a real life problem, which can be well explained if it's possible.
>>
>> The problem could be if the second (rewrite) request (SCSI command) for the same page queued to the corresponding device before the original request finished. Since the device allowed to freely reorder requests, there's a probability that the original write request would hit the permanent storage *AFTER* the retry request, hence the data changes it's carrying would be lost, hence welcome data corruption.
>>
>> For single parallel SCSI or SAS devices such race may look practically impossible, but for sophisticated clusters when many nodes pretending to be a single SCSI device in a load balancing configuration, it becomes very real.
>>
>> The real life problem we can see in an active-active DRBD-setup. In this configuration 2 nodes act as a single SCST-powered SCSI device and they both run DRBD to keep their backstorage in-sync. The initiator uses them as a single multipath device in an active-active round-robin load-balancing configuration, i.e. sends requests to both nodes in parallel, then DRBD takes care to replicate the requests to the other node.
>>
>> The problem is that sometimes DRBD complies about concurrent local writes, like:
>>
>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>
>> This message means that DRBD detected that both nodes received overlapping writes on the same block(s) and DRBD can't figure out which one to store. This is possible only if the initiator sent the second write request before the first one completed.
>>
>> The topic of the discussion could well explain the cause of that. But, unfortunately, people who reported it forgot to note which OS they run on the initiator, i.e. I can't say for sure it's Linux.
>
> Sorry for the late chime in, but here's some more information of
> potential interest as I've previously inquired about this to the drbd
> mailing list:
>
> 1. It only happens when using blockio mode in IET or SCST. Fileio,
> nv_cache, and write_through do not generate the warnings.

Some explanations for those who not familiar with the terminology:

  - "Fileio" means Linux IO stack on the target receives IO via 
vfs_readv()/vfs_writev()

  - "NV_CACHE" means all the cache synchronization requests 
(SYNCHRONIZE_CACHE, FUA) from the initiator are ignored

  - "WRITE_THROUGH" means write through, i.e. the corresponding backend 
file for the device open with O_SYNC flag.

> 2. It happens on active/passive drbd clusters (on the active node
> obviously), NOT active/active. In fact, I've found that doing round
> robin on active/active is a Bad Idea (tm) even with a clustered
> filesystem, until at least the target software is able to synchronize
> the command state of either node.
> 3. Linux and ESX initiators can generate the warning, but I've so far
> only been able to reliably reproduce it using a Windows initiator and
> sqlio or iometer benchmarks. I'll be trying again using iometer when I
> have the time.
> 4. It only happens using a random write io workload (any block size),
> with initiator threads>1, OR initiator queue depth>1. The higher
> either of those is, the more spammy the warnings become.
> 5. The transport does not matter (reproduced with iSCSI and SRP)
> 6. If DRBD is disconnected (primary/unknown), the warnings are not
> generated. As soon as it's reconnected (primary/secondary), the
> warnings will reappear.

It would be great if you prove or disprove our suspicions that Linux can 
produce several write requests for the same blocks simultaneously. To be 
sure we need:

1. The initiator is Linux. Windows and ESX are not needed for this 
particular case.

2. If you are able to reproduce it, we will need full description of 
which application used on the initiator to generate the load and in 
which mode.

Target and DRBD configuration doesn't matter, you can use any.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-07-23 19:16               ` Vladislav Bolkhovitin
@ 2010-07-23 20:51                 ` Gennadiy Nerubayev
  2010-07-26 12:22                   ` Vladislav Bolkhovitin
  2010-07-24  1:03                 ` Dave Chinner
  1 sibling, 1 reply; 79+ messages in thread
From: Gennadiy Nerubayev @ 2010-07-23 20:51 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

On Fri, Jul 23, 2010 at 3:16 PM, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote:
>>
>> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<vst@vlnb.net>
>>  wrote:
>>>
>>> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>>>
>>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>>>
>>>>> What is the best strategy to continue with the invalid guard tags on
>>>>> write requests? Should this be fixed in the filesystems?
>>>>
>>>> For write requests, as long as the page dirty bit is still set, it's
>>>> safe to drop the request, since it's already going to be repeated.  What
>>>> we probably want is an error code we can return that the layer that sees
>>>> both the request and the page flags can make the call.
>>>>
>>>>> Another idea would be to pass invalid guard tags on write requests
>>>>> down to the hardware, expect an "invalid guard tag" error and report
>>>>> it to the block layer where a new checksum is generated and the
>>>>> request is issued again. Basically implement a retry through the whole
>>>>> I/O stack. But this also sounds complicated.
>>>>
>>>> No, no ... as long as the guard tag is wrong because the fs changed the
>>>> page, the write request for the updated page will already be queued or
>>>> in-flight, so there's no need to retry.
>>>
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth>1 and allowed
>>> to internally reorder queued requests. I don't know the FS/block layers
>>> sufficiently well to tell if sending several requests for the same page
>>> really possible or not, but we can see a real life problem, which can be
>>> well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original request
>>> finished. Since the device allowed to freely reorder requests, there's a
>>> probability that the original write request would hit the permanent storage
>>> *AFTER* the retry request, hence the data changes it's carrying would be
>>> lost, hence welcome data corruption.
>>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to be
>>> a single SCSI device in a load balancing configuration, it becomes very
>>> real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they both
>>> run DRBD to keep their backstorage in-sync. The initiator uses them as a
>>> single multipath device in an active-active round-robin load-balancing
>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>> takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD
>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which one
>>> to store. This is possible only if the initiator sent the second write
>>> request before the first one completed.
>>>
>>> The topic of the discussion could well explain the cause of that. But,
>>> unfortunately, people who reported it forgot to note which OS they run on
>>> the initiator, i.e. I can't say for sure it's Linux.
>>
>> Sorry for the late chime in, but here's some more information of
>> potential interest as I've previously inquired about this to the drbd
>> mailing list:
>>
>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>> nv_cache, and write_through do not generate the warnings.
>
> Some explanations for those who not familiar with the terminology:
>
>  - "Fileio" means Linux IO stack on the target receives IO via
> vfs_readv()/vfs_writev()
>
>  - "NV_CACHE" means all the cache synchronization requests
> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>
>  - "WRITE_THROUGH" means write through, i.e. the corresponding backend file
> for the device open with O_SYNC flag.
>
>> 2. It happens on active/passive drbd clusters (on the active node
>> obviously), NOT active/active. In fact, I've found that doing round
>> robin on active/active is a Bad Idea (tm) even with a clustered
>> filesystem, until at least the target software is able to synchronize
>> the command state of either node.
>> 3. Linux and ESX initiators can generate the warning, but I've so far
>> only been able to reliably reproduce it using a Windows initiator and
>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>> have the time.
>> 4. It only happens using a random write io workload (any block size),
>> with initiator threads>1, OR initiator queue depth>1. The higher
>> either of those is, the more spammy the warnings become.
>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>> generated. As soon as it's reconnected (primary/secondary), the
>> warnings will reappear.
>
> It would be great if you prove or disprove our suspicions that Linux can
> produce several write requests for the same blocks simultaneously. To be
> sure we need:
>
> 1. The initiator is Linux. Windows and ESX are not needed for this
> particular case.
>
> 2. If you are able to reproduce it, we will need full description of which
> application used on the initiator to generate the load and in which mode.
>
> Target and DRBD configuration doesn't matter, you can use any.

I just tried, and this particular DRBD warning is not reproducible
with io (iometer) coming from a Linux initiator (2.6.30.10) The same
iometer parameters were used as on windows, and both the base device
as well as filesystem (ext3) were tested, both negative. I'll try a
few more tests, but it seems that this is a nonissue with a Linux
initiator.

Hope that helps,

-Gennadiy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-07-23 19:16               ` Vladislav Bolkhovitin
  2010-07-23 20:51                 ` Gennadiy Nerubayev
@ 2010-07-24  1:03                 ` Dave Chinner
  1 sibling, 0 replies; 79+ messages in thread
From: Dave Chinner @ 2010-07-24  1:03 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Gennadiy Nerubayev, James Bottomley, Christof Schmitt,
	Boaz Harrosh, Martin K. Petersen, linux-scsi, linux-kernel,
	linux-fsdevel, Chris Mason

On Fri, Jul 23, 2010 at 11:16:33PM +0400, Vladislav Bolkhovitin wrote:
> It would be great if you prove or disprove our suspicions that Linux
> can produce several write requests for the same blocks
> simultaneously. To be sure we need:

Just use direct IO. Case in point is the concurrent sub-block
AIO-DIO data corruption we're chasing on XFS and ext4 at the moment
where we have two concurrent unaligned write IOs to the same
filesystem block:

http://oss.sgi.com/archives/xfs/2010-07/msg00278.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-07-23 20:51                 ` Gennadiy Nerubayev
@ 2010-07-26 12:22                   ` Vladislav Bolkhovitin
  2010-07-26 17:00                     ` Gennadiy Nerubayev
  0 siblings, 1 reply; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-26 12:22 UTC (permalink / raw)
  To: Gennadiy Nerubayev
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they both
>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as a
>>>> single multipath device in an active-active round-robin load-balancing
>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>> takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD
>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which one
>>>> to store. This is possible only if the initiator sent the second write
>>>> request before the first one completed.
>>>>
>>>> The topic of the discussion could well explain the cause of that. But,
>>>> unfortunately, people who reported it forgot to note which OS they run on
>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>
>>> Sorry for the late chime in, but here's some more information of
>>> potential interest as I've previously inquired about this to the drbd
>>> mailing list:
>>>
>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>> nv_cache, and write_through do not generate the warnings.
>>
>> Some explanations for those who not familiar with the terminology:
>>
>>   - "Fileio" means Linux IO stack on the target receives IO via
>> vfs_readv()/vfs_writev()
>>
>>   - "NV_CACHE" means all the cache synchronization requests
>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>
>>   - "WRITE_THROUGH" means write through, i.e. the corresponding backend file
>> for the device open with O_SYNC flag.
>>
>>> 2. It happens on active/passive drbd clusters (on the active node
>>> obviously), NOT active/active. In fact, I've found that doing round
>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>> filesystem, until at least the target software is able to synchronize
>>> the command state of either node.
>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>> only been able to reliably reproduce it using a Windows initiator and
>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>> have the time.
>>> 4. It only happens using a random write io workload (any block size),
>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>> either of those is, the more spammy the warnings become.
>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>> generated. As soon as it's reconnected (primary/secondary), the
>>> warnings will reappear.
>>
>> It would be great if you prove or disprove our suspicions that Linux can
>> produce several write requests for the same blocks simultaneously. To be
>> sure we need:
>>
>> 1. The initiator is Linux. Windows and ESX are not needed for this
>> particular case.
>>
>> 2. If you are able to reproduce it, we will need full description of which
>> application used on the initiator to generate the load and in which mode.
>>
>> Target and DRBD configuration doesn't matter, you can use any.
>
> I just tried, and this particular DRBD warning is not reproducible
> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
> iometer parameters were used as on windows, and both the base device
> as well as filesystem (ext3) were tested, both negative. I'll try a
> few more tests, but it seems that this is a nonissue with a Linux
> initiator.

OK, but to be completely sure, can you check also with other load 
generators, than IOmeter, please? IOmeter on Linux is a lot less 
effective than on Windows, because it uses sync IO, while we need big 
multi-IO load to trigger the problem we are discussing, if it exists. 
Plus, to catch it we need an FS on the initiator side, not using raw 
devices. So, something like fio over files on FS or diskbench should be 
more appropriate. Please don't use direct IO to avoid the bug Dave 
Chinner pointed us out.

Also, you mentioned above about that Linux can generate the warning. Can 
you recall on which configuration, including the kernel version, the 
load application and its configuration, you have seen it?

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-07-26 12:22                   ` Vladislav Bolkhovitin
@ 2010-07-26 17:00                     ` Gennadiy Nerubayev
  2010-07-26 19:26                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 79+ messages in thread
From: Gennadiy Nerubayev @ 2010-07-26 17:00 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

On Mon, Jul 26, 2010 at 8:22 AM, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>>>
>>>>> The real life problem we can see in an active-active DRBD-setup. In
>>>>> this
>>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>>> both
>>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as
>>>>> a
>>>>> single multipath device in an active-active round-robin load-balancing
>>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>>> takes care to replicate the requests to the other node.
>>>>>
>>>>> The problem is that sometimes DRBD complies about concurrent local
>>>>> writes, like:
>>>>>
>>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>>> [DISCARD
>>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>>
>>>>> This message means that DRBD detected that both nodes received
>>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>>> one
>>>>> to store. This is possible only if the initiator sent the second write
>>>>> request before the first one completed.
>>>>>
>>>>> The topic of the discussion could well explain the cause of that. But,
>>>>> unfortunately, people who reported it forgot to note which OS they run
>>>>> on
>>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>>
>>>> Sorry for the late chime in, but here's some more information of
>>>> potential interest as I've previously inquired about this to the drbd
>>>> mailing list:
>>>>
>>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>>> nv_cache, and write_through do not generate the warnings.
>>>
>>> Some explanations for those who not familiar with the terminology:
>>>
>>>  - "Fileio" means Linux IO stack on the target receives IO via
>>> vfs_readv()/vfs_writev()
>>>
>>>  - "NV_CACHE" means all the cache synchronization requests
>>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>>
>>>  - "WRITE_THROUGH" means write through, i.e. the corresponding backend
>>> file
>>> for the device open with O_SYNC flag.
>>>
>>>> 2. It happens on active/passive drbd clusters (on the active node
>>>> obviously), NOT active/active. In fact, I've found that doing round
>>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>>> filesystem, until at least the target software is able to synchronize
>>>> the command state of either node.
>>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>>> only been able to reliably reproduce it using a Windows initiator and
>>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>>> have the time.
>>>> 4. It only happens using a random write io workload (any block size),
>>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>>> either of those is, the more spammy the warnings become.
>>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>>> generated. As soon as it's reconnected (primary/secondary), the
>>>> warnings will reappear.
>>>
>>> It would be great if you prove or disprove our suspicions that Linux can
>>> produce several write requests for the same blocks simultaneously. To be
>>> sure we need:
>>>
>>> 1. The initiator is Linux. Windows and ESX are not needed for this
>>> particular case.
>>>
>>> 2. If you are able to reproduce it, we will need full description of
>>> which
>>> application used on the initiator to generate the load and in which mode.
>>>
>>> Target and DRBD configuration doesn't matter, you can use any.
>>
>> I just tried, and this particular DRBD warning is not reproducible
>> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
>> iometer parameters were used as on windows, and both the base device
>> as well as filesystem (ext3) were tested, both negative. I'll try a
>> few more tests, but it seems that this is a nonissue with a Linux
>> initiator.
>
> OK, but to be completely sure, can you check also with other load
> generators, than IOmeter, please? IOmeter on Linux is a lot less effective
> than on Windows, because it uses sync IO, while we need big multi-IO load to
> trigger the problem we are discussing, if it exists. Plus, to catch it we
> need an FS on the initiator side, not using raw devices. So, something like
> fio over files on FS or diskbench should be more appropriate. Please don't
> use direct IO to avoid the bug Dave Chinner pointed us out.

I tried both fio and dbench, with the same results. With fio in
particular, I think I used pretty much every possible combination of
engines, directio, and sync settings with 8 threads, 32 queue depth
and random write workload.

> Also, you mentioned above about that Linux can generate the warning. Can you
> recall on which configuration, including the kernel version, the load
> application and its configuration, you have seen it?

Sorry, after double checking, it's only ESX and Windows that generate
them. The majority of the ESX virtuals in question are Windows, though
I can see some indications of ESX servers that have Linux-only
virtuals generating one here and there. It's somewhat difficult to
tell historically, and I probably would not be able to determine what
those virtuals were running at the time.

-Gennadiy

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Wrong DIF guard tag on ext2 write
  2010-07-26 17:00                     ` Gennadiy Nerubayev
@ 2010-07-26 19:26                       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 79+ messages in thread
From: Vladislav Bolkhovitin @ 2010-07-26 19:26 UTC (permalink / raw)
  To: Gennadiy Nerubayev
  Cc: James Bottomley, Christof Schmitt, Boaz Harrosh,
	Martin K. Petersen, linux-scsi, linux-kernel, linux-fsdevel,
	Chris Mason

Gennadiy Nerubayev, on 07/26/2010 09:00 PM wrote:
> On Mon, Jul 26, 2010 at 8:22 AM, Vladislav Bolkhovitin<vst@vlnb.net>  wrote:
>> Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>>>>
>>>>>> The real life problem we can see in an active-active DRBD-setup. In
>>>>>> this
>>>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>>>> both
>>>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as
>>>>>> a
>>>>>> single multipath device in an active-active round-robin load-balancing
>>>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>>>> takes care to replicate the requests to the other node.
>>>>>>
>>>>>> The problem is that sometimes DRBD complies about concurrent local
>>>>>> writes, like:
>>>>>>
>>>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>>>> [DISCARD
>>>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>>>
>>>>>> This message means that DRBD detected that both nodes received
>>>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>>>> one
>>>>>> to store. This is possible only if the initiator sent the second write
>>>>>> request before the first one completed.
>>>>>>
>>>>>> The topic of the discussion could well explain the cause of that. But,
>>>>>> unfortunately, people who reported it forgot to note which OS they run
>>>>>> on
>>>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>>>
>>>>> Sorry for the late chime in, but here's some more information of
>>>>> potential interest as I've previously inquired about this to the drbd
>>>>> mailing list:
>>>>>
>>>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>>>> nv_cache, and write_through do not generate the warnings.
>>>>
>>>> Some explanations for those who not familiar with the terminology:
>>>>
>>>>   - "Fileio" means Linux IO stack on the target receives IO via
>>>> vfs_readv()/vfs_writev()
>>>>
>>>>   - "NV_CACHE" means all the cache synchronization requests
>>>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>>>
>>>>   - "WRITE_THROUGH" means write through, i.e. the corresponding backend
>>>> file
>>>> for the device open with O_SYNC flag.
>>>>
>>>>> 2. It happens on active/passive drbd clusters (on the active node
>>>>> obviously), NOT active/active. In fact, I've found that doing round
>>>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>>>> filesystem, until at least the target software is able to synchronize
>>>>> the command state of either node.
>>>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>>>> only been able to reliably reproduce it using a Windows initiator and
>>>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>>>> have the time.
>>>>> 4. It only happens using a random write io workload (any block size),
>>>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>>>> either of those is, the more spammy the warnings become.
>>>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>>>> generated. As soon as it's reconnected (primary/secondary), the
>>>>> warnings will reappear.
>>>>
>>>> It would be great if you prove or disprove our suspicions that Linux can
>>>> produce several write requests for the same blocks simultaneously. To be
>>>> sure we need:
>>>>
>>>> 1. The initiator is Linux. Windows and ESX are not needed for this
>>>> particular case.
>>>>
>>>> 2. If you are able to reproduce it, we will need full description of
>>>> which
>>>> application used on the initiator to generate the load and in which mode.
>>>>
>>>> Target and DRBD configuration doesn't matter, you can use any.
>>>
>>> I just tried, and this particular DRBD warning is not reproducible
>>> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
>>> iometer parameters were used as on windows, and both the base device
>>> as well as filesystem (ext3) were tested, both negative. I'll try a
>>> few more tests, but it seems that this is a nonissue with a Linux
>>> initiator.
>>
>> OK, but to be completely sure, can you check also with other load
>> generators, than IOmeter, please? IOmeter on Linux is a lot less effective
>> than on Windows, because it uses sync IO, while we need big multi-IO load to
>> trigger the problem we are discussing, if it exists. Plus, to catch it we
>> need an FS on the initiator side, not using raw devices. So, something like
>> fio over files on FS or diskbench should be more appropriate. Please don't
>> use direct IO to avoid the bug Dave Chinner pointed us out.
>
> I tried both fio and dbench, with the same results. With fio in
> particular, I think I used pretty much every possible combination of
> engines, directio, and sync settings with 8 threads, 32 queue depth
> and random write workload.
>
>> Also, you mentioned above about that Linux can generate the warning. Can you
>> recall on which configuration, including the kernel version, the load
>> application and its configuration, you have seen it?
>
> Sorry, after double checking, it's only ESX and Windows that generate
> them. The majority of the ESX virtuals in question are Windows, though
> I can see some indications of ESX servers that have Linux-only
> virtuals generating one here and there. It's somewhat difficult to
> tell historically, and I probably would not be able to determine what
> those virtuals were running at the time.

OK, I see. A negative result is also a result. Now we know that Linux 
(in contrast to VMware and Windows) works well in this area.

Thank you!
Vlad

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2010-07-26 19:26 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-31 11:28 Wrong DIF guard tag on ext2 write Christof Schmitt
2010-05-31 11:34 ` Christof Schmitt
2010-05-31 14:20 ` Martin K. Petersen
2010-05-31 14:46   ` Christof Schmitt
2010-06-01 13:16     ` Martin K. Petersen
2010-06-02 13:37       ` Christof Schmitt
2010-06-02 23:20         ` Dave Chinner
2010-06-04  1:34           ` Martin K. Petersen
2010-06-04  2:32             ` Dave Chinner
2010-06-07 16:20               ` Martin K. Petersen
2010-06-07 17:22                 ` Boaz Harrosh
2010-06-07 17:40                   ` Martin K. Petersen
2010-06-08  7:15                     ` Christof Schmitt
2010-06-08  8:47                       ` Dave Chinner
2010-06-08  8:52                         ` Nick Piggin
2010-05-31 14:49   ` Nick Piggin
2010-06-01 13:17     ` Martin K. Petersen
2010-05-31 15:01   ` James Bottomley
2010-05-31 15:30     ` Boaz Harrosh
2010-05-31 15:49       ` Nick Piggin
2010-05-31 16:25         ` Boaz Harrosh
2010-06-01 13:22         ` Martin K. Petersen
2010-06-01 10:30       ` Christof Schmitt
2010-06-01 10:49         ` Boaz Harrosh
2010-06-01 13:03         ` Chris Mason
2010-06-01 13:50           ` Christof Schmitt
     [not found]           ` <20100601135059.GA21008@schmichrtp.mainz.de.ibm.com>
2010-06-01 13:58             ` Chris Mason
2010-06-08  7:18               ` Christof Schmitt
2010-06-01 14:26             ` Nick Piggin
2010-06-01 13:27         ` James Bottomley
2010-06-01 13:33           ` Chris Mason
2010-06-01 13:40             ` James Bottomley
2010-06-01 13:49               ` Chris Mason
2010-06-01 16:29                 ` Matthew Wilcox
2010-06-01 16:47                   ` Chris Mason
2010-06-01 16:54                     ` James Bottomley
2010-06-01 18:09                       ` Chris Mason
2010-06-01 18:46                         ` Nick Piggin
     [not found]                         ` <20100601184649.GE9453@laptop>
2010-06-01 19:35                           ` Chris Mason
2010-06-02  3:20                             ` Nick Piggin
2010-06-02 13:17                               ` Martin K. Petersen
2010-06-02 13:41                                 ` Nick Piggin
2010-06-03 15:46                                   ` Chris Mason
2010-06-03 16:27                                     ` Nick Piggin
     [not found]                                     ` <20100603162718.GR6822@laptop>
2010-06-04  1:46                                       ` Martin K. Petersen
2010-06-04  3:09                                         ` Nick Piggin
2010-06-04  2:02                                     ` Dave Chinner
     [not found]                                     ` <20100604020243.GE19651@dastard>
2010-06-04 15:32                                       ` Jan Kara
2010-06-04  1:30                                   ` Martin K. Petersen
2010-06-01 21:07                         ` James Bottomley
2010-06-01 22:49                           ` Chris Mason
2010-06-01 13:50               ` Martin K. Petersen
2010-06-01 14:28                 ` Nick Piggin
2010-06-01 14:32                 ` James Bottomley
2010-06-01 14:54                   ` Martin K. Petersen
2010-06-03 11:20           ` Vladislav Bolkhovitin
2010-06-03 12:07             ` Boaz Harrosh
2010-06-03 12:41               ` Vladislav Bolkhovitin
2010-06-03 12:46                 ` Vladislav Bolkhovitin
2010-06-09 15:58                   ` Vladislav Bolkhovitin
2010-06-03 13:06                 ` Boaz Harrosh
2010-06-03 13:23                   ` Vladislav Bolkhovitin
2010-07-23 17:59             ` Gennadiy Nerubayev
2010-07-23 19:16               ` Vladislav Bolkhovitin
2010-07-23 20:51                 ` Gennadiy Nerubayev
2010-07-26 12:22                   ` Vladislav Bolkhovitin
2010-07-26 17:00                     ` Gennadiy Nerubayev
2010-07-26 19:26                       ` Vladislav Bolkhovitin
2010-07-24  1:03                 ` Dave Chinner
2010-06-01  2:40     ` FUJITA Tomonori
2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
2010-06-03 16:30   ` [Lsf10-pc] " J. Bruce Fields
2010-06-03 17:41   ` Vladislav Bolkhovitin
2010-06-04 16:23   ` Jan Kara
2010-06-04 16:30     ` [Lsf10-pc] " J. Bruce Fields
2010-06-04 17:11       ` Jan Kara
2010-06-06  9:35     ` Boaz Harrosh
2010-06-06 23:37       ` Jan Kara
2010-06-07  8:30         ` Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).