From: Vladislav Bolkhovitin <vst@vlnb.net>
To: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@suse.de>,
Christof Schmitt <christof.schmitt@de.ibm.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org,
Chris Mason <chris.mason@oracle.com>
Subject: Re: Wrong DIF guard tag on ext2 write
Date: Wed, 09 Jun 2010 19:58:00 +0400 [thread overview]
Message-ID: <4C0FBA08.2090502@vlnb.net> (raw)
In-Reply-To: <4C07A442.1030502@vlnb.net>
Vladislav Bolkhovitin, on 06/03/2010 04:46 PM wrote:
>
> Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI
>>>> or similar transports which allow to have commands queue depth >1 and
>>>> allowed to internally reorder queued requests. I don't know the FS/block
>>>> layers sufficiently well to tell if sending several requests for the
>>>> same page really possible or not, but we can see a real life problem,
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for
>>>> the same page queued to the corresponding device before the original
>>>> request finished. Since the device allowed to freely reorder requests,
>>>> there's a probability that the original write request would hit the
>>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
>> allowed to freely reorder any commands with SIMPLE task attribute. If an
>> application wants to maintain order of some commands for such devices,
>> it must issue them with ORDERED task attribute and over a _single_ MPIO
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically
>>>> impossible, but for sophisticated clusters when many nodes pretending to
>>>> be a single SCSI device in a load balancing configuration, it becomes
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>>> as a single multipath device in an active-active round-robin
>>>> load-balancing configuration, i.e. sends requests to both nodes in
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>> one to store. This is possible only if the initiator sent the second
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO
>> paths to the device, so at the moment of the sending all the order
>> information was lost.
>>
>> Until SCSI generally allowed to preserve ordering information between
>> MPIO paths in such configurations the only way to maintain commands
>> order would be queue draining. Hence, for safety all initiators working
>> with such devices must do it.
>>
>> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
>
> I meant load balancing MPIO clusters.
Actually, if consider processing of exception conditions like Task Set
Full status or Unit Attentions, queuing of several write commands for
the same page(s) is not safe also for all other MPIO clusters as well as
for single path SCSI-transport devices, like regular HDDs or other
SAS/FC/iSCSI/... storage.
This is because in case of exception conditions the first write command
could be preliminary finished to deliver the exception condition status
to the initiator, but all the queued after it commands would be neither
aborted, nor suspended. So, after retrying the command can be queued
_after_ the second write command, hence they would be executed in the
reverse order with related data corruption.
To prevent such things, SCSI standard provides ACA and UA interlock
facilities, but Linux doesn't use them.
Thus, to be safe Linux should:
1. Either don't write on pages under IO, hence don't queue retries,
2. Or queue retries only after the original write finished.
Vlad
next prev parent reply other threads:[~2010-06-09 15:58 UTC|newest]
Thread overview: 96+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-31 11:28 Wrong DIF guard tag on ext2 write Christof Schmitt
2010-05-31 11:34 ` Christof Schmitt
2010-05-31 14:20 ` Martin K. Petersen
2010-05-31 14:46 ` Christof Schmitt
2010-06-01 13:16 ` Martin K. Petersen
2010-06-02 13:37 ` Christof Schmitt
2010-06-02 23:20 ` Dave Chinner
2010-06-04 1:34 ` Martin K. Petersen
2010-06-04 2:32 ` Dave Chinner
2010-06-07 16:20 ` Martin K. Petersen
2010-06-07 17:22 ` Boaz Harrosh
2010-06-07 17:40 ` Martin K. Petersen
2010-06-08 7:15 ` Christof Schmitt
2010-06-08 8:47 ` Dave Chinner
2010-06-08 8:52 ` Nick Piggin
2010-05-31 14:49 ` Nick Piggin
2010-06-01 13:17 ` Martin K. Petersen
2010-05-31 15:01 ` James Bottomley
2010-05-31 15:30 ` Boaz Harrosh
2010-05-31 15:49 ` Nick Piggin
2010-05-31 16:25 ` Boaz Harrosh
2010-06-01 13:22 ` Martin K. Petersen
2010-06-01 10:30 ` Christof Schmitt
2010-06-01 10:49 ` Boaz Harrosh
2010-06-01 13:03 ` Chris Mason
2010-06-01 13:50 ` Christof Schmitt
2010-06-01 13:58 ` Chris Mason
2010-06-08 7:18 ` Christof Schmitt
2010-06-08 7:18 ` Christof Schmitt
2010-06-08 7:18 ` Christof Schmitt
2010-06-01 14:26 ` Nick Piggin
2010-06-01 13:50 ` Christof Schmitt
2010-06-01 13:50 ` Christof Schmitt
2010-06-01 13:27 ` James Bottomley
2010-06-01 13:33 ` Chris Mason
2010-06-01 13:40 ` James Bottomley
2010-06-01 13:49 ` Chris Mason
2010-06-01 16:29 ` Matthew Wilcox
2010-06-01 16:29 ` Matthew Wilcox
2010-06-01 16:47 ` Chris Mason
2010-06-01 16:54 ` James Bottomley
2010-06-01 18:09 ` Chris Mason
2010-06-01 18:46 ` Nick Piggin
2010-06-01 19:35 ` Chris Mason
2010-06-02 3:20 ` Nick Piggin
2010-06-02 3:20 ` Nick Piggin
2010-06-02 3:20 ` Nick Piggin
2010-06-02 13:17 ` Martin K. Petersen
2010-06-02 13:41 ` Nick Piggin
2010-06-03 15:46 ` Chris Mason
2010-06-03 16:27 ` Nick Piggin
2010-06-03 16:27 ` Nick Piggin
2010-06-04 1:46 ` Martin K. Petersen
2010-06-04 3:09 ` Nick Piggin
2010-06-03 16:27 ` Nick Piggin
2010-06-04 2:02 ` Dave Chinner
2010-06-04 2:02 ` Dave Chinner
2010-06-04 15:32 ` Jan Kara
2010-06-04 2:02 ` Dave Chinner
2010-06-04 1:30 ` Martin K. Petersen
2010-06-01 18:46 ` Nick Piggin
2010-06-01 18:46 ` Nick Piggin
2010-06-01 21:07 ` James Bottomley
2010-06-01 22:49 ` Chris Mason
2010-06-01 16:29 ` Matthew Wilcox
2010-06-01 13:50 ` Martin K. Petersen
2010-06-01 14:28 ` Nick Piggin
2010-06-01 14:32 ` James Bottomley
2010-06-01 14:54 ` Martin K. Petersen
2010-06-03 11:20 ` Vladislav Bolkhovitin
2010-06-03 12:07 ` Boaz Harrosh
2010-06-03 12:41 ` Vladislav Bolkhovitin
2010-06-03 12:46 ` Vladislav Bolkhovitin
2010-06-09 15:58 ` Vladislav Bolkhovitin [this message]
2010-06-03 13:06 ` Boaz Harrosh
2010-06-03 13:23 ` Vladislav Bolkhovitin
2010-07-23 17:59 ` Gennadiy Nerubayev
2010-07-23 17:59 ` Gennadiy Nerubayev
2010-07-23 19:16 ` Vladislav Bolkhovitin
2010-07-23 20:51 ` Gennadiy Nerubayev
2010-07-26 12:22 ` Vladislav Bolkhovitin
2010-07-26 17:00 ` Gennadiy Nerubayev
2010-07-26 19:26 ` Vladislav Bolkhovitin
2010-07-24 1:03 ` Dave Chinner
2010-06-01 2:40 ` FUJITA Tomonori
2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
2010-06-03 16:09 ` Boaz Harrosh
2010-06-03 16:09 ` Boaz Harrosh
2010-06-03 16:30 ` [Lsf10-pc] " J. Bruce Fields
2010-06-03 17:41 ` Vladislav Bolkhovitin
2010-06-04 16:23 ` Jan Kara
2010-06-04 16:30 ` [Lsf10-pc] " J. Bruce Fields
2010-06-04 17:11 ` Jan Kara
2010-06-06 9:35 ` Boaz Harrosh
2010-06-06 23:37 ` Jan Kara
2010-06-07 8:30 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C0FBA08.2090502@vlnb.net \
--to=vst@vlnb.net \
--cc=James.Bottomley@suse.de \
--cc=bharrosh@panasas.com \
--cc=chris.mason@oracle.com \
--cc=christof.schmitt@de.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.