T10 WCE interpretation in Linux & device level access

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* T10 WCE interpretation in Linux & device level access
@ 2013-04-23 19:41 Ric Wheeler
  2013-04-23 20:07 ` James Bottomley
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Ric Wheeler @ 2013-04-23 19:41 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org, Martin K. Petersen, James Bottomley,
	Jeff Moyer, Tejun Heo
  Cc: Mike Snitzer

For many years, we have used WCE as an indication that a device has a volatile 
write cache (not just a write cache) and used this as a trigger to send down 
SYNCHRONIZE_CACHE commands as needed.

Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
command.

Some arrays with non-volatile cache seem to not set WCE.

Others arrays with non-volatile cache - our problem arrays - set WCE and do 
something horrible and slow when sent the SYNCHRONIZE_CACHE commands.

Note that for file systems, you can override this behavior by mounting with our 
barriers disabled (mount -o nobarrier .....). There is currently no way do 
disable this for anything using the device directly, not through the file system.

Some applications run against block devices - not through a file system - and 
want not to slow to a crawl when they have an array in my problem set.

Giving them a hook to ignore WCE seems to be a hack, but one that would resolve 
issues with users who won't want to wait months (years?) for us to convince the 
array vendors.

Is this a hook worth doing?

Have we hashed this out in the T10 committee?

Regards,

Ric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-23 19:41 T10 WCE interpretation in Linux & device level access Ric Wheeler
@ 2013-04-23 20:07 ` James Bottomley
  2013-04-23 22:39   ` Jeremy Linton
                     ` (2 more replies)
  2013-04-23 20:28 ` Douglas Gilbert
  2013-04-24 15:40 ` Douglas Gilbert
  2 siblings, 3 replies; 29+ messages in thread
From: James Bottomley @ 2013-04-23 20:07 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: linux-scsi@vger.kernel.org, Martin K. Petersen, Jeff Moyer,
	Tejun Heo, Mike Snitzer

On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
> For many years, we have used WCE as an indication that a device has a volatile 
> write cache (not just a write cache) and used this as a trigger to send down 
> SYNCHRONIZE_CACHE commands as needed.
> 
> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
> command.

I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
which if unset (which it is in our implementation) means only sync your
non-NV cache.  For a device with all NV, that equates to nop.

> Some arrays with non-volatile cache seem to not set WCE.
> 
> Others arrays with non-volatile cache - our problem arrays - set WCE and do 
> something horrible and slow when sent the SYNCHRONIZE_CACHE commands.

These arrays sound to be out of spec, so we should probably just
blacklist them.

> Note that for file systems, you can override this behavior by mounting with our 
> barriers disabled (mount -o nobarrier .....). There is currently no way do 
> disable this for anything using the device directly, not through the file system.
> 
> Some applications run against block devices - not through a file system - and 
> want not to slow to a crawl when they have an array in my problem set.
> 
> Giving them a hook to ignore WCE seems to be a hack, but one that would resolve 
> issues with users who won't want to wait months (years?) for us to convince the 
> array vendors.
> 
> Is this a hook worth doing?

We already have the ability to set the cache type in sysfs.  It tries to
do a mode select back to the array, but the USB guys want it for the
reverse problem (write back cache behind bridge declared as write
through).

> Have we hashed this out in the T10 committee?

SBC-3 contains everything one could wish for about handling devices with
volatile and NV cache, I thought.

James



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-23 19:41 T10 WCE interpretation in Linux & device level access Ric Wheeler
  2013-04-23 20:07 ` James Bottomley
@ 2013-04-23 20:28 ` Douglas Gilbert
  2013-04-24 15:40 ` Douglas Gilbert
  2 siblings, 0 replies; 29+ messages in thread
From: Douglas Gilbert @ 2013-04-23 20:28 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: linux-scsi@vger.kernel.org, Martin K. Petersen, James Bottomley,
	Jeff Moyer, Tejun Heo, Mike Snitzer

On 13-04-23 03:41 PM, Ric Wheeler wrote:
>
> For many years, we have used WCE as an indication that a device has a volatile
> write cache (not just a write cache) and used this as a trigger to send down
> SYNCHRONIZE_CACHE commands as needed.
>
> Some arrays with non-volatile cache seem to have WCE set and simply ignore the
> command.
>
> Some arrays with non-volatile cache seem to not set WCE.
>
> Others arrays with non-volatile cache - our problem arrays - set WCE and do
> something horrible and slow when sent the SYNCHRONIZE_CACHE commands.

Do the problematic arrays correctly set the NV_SUP** or V_SUP
bits in the Extended Inquiry VPD pages? Does Linux take any
notice of those bits? [Guess the answer.]

If the problematic arrays correctly set the NV_SUP and Linux
sends SYNCHRONIZE_CACHE commands then Linux should be named
and shamed.

> Note that for file systems, you can override this behavior by mounting with our
> barriers disabled (mount -o nobarrier .....). There is currently no way do
> disable this for anything using the device directly, not through the file system.
>
> Some applications run against block devices - not through a file system - and
> want not to slow to a crawl when they have an array in my problem set.
>
> Giving them a hook to ignore WCE seems to be a hack, but one that would resolve
> issues with users who won't want to wait months (years?) for us to convince the
> array vendors.

> Is this a hook worth doing?
>
> Have we hashed this out in the T10 committee?

There is also a NV_DIS bit in the caching mode page that if
supported would allow the NV cache to be disabled.

Looks like T10 supplies enough knobs, it is up to the vendor
to set them appropriately and Linux to take notice.

Doug Gilbert


** What JB called SYNC_NV ...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-23 20:07 ` James Bottomley
@ 2013-04-23 22:39   ` Jeremy Linton
  2013-04-24  5:44     ` Elliott, Robert (Server Storage)
  2013-04-24 11:17   ` Paolo Bonzini
  2013-04-24 11:30   ` Hannes Reinecke
  2 siblings, 1 reply; 29+ messages in thread
From: Jeremy Linton @ 2013-04-23 22:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer, dgilbert@interlog.com

On 4/23/2013 3:07 PM, James Bottomley wrote:

> 
> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
> which if unset (which it is in our implementation) means only sync your
> non-NV cache.  For a device with all NV, that equates to nop.

	Yes, linux leaves the SYNC_NV bit unset in scsi_setup_flush_cmnd().

The draft specs, and a couple others I have laying about says: says the device
shall sync cache to medium for both volatile and non volatile cache data if
SYNC_NV is _unset_.

With it set, the table could be more confusing!

For volatile cache blocks with SYNC_NV set "If a non-volatile cache is present,
then the device server shall synchronize to non-volatile cache or to the medium.
If a non-volatile cache is not present, then the device server shall synchronize
to the medium".

And for Non-volatile cache with it set "No Requirement"

Which to me says, don't expect any particular behavior if you set this bit and
have NV it could flush to medium, flush to NV cache, or do nothing at all. But
it seems pretty clear that with it unset its probably going to get synchronized
to the medium.

If T10 were to do something, maybe they could stop putting bits in the docs that
aren't guaranteed to do anything (fill in rant).

As for linux, seems the state of the spec really doesn't leave any good options
other than provide the user the ability to disable the flush_cmnd() if  the
NV_SUP bit is set. Or maybe a white list (ick!)...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: T10 WCE interpretation in Linux & device level access
  2013-04-23 22:39   ` Jeremy Linton
@ 2013-04-24  5:44     ` Elliott, Robert (Server Storage)
  2013-04-24 11:00       ` Ric Wheeler
  2013-04-27 16:09       ` James Bottomley
  0 siblings, 2 replies; 29+ messages in thread
From: Elliott, Robert (Server Storage) @ 2013-04-24  5:44 UTC (permalink / raw)
  To: Jeremy Linton, James Bottomley
  Cc: Ric Wheeler, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer, dgilbert@interlog.com

If the writeback cache is enabled (per the WCE bit in the Caching mode page),
prudent software uses the FUA bit in WRITE commands when writing metadata
and/or sends the SYNCHRONIZE CACHE command at important checkpoints to 
ensure the data is not going to be lost due to a power loss.  Some 
database software is particularly prolific at sending these commands. 

Around 2003, many RAID controllers with non-volatile writeback caches honored
the SYNCHRONIZE CACHE command, flushing the entire cache to the drives.  This
started causing timeouts as non-volatile write cache sizes grew.  Recently,
it's even causing trouble on individual disk drives with growing volatile 
write caches.

The intent of software using these commands and bits was unclear - it could be:
a) ensure data is in non-volatile cache (and will eventually be flushed) 
   or on the medium; or
b) ensure data is on the medium (so the drives are ready for removal). 

As a short-term fix, many RAID controllers assumed intent (a) and started
interpreting the SYNCHRONIZE CACHE command as a NOP and ignoring the FUA bit.  

Surprise removal of a drive from a RAID controller is risky even if software 
has run SYNCHRONIZE CACHE, since the RAID controller might be doing other
activity in the background. So, there are other reasons to justify assuming
that the user just won't do that.

Afraid of breaking software with intent (b) (which was more likely in the 
days of floppy disks, Bournelli Boxes, and other removable block devices), 
T10 chose to clarify that the original meaning was (b) and added new 
FUA_NV and SYNC_NV bits to let software express intent (a).  The hope
was that devices would implement the bits and software would start using
them at appropriate times.

Unfortunately, the short-term fix worked well enough that it still prevails
today, and most standalone removable media block devices have disappeared.
There is not much software actually sending the FUA_NV and SYNC_NV bits 
and few devices honoring the bits per the standard.

As an SBC-3 letter ballot comment, I recently submitted T10 proposal 
13-050 (see http://www.t10.org/doc13.htm) to obsolete the SYNC_NV and 
FUA_NV bits and change the meaning of the commands without those bits
to intent (a), reflecting what the industry has actually done.

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Jeremy Linton
Sent: Tuesday, April 23, 2013 5:40 PM
To: James Bottomley
Cc: Ric Wheeler; linux-scsi@vger.kernel.org; Martin K. Petersen; Jeff Moyer; Tejun Heo; Mike Snitzer; dgilbert@interlog.com
Subject: Re: T10 WCE interpretation in Linux & device level access

On 4/23/2013 3:07 PM, James Bottomley wrote:

> 
> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
> which if unset (which it is in our implementation) means only sync your
> non-NV cache.  For a device with all NV, that equates to nop.

	Yes, linux leaves the SYNC_NV bit unset in scsi_setup_flush_cmnd().

The draft specs, and a couple others I have laying about says: says the device
shall sync cache to medium for both volatile and non volatile cache data if
SYNC_NV is _unset_.

With it set, the table could be more confusing!

For volatile cache blocks with SYNC_NV set "If a non-volatile cache is present,
then the device server shall synchronize to non-volatile cache or to the medium.
If a non-volatile cache is not present, then the device server shall synchronize
to the medium".

And for Non-volatile cache with it set "No Requirement"

Which to me says, don't expect any particular behavior if you set this bit and
have NV it could flush to medium, flush to NV cache, or do nothing at all. But
it seems pretty clear that with it unset its probably going to get synchronized
to the medium.

If T10 were to do something, maybe they could stop putting bits in the docs that
aren't guaranteed to do anything (fill in rant).

As for linux, seems the state of the spec really doesn't leave any good options
other than provide the user the ability to disable the flush_cmnd() if  the
NV_SUP bit is set. Or maybe a white list (ick!)...

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24  5:44     ` Elliott, Robert (Server Storage)
@ 2013-04-24 11:00       ` Ric Wheeler
  2013-04-27 16:09       ` James Bottomley
  1 sibling, 0 replies; 29+ messages in thread
From: Ric Wheeler @ 2013-04-24 11:00 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jeremy Linton, James Bottomley, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer,
	dgilbert@interlog.com, Black, David, Knight, Frederick

Hi Rob,

Comments inline below.

On 04/24/2013 01:44 AM, Elliott, Robert (Server Storage) wrote:
> If the writeback cache is enabled (per the WCE bit in the Caching mode page),
> prudent software uses the FUA bit in WRITE commands when writing metadata
> and/or sends the SYNCHRONIZE CACHE command at important checkpoints to
> ensure the data is not going to be lost due to a power loss.  Some
> database software is particularly prolific at sending these commands.
>
> Around 2003, many RAID controllers with non-volatile writeback caches honored
> the SYNCHRONIZE CACHE command, flushing the entire cache to the drives.  This
> started causing timeouts as non-volatile write cache sizes grew.  Recently,
> it's even causing trouble on individual disk drives with growing volatile
> write caches.
>
> The intent of software using these commands and bits was unclear - it could be:
> a) ensure data is in non-volatile cache (and will eventually be flushed)
>     or on the medium; or
> b) ensure data is on the medium (so the drives are ready for removal).



Linux issues SYNCHRONIZE_CACHE commands when we need to make sure that the data 
needs to be crash safe (after a transaction commit from a file system journal, 
an explicit fsync call or write system call with O_SYNC set).

If the cache is nonvolatile (i.e., the target will have it after a power outage 
or reboot), we are fine - pretty much your (a) clause above.

Not sure we have thought through (or can control) how an array would handle 
pulling a drive from behind a RAID controller that has not flushed its state.
>
> As a short-term fix, many RAID controllers assumed intent (a) and started
> interpreting the SYNCHRONIZE CACHE command as a NOP and ignoring the FUA bit.

We have seen problems with some RAID controllers that leave the write cache 
enabled on back end drives - their cache is battery backed, but the cache on 
those backend drives is exposed to certain data loss on a power outage.

It would be nice if they always disabled the write cache on the backend drives 
*or* advertised WCE and propagated the SYNCHRONIZE_CACHE commands to each drive 
when we send them down.
>
> Surprise removal of a drive from a RAID controller is risky even if software
> has run SYNCHRONIZE CACHE, since the RAID controller might be doing other
> activity in the background. So, there are other reasons to justify assuming
> that the user just won't do that.
>
> Afraid of breaking software with intent (b) (which was more likely in the
> days of floppy disks, Bournelli Boxes, and other removable block devices),
> T10 chose to clarify that the original meaning was (b) and added new
> FUA_NV and SYNC_NV bits to let software express intent (a).  The hope
> was that devices would implement the bits and software would start using
> them at appropriate times.
>
> Unfortunately, the short-term fix worked well enough that it still prevails
> today, and most standalone removable media block devices have disappeared.
> There is not much software actually sending the FUA_NV and SYNC_NV bits
> and few devices honoring the bits per the standard.
>
> As an SBC-3 letter ballot comment, I recently submitted T10 proposal
> 13-050 (see http://www.t10.org/doc13.htm) to obsolete the SYNC_NV and
> FUA_NV bits and change the meaning of the commands without those bits
> to intent (a), reflecting what the industry has actually done.

This is definitely something that we should review and take into account going 
forward.

It does sound like we have a lot of confusion around WCE meaning in the storage 
industry today though, which leads me to think that we will need to allow raw 
block accessing applications to manually override our flush settings (reluctantly!).

Regards,

Ric

>
>
>
>
>
> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Jeremy Linton
> Sent: Tuesday, April 23, 2013 5:40 PM
> To: James Bottomley
> Cc: Ric Wheeler; linux-scsi@vger.kernel.org; Martin K. Petersen; Jeff Moyer; Tejun Heo; Mike Snitzer; dgilbert@interlog.com
> Subject: Re: T10 WCE interpretation in Linux & device level access
>
> On 4/23/2013 3:07 PM, James Bottomley wrote:
>
>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
>> which if unset (which it is in our implementation) means only sync your
>> non-NV cache.  For a device with all NV, that equates to nop.
> 	Yes, linux leaves the SYNC_NV bit unset in scsi_setup_flush_cmnd().
>
> The draft specs, and a couple others I have laying about says: says the device
> shall sync cache to medium for both volatile and non volatile cache data if
> SYNC_NV is _unset_.
>
> With it set, the table could be more confusing!
>
> For volatile cache blocks with SYNC_NV set "If a non-volatile cache is present,
> then the device server shall synchronize to non-volatile cache or to the medium.
> If a non-volatile cache is not present, then the device server shall synchronize
> to the medium".
>
> And for Non-volatile cache with it set "No Requirement"
>
>
> Which to me says, don't expect any particular behavior if you set this bit and
> have NV it could flush to medium, flush to NV cache, or do nothing at all. But
> it seems pretty clear that with it unset its probably going to get synchronized
> to the medium.
>
>
> If T10 were to do something, maybe they could stop putting bits in the docs that
> aren't guaranteed to do anything (fill in rant).
>
> As for linux, seems the state of the spec really doesn't leave any good options
> other than provide the user the ability to disable the flush_cmnd() if  the
> NV_SUP bit is set. Or maybe a white list (ick!)...
>
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-23 20:07 ` James Bottomley
  2013-04-23 22:39   ` Jeremy Linton
@ 2013-04-24 11:17   ` Paolo Bonzini
  2013-04-24 12:07     ` Hannes Reinecke
  2013-04-24 11:30   ` Hannes Reinecke
  2 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2013-04-24 11:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer

Il 23/04/2013 22:07, James Bottomley ha scritto:
> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>> For many years, we have used WCE as an indication that a device has a volatile 
>> write cache (not just a write cache) and used this as a trigger to send down 
>> SYNCHRONIZE_CACHE commands as needed.
>>
>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
>> command.
> 
> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
> which if unset (which it is in our implementation) means only sync your
> non-NV cache.  For a device with all NV, that equates to nop.

Isn't it the other way round?

SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.

SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.

So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
is non-removable just to err on the safe side.

Ric, can you check what your arrays set in VPD page 0x86 byte 6 bit 1?

Paolo

>> Some arrays with non-volatile cache seem to not set WCE.
>>
>> Others arrays with non-volatile cache - our problem arrays - set WCE and do 
>> something horrible and slow when sent the SYNCHRONIZE_CACHE commands.
> 
> These arrays sound to be out of spec, so we should probably just
> blacklist them.
> 
>> Note that for file systems, you can override this behavior by mounting with our 
>> barriers disabled (mount -o nobarrier .....). There is currently no way do 
>> disable this for anything using the device directly, not through the file system.
>>
>> Some applications run against block devices - not through a file system - and 
>> want not to slow to a crawl when they have an array in my problem set.
>>
>> Giving them a hook to ignore WCE seems to be a hack, but one that would resolve 
>> issues with users who won't want to wait months (years?) for us to convince the 
>> array vendors.
>>
>> Is this a hook worth doing?
> 
> We already have the ability to set the cache type in sysfs.  It tries to
> do a mode select back to the array, but the USB guys want it for the
> reverse problem (write back cache behind bridge declared as write
> through).
> 
>> Have we hashed this out in the T10 committee?
> 
> SBC-3 contains everything one could wish for about handling devices with
> volatile and NV cache, I thought.
> 
> James
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-23 20:07 ` James Bottomley
  2013-04-23 22:39   ` Jeremy Linton
  2013-04-24 11:17   ` Paolo Bonzini
@ 2013-04-24 11:30   ` Hannes Reinecke
  2 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2013-04-24 11:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer

On 04/23/2013 10:07 PM, James Bottomley wrote:
> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>> For many years, we have used WCE as an indication that a device has a volatile 
>> write cache (not just a write cache) and used this as a trigger to send down 
>> SYNCHRONIZE_CACHE commands as needed.
>>
>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
>> command.
> 
> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
> which if unset (which it is in our implementation) means only sync your
> non-NV cache.  For a device with all NV, that equates to nop.
> 
>> Some arrays with non-volatile cache seem to not set WCE.
>>
>> Others arrays with non-volatile cache - our problem arrays - set WCE and do 
>> something horrible and slow when sent the SYNCHRONIZE_CACHE commands.
> 
> These arrays sound to be out of spec, so we should probably just
> blacklist them.
> 
Don't think so.
There is no time limit for the SYNCHRONIZE_CACHE command, so the
array might take all day to write out the cache.

In fact, I've recently had a rather heated discussion with a certain
storage vendor which reserved his right to take up to several
seconds when flushing the cache.
Might be that we're in fact talking about the same one ... are we on
a naming-and-shaming policy here ?
If so I could tell you some really _interesting_ details about their
behaviour ...

Also note that SYNCHRONIZE_CACHE was always problematic; and as
we're not even setting the LBA range we're even have cross-speak
when issuing it to partitioned devices.

>> Note that for file systems, you can override this behavior by mounting with our 
>> barriers disabled (mount -o nobarrier .....). There is currently no way do 
>> disable this for anything using the device directly, not through the file system.
>>
>> Some applications run against block devices - not through a file system - and 
>> want not to slow to a crawl when they have an array in my problem set.
>>
>> Giving them a hook to ignore WCE seems to be a hack, but one that would resolve 
>> issues with users who won't want to wait months (years?) for us to convince the 
>> array vendors.
>>
>> Is this a hook worth doing?
> 
> We already have the ability to set the cache type in sysfs.  It tries to
> do a mode select back to the array, but the USB guys want it for the
> reverse problem (write back cache behind bridge declared as write
> through).
> 
You can always set the 'IMMED' bit for these arrays :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 11:17   ` Paolo Bonzini
@ 2013-04-24 12:07     ` Hannes Reinecke
  2013-04-24 12:08       ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2013-04-24 12:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: James Bottomley, Ric Wheeler, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer

On 04/24/2013 01:17 PM, Paolo Bonzini wrote:
> Il 23/04/2013 22:07, James Bottomley ha scritto:
>> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>>> For many years, we have used WCE as an indication that a device has a volatile 
>>> write cache (not just a write cache) and used this as a trigger to send down 
>>> SYNCHRONIZE_CACHE commands as needed.
>>>
>>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
>>> command.
>>
>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
>> which if unset (which it is in our implementation) means only sync your
>> non-NV cache.  For a device with all NV, that equates to nop.
> 
> Isn't it the other way round?
> 
> SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.
> 
> SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.
> 
> So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
> is non-removable just to err on the safe side.
> 

Or use 'WRITE_AND_VERIFY' here; that's guaranteed to hit the disk.
Plus it even has a guarantee about data consistency on the disk,
which the normal WRITE command has not.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:07     ` Hannes Reinecke
@ 2013-04-24 12:08       ` Paolo Bonzini
  2013-04-24 12:12         ` Hannes Reinecke
  2013-04-24 12:27         ` Ric Wheeler
  0 siblings, 2 replies; 29+ messages in thread
From: Paolo Bonzini @ 2013-04-24 12:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: James Bottomley, Ric Wheeler, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer

Il 24/04/2013 14:07, Hannes Reinecke ha scritto:
> On 04/24/2013 01:17 PM, Paolo Bonzini wrote:
>> Il 23/04/2013 22:07, James Bottomley ha scritto:
>>> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>>>> For many years, we have used WCE as an indication that a device has a volatile 
>>>> write cache (not just a write cache) and used this as a trigger to send down 
>>>> SYNCHRONIZE_CACHE commands as needed.
>>>>
>>>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
>>>> command.
>>>
>>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
>>> which if unset (which it is in our implementation) means only sync your
>>> non-NV cache.  For a device with all NV, that equates to nop.
>>
>> Isn't it the other way round?
>>
>> SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.
>>
>> SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.
>>
>> So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
>> is non-removable just to err on the safe side.
> 
> Or use 'WRITE_AND_VERIFY' here; that's guaranteed to hit the disk.
> Plus it even has a guarantee about data consistency on the disk,
> which the normal WRITE command has not.

The point is to _avoid_ hitting the disk. :)

Paolo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:08       ` Paolo Bonzini
@ 2013-04-24 12:12         ` Hannes Reinecke
  2013-04-24 12:23           ` Paolo Bonzini
  2013-04-24 12:27           ` Mike Snitzer
  2013-04-24 12:27         ` Ric Wheeler
  1 sibling, 2 replies; 29+ messages in thread
From: Hannes Reinecke @ 2013-04-24 12:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: James Bottomley, Ric Wheeler, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer

On 04/24/2013 02:08 PM, Paolo Bonzini wrote:
> Il 24/04/2013 14:07, Hannes Reinecke ha scritto:
>> On 04/24/2013 01:17 PM, Paolo Bonzini wrote:
>>> Il 23/04/2013 22:07, James Bottomley ha scritto:
>>>> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>>>>> For many years, we have used WCE as an indication that a device has a volatile 
>>>>> write cache (not just a write cache) and used this as a trigger to send down 
>>>>> SYNCHRONIZE_CACHE commands as needed.
>>>>>
>>>>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
>>>>> command.
>>>>
>>>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
>>>> which if unset (which it is in our implementation) means only sync your
>>>> non-NV cache.  For a device with all NV, that equates to nop.
>>>
>>> Isn't it the other way round?
>>>
>>> SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.
>>>
>>> SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.
>>>
>>> So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
>>> is non-removable just to err on the safe side.
>>
>> Or use 'WRITE_AND_VERIFY' here; that's guaranteed to hit the disk.
>> Plus it even has a guarantee about data consistency on the disk,
>> which the normal WRITE command has not.
> 
> The point is to _avoid_ hitting the disk. :)
> 
Ah. Really?

Why do we discuss SYNCHRONIZE CACHE then?
I was under the impression that we're talking various 'barriers'
(or rather 'flush' nowadays) implementations.
Which require that some data needs to be written to disk before
continuing.

Or did I somehow misread the thread?

Confused,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:12         ` Hannes Reinecke
@ 2013-04-24 12:23           ` Paolo Bonzini
  2013-04-24 12:27           ` Mike Snitzer
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2013-04-24 12:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: James Bottomley, Ric Wheeler, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer

Il 24/04/2013 14:12, Hannes Reinecke ha scritto:
> On 04/24/2013 02:08 PM, Paolo Bonzini wrote:
>> Il 24/04/2013 14:07, Hannes Reinecke ha scritto:
>>> On 04/24/2013 01:17 PM, Paolo Bonzini wrote:
>>>> Il 23/04/2013 22:07, James Bottomley ha scritto:
>>>>> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>>>>>> For many years, we have used WCE as an indication that a device has a volatile 
>>>>>> write cache (not just a write cache) and used this as a trigger to send down 
>>>>>> SYNCHRONIZE_CACHE commands as needed.
>>>>>>
>>>>>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
>>>>>> command.
>>>>>
>>>>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
>>>>> which if unset (which it is in our implementation) means only sync your
>>>>> non-NV cache.  For a device with all NV, that equates to nop.
>>>>
>>>> Isn't it the other way round?
>>>>
>>>> SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.
>>>>
>>>> SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.
>>>>
>>>> So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
>>>> is non-removable just to err on the safe side.
>>>
>>> Or use 'WRITE_AND_VERIFY' here; that's guaranteed to hit the disk.
>>> Plus it even has a guarantee about data consistency on the disk,
>>> which the normal WRITE command has not.
>>
>> The point is to _avoid_ hitting the disk. :)
>>
> Ah. Really?
> 
> Why do we discuss SYNCHRONIZE CACHE then?

Because we do want the data to hit the non-volatile cache, just in case
the disk has both a volatile and a non-volatile cache.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:12         ` Hannes Reinecke
  2013-04-24 12:23           ` Paolo Bonzini
@ 2013-04-24 12:27           ` Mike Snitzer
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Snitzer @ 2013-04-24 12:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Paolo Bonzini, James Bottomley, Ric Wheeler,
	linux-scsi@vger.kernel.org, Martin K. Petersen, Jeff Moyer,
	Tejun Heo

On Wed, Apr 24 2013 at  8:12am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 04/24/2013 02:08 PM, Paolo Bonzini wrote:
> > Il 24/04/2013 14:07, Hannes Reinecke ha scritto:
> >> On 04/24/2013 01:17 PM, Paolo Bonzini wrote:
> >>> Il 23/04/2013 22:07, James Bottomley ha scritto:
> >>>> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
> >>>>> For many years, we have used WCE as an indication that a device has a volatile 
> >>>>> write cache (not just a write cache) and used this as a trigger to send down 
> >>>>> SYNCHRONIZE_CACHE commands as needed.
> >>>>>
> >>>>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the 
> >>>>> command.
> >>>>
> >>>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
> >>>> which if unset (which it is in our implementation) means only sync your
> >>>> non-NV cache.  For a device with all NV, that equates to nop.
> >>>
> >>> Isn't it the other way round?
> >>>
> >>> SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.
> >>>
> >>> SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.
> >>>
> >>> So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
> >>> is non-removable just to err on the safe side.
> >>
> >> Or use 'WRITE_AND_VERIFY' here; that's guaranteed to hit the disk.
> >> Plus it even has a guarantee about data consistency on the disk,
> >> which the normal WRITE command has not.
> > 
> > The point is to _avoid_ hitting the disk. :)
> > 
> Ah. Really?
> 
> Why do we discuss SYNCHRONIZE CACHE then?
> I was under the impression that we're talking various 'barriers'
> (or rather 'flush' nowadays) implementations.
> Which require that some data needs to be written to disk before
> continuing.
> 
> Or did I somehow misread the thread?

This thread was motivated by the fact that the storage is reporting
WCE=1 and OracleDB (with ASM) is issuing SYNCHRONIZE CACHE (via
REQ_FLUSH) which the array in question handles _very_ slowly (even
though it is battery backed).

So the question Ric had is: should we expose a new knob that allows
admins to impose WCE=0 behavior (avoiding the SYNCHRONIZE CACHE).

I'm concerned such a knob will be abused for the benefit of speed and
all data integrity caution will get thrown to the wind (much like the
nobarrier FS mount option).

Mike

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:08       ` Paolo Bonzini
  2013-04-24 12:12         ` Hannes Reinecke
@ 2013-04-24 12:27         ` Ric Wheeler
  2013-04-24 12:57           ` Paolo Bonzini
  1 sibling, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2013-04-24 12:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Hannes Reinecke, James Bottomley, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer,
	Black, David, Elliott, Robert (Server Storage), Knight, Frederick

On 04/24/2013 08:08 AM, Paolo Bonzini wrote:
> Il 24/04/2013 14:07, Hannes Reinecke ha scritto:
>> On 04/24/2013 01:17 PM, Paolo Bonzini wrote:
>>> Il 23/04/2013 22:07, James Bottomley ha scritto:
>>>> On Tue, 2013-04-23 at 15:41 -0400, Ric Wheeler wrote:
>>>>> For many years, we have used WCE as an indication that a device has a volatile
>>>>> write cache (not just a write cache) and used this as a trigger to send down
>>>>> SYNCHRONIZE_CACHE commands as needed.
>>>>>
>>>>> Some arrays with non-volatile cache seem to have WCE set and simply ignore the
>>>>> command.
>>>> I bet they don't; they probably obey the spec.  There's a SYNC_NV bit
>>>> which if unset (which it is in our implementation) means only sync your
>>>> non-NV cache.  For a device with all NV, that equates to nop.
>>> Isn't it the other way round?
>>>
>>> SYNC_NV = 0 means "sync all your caches to the medium", and it's what we do.
>>>
>>> SYNC_NV = 1 means "sync volatile to non-volatile", and it's what Ric wants.
>>>
>>> So we should set SYNC_NV=1 if NV_SUP is set, perhaps only if the medium
>>> is non-removable just to err on the safe side.
>> Or use 'WRITE_AND_VERIFY' here; that's guaranteed to hit the disk.
>> Plus it even has a guarantee about data consistency on the disk,
>> which the normal WRITE command has not.
> The point is to _avoid_ hitting the disk. :)
>
> Paolo
>

The point is to have a crash-proof version of the data acknowledged by the 
target device while letting data sit in volatile state as long as possible. To 
be even clearer, we would love to do this for a sub-range of the device but 
currently use a "big hammer" to flush the entire cache (possibly for multiple 
file systems on one target storage device).

Once we use the SYNCHRONIZE_CACHE (or CACHE_FLUSH_EXT) command, we want the data 
on that target device to be there if someone loses power.

If the device can promise this, we don't care (and don't know) how it manages 
that promise. It can leave the data on battery backed DRAM, can archive it to 
flash or any other scheme that works.

Just as importantly, we don't want to "destage" data to the back end drives if 
that is not required since it is really, really slow.

The confusion here is that various storage devices have used the standard bits 
in arbitrary ways which makes it very hard to have one clear set of rules.

Even harder to explain to end users when to use a work around (like mount -o 
nobarrier) or the proposed "ignore flushes" block level call :)

Regards,

Ric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:27         ` Ric Wheeler
@ 2013-04-24 12:57           ` Paolo Bonzini
  2013-04-24 14:35             ` Jeremy Linton
  0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2013-04-24 12:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Hannes Reinecke, James Bottomley, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer,
	Black, David, Elliott, Robert (Server Storage), Knight, Frederick

Il 24/04/2013 14:27, Ric Wheeler ha scritto:
>> The point is to _avoid_ hitting the disk. :)
> 
> The point is to have a crash-proof version of the data acknowledged by
> the target device while letting data sit in volatile state as long as
> possible. To be even clearer, we would love to do this for a sub-range
> of the device but currently use a "big hammer" to flush the entire cache
> (possibly for multiple file systems on one target storage device).
> 
> Once we use the SYNCHRONIZE_CACHE (or CACHE_FLUSH_EXT) command, we want
> the data on that target device to be there if someone loses power.
> 
> If the device can promise this, we don't care (and don't know) how it
> manages that promise. It can leave the data on battery backed DRAM, can
> archive it to flash or any other scheme that works.

That's exactly the point of SYNC_NV=1.

> Just as importantly, we don't want to "destage" data to the back end
> drives if that is not required since it is really, really slow.
> 
> The confusion here is that various storage devices have used the
> standard bits in arbitrary ways which makes it very hard to have one
> clear set of rules.

Also that we have ignored the problem for long, and it's worked
surprisingly well. :)

> Even harder to explain to end users when to use a work around (like
> mount -o nobarrier) or the proposed "ignore flushes" block level call :)

Hoping Thunderbird doesn't mangle the patch too badly, code might be
worth a thousand words... see after the sig, compile-tested only.  I have
no access to these controllers, neither the good ones nor the bad ones. :)

Paolo

--------------------- 8< -------------------------
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: [PATCH] scsi: only make REQ_FLUSH flush to non-volatile cache

The point of REQ_FLUSH is to have a crash-proof version of the data
acknowledged by the target device.  We want the data on that target
device to be there if someone loses power, but we don't care (and don't
want to know) how it manages that promise.  It can leave the data on
battery backed DRAM, can archive it to flash or any other scheme that
works.

This is exactly what SYNC_NV=1 does.  Instead, SYNC_NV=0 should flush
the data to the medium, which is not desirable when we have a non-volatile
cache (except perhaps if the medium is removable).

NOT-Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7992635..97ecfd9 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -800,9 +800,17 @@ static int sd_setup_write_same_cmnd(struct scsi_device *sdp, struct request *rq)
 
 static int scsi_setup_flush_cmnd(struct scsi_device *sdp, struct request *rq)
 {
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+
 	rq->timeout = SD_FLUSH_TIMEOUT;
 	rq->retries = SD_MAX_RETRIES;
 	rq->cmd[0] = SYNCHRONIZE_CACHE;
+
+	/* No need to synchronize to medium if we have a non-volatile cache,
+	 * but be safe if the medium could just go away.
+	 */
+	if (sdkp->nv_sup && !sdp->removable)
+		rq->cmd[1] |= 4; /* SYNC_NV */
 	rq->cmd_len = 10;
 
 	return scsi_setup_blk_pc_cmnd(sdp, rq);
@@ -2511,6 +2519,26 @@ static void sd_read_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer)
 }
 
 /**
+ * sd_read_extended_inquiry - Query disk device for non-volatile cache.
+ * @disk: disk to query
+ */
+static void sd_read_extended_inquiry(struct scsi_disk *sdkp)
+{
+	const int vpd_len = 64;
+	unsigned char *buffer = kmalloc(vpd_len, GFP_KERNEL);
+
+	if (!buffer ||
+	    /* Block Limits VPD */
+	    scsi_get_vpd_page(sdkp->device, 0x86, buffer, vpd_len))
+		goto out;
+
+	sdkp->nv_sup = (buffer[6] & 0x02) != 0;
+
+ out:
+	kfree(buffer);
+}
+
+/**
  * sd_read_block_limits - Query disk device for preferred I/O sizes.
  * @disk: disk to query
  */
@@ -2684,6 +2712,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		sd_read_capacity(sdkp, buffer);
 
 		if (sd_try_extended_inquiry(sdp)) {
+			sd_read_extended_inquiry(sdkp);
 			sd_read_block_provisioning(sdkp);
 			sd_read_block_limits(sdkp);
 			sd_read_block_characteristics(sdkp);
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 74a1e4c..6334dfe 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
 	unsigned	lbpws10 : 1;
 	unsigned	lbpvpd : 1;
 	unsigned	ws16 : 1;
+	unsigned	nv_sup : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 

> Regards,
> 
> Ric
> 


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 12:57           ` Paolo Bonzini
@ 2013-04-24 14:35             ` Jeremy Linton
  2013-04-24 18:20               ` Black, David
  0 siblings, 1 reply; 29+ messages in thread
From: Jeremy Linton @ 2013-04-24 14:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ric Wheeler, Hannes Reinecke, James Bottomley,
	linux-scsi@vger.kernel.org, Martin K. Petersen, Jeff Moyer,
	Tejun Heo, Mike Snitzer, Black, David,
	Elliott, Robert (Server Storage), Knight, Frederick

On 4/24/2013 7:57 AM, Paolo Bonzini wrote:
>> If the device can promise this, we don't care (and don't know) how it 
>> manages that promise. It can leave the data on battery backed DRAM, can 
>> archive it to flash or any other scheme that works.
> 
> That's exactly the point of SYNC_NV=1.

	Well its the point, but the specification is written such that the vendors can
choose to implement it any way they wish, especially for split cache
systems where there is both volatile and non volatile cache.

	Flushing the NV cache to medium (as is the current behavior) may not be a bad
idea anyway.

	Thats because I know of a large vendors array where the non-volatile cache
might be better described as the "sometimes" non-volatile cache. That is because
a failure to flush the volatile portions results in the non-volatile portions
being considered invalid when power is restored. This fences the volume, and the
usual method for recovering the array is to call support and have them
invalidate the NV portions of the cache. Thereby negating the whole reason for
having a NV cache. I'm sure they don't tell customers this fact when they sell
the array, when it happened in our lab I was in a state of shock for about a week.


	







^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-23 19:41 T10 WCE interpretation in Linux & device level access Ric Wheeler
  2013-04-23 20:07 ` James Bottomley
  2013-04-23 20:28 ` Douglas Gilbert
@ 2013-04-24 15:40 ` Douglas Gilbert
  2 siblings, 0 replies; 29+ messages in thread
From: Douglas Gilbert @ 2013-04-24 15:40 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: linux-scsi@vger.kernel.org, Martin K. Petersen, James Bottomley,
	Jeff Moyer, Tejun Heo, Mike Snitzer

On 13-04-23 03:41 PM, Ric Wheeler wrote:
>
> For many years, we have used WCE as an indication that a device has a volatile
> write cache (not just a write cache) and used this as a trigger to send down
> SYNCHRONIZE_CACHE commands as needed.
>
> Some arrays with non-volatile cache seem to have WCE set and simply ignore the
> command.
>
> Some arrays with non-volatile cache seem to not set WCE.
>
> Others arrays with non-volatile cache - our problem arrays - set WCE and do
> something horrible and slow when sent the SYNCHRONIZE_CACHE commands.
>
> Note that for file systems, you can override this behavior by mounting with our
> barriers disabled (mount -o nobarrier .....). There is currently no way do
> disable this for anything using the device directly, not through the file system.
>
> Some applications run against block devices - not through a file system - and
> want not to slow to a crawl when they have an array in my problem set.
>
> Giving them a hook to ignore WCE seems to be a hack, but one that would resolve
> issues with users who won't want to wait months (years?) for us to convince the
> array vendors.
>
> Is this a hook worth doing?
>
> Have we hashed this out in the T10 committee?

Naturally I'm biased, but I tend to think the user space
is usually smarter than the kernel. That assumes skilled
users.

So if the user space issues a SYNCHRONIZE_CACHE with the
IMMED bit set and for the whole disk then the user should
have a way of forcing that command to be issued. The
assumption here is that the skilled user is about to power
down that array or pull some disks or SSDs *.

The more questionable cases are when a file system or the
block layer is issuing a barrier or some such that
translates to a SYNCHRONIZE_CACHE. That should be ignored
in some cases already discussed in this thread.

While working with SoCs I have noticed an interesting
technique. Sub-system sized sections of the memory mapped
IO space (e.g. a bank of GPIOs) can be write protected by
a simple ASCII sequence **. Attempts to change configuration
registers after write protect are ignored and an error
is noted (if anyone cares). The same ACSII sequence can be
used to un-write protect those sub-system configuration
registers. Typically on a SoC if the GPIOs are randomly
re-configured, it's game over.

Back to the SCSI world: a better solution might be if an
LLD could be informed of the reason a SCSI control command
is being issued (a sort of "come from" field). Failing, or
it addition to that, a sysfs interface could be added to
filter out "dangerous" SCSI commands:
   echo "SC" > /sys/class/scsi_device/8:0:0:0/device/filter

   cat /sys/class/scsi_device/8:0:0:0/device/filter
FU SC

If, for whatever reason, we did ignore a SYNCHRONIZE_CACHE
command we could use vendor specific sense data (vendor=Linux)
to indicate that a command had been ignored. That could be
extended to all SCSI commands that are filtered out ***;
better that than EIO, EACCES etc.

Doug Gilbert

*   and if Linux doesn't permit this, then user might be
     advised to run another, more obedient, host OS with
     Linux running as a VM. A "pass-by" rather than a
     "pass-through" ...

**  only the configuration registers are write protected, so
     data can still be written to the GPIOs

*** like me, many pass-through users cannot see why SCSI
     commands injected to the SCSI subsystem (e.g. via
     sg or bsg) are filtered out silently by the block layer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: T10 WCE interpretation in Linux & device level access
  2013-04-24 14:35             ` Jeremy Linton
@ 2013-04-24 18:20               ` Black, David
  2013-04-24 20:41                 ` Ric Wheeler
  0 siblings, 1 reply; 29+ messages in thread
From: Black, David @ 2013-04-24 18:20 UTC (permalink / raw)
  To: Jeremy Linton, Paolo Bonzini
  Cc: Ric Wheeler, Hannes Reinecke, James Bottomley,
	linux-scsi@vger.kernel.org, Martin K. Petersen, Jeff Moyer,
	Tejun Heo, Mike Snitzer, Elliott, Robert (Server Storage),
	Knight, Frederick, Black, David

Jeremy,

It looks like, you, Paolo and Ric have hit the nail on the head here - this is
a nice summary, IMHO:

> On 4/24/2013 7:57 AM, Paolo Bonzini wrote:
>>> If the device can promise this, we don't care (and don't know) how it 
>>> manages that promise. It can leave the data on battery backed DRAM, can 
>> archive it to flash or any other scheme that works.
>> 
>> That's exactly the point of SYNC_NV=1.
>
>	Well its the point, but the specification is written such that the vendors can
> choose to implement it any way they wish, especially for split cache
> systems where there is both volatile and non volatile cache.

Independent of T10's best intentions at the time, the implementations aren't
doing what's needed or intended, and I'd guess that the SYNC_NV bit is not
being set to 1 by [other people's ;-) ] software that should be setting it
to 1 if it were paying attention to the standard.

This is further complicated by it being completely legitimate wrt the SCSI
standard to put non-volatile cache in a system and not have the SCSI interface
admit that the non-volatile cache exists (WCE=0, SYNCHRONIZE CACHE is a no-op
independent of the value of SYNC_NV).

I believe that Rob Elliot's 13-050 proposal to obsolete SYNC_NV and re-specify
SYNCHRONIZE CACHE to make all data non-volatile by whatever means the target
chooses is what T10 should do, and that matches Ric's summary:

>>> If the device can promise this, we don't care (and don't know) how it 
>>> manages that promise. It can leave the data on battery backed DRAM, can 
>> archive it to flash or any other scheme that works.

Beyond that, attempting to manage drive removal from storage systems via the
SCSI interface with standard commands is a waste of time and effort, IMHO.
In a serious storage array (and even some fairly simple RAID controllers), some
vendor-specific "magic" is needed to get the array (or controller) to prepare
so that the drive can be removed cleanly.  To oversimplify, it's not enough to
flush data to the drive; the array or controller is stateful, and hence has
to be told to "forget" the drive, where "forget" involves things that are
rather implementation-specific.

Thanks,
--David
----------------------------------------------------
David L. Black, Distinguished Engineer
EMC Corporation, 176 South St., Hopkinton, MA  01748
+1 (508) 293-7953             FAX: +1 (508) 293-7786
david.black@emc.com        Mobile: +1 (978) 394-7754
----------------------------------------------------

> -----Original Message-----
> From: Jeremy Linton [mailto:jlinton@tributary.com]
> Sent: Wednesday, April 24, 2013 10:36 AM
> To: Paolo Bonzini
> Cc: Ric Wheeler; Hannes Reinecke; James Bottomley; linux-scsi@vger.kernel.org;
> Martin K. Petersen; Jeff Moyer; Tejun Heo; Mike Snitzer; Black, David;
> Elliott, Robert (Server Storage); Knight, Frederick
> Subject: Re: T10 WCE interpretation in Linux & device level access
> 
> On 4/24/2013 7:57 AM, Paolo Bonzini wrote:
> >> If the device can promise this, we don't care (and don't know) how it
> >> manages that promise. It can leave the data on battery backed DRAM, can
> >> archive it to flash or any other scheme that works.
> >
> > That's exactly the point of SYNC_NV=1.
> 
> 	Well its the point, but the specification is written such that the
> vendors can
> choose to implement it any way they wish, especially for split cache
> systems where there is both volatile and non volatile cache.
> 
> 	Flushing the NV cache to medium (as is the current behavior) may not be
> a bad
> idea anyway.
> 
> 	Thats because I know of a large vendors array where the non-volatile
> cache
> might be better described as the "sometimes" non-volatile cache. That is
> because
> a failure to flush the volatile portions results in the non-volatile portions
> being considered invalid when power is restored. This fences the volume, and
> the
> usual method for recovering the array is to call support and have them
> invalidate the NV portions of the cache. Thereby negating the whole reason for
> having a NV cache. I'm sure they don't tell customers this fact when they sell
> the array, when it happened in our lab I was in a state of shock for about a
> week.
> 
> 
> 
> 
> 
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 18:20               ` Black, David
@ 2013-04-24 20:41                 ` Ric Wheeler
  2013-04-24 21:02                   ` James Bottomley
  0 siblings, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2013-04-24 20:41 UTC (permalink / raw)
  To: Black, David
  Cc: Jeremy Linton, Paolo Bonzini, Ric Wheeler, Hannes Reinecke,
	James Bottomley, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

On 04/24/2013 02:20 PM, Black, David wrote:
> Jeremy,
>
> It looks like, you, Paolo and Ric have hit the nail on the head here - this is
> a nice summary, IMHO:
>
>> On 4/24/2013 7:57 AM, Paolo Bonzini wrote:
>>>> If the device can promise this, we don't care (and don't know) how it
>>>> manages that promise. It can leave the data on battery backed DRAM, can
>>> archive it to flash or any other scheme that works.
>>>
>>> That's exactly the point of SYNC_NV=1.
>> 	Well its the point, but the specification is written such that the vendors can
>> choose to implement it any way they wish, especially for split cache
>> systems where there is both volatile and non volatile cache.
> Independent of T10's best intentions at the time, the implementations aren't
> doing what's needed or intended, and I'd guess that the SYNC_NV bit is not
> being set to 1 by [other people's ;-) ] software that should be setting it
> to 1 if it were paying attention to the standard.
>
> This is further complicated by it being completely legitimate wrt the SCSI
> standard to put non-volatile cache in a system and not have the SCSI interface
> admit that the non-volatile cache exists (WCE=0, SYNCHRONIZE CACHE is a no-op
> independent of the value of SYNC_NV).
>
> I believe that Rob Elliot's 13-050 proposal to obsolete SYNC_NV and re-specify
> SYNCHRONIZE CACHE to make all data non-volatile by whatever means the target
> chooses is what T10 should do, and that matches Ric's summary:
>
>>>> If the device can promise this, we don't care (and don't know) how it
>>>> manages that promise. It can leave the data on battery backed DRAM, can
>>> archive it to flash or any other scheme that works.
> Beyond that, attempting to manage drive removal from storage systems via the
> SCSI interface with standard commands is a waste of time and effort, IMHO.
> In a serious storage array (and even some fairly simple RAID controllers), some
> vendor-specific "magic" is needed to get the array (or controller) to prepare
> so that the drive can be removed cleanly.  To oversimplify, it's not enough to
> flush data to the drive; the array or controller is stateful, and hence has
> to be told to "forget" the drive, where "forget" involves things that are
> rather implementation-specific.
>
> Thanks,
> --David
>

So I think that leaves us with some arrays that might benefit from Paolo's 
proposed patch, but almost certainly still will need to be able to "ignore 
flushes" for some block device accessing DB's, etc....

Ric


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 20:41                 ` Ric Wheeler
@ 2013-04-24 21:02                   ` James Bottomley
  2013-04-24 21:54                     ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: James Bottomley @ 2013-04-24 21:02 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Black, David, Jeremy Linton, Paolo Bonzini, Ric Wheeler,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

On Wed, 2013-04-24 at 16:41 -0400, Ric Wheeler wrote:
> On 04/24/2013 02:20 PM, Black, David wrote:
> > Jeremy,
> >
> > It looks like, you, Paolo and Ric have hit the nail on the head here - this is
> > a nice summary, IMHO:
> >
> >> On 4/24/2013 7:57 AM, Paolo Bonzini wrote:
> >>>> If the device can promise this, we don't care (and don't know) how it
> >>>> manages that promise. It can leave the data on battery backed DRAM, can
> >>> archive it to flash or any other scheme that works.
> >>>
> >>> That's exactly the point of SYNC_NV=1.
> >> 	Well its the point, but the specification is written such that the vendors can
> >> choose to implement it any way they wish, especially for split cache
> >> systems where there is both volatile and non volatile cache.
> > Independent of T10's best intentions at the time, the implementations aren't
> > doing what's needed or intended, and I'd guess that the SYNC_NV bit is not
> > being set to 1 by [other people's ;-) ] software that should be setting it
> > to 1 if it were paying attention to the standard.
> >
> > This is further complicated by it being completely legitimate wrt the SCSI
> > standard to put non-volatile cache in a system and not have the SCSI interface
> > admit that the non-volatile cache exists (WCE=0, SYNCHRONIZE CACHE is a no-op
> > independent of the value of SYNC_NV).
> >
> > I believe that Rob Elliot's 13-050 proposal to obsolete SYNC_NV and re-specify
> > SYNCHRONIZE CACHE to make all data non-volatile by whatever means the target
> > chooses is what T10 should do, and that matches Ric's summary:
> >
> >>>> If the device can promise this, we don't care (and don't know) how it
> >>>> manages that promise. It can leave the data on battery backed DRAM, can
> >>> archive it to flash or any other scheme that works.
> > Beyond that, attempting to manage drive removal from storage systems via the
> > SCSI interface with standard commands is a waste of time and effort, IMHO.
> > In a serious storage array (and even some fairly simple RAID controllers), some
> > vendor-specific "magic" is needed to get the array (or controller) to prepare
> > so that the drive can be removed cleanly.  To oversimplify, it's not enough to
> > flush data to the drive; the array or controller is stateful, and hence has
> > to be told to "forget" the drive, where "forget" involves things that are
> > rather implementation-specific.
> >
> > Thanks,
> > --David
> >
> 
> So I think that leaves us with some arrays that might benefit from Paolo's 
> proposed patch, but almost certainly still will need to be able to "ignore 
> flushes" for some block device accessing DB's, etc....

That just leaves us with random standards behaviour.  Lets permit the
deterministic thing instead for the distros.  It kills two birds with
one stone because we can set WCE for the stupid UAS devices that clear
it wrongly as well.

For those who don't read code well, you add a temporary prefix to the
cache set in

echo xxx > /sys/class/scsi_disk/<disk>/cache_type

and it will set the flags for the lifetime of the current kernel, but
won't try to do a mode select to make them permanent.

James

---

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7992635..af16e88 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -142,6 +142,7 @@ sd_store_cache_type(struct device *dev, struct device_attribute *attr,
 	char *buffer_data;
 	struct scsi_mode_data data;
 	struct scsi_sense_hdr sshdr;
+	const char *temp = "temporary ";
 	int len;
 
 	if (sdp->type != TYPE_DISK)
@@ -150,6 +151,13 @@ sd_store_cache_type(struct device *dev, struct device_attribute *attr,
 		 * it's not worth the risk */
 		return -EINVAL;
 
+	if (strncmp(buf, temp, sizeof(temp) - 1) == 0) {
+		buf += sizeof(temp) - 1;
+		sdkp->cache_override = 1;
+	} else {
+		sdkp->cache_override = 0;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(sd_cache_types); i++) {
 		len = strlen(sd_cache_types[i]);
 		if (strncmp(sd_cache_types[i], buf, len) == 0 &&
@@ -162,6 +170,13 @@ sd_store_cache_type(struct device *dev, struct device_attribute *attr,
 		return -EINVAL;
 	rcd = ct & 0x01 ? 1 : 0;
 	wce = ct & 0x02 ? 1 : 0;
+
+	if (sdkp->cache_override) {
+		sdkp->WCE = wce;
+		sdkp->RCD = rcd;
+		return count;
+	}
+
 	if (scsi_mode_sense(sdp, 0x08, 8, buffer, sizeof(buffer), SD_TIMEOUT,
 			    SD_MAX_RETRIES, &data, NULL))
 		return -EINVAL;
@@ -2319,6 +2334,10 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
 	int old_rcd = sdkp->RCD;
 	int old_dpofua = sdkp->DPOFUA;
 
+
+	if (sdkp->cache_override)
+		return;
+		
 	first_len = 4;
 	if (sdp->skip_ms_page_8) {
 		if (sdp->type == TYPE_RBC)
@@ -2812,6 +2831,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
 	sdkp->capacity = 0;
 	sdkp->media_present = 1;
 	sdkp->write_prot = 0;
+	sdkp->cache_override = 0;
 	sdkp->WCE = 0;
 	sdkp->RCD = 0;
 	sdkp->ATO = 0;
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 74a1e4c..2386aeb 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -73,6 +73,7 @@ struct scsi_disk {
 	u8		protection_type;/* Data Integrity Field */
 	u8		provisioning_mode;
 	unsigned	ATO : 1;	/* state of disk ATO bit */
+	unsigned	cache_override : 1; /* temp override of WCE,RCD */
 	unsigned	WCE : 1;	/* state of disk WCE bit */
 	unsigned	RCD : 1;	/* state of disk RCD bit, unused */
 	unsigned	DPOFUA : 1;	/* state of disk DPOFUA bit */



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 21:02                   ` James Bottomley
@ 2013-04-24 21:54                     ` Paolo Bonzini
  2013-04-24 22:09                       ` James Bottomley
  0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2013-04-24 21:54 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Black, David, Jeremy Linton, Ric Wheeler,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

Il 24/04/2013 23:02, James Bottomley ha scritto:
> That just leaves us with random standards behaviour.  Lets permit the
> deterministic thing instead for the distros.  It kills two birds with
> one stone because we can set WCE for the stupid UAS devices that clear
> it wrongly as well.
> 
> For those who don't read code well, you add a temporary prefix to the
> cache set in
> 
> echo xxx > /sys/class/scsi_disk/<disk>/cache_type
> 
> and it will set the flags for the lifetime of the current kernel, but
> won't try to do a mode select to make them permanent.

Having the knob is useful indeed.  I don't like the "temporary" name
though, because "temporary write-through" doesn't sound like it can eat
data on a power loss.  What about "force" or "assume"?

Also, this would be in addition to my patch (when tested), right?

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 21:54                     ` Paolo Bonzini
@ 2013-04-24 22:09                       ` James Bottomley
  2013-04-24 22:36                         ` Ric Wheeler
  2013-04-25  1:32                         ` Martin K. Petersen
  0 siblings, 2 replies; 29+ messages in thread
From: James Bottomley @ 2013-04-24 22:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ric Wheeler, Black, David, Jeremy Linton, Ric Wheeler,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

On Wed, 2013-04-24 at 23:54 +0200, Paolo Bonzini wrote:
> Il 24/04/2013 23:02, James Bottomley ha scritto:
> > That just leaves us with random standards behaviour.  Lets permit the
> > deterministic thing instead for the distros.  It kills two birds with
> > one stone because we can set WCE for the stupid UAS devices that clear
> > it wrongly as well.
> > 
> > For those who don't read code well, you add a temporary prefix to the
> > cache set in
> > 
> > echo xxx > /sys/class/scsi_disk/<disk>/cache_type
> > 
> > and it will set the flags for the lifetime of the current kernel, but
> > won't try to do a mode select to make them permanent.
> 
> Having the knob is useful indeed.  I don't like the "temporary" name
> though, because "temporary write-through" doesn't sound like it can eat
> data on a power loss.  What about "force" or "assume"?

I'm fairly ambivalent, except not force.  The default behaviour is to do
the mode select, so force seems to imply that as well, except it won't.
I don't see a difference between assume and temporary.

> Also, this would be in addition to my patch (when tested), right?

Not really ... given T10s deprecation I don't think we want to touch
anything to do with SYNC_NV because it just adds to the uncertainty
about what will actually happen.  Giving the ability to control WCE (and
RCD) fixes all the problems raised so far.

James



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 22:09                       ` James Bottomley
@ 2013-04-24 22:36                         ` Ric Wheeler
  2013-04-24 22:46                           ` James Bottomley
  2013-04-25  1:32                         ` Martin K. Petersen
  1 sibling, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2013-04-24 22:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Paolo Bonzini, Ric Wheeler, Black, David, Jeremy Linton,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

On 04/24/2013 06:09 PM, James Bottomley wrote:
> On Wed, 2013-04-24 at 23:54 +0200, Paolo Bonzini wrote:
>> Il 24/04/2013 23:02, James Bottomley ha scritto:
>>> That just leaves us with random standards behaviour.  Lets permit the
>>> deterministic thing instead for the distros.  It kills two birds with
>>> one stone because we can set WCE for the stupid UAS devices that clear
>>> it wrongly as well.
>>>
>>> For those who don't read code well, you add a temporary prefix to the
>>> cache set in
>>>
>>> echo xxx > /sys/class/scsi_disk/<disk>/cache_type
>>>
>>> and it will set the flags for the lifetime of the current kernel, but
>>> won't try to do a mode select to make them permanent.
>> Having the knob is useful indeed.  I don't like the "temporary" name
>> though, because "temporary write-through" doesn't sound like it can eat
>> data on a power loss.  What about "force" or "assume"?
> I'm fairly ambivalent, except not force.  The default behaviour is to do
> the mode select, so force seems to imply that as well, except it won't.
> I don't see a difference between assume and temporary.
>
>> Also, this would be in addition to my patch (when tested), right?
> Not really ... given T10s deprecation I don't think we want to touch
> anything to do with SYNC_NV because it just adds to the uncertainty
> about what will actually happen.  Giving the ability to control WCE (and
> RCD) fixes all the problems raised so far.
>
> James
>
>

Why are we turning off the RCD bit in this? Not sure it matters, but we only 
should care about WCE (and the dirty write cache data)?

Thanks!

Ric


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 22:36                         ` Ric Wheeler
@ 2013-04-24 22:46                           ` James Bottomley
  2013-04-25 11:35                             ` Ric Wheeler
  0 siblings, 1 reply; 29+ messages in thread
From: James Bottomley @ 2013-04-24 22:46 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Paolo Bonzini, Ric Wheeler, Black, David, Jeremy Linton,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

On Wed, 2013-04-24 at 18:36 -0400, Ric Wheeler wrote:
> On 04/24/2013 06:09 PM, James Bottomley wrote:
> > On Wed, 2013-04-24 at 23:54 +0200, Paolo Bonzini wrote:
> >> Il 24/04/2013 23:02, James Bottomley ha scritto:
> >>> That just leaves us with random standards behaviour.  Lets permit the
> >>> deterministic thing instead for the distros.  It kills two birds with
> >>> one stone because we can set WCE for the stupid UAS devices that clear
> >>> it wrongly as well.
> >>>
> >>> For those who don't read code well, you add a temporary prefix to the
> >>> cache set in
> >>>
> >>> echo xxx > /sys/class/scsi_disk/<disk>/cache_type
> >>>
> >>> and it will set the flags for the lifetime of the current kernel, but
> >>> won't try to do a mode select to make them permanent.
> >> Having the knob is useful indeed.  I don't like the "temporary" name
> >> though, because "temporary write-through" doesn't sound like it can eat
> >> data on a power loss.  What about "force" or "assume"?
> > I'm fairly ambivalent, except not force.  The default behaviour is to do
> > the mode select, so force seems to imply that as well, except it won't.
> > I don't see a difference between assume and temporary.
> >
> >> Also, this would be in addition to my patch (when tested), right?
> > Not really ... given T10s deprecation I don't think we want to touch
> > anything to do with SYNC_NV because it just adds to the uncertainty
> > about what will actually happen.  Giving the ability to control WCE (and
> > RCD) fixes all the problems raised so far.
> >
> 
> Why are we turning off the RCD bit in this? Not sure it matters, but we only 
> should care about WCE (and the dirty write cache data)?

Well, it's in the code.  Cache policy is a combination of those two
bits.  The cache type takes a cache policy string, ergo it must update
both.  We don't do anything with it because having a write back cache
and no cache at all is transparent to us.

James



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 22:09                       ` James Bottomley
  2013-04-24 22:36                         ` Ric Wheeler
@ 2013-04-25  1:32                         ` Martin K. Petersen
  2013-04-27  6:03                           ` Paolo Bonzini
  1 sibling, 1 reply; 29+ messages in thread
From: Martin K. Petersen @ 2013-04-25  1:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Paolo Bonzini, Ric Wheeler, Black, David, Jeremy Linton,
	Ric Wheeler, Hannes Reinecke, linux-scsi@vger.kernel.org,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

>>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes:

James> I'm fairly ambivalent, except not force.  The default behaviour
James> is to do the mode select, so force seems to imply that as well,
James> except it won't.  I don't see a difference between assume and
James> temporary.

I'm ok with your patch. And a strong believer in not altering the
SYNCHRONIZE CACHE behavior that's been rigorously tested in the field by
adding SYNC_NV to the mix.

James> Not really ... given T10s deprecation I don't think we want to
James> touch anything to do with SYNC_NV because it just adds to the
James> uncertainty about what will actually happen.  

Yep.

James> Giving the ability to control WCE (and RCD) fixes all the
James> problems raised so far.

If there are devices that would truly benefit from SYNC_NV we could add
a sync_nv parameter to scsi_disk's sysfs that could be used to set that
bit when issuing flush_cmnd.

But it would be something we would do manually on a per-device basis and
not something that is automatically keyed off of NV_SUP (SYNC_NV doesn't
require NV_SUP, btw., so that's not even a valid check).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24 22:46                           ` James Bottomley
@ 2013-04-25 11:35                             ` Ric Wheeler
  2013-04-25 14:12                               ` James Bottomley
  0 siblings, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2013-04-25 11:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Paolo Bonzini, Ric Wheeler, Black, David, Jeremy Linton,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick,
	Martin K. Petersen

On 04/24/2013 06:46 PM, James Bottomley wrote:
> On Wed, 2013-04-24 at 18:36 -0400, Ric Wheeler wrote:
>> On 04/24/2013 06:09 PM, James Bottomley wrote:
>>> On Wed, 2013-04-24 at 23:54 +0200, Paolo Bonzini wrote:
>>>> Il 24/04/2013 23:02, James Bottomley ha scritto:
>>>>> That just leaves us with random standards behaviour.  Lets permit the
>>>>> deterministic thing instead for the distros.  It kills two birds with
>>>>> one stone because we can set WCE for the stupid UAS devices that clear
>>>>> it wrongly as well.
>>>>>
>>>>> For those who don't read code well, you add a temporary prefix to the
>>>>> cache set in
>>>>>
>>>>> echo xxx > /sys/class/scsi_disk/<disk>/cache_type
>>>>>
>>>>> and it will set the flags for the lifetime of the current kernel, but
>>>>> won't try to do a mode select to make them permanent.
>>>> Having the knob is useful indeed.  I don't like the "temporary" name
>>>> though, because "temporary write-through" doesn't sound like it can eat
>>>> data on a power loss.  What about "force" or "assume"?
>>> I'm fairly ambivalent, except not force.  The default behaviour is to do
>>> the mode select, so force seems to imply that as well, except it won't.
>>> I don't see a difference between assume and temporary.
>>>
>>>> Also, this would be in addition to my patch (when tested), right?
>>> Not really ... given T10s deprecation I don't think we want to touch
>>> anything to do with SYNC_NV because it just adds to the uncertainty
>>> about what will actually happen.  Giving the ability to control WCE (and
>>> RCD) fixes all the problems raised so far.
>>>
>> Why are we turning off the RCD bit in this? Not sure it matters, but we only
>> should care about WCE (and the dirty write cache data)?
> Well, it's in the code.  Cache policy is a combination of those two
> bits.  The cache type takes a cache policy string, ergo it must update
> both.  We don't do anything with it because having a write back cache
> and no cache at all is transparent to us.
>
> James
>
>

It was pointed out to me that RCD is "Read Cache Disable" so by setting it to 
zero, we are enabling the read cache (not that we ever look at this bit or send 
it down). The WCE bit is "write cache enable" so the polarity of the bits is 
inverted.

Should be fine regardless :)

Ric


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-25 11:35                             ` Ric Wheeler
@ 2013-04-25 14:12                               ` James Bottomley
  0 siblings, 0 replies; 29+ messages in thread
From: James Bottomley @ 2013-04-25 14:12 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Paolo Bonzini, Ric Wheeler, Black, David, Jeremy Linton,
	Hannes Reinecke, linux-scsi@vger.kernel.org, Martin K. Petersen,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick,
	Martin K. Petersen

On Thu, 2013-04-25 at 07:35 -0400, Ric Wheeler wrote:
> It was pointed out to me that RCD is "Read Cache Disable" so by
> setting it to 
> zero, we are enabling the read cache (not that we ever look at this
> bit or send 
> it down). The WCE bit is "write cache enable" so the polarity of the
> bits is 
> inverted.
> 
> Should be fine regardless :)

Just look at the code.  We already do the right thing.

The whole reason for combining both into cache policy is that half the
world got the individual bits wrong.  We translate from an
understandable cache policy to and from WCE/RCD speak.

James



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-25  1:32                         ` Martin K. Petersen
@ 2013-04-27  6:03                           ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2013-04-27  6:03 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Ric Wheeler, Black, David, Jeremy Linton,
	Ric Wheeler, Hannes Reinecke, linux-scsi@vger.kernel.org,
	Jeff Moyer, Tejun Heo, Mike Snitzer,
	Elliott, Robert (Server Storage), Knight, Frederick

Il 25/04/2013 03:32, Martin K. Petersen ha scritto:
> I'm ok with your patch. And a strong believer in not altering the
> SYNCHRONIZE CACHE behavior that's been rigorously tested in the field by
> adding SYNC_NV to the mix.

SYNC_NV is absolutely necessary for targets that (a) have both volatile
and non-volatile cache, and (b) actually follow the standards behavior
for SYNC_NV=0.

I used NV_SUP as a guess that the SYNC_NV bit is supported, perhaps
V_SUP && NV_SUP is a better guess.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: T10 WCE interpretation in Linux & device level access
  2013-04-24  5:44     ` Elliott, Robert (Server Storage)
  2013-04-24 11:00       ` Ric Wheeler
@ 2013-04-27 16:09       ` James Bottomley
  1 sibling, 0 replies; 29+ messages in thread
From: James Bottomley @ 2013-04-27 16:09 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jeremy Linton, Ric Wheeler, linux-scsi@vger.kernel.org,
	Martin K. Petersen, Jeff Moyer, Tejun Heo, Mike Snitzer,
	dgilbert@interlog.com

On Wed, 2013-04-24 at 05:44 +0000, Elliott, Robert (Server Storage)
wrote:
> If the writeback cache is enabled (per the WCE bit in the Caching mode page),
> prudent software uses the FUA bit in WRITE commands when writing metadata
> and/or sends the SYNCHRONIZE CACHE command at important checkpoints to 
> ensure the data is not going to be lost due to a power loss.  Some 
> database software is particularly prolific at sending these commands. 
> 
> Around 2003, many RAID controllers with non-volatile writeback caches honored
> the SYNCHRONIZE CACHE command, flushing the entire cache to the drives.  This
> started causing timeouts as non-volatile write cache sizes grew.  Recently,
> it's even causing trouble on individual disk drives with growing volatile 
> write caches.
> 
> The intent of software using these commands and bits was unclear - it could be:
> a) ensure data is in non-volatile cache (and will eventually be flushed) 
>    or on the medium; or
> b) ensure data is on the medium (so the drives are ready for removal). 

Just from looking at the Linux code (and the code in other operating
systems like BSD or Solaris), you can see that for non-removable media
our intent is always a).

For removable media, you can argue the OS needs b), but I don't actually
know of any removable hard disks that actually have a NV cache (that's
exclusively the province of the array vendors), so it's a bit moot.

> As a short-term fix, many RAID controllers assumed intent (a) and started
> interpreting the SYNCHRONIZE CACHE command as a NOP and ignoring the FUA bit.  
> 
> Surprise removal of a drive from a RAID controller is risky even if software 
> has run SYNCHRONIZE CACHE, since the RAID controller might be doing other
> activity in the background. So, there are other reasons to justify assuming
> that the user just won't do that.

Right.  In fact surprise removal of array disks is something most admins
quickly learn never to do.  The only use case for deliberately damaging
your array like this is drive replacement, and that's where you remove a
potentially failing device and ask for a rebuild but since the array
keeps running, there are no cache issues involved.

> Afraid of breaking software with intent (b) (which was more likely in the 
> days of floppy disks, Bournelli Boxes, and other removable block devices), 
> T10 chose to clarify that the original meaning was (b) and added new 
> FUA_NV and SYNC_NV bits to let software express intent (a).  The hope
> was that devices would implement the bits and software would start using
> them at appropriate times.

Just for future learning, does T10 see the mistake here?  Even if we
assume the b) case (which I think everyone can agree is the wrong one),
Operating Systems are slow to change, so arrays have to continue with
current behaviour.  Even in the b) case, the only way to update the
standard to codify existing behaviour and enable the b) case is to say
that current SYNCHRONIZE CACHE may now choose not to flush the NV cache
but here's a new bit to signal intent to flush NV cache as well (i.e.
the new flag should have forced flush of volatile + Non Volatile cache).

By doing the opposite, T10 effectively piled confusion onto the
situation because array vendors worried about flush latencies were
always going to ignore the flush and new entrants were going to get
confused about what the OS is doing, leading to what you say below:

> Unfortunately, the short-term fix worked well enough that it still prevails
> today, and most standalone removable media block devices have disappeared.
> There is not much software actually sending the FUA_NV and SYNC_NV bits 
> and few devices honoring the bits per the standard.

And the arrays that did actually honour the standard are now the ones
people are complaining about ...

> As an SBC-3 letter ballot comment, I recently submitted T10 proposal 
> 13-050 (see http://www.t10.org/doc13.htm) to obsolete the SYNC_NV and 
> FUA_NV bits and change the meaning of the commands without those bits
> to intent (a), reflecting what the industry has actually done.

I think that works.  If an admin is concerned about the b) case, they'll
ask the array management software to do the offline rather than the OS,
so I don't actually see any use case where we have to worry in the OS
about the NV cache.

James

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2013-04-27 16:09 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-23 19:41 T10 WCE interpretation in Linux & device level access Ric Wheeler
2013-04-23 20:07 ` James Bottomley
2013-04-23 22:39   ` Jeremy Linton
2013-04-24  5:44     ` Elliott, Robert (Server Storage)
2013-04-24 11:00       ` Ric Wheeler
2013-04-27 16:09       ` James Bottomley
2013-04-24 11:17   ` Paolo Bonzini
2013-04-24 12:07     ` Hannes Reinecke
2013-04-24 12:08       ` Paolo Bonzini
2013-04-24 12:12         ` Hannes Reinecke
2013-04-24 12:23           ` Paolo Bonzini
2013-04-24 12:27           ` Mike Snitzer
2013-04-24 12:27         ` Ric Wheeler
2013-04-24 12:57           ` Paolo Bonzini
2013-04-24 14:35             ` Jeremy Linton
2013-04-24 18:20               ` Black, David
2013-04-24 20:41                 ` Ric Wheeler
2013-04-24 21:02                   ` James Bottomley
2013-04-24 21:54                     ` Paolo Bonzini
2013-04-24 22:09                       ` James Bottomley
2013-04-24 22:36                         ` Ric Wheeler
2013-04-24 22:46                           ` James Bottomley
2013-04-25 11:35                             ` Ric Wheeler
2013-04-25 14:12                               ` James Bottomley
2013-04-25  1:32                         ` Martin K. Petersen
2013-04-27  6:03                           ` Paolo Bonzini
2013-04-24 11:30   ` Hannes Reinecke
2013-04-23 20:28 ` Douglas Gilbert
2013-04-24 15:40 ` Douglas Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).