* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <Pine.LNX.4.33.0107032211120.30968-100000@toomuch.toronto.redhat.com>
@ 2001-07-05 6:34 ` Ragnar Kjørstad
2001-07-05 7:35 ` Ben LaHaise
0 siblings, 1 reply; 31+ messages in thread
From: Ragnar Kjørstad @ 2001-07-05 6:34 UTC (permalink / raw)
To: Ben LaHaise; +Cc: linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
On Tue, Jul 03, 2001 at 10:19:36PM -0400, Ben LaHaise wrote:
> > > [ patch to make md and nbd work for >2TB devices ]
> > What about LVM?
>
> Errr, I'll refrain from talking about LVM.
What do you mean?
Is it not feasible to fix this in LVM as well, or do you just not know
what needs to be done to LVM?
--
Ragnar Kjorstad
Big Storage
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-05 6:34 ` [linux-lvm] Re: [PATCH] 64 bit scsi read/write Ragnar Kjørstad
@ 2001-07-05 7:35 ` Ben LaHaise
2001-07-05 16:46 ` AJ Lewis
` (3 more replies)
0 siblings, 4 replies; 31+ messages in thread
From: Ben LaHaise @ 2001-07-05 7:35 UTC (permalink / raw)
To: Ragnar Kjørstad; +Cc: linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
On Thu, 5 Jul 2001, Ragnar Kj�rstad wrote:
> What do you mean?
> Is it not feasible to fix this in LVM as well, or do you just not know
> what needs to be done to LVM?
Fixing LVM is not on the radar of my priorities. The code is sorely in
need of a rewrite and violates several of the basic planning tenents that
any good code in the block layer should follow. Namely, it should have 1)
planned on supporting 64 bit offsets, 2) never used multiplication,
division or modulus on block numbers, and 3) don't allocate memory
structures that are indexed by block numbers. LVM failed on all three of
these -- and this si just what I noticed in a quick 5 minute glance
through the code. Sorry, but LVM is obsolete by design. It will continue
to work on 32 bit block devices, but if you try to use it beyond that, it
will fail. That said, we'll have to make sure these failures are graceful
and occur prior to the user having a chance at loosing any data.
Now, thankfully there are alternatives like ELVM, which are working on
getting the details right from the lessons learned. Given that, I think
we'll be in good shape during the 2.5 cycle.
-ben
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-05 7:35 ` Ben LaHaise
@ 2001-07-05 16:46 ` AJ Lewis
2001-07-05 17:09 ` Eric M. Hopper
` (2 subsequent siblings)
3 siblings, 0 replies; 31+ messages in thread
From: AJ Lewis @ 2001-07-05 16:46 UTC (permalink / raw)
To: linux-lvm
[-- Attachment #1: Type: text/plain, Size: 2761 bytes --]
On Thu, Jul 05, 2001 at 03:35:31AM -0400, Ben LaHaise wrote:
> On Thu, 5 Jul 2001, Ragnar Kjørstad wrote:
> > What do you mean?
> > Is it not feasible to fix this in LVM as well, or do you just not know
> > what needs to be done to LVM?
>
> Fixing LVM is not on the radar of my priorities. The code is sorely in
> need of a rewrite and violates several of the basic planning tenents that
> any good code in the block layer should follow. Namely, it should have 1)
> planned on supporting 64 bit offsets, 2) never used multiplication,
> division or modulus on block numbers, and 3) don't allocate memory
> structures that are indexed by block numbers. LVM failed on all three of
> these -- and this si just what I noticed in a quick 5 minute glance
> through the code. Sorry, but LVM is obsolete by design. It will continue
> to work on 32 bit block devices, but if you try to use it beyond that, it
> will fail. That said, we'll have to make sure these failures are graceful
> and occur prior to the user having a chance at loosing any data.
So are these tenents written down somewhere, or is it just understood that
this is how it needs to be?
Getting LVM 64-bit ready is certainly a priority for the LVM team, but this
is the first I've seen of concrete requirements for doing so. Of course, I
could just be blind and have missed everything too. ;)
Regards,
--
AJ Lewis
Sistina Software Inc. Voice: 612-638-0500
1313 5th St SE, Suite 111 Fax: 612-638-0500
Minneapolis, MN 55414 E-Mail: lewis@sistina.com
http://www.sistina.com
Current GPG fingerprint = 3B5F 6011 5216 76A5 2F6B 52A0 941E 1261 0029 2648
Get my key at: http://www.sistina.com/~lewis/gpgkey
(Unfortunately, the PKS-type keyservers do not work with multiple sub-keys)
-----Begin Obligatory Humorous Quote----------------------------------------
The three most dangerous things are a programmer with a soldering iron, a
manager who codes, and a user who gets ideas.
-----End Obligatory Humorous Quote------------------------------------------
AJ Lewis
Sistina Software Inc. Voice: 612-638-0500
1313 5th St SE, Suite 111 Fax: 612-638-0500
Minneapolis, MN 55414 E-Mail: lewis@sistina.com
http://www.sistina.com
Current GPG fingerprint = 3B5F 6011 5216 76A5 2F6B 52A0 941E 1261 0029 2648
Get my key at: http://www.sistina.com/~lewis/gpgkey
(Unfortunately, the PKS-type keyservers do not work with multiple sub-keys)
-----Begin Obligatory Humorous Quote----------------------------------------
Linux: Because rebooting is for adding new hardware.
-----End Obligatory Humorous Quote------------------------------------------
[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-05 7:35 ` Ben LaHaise
2001-07-05 16:46 ` AJ Lewis
@ 2001-07-05 17:09 ` Eric M. Hopper
2001-07-10 13:45 ` Heinz J. Mauelshagen
2001-07-13 18:20 ` Albert D. Cahalan
3 siblings, 0 replies; 31+ messages in thread
From: Eric M. Hopper @ 2001-07-05 17:09 UTC (permalink / raw)
To: linux-lvm
[-- Attachment #1: Type: text/plain, Size: 1402 bytes --]
On Thu, Jul 05, 2001 at 03:35:31AM -0400, Ben LaHaise wrote:
> On Thu, 5 Jul 2001, Ragnar Kjørstad wrote:
>
> > What do you mean?
> > Is it not feasible to fix this in LVM as well, or do you just not know
> > what needs to be done to LVM?
>
> Fixing LVM is not on the radar of my priorities. The code is sorely in
> need of a rewrite and violates several of the basic planning tenents that
> any good code in the block layer should follow. Namely, it should have 1)
> planned on supporting 64 bit offsets, 2) never used multiplication,
> division or modulus on block numbers, and 3) don't allocate memory
> structures that are indexed by block numbers.
It would seem to me that it would be nearly impossible to write
something that does the kind of block remapping that LVM does without
violating tenets 2 and 3.
To me, that's like saying that no code that deals with memory
can have a data structure indexed by address, and you can't do masking
operations on memory addresses. That would completely kill virtual
memory right there.
Have fun (if at all possible),
--
"It does me no injury for my neighbor to say there are twenty gods or no God.
It neither picks my pocket nor breaks my leg." --- Thomas Jefferson
"Go to Heaven for the climate, Hell for the company." -- Mark Twain
-- Eric Hopper (hopper@omnifarious.org http://www.omnifarious.org/~hopper) --
[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-05 7:35 ` Ben LaHaise
2001-07-05 16:46 ` AJ Lewis
2001-07-05 17:09 ` Eric M. Hopper
@ 2001-07-10 13:45 ` Heinz J. Mauelshagen
2001-07-13 18:20 ` Albert D. Cahalan
3 siblings, 0 replies; 31+ messages in thread
From: Heinz J. Mauelshagen @ 2001-07-10 13:45 UTC (permalink / raw)
To: linux-lvm
On Thu, Jul 05, 2001 at 03:35:31AM -0400, Ben LaHaise wrote:
> On Thu, 5 Jul 2001, Ragnar Kj�rstad wrote:
>
> > What do you mean?
> > Is it not feasible to fix this in LVM as well, or do you just not know
> > what needs to be done to LVM?
>
> Fixing LVM is not on the radar of my priorities. The code is sorely in
> need of a rewrite and violates several of the basic planning tenents that
> any good code in the block layer should follow. Namely, it should have 1)
> planned on supporting 64 bit offsets,
What is the particular problem with 1?
Changes need to take place in the whole block device layer during 2.5
development in order to be 64 bit clean.
LVM will just be one member of the bunch :-)
> 2) never used multiplication,
> division or modulus on block numbers,
I assume that you mean an advice ("never use") rather than a statement
that you never used it, right?
FYI: this takes place in multiple block device layer functions.
For sure you can argue that this is supposed to vanish in 2.5 but
were's the document recommending this and defining how to do it?
> and 3) don't allocate memory
> structures that are indexed by block numbers.
Why?
Remapping block devices do that!
Checks against index mismatches are in LVM already.
> Even LVM failed on all three of
> these -- and this si just what I noticed in a quick 5 minute glance
> through the code. Sorry, but LVM is obsolete by design.
Sorry but you are argueing weakly in a political manor.
Please speak up clearly and talk about the caveats you see and change request
you recommend as an expert or say nothing!
> It will continue
> to work on 32 bit block devices, but if you try to use it beyond that, it
> will fail.
As I said above.
During the 2.5 development cycle it will be addressed.
> That said, we'll have to make sure these failures are graceful
> and occur prior to the user having a chance at loosing any data.
Once 64 bit support will be implemented in the block device layer
and the VFS, LVM as a block device layer entity will support it as well :-)
>
> Now, thankfully there are alternatives like ELVM, which are working on
> getting the details right from the lessons learned.
ELVM is not in production so far...
If you would have read the EVMS kernel code, you had realized that they do
modulo calculations as well and I think they are doing right!
> Given that, I think
> we'll be in good shape during the 2.5 cycle.
I agree (from a different standpoint though ;-)
Waiting for your serious advice...
>
> -ben
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@sistina.com
> http://lists.sistina.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://www.sistina.com/lvm/Pages/howto.html
--
Regards,
Heinz -- The LVM Guy --
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Heinz Mauelshagen Sistina Software Inc.
Senior Consultant/Developer Am Sonnenhang 11
56242 Marienrachdorf
Germany
Mauelshagen@Sistina.com +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-05 7:35 ` Ben LaHaise
` (2 preceding siblings ...)
2001-07-10 13:45 ` Heinz J. Mauelshagen
@ 2001-07-13 18:20 ` Albert D. Cahalan
2001-07-13 20:41 ` Andreas Dilger
3 siblings, 1 reply; 31+ messages in thread
From: Albert D. Cahalan @ 2001-07-13 18:20 UTC (permalink / raw)
To: Ben LaHaise
Cc: Ragnar Kjørstad, linux-fsdevel, linux-kernel, mike, kevin,
linux-lvm
Ben LaHaise writes:
> On Thu, 5 Jul 2001, Ragnar Kj\370rstad wrote:
>> What do you mean?
>> Is it not feasible to fix this in LVM as well, or do you just not know
>> what needs to be done to LVM?
>
> Fixing LVM is not on the radar of my priorities. The code is sorely in
> need of a rewrite and violates several of the basic planning tenents that
> any good code in the block layer should follow. Namely, it should have 1)
> planned on supporting 64 bit offsets, 2) never used multiplication,
> division or modulus on block numbers, and 3) don't allocate memory
> structures that are indexed by block numbers. LVM failed on all three of
> these -- and this si just what I noticed in a quick 5 minute glance
> through the code. Sorry, but LVM is obsolete by design. It will continue
> to work on 32 bit block devices, but if you try to use it beyond that, it
> will fail. That said, we'll have to make sure these failures are graceful
> and occur prior to the user having a chance at loosing any data.
>
> Now, thankfully there are alternatives like ELVM, which are working on
> getting the details right from the lessons learned. Given that, I think
> we'll be in good shape during the 2.5 cycle.
How does can any of this even work?
Say I have N disks, mirrored, or maybe with parity. I'm trying
to have a reliable system. I change a file. The write goes out
to my disks, and power is lost. Some number M, such that 0<M<N,
of the disks are written before the power loss. The rest of the
disks don't complete the write. Maybe worse, this is more than
one sector, and some disks have partial writes.
Doesn't RAID need a journal or the phase-tree algorithm?
How does one tell what data is old and what data is new?
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-13 18:20 ` Albert D. Cahalan
@ 2001-07-13 20:41 ` Andreas Dilger
2001-07-13 21:14 ` Alan Cox
0 siblings, 1 reply; 31+ messages in thread
From: Andreas Dilger @ 2001-07-13 20:41 UTC (permalink / raw)
To: Albert D. Cahalan
Cc: Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike,
kevin, linux-lvm
Albert writes:
> How does can any of this even work?
>
> Say I have N disks, mirrored, or maybe with parity. I'm trying
> to have a reliable system. I change a file. The write goes out
> to my disks, and power is lost. Some number M, such that 0<M<N,
> of the disks are written before the power loss. The rest of the
> disks don't complete the write. Maybe worse, this is more than
> one sector, and some disks have partial writes.
>
> Doesn't RAID need a journal or the phase-tree algorithm?
> How does one tell what data is old and what data is new?
Yes, RAID should have a journal or other ordering enforcement, but
it really isn't any worse in this regard than a single disk. Even
on a single disk you don't have any guarantees of data ordering, so
if you change the file and the power is lost, some of the sectors
will make it to disk and some will not => fsck, with possible data
corrpution or loss.
That's why the journaled filesystems have multi-stage commit of I/O,
first to the journal and then to the disk, so no chance of corruption
of the metadata, and if you journal data also, then the data cannot
be corrupted (but some may be lost).
RAID 5 throws a wrench into this by not guaranteeing that all of the
blocks in a stripe are consistent (you don't know which blocks and/or
parity were written and which not). Ideally, you want a multi-stage
commit for RAID as well, so that you write the data first, and the
parity afterwards (so on reboot you trust the data first, and not the
parity). You have a problem if there is a bad disk and you crash.
With a data-journaled fs you don't care what RAID does because the fs
journal knows which transactions were in progress. If an I/O was being
written into the journal and did not complete, it is discarded. If it
was written into the journal and did not finish the write into the fs,
it will re-write it on recovery. In both cases you don't care if the
RAID finished the write or not.
Note that for LVM (the original topic), it does NOT do any RAID stuff
at all, it is just a virtually contiguous disk, made up of one or more
real disks (or stacked on top of RAID).
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-13 20:41 ` Andreas Dilger
@ 2001-07-13 21:14 ` Alan Cox
2001-07-14 3:23 ` Andrew Morton
0 siblings, 1 reply; 31+ messages in thread
From: Alan Cox @ 2001-07-13 21:14 UTC (permalink / raw)
To: Andreas Dilger
Cc: Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
> RAID 5 throws a wrench into this by not guaranteeing that all of the
> blocks in a stripe are consistent (you don't know which blocks and/or
> parity were written and which not). Ideally, you want a multi-stage
> commit for RAID as well, so that you write the data first, and the
> parity afterwards (so on reboot you trust the data first, and not the
> parity). You have a problem if there is a bad disk and you crash.
Well to be honest so does most disk firmware. IDE especially. For one thing
the logical sector size the drives writes need not match the illusions
provided upstream, and the write flush commands are frequently not implemented
because they damage benchmarketing numbers from folks like Zdnet..
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] <20010714090703.B5737@weta.f00f.org>
@ 2001-07-13 22:04 ` Andreas Dilger
2001-07-14 0:49 ` Jonathan Lundell
0 siblings, 1 reply; 31+ messages in thread
From: Andreas Dilger @ 2001-07-13 22:04 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Andreas Dilger, Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad,
linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
Chris writes:
> On Fri, Jul 13, 2001 at 02:41:52PM -0600, Andreas Dilger wrote:
>
> Yes, RAID should have a journal or other ordering enforcement, but
> it really isn't any worse in this regard than a single disk. Even
> on a single disk you don't have any guarantees of data ordering,
> so if you change the file and the power is lost, some of the
> sectors will make it to disk and some will not => fsck, with
> possible data corrpution or loss.
>
> How so? On a single disk you can either disable write-caching or for
> SCSI disks you can use barriers of sorts.
>
> At which time, you can either assume a sector is written or not.
Well, I _think_ your statement is only true if you are using rawio.
Otherwise, you have a minimum block size of 1kB (for filesystems at
least) so you can't write less than that, and you could potentially
write one sector and not another.
I'm not sure of the exact MD RAID implementation, but I suspect that
if you write a single sector*, it will be exactly the same situation.
However, it also has to write the parity to disk, so if you crash at
this point what you get back depends on the RAID implementation**.
As Alan said in another reply, with IDE disks, you have no guarantee
about write caching on the disk, even if you try to turn it off.
If you are doing synchronous I/O from your application, then I don't
think a RAID write will not complete until all of the data+parity I/O
is complete, so you should again be as safe as with a single disk.
If you want safety, but async I/O, use ext3 with full data journaling
and a large journal. Andrew Morton has just done some testing with
this and the performance is very good, as long as your journal is big
enough to hold your largest write bursts, and you have < 50% duty
cycle for disk I/O (i.e. you have to have enough spare I/O bandwidth
to write everything to disk twice, but it will go to the journal in a
single contiguous (synchronous) write and can go to the filesystem
asynchronously at a later time when there is no other I/O). If you
put your journal on NVRAM, you will have blazing synchronous I/O.
Cheers, Andreas
*) You _may_ be limited to a larger minimum write, depending on the stripe
size, I haven't looked closely@the code. AFAIK, MD RAID does not
let you stripe a single sector across multiple disks (nor would you
want to), so all disk I/O would still be one or more single sector I/Os
to one or more disks. This means the sector I/O to each individual
disk is still atomic, so it is not any worse than writes to a single
disk (the parity is NOT atomic, but then you don't have parity at
all on a single disk...).
**) As I said in my previous posting, it depends on if/how MD RAID does
write ordering of I/O to the data sector and the parity sector. If
it holds back the parity write until the data I/O(s) are complete, and
trusts the data over parity on recovery, you should be OK unless you
have multiple failures (i.e. bad disk + crash). If it doesn't do this
ordering, or trusts parity over data, then you are F***ed (I doubt it
would have this problem).
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-13 22:04 ` Andreas Dilger
@ 2001-07-14 0:49 ` Jonathan Lundell
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Lundell @ 2001-07-14 0:49 UTC (permalink / raw)
To: Andreas Dilger, Chris Wedgwood
Cc: Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
At 4:04 PM -0600 2001-07-13, Andreas Dilger wrote:
>**) As I said in my previous posting, it depends on if/how MD RAID does
> write ordering of I/O to the data sector and the parity sector. If
> it holds back the parity write until the data I/O(s) are complete, and
> trusts the data over parity on recovery, you should be OK unless you
> have multiple failures (i.e. bad disk + crash). If it doesn't do this
> ordering, or trusts parity over data, then you are F***ed (I doubt it
> would have this problem).
That wouldn't help, would it, if >1 data sectors were being written.
The fault mode of a sector simply not being written seems like a real
weak point of both RAID-1 and RAID-5. Not that RAID-5 parity ever
gets checked, I think, under normal circumstances, nor RAID-1 mirrors
get compared, but if they were check and there was an parity or
mirror-compare error and no other indication of a fault (eg CRC),
there's no way to recover correct data.
--
/Jonathan Lundell.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-13 21:14 ` Alan Cox
@ 2001-07-14 3:23 ` Andrew Morton
2001-07-14 8:45 ` Alan Cox
0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2001-07-14 3:23 UTC (permalink / raw)
To: Alan Cox
Cc: Andreas Dilger, Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad,
linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
Alan Cox wrote:
>
> > RAID 5 throws a wrench into this by not guaranteeing that all of the
> > blocks in a stripe are consistent (you don't know which blocks and/or
> > parity were written and which not). Ideally, you want a multi-stage
> > commit for RAID as well, so that you write the data first, and the
> > parity afterwards (so on reboot you trust the data first, and not the
> > parity). You have a problem if there is a bad disk and you crash.
>
> Well to be honest so does most disk firmware. IDE especially. For one thing
> the logical sector size the drives writes need not match the illusions
> provided upstream, and the write flush commands are frequently not implemented
> because they damage benchmarketing numbers from folks like Zdnet..
If, after a power outage, the IDE disk can keep going for long enough
to write its write cache out to the reserved vendor area (which will
only take 20-30 milliseconds) then the data may be considered *safe*
as soon as it hits writecache.
In which case it is perfectly legitimate and sensible for the drive
to ignore flush commands, and to ack data as soon as it hits cache.
Yes?
If I'm right then the only open question is: which disks do and
do not do the right thing when the lights go out.
-
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-14 3:23 ` Andrew Morton
@ 2001-07-14 8:45 ` Alan Cox
2001-07-14 13:54 ` Steven Lembark
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Alan Cox @ 2001-07-14 8:45 UTC (permalink / raw)
To: Andrew Morton
Cc: Alan Cox, Andreas Dilger, Albert D. Cahalan, Ben LaHaise,
Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike, kevin,
linux-lvm
> If, after a power outage, the IDE disk can keep going for long enough
> to write its write cache out to the reserved vendor area (which will
> only take 20-30 milliseconds) then the data may be considered *safe*
> as soon as it hits writecache.
Hohohoho.
> In which case it is perfectly legitimate and sensible for the drive
> to ignore flush commands, and to ack data as soon as it hits cache.
Since the flushing commands are 'optional' it can legitimately ignore them
> If I'm right then the only open question is: which disks do and
> do not do the right thing when the lights go out.
As far as I can tell none of them at least in the IDE world
Alan
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-14 8:45 ` Alan Cox
@ 2001-07-14 13:54 ` Steven Lembark
2001-07-14 17:33 ` Jonathan Lundell
[not found] ` <20010715025001.B6722@weta.f00f.org>
2 siblings, 0 replies; 31+ messages in thread
From: Steven Lembark @ 2001-07-14 13:54 UTC (permalink / raw)
To: linux-lvm
- Alan Cox <alan@lxorguk.ukuu.org.uk> on 07/14/01 09:45:44 +0100:
>> If, after a power outage, the IDE disk can keep going for long enough
>> to write its write cache out to the reserved vendor area (which will
>> only take 20-30 milliseconds) then the data may be considered *safe*
>> as soon as it hits writecache.
>
> Hohohoho.
Don't laugh, it works. A 10KRPM drive has enough inertia and
stored charge to write a full cyl in less time than the BFC's in
the switcher can discharge.
>> If I'm right then the only open question is: which disks do and
>> do not do the right thing when the lights go out.
>
> As far as I can tell none of them at least in the IDE world
Some SCSI's will use intertia to write a single cyl if the power
goes out. I've never seen a spec for IDE's that allows for this.
One more reason to use SCSI.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715025001.B6722@weta.f00f.org>
@ 2001-07-14 15:41 ` Jonathan Lundell
2001-07-14 20:11 ` Daniel Phillips
1 sibling, 0 replies; 31+ messages in thread
From: Jonathan Lundell @ 2001-07-14 15:41 UTC (permalink / raw)
To: Chris Wedgwood, Alan Cox
Cc: Andrew Morton, Andreas Dilger, Albert D. Cahalan, Ben LaHaise,
Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike, kevin,
linux-lvm
At 2:50 AM +1200 2001-07-15, Chris Wedgwood wrote:
>On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
>
> As far as I can tell none of them at least in the IDE world
>
>SCSI disk must, or at least some... if not, how to peopel like NetApp
>get these cool HA certifications?
NetApp uses a large system-local NVRAM buffer, do they not?
--
/Jonathan Lundell.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-14 8:45 ` Alan Cox
2001-07-14 13:54 ` Steven Lembark
@ 2001-07-14 17:33 ` Jonathan Lundell
[not found] ` <20010715160247.I7624@weta.f00f.org>
[not found] ` <20010715025001.B6722@weta.f00f.org>
2 siblings, 1 reply; 31+ messages in thread
From: Jonathan Lundell @ 2001-07-14 17:33 UTC (permalink / raw)
To: Alan Cox, Andrew Morton
Cc: Andreas Dilger, Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad,
linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
At 9:45 AM +0100 2001-07-14, Alan Cox wrote:
> > If, after a power outage, the IDE disk can keep going for long enough
>> to write its write cache out to the reserved vendor area (which will
>> only take 20-30 milliseconds) then the data may be considered *safe*
>> as soon as it hits writecache.
>
>Hohohoho.
>
>> In which case it is perfectly legitimate and sensible for the drive
>> to ignore flush commands, and to ack data as soon as it hits cache.
>
>Since the flushing commands are 'optional' it can legitimately ignore them
>
>> If I'm right then the only open question is: which disks do and
>> do not do the right thing when the lights go out.
>
>As far as I can tell none of them at least in the IDE world
It's not so great in the SCSI world either. Here's a bit from the
Ultrastar 73LZX functional spec (this is the current-technology
Ultra160 73GB family):
>5.0 Data integrity
>The drive retains recorded information under all non-write operations.
>No more than one sector will be lost by power down during write
>operation while write cache is
>disabled.
>If power down occurs before completion of data transfer from write
>cache to disk while write cache is
>enabled, the data remaining in write cache will be lost. To prevent
>this data loss at power off, the
>following action is recommended:
>* Confirm successful completion of SYNCHRONIZE CACHE (35h) command.
What's worse, though the spec is not explicit on this point, it
appears that the write cache is lost on a SCSI reset, which is
typically used by drivers for last-resort error recovery. And of
course a SCSI bus reset affects all the drives on the bus, not just
the offending one.
--
/Jonathan Lundell.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715025001.B6722@weta.f00f.org>
2001-07-14 15:41 ` Jonathan Lundell
@ 2001-07-14 20:11 ` Daniel Phillips
2001-07-15 1:21 ` Andrew Morton
[not found] ` <20010715153607.A7624@weta.f00f.org>
1 sibling, 2 replies; 31+ messages in thread
From: Daniel Phillips @ 2001-07-14 20:11 UTC (permalink / raw)
To: Chris Wedgwood, Alan Cox
Cc: Andrew Morton, Andreas Dilger, Albert D. Cahalan, Ben LaHaise,
Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike, kevin,
linux-lvm
On Saturday 14 July 2001 16:50, Chris Wedgwood wrote:
> On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
>
> As far as I can tell none of them at least in the IDE world
>
> SCSI disk must, or at least some... if not, how to peopel like NetApp
> get these cool HA certifications?
Atomic commit. The superblock, which references the updated version
of the filesystem, carries a sequence number and a checksum. It is
written to one of two alternating locations. On restart, both
locations are read and the highest numbered superblock with a correct
checksum is chosen as the new filesystem root.
--
Daniel
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-14 20:11 ` Daniel Phillips
@ 2001-07-15 1:21 ` Andrew Morton
2001-07-15 1:53 ` Daniel Phillips
[not found] ` <20010715153607.A7624@weta.f00f.org>
1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2001-07-15 1:21 UTC (permalink / raw)
To: Daniel Phillips; +Cc: linux-fsdevel, linux-kernel, linux-lvm
Daniel Phillips wrote:
>
> On Saturday 14 July 2001 16:50, Chris Wedgwood wrote:
> > On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
> >
> > As far as I can tell none of them at least in the IDE world
> >
> > SCSI disk must, or at least some... if not, how to peopel like NetApp
> > get these cool HA certifications?
>
> Atomic commit. The superblock, which references the updated version
> of the filesystem, carries a sequence number and a checksum. It is
> written to one of two alternating locations. On restart, both
> locations are read and the highest numbered superblock with a correct
> checksum is chosen as the new filesystem root.
But this assumes that it is the most-recently-written sector/block
which gets lost in a power failure.
The disk will be reordering writes - so when it fails it may have
written the commit block but *not* the data which that block is
committing.
You need a barrier or a full synchronous flush prior to writing
the commit block. A `don't-reorder-past-me' barrier is very much
preferable, of course.
-
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-15 1:21 ` Andrew Morton
@ 2001-07-15 1:53 ` Daniel Phillips
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Phillips @ 2001-07-15 1:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel, linux-lvm
On Sunday 15 July 2001 03:21, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On Saturday 14 July 2001 16:50, Chris Wedgwood wrote:
> > > On Sat, Jul 14, 2001 at 09:45:44AM +0100, Alan Cox wrote:
> > >
> > > As far as I can tell none of them at least in the IDE world
> > >
> > > SCSI disk must, or at least some... if not, how to peopel like
> > > NetApp get these cool HA certifications?
> >
> > Atomic commit. The superblock, which references the updated
> > version of the filesystem, carries a sequence number and a
> > checksum. It is written to one of two alternating locations. On
> > restart, both locations are read and the highest numbered
> > superblock with a correct checksum is chosen as the new filesystem
> > root.
>
> But this assumes that it is the most-recently-written sector/block
> which gets lost in a power failure.
>
> The disk will be reordering writes - so when it fails it may have
> written the commit block but *not* the data which that block is
> committing.
>
> You need a barrier or a full synchronous flush prior to writing
> the commit block. A `don't-reorder-past-me' barrier is very much
> preferable, of course.
Oh yes, absolutely, that's very much part of the puzzle. Any disk
that doesn't support a real write barrier or write cache flush is
fundamentally broken as far as failsafe operation goes. A disk that
claims to provide such support and doesn't is an even worse offender.
I find Alan's comment there worrisome. We need to know which disks
devliver on this and which don't.
--
Daniel
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715160247.I7624@weta.f00f.org>
@ 2001-07-15 5:46 ` Jonathan Lundell
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Lundell @ 2001-07-15 5:46 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Alan Cox, Andrew Morton, Andreas Dilger, Albert D. Cahalan,
Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike,
kevin, linux-lvm
At 4:02 PM +1200 2001-07-15, Chris Wedgwood wrote:
>On Sat, Jul 14, 2001 at 10:33:44AM -0700, Jonathan Lundell wrote:
>
> What's worse, though the spec is not explicit on this point, it
> appears that the write cache is lost on a SCSI reset, which is
> typically used by drivers for last-resort error recovery. And of
> course a SCSI bus reset affects all the drives on the bus, not
> just the offending one.
>
>Doesn't SCSI have a notion of write barriers?
>
>Even if this is required, the above still works because for anything
>requiring a barrier, you wait for a positive SYNCHRONIZE CACHE
Sure, if you keep all your write buffers around until then, so you
can re-write if the sync fails. And if you don't crash in the
meantime.
--
/Jonathan Lundell.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715153607.A7624@weta.f00f.org>
@ 2001-07-15 6:05 ` John Alvord
[not found] ` <20010715180752.B7993@weta.f00f.org>
2001-07-15 13:44 ` Daniel Phillips
1 sibling, 1 reply; 31+ messages in thread
From: John Alvord @ 2001-07-15 6:05 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Daniel Phillips, Alan Cox, Andrew Morton, Andreas Dilger,
Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
On Sun, 15 Jul 2001, Chris Wedgwood wrote:
> On Sat, Jul 14, 2001 at 10:11:30PM +0200, Daniel Phillips wrote:
>
> Atomic commit. The superblock, which references the updated
> version of the filesystem, carries a sequence number and a
> checksum. It is written to one of two alternating locations. On
> restart, both locations are read and the highest numbered
> superblock with a correct checksum is chosen as the new filesystem
> root.
>
> Yes... and which ever part of the superblock contains the sequence
> number must be written atomically.
>
> The point is, you _NEED_ to be sure that data written before the
> superblock (or indeed anywhere further up the tree, you can make
> changes in theory which don't require super-block updates) are written
> firmly to the platters before any thing which refers to it is updated.
>
> Alan was saying with IDE you cannot reliably do this, I assume you can
> with SCSI was my point.
In the IBM solution to this (1977-78, VM/CMS) the critical data was
written at the begining and the end of the block. If the two data items
didn't match then the block was rejected.
john alvord
>
>
>
> --cw
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715180752.B7993@weta.f00f.org>
@ 2001-07-15 13:16 ` Ken Hirsch
2001-07-15 22:14 ` Daniel Phillips
2001-07-17 0:31 ` Juan Quintela
1 sibling, 1 reply; 31+ messages in thread
From: Ken Hirsch @ 2001-07-15 13:16 UTC (permalink / raw)
To: Chris Wedgwood, John Alvord
Cc: Daniel Phillips, Alan Cox, Andrew Morton, Andreas Dilger,
Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
Chris Wedgwood <cw@f00f.org> wrote:
> On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:
>
> In the IBM solution to this (1977-78, VM/CMS) the critical data was
> written at the begining and the end of the block. If the two data
items
> didn't match then the block was rejected.
>
> Neat.
>
>
> Simple and effective. Presumably you can also checksum the block, and
> check that.
The first technique is not sufficient with modern disk controllers, which
may reorder sector writes within a block. A checksum, especially a robust
CRC32, is sufficient, but rather expensive.
Mohan has a clever technique that is computationally trivial and only uses
one bit per sector: http://www.almaden.ibm.com/u/mohan/ICDE95.pdf
Unfortunately, it's also patented:
http://www.delphion.com/details?pn=US05418940__
Perhaps IBM will clarify their position with respect to free software and
patents in the upcoming conference.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715153607.A7624@weta.f00f.org>
2001-07-15 6:05 ` John Alvord
@ 2001-07-15 13:44 ` Daniel Phillips
[not found] ` <20010716023911.A10576@weta.f00f.org>
1 sibling, 1 reply; 31+ messages in thread
From: Daniel Phillips @ 2001-07-15 13:44 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Alan Cox, Andrew Morton, Andreas Dilger, Albert D. Cahalan,
Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike,
kevin, linux-lvm
On Sunday 15 July 2001 05:36, Chris Wedgwood wrote:
> On Sat, Jul 14, 2001 at 10:11:30PM +0200, Daniel Phillips wrote:
>
> Atomic commit. The superblock, which references the updated
> version of the filesystem, carries a sequence number and a
> checksum. It is written to one of two alternating locations. On
> restart, both locations are read and the highest numbered
> superblock with a correct checksum is chosen as the new
> filesystem root.
>
> Yes... and which ever part of the superblock contains the sequence
> number must be written atomically.
The only requirement here is that the checksum be correct. And sure,
that's not a hard guarantee because, on average, you will get a good
checksum for bad data once every 4 billion power events that mess up
the final superblock transfer. Let me see, if that happens once a year,
your data should still be good when the warrantee on the sun expires.
:-)
> The point is, you _NEED_ to be sure that data written before the
> superblock (or indeed anywhere further up the tree, you can make
> changes in theory which don't require super-block updates) are
> written firmly to the platters before any thing which refers to it is
> updated.
Since the updated tree is created non-destructively with respect to
the original tree, the only priority relationship that matters is the
requirement that all blocks of the updated tree be securely committed
before the new superblock is written.
> Alan was saying with IDE you cannot reliably do this, I assume you
> can with SCSI was my point.
Surely it can't be that *all* IDE disks can fail in that way? And it
seems the jury is still out on SCSI, I'm interested to see where that
discussion goes.
--
Daniel
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010716023911.A10576@weta.f00f.org>
@ 2001-07-15 15:06 ` Jonathan Lundell
[not found] ` <20010716032220.B10635@weta.f00f.org>
2001-07-15 17:47 ` Justin T. Gibbs
2001-07-15 15:32 ` Alan Cox
1 sibling, 2 replies; 31+ messages in thread
From: Jonathan Lundell @ 2001-07-15 15:06 UTC (permalink / raw)
To: Chris Wedgwood, Daniel Phillips
Cc: Alan Cox, Andrew Morton, Andreas Dilger, Albert D. Cahalan,
Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike,
kevin, linux-lvm
At 2:39 AM +1200 2001-07-16, Chris Wedgwood wrote:
>On Sun, Jul 15, 2001 at 03:44:14PM +0200, Daniel Phillips wrote:
>
> The only requirement here is that the checksum be correct. And
> sure, that's not a hard guarantee because, on average, you will
> get a good checksum for bad data once every 4 billion power events
> that mess up the final superblock transfer. Let me see, if that
> happens once a year, your data should still be good when the
> warrantee on the sun expires. :-)
>
>the sun will probably last a tad longer than that even contuing to
>burn hydrogen, if you allow for helium burning, you will probably get
>errors to sneak by
>
> Surely it can't be that *all* IDE disks can fail in that way? And
> it seems the jury is still out on SCSI, I'm interested to see
> where that discussion goes.
>
>Alan said *ALL* disks appear to lie, and I'm not going to argue with
>him :)
>
>I only have SCSI disks to test with, but they are hot-plug, so I guess
>I can write a whole bunch of blocks with different numbers on them,
>all over the disk, if I can figure out how to place SCSI barriers and
>then pull the drive and see what gives?
Consider the possibility (probability, I think) that SCSI drives blow
away their (unwritten) write cache buffers on a SCSI bus reset, and
that a SCSI bus reset is a routine, albeit last-resort, error
recovery technique. (It's also necessary; by the time a driver gets
to a bus reset, all else has failed. It's also, in my experience, not
especially rare.)
The fix for that particular problem--disabling write caching--is
simple enough, though it presumably has a performance consequence. A
second benefit of disabling write caching is that the drive can't
reorder writes (though of course the system still might).
At first glance, by the way, the only write barrier I see in the SCSI
command set is the synchronize-cache command, which completes only
after all the drive's dirty buffers are written out. Of course,
without write caching, it's not an issue.
--
/Jonathan Lundell.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010716023911.A10576@weta.f00f.org>
2001-07-15 15:06 ` Jonathan Lundell
@ 2001-07-15 15:32 ` Alan Cox
1 sibling, 0 replies; 31+ messages in thread
From: Alan Cox @ 2001-07-15 15:32 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Daniel Phillips, Alan Cox, Andrew Morton, Andreas Dilger,
Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
> I only have SCSI disks to test with, but they are hot-plug, so I guess
> I can write a whole bunch of blocks with different numbers on them,
> all over the disk, if I can figure out how to place SCSI barriers and
> then pull the drive and see what gives?
Another way is to time
write block
write barrier
write same block
write barrier
repeat
If the write barrier is working you should be able to measure the drive rpm 8)
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010716032220.B10635@weta.f00f.org>
@ 2001-07-15 17:44 ` Jonathan Lundell
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Lundell @ 2001-07-15 17:44 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Daniel Phillips, Alan Cox, Andrew Morton, Andreas Dilger,
Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
At 3:22 AM +1200 2001-07-16, Chris Wedgwood wrote:
>On Sun, Jul 15, 2001 at 08:06:39AM -0700, Jonathan Lundell wrote:
>
> At first glance, by the way, the only write barrier I see in the
> SCSI command set is the synchronize-cache command, which completes
> only after all the drive's dirty buffers are written out. Of
> course, without write caching, it's not an issue.
>
>Is the spec you have distributable? I believe some of the early drafts
>were, but the final spec isn't.
>
>I'd really like to check it out myself, I alwasy assumed SCSI had the
>smarts for write-barriers and force-unit-access but I guess I was
>wrong.
>
>Anyhow, I'd like to see the spec for myself if it is something I can
>get hold of.
I was referring to IBM's spec, as implemented in their recent SCSI
and FC drives. You can find a copy at
http://www.storage.ibm.com/techsup/hddtech/prodspec/ddyf_spi.pdf
WRITE EXTENDED has a bit (FUA) that will let you force that
particular write to go to disk immediately, independent of write
caching, but there's no suggestion that it otherwise acts as a write
barrier for cached writes.
WRITE VERIFY implies a CACHE SYNCHRONIZE, so it's a write barrier,
but an expensive (because synchronous) one.
--
/Jonathan Lundell.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-15 15:06 ` Jonathan Lundell
[not found] ` <20010716032220.B10635@weta.f00f.org>
@ 2001-07-15 17:47 ` Justin T. Gibbs
2001-07-15 23:14 ` Rod Van Meter
[not found] ` <20010716205633.G11938@weta.f00f.org>
1 sibling, 2 replies; 31+ messages in thread
From: Justin T. Gibbs @ 2001-07-15 17:47 UTC (permalink / raw)
To: Jonathan Lundell
Cc: Chris Wedgwood, Daniel Phillips, Alan Cox, Andrew Morton,
Andreas Dilger, Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad,
linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
>Consider the possibility (probability, I think) that SCSI drives blow
>away their (unwritten) write cache buffers on a SCSI bus reset, and
>that a SCSI bus reset is a routine, albeit last-resort, error
>recovery technique. (It's also necessary; by the time a driver gets
>to a bus reset, all else has failed. It's also, in my experience, not
>especially rare.)
I have never seen this to be the case. The SCSI spec is quite clear
in stating that a bus reset only affects "I/O processes that have not
completed, SCSI device reservations, and SCSI device operating modes".
The soft reset section clarifies the meaning of "completed commands"
as:
e) An initiator shall consider an I/O process to be completed
when it negates ACK for a successfully received COMMAND
COMPLETE message.
f) A target shall consider an I/O process to be completed when
it detects the transition of ACK to false for the COMMAND
COMPLETE message with the ATN signal false.
As the soft reset section also specifies how to deal with initiators
that are not expecting soft reset semantics, I believe this applies to
either reset model.
If we look at the section on caching for direct access devices we see,
"[write-back cached] data may be lost if power to the device is lost or
a hardware failure occurs". There is no mention of a bus reset having
any effect on commands already acked as completed to the intiator.
>The fix for that particular problem--disabling write caching--is
>simple enough, though it presumably has a performance consequence. A
>second benefit of disabling write caching is that the drive can't
>reorder writes (though of course the system still might).
Simply disabling the write cache does not guarantee the order of writes.
For one, with tagged I/O and the use of the SIMPLE_Q tag qualifier,
commands may be completed in any order. If you want some semblance of
order, either disable the write cache or use the FUA bit in all writes,
and use the ORDERED tag qualifier. Even when using these options,
it is not clear that the drive cannot reorder writes "slightly" to
make track writes more efficient (e.g. two separate commands to write
sequential sectors on the same track may be written in reverse order).
>At first glance, by the way, the only write barrier I see in the SCSI
>command set is the synchronize-cache command, which completes only
>after all the drive's dirty buffers are written out. Of course,
>without write caching, it's not an issue.
The ordered tag qualifier gives you barier semantics with the caveats
listed above.
--
Justin
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-15 13:16 ` Ken Hirsch
@ 2001-07-15 22:14 ` Daniel Phillips
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Phillips @ 2001-07-15 22:14 UTC (permalink / raw)
To: Ken Hirsch, Chris Wedgwood, John Alvord
Cc: Alan Cox, Andrew Morton, Andreas Dilger, Albert D. Cahalan,
Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel, linux-kernel, mike,
kevin, linux-lvm
On Sunday 15 July 2001 15:16, Ken Hirsch wrote:
> Chris Wedgwood <cw@f00f.org> wrote:
> > On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:
> > >
> > > In the IBM solution to this (1977-78, VM/CMS) the critical data
> > > was written at the begining and the end of the block. If the two
> > > data items didn't match then the block was rejected.
> >
> > Neat.
> >
> > Simple and effective. Presumably you can also checksum the block,
> > and check that.
>
> The first technique is not sufficient with modern disk controllers,
> which may reorder sector writes within a block. A checksum,
> especially a robust CRC32, is sufficient, but rather expensive.
As somebody else pointed out, not if you don't have to compute it on
every block, as with journalling or atomic commit.
> Mohan has a clever technique that is computationally trivial and only
> uses one bit per sector:
> http://www.almaden.ibm.com/u/mohan/ICDE95.pdf
>
> Unfortunately, it's also patented:
> http://www.delphion.com/details?pn=US05418940__
Fortunately, it's clunky and unappealing compared to the simple
checksum method, applied only to those blocks that define consistency
points. I don't think this is patented. I'd be disturbed if it was,
since it's obvious.
> Perhaps IBM will clarify their position with respect to free software
> and patents in the upcoming conference.
Wouldn't that be nice. Imagine, IBM comes out and says, we admit it,
patents are a net burden on everybody, even us - from now on, we use
them only against those who use them against us, and we'll put that
in writing. Right.
--
Daniel
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-15 17:47 ` Justin T. Gibbs
@ 2001-07-15 23:14 ` Rod Van Meter
[not found] ` <20010716205633.G11938@weta.f00f.org>
1 sibling, 0 replies; 31+ messages in thread
From: Rod Van Meter @ 2001-07-15 23:14 UTC (permalink / raw)
To: Justin T. Gibbs; +Cc: linux-fsdevel, linux-kernel, linux-lvm
I don't have the SCSI spec in front of me (though, as noted, some
drafts are available online; try t10.org somewhere), but as I
understand it (having worked, briefly, for a major disk manufacturer):
You can commit an individual write with the FUA (force unit access)
bit. The command for this is not WRITE EXTENDED, but WRITE(10) or
WRITE(12). I don't think WRITE(6) has room for the bit, and WRITE(6)
is useless nowadays, anyway. WRITE EXTENDED lets you write over the
ECC bits -- it's a raw write to the platter. Dunno that anyone
implements it any more.
That does NOT get you ordering with respect to other commands. You
can use the complex tagging stuff to get that, but most disk drives
didn't implement it properly in the SCSI-2 days, and there are
significant differences in SCSI-3.
Otherwise, your choice, as noted, is SYNCHRONIZE CACHE before the root
block write, and after. AFAIK, all drives treat that the way it's
meant to be done; everything's on platter when you get a COMMAND
COMPLETE back from it, but they weren't necessarily done in order.
Even within a command, I don't believe there is a guarantee that the
blocks will go to platter in order. Say you write blocks 0-7; the
drive will start the transfer to buffer immediately, as the seek is
begun. When the seek completes, the write gate will enable writes
from buffer to platter, and a state machine takes care of that.
However, the seek and settle may complete when the head is over block
3, so the first write to platter would be block 4, then 5-7. This is
followed by almost an entire revolution's delay(*see note) to get back
to block 0, and 3 will be the last block written.
I have had this exact conversation with disk drive folks (of which I
am not one), but I haven't seen the firmware and state machines
myself, so treat this as an educated guess. The folks I was talking
to may have been wrong, or more likely, misunderstood what I was
asking.
Some manufacturers can put either IDE or SCSI on a drive, and this
behavior is likely to be the same on both. It may not apply to all
members of a family, and probably doesn't apply across families from
the same manufacturer.
Most disk drives, as recently as two years ago, were a lot dumber than
you think, and I doubt the situation has improved much. For the most
part, disk manufacturers get paid for capacity, not smarts, but
there's an entire year-long argument there.
--Rod
* Note: In theory, that rotational delay doesn't have to be idle. I
believe any blocks between 7 and 0 that are also in cache will be
written as the head passes over them. Thus, the drive might
literally interleave writes from multiple commands. It's also
possible, in theory, to switch tracks for a short time and come back
to the first track before block 0 rolls around, but I don't believe
existing controllers are that sophisticated.
P.S. I gotta put in another plug here -- you have until Friday to
write this behavior up and submit it as a paper to USENIX FAST --
Conference on File and Storage Technology. See
http://www.usenix.org/events/fast/
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010716205633.G11938@weta.f00f.org>
@ 2001-07-16 13:19 ` Daniel Phillips
2001-07-16 14:26 ` Heinz J. Mauelshagen
0 siblings, 1 reply; 31+ messages in thread
From: Daniel Phillips @ 2001-07-16 13:19 UTC (permalink / raw)
To: Chris Wedgwood, Justin T. Gibbs
Cc: Jonathan Lundell, Alan Cox, Andrew Morton, Andreas Dilger,
Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad, linux-fsdevel,
linux-kernel, mike, kevin, linux-lvm
On Monday 16 July 2001 10:56, Chris Wedgwood wrote:
> On Sun, Jul 15, 2001 at 11:47:10AM -0600, Justin T. Gibbs wrote:
>
> Simply disabling the write cache does not guarantee the order of
> writes. For one, with tagged I/O and the use of the SIMPLE_Q tag
> qualifier, commands may be completed in any order. If you want
> some semblance of order, either disable the write cache or use
> the FUA bit in all writes, and use the ORDERED tag qualifier. Even
> when using these options, it is not clear that the drive cannot
> reorder writes "slightly" to make track writes more efficient (e.g.
> two separate commands to write sequential sectors on the same track
> may be written in reverse order).
>
> ORDERED sounds like the trick... I assume this is some kind of
> write-barrier? If so, then I assume it has some kind of strict
> temporal ordering, even between command issues to the drive.
>
> If so, that would be idea if we can have the fs communicate this all
> the way down to the device layer, making it work for soft-raid and
> LVM be a little harder perhaps.
There was general agreement amongst filesystem developers at San Jose
that we need some kind of internal interface at the filesystem level
for this, independent of the type of underlying block device - IDE,
SCSI or "other". That's as far as it got.
--
Daniel
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [linux-lvm] Re: [PATCH] 64 bit scsi read/write
2001-07-16 13:19 ` Daniel Phillips
@ 2001-07-16 14:26 ` Heinz J. Mauelshagen
0 siblings, 0 replies; 31+ messages in thread
From: Heinz J. Mauelshagen @ 2001-07-16 14:26 UTC (permalink / raw)
To: linux-lvm
On Mon, Jul 16, 2001 at 03:19:39PM +0200, Daniel Phillips wrote:
> On Monday 16 July 2001 10:56, Chris Wedgwood wrote:
> > On Sun, Jul 15, 2001 at 11:47:10AM -0600, Justin T. Gibbs wrote:
> >
> > Simply disabling the write cache does not guarantee the order of
> > writes. For one, with tagged I/O and the use of the SIMPLE_Q tag
> > qualifier, commands may be completed in any order. If you want
> > some semblance of order, either disable the write cache or use
> > the FUA bit in all writes, and use the ORDERED tag qualifier. Even
> > when using these options, it is not clear that the drive cannot
> > reorder writes "slightly" to make track writes more efficient (e.g.
> > two separate commands to write sequential sectors on the same track
> > may be written in reverse order).
> >
> > ORDERED sounds like the trick... I assume this is some kind of
> > write-barrier? If so, then I assume it has some kind of strict
> > temporal ordering, even between command issues to the drive.
> >
> > If so, that would be idea if we can have the fs communicate this all
> > the way down to the device layer, making it work for soft-raid and
> > LVM be a little harder perhaps.
>
> There was general agreement amongst filesystem developers at San Jose
> that we need some kind of internal interface at the filesystem level
> for this, independent of the type of underlying block device - IDE,
> SCSI or "other". That's as far as it got.
>
Daniel,
so there's no document defining the basics of that interface so far?
> --
> Daniel
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@sistina.com
> http://lists.sistina.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://www.sistina.com/lvm/Pages/howto.html
--
Regards,
Heinz -- The LVM Guy --
*** Software bugs are stupid.
Nevertheless it needs not so stupid people to solve them ***
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Heinz Mauelshagen Sistina Software Inc.
Senior Consultant/Developer Am Sonnenhang 11
56242 Marienrachdorf
Germany
Mauelshagen@Sistina.com +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 31+ messages in thread
* [linux-lvm] Re: [PATCH] 64 bit scsi read/write
[not found] ` <20010715180752.B7993@weta.f00f.org>
2001-07-15 13:16 ` Ken Hirsch
@ 2001-07-17 0:31 ` Juan Quintela
1 sibling, 0 replies; 31+ messages in thread
From: Juan Quintela @ 2001-07-17 0:31 UTC (permalink / raw)
To: Chris Wedgwood
Cc: John Alvord, Daniel Phillips, Alan Cox, Andrew Morton,
Andreas Dilger, Albert D. Cahalan, Ben LaHaise, Ragnar Kjxrstad,
linux-fsdevel, linux-kernel, mike, kevin, linux-lvm
>>>>> "chris" == Chris Wedgwood <cw@f00f.org> writes:
chris> On Sat, Jul 14, 2001 at 11:05:36PM -0700, John Alvord wrote:
chris> In the IBM solution to this (1977-78, VM/CMS) the critical data was
chris> written at the begining and the end of the block. If the two data items
chris> didn't match then the block was rejected.
chris> Neat.
chris> Simple and effective. Presumably you can also checksum the block, and
chris> check that.
There is the rumor (I can't confirm that), that you need checksums,
that some disks are able to write well the beginning & the end of the
sector and put garbage in the middle in the case of problems. I
have never been able to reproduce that errors, but ....
Later, Juan.
--
In theory, practice and theory are the same, but in practice they
are different -- Larry McVoy
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2001-07-17 0:31 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20010703065312.J4841@vestdata.no>
[not found] ` <Pine.LNX.4.33.0107032211120.30968-100000@toomuch.toronto.redhat.com>
2001-07-05 6:34 ` [linux-lvm] Re: [PATCH] 64 bit scsi read/write Ragnar Kjørstad
2001-07-05 7:35 ` Ben LaHaise
2001-07-05 16:46 ` AJ Lewis
2001-07-05 17:09 ` Eric M. Hopper
2001-07-10 13:45 ` Heinz J. Mauelshagen
2001-07-13 18:20 ` Albert D. Cahalan
2001-07-13 20:41 ` Andreas Dilger
2001-07-13 21:14 ` Alan Cox
2001-07-14 3:23 ` Andrew Morton
2001-07-14 8:45 ` Alan Cox
2001-07-14 13:54 ` Steven Lembark
2001-07-14 17:33 ` Jonathan Lundell
[not found] ` <20010715160247.I7624@weta.f00f.org>
2001-07-15 5:46 ` Jonathan Lundell
[not found] ` <20010715025001.B6722@weta.f00f.org>
2001-07-14 15:41 ` Jonathan Lundell
2001-07-14 20:11 ` Daniel Phillips
2001-07-15 1:21 ` Andrew Morton
2001-07-15 1:53 ` Daniel Phillips
[not found] ` <20010715153607.A7624@weta.f00f.org>
2001-07-15 6:05 ` John Alvord
[not found] ` <20010715180752.B7993@weta.f00f.org>
2001-07-15 13:16 ` Ken Hirsch
2001-07-15 22:14 ` Daniel Phillips
2001-07-17 0:31 ` Juan Quintela
2001-07-15 13:44 ` Daniel Phillips
[not found] ` <20010716023911.A10576@weta.f00f.org>
2001-07-15 15:06 ` Jonathan Lundell
[not found] ` <20010716032220.B10635@weta.f00f.org>
2001-07-15 17:44 ` Jonathan Lundell
2001-07-15 17:47 ` Justin T. Gibbs
2001-07-15 23:14 ` Rod Van Meter
[not found] ` <20010716205633.G11938@weta.f00f.org>
2001-07-16 13:19 ` Daniel Phillips
2001-07-16 14:26 ` Heinz J. Mauelshagen
2001-07-15 15:32 ` Alan Cox
[not found] <20010714090703.B5737@weta.f00f.org>
2001-07-13 22:04 ` Andreas Dilger
2001-07-14 0:49 ` Jonathan Lundell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox