public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Disk (block) write strangeness
@ 2002-08-05 18:49 Jakob Oestergaard
  2002-08-05 20:17 ` Alan Cox
  2002-08-07 11:43 ` Itai Nahshon
  0 siblings, 2 replies; 6+ messages in thread
From: Jakob Oestergaard @ 2002-08-05 18:49 UTC (permalink / raw)
  To: linux-kernel


Hello all,

While investigating how various disks handle power-loss during writes, I
came across something *very* strange.

It seems that

*) Either the disk writes backwards  (no I don't believe that)
*) Or the kernel is writing 256 B blocks (AFAIK it can't)
*) The disk has some internal magic that cause a power-loss during
   a full block write to leave the first half of the block intact with
   old data, and update the second half of a block correctly with new
   data.  (And I don't believe that either).

The scenario is:   I wrote a program that will write a 50 MB block with
O_SYNC to /dev/hdc.  The block is full of 32-bit integers, initialized
to 0.  For every full block write (the block is written with one single
write() call), the integers are incremented once.

So first I have 50 MB of 0's. Then 50 MB of 1's. etc.

During this write cycle, I pull the power cable.   I get the machine
back online and I dump the 50 MB block.

What I found was a 50 MB block holding:
 11668992 times "0x00000002"
   231168 times "0x00000003"
  1174528 times "0x00000002"
    32512 times "0x00000003"

Please note that 32512 is *not* a multiple of 512.  And please note that
the 3's are written *after* the 2's, so actually there is a 512 byte
block on the disk which contains 2's in the first half, and 3's in the
second half!

How on earth could that happen ?

Why does the kernel not write from beginning to end ?   Or why doesn't
the disk ?

And does the elevator cause the writes to be shuffled around like that -
I would have expected the kernel to write from beginning to end every
single time...

The kernel is 2.4.18 on some i686 box
The disk is a Quantum Fireball 1GB IDE (from way back then ;)
The IDE chipset is an I820 Camino 2

I can submit the test program or do further tests, if anyone is
interested.

Thank you,

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Disk (block) write strangeness
  2002-08-05 20:17 ` Alan Cox
@ 2002-08-05 19:07   ` Jakob Oestergaard
  2002-08-06 14:44     ` Kasper Dupont
  0 siblings, 1 reply; 6+ messages in thread
From: Jakob Oestergaard @ 2002-08-05 19:07 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Mon, Aug 05, 2002 at 09:17:12PM +0100, Alan Cox wrote:
> On Mon, 2002-08-05 at 19:49, Jakob Oestergaard wrote:
> > *) Either the disk writes backwards  (no I don't believe that)
> > *) Or the kernel is writing 256 B blocks (AFAIK it can't)
> > *) The disk has some internal magic that cause a power-loss during
> >    a full block write to leave the first half of the block intact with
> >    old data, and update the second half of a block correctly with new
> >    data.  (And I don't believe that either).
> 
> You forgot to add
> 
> *) or the disk internal logic bears no resemblance to the antiquated API
> it fakes for the convenience of interface hardware and software

Fair enough - that seems like a reasonable explanation.

On a side note - what guarantees does one have ?  Any pointers to papers
or other material about this ?

For example, when updating a 3 to a 4 on the disk, could I end up with a
7 ?    (having 00000011 on platter, starting write of 00000100, but
after having written the one high power fails and I now have 00000111).

The above example is simple - I doubt that it would happen - but how
much can and cannot happen ?   I bet the Phase Tree (Tux2) people must
have thought about this at some point...  I haven't had much luck with
Google on this one...

> 
> Linux also won't neccessarily do write outs in order. 

But in this case, I wonder why ?

It's one huge sequential write, from the beginning of a device and 50 MB
onwards.  The write is submitted in one single write() every single
time.   Why start going semi-backwards and chopping things up ?

I'm *very* certain that Linux does this non-sequentially, because the
disk might be causing the half-block oddity which really surprised me,
but the disk is not caching 20 MB of data internally, for sure.

Is this an elevator deficiency in 2.4.18, or am I just moaning for no
reason at all ?    ;)

Thanks for the quick reply !

Cheers,

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Disk (block) write strangeness
  2002-08-05 18:49 Disk (block) write strangeness Jakob Oestergaard
@ 2002-08-05 20:17 ` Alan Cox
  2002-08-05 19:07   ` Jakob Oestergaard
  2002-08-07 11:43 ` Itai Nahshon
  1 sibling, 1 reply; 6+ messages in thread
From: Alan Cox @ 2002-08-05 20:17 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: linux-kernel

On Mon, 2002-08-05 at 19:49, Jakob Oestergaard wrote:
> *) Either the disk writes backwards  (no I don't believe that)
> *) Or the kernel is writing 256 B blocks (AFAIK it can't)
> *) The disk has some internal magic that cause a power-loss during
>    a full block write to leave the first half of the block intact with
>    old data, and update the second half of a block correctly with new
>    data.  (And I don't believe that either).

You forgot to add

*) or the disk internal logic bears no resemblance to the antiquated API
it fakes for the convenience of interface hardware and software

Linux also won't neccessarily do write outs in order. 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Disk (block) write strangeness
  2002-08-05 19:07   ` Jakob Oestergaard
@ 2002-08-06 14:44     ` Kasper Dupont
  2002-08-07  8:14       ` Helge Hafting
  0 siblings, 1 reply; 6+ messages in thread
From: Kasper Dupont @ 2002-08-06 14:44 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: Alan Cox, linux-kernel

Jakob Oestergaard wrote:
> 
> I'm *very* certain that Linux does this non-sequentially, because the
> disk might be causing the half-block oddity which really surprised me,
> but the disk is not caching 20 MB of data internally, for sure.

Maybe you shouldn't consider the powerfailure as a happening at one
single point in time, but rather happening during a short periode of
time. Maybe it is possible during this periode of time, that at some
times there is enough power for actually writing to the disk, and at
other times there is not.

I think it should be possible for the firmware on a good disk to
prevent such artifacts. But I think you can find disks that just
keeps trying to write even while power is failing.

-- 
Kasper Dupont -- der bruger for meget tid på usenet.
For sending spam use mailto:razrep@daimi.au.dk
or mailto:mcxumhvenwblvtl@skrammel.yaboo.dk

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Disk (block) write strangeness
  2002-08-06 14:44     ` Kasper Dupont
@ 2002-08-07  8:14       ` Helge Hafting
  0 siblings, 0 replies; 6+ messages in thread
From: Helge Hafting @ 2002-08-07  8:14 UTC (permalink / raw)
  To: linux-kernel

Kasper Dupont wrote:

> I think it should be possible for the firmware on a good disk to
> prevent such artifacts. But I think you can find disks that just
> keeps trying to write even while power is failing.
> 
That could might give you some (sub)blocks out of order, if the disk
firmware falsely believes that it is free to reorder anything
that reach its internal cache.   Writing to the bitter end
will turn at least one block to garbage as write current fail
in the middle.

Alan Cox wrote:
> *) or the disk internal logic bears no resemblance to the antiquated API
> it fakes for the convenience of interface hardware and software

One may then wonder if journalling is a safe thing to do,
with out-of-order writes exposed by a power failure...

Helge Hafting

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Disk (block) write strangeness
  2002-08-05 18:49 Disk (block) write strangeness Jakob Oestergaard
  2002-08-05 20:17 ` Alan Cox
@ 2002-08-07 11:43 ` Itai Nahshon
  1 sibling, 0 replies; 6+ messages in thread
From: Itai Nahshon @ 2002-08-07 11:43 UTC (permalink / raw)
  To: Jakob Oestergaard, linux-kernel

On Monday 05 August 2002 21:49 pm, Jakob Oestergaard wrote:
> Hello all,
>
> While investigating how various disks handle power-loss during writes, I
> came across something *very* strange.
>
> It seems that
>
> *) Either the disk writes backwards  (no I don't believe that)
> *) Or the kernel is writing 256 B blocks (AFAIK it can't)
> *) The disk has some internal magic that cause a power-loss during
>    a full block write to leave the first half of the block intact with
>    old data, and update the second half of a block correctly with new
>    data.  (And I don't believe that either).
>
> The scenario is:   I wrote a program that will write a 50 MB block with
> O_SYNC to /dev/hdc.  The block is full of 32-bit integers, initialized
> to 0.  For every full block write (the block is written with one single
> write() call), the integers are incremented once.
>
> So first I have 50 MB of 0's. Then 50 MB of 1's. etc.
>
> During this write cycle, I pull the power cable.   I get the machine
> back online and I dump the 50 MB block.
>
> What I found was a 50 MB block holding:
>  11668992 times "0x00000002"
>    231168 times "0x00000003"
>   1174528 times "0x00000002"
>     32512 times "0x00000003"
>
> Please note that 32512 is *not* a multiple of 512.  And please note that
> the 3's are written *after* the 2's, so actually there is a 512 byte
> block on the disk which contains 2's in the first half, and 3's in the
> second half!

Integers are 32 bit, so a 512 byte disk block contains 128 such integers...
Indeed, All the values above are divisible by 128, so you have:
11668992/128 = 91164 blocks of "0x00000002"
231168/128 = 1806 blocks of "0x00000003"
1174528/128 = 9176 blocks of "0x00000002"
32512/128 = 254 blocks of "0x00000003"

This does not prove, neither disprove anything about your
main concern, that writes are non-atomic in the block level.

>
> How on earth could that happen ?
>
> Why does the kernel not write from beginning to end ?   Or why doesn't
> the disk ?
>
> And does the elevator cause the writes to be shuffled around like that -
> I would have expected the kernel to write from beginning to end every
> single time...
>

I would not expect writes to be in order.
A simple elevator algorithm could write fragments (cylinder sized?)
in reverse order. On-disk write scheduling could start writing at any
sector (to minimize rotational latency).

Knowing the disk geometry and parameters could help with understanding
your results.

> The kernel is 2.4.18 on some i686 box
> The disk is a Quantum Fireball 1GB IDE (from way back then ;)
> The IDE chipset is an I820 Camino 2
>
> I can submit the test program or do further tests, if anyone is
> interested.
>
> Thank you,


-- Itai


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-08-07 11:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-05 18:49 Disk (block) write strangeness Jakob Oestergaard
2002-08-05 20:17 ` Alan Cox
2002-08-05 19:07   ` Jakob Oestergaard
2002-08-06 14:44     ` Kasper Dupont
2002-08-07  8:14       ` Helge Hafting
2002-08-07 11:43 ` Itai Nahshon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox