ordered I/O with multipath

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ordered I/O with multipath
@ 2009-04-08  4:59 谢纲
  2009-04-08 14:30 ` Jamie Lokier
  0 siblings, 1 reply; 6+ messages in thread
From: 谢纲 @ 2009-04-08  4:59 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

Some journal filesystem use barrier i/o to ensure the order of the
committing data. But if the filesystem is on the top of volume manager
which support the raid and multipath. The barrier i/o might not be
handled correctly. How does journal filesystem deal with this?

-- 
Xie Gang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ordered I/O with multipath
  2009-04-08  4:59 ordered I/O with multipath 谢纲
@ 2009-04-08 14:30 ` Jamie Lokier
  2009-04-08 14:53   ` 谢纲
  2009-04-09 18:32   ` Bryan Henderson
  0 siblings, 2 replies; 6+ messages in thread
From: Jamie Lokier @ 2009-04-08 14:30 UTC (permalink / raw)
  To: 谢纲; +Cc: linux-fsdevel

谢纲 wrote:
> Some journal filesystem use barrier i/o to ensure the order of the
> committing data. But if the filesystem is on the top of volume manager
> which support the raid and multipath. The barrier i/o might not be
> handled correctly. How does journal filesystem deal with this?

For software RAID and multipath, I think it isn't handled at all.

Even if you disable write-caching in the underlying storage, ordered
requests may not retain their order, so the common database advice to
disable write-cache and use SCSI or SATA-NCQ may not work either.

If the RAID code is changed to handle barriers, that would still have
possible "scattershot" corruption on RAID-5, because writing a single
sector on the logical device affects more than one visible sector if
it is interrupted.  In other words, the "radius of corruption" is
bigger than one sector for RAID-5, and it's not contiguous either.

In principle, journalling filesystems need to know the "radius of
corruption" to provide robust journalling.  If individual sector
writes are atomic, this isn't an issue.  Some people think sector
writes are atomic on modern hard drives (but I wouldn't count on it).
But it is definitely not atomic when writing to a RAID or multipath if
the write affects more than one device.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ordered I/O with multipath
  2009-04-08 14:30 ` Jamie Lokier
@ 2009-04-08 14:53   ` 谢纲
  2009-04-09 18:32   ` Bryan Henderson
  1 sibling, 0 replies; 6+ messages in thread
From: 谢纲 @ 2009-04-08 14:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel

On Wed, Apr 8, 2009 at 10:30 PM, Jamie Lokier <jamie@shareable.org> wrote:
> 谢纲 wrote:
>> Some journal filesystem use barrier i/o to ensure the order of the
>> committing data. But if the filesystem is on the top of volume manager
>> which support the raid and multipath. The barrier i/o might not be
>> handled correctly. How does journal filesystem deal with this?
>
> For software RAID and multipath, I think it isn't handled at all.
>
> Even if you disable write-caching in the underlying storage, ordered
> requests may not retain their order, so the common database advice to
> disable write-cache and use SCSI or SATA-NCQ may not work either.
>
> If the RAID code is changed to handle barriers, that would still have
> possible "scattershot" corruption on RAID-5, because writing a single
> sector on the logical device affects more than one visible sector if
> it is interrupted.  In other words, the "radius of corruption" is
> bigger than one sector for RAID-5, and it's not contiguous either.
If there is volume manager, which control the raid and could
understand the multipath, I think the barriers can be handled.
Because, the it can get all the information about where those i/o
goes. But it's very complicated to handle all of this.
It's said that the Veritas volume manager could handle this. I don't
know whether it's true. but according to the linux block layer, it's
really had to implement this.
>
> In principle, journalling filesystems need to know the "radius of
> corruption" to provide robust journalling.  If individual sector
> writes are atomic, this isn't an issue.  Some people think sector
> writes are atomic on modern hard drives (but I wouldn't count on it).
> But it is definitely not atomic when writing to a RAID or multipath if
> the write affects more than one device.
>
> -- Jamie
>



-- 
Xie Gang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ordered I/O with multipath
  2009-04-08 14:30 ` Jamie Lokier
  2009-04-08 14:53   ` 谢纲
@ 2009-04-09 18:32   ` Bryan Henderson
  2009-04-09 20:00     ` Jamie Lokier
  1 sibling, 1 reply; 6+ messages in thread
From: Bryan Henderson @ 2009-04-09 18:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, 谢纲

> If the RAID code is changed to handle barriers, that would still have
> possible "scattershot" corruption on RAID-5, because writing a single
> sector on the logical device affects more than one visible sector if
> it is interrupted.  In other words, the "radius of corruption" is
> bigger than one sector for RAID-5, and it's not contiguous either.

I've seen several RAID-5 systems, and they all went to great lengths to 
ensure that interrupting a write to Sector A can't destroy Sector B.  It 
isn't easy; it involves journalling.  But I've always taken it as an 
absolute requirement.

I assume you're talking about something like where Sectors 1-5 are covered 
by a single parity sector and the RAID system restarts between it has 
written Sector 1 and when it has written the new parity.  Now if you lose 
Sector 2, you'll recover incorrect contents for it.

Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're 
saying it does have this problem.

> In principle, journalling filesystems need to know the "radius of
> corruption" to provide robust journalling.  If individual sector
> writes are atomic, this isn't an issue.  Some people think sector
> writes are atomic on modern hard drives (but I wouldn't count on it).
> But it is definitely not atomic when writing to a RAID or multipath if
> the write affects more than one device.

It would make a lot more sense to make the RAID block device driver 
present a block device that can't corrupt data upon something as simple as 
a restart in the middle of write to an unrelated sector than to make 
filesystem drivers comprehend a block device that can.  Less work, more 
integrity.

Some have noted recently that block devices are really too simple to do 
some of the fancy storage things we'd like to do these days anyway, so 
another approach would be to integrate the RAID-5 function in the 
filesystem driver instead of attempting to have a RAID block device layer.

For now, I'll just try to remember not to use Linux kernel RAID-5.

>If individual sector writes are atomic, this isn't an issue.

True, however: atomic is sufficient, but not necessary.  In the real 
world, disk drive writes aren't atomic, and it's OK.  A journalling 
filesystem can deal with a failed write wiping out the previous contents 
of the subject sector.  It just can't deal with a failed write polluting 
some unrelated previously hardened sector.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ordered I/O with multipath
  2009-04-09 18:32   ` Bryan Henderson
@ 2009-04-09 20:00     ` Jamie Lokier
  2009-04-10 16:42       ` Bryan Henderson
  0 siblings, 1 reply; 6+ messages in thread
From: Jamie Lokier @ 2009-04-09 20:00 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, 谢纲

Bryan Henderson wrote:
> > If the RAID code is changed to handle barriers, that would still have
> > possible "scattershot" corruption on RAID-5, because writing a single
> > sector on the logical device affects more than one visible sector if
> > it is interrupted.  In other words, the "radius of corruption" is
> > bigger than one sector for RAID-5, and it's not contiguous either.
> 
> I've seen several RAID-5 systems, and they all went to great lengths to 
> ensure that interrupting a write to Sector A can't destroy Sector B.  It 
> isn't easy; it involves journalling.  But I've always taken it as an 
> absolute requirement.

How do you do a second layer of journalling (in addition to the
filesystem's) without a big performance penalty for the extra seeks?

> I assume you're talking about something like where Sectors 1-5 are covered 
> by a single parity sector and the RAID system restarts between it has 
> written Sector 1 and when it has written the new parity.  Now if you lose 
> Sector 2, you'll recover incorrect contents for it.
> 
> Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're 
> saying it does have this problem.

No, I'm assuming it has this problem because every description of
RAID-5 I've seen does not mention journalling or anything equivalent.

> > In principle, journalling filesystems need to know the "radius of
> > corruption" to provide robust journalling.  If individual sector
> > writes are atomic, this isn't an issue.  Some people think sector
> > writes are atomic on modern hard drives (but I wouldn't count on it).
> > But it is definitely not atomic when writing to a RAID or multipath if
> > the write affects more than one device.
> 
> It would make a lot more sense to make the RAID block device driver 
> present a block device that can't corrupt data upon something as simple as 
> a restart in the middle of write to an unrelated sector than to make 
> filesystem drivers comprehend a block device that can.  Less work, more 
> integrity.

A lot less performance?

> Some have noted recently that block devices are really too simple to do 
> some of the fancy storage things we'd like to do these days anyway, so 
> another approach would be to integrate the RAID-5 function in the 
> filesystem driver instead of attempting to have a RAID block device layer.

Like ZFS and BTRFS I guess.

This is why RAID ought to work better in the filesystem.
Two layers of journalling or equivalent does not sound good.

> For now, I'll just try to remember not to use Linux kernel RAID-5.

I've no idea if you should avoid it.  I'm making assumptions.

Other parts of Linux are a bit flaky on the issue of data integrity on
crashes though, and I/O barriers are not passed down through Linux
software RAID-5, so I'd be mighty surprised if it provides atomic writes.

> >If individual sector writes are atomic, this isn't an issue.
> 
> True, however: atomic is sufficient, but not necessary.  In the real 
> world, disk drive writes aren't atomic, and it's OK.  A journalling 
> filesystem can deal with a failed write wiping out the previous contents 
> of the subject sector.  It just can't deal with a failed write polluting 
> some unrelated previously hardened sector.

That's right.  But an failed write might corrupt previously
hardened sectors in these cases:

    - Disks with 4k sectors pretending to be 512 byte sectors.

    - RAIDs without journalling (or other equivalent) and no
      battery backup.

    - SSDs and other flash storage if their internal algorithms are stupid.

I've just noticed that a system crash is not the only way this type of
corruption can happen.

Does this argue for an additional parameter from the block device
hints: In addition to strip sizes, sector size -

   Radius of Corruption on Failed Write?

For hard disks, this is the sector size.  But for RAIDs and maybe some
flash storage, it might be larger.

-- Jamie

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ordered I/O with multipath
  2009-04-09 20:00     ` Jamie Lokier
@ 2009-04-10 16:42       ` Bryan Henderson
  0 siblings, 0 replies; 6+ messages in thread
From: Bryan Henderson @ 2009-04-10 16:42 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, 谢纲

> Bryan Henderson wrote:
> > > If the RAID code is changed to handle barriers, that would still 
have
> > > possible "scattershot" corruption on RAID-5, because writing a 
single
> > > sector on the logical device affects more than one visible sector if
> > > it is interrupted.  In other words, the "radius of corruption" is
> > > bigger than one sector for RAID-5, and it's not contiguous either.
> > 
> > I've seen several RAID-5 systems, and they all went to great lengths 
to 
> > ensure that interrupting a write to Sector A can't destroy Sector B. 
It 
> > isn't easy; it involves journalling.  But I've always taken it as an 
> > absolute requirement.
> 
> How do you do a second layer of journalling (in addition to the
> filesystem's) without a big performance penalty for the extra seeks?

The systems I know all have a means of storing data persistent across the 
kinds of restarts in question without seeking.  It's probably the only way 
to get great performance with data integrity.

But some things about Linux block device RAID-5 are coming back to me.  In 
the early implementations, if the system restarted without explicitly 
shutting down the array (as in a power failure), all of the parity in the 
array would be rebuilt.  Later, a "write intent bitmap" was added so it 
could rebuild substantially less than all of the parity.  That bitmap is 
the journal I was talking about, and I don't know what if anything it does 
to avoid a big performance penalty.

> But an failed write might corrupt previously
> hardened sectors in these cases:
> 
>     - Disks with 4k sectors pretending to be 512 byte sectors.

AFAIK there are no such disks today and there is a big controversy over 
whether it's acceptable for such disks currently being designed to allow 
such corruption.

>     - RAIDs without journalling (or other equivalent) and no
>       battery backup.

I still don't know if anybody is doing that.

>     - SSDs and other flash storage if their internal algorithms are 
stupid.

I don't know if that's commonly accepted either.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-04-10 16:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-08  4:59 ordered I/O with multipath 谢纲
2009-04-08 14:30 ` Jamie Lokier
2009-04-08 14:53   ` 谢纲
2009-04-09 18:32   ` Bryan Henderson
2009-04-09 20:00     ` Jamie Lokier
2009-04-10 16:42       ` Bryan Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).