* ordered I/O with multipath @ 2009-04-08 4:59 谢纲 2009-04-08 14:30 ` Jamie Lokier 0 siblings, 1 reply; 6+ messages in thread From: 谢纲 @ 2009-04-08 4:59 UTC (permalink / raw) To: linux-fsdevel Hi, Some journal filesystem use barrier i/o to ensure the order of the committing data. But if the filesystem is on the top of volume manager which support the raid and multipath. The barrier i/o might not be handled correctly. How does journal filesystem deal with this? -- Xie Gang ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ordered I/O with multipath 2009-04-08 4:59 ordered I/O with multipath 谢纲 @ 2009-04-08 14:30 ` Jamie Lokier 2009-04-08 14:53 ` 谢纲 2009-04-09 18:32 ` Bryan Henderson 0 siblings, 2 replies; 6+ messages in thread From: Jamie Lokier @ 2009-04-08 14:30 UTC (permalink / raw) To: 谢纲; +Cc: linux-fsdevel 谢纲 wrote: > Some journal filesystem use barrier i/o to ensure the order of the > committing data. But if the filesystem is on the top of volume manager > which support the raid and multipath. The barrier i/o might not be > handled correctly. How does journal filesystem deal with this? For software RAID and multipath, I think it isn't handled at all. Even if you disable write-caching in the underlying storage, ordered requests may not retain their order, so the common database advice to disable write-cache and use SCSI or SATA-NCQ may not work either. If the RAID code is changed to handle barriers, that would still have possible "scattershot" corruption on RAID-5, because writing a single sector on the logical device affects more than one visible sector if it is interrupted. In other words, the "radius of corruption" is bigger than one sector for RAID-5, and it's not contiguous either. In principle, journalling filesystems need to know the "radius of corruption" to provide robust journalling. If individual sector writes are atomic, this isn't an issue. Some people think sector writes are atomic on modern hard drives (but I wouldn't count on it). But it is definitely not atomic when writing to a RAID or multipath if the write affects more than one device. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ordered I/O with multipath 2009-04-08 14:30 ` Jamie Lokier @ 2009-04-08 14:53 ` 谢纲 2009-04-09 18:32 ` Bryan Henderson 1 sibling, 0 replies; 6+ messages in thread From: 谢纲 @ 2009-04-08 14:53 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel On Wed, Apr 8, 2009 at 10:30 PM, Jamie Lokier <jamie@shareable.org> wrote: > 谢纲 wrote: >> Some journal filesystem use barrier i/o to ensure the order of the >> committing data. But if the filesystem is on the top of volume manager >> which support the raid and multipath. The barrier i/o might not be >> handled correctly. How does journal filesystem deal with this? > > For software RAID and multipath, I think it isn't handled at all. > > Even if you disable write-caching in the underlying storage, ordered > requests may not retain their order, so the common database advice to > disable write-cache and use SCSI or SATA-NCQ may not work either. > > If the RAID code is changed to handle barriers, that would still have > possible "scattershot" corruption on RAID-5, because writing a single > sector on the logical device affects more than one visible sector if > it is interrupted. In other words, the "radius of corruption" is > bigger than one sector for RAID-5, and it's not contiguous either. If there is volume manager, which control the raid and could understand the multipath, I think the barriers can be handled. Because, the it can get all the information about where those i/o goes. But it's very complicated to handle all of this. It's said that the Veritas volume manager could handle this. I don't know whether it's true. but according to the linux block layer, it's really had to implement this. > > In principle, journalling filesystems need to know the "radius of > corruption" to provide robust journalling. If individual sector > writes are atomic, this isn't an issue. Some people think sector > writes are atomic on modern hard drives (but I wouldn't count on it). > But it is definitely not atomic when writing to a RAID or multipath if > the write affects more than one device. > > -- Jamie > -- Xie Gang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ordered I/O with multipath 2009-04-08 14:30 ` Jamie Lokier 2009-04-08 14:53 ` 谢纲 @ 2009-04-09 18:32 ` Bryan Henderson 2009-04-09 20:00 ` Jamie Lokier 1 sibling, 1 reply; 6+ messages in thread From: Bryan Henderson @ 2009-04-09 18:32 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, 谢纲 > If the RAID code is changed to handle barriers, that would still have > possible "scattershot" corruption on RAID-5, because writing a single > sector on the logical device affects more than one visible sector if > it is interrupted. In other words, the "radius of corruption" is > bigger than one sector for RAID-5, and it's not contiguous either. I've seen several RAID-5 systems, and they all went to great lengths to ensure that interrupting a write to Sector A can't destroy Sector B. It isn't easy; it involves journalling. But I've always taken it as an absolute requirement. I assume you're talking about something like where Sectors 1-5 are covered by a single parity sector and the RAID system restarts between it has written Sector 1 and when it has written the new parity. Now if you lose Sector 2, you'll recover incorrect contents for it. Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're saying it does have this problem. > In principle, journalling filesystems need to know the "radius of > corruption" to provide robust journalling. If individual sector > writes are atomic, this isn't an issue. Some people think sector > writes are atomic on modern hard drives (but I wouldn't count on it). > But it is definitely not atomic when writing to a RAID or multipath if > the write affects more than one device. It would make a lot more sense to make the RAID block device driver present a block device that can't corrupt data upon something as simple as a restart in the middle of write to an unrelated sector than to make filesystem drivers comprehend a block device that can. Less work, more integrity. Some have noted recently that block devices are really too simple to do some of the fancy storage things we'd like to do these days anyway, so another approach would be to integrate the RAID-5 function in the filesystem driver instead of attempting to have a RAID block device layer. For now, I'll just try to remember not to use Linux kernel RAID-5. >If individual sector writes are atomic, this isn't an issue. True, however: atomic is sufficient, but not necessary. In the real world, disk drive writes aren't atomic, and it's OK. A journalling filesystem can deal with a failed write wiping out the previous contents of the subject sector. It just can't deal with a failed write polluting some unrelated previously hardened sector. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ordered I/O with multipath 2009-04-09 18:32 ` Bryan Henderson @ 2009-04-09 20:00 ` Jamie Lokier 2009-04-10 16:42 ` Bryan Henderson 0 siblings, 1 reply; 6+ messages in thread From: Jamie Lokier @ 2009-04-09 20:00 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-fsdevel, 谢纲 Bryan Henderson wrote: > > If the RAID code is changed to handle barriers, that would still have > > possible "scattershot" corruption on RAID-5, because writing a single > > sector on the logical device affects more than one visible sector if > > it is interrupted. In other words, the "radius of corruption" is > > bigger than one sector for RAID-5, and it's not contiguous either. > > I've seen several RAID-5 systems, and they all went to great lengths to > ensure that interrupting a write to Sector A can't destroy Sector B. It > isn't easy; it involves journalling. But I've always taken it as an > absolute requirement. How do you do a second layer of journalling (in addition to the filesystem's) without a big performance penalty for the extra seeks? > I assume you're talking about something like where Sectors 1-5 are covered > by a single parity sector and the RAID system restarts between it has > written Sector 1 and when it has written the new parity. Now if you lose > Sector 2, you'll recover incorrect contents for it. > > Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're > saying it does have this problem. No, I'm assuming it has this problem because every description of RAID-5 I've seen does not mention journalling or anything equivalent. > > In principle, journalling filesystems need to know the "radius of > > corruption" to provide robust journalling. If individual sector > > writes are atomic, this isn't an issue. Some people think sector > > writes are atomic on modern hard drives (but I wouldn't count on it). > > But it is definitely not atomic when writing to a RAID or multipath if > > the write affects more than one device. > > It would make a lot more sense to make the RAID block device driver > present a block device that can't corrupt data upon something as simple as > a restart in the middle of write to an unrelated sector than to make > filesystem drivers comprehend a block device that can. Less work, more > integrity. A lot less performance? > Some have noted recently that block devices are really too simple to do > some of the fancy storage things we'd like to do these days anyway, so > another approach would be to integrate the RAID-5 function in the > filesystem driver instead of attempting to have a RAID block device layer. Like ZFS and BTRFS I guess. This is why RAID ought to work better in the filesystem. Two layers of journalling or equivalent does not sound good. > For now, I'll just try to remember not to use Linux kernel RAID-5. I've no idea if you should avoid it. I'm making assumptions. Other parts of Linux are a bit flaky on the issue of data integrity on crashes though, and I/O barriers are not passed down through Linux software RAID-5, so I'd be mighty surprised if it provides atomic writes. > >If individual sector writes are atomic, this isn't an issue. > > True, however: atomic is sufficient, but not necessary. In the real > world, disk drive writes aren't atomic, and it's OK. A journalling > filesystem can deal with a failed write wiping out the previous contents > of the subject sector. It just can't deal with a failed write polluting > some unrelated previously hardened sector. That's right. But an failed write might corrupt previously hardened sectors in these cases: - Disks with 4k sectors pretending to be 512 byte sectors. - RAIDs without journalling (or other equivalent) and no battery backup. - SSDs and other flash storage if their internal algorithms are stupid. I've just noticed that a system crash is not the only way this type of corruption can happen. Does this argue for an additional parameter from the block device hints: In addition to strip sizes, sector size - Radius of Corruption on Failed Write? For hard disks, this is the sector size. But for RAIDs and maybe some flash storage, it might be larger. -- Jamie ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ordered I/O with multipath 2009-04-09 20:00 ` Jamie Lokier @ 2009-04-10 16:42 ` Bryan Henderson 0 siblings, 0 replies; 6+ messages in thread From: Bryan Henderson @ 2009-04-10 16:42 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, 谢纲 > Bryan Henderson wrote: > > > If the RAID code is changed to handle barriers, that would still have > > > possible "scattershot" corruption on RAID-5, because writing a single > > > sector on the logical device affects more than one visible sector if > > > it is interrupted. In other words, the "radius of corruption" is > > > bigger than one sector for RAID-5, and it's not contiguous either. > > > > I've seen several RAID-5 systems, and they all went to great lengths to > > ensure that interrupting a write to Sector A can't destroy Sector B. It > > isn't easy; it involves journalling. But I've always taken it as an > > absolute requirement. > > How do you do a second layer of journalling (in addition to the > filesystem's) without a big performance penalty for the extra seeks? The systems I know all have a means of storing data persistent across the kinds of restarts in question without seeking. It's probably the only way to get great performance with data integrity. But some things about Linux block device RAID-5 are coming back to me. In the early implementations, if the system restarted without explicitly shutting down the array (as in a power failure), all of the parity in the array would be rebuilt. Later, a "write intent bitmap" was added so it could rebuild substantially less than all of the parity. That bitmap is the journal I was talking about, and I don't know what if anything it does to avoid a big performance penalty. > But an failed write might corrupt previously > hardened sectors in these cases: > > - Disks with 4k sectors pretending to be 512 byte sectors. AFAIK there are no such disks today and there is a big controversy over whether it's acceptable for such disks currently being designed to allow such corruption. > - RAIDs without journalling (or other equivalent) and no > battery backup. I still don't know if anybody is doing that. > - SSDs and other flash storage if their internal algorithms are stupid. I don't know if that's commonly accepted either. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-04-10 16:42 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-08 4:59 ordered I/O with multipath 谢纲 2009-04-08 14:30 ` Jamie Lokier 2009-04-08 14:53 ` 谢纲 2009-04-09 18:32 ` Bryan Henderson 2009-04-09 20:00 ` Jamie Lokier 2009-04-10 16:42 ` Bryan Henderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).