* mtdblock caching and syncing @ 2009-04-09 14:15 Doug Graham 2009-04-09 14:51 ` Josh Boyer 0 siblings, 1 reply; 8+ messages in thread From: Doug Graham @ 2009-04-09 14:15 UTC (permalink / raw) To: linux-mtd Hello, I'm running a 2.6.24 kernel, but I've looked over the latest kernel sources, and I don't see a fix for this (although I could have missed something). The problem is that a sync() or fsync() on an mtdblock device does not actually get the data all the way to the flash device. The mtdblock layer maintains its own cache of a single erase-unit (256KB in my case). If I open /dev/mtdblock0 for writing, write some stuff to it, then call fsync() but do not close the device, up to one erase-unit's worth of data may still be buffered in memory. This data is only flushed when the device is actually closed (by mtdblock_release). I think that this violates the intended semantics of sync and fsync. I shouldn't be required to do a close() to force the data to the device. Another scenario is to open an mtdblock device, write to it, then call close() without calling fsync(). In this case, the data may not yet be flushed to the mtdblock layer by the time close() is called, so nothing gets written to flash during the close(). That's all reasonable so far, but if I then type "sync" from the shell (or wait for 30 seconds for pdflush to do its thing), all buffered data *should* be flushed to the actual device. As it stands now, the sync will flush data as far as the mtdblock layer, but that data may then get buffered in the mtdblock layer forever. It only gets flushed to the device on a close(), and if nobody calls close(), it stays there forever. I think this is fairly serious bug in a flash-based system, where there are frequently times that you want to make sure that data has actually made it all the way to the device. I think that a sync() or fsync() really ought to somehow propagate all the way down to the mtdblock layer so that mtdblock can flush its buffer. Thoughts? Suggestions? Patches? Thanks, Doug. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-09 14:15 mtdblock caching and syncing Doug Graham @ 2009-04-09 14:51 ` Josh Boyer 2009-04-09 16:02 ` Doug Graham 0 siblings, 1 reply; 8+ messages in thread From: Josh Boyer @ 2009-04-09 14:51 UTC (permalink / raw) To: Doug Graham; +Cc: linux-mtd On Thu, Apr 09, 2009 at 10:15:56AM -0400, Doug Graham wrote: >Hello, > >I'm running a 2.6.24 kernel, but I've looked over the latest kernel >sources, and I don't see a fix for this (although I could have missed >something). > >The problem is that a sync() or fsync() on an mtdblock device does not >actually get the data all the way to the flash device. The mtdblock >layer maintains its own cache of a single erase-unit (256KB in my case). >If I open /dev/mtdblock0 for writing, write some stuff to it, then call >fsync() but do not close the device, up to one erase-unit's worth of >data may still be buffered in memory. This data is only flushed when >the device is actually closed (by mtdblock_release). I think that >this violates the intended semantics of sync and fsync. I shouldn't be >required to do a close() to force the data to the device. The device in question isn't the flash. It's the mtdblock device. So fsync semantics are preserved. This is the same as writing to a file on a hard drive, calling fsync, and having it sit in the hard drive's cache. >Another scenario is to open an mtdblock device, write to it, then call >close() without calling fsync(). In this case, the data may not yet be >flushed to the mtdblock layer by the time close() is called, so nothing >gets written to flash during the close(). That's all reasonable so far, >but if I then type "sync" from the shell (or wait for 30 seconds for >pdflush to do its thing), all buffered data *should* be flushed to the >actual device. As it stands now, the sync will flush data as far as >the mtdblock layer, but that data may then get buffered in the mtdblock >layer forever. It only gets flushed to the device on a close(), and if >nobody calls close(), it stays there forever. > >I think this is fairly serious bug in a flash-based system, where there >are frequently times that you want to make sure that data has actually >made it all the way to the device. I think that a sync() or fsync() >really ought to somehow propagate all the way down to the mtdblock layer >so that mtdblock can flush its buffer. Why are you using mtdblock in a serious flash-based system? The fact that it buffers an entire eraseblock means you risk huge data loss in the event of an unclean shutdown anyway (power loss). No amount of sync or fsync will fix that. >Thoughts? Suggestions? Patches? Word-weasling aside, if you have patches that fix the behavior you don't like, they would certainly be looked at. Setting pdflush to 5 seconds instead of 30 would help a bit, or using the ioctl on the mtdblock device that already exists to flush would help too. However you might want to really look at a system design that relies on mtdblock for data integrity. josh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-09 14:51 ` Josh Boyer @ 2009-04-09 16:02 ` Doug Graham 2009-04-09 17:16 ` Josh Boyer ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Doug Graham @ 2009-04-09 16:02 UTC (permalink / raw) To: Josh Boyer; +Cc: linux-mtd On Thu, Apr 09, 2009 at 10:51:00AM -0400, Josh Boyer wrote: > On Thu, Apr 09, 2009 at 10:15:56AM -0400, Doug Graham wrote: > > > >The problem is that a sync() or fsync() on an mtdblock device does not > >actually get the data all the way to the flash device. The mtdblock > >layer maintains its own cache of a single erase-unit (256KB in my case). > >If I open /dev/mtdblock0 for writing, write some stuff to it, then call > >fsync() but do not close the device, up to one erase-unit's worth of > >data may still be buffered in memory. This data is only flushed when > >the device is actually closed (by mtdblock_release). I think that > >this violates the intended semantics of sync and fsync. I shouldn't be > >required to do a close() to force the data to the device. > > The device in question isn't the flash. It's the mtdblock device. So > fsync semantics are preserved. This is the same as writing to a file > on a hard drive, calling fsync, and having it sit in the hard drive's > cache. That's a good point, and one I've wondered about before. I don't know much about how hard drives manage their cache, but I would assume that they don't leave dirty data in their cache for an unbounded period of time. I'd guess that data is written to the actual disk within a few 10s of milliseconds after being sent to the device. In the case of mtdblock, dirty data can stay in the cache forever. > >I think this is fairly serious bug in a flash-based system, where there > >are frequently times that you want to make sure that data has actually > >made it all the way to the device. I think that a sync() or fsync() > >really ought to somehow propagate all the way down to the mtdblock layer > >so that mtdblock can flush its buffer. > > Why are you using mtdblock in a serious flash-based system? The fact > that it buffers an entire eraseblock means you risk huge data loss in > the event of an unclean shutdown anyway (power loss). No amount of > sync or fsync will fix that. We don't use mtdblock during normal operations; we use squashfs and jffs2 (maybe ubifs sometime soon). But one job that we do use mtdblock for is burning loads. We could, and perhaps should, be using the char device instead to burn loads, except that those require specialized tools to do erases before writes. To avoid the need for such specialized tools, we just use the equivalent of dd on the mtdblock device followed by a sync. But that doesn't work given the behaviour I'm complaining about. It's actually a little more complicated that that. We have a system comprised of multiple cards. When upgrading the system from the master card, we're using NBD to upgrade (some) loads on remote cards. The NBD server running on the remote cards never closes the mtdblock device that it is managing, so the mtdblock_release() method never gets called. The NDB server cannot using the MTD character device because it knows nothing about the characteristics of flash, including the need to erase before writing. Even if it did know about erasing, we'd want it to do exactly the same kind of caching the mtdblock already does, so mtdblock does seem like a good match in this case. We can certainly modify the NBD server to close and reopen the device when it needs to be sure that data has actually been written to flash, but that seems a bit on the kludgy side, and doesn't help any other applications using mtdblock (like the dd scheme I mention above). > >Thoughts? Suggestions? Patches? > > Word-weasling aside, if you have patches that fix the behavior you don't > like, they would certainly be looked at. Setting pdflush to 5 seconds > instead of 30 would help a bit, or using the ioctl on the mtdblock device > that already exists to flush would help too. However you might want to > really look at a system design that relies on mtdblock for data integrity. What's the point of mtdblock then? All systems care about data integrity to some degree (some more than others, obviously), so if mtdblock makes no effort to preserve that integrity, where do you see it ever being used legitimately? Thanks very much for your comments. --Doug. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-09 16:02 ` Doug Graham @ 2009-04-09 17:16 ` Josh Boyer 2009-04-09 18:07 ` Jamie Lokier 2009-04-23 18:21 ` Miles Nordin 2 siblings, 0 replies; 8+ messages in thread From: Josh Boyer @ 2009-04-09 17:16 UTC (permalink / raw) To: Doug Graham; +Cc: linux-mtd On Thu, Apr 09, 2009 at 12:02:47PM -0400, Doug Graham wrote: >On Thu, Apr 09, 2009 at 10:51:00AM -0400, Josh Boyer wrote: >> On Thu, Apr 09, 2009 at 10:15:56AM -0400, Doug Graham wrote: >> > >> >The problem is that a sync() or fsync() on an mtdblock device does not >> >actually get the data all the way to the flash device. The mtdblock >> >layer maintains its own cache of a single erase-unit (256KB in my case). >> >If I open /dev/mtdblock0 for writing, write some stuff to it, then call >> >fsync() but do not close the device, up to one erase-unit's worth of >> >data may still be buffered in memory. This data is only flushed when >> >the device is actually closed (by mtdblock_release). I think that >> >this violates the intended semantics of sync and fsync. I shouldn't be >> >required to do a close() to force the data to the device. >> >> The device in question isn't the flash. It's the mtdblock device. So >> fsync semantics are preserved. This is the same as writing to a file >> on a hard drive, calling fsync, and having it sit in the hard drive's >> cache. > >That's a good point, and one I've wondered about before. I don't know >much about how hard drives manage their cache, but I would assume that >they don't leave dirty data in their cache for an unbounded period >of time. I'd guess that data is written to the actual disk within a >few 10s of milliseconds after being sent to the device. Right. >In the case of mtdblock, dirty data can stay in the cache forever. True. And I agree that is entirely sub-optimal. >> >I think this is fairly serious bug in a flash-based system, where there >> >are frequently times that you want to make sure that data has actually >> >made it all the way to the device. I think that a sync() or fsync() >> >really ought to somehow propagate all the way down to the mtdblock layer >> >so that mtdblock can flush its buffer. >> >> Why are you using mtdblock in a serious flash-based system? The fact >> that it buffers an entire eraseblock means you risk huge data loss in >> the event of an unclean shutdown anyway (power loss). No amount of >> sync or fsync will fix that. > >We don't use mtdblock during normal operations; we use squashfs and jffs2 >(maybe ubifs sometime soon). But one job that we do use mtdblock for is >burning loads. We could, and perhaps should, be using the char device >instead to burn loads, except that those require specialized tools to do >erases before writes. To avoid the need for such specialized tools, we >just use the equivalent of dd on the mtdblock device followed by a sync. >But that doesn't work given the behaviour I'm complaining about. > >It's actually a little more complicated that that. We have a system >comprised of multiple cards. When upgrading the system from the master >card, we're using NBD to upgrade (some) loads on remote cards. The NBD >server running on the remote cards never closes the mtdblock device that >it is managing, so the mtdblock_release() method never gets called. >The NDB server cannot using the MTD character device because it knows >nothing about the characteristics of flash, including the need to erase >before writing. Even if it did know about erasing, we'd want it to do >exactly the same kind of caching the mtdblock already does, so mtdblock >does seem like a good match in this case. We can certainly modify the >NBD server to close and reopen the device when it needs to be sure that >data has actually been written to flash, but that seems a bit on the >kludgy side, and doesn't help any other applications using mtdblock >(like the dd scheme I mention above). > >> >Thoughts? Suggestions? Patches? >> >> Word-weasling aside, if you have patches that fix the behavior you don't >> like, they would certainly be looked at. Setting pdflush to 5 seconds >> instead of 30 would help a bit, or using the ioctl on the mtdblock device >> that already exists to flush would help too. However you might want to >> really look at a system design that relies on mtdblock for data integrity. > >What's the point of mtdblock then? All systems care about data integrity >to some degree (some more than others, obviously), so if mtdblock makes >no effort to preserve that integrity, where do you see it ever being >used legitimately? For cramfs, which is read-only. Or in cases sort of like what you describe, where the conditions of writes are tightly controlled and failure does not produce a bricked device that a customer is going to be grumpy about. josh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-09 16:02 ` Doug Graham 2009-04-09 17:16 ` Josh Boyer @ 2009-04-09 18:07 ` Jamie Lokier 2009-04-09 19:24 ` Doug Graham 2009-04-23 18:21 ` Miles Nordin 2 siblings, 1 reply; 8+ messages in thread From: Jamie Lokier @ 2009-04-09 18:07 UTC (permalink / raw) To: Doug Graham; +Cc: Josh Boyer, linux-mtd Doug Graham wrote: > > The device in question isn't the flash. It's the mtdblock device. So > > fsync semantics are preserved. This is the same as writing to a file > > on a hard drive, calling fsync, and having it sit in the hard drive's > > cache. > > That's a good point, and one I've wondered about before. I don't know > much about how hard drives manage their cache, but I would assume that > they don't leave dirty data in their cache for an unbounded period > of time. I'd guess that data is written to the actual disk within a > few 10s of milliseconds after being sent to the device. Ideally, fsync() should flush data from a hard drive's cache too. -- Jamie ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-09 18:07 ` Jamie Lokier @ 2009-04-09 19:24 ` Doug Graham 0 siblings, 0 replies; 8+ messages in thread From: Doug Graham @ 2009-04-09 19:24 UTC (permalink / raw) To: Jamie Lokier; +Cc: Josh Boyer, linux-mtd On Thu, Apr 09, 2009 at 07:07:24PM +0100, Jamie Lokier wrote: > Doug Graham wrote: > > > The device in question isn't the flash. It's the mtdblock device. So > > > fsync semantics are preserved. This is the same as writing to a file > > > on a hard drive, calling fsync, and having it sit in the hard drive's > > > cache. > > > > That's a good point, and one I've wondered about before. I don't know > > much about how hard drives manage their cache, but I would assume that > > they don't leave dirty data in their cache for an unbounded period > > of time. I'd guess that data is written to the actual disk within a > > few 10s of milliseconds after being sent to the device. > > Ideally, fsync() should flush data from a hard drive's cache too. That seems ideal to me too, and if Linux had the framework to support that, the solution to the mtdblock problem would be trivial. I'd guess that the cleanest way to do this would be to add a sync() method to block_device_operations. --Doug. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-09 16:02 ` Doug Graham 2009-04-09 17:16 ` Josh Boyer 2009-04-09 18:07 ` Jamie Lokier @ 2009-04-23 18:21 ` Miles Nordin 2009-04-24 15:35 ` Jamie Lokier 2 siblings, 1 reply; 8+ messages in thread From: Miles Nordin @ 2009-04-23 18:21 UTC (permalink / raw) To: linux-mtd [-- Attachment #1: Type: text/plain, Size: 1408 bytes --] >>>>> "dg" == Doug Graham <dgraham@nortel.com> writes: dg> I would assume that [hard drives] don't leave dirty data in dg> their cache for an unbounded period of time. ZFS always, ext3 mounted with barrier=1 (or by default in SLES only) and only when not using LVM2, and HFS+ on Mac OS X when using their bullshit fcntl F_FULLSYNC API, all at least claim to issue SYNC CACHE commands to the drives and wait for the drive to write to platter. It's also possible for people who care to turn off the write cache in their drives. For drives connected to RAID controllers that have a battery-backed write cache, I've hard of database-oriented sysadmins who actually do this as a best-practice not a freak experiment---in general, at least for SCSI drives, DBA's are likely to be aware of and control their drives' write cache settings. There is some FUD about drives existing that ignore the SYNC CACHE command (but not SCSI drives), but so far the people I've seen spreading this FUD refuse to name the specific drives, so they are likely to be really old drives, or even just made-up excuses from developers of buggy filesystems and block layers. Anyway, the point is, mtdblock is behind the curve here, yes. and for hard drive-backed filesystems the situation will probably get even better soon. And finally, devices which use FLASH tend to get their cords yanked more often so IMHO it matters. [-- Attachment #2: Type: application/pgp-signature, Size: 304 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing 2009-04-23 18:21 ` Miles Nordin @ 2009-04-24 15:35 ` Jamie Lokier 0 siblings, 0 replies; 8+ messages in thread From: Jamie Lokier @ 2009-04-24 15:35 UTC (permalink / raw) To: Miles Nordin; +Cc: linux-mtd Miles Nordin wrote: > >>>>> "dg" == Doug Graham <dgraham@nortel.com> writes: > > dg> I would assume that [hard drives] don't leave dirty data in > dg> their cache for an unbounded period of time. > > ZFS always, ext3 mounted with barrier=1 (or by default in SLES only) > and only when not using LVM2, and HFS+ on Mac OS X when using their > bullshit fcntl F_FULLSYNC API, all at least claim to issue SYNC CACHE > commands to the drives and wait for the drive to write to platter. > It's also possible for people who care to turn off the write cache in > their drives. For drives connected to RAID controllers that have a > battery-backed write cache, I've hard of database-oriented sysadmins > who actually do this as a best-practice not a freak experiment---in > general, at least for SCSI drives, DBA's are likely to be aware of and > control their drives' write cache settings. There is some FUD about > drives existing that ignore the SYNC CACHE command (but not SCSI > drives), but so far the people I've seen spreading this FUD refuse to > name the specific drives, so they are likely to be really old drives, > or even just made-up excuses from developers of buggy filesystems and > block layers. Yes, that is all true. > Anyway, the point is, mtdblock is behind the curve here, yes. > and for hard drive-backed filesystems the situation will probably > get even better soon. There's a lot of discussion about fsync and barriers recently, so yes. Same for SSDs. > And finally, devices which use FLASH tend to get their cords yanked > more often so IMHO it matters. That is the single most important reason! -- Jamie ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-04-24 15:35 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-09 14:15 mtdblock caching and syncing Doug Graham 2009-04-09 14:51 ` Josh Boyer 2009-04-09 16:02 ` Doug Graham 2009-04-09 17:16 ` Josh Boyer 2009-04-09 18:07 ` Jamie Lokier 2009-04-09 19:24 ` Doug Graham 2009-04-23 18:21 ` Miles Nordin 2009-04-24 15:35 ` Jamie Lokier
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.