* mtdblock caching and syncing
@ 2009-04-09 14:15 Doug Graham
2009-04-09 14:51 ` Josh Boyer
0 siblings, 1 reply; 8+ messages in thread
From: Doug Graham @ 2009-04-09 14:15 UTC (permalink / raw)
To: linux-mtd
Hello,
I'm running a 2.6.24 kernel, but I've looked over the latest kernel
sources, and I don't see a fix for this (although I could have missed
something).
The problem is that a sync() or fsync() on an mtdblock device does not
actually get the data all the way to the flash device. The mtdblock
layer maintains its own cache of a single erase-unit (256KB in my case).
If I open /dev/mtdblock0 for writing, write some stuff to it, then call
fsync() but do not close the device, up to one erase-unit's worth of
data may still be buffered in memory. This data is only flushed when
the device is actually closed (by mtdblock_release). I think that
this violates the intended semantics of sync and fsync. I shouldn't be
required to do a close() to force the data to the device.
Another scenario is to open an mtdblock device, write to it, then call
close() without calling fsync(). In this case, the data may not yet be
flushed to the mtdblock layer by the time close() is called, so nothing
gets written to flash during the close(). That's all reasonable so far,
but if I then type "sync" from the shell (or wait for 30 seconds for
pdflush to do its thing), all buffered data *should* be flushed to the
actual device. As it stands now, the sync will flush data as far as
the mtdblock layer, but that data may then get buffered in the mtdblock
layer forever. It only gets flushed to the device on a close(), and if
nobody calls close(), it stays there forever.
I think this is fairly serious bug in a flash-based system, where there
are frequently times that you want to make sure that data has actually
made it all the way to the device. I think that a sync() or fsync()
really ought to somehow propagate all the way down to the mtdblock layer
so that mtdblock can flush its buffer.
Thoughts? Suggestions? Patches?
Thanks,
Doug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-09 14:15 mtdblock caching and syncing Doug Graham
@ 2009-04-09 14:51 ` Josh Boyer
2009-04-09 16:02 ` Doug Graham
0 siblings, 1 reply; 8+ messages in thread
From: Josh Boyer @ 2009-04-09 14:51 UTC (permalink / raw)
To: Doug Graham; +Cc: linux-mtd
On Thu, Apr 09, 2009 at 10:15:56AM -0400, Doug Graham wrote:
>Hello,
>
>I'm running a 2.6.24 kernel, but I've looked over the latest kernel
>sources, and I don't see a fix for this (although I could have missed
>something).
>
>The problem is that a sync() or fsync() on an mtdblock device does not
>actually get the data all the way to the flash device. The mtdblock
>layer maintains its own cache of a single erase-unit (256KB in my case).
>If I open /dev/mtdblock0 for writing, write some stuff to it, then call
>fsync() but do not close the device, up to one erase-unit's worth of
>data may still be buffered in memory. This data is only flushed when
>the device is actually closed (by mtdblock_release). I think that
>this violates the intended semantics of sync and fsync. I shouldn't be
>required to do a close() to force the data to the device.
The device in question isn't the flash. It's the mtdblock device. So
fsync semantics are preserved. This is the same as writing to a file
on a hard drive, calling fsync, and having it sit in the hard drive's
cache.
>Another scenario is to open an mtdblock device, write to it, then call
>close() without calling fsync(). In this case, the data may not yet be
>flushed to the mtdblock layer by the time close() is called, so nothing
>gets written to flash during the close(). That's all reasonable so far,
>but if I then type "sync" from the shell (or wait for 30 seconds for
>pdflush to do its thing), all buffered data *should* be flushed to the
>actual device. As it stands now, the sync will flush data as far as
>the mtdblock layer, but that data may then get buffered in the mtdblock
>layer forever. It only gets flushed to the device on a close(), and if
>nobody calls close(), it stays there forever.
>
>I think this is fairly serious bug in a flash-based system, where there
>are frequently times that you want to make sure that data has actually
>made it all the way to the device. I think that a sync() or fsync()
>really ought to somehow propagate all the way down to the mtdblock layer
>so that mtdblock can flush its buffer.
Why are you using mtdblock in a serious flash-based system? The fact
that it buffers an entire eraseblock means you risk huge data loss in
the event of an unclean shutdown anyway (power loss). No amount of
sync or fsync will fix that.
>Thoughts? Suggestions? Patches?
Word-weasling aside, if you have patches that fix the behavior you don't
like, they would certainly be looked at. Setting pdflush to 5 seconds
instead of 30 would help a bit, or using the ioctl on the mtdblock device
that already exists to flush would help too. However you might want to
really look at a system design that relies on mtdblock for data integrity.
josh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-09 14:51 ` Josh Boyer
@ 2009-04-09 16:02 ` Doug Graham
2009-04-09 17:16 ` Josh Boyer
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Doug Graham @ 2009-04-09 16:02 UTC (permalink / raw)
To: Josh Boyer; +Cc: linux-mtd
On Thu, Apr 09, 2009 at 10:51:00AM -0400, Josh Boyer wrote:
> On Thu, Apr 09, 2009 at 10:15:56AM -0400, Doug Graham wrote:
> >
> >The problem is that a sync() or fsync() on an mtdblock device does not
> >actually get the data all the way to the flash device. The mtdblock
> >layer maintains its own cache of a single erase-unit (256KB in my case).
> >If I open /dev/mtdblock0 for writing, write some stuff to it, then call
> >fsync() but do not close the device, up to one erase-unit's worth of
> >data may still be buffered in memory. This data is only flushed when
> >the device is actually closed (by mtdblock_release). I think that
> >this violates the intended semantics of sync and fsync. I shouldn't be
> >required to do a close() to force the data to the device.
>
> The device in question isn't the flash. It's the mtdblock device. So
> fsync semantics are preserved. This is the same as writing to a file
> on a hard drive, calling fsync, and having it sit in the hard drive's
> cache.
That's a good point, and one I've wondered about before. I don't know
much about how hard drives manage their cache, but I would assume that
they don't leave dirty data in their cache for an unbounded period
of time. I'd guess that data is written to the actual disk within a
few 10s of milliseconds after being sent to the device.
In the case of mtdblock, dirty data can stay in the cache forever.
> >I think this is fairly serious bug in a flash-based system, where there
> >are frequently times that you want to make sure that data has actually
> >made it all the way to the device. I think that a sync() or fsync()
> >really ought to somehow propagate all the way down to the mtdblock layer
> >so that mtdblock can flush its buffer.
>
> Why are you using mtdblock in a serious flash-based system? The fact
> that it buffers an entire eraseblock means you risk huge data loss in
> the event of an unclean shutdown anyway (power loss). No amount of
> sync or fsync will fix that.
We don't use mtdblock during normal operations; we use squashfs and jffs2
(maybe ubifs sometime soon). But one job that we do use mtdblock for is
burning loads. We could, and perhaps should, be using the char device
instead to burn loads, except that those require specialized tools to do
erases before writes. To avoid the need for such specialized tools, we
just use the equivalent of dd on the mtdblock device followed by a sync.
But that doesn't work given the behaviour I'm complaining about.
It's actually a little more complicated that that. We have a system
comprised of multiple cards. When upgrading the system from the master
card, we're using NBD to upgrade (some) loads on remote cards. The NBD
server running on the remote cards never closes the mtdblock device that
it is managing, so the mtdblock_release() method never gets called.
The NDB server cannot using the MTD character device because it knows
nothing about the characteristics of flash, including the need to erase
before writing. Even if it did know about erasing, we'd want it to do
exactly the same kind of caching the mtdblock already does, so mtdblock
does seem like a good match in this case. We can certainly modify the
NBD server to close and reopen the device when it needs to be sure that
data has actually been written to flash, but that seems a bit on the
kludgy side, and doesn't help any other applications using mtdblock
(like the dd scheme I mention above).
> >Thoughts? Suggestions? Patches?
>
> Word-weasling aside, if you have patches that fix the behavior you don't
> like, they would certainly be looked at. Setting pdflush to 5 seconds
> instead of 30 would help a bit, or using the ioctl on the mtdblock device
> that already exists to flush would help too. However you might want to
> really look at a system design that relies on mtdblock for data integrity.
What's the point of mtdblock then? All systems care about data integrity
to some degree (some more than others, obviously), so if mtdblock makes
no effort to preserve that integrity, where do you see it ever being
used legitimately?
Thanks very much for your comments.
--Doug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-09 16:02 ` Doug Graham
@ 2009-04-09 17:16 ` Josh Boyer
2009-04-09 18:07 ` Jamie Lokier
2009-04-23 18:21 ` Miles Nordin
2 siblings, 0 replies; 8+ messages in thread
From: Josh Boyer @ 2009-04-09 17:16 UTC (permalink / raw)
To: Doug Graham; +Cc: linux-mtd
On Thu, Apr 09, 2009 at 12:02:47PM -0400, Doug Graham wrote:
>On Thu, Apr 09, 2009 at 10:51:00AM -0400, Josh Boyer wrote:
>> On Thu, Apr 09, 2009 at 10:15:56AM -0400, Doug Graham wrote:
>> >
>> >The problem is that a sync() or fsync() on an mtdblock device does not
>> >actually get the data all the way to the flash device. The mtdblock
>> >layer maintains its own cache of a single erase-unit (256KB in my case).
>> >If I open /dev/mtdblock0 for writing, write some stuff to it, then call
>> >fsync() but do not close the device, up to one erase-unit's worth of
>> >data may still be buffered in memory. This data is only flushed when
>> >the device is actually closed (by mtdblock_release). I think that
>> >this violates the intended semantics of sync and fsync. I shouldn't be
>> >required to do a close() to force the data to the device.
>>
>> The device in question isn't the flash. It's the mtdblock device. So
>> fsync semantics are preserved. This is the same as writing to a file
>> on a hard drive, calling fsync, and having it sit in the hard drive's
>> cache.
>
>That's a good point, and one I've wondered about before. I don't know
>much about how hard drives manage their cache, but I would assume that
>they don't leave dirty data in their cache for an unbounded period
>of time. I'd guess that data is written to the actual disk within a
>few 10s of milliseconds after being sent to the device.
Right.
>In the case of mtdblock, dirty data can stay in the cache forever.
True. And I agree that is entirely sub-optimal.
>> >I think this is fairly serious bug in a flash-based system, where there
>> >are frequently times that you want to make sure that data has actually
>> >made it all the way to the device. I think that a sync() or fsync()
>> >really ought to somehow propagate all the way down to the mtdblock layer
>> >so that mtdblock can flush its buffer.
>>
>> Why are you using mtdblock in a serious flash-based system? The fact
>> that it buffers an entire eraseblock means you risk huge data loss in
>> the event of an unclean shutdown anyway (power loss). No amount of
>> sync or fsync will fix that.
>
>We don't use mtdblock during normal operations; we use squashfs and jffs2
>(maybe ubifs sometime soon). But one job that we do use mtdblock for is
>burning loads. We could, and perhaps should, be using the char device
>instead to burn loads, except that those require specialized tools to do
>erases before writes. To avoid the need for such specialized tools, we
>just use the equivalent of dd on the mtdblock device followed by a sync.
>But that doesn't work given the behaviour I'm complaining about.
>
>It's actually a little more complicated that that. We have a system
>comprised of multiple cards. When upgrading the system from the master
>card, we're using NBD to upgrade (some) loads on remote cards. The NBD
>server running on the remote cards never closes the mtdblock device that
>it is managing, so the mtdblock_release() method never gets called.
>The NDB server cannot using the MTD character device because it knows
>nothing about the characteristics of flash, including the need to erase
>before writing. Even if it did know about erasing, we'd want it to do
>exactly the same kind of caching the mtdblock already does, so mtdblock
>does seem like a good match in this case. We can certainly modify the
>NBD server to close and reopen the device when it needs to be sure that
>data has actually been written to flash, but that seems a bit on the
>kludgy side, and doesn't help any other applications using mtdblock
>(like the dd scheme I mention above).
>
>> >Thoughts? Suggestions? Patches?
>>
>> Word-weasling aside, if you have patches that fix the behavior you don't
>> like, they would certainly be looked at. Setting pdflush to 5 seconds
>> instead of 30 would help a bit, or using the ioctl on the mtdblock device
>> that already exists to flush would help too. However you might want to
>> really look at a system design that relies on mtdblock for data integrity.
>
>What's the point of mtdblock then? All systems care about data integrity
>to some degree (some more than others, obviously), so if mtdblock makes
>no effort to preserve that integrity, where do you see it ever being
>used legitimately?
For cramfs, which is read-only. Or in cases sort of like what you describe,
where the conditions of writes are tightly controlled and failure does not
produce a bricked device that a customer is going to be grumpy about.
josh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-09 16:02 ` Doug Graham
2009-04-09 17:16 ` Josh Boyer
@ 2009-04-09 18:07 ` Jamie Lokier
2009-04-09 19:24 ` Doug Graham
2009-04-23 18:21 ` Miles Nordin
2 siblings, 1 reply; 8+ messages in thread
From: Jamie Lokier @ 2009-04-09 18:07 UTC (permalink / raw)
To: Doug Graham; +Cc: Josh Boyer, linux-mtd
Doug Graham wrote:
> > The device in question isn't the flash. It's the mtdblock device. So
> > fsync semantics are preserved. This is the same as writing to a file
> > on a hard drive, calling fsync, and having it sit in the hard drive's
> > cache.
>
> That's a good point, and one I've wondered about before. I don't know
> much about how hard drives manage their cache, but I would assume that
> they don't leave dirty data in their cache for an unbounded period
> of time. I'd guess that data is written to the actual disk within a
> few 10s of milliseconds after being sent to the device.
Ideally, fsync() should flush data from a hard drive's cache too.
-- Jamie
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-09 18:07 ` Jamie Lokier
@ 2009-04-09 19:24 ` Doug Graham
0 siblings, 0 replies; 8+ messages in thread
From: Doug Graham @ 2009-04-09 19:24 UTC (permalink / raw)
To: Jamie Lokier; +Cc: Josh Boyer, linux-mtd
On Thu, Apr 09, 2009 at 07:07:24PM +0100, Jamie Lokier wrote:
> Doug Graham wrote:
> > > The device in question isn't the flash. It's the mtdblock device. So
> > > fsync semantics are preserved. This is the same as writing to a file
> > > on a hard drive, calling fsync, and having it sit in the hard drive's
> > > cache.
> >
> > That's a good point, and one I've wondered about before. I don't know
> > much about how hard drives manage their cache, but I would assume that
> > they don't leave dirty data in their cache for an unbounded period
> > of time. I'd guess that data is written to the actual disk within a
> > few 10s of milliseconds after being sent to the device.
>
> Ideally, fsync() should flush data from a hard drive's cache too.
That seems ideal to me too, and if Linux had the framework to support
that, the solution to the mtdblock problem would be trivial. I'd guess
that the cleanest way to do this would be to add a sync() method to
block_device_operations.
--Doug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-09 16:02 ` Doug Graham
2009-04-09 17:16 ` Josh Boyer
2009-04-09 18:07 ` Jamie Lokier
@ 2009-04-23 18:21 ` Miles Nordin
2009-04-24 15:35 ` Jamie Lokier
2 siblings, 1 reply; 8+ messages in thread
From: Miles Nordin @ 2009-04-23 18:21 UTC (permalink / raw)
To: linux-mtd
[-- Attachment #1: Type: text/plain, Size: 1408 bytes --]
>>>>> "dg" == Doug Graham <dgraham@nortel.com> writes:
dg> I would assume that [hard drives] don't leave dirty data in
dg> their cache for an unbounded period of time.
ZFS always, ext3 mounted with barrier=1 (or by default in SLES only)
and only when not using LVM2, and HFS+ on Mac OS X when using their
bullshit fcntl F_FULLSYNC API, all at least claim to issue SYNC CACHE
commands to the drives and wait for the drive to write to platter.
It's also possible for people who care to turn off the write cache in
their drives. For drives connected to RAID controllers that have a
battery-backed write cache, I've hard of database-oriented sysadmins
who actually do this as a best-practice not a freak experiment---in
general, at least for SCSI drives, DBA's are likely to be aware of and
control their drives' write cache settings. There is some FUD about
drives existing that ignore the SYNC CACHE command (but not SCSI
drives), but so far the people I've seen spreading this FUD refuse to
name the specific drives, so they are likely to be really old drives,
or even just made-up excuses from developers of buggy filesystems and
block layers.
Anyway, the point is, mtdblock is behind the curve here, yes. and for
hard drive-backed filesystems the situation will probably get even
better soon. And finally, devices which use FLASH tend to get their
cords yanked more often so IMHO it matters.
[-- Attachment #2: Type: application/pgp-signature, Size: 304 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mtdblock caching and syncing
2009-04-23 18:21 ` Miles Nordin
@ 2009-04-24 15:35 ` Jamie Lokier
0 siblings, 0 replies; 8+ messages in thread
From: Jamie Lokier @ 2009-04-24 15:35 UTC (permalink / raw)
To: Miles Nordin; +Cc: linux-mtd
Miles Nordin wrote:
> >>>>> "dg" == Doug Graham <dgraham@nortel.com> writes:
>
> dg> I would assume that [hard drives] don't leave dirty data in
> dg> their cache for an unbounded period of time.
>
> ZFS always, ext3 mounted with barrier=1 (or by default in SLES only)
> and only when not using LVM2, and HFS+ on Mac OS X when using their
> bullshit fcntl F_FULLSYNC API, all at least claim to issue SYNC CACHE
> commands to the drives and wait for the drive to write to platter.
> It's also possible for people who care to turn off the write cache in
> their drives. For drives connected to RAID controllers that have a
> battery-backed write cache, I've hard of database-oriented sysadmins
> who actually do this as a best-practice not a freak experiment---in
> general, at least for SCSI drives, DBA's are likely to be aware of and
> control their drives' write cache settings. There is some FUD about
> drives existing that ignore the SYNC CACHE command (but not SCSI
> drives), but so far the people I've seen spreading this FUD refuse to
> name the specific drives, so they are likely to be really old drives,
> or even just made-up excuses from developers of buggy filesystems and
> block layers.
Yes, that is all true.
> Anyway, the point is, mtdblock is behind the curve here, yes.
> and for hard drive-backed filesystems the situation will probably
> get even better soon.
There's a lot of discussion about fsync and barriers recently, so yes.
Same for SSDs.
> And finally, devices which use FLASH tend to get their cords yanked
> more often so IMHO it matters.
That is the single most important reason!
-- Jamie
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-04-24 15:35 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-09 14:15 mtdblock caching and syncing Doug Graham
2009-04-09 14:51 ` Josh Boyer
2009-04-09 16:02 ` Doug Graham
2009-04-09 17:16 ` Josh Boyer
2009-04-09 18:07 ` Jamie Lokier
2009-04-09 19:24 ` Doug Graham
2009-04-23 18:21 ` Miles Nordin
2009-04-24 15:35 ` Jamie Lokier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).