* o_sync in vfat driver
[not found] ` <op.s5kzao2jj68xd1@mail.piments.com>
@ 2006-02-26 22:50 ` col-pepper
2006-02-27 13:28 ` Lennart Sorensen
0 siblings, 1 reply; 42+ messages in thread
From: col-pepper @ 2006-02-26 22:50 UTC (permalink / raw)
To: linux-kernel
Hi,
OMG what do I have to do to post here? 10th attempt.
{part2}
Here is a non-exhaustive list of typical devices types requiring fat vfat
support:
fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
usb-flash (usbsticks, cameras, some music devices.)
IIRC the sync mount option for vfat is ignored for file systems >2G, this
effectively (and probably intentionally) excludes nearly all hd partitions
and iPod type devices.
sync does not have any meaning for CD DVD media.
^ permalink raw reply [flat|nested] 42+ messages in thread
* o_sync in vfat driver
@ 2006-02-26 22:55 col-pepper
0 siblings, 0 replies; 42+ messages in thread
From: col-pepper @ 2006-02-26 22:55 UTC (permalink / raw)
To: linux-kernel
part 3 (we'll get past these filters in the end...)
These devices do present special problems since they are a rw media that
can be abruptly removed at any time without even the chance for the OS to
interrupt on-going IO.
This is compounded by the fact that flash memory has to be zeroed and then
rewritten with the new data. If the device is physically removed before a
block is written the update will be lost. If it is removed _during_ write
the new and the old data will likely be lost.
If the block being written is the FAT , the principal record of the
structure of the whole disk will very likely be erased.
Since there is a heavy performance penalty involved (typically around an
_order of magnitude_ slower), it seems that the sole aim here is security
of data at any cost in the case of premature withdrawal.
^ permalink raw reply [flat|nested] 42+ messages in thread
* o_sync in vfat driver
@ 2006-02-26 23:08 col-pepper
2006-02-27 0:51 ` Andrew Morton
0 siblings, 1 reply; 42+ messages in thread
From: col-pepper @ 2006-02-26 23:08 UTC (permalink / raw)
To: linux-kernel
*** Is that aim being achieved by the current policy? ***
As I understand it the old (<=2.6.11) sync model kept the data in sync
without updating the FAT until later. This runs the risks of partial
corruption of one or more files on pullout.
The new model attempts to be more rigourous by updating the FAT every time
a block of data is written. Thus the "hammering" of the physical memory
hosting the FAT record.
In view of the nature of flash memory this may actually be drastically
increasing the chance that the whole FAT gets erased.
part IV (end of a sage)
If a pullout occurs during write , there is now a near 50% chance that
this takes out the entire FAT.
It would seem that the main advantage of this scheme is that it is so slow
that it encourages users to turn it off. Presumably in the process of
coming to that conclusion they will become aware of the need to run umount
or the sync command before doing removing the device.
= Danger of destroying hardware =
It seems that there are well documented cases of this abusive rewriting of
the FAT causing rapid and total premature failure of what Alan Cox refers
to as "ultra-crap devices".
There may be valid reasons of cost or miniaturisation that preclude the
additional hardware found in more complex devices.
Even if better quality devices may have some sort of paging mechanism
which makes them more resistant to this sort of abuse, it does not seem
good engineering practice to dismiss those that fail as "shite".
There is nothing in the spec of vfat that suggests the FAT will be written
10.000 during the writing of one large file. Indeed it is hard to imagine
that any other implementation on any other OS or any previous linux kernel
behaves like that.
So should the hardware manufacturers have anticipated this particular
driver implementation or should the kernel be more aware of the existing
hardware that it purports to support.
= The way forward =
It would seem that the first step could be to revert to the 2.6.11
behaviour which was more appropriate and probably safer even from the data
point of view.
I lack the knowlege and experience to produce reliable kernel code so I
wont try. However, I have already seen a number of suggestions of how the
old model could be improved. This post could be the starting point for a
discussion of more robust techniques. In any case the coding is unlikely
to be very complex given the existing , tested code base that is in place
in 2.6.11
Any new technique should probably aim to be applicable to larger devices
as well. The 2G limit is artificial and is a tacit recognition of the
precarity of the current code. USB hard disks are just as prone to
accidental cable pullout. Some periodic or per file sync should probably
be envisaged for the VFAT sync mount option.
PS if anyone can tell me why I had to post this ten times and chop it into
little bits it would be appreciated in not messing up the list in the
future.
I spent an hour reading the faq and I dont see anything taboo here.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-26 23:08 o_sync in vfat driver col-pepper
@ 2006-02-27 0:51 ` Andrew Morton
2006-02-27 22:19 ` col-pepper
0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2006-02-27 0:51 UTC (permalink / raw)
To: col-pepper; +Cc: linux-kernel
col-pepper@piments.com wrote:
>
> There is nothing in the spec of vfat that suggests the FAT will be written
> 10.000 during the writing of one large file. Indeed it is hard to imagine
> that any other implementation on any other OS or any previous linux kernel
> behaves like that.
We sync the file metadata once per write() syscall. If your app writes a
large file in lots of little bits, it'll do a lot of syncs. Other
implementations of fatfs will (must) do the same thing.
> It would seem that the first step could be to revert to the 2.6.11
> behaviour which was more appropriate and probably safer even from the data
> point of view.
fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and
what we're seeing is the fallout from the late fixing of that bug.
You're right - people need to understand what they're doing, make their own
decision, then remove the `-o sync' option. There aren't any easy
solutions.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-26 22:50 ` col-pepper
@ 2006-02-27 13:28 ` Lennart Sorensen
2006-02-27 13:50 ` Arjan van de Ven
2006-02-27 14:26 ` linux-os (Dick Johnson)
0 siblings, 2 replies; 42+ messages in thread
From: Lennart Sorensen @ 2006-02-27 13:28 UTC (permalink / raw)
To: col-pepper; +Cc: linux-kernel
On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote:
> Hi,
>
> OMG what do I have to do to post here? 10th attempt.
> {part2}
>
> Here is a non-exhaustive list of typical devices types requiring fat vfat
> support:
>
> fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
> usb-flash (usbsticks, cameras, some music devices.)
>
> IIRC the sync mount option for vfat is ignored for file systems >2G, this
> effectively (and probably intentionally) excludes nearly all hd partitions
> and iPod type devices.
I think many people wish it was ignored on smaller devices too given
what it does to write performance. And if your device is flash based
and is one of the ones that doesn't have proper wear leveling the card
won't last long with sync enabled (even with wear leveling rewriting the
fat that often as sync seems to do can't be good for the lifespan of the
flash).
I suspect either vfat should ignore sync all the time, or it should at
least warn about its use so distributions don't think enabling it on all
removeable media is a good idea in general. Or perhaps the vfat driver
could be made to wait for a file to be closed or at least have some
timeout before updating the fat table again. Not sure.
Len Sorensen
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 13:28 ` Lennart Sorensen
@ 2006-02-27 13:50 ` Arjan van de Ven
2006-02-27 14:06 ` Anton Altaparmakov
2006-02-27 14:26 ` linux-os (Dick Johnson)
1 sibling, 1 reply; 42+ messages in thread
From: Arjan van de Ven @ 2006-02-27 13:50 UTC (permalink / raw)
To: Lennart Sorensen; +Cc: col-pepper, linux-kernel
On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
> On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote:
> > Hi,
> >
> > OMG what do I have to do to post here? 10th attempt.
> > {part2}
> >
> > Here is a non-exhaustive list of typical devices types requiring fat vfat
> > support:
> >
> > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
> > usb-flash (usbsticks, cameras, some music devices.)
> >
> > IIRC the sync mount option for vfat is ignored for file systems >2G, this
> > effectively (and probably intentionally) excludes nearly all hd partitions
> > and iPod type devices.
>
> I think many people wish it was ignored on smaller devices too given
> what it does to write performance.
well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
> And if your device is flash based
> and is one of the ones that doesn't have proper wear leveling the card
> won't last long with sync enabled (even with wear leveling rewriting the
> fat that often as sync seems to do can't be good for the lifespan of the
> flash).
patient> doctor doctor it hurts when I do this
doctor> Then don't do that
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 13:50 ` Arjan van de Ven
@ 2006-02-27 14:06 ` Anton Altaparmakov
2006-02-27 14:27 ` Arjan van de Ven
0 siblings, 1 reply; 42+ messages in thread
From: Anton Altaparmakov @ 2006-02-27 14:06 UTC (permalink / raw)
To: Arjan van de Ven; +Cc: Lennart Sorensen, col-pepper, linux-kernel
On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote:
> On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
> > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote:
> > > Hi,
> > >
> > > OMG what do I have to do to post here? 10th attempt.
> > > {part2}
> > >
> > > Here is a non-exhaustive list of typical devices types requiring fat vfat
> > > support:
> > >
> > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
> > > usb-flash (usbsticks, cameras, some music devices.)
> > >
> > > IIRC the sync mount option for vfat is ignored for file systems >2G, this
> > > effectively (and probably intentionally) excludes nearly all hd partitions
> > > and iPod type devices.
> >
> > I think many people wish it was ignored on smaller devices too given
> > what it does to write performance.
>
> well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
That is easy to say when you are using the command line... Modern
distros (as you know I am sure) mount all hot-plug devices like usb
keys, usb hard disks, etc automatically at plug-in time and at least
some distros use "-o sync" for everything so you don't get (too much)
data loss when the user unplugs a device and so a umount to unplug the
device does not take ages...
Being someone who maintains a distribution based on one of the big
distributions I can tell you that figuring out how to change that
default behaviour is not always pretty. Usually involves hacking files
deep in the bowels of the hotplug framework on the system.
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 13:28 ` Lennart Sorensen
2006-02-27 13:50 ` Arjan van de Ven
@ 2006-02-27 14:26 ` linux-os (Dick Johnson)
2006-02-27 18:53 ` Jan Engelhardt
1 sibling, 1 reply; 42+ messages in thread
From: linux-os (Dick Johnson) @ 2006-02-27 14:26 UTC (permalink / raw)
To: Lennart Sorensen; +Cc: col-pepper, linux-kernel
On Mon, 27 Feb 2006, Lennart Sorensen wrote:
> On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote:
>> Hi,
>>
>> OMG what do I have to do to post here? 10th attempt.
>> {part2}
>>
>> Here is a non-exhaustive list of typical devices types requiring fat vfat
>> support:
>>
>> fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
>> usb-flash (usbsticks, cameras, some music devices.)
>>
>> IIRC the sync mount option for vfat is ignored for file systems >2G, this
>> effectively (and probably intentionally) excludes nearly all hd partitions
>> and iPod type devices.
>
> I think many people wish it was ignored on smaller devices too given
> what it does to write performance. And if your device is flash based
> and is one of the ones that doesn't have proper wear leveling the card
> won't last long with sync enabled (even with wear leveling rewriting the
> fat that often as sync seems to do can't be good for the lifespan of the
> flash).
>
> I suspect either vfat should ignore sync all the time, or it should at
> least warn about its use so distributions don't think enabling it on all
> removeable media is a good idea in general. Or perhaps the vfat driver
> could be made to wait for a file to be closed or at least have some
> timeout before updating the fat table again. Not sure.
>
> Len Sorensen
I really don't think one needs to worry about this! The flash-file-
system designers know how to minimize wear and spread the wear
throughout the device. It's not up to the file-systems to be
concerned whatsoever! The filesystems need to concern themselves
with the proper implementation of their structural details, nothing
else. Any special device considerations do not belong in the
file-system code. If there are any special device considerations,
they need to be in the device driver, nowhere else.
BYW, even the drivers can't effectively compensate for any
potential wear because they don't know where the physical
write will occur. The physical sectors (pages) of many of
these devices are 64k. All of the access, both read and
write, is buffered in read/write static RAM. It's only
when the disk emulator of the FlashRAM decides that the
static RAM needs to be flushed to flash, that the write
actually occurs. Typically, a LRU 64k page is erased
and re-written. Then a table is updated to reference the
new correct block. This is all transparent, and it
needs to be, because erasing a 64k block takes nearly
a second! Without the buffering, write performance
would be unacceptable.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.53 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 14:06 ` Anton Altaparmakov
@ 2006-02-27 14:27 ` Arjan van de Ven
2006-02-27 14:41 ` Anton Altaparmakov
0 siblings, 1 reply; 42+ messages in thread
From: Arjan van de Ven @ 2006-02-27 14:27 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: Lennart Sorensen, col-pepper, linux-kernel
On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote:
> On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote:
> > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
> > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote:
> > > > Hi,
> > > >
> > > > OMG what do I have to do to post here? 10th attempt.
> > > > {part2}
> > > >
> > > > Here is a non-exhaustive list of typical devices types requiring fat vfat
> > > > support:
> > > >
> > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
> > > > usb-flash (usbsticks, cameras, some music devices.)
> > > >
> > > > IIRC the sync mount option for vfat is ignored for file systems >2G, this
> > > > effectively (and probably intentionally) excludes nearly all hd partitions
> > > > and iPod type devices.
> > >
> > > I think many people wish it was ignored on smaller devices too given
> > > what it does to write performance.
> >
> > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
>
> That is easy to say when you are using the command line... Modern
> distros (as you know I am sure) mount all hot-plug devices like usb
> keys, usb hard disks, etc automatically at plug-in time and at least
> some distros use "-o sync"
that is a bad misdesign of that distro or at least the tool the distro
uses for this (I don't know which it is so I can say that without
sounding partial :)
the tool that decides to use "sync", or at least the author thereof,
should be aware of what flash is, and that it has a limited lifespan etc
etc, and that you thus want maximum caching etc.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 14:27 ` Arjan van de Ven
@ 2006-02-27 14:41 ` Anton Altaparmakov
2006-02-27 21:04 ` col-pepper
0 siblings, 1 reply; 42+ messages in thread
From: Anton Altaparmakov @ 2006-02-27 14:41 UTC (permalink / raw)
To: Arjan van de Ven; +Cc: Lennart Sorensen, col-pepper, linux-kernel
On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote:
> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote:
> > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote:
> > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
> > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote:
> > > > > Hi,
> > > > >
> > > > > OMG what do I have to do to post here? 10th attempt.
> > > > > {part2}
> > > > >
> > > > > Here is a non-exhaustive list of typical devices types requiring fat vfat
> > > > > support:
> > > > >
> > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc)
> > > > > usb-flash (usbsticks, cameras, some music devices.)
> > > > >
> > > > > IIRC the sync mount option for vfat is ignored for file systems >2G, this
> > > > > effectively (and probably intentionally) excludes nearly all hd partitions
> > > > > and iPod type devices.
> > > >
> > > > I think many people wish it was ignored on smaller devices too given
> > > > what it does to write performance.
> > >
> > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
> >
> > That is easy to say when you are using the command line... Modern
> > distros (as you know I am sure) mount all hot-plug devices like usb
> > keys, usb hard disks, etc automatically at plug-in time and at least
> > some distros use "-o sync"
>
> that is a bad misdesign of that distro or at least the tool the distro
> uses for this (I don't know which it is so I can say that without
> sounding partial :)
>
> the tool that decides to use "sync", or at least the author thereof,
> should be aware of what flash is, and that it has a limited lifespan etc
> etc, and that you thus want maximum caching etc.
I agree completely which is why we hack the system to remove the o_sync
on our distro derivative. (-:
But my point was that your solution of "don't do that then" is not much
use to your average user who sits in front of such distro in graphical
desktop as they are not technical enough to find and hack their hotplug
system to work properly...
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 14:26 ` linux-os (Dick Johnson)
@ 2006-02-27 18:53 ` Jan Engelhardt
0 siblings, 0 replies; 42+ messages in thread
From: Jan Engelhardt @ 2006-02-27 18:53 UTC (permalink / raw)
To: linux-os (Dick Johnson); +Cc: Lennart Sorensen, col-pepper, linux-kernel
>
>I really don't think one needs to worry about this! The flash-file-
>system designers know how to minimize wear and spread the wear
>throughout the device. It's not up to the file-systems to be
>concerned whatsoever!
Yes, the filesystem designers, JFFS and such. But most people unfortunately
have to use something not-optimized-for-flash called VFAT to be able to
read it on Win32 too. I would like to use UDF instead, but Windows seems to
have a nogo with UDF on non-CDROMs.
Jan Engelhardt
--
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 14:41 ` Anton Altaparmakov
@ 2006-02-27 21:04 ` col-pepper
2006-02-27 21:17 ` Arjan van de Ven
` (3 more replies)
0 siblings, 4 replies; 42+ messages in thread
From: col-pepper @ 2006-02-27 21:04 UTC (permalink / raw)
To: Anton Altaparmakov, Arjan van de Ven; +Cc: Lennart Sorensen, linux-kernel
On Mon, 27 Feb 2006 15:41:44 +0100, Anton Altaparmakov <aia21@cam.ac.uk>
wrote:
> On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote:
>> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote:
>> > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote:
>> > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
>> > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com
>> wrote:
>> > > > > Hi,
>> > > > >
>> > > > > OMG what do I have to do to post here? 10th attempt.
>> > > > > {part2}
>> > > > >
>> > > > > Here is a non-exhaustive list of typical devices types
>> requiring fat vfat
>> > > > > support:
>> > > > >
>> > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod,
>> iRiver etc)
>> > > > > usb-flash (usbsticks, cameras, some music devices.)
>> > > > >
>> > > > > IIRC the sync mount option for vfat is ignored for file systems
>> >2G, this
>> > > > > effectively (and probably intentionally) excludes nearly all hd
>> partitions
>> > > > > and iPod type devices.
>> > > >
>> > > > I think many people wish it was ignored on smaller devices too
>> given
>> > > > what it does to write performance.
>> > >
>> > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND
>> LINE* !!!
>> >
>> > That is easy to say when you are using the command line... Modern
>> > distros (as you know I am sure) mount all hot-plug devices like usb
>> > keys, usb hard disks, etc automatically at plug-in time and at least
>> > some distros use "-o sync"
>>
>> that is a bad misdesign of that distro or at least the tool the distro
>> uses for this (I don't know which it is so I can say that without
>> sounding partial :)
>>
>> the tool that decides to use "sync", or at least the author thereof,
>> should be aware of what flash is, and that it has a limited lifespan etc
>> etc, and that you thus want maximum caching etc.
>
> I agree completely which is why we hack the system to remove the o_sync
> on our distro derivative. (-:
>
> But my point was that your solution of "don't do that then" is not much
> use to your average user who sits in front of such distro in graphical
> desktop as they are not technical enough to find and hack their hotplug
> system to work properly...
>
> Best regards,
>
> Anton
>> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
Yeah, cleaver.
That is not really a constructive responce. I dont use , I do use command
line mount all the time. I never was in danger of damaging my drive with
this new "feature".
Telling a user who has just burnt out a brand new 1GB usb device he should
have RTFM and modified that HAL configuration to insure it did not use
sync it not likely to win much confidence in the linux kernel.
The point of raising this is that the vast majority of linux users have no
awareness of this. If there is a danger of this sync implementation
damaging hardware it should be done differently.
More importantly this sync strategy is very likely _increasing_ the danger
of data loss that is the core reason for using sync in the first place.
To quote from my earlier post:
The new model attempts to be more rigourous by updating the FAT every time
a block of data is written. Thus the "hammering" of the physical memory
hosting the FAT record.
In view of the nature of flash memory this may actually be drastically
increasing the chance that the whole FAT gets erased.
If a pullout occurs during write , there is now a near 50% chance that
this takes out the entire FAT.
Now if that analysis is inaccurate I'd like be corrected. But flash has to
be zeroed to be written. If every second write is zeroing the FAT this
would seem much more likely to destroy the whole fs than to provide better
protection from a untimely pull-out.
[Note: I am not subscribed to LKML, if you wish me to recieve any follow
ups please BCC: col-pepper at piments point com . thx]
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 21:04 ` col-pepper
@ 2006-02-27 21:17 ` Arjan van de Ven
2006-02-27 23:21 ` col-pepper
2006-02-27 21:32 ` linux-os (Dick Johnson)
` (2 subsequent siblings)
3 siblings, 1 reply; 42+ messages in thread
From: Arjan van de Ven @ 2006-02-27 21:17 UTC (permalink / raw)
To: col-pepper; +Cc: Anton Altaparmakov, Lennart Sorensen, linux-kernel
> Telling a user who has just burnt out a brand new 1GB usb device he should
> have RTFM and modified that HAL configuration to insure it did not use
> sync it not likely to win much confidence in the linux kernel.
or in HAL. really.
there was a very long discussion abuot kernel stability.
The problem is that once depending on the absence of a feature becomes
ABI ... there is a big problem.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 21:04 ` col-pepper
2006-02-27 21:17 ` Arjan van de Ven
@ 2006-02-27 21:32 ` linux-os (Dick Johnson)
2006-02-27 23:21 ` col-pepper
2006-02-28 16:11 ` Helge Hafting
2006-02-28 22:37 ` Pavel Machek
3 siblings, 1 reply; 42+ messages in thread
From: linux-os (Dick Johnson) @ 2006-02-27 21:32 UTC (permalink / raw)
To: col-pepper
Cc: Anton Altaparmakov, Arjan van de Ven, Lennart Sorensen,
linux-kernel
On Mon, 27 Feb 2006 col-pepper@piments.com wrote:
> On Mon, 27 Feb 2006 15:41:44 +0100, Anton Altaparmakov <aia21@cam.ac.uk>
> wrote:
>
>> On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote:
>>> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote:
>>>> On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote:
>>>>> On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
>>>>>> On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com
>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> OMG what do I have to do to post here? 10th attempt.
>>>>>>> {part2}
>>>>>>>
>>>>>>> Here is a non-exhaustive list of typical devices types
>>> requiring fat vfat
>>>>>>> support:
>>>>>>>
>>>>>>> fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod,
>>> iRiver etc)
>>>>>>> usb-flash (usbsticks, cameras, some music devices.)
>>>>>>>
>>>>>>> IIRC the sync mount option for vfat is ignored for file systems
>>>> 2G, this
>>>>>>> effectively (and probably intentionally) excludes nearly all hd
>>> partitions
>>>>>>> and iPod type devices.
>>>>>>
>>>>>> I think many people wish it was ignored on smaller devices too
>>> given
>>>>>> what it does to write performance.
>>>>>
>>>>> well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND
>>> LINE* !!!
>>>>
>>>> That is easy to say when you are using the command line... Modern
>>>> distros (as you know I am sure) mount all hot-plug devices like usb
>>>> keys, usb hard disks, etc automatically at plug-in time and at least
>>>> some distros use "-o sync"
>>>
>>> that is a bad misdesign of that distro or at least the tool the distro
>>> uses for this (I don't know which it is so I can say that without
>>> sounding partial :)
>>>
>>> the tool that decides to use "sync", or at least the author thereof,
>>> should be aware of what flash is, and that it has a limited lifespan etc
>>> etc, and that you thus want maximum caching etc.
>>
>> I agree completely which is why we hack the system to remove the o_sync
>> on our distro derivative. (-:
>>
>> But my point was that your solution of "don't do that then" is not much
>> use to your average user who sits in front of such distro in graphical
>> desktop as they are not technical enough to find and hack their hotplug
>> system to work properly...
>>
>> Best regards,
>>
>> Anton
>
>>> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
>
> Yeah, cleaver.
> That is not really a constructive responce. I dont use , I do use command
> line mount all the time. I never was in danger of damaging my drive with
> this new "feature".
>
> Telling a user who has just burnt out a brand new 1GB usb device he should
> have RTFM and modified that HAL configuration to insure it did not use
> sync it not likely to win much confidence in the linux kernel.
>
> The point of raising this is that the vast majority of linux users have no
> awareness of this. If there is a danger of this sync implementation
> damaging hardware it should be done differently.
>
> More importantly this sync strategy is very likely _increasing_ the danger
> of data loss that is the core reason for using sync in the first place.
>
> To quote from my earlier post:
>
> The new model attempts to be more rigourous by updating the FAT every time
> a block of data is written. Thus the "hammering" of the physical memory
> hosting the FAT record.
Nobody should care.
>
> In view of the nature of flash memory this may actually be drastically
> increasing the chance that the whole FAT gets erased.
>
Will not happen because that's not how they work.
> If a pullout occurs during write , there is now a near 50% chance that
> this takes out the entire FAT.
>
If a pullout or a power-failure occurs, you just have an incomplete
write, an old FAT entry just like ejecting a floppy during a write.
> Now if that analysis is inaccurate I'd like be corrected. But flash has to
> be zeroed to be written. If every second write is zeroing the FAT this
> would seem much more likely to destroy the whole fs than to provide better
> protection from a untimely pull-out.
>
Flash does not get zeroed to be written! It gets erased, which sets all
the bits to '1', i.e., all bytes to 0xff. Further, the designers of
flash disks are not stupid as you assume. The direct access occurs
to static RAM (read/write stuff). After a few milliseconds of it
becoming dirty, and/or when a new page needs to be accessed, the
chip erases some page that was not used yet, or was used a long
time ago and is not on the active list. Then, it becomes buzy,
writes the current sector to the newly erased sector, and (after
that write occurs) replaces the entry in the table that tells the
disk implimentation the logical to physical translation of that page.
In the case where a page will be changed, the new page's data is read
from the device into static RAM before access. In any case, the chip
then becomes non-buzy. The power can fail at any time and you just
have the previous data instead of the new data, just like a real
disk drive, except that the sectors are large (64 k).
You see, these are not just flash-RAM chips. They are disc drive
emulators that contain an ASIC for the bus interface and control
logic, some static RAM, and the flash RAM.
The IDE emulators, like CompaqFlash, as tiny as they are, actually
have the same pin-outs as an IDE drive!!
>
> [Note: I am not subscribed to LKML, if you wish me to recieve any follow
> ups please BCC: col-pepper at piments point com . thx]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.53 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 0:51 ` Andrew Morton
@ 2006-02-27 22:19 ` col-pepper
2006-02-27 23:12 ` Andrew Morton
2006-02-28 0:52 ` Machida, Hiroyuki
0 siblings, 2 replies; 42+ messages in thread
From: col-pepper @ 2006-02-27 22:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel@vger.kernel.org
Thanks for the reply.
On Mon, 27 Feb 2006 01:51:14 +0100, Andrew Morton <akpm@osdl.org> wrote:
> col-pepper@piments.com wrote:
>>
>> There is nothing in the spec of vfat that suggests the FAT will be
>> written
>> 10.000 during the writing of one large file. Indeed it is hard to
>> imagine
>> that any other implementation on any other OS or any previous linux
>> kernel
>> behaves like that.
>
> We sync the file metadata once per write() syscall. If your app writes a
> large file in lots of little bits, it'll do a lot of syncs. Other
> implementations of fatfs will (must) do the same thing.
That would not seem to be the case at least on MS systems. I had a freind
do some timings copying a large group of files to a 128M usb flash device.
There was an arbitary mix of files including many small files and some
larger files, one in excess of 50MB.
suse10 default 4m10
win2k 2m30
suse w/o sync 30s
The suse test was drag and drop in konqueror , the other dnd in windows
explorer.
>
>> It would seem that the first step could be to revert to the 2.6.11
>> behaviour which was more appropriate and probably safer even from the
>> data
>> point of view.
>
> fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and
> what we're seeing is the fallout from the late fixing of that bug.
>
I just tested on my 2.6.11 kernel which would predate the change and there
is a clear difference between mounting my usb device with and without sync
option.
ls -ail /tmpd/mail*
239151 -rw-r--r-- 1 root root 8169540 2006-02-27 19:04
/tmpd/mail-bak.2006-02-28.bz2
bash-3.1#time cp !$ /mnt/usb
time cp /tmpd/mail* /mnt/usb
real 0m0.227s
user 0m0.001s
sys 0m0.070s
It returns immediately with no disk activity. About 30s later there was
disk activity. Presumably some periodic flushing of IO buffers.
bash-3.1#umount /mnt/usb
bash-3.1#mount -o sync !$
bash-3.1#time cp /tmpd/mail* /mnt/usb
real 0m5.440s
user 0m0.000s
sys 0m0.143s
So the older model did seem to have some sync functionality , tho'
presumably not the agressive one-for-one sync that is now being used.
Please correct me if my interpretation is flawed here:
flash has to be cleared before being written. If metadata is written with
every block output with write(), the risk of erasing the FAT is now many
times higher than with the old sync policy.
So the newer sync policy drastically _reduces_ the data security in the
case of untimely disconnection despite the speed penalty and possible
hardware damage it incurs.
A less rigourous sync policy may in fact be more appropriate than the
current model.
Thanks again.
[Note: I am not subscribed to LKML, if you wish me to recieve any follow
ups please BCC: col-pepper at piments point com . thx]
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 22:19 ` col-pepper
@ 2006-02-27 23:12 ` Andrew Morton
2006-02-28 18:47 ` Chris Mason
2006-02-28 0:52 ` Machida, Hiroyuki
1 sibling, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2006-02-27 23:12 UTC (permalink / raw)
To: col-pepper; +Cc: linux-kernel
col-pepper@piments.com wrote:
>
> That would not seem to be the case at least on MS systems. I had a freind
> do some timings copying a large group of files to a 128M usb flash device.
> There was an arbitary mix of files including many small files and some
> larger files, one in excess of 50MB.
>
> suse10 default 4m10
> win2k 2m30
> suse w/o sync 30s
>
> The suse test was drag and drop in konqueror , the other dnd in windows
> explorer.
We don't know that the same number of same-sized write()s were happening in
each case.
There's been some talk about implementing fsync()-on-file-close for this
problem, and some protopatches. But nothing final yet.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 21:17 ` Arjan van de Ven
@ 2006-02-27 23:21 ` col-pepper
0 siblings, 0 replies; 42+ messages in thread
From: col-pepper @ 2006-02-27 23:21 UTC (permalink / raw)
To: Arjan van de Ven; +Cc: Anton Altaparmakov, Lennart Sorensen, linux-kernel
On Mon, 27 Feb 2006 22:17:21 +0100, Arjan van de Ven <arjan@infradead.org>
wrote:
>
>> Telling a user who has just burnt out a brand new 1GB usb device he
>> should
>> have RTFM and modified that HAL configuration to insure it did not use
>> sync it not likely to win much confidence in the linux kernel.
>
> or in HAL. really.
It may unfairly reflect on HAL in the users' mind but hal still does
exactly what it is set up to do.
>
>
> there was a very long discussion abuot kernel stability.
> The problem is that once depending on the absence of a feature becomes
> ABI ... there is a big problem.
>
>
>
It was not totally absent. If it was absent no-one would configure
anything to use it anyway. It seems that big problem was that it
functionality was fundamentlly changed but it was passed on like a minor
mod that no-one needed to worry about and the doc was not updated at the
time.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 21:32 ` linux-os (Dick Johnson)
@ 2006-02-27 23:21 ` col-pepper
2006-02-28 13:10 ` linux-os (Dick Johnson)
2006-02-28 22:38 ` Pavel Machek
0 siblings, 2 replies; 42+ messages in thread
From: col-pepper @ 2006-02-27 23:21 UTC (permalink / raw)
To: linux-os (Dick Johnson); +Cc: linux-kernel@vger.kernel.org
On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
<linux-os@analogic.com> wrote:
> Flash does not get zeroed to be written! It gets erased, which sets all
> the bits to '1', i.e., all bytes to 0xff.
Thanks for the correction, but that does not change the discussion.
> Further, the designers of
> flash disks are not stupid as you assume. The direct access occurs
> to static RAM (read/write stuff).
I'm not assuming anything . Some hardware has been killed by this issue.
http://lkml.org/lkml/2005/5/13/144
It seems that it's you making the assumption that all of these devices are
manufactured the same way.
The constant dirtying of the buffer will still cause excessive use of the
flash block hosting the FAT. Clearly not all devices use a load spreading
mechanism and this can lead to premature failure.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 22:19 ` col-pepper
2006-02-27 23:12 ` Andrew Morton
@ 2006-02-28 0:52 ` Machida, Hiroyuki
1 sibling, 0 replies; 42+ messages in thread
From: Machida, Hiroyuki @ 2006-02-28 0:52 UTC (permalink / raw)
To: col-pepper; +Cc: Andrew Morton, linux-kernel@vger.kernel.org
As Andrew suggested, Ogawa Hirofumi, FAT maintainer post
following patch. I may help you. Please check it.
Message-Id: <87wthrznsv.fsf@devron.myhome.or.jp>
Subject: Re: [EXPERIMENT] Add new "flush" option
This adds new "flush" option for hotplug devices.
Current implementation of "flush" option does,
- synchronizing data pages at ->release() (last close(2))
- if user's work seems to be done (fs is not active), all
metadata syncs by pdflush()
This option would provide kind of sane progress, and dirty buffers is
flushed more frequently (if fs is not active). This option doesn't
provide any robustness (robustness is provided by other options), but
probably the option is proper for hotplug devices.
(Please don't assume that dirty buffers is synchronized at any point.
This implementation will be changed easily.)
col-pepper@piments.com wrote:
> Thanks for the reply.
>
>
> On Mon, 27 Feb 2006 01:51:14 +0100, Andrew Morton <akpm@osdl.org> wrote:
>
>> col-pepper@piments.com wrote:
>>
>>>
>>> There is nothing in the spec of vfat that suggests the FAT will be
>>> written
>>> 10.000 during the writing of one large file. Indeed it is hard to
>>> imagine
>>> that any other implementation on any other OS or any previous linux
>>> kernel
>>> behaves like that.
>>
>>
>> We sync the file metadata once per write() syscall. If your app writes a
>> large file in lots of little bits, it'll do a lot of syncs. Other
>> implementations of fatfs will (must) do the same thing.
>
>
> That would not seem to be the case at least on MS systems. I had a
> freind do some timings copying a large group of files to a 128M usb
> flash device.
> There was an arbitary mix of files including many small files and some
> larger files, one in excess of 50MB.
>
> suse10 default 4m10
> win2k 2m30
> suse w/o sync 30s
>
> The suse test was drag and drop in konqueror , the other dnd in windows
> explorer.
>
>>
>>> It would seem that the first step could be to revert to the 2.6.11
>>> behaviour which was more appropriate and probably safer even from
>>> the data
>>> point of view.
>>
>>
>> fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and
>> what we're seeing is the fallout from the late fixing of that bug.
>>
>
> I just tested on my 2.6.11 kernel which would predate the change and
> there is a clear difference between mounting my usb device with and
> without sync option.
>
> ls -ail /tmpd/mail*
> 239151 -rw-r--r-- 1 root root 8169540 2006-02-27 19:04
> /tmpd/mail-bak.2006-02-28.bz2
> bash-3.1#time cp !$ /mnt/usb
> time cp /tmpd/mail* /mnt/usb
>
> real 0m0.227s
> user 0m0.001s
> sys 0m0.070s
>
> It returns immediately with no disk activity. About 30s later there was
> disk activity. Presumably some periodic flushing of IO buffers.
>
> bash-3.1#umount /mnt/usb
> bash-3.1#mount -o sync !$
>
> bash-3.1#time cp /tmpd/mail* /mnt/usb
>
> real 0m5.440s
> user 0m0.000s
> sys 0m0.143s
>
> So the older model did seem to have some sync functionality , tho'
> presumably not the agressive one-for-one sync that is now being used.
>
> Please correct me if my interpretation is flawed here:
>
> flash has to be cleared before being written. If metadata is written
> with every block output with write(), the risk of erasing the FAT is
> now many times higher than with the old sync policy.
>
> So the newer sync policy drastically _reduces_ the data security in the
> case of untimely disconnection despite the speed penalty and possible
> hardware damage it incurs.
>
> A less rigourous sync policy may in fact be more appropriate than the
> current model.
>
> Thanks again.
>
>
> [Note: I am not subscribed to LKML, if you wish me to recieve any follow
> ups please BCC: col-pepper at piments point com . thx]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Hiroyuki Machida machida@sm.sony.co.jp
SSW Dept. HENC, Sony Corp.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 23:21 ` col-pepper
@ 2006-02-28 13:10 ` linux-os (Dick Johnson)
2006-02-28 13:52 ` Sergei Organov
` (2 more replies)
2006-02-28 22:38 ` Pavel Machek
1 sibling, 3 replies; 42+ messages in thread
From: linux-os (Dick Johnson) @ 2006-02-28 13:10 UTC (permalink / raw)
To: col-pepper; +Cc: linux-kernel
On Mon, 27 Feb 2006 col-pepper@piments.com wrote:
> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
> <linux-os@analogic.com> wrote:
>
>> Flash does not get zeroed to be written! It gets erased, which sets all
>> the bits to '1', i.e., all bytes to 0xff.
>
> Thanks for the correction, but that does not change the discussion.
>
>> Further, the designers of
>> flash disks are not stupid as you assume. The direct access occurs
>> to static RAM (read/write stuff).
>
> I'm not assuming anything . Some hardware has been killed by this issue.
> http://lkml.org/lkml/2005/5/13/144
No. That hardware was not killed by that issue. The writer, or another
who had encountered the same issue, eventually repartitioned and
reformatted the device. The partition table had gotten corrupted by
some experiments and the writer assumed that the device was broken.
It wasn't.
Also, if you read other elements in this thread, you would have
learned about something that has become somewhat of a red herring.
It takes about a second to erase a 64k physical sector. This is
a required operation before it is written. Since the projected
life of these new devices is about 5 to 10 million such cycles,
(older NAND flash used in modems was only 100-200k) the writer
would have to be running that "brand new device" for at least
5 million seconds. Let's see:
60 seconds per minute
3600 seconds per hour
86400 seconds per day.
5,000,000 / 86400 = 57 days of continuous writes to the same
sector. The writer surely would have a strange file because
he states that even a single large file can destroy the drive
if it is mounted with the "sync" option.
Also, the failure mode of NAND flash is not that it becomes
"destroyed". The failure mode is a slow loss of data. The
devices no longer retain data for a zillion years, only a
few hundred, eventually, only a year or so. So when somebody
claims that the flash has gotten destroyed, they need to have
written it for a few months, then waited for a few years before
reporting the event.
Clearly the writer is wrong.
>
> It seems that it's you making the assumption that all of these devices are
> manufactured the same way.
>
> The constant dirtying of the buffer will still cause excessive use of the
> flash block hosting the FAT. Clearly not all devices use a load spreading
> mechanism and this can lead to premature failure.
>
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.53 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 13:10 ` linux-os (Dick Johnson)
@ 2006-02-28 13:52 ` Sergei Organov
2006-02-28 15:18 ` Lennart Sorensen
2006-02-28 17:16 ` col-pepper
2 siblings, 0 replies; 42+ messages in thread
From: Sergei Organov @ 2006-02-28 13:52 UTC (permalink / raw)
To: linux-kernel
"linux-os \(Dick Johnson\)" <linux-os@analogic.com> writes:
> On Mon, 27 Feb 2006 col-pepper@piments.com wrote:
>
>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
>> <linux-os@analogic.com> wrote:
>>
[...]
> It takes about a second to erase a 64k physical sector. This is
> a required operation before it is written.
> Since the projected life of these new devices is about 5 to 10 million
> such cycles, (older NAND flash used in modems was only 100-200k) the
> writer would have to be running that "brand new device" for at least 5
> million seconds. Let's see:
What FLASH are you talking about? I work with NAND FLASH chips directly
in embedded projects, and for both Toshiba and Samsung NAND FLASH the
erase time of 128Kb (64K words) block is 2 milliseconds typical. Page
program time is 0.3 milliseconds typical, so, having 64 pages per block,
total erase-write block cycle is about 22ms.
Those chips indeed support about 100K program/erase cycles. Well, maybe
there are some new NAND FLASH chips that support more program/erase
cycles (just checked Samsung but found none), but I doubt they are 1000
times slower for block erase anyway.
-- Sergei.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 13:10 ` linux-os (Dick Johnson)
2006-02-28 13:52 ` Sergei Organov
@ 2006-02-28 15:18 ` Lennart Sorensen
2006-02-28 16:16 ` linux-os (Dick Johnson)
2006-02-28 17:16 ` col-pepper
2 siblings, 1 reply; 42+ messages in thread
From: Lennart Sorensen @ 2006-02-28 15:18 UTC (permalink / raw)
To: linux-os (Dick Johnson); +Cc: col-pepper, linux-kernel
On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote:
>
> On Mon, 27 Feb 2006 col-pepper@piments.com wrote:
>
> > On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
> > <linux-os@analogic.com> wrote:
> >
> >> Flash does not get zeroed to be written! It gets erased, which sets all
> >> the bits to '1', i.e., all bytes to 0xff.
> >
> > Thanks for the correction, but that does not change the discussion.
> >
> >> Further, the designers of
> >> flash disks are not stupid as you assume. The direct access occurs
> >> to static RAM (read/write stuff).
> >
> > I'm not assuming anything . Some hardware has been killed by this issue.
> > http://lkml.org/lkml/2005/5/13/144
>
> No. That hardware was not killed by that issue. The writer, or another
> who had encountered the same issue, eventually repartitioned and
> reformatted the device. The partition table had gotten corrupted by
> some experiments and the writer assumed that the device was broken.
> It wasn't.
>
> Also, if you read other elements in this thread, you would have
> learned about something that has become somewhat of a red herring.
>
> It takes about a second to erase a 64k physical sector. This is
> a required operation before it is written. Since the projected
> life of these new devices is about 5 to 10 million such cycles,
> (older NAND flash used in modems was only 100-200k) the writer
> would have to be running that "brand new device" for at least
> 5 million seconds. Let's see:
How come I can write to my compact flash at about 2M/s if you claim it
takes 1s to erase a 64k sector? Somehow I think your number is much too
high. Or it can do multiple erases at the same time.
Also the 5 to 10 million is a lot higher than the numbers the makers of
the compact flash cards I use claim.
> 60 seconds per minute
> 3600 seconds per hour
> 86400 seconds per day.
>
> 5,000,000 / 86400 = 57 days of continuous writes to the same
> sector. The writer surely would have a strange file because
> he states that even a single large file can destroy the drive
> if it is mounted with the "sync" option.
>
> Also, the failure mode of NAND flash is not that it becomes
> "destroyed". The failure mode is a slow loss of data. The
> devices no longer retain data for a zillion years, only a
> few hundred, eventually, only a year or so. So when somebody
> claims that the flash has gotten destroyed, they need to have
> written it for a few months, then waited for a few years before
> reporting the event.
Some flash devices can be "destroyed" by loosing power in the middle of
a write, since this causes them to corrupt their table of blocks and
only the manufacturer has the ability to reset that. Fortunately this
doesn't seem like too common a design.
Len Sorensen
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 21:04 ` col-pepper
2006-02-27 21:17 ` Arjan van de Ven
2006-02-27 21:32 ` linux-os (Dick Johnson)
@ 2006-02-28 16:11 ` Helge Hafting
2006-02-28 22:37 ` Pavel Machek
3 siblings, 0 replies; 42+ messages in thread
From: Helge Hafting @ 2006-02-28 16:11 UTC (permalink / raw)
To: col-pepper
Cc: Anton Altaparmakov, Arjan van de Ven, Lennart Sorensen,
linux-kernel
col-pepper@piments.com wrote:
> On Mon, 27 Feb 2006 15:41:44 +0100, Anton Altaparmakov
> <aia21@cam.ac.uk> wrote:
>
>> On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote:
>>
>>> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote:
>>> > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote:
>>> > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote:
>>> > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100,
>>> col-pepper@piments.com wrote:
>>> > > > > Hi,
>>> > > > >
>>> > > > > OMG what do I have to do to post here? 10th attempt.
>>> > > > > {part2}
>>> > > > >
>>> > > > > Here is a non-exhaustive list of typical devices types
>>> requiring fat vfat
>>> > > > > support:
>>> > > > >
>>> > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod,
>>> iRiver etc)
>>> > > > > usb-flash (usbsticks, cameras, some music devices.)
>>> > > > >
>>> > > > > IIRC the sync mount option for vfat is ignored for file
>>> systems >2G, this
>>> > > > > effectively (and probably intentionally) excludes nearly all
>>> hd partitions
>>> > > > > and iPod type devices.
>>> > > >
>>> > > > I think many people wish it was ignored on smaller devices
>>> too given
>>> > > > what it does to write performance.
>>> > >
>>> > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND
>>> LINE* !!!
>>> >
>>> > That is easy to say when you are using the command line... Modern
>>> > distros (as you know I am sure) mount all hot-plug devices like usb
>>> > keys, usb hard disks, etc automatically at plug-in time and at least
>>> > some distros use "-o sync"
>>>
>>> that is a bad misdesign of that distro or at least the tool the distro
>>> uses for this (I don't know which it is so I can say that without
>>> sounding partial :)
>>>
>>> the tool that decides to use "sync", or at least the author thereof,
>>> should be aware of what flash is, and that it has a limited lifespan
>>> etc
>>> etc, and that you thus want maximum caching etc.
>>
>>
>> I agree completely which is why we hack the system to remove the o_sync
>> on our distro derivative. (-:
>>
>> But my point was that your solution of "don't do that then" is not much
>> use to your average user who sits in front of such distro in graphical
>> desktop as they are not technical enough to find and hack their hotplug
>> system to work properly...
>>
>> Best regards,
>>
>> Anton
>
>
>>> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
>>
>
> Yeah, cleaver.
> That is not really a constructive responce. I dont use , I do use
> command line mount all the time. I never was in danger of damaging my
> drive with this new "feature".
>
> Telling a user who has just burnt out a brand new 1GB usb device he
> should have RTFM and modified that HAL configuration to insure it did
> not use sync it not likely to win much confidence in the linux kernel.
No problem in the kernel. The system is set up wrong. A simple user
may not be able to
figure out his distro's hotplug setup to fix this - but then this problem is
the fault of _the distro_, not the kernel. Complain to distributors
instead.
There is no need for the kernel to treat o_sync VFAT in any special
way. The users,
or more likely the distros, can skip that o_sync part.
Not all distros have such problems either. On debian, I had to set up
/etc/fstab myself -
where not specifying sync is easy enough.
>
> The point of raising this is that the vast majority of linux users
> have no awareness of this. If there is a danger of this sync
> implementation damaging hardware it should be done differently.
Which is why people is working on the "sync on close" alternative.
>
> More importantly this sync strategy is very likely _increasing_ the
> danger of data loss that is the core reason for using sync in the
> first place.
>
> To quote from my earlier post:
>
> The new model attempts to be more rigourous by updating the FAT every
> time
> a block of data is written. Thus the "hammering" of the physical memory
> hosting the FAT record.
>
> In view of the nature of flash memory this may actually be drastically
> increasing the chance that the whole FAT gets erased.
>
> If a pullout occurs during write , there is now a near 50% chance that
> this takes out the entire FAT.
No, only one FAT entry. And the users who pull out during writes
_really_ get
what they deserve anyway. You don't need deep linux knowledge for that.
In the day of the floppy, people respected the activity light regardless
of OS.
Helge Hafting
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 15:18 ` Lennart Sorensen
@ 2006-02-28 16:16 ` linux-os (Dick Johnson)
2006-02-28 17:23 ` Sergei Organov
2006-02-28 18:09 ` Krzysztof Halasa
0 siblings, 2 replies; 42+ messages in thread
From: linux-os (Dick Johnson) @ 2006-02-28 16:16 UTC (permalink / raw)
To: Lennart Sorensen; +Cc: col-pepper, linux-kernel
On Tue, 28 Feb 2006, Lennart Sorensen wrote:
> On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote:
>>
>> On Mon, 27 Feb 2006 col-pepper@piments.com wrote:
>>
>>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
>>> <linux-os@analogic.com> wrote:
>>>
>>>> Flash does not get zeroed to be written! It gets erased, which sets all
>>>> the bits to '1', i.e., all bytes to 0xff.
>>>
>>> Thanks for the correction, but that does not change the discussion.
>>>
>>>> Further, the designers of
>>>> flash disks are not stupid as you assume. The direct access occurs
>>>> to static RAM (read/write stuff).
>>>
>>> I'm not assuming anything . Some hardware has been killed by this issue.
>>> http://lkml.org/lkml/2005/5/13/144
>>
>> No. That hardware was not killed by that issue. The writer, or another
>> who had encountered the same issue, eventually repartitioned and
>> reformatted the device. The partition table had gotten corrupted by
>> some experiments and the writer assumed that the device was broken.
>> It wasn't.
>>
>> Also, if you read other elements in this thread, you would have
>> learned about something that has become somewhat of a red herring.
>>
>> It takes about a second to erase a 64k physical sector. This is
>> a required operation before it is written. Since the projected
>> life of these new devices is about 5 to 10 million such cycles,
>> (older NAND flash used in modems was only 100-200k) the writer
>> would have to be running that "brand new device" for at least
>> 5 million seconds. Let's see:
>
> How come I can write to my compact flash at about 2M/s if you claim it
> takes 1s to erase a 64k sector? Somehow I think your number is much too
> high. Or it can do multiple erases at the same time.
>
> Also the 5 to 10 million is a lot higher than the numbers the makers of
> the compact flash cards I use claim.
>
Here is an instrumented erase function on a driver that rewrites
the first sector of a BIOS ROM. Unlike the Flash DISKS, the
BIOS ROM has no buffering in static RAM so you can gustimate
the actual time to erase............
//-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
//
// This erases a page and waits for the erasure to complete. It
// returns false if it failed.
//
static int erase(void *bios, int page)
{
int era;
flags_t flags;
jiffie_t ticks, start;
spin_lock_irqsave(&info->lock, flags);
erase_page(bios, page);
spin_unlock_irqrestore(&info->lock, flags);
start = jiffies;
ticks = jiffies + (ERA_TIME * HZ);
era = 0x00;
while(time_before(jiffies, ticks))
{
if((era = check_erase(bios, page)))
break;
if(signal_pending(current))
break;
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(1);
}
set_current_state(TASK_RUNNING);
printk("They don't believe... %d\n", (int) (jiffies - start));
return era;
}
[SNIPPED...]
On the system I rewrite a BIOS sector on, jiffies is 1024 ticks/second.
parport: PnPBIOS parport detected.
parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]
lp0: using parport0 (interrupt-driven).
lp0: console ready
device eth0 entered promiscuous mode
device eth0 left promiscuous mode
device eth0 entered promiscuous mode
device eth0 left promiscuous mode
Analogic-BiosDev : Initialization complete
They don't believe... 1004
Now, the wait for erase always sleeps for at least a timer-tick
(about a milisecond) so this may take longer than the physical
erase, but not much longer.
The erase function is:
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#
# This erases the NVRAM page (block). It doesn't wait for completion.
# Each block is 64k in length.
# M29W040B chip
#
.section .text
erase_page:
pushl %ebx
movl BUF(%esp), %ebx # Address of the chip
movl DAT(%esp), %ecx # The page
andl $0x07, %ecx # Max pages
shll $0x10, %ecx # Times 64k
movb $0xf0, (%ebx) # Reset
movb $0xaa, 0x555(%ebx)
movb $0x55, 0x2aa(%ebx)
movb $0x80, 0x555(%ebx)
movb $0xaa, 0x555(%ebx)
movb $0x55, 0x2aa(%ebx)
movb $0x30, (%ecx,%ebx)
popl %ebx
ret
.size erase_page,.-erase_page
.type erase_page,@function
.global erase_page
And the check-erase function is this:
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#
# This reads the whole M29W040B page, looking for all 0xffff words.
# It returns non-zero if it has been erased and zero otherwise.
#
check_erase:
pushl %edi
movl BUF(%esp), %edi # Point to buffer
movl DAT(%esp), %eax # 64k page
andl $0x07, %eax # Max pages possible
shll $0x10, %eax # Times 64k
addl %eax, %edi # Offset to start
cld
movl $0x8000, %ecx # Number of words to check
movl $-1, %eax # What to look for
repz scasw # Look for all 0xffff
jz 1f # All erased
incl %eax # -1 becomes zero
1: popl %edi
ret
.size check_erase,.-check_erase
.type check_erase,@function
.global check_erase
>> 60 seconds per minute
>> 3600 seconds per hour
>> 86400 seconds per day.
>>
>> 5,000,000 / 86400 = 57 days of continuous writes to the same
>> sector. The writer surely would have a strange file because
>> he states that even a single large file can destroy the drive
>> if it is mounted with the "sync" option.
>>
>> Also, the failure mode of NAND flash is not that it becomes
>> "destroyed". The failure mode is a slow loss of data. The
>> devices no longer retain data for a zillion years, only a
>> few hundred, eventually, only a year or so. So when somebody
>> claims that the flash has gotten destroyed, they need to have
>> written it for a few months, then waited for a few years before
>> reporting the event.
>
> Some flash devices can be "destroyed" by loosing power in the middle of
> a write, since this causes them to corrupt their table of blocks and
> only the manufacturer has the ability to reset that. Fortunately this
> doesn't seem like too common a design.
>
# dd if=/dev/zero of=/dev/whatever bs=1M count=128
Fixes a 128 megabyte flash disk, plug in other values for other
sizes.
> Len Sorensen
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.54 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 13:10 ` linux-os (Dick Johnson)
2006-02-28 13:52 ` Sergei Organov
2006-02-28 15:18 ` Lennart Sorensen
@ 2006-02-28 17:16 ` col-pepper
2 siblings, 0 replies; 42+ messages in thread
From: col-pepper @ 2006-02-28 17:16 UTC (permalink / raw)
To: linux-os (Dick Johnson); +Cc: linux-kernel@vger.kernel.org
On Tue, 28 Feb 2006 14:10:44 +0100, linux-os (Dick Johnson)
<linux-os@analogic.com> wrote:
> No. That hardware was not killed by that issue. The writer, or another
> who had encountered the same issue, eventually repartitioned and
> reformatted the device. The partition table had gotten corrupted by
> some experiments and the writer assumed that the device was broken.
> It wasn't.
I did not get the info you posted from that thread so maybe I missed
something you saw. Or indeed it was someone else.
Many thanks for your comments. If this is a false alert all the better.
> Also, the failure mode of NAND flash is not that it becomes
> "destroyed". The failure mode is a slow loss of data. The
> devices no longer retain data for a zillion years, only a
> few hundred, eventually, only a year or so.
There was a comment about the failure mode, no time scale was given. I see
no reason why the degradation would stop at a year though.
> Since the projected life of these new devices is about 5 to 10million
> such cycles,(older NAND flash used in modems was only 100-200k)
Maybe some of the cheap devices are not using the new flash memory in
which case it would come down to between 24 and 48hrs of constant use.
This would be a realistic problem.
Alan Cox refered to some devices that could be damaged as "crap", so it
seems he is aware of some hardware differences.
In conclusion it seems from Andrew Morton's posts that the way this is
handled is under review so I am confident that a robust and stable
solution will result.
Thanks again for your thoughts on this.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 16:16 ` linux-os (Dick Johnson)
@ 2006-02-28 17:23 ` Sergei Organov
2006-02-28 18:09 ` Krzysztof Halasa
1 sibling, 0 replies; 42+ messages in thread
From: Sergei Organov @ 2006-02-28 17:23 UTC (permalink / raw)
To: linux-kernel
"linux-os \(Dick Johnson\)" <linux-os@analogic.com> writes:
> On Tue, 28 Feb 2006, Lennart Sorensen wrote:
>
>> On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote:
>>>
>>> On Mon, 27 Feb 2006 col-pepper@piments.com wrote:
>>>
>>>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
>>>> <linux-os@analogic.com> wrote:
>>>>
>>>>> Flash does not get zeroed to be written! It gets erased, which sets all
>>>>> the bits to '1', i.e., all bytes to 0xff.
>>>>
>>>> Thanks for the correction, but that does not change the discussion.
>>>>
>>>>> Further, the designers of
>>>>> flash disks are not stupid as you assume. The direct access occurs
>>>>> to static RAM (read/write stuff).
>>>>
>>>> I'm not assuming anything . Some hardware has been killed by this issue.
>>>> http://lkml.org/lkml/2005/5/13/144
>>>
>>> No. That hardware was not killed by that issue. The writer, or another
>>> who had encountered the same issue, eventually repartitioned and
>>> reformatted the device. The partition table had gotten corrupted by
>>> some experiments and the writer assumed that the device was broken.
>>> It wasn't.
>>>
>>> Also, if you read other elements in this thread, you would have
>>> learned about something that has become somewhat of a red herring.
>>>
>>> It takes about a second to erase a 64k physical sector. This is
>>> a required operation before it is written. Since the projected
>>> life of these new devices is about 5 to 10 million such cycles,
>>> (older NAND flash used in modems was only 100-200k) the writer
>>> would have to be running that "brand new device" for at least
>>> 5 million seconds. Let's see:
>>
>> How come I can write to my compact flash at about 2M/s if you claim it
>> takes 1s to erase a 64k sector? Somehow I think your number is much too
>> high. Or it can do multiple erases at the same time.
>>
>> Also the 5 to 10 million is a lot higher than the numbers the makers of
>> the compact flash cards I use claim.
>>
>
> Here is an instrumented erase function on a driver that rewrites
> the first sector of a BIOS ROM. Unlike the Flash DISKS, the
> BIOS ROM has no buffering in static RAM so you can gustimate
> the actual time to erase............
BIOS ROM is never NAND FLASH, it's most probably NOR FLASH, and FLASH
DISKS are most probably NAND FLASH. NOR and NAND are very different
technologies. You compare apples and oranges, -- static RAM has nothing
to do with that.
-- Sergei.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 16:16 ` linux-os (Dick Johnson)
2006-02-28 17:23 ` Sergei Organov
@ 2006-02-28 18:09 ` Krzysztof Halasa
1 sibling, 0 replies; 42+ messages in thread
From: Krzysztof Halasa @ 2006-02-28 18:09 UTC (permalink / raw)
To: linux-os (Dick Johnson); +Cc: Lennart Sorensen, col-pepper, linux-kernel
"linux-os \(Dick Johnson\)" <linux-os@analogic.com> writes:
> Here is an instrumented erase function on a driver that rewrites
> the first sector of a BIOS ROM. Unlike the Flash DISKS, the
> BIOS ROM has no buffering in static RAM so you can gustimate
> the actual time to erase............
The NOR flash is different but Samsung manual for K9F5608U0A-YCB0,
K9F5608U0A-YIB0 32M x 8 Bit NAND Flash Memory says:
FEATURES GENERAL DESCRIPTION
- Page Program : (512 + 16)Byte
- Block Erase : (16K + 512)Byte
- Program time : 200us(Typ.)
- Block Erase Time : 2ms(Typ.)
- Endurance : 100K Program/Erase Cycles
- Data Retention : 10 Years
--
Krzysztof Halasa
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 23:12 ` Andrew Morton
@ 2006-02-28 18:47 ` Chris Mason
2006-02-28 19:10 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 42+ messages in thread
From: Chris Mason @ 2006-02-28 18:47 UTC (permalink / raw)
To: Andrew Morton; +Cc: col-pepper, linux-kernel
On Monday 27 February 2006 18:12, Andrew Morton wrote:
> We don't know that the same number of same-sized write()s were happening in
> each case.
>
> There's been some talk about implementing fsync()-on-file-close for this
> problem, and some protopatches. But nothing final yet.
Here's the patch I'm using in -suse right now. What I want to do is make a
much more generic -o flush, but it'll still need a few bits in individual
filesystem to kick off metadata writes quickly.
The basic goal behind the code is to trigger writes without waiting for both
data and metadata. If the user is watching the memory stick, when the
little light stops flashing all the data and metadata will be on disk.
It also generally throttles userland a little during file release. This
could be changed to throttle for each page dirtied, but most users I
asked liked the current setup better.
-chris
From: Chris Mason <mason@suse.com>
Subject: add -o flush for fat
Fat is commonly used on removable media, mounting with -o flush tells the
FS to write things to disk as quickly as possible. It is like -o sync, but
much faster (and not as safe).
diff -r a06cef570da0 fs/fat/file.c
--- a/fs/fat/file.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/fat/file.c Sun Jan 15 13:00:35 2006 -0500
@@ -13,6 +13,7 @@
#include <linux/smp_lock.h>
#include <linux/buffer_head.h>
#include <linux/writeback.h>
+#include <linux/blkdev.h>
int fat_generic_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
@@ -112,6 +113,19 @@ int fat_generic_ioctl(struct inode *inod
}
}
+static int
+fat_file_release(struct inode *inode, struct file *filp)
+{
+
+ if ((filp->f_mode & FMODE_WRITE) &&
+ MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ blk_congestion_wait(WRITE, HZ/10);
+ }
+ return 0;
+}
+
struct file_operations fat_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -121,6 +135,7 @@ struct file_operations fat_file_operatio
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
+ .release = fat_file_release,
.ioctl = fat_generic_ioctl,
.fsync = file_fsync,
.sendfile = generic_file_sendfile,
@@ -293,6 +308,10 @@ void fat_truncate(struct inode *inode)
lock_kernel();
fat_free(inode, nr_clusters);
unlock_kernel();
+ if (MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ }
}
struct inode_operations fat_file_inode_operations = {
diff -r a06cef570da0 fs/fat/inode.c
--- a/fs/fat/inode.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/fat/inode.c Sun Jan 15 13:00:35 2006 -0500
@@ -24,6 +24,7 @@
#include <linux/vfs.h>
#include <linux/parser.h>
#include <linux/uio.h>
+#include <linux/writeback.h>
#include <asm/unaligned.h>
#ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -860,7 +861,7 @@ enum {
Opt_charset, Opt_shortname_lower, Opt_shortname_win95,
Opt_shortname_winnt, Opt_shortname_mixed, Opt_utf8_no, Opt_utf8_yes,
Opt_uni_xl_no, Opt_uni_xl_yes, Opt_nonumtail_no, Opt_nonumtail_yes,
- Opt_obsolate, Opt_err,
+ Opt_obsolate, Opt_flush, Opt_err,
};
static match_table_t fat_tokens = {
@@ -892,7 +893,8 @@ static match_table_t fat_tokens = {
{Opt_obsolate, "cvf_format=%20s"},
{Opt_obsolate, "cvf_options=%100s"},
{Opt_obsolate, "posix"},
- {Opt_err, NULL}
+ {Opt_flush, "flush"},
+ {Opt_err, NULL},
};
static match_table_t msdos_tokens = {
{Opt_nodots, "nodots"},
@@ -1033,6 +1035,9 @@ static int parse_options(char *options,
return 0;
opts->codepage = option;
break;
+ case Opt_flush:
+ opts->flush = 1;
+ break;
/* msdos specific */
case Opt_dots:
diff -r a06cef570da0 fs/fs-writeback.c
--- a/fs/fs-writeback.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/fs-writeback.c Sun Jan 15 13:00:35 2006 -0500
@@ -390,6 +390,29 @@ sync_sb_inodes(struct super_block *sb, s
return; /* Leave any unwritten inodes on s_io */
}
+void
+writeback_bdev(struct super_block *sb)
+{
+ struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ filemap_flush(mapping);
+ blk_run_address_space(mapping);
+}
+EXPORT_SYMBOL_GPL(writeback_bdev);
+
+void
+writeback_inode(struct inode *inode)
+{
+
+ struct address_space *mapping = inode->i_mapping;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = 0,
+ };
+ sync_inode(inode, &wbc);
+ filemap_fdatawrite(mapping);
+}
+EXPORT_SYMBOL_GPL(writeback_inode);
+
/*
* Start writeback of dirty pagecache data against all unlocked inodes.
*
diff -r a06cef570da0 fs/msdos/namei.c
--- a/fs/msdos/namei.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/msdos/namei.c Sun Jan 15 13:00:35 2006 -0500
@@ -11,6 +11,7 @@
#include <linux/buffer_head.h>
#include <linux/msdos_fs.h>
#include <linux/smp_lock.h>
+#include <linux/writeback.h>
/* MS-DOS "device special files" */
static const unsigned char *reserved_names[] = {
@@ -293,7 +294,7 @@ static int msdos_create(struct inode *di
struct nameidata *nd)
{
struct super_block *sb = dir->i_sb;
- struct inode *inode;
+ struct inode *inode = NULL;
struct fat_slot_info sinfo;
struct timespec ts;
unsigned char msdos_name[MSDOS_NAME];
@@ -329,6 +330,11 @@ static int msdos_create(struct inode *di
d_instantiate(dentry, inode);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(sb);
+ }
return err;
}
@@ -361,6 +367,11 @@ static int msdos_rmdir(struct inode *dir
fat_detach(inode);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ }
return err;
}
@@ -414,6 +425,11 @@ static int msdos_mkdir(struct inode *dir
d_instantiate(dentry, inode);
unlock_kernel();
+ if (MSDOS_SB(sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(sb);
+ }
return 0;
out_free:
@@ -443,6 +459,11 @@ static int msdos_unlink(struct inode *di
fat_detach(inode);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ }
return err;
}
@@ -648,6 +669,11 @@ static int msdos_rename(struct inode *ol
new_dir, new_msdos_name, new_dentry, is_hid);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(old_dir->i_sb)->options.flush) {
+ writeback_inode(old_dir);
+ writeback_inode(new_dir);
+ writeback_bdev(old_dir->i_sb);
+ }
return err;
}
diff -r a06cef570da0 include/linux/msdos_fs.h
--- a/include/linux/msdos_fs.h Sun Jan 15 11:59:32 2006 -0500
+++ b/include/linux/msdos_fs.h Sun Jan 15 13:00:35 2006 -0500
@@ -203,6 +203,7 @@ struct fat_mount_options {
unicode_xlate:1, /* create escape sequences for unhandled Unicode */
numtail:1, /* Does first alias have a numeric '~1' type tail? */
atari:1, /* Use Atari GEMDOS variation of MS-DOS fs */
+ flush:1, /* write things quickly */
nocase:1; /* Does this need case conversion? 0=need case conversion*/
};
diff -r a06cef570da0 include/linux/writeback.h
--- a/include/linux/writeback.h Sun Jan 15 11:59:32 2006 -0500
+++ b/include/linux/writeback.h Sun Jan 15 13:00:35 2006 -0500
@@ -68,6 +68,8 @@ int inode_wait(void *);
int inode_wait(void *);
void sync_inodes_sb(struct super_block *, int wait);
void sync_inodes(int wait);
+void writeback_bdev(struct super_block *);
+void writeback_inode(struct inode *);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 18:47 ` Chris Mason
@ 2006-02-28 19:10 ` Andrew Morton
2006-02-28 19:48 ` Chris Mason
[not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp>
2006-03-29 2:13 ` Mathis Ahrens
2 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2006-02-28 19:10 UTC (permalink / raw)
To: Chris Mason; +Cc: col-pepper, linux-kernel
Chris Mason <mason@suse.com> wrote:
>
> On Monday 27 February 2006 18:12, Andrew Morton wrote:
>
> > We don't know that the same number of same-sized write()s were happening in
> > each case.
> >
> > There's been some talk about implementing fsync()-on-file-close for this
> > problem, and some protopatches. But nothing final yet.
>
> Here's the patch I'm using in -suse right now. What I want to do is make a
> much more generic -o flush, but it'll still need a few bits in individual
> filesystem to kick off metadata writes quickly.
>
> The basic goal behind the code is to trigger writes without waiting for both
> data and metadata. If the user is watching the memory stick, when the
> little light stops flashing all the data and metadata will be on disk.
>
> It also generally throttles userland a little during file release. This
> could be changed to throttle for each page dirtied, but most users I
> asked liked the current setup better.
>
> ...
>
> +static int
> +fat_file_release(struct inode *inode, struct file *filp)
On a single line, please.
> + if (MSDOS_SB(inode->i_sb)->options.flush) {
Did you consider making `-o flush' a generic mount option rather than
msdos-only?
I guess there isn't a lot of demand for this for other filesystems, and
having an ignored option like this is a bit misleading...
> +void
> +writeback_inode(struct inode *inode)
> +{
> +
> + struct address_space *mapping = inode->i_mapping;
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + .nr_to_write = 0,
> + };
> + sync_inode(inode, &wbc);
> + filemap_fdatawrite(mapping);
I think that filemap_fdatawrite() will be a no-op?
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 19:10 ` Andrew Morton
@ 2006-02-28 19:48 ` Chris Mason
0 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2006-02-28 19:48 UTC (permalink / raw)
To: Andrew Morton; +Cc: col-pepper, linux-kernel
On Tuesday 28 February 2006 14:10, Andrew Morton wrote:
> On a single line, please.
>
Ack.
> > + if (MSDOS_SB(inode->i_sb)->options.flush) {
>
> Did you consider making `-o flush' a generic mount option rather than
> msdos-only?
Yes, long term I think the generic option is better. I have three or so ideas
for a generic patch:
1) When the block device leaves congestion, it asks for more io
2) pdflush operation that tries to constantly keep a given block device
congested
3) my current patch aggregated to other filesystems that people want -o flush
on.
I've made a few stabs at #1, but didn't like the end result. #2 seems like
the best choice so far. If I got it working nicely I would add the generic
option, otherwise with option #3 it's probably best to keep it per FS.
The main goal for my current patch was to find out if this functionality will
actually make people happy (so far the beta testers like it). If the
complaints are low, it's worth the time to add something generic.
>
> I guess there isn't a lot of demand for this for other filesystems, and
> having an ignored option like this is a bit misleading...
>
> > +void
> > +writeback_inode(struct inode *inode)
> > +{
> > +
> > + struct address_space *mapping = inode->i_mapping;
> > + struct writeback_control wbc = {
> > + .sync_mode = WB_SYNC_NONE,
> > + .nr_to_write = 0,
> > + };
> > + sync_inode(inode, &wbc);
> > + filemap_fdatawrite(mapping);
>
> I think that filemap_fdatawrite() will be a no-op?
This part is nasty, I want to write all of the file data pages and write the
inode without waiting on it. The nr_to_write = 0 will make sure that
sync_inode only writes the inode, and WB_SYNC_NONE makes sure it does not
wait for that io to finish.
What I really want is WB_SYNC_NONE in mpage_writepages, but I don't want to
trigger this code:
if (wbc->sync_mode == WB_SYNC_NONE) {
index = mapping->writeback_index; /* Start from prev offset */
So, I use filemap_fdatawrite to make sure all of the data pages get written.
It's not perfect, but I was going for minimal changes outside of fat.
-chris
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 21:04 ` col-pepper
` (2 preceding siblings ...)
2006-02-28 16:11 ` Helge Hafting
@ 2006-02-28 22:37 ` Pavel Machek
3 siblings, 0 replies; 42+ messages in thread
From: Pavel Machek @ 2006-02-28 22:37 UTC (permalink / raw)
To: col-pepper
Cc: Anton Altaparmakov, Arjan van de Ven, Lennart Sorensen,
linux-kernel
Hi!
> > I agree completely which is why we hack the system to remove the o_sync
> > on our distro derivative. (-:
> >
> > But my point was that your solution of "don't do that then" is not much
> > use to your average user who sits in front of such distro in graphical
> > desktop as they are not technical enough to find and hack their hotplug
> > system to work properly...
> >
> > Best regards,
> >
> > Anton
>
> >> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!!
>
> Yeah, cleaver.
> That is not really a constructive responce. I dont use , I do use command
> line mount all the time. I never was in danger of damaging my drive with
> this new "feature".
>
> Telling a user who has just burnt out a brand new 1GB usb device he should
> have RTFM and modified that HAL configuration to insure it did not use
> sync it not likely to win much confidence in the linux kernel.
Return that 1GB usb device to manufacturer, it was broken.
> The point of raising this is that the vast majority of linux users have no
> awareness of this. If there is a danger of this sync implementation
> damaging hardware it should be done differently.
Fix the distribution, then.
Pavel
--
Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted...
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-27 23:21 ` col-pepper
2006-02-28 13:10 ` linux-os (Dick Johnson)
@ 2006-02-28 22:38 ` Pavel Machek
2006-03-01 4:28 ` Kyle Moffett
2006-03-02 8:23 ` col-pepper
1 sibling, 2 replies; 42+ messages in thread
From: Pavel Machek @ 2006-02-28 22:38 UTC (permalink / raw)
To: col-pepper; +Cc: linux-os (Dick Johnson), linux-kernel@vger.kernel.org
On Út 28-02-06 00:21:53, col-pepper@piments.com wrote:
> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
> <linux-os@analogic.com> wrote:
>
> > Flash does not get zeroed to be written! It gets erased, which sets all
> > the bits to '1', i.e., all bytes to 0xff.
>
> Thanks for the correction, but that does not change the discussion.
>
> > Further, the designers of
> > flash disks are not stupid as you assume. The direct access occurs
> > to static RAM (read/write stuff).
>
> I'm not assuming anything . Some hardware has been killed by this issue.
> http://lkml.org/lkml/2005/5/13/144
I have seen flash disk dead in 5 minutes, even without o-sync. Those
devices are often crap. (I copied tar file to flash by cat foo.tar >
/dev/sda. That was apparently enough to kill that flash. Label "Yahoo"
should have warned me).
Pavel
--
Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted...
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 22:38 ` Pavel Machek
@ 2006-03-01 4:28 ` Kyle Moffett
2006-03-02 8:23 ` col-pepper
1 sibling, 0 replies; 42+ messages in thread
From: Kyle Moffett @ 2006-03-01 4:28 UTC (permalink / raw)
To: Pavel Machek; +Cc: col-pepper, LKML Kernel
On Feb 28, 2006, at 17:38:55, Pavel Machek wrote:
> I have seen flash disk dead in 5 minutes, even without o-sync.
> Those devices are often crap. (I copied tar file to flash by cat
> foo.tar > /dev/sda. That was apparently enough to kill that flash.
> Label "Yahoo" should have warned me).
Sometimes a flash device can have a temporary error condition that is
solved by rewriting the data. (I've seen it triggered by buggy USB
hubs that don't provide the rated power). It seems that a number of
flash drives have internal checks, and when those trigger it reports
a bad sector (even if it isn't permanently bad). My 1GB flashdrive
failed in that way, and I was able to fix the error by erasing with
"dd if=/dev/full of=/dev/usbkey" and reformatting. After the error
occurred I started md5summing every file I put on the drive, but I've
been using it for a month now and not a single checksum has miscomputed.
Cheers,
Kyle Moffett
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
[not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp>
@ 2006-03-01 15:23 ` Chris Mason
[not found] ` <87mzg9wst0.fsf@duaron.myhome.or.jp>
0 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2006-03-01 15:23 UTC (permalink / raw)
To: OGAWA Hirofumi; +Cc: Andrew Morton, col-pepper, linux-kernel
On Wednesday 01 March 2006 10:00, OGAWA Hirofumi wrote:
> Chris Mason <mason@suse.com> writes:
> > @@ -329,6 +330,11 @@ static int msdos_create(struct inode *di
> > d_instantiate(dentry, inode);
> > out:
> > unlock_kernel();
> > + if (!err && MSDOS_SB(sb)->options.flush) {
> > + writeback_inode(dir);
> > + writeback_inode(inode);
> > + writeback_bdev(sb);
> > + }
> > return err;
> > }
>
> If buffers is already queued for I/O, and if you don't wait anything,
> the buffers wouldn't be (re-)submited, then those buffers will be
> flushing by normal periodically wb_kupdate() after all.
Just to make sure we're using the same terms, do you mean the pages are marked
dirty and on the SB's dirty list, or do you mean the page has been through
writepage and is currently on its way to the disk?
>
> Do you have any plan to address it? Or I'm just missing something?
If you mean the page is just dirty, it will get written by the
filemap_fdatawrite calls. If you mean the page is PG_writeback, it is
already on the way to the disk, so it passes the 'blinking light on the
memory stick' rule.
-chris
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 22:38 ` Pavel Machek
2006-03-01 4:28 ` Kyle Moffett
@ 2006-03-02 8:23 ` col-pepper
2006-03-02 8:32 ` Pavel Machek
1 sibling, 1 reply; 42+ messages in thread
From: col-pepper @ 2006-03-02 8:23 UTC (permalink / raw)
To: Pavel Machek; +Cc: linux-kernel@vger.kernel.org
On Tue, 28 Feb 2006 23:38:55 +0100, Pavel Machek <pavel@suse.cz> wrote:
> On Út 28-02-06 00:21:53, col-pepper@piments.com wrote:
>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
>> <linux-os@analogic.com> wrote:
>>
>> > Flash does not get zeroed to be written! It gets erased, which sets
>> all
>> > the bits to '1', i.e., all bytes to 0xff.
>>
>> Thanks for the correction, but that does not change the discussion.
>>
>> > Further, the designers of
>> > flash disks are not stupid as you assume. The direct access occurs
>> > to static RAM (read/write stuff).
>>
>> I'm not assuming anything . Some hardware has been killed by this issue.
>> http://lkml.org/lkml/2005/5/13/144
>
> I have seen flash disk dead in 5 minutes, even without o-sync. Those
> devices are often crap. (I copied tar file to flash by cat foo.tar >
> /dev/sda. That was apparently enough to kill that flash. Label "Yahoo"
> should have warned me).
> Pavel
If I'm not mistaken, writing to the device with cat will output that file
byte by byte. This would probably be even harder on the device than using
a formatted device with o_sync, since it would dirty a 64k block 64k times!
It seems some of the less elaborate devices dont support this type of use.
I suspect if you had tried using dd with a suitable bs you may still own a
crap Yahoo usb device.
Just because the linux kernel lets us use the abstract /dev devices freely
does not mean everything you can do with a /dev is a good idea for all h/w
that gets a device name.
I think that is the heart of the problem. Manufacturers are designing
these devices for the windows market. They are specifically designed and
supplied, preformatted with a fat fs, to be used in that way.
If linux distros, MacOS or anybody else wants to claim to support these
devices the default setup should probably handle the devices in a
_similar_ way to the native windows drivers.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-03-02 8:23 ` col-pepper
@ 2006-03-02 8:32 ` Pavel Machek
0 siblings, 0 replies; 42+ messages in thread
From: Pavel Machek @ 2006-03-02 8:32 UTC (permalink / raw)
To: col-pepper; +Cc: linux-kernel@vger.kernel.org
On Čt 02-03-06 09:23:02, col-pepper@piments.com wrote:
> On Tue, 28 Feb 2006 23:38:55 +0100, Pavel Machek <pavel@suse.cz> wrote:
>
> >On Út 28-02-06 00:21:53, col-pepper@piments.com wrote:
> >>On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
> >><linux-os@analogic.com> wrote:
> >>
> >>> Flash does not get zeroed to be written! It gets erased, which sets
> >>all
> >>> the bits to '1', i.e., all bytes to 0xff.
> >>
> >>Thanks for the correction, but that does not change the discussion.
> >>
> >>> Further, the designers of
> >>> flash disks are not stupid as you assume. The direct access occurs
> >>> to static RAM (read/write stuff).
> >>
> >>I'm not assuming anything . Some hardware has been killed by this issue.
> >>http://lkml.org/lkml/2005/5/13/144
> >
> >I have seen flash disk dead in 5 minutes, even without o-sync. Those
> >devices are often crap. (I copied tar file to flash by cat foo.tar >
> >/dev/sda. That was apparently enough to kill that flash. Label "Yahoo"
> >should have warned me).
>
> If I'm not mistaken, writing to the device with cat will output that file
> byte by byte. This would probably be even harder on the device than using
> a formatted device with o_sync, since it would dirty a 64k block 64k
> times!
No.
> It seems some of the less elaborate devices dont support this type of use.
>
> I suspect if you had tried using dd with a suitable bs you may still own a
> crap Yahoo usb device.
>
> Just because the linux kernel lets us use the abstract /dev devices freely
> does not mean everything you can do with a /dev is a good idea for all h/w
> that gets a device name.
>
> I think that is the heart of the problem. Manufacturers are designing
> these devices for the windows market. They are specifically designed and
> supplied, preformatted with a fat fs, to be used in that way.
There's USB mass storage specification, that says nothing about FAT,
or expected use of the device... if your device is broken FAT thing
that will break if used any other way, do not advertise it as USB mass
storage.
Pavel
--
Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted...
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
[not found] ` <87mzg9wst0.fsf@duaron.myhome.or.jp>
@ 2006-03-02 13:45 ` Chris Mason
2006-03-02 14:07 ` OGAWA Hirofumi
0 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2006-03-02 13:45 UTC (permalink / raw)
To: OGAWA Hirofumi; +Cc: Andrew Morton, col-pepper, linux-kernel
On Wednesday 01 March 2006 20:15, OGAWA Hirofumi wrote:
> > Just to make sure we're using the same terms, do you mean the pages are
> > marked dirty and on the SB's dirty list, or do you mean the page has been
> > through writepage and is currently on its way to the disk?
>
> The page is already on device's request queue, and the page is already
> marked a PG_writeback. And that page is not processed by device yet.
>
> Then, you call next filemap_fdatawrite(), it just re-dirty the page
> and queues to sb->s_dirty, because the page's buffer_heads is still
> locked. So, the re-dirtyed page is re-submited to device by
> periodically wb_kupdate()?
filemap_fdatawrite() won't redirty the page. It will wait on the pending
writeback.
-chris
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-03-02 13:45 ` Chris Mason
@ 2006-03-02 14:07 ` OGAWA Hirofumi
2006-03-02 17:01 ` Chris Mason
0 siblings, 1 reply; 42+ messages in thread
From: OGAWA Hirofumi @ 2006-03-02 14:07 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, col-pepper, linux-kernel
Chris Mason <mason@suse.com> writes:
> filemap_fdatawrite() won't redirty the page. It will wait on the pending
> writeback.
Umm... I'm looking the following code.
+ if (MSDOS_SB(sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(sb);
+ }
+void
+writeback_bdev(struct super_block *sb)
+{
+ struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ filemap_flush(mapping);
+ blk_run_address_space(mapping);
+}
+EXPORT_SYMBOL_GPL(writeback_bdev);
filemap_flush() is using WB_SYNC_NONE.
in mpage_writepages()
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
Where does wait it?
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-03-02 14:07 ` OGAWA Hirofumi
@ 2006-03-02 17:01 ` Chris Mason
2006-03-02 18:14 ` OGAWA Hirofumi
0 siblings, 1 reply; 42+ messages in thread
From: Chris Mason @ 2006-03-02 17:01 UTC (permalink / raw)
To: OGAWA Hirofumi; +Cc: Andrew Morton, col-pepper, linux-kernel
On Thursday 02 March 2006 09:07, OGAWA Hirofumi wrote:
> Chris Mason <mason@suse.com> writes:
> > filemap_fdatawrite() won't redirty the page. It will wait on the pending
> > writeback.
>
> Umm... I'm looking the following code.
>
> +void
> +writeback_bdev(struct super_block *sb)
> +{
> + struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
> + filemap_flush(mapping);
> + blk_run_address_space(mapping);
> +}
> +EXPORT_SYMBOL_GPL(writeback_bdev);
>
> filemap_flush() is using WB_SYNC_NONE.
>
Ok, I thought you were asking about the code that called filemap_fdatawrite,
which does wait. filemap_flush is used on the underlying block device. In
the case of a page that is already under IO, the io is not cancelled but
allowed to continue.
This is the desired result. When you're doing a number of operations in
sequence, each operation will start io on the block device. If they used
filemap_fdatawrite instead of filemap_flush, they would end up being
synchronous.
-chris
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-03-02 17:01 ` Chris Mason
@ 2006-03-02 18:14 ` OGAWA Hirofumi
0 siblings, 0 replies; 42+ messages in thread
From: OGAWA Hirofumi @ 2006-03-02 18:14 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, col-pepper, linux-kernel
Chris Mason <mason@suse.com> writes:
> Ok, I thought you were asking about the code that called filemap_fdatawrite,
> which does wait. filemap_flush is used on the underlying block device. In
> the case of a page that is already under IO, the io is not cancelled but
> allowed to continue.
>
> This is the desired result. When you're doing a number of operations in
> sequence, each operation will start io on the block device. If they used
> filemap_fdatawrite instead of filemap_flush, they would end up being
> synchronous.
Of course, I know. Let's return to beginning of this thread, do you have
any plan to address it?
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-02-28 18:47 ` Chris Mason
2006-02-28 19:10 ` Andrew Morton
[not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp>
@ 2006-03-29 2:13 ` Mathis Ahrens
2006-03-30 17:35 ` col-pepper
2 siblings, 1 reply; 42+ messages in thread
From: Mathis Ahrens @ 2006-03-29 2:13 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, col-pepper, linux-kernel
Hi all,
Chris Mason wrote:
> On Monday 27 February 2006 18:12, Andrew Morton wrote:
>
>> We don't know that the same number of same-sized write()s were happening in
>> each case.
>>
>> There's been some talk about implementing fsync()-on-file-close for this
>> problem, and some protopatches. But nothing final yet.
>>
>
> Here's the patch I'm using in -suse right now. What I want to do is make a
> much more generic -o flush, but it'll still need a few bits in individual
> filesystem to kick off metadata writes quickly.
>
> The basic goal behind the code is to trigger writes without waiting for both
> data and metadata. If the user is watching the memory stick, when the
> little light stops flashing all the data and metadata will be on disk.
>
> It also generally throttles userland a little during file release. This
> could be changed to throttle for each page dirtied, but most users I
> asked liked the current setup better.
>
I like the idea and would like to see something like this in mainline.
Here is some non-scientific benchmark done with 2.6.16, comparing
default mount and flush mount of a USB2 stick:
/////////////////////////////////////////////////////////////////////
Single File "Test": 43MB
$ time cp Test /media/usbdisk/test/ && time umount /media/usbdisk/
/////////////////////////////////////////////////////////////////////
VANILLA:
real 0m3.770s
user 0m0.004s
sys 0m0.308s
real 0m9.439s
user 0m0.000s
sys 0m0.040s
FLUSH:
real 0m6.000s
user 0m0.012s
sys 0m0.400s
real 0m3.668s
user 0m0.000s
sys 0m0.028s
REAL TIME RATIO (FLUSH/VANILLA):
9.6 / 13.1 = 0.73
/////////////////////////////////////////////////////////////////////
Directory Tree "flushtest": 44MB (8866 files, 1820 dirs)
$ time cp -R flushtest/ /media/usbdisk/ && time umount /media/usbdisk/
/////////////////////////////////////////////////////////////////////
VANILLA:
real 0m0.966s
user 0m0.024s
sys 0m0.860s
real 1m11.962s
user 0m0.004s
sys 0m0.160s
FLUSH:
real 1m41.645s
user 0m0.032s
sys 0m1.112s
real 0m4.660s
user 0m0.004s
sys 0m0.068s
REAL TIME RATIO (FLUSH/VANILLA):
106.3 / 77.9 = 1.36
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: o_sync in vfat driver
2006-03-29 2:13 ` Mathis Ahrens
@ 2006-03-30 17:35 ` col-pepper
0 siblings, 0 replies; 42+ messages in thread
From: col-pepper @ 2006-03-30 17:35 UTC (permalink / raw)
To: Mathis Ahrens, Chris Mason; +Cc: Andrew Morton, linux-kernel
On Wed, 29 Mar 2006 04:13:03 +0200, Mathis Ahrens <Mathis.Ahrens@gmx.de>
wrote:
> Hi all,
>
> Chris Mason wrote:
>> On Monday 27 February 2006 18:12, Andrew Morton wrote:
>>
>>> We don't know that the same number of same-sized write()s were
>>> happening in
>>> each case.
>>>
>>> There's been some talk about implementing fsync()-on-file-close for
>>> this
>>> problem, and some protopatches. But nothing final yet.
>>>
>>
>> Here's the patch I'm using in -suse right now. What I want to do is
>> make a much more generic -o flush, but it'll still need a few bits in
>> individual filesystem to kick off metadata writes quickly.
>>
>> The basic goal behind the code is to trigger writes without waiting for
>> both
>> data and metadata. If the user is watching the memory stick, when the
>> little light stops flashing all the data and metadata will be on disk.
>>
>> It also generally throttles userland a little during file release.
>> This could be changed to throttle for each page dirtied, but most users
>> I asked liked the current setup better.
>>
>
> I like the idea and would like to see something like this in mainline.
>
> Here is some non-scientific benchmark done with 2.6.16, comparing
> default mount and flush mount of a USB2 stick:
>
> /////////////////////////////////////////////////////////////////////
> Single File "Test": 43MB
> $ time cp Test /media/usbdisk/test/ && time umount /media/usbdisk/
> /////////////////////////////////////////////////////////////////////
>
> VANILLA:
>
> real 0m3.770s
> user 0m0.004s
> sys 0m0.308s
>
> real 0m9.439s
> user 0m0.000s
> sys 0m0.040s
>
> FLUSH:
>
> real 0m6.000s
> user 0m0.012s
> sys 0m0.400s
>
> real 0m3.668s
> user 0m0.000s
> sys 0m0.028s
>
> REAL TIME RATIO (FLUSH/VANILLA):
> 9.6 / 13.1 = 0.73
>
> /////////////////////////////////////////////////////////////////////
> Directory Tree "flushtest": 44MB (8866 files, 1820 dirs)
> $ time cp -R flushtest/ /media/usbdisk/ && time umount /media/usbdisk/
> /////////////////////////////////////////////////////////////////////
>
> VANILLA:
>
> real 0m0.966s
> user 0m0.024s
> sys 0m0.860s
>
> real 1m11.962s
> user 0m0.004s
> sys 0m0.160s
>
> FLUSH:
>
> real 1m41.645s
> user 0m0.032s
> sys 0m1.112s
>
> real 0m4.660s
> user 0m0.004s
> sys 0m0.068s
>
> REAL TIME RATIO (FLUSH/VANILLA):
> 106.3 / 77.9 = 1.36
>
>
That's interesting, albeit non-scientific, I think it is quite informative.
There are two basic problems with the current code: speed is down by
around and order of magnitude compared to a non-synced write and the fact
that the code is hammering the FAT. The two are obviously related.
Viewing the system globally rather than considering the details of the
techniques used, it would seem that any algorithm that does not
drastically reduce write times, at least on the one large file test , is
missing the mark and presumably repeating the problem in a slightly
different way.
Not knocking the efforts Chris has put in , it's great to see this is
getting some attention, but I think viewing overall performance times as
shown above gives a touchstone as to whether any particular proto is
effective.
The fact that flush can be almost 40% slower in some cases is worrying.
Thanks for the info.
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2006-03-30 17:38 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-26 23:08 o_sync in vfat driver col-pepper
2006-02-27 0:51 ` Andrew Morton
2006-02-27 22:19 ` col-pepper
2006-02-27 23:12 ` Andrew Morton
2006-02-28 18:47 ` Chris Mason
2006-02-28 19:10 ` Andrew Morton
2006-02-28 19:48 ` Chris Mason
[not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp>
2006-03-01 15:23 ` Chris Mason
[not found] ` <87mzg9wst0.fsf@duaron.myhome.or.jp>
2006-03-02 13:45 ` Chris Mason
2006-03-02 14:07 ` OGAWA Hirofumi
2006-03-02 17:01 ` Chris Mason
2006-03-02 18:14 ` OGAWA Hirofumi
2006-03-29 2:13 ` Mathis Ahrens
2006-03-30 17:35 ` col-pepper
2006-02-28 0:52 ` Machida, Hiroyuki
-- strict thread matches above, loose matches on Subject: below --
2006-02-26 22:55 col-pepper
[not found] <op.s5cj47sxj68xd1@mail.piments.com>
[not found] ` <op.s5jpqvwhui3qek@mail.piments.com>
[not found] ` <op.s5kxhyzgfx0war@mail.piments.com>
[not found] ` <op.s5kx7xhfj68xd1@mail.piments.com>
[not found] ` <op.s5kya3t0j68xd1@mail.piments.com>
[not found] ` <op.s5ky2dbcj68xd1@mail.piments.com>
[not found] ` <op.s5ky71nwj68xd1@mail.piments.com>
[not found] ` <op.s5kzao2jj68xd1@mail.piments.com>
2006-02-26 22:50 ` col-pepper
2006-02-27 13:28 ` Lennart Sorensen
2006-02-27 13:50 ` Arjan van de Ven
2006-02-27 14:06 ` Anton Altaparmakov
2006-02-27 14:27 ` Arjan van de Ven
2006-02-27 14:41 ` Anton Altaparmakov
2006-02-27 21:04 ` col-pepper
2006-02-27 21:17 ` Arjan van de Ven
2006-02-27 23:21 ` col-pepper
2006-02-27 21:32 ` linux-os (Dick Johnson)
2006-02-27 23:21 ` col-pepper
2006-02-28 13:10 ` linux-os (Dick Johnson)
2006-02-28 13:52 ` Sergei Organov
2006-02-28 15:18 ` Lennart Sorensen
2006-02-28 16:16 ` linux-os (Dick Johnson)
2006-02-28 17:23 ` Sergei Organov
2006-02-28 18:09 ` Krzysztof Halasa
2006-02-28 17:16 ` col-pepper
2006-02-28 22:38 ` Pavel Machek
2006-03-01 4:28 ` Kyle Moffett
2006-03-02 8:23 ` col-pepper
2006-03-02 8:32 ` Pavel Machek
2006-02-28 16:11 ` Helge Hafting
2006-02-28 22:37 ` Pavel Machek
2006-02-27 14:26 ` linux-os (Dick Johnson)
2006-02-27 18:53 ` Jan Engelhardt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).