* o_sync in vfat driver [not found] ` <op.s5kzao2jj68xd1@mail.piments.com> @ 2006-02-26 22:50 ` col-pepper 2006-02-27 13:28 ` Lennart Sorensen 0 siblings, 1 reply; 49+ messages in thread From: col-pepper @ 2006-02-26 22:50 UTC (permalink / raw) To: linux-kernel Hi, OMG what do I have to do to post here? 10th attempt. {part2} Here is a non-exhaustive list of typical devices types requiring fat vfat support: fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) usb-flash (usbsticks, cameras, some music devices.) IIRC the sync mount option for vfat is ignored for file systems >2G, this effectively (and probably intentionally) excludes nearly all hd partitions and iPod type devices. sync does not have any meaning for CD DVD media. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-26 22:50 ` o_sync in vfat driver col-pepper @ 2006-02-27 13:28 ` Lennart Sorensen 2006-02-27 13:50 ` Arjan van de Ven 2006-02-27 14:26 ` linux-os (Dick Johnson) 0 siblings, 2 replies; 49+ messages in thread From: Lennart Sorensen @ 2006-02-27 13:28 UTC (permalink / raw) To: col-pepper; +Cc: linux-kernel On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote: > Hi, > > OMG what do I have to do to post here? 10th attempt. > {part2} > > Here is a non-exhaustive list of typical devices types requiring fat vfat > support: > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) > usb-flash (usbsticks, cameras, some music devices.) > > IIRC the sync mount option for vfat is ignored for file systems >2G, this > effectively (and probably intentionally) excludes nearly all hd partitions > and iPod type devices. I think many people wish it was ignored on smaller devices too given what it does to write performance. And if your device is flash based and is one of the ones that doesn't have proper wear leveling the card won't last long with sync enabled (even with wear leveling rewriting the fat that often as sync seems to do can't be good for the lifespan of the flash). I suspect either vfat should ignore sync all the time, or it should at least warn about its use so distributions don't think enabling it on all removeable media is a good idea in general. Or perhaps the vfat driver could be made to wait for a file to be closed or at least have some timeout before updating the fat table again. Not sure. Len Sorensen ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 13:28 ` Lennart Sorensen @ 2006-02-27 13:50 ` Arjan van de Ven 2006-02-27 14:06 ` Anton Altaparmakov 2006-02-27 14:26 ` linux-os (Dick Johnson) 1 sibling, 1 reply; 49+ messages in thread From: Arjan van de Ven @ 2006-02-27 13:50 UTC (permalink / raw) To: Lennart Sorensen; +Cc: col-pepper, linux-kernel On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote: > > Hi, > > > > OMG what do I have to do to post here? 10th attempt. > > {part2} > > > > Here is a non-exhaustive list of typical devices types requiring fat vfat > > support: > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) > > usb-flash (usbsticks, cameras, some music devices.) > > > > IIRC the sync mount option for vfat is ignored for file systems >2G, this > > effectively (and probably intentionally) excludes nearly all hd partitions > > and iPod type devices. > > I think many people wish it was ignored on smaller devices too given > what it does to write performance. well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! > And if your device is flash based > and is one of the ones that doesn't have proper wear leveling the card > won't last long with sync enabled (even with wear leveling rewriting the > fat that often as sync seems to do can't be good for the lifespan of the > flash). patient> doctor doctor it hurts when I do this doctor> Then don't do that ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 13:50 ` Arjan van de Ven @ 2006-02-27 14:06 ` Anton Altaparmakov 2006-02-27 14:27 ` Arjan van de Ven 0 siblings, 1 reply; 49+ messages in thread From: Anton Altaparmakov @ 2006-02-27 14:06 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Lennart Sorensen, col-pepper, linux-kernel On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote: > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote: > > > Hi, > > > > > > OMG what do I have to do to post here? 10th attempt. > > > {part2} > > > > > > Here is a non-exhaustive list of typical devices types requiring fat vfat > > > support: > > > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) > > > usb-flash (usbsticks, cameras, some music devices.) > > > > > > IIRC the sync mount option for vfat is ignored for file systems >2G, this > > > effectively (and probably intentionally) excludes nearly all hd partitions > > > and iPod type devices. > > > > I think many people wish it was ignored on smaller devices too given > > what it does to write performance. > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! That is easy to say when you are using the command line... Modern distros (as you know I am sure) mount all hot-plug devices like usb keys, usb hard disks, etc automatically at plug-in time and at least some distros use "-o sync" for everything so you don't get (too much) data loss when the user unplugs a device and so a umount to unplug the device does not take ages... Being someone who maintains a distribution based on one of the big distributions I can tell you that figuring out how to change that default behaviour is not always pretty. Usually involves hacking files deep in the bowels of the hotplug framework on the system. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 14:06 ` Anton Altaparmakov @ 2006-02-27 14:27 ` Arjan van de Ven 2006-02-27 14:41 ` Anton Altaparmakov 0 siblings, 1 reply; 49+ messages in thread From: Arjan van de Ven @ 2006-02-27 14:27 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Lennart Sorensen, col-pepper, linux-kernel On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote: > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote: > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote: > > > > Hi, > > > > > > > > OMG what do I have to do to post here? 10th attempt. > > > > {part2} > > > > > > > > Here is a non-exhaustive list of typical devices types requiring fat vfat > > > > support: > > > > > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) > > > > usb-flash (usbsticks, cameras, some music devices.) > > > > > > > > IIRC the sync mount option for vfat is ignored for file systems >2G, this > > > > effectively (and probably intentionally) excludes nearly all hd partitions > > > > and iPod type devices. > > > > > > I think many people wish it was ignored on smaller devices too given > > > what it does to write performance. > > > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! > > That is easy to say when you are using the command line... Modern > distros (as you know I am sure) mount all hot-plug devices like usb > keys, usb hard disks, etc automatically at plug-in time and at least > some distros use "-o sync" that is a bad misdesign of that distro or at least the tool the distro uses for this (I don't know which it is so I can say that without sounding partial :) the tool that decides to use "sync", or at least the author thereof, should be aware of what flash is, and that it has a limited lifespan etc etc, and that you thus want maximum caching etc. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 14:27 ` Arjan van de Ven @ 2006-02-27 14:41 ` Anton Altaparmakov 2006-02-27 21:04 ` col-pepper 0 siblings, 1 reply; 49+ messages in thread From: Anton Altaparmakov @ 2006-02-27 14:41 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Lennart Sorensen, col-pepper, linux-kernel On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote: > On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote: > > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote: > > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: > > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote: > > > > > Hi, > > > > > > > > > > OMG what do I have to do to post here? 10th attempt. > > > > > {part2} > > > > > > > > > > Here is a non-exhaustive list of typical devices types requiring fat vfat > > > > > support: > > > > > > > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) > > > > > usb-flash (usbsticks, cameras, some music devices.) > > > > > > > > > > IIRC the sync mount option for vfat is ignored for file systems >2G, this > > > > > effectively (and probably intentionally) excludes nearly all hd partitions > > > > > and iPod type devices. > > > > > > > > I think many people wish it was ignored on smaller devices too given > > > > what it does to write performance. > > > > > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! > > > > That is easy to say when you are using the command line... Modern > > distros (as you know I am sure) mount all hot-plug devices like usb > > keys, usb hard disks, etc automatically at plug-in time and at least > > some distros use "-o sync" > > that is a bad misdesign of that distro or at least the tool the distro > uses for this (I don't know which it is so I can say that without > sounding partial :) > > the tool that decides to use "sync", or at least the author thereof, > should be aware of what flash is, and that it has a limited lifespan etc > etc, and that you thus want maximum caching etc. I agree completely which is why we hack the system to remove the o_sync on our distro derivative. (-: But my point was that your solution of "don't do that then" is not much use to your average user who sits in front of such distro in graphical desktop as they are not technical enough to find and hack their hotplug system to work properly... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 14:41 ` Anton Altaparmakov @ 2006-02-27 21:04 ` col-pepper 2006-02-27 21:17 ` Arjan van de Ven ` (3 more replies) 0 siblings, 4 replies; 49+ messages in thread From: col-pepper @ 2006-02-27 21:04 UTC (permalink / raw) To: Anton Altaparmakov, Arjan van de Ven; +Cc: Lennart Sorensen, linux-kernel On Mon, 27 Feb 2006 15:41:44 +0100, Anton Altaparmakov <aia21@cam.ac.uk> wrote: > On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote: >> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote: >> > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote: >> > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: >> > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com >> wrote: >> > > > > Hi, >> > > > > >> > > > > OMG what do I have to do to post here? 10th attempt. >> > > > > {part2} >> > > > > >> > > > > Here is a non-exhaustive list of typical devices types >> requiring fat vfat >> > > > > support: >> > > > > >> > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, >> iRiver etc) >> > > > > usb-flash (usbsticks, cameras, some music devices.) >> > > > > >> > > > > IIRC the sync mount option for vfat is ignored for file systems >> >2G, this >> > > > > effectively (and probably intentionally) excludes nearly all hd >> partitions >> > > > > and iPod type devices. >> > > > >> > > > I think many people wish it was ignored on smaller devices too >> given >> > > > what it does to write performance. >> > > >> > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND >> LINE* !!! >> > >> > That is easy to say when you are using the command line... Modern >> > distros (as you know I am sure) mount all hot-plug devices like usb >> > keys, usb hard disks, etc automatically at plug-in time and at least >> > some distros use "-o sync" >> >> that is a bad misdesign of that distro or at least the tool the distro >> uses for this (I don't know which it is so I can say that without >> sounding partial :) >> >> the tool that decides to use "sync", or at least the author thereof, >> should be aware of what flash is, and that it has a limited lifespan etc >> etc, and that you thus want maximum caching etc. > > I agree completely which is why we hack the system to remove the o_sync > on our distro derivative. (-: > > But my point was that your solution of "don't do that then" is not much > use to your average user who sits in front of such distro in graphical > desktop as they are not technical enough to find and hack their hotplug > system to work properly... > > Best regards, > > Anton >> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! Yeah, cleaver. That is not really a constructive responce. I dont use , I do use command line mount all the time. I never was in danger of damaging my drive with this new "feature". Telling a user who has just burnt out a brand new 1GB usb device he should have RTFM and modified that HAL configuration to insure it did not use sync it not likely to win much confidence in the linux kernel. The point of raising this is that the vast majority of linux users have no awareness of this. If there is a danger of this sync implementation damaging hardware it should be done differently. More importantly this sync strategy is very likely _increasing_ the danger of data loss that is the core reason for using sync in the first place. To quote from my earlier post: The new model attempts to be more rigourous by updating the FAT every time a block of data is written. Thus the "hammering" of the physical memory hosting the FAT record. In view of the nature of flash memory this may actually be drastically increasing the chance that the whole FAT gets erased. If a pullout occurs during write , there is now a near 50% chance that this takes out the entire FAT. Now if that analysis is inaccurate I'd like be corrected. But flash has to be zeroed to be written. If every second write is zeroing the FAT this would seem much more likely to destroy the whole fs than to provide better protection from a untimely pull-out. [Note: I am not subscribed to LKML, if you wish me to recieve any follow ups please BCC: col-pepper at piments point com . thx] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 21:04 ` col-pepper @ 2006-02-27 21:17 ` Arjan van de Ven 2006-02-27 23:21 ` col-pepper 2006-02-27 21:32 ` linux-os (Dick Johnson) ` (2 subsequent siblings) 3 siblings, 1 reply; 49+ messages in thread From: Arjan van de Ven @ 2006-02-27 21:17 UTC (permalink / raw) To: col-pepper; +Cc: Anton Altaparmakov, Lennart Sorensen, linux-kernel > Telling a user who has just burnt out a brand new 1GB usb device he should > have RTFM and modified that HAL configuration to insure it did not use > sync it not likely to win much confidence in the linux kernel. or in HAL. really. there was a very long discussion abuot kernel stability. The problem is that once depending on the absence of a feature becomes ABI ... there is a big problem. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 21:17 ` Arjan van de Ven @ 2006-02-27 23:21 ` col-pepper 0 siblings, 0 replies; 49+ messages in thread From: col-pepper @ 2006-02-27 23:21 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Anton Altaparmakov, Lennart Sorensen, linux-kernel On Mon, 27 Feb 2006 22:17:21 +0100, Arjan van de Ven <arjan@infradead.org> wrote: > >> Telling a user who has just burnt out a brand new 1GB usb device he >> should >> have RTFM and modified that HAL configuration to insure it did not use >> sync it not likely to win much confidence in the linux kernel. > > or in HAL. really. It may unfairly reflect on HAL in the users' mind but hal still does exactly what it is set up to do. > > > there was a very long discussion abuot kernel stability. > The problem is that once depending on the absence of a feature becomes > ABI ... there is a big problem. > > > It was not totally absent. If it was absent no-one would configure anything to use it anyway. It seems that big problem was that it functionality was fundamentlly changed but it was passed on like a minor mod that no-one needed to worry about and the doc was not updated at the time. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 21:04 ` col-pepper 2006-02-27 21:17 ` Arjan van de Ven @ 2006-02-27 21:32 ` linux-os (Dick Johnson) 2006-02-27 23:21 ` col-pepper 2006-02-28 16:11 ` Helge Hafting 2006-02-28 22:37 ` Pavel Machek 3 siblings, 1 reply; 49+ messages in thread From: linux-os (Dick Johnson) @ 2006-02-27 21:32 UTC (permalink / raw) To: col-pepper Cc: Anton Altaparmakov, Arjan van de Ven, Lennart Sorensen, linux-kernel On Mon, 27 Feb 2006 col-pepper@piments.com wrote: > On Mon, 27 Feb 2006 15:41:44 +0100, Anton Altaparmakov <aia21@cam.ac.uk> > wrote: > >> On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote: >>> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote: >>>> On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote: >>>>> On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: >>>>>> On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com >>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> OMG what do I have to do to post here? 10th attempt. >>>>>>> {part2} >>>>>>> >>>>>>> Here is a non-exhaustive list of typical devices types >>> requiring fat vfat >>>>>>> support: >>>>>>> >>>>>>> fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, >>> iRiver etc) >>>>>>> usb-flash (usbsticks, cameras, some music devices.) >>>>>>> >>>>>>> IIRC the sync mount option for vfat is ignored for file systems >>>> 2G, this >>>>>>> effectively (and probably intentionally) excludes nearly all hd >>> partitions >>>>>>> and iPod type devices. >>>>>> >>>>>> I think many people wish it was ignored on smaller devices too >>> given >>>>>> what it does to write performance. >>>>> >>>>> well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND >>> LINE* !!! >>>> >>>> That is easy to say when you are using the command line... Modern >>>> distros (as you know I am sure) mount all hot-plug devices like usb >>>> keys, usb hard disks, etc automatically at plug-in time and at least >>>> some distros use "-o sync" >>> >>> that is a bad misdesign of that distro or at least the tool the distro >>> uses for this (I don't know which it is so I can say that without >>> sounding partial :) >>> >>> the tool that decides to use "sync", or at least the author thereof, >>> should be aware of what flash is, and that it has a limited lifespan etc >>> etc, and that you thus want maximum caching etc. >> >> I agree completely which is why we hack the system to remove the o_sync >> on our distro derivative. (-: >> >> But my point was that your solution of "don't do that then" is not much >> use to your average user who sits in front of such distro in graphical >> desktop as they are not technical enough to find and hack their hotplug >> system to work properly... >> >> Best regards, >> >> Anton > >>> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! > > Yeah, cleaver. > That is not really a constructive responce. I dont use , I do use command > line mount all the time. I never was in danger of damaging my drive with > this new "feature". > > Telling a user who has just burnt out a brand new 1GB usb device he should > have RTFM and modified that HAL configuration to insure it did not use > sync it not likely to win much confidence in the linux kernel. > > The point of raising this is that the vast majority of linux users have no > awareness of this. If there is a danger of this sync implementation > damaging hardware it should be done differently. > > More importantly this sync strategy is very likely _increasing_ the danger > of data loss that is the core reason for using sync in the first place. > > To quote from my earlier post: > > The new model attempts to be more rigourous by updating the FAT every time > a block of data is written. Thus the "hammering" of the physical memory > hosting the FAT record. Nobody should care. > > In view of the nature of flash memory this may actually be drastically > increasing the chance that the whole FAT gets erased. > Will not happen because that's not how they work. > If a pullout occurs during write , there is now a near 50% chance that > this takes out the entire FAT. > If a pullout or a power-failure occurs, you just have an incomplete write, an old FAT entry just like ejecting a floppy during a write. > Now if that analysis is inaccurate I'd like be corrected. But flash has to > be zeroed to be written. If every second write is zeroing the FAT this > would seem much more likely to destroy the whole fs than to provide better > protection from a untimely pull-out. > Flash does not get zeroed to be written! It gets erased, which sets all the bits to '1', i.e., all bytes to 0xff. Further, the designers of flash disks are not stupid as you assume. The direct access occurs to static RAM (read/write stuff). After a few milliseconds of it becoming dirty, and/or when a new page needs to be accessed, the chip erases some page that was not used yet, or was used a long time ago and is not on the active list. Then, it becomes buzy, writes the current sector to the newly erased sector, and (after that write occurs) replaces the entry in the table that tells the disk implimentation the logical to physical translation of that page. In the case where a page will be changed, the new page's data is read from the device into static RAM before access. In any case, the chip then becomes non-buzy. The power can fail at any time and you just have the previous data instead of the new data, just like a real disk drive, except that the sectors are large (64 k). You see, these are not just flash-RAM chips. They are disc drive emulators that contain an ASIC for the bus interface and control logic, some static RAM, and the flash RAM. The IDE emulators, like CompaqFlash, as tiny as they are, actually have the same pin-outs as an IDE drive!! > > [Note: I am not subscribed to LKML, if you wish me to recieve any follow > ups please BCC: col-pepper at piments point com . thx] > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.53 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 21:32 ` linux-os (Dick Johnson) @ 2006-02-27 23:21 ` col-pepper 2006-02-28 13:10 ` linux-os (Dick Johnson) 2006-02-28 22:38 ` Pavel Machek 0 siblings, 2 replies; 49+ messages in thread From: col-pepper @ 2006-02-27 23:21 UTC (permalink / raw) To: linux-os (Dick Johnson); +Cc: linux-kernel@vger.kernel.org On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) <linux-os@analogic.com> wrote: > Flash does not get zeroed to be written! It gets erased, which sets all > the bits to '1', i.e., all bytes to 0xff. Thanks for the correction, but that does not change the discussion. > Further, the designers of > flash disks are not stupid as you assume. The direct access occurs > to static RAM (read/write stuff). I'm not assuming anything . Some hardware has been killed by this issue. http://lkml.org/lkml/2005/5/13/144 It seems that it's you making the assumption that all of these devices are manufactured the same way. The constant dirtying of the buffer will still cause excessive use of the flash block hosting the FAT. Clearly not all devices use a load spreading mechanism and this can lead to premature failure. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 23:21 ` col-pepper @ 2006-02-28 13:10 ` linux-os (Dick Johnson) 2006-02-28 13:52 ` Sergei Organov ` (2 more replies) 2006-02-28 22:38 ` Pavel Machek 1 sibling, 3 replies; 49+ messages in thread From: linux-os (Dick Johnson) @ 2006-02-28 13:10 UTC (permalink / raw) To: col-pepper; +Cc: linux-kernel On Mon, 27 Feb 2006 col-pepper@piments.com wrote: > On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) > <linux-os@analogic.com> wrote: > >> Flash does not get zeroed to be written! It gets erased, which sets all >> the bits to '1', i.e., all bytes to 0xff. > > Thanks for the correction, but that does not change the discussion. > >> Further, the designers of >> flash disks are not stupid as you assume. The direct access occurs >> to static RAM (read/write stuff). > > I'm not assuming anything . Some hardware has been killed by this issue. > http://lkml.org/lkml/2005/5/13/144 No. That hardware was not killed by that issue. The writer, or another who had encountered the same issue, eventually repartitioned and reformatted the device. The partition table had gotten corrupted by some experiments and the writer assumed that the device was broken. It wasn't. Also, if you read other elements in this thread, you would have learned about something that has become somewhat of a red herring. It takes about a second to erase a 64k physical sector. This is a required operation before it is written. Since the projected life of these new devices is about 5 to 10 million such cycles, (older NAND flash used in modems was only 100-200k) the writer would have to be running that "brand new device" for at least 5 million seconds. Let's see: 60 seconds per minute 3600 seconds per hour 86400 seconds per day. 5,000,000 / 86400 = 57 days of continuous writes to the same sector. The writer surely would have a strange file because he states that even a single large file can destroy the drive if it is mounted with the "sync" option. Also, the failure mode of NAND flash is not that it becomes "destroyed". The failure mode is a slow loss of data. The devices no longer retain data for a zillion years, only a few hundred, eventually, only a year or so. So when somebody claims that the flash has gotten destroyed, they need to have written it for a few months, then waited for a few years before reporting the event. Clearly the writer is wrong. > > It seems that it's you making the assumption that all of these devices are > manufactured the same way. > > The constant dirtying of the buffer will still cause excessive use of the > flash block hosting the FAT. Clearly not all devices use a load spreading > mechanism and this can lead to premature failure. > Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.53 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 13:10 ` linux-os (Dick Johnson) @ 2006-02-28 13:52 ` Sergei Organov 2006-02-28 15:18 ` Lennart Sorensen 2006-02-28 17:16 ` col-pepper 2 siblings, 0 replies; 49+ messages in thread From: Sergei Organov @ 2006-02-28 13:52 UTC (permalink / raw) To: linux-kernel "linux-os \(Dick Johnson\)" <linux-os@analogic.com> writes: > On Mon, 27 Feb 2006 col-pepper@piments.com wrote: > >> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) >> <linux-os@analogic.com> wrote: >> [...] > It takes about a second to erase a 64k physical sector. This is > a required operation before it is written. > Since the projected life of these new devices is about 5 to 10 million > such cycles, (older NAND flash used in modems was only 100-200k) the > writer would have to be running that "brand new device" for at least 5 > million seconds. Let's see: What FLASH are you talking about? I work with NAND FLASH chips directly in embedded projects, and for both Toshiba and Samsung NAND FLASH the erase time of 128Kb (64K words) block is 2 milliseconds typical. Page program time is 0.3 milliseconds typical, so, having 64 pages per block, total erase-write block cycle is about 22ms. Those chips indeed support about 100K program/erase cycles. Well, maybe there are some new NAND FLASH chips that support more program/erase cycles (just checked Samsung but found none), but I doubt they are 1000 times slower for block erase anyway. -- Sergei. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 13:10 ` linux-os (Dick Johnson) 2006-02-28 13:52 ` Sergei Organov @ 2006-02-28 15:18 ` Lennart Sorensen 2006-02-28 16:16 ` linux-os (Dick Johnson) 2006-02-28 17:16 ` col-pepper 2 siblings, 1 reply; 49+ messages in thread From: Lennart Sorensen @ 2006-02-28 15:18 UTC (permalink / raw) To: linux-os (Dick Johnson); +Cc: col-pepper, linux-kernel On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote: > > On Mon, 27 Feb 2006 col-pepper@piments.com wrote: > > > On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) > > <linux-os@analogic.com> wrote: > > > >> Flash does not get zeroed to be written! It gets erased, which sets all > >> the bits to '1', i.e., all bytes to 0xff. > > > > Thanks for the correction, but that does not change the discussion. > > > >> Further, the designers of > >> flash disks are not stupid as you assume. The direct access occurs > >> to static RAM (read/write stuff). > > > > I'm not assuming anything . Some hardware has been killed by this issue. > > http://lkml.org/lkml/2005/5/13/144 > > No. That hardware was not killed by that issue. The writer, or another > who had encountered the same issue, eventually repartitioned and > reformatted the device. The partition table had gotten corrupted by > some experiments and the writer assumed that the device was broken. > It wasn't. > > Also, if you read other elements in this thread, you would have > learned about something that has become somewhat of a red herring. > > It takes about a second to erase a 64k physical sector. This is > a required operation before it is written. Since the projected > life of these new devices is about 5 to 10 million such cycles, > (older NAND flash used in modems was only 100-200k) the writer > would have to be running that "brand new device" for at least > 5 million seconds. Let's see: How come I can write to my compact flash at about 2M/s if you claim it takes 1s to erase a 64k sector? Somehow I think your number is much too high. Or it can do multiple erases at the same time. Also the 5 to 10 million is a lot higher than the numbers the makers of the compact flash cards I use claim. > 60 seconds per minute > 3600 seconds per hour > 86400 seconds per day. > > 5,000,000 / 86400 = 57 days of continuous writes to the same > sector. The writer surely would have a strange file because > he states that even a single large file can destroy the drive > if it is mounted with the "sync" option. > > Also, the failure mode of NAND flash is not that it becomes > "destroyed". The failure mode is a slow loss of data. The > devices no longer retain data for a zillion years, only a > few hundred, eventually, only a year or so. So when somebody > claims that the flash has gotten destroyed, they need to have > written it for a few months, then waited for a few years before > reporting the event. Some flash devices can be "destroyed" by loosing power in the middle of a write, since this causes them to corrupt their table of blocks and only the manufacturer has the ability to reset that. Fortunately this doesn't seem like too common a design. Len Sorensen ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 15:18 ` Lennart Sorensen @ 2006-02-28 16:16 ` linux-os (Dick Johnson) 2006-02-28 17:23 ` Sergei Organov 2006-02-28 18:09 ` Krzysztof Halasa 0 siblings, 2 replies; 49+ messages in thread From: linux-os (Dick Johnson) @ 2006-02-28 16:16 UTC (permalink / raw) To: Lennart Sorensen; +Cc: col-pepper, linux-kernel On Tue, 28 Feb 2006, Lennart Sorensen wrote: > On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote: >> >> On Mon, 27 Feb 2006 col-pepper@piments.com wrote: >> >>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) >>> <linux-os@analogic.com> wrote: >>> >>>> Flash does not get zeroed to be written! It gets erased, which sets all >>>> the bits to '1', i.e., all bytes to 0xff. >>> >>> Thanks for the correction, but that does not change the discussion. >>> >>>> Further, the designers of >>>> flash disks are not stupid as you assume. The direct access occurs >>>> to static RAM (read/write stuff). >>> >>> I'm not assuming anything . Some hardware has been killed by this issue. >>> http://lkml.org/lkml/2005/5/13/144 >> >> No. That hardware was not killed by that issue. The writer, or another >> who had encountered the same issue, eventually repartitioned and >> reformatted the device. The partition table had gotten corrupted by >> some experiments and the writer assumed that the device was broken. >> It wasn't. >> >> Also, if you read other elements in this thread, you would have >> learned about something that has become somewhat of a red herring. >> >> It takes about a second to erase a 64k physical sector. This is >> a required operation before it is written. Since the projected >> life of these new devices is about 5 to 10 million such cycles, >> (older NAND flash used in modems was only 100-200k) the writer >> would have to be running that "brand new device" for at least >> 5 million seconds. Let's see: > > How come I can write to my compact flash at about 2M/s if you claim it > takes 1s to erase a 64k sector? Somehow I think your number is much too > high. Or it can do multiple erases at the same time. > > Also the 5 to 10 million is a lot higher than the numbers the makers of > the compact flash cards I use claim. > Here is an instrumented erase function on a driver that rewrites the first sector of a BIOS ROM. Unlike the Flash DISKS, the BIOS ROM has no buffering in static RAM so you can gustimate the actual time to erase............ //-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= // // This erases a page and waits for the erasure to complete. It // returns false if it failed. // static int erase(void *bios, int page) { int era; flags_t flags; jiffie_t ticks, start; spin_lock_irqsave(&info->lock, flags); erase_page(bios, page); spin_unlock_irqrestore(&info->lock, flags); start = jiffies; ticks = jiffies + (ERA_TIME * HZ); era = 0x00; while(time_before(jiffies, ticks)) { if((era = check_erase(bios, page))) break; if(signal_pending(current)) break; set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(1); } set_current_state(TASK_RUNNING); printk("They don't believe... %d\n", (int) (jiffies - start)); return era; } [SNIPPED...] On the system I rewrite a BIOS sector on, jiffies is 1024 ticks/second. parport: PnPBIOS parport detected. parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE] lp0: using parport0 (interrupt-driven). lp0: console ready device eth0 entered promiscuous mode device eth0 left promiscuous mode device eth0 entered promiscuous mode device eth0 left promiscuous mode Analogic-BiosDev : Initialization complete They don't believe... 1004 Now, the wait for erase always sleeps for at least a timer-tick (about a milisecond) so this may take longer than the physical erase, but not much longer. The erase function is: #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= # # This erases the NVRAM page (block). It doesn't wait for completion. # Each block is 64k in length. # M29W040B chip # .section .text erase_page: pushl %ebx movl BUF(%esp), %ebx # Address of the chip movl DAT(%esp), %ecx # The page andl $0x07, %ecx # Max pages shll $0x10, %ecx # Times 64k movb $0xf0, (%ebx) # Reset movb $0xaa, 0x555(%ebx) movb $0x55, 0x2aa(%ebx) movb $0x80, 0x555(%ebx) movb $0xaa, 0x555(%ebx) movb $0x55, 0x2aa(%ebx) movb $0x30, (%ecx,%ebx) popl %ebx ret .size erase_page,.-erase_page .type erase_page,@function .global erase_page And the check-erase function is this: #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= # # This reads the whole M29W040B page, looking for all 0xffff words. # It returns non-zero if it has been erased and zero otherwise. # check_erase: pushl %edi movl BUF(%esp), %edi # Point to buffer movl DAT(%esp), %eax # 64k page andl $0x07, %eax # Max pages possible shll $0x10, %eax # Times 64k addl %eax, %edi # Offset to start cld movl $0x8000, %ecx # Number of words to check movl $-1, %eax # What to look for repz scasw # Look for all 0xffff jz 1f # All erased incl %eax # -1 becomes zero 1: popl %edi ret .size check_erase,.-check_erase .type check_erase,@function .global check_erase >> 60 seconds per minute >> 3600 seconds per hour >> 86400 seconds per day. >> >> 5,000,000 / 86400 = 57 days of continuous writes to the same >> sector. The writer surely would have a strange file because >> he states that even a single large file can destroy the drive >> if it is mounted with the "sync" option. >> >> Also, the failure mode of NAND flash is not that it becomes >> "destroyed". The failure mode is a slow loss of data. The >> devices no longer retain data for a zillion years, only a >> few hundred, eventually, only a year or so. So when somebody >> claims that the flash has gotten destroyed, they need to have >> written it for a few months, then waited for a few years before >> reporting the event. > > Some flash devices can be "destroyed" by loosing power in the middle of > a write, since this causes them to corrupt their table of blocks and > only the manufacturer has the ability to reset that. Fortunately this > doesn't seem like too common a design. > # dd if=/dev/zero of=/dev/whatever bs=1M count=128 Fixes a 128 megabyte flash disk, plug in other values for other sizes. > Len Sorensen Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.54 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 16:16 ` linux-os (Dick Johnson) @ 2006-02-28 17:23 ` Sergei Organov 2006-02-28 18:09 ` Krzysztof Halasa 1 sibling, 0 replies; 49+ messages in thread From: Sergei Organov @ 2006-02-28 17:23 UTC (permalink / raw) To: linux-kernel "linux-os \(Dick Johnson\)" <linux-os@analogic.com> writes: > On Tue, 28 Feb 2006, Lennart Sorensen wrote: > >> On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote: >>> >>> On Mon, 27 Feb 2006 col-pepper@piments.com wrote: >>> >>>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) >>>> <linux-os@analogic.com> wrote: >>>> >>>>> Flash does not get zeroed to be written! It gets erased, which sets all >>>>> the bits to '1', i.e., all bytes to 0xff. >>>> >>>> Thanks for the correction, but that does not change the discussion. >>>> >>>>> Further, the designers of >>>>> flash disks are not stupid as you assume. The direct access occurs >>>>> to static RAM (read/write stuff). >>>> >>>> I'm not assuming anything . Some hardware has been killed by this issue. >>>> http://lkml.org/lkml/2005/5/13/144 >>> >>> No. That hardware was not killed by that issue. The writer, or another >>> who had encountered the same issue, eventually repartitioned and >>> reformatted the device. The partition table had gotten corrupted by >>> some experiments and the writer assumed that the device was broken. >>> It wasn't. >>> >>> Also, if you read other elements in this thread, you would have >>> learned about something that has become somewhat of a red herring. >>> >>> It takes about a second to erase a 64k physical sector. This is >>> a required operation before it is written. Since the projected >>> life of these new devices is about 5 to 10 million such cycles, >>> (older NAND flash used in modems was only 100-200k) the writer >>> would have to be running that "brand new device" for at least >>> 5 million seconds. Let's see: >> >> How come I can write to my compact flash at about 2M/s if you claim it >> takes 1s to erase a 64k sector? Somehow I think your number is much too >> high. Or it can do multiple erases at the same time. >> >> Also the 5 to 10 million is a lot higher than the numbers the makers of >> the compact flash cards I use claim. >> > > Here is an instrumented erase function on a driver that rewrites > the first sector of a BIOS ROM. Unlike the Flash DISKS, the > BIOS ROM has no buffering in static RAM so you can gustimate > the actual time to erase............ BIOS ROM is never NAND FLASH, it's most probably NOR FLASH, and FLASH DISKS are most probably NAND FLASH. NOR and NAND are very different technologies. You compare apples and oranges, -- static RAM has nothing to do with that. -- Sergei. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 16:16 ` linux-os (Dick Johnson) 2006-02-28 17:23 ` Sergei Organov @ 2006-02-28 18:09 ` Krzysztof Halasa 1 sibling, 0 replies; 49+ messages in thread From: Krzysztof Halasa @ 2006-02-28 18:09 UTC (permalink / raw) To: linux-os (Dick Johnson); +Cc: Lennart Sorensen, col-pepper, linux-kernel "linux-os \(Dick Johnson\)" <linux-os@analogic.com> writes: > Here is an instrumented erase function on a driver that rewrites > the first sector of a BIOS ROM. Unlike the Flash DISKS, the > BIOS ROM has no buffering in static RAM so you can gustimate > the actual time to erase............ The NOR flash is different but Samsung manual for K9F5608U0A-YCB0, K9F5608U0A-YIB0 32M x 8 Bit NAND Flash Memory says: FEATURES GENERAL DESCRIPTION - Page Program : (512 + 16)Byte - Block Erase : (16K + 512)Byte - Program time : 200us(Typ.) - Block Erase Time : 2ms(Typ.) - Endurance : 100K Program/Erase Cycles - Data Retention : 10 Years -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 13:10 ` linux-os (Dick Johnson) 2006-02-28 13:52 ` Sergei Organov 2006-02-28 15:18 ` Lennart Sorensen @ 2006-02-28 17:16 ` col-pepper 2 siblings, 0 replies; 49+ messages in thread From: col-pepper @ 2006-02-28 17:16 UTC (permalink / raw) To: linux-os (Dick Johnson); +Cc: linux-kernel@vger.kernel.org On Tue, 28 Feb 2006 14:10:44 +0100, linux-os (Dick Johnson) <linux-os@analogic.com> wrote: > No. That hardware was not killed by that issue. The writer, or another > who had encountered the same issue, eventually repartitioned and > reformatted the device. The partition table had gotten corrupted by > some experiments and the writer assumed that the device was broken. > It wasn't. I did not get the info you posted from that thread so maybe I missed something you saw. Or indeed it was someone else. Many thanks for your comments. If this is a false alert all the better. > Also, the failure mode of NAND flash is not that it becomes > "destroyed". The failure mode is a slow loss of data. The > devices no longer retain data for a zillion years, only a > few hundred, eventually, only a year or so. There was a comment about the failure mode, no time scale was given. I see no reason why the degradation would stop at a year though. > Since the projected life of these new devices is about 5 to 10million > such cycles,(older NAND flash used in modems was only 100-200k) Maybe some of the cheap devices are not using the new flash memory in which case it would come down to between 24 and 48hrs of constant use. This would be a realistic problem. Alan Cox refered to some devices that could be damaged as "crap", so it seems he is aware of some hardware differences. In conclusion it seems from Andrew Morton's posts that the way this is handled is under review so I am confident that a robust and stable solution will result. Thanks again for your thoughts on this. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 23:21 ` col-pepper 2006-02-28 13:10 ` linux-os (Dick Johnson) @ 2006-02-28 22:38 ` Pavel Machek 2006-02-28 23:10 ` why VM_SHM has been removed from mm.h? Kamran Karimi ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Pavel Machek @ 2006-02-28 22:38 UTC (permalink / raw) To: col-pepper; +Cc: linux-os (Dick Johnson), linux-kernel@vger.kernel.org On Út 28-02-06 00:21:53, col-pepper@piments.com wrote: > On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) > <linux-os@analogic.com> wrote: > > > Flash does not get zeroed to be written! It gets erased, which sets all > > the bits to '1', i.e., all bytes to 0xff. > > Thanks for the correction, but that does not change the discussion. > > > Further, the designers of > > flash disks are not stupid as you assume. The direct access occurs > > to static RAM (read/write stuff). > > I'm not assuming anything . Some hardware has been killed by this issue. > http://lkml.org/lkml/2005/5/13/144 I have seen flash disk dead in 5 minutes, even without o-sync. Those devices are often crap. (I copied tar file to flash by cat foo.tar > /dev/sda. That was apparently enough to kill that flash. Label "Yahoo" should have warned me). Pavel -- Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted... ^ permalink raw reply [flat|nested] 49+ messages in thread
* why VM_SHM has been removed from mm.h? 2006-02-28 22:38 ` Pavel Machek @ 2006-02-28 23:10 ` Kamran Karimi 2006-03-01 3:02 ` Phillip Susi 2006-03-01 7:56 ` Hugh Dickins 2006-03-01 4:28 ` o_sync in vfat driver Kyle Moffett 2006-03-02 8:23 ` col-pepper 2 siblings, 2 replies; 49+ messages in thread From: Kamran Karimi @ 2006-02-28 23:10 UTC (permalink / raw) To: linux-kernel Hello all, VM_SHM is used by DIPC to quickly recognise when we are dealing with a System V IPC segment. It has been "removed" from recent kernels (set to 0). Is there an easy way to find out if a segment is a Sys V shm? if not, I suggest we re-activate it. -Kamran ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: why VM_SHM has been removed from mm.h? 2006-02-28 23:10 ` why VM_SHM has been removed from mm.h? Kamran Karimi @ 2006-03-01 3:02 ` Phillip Susi 2006-03-01 7:56 ` Hugh Dickins 1 sibling, 0 replies; 49+ messages in thread From: Phillip Susi @ 2006-03-01 3:02 UTC (permalink / raw) To: Kamran Karimi; +Cc: linux-kernel Is there a reason that you posted this as a reply to the "o_sync in vfat" thread? Kamran Karimi wrote: > Hello all, > > VM_SHM is used by DIPC to quickly recognise when we are dealing with a > System V IPC segment. It has been "removed" from recent kernels (set to > 0). Is there an easy way to find out if a segment is a Sys V shm? if > not, I suggest we re-activate it. > > -Kamran > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: why VM_SHM has been removed from mm.h? 2006-02-28 23:10 ` why VM_SHM has been removed from mm.h? Kamran Karimi 2006-03-01 3:02 ` Phillip Susi @ 2006-03-01 7:56 ` Hugh Dickins 2006-03-01 14:58 ` Kamran Karimi 1 sibling, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2006-03-01 7:56 UTC (permalink / raw) To: Kamran Karimi; +Cc: linux-kernel On Tue, 28 Feb 2006, Kamran Karimi wrote: > > VM_SHM is used by DIPC to quickly recognise when we are dealing with a System > V IPC segment. It has been "removed" from recent kernels (set to 0). Curious: VM_SHM wasn't set on a System V IPC shm vma in any 2.4 or 2.6 kernel that I know of; but was set on the vmas of a random collection of drivers. Perhaps you've been using your own patch to set it on SysV IPC shm vmas, and clear it from drivers' vmas? (We'll remove VM_SHM entirely once I've trawled through those drivers.) > Is there an easy way to find out if a segment is a Sys V shm? Nothing easy and reliable springs immediately to mind - from a VM point of view, they're treated much the same as tmpfs files; but there probably is some hacky way if we think about it long enough. > if not, I suggest we re-activate it. It seems that either you've been doing the wrong thing up to now, and never noticed it; or that you've been using your own flag in your own patch, and can continue to do so. No need for vanilla kernel to reinstate VM_SHM. Are you sure you need to recognize them? Hugh ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: why VM_SHM has been removed from mm.h? 2006-03-01 7:56 ` Hugh Dickins @ 2006-03-01 14:58 ` Kamran Karimi 2006-03-01 16:24 ` Hugh Dickins 0 siblings, 1 reply; 49+ messages in thread From: Kamran Karimi @ 2006-03-01 14:58 UTC (permalink / raw) To: hugh; +Cc: linux-kernel Thank you Hugh for the reply. Last time I used VM_SHM was in 2.2.x kernels. I have a programme called DIPC which makes System V shared memory segments (and also messages and semaphores) work over a network. In the arch/xyz/mm/fault.c file, it checks the VM_SHM flag and then calls its logic. As a substitute I've been trying this ad-hoc code to see if a vma represents a Sys V shm: file = vma->vm_file; if(file && (file->f_dentry) && (file->f_dentry->d_inode) && (id = file->f_dentry->d_inode->i_ino)) { shp = shm_lock(id); if(shp == NULL) return 0; // not a Sys V shm } else return 0; // not a Sys V shm But the kernel hangs with an invalid-pointer error message. Any suggestions? -Kamran >On Tue, 28 Feb 2006, Kamran Karimi wrote: > > > > VM_SHM is used by DIPC to quickly recognise when we are dealing with a >System > > V IPC segment. It has been "removed" from recent kernels (set to 0). > >Curious: VM_SHM wasn't set on a System V IPC shm vma in any 2.4 or 2.6 >kernel that I know of; but was set on the vmas of a random collection >of drivers. Perhaps you've been using your own patch to set it on >SysV IPC shm vmas, and clear it from drivers' vmas? > >(We'll remove VM_SHM entirely once I've trawled through those drivers.) > > > Is there an easy way to find out if a segment is a Sys V shm? > >Nothing easy and reliable springs immediately to mind - from a VM point >of view, they're treated much the same as tmpfs files; but there >probably is some hacky way if we think about it long enough. > > > if not, I suggest we re-activate it. > >It seems that either you've been doing the wrong thing up to now, >and never noticed it; or that you've been using your own flag in >your own patch, and can continue to do so. No need for vanilla >kernel to reinstate VM_SHM. > >Are you sure you need to recognize them? > >Hugh ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: why VM_SHM has been removed from mm.h? 2006-03-01 14:58 ` Kamran Karimi @ 2006-03-01 16:24 ` Hugh Dickins 2006-03-01 16:55 ` Kamran Karimi 0 siblings, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2006-03-01 16:24 UTC (permalink / raw) To: Kamran Karimi; +Cc: linux-kernel On Wed, 1 Mar 2006, Kamran Karimi wrote: > > Thank you Hugh for the reply. Last time I used VM_SHM was in 2.2.x kernels. > I have a programme called DIPC which makes System V shared memory segments > (and also messages and semaphores) work over a network. > > In the arch/xyz/mm/fault.c file, it checks the VM_SHM flag and then calls > its logic. As a substitute I've been trying this ad-hoc code to see if a vma > represents a Sys V shm: > > file = vma->vm_file; > if(file && (file->f_dentry) && (file->f_dentry->d_inode) && > (id = file->f_dentry->d_inode->i_ino)) { > shp = shm_lock(id); > if(shp == NULL) > return 0; // not a Sys V shm > } > else return 0; // not a Sys V shm > > But the kernel hangs with an invalid-pointer error message. Any suggestions? It's not obvious to me why the kernel would hang with an invalid pointer error message there: ipc_lock appears to have good safety against being passed a random id. Perhaps the invalid pointer message comes from other code you've not shown (for example, I hope you shm_unlock(shp) and return 1 when shm_lock succeeds), or perhaps I'm misreading. But what you're doing there looks entirely weird and meaningless to me: if shm_lock happens to succeed or fail on the inode number of some file on some filesystem, that tells you nothing about whether that file is SysV shm or not. Ah, I see ipc/shm.c saves id in i_ino: so if you're dealing with a SysV shm file, then indeed that ought to tell whether you're dealing with a SysV shm file - but that hasn't helped much! Since you're already patching base kernel source (you mention arch/xyz/mm/fault.c), why don't you just patch your own VM_SYSVSHM into include/linux/mm.h, and set it on the vma in ipc/shm.c? Hugh ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: why VM_SHM has been removed from mm.h? 2006-03-01 16:24 ` Hugh Dickins @ 2006-03-01 16:55 ` Kamran Karimi 2006-03-01 17:50 ` Hugh Dickins 0 siblings, 1 reply; 49+ messages in thread From: Kamran Karimi @ 2006-03-01 16:55 UTC (permalink / raw) To: hugh; +Cc: linux-kernel >It's not obvious to me why the kernel would hang with an invalid pointer >error message there: ipc_lock appears to have good safety against being >passed a random id. Perhaps the invalid pointer message comes from >other code you've not shown (for example, I hope you shm_unlock(shp) >and return 1 when shm_lock succeeds), or perhaps I'm misreading. I have put printk() statements all over the place. The hang (which is during boot time) occurs within the block of code that I sent you. There is a shm_unlock() statement after the code, but it is never reached. >Since you're already patching base kernel source (you mention >arch/xyz/mm/fault.c), why don't you just patch your own VM_SYSVSHM >into include/linux/mm.h, and set it on the vma in ipc/shm.c? Yes this looks like a good solution. I have changed VM_SHM in mm.h to be 0x0800000 and am looking for a good place to include it in the vma->vm_flags. shmat() looks like a good place. How can I find the vma of a SysV shm in that routine? -Kamran ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: why VM_SHM has been removed from mm.h? 2006-03-01 16:55 ` Kamran Karimi @ 2006-03-01 17:50 ` Hugh Dickins 0 siblings, 0 replies; 49+ messages in thread From: Hugh Dickins @ 2006-03-01 17:50 UTC (permalink / raw) To: Kamran Karimi; +Cc: linux-kernel On Wed, 1 Mar 2006, Kamran Karimi wrote: > > > Since you're already patching base kernel source (you mention > > arch/xyz/mm/fault.c), why don't you just patch your own VM_SYSVSHM > > into include/linux/mm.h, and set it on the vma in ipc/shm.c? > > Yes this looks like a good solution. I have changed VM_SHM in mm.h to be > 0x0800000 and am looking for a good place to include it in the vma->vm_flags. I already pointed out that several drivers are setting VM_SHM; and that we shall remove it in due course. Your DIPC patch should use its own flag. > shmat() looks like a good place. How can I find the vma of a SysV shm in that > routine? shm_mmap would be the right place: shmat's do_mmap will call it. Hugh ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 22:38 ` Pavel Machek 2006-02-28 23:10 ` why VM_SHM has been removed from mm.h? Kamran Karimi @ 2006-03-01 4:28 ` Kyle Moffett 2006-03-02 8:23 ` col-pepper 2 siblings, 0 replies; 49+ messages in thread From: Kyle Moffett @ 2006-03-01 4:28 UTC (permalink / raw) To: Pavel Machek; +Cc: col-pepper, LKML Kernel On Feb 28, 2006, at 17:38:55, Pavel Machek wrote: > I have seen flash disk dead in 5 minutes, even without o-sync. > Those devices are often crap. (I copied tar file to flash by cat > foo.tar > /dev/sda. That was apparently enough to kill that flash. > Label "Yahoo" should have warned me). Sometimes a flash device can have a temporary error condition that is solved by rewriting the data. (I've seen it triggered by buggy USB hubs that don't provide the rated power). It seems that a number of flash drives have internal checks, and when those trigger it reports a bad sector (even if it isn't permanently bad). My 1GB flashdrive failed in that way, and I was able to fix the error by erasing with "dd if=/dev/full of=/dev/usbkey" and reformatting. After the error occurred I started md5summing every file I put on the drive, but I've been using it for a month now and not a single checksum has miscomputed. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 22:38 ` Pavel Machek 2006-02-28 23:10 ` why VM_SHM has been removed from mm.h? Kamran Karimi 2006-03-01 4:28 ` o_sync in vfat driver Kyle Moffett @ 2006-03-02 8:23 ` col-pepper 2006-03-02 8:32 ` Pavel Machek 2 siblings, 1 reply; 49+ messages in thread From: col-pepper @ 2006-03-02 8:23 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-kernel@vger.kernel.org On Tue, 28 Feb 2006 23:38:55 +0100, Pavel Machek <pavel@suse.cz> wrote: > On Út 28-02-06 00:21:53, col-pepper@piments.com wrote: >> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) >> <linux-os@analogic.com> wrote: >> >> > Flash does not get zeroed to be written! It gets erased, which sets >> all >> > the bits to '1', i.e., all bytes to 0xff. >> >> Thanks for the correction, but that does not change the discussion. >> >> > Further, the designers of >> > flash disks are not stupid as you assume. The direct access occurs >> > to static RAM (read/write stuff). >> >> I'm not assuming anything . Some hardware has been killed by this issue. >> http://lkml.org/lkml/2005/5/13/144 > > I have seen flash disk dead in 5 minutes, even without o-sync. Those > devices are often crap. (I copied tar file to flash by cat foo.tar > > /dev/sda. That was apparently enough to kill that flash. Label "Yahoo" > should have warned me). > Pavel If I'm not mistaken, writing to the device with cat will output that file byte by byte. This would probably be even harder on the device than using a formatted device with o_sync, since it would dirty a 64k block 64k times! It seems some of the less elaborate devices dont support this type of use. I suspect if you had tried using dd with a suitable bs you may still own a crap Yahoo usb device. Just because the linux kernel lets us use the abstract /dev devices freely does not mean everything you can do with a /dev is a good idea for all h/w that gets a device name. I think that is the heart of the problem. Manufacturers are designing these devices for the windows market. They are specifically designed and supplied, preformatted with a fat fs, to be used in that way. If linux distros, MacOS or anybody else wants to claim to support these devices the default setup should probably handle the devices in a _similar_ way to the native windows drivers. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-03-02 8:23 ` col-pepper @ 2006-03-02 8:32 ` Pavel Machek 0 siblings, 0 replies; 49+ messages in thread From: Pavel Machek @ 2006-03-02 8:32 UTC (permalink / raw) To: col-pepper; +Cc: linux-kernel@vger.kernel.org On Čt 02-03-06 09:23:02, col-pepper@piments.com wrote: > On Tue, 28 Feb 2006 23:38:55 +0100, Pavel Machek <pavel@suse.cz> wrote: > > >On Út 28-02-06 00:21:53, col-pepper@piments.com wrote: > >>On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson) > >><linux-os@analogic.com> wrote: > >> > >>> Flash does not get zeroed to be written! It gets erased, which sets > >>all > >>> the bits to '1', i.e., all bytes to 0xff. > >> > >>Thanks for the correction, but that does not change the discussion. > >> > >>> Further, the designers of > >>> flash disks are not stupid as you assume. The direct access occurs > >>> to static RAM (read/write stuff). > >> > >>I'm not assuming anything . Some hardware has been killed by this issue. > >>http://lkml.org/lkml/2005/5/13/144 > > > >I have seen flash disk dead in 5 minutes, even without o-sync. Those > >devices are often crap. (I copied tar file to flash by cat foo.tar > > >/dev/sda. That was apparently enough to kill that flash. Label "Yahoo" > >should have warned me). > > If I'm not mistaken, writing to the device with cat will output that file > byte by byte. This would probably be even harder on the device than using > a formatted device with o_sync, since it would dirty a 64k block 64k > times! No. > It seems some of the less elaborate devices dont support this type of use. > > I suspect if you had tried using dd with a suitable bs you may still own a > crap Yahoo usb device. > > Just because the linux kernel lets us use the abstract /dev devices freely > does not mean everything you can do with a /dev is a good idea for all h/w > that gets a device name. > > I think that is the heart of the problem. Manufacturers are designing > these devices for the windows market. They are specifically designed and > supplied, preformatted with a fat fs, to be used in that way. There's USB mass storage specification, that says nothing about FAT, or expected use of the device... if your device is broken FAT thing that will break if used any other way, do not advertise it as USB mass storage. Pavel -- Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 21:04 ` col-pepper 2006-02-27 21:17 ` Arjan van de Ven 2006-02-27 21:32 ` linux-os (Dick Johnson) @ 2006-02-28 16:11 ` Helge Hafting 2006-02-28 22:37 ` Pavel Machek 3 siblings, 0 replies; 49+ messages in thread From: Helge Hafting @ 2006-02-28 16:11 UTC (permalink / raw) To: col-pepper Cc: Anton Altaparmakov, Arjan van de Ven, Lennart Sorensen, linux-kernel col-pepper@piments.com wrote: > On Mon, 27 Feb 2006 15:41:44 +0100, Anton Altaparmakov > <aia21@cam.ac.uk> wrote: > >> On Mon, 2006-02-27 at 15:27 +0100, Arjan van de Ven wrote: >> >>> On Mon, 2006-02-27 at 14:06 +0000, Anton Altaparmakov wrote: >>> > On Mon, 2006-02-27 at 14:50 +0100, Arjan van de Ven wrote: >>> > > On Mon, 2006-02-27 at 08:28 -0500, Lennart Sorensen wrote: >>> > > > On Sun, Feb 26, 2006 at 11:50:40PM +0100, >>> col-pepper@piments.com wrote: >>> > > > > Hi, >>> > > > > >>> > > > > OMG what do I have to do to post here? 10th attempt. >>> > > > > {part2} >>> > > > > >>> > > > > Here is a non-exhaustive list of typical devices types >>> requiring fat vfat >>> > > > > support: >>> > > > > >>> > > > > fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, >>> iRiver etc) >>> > > > > usb-flash (usbsticks, cameras, some music devices.) >>> > > > > >>> > > > > IIRC the sync mount option for vfat is ignored for file >>> systems >2G, this >>> > > > > effectively (and probably intentionally) excludes nearly all >>> hd partitions >>> > > > > and iPod type devices. >>> > > > >>> > > > I think many people wish it was ignored on smaller devices >>> too given >>> > > > what it does to write performance. >>> > > >>> > > well. If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND >>> LINE* !!! >>> > >>> > That is easy to say when you are using the command line... Modern >>> > distros (as you know I am sure) mount all hot-plug devices like usb >>> > keys, usb hard disks, etc automatically at plug-in time and at least >>> > some distros use "-o sync" >>> >>> that is a bad misdesign of that distro or at least the tool the distro >>> uses for this (I don't know which it is so I can say that without >>> sounding partial :) >>> >>> the tool that decides to use "sync", or at least the author thereof, >>> should be aware of what flash is, and that it has a limited lifespan >>> etc >>> etc, and that you thus want maximum caching etc. >> >> >> I agree completely which is why we hack the system to remove the o_sync >> on our distro derivative. (-: >> >> But my point was that your solution of "don't do that then" is not much >> use to your average user who sits in front of such distro in graphical >> desktop as they are not technical enough to find and hack their hotplug >> system to work properly... >> >> Best regards, >> >> Anton > > >>> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! >> > > Yeah, cleaver. > That is not really a constructive responce. I dont use , I do use > command line mount all the time. I never was in danger of damaging my > drive with this new "feature". > > Telling a user who has just burnt out a brand new 1GB usb device he > should have RTFM and modified that HAL configuration to insure it did > not use sync it not likely to win much confidence in the linux kernel. No problem in the kernel. The system is set up wrong. A simple user may not be able to figure out his distro's hotplug setup to fix this - but then this problem is the fault of _the distro_, not the kernel. Complain to distributors instead. There is no need for the kernel to treat o_sync VFAT in any special way. The users, or more likely the distros, can skip that o_sync part. Not all distros have such problems either. On debian, I had to set up /etc/fstab myself - where not specifying sync is easy enough. > > The point of raising this is that the vast majority of linux users > have no awareness of this. If there is a danger of this sync > implementation damaging hardware it should be done differently. Which is why people is working on the "sync on close" alternative. > > More importantly this sync strategy is very likely _increasing_ the > danger of data loss that is the core reason for using sync in the > first place. > > To quote from my earlier post: > > The new model attempts to be more rigourous by updating the FAT every > time > a block of data is written. Thus the "hammering" of the physical memory > hosting the FAT record. > > In view of the nature of flash memory this may actually be drastically > increasing the chance that the whole FAT gets erased. > > If a pullout occurs during write , there is now a near 50% chance that > this takes out the entire FAT. No, only one FAT entry. And the users who pull out during writes _really_ get what they deserve anyway. You don't need deep linux knowledge for that. In the day of the floppy, people respected the activity light regardless of OS. Helge Hafting ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 21:04 ` col-pepper ` (2 preceding siblings ...) 2006-02-28 16:11 ` Helge Hafting @ 2006-02-28 22:37 ` Pavel Machek 3 siblings, 0 replies; 49+ messages in thread From: Pavel Machek @ 2006-02-28 22:37 UTC (permalink / raw) To: col-pepper Cc: Anton Altaparmakov, Arjan van de Ven, Lennart Sorensen, linux-kernel Hi! > > I agree completely which is why we hack the system to remove the o_sync > > on our distro derivative. (-: > > > > But my point was that your solution of "don't do that then" is not much > > use to your average user who sits in front of such distro in graphical > > desktop as they are not technical enough to find and hack their hotplug > > system to work properly... > > > > Best regards, > > > > Anton > > >> If you don't want it *DO NOT USE IT AT THE MOUNT COMMAND LINE* !!! > > Yeah, cleaver. > That is not really a constructive responce. I dont use , I do use command > line mount all the time. I never was in danger of damaging my drive with > this new "feature". > > Telling a user who has just burnt out a brand new 1GB usb device he should > have RTFM and modified that HAL configuration to insure it did not use > sync it not likely to win much confidence in the linux kernel. Return that 1GB usb device to manufacturer, it was broken. > The point of raising this is that the vast majority of linux users have no > awareness of this. If there is a danger of this sync implementation > damaging hardware it should be done differently. Fix the distribution, then. Pavel -- Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 13:28 ` Lennart Sorensen 2006-02-27 13:50 ` Arjan van de Ven @ 2006-02-27 14:26 ` linux-os (Dick Johnson) 2006-02-27 18:53 ` Jan Engelhardt 1 sibling, 1 reply; 49+ messages in thread From: linux-os (Dick Johnson) @ 2006-02-27 14:26 UTC (permalink / raw) To: Lennart Sorensen; +Cc: col-pepper, linux-kernel On Mon, 27 Feb 2006, Lennart Sorensen wrote: > On Sun, Feb 26, 2006 at 11:50:40PM +0100, col-pepper@piments.com wrote: >> Hi, >> >> OMG what do I have to do to post here? 10th attempt. >> {part2} >> >> Here is a non-exhaustive list of typical devices types requiring fat vfat >> support: >> >> fd ide-hd scsi-hd usb-hd cdrom usb-hd usb-handheld (iPod, iRiver etc) >> usb-flash (usbsticks, cameras, some music devices.) >> >> IIRC the sync mount option for vfat is ignored for file systems >2G, this >> effectively (and probably intentionally) excludes nearly all hd partitions >> and iPod type devices. > > I think many people wish it was ignored on smaller devices too given > what it does to write performance. And if your device is flash based > and is one of the ones that doesn't have proper wear leveling the card > won't last long with sync enabled (even with wear leveling rewriting the > fat that often as sync seems to do can't be good for the lifespan of the > flash). > > I suspect either vfat should ignore sync all the time, or it should at > least warn about its use so distributions don't think enabling it on all > removeable media is a good idea in general. Or perhaps the vfat driver > could be made to wait for a file to be closed or at least have some > timeout before updating the fat table again. Not sure. > > Len Sorensen I really don't think one needs to worry about this! The flash-file- system designers know how to minimize wear and spread the wear throughout the device. It's not up to the file-systems to be concerned whatsoever! The filesystems need to concern themselves with the proper implementation of their structural details, nothing else. Any special device considerations do not belong in the file-system code. If there are any special device considerations, they need to be in the device driver, nowhere else. BYW, even the drivers can't effectively compensate for any potential wear because they don't know where the physical write will occur. The physical sectors (pages) of many of these devices are 64k. All of the access, both read and write, is buffered in read/write static RAM. It's only when the disk emulator of the FlashRAM decides that the static RAM needs to be flushed to flash, that the write actually occurs. Typically, a LRU 64k page is erased and re-written. Then a table is updated to reference the new correct block. This is all transparent, and it needs to be, because erasing a 64k block takes nearly a second! Without the buffering, write performance would be unacceptable. Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.53 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 14:26 ` linux-os (Dick Johnson) @ 2006-02-27 18:53 ` Jan Engelhardt 0 siblings, 0 replies; 49+ messages in thread From: Jan Engelhardt @ 2006-02-27 18:53 UTC (permalink / raw) To: linux-os (Dick Johnson); +Cc: Lennart Sorensen, col-pepper, linux-kernel > >I really don't think one needs to worry about this! The flash-file- >system designers know how to minimize wear and spread the wear >throughout the device. It's not up to the file-systems to be >concerned whatsoever! Yes, the filesystem designers, JFFS and such. But most people unfortunately have to use something not-optimized-for-flash called VFAT to be able to read it on Win32 too. I would like to use UDF instead, but Windows seems to have a nogo with UDF on non-CDROMs. Jan Engelhardt -- ^ permalink raw reply [flat|nested] 49+ messages in thread
* o_sync in vfat driver @ 2006-02-26 22:55 col-pepper 0 siblings, 0 replies; 49+ messages in thread From: col-pepper @ 2006-02-26 22:55 UTC (permalink / raw) To: linux-kernel part 3 (we'll get past these filters in the end...) These devices do present special problems since they are a rw media that can be abruptly removed at any time without even the chance for the OS to interrupt on-going IO. This is compounded by the fact that flash memory has to be zeroed and then rewritten with the new data. If the device is physically removed before a block is written the update will be lost. If it is removed _during_ write the new and the old data will likely be lost. If the block being written is the FAT , the principal record of the structure of the whole disk will very likely be erased. Since there is a heavy performance penalty involved (typically around an _order of magnitude_ slower), it seems that the sole aim here is security of data at any cost in the case of premature withdrawal. ^ permalink raw reply [flat|nested] 49+ messages in thread
* o_sync in vfat driver @ 2006-02-26 23:08 col-pepper 2006-02-27 0:51 ` Andrew Morton 0 siblings, 1 reply; 49+ messages in thread From: col-pepper @ 2006-02-26 23:08 UTC (permalink / raw) To: linux-kernel *** Is that aim being achieved by the current policy? *** As I understand it the old (<=2.6.11) sync model kept the data in sync without updating the FAT until later. This runs the risks of partial corruption of one or more files on pullout. The new model attempts to be more rigourous by updating the FAT every time a block of data is written. Thus the "hammering" of the physical memory hosting the FAT record. In view of the nature of flash memory this may actually be drastically increasing the chance that the whole FAT gets erased. part IV (end of a sage) If a pullout occurs during write , there is now a near 50% chance that this takes out the entire FAT. It would seem that the main advantage of this scheme is that it is so slow that it encourages users to turn it off. Presumably in the process of coming to that conclusion they will become aware of the need to run umount or the sync command before doing removing the device. = Danger of destroying hardware = It seems that there are well documented cases of this abusive rewriting of the FAT causing rapid and total premature failure of what Alan Cox refers to as "ultra-crap devices". There may be valid reasons of cost or miniaturisation that preclude the additional hardware found in more complex devices. Even if better quality devices may have some sort of paging mechanism which makes them more resistant to this sort of abuse, it does not seem good engineering practice to dismiss those that fail as "shite". There is nothing in the spec of vfat that suggests the FAT will be written 10.000 during the writing of one large file. Indeed it is hard to imagine that any other implementation on any other OS or any previous linux kernel behaves like that. So should the hardware manufacturers have anticipated this particular driver implementation or should the kernel be more aware of the existing hardware that it purports to support. = The way forward = It would seem that the first step could be to revert to the 2.6.11 behaviour which was more appropriate and probably safer even from the data point of view. I lack the knowlege and experience to produce reliable kernel code so I wont try. However, I have already seen a number of suggestions of how the old model could be improved. This post could be the starting point for a discussion of more robust techniques. In any case the coding is unlikely to be very complex given the existing , tested code base that is in place in 2.6.11 Any new technique should probably aim to be applicable to larger devices as well. The 2G limit is artificial and is a tacit recognition of the precarity of the current code. USB hard disks are just as prone to accidental cable pullout. Some periodic or per file sync should probably be envisaged for the VFAT sync mount option. PS if anyone can tell me why I had to post this ten times and chop it into little bits it would be appreciated in not messing up the list in the future. I spent an hour reading the faq and I dont see anything taboo here. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-26 23:08 col-pepper @ 2006-02-27 0:51 ` Andrew Morton 2006-02-27 22:19 ` col-pepper 0 siblings, 1 reply; 49+ messages in thread From: Andrew Morton @ 2006-02-27 0:51 UTC (permalink / raw) To: col-pepper; +Cc: linux-kernel col-pepper@piments.com wrote: > > There is nothing in the spec of vfat that suggests the FAT will be written > 10.000 during the writing of one large file. Indeed it is hard to imagine > that any other implementation on any other OS or any previous linux kernel > behaves like that. We sync the file metadata once per write() syscall. If your app writes a large file in lots of little bits, it'll do a lot of syncs. Other implementations of fatfs will (must) do the same thing. > It would seem that the first step could be to revert to the 2.6.11 > behaviour which was more appropriate and probably safer even from the data > point of view. fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and what we're seeing is the fallout from the late fixing of that bug. You're right - people need to understand what they're doing, make their own decision, then remove the `-o sync' option. There aren't any easy solutions. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 0:51 ` Andrew Morton @ 2006-02-27 22:19 ` col-pepper 2006-02-27 23:12 ` Andrew Morton 2006-02-28 0:52 ` Machida, Hiroyuki 0 siblings, 2 replies; 49+ messages in thread From: col-pepper @ 2006-02-27 22:19 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel@vger.kernel.org Thanks for the reply. On Mon, 27 Feb 2006 01:51:14 +0100, Andrew Morton <akpm@osdl.org> wrote: > col-pepper@piments.com wrote: >> >> There is nothing in the spec of vfat that suggests the FAT will be >> written >> 10.000 during the writing of one large file. Indeed it is hard to >> imagine >> that any other implementation on any other OS or any previous linux >> kernel >> behaves like that. > > We sync the file metadata once per write() syscall. If your app writes a > large file in lots of little bits, it'll do a lot of syncs. Other > implementations of fatfs will (must) do the same thing. That would not seem to be the case at least on MS systems. I had a freind do some timings copying a large group of files to a 128M usb flash device. There was an arbitary mix of files including many small files and some larger files, one in excess of 50MB. suse10 default 4m10 win2k 2m30 suse w/o sync 30s The suse test was drag and drop in konqueror , the other dnd in windows explorer. > >> It would seem that the first step could be to revert to the 2.6.11 >> behaviour which was more appropriate and probably safer even from the >> data >> point of view. > > fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and > what we're seeing is the fallout from the late fixing of that bug. > I just tested on my 2.6.11 kernel which would predate the change and there is a clear difference between mounting my usb device with and without sync option. ls -ail /tmpd/mail* 239151 -rw-r--r-- 1 root root 8169540 2006-02-27 19:04 /tmpd/mail-bak.2006-02-28.bz2 bash-3.1#time cp !$ /mnt/usb time cp /tmpd/mail* /mnt/usb real 0m0.227s user 0m0.001s sys 0m0.070s It returns immediately with no disk activity. About 30s later there was disk activity. Presumably some periodic flushing of IO buffers. bash-3.1#umount /mnt/usb bash-3.1#mount -o sync !$ bash-3.1#time cp /tmpd/mail* /mnt/usb real 0m5.440s user 0m0.000s sys 0m0.143s So the older model did seem to have some sync functionality , tho' presumably not the agressive one-for-one sync that is now being used. Please correct me if my interpretation is flawed here: flash has to be cleared before being written. If metadata is written with every block output with write(), the risk of erasing the FAT is now many times higher than with the old sync policy. So the newer sync policy drastically _reduces_ the data security in the case of untimely disconnection despite the speed penalty and possible hardware damage it incurs. A less rigourous sync policy may in fact be more appropriate than the current model. Thanks again. [Note: I am not subscribed to LKML, if you wish me to recieve any follow ups please BCC: col-pepper at piments point com . thx] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 22:19 ` col-pepper @ 2006-02-27 23:12 ` Andrew Morton 2006-02-28 18:47 ` Chris Mason 2006-02-28 0:52 ` Machida, Hiroyuki 1 sibling, 1 reply; 49+ messages in thread From: Andrew Morton @ 2006-02-27 23:12 UTC (permalink / raw) To: col-pepper; +Cc: linux-kernel col-pepper@piments.com wrote: > > That would not seem to be the case at least on MS systems. I had a freind > do some timings copying a large group of files to a 128M usb flash device. > There was an arbitary mix of files including many small files and some > larger files, one in excess of 50MB. > > suse10 default 4m10 > win2k 2m30 > suse w/o sync 30s > > The suse test was drag and drop in konqueror , the other dnd in windows > explorer. We don't know that the same number of same-sized write()s were happening in each case. There's been some talk about implementing fsync()-on-file-close for this problem, and some protopatches. But nothing final yet. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 23:12 ` Andrew Morton @ 2006-02-28 18:47 ` Chris Mason 2006-02-28 19:10 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Chris Mason @ 2006-02-28 18:47 UTC (permalink / raw) To: Andrew Morton; +Cc: col-pepper, linux-kernel On Monday 27 February 2006 18:12, Andrew Morton wrote: > We don't know that the same number of same-sized write()s were happening in > each case. > > There's been some talk about implementing fsync()-on-file-close for this > problem, and some protopatches. But nothing final yet. Here's the patch I'm using in -suse right now. What I want to do is make a much more generic -o flush, but it'll still need a few bits in individual filesystem to kick off metadata writes quickly. The basic goal behind the code is to trigger writes without waiting for both data and metadata. If the user is watching the memory stick, when the little light stops flashing all the data and metadata will be on disk. It also generally throttles userland a little during file release. This could be changed to throttle for each page dirtied, but most users I asked liked the current setup better. -chris From: Chris Mason <mason@suse.com> Subject: add -o flush for fat Fat is commonly used on removable media, mounting with -o flush tells the FS to write things to disk as quickly as possible. It is like -o sync, but much faster (and not as safe). diff -r a06cef570da0 fs/fat/file.c --- a/fs/fat/file.c Sun Jan 15 11:59:32 2006 -0500 +++ b/fs/fat/file.c Sun Jan 15 13:00:35 2006 -0500 @@ -13,6 +13,7 @@ #include <linux/smp_lock.h> #include <linux/buffer_head.h> #include <linux/writeback.h> +#include <linux/blkdev.h> int fat_generic_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) @@ -112,6 +113,19 @@ int fat_generic_ioctl(struct inode *inod } } +static int +fat_file_release(struct inode *inode, struct file *filp) +{ + + if ((filp->f_mode & FMODE_WRITE) && + MSDOS_SB(inode->i_sb)->options.flush) { + writeback_inode(inode); + writeback_bdev(inode->i_sb); + blk_congestion_wait(WRITE, HZ/10); + } + return 0; +} + struct file_operations fat_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -121,6 +135,7 @@ struct file_operations fat_file_operatio .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, + .release = fat_file_release, .ioctl = fat_generic_ioctl, .fsync = file_fsync, .sendfile = generic_file_sendfile, @@ -293,6 +308,10 @@ void fat_truncate(struct inode *inode) lock_kernel(); fat_free(inode, nr_clusters); unlock_kernel(); + if (MSDOS_SB(inode->i_sb)->options.flush) { + writeback_inode(inode); + writeback_bdev(inode->i_sb); + } } struct inode_operations fat_file_inode_operations = { diff -r a06cef570da0 fs/fat/inode.c --- a/fs/fat/inode.c Sun Jan 15 11:59:32 2006 -0500 +++ b/fs/fat/inode.c Sun Jan 15 13:00:35 2006 -0500 @@ -24,6 +24,7 @@ #include <linux/vfs.h> #include <linux/parser.h> #include <linux/uio.h> +#include <linux/writeback.h> #include <asm/unaligned.h> #ifndef CONFIG_FAT_DEFAULT_IOCHARSET @@ -860,7 +861,7 @@ enum { Opt_charset, Opt_shortname_lower, Opt_shortname_win95, Opt_shortname_winnt, Opt_shortname_mixed, Opt_utf8_no, Opt_utf8_yes, Opt_uni_xl_no, Opt_uni_xl_yes, Opt_nonumtail_no, Opt_nonumtail_yes, - Opt_obsolate, Opt_err, + Opt_obsolate, Opt_flush, Opt_err, }; static match_table_t fat_tokens = { @@ -892,7 +893,8 @@ static match_table_t fat_tokens = { {Opt_obsolate, "cvf_format=%20s"}, {Opt_obsolate, "cvf_options=%100s"}, {Opt_obsolate, "posix"}, - {Opt_err, NULL} + {Opt_flush, "flush"}, + {Opt_err, NULL}, }; static match_table_t msdos_tokens = { {Opt_nodots, "nodots"}, @@ -1033,6 +1035,9 @@ static int parse_options(char *options, return 0; opts->codepage = option; break; + case Opt_flush: + opts->flush = 1; + break; /* msdos specific */ case Opt_dots: diff -r a06cef570da0 fs/fs-writeback.c --- a/fs/fs-writeback.c Sun Jan 15 11:59:32 2006 -0500 +++ b/fs/fs-writeback.c Sun Jan 15 13:00:35 2006 -0500 @@ -390,6 +390,29 @@ sync_sb_inodes(struct super_block *sb, s return; /* Leave any unwritten inodes on s_io */ } +void +writeback_bdev(struct super_block *sb) +{ + struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; + filemap_flush(mapping); + blk_run_address_space(mapping); +} +EXPORT_SYMBOL_GPL(writeback_bdev); + +void +writeback_inode(struct inode *inode) +{ + + struct address_space *mapping = inode->i_mapping; + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = 0, + }; + sync_inode(inode, &wbc); + filemap_fdatawrite(mapping); +} +EXPORT_SYMBOL_GPL(writeback_inode); + /* * Start writeback of dirty pagecache data against all unlocked inodes. * diff -r a06cef570da0 fs/msdos/namei.c --- a/fs/msdos/namei.c Sun Jan 15 11:59:32 2006 -0500 +++ b/fs/msdos/namei.c Sun Jan 15 13:00:35 2006 -0500 @@ -11,6 +11,7 @@ #include <linux/buffer_head.h> #include <linux/msdos_fs.h> #include <linux/smp_lock.h> +#include <linux/writeback.h> /* MS-DOS "device special files" */ static const unsigned char *reserved_names[] = { @@ -293,7 +294,7 @@ static int msdos_create(struct inode *di struct nameidata *nd) { struct super_block *sb = dir->i_sb; - struct inode *inode; + struct inode *inode = NULL; struct fat_slot_info sinfo; struct timespec ts; unsigned char msdos_name[MSDOS_NAME]; @@ -329,6 +330,11 @@ static int msdos_create(struct inode *di d_instantiate(dentry, inode); out: unlock_kernel(); + if (!err && MSDOS_SB(sb)->options.flush) { + writeback_inode(dir); + writeback_inode(inode); + writeback_bdev(sb); + } return err; } @@ -361,6 +367,11 @@ static int msdos_rmdir(struct inode *dir fat_detach(inode); out: unlock_kernel(); + if (!err && MSDOS_SB(inode->i_sb)->options.flush) { + writeback_inode(dir); + writeback_inode(inode); + writeback_bdev(inode->i_sb); + } return err; } @@ -414,6 +425,11 @@ static int msdos_mkdir(struct inode *dir d_instantiate(dentry, inode); unlock_kernel(); + if (MSDOS_SB(sb)->options.flush) { + writeback_inode(dir); + writeback_inode(inode); + writeback_bdev(sb); + } return 0; out_free: @@ -443,6 +459,11 @@ static int msdos_unlink(struct inode *di fat_detach(inode); out: unlock_kernel(); + if (!err && MSDOS_SB(inode->i_sb)->options.flush) { + writeback_inode(dir); + writeback_inode(inode); + writeback_bdev(inode->i_sb); + } return err; } @@ -648,6 +669,11 @@ static int msdos_rename(struct inode *ol new_dir, new_msdos_name, new_dentry, is_hid); out: unlock_kernel(); + if (!err && MSDOS_SB(old_dir->i_sb)->options.flush) { + writeback_inode(old_dir); + writeback_inode(new_dir); + writeback_bdev(old_dir->i_sb); + } return err; } diff -r a06cef570da0 include/linux/msdos_fs.h --- a/include/linux/msdos_fs.h Sun Jan 15 11:59:32 2006 -0500 +++ b/include/linux/msdos_fs.h Sun Jan 15 13:00:35 2006 -0500 @@ -203,6 +203,7 @@ struct fat_mount_options { unicode_xlate:1, /* create escape sequences for unhandled Unicode */ numtail:1, /* Does first alias have a numeric '~1' type tail? */ atari:1, /* Use Atari GEMDOS variation of MS-DOS fs */ + flush:1, /* write things quickly */ nocase:1; /* Does this need case conversion? 0=need case conversion*/ }; diff -r a06cef570da0 include/linux/writeback.h --- a/include/linux/writeback.h Sun Jan 15 11:59:32 2006 -0500 +++ b/include/linux/writeback.h Sun Jan 15 13:00:35 2006 -0500 @@ -68,6 +68,8 @@ int inode_wait(void *); int inode_wait(void *); void sync_inodes_sb(struct super_block *, int wait); void sync_inodes(int wait); +void writeback_bdev(struct super_block *); +void writeback_inode(struct inode *); /* writeback.h requires fs.h; it, too, is not included from here. */ static inline void wait_on_inode(struct inode *inode) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 18:47 ` Chris Mason @ 2006-02-28 19:10 ` Andrew Morton 2006-02-28 19:48 ` Chris Mason [not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp> 2006-03-29 2:13 ` Mathis Ahrens 2 siblings, 1 reply; 49+ messages in thread From: Andrew Morton @ 2006-02-28 19:10 UTC (permalink / raw) To: Chris Mason; +Cc: col-pepper, linux-kernel Chris Mason <mason@suse.com> wrote: > > On Monday 27 February 2006 18:12, Andrew Morton wrote: > > > We don't know that the same number of same-sized write()s were happening in > > each case. > > > > There's been some talk about implementing fsync()-on-file-close for this > > problem, and some protopatches. But nothing final yet. > > Here's the patch I'm using in -suse right now. What I want to do is make a > much more generic -o flush, but it'll still need a few bits in individual > filesystem to kick off metadata writes quickly. > > The basic goal behind the code is to trigger writes without waiting for both > data and metadata. If the user is watching the memory stick, when the > little light stops flashing all the data and metadata will be on disk. > > It also generally throttles userland a little during file release. This > could be changed to throttle for each page dirtied, but most users I > asked liked the current setup better. > > ... > > +static int > +fat_file_release(struct inode *inode, struct file *filp) On a single line, please. > + if (MSDOS_SB(inode->i_sb)->options.flush) { Did you consider making `-o flush' a generic mount option rather than msdos-only? I guess there isn't a lot of demand for this for other filesystems, and having an ignored option like this is a bit misleading... > +void > +writeback_inode(struct inode *inode) > +{ > + > + struct address_space *mapping = inode->i_mapping; > + struct writeback_control wbc = { > + .sync_mode = WB_SYNC_NONE, > + .nr_to_write = 0, > + }; > + sync_inode(inode, &wbc); > + filemap_fdatawrite(mapping); I think that filemap_fdatawrite() will be a no-op? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 19:10 ` Andrew Morton @ 2006-02-28 19:48 ` Chris Mason 0 siblings, 0 replies; 49+ messages in thread From: Chris Mason @ 2006-02-28 19:48 UTC (permalink / raw) To: Andrew Morton; +Cc: col-pepper, linux-kernel On Tuesday 28 February 2006 14:10, Andrew Morton wrote: > On a single line, please. > Ack. > > + if (MSDOS_SB(inode->i_sb)->options.flush) { > > Did you consider making `-o flush' a generic mount option rather than > msdos-only? Yes, long term I think the generic option is better. I have three or so ideas for a generic patch: 1) When the block device leaves congestion, it asks for more io 2) pdflush operation that tries to constantly keep a given block device congested 3) my current patch aggregated to other filesystems that people want -o flush on. I've made a few stabs at #1, but didn't like the end result. #2 seems like the best choice so far. If I got it working nicely I would add the generic option, otherwise with option #3 it's probably best to keep it per FS. The main goal for my current patch was to find out if this functionality will actually make people happy (so far the beta testers like it). If the complaints are low, it's worth the time to add something generic. > > I guess there isn't a lot of demand for this for other filesystems, and > having an ignored option like this is a bit misleading... > > > +void > > +writeback_inode(struct inode *inode) > > +{ > > + > > + struct address_space *mapping = inode->i_mapping; > > + struct writeback_control wbc = { > > + .sync_mode = WB_SYNC_NONE, > > + .nr_to_write = 0, > > + }; > > + sync_inode(inode, &wbc); > > + filemap_fdatawrite(mapping); > > I think that filemap_fdatawrite() will be a no-op? This part is nasty, I want to write all of the file data pages and write the inode without waiting on it. The nr_to_write = 0 will make sure that sync_inode only writes the inode, and WB_SYNC_NONE makes sure it does not wait for that io to finish. What I really want is WB_SYNC_NONE in mpage_writepages, but I don't want to trigger this code: if (wbc->sync_mode == WB_SYNC_NONE) { index = mapping->writeback_index; /* Start from prev offset */ So, I use filemap_fdatawrite to make sure all of the data pages get written. It's not perfect, but I was going for minimal changes outside of fat. -chris ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <87u0aiw6pi.fsf@duaron.myhome.or.jp>]
* Re: o_sync in vfat driver [not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp> @ 2006-03-01 15:23 ` Chris Mason [not found] ` <87mzg9wst0.fsf@duaron.myhome.or.jp> 0 siblings, 1 reply; 49+ messages in thread From: Chris Mason @ 2006-03-01 15:23 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Andrew Morton, col-pepper, linux-kernel On Wednesday 01 March 2006 10:00, OGAWA Hirofumi wrote: > Chris Mason <mason@suse.com> writes: > > @@ -329,6 +330,11 @@ static int msdos_create(struct inode *di > > d_instantiate(dentry, inode); > > out: > > unlock_kernel(); > > + if (!err && MSDOS_SB(sb)->options.flush) { > > + writeback_inode(dir); > > + writeback_inode(inode); > > + writeback_bdev(sb); > > + } > > return err; > > } > > If buffers is already queued for I/O, and if you don't wait anything, > the buffers wouldn't be (re-)submited, then those buffers will be > flushing by normal periodically wb_kupdate() after all. Just to make sure we're using the same terms, do you mean the pages are marked dirty and on the SB's dirty list, or do you mean the page has been through writepage and is currently on its way to the disk? > > Do you have any plan to address it? Or I'm just missing something? If you mean the page is just dirty, it will get written by the filemap_fdatawrite calls. If you mean the page is PG_writeback, it is already on the way to the disk, so it passes the 'blinking light on the memory stick' rule. -chris ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <87mzg9wst0.fsf@duaron.myhome.or.jp>]
* Re: o_sync in vfat driver [not found] ` <87mzg9wst0.fsf@duaron.myhome.or.jp> @ 2006-03-02 13:45 ` Chris Mason 2006-03-02 14:07 ` OGAWA Hirofumi 0 siblings, 1 reply; 49+ messages in thread From: Chris Mason @ 2006-03-02 13:45 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Andrew Morton, col-pepper, linux-kernel On Wednesday 01 March 2006 20:15, OGAWA Hirofumi wrote: > > Just to make sure we're using the same terms, do you mean the pages are > > marked dirty and on the SB's dirty list, or do you mean the page has been > > through writepage and is currently on its way to the disk? > > The page is already on device's request queue, and the page is already > marked a PG_writeback. And that page is not processed by device yet. > > Then, you call next filemap_fdatawrite(), it just re-dirty the page > and queues to sb->s_dirty, because the page's buffer_heads is still > locked. So, the re-dirtyed page is re-submited to device by > periodically wb_kupdate()? filemap_fdatawrite() won't redirty the page. It will wait on the pending writeback. -chris ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-03-02 13:45 ` Chris Mason @ 2006-03-02 14:07 ` OGAWA Hirofumi 2006-03-02 17:01 ` Chris Mason 0 siblings, 1 reply; 49+ messages in thread From: OGAWA Hirofumi @ 2006-03-02 14:07 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, col-pepper, linux-kernel Chris Mason <mason@suse.com> writes: > filemap_fdatawrite() won't redirty the page. It will wait on the pending > writeback. Umm... I'm looking the following code. + if (MSDOS_SB(sb)->options.flush) { + writeback_inode(dir); + writeback_inode(inode); + writeback_bdev(sb); + } +void +writeback_bdev(struct super_block *sb) +{ + struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; + filemap_flush(mapping); + blk_run_address_space(mapping); +} +EXPORT_SYMBOL_GPL(writeback_bdev); filemap_flush() is using WB_SYNC_NONE. in mpage_writepages() if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } Where does wait it? -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-03-02 14:07 ` OGAWA Hirofumi @ 2006-03-02 17:01 ` Chris Mason 2006-03-02 18:14 ` OGAWA Hirofumi 0 siblings, 1 reply; 49+ messages in thread From: Chris Mason @ 2006-03-02 17:01 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Andrew Morton, col-pepper, linux-kernel On Thursday 02 March 2006 09:07, OGAWA Hirofumi wrote: > Chris Mason <mason@suse.com> writes: > > filemap_fdatawrite() won't redirty the page. It will wait on the pending > > writeback. > > Umm... I'm looking the following code. > > +void > +writeback_bdev(struct super_block *sb) > +{ > + struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > + filemap_flush(mapping); > + blk_run_address_space(mapping); > +} > +EXPORT_SYMBOL_GPL(writeback_bdev); > > filemap_flush() is using WB_SYNC_NONE. > Ok, I thought you were asking about the code that called filemap_fdatawrite, which does wait. filemap_flush is used on the underlying block device. In the case of a page that is already under IO, the io is not cancelled but allowed to continue. This is the desired result. When you're doing a number of operations in sequence, each operation will start io on the block device. If they used filemap_fdatawrite instead of filemap_flush, they would end up being synchronous. -chris ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-03-02 17:01 ` Chris Mason @ 2006-03-02 18:14 ` OGAWA Hirofumi 0 siblings, 0 replies; 49+ messages in thread From: OGAWA Hirofumi @ 2006-03-02 18:14 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, col-pepper, linux-kernel Chris Mason <mason@suse.com> writes: > Ok, I thought you were asking about the code that called filemap_fdatawrite, > which does wait. filemap_flush is used on the underlying block device. In > the case of a page that is already under IO, the io is not cancelled but > allowed to continue. > > This is the desired result. When you're doing a number of operations in > sequence, each operation will start io on the block device. If they used > filemap_fdatawrite instead of filemap_flush, they would end up being > synchronous. Of course, I know. Let's return to beginning of this thread, do you have any plan to address it? -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-28 18:47 ` Chris Mason 2006-02-28 19:10 ` Andrew Morton [not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp> @ 2006-03-29 2:13 ` Mathis Ahrens 2006-03-30 17:35 ` col-pepper 2 siblings, 1 reply; 49+ messages in thread From: Mathis Ahrens @ 2006-03-29 2:13 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, col-pepper, linux-kernel Hi all, Chris Mason wrote: > On Monday 27 February 2006 18:12, Andrew Morton wrote: > >> We don't know that the same number of same-sized write()s were happening in >> each case. >> >> There's been some talk about implementing fsync()-on-file-close for this >> problem, and some protopatches. But nothing final yet. >> > > Here's the patch I'm using in -suse right now. What I want to do is make a > much more generic -o flush, but it'll still need a few bits in individual > filesystem to kick off metadata writes quickly. > > The basic goal behind the code is to trigger writes without waiting for both > data and metadata. If the user is watching the memory stick, when the > little light stops flashing all the data and metadata will be on disk. > > It also generally throttles userland a little during file release. This > could be changed to throttle for each page dirtied, but most users I > asked liked the current setup better. > I like the idea and would like to see something like this in mainline. Here is some non-scientific benchmark done with 2.6.16, comparing default mount and flush mount of a USB2 stick: ///////////////////////////////////////////////////////////////////// Single File "Test": 43MB $ time cp Test /media/usbdisk/test/ && time umount /media/usbdisk/ ///////////////////////////////////////////////////////////////////// VANILLA: real 0m3.770s user 0m0.004s sys 0m0.308s real 0m9.439s user 0m0.000s sys 0m0.040s FLUSH: real 0m6.000s user 0m0.012s sys 0m0.400s real 0m3.668s user 0m0.000s sys 0m0.028s REAL TIME RATIO (FLUSH/VANILLA): 9.6 / 13.1 = 0.73 ///////////////////////////////////////////////////////////////////// Directory Tree "flushtest": 44MB (8866 files, 1820 dirs) $ time cp -R flushtest/ /media/usbdisk/ && time umount /media/usbdisk/ ///////////////////////////////////////////////////////////////////// VANILLA: real 0m0.966s user 0m0.024s sys 0m0.860s real 1m11.962s user 0m0.004s sys 0m0.160s FLUSH: real 1m41.645s user 0m0.032s sys 0m1.112s real 0m4.660s user 0m0.004s sys 0m0.068s REAL TIME RATIO (FLUSH/VANILLA): 106.3 / 77.9 = 1.36 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-03-29 2:13 ` Mathis Ahrens @ 2006-03-30 17:35 ` col-pepper 0 siblings, 0 replies; 49+ messages in thread From: col-pepper @ 2006-03-30 17:35 UTC (permalink / raw) To: Mathis Ahrens, Chris Mason; +Cc: Andrew Morton, linux-kernel On Wed, 29 Mar 2006 04:13:03 +0200, Mathis Ahrens <Mathis.Ahrens@gmx.de> wrote: > Hi all, > > Chris Mason wrote: >> On Monday 27 February 2006 18:12, Andrew Morton wrote: >> >>> We don't know that the same number of same-sized write()s were >>> happening in >>> each case. >>> >>> There's been some talk about implementing fsync()-on-file-close for >>> this >>> problem, and some protopatches. But nothing final yet. >>> >> >> Here's the patch I'm using in -suse right now. What I want to do is >> make a much more generic -o flush, but it'll still need a few bits in >> individual filesystem to kick off metadata writes quickly. >> >> The basic goal behind the code is to trigger writes without waiting for >> both >> data and metadata. If the user is watching the memory stick, when the >> little light stops flashing all the data and metadata will be on disk. >> >> It also generally throttles userland a little during file release. >> This could be changed to throttle for each page dirtied, but most users >> I asked liked the current setup better. >> > > I like the idea and would like to see something like this in mainline. > > Here is some non-scientific benchmark done with 2.6.16, comparing > default mount and flush mount of a USB2 stick: > > ///////////////////////////////////////////////////////////////////// > Single File "Test": 43MB > $ time cp Test /media/usbdisk/test/ && time umount /media/usbdisk/ > ///////////////////////////////////////////////////////////////////// > > VANILLA: > > real 0m3.770s > user 0m0.004s > sys 0m0.308s > > real 0m9.439s > user 0m0.000s > sys 0m0.040s > > FLUSH: > > real 0m6.000s > user 0m0.012s > sys 0m0.400s > > real 0m3.668s > user 0m0.000s > sys 0m0.028s > > REAL TIME RATIO (FLUSH/VANILLA): > 9.6 / 13.1 = 0.73 > > ///////////////////////////////////////////////////////////////////// > Directory Tree "flushtest": 44MB (8866 files, 1820 dirs) > $ time cp -R flushtest/ /media/usbdisk/ && time umount /media/usbdisk/ > ///////////////////////////////////////////////////////////////////// > > VANILLA: > > real 0m0.966s > user 0m0.024s > sys 0m0.860s > > real 1m11.962s > user 0m0.004s > sys 0m0.160s > > FLUSH: > > real 1m41.645s > user 0m0.032s > sys 0m1.112s > > real 0m4.660s > user 0m0.004s > sys 0m0.068s > > REAL TIME RATIO (FLUSH/VANILLA): > 106.3 / 77.9 = 1.36 > > That's interesting, albeit non-scientific, I think it is quite informative. There are two basic problems with the current code: speed is down by around and order of magnitude compared to a non-synced write and the fact that the code is hammering the FAT. The two are obviously related. Viewing the system globally rather than considering the details of the techniques used, it would seem that any algorithm that does not drastically reduce write times, at least on the one large file test , is missing the mark and presumably repeating the problem in a slightly different way. Not knocking the efforts Chris has put in , it's great to see this is getting some attention, but I think viewing overall performance times as shown above gives a touchstone as to whether any particular proto is effective. The fact that flush can be almost 40% slower in some cases is worrying. Thanks for the info. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: o_sync in vfat driver 2006-02-27 22:19 ` col-pepper 2006-02-27 23:12 ` Andrew Morton @ 2006-02-28 0:52 ` Machida, Hiroyuki 1 sibling, 0 replies; 49+ messages in thread From: Machida, Hiroyuki @ 2006-02-28 0:52 UTC (permalink / raw) To: col-pepper; +Cc: Andrew Morton, linux-kernel@vger.kernel.org As Andrew suggested, Ogawa Hirofumi, FAT maintainer post following patch. I may help you. Please check it. Message-Id: <87wthrznsv.fsf@devron.myhome.or.jp> Subject: Re: [EXPERIMENT] Add new "flush" option This adds new "flush" option for hotplug devices. Current implementation of "flush" option does, - synchronizing data pages at ->release() (last close(2)) - if user's work seems to be done (fs is not active), all metadata syncs by pdflush() This option would provide kind of sane progress, and dirty buffers is flushed more frequently (if fs is not active). This option doesn't provide any robustness (robustness is provided by other options), but probably the option is proper for hotplug devices. (Please don't assume that dirty buffers is synchronized at any point. This implementation will be changed easily.) col-pepper@piments.com wrote: > Thanks for the reply. > > > On Mon, 27 Feb 2006 01:51:14 +0100, Andrew Morton <akpm@osdl.org> wrote: > >> col-pepper@piments.com wrote: >> >>> >>> There is nothing in the spec of vfat that suggests the FAT will be >>> written >>> 10.000 during the writing of one large file. Indeed it is hard to >>> imagine >>> that any other implementation on any other OS or any previous linux >>> kernel >>> behaves like that. >> >> >> We sync the file metadata once per write() syscall. If your app writes a >> large file in lots of little bits, it'll do a lot of syncs. Other >> implementations of fatfs will (must) do the same thing. > > > That would not seem to be the case at least on MS systems. I had a > freind do some timings copying a large group of files to a 128M usb > flash device. > There was an arbitary mix of files including many small files and some > larger files, one in excess of 50MB. > > suse10 default 4m10 > win2k 2m30 > suse w/o sync 30s > > The suse test was drag and drop in konqueror , the other dnd in windows > explorer. > >> >>> It would seem that the first step could be to revert to the 2.6.11 >>> behaviour which was more appropriate and probably safer even from >>> the data >>> point of view. >> >> >> fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and >> what we're seeing is the fallout from the late fixing of that bug. >> > > I just tested on my 2.6.11 kernel which would predate the change and > there is a clear difference between mounting my usb device with and > without sync option. > > ls -ail /tmpd/mail* > 239151 -rw-r--r-- 1 root root 8169540 2006-02-27 19:04 > /tmpd/mail-bak.2006-02-28.bz2 > bash-3.1#time cp !$ /mnt/usb > time cp /tmpd/mail* /mnt/usb > > real 0m0.227s > user 0m0.001s > sys 0m0.070s > > It returns immediately with no disk activity. About 30s later there was > disk activity. Presumably some periodic flushing of IO buffers. > > bash-3.1#umount /mnt/usb > bash-3.1#mount -o sync !$ > > bash-3.1#time cp /tmpd/mail* /mnt/usb > > real 0m5.440s > user 0m0.000s > sys 0m0.143s > > So the older model did seem to have some sync functionality , tho' > presumably not the agressive one-for-one sync that is now being used. > > Please correct me if my interpretation is flawed here: > > flash has to be cleared before being written. If metadata is written > with every block output with write(), the risk of erasing the FAT is > now many times higher than with the old sync policy. > > So the newer sync policy drastically _reduces_ the data security in the > case of untimely disconnection despite the speed penalty and possible > hardware damage it incurs. > > A less rigourous sync policy may in fact be more appropriate than the > current model. > > Thanks again. > > > [Note: I am not subscribed to LKML, if you wish me to recieve any follow > ups please BCC: col-pepper at piments point com . thx] > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Hiroyuki Machida machida@sm.sony.co.jp SSW Dept. HENC, Sony Corp. ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2006-03-30 17:38 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <op.s5cj47sxj68xd1@mail.piments.com>
[not found] ` <op.s5jpqvwhui3qek@mail.piments.com>
[not found] ` <op.s5kxhyzgfx0war@mail.piments.com>
[not found] ` <op.s5kx7xhfj68xd1@mail.piments.com>
[not found] ` <op.s5kya3t0j68xd1@mail.piments.com>
[not found] ` <op.s5ky2dbcj68xd1@mail.piments.com>
[not found] ` <op.s5ky71nwj68xd1@mail.piments.com>
[not found] ` <op.s5kzao2jj68xd1@mail.piments.com>
2006-02-26 22:50 ` o_sync in vfat driver col-pepper
2006-02-27 13:28 ` Lennart Sorensen
2006-02-27 13:50 ` Arjan van de Ven
2006-02-27 14:06 ` Anton Altaparmakov
2006-02-27 14:27 ` Arjan van de Ven
2006-02-27 14:41 ` Anton Altaparmakov
2006-02-27 21:04 ` col-pepper
2006-02-27 21:17 ` Arjan van de Ven
2006-02-27 23:21 ` col-pepper
2006-02-27 21:32 ` linux-os (Dick Johnson)
2006-02-27 23:21 ` col-pepper
2006-02-28 13:10 ` linux-os (Dick Johnson)
2006-02-28 13:52 ` Sergei Organov
2006-02-28 15:18 ` Lennart Sorensen
2006-02-28 16:16 ` linux-os (Dick Johnson)
2006-02-28 17:23 ` Sergei Organov
2006-02-28 18:09 ` Krzysztof Halasa
2006-02-28 17:16 ` col-pepper
2006-02-28 22:38 ` Pavel Machek
2006-02-28 23:10 ` why VM_SHM has been removed from mm.h? Kamran Karimi
2006-03-01 3:02 ` Phillip Susi
2006-03-01 7:56 ` Hugh Dickins
2006-03-01 14:58 ` Kamran Karimi
2006-03-01 16:24 ` Hugh Dickins
2006-03-01 16:55 ` Kamran Karimi
2006-03-01 17:50 ` Hugh Dickins
2006-03-01 4:28 ` o_sync in vfat driver Kyle Moffett
2006-03-02 8:23 ` col-pepper
2006-03-02 8:32 ` Pavel Machek
2006-02-28 16:11 ` Helge Hafting
2006-02-28 22:37 ` Pavel Machek
2006-02-27 14:26 ` linux-os (Dick Johnson)
2006-02-27 18:53 ` Jan Engelhardt
2006-02-26 22:55 col-pepper
-- strict thread matches above, loose matches on Subject: below --
2006-02-26 23:08 col-pepper
2006-02-27 0:51 ` Andrew Morton
2006-02-27 22:19 ` col-pepper
2006-02-27 23:12 ` Andrew Morton
2006-02-28 18:47 ` Chris Mason
2006-02-28 19:10 ` Andrew Morton
2006-02-28 19:48 ` Chris Mason
[not found] ` <87u0aiw6pi.fsf@duaron.myhome.or.jp>
2006-03-01 15:23 ` Chris Mason
[not found] ` <87mzg9wst0.fsf@duaron.myhome.or.jp>
2006-03-02 13:45 ` Chris Mason
2006-03-02 14:07 ` OGAWA Hirofumi
2006-03-02 17:01 ` Chris Mason
2006-03-02 18:14 ` OGAWA Hirofumi
2006-03-29 2:13 ` Mathis Ahrens
2006-03-30 17:35 ` col-pepper
2006-02-28 0:52 ` Machida, Hiroyuki
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.