* 2.6.0-test11 data loss
@ 2003-12-24 21:59 Keith Lea
2003-12-24 22:22 ` Gergely Tamas
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Keith Lea @ 2003-12-24 21:59 UTC (permalink / raw)
To: linux-kernel
Hello, I'm not subscribed to this list. This is not a help request, and
not really a bug report, I just thought someone should know about this.
I installed the 2.6.0-beta11-mm kernel last week, and the other day my
computer locked up (this is normal on my laptop with every kernel
version I've tried, this isn't the problem I'm posting about). When I
restarted, many, many files that had been open when it locked up were
filled with garbage, or the contents of totally unrelated files. For
example, my syslog contained some KDE header file code, and
/sbin/modprobe contained 82kb of data that seemed like random noise. I
think each file was the same size as it was originally, just with
different data, but I'm not sure.
The corruption happened on two separate partitions on a single IDE
laptop drive, and both were ReiserFS 3.6 partitions. I don't know if
this is a kernel bug or a Reiser bug or something else, but I thought
the kernel developers should know about this, and be on the lookout for
similar things (hopefully with more informative bug reports than mine).
I'm sorry I don't have more information, but if anyone wants to know
more about my system I'd be glad to help.
-Keith Lea
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: 2.6.0-test11 data loss 2003-12-24 21:59 2.6.0-test11 data loss Keith Lea @ 2003-12-24 22:22 ` Gergely Tamas 2003-12-24 22:34 ` Con Kolivas 2003-12-25 16:46 ` Tomas Szepe 2003-12-25 1:21 ` Felipe Alfaro Solana 2003-12-25 6:11 ` Hans Reiser 2 siblings, 2 replies; 10+ messages in thread From: Gergely Tamas @ 2003-12-24 22:22 UTC (permalink / raw) To: Keith Lea; +Cc: linux-kernel Hi, I've been hit by the same problem but using 2.6.0 . As you described, garbage in files (eg. /etc/modules.conf, ...). 2.6.0, Slackware 9.1 > The corruption happened on two separate partitions on a single IDE > laptop drive, and both were ReiserFS 3.6 partitions. I don't know if > this is a kernel bug or a Reiser bug or something else, but I thought I don't think this is a reiserfs bug. This was my first thought and after first hitting this bug, I've moved all my partitions from reiserfs to jfs. But I've also had this problem with it... Now I'm back to 2.4.23, and everything works fine. Gergely ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-24 22:22 ` Gergely Tamas @ 2003-12-24 22:34 ` Con Kolivas 2003-12-25 2:07 ` Eric D. Mudama 2003-12-25 16:46 ` Tomas Szepe 1 sibling, 1 reply; 10+ messages in thread From: Con Kolivas @ 2003-12-24 22:34 UTC (permalink / raw) To: Gergely Tamas, Keith Lea; +Cc: linux-kernel Hello On Thu, 25 Dec 2003 09:22, Gergely Tamas wrote: > I've been hit by the same problem but using 2.6.0 . As you described, > garbage in files (eg. /etc/modules.conf, ...). > > 2.6.0, Slackware 9.1 > > > The corruption happened on two separate partitions on a single IDE > > laptop drive, and both were ReiserFS 3.6 partitions. I don't know if > > this is a kernel bug or a Reiser bug or something else, but I thought > > I don't think this is a reiserfs bug. This was my first thought and > after first hitting this bug, I've moved all my partitions from reiserfs > to jfs. But I've also had this problem with it... Now I'm back to > 2.4.23, and everything works fine. Because of the numerous reboots and hangs I've seen with experimental patches I've also seen this, but it's not reiserFS fault. The problem is that most drives have write caching enabled and not all of them are safe with this. If you disable it with hdparm (hdparm -W 0 /dev/hd*) you'll find that open files during a hard reset or power outage will prevent those open files from being corrupted. Con ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-24 22:34 ` Con Kolivas @ 2003-12-25 2:07 ` Eric D. Mudama 2003-12-25 5:17 ` Con Kolivas 2003-12-25 6:15 ` Hans Reiser 0 siblings, 2 replies; 10+ messages in thread From: Eric D. Mudama @ 2003-12-25 2:07 UTC (permalink / raw) To: linux-kernel On Thu, Dec 25 at 9:34, Con Kolivas wrote: >On Thu, 25 Dec 2003 09:22, Gergely Tamas wrote: >> I don't think this is a reiserfs bug. This was my first thought and >> after first hitting this bug, I've moved all my partitions from reiserfs >> to jfs. But I've also had this problem with it... Now I'm back to >> 2.4.23, and everything works fine. > >Because of the numerous reboots and hangs I've seen with experimental patches >I've also seen this, but it's not reiserFS fault. The problem is that most >drives have write caching enabled and not all of them are safe with this. If >you disable it with hdparm (hdparm -W 0 /dev/hd*) you'll find that open files >during a hard reset or power outage will prevent those open files from being >corrupted. Write cache off will not prevent a file from being corrupted, however, it should limit the corruption to a single disk operation. I don't see how the behavior you describe could be the drive's fault... The user stated that their system hard locked, then they went and rebooted it, and following the reboot they had corruption... From this, there are a few possibilities: 1. The drive had been given the commands to write the data prior to the hang. If this was the case, the drive would happilly keep writing the data it had been given and was caching in the background, even while you continued to send (or stopped sending) data for a new command over the interface. An IDE interface lockup or system lockup will not prevent the drive from flushing the remainder of its write cache. (Only possible exception might be faulty handling of a hard reset, but all drives today will flush their cache when they see the reset, prior to processing it.) Unless the user yanked power within a few hundred milliseconds of the write command, I think it is unlikely that cached data already in the drive wasn't flushed properly. 2. The drive was in the middle of a command writing important data during the hang. In this case, yes, your file you were writing would probably be corrupt on the media, but nothing more. Drives detect power loss, and immediately disable write-gate and park the actuator. If they don't get the actuator parked before they run out of back-EMF from the momentum of the platter(s), the head will stick to the media and you'll probably need a chisel to get that drive to spin again. 3. The drive hadn't yet been issued the commands for the data that was eventually corrupted. I find this to be the most likely case, and is a situation where the filesystem thinks objects were moved but those updates were not correctly sent to the disk (due to the hang?), so it might think they're in the old location or something. (I'm not a filesystem wizard so if I'm way off-base, my apologies) It seems to me that the problem occurred at a higher system level than the disk, and disabling the write cache on the drive (besides being a *HUGE* performance loser) will only make the window for failure smaller, not eliminate it entirely. Unless you are using *really* old hard drives, the write caching in today's drives is really quite good and definitely should be usable. Sure, it makes things less safe in power events, but system lockups shouldn't affect the drive's ability to flush its cache. Note too that Gergely reported that the problem went away on his 2.4.23 system. I don't believe that to be a small data point. --eric, posting from home -- Eric D. Mudama edmudama@mail.bounceswoosh.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-25 2:07 ` Eric D. Mudama @ 2003-12-25 5:17 ` Con Kolivas 2003-12-25 6:15 ` Hans Reiser 1 sibling, 0 replies; 10+ messages in thread From: Con Kolivas @ 2003-12-25 5:17 UTC (permalink / raw) To: Eric D. Mudama, linux-kernel On Thu, 25 Dec 2003 13:07, Eric D. Mudama wrote: > On Thu, Dec 25 at 9:34, Con Kolivas wrote: > >On Thu, 25 Dec 2003 09:22, Gergely Tamas wrote: > >> I don't think this is a reiserfs bug. This was my first thought and > >> after first hitting this bug, I've moved all my partitions from reiserfs > >> to jfs. But I've also had this problem with it... Now I'm back to > >> 2.4.23, and everything works fine. > > > >Because of the numerous reboots and hangs I've seen with experimental > > patches I've also seen this, but it's not reiserFS fault. The problem is > > that most drives have write caching enabled and not all of them are safe > > with this. If you disable it with hdparm (hdparm -W 0 /dev/hd*) you'll > > find that open files during a hard reset or power outage will prevent > > those open files from being corrupted. > > Write cache off will not prevent a file from being corrupted, however, > it should limit the corruption to a single disk operation. > > I don't see how the behavior you describe could be the drive's > fault... > > The user stated that their system hard locked, then they went and > rebooted it, and following the reboot they had corruption... From > this, there are a few possibilities: > > 1. The drive had been given the commands to write the data prior to the > hang. > > If this was the case, the drive would happilly keep writing the data > it had been given and was caching in the background, even while you > continued to send (or stopped sending) data for a new command over the > interface. An IDE interface lockup or system lockup will not prevent > the drive from flushing the remainder of its write cache. (Only > possible exception might be faulty handling of a hard reset, but all > drives today will flush their cache when they see the reset, prior to > processing it.) Unless the user yanked power within a few hundred > milliseconds of the write command, I think it is unlikely that cached > data already in the drive wasn't flushed properly. > > 2. The drive was in the middle of a command writing important data > during the hang. > > In this case, yes, your file you were writing would probably be > corrupt on the media, but nothing more. Drives detect power loss, and > immediately disable write-gate and park the actuator. If they don't > get the actuator parked before they run out of back-EMF from the > momentum of the platter(s), the head will stick to the media and > you'll probably need a chisel to get that drive to spin again. > > 3. The drive hadn't yet been issued the commands for the data that was > eventually corrupted. > > I find this to be the most likely case, and is a situation where the > filesystem thinks objects were moved but those updates were not > correctly sent to the disk (due to the hang?), so it might think > they're in the old location or something. (I'm not a filesystem > wizard so if I'm way off-base, my apologies) > > It seems to me that the problem occurred at a higher system level than > the disk, and disabling the write cache on the drive (besides being a > *HUGE* performance loser) will only make the window for failure > smaller, not eliminate it entirely. > > Unless you are using *really* old hard drives, the write caching in > today's drives is really quite good and definitely should be usable. > Sure, it makes things less safe in power events, but system lockups > shouldn't affect the drive's ability to flush its cache. Note too > that Gergely reported that the problem went away on his 2.4.23 system. > I don't believe that to be a small data point. I hardly said it was the correct solution; just what worked for me, as I had exactly the same issue going 2.4->2.6. I can't even recall if write caching was actually on in 2.4, and my write performance under video capture has not shown any detriment. The filesystem gods should comment. Merry Christmas. Con ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-25 2:07 ` Eric D. Mudama 2003-12-25 5:17 ` Con Kolivas @ 2003-12-25 6:15 ` Hans Reiser 1 sibling, 0 replies; 10+ messages in thread From: Hans Reiser @ 2003-12-25 6:15 UTC (permalink / raw) To: Eric D. Mudama; +Cc: linux-kernel, Chris Mason Eric D. Mudama wrote: > > > > It seems to me that the problem occurred at a higher system level than > the disk, and disabling the write cache on the drive (besides being a > *HUGE* performance loser) will only make the window for failure > smaller, not eliminate it entirely. > You should only use write caching in kernels where write cache flushing is supported. Chris, which ones are those, could you remind us? -- Hans ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-24 22:22 ` Gergely Tamas 2003-12-24 22:34 ` Con Kolivas @ 2003-12-25 16:46 ` Tomas Szepe 2003-12-25 23:45 ` Hmamouche, Youssef 1 sibling, 1 reply; 10+ messages in thread From: Tomas Szepe @ 2003-12-25 16:46 UTC (permalink / raw) To: Gergely Tamas; +Cc: Keith Lea, linux-kernel On Dec-24 2003, Wed, 23:22 +0100 Gergely Tamas <dice@mfa.kfki.hu> wrote: > I've been hit by the same problem but using 2.6.0 . As you described, > garbage in files (eg. /etc/modules.conf, ...). > > 2.6.0, Slackware 9.1 Count me in. IBM ThinkPad T40p (PIIX IDE HDD access) slackware-current linux-2.6.0 reiserfs-3.6 I can reproduce the problem anytime simply by terminating an XDM session. - complete freeze - blank screen - can't see an oops - nothing in the logs - kernel won't panic (tried w/ the morsecode panics patch) - typically corrupted files (random garbage in the middle): /lib/modules/2.6.0/modules.dep /var/adm/messages I'm hesitant in blaming hdd write cache since there's no power outage involved (also I've never seen this before w/ 2.4). -- Tomas Szepe <szepe@pinerecords.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-25 16:46 ` Tomas Szepe @ 2003-12-25 23:45 ` Hmamouche, Youssef 0 siblings, 0 replies; 10+ messages in thread From: Hmamouche, Youssef @ 2003-12-25 23:45 UTC (permalink / raw) To: Tomas Szepe; +Cc: Gergely Tamas, Keith Lea, linux-kernel I'm getting a system freeze but no file corruption. The freeze happens randomly after all rc.d scripts run. The freeze seems to happen slightly at a "later" time when I applied the 2.6.0-mm1 patch(I was able to login and startx) whereas before the freeze happened before/while logging in. My boot parameters usually look like this: BOOT_IMAGE=Linux-2.6.0 ro root=303 apm=on acpi=off IBM Thinkpad T22 linux-2.6.0 | linux-2.6.0-mm1 slackware 9.1 bash-2.05b# lspci 00:00.0 Host bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) 00:01.0 PCI bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) 00:02.0 CardBus bridge: Texas Instruments PCI1450 (rev 03) 00:02.1 CardBus bridge: Texas Instruments PCI1450 (rev 03) 00:03.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c) 00:03.1 Serial controller: Lucent Microelectronics LT WinModem (rev 01) 00:05.0 Multimedia audio controller: Cirrus Logic CS 4614/22/24 [CrystalClear SoundFusion Audio Accelerator] (rev 01) 00:07.0 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ISA (rev 02) 00:07.1 IDE interface: Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.2 USB Controller: Intel Corp. 82371AB/EB/MB PIIX4 USB (rev 01) 00:07.3 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ACPI (rev 03) 01:00.0 VGA compatible controller: S3 Inc. 86C270-294 Savage/IX-MV (rev 13) Now that I went back and tested the kernel(mm1) with the following parameters, the system hasn't freezed yet. I'll report if anything goes wrong. BOOT_IMAGE=Linux-2.6.0 ro root=303 idebus=66 ide0=ata66 ide1=ata66 ide2=ata66 apm=on acpi=off Thank you On Thu, 25 Dec 2003, Tomas Szepe wrote: > On Dec-24 2003, Wed, 23:22 +0100 > Gergely Tamas <dice@mfa.kfki.hu> wrote: > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-24 21:59 2.6.0-test11 data loss Keith Lea 2003-12-24 22:22 ` Gergely Tamas @ 2003-12-25 1:21 ` Felipe Alfaro Solana 2003-12-25 6:11 ` Hans Reiser 2 siblings, 0 replies; 10+ messages in thread From: Felipe Alfaro Solana @ 2003-12-25 1:21 UTC (permalink / raw) To: Keith Lea; +Cc: Linux Kernel Mailinglist On Wed, 2003-12-24 at 22:59, Keith Lea wrote: > Hello, I'm not subscribed to this list. This is not a help request, and > not really a bug report, I just thought someone should know about this. > > I installed the 2.6.0-beta11-mm kernel last week, and the other day my > computer locked up (this is normal on my laptop with every kernel > version I've tried, this isn't the problem I'm posting about). When I > restarted, many, many files that had been open when it locked up were > filled with garbage, or the contents of totally unrelated files. For > example, my syslog contained some KDE header file code, and > /sbin/modprobe contained 82kb of data that seemed like random noise. I > think each file was the same size as it was originally, just with > different data, but I'm not sure. > > The corruption happened on two separate partitions on a single IDE > laptop drive, and both were ReiserFS 3.6 partitions. I don't know if > this is a kernel bug or a Reiser bug or something else, but I thought > the kernel developers should know about this, and be on the lookout for > similar things (hopefully with more informative bug reports than mine). > I'm sorry I don't have more information, but if anyone wants to know > more about my system I'd be glad to help. I know this is not the answer you're looking for but, could you please test again using 2.6.0 or 2.6.0-mm1? 2.6.0-test11 is a bit ouf of date now that 2.6.0 has gone gold. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.0-test11 data loss 2003-12-24 21:59 2.6.0-test11 data loss Keith Lea 2003-12-24 22:22 ` Gergely Tamas 2003-12-25 1:21 ` Felipe Alfaro Solana @ 2003-12-25 6:11 ` Hans Reiser 2 siblings, 0 replies; 10+ messages in thread From: Hans Reiser @ 2003-12-25 6:11 UTC (permalink / raw) To: Keith Lea; +Cc: linux-kernel Keith Lea wrote: > Hello, I'm not subscribed to this list. This is not a help request, > and not really a bug report, I just thought someone should know about > this. > > I installed the 2.6.0-beta11-mm kernel last week, and the other day my > computer locked up (this is normal on my laptop with every kernel > version I've tried, this isn't the problem I'm posting about). When I > restarted, many, many files that had been open when it locked up were > filled with garbage, or the contents of totally unrelated files. For > example, my syslog contained some KDE header file code, and > /sbin/modprobe contained 82kb of data that seemed like random noise. I > think each file was the same size as it was originally, just with > different data, but I'm not sure. > > The corruption happened on two separate partitions on a single IDE > laptop drive, and both were ReiserFS 3.6 partitions. I don't know if > this is a kernel bug or a Reiser bug or something else, but I thought > the kernel developers should know about this, and be on the lookout > for similar things (hopefully with more informative bug reports than > mine). I'm sorry I don't have more information, but if anyone wants to > know more about my system I'd be glad to help. > > -Keith Lea > - > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > please read about the difference between metadata journaling, data journaling, and atomic filesystems, and all will become clear. also note the ordered writes option for version 3.6 of reiserfs which is probably what you want until atomic reiser4 is fully stable. -- Hans ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2003-12-25 23:45 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-12-24 21:59 2.6.0-test11 data loss Keith Lea 2003-12-24 22:22 ` Gergely Tamas 2003-12-24 22:34 ` Con Kolivas 2003-12-25 2:07 ` Eric D. Mudama 2003-12-25 5:17 ` Con Kolivas 2003-12-25 6:15 ` Hans Reiser 2003-12-25 16:46 ` Tomas Szepe 2003-12-25 23:45 ` Hmamouche, Youssef 2003-12-25 1:21 ` Felipe Alfaro Solana 2003-12-25 6:11 ` Hans Reiser
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox