2.6.0-test11 data loss

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.0-test11 data loss
@ 2003-12-24 21:59 Keith Lea
  2003-12-24 22:22 ` Gergely Tamas
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Keith Lea @ 2003-12-24 21:59 UTC (permalink / raw)
  To: linux-kernel

Hello, I'm not subscribed to this list. This is not a help request, and 
not really a bug report, I just thought someone should know about this.

I installed the 2.6.0-beta11-mm kernel last week, and the other day my 
computer locked up (this is normal on my laptop with every kernel 
version I've tried, this isn't the problem I'm posting about). When I 
restarted, many, many files that had been open when it locked up were 
filled with garbage, or the contents of totally unrelated files. For 
example, my syslog contained some KDE header file code, and 
/sbin/modprobe contained 82kb of data that seemed like random noise. I 
think each file was the same size as it was originally, just with 
different data, but I'm not sure.

The corruption happened on two separate partitions on a single IDE 
laptop drive, and both were ReiserFS 3.6 partitions. I don't know if 
this is a kernel bug or a Reiser bug or something else, but I thought 
the kernel developers should know about this, and be on the lookout for 
similar things (hopefully with more informative bug reports than mine). 
I'm sorry I don't have more information, but if anyone wants to know 
more about my system I'd be glad to help.

-Keith Lea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-24 21:59 2.6.0-test11 data loss Keith Lea
@ 2003-12-24 22:22 ` Gergely Tamas
  2003-12-24 22:34   ` Con Kolivas
  2003-12-25 16:46   ` Tomas Szepe
  2003-12-25  1:21 ` Felipe Alfaro Solana
  2003-12-25  6:11 ` Hans Reiser
  2 siblings, 2 replies; 10+ messages in thread
From: Gergely Tamas @ 2003-12-24 22:22 UTC (permalink / raw)
  To: Keith Lea; +Cc: linux-kernel

Hi,

I've been hit by the same problem but using 2.6.0 . As you described,
garbage in files (eg. /etc/modules.conf, ...).

2.6.0, Slackware 9.1

 > The corruption happened on two separate partitions on a single IDE 
 > laptop drive, and both were ReiserFS 3.6 partitions. I don't know if 
 > this is a kernel bug or a Reiser bug or something else, but I thought 

I don't think this is a reiserfs bug. This was my first thought and
after first hitting this bug, I've moved all my partitions from reiserfs
to jfs. But I've also had this problem with it... Now I'm back to
2.4.23, and everything works fine.

Gergely

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-24 22:22 ` Gergely Tamas
@ 2003-12-24 22:34   ` Con Kolivas
  2003-12-25  2:07     ` Eric D. Mudama
  2003-12-25 16:46   ` Tomas Szepe
  1 sibling, 1 reply; 10+ messages in thread
From: Con Kolivas @ 2003-12-24 22:34 UTC (permalink / raw)
  To: Gergely Tamas, Keith Lea; +Cc: linux-kernel

Hello

On Thu, 25 Dec 2003 09:22, Gergely Tamas wrote:
> I've been hit by the same problem but using 2.6.0 . As you described,
> garbage in files (eg. /etc/modules.conf, ...).
>
> 2.6.0, Slackware 9.1
>
>  > The corruption happened on two separate partitions on a single IDE
>  > laptop drive, and both were ReiserFS 3.6 partitions. I don't know if
>  > this is a kernel bug or a Reiser bug or something else, but I thought
>
> I don't think this is a reiserfs bug. This was my first thought and
> after first hitting this bug, I've moved all my partitions from reiserfs
> to jfs. But I've also had this problem with it... Now I'm back to
> 2.4.23, and everything works fine.

Because of the numerous reboots and hangs I've seen with experimental patches 
I've also seen this, but it's not reiserFS fault. The problem is that most 
drives have write caching enabled and not all of them are safe with this. If 
you disable it with hdparm (hdparm -W 0 /dev/hd*) you'll find that open files 
during a hard reset or power outage will prevent those open files from being 
corrupted.

Con


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-24 22:34   ` Con Kolivas
@ 2003-12-25  2:07     ` Eric D. Mudama
  2003-12-25  5:17       ` Con Kolivas
  2003-12-25  6:15       ` Hans Reiser
  0 siblings, 2 replies; 10+ messages in thread
From: Eric D. Mudama @ 2003-12-25  2:07 UTC (permalink / raw)
  To: linux-kernel

On Thu, Dec 25 at  9:34, Con Kolivas wrote:
>On Thu, 25 Dec 2003 09:22, Gergely Tamas wrote:
>> I don't think this is a reiserfs bug. This was my first thought and
>> after first hitting this bug, I've moved all my partitions from reiserfs
>> to jfs. But I've also had this problem with it... Now I'm back to
>> 2.4.23, and everything works fine.
>
>Because of the numerous reboots and hangs I've seen with experimental patches 
>I've also seen this, but it's not reiserFS fault. The problem is that most 
>drives have write caching enabled and not all of them are safe with this. If 
>you disable it with hdparm (hdparm -W 0 /dev/hd*) you'll find that open files 
>during a hard reset or power outage will prevent those open files from being 
>corrupted.

Write cache off will not prevent a file from being corrupted, however,
it should limit the corruption to a single disk operation.

I don't see how the behavior you describe could be the drive's
fault...

The user stated that their system hard locked, then they went and
rebooted it, and following the reboot they had corruption...  From
this, there are a few possibilities:

1. The drive had been given the commands to write the data prior to the hang.

If this was the case, the drive would happilly keep writing the data
it had been given and was caching in the background, even while you
continued to send (or stopped sending) data for a new command over the
interface.  An IDE interface lockup or system lockup will not prevent
the drive from flushing the remainder of its write cache.  (Only
possible exception might be faulty handling of a hard reset, but all
drives today will flush their cache when they see the reset, prior to
processing it.) Unless the user yanked power within a few hundred
milliseconds of the write command, I think it is unlikely that cached
data already in the drive wasn't flushed properly.

2. The drive was in the middle of a command writing important data
during the hang.

In this case, yes, your file you were writing would probably be
corrupt on the media, but nothing more.  Drives detect power loss, and
immediately disable write-gate and park the actuator.  If they don't
get the actuator parked before they run out of back-EMF from the
momentum of the platter(s), the head will stick to the media and
you'll probably need a chisel to get that drive to spin again.

3. The drive hadn't yet been issued the commands for the data that was
eventually corrupted.

I find this to be the most likely case, and is a situation where the
filesystem thinks objects were moved but those updates were not
correctly sent to the disk (due to the hang?), so it might think
they're in the old location or something.  (I'm not a filesystem
wizard so if I'm way off-base, my apologies)

It seems to me that the problem occurred at a higher system level than
the disk, and disabling the write cache on the drive (besides being a
*HUGE* performance loser) will only make the window for failure
smaller, not eliminate it entirely.

Unless you are using *really* old hard drives, the write caching in
today's drives is really quite good and definitely should be usable.
Sure, it makes things less safe in power events, but system lockups
shouldn't affect the drive's ability to flush its cache.  Note too
that Gergely reported that the problem went away on his 2.4.23 system.
I don't believe that to be a small data point.

--eric, posting from home

-- 
Eric D. Mudama
edmudama@mail.bounceswoosh.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-25  2:07     ` Eric D. Mudama
@ 2003-12-25  5:17       ` Con Kolivas
  2003-12-25  6:15       ` Hans Reiser
  1 sibling, 0 replies; 10+ messages in thread
From: Con Kolivas @ 2003-12-25  5:17 UTC (permalink / raw)
  To: Eric D. Mudama, linux-kernel

On Thu, 25 Dec 2003 13:07, Eric D. Mudama wrote:
> On Thu, Dec 25 at  9:34, Con Kolivas wrote:
> >On Thu, 25 Dec 2003 09:22, Gergely Tamas wrote:
> >> I don't think this is a reiserfs bug. This was my first thought and
> >> after first hitting this bug, I've moved all my partitions from reiserfs
> >> to jfs. But I've also had this problem with it... Now I'm back to
> >> 2.4.23, and everything works fine.
> >
> >Because of the numerous reboots and hangs I've seen with experimental
> > patches I've also seen this, but it's not reiserFS fault. The problem is
> > that most drives have write caching enabled and not all of them are safe
> > with this. If you disable it with hdparm (hdparm -W 0 /dev/hd*) you'll
> > find that open files during a hard reset or power outage will prevent
> > those open files from being corrupted.
>
> Write cache off will not prevent a file from being corrupted, however,
> it should limit the corruption to a single disk operation.
>
> I don't see how the behavior you describe could be the drive's
> fault...
>
> The user stated that their system hard locked, then they went and
> rebooted it, and following the reboot they had corruption...  From
> this, there are a few possibilities:
>
> 1. The drive had been given the commands to write the data prior to the
> hang.
>
> If this was the case, the drive would happilly keep writing the data
> it had been given and was caching in the background, even while you
> continued to send (or stopped sending) data for a new command over the
> interface.  An IDE interface lockup or system lockup will not prevent
> the drive from flushing the remainder of its write cache.  (Only
> possible exception might be faulty handling of a hard reset, but all
> drives today will flush their cache when they see the reset, prior to
> processing it.) Unless the user yanked power within a few hundred
> milliseconds of the write command, I think it is unlikely that cached
> data already in the drive wasn't flushed properly.
>
> 2. The drive was in the middle of a command writing important data
> during the hang.
>
> In this case, yes, your file you were writing would probably be
> corrupt on the media, but nothing more.  Drives detect power loss, and
> immediately disable write-gate and park the actuator.  If they don't
> get the actuator parked before they run out of back-EMF from the
> momentum of the platter(s), the head will stick to the media and
> you'll probably need a chisel to get that drive to spin again.
>
> 3. The drive hadn't yet been issued the commands for the data that was
> eventually corrupted.
>
> I find this to be the most likely case, and is a situation where the
> filesystem thinks objects were moved but those updates were not
> correctly sent to the disk (due to the hang?), so it might think
> they're in the old location or something.  (I'm not a filesystem
> wizard so if I'm way off-base, my apologies)
>
> It seems to me that the problem occurred at a higher system level than
> the disk, and disabling the write cache on the drive (besides being a
> *HUGE* performance loser) will only make the window for failure
> smaller, not eliminate it entirely.
>
> Unless you are using *really* old hard drives, the write caching in
> today's drives is really quite good and definitely should be usable.
> Sure, it makes things less safe in power events, but system lockups
> shouldn't affect the drive's ability to flush its cache.  Note too
> that Gergely reported that the problem went away on his 2.4.23 system.
> I don't believe that to be a small data point.

I hardly said it was the correct solution; just what worked for me, as I had 
exactly the same issue going 2.4->2.6. I can't even recall if write caching 
was actually on in 2.4, and my write performance under video capture has not 
shown any detriment. The filesystem gods should comment. Merry Christmas.

Con


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-25  2:07     ` Eric D. Mudama
  2003-12-25  5:17       ` Con Kolivas
@ 2003-12-25  6:15       ` Hans Reiser
  1 sibling, 0 replies; 10+ messages in thread
From: Hans Reiser @ 2003-12-25  6:15 UTC (permalink / raw)
  To: Eric D. Mudama; +Cc: linux-kernel, Chris Mason

Eric D. Mudama wrote:

>
>
>
> It seems to me that the problem occurred at a higher system level than
> the disk, and disabling the write cache on the drive (besides being a
> *HUGE* performance loser) will only make the window for failure
> smaller, not eliminate it entirely.
>
You should only use write caching in kernels where write cache flushing 
is supported. Chris, which ones are those, could you remind us?

-- 
Hans



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-24 22:22 ` Gergely Tamas
  2003-12-24 22:34   ` Con Kolivas
@ 2003-12-25 16:46   ` Tomas Szepe
  2003-12-25 23:45     ` Hmamouche, Youssef
  1 sibling, 1 reply; 10+ messages in thread
From: Tomas Szepe @ 2003-12-25 16:46 UTC (permalink / raw)
  To: Gergely Tamas; +Cc: Keith Lea, linux-kernel

On Dec-24 2003, Wed, 23:22 +0100
Gergely Tamas <dice@mfa.kfki.hu> wrote:

> I've been hit by the same problem but using 2.6.0 . As you described,
> garbage in files (eg. /etc/modules.conf, ...).
> 
> 2.6.0, Slackware 9.1

Count me in.

IBM ThinkPad T40p (PIIX IDE HDD access)
slackware-current
linux-2.6.0
reiserfs-3.6

I can reproduce the problem anytime simply by terminating an XDM session.

	- complete freeze
	- blank screen
	- can't see an oops
	- nothing in the logs
	- kernel won't panic (tried w/ the morsecode panics patch)
	- typically corrupted files (random garbage in the middle):
		/lib/modules/2.6.0/modules.dep
		/var/adm/messages

I'm hesitant in blaming hdd write cache since there's no power outage
involved (also I've never seen this before w/ 2.4).

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-25 16:46   ` Tomas Szepe
@ 2003-12-25 23:45     ` Hmamouche, Youssef
  0 siblings, 0 replies; 10+ messages in thread
From: Hmamouche, Youssef @ 2003-12-25 23:45 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Gergely Tamas, Keith Lea, linux-kernel


I'm getting a system freeze but no file corruption. The freeze happens
randomly after all rc.d scripts run. The freeze seems to happen slightly
at a "later" time when I applied the 2.6.0-mm1 patch(I was able to login
and startx) whereas before the freeze happened before/while logging
in. 

My boot parameters usually look like this:

BOOT_IMAGE=Linux-2.6.0 ro root=303 apm=on acpi=off

IBM Thinkpad T22
linux-2.6.0 | linux-2.6.0-mm1
slackware 9.1

bash-2.05b# lspci 
00:00.0 Host bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX Host bridge
(rev 03)
00:01.0 PCI bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge
(rev 03)
00:02.0 CardBus bridge: Texas Instruments PCI1450 (rev 03)
00:02.1 CardBus bridge: Texas Instruments PCI1450 (rev 03)
00:03.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
0c)
00:03.1 Serial controller: Lucent Microelectronics LT WinModem (rev 01)
00:05.0 Multimedia audio controller: Cirrus Logic CS 4614/22/24
[CrystalClear SoundFusion Audio Accelerator] (rev 01)
00:07.0 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corp. 82371AB/EB/MB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ACPI (rev 03)
01:00.0 VGA compatible controller: S3 Inc. 86C270-294 Savage/IX-MV (rev
13)


Now that I went back and tested the kernel(mm1) with the following
parameters, the system hasn't freezed yet. I'll report if anything goes
wrong.

BOOT_IMAGE=Linux-2.6.0 ro root=303 idebus=66 ide0=ata66 ide1=ata66
ide2=ata66 apm=on acpi=off

Thank you

On Thu, 25 Dec 2003, Tomas Szepe wrote:

> On Dec-24 2003, Wed, 23:22 +0100
> Gergely Tamas <dice@mfa.kfki.hu> wrote:
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-24 21:59 2.6.0-test11 data loss Keith Lea
  2003-12-24 22:22 ` Gergely Tamas
@ 2003-12-25  1:21 ` Felipe Alfaro Solana
  2003-12-25  6:11 ` Hans Reiser
  2 siblings, 0 replies; 10+ messages in thread
From: Felipe Alfaro Solana @ 2003-12-25  1:21 UTC (permalink / raw)
  To: Keith Lea; +Cc: Linux Kernel Mailinglist

On Wed, 2003-12-24 at 22:59, Keith Lea wrote:
> Hello, I'm not subscribed to this list. This is not a help request, and 
> not really a bug report, I just thought someone should know about this.
> 
> I installed the 2.6.0-beta11-mm kernel last week, and the other day my 
> computer locked up (this is normal on my laptop with every kernel 
> version I've tried, this isn't the problem I'm posting about). When I 
> restarted, many, many files that had been open when it locked up were 
> filled with garbage, or the contents of totally unrelated files. For 
> example, my syslog contained some KDE header file code, and 
> /sbin/modprobe contained 82kb of data that seemed like random noise. I 
> think each file was the same size as it was originally, just with 
> different data, but I'm not sure.
> 
> The corruption happened on two separate partitions on a single IDE 
> laptop drive, and both were ReiserFS 3.6 partitions. I don't know if 
> this is a kernel bug or a Reiser bug or something else, but I thought 
> the kernel developers should know about this, and be on the lookout for 
> similar things (hopefully with more informative bug reports than mine). 
> I'm sorry I don't have more information, but if anyone wants to know 
> more about my system I'd be glad to help.

I know this is not the answer you're looking for but, could you please
test again using 2.6.0 or 2.6.0-mm1? 2.6.0-test11 is a bit ouf of date
now that 2.6.0 has gone gold.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test11 data loss
  2003-12-24 21:59 2.6.0-test11 data loss Keith Lea
  2003-12-24 22:22 ` Gergely Tamas
  2003-12-25  1:21 ` Felipe Alfaro Solana
@ 2003-12-25  6:11 ` Hans Reiser
  2 siblings, 0 replies; 10+ messages in thread
From: Hans Reiser @ 2003-12-25  6:11 UTC (permalink / raw)
  To: Keith Lea; +Cc: linux-kernel

Keith Lea wrote:

> Hello, I'm not subscribed to this list. This is not a help request, 
> and not really a bug report, I just thought someone should know about 
> this.
>
> I installed the 2.6.0-beta11-mm kernel last week, and the other day my 
> computer locked up (this is normal on my laptop with every kernel 
> version I've tried, this isn't the problem I'm posting about). When I 
> restarted, many, many files that had been open when it locked up were 
> filled with garbage, or the contents of totally unrelated files. For 
> example, my syslog contained some KDE header file code, and 
> /sbin/modprobe contained 82kb of data that seemed like random noise. I 
> think each file was the same size as it was originally, just with 
> different data, but I'm not sure.
>
> The corruption happened on two separate partitions on a single IDE 
> laptop drive, and both were ReiserFS 3.6 partitions. I don't know if 
> this is a kernel bug or a Reiser bug or something else, but I thought 
> the kernel developers should know about this, and be on the lookout 
> for similar things (hopefully with more informative bug reports than 
> mine). I'm sorry I don't have more information, but if anyone wants to 
> know more about my system I'd be glad to help.
>
> -Keith Lea
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
please read about the difference between metadata journaling, data 
journaling, and atomic filesystems, and all will become clear. also note 
the ordered writes option for version 3.6 of reiserfs which is probably 
what you want until atomic reiser4 is fully stable.

-- 
Hans



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-12-25 23:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-24 21:59 2.6.0-test11 data loss Keith Lea
2003-12-24 22:22 ` Gergely Tamas
2003-12-24 22:34   ` Con Kolivas
2003-12-25  2:07     ` Eric D. Mudama
2003-12-25  5:17       ` Con Kolivas
2003-12-25  6:15       ` Hans Reiser
2003-12-25 16:46   ` Tomas Szepe
2003-12-25 23:45     ` Hmamouche, Youssef
2003-12-25  1:21 ` Felipe Alfaro Solana
2003-12-25  6:11 ` Hans Reiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox