From mboxrd@z Thu Jan 1 00:00:00 1970 From: John McMonagle Subject: Re: kernel panic?? Date: Mon, 07 Mar 2005 20:02:03 -0600 Message-ID: <422D079B.7050507@advocap.org> References: <4225E09C.5010401@advocap.org> <42271B3D.1050000@advocap.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit In-Reply-To: <42271B3D.1050000@advocap.org> Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Changed the subject the slow sync is now a low priority :( Getting rather concerned. I finally failed the questionable sdb drive. I managed to backup the partition on the bad spot. It was up about 24 hours when I did the backup. Copied to a usb external drive. Very slow but kept running. At this point doesn't really think it's a problem with raid. I did dd if=/dev/sdb2 of=/dev/null bs=64k with sdb2 failed. And it caused a kernel panic! This was with 2.6.11 with ata patch in run level 1 Oddly the Call trace doesn't list anything to do with disk io?? Abbreviated trace: drain_array_locked cache_reap worker_thread default_wake_function default_wake_function worker_thread kthread kthread kernel_thread_helper More if it helps. All panics seem to be associated with accessing bad spot on sdb It seems really strange that one can get panic from a drive problem. mother board is asus p2b with 512mb ecc ram. It's been 100% reliable for about 5 years. Ran memtest86 and it looked OK. Drive controller is promise tx4 4 port sata controller. Any issues with this controller? Drives are western digital WD2000JD I'm a bit concerned about the quality as this is the the 3rd failure. Had 2 bad drives in the first week. Even then is it possible for a drive failure to cause a kernel panic??? Really appreciate feedback even if it's what hardware works for you. John John McMonagle wrote: > It kernel panicked again at around 30% with the 2.6.10 kernel :( > I'm not on site yet but I'll check the kernel message. > Remotely it's acting the same as the first time. Oddly ping and nmap > looks OK but nothing responds. > I'll be taking it off line until I can figure it out. > Any suggestions? > > Is it possible to stop the resync to to get some data off it? It's > really unresponsive with it running. > > My guess is there is something wrong with a drive and with the raid > resync code. > I'll do some tests on the drives. The rest I'll need help with. > > John > > > John McMonagle wrote: > >> Have a backup system that recently had a kernel panic and am having >> problems with rsync. >> >> >> It's a p2 460mhz mb 512 mb ram. >> promise sata comtroler. >> 3 200gb sata drives >> debain sarge >> 2.6.10 kernel from kernel.org with smart sata patches. >> mdadm v1.9.0 >> >> >> /proc/mdstat >> >> Personalities : [raid1] [raid5] >> md0 : active raid1 sdb1[0] sda1[2] sdc1[1] >> 586240 blocks [3/3] [UUU] >> >> md1 : active raid5 sda2[0] sdc2[2] sdb2[1] >> 389543936 blocks level 5, 128k chunk, algorithm 2 [3/3] [UUU] >> [=====>...............] resync = 28.9% (56429568/194771968) >> finish=1374.9min speed=1675K/sec >> unused devices: >> >> Was running just fine for a few months. >> >> System is very unresponsive. Sometimes takes a minute to respond to >> cat /proc/mdstat >> I know some of this is because it's doing it's normal programs but it >> was not that much better in single user mode. >> Much to my amazement it managed to do the overnight backups while >> resyncing. >> Interested why it takes so long and if there is anything one can do? >> >> As i recall it took about 120 minutes to build initially. >> >> Other related issue: >> >> Tried to resync on the sarge 2.6.8 kernel it seemed to go a little >> bit faster but it ended up kernel panicing at about 30%. Did this >> twice. Now I'm on the 2.6.10 and getting close to that point . Will >> know if it makes it past that soon. >> >> I did get a smart message this morning but it not clear to me that >> there is anything wrong. >> >> Here is the email: >> >> This email was generated by the smartd daemon running on: >> >> host name: fonbackup >> DNS domain: advocap.org >> NIS domain: (none) >> >> The following warning/error was logged by the smartd daemon: >> >> Device: /dev/sdb, Self-Test Log error count increased from 0 to 1 >> >> For details see host's SYSLOG (default: /var/log/messages). >> >> You can also use the smartctl utility for further investigation. >> >> .................................................. >> smartctl -a -d ata /dev/sdb >> smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> === START OF INFORMATION SECTION === >> Device Model: WDC WD2000JD-00FYB0 >> Serial Number: WD-WMAEH2121689 >> Firmware Version: 02.05D02 >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: 6 >> ATA Standard is: Exact ATA specification draft version not indicated >> Local Time is: Wed Mar 2 06:51:26 2005 CST >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> See vendor-specific Attribute list for marginal Attributes. >> >> General SMART Values: >> Offline data collection status: (0x85) Offline data collection activity >> was aborted by an interrupting >> command from host. >> Auto Offline Data Collection: >> Enabled. >> Self-test execution status: ( 73) The previous self-test >> completed having >> a test element that failed and >> the test >> element that failed is not known. >> Total time to complete Offline >> data collection: (6933) seconds. >> Offline data collection >> capabilities: (0x79) SMART execute Offline immediate. >> No Auto Offline data >> collection support. >> Suspend Offline collection >> upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> No General Purpose Logging >> support. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 88) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000b 001 001 051 Pre-fail >> Always FAILING_NOW 12058 >> 3 Spin_Up_Time 0x0007 124 122 021 Pre-fail >> Always - 4316 >> 4 Start_Stop_Count 0x0032 100 100 040 Old_age >> Always - 24 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail >> Always - 0 >> 9 Power_On_Hours 0x0032 099 099 000 Old_age >> Always - 1263 >> 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail >> Always - 0 >> 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 24 >> 194 Temperature_Celsius 0x0022 115 253 000 Old_age >> Always - 35 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age >> Always - 0 >> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age >> Always - 0 >> 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age >> Always - 0 >> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail >> Offline - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Short offline Completed: unknown failure 90% >> 170 1114112 >> # 2 Short offline Completed without error 00% >> 126 - >> # 3 Short offline Completed without error 00% >> 103 - >> # 4 Short offline Completed without error 00% >> 79 - >> # 5 Short offline Completed without error 00% >> 56 - >> # 6 Short offline Completed without error 00% >> 32 - >> # 7 Short offline Completed without error 00% >> 9 - >> # 8 Extended offline Completed without error 00% >> 1081 - >> # 9 Short offline Completed without error 00% >> 1079 - >> #10 Short offline Completed without error 00% >> 1056 - >> #11 Short offline Completed without error 00% >> 1032 - >> #12 Short offline Completed without error 00% >> 1009 - >> #13 Short offline Completed without error 00% >> 985 - >> #14 Short offline Completed without error 00% >> 962 - >> #15 Short offline Completed without error 00% >> 939 - >> #16 Extended offline Completed without error 00% >> 919 - >> #17 Short offline Completed without error 00% >> 916 - >> #18 Short offline Completed without error 00% >> 893 - >> #19 Short offline Completed without error 00% >> 869 - >> #20 Short offline Completed without error 00% >> 846 - >> #21 Short offline Completed without error 00% >> 823 - >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute >> delay. >> >> >> ...................................... >> >> Any sugestions? >> >> John >> >> >> >> >> - >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html