From: Adko Branil <adkobranil@yahoo.com>
To: "linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>
Subject: HDD problem, software bug, bios bug, or hardware ?
Date: Fri, 24 Aug 2012 17:54:08 -0700 (PDT) [thread overview]
Message-ID: <1345856048.38987.YahooMailNeo@web124702.mail.ne1.yahoo.com> (raw)
My system hangs from time to time, after few hours work(which differs from minutes to 8-9 hours), with kernel panic. Before that begins it worked fine for about 6 years - no software or hardware changes during this period.
I have some photos of the screen after panic, the first two are with the old linux kernel 2.6.16.27:
http://picpaste.com/pics/img00005-73m0unO0.1345852235.jpg
http://picpaste.com/pics/P170812_12.01-MeZrs3zv.1345817375.jpg
-they can enlarge on click.
Then i installed slackware-current with their default kernel "huge.s" and the crashes continued:
http://picpaste.com/pics/P210812_15.34-3NSTEV8f.1345816730.jpg
then i swithced off the swap:
http://picpaste.com/pics/P230812_15.06-hB12169n.1345812390.jpg
after that i managed to save one message with netconsole (swap is off):
1. [13330.042569] BUG: unable to handle kernel paging request at 000060ff80001f1c
2. [13330.043554] IP: [<ffffffff810b17e0>] no_action+0x10/0x10
3. [13330.043554] PGD 0
4. [13330.043554] Oops: 0002 [#1] SMP
5. [13330.043554] CPU 1
6. [13330.043554] Modules linked in: ipv6
lp netconsole snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_pcm_oss snd_mixer_oss fuse nouveau mxm_wmi wmi video ttm
drm_kms_helper drm amd64_agp processor thermal_sys k8temp agpgart hwmon
snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi snd_seq_device
snd_pcm snd_page_alloc snd_timer snd soundcore ac97_bus ppdev parport_pc i2c_algo_bit gameport evdev shpchp button i2c_viapro i2c_core loop skge parport [last unloaded: lp]
7. [13330.043554]
8. [13330.043554] Pid: 0, comm: swapper/1 Not tainted 3.2.27 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe
9. [13330.043554] RIP: 0010:[<ffffffff810b17e0>] [<ffffffff810b17e0>] no_action+0x10/0x10
10. [13330.043554] RSP: 0018:ffff88007fd03f10 EFLAGS: 00010086
11. [13330.043554] RAX: 000060ff80001f1c RBX: ffff88007aef2c00 RCX: 00000000fffffffa
12. [13330.043554] RDX: 00000000000000d0 RSI: ffff88007ae93f80 RDI: ffff88007aef2c00
13. [13330.043554] RBP: ffff88007fd03f38 R08: ffff88007aef2c00 R09: ffff88007cc00000
14. [13330.043554] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007aef2c8c
15. [13330.043554] R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000000
16. [13330.043554] FS: 00007f674b3e6740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f7369700
17. [13330.043554] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
18. [13330.043554] CR2: 000060ff80001f1c CR3: 000000006f115000 CR4: 00000000000006e0
19. [13330.043554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
20. [13330.043554] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
21. [13330.043554] Process swapper/1 (pid: 0, threadinfo ffff88007bd18000, task ffff88007d0ec4c0)
22. [13330.043554] Stack:
23. [13330.043554] ffffffff810b1a10 ffff88007fd03f58 ffff88007aef2c00 0000000000000051
24. [13330.043554] 0000000000000011 ffff88007fd03f58 ffffffff810b4879 ffff88007fd03f58
25. [13330.043554] 0000000000000011 ffff88007fd03f78 ffffffff81003d12 ffff88007fd03f78
26. [13330.043554] Call Trace:
27. [13330.043554] <IRQ>
28. [13330.043554] [<ffffffff810b1a10>] ? handle_irq_event+0x40/0x70
29. [13330.043554] [<ffffffff810b4879>] handle_fasteoi_irq+0x59/0x100
30. [13330.043554] [<ffffffff81003d12>] handle_irq+0x22/0x40
31. [13330.043554] [<ffffffff81b3158a>] do_IRQ+0x5a/0xe0
32. [13330.043554] [<ffffffff81b2e82b>] common_interrupt+0x6b/0x6b
33. [13330.043554] <EOI>
here is link to dmesg, before that last crash: http://pastebin.com/Af7bb34x
And at the end i noticed scary messages in the syslog:
[31770.094556] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 347717 on readonly FS
[31770.472848] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 347740 on readonly FS
[31790.796117] REISERFS warning (device sda1): clm-6006 reiserfs_dirty_inode: writing inode 426162 on readonly FS
after which i have done reiserfsck immediately - no corruption were found.
Never seen such messages before, i have syslogs for 17 days before that - no messages like this.
I have done some tests with smartmontools before - when it was the old linux (2.6.16.27) - the result of "smartctl -s on -a /dev/sda" is:
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.2] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus
Device Model: ST3200822AS
Serial Number: 4LJ221BB
Firmware Version: 3.01
User Capacity: 200,049,647,616 bytes [200 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
Local Time is: Sat Aug 25 03:09:01 2012 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 111) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 050 046 006 Pre-fail Always - 179699255
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 123
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 6
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 81170784
9 Power_On_Hours 0x0032 039 039 000 Old_age Always - 53553
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 142
194 Temperature_Celsius 0x0022 037 054 000 Old_age Always - 37
195 Hardware_ECC_Recovered 0x001a 050 046 000 Old_age Always - 179699255
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 198 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 13784 hours (574 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 7a 7d 1d e0 Error: ICRC, ABRT at LBA = 0x001d7d7a = 1932666
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 7b 7c 1d e0 00 22:14:23.595 READ DMA EXT
25 00 00 7b 7b 1d e0 00 22:14:23.593 READ DMA EXT
25 00 00 7b 7a 1d e0 00 22:14:23.576 READ DMA EXT
25 00 00 7b 79 1d e0 00 22:14:23.567 READ DMA EXT
25 00 00 7b 78 1d e0 00 22:14:23.566 READ DMA EXT
Error 1 occurred at disk power-on lifetime: 13784 hours (574 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 fa 0e 01 e0 Error: ICRC, ABRT at LBA = 0x00010efa = 69370
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 00 fb 0d 01 e0 00 22:13:03.489 READ DMA EXT
25 00 00 fb 0c 01 e0 00 22:13:03.487 READ DMA EXT
25 00 00 fb 0b 01 e0 00 22:13:03.701 READ DMA EXT
25 00 00 fb 09 01 e0 00 22:13:03.682 READ DMA EXT
25 00 00 fb 07 01 e0 00 22:13:03.681 READ DMA EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 53153 -
# 2 Short offline Completed without error 00% 53152 -
# 3 Short offline Completed without error 00% 53152 -
# 4 Short offline Completed without error 00% 53152 -
# 5 Short offline Completed without error 00% 53152 -
# 6 Short offline Completed without error 00% 53148 -
# 7 Short offline Completed without error 00% 53148 -
# 8 Short offline Completed without error 00% 53148 -
# 9 Extended offline Aborted by host 80% 53148 -
#10 Short offline Completed without error 00% 53147 -
#11 Short offline Completed without error 00% 53147 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
And soon after that (you can see the time of the messages)" i just succeed to to capture one whole panic message(i am hoping it is):
[32874.215014] BUG: unable to handle kernel NULL pointer dereference at 0000000000000086
[32874.215192] IP: [<ffffffff819f9440>] start_show+0x30/0x30
[32874.215192] PGD 7afe0067 PUD 7497e067 PMD 0
[32874.215192] Oops: 0002 [#1] SMP
[32874.215192] CPU 1
[32874.215192] Modules linked in: netconsole ipt_REJECT xt_tcpudp iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables ipv6 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss fuse nouveau mxm_wmi wmi video ttm drm_kms_helper snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi snd_seq_device snd_pcm snd_page_alloc drm snd_timer amd64_agp processor i2c_algo_bit snd shpchp k8temp agpgart thermal_sys i2c_viapro hwmon i2c_core skge soundcore ac97_bus gameport evdev ppdev button parport_pc parport loop [last unloaded: lp]
[32874.215192]
[32874.215192] Pid: 0, comm: swapper/1 Not tainted 3.2.27 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe
[32874.215192] RIP: 0010:[<ffffffff819f9440>] [<ffffffff819f9440>] start_show+0x30/0x30
[32874.215192] RSP: 0018:ffff88007fd03eb0 EFLAGS: 00010006
[32874.215192] RAX: 0000000000000086 RBX: ffffffff820c2fc0 RCX: 0000000000000001
[32874.215192] RDX: 00001de61fe84bdb RSI: 0000000000000000 RDI: ffffffff820c2fc0
[32874.215192] RBP: ffff88007fd03ed8 R08: 0000000000000000 R09: 0000000000000001
[32874.215192] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000008069
[32874.215192] R13: 00000000484f99af R14: 0000000000ab2476 R15: 0000000000000000
[32874.215192] FS: 00007f61bddf4740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f75fc6c0
[32874.215192] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[32874.215192] CR2: 0000000000000086 CR3: 00000000746e8000 CR4: 00000000000006e0
[32874.215192] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32874.215192] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32874.215192] Process swapper/1 (pid: 0, threadinfo ffff88007bd18000, task ffff88007d0ec4c0)
[32874.215192] Stack:
[32874.215192] ffffffff8107df04 ffff88007fd12680 0000000000000001 000000000000d300
[32874.215192] 0000000000000000 ffff88007fd03ef8 ffffffff8107ab80 ffff88007fd0d300
[32874.215192] 0000000000000001 ffff88007fd03f08 ffffffff8107abe9 ffff88007fd03f28
[32874.215192] Call Trace:
[32874.215192] <IRQ>
[32874.215192] [<ffffffff8107df04>] ? ktime_get+0x64/0xe0
[32874.215192] [<ffffffff8107ab80>] sched_clock_tick+0x40/0x90
[32874.215192] [<ffffffff8107abe9>] sched_clock_idle_wakeup_event+0x19/0x20
[32874.215192] [<ffffffff8108538e>] tick_nohz_stop_idle+0x3e/0x50
[32874.215192] [<ffffffff81085b77>] tick_check_idle+0xb7/0xd0
[32874.215192] [<ffffffff8105a749>] irq_enter+0x69/0x70
[32874.215192] [<ffffffff81b31653>] smp_apic_timer_interrupt+0x43/0x99
[32874.215192] [<ffffffff81b2f9cb>] apic_timer_interrupt+0x6b/0x70
[32874.215192] <EOI>
[32874.215192] [<ffffffff8107aa58>] ? sched_clock_cpu+0xa8/0x120
[32874.215192] [<ffffffff8100a89a>] ? default_idle+0x5a/0x180
[32874.215192] [<ffffffff810009b6>] cpu_idle+0xf6/0x110
[32874.215192] [<ffffffff81b146ea>] start_secondary+0x1cf/0x1d6
[32874.215192] Code: 66 66 66 90 48 8b 0f 48 c7 c2 0d 46 dc 81 48 89 f0 be 00 10 00 00 48 89 c7 31 c0 e8 5b 71 b9 ff 5d 48 98 c3 0f 1f 80 00 00 00 00 <55> 48 89 e5 66 66 66 66 90 8b 15 39 31 6f 00 ed 25 ff ff ff 00
[32874.215192] RIP [<ffffffff819f9440>] start_show+0x30/0x30
[32874.215192] RSP <ffff88007fd03eb0>
[32874.215192] CR2: 0000000000000086
[32874.215192] [drm] nouveau 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
[32874.215192] ---[ end trace 90aad159d8ed7c1e ]---
[32874.215192] Kernel panic - not syncing: Fatal exception in interrupt
[32874.215192] Pid: 0, comm: swapper/1 Tainted: G D 3.2.27 #2
[32874.215192] Call Trace:
[32874.215192] <IRQ> [<ffffffff81b1aeea>] panic+0x91/0x189
[32874.215192] [<ffffffff81005491>] oops_end+0x91/0xa0
[32874.215192] [<ffffffff81b1a85f>] no_context+0x1fa/0x225
[32874.215192] [<ffffffff81b1aa3b>] __bad_area_nosemaphore+0x1b1/0x1d0
[32874.215192] [<ffffffff81b1aa6d>] bad_area_nosemaphore+0x13/0x15
[32874.215192] [<ffffffff81028794>] do_page_fault+0x2b4/0x480
[32874.215192] [<ffffffff8104aa6c>] ? load_balance+0xac/0x780
[32874.215192] [<ffffffff81a1b1e0>] ? skb_release_head_state+0x60/0x100
[32874.215192] [<ffffffff81a1affe>] ? __kfree_skb+0x1e/0xa0
[32874.215192] [<ffffffff81a1b0b1>] ? consume_skb+0x31/0x70
[32874.215192] [<ffffffff81b2ea2f>] page_fault+0x1f/0x30
[32874.215192] [<ffffffff819f9440>] ? start_show+0x30/0x30
[32874.215192] [<ffffffff8107df04>] ? ktime_get+0x64/0xe0
[32874.215192] [<ffffffff8107ab80>] sched_clock_tick+0x40/0x90
[32874.215192] [<ffffffff8107abe9>] sched_clock_idle_wakeup_event+0x19/0x20
[32874.215192] [<ffffffff8108538e>] tick_nohz_stop_idle+0x3e/0x50
[32874.215192] [<ffffffff81085b77>] tick_check_idle+0xb7/0xd0
[32874.215192] [<ffffffff8105a749>] irq_enter+0x69/0x70
[32874.215192] [<ffffffff81b31653>] smp_apic_timer_interrupt+0x43/0x99
[32874.215192] [<ffffffff81b2f9cb>] apic_timer_interrupt+0x6b/0x70
[32874.215192] <EOI> [<ffffffff8107aa58>] ? sched_clock_cpu+0xa8/0x120
[32874.215192] [<ffffffff8100a89a>] ? default_idle+0x5a/0x180
[32874.215192] [<ffffffff810009b6>] cpu_idle+0xf6/0x110
[32874.215192] [<ffffffff81b146ea>] start_secondary+0x1cf/0x1d6
[32874.215192] panic occurred, switching back to text console.
swap is off again.
After that i ran the machine with the newest kernel - 3.5.2, and if it happens again i will try "nosmp" option.Any ideas of what should be the reason, or how to catch it, will be welcome.
Is that the right place to ask, or should i send it to kernel@vger.kernel.org, or somewhere else ?
Thanks in advance !
Adko.
next reply other threads:[~2012-08-25 0:54 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-25 0:54 Adko Branil [this message]
2012-08-25 2:58 ` HDD problem, software bug, bios bug, or hardware ? Felix Miata
-- strict thread matches above, loose matches on Subject: below --
2012-08-25 1:31 Adko Branil
2012-08-25 11:33 ` Borislav Petkov
2012-08-27 7:50 ` Rafael J. Wysocki
2012-09-01 12:45 ` Dan Merillat
2012-09-02 7:24 ` Borislav Petkov
2012-09-03 0:38 ` Dan Merillat
2012-09-10 20:51 ` Borislav Petkov
2012-09-10 21:29 ` Dan Merillat
2012-09-10 21:59 ` Borislav Petkov
2012-09-15 1:34 ` Dan Merillat
2012-09-15 10:34 ` Borislav Petkov
2012-08-25 13:36 Adko Branil
2012-08-25 17:46 ` Adko Branil
2012-08-26 13:01 ` Borislav Petkov
2012-08-27 17:01 ` Adko Branil
2012-08-27 17:21 ` Felix Miata
2012-08-27 21:59 ` Borislav Petkov
2012-08-29 17:02 ` Adko Branil
2012-08-29 17:31 ` Adko Branil
2012-08-30 10:12 ` Borislav Petkov
[not found] ` <1346259574.81504.YahooMailNeo@web124706.mail.ne1.yahoo.com>
2012-08-30 9:58 ` Borislav Petkov
2012-08-30 10:11 ` Borislav Petkov
2012-08-30 11:10 ` Adko Branil
2012-09-02 20:04 ` Mikael Pettersson
2012-09-03 21:46 ` Adko Branil
2012-09-07 11:32 ` Adko Branil
2012-09-08 16:30 ` Mikael Pettersson
2012-09-08 17:22 ` Adko Branil
2012-09-11 13:38 ` Adko Branil
2012-09-10 9:15 ` Borislav Petkov
2012-09-10 9:17 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1345856048.38987.YahooMailNeo@web124702.mail.ne1.yahoo.com \
--to=adkobranil@yahoo.com \
--cc=linux-ide@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox