* Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...
[not found] ` <20140322123706.GE12444@ofan>
@ 2014-03-22 13:18 ` Thomas Gleixner
2014-03-23 13:25 ` Jan Kara
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2014-03-22 13:18 UTC (permalink / raw)
To: dafreedm
Cc: Guennadi Liakhovetski, LKML, Ingo Molnar, H. Peter Anvin,
Theodore Ts'o, linux-ext4, Jens Axboe
[-- Attachment #1: Type: TEXT/PLAIN, Size: 7627 bytes --]
On Sat, 22 Mar 2014, dafreedm@gmail.com wrote:
> Ahh. Good call. I wasn't sophisticated enough with these things to
> ascertain the difference. I knew to avoid reporting oops/panics with
> kernels tainted with out-of-tree (non-free) modules, but I guess I
> grabbed the wrong lines from the dmesg (namely, subsequent oops after
> the initial one). Here's a more recent kernel oops (from this
> morning) --- it's the first oops after a fresh reboot:
Cc'ing ext4 and block folks.
>
> [33488.170415] general protection fault: 0000 [#1] SMP
> [33488.171351] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi hid_kensington x86_pkg_temp_thermal coretemp joydev kvm_intel hid_generic kvm crct10dif_pclmul snd_hda_intel crc32_pclmul crc32c_intel snd_hda_codec usbhid snd_hwdep hid ghash_clmulni_intel snd_pcm iTCO_wdt iTCO_vendor_support snd_page_alloc aesni_intel i915 snd_seq aes_x86_64 snd_seq_device snd_timer lrw evdev gf128mul glue_helper drm_kms_helper ablk_helper snd psmouse cryptd drm i2c_i801 pcspkr soundcore serio_raw lpc_ich mei_me mei mfd_core video button processor ext4 cr
c16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci xhci_hcd ehci_pci ehci_hcd e1000e libata igb scsi_mod i2c_algo_bit i2c_core dca ptp pps_core fan usbcore usb_common thermal thermal_sys
> [33488.177102] CPU: 0 PID: 340 Comm: jbd2/sdd2-8 Not tainted 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
> [33488.178100] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
> [33488.179095] task: ffff88081b179080 ti: ffff88081b3ee000 task.ti: ffff88081b3ee000
> [33488.180102] RIP: 0010:[<ffffffff811b6b22>] [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> [33488.181117] RSP: 0018:ffff88081b3efb78 EFLAGS: 00010282
> [33488.182127] RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> [33488.183142] RDX: 00000000000001ff RSI: 000000000148b2ac RDI: 0000000000000000
> [33488.184152] RBP: ffff88081b2a4800 R08: 0000000000000000 R09: ffff8807eccca628
> [33488.185168] R10: 00000000000032ad R11: 0000000000000001 R12: 0000000000000000
> [33488.186182] R13: ffff88081bcf67c0 R14: 00000000532d22fd R15: 000000000148b2ac
> [33488.187194] FS: 0000000000000000(0000) GS:ffff88083fa00000(0000) knlGS:0000000000000000
> [33488.188212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [33488.189228] CR2: ffffffffff600400 CR3: 000000000180c000 CR4: 00000000001407f0
> [33488.190245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [33488.191268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [33488.192279] Stack:
> [33488.193273] 0000000000000000 ffffffffa01ec1ba ffff88081b179080 ffff88081b2a4800
> [33488.194276] 00000000000032ac 0000000000000000 ffff88081b2a4800 ffff88081b2a4800
> [33488.195265] ffff8807f36b4688 00000000532d22fd 00000000ffffffff ffffffffa01737d0
> [33488.196246] Call Trace:
> [33488.197222] [<ffffffffa01ec1ba>] ? ext4_bmap+0x5a/0x110 [ext4]
> [33488.198210] [<ffffffffa01737d0>] ? jbd2_journal_get_descriptor_buffer+0x30/0x80 [jbd2]
> [33488.199203] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> [33488.200196] [<ffffffffa016aedd>] ? journal_submit_commit_record.isra.12+0x9d/0x200 [jbd2]
> [33488.201193] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> [33488.202198] [<ffffffffa016c1f4>] ? jbd2_journal_commit_transaction+0x11b4/0x18c0 [jbd2]
> [33488.203211] [<ffffffffa01706cc>] ? kjournald2+0xac/0x240 [jbd2]
> [33488.204219] [<ffffffff81082d20>] ? add_wait_queue+0x60/0x60
> [33488.205226] [<ffffffffa0170620>] ? commit_timeout+0x10/0x10 [jbd2]
> [33488.206230] [<ffffffff81082333>] ? kthread+0xb3/0xc0
> [33488.207230] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> [33488.208231] [<ffffffff814cb70c>] ? ret_from_fork+0x7c/0xb0
> [33488.209224] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> [33488.210205] Code: 12 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 85 98 00 00 00 ba ff 01 00 00 48 8b 80 58 03 00 00 48 85 c0 74 10 <0f> b7 80 ac 05 00 00 66 85 c0 0f 85 cd 01 00 00 85 da 0f 85 d2
> [33488.212277] RIP [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> [33488.213276] RSP <ffff88081b3efb78>
> [33488.370823] ---[ end trace cf90c18d45ff9570 ]---
>
>
> Thoughts?
>
> Ingo, Peter, Thomas, any further ideas, please?
>
>
> > > Though at times the oops occur even when the system is largely idle,
> > > they seem to be exacerbated by md5sum'ing all files on a large
> > > partition as part of archive verification --- say 1 million files
> > > corresponding to 1 TByte of storage. If I perform this repeatedly,
> > > the machines seem to lock up about once a week. Strangely, other
> > > typical high-load/high-stress scenarios don't seem to provoke the oops
> > > nearly so much (see below).
> > >
> > > Naturally, such md5sum usage is putting heavy load on the processor,
> > > memory, and even power supply, and my initial inclination is generally
> > > that I must have some faulty components. Even after otherwise
> > > ambiguous diagnostics (described below), I'm highly skeptical that
> > > there's anything here inherent to the md5sum codebase, in particular.
> > > However, I have started to wonder whether this might be a kernel
> > > regression...
> > >
> > > For reference, here's my setup:
> > >
> > > Mainboard: Supermicro X10SLQ
> > > Processor: (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
> > > Memory: 32GB Kingston DDR3 RAM (4x KVR16N11/8)
> > > PSU: SeaSonic SS-400FL2 400W PSU
> > > O/S: Debian v7.4 Wheezy (amd64)
> > > Filesystem: Ext4 (with default settings upon creation) over LUKS
> > > Kernel: Using both:
> > > Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
> > > Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)
> > >
> > > To summarize where I am now: I've been very extensively testing all of
> > > the likely culprits among hardware components on both of my servers
> > > --- running memtest86 upon boot for 3+ days, memtester in userspace
> > > for 24 hours, repeated kernel compiles with various '-j' values, and
> > > the 'stress' and 'stressapptest' load generators (see [2] for full
> > > details) --- and I have never seen even a hiccup in server operation
> > > under such "artificial" environments --- however, it consistently
> > > occurs with heavy md5sum operation, and randomly at other times.
> > >
> > > At least from my past experiences (with scientific HPC clusters), such
> > > diagnostic results would normally seem to largely rule out most
> > > problems with the processor, memory, mainboard subsystems. The PSU is
> > > often a little harder to rule out, but the 400W Seasonic PSUs are
> > > rated at 2--3 times the wattage I should really need, even under peak
> > > load (given each server's single-socket CPU is 65W at max TDP, there
> > > are only a few HDs and one SSD, and no discrete graphics at all, of
> > > course).
> > >
> > > I'm further surprised to see the exact same kernel-crash behavior on
> > > two separate, but identical, servers, which leads me to wonder if
> > > there's possibly some regression between the hardware (given that it's
> > > relatively new Haswell microcode / silicon) and the (kernel?)
> > > software.
> > >
> > > Any thoughts on what might be occurring here? Or what I should focus
> > > on? Thanks in advance.
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...
2014-03-22 13:18 ` Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs Thomas Gleixner
@ 2014-03-23 13:25 ` Jan Kara
2014-03-23 14:26 ` dafreedm
0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2014-03-23 13:25 UTC (permalink / raw)
To: Thomas Gleixner
Cc: dafreedm, Guennadi Liakhovetski, LKML, Ingo Molnar,
H. Peter Anvin, Theodore Ts'o, linux-ext4, Jens Axboe
On Sat 22-03-14 14:18:39, Thomas Gleixner wrote:
> On Sat, 22 Mar 2014, dafreedm@gmail.com wrote:
>
> > Ahh. Good call. I wasn't sophisticated enough with these things to
> > ascertain the difference. I knew to avoid reporting oops/panics with
> > kernels tainted with out-of-tree (non-free) modules, but I guess I
> > grabbed the wrong lines from the dmesg (namely, subsequent oops after
> > the initial one). Here's a more recent kernel oops (from this
> > morning) --- it's the first oops after a fresh reboot:
>
> Cc'ing ext4 and block folks.
Hum, so decodecode shows:
...
26: 48 85 c0 test %rax,%rax
29: 74 10 je 0x3b
2b:* 0f b7 80 ac 05 00 00 movzwl 0x5ac(%rax),%eax <--
trapping instruction
32: 66 85 c0 test %ax,%ax
...
And the register has:
RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
So that looks like a bitbflip the upper byte. So I'd check the hardware
first...
Honza
> > [33488.170415] general protection fault: 0000 [#1] SMP
> > [33488.171351] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi hid_kensington x86_pkg_temp_thermal coretemp joydev kvm_intel hid_generic kvm crct10dif_pclmul snd_hda_intel crc32_pclmul crc32c_intel snd_hda_codec usbhid snd_hwdep hid ghash_clmulni_intel snd_pcm iTCO_wdt iTCO_vendor_support snd_page_alloc aesni_intel i915 snd_seq aes_x86_64 snd_seq_device snd_timer lrw evdev gf128mul glue_helper drm_kms_helper ablk_helper snd psmouse cryptd drm i2c_i801 pcspkr soundcore serio_raw lpc_ich mei_me mei mfd_core video button processor ext4
crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci xhci_hcd ehci_pci ehci_hcd e1000e libata igb scsi_mod i2c_algo_bit i2c_core dca ptp pps_core fan usbcore usb_common thermal therma
> l_sys
> > [33488.177102] CPU: 0 PID: 340 Comm: jbd2/sdd2-8 Not tainted 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
> > [33488.178100] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
> > [33488.179095] task: ffff88081b179080 ti: ffff88081b3ee000 task.ti: ffff88081b3ee000
> > [33488.180102] RIP: 0010:[<ffffffff811b6b22>] [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > [33488.181117] RSP: 0018:ffff88081b3efb78 EFLAGS: 00010282
> > [33488.182127] RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> > [33488.183142] RDX: 00000000000001ff RSI: 000000000148b2ac RDI: 0000000000000000
> > [33488.184152] RBP: ffff88081b2a4800 R08: 0000000000000000 R09: ffff8807eccca628
> > [33488.185168] R10: 00000000000032ad R11: 0000000000000001 R12: 0000000000000000
> > [33488.186182] R13: ffff88081bcf67c0 R14: 00000000532d22fd R15: 000000000148b2ac
> > [33488.187194] FS: 0000000000000000(0000) GS:ffff88083fa00000(0000) knlGS:0000000000000000
> > [33488.188212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [33488.189228] CR2: ffffffffff600400 CR3: 000000000180c000 CR4: 00000000001407f0
> > [33488.190245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [33488.191268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [33488.192279] Stack:
> > [33488.193273] 0000000000000000 ffffffffa01ec1ba ffff88081b179080 ffff88081b2a4800
> > [33488.194276] 00000000000032ac 0000000000000000 ffff88081b2a4800 ffff88081b2a4800
> > [33488.195265] ffff8807f36b4688 00000000532d22fd 00000000ffffffff ffffffffa01737d0
> > [33488.196246] Call Trace:
> > [33488.197222] [<ffffffffa01ec1ba>] ? ext4_bmap+0x5a/0x110 [ext4]
> > [33488.198210] [<ffffffffa01737d0>] ? jbd2_journal_get_descriptor_buffer+0x30/0x80 [jbd2]
> > [33488.199203] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > [33488.200196] [<ffffffffa016aedd>] ? journal_submit_commit_record.isra.12+0x9d/0x200 [jbd2]
> > [33488.201193] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > [33488.202198] [<ffffffffa016c1f4>] ? jbd2_journal_commit_transaction+0x11b4/0x18c0 [jbd2]
> > [33488.203211] [<ffffffffa01706cc>] ? kjournald2+0xac/0x240 [jbd2]
> > [33488.204219] [<ffffffff81082d20>] ? add_wait_queue+0x60/0x60
> > [33488.205226] [<ffffffffa0170620>] ? commit_timeout+0x10/0x10 [jbd2]
> > [33488.206230] [<ffffffff81082333>] ? kthread+0xb3/0xc0
> > [33488.207230] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > [33488.208231] [<ffffffff814cb70c>] ? ret_from_fork+0x7c/0xb0
> > [33488.209224] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > [33488.210205] Code: 12 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 85 98 00 00 00 ba ff 01 00 00 48 8b 80 58 03 00 00 48 85 c0 74 10 <0f> b7 80 ac 05 00 00 66 85 c0 0f 85 cd 01 00 00 85 da 0f 85 d2
> > [33488.212277] RIP [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > [33488.213276] RSP <ffff88081b3efb78>
> > [33488.370823] ---[ end trace cf90c18d45ff9570 ]---
> >
> >
> > Thoughts?
> >
> > Ingo, Peter, Thomas, any further ideas, please?
> >
> >
> > > > Though at times the oops occur even when the system is largely idle,
> > > > they seem to be exacerbated by md5sum'ing all files on a large
> > > > partition as part of archive verification --- say 1 million files
> > > > corresponding to 1 TByte of storage. If I perform this repeatedly,
> > > > the machines seem to lock up about once a week. Strangely, other
> > > > typical high-load/high-stress scenarios don't seem to provoke the oops
> > > > nearly so much (see below).
> > > >
> > > > Naturally, such md5sum usage is putting heavy load on the processor,
> > > > memory, and even power supply, and my initial inclination is generally
> > > > that I must have some faulty components. Even after otherwise
> > > > ambiguous diagnostics (described below), I'm highly skeptical that
> > > > there's anything here inherent to the md5sum codebase, in particular.
> > > > However, I have started to wonder whether this might be a kernel
> > > > regression...
> > > >
> > > > For reference, here's my setup:
> > > >
> > > > Mainboard: Supermicro X10SLQ
> > > > Processor: (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
> > > > Memory: 32GB Kingston DDR3 RAM (4x KVR16N11/8)
> > > > PSU: SeaSonic SS-400FL2 400W PSU
> > > > O/S: Debian v7.4 Wheezy (amd64)
> > > > Filesystem: Ext4 (with default settings upon creation) over LUKS
> > > > Kernel: Using both:
> > > > Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
> > > > Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)
> > > >
> > > > To summarize where I am now: I've been very extensively testing all of
> > > > the likely culprits among hardware components on both of my servers
> > > > --- running memtest86 upon boot for 3+ days, memtester in userspace
> > > > for 24 hours, repeated kernel compiles with various '-j' values, and
> > > > the 'stress' and 'stressapptest' load generators (see [2] for full
> > > > details) --- and I have never seen even a hiccup in server operation
> > > > under such "artificial" environments --- however, it consistently
> > > > occurs with heavy md5sum operation, and randomly at other times.
> > > >
> > > > At least from my past experiences (with scientific HPC clusters), such
> > > > diagnostic results would normally seem to largely rule out most
> > > > problems with the processor, memory, mainboard subsystems. The PSU is
> > > > often a little harder to rule out, but the 400W Seasonic PSUs are
> > > > rated at 2--3 times the wattage I should really need, even under peak
> > > > load (given each server's single-socket CPU is 65W at max TDP, there
> > > > are only a few HDs and one SSD, and no discrete graphics at all, of
> > > > course).
> > > >
> > > > I'm further surprised to see the exact same kernel-crash behavior on
> > > > two separate, but identical, servers, which leads me to wonder if
> > > > there's possibly some regression between the hardware (given that it's
> > > > relatively new Haswell microcode / silicon) and the (kernel?)
> > > > software.
> > > >
> > > > Any thoughts on what might be occurring here? Or what I should focus
> > > > on? Thanks in advance.
> >
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...
2014-03-23 13:25 ` Jan Kara
@ 2014-03-23 14:26 ` dafreedm
2014-03-27 10:35 ` dafreedm
2014-03-27 15:26 ` Jan Kara
0 siblings, 2 replies; 5+ messages in thread
From: dafreedm @ 2014-03-23 14:26 UTC (permalink / raw)
To: Jan Kara
Cc: Thomas Gleixner, Guennadi Liakhovetski, LKML, Ingo Molnar,
H. Peter Anvin, Theodore Ts'o, linux-ext4, Jens Axboe,
dafreedm
Hi Jan,
On Sun, Mar 23, 2014, Jan Kara wrote:
> On Sat 22-03-14 14:18:39, Thomas Gleixner wrote:
> > On Sat, 22 Mar 2014, dafreedm@gmail.com wrote:
> >
> > > Ahh. Good call. I wasn't sophisticated enough with these things to
> > > ascertain the difference. I knew to avoid reporting oops/panics with
> > > kernels tainted with out-of-tree (non-free) modules, but I guess I
> > > grabbed the wrong lines from the dmesg (namely, subsequent oops after
> > > the initial one). Here's a more recent kernel oops (from this
> > > morning) --- it's the first oops after a fresh reboot:
> >
> > Cc'ing ext4 and block folks.
> Hum, so decodecode shows:
> ...
> 26: 48 85 c0 test %rax,%rax
> 29: 74 10 je 0x3b
> 2b:* 0f b7 80 ac 05 00 00 movzwl 0x5ac(%rax),%eax <-- trapping instruction
> 32: 66 85 c0 test %ax,%ax
> ...
>
> And the register has:
> RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
>
> So that looks like a bitbflip the upper byte.
Just for my own knowledge / growth --- how can you tell there's a
"bitbflip" on the upper byte?
> So I'd check the hardware first...
Yes, I absolutely did check the HW first --- and repeatedly (over a
couple of weeks) --- before reaching out to LKML.
As described in my original email below, here's what I've done so far:
I've been very extensively testing all of the likely culprits among
hardware components on both of my servers --- running memtest86 upon
boot for 3+ days, memtester in userspace for 24 hours, repeated
kernel compiles with various '-j' values, and the 'stress' and
'stressapptest' load generators (see below for full details) --- and
I have never seen even a hiccup in server operation under such
"artificial" environments --- however, it consistently occurs with
heavy md5sum operation, and randomly at other times.
More specifically, here are the exact stept I took to try to implicate
the HW:
aptitude install memtest86+ # reboot and run for 3+ days
aptitude install memtester
memtester 30G
aptitude install linux-source
cp /usr/src/linux-source-3.2.tar.bz2 /root/
tar xvfj linux-source-3.2.tar.bz2
cd linux-source-3.2/
make defconfig
time make 1>LOG 2>ERR
make mrproper
make defconfig
time make -j16 1>LOG 2>ERR
aptitude install stress
stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m
aptitude install stressapptest
stressapptest -m 8 -i 4 -C 4 -W -s 30
stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -s 30
stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -n 127.0.0.1 --listen -s 300
As mentioned earlier --- I just could not make it oops doing the
above! (or get any errors in the standalone memtest86+ procedure).
What do you think? Should I just keep on stress-testing it somewhat
indefinitely? Also, please recall that I have two of the identical
machines, and I suffer the same problems with both of them (and they
both pass the above artificial stress-testing).
Thoughts or suggestions, please, for me to explore further...
Thanks again!
> > > [33488.170415] general protection fault: 0000 [#1] SMP
> > > [33488.171351] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi hid_kensington x86_pkg_temp_thermal coretemp joydev kvm_intel hid_generic kvm crct10dif_pclmul snd_hda_intel crc32_pclmul crc32c_intel snd_hda_codec usbhid snd_hwdep hid ghash_clmulni_intel snd_pcm iTCO_wdt iTCO_vendor_support snd_page_alloc aesni_intel i915 snd_seq aes_x86_64 snd_seq_device snd_timer lrw evdev gf128mul glue_helper drm_kms_helper ablk_helper snd psmouse cryptd drm i2c_i801 pcspkr soundcore serio_raw lpc_ich mei_me mei mfd_core video button processor ext
4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci xhci_hcd ehci_pci ehci_hcd e1000e libata igb scsi_mod i2c_algo_bit i2c_core dca ptp pps_core fan usbcore usb_common thermal therma
> > l_sys
> > > [33488.177102] CPU: 0 PID: 340 Comm: jbd2/sdd2-8 Not tainted 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
> > > [33488.178100] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
> > > [33488.179095] task: ffff88081b179080 ti: ffff88081b3ee000 task.ti: ffff88081b3ee000
> > > [33488.180102] RIP: 0010:[<ffffffff811b6b22>] [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > > [33488.181117] RSP: 0018:ffff88081b3efb78 EFLAGS: 00010282
> > > [33488.182127] RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> > > [33488.183142] RDX: 00000000000001ff RSI: 000000000148b2ac RDI: 0000000000000000
> > > [33488.184152] RBP: ffff88081b2a4800 R08: 0000000000000000 R09: ffff8807eccca628
> > > [33488.185168] R10: 00000000000032ad R11: 0000000000000001 R12: 0000000000000000
> > > [33488.186182] R13: ffff88081bcf67c0 R14: 00000000532d22fd R15: 000000000148b2ac
> > > [33488.187194] FS: 0000000000000000(0000) GS:ffff88083fa00000(0000) knlGS:0000000000000000
> > > [33488.188212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [33488.189228] CR2: ffffffffff600400 CR3: 000000000180c000 CR4: 00000000001407f0
> > > [33488.190245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [33488.191268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [33488.192279] Stack:
> > > [33488.193273] 0000000000000000 ffffffffa01ec1ba ffff88081b179080 ffff88081b2a4800
> > > [33488.194276] 00000000000032ac 0000000000000000 ffff88081b2a4800 ffff88081b2a4800
> > > [33488.195265] ffff8807f36b4688 00000000532d22fd 00000000ffffffff ffffffffa01737d0
> > > [33488.196246] Call Trace:
> > > [33488.197222] [<ffffffffa01ec1ba>] ? ext4_bmap+0x5a/0x110 [ext4]
> > > [33488.198210] [<ffffffffa01737d0>] ? jbd2_journal_get_descriptor_buffer+0x30/0x80 [jbd2]
> > > [33488.199203] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > > [33488.200196] [<ffffffffa016aedd>] ? journal_submit_commit_record.isra.12+0x9d/0x200 [jbd2]
> > > [33488.201193] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > > [33488.202198] [<ffffffffa016c1f4>] ? jbd2_journal_commit_transaction+0x11b4/0x18c0 [jbd2]
> > > [33488.203211] [<ffffffffa01706cc>] ? kjournald2+0xac/0x240 [jbd2]
> > > [33488.204219] [<ffffffff81082d20>] ? add_wait_queue+0x60/0x60
> > > [33488.205226] [<ffffffffa0170620>] ? commit_timeout+0x10/0x10 [jbd2]
> > > [33488.206230] [<ffffffff81082333>] ? kthread+0xb3/0xc0
> > > [33488.207230] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > > [33488.208231] [<ffffffff814cb70c>] ? ret_from_fork+0x7c/0xb0
> > > [33488.209224] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > > [33488.210205] Code: 12 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 85 98 00 00 00 ba ff 01 00 00 48 8b 80 58 03 00 00 48 85 c0 74 10 <0f> b7 80 ac 05 00 00 66 85 c0 0f 85 cd 01 00 00 85 da 0f 85 d2
> > > [33488.212277] RIP [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > > [33488.213276] RSP <ffff88081b3efb78>
> > > [33488.370823] ---[ end trace cf90c18d45ff9570 ]---
> > >
> > >
> > > Thoughts?
> > >
> > > Ingo, Peter, Thomas, any further ideas, please?
> > >
> > >
> > > > > Though at times the oops occur even when the system is largely idle,
> > > > > they seem to be exacerbated by md5sum'ing all files on a large
> > > > > partition as part of archive verification --- say 1 million files
> > > > > corresponding to 1 TByte of storage. If I perform this repeatedly,
> > > > > the machines seem to lock up about once a week. Strangely, other
> > > > > typical high-load/high-stress scenarios don't seem to provoke the oops
> > > > > nearly so much (see below).
> > > > >
> > > > > Naturally, such md5sum usage is putting heavy load on the processor,
> > > > > memory, and even power supply, and my initial inclination is generally
> > > > > that I must have some faulty components. Even after otherwise
> > > > > ambiguous diagnostics (described below), I'm highly skeptical that
> > > > > there's anything here inherent to the md5sum codebase, in particular.
> > > > > However, I have started to wonder whether this might be a kernel
> > > > > regression...
> > > > >
> > > > > For reference, here's my setup:
> > > > >
> > > > > Mainboard: Supermicro X10SLQ
> > > > > Processor: (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
> > > > > Memory: 32GB Kingston DDR3 RAM (4x KVR16N11/8)
> > > > > PSU: SeaSonic SS-400FL2 400W PSU
> > > > > O/S: Debian v7.4 Wheezy (amd64)
> > > > > Filesystem: Ext4 (with default settings upon creation) over LUKS
> > > > > Kernel: Using both:
> > > > > Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
> > > > > Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)
> > > > >
> > > > > To summarize where I am now: I've been very extensively testing all of
> > > > > the likely culprits among hardware components on both of my servers
> > > > > --- running memtest86 upon boot for 3+ days, memtester in userspace
> > > > > for 24 hours, repeated kernel compiles with various '-j' values, and
> > > > > the 'stress' and 'stressapptest' load generators (see [2] for full
> > > > > details) --- and I have never seen even a hiccup in server operation
> > > > > under such "artificial" environments --- however, it consistently
> > > > > occurs with heavy md5sum operation, and randomly at other times.
> > > > >
> > > > > At least from my past experiences (with scientific HPC clusters), such
> > > > > diagnostic results would normally seem to largely rule out most
> > > > > problems with the processor, memory, mainboard subsystems. The PSU is
> > > > > often a little harder to rule out, but the 400W Seasonic PSUs are
> > > > > rated at 2--3 times the wattage I should really need, even under peak
> > > > > load (given each server's single-socket CPU is 65W at max TDP, there
> > > > > are only a few HDs and one SSD, and no discrete graphics at all, of
> > > > > course).
> > > > >
> > > > > I'm further surprised to see the exact same kernel-crash behavior on
> > > > > two separate, but identical, servers, which leads me to wonder if
> > > > > there's possibly some regression between the hardware (given that it's
> > > > > relatively new Haswell microcode / silicon) and the (kernel?)
> > > > > software.
> > > > >
> > > > > Any thoughts on what might be occurring here? Or what I should focus
> > > > > on? Thanks in advance.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...
2014-03-23 14:26 ` dafreedm
@ 2014-03-27 10:35 ` dafreedm
2014-03-27 15:26 ` Jan Kara
1 sibling, 0 replies; 5+ messages in thread
From: dafreedm @ 2014-03-27 10:35 UTC (permalink / raw)
To: Jan Kara
Cc: Thomas Gleixner, Guennadi Liakhovetski, LKML, Ingo Molnar,
H. Peter Anvin, Theodore Ts'o, linux-ext4@vger.kernel.org,
Jens Axboe, dafreedm
[-- Attachment #1: Type: text/plain, Size: 3491 bytes --]
Hi,
I've attached another oops (initial one from untainted kernel, and
then successive ones) on the same machine.
Please see the HW stress-testing I've already done below (without
seeing such an oops). Any further suggestions?
Also, how can I tell from the registers you decoded (below) that it's
a bit-flip? (That way I can look at this stuff more myself,
perhaps)...
Thanks.
On Sun, Mar 23, 2014, Daniel Freedman wrote:
> > Hum, so decodecode shows:
> > ...
> > 26: 48 85 c0 test %rax,%rax
> > 29: 74 10 je 0x3b
> > 2b:* 0f b7 80 ac 05 00 00 movzwl 0x5ac(%rax),%eax <-- trapping instruction
> > 32: 66 85 c0 test %ax,%ax
> > ...
> >
> > And the register has:
> > RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> >
> > So that looks like a bitbflip the upper byte.
>
> Just for my own knowledge / growth --- how can you tell there's a
> "bitbflip" on the upper byte?
>
> > So I'd check the hardware first...
>
>
> Yes, I absolutely did check the HW first --- and repeatedly (over a
> couple of weeks) --- before reaching out to LKML.
>
> As described in my original email below, here's what I've done so far:
>
> I've been very extensively testing all of the likely culprits among
> hardware components on both of my servers --- running memtest86 upon
> boot for 3+ days, memtester in userspace for 24 hours, repeated
> kernel compiles with various '-j' values, and the 'stress' and
> 'stressapptest' load generators (see below for full details) --- and
> I have never seen even a hiccup in server operation under such
> "artificial" environments --- however, it consistently occurs with
> heavy md5sum operation, and randomly at other times.
>
> More specifically, here are the exact stept I took to try to implicate
> the HW:
>
> aptitude install memtest86+ # reboot and run for 3+ days
>
> aptitude install memtester
> memtester 30G
>
> aptitude install linux-source
> cp /usr/src/linux-source-3.2.tar.bz2 /root/
> tar xvfj linux-source-3.2.tar.bz2
> cd linux-source-3.2/
> make defconfig
> time make 1>LOG 2>ERR
> make mrproper
> make defconfig
> time make -j16 1>LOG 2>ERR
>
> aptitude install stress
> stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
> stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
> stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m
>
> aptitude install stressapptest
> stressapptest -m 8 -i 4 -C 4 -W -s 30
> stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
> stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -s 30
> stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
> stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
> stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
> stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -n 127.0.0.1 --listen -s 300
>
>
> As mentioned earlier --- I just could not make it oops doing the
> above! (or get any errors in the standalone memtest86+ procedure).
>
> What do you think? Should I just keep on stress-testing it somewhat
> indefinitely? Also, please recall that I have two of the identical
> machines, and I suffer the same problems with both of them (and they
> both pass the above artificial stress-testing).
>
> Thoughts or suggestions, please, for me to explore further...
>
> Thanks again!
[-- Attachment #2: KernelOops --]
[-- Type: text/plain, Size: 24148 bytes --]
[210799.624492] invalid opcode: 0000 [#1] SMP
[210799.624516] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi joydev hid_generic hid_kensington usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul crc32c_intel snd_hwdep snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq iTCO_wdt snd_seq_device aesni_intel iTCO_vendor_support aes_x86_64 snd_timer evdev lrw gf128mul i915 glue_helper ablk_helper snd cryptd drm_kms_helper soundcore psmouse pcspkr drm lpc_ich mei_me mfd_core serio_raw mei i2c_i801 video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata xhci_hcd scsi_mod ehci_pci ehci_hcd e1000e igb i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core fan thermal thermal_sys
[210799.624870] CPU: 2 PID: 22239 Comm: Timer Not tainted 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
[210799.624891] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
[210799.624908] task: ffff88081a485800 ti: ffff88081ba24000 task.ti: ffff88081ba24000
[210799.624927] RIP: 0010:[<ffffffff810c1591>] [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.624957] RSP: 0018:ffff88081ba25e00 EFLAGS: 00010297
[210799.624974] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 00000000ffffffff
[210799.624991] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007fdb3c4173d0
[210799.625008] RBP: 00007fdb3c4173d0 R08: 00007fdb35147608 R09: 0000000000000000
[210799.625025] R10: 0000000000000001 R11: 0000000000000206 R12: 0000000000000001
[210799.625043] R13: 00007fdb35147608 R14: 0000000000000000 R15: 00007fdb3c4173d0
[210799.625060] FS: 00007fdb378ff700(0000) GS:ffff88083fa80000(0000) knlGS:0000000000000000
[210799.625079] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[210799.625093] CR2: 00007fdb1f68b000 CR3: 00000007e6c0a000 CR4: 00000000001407e0
[210799.625110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[210799.625127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[210799.625144] Stack:
[210799.625150] ffffffff810c1e34 ffff88081ba25ee8 ffff88081b883000 ffff8807ec1fac00
[210799.625173] ffff88081ba25fd8 ffff88081a485800 0000000100000000 ffff880800cde0a8
[210799.625199] ffff880800cde0f0 0000000000000001 ffffffff811c3e1c 0000000000000001
[210799.625223] Call Trace:
[210799.625232] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.625247] [<ffffffff811c3e1c>] ? fsnotify+0x1dc/0x2d0
[210799.625261] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.625277] [<ffffffff811856dc>] ? vfs_write+0x17c/0x200
[210799.625293] [<ffffffff814cb7b9>] ? system_call_fastpath+0x16/0x1b
[210799.625308] Code: 83 44 24 20 01 e9 40 fc ff ff c7 44 24 08 f5 ff ff ff e9 2b fb ff ff c7 44 24 08 00 00 00 00 e9 1e fb ff ff 89 44 24 08 e9 70 fb <ff> ff be 96 04 00 00 48 c7 c7 00 af 6f 81 e8 3c ee f9 ff eb a2
[210799.625434] RIP [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.625456] RSP <ffff88081ba25e00>
[210799.630421] ---[ end trace 5197659ccd2d2aa0 ]---
[210799.630429] invalid opcode: 0000 [#2] SMP
[210799.630445] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi joydev hid_generic hid_kensington usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul crc32c_intel snd_hwdep snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq iTCO_wdt snd_seq_device aesni_intel iTCO_vendor_support aes_x86_64 snd_timer evdev lrw gf128mul i915 glue_helper ablk_helper snd cryptd drm_kms_helper soundcore psmouse pcspkr drm lpc_ich mei_me mfd_core serio_raw mei i2c_i801 video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata xhci_hcd scsi_mod ehci_pci ehci_hcd e1000e igb i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core fan thermal thermal_sys
[210799.630738] CPU: 2 PID: 22239 Comm: Timer Tainted: G D 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
[210799.630758] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
[210799.630772] task: ffff88081a485800 ti: ffff88081ba24000 task.ti: ffff88081ba24000
[210799.630788] RIP: 0010:[<ffffffff810c1591>] [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.630807] RSP: 0018:ffff88081ba25a70 EFLAGS: 00010297
[210799.630819] RAX: 0000000000000002 RBX: 0000000000000001 RCX: 00000000ffffffff
[210799.630833] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007fdb378ff9d0
[210799.630848] RBP: 00007fdb378ff9d0 R08: 0000000000000000 R09: 0000000000000000
[210799.630863] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000001
[210799.630878] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdb378ff9d0
[210799.630894] FS: 0000000000000000(0000) GS:ffff88083fa80000(0000) knlGS:0000000000000000
[210799.630910] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[210799.630923] CR2: 00007fdb1f68b000 CR3: 00000007e6c0a000 CR4: 00000000001407e0
[210799.630938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[210799.630952] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[210799.630967] Stack:
[210799.630972] ffffffff810c1e34 ffff88081c23a424 000000000000003d 0000000000005100
[210799.630993] 0000000000000002 613088081b7d6000 396363640000003d 000000000003376f
[210799.631013] ffff88081b8d4800 00000000000003e8 0000000000000035 ffffffff81a102c0
[210799.631034] Call Trace:
[210799.631041] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.631054] [<ffffffffa04377e5>] ? write_msg+0xd5/0x140 [netconsole]
[210799.631069] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.631082] [<ffffffff810aa9c8>] ? console_unlock+0x258/0x3a0
[210799.631096] [<ffffffff810ee9fc>] ? __delayacct_add_tsk+0x16c/0x180
[210799.631110] [<ffffffff810c1a0b>] ? exit_robust_list+0x8b/0x170
[210799.631125] [<ffffffff8105d5cf>] ? mm_release+0xdf/0x120
[210799.631138] [<ffffffff81062679>] ? do_exit+0x159/0xa80
[210799.631151] [<ffffffff814bb14f>] ? printk+0x4f/0x54
[210799.631163] [<ffffffff814c4c48>] ? oops_end+0xa8/0xf0
[210799.631176] [<ffffffff81014f74>] ? do_invalid_op+0x84/0xa0
[210799.631189] [<ffffffff810c1591>] ? futex_requeue+0x721/0x7e0
[210799.632160] [<ffffffff814cce5e>] ? invalid_op+0x1e/0x30
[210799.633105] [<ffffffff810c1591>] ? futex_requeue+0x721/0x7e0
[210799.634027] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.634921] [<ffffffff811c3e1c>] ? fsnotify+0x1dc/0x2d0
[210799.635789] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.636631] [<ffffffff811856dc>] ? vfs_write+0x17c/0x200
[210799.637447] [<ffffffff814cb7b9>] ? system_call_fastpath+0x16/0x1b
[210799.638254] Code: 83 44 24 20 01 e9 40 fc ff ff c7 44 24 08 f5 ff ff ff e9 2b fb ff ff c7 44 24 08 00 00 00 00 e9 1e fb ff ff 89 44 24 08 e9 70 fb <ff> ff be 96 04 00 00 48 c7 c7 00 af 6f 81 e8 3c ee f9 ff eb a2
[210799.640004] RIP [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.640844] RSP <ffff88081ba25a70>
[210799.641677] ---[ end trace 5197659ccd2d2aa1 ]---
[210799.641678] Fixing recursive fault but reboot is needed!
[210799.641675] invalid opcode: 0000 [#3] SMP
[210799.644149] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi joydev hid_generic hid_kensington usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul crc32c_intel snd_hwdep snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq iTCO_wdt snd_seq_device aesni_intel iTCO_vendor_support aes_x86_64 snd_timer evdev lrw gf128mul i915 glue_helper ablk_helper snd cryptd drm_kms_helper soundcore psmouse pcspkr drm lpc_ich mei_me mfd_core serio_raw mei i2c_i801 video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata xhci_hcd scsi_mod ehci_pci ehci_hcd e1000e igb i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core fan thermal thermal_sys
[210799.649776] CPU: 3 PID: 2555 Comm: rsyslogd Tainted: G D 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
[210799.650754] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
[210799.651731] task: ffff88081a70d0c0 ti: ffff88081e926000 task.ti: ffff88081e926000
[210799.652710] RIP: 0010:[<ffffffff810c1591>] [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.653699] RSP: 0018:ffff88081e927e00 EFLAGS: 00010297
[210799.654681] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 00000000ffffffff
[210799.655667] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000001816630
[210799.656648] RBP: 0000000001816630 R08: 0000000001816d80 R09: 0000000000000000
[210799.657631] R10: 0000000000000001 R11: 0000000000000206 R12: 0000000000000001
[210799.658614] R13: 0000000001816d80 R14: 0000000000000000 R15: 0000000001816630
[210799.659599] FS: 00007f6122adb700(0000) GS:ffff88083fac0000(0000) knlGS:0000000000000000
[210799.660591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[210799.661562] CR2: 00007fdb2c029000 CR3: 000000081e4ec000 CR4: 00000000001407e0
[210799.662516] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[210799.663466] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[210799.664410] Stack:
[210799.665349] ffffffff810c1e34 0000000000000246 0000000000000d37 0000000000000001
[210799.666286] ffffffff81a330c8 ffffffff81a324c8 0000000000000000 0000000000000001
[210799.667198] 00000004810ab868 0000000000000001 ffff88081e927fd8 ffffffff00000001
[210799.668091] Call Trace:
[210799.668951] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.669794] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.670613] [<ffffffff81185868>] ? vfs_read+0x108/0x180
[210799.671420] [<ffffffff81185aaf>] ? SyS_read+0x6f/0xa0
[210799.672216] [<ffffffff814cb7b9>] ? system_call_fastpath+0x16/0x1b
[210799.673008] Code: 83 44 24 20 01 e9 40 fc ff ff c7 44 24 08 f5 ff ff ff e9 2b fb ff ff c7 44 24 08 00 00 00 00 e9 1e fb ff ff 89 44 24 08 e9 70 fb <ff> ff be 96 04 00 00 48 c7 c7 00 af 6f 81 e8 3c ee f9 ff eb a2
[210799.674727] RIP [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.675556] RSP <ffff88081e927e00>
[210799.676389] ---[ end trace 5197659ccd2d2aa2 ]---
[210799.676383] invalid opcode: 0000 [#4] SMP
[210799.678069] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi joydev hid_generic hid_kensington usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul crc32c_intel snd_hwdep snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq iTCO_wdt snd_seq_device aesni_intel iTCO_vendor_support aes_x86_64 snd_timer evdev lrw gf128mul i915 glue_helper ablk_helper snd cryptd drm_kms_helper soundcore psmouse pcspkr drm lpc_ich mei_me mfd_core serio_raw mei i2c_i801 video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata xhci_hcd scsi_mod ehci_pci ehci_hcd e1000e igb i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core fan thermal thermal_sys
[210799.683833] CPU: 1 PID: 2553 Comm: rs:main Q:Reg Tainted: G D 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
[210799.684832] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
[210799.685830] task: ffff88081def1040 ti: ffff88081e4c8000 task.ti: ffff88081e4c8000
[210799.686835] RIP: 0010:[<ffffffff810c1591>] [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.687844] RSP: 0018:ffff88081e4c9e00 EFLAGS: 00010297
[210799.688847] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 00000000ffffffff
[210799.689856] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000001816630
[210799.690861] RBP: 0000000001816630 R08: 0000000000000000 R09: 0000000000000000
[210799.691865] R10: 0000000000000001 R11: 0000000000000206 R12: 0000000000000001
[210799.692867] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000001816630
[210799.693868] FS: 00007f6123add700(0000) GS:ffff88083fa40000(0000) knlGS:0000000000000000
[210799.694875] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[210799.695882] CR2: 00007fd5bae3b000 CR3: 000000081e4ec000 CR4: 00000000001407e0
[210799.696899] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[210799.697916] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[210799.698931] Stack:
[210799.699944] ffffffff810c1e34 ffff88081e4c9ee8 ffff88080126f000 ffff880804350000
[210799.700977] ffff88081e4c9fd8 ffff88081def1040 0000000100000000 ffff88081e7bd3a8
[210799.702012] ffff88081e7bd3f0 ffffffff810da755 ffffffff811c3e1c 0000000000000057
[210799.703050] Call Trace:
[210799.704081] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.705121] [<ffffffff810da755>] ? from_kgid_munged+0x5/0x10
[210799.706161] [<ffffffff811c3e1c>] ? fsnotify+0x1dc/0x2d0
[210799.707203] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.708244] [<ffffffff811856dc>] ? vfs_write+0x17c/0x200
[210799.709260] [<ffffffff81185b4f>] ? SyS_write+0x6f/0xa0
[210799.710250] [<ffffffff814cb7b9>] ? system_call_fastpath+0x16/0x1b
[210799.711237] Code: 83 44 24 20 01 e9 40 fc ff ff c7 44 24 08 f5 ff ff ff e9 2b fb ff ff c7 44 24 08 00 00 00 00 e9 1e fb ff ff 89 44 24 08 e9 70 fb <ff> ff be 96 04 00 00 48 c7 c7 00 af 6f 81 e8 3c ee f9 ff eb a2
[210799.713354] RIP [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.714376] RSP <ffff88081e4c9e00>
[210799.715401] ---[ end trace 5197659ccd2d2aa3 ]---
[210799.715393] invalid opcode: 0000 [#5] SMP
[210799.717470] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi joydev hid_generic hid_kensington usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul crc32c_intel snd_hwdep snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq iTCO_wdt snd_seq_device aesni_intel iTCO_vendor_support aes_x86_64 snd_timer evdev lrw gf128mul i915 glue_helper ablk_helper snd cryptd drm_kms_helper soundcore psmouse pcspkr drm lpc_ich mei_me mfd_core serio_raw mei i2c_i801 video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata xhci_hcd scsi_mod ehci_pci ehci_hcd e1000e igb i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core fan thermal thermal_sys
[210799.717487] CPU: 3 PID: 2555 Comm: rsyslogd Tainted: G D 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
[210799.717487] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
[210799.717487] task: ffff88081a70d0c0 ti: ffff88081e926000 task.ti: ffff88081e926000
[210799.717488] RIP: 0010:[<ffffffff810c1591>] [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.717489] RSP: 0018:ffff88081e927a70 EFLAGS: 00010297
[210799.717490] RAX: 0000000000000002 RBX: 0000000000000001 RCX: 00000000ffffffff
[210799.717490] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f6122adb9d0
[210799.717490] RBP: 00007f6122adb9d0 R08: 0000000000000000 R09: 0000000000000000
[210799.717491] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000001
[210799.717491] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f6122adb9d0
[210799.717491] FS: 0000000000000000(0000) GS:ffff88083fac0000(0000) knlGS:0000000000000000
[210799.717492] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[210799.717492] CR2: 00007fdb2c029000 CR3: 000000081e4ec000 CR4: 00000000001407e0
[210799.717493] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[210799.717493] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[210799.717493] Stack:
[210799.717493] ffffffff810c1e34 ffff8807f92e3400 0000000000000030 ffffffff81a102e8
[210799.717494] 00ff880800000000 61320000000003e8 3963636432643261 ffff353139373635
[210799.717495] 0000000000000087 0000000000000028 ffffffffa04377e5 ffff8807f92e3400
[210799.717496] Call Trace:
[210799.717497] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.717498] [<ffffffffa04377e5>] ? write_msg+0xd5/0x140 [netconsole]
[210799.717500] [<ffffffff8116eacd>] ? cache_alloc_refill+0x8d/0x2e0
[210799.717501] [<ffffffff8101c12f>] ? native_sched_clock+0xf/0x70
[210799.717503] [<ffffffff8101c195>] ? sched_clock+0x5/0x10
[210799.717504] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.717505] [<ffffffff8116f97c>] ? kmem_cache_alloc+0x1bc/0x1f0
[210799.717506] [<ffffffff810ee9fc>] ? __delayacct_add_tsk+0x16c/0x180
[210799.717508] [<ffffffff810c1a0b>] ? exit_robust_list+0x8b/0x170
[210799.717509] [<ffffffff8105d5cf>] ? mm_release+0xdf/0x120
[210799.717510] [<ffffffff81062679>] ? do_exit+0x159/0xa80
[210799.717511] [<ffffffff814bb14f>] ? printk+0x4f/0x54
[210799.717512] [<ffffffff814c4c48>] ? oops_end+0xa8/0xf0
[210799.717514] [<ffffffff81014f74>] ? do_invalid_op+0x84/0xa0
[210799.717516] [<ffffffff810c1591>] ? futex_requeue+0x721/0x7e0
[210799.717517] [<ffffffff81095d9e>] ? select_task_rq_fair+0x69e/0x740
[210799.717519] [<ffffffff810980d4>] ? enqueue_task_fair+0xb44/0xb80
[210799.717520] [<ffffffff8101c195>] ? sched_clock+0x5/0x10
[210799.717522] [<ffffffff814cce5e>] ? invalid_op+0x1e/0x30
[210799.717523] [<ffffffff810c1591>] ? futex_requeue+0x721/0x7e0
[210799.717524] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.717525] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.717526] [<ffffffff81185868>] ? vfs_read+0x108/0x180
[210799.717527] [<ffffffff81185aaf>] ? SyS_read+0x6f/0xa0
[210799.717528] [<ffffffff814cb7b9>] ? system_call_fastpath+0x16/0x1b
[210799.717529] Code: 83 44 24 20 01 e9 40 fc ff ff c7 44 24 08 f5 ff ff ff e9 2b fb ff ff c7 44 24 08 00 00 00 00 e9 1e fb ff ff 89 44 24 08 e9 70 fb <ff> ff be 96 04 00 00 48 c7 c7 00 af 6f 81 e8 3c ee f9 ff eb a2
[210799.717541] RIP [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.717542] RSP <ffff88081e927a70>
[210799.717543] invalid opcode: 0000 [#6] SMP
[210799.717544] Modules linked in: dm_crypt<4>[210799.717545] ---[ end trace 5197659ccd2d2aa4 ]---
[210799.717545] dm_mod<1>[210799.717546] Fixing recursive fault but reboot is needed!
[210799.717546] parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi joydev hid_generic hid_kensington usbhid hid x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul crc32c_intel snd_hwdep snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq iTCO_wdt snd_seq_device aesni_intel iTCO_vendor_support aes_x86_64 snd_timer evdev lrw gf128mul i915 glue_helper ablk_helper snd cryptd drm_kms_helper soundcore psmouse pcspkr drm lpc_ich mei_me mfd_core serio_raw mei i2c_i801 video button processor ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata xhci_hcd scsi_mod ehci_pci ehci_hcd e1000e igb i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core fan thermal thermal_sys
[210799.717588] CPU: 1 PID: 2553 Comm: rs:main Q:Reg Tainted: G D 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
[210799.717588] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
[210799.717589] task: ffff88081def1040 ti: ffff88081e4c8000 task.ti: ffff88081e4c8000
[210799.717589] RIP: 0010:[<ffffffff810c1591>] [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.717592] RSP: 0018:ffff88081e4c9a70 EFLAGS: 00010297
[210799.717592] RAX: 0000000000000002 RBX: 0000000000000001 RCX: 00000000ffffffff
[210799.717593] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f6123add9d0
[210799.717593] RBP: 00007f6123add9d0 R08: 0000000000000000 R09: 0000000000000000
[210799.717594] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000001
[210799.717594] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f6123add9d0
[210799.717595] FS: 0000000000000000(0000) GS:ffff88083fa40000(0000) knlGS:0000000000000000
[210799.717596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[210799.717597] CR2: 00007fd5bae3b000 CR3: 000000081e4ec000 CR4: 00000000001407e0
[210799.717597] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[210799.717598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[210799.717598] Stack:
[210799.717598] ffffffff810c1e34 ffff8807f92e3400 0000000000000030 ffffffff81a102e8
[210799.717600] 00ff880800000000 61330000000003e8 3963636432643261 ffff353139373635
[210799.717602] 0000000000000087 0000000000000028 ffffffffa04377e5 ffff8807f92e3400
[210799.717604] Call Trace:
[210799.717604] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.717606] [<ffffffffa04377e5>] ? write_msg+0xd5/0x140 [netconsole]
[210799.717608] [<ffffffff8101c12f>] ? native_sched_clock+0xf/0x70
[210799.717610] [<ffffffff8101c195>] ? sched_clock+0x5/0x10
[210799.717612] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.717613] [<ffffffff8108756c>] ? down_trylock+0x2c/0x40
[210799.717615] [<ffffffff810ee9fc>] ? __delayacct_add_tsk+0x16c/0x180
[210799.717617] [<ffffffff810c1a0b>] ? exit_robust_list+0x8b/0x170
[210799.717619] [<ffffffff8105d5cf>] ? mm_release+0xdf/0x120
[210799.717621] [<ffffffff81062679>] ? do_exit+0x159/0xa80
[210799.717622] [<ffffffff814bb14f>] ? printk+0x4f/0x54
[210799.717624] [<ffffffff814c4c48>] ? oops_end+0xa8/0xf0
[210799.717625] [<ffffffff81014f74>] ? do_invalid_op+0x84/0xa0
[210799.717627] [<ffffffff810c1591>] ? futex_requeue+0x721/0x7e0
[210799.717629] [<ffffffff814cce5e>] ? invalid_op+0x1e/0x30
[210799.717630] [<ffffffff810c1591>] ? futex_requeue+0x721/0x7e0
[210799.717632] [<ffffffff810c1e34>] ? do_futex+0x344/0xb00
[210799.717633] [<ffffffff810da755>] ? from_kgid_munged+0x5/0x10
[210799.717635] [<ffffffff811c3e1c>] ? fsnotify+0x1dc/0x2d0
[210799.717637] [<ffffffff810c273b>] ? SyS_futex+0x14b/0x1b0
[210799.717638] [<ffffffff811856dc>] ? vfs_write+0x17c/0x200
[210799.717640] [<ffffffff81185b4f>] ? SyS_write+0x6f/0xa0
[210799.717641] [<ffffffff814cb7b9>] ? system_call_fastpath+0x16/0x1b
[210799.717643] Code: 83 44 24 20 01 e9 40 fc ff ff c7 44 24 08 f5 ff ff ff e9 2b fb ff ff c7 44 24 08 00 00 00 00 e9 1e fb ff ff 89 44 24 08 e9 70 fb <ff> ff be 96 04 00 00 48 c7 c7 00 af 6f 81 e8 3c ee f9 ff eb a2
[210799.717662] RIP [<ffffffff810c1591>] futex_requeue+0x721/0x7e0
[210799.717663] RSP <ffff88081e4c9a70>
[210799.717664] ---[ end trace 5197659ccd2d2aa5 ]---
[210799.717664] Fixing recursive fault but reboot is needed!
[212136.276450] workrave[22140]: segfault at 21 ip 0000000000000021 sp 00007fff17e75df8 error 14 in workrave[400000+15e000]
[212219.839684] workrave[24488]: segfault at 656d69746c7d ip 00007fadceb6e35d sp 00007fff5ac3c870 error 4 in libglib-2.0.so.0.3200.4[7fadceb01000+f5000]
[227769.748991] traps: workrave[25273] general protection ip:4f3a49 sp:7fffb9a6e8a0 error:0 in workrave[400000+15e000]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs...
2014-03-23 14:26 ` dafreedm
2014-03-27 10:35 ` dafreedm
@ 2014-03-27 15:26 ` Jan Kara
1 sibling, 0 replies; 5+ messages in thread
From: Jan Kara @ 2014-03-27 15:26 UTC (permalink / raw)
To: dafreedm
Cc: Jan Kara, Thomas Gleixner, Guennadi Liakhovetski, LKML,
Ingo Molnar, H. Peter Anvin, Theodore Ts'o, linux-ext4,
Jens Axboe
Sorry for the late reply. I'm in a conference this week...
On Sun 23-03-14 10:26:09, dafreedm@gmail.com wrote:
> On Sun, Mar 23, 2014, Jan Kara wrote:
> > On Sat 22-03-14 14:18:39, Thomas Gleixner wrote:
> > > On Sat, 22 Mar 2014, dafreedm@gmail.com wrote:
> > >
> > > > Ahh. Good call. I wasn't sophisticated enough with these things to
> > > > ascertain the difference. I knew to avoid reporting oops/panics with
> > > > kernels tainted with out-of-tree (non-free) modules, but I guess I
> > > > grabbed the wrong lines from the dmesg (namely, subsequent oops after
> > > > the initial one). Here's a more recent kernel oops (from this
> > > > morning) --- it's the first oops after a fresh reboot:
> > >
> > > Cc'ing ext4 and block folks.
> > Hum, so decodecode shows:
> > ...
> > 26: 48 85 c0 test %rax,%rax
> > 29: 74 10 je 0x3b
> > 2b:* 0f b7 80 ac 05 00 00 movzwl 0x5ac(%rax),%eax <-- trapping instruction
> > 32: 66 85 c0 test %ax,%ax
> > ...
> >
> > And the register has:
> > RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> >
> > So that looks like a bitbflip the upper byte.
>
> Just for my own knowledge / growth --- how can you tell there's a
> "bitbflip" on the upper byte?
Kernel addresses start at ffff880000000000. Here RAX should have struct
block_device pointer which is a kernel pointer. But upper byte is 0xf7
instead of 0xff - so very likely single bit (0x0800000000000000) got flipped
from 1 to 0.
> > So I'd check the hardware first...
>
> Yes, I absolutely did check the HW first --- and repeatedly (over a
> couple of weeks) --- before reaching out to LKML.
>
> As described in my original email below, here's what I've done so far:
>
> I've been very extensively testing all of the likely culprits among
> hardware components on both of my servers --- running memtest86 upon
> boot for 3+ days, memtester in userspace for 24 hours, repeated
> kernel compiles with various '-j' values, and the 'stress' and
> 'stressapptest' load generators (see below for full details) --- and
> I have never seen even a hiccup in server operation under such
> "artificial" environments --- however, it consistently occurs with
> heavy md5sum operation, and randomly at other times.
Heh, that's strange. So that makes the faulty hw theory less likely -
especially the fact that you see it on two different machines as you
mention below. OTOH the next oops you've posted is at a completely
different place. So that could point to some generic problem where we
corrupt memory.
> More specifically, here are the exact stept I took to try to implicate
> the HW:
>
> aptitude install memtest86+ # reboot and run for 3+ days
>
> aptitude install memtester
> memtester 30G
>
> aptitude install linux-source
> cp /usr/src/linux-source-3.2.tar.bz2 /root/
> tar xvfj linux-source-3.2.tar.bz2
> cd linux-source-3.2/
> make defconfig
> time make 1>LOG 2>ERR
> make mrproper
> make defconfig
> time make -j16 1>LOG 2>ERR
>
> aptitude install stress
> stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
> stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
> stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m
>
> aptitude install stressapptest
> stressapptest -m 8 -i 4 -C 4 -W -s 30
> stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
> stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -s 30
> stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
> stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
> stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
> stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 --random-threads 4 -n 127.0.0.1 --listen -s 300
>
>
> As mentioned earlier --- I just could not make it oops doing the
> above! (or get any errors in the standalone memtest86+ procedure).
>
> What do you think? Should I just keep on stress-testing it somewhat
> indefinitely? Also, please recall that I have two of the identical
> machines, and I suffer the same problems with both of them (and they
> both pass the above artificial stress-testing).
>
> > > > [33488.170415] general protection fault: 0000 [#1] SMP
> > > > [33488.171351] Modules linked in: dm_crypt dm_mod parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill cpufreq_stats cpufreq_userspace cpufreq_conservative cpufreq_powersave nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc netconsole configfs loop raid1 md_mod snd_hda_codec_realtek snd_hda_codec_hdmi hid_kensington x86_pkg_temp_thermal coretemp joydev kvm_intel hid_generic kvm crct10dif_pclmul snd_hda_intel crc32_pclmul crc32c_intel snd_hda_codec usbhid snd_hwdep hid ghash_clmulni_intel snd_pcm iTCO_wdt iTCO_vendor_support snd_page_alloc aesni_intel i915 snd_seq aes_x86_64 snd_seq_device snd_timer lrw evdev gf128mul glue_helper drm_kms_helper ablk_helper snd psmouse cryptd drm i2c_i801 pcspkr soundcore serio_raw lpc_ich mei_me mei mfd_core video button processor e
xt4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci xhci_hcd ehci_pci ehci_hcd e1000e libata igb scsi_mod i2c_algo_bit i2c_core dca ptp pps_core fan usbcore usb_common thermal therma
> > > l_sys
> > > > [33488.177102] CPU: 0 PID: 340 Comm: jbd2/sdd2-8 Not tainted 3.12-0.bpo.1-amd64 #1 Debian 3.12.9-1~bpo70+1
> > > > [33488.178100] Hardware name: Supermicro X10SLQ/X10SLQ, BIOS 1.00 05/09/2013
> > > > [33488.179095] task: ffff88081b179080 ti: ffff88081b3ee000 task.ti: ffff88081b3ee000
> > > > [33488.180102] RIP: 0010:[<ffffffff811b6b22>] [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > > > [33488.181117] RSP: 0018:ffff88081b3efb78 EFLAGS: 00010282
> > > > [33488.182127] RAX: f7ff880037267140 RBX: 0000000000001000 RCX: 0000000000000000
> > > > [33488.183142] RDX: 00000000000001ff RSI: 000000000148b2ac RDI: 0000000000000000
> > > > [33488.184152] RBP: ffff88081b2a4800 R08: 0000000000000000 R09: ffff8807eccca628
> > > > [33488.185168] R10: 00000000000032ad R11: 0000000000000001 R12: 0000000000000000
> > > > [33488.186182] R13: ffff88081bcf67c0 R14: 00000000532d22fd R15: 000000000148b2ac
> > > > [33488.187194] FS: 0000000000000000(0000) GS:ffff88083fa00000(0000) knlGS:0000000000000000
> > > > [33488.188212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [33488.189228] CR2: ffffffffff600400 CR3: 000000000180c000 CR4: 00000000001407f0
> > > > [33488.190245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [33488.191268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > [33488.192279] Stack:
> > > > [33488.193273] 0000000000000000 ffffffffa01ec1ba ffff88081b179080 ffff88081b2a4800
> > > > [33488.194276] 00000000000032ac 0000000000000000 ffff88081b2a4800 ffff88081b2a4800
> > > > [33488.195265] ffff8807f36b4688 00000000532d22fd 00000000ffffffff ffffffffa01737d0
> > > > [33488.196246] Call Trace:
> > > > [33488.197222] [<ffffffffa01ec1ba>] ? ext4_bmap+0x5a/0x110 [ext4]
> > > > [33488.198210] [<ffffffffa01737d0>] ? jbd2_journal_get_descriptor_buffer+0x30/0x80 [jbd2]
> > > > [33488.199203] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > > > [33488.200196] [<ffffffffa016aedd>] ? journal_submit_commit_record.isra.12+0x9d/0x200 [jbd2]
> > > > [33488.201193] [<ffffffff811b5100>] ? __wait_on_buffer+0x30/0x30
> > > > [33488.202198] [<ffffffffa016c1f4>] ? jbd2_journal_commit_transaction+0x11b4/0x18c0 [jbd2]
> > > > [33488.203211] [<ffffffffa01706cc>] ? kjournald2+0xac/0x240 [jbd2]
> > > > [33488.204219] [<ffffffff81082d20>] ? add_wait_queue+0x60/0x60
> > > > [33488.205226] [<ffffffffa0170620>] ? commit_timeout+0x10/0x10 [jbd2]
> > > > [33488.206230] [<ffffffff81082333>] ? kthread+0xb3/0xc0
> > > > [33488.207230] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > > > [33488.208231] [<ffffffff814cb70c>] ? ret_from_fork+0x7c/0xb0
> > > > [33488.209224] [<ffffffff81082280>] ? flush_kthread_worker+0xa0/0xa0
> > > > [33488.210205] Code: 12 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 85 98 00 00 00 ba ff 01 00 00 48 8b 80 58 03 00 00 48 85 c0 74 10 <0f> b7 80 ac 05 00 00 66 85 c0 0f 85 cd 01 00 00 85 da 0f 85 d2
> > > > [33488.212277] RIP [<ffffffff811b6b22>] __getblk+0x52/0x2c0
> > > > [33488.213276] RSP <ffff88081b3efb78>
> > > > [33488.370823] ---[ end trace cf90c18d45ff9570 ]---
> > > >
> > > >
> > > > Thoughts?
> > > >
> > > > Ingo, Peter, Thomas, any further ideas, please?
> > > >
> > > >
> > > > > > Though at times the oops occur even when the system is largely idle,
> > > > > > they seem to be exacerbated by md5sum'ing all files on a large
> > > > > > partition as part of archive verification --- say 1 million files
> > > > > > corresponding to 1 TByte of storage. If I perform this repeatedly,
> > > > > > the machines seem to lock up about once a week. Strangely, other
> > > > > > typical high-load/high-stress scenarios don't seem to provoke the oops
> > > > > > nearly so much (see below).
> > > > > >
> > > > > > Naturally, such md5sum usage is putting heavy load on the processor,
> > > > > > memory, and even power supply, and my initial inclination is generally
> > > > > > that I must have some faulty components. Even after otherwise
> > > > > > ambiguous diagnostics (described below), I'm highly skeptical that
> > > > > > there's anything here inherent to the md5sum codebase, in particular.
> > > > > > However, I have started to wonder whether this might be a kernel
> > > > > > regression...
> > > > > >
> > > > > > For reference, here's my setup:
> > > > > >
> > > > > > Mainboard: Supermicro X10SLQ
> > > > > > Processor: (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
> > > > > > Memory: 32GB Kingston DDR3 RAM (4x KVR16N11/8)
> > > > > > PSU: SeaSonic SS-400FL2 400W PSU
> > > > > > O/S: Debian v7.4 Wheezy (amd64)
> > > > > > Filesystem: Ext4 (with default settings upon creation) over LUKS
> > > > > > Kernel: Using both:
> > > > > > Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
> > > > > > Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)
> > > > > >
> > > > > > To summarize where I am now: I've been very extensively testing all of
> > > > > > the likely culprits among hardware components on both of my servers
> > > > > > --- running memtest86 upon boot for 3+ days, memtester in userspace
> > > > > > for 24 hours, repeated kernel compiles with various '-j' values, and
> > > > > > the 'stress' and 'stressapptest' load generators (see [2] for full
> > > > > > details) --- and I have never seen even a hiccup in server operation
> > > > > > under such "artificial" environments --- however, it consistently
> > > > > > occurs with heavy md5sum operation, and randomly at other times.
> > > > > >
> > > > > > At least from my past experiences (with scientific HPC clusters), such
> > > > > > diagnostic results would normally seem to largely rule out most
> > > > > > problems with the processor, memory, mainboard subsystems. The PSU is
> > > > > > often a little harder to rule out, but the 400W Seasonic PSUs are
> > > > > > rated at 2--3 times the wattage I should really need, even under peak
> > > > > > load (given each server's single-socket CPU is 65W at max TDP, there
> > > > > > are only a few HDs and one SSD, and no discrete graphics at all, of
> > > > > > course).
> > > > > >
> > > > > > I'm further surprised to see the exact same kernel-crash behavior on
> > > > > > two separate, but identical, servers, which leads me to wonder if
> > > > > > there's possibly some regression between the hardware (given that it's
> > > > > > relatively new Haswell microcode / silicon) and the (kernel?)
> > > > > > software.
> > > > > >
> > > > > > Any thoughts on what might be occurring here? Or what I should focus
> > > > > > on? Thanks in advance.
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-03-27 15:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20140318184909.GA26255@ofan>
[not found] ` <Pine.LNX.4.64.1403192239560.5202@axis700.grange>
[not found] ` <20140322123706.GE12444@ofan>
2014-03-22 13:18 ` Consistent kernel oops with 3.11.10 & 3.12.9 on Haswell CPUs Thomas Gleixner
2014-03-23 13:25 ` Jan Kara
2014-03-23 14:26 ` dafreedm
2014-03-27 10:35 ` dafreedm
2014-03-27 15:26 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).