First-time poster to LKML, though I've been a Linux user for the past
15+ years.  Thanks to you all for your collective efforts at creating
such a great (useful, stable, etc) kernel...

Problem at hand: I'm getting consistent kernel oops (at times,
hard-crashes) on two of my identical servers (they are much more
common on one of the servers than the other, but I see them on both).
Please reference the kernel log messages appended to this email [1].

Though at times the oops occur even when the system is largely idle,
they seem to be exacerbated by md5sum'ing all files on a large
partition as part of archive verification --- say 1 million files
corresponding to 1 TByte of storage.  If I perform this repeatedly,
the machines seem to lock up about once a week.  Strangely, other
typical high-load/high-stress scenarios don't seem to provoke the oops
nearly so much (see below).

Naturally, such md5sum usage is putting heavy load on the processor,
memory, and even power supply, and my initial inclination is generally
that I must have some faulty components.  Even after otherwise
ambiguous diagnostics (described below), I'm highly skeptical that
there's anything here inherent to the md5sum codebase, in particular.
However, I have started to wonder whether this might be a kernel
regression...

For reference, here's my setup:

  Mainboard:  Supermicro X10SLQ
  Processor:  (Single-Socket) Intel Haswell i7-4770S (65W max TDP)
  Memory:     32GB Kingston DDR3 RAM (4x KVR16N11/8)
  PSU:        SeaSonic SS-400FL2 400W PSU
  O/S:        Debian v7.4 Wheezy (amd64)
  Filesystem: Ext4 (with default settings upon creation) over LUKS
  Kernel:     Using both:
                Linux 3.11.10 ('3.11-0.bpo.2-amd64' via wheezy-backports)
                Linux 3.12.9 ('3.12-0.bpo.2-amd64' via wheezy-backports)

To summarize where I am now: I've been very extensively testing all of
the likely culprits among hardware components on both of my servers
--- running memtest86 upon boot for 3+ days, memtester in userspace
for 24 hours, repeated kernel compiles with various '-j' values, and
the 'stress' and 'stressapptest' load generators (see [2] for full
details) --- and I have never seen even a hiccup in server operation
under such "artificial" environments --- however, it consistently
occurs with heavy md5sum operation, and randomly at other times.

At least from my past experiences (with scientific HPC clusters), such
diagnostic results would normally seem to largely rule out most
problems with the processor, memory, mainboard subsystems.  The PSU is
often a little harder to rule out, but the 400W Seasonic PSUs are
rated at 2--3 times the wattage I should really need, even under peak
load (given each server's single-socket CPU is 65W at max TDP, there
are only a few HDs and one SSD, and no discrete graphics at all, of
course).

I'm further surprised to see the exact same kernel-crash behavior on
two separate, but identical, servers, which leads me to wonder if
there's possibly some regression between the hardware (given that it's
relatively new Haswell microcode / silicon) and the (kernel?)
software.

Any thoughts on what might be occurring here?  Or what I should focus
on?  Thanks in advance.


[1] Attached 'KernelLogs' file.
[2] Attached 'SystemStressTesting' file.