* Bug Report: BUG: Bad rss-counter state mm:ffff88101705f800 idx:1 val:512 / application segfaults / thp
@ 2015-12-14 20:19 Martin Tippmann
2015-12-15 0:29 ` Kirill A. Shutemov
0 siblings, 1 reply; 2+ messages in thread
From: Martin Tippmann @ 2015-12-14 20:19 UTC (permalink / raw)
To: linux-mm
Hi,
I'm seeing random application crashes (SIGSEV) and after a few minutes
this appears in the logfiles:
[133933.729199]
/build/linux-lts-wily-4x6IId/linux-lts-wily-4.2.0/mm/pgtable-generic.c:33:
bad pmd ffff880fd06d6200(000000018da009e2)
[133933.763015] BUG: Bad rss-counter state mm:ffff88101705f800 idx:1 val:512
[133933.763039] BUG: non-zero nr_ptes on freeing mm: 1
I'm quite certain that it's not a hardware error. The problems appears
regularly on random machines of a 100+ machine cluster of Dell
PowerEdge R720 servers with 2xXeon E5 (NUMA) and 64GB ECC Memory.
The workload is mostly Hadoop YARN with MapReduce and Spark, the JVM
(mostly from the DataNodes) crashes randomly under load with SIGSEV.
The problems appears with Kernel 4.3.0 and 4.2.7 from Ubuntu Kernel
Mainline PPA[1] and with the current 4.2 Ubuntu Wily Kernel - all of
these kernels already have a related patch[2].
However I'm still seeing the problem. The bug disappears when I
disable transparent hugepages and reboot the machines!
Before disabling transparent hugepages completely I ran this config:
echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Unfortunately I can't provide any more data at the moment. Maybe I'm
able to compile a kernel with debug options turned on over the
holidays - if you have any hints where I can help to pin this down
please tell me. On IRC
CONFIG_DEBUG_VM was recommend.
regards and thanks
Martin
1: http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=M;O=D
2: https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable.git/+/47aee4d8e314384807e98b67ade07f6da476aa75
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Bug Report: BUG: Bad rss-counter state mm:ffff88101705f800 idx:1 val:512 / application segfaults / thp
2015-12-14 20:19 Bug Report: BUG: Bad rss-counter state mm:ffff88101705f800 idx:1 val:512 / application segfaults / thp Martin Tippmann
@ 2015-12-15 0:29 ` Kirill A. Shutemov
0 siblings, 0 replies; 2+ messages in thread
From: Kirill A. Shutemov @ 2015-12-15 0:29 UTC (permalink / raw)
To: Martin Tippmann; +Cc: linux-mm
On Mon, Dec 14, 2015 at 09:19:09PM +0100, Martin Tippmann wrote:
> Hi,
>
> I'm seeing random application crashes (SIGSEV) and after a few minutes
> this appears in the logfiles:
>
> [133933.729199]
> /build/linux-lts-wily-4x6IId/linux-lts-wily-4.2.0/mm/pgtable-generic.c:33:
> bad pmd ffff880fd06d6200(000000018da009e2)
Coould you put dump_stack() into pmd_clear_bad() after pmd_ERROR(). It can
give some clue.
> [133933.763015] BUG: Bad rss-counter state mm:ffff88101705f800 idx:1 val:512
> [133933.763039] BUG: non-zero nr_ptes on freeing mm: 1
>
> I'm quite certain that it's not a hardware error. The problems appears
> regularly on random machines of a 100+ machine cluster of Dell
> PowerEdge R720 servers with 2xXeon E5 (NUMA) and 64GB ECC Memory.
>
> The workload is mostly Hadoop YARN with MapReduce and Spark, the JVM
> (mostly from the DataNodes) crashes randomly under load with SIGSEV.
>
> The problems appears with Kernel 4.3.0 and 4.2.7 from Ubuntu Kernel
> Mainline PPA[1] and with the current 4.2 Ubuntu Wily Kernel - all of
> these kernels already have a related patch[2].
>
> However I'm still seeing the problem. The bug disappears when I
> disable transparent hugepages and reboot the machines!
>
> Before disabling transparent hugepages completely I ran this config:
>
> echo always > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
>
> Unfortunately I can't provide any more data at the moment. Maybe I'm
> able to compile a kernel with debug options turned on over the
> holidays - if you have any hints where I can help to pin this down
> please tell me. On IRC
> CONFIG_DEBUG_VM was recommend.
>
> regards and thanks
> Martin
>
> 1: http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=M;O=D
> 2: https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable.git/+/47aee4d8e314384807e98b67ade07f6da476aa75
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2015-12-15 0:29 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-14 20:19 Bug Report: BUG: Bad rss-counter state mm:ffff88101705f800 idx:1 val:512 / application segfaults / thp Martin Tippmann
2015-12-15 0:29 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).