linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trevor Cordes <trevor@tecnopolis.ca>
To: linux-kernel@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Michal Hocko <mhocko@kernel.org>,
	Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@surriel.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: mm, vmscan: commit makes PAE kernel crash nightly (bisected)
Date: Wed, 11 Jan 2017 04:32:43 -0600	[thread overview]
Message-ID: <20170111103243.GA27795@pog.tecnopolis.ca> (raw)

Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
here:

commit b2e18757f2c9d1cdd746a882e9878852fdec9501
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Thu Jul 28 15:45:37 2016 -0700

    mm, vmscan: begin reclaiming pages on a per-node basis

I bisected between:
# bad: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
# good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7

I have not tried newer than 4.8.13 Fedora kernel, but if someone thinks 
this bug is already fixed in HEAD I could try that next.  It took 3 weeks 
to bisect because the crash only seems to happen in the middle of the 
night, and not every, but most, nights.

It does not occur on most of my other boxes, just this one.  The box is a 
bit unique in that it's running 32-bit PAE on a 64-bit capable CPU, and I 
have the memory tuned down to mem=6G in the kernel command line (I think 
it has 16GB actual).  I tuned the RAM down because around 8GB the PAE 
kernel has massive IO speed issues.

It is a relatively new Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz on an 
Intel S1200BTL board.  I will eventually change it to 64-bit Fedora which 
I'm sure will solve this bug, but since there's no easy upgrade path, 
that's on the backburner on this production box.

I'm sure this will be another "PAE sucks, don't use it" issue, but like I 
said, I'm currently stuck with it, and in theory the kernel shouldn't 
crash like this (I'm guessing/hoping).

I think I pinned the trigger down to either (or both) big dir scans (like 
"find /bigdir-foo") running at around 3am.  It's either a remote box doing 
indexing via smbd and/or rsync or rdiff-backup also doing big dir scans.  
But when I do "find /" manually I can't trigger the bug.  Very weird.

The commit notes make it sound like the author thought perhaps there could 
be a problem in some scenarios?  I guess I found the scenario.

The only discussion I found on the net regarding this commit is
https://lkml.org/lkml/2016/8/29/154
And perhaps it's somewhat relevant, it's a bit over my head.

I'm available for testing, etc, and can usually rule out a bad kernel 
within 24-hours by just waiting for 3am to roll around.  I also have 
copious logs I can provide and screenshots of the crashes.

The box is extremely lightly loaded, and RAM use is almost always under 
1GB, and swap is 0-20k used most of the time with GB's free.  Everything 
looks great until all of a sudden oom-killer starts running and goes 
through 10-260 iterations before the system just dies.  I wrote a script 
to watch for oom-killer and issue "reboot" immediately, but 80% of the 
time the box will hang before the reboot actually manages to shutdown.

Any information/help I can provide, please just holler.  Thanks!

             reply	other threads:[~2017-01-11 10:33 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-11 10:32 Trevor Cordes [this message]
2017-01-11 12:11 ` mm, vmscan: commit makes PAE kernel crash nightly (bisected) Mel Gorman
2017-01-11 12:14   ` Mel Gorman
2017-01-11 22:52     ` Trevor Cordes
2017-01-12  9:36       ` Michal Hocko
2017-01-15  6:27         ` Trevor Cordes
2017-01-16 11:09           ` Mel Gorman
2017-01-17 13:52             ` Michal Hocko
2017-01-17 14:21               ` Mel Gorman
2017-01-17 14:54                 ` Michal Hocko
2017-01-18  7:25                   ` Trevor Cordes
2017-01-18 17:48                   ` Mel Gorman
2017-01-18 18:07                   ` Mel Gorman
2017-01-19  9:48                   ` Trevor Cordes
2017-01-19 11:37                     ` Michal Hocko
2017-01-20  6:35                       ` Trevor Cordes
2017-01-20 11:02                         ` Mel Gorman
2017-01-20 15:55                           ` Mel Gorman
2017-01-23  0:45                             ` Trevor Cordes
2017-01-23 10:48                               ` Mel Gorman
2017-01-23 11:04                                 ` Mel Gorman
2017-01-25  9:46                                   ` Michal Hocko
2017-01-24 12:59                                 ` Michal Hocko
2017-01-25 10:02                                 ` Trevor Cordes
2017-01-25 12:04                                   ` Michal Hocko
2017-01-29 22:50                                     ` Trevor Cordes
2017-01-30  7:51                                       ` Michal Hocko
2017-02-01  9:29                                         ` Trevor Cordes
2017-02-01 10:14                                           ` Michal Hocko
2017-02-04  0:36                                             ` Trevor Cordes
2017-02-04 20:05                                               ` Rik van Riel
2017-02-05 10:03                                               ` Michal Hocko
2017-02-05 22:53                                                 ` Trevor Cordes
2017-01-30  9:10                                       ` Mel Gorman
2017-01-24 12:54                               ` Michal Hocko
2017-01-26 23:18                                 ` Trevor Cordes
2017-01-27  7:36                                   ` Michal Hocko
2017-01-24 12:51                         ` Michal Hocko
2017-01-18  6:52             ` Trevor Cordes
2017-01-17 13:45           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170111103243.GA27795@pog.tecnopolis.ca \
    --to=trevor@tecnopolis.ca \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).