* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
[not found] <bug-13302-10286@http.bugzilla.kernel.org/>
@ 2009-05-13 20:08 ` Andrew Morton
2009-05-14 10:53 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2009-05-13 20:08 UTC (permalink / raw)
To: linux-mm; +Cc: bugzilla-daemon, bugme-daemon, Adam Litke, starlight
(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
(Please read this ^^^^ !)
On Wed, 13 May 2009 19:54:10 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13302
>
> Summary: "bad pmd" on fork() of process with hugepage shared
> memory segments attached
> Product: Memory Management
> Version: 2.5
> Kernel Version: 2.6.29.1
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: Other
> AssignedTo: akpm@linux-foundation.org
> ReportedBy: starlight@binnacle.cx
> Regression: Yes
>
>
> Kernel reports "bad pmd" errors when process with hugepage
> shared memory segments attached executes fork() system call.
> Using vfork() avoids the issue.
>
> Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
> leakage of huge pages.
>
> Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.
>
> See bug 12134 for an example of the errors reported
> by 'dmesg'.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-13 20:08 ` Andrew Morton
@ 2009-05-14 10:53 ` Mel Gorman
2009-05-14 10:59 ` Mel Gorman
2009-05-14 17:16 ` starlight
0 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-05-14 10:53 UTC (permalink / raw)
To: starlight, Andrew Morton
Cc: linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson
On Wed, May 13, 2009 at 01:08:46PM -0700, Andrew Morton wrote:
>
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> (Please read this ^^^^ !)
>
> On Wed, 13 May 2009 19:54:10 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=13302
> >
> > Summary: "bad pmd" on fork() of process with hugepage shared
> > memory segments attached
> > Product: Memory Management
> > Version: 2.5
> > Kernel Version: 2.6.29.1
> > Platform: All
> > OS/Version: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: Other
> > AssignedTo: akpm@linux-foundation.org
> > ReportedBy: starlight@binnacle.cx
> > Regression: Yes
> >
> >
> > Kernel reports "bad pmd" errors when process with hugepage
> > shared memory segments attached executes fork() system call.
> > Using vfork() avoids the issue.
> >
> > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
> > leakage of huge pages.
> >
> > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.
> >
> > See bug 12134 for an example of the errors reported
> > by 'dmesg'.
> >
This seems familiar and I believe it couldn't be reproduced the last time
and then the problem reporter went away. We need a reproduction case so
I modified on of the libhugetlbfs tests to do what I think you described
above. However, it does not trigger the problem for me on x86 or x86-64
running 2.6.29.1.
starlight@binnacle.cz, can you try the reproduction steps on your system
please? If it reproduces, can you send me your .config please? If it
does not reproduce, can you look at the test program and tell me what
it's doing different to your reproduction case?
1. wget http://heanet.dl.sourceforge.net/sourceforge/libhugetlbfs/libhugetlbfs-2.3.tar.gz
2. tar -zxf libhugetlbfs-2.3.tar.gz
3. cd libhugetlbfs-2.3
4. wget http://www.csn.ul.ie/~mel/shm-fork.c (program is below for reference)
5. mv shm-fork.c tests/
6. make
7. ./obj/hugeadm --create-global-mounts
8. ./obj/hugeadm --pool-pages-min 2M:20
(Adjust pagesize of 2M if necessary. If x86 and not 2M, tell me
and send me your .config)
9. ./tests/obj32/shm-fork 10 2
On my two systems, I saw something like
# ./tests/obj32/shm-fork 10 2
Starting testcase "./tests/obj32/shm-fork", pid 3527
Requesting 4194304 bytes for each test
Spawning children glibc_fork:..........glibc_fork
Spawning children glibc_vfork:..........glibc_vfork
Spawning children sys_fork:..........sys_fork
PASS
Test program I used is below and is a modified version of what's in
libhugetlbfs. It does not compile standalone. The steps it takes are
1. Gets the hugepage size
2. Calls shmget() to create a suitably large shared memory segment
3. Creates a requested number of children
4. Each child attaches to the share memory segment
5. Each child creates a grandchild
6. The child and grandchildren write the segment
7. The grandchild exists, the child waits for the grandchild
8. The child detaches and exists
9. The parent waits for the child to exit
It does this for glibc fork, glibc vfork and a direct call to the system
call fork().
Thanks
==== CUT HERE ====
/*
* libhugetlbfs - Easy use of Linux hugepages
* Copyright (C) 2005-2006 David Gibson & Adam Litke, IBM Corporation.
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
* as published by the Free Software Foundation; either version 2.1 of
* the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with this library; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/shm.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <hugetlbfs.h>
#include "hugetests.h"
#define P "shm-fork"
#define DESC \
"* Test shared memory behavior when multiple threads are attached *\n"\
"* to a segment. A segment is created and then children are *\n"\
"* spawned which attach, write, read (verify), and detach from the *\n"\
"* shared memory segment. *"
extern int errno;
/* Global Configuration */
static int nr_hugepages;
static int numprocs;
static int shmid = -1;
#define MAX_PROCS 200
#define BUF_SZ 256
#define GLIBC_FORK 0
#define GLIBC_VFORK 1
#define SYS_FORK 2
static char *testnames[] = { "glibc_fork", "glibc_vfork", "sys_fork" };
#define CHILD_FAIL(thread, fmt, ...) \
do { \
verbose_printf("Thread %d (pid=%d) FAIL: " fmt, \
thread, getpid(), __VA_ARGS__); \
exit(1); \
} while (0)
void cleanup(void)
{
remove_shmid(shmid);
}
static void do_child(int thread, unsigned long size, int testtype)
{
volatile char *shmaddr;
int j, k;
int pid, status;
verbose_printf(".");
for (j=0; j<5; j++) {
shmaddr = shmat(shmid, 0, SHM_RND);
if (shmaddr == MAP_FAILED)
CHILD_FAIL(thread, "shmat() failed: %s",
strerror(errno));
/* Create even more children to double up the work */
switch (testtype) {
case GLIBC_FORK:
if ((pid = fork()) < 0)
FAIL("glibc_fork(): %s", strerror(errno));
break;
case GLIBC_VFORK:
if ((pid = vfork()) < 0)
FAIL("glibc_vfork(): %s", strerror(errno));
break;
case SYS_FORK:
if ((pid = syscall(__NR_fork)) < 0)
FAIL("sys_fork(): %s", strerror(errno));
break;
default:
FAIL("Test type %d not implemented\n", testtype);
}
/* Child and parent access the shared area */
for (k=0;k<size;k++)
shmaddr[k] = (char) (k);
for (k=0;k<size;k++)
if (shmaddr[k] != (char)k)
CHILD_FAIL(thread, "Index %d mismatch", k);
/* Children exits */
if (pid == 0)
exit(0);
/* Parent waits for child and detaches */
waitpid(pid, &status, 0);
if (shmdt((const void *)shmaddr) != 0)
CHILD_FAIL(thread, "shmdt() failed: %s",
strerror(errno));
}
exit(0);
}
static void do_test(unsigned long size, int testtype)
{
int wait_list[MAX_PROCS];
int i;
int pid, status;
char *testname = testnames[testtype];
if ((shmid = shmget(2, size, SHM_HUGETLB|IPC_CREAT|SHM_R|SHM_W )) < 0)
FAIL("shmget(): %s", strerror(errno));
verbose_printf("Spawning children %s:", testname);
for (i=0; i<numprocs; i++) {
switch (testtype) {
case GLIBC_FORK:
if ((pid = fork()) < 0)
FAIL("glibc_fork(): %s", strerror(errno));
break;
case GLIBC_VFORK:
if ((pid = vfork()) < 0)
FAIL("glibc_vfork(): %s", strerror(errno));
break;
case SYS_FORK:
if ((pid = syscall(__NR_fork)) < 0)
FAIL("sys_fork(): %s", strerror(errno));
break;
default:
FAIL("Test type %d not implemented\n", testtype);
}
if (pid == 0)
do_child(i, size, testtype);
wait_list[i] = pid;
}
for (i=0; i<numprocs; i++) {
waitpid(wait_list[i], &status, 0);
if (WEXITSTATUS(status) != 0)
FAIL("Thread %d (pid=%d) failed", i, wait_list[i]);
if (WIFSIGNALED(status))
FAIL("Thread %d (pid=%d) received unhandled signal",
i, wait_list[i]);
}
printf("%s\n", testname);
}
int main(int argc, char ** argv)
{
unsigned long size;
long hpage_size;
test_init(argc, argv);
if (argc < 3)
CONFIG("Usage: %s <# procs> <# pages>", argv[0]);
numprocs = atoi(argv[1]);
nr_hugepages = atoi(argv[2]);
if (numprocs > MAX_PROCS)
CONFIG("Cannot spawn more than %d processes", MAX_PROCS);
check_hugetlb_shm_group();
hpage_size = check_hugepagesize();
size = hpage_size * nr_hugepages;
verbose_printf("Requesting %lu bytes for each test\n", size);
do_test(size, GLIBC_FORK);
do_test(size, GLIBC_VFORK);
do_test(size, SYS_FORK);
PASS();
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-14 10:53 ` Mel Gorman
@ 2009-05-14 10:59 ` Mel Gorman
2009-05-14 17:20 ` starlight
2009-05-14 17:16 ` starlight
1 sibling, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2009-05-14 10:59 UTC (permalink / raw)
To: starlight, Andrew Morton
Cc: linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson
On Thu, May 14, 2009 at 11:53:27AM +0100, Mel Gorman wrote:
> On Wed, May 13, 2009 at 01:08:46PM -0700, Andrew Morton wrote:
> >
> > (switched to email. Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> >
> > (Please read this ^^^^ !)
> >
> > On Wed, 13 May 2009 19:54:10 GMT
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=13302
> > >
> > > Summary: "bad pmd" on fork() of process with hugepage shared
> > > memory segments attached
> > > Product: Memory Management
> > > Version: 2.5
> > > Kernel Version: 2.6.29.1
> > > Platform: All
> > > OS/Version: Linux
> > > Tree: Mainline
> > > Status: NEW
> > > Severity: normal
> > > Priority: P1
> > > Component: Other
> > > AssignedTo: akpm@linux-foundation.org
> > > ReportedBy: starlight@binnacle.cx
> > > Regression: Yes
> > >
> > >
> > > Kernel reports "bad pmd" errors when process with hugepage
> > > shared memory segments attached executes fork() system call.
> > > Using vfork() avoids the issue.
> > >
> > > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
> > > leakage of huge pages.
> > >
> > > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.
> > >
> > > See bug 12134 for an example of the errors reported
> > > by 'dmesg'.
> > >
>
> This seems familiar and I believe it couldn't be reproduced the last time
> and then the problem reporter went away. We need a reproduction case so
> I modified on of the libhugetlbfs tests to do what I think you described
> above. However, it does not trigger the problem for me on x86 or x86-64
> running 2.6.29.1.
>
> starlight@binnacle.cz, can you try the reproduction steps on your system
> please? If it reproduces, can you send me your .config please? If it
> does not reproduce, can you look at the test program and tell me what
> it's doing different to your reproduction case?
>
Another question on top of this.
At any point, do you call madvise(MADV_WILLNEED), fadvise(FADV_WILLNEED)
or readahead() on the share memory segment?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-14 10:53 ` Mel Gorman
2009-05-14 10:59 ` Mel Gorman
@ 2009-05-14 17:16 ` starlight
1 sibling, 0 replies; 25+ messages in thread
From: starlight @ 2009-05-14 17:16 UTC (permalink / raw)
To: Mel Gorman, Andrew Morton
Cc: linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson
Will try it out, but it has to wait till this weekend.
At 11:53 AM 5/14/2009 +0100, Mel Gorman wrote:
>starlight@binnacle.cx, can you try the reproduction steps on your system
>please? If it reproduces, can you send me your .config please? If it
>does not reproduce, can you look at the test program and tell me what
>it's doing different to your reproduction case?
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-14 10:59 ` Mel Gorman
@ 2009-05-14 17:20 ` starlight
2009-05-14 17:49 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: starlight @ 2009-05-14 17:20 UTC (permalink / raw)
To: Mel Gorman, Andrew Morton
Cc: linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson
Definately no.
The possibly unusual thing done is that a file is read into
something like 30% of the segment, and the remaining pages are
not touched.
At 11:59 AM 5/14/2009 +0100, Mel Gorman wrote:
>Another question on top of this.
>
>At any point, do you call madvise(MADV_WILLNEED),
>fadvise(FADV_WILLNEED) or readahead() on the share memory segment?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-14 17:20 ` starlight
@ 2009-05-14 17:49 ` Mel Gorman
2009-05-14 18:42 ` starlight
2009-05-14 19:10 ` starlight
0 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-05-14 17:49 UTC (permalink / raw)
To: starlight
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
On Thu, May 14, 2009 at 01:20:09PM -0400, starlight@binnacle.cx wrote:
> At 11:59 AM 5/14/2009 +0100, Mel Gorman wrote:
> >Another question on top of this.
> >
> >At any point, do you call madvise(MADV_WILLNEED),
> >fadvise(FADV_WILLNEED) or readahead() on the share memory segment?
>
> Definately no.
>
> The possibly unusual thing done is that a file is read into
> something like 30% of the segment, and the remaining pages are
> not touched.
>
Ok, I just tried that there - parent writing 30% of the shared memory
before forking but still did not reproduce the problem :(
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-14 17:49 ` Mel Gorman
@ 2009-05-14 18:42 ` starlight
2009-05-14 19:10 ` starlight
1 sibling, 0 replies; 25+ messages in thread
From: starlight @ 2009-05-14 18:42 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
[-- Attachment #1: Type: text/plain, Size: 1506 bytes --]
At 06:49 PM 5/14/2009 +0100, Mel Gorman wrote:
>Ok, I just tried that there - parent writing 30% of the shared memory
>before forking but still did not reproduce the problem :(
Maybe it makes a difference to have lots of RAM (16GB on this
server), and about 1.5 GB of hugepage shared memory allocated in
the forking process in about four segments. Often have all free
memory consumed by the file cache, but I don't belive this is
necessary to produce the problem as it will happen even right
after a reboot. [RHEL5 meminfo attached]
Other possible factors:
daemon is non-root but has explicit
CAP_IPC_LOCK, CAP_NET_RAW, CAP_SYS_NICE set via
'setcap cap_net_raw,cap_ipc_lock,cap_sys_nice+ep daemon'
ulimit -Hl and -Sl are set to <unlimited>
process group is set in /proc/sys/vm/hugetlb_shm_group
/proc/sys/vm/nr_hugepages is set to 2048
daemon has 200 threads at time of fork()
shared memory segments explictly located [RHEL5 pmap -x attached]
between fork & exec these syscalls are issued
sched_getscheduler/sched_setscheduler
getpriority/setpriority
seteuid(getuid())
setegid(getgid())
with vfork() work-around, no syscalls are made before exec()
Don't think it's something anything specific to the DL160 (Intel E5430)
we have because the DL165 (Opteron 2354) also exhibits the problem.
Will run the test cases provided this weekend for certain and
will let you know if bug is reproduced.
Have to go silent on this till the weekend.
[-- Attachment #2: meminfo.txt --]
[-- Type: text/plain, Size: 777 bytes --]
MemTotal: 16443828 kB
MemFree: 105364 kB
Buffers: 8476 kB
Cached: 11626260 kB
SwapCached: 0 kB
Active: 121876 kB
Inactive: 11570788 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 16443828 kB
LowFree: 105364 kB
SwapTotal: 2031608 kB
SwapFree: 2031396 kB
Dirty: 417224 kB
Writeback: 0 kB
AnonPages: 62700 kB
Mapped: 10640 kB
Slab: 416872 kB
PageTables: 1904 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 8156368 kB
Committed_AS: 71692 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 266280 kB
VmallocChunk: 34359471371 kB
HugePages_Total: 2048
HugePages_Free: 889
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
[-- Attachment #3: pmap.txt --]
[-- Type: text/plain, Size: 20953 bytes --]
Address Kbytes RSS Anon Locked Mode Mapping
0000000000400000 2976 - - - r-x-- daemon
00000000007e8000 8 - - - rw--- daemon
00000000007ea000 192 - - - rw--- [ anon ]
000000001546d000 9228 - - - rw--- [ anon ]
0000000040955000 4 - - - ----- [ anon ]
0000000040956000 128 - - - rw--- [ anon ]
0000000040976000 4 - - - ----- [ anon ]
0000000040977000 128 - - - rw--- [ anon ]
0000000040997000 4 - - - ----- [ anon ]
0000000040998000 128 - - - rw--- [ anon ]
00000000409b8000 4 - - - ----- [ anon ]
00000000409b9000 128 - - - rw--- [ anon ]
000000004193a000 4 - - - ----- [ anon ]
000000004193b000 128 - - - rw--- [ anon ]
000000004195b000 4 - - - ----- [ anon ]
000000004195c000 128 - - - rw--- [ anon ]
000000004197c000 4 - - - ----- [ anon ]
000000004197d000 128 - - - rw--- [ anon ]
000000004199d000 4 - - - ----- [ anon ]
000000004199e000 128 - - - rw--- [ anon ]
00000000419be000 4 - - - ----- [ anon ]
00000000419bf000 128 - - - rw--- [ anon ]
00000000419df000 4 - - - ----- [ anon ]
00000000419e0000 128 - - - rw--- [ anon ]
0000000041a00000 4 - - - ----- [ anon ]
0000000041a01000 128 - - - rw--- [ anon ]
0000000041a21000 4 - - - ----- [ anon ]
0000000041a22000 128 - - - rw--- [ anon ]
0000000041a42000 4 - - - ----- [ anon ]
0000000041a43000 128 - - - rw--- [ anon ]
0000000041a63000 4 - - - ----- [ anon ]
0000000041a64000 128 - - - rw--- [ anon ]
0000000041a84000 4 - - - ----- [ anon ]
0000000041a85000 128 - - - rw--- [ anon ]
0000000041aa5000 4 - - - ----- [ anon ]
0000000041aa6000 128 - - - rw--- [ anon ]
0000000041ac6000 4 - - - ----- [ anon ]
0000000041ac7000 128 - - - rw--- [ anon ]
0000000041ae7000 4 - - - ----- [ anon ]
0000000041ae8000 128 - - - rw--- [ anon ]
0000000041b08000 4 - - - ----- [ anon ]
0000000041b09000 128 - - - rw--- [ anon ]
0000000041b29000 4 - - - ----- [ anon ]
0000000041b2a000 128 - - - rw--- [ anon ]
0000000041b4a000 4 - - - ----- [ anon ]
0000000041b4b000 128 - - - rw--- [ anon ]
0000000041b6b000 4 - - - ----- [ anon ]
0000000041b6c000 128 - - - rw--- [ anon ]
0000000041b8c000 4 - - - ----- [ anon ]
0000000041b8d000 128 - - - rw--- [ anon ]
0000000041bad000 4 - - - ----- [ anon ]
0000000041bae000 128 - - - rw--- [ anon ]
0000000041bce000 4 - - - ----- [ anon ]
0000000041bcf000 128 - - - rw--- [ anon ]
0000000041bef000 4 - - - ----- [ anon ]
0000000041bf0000 128 - - - rw--- [ anon ]
0000000041c10000 4 - - - ----- [ anon ]
0000000041c11000 128 - - - rw--- [ anon ]
0000000041c31000 4 - - - ----- [ anon ]
0000000041c32000 128 - - - rw--- [ anon ]
0000000041c52000 4 - - - ----- [ anon ]
0000000041c53000 128 - - - rw--- [ anon ]
0000000041c73000 4 - - - ----- [ anon ]
0000000041c74000 128 - - - rw--- [ anon ]
0000000041c94000 4 - - - ----- [ anon ]
0000000041c95000 128 - - - rw--- [ anon ]
0000000041cb5000 4 - - - ----- [ anon ]
0000000041cb6000 128 - - - rw--- [ anon ]
0000000041cd6000 4 - - - ----- [ anon ]
0000000041cd7000 128 - - - rw--- [ anon ]
0000000041cf7000 4 - - - ----- [ anon ]
0000000041cf8000 128 - - - rw--- [ anon ]
0000000041d18000 4 - - - ----- [ anon ]
0000000041d19000 128 - - - rw--- [ anon ]
0000000041d39000 4 - - - ----- [ anon ]
0000000041d3a000 128 - - - rw--- [ anon ]
0000000041d5a000 4 - - - ----- [ anon ]
0000000041d5b000 128 - - - rw--- [ anon ]
0000000041d7b000 4 - - - ----- [ anon ]
0000000041d7c000 128 - - - rw--- [ anon ]
0000000041d9c000 4 - - - ----- [ anon ]
0000000041d9d000 128 - - - rw--- [ anon ]
0000000041dbd000 4 - - - ----- [ anon ]
0000000041dbe000 128 - - - rw--- [ anon ]
0000000041dde000 4 - - - ----- [ anon ]
0000000041ddf000 128 - - - rw--- [ anon ]
0000000041dff000 4 - - - ----- [ anon ]
0000000041e00000 128 - - - rw--- [ anon ]
0000000041e20000 4 - - - ----- [ anon ]
0000000041e21000 128 - - - rw--- [ anon ]
0000000041e41000 4 - - - ----- [ anon ]
0000000041e42000 128 - - - rw--- [ anon ]
0000000041e62000 4 - - - ----- [ anon ]
0000000041e63000 128 - - - rw--- [ anon ]
0000000041e83000 4 - - - ----- [ anon ]
0000000041e84000 128 - - - rw--- [ anon ]
0000000041ea4000 4 - - - ----- [ anon ]
0000000041ea5000 128 - - - rw--- [ anon ]
0000000041ec5000 4 - - - ----- [ anon ]
0000000041ec6000 128 - - - rw--- [ anon ]
0000000041ee6000 4 - - - ----- [ anon ]
0000000041ee7000 128 - - - rw--- [ anon ]
0000000041f07000 4 - - - ----- [ anon ]
0000000041f08000 128 - - - rw--- [ anon ]
0000000041f28000 4 - - - ----- [ anon ]
0000000041f29000 128 - - - rw--- [ anon ]
0000000041f49000 4 - - - ----- [ anon ]
0000000041f4a000 128 - - - rw--- [ anon ]
0000000041f6a000 4 - - - ----- [ anon ]
0000000041f6b000 128 - - - rw--- [ anon ]
0000000041f8b000 4 - - - ----- [ anon ]
0000000041f8c000 128 - - - rw--- [ anon ]
0000000041fac000 4 - - - ----- [ anon ]
0000000041fad000 128 - - - rw--- [ anon ]
0000000041fcd000 4 - - - ----- [ anon ]
0000000041fce000 128 - - - rw--- [ anon ]
0000000041fee000 4 - - - ----- [ anon ]
0000000041fef000 128 - - - rw--- [ anon ]
000000004200f000 4 - - - ----- [ anon ]
0000000042010000 128 - - - rw--- [ anon ]
0000000042030000 4 - - - ----- [ anon ]
0000000042031000 128 - - - rw--- [ anon ]
0000000042051000 4 - - - ----- [ anon ]
0000000042052000 128 - - - rw--- [ anon ]
0000000042072000 4 - - - ----- [ anon ]
0000000042073000 128 - - - rw--- [ anon ]
0000000042093000 4 - - - ----- [ anon ]
0000000042094000 128 - - - rw--- [ anon ]
00000000420b4000 4 - - - ----- [ anon ]
00000000420b5000 128 - - - rw--- [ anon ]
00000000420d5000 4 - - - ----- [ anon ]
00000000420d6000 128 - - - rw--- [ anon ]
00000000420f6000 4 - - - ----- [ anon ]
00000000420f7000 128 - - - rw--- [ anon ]
0000000042117000 4 - - - ----- [ anon ]
0000000042118000 128 - - - rw--- [ anon ]
0000000042138000 4 - - - ----- [ anon ]
0000000042139000 128 - - - rw--- [ anon ]
0000000042159000 4 - - - ----- [ anon ]
000000004215a000 128 - - - rw--- [ anon ]
000000004217a000 4 - - - ----- [ anon ]
000000004217b000 128 - - - rw--- [ anon ]
000000004219b000 4 - - - ----- [ anon ]
000000004219c000 128 - - - rw--- [ anon ]
00000000421bc000 4 - - - ----- [ anon ]
00000000421bd000 128 - - - rw--- [ anon ]
00000000421dd000 4 - - - ----- [ anon ]
00000000421de000 128 - - - rw--- [ anon ]
00000000421fe000 4 - - - ----- [ anon ]
00000000421ff000 128 - - - rw--- [ anon ]
000000004221f000 4 - - - ----- [ anon ]
0000000042220000 128 - - - rw--- [ anon ]
0000000042240000 4 - - - ----- [ anon ]
0000000042241000 128 - - - rw--- [ anon ]
0000000042261000 4 - - - ----- [ anon ]
0000000042262000 128 - - - rw--- [ anon ]
0000000042282000 4 - - - ----- [ anon ]
0000000042283000 128 - - - rw--- [ anon ]
00000000422a3000 4 - - - ----- [ anon ]
00000000422a4000 128 - - - rw--- [ anon ]
00000000422c4000 4 - - - ----- [ anon ]
00000000422c5000 128 - - - rw--- [ anon ]
00000000422e5000 4 - - - ----- [ anon ]
00000000422e6000 128 - - - rw--- [ anon ]
0000000042306000 4 - - - ----- [ anon ]
0000000042307000 128 - - - rw--- [ anon ]
0000000042327000 4 - - - ----- [ anon ]
0000000042328000 128 - - - rw--- [ anon ]
0000000042348000 4 - - - ----- [ anon ]
0000000042349000 128 - - - rw--- [ anon ]
0000000042369000 4 - - - ----- [ anon ]
000000004236a000 128 - - - rw--- [ anon ]
000000004238a000 4 - - - ----- [ anon ]
000000004238b000 128 - - - rw--- [ anon ]
00000000423ab000 4 - - - ----- [ anon ]
00000000423ac000 128 - - - rw--- [ anon ]
00000000423cc000 4 - - - ----- [ anon ]
00000000423cd000 128 - - - rw--- [ anon ]
00000000423ed000 4 - - - ----- [ anon ]
00000000423ee000 128 - - - rw--- [ anon ]
000000004240e000 4 - - - ----- [ anon ]
000000004240f000 128 - - - rw--- [ anon ]
000000004242f000 4 - - - ----- [ anon ]
0000000042430000 128 - - - rw--- [ anon ]
0000000042450000 4 - - - ----- [ anon ]
0000000042451000 128 - - - rw--- [ anon ]
0000000042471000 4 - - - ----- [ anon ]
0000000042472000 128 - - - rw--- [ anon ]
0000000042492000 4 - - - ----- [ anon ]
0000000042493000 128 - - - rw--- [ anon ]
00000000424b3000 4 - - - ----- [ anon ]
00000000424b4000 128 - - - rw--- [ anon ]
00000000424d4000 4 - - - ----- [ anon ]
00000000424d5000 128 - - - rw--- [ anon ]
00000000424f5000 4 - - - ----- [ anon ]
00000000424f6000 128 - - - rw--- [ anon ]
0000000042516000 4 - - - ----- [ anon ]
0000000042517000 128 - - - rw--- [ anon ]
0000000042537000 4 - - - ----- [ anon ]
0000000042538000 128 - - - rw--- [ anon ]
0000000042558000 4 - - - ----- [ anon ]
0000000042559000 128 - - - rw--- [ anon ]
0000000042579000 4 - - - ----- [ anon ]
000000004257a000 128 - - - rw--- [ anon ]
000000004259a000 4 - - - ----- [ anon ]
000000004259b000 128 - - - rw--- [ anon ]
00000000425bb000 4 - - - ----- [ anon ]
00000000425bc000 128 - - - rw--- [ anon ]
00000000425dc000 4 - - - ----- [ anon ]
00000000425dd000 128 - - - rw--- [ anon ]
00000000425fd000 4 - - - ----- [ anon ]
00000000425fe000 128 - - - rw--- [ anon ]
000000004261e000 4 - - - ----- [ anon ]
000000004261f000 128 - - - rw--- [ anon ]
000000004263f000 4 - - - ----- [ anon ]
0000000042640000 128 - - - rw--- [ anon ]
0000000042660000 4 - - - ----- [ anon ]
0000000042661000 128 - - - rw--- [ anon ]
0000000042681000 4 - - - ----- [ anon ]
0000000042682000 128 - - - rw--- [ anon ]
00000000426a2000 4 - - - ----- [ anon ]
00000000426a3000 128 - - - rw--- [ anon ]
00000000426c3000 4 - - - ----- [ anon ]
00000000426c4000 128 - - - rw--- [ anon ]
00000000426e4000 4 - - - ----- [ anon ]
00000000426e5000 128 - - - rw--- [ anon ]
0000000042705000 4 - - - ----- [ anon ]
0000000042706000 128 - - - rw--- [ anon ]
0000000042726000 4 - - - ----- [ anon ]
0000000042727000 128 - - - rw--- [ anon ]
0000000042747000 4 - - - ----- [ anon ]
0000000042748000 128 - - - rw--- [ anon ]
0000000042768000 4 - - - ----- [ anon ]
0000000042769000 128 - - - rw--- [ anon ]
0000000042789000 4 - - - ----- [ anon ]
000000004278a000 128 - - - rw--- [ anon ]
00000000427aa000 4 - - - ----- [ anon ]
00000000427ab000 128 - - - rw--- [ anon ]
00000000427cb000 4 - - - ----- [ anon ]
00000000427cc000 128 - - - rw--- [ anon ]
00000000427ec000 4 - - - ----- [ anon ]
00000000427ed000 128 - - - rw--- [ anon ]
000000004280d000 4 - - - ----- [ anon ]
000000004280e000 28 - - - rw--- [ anon ]
0000000042815000 4 - - - ----- [ anon ]
0000000042816000 28 - - - rw--- [ anon ]
000000004281d000 4 - - - ----- [ anon ]
000000004281e000 28 - - - rw--- [ anon ]
0000000042825000 4 - - - ----- [ anon ]
0000000042826000 28 - - - rw--- [ anon ]
000000004282d000 4 - - - ----- [ anon ]
000000004282e000 28 - - - rw--- [ anon ]
0000000042835000 4 - - - ----- [ anon ]
0000000042836000 28 - - - rw--- [ anon ]
000000004283d000 4 - - - ----- [ anon ]
000000004283e000 28 - - - rw--- [ anon ]
0000000042845000 4 - - - ----- [ anon ]
0000000042846000 28 - - - rw--- [ anon ]
000000004284d000 4 - - - ----- [ anon ]
000000004284e000 28 - - - rw--- [ anon ]
0000000300000000 524288 - - - rw-s- 9 (deleted)
0000000330000000 131072 - - - rw-s- 6 (deleted)
00000003c0000000 98304 - - - rw-s- 8 (deleted)
00000003d8000000 169984 - - - rw-s- 5 (deleted)
00000003f0000000 2048 - - - rw-s- 1 (deleted)
00000003f0400000 2048 - - - rw-s- 7 (deleted)
0000000400000000 1048576 - - - rw-s- 3 (deleted)
0000000580000000 262144 - - - rw-s- 4 (deleted)
0000000600000000 131072 - - - rw-s- 2 (deleted)
00000032d5400000 112 - - - r-x-- ld-2.5.so
00000032d561b000 4 - - - r---- ld-2.5.so
00000032d561c000 4 - - - rw--- ld-2.5.so
00000032d5800000 1328 - - - r-x-- libc-2.5.so
00000032d594c000 2048 - - - ----- libc-2.5.so
00000032d5b4c000 16 - - - r---- libc-2.5.so
00000032d5b50000 4 - - - rw--- libc-2.5.so
00000032d5b51000 20 - - - rw--- [ anon ]
00000032d6800000 520 - - - r-x-- libm-2.5.so
00000032d6882000 2044 - - - ----- libm-2.5.so
00000032d6a81000 4 - - - r---- libm-2.5.so
00000032d6a82000 4 - - - rw--- libm-2.5.so
00000032d6c00000 88 - - - r-x-- libpthread-2.5.so
00000032d6c16000 2044 - - - ----- libpthread-2.5.so
00000032d6e15000 4 - - - r---- libpthread-2.5.so
00000032d6e16000 4 - - - rw--- libpthread-2.5.so
00000032d6e17000 16 - - - rw--- [ anon ]
00000032d7000000 28 - - - r-x-- librt-2.5.so
00000032d7007000 2048 - - - ----- librt-2.5.so
00000032d7207000 4 - - - r---- librt-2.5.so
00000032d7208000 4 - - - rw--- librt-2.5.so
00002aaaac000000 132 - - - rw--- [ anon ]
00002aaaac021000 65404 - - - ----- [ anon ]
00002aaab0000000 132 - - - rw--- [ anon ]
00002aaab0021000 65404 - - - ----- [ anon ]
00002aaab4000000 132 - - - rw--- [ anon ]
00002aaab4021000 65404 - - - ----- [ anon ]
00002ad69510e000 4 - - - rw--- [ anon ]
00002ad695115000 4 - - - rw--- [ anon ]
00002ad695116000 944 - - - r-x-- libstdc++.so.6.0.10
00002ad695202000 1024 - - - ----- libstdc++.so.6.0.10
00002ad695302000 8 - - - r---- libstdc++.so.6.0.10
00002ad695304000 28 - - - rw--- libstdc++.so.6.0.10
00002ad69530b000 80 - - - rw--- [ anon ]
00002ad69531f000 88 - - - r-x-- libgcc_s.so.1
00002ad695335000 1020 - - - ----- libgcc_s.so.1
00002ad695434000 4 - - - rw--- libgcc_s.so.1
00002ad695435000 24776 - - - rw--- [ anon ]
00007fff1597d000 124 - - - rw--- [ stack ]
ffffffffff600000 8192 - - - ----- [ anon ]
---------------- ------ ------ ------ ------
total kB 2641188 - - -
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-14 17:49 ` Mel Gorman
2009-05-14 18:42 ` starlight
@ 2009-05-14 19:10 ` starlight
1 sibling, 0 replies; 25+ messages in thread
From: starlight @ 2009-05-14 19:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
[-- Attachment #1: Type: text/plain, Size: 1506 bytes --]
At 06:49 PM 5/14/2009 +0100, Mel Gorman wrote:
>Ok, I just tried that there - parent writing 30% of the shared memory
>before forking but still did not reproduce the problem :(
Maybe it makes a difference to have lots of RAM (16GB on this
server), and about 1.5 GB of hugepage shared memory allocated in
the forking process in about four segments. Often have all free
memory consumed by the file cache, but I don't belive this is
necessary to produce the problem as it will happen even right
after a reboot. [RHEL5 meminfo attached]
Other possible factors:
daemon is non-root but has explicit
CAP_IPC_LOCK, CAP_NET_RAW, CAP_SYS_NICE set via
'setcap cap_net_raw,cap_ipc_lock,cap_sys_nice+ep daemon'
ulimit -Hl and -Sl are set to <unlimited>
process group is set in /proc/sys/vm/hugetlb_shm_group
/proc/sys/vm/nr_hugepages is set to 2048
daemon has 200 threads at time of fork()
shared memory segments explictly located [RHEL5 pmap -x attached]
between fork & exec these syscalls are issued
sched_getscheduler/sched_setscheduler
getpriority/setpriority
seteuid(getuid())
setegid(getgid())
with vfork() work-around, no syscalls are made before exec()
Don't think it's something anything specific to the DL160 (Intel E5430)
we have because the DL165 (Opteron 2354) also exhibits the problem.
Will run the test cases provided this weekend for certain and
will let you know if bug is reproduced.
Have to go silent on this till the weekend.
[-- Attachment #2: meminfo.txt --]
[-- Type: text/plain, Size: 777 bytes --]
MemTotal: 16443828 kB
MemFree: 105364 kB
Buffers: 8476 kB
Cached: 11626260 kB
SwapCached: 0 kB
Active: 121876 kB
Inactive: 11570788 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 16443828 kB
LowFree: 105364 kB
SwapTotal: 2031608 kB
SwapFree: 2031396 kB
Dirty: 417224 kB
Writeback: 0 kB
AnonPages: 62700 kB
Mapped: 10640 kB
Slab: 416872 kB
PageTables: 1904 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 8156368 kB
Committed_AS: 71692 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 266280 kB
VmallocChunk: 34359471371 kB
HugePages_Total: 2048
HugePages_Free: 889
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
[-- Attachment #3: pmap.txt --]
[-- Type: text/plain, Size: 20953 bytes --]
Address Kbytes RSS Anon Locked Mode Mapping
0000000000400000 2976 - - - r-x-- daemon
00000000007e8000 8 - - - rw--- daemon
00000000007ea000 192 - - - rw--- [ anon ]
000000001546d000 9228 - - - rw--- [ anon ]
0000000040955000 4 - - - ----- [ anon ]
0000000040956000 128 - - - rw--- [ anon ]
0000000040976000 4 - - - ----- [ anon ]
0000000040977000 128 - - - rw--- [ anon ]
0000000040997000 4 - - - ----- [ anon ]
0000000040998000 128 - - - rw--- [ anon ]
00000000409b8000 4 - - - ----- [ anon ]
00000000409b9000 128 - - - rw--- [ anon ]
000000004193a000 4 - - - ----- [ anon ]
000000004193b000 128 - - - rw--- [ anon ]
000000004195b000 4 - - - ----- [ anon ]
000000004195c000 128 - - - rw--- [ anon ]
000000004197c000 4 - - - ----- [ anon ]
000000004197d000 128 - - - rw--- [ anon ]
000000004199d000 4 - - - ----- [ anon ]
000000004199e000 128 - - - rw--- [ anon ]
00000000419be000 4 - - - ----- [ anon ]
00000000419bf000 128 - - - rw--- [ anon ]
00000000419df000 4 - - - ----- [ anon ]
00000000419e0000 128 - - - rw--- [ anon ]
0000000041a00000 4 - - - ----- [ anon ]
0000000041a01000 128 - - - rw--- [ anon ]
0000000041a21000 4 - - - ----- [ anon ]
0000000041a22000 128 - - - rw--- [ anon ]
0000000041a42000 4 - - - ----- [ anon ]
0000000041a43000 128 - - - rw--- [ anon ]
0000000041a63000 4 - - - ----- [ anon ]
0000000041a64000 128 - - - rw--- [ anon ]
0000000041a84000 4 - - - ----- [ anon ]
0000000041a85000 128 - - - rw--- [ anon ]
0000000041aa5000 4 - - - ----- [ anon ]
0000000041aa6000 128 - - - rw--- [ anon ]
0000000041ac6000 4 - - - ----- [ anon ]
0000000041ac7000 128 - - - rw--- [ anon ]
0000000041ae7000 4 - - - ----- [ anon ]
0000000041ae8000 128 - - - rw--- [ anon ]
0000000041b08000 4 - - - ----- [ anon ]
0000000041b09000 128 - - - rw--- [ anon ]
0000000041b29000 4 - - - ----- [ anon ]
0000000041b2a000 128 - - - rw--- [ anon ]
0000000041b4a000 4 - - - ----- [ anon ]
0000000041b4b000 128 - - - rw--- [ anon ]
0000000041b6b000 4 - - - ----- [ anon ]
0000000041b6c000 128 - - - rw--- [ anon ]
0000000041b8c000 4 - - - ----- [ anon ]
0000000041b8d000 128 - - - rw--- [ anon ]
0000000041bad000 4 - - - ----- [ anon ]
0000000041bae000 128 - - - rw--- [ anon ]
0000000041bce000 4 - - - ----- [ anon ]
0000000041bcf000 128 - - - rw--- [ anon ]
0000000041bef000 4 - - - ----- [ anon ]
0000000041bf0000 128 - - - rw--- [ anon ]
0000000041c10000 4 - - - ----- [ anon ]
0000000041c11000 128 - - - rw--- [ anon ]
0000000041c31000 4 - - - ----- [ anon ]
0000000041c32000 128 - - - rw--- [ anon ]
0000000041c52000 4 - - - ----- [ anon ]
0000000041c53000 128 - - - rw--- [ anon ]
0000000041c73000 4 - - - ----- [ anon ]
0000000041c74000 128 - - - rw--- [ anon ]
0000000041c94000 4 - - - ----- [ anon ]
0000000041c95000 128 - - - rw--- [ anon ]
0000000041cb5000 4 - - - ----- [ anon ]
0000000041cb6000 128 - - - rw--- [ anon ]
0000000041cd6000 4 - - - ----- [ anon ]
0000000041cd7000 128 - - - rw--- [ anon ]
0000000041cf7000 4 - - - ----- [ anon ]
0000000041cf8000 128 - - - rw--- [ anon ]
0000000041d18000 4 - - - ----- [ anon ]
0000000041d19000 128 - - - rw--- [ anon ]
0000000041d39000 4 - - - ----- [ anon ]
0000000041d3a000 128 - - - rw--- [ anon ]
0000000041d5a000 4 - - - ----- [ anon ]
0000000041d5b000 128 - - - rw--- [ anon ]
0000000041d7b000 4 - - - ----- [ anon ]
0000000041d7c000 128 - - - rw--- [ anon ]
0000000041d9c000 4 - - - ----- [ anon ]
0000000041d9d000 128 - - - rw--- [ anon ]
0000000041dbd000 4 - - - ----- [ anon ]
0000000041dbe000 128 - - - rw--- [ anon ]
0000000041dde000 4 - - - ----- [ anon ]
0000000041ddf000 128 - - - rw--- [ anon ]
0000000041dff000 4 - - - ----- [ anon ]
0000000041e00000 128 - - - rw--- [ anon ]
0000000041e20000 4 - - - ----- [ anon ]
0000000041e21000 128 - - - rw--- [ anon ]
0000000041e41000 4 - - - ----- [ anon ]
0000000041e42000 128 - - - rw--- [ anon ]
0000000041e62000 4 - - - ----- [ anon ]
0000000041e63000 128 - - - rw--- [ anon ]
0000000041e83000 4 - - - ----- [ anon ]
0000000041e84000 128 - - - rw--- [ anon ]
0000000041ea4000 4 - - - ----- [ anon ]
0000000041ea5000 128 - - - rw--- [ anon ]
0000000041ec5000 4 - - - ----- [ anon ]
0000000041ec6000 128 - - - rw--- [ anon ]
0000000041ee6000 4 - - - ----- [ anon ]
0000000041ee7000 128 - - - rw--- [ anon ]
0000000041f07000 4 - - - ----- [ anon ]
0000000041f08000 128 - - - rw--- [ anon ]
0000000041f28000 4 - - - ----- [ anon ]
0000000041f29000 128 - - - rw--- [ anon ]
0000000041f49000 4 - - - ----- [ anon ]
0000000041f4a000 128 - - - rw--- [ anon ]
0000000041f6a000 4 - - - ----- [ anon ]
0000000041f6b000 128 - - - rw--- [ anon ]
0000000041f8b000 4 - - - ----- [ anon ]
0000000041f8c000 128 - - - rw--- [ anon ]
0000000041fac000 4 - - - ----- [ anon ]
0000000041fad000 128 - - - rw--- [ anon ]
0000000041fcd000 4 - - - ----- [ anon ]
0000000041fce000 128 - - - rw--- [ anon ]
0000000041fee000 4 - - - ----- [ anon ]
0000000041fef000 128 - - - rw--- [ anon ]
000000004200f000 4 - - - ----- [ anon ]
0000000042010000 128 - - - rw--- [ anon ]
0000000042030000 4 - - - ----- [ anon ]
0000000042031000 128 - - - rw--- [ anon ]
0000000042051000 4 - - - ----- [ anon ]
0000000042052000 128 - - - rw--- [ anon ]
0000000042072000 4 - - - ----- [ anon ]
0000000042073000 128 - - - rw--- [ anon ]
0000000042093000 4 - - - ----- [ anon ]
0000000042094000 128 - - - rw--- [ anon ]
00000000420b4000 4 - - - ----- [ anon ]
00000000420b5000 128 - - - rw--- [ anon ]
00000000420d5000 4 - - - ----- [ anon ]
00000000420d6000 128 - - - rw--- [ anon ]
00000000420f6000 4 - - - ----- [ anon ]
00000000420f7000 128 - - - rw--- [ anon ]
0000000042117000 4 - - - ----- [ anon ]
0000000042118000 128 - - - rw--- [ anon ]
0000000042138000 4 - - - ----- [ anon ]
0000000042139000 128 - - - rw--- [ anon ]
0000000042159000 4 - - - ----- [ anon ]
000000004215a000 128 - - - rw--- [ anon ]
000000004217a000 4 - - - ----- [ anon ]
000000004217b000 128 - - - rw--- [ anon ]
000000004219b000 4 - - - ----- [ anon ]
000000004219c000 128 - - - rw--- [ anon ]
00000000421bc000 4 - - - ----- [ anon ]
00000000421bd000 128 - - - rw--- [ anon ]
00000000421dd000 4 - - - ----- [ anon ]
00000000421de000 128 - - - rw--- [ anon ]
00000000421fe000 4 - - - ----- [ anon ]
00000000421ff000 128 - - - rw--- [ anon ]
000000004221f000 4 - - - ----- [ anon ]
0000000042220000 128 - - - rw--- [ anon ]
0000000042240000 4 - - - ----- [ anon ]
0000000042241000 128 - - - rw--- [ anon ]
0000000042261000 4 - - - ----- [ anon ]
0000000042262000 128 - - - rw--- [ anon ]
0000000042282000 4 - - - ----- [ anon ]
0000000042283000 128 - - - rw--- [ anon ]
00000000422a3000 4 - - - ----- [ anon ]
00000000422a4000 128 - - - rw--- [ anon ]
00000000422c4000 4 - - - ----- [ anon ]
00000000422c5000 128 - - - rw--- [ anon ]
00000000422e5000 4 - - - ----- [ anon ]
00000000422e6000 128 - - - rw--- [ anon ]
0000000042306000 4 - - - ----- [ anon ]
0000000042307000 128 - - - rw--- [ anon ]
0000000042327000 4 - - - ----- [ anon ]
0000000042328000 128 - - - rw--- [ anon ]
0000000042348000 4 - - - ----- [ anon ]
0000000042349000 128 - - - rw--- [ anon ]
0000000042369000 4 - - - ----- [ anon ]
000000004236a000 128 - - - rw--- [ anon ]
000000004238a000 4 - - - ----- [ anon ]
000000004238b000 128 - - - rw--- [ anon ]
00000000423ab000 4 - - - ----- [ anon ]
00000000423ac000 128 - - - rw--- [ anon ]
00000000423cc000 4 - - - ----- [ anon ]
00000000423cd000 128 - - - rw--- [ anon ]
00000000423ed000 4 - - - ----- [ anon ]
00000000423ee000 128 - - - rw--- [ anon ]
000000004240e000 4 - - - ----- [ anon ]
000000004240f000 128 - - - rw--- [ anon ]
000000004242f000 4 - - - ----- [ anon ]
0000000042430000 128 - - - rw--- [ anon ]
0000000042450000 4 - - - ----- [ anon ]
0000000042451000 128 - - - rw--- [ anon ]
0000000042471000 4 - - - ----- [ anon ]
0000000042472000 128 - - - rw--- [ anon ]
0000000042492000 4 - - - ----- [ anon ]
0000000042493000 128 - - - rw--- [ anon ]
00000000424b3000 4 - - - ----- [ anon ]
00000000424b4000 128 - - - rw--- [ anon ]
00000000424d4000 4 - - - ----- [ anon ]
00000000424d5000 128 - - - rw--- [ anon ]
00000000424f5000 4 - - - ----- [ anon ]
00000000424f6000 128 - - - rw--- [ anon ]
0000000042516000 4 - - - ----- [ anon ]
0000000042517000 128 - - - rw--- [ anon ]
0000000042537000 4 - - - ----- [ anon ]
0000000042538000 128 - - - rw--- [ anon ]
0000000042558000 4 - - - ----- [ anon ]
0000000042559000 128 - - - rw--- [ anon ]
0000000042579000 4 - - - ----- [ anon ]
000000004257a000 128 - - - rw--- [ anon ]
000000004259a000 4 - - - ----- [ anon ]
000000004259b000 128 - - - rw--- [ anon ]
00000000425bb000 4 - - - ----- [ anon ]
00000000425bc000 128 - - - rw--- [ anon ]
00000000425dc000 4 - - - ----- [ anon ]
00000000425dd000 128 - - - rw--- [ anon ]
00000000425fd000 4 - - - ----- [ anon ]
00000000425fe000 128 - - - rw--- [ anon ]
000000004261e000 4 - - - ----- [ anon ]
000000004261f000 128 - - - rw--- [ anon ]
000000004263f000 4 - - - ----- [ anon ]
0000000042640000 128 - - - rw--- [ anon ]
0000000042660000 4 - - - ----- [ anon ]
0000000042661000 128 - - - rw--- [ anon ]
0000000042681000 4 - - - ----- [ anon ]
0000000042682000 128 - - - rw--- [ anon ]
00000000426a2000 4 - - - ----- [ anon ]
00000000426a3000 128 - - - rw--- [ anon ]
00000000426c3000 4 - - - ----- [ anon ]
00000000426c4000 128 - - - rw--- [ anon ]
00000000426e4000 4 - - - ----- [ anon ]
00000000426e5000 128 - - - rw--- [ anon ]
0000000042705000 4 - - - ----- [ anon ]
0000000042706000 128 - - - rw--- [ anon ]
0000000042726000 4 - - - ----- [ anon ]
0000000042727000 128 - - - rw--- [ anon ]
0000000042747000 4 - - - ----- [ anon ]
0000000042748000 128 - - - rw--- [ anon ]
0000000042768000 4 - - - ----- [ anon ]
0000000042769000 128 - - - rw--- [ anon ]
0000000042789000 4 - - - ----- [ anon ]
000000004278a000 128 - - - rw--- [ anon ]
00000000427aa000 4 - - - ----- [ anon ]
00000000427ab000 128 - - - rw--- [ anon ]
00000000427cb000 4 - - - ----- [ anon ]
00000000427cc000 128 - - - rw--- [ anon ]
00000000427ec000 4 - - - ----- [ anon ]
00000000427ed000 128 - - - rw--- [ anon ]
000000004280d000 4 - - - ----- [ anon ]
000000004280e000 28 - - - rw--- [ anon ]
0000000042815000 4 - - - ----- [ anon ]
0000000042816000 28 - - - rw--- [ anon ]
000000004281d000 4 - - - ----- [ anon ]
000000004281e000 28 - - - rw--- [ anon ]
0000000042825000 4 - - - ----- [ anon ]
0000000042826000 28 - - - rw--- [ anon ]
000000004282d000 4 - - - ----- [ anon ]
000000004282e000 28 - - - rw--- [ anon ]
0000000042835000 4 - - - ----- [ anon ]
0000000042836000 28 - - - rw--- [ anon ]
000000004283d000 4 - - - ----- [ anon ]
000000004283e000 28 - - - rw--- [ anon ]
0000000042845000 4 - - - ----- [ anon ]
0000000042846000 28 - - - rw--- [ anon ]
000000004284d000 4 - - - ----- [ anon ]
000000004284e000 28 - - - rw--- [ anon ]
0000000300000000 524288 - - - rw-s- 9 (deleted)
0000000330000000 131072 - - - rw-s- 6 (deleted)
00000003c0000000 98304 - - - rw-s- 8 (deleted)
00000003d8000000 169984 - - - rw-s- 5 (deleted)
00000003f0000000 2048 - - - rw-s- 1 (deleted)
00000003f0400000 2048 - - - rw-s- 7 (deleted)
0000000400000000 1048576 - - - rw-s- 3 (deleted)
0000000580000000 262144 - - - rw-s- 4 (deleted)
0000000600000000 131072 - - - rw-s- 2 (deleted)
00000032d5400000 112 - - - r-x-- ld-2.5.so
00000032d561b000 4 - - - r---- ld-2.5.so
00000032d561c000 4 - - - rw--- ld-2.5.so
00000032d5800000 1328 - - - r-x-- libc-2.5.so
00000032d594c000 2048 - - - ----- libc-2.5.so
00000032d5b4c000 16 - - - r---- libc-2.5.so
00000032d5b50000 4 - - - rw--- libc-2.5.so
00000032d5b51000 20 - - - rw--- [ anon ]
00000032d6800000 520 - - - r-x-- libm-2.5.so
00000032d6882000 2044 - - - ----- libm-2.5.so
00000032d6a81000 4 - - - r---- libm-2.5.so
00000032d6a82000 4 - - - rw--- libm-2.5.so
00000032d6c00000 88 - - - r-x-- libpthread-2.5.so
00000032d6c16000 2044 - - - ----- libpthread-2.5.so
00000032d6e15000 4 - - - r---- libpthread-2.5.so
00000032d6e16000 4 - - - rw--- libpthread-2.5.so
00000032d6e17000 16 - - - rw--- [ anon ]
00000032d7000000 28 - - - r-x-- librt-2.5.so
00000032d7007000 2048 - - - ----- librt-2.5.so
00000032d7207000 4 - - - r---- librt-2.5.so
00000032d7208000 4 - - - rw--- librt-2.5.so
00002aaaac000000 132 - - - rw--- [ anon ]
00002aaaac021000 65404 - - - ----- [ anon ]
00002aaab0000000 132 - - - rw--- [ anon ]
00002aaab0021000 65404 - - - ----- [ anon ]
00002aaab4000000 132 - - - rw--- [ anon ]
00002aaab4021000 65404 - - - ----- [ anon ]
00002ad69510e000 4 - - - rw--- [ anon ]
00002ad695115000 4 - - - rw--- [ anon ]
00002ad695116000 944 - - - r-x-- libstdc++.so.6.0.10
00002ad695202000 1024 - - - ----- libstdc++.so.6.0.10
00002ad695302000 8 - - - r---- libstdc++.so.6.0.10
00002ad695304000 28 - - - rw--- libstdc++.so.6.0.10
00002ad69530b000 80 - - - rw--- [ anon ]
00002ad69531f000 88 - - - r-x-- libgcc_s.so.1
00002ad695335000 1020 - - - ----- libgcc_s.so.1
00002ad695434000 4 - - - rw--- libgcc_s.so.1
00002ad695435000 24776 - - - rw--- [ anon ]
00007fff1597d000 124 - - - rw--- [ stack ]
ffffffffff600000 8192 - - - ----- [ anon ]
---------------- ------ ------ ------ ------
total kB 2641188 - - -
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
@ 2009-05-15 5:32 starlight
2009-05-15 14:55 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: starlight @ 2009-05-15 5:32 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
[-- Attachment #1: Type: text/plain, Size: 1105 bytes --]
Whacked at a this, attempting to build a testcase from a
combination of the original daemon strace in the bug report
and knowledge of what the daemon is doing.
What emerged is something that will destroy RHEL5
2.6.18-128.1.6.el5 100% every time. Completely fills the kernel
message log with "bad pmd" errors and wrecks hugepages.
Unfortunately it only occasionally breaks 2.6.29.1. Haven't
been able to produce "bad pmd" messages, but did get the
kernel to think it's out of large page memory when in
theory it was not. Saw a lot of really strange accounting
in the hugepage section of /proc/meminfo.
For what it's worth, the testcase code is attached.
Note that hugepages=2048 is assumed--the bug seems to require
use of more than 50% of large page memory.
Definately will be posted under the RHEL5 bug report, which is
the more pressing issue here than far-future kernel support.
In addition, the original segment attach bug
http://bugzilla.kernel.org/show_bug.cgi?id=12134 is still there
and can be reproduced every time with the 'create_seg_strace'
and 'access_seg_straceX' sequences.
[-- Attachment #2: do_tcbm.txt --]
[-- Type: text/plain, Size: 28 bytes --]
g++ -Wall -g -o tcbm tcbm.C
[-- Attachment #3: tcbm.C.txt --]
[-- Type: text/plain, Size: 3873 bytes --]
extern "C" {
#include <errno.h>
#include <memory.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/wait.h>
#include <sys/shm.h>
#include <sys/resource.h>
#include <sys/mman.h>
}
extern "C"
void child_signal_handler(
const int
)
{
int errno_save;
pid_t dead_pid;
int dead_status;
errno_save = errno;
do {
dead_pid = waitpid(-1, &dead_status, WNOHANG);
if (dead_pid == -1) {
if (errno == ECHILD) break;
perror("waitpid");
exit(1);
}
} while (dead_pid != 0);
errno = errno_save;
return;
}
int rabbits(void)
{
int pid = fork();
if (pid != 0) {
return 0;
} else {
const int sched_policy = sched_getscheduler(0);
if (sched_policy == -1) {
perror("sched_getscheduler");
}
if (sched_policy != SCHED_OTHER) {
sched_param sched;
memset(&sched, 0, sizeof(sched));
sched.sched_priority = 0;
if (sched_setscheduler(0, SCHED_OTHER, &sched) != 0) {
perror("sched_setscheduler");
}
}
errno = 0; // -1 return value legitimate
const int nice = getpriority(PRIO_PROCESS, 0);
if (errno != 0) {
perror("getpriority");
}
if (nice < -10) {
if (setpriority(PRIO_PROCESS, 0, -10) != 0) { // somewhat elevated
perror("setpriority");
}
}
char* program;
program = (char*) "script";
char* pargs[2];
pargs[0] = program;
pargs[1] = NULL;
execvp(program, pargs);
perror("execvp");
exit(1);
}
}
int main(
int argc,
const char** argv,
const char** envp
)
{
#if 1
sched_param sched;
memset(&sched, 0, sizeof(sched));
sched.sched_priority = 26;
if (sched_setscheduler(0, SCHED_RR, &sched) != 0) {
perror("sched_setscheduler(SCHED_RR, 26)");
return 1;
}
#endif
#if 0
if (mlockall(MCL_CURRENT|MCL_FUTURE) != 0) {
perror("mlockall");
return 1;
}
#endif
struct sigaction sas_child;
memset(&sas_child, 0, sizeof(sas_child));
sas_child.sa_handler = child_signal_handler;
if (sigaction(SIGCHLD, &sas_child, NULL) != 0) {
perror("sigaction(SIGCHLD)");
return 1;
}
int seg1id = shmget(0x12345600,
(size_t) 0xC0000000,
IPC_CREAT|SHM_HUGETLB|0640
);
if (seg1id == -1) {
perror("shmget(3GB)");
return 1;
}
void* seg1adr = shmat(seg1id, (void*) 0x400000000, 0);
if (seg1adr == (void*) -1) {
perror("shmat(3GB)");
return 1;
}
#if 1
memset(seg1adr, 0xFF, (size_t) 0x60000000);
if (mlock(seg1adr, (size_t) 0xC0000000) != 0) {
perror("mlock(3GB)");
return 1;
}
#endif
int seg2id = shmget(0x12345601,
(size_t) 0x40000000,
IPC_CREAT|SHM_HUGETLB|0640
);
if (seg2id == -1) {
perror("shmget(1GB)");
return 1;
}
void* seg2adr = shmat(seg2id, (void*) 0x500000000, 0);
if (seg2adr == (void*) -1) {
perror("shmat(1GB)");
return 1;
}
#if 1
memset(seg2adr, 0xFF, (size_t) 0x40000000);
if (mlock(seg2adr, (size_t) 0x40000000) != 0) {
perror("mlock(1GB)");
return 1;
}
#endif
for (int i1 = 0; i1 < 50; i1++) {
void* mmtarg = mmap(NULL,
528384,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1,
0
);
if (mmtarg == (void*) -1) {
perror("mmap");
return 1;
}
}
for (int i1 = 0; i1 < 50; i1++) {
rabbits();
usleep(500);
}
while (true) {
pause();
}
return 0;
}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-15 5:32 starlight
@ 2009-05-15 14:55 ` Mel Gorman
2009-05-15 15:02 ` starlight
0 siblings, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2009-05-15 14:55 UTC (permalink / raw)
To: starlight
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
[-- Attachment #1: Type: text/plain, Size: 2553 bytes --]
On Fri, May 15, 2009 at 01:32:38AM -0400, starlight@binnacle.cx wrote:
> Whacked at a this, attempting to build a testcase from a
> combination of the original daemon strace in the bug report
> and knowledge of what the daemon is doing.
>
> What emerged is something that will destroy RHEL5
> 2.6.18-128.1.6.el5 100% every time. Completely fills the kernel
> message log with "bad pmd" errors and wrecks hugepages.
>
Ok, I can confirm that more or less. I reproduced the problem on 2.6.18-92.el5
on x86-64 running RHEL 5.2. I didn't have access to a machine with enough
memory though so I dropped the requirements slightly. It still triggered
a failure though.
However, when I ran 2.6.18, 2.6.19 and 2.6.29.1 on the same machine, I could
not reproduce the problem, nor could I cause hugepages to leak so I'm leaning
towards believing this is a distribution bug at the moment.
On the plus side, due to your good work, there is enough available for them
to bisect this problem hopefully.
> Unfortunately it only occasionally breaks 2.6.29.1. Haven't
> been able to produce "bad pmd" messages, but did get the
> kernel to think it's out of large page memory when in
> theory it was not. Saw a lot of really strange accounting
> in the hugepage section of /proc/meminfo.
>
What sort of strange accounting? The accounting has changed since 2.6.18
so I want to be sure you're really seeing something weird. When I was
testing, I didn't see anything out of the ordinary but maybe I'm looking
in a different place.
> For what it's worth, the testcase code is attached.
>
I cleaned the test up a bit and wrote a wrapper script to run this
multiple times while checking for hugepage leaks. I've it running in a
loop while the machine runs sysbench as a stress test to see can I cause
anything out of the ordinary to happen. Nothing so far though.
> Note that hugepages=2048 is assumed--the bug seems to require
> use of more than 50% of large page memory.
>
> Definately will be posted under the RHEL5 bug report, which is
> the more pressing issue here than far-future kernel support.
>
If you've filed a RedHat bug, this modified testcase and wrapper script
might help them. The program exists and cleans up after itself and the memory
requirements are less. The script sets the machine up in a way that
breaks for me where the breakage is bad pmd messages and hugepages
leaking.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
[-- Attachment #2: test-tcbm.sh --]
[-- Type: application/x-sh, Size: 603 bytes --]
[-- Attachment #3: tcbm.c --]
[-- Type: text/x-csrc, Size: 4385 bytes --]
#include <errno.h>
#include <memory.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/wait.h>
#include <sys/shm.h>
#include <sys/resource.h>
#include <sys/mman.h>
#define LARGE_SHARED_SEGMENT_KEY 0x12345600
#define LARGE_SHARED_SEGMENT_SIZE ((size_t)0x40000000)
#define LARGE_SHARED_SEGMENT_ADDR ((void *)0x40000000)
#define SMALL_SHARED_SEGMENT_KEY 0x12345601
#define SMALL_SHARED_SEGMENT_SIZE ((size_t)0x20000000)
#define SMALL_SHARED_SEGMENT_ADDR ((void *)0x94000000)
#define NUM_SMALL_BUFFERS 50
char *helper_program = "echo";
char *helper_args[] = { "-n", ".", NULL };
void child_signal_handler(const int unused)
{
int errno_save;
pid_t dead_pid;
int dead_status;
errno_save = errno;
do {
dead_pid = waitpid(-1, &dead_status, WNOHANG);
if (dead_pid == -1) {
if (errno == ECHILD)
break;
perror("waitpid");
exit(EXIT_FAILURE);
}
} while (dead_pid != 0);
errno = errno_save;
return;
}
int rabbits(void)
{
int sched_policy;
int pid;
pid = fork();
if (pid != 0)
return 0;
sched_policy = sched_getscheduler(0);
if (sched_policy == -1)
perror("sched_getscheduler");
/* Set the childs policy to SCHED_OTHER */
if (sched_policy != SCHED_OTHER) {
struct sched_param sched;
memset(&sched, 0, sizeof(sched));
sched.sched_priority = 0;
if (sched_setscheduler(0, SCHED_OTHER, &sched) != 0)
perror("sched_setscheduler");
}
/* Set the priority of the process */
errno = 0;
const int nice = getpriority(PRIO_PROCESS, 0);
if (errno != 0)
perror("getpriority");
if (nice < -10)
if (setpriority(PRIO_PROCESS, 0, -10) != 0)
perror("setpriority");
/* Launch helper program */
execvp(helper_program, helper_args);
perror("execvp");
exit(EXIT_FAILURE);
}
int main(int argc, const char** argv, const char** envp)
{
struct sched_param sched;
struct sigaction sas_child;
int i;
/* Set the round robin scheduler */
memset(&sched, 0, sizeof(sched));
sched.sched_priority = 26;
if (sched_setscheduler(0, SCHED_RR, &sched) != 0) {
perror("sched_setscheduler(SCHED_RR, 26)");
return 1;
}
/* Set a signal handler for children exiting */
memset(&sas_child, 0, sizeof(sas_child));
sas_child.sa_handler = child_signal_handler;
if (sigaction(SIGCHLD, &sas_child, NULL) != 0) {
perror("sigaction(SIGCHLD)");
return 1;
}
/* Create a large shared memory segment */
int seg1id = shmget(LARGE_SHARED_SEGMENT_KEY,
LARGE_SHARED_SEGMENT_SIZE,
IPC_CREAT|SHM_HUGETLB|0640);
if (seg1id == -1) {
perror("shmget(LARGE_SEGMENT)");
return 1;
}
/* Attach at the 16GB offset */
void* seg1adr = shmat(seg1id, LARGE_SHARED_SEGMENT_ADDR, 0);
if (seg1adr == (void*)-1) {
perror("shmat(LARGE_SEGMENT)");
return 1;
}
/* Initialise the start of the segment and mlock it */
memset(seg1adr, 0xFF, LARGE_SHARED_SEGMENT_SIZE/2);
if (mlock(seg1adr, LARGE_SHARED_SEGMENT_SIZE) != 0) {
perror("mlock(LARGE_SEGMENT)");
return 1;
}
/* Create a second smaller segment */
int seg2id = shmget(SMALL_SHARED_SEGMENT_KEY,
SMALL_SHARED_SEGMENT_SIZE,
IPC_CREAT|SHM_HUGETLB|0640);
if (seg2id == -1) {
perror("shmget(SMALL_SEGMENT)");
return 1;
}
/* Attach small segment */
void *seg2adr = shmat(seg2id, SMALL_SHARED_SEGMENT_ADDR, 0);
if (seg2adr == (void*) -1) {
perror("shmat(SMALL_SEGMENT)");
return 1;
}
/* Initialise all of small segment and mlock */
memset(seg2adr, 0xFF, (size_t) SMALL_SHARED_SEGMENT_SIZE);
if (mlock(seg2adr, (size_t) SMALL_SHARED_SEGMENT_SIZE) != 0) {
perror("mlock(SMALL_SEGMENT)");
return 1;
}
/* Create a number of approximately 516K buffers */
for (i = 0; i < NUM_SMALL_BUFFERS; i++) {
void* mmtarg = mmap(NULL, 528384,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0);
if (mmtarg == (void*) -1) {
perror("mmap");
return 1;
}
}
/* Create one child per small buffer */
for (i = 0; i < NUM_SMALL_BUFFERS; i++) {
rabbits();
usleep(500);
}
/* Wait until children shut up signalling */
printf("Waiting for children\n");
while (sleep(3) != 0);
/* Detach */
if (shmdt(seg1adr) == -1)
perror("shmdt(LARGE_SEGMENT)");
if (shmdt(seg2adr) == -1)
perror("shmdt(SMALL_SEGMENT)");
if (shmctl(seg1id, IPC_RMID, NULL) == -1)
perror("shmrm(LARGE_SEGMENT)");
if (shmctl(seg2id, IPC_RMID, NULL) == -1)
perror("shmrm(SMALL_SEGMENT)");
printf("Done\n");
return 0;
}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-15 14:55 ` Mel Gorman
@ 2009-05-15 15:02 ` starlight
0 siblings, 0 replies; 25+ messages in thread
From: starlight @ 2009-05-15 15:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
At 03:55 PM 5/15/2009 +0100, Mel Gorman wrote:
>On Fri, May 15, 2009 at 01:32:38AM -0400, starlight@binnacle.cx
>wrote:
>> Whacked at a this, attempting to build a testcase from a
>> combination of the original daemon strace in the bug report
>> and knowledge of what the daemon is doing.
>>
>> What emerged is something that will destroy RHEL5
>> 2.6.18-128.1.6.el5 100% every time. Completely fills the kernel
>> message log with "bad pmd" errors and wrecks hugepages.
>
>Ok, I can confirm that more or less. I reproduced the problem on
>2.6.18-92.el5 on x86-64 running RHEL 5.2. I didn't have access
>to a machine with enough memory though so I dropped the
>requirements slightly. It still triggered a failure though.
>
>However, when I ran 2.6.18, 2.6.19 and 2.6.29.1 on the same
>machine, I could not reproduce the problem, nor could I cause
>hugepages to leak so I'm leaning towards believing this is a
>distribution bug at the moment.
>
>On the plus side, due to your good work, there is enough
>available for them to bisect this problem hopefully.
Good to hear that the testcase works on other machines.
>> Unfortunately it only occasionally breaks 2.6.29.1. Haven't
>> been able to produce "bad pmd" messages, but did get the
>> kernel to think it's out of large page memory when in
>> theory it was not. Saw a lot of really strange accounting
>> in the hugepage section of /proc/meminfo.
>>
>What sort of strange accounting? The accounting has changed
>since 2.6.18 so I want to be sure you're really seeing something
>weird. When I was testing, I didn't see anything out of the
>ordinary but maybe I'm looking in a different place.
Saw things like both free and used set to zero, used set to 2048
when it should not have been (in association with the failure).
Often the counters would correct themselves after segments were
removed with 'ipcs'. Sometimes not--usually when it broke.
Also saw some truly insane usage counts like 32520 and less
egregious off-by-one-or-two inaccuracies.
>> For what it's worth, the testcase code is attached.
>>
>I cleaned the test up a bit and wrote a wrapper script to run
>this multiple times while checking for hugepage leaks. I've it
>running in a loop while the machine runs sysbench as a stress
>test to see can I cause anything out of the ordinary to happen.
>Nothing so far though.
>
>> Note that hugepages=2048 is assumed--the bug seems to require
>> use of more than 50% of large page memory.
>>
>> Definately will be posted under the RHEL5 bug report, which is
>> the more pressing issue here than far-future kernel support.
>>
>If you've filed a RedHat bug, this modified testcase and wrapper
>script might help them. The program exists and cleans up after
>itself and the memory requirements are less. The script sets the
>machine up in a way that breaks for me where the breakage is bad
>pmd messages and hugepages leaking.
Thank you for your efforts. Could you post to the RH bug along
with a back-reference to this? Might improve the chances
someone will pay attention to it. It's at
https://bugzilla.redhat.com/show_bug.cgi?id=497653
In a week or two I'll see if I can make time to turn the 100%
failure scenario into a testcase. This is just the run of a
segment loader followed by running a status checker three times.
In 2.6.29.1 I'm wondering if the "bad pmd" I saw was just a bit
of bad memory, so might as well focus on the thing that fails
with certainty. Possibly the "bad pmd" case requires a few hours
of live data runtime before it emerges--a tougher nut.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
@ 2009-05-15 18:44 starlight
2009-05-18 16:36 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: starlight @ 2009-05-15 18:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
[-- Attachment #1: Type: text/plain, Size: 476 bytes --]
This was really bugging me, so I hacked out
the test case for the attach failure.
Hoses 2.6.29.1 100% every time. Run it like this:
tcbm_att
tcbm_att -
tcbm_att -
tcbm_att -
It will break on the last iteration with ENOMEM
and ENOMEM is all any shmget() or shmat() call
gets forever more.
After removing the segments this appears:
HugePages_Total: 2048
HugePages_Free: 2048
HugePages_Rsvd: 1280
HugePages_Surp: 0
Even though no segments show in 'ipcs'.
[-- Attachment #2: tcbm_att.C.txt --]
[-- Type: text/plain, Size: 2429 bytes --]
extern "C" {
#include <errno.h>
#include <memory.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/shm.h>
}
int main(
int argc,
const char** argv,
const char** envp
)
{
if (argc == 1) {
int seg1id = shmget(0x12345600,
(size_t) 0x40000000,
IPC_CREAT|SHM_HUGETLB|0640
);
if (seg1id == -1) {
perror("shmget(1GB)");
return 1;
}
void* seg1adr = shmat(seg1id, (void*) 0x400000000, 0);
if (seg1adr == (void*) -1) {
perror("shmat(1GB)");
return 1;
}
int seg2id = shmget(0x12345601,
(size_t) 0x10000000,
IPC_CREAT|SHM_HUGETLB|0640
);
if (seg2id == -1) {
perror("shmget(256MB)");
return 1;
}
void* seg2adr = shmat(seg2id, (void*) 0x580000000, 0);
if (seg2adr == (void*) -1) {
perror("shmat(256MB)");
return 1;
}
char* seg_p = (char*) seg1adr;
int i1 = 182;
while (i1 > 0) {
memset(seg_p, 0x55, 0x400000);
seg_p += 0x400000;
i1--;
}
seg_p = (char*) seg2adr;
i1 = 6;
while (i1 > 0) {
memset(seg_p, 0xAA, 0x400000);
seg_p += 0x400000;
i1--;
}
if (shmdt((void*) 0x400000000) != 0) {
perror("shmdt(1GB)");
return 1;
}
if (shmdt((void*) 0x580000000) != 0) {
perror("shmdt(256MB)");
return 1;
}
} else {
int seg1id = shmget(0x12345600, 0, 0);
if (seg1id == -1) {
perror("shmget(1GB)");
return 1;
}
void* seg1adr = shmat(seg1id, (void*) 0x400000000, SHM_RDONLY);
if (seg1adr == (void*) -1) {
perror("shmat(1GB)");
return 1;
}
int seg2id = shmget(0x12345601, 0, 0);
if (seg2id == -1) {
perror("shmget(256MB)");
return 1;
}
void* seg2adr = shmat(seg2id, (void*) 0x580000000, SHM_RDONLY);
if (seg2adr == (void*) -1) {
perror("shmat(256MB)");
return 1;
}
if (shmdt((void*) 0x400000000) != 0) {
perror("shmdt(1GB)");
return 1;
}
if (shmdt((void*) 0x580000000) != 0) {
perror("shmdt(256MB)");
return 1;
}
}
return 0;
}
[-- Attachment #3: do_tcbm_att.txt --]
[-- Type: application/octet-stream, Size: 36 bytes --]
g++ -Wall -g -o tcbm_att tcbm_att.C
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
@ 2009-05-15 18:53 starlight
2009-05-20 11:35 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: starlight @ 2009-05-15 18:53 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
Here's another possible clue:
I tried the first 'tcbm' testcase on a 2.6.27.7
kernel that was hanging around from a few months
ago and it breaks it 100% of the time.
Completely hoses huge memory. Enough "bad pmd"
errors to fill the kernel log.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-15 18:44 starlight
@ 2009-05-18 16:36 ` Mel Gorman
0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-05-18 16:36 UTC (permalink / raw)
To: starlight
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson
On Fri, May 15, 2009 at 02:44:29PM -0400, starlight@binnacle.cx wrote:
> This was really bugging me, so I hacked out
> the test case for the attach failure.
>
> Hoses 2.6.29.1 100% every time. Run it like this:
>
> tcbm_att
> tcbm_att -
> tcbm_att -
> tcbm_att -
>
> It will break on the last iteration with ENOMEM
> and ENOMEM is all any shmget() or shmat() call
> gets forever more.
>
> After removing the segments this appears:
>
> HugePages_Total: 2048
> HugePages_Free: 2048
> HugePages_Rsvd: 1280
> HugePages_Surp: 0
>
Ok, the critical fact was that one process mapped read-write and
populated the segment. Each subsequent process mapped it read-only. The
core VM sets VM_SHARED for file-shared-read-write mappings but not
file-shared-read-only mapping. Hugetlbfs confused how it should be using
VM_SHARED as it was being used to check if the mapping was MAP_SHARED.
Straight-forward mistake with the consequence that reservations "leaked"
and future mappings failed as a result.
Can you try this patch out please? It is against 2.6.29.1 and mostly
applies to 2.6.27.7. The reject is trivially resolved by editting
mm/hugetlb.c and changing the VM_SHARED at the end of
hugetlb_reserve_pages() to VM_MAYSHARE.
Thing is, this patch fixes a reservation issue. The bad pmd messages do
show up for the original test on 2.6.27.7 for x86-64 (not x86) but it's a
separate issue and I have not determined what it is yet. Can you test this
patch to begin with please?
==== CUT HERE ====
Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs
hugetlbfs reserves huge pages and accounts for them differently depending on
whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the check
it makes against the VMA in some places is VM_SHARED and not VM_MAYSHARE.
For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the
mapping is MAP_SHARED *and* it is read-write. If a shared memory mapping
was created read-write for populating of data and mapped read-only by other
processes, then hugetlbfs gets the accounting wrong and reservations leak.
This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when
the intent of the code was to check whether the VMA was mapped MAP_SHARED
or MAP_PRIVATE.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/hugetlb.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..e83ad2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref)
static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
- if (!(vma->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & VM_MAYSHARE))
return (struct resv_map *)(get_vma_private_data(vma) &
~HPAGE_RESV_MASK);
return NULL;
@@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
- VM_BUG_ON(vma->vm_flags & VM_SHARED);
+ VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
set_vma_private_data(vma, (get_vma_private_data(vma) &
HPAGE_RESV_MASK) | (unsigned long)map);
@@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
- VM_BUG_ON(vma->vm_flags & VM_SHARED);
+ VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
set_vma_private_data(vma, get_vma_private_data(vma) | flags);
}
@@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
if (vma->vm_flags & VM_NORESERVE)
return;
- if (vma->vm_flags & VM_SHARED) {
+ if (vma->vm_flags & VM_MAYSHARE) {
/* Shared mappings always use reserves */
h->resv_huge_pages--;
} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
@@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
- if (!(vma->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & VM_MAYSHARE))
vma->vm_private_data = (void *)0;
}
/* Returns true if the VMA has associated reserve pages */
static int vma_has_reserves(struct vm_area_struct *vma)
{
- if (vma->vm_flags & VM_SHARED)
+ if (vma->vm_flags & VM_MAYSHARE)
return 1;
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return 1;
@@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h,
struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
- if (vma->vm_flags & VM_SHARED) {
+ if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
return region_chg(&inode->i_mapping->private_list,
idx, idx + 1);
@@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h,
struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
- if (vma->vm_flags & VM_SHARED) {
+ if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
region_add(&inode->i_mapping->private_list, idx, idx + 1);
@@ -1893,7 +1893,7 @@ retry_avoidcopy:
* at the time of fork() could consume its reserves on COW instead
* of the full address range.
*/
- if (!(vma->vm_flags & VM_SHARED) &&
+ if (!(vma->vm_flags & VM_MAYSHARE) &&
is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
old_page != pagecache_page)
outside_reserve = 1;
@@ -2000,7 +2000,7 @@ retry:
clear_huge_page(page, address, huge_page_size(h));
__SetPageUptodate(page);
- if (vma->vm_flags & VM_SHARED) {
+ if (vma->vm_flags & VM_MAYSHARE) {
int err;
struct inode *inode = mapping->host;
@@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mutex;
}
- if (!(vma->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & VM_MAYSHARE))
pagecache_page = hugetlbfs_pagecache_page(h,
vma, address);
}
@@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* to reserve the full area even if read-only as mprotect() may be
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
- if (!vma || vma->vm_flags & VM_SHARED)
+ if (!vma || vma->vm_flags & VM_MAYSHARE)
chg = region_chg(&inode->i_mapping->private_list, from, to);
else {
struct resv_map *resv_map = resv_map_alloc();
@@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* consumed reservations are stored in the map. Hence, nothing
* else has to be done for private mappings here
*/
- if (!vma || vma->vm_flags & VM_SHARED)
+ if (!vma || vma->vm_flags & VM_MAYSHARE)
region_add(&inode->i_mapping->private_list, from, to);
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-15 18:53 [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached starlight
@ 2009-05-20 11:35 ` Mel Gorman
2009-05-20 14:29 ` Mel Gorman
2009-05-20 14:53 ` Lee Schermerhorn
0 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-05-20 11:35 UTC (permalink / raw)
To: starlight
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson, riel, lee.schermerhorn
[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]
On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> Here's another possible clue:
>
> I tried the first 'tcbm' testcase on a 2.6.27.7
> kernel that was hanging around from a few months
> ago and it breaks it 100% of the time.
>
> Completely hoses huge memory. Enough "bad pmd"
> errors to fill the kernel log.
>
So I investigated what's wrong with 2.6.27.7. The problem is a race between
exec() and the handling of mlock()ed VMAs but I can't see where. The normal
teardown of pages is applied to a shared memory segment as if VM_HUGETLB
was not set.
This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the
introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
to how mlock()ed are handled but I didn't spot which was the relevant change
that fixed the problem and reverse bisecting didn't help. I've added two people
that were working on the unevictable LRU patches to see if they spot something.
For context, the two attached files are used to reproduce a problem
where bad pmd messages are scribbled all over the console on 2.6.27.7.
Do something like
echo 64 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs none /mnt
sh ./test-tcbm.sh
I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
set or not. It's possible the race it still there but I don't know where
it is.
Any ideas where the race might be?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
[-- Attachment #2: tcbm.c --]
[-- Type: text/x-csrc, Size: 4618 bytes --]
#include <errno.h>
#include <fcntl.h>
#include <memory.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/shm.h>
#include <sys/resource.h>
#include <sys/mman.h>
#define LARGE_SHARED_SEGMENT_KEY 0x12345600
#define LARGE_SHARED_SEGMENT_SIZE ((size_t)0x40000000)
#define LARGE_SHARED_SEGMENT_ADDR ((void *)0x40000000)
#define SMALL_SHARED_SEGMENT_KEY 0x12345601
#define SMALL_SHARED_SEGMENT_SIZE ((size_t)0x20000000)
#define SMALL_SHARED_SEGMENT_ADDR ((void *)0x94000000)
#define NUM_SMALL_BUFFERS 50
char *helper_program = "echo";
char *helper_args[] = { "-n", ".", NULL };
void child_signal_handler(const int unused)
{
int errno_save;
pid_t dead_pid;
int dead_status;
errno_save = errno;
do {
dead_pid = waitpid(-1, &dead_status, WNOHANG);
if (dead_pid == -1) {
if (errno == ECHILD)
break;
perror("waitpid");
exit(EXIT_FAILURE);
}
} while (dead_pid != 0);
errno = errno_save;
return;
}
int rabbits(void)
{
int sched_policy;
int pid;
pid = fork();
if (pid != 0)
return 0;
sched_policy = sched_getscheduler(0);
if (sched_policy == -1)
perror("sched_getscheduler");
/* Set the childs policy to SCHED_OTHER */
if (sched_policy != SCHED_OTHER) {
struct sched_param sched;
memset(&sched, 0, sizeof(sched));
sched.sched_priority = 0;
if (sched_setscheduler(0, SCHED_OTHER, &sched) != 0)
perror("sched_setscheduler");
}
/* Set the priority of the process */
errno = 0;
const int nice = getpriority(PRIO_PROCESS, 0);
if (errno != 0)
perror("getpriority");
if (nice < -10)
if (setpriority(PRIO_PROCESS, 0, -10) != 0)
perror("setpriority");
/* Launch helper program */
execvp(helper_program, helper_args);
perror("execvp");
exit(EXIT_FAILURE);
}
int main(int argc, const char** argv, const char** envp)
{
struct sched_param sched;
struct sigaction sas_child;
int i;
/* Set the round robin scheduler */
memset(&sched, 0, sizeof(sched));
sched.sched_priority = 26;
if (sched_setscheduler(0, SCHED_RR, &sched) != 0) {
perror("sched_setscheduler(SCHED_RR, 26)");
return 1;
}
/* Set a signal handler for children exiting */
memset(&sas_child, 0, sizeof(sas_child));
sas_child.sa_handler = child_signal_handler;
if (sigaction(SIGCHLD, &sas_child, NULL) != 0) {
perror("sigaction(SIGCHLD)");
return 1;
}
/* Create a large shared memory segment */
int seg1id = shmget(LARGE_SHARED_SEGMENT_KEY,
LARGE_SHARED_SEGMENT_SIZE,
IPC_CREAT|SHM_HUGETLB|0640);
if (seg1id == -1) {
perror("shmget(LARGE_SEGMENT)");
return 1;
}
/* Attach at the 16GB offset */
void* seg1adr = shmat(seg1id, LARGE_SHARED_SEGMENT_ADDR, 0);
if (seg1adr == (void*)-1) {
perror("shmat(LARGE_SEGMENT)");
return 1;
}
/* Initialise the start of the segment and mlock it */
memset(seg1adr, 0xFF, LARGE_SHARED_SEGMENT_SIZE/2);
if (mlock(seg1adr, LARGE_SHARED_SEGMENT_SIZE) != 0) {
perror("mlock(LARGE_SEGMENT)");
return 1;
}
/* Create a second smaller segment */
int seg2id = shmget(SMALL_SHARED_SEGMENT_KEY,
SMALL_SHARED_SEGMENT_SIZE,
IPC_CREAT|SHM_HUGETLB|0640);
if (seg2id == -1) {
perror("shmget(SMALL_SEGMENT)");
return 1;
}
/* Attach small segment */
void *seg2adr = shmat(seg2id, SMALL_SHARED_SEGMENT_ADDR, 0);
if (seg2adr == (void*) -1) {
perror("shmat(SMALL_SEGMENT)");
return 1;
}
/* Initialise all of small segment and mlock */
memset(seg2adr, 0xFF, (size_t) SMALL_SHARED_SEGMENT_SIZE);
/*
if (mlock(seg2adr, (size_t) SMALL_SHARED_SEGMENT_SIZE) != 0) {
perror("mlock(SMALL_SEGMENT)");
return 1;
}
*/
/* Create a number of approximately 516K buffers */
for (i = 0; i < NUM_SMALL_BUFFERS; i++) {
void* mmtarg = mmap(NULL, 528384,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0);
if (mmtarg == (void*) -1) {
perror("mmap");
return 1;
}
}
/* Dump maps */
{
char buf[4097];
int bytes;
int fd = open("/proc/self/maps", O_RDONLY);
while ((bytes = read(fd, buf, 4096)) > 0) {
printf("%s", buf);
}
close(fd);
}
/* Create one child per small buffer */
for (i = 0; i < NUM_SMALL_BUFFERS; i++) {
rabbits();
usleep(500);
}
/* Wait until children shut up signalling */
printf("Waiting for children\n");
while (sleep(3) != 0);
/* Detach */
if (shmdt(seg1adr) == -1)
perror("shmdt(LARGE_SEGMENT)");
if (shmdt(seg2adr) == -1)
perror("shmdt(SMALL_SEGMENT)");
if (shmctl(seg1id, IPC_RMID, NULL) == -1)
perror("shmrm(LARGE_SEGMENT)");
if (shmctl(seg2id, IPC_RMID, NULL) == -1)
perror("shmrm(SMALL_SEGMENT)");
printf("Done\n");
return 0;
}
[-- Attachment #3: test-tcbm.sh --]
[-- Type: application/x-sh, Size: 603 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-20 11:35 ` Mel Gorman
@ 2009-05-20 14:29 ` Mel Gorman
2009-05-20 14:53 ` Lee Schermerhorn
1 sibling, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-05-20 14:29 UTC (permalink / raw)
To: starlight
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson, riel, lee.schermerhorn, npiggin
On Wed, May 20, 2009 at 12:35:25PM +0100, Mel Gorman wrote:
> On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > Here's another possible clue:
> >
> > I tried the first 'tcbm' testcase on a 2.6.27.7
> > kernel that was hanging around from a few months
> > ago and it breaks it 100% of the time.
> >
> > Completely hoses huge memory. Enough "bad pmd"
> > errors to fill the kernel log.
> >
>
> So I investigated what's wrong with 2.6.27.7. The problem is a race between
> exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> was not set.
>
> This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the
> introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
> to how mlock()ed are handled but I didn't spot which was the relevant change
> that fixed the problem and reverse bisecting didn't help. I've added two people
> that were working on the unevictable LRU patches to see if they spot something.
>
> For context, the two attached files are used to reproduce a problem
> where bad pmd messages are scribbled all over the console on 2.6.27.7.
> Do something like
>
> echo 64 > /proc/sys/vm/nr_hugepages
> mount -t hugetlbfs none /mnt
> sh ./test-tcbm.sh
>
> I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
> set or not. It's possible the race it still there but I don't know where
> it is.
>
> Any ideas where the race might be?
>
With all the grace of a drunken elephant in a china shop, I gave up on being
clever as it wasn't working and brute-force attacked this to make a list of the
commits needed for CONFIG_UNEVICTABLE_LRU on top of 2.6.27.7. This is the list
# Prereq commits for UNEVICT patches to apply
b69408e88bd86b98feb7b9a38fd865e1ddb29827 vmscan: Use an indexed array for LRU variabl
62695a84eb8f2e718bf4dfb21700afaa7a08e0ea vmscan: move isolate_lru_page() to vmscan.c
f04e9ebbe4909f9a41efd55149bc353299f4e83b swap: use an array for the LRU pagevecs
68a22394c286a2daf06ee8d65d8835f738faefa5 vmscan: free swap space on swap-in/activation
b2e185384f534781fd22f5ce170b2ad26f97df70 define page_file_cache() function
4f98a2fee8acdb4ac84545df98cccecfd130f8db vmscan: split LRU lists into anon & file sets
556adecba110bf5f1db6c6b56416cfab5bcab698 vmscan: second chance replacement
7e9cd484204f9e5b316ed35b241abf088d76e0af vmscan: fix pagecache reclaim referenced
33c120ed2843090e2bd316de1588b8bf8b96cbde more aggressively use lumpy reclaim
# Part 1: Initial patches for UNEVICTABLE_LRU
8a7a8544a4f6554ec2d8048ac9f9672f442db5a2 pageflag helpers for configed-out flags
894bc310419ac95f4fa4142dc364401a7e607f65 Unevictable LRU Infrastructure
bbfd28eee9fbd73e780b19beb3dc562befbb94fa unevictable lru: add event counting with stat
7b854121eb3e5ba0241882ff939e2c485228c9c5 Unevictable LRU Page Statistics
ba9ddf49391645e6bb93219131a40446538a5e76 Ramfs and Ram Disk pages are unevictable
89e004ea55abe201b29e2d6e35124101f1288ef7 SHM_LOCKED pages are unevictable
# Part 2: Critical patch that makes the problem go away
b291f000393f5a0b679012b39d79fbc85c018233 mlock: mlocked pages are unevictable
# Part 3: Rest of UNEVICTABLE_LRU
fa07e787733416c42938a310a8e717295934e33c doc: unevictable LRU and mlocked pages doc
8edb08caf68184fb170f4f69c7445929e199eaea mlock: downgrade mmap sem while pop mlock
ba470de43188cdbff795b5da43a1474523c6c2fb mmap: handle mlocked pages during map, remap
5344b7e648980cc2ca613ec03a56a8222ff48820 vmstat: mlocked pages statistics
64d6519dda3905dfb94d3f93c07c5f263f41813f swap: cull unevictable pages in fault path
af936a1606246a10c145feac3770f6287f483f02 vmscan: unevictable LRU scan sysctl
985737cf2ea096ea946aed82c7484d40defc71a8 mlock: count attempts to free mlocked page
902d2e8ae0de29f483840ba1134af27343b9564d vmscan: kill unused lru functions
e0f79b8f1f3394bb344b7b83d6f121ac2af327de vmscan: don't accumulate scan pressure on un
c11d69d8c830e09a0e7b3935c952afb26c48bba8 mlock: revert mainline handling of mlock erro
9978ad583e100945b74e4f33e73317983ea32df9 mlock: make mlock error return Posixly Correct
I won't get the chance to start picking apart
b291f000393f5a0b679012b39d79fbc85c018233 to see what's so special in there
until Friday but maybe someone else will spot the magic before I do. Again,
it does not matter if UNEVICTABLE_LRU is set or not once that critical patch
is applied.
For what it's worth, this bug affects the SLES 11 kernel which is based on
2.6.27. I imagine they'd like to have this fixed but may not be so keen on
applying so many patches.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-20 11:35 ` Mel Gorman
2009-05-20 14:29 ` Mel Gorman
@ 2009-05-20 14:53 ` Lee Schermerhorn
2009-05-20 15:05 ` Lee Schermerhorn
1 sibling, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2009-05-20 14:53 UTC (permalink / raw)
To: Mel Gorman
Cc: starlight, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson, riel
On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote:
> On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > Here's another possible clue:
> >
> > I tried the first 'tcbm' testcase on a 2.6.27.7
> > kernel that was hanging around from a few months
> > ago and it breaks it 100% of the time.
> >
> > Completely hoses huge memory. Enough "bad pmd"
> > errors to fill the kernel log.
> >
>
> So I investigated what's wrong with 2.6.27.7. The problem is a race between
> exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> was not set.
>
> This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the
> introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
> to how mlock()ed are handled but I didn't spot which was the relevant change
> that fixed the problem and reverse bisecting didn't help. I've added two people
> that were working on the unevictable LRU patches to see if they spot something.
Hi, Mel:
and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move
the mlocked pages to the unevictable lru list and munlock, including at
exit, must rescue them from the unevictable list. Since hugepages are
not maintained on the lru and don't get reclaimed, we don't want to move
them to the unevictable list, However, we still want to populate the
page tables. So, we still call [_]mlock_vma_pages_range() for hugepage
vmas, but after making the pages present to preserve prior behavior, we
remove the VM_LOCKED flag from the vma.
The basic change to handling of hugepage handling with the unevictable
lru patches is that we no longer keep a huge page vma marked with
VM_LOCKED. So, at exit time, there is no record that this is a vmlocked
vma.
A bit of context: before the unevictable lru, mlock() or
mmap(MAP_LOCKED) would just set the VM_LOCKED flag and
"make_pages_present()" for all but a few vma types. We've always
excluded those that get_user_pages() can't handle and still do. With
the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to
the unevictable lru list and munlock, including at exit, must rescue
them from the unevictable list. Since hugepages are not maintained on
the lru and don't get reclaimed, we don't want to move them to the
unevictable list, However, we still want to populate the page tables.
So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but
after making the pages present to preserve prior behavior, we remove the
VM_LOCKED flag from the vma.
This may have resulted in the apparent fix to the subject problem in
2.6.28...
>
> For context, the two attached files are used to reproduce a problem
> where bad pmd messages are scribbled all over the console on 2.6.27.7.
> Do something like
>
> echo 64 > /proc/sys/vm/nr_hugepages
> mount -t hugetlbfs none /mnt
> sh ./test-tcbm.sh
>
> I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
> set or not. It's possible the race it still there but I don't know where
> it is.
>
> Any ideas where the race might be?
No, sorry. Haven't had time to investigate this.
Lee
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-20 14:53 ` Lee Schermerhorn
@ 2009-05-20 15:05 ` Lee Schermerhorn
2009-05-20 15:41 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2009-05-20 15:05 UTC (permalink / raw)
To: Mel Gorman
Cc: starlight, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson, riel
On Wed, 2009-05-20 at 10:53 -0400, Lee Schermerhorn wrote:
> On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote:
> > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > > Here's another possible clue:
> > >
> > > I tried the first 'tcbm' testcase on a 2.6.27.7
> > > kernel that was hanging around from a few months
> > > ago and it breaks it 100% of the time.
> > >
> > > Completely hoses huge memory. Enough "bad pmd"
> > > errors to fill the kernel log.
> > >
> >
> > So I investigated what's wrong with 2.6.27.7. The problem is a race between
> > exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> > teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> > was not set.
> >
> > This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the
> > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
> > to how mlock()ed are handled but I didn't spot which was the relevant change
> > that fixed the problem and reverse bisecting didn't help. I've added two people
> > that were working on the unevictable LRU patches to see if they spot something.
>
> Hi, Mel:
> and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move
> the mlocked pages to the unevictable lru list and munlock, including at
> exit, must rescue them from the unevictable list. Since hugepages are
> not maintained on the lru and don't get reclaimed, we don't want to move
> them to the unevictable list, However, we still want to populate the
> page tables. So, we still call [_]mlock_vma_pages_range() for hugepage
> vmas, but after making the pages present to preserve prior behavior, we
> remove the VM_LOCKED flag from the vma.
Wow! that got garbled. not sure how. Message was intended to start
here:
> The basic change to handling of hugepage handling with the unevictable
> lru patches is that we no longer keep a huge page vma marked with
> VM_LOCKED. So, at exit time, there is no record that this is a vmlocked
> vma.
>
> A bit of context: before the unevictable lru, mlock() or
> mmap(MAP_LOCKED) would just set the VM_LOCKED flag and
> "make_pages_present()" for all but a few vma types. We've always
> excluded those that get_user_pages() can't handle and still do. With
> the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to
> the unevictable lru list and munlock, including at exit, must rescue
> them from the unevictable list. Since hugepages are not maintained on
> the lru and don't get reclaimed, we don't want to move them to the
> unevictable list, However, we still want to populate the page tables.
> So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but
> after making the pages present to preserve prior behavior, we remove the
> VM_LOCKED flag from the vma.
>
> This may have resulted in the apparent fix to the subject problem in
> 2.6.28...
>
> >
> > For context, the two attached files are used to reproduce a problem
> > where bad pmd messages are scribbled all over the console on 2.6.27.7.
> > Do something like
> >
> > echo 64 > /proc/sys/vm/nr_hugepages
> > mount -t hugetlbfs none /mnt
> > sh ./test-tcbm.sh
> >
> > I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
> > set or not. It's possible the race it still there but I don't know where
> > it is.
> >
> > Any ideas where the race might be?
>
> No, sorry. Haven't had time to investigate this.
>
> Lee
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-20 15:05 ` Lee Schermerhorn
@ 2009-05-20 15:41 ` Mel Gorman
2009-05-21 0:41 ` KOSAKI Motohiro
0 siblings, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2009-05-20 15:41 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: starlight, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Adam Litke, Eric B Munson, riel
On Wed, May 20, 2009 at 11:05:15AM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-05-20 at 10:53 -0400, Lee Schermerhorn wrote:
> > On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote:
> > > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > > > Here's another possible clue:
> > > >
> > > > I tried the first 'tcbm' testcase on a 2.6.27.7
> > > > kernel that was hanging around from a few months
> > > > ago and it breaks it 100% of the time.
> > > >
> > > > Completely hoses huge memory. Enough "bad pmd"
> > > > errors to fill the kernel log.
> > > >
> > >
> > > So I investigated what's wrong with 2.6.27.7. The problem is a race between
> > > exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> > > teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> > > was not set.
> > >
> > > This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the
> > > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
> > > to how mlock()ed are handled but I didn't spot which was the relevant change
> > > that fixed the problem and reverse bisecting didn't help. I've added two people
> > > that were working on the unevictable LRU patches to see if they spot something.
> >
> > Hi, Mel:
> > and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move
> > the mlocked pages to the unevictable lru list and munlock, including at
> > exit, must rescue them from the unevictable list. Since hugepages are
> > not maintained on the lru and don't get reclaimed, we don't want to move
> > them to the unevictable list, However, we still want to populate the
> > page tables. So, we still call [_]mlock_vma_pages_range() for hugepage
> > vmas, but after making the pages present to preserve prior behavior, we
> > remove the VM_LOCKED flag from the vma.
>
> Wow! that got garbled. not sure how. Message was intended to start
> here:
>
> > The basic change to handling of hugepage handling with the unevictable
> > lru patches is that we no longer keep a huge page vma marked with
> > VM_LOCKED. So, at exit time, there is no record that this is a vmlocked
> > vma.
> >
Basic and in this case, apparently the critical factor. This patch on
2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on
hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the
wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB
is wrong somewhere. Now I have a better idea now what to search for on
Friday. Thanks Lee.
--- mm/mlock.c 2009-05-20 16:36:08.000000000 +0100
+++ mm/mlock-new.c 2009-05-20 16:28:17.000000000 +0100
@@ -64,7 +64,8 @@
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, make_pages_present below will bring it back.
*/
- vma->vm_flags = newflags;
+ if (!(vma->vm_flags & VM_HUGETLB))
+ vma->vm_flags = newflags;
/*
* Keep track of amount of locked VM.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-20 15:41 ` Mel Gorman
@ 2009-05-21 0:41 ` KOSAKI Motohiro
2009-05-22 16:41 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-05-21 0:41 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Lee Schermerhorn, starlight, Andrew Morton,
linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson, riel
Hi
> Basic and in this case, apparently the critical factor. This patch on
> 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on
> hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the
> wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB
> is wrong somewhere. Now I have a better idea now what to search for on
> Friday. Thanks Lee.
>
> --- mm/mlock.c 2009-05-20 16:36:08.000000000 +0100
> +++ mm/mlock-new.c 2009-05-20 16:28:17.000000000 +0100
> @@ -64,7 +64,8 @@
> * It's okay if try_to_unmap_one unmaps a page just after we
> * set VM_LOCKED, make_pages_present below will bring it back.
> */
> - vma->vm_flags = newflags;
> + if (!(vma->vm_flags & VM_HUGETLB))
this condition meaning isn't so obvious to me. could you please
consider comment adding?
> + vma->vm_flags = newflags;
>
> /*
> * Keep track of amount of locked VM.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-21 0:41 ` KOSAKI Motohiro
@ 2009-05-22 16:41 ` Mel Gorman
2009-05-24 13:44 ` KOSAKI Motohiro
0 siblings, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2009-05-22 16:41 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Lee Schermerhorn, starlight, Andrew Morton, linux-mm,
bugzilla-daemon, bugme-daemon, Adam Litke, Eric B Munson, riel,
hugh.dickins, kenchen
On Thu, May 21, 2009 at 09:41:46AM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > Basic and in this case, apparently the critical factor. This patch on
> > 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on
> > hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the
> > wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB
> > is wrong somewhere. Now I have a better idea now what to search for on
> > Friday. Thanks Lee.
> >
> > --- mm/mlock.c 2009-05-20 16:36:08.000000000 +0100
> > +++ mm/mlock-new.c 2009-05-20 16:28:17.000000000 +0100
> > @@ -64,7 +64,8 @@
> > * It's okay if try_to_unmap_one unmaps a page just after we
> > * set VM_LOCKED, make_pages_present below will bring it back.
> > */
> > - vma->vm_flags = newflags;
> > + if (!(vma->vm_flags & VM_HUGETLB))
>
> this condition meaning isn't so obvious to me. could you please
> consider comment adding?
>
I should have used the helper, but anyway, the check was to see if the VMA was
backed by hugetlbfs or not. This wasn't the right fix. It was only intended
to show that it was something to do with the VM_LOCKED flag.
The real problem has something to do with pagetable-sharing of hugetlb-backed
segments. After fork(), the VM_LOCKED gets cleared so when huge_pmd_share()
is called, some of the pagetables are shared and others are not. I believe
this is resulting in pagetables being freed prematurely. I'm cc'ing the
author and acks to the pagetable-sharing patch to see can they shed more
light on whether this is the right patch or not. Kenneth, Hugh?
==== CUT HERE ====
x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
On x86 and x86-64, it is possible that page tables are shared beween shared
mappings backed by hugetlbfs. As part of this, page_table_shareable() checks
a pair of vma->vm_flags and they must match if they are to be shared. All
VMA flags are taken into account, including VM_LOCKED.
The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared VMAs
with different flags. The impact is that the shared segment is sometimes
considered shareable and other times not, depending on what process is
checking. A test process that forks and execs heavily can trigger a
number of "bad pmd" messages appearing in the kernel log and hugepages
being leaked.
I believe what happens is that the segment page tables are being shared but
the count is inaccurate depending on the ordering of events.
Strictly speaking, this affects mainline but the problem is masked by the
changes made for CONFIG_UNEVITABLE_LRU as the kernel now never has VM_LOCKED
set for hugetlbfs-backed mapping. This does affect the stable branch of
2.6.27 and distributions based on that kernel such as SLES 11.
This patch addresses the problem by comparing all flags but VM_LOCKED when
deciding if pagetables should be shared or not for hugetlbfs-backed mapping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
arch/x86/mm/hugetlbpage.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..16e4bcc 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
unsigned long sbase = saddr & PUD_MASK;
unsigned long s_end = sbase + PUD_SIZE;
+ /* Allow segments to share if only one is locked */
+ unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
+ unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
+
/*
* match the virtual addresses, permission and the alignment of the
* page table page.
*/
if (pmd_index(addr) != pmd_index(saddr) ||
- vma->vm_flags != svma->vm_flags ||
+ vm_flags != svm_flags ||
sbase < svma->vm_start || svma->vm_end < s_end)
return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-22 16:41 ` Mel Gorman
@ 2009-05-24 13:44 ` KOSAKI Motohiro
2009-05-25 8:51 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-05-24 13:44 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Lee Schermerhorn, starlight, Andrew Morton,
linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson, riel, hugh.dickins, kenchen
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> arch/x86/mm/hugetlbpage.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 8f307d9..16e4bcc 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> unsigned long sbase = saddr & PUD_MASK;
> unsigned long s_end = sbase + PUD_SIZE;
>
> + /* Allow segments to share if only one is locked */
> + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
svma?
- kosaki
> +
> /*
> * match the virtual addresses, permission and the alignment of the
> * page table page.
> */
> if (pmd_index(addr) != pmd_index(saddr) ||
> - vma->vm_flags != svma->vm_flags ||
> + vm_flags != svm_flags ||
> sbase < svma->vm_start || svma->vm_end < s_end)
> return 0;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-24 13:44 ` KOSAKI Motohiro
@ 2009-05-25 8:51 ` Mel Gorman
2009-05-25 10:10 ` Hugh Dickins
0 siblings, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2009-05-25 8:51 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Lee Schermerhorn, starlight, Andrew Morton, linux-mm,
bugzilla-daemon, bugme-daemon, Adam Litke, Eric B Munson, riel,
hugh.dickins, kenchen
On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote:
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > arch/x86/mm/hugetlbpage.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 8f307d9..16e4bcc 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> > unsigned long sbase = saddr & PUD_MASK;
> > unsigned long s_end = sbase + PUD_SIZE;
> >
> > + /* Allow segments to share if only one is locked */
> > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
> svma?
>
/me slaps self
svma indeed.
With the patch corrected, I still cannot trigger the bad pmd messages as
applied so I'm convinced the bug is related to hugetlb pagetable
sharing and this is more or less the fix. Any opinions?
> - kosaki
>
> > +
> > /*
> > * match the virtual addresses, permission and the alignment of the
> > * page table page.
> > */
> > if (pmd_index(addr) != pmd_index(saddr) ||
> > - vma->vm_flags != svma->vm_flags ||
> > + vm_flags != svm_flags ||
> > sbase < svma->vm_start || svma->vm_end < s_end)
> > return 0;
> >
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-25 8:51 ` Mel Gorman
@ 2009-05-25 10:10 ` Hugh Dickins
2009-05-25 13:17 ` Mel Gorman
0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2009-05-25 10:10 UTC (permalink / raw)
To: Mel Gorman
Cc: KOSAKI Motohiro, Lee Schermerhorn, starlight, Andrew Morton,
linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson, riel, hugh.dickins, kenchen
On Mon, 25 May 2009, Mel Gorman wrote:
> On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote:
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > arch/x86/mm/hugetlbpage.c | 6 +++++-
> > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > > index 8f307d9..16e4bcc 100644
> > > --- a/arch/x86/mm/hugetlbpage.c
> > > +++ b/arch/x86/mm/hugetlbpage.c
> > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> > > unsigned long sbase = saddr & PUD_MASK;
> > > unsigned long s_end = sbase + PUD_SIZE;
> > >
> > > + /* Allow segments to share if only one is locked */
> > > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> > > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
> > svma?
> >
>
> /me slaps self
>
> svma indeed.
>
> With the patch corrected, I still cannot trigger the bad pmd messages as
> applied so I'm convinced the bug is related to hugetlb pagetable
> sharing and this is more or less the fix. Any opinions?
Yes, you gave a very good analysis, and I agree with you, your patch
does seem to be needed for 2.6.27.N, and the right thing to do there
(though I prefer the way 2.6.28 mlocking skips huge areas completely).
One nit, doesn't really matter, but if I'm not too late: please change
- /* Allow segments to share if only one is locked */
+ /* Allow segments to share if only one is marked locked */
since locking is such a no-op on hugetlb areas.
Hugetlb pagetable sharing does scare me some nights: it's a very easily
forgotten corner of mm, worrying that we do something so different in
there; but IIRC this is actually the first bug related to it, much to
Ken's credit (and Dave McCracken's).
(I'm glad Kosaki-san noticed the svma before I acked your previous
version! And I've still got to go back to your VM_MAYSHARE patch:
seems right, but still wondering about the remaining VM_SHAREDs -
will report back later.)
Feel free to add an
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
to your fixed version.
Hugh
>
> > - kosaki
> >
> > > +
> > > /*
> > > * match the virtual addresses, permission and the alignment of the
> > > * page table page.
> > > */
> > > if (pmd_index(addr) != pmd_index(saddr) ||
> > > - vma->vm_flags != svma->vm_flags ||
> > > + vm_flags != svm_flags ||
> > > sbase < svma->vm_start || svma->vm_end < s_end)
> > > return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached
2009-05-25 10:10 ` Hugh Dickins
@ 2009-05-25 13:17 ` Mel Gorman
0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-05-25 13:17 UTC (permalink / raw)
To: Hugh Dickins
Cc: KOSAKI Motohiro, Lee Schermerhorn, starlight, Andrew Morton,
linux-mm, bugzilla-daemon, bugme-daemon, Adam Litke,
Eric B Munson, riel, kenchen
On Mon, May 25, 2009 at 11:10:11AM +0100, Hugh Dickins wrote:
> On Mon, 25 May 2009, Mel Gorman wrote:
> > On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote:
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > arch/x86/mm/hugetlbpage.c | 6 +++++-
> > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > > > index 8f307d9..16e4bcc 100644
> > > > --- a/arch/x86/mm/hugetlbpage.c
> > > > +++ b/arch/x86/mm/hugetlbpage.c
> > > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> > > > unsigned long sbase = saddr & PUD_MASK;
> > > > unsigned long s_end = sbase + PUD_SIZE;
> > > >
> > > > + /* Allow segments to share if only one is locked */
> > > > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> > > > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
> > > svma?
> > >
> >
> > /me slaps self
> >
> > svma indeed.
> >
> > With the patch corrected, I still cannot trigger the bad pmd messages as
> > applied so I'm convinced the bug is related to hugetlb pagetable
> > sharing and this is more or less the fix. Any opinions?
>
> Yes, you gave a very good analysis, and I agree with you, your patch
> does seem to be needed for 2.6.27.N, and the right thing to do there
> (though I prefer the way 2.6.28 mlocking skips huge areas completely).
>
I similarly prefer how 2.6.28 simply makes the pages present and then gets
rid of the flag. I was tempted to back-porting something similar but it felt
better to fix where hugetlb was going wrong. Even though it's essentially a
no-op on mainline, I'd like to apply the patch there as well in case there
is ever another change in mlock() with respect to hugetlbfs.
> One nit, doesn't really matter, but if I'm not too late: please change
> - /* Allow segments to share if only one is locked */
> + /* Allow segments to share if only one is marked locked */
> since locking is such a no-op on hugetlb areas.
>
It's not too late and that change makes sense.
> Hugetlb pagetable sharing does scare me some nights: it's a very easily
> forgotten corner of mm, worrying that we do something so different in
> there; but IIRC this is actually the first bug related to it, much to
> Ken's credit (and Dave McCracken's).
>
I had totally forgotten about it which is why it took me so long to identify
it as the problem area. I don't remember there ever being a problem with
this area either.
> (I'm glad Kosaki-san noticed the svma before I acked your previous
> version! And I've still got to go back to your VM_MAYSHARE patch:
> seems right, but still wondering about the remaining VM_SHAREDs -
> will report back later.)
>
Thanks.
> Feel free to add an
> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> to your fixed version.
>
Thanks again Hugh.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2009-05-25 13:16 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-15 18:53 [Bugme-new] [Bug 13302] New: "bad pmd" on fork() of process with hugepage shared memory segments attached starlight
2009-05-20 11:35 ` Mel Gorman
2009-05-20 14:29 ` Mel Gorman
2009-05-20 14:53 ` Lee Schermerhorn
2009-05-20 15:05 ` Lee Schermerhorn
2009-05-20 15:41 ` Mel Gorman
2009-05-21 0:41 ` KOSAKI Motohiro
2009-05-22 16:41 ` Mel Gorman
2009-05-24 13:44 ` KOSAKI Motohiro
2009-05-25 8:51 ` Mel Gorman
2009-05-25 10:10 ` Hugh Dickins
2009-05-25 13:17 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2009-05-15 18:44 starlight
2009-05-18 16:36 ` Mel Gorman
2009-05-15 5:32 starlight
2009-05-15 14:55 ` Mel Gorman
2009-05-15 15:02 ` starlight
[not found] <bug-13302-10286@http.bugzilla.kernel.org/>
2009-05-13 20:08 ` Andrew Morton
2009-05-14 10:53 ` Mel Gorman
2009-05-14 10:59 ` Mel Gorman
2009-05-14 17:20 ` starlight
2009-05-14 17:49 ` Mel Gorman
2009-05-14 18:42 ` starlight
2009-05-14 19:10 ` starlight
2009-05-14 17:16 ` starlight
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).