* [BUG] new copy_hugetlb_page_range() causing crashes @ 2014-07-17 16:23 Guillaume Morin 0 siblings, 0 replies; 4+ messages in thread From: Guillaume Morin @ 2014-07-17 16:23 UTC (permalink / raw) To: Naoya Horiguchi, Hugh Dickins; +Cc: linux-kernel, stable Naoya, Hugh, I am seeing lots of crashes with the new copy_hugetlb_page_range() code added by 4a705fef986231a3e7a6b1a6d3c37025f021f49f for some set of programs. Specifically, I am running some test programs which use huge pages for malloc (through libhugetlbfs with HUGETLB_MORECORE=yes and HUGETLB_NO_PREFAULT= set) that also do fork() a lot. The crashes are very diverse: assertion failures in malloc() or the python GC code, some SIGSEGV as well. I started observing these crashes with 3.4.98 and 3.14.12 which just got a backport of the patch above (as 2bcdd4933ff4dc46445dcae93cb37c648283b782 in the stable branch). The 3.4 and 3.14 patches are identical to the upstream commit so that's not a patch backport issue. If I revert only 2bcdd4933ff4dc46445dcae93cb37c648283b782 in my 3.4 tree, the crashes disappear right away and everything is stable. -- Guillaume Morin <guillaume@morinfr.org> ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <019768ac467043a4aaea3e455cb74db7@BPXC18GP.gisp.nec.co.jp>]
[parent not found: <FC3CA273EA98D94B96901B237F5F506BB61DC0@irvmail101.necam.prv>]
* Re: [BUG] new copy_hugetlb_page_range() causing crashes [not found] ` <FC3CA273EA98D94B96901B237F5F506BB61DC0@irvmail101.necam.prv> @ 2014-07-17 20:12 ` Guillaume Morin 2014-07-17 21:33 ` Naoya Horiguchi 0 siblings, 1 reply; 4+ messages in thread From: Guillaume Morin @ 2014-07-17 20:12 UTC (permalink / raw) To: Horiguchi, Naoya Cc: Naoya Horiguchi, Hugh Dickins, linux-kernel, stable, nao.horiguchi Nayoa, Thanks for your answer. On 17 Jul 14:35, Horiguchi, Naoya wrote: > I tried some simple operation (below) on 3.14.12, but not reproduced the crash, > so some non-trivial condition seemed to trigger this. > Could you elaborate about how you reproduced the crash? Well, It's just a bunch of fairly random unit tests and I was not able to make it happen on a small program. However, I modified your program a bit and managed to reproduce the problem: $ cat heap.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> int main() { int i; char *p = malloc(4096*512); for (i = 0; i < 512; i++) p[i*4096] = '1'; printf("p %p\n", p); for (i = 0 ; i < 10; i++) if (!fork()) { memset(p, '2', 4096*512); p = malloc(4096*512); printf("pid %d\n", getpid()); memset(p, '3', 4096*512); free(p); return 0; } pause(); } This is what happens on my 3.14.12 machine: $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./h p 0x800010 pid 7974 pid 7975 h: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed. Sometimes the process gets stuck instead asserting out. But I could not make it SIGSEGV Same result with the 3.4.98 kernel. It works fine when I remove your patch though Guillaume. -- Guillaume Morin <guillaume@morinfr.org> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [BUG] new copy_hugetlb_page_range() causing crashes 2014-07-17 20:12 ` Guillaume Morin @ 2014-07-17 21:33 ` Naoya Horiguchi 2014-07-17 21:59 ` Guillaume Morin 0 siblings, 1 reply; 4+ messages in thread From: Naoya Horiguchi @ 2014-07-17 21:33 UTC (permalink / raw) To: Guillaume Morin Cc: Naoya Horiguchi, Hugh Dickins, linux-kernel, stable, Naoya Horiguchi On Thu, Jul 17, 2014 at 10:12:04PM +0200, Guillaume Morin wrote: > Nayoa, > > Thanks for your answer. > > On 17 Jul 14:35, Horiguchi, Naoya wrote: > > I tried some simple operation (below) on 3.14.12, but not reproduced the crash, > > so some non-trivial condition seemed to trigger this. > > Could you elaborate about how you reproduced the crash? > > Well, It's just a bunch of fairly random unit tests and I was not able > to make it happen on a small program. However, I modified your program > a bit and managed to reproduce the problem: Great. > $ cat heap.c > #include <stdio.h> > #include <unistd.h> > #include <stdlib.h> > #include <string.h> > > int main() { > int i; > char *p = malloc(4096*512); > for (i = 0; i < 512; i++) > p[i*4096] = '1'; > printf("p %p\n", p); > for (i = 0 ; i < 10; i++) > if (!fork()) { > memset(p, '2', 4096*512); > p = malloc(4096*512); > printf("pid %d\n", getpid()); > memset(p, '3', 4096*512); > free(p); > > return 0; > } > pause(); > } > > This is what happens on my 3.14.12 machine: > $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./h > p 0x800010 > pid 7974 > pid 7975 > h: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed. > > Sometimes the process gets stuck instead asserting out. But I could not > make it SIGSEGV > > Same result with the 3.4.98 kernel. In my environment (kernel-3.14.12, libhugetlbfs-utils-2.16-2.fc20.x86_64), the crash looks like this: [root@test_140717-1333 hugetlbfs_test]# $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap bash: $: command not found... p 0x2200010 pid 2809 *** Error in `./heap': break adjusted to free malloc space: 0x0000000002501000 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3940e75cff] /lib64/libc.so.6[0x3940e7f121] /lib64/libc.so.6(__libc_malloc+0x5c)[0x3940e7ff6c] ./heap[0x400767] /lib64/libc.so.6(__libc_start_main+0xf5)[0x3940e21d65] ./heap[0x400619] ======= Memory map: ======== 00400000-00401000 r-xp 00000000 fd:01 272411 /root/hugetlbfs_test/heap 00600000-00601000 r--p 00000000 fd:01 272411 /root/hugetlbfs_test/heap 00601000-00602000 rw-p 00001000 fd:01 272411 /root/hugetlbfs_test/heap 02200000-02600000 rw-p 00000000 00:0c 23209 /anon_hugepage (deleted) 02600000-02800000 rw-p 00400000 00:0c 25663 /anon_hugepage (deleted) 3940a00000-3940a20000 r-xp 00000000 fd:01 175094 /usr/lib64/ld-2.18.so 3940c1f000-3940c20000 r--p 0001f000 fd:01 175094 /usr/lib64/ld-2.18.so 3940c20000-3940c21000 rw-p 00020000 fd:01 175094 /usr/lib64/ld-2.18.so 3940c21000-3940c22000 rw-p 00000000 00:00 0 3940e00000-3940fb4000 r-xp 00000000 fd:01 175095 /usr/lib64/libc-2.18.so 3940fb4000-39411b4000 ---p 001b4000 fd:01 175095 /usr/lib64/libc-2.18.so 39411b4000-39411b8000 r--p 001b4000 fd:01 175095 /usr/lib64/libc-2.18.so 39411b8000-39411ba000 rw-p 001b8000 fd:01 175095 /usr/lib64/libc-2.18.so 39411ba000-39411bf000 rw-p 00000000 00:00 0 3941200000-3941203000 r-xp 00000000 fd:01 175098 /usr/lib64/libdl-2.18.so 3941203000-3941402000 ---p 00003000 fd:01 175098 /usr/lib64/libdl-2.18.so 3941402000-3941403000 r--p 00002000 fd:01 175098 /usr/lib64/libdl-2.18.so 3941403000-3941404000 rw-p 00003000 fd:01 175098 /usr/lib64/libdl-2.18.so 7f3860277000-7f386028c000 r-xp 00000000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 7f386028c000-7f386048b000 ---p 00015000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 7f386048b000-7f386048c000 r--p 00014000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 7f386048c000-7f386048d000 rw-p 00015000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 7f38604a2000-7f38604a6000 rw-p 00000000 00:00 0 7f38604a6000-7f38604b6000 r-xp 00000000 fd:01 177014 /usr/lib64/libhugetlbfs.so 7f38604b6000-7f38606b5000 ---p 00010000 fd:01 177014 /usr/lib64/libhugetlbfs.so 7f38606b5000-7f38606b6000 r--p 0000f000 fd:01 177014 /usr/lib64/libhugetlbfs.so 7f38606b6000-7f38606b7000 rw-p 00010000 fd:01 177014 /usr/lib64/libhugetlbfs.so 7f38606b7000-7f38606c2000 rw-p 00000000 00:00 0 7f38606d5000-7f38606d6000 rw-p 00000000 00:00 0 7f38606d6000-7f38606d8000 rw-p 00000000 00:00 0 7fff07c44000-7fff07c65000 rw-p 00000000 00:00 0 [stack] 7fff07d52000-7fff07d54000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Is this crash the same as yours? And it seems that this also happens on v3.16-rc5. So it might be an upstream bug, not a stable-specific matter. It looks strange to me that the problem is gone by removing the commit 4a705fef98 (although I confirmed it is,) because the kernel's behavior shouldn't change unless (is_hugetlb_entry_migration(entry) || is_hugetlb_entry_hwpoisoned(entry)) is true. And I checked with systemtap that both these check returned false in the above test program. So I'm wondering why the commit makes difference for this test program. Anyway I'll dig more. Thanks, Naoya Horiguchi ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [BUG] new copy_hugetlb_page_range() causing crashes 2014-07-17 21:33 ` Naoya Horiguchi @ 2014-07-17 21:59 ` Guillaume Morin 0 siblings, 0 replies; 4+ messages in thread From: Guillaume Morin @ 2014-07-17 21:59 UTC (permalink / raw) To: Naoya Horiguchi Cc: Guillaume Morin, Hugh Dickins, linux-kernel, stable, Naoya Horiguchi On 17 Jul 17:33, Naoya Horiguchi wrote: > In my environment (kernel-3.14.12, libhugetlbfs-utils-2.16-2.fc20.x86_64), > the crash looks like this: > > [root@test_140717-1333 hugetlbfs_test]# $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap > bash: $: command not found... > p 0x2200010 > pid 2809 > *** Error in `./heap': break adjusted to free malloc space: 0x0000000002501000 *** > ======= Backtrace: ========= > /lib64/libc.so.6[0x3940e75cff] > /lib64/libc.so.6[0x3940e7f121] > /lib64/libc.so.6(__libc_malloc+0x5c)[0x3940e7ff6c] > ./heap[0x400767] > /lib64/libc.so.6(__libc_start_main+0xf5)[0x3940e21d65] > ./heap[0x400619] > ======= Memory map: ======== > 00400000-00401000 r-xp 00000000 fd:01 272411 /root/hugetlbfs_test/heap > 00600000-00601000 r--p 00000000 fd:01 272411 /root/hugetlbfs_test/heap > 00601000-00602000 rw-p 00001000 fd:01 272411 /root/hugetlbfs_test/heap > 02200000-02600000 rw-p 00000000 00:0c 23209 /anon_hugepage (deleted) > 02600000-02800000 rw-p 00400000 00:0c 25663 /anon_hugepage (deleted) > 3940a00000-3940a20000 r-xp 00000000 fd:01 175094 /usr/lib64/ld-2.18.so > 3940c1f000-3940c20000 r--p 0001f000 fd:01 175094 /usr/lib64/ld-2.18.so > 3940c20000-3940c21000 rw-p 00020000 fd:01 175094 /usr/lib64/ld-2.18.so > 3940c21000-3940c22000 rw-p 00000000 00:00 0 > 3940e00000-3940fb4000 r-xp 00000000 fd:01 175095 /usr/lib64/libc-2.18.so > 3940fb4000-39411b4000 ---p 001b4000 fd:01 175095 /usr/lib64/libc-2.18.so > 39411b4000-39411b8000 r--p 001b4000 fd:01 175095 /usr/lib64/libc-2.18.so > 39411b8000-39411ba000 rw-p 001b8000 fd:01 175095 /usr/lib64/libc-2.18.so > 39411ba000-39411bf000 rw-p 00000000 00:00 0 > 3941200000-3941203000 r-xp 00000000 fd:01 175098 /usr/lib64/libdl-2.18.so > 3941203000-3941402000 ---p 00003000 fd:01 175098 /usr/lib64/libdl-2.18.so > 3941402000-3941403000 r--p 00002000 fd:01 175098 /usr/lib64/libdl-2.18.so > 3941403000-3941404000 rw-p 00003000 fd:01 175098 /usr/lib64/libdl-2.18.so > 7f3860277000-7f386028c000 r-xp 00000000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 > 7f386028c000-7f386048b000 ---p 00015000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 > 7f386048b000-7f386048c000 r--p 00014000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 > 7f386048c000-7f386048d000 rw-p 00015000 fd:01 184953 /usr/lib64/libgcc_s-4.8.3-20140624.so.1 > 7f38604a2000-7f38604a6000 rw-p 00000000 00:00 0 > 7f38604a6000-7f38604b6000 r-xp 00000000 fd:01 177014 /usr/lib64/libhugetlbfs.so > 7f38604b6000-7f38606b5000 ---p 00010000 fd:01 177014 /usr/lib64/libhugetlbfs.so > 7f38606b5000-7f38606b6000 r--p 0000f000 fd:01 177014 /usr/lib64/libhugetlbfs.so > 7f38606b6000-7f38606b7000 rw-p 00010000 fd:01 177014 /usr/lib64/libhugetlbfs.so > 7f38606b7000-7f38606c2000 rw-p 00000000 00:00 0 > 7f38606d5000-7f38606d6000 rw-p 00000000 00:00 0 > 7f38606d6000-7f38606d8000 rw-p 00000000 00:00 0 > 7fff07c44000-7fff07c65000 rw-p 00000000 00:00 0 [stack] > 7fff07d52000-7fff07d54000 r-xp 00000000 00:00 0 [vdso] > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] > > Is this crash the same as yours? It's very similar in the sense that's an assert in malloc. My error is slightly different but I guess the exact error depends on a few runtime parameters > And it seems that this also happens on v3.16-rc5. > So it might be an upstream bug, not a stable-specific matter. That's my understanding as well. I just reported it for 3.4 and 3.14 since these were the kernels I could easily try my original test with. > It looks strange to me that the problem is gone by removing the commit > 4a705fef98 (although I confirmed it is,) because the kernel's behavior > shouldn't change unless (is_hugetlb_entry_migration(entry) || > is_hugetlb_entry_hwpoisoned(entry)) is true. And I checked with systemtap > that both these check returned false in the above test program. > So I'm wondering why the commit makes difference for this test program. I don't know why I am just thinking about that now. Isn't this the fact that your patch moved the huge_ptep_get() before huge_ptep_set_wrprotect() in the pte_present() cow case? Actually, I've just tried to re-add the huge_ptep_get call for that case and it's fixing the problem for me... Hmm, want a patch? -- Guillaume Morin <guillaume@morinfr.org> ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-07-17 21:59 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-17 16:23 [BUG] new copy_hugetlb_page_range() causing crashes Guillaume Morin
[not found] <019768ac467043a4aaea3e455cb74db7@BPXC18GP.gisp.nec.co.jp>
[not found] ` <FC3CA273EA98D94B96901B237F5F506BB61DC0@irvmail101.necam.prv>
2014-07-17 20:12 ` Guillaume Morin
2014-07-17 21:33 ` Naoya Horiguchi
2014-07-17 21:59 ` Guillaume Morin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox