From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 6F2CC6B025E for ; Tue, 26 Jul 2016 04:53:49 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id b65so3522286wmg.0 for ; Tue, 26 Jul 2016 01:53:49 -0700 (PDT) Received: from mail-lf0-x243.google.com (mail-lf0-x243.google.com. [2a00:1450:4010:c07::243]) by mx.google.com with ESMTPS id s90si16671496lfg.192.2016.07.26.01.53.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 26 Jul 2016 01:53:47 -0700 (PDT) Received: by mail-lf0-x243.google.com with SMTP id l89so18598lfi.2 for ; Tue, 26 Jul 2016 01:53:47 -0700 (PDT) Date: Tue, 26 Jul 2016 11:53:44 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH] mm: correctly handle errors during VMA merging Message-ID: <20160726085344.GA7370@node.shutemov.name> References: <1469514843-23778-1-git-send-email-vegard.nossum@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1469514843-23778-1-git-send-email-vegard.nossum@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Vegard Nossum Cc: linux-mm@kvack.org, Vlastimil Babka , Leon Yu , Konstantin Khlebnikov , Rik van Riel , Daniel Forrest On Tue, Jul 26, 2016 at 08:34:03AM +0200, Vegard Nossum wrote: > Using trinity + fault injection I've been running into this bug a lot: > > ================================================================== > BUG: KASAN: out-of-bounds in mprotect_fixup+0x523/0x5a0 at addr ffff8800b9e7d740 > Read of size 8 by task trinity-c3/6338 > ============================================================================= > BUG vm_area_struct (Not tainted): kasan: bad access detected > ----------------------------------------------------------------------------- > > Disabling lock debugging due to kernel taint > INFO: Allocated in copy_process.part.42+0x3ae7/0x52d0 age=13 cpu=0 pid=23703 > ___slab_alloc+0x480/0x4b0 > __slab_alloc.isra.53+0x56/0x80 > kmem_cache_alloc+0x22d/0x270 > copy_process.part.42+0x3ae7/0x52d0 > _do_fork+0x16d/0x8e0 > SyS_clone+0x14/0x20 > do_syscall_64+0x19c/0x410 > return_from_SYSCALL_64+0x0/0x6a > INFO: Freed in vma_adjust+0xab7/0x1740 age=25 cpu=1 pid=6338 > __slab_free+0x17a/0x250 > kmem_cache_free+0x20f/0x220 > remove_vma+0x12e/0x170 > exit_mmap+0x265/0x3c0 > mmput+0x77/0x170 > do_exit+0x636/0x2b80 > do_group_exit+0xe2/0x2d0 > get_signal+0x4be/0x1000 > do_signal+0x83/0x1f10 > exit_to_usermode_loop+0xa2/0x120 > syscall_return_slowpath+0x13f/0x170 > ret_from_fork+0x2f/0x40 > > CPU: 1 PID: 6338 Comm: trinity-c3 Tainted: G B 4.7.0-rc7+ #45 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 > ffffea0002e79f00 ffff88011887fc60 ffffffff81aa58b1 ffff88011a816400 > ffff8800b9e7d740 ffff88011887fc90 ffffffff8142c54d ffff88011a816400 > ffffea0002e79f00 ffff8800b9e7d740 0000000000000000 ffff88011887fcb8 > Call Trace: > [] dump_stack+0x65/0x84 > [] print_trailer+0x10d/0x1a0 > [] object_err+0x2f/0x40 > [] kasan_report_error+0x221/0x520 > [] __asan_report_load8_noabort+0x3e/0x40 > [] mprotect_fixup+0x523/0x5a0 > [] SyS_mprotect+0x4c4/0xa10 > [] do_syscall_64+0x19c/0x410 > [] entry_SYSCALL64_slow_path+0x25/0x25 > > followed shortly by assertion errors and/or other bugs due to memory > corruption. > > What's happening is that we're doing an mprotect() on a range that spans > three existing adjacent mappings. The first two are merged fine, but if > we merge the last one and anon_vma_clone() runs out of memory, we return > an error and mprotect_fixup() tries to use the (now stale) pointer. It > goes like this: > > SyS_mprotect() > - mprotect_fixup() > - vma_merge() > - vma_adjust() > // first merge > - kmem_cache_free(vma) > - goto again; > // second merge > - anon_vma_clone() > - kmem_cache_alloc() > - return NULL > - kmem_cache_alloc() > - return NULL > - return -ENOMEM > - return -ENOMEM > - return NULL > - vma->vm_start // use-after-free > > In other words, it is possible to run into a memory allocation error > *after* part of the merging work has already been done. In this case, > we probably shouldn't return an error back to userspace anyway (since > it would not reflect the partial work that was done). > > I *think* the solution might be to simply ignore the errors from > vma_adjust() and carry on with distinct VMAs for adjacent regions that > might otherwise have been represented with a single VMA. I don't like this. At least, vma_adjust() should be able to handle mering more than three vmas together on next call if memory pressure gone. I would keep virtual address space fragmentation within reasonable. I think this wouldn't be easy to validate... > I have a reproducer that runs into the bug within a few seconds when > fault injection is enabled -- with the patch I no longer see any > problems. > > The patch and resulting code admittedly look odd and I'm *far* from > an expert on mm internals, so feel free to propose counter-patches and One idea is to pre-allocate anon_vma, if remove_next == 2 before merging started and use it on second iteration instead of allocation it in anon_vma_clone(). If I read code correctly there shouldn't be more than two iterations. Right? > I can give the reproducer a spin. Could you post your reproducer? I guess it requires kernel instrumentation to make allocation failure more likely. > > There's also a question about what to do with __split_vma() and other > callers of vma_adjust(). This crash (without my patch) only appeared > once and it looks kinda related, but I haven't really looked into it > and could be something else entirely: > > ------------[ cut here ]------------ > kernel BUG at mm/mmap.c:591! > invalid opcode: 0000 [#1] PREEMPT SMP KASAN > CPU: 0 PID: 3354 Comm: trinity-c1 Not tainted 4.7.0-rc7+ #37 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 > task: ffff8800b89a8dc0 ti: ffff8800b8b20000 task.ti: ffff8800b8b20000 > RIP: 0010:[] [] vma_adjust+0xe9a/0x1390 > RSP: 0018:ffff8800b8b27c60 EFLAGS: 00010206 > RAX: 1ffff10017014364 RBX: ffff8800b89c1930 RCX: 1ffff10017174cc5 > RDX: dffffc0000000000 RSI: 00007f1774fe8000 RDI: ffff8800b80a1b28 > RBP: ffff8800b8b27d08 R08: ffff8800bad5b660 R09: ffff8801190eafa0 > R10: 00007f1775ff7000 R11: ffff8800b8ba65d0 R12: ffff8800b89c0f80 > R13: ffff8800b80a1b40 R14: ffff8800b80a1b20 R15: ffff8801190eafa0 > FS: 00007f1776bfa700(0000) GS:ffff88011ae00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000000066a3ac CR3: 00000000b807f000 CR4: 00000000000006f0 > Stack: > 0000000000000286 00000000024000c0 00000000ffffffff ffff8800b89c0f90 > 000fffffffffee02 ffff8800b89c0f88 ffff8800bad5b640 00007f1774df7000 > ffff8800bb840000 00000000000100c0 ffff8800b809d500 00007f1774fe8000 > Call Trace: > [] __split_vma.isra.34+0x404/0x730 > [] split_vma+0x7f/0xc0 > [] mprotect_fixup+0x3e8/0x5a0 > [] SyS_mprotect+0x397/0x790 > [] ? mprotect_fixup+0x5a0/0x5a0 > [] ? syscall_trace_enter_phase2+0x227/0x3e0 > [] ? mprotect_fixup+0x5a0/0x5a0 > [] do_syscall_64+0x19c/0x410 > [] ? context_tracking_enter+0x18/0x20 > [] entry_SYSCALL64_slow_path+0x25/0x25 > Code: 39 d0 48 0f 42 c2 49 8d 7d 18 48 89 fa 48 c1 ea 03 42 80 3c 32 00 0f 85 b8 01 00 00 49 39 45 18 0f 85 e9 fe ff ff e9 a6 fc ff ff <0f> 0b 48 8d 7b 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 > RIP [] vma_adjust+0xe9a/0x1390 > RSP > ---[ end trace 49ee508a1e48b42d ]--- > -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org