Re: race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	dgdegra@tycho.nsa.gov, David Vrabel <david.vrabel@citrix.com>,
	xen-devel <xen-devel@lists.xen.org>
Subject: Re: race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev)
Date: Mon, 22 Jun 2015 20:13:35 +0200	[thread overview]
Message-ID: <20150622181335.GJ11083@mail-itl> (raw)
In-Reply-To: <20150622174626.GH5408@l.oracle.com>


[-- Attachment #1.1: Type: text/plain, Size: 12579 bytes --]

On Mon, Jun 22, 2015 at 01:46:27PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 17, 2015 at 09:42:11PM +0200, Marek Marczykowski-Górecki wrote:
> > On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-Górecki wrote:
> > > On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > > 
> > > > What is the proper way to handle shared pages (either side - using
> > > > gntdev or gntalloc) regarding fork and possible exec later? The child
> > > > process do not need to access those pages in any way, but will map
> > > > different one(s), using newly opened FD to the gntdev/gntalloc device.
> > > > Should it unmap them and close FD to the device manually just after the
> > > > fork? Or the process using gntdev or gntalloc should prevent using fork
> > > > at all?
> > > > 
> > > > I'm asking because I get kernel oops[1] in context of such process. This
> > > > process uses both gntdev and gntalloc. The PID reported there is a
> > > > child, which maps additional pages (using newly opened FD to
> > > > /dev/xen/gnt*), but I'm not sure if the crash happens before, after or
> > > > at this second mapping (actually vchan connection), or maybe even at
> > > > cleanup of this second mapping. The parent process keeps its mappings
> > > > for the whole lifetime of its child. I don't have a 100% reliable way
> > > > to reproduce this problem, but it happens quite often when I run such
> > > > operations in a loop.
> > > 
> > > Any ideas?
> > 
> > I've done some further debugging, below is what I get.
> 
> Woot! Thank you!
> > 
> > > > The kernel is vanilla 3.19.3, running on Xen 4.4.2.
> > > > 
> > > > The kernel message:
> > > > [74376.073464] general protection fault: 0000 [#1] SMP 
> > > > [74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr xen_netfront ghash_clmulni_intel nfsd auth_rpcgss nfs_acl lockd grace xenfs xen_privcmd dummy_hcd udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc u2mfn(O) xen_evtchn xen_blkfront
> > > > [74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G           O   3.19.3-4.pvops.qubes.x86_64 #1
> > > > [74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: ffff88000032c000
> > > > [74376.073532] RIP: e030:[<ffffffffa00952c5>]  [<ffffffffa00952c5>] unmap_if_in_range+0x15/0xd0 [xen_gntdev]
> > 
> > 	static void unmap_if_in_range(struct grant_map *map,
> > 			  unsigned long start, unsigned long end)
> > 	{   
> > 	    unsigned long mstart, mend;
> > 	    int err;
> > 
> > 	    if (!map->vma)
> > 		return;
> > 
> > The above crash is at first access to "map"...
> > 
> > > > [74376.073543] RSP: e02b:ffff88000032fc08  EFLAGS: 00010292
> > > > [74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: 00007fd8616ea000
> > > > [74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: dead000000100100
> > 
> > ... which is 0xdead000000100100 (LIST_POISON1).
> > 
> > 
> > > > [74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: 0000000000000000
> > > > [74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: 00007fd8616e9000
> > > > [74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: ffff880012702e70
> > > > [74376.073569] FS:  00007fd8616ca700(0000) GS:ffff880013c80000(0000) knlGS:0000000000000000
> > > > [74376.073574] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: 0000000000042660
> > > > [74376.073582] Stack:
> > > > [74376.073584]  ffff8800188356c0 00000000000000d0 ffff88000032fc68 00000000c64ef797
> > > > [74376.073590]  0000000000000220 dead000000100100 00007fd8616e9000 00007fd8616ea000
> > > > [74376.073596]  ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 ffff880012702e70
> > > > [74376.073603] Call Trace:
> > > > [74376.073610]  [<ffffffffa00953c6>] mn_invl_range_start+0x46/0x90 [xen_gntdev]
> > > > [74376.073620]  [<ffffffff811e88fb>]
> > > > __mmu_notifier_invalidate_range_start+0x5b/0x90
> > > > [74376.073627]  [<ffffffff811c2a59>] do_wp_page+0x769/0x820
> > > > [74376.074031]  [<ffffffff811c4f5c>] handle_mm_fault+0x7fc/0x10c0
> > > > [74376.074031]  [<ffffffff813864cd>] ? radix_tree_lookup+0xd/0x10
> > > > [74376.074031]  [<ffffffff81061e1c>] __do_page_fault+0x1dc/0x5a0
> > > > [74376.074031]  [<ffffffff817560a6>] ? mutex_lock+0x16/0x37
> > > > [74376.074031]  [<ffffffffa0008928>] ? evtchn_ioctl+0x118/0x3c0
> > > > [xen_evtchn]
> > > > [74376.074031]  [<ffffffff812209d8>] ? do_vfs_ioctl+0x2f8/0x4f0
> > > > [74376.074031]  [<ffffffff811cafdf>] ? do_munmap+0x29f/0x3b0
> > > > [74376.074031]  [<ffffffff81062211>] do_page_fault+0x31/0x70
> > > > [74376.074031]  [<ffffffff81759e28>] page_fault+0x28/0x30
> > 
> > 
> > 	static void mn_invl_range_start(struct mmu_notifier *mn,
> > 			struct mm_struct *mm,
> > 			unsigned long start, unsigned long end)
> > 	{
> > 	    struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> > 	    struct grant_map *map;
> > 
> > 	    spin_lock(&priv->lock);
> > 	    list_for_each_entry(map, &priv->maps, next) {
> > 		unmap_if_in_range(map, start, end);
> > 	    }
> > 
> > mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so
> > something is wrong with priv->maps list.
> > 
> > So I've searched for all the list_del calls on this list and found one not
> > guarded by spinlock:
> > 
> > 	static int gntdev_release(struct inode *inode, struct file *flip)
> > 	{   
> > 	    struct gntdev_priv *priv = flip->private_data;
> > 	    struct grant_map *map;
> > 
> > 	    pr_debug("priv %p\n", priv);
> > 
> > 	    while (!list_empty(&priv->maps)) {
> > 		map = list_entry(priv->maps.next, struct grant_map, next);
> > 		list_del(&map->next);
> > 		gntdev_put_map(NULL /* already removed */, map);
> > 	    }
> > 	    WARN_ON(!list_empty(&priv->freeable_maps));
> > 
> > 	    if (use_ptemod)
> > 		mmu_notifier_unregister(&priv->mn, priv->mm);
> > 	    kfree(priv);
> > 	    return 0;
> > 	}   
> > 
> > So I see this as:
> > P1(parent)                 P2(child)
> >                            1. gntdev_release called
> >                            2. list destroyed (above loop)
> > 
> > 3. page fault occurs, gntdev mmu notifier called
> > 4. priv->lock taken
> > 5. iterate over priv->maps
> 
> 
> Could we check the map->users?

I don't have an easy way to reproduce the problem. It happens randomly
once a day/two in some random domain. I've tried to fiddle with
gnttab+fork+page faults, but failed to reliably reproduce the problem.

> > 6. crashed since map is already destroyed
> 
> Oh, indeed. gntdev_release calls gntdev_put_map which frees the whole
> thing! Ouch.
> > 
> >                            7. mmu_notifier_unregister calls:
> >                            8.   mn_release, which tries to take priv->lock
> >                            9. this process hangs
> > 
> > So I'd guess the fix would be to move mmu_notifier_unregister before
> > releasing priv->maps list. Could someone more familiar with this code
> > confirm this?
> 
> Hey Marek,
> 
> Could we just add an lock around the 'gntdev_release' loop?

I don't fully understand what is going there, especially when
gntdev_release is called (in case of forked process, still some memory
mapped etc), and when mmu notifier is called. But just looking at this
crash IMO adding a lock should be sufficient. What will happen when mmu notifier
would not find the mapping? I guess nothing, because the mapping is just
removed for a reason, right?

> Daniel added this code to guard against a similar race:
> 
> commit 16a1d0225e22e4e273e6b60a21db95decde666c2
> Author: Daniel De Graaf <dgdegra@tycho.nsa.gov>
> Date:   Wed Jan 2 22:57:12 2013 +0000
> 
>     xen/gntdev: correctly unmap unlinked maps in mmu notifier
>     
>     If gntdev_ioctl_unmap_grant_ref is called on a range before unmapping
>     it, the entry is removed from priv->maps and the later call to
>     mn_invl_range_start won't find it to do the unmapping. Fix this by
>     creating another list of freeable maps that the mmu notifier can search
>     and use to unmap grants.
>     
>     Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
> 
> Lets see what he thinks.
> > 
> > > > [74376.074031] Code: e9 dd fd ff ff 31 c9 31 db e9 20 fe ff ff 0f 1f
> > > > 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 89 fb
> > > > 48 83 ec 28 <48> 8b 47 10 48 85 c0 74 4e 4c 8b 00 49 39 d0 73 46 4c
> > > > 8b 48 08
> > > > [74376.074031] RIP  [<ffffffffa00952c5>] unmap_if_in_range+0x15/0xd0
> > > > [xen_gntdev]
> > > > [74376.074031]  RSP <ffff88000032fc08>
> > > > [74376.091682] ---[ end trace 2b21c5b714eb1071 ]---
> > 
> > And the other process:
> > 
> > > > [74404.069009] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s!
> > > > [qrexec-agent:9379]
> > > > [74404.069009] Modules linked in: fuse xt_conntrack ipt_MASQUERADE
> > > > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> > > > nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables
> > > > intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul
> > > > crc32_pclmul crc32c_intel pcspkr xen_netfront ghash_clmulni_intel
> > > > nfsd auth_rpcgss nfs_acl lockd grace xenfs xen_privcmd dummy_hcd
> > > > udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc u2mfn(O)
> > > > xen_evtchn xen_blkfront
> > > > [74404.069009] CPU: 2 PID: 9379 Comm: qrexec-agent Tainted: G      D
> > > > O   3.19.3-4.pvops.qubes.x86_64 #1
> > > > [74404.069009] task: ffff880010e24a00 ti: ffff880002470000 task.ti:
> > > > ffff880002470000
> > > > [74404.069009] RIP: e030:[<ffffffff81757b11>]  [<ffffffff81757b11>]
> > > > _raw_spin_lock+0x21/0x30
> > > > [74404.069009] RSP: e02b:ffff880002473e18  EFLAGS: 00000297
> > > > [74404.069009] RAX: 0000000000000040 RBX: ffff880002345c00 RCX:
> > > > 0000000000018cf8
> > > > [74404.069009] RDX: 0000000000000041 RSI: ffff880002345c00 RDI:
> > > > ffff880012702e60
> > > > [74404.069009] RBP: ffff880002473e18 R08: ffff880012702240 R09:
> > > > 00000001802a0019
> > > > [74404.069009] R10: ffffea000049c080 R11: ffffffffa00955bf R12:
> > > > ffff880012702e70
> > > > [74404.069009] R13: ffff880012702e40 R14: ffff8800132c6f20 R15:
> > > > ffff880012b163c0
> > > > [74404.069009] FS:  00007fd8616ca700(0000) GS:ffff880013d00000(0000)
> > > > knlGS:0000000000000000
> > > > [74404.069009] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [74404.069009] CR2: 00007fd8610be098 CR3: 000000000b971000 CR4:
> > > > 0000000000042660
> > > > [74404.069009] Stack:
> > > > [74404.069009]  ffff880002473e48 ffffffffa0095452 ffff880002473e48
> > > > ffff880002345c00
> > > > [74404.069009]  ffff880012702e70 0000000000000000 ffff880002473e78
> > > > ffffffff811e8c2e
> > > > [74404.069009]  ffff880002473e78 ffff880012702e40 ffff880012702e40
> > > > ffff880012d123c8
> > > > [74404.069009] Call Trace:
> > > > [74404.069009]  [<ffffffffa0095452>] mn_release+0x22/0x130
> > > > [xen_gntdev]
> > > > [74404.069009]  [<ffffffff811e8c2e>]
> > > > mmu_notifier_unregister+0x4e/0xe0
> > > > [74404.069009]  [<ffffffffa00957c0>] gntdev_release+0x60/0xa0
> > > > [xen_gntdev]
> > > > [74404.069009]  [<ffffffff8120ec0f>] __fput+0xdf/0x1e0
> > > > [74404.069009]  [<ffffffff8120ed5e>] ____fput+0xe/0x10
> > > > [74404.069009]  [<ffffffff810b56df>] task_work_run+0xbf/0x100
> > > > [74404.069009]  [<ffffffff81014c47>] do_notify_resume+0x97/0xb0
> > > > [74404.069009]  [<ffffffff81758127>] int_signal+0x12/0x17
> > > > [74404.069009] Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
> > > > 48 89 e5 b8 00 01 00 00 f0 66 0f c1 07 0f b6 d4 38 c2 75 04 5d c3 f3
> > > > 90 0f b6 07 <38> d0 75 f7 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66
> > > > 66 90 55
> > 
> > -- 
> > Best Regards,
> > Marek Marczykowski-Górecki
> > Invisible Things Lab
> > A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #1.2: Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2015-06-22 18:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-30 14:47 gntdev/gntalloc and fork Marek Marczykowski-Górecki
2015-05-27 23:45 ` gntdev/gntalloc and fork? - crash in gntdev Marek Marczykowski-Górecki
2015-06-17 19:42   ` race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev) Marek Marczykowski-Górecki
2015-06-22 17:46     ` Konrad Rzeszutek Wilk
2015-06-22 18:13       ` Marek Marczykowski-Górecki [this message]
2015-06-22 18:37         ` Konrad Rzeszutek Wilk
2015-06-22 19:14           ` race condition in xen-gntdev Daniel De Graaf
2015-06-26  1:28             ` Marek Marczykowski-Górecki
2015-06-29 14:39               ` Konrad Rzeszutek Wilk
2015-06-29 14:50                 ` Marek Marczykowski-Górecki
2015-07-22  3:21                   ` Marek Marczykowski-Górecki
2015-07-22 13:58                     ` Konrad Rzeszutek Wilk
2015-06-30 15:26               ` David Vrabel
2015-06-30 15:51                 ` Marek Marczykowski-Górecki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150622181335.GJ11083@mail-itl \
    --to=marmarek@invisiblethingslab.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=david.vrabel@citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=konrad.wilk@oracle.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.