Re: race condition in xen-gntdev

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel De Graaf <dgdegra@tycho.nsa.gov>
To: "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>,
	"Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	David Vrabel <david.vrabel@citrix.com>,
	xen-devel <xen-devel@lists.xen.org>
Subject: Re: race condition in xen-gntdev
Date: Mon, 22 Jun 2015 15:14:16 -0400	[thread overview]
Message-ID: <55885E88.2040805@tycho.nsa.gov> (raw)
In-Reply-To: <20150622183713.GD9631@l.oracle.com>

On 06/22/2015 02:37 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 22, 2015 at 08:13:35PM +0200, Marek Marczykowski-Górecki wrote:
>> On Mon, Jun 22, 2015 at 01:46:27PM -0400, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Jun 17, 2015 at 09:42:11PM +0200, Marek Marczykowski-Górecki wrote:
>>>> On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-Górecki wrote:
>>>>> On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-Górecki wrote:
>>>>>> Hi,
>>>>>>
>>>>>> What is the proper way to handle shared pages (either side - using
>>>>>> gntdev or gntalloc) regarding fork and possible exec later? The child
>>>>>> process do not need to access those pages in any way, but will map
>>>>>> different one(s), using newly opened FD to the gntdev/gntalloc device.
>>>>>> Should it unmap them and close FD to the device manually just after the
>>>>>> fork? Or the process using gntdev or gntalloc should prevent using fork
>>>>>> at all?
>>>>>>
>>>>>> I'm asking because I get kernel oops[1] in context of such process. This
>>>>>> process uses both gntdev and gntalloc. The PID reported there is a
>>>>>> child, which maps additional pages (using newly opened FD to
>>>>>> /dev/xen/gnt*), but I'm not sure if the crash happens before, after or
>>>>>> at this second mapping (actually vchan connection), or maybe even at
>>>>>> cleanup of this second mapping. The parent process keeps its mappings
>>>>>> for the whole lifetime of its child. I don't have a 100% reliable way
>>>>>> to reproduce this problem, but it happens quite often when I run such
>>>>>> operations in a loop.
>>>>>
>>>>> Any ideas?
>>>>
>>>> I've done some further debugging, below is what I get.
>>>
>>> Woot! Thank you!
>>>>
>>>>>> The kernel is vanilla 3.19.3, running on Xen 4.4.2.
>>>>>>
>>>>>> The kernel message:
>>>>>> [74376.073464] general protection fault: 0000 [#1] SMP
>>>>>> [74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr xen_netfront ghash_clmulni_intel nfsd auth_rpcgss nfs_acl lockd grace xenfs xen_privcmd dummy_hcd udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc u2mfn(O) xen_evtchn xen_blkfront
>>>>>> [74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G           O   3.19.3-4.pvops.qubes.x86_64 #1
>>>>>> [74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: ffff88000032c000
>>>>>> [74376.073532] RIP: e030:[<ffffffffa00952c5>]  [<ffffffffa00952c5>] unmap_if_in_range+0x15/0xd0 [xen_gntdev]
>>>>
>>>> 	static void unmap_if_in_range(struct grant_map *map,
>>>> 			  unsigned long start, unsigned long end)
>>>> 	{
>>>> 	    unsigned long mstart, mend;
>>>> 	    int err;
>>>>
>>>> 	    if (!map->vma)
>>>> 		return;
>>>>
>>>> The above crash is at first access to "map"...
>>>>
>>>>>> [74376.073543] RSP: e02b:ffff88000032fc08  EFLAGS: 00010292
>>>>>> [74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: 00007fd8616ea000
>>>>>> [74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: dead000000100100
>>>>
>>>> ... which is 0xdead000000100100 (LIST_POISON1).
>>>>
>>>>
>>>>>> [74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: 0000000000000000
>>>>>> [74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: 00007fd8616e9000
>>>>>> [74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: ffff880012702e70
>>>>>> [74376.073569] FS:  00007fd8616ca700(0000) GS:ffff880013c80000(0000) knlGS:0000000000000000
>>>>>> [74376.073574] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: 0000000000042660
>>>>>> [74376.073582] Stack:
>>>>>> [74376.073584]  ffff8800188356c0 00000000000000d0 ffff88000032fc68 00000000c64ef797
>>>>>> [74376.073590]  0000000000000220 dead000000100100 00007fd8616e9000 00007fd8616ea000
>>>>>> [74376.073596]  ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 ffff880012702e70
>>>>>> [74376.073603] Call Trace:
>>>>>> [74376.073610]  [<ffffffffa00953c6>] mn_invl_range_start+0x46/0x90 [xen_gntdev]
>>>>>> [74376.073620]  [<ffffffff811e88fb>]
>>>>>> __mmu_notifier_invalidate_range_start+0x5b/0x90
>>>>>> [74376.073627]  [<ffffffff811c2a59>] do_wp_page+0x769/0x820
>>>>>> [74376.074031]  [<ffffffff811c4f5c>] handle_mm_fault+0x7fc/0x10c0
>>>>>> [74376.074031]  [<ffffffff813864cd>] ? radix_tree_lookup+0xd/0x10
>>>>>> [74376.074031]  [<ffffffff81061e1c>] __do_page_fault+0x1dc/0x5a0
>>>>>> [74376.074031]  [<ffffffff817560a6>] ? mutex_lock+0x16/0x37
>>>>>> [74376.074031]  [<ffffffffa0008928>] ? evtchn_ioctl+0x118/0x3c0
>>>>>> [xen_evtchn]
>>>>>> [74376.074031]  [<ffffffff812209d8>] ? do_vfs_ioctl+0x2f8/0x4f0
>>>>>> [74376.074031]  [<ffffffff811cafdf>] ? do_munmap+0x29f/0x3b0
>>>>>> [74376.074031]  [<ffffffff81062211>] do_page_fault+0x31/0x70
>>>>>> [74376.074031]  [<ffffffff81759e28>] page_fault+0x28/0x30
>>>>
>>>>
>>>> 	static void mn_invl_range_start(struct mmu_notifier *mn,
>>>> 			struct mm_struct *mm,
>>>> 			unsigned long start, unsigned long end)
>>>> 	{
>>>> 	    struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>>>> 	    struct grant_map *map;
>>>>
>>>> 	    spin_lock(&priv->lock);
>>>> 	    list_for_each_entry(map, &priv->maps, next) {
>>>> 		unmap_if_in_range(map, start, end);
>>>> 	    }
>>>>
>>>> mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so
>>>> something is wrong with priv->maps list.
>>>>
>>>> So I've searched for all the list_del calls on this list and found one not
>>>> guarded by spinlock:
>>>>
>>>> 	static int gntdev_release(struct inode *inode, struct file *flip)
>>>> 	{
>>>> 	    struct gntdev_priv *priv = flip->private_data;
>>>> 	    struct grant_map *map;
>>>>
>>>> 	    pr_debug("priv %p\n", priv);
>>>>
>>>> 	    while (!list_empty(&priv->maps)) {
>>>> 		map = list_entry(priv->maps.next, struct grant_map, next);
>>>> 		list_del(&map->next);
>>>> 		gntdev_put_map(NULL /* already removed */, map);
>>>> 	    }
>>>> 	    WARN_ON(!list_empty(&priv->freeable_maps));
>>>>
>>>> 	    if (use_ptemod)
>>>> 		mmu_notifier_unregister(&priv->mn, priv->mm);
>>>> 	    kfree(priv);
>>>> 	    return 0;
>>>> 	}
>>>>
>>>> So I see this as:
>>>> P1(parent)                 P2(child)
>>>>                             1. gntdev_release called
>>>>                             2. list destroyed (above loop)
>>>>
>>>> 3. page fault occurs, gntdev mmu notifier called
>>>> 4. priv->lock taken
>>>> 5. iterate over priv->maps
>>>
>>>
>>> Could we check the map->users?
>>
>> I don't have an easy way to reproduce the problem. It happens randomly
>> once a day/two in some random domain. I've tried to fiddle with
>> gnttab+fork+page faults, but failed to reliably reproduce the problem.
>>
>>>> 6. crashed since map is already destroyed
>>>
>>> Oh, indeed. gntdev_release calls gntdev_put_map which frees the whole
>>> thing! Ouch.
>>>>
>>>>                             7. mmu_notifier_unregister calls:
>>>>                             8.   mn_release, which tries to take priv->lock
>>>>                             9. this process hangs
>>>>
>>>> So I'd guess the fix would be to move mmu_notifier_unregister before
>>>> releasing priv->maps list. Could someone more familiar with this code
>>>> confirm this?
>>>
>>> Hey Marek,
>>>
>>> Could we just add an lock around the 'gntdev_release' loop?
>>
>> I don't fully understand what is going there, especially when
>> gntdev_release is called (in case of forked process, still some memory
>> mapped etc), and when mmu notifier is called. But just looking at this
>> crash IMO adding a lock should be sufficient. What will happen when mmu notifier
>> would not find the mapping? I guess nothing, because the mapping is just
>> removed for a reason, right?
>
> That was my thought. But then why didn't we add this lock in the first place?
> Hopefully Daniel can remember ~2yr patch :-)

The reason that gntdev_release didn't have a lock is because there are not
supposed to be any references to the areas pointed to by priv->maps when it
is called.  However, since the MMU notifier has not yet been unregistered,
it is apparently possible to race here; the comment on mmu_notifier_unregister
seems to confirm this as a possibility (as do the backtraces).

I think adding the lock will be sufficient.

next prev parent reply	other threads:[~2015-06-22 19:14 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-30 14:47 gntdev/gntalloc and fork Marek Marczykowski-Górecki
2015-05-27 23:45 ` gntdev/gntalloc and fork? - crash in gntdev Marek Marczykowski-Górecki
2015-06-17 19:42   ` race condition in xen-gntdev (was: Re: gntdev/gntalloc and fork? - crash in gntdev) Marek Marczykowski-Górecki
2015-06-22 17:46     ` Konrad Rzeszutek Wilk
2015-06-22 18:13       ` Marek Marczykowski-Górecki
2015-06-22 18:37         ` Konrad Rzeszutek Wilk
2015-06-22 19:14           ` Daniel De Graaf [this message]
2015-06-26  1:28             ` race condition in xen-gntdev Marek Marczykowski-Górecki
2015-06-29 14:39               ` Konrad Rzeszutek Wilk
2015-06-29 14:50                 ` Marek Marczykowski-Górecki
2015-07-22  3:21                   ` Marek Marczykowski-Górecki
2015-07-22 13:58                     ` Konrad Rzeszutek Wilk
2015-06-30 15:26               ` David Vrabel
2015-06-30 15:51                 ` Marek Marczykowski-Górecki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55885E88.2040805@tycho.nsa.gov \
    --to=dgdegra@tycho.nsa.gov \
    --cc=boris.ostrovsky@oracle.com \
    --cc=david.vrabel@citrix.com \
    --cc=konrad.wilk@oracle.com \
    --cc=marmarek@invisiblethingslab.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.