From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Subject: Re: race condition in xen-gntdev
Date: Mon, 22 Jun 2015 15:14:16 -0400
Message-ID: <55885E88.2040805@tycho.nsa.gov>
References: <20150430144744.GF919@mail-itl> <20150527234508.GA14838@mail-itl>
	<20150617194211.GB11083@mail-itl>
	<20150622174626.GH5408@l.oracle.com>
	<20150622181335.GJ11083@mail-itl>
	<20150622183713.GD9631@l.oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20150622183713.GD9631@l.oracle.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, =?windows-1252?Q?Mare?= =?windows-1252?Q?k_Marczykowski-G=F3recki?= <marmarek@invisiblethingslab.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>, David Vrabel <david.vrabel@citrix.com>, xen-devel <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On 06/22/2015 02:37 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 22, 2015 at 08:13:35PM +0200, Marek Marczykowski-G=F3recki wr=
ote:
>> On Mon, Jun 22, 2015 at 01:46:27PM -0400, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Jun 17, 2015 at 09:42:11PM +0200, Marek Marczykowski-G=F3recki =
wrote:
>>>> On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-G=F3recki=
 wrote:
>>>>> On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-G=F3reck=
i wrote:
>>>>>> Hi,
>>>>>>
>>>>>> What is the proper way to handle shared pages (either side - using
>>>>>> gntdev or gntalloc) regarding fork and possible exec later? The child
>>>>>> process do not need to access those pages in any way, but will map
>>>>>> different one(s), using newly opened FD to the gntdev/gntalloc devic=
e.
>>>>>> Should it unmap them and close FD to the device manually just after =
the
>>>>>> fork? Or the process using gntdev or gntalloc should prevent using f=
ork
>>>>>> at all?
>>>>>>
>>>>>> I'm asking because I get kernel oops[1] in context of such process. =
This
>>>>>> process uses both gntdev and gntalloc. The PID reported there is a
>>>>>> child, which maps additional pages (using newly opened FD to
>>>>>> /dev/xen/gnt*), but I'm not sure if the crash happens before, after =
or
>>>>>> at this second mapping (actually vchan connection), or maybe even at
>>>>>> cleanup of this second mapping. The parent process keeps its mappings
>>>>>> for the whole lifetime of its child. I don't have a 100% reliable way
>>>>>> to reproduce this problem, but it happens quite often when I run such
>>>>>> operations in a loop.
>>>>>
>>>>> Any ideas?
>>>>
>>>> I've done some further debugging, below is what I get.
>>>
>>> Woot! Thank you!
>>>>
>>>>>> The kernel is vanilla 3.19.3, running on Xen 4.4.2.
>>>>>>
>>>>>> The kernel message:
>>>>>> [74376.073464] general protection fault: 0000 [#1] SMP
>>>>>> [74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE n=
f_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_i=
pv4 nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl iosf_mbi x86_=
pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr=
 xen_netfront ghash_clmulni_intel nfsd auth_rpcgss nfs_acl lockd grace xenf=
s xen_privcmd dummy_hcd udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc=
 u2mfn(O) xen_evtchn xen_blkfront
>>>>>> [74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G       =
    O   3.19.3-4.pvops.qubes.x86_64 #1
>>>>>> [74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: =
ffff88000032c000
>>>>>> [74376.073532] RIP: e030:[<ffffffffa00952c5>]  [<ffffffffa00952c5>] =
unmap_if_in_range+0x15/0xd0 [xen_gntdev]
>>>>
>>>> 	static void unmap_if_in_range(struct grant_map *map,
>>>> 			  unsigned long start, unsigned long end)
>>>> 	{
>>>> 	    unsigned long mstart, mend;
>>>> 	    int err;
>>>>
>>>> 	    if (!map->vma)
>>>> 		return;
>>>>
>>>> The above crash is at first access to "map"...
>>>>
>>>>>> [74376.073543] RSP: e02b:ffff88000032fc08  EFLAGS: 00010292
>>>>>> [74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: 0000=
7fd8616ea000
>>>>>> [74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: dead=
000000100100
>>>>
>>>> ... which is 0xdead000000100100 (LIST_POISON1).
>>>>
>>>>
>>>>>> [74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: 0000=
000000000000
>>>>>> [74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: 0000=
7fd8616e9000
>>>>>> [74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: ffff=
880012702e70
>>>>>> [74376.073569] FS:  00007fd8616ca700(0000) GS:ffff880013c80000(0000)=
 knlGS:0000000000000000
>>>>>> [74376.073574] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: 0000=
000000042660
>>>>>> [74376.073582] Stack:
>>>>>> [74376.073584]  ffff8800188356c0 00000000000000d0 ffff88000032fc68 0=
0000000c64ef797
>>>>>> [74376.073590]  0000000000000220 dead000000100100 00007fd8616e9000 0=
0007fd8616ea000
>>>>>> [74376.073596]  ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 f=
fff880012702e70
>>>>>> [74376.073603] Call Trace:
>>>>>> [74376.073610]  [<ffffffffa00953c6>] mn_invl_range_start+0x46/0x90 [=
xen_gntdev]
>>>>>> [74376.073620]  [<ffffffff811e88fb>]
>>>>>> __mmu_notifier_invalidate_range_start+0x5b/0x90
>>>>>> [74376.073627]  [<ffffffff811c2a59>] do_wp_page+0x769/0x820
>>>>>> [74376.074031]  [<ffffffff811c4f5c>] handle_mm_fault+0x7fc/0x10c0
>>>>>> [74376.074031]  [<ffffffff813864cd>] ? radix_tree_lookup+0xd/0x10
>>>>>> [74376.074031]  [<ffffffff81061e1c>] __do_page_fault+0x1dc/0x5a0
>>>>>> [74376.074031]  [<ffffffff817560a6>] ? mutex_lock+0x16/0x37
>>>>>> [74376.074031]  [<ffffffffa0008928>] ? evtchn_ioctl+0x118/0x3c0
>>>>>> [xen_evtchn]
>>>>>> [74376.074031]  [<ffffffff812209d8>] ? do_vfs_ioctl+0x2f8/0x4f0
>>>>>> [74376.074031]  [<ffffffff811cafdf>] ? do_munmap+0x29f/0x3b0
>>>>>> [74376.074031]  [<ffffffff81062211>] do_page_fault+0x31/0x70
>>>>>> [74376.074031]  [<ffffffff81759e28>] page_fault+0x28/0x30
>>>>
>>>>
>>>> 	static void mn_invl_range_start(struct mmu_notifier *mn,
>>>> 			struct mm_struct *mm,
>>>> 			unsigned long start, unsigned long end)
>>>> 	{
>>>> 	    struct gntdev_priv *priv =3D container_of(mn, struct gntdev_priv,=
 mn);
>>>> 	    struct grant_map *map;
>>>>
>>>> 	    spin_lock(&priv->lock);
>>>> 	    list_for_each_entry(map, &priv->maps, next) {
>>>> 		unmap_if_in_range(map, start, end);
>>>> 	    }
>>>>
>>>> mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so
>>>> something is wrong with priv->maps list.
>>>>
>>>> So I've searched for all the list_del calls on this list and found one=
 not
>>>> guarded by spinlock:
>>>>
>>>> 	static int gntdev_release(struct inode *inode, struct file *flip)
>>>> 	{
>>>> 	    struct gntdev_priv *priv =3D flip->private_data;
>>>> 	    struct grant_map *map;
>>>>
>>>> 	    pr_debug("priv %p\n", priv);
>>>>
>>>> 	    while (!list_empty(&priv->maps)) {
>>>> 		map =3D list_entry(priv->maps.next, struct grant_map, next);
>>>> 		list_del(&map->next);
>>>> 		gntdev_put_map(NULL /* already removed */, map);
>>>> 	    }
>>>> 	    WARN_ON(!list_empty(&priv->freeable_maps));
>>>>
>>>> 	    if (use_ptemod)
>>>> 		mmu_notifier_unregister(&priv->mn, priv->mm);
>>>> 	    kfree(priv);
>>>> 	    return 0;
>>>> 	}
>>>>
>>>> So I see this as:
>>>> P1(parent)                 P2(child)
>>>>                             1. gntdev_release called
>>>>                             2. list destroyed (above loop)
>>>>
>>>> 3. page fault occurs, gntdev mmu notifier called
>>>> 4. priv->lock taken
>>>> 5. iterate over priv->maps
>>>
>>>
>>> Could we check the map->users?
>>
>> I don't have an easy way to reproduce the problem. It happens randomly
>> once a day/two in some random domain. I've tried to fiddle with
>> gnttab+fork+page faults, but failed to reliably reproduce the problem.
>>
>>>> 6. crashed since map is already destroyed
>>>
>>> Oh, indeed. gntdev_release calls gntdev_put_map which frees the whole
>>> thing! Ouch.
>>>>
>>>>                             7. mmu_notifier_unregister calls:
>>>>                             8.   mn_release, which tries to take priv-=
>lock
>>>>                             9. this process hangs
>>>>
>>>> So I'd guess the fix would be to move mmu_notifier_unregister before
>>>> releasing priv->maps list. Could someone more familiar with this code
>>>> confirm this?
>>>
>>> Hey Marek,
>>>
>>> Could we just add an lock around the 'gntdev_release' loop?
>>
>> I don't fully understand what is going there, especially when
>> gntdev_release is called (in case of forked process, still some memory
>> mapped etc), and when mmu notifier is called. But just looking at this
>> crash IMO adding a lock should be sufficient. What will happen when mmu =
notifier
>> would not find the mapping? I guess nothing, because the mapping is just
>> removed for a reason, right?
>
> That was my thought. But then why didn't we add this lock in the first pl=
ace?
> Hopefully Daniel can remember ~2yr patch :-)

The reason that gntdev_release didn't have a lock is because there are not
supposed to be any references to the areas pointed to by priv->maps when it
is called.  However, since the MMU notifier has not yet been unregistered,
it is apparently possible to race here; the comment on mmu_notifier_unregis=
ter
seems to confirm this as a possibility (as do the backtraces).

I think adding the lock will be sufficient.