From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel De Graaf Subject: Re: race condition in xen-gntdev Date: Mon, 22 Jun 2015 15:14:16 -0400 Message-ID: <55885E88.2040805@tycho.nsa.gov> References: <20150430144744.GF919@mail-itl> <20150527234508.GA14838@mail-itl> <20150617194211.GB11083@mail-itl> <20150622174626.GH5408@l.oracle.com> <20150622181335.GJ11083@mail-itl> <20150622183713.GD9631@l.oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20150622183713.GD9631@l.oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk , =?windows-1252?Q?Mare?= =?windows-1252?Q?k_Marczykowski-G=F3recki?= Cc: Boris Ostrovsky , David Vrabel , xen-devel List-Id: xen-devel@lists.xenproject.org On 06/22/2015 02:37 PM, Konrad Rzeszutek Wilk wrote: > On Mon, Jun 22, 2015 at 08:13:35PM +0200, Marek Marczykowski-G=F3recki wr= ote: >> On Mon, Jun 22, 2015 at 01:46:27PM -0400, Konrad Rzeszutek Wilk wrote: >>> On Wed, Jun 17, 2015 at 09:42:11PM +0200, Marek Marczykowski-G=F3recki = wrote: >>>> On Thu, May 28, 2015 at 01:45:08AM +0200, Marek Marczykowski-G=F3recki= wrote: >>>>> On Thu, Apr 30, 2015 at 04:47:44PM +0200, Marek Marczykowski-G=F3reck= i wrote: >>>>>> Hi, >>>>>> >>>>>> What is the proper way to handle shared pages (either side - using >>>>>> gntdev or gntalloc) regarding fork and possible exec later? The child >>>>>> process do not need to access those pages in any way, but will map >>>>>> different one(s), using newly opened FD to the gntdev/gntalloc devic= e. >>>>>> Should it unmap them and close FD to the device manually just after = the >>>>>> fork? Or the process using gntdev or gntalloc should prevent using f= ork >>>>>> at all? >>>>>> >>>>>> I'm asking because I get kernel oops[1] in context of such process. = This >>>>>> process uses both gntdev and gntalloc. The PID reported there is a >>>>>> child, which maps additional pages (using newly opened FD to >>>>>> /dev/xen/gnt*), but I'm not sure if the crash happens before, after = or >>>>>> at this second mapping (actually vchan connection), or maybe even at >>>>>> cleanup of this second mapping. The parent process keeps its mappings >>>>>> for the whole lifetime of its child. I don't have a 100% reliable way >>>>>> to reproduce this problem, but it happens quite often when I run such >>>>>> operations in a loop. >>>>> >>>>> Any ideas? >>>> >>>> I've done some further debugging, below is what I get. >>> >>> Woot! Thank you! >>>> >>>>>> The kernel is vanilla 3.19.3, running on Xen 4.4.2. >>>>>> >>>>>> The kernel message: >>>>>> [74376.073464] general protection fault: 0000 [#1] SMP >>>>>> [74376.073475] Modules linked in: fuse xt_conntrack ipt_MASQUERADE n= f_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_i= pv4 nf_nat nf_conntrack ip6table_filter ip6_tables intel_rapl iosf_mbi x86_= pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr= xen_netfront ghash_clmulni_intel nfsd auth_rpcgss nfs_acl lockd grace xenf= s xen_privcmd dummy_hcd udc_core xen_gntdev xen_gntalloc xen_blkback sunrpc= u2mfn(O) xen_evtchn xen_blkfront >>>>>> [74376.073522] CPU: 1 PID: 9377 Comm: qrexec-agent Tainted: G = O 3.19.3-4.pvops.qubes.x86_64 #1 >>>>>> [74376.073528] task: ffff880002442e40 ti: ffff88000032c000 task.ti: = ffff88000032c000 >>>>>> [74376.073532] RIP: e030:[] [] = unmap_if_in_range+0x15/0xd0 [xen_gntdev] >>>> >>>> static void unmap_if_in_range(struct grant_map *map, >>>> unsigned long start, unsigned long end) >>>> { >>>> unsigned long mstart, mend; >>>> int err; >>>> >>>> if (!map->vma) >>>> return; >>>> >>>> The above crash is at first access to "map"... >>>> >>>>>> [74376.073543] RSP: e02b:ffff88000032fc08 EFLAGS: 00010292 >>>>>> [74376.073546] RAX: 0000000000000000 RBX: dead000000100100 RCX: 0000= 7fd8616ea000 >>>>>> [74376.073550] RDX: 00007fd8616ea000 RSI: 00007fd8616e9000 RDI: dead= 000000100100 >>>> >>>> ... which is 0xdead000000100100 (LIST_POISON1). >>>> >>>> >>>>>> [74376.073554] RBP: ffff88000032fc48 R08: 0000000000000000 R09: 0000= 000000000000 >>>>>> [74376.073557] R10: ffffea000021bb00 R11: 0000000000000000 R12: 0000= 7fd8616e9000 >>>>>> [74376.073561] R13: 00007fd8616ea000 R14: ffff880012702e40 R15: ffff= 880012702e70 >>>>>> [74376.073569] FS: 00007fd8616ca700(0000) GS:ffff880013c80000(0000)= knlGS:0000000000000000 >>>>>> [74376.073574] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>> [74376.073577] CR2: 00007fd8616e9458 CR3: 00000000e7af5000 CR4: 0000= 000000042660 >>>>>> [74376.073582] Stack: >>>>>> [74376.073584] ffff8800188356c0 00000000000000d0 ffff88000032fc68 0= 0000000c64ef797 >>>>>> [74376.073590] 0000000000000220 dead000000100100 00007fd8616e9000 0= 0007fd8616ea000 >>>>>> [74376.073596] ffff88000032fc88 ffffffffa00953c6 ffff88000032fcc8 f= fff880012702e70 >>>>>> [74376.073603] Call Trace: >>>>>> [74376.073610] [] mn_invl_range_start+0x46/0x90 [= xen_gntdev] >>>>>> [74376.073620] [] >>>>>> __mmu_notifier_invalidate_range_start+0x5b/0x90 >>>>>> [74376.073627] [] do_wp_page+0x769/0x820 >>>>>> [74376.074031] [] handle_mm_fault+0x7fc/0x10c0 >>>>>> [74376.074031] [] ? radix_tree_lookup+0xd/0x10 >>>>>> [74376.074031] [] __do_page_fault+0x1dc/0x5a0 >>>>>> [74376.074031] [] ? mutex_lock+0x16/0x37 >>>>>> [74376.074031] [] ? evtchn_ioctl+0x118/0x3c0 >>>>>> [xen_evtchn] >>>>>> [74376.074031] [] ? do_vfs_ioctl+0x2f8/0x4f0 >>>>>> [74376.074031] [] ? do_munmap+0x29f/0x3b0 >>>>>> [74376.074031] [] do_page_fault+0x31/0x70 >>>>>> [74376.074031] [] page_fault+0x28/0x30 >>>> >>>> >>>> static void mn_invl_range_start(struct mmu_notifier *mn, >>>> struct mm_struct *mm, >>>> unsigned long start, unsigned long end) >>>> { >>>> struct gntdev_priv *priv =3D container_of(mn, struct gntdev_priv,= mn); >>>> struct grant_map *map; >>>> >>>> spin_lock(&priv->lock); >>>> list_for_each_entry(map, &priv->maps, next) { >>>> unmap_if_in_range(map, start, end); >>>> } >>>> >>>> mn_invl_range_start+0x46 is the first call to unmap_if_in_range, so >>>> something is wrong with priv->maps list. >>>> >>>> So I've searched for all the list_del calls on this list and found one= not >>>> guarded by spinlock: >>>> >>>> static int gntdev_release(struct inode *inode, struct file *flip) >>>> { >>>> struct gntdev_priv *priv =3D flip->private_data; >>>> struct grant_map *map; >>>> >>>> pr_debug("priv %p\n", priv); >>>> >>>> while (!list_empty(&priv->maps)) { >>>> map =3D list_entry(priv->maps.next, struct grant_map, next); >>>> list_del(&map->next); >>>> gntdev_put_map(NULL /* already removed */, map); >>>> } >>>> WARN_ON(!list_empty(&priv->freeable_maps)); >>>> >>>> if (use_ptemod) >>>> mmu_notifier_unregister(&priv->mn, priv->mm); >>>> kfree(priv); >>>> return 0; >>>> } >>>> >>>> So I see this as: >>>> P1(parent) P2(child) >>>> 1. gntdev_release called >>>> 2. list destroyed (above loop) >>>> >>>> 3. page fault occurs, gntdev mmu notifier called >>>> 4. priv->lock taken >>>> 5. iterate over priv->maps >>> >>> >>> Could we check the map->users? >> >> I don't have an easy way to reproduce the problem. It happens randomly >> once a day/two in some random domain. I've tried to fiddle with >> gnttab+fork+page faults, but failed to reliably reproduce the problem. >> >>>> 6. crashed since map is already destroyed >>> >>> Oh, indeed. gntdev_release calls gntdev_put_map which frees the whole >>> thing! Ouch. >>>> >>>> 7. mmu_notifier_unregister calls: >>>> 8. mn_release, which tries to take priv-= >lock >>>> 9. this process hangs >>>> >>>> So I'd guess the fix would be to move mmu_notifier_unregister before >>>> releasing priv->maps list. Could someone more familiar with this code >>>> confirm this? >>> >>> Hey Marek, >>> >>> Could we just add an lock around the 'gntdev_release' loop? >> >> I don't fully understand what is going there, especially when >> gntdev_release is called (in case of forked process, still some memory >> mapped etc), and when mmu notifier is called. But just looking at this >> crash IMO adding a lock should be sufficient. What will happen when mmu = notifier >> would not find the mapping? I guess nothing, because the mapping is just >> removed for a reason, right? > > That was my thought. But then why didn't we add this lock in the first pl= ace? > Hopefully Daniel can remember ~2yr patch :-) The reason that gntdev_release didn't have a lock is because there are not supposed to be any references to the areas pointed to by priv->maps when it is called. However, since the MMU notifier has not yet been unregistered, it is apparently possible to race here; the comment on mmu_notifier_unregis= ter seems to confirm this as a possibility (as do the backtraces). I think adding the lock will be sufficient.