From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 54B1AEFCD6D for ; Mon, 9 Mar 2026 09:33:02 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0D1E610E4B8; Mon, 9 Mar 2026 09:33:02 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="EEUzd9iN"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8F4B2896C7 for ; Mon, 9 Mar 2026 09:33:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1773048780; x=1804584780; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=xvPmRT/RkU7zDaohEehNpMSGNsLv4qlUNImPit9KnK4=; b=EEUzd9iN0SrPLuWCGAB8oV6dJdj0vu9X3WhoFO5sS4q/c8LLNXlpHsrG XVFqUFi5dGThXqcWa9kFi/5dFKV3+S8HRB8urhcykdS63rvwHEC8AlT4A BetsUGOr9s1nnudoRC9BOTnL1BuCpqtbh0c2gxfyCz+DvSAaAS9uG0UQC wdLY+57tX1kSktKdghBIQ3Hzqtl71nL/hICAFgtxSxT0O+9Kk02TFD16q ZqCW8Rx9zRqjWlqs8ok42bzrqyqfkzj7uDPTj5tjJA+scg8NNmkzWwDZt Wh01ZxB9HvpTfsPMHu5KcVQ4YK2KkvaqJ6Gi07zxL0zBX43L815w0TuX/ Q==; X-CSE-ConnectionGUID: EWU+mtGjTse/7pIeoScEhg== X-CSE-MsgGUID: 17Ns94nCQoKhH2kvPbsF5w== X-IronPort-AV: E=McAfee;i="6800,10657,11723"; a="74259709" X-IronPort-AV: E=Sophos;i="6.23,109,1770624000"; d="scan'208";a="74259709" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2026 02:33:00 -0700 X-CSE-ConnectionGUID: j8uvdzsBRhyUfThBcJEvVw== X-CSE-MsgGUID: /x/pWDKlTWmlQlXEbZ5sHw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,109,1770624000"; d="scan'208";a="219673237" Received: from dhhellew-desk2.ger.corp.intel.com (HELO [10.245.245.171]) ([10.245.245.171]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2026 02:32:59 -0700 Message-ID: <9500a8631dc402b690b4849fd482e436cb425ca8.camel@linux.intel.com> Subject: Re: [RFC 4/7] drm/xe/vm: Add madvise autoreset interval notifier worker infrastructure From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: "Yadav, Arvind" , Matthew Brost Cc: intel-xe@lists.freedesktop.org, himal.prasad.ghimiray@intel.com Date: Mon, 09 Mar 2026 10:32:35 +0100 In-Reply-To: <70f3a306-15e4-4eb1-82da-74818f35b437@intel.com> References: <20260219091312.796749-1-arvind.yadav@intel.com> <20260219091312.796749-5-arvind.yadav@intel.com> <70f3a306-15e4-4eb1-82da-74818f35b437@intel.com> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.3 (3.58.3-1.fc43) MIME-Version: 1.0 X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Mon, 2026-03-09 at 12:37 +0530, Yadav, Arvind wrote: >=20 > On 26-02-2026 05:04, Matthew Brost wrote: > > On Thu, Feb 19, 2026 at 02:43:09PM +0530, Arvind Yadav wrote: > > > MADVISE_AUTORESET needs to reset VMA attributes when userspace > > > unmaps > > > CPU-only ranges, but the MMU invalidate callback cannot take vm- > > > >lock > > > due to lock ordering (mmap_lock is already held). > > >=20 > > > Add mmu_interval_notifier that queues work items for > > > MMU_NOTIFY_UNMAP > > > events. The worker runs under vm->lock and resets attributes for > > > VMAs > > > still marked XE_VMA_CPU_AUTORESET_ACTIVE (i.e., not yet GPU- > > > touched). > > >=20 > > > Work items are allocated from a mempool to handle atomic context > > > in the > > > callback. The notifier is deactivated when GPU touches the VMA. > > >=20 > > > Cc: Matthew Brost > > > Cc: Thomas Hellstr=C3=B6m > > > Cc: Himal Prasad Ghimiray > > > Signed-off-by: Arvind Yadav > > > --- > > > =C2=A0 drivers/gpu/drm/xe/xe_vm_madvise.c | 394 > > > +++++++++++++++++++++++++++++ > > > =C2=A0 drivers/gpu/drm/xe/xe_vm_madvise.h |=C2=A0=C2=A0 8 + > > > =C2=A0 drivers/gpu/drm/xe/xe_vm_types.h=C2=A0=C2=A0 |=C2=A0 41 +++ > > > =C2=A0 3 files changed, 443 insertions(+) > > >=20 > > > diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.c > > > b/drivers/gpu/drm/xe/xe_vm_madvise.c > > > index 52147f5eaaa0..4c0ffb100bcc 100644 > > > --- a/drivers/gpu/drm/xe/xe_vm_madvise.c > > > +++ b/drivers/gpu/drm/xe/xe_vm_madvise.c > > > @@ -6,9 +6,12 @@ > > > =C2=A0 #include "xe_vm_madvise.h" > > > =C2=A0=20 > > > =C2=A0 #include > > > +#include > > > +#include > > > =C2=A0 #include > > > =C2=A0=20 > > > =C2=A0 #include "xe_bo.h" > > > +#include "xe_macros.h" > > > =C2=A0 #include "xe_pat.h" > > > =C2=A0 #include "xe_pt.h" > > > =C2=A0 #include "xe_svm.h" > > > @@ -500,3 +503,394 @@ int xe_vm_madvise_ioctl(struct drm_device > > > *dev, void *data, struct drm_file *fil > > > =C2=A0=C2=A0 xe_vm_put(vm); > > > =C2=A0=C2=A0 return err; > > > =C2=A0 } > > > + > > > +/** > > > + * struct xe_madvise_work_item - Work item for unmap processing > > > + * @work: work_struct > > > + * @vm: VM reference > > > + * @pool: Mempool for recycling > > > + * @start: Start address > > > + * @end: End address > > > + */ > > > +struct xe_madvise_work_item { > > > + struct work_struct work; > > > + struct xe_vm *vm; > > > + mempool_t *pool; > > Why mempool? Seems like we could just do kmalloc with correct gfp > > flags. >=20 >=20 > I tried kmalloc first, but ran into two issues: > GFP_KERNEL =E2=80=94 fails because MMU notifier callbacks must not block,= and > GFP_KERNEL can sleep waiting for memory reclaim. > GFP_ATOMIC =E2=80=94 triggers a circular lockdep warning: the MMU notifie= r > holds=20 > mmu_notifier_invalidate_range_start, and GFP_ATOMIC internally tries > to=20 > acquire fs_reclaim, which already depends on the MMU notifier lock. >=20 > Agreed. mempool looks unnecessary here. I re-tested this with=20 > kmalloc(..., GFP_NOWAIT) and that avoids both blocking and the=20 > reclaim-related lockdep issue I saw with the earlier approach. I will > switch to that and drop the pool in the next version. Note that GFP_NOWAIT can only be used as a potential optimization in case memory happens to be available. GFP_NOWAIT is very likely to fail in a reclaim situation and should not be used unless there is a backup path. We shouldn't really try to work around lockdep problems with GFP flags. /Thomas >=20 >=20 > >=20 > > > + u64 start; > > > + u64 end; > > > +}; > > > + > > > +static void xe_vma_set_default_attributes(struct xe_vma *vma) > > > +{ > > > + vma->attr.preferred_loc.devmem_fd =3D > > > DRM_XE_PREFERRED_LOC_DEFAULT_DEVICE; > > > + vma->attr.preferred_loc.migration_policy =3D > > > DRM_XE_MIGRATE_ALL_PAGES; > > > + vma->attr.pat_index =3D vma->attr.default_pat_index; > > > + vma->attr.atomic_access =3D DRM_XE_ATOMIC_UNDEFINED; > > > +} > > > + > > > +/** > > > + * xe_vm_madvise_process_unmap - Process munmap for all VMAs in > > > range > > > + * @vm: VM > > > + * @start: Start of unmap range > > > + * @end: End of unmap range > > > + * > > > + * Processes all VMAs overlapping the unmap range. An unmap can > > > span multiple > > > + * VMAs, so we need to loop and process each segment. > > > + * > > > + * Return: 0 on success, negative error otherwise > > > + */ > > > +static int xe_vm_madvise_process_unmap(struct xe_vm *vm, u64 > > > start, u64 end) > > > +{ > > > + u64 addr =3D start; > > > + int err; > > > + > > > + lockdep_assert_held_write(&vm->lock); > > > + > > > + if (xe_vm_is_closed_or_banned(vm)) > > > + return 0; > > > + > > > + while (addr < end) { > > > + struct xe_vma *vma; > > > + u64 seg_start, seg_end; > > > + bool has_default_attr; > > > + > > > + vma =3D xe_vm_find_overlapping_vma(vm, addr, end); > > > + if (!vma) > > > + break; > > > + > > > + /* Skip GPU-touched VMAs - SVM handles them */ > > > + if (!xe_vma_has_cpu_autoreset_active(vma)) { > > > + addr =3D xe_vma_end(vma); > > > + continue; > > > + } > > > + > > > + has_default_attr =3D > > > xe_vma_has_default_mem_attrs(vma); > > > + seg_start =3D max(addr, xe_vma_start(vma)); > > > + seg_end =3D min(end, xe_vma_end(vma)); > > > + > > > + /* Expand for merging if VMA already has default > > > attrs */ > > > + if (has_default_attr && > > > + =C2=A0=C2=A0=C2=A0 xe_vma_start(vma) >=3D start && > > > + =C2=A0=C2=A0=C2=A0 xe_vma_end(vma) <=3D end) { > > > + seg_start =3D xe_vma_start(vma); > > > + seg_end =3D xe_vma_end(vma); > > > + xe_vm_find_cpu_addr_mirror_vma_range(vm, > > > &seg_start, &seg_end); > > > + } else if (xe_vma_start(vma) =3D=3D seg_start && > > > xe_vma_end(vma) =3D=3D seg_end) { > > > + xe_vma_set_default_attributes(vma); > > > + addr =3D seg_end; > > > + continue; > > > + } > > > + > > > + if (xe_vma_start(vma) =3D=3D seg_start && > > > + =C2=A0=C2=A0=C2=A0 xe_vma_end(vma) =3D=3D seg_end && > > > + =C2=A0=C2=A0=C2=A0 has_default_attr) { > > > + addr =3D seg_end; > > > + continue; > > > + } > > > + > > > + err =3D xe_vm_alloc_cpu_addr_mirror_vma(vm, > > > seg_start, seg_end - seg_start); > > > + if (err) { > > > + if (err =3D=3D -ENOENT) { > > > + addr =3D seg_end; > > > + continue; > > > + } > > > + return err; > > > + } > > > + > > > + addr =3D seg_end; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * xe_madvise_work_func - Worker to process unmap > > > + * @w: work_struct > > > + * > > > + * Processes a single unmap by taking vm->lock and calling the > > > helper. > > > + * Each unmap has its own work item, so no interval loss. > > > + */ > > > +static void xe_madvise_work_func(struct work_struct *w) > > > +{ > > > + struct xe_madvise_work_item *item =3D container_of(w, > > > struct xe_madvise_work_item, work); > > > + struct xe_vm *vm =3D item->vm; > > > + int err; > > > + > > > + down_write(&vm->lock); > > > + err =3D xe_vm_madvise_process_unmap(vm, item->start, item- > > > >end); > > > + if (err) > > > + drm_warn(&vm->xe->drm, > > > + "madvise autoreset failed [%#llx- > > > %#llx]: %d\n", > > > + item->start, item->end, err); > > > + /* > > > + * Best-effort: Log failure and continue. > > > + * Core correctness from CPU_AUTORESET_ACTIVE flag. > > > + */ > > > + up_write(&vm->lock); > > > + xe_vm_put(vm); > > > + mempool_free(item, item->pool); > > > +} > > > + > > > +/** > > > + * xe_madvise_notifier_callback - MMU notifier callback for CPU > > > munmap > > > + * @mni: mmu_interval_notifier > > > + * @range: mmu_notifier_range > > > + * @cur_seq: current sequence number > > > + * > > > + * Queues work to reset VMA attributes. Cannot take vm->lock > > > (circular locking), > > > + * so uses workqueue. GFP_ATOMIC allocation may fail; drops > > > event if so. > > > + * > > > + * Return: true (never blocks) > > > + */ > > > +static bool xe_madvise_notifier_callback(struct > > > mmu_interval_notifier *mni, > > > + const struct > > > mmu_notifier_range *range, > > > + unsigned long cur_seq) > > > +{ > > > + struct xe_madvise_notifier *notifier =3D > > > + container_of(mni, struct xe_madvise_notifier, > > > mmu_notifier); > > > + struct xe_vm *vm =3D notifier->vm; > > > + struct xe_madvise_work_item *item; > > > + struct workqueue_struct *wq; > > > + mempool_t *pool; > > > + u64 start, end; > > > + > > > + if (range->event !=3D MMU_NOTIFY_UNMAP) > > > + return true; > > > + > > > + /* > > > + * Best-effort: skip in non-blockable contexts to avoid > > > building up work. > > > + * Correctness does not rely on this notifier - > > > CPU_AUTORESET_ACTIVE flag > > > + * prevents GPU PTE zaps on CPU-only VMAs in the zap > > > path. > > > + */ > > > + if (!mmu_notifier_range_blockable(range)) > > > + return true; > > > + > > > + /* Consume seq (interval-notifier convention) */ > > > + mmu_interval_set_seq(mni, cur_seq); > > > + > > > + /* Best-effort: core correctness from > > > CPU_AUTORESET_ACTIVE check in zap path */ > > > + > > > + start =3D max_t(u64, range->start, notifier->vma_start); > > > + end =3D min_t(u64, range->end, notifier->vma_end); > > > + > > > + if (start >=3D end) > > > + return true; > > > + > > > + pool =3D READ_ONCE(vm->svm.madvise_work.pool); > > > + wq =3D READ_ONCE(vm->svm.madvise_work.wq); > > > + if (!pool || !wq || atomic_read(&vm- > > > >svm.madvise_work.closing)) > > Can you explain the use of READ_ONCE, xchg, and atomics? At first > > glance > > it seems unnecessary or overly complicated. Let=E2=80=99s start with th= e > > problem > > this is trying to solve and see if we can find a simpler approach. > >=20 > > My initial thought is a VM-wide rwsem, marked as reclaim-safe. The > > notifiers would take it in read mode to check whether the VM is > > tearing > > down, and the fini path would take it in write mode to initiate > > teardown... >=20 >=20 > Agreed. This got more complicated than it needs to be. I reworked it > to=20 > use a VM-wide rw_semaphore for teardown serialization, so the > atomic_t,=20 > READ_ONCE(), and xchg() go away.. >=20 > >=20 > > > + return true; > > > + > > > + /* GFP_ATOMIC to avoid fs_reclaim lockdep in notifier > > > context */ > > > + item =3D mempool_alloc(pool, GFP_ATOMIC); > > Again, probably just use kmalloc. Also s/GFP_ATOMIC/GFP_NOWAIT. We > > really shouldn=E2=80=99t be using GFP_ATOMIC in Xe per the DRM docs unl= ess > > a > > failed memory allocation would take down the device. We likely > > abuse > > GFP_ATOMIC in several places that we should clean up, but in this > > case > > it=E2=80=99s pretty clear GFP_NOWAIT is what we want, as failure isn=E2= =80=99t > > fatal=E2=80=94just sub-optimal. >=20 >=20 > Agreed. This should be |GFP_NOWAIT|, not |GFP_ATOMIC|. Allocation=20 > failure here is non-fatal, so |GFP_NOWAIT| is the right fit. I willl=20 > switch to |kmalloc(..., GFP_NOWAIT)| and drop the mempool. >=20 > >=20 > > > + if (!item) > > > + return true; > > > + > > > + memset(item, 0, sizeof(*item)); > > > + INIT_WORK(&item->work, xe_madvise_work_func); > > > + item->vm =3D xe_vm_get(vm); > > > + item->pool =3D pool; > > > + item->start =3D start; > > > + item->end =3D end; > > > + > > > + if (unlikely(atomic_read(&vm- > > > >svm.madvise_work.closing))) { > > Same as above the atomic usage... >=20 >=20 > Noted, Will remove. >=20 > >=20 > > > + xe_vm_put(item->vm); > > > + mempool_free(item, pool); > > > + return true; > > > + } > > > + > > > + queue_work(wq, &item->work); > > > + > > > + return true; > > > +} > > > + > > > +static const struct mmu_interval_notifier_ops > > > xe_madvise_notifier_ops =3D { > > > + .invalidate =3D xe_madvise_notifier_callback, > > > +}; > > > + > > > +/** > > > + * xe_vm_madvise_init - Initialize madvise notifier > > > infrastructure > > > + * @vm: VM > > > + * > > > + * Sets up workqueue and mempool for async munmap processing. > > > + * > > > + * Return: 0 on success, -ENOMEM on failure > > > + */ > > > +int xe_vm_madvise_init(struct xe_vm *vm) > > > +{ > > > + struct workqueue_struct *wq; > > > + mempool_t *pool; > > > + > > > + /* Always initialize list and mutex - fini may be called > > > on partial init */ > > > + INIT_LIST_HEAD(&vm->svm.madvise_notifiers.list); > > > + mutex_init(&vm->svm.madvise_notifiers.lock); > > > + > > > + wq =3D READ_ONCE(vm->svm.madvise_work.wq); > > > + pool =3D READ_ONCE(vm->svm.madvise_work.pool); > > > + > > > + /* Guard against double initialization and detect > > > partial init */ > > > + if (wq || pool) { > > > + XE_WARN_ON(!wq || !pool); > > > + return 0; > > > + } > > > + > > > + WRITE_ONCE(vm->svm.madvise_work.wq, NULL); > > > + WRITE_ONCE(vm->svm.madvise_work.pool, NULL); > > > + atomic_set(&vm->svm.madvise_work.closing, 1); > > > + > > > + /* > > > + * WQ_UNBOUND: best-effort optimization, not critical > > > path. > > > + * No WQ_MEM_RECLAIM: worker allocates memory (VMA ops > > > with GFP_KERNEL). > > > + * Not on reclaim path - merely resets attributes after > > > munmap. > > > + */ > > > + vm->svm.madvise_work.wq =3D alloc_workqueue("xe_madvise", > > > WQ_UNBOUND, 0); > > > + if (!vm->svm.madvise_work.wq) > > > + return -ENOMEM; > > > + > > > + /* Mempool for GFP_ATOMIC allocs in notifier callback */ > > > + vm->svm.madvise_work.pool =3D > > > + mempool_create_kmalloc_pool(64, > > > + =C2=A0=C2=A0=C2=A0=C2=A0 sizeof(struct > > > xe_madvise_work_item)); > > > + if (!vm->svm.madvise_work.pool) { > > > + destroy_workqueue(vm->svm.madvise_work.wq); > > > + WRITE_ONCE(vm->svm.madvise_work.wq, NULL); > > > + return -ENOMEM; > > > + } > > > + > > > + atomic_set(&vm->svm.madvise_work.closing, 0); > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * xe_vm_madvise_fini - Cleanup all madvise notifiers > > > + * @vm: VM > > > + * > > > + * Tears down notifiers and drains workqueue. Safe if init > > > partially failed. > > > + * Order: closing flag =E2=86=92 remove notifiers (SRCU sync) =E2=86= =92 drain wq > > > =E2=86=92 destroy. > > > + */ > > > +void xe_vm_madvise_fini(struct xe_vm *vm) > > > +{ > > > + struct xe_madvise_notifier *notifier, *next; > > > + struct workqueue_struct *wq; > > > + mempool_t *pool; > > > + LIST_HEAD(tmp); > > > + > > > + atomic_set(&vm->svm.madvise_work.closing, 1); > > > + > > > + /* > > > + * Detach notifiers under lock, then remove outside lock > > > (SRCU sync can be slow). > > > + * Splice avoids holding mutex across > > > mmu_interval_notifier_remove() SRCU sync. > > > + * Removing notifiers first (before drain) prevents new > > > invalidate callbacks. > > > + */ > > > + mutex_lock(&vm->svm.madvise_notifiers.lock); > > > + list_splice_init(&vm->svm.madvise_notifiers.list, &tmp); > > > + mutex_unlock(&vm->svm.madvise_notifiers.lock); > > > + > > > + /* Now remove notifiers without holding lock - > > > mmu_interval_notifier_remove() SRCU-syncs */ > > > + list_for_each_entry_safe(notifier, next, &tmp, list) { > > > + list_del(¬ifier->list); > > > + mmu_interval_notifier_remove(¬ifier- > > > >mmu_notifier); > > > + xe_vm_put(notifier->vm); > > > + kfree(notifier); > > > + } > > > + > > > + /* Drain and destroy workqueue */ > > > + wq =3D xchg(&vm->svm.madvise_work.wq, NULL); > > > + if (wq) { > > > + drain_workqueue(wq); > > Work items in wq call xe_madvise_work_func, which takes vm->lock in > > write mode. If we try to drain here after the work item executing > > xe_madvise_work_func has started or is queued, I think we could > > deadlock. Lockdep should complain about this if you run a test that > > triggers xe_madvise_work_func at least once =E2=80=94 or at least it > > should. If > > it doesn=E2=80=99t, then workqueues likely have an issue in their lockd= ep > > implementation as 'drain_workqueue' should touch its lockdep map > > which > > has tainted vm->lock (i.e., is outside of it). > >=20 > > So perhaps call this function without vm->lock and take as need in > > the > > this function, then drop it drain the work queue, etc... >=20 >=20 > Good catch. Draining the workqueue while holding |vm->lock| can > deadlock=20 > against a worker that takes |vm->lock|. I fixed that by dropping=20 > > vm->lock| before |xe_vm_madvise_fini()|. In the reworked teardown > > path,=20 > > drain_workqueue()| runs with neither |vm->lock| nor the teardown=20 > semaphore held. >=20 >=20 > >=20 > > > + destroy_workqueue(wq); > > > + } > > > + > > > + pool =3D xchg(&vm->svm.madvise_work.pool, NULL); > > > + if (pool) > > > + mempool_destroy(pool); > > > +} > > > + > > > +/** > > > + * xe_vm_madvise_register_notifier_range - Register MMU notifier > > > for address range > > > + * @vm: VM > > > + * @start: Start address (page-aligned) > > > + * @end: End address (page-aligned) > > > + * > > > + * Registers interval notifier for munmap tracking. Uses > > > addresses (not VMA pointers) > > > + * to avoid UAF after dropping vm->lock. Deduplicates by range. > > > + * > > > + * Return: 0 on success, negative error code on failure > > > + */ > > > +int xe_vm_madvise_register_notifier_range(struct xe_vm *vm, u64 > > > start, u64 end) > > > +{ > > > + struct xe_madvise_notifier *notifier, *existing; > > > + int err; > > > + > > I see this isn=E2=80=99t called under the vm->lock write lock. Is there= a > > reason > > not to? I think taking it under the write lock would help with the > > teardown sequence, since you wouldn=E2=80=99t be able to get here if > > xe_vm_is_closed_or_banned were stable=E2=80=94and we wouldn=E2=80=99t e= nter this > > function if that helper returned true. >=20 >=20 > I can make the closed/banned check stable at the call site under=20 > > vm->lock|, but I don=E2=80=99t think I can hold it across=20 > > mmu_interval_notifier_insert()| itself since that may take > > |mmap_lock|=20 > internally. I=E2=80=99ll restructure this so the state check happens unde= r=20 > > vm->lock|, while the actual insert remains outside that lock. >=20 > >=20 > > > + if (!IS_ALIGNED(start, PAGE_SIZE) || !IS_ALIGNED(end, > > > PAGE_SIZE)) > > > + return -EINVAL; > > > + > > > + if (WARN_ON_ONCE(end <=3D start)) > > > + return -EINVAL; > > > + > > > + if (atomic_read(&vm->svm.madvise_work.closing)) > > > + return -ENOENT; > > > + > > > + if (!READ_ONCE(vm->svm.madvise_work.wq) || > > > + =C2=A0=C2=A0=C2=A0 !READ_ONCE(vm->svm.madvise_work.pool)) > > > + return -ENOMEM; > > > + > > > + /* Check mm early to avoid allocation if it's missing */ > > > + if (!vm->svm.gpusvm.mm) > > > + return -EINVAL; > > > + > > > + /* Dedupe: check if notifier exists for this range */ > > > + mutex_lock(&vm->svm.madvise_notifiers.lock); > > If we had the vm->lock in write mode we could likely just drop > > svm.madvise_notifiers.lock for now, but once we move to fine > > grained > > locking in page faults [1] we'd in fact need a dedicated lock. So > > let's > > keep this. > >=20 > > [1] > > https://patchwork.freedesktop.org/patch/707238/?series=3D162167&rev=3D2 >=20 >=20 > Agreed. We should keep a dedicated lock here. >=20 > I donot think |vm->lock| can cover |mmu_interval_notifier_insert()|=20 > itself, since that path may take |mmap_lock| internally and would > risk=20 > inverting the existing |mmap_lock -> vm->lock| ordering. >=20 > So I will keep |svm.madvise_notifiers.lock| in place. That also lines > up=20 > better with the planned fine-grained page-fault locking work. >=20 > >=20 > > > + list_for_each_entry(existing, &vm- > > > >svm.madvise_notifiers.list, list) { > > > + if (existing->vma_start =3D=3D start && existing- > > > >vma_end =3D=3D end) { > > This is O(N) which typically isn't ideal. Better structure here? > > mtree? > > Does an mtree have its own locking so svm.madvise_notifiers.lock > > could > > just be dropped? I'd look into this. >=20 >=20 > Agreed. I switched this over to a maple tree, so the exact-range > lookup=20 > is no longer O(N). That also lets me drop the list walk in the > duplicate=20 > check. >=20 > >=20 > > > + mutex_unlock(&vm- > > > >svm.madvise_notifiers.lock); > > > + return 0; > > > + } > > > + } > > > + mutex_unlock(&vm->svm.madvise_notifiers.lock); > > > + > > > + notifier =3D kzalloc(sizeof(*notifier), GFP_KERNEL); > > > + if (!notifier) > > > + return -ENOMEM; > > > + > > > + notifier->vm =3D xe_vm_get(vm); > > > + notifier->vma_start =3D start; > > > + notifier->vma_end =3D end; > > > + INIT_LIST_HEAD(¬ifier->list); > > > + > > > + err =3D mmu_interval_notifier_insert(¬ifier- > > > >mmu_notifier, > > > + =C2=A0=C2=A0 vm->svm.gpusvm.mm, > > > + =C2=A0=C2=A0 start, > > > + =C2=A0=C2=A0 end - start, > > > + =C2=A0=C2=A0 > > > &xe_madvise_notifier_ops); > > > + if (err) { > > > + xe_vm_put(notifier->vm); > > > + kfree(notifier); > > > + return err; > > > + } > > > + > > > + /* Re-check closing to avoid teardown race */ > > > + if (unlikely(atomic_read(&vm- > > > >svm.madvise_work.closing))) { > > > + mmu_interval_notifier_remove(¬ifier- > > > >mmu_notifier); > > > + xe_vm_put(notifier->vm); > > > + kfree(notifier); > > > + return -ENOENT; > > > + } > > > + > > > + /* Add to list - check again for concurrent registration > > > race */ > > > + mutex_lock(&vm->svm.madvise_notifiers.lock); > > If we had the vm->lock in write mode, we couldn't get concurrent > > registrations. > >=20 > > I likely have more comments, but I have enough concerns with the > > locking > > and structure in this patch that I=E2=80=99m going to pause reviewing t= he > > series > > until most of my comments are addressed. It=E2=80=99s hard to focus on > > anything > > else until we get these issues worked out. >=20 >=20 > I think the main issue is exactly the locking story around notifier=20 > insert/remove. We cannot hold |vm->lock| across=20 > > mmu_interval_notifier_insert()| because that may take |mmap_lock|=20 > internally and invert the existing ordering. >=20 > I have reworked this to simplify the teardown/registration side: drop > the atomic/READ_ONCE/xchg handling, use a single teardown |rwsem|, > and=20 > replace the list-based dedupe with a maple tree. > I will send a cleaned-up version with the locking documented more=20 > clearly. Sorry for the churn here. >=20 >=20 > Thanks, > Arvind >=20 > >=20 > > Matt > >=20 > > > + list_for_each_entry(existing, &vm- > > > >svm.madvise_notifiers.list, list) { > > > + if (existing->vma_start =3D=3D start && existing- > > > >vma_end =3D=3D end) { > > > + mutex_unlock(&vm- > > > >svm.madvise_notifiers.lock); > > > + mmu_interval_notifier_remove(¬ifier- > > > >mmu_notifier); > > > + xe_vm_put(notifier->vm); > > > + kfree(notifier); > > > + return 0; > > > + } > > > + } > > > + list_add(¬ifier->list, &vm- > > > >svm.madvise_notifiers.list); > > > + mutex_unlock(&vm->svm.madvise_notifiers.lock); > > > + > > > + return 0; > > > +} > > > diff --git a/drivers/gpu/drm/xe/xe_vm_madvise.h > > > b/drivers/gpu/drm/xe/xe_vm_madvise.h > > > index b0e1fc445f23..ba9cd7912113 100644 > > > --- a/drivers/gpu/drm/xe/xe_vm_madvise.h > > > +++ b/drivers/gpu/drm/xe/xe_vm_madvise.h > > > @@ -6,10 +6,18 @@ > > > =C2=A0 #ifndef _XE_VM_MADVISE_H_ > > > =C2=A0 #define _XE_VM_MADVISE_H_ > > > =C2=A0=20 > > > +#include > > > + > > > =C2=A0 struct drm_device; > > > =C2=A0 struct drm_file; > > > +struct xe_vm; > > > +struct xe_vma; > > > =C2=A0=20 > > > =C2=A0 int xe_vm_madvise_ioctl(struct drm_device *dev, void *data, > > > =C2=A0=C2=A0 struct drm_file *file); > > > =C2=A0=20 > > > +int xe_vm_madvise_init(struct xe_vm *vm); > > > +void xe_vm_madvise_fini(struct xe_vm *vm); > > > +int xe_vm_madvise_register_notifier_range(struct xe_vm *vm, u64 > > > start, u64 end); > > > + > > > =C2=A0 #endif > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h > > > b/drivers/gpu/drm/xe/xe_vm_types.h > > > index 29ff63503d4c..eb978995000c 100644 > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h > > > @@ -12,6 +12,7 @@ > > > =C2=A0=20 > > > =C2=A0 #include > > > =C2=A0 #include > > > +#include > > > =C2=A0 #include > > > =C2=A0 #include > > > =C2=A0=20 > > > @@ -29,6 +30,26 @@ struct xe_user_fence; > > > =C2=A0 struct xe_vm; > > > =C2=A0 struct xe_vm_pgtable_update_op; > > > =C2=A0=20 > > > +/** > > > + * struct xe_madvise_notifier - CPU madvise notifier for memory > > > attribute reset > > > + * > > > + * Tracks CPU munmap operations on SVM CPU address mirror VMAs. > > > + * When userspace unmaps CPU memory, this notifier processes > > > attribute reset > > > + * via work queue to avoid circular locking (can't take vm->lock > > > in callback). > > > + */ > > > +struct xe_madvise_notifier { > > > + /** @mmu_notifier: MMU interval notifier */ > > > + struct mmu_interval_notifier mmu_notifier; > > > + /** @vm: VM this notifier belongs to (holds reference > > > via xe_vm_get) */ > > > + struct xe_vm *vm; > > > + /** @vma_start: Start address of VMA being tracked */ > > > + u64 vma_start; > > > + /** @vma_end: End address of VMA being tracked */ > > > + u64 vma_end; > > > + /** @list: Link in vm->svm.madvise_notifiers.list */ > > > + struct list_head list; > > > +}; > > > + > > > =C2=A0 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG) > > > =C2=A0 #define TEST_VM_OPS_ERROR > > > =C2=A0 #define FORCE_OP_ERROR BIT(31) > > > @@ -212,6 +233,26 @@ struct xe_vm { > > > =C2=A0=C2=A0 struct xe_pagemap > > > *pagemaps[XE_MAX_TILES_PER_DEVICE]; > > > =C2=A0=C2=A0 /** @svm.peer: Used for pagemap connectivity > > > computations. */ > > > =C2=A0=C2=A0 struct drm_pagemap_peer peer; > > > + > > > + /** > > > + * @svm.madvise_notifiers: Active CPU madvise > > > notifiers > > > + */ > > > + struct { > > > + /** @svm.madvise_notifiers.list: List of > > > active notifiers */ > > > + struct list_head list; > > > + /** @svm.madvise_notifiers.lock: > > > Protects notifiers list */ > > > + struct mutex lock; > > > + } madvise_notifiers; > > > + > > > + /** @svm.madvise_work: Workqueue for async > > > munmap processing */ > > > + struct { > > > + /** @svm.madvise_work.wq: Workqueue */ > > > + struct workqueue_struct *wq; > > > + /** @svm.madvise_work.pool: Mempool for > > > work items */ > > > + mempool_t *pool; > > > + /** @svm.madvise_work.closing: Teardown > > > flag */ > > > + atomic_t closing; > > > + } madvise_work; > > > =C2=A0=C2=A0 } svm; > > > =C2=A0=20 > > > =C2=A0=C2=A0 struct xe_device *xe; > > > --=20 > > > 2.43.0