From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5D66CC282D0 for ; Fri, 28 Feb 2025 18:38:16 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2617310E2C1; Fri, 28 Feb 2025 18:38:16 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="hnii3szy"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.19]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5F7A610E2C1 for ; Fri, 28 Feb 2025 18:38:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740767893; x=1772303893; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=ViTjoTv/xLX1hiT/+lrnshTw60/Jd2d2BD7RTfqPElE=; b=hnii3szyuvL2Dww7EXAN7FOiQEHWUlc7Y5BZCIU4/wC0LBEFODG6oJOZ TLq1vrsRuCS74LuZ2Qkf9rLlmKb9D9cjZW/dVPtD9aTNFCGHh8qWZTNvH MENnjZY1WM8BsYdhNdYbdqaM4ODm+QWGbs/Kgne3s31D+D80ez+fxKCPo RhDVm21Ck5NISrAbaDltoyLFp6Q83Hp7BNguc5HaAT6wcZPDiEgQZU3Gx ncmNKPYugGHvaPel/5Jpt4rfN8RGOKxAmCm/VynCjKexuSTp2+1QYzLLl lqEalES2lMFF3CGvAzjfVjdm/f++WzDoU7motAgOGyZjb/nFIj+gI/66u w==; X-CSE-ConnectionGUID: pvx43kLdSJWtDHMNndY4iw== X-CSE-MsgGUID: bKA1r5wsSA689/FKPJsc/g== X-IronPort-AV: E=McAfee;i="6700,10204,11359"; a="41554957" X-IronPort-AV: E=Sophos;i="6.13,323,1732608000"; d="scan'208";a="41554957" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by orvoesa111.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Feb 2025 10:38:12 -0800 X-CSE-ConnectionGUID: IXzCMSiwSuSwed6nGai4dg== X-CSE-MsgGUID: syWcHf8XQaCXrvu48tfR1w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,323,1732608000"; d="scan'208";a="122369272" Received: from oandoniu-mobl3.ger.corp.intel.com (HELO [10.245.244.73]) ([10.245.244.73]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Feb 2025 10:38:11 -0800 Message-ID: Date: Fri, 28 Feb 2025 18:38:09 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/2] drm/xe_migrate: Switch from drm to dev managed actions To: "Hellstrom, Thomas" , "Roper, Matthew D" , "Bhatia, Aradhya" Cc: "intel-xe@lists.freedesktop.org" , "Upadhyay, Tejas" , "Ghimiray, Himal Prasad" , "De Marchi, Lucas" References: <20250228065224.320811-1-aradhya.bhatia@intel.com> <20250228065224.320811-2-aradhya.bhatia@intel.com> <1b66c78c-82d4-4002-9fa4-6f30e97e0268@intel.com> <3a74a434957fb52ebcc36a4d5b88e8428ac2165f.camel@intel.com> <1bb4be8a-8058-4066-b447-1af29cbb46c0@intel.com> <7238c8a392fc5fb4ce297cd85dacf8ef2b342669.camel@intel.com> <56e1e5b6d4d68d67cff09fca2f77cc125c29fc7f.camel@intel.com> Content-Language: en-GB From: Matthew Auld In-Reply-To: <56e1e5b6d4d68d67cff09fca2f77cc125c29fc7f.camel@intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 28/02/2025 14:47, Hellstrom, Thomas wrote: > On Fri, 2025-02-28 at 13:57 +0100, Thomas Hellström wrote: >> On Fri, 2025-02-28 at 12:28 +0000, Matthew Auld wrote: >>> Hi, >>> >>> On 28/02/2025 11:11, Hellstrom, Thomas wrote: >>>> Hi, Matthew >>>> >>>> On Fri, 2025-02-28 at 10:21 +0000, Matthew Auld wrote: >>>>> On 28/02/2025 06:52, Aradhya Bhatia wrote: >>>>>> Change the scope of the migrate subsystem to be dev managed >>>>>> instead >>>>>> of >>>>>> drm managed. >>>>>> >>>>>> The parent pci struct &device, that the xe struct &drm_device >>>>>> is a >>>>>> part >>>>>> of, gets removed when a hot unplug is triggered, which causes >>>>>> the >>>>>> underlying iommu group to get destroyed as well. >>>>> >>>>> Nice find. But if that's the case then the migrate BO here is >>>>> just >>>>> one >>>>> of many where we will see this. Basically all system memory BO >>>>> can >>>>> suffer from this AFAICT, including userspace owned. So I think >>>>> we >>>>> need >>>>> to rather solve this for all, and I don't think we can really >>>>> tie >>>>> the >>>>> lifetime of all BOs to devm, so likely need a different >>>>> approach. >>>>> >>>>> I think we might instead need to teardown all dma mappings when >>>>> removing >>>>> the device, leaving the BO intact. It looks like there is >>>>> already >>>>> a >>>>> helper for this, so maybe something roughly like this: >>>>> >>>>> @@ -980,6 +980,8 @@ void xe_device_remove(struct xe_device *xe) >>>>> >>>>>           drm_dev_unplug(&xe->drm); >>>>> >>>>> +       ttm_device_clear_dma_mappings(&xe->ttm); >>>>> + >>>>>           xe_display_driver_remove(xe); >>>>> >>>>>           xe_heci_gsc_fini(xe); >>>>> >>>>> But I don't think userptr will be covered by that, just all >>>>> BOs, >>>>> so >>>>> likely need an extra step to also nuke all usersptr dma >>>>> mappings >>>>> somewhere. >>>> >>>> I have been discussing this a bit with Aradhya, and the problem >>>> is >>>> a >>>> bit complex. >>>> >>>> Really if the PCI device goes away, we need to move *all* bos to >>>> system. That should clear out any dma mappings, since we map dma >>>> when >>>> moving to TT. Also VRAM bos should be moved to system which will >>>> also >>>> handle things like SVM migration. >>> >>> My thinking is that from userspace pov, the buffers are about to be >>> lost >>> anyway, so going through the trouble of moving stuff seems like a >>> waste? >>> Once the device is unplugged all ioctls are rejected for that >>> instance >>> and we should have stopped touching the hw, so all that is left is >>> to >>> close stuff and open up a new instance. VRAM is basically just some >>> state tracking at this point (for this driver instance), and actual >>> memory footprint is small. If we move to system memory we might be >>> using >>> actual system pages for stuff that should really just be thrown >>> away, >>> and will be once the driver instance goes away AFAIK. And for stuff >>> that >>> has an actual populated ttm_tt, we just discard with something like >>> ttm_device_clear_dma_mappings() during remove. >> >> That's true for most current use-cases but exported things, like >> multidevice SVM, if you unbind a device you'd currently just want to >> migrate to system, and it's not fatal to the running application >> (subject to review ofc). And I'm not sure what the expected dma-buf >> semantics are, although if we free pages that the remote device has >> pinned and keep accessing, that sounds bad. > > Having thought a bit more about this, First, I fully agree that > evicting all of VRAM was probably a bad idea. We need to be selective, > and evict exported VRAM only. That means SVM (no action required until > multi-svm lands) and p2p DMA-buf, (unless there is a way to revoke a > dma-buf) where we at least need to send a move_notify() and wait for > idle. A possible way might be to hook up to xe_evict_flags() and > respond with a NULL placement, purging the bo for anything but SVM. > > Pinned p2p VRAM dma-buf I think needs to block the remove() callback, > but we don't support that ATM. > > System DMA-buf shouldn't be a problem, but we can't drop the pages > AFAICT. > > Imported dma-buf should have the attachment unmapped. > > I think that _evict_all() could be made to handle all these situations > with some minor tweaks, but that still leaves the pinned TT bos. Right. I guess in the worst case we can rely on our own pin tracking to nuke anything pinned that also has populated tt? > > /Thomas > > >> >>> >>>> >>>> But that doesn't really help with pinned bos. So IMO subsystems >>>> that >>> >>> I think ttm_device_clear_dma_mappings() also handles pinned BO, >>> AFAICT. >> >> Yes, you're right. I still think it's reasonable, though to >> >>> >>> I believe the pinned BO are tracked via bdev->unevictable, so in >>> theory >>> it will nuke the dma mapping there also and discard the pages. >>> >>>> allocate pinned bos need to be devm managed, and this is a step >>>> in >>>> that >>>> direction. But we probably need to deal with some fallout. For >>>> example >>>> when we take down the vram managers using drmmm_ they'd want to >>>> call >>>> evict_all() and the migrate subsystem is already down. >>>> >>>> So I think a natural place to deal with such fallouts is the >>>> remove() >>>> callback, while subsystems with pinned bos use devm_ >>>> >>>> But admittedly to me it's not really clear how to *best* handle >>>> this >>>> situation. I suspect if we really stress-test device unbinding on >>>> a >>>> running app we're going to hit a lot of problems. >>> >>> Yes, there are lots of open issues here, but our testing is still >>> lacking :) >>> >>>> >>>> As for the userptrs, I just posted a series that introduces a >>>> per- >>>> userptr dedicated unmap function. We could probably put them on a >>>> device list or something similar that calls that function (or >>>> just >>>> general userptr invalidation) for all userptrs. >>> >>> Sounds good. >>> >>>> >>>> /Thomas >>>> >>>> >>>> >>>> >>>> >>>>> >>>>>> >>>>>> The migrate subsystem, which handles the lifetime of the >>>>>> page- >>>>>> table >>>>>> tree >>>>>> (pt) BO, doesn't get a chance to keep the BO back during the >>>>>> hot >>>>>> unplug, >>>>>> as all the references to DRM haven't been put back. >>>>>> When all the references to DRM are indeed put back later, the >>>>>> migrate >>>>>> subsystem tries to put back the pt BO. Since the underlying >>>>>> iommu >>>>>> group >>>>>> has been already destroyed, a kernel NULL ptr dereference >>>>>> takes >>>>>> place >>>>>> while attempting to keep back the pt BO. >>>>>> >>>>>> Signed-off-by: Aradhya Bhatia >>>>>> --- >>>>>>    drivers/gpu/drm/xe/xe_migrate.c | 6 +++--- >>>>>>    1 file changed, 3 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c >>>>>> b/drivers/gpu/drm/xe/xe_migrate.c >>>>>> index 278bc96cf593..4e23adfa208a 100644 >>>>>> --- a/drivers/gpu/drm/xe/xe_migrate.c >>>>>> +++ b/drivers/gpu/drm/xe/xe_migrate.c >>>>>> @@ -97,7 +97,7 @@ struct xe_exec_queue >>>>>> *xe_tile_migrate_exec_queue(struct xe_tile *tile) >>>>>>     return tile->migrate->q; >>>>>>    } >>>>>> >>>>>> -static void xe_migrate_fini(struct drm_device *dev, void >>>>>> *arg) >>>>>> +static void xe_migrate_fini(void *arg) >>>>>>    { >>>>>>     struct xe_migrate *m = arg; >>>>>> >>>>>> @@ -401,7 +401,7 @@ struct xe_migrate *xe_migrate_init(struct >>>>>> xe_tile *tile) >>>>>>     struct xe_vm *vm; >>>>>>     int err; >>>>>> >>>>>> - m = drmm_kzalloc(&xe->drm, sizeof(*m), GFP_KERNEL); >>>>>> + m = devm_kzalloc(xe->drm.dev, sizeof(*m), >>>>>> GFP_KERNEL); >>>>>>     if (!m) >>>>>>     return ERR_PTR(-ENOMEM); >>>>>> >>>>>> @@ -455,7 +455,7 @@ struct xe_migrate *xe_migrate_init(struct >>>>>> xe_tile *tile) >>>>>>     might_lock(&m->job_mutex); >>>>>>     fs_reclaim_release(GFP_KERNEL); >>>>>> >>>>>> - err = drmm_add_action_or_reset(&xe->drm, >>>>>> xe_migrate_fini, >>>>>> m); >>>>>> + err = devm_add_action_or_reset(xe->drm.dev, >>>>>> xe_migrate_fini, m); >>>>>>     if (err) >>>>>>     return ERR_PTR(err); >>>>>> >>>>> >>>> >>> >> >