From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FEDA34D3BE for ; Wed, 20 May 2026 19:40:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779306013; cv=none; b=CFKQ0KZWXjAMgjGaBjgFRWFi1HM3wzCqKj3p5esu/YDlhDalRiVMcXcXnXzySrn1xvqvZjoqcflg2JFs7n4hwbGDhxm5tdU4AK+I6K9RRZPe317XqSWxWPuqyomlmy2RDyHy53PYCJGol+kNqHy1YToRi+8/k3GH8hL7N7yQDUo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779306013; c=relaxed/simple; bh=T6+dg6Wz1c1tvS/wGPJZQCCsqwKCWSRMJIBbb0pbADU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=QfdC2izbj1eP6BjMkSUmSDme29AfbthJzA6QEQk+e3oRFw3CYuU/GKe+AsMzi1wYvAKIsqbRAr7NHHF6wJYIWi0rmvvMEvTUXdHV5wBV2L6caS/scIVaa5EOHVgW5KmH82rudb6BEa22L0WiW7Y+usjx/yTPl0V2s2DxtqNVPAw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=cx2gF+0Y; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="cx2gF+0Y" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2ba3b9bcf69so1025ad.0 for ; Wed, 20 May 2026 12:40:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1779306011; x=1779910811; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+wje9r1zoBXXds3UzxCMoNq04XN47Olzurqr21FFgTU=; b=cx2gF+0YrKSbvs1N8iWDypXFD94cs1XbNOZVC8MgS6KDBY52l3wc1V9/SzFY1WONG0 FshG+I1pRHuyvB6R/g+YzXI9w/PIG1ycq+L7ytWorRPiBuNacBNWXWl7bOqQLTCy/6yB DWyWbw277M+0IJqqn/VVYStMDHl9sih2AROE+KaPFWhU28qly9dp+CsQo4kdlonR3oL6 KzEDrbIwUbuXs8sDFJN1CWf7Dec5TRwu9AwIBPndLJVtzSvk+keSi0WH7ppQOXoc5ZRL Oog/NWTT3pU8fsCeElu8nbKJ0GNgapnJ0OS46TWRDZYsadYtazb9Pcx38tSlXp+0sLap VEew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779306011; x=1779910811; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+wje9r1zoBXXds3UzxCMoNq04XN47Olzurqr21FFgTU=; b=AXmKAmYQiSyUbZVARzFk/X05eQQtiMyMGXwNaqrlwebi+iffZvtCNfuLaHq3hLdOZX ABKqAqUvfpuHVZ3xdzGa77NECN/q1PgLMk8HrPhRjMADQRC4YvLNA/VPBNzerGM27yxv v0duwXRYc3Y84dpCQ4AnjZN3XlddCEBqz4q3tvYLZzodtg8hCmeGbQvXE/TTREK6Fa1u oy1bZbjEiHjj66FXzqgWlaW5p/XM6MDV97gE48shC41021UVcFzzoAi9ZT+hkQU8q2RH QJ3aCMC13iQ4TVklrX+1OyJx1WIOqhjhOhsdL8nB21t2tESf6mAB2Kc7UQkyfnOLZOcV SxCg== X-Forwarded-Encrypted: i=1; AFNElJ+6Xku70TfM52ObhUhBtZqERJrr1Ih1PVAjW+lHF3EHK1ZJa8U3+DTOh4wLHl5YhSiJeRFrAZQnU9UQSsk=@vger.kernel.org X-Gm-Message-State: AOJu0YzVIQOQSXYOwwhLDTsrQSTT6iVvU3WhSEAv0Lu/Z7lhlP5nWJEB nni7l4M8Zu+FvKyKXiy8iY8hTUgMiLPyu1S/JRndjl8yCMzavh1TdGlWRmLpxvoPVA== X-Gm-Gg: Acq92OEjQWarJ1Bj1IZHD5ZhnOMbQ4sckolr3OUX8v/2ZtW3pFz07TmYtYO4A6cAPAI XxQ2lMkQAN9cpDxH+u2ja3NwknCwQhLcupyidaOCMhvL7iVw8fz6VKjfokLjY6eMulQaPMBzfrR sZlLJVEKSC84b56fQ0H1SK1WqhhkMlKMCodc8e3xD6LLOvrelqwL5HSPCyD8nz8bPxd9RKt5MT1 UbEHC8VemW2Rl1NuQnWnt+HtPML0FHsZlCqf0p5rEn0h8tX4UD2pZiYLjbofxwPdDLW6JDzdiV3 lqqzEdz3/Rxhrh/4iTlQV5W8uitWi+8sOfbg49en8S1vO8HbCdl/cPWSSrNP/vvq3C5TTETQgT0 kf5ixE9k+XefSRoMUOaZQoJSRai/4B69XUcMMluEtz8pvVaRGYFydj0abBEy1ZhlcpK/cDxo1Vd ANTyZKM6IQvPHT44KaMuwzu2t8jxIUaKzfC1uYHSvPEFGMxmU3XPg4cBwRZ2X0kMCxYQEZoOQGG O7LzBBZ X-Received: by 2002:a17:902:ce89:b0:2b0:5193:1212 with SMTP id d9443c01a7336-2be9edfbb64mr642595ad.4.1779306010201; Wed, 20 May 2026 12:40:10 -0700 (PDT) Received: from google.com (153.46.83.34.bc.googleusercontent.com. [34.83.46.153]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-369e383b31bsm3906879a91.2.2026.05.20.12.40.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 12:40:09 -0700 (PDT) Date: Wed, 20 May 2026 19:40:05 +0000 From: Samiullah Khawaja To: Pranjal Shrivastava Cc: David Woodhouse , Lu Baolu , Joerg Roedel , Will Deacon , Jason Gunthorpe , YiFei Zhu , Robin Murphy , Kevin Tian , Alex Williamson , Shuah Khan , iommu@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Saeed Mahameed , Adithya Jayachandran , Parav Pandit , Leon Romanovsky , William Tu , Pratyush Yadav , Pasha Tatashin , David Matlack , Andrew Morton , Chris Li , Vipin Sharma Subject: Re: [PATCH v2 13/16] iommufd: Persist iommu hardware pagetables for live update Message-ID: References: <20260427175633.1978233-1-skhawaja@google.com> <20260427175633.1978233-14-skhawaja@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: On Wed, May 20, 2026 at 12:00:44AM +0000, Pranjal Shrivastava wrote: >On Mon, Apr 27, 2026 at 05:56:30PM +0000, Samiullah Khawaja wrote: >> From: YiFei Zhu >> >> Register iommufd with the LUO framework and implement the preserve and >> unpreserve ops to save marked HWPTs. >> >> To make sure mappings do not change during preserved state, add a >> liveupdate_immutable flag to IOAS. When an HWPT is preserved, its IOAS >> is marked immutable and any map/unmap attempts will fail with -EBUSY. >> This is synchronized using the domains_rwsem to prevent races with >> concurrent mapping operations. >> >> The preserve callback iterates over the marked HWPTs, verifies that the >> backing memory pages are preserved, and calls iommu_domain_preserve() to >> preserve the associated IOMMU domain. >> >> Signed-off-by: YiFei Zhu >> Signed-off-by: Samiullah Khawaja > >[...] > >> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h >> index 111f4d42e210..3c88aa115d08 100644 >> --- a/drivers/iommu/iommufd/iommufd_private.h >> +++ b/drivers/iommu/iommufd/iommufd_private.h >> @@ -98,6 +98,9 @@ struct io_pagetable { >> /* IOVA that cannot be allocated, struct iopt_reserved */ >> struct rb_root_cached reserved_itree; >> u8 disable_large_pages; >> +#ifdef CONFIG_IOMMU_LIVEUPDATE >> + bool liveupdate_immutable; >> +#endif >> unsigned long iova_alignment; >> }; >> [snip] >> + >> +static int iommufd_preserve_hwpt(struct iommufd_hwpt_paging *hwpt, >> + struct iommufd_hwpt_ser *hwpt_ser, >> + struct liveupdate_session *session) >> +{ >> + struct iommu_domain_ser *domain_ser; >> + bool ioas_made_immutable = false; >> + int rc; >> + >> + if (!hwpt->ioas->iopt.liveupdate_immutable) { >> + /* >> + * Make IOAS immutable so the DMA mappings do not change while >> + * the HWPT is preserved. Since one IOAS can have multiple >> + * HWPTs, if an error occurs this call needs to make the IOAS >> + * mutable again if it was the one that made it immutable. >> + */ >> + ioas_made_immutable = true; >> + ioas_set_immutable(hwpt->ioas, true); >> + >> + rc = check_iopt_pages_preserved(session, hwpt); >> + if (rc) >> + goto err; >> + } > >Nit: >I'm thinking what happens for a shared IOAS situation? Say, 2 devices, >in the same container, behind 2 different IOMMUs, sharing the IOAS. Each >device will have it's own HWPT (N:1 mapping for HWPT v/s IOAS) > >So, what happens if Device 1 succeeds with preservation but device 2 >doesn't? For example: > >1. Preserve Device 1: > - It sees liveupdate_immutable is false. > - Sets ioas_made_immutable = true on its local stack. > - Flips IOAS to immutable. > - Preservation succeeds. > >2. Preserve Device 2: > - It sees liveupdate_immutable is already true (because Device 1 set it). > - Sets ioas_made_immutable = false on its local stack. > - The Failure: iommu_domain_preserve fails for Device 2. > - The Jump: Hits goto err; > >Now, inside the err: label for device 2, it checks >if (ioas_made_immutable), since it is FALSE for device 2, >it does nothing. > >I agree we return an error code to the caller which finally cleans it up >well, but I'm considering if we should make liveupdate_immutable >refcountable? Since the error handling in iommufd_preserve_hwpt() is >logically incomplete for shared IOAS as it only attempts to restore >mutability if the current HWPT set immutable = true; This is a valid scenario and as you already pointed that it is handled by the error handling in the caller. But I agree refcount makes it cleaner. I will update this. > >> + >> + hwpt_ser->token = hwpt->liveupdate_token; >> + hwpt_ser->reclaimed = false; >> + >> + rc = iommu_domain_preserve(hwpt->common.domain, &domain_ser); >> + if (rc < 0) >> + goto err; >> + >> + hwpt_ser->domain_data = virt_to_phys(domain_ser); >> + return 0; >> + >> +err: >> + if (ioas_made_immutable) >> + ioas_set_immutable(hwpt->ioas, false); >> + >> + return rc; >> +} > >[...] > >> + >> +static int iommufd_liveupdate_preserve(struct liveupdate_file_op_args *args) >> +{ >> + struct iommufd_ctx *ictx = iommufd_ctx_from_file(args->file); >> + struct iommufd_hwpt_paging *hwpt; >> + struct iommufd_ser *iommufd_ser; >> + struct iommufd_object *obj; >> + unsigned int nr_hwpts; >> + unsigned long index; >> + unsigned int i; >> + void *mem; >> + int rc; >> + >> + if (IS_ERR(ictx)) >> + return PTR_ERR(ictx); >> + >> + mutex_lock(&ictx->liveupdate_mutex); >> + >> + /* Count the number of HWPTs to preserve */ >> + nr_hwpts = 0; >> + xa_lock(&ictx->objects); >> + xa_for_each_marked(&ictx->objects, index, obj, IOMMUFD_OBJ_LIVEUPDATE_MARK) { >> + if (obj->type != IOMMUFD_OBJ_HWPT_PAGING) >> + continue; >> + >> + hwpt = to_hwpt_paging(container_of(obj, struct iommufd_hw_pagetable, obj)); >> + if (!hwpt->common.domain) { >> + rc = -EINVAL; >> + xa_unlock(&ictx->objects); >> + goto out_unlock; >> + } >> + nr_hwpts++; >> + } >> + xa_unlock(&ictx->objects); >> + >> + mem = kho_alloc_preserve(struct_size(iommufd_ser, >> + hwpt_array, nr_hwpts)); >> + if (!mem) { >> + rc = -ENOMEM; >> + goto out_unlock; >> + } >> + >> + iommufd_ser = mem; >> + iommufd_ser->nr_hwpts = nr_hwpts; > >Nit: Can there be a TOCTOU here? We first count nr_hwpts in the first >loop, but actually preserve them in the loop below. Is it possible for a >Guest to race with these loops and destroy a HWPT? > >That could cause a bug in the new kernel as it may try to restore >nr_hwpts which is one more than the preserved HWPTs. Agreed. The HWPT is locked in the preserve loop below, I will update the nr_hwpts to handle this after that loop to handle this. > >> + >> + /* Preserve HWPTs */ >> + i = 0; >> + xa_lock(&ictx->objects); >> + xa_for_each_marked(&ictx->objects, index, obj, IOMMUFD_OBJ_LIVEUPDATE_MARK) { >> + if (obj->type != IOMMUFD_OBJ_HWPT_PAGING) >> + continue; >> + >> + if (!iommufd_lock_obj(obj)) { >> + rc = -ENOENT; >> + xa_unlock(&ictx->objects); >> + goto out_unpreserve; >> + } >> + >> + /* >> + * HWPT is locked so it will not be destroyed. The xarray lock >> + * can be released here before preserving the HWPT. >> + */ >> + xa_unlock(&ictx->objects); >> + hwpt = to_hwpt_paging(container_of(obj, struct iommufd_hw_pagetable, obj)); >> + rc = iommufd_preserve_hwpt(hwpt, &iommufd_ser->hwpt_array[i++], args->session); >> + if (rc) { >> + iommufd_put_object(ictx, obj); >> + goto out_unpreserve; >> + } >> + >> + /* Mark as preserved */ >> + hwpt->liveupdate_preserved = true; >> + xa_lock(&ictx->objects); >> + } >> + xa_unlock(&ictx->objects); > >[...] > >> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c >> index 9bdb2945afe1..3b0c0acb8856 100644 >> --- a/drivers/iommu/iommufd/pages.c >> +++ b/drivers/iommu/iommufd/pages.c >> @@ -55,6 +55,7 @@ >> #include >> #include >> #include >> +#include >> #include >> >> #include "double_span.h" >> @@ -1421,6 +1422,7 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file, >> >> { >> struct iopt_pages *pages; >> + int seals; >> >> pages = iopt_alloc_pages(start_byte, length, writable); >> if (IS_ERR(pages)) >> @@ -1428,6 +1430,11 @@ struct iopt_pages *iopt_alloc_file_pages(struct file *file, >> pages->file = get_file(file); >> pages->start = start - start_byte; >> pages->type = IOPT_ADDRESS_FILE; >> + >> + seals = memfd_get_seals(file); >> + if (seals > 0) >> + pages->seals = seals; >> + > >Can caching memfd seals create a TOCTOU issue? >IIUC, iopt_alloc_file_pages happens at map time, However, the userspace >is allowed to map a memfd and then apply the F_ADD_SEALS via fcntl() >later in its setup sequence? For example a sequence like: > >1. VMM creates a memfd. It has 0 seals. >2. VMM calls IOMMU_IOAS_MAP_FILE. IOMMUFD caches pages->seals = 0. >3. VMM finishes its setup and calls: > fcntl(fd, F_ADD_SEALS, F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL). > >4.VMM initiates Live Update. >5.check_iopt_pages_preserved looks at the cached pages->seals > (which is still 0), sees the seals are missing, & kills the LiveUpdate > with -EINVAL, even though the file is properly sealed.. This is true and it is intentionally this way to make sure that the seal is applied during mapping otherwise user can apply the seal after resizing the memfd and preserve IOMMU mappings that are pointing to unpreserved pages. Consider following: 1. VMM creates a memfd and seals is zero. 2. VMM maps memfd into ioas/hwpt. 3. VMM resizes the memfd. 4. VMM seals memfd 5. VMM preserves the memfd (it only preseves the current size). 6. VMM preserves iommufd and it succeeds as memfd is sealed. But the pages being referred by the iommu mappings are refcounted in current kernel, but not preserved. Check the comment in check_iopt_pages_preserved() also. I will add a comment here also. > >Thus, I guess we should dynamically check seals via memfd_get_seals() > >> return pages; >> } >> > >Thanks, >Praan Sami