From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D453F3D3D02 for ; Fri, 24 Apr 2026 13:00:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777035614; cv=none; b=Jo8wGCBE5BzMnx+cLfU0/RTJ9DvYb9mvC5JnDGrfcnomCMRpek8sS8tEzTVjdQP2UlucTesSQ7eOWSHrqnr01qenU0l/xTGhHGFxy2Igu2KqdfV/AvCqVW6ZCyAX0d4UZVorXJxNOFgC/w2co32lnrk7eO5IEhW6bgIIuIWvuVg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777035614; c=relaxed/simple; bh=mNaMVJ45d/XkfX9LnPfVcMylCsWv/6zRlhpVNM52jUA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=GWWrSLhQpVDTS57+YxdvnEjivkZlARcaVfLBkE9pvpJK2wh3rrxKAcdfwl3PNF5YjLkkD710rkwCeRpER1PIZ8PJKCIH1pl2R6qE1P7/EDCVxSwRlTdiYzUDqdt1yhiqA5jtjQaU66f7Cu9pwTd9ZjLIy23+pSg+0oHROgTAFAY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=fFTV7/OR; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=KbSQaYvK; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="fFTV7/OR"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="KbSQaYvK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777035602; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CuxHqotWAXKQn8tfukv8Lxf0MCnzWeGdcx1e3ImXNv4=; b=fFTV7/OR0y9acSovjvOioYx8UWjqexln4e+IcnaaPu2GDGFVpatcZaqVGWJLuAa00un+Cx s/Gp4c4b6QqFYLzrPwK2HfkpKfCejVleXCSJPLcN4HH5G8Te+51ncYV0A3egHdt4+ieoux 6sh3LcfKAdW0P8g1GBtpS7sZMpAAXhY= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-609-b0FvQgUoPDqaA9rkbqNE7Q-1; Fri, 24 Apr 2026 09:00:00 -0400 X-MC-Unique: b0FvQgUoPDqaA9rkbqNE7Q-1 X-Mimecast-MFC-AGG-ID: b0FvQgUoPDqaA9rkbqNE7Q_1777035600 Received: by mail-qk1-f200.google.com with SMTP id af79cd13be357-8d3ea68b9cdso1510175585a.3 for ; Fri, 24 Apr 2026 06:00:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777035600; x=1777640400; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=CuxHqotWAXKQn8tfukv8Lxf0MCnzWeGdcx1e3ImXNv4=; b=KbSQaYvK0+FHgvQDrgrmhoPEgDWwUnb5mawymHIH/7+J0/T1I7pa4LdpgEi/5aBxo7 /xo05V/MLnkBDbH4AyEbM9WuEpoF8NhfCIENq+KMsJrplaAZikmVzZHVJqSi6lZJzDeN IPqaKjLsSW7o/Rz4fhQWNVkVeNNomvsKunceOnC/RbKhxLCmFsu3h7/IQnlvR+W+NA5g hKFnBSVbVEanfm6lxLjEg8FKGcZ+l57+n39rB8KjrbEEqLJmh5ZRYEMkQqkjg6LehPlz us4t6G3xuelpWXedBb76zRp37UaBmrm1w9OTFhX1IFMw3bPdLLC7+m2w2vgxmiXS5hti WjgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777035600; x=1777640400; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CuxHqotWAXKQn8tfukv8Lxf0MCnzWeGdcx1e3ImXNv4=; b=rK/QjkN30L0kmm6TQ3Z38v4AqYf9QU+JIEwQfXZutaWCdVQgVYTJXiYbNi2yPxjOQi B7MURGY+H21FSeDB/KW3FvTksbF/E7BYIqtsa2ugh8zwIJL2eWUDeOtlls3FlR586MRv BA3wmOZorRHN0+DJgb2cxVZFErrrBXv1sZcb0Ol9FncA1+yC7Npfzzf07N7kDExwK5gB 6zY0vR6jvMG6WA0PyxdtBUYpP3kNFvQ2o01aOmES71UdsbsqDPTQE0Hnwf7Af4xLjG0d V1n0WTM/mFvd36HfuBWhV8Q6SH8fAVznWjcJ1GxpZ4evYEHL7qLgRQ362qLpYHH3ZSVq DHkw== X-Forwarded-Encrypted: i=1; AFNElJ/sgAVP9GhhFosWIIUTB9Nr9s+uwVxjS75lgLCa8qHy19IReo5ccQPXsXOUkvTivRZUrsx1MX7FqEVHHzo=@vger.kernel.org X-Gm-Message-State: AOJu0YwwNKWJQcUF08RhzI1AbPUyGji0A99pTklCSaxtNEEGRdB8ZRxS zPfv/kEKnIfrVQlCPJYf6N2F1iIU3/A0B7pThnbdDdD0vNLC+bUKx3//WnDe+kmXmey6FGw30Ly S64pmaJRCJYjoO0ye66V0r+oFaieIuvfPqkZARatPKHUfc3Qh2NUyrM2482dlWfwTiA== X-Gm-Gg: AeBDiesk9D2B8Z9rLJhyUZxbgDqqcpsw13lDOZIFyXhDIJAl2GKXvz7oCBOas/nDe5A 3vRnE+2ZkSwfKFnAjxosxet+lpQYxkYwHJrkssn0JI1GhFzUXX+XQjAD9V1xCk8zbX7wrUJ6FKu LWacBv7lYtuIwH6rCeeVU5VYK88UeiVxJmoj7utvmCCNqVGjcz0oDBieRTwHnDWmzOD3yqtprlG fJkKZvj8zCsgHgDpLMiRIa01HWwokOSGxewwQewCq/fg8+H6pJwnigNLVdC39kN5zvbyAZ43Op6 WauBkXkC3t3tEaqiCGiQ4bu2wZK1AB3ftEL64dm95Aonu9UoATpyn0PKPIouJ0xRdS2bY762+0J 804OgWDutQ1I8J/ZMZhNMU70SoOESys7U2+j6KeibMnXueY8UNIDjeAWwng== X-Received: by 2002:a05:620a:40cb:b0:8ef:47ae:94de with SMTP id af79cd13be357-8ef47ae9518mr2002365885a.39.1777035600073; Fri, 24 Apr 2026 06:00:00 -0700 (PDT) X-Received: by 2002:a05:620a:40cb:b0:8ef:47ae:94de with SMTP id af79cd13be357-8ef47ae9518mr2002358885a.39.1777035599426; Fri, 24 Apr 2026 05:59:59 -0700 (PDT) Received: from x1.local ([142.189.10.167]) by smtp.gmail.com with ESMTPSA id af79cd13be357-8eb3aa60b99sm1676032585a.42.2026.04.24.05.59.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Apr 2026 05:59:58 -0700 (PDT) Date: Fri, 24 Apr 2026 08:59:57 -0400 From: Peter Xu To: Kiryl Shutsemau Cc: "David Hildenbrand (Arm)" , Andrew Morton , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org Subject: Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory Message-ID: References: <34f75083-29a3-4860-8a6e-94551d37ac6a@kernel.org> <17b0dc02-eee3-46d6-9afb-5f81a3a20216@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Fri, Apr 24, 2026 at 12:37:35PM +0100, Kiryl Shutsemau wrote: > On Thu, Apr 23, 2026 at 04:10:30PM -0400, Peter Xu wrote: > > On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: > > > > > > > > The other thing is, as I mentioned in the other email, I still don't know > > > > how the current RW protection would work for anonymous. I don't yet think > > > > the user swapper can read the anon page with RW-protected pgtables. So far > > > > my understanding is maybe you only care about shmem so it's fine, but it'll > > > > always be great to confirm with you. > > > That's true. We use vhost and therefore shmem in our setup. I see, thanks for confirming. Side note: I believe host works for anon too since GUP works for anon, but it doesn't matter as long as we know anon isn't a must. > > One idea I had about how to make atomic eviction for anon is extending > process_vm_read() and process_madvise(): > > - Add a flag to process_vm_read() to bypass the protnone check on > accessible (or only RWP?) VMAs. > > - Allow process_madvise(MADV_DONTNEED) when the caller already has > ptrace write access to the target. > > The standing objection to remote DONTNEED has been "destructive", but > process_vm_writev() already lets a ptrace-capable caller overwrite > arbitrary anon with attacker-chosen content. DONTNEED is strictly > weaker — it zeroes, it does not inject — so the trust model is already > established. > > > > I wonder if uffdio_move could be used for a swapper implementation instead? > > I considered it. UFFDIO_MOVE can in principle relocate the cold folio > into a staging VMA inside the VMM, which then reads it and drops it. > The downside is the VMM has to maintain a second address range and > serialise eviction through it. A purpose-built primitive — something > like UFFDIO_EVICT that zaps the PTE and returns the folio contents > (optionally to an fd for io_uring) — seems cleaner. Right, the other thing is unnecessary overhead on the extra pgtable operations when moving to the staging VMA (e.g. tlb flush). > > > > If RW is justified to be useful first, maybe. > > > > I had a gut feeling Kirill's use case doesn't use anon at all, then if > > nobody needs it we can still decide to not support anon. > > > > > > > > If we ever have to read from a protnone page, maybe we could teach ptrace access > > > to do it, or have something that can read from prot_none areas -- like > > > uffdio_copy, which can write to prot-none areas. > > > > Somethinig like swap_access() in my proposal can also partly achieve that. > > > > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ > > A maccess()-style primitive that reads through PROT_NONE is a reasonable > building block and overlaps with part of what UFFDIO_EVICT would need. > > > There, it was only about reading from swap so far, though. But that one > > might be easier to be extended to read PROT_NONE and directly put data into > > buffer user specified (ps: in my local tree impl I named it maccess() to > > pair with mincore(), but it doesn't really matter; it doesn't even need to > > be a syscall..). > > > > To me, the interfacing is not a major issue. The major question I have is > > why RW protection can help in swap system impl when we already have uffd-wp. > > > > So I want to make sure the use case can't be implemented by uffd-wp already. > > Because that's really what we might do for QEMU. > > Race-free eviction can definitely be implemented with uffd-wp already. > But not proper working set discovery. Good. Then we can focus the discussion on hotness tracking with RWP and its benefits, and compare it with a pure access bit focused tracking system (as I mentioned in the other reply). Thanks, -- Peter Xu