Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: mm: opaque hardware page-table entry handles
From: Muhammad Usama Anjum @ 2026-06-24 22:39 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
	Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
	Catalin Marinas, Will Deacon, Samuel Holland
  Cc: usama.anjum, linux-mm, linux-arm-kernel, linux-kernel
In-Reply-To: <DJHEF8MWTZSC.3KMTACZH7KWP@nvidia.com>

On 24/06/2026 4:52 pm, Zi Yan wrote:
> On Wed Jun 24, 2026 at 10:09 AM EDT, Usama Anjum wrote:
>> Hi all,
>>
>> This is a direction-check with the wider community before spending time on the
>> development. This picks up the idea that was raised and broadly agreed in the
>> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>>
>> The problem
>> -----------
>> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
>> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
>> representation. Sprinkling getters wouldn't solve the problem entirely. The
>> problem is one level up: the *pointer type* itself is overloaded. At each level
>> there are really three distinct things:
>>
>>   1. a page-table entry value (pte_t, pmd_t, ...)
>>   2. a pointer to an entry value, e.g. a pXX_t on the stack
>>   3. a pointer to a live entry in the hardware page table
> 
> This sounds good to me, but can you clarify the situation below?
> 
> A live entry means the entry can be accessed by hardware when the code
> is manipulating it? 
I think live is wrong world to chose here. Its a mistake on my end. (3) means
the pointer points into real page-table memory and that table is complete,
whether or not it's linked in yet. A withdrawn-but-not-yet-installed table is
still a hardware table and can be represented by hw_pXXp type.


> What type should we use if we are pre-populating
> PTEs in a PMD page before we establish the PMD page as a HW page table?
> In __split_huge_pmd_locked(), we do that. A PMD page is first withdrawn
> and filled with after-split PTEs, pmd_populate() and pte_offset_map()
> are used for this not-yet-HW page table. Later, pmd_populate() is used> to make this page table visible to HW. Should we have two versions of
> pmd_populate() and pte_offset_map()? Since the first pmd_populate()
> would accept pmd_t*, but the second one would accept hw_pmdp, if we are
> pedantic. Of course, we can be flexible here to use pmd_populate()
> accpeting hw_pmdp for both, since the PMD page table we are modifying
> is going to be visible to HW soon. But I think we should have clear
> definitions for where these types are used and document them well.
This is exactly the example that causes the confusion. Following the definition
above, the pmd is on the stack while the PTEs are being prepared, and the PTE
table is complete — so the pmd pointer should be pmd_t * and the PTE table
hw_ptep. I'd keep the two APIs distinct rather than overloading hw_pmdp for
both: that's what enforces the rule that no stack pointer reaches a
table-writing API, and what lets the *_stack path drop the synchronization.

(One thing I still need to chase: there are cases where we convert between pmd
and pte. I need to understand how often that happens — if it's common, a
hw_ptep could get converted into a pmd and bring the confusion back, and if we
have to account for that, definition (3) may need to change.)

> 
> You probably can ask LLMs to check these ambiguous/vague uses throughout
> the code base.
> 
>>
>> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
>> distinguishes a pointer into a live table from a pointer to a stack copy.
>>
>> A pointer to an on-stack entry value and a pointer to a live hardware entry have
>> the same type, so the compiler cannot distinguish them. Passing the stack
>> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
>> but is wrong - a bug class the type system makes invisible. It also blocks
>> evolution: an arch helper may need to read beyond the addressed entry (e.g.
>> adjacent or contiguous entries), which only makes sense for a real page-table
>> pointer, not a stack copy.
>>
>> The idea
>> --------
>> Give (3) its own opaque type that cannot be dereferenced:
>>
>>     /* opaque handle to a HW page-table entry; not dereferenceable */
>>     typedef struct {
>> 	pte_t *ptr;
>>     } hw_ptep;
>>
>> With this:
>>
>>   - a stack value can no longer masquerade as a hardware table entry,
>>   - a hardware handle can no longer be raw-dereferenced,
>>   - cases that genuinely operate on a value can be refactored to pass the value
>>     and let the caller, which knows whether it holds a handle or a stack copy,
>>     read it once.
>>
>> The overload becomes a compile-time type error instead of a silent runtime bug,
>> and converting the tree forces every such site to be made explicit. This gives
>> us a framework where the architecture can completely virtualize the pgtable if
>> it likes; and the compiler can enforce that higher level code can't accidentally
>> work around it.
>>
>> It is opt-in by architectures and incremental. The generic definition is
>> just an alias, so arches that do not care build unchanged:
>>
>>     typedef pte_t *hw_ptep;
>>
>> An arch flips to the strong struct type when it is ready, and only then does
>> it get the stronger checking. This lets the conversion land gradually.
>>
>> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
>> features that need tighter control over how page tables are accessed and
>> manipulated.
>>
>> Getter flavours
>> ---------------
>> While converting, it is useful to have two accessor flavours at each level:
>>
>>   - pXXp_get(hw_ptep)        plain C dereference (compiler may optimize)
>>   - pXXp_get_once(hw_ptep)   single-copy-atomic, not torn, elided or
>>                              duplicated by the compiler
>>
>> Keeping them distinct simplifies the conversion and avoids re-introducing the
>> class of lockless-read bugs seen on 32-bit.
>>
>> Example conversion
>> ------------------
>> Most of the conversion is mechanical.
>>
>>   -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>   -		pte_t *ptep, pte_t pte, unsigned int nr)
>>   +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>   +		hw_ptep ptep, pte_t pte, unsigned int nr)
>>    {
>>    	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>>    	for (;;) {
>>    		set_pte(ptep, pte);
>>    		if (--nr == 0)
>>    			break;
>>   -		ptep++;
>>   +		ptep = hw_pte_next(ptep);
>>    		pte = pte_next_pfn(pte);
>>    	}
>>    }
>>
>> The bulk of work is this kind of rote substitution. The genuine work is the
>> handful of sites that turn out to be operating on a stack copy rather than a
>> live entry - those are exactly the ones the new type forces us to surface and 
>> fix.
>>
>> Estimated churn:
>> ----------------
>> Half way through the prototyping converting only PTE and PMD levels:
>>   77 files changed, +1801 / -1425
>>   ~57 files reference the new types
>>
>> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
>> converted; expect meaningfully more churn than the numbers above.
>>
>> Introduce the type as an alias, convert one helper family per patch, and flip
>> an arch to the strong type last - with non-opted arches building unchanged at
>> every step.
>>
>> Open questions
>> --------------
>>   - Is the type-safety + future-feature enablement worth the churn?
>>   - Naming: hw_ptep/hw_pmdp vs something else?
>>   - Should all five levels be converted before merging anything, or is a staged
>>     PTE-and-PMD then landing others acceptable?
>>   - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>>     level?
>>
>> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>>
>> Thanks,
>> Usama
> 
> 
> 
> 

--
Thanks,
Usama



^ permalink raw reply

* Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends
From: Nhat Pham @ 2026-06-24 22:41 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ajnQxMY0W3VGyAUE@google.com>

On Mon, Jun 22, 2026 at 5:18 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> [..]
> > @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
> >       if (entry->objcg)
> >               count_objcg_events(entry->objcg, ZSWPIN, 1);
> >
> > -     /*
> > -      * We are reading into the swapcache, invalidate zswap entry.
> > -      * The swapcache is the authoritative owner of the page and
> > -      * its mappings, and the pressure that results from having two
> > -      * in-memory copies outweighs any benefits of caching the
> > -      * compression work.
> > -      */
>
> Forgot to ask, is dropping this comment intentional?

Ooops. Lemme restore it.

>
> >       folio_mark_dirty(folio);
> > -     xa_erase(tree, offset);
> > -     zswap_entry_free(entry);
> > +
> > +     if (swap_is_vswap(si)) {
> > +             folio_release_vswap_backing(folio);
> > +     } else {
> > +             xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> > +             zswap_entry_free(entry);
> > +     }
> >
> >       folio_unlock(folio);
> >       return 0;
> > --
> > 2.53.0-Meta
> >


^ permalink raw reply

* Re: [PATCH v1] kasan: Fix false-positive wild-memory-access on x86 under 5-level paging
From: Ihor Solodrai @ 2026-06-24 22:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Andrey Ryabinin,
	Andrew Morton, bpf, kasan-dev, linux-mm, linux-kernel,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin,
	Andrey Konovalov
In-Reply-To: <20260624025732.GDajtHnB1Jc21kQnl_@fat_crate.local>

On 6/23/26 7:57 PM, Borislav Petkov wrote:
> On Mon, Jun 22, 2026 at 05:29:12PM -0700, Ihor Solodrai wrote:
>> On 6/18/26 11:38 AM, Borislav Petkov wrote:
>>> On Thu, Jun 18, 2026 at 11:12:09AM -0700, Dave Hansen wrote:
>>>> On 6/18/26 10:09, Borislav Petkov wrote:
>>>>> On Wed, Jun 17, 2026 at 03:13:33PM -0700, Ihor Solodrai wrote:
>>>>>> So my question to maintainers is what approach seems best?
>>>>> The CPUID stuff is being rewritten currently and it should address your issue
>>>>> too. If not, then we need to rewrite it better.
>>>>>
>>>>> Can you reproduce with this set applied ontop:
>>>>>
>>>>> https://lore.kernel.org/r/20260528153923.403473-1-darwi@linutronix.de
>>>>
>>>> Thinking about this a bit more... If Ahmed's series does fix this, I
>>>> think it will be accidental. It still uses identify_cpu() and also does
>>>> a memset() of the new c->cpuid structure in addition to the old
>>>> c->x86_capability structure.
>>>>
>>>> I'm not knocking Ahmed's series by any means. It just probably won't fix
>>>> this issue.
>>>>
>>>> In a perfect world early_identify_cpu() and identify_cpu() would either
>>>> get consolidated into one thing. Or at least become two discrete things
>>>> that initialize two completely disjoint sets of data. That way,
>>>> identify_cpu() wouldn't memset() anything.
>>>>
>>>> Isn't that the _real_ fix? Instead of trying to hide the inconsistency
>>>> when good data is blown away, we stop blowing it away in the first place?
>>>
>>> early_ is only on the BSP and you want to scan all CPUs.
>>>
>>> AFAIR, the last time I was looking at how we scan the CPUID leafs, we do have
>>> cases where there's blips in time when cap bits get disappeared to be
>>> rescanned again. The toggling of MSR bits which control feature flags is one
>>> thing that comes to mind.
>>>
>>> But I'm with you on the consolidation approach. I think this is what we should
>>
>> Hello Dave, Boris. Thank you for the input.
>>
>> I tried a slight refactoring of the identify_cpu() machinery to
>> eliminate the memset(x86_capability) from the boot cpu, and it fixes
>> the splat that I reported.
>>
>> identify_cpu() is split into two parts:
>>   * identify_cpu_scan() - the top part, up to and including the
>>     generic_identify() call
>>   * identify_cpu() proper with the rest, no memset here
>>
>> Then (gated on x86_64) identify_boot_cpu() *skips* the
>> identify_cpu_scan() call, only executing the bottom part of current
>> identify_cpu(). We can do that because boot cpu already did the _scan
>> part in early_identify_cpu(), when interrupts are still disabled.
>>
>> One caveat: get_model_name() is moved from generic_identify() into
>> identify_cpu(), otherwise it wouldn't be called on boot cpu. Same for
>> c->loops_per_jiffy = loops_per_jiffy; - it's moved to the "bottom"
>> identify_cpu().
>>
>> The behavior for secondary cpus is unchanged: identify_secondary_cpu()
>> calls identify_cpu_scan() then immediately identify_cpu().
>>
>> Please take a look at the diff below and let me know if this is a good
>> way to proceed. I don't know exactly what I'm doing, so concerns and
>> corrections are welcome.
>>
>>
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index a4268c47f2bc..4a655109cacf 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -1978,8 +1978,6 @@ static void generic_identify(struct cpuinfo_x86 *c)
>>  
>>  	get_cpu_address_sizes(c);
>>  
>> -	get_model_name(c); /* Default name */
>> -
>>  	/*
>>  	 * ESPFIX is a strange bug.  All real CPUs have it.  Paravirt
>>  	 * systems that run Linux at CPL > 0 may or may not have the
>> @@ -1999,13 +1997,12 @@ static void generic_identify(struct cpuinfo_x86 *c)
>>  }
>>  
>>  /*
>> - * This does the hard work of actually picking apart the CPU stuff...
>> + * Rebuild capability set from CPUID and re-apply the forced-cap overrides
>> + * through generic_identify(). This *should* run with interrupts off, otherwise
>> + * an interrupt handler might see the capability bits cleared.
> 
> And yet it doesn't run with IRQs off. So that comment doesn't need to be here.
> The point of the whole exercise is to not disable ugly interrupts in that path
> at all.
> 
>>   */
>> -static void identify_cpu(struct cpuinfo_x86 *c)
>> +static void identify_cpu_scan(struct cpuinfo_x86 *c)
>>  {
>> -	int i;
>> -
>> -	c->loops_per_jiffy = loops_per_jiffy;
>>  	c->x86_cache_size = 0;
>>  	c->x86_vendor = X86_VENDOR_UNKNOWN;
>>  	c->x86_model = c->x86_stepping = 0;	/* So far unknown... */
>> @@ -2028,6 +2025,19 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>>  #endif
>>  
>>  	generic_identify(c);
>> +}
>> +
>> +/*
>> + * This is safe to run with interrupts on.
>> + * The boot CPU can run just this from arch_cpu_finalize_init(), because
>> + * the scan part already happened in early_identify_cpu().
>> + */
>> +static void identify_cpu(struct cpuinfo_x86 *c)
>> +{
>> +	int i;
>> +
>> +	c->loops_per_jiffy = loops_per_jiffy;
>> +	get_model_name(c); /* Default name */
>>  
>>  	cpu_parse_topology(c);
>>  
>> @@ -2154,6 +2164,17 @@ void enable_sep_cpu(void)
>>  
>>  static __init void identify_boot_cpu(void)
>>  {
>> +	/*
>> +	 * The boot CPU's capabilities were already scanned by early_identify_cpu().
>> +	 * Here identify_cpu() only finalizes them.
>> +	 *
>> +	 * 32-bit still needs the full scan here: it sets X86_BUG_ESPFIX (via
>> +	 * generic_identify()) and the no-CPUID cpuid_level default, which the early
>> +	 * path does not.
>> +	 */
>> +#ifndef CONFIG_X86_64
>> +	identify_cpu_scan(&boot_cpu_data);
>> +#endif
>>  	identify_cpu(&boot_cpu_data);
>>  	if (HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT))
>>  		pr_info("CET detected: Indirect Branch Tracking enabled\n");
>> @@ -2178,6 +2199,7 @@ void identify_secondary_cpu(unsigned int cpu)
>>  		*c = boot_cpu_data;
>>  	c->cpu_index = cpu;
>>  
>> +	identify_cpu_scan(c);
>>  	identify_cpu(c);
>>  #ifdef CONFIG_X86_32
>>  	enable_sep_cpu();
>> -- 
> 
> Ok, thanks for that, that's a good first try. I did poke at it a bit and
> here's what I think should happen:
> 
> 1. The part which initializes struct cpuinfo_x86 up and including the memsets
>    should be one function called init_cpu_info() or so.
> 
> 2. Then, generic_identify() should be merged into identify_cpu() basically
>    turning the latter into a single function which does a generic CPU
>    identification both on the BSP and on the APs.
> 
> 3. #ifdef CONFIG_X86_32
>         enable_sep_cpu();
> #endif 
> 	that gunk should be stuck into a function called identify_cpu_32() and
> 	you'll have at the beginning of it
> 
> 	if (!IS_ENABLED(CONFIG_X86_32))
> 		return;
> 
>    and then you call that function both in identify_boot_cpu() and
>    identify_secondary_cpu(). This way you streamline the paths and drop all
>    that ugly ifdeffery.
> 
> 4. When you do 3. above, you should be able to move this gunk
> 
> 	#ifdef CONFIG_X86_32
>         set_cpu_bug(c, X86_BUG_ESPFIX);
> 	#endif
> 
> to it too. And now everything is nicely encapsulated and straight-forward.
> 
> 5. The above will allow you to have the init_cpu_info() only on 32-bit.
> 
> Oh and, looking at the ifdeffery on identify_cpu():
> 
> 	#ifdef CONFIG_X86_64
> 	        c->x86_clflush_size = 64;
> 	        c->x86_phys_bits = 36;
> 	        c->x86_virt_bits = 48;
> 	#else
> 
> you could probably add a identify_cpu_64() too. Or maybe the init parts should
> be called separately init_cpu_info_32() and init_cpu_info_64(). You probably
> need to see how it looks like when you write it.
> 
> Oh, and each 1., 2., ...step above is a single patch. This'll make the review
> very easy too.
> 
> How does that sound? Willing to give it a try? :-)

The instructions are quite detailed, thank you.

I'll try it and will send a separate series as soon as I can (likely next week).

> 
> If not, I could try to find some time and do it myself but I thought you might
> be willing to try it...
> 
> :-)
> 
> Thx.
> 



^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-24 23:00 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajxasFBzp_9KnQLq@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Jun 23, 2026, Yan Zhao wrote:
>> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> > > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> > > > >  		return -EIO;
>> > > > >
>> > > > > -	if (!src_page)
>> > > > > -		return -EOPNOTSUPP;
>> > > > > +	if (!src_page) {
>> > > > > +		if (!gmem_in_place_conversion)
>> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> > > > error.
>> > >
>> > > Why MMAP?
>> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >
>> > > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
>> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> > > and written memory.  And when write() lands, MMAP wouldn't be necessary to
>> > > initialize the memory.
>> > Do you mean using up-to-date flag as below?
>
> Yes?  I didn't actually look at the implementation details.
>
>> > if (!src_page) {
>> > 	src_page = pfn_to_page(pfn);
>> > 	if (!folio_test_uptodate(page_folio(src_page)))
>> > 		return -EOPNOTSUPP;
>> > }

Yan is right that with the earlier patch "Zero page while getting pfn",
folio_test_uptodate() here will always return true.

Actually, this is an alternative fix for the issue Sashiko pointed out
on v7 where userspace can do a populate() (either TDX or SNP) without
first allocating the page, with src_address == NULL, and leak
uninitialized memory into the guest.

Advantage of using the uptodate check in populate: if the host never
allocates the page, populate doesn't incur zeroing before writing the
page anyway in populate().

Disadvantage: Both TDX and SNP will have to implement this uptodate
check. guest_memfd can't check centrally because for SNP, for a
PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
firmware will zero and there's no leakage of uninitialized host memory?

>>
>> Another concern with this fix is that:
>> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> folio uptodate before reaching post_populate().
>>
>> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>>
>> > One concern is that TDX now does not much care about the up-to-date flag since
>> > TDX doesn't rely on the flag to clear pages on conversions.
>> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> > user access.
>> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> > to have more effort than introducing a dedicated flag for TDX.
>> >
>> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> > > > 1. create guest_memfd with MMAP flag
>> > > > 2. mmap the guest_memfd.
>> > > > 3. convert the initial memory range to shared.
>> > > > 4. copy initial content to the source page.
>> > > > 5. convert the initial memory range to private
>> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> > > > 7. do not unmap the source backend.
>> > > >
>> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> > >
>> > > Why?  It's userspace's responsibility to get the above right.  If userspace fails
>> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.

Yan, is your concern that userspace forgot to update the code and
forgets to provide a src_page, and if we keep the "Zero page while
getting pfn" patch, ends up with the guest silently having a zero page?
I think that would be found quite early in userspace VMM testing...

>> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> > kernel to detect this mistake, similar to how it validates whether source_addr
>> > is PAGE_ALIGNED.
>
> The alignment case is different.  If userspace provides an unaligned value, KVM
> *can't* do what userspace is asking because hardware and thus KVM only supports
> converting on page boundaries.
>
> For a NULL source, KVM can still do what userspace is asking.  Rejecting userspace's
> request would then be making assumptions about what userspace wants.
>

Also, +1 on this, what if userspace, knowing that pages are zeroed on
allocation, actually wants to rely on that to get a zero page in the guest?

>> > Since userspace already needs to perform additional steps to enable in-place
>> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> > intentional seems like a reasonable burden.
>
> I don't see how it adds any value.  I wouldn't be at all surprised if most VMMs
> just wen up with code that does:
>
> 	if (in-place) {
> 		src = NULL;
> 		flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
> 	}


^ permalink raw reply

* Re: [RFC PATCH] mm: Avoiding split large folios if swap has no space
From: Barry Song @ 2026-06-24 23:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: akpm, axelrasmussen, baolin.wang, dev.jain, kasong, lance.yang,
	liam, linux-kernel, linux-mm, ljs, npache, qi.zheng, ryan.roberts,
	shakeel.butt, weixugc, yuanchu, zhaonanzhe, ziy
In-Reply-To: <4aa8350e-712f-4380-b3bf-2ff06cf2a35d@kernel.org>

On Mon, Jun 22, 2026 at 4:58 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/20/26 10:10, Barry Song (Xiaomi) wrote:
> > On Fri, Jun 19, 2026 at 10:04 PM David Hildenbrand (Arm) <david@kernel.org> wrote:
> > [...]
> >>>       /*
> >>>        * The page can not be swapped.
> >>>        *
> >>> @@ -1280,6 +1289,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>
> >>>                               if (!folio_test_large(folio))
> >>>                                       goto activate_locked_split;
> >>> +                             if (!__can_reclaim_anon_pages(memcg, sc))
> >>> +                                     goto activate_locked_split;
> >>
> >> Why are we even trying to allocate swap space if we cannot reclaim such pages?
> >> Makes we wonder whether we would want to have that check earlier, before the
> >> folio_alloc_swap().
> >>
> >> Any downsides?
> >
> > I don't think there are any obvious downsides there. One issue is that
> > the memcg may not be passed from reclaim_pages(), so memcg would
> > always be NULL. However, the folio could still belong to a memcg
> > whose swap quota has been exhausted. In that case, my
> > __can_reclaim_anon_pages() will fail when checking whether we can
> > swap out. But switching to folio_memcg() also seems awkward.
> >
> > So I feel Kairui’s suggestion [1] might be the best approach. In
> > folio_alloc_swap(), we return -EAGAIN to tell vmscan.c that
> > we can split the folio and retry the swap-out.
> > only when there are sufficient swap slots and sufficient memcg swap
> > quota do we return -EAGAIN, allowing vmscan to perform a split.
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 78b49b0658ad..62e2c506ccae 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1755,6 +1755,9 @@ int folio_alloc_swap(struct folio *folio)
> >                       VM_WARN_ON_ONCE(1);
> >                       return -EINVAL;
> >               }
> > +
> > +             if (get_nr_swap_pages() < (1 << order))
> > +                     return -ENOMEM;
>
> I guess we could be clearer with the return value:
>
> -> !get_nr_swap_pages() -> -ENOSPC / -ENOMEM
>
> (no space at all)
>
> -> get_nr_swap_pages() < (1 << order) -> -E2BIG
>
> (there is some space, but not for the full thing)

understood. make sense.

>
> But now I wonder whether we would also want to check "is there any free swap
> space", not just "is there any swap".

I don't quite understand you. get_nr_swap_pages() returns
nr_swap_pages, which increases or decreases as swap is allocated or
freed. I guess it just reflects how many swaps we currently have
available?

>
>
> Essentially, try returning -E2BIG if there is the chance to swap out after
> split, and  -ENOSPC / -ENOMEM if a split wouldn't help.
>
> >       }
> >
> >  again:
> > @@ -1769,11 +1772,13 @@ int folio_alloc_swap(struct folio *folio)
> >       }
> >
> >       /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
> > -     if (unlikely(mem_cgroup_try_charge_swap(folio)))
> > +     if (unlikely(mem_cgroup_try_charge_swap(folio))) {
> >               swap_cache_del_folio(folio);
> > +             return -ENOMEM;
>
> Here we wouldn't have the information whether we could charge after a split.
>
> So that would require a rework to signal this more cleanly to the caller.

Yep. The tricky part is that mem_cgroup_try_charge_swap() cannot
return how much swap quota is available in the memcg. Do you prefer to
add an output argument to mem_cgroup_try_charge_swap() to expose
that, or should we introduce a separate wrapper such as …

long get_nr_swap_pages_from_folio_memcg(struct folio *folio)
{
         long int nr;
         memcg = get_memcg_from_folio(folio);
         nr = mem_cgroup_get_nr_swap_pages(memcg);

         return nr;
}

then in folio_alloc_swap(), if nr < folio_nr_pages() but > 0,
we ask for a split by returning -E2BIG ?

>
> > +     }
> >
> >       if (unlikely(!folio_test_swapcache(folio)))
> > -             return -ENOMEM;
> > +             return -EAGAIN;
> >
> >       return 0;
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 299b5d9e8836..63e8578454ea 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1257,6 +1257,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                */
> >               if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
> >                               !folio_test_swapcache(folio)) {
> > +                     int ret;
> > +
> >                       if (!(sc->gfp_mask & __GFP_IO))
> >                               goto keep_locked;
> >                       if (folio_maybe_dma_pinned(folio))
> > @@ -1275,10 +1277,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                                   split_folio_to_list(folio, folio_list))
> >                                       goto activate_locked;
> >                       }
> > -                     if (folio_alloc_swap(folio)) {
> > +                     if ((ret = folio_alloc_swap(folio))) {
>
> I prefer doing the assignment outside the conditional.

Ok.

>
> >                               int __maybe_unused order = folio_order(folio);
> >
> > -                             if (!folio_test_large(folio))
> > +                             if (!folio_test_large(folio) || ret != -EAGAIN)
> >                                       goto activate_locked_split;
> >                               /* Fallback to swap normal pages */
> >                               if (split_folio_to_list(folio, folio_list))
> >
> > What’s your view on this, David?
>
> I guess returning from folio_alloc_swap() whether a split could allow for
> swapout (e.g., -E2BIG) would be reasonable.
>
> To catch all the cases where it makes a difference:
> * No free swap space (split won't work)
> * Some free swap space (split would work)
> * Sufficient free swap space, but fragmented (split would work)
> * No memcg space (split won't work)
> * Some memcg space (split would work)

Right. The memcg part is the tricky one, since we don’t have that
information available.

Best Regards
Barry


^ permalink raw reply

* Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends
From: Nhat Pham @ 2026-06-24 23:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ajnNWRO7apBq2-kQ@google.com>

On Mon, Jun 22, 2026 at 5:16 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Fri, Jun 12, 2026 at 12:37:33PM -0700, Nhat Pham wrote:
> > Build the virtual swap layer on top of the swap-table infrastructure.
> > Virtual swap entries decouple PTE swap entries from physical backing,
> > allowing pages to be compressed by zswap (or detected as zero-filled)
> > without pre-allocating a physical swap slot.
> >
> > This patch only supports zswap and zero-page backends. If zswap_store
> > fails, the page stays dirty in the swap cache (AOP_WRITEPAGE_ACTIVATE)
> > - physical disk backing fallback comes in the next patch. Zswap
> > writeback of vswap-backed entries is also disabled - the shrinker
> > skips when no physical swap pages are available.
> >
> > Suggested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> [..]
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 993406074d58..466f8a182716 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/zsmalloc.h>
> >
> >  #include "swap.h"
> > +#include "vswap.h"
> >  #include "internal.h"
> >
> >  /*********************************
> > @@ -762,7 +763,7 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> >   * Carries out the common pattern of freeing an entry's zsmalloc allocation,
> >   * freeing the entry itself, and decrementing the number of stored pages.
> >   */
> > -static void zswap_entry_free(struct zswap_entry *entry)
> > +void zswap_entry_free(struct zswap_entry *entry)
> >  {
> >       zswap_lru_del(&zswap_list_lru, entry);
> >       zs_free(entry->pool->zs_pool, entry->handle);
> > @@ -994,16 +995,21 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >       struct swap_info_struct *si;
> >       int ret = 0;
> >
> > +     /* try to allocate swap cache folio */
> >       si = get_swap_device(swpentry);
> >       if (!si)
> >               return -EEXIST;
> >
> > +     /*
> > +      * Vswap entries have no physical backing - writeback would fail
> > +      * and SIGBUS the caller. Bail before we waste a swap-cache folio
> > +      * allocation.
> > +      */
>
> Seems like this comment belongs in the previous patch, and the other
> comment movement is undoing what last patch did.

Yeah this comment belongs to the first patch. I added it after the
fact but commit to the second patch.

TBH, the first patch kinda not do much. It just declares a new special
struct swap_info_struct, with some helpers and checks, but it's not
hooked to any allocation path. Logically it should be squashed into
this patch, but this patch is already 600 LoC, lol.

>
> >       if (si->flags & SWP_VSWAP) {
> >               put_swap_device(si);
> >               return -EINVAL;
> >       }
> >
> > -     /* try to allocate swap cache folio */
> >       mpol = get_task_policy(current);
> >       folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
> >                                      NO_INTERLEAVE_INDEX);
> > @@ -1416,25 +1422,25 @@ static bool zswap_store_page(struct page *page,
> >       if (!zswap_compress(page, entry, pool))
> >               goto compress_failed;
> >
> > -     old = xa_store(swap_zswap_tree(page_swpentry),
> > -                    swp_offset(page_swpentry),
> > -                    entry, GFP_KERNEL);
> > -     if (xa_is_err(old)) {
> > -             int err = xa_err(old);
> > +     if (is_vswap_entry(page_swpentry)) {
> > +             vswap_zswap_store(page_swpentry, entry);
> > +     } else {
> > +             old = xa_store(swap_zswap_tree(page_swpentry),
> > +                            swp_offset(page_swpentry),
> > +                            entry, GFP_KERNEL);
> > +             if (xa_is_err(old)) {
> > +                     int err = xa_err(old);
> > +
> > +                     WARN_ONCE(err != -ENOMEM,
> > +                               "unexpected xarray error: %d\n", err);
> > +                     zswap_reject_alloc_fail++;
> > +                     goto store_failed;
> > +             }
> >
> > -             WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> > -             zswap_reject_alloc_fail++;
> > -             goto store_failed;
> > +             if (old)
> > +                     zswap_entry_free(old);
> >       }
> >
> > -     /*
> > -      * We may have had an existing entry that became stale when
> > -      * the folio was redirtied and now the new version is being
> > -      * swapped out. Get rid of the old.
> > -      */
> > -     if (old)
> > -             zswap_entry_free(old);
> > -
> >       /*
> >        * The entry is successfully compressed and stored in the tree, there is
> >        * no further possibility of failure. Grab refs to the pool and objcg,
> > @@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
> >       struct mem_cgroup *memcg = NULL;
> >       struct zswap_pool *pool;
> >       bool ret = false;
> > +     bool partial_store = false;
> >       long index;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1524,8 +1531,10 @@ bool zswap_store(struct folio *folio)
> >       for (index = 0; index < nr_pages; ++index) {
> >               struct page *page = folio_page(folio, index);
> >
> > -             if (!zswap_store_page(page, objcg, pool))
> > +             if (!zswap_store_page(page, objcg, pool)) {
> > +                     partial_store = index > 0;
> >                       goto put_pool;
> > +             }
> >       }
> >
> >       if (objcg)
> > @@ -1548,7 +1557,9 @@ bool zswap_store(struct folio *folio)
> >        * offsets corresponding to each page of the folio. Otherwise,
> >        * writeback could overwrite the new data in the swapfile.
> >        */
> > -     if (!ret) {
> > +     if (partial_store && is_vswap_entry(swp))
> > +             folio_release_vswap_backing(folio);
>
> Hmm the above should also only happen in the !ret case, but that's not
> obvious from the code here. I think all of this should go under if
> (!ret), but maybe reverse the polarity to avoid the indentation?

Yeah that's just me avoiding indentation lol. But yes, it only happens
in !ret case:

>
>         if (ret)
>                 return ret;
>
>         if (is_vswap_entry(swp)) {
>                 if (partial_store)
>                         folio_release_vswap_backing(folio);
>                 return ret;
>         }
>
>         ...
>
> Alternatively you can move the check_old code for xarray into a helper
> and do:
>
>         if (!ret) {
>                 if (is_vswap_entry(swp)) {
>                         if (partial_store)
>                                 folio_release_vswap_backing(folio);
>                 } else {
>                         zswap_free_old_xa_entries(swp, nr_pages)
>                 }
>         }

Yup! I can switch to this if you think it's cleaner.

>
> Also, I think you can probably drop partial_store and check the index
> directly here.

Ah yeah. That's true!

>
> > +     else if (!ret && !is_vswap_entry(swp)) {
> >               unsigned type = swp_type(swp);
> >               pgoff_t offset = swp_offset(swp);
> >               struct zswap_entry *entry;
> > @@ -1588,8 +1599,7 @@ bool zswap_store(struct folio *folio)
> >  int zswap_load(struct folio *folio)
> >  {
> >       swp_entry_t swp = folio->swap;
> > -     pgoff_t offset = swp_offset(swp);
> > -     struct xarray *tree = swap_zswap_tree(swp);
> > +     struct swap_info_struct *si = __swap_entry_to_info(swp);
> >       struct zswap_entry *entry;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1599,16 +1609,25 @@ int zswap_load(struct folio *folio)
> >               return -ENOENT;
> >
> >       /*
> > -      * Large folios should not be swapped in while zswap is being used, as
> > -      * they are not properly handled. Zswap does not properly load large
> > -      * folios, and a large folio may only be partially in zswap.
> > +      * zswap_load() does not support large folios. For non-vswap
> > +      * entries this is unexpected on the swapin path: WARN and
> > +      * sigbus. For vswap entries __swap_cache_add_check() has already
> > +      * filtered out ZSWAP-backed THPs under the cluster lock, so the
> > +      * large folio here is zero- or phys-backed; return -ENOENT to
> > +      * fall through to the phys/zero IO path.
>
> Hmm should we start simple and avoid THP swapin for vswap initially?
>
> IIUC, it isn't really vswap specific. Even without vswap, it's possible
> that an entire folio is on-disk, not in zswap, in which case THP swap
> should be allowed.
>
> I assume it's not common for zswap to be enabled and an entire THP worth
> of pages are not in zswap, so maybe we can add this later?

I was thinking of removing it altogether haha. Are we even doing THP
swap in for non-sync IO devices?

if (!folio) {
    /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
    if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
        folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
[...]
else
    folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);

So I guess it's primarily zram that does THP swap in here? on
non-SWP_SYNCHRONOUS_IO devices, seems like we only do "THP swapin" if
we catch the page in swap cache (minor page fault). :) Will zram users
like vswap?

OTOH, zswap might be getting THP zswap-in support soon, so it's not
just zram backend that cares about these kinds of check? :)

Or maybe I can keep it, but separate it from this big patch to make it
easier to review :) Lemme play with it.

>
> >        */
> > -     if (WARN_ON_ONCE(folio_test_large(folio))) {
> > -             folio_unlock(folio);
> > -             return -EINVAL;
> > +     if (folio_test_large(folio)) {
> > +             if (WARN_ON_ONCE(!swap_is_vswap(si))) {
> > +                     folio_unlock(folio);
> > +                     return -EINVAL;
> > +             }
> > +             return -ENOENT;
> >       }
> >
> > -     entry = xa_load(tree, offset);
> > +     if (swap_is_vswap(si))
> > +             entry = vswap_zswap_load(swp);
> > +     else
> > +             entry = xa_load(swap_zswap_tree(swp), swp_offset(swp));
> >       if (!entry)
> >               return -ENOENT;
> >
> > @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
> >       if (entry->objcg)
> >               count_objcg_events(entry->objcg, ZSWPIN, 1);
> >
> > -     /*
> > -      * We are reading into the swapcache, invalidate zswap entry.
> > -      * The swapcache is the authoritative owner of the page and
> > -      * its mappings, and the pressure that results from having two
> > -      * in-memory copies outweighs any benefits of caching the
> > -      * compression work.
> > -      */
> >       folio_mark_dirty(folio);
> > -     xa_erase(tree, offset);
> > -     zswap_entry_free(entry);
> > +
> > +     if (swap_is_vswap(si)) {
> > +             folio_release_vswap_backing(folio);
>
> Is there any advantage to calling folio_release_vswap_backing() over
> zswap_entry_free()? Seems like __vswap_release_backing() ends up just
> calling zswap_entry_free() -- and I don't see any vswap-specific state
> being cleaned up.
>
> I wonder if the zswap code should call zswap_entry_free() directly? Same
> goes for the call in zswap_store() above.

Most just not repeating the vtable lookup-and-lock and what not. :)
The pattern is repeated the third time in swapoff when I allow phys
swap to be the backend of vswap in the next patch so I figure probably
should add some helper.

>
> > +     } else {
> > +             xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> > +             zswap_entry_free(entry);
> > +     }
> >
> >       folio_unlock(folio);
> >       return 0;
> > --
> > 2.53.0-Meta
> >


^ permalink raw reply

* Re: [PATCH v5 2/4] mm/zsmalloc: drop pool->lock from zs_free on 64-bit systems
From: Barry Song @ 2026-06-24 23:10 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Wenchao Hao, Andrew Morton, linux-kernel, linux-mm, Minchan Kim,
	Sergey Senozhatsky, Joshua Hahn, Wenchao Hao
In-Reply-To: <CAKEwX=Og51dvN9PsH9dhVHTU=CQk5oRHwJTiu9mN9Bn8uLnx0Q@mail.gmail.com>

On Thu, Jun 25, 2026 at 6:39 AM Nhat Pham <nphamcs@gmail.com> wrote:
[...]
>
> get_and_lock_obj_class()? :)
>
> or obj_class_get_and_lock()? - there's swap_cluster_get_and_lock() as
> the precedent.

Nice, that sounds much better to me.

>
> Anyway naming is hard.

Best Regards
Barry


^ permalink raw reply

* Re: [PATCH v1] kasan: Fix false-positive wild-memory-access on x86 under 5-level paging
From: Borislav Petkov @ 2026-06-24 23:45 UTC (permalink / raw)
  To: Ihor Solodrai
  Cc: Dave Hansen, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Andrey Ryabinin,
	Andrew Morton, bpf, kasan-dev, linux-mm, linux-kernel,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin,
	Andrey Konovalov
In-Reply-To: <c49b1499-6aa0-4e17-8385-cd2bcd2220cc@linux.dev>

On Wed, Jun 24, 2026 at 03:45:15PM -0700, Ihor Solodrai wrote:
> The instructions are quite detailed, thank you.

I hope... the devil's in the detail as always so I might be talking a lot of
crap too, as usual. You'll know better when you start poking at the code.

> I'll try it and will send a separate series as soon as I can (likely next week).

No worries, take your time.

And ofc, if we decide we need a fix for stable, you could make a one-liner be
the first patch which maybe which disables KASAN inline  when 5level page
tables have been detected. Or something to that effect.

And then the last patch will remove that fix.

That is, iff there's even a need to have something in stable. I don't know how
widespread kasan inline is...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Ackerley Tng @ 2026-06-25  0:05 UTC (permalink / raw)
  To: Yan Zhao
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aji/2svhcc84rn5w@yzhao56-desk.sh.intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

>
> [...snip...]
>
>>
>>  #ifdef kvm_arch_has_private_mem
>> -bool __ro_after_init gmem_in_place_conversion = false;
>> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
>> +module_param(gmem_in_place_conversion, bool, 0444);
>
> With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> MMAP flag. In such cases, shared memory is allocated from different backends.
> This means this module parameter only enables per-gmem memory attribute and does
> not guarantee that gmem in-place conversion will actually occur.
>
> To avoid confusion, could we rename this module parameter to something more
> accurate, such as gmem_memory_attribute?
>

I asked Sean about this after getting some fixes off list. Sean said
gmem_in_place_conversion is named for a host admin to use, and something
like gmem_memory_attributes is too much implementation details for the
admin.

Sean, would you reconsider since Yan also asked? If the admin compiled
the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
admin would also be able to use a param like gmem_memory_attributes?

There's the additional benefit that the similar naming aids in
understanding for both the admin and software engineers.

Either way, in the next revision, I'll also add this documentation for
this module_param:

  Setting the module parameter gmem_in_place_conversion to true will
  enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
  the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
  true, the private/shared attribute will be tracked per-guest_memfd
  instead of per-VM.

Let me know what y'all think of the wording!

>>
>> [...snip...]
>>


^ permalink raw reply

* Re: [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: SeongJae Park @ 2026-06-25  0:10 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: SeongJae Park, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, cgroups,
	linux-kernel, kernel-team
In-Reply-To: <20260624183700.1152742-1-joshua.hahnjy@gmail.com>

On Wed, 24 Jun 2026 11:36:59 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
> last caller of for_each_mem_cgroup back in 2021, and there have not been
> any new callers since. Remove the macro.
> 
> A comment in mem_cgroup_css_online has also been out of date since 2021,
> when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
> code") open-coded the for_each_mem_cgroup iterator. Update the comment.
> 
> Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
> statistics and events") added a second declaration for memcg_events to
> include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
> Let's clean that up too.
> 
> No functional changes intended.

Nice cleanup, thank you!

> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]


^ permalink raw reply

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Nhat Pham @ 2026-06-25  0:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAO9r8zPXk2eRbVcEMQDTCH1j-w241h189=p04FenAfKAjkkQtA@mail.gmail.com>

On Tue, Jun 23, 2026 at 12:02 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 466f8a182716..5daff7a25f67 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -993,6 +993,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >         struct folio *folio;
> >         struct mempolicy *mpol;
> >         struct swap_info_struct *si;
> > +       swp_entry_t phys = {};
> >         int ret = 0;
> >
> >         /* try to allocate swap cache folio */
> > @@ -1000,16 +1001,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >         if (!si)
> >                 return -EEXIST;
> >
> > -       /*
> > -        * Vswap entries have no physical backing - writeback would fail
> > -        * and SIGBUS the caller. Bail before we waste a swap-cache folio
> > -        * allocation.
> > -        */
> > -       if (si->flags & SWP_VSWAP) {
> > -               put_swap_device(si);
> > -               return -EINVAL;
> > -       }
> > -
> >         mpol = get_task_policy(current);
> >         folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
> >                                        NO_INTERLEAVE_INDEX);
> > @@ -1028,40 +1019,78 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >         /*
> >          * folio is locked, and the swapcache is now secured against
> >          * concurrent swapping to and from the slot, and concurrent
> > -        * swapoff so we can safely dereference the zswap tree here.
> > -        * Verify that the swap entry hasn't been invalidated and recycled
> > -        * behind our backs, to avoid overwriting a new swap folio with
> > -        * old compressed data. Only when this is successful can the entry
> > -        * be dereferenced.
> > +        * swapoff so we can safely dereference the zswap tree (or vswap
> > +        * vtable) here. Verify that the swap entry hasn't been
> > +        * invalidated and recycled behind our backs, to avoid overwriting
> > +        * a new swap folio with old compressed data. Only when this is
> > +        * successful can the entry be dereferenced.
> >          */
> > -       tree = swap_zswap_tree(swpentry);
> > -       if (entry != xa_load(tree, offset)) {
> > -               ret = -ENOMEM;
> > -               goto out;
> > +       if (swap_is_vswap(si)) {
> > +               if (entry != vswap_zswap_load(swpentry)) {
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               }
> > +               /*
> > +                * Allocate physical backing BEFORE decompress - if it fails,
> > +                * no wasted work. folio_realloc_swap sets vtable to PHYS,
> > +                * overwriting ZSWAP - the old entry pointer is only held
> > +                * by the caller now.
> > +                */
> > +               phys = folio_realloc_swap(folio);
> > +               if (!phys.val) {
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               }
> > +       } else {
> > +               tree = swap_zswap_tree(swpentry);
> > +               if (entry != xa_load(tree, offset)) {
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               }
>
> There's a lot of divergence in the code (in this patch and previous
> ones). Seems like a lot of it is to do xarray operations vs vswap
> operations. I wonder if we can abstract these into helpers, e.g.
> zswap_tree_store(), zswap_tree_load(), etc. Maybe the name is not the
> best, but you get the point :)

How about zswap_entry_load() and zswap_entry_store()? :)

>
> Here we can then do zswap_tree_load() for both code paths and only the
> folio_realloc_swap() needs to be different for vswap. We can do
> similar cleanups for the load/store paths as well.
>
> >         }
> >
> >         if (!zswap_decompress(entry, folio)) {
> >                 ret = -EIO;
> > +               /*
> > +                * For vswap: folio_realloc_swap already moved the entry
> > +                * out of the vtable. Restore it via vswap_zswap_store so
> > +                * the entry stays tracked (and the just-allocated PHYS
> > +                * slot is freed). For non-vswap: entry is still in the
> > +                * zswap tree.
> > +                */
> > +               if (swap_is_vswap(si) && phys.val)
> > +                       vswap_zswap_store(swpentry, entry);
>
> Should this go in the cleanup path instead (i.e. in the 'out' label?).

Ah, maybe if (ret == -EIO &&)...

>
> >                 goto out;
> >         }
> >
> > -       xa_erase(tree, offset);
> > +       if (!swap_is_vswap(si))
> > +               xa_erase(tree, offset);
>
> Maybe this can also be abstracted into a helper, but I wonder what the
> corresponding vswap operation would be. I think folio_realloc_swap()
> will have already "erased" the zswap entry from vswap. Maybe have a

Yup that's the right logic. We already change the backend to physical
swap slot here, so there's no real "erase".

> vswap helper that will only remove it if it's a zswap entry? We can
> probably do a lockless check first to make it cheap?
>
> It's probably silly to do this, and maybe there's a better way.
> Generally, I think the code would be easier to follow if we abstract
> away the xarray vs. vswap stuff into helpers (where it's reasonable).

I'm not entirely sure if its worth it either, yeah. Unlike load and
store, erase seems a bit asymmetric in the sense that we only need to
do it for non-vswap cases.


^ permalink raw reply

* Re: [PATCH] tools/mm: add thp_swap_allocator_test binary to .gitignore
From: SeongJae Park @ 2026-06-25  0:16 UTC (permalink / raw)
  To: Zenghui Yu; +Cc: SeongJae Park, linux-mm, linux-kernel, akpm
In-Reply-To: <20260624150642.19749-1-zenghui.yu@linux.dev>

On Wed, 24 Jun 2026 23:06:42 +0800 Zenghui Yu <zenghui.yu@linux.dev> wrote:

> Tell git to ignore the generated binary for thp_swap_allocator_test.c.

Nice catch, thank you!

> 
> Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]


^ permalink raw reply

* Re: [PATCH v8 00/46] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-25  0:19 UTC (permalink / raw)
  To: Garg, Shivank, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <a6373206-60b6-454c-9aa9-9d52f9d84de3@amd.com>

"Garg, Shivank" <shivankg@amd.com> writes:

>
> [...snip...]
>
>
> Hi,
>
> Thanks for this series.
>
> [...snip...]
>
>
> Tested-by: Shivank Garg <shivankg@amd.com>

Thanks for testing!

>
> Best regards,
> Shivank


^ permalink raw reply

* Re: [PATCH v8 00/46] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-25  0:19 UTC (permalink / raw)
  To: Xiaoyao Li, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <9f81ea12-98c4-4ce6-a95e-233851dfe8dd@intel.com>

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>> TODOs
>>
>> + Retest with TDX selftests. v7 was tested with TDX [12], but the setup there was
>>    wrong. Conversions were successful (no errors), but the shared memory being
>>    tested is actually in a completely different host physical page.
>
> Glad to see you knew it already (I was going to report this to the
> original POC TDX patch)

Thanks for reviewing!


^ permalink raw reply

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Yosry Ahmed @ 2026-06-25  0:19 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAKEwX=PT_ABx51--Qv9AAZwkuH+_Wp_TeiUYVQBY=1=SCf1HJA@mail.gmail.com>

> > > +       if (swap_is_vswap(si)) {
> > > +               if (entry != vswap_zswap_load(swpentry)) {
> > > +                       ret = -ENOMEM;
> > > +                       goto out;
> > > +               }
> > > +               /*
> > > +                * Allocate physical backing BEFORE decompress - if it fails,
> > > +                * no wasted work. folio_realloc_swap sets vtable to PHYS,
> > > +                * overwriting ZSWAP - the old entry pointer is only held
> > > +                * by the caller now.
> > > +                */
> > > +               phys = folio_realloc_swap(folio);
> > > +               if (!phys.val) {
> > > +                       ret = -ENOMEM;
> > > +                       goto out;
> > > +               }
> > > +       } else {
> > > +               tree = swap_zswap_tree(swpentry);
> > > +               if (entry != xa_load(tree, offset)) {
> > > +                       ret = -ENOMEM;
> > > +                       goto out;
> > > +               }
> >
> > There's a lot of divergence in the code (in this patch and previous
> > ones). Seems like a lot of it is to do xarray operations vs vswap
> > operations. I wonder if we can abstract these into helpers, e.g.
> > zswap_tree_store(), zswap_tree_load(), etc. Maybe the name is not the
> > best, but you get the point :)
>
> How about zswap_entry_load() and zswap_entry_store()? :)

Even better!


> > > -       xa_erase(tree, offset);
> > > +       if (!swap_is_vswap(si))
> > > +               xa_erase(tree, offset);
> >
> > Maybe this can also be abstracted into a helper, but I wonder what the
> > corresponding vswap operation would be. I think folio_realloc_swap()
> > will have already "erased" the zswap entry from vswap. Maybe have a
>
> Yup that's the right logic. We already change the backend to physical
> swap slot here, so there's no real "erase".
>
> > vswap helper that will only remove it if it's a zswap entry? We can
> > probably do a lockless check first to make it cheap?
> >
> > It's probably silly to do this, and maybe there's a better way.
> > Generally, I think the code would be easier to follow if we abstract
> > away the xarray vs. vswap stuff into helpers (where it's reasonable).
>
> I'm not entirely sure if its worth it either, yeah. Unlike load and
> store, erase seems a bit asymmetric in the sense that we only need to
> do it for non-vswap cases.

Yeah :/

Maybe just add a comment why no erase is needed for the vswap case.


^ permalink raw reply

* Re: [PATCH 1/6] mm/page_owner: extract skip_buddy_pages() helper to unify buddy page skipping
From: Andrew Morton @ 2026-06-25  0:20 UTC (permalink / raw)
  To: Ye Liu
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel
In-Reply-To: <20260623065234.31866-2-ye.liu@linux.dev>

On Tue, 23 Jun 2026 14:52:26 +0800 Ye Liu <ye.liu@linux.dev> wrote:

> Three places in page_owner.c duplicate the same pattern: check if a
> page is PageBuddy, read its order via buddy_order_unsafe(), advance
> the pfn past the buddy block if the order is valid, and continue.
> 
> Consolidate them into a single inline helper skip_buddy_pages().
> The function returns true (skip) for any buddy page and advances
> @pfn past the block when the order is valid; returns false if the
> page is not a buddy page and should be processed normally.
> 
> The old init_pages_in_zone() variant used "order > 0" as an extra
> guard before advancing pfn, but the continue was unconditional and
> (1UL << 0) - 1 == 0, so the behaviour is identical.  The comment
> about zone->lock is preserved in the helper's kernel-doc.

All looks nice, thanks.

A [0/N] cover letter is nice to have.

AI review identified a few possible pre-existing issues, if you're
interested:
	https://sashiko.dev/#/patchset/20260623065234.31866-2-ye.liu@linux.dev



^ permalink raw reply

* Re: [PATCH] tools/writeback: parse help before importing drgn
From: SeongJae Park @ 2026-06-25  0:27 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: SeongJae Park, willy, jack, shikemeng, linux-fsdevel, linux-mm,
	linux-kernel
In-Reply-To: <20260624123514.7822-1-alhouseenyousef@gmail.com>

On Wed, 24 Jun 2026 14:35:14 +0200 Yousef Alhouseen <alhouseenyousef@gmail.com> wrote:

> wb_monitor.py imports drgn before argparse can handle "-h". That makes
> help fail on systems where drgn is not installed, even though the script
> does not need drgn to print usage text.

But...  How do you execute the drgn script on systems not having drgn?  I tried
to mimic the situation and reproduce the issue you are saying about, but what I
get is like below:

    $ sudo mv /usr/bin/drgn /usr/bin/drgn.bak
    $ drgn tools/writeback/wb_monitor.py
    -bash: /usr/bin/drgn: No such file or directory
    $ python tools/writeback/wb_monitor.py
    Traceback (most recent call last):
      File "/home/lkhack/linux/tools/writeback/wb_monitor.py", line 44, in <module>
        bdi_list                = prog['bdi_list']
                                  ^^^^
    NameError: name 'prog' is not defined


Thanks,
SJ

[...]


^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-25  0:35 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgE8HZDOnexMJeim6TjmxGG1AUXFY2+HH1YyKB=aM6D-DQ@mail.gmail.com>

On Wed, Jun 24, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Jun 18, 2026, Ackerley Tng wrote:
> >> When checking if a guest_memfd folio is safe for conversion, its refcount
> >> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> >> temporarily increases its refcount.
> >
> > Under what circumstances does this happen,
> 
> It happened 100% of the time in selftests. Perhaps it's because in the
> selftests the pages are almost always freshly allocated and so the
> lru_add fbatch isn't full yet? (and that the host isn't super busy so
> lru_add fbatch doesn't get drained yet).

I chatted with Ackerley about this.  What I wanted to understand is why guest_memfd
pages were getting put onto per-CPU batches for lru_add(), given that guest_memfd
pages are unevictable.  The answer (assuming I read the code right), is that
lruvec_add_folio() updates stats and other per-lru metadata for the unevictable
lru, and does so under a per-lru lock.  I.e. we don't want to skip that stuff
entirely.

One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
something into folio_may_be_lru_cached().  But due to taking a per-lru lock,
that would penalize the relatively hot path and definitely common operation of
faulting in guest memory.  On the other hand, memory conversion is already a
relatively slow operation and is relatively uncommon compared to page faults,
(and likely very uncommon for real world setups).  I.e. having to drain all
caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
path.

If we're concerned about noisy neighbor problems, or outright abuse, I think a
simple (per process?) ratelimit would suffice.  But it's not clear to me that we
even need that, because there are already many flows in the kernel that allow
blasting IPIs without too much effort.


^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Sean Christopherson @ 2026-06-25  0:41 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Yan Zhao, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgHYTFnHbsLLgMTCSitmnp1_j9Pomikm9qmpGTh1w8YE5Q@mail.gmail.com>

On Wed, Jun 24, 2026, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > MMAP flag. In such cases, shared memory is allocated from different backends.
> > This means this module parameter only enables per-gmem memory attribute and does
> > not guarantee that gmem in-place conversion will actually occur.

KVM module params are pretty much always about what KVM supports, not what is
guaranteed to happen.

  - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
    because maybe the guest never accesses emulated MMIO.
  - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
    not to advertise one.
  - and so on and so forth...

Yes, there's a small mental jump to get from "KVM supports in-place conversion"
to "I need to set memory attributes on the guest_memfd instance, not the VM",
but I don't see that as a big hurdle, certainly not in the long term.  And once
the VMM code is written, I really do think most people are going to care about
whether or not KVM supports in-place conversion, not where PRIVATE is tracked.

> > To avoid confusion, could we rename this module parameter to something more
> > accurate, such as gmem_memory_attribute?
> 
> I asked Sean about this after getting some fixes off list. Sean said
> gmem_in_place_conversion is named for a host admin to use, and something
> like gmem_memory_attributes is too much implementation details for the
> admin.
> 
> Sean, would you reconsider since Yan also asked? If the admin compiled
> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> admin would also be able to use a param like gmem_memory_attributes?

No, because it's not all memory attributes, it's very specifically the PRIVATE
attribute that will get moved to guest_memfd.  I don't want to pick a name that
will become stale and confusing when RWX attributes come along.  The RWX bits
will be per-VM, while PRIVATE will be per-guest_memfd.


^ permalink raw reply

* Re: [PATCH] fs/proc: fix KPF_KSM reported for all anonymous pages
From: Andrew Morton @ 2026-06-25  1:00 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: David Hildenbrand (Arm), ziy, luizcap, willy, linmiaohe,
	svetly.todorov, xu.xin16, chengming.zhou, linux-fsdevel, linux-mm,
	wangkefeng.wang, sunnanyong
In-Reply-To: <601fb5dd-18e1-4a6c-bc99-dc2a655240e2@huawei.com>

On Tue, 23 Jun 2026 09:37:57 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:

> > This only affects /proc/kpageflags (well, and hwpoison inject on a weird testing
> > interface), so it's not really that relevant for real workloads (debugging and
> > testing).
> >
> > So not sure whether we should CC:stable. Likely not.
> 
> /proc/kpageflags is generally used only for analysis and is unlikely to be
> used in production environments. I found this issue due to I was analyzing
> pfns allocated by which stacks are KSM-merged. So I think it's unnecessary
> to CC:stable.

Well, it's a bug.  The fix is super-simple so I think it's reasonable
to feed it back to users of earlier kernels.



^ permalink raw reply

* Re: [RFC v2 PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-25  1:09 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-mm, linux-doc, linux-kernel, rppt, akpm, tgopinath,
	bboscaccy, kees, tony.luck, gpiccoli, bp, peterz, feng.tang,
	dapeng1.mi, elver, enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <3e206be0-3ef4-468f-b7e7-7bc03848b0d0@infradead.org>

Hi,


On 19 Jun 2026 11:35, Randy Dunlap wrote:
> Hi,
> 
> On 6/18/26 11:23 PM, Shyam Saini wrote:
> > reserve_mem relies on dynamic memory allocation, this limits the
> > usecase where memory is required to be preserved across the boots.
> > Eg: ramoops memory reservation on ACPI platforms
> > 
> > So add support to pass a pre-determined static address and reserve
> > memory at a specified location. This enables use case like ramoops
> > on ACPI platforms to reliably access ramoops region with previous
> > boot logs.
> > 
> > Also skip the parsing of <align> when static address is passed.
> > 
> > Example syntax for static address
> >  reserve_mem=4M@0x1E0000000:oops
> > 
> > Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> > ---
> > v1: https://lore.kernel.org/lkml/0eaf3be2-5121-48b7-aeed-196405c0a480@infradead.org/
> > v2: Fix code logic and incorporate Randy's suggestion
> 
> OK, you fixed a few typos.
> There are some bigger things that you seem to have ignored.

Thanks for calling this out. You are right that I did not address all
comments in v2.
My goal for v2 was to quickly fix the core logic issue and keep
discussion focused on the reserve_mem static address direction in this
RFC cycle. I should have stated that clearly.
 
> > ---
> >  .../admin-guide/kernel-parameters.txt         | 15 ++++++
> >  mm/memblock.c                                 | 47 +++++++++++++------
> >  2 files changed, 47 insertions(+), 15 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index b5493a7f8f228..7e0baca564b97 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -6563,6 +6563,21 @@ Kernel parameters
> >  
> >  			reserve_mem=12M:4096:oops ramoops.mem_name=oops
> >  
> > +	reserve_mem=	[RAM]
> 
> [RAM] means "RAM disk support is enabled."
> Is that the case here?  Is "reserve_mem=" only for use in case
> RAM disk support is enabled?
> 
> ISTM that you need a new designator instead of RAM...
> or overload the use of RAM by adding more info near the top of
> Documentation/admin-guide/kernel-parameters.txt.

will address them in future iterations
> 
> > +			Format: nn[KMG]:<@offset>:<label>
> > +			Reserve physical memory at predetermined location and label it with
> > +			a name that other subsystems can use to access it. This is typically
> > +			used for systems that do not wipe the RAM, and this command
> > +			line will try to reserve the same physical memory on
> > +			soft reboots. Note, it is guaranteed to be the same
> > +			location unless some other early allocation, e.g.: crashkernel=256M
> > +                        (without static address) is reserved or overlaps this region.
> > +
> > +			The format is size:offset:label for example, to request
> > +			4 megabytes for ramoops at 0x1E0000000:
> > +
> > +			reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
> > +
> >  	reservetop=	[X86-32,EARLY]
> >  			Format: nn[KMG]
> >  			Reserves a hole at the top of the kernel virtual
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 6349c48154f4b..c76cefa0a8a83 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -2721,6 +2721,7 @@ static int __init reserve_mem(char *p)
> >  	char *name;
> >  	char *oldp;
> >  	int len;
> > +	bool addr_is_static = false;
> >  
> >  	if (!p)
> >  		goto err_param;
> > @@ -2736,19 +2737,27 @@ static int __init reserve_mem(char *p)
> >  	if (!size || p == oldp)
> >  		goto err_param;
> >  
> > -	if (*p != ':')
> > -		goto err_param;
> > +	/* parse the static memory address */
> > +	if (*p == '@') {
> > +		start = memparse(p+1, &p);
> > +		addr_is_static = true;
> > +	}
> >  
> > -	align = memparse(p+1, &p);
> >  	if (*p != ':')
> >  		goto err_param;
> >  
> > -	/*
> > -	 * memblock_phys_alloc() doesn't like a zero size align,
> > -	 * but it is OK for this command to have it.
> > -	 */
> > -	if (align < SMP_CACHE_BYTES)
> > -		align = SMP_CACHE_BYTES;
> > +	if (!addr_is_static) {
> > +		align = memparse(p+1, &p);
> > +		if (*p != ':')
> > +			goto err_param;
> > +
> > +		/*
> > +		 * memblock_phys_alloc() doesn't like a zero size align,
> > +		 * but it is OK for this command to have it.
> > +		 */
> > +		if (align < SMP_CACHE_BYTES)
> > +			align = SMP_CACHE_BYTES;
> > +	}
> >  
> >  	name = p + 1;
> >  	len = strlen(name);
> > @@ -2772,14 +2781,22 @@ static int __init reserve_mem(char *p)
> >  	}
> >  
> >  	/* Pick previous allocations up from KHO if available */
> > -	if (reserve_mem_kho_revive(name, size, align))
> > +	if (!addr_is_static && reserve_mem_kho_revive(name, size, align))
> >  		return 1;
> >  
> > -	/* TODO: Allocation must be outside of scratch region */
> > -	start = memblock_phys_alloc(size, align);
> > -	if (!start) {
> > -		pr_err("reserve_mem: memblock allocation failed\n");
> > -		return -ENOMEM;
> 
> 		return 1;
> 
> > +	if (addr_is_static) {
> > +		if (memblock_reserve(start, size)) {
> > +			pr_err("reserve_mem: memblock reservation failed\n");
> > +			return -ENOMEM;
> 
> 			return 1;
> 
> > +		}
> > +
> > +	} else {
> > +		/* TODO: Allocation must be outside of scratch region */
> > +		start = memblock_phys_alloc(size, align);
> > +		if (!start) {
> > +			pr_err("reserve_mem: memblock allocation failed\n");
> > +			return -ENOMEM;
> 
> 			return 1;
> 
> > +		}
> >  	}
> >  
> >  	reserved_mem_add(start, size, name);
> 
> 
> __setup() functions return 1 for "yes, I recognized this string/option
> and attempted to handle it" or 0 for "This string/option is meaningless."
> There is no "return -Eerror".
> If you need that, you could consider using early_param() [see
> <linux/init.h>].
> 
same for this concern, will address them in next iteration.

Thanks,
Shyam


^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25  1:21 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgHYTFnHbsLLgMTCSitmnp1_j9Pomikm9qmpGTh1w8YE5Q@mail.gmail.com>

On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> >
> > [...snip...]
> >
> >>
> >>  #ifdef kvm_arch_has_private_mem
> >> -bool __ro_after_init gmem_in_place_conversion = false;
> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> >> +module_param(gmem_in_place_conversion, bool, 0444);
> >
> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > MMAP flag. In such cases, shared memory is allocated from different backends.
> > This means this module parameter only enables per-gmem memory attribute and does
> > not guarantee that gmem in-place conversion will actually occur.
> >
> > To avoid confusion, could we rename this module parameter to something more
> > accurate, such as gmem_memory_attribute?
> >
> 
> I asked Sean about this after getting some fixes off list. Sean said
> gmem_in_place_conversion is named for a host admin to use, and something
> like gmem_memory_attributes is too much implementation details for the
> admin.
Thanks for this background.

Some more context on why I'm asking:

Currently, I'm testing TDX huge pages with the following two gmem components:
1. The gmem memory attribute in this gmem in-place conversion v8.
2. The gmem 2MB from buddy allocator. (for development/testing only). 

The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
memory, while shared memory is allocated from a different backend.
(To avoid fragmentation, only private mappings are split during private-to-shared
conversions. In this approach, the 2MB folios are always retained in the gmem
inode filemap cache without splitting.)

Since shared memory is not allocated from gmem, there're no in-place conversions.
The reason I'm using "gmem memory attribute" is that the per-VM attribute is
being deprecated, as suggested by Sean [1].

Besides my current usage, there may be other scenarios where gmem memory
attributes is preferred without allocating shared memory from gmem.
(e.g., PAGE.ADD from a temp extra shared source memory).

For such use cases, I'm concerns that the admins may find it confusing if they
enable gmem_in_place_conversion but still observe extra memory consumptions for
shared memory.

[1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/

> Sean, would you reconsider since Yan also asked? If the admin compiled
> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> admin would also be able to use a param like gmem_memory_attributes?
> 
> There's the additional benefit that the similar naming aids in
> understanding for both the admin and software engineers.
> 
> Either way, in the next revision, I'll also add this documentation for
> this module_param:
> 
>   Setting the module parameter gmem_in_place_conversion to true will
>   enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
>   the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
>   true, the private/shared attribute will be tracked per-guest_memfd
>   instead of per-VM.
> 
> Let me know what y'all think of the wording!
> 
> >>
> >> [...snip...]
> >>


^ permalink raw reply

* Re: [RFC v2 PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-25  1:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-doc, linux-kernel, akpm, tgopinath, bboscaccy,
	kees, tony.luck, gpiccoli, bp, rdunlap, peterz, feng.tang,
	dapeng1.mi, elver, enelsonmoore, kuba, lirongqing, ebiggers
In-Reply-To: <aje-nY6QbwZP9XLG@kernel.org>

Hi Mike,

On 21 Jun 2026 13:36, Mike Rapoport wrote:
> On Thu, Jun 18, 2026 at 11:23:31PM -0700, Shyam Saini wrote:
> > reserve_mem relies on dynamic memory allocation, this limits the
> > usecase where memory is required to be preserved across the boots.
> > Eg: ramoops memory reservation on ACPI platforms
> >
> > So add support to pass a pre-determined static address and reserve
> > memory at a specified location. This enables use case like ramoops
> > on ACPI platforms to reliably access ramoops region with previous
> > boot logs.
> > 
> > Also skip the parsing of <align> when static address is passed.
> > 
> > Example syntax for static address
> >  reserve_mem=4M@0x1E0000000:oops
> 
> reserve_mem is best effort by design because such hacks as well as memmap=
> cannot guarantee this memory is actually free.
> 
> If you want to preserve ramoops reliably, use KHO with reserve_mem.
> The first kernel will allocate memory, this memory will be preserved by KHO
> and could be picked up by the second kernel.

ok, On ARM64 DTS systems, we can reserve ramoops memory in the device tree during
the warm reboot.
For an equivalent ARM64 ACPI platform, what is the recommended way to reserve
and preserve that memory across the boots? 

> > Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
> > ---
> > v1: https://lore.kernel.org/lkml/0eaf3be2-5121-48b7-aeed-196405c0a480@infradead.org/
> > v2: Fix code logic and incorporate Randy's suggestion
> > ---
> >  .../admin-guide/kernel-parameters.txt         | 15 ++++++
> >  mm/memblock.c                                 | 47 +++++++++++++------
> >  2 files changed, 47 insertions(+), 15 deletions(-)
> 
> -- 
> Sincerely yours,
> Mike.

Thanks,
Shyam


^ permalink raw reply

* [PATCH] mm/damon/paddr: remove folio_put from damon_pa_invalid_damos_folio
From: Yu Qin @ 2026-06-25  1:22 UTC (permalink / raw)
  To: sj; +Cc: akpm, damon, linux-mm, Yu Qin

This boolean function called folio_put() implicitly. Remove the put and let
callers handle it explicitly, making the get/put pair more clear.

Signed-off-by: Yu Qin <qin.yuA@h3c.com>
---
 mm/damon/paddr.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 85cd64a55..f45c7939a 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -313,15 +313,12 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
 	return scheme->ops_filters_default_reject;
 }
 
-static bool damon_pa_invalid_damos_folio(struct folio *folio, struct damos *s)
+static inline bool damon_pa_invalid_damos_folio(struct folio *folio,
+		struct damos *s)
 {
 	if (!folio)
 		return true;
-	if (folio == s->last_applied) {
-		folio_put(folio);
-		return true;
-	}
-	return false;
+	return folio == s->last_applied;
 }
 
 static unsigned long damon_pa_pageout(struct damon_region *r,
@@ -353,6 +350,8 @@ static unsigned long damon_pa_pageout(struct damon_region *r,
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
@@ -394,6 +393,8 @@ static inline unsigned long damon_pa_de_activate(
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
@@ -442,6 +443,8 @@ static unsigned long damon_pa_migrate(struct damon_region *r,
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
@@ -478,6 +481,8 @@ static unsigned long damon_pa_stat(struct damon_region *r,
 	while (addr < damon_pa_phys_addr(r->ar.end, addr_unit)) {
 		folio = damon_get_folio(PHYS_PFN(addr));
 		if (damon_pa_invalid_damos_folio(folio, s)) {
+			if (folio)
+				folio_put(folio);
 			addr += PAGE_SIZE;
 			continue;
 		}
-- 
2.43.0



^ permalink raw reply related

* Re: [PATCH v2] mm: avoid KCSAN false positive in memdesc_nid()
From: Hui Zhu @ 2026-06-25  1:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, Hui Zhu
In-Reply-To: <20260624140104.eacc15e291eec123bc7b3349@linux-foundation.org>

> 
> On Tue, 23 Jun 2026 16:44:32 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:
> 
> > 
> > From: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  KCSAN reports a data race between page_to_nid()/folio_nid() reading
> >  page->flags and folio_trylock()/folio_lock() concurrently doing
> >  test_and_set_bit_lock(PG_locked, ...) on the same word, e.g.:
> >  
> >  BUG: KCSAN: data-race in __lruvec_stat_mod_folio / shmem_get_folio_gfp
> >  
> >  The node id occupies a fixed bit-range of page->flags that is set
> >  once at page init and never modified afterwards, so it can never
> >  overlap with the low PG_locked/PG_waiters bits touched by the folio
> >  lock path.
> >  
> >  Use ASSERT_EXCLUSIVE_BITS() in memdesc_nid() to scope the exemption
> >  to just the node-id bits, consistent with how memdesc_zonenum()
> >  already handles the same class of race for the zone-id bits.
> >  
> >  ...
> > 
> >  --- a/include/linux/mm.h
> >  +++ b/include/linux/mm.h
> >  @@ -2290,6 +2290,7 @@ int memdesc_nid(memdesc_flags_t mdf);
> >  #else
> >  static inline int memdesc_nid(memdesc_flags_t mdf)
> >  {
> >  + ASSERT_EXCLUSIVE_BITS(mdf.f, NODES_MASK << NODES_PGSHIFT);
> >  return (mdf.f >> NODES_PGSHIFT) & NODES_MASK;
> >  }
> >  #endif
> > 
> It seems weird to be doing this against a local variable within a
> random function, seemingly unrelated to the problematic functions which
> you've identified.
> 
> Seems that it fooled Sashiko:
>  https://sashiko.dev/#/patchset/20260623084432.701120-1-hui.zhu@linux.dev
> 
> I'm wondering what the heck is going on here?
>

Hi Andrew,

Good catch. ASSERT_EXCLUSIVE_BITS(mdf.f, ...) is checking a by-value
copy of the flags word inside memdesc_nid(), not the actual shared
page->flags/folio->flags being modified by folio_trylock(). Whatever
made it appear to suppress the KCSAN report is likely an artifact of
inlining/codegen (kcsan_atomic_next() happening to land on the real
load after inlining), not a principled fix - so Sashiko's pass is
not reassuring here.

I'll move the assertion to where the real dereference happens (at
the page_to_nid()/folio_nid() call sites) instead of inside the
by-value helper. This probably also applies to the existing
memdesc_zonenum() pattern - is that one actually verified to work,
or does it have the same issue?

Best,
Hui


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox