Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range()
From: Yin Tirui @ 2026-04-19 11:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm), lorenzo.stoakes
  Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, jgross,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto,
	peterz, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb,
	mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd,
	pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu,
	yangyicong, baolu.lu, conor.dooley, Jonathan.Cameron, riel,
	wangkefeng.wang, chenjun102
In-Reply-To: <5d04929b-576f-4926-9f3b-be9a41a3e010@gmail.com>

Hi David,

Thanks a lot for the thorough review!

On 4/14/26 04:02, David Hildenbrand (Arm) wrote:
> On 2/28/26 08:09, Yin Tirui wrote:
>> Add PMD-level huge page support to remap_pfn_range(), automatically
>> creating huge mappings when prerequisites are satisfied (size, alignment,
>> architecture support, etc.) and falling back to normal page mappings
>> otherwise.
>>
>> Implement special huge PMD splitting by utilizing the pgtable deposit/
>> withdraw mechanism. When splitting is needed, the deposited pgtable is
>> withdrawn and populated with individual PTEs created from the original
>> huge mapping.
>>
>> Signed-off-by: Yin Tirui <yintirui@huawei.com>
>> ---
> 
> [...]
> 
>>  
>>  	if (!vma_is_anonymous(vma)) {
>>  		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>> +
>> +		if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
> 
> These magical vma checks are really bad. This all needs a cleanup
> (Lorenzo is doing some, hoping it will look better on top of that).
> 
Agreed. I am following Lorenzo's recent cleanups closely.

>> +			pte_t entry;
>> +
>> +			if (!pmd_special(old_pmd)) {
> 
> If you are using pmd_special(), you are doing something wrong.
> 
> Hint: vm_normal_page_pmd() is usually what you want.

Spot on.

While looking into applying vm_normal_folio_pmd() here to avoid the
magical VMA checks, I realized that both __split_huge_pmd_locked() and
copy_huge_pmd() currently suffer from the same !vma_is_anonymous(vma)
top-level entanglement.I think these functions could benefit from a
structural refactoring similar to what Lorenzo is currently doing in
zap_huge_pmd().

My idea is to flatten both functions into a pmd_present()-driven
decision tree:
1. Branch strictly on pmd_present().
2. For present PMDs, rely exclusively on vm_normal_folio_pmd() to
determine the underlying memory type, rather than guessing from VMA flags.
3. If !folio (and not a huge zero page), it cleanly identifies special
mappings (like PFNMAPs) without relying on vma_is_special_huge(). We can
handle the split/copy directly and return early.
4. Otherwise, proceed with the normal Anon/File THP logic, or handle
non-present migration entries in the !pmd_present() branch.

I have drafted two preparation patches demonstrating this approach and
appended the diffs at the end of this email. Does this direction look
reasonable to you? If so, I will iron out the implementation details and
include these refactoring patches in my upcoming v4 series.

> 
>> +				zap_deposited_table(mm, pmd);
>> +				return;
>> +			}
>> +			pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +			if (unlikely(!pgtable))
>> +				return;
>> +			pmd_populate(mm, &_pmd, pgtable);
>> +			pte = pte_offset_map(&_pmd, haddr);
>> +			entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd));
>> +			set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
>> +			pte_unmap(pte);
>> +
>> +			smp_wmb(); /* make pte visible before pmd */
>> +			pmd_populate(mm, pmd, pgtable);
>> +			return;
>> +		}
>> +
>>  		/*
>>  		 * We are going to unmap this huge page. So
>>  		 * just go ahead and zap it
>>  		 */
>>  		if (arch_needs_pgtable_deposit())
>>  			zap_deposited_table(mm, pmd);
>> -		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
>> -			return;
>> +
>>  		if (unlikely(pmd_is_migration_entry(old_pmd))) {
>>  			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
>>  
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 07778814b4a8..affccf38cbcf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2890,6 +2890,40 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
>>  	return err;
>>  }
>>  
>> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> 
> Why exactly do we need arch support for that in form of a Kconfig.
> 
> Usually, we guard pmd support by CONFIG_TRANSPARENT_HUGEPAGE.
> 
> And then, we must check at runtime if PMD leaves are actually supported.
> 
> Luiz is working on a cleanup series:
> 
> https://lore.kernel.org/r/cover.1775679721.git.luizcap@redhat.com
> 
> pgtable_has_pmd_leaves() is what you would want to check.

Makes sense. This Kconfig was inherited from Peter Xu's earlier
proposal, but depending on CONFIG_TRANSPARENT_HUGEPAGE and
pgtable_has_pmd_leaves() is indeed the correct standard. I will rebase
on Luiz's series.

> 
> 
>> +static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd,
>> +			unsigned long addr, unsigned long end,
>> +			unsigned long pfn, pgprot_t prot)
> 
> Use two-tab indent. (currently 3? :) )
> 
> Also, we tend to call these things now "pmd leaves". Call it
>  "remap_try_pmd_leaf" or something even more expressive like
> 
> "remap_try_install_pmd_leaf()"
> 

Noted. Will fix the indentation and rename it.

>> +{
>> +	pgtable_t pgtable;
>> +	spinlock_t *ptl;
>> +
>> +	if ((end - addr) != PMD_SIZE)
> 
> 	if (end - addr != PMD_SIZE)
> 
> Should work

Noted.

> 
>> +		return 0;
>> +
>> +	if (!IS_ALIGNED(addr, PMD_SIZE))
>> +		return 0;
>> +
> 
> You could likely combine both things into a
> 
> 	if (!IS_ALIGNED(addr | end, PMD_SIZE))
> 
>> +	if (!IS_ALIGNED(pfn, HPAGE_PMD_NR))
> 
> Another sign that you piggy-back on THP support ;)

Indeed! :)

> 
>> +		return 0;
>> +
>> +	if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
>> +		return 0;
> 
> Ripping out a page table?! That doesn't sound right :)
> 
> Why is that required? We shouldn't be doing that here. Gah.
> 
> Especially, without any pmd locks etc.

...oops. That is indeed a silly one. Thanks for catching it.
I will fix this to:

	if (!pmd_none(*pmd))
		return 0;

> 
>> +
>> +	pgtable = pte_alloc_one(mm);
>> +	if (unlikely(!pgtable))
>> +		return 0;
>> +
>> +	mm_inc_nr_ptes(mm);
>> +	ptl = pmd_lock(mm, pmd);
>> +	set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot))));
>> +	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +	spin_unlock(ptl);
>> +
>> +	return 1;
>> +}
>> +#endif
>> +
>>  static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
>>  			unsigned long addr, unsigned long end,
>>  			unsigned long pfn, pgprot_t prot)
>> @@ -2905,6 +2939,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
>>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>>  	do {
>>  		next = pmd_addr_end(addr, end);
>> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
>> +		if (remap_try_huge_pmd(mm, pmd, addr, next,
>> +				pfn + (addr >> PAGE_SHIFT), prot)) {
> 
> Please provide a stub instead so we don't end up with ifdef in this code.

Will do.

> 

Appendix:

Based on the mm-stable branch.

1. copy_huge_pmd()

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 42c983821c03..3f8b3f15c6ba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1912,35 +1912,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm,
struct mm_struct *src_mm,
 		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
-	struct page *src_page;
 	struct folio *src_folio;
 	pmd_t pmd;
 	pgtable_t pgtable = NULL;
 	int ret = -ENOMEM;

-	pmd = pmdp_get_lockless(src_pmd);
-	if (unlikely(pmd_present(pmd) && pmd_special(pmd) &&
-		     !is_huge_zero_pmd(pmd))) {
-		dst_ptl = pmd_lock(dst_mm, dst_pmd);
-		src_ptl = pmd_lockptr(src_mm, src_pmd);
-		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-		/*
-		 * No need to recheck the pmd, it can't change with write
-		 * mmap lock held here.
-		 *
-		 * Meanwhile, making sure it's not a CoW VMA with writable
-		 * mapping, otherwise it means either the anon page wrongly
-		 * applied special bit, or we made the PRIVATE mapping be
-		 * able to wrongly write to the backend MMIO.
-		 */
-		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
-		goto set_pmd;
-	}
-
-	/* Skip if can be re-fill on fault */
-	if (!vma_is_anonymous(dst_vma))
-		return 0;
-
 	pgtable = pte_alloc_one(dst_mm);
 	if (unlikely(!pgtable))
 		goto out;
@@ -1952,48 +1928,69 @@ int copy_huge_pmd(struct mm_struct *dst_mm,
struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;

-	if (unlikely(thp_migration_supported() &&
-		     pmd_is_valid_softleaf(pmd))) {
+	if (likely(pmd_present(pmd))) {
+		src_folio = vm_normal_folio_pmd(src_vma, addr, pmd);
+		if (unlikely(!src_folio)) {
+			/*
+			 * When page table lock is held, the huge zero pmd should not be
+			 * under splitting since we don't split the page itself, only pmd to
+			 * a page table.
+			 */
+			if (is_huge_zero_pmd(pmd)) {
+				/*
+				 * mm_get_huge_zero_folio() will never allocate a new
+				 * folio here, since we already have a zero page to
+				 * copy. It just takes a reference.
+				 */
+				mm_get_huge_zero_folio(dst_mm);
+				goto out_zero_page;
+			}
+
+			/*
+			 * Making sure it's not a CoW VMA with writable
+			 * mapping, otherwise it means either the anon page wrongly
+			 * applied special bit, or we made the PRIVATE mapping be
+			 * able to wrongly write to the backend MMIO.
+			 */
+			VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
+			pte_free(dst_mm, pgtable);
+			goto set_pmd;
+		}
+
+		if (!folio_test_anon(src_folio)) {
+			pte_free(dst_mm, pgtable);
+			ret = 0;
+			goto out_unlock;
+		}
+
+		folio_get(src_folio);
+		if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page,
dst_vma, src_vma))) {
+			/* Page maybe pinned: split and retry the fault on PTEs. */
+			folio_put(src_folio);
+			pte_free(dst_mm, pgtable);
+			spin_unlock(src_ptl);
+			spin_unlock(dst_ptl);
+			__split_huge_pmd(src_vma, src_pmd, addr, false);
+			return -EAGAIN;
+		}
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+
+	} else if (unlikely(thp_migration_supported() &&
pmd_is_valid_softleaf(pmd))) {
+		if (unlikely(!vma_is_anonymous(dst_vma))) {
+			pte_free(dst_mm, pgtable);
+			ret = 0;
+			goto out_unlock;
+		}
 		copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
 					  dst_vma, src_vma, pmd, pgtable);
 		ret = 0;
 		goto out_unlock;
-	}

-	if (unlikely(!pmd_trans_huge(pmd))) {
+	} else {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
-	/*
-	 * When page table lock is held, the huge zero pmd should not be
-	 * under splitting since we don't split the page itself, only pmd to
-	 * a page table.
-	 */
-	if (is_huge_zero_pmd(pmd)) {
-		/*
-		 * mm_get_huge_zero_folio() will never allocate a new
-		 * folio here, since we already have a zero page to
-		 * copy. It just takes a reference.
-		 */
-		mm_get_huge_zero_folio(dst_mm);
-		goto out_zero_page;
-	}

-	src_page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
-	src_folio = page_folio(src_page);
-
-	folio_get(src_folio);
-	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma,
src_vma))) {
-		/* Page maybe pinned: split and retry the fault on PTEs. */
-		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		__split_huge_pmd(src_vma, src_pmd, addr, false);
-		return -EAGAIN;
-	}
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);


2. __split_huge_pmd_locked()

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3f8b3f15c6ba..c02c2843520f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3090,98 +3090,50 @@ static void __split_huge_pmd_locked(struct
vm_area_struct *vma, pmd_t *pmd,

 	count_vm_event(THP_SPLIT_PMD);

-	if (!vma_is_anonymous(vma)) {
-		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
-		/*
-		 * We are going to unmap this huge page. So
-		 * just go ahead and zap it
-		 */
-		if (arch_needs_pgtable_deposit())
-			zap_deposited_table(mm, pmd);
-		if (vma_is_special_huge(vma))
-			return;
-		if (unlikely(pmd_is_migration_entry(old_pmd))) {
-			const softleaf_t old_entry = softleaf_from_pmd(old_pmd);
+	if (pmd_present(*pmd)) {
+		folio = vm_normal_folio_pmd(vma, haddr, *pmd);

-			folio = softleaf_to_folio(old_entry);
-		} else if (is_huge_zero_pmd(old_pmd)) {
+		if (unlikely(!folio)) {
+			/* Huge Zero Page */
+			if (is_huge_zero_pmd(*pmd))
+				/*
+				 * FIXME: Do we want to invalidate secondary mmu by calling
+				 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
+				 * inside __split_huge_pmd() ?
+				 *
+				 * We are going from a zero huge page write protected to zero
+				 * small page also write protected so it does not seems useful
+				 * to invalidate secondary mmu at this time.
+				 */
+				return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+			/* Huge PFNMAP */
+			old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(mm, pmd);
 			return;
-		} else {
+		}
+
+		/* File/Shmem THP */
+		if (unlikely(!folio_test_anon(folio))) {
+			old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(mm, pmd);
+			if (vma_is_special_huge(vma))
+				return;
+
 			page = pmd_page(old_pmd);
-			folio = page_folio(page);
 			if (!folio_test_dirty(folio) && pmd_dirty(old_pmd))
 				folio_mark_dirty(folio);
 			if (!folio_test_referenced(folio) && pmd_young(old_pmd))
 				folio_set_referenced(folio);
 			folio_remove_rmap_pmd(folio, page, vma);
 			folio_put(folio);
+			add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
+			return;
 		}
-		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
-		return;
-	}
-
-	if (is_huge_zero_pmd(*pmd)) {
-		/*
-		 * FIXME: Do we want to invalidate secondary mmu by calling
-		 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
-		 * inside __split_huge_pmd() ?
-		 *
-		 * We are going from a zero huge page write protected to zero
-		 * small page also write protected so it does not seems useful
-		 * to invalidate secondary mmu at this time.
-		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
-	}
-
-	if (pmd_is_migration_entry(*pmd)) {
-		softleaf_t entry;
-
-		old_pmd = *pmd;
-		entry = softleaf_from_pmd(old_pmd);
-		page = softleaf_to_page(entry);
-		folio = page_folio(page);
-
-		soft_dirty = pmd_swp_soft_dirty(old_pmd);
-		uffd_wp = pmd_swp_uffd_wp(old_pmd);
-
-		write = softleaf_is_migration_write(entry);
-		if (PageAnon(page))
-			anon_exclusive = softleaf_is_migration_read_exclusive(entry);
-		young = softleaf_is_migration_young(entry);
-		dirty = softleaf_is_migration_dirty(entry);
-	} else if (pmd_is_device_private_entry(*pmd)) {
-		softleaf_t entry;
-
-		old_pmd = *pmd;
-		entry = softleaf_from_pmd(old_pmd);
-		page = softleaf_to_page(entry);
-		folio = page_folio(page);
-
-		soft_dirty = pmd_swp_soft_dirty(old_pmd);
-		uffd_wp = pmd_swp_uffd_wp(old_pmd);
-
-		write = softleaf_is_device_private_write(entry);
-		anon_exclusive = PageAnonExclusive(page);

-		/*
-		 * Device private THP should be treated the same as regular
-		 * folios w.r.t anon exclusive handling. See the comments for
-		 * folio handling and anon_exclusive below.
-		 */
-		if (freeze && anon_exclusive &&
-		    folio_try_share_anon_rmap_pmd(folio, page))
-			freeze = false;
-		if (!freeze) {
-			rmap_t rmap_flags = RMAP_NONE;
-
-			folio_ref_add(folio, HPAGE_PMD_NR - 1);
-			if (anon_exclusive)
-				rmap_flags |= RMAP_EXCLUSIVE;
-
-			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
-						 vma, haddr, rmap_flags);
-		}
-	} else {
+		/* Anon THP */
 		/*
 		 * Up to this point the pmd is present and huge and userland has
 		 * the whole access to the hugepage during the split (which
@@ -3207,7 +3159,6 @@ static void __split_huge_pmd_locked(struct
vm_area_struct *vma, pmd_t *pmd,
 		 */
 		old_pmd = pmdp_invalidate(vma, haddr, pmd);
 		page = pmd_page(old_pmd);
-		folio = page_folio(page);
 		if (pmd_dirty(old_pmd)) {
 			dirty = true;
 			folio_set_dirty(folio);
@@ -3218,8 +3169,6 @@ static void __split_huge_pmd_locked(struct
vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_uffd_wp(old_pmd);

 		VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
-		VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
-
 		/*
 		 * Without "freeze", we'll simply split the PMD, propagating the
 		 * PageAnonExclusive() flag for each PTE by setting it for
@@ -3236,17 +3185,82 @@ static void __split_huge_pmd_locked(struct
vm_area_struct *vma, pmd_t *pmd,
 		 * See folio_try_share_anon_rmap_pmd(): invalidate PMD first.
 		 */
 		anon_exclusive = PageAnonExclusive(page);
-		if (freeze && anon_exclusive &&
-		    folio_try_share_anon_rmap_pmd(folio, page))
+		if (freeze && anon_exclusive && folio_try_share_anon_rmap_pmd(folio,
page))
 			freeze = false;
 		if (!freeze) {
 			rmap_t rmap_flags = RMAP_NONE;
-
 			folio_ref_add(folio, HPAGE_PMD_NR - 1);
 			if (anon_exclusive)
 				rmap_flags |= RMAP_EXCLUSIVE;
-			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
-						 vma, haddr, rmap_flags);
+			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, vma, haddr,
rmap_flags);
+		}
+	} else { /* pmd not present */
+		folio = pmd_to_softleaf_folio(*pmd);
+		if (unlikely(!folio))
+			return;
+
+		/* Migration of File/Shmem THP */
+		if (unlikely(!folio_test_anon(folio))) {
+			old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(mm, pmd);
+			if (vma_is_special_huge(vma))
+				return;
+			add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
+			return;
+		}
+
+		/* Migration of Anon THP or Device Private*/
+		if (pmd_is_migration_entry(*pmd)) {
+			softleaf_t entry;
+
+			old_pmd = *pmd;
+			entry = softleaf_from_pmd(old_pmd);
+			page = softleaf_to_page(entry);
+			folio = page_folio(page);
+
+			soft_dirty = pmd_swp_soft_dirty(old_pmd);
+			uffd_wp = pmd_swp_uffd_wp(old_pmd);
+
+			write = softleaf_is_migration_write(entry);
+			if (PageAnon(page))
+				anon_exclusive = softleaf_is_migration_read_exclusive(entry);
+			young = softleaf_is_migration_young(entry);
+			dirty = softleaf_is_migration_dirty(entry);
+		} else if (pmd_is_device_private_entry(*pmd)) {
+			softleaf_t entry;
+
+			old_pmd = *pmd;
+			entry = softleaf_from_pmd(old_pmd);
+			page = softleaf_to_page(entry);
+
+			soft_dirty = pmd_swp_soft_dirty(old_pmd);
+			uffd_wp = pmd_swp_uffd_wp(old_pmd);
+
+			write = softleaf_is_device_private_write(entry);
+			anon_exclusive = PageAnonExclusive(page);
+
+			/*
+			* Device private THP should be treated the same as regular
+			* folios w.r.t anon exclusive handling. See the comments for
+			* folio handling and anon_exclusive below.
+			*/
+			if (freeze && anon_exclusive &&
+				folio_try_share_anon_rmap_pmd(folio, page))
+				freeze = false;
+			if (!freeze) {
+				rmap_t rmap_flags = RMAP_NONE;
+
+				folio_ref_add(folio, HPAGE_PMD_NR - 1);
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+
+				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+							vma, haddr, rmap_flags);
+			}
+		} else {
+			VM_WARN_ONCE(1, "unknown situation.");
+			return;
 		}
 	}

-- 
2.43.0


-- 
Yin Tirui



^ permalink raw reply related

* Re: [PATCH] docs: Add overview and SLUB allocator sections to slab documentation
From: David Hildenbrand (Arm) @ 2026-04-19  8:35 UTC (permalink / raw)
  To: Matthew Wilcox, Lorenzo Stoakes
  Cc: Nick Huang, Vlastimil Babka, Harry Yoo, Andrew Morton,
	Jonathan Corbet, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-mm, linux-doc,
	linux-kernel
In-Reply-To: <aeOuCH8ydw_yzdXZ@casper.infradead.org>

On 4/18/26 18:15, Matthew Wilcox wrote:
> On Sat, Apr 18, 2026 at 10:07:22AM +0100, Lorenzo Stoakes wrote:
>> On Sat, Apr 18, 2026 at 12:06:19AM +0000, Nick Huang wrote:
>>> - Add "Overview" section explaining the slab allocator's role and purpose
>>> - Document the three main slab allocator implementations (SLAB, SLUB, SLOB)
>>
>> The fact you're insanely wrong about the current state of slab only makes this
>> worse.
> 
> This is actually a new low.  We've always had to contend with people
> putting up outdated or just wrong information on web pages, and there's
> little we can do about it.  Witness all the outdated information about
> THP that's based on code that's been deleted for over a decade.
> 
> But now we've got AI trained on all this wrong/ out of date information,
> and, er, "enthusiasts" who are trying to change the correct information
> in the kernel to match what the deluded AI "thinks" should be true.
> 
> Let that sink in.
> 

I think we should make it very clear that we don't want doc updates from someone
that is not a renowned expert in that area or wants to become an expert in that
area (and already discussed working on the docs with maintainers/experts).

Otherwise we'll have this same discussion over and over again.

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 7aa2a88869083..8c5721001c8bb 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -7,6 +7,11 @@ of Linux.  If you are looking for advice on simply allocating
memory,
  see the :ref:`memory_allocation`.  For controlling and tuning guides,
  see the :doc:`admin guide <../admin-guide/mm/index>`.

+A lot of documentation in this guide is still incomplete. If you are not
+a renowned expert in the specific area, but you want to contribute bigger
+chunks of documentation, talk to the respective MM experts first. LLM
+generated slop from non-experts will be rejected without further comments.
+
  .. toctree::
     :maxdepth: 1



LLMs are just the tip of the iceberg. It will all be developmend-by review with
inexperienced contributors. And we are only willing to put in the effort to
teach contributors if the contributors are not actually worth our time: i.e.,
LLM kiddies that will actually stick around and help the subsystem in the long run.


The whole doc update stuff is similar to people just grepping for TODOs in the
kernel and then using an LLM to produce code they have no idea about.

It's the evolution of typo fixes: review load without any benefit.

-- 
Cheers,

David



^ permalink raw reply related

* Re: [PATCH] mm: prepare anon_vma before swapin rmap
From: David Hildenbrand (Arm) @ 2026-04-19  8:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: ZhengYuan Huang, akpm, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	willy, linux-mm, linux-kernel, baijiaju1990, r33s3n6, zzzccc427
In-Reply-To: <aeNP7mtx878gynXS@lucifer>

On 4/18/26 11:35, Lorenzo Stoakes wrote:
> On Fri, Apr 17, 2026 at 01:57:59PM +0200, David Hildenbrand (Arm) wrote:
>> Maybe there was a scenario where we could have lost vma->anon_vma during
>> a merge, resulting in a swapped page in an anon_vma.
> 
> Unless there's a bug (and correct me if I'm misinterpreting), VMA merge requires
> vma->anon_vma to either be equal for merged adjacent VMAs, or one or the other
> VMA to have NULL vma->anon_vma, in which case we set vma->anon_vma in the merged
> VMA.

I think you didn't understand what I was trying to say.

The reporter claimed that it happened on 6.18. Nobody knows on which 
patch version (stable tree?).

I was wondering whether your fix

commit 3b617fd3d317bf9dd7e2c233e56eafef05734c9d
Author: Lorenzo Stoakes <ljs@kernel.org>
Date:   Mon Jan 5 20:11:49 2026 +0000

     mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too

that went into 6.19 might have resolved this problem.

Your fix stated "allow an unfaulted/faulted merge with a VMA that has 
been forked", so I was wondering whether that could have resulted in a 
situation with anon folios without vma->anon_vma (losing vma->anon_vma 
during the merge).

But I am not sure if 879bca0a2c4f could have triggered that. Are you 
aware of other fixes that went into 6.19 that could have fixed such a 
scenario?

-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH] mm/migrate_device: Cleanup up PMD Checks and warnings
From: David Hildenbrand (Arm) @ 2026-04-19  8:12 UTC (permalink / raw)
  To: Sunny Patel
  Cc: akpm, apopple, byungchul, gourry, joshua.hahnjy, linux-kernel,
	linux-mm, matthew.brost, rakie.kim, sj, ying.huang, ziy
In-Reply-To: <20260418171806.11615-1-nueralspacetech@gmail.com>

On 4/18/26 19:18, Sunny Patel wrote:
> On 4/17/26 01:52, SeongJae Park wrote:
>> On Thu, 16 Apr 2026 21:44:15 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>>
>> [...]
>>>
>>> is_huge_zero_pmd() checks pmd_present(), so we didn't have a bug before.
>>>
>>> We could also do:
>>>
>>> if (is_huge_zero_pmd(*pmdp)) {
>>> 	flush = true;
>>> } else if (!pmd_none(*pmdp)) {
>>> 	goto unlock_abort;
>>> }
>>
>> Then we could even further remove the braces and reduce one more line, nice!
> 
> is_huge_zero_pmd() didn't check for pmd_present as of now as per the current implementation of it so additional check require for pmd_present().

I don't know what you mean. Here is the code in the tree:
static inline bool is_huge_zero_pmd(pmd_t pmd)
{
	return pmd_present(pmd) && is_huge_zero_pfn(pmd_pfn(pmd));
}

-- 
Cheers,

David


^ permalink raw reply

* [GIT PULL] Additional MM updates for 7.1-rc1
From: Andrew Morton @ 2026-04-19  5:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-mm, mm-commits


Linus, please merge this second batch of MM updates for the current
merge window, thanks.

I'm seeing no conflicts against mainline at this time.  If some do pop
up, they will hopefully be addressed in the first-round merge at

	https://lore.kernel.org/20260413214952.62836ac9df0eb348ee4aeb2b@linux-foundation.org


The following changes since commit 3bac01168982ec3e3bf87efdc1807c7933590a85:

  mm: fix deferred split queue races during migration (2026-04-05 13:53:47 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm tags/mm-stable-2026-04-18-02-14

for you to fetch changes up to 0b5e8d7999076ac3c490fc18376a404e2626abff:

  MAINTAINERS: add page cache reviewer (2026-04-18 00:10:56 -0700)

----------------------------------------------------------------
mm.git review status for linus..mm-stable

Everything:

Total patches:       121
Reviews/patch:       2.11
Reviewed rate:       90%

Excluding DAMON:

Total patches:       113
Reviews/patch:       2.25
Reviewed rate:       96%

- The 33 patch series "Eliminate Dying Memory Cgroup" from Qi Zheng and
  Muchun Song addresses the longstanding "dying memcg problem".  A
  situation wherein a no-longer-used memory control group will hang around
  for an extended period pointlessly consuming memory.  The [0/N]
  changelog has a good overview of this work.

- The 3 patch series "fix unexpected type conversions and potential
  overflows" from Qi Zheng fixes a couple of potential 32-bit/64-bit
  issues which were identified during review of the "Eliminate Dying
  Memory Cgroup" series.

- The 6 patch series "kho: history: track previous kernel version and
  kexec boot count" from Breno Leitao uses Kexec Handover (KHO) to pass
  the previous kernel's version string and the number of kexec reboots
  since the last cold boot to the next kernel, and prints it at boot time.

- The 4 patch series "liveupdate: prevent double preservation" from
  Pasha Tatashin teaches LUO to avoid managing the same file across
  different active sessions.

- The 10 patch series "liveupdate: Fix module unloading and unregister
  API" from Pasha Tatashin addresses an issue with how LUO handles module
  reference counting and unregistration during module unloading.

- The 2 patch series "zswap pool per-CPU acomp_ctx simplifications" from
  Kanchana Sridhar simplifies and cleans up the zswap crypto compression
  handling and improves the lifecycle management of zswap pool's per-CPU
  acomp_ctx resources.

- The 2 patch series "mm/damon/core: fix damon_call()/damos_walk() vs
  kdmond exit race" from SeongJae Park addresses unlikely but possible
  leaks and deadlocks in damon_call() and damon_walk().

- The 2 patch series "mm/damon/core: validate damos_quota_goal->nid"
  from SeongJae Park fixes a couple of root-only wild pointer
  dereferences.

- The 2 patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs
  other params race" from SeongJae Park updates the DAMON documentation to
  warn operators about potential races which can occur if the
  commit_inputs parameter is altered at the wrong time.

- The 3 patch series "Minor hmm_test fixes and cleanups" from Alistair
  Popple implements two bugfixes a cleanup for the HMM kernel selftests.

- The 6 patch series "Modify memfd_luo code" from Chenghao Duan provides
  cleanups, simplifications and speedups in the memfd_lou code.

- The 4 patch series "mm, kvm: allow uffd support in guest_memfd" from
  Mike Rapoport enables support for userfaultfd in guest_memfd.

- The 6 patch series "selftests/mm: skip several tests when thp is not
  available" from Chunyu Hu fixes several issues in the selftests code
  which were causing breakage when the tests were run on CONFIG_THP=n
  kernels.

- The 2 patch series "mm/mprotect: micro-optimization work" from Pedro
  Falcato implements a couple of nice speedups for mprotect().

- The 3 patch series "MAINTAINERS: update KHO and LIVE UPDATE entries"
  from Pratyush Yadav reflects upcoming changes in the maintenance of KHO,
  LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based
  things - they are being moved out of mm.git and into a new git tree.

----------------------------------------------------------------
Alistair Popple (3):
      lib: test_hmm: evict device pages on file close to avoid use-after-free
      selftests/mm: hmm-tests: don't hardcode THP size to 2MB
      lib: test_hmm: implement a device release method

Andrew Stellman (1):
      zram: reject unrecognized type= values in recompress_store()

Arnd Bergmann (1):
      mm/vmscan: avoid false-positive -Wuninitialized warning

Baolin Wang (1):
      mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU

Breno Leitao (8):
      mm: kmemleak: add CONFIG_DEBUG_KMEMLEAK_VERBOSE build option
      kho: add size parameter to kho_add_subtree()
      kho: rename fdt parameter to blob in kho_add/remove_subtree()
      kho: persist blob size in KHO FDT
      kho: fix kho_in_debugfs_init() to handle non-FDT blobs
      kho: kexec-metadata: track previous kernel chain
      kho: document kexec-metadata tracking feature
      mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update

Cao Ruichuang (1):
      selftests: mm: skip charge_reserved_hugetlb without killall

Chenghao Duan (6):
      mm/memfd: use folio_nr_pages() for shmem inode accounting
      mm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path
      mm/memfd_luo: remove unnecessary memset in zero-size memfd path
      mm/memfd_luo: use i_size_write() to set inode size during retrieve
      mm/memfd_luo: fix physical address conversion in put_folios cleanup
      mm/memfd_luo: remove folio from page cache when accounting fails

Chunyu Hu (6):
      selftests/mm/guard-regions: skip collapse test when thp not enabled
      selftests/mm: soft-dirty: skip two tests when thp is not available
      selftests/mm: move write_file helper to vm_util
      selftests/mm/vm_util: robust write_file()
      selftests/mm: split_huge_page_test: skip the test when thp is not available
      selftests/mm: transhuge_stress: skip the test when thp not available

Dave Young (1):
      MAINTAINERS: update Dave's kdump reviewer email address

David Carlier (1):
      mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()

Davidlohr Bueso (1):
      mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()

Denis M. Karpov (1):
      userfaultfd: allow registration of ranges below mmap_min_addr

Hao Ge (1):
      mm/alloc_tag: clear codetag for pages allocated before page_ext initialization

Jackie Liu (2):
      mm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start()
      mm/mempolicy: fix memory leaks in weighted_interleave_auto_store()

Jan Kara (1):
      MAINTAINERS: add page cache reviewer

Kanchana P. Sridhar (2):
      mm: zswap: remove redundant checks in zswap_cpu_comp_dead()
      mm: zswap: tie per-CPU acomp_ctx lifetime to the pool

Kevin Brodsky (1):
      docs: proc: document ProtectionKey in smaps

Li Wang (1):
      selftests/mm: skip hugetlb_dio tests when DIO alignment is incompatible

Lorenzo Stoakes (1):
      mm/vma: remove __vma_check_mmap_hook()

Lorenzo Stoakes (Oracle) (2):
      MAINTAINERS: update MGLRU entry to reflect current status
      tools/testing/selftests: add merge test for partial msealed range

Mike Rapoport (Microsoft) (11):
      userfaultfd: introduce mfill_copy_folio_locked() helper
      userfaultfd: introduce struct mfill_state
      userfaultfd: introduce mfill_establish_pmd() helper
      userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
      userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
      userfaultfd: move vma_can_userfault out of line
      userfaultfd: introduce vm_uffd_ops
      shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE
      userfaultfd: introduce vm_uffd_ops->alloc_folio()
      shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
      userfaultfd: mfill_atomic(): remove retry logic

Muchun Song (24):
      mm: memcontrol: remove dead code of checking parent memory cgroup
      mm: workingset: use folio_lruvec() in workingset_refault()
      mm: rename unlock_page_lruvec_irq and its variants
      mm: vmscan: refactor move_folios_to_lru()
      mm: memcontrol: allocate object cgroup for non-kmem case
      mm: memcontrol: return root object cgroup for root memory cgroup
      mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
      buffer: prevent memory cgroup release in folio_alloc_buffers()
      writeback: prevent memory cgroup release in writeback module
      mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
      mm: page_io: prevent memory cgroup release in page_io module
      mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
      mm: mglru: prevent memory cgroup release in mglru
      mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
      mm: workingset: prevent memory cgroup release in lru_gen_eviction()
      mm: workingset: prevent lruvec release in workingset_refault()
      mm: zswap: prevent lruvec release in zswap_folio_swapin()
      mm: swap: prevent lruvec release in lru_gen_clear_refs()
      mm: workingset: prevent lruvec release in workingset_activation()
      mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
      mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
      mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
      mm/sparse: fix preinited section_mem_map clobbering on failure path
      mm/sparse: fix comment for section map alignment

Pasha Tatashin (14):
      liveupdate: prevent double management of files
      memfd: implement get_id for memfd_luo
      selftests: liveupdate: add test for double preservation
      liveupdate: safely print untrusted strings
      liveupdate: synchronize lazy initialization of FLB private state
      liveupdate: protect file handler list with rwsem
      liveupdate: protect FLB lists with luo_register_rwlock
      liveupdate: defer FLB module refcounting to active sessions
      liveupdate: remove luo_session_quiesce()
      liveupdate: auto unregister FLBs on file handler unregistration
      liveupdate: remove liveupdate_test_unregister()
      liveupdate: make unregister functions return void
      liveupdate: defer file handler module refcounting to active sessions
      MAINTAINERS: update kexec/kdump maintainers entries

Pedro Falcato (2):
      mm/mprotect: move softleaf code out of the main function
      mm/mprotect: special-case small folios when applying permissions

Pratyush Yadav (Google) (3):
      MAINTAINERS: update KHO and LIVE UPDATE maintainers
      MAINTAINERS: drop include/linux/kho/abi/ from KHO
      MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE

Qi Zheng (14):
      mm: vmscan: prepare for the refactoring the move_folios_to_lru()
      mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
      mm: zswap: prevent memory cgroup release in zswap_compress()
      mm: do not open-code lruvec lock
      mm: vmscan: prepare for reparenting traditional LRU folios
      mm: vmscan: prepare for reparenting MGLRU folios
      mm: memcontrol: refactor memcg_reparent_objcgs()
      mm: workingset: use lruvec_lru_size() to get the number of lru pages
      mm: memcontrol: refactor mod_memcg_state() and mod_memcg_lruvec_state()
      mm: memcontrol: prepare for reparenting non-hierarchical stats
      mm: memcontrol: convert objcg to be per-memcg per-node type
      mm: memcontrol: correct the type of stats_updates to unsigned long
      mm: memcontrol: change val type to long in __mod_memcg_{lruvec_}state()
      mm: memcontrol: correct the nr_pages parameter type of mem_cgroup_update_lru_size()

SeongJae Park (7):
      mm/damon/core: fix damon_call() vs kdamond_fn() exit race
      mm/damon/core: fix damos_walk() vs kdamond_fn() exit race
      mm/damon/core: validate damos_quota_goal->nid for node_mem_{used,free}_bp
      mm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bp
      mm/damon/core: use time_in_range_open() for damos quota window start
      Docs/admin-guide/mm/damon/reclaim: warn commit_inputs vs param updates race
      Docs/admin-guide/mm/damon/lru_sort: warn commit_inputs vs param updates race

Sergey Senozhatsky (1):
      zram: do not forget to endio for partial discard requests

Suren Baghdasaryan (1):
      mm/vmscan: prevent MGLRU reclaim from pinning address space

Thorsten Blum (1):
      mm/hugetlb: fix early boot crash on parameters without '=' separator

Zhaoyang Huang (1):
      mm: remove '!root_reclaim' checking in should_abort_scan()

 CREDITS                                            |   8 +
 Documentation/admin-guide/mm/damon/lru_sort.rst    |   4 +
 Documentation/admin-guide/mm/damon/reclaim.rst     |   4 +
 Documentation/admin-guide/mm/kho.rst               |  41 +-
 Documentation/filesystems/proc.rst                 |   4 +
 MAINTAINERS                                        |  29 +-
 drivers/block/zram/zram_drv.c                      |   5 +-
 fs/buffer.c                                        |   4 +-
 fs/fs-writeback.c                                  |  22 +-
 fs/userfaultfd.c                                   |   2 -
 include/linux/alloc_tag.h                          |   2 +
 include/linux/damon.h                              |   2 +
 include/linux/fs.h                                 |   9 +-
 include/linux/kexec_handover.h                     |  13 +-
 include/linux/kho/abi/kexec_handover.h             |  20 +-
 include/linux/kho/abi/kexec_metadata.h             |  46 ++
 include/linux/liveupdate.h                         |  17 +-
 include/linux/memcontrol.h                         | 191 +++---
 include/linux/mm.h                                 |   5 +
 include/linux/mm_inline.h                          |   6 +
 include/linux/mmzone.h                             |  42 +-
 include/linux/pgalloc_tag.h                        |   2 +-
 include/linux/sched.h                              |   2 +-
 include/linux/shmem_fs.h                           |  14 -
 include/linux/swap.h                               |  25 +-
 include/linux/userfaultfd_k.h                      |  73 ++-
 include/trace/events/memcg.h                       |  10 +-
 include/trace/events/writeback.h                   |   3 +
 kernel/cgroup/cgroup.c                             |   9 +-
 kernel/liveupdate/kexec_handover.c                 | 158 ++++-
 kernel/liveupdate/kexec_handover_debugfs.c         |  55 +-
 kernel/liveupdate/kexec_handover_internal.h        |  15 +-
 kernel/liveupdate/luo_core.c                       |  11 +-
 kernel/liveupdate/luo_file.c                       | 112 ++--
 kernel/liveupdate/luo_flb.c                        | 182 +++---
 kernel/liveupdate/luo_internal.h                   |   7 +-
 kernel/liveupdate/luo_session.c                    |  46 +-
 lib/alloc_tag.c                                    | 109 ++++
 lib/test_hmm.c                                     | 130 ++--
 lib/test_kho.c                                     |   5 +-
 lib/tests/liveupdate.c                             |  18 -
 mm/Kconfig.debug                                   |  11 +
 mm/compaction.c                                    |  43 +-
 mm/damon/core.c                                    |  88 +--
 mm/damon/stat.c                                    |   5 +-
 mm/huge_memory.c                                   |  22 +-
 mm/hugetlb.c                                       |  18 +
 mm/kmemleak.c                                      |   2 +-
 mm/memblock.c                                      |   4 +-
 mm/memcontrol-v1.c                                 |  31 +-
 mm/memcontrol-v1.h                                 |   7 +
 mm/memcontrol.c                                    | 632 ++++++++++++-------
 mm/memfd_luo.c                                     |  34 +-
 mm/mempolicy.c                                     |  23 +-
 mm/migrate.c                                       |   2 +
 mm/migrate_device.c                                |   6 -
 mm/mlock.c                                         |   2 +-
 mm/mprotect.c                                      | 218 ++++---
 mm/page_alloc.c                                    |  10 +-
 mm/page_io.c                                       |  10 +-
 mm/percpu.c                                        |   2 +-
 mm/shmem.c                                         | 176 +++---
 mm/shrinker.c                                      |   6 +-
 mm/sparse.c                                        |   1 -
 mm/swap.c                                          |  59 +-
 mm/userfaultfd.c                                   | 682 ++++++++++++---------
 mm/util.c                                          |  10 -
 mm/vmscan.c                                        | 303 ++++++---
 mm/vmstat.c                                        |   2 +-
 mm/workingset.c                                    |  30 +-
 mm/zswap.c                                         | 187 +++---
 tools/testing/selftests/liveupdate/liveupdate.c    |  41 ++
 .../selftests/mm/charge_reserved_hugetlb.sh        |   5 +
 tools/testing/selftests/mm/guard-regions.c         |   4 +
 tools/testing/selftests/mm/hmm-tests.c             |  83 +--
 tools/testing/selftests/mm/hugetlb_dio.c           |  91 ++-
 tools/testing/selftests/mm/merge.c                 |  88 +++
 tools/testing/selftests/mm/soft-dirty.c            |   4 +-
 tools/testing/selftests/mm/split_huge_page_test.c  |  19 +-
 tools/testing/selftests/mm/thp_settings.c          |  35 +-
 tools/testing/selftests/mm/thp_settings.h          |   1 -
 tools/testing/selftests/mm/transhuge-stress.c      |   4 +
 tools/testing/selftests/mm/vm_util.c               |  24 +
 tools/testing/selftests/mm/vm_util.h               |   2 +
 84 files changed, 2814 insertions(+), 1675 deletions(-)
 create mode 100644 include/linux/kho/abi/kexec_metadata.h



^ permalink raw reply

* [PATCH 2/2] Documenation/binfmt-misc.rst: Make "P" flag path desc more precise
From: Charlie Jenkins @ 2026-04-19  4:11 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Kees Cook
  Cc: linux-doc, linux-mm, linux-kernel, Charlie Jenkins
In-Reply-To: <20260419-binfmt_misc_doc_update_p-v1-0-757c12f33cc2@gmail.com>

The "full path" is not passed through to the interpreter, but rather
whatever path was passed to execve. The user's shell is the mechanism
that is converting the executable name "blah" into the full path name of
"/usr/local/bin/blah" instead of the kernel. Clarify this in the
documentation by noting that the path is found in execve and including
"shell" in the conversation for locating "blah".

Signed-off-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
---
 Documentation/admin-guide/binfmt-misc.rst | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst
index 2e2be2922ba6..aabf6599ac49 100644
--- a/Documentation/admin-guide/binfmt-misc.rst
+++ b/Documentation/admin-guide/binfmt-misc.rst
@@ -56,16 +56,16 @@ Here is what the fields mean:
 
       ``P`` - preserve-argv[0]
             Legacy behavior of binfmt_misc is to overwrite
-            the original argv[0] with the full path to the binary. When this
-            flag is included, binfmt_misc will add an argument to the argument
-            vector for this purpose, thus preserving the original ``argv[0]``.
-            e.g. If your interp is set to ``/bin/foo`` and you run ``blah``
-            (which is in ``/usr/local/bin``), then the kernel will execute
-            ``/bin/foo`` with ``argv[]`` set to ``["/bin/foo",
-            "/usr/local/bin/blah", "blah"]``.  The interp can be aware of this
-            by checking if bit 0 in AT_FLAGS in the auxilary vector is set to 1
-            so it can execute ``/usr/local/bin/blah`` with ``argv[]`` set to
-            ``["blah"]``.
+            the original argv[0] with the path to the binary found in execve.
+            When this flag is included, binfmt_misc will add an argument to the
+            argument vector for this purpose, thus preserving the original
+            ``argv[0]``. e.g. If your interp is set to ``/bin/foo`` and you run
+            ``blah`` (which your shell finds in ``/usr/local/bin``), then the
+            kernel will execute ``/bin/foo`` with ``argv[]`` set to
+            ``["/bin/foo", "/usr/local/bin/blah", "blah"]``.  The interp can be
+            aware of this by checking if bit 0 in AT_FLAGS in the auxilary
+            vector is set to 1 so it can execute ``/usr/local/bin/blah`` with
+            ``argv[]`` set to ``["blah"]``.
       ``O`` - open-binary
 	    Legacy behavior of binfmt_misc is to pass the full path
             of the binary to the interpreter as an argument. When this flag is

-- 
2.53.0



^ permalink raw reply related

* [PATCH 1/2] Documentation/binfmt-misc.rst: Include AT_FLAGS info in "P" flag description
From: Charlie Jenkins @ 2026-04-19  4:11 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Kees Cook
  Cc: linux-doc, linux-mm, linux-kernel, Charlie Jenkins
In-Reply-To: <20260419-binfmt_misc_doc_update_p-v1-0-757c12f33cc2@gmail.com>

Commit 2347961b11d4 ("binfmt_misc: pass binfmt_misc flags to the
interpreter") added a bit to AT_FLAGS in the aux vector to notify an
interpreter that the 'P' flag was set in binfmt-misc. Clarify that the
interpreter is able to be aware of the 'P' flag by using this bit.

Signed-off-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
---
 Documentation/admin-guide/binfmt-misc.rst | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst
index 59cd902e3549..2e2be2922ba6 100644
--- a/Documentation/admin-guide/binfmt-misc.rst
+++ b/Documentation/admin-guide/binfmt-misc.rst
@@ -61,9 +61,11 @@ Here is what the fields mean:
             vector for this purpose, thus preserving the original ``argv[0]``.
             e.g. If your interp is set to ``/bin/foo`` and you run ``blah``
             (which is in ``/usr/local/bin``), then the kernel will execute
-            ``/bin/foo`` with ``argv[]`` set to ``["/bin/foo", "/usr/local/bin/blah", "blah"]``.  The interp has to be aware of this so it can
-            execute ``/usr/local/bin/blah``
-            with ``argv[]`` set to ``["blah"]``.
+            ``/bin/foo`` with ``argv[]`` set to ``["/bin/foo",
+            "/usr/local/bin/blah", "blah"]``.  The interp can be aware of this
+            by checking if bit 0 in AT_FLAGS in the auxilary vector is set to 1
+            so it can execute ``/usr/local/bin/blah`` with ``argv[]`` set to
+            ``["blah"]``.
       ``O`` - open-binary
 	    Legacy behavior of binfmt_misc is to pass the full path
             of the binary to the interpreter as an argument. When this flag is

-- 
2.53.0



^ permalink raw reply related

* [PATCH 0/2] Documentation/binfmt-misc.rst: Clarify "P" flag
From: Charlie Jenkins @ 2026-04-19  4:11 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Kees Cook
  Cc: linux-doc, linux-mm, linux-kernel, Charlie Jenkins

Improve the wording of the description of the "P" flag to explain that
the interpreter gets the path to the file provided by execve and not the
full path as well as documenting that AT_FLAGS can be read to see if the
"P" flag is set.

Signed-off-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
---
Charlie Jenkins (2):
      Documentation/binfmt-misc.rst: Include AT_FLAGS info in "P" flag description
      Documenation/binfmt-misc.rst: Make "P" flag path desc more precise

 Documentation/admin-guide/binfmt-misc.rst | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: ${change-id}

- Charlie



^ permalink raw reply

* Re: [PATCH v2 00/19] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Steven Rostedt @ 2026-04-18 23:04 UTC (permalink / raw)
  To: Vineeth Pillai (Google)
  Cc: Peter Zijlstra, Dmitry Ilvokhin, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Jens Axboe, io-uring,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
	Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	netdev, bpf, linux-sctp, tipc-discussion, dev, Jiri Pirko,
	Oded Gabbay, Koby Elbaz, dri-devel, Rafael J. Wysocki,
	Viresh Kumar, Gautham R. Shenoy, Huang Rui, Mario Limonciello,
	Len Brown, Srinivas Pandruvada, linux-pm, MyungJoo Ham,
	Kyungmin Park, Chanwoo Choi, Christian König, Sumit Semwal,
	linaro-mm-sig, Eddie James, Andrew Jeffery, Joel Stanley,
	linux-fsi, David Airlie, Simona Vetter, Alex Deucher,
	Danilo Krummrich, Matthew Brost, Philipp Stanner, Harry Wentland,
	Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires, linux-input,
	Wolfram Sang, linux-i2c, Mark Brown, Michael Hennerich,
	Nuno Sá, linux-spi, James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, Chris Mason, David Sterba, linux-btrfs,
	Thomas Gleixner, Andrew Morton, SeongJae Park, linux-mm,
	Borislav Petkov, Dave Hansen, x86, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260323160052.17528-1-vineeth@bitbyteword.org>

On Mon, 23 Mar 2026 12:00:19 -0400
"Vineeth Pillai (Google)" <vineeth@bitbyteword.org> wrote:

>   if (trace_foo_enabled() && cond)
>       trace_call__foo(args);   /* calls __do_trace_foo() directly */

Hi Vineeth,

Could you rebase this series on top of 7.1-rc1 when it comes out?
Several of these patches were accepted already. Obviously drop those.
They were the patches that added the feature, and any where the
maintainer acked the patch.

Now that the feature has been accepted, if you post the patch series
again after 7.1-rc1 with all the patches that haven't been accepted
yet, then the maintainers can simply take them directly. As the feature
is now accepted, there's no dependency on it, and they don't need to go
through the tracing tree.

Thanks,

-- Steve


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
From: Jeff Layton @ 2026-04-18 22:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Darrick J. Wong, linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc
In-Reply-To: <vpgymg4vq7fv2iebfjebxbufjvo2yre64lcwjgsbyivu2pxmpp@zhjplhrjayc5>

On Fri, 2026-04-10 at 11:43 +0200, Jan Kara wrote:
> On Thu 09-04-26 13:37:17, Jeff Layton wrote:
> > On Thu, 2026-04-09 at 09:12 -0700, Darrick J. Wong wrote:
> > > On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> > > > Hello!
> > > > 
> > > > This is a recurring topic Matthew has been kicking forward for the last
> > > > year so let me maybe offer a fs-person point of view on the problem and
> > > > possible solutions. The problem is very simple: When a filesystem (ext4,
> > > > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > > > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > > > making sure journalling machinery is done with the inode, etc.. This may
> > > > require reading metadata into memory which requires memory allocations and
> > > > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > > > allocations (and there are other reasons why it would be very difficult to
> > > > make some of these required allocations in the filesystems failable).
> > > > 
> > > > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > > > trigger warnings - and for a good reason as forward progress isn't
> > > > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > > > rather long running operations blocking on IO from reclaim context thus
> > > > stalling reclaim for substantial amount of time to free 1k worth of slab
> > > > cache.
> > > > 
> > > > I have been mulling over possible solutions since I don't think each
> > > > filesystem should be inventing a complex inode lifetime management scheme
> > > > as XFS has invented to solve these issues. Here's what I think we could do:
> > > > 
> > > > 1) Filesystems will be required to mark inodes that have non-trivial
> > > > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > > > whatever :)). Usually I expect this to happen on first inode modification
> > > > or so. This will require some per-fs work but it shouldn't be that
> > > > difficult and filesystems can be adapted one-by-one as they decide to
> > > > address these warnings from reclaim.
> > > > 
> > > > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > > > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > > > performance reasons. I expect this to be a significant portion of inodes
> > > > on average and in particular for some workloads which scan a lot of inodes
> > > > (find through the whole fs or similar) the efficiency of inode reclaim is
> > > > one of the determining factors for their performance.
> > > > 
> > > > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > > > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > > > to process them.
> > > > 
> > > > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > > > inode, doing the hard work.
> > > > 
> > > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > > they should really be addressed.
> > > 
> > > This more or less sounds fine to me.
> > > 
> > > > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > > > control for some workloads (in particular because there could be multiple
> > > > CPUs generating hard to reclaim inodes while the cleanup would be
> > > > single-threaded). This could be addressed by tracking number of inodes in
> > > > that list and if it grows over some limit, we could start throttling
> > > > processes when setting I_RECLAIM_HARD inode flag.
> > > 
> > > <nod> XFS does that, see xfs_inodegc_want_flush_work in
> > > xfs_inodegc_queue.
> > > 
> > > > There's also a simpler approach to this problem but with more radical
> > > > changes to behavior. For example getting rid of inode LRU completely -
> > > > inodes without dentries referencing them anymore should be rare and it
> > > > isn't very useful to cache them. So we can always drop inodes on last
> > > > iput() (as we currently do for example for unlinked inodes). But I have a
> > > > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > > > like poll the collective knowledge of what could possibly go wrong here :)
> > > 
> > > NFS, possibly? ;)
> > > 
> > 
> > NFS indeed.
> > 
> > Bear in mind that the NFS may fail d_revalidate checks on child
> > dentries when a parent directory is changed on the server. I imagine
> > some workloads might see a performance hit if a large file's dentry has
> > to be discarded and looked back up because we suddenly threw away a
> > bunch of useful data in the pagecache.
> 
> Thanks for filling in details! I was having vague memories that NFS was
> relying on inodes with 0 refcount not getting immediately evicted. I'll
> keep that in mind when thinking about solutions.
> 

In my mind, there are two classes of filesystems when it comes to the
dcache: ones where the kernel has perfect knowledge of the directory
tree, and ones where it does not. In general, in-memory and local
filesystems are the former, and distributed/network filesystems are the
latter.

It might make sense to allow filesystems to declare that they are of
the type where the kernel has perfect knowledge, and use that to do
this kind of optimization. It wouldn't get rid of the LRU entirely, but
could allow you to reduce the size significantly.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply

* [RFC PATCH v2.1 3/3] mm/damon/stat: detect and use fresh enabled value
From: SeongJae Park @ 2026-04-18 22:27 UTC (permalink / raw)
  Cc: SeongJae Park, # 6 . 17 . x, Andrew Morton, damon, linux-kernel,
	linux-mm
In-Reply-To: <20260418222758.39795-1-sj@kernel.org>

DAMON_STAT updates 'enabled' parameter value, which represents the
running status of its kdamond, when the user explicitly requests
start/stop of the kdamond.  The kdamond can, however, be stopped even if
the user explicitly requested the stop, if ctx->regions_score_histogram
allocation failure at beginning of the execution of the kdamond.  Hence,
if the kdamond is stopped by the allocation failure, the value of the
parameter can be stale.

Users could show the stale value and be confused.  The problem will only
rarely happen in real and common setups because the allocation is
arguably too small to fail.  Also, unlike the similar bugs that are now
fixed in DAMON_RECLAIM and DAMON_LRU_SORT, kdamond can be restarted in
this case, because DAMON_STAT force-updates the enabled parameter value
for user inputs.  The bug is a bug, though.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging.
Dynamically detect and use the fresh status for the parameters when
those are requested.

The issue was dicovered [1] by Sashiko.

[1] https://lore.kernel.org/20260416040602.88665-1-sj@kernel.org

Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module")
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/stat.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/damon/stat.c b/mm/damon/stat.c
index 99ba346f9e325..3951b762cbddf 100644
--- a/mm/damon/stat.c
+++ b/mm/damon/stat.c
@@ -19,14 +19,17 @@
 static int damon_stat_enabled_store(
 		const char *val, const struct kernel_param *kp);
 
+static int damon_stat_enabled_load(char *buffer,
+		const struct kernel_param *kp);
+
 static const struct kernel_param_ops enabled_param_ops = {
 	.set = damon_stat_enabled_store,
-	.get = param_get_bool,
+	.get = damon_stat_enabled_load,
 };
 
 static bool enabled __read_mostly = IS_ENABLED(
 	CONFIG_DAMON_STAT_ENABLED_DEFAULT);
-module_param_cb(enabled, &enabled_param_ops, &enabled, 0600);
+module_param_cb(enabled, &enabled_param_ops, NULL, 0600);
 MODULE_PARM_DESC(enabled, "Enable of disable DAMON_STAT");
 
 static unsigned long estimated_memory_bandwidth __read_mostly;
@@ -273,17 +276,23 @@ static void damon_stat_stop(void)
 	damon_stat_context = NULL;
 }
 
+static bool damon_stat_enabled(void)
+{
+	if (!damon_stat_context)
+		return false;
+	return damon_is_running(damon_stat_context);
+}
+
 static int damon_stat_enabled_store(
 		const char *val, const struct kernel_param *kp)
 {
-	bool is_enabled = enabled;
 	int err;
 
 	err = kstrtobool(val, &enabled);
 	if (err)
 		return err;
 
-	if (is_enabled == enabled)
+	if (damon_stat_enabled() == enabled)
 		return 0;
 
 	if (!damon_initialized())
@@ -293,16 +302,17 @@ static int damon_stat_enabled_store(
 		 */
 		return 0;
 
-	if (enabled) {
-		err = damon_stat_start();
-		if (err)
-			enabled = false;
-		return err;
-	}
+	if (enabled)
+		return damon_stat_start();
 	damon_stat_stop();
 	return 0;
 }
 
+static int damon_stat_enabled_load(char *buffer, const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%c\n", damon_stat_enabled() ? 'Y' : 'N');
+}
+
 static int __init damon_stat_init(void)
 {
 	int err = 0;
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v2.1 1/3] mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
From: SeongJae Park @ 2026-04-18 22:27 UTC (permalink / raw)
  Cc: SeongJae Park, # 5 . 19 . x, Andrew Morton, damon, linux-kernel,
	linux-mm, Liew Rui Yan
In-Reply-To: <20260418222758.39795-1-sj@kernel.org>

DAMON_RECLAIM updates 'enabled' and 'kdamond_pid' parameter values,
which represents the running status of its kdamond, when the user
explicitly requests start/stop of the kdamond.  The kdamond can,
however, be stopped in events other than the explicit user request in
the following three events.

1. ctx->regions_score_histogram allocation failure at beginning of the
   execution,
2. damon_commit_ctx() failure due to invalid user input, and
3. damon_commit_ctx() failure due to its internal allocation failures.

Hence, if the kdamond is stopped by the above three events, the values
of the status parameters can be stale.  Users could show the stale
values and be confused.  This is already bad, but the real consequence
is worse.  DAMON_RECLAIM avoids unnecessary damon_start() and
damon_stop() calls based on the 'enabled' parameter value.  And the
update of 'enabled' parameter value depends on the damon_start() and
damon_stop() call results.  Hence, once the kdamond has stopped by the
unintentional events, the user cannot restart the kdamond before the
system reboot.  For example, the issue can be reproduced via below
steps.

    # cd /sys/module/damon_reclaim/parameters
    #
    # # start DAMON_RECLAIM
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         806       2  0 17:53 ?        00:00:00 [kdamond.0]
    root         808     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # commit wrong input to stop kdamond withou explicit stop request
    # echo 3 > addr_unit
    # echo Y > commit_inputs
    bash: echo: write error: Invalid argument
    #
    # # confirm kdamond is stopped
    # ps -ef | grep kdamond
    root         811     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # users casn now show stable status
    # cat enabled
    Y
    # cat kdamond_pid
    806
    #
    # # even after fixing the wrong parameter,
    # # kdamond cannot be restarted.
    # echo 1 > addr_unit
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         815     803  0 17:54 pts/4    00:00:00 grep kdamond

The problem will only rarely happen in real and common setups for the
following reasons.  The allocation failures are unlikely in such setups
since those allocations are arguably too small to fail.  Also sane users
on real production environments may not commit wrong input parameters.
But once it happens, the consequence is quite bad.  And the bug is a
bug.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging.
Dynamically detect and use the fresh status for the parameters when
those are requested.

Fixes: e035c280f6df ("mm/damon/reclaim: support online inputs update")
Cc: <stable@vger.kernel.org> # 5.19.x
Co-developed-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/reclaim.c | 85 ++++++++++++++++++++++++++++++----------------
 1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 86da147786583..fe7fce26cf6ce 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -144,15 +144,6 @@ static unsigned long addr_unit __read_mostly = 1;
 static bool skip_anon __read_mostly;
 module_param(skip_anon, bool, 0600);
 
-/*
- * PID of the DAMON thread
- *
- * If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread.
- * Else, -1.
- */
-static int kdamond_pid __read_mostly = -1;
-module_param(kdamond_pid, int, 0400);
-
 static struct damos_stat damon_reclaim_stat;
 DEFINE_DAMON_MODULES_DAMOS_STATS_PARAMS(damon_reclaim_stat,
 		reclaim_tried_regions, reclaimed_regions, quota_exceeds);
@@ -288,12 +279,8 @@ static int damon_reclaim_turn(bool on)
 {
 	int err;
 
-	if (!on) {
-		err = damon_stop(&ctx, 1);
-		if (!err)
-			kdamond_pid = -1;
-		return err;
-	}
+	if (!on)
+		return damon_stop(&ctx, 1);
 
 	err = damon_reclaim_apply_parameters();
 	if (err)
@@ -302,9 +289,6 @@ static int damon_reclaim_turn(bool on)
 	err = damon_start(&ctx, 1, true);
 	if (err)
 		return err;
-	kdamond_pid = damon_kdamond_pid(ctx);
-	if (kdamond_pid < 0)
-		return kdamond_pid;
 	return damon_call(ctx, &call_control);
 }
 
@@ -332,42 +316,83 @@ module_param_cb(addr_unit, &addr_unit_param_ops, &addr_unit, 0600);
 MODULE_PARM_DESC(addr_unit,
 	"Scale factor for DAMON_RECLAIM to ops address conversion (default: 1)");
 
+static bool damon_reclaim_enabled(void)
+{
+	if (!ctx)
+		return false;
+	return damon_is_running(ctx);
+}
+
 static int damon_reclaim_enabled_store(const char *val,
 		const struct kernel_param *kp)
 {
-	bool is_enabled = enabled;
-	bool enable;
 	int err;
 
-	err = kstrtobool(val, &enable);
+	err = kstrtobool(val, &enabled);
 	if (err)
 		return err;
 
-	if (is_enabled == enable)
+	if (damon_reclaim_enabled() == enabled)
 		return 0;
 
 	/* Called before init function.  The function will handle this. */
 	if (!damon_initialized())
-		goto set_param_out;
+		return 0;
 
-	err = damon_reclaim_turn(enable);
-	if (err)
-		return err;
+	return damon_reclaim_turn(enabled);
+}
 
-set_param_out:
-	enabled = enable;
-	return err;
+static int damon_reclaim_enabled_load(char *buffer,
+		const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%c\n", damon_reclaim_enabled() ? 'Y' : 'N');
 }
 
 static const struct kernel_param_ops enabled_param_ops = {
 	.set = damon_reclaim_enabled_store,
-	.get = param_get_bool,
+	.get = damon_reclaim_enabled_load,
 };
 
 module_param_cb(enabled, &enabled_param_ops, &enabled, 0600);
 MODULE_PARM_DESC(enabled,
 	"Enable or disable DAMON_RECLAIM (default: disabled)");
 
+static int damon_reclaim_kdamond_pid_store(const char *val,
+		const struct kernel_param *kp)
+{
+	/*
+	 * kdamond_pid is read-only, but kernel command line could write it.
+	 * Do nothing here.
+	 */
+	return 0;
+}
+
+static int damon_reclaim_kdamond_pid_load(char *buffer,
+		const struct kernel_param *kp)
+{
+	int kdamond_pid = -1;
+
+	if (ctx) {
+		kdamond_pid = damon_kdamond_pid(ctx);
+		if (kdamond_pid < 0)
+			kdamond_pid = -1;
+	}
+	return sprintf(buffer, "%d\n", kdamond_pid);
+}
+
+static const struct kernel_param_ops kdamond_pid_param_ops = {
+	.set = damon_reclaim_kdamond_pid_store,
+	.get = damon_reclaim_kdamond_pid_load,
+};
+
+/*
+ * PID of the DAMON thread
+ *
+ * If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread.
+ * Else, -1.
+ */
+module_param_cb(kdamond_pid, &kdamond_pid_param_ops, NULL, 0400);
+
 static int __init damon_reclaim_init(void)
 {
 	int err;
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v2.1 2/3] mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
From: SeongJae Park @ 2026-04-18 22:27 UTC (permalink / raw)
  Cc: SeongJae Park, # 6 . 0 . x, Andrew Morton, damon, linux-kernel,
	linux-mm, Liew Rui Yan
In-Reply-To: <20260418222758.39795-1-sj@kernel.org>

DAMON_LRU_SORT updates 'enabled' and 'kdamond_pid' parameter values,
which represents the running status of its kdamond, when the user
explicitly requests start/stop of the kdamond.  The kdamond can,
however, be stopped in events other than the explicit user request in
the following three events.

1. ctx->regions_score_histogram allocation failure at beginning of the
   execution,
2. damon_commit_ctx() failure due to invalid user input, and
3. damon_commit_ctx() failure due to its internal allocation failures.

Hence, if the kdamond is stopped by the above three events, the values
of the status parameters can be stale.  Users could show the stale
values and be confused.  This is already bad, but the real consequence
is worse.  DAMON_LRU_SORT avoids unnecessary damon_start() and
damon_stop() calls based on the 'enabled' parameter value.  And the
update of 'enabled' parameter value depends on the damon_start() and
damon_stop() call results.  Hence, once the kdamond has stopped by the
unintentional events, the user cannot restart the kdamond before the
system reboot.  For example, the issue can be reproduced via below
steps.

    # cd /sys/module/damon_lru_sort/parameters
    #
    # # start DAMON_LRU_SORT
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         806       2  0 17:53 ?        00:00:00 [kdamond.0]
    root         808     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # commit wrong input to stop kdamond withou explicit stop request
    # echo 3 > addr_unit
    # echo Y > commit_inputs
    bash: echo: write error: Invalid argument
    #
    # # confirm kdamond is stopped
    # ps -ef | grep kdamond
    root         811     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # users casn now show stable status
    # cat enabled
    Y
    # cat kdamond_pid
    806
    #
    # # even after fixing the wrong parameter,
    # # kdamond cannot be restarted.
    # echo 1 > addr_unit
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         815     803  0 17:54 pts/4    00:00:00 grep kdamond

The problem will only rarely happen in real and common setups for the
following reasons.  The allocation failures are unlikely in such setups
since those allocations are arguably too small to fail.  Also sane users
on real production environments may not commit wrong input parameters.
But once it happens, the consequence is quite bad.  And the bug is a
bug.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging.
Dynamically detect and use the fresh status for the parameters when
those are requested.

Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting")
Cc: <stable@vger.kernel.org> # 6.0.x
Co-developed-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/lru_sort.c | 85 +++++++++++++++++++++++++++++----------------
 1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 554559d729760..8494040b1ee48 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -161,15 +161,6 @@ module_param(monitor_region_end, ulong, 0600);
  */
 static unsigned long addr_unit __read_mostly = 1;
 
-/*
- * PID of the DAMON thread
- *
- * If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread.
- * Else, -1.
- */
-static int kdamond_pid __read_mostly = -1;
-module_param(kdamond_pid, int, 0400);
-
 static struct damos_stat damon_lru_sort_hot_stat;
 DEFINE_DAMON_MODULES_DAMOS_STATS_PARAMS(damon_lru_sort_hot_stat,
 		lru_sort_tried_hot_regions, lru_sorted_hot_regions,
@@ -386,12 +377,8 @@ static int damon_lru_sort_turn(bool on)
 {
 	int err;
 
-	if (!on) {
-		err = damon_stop(&ctx, 1);
-		if (!err)
-			kdamond_pid = -1;
-		return err;
-	}
+	if (!on)
+		return damon_stop(&ctx, 1);
 
 	err = damon_lru_sort_apply_parameters();
 	if (err)
@@ -400,9 +387,6 @@ static int damon_lru_sort_turn(bool on)
 	err = damon_start(&ctx, 1, true);
 	if (err)
 		return err;
-	kdamond_pid = damon_kdamond_pid(ctx);
-	if (kdamond_pid < 0)
-		return kdamond_pid;
 	return damon_call(ctx, &call_control);
 }
 
@@ -430,42 +414,83 @@ module_param_cb(addr_unit, &addr_unit_param_ops, &addr_unit, 0600);
 MODULE_PARM_DESC(addr_unit,
 	"Scale factor for DAMON_LRU_SORT to ops address conversion (default: 1)");
 
+static bool damon_lru_sort_enabled(void)
+{
+	if (!ctx)
+		return false;
+	return damon_is_running(ctx);
+}
+
 static int damon_lru_sort_enabled_store(const char *val,
 		const struct kernel_param *kp)
 {
-	bool is_enabled = enabled;
-	bool enable;
 	int err;
 
-	err = kstrtobool(val, &enable);
+	err = kstrtobool(val, &enabled);
 	if (err)
 		return err;
 
-	if (is_enabled == enable)
+	if (damon_lru_sort_enabled() == enabled)
 		return 0;
 
 	/* Called before init function.  The function will handle this. */
 	if (!damon_initialized())
-		goto set_param_out;
+		return 0;
 
-	err = damon_lru_sort_turn(enable);
-	if (err)
-		return err;
+	return damon_lru_sort_turn(enabled);
+}
 
-set_param_out:
-	enabled = enable;
-	return err;
+static int damon_lru_sort_enabled_load(char *buffer,
+		const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%c\n", damon_lru_sort_enabled() ? 'Y' : 'N');
 }
 
 static const struct kernel_param_ops enabled_param_ops = {
 	.set = damon_lru_sort_enabled_store,
-	.get = param_get_bool,
+	.get = damon_lru_sort_enabled_load,
 };
 
 module_param_cb(enabled, &enabled_param_ops, &enabled, 0600);
 MODULE_PARM_DESC(enabled,
 	"Enable or disable DAMON_LRU_SORT (default: disabled)");
 
+static int damon_lru_sort_kdamond_pid_store(const char *val,
+		const struct kernel_param *kp)
+{
+	/*
+	 * kdamond_pid is read-only, but kernel command line could write it.
+	 * Do nothing here.
+	 */
+	return 0;
+}
+
+static int damon_lru_sort_kdamond_pid_load(char *buffer,
+		const struct kernel_param *kp)
+{
+	int kdamond_pid = -1;
+
+	if (ctx) {
+		kdamond_pid = damon_kdamond_pid(ctx);
+		if (kdamond_pid < 0)
+			kdamond_pid = -1;
+	}
+	return sprintf(buffer, "%d\n", kdamond_pid);
+}
+
+static const struct kernel_param_ops kdamond_pid_param_ops = {
+	.set = damon_lru_sort_kdamond_pid_store,
+	.get = damon_lru_sort_kdamond_pid_load,
+};
+
+/*
+ * PID of the DAMON thread
+ *
+ * If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread.
+ * Else, -1.
+ */
+module_param_cb(kdamond_pid, &kdamond_pid_param_ops, NULL, 0400);
+
 static int __init damon_lru_sort_init(void)
 {
 	int err;
-- 
2.47.3


^ permalink raw reply related

* [RFC PATCH v2.1 0/3] mm/damon/modules: detect and use fresh status
From: SeongJae Park @ 2026-04-18 22:27 UTC (permalink / raw)
  Cc: SeongJae Park, # 5 . 19 . x, Andrew Morton, damon, linux-kernel,
	linux-mm

DAMON modules including DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT
commonly expose the kdamond running status via their parameters.  Under
certain scenarios including wrong user inputs and memory allocation
failures, those parameter values can be stale.  It can confuse users.
For DAMON_RECLAIM and DAMON_LRU_SORT, it even makes the kdamond unable
to be restarted before the system reboot.

The problem comes from the fact that there are multiple events for the
status changes and it is difficult to follow up all the scenarios.  Fix
the issue by detecting and using the status on demand, instead of using
a cached status that is difficult to be updated.

Patches 1-3 fix the bugs in DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT
in the order.

Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260418014439.6353-1-sj@kernel.org
- Set kdamond_pid set callbacks.
- Support multiple enabled parameters setup on boot commandline.
- Acknowledge the third patch was discovered by Sashiko.
Changes from v2
- v2: https://lore.kernel.org/20260413185249.5921-1-aethernet65535@gmail.com
- Add RFC tag back, for sashiko review.
- Detect and use fresh status instead of trying to catch up all scenarios.
- Change Liew from the responsible author to a credit-deserved co-developer.
- Move authorship responsibility to SJ.
- Add DAMON_STAT fix.
  - RFC of the fix was posted separately
    (https://lore.kernel.org/20260416143857.76146-1-sj@kernel.org), and
    only commit message wordsmithing is added in this version.
Changes from RFC
- rfc: https://lore.kernel.org/20260330164347.12772-1-aethernet65535@gmail.com
- Remove RFC tag.
- Remove 'damon_thread_status' structure and damon_update_thread_status()
  (SJ pointed out this was too much extension of core API for a problem
  that can be fixed more simply).
- Add a fallback in damon_{lru_sort, reclaim}_turn() 'N' path. If
  damon_stop() fails but kdamond is not running, forcefully reset the
  parameters.
- Reset 'enabled' and 'kdamond_pid' when damon_commit_ctx() fails in
  damon_{lru_sort, reclaim}_apply_parameters() (kdamond will terminate
  eventually in this case).

SeongJae Park (3):
  mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
  mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
  mm/damon/stat: detect and use fresh enabled value

 mm/damon/lru_sort.c | 85 +++++++++++++++++++++++++++++----------------
 mm/damon/reclaim.c  | 85 +++++++++++++++++++++++++++++----------------
 mm/damon/stat.c     | 30 ++++++++++------
 3 files changed, 130 insertions(+), 70 deletions(-)


base-commit: 710b7b26c423290803f447f5ed2fb264e91cda56
-- 
2.47.3


^ permalink raw reply

* [PATCH] Documentation/binfmt-misc.rst: Specify aux vector for "O" flag description
From: Charlie Jenkins @ 2026-04-18 21:08 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan, Kees Cook
  Cc: linux-doc, linux-mm, linux-kernel, Charlie Jenkins

Instead of replacing the file path in the argument vector, the file
descriptor is passed as AT_EXECFD in the auxilary vector. This appears
to have been the case at least since the git port, update the
documentation to reflect this.

Signed-off-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
---
 Documentation/admin-guide/binfmt-misc.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst
index 59cd902e3549..c0a34fbf8022 100644
--- a/Documentation/admin-guide/binfmt-misc.rst
+++ b/Documentation/admin-guide/binfmt-misc.rst
@@ -68,10 +68,10 @@ Here is what the fields mean:
 	    Legacy behavior of binfmt_misc is to pass the full path
             of the binary to the interpreter as an argument. When this flag is
             included, binfmt_misc will open the file for reading and pass its
-            descriptor as an argument, instead of the full path, thus allowing
-            the interpreter to execute non-readable binaries. This feature
-            should be used with care - the interpreter has to be trusted not to
-            emit the contents of the non-readable binary.
+            descriptor into the auxilary vector with the key "AT_EXECFD", thus
+            allowing the interpreter to execute non-readable binaries. This
+            feature should be used with care - the interpreter has to be trusted
+            not to emit the contents of the non-readable binary.
       ``C`` - credentials
             Currently, the behavior of binfmt_misc is to calculate
             the credentials and security token of the new process according to

---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: ${change-id}

- Charlie



^ permalink raw reply related

* [RFC PATCH 2/2] Documentation: maple_tree: Clarify behavior when using reserved values
From: Wei-Lin Chang @ 2026-04-18 20:47 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-doc, linux-kernel
  Cc: Liam R . Howlett, Alice Ryhl, Andrew Ballance, Jonathan Corbet,
	Shuah Khan, Wei-Lin Chang
In-Reply-To: <20260418204754.120405-1-weilin.chang@arm.com>

It doesn't matter whether the normal or the advanced API is used if the
user uses xa_{mk, to}_value when storing and retrieving the values. Just
specify that the normal API blocks usages of reserved values while the
advanced API does not.

Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 Documentation/core-api/maple_tree.rst | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
index 15eda6742af8..54ea99c7bca7 100644
--- a/Documentation/core-api/maple_tree.rst
+++ b/Documentation/core-api/maple_tree.rst
@@ -30,9 +30,8 @@ Tree reserves values with the bottom two bits set to '10' which are below 4096
 (ie 2, 6, 10 .. 4094) for internal use.  If the entries may use reserved
 entries under the condition that their top bits are never 1, then the user can
 convert the entries using xa_mk_value() and convert them back by calling
-xa_to_value().  If the user needs to use a reserved value, then the user can
-convert the value when using the :ref:`maple-tree-advanced-api`, but are blocked
-by the normal API.
+xa_to_value().  Usage of reserved values is blocked by the normal API, and will
+cause undefined behavior if used with the :ref:`maple-tree-advanced-api`.
 
 The Maple Tree can also be configured to support searching for a gap of a given
 size (or larger).
-- 
2.43.0



^ permalink raw reply related

* [RFC PATCH 1/2] Documentation: maple_tree: Point out constraint when using xa_{mk, to}_value
From: Wei-Lin Chang @ 2026-04-18 20:47 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-doc, linux-kernel
  Cc: Liam R . Howlett, Alice Ryhl, Andrew Ballance, Jonathan Corbet,
	Shuah Khan, Wei-Lin Chang
In-Reply-To: <20260418204754.120405-1-weilin.chang@arm.com>

Using xa_{mk, to}_value when storing values loses the information of
the top bit from the left shift, point that out in the doc.

Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 Documentation/core-api/maple_tree.rst | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
index ccdd1615cf97..15eda6742af8 100644
--- a/Documentation/core-api/maple_tree.rst
+++ b/Documentation/core-api/maple_tree.rst
@@ -28,10 +28,11 @@ virtual memory areas.
 The Maple Tree can store values between ``0`` and ``ULONG_MAX``.  The Maple
 Tree reserves values with the bottom two bits set to '10' which are below 4096
 (ie 2, 6, 10 .. 4094) for internal use.  If the entries may use reserved
-entries then the users can convert the entries using xa_mk_value() and convert
-them back by calling xa_to_value().  If the user needs to use a reserved
-value, then the user can convert the value when using the
-:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
+entries under the condition that their top bits are never 1, then the user can
+convert the entries using xa_mk_value() and convert them back by calling
+xa_to_value().  If the user needs to use a reserved value, then the user can
+convert the value when using the :ref:`maple-tree-advanced-api`, but are blocked
+by the normal API.
 
 The Maple Tree can also be configured to support searching for a gap of a given
 size (or larger).
-- 
2.43.0



^ permalink raw reply related

* [RFC PATCH 0/2] Documentation: maple_tree: Improve statements on reserved values
From: Wei-Lin Chang @ 2026-04-18 20:47 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-doc, linux-kernel
  Cc: Liam R . Howlett, Alice Ryhl, Andrew Ballance, Jonathan Corbet,
	Shuah Khan, Wei-Lin Chang

Hi,

While using the maple tree and reading its documentation, I found a few
bits confusing, mainly about the reserved values. So here are some
changes hoping to make things clearer.

I am not familiar with the implementation, so I might be getting things
wrong, hence this being RFC.

While looking at the code I also found that although the doc claims the
normal API blocks reserved value stores, the code checks this using
xa_is_advanced(), which only blocks values up to 1026, not up to the max
maple tree reserved value 4094. For this part I am not sure whether the
code needs to be changed or we can also improve the doc.

Any feedback is appreciated, thanks!

Wei-Lin Chang (2):
  Documentation: maple_tree: Point out constraint when using xa_{mk,
    to}_value
  Documentation: maple_tree: Clarify behavior when using reserved values

 Documentation/core-api/maple_tree.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
2.43.0



^ permalink raw reply

* Re: [GIT PULL] memblock updates for v7.1-rc1
From: pr-tracker-bot @ 2026-04-18 18:40 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Linus Torvalds, Andrew Morton, Mike Rapoport, linux-mm,
	linux-kernel
In-Reply-To: <aeNIFuabuGbdVvDW@kernel.org>

The pull request you sent on Sat, 18 Apr 2026 12:00:06 +0300:

> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock tags/memblock-v7.1-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9055c64567e9fc2a58d9382205bf3082f7bea141

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


^ permalink raw reply

* Re: [PATCH] mm/migrate_device: Cleanup up PMD Checks and warnings
From: Sunny Patel @ 2026-04-18 17:18 UTC (permalink / raw)
  To: david
  Cc: akpm, apopple, byungchul, gourry, joshua.hahnjy, linux-kernel,
	linux-mm, matthew.brost, nueralspacetech, rakie.kim, sj,
	ying.huang, ziy
In-Reply-To: <82193ab8-e6f6-4664-8b4f-e30d280d8b1c@kernel.org>

On 4/17/26 01:52, SeongJae Park wrote:
> On Thu, 16 Apr 2026 21:44:15 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:
> 
>> On 4/14/26 16:13, Sunny Patel wrote:
> [...]
>>> @@ -865,12 +864,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>  	if (userfaultfd_missing(vma))
>>>  		goto unlock_abort;
>>>  
>>> -	if (!pmd_none(*pmdp)) {
>>> +	if (pmd_present(*pmdp)) {
>>>  		if (!is_huge_zero_pmd(*pmdp))
>>>  			goto unlock_abort;
>>>  		flush = true;
>>> -	} else if (!pmd_none(*pmdp))
>>> +	} else if (!pmd_none(*pmdp)) {
>>>  		goto unlock_abort;
>>> +	}
>>>  
>>>  	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>>>  	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
>>
>> is_huge_zero_pmd() checks pmd_present(), so we didn't have a bug before.
>>
>> We could also do:
>>
>> if (is_huge_zero_pmd(*pmdp)) {
>> 	flush = true;
>> } else if (!pmd_none(*pmdp)) {
>> 	goto unlock_abort;
>> }
> 
> Then we could even further remove the braces and reduce one more line, nice!

is_huge_zero_pmd() didn't check for pmd_present as of now as per the current implementation of it so additional check require for pmd_present(). 
Please let me know if anything needs to add in here.

Thanks,
Sunny Patel


^ permalink raw reply

* Re: [PATCH v5 14/14] mm/vmscan: unify writeback reclaim statistic and throttling
From: Kairui Song @ 2026-04-18 16:57 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang
In-Reply-To: <20260413-mglru-reclaim-v5-14-8eaeacbddc44@tencent.com>

On Mon, Apr 13, 2026 at 12:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
>
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior.
>
> Test using following reproducer using bash:
>
>   echo "Setup a slow device using dm delay"
>   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>   LOOP=$(losetup --show -f /var/tmp/backing)
>   mkfs.ext4 -q $LOOP
>   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>       dmsetup create slow_dev
>   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
>
>   echo "Start writeback pressure"
>   sync && echo 3 > /proc/sys/vm/drop_caches
>   mkdir /sys/fs/cgroup/test_wb
>   echo 128M > /sys/fs/cgroup/test_wb/memory.max
>   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
>
>   echo "Clean up"
>   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>   dmsetup resume slow_dev
>   umount -l /mnt/slow && sync
>   dmsetup remove slow_dev
>
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
>
> After this commit, throttling is now effective and no more spin on
> LRU or premature OOM. Stress test on other workloads also looking good.
>
> Global throttling is not here yet, we will fix that separately later.
>
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Tested-by: Leno Hou <lenohou@gmail.com>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 41 insertions(+), 49 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a431f94ff3a3..43a3cadbb586 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
>         return !(current->flags & PF_LOCAL_THROTTLE);
>  }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +                                    struct pglist_data *pgdat,
> +                                    struct scan_control *sc,
> +                                    struct reclaim_stat *stat)
> +{
> +       /*
> +        * If dirty folios are scanned that are not queued for IO, it
> +        * implies that flushers are not doing their job. This can
> +        * happen when memory pressure pushes dirty folios to the end of
> +        * the LRU before the dirty limits are breached and the dirty
> +        * data has expired. It can also happen when the proportion of
> +        * dirty folios grows not through writes but through memory
> +        * pressure reclaiming all the clean cache. And in some cases,
> +        * the flushers simply cannot keep up with the allocation
> +        * rate. Nudge the flusher threads in case they are asleep.
> +        */
> +       if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {

While doing self review, I noticed a small problem here: It should
return without updating the counters below if nr_taken == 0. Currently
it only skips the flusher.

We might see nr_taken == 0 because MGLRU has a retry logic: if
shrink_folio_list returned some folios for being dirty or writeback,
and, they became clean during that isolation time period, then MGLRU
will try call shrink_folio_list again without doing isolation again.

This patch is still fine with the retry here in most cases. But if a
folio was returned by shrink_folio_list for being dirty, then suddenly
became clean and triggered the retry, then became dirty again. Now the
counter below might be skewed since a dirty folio is counted twice.
Still this is not a big issue, and I couldn't find a way to
reproduce this even on purpose, since that requires a few really short
time windows to hit together, and the result is also hardly
observable. But for a 100% accuracy, I'll update this patch with:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 71b4ef0e6735..af14efbc0cd8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1958,7 +1958,7 @@ static void handle_reclaim_writeback(unsigned
long nr_taken,
         * the flushers simply cannot keep up with the allocation
         * rate. Nudge the flusher threads in case they are asleep.
         */
-       if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+       if (stat->nr_unqueued_dirty == nr_taken) {
                wakeup_flusher_threads(WB_REASON_VMSCAN);
                /*
                 * For cgroupv1 dirty throttling is achieved by waking up
@@ -4830,7 +4830,9 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
 retry:
        reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
        sc->nr_reclaimed += reclaimed;
-       handle_reclaim_writeback(isolated, pgdat, sc, &stat);
+       /* Retry pass is only meant for clean folios without new isolation */
+       if (isolated)
+               handle_reclaim_writeback(isolated, pgdat, sc, &stat);
        trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
                        type_scanned, reclaimed, &stat, sc->priority,
                        type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);

Then it should be perfect.

We might better just remove that retry logic completely later, it's
meant to avoid folio_rotate_reclaimable from missing isolated folios.
That should be done in a cleaner way. The current retry loop also may
lead to inaccurate tracepoint data, not a new or major problem so not
touching that part for now.


^ permalink raw reply related

* Re: [PATCH v7 4/4] selftests/liveupdate: add test cases for LIVEUPDATE_SESSION_GET_NAME
From: Luca Boccassi @ 2026-04-18 16:34 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: kexec, linux-mm, graf, rppt, pratyush, brauner, linux-kernel
In-Reply-To: <a4jd64pkvamnwuwzu5k762rgn6sgzp6ldcoknkwhuddsrwcbyv@vxwdpp66x7bl>

On Sat, 18 Apr 2026 at 16:50, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On 04-18 15:09, luca.boccassi@gmail.com wrote:
> > From: Luca Boccassi <luca.boccassi@gmail.com>
> >
> > Verify that the new LIVEUPDATE_SESSION_GET_NAME ioctl works
> > as expected via new test cases in the existing liveupdate selftest.
> >
> > Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
> > ---
> > v5: merge with LUO_SESSION_MAGIC series as they both change the same
> >     unit test file, to avoid merge conflicts
> >     split into separate patch
> > v6: add more test cases as suggested
> >     more verbose commit message
> >
> >  .../testing/selftests/liveupdate/liveupdate.c | 138 ++++++++++++++++++
> >  1 file changed, 138 insertions(+)
> >
> > diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c
> > index d132b4685f76..ffbcbd465c9b 100644
> > --- a/tools/testing/selftests/liveupdate/liveupdate.c
> > +++ b/tools/testing/selftests/liveupdate/liveupdate.c
> > @@ -105,6 +105,22 @@ static int create_session(int lu_fd, const char *name)
> >       return args.fd;
> >  }
> >
> > +/* Helper function to get a session name via ioctl. */
> > +static int get_session_name(int session_fd, char *name, size_t name_len)
> > +{
> > +     struct liveupdate_session_get_name args = {};
> > +
> > +     args.size = sizeof(args);
> > +
> > +     if (ioctl(session_fd, LIVEUPDATE_SESSION_GET_NAME, &args))
> > +             return -errno;
> > +
> > +     strncpy(name, (char *)args.name, name_len - 1);
> > +     name[name_len - 1] = '\0';
> > +
> > +     return 0;
> > +}
> > +
> >  /*
> >   * Test Case: Create Duplicate Session
> >   *
> > @@ -385,4 +401,126 @@ TEST_F(liveupdate_device, session_fstat)
> >       ASSERT_EQ(close(session_fd2), 0);
> >  }
> >
> > +/*
> > + * Test Case: Get Session Name
> > + *
> > + * Verifies that the full session name can be retrieved from a session file
> > + * descriptor via ioctl.
> > + */
> > +TEST_F(liveupdate_device, get_session_name)
> > +{
> > +     char name_buf[LIVEUPDATE_SESSION_NAME_LENGTH] = {};
> > +     const char *session_name = "get-name-test-session";
> > +     int session_fd;
> > +
> > +     self->fd1 = open(LIVEUPDATE_DEV, O_RDWR);
> > +     if (self->fd1 < 0 && errno == ENOENT)
> > +             SKIP(return, "%s does not exist", LIVEUPDATE_DEV);
> > +     ASSERT_GE(self->fd1, 0);
> > +
> > +     session_fd = create_session(self->fd1, session_name);
> > +     ASSERT_GE(session_fd, 0);
> > +
> > +     ASSERT_EQ(get_session_name(session_fd, name_buf, sizeof(name_buf)), 0);
> > +     ASSERT_STREQ(name_buf, session_name);
> > +
> > +     ASSERT_EQ(close(session_fd), 0);
> > +}
> > +
> > +/*
> > + * Test Case: Get Session Name at Maximum Length
> > + *
> > + * Verifies that a session name using the full LIVEUPDATE_SESSION_NAME_LENGTH
> > + * (minus the null terminator) can be correctly retrieved.
> > + */
> > +TEST_F(liveupdate_device, get_session_name_max_length)
> > +{
> > +     char name_buf[LIVEUPDATE_SESSION_NAME_LENGTH] = {};
> > +     char long_name[LIVEUPDATE_SESSION_NAME_LENGTH];
> > +     int session_fd;
> > +
> > +     memset(long_name, 'A', sizeof(long_name) - 1);
> > +     long_name[sizeof(long_name) - 1] = '\0';
> > +
> > +     self->fd1 = open(LIVEUPDATE_DEV, O_RDWR);
> > +     if (self->fd1 < 0 && errno == ENOENT)
> > +             SKIP(return, "%s does not exist", LIVEUPDATE_DEV);
> > +     ASSERT_GE(self->fd1, 0);
> > +
> > +     session_fd = create_session(self->fd1, long_name);
> > +     ASSERT_GE(session_fd, 0);
> > +
> > +     ASSERT_EQ(get_session_name(session_fd, name_buf, sizeof(name_buf)), 0);
> > +     ASSERT_STREQ(name_buf, long_name);
> > +
> > +     ASSERT_EQ(close(session_fd), 0);
> > +}
> > +
> > +/*
> > + * Test Case: Create Session with No Null Termination
> > + *
> > + * Verifies that filling the entire 64-byte name field with non-null characters
> > + * (no '\0' terminator) is handled safely by the kernel. The kernel's strscpy
> > + * truncates to 63 characters, which we verify via get_session_name.
>
> [...]
>
> > + */
> > +TEST_F(liveupdate_device, create_session_no_null_termination)
> > +{
> > +     struct liveupdate_ioctl_create_session args = {};
> > +     char expected_name[LIVEUPDATE_SESSION_NAME_LENGTH];
> > +     char name_buf[LIVEUPDATE_SESSION_NAME_LENGTH] = {};
> > +     int session_fd;
> > +
> > +     self->fd1 = open(LIVEUPDATE_DEV, O_RDWR);
> > +     if (self->fd1 < 0 && errno == ENOENT)
> > +             SKIP(return, "%s does not exist", LIVEUPDATE_DEV);
> > +     ASSERT_GE(self->fd1, 0);
> > +
> > +     /* Fill entire name field with 'X', no null terminator */
> > +     args.size = sizeof(args);
> > +     memset(args.name, 'X', sizeof(args.name));
> > +
> > +     ASSERT_EQ(ioctl(self->fd1, LIVEUPDATE_IOCTL_CREATE_SESSION, &args), 0);
> > +     session_fd = args.fd;
> > +     ASSERT_GE(session_fd, 0);
> > +
> > +     /* Kernel should have truncated to 63 chars + '\0' */
> > +     memset(expected_name, 'X', sizeof(expected_name) - 1);
> > +     expected_name[sizeof(expected_name) - 1] = '\0';
> > +
> > +     ASSERT_EQ(get_session_name(session_fd, name_buf, sizeof(name_buf)), 0);
> > +     ASSERT_STREQ(name_buf, expected_name);
> > +
> > +     ASSERT_EQ(close(session_fd), 0);
> > +}
> > +
> > +/*
> > + * Test Case: Create Session with Empty Name
> > + *
> > + * Verifies that creating a session with an empty string name succeeds,
> > + * and that creating a second session with the same empty name fails
> > + * with EEXIST.
> > + */
>
> For the two cases mentioned above, we should implement a fix in the
> kernel:
>
> In luo_session_create(), we should verify that the name is at least 1
> character long and at most LIVEUPDATE_SESSION_NAME_LENGTH - 1.
>
> size_t len = strnlen(name, LIVEUPDATE_SESSION_NAME_LENGTH);
>
> if (!len || len > LIVEUPDATE_SESSION_NAME_LENGTH - 1)
>    return -EINVAL;
>
> This change should be submitted as a separate patch in your series.
>
> Additionally, please include a cover letter in the next version
> explaining the overall series. Once the cover letter is included, you
> can remove the individual patch versions from the stat areas.
>
> Pasha

Ok, all done in v8, thanks


^ permalink raw reply

* [PATCH v8 6/6] selftests/liveupdate: add test cases for LIVEUPDATE_SESSION_GET_NAME
From: luca.boccassi @ 2026-04-18 16:28 UTC (permalink / raw)
  To: kexec
  Cc: linux-mm, graf, rppt, pasha.tatashin, pratyush, brauner,
	linux-kernel, Luca Boccassi
In-Reply-To: <20260418163358.2304490-1-luca.boccassi@gmail.com>

From: Luca Boccassi <luca.boccassi@gmail.com>

Verify that the new LIVEUPDATE_SESSION_GET_NAME ioctl works
as expected via new test cases in the existing liveupdate selftest.

Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
---
 .../testing/selftests/liveupdate/liveupdate.c | 71 +++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c
index 5e99af0cc6e9..c21354dc9b93 100644
--- a/tools/testing/selftests/liveupdate/liveupdate.c
+++ b/tools/testing/selftests/liveupdate/liveupdate.c
@@ -105,6 +105,22 @@ static int create_session(int lu_fd, const char *name)
 	return args.fd;
 }
 
+/* Helper function to get a session name via ioctl. */
+static int get_session_name(int session_fd, char *name, size_t name_len)
+{
+	struct liveupdate_session_get_name args = {};
+
+	args.size = sizeof(args);
+
+	if (ioctl(session_fd, LIVEUPDATE_SESSION_GET_NAME, &args))
+		return -errno;
+
+	strncpy(name, (char *)args.name, name_len - 1);
+	name[name_len - 1] = '\0';
+
+	return 0;
+}
+
 /*
  * Test Case: Create Duplicate Session
  *
@@ -427,4 +443,59 @@ TEST_F(liveupdate_device, session_fstat)
 	ASSERT_EQ(close(session_fd2), 0);
 }
 
+/*
+ * Test Case: Get Session Name
+ *
+ * Verifies that the full session name can be retrieved from a session file
+ * descriptor via ioctl.
+ */
+TEST_F(liveupdate_device, get_session_name)
+{
+	char name_buf[LIVEUPDATE_SESSION_NAME_LENGTH] = {};
+	const char *session_name = "get-name-test-session";
+	int session_fd;
+
+	self->fd1 = open(LIVEUPDATE_DEV, O_RDWR);
+	if (self->fd1 < 0 && errno == ENOENT)
+		SKIP(return, "%s does not exist", LIVEUPDATE_DEV);
+	ASSERT_GE(self->fd1, 0);
+
+	session_fd = create_session(self->fd1, session_name);
+	ASSERT_GE(session_fd, 0);
+
+	ASSERT_EQ(get_session_name(session_fd, name_buf, sizeof(name_buf)), 0);
+	ASSERT_STREQ(name_buf, session_name);
+
+	ASSERT_EQ(close(session_fd), 0);
+}
+
+/*
+ * Test Case: Get Session Name at Maximum Length
+ *
+ * Verifies that a session name using the full LIVEUPDATE_SESSION_NAME_LENGTH
+ * (minus the null terminator) can be correctly retrieved.
+ */
+TEST_F(liveupdate_device, get_session_name_max_length)
+{
+	char name_buf[LIVEUPDATE_SESSION_NAME_LENGTH] = {};
+	char long_name[LIVEUPDATE_SESSION_NAME_LENGTH];
+	int session_fd;
+
+	memset(long_name, 'A', sizeof(long_name) - 1);
+	long_name[sizeof(long_name) - 1] = '\0';
+
+	self->fd1 = open(LIVEUPDATE_DEV, O_RDWR);
+	if (self->fd1 < 0 && errno == ENOENT)
+		SKIP(return, "%s does not exist", LIVEUPDATE_DEV);
+	ASSERT_GE(self->fd1, 0);
+
+	session_fd = create_session(self->fd1, long_name);
+	ASSERT_GE(session_fd, 0);
+
+	ASSERT_EQ(get_session_name(session_fd, name_buf, sizeof(name_buf)), 0);
+	ASSERT_STREQ(name_buf, long_name);
+
+	ASSERT_EQ(close(session_fd), 0);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related

* [PATCH v8 5/6] liveupdate: add LIVEUPDATE_SESSION_GET_NAME ioctl
From: luca.boccassi @ 2026-04-18 16:28 UTC (permalink / raw)
  To: kexec
  Cc: linux-mm, graf, rppt, pasha.tatashin, pratyush, brauner,
	linux-kernel, Luca Boccassi
In-Reply-To: <20260418163358.2304490-1-luca.boccassi@gmail.com>

From: Luca Boccassi <luca.boccassi@gmail.com>

Userspace when requesting a session via the ioctl specifies a name and
gets a FD, but then there is no ioctl to go back the other way and get
the name given a LUO session FD. This is problematic especially when
there is a userspace orchestrator that wants to check what FDs it is
handling for clients without having to do manual string scraping of
procfs, or without procfs at all.

Add a ioctl to simply get the name from an FD.

Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/uapi/linux/liveupdate.h | 21 +++++++++++++++++++++
 kernel/liveupdate/luo_session.c | 13 +++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
index 30bc66ee9436..3a9ff53b02e0 100644
--- a/include/uapi/linux/liveupdate.h
+++ b/include/uapi/linux/liveupdate.h
@@ -59,6 +59,7 @@ enum {
 	LIVEUPDATE_CMD_SESSION_PRESERVE_FD = LIVEUPDATE_CMD_SESSION_BASE,
 	LIVEUPDATE_CMD_SESSION_RETRIEVE_FD = 0x41,
 	LIVEUPDATE_CMD_SESSION_FINISH = 0x42,
+	LIVEUPDATE_CMD_SESSION_GET_NAME = 0x43,
 };
 
 /**
@@ -213,4 +214,24 @@ struct liveupdate_session_finish {
 #define LIVEUPDATE_SESSION_FINISH					\
 	_IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_SESSION_FINISH)
 
+/**
+ * struct liveupdate_session_get_name - ioctl(LIVEUPDATE_SESSION_GET_NAME)
+ * @size:  Input; sizeof(struct liveupdate_session_get_name)
+ * @reserved: Input; Must be zero. Reserved for future use.
+ * @name:  Output; A null-terminated string with the full session name.
+ *
+ * Retrieves the full name of the session associated with this file descriptor.
+ * This is useful because the kernel may truncate the name shown in /proc.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+struct liveupdate_session_get_name {
+	__u32		size;
+	__u32		reserved;
+	__u8		name[LIVEUPDATE_SESSION_NAME_LENGTH];
+};
+
+#define LIVEUPDATE_SESSION_GET_NAME					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_SESSION_GET_NAME)
+
 #endif /* _UAPI_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c
index 21cbe99fc819..ca5e44274dcb 100644
--- a/kernel/liveupdate/luo_session.c
+++ b/kernel/liveupdate/luo_session.c
@@ -291,10 +291,21 @@ static int luo_session_finish(struct luo_session *session,
 	return luo_ucmd_respond(ucmd, sizeof(*argp));
 }
 
+static int luo_session_get_name(struct luo_session *session,
+				struct luo_ucmd *ucmd)
+{
+	struct liveupdate_session_get_name *argp = ucmd->cmd;
+
+	strscpy((char *)argp->name, session->name, sizeof(argp->name));
+
+	return luo_ucmd_respond(ucmd, sizeof(*argp));
+}
+
 union ucmd_buffer {
 	struct liveupdate_session_finish finish;
 	struct liveupdate_session_preserve_fd preserve;
 	struct liveupdate_session_retrieve_fd retrieve;
+	struct liveupdate_session_get_name get_name;
 };
 
 struct luo_ioctl_op {
@@ -321,6 +332,8 @@ static const struct luo_ioctl_op luo_session_ioctl_ops[] = {
 		 struct liveupdate_session_preserve_fd, token),
 	IOCTL_OP(LIVEUPDATE_SESSION_RETRIEVE_FD, luo_session_retrieve_fd,
 		 struct liveupdate_session_retrieve_fd, token),
+	IOCTL_OP(LIVEUPDATE_SESSION_GET_NAME, luo_session_get_name,
+		 struct liveupdate_session_get_name, name),
 };
 
 static long luo_session_ioctl(struct file *filep, unsigned int cmd,
-- 
2.47.3



^ permalink raw reply related

* [PATCH v8 4/6] selftests/liveupdate: add test case for LUO_SESSION_MAGIC
From: luca.boccassi @ 2026-04-18 16:28 UTC (permalink / raw)
  To: kexec
  Cc: linux-mm, graf, rppt, pasha.tatashin, pratyush, brauner,
	linux-kernel, Luca Boccassi
In-Reply-To: <20260418163358.2304490-1-luca.boccassi@gmail.com>

From: Luca Boccassi <luca.boccassi@gmail.com>

Verify that fstat works as expected after the switch from anon_inode
to the new magic number.

Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 .../testing/selftests/liveupdate/liveupdate.c | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c
index f0a8e600c154..5e99af0cc6e9 100644
--- a/tools/testing/selftests/liveupdate/liveupdate.c
+++ b/tools/testing/selftests/liveupdate/liveupdate.c
@@ -22,9 +22,12 @@
 #include <fcntl.h>
 #include <string.h>
 #include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/vfs.h>
 #include <unistd.h>
 
 #include <linux/liveupdate.h>
+#include <linux/magic.h>
 
 #include "../kselftest.h"
 #include "../kselftest_harness.h"
@@ -387,4 +390,41 @@ TEST_F(liveupdate_device, create_session_empty_name)
 	EXPECT_EQ(session_fd, -EINVAL);
 }
 
+/*
+ * Test Case: Session fstat
+ *
+ * Verifies that fstatfs() on a session file descriptor reports the
+ * LUO_SESSION_MAGIC filesystem type, and that fstat() returns consistent
+ * inode numbers across different sessions (shared singleton inode).
+ */
+TEST_F(liveupdate_device, session_fstat)
+{
+	int session_fd1, session_fd2;
+	struct stat st1, st2;
+	struct statfs sfs;
+
+	self->fd1 = open(LIVEUPDATE_DEV, O_RDWR);
+	if (self->fd1 < 0 && errno == ENOENT)
+		SKIP(return, "%s does not exist", LIVEUPDATE_DEV);
+	ASSERT_GE(self->fd1, 0);
+
+	session_fd1 = create_session(self->fd1, "fstat-session-1");
+	ASSERT_GE(session_fd1, 0);
+
+	session_fd2 = create_session(self->fd1, "fstat-session-2");
+	ASSERT_GE(session_fd2, 0);
+
+	/* Verify the filesystem type is LUO_SESSION_MAGIC */
+	ASSERT_EQ(fstatfs(session_fd1, &sfs), 0);
+	EXPECT_EQ(sfs.f_type, LUO_SESSION_MAGIC);
+
+	/* Verify both sessions share the same inode number */
+	ASSERT_EQ(fstat(session_fd1, &st1), 0);
+	ASSERT_EQ(fstat(session_fd2, &st2), 0);
+	EXPECT_EQ(st1.st_ino, st2.st_ino);
+
+	ASSERT_EQ(close(session_fd1), 0);
+	ASSERT_EQ(close(session_fd2), 0);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox