[patch 021/227] mm: fs: fix lru_cache_disabled race in bh

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 021/227] mm: fs: fix lru_cache_disabled race in bh_lru
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
@ 2022-03-22 21:39 ` Andrew Morton
  2022-03-22 21:42 ` [patch 073/227] mm: don't skip swap entry even if zap_details specified Andrew Morton
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:39 UTC (permalink / raw)
  To: stable, mtosatti, joaodias, cgoldswo, minchan, akpm, patches,
	linux-mm, mm-commits, torvalds, akpm

From: Minchan Kim <minchan@kernel.org>
Subject: mm: fs: fix lru_cache_disabled race in bh_lru

Check lru_cache_disabled under bh_lru_lock.  Otherwise, it could introduce
race below and it fails to migrate pages containing buffer_head.

   CPU 0					CPU 1

bh_lru_install
                                       lru_cache_disable
  lru_cache_disabled = false
                                       atomic_inc(&lru_disable_count);
				       invalidate_bh_lrus_cpu of CPU 0
				       bh_lru_lock
				       __invalidate_bh_lrus
				       bh_lru_unlock
  bh_lru_lock
  install the bh
  bh_lru_unlock

WHen this race happens a CMA allocation fails, which is critical for
the workload which depends on CMA.

Link: https://lkml.kernel.org/r/20220308180709.2017638-1-minchan@kernel.org
Fixes: 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: John Dias <joaodias@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/buffer.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/fs/buffer.c~mm-fs-fix-lru_cache_disabled-race-in-bh_lru
+++ a/fs/buffer.c
@@ -1235,16 +1235,18 @@ static void bh_lru_install(struct buffer
 	int i;
 
 	check_irqs_on();
+	bh_lru_lock();
+
 	/*
 	 * the refcount of buffer_head in bh_lru prevents dropping the
 	 * attached page(i.e., try_to_free_buffers) so it could cause
 	 * failing page migration.
 	 * Skip putting upcoming bh into bh_lru until migration is done.
 	 */
-	if (lru_cache_disabled())
+	if (lru_cache_disabled()) {
+		bh_lru_unlock();
 		return;
-
-	bh_lru_lock();
+	}
 
 	b = this_cpu_ptr(&bh_lrus);
 	for (i = 0; i < BH_LRU_SIZE; i++) {
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 073/227] mm: don't skip swap entry even if zap_details specified
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
  2022-03-22 21:39 ` [patch 021/227] mm: fs: fix lru_cache_disabled race in bh_lru Andrew Morton
@ 2022-03-22 21:42 ` Andrew Morton
  2022-03-22 21:43 ` [patch 097/227] mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node Andrew Morton
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:42 UTC (permalink / raw)
  To: willy, vbabka, stable, shy828301, kirill, jhubbard, hughd, david,
	apopple, aarcange, peterx, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Peter Xu <peterx@redhat.com>
Subject: mm: don't skip swap entry even if zap_details specified

Patch series "mm: Rework zap ptes on swap entries", v5.

Patch 1 should fix a long standing bug for zap_pte_range() on zap_details
usage.  The risk is we could have some swap entries skipped while we should
have zapped them.

Migration entries are not the major concern because file backed memory always
zap in the pattern that "first time without page lock, then re-zap with page
lock" hence the 2nd zap will always make sure all migration entries are already
recovered.

However there can be issues with real swap entries got skipped errornoously.
There's a reproducer provided in commit message of patch 1 for that.

Patch 2-4 are cleanups that are based on patch 1.  After the whole patchset
applied, we should have a very clean view of zap_pte_range().

Only patch 1 needs to be backported to stable if necessary.


This patch (of 4):

The "details" pointer shouldn't be the token to decide whether we should
skip swap entries.

For example, when the callers specified details->zap_mapping==NULL, it
means the user wants to zap all the pages (including COWed pages), then we
need to look into swap entries because there can be private COWed pages
that was swapped out.

Skipping some swap entries when details is non-NULL may lead to wrongly
leaving some of the swap entries while we should have zapped them.

A reproducer of the problem:

===8<===
        #define _GNU_SOURCE         /* See feature_test_macros(7) */
        #include <stdio.h>
        #include <assert.h>
        #include <unistd.h>
        #include <sys/mman.h>
        #include <sys/types.h>

        int page_size;
        int shmem_fd;
        char *buffer;

        void main(void)
        {
                int ret;
                char val;

                page_size = getpagesize();
                shmem_fd = memfd_create("test", 0);
                assert(shmem_fd >= 0);

                ret = ftruncate(shmem_fd, page_size * 2);
                assert(ret == 0);

                buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                MAP_PRIVATE, shmem_fd, 0);
                assert(buffer != MAP_FAILED);

                /* Write private page, swap it out */
                buffer[page_size] = 1;
                madvise(buffer, page_size * 2, MADV_PAGEOUT);

                /* This should drop private buffer[page_size] already */
                ret = ftruncate(shmem_fd, page_size);
                assert(ret == 0);
                /* Recover the size */
                ret = ftruncate(shmem_fd, page_size * 2);
                assert(ret == 0);

                /* Re-read the data, it should be all zero */
                val = buffer[page_size];
                if (val == 0)
                        printf("Good\n");
                else
                        printf("BUG\n");
        }
===8<===

We don't need to touch up the pmd path, because pmd never had a issue with
swap entries.  For example, shmem pmd migration will always be split into
pte level, and same to swapping on anonymous.

Add another helper should_zap_cows() so that we can also check whether we
should zap private mappings when there's no page pointer specified.

This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
we should do the same check upon migration entry, hwpoison entry and
genuine swap entries too.

To be explicit, we should still remember to keep the private entries if
even_cows==false, and always zap them when even_cows==true.

The issue seems to exist starting from the initial commit of git.

[peterx@redhat.com: comment tweaks]
  Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |   40 +++++++++++++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 9 deletions(-)

--- a/mm/memory.c~mm-dont-skip-swap-entry-even-if-zap_details-specified
+++ a/mm/memory.c
@@ -1313,6 +1313,17 @@ struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 };
 
+/* Whether we should zap all COWed (private) pages too */
+static inline bool should_zap_cows(struct zap_details *details)
+{
+	/* By default, zap all pages */
+	if (!details)
+		return true;
+
+	/* Or, we zap COWed pages only if the caller wants to */
+	return !details->zap_mapping;
+}
+
 /*
  * We set details->zap_mapping when we want to unmap shared but keep private
  * pages. Return true if skip zapping this page, false otherwise.
@@ -1320,11 +1331,15 @@ struct zap_details {
 static inline bool
 zap_skip_check_mapping(struct zap_details *details, struct page *page)
 {
-	if (!details || !page)
+	/* If we can make a decision without *page.. */
+	if (should_zap_cows(details))
+		return false;
+
+	/* E.g. the caller passes NULL for the case of a zero page */
+	if (!page)
 		return false;
 
-	return details->zap_mapping &&
-		(details->zap_mapping != page_rmapping(page));
+	return details->zap_mapping != page_rmapping(page);
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1405,17 +1420,24 @@ again:
 			continue;
 		}
 
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
-			continue;
-
-		if (!non_swap_entry(entry))
+		if (!non_swap_entry(entry)) {
+			/* Genuine swap entry, hence a private anon page */
+			if (!should_zap_cows(details))
+				continue;
 			rss[MM_SWAPENTS]--;
-		else if (is_migration_entry(entry)) {
+		} else if (is_migration_entry(entry)) {
 			struct page *page;
 
 			page = pfn_swap_entry_to_page(entry);
+			if (zap_skip_check_mapping(details, page))
+				continue;
 			rss[mm_counter(page)]--;
+		} else if (is_hwpoison_entry(entry)) {
+			if (!should_zap_cows(details))
+				continue;
+		} else {
+			/* We should have covered all the swap entry types */
+			WARN_ON_ONCE(1);
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 097/227] mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
  2022-03-22 21:39 ` [patch 021/227] mm: fs: fix lru_cache_disabled race in bh_lru Andrew Morton
  2022-03-22 21:42 ` [patch 073/227] mm: don't skip swap entry even if zap_details specified Andrew Morton
@ 2022-03-22 21:43 ` Andrew Morton
  2022-03-22 21:44 ` [patch 111/227] mm: invalidate hwpoison page cache page in fault path Andrew Morton
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:43 UTC (permalink / raw)
  To: ziy, stable, osalvador, mgorman, jhubbard, david,
	anshuman.khandual, apopple, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Alistair Popple <apopple@nvidia.com>
Subject: mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node

ZONE_MOVABLE uses the remaining memory in each node.  Its starting pfn is
also aligned to MAX_ORDER_NR_PAGES.  It is possible for the remaining
memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is not
enough room for ZONE_MOVABLE on that node.

Unfortunately this condition is not checked for.  This leads to
zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
node.

calculate_node_totalpages() then sets zone->present_pages to be greater
than zone->spanned_pages which is invalid, as spanned_pages represents the
maximum number of pages in a zone assuming no holes.

Subsequently it is possible free_area_init_core() will observe a zone of
size zero with present pages.  In this case it will skip setting up the
zone, including the initialisation of free_lists[].

However populated_zone() checks zone->present_pages to see if a zone has
memory available.  This is used by iterators such as walk_zones_in_node().
pagetypeinfo_showfree() uses this to walk the free_list of each zone in
each node, which are assumed to be initialised due to the zone not being
empty.  As free_area_init_core() never initialised the free_lists[] this
results in the following kernel crash when trying to read
/proc/pagetypeinfo:

[   67.534914] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   67.535429] #PF: supervisor read access in kernel mode
[   67.535789] #PF: error_code(0x0000) - not-present page
[   67.536128] PGD 0 P4D 0
[   67.536305] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
[   67.536696] CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
[   67.537096] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[   67.537638] RIP: 0010:pagetypeinfo_show+0x163/0x460
[   67.537992] Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
[   67.538259] RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
[   67.538259] RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
[   67.538259] RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
[   67.538259] RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
[   67.538259] R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
[   67.538259] R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
[   67.538259] FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[   67.538259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   67.538259] CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
[   67.538259] Call Trace:
[   67.538259]  <TASK>
[   67.538259]  seq_read_iter+0x128/0x460
[   67.538259]  ? aa_file_perm+0x1af/0x5f0
[   67.538259]  proc_reg_read_iter+0x51/0x80
[   67.538259]  ? lock_is_held_type+0xea/0x140
[   67.538259]  new_sync_read+0x113/0x1a0
[   67.538259]  vfs_read+0x136/0x1d0
[   67.538259]  ksys_read+0x70/0xf0
[   67.538259]  __x64_sys_read+0x1a/0x20
[   67.538259]  do_syscall_64+0x3b/0xc0
[   67.538259]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   67.538259] RIP: 0033:0x7f9c83e23cce
[   67.538259] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 13 0a 00 e8 c9 e3 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[   67.538259] RSP: 002b:00007fff116e1a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   67.538259] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f9c83e23cce
[   67.538259] RDX: 0000000000020000 RSI: 00007f9c83a2c000 RDI: 0000000000000003
[   67.538259] RBP: 00007f9c83a2c000 R08: 00007f9c83a2b010 R09: 0000000000000000
[   67.538259] R10: 00007f9c83f2d7d0 R11: 0000000000000246 R12: 0000000000000000
[   67.538259] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[   67.538259]  </TASK>

Fix this by checking that the aligned zone_movable_pfn[] does not exceed
the end of the node, and if it does skip creating a movable zone on this
node.

Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com
Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-pages_allocc-dont-create-zone_movable-beyond-the-end-of-a-node
+++ a/mm/page_alloc.c
@@ -7951,10 +7951,17 @@ restart:
 
 out2:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
-	for (nid = 0; nid < MAX_NUMNODES; nid++)
+	for (nid = 0; nid < MAX_NUMNODES; nid++) {
+		unsigned long start_pfn, end_pfn;
+
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+		if (zone_movable_pfn[nid] >= end_pfn)
+			zone_movable_pfn[nid] = 0;
+	}
+
 out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 111/227] mm: invalidate hwpoison page cache page in fault path
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
                   ` (2 preceding siblings ...)
  2022-03-22 21:43 ` [patch 097/227] mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node Andrew Morton
@ 2022-03-22 21:44 ` Andrew Morton
  2022-03-22 21:45 ` [patch 147/227] mempolicy: mbind_range() set_policy() after vma_merge() Andrew Morton
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:44 UTC (permalink / raw)
  To: willy, stable, osalvador, naoya.horiguchi, mgorman, linmiaohe,
	jhubbard, hannes, riel, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Rik van Riel <riel@surriel.com>
Subject: mm: invalidate hwpoison page cache page in fault path

Sometimes the page offlining code can leave behind a hwpoisoned clean page
cache page.  This can lead to programs being killed over and over and over
again as they fault in the hwpoisoned page, get killed, and then get
re-spawned by whatever wanted to run them.

This is particularly embarrassing when the page was offlined due to having
too many corrected memory errors.  Now we are killing tasks due to them
trying to access memory that probably isn't even corrupted.

This problem can be avoided by invalidating the page from the page fault
handler, which already has a branch for dealing with these kinds of pages.
With this patch we simply pretend the page fault was successful if the
page was invalidated, return to userspace, incur another page fault, read
in the file from disk (to a new memory page), and then everything works
again.

Link: https://lkml.kernel.org/r/20220212213740.423efcea@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/mm/memory.c~mm-clean-up-hwpoison-page-cache-page-in-fault-path
+++ a/mm/memory.c
@@ -3877,11 +3877,16 @@ static vm_fault_t __do_fault(struct vm_f
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf->page))) {
-		if (ret & VM_FAULT_LOCKED)
+		vm_fault_t poisonret = VM_FAULT_HWPOISON;
+		if (ret & VM_FAULT_LOCKED) {
+			/* Retry if a clean page was removed from the cache. */
+			if (invalidate_inode_page(vmf->page))
+				poisonret = 0;
 			unlock_page(vmf->page);
+		}
 		put_page(vmf->page);
 		vmf->page = NULL;
-		return VM_FAULT_HWPOISON;
+		return poisonret;
 	}
 
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 147/227] mempolicy: mbind_range() set_policy() after vma_merge()
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
                   ` (3 preceding siblings ...)
  2022-03-22 21:44 ` [patch 111/227] mm: invalidate hwpoison page cache page in fault path Andrew Morton
@ 2022-03-22 21:45 ` Andrew Morton
  2022-03-22 21:46 ` [patch 162/227] mm: madvise: return correct bytes advised with process_madvise Andrew Morton
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:45 UTC (permalink / raw)
  To: vbabka, stable, oleg, Liam.Howlett, hughd, akpm, patches,
	linux-mm, mm-commits, torvalds, akpm

From: Hugh Dickins <hughd@google.com>
Subject: mempolicy: mbind_range() set_policy() after vma_merge()

v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced
vma_merge() to mbind_range(); but unlike madvise, mlock and mprotect, it
put a "continue" to next vma where its precedents go to update flags on
current vma before advancing: that left vma with the wrong setting in the
infamous vma_merge() case 8.

v3.10 commit 1444f92c8498 ("mm: merging memory blocks resets mempolicy")
tried to fix that in vma_adjust(), without fully understanding the issue.

v3.11 commit 3964acd0dbec ("mm: mempolicy: fix mbind_range() &&
vma_adjust() interaction") reverted that, and went about the fix in the
right way, but chose to optimize out an unnecessary mpol_dup() with a
prior mpol_equal() test.  But on tmpfs, that also pessimized out the vital
call to its ->set_policy(), leaving the new mbind unenforced.

The user visible effect was that the pages got allocated on the local
node (happened to be 0), after the mbind() caller had specifically
asked for them to be allocated on node 1.  There was not any page
migration involved in the case reported: the pages simply got allocated
on the wrong node.

Just delete that optimization now (though it could be made conditional on
vma not having a set_policy).  Also remove the "next" variable: it turned
out to be blameless, but also pointless.

Link: https://lkml.kernel.org/r/319e4db9-64ae-4bca-92f0-ade85d342ff@google.com
Fixes: 3964acd0dbec ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/mm/mempolicy.c~mempolicy-mbind_range-set_policy-after-vma_merge
+++ a/mm/mempolicy.c
@@ -786,7 +786,6 @@ static int vma_replace_policy(struct vm_
 static int mbind_range(struct mm_struct *mm, unsigned long start,
 		       unsigned long end, struct mempolicy *new_pol)
 {
-	struct vm_area_struct *next;
 	struct vm_area_struct *prev;
 	struct vm_area_struct *vma;
 	int err = 0;
@@ -801,8 +800,7 @@ static int mbind_range(struct mm_struct
 	if (start > vma->vm_start)
 		prev = vma;

-	for (; vma && vma->vm_start < end; prev = vma, vma = next) {
-		next = vma->vm_next;
+	for (; vma && vma->vm_start < end; prev = vma, vma = vma->vm_next) {
 		vmstart = max(start, vma->vm_start);
 		vmend   = min(end, vma->vm_end);

@@ -817,10 +815,6 @@ static int mbind_range(struct mm_struct
 				 anon_vma_name(vma));
 		if (prev) {
 			vma = prev;
-			next = vma->vm_next;
-			if (mpol_equal(vma_policy(vma), new_pol))
-				continue;
-			/* vma_merge() joined vma && vma->next, case 8 */
 			goto replace;
 		}
 		if (vma->vm_start != vmstart) {
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 162/227] mm: madvise: return correct bytes advised with process_madvise
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
                   ` (4 preceding siblings ...)
  2022-03-22 21:45 ` [patch 147/227] mempolicy: mbind_range() set_policy() after vma_merge() Andrew Morton
@ 2022-03-22 21:46 ` Andrew Morton
  2022-03-22 21:46 ` [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise Andrew Morton
  2022-03-22 21:47 ` [patch 179/227] mm: only re-generate demotion targets when a numa node changes its N_CPU state Andrew Morton
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:46 UTC (permalink / raw)
  To: vbabka, surenb, stable, sfr, rientjes, nadav.amit, minchan,
	mhocko, quic_charante, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Charan Teja Kalla <quic_charante@quicinc.com>
Subject: mm: madvise: return correct bytes advised with process_madvise

Patch series "mm: madvise: return correct bytes processed with
process_madvise", v2.  With the process_madvise(), always choose to return
non zero processed bytes over an error.  This can help the user to know on
which VMA, passed in the 'struct iovec' vector list, is failed to advise
thus can take the decission of retrying/skipping on that VMA.

This patch (of 2):

The process_madvise() system call returns error even after processing some
VMA's passed in the 'struct iovec' vector list which leaves the user
confused to know where to restart the advise next.  It is also against
this syscall man page[1] documentation where it mentions that "return
value may be less than the total number of requested bytes, if an error
occurred after some iovec elements were already processed.".

Consider a user passed 10 VMA's in the 'struct iovec' vector list of which
9 are processed but one.  Then it just returns the error caused on that
failed VMA despite the first 9 VMA's processed, leaving the user confused
about on which VMA it is failed.  Returning the number of bytes processed
here can help the user to know which VMA it is failed on and thus can
retry/skip the advise on that VMA.

[1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.

Link: https://lkml.kernel.org/r/cover.1647008754.git.quic_charante@quicinc.com
Link: https://lkml.kernel.org/r/125b61a0edcee5c2db8658aed9d06a43a19ccafc.1647008754.git.quic_charante@quicinc.com
Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/madvise.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/madvise.c~mm-madvise-return-correct-bytes-advised-with-process_madvise
+++ a/mm/madvise.c
@@ -1435,8 +1435,7 @@ SYSCALL_DEFINE5(process_madvise, int, pi
 		iov_iter_advance(&iter, iovec.iov_len);
 	}

-	if (ret == 0)
-		ret = total_len - iov_iter_count(&iter);
+	ret = (total_len - iov_iter_count(&iter)) ? : ret;

 release_mm:
 	mmput(mm);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
                   ` (5 preceding siblings ...)
  2022-03-22 21:46 ` [patch 162/227] mm: madvise: return correct bytes advised with process_madvise Andrew Morton
@ 2022-03-22 21:46 ` Andrew Morton
  2022-03-23  0:24   ` Minchan Kim
  2022-03-22 21:47 ` [patch 179/227] mm: only re-generate demotion targets when a numa node changes its N_CPU state Andrew Morton
  7 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:46 UTC (permalink / raw)
  To: vbabka, surenb, stable, sfr, rientjes, nadav.amit, minchan,
	mhocko, quic_charante, akpm, patches, linux-mm, mm-commits,
	torvalds, akpm

From: Charan Teja Kalla <quic_charante@quicinc.com>
Subject: mm: madvise: skip unmapped vma holes passed to process_madvise

The process_madvise() system call is expected to skip holes in vma passed
through 'struct iovec' vector list.  But do_madvise, which
process_madvise() calls for each vma, returns ENOMEM in case of unmapped
holes, despite the VMA is processed.

Thus process_madvise() should treat ENOMEM as expected and consider the
VMA passed to as processed and continue processing other vma's in the
vector list.  Returning -ENOMEM to user, despite the VMA is processed,
will be unable to figure out where to start the next madvise.

Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com
Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/madvise.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/mm/madvise.c~mm-madvise-skip-unmapped-vma-holes-passed-to-process_madvise
+++ a/mm/madvise.c
@@ -1428,9 +1428,16 @@ SYSCALL_DEFINE5(process_madvise, int, pi
 
 	while (iov_iter_count(&iter)) {
 		iovec = iov_iter_iovec(&iter);
+		/*
+		 * do_madvise returns ENOMEM if unmapped holes are present
+		 * in the passed VMA. process_madvise() is expected to skip
+		 * unmapped holes passed to it in the 'struct iovec' list
+		 * and not fail because of them. Thus treat -ENOMEM return
+		 * from do_madvise as valid and continue processing.
+		 */
 		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
 					iovec.iov_len, behavior);
-		if (ret < 0)
+		if (ret < 0 && ret != -ENOMEM)
 			break;
 		iov_iter_advance(&iter, iovec.iov_len);
 	}
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 179/227] mm: only re-generate demotion targets when a numa node changes its N_CPU state
       [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
                   ` (6 preceding siblings ...)
  2022-03-22 21:46 ` [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise Andrew Morton
@ 2022-03-22 21:47 ` Andrew Morton
  7 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2022-03-22 21:47 UTC (permalink / raw)
  To: ying.huang, stable, huntbag, dave.hansen, baolin.wang, osalvador,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Oscar Salvador <osalvador@suse.de>
Subject: mm: only re-generate demotion targets when a numa node changes its N_CPU state

Abhishek reported that after patch [1], hotplug operations are taking
~double the expected time.  [2]

The reason behind is that the CPU callbacks that migrate_on_reclaim_init()
sets always call set_migration_target_nodes() whenever a CPU is brought
up/down.

But we only care about numa nodes going from having cpus to become
cpuless, and vice versa, as that influences the demotion_target order.

We do already have two CPU callbacks (vmstat_cpu_online() and
vmstat_cpu_dead()) that check exactly that, so get rid of the CPU
callbacks in migrate_on_reclaim_init() and only call
set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a
numa node change its N_CPU state.

[1] https://lore.kernel.org/linux-mm/20210721063926.3024591-2-ying.huang@intel.com/
[2] https://lore.kernel.org/linux-mm/eb438ddd-2919-73d4-bd9f-b7eecdd9577a@linux.vnet.ibm.com/

[osalvador@suse.de: add feedback from Huang Ying]
  Link: https://lkml.kernel.org/r/20220314150945.12694-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20220310120749.23077-1-osalvador@suse.de
Fixes: 884a6e5d1f93b ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reported-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/migrate.h |    8 ++++++
 mm/migrate.c            |   47 ++++++++------------------------------
 mm/vmstat.c             |   13 +++++++++-
 3 files changed, 30 insertions(+), 38 deletions(-)

--- a/include/linux/migrate.h~mm-only-re-generate-demotion-targets-when-a-numa-node-changes-its-n_cpu-state
+++ a/include/linux/migrate.h
@@ -48,7 +48,15 @@ int folio_migrate_mapping(struct address
 		struct folio *newfolio, struct folio *folio, int extra_count);
 
 extern bool numa_demotion_enabled;
+extern void migrate_on_reclaim_init(void);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void set_migration_target_nodes(void);
 #else
+static inline void set_migration_target_nodes(void) {}
+#endif
+#else
+
+static inline void set_migration_target_nodes(void) {}
 
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
--- a/mm/migrate.c~mm-only-re-generate-demotion-targets-when-a-numa-node-changes-its-n_cpu-state
+++ a/mm/migrate.c
@@ -3209,7 +3209,7 @@ again:
 /*
  * For callers that do not hold get_online_mems() already.
  */
-static void set_migration_target_nodes(void)
+void set_migration_target_nodes(void)
 {
 	get_online_mems();
 	__set_migration_target_nodes();
@@ -3273,51 +3273,24 @@ static int __meminit migrate_on_reclaim_
 	return notifier_from_errno(0);
 }
 
-/*
- * React to hotplug events that might affect the migration targets
- * like events that online or offline NUMA nodes.
- *
- * The ordering is also currently dependent on which nodes have
- * CPUs.  That means we need CPU on/offline notification too.
- */
-static int migration_online_cpu(unsigned int cpu)
-{
-	set_migration_target_nodes();
-	return 0;
-}
-
-static int migration_offline_cpu(unsigned int cpu)
-{
-	set_migration_target_nodes();
-	return 0;
-}
-
-static int __init migrate_on_reclaim_init(void)
+void __init migrate_on_reclaim_init(void)
 {
-	int ret;
-
 	node_demotion = kmalloc_array(nr_node_ids,
 				      sizeof(struct demotion_nodes),
 				      GFP_KERNEL);
 	WARN_ON(!node_demotion);
 
-	ret = cpuhp_setup_state_nocalls(CPUHP_MM_DEMOTION_DEAD, "mm/demotion:offline",
-					NULL, migration_offline_cpu);
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
 	/*
-	 * In the unlikely case that this fails, the automatic
-	 * migration targets may become suboptimal for nodes
-	 * where N_CPU changes.  With such a small impact in a
-	 * rare case, do not bother trying to do anything special.
+	 * At this point, all numa nodes with memory/CPus have their state
+	 * properly set, so we can build the demotion order now.
+	 * Let us hold the cpu_hotplug lock just, as we could possibily have
+	 * CPU hotplug events during boot.
 	 */
-	WARN_ON(ret < 0);
-	ret = cpuhp_setup_state(CPUHP_AP_MM_DEMOTION_ONLINE, "mm/demotion:online",
-				migration_online_cpu, NULL);
-	WARN_ON(ret < 0);
-
-	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-	return 0;
+	cpus_read_lock();
+	set_migration_target_nodes();
+	cpus_read_unlock();
 }
-late_initcall(migrate_on_reclaim_init);
 #endif /* CONFIG_HOTPLUG_CPU */
 
 bool numa_demotion_enabled = false;
--- a/mm/vmstat.c~mm-only-re-generate-demotion-targets-when-a-numa-node-changes-its-n_cpu-state
+++ a/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -2049,7 +2050,12 @@ static void __init init_cpu_node_state(v
 static int vmstat_cpu_online(unsigned int cpu)
 {
 	refresh_zone_stat_thresholds();
-	node_set_state(cpu_to_node(cpu), N_CPU);
+
+	if (!node_state(cpu_to_node(cpu), N_CPU)) {
+		node_set_state(cpu_to_node(cpu), N_CPU);
+		set_migration_target_nodes();
+	}
+
 	return 0;
 }
 
@@ -2072,6 +2078,8 @@ static int vmstat_cpu_dead(unsigned int
 		return 0;
 
 	node_clear_state(node, N_CPU);
+	set_migration_target_nodes();
+
 	return 0;
 }
 
@@ -2103,6 +2111,9 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
+#if defined(CONFIG_MIGRATION) && defined(CONFIG_HOTPLUG_CPU)
+	migrate_on_reclaim_init();
+#endif
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-22 21:46 ` [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise Andrew Morton
@ 2022-03-23  0:24   ` Minchan Kim
  2022-03-23  2:08     ` Linus Torvalds
  2022-03-23  8:28     ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Minchan Kim @ 2022-03-23  0:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: vbabka, surenb, stable, sfr, rientjes, nadav.amit, mhocko,
	quic_charante, patches, linux-mm, mm-commits, torvalds

On Tue, Mar 22, 2022 at 02:46:48PM -0700, Andrew Morton wrote:
> From: Charan Teja Kalla <quic_charante@quicinc.com>
> Subject: mm: madvise: skip unmapped vma holes passed to process_madvise
> 
> The process_madvise() system call is expected to skip holes in vma passed
> through 'struct iovec' vector list.  But do_madvise, which
> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> holes, despite the VMA is processed.
> 
> Thus process_madvise() should treat ENOMEM as expected and consider the
> VMA passed to as processed and continue processing other vma's in the
> vector list.  Returning -ENOMEM to user, despite the VMA is processed,
> will be unable to figure out where to start the next madvise.
> 
> Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com

I thought it was still under discussion and Charan will post next
version along with previous patch
"mm: madvise: return correct bytes advised with process_madvise"

https://lore.kernel.org/linux-mm/7207b2f5-6b3e-aea4-aa1b-9c6d849abe34@quicinc.com/


> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/madvise.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> --- a/mm/madvise.c~mm-madvise-skip-unmapped-vma-holes-passed-to-process_madvise
> +++ a/mm/madvise.c
> @@ -1428,9 +1428,16 @@ SYSCALL_DEFINE5(process_madvise, int, pi
>  
>  	while (iov_iter_count(&iter)) {
>  		iovec = iov_iter_iovec(&iter);
> +		/*
> +		 * do_madvise returns ENOMEM if unmapped holes are present
> +		 * in the passed VMA. process_madvise() is expected to skip
> +		 * unmapped holes passed to it in the 'struct iovec' list
> +		 * and not fail because of them. Thus treat -ENOMEM return
> +		 * from do_madvise as valid and continue processing.
> +		 */
>  		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
>  					iovec.iov_len, behavior);
> -		if (ret < 0)
> +		if (ret < 0 && ret != -ENOMEM)
>  			break;
>  		iov_iter_advance(&iter, iovec.iov_len);
>  	}
> _

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-23  0:24   ` Minchan Kim
@ 2022-03-23  2:08     ` Linus Torvalds
  2022-03-23  8:28     ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2022-03-23  2:08 UTC (permalink / raw)
  To: Minchan Kim, Charan Teja Kalla
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, stable,
	Stephen Rothwell, David Rientjes, Nadav Amit, Michal Hocko,
	patches, Linux-MM, mm-commits

On Tue, Mar 22, 2022 at 5:25 PM Minchan Kim <minchan@kernel.org> wrote:
>
> I thought it was still under discussion and Charan will post next
> version along with previous patch
> "mm: madvise: return correct bytes advised with process_madvise"
>
> https://lore.kernel.org/linux-mm/7207b2f5-6b3e-aea4-aa1b-9c6d849abe34@quicinc.com/

Hmm. It's merged now, as commit 08095d6310a7.

So any fixes please do it on top of that existing state ;(

            Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-23  0:24   ` Minchan Kim
  2022-03-23  2:08     ` Linus Torvalds
@ 2022-03-23  8:28     ` Michal Hocko
  2022-03-23 15:47       ` Charan Teja Kalla
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2022-03-23  8:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, vbabka, surenb, stable, sfr, rientjes, nadav.amit,
	quic_charante, patches, linux-mm, mm-commits, torvalds

On Tue 22-03-22 17:24:58, Minchan Kim wrote:
> On Tue, Mar 22, 2022 at 02:46:48PM -0700, Andrew Morton wrote:
> > From: Charan Teja Kalla <quic_charante@quicinc.com>
> > Subject: mm: madvise: skip unmapped vma holes passed to process_madvise
> > 
> > The process_madvise() system call is expected to skip holes in vma passed
> > through 'struct iovec' vector list.  But do_madvise, which
> > process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> > holes, despite the VMA is processed.
> > 
> > Thus process_madvise() should treat ENOMEM as expected and consider the
> > VMA passed to as processed and continue processing other vma's in the
> > vector list.  Returning -ENOMEM to user, despite the VMA is processed,
> > will be unable to figure out where to start the next madvise.
> > 
> > Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com
> 
> I thought it was still under discussion and Charan will post next
> version along with previous patch
> "mm: madvise: return correct bytes advised with process_madvise"
> 
> https://lore.kernel.org/linux-mm/7207b2f5-6b3e-aea4-aa1b-9c6d849abe34@quicinc.com/

Yes, I am not even sure the new semantic is sensible[1]. We should discuss
that and see all the consequences. Changing the semantic of an existing
syscall is always tricky going back and forth is even worse.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-23  8:28     ` Michal Hocko
@ 2022-03-23 15:47       ` Charan Teja Kalla
  0 siblings, 0 replies; 12+ messages in thread
From: Charan Teja Kalla @ 2022-03-23 15:47 UTC (permalink / raw)
  To: Michal Hocko, linux-kernel
  Cc: Andrew Morton, vbabka, surenb, stable, sfr, rientjes, nadav.amit,
	patches, linux-mm, mm-commits, torvalds



On 3/23/2022 1:58 PM, Michal Hocko wrote:
> On Tue 22-03-22 17:24:58, Minchan Kim wrote:
>> On Tue, Mar 22, 2022 at 02:46:48PM -0700, Andrew Morton wrote:
>>> From: Charan Teja Kalla <quic_charante@quicinc.com>
>>> Subject: mm: madvise: skip unmapped vma holes passed to process_madvise
>>>
>>> The process_madvise() system call is expected to skip holes in vma passed
>>> through 'struct iovec' vector list.  But do_madvise, which
>>> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
>>> holes, despite the VMA is processed.
>>>
>>> Thus process_madvise() should treat ENOMEM as expected and consider the
>>> VMA passed to as processed and continue processing other vma's in the
>>> vector list.  Returning -ENOMEM to user, despite the VMA is processed,
>>> will be unable to figure out where to start the next madvise.
>>>
>>> Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com
>> I thought it was still under discussion and Charan will post next
>> version along with previous patch
>> "mm: madvise: return correct bytes advised with process_madvise"
>>
>> https://lore.kernel.org/linux-mm/7207b2f5-6b3e-aea4-aa1b-9c6d849abe34@quicinc.com/
> Yes, I am not even sure the new semantic is sensible[1]. We should discuss
> that and see all the consequences. Changing the semantic of an existing
> syscall is always tricky going back and forth is even worse.

Starting the discussion @
https://lore.kernel.org/linux-mm/cover.1648046642.git.quic_charante@quicinc.com/

Thanks,
Charan


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-23 15:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
2022-03-22 21:39 ` [patch 021/227] mm: fs: fix lru_cache_disabled race in bh_lru Andrew Morton
2022-03-22 21:42 ` [patch 073/227] mm: don't skip swap entry even if zap_details specified Andrew Morton
2022-03-22 21:43 ` [patch 097/227] mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node Andrew Morton
2022-03-22 21:44 ` [patch 111/227] mm: invalidate hwpoison page cache page in fault path Andrew Morton
2022-03-22 21:45 ` [patch 147/227] mempolicy: mbind_range() set_policy() after vma_merge() Andrew Morton
2022-03-22 21:46 ` [patch 162/227] mm: madvise: return correct bytes advised with process_madvise Andrew Morton
2022-03-22 21:46 ` [patch 163/227] mm: madvise: skip unmapped vma holes passed to process_madvise Andrew Morton
2022-03-23  0:24   ` Minchan Kim
2022-03-23  2:08     ` Linus Torvalds
2022-03-23  8:28     ` Michal Hocko
2022-03-23 15:47       ` Charan Teja Kalla
2022-03-22 21:47 ` [patch 179/227] mm: only re-generate demotion targets when a numa node changes its N_CPU state Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).