* [PATCH v17 00/21] per memcg lru lock
@ 2020-07-25 12:59 Alex Shi
2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
` (14 more replies)
0 siblings, 15 replies; 101+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf,
willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
iamjoonsoo.kim-Hm3cg6mZ9cc,
richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
kirill-oKw7cIdHH8eLwutG50LtGA,
alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
rong.a.chen-ral2JQCrhuEAvxtiuMwx3w
The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in
mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
removes 'mm/mlock: reorder isolation sequence during munlock'
Hi Johanness & Hugh & Alexander & Willy,
Could you like to give a reviewed by since you address much of issue and
give lots of suggestions! Many thanks!
Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.
Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.
The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new
lru_lock in it.
The above solution suggested by Johannes Weiner, and based on his new memcg
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).
The patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition
3, replace per node lru_lock with per memcg per node lru_lock
Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this patchset, the readtwice performance increased about 80%
in concurrent containers.
Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan,
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
Alex Shi (19):
mm/vmscan: remove unnecessary lruvec adding
mm/page_idle: no unlikely double check for idle page counting
mm/compaction: correct the comments of compact_defer_shift
mm/compaction: rename compact_deferred as compact_should_defer
mm/thp: move lru_add_page_tail func to huge_memory.c
mm/thp: clean up lru_add_page_tail
mm/thp: remove code path which never got into
mm/thp: narrow lru locking
mm/memcg: add debug checking in lock_page_memcg
mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
mm/lru: move lru_lock holding in func lru_note_cost_page
mm/lru: move lock into lru_note_cost
mm/lru: introduce TestClearPageLRU
mm/compaction: do page isolation first in compaction
mm/thp: add tail pages into lru anyway in split_huge_page()
mm/swap: serialize memcg changes in pagevec_lru_move_fn
mm/lru: replace pgdat lru_lock with lruvec lock
mm/lru: introduce the relock_page_lruvec function
mm/pgdat: remove pgdat lru_lock
Hugh Dickins (2):
mm/vmscan: use relock for move_pages_to_lru
mm/lru: revise the comments of lru_lock
Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +-
Documentation/admin-guide/cgroup-v1/memory.rst | 21 +--
Documentation/trace/events-kmem.rst | 2 +-
Documentation/vm/unevictable-lru.rst | 22 +--
include/linux/compaction.h | 4 +-
include/linux/memcontrol.h | 98 ++++++++++
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 6 +-
include/linux/page-flags.h | 1 +
include/linux/swap.h | 4 +-
include/trace/events/compaction.h | 2 +-
mm/compaction.c | 113 ++++++++----
mm/filemap.c | 4 +-
mm/huge_memory.c | 48 +++--
mm/memcontrol.c | 71 ++++++-
mm/memory.c | 3 -
mm/mlock.c | 43 +++--
mm/mmzone.c | 1 +
mm/page_alloc.c | 1 -
mm/page_idle.c | 8 -
mm/rmap.c | 4 +-
mm/swap.c | 203 ++++++++-------------
mm/swap_state.c | 2 -
mm/vmscan.c | 174 ++++++++++--------
mm/workingset.c | 2 -
25 files changed, 510 insertions(+), 344 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 101+ messages in thread* [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-08-06 3:47 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting Alex Shi ` (13 subsequent siblings) 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen We don't have to add a freeable page into lru and then remove from it. This change saves a couple of actions and makes the moving more clear. The SetPageLRU needs to be kept here for list intergrity. Otherwise: #0 mave_pages_to_lru #1 release_pages if (put_page_testzero()) if !put_page_testzero !PageLRU //skip lru_lock list_add(&page->lru,) list_add(&page->lru,) //corrupt [akpm@linux-foundation.org: coding style fixes] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/vmscan.c | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 749d239c62b2..ddb29d813d77 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1856,26 +1856,29 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, while (!list_empty(list)) { page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); + list_del(&page->lru); if (unlikely(!page_evictable(page))) { - list_del(&page->lru); spin_unlock_irq(&pgdat->lru_lock); putback_lru_page(page); spin_lock_irq(&pgdat->lru_lock); continue; } - lruvec = mem_cgroup_page_lruvec(page, pgdat); + /* + * The SetPageLRU needs to be kept here for list intergrity. + * Otherwise: + * #0 mave_pages_to_lru #1 release_pages + * if (put_page_testzero()) + * if !put_page_testzero + * !PageLRU //skip lru_lock + * list_add(&page->lru,) + * list_add(&page->lru,) //corrupt + */ SetPageLRU(page); - lru = page_lru(page); - nr_pages = hpage_nr_pages(page); - update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); - list_move(&page->lru, &lruvec->lists[lru]); - - if (put_page_testzero(page)) { + if (unlikely(put_page_testzero(page))) { __ClearPageLRU(page); __ClearPageActive(page); - del_page_from_lru_list(page, lruvec, lru); if (unlikely(PageCompound(page))) { spin_unlock_irq(&pgdat->lru_lock); @@ -1883,11 +1886,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, spin_lock_irq(&pgdat->lru_lock); } else list_add(&page->lru, &pages_to_free); - } else { - nr_moved += nr_pages; - if (PageActive(page)) - workingset_age_nonresident(lruvec, nr_pages); + + continue; } + + lruvec = mem_cgroup_page_lruvec(page, pgdat); + lru = page_lru(page); + nr_pages = hpage_nr_pages(page); + + update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); + list_add(&page->lru, &lruvec->lists[lru]); + nr_moved += nr_pages; + if (PageActive(page)) + workingset_age_nonresident(lruvec, nr_pages); } /* -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding 2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi @ 2020-08-06 3:47 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-06 3:47 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen ÔÚ 2020/7/25 ÏÂÎç8:59, Alex Shi дµÀ: > We don't have to add a freeable page into lru and then remove from it. > This change saves a couple of actions and makes the moving more clear. > > The SetPageLRU needs to be kept here for list intergrity. > Otherwise: > #0 mave_pages_to_lru #1 release_pages > if (put_page_testzero()) > if !put_page_testzero > !PageLRU //skip lru_lock > list_add(&page->lru,) > list_add(&page->lru,) //corrupt The race comments should be corrected to this: /* * The SetPageLRU needs to be kept here for list intergrity. * Otherwise: * #0 mave_pages_to_lru #1 release_pages * if !put_page_testzero * if (put_page_testzero()) * !PageLRU //skip lru_lock * SetPageLRU() * list_add(&page->lru,) * list_add(&page->lru,) */ > > [akpm@linux-foundation.org: coding style fixes] > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > --- > mm/vmscan.c | 37 ++++++++++++++++++++++++------------- > 1 file changed, 24 insertions(+), 13 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 749d239c62b2..ddb29d813d77 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1856,26 +1856,29 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > while (!list_empty(list)) { > page = lru_to_page(list); > VM_BUG_ON_PAGE(PageLRU(page), page); > + list_del(&page->lru); > if (unlikely(!page_evictable(page))) { > - list_del(&page->lru); > spin_unlock_irq(&pgdat->lru_lock); > putback_lru_page(page); > spin_lock_irq(&pgdat->lru_lock); > continue; > } > - lruvec = mem_cgroup_page_lruvec(page, pgdat); > > + /* > + * The SetPageLRU needs to be kept here for list intergrity. > + * Otherwise: > + * #0 mave_pages_to_lru #1 release_pages > + * if (put_page_testzero()) > + * if !put_page_testzero > + * !PageLRU //skip lru_lock > + * list_add(&page->lru,) > + * list_add(&page->lru,) //corrupt > + */ /* * The SetPageLRU needs to be kept here for list intergrity. * Otherwise: * #0 mave_pages_to_lru #1 release_pages * if !put_page_testzero * if (put_page_testzero()) * !PageLRU //skip lru_lock * SetPageLRU() * list_add(&page->lru,) * list_add(&page->lru,) */ > SetPageLRU(page); > - lru = page_lru(page); > > - nr_pages = hpage_nr_pages(page); > - update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); > - list_move(&page->lru, &lruvec->lists[lru]); > - > - if (put_page_testzero(page)) { > + if (unlikely(put_page_testzero(page))) { > __ClearPageLRU(page); > __ClearPageActive(page); > - del_page_from_lru_list(page, lruvec, lru); > > if (unlikely(PageCompound(page))) { > spin_unlock_irq(&pgdat->lru_lock); > @@ -1883,11 +1886,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > spin_lock_irq(&pgdat->lru_lock); > } else > list_add(&page->lru, &pages_to_free); > - } else { > - nr_moved += nr_pages; > - if (PageActive(page)) > - workingset_age_nonresident(lruvec, nr_pages); > + > + continue; > } > + > + lruvec = mem_cgroup_page_lruvec(page, pgdat); > + lru = page_lru(page); > + nr_pages = hpage_nr_pages(page); > + > + update_lru_size(lruvec, lru, page_zonenum(page), nr_pages); > + list_add(&page->lru, &lruvec->lists[lru]); > + nr_moved += nr_pages; > + if (PageActive(page)) > + workingset_age_nonresident(lruvec, nr_pages); > } > > /* > ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi 2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi ` (12 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen As func comments mentioned, few isolated page missing be tolerated. So why not do further to drop the unlikely double check. That won't cause more idle pages, but reduce a lock contention. This is also a preparation for later new page isolation feature. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/page_idle.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/mm/page_idle.c b/mm/page_idle.c index 057c61df12db..5fdd753e151a 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -32,19 +32,11 @@ static struct page *page_idle_get_page(unsigned long pfn) { struct page *page = pfn_to_online_page(pfn); - pg_data_t *pgdat; if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; - pgdat = page_pgdat(page); - spin_lock_irq(&pgdat->lru_lock); - if (unlikely(!PageLRU(page))) { - put_page(page); - page = NULL; - } - spin_unlock_irq(&pgdat->lru_lock); return page; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi 2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi 2020-07-25 12:59 ` [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-27 17:29 ` Alexander Duyck [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (11 subsequent siblings) 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- include/linux/mmzone.h | 1 + mm/compaction.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f6f884970511..14c668b7e793 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -512,6 +512,7 @@ struct zone { * On compaction failure, 1<<compact_defer_shift compactions * are skipped before trying again. The number attempted since * last failure is tracked with compact_considered. + * compact_order_failed is the minimum compaction failed order. */ unsigned int compact_considered; unsigned int compact_defer_shift; diff --git a/mm/compaction.c b/mm/compaction.c index 86375605faa9..cd1ef9e5e638 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page) /* * Compaction is deferred when compaction fails to result in a page - * allocation success. 1 << compact_defer_limit compactions are skipped up + * allocation success. compact_defer_shift++, compactions are skipped up * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT */ void defer_compaction(struct zone *zone, int order) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift 2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi @ 2020-07-27 17:29 ` Alexander Duyck [not found] ` <CAKgT0UfmbdhpUdGy+4VircovmJfiJy9m-MN_o0LChNT_kWRUng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-07-27 17:29 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > There is no compact_defer_limit. It should be compact_defer_shift in > use. and add compact_order_failed explanation. > > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > --- > include/linux/mmzone.h | 1 + > mm/compaction.c | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index f6f884970511..14c668b7e793 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -512,6 +512,7 @@ struct zone { > * On compaction failure, 1<<compact_defer_shift compactions > * are skipped before trying again. The number attempted since > * last failure is tracked with compact_considered. > + * compact_order_failed is the minimum compaction failed order. > */ > unsigned int compact_considered; > unsigned int compact_defer_shift; > diff --git a/mm/compaction.c b/mm/compaction.c > index 86375605faa9..cd1ef9e5e638 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page) > > /* > * Compaction is deferred when compaction fails to result in a page > - * allocation success. 1 << compact_defer_limit compactions are skipped up > + * allocation success. compact_defer_shift++, compactions are skipped up > * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT > */ > void defer_compaction(struct zone *zone, int order) So this doesn't read right. I wouldn't keep the "++," in the explanation, and if we are going to refer to a limit of "1 << COMPACT_MAX_DEFER_SHIFT" then maybe this should be left as "1 << compact_defer_shift". ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UfmbdhpUdGy+4VircovmJfiJy9m-MN_o0LChNT_kWRUng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift [not found] ` <CAKgT0UfmbdhpUdGy+4VircovmJfiJy9m-MN_o0LChNT_kWRUng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-07-28 11:59 ` Alex Shi [not found] ` <3bd60e1b-a74e-050d-ade4-6e8f54e00b92-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-28 11:59 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen >> * Compaction is deferred when compaction fails to result in a page >> - * allocation success. 1 << compact_defer_limit compactions are skipped up >> + * allocation success. compact_defer_shift++, compactions are skipped up >> * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT >> */ >> void defer_compaction(struct zone *zone, int order) > > So this doesn't read right. I wouldn't keep the "++," in the > explanation, and if we are going to refer to a limit of "1 << > COMPACT_MAX_DEFER_SHIFT" then maybe this should be left as "1 << > compact_defer_shift". > Thanks for comments! So is the changed patch fine? -- From 80ffde4c8e13ba2ad1ad5175dbaef245c2fe49bc Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Tue, 26 May 2020 09:47:01 +0800 Subject: [PATCH] mm/compaction: correct the comments of compact_defer_shift There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- include/linux/mmzone.h | 1 + mm/compaction.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f6f884970511..14c668b7e793 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -512,6 +512,7 @@ struct zone { * On compaction failure, 1<<compact_defer_shift compactions * are skipped before trying again. The number attempted since * last failure is tracked with compact_considered. + * compact_order_failed is the minimum compaction failed order. */ unsigned int compact_considered; unsigned int compact_defer_shift; diff --git a/mm/compaction.c b/mm/compaction.c index 86375605faa9..4950240cd455 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page) /* * Compaction is deferred when compaction fails to result in a page - * allocation success. 1 << compact_defer_limit compactions are skipped up + * allocation success. 1 << compact_defer_shift, compactions are skipped up * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT */ void defer_compaction(struct zone *zone, int order) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <3bd60e1b-a74e-050d-ade4-6e8f54e00b92-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift [not found] ` <3bd60e1b-a74e-050d-ade4-6e8f54e00b92-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-28 14:17 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-07-28 14:17 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Tue, Jul 28, 2020 at 4:59 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > >> * Compaction is deferred when compaction fails to result in a page > >> - * allocation success. 1 << compact_defer_limit compactions are skipped up > >> + * allocation success. compact_defer_shift++, compactions are skipped up > >> * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT > >> */ > >> void defer_compaction(struct zone *zone, int order) > > > > So this doesn't read right. I wouldn't keep the "++," in the > > explanation, and if we are going to refer to a limit of "1 << > > COMPACT_MAX_DEFER_SHIFT" then maybe this should be left as "1 << > > compact_defer_shift". > > > > Thanks for comments! So is the changed patch fine? > -- > > From 80ffde4c8e13ba2ad1ad5175dbaef245c2fe49bc Mon Sep 17 00:00:00 2001 > From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Date: Tue, 26 May 2020 09:47:01 +0800 > Subject: [PATCH] mm/compaction: correct the comments of compact_defer_shift > > There is no compact_defer_limit. It should be compact_defer_shift in > use. and add compact_order_failed explanation. > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > --- > include/linux/mmzone.h | 1 + > mm/compaction.c | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index f6f884970511..14c668b7e793 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -512,6 +512,7 @@ struct zone { > * On compaction failure, 1<<compact_defer_shift compactions > * are skipped before trying again. The number attempted since > * last failure is tracked with compact_considered. > + * compact_order_failed is the minimum compaction failed order. > */ > unsigned int compact_considered; > unsigned int compact_defer_shift; > diff --git a/mm/compaction.c b/mm/compaction.c > index 86375605faa9..4950240cd455 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page) > > /* > * Compaction is deferred when compaction fails to result in a page > - * allocation success. 1 << compact_defer_limit compactions are skipped up > + * allocation success. 1 << compact_defer_shift, compactions are skipped up > * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT > */ > void defer_compaction(struct zone *zone, int order) Yes, that looks better to me. Reviewed-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi ` (11 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Steven Rostedt, Ingo Molnar, Vlastimil Babka, Mike Kravetz The compact_deferred is a defer suggestion check, deferring action does in defer_compaction not here. so, better rename it to avoid confusing. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> Cc: Mike Kravetz <mike.kravetz-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org --- include/linux/compaction.h | 4 ++-- include/trace/events/compaction.h | 2 +- mm/compaction.c | 8 ++++---- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 6fa0eea3f530..be9ed7437a38 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -100,7 +100,7 @@ extern enum compact_result compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int highest_zoneidx); extern void defer_compaction(struct zone *zone, int order); -extern bool compaction_deferred(struct zone *zone, int order); +extern bool compaction_should_defer(struct zone *zone, int order); extern void compaction_defer_reset(struct zone *zone, int order, bool alloc_success); extern bool compaction_restarting(struct zone *zone, int order); @@ -199,7 +199,7 @@ static inline void defer_compaction(struct zone *zone, int order) { } -static inline bool compaction_deferred(struct zone *zone, int order) +static inline bool compaction_should_defer(struct zone *zone, int order) { return true; } diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h index 54e5bf081171..33633c71df04 100644 --- a/include/trace/events/compaction.h +++ b/include/trace/events/compaction.h @@ -274,7 +274,7 @@ 1UL << __entry->defer_shift) ); -DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred, +DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_should_defer, TP_PROTO(struct zone *zone, int order), diff --git a/mm/compaction.c b/mm/compaction.c index cd1ef9e5e638..f14780fc296a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -154,7 +154,7 @@ void defer_compaction(struct zone *zone, int order) } /* Returns true if compaction should be skipped this time */ -bool compaction_deferred(struct zone *zone, int order) +bool compaction_should_defer(struct zone *zone, int order) { unsigned long defer_limit = 1UL << zone->compact_defer_shift; @@ -168,7 +168,7 @@ bool compaction_deferred(struct zone *zone, int order) if (zone->compact_considered >= defer_limit) return false; - trace_mm_compaction_deferred(zone, order); + trace_mm_compaction_should_defer(zone, order); return true; } @@ -2377,7 +2377,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, enum compact_result status; if (prio > MIN_COMPACT_PRIORITY - && compaction_deferred(zone, order)) { + && compaction_should_defer(zone, order)) { rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); continue; } @@ -2561,7 +2561,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) if (!populated_zone(zone)) continue; - if (compaction_deferred(zone, cc.order)) + if (compaction_should_defer(zone, cc.order)) continue; if (compaction_suitable(zone, cc.order, 0, zoneid) != -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg Alex Shi ` (10 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w The func is only used in huge_memory.c, defining it in other file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird. Let's move it THP. And make it static as Hugh Dickin suggested. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org --- include/linux/swap.h | 2 -- mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++ mm/swap.c | 33 --------------------------------- 3 files changed, 30 insertions(+), 35 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 5b3216ba39a9..2c29399b29a0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -339,8 +339,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages); extern void lru_note_cost_page(struct page *); extern void lru_cache_add(struct page *); -extern void lru_add_page_tail(struct page *page, struct page *page_tail, - struct lruvec *lruvec, struct list_head *head); extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 78c84bee7e29..9e050b13f597 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2340,6 +2340,36 @@ static void remap_page(struct page *page) } } +static void lru_add_page_tail(struct page *page, struct page *page_tail, + struct lruvec *lruvec, struct list_head *list) +{ + VM_BUG_ON_PAGE(!PageHead(page), page); + VM_BUG_ON_PAGE(PageCompound(page_tail), page); + VM_BUG_ON_PAGE(PageLRU(page_tail), page); + lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); + + if (!list) + SetPageLRU(page_tail); + + if (likely(PageLRU(page))) + list_add_tail(&page_tail->lru, &page->lru); + else if (list) { + /* page reclaim is reclaiming a huge page */ + get_page(page_tail); + list_add_tail(&page_tail->lru, list); + } else { + /* + * Head page has not yet been counted, as an hpage, + * so we must account for each subpage individually. + * + * Put page_tail on the list at the correct position + * so they all end up in order. + */ + add_page_to_lru_list_tail(page_tail, lruvec, + page_lru(page_tail)); + } +} + static void __split_huge_page_tail(struct page *head, int tail, struct lruvec *lruvec, struct list_head *list) { diff --git a/mm/swap.c b/mm/swap.c index a82efc33411f..7701d855873d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -933,39 +933,6 @@ void __pagevec_release(struct pagevec *pvec) } EXPORT_SYMBOL(__pagevec_release); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -/* used by __split_huge_page_refcount() */ -void lru_add_page_tail(struct page *page, struct page *page_tail, - struct lruvec *lruvec, struct list_head *list) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - VM_BUG_ON_PAGE(PageCompound(page_tail), page); - VM_BUG_ON_PAGE(PageLRU(page_tail), page); - lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); - - if (!list) - SetPageLRU(page_tail); - - if (likely(PageLRU(page))) - list_add_tail(&page_tail->lru, &page->lru); - else if (list) { - /* page reclaim is reclaiming a huge page */ - get_page(page_tail); - list_add_tail(&page_tail->lru, list); - } else { - /* - * Head page has not yet been counted, as an hpage, - * so we must account for each subpage individually. - * - * Put page_tail on the list at the correct position - * so they all end up in order. - */ - add_page_to_lru_list_tail(page_tail, lruvec, - page_lru(page_tail)); - } -} -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, void *arg) { -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi 2020-07-25 12:59 ` [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 12/21] mm/lru: move lock into lru_note_cost Alex Shi ` (9 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Michal Hocko, Vladimir Davydov Add a debug checking in lock_page_memcg, then we could get alarm if anything wrong here. Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- mm/memcontrol.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5d45a9159af9..20c8ed69a930 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1983,6 +1983,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page) if (unlikely(!memcg)) return NULL; +#ifdef CONFIG_PROVE_LOCKING + local_irq_save(flags); + might_lock(&memcg->move_lock); + local_irq_restore(flags); +#endif + if (atomic_read(&memcg->moving_account) <= 0) return memcg; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 12/21] mm/lru: move lock into lru_note_cost [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (2 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi ` (8 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w This patch move lru_lock into lru_note_cost. It's a bit ugly and may cost more locking, but it's necessary for later per pgdat lru_lock to per memcg lru_lock change. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- mm/swap.c | 5 +++-- mm/vmscan.c | 4 +--- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index b88ca630db70..f645965fde0e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -269,7 +269,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) { do { unsigned long lrusize; + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + spin_lock_irq(&pgdat->lru_lock); /* Record cost event */ if (file) lruvec->file_cost += nr_pages; @@ -293,15 +295,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) lruvec->file_cost /= 2; lruvec->anon_cost /= 2; } + spin_unlock_irq(&pgdat->lru_lock); } while ((lruvec = parent_lruvec(lruvec))); } void lru_note_cost_page(struct page *page) { - spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)), page_is_file_lru(page), hpage_nr_pages(page)); - spin_unlock_irq(&page_pgdat(page)->lru_lock); } static void __activate_page(struct page *page, struct lruvec *lruvec) diff --git a/mm/vmscan.c b/mm/vmscan.c index ddb29d813d77..c1c4259b4de5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1976,19 +1976,17 @@ static int current_may_throttle(void) &stat, false); spin_lock_irq(&pgdat->lru_lock); - move_pages_to_lru(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - lru_note_cost(lruvec, file, stat.nr_pageout); item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); - spin_unlock_irq(&pgdat->lru_lock); + lru_note_cost(lruvec, file, stat.nr_pageout); mem_cgroup_uncharge_list(&page_list); free_unref_page_list(&page_list); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (3 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 12/21] mm/lru: move lock into lru_note_cost Alex Shi @ 2020-07-25 12:59 ` Alex Shi [not found] ` <1595681998-19193-14-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi ` (7 subsequent siblings) 12 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Michal Hocko, Vladimir Davydov Currently lru_lock still guards both lru list and page's lru bit, that's ok. but if we want to use specific lruvec lock on the page, we need to pin down the page's lruvec/memcg during locking. Just taking lruvec lock first may be undermined by the page's memcg charge/migration. To fix this problem, we could clear the lru bit out of locking and use it as pin down action to block the page isolation in memcg changing. So now we do page isolating by both actions: TestClearPageLRU and hold the lru_lock. This patch start with the first part: TestClearPageLRU, which combines PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This function will be used as page isolation precondition to prevent other isolations some where else. Then there are may !PageLRU page on lru list, need to remove BUG() checking accordingly. There 2 rules for lru bit now: 1, the lru bit still indicate if a page on lru list, just in some temporary moment(isolating), the page may have no lru bit when it's on lru list. but the page still must be on lru list when the lru bit set. 2, have to remove lru bit before delete it from lru list. Hugh Dickins pointed that when a page is in free path and no one is possible to take it, non atomic lru bit clearing is better, like in __page_cache_release and release_pages. And no need get_page() before lru bit clear in isolate_lru_page, since it '(1) Must be called with an elevated refcount on the page'. As Andrew Morton mentioned this change would dirty cacheline for page isn't on LRU. But the lost would be acceptable with Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> report: https://lkml.org/lkml/2020/3/4/173 Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org --- include/linux/page-flags.h | 1 + mm/mlock.c | 3 +-- mm/swap.c | 6 ++---- mm/vmscan.c | 18 +++++++----------- 4 files changed, 11 insertions(+), 17 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6be1aa559b1e..9554ed1387dc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size) PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD) __CLEARPAGEFLAG(Dirty, dirty, PF_HEAD) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) + TESTCLEARFLAG(LRU, lru, PF_HEAD) PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) PAGEFLAG(Workingset, workingset, PF_HEAD) diff --git a/mm/mlock.c b/mm/mlock.c index f8736136fad7..228ba5a8e0a5 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page) */ static bool __munlock_isolate_lru_page(struct page *page, bool getpage) { - if (PageLRU(page)) { + if (TestClearPageLRU(page)) { struct lruvec *lruvec; lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); if (getpage) get_page(page); - ClearPageLRU(page); del_page_from_lru_list(page, lruvec, page_lru(page)); return true; } diff --git a/mm/swap.c b/mm/swap.c index f645965fde0e..5092fe9c8c47 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page) struct lruvec *lruvec; unsigned long flags; + __ClearPageLRU(page); spin_lock_irqsave(&pgdat->lru_lock, flags); lruvec = mem_cgroup_page_lruvec(page, pgdat); - VM_BUG_ON_PAGE(!PageLRU(page), page); - __ClearPageLRU(page); del_page_from_lru_list(page, lruvec, page_off_lru(page)); spin_unlock_irqrestore(&pgdat->lru_lock, flags); } @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr) spin_lock_irqsave(&locked_pgdat->lru_lock, flags); } - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); - VM_BUG_ON_PAGE(!PageLRU(page), page); __ClearPageLRU(page); + lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); del_page_from_lru_list(page, lruvec, page_off_lru(page)); } diff --git a/mm/vmscan.c b/mm/vmscan.c index c1c4259b4de5..4183ae6b54b5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, page = lru_to_page(src); prefetchw_prev_lru_page(page, src, flags); - VM_BUG_ON_PAGE(!PageLRU(page), page); - nr_pages = compound_nr(page); total_scan += nr_pages; @@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page) VM_BUG_ON_PAGE(!page_count(page), page); WARN_RATELIMIT(PageTail(page), "trying to isolate tail page"); - if (PageLRU(page)) { + if (TestClearPageLRU(page)) { pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; + int lru = page_lru(page); - spin_lock_irq(&pgdat->lru_lock); + get_page(page); lruvec = mem_cgroup_page_lruvec(page, pgdat); - if (PageLRU(page)) { - int lru = page_lru(page); - get_page(page); - ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, lru); - ret = 0; - } + spin_lock_irq(&pgdat->lru_lock); + del_page_from_lru_list(page, lruvec, lru); spin_unlock_irq(&pgdat->lru_lock); + ret = 0; } + return ret; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-14-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU [not found] ` <1595681998-19193-14-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-29 3:53 ` Alex Shi 2020-08-05 22:43 ` Alexander Duyck 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-29 3:53 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Michal Hocko, Vladimir Davydov rewrite the commit log. From 9310c359b0049e3cc9827b771dc583d504bbf022 Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Sat, 25 Apr 2020 12:03:30 +0800 Subject: [PATCH v17 13/23] mm/lru: introduce TestClearPageLRU Currently lru_lock still guards both lru list and page's lru bit, that's ok. but if we want to use specific lruvec lock on the page, we need to pin down the page's lruvec/memcg during locking. Just taking lruvec lock first may be undermined by the page's memcg charge/migration. To fix this problem, we could clear the lru bit out of locking and use it as pin down action to block the page isolation in memcg changing. So now a standard steps of page isolation is following: 1, get_page(); #pin the page avoid to be free 2, TestClearPageLRU(); #block other isolation like memcg change 3, spin_lock on lru_lock; #serialize lru list access 4, delete page from lru list; The step 2 could be optimzed/replaced in scenarios which page is unlikely be accessed or be moved between memcgs. This patch start with the first part: TestClearPageLRU, which combines PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This function will be used as page isolation precondition to prevent other isolations some where else. Then there are may !PageLRU page on lru list, need to remove BUG() checking accordingly. There 2 rules for lru bit now: 1, the lru bit still indicate if a page on lru list, just in some temporary moment(isolating), the page may have no lru bit when it's on lru list. but the page still must be on lru list when the lru bit set. 2, have to remove lru bit before delete it from lru list. Hugh Dickins pointed that when a page is in free path and no one is possible to take it, non atomic lru bit clearing is better, like in __page_cache_release and release_pages. And no need get_page() before lru bit clear in isolate_lru_page, since it '(1) Must be called with an elevated refcount on the page'. As Andrew Morton mentioned this change would dirty cacheline for page isn't on LRU. But the lost would be acceptable with Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> report: https://lkml.org/lkml/2020/3/4/173 Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org --- include/linux/page-flags.h | 1 + mm/mlock.c | 3 +-- mm/swap.c | 6 ++---- mm/vmscan.c | 18 +++++++----------- 4 files changed, 11 insertions(+), 17 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6be1aa559b1e..9554ed1387dc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size) PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD) __CLEARPAGEFLAG(Dirty, dirty, PF_HEAD) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) + TESTCLEARFLAG(LRU, lru, PF_HEAD) PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) PAGEFLAG(Workingset, workingset, PF_HEAD) diff --git a/mm/mlock.c b/mm/mlock.c index f8736136fad7..228ba5a8e0a5 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page) */ static bool __munlock_isolate_lru_page(struct page *page, bool getpage) { - if (PageLRU(page)) { + if (TestClearPageLRU(page)) { struct lruvec *lruvec; lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); if (getpage) get_page(page); - ClearPageLRU(page); del_page_from_lru_list(page, lruvec, page_lru(page)); return true; } diff --git a/mm/swap.c b/mm/swap.c index f645965fde0e..5092fe9c8c47 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page) struct lruvec *lruvec; unsigned long flags; + __ClearPageLRU(page); spin_lock_irqsave(&pgdat->lru_lock, flags); lruvec = mem_cgroup_page_lruvec(page, pgdat); - VM_BUG_ON_PAGE(!PageLRU(page), page); - __ClearPageLRU(page); del_page_from_lru_list(page, lruvec, page_off_lru(page)); spin_unlock_irqrestore(&pgdat->lru_lock, flags); } @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr) spin_lock_irqsave(&locked_pgdat->lru_lock, flags); } - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); - VM_BUG_ON_PAGE(!PageLRU(page), page); __ClearPageLRU(page); + lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); del_page_from_lru_list(page, lruvec, page_off_lru(page)); } diff --git a/mm/vmscan.c b/mm/vmscan.c index c1c4259b4de5..4183ae6b54b5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, page = lru_to_page(src); prefetchw_prev_lru_page(page, src, flags); - VM_BUG_ON_PAGE(!PageLRU(page), page); - nr_pages = compound_nr(page); total_scan += nr_pages; @@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page) VM_BUG_ON_PAGE(!page_count(page), page); WARN_RATELIMIT(PageTail(page), "trying to isolate tail page"); - if (PageLRU(page)) { + if (TestClearPageLRU(page)) { pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; + int lru = page_lru(page); - spin_lock_irq(&pgdat->lru_lock); + get_page(page); lruvec = mem_cgroup_page_lruvec(page, pgdat); - if (PageLRU(page)) { - int lru = page_lru(page); - get_page(page); - ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, lru); - ret = 0; - } + spin_lock_irq(&pgdat->lru_lock); + del_page_from_lru_list(page, lruvec, lru); spin_unlock_irq(&pgdat->lru_lock); + ret = 0; } + return ret; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU 2020-07-29 3:53 ` Alex Shi @ 2020-08-05 22:43 ` Alexander Duyck [not found] ` <CAKgT0Ud1+FkJcTXR0MxZYFxd7mr=opdXfXKTqkmiu4NNMyT4bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-05 22:43 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov On Tue, Jul 28, 2020 at 8:53 PM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > rewrite the commit log. > > From 9310c359b0049e3cc9827b771dc583d504bbf022 Mon Sep 17 00:00:00 2001 > From: Alex Shi <alex.shi@linux.alibaba.com> > Date: Sat, 25 Apr 2020 12:03:30 +0800 > Subject: [PATCH v17 13/23] mm/lru: introduce TestClearPageLRU > > Currently lru_lock still guards both lru list and page's lru bit, that's > ok. but if we want to use specific lruvec lock on the page, we need to > pin down the page's lruvec/memcg during locking. Just taking lruvec > lock first may be undermined by the page's memcg charge/migration. To > fix this problem, we could clear the lru bit out of locking and use > it as pin down action to block the page isolation in memcg changing. > > So now a standard steps of page isolation is following: > 1, get_page(); #pin the page avoid to be free > 2, TestClearPageLRU(); #block other isolation like memcg change > 3, spin_lock on lru_lock; #serialize lru list access > 4, delete page from lru list; > The step 2 could be optimzed/replaced in scenarios which page is > unlikely be accessed or be moved between memcgs. > > This patch start with the first part: TestClearPageLRU, which combines > PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This > function will be used as page isolation precondition to prevent other > isolations some where else. Then there are may !PageLRU page on lru > list, need to remove BUG() checking accordingly. > > There 2 rules for lru bit now: > 1, the lru bit still indicate if a page on lru list, just in some > temporary moment(isolating), the page may have no lru bit when > it's on lru list. but the page still must be on lru list when the > lru bit set. > 2, have to remove lru bit before delete it from lru list. > > Hugh Dickins pointed that when a page is in free path and no one is > possible to take it, non atomic lru bit clearing is better, like in > __page_cache_release and release_pages. > And no need get_page() before lru bit clear in isolate_lru_page, > since it '(1) Must be called with an elevated refcount on the page'. > > As Andrew Morton mentioned this change would dirty cacheline for page > isn't on LRU. But the lost would be acceptable with Rong Chen > <rong.a.chen@intel.com> report: > https://lkml.org/lkml/2020/3/4/173 > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-kernel@vger.kernel.org > Cc: cgroups@vger.kernel.org > Cc: linux-mm@kvack.org > --- > include/linux/page-flags.h | 1 + > mm/mlock.c | 3 +-- > mm/swap.c | 6 ++---- > mm/vmscan.c | 18 +++++++----------- > 4 files changed, 11 insertions(+), 17 deletions(-) > <snip> > diff --git a/mm/swap.c b/mm/swap.c > index f645965fde0e..5092fe9c8c47 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page) > struct lruvec *lruvec; > unsigned long flags; > > + __ClearPageLRU(page); > spin_lock_irqsave(&pgdat->lru_lock, flags); > lruvec = mem_cgroup_page_lruvec(page, pgdat); > - VM_BUG_ON_PAGE(!PageLRU(page), page); > - __ClearPageLRU(page); > del_page_from_lru_list(page, lruvec, page_off_lru(page)); > spin_unlock_irqrestore(&pgdat->lru_lock, flags); > } > @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr) > spin_lock_irqsave(&locked_pgdat->lru_lock, flags); > } > > - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); > - VM_BUG_ON_PAGE(!PageLRU(page), page); > __ClearPageLRU(page); > + lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); > del_page_from_lru_list(page, lruvec, page_off_lru(page)); > } > The more I look at this piece it seems like this change wasn't really necessary. If anything it seems like it could catch potential bugs as it was testing for the PageLRU flag before and then clearing it manually anyway. In addition it doesn't reduce the critical path by any significant amount so I am not sure these changes are providing any benefit. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0Ud1+FkJcTXR0MxZYFxd7mr=opdXfXKTqkmiu4NNMyT4bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU [not found] ` <CAKgT0Ud1+FkJcTXR0MxZYFxd7mr=opdXfXKTqkmiu4NNMyT4bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-06 1:54 ` Alex Shi 2020-08-06 14:41 ` Alexander Duyck 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-06 1:54 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/8/6 上午6:43, Alexander Duyck 写道: >> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr) >> spin_lock_irqsave(&locked_pgdat->lru_lock, flags); >> } >> >> - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); >> - VM_BUG_ON_PAGE(!PageLRU(page), page); >> __ClearPageLRU(page); >> + lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); >> del_page_from_lru_list(page, lruvec, page_off_lru(page)); >> } >> > The more I look at this piece it seems like this change wasn't really > necessary. If anything it seems like it could catch potential bugs as > it was testing for the PageLRU flag before and then clearing it > manually anyway. In addition it doesn't reduce the critical path by > any significant amount so I am not sure these changes are providing > any benefit. Don't know hat kind of bug do you mean here, since the page is no one using, means no one could ClearPageLRU in other place, so if you like to keep the VM_BUG_ON_PAGE, that should be ok. Thanks! Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU 2020-08-06 1:54 ` Alex Shi @ 2020-08-06 14:41 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-06 14:41 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov On Wed, Aug 5, 2020 at 6:54 PM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > > > 在 2020/8/6 上午6:43, Alexander Duyck 写道: > >> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr) > >> spin_lock_irqsave(&locked_pgdat->lru_lock, flags); > >> } > >> > >> - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); > >> - VM_BUG_ON_PAGE(!PageLRU(page), page); > >> __ClearPageLRU(page); > >> + lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); > >> del_page_from_lru_list(page, lruvec, page_off_lru(page)); > >> } > >> > > The more I look at this piece it seems like this change wasn't really > > necessary. If anything it seems like it could catch potential bugs as > > it was testing for the PageLRU flag before and then clearing it > > manually anyway. In addition it doesn't reduce the critical path by > > any significant amount so I am not sure these changes are providing > > any benefit. > > Don't know hat kind of bug do you mean here, since the page is no one using, means > no one could ClearPageLRU in other place, so if you like to keep the VM_BUG_ON_PAGE, > that should be ok. You kind of answered your own question. Basically the bug it would catch is if another thread were to clear the flag without getting a reference to the page first. My preference would be to leave this code as is for now. There isn't much value in either moving the lruvec or removing the VM_BUG_ON_PAGE call since the critical path size would barely be effected as it is only one or two operations anyway. What it comes down to is that the less unnecessary changes we make the better. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (4 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi ` (6 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Mika Penttilä Split_huge_page() must start with PageLRU(head), and we are holding the lru_lock here. If the head was cleared lru bit unexpected, tracking it. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Mika Penttil√§ <mika.penttila-MRsr7dthA9VWk0Htik3J/w@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- mm/huge_memory.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d866b6e43434..28538444197b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2348,15 +2348,19 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail, VM_BUG_ON_PAGE(PageLRU(page_tail), head); lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); - if (!list) - SetPageLRU(page_tail); - - if (likely(PageLRU(head))) - list_add_tail(&page_tail->lru, &head->lru); - else if (list) { + if (list) { /* page reclaim is reclaiming a huge page */ get_page(page_tail); list_add_tail(&page_tail->lru, list); + } else { + /* + * Split start from PageLRU(head), and we are holding the + * lru_lock. + * Do a warning if the head's lru bit was cleared unexpected. + */ + VM_WARN_ON(!PageLRU(head)); + SetPageLRU(page_tail); + list_add_tail(&page_tail->lru, &head->lru); } } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (5 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi @ 2020-07-25 12:59 ` Alex Shi [not found] ` <1595681998-19193-18-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-29 3:54 ` Alex Shi 2020-07-27 5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi ` (5 subsequent siblings) 12 siblings, 2 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Michal Hocko, Vladimir Davydov This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. According to Daniel Jordan's suggestion, I run 208 'dd' with on 104 containers on a 2s * 26cores * HT box with a modefied case: https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice With this and later patches, the readtwice performance increases about 80% within concurrent containers. Also add a debug func in locking which may give some clues if there are sth out of hands. Hugh Dickins helped on patch polish, thanks! Reported-by: kernel test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- include/linux/memcontrol.h | 58 +++++++++++++++++++++++++ include/linux/mmzone.h | 2 + mm/compaction.c | 67 ++++++++++++++++++----------- mm/huge_memory.c | 11 ++--- mm/memcontrol.c | 63 ++++++++++++++++++++++++++- mm/mlock.c | 47 +++++++++++++------- mm/mmzone.c | 1 + mm/swap.c | 104 +++++++++++++++++++++------------------------ mm/vmscan.c | 70 ++++++++++++++++-------------- 9 files changed, 288 insertions(+), 135 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e77197a62809..258901021c6c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct mem_cgroup *get_mem_cgroup_from_page(struct page *page); +struct lruvec *lock_page_lruvec(struct page *page); +struct lruvec *lock_page_lruvec_irq(struct page *page); +struct lruvec *lock_page_lruvec_irqsave(struct page *page, + unsigned long *flags); + +#ifdef CONFIG_DEBUG_VM +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page); +#else +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} +#endif + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; @@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } +static inline struct lruvec *lock_page_lruvec(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock(&pgdat->__lruvec.lru_lock); + return &pgdat->__lruvec; +} + +static inline struct lruvec *lock_page_lruvec_irq(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock_irq(&pgdat->__lruvec.lru_lock); + return &pgdat->__lruvec; +} + +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page, + unsigned long *flagsp) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); + return &pgdat->__lruvec; +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page, void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1255,6 +1297,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); } +static inline void unlock_page_lruvec(struct lruvec *lruvec) +{ + spin_unlock(&lruvec->lru_lock); +} + +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) +{ + spin_unlock_irq(&lruvec->lru_lock); +} + +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, + unsigned long flags) +{ + spin_unlock_irqrestore(&lruvec->lru_lock, flags); +} + #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 14c668b7e793..30b961a9a749 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -249,6 +249,8 @@ enum lruvec_flags { }; struct lruvec { + /* per lruvec lru_lock for memcg */ + spinlock_t lru_lock; struct list_head lists[NR_LRU_LISTS]; /* * These track the cost of reclaiming one LRU - file or anon - diff --git a/mm/compaction.c b/mm/compaction.c index 2da2933fe56b..88bbd2e93895 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat) unsigned long nr_scanned = 0, nr_isolated = 0; struct lruvec *lruvec; unsigned long flags = 0; - bool locked = false; + struct lruvec *locked_lruvec = NULL; struct page *page = NULL, *valid_page = NULL; unsigned long start_pfn = low_pfn; bool skip_on_failure = false; @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat) * contention, to give chance to IRQs. Abort completely if * a fatal signal is pending. */ - if (!(low_pfn % SWAP_CLUSTER_MAX) - && compact_unlock_should_abort(&pgdat->lru_lock, - flags, &locked, cc)) { - low_pfn = 0; - goto fatal_pending; + if (!(low_pfn % SWAP_CLUSTER_MAX)) { + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, + flags); + locked_lruvec = NULL; + } + + if (fatal_signal_pending(current)) { + cc->contended = true; + + low_pfn = 0; + goto fatal_pending; + } + + cond_resched(); } if (!pfn_valid_within(low_pfn)) @@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat) */ if (unlikely(__PageMovable(page)) && !PageIsolated(page)) { - if (locked) { - spin_unlock_irqrestore(&pgdat->lru_lock, - flags); - locked = false; + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, flags); + locked_lruvec = NULL; } if (!isolate_movable_page(page, isolate_mode)) @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!TestClearPageLRU(page)) goto isolate_fail_put; + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + /* If we already hold the lock, we can skip some rechecking */ - if (!locked) { - locked = compact_lock_irqsave(&pgdat->lru_lock, - &flags, cc); + if (lruvec != locked_lruvec) { + if (locked_lruvec) + unlock_page_lruvec_irqrestore(locked_lruvec, + flags); + + compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); + locked_lruvec = lruvec; + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); /* Try get exclusive access under lock */ if (!skip_updated) { @@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat) SetPageLRU(page); goto isolate_fail_put; } - } - - lruvec = mem_cgroup_page_lruvec(page, pgdat); + } else + rcu_read_unlock(); /* The whole page is taken off the LRU; skip the tail pages. */ if (PageCompound(page)) @@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat) isolate_fail_put: /* Avoid potential deadlock in freeing page under lru_lock */ - if (locked) { - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - locked = false; + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, flags); + locked_lruvec = NULL; } put_page(page); @@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat) * page anyway. */ if (nr_isolated) { - if (locked) { - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - locked = false; + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, + flags); + locked_lruvec = NULL; } putback_movable_pages(&cc->migratepages); cc->nr_migratepages = 0; @@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat) page = NULL; isolate_abort: - if (locked) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (locked_lruvec) + unlock_page_lruvec_irqrestore(locked_lruvec, flags); if (page) { SetPageLRU(page); put_page(page); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 28538444197b..a0cb95891ae5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail, VM_BUG_ON_PAGE(!PageHead(head), head); VM_BUG_ON_PAGE(PageCompound(page_tail), head); VM_BUG_ON_PAGE(PageLRU(page_tail), head); - lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); + lockdep_assert_held(&lruvec->lru_lock); if (list) { /* page reclaim is reclaiming a huge page */ @@ -2430,7 +2430,6 @@ static void __split_huge_page(struct page *page, struct list_head *list, pgoff_t end) { struct page *head = compound_head(page); - pg_data_t *pgdat = page_pgdat(head); struct lruvec *lruvec; struct address_space *swap_cache = NULL; unsigned long offset = 0; @@ -2447,10 +2446,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); } - /* prevent PageLRU to go away from under us, and freeze lru stats */ - spin_lock(&pgdat->lru_lock); - - lruvec = mem_cgroup_page_lruvec(head, pgdat); + /* lock lru list/PageCompound, ref freezed by page_ref_freeze */ + lruvec = lock_page_lruvec(head); for (i = HPAGE_PMD_NR - 1; i >= 1; i--) { __split_huge_page_tail(head, i, lruvec, list); @@ -2471,7 +2468,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, } ClearPageCompound(head); - spin_unlock(&pgdat->lru_lock); + unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, HPAGE_PMD_ORDER); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 20c8ed69a930..d6746656cc39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, return ret; } +#ifdef CONFIG_DEBUG_VM +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ + if (mem_cgroup_disabled()) + return; + + if (!page->mem_cgroup) + VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page); + else + VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); +} +#endif + /** * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page * @page: the page @@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd goto out; } - memcg = page->mem_cgroup; + VM_BUG_ON_PAGE(PageTail(page), page); + memcg = READ_ONCE(page->mem_cgroup); /* * Swapcache readahead pages are added to the LRU - and * possibly migrated - before they are charged. @@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd return lruvec; } +struct lruvec *lock_page_lruvec(struct page *page) +{ + struct lruvec *lruvec; + struct pglist_data *pgdat = page_pgdat(page); + + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + spin_lock(&lruvec->lru_lock); + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + +struct lruvec *lock_page_lruvec_irq(struct page *page) +{ + struct lruvec *lruvec; + struct pglist_data *pgdat = page_pgdat(page); + + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + spin_lock_irq(&lruvec->lru_lock); + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) +{ + struct lruvec *lruvec; + struct pglist_data *pgdat = page_pgdat(page); + + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + spin_lock_irqsave(&lruvec->lru_lock, *flags); + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + /** * mem_cgroup_update_lru_size - account for adding or removing an lru page * @lruvec: mem_cgroup per zone lru vector @@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) /* * Because tail pages are not marked as "used", set it. We're under - * pgdat->lru_lock and migration entries setup in all page mappings. + * lruvec->lru_lock and migration entries setup in all page mappings. */ void mem_cgroup_split_huge_fixup(struct page *head) { diff --git a/mm/mlock.c b/mm/mlock.c index 228ba5a8e0a5..5d40d259a931 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page) * Isolate a page from LRU with optional get_page() pin. * Assumes lru_lock already held and page already pinned. */ -static bool __munlock_isolate_lru_page(struct page *page, bool getpage) +static bool __munlock_isolate_lru_page(struct page *page, + struct lruvec *lruvec, bool getpage) { if (TestClearPageLRU(page)) { - struct lruvec *lruvec; - - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); if (getpage) get_page(page); del_page_from_lru_list(page, lruvec, page_lru(page)); @@ -181,7 +179,7 @@ static void __munlock_isolation_failed(struct page *page) unsigned int munlock_vma_page(struct page *page) { int nr_pages; - pg_data_t *pgdat = page_pgdat(page); + struct lruvec *lruvec; /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); @@ -189,11 +187,16 @@ unsigned int munlock_vma_page(struct page *page) VM_BUG_ON_PAGE(PageTail(page), page); /* - * Serialize with any parallel __split_huge_page_refcount() which + * Serialize split tail pages in __split_huge_page_tail() which * might otherwise copy PageMlocked to part of the tail pages before * we clear it in the head page. It also stabilizes hpage_nr_pages(). + * TestClearPageLRU can't be used here to block page isolation, since + * out of lock clear_page_mlock may interfer PageLRU/PageMlocked + * sequence, same as __pagevec_lru_add_fn, and lead the page place to + * wrong lru list here. So relay on PageLocked to stop lruvec change + * in mem_cgroup_move_account(). */ - spin_lock_irq(&pgdat->lru_lock); + lruvec = lock_page_lruvec_irq(page); if (!TestClearPageMlocked(page)) { /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ @@ -204,15 +207,15 @@ unsigned int munlock_vma_page(struct page *page) nr_pages = hpage_nr_pages(page); __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - if (__munlock_isolate_lru_page(page, true)) { - spin_unlock_irq(&pgdat->lru_lock); + if (__munlock_isolate_lru_page(page, lruvec, true)) { + unlock_page_lruvec_irq(lruvec); __munlock_isolated_page(page); goto out; } __munlock_isolation_failed(page); unlock_out: - spin_unlock_irq(&pgdat->lru_lock); + unlock_page_lruvec_irq(lruvec); out: return nr_pages - 1; @@ -292,23 +295,34 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) int nr = pagevec_count(pvec); int delta_munlocked = -nr; struct pagevec pvec_putback; + struct lruvec *lruvec = NULL; int pgrescued = 0; pagevec_init(&pvec_putback); /* Phase 1: page isolation */ - spin_lock_irq(&zone->zone_pgdat->lru_lock); for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; + struct lruvec *new_lruvec; + + /* block memcg change in mem_cgroup_move_account */ + lock_page_memcg(page); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (new_lruvec != lruvec) { + if (lruvec) + unlock_page_lruvec_irq(lruvec); + lruvec = lock_page_lruvec_irq(page); + } if (TestClearPageMlocked(page)) { /* * We already have pin from follow_page_mask() * so we can spare the get_page() here. */ - if (__munlock_isolate_lru_page(page, false)) + if (__munlock_isolate_lru_page(page, lruvec, false)) { + unlock_page_memcg(page); continue; - else + } else __munlock_isolation_failed(page); } else { delta_munlocked++; @@ -320,11 +334,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * pin. We cannot do it under lru_lock however. If it's * the last pin, __page_cache_release() would deadlock. */ + unlock_page_memcg(page); pagevec_add(&pvec_putback, pvec->pages[i]); pvec->pages[i] = NULL; } - __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); - spin_unlock_irq(&zone->zone_pgdat->lru_lock); + if (lruvec) { + __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); + unlock_page_lruvec_irq(lruvec); + } /* Now we can release pins of pages that we are not munlocking */ pagevec_release(&pvec_putback); diff --git a/mm/mmzone.c b/mm/mmzone.c index 4686fdc23bb9..3750a90ed4a0 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec) enum lru_list lru; memset(lruvec, 0, sizeof(struct lruvec)); + spin_lock_init(&lruvec->lru_lock); for_each_lru(lru) INIT_LIST_HEAD(&lruvec->lists[lru]); diff --git a/mm/swap.c b/mm/swap.c index 3029b3f74811..09edac441eb6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { static void __page_cache_release(struct page *page) { if (PageLRU(page)) { - pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; unsigned long flags; __ClearPageLRU(page); - spin_lock_irqsave(&pgdat->lru_lock, flags); - lruvec = mem_cgroup_page_lruvec(page, pgdat); + lruvec = lock_page_lruvec_irqsave(page, &flags); del_page_from_lru_list(page, lruvec, page_off_lru(page)); - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + unlock_page_lruvec_irqrestore(lruvec, flags); } __ClearPageWaiters(page); } @@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, void (*move_fn)(struct page *page, struct lruvec *lruvec)) { int i; - struct pglist_data *pgdat = NULL; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long flags = 0; for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - struct pglist_data *pagepgdat = page_pgdat(page); - - if (pagepgdat != pgdat) { - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - pgdat = pagepgdat; - spin_lock_irqsave(&pgdat->lru_lock, flags); - } + struct lruvec *new_lruvec; /* block memcg migration during page moving between lru */ if (!TestClearPageLRU(page)) continue; - lruvec = mem_cgroup_page_lruvec(page, pgdat); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (lruvec != new_lruvec) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = lock_page_lruvec_irqsave(page, &flags); + } + (*move_fn)(page, lruvec); SetPageLRU(page); } - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); } @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) { do { unsigned long lrusize; - struct pglist_data *pgdat = lruvec_pgdat(lruvec); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); /* Record cost event */ if (file) lruvec->file_cost += nr_pages; @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) lruvec->file_cost /= 2; lruvec->anon_cost /= 2; } - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); } while ((lruvec = parent_lruvec(lruvec))); } @@ -365,12 +360,13 @@ static inline void activate_page_drain(int cpu) void activate_page(struct page *page) { pg_data_t *pgdat = page_pgdat(page); + struct lruvec *lruvec; page = compound_head(page); - spin_lock_irq(&pgdat->lru_lock); + lruvec = lock_page_lruvec_irq(page); if (PageLRU(page)) - __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); - spin_unlock_irq(&pgdat->lru_lock); + __activate_page(page, lruvec); + unlock_page_lruvec_irq(lruvec); } #endif @@ -817,8 +813,7 @@ void release_pages(struct page **pages, int nr) { int i; LIST_HEAD(pages_to_free); - struct pglist_data *locked_pgdat = NULL; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long uninitialized_var(flags); unsigned int uninitialized_var(lock_batch); @@ -828,21 +823,20 @@ void release_pages(struct page **pages, int nr) /* * Make sure the IRQ-safe lock-holding time does not get * excessive with a continuous string of pages from the - * same pgdat. The lock is held only if pgdat != NULL. + * same lruvec. The lock is held only if lruvec != NULL. */ - if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) { - spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); - locked_pgdat = NULL; + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } if (is_huge_zero_page(page)) continue; if (is_zone_device_page(page)) { - if (locked_pgdat) { - spin_unlock_irqrestore(&locked_pgdat->lru_lock, - flags); - locked_pgdat = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } /* * ZONE_DEVICE pages that return 'false' from @@ -861,28 +855,28 @@ void release_pages(struct page **pages, int nr) continue; if (PageCompound(page)) { - if (locked_pgdat) { - spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); - locked_pgdat = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } __put_compound_page(page); continue; } if (PageLRU(page)) { - struct pglist_data *pgdat = page_pgdat(page); + struct lruvec *new_lruvec; - if (pgdat != locked_pgdat) { - if (locked_pgdat) - spin_unlock_irqrestore(&locked_pgdat->lru_lock, + new_lruvec = mem_cgroup_page_lruvec(page, + page_pgdat(page)); + if (new_lruvec != lruvec) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); lock_batch = 0; - locked_pgdat = pgdat; - spin_lock_irqsave(&locked_pgdat->lru_lock, flags); + lruvec = lock_page_lruvec_irqsave(page, &flags); } __ClearPageLRU(page); - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); del_page_from_lru_list(page, lruvec, page_off_lru(page)); } @@ -892,8 +886,8 @@ void release_pages(struct page **pages, int nr) list_add(&page->lru, &pages_to_free); } - if (locked_pgdat) - spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); mem_cgroup_uncharge_list(&pages_to_free); free_unref_page_list(&pages_to_free); @@ -981,26 +975,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) void __pagevec_lru_add(struct pagevec *pvec) { int i; - struct pglist_data *pgdat = NULL; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long flags = 0; for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - struct pglist_data *pagepgdat = page_pgdat(page); + struct lruvec *new_lruvec; - if (pagepgdat != pgdat) { - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - pgdat = pagepgdat; - spin_lock_irqsave(&pgdat->lru_lock, flags); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (lruvec != new_lruvec) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = lock_page_lruvec_irqsave(page, &flags); } - lruvec = mem_cgroup_page_lruvec(page, pgdat); __pagevec_lru_add_fn(page, lruvec); } - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); } diff --git a/mm/vmscan.c b/mm/vmscan.c index f77748adc340..168c1659e430 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page) WARN_RATELIMIT(PageTail(page), "trying to isolate tail page"); if (TestClearPageLRU(page)) { - pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; int lru = page_lru(page); get_page(page); - lruvec = mem_cgroup_page_lruvec(page, pgdat); - spin_lock_irq(&pgdat->lru_lock); + lruvec = lock_page_lruvec_irq(page); del_page_from_lru_list(page, lruvec, lru); - spin_unlock_irq(&pgdat->lru_lock); + unlock_page_lruvec_irq(lruvec); ret = 0; } @@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, struct list_head *list) { - struct pglist_data *pgdat = lruvec_pgdat(lruvec); int nr_pages, nr_moved = 0; LIST_HEAD(pages_to_free); struct page *page; + struct lruvec *orig_lruvec = lruvec; enum lru_list lru; while (!list_empty(list)) { + struct lruvec *new_lruvec = NULL; + page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); list_del(&page->lru); if (unlikely(!page_evictable(page))) { - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); putback_lru_page(page); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); continue; } @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, * list_add(&page->lru,) * list_add(&page->lru,) //corrupt */ + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (new_lruvec != lruvec) { + if (lruvec) + spin_unlock_irq(&lruvec->lru_lock); + lruvec = lock_page_lruvec_irq(page); + } SetPageLRU(page); if (unlikely(put_page_testzero(page))) { @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, __ClearPageActive(page); if (unlikely(PageCompound(page))) { - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); destroy_compound_page(page); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); } else list_add(&page->lru, &pages_to_free); continue; } - lruvec = mem_cgroup_page_lruvec(page, pgdat); lru = page_lru(page); nr_pages = hpage_nr_pages(page); @@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, if (PageActive(page)) workingset_age_nonresident(lruvec, nr_pages); } + if (orig_lruvec != lruvec) { + if (lruvec) + spin_unlock_irq(&lruvec->lru_lock); + spin_lock_irq(&orig_lruvec->lru_lock); + } /* * To save our caller's stack, now use input list for pages to free. @@ -1957,7 +1967,7 @@ static int current_may_throttle(void) lru_add_drain(); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list, &nr_scanned, sc, lru); @@ -1969,7 +1979,7 @@ static int current_may_throttle(void) __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); __count_vm_events(PGSCAN_ANON + file, nr_scanned); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); if (nr_taken == 0) return 0; @@ -1977,7 +1987,7 @@ static int current_may_throttle(void) nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0, &stat, false); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); move_pages_to_lru(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); @@ -1986,7 +1996,7 @@ static int current_may_throttle(void) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); lru_note_cost(lruvec, file, stat.nr_pageout); mem_cgroup_uncharge_list(&page_list); @@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan, lru_add_drain(); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold, &nr_scanned, sc, lru); @@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __count_vm_events(PGREFILL, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); while (!list_empty(&l_hold)) { cond_resched(); @@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan, /* * Move pages back to the lru list. */ - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); nr_activate = move_pages_to_lru(lruvec, &l_active); nr_deactivate = move_pages_to_lru(lruvec, &l_inactive); @@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_list(&l_active); free_unref_page_list(&l_active); @@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) /* * Determine the scan balance between anon and file LRUs. */ - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&target_lruvec->lru_lock); sc->anon_cost = target_lruvec->anon_cost; sc->file_cost = target_lruvec->file_cost; - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&target_lruvec->lru_lock); /* * Target desirable inactive:active list ratios for the anon @@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) */ void check_move_unevictable_pages(struct pagevec *pvec) { - struct lruvec *lruvec; - struct pglist_data *pgdat = NULL; + struct lruvec *lruvec = NULL; int pgscanned = 0; int pgrescued = 0; int i; for (i = 0; i < pvec->nr; i++) { struct page *page = pvec->pages[i]; - struct pglist_data *pagepgdat = page_pgdat(page); + struct lruvec *new_lruvec; pgscanned++; - if (pagepgdat != pgdat) { - if (pgdat) - spin_unlock_irq(&pgdat->lru_lock); - pgdat = pagepgdat; - spin_lock_irq(&pgdat->lru_lock); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (lruvec != new_lruvec) { + if (lruvec) + unlock_page_lruvec_irq(lruvec); + lruvec = lock_page_lruvec_irq(page); } - lruvec = mem_cgroup_page_lruvec(page, pgdat); if (!PageLRU(page) || !PageUnevictable(page)) continue; @@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec) } } - if (pgdat) { + if (lruvec) { __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); - spin_unlock_irq(&pgdat->lru_lock); + unlock_page_lruvec_irq(lruvec); } } EXPORT_SYMBOL_GPL(check_move_unevictable_pages); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-18-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <1595681998-19193-18-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-27 23:34 ` Alexander Duyck [not found] ` <CAKgT0UdaW4Rf43yULhQBuP07vQgmoPbaWHGKv1Z7fEPP6jH83w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2020-08-06 7:41 ` Alex Shi 1 sibling, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-07-27 23:34 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov On Sat, Jul 25, 2020 at 6:01 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > This patch moves per node lru_lock into lruvec, thus bring a lru_lock for > each of memcg per node. So on a large machine, each of memcg don't > have to suffer from per node pgdat->lru_lock competition. They could go > fast with their self lru_lock. > > After move memcg charge before lru inserting, page isolation could > serialize page's memcg, then per memcg lruvec lock is stable and could > replace per node lru lock. > > According to Daniel Jordan's suggestion, I run 208 'dd' with on 104 > containers on a 2s * 26cores * HT box with a modefied case: > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > > With this and later patches, the readtwice performance increases about > 80% within concurrent containers. > > Also add a debug func in locking which may give some clues if there are > sth out of hands. > > Hugh Dickins helped on patch polish, thanks! > > Reported-by: kernel test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> > Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > --- > include/linux/memcontrol.h | 58 +++++++++++++++++++++++++ > include/linux/mmzone.h | 2 + > mm/compaction.c | 67 ++++++++++++++++++----------- > mm/huge_memory.c | 11 ++--- > mm/memcontrol.c | 63 ++++++++++++++++++++++++++- > mm/mlock.c | 47 +++++++++++++------- > mm/mmzone.c | 1 + > mm/swap.c | 104 +++++++++++++++++++++------------------------ > mm/vmscan.c | 70 ++++++++++++++++-------------- > 9 files changed, 288 insertions(+), 135 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index e77197a62809..258901021c6c 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, > > struct mem_cgroup *get_mem_cgroup_from_page(struct page *page); > > +struct lruvec *lock_page_lruvec(struct page *page); > +struct lruvec *lock_page_lruvec_irq(struct page *page); > +struct lruvec *lock_page_lruvec_irqsave(struct page *page, > + unsigned long *flags); > + > +#ifdef CONFIG_DEBUG_VM > +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page); > +#else > +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +{ > +} > +#endif > + > static inline > struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ > return css ? container_of(css, struct mem_cgroup, css) : NULL; > @@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg) > { > } > > +static inline struct lruvec *lock_page_lruvec(struct page *page) > +{ > + struct pglist_data *pgdat = page_pgdat(page); > + > + spin_lock(&pgdat->__lruvec.lru_lock); > + return &pgdat->__lruvec; > +} > + > +static inline struct lruvec *lock_page_lruvec_irq(struct page *page) > +{ > + struct pglist_data *pgdat = page_pgdat(page); > + > + spin_lock_irq(&pgdat->__lruvec.lru_lock); > + return &pgdat->__lruvec; > +} > + > +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page, > + unsigned long *flagsp) > +{ > + struct pglist_data *pgdat = page_pgdat(page); > + > + spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); > + return &pgdat->__lruvec; > +} > + > static inline struct mem_cgroup * > mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > @@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page, > void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > + > +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +{ > +} > #endif /* CONFIG_MEMCG */ > > /* idx can be of type enum memcg_stat_item or node_stat_item */ > @@ -1255,6 +1297,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) > return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); > } > > +static inline void unlock_page_lruvec(struct lruvec *lruvec) > +{ > + spin_unlock(&lruvec->lru_lock); > +} > + > +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) > +{ > + spin_unlock_irq(&lruvec->lru_lock); > +} > + > +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, > + unsigned long flags) > +{ > + spin_unlock_irqrestore(&lruvec->lru_lock, flags); > +} > + > #ifdef CONFIG_CGROUP_WRITEBACK > > struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 14c668b7e793..30b961a9a749 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -249,6 +249,8 @@ enum lruvec_flags { > }; > > struct lruvec { > + /* per lruvec lru_lock for memcg */ > + spinlock_t lru_lock; > struct list_head lists[NR_LRU_LISTS]; > /* > * These track the cost of reclaiming one LRU - file or anon - > diff --git a/mm/compaction.c b/mm/compaction.c > index 2da2933fe56b..88bbd2e93895 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat) > unsigned long nr_scanned = 0, nr_isolated = 0; > struct lruvec *lruvec; > unsigned long flags = 0; > - bool locked = false; > + struct lruvec *locked_lruvec = NULL; > struct page *page = NULL, *valid_page = NULL; > unsigned long start_pfn = low_pfn; > bool skip_on_failure = false; > @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat) > * contention, to give chance to IRQs. Abort completely if > * a fatal signal is pending. > */ > - if (!(low_pfn % SWAP_CLUSTER_MAX) > - && compact_unlock_should_abort(&pgdat->lru_lock, > - flags, &locked, cc)) { > - low_pfn = 0; > - goto fatal_pending; > + if (!(low_pfn % SWAP_CLUSTER_MAX)) { > + if (locked_lruvec) { > + unlock_page_lruvec_irqrestore(locked_lruvec, > + flags); > + locked_lruvec = NULL; > + } > + > + if (fatal_signal_pending(current)) { > + cc->contended = true; > + > + low_pfn = 0; > + goto fatal_pending; > + } > + > + cond_resched(); > } > > if (!pfn_valid_within(low_pfn)) I'm noticing this patch introduces a bunch of noise. What is the reason for getting rid of compact_unlock_should_abort? It seems like you just open coded it here. If there is some sort of issue with it then it might be better to replace it as part of a preparatory patch before you introduce this one as changes like this make it harder to review. It might make more sense to look at modifying compact_unlock_should_abort and compact_lock_irqsave (which always returns true so should probably be a void) to address the deficiencies they have that make them unusable for you. > @@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat) > */ > if (unlikely(__PageMovable(page)) && > !PageIsolated(page)) { > - if (locked) { > - spin_unlock_irqrestore(&pgdat->lru_lock, > - flags); > - locked = false; > + if (locked_lruvec) { > + unlock_page_lruvec_irqrestore(locked_lruvec, flags); > + locked_lruvec = NULL; > } > > if (!isolate_movable_page(page, isolate_mode)) > @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!TestClearPageLRU(page)) > goto isolate_fail_put; > > + rcu_read_lock(); > + lruvec = mem_cgroup_page_lruvec(page, pgdat); > + > /* If we already hold the lock, we can skip some rechecking */ > - if (!locked) { > - locked = compact_lock_irqsave(&pgdat->lru_lock, > - &flags, cc); > + if (lruvec != locked_lruvec) { > + if (locked_lruvec) > + unlock_page_lruvec_irqrestore(locked_lruvec, > + flags); > + > + compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); > + locked_lruvec = lruvec; > + rcu_read_unlock(); > + > + lruvec_memcg_debug(lruvec, page); > > /* Try get exclusive access under lock */ > if (!skip_updated) { So this bit makes things a bit complicated. From what I can can tell the comment about exclusive access under the lock is supposed to apply to the pageblock via the lru_lock. However you are having to retest the lock for each page because it is possible the page was moved to another memory cgroup while the lru_lock was released correct? So in this case is the lru vector lock really providing any protection for the skip_updated portion of this code block if the lock isn't exclusive to the pageblock? In theory this would probably make more sense to have protected the skip bits under the zone lock, but I imagine that was avoided due to the additional overhead. > @@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat) > SetPageLRU(page); > goto isolate_fail_put; > } > - } > - > - lruvec = mem_cgroup_page_lruvec(page, pgdat); > + } else > + rcu_read_unlock(); > > /* The whole page is taken off the LRU; skip the tail pages. */ > if (PageCompound(page)) > @@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat) > > isolate_fail_put: > /* Avoid potential deadlock in freeing page under lru_lock */ > - if (locked) { > - spin_unlock_irqrestore(&pgdat->lru_lock, flags); > - locked = false; > + if (locked_lruvec) { > + unlock_page_lruvec_irqrestore(locked_lruvec, flags); > + locked_lruvec = NULL; > } > put_page(page); > > @@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat) > * page anyway. > */ > if (nr_isolated) { > - if (locked) { > - spin_unlock_irqrestore(&pgdat->lru_lock, flags); > - locked = false; > + if (locked_lruvec) { > + unlock_page_lruvec_irqrestore(locked_lruvec, > + flags); > + locked_lruvec = NULL; > } > putback_movable_pages(&cc->migratepages); > cc->nr_migratepages = 0; > @@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat) > page = NULL; > > isolate_abort: > - if (locked) > - spin_unlock_irqrestore(&pgdat->lru_lock, flags); > + if (locked_lruvec) > + unlock_page_lruvec_irqrestore(locked_lruvec, flags); > if (page) { > SetPageLRU(page); > put_page(page); <snip> > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f77748adc340..168c1659e430 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page) > WARN_RATELIMIT(PageTail(page), "trying to isolate tail page"); > > if (TestClearPageLRU(page)) { > - pg_data_t *pgdat = page_pgdat(page); > struct lruvec *lruvec; > int lru = page_lru(page); > > get_page(page); > - lruvec = mem_cgroup_page_lruvec(page, pgdat); > - spin_lock_irq(&pgdat->lru_lock); > + lruvec = lock_page_lruvec_irq(page); > del_page_from_lru_list(page, lruvec, lru); > - spin_unlock_irq(&pgdat->lru_lock); > + unlock_page_lruvec_irq(lruvec); > ret = 0; > } > > @@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, > static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > struct list_head *list) > { > - struct pglist_data *pgdat = lruvec_pgdat(lruvec); > int nr_pages, nr_moved = 0; > LIST_HEAD(pages_to_free); > struct page *page; > + struct lruvec *orig_lruvec = lruvec; > enum lru_list lru; > > while (!list_empty(list)) { > + struct lruvec *new_lruvec = NULL; > + > page = lru_to_page(list); > VM_BUG_ON_PAGE(PageLRU(page), page); > list_del(&page->lru); > if (unlikely(!page_evictable(page))) { > - spin_unlock_irq(&pgdat->lru_lock); > + spin_unlock_irq(&lruvec->lru_lock); > putback_lru_page(page); > - spin_lock_irq(&pgdat->lru_lock); > + spin_lock_irq(&lruvec->lru_lock); > continue; > } > > @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > * list_add(&page->lru,) > * list_add(&page->lru,) //corrupt > */ > + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > + if (new_lruvec != lruvec) { > + if (lruvec) > + spin_unlock_irq(&lruvec->lru_lock); > + lruvec = lock_page_lruvec_irq(page); > + } > SetPageLRU(page); > > if (unlikely(put_page_testzero(page))) { I was going through the code of the entire patch set and I noticed these changes in move_pages_to_lru. What is the reason for adding the new_lruvec logic? My understanding is that we are moving the pages to the lruvec provided are we not?If so why do we need to add code to get a new lruvec? The code itself seems to stand out from the rest of the patch as it is introducing new code instead of replacing existing locking code, and it doesn't match up with the description of what this function is supposed to do since it changes the lruvec. > @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > __ClearPageActive(page); > > if (unlikely(PageCompound(page))) { > - spin_unlock_irq(&pgdat->lru_lock); > + spin_unlock_irq(&lruvec->lru_lock); > destroy_compound_page(page); > - spin_lock_irq(&pgdat->lru_lock); > + spin_lock_irq(&lruvec->lru_lock); > } else > list_add(&page->lru, &pages_to_free); > > continue; > } > > - lruvec = mem_cgroup_page_lruvec(page, pgdat); > lru = page_lru(page); > nr_pages = hpage_nr_pages(page); > > @@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > if (PageActive(page)) > workingset_age_nonresident(lruvec, nr_pages); > } > + if (orig_lruvec != lruvec) { > + if (lruvec) > + spin_unlock_irq(&lruvec->lru_lock); > + spin_lock_irq(&orig_lruvec->lru_lock); > + } > > /* > * To save our caller's stack, now use input list for pages to free. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UdaW4Rf43yULhQBuP07vQgmoPbaWHGKv1Z7fEPP6jH83w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <CAKgT0UdaW4Rf43yULhQBuP07vQgmoPbaWHGKv1Z7fEPP6jH83w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-07-28 7:15 ` Alex Shi 2020-07-28 11:19 ` Alex Shi 2020-07-28 15:39 ` Alex Shi 2 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-28 7:15 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/7/28 上午7:34, Alexander Duyck 写道: >> @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat) >> * contention, to give chance to IRQs. Abort completely if >> * a fatal signal is pending. >> */ >> - if (!(low_pfn % SWAP_CLUSTER_MAX) >> - && compact_unlock_should_abort(&pgdat->lru_lock, >> - flags, &locked, cc)) { >> - low_pfn = 0; >> - goto fatal_pending; >> + if (!(low_pfn % SWAP_CLUSTER_MAX)) { >> + if (locked_lruvec) { >> + unlock_page_lruvec_irqrestore(locked_lruvec, >> + flags); >> + locked_lruvec = NULL; >> + } >> + >> + if (fatal_signal_pending(current)) { >> + cc->contended = true; >> + >> + low_pfn = 0; >> + goto fatal_pending; >> + } >> + >> + cond_resched(); >> } >> >> if (!pfn_valid_within(low_pfn)) > > I'm noticing this patch introduces a bunch of noise. What is the > reason for getting rid of compact_unlock_should_abort? It seems like > you just open coded it here. If there is some sort of issue with it > then it might be better to replace it as part of a preparatory patch > before you introduce this one as changes like this make it harder to > review. Thanks for comments, Alex. the func compact_unlock_should_abort should be removed since one of parameters changed from 'bool *locked' to 'struct lruvec *lruvec'. So it's not applicable now. I have to open it here instead of adding a only one user func. > > It might make more sense to look at modifying > compact_unlock_should_abort and compact_lock_irqsave (which always > returns true so should probably be a void) to address the deficiencies > they have that make them unusable for you. I am wondering if people like a patch which just open compact_unlock_should_abort func and move bool to void as a preparation patch, do you like this? >> @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat) >> if (!TestClearPageLRU(page)) >> goto isolate_fail_put; >> >> + rcu_read_lock(); >> + lruvec = mem_cgroup_page_lruvec(page, pgdat); >> + >> /* If we already hold the lock, we can skip some rechecking */ >> - if (!locked) { >> - locked = compact_lock_irqsave(&pgdat->lru_lock, >> - &flags, cc); >> + if (lruvec != locked_lruvec) { >> + if (locked_lruvec) >> + unlock_page_lruvec_irqrestore(locked_lruvec, >> + flags); >> + >> + compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); >> + locked_lruvec = lruvec; >> + rcu_read_unlock(); >> + >> + lruvec_memcg_debug(lruvec, page); >> >> /* Try get exclusive access under lock */ >> if (!skip_updated) { > > So this bit makes things a bit complicated. From what I can can tell > the comment about exclusive access under the lock is supposed to apply > to the pageblock via the lru_lock. However you are having to retest > the lock for each page because it is possible the page was moved to > another memory cgroup while the lru_lock was released correct? So in The pageblock is aligned by pfn, so pages in them maynot on same memcg originally. and yes, page may be changed memcg also. > this case is the lru vector lock really providing any protection for > the skip_updated portion of this code block if the lock isn't > exclusive to the pageblock? In theory this would probably make more > sense to have protected the skip bits under the zone lock, but I > imagine that was avoided due to the additional overhead. when we change to lruvec->lru_lock, it does the same thing as pgdat->lru_lock. just may get a bit more chance to here, and find out this is a skipable pageblock and quit. Yes, logically, pgdat lru_lock seems better, but since we are holding lru_lock. It's fine to not bother more locks. > >> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, >> * list_add(&page->lru,) >> * list_add(&page->lru,) //corrupt >> */ >> + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); >> + if (new_lruvec != lruvec) { >> + if (lruvec) >> + spin_unlock_irq(&lruvec->lru_lock); >> + lruvec = lock_page_lruvec_irq(page); >> + } >> SetPageLRU(page); >> >> if (unlikely(put_page_testzero(page))) { > > I was going through the code of the entire patch set and I noticed > these changes in move_pages_to_lru. What is the reason for adding the > new_lruvec logic? My understanding is that we are moving the pages to > the lruvec provided are we not?If so why do we need to add code to get > a new lruvec? The code itself seems to stand out from the rest of the > patch as it is introducing new code instead of replacing existing > locking code, and it doesn't match up with the description of what > this function is supposed to do since it changes the lruvec. A code here since some bugs happened. I will check it again anyway. Thanks! ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <CAKgT0UdaW4Rf43yULhQBuP07vQgmoPbaWHGKv1Z7fEPP6jH83w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2020-07-28 7:15 ` Alex Shi @ 2020-07-28 11:19 ` Alex Shi [not found] ` <ccd01046-451c-463d-7c5d-9c32794f4b1e-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-28 15:39 ` Alex Shi 2 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-28 11:19 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/7/28 上午7:34, Alexander Duyck 写道: >> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, >> * list_add(&page->lru,) >> * list_add(&page->lru,) //corrupt >> */ >> + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); >> + if (new_lruvec != lruvec) { >> + if (lruvec) >> + spin_unlock_irq(&lruvec->lru_lock); >> + lruvec = lock_page_lruvec_irq(page); >> + } >> SetPageLRU(page); >> >> if (unlikely(put_page_testzero(page))) { > I was going through the code of the entire patch set and I noticed > these changes in move_pages_to_lru. What is the reason for adding the > new_lruvec logic? My understanding is that we are moving the pages to > the lruvec provided are we not?If so why do we need to add code to get > a new lruvec? The code itself seems to stand out from the rest of the > patch as it is introducing new code instead of replacing existing > locking code, and it doesn't match up with the description of what > this function is supposed to do since it changes the lruvec. this new_lruvec is the replacement of removed line, as following code: >> - lruvec = mem_cgroup_page_lruvec(page, pgdat); This recheck is for the page move the root memcg, otherwise it cause the bug: [ 2081.240795] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2081.248125] #PF: supervisor read access in kernel mode [ 2081.253627] #PF: error_code(0x0000) - not-present page [ 2081.259124] PGD 8000000044cb0067 P4D 8000000044cb0067 PUD 95c9067 PMD 0 [ 2081.266193] Oops: 0000 [#1] PREEMPT SMP PTI [ 2081.270740] CPU: 5 PID: 131 Comm: kswapd0 Kdump: loaded Tainted: G W 5.8.0-rc6-00025-gc708f8a0db47 #45 [ 2081.281960] Hardware name: Alibaba X-Dragon CN 01/20G4B, BIOS 1ALSP016 05/21/2018 [ 2081.290054] RIP: 0010:do_raw_spin_trylock+0x5/0x40 [ 2081.295209] Code: 76 82 48 89 df e8 bb fe ff ff eb 8c 89 c6 48 89 df e8 4f dd ff ff 66 90 eb 8b 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 03 6a [ 2081.314832] RSP: 0018:ffffc900002ebac8 EFLAGS: 00010082 [ 2081.320410] RAX: 0000000000000000 RBX: 0000000000000018 RCX: 0000000000000000 [ 2081.327907] RDX: ffff888035833480 RSI: 0000000000000000 RDI: 0000000000000000 [ 2081.335407] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001 [ 2081.342907] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 2081.350405] R13: dead000000000100 R14: 0000000000000000 R15: ffffc900002ebbb0 [ 2081.357908] FS: 0000000000000000(0000) GS:ffff88807a200000(0000) knlGS:0000000000000000 [ 2081.366619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2081.372717] CR2: 0000000000000000 CR3: 0000000031228005 CR4: 00000000003606e0 [ 2081.380215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2081.387713] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2081.395198] Call Trace: [ 2081.398008] _raw_spin_lock_irq+0x47/0x80 [ 2081.402387] ? move_pages_to_lru+0x566/0xb80 [ 2081.407028] move_pages_to_lru+0x566/0xb80 [ 2081.411495] shrink_active_list+0x355/0xa70 [ 2081.416054] shrink_lruvec+0x4f7/0x810 [ 2081.420176] ? mem_cgroup_iter+0xb6/0x410 [ 2081.424558] shrink_node+0x1cc/0x8d0 [ 2081.428510] balance_pgdat+0x3cf/0x760 [ 2081.432634] kswapd+0x232/0x660 [ 2081.436147] ? finish_wait+0x80/0x80 [ 2081.440093] ? balance_pgdat+0x760/0x760 [ 2081.444382] kthread+0x17e/0x1b0 [ 2081.447975] ? kthread_park+0xc0/0xc0 [ 2081.452005] ret_from_fork+0x22/0x30 Thanks! Alex > >> @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, >> __ClearPageActive(page); >> >> if (unlikely(PageCompound(page))) { >> - spin_unlock_irq(&pgdat->lru_lock); >> + spin_unlock_irq(&lruvec->lru_lock); >> destroy_compound_page(page); >> - spin_lock_irq(&pgdat->lru_lock); >> + spin_lock_irq(&lruvec->lru_lock); >> } else >> list_add(&page->lru, &pages_to_free); >> >> continue; >> } >> >> - lruvec = mem_cgroup_page_lruvec(page, pgdat); >> lru = page_lru(page); >> nr_pages = hpage_nr_pages(page); ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <ccd01046-451c-463d-7c5d-9c32794f4b1e-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <ccd01046-451c-463d-7c5d-9c32794f4b1e-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-28 14:54 ` Alexander Duyck 2020-07-29 1:00 ` Alex Shi 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-07-28 14:54 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/7/28 上午7:34, Alexander Duyck 写道: > >> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > >> * list_add(&page->lru,) > >> * list_add(&page->lru,) //corrupt > >> */ > >> + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > >> + if (new_lruvec != lruvec) { > >> + if (lruvec) > >> + spin_unlock_irq(&lruvec->lru_lock); > >> + lruvec = lock_page_lruvec_irq(page); > >> + } > >> SetPageLRU(page); > >> > >> if (unlikely(put_page_testzero(page))) { > > I was going through the code of the entire patch set and I noticed > > these changes in move_pages_to_lru. What is the reason for adding the > > new_lruvec logic? My understanding is that we are moving the pages to > > the lruvec provided are we not?If so why do we need to add code to get > > a new lruvec? The code itself seems to stand out from the rest of the > > patch as it is introducing new code instead of replacing existing > > locking code, and it doesn't match up with the description of what > > this function is supposed to do since it changes the lruvec. > > this new_lruvec is the replacement of removed line, as following code: > >> - lruvec = mem_cgroup_page_lruvec(page, pgdat); > This recheck is for the page move the root memcg, otherwise it cause the bug: Okay, now I see where the issue is. You moved this code so now it has a different effect than it did before. You are relocking things before you needed to. Don't forget that when you came into this function you already had the lock. In addition the patch is broken as it currently stands as you aren't using similar logic in the code just above this addition if you encounter an evictable page. As a result this is really difficult to review as there are subtle bugs here. I suppose the correct fix is to get rid of this line, but it should be placed everywhere the original function was calling spin_lock_irq(). In addition I would consider changing the arguments/documentation for move_pages_to_lru. You aren't moving the pages to lruvec, so there is probably no need to pass that as an argument. Instead I would pass pgdat since that isn't going to be moving and is the only thing you actually derive based on the original lruvec. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock 2020-07-28 14:54 ` Alexander Duyck @ 2020-07-29 1:00 ` Alex Shi [not found] ` <09aeced7-cc36-0c9a-d40b-451db9dc54cc-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-29 1:00 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/7/28 下午10:54, Alexander Duyck 写道: > On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >> >> >> 在 2020/7/28 上午7:34, Alexander Duyck 写道: >>>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, >>>> * list_add(&page->lru,) >>>> * list_add(&page->lru,) //corrupt >>>> */ >>>> + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); >>>> + if (new_lruvec != lruvec) { >>>> + if (lruvec) >>>> + spin_unlock_irq(&lruvec->lru_lock); >>>> + lruvec = lock_page_lruvec_irq(page); >>>> + } >>>> SetPageLRU(page); >>>> >>>> if (unlikely(put_page_testzero(page))) { >>> I was going through the code of the entire patch set and I noticed >>> these changes in move_pages_to_lru. What is the reason for adding the >>> new_lruvec logic? My understanding is that we are moving the pages to >>> the lruvec provided are we not?If so why do we need to add code to get >>> a new lruvec? The code itself seems to stand out from the rest of the >>> patch as it is introducing new code instead of replacing existing >>> locking code, and it doesn't match up with the description of what >>> this function is supposed to do since it changes the lruvec. >> >> this new_lruvec is the replacement of removed line, as following code: >>>> - lruvec = mem_cgroup_page_lruvec(page, pgdat); >> This recheck is for the page move the root memcg, otherwise it cause the bug: > > Okay, now I see where the issue is. You moved this code so now it has > a different effect than it did before. You are relocking things before > you needed to. Don't forget that when you came into this function you > already had the lock. In addition the patch is broken as it currently > stands as you aren't using similar logic in the code just above this > addition if you encounter an evictable page. As a result this is > really difficult to review as there are subtle bugs here. Why you think its a bug? the relock only happens if locked lruvec is different. and unlock the old one. > > I suppose the correct fix is to get rid of this line, but it should > be placed everywhere the original function was calling > spin_lock_irq(). > > In addition I would consider changing the arguments/documentation for > move_pages_to_lru. You aren't moving the pages to lruvec, so there is > probably no need to pass that as an argument. Instead I would pass > pgdat since that isn't going to be moving and is the only thing you > actually derive based on the original lruvec. yes, The comments should be changed with the line was introduced from long ago. :) Anyway, I am wondering if it worth a v18 version resend? > ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <09aeced7-cc36-0c9a-d40b-451db9dc54cc-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <09aeced7-cc36-0c9a-d40b-451db9dc54cc-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-29 1:27 ` Alexander Duyck [not found] ` <CAKgT0UfCv9u3UaJnzh7CYu_nCggV8yesZNu4oxMGn4+mJYiFUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-07-29 1:27 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov On Tue, Jul 28, 2020 at 6:00 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/7/28 下午10:54, Alexander Duyck 写道: > > On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >> > >> > >> > >> 在 2020/7/28 上午7:34, Alexander Duyck 写道: > >>>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > >>>> * list_add(&page->lru,) > >>>> * list_add(&page->lru,) //corrupt > >>>> */ > >>>> + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > >>>> + if (new_lruvec != lruvec) { > >>>> + if (lruvec) > >>>> + spin_unlock_irq(&lruvec->lru_lock); > >>>> + lruvec = lock_page_lruvec_irq(page); > >>>> + } > >>>> SetPageLRU(page); > >>>> > >>>> if (unlikely(put_page_testzero(page))) { > >>> I was going through the code of the entire patch set and I noticed > >>> these changes in move_pages_to_lru. What is the reason for adding the > >>> new_lruvec logic? My understanding is that we are moving the pages to > >>> the lruvec provided are we not?If so why do we need to add code to get > >>> a new lruvec? The code itself seems to stand out from the rest of the > >>> patch as it is introducing new code instead of replacing existing > >>> locking code, and it doesn't match up with the description of what > >>> this function is supposed to do since it changes the lruvec. > >> > >> this new_lruvec is the replacement of removed line, as following code: > >>>> - lruvec = mem_cgroup_page_lruvec(page, pgdat); > >> This recheck is for the page move the root memcg, otherwise it cause the bug: > > > > Okay, now I see where the issue is. You moved this code so now it has > > a different effect than it did before. You are relocking things before > > you needed to. Don't forget that when you came into this function you > > already had the lock. In addition the patch is broken as it currently > > stands as you aren't using similar logic in the code just above this > > addition if you encounter an evictable page. As a result this is > > really difficult to review as there are subtle bugs here. > > Why you think its a bug? the relock only happens if locked lruvec is different. > and unlock the old one. The section I am talking about with the bug is this section here: while (!list_empty(list)) { + struct lruvec *new_lruvec = NULL; + page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); list_del(&page->lru); if (unlikely(!page_evictable(page))) { - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); putback_lru_page(page); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); continue; } Basically it probably is not advisable to be retaking the lruvec->lru_lock directly as the lruvec may have changed so it wouldn't be correct for the next page. It would make more sense to be using your API and calling unlock_page_lruvec_irq and lock_page_lruvec_irq instead of using the lock directly. > > > > I suppose the correct fix is to get rid of this line, but it should > > be placed everywhere the original function was calling > > spin_lock_irq(). > > > > In addition I would consider changing the arguments/documentation for > > move_pages_to_lru. You aren't moving the pages to lruvec, so there is > > probably no need to pass that as an argument. Instead I would pass > > pgdat since that isn't going to be moving and is the only thing you > > actually derive based on the original lruvec. > > yes, The comments should be changed with the line was introduced from long ago. :) > Anyway, I am wondering if it worth a v18 version resend? So I have been looking over the function itself and I wonder if it isn't worth looking at rewriting this to optimize the locking behavior to minimize the number of times we have to take the LRU lock. I have some code I am working on that I plan to submit as an RFC in the next day or so after I can get it smoke tested. The basic idea would be to defer returning the evictiable pages or freeing the compound pages until after we have processed the pages that can be moved while still holding the lock. I would think it should reduce the lock contention significantly while improving the throughput. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UfCv9u3UaJnzh7CYu_nCggV8yesZNu4oxMGn4+mJYiFUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <CAKgT0UfCv9u3UaJnzh7CYu_nCggV8yesZNu4oxMGn4+mJYiFUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-07-29 2:27 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-29 2:27 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/7/29 上午9:27, Alexander Duyck 写道: > On Tue, Jul 28, 2020 at 6:00 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: >> >> >> >> 在 2020/7/28 下午10:54, Alexander Duyck 写道: >>> On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: >>>> >>>> >>>> >>>> 在 2020/7/28 上午7:34, Alexander Duyck 写道: >>>>>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, >>>>>> * list_add(&page->lru,) >>>>>> * list_add(&page->lru,) //corrupt >>>>>> */ >>>>>> + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); >>>>>> + if (new_lruvec != lruvec) { >>>>>> + if (lruvec) >>>>>> + spin_unlock_irq(&lruvec->lru_lock); >>>>>> + lruvec = lock_page_lruvec_irq(page); >>>>>> + } >>>>>> SetPageLRU(page); >>>>>> >>>>>> if (unlikely(put_page_testzero(page))) { >>>>> I was going through the code of the entire patch set and I noticed >>>>> these changes in move_pages_to_lru. What is the reason for adding the >>>>> new_lruvec logic? My understanding is that we are moving the pages to >>>>> the lruvec provided are we not?If so why do we need to add code to get >>>>> a new lruvec? The code itself seems to stand out from the rest of the >>>>> patch as it is introducing new code instead of replacing existing >>>>> locking code, and it doesn't match up with the description of what >>>>> this function is supposed to do since it changes the lruvec. >>>> >>>> this new_lruvec is the replacement of removed line, as following code: >>>>>> - lruvec = mem_cgroup_page_lruvec(page, pgdat); >>>> This recheck is for the page move the root memcg, otherwise it cause the bug: >>> >>> Okay, now I see where the issue is. You moved this code so now it has >>> a different effect than it did before. You are relocking things before >>> you needed to. Don't forget that when you came into this function you >>> already had the lock. In addition the patch is broken as it currently >>> stands as you aren't using similar logic in the code just above this >>> addition if you encounter an evictable page. As a result this is >>> really difficult to review as there are subtle bugs here. >> >> Why you think its a bug? the relock only happens if locked lruvec is different. >> and unlock the old one. > > The section I am talking about with the bug is this section here: > while (!list_empty(list)) { > + struct lruvec *new_lruvec = NULL; > + > page = lru_to_page(list); > VM_BUG_ON_PAGE(PageLRU(page), page); > list_del(&page->lru); > if (unlikely(!page_evictable(page))) { > - spin_unlock_irq(&pgdat->lru_lock); > + spin_unlock_irq(&lruvec->lru_lock); > putback_lru_page(page); > - spin_lock_irq(&pgdat->lru_lock); > + spin_lock_irq(&lruvec->lru_lock); It would be still fine. The lruvec->lru_lock will be checked again before we take and use it. And this lock will optimized in patch 19th which did by Hugh Dickins. > continue; > } > > Basically it probably is not advisable to be retaking the > lruvec->lru_lock directly as the lruvec may have changed so it > wouldn't be correct for the next page. It would make more sense to be > using your API and calling unlock_page_lruvec_irq and > lock_page_lruvec_irq instead of using the lock directly. > >>> >>> I suppose the correct fix is to get rid of this line, but it should >>> be placed everywhere the original function was calling >>> spin_lock_irq(). >>> >>> In addition I would consider changing the arguments/documentation for >>> move_pages_to_lru. You aren't moving the pages to lruvec, so there is >>> probably no need to pass that as an argument. Instead I would pass >>> pgdat since that isn't going to be moving and is the only thing you >>> actually derive based on the original lruvec. >> >> yes, The comments should be changed with the line was introduced from long ago. :) >> Anyway, I am wondering if it worth a v18 version resend? > > So I have been looking over the function itself and I wonder if it > isn't worth looking at rewriting this to optimize the locking behavior > to minimize the number of times we have to take the LRU lock. I have > some code I am working on that I plan to submit as an RFC in the next > day or so after I can get it smoke tested. The basic idea would be to > defer returning the evictiable pages or freeing the compound pages > until after we have processed the pages that can be moved while still > holding the lock. I would think it should reduce the lock contention > significantly while improving the throughput. > I had tried once, but the freeing page cross onto release_pages which hard to deal with. I am very glad to wait your patch, and hope it could be resovled. :) Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <CAKgT0UdaW4Rf43yULhQBuP07vQgmoPbaWHGKv1Z7fEPP6jH83w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2020-07-28 7:15 ` Alex Shi 2020-07-28 11:19 ` Alex Shi @ 2020-07-28 15:39 ` Alex Shi [not found] ` <1fd45e69-3a50-aae8-bcc4-47d891a5e263-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-28 15:39 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/7/28 上午7:34, Alexander Duyck 写道: > It might make more sense to look at modifying > compact_unlock_should_abort and compact_lock_irqsave (which always > returns true so should probably be a void) to address the deficiencies > they have that make them unusable for you. One of possible reuse for the func compact_unlock_should_abort, could be like the following, the locked parameter reused different in 2 places. but, it's seems no this style usage in kernel, isn't it? Thanks Alex From 41d5ce6562f20f74bc6ac2db83e226ac28d56e90 Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Tue, 28 Jul 2020 21:19:32 +0800 Subject: [PATCH] compaction polishing Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> --- mm/compaction.c | 71 ++++++++++++++++++++++++--------------------------------- 1 file changed, 30 insertions(+), 41 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index c28a43481f01..36fce988de3e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -479,20 +479,20 @@ static bool test_and_set_skip(struct compact_control *cc, struct page *page, * * Always returns true which makes it easier to track lock state in callers. */ -static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, +static void compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, struct compact_control *cc) __acquires(lock) { /* Track if the lock is contended in async mode */ if (cc->mode == MIGRATE_ASYNC && !cc->contended) { if (spin_trylock_irqsave(lock, *flags)) - return true; + return; cc->contended = true; } spin_lock_irqsave(lock, *flags); - return true; + return; } /* @@ -511,11 +511,11 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, * scheduled) */ static bool compact_unlock_should_abort(spinlock_t *lock, - unsigned long flags, bool *locked, struct compact_control *cc) + unsigned long flags, void **locked, struct compact_control *cc) { if (*locked) { spin_unlock_irqrestore(lock, flags); - *locked = false; + *locked = NULL; } if (fatal_signal_pending(current)) { @@ -543,7 +543,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, int nr_scanned = 0, total_isolated = 0; struct page *cursor; unsigned long flags = 0; - bool locked = false; + struct compact_control *locked = NULL; unsigned long blockpfn = *start_pfn; unsigned int order; @@ -565,7 +565,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, */ if (!(blockpfn % SWAP_CLUSTER_MAX) && compact_unlock_should_abort(&cc->zone->lock, flags, - &locked, cc)) + (void**)&locked, cc)) break; nr_scanned++; @@ -599,8 +599,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, * recheck as well. */ if (!locked) { - locked = compact_lock_irqsave(&cc->zone->lock, - &flags, cc); + compact_lock_irqsave(&cc->zone->lock, &flags, cc); + locked = cc; /* Recheck this is a buddy page under lock */ if (!PageBuddy(page)) @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat) unsigned long nr_scanned = 0, nr_isolated = 0; struct lruvec *lruvec; unsigned long flags = 0; - struct lruvec *locked_lruvec = NULL; + struct lruvec *locked = NULL; struct page *page = NULL, *valid_page = NULL; unsigned long start_pfn = low_pfn; bool skip_on_failure = false; @@ -847,21 +847,11 @@ static bool too_many_isolated(pg_data_t *pgdat) * contention, to give chance to IRQs. Abort completely if * a fatal signal is pending. */ - if (!(low_pfn % SWAP_CLUSTER_MAX)) { - if (locked_lruvec) { - unlock_page_lruvec_irqrestore(locked_lruvec, - flags); - locked_lruvec = NULL; - } - - if (fatal_signal_pending(current)) { - cc->contended = true; - - low_pfn = 0; - goto fatal_pending; - } - - cond_resched(); + if (!(low_pfn % SWAP_CLUSTER_MAX) + && compact_unlock_should_abort(&locked->lru_lock, flags, + (void**)&locked, cc)) { + low_pfn = 0; + goto fatal_pending; } if (!pfn_valid_within(low_pfn)) @@ -932,9 +922,9 @@ static bool too_many_isolated(pg_data_t *pgdat) */ if (unlikely(__PageMovable(page)) && !PageIsolated(page)) { - if (locked_lruvec) { - unlock_page_lruvec_irqrestore(locked_lruvec, flags); - locked_lruvec = NULL; + if (locked) { + unlock_page_lruvec_irqrestore(locked, flags); + locked = NULL; } if (!isolate_movable_page(page, isolate_mode)) @@ -979,13 +969,13 @@ static bool too_many_isolated(pg_data_t *pgdat) lruvec = mem_cgroup_page_lruvec(page, pgdat); /* If we already hold the lock, we can skip some rechecking */ - if (lruvec != locked_lruvec) { - if (locked_lruvec) - unlock_page_lruvec_irqrestore(locked_lruvec, + if (lruvec != locked) { + if (locked) + unlock_page_lruvec_irqrestore(locked, flags); compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); - locked_lruvec = lruvec; + locked = lruvec; rcu_read_unlock(); lruvec_memcg_debug(lruvec, page); @@ -1041,9 +1031,9 @@ static bool too_many_isolated(pg_data_t *pgdat) isolate_fail_put: /* Avoid potential deadlock in freeing page under lru_lock */ - if (locked_lruvec) { - unlock_page_lruvec_irqrestore(locked_lruvec, flags); - locked_lruvec = NULL; + if (locked) { + unlock_page_lruvec_irqrestore(locked, flags); + locked = NULL; } put_page(page); @@ -1057,10 +1047,9 @@ static bool too_many_isolated(pg_data_t *pgdat) * page anyway. */ if (nr_isolated) { - if (locked_lruvec) { - unlock_page_lruvec_irqrestore(locked_lruvec, - flags); - locked_lruvec = NULL; + if (locked) { + unlock_page_lruvec_irqrestore(locked, flags); + locked = NULL; } putback_movable_pages(&cc->migratepages); cc->nr_migratepages = 0; @@ -1087,8 +1076,8 @@ static bool too_many_isolated(pg_data_t *pgdat) page = NULL; isolate_abort: - if (locked_lruvec) - unlock_page_lruvec_irqrestore(locked_lruvec, flags); + if (locked) + unlock_page_lruvec_irqrestore(locked, flags); if (page) { SetPageLRU(page); put_page(page); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1fd45e69-3a50-aae8-bcc4-47d891a5e263-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <1fd45e69-3a50-aae8-bcc4-47d891a5e263-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-28 15:55 ` Alexander Duyck 2020-07-29 0:48 ` Alex Shi 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-07-28 15:55 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov On Tue, Jul 28, 2020 at 8:40 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/7/28 上午7:34, Alexander Duyck 写道: > > It might make more sense to look at modifying > > compact_unlock_should_abort and compact_lock_irqsave (which always > > returns true so should probably be a void) to address the deficiencies > > they have that make them unusable for you. > > One of possible reuse for the func compact_unlock_should_abort, could be > like the following, the locked parameter reused different in 2 places. > but, it's seems no this style usage in kernel, isn't it? > > Thanks > Alex > > From 41d5ce6562f20f74bc6ac2db83e226ac28d56e90 Mon Sep 17 00:00:00 2001 > From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Date: Tue, 28 Jul 2020 21:19:32 +0800 > Subject: [PATCH] compaction polishing > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > --- > mm/compaction.c | 71 ++++++++++++++++++++++++--------------------------------- > 1 file changed, 30 insertions(+), 41 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index c28a43481f01..36fce988de3e 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -479,20 +479,20 @@ static bool test_and_set_skip(struct compact_control *cc, struct page *page, > * > * Always returns true which makes it easier to track lock state in callers. > */ > -static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, > +static void compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, > struct compact_control *cc) > __acquires(lock) > { > /* Track if the lock is contended in async mode */ > if (cc->mode == MIGRATE_ASYNC && !cc->contended) { > if (spin_trylock_irqsave(lock, *flags)) > - return true; > + return; > > cc->contended = true; > } > > spin_lock_irqsave(lock, *flags); > - return true; > + return; > } > > /* > @@ -511,11 +511,11 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, > * scheduled) > */ > static bool compact_unlock_should_abort(spinlock_t *lock, > - unsigned long flags, bool *locked, struct compact_control *cc) > + unsigned long flags, void **locked, struct compact_control *cc) Instead of passing both a void pointer and the lock why not just pass the pointer to the lock pointer? You could combine lock and locked into a single argument and save yourself some extra effort. > { > if (*locked) { > spin_unlock_irqrestore(lock, flags); > - *locked = false; > + *locked = NULL; > } > > if (fatal_signal_pending(current)) { > @@ -543,7 +543,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, > int nr_scanned = 0, total_isolated = 0; > struct page *cursor; > unsigned long flags = 0; > - bool locked = false; > + struct compact_control *locked = NULL; > unsigned long blockpfn = *start_pfn; > unsigned int order; > > @@ -565,7 +565,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, > */ > if (!(blockpfn % SWAP_CLUSTER_MAX) > && compact_unlock_should_abort(&cc->zone->lock, flags, > - &locked, cc)) > + (void**)&locked, cc)) > break; > > nr_scanned++; > @@ -599,8 +599,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, > * recheck as well. > */ > if (!locked) { > - locked = compact_lock_irqsave(&cc->zone->lock, > - &flags, cc); > + compact_lock_irqsave(&cc->zone->lock, &flags, cc); > + locked = cc; > > /* Recheck this is a buddy page under lock */ > if (!PageBuddy(page)) If you have to provide a pointer you might as well just provide a pointer to the zone lock since that is the thing that is actually holding the lock at this point and would be consistent with your other uses of the locked value. One possibility would be to change the return type so that you return a pointer to the lock you are using. Then the code would look closer to the lruvec code you are already using. > @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat) > unsigned long nr_scanned = 0, nr_isolated = 0; > struct lruvec *lruvec; > unsigned long flags = 0; > - struct lruvec *locked_lruvec = NULL; > + struct lruvec *locked = NULL; > struct page *page = NULL, *valid_page = NULL; > unsigned long start_pfn = low_pfn; > bool skip_on_failure = false; > @@ -847,21 +847,11 @@ static bool too_many_isolated(pg_data_t *pgdat) > * contention, to give chance to IRQs. Abort completely if > * a fatal signal is pending. > */ > - if (!(low_pfn % SWAP_CLUSTER_MAX)) { > - if (locked_lruvec) { > - unlock_page_lruvec_irqrestore(locked_lruvec, > - flags); > - locked_lruvec = NULL; > - } > - > - if (fatal_signal_pending(current)) { > - cc->contended = true; > - > - low_pfn = 0; > - goto fatal_pending; > - } > - > - cond_resched(); > + if (!(low_pfn % SWAP_CLUSTER_MAX) > + && compact_unlock_should_abort(&locked->lru_lock, flags, > + (void**)&locked, cc)) { An added advantage to making locked a pointer to a spinlock is that you could reduce the number of pointers you have to pass. Instead of messing with &locked->lru_lock you would just pass the pointer to locked resulting in fewer arguments being passed and if it is NULL you skip the whole unlock pass. > + low_pfn = 0; > + goto fatal_pending; > } > > if (!pfn_valid_within(low_pfn)) > @@ -932,9 +922,9 @@ static bool too_many_isolated(pg_data_t *pgdat) > */ > if (unlikely(__PageMovable(page)) && > !PageIsolated(page)) { > - if (locked_lruvec) { > - unlock_page_lruvec_irqrestore(locked_lruvec, flags); > - locked_lruvec = NULL; > + if (locked) { > + unlock_page_lruvec_irqrestore(locked, flags); > + locked = NULL; > } > > if (!isolate_movable_page(page, isolate_mode)) > @@ -979,13 +969,13 @@ static bool too_many_isolated(pg_data_t *pgdat) > lruvec = mem_cgroup_page_lruvec(page, pgdat); > > /* If we already hold the lock, we can skip some rechecking */ > - if (lruvec != locked_lruvec) { > - if (locked_lruvec) > - unlock_page_lruvec_irqrestore(locked_lruvec, > + if (lruvec != locked) { > + if (locked) > + unlock_page_lruvec_irqrestore(locked, > flags); > > compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); > - locked_lruvec = lruvec; > + locked = lruvec; > rcu_read_unlock(); > > lruvec_memcg_debug(lruvec, page); > @@ -1041,9 +1031,9 @@ static bool too_many_isolated(pg_data_t *pgdat) > > isolate_fail_put: > /* Avoid potential deadlock in freeing page under lru_lock */ > - if (locked_lruvec) { > - unlock_page_lruvec_irqrestore(locked_lruvec, flags); > - locked_lruvec = NULL; > + if (locked) { > + unlock_page_lruvec_irqrestore(locked, flags); > + locked = NULL; > } > put_page(page); > > @@ -1057,10 +1047,9 @@ static bool too_many_isolated(pg_data_t *pgdat) > * page anyway. > */ > if (nr_isolated) { > - if (locked_lruvec) { > - unlock_page_lruvec_irqrestore(locked_lruvec, > - flags); > - locked_lruvec = NULL; > + if (locked) { > + unlock_page_lruvec_irqrestore(locked, flags); > + locked = NULL; > } > putback_movable_pages(&cc->migratepages); > cc->nr_migratepages = 0; > @@ -1087,8 +1076,8 @@ static bool too_many_isolated(pg_data_t *pgdat) > page = NULL; > > isolate_abort: > - if (locked_lruvec) > - unlock_page_lruvec_irqrestore(locked_lruvec, flags); > + if (locked) > + unlock_page_lruvec_irqrestore(locked, flags); > if (page) { > SetPageLRU(page); > put_page(page); > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock 2020-07-28 15:55 ` Alexander Duyck @ 2020-07-29 0:48 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-29 0:48 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Michal Hocko, Vladimir Davydov 在 2020/7/28 下午11:55, Alexander Duyck 写道: >> /* >> @@ -511,11 +511,11 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, >> * scheduled) >> */ >> static bool compact_unlock_should_abort(spinlock_t *lock, >> - unsigned long flags, bool *locked, struct compact_control *cc) >> + unsigned long flags, void **locked, struct compact_control *cc) > Instead of passing both a void pointer and the lock why not just pass > the pointer to the lock pointer? You could combine lock and locked > into a single argument and save yourself some extra effort. > the passed locked pointer could be rewrite in the func, that is unacceptable if it is a lock which could be used other place. And it is alreay dangerous to NULL a local pointer. In fact, I perfer the orignal verion, not so smart but rebust enough for future changes, right? Thanks Alex >> { >> if (*locked) { >> spin_unlock_irqrestore(lock, flags); >> - *locked = false; >> + *locked = NULL; >> } >> >> if (fatal_signal_pending(current)) { ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock [not found] ` <1595681998-19193-18-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-27 23:34 ` Alexander Duyck @ 2020-08-06 7:41 ` Alex Shi 1 sibling, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-06 7:41 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Cc: Michal Hocko, Vladimir Davydov Hi Johannes, Michal, From page to its lruvec, a few memory access under lock cause extra cost. Would you like to save the per memcg lruvec pointer to page->private? Thanks Alex ÔÚ 2020/7/25 ÏÂÎç8:59, Alex Shi дµÀ: > /** > * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page > * @page: the page > @@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd > goto out; > } > > - memcg = page->mem_cgroup; > + VM_BUG_ON_PAGE(PageTail(page), page); > + memcg = READ_ONCE(page->mem_cgroup); > /* > * Swapcache readahead pages are added to the LRU - and > * possibly migrated - before they are charged. > @@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd > return lruvec; > } > > +struct lruvec *lock_page_lruvec(struct page *page) > +{ > + struct lruvec *lruvec; > + struct pglist_data *pgdat = page_pgdat(page); > + > + rcu_read_lock(); > + lruvec = mem_cgroup_page_lruvec(page, pgdat); > + spin_lock(&lruvec->lru_lock); > + rcu_read_unlock(); > + > + lruvec_memcg_debug(lruvec, page); > + > + return lruvec; > +} > + ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock 2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi [not found] ` <1595681998-19193-18-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-29 3:54 ` Alex Shi 1 sibling, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-29 3:54 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Cc: Michal Hocko, Vladimir Davydov rewrite the commit log. From 5e9340444632d69cf10c8db521577d0637819c5f Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Tue, 26 May 2020 17:27:52 +0800 Subject: [PATCH v17 17/23] mm/lru: replace pgdat lru_lock with lruvec lock This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. In func isolate_migratepages_block, compact_unlock_should_abort is opend, and lock_page_lruvec logical is embedded for tight process. Also add a debug func in locking which may give some clues if there are sth out of hands. According to Daniel Jordan's suggestion, I run 208 'dd' with on 104 containers on a 2s * 26cores * HT box with a modefied case: https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice With this and later patches, the readtwice performance increases about 80% within concurrent containers. On a large but non memcg machine, the extra recheck if page's lruvec should be changed in few place, that increase a little lock holding time, and a little regression. Hugh Dickins helped on patch polish, thanks! Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Tejun Heo <tj@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: cgroups@vger.kernel.org --- include/linux/memcontrol.h | 58 +++++++++++++++++++++++++ include/linux/mmzone.h | 2 + mm/compaction.c | 67 ++++++++++++++++++----------- mm/huge_memory.c | 11 ++--- mm/memcontrol.c | 63 ++++++++++++++++++++++++++- mm/mlock.c | 47 +++++++++++++------- mm/mmzone.c | 1 + mm/swap.c | 104 +++++++++++++++++++++------------------------ mm/vmscan.c | 70 ++++++++++++++++-------------- 9 files changed, 288 insertions(+), 135 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e77197a62809..258901021c6c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct mem_cgroup *get_mem_cgroup_from_page(struct page *page); +struct lruvec *lock_page_lruvec(struct page *page); +struct lruvec *lock_page_lruvec_irq(struct page *page); +struct lruvec *lock_page_lruvec_irqsave(struct page *page, + unsigned long *flags); + +#ifdef CONFIG_DEBUG_VM +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page); +#else +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} +#endif + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; @@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } +static inline struct lruvec *lock_page_lruvec(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock(&pgdat->__lruvec.lru_lock); + return &pgdat->__lruvec; +} + +static inline struct lruvec *lock_page_lruvec_irq(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock_irq(&pgdat->__lruvec.lru_lock); + return &pgdat->__lruvec; +} + +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page, + unsigned long *flagsp) +{ + struct pglist_data *pgdat = page_pgdat(page); + + spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); + return &pgdat->__lruvec; +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page, void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1255,6 +1297,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); } +static inline void unlock_page_lruvec(struct lruvec *lruvec) +{ + spin_unlock(&lruvec->lru_lock); +} + +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec) +{ + spin_unlock_irq(&lruvec->lru_lock); +} + +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, + unsigned long flags) +{ + spin_unlock_irqrestore(&lruvec->lru_lock, flags); +} + #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 14c668b7e793..30b961a9a749 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -249,6 +249,8 @@ enum lruvec_flags { }; struct lruvec { + /* per lruvec lru_lock for memcg */ + spinlock_t lru_lock; struct list_head lists[NR_LRU_LISTS]; /* * These track the cost of reclaiming one LRU - file or anon - diff --git a/mm/compaction.c b/mm/compaction.c index 72f135330f81..c28a43481f01 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat) unsigned long nr_scanned = 0, nr_isolated = 0; struct lruvec *lruvec; unsigned long flags = 0; - bool locked = false; + struct lruvec *locked_lruvec = NULL; struct page *page = NULL, *valid_page = NULL; unsigned long start_pfn = low_pfn; bool skip_on_failure = false; @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat) * contention, to give chance to IRQs. Abort completely if * a fatal signal is pending. */ - if (!(low_pfn % SWAP_CLUSTER_MAX) - && compact_unlock_should_abort(&pgdat->lru_lock, - flags, &locked, cc)) { - low_pfn = 0; - goto fatal_pending; + if (!(low_pfn % SWAP_CLUSTER_MAX)) { + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, + flags); + locked_lruvec = NULL; + } + + if (fatal_signal_pending(current)) { + cc->contended = true; + + low_pfn = 0; + goto fatal_pending; + } + + cond_resched(); } if (!pfn_valid_within(low_pfn)) @@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat) */ if (unlikely(__PageMovable(page)) && !PageIsolated(page)) { - if (locked) { - spin_unlock_irqrestore(&pgdat->lru_lock, - flags); - locked = false; + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, flags); + locked_lruvec = NULL; } if (!isolate_movable_page(page, isolate_mode)) @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!TestClearPageLRU(page)) goto isolate_fail_put; + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + /* If we already hold the lock, we can skip some rechecking */ - if (!locked) { - locked = compact_lock_irqsave(&pgdat->lru_lock, - &flags, cc); + if (lruvec != locked_lruvec) { + if (locked_lruvec) + unlock_page_lruvec_irqrestore(locked_lruvec, + flags); + + compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); + locked_lruvec = lruvec; + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); /* Try get exclusive access under lock */ if (!skip_updated) { @@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat) SetPageLRU(page); goto isolate_fail_put; } - } - - lruvec = mem_cgroup_page_lruvec(page, pgdat); + } else + rcu_read_unlock(); /* The whole page is taken off the LRU; skip the tail pages. */ if (PageCompound(page)) @@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat) isolate_fail_put: /* Avoid potential deadlock in freeing page under lru_lock */ - if (locked) { - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - locked = false; + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, flags); + locked_lruvec = NULL; } put_page(page); @@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat) * page anyway. */ if (nr_isolated) { - if (locked) { - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - locked = false; + if (locked_lruvec) { + unlock_page_lruvec_irqrestore(locked_lruvec, + flags); + locked_lruvec = NULL; } putback_movable_pages(&cc->migratepages); cc->nr_migratepages = 0; @@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat) page = NULL; isolate_abort: - if (locked) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (locked_lruvec) + unlock_page_lruvec_irqrestore(locked_lruvec, flags); if (page) { SetPageLRU(page); put_page(page); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 28538444197b..a0cb95891ae5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail, VM_BUG_ON_PAGE(!PageHead(head), head); VM_BUG_ON_PAGE(PageCompound(page_tail), head); VM_BUG_ON_PAGE(PageLRU(page_tail), head); - lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); + lockdep_assert_held(&lruvec->lru_lock); if (list) { /* page reclaim is reclaiming a huge page */ @@ -2430,7 +2430,6 @@ static void __split_huge_page(struct page *page, struct list_head *list, pgoff_t end) { struct page *head = compound_head(page); - pg_data_t *pgdat = page_pgdat(head); struct lruvec *lruvec; struct address_space *swap_cache = NULL; unsigned long offset = 0; @@ -2447,10 +2446,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); } - /* prevent PageLRU to go away from under us, and freeze lru stats */ - spin_lock(&pgdat->lru_lock); - - lruvec = mem_cgroup_page_lruvec(head, pgdat); + /* lock lru list/PageCompound, ref freezed by page_ref_freeze */ + lruvec = lock_page_lruvec(head); for (i = HPAGE_PMD_NR - 1; i >= 1; i--) { __split_huge_page_tail(head, i, lruvec, list); @@ -2471,7 +2468,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, } ClearPageCompound(head); - spin_unlock(&pgdat->lru_lock); + unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, HPAGE_PMD_ORDER); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 20c8ed69a930..d6746656cc39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, return ret; } +#ifdef CONFIG_DEBUG_VM +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ + if (mem_cgroup_disabled()) + return; + + if (!page->mem_cgroup) + VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page); + else + VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); +} +#endif + /** * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page * @page: the page @@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd goto out; } - memcg = page->mem_cgroup; + VM_BUG_ON_PAGE(PageTail(page), page); + memcg = READ_ONCE(page->mem_cgroup); /* * Swapcache readahead pages are added to the LRU - and * possibly migrated - before they are charged. @@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd return lruvec; } +struct lruvec *lock_page_lruvec(struct page *page) +{ + struct lruvec *lruvec; + struct pglist_data *pgdat = page_pgdat(page); + + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + spin_lock(&lruvec->lru_lock); + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + +struct lruvec *lock_page_lruvec_irq(struct page *page) +{ + struct lruvec *lruvec; + struct pglist_data *pgdat = page_pgdat(page); + + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + spin_lock_irq(&lruvec->lru_lock); + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) +{ + struct lruvec *lruvec; + struct pglist_data *pgdat = page_pgdat(page); + + rcu_read_lock(); + lruvec = mem_cgroup_page_lruvec(page, pgdat); + spin_lock_irqsave(&lruvec->lru_lock, *flags); + rcu_read_unlock(); + + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + /** * mem_cgroup_update_lru_size - account for adding or removing an lru page * @lruvec: mem_cgroup per zone lru vector @@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) /* * Because tail pages are not marked as "used", set it. We're under - * pgdat->lru_lock and migration entries setup in all page mappings. + * lruvec->lru_lock and migration entries setup in all page mappings. */ void mem_cgroup_split_huge_fixup(struct page *head) { diff --git a/mm/mlock.c b/mm/mlock.c index 228ba5a8e0a5..5d40d259a931 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page) * Isolate a page from LRU with optional get_page() pin. * Assumes lru_lock already held and page already pinned. */ -static bool __munlock_isolate_lru_page(struct page *page, bool getpage) +static bool __munlock_isolate_lru_page(struct page *page, + struct lruvec *lruvec, bool getpage) { if (TestClearPageLRU(page)) { - struct lruvec *lruvec; - - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); if (getpage) get_page(page); del_page_from_lru_list(page, lruvec, page_lru(page)); @@ -181,7 +179,7 @@ static void __munlock_isolation_failed(struct page *page) unsigned int munlock_vma_page(struct page *page) { int nr_pages; - pg_data_t *pgdat = page_pgdat(page); + struct lruvec *lruvec; /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); @@ -189,11 +187,16 @@ unsigned int munlock_vma_page(struct page *page) VM_BUG_ON_PAGE(PageTail(page), page); /* - * Serialize with any parallel __split_huge_page_refcount() which + * Serialize split tail pages in __split_huge_page_tail() which * might otherwise copy PageMlocked to part of the tail pages before * we clear it in the head page. It also stabilizes hpage_nr_pages(). + * TestClearPageLRU can't be used here to block page isolation, since + * out of lock clear_page_mlock may interfer PageLRU/PageMlocked + * sequence, same as __pagevec_lru_add_fn, and lead the page place to + * wrong lru list here. So relay on PageLocked to stop lruvec change + * in mem_cgroup_move_account(). */ - spin_lock_irq(&pgdat->lru_lock); + lruvec = lock_page_lruvec_irq(page); if (!TestClearPageMlocked(page)) { /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ @@ -204,15 +207,15 @@ unsigned int munlock_vma_page(struct page *page) nr_pages = hpage_nr_pages(page); __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - if (__munlock_isolate_lru_page(page, true)) { - spin_unlock_irq(&pgdat->lru_lock); + if (__munlock_isolate_lru_page(page, lruvec, true)) { + unlock_page_lruvec_irq(lruvec); __munlock_isolated_page(page); goto out; } __munlock_isolation_failed(page); unlock_out: - spin_unlock_irq(&pgdat->lru_lock); + unlock_page_lruvec_irq(lruvec); out: return nr_pages - 1; @@ -292,23 +295,34 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) int nr = pagevec_count(pvec); int delta_munlocked = -nr; struct pagevec pvec_putback; + struct lruvec *lruvec = NULL; int pgrescued = 0; pagevec_init(&pvec_putback); /* Phase 1: page isolation */ - spin_lock_irq(&zone->zone_pgdat->lru_lock); for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; + struct lruvec *new_lruvec; + + /* block memcg change in mem_cgroup_move_account */ + lock_page_memcg(page); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (new_lruvec != lruvec) { + if (lruvec) + unlock_page_lruvec_irq(lruvec); + lruvec = lock_page_lruvec_irq(page); + } if (TestClearPageMlocked(page)) { /* * We already have pin from follow_page_mask() * so we can spare the get_page() here. */ - if (__munlock_isolate_lru_page(page, false)) + if (__munlock_isolate_lru_page(page, lruvec, false)) { + unlock_page_memcg(page); continue; - else + } else __munlock_isolation_failed(page); } else { delta_munlocked++; @@ -320,11 +334,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * pin. We cannot do it under lru_lock however. If it's * the last pin, __page_cache_release() would deadlock. */ + unlock_page_memcg(page); pagevec_add(&pvec_putback, pvec->pages[i]); pvec->pages[i] = NULL; } - __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); - spin_unlock_irq(&zone->zone_pgdat->lru_lock); + if (lruvec) { + __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); + unlock_page_lruvec_irq(lruvec); + } /* Now we can release pins of pages that we are not munlocking */ pagevec_release(&pvec_putback); diff --git a/mm/mmzone.c b/mm/mmzone.c index 4686fdc23bb9..3750a90ed4a0 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec) enum lru_list lru; memset(lruvec, 0, sizeof(struct lruvec)); + spin_lock_init(&lruvec->lru_lock); for_each_lru(lru) INIT_LIST_HEAD(&lruvec->lists[lru]); diff --git a/mm/swap.c b/mm/swap.c index 3029b3f74811..09edac441eb6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { static void __page_cache_release(struct page *page) { if (PageLRU(page)) { - pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; unsigned long flags; __ClearPageLRU(page); - spin_lock_irqsave(&pgdat->lru_lock, flags); - lruvec = mem_cgroup_page_lruvec(page, pgdat); + lruvec = lock_page_lruvec_irqsave(page, &flags); del_page_from_lru_list(page, lruvec, page_off_lru(page)); - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + unlock_page_lruvec_irqrestore(lruvec, flags); } __ClearPageWaiters(page); } @@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, void (*move_fn)(struct page *page, struct lruvec *lruvec)) { int i; - struct pglist_data *pgdat = NULL; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long flags = 0; for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - struct pglist_data *pagepgdat = page_pgdat(page); - - if (pagepgdat != pgdat) { - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - pgdat = pagepgdat; - spin_lock_irqsave(&pgdat->lru_lock, flags); - } + struct lruvec *new_lruvec; /* block memcg migration during page moving between lru */ if (!TestClearPageLRU(page)) continue; - lruvec = mem_cgroup_page_lruvec(page, pgdat); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (lruvec != new_lruvec) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = lock_page_lruvec_irqsave(page, &flags); + } + (*move_fn)(page, lruvec); SetPageLRU(page); } - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); } @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) { do { unsigned long lrusize; - struct pglist_data *pgdat = lruvec_pgdat(lruvec); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); /* Record cost event */ if (file) lruvec->file_cost += nr_pages; @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) lruvec->file_cost /= 2; lruvec->anon_cost /= 2; } - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); } while ((lruvec = parent_lruvec(lruvec))); } @@ -365,12 +360,13 @@ static inline void activate_page_drain(int cpu) void activate_page(struct page *page) { pg_data_t *pgdat = page_pgdat(page); + struct lruvec *lruvec; page = compound_head(page); - spin_lock_irq(&pgdat->lru_lock); + lruvec = lock_page_lruvec_irq(page); if (PageLRU(page)) - __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); - spin_unlock_irq(&pgdat->lru_lock); + __activate_page(page, lruvec); + unlock_page_lruvec_irq(lruvec); } #endif @@ -817,8 +813,7 @@ void release_pages(struct page **pages, int nr) { int i; LIST_HEAD(pages_to_free); - struct pglist_data *locked_pgdat = NULL; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long uninitialized_var(flags); unsigned int uninitialized_var(lock_batch); @@ -828,21 +823,20 @@ void release_pages(struct page **pages, int nr) /* * Make sure the IRQ-safe lock-holding time does not get * excessive with a continuous string of pages from the - * same pgdat. The lock is held only if pgdat != NULL. + * same lruvec. The lock is held only if lruvec != NULL. */ - if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) { - spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); - locked_pgdat = NULL; + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } if (is_huge_zero_page(page)) continue; if (is_zone_device_page(page)) { - if (locked_pgdat) { - spin_unlock_irqrestore(&locked_pgdat->lru_lock, - flags); - locked_pgdat = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } /* * ZONE_DEVICE pages that return 'false' from @@ -861,28 +855,28 @@ void release_pages(struct page **pages, int nr) continue; if (PageCompound(page)) { - if (locked_pgdat) { - spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); - locked_pgdat = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } __put_compound_page(page); continue; } if (PageLRU(page)) { - struct pglist_data *pgdat = page_pgdat(page); + struct lruvec *new_lruvec; - if (pgdat != locked_pgdat) { - if (locked_pgdat) - spin_unlock_irqrestore(&locked_pgdat->lru_lock, + new_lruvec = mem_cgroup_page_lruvec(page, + page_pgdat(page)); + if (new_lruvec != lruvec) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); lock_batch = 0; - locked_pgdat = pgdat; - spin_lock_irqsave(&locked_pgdat->lru_lock, flags); + lruvec = lock_page_lruvec_irqsave(page, &flags); } __ClearPageLRU(page); - lruvec = mem_cgroup_page_lruvec(page, locked_pgdat); del_page_from_lru_list(page, lruvec, page_off_lru(page)); } @@ -892,8 +886,8 @@ void release_pages(struct page **pages, int nr) list_add(&page->lru, &pages_to_free); } - if (locked_pgdat) - spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); mem_cgroup_uncharge_list(&pages_to_free); free_unref_page_list(&pages_to_free); @@ -981,26 +975,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) void __pagevec_lru_add(struct pagevec *pvec) { int i; - struct pglist_data *pgdat = NULL; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long flags = 0; for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - struct pglist_data *pagepgdat = page_pgdat(page); + struct lruvec *new_lruvec; - if (pagepgdat != pgdat) { - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); - pgdat = pagepgdat; - spin_lock_irqsave(&pgdat->lru_lock, flags); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (lruvec != new_lruvec) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = lock_page_lruvec_irqsave(page, &flags); } - lruvec = mem_cgroup_page_lruvec(page, pgdat); __pagevec_lru_add_fn(page, lruvec); } - if (pgdat) - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); } diff --git a/mm/vmscan.c b/mm/vmscan.c index f77748adc340..168c1659e430 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page) WARN_RATELIMIT(PageTail(page), "trying to isolate tail page"); if (TestClearPageLRU(page)) { - pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; int lru = page_lru(page); get_page(page); - lruvec = mem_cgroup_page_lruvec(page, pgdat); - spin_lock_irq(&pgdat->lru_lock); + lruvec = lock_page_lruvec_irq(page); del_page_from_lru_list(page, lruvec, lru); - spin_unlock_irq(&pgdat->lru_lock); + unlock_page_lruvec_irq(lruvec); ret = 0; } @@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, struct list_head *list) { - struct pglist_data *pgdat = lruvec_pgdat(lruvec); int nr_pages, nr_moved = 0; LIST_HEAD(pages_to_free); struct page *page; + struct lruvec *orig_lruvec = lruvec; enum lru_list lru; while (!list_empty(list)) { + struct lruvec *new_lruvec = NULL; + page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); list_del(&page->lru); if (unlikely(!page_evictable(page))) { - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); putback_lru_page(page); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); continue; } @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, * list_add(&page->lru,) * list_add(&page->lru,) //corrupt */ + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (new_lruvec != lruvec) { + if (lruvec) + spin_unlock_irq(&lruvec->lru_lock); + lruvec = lock_page_lruvec_irq(page); + } SetPageLRU(page); if (unlikely(put_page_testzero(page))) { @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, __ClearPageActive(page); if (unlikely(PageCompound(page))) { - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); destroy_compound_page(page); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); } else list_add(&page->lru, &pages_to_free); continue; } - lruvec = mem_cgroup_page_lruvec(page, pgdat); lru = page_lru(page); nr_pages = hpage_nr_pages(page); @@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, if (PageActive(page)) workingset_age_nonresident(lruvec, nr_pages); } + if (orig_lruvec != lruvec) { + if (lruvec) + spin_unlock_irq(&lruvec->lru_lock); + spin_lock_irq(&orig_lruvec->lru_lock); + } /* * To save our caller's stack, now use input list for pages to free. @@ -1957,7 +1967,7 @@ static int current_may_throttle(void) lru_add_drain(); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list, &nr_scanned, sc, lru); @@ -1969,7 +1979,7 @@ static int current_may_throttle(void) __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); __count_vm_events(PGSCAN_ANON + file, nr_scanned); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); if (nr_taken == 0) return 0; @@ -1977,7 +1987,7 @@ static int current_may_throttle(void) nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0, &stat, false); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); move_pages_to_lru(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); @@ -1986,7 +1996,7 @@ static int current_may_throttle(void) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); lru_note_cost(lruvec, file, stat.nr_pageout); mem_cgroup_uncharge_list(&page_list); @@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan, lru_add_drain(); - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold, &nr_scanned, sc, lru); @@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __count_vm_events(PGREFILL, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); while (!list_empty(&l_hold)) { cond_resched(); @@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan, /* * Move pages back to the lru list. */ - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&lruvec->lru_lock); nr_activate = move_pages_to_lru(lruvec, &l_active); nr_deactivate = move_pages_to_lru(lruvec, &l_inactive); @@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_list(&l_active); free_unref_page_list(&l_active); @@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) /* * Determine the scan balance between anon and file LRUs. */ - spin_lock_irq(&pgdat->lru_lock); + spin_lock_irq(&target_lruvec->lru_lock); sc->anon_cost = target_lruvec->anon_cost; sc->file_cost = target_lruvec->file_cost; - spin_unlock_irq(&pgdat->lru_lock); + spin_unlock_irq(&target_lruvec->lru_lock); /* * Target desirable inactive:active list ratios for the anon @@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) */ void check_move_unevictable_pages(struct pagevec *pvec) { - struct lruvec *lruvec; - struct pglist_data *pgdat = NULL; + struct lruvec *lruvec = NULL; int pgscanned = 0; int pgrescued = 0; int i; for (i = 0; i < pvec->nr; i++) { struct page *page = pvec->pages[i]; - struct pglist_data *pagepgdat = page_pgdat(page); + struct lruvec *new_lruvec; pgscanned++; - if (pagepgdat != pgdat) { - if (pgdat) - spin_unlock_irq(&pgdat->lru_lock); - pgdat = pagepgdat; - spin_lock_irq(&pgdat->lru_lock); + new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + if (lruvec != new_lruvec) { + if (lruvec) + unlock_page_lruvec_irq(lruvec); + lruvec = lock_page_lruvec_irq(page); } - lruvec = mem_cgroup_page_lruvec(page, pgdat); if (!PageLRU(page) || !PageUnevictable(page)) continue; @@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec) } } - if (pgdat) { + if (lruvec) { __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); - spin_unlock_irq(&pgdat->lru_lock); + unlock_page_lruvec_irq(lruvec); } } EXPORT_SYMBOL_GPL(check_move_unevictable_pages); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (6 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi @ 2020-07-27 5:40 ` Alex Shi [not found] ` <49d4f3bf-ccce-3c97-3a4c-f5cefe2d623a-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-31 21:31 ` Alexander Duyck ` (4 subsequent siblings) 12 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-27 5:40 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w A standard for new page isolation steps like the following: 1, get_page(); #pin the page avoid be free 2, TestClearPageLRU(); #serialize other isolation, also memcg change 3, spin_lock on lru_lock; #serialize lru list access The step 2 could be optimzed/replaced in scenarios which page is unlikely be accessed by others. 在 2020/7/25 下午8:59, Alex Shi 写道: > The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in > mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then > removes 'mm/mlock: reorder isolation sequence during munlock' > > Hi Johanness & Hugh & Alexander & Willy, > > Could you like to give a reviewed by since you address much of issue and > give lots of suggestions! Many thanks! > > Current lru_lock is one for each of node, pgdat->lru_lock, that guard for > lru lists, but now we had moved the lru lists into memcg for long time. Still > using per node lru_lock is clearly unscalable, pages on each of memcgs have > to compete each others for a whole lru_lock. This patchset try to use per > lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make > it scalable for memcgs and get performance gain. > > Currently lru_lock still guards both lru list and page's lru bit, that's ok. > but if we want to use specific lruvec lock on the page, we need to pin down > the page's lruvec/memcg during locking. Just taking lruvec lock first may be > undermined by the page's memcg charge/migration. To fix this problem, we could > take out the page's lru bit clear and use it as pin down action to block the > memcg changes. That's the reason for new atomic func TestClearPageLRU. > So now isolating a page need both actions: TestClearPageLRU and hold the > lru_lock. > > The typical usage of this is isolate_migratepages_block() in compaction.c > we have to take lru bit before lru lock, that serialized the page isolation > in memcg page charge/migration which will change page's lruvec and new > lru_lock in it. > > The above solution suggested by Johannes Weiner, and based on his new memcg > charge path, then have this patchset. (Hugh Dickins tested and contributed much > code from compaction fix to general code polish, thanks a lot!). > > The patchset includes 3 parts: > 1, some code cleanup and minimum optimization as a preparation. > 2, use TestCleanPageLRU as page isolation's precondition > 3, replace per node lru_lock with per memcg per node lru_lock > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > containers on a 2s * 26cores * HT box with a modefied case: > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > With this patchset, the readtwice performance increased about 80% > in concurrent containers. > > Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this > idea 8 years ago, and others who give comments as well: Daniel Jordan, > Mel Gorman, Shakeel Butt, Matthew Wilcox etc. > > Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu, > and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks! > > > Alex Shi (19): > mm/vmscan: remove unnecessary lruvec adding > mm/page_idle: no unlikely double check for idle page counting > mm/compaction: correct the comments of compact_defer_shift > mm/compaction: rename compact_deferred as compact_should_defer > mm/thp: move lru_add_page_tail func to huge_memory.c > mm/thp: clean up lru_add_page_tail > mm/thp: remove code path which never got into > mm/thp: narrow lru locking > mm/memcg: add debug checking in lock_page_memcg > mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn > mm/lru: move lru_lock holding in func lru_note_cost_page > mm/lru: move lock into lru_note_cost > mm/lru: introduce TestClearPageLRU > mm/compaction: do page isolation first in compaction > mm/thp: add tail pages into lru anyway in split_huge_page() > mm/swap: serialize memcg changes in pagevec_lru_move_fn > mm/lru: replace pgdat lru_lock with lruvec lock > mm/lru: introduce the relock_page_lruvec function > mm/pgdat: remove pgdat lru_lock > > Hugh Dickins (2): > mm/vmscan: use relock for move_pages_to_lru > mm/lru: revise the comments of lru_lock > > Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +- > Documentation/admin-guide/cgroup-v1/memory.rst | 21 +-- > Documentation/trace/events-kmem.rst | 2 +- > Documentation/vm/unevictable-lru.rst | 22 +-- > include/linux/compaction.h | 4 +- > include/linux/memcontrol.h | 98 ++++++++++ > include/linux/mm_types.h | 2 +- > include/linux/mmzone.h | 6 +- > include/linux/page-flags.h | 1 + > include/linux/swap.h | 4 +- > include/trace/events/compaction.h | 2 +- > mm/compaction.c | 113 ++++++++---- > mm/filemap.c | 4 +- > mm/huge_memory.c | 48 +++-- > mm/memcontrol.c | 71 ++++++- > mm/memory.c | 3 - > mm/mlock.c | 43 +++-- > mm/mmzone.c | 1 + > mm/page_alloc.c | 1 - > mm/page_idle.c | 8 - > mm/rmap.c | 4 +- > mm/swap.c | 203 ++++++++------------- > mm/swap_state.c | 2 - > mm/vmscan.c | 174 ++++++++++-------- > mm/workingset.c | 2 - > 25 files changed, 510 insertions(+), 344 deletions(-) > ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <49d4f3bf-ccce-3c97-3a4c-f5cefe2d623a-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <49d4f3bf-ccce-3c97-3a4c-f5cefe2d623a-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-29 14:49 ` Alex Shi 2020-07-29 18:06 ` Hugh Dickins 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-29 14:49 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w Is there any comments or suggestion for this patchset? Any hints will be very appreciated. Thanks Alex 在 2020/7/27 下午1:40, Alex Shi 写道: > A standard for new page isolation steps like the following: > 1, get_page(); #pin the page avoid be free > 2, TestClearPageLRU(); #serialize other isolation, also memcg change > 3, spin_lock on lru_lock; #serialize lru list access > The step 2 could be optimzed/replaced in scenarios which page is unlikely > be accessed by others. > > > > 在 2020/7/25 下午8:59, Alex Shi 写道: >> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in >> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then >> removes 'mm/mlock: reorder isolation sequence during munlock' >> >> Hi Johanness & Hugh & Alexander & Willy, >> >> Could you like to give a reviewed by since you address much of issue and >> give lots of suggestions! Many thanks! ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock 2020-07-29 14:49 ` Alex Shi @ 2020-07-29 18:06 ` Hugh Dickins 2020-07-30 2:16 ` Alex Shi 0 siblings, 1 reply; 101+ messages in thread From: Hugh Dickins @ 2020-07-29 18:06 UTC (permalink / raw) To: Alex Shi Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen On Wed, 29 Jul 2020, Alex Shi wrote: > > Is there any comments or suggestion for this patchset? > Any hints will be very appreciated. Alex: it is now v5.8-rc7, obviously too late for this patchset to make v5.9, so I'm currently concentrated on checking some patches headed for v5.9 (and some bugfix patches of my own that I don't get time to send): I'll get back to responding on lru_lock in a week or two's time. Hugh ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock 2020-07-29 18:06 ` Hugh Dickins @ 2020-07-30 2:16 ` Alex Shi [not found] ` <08c8797d-1935-7b41-b8db-d22f054912ac-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-30 2:16 UTC (permalink / raw) To: Hugh Dickins Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen ÔÚ 2020/7/30 ÉÏÎç2:06, Hugh Dickins дµÀ: > On Wed, 29 Jul 2020, Alex Shi wrote: >> >> Is there any comments or suggestion for this patchset? >> Any hints will be very appreciated. > > Alex: it is now v5.8-rc7, obviously too late for this patchset to make > v5.9, so I'm currently concentrated on checking some patches headed for > v5.9 (and some bugfix patches of my own that I don't get time to send): > I'll get back to responding on lru_lock in a week or two's time. Hi Hugh, Thanks a lot for response! It's fine to wait longer. But thing would be more efficient if review get concentrated... I am still too new in mm area. Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <08c8797d-1935-7b41-b8db-d22f054912ac-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <08c8797d-1935-7b41-b8db-d22f054912ac-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-03 15:07 ` Michal Hocko [not found] ` <20200803150704.GV5174-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Michal Hocko @ 2020-08-03 15:07 UTC (permalink / raw) To: Alex Shi Cc: Hugh Dickins, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w On Thu 30-07-20 10:16:13, Alex Shi wrote: > > > 在 2020/7/30 上午2:06, Hugh Dickins 写道: > > On Wed, 29 Jul 2020, Alex Shi wrote: > >> > >> Is there any comments or suggestion for this patchset? > >> Any hints will be very appreciated. > > > > Alex: it is now v5.8-rc7, obviously too late for this patchset to make > > v5.9, so I'm currently concentrated on checking some patches headed for > > v5.9 (and some bugfix patches of my own that I don't get time to send): > > I'll get back to responding on lru_lock in a week or two's time. > > Hi Hugh, > > Thanks a lot for response! It's fine to wait longer. > But thing would be more efficient if review get concentrated... > I am still too new in mm area. I am sorry and owe you a review but it is hard to find time for that. This is a large change and the review will be really far from trivial. If this version is mostly stable then I would recommend not posting new versions and simply remind people you expect the review from by a targeted ping. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <20200803150704.GV5174-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <20200803150704.GV5174-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2020-08-04 6:14 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 6:14 UTC (permalink / raw) To: Michal Hocko Cc: Hugh Dickins, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w 在 2020/8/3 下午11:07, Michal Hocko 写道: > On Thu 30-07-20 10:16:13, Alex Shi wrote: >> >> >> 在 2020/7/30 上午2:06, Hugh Dickins 写道: >>> On Wed, 29 Jul 2020, Alex Shi wrote: >>>> >>>> Is there any comments or suggestion for this patchset? >>>> Any hints will be very appreciated. >>> >>> Alex: it is now v5.8-rc7, obviously too late for this patchset to make >>> v5.9, so I'm currently concentrated on checking some patches headed for >>> v5.9 (and some bugfix patches of my own that I don't get time to send): >>> I'll get back to responding on lru_lock in a week or two's time. >> >> Hi Hugh, >> >> Thanks a lot for response! It's fine to wait longer. >> But thing would be more efficient if review get concentrated... >> I am still too new in mm area. > > I am sorry and owe you a review but it is hard to find time for that. > This is a large change and the review will be really far from trivial. > If this version is mostly stable then I would recommend not posting new > versions and simply remind people you expect the review from by a > targeted ping. > hi Michal, Thanks a lot for reminder! Except a update on patch [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function from Alexander, the patchset is stable on 5.8. Just on linux-next, there are changes on hpage_nr_pages -> thp_nr_pages func name change, and lru_note_cost changes that they need a new update. And I have another 3 more patches, following this patchset which do clean up and optimzing. Is it worth for a new patchset? or let me just update here? Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (7 preceding siblings ...) 2020-07-27 5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi @ 2020-07-31 21:31 ` Alexander Duyck 2020-08-04 8:36 ` Alex Shi ` (3 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-07-31 21:31 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in > mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then > removes 'mm/mlock: reorder isolation sequence during munlock' > > Hi Johanness & Hugh & Alexander & Willy, > > Could you like to give a reviewed by since you address much of issue and > give lots of suggestions! Many thanks! > I just finished getting a test pass done on the patches. I'm still seeing a regression on the will-it-scale/page_fault3 test but it is now only 3% instead of the 20% that it was so it may just be noise at this point. I'll try to make sure to get my review feedback wrapped up early next week. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (8 preceding siblings ...) 2020-07-31 21:31 ` Alexander Duyck @ 2020-08-04 8:36 ` Alex Shi 2020-08-04 8:36 ` Alex Shi ` (2 subsequent siblings) 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 8:36 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, Michal Hocko, Vladimir Davydov From 6f3ac2a72448291a88f50df836d829a23e7df736 Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Sat, 25 Jul 2020 22:52:11 +0800 Subject: [PATCH 2/3] mm/mlock: remove __munlock_isolate_lru_page The func only has one caller, remove it to clean up code and simplify code. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- mm/mlock.c | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 46a05e6ec5ba..40a8bb79c65e 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -102,23 +102,6 @@ void mlock_vma_page(struct page *page) } /* - * Isolate a page from LRU with optional get_page() pin. - * Assumes lru_lock already held and page already pinned. - */ -static bool __munlock_isolate_lru_page(struct page *page, - struct lruvec *lruvec, bool getpage) -{ - if (TestClearPageLRU(page)) { - if (getpage) - get_page(page); - del_page_from_lru_list(page, lruvec, page_lru(page)); - return true; - } - - return false; -} - -/* * Finish munlock after successful page isolation * * Page must be locked. This is a wrapper for try_to_munlock() @@ -300,7 +283,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * We already have pin from follow_page_mask() * so we can spare the get_page() here. */ - if (__munlock_isolate_lru_page(page, lruvec, false)) { + if (TestClearPageLRU(page)) { + enum lru_list lru = page_lru(page); + + del_page_from_lru_list(page, lruvec, lru); unlock_page_memcg(page); continue; } else -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (9 preceding siblings ...) 2020-08-04 8:36 ` Alex Shi @ 2020-08-04 8:36 ` Alex Shi 2020-08-04 8:37 ` Alex Shi 2020-08-04 8:37 ` Alex Shi 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 8:36 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, Vladimir Davydov, Michal Hocko From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Sat, 1 Aug 2020 22:49:31 +0800 Subject: [PATCH 3/3] mm/swap.c: optimizing __pagevec_lru_add lru_lock The current relock will unlock/lock lru_lock with every time lruvec changes, so it would cause frequency relock if 2 memcgs are reading file simultaneously. This patch will record the involved lru_lock and only hold them once in above scenario. That could reduce the lock contention. Using per cpu data intead of local stack data to avoid repeatly INIT_LIST_HEAD action. [lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org: found a build issue in the original patch, thanks] Suggested-by: Konstantin Khlebnikov <koct9i-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 6 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index d88a6c650a7c..e227fec6983c 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -72,6 +72,27 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { .lock = INIT_LOCAL_LOCK(lock), }; +struct pvlvs { + struct list_head lists[PAGEVEC_SIZE]; + struct lruvec *vecs[PAGEVEC_SIZE]; +}; +static DEFINE_PER_CPU(struct pvlvs, pvlvs); + +static int __init pvlvs_init(void) { + int i, cpu; + struct pvlvs *pvecs; + + for (cpu = 0; cpu < NR_CPUS; cpu++) { + if (!cpu_possible(cpu)) + continue; + pvecs = per_cpu_ptr(&pvlvs, cpu); + for (i = 0; i < PAGEVEC_SIZE; i++) + INIT_LIST_HEAD(&pvecs->lists[i]); + } + return 0; +} +subsys_initcall(pvlvs_init); + /* * This path almost never happens for VM activity - pages are normally * freed via pagevecs. But it gets used by networking. @@ -963,18 +984,42 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) */ void __pagevec_lru_add(struct pagevec *pvec) { - int i; + int i, j, total = 0; struct lruvec *lruvec = NULL; unsigned long flags = 0; + struct page *page; + struct pvlvs *lvs = this_cpu_ptr(&pvlvs); + /* Sort the same lruvec pages on a list. */ for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; + page = pvec->pages[i]; + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + + /* Try to find a same lruvec */ + for (j = 0; j <= total; j++) + if (lruvec == lvs->vecs[j]) + break; + /* A new lruvec */ + if (j > total) { + lvs->vecs[total] = lruvec; + j = total; + total++; + } - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); - __pagevec_lru_add_fn(page, lruvec); + list_add(&page->lru, &lvs->lists[j]); } - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); + + for (i = 0; i < total; i++) { + spin_lock_irqsave(&lvs->vecs[i]->lru_lock, flags); + while (!list_empty(&lvs->lists[i])) { + page = lru_to_page(&lvs->lists[i]); + list_del(&page->lru); + __pagevec_lru_add_fn(page, lvs->vecs[i]); + } + spin_unlock_irqrestore(&lvs->vecs[i]->lru_lock, flags); + lvs->vecs[i] = NULL; + } + release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (10 preceding siblings ...) 2020-08-04 8:36 ` Alex Shi @ 2020-08-04 8:37 ` Alex Shi 2020-08-04 8:37 ` Alex Shi 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 8:37 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, Vladimir Davydov, Michal Hocko From 0696a2a4a8ca5e9bf62f208126ea4af7727d2edc Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Sat, 25 Jul 2020 22:31:03 +0800 Subject: [PATCH 1/3] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page In the func munlock_vma_page, the page must be PageLocked as well as pages in split_huge_page series funcs. Thus the PageLocked is enough to serialize both funcs. So we could relief the TestClearPageMlocked/hpage_nr_pages which are not necessary under lru lock. As to another munlock func __munlock_pagevec, which no PageLocked protection and should remain lru protecting. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Cc: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- mm/mlock.c | 41 +++++++++++++++-------------------------- 1 file changed, 15 insertions(+), 26 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 0448409184e3..46a05e6ec5ba 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page) * * See __pagevec_lru_add_fn for more explanation. */ - if (!isolate_lru_page(page)) { + if (!isolate_lru_page(page)) putback_lru_page(page); - } else { + else { /* * We lost the race. the page already moved to evictable list. */ @@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page) unsigned int munlock_vma_page(struct page *page) { int nr_pages; - struct lruvec *lruvec; /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); @@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page) VM_BUG_ON_PAGE(PageTail(page), page); /* - * Serialize split tail pages in __split_huge_page_tail() which - * might otherwise copy PageMlocked to part of the tail pages before - * we clear it in the head page. It also stabilizes thp_nr_pages(). - * TestClearPageLRU can't be used here to block page isolation, since - * out of lock clear_page_mlock may interfer PageLRU/PageMlocked - * sequence, same as __pagevec_lru_add_fn, and lead the page place to - * wrong lru list here. So relay on PageLocked to stop lruvec change - * in mem_cgroup_move_account(). + * Serialize split tail pages in __split_huge_page_tail() by + * lock_page(); Do TestClearPageMlocked/PageLRU sequence like + * clear_page_mlock(). */ - lruvec = lock_page_lruvec_irq(page); - - if (!TestClearPageMlocked(page)) { + if (!TestClearPageMlocked(page)) /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ - nr_pages = 1; - goto unlock_out; - } + return 0; nr_pages = thp_nr_pages(page); __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - if (__munlock_isolate_lru_page(page, lruvec, true)) { - unlock_page_lruvec_irq(lruvec); + if (!isolate_lru_page(page)) __munlock_isolated_page(page); - goto out; - } - __munlock_isolation_failed(page); - -unlock_out: - unlock_page_lruvec_irq(lruvec); + else + __munlock_isolation_failed(page); -out: return nr_pages - 1; } @@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) /* block memcg change in mem_cgroup_move_account */ lock_page_memcg(page); + /* + * Serialize split tail pages in __split_huge_page_tail() which + * might otherwise copy PageMlocked to part of the tail pages + * before we clear it in the head page. + */ lruvec = relock_page_lruvec_irq(page, lruvec); if (TestClearPageMlocked(page)) { /* -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 00/21] per memcg lru lock [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> ` (11 preceding siblings ...) 2020-08-04 8:37 ` Alex Shi @ 2020-08-04 8:37 ` Alex Shi 12 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 8:37 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, kirill-oKw7cIdHH8eLwutG50LtGA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, Vladimir Davydov, Michal Hocko From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Tue, 4 Aug 2020 16:20:02 +0800 Subject: [PATCH 0/3] optimzing following per memcg lru_lock The first 2 patches are code clean up. And the 3rd one is a lru_add optimize. Alex Shi (3): mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page mm/mlock: remove __munlock_isolate_lru_page mm/swap.c: optimizing __pagevec_lru_add lru_lock mm/mlock.c | 63 +++++++++++++++++++------------------------------------------- mm/swap.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 70 insertions(+), 50 deletions(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (3 preceding siblings ...) [not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 07/21] mm/thp: remove code path which never got into Alex Shi ` (9 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Since the first parameter is only used by head page, it's better to make it explicit. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9e050b13f597..b18f21da4dac 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2340,19 +2340,19 @@ static void remap_page(struct page *page) } } -static void lru_add_page_tail(struct page *page, struct page *page_tail, +static void lru_add_page_tail(struct page *head, struct page *page_tail, struct lruvec *lruvec, struct list_head *list) { - VM_BUG_ON_PAGE(!PageHead(page), page); - VM_BUG_ON_PAGE(PageCompound(page_tail), page); - VM_BUG_ON_PAGE(PageLRU(page_tail), page); + VM_BUG_ON_PAGE(!PageHead(head), head); + VM_BUG_ON_PAGE(PageCompound(page_tail), head); + VM_BUG_ON_PAGE(PageLRU(page_tail), head); lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock); if (!list) SetPageLRU(page_tail); - if (likely(PageLRU(page))) - list_add_tail(&page_tail->lru, &page->lru); + if (likely(PageLRU(head))) + list_add_tail(&page_tail->lru, &head->lru); else if (list) { /* page reclaim is reclaiming a huge page */ get_page(page_tail); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 07/21] mm/thp: remove code path which never got into 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (4 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 08/21] mm/thp: narrow lru locking Alex Shi ` (8 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen split_huge_page() will never call on a page which isn't on lru list, so this code never got a chance to run, and should not be run, to add tail pages on a lru list which head page isn't there. Although the bug was never triggered, it'better be removed for code correctness. BTW, it looks better to have a WARN() set in the wrong path, but the path will be changed in incoming new page isolation func. So just save it now. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b18f21da4dac..1fb4147ff854 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2357,16 +2357,6 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail, /* page reclaim is reclaiming a huge page */ get_page(page_tail); list_add_tail(&page_tail->lru, list); - } else { - /* - * Head page has not yet been counted, as an hpage, - * so we must account for each subpage individually. - * - * Put page_tail on the list at the correct position - * so they all end up in order. - */ - add_page_to_lru_list_tail(page_tail, lruvec, - page_lru(page_tail)); } } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 08/21] mm/thp: narrow lru locking 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (5 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 07/21] mm/thp: remove code path which never got into Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi ` (7 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Cc: Andrea Arcangeli lru_lock and page cache xa_lock have no reason with current sequence, put them together isn't necessary. let's narrow the lru locking, but left the local_irq_disable to block interrupt re-entry and statistic update. Hugh Dickins point: split_huge_page_to_list() was already silly,to be using the _irqsave variant: it's just been taking sleeping locks, so would already be broken if entered with interrupts enabled. so we can save passing flags argument down to __split_huge_page(). Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/huge_memory.c | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1fb4147ff854..d866b6e43434 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2423,7 +2423,7 @@ static void __split_huge_page_tail(struct page *head, int tail, } static void __split_huge_page(struct page *page, struct list_head *list, - pgoff_t end, unsigned long flags) + pgoff_t end) { struct page *head = compound_head(page); pg_data_t *pgdat = page_pgdat(head); @@ -2432,8 +2432,6 @@ static void __split_huge_page(struct page *page, struct list_head *list, unsigned long offset = 0; int i; - lruvec = mem_cgroup_page_lruvec(head, pgdat); - /* complete memcg works before add pages to LRU */ mem_cgroup_split_huge_fixup(head); @@ -2445,6 +2443,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, xa_lock(&swap_cache->i_pages); } + /* prevent PageLRU to go away from under us, and freeze lru stats */ + spin_lock(&pgdat->lru_lock); + + lruvec = mem_cgroup_page_lruvec(head, pgdat); + for (i = HPAGE_PMD_NR - 1; i >= 1; i--) { __split_huge_page_tail(head, i, lruvec, list); /* Some pages can be beyond i_size: drop them from page cache */ @@ -2464,6 +2467,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, } ClearPageCompound(head); + spin_unlock(&pgdat->lru_lock); + /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, HPAGE_PMD_ORDER); @@ -2481,8 +2486,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, page_ref_add(head, 2); xa_unlock(&head->mapping->i_pages); } - - spin_unlock_irqrestore(&pgdat->lru_lock, flags); + local_irq_enable(); remap_page(head); @@ -2621,12 +2625,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins) int split_huge_page_to_list(struct page *page, struct list_head *list) { struct page *head = compound_head(page); - struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); struct deferred_split *ds_queue = get_deferred_split_queue(head); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; - unsigned long flags; pgoff_t end; VM_BUG_ON_PAGE(is_huge_zero_page(head), head); @@ -2687,9 +2689,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) unmap_page(head); VM_BUG_ON_PAGE(compound_mapcount(head), head); - /* prevent PageLRU to go away from under us, and freeze lru stats */ - spin_lock_irqsave(&pgdata->lru_lock, flags); - + /* block interrupt reentry in xa_lock and spinlock */ + local_irq_disable(); if (mapping) { XA_STATE(xas, &mapping->i_pages, page_index(head)); @@ -2719,7 +2720,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) __dec_node_page_state(head, NR_FILE_THPS); } - __split_huge_page(page, list, end, flags); + __split_huge_page(page, list, end); if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2738,7 +2739,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) spin_unlock(&ds_queue->split_queue_lock); fail: if (mapping) xa_unlock(&mapping->i_pages); - spin_unlock_irqrestore(&pgdata->lru_lock, flags); + local_irq_enable(); remap_page(head); ret = -EBUSY; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (6 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 08/21] mm/thp: narrow lru locking Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi ` (6 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Fold the PGROTATED event collection into pagevec_move_tail_fn call back func like other funcs does in pagevec_lru_move_fn. Now all usage of pagevec_lru_move_fn are same and no needs of the 3rd parameter. It's simply the calling. [lkp@intel.com: found a build issue in the original patch, thanks] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 66 +++++++++++++++++++++++---------------------------------------- 1 file changed, 24 insertions(+), 42 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 7701d855873d..dc8b02cdddcb 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages) EXPORT_SYMBOL_GPL(get_kernel_page); static void pagevec_lru_move_fn(struct pagevec *pvec, - void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg), - void *arg) + void (*move_fn)(struct page *page, struct lruvec *lruvec)) { int i; struct pglist_data *pgdat = NULL; @@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, } lruvec = mem_cgroup_page_lruvec(page, pgdat); - (*move_fn)(page, lruvec, arg); + (*move_fn)(page, lruvec); } if (pgdat) spin_unlock_irqrestore(&pgdat->lru_lock, flags); @@ -232,35 +231,23 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, pagevec_reinit(pvec); } -static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, - void *arg) +static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) { - int *pgmoved = arg; - if (PageLRU(page) && !PageUnevictable(page)) { del_page_from_lru_list(page, lruvec, page_lru(page)); ClearPageActive(page); add_page_to_lru_list_tail(page, lruvec, page_lru(page)); - (*pgmoved) += hpage_nr_pages(page); + __count_vm_events(PGROTATED, hpage_nr_pages(page)); } } /* - * pagevec_move_tail() must be called with IRQ disabled. - * Otherwise this may cause nasty races. - */ -static void pagevec_move_tail(struct pagevec *pvec) -{ - int pgmoved = 0; - - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved); - __count_vm_events(PGROTATED, pgmoved); -} - -/* * Writeback is about to end against a page which has been marked for immediate * reclaim. If it still appears to be reclaimable, move it to the tail of the * inactive list. + * + * pagevec_move_tail_fn() must be called with IRQ disabled. + * Otherwise this may cause nasty races. */ void rotate_reclaimable_page(struct page *page) { @@ -273,7 +260,7 @@ void rotate_reclaimable_page(struct page *page) local_lock_irqsave(&lru_rotate.lock, flags); pvec = this_cpu_ptr(&lru_rotate.pvec); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_move_tail(pvec); + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); local_unlock_irqrestore(&lru_rotate.lock, flags); } } @@ -315,8 +302,7 @@ void lru_note_cost_page(struct page *page) page_is_file_lru(page), hpage_nr_pages(page)); } -static void __activate_page(struct page *page, struct lruvec *lruvec, - void *arg) +static void __activate_page(struct page *page, struct lruvec *lruvec) { if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); @@ -340,7 +326,7 @@ static void activate_page_drain(int cpu) struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, __activate_page, NULL); + pagevec_lru_move_fn(pvec, __activate_page); } static bool need_activate_page_drain(int cpu) @@ -358,7 +344,7 @@ void activate_page(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.activate_page); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, __activate_page, NULL); + pagevec_lru_move_fn(pvec, __activate_page); local_unlock(&lru_pvecs.lock); } } @@ -374,7 +360,7 @@ void activate_page(struct page *page) page = compound_head(page); spin_lock_irq(&pgdat->lru_lock); - __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL); + __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); spin_unlock_irq(&pgdat->lru_lock); } #endif @@ -526,8 +512,7 @@ void lru_cache_add_active_or_unevictable(struct page *page, * be write it out by flusher threads as this is much more effective * than the single-page writeout from reclaim. */ -static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, - void *arg) +static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) { int lru; bool active; @@ -574,8 +559,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, } } -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, - void *arg) +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); @@ -592,8 +576,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, } } -static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, - void *arg) +static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec) { if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page) && !PageUnevictable(page)) { @@ -636,21 +619,21 @@ void lru_add_drain_cpu(int cpu) /* No harm done if a racing interrupt already did this */ local_lock_irqsave(&lru_rotate.lock, flags); - pagevec_move_tail(pvec); + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); local_unlock_irqrestore(&lru_rotate.lock, flags); } pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + pagevec_lru_move_fn(pvec, lru_deactivate_fn); pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL); + pagevec_lru_move_fn(pvec, lru_lazyfree_fn); activate_page_drain(cpu); } @@ -679,7 +662,7 @@ void deactivate_file_page(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); local_unlock(&lru_pvecs.lock); } } @@ -701,7 +684,7 @@ void deactivate_page(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + pagevec_lru_move_fn(pvec, lru_deactivate_fn); local_unlock(&lru_pvecs.lock); } } @@ -723,7 +706,7 @@ void mark_page_lazyfree(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL); + pagevec_lru_move_fn(pvec, lru_lazyfree_fn); local_unlock(&lru_pvecs.lock); } } @@ -933,8 +916,7 @@ void __pagevec_release(struct pagevec *pvec) } EXPORT_SYMBOL(__pagevec_release); -static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, - void *arg) +static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) { enum lru_list lru; int was_unevictable = TestClearPageUnevictable(page); @@ -993,7 +975,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, */ void __pagevec_lru_add(struct pagevec *pvec) { - pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL); + pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn); } /** -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (7 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi @ 2020-07-25 12:59 ` Alex Shi [not found] ` <1595681998-19193-12-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi ` (5 subsequent siblings) 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen It's a clean up patch w/o function changes. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/memory.c | 3 --- mm/swap.c | 2 ++ mm/swap_state.c | 2 -- mm/workingset.c | 2 -- 4 files changed, 2 insertions(+), 7 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 87ec87cdc1ff..dafc5585517e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3150,10 +3150,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * XXX: Move to lru_cache_add() when it * supports new vs putback */ - spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); - lru_cache_add(page); swap_readpage(page, true); } diff --git a/mm/swap.c b/mm/swap.c index dc8b02cdddcb..b88ca630db70 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -298,8 +298,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) void lru_note_cost_page(struct page *page) { + spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)), page_is_file_lru(page), hpage_nr_pages(page)); + spin_unlock_irq(&page_pgdat(page)->lru_lock); } static void __activate_page(struct page *page, struct lruvec *lruvec) diff --git a/mm/swap_state.c b/mm/swap_state.c index 05889e8e3c97..080be52db6a8 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -440,9 +440,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, } /* XXX: Move to lru_cache_add() when it supports new vs putback */ - spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); /* Caller will initiate read into locked page */ SetPageWorkingset(page); diff --git a/mm/workingset.c b/mm/workingset.c index 50b7937bab32..337d5b9ad132 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -372,9 +372,7 @@ void workingset_refault(struct page *page, void *shadow) if (workingset) { SetPageWorkingset(page); /* XXX: Move to lru_cache_add() when it supports new vs putback */ - spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } out: -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-12-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page [not found] ` <1595681998-19193-12-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-05 21:18 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-05 21:18 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > It's a clean up patch w/o function changes. > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Reviewed-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > --- > mm/memory.c | 3 --- > mm/swap.c | 2 ++ > mm/swap_state.c | 2 -- > mm/workingset.c | 2 -- > 4 files changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 87ec87cdc1ff..dafc5585517e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3150,10 +3150,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > * XXX: Move to lru_cache_add() when it > * supports new vs putback > */ > - spin_lock_irq(&page_pgdat(page)->lru_lock); > lru_note_cost_page(page); > - spin_unlock_irq(&page_pgdat(page)->lru_lock); > - > lru_cache_add(page); > swap_readpage(page, true); > } > diff --git a/mm/swap.c b/mm/swap.c > index dc8b02cdddcb..b88ca630db70 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -298,8 +298,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) > > void lru_note_cost_page(struct page *page) > { > + spin_lock_irq(&page_pgdat(page)->lru_lock); > lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)), > page_is_file_lru(page), hpage_nr_pages(page)); > + spin_unlock_irq(&page_pgdat(page)->lru_lock); > } > > static void __activate_page(struct page *page, struct lruvec *lruvec) > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 05889e8e3c97..080be52db6a8 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -440,9 +440,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > } > > /* XXX: Move to lru_cache_add() when it supports new vs putback */ > - spin_lock_irq(&page_pgdat(page)->lru_lock); > lru_note_cost_page(page); > - spin_unlock_irq(&page_pgdat(page)->lru_lock); > > /* Caller will initiate read into locked page */ > SetPageWorkingset(page); > diff --git a/mm/workingset.c b/mm/workingset.c > index 50b7937bab32..337d5b9ad132 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -372,9 +372,7 @@ void workingset_refault(struct page *page, void *shadow) > if (workingset) { > SetPageWorkingset(page); > /* XXX: Move to lru_cache_add() when it supports new vs putback */ > - spin_lock_irq(&page_pgdat(page)->lru_lock); > lru_note_cost_page(page); > - spin_unlock_irq(&page_pgdat(page)->lru_lock); > inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > } > out: > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 14/21] mm/compaction: do page isolation first in compaction 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (8 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi @ 2020-07-25 12:59 ` Alex Shi [not found] ` <1595681998-19193-15-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn Alex Shi ` (4 subsequent siblings) 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Currently, compaction would get the lru_lock and then do page isolation which works fine with pgdat->lru_lock, since any page isoltion would compete for the lru_lock. If we want to change to memcg lru_lock, we have to isolate the page before getting lru_lock, thus isoltion would block page's memcg change which relay on page isoltion too. Then we could safely use per memcg lru_lock later. The new page isolation use previous introduced TestClearPageLRU() + pgdat lru locking which will be changed to memcg lru lock later. Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early version: Fix lots of crashes under compaction load: isolate_migratepages_block() must clean up appropriately when rejecting a page, setting PageLRU again if it had been cleared; and a put_page() after get_page_unless_zero() cannot safely be done while holding locked_lruvec - it may turn out to be the final put_page(), which will take an lruvec lock when PageLRU. And move __isolate_lru_page_prepare back after get_page_unless_zero to make trylock_page() safe: trylock_page() is not safe to use at this time: its setting PG_locked can race with the page being freed or allocated ("Bad page"), and can also erase flags being set by one of those "sole owners" of a freshly allocated page who use non-atomic __SetPageFlag(). Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/swap.h | 2 +- mm/compaction.c | 42 +++++++++++++++++++++++++++++++++--------- mm/vmscan.c | 46 ++++++++++++++++++++++++++-------------------- 3 files changed, 60 insertions(+), 30 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2c29399b29a0..6d23d3beeff7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, diff --git a/mm/compaction.c b/mm/compaction.c index f14780fc296a..2da2933fe56b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) { if (!cc->ignore_skip_hint && get_pageblock_skip(page)) { low_pfn = end_pfn; + page = NULL; goto isolate_abort; } valid_page = page; @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) goto isolate_fail; + /* + * Be careful not to clear PageLRU until after we're + * sure the page is not being freed elsewhere -- the + * page release code relies on it. + */ + if (unlikely(!get_page_unless_zero(page))) + goto isolate_fail; + + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) + goto isolate_fail_put; + + /* Try isolate the page */ + if (!TestClearPageLRU(page)) + goto isolate_fail_put; + /* If we already hold the lock, we can skip some rechecking */ if (!locked) { locked = compact_lock_irqsave(&pgdat->lru_lock, @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat) goto isolate_abort; } - /* Recheck PageLRU and PageCompound under lock */ - if (!PageLRU(page)) - goto isolate_fail; - /* * Page become compound since the non-locked check, * and it's on LRU. It can only be a THP so the order @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat) */ if (unlikely(PageCompound(page) && !cc->alloc_contig)) { low_pfn += compound_nr(page) - 1; - goto isolate_fail; + SetPageLRU(page); + goto isolate_fail_put; } } lruvec = mem_cgroup_page_lruvec(page, pgdat); - /* Try isolate the page */ - if (__isolate_lru_page(page, isolate_mode) != 0) - goto isolate_fail; - /* The whole page is taken off the LRU; skip the tail pages. */ if (PageCompound(page)) low_pfn += compound_nr(page) - 1; @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat) } continue; + +isolate_fail_put: + /* Avoid potential deadlock in freeing page under lru_lock */ + if (locked) { + spin_unlock_irqrestore(&pgdat->lru_lock, flags); + locked = false; + } + put_page(page); + isolate_fail: if (!skip_on_failure) continue; @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat) if (unlikely(low_pfn > end_pfn)) low_pfn = end_pfn; + page = NULL; + isolate_abort: if (locked) spin_unlock_irqrestore(&pgdat->lru_lock, flags); + if (page) { + SetPageLRU(page); + put_page(page); + } /* * Updated the cached scanner pfn once the pageblock has been scanned diff --git a/mm/vmscan.c b/mm/vmscan.c index 4183ae6b54b5..f77748adc340 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1544,20 +1544,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, * * returns 0 on success, -ve errno on failure. */ -int __isolate_lru_page(struct page *page, isolate_mode_t mode) +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode) { int ret = -EINVAL; - /* Only take pages on the LRU. */ - if (!PageLRU(page)) - return ret; - /* Compaction should not handle unevictable pages but CMA can do so */ if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE)) return ret; ret = -EBUSY; + /* Only take pages on the LRU. */ + if (!PageLRU(page)) + return ret; + /* * To minimise LRU disruption, the caller can indicate that it only * wants to isolate pages it will be able to operate on without @@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode) if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) return ret; - if (likely(get_page_unless_zero(page))) { - /* - * Be careful not to clear PageLRU until after we're - * sure the page is not being freed elsewhere -- the - * page release code relies on it. - */ - ClearPageLRU(page); - ret = 0; - } - - return ret; + return 0; } - /* * Update LRU sizes after isolating pages. The LRU size updates must * be complete before mem_cgroup_update_lru_size due to a sanity check. @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, * only when the page is being freed somewhere else. */ scan += nr_pages; - switch (__isolate_lru_page(page, mode)) { + switch (__isolate_lru_page_prepare(page, mode)) { case 0: + /* + * Be careful not to clear PageLRU until after we're + * sure the page is not being freed elsewhere -- the + * page release code relies on it. + */ + if (unlikely(!get_page_unless_zero(page))) + goto busy; + + if (!TestClearPageLRU(page)) { + /* + * This page may in other isolation path, + * but we still hold lru_lock. + */ + put_page(page); + goto busy; + } + nr_taken += nr_pages; nr_zone_taken[page_zonenum(page)] += nr_pages; list_move(&page->lru, dst); break; - +busy: case -EBUSY: /* else it is being freed elsewhere */ list_move(&page->lru, src); - continue; + break; default: BUG(); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-15-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <1595681998-19193-15-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-04 21:35 ` Alexander Duyck 2020-08-06 18:38 ` Alexander Duyck 2020-08-17 22:58 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alexander Duyck 2 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-04 21:35 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > Currently, compaction would get the lru_lock and then do page isolation > which works fine with pgdat->lru_lock, since any page isoltion would > compete for the lru_lock. If we want to change to memcg lru_lock, we > have to isolate the page before getting lru_lock, thus isoltion would > block page's memcg change which relay on page isoltion too. Then we > could safely use per memcg lru_lock later. > > The new page isolation use previous introduced TestClearPageLRU() + > pgdat lru locking which will be changed to memcg lru lock later. > > Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> fixed following bugs in this patch's > early version: > > Fix lots of crashes under compaction load: isolate_migratepages_block() > must clean up appropriately when rejecting a page, setting PageLRU again > if it had been cleared; and a put_page() after get_page_unless_zero() > cannot safely be done while holding locked_lruvec - it may turn out to > be the final put_page(), which will take an lruvec lock when PageLRU. > And move __isolate_lru_page_prepare back after get_page_unless_zero to > make trylock_page() safe: > trylock_page() is not safe to use at this time: its setting PG_locked > can race with the page being freed or allocated ("Bad page"), and can > also erase flags being set by one of those "sole owners" of a freshly > allocated page who use non-atomic __SetPageFlag(). > > Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > --- > include/linux/swap.h | 2 +- > mm/compaction.c | 42 +++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 46 ++++++++++++++++++++++++++-------------------- > 3 files changed, 60 insertions(+), 30 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2c29399b29a0..6d23d3beeff7 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, > extern unsigned long zone_reclaimable_pages(struct zone *zone); > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); > +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode); > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > diff --git a/mm/compaction.c b/mm/compaction.c > index f14780fc296a..2da2933fe56b 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) { > if (!cc->ignore_skip_hint && get_pageblock_skip(page)) { > low_pfn = end_pfn; > + page = NULL; > goto isolate_abort; > } > valid_page = page; > @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) > goto isolate_fail; > > + /* > + * Be careful not to clear PageLRU until after we're > + * sure the page is not being freed elsewhere -- the > + * page release code relies on it. > + */ > + if (unlikely(!get_page_unless_zero(page))) > + goto isolate_fail; > + > + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > + goto isolate_fail_put; > + > + /* Try isolate the page */ > + if (!TestClearPageLRU(page)) > + goto isolate_fail_put; > + > /* If we already hold the lock, we can skip some rechecking */ > if (!locked) { > locked = compact_lock_irqsave(&pgdat->lru_lock, So the fact that this flow doesn't match what we have below in isolate_lru_pages(). I went digging through the history and realize I brought this up before and you referenced the following patch from hugh: https://lore.kernel.org/lkml/alpine.LSU.2.11.2006111529010.10801-fupSdm12i1nKWymIFiNcPA@public.gmane.org/ As such I am assuming this flow is needed because we aren't holding an LRU lock, and the flow in mm/vmscan.c works because that is being called while holding an LRU lock. I am wondering if we are overcomplicating things by keeping the LRU check in __isolate_lru_page_prepare(). If we were to pull it out then you could just perform the get_page_unless_zero and TestClearPageLRU check before you call the function and you could consolidate the code so that it could be combined into a single function as below. So for example you could combine them into: static inline bool get_lru_page_unless_zero(struct page *page) { /* * Be careful not to clear PageLRU until after we're * sure the page is not being freed elsewhere -- the * page release code relies on it. */ if (unlikely(!get_page_unless_zero(page)) return false; if(TestClearPageLRU(page)) return true; put_page(page); return false; } Then the logic becomes that you have to either call get_lru_page_unless_zero before calling __isolate_lru_page_prepare or you have to be holding the LRU lock. > @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > goto isolate_abort; > } > > - /* Recheck PageLRU and PageCompound under lock */ > - if (!PageLRU(page)) > - goto isolate_fail; > - > /* > * Page become compound since the non-locked check, > * and it's on LRU. It can only be a THP so the order > @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat) > */ > if (unlikely(PageCompound(page) && !cc->alloc_contig)) { > low_pfn += compound_nr(page) - 1; > - goto isolate_fail; > + SetPageLRU(page); > + goto isolate_fail_put; > } > } > > lruvec = mem_cgroup_page_lruvec(page, pgdat); > > - /* Try isolate the page */ > - if (__isolate_lru_page(page, isolate_mode) != 0) > - goto isolate_fail; > - > /* The whole page is taken off the LRU; skip the tail pages. */ > if (PageCompound(page)) > low_pfn += compound_nr(page) - 1; > @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat) > } > > continue; > + > +isolate_fail_put: > + /* Avoid potential deadlock in freeing page under lru_lock */ > + if (locked) { > + spin_unlock_irqrestore(&pgdat->lru_lock, flags); > + locked = false; > + } > + put_page(page); > + > isolate_fail: > if (!skip_on_failure) > continue; > @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (unlikely(low_pfn > end_pfn)) > low_pfn = end_pfn; > > + page = NULL; > + > isolate_abort: > if (locked) > spin_unlock_irqrestore(&pgdat->lru_lock, flags); > + if (page) { > + SetPageLRU(page); > + put_page(page); > + } > > /* > * Updated the cached scanner pfn once the pageblock has been scanned > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4183ae6b54b5..f77748adc340 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1544,20 +1544,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, > * > * returns 0 on success, -ve errno on failure. > */ > -int __isolate_lru_page(struct page *page, isolate_mode_t mode) > +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode) > { > int ret = -EINVAL; > > - /* Only take pages on the LRU. */ > - if (!PageLRU(page)) > - return ret; > - > /* Compaction should not handle unevictable pages but CMA can do so */ > if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE)) > return ret; > > ret = -EBUSY; > > + /* Only take pages on the LRU. */ > + if (!PageLRU(page)) > + return ret; > + > /* > * To minimise LRU disruption, the caller can indicate that it only > * wants to isolate pages it will be able to operate on without So the question I would have is if we really need to be checking PageLRU here? I wonder if this isn't another spot where we would be better served by just assuming that PageLRU has been checked while holding the lock, or tested and cleared while holding a page reference. The original patch from Hugh referenced above mentions a desire to do away with __isolate_lru_page_prepare entirely, so I wonder if it wouldn't be good to be proactive and pull out the bits we think we might need versus the ones we don't. > @@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode) > if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) > return ret; > > - if (likely(get_page_unless_zero(page))) { > - /* > - * Be careful not to clear PageLRU until after we're > - * sure the page is not being freed elsewhere -- the > - * page release code relies on it. > - */ > - ClearPageLRU(page); > - ret = 0; > - } > - > - return ret; > + return 0; > } > > - > /* > * Update LRU sizes after isolating pages. The LRU size updates must > * be complete before mem_cgroup_update_lru_size due to a sanity check. > @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > * only when the page is being freed somewhere else. > */ > scan += nr_pages; > - switch (__isolate_lru_page(page, mode)) { > + switch (__isolate_lru_page_prepare(page, mode)) { So after looking through the code I realized that "mode" here will always be either 0 or ISOLATE_UNMAPPED. I assume this is why we aren't worried about the trylock_page call messing things up. With that said it looks like the function just breaks down to three tests, first for PageUnevictable(), then PageLRU(), and then possibly page_mapped(). As such I believe dropping the PageLRU check from the function as I suggested above should be safe since we are at risk for the test of it racing anyway since it could be cleared out from under us, and the bit isn't really protecting anything anyway since we are holding the LRU lock and anyhow. > case 0: > + /* > + * Be careful not to clear PageLRU until after we're > + * sure the page is not being freed elsewhere -- the > + * page release code relies on it. > + */ > + if (unlikely(!get_page_unless_zero(page))) > + goto busy; > + > + if (!TestClearPageLRU(page)) { > + /* > + * This page may in other isolation path, > + * but we still hold lru_lock. > + */ > + put_page(page); > + goto busy; > + } > + This piece could be consolidated via the single function I called out above. > nr_taken += nr_pages; > nr_zone_taken[page_zonenum(page)] += nr_pages; > list_move(&page->lru, dst); > break; > - > +busy: > case -EBUSY: > /* else it is being freed elsewhere */ > list_move(&page->lru, src); > - continue; > + break; > default: > BUG(); > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <1595681998-19193-15-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-08-04 21:35 ` Alexander Duyck @ 2020-08-06 18:38 ` Alexander Duyck [not found] ` <CAKgT0UcbBv=QBK9ErqLKXoNLYxFz52L4fiiHy4h6zKdBs=YPOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2020-08-17 22:58 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alexander Duyck 2 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-06 18:38 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > Currently, compaction would get the lru_lock and then do page isolation > which works fine with pgdat->lru_lock, since any page isoltion would > compete for the lru_lock. If we want to change to memcg lru_lock, we > have to isolate the page before getting lru_lock, thus isoltion would > block page's memcg change which relay on page isoltion too. Then we > could safely use per memcg lru_lock later. > > The new page isolation use previous introduced TestClearPageLRU() + > pgdat lru locking which will be changed to memcg lru lock later. > > Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> fixed following bugs in this patch's > early version: > > Fix lots of crashes under compaction load: isolate_migratepages_block() > must clean up appropriately when rejecting a page, setting PageLRU again > if it had been cleared; and a put_page() after get_page_unless_zero() > cannot safely be done while holding locked_lruvec - it may turn out to > be the final put_page(), which will take an lruvec lock when PageLRU. > And move __isolate_lru_page_prepare back after get_page_unless_zero to > make trylock_page() safe: > trylock_page() is not safe to use at this time: its setting PG_locked > can race with the page being freed or allocated ("Bad page"), and can > also erase flags being set by one of those "sole owners" of a freshly > allocated page who use non-atomic __SetPageFlag(). > > Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > --- > include/linux/swap.h | 2 +- > mm/compaction.c | 42 +++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 46 ++++++++++++++++++++++++++-------------------- > 3 files changed, 60 insertions(+), 30 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2c29399b29a0..6d23d3beeff7 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, > extern unsigned long zone_reclaimable_pages(struct zone *zone); > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); > +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode); > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > diff --git a/mm/compaction.c b/mm/compaction.c > index f14780fc296a..2da2933fe56b 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) { > if (!cc->ignore_skip_hint && get_pageblock_skip(page)) { > low_pfn = end_pfn; > + page = NULL; > goto isolate_abort; > } > valid_page = page; > @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) > goto isolate_fail; > > + /* > + * Be careful not to clear PageLRU until after we're > + * sure the page is not being freed elsewhere -- the > + * page release code relies on it. > + */ > + if (unlikely(!get_page_unless_zero(page))) > + goto isolate_fail; > + > + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > + goto isolate_fail_put; > + > + /* Try isolate the page */ > + if (!TestClearPageLRU(page)) > + goto isolate_fail_put; > + > /* If we already hold the lock, we can skip some rechecking */ > if (!locked) { > locked = compact_lock_irqsave(&pgdat->lru_lock, > @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > goto isolate_abort; > } > > - /* Recheck PageLRU and PageCompound under lock */ > - if (!PageLRU(page)) > - goto isolate_fail; > - > /* > * Page become compound since the non-locked check, > * and it's on LRU. It can only be a THP so the order > @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat) > */ > if (unlikely(PageCompound(page) && !cc->alloc_contig)) { > low_pfn += compound_nr(page) - 1; > - goto isolate_fail; > + SetPageLRU(page); > + goto isolate_fail_put; > } > } > > lruvec = mem_cgroup_page_lruvec(page, pgdat); > > - /* Try isolate the page */ > - if (__isolate_lru_page(page, isolate_mode) != 0) > - goto isolate_fail; > - > /* The whole page is taken off the LRU; skip the tail pages. */ > if (PageCompound(page)) > low_pfn += compound_nr(page) - 1; > @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat) > } > > continue; > + > +isolate_fail_put: > + /* Avoid potential deadlock in freeing page under lru_lock */ > + if (locked) { > + spin_unlock_irqrestore(&pgdat->lru_lock, flags); > + locked = false; > + } > + put_page(page); > + > isolate_fail: > if (!skip_on_failure) > continue; > @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (unlikely(low_pfn > end_pfn)) > low_pfn = end_pfn; > > + page = NULL; > + > isolate_abort: > if (locked) > spin_unlock_irqrestore(&pgdat->lru_lock, flags); > + if (page) { > + SetPageLRU(page); > + put_page(page); > + } > > /* > * Updated the cached scanner pfn once the pageblock has been scanned We should probably be calling SetPageLRU before we release the lru lock instead of before. It might make sense to just call it before we get here, similar to how you did in the isolate_fail_put case a few lines later. Otherwise this seems to violate the rules you had set up earlier where we were only going to be setting the LRU bit while holding the LRU lock. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UcbBv=QBK9ErqLKXoNLYxFz52L4fiiHy4h6zKdBs=YPOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <CAKgT0UcbBv=QBK9ErqLKXoNLYxFz52L4fiiHy4h6zKdBs=YPOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-07 3:24 ` Alex Shi [not found] ` <241ca157-104f-4f0d-7d5b-de394443788d-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-07 3:24 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/7 上午2:38, Alexander Duyck 写道: >> + >> isolate_abort: >> if (locked) >> spin_unlock_irqrestore(&pgdat->lru_lock, flags); >> + if (page) { >> + SetPageLRU(page); >> + put_page(page); >> + } >> >> /* >> * Updated the cached scanner pfn once the pageblock has been scanned > We should probably be calling SetPageLRU before we release the lru > lock instead of before. It might make sense to just call it before we > get here, similar to how you did in the isolate_fail_put case a few > lines later. Otherwise this seems to violate the rules you had set up > earlier where we were only going to be setting the LRU bit while > holding the LRU lock. Hi Alex, Set out of lock here should be fine. I never said we must set the bit in locking. And this page is get by get_page_unless_zero(), no warry on release. Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <241ca157-104f-4f0d-7d5b-de394443788d-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <241ca157-104f-4f0d-7d5b-de394443788d-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-07 14:51 ` Alexander Duyck [not found] ` <CAKgT0UdSrarC8j+G=LYRSadcaG6yNCoCfeVpFjEiHRJb4A77-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-07 14:51 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Thu, Aug 6, 2020 at 8:25 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/7 上午2:38, Alexander Duyck 写道: > >> + > >> isolate_abort: > >> if (locked) > >> spin_unlock_irqrestore(&pgdat->lru_lock, flags); > >> + if (page) { > >> + SetPageLRU(page); > >> + put_page(page); > >> + } > >> > >> /* > >> * Updated the cached scanner pfn once the pageblock has been scanned > > We should probably be calling SetPageLRU before we release the lru > > lock instead of before. It might make sense to just call it before we > > get here, similar to how you did in the isolate_fail_put case a few > > lines later. Otherwise this seems to violate the rules you had set up > > earlier where we were only going to be setting the LRU bit while > > holding the LRU lock. > > Hi Alex, > > Set out of lock here should be fine. I never said we must set the bit in locking. > And this page is get by get_page_unless_zero(), no warry on release. > > Thanks > Alex I wonder if this entire section shouldn't be restructured. This is the only spot I can see where we are resetting the LRU flag instead of pulling the page from the LRU list with the lock held. Looking over the code it seems like something like that should be possible. I am not sure the LRU lock is really protecting us in either the PageCompound check nor the skip bits. It seems like holding a reference on the page should prevent it from switching between compound or not, and the skip bits are per pageblock with the LRU bits being per node/memcg which I would think implies that we could have multiple LRU locks that could apply to a single skip bit. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UdSrarC8j+G=LYRSadcaG6yNCoCfeVpFjEiHRJb4A77-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <CAKgT0UdSrarC8j+G=LYRSadcaG6yNCoCfeVpFjEiHRJb4A77-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-10 13:10 ` Alex Shi [not found] ` <8dbd004e-8eba-f1ec-a5eb-5dc551978936-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-10 13:10 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/7 下午10:51, Alexander Duyck 写道: > I wonder if this entire section shouldn't be restructured. This is the > only spot I can see where we are resetting the LRU flag instead of > pulling the page from the LRU list with the lock held. Looking over > the code it seems like something like that should be possible. I am > not sure the LRU lock is really protecting us in either the > PageCompound check nor the skip bits. It seems like holding a > reference on the page should prevent it from switching between > compound or not, and the skip bits are per pageblock with the LRU bits > being per node/memcg which I would think implies that we could have > multiple LRU locks that could apply to a single skip bit. Hi Alexander, I don't find problem yet on compound or skip bit usage. Would you clarify the issue do you concerned? Thanks! ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <8dbd004e-8eba-f1ec-a5eb-5dc551978936-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <8dbd004e-8eba-f1ec-a5eb-5dc551978936-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-10 14:41 ` Alexander Duyck 2020-08-11 8:22 ` Alex Shi 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-10 14:41 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/7 下午10:51, Alexander Duyck 写道: > > I wonder if this entire section shouldn't be restructured. This is the > > only spot I can see where we are resetting the LRU flag instead of > > pulling the page from the LRU list with the lock held. Looking over > > the code it seems like something like that should be possible. I am > > not sure the LRU lock is really protecting us in either the > > PageCompound check nor the skip bits. It seems like holding a > > reference on the page should prevent it from switching between > > compound or not, and the skip bits are per pageblock with the LRU bits > > being per node/memcg which I would think implies that we could have > > multiple LRU locks that could apply to a single skip bit. > > Hi Alexander, > > I don't find problem yet on compound or skip bit usage. Would you clarify the > issue do you concerned? > > Thanks! The point I was getting at is that the LRU lock is being used to protect these and with your changes I don't think that makes sense anymore. The skip bits are per-pageblock bits. With your change the LRU lock is now per memcg first and then per node. As such I do not believe it really provides any sort of exclusive access to the skip bits. I still have to look into this more, but it seems like you need a lock per either section or zone that can be used to protect those bits and deal with this sooner rather than waiting until you have found an LRU page. The one part that is confusing though is that the definition of the skip bits seems to call out that they are a hint since they are not protected by a lock, but that is exactly what has been happening here. The point I was getting at with the PageCompound check is that instead of needing the LRU lock you should be able to look at PageCompound as soon as you call get_page_unless_zero() and preempt the need to set the LRU bit again. Instead of trying to rely on the LRU lock to guarantee that the page hasn't been merged you could just rely on the fact that you are holding a reference to it so it isn't going to switch between being compound or order 0 since it cannot be freed. It spoils the idea I originally had of combining the logic for get_page_unless_zero and TestClearPageLRU into a single function, but the advantage is you aren't clearing the LRU flag unless you are actually going to pull the page from the LRU list. My main worry is that this is the one spot where we appear to be clearing the LRU bit without ever actually pulling the page off of the LRU list, and I am thinking we would be better served by addressing the skip and PageCompound checks earlier rather than adding code to set the bit again if either of those cases are encountered. This way we don't pseudo-pin pages in the LRU if they are compound or supposed to be skipped. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction 2020-08-10 14:41 ` Alexander Duyck @ 2020-08-11 8:22 ` Alex Shi [not found] ` <d9818e06-95f1-9f21-05c0-98f29ea96d89-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-11 8:22 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/10 下午10:41, Alexander Duyck 写道: > On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >> >> >> 在 2020/8/7 下午10:51, Alexander Duyck 写道: >>> I wonder if this entire section shouldn't be restructured. This is the >>> only spot I can see where we are resetting the LRU flag instead of >>> pulling the page from the LRU list with the lock held. Looking over >>> the code it seems like something like that should be possible. I am >>> not sure the LRU lock is really protecting us in either the >>> PageCompound check nor the skip bits. It seems like holding a >>> reference on the page should prevent it from switching between >>> compound or not, and the skip bits are per pageblock with the LRU bits >>> being per node/memcg which I would think implies that we could have >>> multiple LRU locks that could apply to a single skip bit. >> >> Hi Alexander, >> >> I don't find problem yet on compound or skip bit usage. Would you clarify the >> issue do you concerned? >> >> Thanks! > > The point I was getting at is that the LRU lock is being used to > protect these and with your changes I don't think that makes sense > anymore. > > The skip bits are per-pageblock bits. With your change the LRU lock is > now per memcg first and then per node. As such I do not believe it > really provides any sort of exclusive access to the skip bits. I still > have to look into this more, but it seems like you need a lock per > either section or zone that can be used to protect those bits and deal > with this sooner rather than waiting until you have found an LRU page. > The one part that is confusing though is that the definition of the > skip bits seems to call out that they are a hint since they are not > protected by a lock, but that is exactly what has been happening here. > The skip bits are safe here, since even it race with other skip action, It will still skip out. The skip action is try not to compaction too much, not a exclusive action needs avoid race. > The point I was getting at with the PageCompound check is that instead > of needing the LRU lock you should be able to look at PageCompound as > soon as you call get_page_unless_zero() and preempt the need to set > the LRU bit again. Instead of trying to rely on the LRU lock to > guarantee that the page hasn't been merged you could just rely on the > fact that you are holding a reference to it so it isn't going to > switch between being compound or order 0 since it cannot be freed. It > spoils the idea I originally had of combining the logic for > get_page_unless_zero and TestClearPageLRU into a single function, but > the advantage is you aren't clearing the LRU flag unless you are > actually going to pull the page from the LRU list. Sorry, I still can not follow you here. Compound code part is unchanged and follow the original logical. So would you like to pose a new code to see if its works? Thanks Alex > > My main worry is that this is the one spot where we appear to be > clearing the LRU bit without ever actually pulling the page off of the > LRU list, and I am thinking we would be better served by addressing > the skip and PageCompound checks earlier rather than adding code to > set the bit again if either of those cases are encountered. This way > we don't pseudo-pin pages in the LRU if they are compound or supposed > to be skipped. > > Thanks. > > - Alex > ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <d9818e06-95f1-9f21-05c0-98f29ea96d89-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <d9818e06-95f1-9f21-05c0-98f29ea96d89-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-11 14:47 ` Alexander Duyck 2020-08-12 11:43 ` Alex Shi 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-11 14:47 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/10 下午10:41, Alexander Duyck 写道: > > On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >> > >> > >> > >> 在 2020/8/7 下午10:51, Alexander Duyck 写道: > >>> I wonder if this entire section shouldn't be restructured. This is the > >>> only spot I can see where we are resetting the LRU flag instead of > >>> pulling the page from the LRU list with the lock held. Looking over > >>> the code it seems like something like that should be possible. I am > >>> not sure the LRU lock is really protecting us in either the > >>> PageCompound check nor the skip bits. It seems like holding a > >>> reference on the page should prevent it from switching between > >>> compound or not, and the skip bits are per pageblock with the LRU bits > >>> being per node/memcg which I would think implies that we could have > >>> multiple LRU locks that could apply to a single skip bit. > >> > >> Hi Alexander, > >> > >> I don't find problem yet on compound or skip bit usage. Would you clarify the > >> issue do you concerned? > >> > >> Thanks! > > > > The point I was getting at is that the LRU lock is being used to > > protect these and with your changes I don't think that makes sense > > anymore. > > > > The skip bits are per-pageblock bits. With your change the LRU lock is > > now per memcg first and then per node. As such I do not believe it > > really provides any sort of exclusive access to the skip bits. I still > > have to look into this more, but it seems like you need a lock per > > either section or zone that can be used to protect those bits and deal > > with this sooner rather than waiting until you have found an LRU page. > > The one part that is confusing though is that the definition of the > > skip bits seems to call out that they are a hint since they are not > > protected by a lock, but that is exactly what has been happening here. > > > > The skip bits are safe here, since even it race with other skip action, > It will still skip out. The skip action is try not to compaction too much, > not a exclusive action needs avoid race. That would be the case if it didn't have the impact that they currently do on the compaction process. What I am getting at is that a race was introduced when you placed this test between the clearing of the LRU flag and the actual pulling of the page from the LRU list. So if you tested the skip bits before clearing the LRU flag then I would be okay with the code, however because it is triggering an abort after the LRU flag is cleared then you are creating a situation where multiple processes will be stomping all over each other as you can have each thread essentially take a page via the LRU flag, but only one thread will process a page and it could skip over all other pages that preemptively had their LRU flag cleared. If you take a look at the test_and_set_skip the function only acts on the pageblock aligned PFN for a given range. WIth the changes you have in place now that would mean that only one thread would ever actually call this function anyway since the first PFN would take the LRU flag so no other thread could follow through and test or set the bit as well. The expectation before was that all threads would encounter this test and either proceed after setting the bit for the first PFN or abort after testing the first PFN. With you changes only the first thread actually runs this test and then it and the others will likely encounter multiple failures as they are all clearing LRU bits simultaneously and tripping each other up. That is why the skip bit must have a test and set done before you even get to the point of clearing the LRU flag. > > The point I was getting at with the PageCompound check is that instead > > of needing the LRU lock you should be able to look at PageCompound as > > soon as you call get_page_unless_zero() and preempt the need to set > > the LRU bit again. Instead of trying to rely on the LRU lock to > > guarantee that the page hasn't been merged you could just rely on the > > fact that you are holding a reference to it so it isn't going to > > switch between being compound or order 0 since it cannot be freed. It > > spoils the idea I originally had of combining the logic for > > get_page_unless_zero and TestClearPageLRU into a single function, but > > the advantage is you aren't clearing the LRU flag unless you are > > actually going to pull the page from the LRU list. > > Sorry, I still can not follow you here. Compound code part is unchanged > and follow the original logical. So would you like to pose a new code to > see if its works? No there are significant changes as you reordered all of the operations. Prior to your change the LRU bit was checked, but not cleared before testing for PageCompound. Now you are clearing it before you are testing if it is a compound page. So if compaction is running we will be seeing the pages in the LRU stay put, but the compound bit flickering off and on if the compound page is encountered with the wrong or NULL lruvec. What I was suggesting is that the PageCompound test probably doesn't need to be concerned with the lock after your changes. You could test it after you call get_page_unless_zero() and before you call __isolate_lru_page_prepare(). Instead of relying on the LRU lock to protect us from the page switching between compound and not we would be relying on the fact that we are holding a reference to the page so it should not be freed and transition between compound or not. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction 2020-08-11 14:47 ` Alexander Duyck @ 2020-08-12 11:43 ` Alex Shi [not found] ` <9581db48-cef3-788a-7f5a-8548fee56c13-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-12 11:43 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/11 下午10:47, Alexander Duyck 写道: > On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >> >> >> 在 2020/8/10 下午10:41, Alexander Duyck 写道: >>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >>>> >>>> >>>> >>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道: >>>>> I wonder if this entire section shouldn't be restructured. This is the >>>>> only spot I can see where we are resetting the LRU flag instead of >>>>> pulling the page from the LRU list with the lock held. Looking over >>>>> the code it seems like something like that should be possible. I am >>>>> not sure the LRU lock is really protecting us in either the >>>>> PageCompound check nor the skip bits. It seems like holding a >>>>> reference on the page should prevent it from switching between >>>>> compound or not, and the skip bits are per pageblock with the LRU bits >>>>> being per node/memcg which I would think implies that we could have >>>>> multiple LRU locks that could apply to a single skip bit. >>>> >>>> Hi Alexander, >>>> >>>> I don't find problem yet on compound or skip bit usage. Would you clarify the >>>> issue do you concerned? >>>> >>>> Thanks! >>> >>> The point I was getting at is that the LRU lock is being used to >>> protect these and with your changes I don't think that makes sense >>> anymore. >>> >>> The skip bits are per-pageblock bits. With your change the LRU lock is >>> now per memcg first and then per node. As such I do not believe it >>> really provides any sort of exclusive access to the skip bits. I still >>> have to look into this more, but it seems like you need a lock per >>> either section or zone that can be used to protect those bits and deal >>> with this sooner rather than waiting until you have found an LRU page. >>> The one part that is confusing though is that the definition of the >>> skip bits seems to call out that they are a hint since they are not >>> protected by a lock, but that is exactly what has been happening here. >>> >> >> The skip bits are safe here, since even it race with other skip action, >> It will still skip out. The skip action is try not to compaction too much, >> not a exclusive action needs avoid race. > > That would be the case if it didn't have the impact that they > currently do on the compaction process. What I am getting at is that a > race was introduced when you placed this test between the clearing of > the LRU flag and the actual pulling of the page from the LRU list. So > if you tested the skip bits before clearing the LRU flag then I would > be okay with the code, however because it is triggering an abort after Hi Alexander, Thanks a lot for comments and suggestions! I have tried your suggestion: Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> --- mm/compaction.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index b99c96c4862d..6c881dee8c9a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat) if (__isolate_lru_page_prepare(page, isolate_mode) != 0) goto isolate_fail_put; + /* Try get exclusive access under lock */ + if (!skip_updated) { + skip_updated = true; + if (test_and_set_skip(cc, page, low_pfn)) + goto isolate_fail_put; + } + /* Try isolate the page */ if (!TestClearPageLRU(page)) goto isolate_fail_put; @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat) lruvec_memcg_debug(lruvec, page); - /* Try get exclusive access under lock */ - if (!skip_updated) { - skip_updated = true; - if (test_and_set_skip(cc, page, low_pfn)) - goto isolate_abort; - } - /* * Page become compound since the non-locked check, * and it's on LRU. It can only be a THP so the order -- Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not helpful > the LRU flag is cleared then you are creating a situation where > multiple processes will be stomping all over each other as you can > have each thread essentially take a page via the LRU flag, but only > one thread will process a page and it could skip over all other pages > that preemptively had their LRU flag cleared. It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit could stop each other in a array check(bitmap). So compare to whole node lru_lock, the net profit is clear in patch 17. > > If you take a look at the test_and_set_skip the function only acts on > the pageblock aligned PFN for a given range. WIth the changes you have > in place now that would mean that only one thread would ever actually > call this function anyway since the first PFN would take the LRU flag > so no other thread could follow through and test or set the bit as Is this good for only one process could do test_and_set_skip? is that the 'skip' meaning to be? > well. The expectation before was that all threads would encounter this > test and either proceed after setting the bit for the first PFN or > abort after testing the first PFN. With you changes only the first > thread actually runs this test and then it and the others will likely > encounter multiple failures as they are all clearing LRU bits > simultaneously and tripping each other up. That is why the skip bit > must have a test and set done before you even get to the point of > clearing the LRU flag. It make the things warse in my machine, would you like to have a try by yourself? > >>> The point I was getting at with the PageCompound check is that instead >>> of needing the LRU lock you should be able to look at PageCompound as >>> soon as you call get_page_unless_zero() and preempt the need to set >>> the LRU bit again. Instead of trying to rely on the LRU lock to >>> guarantee that the page hasn't been merged you could just rely on the >>> fact that you are holding a reference to it so it isn't going to >>> switch between being compound or order 0 since it cannot be freed. It >>> spoils the idea I originally had of combining the logic for >>> get_page_unless_zero and TestClearPageLRU into a single function, but >>> the advantage is you aren't clearing the LRU flag unless you are >>> actually going to pull the page from the LRU list. >> >> Sorry, I still can not follow you here. Compound code part is unchanged >> and follow the original logical. So would you like to pose a new code to >> see if its works? > > No there are significant changes as you reordered all of the > operations. Prior to your change the LRU bit was checked, but not > cleared before testing for PageCompound. Now you are clearing it > before you are testing if it is a compound page. So if compaction is > running we will be seeing the pages in the LRU stay put, but the > compound bit flickering off and on if the compound page is encountered > with the wrong or NULL lruvec. What I was suggesting is that the The lruvec could be wrong or NULL here, that is the base stone of whole patchset. > PageCompound test probably doesn't need to be concerned with the lock > after your changes. You could test it after you call > get_page_unless_zero() and before you call > __isolate_lru_page_prepare(). Instead of relying on the LRU lock to > protect us from the page switching between compound and not we would > be relying on the fact that we are holding a reference to the page so > it should not be freed and transition between compound or not. > I have tried the patch as your suggested, it has no clear help on performance on above vm-scaliblity case. Maybe it's due to we checked the same thing before lock already. diff --git a/mm/compaction.c b/mm/compaction.c index b99c96c4862d..cf2ac5148001 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat) if (unlikely(!get_page_unless_zero(page))) goto isolate_fail; + /* + * Page become compound since the non-locked check, + * and it's on LRU. It can only be a THP so the order + * is safe to read and it's 0 for tail pages. + */ + if (unlikely(PageCompound(page) && !cc->alloc_contig)) { + low_pfn += compound_nr(page) - 1; + goto isolate_fail_put; + } + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) goto isolate_fail_put; @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat) goto isolate_abort; } - /* - * Page become compound since the non-locked check, - * and it's on LRU. It can only be a THP so the order - * is safe to read and it's 0 for tail pages. - */ - if (unlikely(PageCompound(page) && !cc->alloc_contig)) { - low_pfn += compound_nr(page) - 1; - SetPageLRU(page); - goto isolate_fail_put; - } } else rcu_read_unlock(); Thanks Alex ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <9581db48-cef3-788a-7f5a-8548fee56c13-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <9581db48-cef3-788a-7f5a-8548fee56c13-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-12 12:16 ` Alex Shi 2020-08-12 16:51 ` Alexander Duyck 1 sibling, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-12 12:16 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/12 下午7:43, Alex Shi 写道: >>> Sorry, I still can not follow you here. Compound code part is unchanged >>> and follow the original logical. So would you like to pose a new code to >>> see if its works? >> No there are significant changes as you reordered all of the >> operations. Prior to your change the LRU bit was checked, but not >> cleared before testing for PageCompound. Now you are clearing it >> before you are testing if it is a compound page. So if compaction is >> running we will be seeing the pages in the LRU stay put, but the >> compound bit flickering off and on if the compound page is encountered >> with the wrong or NULL lruvec. What I was suggesting is that the > The lruvec could be wrong or NULL here, that is the base stone of whole > patchset. > Sorry for typo. s/could/could not/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <9581db48-cef3-788a-7f5a-8548fee56c13-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-08-12 12:16 ` Alex Shi @ 2020-08-12 16:51 ` Alexander Duyck 2020-08-13 1:46 ` Alex Shi 1 sibling, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-12 16:51 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/11 下午10:47, Alexander Duyck 写道: > > On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >> > >> > >> > >> 在 2020/8/10 下午10:41, Alexander Duyck 写道: > >>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >>>> > >>>> > >>>> > >>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道: > >>>>> I wonder if this entire section shouldn't be restructured. This is the > >>>>> only spot I can see where we are resetting the LRU flag instead of > >>>>> pulling the page from the LRU list with the lock held. Looking over > >>>>> the code it seems like something like that should be possible. I am > >>>>> not sure the LRU lock is really protecting us in either the > >>>>> PageCompound check nor the skip bits. It seems like holding a > >>>>> reference on the page should prevent it from switching between > >>>>> compound or not, and the skip bits are per pageblock with the LRU bits > >>>>> being per node/memcg which I would think implies that we could have > >>>>> multiple LRU locks that could apply to a single skip bit. > >>>> > >>>> Hi Alexander, > >>>> > >>>> I don't find problem yet on compound or skip bit usage. Would you clarify the > >>>> issue do you concerned? > >>>> > >>>> Thanks! > >>> > >>> The point I was getting at is that the LRU lock is being used to > >>> protect these and with your changes I don't think that makes sense > >>> anymore. > >>> > >>> The skip bits are per-pageblock bits. With your change the LRU lock is > >>> now per memcg first and then per node. As such I do not believe it > >>> really provides any sort of exclusive access to the skip bits. I still > >>> have to look into this more, but it seems like you need a lock per > >>> either section or zone that can be used to protect those bits and deal > >>> with this sooner rather than waiting until you have found an LRU page. > >>> The one part that is confusing though is that the definition of the > >>> skip bits seems to call out that they are a hint since they are not > >>> protected by a lock, but that is exactly what has been happening here. > >>> > >> > >> The skip bits are safe here, since even it race with other skip action, > >> It will still skip out. The skip action is try not to compaction too much, > >> not a exclusive action needs avoid race. > > > > That would be the case if it didn't have the impact that they > > currently do on the compaction process. What I am getting at is that a > > race was introduced when you placed this test between the clearing of > > the LRU flag and the actual pulling of the page from the LRU list. So > > if you tested the skip bits before clearing the LRU flag then I would > > be okay with the code, however because it is triggering an abort after > > Hi Alexander, > > Thanks a lot for comments and suggestions! > > I have tried your suggestion: > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > --- > mm/compaction.c | 14 +++++++------- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index b99c96c4862d..6c881dee8c9a 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > goto isolate_fail_put; > > + /* Try get exclusive access under lock */ > + if (!skip_updated) { > + skip_updated = true; > + if (test_and_set_skip(cc, page, low_pfn)) > + goto isolate_fail_put; > + } > + > /* Try isolate the page */ > if (!TestClearPageLRU(page)) > goto isolate_fail_put; I would have made this much sooner. Probably before you call get_page_unless_zero so as to avoid the unnecessary atomic operations. > @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > > lruvec_memcg_debug(lruvec, page); > > - /* Try get exclusive access under lock */ > - if (!skip_updated) { > - skip_updated = true; > - if (test_and_set_skip(cc, page, low_pfn)) > - goto isolate_abort; > - } > - > /* > * Page become compound since the non-locked check, > * and it's on LRU. It can only be a THP so the order > -- > > Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not > helpful So one issue with this change is that it is still too late to be of much benefit. Really you should probably be doing this much sooner, for example somewhere before the get_page_unless_zero(). Also the thing that still has me scratching my head is the "Try get exclusive access under lock" comment. The function declaration says this is supposed to be a hint, but we were using the LRU lock to synchronize it. I'm wondering if we should really be protecting this with the zone lock since we are modifying the pageblock flags which also contain the migration type value for the pageblock and are only modified while holding the zone lock. > > the LRU flag is cleared then you are creating a situation where > > multiple processes will be stomping all over each other as you can > > have each thread essentially take a page via the LRU flag, but only > > one thread will process a page and it could skip over all other pages > > that preemptively had their LRU flag cleared. > > It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit > could stop each other in a array check(bitmap). So compare to whole node > lru_lock, the net profit is clear in patch 17. My concern is that what you can end up with is multiple threads all working over the same pageblock for isolation. With the old code the LRU lock was used to make certain that test_and_set_skip was being synchronized on the first page in the pageblock so you would only have one thread going through and working a single pageblock. However after your changes it doesn't seem like the test_and_set_skip has that protection since only one thread will ever be able to successfully call it for the first page in the pageblock assuming that the LRU flag is set on the first page in the pageblock block. > > > > If you take a look at the test_and_set_skip the function only acts on > > the pageblock aligned PFN for a given range. WIth the changes you have > > in place now that would mean that only one thread would ever actually > > call this function anyway since the first PFN would take the LRU flag > > so no other thread could follow through and test or set the bit as > > Is this good for only one process could do test_and_set_skip? is that > the 'skip' meaning to be? So only one thread really getting to fully use test_and_set_skip is good, however the issue is that there is nothing to synchronize the testing from the other threads. As a result the other threads could have isolated other pages within the pageblock before the thread that is calling test_and_set_skip will get to complete the setting of the skip bit. This will result in isolation failures for the thread that set the skip bit which may be undesirable behavior. With the old code the threads were all synchronized on testing the first PFN in the pageblock while holding the LRU lock and that is what we lost. My concern is the cases where skip_on_failure == true are going to fail much more often now as the threads can easily interfere with each other. > > well. The expectation before was that all threads would encounter this > > test and either proceed after setting the bit for the first PFN or > > abort after testing the first PFN. With you changes only the first > > thread actually runs this test and then it and the others will likely > > encounter multiple failures as they are all clearing LRU bits > > simultaneously and tripping each other up. That is why the skip bit > > must have a test and set done before you even get to the point of > > clearing the LRU flag. > > It make the things warse in my machine, would you like to have a try by yourself? I plan to do that. I have already been working on a few things to clean up and optimize your patch set further. I will try to submit an RFC this evening so we can discuss. > > > >>> The point I was getting at with the PageCompound check is that instead > >>> of needing the LRU lock you should be able to look at PageCompound as > >>> soon as you call get_page_unless_zero() and preempt the need to set > >>> the LRU bit again. Instead of trying to rely on the LRU lock to > >>> guarantee that the page hasn't been merged you could just rely on the > >>> fact that you are holding a reference to it so it isn't going to > >>> switch between being compound or order 0 since it cannot be freed. It > >>> spoils the idea I originally had of combining the logic for > >>> get_page_unless_zero and TestClearPageLRU into a single function, but > >>> the advantage is you aren't clearing the LRU flag unless you are > >>> actually going to pull the page from the LRU list. > >> > >> Sorry, I still can not follow you here. Compound code part is unchanged > >> and follow the original logical. So would you like to pose a new code to > >> see if its works? > > > > No there are significant changes as you reordered all of the > > operations. Prior to your change the LRU bit was checked, but not > > cleared before testing for PageCompound. Now you are clearing it > > before you are testing if it is a compound page. So if compaction is > > running we will be seeing the pages in the LRU stay put, but the > > compound bit flickering off and on if the compound page is encountered > > with the wrong or NULL lruvec. What I was suggesting is that the > > The lruvec could be wrong or NULL here, that is the base stone of whole > patchset. Sorry I had a typo in my comment as well as it is the LRU bit that will be flickering, not the compound. The goal here is to avoid clearing the LRU bit unless we are sure we are going to take the lruvec lock and pull the page from the list. > > PageCompound test probably doesn't need to be concerned with the lock > > after your changes. You could test it after you call > > get_page_unless_zero() and before you call > > __isolate_lru_page_prepare(). Instead of relying on the LRU lock to > > protect us from the page switching between compound and not we would > > be relying on the fact that we are holding a reference to the page so > > it should not be freed and transition between compound or not. > > > > I have tried the patch as your suggested, it has no clear help on performance > on above vm-scaliblity case. Maybe it's due to we checked the same thing > before lock already. > > diff --git a/mm/compaction.c b/mm/compaction.c > index b99c96c4862d..cf2ac5148001 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (unlikely(!get_page_unless_zero(page))) > goto isolate_fail; > > + /* > + * Page become compound since the non-locked check, > + * and it's on LRU. It can only be a THP so the order > + * is safe to read and it's 0 for tail pages. > + */ > + if (unlikely(PageCompound(page) && !cc->alloc_contig)) { > + low_pfn += compound_nr(page) - 1; > + goto isolate_fail_put; > + } > + > if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > goto isolate_fail_put; > > @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > goto isolate_abort; > } > > - /* > - * Page become compound since the non-locked check, > - * and it's on LRU. It can only be a THP so the order > - * is safe to read and it's 0 for tail pages. > - */ > - if (unlikely(PageCompound(page) && !cc->alloc_contig)) { > - low_pfn += compound_nr(page) - 1; > - SetPageLRU(page); > - goto isolate_fail_put; > - } > } else > rcu_read_unlock(); > So actually there is more we could do than just this. Specifically a few lines below the rcu_read_lock there is yet another PageCompound check that sets low_pfn yet again. So in theory we could combine both of those and modify the code so you end up with something more like: @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (unlikely(!get_page_unless_zero(page))) goto isolate_fail; + if (PageCompound(page)) { + const unsigned int order = compound_order(page); + + if (likely(order < MAX_ORDER)) + low_pfn += (1UL << order) - 1; + + if (unlikely(!cc->alloc_contig)) + goto isolate_fail_put; + } + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) goto isolate_fail_put; Doing this you would be more likely to skip over the entire compound page in a single jump should you not be able to either take the LRU bit or encounter a busy page in __isolate_Lru_page_prepare. I had copied this bit from an earlier check and modified it as I was not sure I can guarantee that this is a THP since we haven't taken the LRU lock yet. However I believe the page cannot be split up while we are holding the extra reference so the PageCompound flag and order should not change until we call put_page. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction 2020-08-12 16:51 ` Alexander Duyck @ 2020-08-13 1:46 ` Alex Shi [not found] ` <3828d045-17e4-16aa-f0e6-d5dda7ad6b1b-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-13 1:46 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/13 上午12:51, Alexander Duyck 写道: > On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >> >> >> 在 2020/8/11 下午10:47, Alexander Duyck 写道: >>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >>>> >>>> >>>> >>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道: >>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道: >>>>>>> I wonder if this entire section shouldn't be restructured. This is the >>>>>>> only spot I can see where we are resetting the LRU flag instead of >>>>>>> pulling the page from the LRU list with the lock held. Looking over >>>>>>> the code it seems like something like that should be possible. I am >>>>>>> not sure the LRU lock is really protecting us in either the >>>>>>> PageCompound check nor the skip bits. It seems like holding a >>>>>>> reference on the page should prevent it from switching between >>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits >>>>>>> being per node/memcg which I would think implies that we could have >>>>>>> multiple LRU locks that could apply to a single skip bit. >>>>>> >>>>>> Hi Alexander, >>>>>> >>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the >>>>>> issue do you concerned? >>>>>> >>>>>> Thanks! >>>>> >>>>> The point I was getting at is that the LRU lock is being used to >>>>> protect these and with your changes I don't think that makes sense >>>>> anymore. >>>>> >>>>> The skip bits are per-pageblock bits. With your change the LRU lock is >>>>> now per memcg first and then per node. As such I do not believe it >>>>> really provides any sort of exclusive access to the skip bits. I still >>>>> have to look into this more, but it seems like you need a lock per >>>>> either section or zone that can be used to protect those bits and deal >>>>> with this sooner rather than waiting until you have found an LRU page. >>>>> The one part that is confusing though is that the definition of the >>>>> skip bits seems to call out that they are a hint since they are not >>>>> protected by a lock, but that is exactly what has been happening here. >>>>> >>>> >>>> The skip bits are safe here, since even it race with other skip action, >>>> It will still skip out. The skip action is try not to compaction too much, >>>> not a exclusive action needs avoid race. >>> >>> That would be the case if it didn't have the impact that they >>> currently do on the compaction process. What I am getting at is that a >>> race was introduced when you placed this test between the clearing of >>> the LRU flag and the actual pulling of the page from the LRU list. So >>> if you tested the skip bits before clearing the LRU flag then I would >>> be okay with the code, however because it is triggering an abort after >> >> Hi Alexander, >> >> Thanks a lot for comments and suggestions! >> >> I have tried your suggestion: >> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> >> --- >> mm/compaction.c | 14 +++++++------- >> 1 file changed, 7 insertions(+), 7 deletions(-) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index b99c96c4862d..6c881dee8c9a 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat) >> if (__isolate_lru_page_prepare(page, isolate_mode) != 0) >> goto isolate_fail_put; >> >> + /* Try get exclusive access under lock */ >> + if (!skip_updated) { >> + skip_updated = true; >> + if (test_and_set_skip(cc, page, low_pfn)) >> + goto isolate_fail_put; >> + } >> + >> /* Try isolate the page */ >> if (!TestClearPageLRU(page)) >> goto isolate_fail_put; > > I would have made this much sooner. Probably before you call > get_page_unless_zero so as to avoid the unnecessary atomic operations. > >> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat) >> >> lruvec_memcg_debug(lruvec, page); >> >> - /* Try get exclusive access under lock */ >> - if (!skip_updated) { >> - skip_updated = true; >> - if (test_and_set_skip(cc, page, low_pfn)) >> - goto isolate_abort; >> - } >> - >> /* >> * Page become compound since the non-locked check, >> * and it's on LRU. It can only be a THP so the order >> -- >> >> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not >> helpful > > So one issue with this change is that it is still too late to be of > much benefit. Really you should probably be doing this much sooner, > for example somewhere before the get_page_unless_zero(). Also the > thing that still has me scratching my head is the "Try get exclusive > access under lock" comment. The function declaration says this is > supposed to be a hint, but we were using the LRU lock to synchronize > it. I'm wondering if we should really be protecting this with the zone > lock since we are modifying the pageblock flags which also contain the > migration type value for the pageblock and are only modified while > holding the zone lock. zone lock is probability better. you can try and test. > >>> the LRU flag is cleared then you are creating a situation where >>> multiple processes will be stomping all over each other as you can >>> have each thread essentially take a page via the LRU flag, but only >>> one thread will process a page and it could skip over all other pages >>> that preemptively had their LRU flag cleared. >> >> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit >> could stop each other in a array check(bitmap). So compare to whole node >> lru_lock, the net profit is clear in patch 17. > > My concern is that what you can end up with is multiple threads all > working over the same pageblock for isolation. With the old code the > LRU lock was used to make certain that test_and_set_skip was being > synchronized on the first page in the pageblock so you would only have > one thread going through and working a single pageblock. However after > your changes it doesn't seem like the test_and_set_skip has that > protection since only one thread will ever be able to successfully > call it for the first page in the pageblock assuming that the LRU flag > is set on the first page in the pageblock block. > >>> >>> If you take a look at the test_and_set_skip the function only acts on >>> the pageblock aligned PFN for a given range. WIth the changes you have >>> in place now that would mean that only one thread would ever actually >>> call this function anyway since the first PFN would take the LRU flag >>> so no other thread could follow through and test or set the bit as >> >> Is this good for only one process could do test_and_set_skip? is that >> the 'skip' meaning to be? > > So only one thread really getting to fully use test_and_set_skip is > good, however the issue is that there is nothing to synchronize the > testing from the other threads. As a result the other threads could > have isolated other pages within the pageblock before the thread that > is calling test_and_set_skip will get to complete the setting of the > skip bit. This will result in isolation failures for the thread that > set the skip bit which may be undesirable behavior. > > With the old code the threads were all synchronized on testing the > first PFN in the pageblock while holding the LRU lock and that is what > we lost. My concern is the cases where skip_on_failure == true are > going to fail much more often now as the threads can easily interfere > with each other. I have a patch to fix this, which is on https://github.com/alexshi/linux.git lrunext > >>> well. The expectation before was that all threads would encounter this >>> test and either proceed after setting the bit for the first PFN or >>> abort after testing the first PFN. With you changes only the first >>> thread actually runs this test and then it and the others will likely >>> encounter multiple failures as they are all clearing LRU bits >>> simultaneously and tripping each other up. That is why the skip bit >>> must have a test and set done before you even get to the point of >>> clearing the LRU flag. >> >> It make the things warse in my machine, would you like to have a try by yourself? > > I plan to do that. I have already been working on a few things to > clean up and optimize your patch set further. I will try to submit an > RFC this evening so we can discuss. > Glad to see your new code soon. Would you like do it base on https://github.com/alexshi/linux.git lrunext >>> >>>>> The point I was getting at with the PageCompound check is that instead >>>>> of needing the LRU lock you should be able to look at PageCompound as >>>>> soon as you call get_page_unless_zero() and preempt the need to set >>>>> the LRU bit again. Instead of trying to rely on the LRU lock to >>>>> guarantee that the page hasn't been merged you could just rely on the >>>>> fact that you are holding a reference to it so it isn't going to >>>>> switch between being compound or order 0 since it cannot be freed. It >>>>> spoils the idea I originally had of combining the logic for >>>>> get_page_unless_zero and TestClearPageLRU into a single function, but >>>>> the advantage is you aren't clearing the LRU flag unless you are >>>>> actually going to pull the page from the LRU list. >>>> >>>> Sorry, I still can not follow you here. Compound code part is unchanged >>>> and follow the original logical. So would you like to pose a new code to >>>> see if its works? >>> >>> No there are significant changes as you reordered all of the >>> operations. Prior to your change the LRU bit was checked, but not >>> cleared before testing for PageCompound. Now you are clearing it >>> before you are testing if it is a compound page. So if compaction is >>> running we will be seeing the pages in the LRU stay put, but the >>> compound bit flickering off and on if the compound page is encountered >>> with the wrong or NULL lruvec. What I was suggesting is that the >> >> The lruvec could be wrong or NULL here, that is the base stone of whole >> patchset. > > Sorry I had a typo in my comment as well as it is the LRU bit that > will be flickering, not the compound. The goal here is to avoid > clearing the LRU bit unless we are sure we are going to take the > lruvec lock and pull the page from the list. > >>> PageCompound test probably doesn't need to be concerned with the lock >>> after your changes. You could test it after you call >>> get_page_unless_zero() and before you call >>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to >>> protect us from the page switching between compound and not we would >>> be relying on the fact that we are holding a reference to the page so >>> it should not be freed and transition between compound or not. >>> >> >> I have tried the patch as your suggested, it has no clear help on performance >> on above vm-scaliblity case. Maybe it's due to we checked the same thing >> before lock already. >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index b99c96c4862d..cf2ac5148001 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat) >> if (unlikely(!get_page_unless_zero(page))) >> goto isolate_fail; >> >> + /* >> + * Page become compound since the non-locked check, >> + * and it's on LRU. It can only be a THP so the order >> + * is safe to read and it's 0 for tail pages. >> + */ >> + if (unlikely(PageCompound(page) && !cc->alloc_contig)) { >> + low_pfn += compound_nr(page) - 1; >> + goto isolate_fail_put; >> + } >> + >> if (__isolate_lru_page_prepare(page, isolate_mode) != 0) >> goto isolate_fail_put; >> >> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat) >> goto isolate_abort; >> } >> >> - /* >> - * Page become compound since the non-locked check, >> - * and it's on LRU. It can only be a THP so the order >> - * is safe to read and it's 0 for tail pages. >> - */ >> - if (unlikely(PageCompound(page) && !cc->alloc_contig)) { >> - low_pfn += compound_nr(page) - 1; >> - SetPageLRU(page); >> - goto isolate_fail_put; >> - } >> } else >> rcu_read_unlock(); >> > > So actually there is more we could do than just this. Specifically a > few lines below the rcu_read_lock there is yet another PageCompound > check that sets low_pfn yet again. So in theory we could combine both > of those and modify the code so you end up with something more like: > @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control > *cc, unsigned long low_pfn, > if (unlikely(!get_page_unless_zero(page))) > goto isolate_fail; > > + if (PageCompound(page)) { > + const unsigned int order = compound_order(page); > + > + if (likely(order < MAX_ORDER)) > + low_pfn += (1UL << order) - 1; > + > + if (unlikely(!cc->alloc_contig)) > + goto isolate_fail_put; > The current don't check this unless locked changed. But anyway check it every page may have no performance impact. + } > + > if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > goto isolate_fail_put; > > Doing this you would be more likely to skip over the entire compound > page in a single jump should you not be able to either take the LRU > bit or encounter a busy page in __isolate_Lru_page_prepare. I had > copied this bit from an earlier check and modified it as I was not > sure I can guarantee that this is a THP since we haven't taken the LRU > lock yet. However I believe the page cannot be split up while we are > holding the extra reference so the PageCompound flag and order should > not change until we call put_page. > It looks like the lock_page protect this instead of get_page that just works after split func called. Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <3828d045-17e4-16aa-f0e6-d5dda7ad6b1b-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <3828d045-17e4-16aa-f0e6-d5dda7ad6b1b-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-13 2:17 ` Alexander Duyck [not found] ` <CAKgT0Ud6ZQ4ZTm1cAUKCdb8FMu0fk9vXgf-bnmb0aY5ndDHwyA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2020-08-13 4:02 ` [RFC PATCH 0/3] " Alexander Duyck 1 sibling, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 2:17 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Wed, Aug 12, 2020 at 6:47 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/13 上午12:51, Alexander Duyck 写道: > > On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >> > >> > >> > >> 在 2020/8/11 下午10:47, Alexander Duyck 写道: > >>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >>>> > >>>> > >>>> > >>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道: > >>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAg@public.gmane.orgm> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道: > >>>>>>> I wonder if this entire section shouldn't be restructured. This is the > >>>>>>> only spot I can see where we are resetting the LRU flag instead of > >>>>>>> pulling the page from the LRU list with the lock held. Looking over > >>>>>>> the code it seems like something like that should be possible. I am > >>>>>>> not sure the LRU lock is really protecting us in either the > >>>>>>> PageCompound check nor the skip bits. It seems like holding a > >>>>>>> reference on the page should prevent it from switching between > >>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits > >>>>>>> being per node/memcg which I would think implies that we could have > >>>>>>> multiple LRU locks that could apply to a single skip bit. > >>>>>> > >>>>>> Hi Alexander, > >>>>>> > >>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the > >>>>>> issue do you concerned? > >>>>>> > >>>>>> Thanks! > >>>>> > >>>>> The point I was getting at is that the LRU lock is being used to > >>>>> protect these and with your changes I don't think that makes sense > >>>>> anymore. > >>>>> > >>>>> The skip bits are per-pageblock bits. With your change the LRU lock is > >>>>> now per memcg first and then per node. As such I do not believe it > >>>>> really provides any sort of exclusive access to the skip bits. I still > >>>>> have to look into this more, but it seems like you need a lock per > >>>>> either section or zone that can be used to protect those bits and deal > >>>>> with this sooner rather than waiting until you have found an LRU page. > >>>>> The one part that is confusing though is that the definition of the > >>>>> skip bits seems to call out that they are a hint since they are not > >>>>> protected by a lock, but that is exactly what has been happening here. > >>>>> > >>>> > >>>> The skip bits are safe here, since even it race with other skip action, > >>>> It will still skip out. The skip action is try not to compaction too much, > >>>> not a exclusive action needs avoid race. > >>> > >>> That would be the case if it didn't have the impact that they > >>> currently do on the compaction process. What I am getting at is that a > >>> race was introduced when you placed this test between the clearing of > >>> the LRU flag and the actual pulling of the page from the LRU list. So > >>> if you tested the skip bits before clearing the LRU flag then I would > >>> be okay with the code, however because it is triggering an abort after > >> > >> Hi Alexander, > >> > >> Thanks a lot for comments and suggestions! > >> > >> I have tried your suggestion: > >> > >> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > >> --- > >> mm/compaction.c | 14 +++++++------- > >> 1 file changed, 7 insertions(+), 7 deletions(-) > >> > >> diff --git a/mm/compaction.c b/mm/compaction.c > >> index b99c96c4862d..6c881dee8c9a 100644 > >> --- a/mm/compaction.c > >> +++ b/mm/compaction.c > >> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat) > >> if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > >> goto isolate_fail_put; > >> > >> + /* Try get exclusive access under lock */ > >> + if (!skip_updated) { > >> + skip_updated = true; > >> + if (test_and_set_skip(cc, page, low_pfn)) > >> + goto isolate_fail_put; > >> + } > >> + > >> /* Try isolate the page */ > >> if (!TestClearPageLRU(page)) > >> goto isolate_fail_put; > > > > I would have made this much sooner. Probably before you call > > get_page_unless_zero so as to avoid the unnecessary atomic operations. > > > >> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > >> > >> lruvec_memcg_debug(lruvec, page); > >> > >> - /* Try get exclusive access under lock */ > >> - if (!skip_updated) { > >> - skip_updated = true; > >> - if (test_and_set_skip(cc, page, low_pfn)) > >> - goto isolate_abort; > >> - } > >> - > >> /* > >> * Page become compound since the non-locked check, > >> * and it's on LRU. It can only be a THP so the order > >> -- > >> > >> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not > >> helpful > > > > So one issue with this change is that it is still too late to be of > > much benefit. Really you should probably be doing this much sooner, > > for example somewhere before the get_page_unless_zero(). Also the > > thing that still has me scratching my head is the "Try get exclusive > > access under lock" comment. The function declaration says this is > > supposed to be a hint, but we were using the LRU lock to synchronize > > it. I'm wondering if we should really be protecting this with the zone > > lock since we are modifying the pageblock flags which also contain the > > migration type value for the pageblock and are only modified while > > holding the zone lock. > > zone lock is probability better. you can try and test. So I spent a good chunk of today looking the code over and what I realized is that we probably don't even really need to have this code protected by the zone lock since the LRU bit in the pageblock should do most of the work for us. In addition we can get rid of the test portion of this and just make it a set only operation if I am not mistaken. > >>> the LRU flag is cleared then you are creating a situation where > >>> multiple processes will be stomping all over each other as you can > >>> have each thread essentially take a page via the LRU flag, but only > >>> one thread will process a page and it could skip over all other pages > >>> that preemptively had their LRU flag cleared. > >> > >> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit > >> could stop each other in a array check(bitmap). So compare to whole node > >> lru_lock, the net profit is clear in patch 17. > > > > My concern is that what you can end up with is multiple threads all > > working over the same pageblock for isolation. With the old code the > > LRU lock was used to make certain that test_and_set_skip was being > > synchronized on the first page in the pageblock so you would only have > > one thread going through and working a single pageblock. However after > > your changes it doesn't seem like the test_and_set_skip has that > > protection since only one thread will ever be able to successfully > > call it for the first page in the pageblock assuming that the LRU flag > > is set on the first page in the pageblock block. > > > >>> > >>> If you take a look at the test_and_set_skip the function only acts on > >>> the pageblock aligned PFN for a given range. WIth the changes you have > >>> in place now that would mean that only one thread would ever actually > >>> call this function anyway since the first PFN would take the LRU flag > >>> so no other thread could follow through and test or set the bit as > >> > >> Is this good for only one process could do test_and_set_skip? is that > >> the 'skip' meaning to be? > > > > So only one thread really getting to fully use test_and_set_skip is > > good, however the issue is that there is nothing to synchronize the > > testing from the other threads. As a result the other threads could > > have isolated other pages within the pageblock before the thread that > > is calling test_and_set_skip will get to complete the setting of the > > skip bit. This will result in isolation failures for the thread that > > set the skip bit which may be undesirable behavior. > > > > With the old code the threads were all synchronized on testing the > > first PFN in the pageblock while holding the LRU lock and that is what > > we lost. My concern is the cases where skip_on_failure == true are > > going to fail much more often now as the threads can easily interfere > > with each other. > > I have a patch to fix this, which is on > https://github.com/alexshi/linux.git lrunext I don't think that patch helps to address anything. You are now failing to set the bit in the case that something modifies the pageblock flags while you are attempting to do so. I think it would be better to just leave the cmpxchg loop as it is. > > > >>> well. The expectation before was that all threads would encounter this > >>> test and either proceed after setting the bit for the first PFN or > >>> abort after testing the first PFN. With you changes only the first > >>> thread actually runs this test and then it and the others will likely > >>> encounter multiple failures as they are all clearing LRU bits > >>> simultaneously and tripping each other up. That is why the skip bit > >>> must have a test and set done before you even get to the point of > >>> clearing the LRU flag. > >> > >> It make the things warse in my machine, would you like to have a try by yourself? > > > > I plan to do that. I have already been working on a few things to > > clean up and optimize your patch set further. I will try to submit an > > RFC this evening so we can discuss. > > > > Glad to see your new code soon. Would you like do it base on > https://github.com/alexshi/linux.git lrunext I can rebase off of that tree. It may add another half hour or so. I have barely had any time to test my code. When I enabled some of the debugging features in the kernel related to using the vm-scalability tests the boot time became incredibly slow so I may just make certain I can boot and not mess the system up before submitting my patches as an RFC. I can probably try testing them more tomorrow. > >>> > >>>>> The point I was getting at with the PageCompound check is that instead > >>>>> of needing the LRU lock you should be able to look at PageCompound as > >>>>> soon as you call get_page_unless_zero() and preempt the need to set > >>>>> the LRU bit again. Instead of trying to rely on the LRU lock to > >>>>> guarantee that the page hasn't been merged you could just rely on the > >>>>> fact that you are holding a reference to it so it isn't going to > >>>>> switch between being compound or order 0 since it cannot be freed. It > >>>>> spoils the idea I originally had of combining the logic for > >>>>> get_page_unless_zero and TestClearPageLRU into a single function, but > >>>>> the advantage is you aren't clearing the LRU flag unless you are > >>>>> actually going to pull the page from the LRU list. > >>>> > >>>> Sorry, I still can not follow you here. Compound code part is unchanged > >>>> and follow the original logical. So would you like to pose a new code to > >>>> see if its works? > >>> > >>> No there are significant changes as you reordered all of the > >>> operations. Prior to your change the LRU bit was checked, but not > >>> cleared before testing for PageCompound. Now you are clearing it > >>> before you are testing if it is a compound page. So if compaction is > >>> running we will be seeing the pages in the LRU stay put, but the > >>> compound bit flickering off and on if the compound page is encountered > >>> with the wrong or NULL lruvec. What I was suggesting is that the > >> > >> The lruvec could be wrong or NULL here, that is the base stone of whole > >> patchset. > > > > Sorry I had a typo in my comment as well as it is the LRU bit that > > will be flickering, not the compound. The goal here is to avoid > > clearing the LRU bit unless we are sure we are going to take the > > lruvec lock and pull the page from the list. > > > >>> PageCompound test probably doesn't need to be concerned with the lock > >>> after your changes. You could test it after you call > >>> get_page_unless_zero() and before you call > >>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to > >>> protect us from the page switching between compound and not we would > >>> be relying on the fact that we are holding a reference to the page so > >>> it should not be freed and transition between compound or not. > >>> > >> > >> I have tried the patch as your suggested, it has no clear help on performance > >> on above vm-scaliblity case. Maybe it's due to we checked the same thing > >> before lock already. > >> > >> diff --git a/mm/compaction.c b/mm/compaction.c > >> index b99c96c4862d..cf2ac5148001 100644 > >> --- a/mm/compaction.c > >> +++ b/mm/compaction.c > >> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat) > >> if (unlikely(!get_page_unless_zero(page))) > >> goto isolate_fail; > >> > >> + /* > >> + * Page become compound since the non-locked check, > >> + * and it's on LRU. It can only be a THP so the order > >> + * is safe to read and it's 0 for tail pages. > >> + */ > >> + if (unlikely(PageCompound(page) && !cc->alloc_contig)) { > >> + low_pfn += compound_nr(page) - 1; > >> + goto isolate_fail_put; > >> + } > >> + > >> if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > >> goto isolate_fail_put; > >> > >> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > >> goto isolate_abort; > >> } > >> > >> - /* > >> - * Page become compound since the non-locked check, > >> - * and it's on LRU. It can only be a THP so the order > >> - * is safe to read and it's 0 for tail pages. > >> - */ > >> - if (unlikely(PageCompound(page) && !cc->alloc_contig)) { > >> - low_pfn += compound_nr(page) - 1; > >> - SetPageLRU(page); > >> - goto isolate_fail_put; > >> - } > >> } else > >> rcu_read_unlock(); > >> > > > > So actually there is more we could do than just this. Specifically a > > few lines below the rcu_read_lock there is yet another PageCompound > > check that sets low_pfn yet again. So in theory we could combine both > > of those and modify the code so you end up with something more like: > > @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control > > *cc, unsigned long low_pfn, > > if (unlikely(!get_page_unless_zero(page))) > > goto isolate_fail; > > > > + if (PageCompound(page)) { > > + const unsigned int order = compound_order(page); > > + > > + if (likely(order < MAX_ORDER)) > > + low_pfn += (1UL << order) - 1; > > + > > + if (unlikely(!cc->alloc_contig)) > > + goto isolate_fail_put; > > > > The current don't check this unless locked changed. But anyway check it > every page may have no performance impact. Yes and no. The same code is also ran outside the lock and that is why I suggested merging the two and creating this block of logic. It will be clearer once I have done some initial smoke testing and submitted my patch. > + } > > + > > if (__isolate_lru_page_prepare(page, isolate_mode) != 0) > > goto isolate_fail_put; > > > > Doing this you would be more likely to skip over the entire compound > > page in a single jump should you not be able to either take the LRU > > bit or encounter a busy page in __isolate_Lru_page_prepare. I had > > copied this bit from an earlier check and modified it as I was not > > sure I can guarantee that this is a THP since we haven't taken the LRU > > lock yet. However I believe the page cannot be split up while we are > > holding the extra reference so the PageCompound flag and order should > > not change until we call put_page. > > > > It looks like the lock_page protect this instead of get_page that just works > after split func called. So I thought that the call to page_ref_freeze that is used in functions like split_huge_page_to_list is meant to address this case. What it is essentially doing is setting the reference count to zero if the count is at the expected value. So with the get_page_unless_zero it would either fail because the value is already zero, or the page_ref_freeze would fail because the count would be one higher than the expected value. Either that or I am still missing another piece in the understanding of this. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0Ud6ZQ4ZTm1cAUKCdb8FMu0fk9vXgf-bnmb0aY5ndDHwyA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <CAKgT0Ud6ZQ4ZTm1cAUKCdb8FMu0fk9vXgf-bnmb0aY5ndDHwyA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-13 3:52 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-13 3:52 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/13 上午10:17, Alexander Duyck 写道: >> zone lock is probability better. you can try and test. > So I spent a good chunk of today looking the code over and what I > realized is that we probably don't even really need to have this code > protected by the zone lock since the LRU bit in the pageblock should > do most of the work for us. In addition we can get rid of the test > portion of this and just make it a set only operation if I am not > mistaken. > >>>>> the LRU flag is cleared then you are creating a situation where >>>>> multiple processes will be stomping all over each other as you can >>>>> have each thread essentially take a page via the LRU flag, but only >>>>> one thread will process a page and it could skip over all other pages >>>>> that preemptively had their LRU flag cleared. >>>> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit >>>> could stop each other in a array check(bitmap). So compare to whole node >>>> lru_lock, the net profit is clear in patch 17. >>> My concern is that what you can end up with is multiple threads all >>> working over the same pageblock for isolation. With the old code the >>> LRU lock was used to make certain that test_and_set_skip was being >>> synchronized on the first page in the pageblock so you would only have >>> one thread going through and working a single pageblock. However after >>> your changes it doesn't seem like the test_and_set_skip has that >>> protection since only one thread will ever be able to successfully >>> call it for the first page in the pageblock assuming that the LRU flag >>> is set on the first page in the pageblock block. >>> >>>>> If you take a look at the test_and_set_skip the function only acts on >>>>> the pageblock aligned PFN for a given range. WIth the changes you have >>>>> in place now that would mean that only one thread would ever actually >>>>> call this function anyway since the first PFN would take the LRU flag >>>>> so no other thread could follow through and test or set the bit as >>>> Is this good for only one process could do test_and_set_skip? is that >>>> the 'skip' meaning to be? >>> So only one thread really getting to fully use test_and_set_skip is >>> good, however the issue is that there is nothing to synchronize the >>> testing from the other threads. As a result the other threads could >>> have isolated other pages within the pageblock before the thread that >>> is calling test_and_set_skip will get to complete the setting of the >>> skip bit. This will result in isolation failures for the thread that >>> set the skip bit which may be undesirable behavior. >>> >>> With the old code the threads were all synchronized on testing the >>> first PFN in the pageblock while holding the LRU lock and that is what >>> we lost. My concern is the cases where skip_on_failure == true are >>> going to fail much more often now as the threads can easily interfere >>> with each other. >> I have a patch to fix this, which is on >> https://github.com/alexshi/linux.git lrunext > I don't think that patch helps to address anything. You are now > failing to set the bit in the case that something modifies the > pageblock flags while you are attempting to do so. I think it would be > better to just leave the cmpxchg loop as it is. It do increae the case-lru-file-mmap-read in vm-scalibity about 3% performance. Yes, I am glad to see it can be make better. > >>>>> well. The expectation before was that all threads would encounter this >>>>> test and either proceed after setting the bit for the first PFN or >>>>> abort after testing the first PFN. With you changes only the first >>>>> thread actually runs this test and then it and the others will likely >>>>> encounter multiple failures as they are all clearing LRU bits >>>>> simultaneously and tripping each other up. That is why the skip bit >>>>> must have a test and set done before you even get to the point of >>>>> clearing the LRU flag. >>>> It make the things warse in my machine, would you like to have a try by yourself? >>> I plan to do that. I have already been working on a few things to >>> clean up and optimize your patch set further. I will try to submit an >>> RFC this evening so we can discuss. >>> >> Glad to see your new code soon. Would you like do it base on >> https://github.com/alexshi/linux.git lrunext > I can rebase off of that tree. It may add another half hour or so. I > have barely had any time to test my code. When I enabled some of the > debugging features in the kernel related to using the vm-scalability > tests the boot time became incredibly slow so I may just make certain > I can boot and not mess the system up before submitting my patches as > an RFC. I can probably try testing them more tomorrow. > >>>>>>> The point I was getting at with the PageCompound check is that instead >>>>>>> of needing the LRU lock you should be able to look at PageCompound as >>>>>>> soon as you call get_page_unless_zero() and preempt the need to set >>>>>>> the LRU bit again. Instead of trying to rely on the LRU lock to >>>>>>> guarantee that the page hasn't been merged you could just rely on the >>>>>>> fact that you are holding a reference to it so it isn't going to >>>>>>> switch between being compound or order 0 since it cannot be freed. It >>>>>>> spoils the idea I originally had of combining the logic for >>>>>>> get_page_unless_zero and TestClearPageLRU into a single function, but >>>>>>> the advantage is you aren't clearing the LRU flag unless you are >>>>>>> actually going to pull the page from the LRU list. >>>>>> Sorry, I still can not follow you here. Compound code part is unchanged >>>>>> and follow the original logical. So would you like to pose a new code to >>>>>> see if its works? >>>>> No there are significant changes as you reordered all of the >>>>> operations. Prior to your change the LRU bit was checked, but not >>>>> cleared before testing for PageCompound. Now you are clearing it >>>>> before you are testing if it is a compound page. So if compaction is >>>>> running we will be seeing the pages in the LRU stay put, but the >>>>> compound bit flickering off and on if the compound page is encountered >>>>> with the wrong or NULL lruvec. What I was suggesting is that the >>>> The lruvec could be wrong or NULL here, that is the base stone of whole >>>> patchset. >>> Sorry I had a typo in my comment as well as it is the LRU bit that >>> will be flickering, not the compound. The goal here is to avoid >>> clearing the LRU bit unless we are sure we are going to take the >>> lruvec lock and pull the page from the list. >>> >>>>> PageCompound test probably doesn't need to be concerned with the lock >>>>> after your changes. You could test it after you call >>>>> get_page_unless_zero() and before you call >>>>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to >>>>> protect us from the page switching between compound and not we would >>>>> be relying on the fact that we are holding a reference to the page so >>>>> it should not be freed and transition between compound or not. >>>>> >>>> I have tried the patch as your suggested, it has no clear help on performance >>>> on above vm-scaliblity case. Maybe it's due to we checked the same thing >>>> before lock already. >>>> >>>> diff --git a/mm/compaction.c b/mm/compaction.c >>>> index b99c96c4862d..cf2ac5148001 100644 >>>> --- a/mm/compaction.c >>>> +++ b/mm/compaction.c >>>> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat) >>>> if (unlikely(!get_page_unless_zero(page))) >>>> goto isolate_fail; >>>> >>>> + /* >>>> + * Page become compound since the non-locked check, >>>> + * and it's on LRU. It can only be a THP so the order >>>> + * is safe to read and it's 0 for tail pages. >>>> + */ >>>> + if (unlikely(PageCompound(page) && !cc->alloc_contig)) { >>>> + low_pfn += compound_nr(page) - 1; >>>> + goto isolate_fail_put; >>>> + } >>>> + >>>> if (__isolate_lru_page_prepare(page, isolate_mode) != 0) >>>> goto isolate_fail_put; >>>> >>>> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat) >>>> goto isolate_abort; >>>> } >>>> >>>> - /* >>>> - * Page become compound since the non-locked check, >>>> - * and it's on LRU. It can only be a THP so the order >>>> - * is safe to read and it's 0 for tail pages. >>>> - */ >>>> - if (unlikely(PageCompound(page) && !cc->alloc_contig)) { >>>> - low_pfn += compound_nr(page) - 1; >>>> - SetPageLRU(page); >>>> - goto isolate_fail_put; >>>> - } >>>> } else >>>> rcu_read_unlock(); >>>> >>> So actually there is more we could do than just this. Specifically a >>> few lines below the rcu_read_lock there is yet another PageCompound >>> check that sets low_pfn yet again. So in theory we could combine both >>> of those and modify the code so you end up with something more like: >>> @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control >>> *cc, unsigned long low_pfn, >>> if (unlikely(!get_page_unless_zero(page))) >>> goto isolate_fail; >>> >>> + if (PageCompound(page)) { >>> + const unsigned int order = compound_order(page); >>> + >>> + if (likely(order < MAX_ORDER)) >>> + low_pfn += (1UL << order) - 1; >>> + >>> + if (unlikely(!cc->alloc_contig)) >>> + goto isolate_fail_put; >>> >> The current don't check this unless locked changed. But anyway check it >> every page may have no performance impact. > Yes and no. The same code is also ran outside the lock and that is why > I suggested merging the two and creating this block of logic. It will > be clearer once I have done some initial smoke testing and submitted > my patch. > >> + } >>> + >>> if (__isolate_lru_page_prepare(page, isolate_mode) != 0) >>> goto isolate_fail_put; >>> >>> Doing this you would be more likely to skip over the entire compound >>> page in a single jump should you not be able to either take the LRU >>> bit or encounter a busy page in __isolate_Lru_page_prepare. I had >>> copied this bit from an earlier check and modified it as I was not >>> sure I can guarantee that this is a THP since we haven't taken the LRU >>> lock yet. However I believe the page cannot be split up while we are >>> holding the extra reference so the PageCompound flag and order should >>> not change until we call put_page. >>> >> It looks like the lock_page protect this instead of get_page that just works >> after split func called. > So I thought that the call to page_ref_freeze that is used in > functions like split_huge_page_to_list is meant to address this case. > What it is essentially doing is setting the reference count to zero if > the count is at the expected value. So with the get_page_unless_zero > it would either fail because the value is already zero, or the > page_ref_freeze would fail because the count would be one higher than > the expected value. Either that or I am still missing another piece in > the understanding of this. Uh, the front xa_lock or anon_vma lock guard the -refcount, so long locking path... Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* [RFC PATCH 0/3] Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <3828d045-17e4-16aa-f0e6-d5dda7ad6b1b-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-08-13 2:17 ` Alexander Duyck @ 2020-08-13 4:02 ` Alexander Duyck [not found] ` <20200813035100.13054.25671.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 1 sibling, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 4:02 UTC (permalink / raw) To: alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc Here are the patches I had discussed earlier to address the issues in isolate_migratepages_block. They are based on the tree at: https://github.com/alexshi/linux.git lrunext The first patch is mostly cleanup to address the RCU locking in the function. The second addresses the test_and_set_skip issue, and the third relocates PageCompound. I did some digging into the history of the skip bits and since they are only supposed to be a hint I thought we could probably just drop the testing portion of the call since the LRU flag is preventing more than one thread from accessing the function anyway so it would make sense to just switch it to a set operation similar to what happens when low_pfn == end_pfn at the end of the call. I have only had a chance to build test these since rebasing on the tree. In addition I am not 100% certain the PageCompound changes are correct as they operate on the assumption that get_page_unless_zero is enough to keep a compound page from being split up. I plan on doing some testing tomorrow, but thought I would push these out now so that we could discuss them. --- Alexander Duyck (3): mm: Drop locked from isolate_migratepages_block mm: Drop use of test_and_set_skip in favor of just setting skip mm: Identify compound pages sooner in isolate_migratepages_block mm/compaction.c | 126 +++++++++++++++++++------------------------------------ 1 file changed, 44 insertions(+), 82 deletions(-) ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <20200813035100.13054.25671.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block [not found] ` <20200813035100.13054.25671.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2020-08-13 4:02 ` Alexander Duyck [not found] ` <20200813040224.13054.96724.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2020-08-13 4:02 ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck 2020-08-13 4:02 ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck 2 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 4:02 UTC (permalink / raw) To: alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> We can drop the need for the locked variable by making use of the lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu locking ugliness for the case where the lruvec is still holding the LRU lock associated with the page. Instead we can just use the lruvec and if it is NULL we assume the lock was released. Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> --- mm/compaction.c | 45 ++++++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 25 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index b99c96c4862d..5021a18ef722 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -803,9 +803,8 @@ static bool too_many_isolated(pg_data_t *pgdat) { pg_data_t *pgdat = cc->zone->zone_pgdat; unsigned long nr_scanned = 0, nr_isolated = 0; - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; unsigned long flags = 0; - struct lruvec *locked = NULL; struct page *page = NULL, *valid_page = NULL; unsigned long start_pfn = low_pfn; bool skip_on_failure = false; @@ -866,9 +865,9 @@ static bool too_many_isolated(pg_data_t *pgdat) * a fatal signal is pending. */ if (!(low_pfn % SWAP_CLUSTER_MAX)) { - if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); - locked = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } if (fatal_signal_pending(current)) { @@ -949,9 +948,9 @@ static bool too_many_isolated(pg_data_t *pgdat) */ if (unlikely(__PageMovable(page)) && !PageIsolated(page)) { - if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); - locked = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } if (!isolate_movable_page(page, isolate_mode)) @@ -992,16 +991,13 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!TestClearPageLRU(page)) goto isolate_fail_put; - rcu_read_lock(); - lruvec = mem_cgroup_page_lruvec(page, pgdat); - /* If we already hold the lock, we can skip some rechecking */ - if (lruvec != locked) { - if (locked) - unlock_page_lruvec_irqrestore(locked, flags); + if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) { + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = mem_cgroup_page_lruvec(page, pgdat); compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); - locked = lruvec; rcu_read_unlock(); lruvec_memcg_debug(lruvec, page); @@ -1023,8 +1019,7 @@ static bool too_many_isolated(pg_data_t *pgdat) SetPageLRU(page); goto isolate_fail_put; } - } else - rcu_read_unlock(); + } /* The whole page is taken off the LRU; skip the tail pages. */ if (PageCompound(page)) @@ -1057,9 +1052,9 @@ static bool too_many_isolated(pg_data_t *pgdat) isolate_fail_put: /* Avoid potential deadlock in freeing page under lru_lock */ - if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); - locked = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } put_page(page); @@ -1073,9 +1068,9 @@ static bool too_many_isolated(pg_data_t *pgdat) * page anyway. */ if (nr_isolated) { - if (locked) { - unlock_page_lruvec_irqrestore(locked, flags); - locked = NULL; + if (lruvec) { + unlock_page_lruvec_irqrestore(lruvec, flags); + lruvec = NULL; } putback_movable_pages(&cc->migratepages); cc->nr_migratepages = 0; @@ -1102,8 +1097,8 @@ static bool too_many_isolated(pg_data_t *pgdat) page = NULL; isolate_abort: - if (locked) - unlock_page_lruvec_irqrestore(locked, flags); + if (lruvec) + unlock_page_lruvec_irqrestore(lruvec, flags); if (page) { SetPageLRU(page); put_page(page); ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <20200813040224.13054.96724.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block [not found] ` <20200813040224.13054.96724.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2020-08-13 6:56 ` Alex Shi [not found] ` <8ea9e186-b223-fb1b-5c82-2aa43c5e9f10-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-08-13 7:44 ` Alex Shi 1 sibling, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-13 6:56 UTC (permalink / raw) To: Alexander Duyck Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc 在 2020/8/13 下午12:02, Alexander Duyck 写道: > From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > We can drop the need for the locked variable by making use of the > lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu > locking ugliness for the case where the lruvec is still holding the LRU > lock associated with the page. Instead we can just use the lruvec and if it > is NULL we assume the lock was released. > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > --- > mm/compaction.c | 45 ++++++++++++++++++++------------------------- > 1 file changed, 20 insertions(+), 25 deletions(-) Thanks a lot! Don't know if community is ok if we keep the patch following whole patchset alone? ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <8ea9e186-b223-fb1b-5c82-2aa43c5e9f10-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block [not found] ` <8ea9e186-b223-fb1b-5c82-2aa43c5e9f10-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-13 14:32 ` Alexander Duyck [not found] ` <CAKgT0UcRFqXUOJ+QjgtjdQE6A7EMgAc_v9b7+mXy-ZJLvG2AgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 14:32 UTC (permalink / raw) To: Alex Shi Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov, Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm, Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim On Wed, Aug 12, 2020 at 11:57 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/13 下午12:02, Alexander Duyck 写道: > > From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > We can drop the need for the locked variable by making use of the > > lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu > > locking ugliness for the case where the lruvec is still holding the LRU > > lock associated with the page. Instead we can just use the lruvec and if it > > is NULL we assume the lock was released. > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > --- > > mm/compaction.c | 45 ++++++++++++++++++++------------------------- > > 1 file changed, 20 insertions(+), 25 deletions(-) > > Thanks a lot! > Don't know if community is ok if we keep the patch following whole patchset alone? I am fine with you squashing it with another patch if you want. In theory this could probably be squashed in with the earlier patch I submitted that introduced lruvec_holds_page_lru_lock or some other patch. It is mostly just a cleanup anyway as it gets us away from needing to hold the RCU read lock in the case that we already have the correct lruvec. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UcRFqXUOJ+QjgtjdQE6A7EMgAc_v9b7+mXy-ZJLvG2AgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block [not found] ` <CAKgT0UcRFqXUOJ+QjgtjdQE6A7EMgAc_v9b7+mXy-ZJLvG2AgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-14 7:25 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-14 7:25 UTC (permalink / raw) To: Alexander Duyck Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov, Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm, Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim 在 2020/8/13 下午10:32, Alexander Duyck 写道: >> Thanks a lot! >> Don't know if community is ok if we keep the patch following whole patchset alone? > I am fine with you squashing it with another patch if you want. In > theory this could probably be squashed in with the earlier patch I > submitted that introduced lruvec_holds_page_lru_lock or some other > patch. It is mostly just a cleanup anyway as it gets us away from > needing to hold the RCU read lock in the case that we already have the > correct lruvec. Hi Alexander, Thanks a lot! look like it's better to fold it in patch 17. Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block [not found] ` <20200813040224.13054.96724.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2020-08-13 6:56 ` Alex Shi @ 2020-08-13 7:44 ` Alex Shi 2020-08-13 14:26 ` Alexander Duyck 1 sibling, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-13 7:44 UTC (permalink / raw) To: Alexander Duyck Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc 在 2020/8/13 下午12:02, Alexander Duyck 写道: > - rcu_read_lock(); > - lruvec = mem_cgroup_page_lruvec(page, pgdat); > - > /* If we already hold the lock, we can skip some rechecking */ > - if (lruvec != locked) { > - if (locked) > - unlock_page_lruvec_irqrestore(locked, flags); > + if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) { Ops, lruvec_holds_page_lru_lock need rcu_read_lock. > + if (lruvec) > + unlock_page_lruvec_irqrestore(lruvec, flags); > > + lruvec = mem_cgroup_page_lruvec(page, pgdat); > compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); > - locked = lruvec; > rcu_read_unlock(); > and some bugs: [ 534.564741] CPU: 23 PID: 545 Comm: kcompactd1 Kdump: loaded Tainted: G S W 5.8.0-next-20200803-00028-g9a7ff2cd6e5c #85 [ 534.577320] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 1.0.PL.IP.P.027.02 05/29/2020 [ 534.587693] Call Trace: [ 534.590522] dump_stack+0x96/0xd0 [ 534.594231] ___might_sleep.cold.90+0xff/0x115 [ 534.599102] kcompactd+0x24b/0x370 [ 534.602904] ? finish_wait+0x80/0x80 [ 534.606897] ? kcompactd_do_work+0x3d0/0x3d0 [ 534.611566] kthread+0x14e/0x170 [ 534.615182] ? kthread_park+0x80/0x80 [ 534.619252] ret_from_fork+0x1f/0x30 [ 535.629483] BUG: sleeping function called from invalid context at include/linux/freezer.h:57 [ 535.638691] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 545, name: kcompactd1 [ 535.647601] INFO: lockdep is turned off. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block 2020-08-13 7:44 ` Alex Shi @ 2020-08-13 14:26 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 14:26 UTC (permalink / raw) To: Alex Shi Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov, Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm, Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo, cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim On Thu, Aug 13, 2020 at 12:45 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > > > 在 2020/8/13 下午12:02, Alexander Duyck 写道: > > - rcu_read_lock(); > > - lruvec = mem_cgroup_page_lruvec(page, pgdat); > > - > > /* If we already hold the lock, we can skip some rechecking */ > > - if (lruvec != locked) { > > - if (locked) > > - unlock_page_lruvec_irqrestore(locked, flags); > > + if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) { > > Ops, lruvec_holds_page_lru_lock need rcu_read_lock. How so? The reason I wrote lruvec_holds_page_lru_lock the way I did is that it is simply comparing the pointers held by the page and the lruvec. It is never actually accessing any of the values, just the pointers. As such we should be able to compare the two since the lruvec is still locked and the the memcg and pgdat held by the lruvec should not be changed. Likewise with the page pointers assuming the values match. > > + if (lruvec) > > + unlock_page_lruvec_irqrestore(lruvec, flags); > > > > + lruvec = mem_cgroup_page_lruvec(page, pgdat); > > compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); > > - locked = lruvec; > > rcu_read_unlock(); > > > > and some bugs: > [ 534.564741] CPU: 23 PID: 545 Comm: kcompactd1 Kdump: loaded Tainted: G S W 5.8.0-next-20200803-00028-g9a7ff2cd6e5c #85 > [ 534.577320] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 1.0.PL.IP.P.027.02 05/29/2020 > [ 534.587693] Call Trace: > [ 534.590522] dump_stack+0x96/0xd0 > [ 534.594231] ___might_sleep.cold.90+0xff/0x115 > [ 534.599102] kcompactd+0x24b/0x370 > [ 534.602904] ? finish_wait+0x80/0x80 > [ 534.606897] ? kcompactd_do_work+0x3d0/0x3d0 > [ 534.611566] kthread+0x14e/0x170 > [ 534.615182] ? kthread_park+0x80/0x80 > [ 534.619252] ret_from_fork+0x1f/0x30 > [ 535.629483] BUG: sleeping function called from invalid context at include/linux/freezer.h:57 > [ 535.638691] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 545, name: kcompactd1 > [ 535.647601] INFO: lockdep is turned off. Ah, I see the bug now. It isn't the lruvec_holds_page_lru_lock that needs the LRU lock. This is an issue as a part of a merge conflict. There should have been an rcu_read_lock added before mem_cgroup_page_lruvec. ^ permalink raw reply [flat|nested] 101+ messages in thread
* [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip [not found] ` <20200813035100.13054.25671.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2020-08-13 4:02 ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck @ 2020-08-13 4:02 ` Alexander Duyck [not found] ` <20200813040232.13054.82417.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2020-08-13 4:02 ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck 2 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 4:02 UTC (permalink / raw) To: alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> The only user of test_and_set_skip was isolate_migratepages_block and it was using it after a call that was testing and clearing the LRU flag. As such it really didn't need to be behind the LRU lock anymore as it wasn't really fulfilling its purpose. With that being the case we can simply drop the bit and instead directly just call the set_pageblock_skip function if the page we are working on is the valid_page at the start of the pageblock. It shouldn't be possible for us to encounter the bit being set since we obtained the LRU flag for the first page in the pageblock which means we would have exclusive access to setting the skip bit. As such we don't need to worry about the abort case since no other thread will be able to call what used to be test_and_set_skip. Since we have dropped the late abort case we can drop the code that was clearing the LRU flag and calling page_put since the abort case will now not be holding a reference to a page. Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> --- mm/compaction.c | 50 +++++++------------------------------------------- 1 file changed, 7 insertions(+), 43 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 5021a18ef722..c1e9918f9dd4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -399,29 +399,6 @@ void reset_isolation_suitable(pg_data_t *pgdat) } } -/* - * Sets the pageblock skip bit if it was clear. Note that this is a hint as - * locks are not required for read/writers. Returns true if it was already set. - */ -static bool test_and_set_skip(struct compact_control *cc, struct page *page, - unsigned long pfn) -{ - bool skip; - - /* Do no update if skip hint is being ignored */ - if (cc->ignore_skip_hint) - return false; - - if (!IS_ALIGNED(pfn, pageblock_nr_pages)) - return false; - - skip = get_pageblock_skip(page); - if (!skip && !cc->no_set_skip_hint) - skip = !set_pageblock_skip(page); - - return skip; -} - static void update_cached_migrate(struct compact_control *cc, unsigned long pfn) { struct zone *zone = cc->zone; @@ -480,12 +457,6 @@ static inline void update_pageblock_skip(struct compact_control *cc, static void update_cached_migrate(struct compact_control *cc, unsigned long pfn) { } - -static bool test_and_set_skip(struct compact_control *cc, struct page *page, - unsigned long pfn) -{ - return false; -} #endif /* CONFIG_COMPACTION */ /* @@ -895,7 +866,6 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) { if (!cc->ignore_skip_hint && get_pageblock_skip(page)) { low_pfn = end_pfn; - page = NULL; goto isolate_abort; } valid_page = page; @@ -991,6 +961,13 @@ static bool too_many_isolated(pg_data_t *pgdat) if (!TestClearPageLRU(page)) goto isolate_fail_put; + /* Indicate that we want exclusive access to this pageblock */ + if (page == valid_page) { + skip_updated = true; + if (!cc->ignore_skip_hint) + set_pageblock_skip(page); + } + /* If we already hold the lock, we can skip some rechecking */ if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) { if (lruvec) @@ -1002,13 +979,6 @@ static bool too_many_isolated(pg_data_t *pgdat) lruvec_memcg_debug(lruvec, page); - /* Try get exclusive access under lock */ - if (!skip_updated) { - skip_updated = true; - if (test_and_set_skip(cc, page, low_pfn)) - goto isolate_abort; - } - /* * Page become compound since the non-locked check, * and it's on LRU. It can only be a THP so the order @@ -1094,15 +1064,9 @@ static bool too_many_isolated(pg_data_t *pgdat) if (unlikely(low_pfn > end_pfn)) low_pfn = end_pfn; - page = NULL; - isolate_abort: if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags); - if (page) { - SetPageLRU(page); - put_page(page); - } /* * Updated the cached scanner pfn once the pageblock has been scanned ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <20200813040232.13054.82417.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip [not found] ` <20200813040232.13054.82417.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2020-08-14 7:19 ` Alex Shi [not found] ` <6c072332-ff16-757d-99dd-b8fbae131a0c-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-08-18 6:50 ` Alex Shi 1 sibling, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-14 7:19 UTC (permalink / raw) To: Alexander Duyck Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc 在 2020/8/13 下午12:02, Alexander Duyck 写道: > > Since we have dropped the late abort case we can drop the code that was > clearing the LRU flag and calling page_put since the abort case will now > not be holding a reference to a page. > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing. on my 80 core machine. Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <6c072332-ff16-757d-99dd-b8fbae131a0c-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip [not found] ` <6c072332-ff16-757d-99dd-b8fbae131a0c-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-14 14:24 ` Alexander Duyck [not found] ` <CAKgT0Uf0TbRBVsuGZ1bgh5rdFp+vARkP=+GgD4-DP3Gy6cj+pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-14 14:24 UTC (permalink / raw) To: Alex Shi Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov, Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm, Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/13 下午12:02, Alexander Duyck 写道: > > > > Since we have dropped the late abort case we can drop the code that was > > clearing the LRU flag and calling page_put since the abort case will now > > not be holding a reference to a page. > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing. > on my 80 core machine. I'm not sure how it could have that much impact on the performance since the total effect would just be dropping what should be a redundant test since we tested the skip bit before we took the LRU bit, so we shouldn't need to test it again after. I finally got my test setup working last night. I'll have to do some testing in my environment and I can start trying to see what is going on. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0Uf0TbRBVsuGZ1bgh5rdFp+vARkP=+GgD4-DP3Gy6cj+pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip [not found] ` <CAKgT0Uf0TbRBVsuGZ1bgh5rdFp+vARkP=+GgD4-DP3Gy6cj+pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-14 21:15 ` Alexander Duyck [not found] ` <650ab639-e66f-5ca6-a9a5-31e61c134ae7@linux.alibaba.com> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-14 21:15 UTC (permalink / raw) To: Alex Shi Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov, Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm, Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim On Fri, Aug 14, 2020 at 7:24 AM Alexander Duyck <alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > > > > > 在 2020/8/13 下午12:02, Alexander Duyck 写道: > > > > > > Since we have dropped the late abort case we can drop the code that was > > > clearing the LRU flag and calling page_put since the abort case will now > > > not be holding a reference to a page. > > > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing. > > on my 80 core machine. > > I'm not sure how it could have that much impact on the performance > since the total effect would just be dropping what should be a > redundant test since we tested the skip bit before we took the LRU > bit, so we shouldn't need to test it again after. > > I finally got my test setup working last night. I'll have to do some > testing in my environment and I can start trying to see what is going > on. So I ran the case-lru-file-mmap-read a few times and I don't see how it is supposed to be testing the compaction code. It doesn't seem like compaction is running at least on my system as a result of the test script. I wonder if testing this code wouldn't be better done using something like thpscale from the mmtests(https://github.com/gormanm/mmtests)? It seems past changes to the compaction code were tested using that, and the config script for the test explains that it is designed specifically to stress the compaction code. I have the test up and running now and hope to collect results over the weekend. There is one change I will probably make to this patch and that is to place the new code that is setting skip_updated where the old code was calling test_and_set_skip_bit. By doing that we can avoid extra checks and it should help to reduce possible collisions when setting the skip bit in the pageblock flags. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <650ab639-e66f-5ca6-a9a5-31e61c134ae7@linux.alibaba.com>]
[parent not found: <650ab639-e66f-5ca6-a9a5-31e61c134ae7-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip [not found] ` <650ab639-e66f-5ca6-a9a5-31e61c134ae7-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-17 15:38 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-17 15:38 UTC (permalink / raw) To: Alex Shi Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov, Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm, Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim On Sat, Aug 15, 2020 at 2:51 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/15 上午5:15, Alexander Duyck 写道: > > On Fri, Aug 14, 2020 at 7:24 AM Alexander Duyck > > <alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> > >> On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >>> > >>> > >>> > >>> 在 2020/8/13 下午12:02, Alexander Duyck 写道: > >>>> > >>>> Since we have dropped the late abort case we can drop the code that was > >>>> clearing the LRU flag and calling page_put since the abort case will now > >>>> not be holding a reference to a page. > >>>> > >>>> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > >>> > >>> seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing. > >>> on my 80 core machine. > >> > >> I'm not sure how it could have that much impact on the performance > >> since the total effect would just be dropping what should be a > >> redundant test since we tested the skip bit before we took the LRU > >> bit, so we shouldn't need to test it again after. > >> > >> I finally got my test setup working last night. I'll have to do some > >> testing in my environment and I can start trying to see what is going > >> on. > > > > So I ran the case-lru-file-mmap-read a few times and I don't see how > > it is supposed to be testing the compaction code. It doesn't seem like > > compaction is running at least on my system as a result of the test > > script. > > atteched my kernel config, it is used on mine machine, I'm just wondering what the margin of error is on the tests you are running. What is the variance between runs? I'm just wondering if 3% falls into the range of noise or possible changes due to just code shifting around? In order for the code to have shown any change it needs to be run and I didn't see the tests triggering compaction on my test system. I'm wondering how much memory you have available in the system you were testing on that the test was enough to trigger compaction? > > I wonder if testing this code wouldn't be better done using > > something like thpscale from the > > mmtests(https://github.com/gormanm/mmtests)? It seems past changes to > > the compaction code were tested using that, and the config script for > > the test explains that it is designed specifically to stress the > > compaction code. I have the test up and running now and hope to > > collect results over the weekend. > > I did the testing, but a awkward is that I failed to get result, > maybe leak some packages. So one thing I noticed is that if you have over 128GB of memory in the system it will fail unless you update the sysctl value vm.max_map_count. It defaulted to somewhere close to 64K, and I increased it 20X to 1280K in order for the test to run without failing on the mmap calls. The other edit I had to make was the config file as the test system I was on had about 1TB of RAM, and my home partition only had about 800GB to spare so I had to reduce the map size from 8/10 to 5/8. > # ../../compare-kernels.sh > > thpscale Fault Latencies > Can't locate List/BinarySearch.pm in @INC (@INC contains: /root/mmtests/bin/lib /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vend. > BEGIN failed--compilation aborted at /root/mmtests/bin/lib/MMTests/Stat.pm line 13. > Compilation failed in require at /root/mmtests/work/log/../../bin/compare-mmtests.pl line 13. > BEGIN failed--compilation aborted at /root/mmtests/work/log/../../bin/compare-mmtests.pl line 13. I had to install List::BinarySearch.pm. It required installing the cpan perl libraries. > > > > There is one change I will probably make to this patch and that is to > > place the new code that is setting skip_updated where the old code was > > calling test_and_set_skip_bit. By doing that we can avoid extra checks > > and it should help to reduce possible collisions when setting the skip > > bit in the pageblock flags. > > the problem maybe on cmpchxg pb flags, that may involved other blocks changes. That is the only thing I can think of just based on code review. Although that would imply multiple compact threads are running, and as I said in my tests I never saw kcompactd wakeup so I don't think the tests you were mentioning were enough to stress compaction. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip [not found] ` <20200813040232.13054.82417.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2020-08-14 7:19 ` Alex Shi @ 2020-08-18 6:50 ` Alex Shi 1 sibling, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-18 6:50 UTC (permalink / raw) To: Alexander Duyck Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc 在 2020/8/13 下午12:02, Alexander Duyck 写道: > From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > The only user of test_and_set_skip was isolate_migratepages_block and it > was using it after a call that was testing and clearing the LRU flag. As > such it really didn't need to be behind the LRU lock anymore as it wasn't > really fulfilling its purpose. > > With that being the case we can simply drop the bit and instead directly > just call the set_pageblock_skip function if the page we are working on is > the valid_page at the start of the pageblock. It shouldn't be possible for > us to encounter the bit being set since we obtained the LRU flag for the > first page in the pageblock which means we would have exclusive access to > setting the skip bit. As such we don't need to worry about the abort case > since no other thread will be able to call what used to be > test_and_set_skip. > > Since we have dropped the late abort case we can drop the code that was > clearing the LRU flag and calling page_put since the abort case will now > not be holding a reference to a page. > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> After my false sharing remove on pageblock_flags, this patch looks fine with a minor change > --- > mm/compaction.c | 50 +++++++------------------------------------------- > 1 file changed, 7 insertions(+), 43 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 5021a18ef722..c1e9918f9dd4 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -399,29 +399,6 @@ void reset_isolation_suitable(pg_data_t *pgdat) > } > } > > -/* > - * Sets the pageblock skip bit if it was clear. Note that this is a hint as > - * locks are not required for read/writers. Returns true if it was already set. > - */ > -static bool test_and_set_skip(struct compact_control *cc, struct page *page, > - unsigned long pfn) > -{ > - bool skip; > - > - /* Do no update if skip hint is being ignored */ > - if (cc->ignore_skip_hint) > - return false; > - > - if (!IS_ALIGNED(pfn, pageblock_nr_pages)) > - return false; > - > - skip = get_pageblock_skip(page); > - if (!skip && !cc->no_set_skip_hint) > - skip = !set_pageblock_skip(page); > - > - return skip; > -} > - > static void update_cached_migrate(struct compact_control *cc, unsigned long pfn) > { > struct zone *zone = cc->zone; > @@ -480,12 +457,6 @@ static inline void update_pageblock_skip(struct compact_control *cc, > static void update_cached_migrate(struct compact_control *cc, unsigned long pfn) > { > } > - > -static bool test_and_set_skip(struct compact_control *cc, struct page *page, > - unsigned long pfn) > -{ > - return false; > -} > #endif /* CONFIG_COMPACTION */ > > /* > @@ -895,7 +866,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) { > if (!cc->ignore_skip_hint && get_pageblock_skip(page)) { > low_pfn = end_pfn; > - page = NULL; > goto isolate_abort; > } > valid_page = page; > @@ -991,6 +961,13 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (!TestClearPageLRU(page)) > goto isolate_fail_put; > > + /* Indicate that we want exclusive access to this pageblock */ > + if (page == valid_page) { > + skip_updated = true; > + if (!cc->ignore_skip_hint) if (!cc->ignore_skip_hint && !cc->no_set_skip_hint) no_set_skip_hint needs to add here. Thanks Alex > + set_pageblock_skip(page); > + } > + > /* If we already hold the lock, we can skip some rechecking */ > if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) { > if (lruvec) > @@ -1002,13 +979,6 @@ static bool too_many_isolated(pg_data_t *pgdat) > > lruvec_memcg_debug(lruvec, page); > > - /* Try get exclusive access under lock */ > - if (!skip_updated) { > - skip_updated = true; > - if (test_and_set_skip(cc, page, low_pfn)) > - goto isolate_abort; > - } > - > /* > * Page become compound since the non-locked check, > * and it's on LRU. It can only be a THP so the order > @@ -1094,15 +1064,9 @@ static bool too_many_isolated(pg_data_t *pgdat) > if (unlikely(low_pfn > end_pfn)) > low_pfn = end_pfn; > > - page = NULL; > - > isolate_abort: > if (lruvec) > unlock_page_lruvec_irqrestore(lruvec, flags); > - if (page) { > - SetPageLRU(page); > - put_page(page); > - } > > /* > * Updated the cached scanner pfn once the pageblock has been scanned > ^ permalink raw reply [flat|nested] 101+ messages in thread
* [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block [not found] ` <20200813035100.13054.25671.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2020-08-13 4:02 ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck 2020-08-13 4:02 ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck @ 2020-08-13 4:02 ` Alexander Duyck [not found] ` <20200813040240.13054.76770.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-13 4:02 UTC (permalink / raw) To: alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Since we are holding a reference to the page much sooner in isolate_migratepages_block we could move the PageCompound check out of the LRU locked section and instead just place it after get_page_unless_zero. By doing this we can allow any of the items that might trigger a failure to trigger a failure for the compound page rather than the order 0 page and as a result we should be able to process the pageblock faster. In addition by testing for PageCompound sooner we can avoid having the LRU flag cleared and then reset in the exception case. As a result this should prevent any possible races where another thread might be attempting to pull the LRU pages from the list. Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> --- mm/compaction.c | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index c1e9918f9dd4..3803f129fd6a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -954,6 +954,24 @@ static bool too_many_isolated(pg_data_t *pgdat) if (unlikely(!get_page_unless_zero(page))) goto isolate_fail; + /* + * Page is compound. We know the order before we know if it is + * on the LRU so we cannot assume it is THP. However since the + * page will have the LRU validated shortly we can use the value + * to skip over this page for now or validate the LRU is set and + * then isolate the entire compound page if we are isolating to + * generate a CMA page. + */ + if (PageCompound(page)) { + const unsigned int order = compound_order(page); + + if (likely(order < MAX_ORDER)) + low_pfn += (1UL << order) - 1; + + if (!cc->alloc_contig) + goto isolate_fail_put; + } + if (__isolate_lru_page_prepare(page, isolate_mode) != 0) goto isolate_fail_put; @@ -978,23 +996,8 @@ static bool too_many_isolated(pg_data_t *pgdat) rcu_read_unlock(); lruvec_memcg_debug(lruvec, page); - - /* - * Page become compound since the non-locked check, - * and it's on LRU. It can only be a THP so the order - * is safe to read and it's 0 for tail pages. - */ - if (unlikely(PageCompound(page) && !cc->alloc_contig)) { - low_pfn += compound_nr(page) - 1; - SetPageLRU(page); - goto isolate_fail_put; - } } - /* The whole page is taken off the LRU; skip the tail pages. */ - if (PageCompound(page)) - low_pfn += compound_nr(page) - 1; - /* Successfully isolated */ del_page_from_lru_list(page, lruvec, page_lru(page)); mod_node_page_state(page_pgdat(page), ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <20200813040240.13054.76770.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block [not found] ` <20200813040240.13054.76770.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2020-08-14 7:20 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-14 7:20 UTC (permalink / raw) To: Alexander Duyck Cc: yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf, lkp-ral2JQCrhuEAvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, hughd-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, iamjoonsoo.kim-Hm3cg6mZ9cc It has a slight performance drop too... Thanks Alex 在 2020/8/13 下午12:02, Alexander Duyck 写道: > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > --- > mm/compaction.c | 33 ++++++++++++++++++--------------- > 1 file changed, 18 insertions(+), 15 deletions(-) ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction [not found] ` <1595681998-19193-15-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-08-04 21:35 ` Alexander Duyck 2020-08-06 18:38 ` Alexander Duyck @ 2020-08-17 22:58 ` Alexander Duyck 2 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-17 22:58 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen > @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > * only when the page is being freed somewhere else. > */ > scan += nr_pages; > - switch (__isolate_lru_page(page, mode)) { > + switch (__isolate_lru_page_prepare(page, mode)) { > case 0: > + /* > + * Be careful not to clear PageLRU until after we're > + * sure the page is not being freed elsewhere -- the > + * page release code relies on it. > + */ > + if (unlikely(!get_page_unless_zero(page))) > + goto busy; > + > + if (!TestClearPageLRU(page)) { > + /* > + * This page may in other isolation path, > + * but we still hold lru_lock. > + */ > + put_page(page); > + goto busy; > + } > + So I was reviewing the code and came across this piece. It has me a bit concerned since we are calling put_page while holding the LRU lock which was taken before calling the function. We should be fine in terms of not encountering a deadlock since the LRU bit is cleared the page shouldn't grab the LRU lock again, however we could end up grabbing the zone lock while holding the LRU lock which would be an issue. One other thought I had is that this might be safe because the assumption would be that another thread is holding a reference on the page, has already called TestClearPageLRU on the page and retrieved the LRU bit, and is waiting on us to release the LRU lock before it can pull the page off of the list. In that case put_page will never decrement the reference count to 0. I believe that is the current case but I cannot be certain. I'm just wondering if we should just replace the put_page(page) with a WARN_ON(put_page_testzero(page)) and a bit more documentation. If I am not mistaken it should never be possible for the reference count to actually hit zero here. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (9 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi ` (3 subsequent siblings) 14 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Hugh Dickins' found a memcg change bug on original version: If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The possible bad scenario would like: cpu 0 cpu 1 lruvec = mem_cgroup_page_lruvec() if (!isolate_lru_page()) mem_cgroup_move_account spin_lock_irqsave(&lruvec->lru_lock <== wrong lock. So we need the ClearPageLRU to block isolate_lru_page(), that serializes the memcg change. and then removing the PageLRU check in move_fn callee as the consequence. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 46 ++++++++++++++++++++++++++++++++++++---------- 1 file changed, 36 insertions(+), 10 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 5092fe9c8c47..3029b3f74811 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -221,8 +221,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, spin_lock_irqsave(&pgdat->lru_lock, flags); } + /* block memcg migration during page moving between lru */ + if (!TestClearPageLRU(page)) + continue; + lruvec = mem_cgroup_page_lruvec(page, pgdat); (*move_fn)(page, lruvec); + + SetPageLRU(page); } if (pgdat) spin_unlock_irqrestore(&pgdat->lru_lock, flags); @@ -232,7 +238,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page)) { del_page_from_lru_list(page, lruvec, page_lru(page)); ClearPageActive(page); add_page_to_lru_list_tail(page, lruvec, page_lru(page)); @@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page) static void __activate_page(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { + if (!PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); int nr_pages = hpage_nr_pages(page); @@ -362,7 +368,8 @@ void activate_page(struct page *page) page = compound_head(page); spin_lock_irq(&pgdat->lru_lock); - __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); + if (PageLRU(page)) + __activate_page(page, mem_cgroup_page_lruvec(page, pgdat)); spin_unlock_irq(&pgdat->lru_lock); } #endif @@ -520,9 +527,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) bool active; int nr_pages = hpage_nr_pages(page); - if (!PageLRU(page)) - return; - if (PageUnevictable(page)) return; @@ -563,7 +567,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageActive(page) && !PageUnevictable(page)) { int lru = page_lru_base_type(page); int nr_pages = hpage_nr_pages(page); @@ -580,7 +584,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec) { - if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && + if (PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page) && !PageUnevictable(page)) { bool active = PageActive(page); int nr_pages = hpage_nr_pages(page); @@ -654,7 +658,7 @@ void deactivate_file_page(struct page *page) * In a workload with many unevictable page such as mprotect, * unevictable page deactivation for accelerating reclaim is pointless. */ - if (PageUnevictable(page)) + if (PageUnevictable(page) || !PageLRU(page)) return; if (likely(get_page_unless_zero(page))) { @@ -976,7 +980,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) */ void __pagevec_lru_add(struct pagevec *pvec) { - pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn); + int i; + struct pglist_data *pgdat = NULL; + struct lruvec *lruvec; + unsigned long flags = 0; + + for (i = 0; i < pagevec_count(pvec); i++) { + struct page *page = pvec->pages[i]; + struct pglist_data *pagepgdat = page_pgdat(page); + + if (pagepgdat != pgdat) { + if (pgdat) + spin_unlock_irqrestore(&pgdat->lru_lock, flags); + pgdat = pagepgdat; + spin_lock_irqsave(&pgdat->lru_lock, flags); + } + + lruvec = mem_cgroup_page_lruvec(page, pgdat); + __pagevec_lru_add_fn(page, lruvec); + } + if (pgdat) + spin_unlock_irqrestore(&pgdat->lru_lock, flags); + release_pages(pvec->pages, pvec->nr); + pagevec_reinit(pvec); } /** -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (10 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn Alex Shi @ 2020-07-25 12:59 ` Alex Shi [not found] ` <1595681998-19193-19-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi ` (2 subsequent siblings) 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Cc: Thomas Gleixner, Andrey Ryabinin Use this new function to replace repeated same code, no func change. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 40 ++++++++++++++++++++++++++++++++++++++++ mm/mlock.c | 9 +-------- mm/swap.c | 33 +++++++-------------------------- mm/vmscan.c | 8 +------- 4 files changed, 49 insertions(+), 41 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 258901021c6c..6e670f991b42 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1313,6 +1313,46 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, spin_unlock_irqrestore(&lruvec->lru_lock, flags); } +/* Don't lock again iff page's lruvec locked */ +static inline struct lruvec *relock_page_lruvec_irq(struct page *page, + struct lruvec *locked_lruvec) +{ + struct pglist_data *pgdat = page_pgdat(page); + bool locked; + + rcu_read_lock(); + locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; + rcu_read_unlock(); + + if (locked) + return locked_lruvec; + + if (locked_lruvec) + unlock_page_lruvec_irq(locked_lruvec); + + return lock_page_lruvec_irq(page); +} + +/* Don't lock again iff page's lruvec locked */ +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, + struct lruvec *locked_lruvec, unsigned long *flags) +{ + struct pglist_data *pgdat = page_pgdat(page); + bool locked; + + rcu_read_lock(); + locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; + rcu_read_unlock(); + + if (locked) + return locked_lruvec; + + if (locked_lruvec) + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); + + return lock_page_lruvec_irqsave(page, flags); +} + #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/mm/mlock.c b/mm/mlock.c index 5d40d259a931..bc2fb3bfbe7a 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -303,17 +303,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) /* Phase 1: page isolation */ for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; - struct lruvec *new_lruvec; /* block memcg change in mem_cgroup_move_account */ lock_page_memcg(page); - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - if (new_lruvec != lruvec) { - if (lruvec) - unlock_page_lruvec_irq(lruvec); - lruvec = lock_page_lruvec_irq(page); - } - + lruvec = relock_page_lruvec_irq(page, lruvec); if (TestClearPageMlocked(page)) { /* * We already have pin from follow_page_mask() diff --git a/mm/swap.c b/mm/swap.c index 09edac441eb6..6d9c7288f7de 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - struct lruvec *new_lruvec; /* block memcg migration during page moving between lru */ if (!TestClearPageLRU(page)) continue; - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - if (lruvec != new_lruvec) { - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); - lruvec = lock_page_lruvec_irqsave(page, &flags); - } - + lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); (*move_fn)(page, lruvec); SetPageLRU(page); @@ -864,17 +857,12 @@ void release_pages(struct page **pages, int nr) } if (PageLRU(page)) { - struct lruvec *new_lruvec; - - new_lruvec = mem_cgroup_page_lruvec(page, - page_pgdat(page)); - if (new_lruvec != lruvec) { - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, - flags); + struct lruvec *prev_lruvec = lruvec; + + lruvec = relock_page_lruvec_irqsave(page, lruvec, + &flags); + if (prev_lruvec != lruvec) lock_batch = 0; - lruvec = lock_page_lruvec_irqsave(page, &flags); - } __ClearPageLRU(page); del_page_from_lru_list(page, lruvec, page_off_lru(page)); @@ -980,15 +968,8 @@ void __pagevec_lru_add(struct pagevec *pvec) for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - struct lruvec *new_lruvec; - - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - if (lruvec != new_lruvec) { - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); - lruvec = lock_page_lruvec_irqsave(page, &flags); - } + lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); __pagevec_lru_add_fn(page, lruvec); } if (lruvec) diff --git a/mm/vmscan.c b/mm/vmscan.c index 168c1659e430..bdb53a678e7e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec) for (i = 0; i < pvec->nr; i++) { struct page *page = pvec->pages[i]; - struct lruvec *new_lruvec; pgscanned++; - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - if (lruvec != new_lruvec) { - if (lruvec) - unlock_page_lruvec_irq(lruvec); - lruvec = lock_page_lruvec_irq(page); - } + lruvec = relock_page_lruvec_irq(page, lruvec); if (!PageLRU(page) || !PageUnevictable(page)) continue; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-19-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function [not found] ` <1595681998-19193-19-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-29 17:52 ` Alexander Duyck [not found] ` <CAKgT0UdFDcz=CQ+6mzcjh-apwy3UyPqAuOozvYr+2PSCNQrENA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2020-07-31 21:14 ` [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w 1 sibling, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-07-29 17:52 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Thomas Gleixner, Andrey Ryabinin On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > Use this new function to replace repeated same code, no func change. > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> > Cc: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> > Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> > Cc: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org> > Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> > Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > --- > include/linux/memcontrol.h | 40 ++++++++++++++++++++++++++++++++++++++++ > mm/mlock.c | 9 +-------- > mm/swap.c | 33 +++++++-------------------------- > mm/vmscan.c | 8 +------- > 4 files changed, 49 insertions(+), 41 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 258901021c6c..6e670f991b42 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1313,6 +1313,46 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, > spin_unlock_irqrestore(&lruvec->lru_lock, flags); > } > > +/* Don't lock again iff page's lruvec locked */ > +static inline struct lruvec *relock_page_lruvec_irq(struct page *page, > + struct lruvec *locked_lruvec) > +{ > + struct pglist_data *pgdat = page_pgdat(page); > + bool locked; > + > + rcu_read_lock(); > + locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > + rcu_read_unlock(); > + > + if (locked) > + return locked_lruvec; > + > + if (locked_lruvec) > + unlock_page_lruvec_irq(locked_lruvec); > + > + return lock_page_lruvec_irq(page); > +} > + > +/* Don't lock again iff page's lruvec locked */ > +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, > + struct lruvec *locked_lruvec, unsigned long *flags) > +{ > + struct pglist_data *pgdat = page_pgdat(page); > + bool locked; > + > + rcu_read_lock(); > + locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > + rcu_read_unlock(); > + > + if (locked) > + return locked_lruvec; > + > + if (locked_lruvec) > + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > + > + return lock_page_lruvec_irqsave(page, flags); > +} > + So looking these over they seem to be pretty inefficient for what they do. Basically in worst case (locked_lruvec == NULL) you end up calling mem_cgoup_page_lruvec and all the rcu_read_lock/unlock a couple times for a single page. It might make more sense to structure this like: if (locked_lruvec) { if (lruvec_holds_page_lru_lock(page, locked_lruvec)) return locked_lruvec; unlock_page_lruvec_irqrestore(locked_lruvec, *flags); } return lock_page_lruvec_irqsave(page, flags); The other piece that has me scratching my head is that I wonder if we couldn't do this without needing the rcu_read_lock. For example, what if we were to compare the page mem_cgroup pointer to the memcg back pointer stored in the mem_cgroup_per_node? It seems like ordering things this way would significantly reduce the overhead due to the pointer chasing to see if the page is in the locked lruvec or not. > #ifdef CONFIG_CGROUP_WRITEBACK > > struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); > diff --git a/mm/mlock.c b/mm/mlock.c > index 5d40d259a931..bc2fb3bfbe7a 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -303,17 +303,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) > /* Phase 1: page isolation */ > for (i = 0; i < nr; i++) { > struct page *page = pvec->pages[i]; > - struct lruvec *new_lruvec; > > /* block memcg change in mem_cgroup_move_account */ > lock_page_memcg(page); > - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > - if (new_lruvec != lruvec) { > - if (lruvec) > - unlock_page_lruvec_irq(lruvec); > - lruvec = lock_page_lruvec_irq(page); > - } > - > + lruvec = relock_page_lruvec_irq(page, lruvec); > if (TestClearPageMlocked(page)) { > /* > * We already have pin from follow_page_mask() > diff --git a/mm/swap.c b/mm/swap.c > index 09edac441eb6..6d9c7288f7de 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, > > for (i = 0; i < pagevec_count(pvec); i++) { > struct page *page = pvec->pages[i]; > - struct lruvec *new_lruvec; > > /* block memcg migration during page moving between lru */ > if (!TestClearPageLRU(page)) > continue; > > - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > - if (lruvec != new_lruvec) { > - if (lruvec) > - unlock_page_lruvec_irqrestore(lruvec, flags); > - lruvec = lock_page_lruvec_irqsave(page, &flags); > - } > - > + lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); > (*move_fn)(page, lruvec); > > SetPageLRU(page); > @@ -864,17 +857,12 @@ void release_pages(struct page **pages, int nr) > } > > if (PageLRU(page)) { > - struct lruvec *new_lruvec; > - > - new_lruvec = mem_cgroup_page_lruvec(page, > - page_pgdat(page)); > - if (new_lruvec != lruvec) { > - if (lruvec) > - unlock_page_lruvec_irqrestore(lruvec, > - flags); > + struct lruvec *prev_lruvec = lruvec; > + > + lruvec = relock_page_lruvec_irqsave(page, lruvec, > + &flags); > + if (prev_lruvec != lruvec) > lock_batch = 0; > - lruvec = lock_page_lruvec_irqsave(page, &flags); > - } > > __ClearPageLRU(page); > del_page_from_lru_list(page, lruvec, page_off_lru(page)); > @@ -980,15 +968,8 @@ void __pagevec_lru_add(struct pagevec *pvec) > > for (i = 0; i < pagevec_count(pvec); i++) { > struct page *page = pvec->pages[i]; > - struct lruvec *new_lruvec; > - > - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > - if (lruvec != new_lruvec) { > - if (lruvec) > - unlock_page_lruvec_irqrestore(lruvec, flags); > - lruvec = lock_page_lruvec_irqsave(page, &flags); > - } > > + lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); > __pagevec_lru_add_fn(page, lruvec); > } > if (lruvec) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 168c1659e430..bdb53a678e7e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec) > > for (i = 0; i < pvec->nr; i++) { > struct page *page = pvec->pages[i]; > - struct lruvec *new_lruvec; > > pgscanned++; > - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > - if (lruvec != new_lruvec) { > - if (lruvec) > - unlock_page_lruvec_irq(lruvec); > - lruvec = lock_page_lruvec_irq(page); > - } > + lruvec = relock_page_lruvec_irq(page, lruvec); > > if (!PageLRU(page) || !PageUnevictable(page)) > continue; > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UdFDcz=CQ+6mzcjh-apwy3UyPqAuOozvYr+2PSCNQrENA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function [not found] ` <CAKgT0UdFDcz=CQ+6mzcjh-apwy3UyPqAuOozvYr+2PSCNQrENA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-07-30 6:08 ` Alex Shi [not found] ` <3345bfbf-ebe9-b5e0-a731-77dd7d76b0c9-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-30 6:08 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Thomas Gleixner, Andrey Ryabinin 在 2020/7/30 上午1:52, Alexander Duyck 写道: >> + rcu_read_lock(); >> + locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; >> + rcu_read_unlock(); >> + >> + if (locked) >> + return locked_lruvec; >> + >> + if (locked_lruvec) >> + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); >> + >> + return lock_page_lruvec_irqsave(page, flags); >> +} >> + > So looking these over they seem to be pretty inefficient for what they > do. Basically in worst case (locked_lruvec == NULL) you end up calling > mem_cgoup_page_lruvec and all the rcu_read_lock/unlock a couple times > for a single page. It might make more sense to structure this like: > if (locked_lruvec) { Uh, we still need to check this page's lruvec, that needs a rcu_read_lock. to save a mem_cgroup_page_lruvec call, we have to open lock_page_lruvec as your mentained before. > if (lruvec_holds_page_lru_lock(page, locked_lruvec)) > return locked_lruvec; > > unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > } > return lock_page_lruvec_irqsave(page, flags); > > The other piece that has me scratching my head is that I wonder if we > couldn't do this without needing the rcu_read_lock. For example, what > if we were to compare the page mem_cgroup pointer to the memcg back > pointer stored in the mem_cgroup_per_node? It seems like ordering > things this way would significantly reduce the overhead due to the > pointer chasing to see if the page is in the locked lruvec or not. > If page->mem_cgroup always be charged. the following could be better. +/* Don't lock again iff page's lruvec locked */ +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, + struct lruvec *locked_lruvec, unsigned long *flags) +{ + struct lruvec *lruvec; + + if (mem_cgroup_disabled()) + return locked_lruvec; + + /* user page always be charged */ + VM_BUG_ON_PAGE(!page->mem_cgroup, page); + + rcu_read_lock(); + if (likely(lruvec_memcg(locked_lruvec) == page->mem_cgroup)) { + rcu_read_unlock(); + return locked_lruvec; + } + + if (locked_lruvec) + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); + + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + spin_lock_irqsave(&lruvec->lru_lock, *flags); + rcu_read_unlock(); + lruvec_memcg_debug(lruvec, page); + + return lruvec; +} + The user page is always be charged since readahead page is charged now. and looks we also can apply this patch. I will test it to see if there is other exception. commit 826128346e50f6c60c513e166998466b593becad Author: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> Date: Thu Jul 30 13:58:38 2020 +0800 mm/memcg: remove useless check on page->mem_cgroup Since readahead page will be charged on memcg too. We don't need to check this exception now. Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> diff --git a/mm/memcontrol.c b/mm/memcontrol.c index af96217f2ec5..0c7f6bed199b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1336,12 +1336,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd VM_BUG_ON_PAGE(PageTail(page), page); memcg = READ_ONCE(page->mem_cgroup); - /* - * Swapcache readahead pages are added to the LRU - and - * possibly migrated - before they are charged. - */ - if (!memcg) - memcg = root_mem_cgroup; mz = mem_cgroup_page_nodeinfo(memcg, page); lruvec = &mz->lruvec; @@ -6962,10 +6956,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) if (newpage->mem_cgroup) return; - /* Swapcache readahead pages can get replaced before being charged */ memcg = oldpage->mem_cgroup; - if (!memcg) - return; /* Force-charge the new page. The old one will be freed soon */ nr_pages = thp_nr_pages(newpage); @@ -7160,10 +7151,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) memcg = page->mem_cgroup; - /* Readahead page, never charged */ - if (!memcg) - return; - /* * In case the memcg owning these pages has been offlined and doesn't * have an ID allocated to it anymore, charge the closest online ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <3345bfbf-ebe9-b5e0-a731-77dd7d76b0c9-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function [not found] ` <3345bfbf-ebe9-b5e0-a731-77dd7d76b0c9-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-07-31 14:20 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-07-31 14:20 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Thomas Gleixner, Andrey Ryabinin On Wed, Jul 29, 2020 at 11:08 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/7/30 上午1:52, Alexander Duyck 写道: > >> + rcu_read_lock(); > >> + locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > >> + rcu_read_unlock(); > >> + > >> + if (locked) > >> + return locked_lruvec; > >> + > >> + if (locked_lruvec) > >> + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > >> + > >> + return lock_page_lruvec_irqsave(page, flags); > >> +} > >> + > > So looking these over they seem to be pretty inefficient for what they > > do. Basically in worst case (locked_lruvec == NULL) you end up calling > > mem_cgoup_page_lruvec and all the rcu_read_lock/unlock a couple times > > for a single page. It might make more sense to structure this like: > > if (locked_lruvec) { > > Uh, we still need to check this page's lruvec, that needs a rcu_read_lock. > to save a mem_cgroup_page_lruvec call, we have to open lock_page_lruvec > as your mentained before. > > > if (lruvec_holds_page_lru_lock(page, locked_lruvec)) > > return locked_lruvec; > > > > unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > > } > > return lock_page_lruvec_irqsave(page, flags); > > > > The other piece that has me scratching my head is that I wonder if we > > couldn't do this without needing the rcu_read_lock. For example, what > > if we were to compare the page mem_cgroup pointer to the memcg back > > pointer stored in the mem_cgroup_per_node? It seems like ordering > > things this way would significantly reduce the overhead due to the > > pointer chasing to see if the page is in the locked lruvec or not. > > > > If page->mem_cgroup always be charged. the following could be better. > > +/* Don't lock again iff page's lruvec locked */ > +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, > + struct lruvec *locked_lruvec, unsigned long *flags) > +{ > + struct lruvec *lruvec; > + > + if (mem_cgroup_disabled()) > + return locked_lruvec; > + > + /* user page always be charged */ > + VM_BUG_ON_PAGE(!page->mem_cgroup, page); > + > + rcu_read_lock(); > + if (likely(lruvec_memcg(locked_lruvec) == page->mem_cgroup)) { > + rcu_read_unlock(); > + return locked_lruvec; > + } > + > + if (locked_lruvec) > + unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > + > + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > + spin_lock_irqsave(&lruvec->lru_lock, *flags); > + rcu_read_unlock(); > + lruvec_memcg_debug(lruvec, page); > + > + return lruvec; > +} > + I understand that you have to use the rcu_lock when you want to acquire the lruvec via mem_cgroup_page_lruvec(). That is why I didn't do away with the call to lock_page_lruvec_irqsave() at the end of the function. However it doesn't make sense to do it when you are already holding the locked_lruvec and simply getting the container of it in order to compare pointer values. One thing I was getting at with the lruvec_holds_page_lru_lock() function I had introduced in my example is that the code baths for the two relock functions are very similar. If we could move all the logic for identifying if we can reuse the lock into a single function it would dut down on the redundancy quite a bit as well. In addition by testing for locked_lruvec != NULL before we before we do the comparison we can save ourselves some unnecessary testing in the case where The thought I had was try to avoid the rcu_lock entirely in the lock reuse case. Basically you just need to compare the pgdat value and the memcg between the page and the lruvec. As long as they both point the same values then you should have the correct lruvec and no need to relock. There is no need to take the rcu_lock as long as you aren't dereferencing something and if you are just comparing the pointers it should be good with that. The fallback if mem_cgroup_disabled() is to make certain the page pgdat->__lruvec is the address belonging to the lruvec. > The user page is always be charged since readahead page is charged now. > and looks we also can apply this patch. I will test it to see if there is > other exception. Yes that would simplify things a bit as the code I had was having to use a ternary to test for root_mem_cgroup if page->mem_cgroup was NULL. I should be able to finish up testing today and will submit a few clean-up patches as RFC to get your thoughts/feedback. > commit 826128346e50f6c60c513e166998466b593becad > Author: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Date: Thu Jul 30 13:58:38 2020 +0800 > > mm/memcg: remove useless check on page->mem_cgroup > > Since readahead page will be charged on memcg too. We don't need to > check this exception now. > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index af96217f2ec5..0c7f6bed199b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1336,12 +1336,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd > > VM_BUG_ON_PAGE(PageTail(page), page); > memcg = READ_ONCE(page->mem_cgroup); > - /* > - * Swapcache readahead pages are added to the LRU - and > - * possibly migrated - before they are charged. > - */ > - if (!memcg) > - memcg = root_mem_cgroup; > > mz = mem_cgroup_page_nodeinfo(memcg, page); > lruvec = &mz->lruvec; > @@ -6962,10 +6956,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) > if (newpage->mem_cgroup) > return; > > - /* Swapcache readahead pages can get replaced before being charged */ > memcg = oldpage->mem_cgroup; > - if (!memcg) > - return; > > /* Force-charge the new page. The old one will be freed soon */ > nr_pages = thp_nr_pages(newpage); > @@ -7160,10 +7151,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) > > memcg = page->mem_cgroup; > > - /* Readahead page, never charged */ > - if (!memcg) > - return; > - > /* > * In case the memcg owning these pages has been offlined and doesn't > * have an ID allocated to it anymore, charge the closest online > ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid [not found] ` <1595681998-19193-19-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-29 17:52 ` Alexander Duyck @ 2020-07-31 21:14 ` alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w [not found] ` <159622999150.2576729.14455020813024958573.stgit-+uVpp3jiz/RcxmDmkzA3yGt3HXsI98Cx0E9HWUfgJXw@public.gmane.org> 1 sibling, 1 reply; 101+ messages in thread From: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w @ 2020-07-31 21:14 UTC (permalink / raw) To: alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, aryabinin-5HdwGun5lf+gSpxsJD1C4w, cgroups-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, hughd-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, lkp-ral2JQCrhuEAvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ, tj-DgEjT+Ai2ygdnm+yROfE0A, willy-wEGCiKHe2LqWVfeAwA7xHQ, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> When testing for relock we can avoid the need for RCU locking if we simply compare the page pgdat and memcg pointers versus those that the lruvec is holding. By doing this we can avoid the extra pointer walks and accesses of the memory cgroup. In addition we can avoid the checks entirely if lruvec is currently NULL. Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> --- include/linux/memcontrol.h | 52 +++++++++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6e670f991b42..7a02f00bf3de 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); +static inline bool lruvec_holds_page_lru_lock(struct page *page, + struct lruvec *lruvec) +{ + pg_data_t *pgdat = page_pgdat(page); + const struct mem_cgroup *memcg; + struct mem_cgroup_per_node *mz; + + if (mem_cgroup_disabled()) + return lruvec == &pgdat->__lruvec; + + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + memcg = page->mem_cgroup ? : root_mem_cgroup; + + return lruvec->pgdat == pgdat && mz->memcg == memcg; +} + struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, return &pgdat->__lruvec; } +static inline bool lruvec_holds_page_lru_lock(struct page *page, + struct lruvec *lruvec) +{ + pg_data_t *pgdat = page_pgdat(page); + + return lruvec == &pgdat->__lruvec; +} + static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { return NULL; @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, static inline struct lruvec *relock_page_lruvec_irq(struct page *page, struct lruvec *locked_lruvec) { - struct pglist_data *pgdat = page_pgdat(page); - bool locked; + if (locked_lruvec) { + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) + return locked_lruvec; - rcu_read_lock(); - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; - rcu_read_unlock(); - - if (locked) - return locked_lruvec; - - if (locked_lruvec) unlock_page_lruvec_irq(locked_lruvec); + } return lock_page_lruvec_irq(page); } @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page, static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, struct lruvec *locked_lruvec, unsigned long *flags) { - struct pglist_data *pgdat = page_pgdat(page); - bool locked; - - rcu_read_lock(); - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; - rcu_read_unlock(); - - if (locked) - return locked_lruvec; + if (locked_lruvec) { + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) + return locked_lruvec; - if (locked_lruvec) unlock_page_lruvec_irqrestore(locked_lruvec, *flags); + } return lock_page_lruvec_irqsave(page, flags); } ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <159622999150.2576729.14455020813024958573.stgit-+uVpp3jiz/RcxmDmkzA3yGt3HXsI98Cx0E9HWUfgJXw@public.gmane.org>]
* Re: [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid [not found] ` <159622999150.2576729.14455020813024958573.stgit-+uVpp3jiz/RcxmDmkzA3yGt3HXsI98Cx0E9HWUfgJXw@public.gmane.org> @ 2020-07-31 23:54 ` Alex Shi 2020-08-02 18:20 ` Alexander Duyck 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-31 23:54 UTC (permalink / raw) To: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, aryabinin-5HdwGun5lf+gSpxsJD1C4w, cgroups-u79uwXL29TY76Z2rM5mHXA, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, hughd-hpIqsD4AKlfQT0dZR+AlfA, iamjoonsoo.kim-Hm3cg6mZ9cc, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g, kirill-oKw7cIdHH8eLwutG50LtGA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, lkp-ral2JQCrhuEAvxtiuMwx3w, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt, richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w, rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, shakeelb-hpIqsD4AKlfQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ, tj-DgEjT+Ai2ygdnm+yROfE0A, willy-wEGCiKHe2LqWVfeAwA7xHQ, yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf It looks much better than mine. and could replace 'mm/lru: introduce the relock_page_lruvec function' with your author signed. :) BTW, it's the rcu_read_lock cause the will-it-scale/page_fault3 regression which you mentained in another letter? Thanks Alex 在 2020/8/1 上午5:14, alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org 写道: > From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > When testing for relock we can avoid the need for RCU locking if we simply > compare the page pgdat and memcg pointers versus those that the lruvec is > holding. By doing this we can avoid the extra pointer walks and accesses of > the memory cgroup. > > In addition we can avoid the checks entirely if lruvec is currently NULL. > > Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > --- > include/linux/memcontrol.h | 52 +++++++++++++++++++++++++++----------------- > 1 file changed, 32 insertions(+), 20 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6e670f991b42..7a02f00bf3de 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, > > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); > > +static inline bool lruvec_holds_page_lru_lock(struct page *page, > + struct lruvec *lruvec) > +{ > + pg_data_t *pgdat = page_pgdat(page); > + const struct mem_cgroup *memcg; > + struct mem_cgroup_per_node *mz; > + > + if (mem_cgroup_disabled()) > + return lruvec == &pgdat->__lruvec; > + > + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > + memcg = page->mem_cgroup ? : root_mem_cgroup; > + > + return lruvec->pgdat == pgdat && mz->memcg == memcg; > +} > + > struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); > > struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); > @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, > return &pgdat->__lruvec; > } > > +static inline bool lruvec_holds_page_lru_lock(struct page *page, > + struct lruvec *lruvec) > +{ > + pg_data_t *pgdat = page_pgdat(page); > + > + return lruvec == &pgdat->__lruvec; > +} > + > static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > { > return NULL; > @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, > static inline struct lruvec *relock_page_lruvec_irq(struct page *page, > struct lruvec *locked_lruvec) > { > - struct pglist_data *pgdat = page_pgdat(page); > - bool locked; > + if (locked_lruvec) { > + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) > + return locked_lruvec; > > - rcu_read_lock(); > - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > - rcu_read_unlock(); > - > - if (locked) > - return locked_lruvec; > - > - if (locked_lruvec) > unlock_page_lruvec_irq(locked_lruvec); > + } > > return lock_page_lruvec_irq(page); > } > @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page, > static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, > struct lruvec *locked_lruvec, unsigned long *flags) > { > - struct pglist_data *pgdat = page_pgdat(page); > - bool locked; > - > - rcu_read_lock(); > - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > - rcu_read_unlock(); > - > - if (locked) > - return locked_lruvec; > + if (locked_lruvec) { > + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) > + return locked_lruvec; > > - if (locked_lruvec) > unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > + } > > return lock_page_lruvec_irqsave(page, flags); > } > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid 2020-07-31 23:54 ` Alex Shi @ 2020-08-02 18:20 ` Alexander Duyck 2020-08-04 6:13 ` Alex Shi 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-02 18:20 UTC (permalink / raw) To: Alex Shi Cc: Duyck, Alexander H, Andrew Morton, Andrey Ryabinin, cgroups, Daniel Jordan, Johannes Weiner, Hugh Dickins, Joonsoo Kim, Konstantin Khlebnikov, Kirill A. Shutemov, LKML, linux-mm, kbuild test robot, Mel Gorman, Wei Yang, Rong Chen, Shakeel Butt, Thomas Gleixner, Tejun Heo, Matthew Wilcox <willy> Feel free to fold it into your patches if you want. I think Hugh was the one that had submitted a patch that addressed it, and it looks like you folded that into your v17 set. It was probably what he had identified which was the additional LRU checks needing to be removed from the code. Thanks. - Alex On Fri, Jul 31, 2020 at 4:55 PM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > It looks much better than mine. and could replace 'mm/lru: introduce the relock_page_lruvec function' > with your author signed. :) > > BTW, > it's the rcu_read_lock cause the will-it-scale/page_fault3 regression which you mentained in another > letter? > > Thanks > Alex > > 在 2020/8/1 上午5:14, alexander.h.duyck@intel.com 写道: > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com> > > > > When testing for relock we can avoid the need for RCU locking if we simply > > compare the page pgdat and memcg pointers versus those that the lruvec is > > holding. By doing this we can avoid the extra pointer walks and accesses of > > the memory cgroup. > > > > In addition we can avoid the checks entirely if lruvec is currently NULL. > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> > > --- > > include/linux/memcontrol.h | 52 +++++++++++++++++++++++++++----------------- > > 1 file changed, 32 insertions(+), 20 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 6e670f991b42..7a02f00bf3de 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, > > > > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); > > > > +static inline bool lruvec_holds_page_lru_lock(struct page *page, > > + struct lruvec *lruvec) > > +{ > > + pg_data_t *pgdat = page_pgdat(page); > > + const struct mem_cgroup *memcg; > > + struct mem_cgroup_per_node *mz; > > + > > + if (mem_cgroup_disabled()) > > + return lruvec == &pgdat->__lruvec; > > + > > + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > > + memcg = page->mem_cgroup ? : root_mem_cgroup; > > + > > + return lruvec->pgdat == pgdat && mz->memcg == memcg; > > +} > > + > > struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); > > > > struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); > > @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, > > return &pgdat->__lruvec; > > } > > > > +static inline bool lruvec_holds_page_lru_lock(struct page *page, > > + struct lruvec *lruvec) > > +{ > > + pg_data_t *pgdat = page_pgdat(page); > > + > > + return lruvec == &pgdat->__lruvec; > > +} > > + > > static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > > { > > return NULL; > > @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, > > static inline struct lruvec *relock_page_lruvec_irq(struct page *page, > > struct lruvec *locked_lruvec) > > { > > - struct pglist_data *pgdat = page_pgdat(page); > > - bool locked; > > + if (locked_lruvec) { > > + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) > > + return locked_lruvec; > > > > - rcu_read_lock(); > > - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > > - rcu_read_unlock(); > > - > > - if (locked) > > - return locked_lruvec; > > - > > - if (locked_lruvec) > > unlock_page_lruvec_irq(locked_lruvec); > > + } > > > > return lock_page_lruvec_irq(page); > > } > > @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page, > > static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, > > struct lruvec *locked_lruvec, unsigned long *flags) > > { > > - struct pglist_data *pgdat = page_pgdat(page); > > - bool locked; > > - > > - rcu_read_lock(); > > - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; > > - rcu_read_unlock(); > > - > > - if (locked) > > - return locked_lruvec; > > + if (locked_lruvec) { > > + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) > > + return locked_lruvec; > > > > - if (locked_lruvec) > > unlock_page_lruvec_irqrestore(locked_lruvec, *flags); > > + } > > > > return lock_page_lruvec_irqsave(page, flags); > > } > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid 2020-08-02 18:20 ` Alexander Duyck @ 2020-08-04 6:13 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 6:13 UTC (permalink / raw) To: Alexander Duyck Cc: Duyck, Alexander H, Andrew Morton, Andrey Ryabinin, cgroups, Daniel Jordan, Johannes Weiner, Hugh Dickins, Joonsoo Kim, Konstantin Khlebnikov, Kirill A. Shutemov, LKML, linux-mm, kbuild test robot, Mel Gorman, Wei Yang, Rong Chen, Shakeel Butt, Thomas Gleixner, Tejun Heo, Matthew Wilcox <willy> 在 2020/8/3 上午2:20, Alexander Duyck 写道: > Feel free to fold it into your patches if you want. > > I think Hugh was the one that had submitted a patch that addressed it, > and it looks like you folded that into your v17 set. It was probably > what he had identified which was the additional LRU checks needing to > be removed from the code. Yes, Hugh's patch was folded into patch [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn and your change is on patch 18. seems there are no interfere with each other. Both of patches are fine. Thanks > > Thanks. > > - Alex > > On Fri, Jul 31, 2020 at 4:55 PM Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >> It looks much better than mine. and could replace 'mm/lru: introduce the relock_page_lruvec function' >> with your author signed. :) >> >> BTW, >> it's the rcu_read_lock cause the will-it-scale/page_fault3 regression which you mentained in another >> letter? >> >> Thanks >> Alex >> >> 在 2020/8/1 上午5:14, alexander.h.duyck@intel.com 写道: >>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com> >>> >>> When testing for relock we can avoid the need for RCU locking if we simply >>> compare the page pgdat and memcg pointers versus those that the lruvec is >>> holding. By doing this we can avoid the extra pointer walks and accesses of >>> the memory cgroup. >>> >>> In addition we can avoid the checks entirely if lruvec is currently NULL. >>> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> >>> --- >>> include/linux/memcontrol.h | 52 +++++++++++++++++++++++++++----------------- >>> 1 file changed, 32 insertions(+), 20 deletions(-) >>> >>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >>> index 6e670f991b42..7a02f00bf3de 100644 >>> --- a/include/linux/memcontrol.h >>> +++ b/include/linux/memcontrol.h >>> @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, >>> >>> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); >>> >>> +static inline bool lruvec_holds_page_lru_lock(struct page *page, >>> + struct lruvec *lruvec) >>> +{ >>> + pg_data_t *pgdat = page_pgdat(page); >>> + const struct mem_cgroup *memcg; >>> + struct mem_cgroup_per_node *mz; >>> + >>> + if (mem_cgroup_disabled()) >>> + return lruvec == &pgdat->__lruvec; >>> + >>> + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); >>> + memcg = page->mem_cgroup ? : root_mem_cgroup; >>> + >>> + return lruvec->pgdat == pgdat && mz->memcg == memcg; >>> +} >>> + >>> struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); >>> >>> struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); >>> @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, >>> return &pgdat->__lruvec; >>> } >>> >>> +static inline bool lruvec_holds_page_lru_lock(struct page *page, >>> + struct lruvec *lruvec) >>> +{ >>> + pg_data_t *pgdat = page_pgdat(page); >>> + >>> + return lruvec == &pgdat->__lruvec; >>> +} >>> + >>> static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) >>> { >>> return NULL; >>> @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, >>> static inline struct lruvec *relock_page_lruvec_irq(struct page *page, >>> struct lruvec *locked_lruvec) >>> { >>> - struct pglist_data *pgdat = page_pgdat(page); >>> - bool locked; >>> + if (locked_lruvec) { >>> + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) >>> + return locked_lruvec; >>> >>> - rcu_read_lock(); >>> - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; >>> - rcu_read_unlock(); >>> - >>> - if (locked) >>> - return locked_lruvec; >>> - >>> - if (locked_lruvec) >>> unlock_page_lruvec_irq(locked_lruvec); >>> + } >>> >>> return lock_page_lruvec_irq(page); >>> } >>> @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page, >>> static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, >>> struct lruvec *locked_lruvec, unsigned long *flags) >>> { >>> - struct pglist_data *pgdat = page_pgdat(page); >>> - bool locked; >>> - >>> - rcu_read_lock(); >>> - locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec; >>> - rcu_read_unlock(); >>> - >>> - if (locked) >>> - return locked_lruvec; >>> + if (locked_lruvec) { >>> + if (lruvec_holds_page_lru_lock(page, locked_lruvec)) >>> + return locked_lruvec; >>> >>> - if (locked_lruvec) >>> unlock_page_lruvec_irqrestore(locked_lruvec, *flags); >>> + } >>> >>> return lock_page_lruvec_irqsave(page, flags); >>> } >>> >> ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (11 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-08-03 22:49 ` Alexander Duyck 2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi 2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Cc: Andrey Ryabinin, Jann Horn From: Hugh Dickins <hughd@google.com> Use the relock function to replace relocking action. And try to save few lock times. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Jann Horn <jannh@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/vmscan.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bdb53a678e7e..078a1640ec60 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1854,15 +1854,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, enum lru_list lru; while (!list_empty(list)) { - struct lruvec *new_lruvec = NULL; - page = lru_to_page(list); VM_BUG_ON_PAGE(PageLRU(page), page); list_del(&page->lru); if (unlikely(!page_evictable(page))) { - spin_unlock_irq(&lruvec->lru_lock); + if (lruvec) { + spin_unlock_irq(&lruvec->lru_lock); + lruvec = NULL; + } putback_lru_page(page); - spin_lock_irq(&lruvec->lru_lock); continue; } @@ -1876,12 +1876,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, * list_add(&page->lru,) * list_add(&page->lru,) //corrupt */ - new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - if (new_lruvec != lruvec) { - if (lruvec) - spin_unlock_irq(&lruvec->lru_lock); - lruvec = lock_page_lruvec_irq(page); - } + lruvec = relock_page_lruvec_irq(page, lruvec); SetPageLRU(page); if (unlikely(put_page_testzero(page))) { @@ -1890,8 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, if (unlikely(PageCompound(page))) { spin_unlock_irq(&lruvec->lru_lock); + lruvec = NULL; destroy_compound_page(page); - spin_lock_irq(&lruvec->lru_lock); } else list_add(&page->lru, &pages_to_free); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru 2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi @ 2020-08-03 22:49 ` Alexander Duyck [not found] ` <CAKgT0UebLfdju0Ny9ad5bigzAazqpzfwk2_JNQQ9yEHYyVm5-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-03 22:49 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > From: Hugh Dickins <hughd@google.com> > > Use the relock function to replace relocking action. And try to save few > lock times. > > Signed-off-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> > Cc: Jann Horn <jannh@google.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: cgroups@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org I am assuming this is only separate from patch 18 because of the fact that it is from Hugh and not yourself. Otherwise I would recommend folding this into patch 18. Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UebLfdju0Ny9ad5bigzAazqpzfwk2_JNQQ9yEHYyVm5-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru [not found] ` <CAKgT0UebLfdju0Ny9ad5bigzAazqpzfwk2_JNQQ9yEHYyVm5-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-04 6:23 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 6:23 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn 在 2020/8/4 上午6:49, Alexander Duyck 写道: >> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > I am assuming this is only separate from patch 18 because of the fact > that it is from Hugh and not yourself. Otherwise I would recommend > folding this into patch 18. Yes, that's resaon for this patch keeps. > > Reviewed-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Thanks a lot! Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (12 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi @ 2020-07-25 12:59 ` Alex Shi [not found] ` <1595681998-19193-21-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: cgroups@vger.kernel.org --- include/linux/mmzone.h | 1 - mm/page_alloc.c | 1 - 2 files changed, 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 30b961a9a749..8af956aa13cf 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -735,7 +735,6 @@ struct deferred_split { /* Write-intensive fields used by page reclaim */ ZONE_PADDING(_pad1_) - spinlock_t lru_lock; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e028b87ce294..4d7df42b32d6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) init_waitqueue_head(&pgdat->pfmemalloc_wait); pgdat_page_ext_init(pgdat); - spin_lock_init(&pgdat->lru_lock); lruvec_init(&pgdat->__lruvec); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <1595681998-19193-21-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock [not found] ` <1595681998-19193-21-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-03 22:42 ` Alexander Duyck [not found] ` <CAKgT0UfZg5Wf2qNJ_=VPO1Cj8YuifZN8rG_X4Btq86ADmsVZFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-03 22:42 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore. > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> > Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org I really think this would be better just squashed into patch 18 instead of as a standalone patch since you were moving all of the locking anyway so it would be more likely to trigger build errors if somebody didn't move a lock somewhere that was referencing this. That said this change is harmless at this point. Reviewed-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > --- > include/linux/mmzone.h | 1 - > mm/page_alloc.c | 1 - > 2 files changed, 2 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 30b961a9a749..8af956aa13cf 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -735,7 +735,6 @@ struct deferred_split { > > /* Write-intensive fields used by page reclaim */ > ZONE_PADDING(_pad1_) > - spinlock_t lru_lock; > > #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e028b87ce294..4d7df42b32d6 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) > init_waitqueue_head(&pgdat->pfmemalloc_wait); > > pgdat_page_ext_init(pgdat); > - spin_lock_init(&pgdat->lru_lock); > lruvec_init(&pgdat->__lruvec); > } > > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UfZg5Wf2qNJ_=VPO1Cj8YuifZN8rG_X4Btq86ADmsVZFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock [not found] ` <CAKgT0UfZg5Wf2qNJ_=VPO1Cj8YuifZN8rG_X4Btq86ADmsVZFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-03 22:45 ` Alexander Duyck [not found] ` <CAKgT0UciRJCPs_zrxri1pEJmJVKkHpEq=AFiVpJE99JJQe=Xrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-03 22:45 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen Just to correct a typo, I meant patch 17, not 18. in the comment below. On Mon, Aug 3, 2020 at 3:42 PM Alexander Duyck <alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore. > > > > Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> > > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > > Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> > > Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org > > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > I really think this would be better just squashed into patch 18 > instead of as a standalone patch since you were moving all of the > locking anyway so it would be more likely to trigger build errors if > somebody didn't move a lock somewhere that was referencing this. > > That said this change is harmless at this point. > > Reviewed-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UciRJCPs_zrxri1pEJmJVKkHpEq=AFiVpJE99JJQe=Xrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock [not found] ` <CAKgT0UciRJCPs_zrxri1pEJmJVKkHpEq=AFiVpJE99JJQe=Xrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-04 6:22 ` Alex Shi 0 siblings, 0 replies; 101+ messages in thread From: Alex Shi @ 2020-08-04 6:22 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen 在 2020/8/4 上午6:45, Alexander Duyck 写道: > Just to correct a typo, I meant patch 17, not 18. in the comment below. > > > On Mon, Aug 3, 2020 at 3:42 PM Alexander Duyck > <alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> >> On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: >>> >>> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore. >>> >>> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> >>> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> >>> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> >>> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> >>> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> >>> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org >>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> >> I really think this would be better just squashed into patch 18 >> instead of as a standalone patch since you were moving all of the >> locking anyway so it would be more likely to trigger build errors if >> somebody didn't move a lock somewhere that was referencing this. Thanks for comments! If someone changed the lru_lock between patch 17 and this, it would cause more troubles then build error here. :) so don't warries for that. But on the other side, I am so insist to have a ceremony to remove this lock... >> >> That said this change is harmless at this point. >> >> Reviewed-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Thanks a lot for review! ^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH v17 21/21] mm/lru: revise the comments of lru_lock 2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi ` (13 preceding siblings ...) 2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi @ 2020-07-25 12:59 ` Alex Shi 2020-08-03 22:37 ` Alexander Duyck 14 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw) To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck, rong.a.chen Cc: Andrey Ryabinin, Jann Horn From: Hugh Dickins <hughd@google.com> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix the incorrect comments in code. Also fixed some zone->lru_lock comment error from ancient time. etc. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Jann Horn <jannh@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------ Documentation/admin-guide/cgroup-v1/memory.rst | 21 +++++++++------------ Documentation/trace/events-kmem.rst | 2 +- Documentation/vm/unevictable-lru.rst | 22 ++++++++-------------- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 2 +- mm/filemap.c | 4 ++-- mm/memcontrol.c | 2 +- mm/rmap.c | 4 ++-- mm/vmscan.c | 12 ++++++++---- 10 files changed, 36 insertions(+), 50 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 3f7115e07b5d..0b9f91589d3d 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 8. LRU ====== - Each memcg has its own private LRU. Now, its handling is under global - VM's control (means that it's handled under global pgdat->lru_lock). - Almost all routines around memcg's LRU is called by global LRU's - list management functions under pgdat->lru_lock. - - A special function is mem_cgroup_isolate_pages(). This scans - memcg's private LRU and call __isolate_lru_page() to extract a page - from LRU. - - (By __isolate_lru_page(), the page is removed from both of global and - private LRU.) - + Each memcg has its own vector of LRUs (inactive anon, active anon, + inactive file, active file, unevictable) of pages from each node, + each LRU handled under a single lru_lock for that memcg and node. 9. Typical Tests. ================= diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 12757e63b26c..24450696579f 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered. 2.6 Locking ----------- - lock_page_cgroup()/unlock_page_cgroup() should not be called under - the i_pages lock. +Lock order is as follows: - Other lock order is following: + Page lock (PG_locked bit of page->flags) + mm->page_table_lock or split pte_lock + lock_page_memcg (memcg->move_lock) + mapping->i_pages lock + lruvec->lru_lock. - PG_locked. - mm->page_table_lock - pgdat->lru_lock - lock_page_cgroup. - - In many cases, just lock_page_cgroup() is called. - - per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by - pgdat->lru_lock, it has no lock of its own. +Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by +lruvec->lru_lock; PG_lru bit of page->flags is cleared before +isolating a page from its LRU under lruvec->lru_lock. 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) ----------------------------------------------- diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst index 555484110e36..68fa75247488 100644 --- a/Documentation/trace/events-kmem.rst +++ b/Documentation/trace/events-kmem.rst @@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered. Broadly speaking, pages are taken off the LRU lock in bulk and freed in batch with a page list. Significant amounts of activity here could indicate that the system is under memory pressure and can also indicate -contention on the zone->lru_lock. +contention on the lruvec->lru_lock. 4. Per-CPU Allocator Activity ============================= diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index 17d0861b0f1d..0e1490524f53 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large memory x86_64 systems. To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of -main memory will have over 32 million 4k pages in a single zone. When a large +main memory will have over 32 million 4k pages in a single node. When a large fraction of these pages are not evictable for any reason [see below], vmscan will spend a lot of time scanning the LRU lists looking for the small fraction of pages that are evictable. This can result in a situation where all CPUs are @@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future. The Unevictable Page List ------------------------- -The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list +The Unevictable LRU infrastructure consists of an additional, per-node, LRU list called the "unevictable" list and an associated page flag, PG_unevictable, to indicate that the page is being managed on the unevictable list. @@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous, swap-backed pages. This differentiation is only important while the pages are, in fact, evictable. -The unevictable list benefits from the "arrayification" of the per-zone LRU +The unevictable list benefits from the "arrayification" of the per-node LRU lists and statistics originally proposed and posted by Christoph Lameter. -The unevictable list does not use the LRU pagevec mechanism. Rather, -unevictable pages are placed directly on the page's zone's unevictable list -under the zone lru_lock. This allows us to prevent the stranding of pages on -the unevictable list when one task has the page isolated from the LRU and other -tasks are changing the "evictability" state of the page. - Memory Control Group Interaction -------------------------------- @@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the lru_list enum. -The memory controller data structure automatically gets a per-zone unevictable -list as a result of the "arrayification" of the per-zone LRU lists (one per +The memory controller data structure automatically gets a per-node unevictable +list as a result of the "arrayification" of the per-node LRU lists (one per lru_list enum element). The memory controller tracks the movement of pages to and from the unevictable list. @@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular active/inactive LRU lists for vmscan to deal with. vmscan checks for such pages in all of the shrink_{active|inactive|page}_list() functions and will "cull" such pages that it encounters: that is, it diverts those pages to the -unevictable list for the zone being scanned. +unevictable list for the node being scanned. There may be situations where a page is mapped into a VM_LOCKED VMA, but the page is not marked as PG_mlocked. Such pages will make it all the way to @@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the page from the LRU, as it is likely on the appropriate active or inactive list at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put back the page - by calling putback_lru_page() - which will notice that the page -is now mlocked and divert the page to the zone's unevictable list. If +is now mlocked and divert the page to the node's unevictable list. If mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle it later if and when it attempts to reclaim the page. @@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are: unevictable list in mlock_vma_page(). shrink_inactive_list() also diverts any unevictable pages that it finds on the -inactive lists to the appropriate zone's unevictable list. +inactive lists to the appropriate node's unevictable list. shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd after shrink_active_list() had moved them to the inactive list, or pages mapped diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 64ede5f150dc..44738cdb5a55 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -78,7 +78,7 @@ struct page { struct { /* Page cache and anonymous pages */ /** * @lru: Pageout list, eg. active_list protected by - * pgdat->lru_lock. Sometimes used as a generic list + * lruvec->lru_lock. Sometimes used as a generic list * by the page owner. */ struct list_head lru; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8af956aa13cf..c92289a4e14d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) struct pglist_data; /* - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. + * zone->lock and the lru_lock are two of the hottest locks in the kernel. * So add a wild amount of padding here to ensure that they fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. diff --git a/mm/filemap.c b/mm/filemap.c index 385759c4ce4b..3083557a1ce6 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -101,8 +101,8 @@ * ->swap_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one) * ->i_pages lock (try_to_unmap_one) - * ->pgdat->lru_lock (follow_page->mark_page_accessed) - * ->pgdat->lru_lock (check_pte_range->isolate_lru_page) + * ->lruvec->lru_lock (follow_page->mark_page_accessed) + * ->lruvec->lru_lock (check_pte_range->isolate_lru_page) * ->private_lock (page_remove_rmap->set_page_dirty) * ->i_pages lock (page_remove_rmap->set_page_dirty) * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d6746656cc39..a018d7c8a3f2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3057,7 +3057,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* - * Because tail pages are not marked as "used", set it. We're under + * Because tail pages are not marked as "used", set it. Don't need * lruvec->lru_lock and migration entries setup in all page mappings. */ void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/rmap.c b/mm/rmap.c index 5fe2dedce1fc..7f6e95680c47 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -28,12 +28,12 @@ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * anon_vma->rwsem * mm->page_table_lock or pte_lock - * pgdat->lru_lock (in mark_page_accessed, isolate_lru_page) * swap_lock (in swap_duplicate, swap_info_get) * mmlist_lock (in mmput, drain_mmlist and others) * mapping->private_lock (in __set_page_dirty_buffers) - * mem_cgroup_{begin,end}_page_stat (memcg->move_lock) + * lock_page_memcg move_lock (in __set_page_dirty_buffers) * i_pages lock (widely used) + * lruvec->lru_lock (in lock_page_lruvec_irq) * inode->i_lock (in set_page_dirty's __mark_inode_dirty) * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) * sb_lock (within inode_lock in fs/fs-writeback.c) diff --git a/mm/vmscan.c b/mm/vmscan.c index 078a1640ec60..bb3ac52de058 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1620,14 +1620,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, } /** - * pgdat->lru_lock is heavily contended. Some of the functions that + * Isolating page from the lruvec to fill in @dst list by nr_to_scan times. + * + * lruvec->lru_lock is heavily contended. Some of the functions that * shrink the lists perform better by taking out a batch of pages * and working on them outside the LRU lock. * * For pagecache intensive workloads, this function is the hottest * spot in the kernel (apart from copy_*_user functions). * - * Appropriate locks must be held before calling this function. + * Lru_lock must be held before calling this function. * * @nr_to_scan: The number of eligible pages to look through on the list. * @lruvec: The LRU vector to pull pages from. @@ -1826,14 +1828,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, /* * This moves pages from @list to corresponding LRU list. + * The pages from @list is out of any lruvec, and in the end list reuses as + * pages_to_free list. * * We move them the other way if the page is referenced by one or more * processes, from rmap. * * If the pages are mostly unmapped, the processing is fast and it is - * appropriate to hold zone_lru_lock across the whole operation. But if + * appropriate to hold lru_lock across the whole operation. But if * the pages are mapped, the processing is slow (page_referenced()) so we - * should drop zone_lru_lock around each page. It's impossible to balance + * should drop lru_lock around each page. It's impossible to balance * this, so instead we remove the pages from the LRU while processing them. * It is safe to rely on PG_active against the non-LRU pages in here because * nobody will play with that bit on a non-LRU page. -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock 2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi @ 2020-08-03 22:37 ` Alexander Duyck [not found] ` <CAKgT0UfpHjBTHvtZz7=WMhZZAunVYuNMpuYBQCiorERb5seFUQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-03 22:37 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > From: Hugh Dickins <hughd@google.com> > > Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to > fix the incorrect comments in code. Also fixed some zone->lru_lock comment > error from ancient time. etc. > > Signed-off-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> > Cc: Jann Horn <jannh@google.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: cgroups@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > --- > Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------ > Documentation/admin-guide/cgroup-v1/memory.rst | 21 +++++++++------------ > Documentation/trace/events-kmem.rst | 2 +- > Documentation/vm/unevictable-lru.rst | 22 ++++++++-------------- > include/linux/mm_types.h | 2 +- > include/linux/mmzone.h | 2 +- > mm/filemap.c | 4 ++-- > mm/memcontrol.c | 2 +- > mm/rmap.c | 4 ++-- > mm/vmscan.c | 12 ++++++++---- > 10 files changed, 36 insertions(+), 50 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst > index 3f7115e07b5d..0b9f91589d3d 100644 > --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst > +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst > @@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. > > 8. LRU > ====== > - Each memcg has its own private LRU. Now, its handling is under global > - VM's control (means that it's handled under global pgdat->lru_lock). > - Almost all routines around memcg's LRU is called by global LRU's > - list management functions under pgdat->lru_lock. > - > - A special function is mem_cgroup_isolate_pages(). This scans > - memcg's private LRU and call __isolate_lru_page() to extract a page > - from LRU. > - > - (By __isolate_lru_page(), the page is removed from both of global and > - private LRU.) > - > + Each memcg has its own vector of LRUs (inactive anon, active anon, > + inactive file, active file, unevictable) of pages from each node, > + each LRU handled under a single lru_lock for that memcg and node. > > 9. Typical Tests. > ================= > diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst > index 12757e63b26c..24450696579f 100644 > --- a/Documentation/admin-guide/cgroup-v1/memory.rst > +++ b/Documentation/admin-guide/cgroup-v1/memory.rst > @@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered. > 2.6 Locking > ----------- > > - lock_page_cgroup()/unlock_page_cgroup() should not be called under > - the i_pages lock. > +Lock order is as follows: > > - Other lock order is following: > + Page lock (PG_locked bit of page->flags) > + mm->page_table_lock or split pte_lock > + lock_page_memcg (memcg->move_lock) > + mapping->i_pages lock > + lruvec->lru_lock. > > - PG_locked. > - mm->page_table_lock > - pgdat->lru_lock > - lock_page_cgroup. > - > - In many cases, just lock_page_cgroup() is called. > - > - per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by > - pgdat->lru_lock, it has no lock of its own. > +Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by > +lruvec->lru_lock; PG_lru bit of page->flags is cleared before > +isolating a page from its LRU under lruvec->lru_lock. > > 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) > ----------------------------------------------- > diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst > index 555484110e36..68fa75247488 100644 > --- a/Documentation/trace/events-kmem.rst > +++ b/Documentation/trace/events-kmem.rst > @@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered. > Broadly speaking, pages are taken off the LRU lock in bulk and > freed in batch with a page list. Significant amounts of activity here could > indicate that the system is under memory pressure and can also indicate > -contention on the zone->lru_lock. > +contention on the lruvec->lru_lock. > > 4. Per-CPU Allocator Activity > ============================= > diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst > index 17d0861b0f1d..0e1490524f53 100644 > --- a/Documentation/vm/unevictable-lru.rst > +++ b/Documentation/vm/unevictable-lru.rst > @@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large > memory x86_64 systems. > > To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of > -main memory will have over 32 million 4k pages in a single zone. When a large > +main memory will have over 32 million 4k pages in a single node. When a large > fraction of these pages are not evictable for any reason [see below], vmscan > will spend a lot of time scanning the LRU lists looking for the small fraction > of pages that are evictable. This can result in a situation where all CPUs are I'm not entirely sure this makes sense. If the system in non-NUMA you don't have nodes then do you? > @@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future. > The Unevictable Page List > ------------------------- > > -The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list > +The Unevictable LRU infrastructure consists of an additional, per-node, LRU list > called the "unevictable" list and an associated page flag, PG_unevictable, to > indicate that the page is being managed on the unevictable list. This isn't quite true either is it? It is per-memcg and per-node isn't it? > @@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous, > swap-backed pages. This differentiation is only important while the pages are, > in fact, evictable. > > -The unevictable list benefits from the "arrayification" of the per-zone LRU > +The unevictable list benefits from the "arrayification" of the per-node LRU > lists and statistics originally proposed and posted by Christoph Lameter. Again, per-node x per-memcg. The list is not stored in just a per-node structure, it is also per-memcg. > -The unevictable list does not use the LRU pagevec mechanism. Rather, > -unevictable pages are placed directly on the page's zone's unevictable list > -under the zone lru_lock. This allows us to prevent the stranding of pages on > -the unevictable list when one task has the page isolated from the LRU and other > -tasks are changing the "evictability" state of the page. > - > > Memory Control Group Interaction > -------------------------------- > @@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka > memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the > lru_list enum. > > -The memory controller data structure automatically gets a per-zone unevictable > -list as a result of the "arrayification" of the per-zone LRU lists (one per > +The memory controller data structure automatically gets a per-node unevictable > +list as a result of the "arrayification" of the per-node LRU lists (one per > lru_list enum element). The memory controller tracks the movement of pages to > and from the unevictable list. Again, per-memcg and per-node. > @@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular > active/inactive LRU lists for vmscan to deal with. vmscan checks for such > pages in all of the shrink_{active|inactive|page}_list() functions and will > "cull" such pages that it encounters: that is, it diverts those pages to the > -unevictable list for the zone being scanned. > +unevictable list for the node being scanned. Another spot where memcg and node apply, not just node. > There may be situations where a page is mapped into a VM_LOCKED VMA, but the > page is not marked as PG_mlocked. Such pages will make it all the way to > @@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the > page from the LRU, as it is likely on the appropriate active or inactive list > at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put > back the page - by calling putback_lru_page() - which will notice that the page > -is now mlocked and divert the page to the zone's unevictable list. If > +is now mlocked and divert the page to the node's unevictable list. If > mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle > it later if and when it attempts to reclaim the page. > Maybe instead of just replacing zone with node it might work better to use wording such as "node specific memcg unevictable LRU list". > @@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are: > unevictable list in mlock_vma_page(). > > shrink_inactive_list() also diverts any unevictable pages that it finds on the > -inactive lists to the appropriate zone's unevictable list. > +inactive lists to the appropriate node's unevictable list. > > shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd > after shrink_active_list() had moved them to the inactive list, or pages mapped Same here. > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 64ede5f150dc..44738cdb5a55 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,7 +78,7 @@ struct page { > struct { /* Page cache and anonymous pages */ > /** > * @lru: Pageout list, eg. active_list protected by > - * pgdat->lru_lock. Sometimes used as a generic list > + * lruvec->lru_lock. Sometimes used as a generic list > * by the page owner. > */ > struct list_head lru; > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 8af956aa13cf..c92289a4e14d 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) > struct pglist_data; > > /* > - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. > + * zone->lock and the lru_lock are two of the hottest locks in the kernel. > * So add a wild amount of padding here to ensure that they fall into separate > * cachelines. There are very few zone structures in the machine, so space > * consumption is not a concern here. So I don't believe you are using ZONE_PADDING in any way to try and protect the LRU lock currently. At least you aren't using it in the lruvec. As such it might make sense to just drop the reference to the lru_lock here. That reminds me that we still need to review the placement of the lru_lock and determine if there might be a better placement and/or padding that might improve performance when under heavy stress. > diff --git a/mm/filemap.c b/mm/filemap.c > index 385759c4ce4b..3083557a1ce6 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -101,8 +101,8 @@ > * ->swap_lock (try_to_unmap_one) > * ->private_lock (try_to_unmap_one) > * ->i_pages lock (try_to_unmap_one) > - * ->pgdat->lru_lock (follow_page->mark_page_accessed) > - * ->pgdat->lru_lock (check_pte_range->isolate_lru_page) > + * ->lruvec->lru_lock (follow_page->mark_page_accessed) > + * ->lruvec->lru_lock (check_pte_range->isolate_lru_page) > * ->private_lock (page_remove_rmap->set_page_dirty) > * ->i_pages lock (page_remove_rmap->set_page_dirty) > * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d6746656cc39..a018d7c8a3f2 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3057,7 +3057,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > /* > - * Because tail pages are not marked as "used", set it. We're under > + * Because tail pages are not marked as "used", set it. Don't need > * lruvec->lru_lock and migration entries setup in all page mappings. > */ > void mem_cgroup_split_huge_fixup(struct page *head) > diff --git a/mm/rmap.c b/mm/rmap.c > index 5fe2dedce1fc..7f6e95680c47 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -28,12 +28,12 @@ > * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) > * anon_vma->rwsem > * mm->page_table_lock or pte_lock > - * pgdat->lru_lock (in mark_page_accessed, isolate_lru_page) > * swap_lock (in swap_duplicate, swap_info_get) > * mmlist_lock (in mmput, drain_mmlist and others) > * mapping->private_lock (in __set_page_dirty_buffers) > - * mem_cgroup_{begin,end}_page_stat (memcg->move_lock) > + * lock_page_memcg move_lock (in __set_page_dirty_buffers) > * i_pages lock (widely used) > + * lruvec->lru_lock (in lock_page_lruvec_irq) > * inode->i_lock (in set_page_dirty's __mark_inode_dirty) > * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) > * sb_lock (within inode_lock in fs/fs-writeback.c) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 078a1640ec60..bb3ac52de058 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1620,14 +1620,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, > } > > /** > - * pgdat->lru_lock is heavily contended. Some of the functions that > + * Isolating page from the lruvec to fill in @dst list by nr_to_scan times. > + * > + * lruvec->lru_lock is heavily contended. Some of the functions that > * shrink the lists perform better by taking out a batch of pages > * and working on them outside the LRU lock. > * > * For pagecache intensive workloads, this function is the hottest > * spot in the kernel (apart from copy_*_user functions). > * > - * Appropriate locks must be held before calling this function. > + * Lru_lock must be held before calling this function. > * > * @nr_to_scan: The number of eligible pages to look through on the list. > * @lruvec: The LRU vector to pull pages from. > @@ -1826,14 +1828,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, > > /* > * This moves pages from @list to corresponding LRU list. > + * The pages from @list is out of any lruvec, and in the end list reuses as > + * pages_to_free list. > * > * We move them the other way if the page is referenced by one or more > * processes, from rmap. > * > * If the pages are mostly unmapped, the processing is fast and it is > - * appropriate to hold zone_lru_lock across the whole operation. But if > + * appropriate to hold lru_lock across the whole operation. But if > * the pages are mapped, the processing is slow (page_referenced()) so we > - * should drop zone_lru_lock around each page. It's impossible to balance > + * should drop lru_lock around each page. It's impossible to balance > * this, so instead we remove the pages from the LRU while processing them. > * It is safe to rely on PG_active against the non-LRU pages in here because > * nobody will play with that bit on a non-LRU page. > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UfpHjBTHvtZz7=WMhZZAunVYuNMpuYBQCiorERb5seFUQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock [not found] ` <CAKgT0UfpHjBTHvtZz7=WMhZZAunVYuNMpuYBQCiorERb5seFUQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-04 10:04 ` Alex Shi [not found] ` <f34e790f-50e6-112c-622f-d7ab804c6d22-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-04 10:04 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn 在 2020/8/4 上午6:37, Alexander Duyck 写道: >> >> shrink_inactive_list() also diverts any unevictable pages that it finds on the >> -inactive lists to the appropriate zone's unevictable list. >> +inactive lists to the appropriate node's unevictable list. >> >> shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd >> after shrink_active_list() had moved them to the inactive list, or pages mapped > Same here. lruvec is used per memcg per node actually, and it fallback to node if memcg disabled. So the comments are still right. And most of changes just fix from zone->lru_lock to pgdat->lru_lock change. > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 64ede5f150dc..44738cdb5a55 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -78,7 +78,7 @@ struct page { >> struct { /* Page cache and anonymous pages */ >> /** >> * @lru: Pageout list, eg. active_list protected by >> - * pgdat->lru_lock. Sometimes used as a generic list >> + * lruvec->lru_lock. Sometimes used as a generic list >> * by the page owner. >> */ >> struct list_head lru; >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 8af956aa13cf..c92289a4e14d 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) >> struct pglist_data; >> >> /* >> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. >> + * zone->lock and the lru_lock are two of the hottest locks in the kernel. >> * So add a wild amount of padding here to ensure that they fall into separate >> * cachelines. There are very few zone structures in the machine, so space >> * consumption is not a concern here. > So I don't believe you are using ZONE_PADDING in any way to try and > protect the LRU lock currently. At least you aren't using it in the > lruvec. As such it might make sense to just drop the reference to the > lru_lock here. That reminds me that we still need to review the > placement of the lru_lock and determine if there might be a better > placement and/or padding that might improve performance when under > heavy stress. > Right, is it the following looks better? diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ccc76590f823..0ed520954843 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) struct pglist_data; /* - * zone->lock and the lru_lock are two of the hottest locks in the kernel. - * So add a wild amount of padding here to ensure that they fall into separate + * Add a wild amount of padding here to ensure datas fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. */ Thanks! Alex ^ permalink raw reply related [flat|nested] 101+ messages in thread
[parent not found: <f34e790f-50e6-112c-622f-d7ab804c6d22-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock [not found] ` <f34e790f-50e6-112c-622f-d7ab804c6d22-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-04 14:29 ` Alexander Duyck [not found] ` <CAKgT0UckqbmYJDE3L2Bg1Nr=Y=GT0OBx1GEhaZ14EbRTzd8tiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alexander Duyck @ 2020-08-04 14:29 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn On Tue, Aug 4, 2020 at 3:04 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/4 上午6:37, Alexander Duyck 写道: > >> > >> shrink_inactive_list() also diverts any unevictable pages that it finds on the > >> -inactive lists to the appropriate zone's unevictable list. > >> +inactive lists to the appropriate node's unevictable list. > >> > >> shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd > >> after shrink_active_list() had moved them to the inactive list, or pages mapped > > Same here. > > lruvec is used per memcg per node actually, and it fallback to node if memcg disabled. > So the comments are still right. > > And most of changes just fix from zone->lru_lock to pgdat->lru_lock change. Actually in my mind one thing that might work better would be to explain what the lruvec is and where it resides. Then replace zone with lruvec since that is really where the unevictable list resides. Then it would be correct for both the memcg and pgdat case. > > > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >> index 64ede5f150dc..44738cdb5a55 100644 > >> --- a/include/linux/mm_types.h > >> +++ b/include/linux/mm_types.h > >> @@ -78,7 +78,7 @@ struct page { > >> struct { /* Page cache and anonymous pages */ > >> /** > >> * @lru: Pageout list, eg. active_list protected by > >> - * pgdat->lru_lock. Sometimes used as a generic list > >> + * lruvec->lru_lock. Sometimes used as a generic list > >> * by the page owner. > >> */ > >> struct list_head lru; > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 8af956aa13cf..c92289a4e14d 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) > >> struct pglist_data; > >> > >> /* > >> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. > >> + * zone->lock and the lru_lock are two of the hottest locks in the kernel. > >> * So add a wild amount of padding here to ensure that they fall into separate > >> * cachelines. There are very few zone structures in the machine, so space > >> * consumption is not a concern here. > > So I don't believe you are using ZONE_PADDING in any way to try and > > protect the LRU lock currently. At least you aren't using it in the > > lruvec. As such it might make sense to just drop the reference to the > > lru_lock here. That reminds me that we still need to review the > > placement of the lru_lock and determine if there might be a better > > placement and/or padding that might improve performance when under > > heavy stress. > > > > Right, is it the following looks better? > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index ccc76590f823..0ed520954843 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) > struct pglist_data; > > /* > - * zone->lock and the lru_lock are two of the hottest locks in the kernel. > - * So add a wild amount of padding here to ensure that they fall into separate > + * Add a wild amount of padding here to ensure datas fall into separate > * cachelines. There are very few zone structures in the machine, so space > * consumption is not a concern here. > */ > > Thanks! > Alex I would maybe tweak it to make sure it is clear that we are using this to pad out items that are likely to cause cache thrash such as various hot spinocks and such. ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <CAKgT0UckqbmYJDE3L2Bg1Nr=Y=GT0OBx1GEhaZ14EbRTzd8tiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock [not found] ` <CAKgT0UckqbmYJDE3L2Bg1Nr=Y=GT0OBx1GEhaZ14EbRTzd8tiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2020-08-06 1:39 ` Alex Shi [not found] ` <a1c6a3a6-f8e3-7bb5-e881-216a4b57ae84-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 101+ messages in thread From: Alex Shi @ 2020-08-06 1:39 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn 在 2020/8/4 下午10:29, Alexander Duyck 写道: > On Tue, Aug 4, 2020 at 3:04 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: >> >> >> >> 在 2020/8/4 上午6:37, Alexander Duyck 写道: >>>> >>>> shrink_inactive_list() also diverts any unevictable pages that it finds on the >>>> -inactive lists to the appropriate zone's unevictable list. >>>> +inactive lists to the appropriate node's unevictable list. >>>> >>>> shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd >>>> after shrink_active_list() had moved them to the inactive list, or pages mapped >>> Same here. >> >> lruvec is used per memcg per node actually, and it fallback to node if memcg disabled. >> So the comments are still right. >> >> And most of changes just fix from zone->lru_lock to pgdat->lru_lock change. > > Actually in my mind one thing that might work better would be to > explain what the lruvec is and where it resides. Then replace zone > with lruvec since that is really where the unevictable list resides. > Then it would be correct for both the memcg and pgdat case. Could you like to revise the doc as your thought? > >>> >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>>> index 64ede5f150dc..44738cdb5a55 100644 >>>> --- a/include/linux/mm_types.h >>>> +++ b/include/linux/mm_types.h >>>> @@ -78,7 +78,7 @@ struct page { >>>> struct { /* Page cache and anonymous pages */ >>>> /** >>>> * @lru: Pageout list, eg. active_list protected by >>>> - * pgdat->lru_lock. Sometimes used as a generic list >>>> + * lruvec->lru_lock. Sometimes used as a generic list >>>> * by the page owner. >>>> */ >>>> struct list_head lru; >>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>>> index 8af956aa13cf..c92289a4e14d 100644 >>>> --- a/include/linux/mmzone.h >>>> +++ b/include/linux/mmzone.h >>>> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) >>>> struct pglist_data; >>>> >>>> /* >>>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. >>>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel. >>>> * So add a wild amount of padding here to ensure that they fall into separate >>>> * cachelines. There are very few zone structures in the machine, so space >>>> * consumption is not a concern here. >>> So I don't believe you are using ZONE_PADDING in any way to try and >>> protect the LRU lock currently. At least you aren't using it in the >>> lruvec. As such it might make sense to just drop the reference to the >>> lru_lock here. That reminds me that we still need to review the >>> placement of the lru_lock and determine if there might be a better >>> placement and/or padding that might improve performance when under >>> heavy stress. >>> >> >> Right, is it the following looks better? >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index ccc76590f823..0ed520954843 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) >> struct pglist_data; >> >> /* >> - * zone->lock and the lru_lock are two of the hottest locks in the kernel. >> - * So add a wild amount of padding here to ensure that they fall into separate >> + * Add a wild amount of padding here to ensure datas fall into separate >> * cachelines. There are very few zone structures in the machine, so space >> * consumption is not a concern here. >> */ >> >> Thanks! >> Alex > > I would maybe tweak it to make sure it is clear that we are using this > to pad out items that are likely to cause cache thrash such as various > hot spinocks and such. > I appreciate if you like to change the doc better. :) Thanks Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
[parent not found: <a1c6a3a6-f8e3-7bb5-e881-216a4b57ae84-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>]
* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock [not found] ` <a1c6a3a6-f8e3-7bb5-e881-216a4b57ae84-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> @ 2020-08-06 16:27 ` Alexander Duyck 0 siblings, 0 replies; 101+ messages in thread From: Alexander Duyck @ 2020-08-06 16:27 UTC (permalink / raw) To: Alex Shi Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen, Andrey Ryabinin, Jann Horn On Wed, Aug 5, 2020 at 6:39 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > > > > 在 2020/8/4 下午10:29, Alexander Duyck 写道: > > On Tue, Aug 4, 2020 at 3:04 AM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote: > >> > >> > >> > >> 在 2020/8/4 上午6:37, Alexander Duyck 写道: > >>>> > >>>> shrink_inactive_list() also diverts any unevictable pages that it finds on the > >>>> -inactive lists to the appropriate zone's unevictable list. > >>>> +inactive lists to the appropriate node's unevictable list. > >>>> > >>>> shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd > >>>> after shrink_active_list() had moved them to the inactive list, or pages mapped > >>> Same here. > >> > >> lruvec is used per memcg per node actually, and it fallback to node if memcg disabled. > >> So the comments are still right. > >> > >> And most of changes just fix from zone->lru_lock to pgdat->lru_lock change. > > > > Actually in my mind one thing that might work better would be to > > explain what the lruvec is and where it resides. Then replace zone > > with lruvec since that is really where the unevictable list resides. > > Then it would be correct for both the memcg and pgdat case. > > Could you like to revise the doc as your thought? > > > >>> > >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >>>> index 64ede5f150dc..44738cdb5a55 100644 > >>>> --- a/include/linux/mm_types.h > >>>> +++ b/include/linux/mm_types.h > >>>> @@ -78,7 +78,7 @@ struct page { > >>>> struct { /* Page cache and anonymous pages */ > >>>> /** > >>>> * @lru: Pageout list, eg. active_list protected by > >>>> - * pgdat->lru_lock. Sometimes used as a generic list > >>>> + * lruvec->lru_lock. Sometimes used as a generic list > >>>> * by the page owner. > >>>> */ > >>>> struct list_head lru; > >>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >>>> index 8af956aa13cf..c92289a4e14d 100644 > >>>> --- a/include/linux/mmzone.h > >>>> +++ b/include/linux/mmzone.h > >>>> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) > >>>> struct pglist_data; > >>>> > >>>> /* > >>>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. > >>>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel. > >>>> * So add a wild amount of padding here to ensure that they fall into separate > >>>> * cachelines. There are very few zone structures in the machine, so space > >>>> * consumption is not a concern here. > >>> So I don't believe you are using ZONE_PADDING in any way to try and > >>> protect the LRU lock currently. At least you aren't using it in the > >>> lruvec. As such it might make sense to just drop the reference to the > >>> lru_lock here. That reminds me that we still need to review the > >>> placement of the lru_lock and determine if there might be a better > >>> placement and/or padding that might improve performance when under > >>> heavy stress. > >>> > >> > >> Right, is it the following looks better? > >> > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index ccc76590f823..0ed520954843 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) > >> struct pglist_data; > >> > >> /* > >> - * zone->lock and the lru_lock are two of the hottest locks in the kernel. > >> - * So add a wild amount of padding here to ensure that they fall into separate > >> + * Add a wild amount of padding here to ensure datas fall into separate > >> * cachelines. There are very few zone structures in the machine, so space > >> * consumption is not a concern here. > >> */ > >> > >> Thanks! > >> Alex > > > > I would maybe tweak it to make sure it is clear that we are using this > > to pad out items that are likely to cause cache thrash such as various > > hot spinocks and such. > > > > I appreciate if you like to change the doc better. :) Give me a day or so. I will submit a follow-on patch with some cleanup for the comments. Thanks. - Alex ^ permalink raw reply [flat|nested] 101+ messages in thread
end of thread, other threads:[~2020-08-18 6:50 UTC | newest]
Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-08-06 3:47 ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting Alex Shi
2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi
2020-07-27 17:29 ` Alexander Duyck
[not found] ` <CAKgT0UfmbdhpUdGy+4VircovmJfiJy9m-MN_o0LChNT_kWRUng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-07-28 11:59 ` Alex Shi
[not found] ` <3bd60e1b-a74e-050d-ade4-6e8f54e00b92-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-28 14:17 ` Alexander Duyck
[not found] ` <1595681998-19193-1-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-25 12:59 ` [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
2020-07-25 12:59 ` [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-07-25 12:59 ` [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-07-25 12:59 ` [PATCH v17 12/21] mm/lru: move lock into lru_note_cost Alex Shi
2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi
[not found] ` <1595681998-19193-14-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-29 3:53 ` Alex Shi
2020-08-05 22:43 ` Alexander Duyck
[not found] ` <CAKgT0Ud1+FkJcTXR0MxZYFxd7mr=opdXfXKTqkmiu4NNMyT4bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-06 1:54 ` Alex Shi
2020-08-06 14:41 ` Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
[not found] ` <1595681998-19193-18-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-27 23:34 ` Alexander Duyck
[not found] ` <CAKgT0UdaW4Rf43yULhQBuP07vQgmoPbaWHGKv1Z7fEPP6jH83w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-07-28 7:15 ` Alex Shi
2020-07-28 11:19 ` Alex Shi
[not found] ` <ccd01046-451c-463d-7c5d-9c32794f4b1e-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-28 14:54 ` Alexander Duyck
2020-07-29 1:00 ` Alex Shi
[not found] ` <09aeced7-cc36-0c9a-d40b-451db9dc54cc-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-29 1:27 ` Alexander Duyck
[not found] ` <CAKgT0UfCv9u3UaJnzh7CYu_nCggV8yesZNu4oxMGn4+mJYiFUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-07-29 2:27 ` Alex Shi
2020-07-28 15:39 ` Alex Shi
[not found] ` <1fd45e69-3a50-aae8-bcc4-47d891a5e263-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-28 15:55 ` Alexander Duyck
2020-07-29 0:48 ` Alex Shi
2020-08-06 7:41 ` Alex Shi
2020-07-29 3:54 ` Alex Shi
2020-07-27 5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi
[not found] ` <49d4f3bf-ccce-3c97-3a4c-f5cefe2d623a-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-29 14:49 ` Alex Shi
2020-07-29 18:06 ` Hugh Dickins
2020-07-30 2:16 ` Alex Shi
[not found] ` <08c8797d-1935-7b41-b8db-d22f054912ac-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-03 15:07 ` Michal Hocko
[not found] ` <20200803150704.GV5174-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-08-04 6:14 ` Alex Shi
2020-07-31 21:31 ` Alexander Duyck
2020-08-04 8:36 ` Alex Shi
2020-08-04 8:36 ` Alex Shi
2020-08-04 8:37 ` Alex Shi
2020-08-04 8:37 ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail Alex Shi
2020-07-25 12:59 ` [PATCH v17 07/21] mm/thp: remove code path which never got into Alex Shi
2020-07-25 12:59 ` [PATCH v17 08/21] mm/thp: narrow lru locking Alex Shi
2020-07-25 12:59 ` [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
[not found] ` <1595681998-19193-12-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-05 21:18 ` Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
[not found] ` <1595681998-19193-15-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-04 21:35 ` Alexander Duyck
2020-08-06 18:38 ` Alexander Duyck
[not found] ` <CAKgT0UcbBv=QBK9ErqLKXoNLYxFz52L4fiiHy4h6zKdBs=YPOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-07 3:24 ` Alex Shi
[not found] ` <241ca157-104f-4f0d-7d5b-de394443788d-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-07 14:51 ` Alexander Duyck
[not found] ` <CAKgT0UdSrarC8j+G=LYRSadcaG6yNCoCfeVpFjEiHRJb4A77-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-10 13:10 ` Alex Shi
[not found] ` <8dbd004e-8eba-f1ec-a5eb-5dc551978936-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-10 14:41 ` Alexander Duyck
2020-08-11 8:22 ` Alex Shi
[not found] ` <d9818e06-95f1-9f21-05c0-98f29ea96d89-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-11 14:47 ` Alexander Duyck
2020-08-12 11:43 ` Alex Shi
[not found] ` <9581db48-cef3-788a-7f5a-8548fee56c13-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-12 12:16 ` Alex Shi
2020-08-12 16:51 ` Alexander Duyck
2020-08-13 1:46 ` Alex Shi
[not found] ` <3828d045-17e4-16aa-f0e6-d5dda7ad6b1b-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-13 2:17 ` Alexander Duyck
[not found] ` <CAKgT0Ud6ZQ4ZTm1cAUKCdb8FMu0fk9vXgf-bnmb0aY5ndDHwyA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-13 3:52 ` Alex Shi
2020-08-13 4:02 ` [RFC PATCH 0/3] " Alexander Duyck
[not found] ` <20200813035100.13054.25671.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2020-08-13 4:02 ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
[not found] ` <20200813040224.13054.96724.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2020-08-13 6:56 ` Alex Shi
[not found] ` <8ea9e186-b223-fb1b-5c82-2aa43c5e9f10-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-13 14:32 ` Alexander Duyck
[not found] ` <CAKgT0UcRFqXUOJ+QjgtjdQE6A7EMgAc_v9b7+mXy-ZJLvG2AgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-14 7:25 ` Alex Shi
2020-08-13 7:44 ` Alex Shi
2020-08-13 14:26 ` Alexander Duyck
2020-08-13 4:02 ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck
[not found] ` <20200813040232.13054.82417.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2020-08-14 7:19 ` Alex Shi
[not found] ` <6c072332-ff16-757d-99dd-b8fbae131a0c-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-14 14:24 ` Alexander Duyck
[not found] ` <CAKgT0Uf0TbRBVsuGZ1bgh5rdFp+vARkP=+GgD4-DP3Gy6cj+pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-14 21:15 ` Alexander Duyck
[not found] ` <650ab639-e66f-5ca6-a9a5-31e61c134ae7@linux.alibaba.com>
[not found] ` <650ab639-e66f-5ca6-a9a5-31e61c134ae7-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-17 15:38 ` Alexander Duyck
2020-08-18 6:50 ` Alex Shi
2020-08-13 4:02 ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck
[not found] ` <20200813040240.13054.76770.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2020-08-14 7:20 ` Alex Shi
2020-08-17 22:58 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn Alex Shi
2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi
[not found] ` <1595681998-19193-19-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-29 17:52 ` Alexander Duyck
[not found] ` <CAKgT0UdFDcz=CQ+6mzcjh-apwy3UyPqAuOozvYr+2PSCNQrENA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-07-30 6:08 ` Alex Shi
[not found] ` <3345bfbf-ebe9-b5e0-a731-77dd7d76b0c9-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-07-31 14:20 ` Alexander Duyck
2020-07-31 21:14 ` [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w
[not found] ` <159622999150.2576729.14455020813024958573.stgit-+uVpp3jiz/RcxmDmkzA3yGt3HXsI98Cx0E9HWUfgJXw@public.gmane.org>
2020-07-31 23:54 ` Alex Shi
2020-08-02 18:20 ` Alexander Duyck
2020-08-04 6:13 ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi
2020-08-03 22:49 ` Alexander Duyck
[not found] ` <CAKgT0UebLfdju0Ny9ad5bigzAazqpzfwk2_JNQQ9yEHYyVm5-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-04 6:23 ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi
[not found] ` <1595681998-19193-21-git-send-email-alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-03 22:42 ` Alexander Duyck
[not found] ` <CAKgT0UfZg5Wf2qNJ_=VPO1Cj8YuifZN8rG_X4Btq86ADmsVZFw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-03 22:45 ` Alexander Duyck
[not found] ` <CAKgT0UciRJCPs_zrxri1pEJmJVKkHpEq=AFiVpJE99JJQe=Xrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-04 6:22 ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi
2020-08-03 22:37 ` Alexander Duyck
[not found] ` <CAKgT0UfpHjBTHvtZz7=WMhZZAunVYuNMpuYBQCiorERb5seFUQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-04 10:04 ` Alex Shi
[not found] ` <f34e790f-50e6-112c-622f-d7ab804c6d22-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-04 14:29 ` Alexander Duyck
[not found] ` <CAKgT0UckqbmYJDE3L2Bg1Nr=Y=GT0OBx1GEhaZ14EbRTzd8tiw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-08-06 1:39 ` Alex Shi
[not found] ` <a1c6a3a6-f8e3-7bb5-e881-216a4b57ae84-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
2020-08-06 16:27 ` Alexander Duyck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox