linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: SeongJae Park <sj38.park@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: SeongJae Park <sj38.park@gmail.com>,
	akpm@linux-foundation.org, markubo@amazon.com,
	SeongJae Park <sjpark@amazon.de>,
	Jonathan.Cameron@Huawei.com, acme@kernel.org,
	alexander.shishkin@linux.intel.com, amit@kernel.org,
	benh@kernel.crashing.org, brendanhiggins@google.com,
	corbet@lwn.net, dwmw@amazon.com, elver@google.com,
	fan.du@intel.com, foersleo@amazon.de, greg@kroah.com,
	gthelen@google.com, guoju.fgj@alibaba-inc.com,
	jgowans@amazon.com, joe@perches.com, mgorman@suse.de,
	mheyne@amazon.de, minchan@kernel.org, mingo@redhat.com,
	namhyung@kernel.org, peterz@infradead.org, riel@surriel.com,
	rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org,
	shakeelb@google.com, shuah@kernel.org, sieberf@amazon.com,
	snu@zelle79.org, vbabka@suse.cz, vdavydov.dev@gmail.com,
	zgf574564920@gmail.com, linux-damon@amazon.com,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v34 05/13] mm/damon: Implement primitives for the virtual memory address spaces
Date: Fri, 27 Aug 2021 11:06:51 +0000	[thread overview]
Message-ID: <20210827110651.1950-1-sjpark@amazon.de> (raw)
In-Reply-To: <3b094493-9c1e-6024-bfd5-7eca66399b7e@redhat.com>

From: SeongJae Park <sjpark@amazon.de>

On Thu, 26 Aug 2021 23:42:19 +0200 David Hildenbrand <david@redhat.com> wrote:

> On 26.08.21 19:29, SeongJae Park wrote:
> > From: SeongJae Park <sjpark@amazon.de>
> > 
> > Hello David,
> > 
> > 
> > On Thu, 26 Aug 2021 16:09:23 +0200 David Hildenbrand <david@redhat.com> wrote:
> > 
> >>> +static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
> >>> +{
> >>> +	pte_t *pte = NULL;
> >>> +	pmd_t *pmd = NULL;
> >>> +	spinlock_t *ptl;
> >>> +
> >>
> >> I just stumbled over this, sorry for the dumb questions:
> > 
> > Appreciate for the great questions!
> > 
> >>
> >>
> >> a) What do we know about that region we are messing with?
> >>
> >> AFAIU, just like follow_pte() and follow_pfn(), follow_invalidate_pte()
> >> should only be called on VM_IO and raw VM_PFNMAP mappings in general
> >> (see the doc of follow_pte()). Do you even know that it's within a
> >> single VMA and that there are no concurrent modifications?
> > 
> > We have no idea about the region at this moment.  However, if we successfully
> > get the pte or pmd under the protection of the page table lock, we ensure the
> > page for the pte or pmd is a online LRU-page with damon_get_page(), before
> > updating the pte or pmd's PAGE_ACCESSED bit.  We release the page table lock
> > only after the update.
> > 
> > And concurrent VMA change doesn't matter here because we read and write only
> > the page table.  If the address is not mapped or not backed by LRU pages, we
> > simply treat it as not accessed.
> 
> reading/writing page tables is the real problem.
> 
> > 
> >>
> >> b) Which locks are we holding?
> >>
> >> I hope we're holding the mmap lock in read mode at least. Or how are you
> >> making sure there are no concurrent modifications to page tables / VMA
> >> layout ... ?
> >>
> >>> +	if (follow_invalidate_pte(mm, addr, NULL, &pte, &pmd, &ptl))
> > 
> > All the operations are protected by the page table lock of the pte or pmd, so
> > no concurrent page table modification would happen.  As previously mentioned,
> > because we read and update only page table, we don't care about VMAs and
> > therefore we don't need to hold mmap lock here.
> 
> See below, that's unfortunately not sufficient.
> 
> > 
> > Outside of this function, DAMON reads the VMAs to know which address ranges are
> > not mapped, and avoid inefficiently checking access to the area with the
> > information.  Nevertheless, it happens only occasionally (once per 60 seconds
> > by default), and it holds the mmap read lock in the case.
> > 
> > Nonetheless, I agree the usage of follow_invalidate_pte() here could make
> > readers very confusing.  It would be better to implement and use DAMON's own
> > page table walk logic.  Of course, I might missing something important.  If you
> > think so, please don't hesitate at yelling to me.
> 
> 
> I'm certainly not going to yell :) But unfortunately I'll have to tell 
> you that what you are doing is in my understanding fundamentally broken.
> 
> See, page tables might get removed any time
> a) By munmap() code even while holding the mmap semaphore in read (!)
> b) By khugepaged holding the mmap lock in write mode
> 
> The rules are (ignoring the rmap side of things)
> 
> a) You can walk page tables inside a known VMA with the mmap semaphore 
> held in read mode. If you drop the mmap sem, you have to re-validate the 
> VMA! Anything could have changed in the meantime. This is essentially 
> what mm/pagewalk.c does.
> 
> b) You can walk page tables ignoring VMAs with the mmap semaphore held 
> in write mode.
> 
> c) You can walk page tables lockless if the architecture supports it and 
> you have interrupts disabled the hole time. But you are not allowed to 
> write.
> 
> With what you're doing, you might end up reading random garbage as page 
> table pointers, or writing random garbage to pages that are no longer 
> used as page tables.
> 
> Take a look at mm/gup.c:lockless_pages_from_mm() to see how difficult it 
> is to walk page tables lockless. And it only works because page table 
> freeing code synchronizes either via IPI or fake-rcu before actually 
> freeing a page table.
> 
> follow_invalidate_pte() is, in general, the wrong thing to use. It's 
> specialized to VM_IO and VM_PFNMAP. take a look at the difference in 
> complexity between follow_invalidate_pte() and mm/pagewalk.c!
> 
> I'm really sorry, but as far as I can tell, this is locking-wise broken 
> and follow_invalidate_pte() is the wrong interface to use here.
> 
> Someone can most certainly correct me if I'm wrong, or if I'm missing 
> something regarding your implementation, but if you take a look around, 
> you won't find any code walking page tables without at least holding the 
> mmap sem in read mode -- for a good reason.

Thank you very much for this kind explanation, David!  I will send a patch for
this soon.


Thanks,
SJ

> 
> -- 
> Thanks,
> 
> David / dhildenb
> 


  reply	other threads:[~2021-08-27 11:07 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-16  8:14 [PATCH v34 00/13] Introduce Data Access MONitor (DAMON) SeongJae Park
2021-07-16  8:14 ` [PATCH v34 01/13] mm: " SeongJae Park
2021-07-16  8:14 ` [PATCH v34 02/13] mm/damon/core: Implement region-based sampling SeongJae Park
2021-07-16  8:14 ` [PATCH v34 03/13] mm/damon: Adaptively adjust regions SeongJae Park
2021-07-16  8:14 ` [PATCH v34 04/13] mm/idle_page_tracking: Make PG_idle reusable SeongJae Park
2021-07-16  8:14 ` [PATCH v34 05/13] mm/damon: Implement primitives for the virtual memory address spaces SeongJae Park
2021-08-26 14:09   ` David Hildenbrand
2021-08-26 17:29     ` SeongJae Park
2021-08-26 21:42       ` David Hildenbrand
2021-08-27 11:06         ` SeongJae Park [this message]
2021-07-16  8:14 ` [PATCH v34 06/13] mm/damon: Add a tracepoint SeongJae Park
2021-07-16  8:14 ` [PATCH v34 07/13] mm/damon: Implement a debugfs-based user space interface SeongJae Park
2021-08-06  0:43   ` Andrew Morton
2021-08-06 11:46     ` SeongJae Park
2021-08-06  0:43   ` Andrew Morton
2021-08-06 11:47     ` SeongJae Park
2021-07-16  8:14 ` [PATCH v34 08/13] mm/damon/dbgfs: Export kdamond pid to the user space SeongJae Park
2021-07-16  8:14 ` [PATCH v34 09/13] mm/damon/dbgfs: Support multiple contexts SeongJae Park
2021-07-16  8:14 ` [PATCH v34 10/13] Documentation: Add documents for DAMON SeongJae Park
2021-07-16  8:14 ` [PATCH v34 11/13] mm/damon: Add kunit tests SeongJae Park
2021-07-16  8:14 ` [PATCH v34 12/13] mm/damon: Add user space selftests SeongJae Park
2021-07-16  8:14 ` [PATCH v34 13/13] MAINTAINERS: Update for DAMON SeongJae Park
2021-07-27 21:30 ` [PATCH v34 00/13] Introduce Data Access MONitor (DAMON) Shakeel Butt
2021-07-28  8:36   ` SeongJae Park
2021-08-02  8:24     ` SeongJae Park
2021-08-04  7:41       ` SeongJae Park
2021-08-06  0:03     ` Andrew Morton
2021-08-06 11:48       ` SeongJae Park
2021-08-09 14:07         ` SeongJae Park
2021-08-06  0:43 ` Andrew Morton
2021-08-06 11:48   ` SeongJae Park
2021-08-07 18:28     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210827110651.1950-1-sjpark@amazon.de \
    --to=sj38.park@gmail.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=amit@kernel.org \
    --cc=benh@kernel.crashing.org \
    --cc=brendanhiggins@google.com \
    --cc=corbet@lwn.net \
    --cc=david@redhat.com \
    --cc=dwmw@amazon.com \
    --cc=elver@google.com \
    --cc=fan.du@intel.com \
    --cc=foersleo@amazon.de \
    --cc=greg@kroah.com \
    --cc=gthelen@google.com \
    --cc=guoju.fgj@alibaba-inc.com \
    --cc=jgowans@amazon.com \
    --cc=joe@perches.com \
    --cc=linux-damon@amazon.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=markubo@amazon.com \
    --cc=mgorman@suse.de \
    --cc=mheyne@amazon.de \
    --cc=minchan@kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=sieberf@amazon.com \
    --cc=sjpark@amazon.de \
    --cc=snu@zelle79.org \
    --cc=vbabka@suse.cz \
    --cc=vdavydov.dev@gmail.com \
    --cc=zgf574564920@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).