Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	Linux-MM <linux-mm@kvack.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <andi@firstfloor.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
Date: Thu, 16 Apr 2015 15:38:26 +0900	[thread overview]
Message-ID: <20150416063826.GA7721@blaptop> (raw)
In-Reply-To: <20150415212855.GI14842@suse.de>

Hello Mel,

On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote:
> > On Wed, 15 Apr 2015, Rik van Riel wrote:
> > > On 04/15/2015 06:42 AM, Mel Gorman wrote:
> > > > An IPI is sent to flush remote TLBs when a page is unmapped that was
> > > > recently accessed by other CPUs. There are many circumstances where this
> > > > happens but the obvious one is kswapd reclaiming pages belonging to a
> > > > running process as kswapd and the task are likely running on separate CPUs.
> > > > 
> > > > On small machines, this is not a significant problem but as machine
> > > > gets larger with more cores and more memory, the cost of these IPIs can
> > > > be high. This patch uses a structure similar in principle to a pagevec
> > > > to collect a list of PFNs and CPUs that require flushing. It then sends
> > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for
> > > > this and one is added for x86. Other architectures will need to decide if
> > > > batching like this is both safe and worth the memory overhead. Specifically
> > > > the requirement is;
> > > > 
> > > > 	If a clean page is unmapped and not immediately flushed, the
> > > > 	architecture must guarantee that a write to that page from a CPU
> > > > 	with a cached TLB entry will trap a page fault.
> > > > 
> > > > This is essentially what the kernel already depends on but the window is
> > > > much larger with this patch applied and is worth highlighting.
> > > 
> > > This means we already have a (hard to hit?) data corruption
> > > issue in the kernel.  We can lose data if we unmap a writable
> > > but not dirty pte from a file page, and the task writes before
> > > we flush the TLB.
> > 
> > I don't think so.  IIRC, when the CPU needs to set the dirty bit,
> > it doesn't just do that in its TLB entry, but has to fetch and update
> > the actual pte entry - and at that point discovers it's no longer
> > valid so traps, as Mel says.
> > 
> 
> This is what I'm expecting i.e. clean->dirty transition is write-through
> to the PTE which is now unmapped and it traps. I'm assuming there is an
> architectural guarantee that it happens but could not find an explicit
> statement in the docs. I'm hoping Dave or Andi can check with the relevant
> people on my behalf.

A dumb question. It's not related to your patch but MADV_FREE.

clean->dirty transition is *atomic* as well as write-through?
I'm really confusing.
It seems most arches use xchg for ptep_get_and_clear so it's
atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR
will use non-atomic version in include/asm-generic/pgtable.h.

        #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
        static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
                                               unsigned long address,
                                               pte_t *ptep)
        {
                pte_t pte = *ptep;
                pte_clear(mm, address, ptep);
                return pte;
        }
        #endif

I hope they have own lock or something to protect a race between software
and hardware(ie, CPU set dirty bit by itself).

Anyway, if there is a problem about that, we might see data corruption
but didn't so I guess it's atomic.
Otherwise, MADV_FREE will break, too.
I'd like to confirm that.

Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	Linux-MM <linux-mm@kvack.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <andi@firstfloor.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
Date: Thu, 16 Apr 2015 15:38:26 +0900	[thread overview]
Message-ID: <20150416063826.GA7721@blaptop> (raw)
In-Reply-To: <20150415212855.GI14842@suse.de>

Hello Mel,

On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote:
> > On Wed, 15 Apr 2015, Rik van Riel wrote:
> > > On 04/15/2015 06:42 AM, Mel Gorman wrote:
> > > > An IPI is sent to flush remote TLBs when a page is unmapped that was
> > > > recently accessed by other CPUs. There are many circumstances where this
> > > > happens but the obvious one is kswapd reclaiming pages belonging to a
> > > > running process as kswapd and the task are likely running on separate CPUs.
> > > > 
> > > > On small machines, this is not a significant problem but as machine
> > > > gets larger with more cores and more memory, the cost of these IPIs can
> > > > be high. This patch uses a structure similar in principle to a pagevec
> > > > to collect a list of PFNs and CPUs that require flushing. It then sends
> > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for
> > > > this and one is added for x86. Other architectures will need to decide if
> > > > batching like this is both safe and worth the memory overhead. Specifically
> > > > the requirement is;
> > > > 
> > > > 	If a clean page is unmapped and not immediately flushed, the
> > > > 	architecture must guarantee that a write to that page from a CPU
> > > > 	with a cached TLB entry will trap a page fault.
> > > > 
> > > > This is essentially what the kernel already depends on but the window is
> > > > much larger with this patch applied and is worth highlighting.
> > > 
> > > This means we already have a (hard to hit?) data corruption
> > > issue in the kernel.  We can lose data if we unmap a writable
> > > but not dirty pte from a file page, and the task writes before
> > > we flush the TLB.
> > 
> > I don't think so.  IIRC, when the CPU needs to set the dirty bit,
> > it doesn't just do that in its TLB entry, but has to fetch and update
> > the actual pte entry - and at that point discovers it's no longer
> > valid so traps, as Mel says.
> > 
> 
> This is what I'm expecting i.e. clean->dirty transition is write-through
> to the PTE which is now unmapped and it traps. I'm assuming there is an
> architectural guarantee that it happens but could not find an explicit
> statement in the docs. I'm hoping Dave or Andi can check with the relevant
> people on my behalf.

A dumb question. It's not related to your patch but MADV_FREE.

clean->dirty transition is *atomic* as well as write-through?
I'm really confusing.
It seems most arches use xchg for ptep_get_and_clear so it's
atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR
will use non-atomic version in include/asm-generic/pgtable.h.

        #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
        static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
                                               unsigned long address,
                                               pte_t *ptep)
        {
                pte_t pte = *ptep;
                pte_clear(mm, address, ptep);
                return pte;
        }
        #endif

I hope they have own lock or something to protect a race between software
and hardware(ie, CPU set dirty bit by itself).

Anyway, if there is a problem about that, we might see data corruption
but didn't so I guess it's atomic.
Otherwise, MADV_FREE will break, too.
I'd like to confirm that.

Thanks.

-- 
Kind regards,
Minchan Kim

next prev parent reply	other threads:[~2015-04-16  6:38 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-15 10:42 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI Mel Gorman
2015-04-15 10:42 ` Mel Gorman
2015-04-15 10:42 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-04-15 10:42   ` Mel Gorman
2015-04-15 10:42 ` [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping Mel Gorman
2015-04-15 10:42   ` Mel Gorman
2015-04-15 21:03   ` Rik van Riel
2015-04-15 21:03     ` Rik van Riel
2015-04-15 21:16     ` Hugh Dickins
2015-04-15 21:16       ` Hugh Dickins
2015-04-15 21:28       ` Mel Gorman
2015-04-15 21:28         ` Mel Gorman
2015-04-15 21:32         ` Dave Hansen
2015-04-15 21:32           ` Dave Hansen
2015-04-16  6:38         ` Minchan Kim [this message]
2015-04-16  6:38           ` Minchan Kim
2015-04-16  8:07           ` Mel Gorman
2015-04-16  8:07             ` Mel Gorman
2015-04-16  8:29             ` Minchan Kim
2015-04-16  8:29               ` Minchan Kim
2015-04-16  9:19               ` Mel Gorman
2015-04-16  9:19                 ` Mel Gorman
2015-04-16 23:30                 ` Minchan Kim
2015-04-16 23:30                   ` Minchan Kim
2015-04-15 22:20   ` Andi Kleen
2015-04-15 22:20     ` Andi Kleen
2015-04-15 22:53     ` Mel Gorman
2015-04-15 22:53       ` Mel Gorman
2015-04-15 10:42 ` [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages Mel Gorman
2015-04-15 10:42   ` Mel Gorman
2015-04-15 11:42   ` Peter Zijlstra
2015-04-15 11:42     ` Peter Zijlstra
2015-04-15 12:15     ` Mel Gorman
2015-04-15 12:15       ` Mel Gorman
2015-04-15 12:24       ` Peter Zijlstra
2015-04-15 12:24         ` Peter Zijlstra
2015-04-15 12:56         ` Mel Gorman
2015-04-15 12:56           ` Mel Gorman
2015-04-15 10:42 ` [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration Mel Gorman
2015-04-15 10:42   ` Mel Gorman
2015-04-15 21:06   ` Hugh Dickins
2015-04-15 21:06     ` Hugh Dickins
2015-04-15 21:44     ` Mel Gorman
2015-04-15 21:44       ` Mel Gorman
2015-04-15 23:50       ` Hugh Dickins
2015-04-15 23:50         ` Hugh Dickins
  -- strict thread matches above, loose matches on Subject: below --
2015-04-16 10:22 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2 Mel Gorman
2015-04-16 10:22 ` [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping Mel Gorman
2015-04-16 10:22   ` Mel Gorman
2015-04-16 15:52   ` Rik van Riel
2015-04-16 15:52     ` Rik van Riel
2015-04-16 19:21   ` Hugh Dickins
2015-04-16 19:21     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150416063826.GA7721@blaptop \
    --to=minchan@kernel.org \
    --cc=andi@firstfloor.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.