linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ren Qiaowei <qiaowei.ren@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-ia64@vger.kernel.org, linux-mips@linux-mips.org
Subject: Re: [PATCH v9 11/12] x86, mpx: cleanup unused bound tables
Date: Mon, 27 Oct 2014 11:13:55 +0800	[thread overview]
Message-ID: <544DB873.1010207@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1410241451280.5308@nanos>

On 10/24/2014 10:40 PM, Thomas Gleixner wrote:
> On Sun, 12 Oct 2014, Qiaowei Ren wrote:
>> Since we are doing the freeing from munmap() (and other paths like it),
>> we hold mmap_sem for write. If we fault, the page fault handler will
>> attempt to acquire mmap_sem for read and we will deadlock. For now, to
>> avoid deadlock, we disable page faults while touching the bounds directory
>> entry. This keeps us from being able to free the tables in this case.
>> This deficiency will be addressed in later patches.
>
> This is wrong to begin with. You need a proper design for addressing
> this short coming in the first place. Retrofitting it into your
> current design will just not work at all or end up with some major
> mess.
>
> The problem to solve is, that the kernel needs to unmap the bounds
> table if the data mapping which it covers goes away.
>
> You decided to do that at the end of unmap with mmap_sem write
> held. As I explained last time, that's not an absolute requirement,
> it's just a random choice. And that choice is obvioulsy not a really
> good one as your own observation of the page fault issue proves.
>
> So perhaps we should look at it the other way round. We already know
> before the actual unmap happens what's the start and the end of the
> area.
>
> So we can be way smarter and do the following:
>
>     mpx_pre_unmap(from, len);
>
>     down_write(mmap_sem);
>     do_unmap(from, len);
>     up_write(mmap_sem);
>
>     mpx_post_unmap(mpx_unmap);
>
> int mpx_pre_unmap(...)
> {
>      down_write(mm->bd_sem);
>
>      down_read(mm->mmap_sem);
>
>      Here we do the following:
>
>      1) Invalidate the bounds table entries for the to be unmapped area in
>         the bounds directory. This can fault as we only hold mmap sem for
>         read.
>
>      2) Mark the corresponding VMAs which should be unmapped and
>         removed.  This can be done with mmap sem down read held as the
>         VMA chain cannot be changed and the 'Mark for removal" field is
>         protected by mm->bd_sem.
>
>         For each to be removed VMA we increment mm->bd_remove_vmas
>
>      Holding mm->bd_sem also prevents that a new bound table to be
>      inserted, if we do the whole protection thing right.
>
>      up_read(mm->mmap_sem);
> }
>
> void mpx_post_unmap(void)
> {
> 	if (mm->bd_remove_vmas) {
> 	   down_write(mm->mmap_sem);
>   	   cleanup_to_be_removed_vmas();		
> 	   up_write(mm->mmap_sem);
> 	}
>
> 	up_write(mm->bd_sem);
> }
>

If so, I guess that there are some questions needed to be considered:

1) Almost all palces which call do_munmap() will need to add 
mpx_pre_unmap/post_unmap calls, like vm_munmap(), mremap(), shmdt(), etc..

2) before mpx_post_unmap() call, it is possible for those bounds tables 
within mm->bd_remove_vmas to be re-used.

In this case, userspace may do new mapping and access one address which 
will cover one of those bounds tables. During this period, HW will check 
if one bounds table exist, if yes one fault won't be produced.

3) According to Dave, those bounds tables related to adjacent VMAs 
within the start and the end possibly don't have to be fully unmmaped, 
and we only need free the part of backing physical memory.

> The mpx_pre_unmap/post_unmap calls can be either added explicit to the
> relevant down_write/unmap/up_write code pathes or you hide it behind
> some wrapper function.
>
> Now you need to acquire mm->bd_sem for the fault handling code as well
> in order to serialize the creation of new bound table entries. In that
> case a down_read(mm->bd_sem) is sufficient as the cmpxchg() prevents
> the insertion of multiple tables for the same directory entries.
>
> So we'd end up with the following serialization scheme:
>
> prctl enable/disable management:  down_write(mm->bd_sem);
>
> bounds violation:    		  down_read(mm->bd_sem);
>
> directory entry allocation:	  down_read(mm->bd_sem);
>
> directory entry removal:	  down_write(mm->bd_sem);
>
> Now we need to check whether there is a potential deadlock lurking
> around the corner. We get the following nesting scenarios:
>
> - prctl enable/disable management:  down_write(mm->bd_sem);
>
>     No mmap_sem nesting at all
>
> - bounds violation:    	     down_read(mm->bd_sem);
>
>     Might nest down_read(mm->mmap_sem) if the copy code from user space
>     faults.
>
> - directory entry allocation:	     down_read(mm->bd_sem);
>
>     Nests down_write(mm->mmap_sem) for the VMA allocation.
>
>     Might nest down_read(mm->mmap_sem) if the cmpxchg() for the
>     directory entry faults
>
> - directory entry removal:	  down_write(mm->bd_sem);
>
>     Might nest down_read(mm->mmap_sem) if the invalidation of the
>     directory entry faults
>
> In other words here is the possible nesting:
>
>    #1 down_read(mm->bd_sem);  down_read(mm->mmap_sem);
>
>    #2 down_read(mm->bd_sem);  down_write(mm->mmap_sem);
>
>    #3 down_write(mm->bd_sem); down_read(mm->mmap_sem);
>
>    #4 down_write(mm->bd_sem); down_write(mm->mmap_sem);
>
> We never execute any of those nested on a single task. The bounds
> fault handler is never nested as it always comes from user space. So
> we should be safe.
>
> And this prevents all the nasty race conditions I discussed in
> Portland with Dave which happen when we try do to stuff outside of the
> mmap_sem write held region.
>
> Performance wise it's not an issue either. The prctl case is not a a
> hotpath anyway. The fault handlers get away with down_read() and the
> unmap code is heavyweight on mmap sem write held anyway.
>
> Thoughts?
>
> Thanks,
>
> 	tglx
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-10-27  3:17 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-12  4:41 [PATCH v9 00/12] Intel MPX support Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 01/12] x86, mpx: introduce VM_MPX to indicate that a VMA is MPX specific Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 02/12] x86, mpx: rename cfg_reg_u and status_reg Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 03/12] x86, mpx: add MPX specific mmap interface Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 04/12] x86, mpx: add MPX to disaabled features Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 05/12] x86, mpx: on-demand kernel allocation of bounds tables Qiaowei Ren
2014-10-24 12:08   ` Thomas Gleixner
2014-10-27  3:20     ` Ren Qiaowei
2014-10-28 17:43     ` Dave Hansen
2014-10-28 17:57       ` Thomas Gleixner
2014-10-12  4:41 ` [PATCH v9 06/12] mpx: extend siginfo structure to include bound violation information Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 07/12] mips: sync struct siginfo with general version Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 08/12] ia64: " Qiaowei Ren
2014-10-12  4:41 ` [PATCH v9 09/12] x86, mpx: decode MPX instruction to get bound violation information Qiaowei Ren
2014-10-24 12:36   ` Thomas Gleixner
2014-10-27  1:43     ` Ren, Qiaowei
2014-10-27 20:36       ` Thomas Gleixner
2014-10-28  5:58         ` Ren Qiaowei
2014-10-31 20:16         ` Dave Hansen
2014-10-31 20:33           ` Thomas Gleixner
2014-10-30 22:38   ` Dave Hansen
2014-10-31  2:12     ` Ren Qiaowei
2014-10-31  9:09       ` Thomas Gleixner
2014-10-12  4:41 ` [PATCH v9 10/12] x86, mpx: add prctl commands PR_MPX_ENABLE_MANAGEMENT, PR_MPX_DISABLE_MANAGEMENT Qiaowei Ren
2014-10-24 12:49   ` Thomas Gleixner
2014-10-24 15:10     ` Thomas Gleixner
2014-10-27  2:17     ` Ren, Qiaowei
2014-10-27 20:38       ` Thomas Gleixner
2014-10-28  5:57         ` Ren Qiaowei
2014-10-12  4:41 ` [PATCH v9 11/12] x86, mpx: cleanup unused bound tables Qiaowei Ren
2014-10-24 14:40   ` Thomas Gleixner
2014-10-27  3:13     ` Ren Qiaowei [this message]
2014-10-27 20:49       ` Thomas Gleixner
2014-10-28  5:56         ` Ren Qiaowei
2014-10-28 10:42           ` Thomas Gleixner
2014-11-03 20:53         ` Dave Hansen
2014-11-03 21:29           ` Thomas Gleixner
2014-11-04 16:00             ` Dave Hansen
2014-11-04 17:02               ` Thomas Gleixner
2014-11-06 21:50     ` Dave Hansen
2014-11-11 18:27       ` Thomas Gleixner
2014-11-11 20:44         ` Dave Hansen
2014-11-11 21:36           ` Thomas Gleixner
2014-10-12  4:41 ` [PATCH v9 12/12] x86, mpx: add documentation on Intel MPX Qiaowei Ren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=544DB873.1010207@intel.com \
    --to=qiaowei.ren@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@linux-mips.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).