Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-04  8:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, Ingo Molnar, Kees Cook,
	Andrea Arcangeli, Erik Bosman, H. Peter Anvin, Linux API,
	Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <CALCETrVtK6w4smnRCTED=csAyt3WNNOaZE_WRzvECuSx260X3w@mail.gmail.com>

On Fri, Oct 03, 2014 at 02:15:24PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 3, 2014 at 2:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
> >> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> > Something like so.. slightly less ugly and possibly with more
> >> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
> >> > as well.
> >>
> >> This will crash anything that tries rdpmc in an allow-everything
> >> seccomp sandbox.  It's also not very compatible with my grand scheme
> >> of allowing rdtsc to be turned off without breaking clock_gettime. :)
> >
> > Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
> > he deserves, no problem there.
> 
> Oh, interesting.
> 
> To continue playing devil's advocate, what if you do perf_event_open,
> then mmap it, then start the seccomp sandbox?

We update that cap bit on every update to the self-monitor state, and in
a perfect world people would also check the cap bit every time they try
and read it, and fall back to the syscall. So we could just clear it..
but I can imagine reality ruining things here.

> My draft patches are currently tracking the number of perf_event mmaps
> per mm.  I'm not thrilled with it, but it's straightforward.  And I
> still need to benchmark cr4 writes, which is tedious, because I can't
> do it from user code.

Should be fairly straight fwd from kernel space, get a tsc stamp,
read+write cr4 1000 times, get another tsc read, and maybe do that
several times. No?

^ permalink raw reply

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
From: Mike Hommey @ 2014-10-03 23:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Taras Glek, Jan Kara
In-Reply-To: <1412356087-16115-9-git-send-email-aarcange@redhat.com>

On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> userland touches a still unmapped virtual address, a sigbus signal is
> sent instead of allocating a new page. The sigbus signal handler will
> then resolve the page fault in userland by calling the
> remap_anon_pages syscall.

What does "unmapped virtual address" mean in this context?

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 21:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003211204.GQ10583-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>

On Fri, Oct 3, 2014 at 2:12 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
>> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>
>> > Something like so.. slightly less ugly and possibly with more
>> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
>> > as well.
>>
>> This will crash anything that tries rdpmc in an allow-everything
>> seccomp sandbox.  It's also not very compatible with my grand scheme
>> of allowing rdtsc to be turned off without breaking clock_gettime. :)
>
> Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
> he deserves, no problem there.

Oh, interesting.

To continue playing devil's advocate, what if you do perf_event_open,
then mmap it, then start the seccomp sandbox?

My draft patches are currently tracking the number of perf_event mmaps
per mm.  I'm not thrilled with it, but it's straightforward.  And I
still need to benchmark cr4 writes, which is tedious, because I can't
do it from user code.

--Andy

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 21:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <CALCETrW7OCuAiK31iRvXgXJfcf3FE4GKjpKQ0doWFyUpETzT9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> > Something like so.. slightly less ugly and possibly with more
> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
> > as well.
> 
> This will crash anything that tries rdpmc in an allow-everything
> seccomp sandbox.  It's also not very compatible with my grand scheme
> of allowing rdtsc to be turned off without breaking clock_gettime. :)

Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
he deserves, no problem there.

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003210213.GG6324-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>

On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, Oct 03, 2014 at 10:44:43PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
>> > On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> > >
>> > > We could make the rule be that RDPMC is enabled if a perf event is
>> > > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
>> > > there's an actual performance issue first.  Ideally we can get this
>> > > all working with no API or ABI change at all.
>> >
>> > No, we can't use that rule.  But we could say that RDPMC is enabled if
>> > a perf event is mmapped and no thread in the mm uses seccomp.  I'll
>> > grumble a little bit about adding yet another piece of seccomp state.
>>
>> Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
>> Should be fairly straight fwd.
>
>
> Something like so.. slightly less ugly and possibly with more
> complicated conditions setting the cr4 if you want to fix tsc vs seccomp
> as well.

This will crash anything that tries rdpmc in an allow-everything
seccomp sandbox.  It's also not very compatible with my grand scheme
of allowing rdtsc to be turned off without breaking clock_gettime. :)

I'll send out a real set of patches in the next few days.  I'll even
try to benchmark them :)

>
> ---
>  arch/x86/kernel/cpu/perf_event.c | 13 ++++++++++++-
>  arch/x86/kernel/process.c        | 24 +++++++++++++++++-------
>  2 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 16c73022306e..cfc42ff5d901 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1869,6 +1869,17 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
>         return count;
>  }
>
> +void perf_change_rdpmc(bool on, unsigned long *cr4)
> +{
> +       if (x86_pmu.attr_rdpmc_broken)
> +               return;
> +
> +       if (on)
> +               *cr4 |= X86_CR4_PCE;
> +       else
> +               *cr4 &= ~X86_CR4_PCE;
> +}
> +
>  static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
>
>  static struct attribute *x86_pmu_attrs[] = {
> @@ -1928,7 +1939,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
>
>         userpg->cap_user_time = 0;
>         userpg->cap_user_time_zero = 0;
> -       userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
> +       userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc && test_thread_flag(TIF_SECCOMP);
>         userpg->pmc_width = x86_pmu.cntval_bits;
>
>         if (!sched_clock_stable())
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index e127ddaa2d5a..b74c0400851e 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -201,12 +201,15 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                       struct tss_struct *tss)
>  {
>         struct thread_struct *prev, *next;
> +       struct thread_info *pi, *ni;
>
>         prev = &prev_p->thread;
>         next = &next_p->thread;
>
> -       if (test_tsk_thread_flag(prev_p, TIF_BLOCKSTEP) ^
> -           test_tsk_thread_flag(next_p, TIF_BLOCKSTEP)) {
> +       pi = task_thread_info(prev_p);
> +       ni = task_thread_info(next_p);
> +
> +       if ((pi->flags & _TIF_BLOCKSTEP) ^ (ni->flags & _TIF_BLOCKSTEP)) {
>                 unsigned long debugctl = get_debugctlmsr();
>
>                 debugctl &= ~DEBUGCTLMSR_BTF;
> @@ -216,13 +219,20 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                 update_debugctlmsr(debugctl);
>         }
>
> -       if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
> -           test_tsk_thread_flag(next_p, TIF_NOTSC)) {
> +       if ((pi->flags & (_TIF_NOTSC | _TIF_SECCOMP)) ^
> +           (ni->flags & (_TIF_NOTSC | _TIF_SECCOMP))) {
> +               extern void perf_change_rdpmc(bool, unsigned long *);
> +               unsigned long cr4 = read_cr4();
> +
>                 /* prev and next are different */
> -               if (test_tsk_thread_flag(next_p, TIF_NOTSC))
> -                       hard_disable_TSC();
> +               if (ni->flags & _TIF_NOTSC)
> +                       cr4 |= X86_CR4_TSD;
>                 else
> -                       hard_enable_TSC();
> +                       cr4 &= ~X86_CR4_TSD;
> +
> +               perf_change_rdpmc(!(ni->flags & _TIF_SECCOMP), &cr4);
> +
> +               write_cr4(cr4);
>         }
>
>         if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 21:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003210213.GG6324-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>

On Fri, Oct 03, 2014 at 11:02:13PM +0200, Peter Zijlstra wrote:
> @@ -1928,7 +1939,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
>  
>  	userpg->cap_user_time = 0;
>  	userpg->cap_user_time_zero = 0;
> -	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
> +	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc && test_thread_flag(TIF_SECCOMP);

&& !test_thread_flag(TIF_SECCOMP) would probably work better

>  	userpg->pmc_width = x86_pmu.cntval_bits;
>  
>  	if (!sched_clock_stable())

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 21:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003204443.GP10583-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>

On Fri, Oct 03, 2014 at 10:44:43PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
> > On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> > >
> > > We could make the rule be that RDPMC is enabled if a perf event is
> > > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> > > there's an actual performance issue first.  Ideally we can get this
> > > all working with no API or ABI change at all.
> > 
> > No, we can't use that rule.  But we could say that RDPMC is enabled if
> > a perf event is mmapped and no thread in the mm uses seccomp.  I'll
> > grumble a little bit about adding yet another piece of seccomp state.
> 
> Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
> Should be fairly straight fwd.


Something like so.. slightly less ugly and possibly with more
complicated conditions setting the cr4 if you want to fix tsc vs seccomp
as well.

---
 arch/x86/kernel/cpu/perf_event.c | 13 ++++++++++++-
 arch/x86/kernel/process.c        | 24 +++++++++++++++++-------
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 16c73022306e..cfc42ff5d901 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1869,6 +1869,17 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
 	return count;
 }
 
+void perf_change_rdpmc(bool on, unsigned long *cr4)
+{
+	if (x86_pmu.attr_rdpmc_broken)
+		return;
+
+	if (on)
+		*cr4 |= X86_CR4_PCE;
+	else
+		*cr4 &= ~X86_CR4_PCE;
+}
+
 static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
 
 static struct attribute *x86_pmu_attrs[] = {
@@ -1928,7 +1939,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
 
 	userpg->cap_user_time = 0;
 	userpg->cap_user_time_zero = 0;
-	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
+	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc && test_thread_flag(TIF_SECCOMP);
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
 	if (!sched_clock_stable())
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e127ddaa2d5a..b74c0400851e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -201,12 +201,15 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss)
 {
 	struct thread_struct *prev, *next;
+	struct thread_info *pi, *ni;
 
 	prev = &prev_p->thread;
 	next = &next_p->thread;
 
-	if (test_tsk_thread_flag(prev_p, TIF_BLOCKSTEP) ^
-	    test_tsk_thread_flag(next_p, TIF_BLOCKSTEP)) {
+	pi = task_thread_info(prev_p);
+	ni = task_thread_info(next_p);
+
+	if ((pi->flags & _TIF_BLOCKSTEP) ^ (ni->flags & _TIF_BLOCKSTEP)) {
 		unsigned long debugctl = get_debugctlmsr();
 
 		debugctl &= ~DEBUGCTLMSR_BTF;
@@ -216,13 +219,20 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		update_debugctlmsr(debugctl);
 	}
 
-	if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
-	    test_tsk_thread_flag(next_p, TIF_NOTSC)) {
+	if ((pi->flags & (_TIF_NOTSC | _TIF_SECCOMP)) ^
+	    (ni->flags & (_TIF_NOTSC | _TIF_SECCOMP))) {
+		extern void perf_change_rdpmc(bool, unsigned long *);
+		unsigned long cr4 = read_cr4();
+
 		/* prev and next are different */
-		if (test_tsk_thread_flag(next_p, TIF_NOTSC))
-			hard_disable_TSC();
+		if (ni->flags & _TIF_NOTSC)
+			cr4 |= X86_CR4_TSD;
 		else
-			hard_enable_TSC();
+			cr4 &= ~X86_CR4_TSD;
+
+		perf_change_rdpmc(!(ni->flags & _TIF_SECCOMP), &cr4);
+
+		write_cr4(cr4);
 	}
 
 	if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {

^ permalink raw reply related

* Re: [PATCH 01/17] mm: gup: add FOLL_TRIED
From: Paolo Bonzini @ 2014-10-03 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love
In-Reply-To: <CA+55aFwEO9=ieSb0uipfQ+qP9nP1ps=NCTe0HL9arK1cRYaJPg@mail.gmail.com>

> This needs more explanation than that one-liner comment. Make the
> commit message explain why the new FOLL_TRIED flag exists.

This patch actually is extracted from a 3.18 commit in the KVM tree,
https://git.kernel.org/cgit/virt/kvm/kvm.git/commit/?h=next&id=234b239b.

Here is how that patch uses the flag:

    /*
     * The previous call has now waited on the IO. Now we can
     * retry and complete. Pass TRIED to ensure we do not re
     * schedule async IO (see e.g. filemap_fault).
     */
    down_read(&mm->mmap_sem);
    npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED,
                              pagep, NULL, NULL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, Ingo Molnar, Kees Cook,
	Andrea Arcangeli, Erik Bosman, H. Peter Anvin, Linux API,
	Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003204256.GO10583@worktop.programming.kicks-ass.net>

On Fri, Oct 3, 2014 at 1:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 01:22:22PM -0700, Andy Lutomirski wrote:
>> > So the problem with the default deny is that its:
>> >  1) pointless -- the attacker can do sys_perf_event_open() just fine;
>>
>> Not if the attacker is in a seccomp sandbox.
>
> Clearly :-)
>
>> >  2) and expensive -- the people trying to measure performance get the
>> >     penalty of the CR4 write.
>>
>> Does this matter for performance measuring?  I'm not 100% clear on how all
>> the perf_event stuff gets used in practice, but, by my very vague
>> understanding, there are two main workflows:
>>
>> a) perf record, etc: one process creates a ringbuffer and wakes up
>> rarely to record the contents.  The process being recorded doesn't
>> have a perf_event mapped, so the cr4 switch will only happen when
>> waking up the perf process.
>>
>> perf record prints stuff like "[ perf record: Woken up 1 times to
>> write data ]", which seems to confirm my understanding.
>>
>> b) self-monitoring.  A task mmaps a perf_event, does rdpmc, does
>> something, and does rdpmc again.  In that case, there's no context
>> switch.
>
> Agreed, but with the default disable, the self-monitoring task will get
> CR4 writes to its context switches and will be slower.
>
> Whereas if we default on and only disable with seccomp then only the
> seccomp tasks will suffer the burden of a CR4 write at context switch
> time.
>
> And I don't see a security implication in the default.

I don't see a security implication, but we can fight over whether
seccomp performance or perf self monitoring performance is more
important.  I spent a bunch of time making seccomp *much* faster for
3.18, and things like Chromium (the browser) could have lots of
context switched into and out of seccomp. :)

I suspect that the overhead only really matters when running as a VM guest.

Hmm.  Can we switch lazily?  I don't really like the idea, but
non-seccomp, non-perf-event tasks could, in principle, just inherit
the PCE value from whatever had the CPU last.

>
>> > So I would suggest a default on, but allow a disable for the seccomp
>> > users, which might have also disabled the syscall. Note that is is
>> > possible to disable RDPMC while still allowing the syscall.
>>
>> Disabling RDPMC per-process while still allowing the syscall will need
>> a bunch of work, right?
>
> Not much, all you need to do is make
> perf_event_mmmap_page::cap_user_rdpmc be 0 and userspace should never
> use rdpmc. Currently we set that bit based on the global sysfs rdpmc
> value, but we could easy bitwise AND it with a per process value.
>
>> What happens if the same perf_event is mapped
>> by two different users?
>
> Not entirely clear on what you mean here.

Can you open a perf_event fd, fork, and mmap it in the child?  What if
you map it in the parent and the child?  Then cap_user_rdpmc has to
match for both mappings.  Or is this impossible?

>
>> We could make the rule be that RDPMC is enabled if a perf event is
>> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
>> there's an actual performance issue first.  Ideally we can get this
>> all working with no API or ABI change at all.
>
> Well I just want to not have extra CR4 writes by default (and by default
> I don't care about seccomp).
>
>> P.S. Hey, Intel, let us context switch RDPMC accessibility of the
>> individual counters, please :)
>
> Ha sure, make it more complicated why don't you ;-) But yes, that has
> come up before.

--Andy

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 20:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003204443.GP10583-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>

On Fri, Oct 3, 2014 at 1:44 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
>> On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> >
>> > We could make the rule be that RDPMC is enabled if a perf event is
>> > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
>> > there's an actual performance issue first.  Ideally we can get this
>> > all working with no API or ABI change at all.
>>
>> No, we can't use that rule.  But we could say that RDPMC is enabled if
>> a perf event is mmapped and no thread in the mm uses seccomp.  I'll
>> grumble a little bit about adding yet another piece of seccomp state.
>
> Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
> Should be fairly straight fwd.

That won't work.  I bet there are plenty of existing users of fairly
wide-open seccomp sandboxes that allow perf_event in.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 20:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <CALCETrWfrWpdMCAYySMAMGCHU3XRkNGmeMTECTE=PXQUfjGPZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> >
> > We could make the rule be that RDPMC is enabled if a perf event is
> > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> > there's an actual performance issue first.  Ideally we can get this
> > all working with no API or ABI change at all.
> 
> No, we can't use that rule.  But we could say that RDPMC is enabled if
> a perf event is mmapped and no thread in the mm uses seccomp.  I'll
> grumble a little bit about adding yet another piece of seccomp state.

Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
Should be fairly straight fwd.

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 20:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <CALCETrVvFP66s5XOmSKaC8Vq73=uh11819HOOLkVTu7jJZotew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Oct 03, 2014 at 01:22:22PM -0700, Andy Lutomirski wrote:
> > So the problem with the default deny is that its:
> >  1) pointless -- the attacker can do sys_perf_event_open() just fine;
> 
> Not if the attacker is in a seccomp sandbox.

Clearly :-)

> >  2) and expensive -- the people trying to measure performance get the
> >     penalty of the CR4 write.
> 
> Does this matter for performance measuring?  I'm not 100% clear on how all
> the perf_event stuff gets used in practice, but, by my very vague
> understanding, there are two main workflows:
> 
> a) perf record, etc: one process creates a ringbuffer and wakes up
> rarely to record the contents.  The process being recorded doesn't
> have a perf_event mapped, so the cr4 switch will only happen when
> waking up the perf process.
> 
> perf record prints stuff like "[ perf record: Woken up 1 times to
> write data ]", which seems to confirm my understanding.
> 
> b) self-monitoring.  A task mmaps a perf_event, does rdpmc, does
> something, and does rdpmc again.  In that case, there's no context
> switch.

Agreed, but with the default disable, the self-monitoring task will get
CR4 writes to its context switches and will be slower.

Whereas if we default on and only disable with seccomp then only the
seccomp tasks will suffer the burden of a CR4 write at context switch
time.

And I don't see a security implication in the default.

> > So I would suggest a default on, but allow a disable for the seccomp
> > users, which might have also disabled the syscall. Note that is is
> > possible to disable RDPMC while still allowing the syscall.
> 
> Disabling RDPMC per-process while still allowing the syscall will need
> a bunch of work, right? 

Not much, all you need to do is make
perf_event_mmmap_page::cap_user_rdpmc be 0 and userspace should never
use rdpmc. Currently we set that bit based on the global sysfs rdpmc
value, but we could easy bitwise AND it with a per process value.

> What happens if the same perf_event is mapped
> by two different users?

Not entirely clear on what you mean here.

> We could make the rule be that RDPMC is enabled if a perf event is
> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> there's an actual performance issue first.  Ideally we can get this
> all working with no API or ABI change at all.

Well I just want to not have extra CR4 writes by default (and by default
I don't care about seccomp).

> P.S. Hey, Intel, let us context switch RDPMC accessibility of the
> individual counters, please :)

Ha sure, make it more complicated why don't you ;-) But yes, that has
come up before.

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, Ingo Molnar, Kees Cook,
	Andrea Arcangeli, Erik Bosman, H. Peter Anvin, Linux API,
	Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <CALCETrVvFP66s5XOmSKaC8Vq73=uh11819HOOLkVTu7jJZotew@mail.gmail.com>

On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> We could make the rule be that RDPMC is enabled if a perf event is
> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> there's an actual performance issue first.  Ideally we can get this
> all working with no API or ABI change at all.

No, we can't use that rule.  But we could say that RDPMC is enabled if
a perf event is mmapped and no thread in the mm uses seccomp.  I'll
grumble a little bit about adding yet another piece of seccomp state.

--Andy

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 20:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <20141003201409.GM10583-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>

On Fri, Oct 3, 2014 at 1:14 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, Oct 03, 2014 at 10:27:47AM -0700, Andy Lutomirski wrote:
>> [adding linux-api.  whoops.]
>>
>> On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> > PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
>> > sandboxed programs from learning the time, presumably to avoid
>> > disclosing the wall clock and to make timing attacks much harder to
>> > exploit.
>> >
>> > Unfortunately, this feature is very insecure, for multiple reasons,
>> > and has probably been insecure since before it was written.
>> >
>> > Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
>> > part of the kernel's fixmap, so any user process could read them.
>> > The vvar page contains low-resolution timing information (with real
>> > wall clock and frequency data), and the HPET can be used for high
>> > precision timing.  Even in Linux 3.16, there clean way to disable
>> > access to these pages.
>> >
>> > Weakness 2: On most configurations, most or all userspace processes
>> > have unrestricted access to RDPMC, which is even better than RDTSC
>> > for exploiting timing attacks.
>> >
>> > I would like to fix both of these issues.  I want to deny access to
>> > RDPMC to processes that haven't asked for access via
>> > perf_event_open.  I also want to implement real TSC blocking, which
>> > will require some vdso enhancements
>
> So the problem with the default deny is that its:
>  1) pointless -- the attacker can do sys_perf_event_open() just fine;

Not if the attacker is in a seccomp sandbox.

>  2) and expensive -- the people trying to measure performance get the
>     penalty of the CR4 write.

Does this matter for performance measuring?  I'm not 100% clear on how all
the perf_event stuff gets used in practice, but, by my very vague
understanding, there are two main workflows:

a) perf record, etc: one process creates a ringbuffer and wakes up
rarely to record the contents.  The process being recorded doesn't
have a perf_event mapped, so the cr4 switch will only happen when
waking up the perf process.

perf record prints stuff like "[ perf record: Woken up 1 times to
write data ]", which seems to confirm my understanding.

b) self-monitoring.  A task mmaps a perf_event, does rdpmc, does
something, and does rdpmc again.  In that case, there's no context
switch.

>
> So I would suggest a default on, but allow a disable for the seccomp
> users, which might have also disabled the syscall. Note that is is
> possible to disable RDPMC while still allowing the syscall.

Disabling RDPMC per-process while still allowing the syscall will need
a bunch of work, right?  What happens if the same perf_event is mapped
by two different users?

We could make the rule be that RDPMC is enabled if a perf event is
mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
there's an actual performance issue first.  Ideally we can get this
all working with no API or ABI change at all.

P.S. Hey, Intel, let us context switch RDPMC accessibility of the
individual counters, please :)

--Andy

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 20:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrea Arcangeli, Paul Mackerras, Arnaldo Carvalho de Melo,
	X86 ML, linux-kernel@vger.kernel.org, Ingo Molnar, Kees Cook,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages
In-Reply-To: <CALCETrWP7QSb4dFh_phUK+7P-XookgL8UXgvHi-e4icukUF2Og@mail.gmail.com>

On Fri, Oct 03, 2014 at 10:59:15AM -0700, Andy Lutomirski wrote:
> RDPMC is controlled by CR4.PCE, whereas RDTSC is controlled by
> CR4.TSD,

This is my understanding too, they're separate things.

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Peter Zijlstra @ 2014-10-03 20:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Kees Cook, Andrea Arcangeli, Erik Bosman, H. Peter Anvin,
	Linux API, Michael Kerrisk-manpages, Paul Mackerras,
	Arnaldo Carvalho de Melo, X86 ML
In-Reply-To: <CALCETrUfCrvidOS6VvUpWFAcHUrPUs58zSQqGRC5UOTS=E37rw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Oct 03, 2014 at 10:27:47AM -0700, Andy Lutomirski wrote:
> [adding linux-api.  whoops.]
> 
> On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> > PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
> > sandboxed programs from learning the time, presumably to avoid
> > disclosing the wall clock and to make timing attacks much harder to
> > exploit.
> >
> > Unfortunately, this feature is very insecure, for multiple reasons,
> > and has probably been insecure since before it was written.
> >
> > Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
> > part of the kernel's fixmap, so any user process could read them.
> > The vvar page contains low-resolution timing information (with real
> > wall clock and frequency data), and the HPET can be used for high
> > precision timing.  Even in Linux 3.16, there clean way to disable
> > access to these pages.
> >
> > Weakness 2: On most configurations, most or all userspace processes
> > have unrestricted access to RDPMC, which is even better than RDTSC
> > for exploiting timing attacks.
> >
> > I would like to fix both of these issues.  I want to deny access to
> > RDPMC to processes that haven't asked for access via
> > perf_event_open.  I also want to implement real TSC blocking, which
> > will require some vdso enhancements

So the problem with the default deny is that its:
 1) pointless -- the attacker can do sys_perf_event_open() just fine;
 2) and expensive -- the people trying to measure performance get the
    penalty of the CR4 write.

So I would suggest a default on, but allow a disable for the seccomp
users, which might have also disabled the syscall. Note that is is
possible to disable RDPMC while still allowing the syscall.

^ permalink raw reply

* Re: [PATCH v3 3/5] selftests/ipc: change test to use ksft framework
From: Davidlohr Bueso @ 2014-10-03 20:01 UTC (permalink / raw)
  To: Shuah Khan
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	colin.king-Z7WLFzj8eWMS+FvcfC7Uqw, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <542EFC36.1010306-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On Fri, 2014-10-03 at 13:42 -0600, Shuah Khan wrote:
> On 10/03/2014 11:39 AM, Davidlohr Bueso wrote:
> > On Fri, 2014-10-03 at 09:36 -0600, Shuah Khan wrote:
> >>  	msgque.key = ftok(argv[0], 822155650);
> >>  	if (msgque.key == -1) {
> >> -		printf("Can't make key\n");
> >> -		return -errno;
> >> +		printf("Can't make key: %d\n", -errno);
> > 
> > So printing a numeric value is quite useless when users actually run
> > into these errors -- which is why I like err() so much. How about using
> > strerror() instead?
> > 
> 
> Yes. using perror() does give better information. There are other
> places in this file that use errno. How about I make that a separate
> patch and catch all of them at once to use perror() as a follow-up
> change? That way I fix all at once without adding more changes to
> this patch.

Sounds good.

^ permalink raw reply

* Re: [PATCH v3 3/5] selftests/ipc: change test to use ksft framework
From: Shuah Khan @ 2014-10-03 19:42 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	colin.king-Z7WLFzj8eWMS+FvcfC7Uqw, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shuah Khan
In-Reply-To: <1412357993.20838.4.camel-dxKd5G12XOI1EaDjlw0dpg@public.gmane.org>

On 10/03/2014 11:39 AM, Davidlohr Bueso wrote:
> On Fri, 2014-10-03 at 09:36 -0600, Shuah Khan wrote:
>>  	msgque.key = ftok(argv[0], 822155650);
>>  	if (msgque.key == -1) {
>> -		printf("Can't make key\n");
>> -		return -errno;
>> +		printf("Can't make key: %d\n", -errno);
> 
> So printing a numeric value is quite useless when users actually run
> into these errors -- which is why I like err() so much. How about using
> strerror() instead?
> 

Yes. using perror() does give better information. There are other
places in this file that use errno. How about I make that a separate
patch and catch all of them at once to use perror() as a follow-up
change? That way I fix all at once without adding more changes to
this patch.

thanks,
-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
From: Linus Torvalds @ 2014-10-03 18:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tg>
In-Reply-To: <1412356087-16115-11-git-send-email-aarcange@redhat.com>

On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Overall this looks a fairly small change to the rmap code, notably
> less intrusive than the nonlinear vmas created by remap_file_pages.

Considering that remap_file_pages() was an unmitigated disaster, and
-mm has a patch to remove it entirely, I'm not at all convinced this
is a good argument.

We thought remap_file_pages() was a good idea, and it really really
really wasn't. Almost nobody used it, why would the anonymous page
case be any different?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious
From: Linus Torvalds @ 2014-10-03 18:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tg>
In-Reply-To: <1412356087-16115-5-git-send-email-aarcange@redhat.com>

On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This teaches gup_fast and __gup_fast to re-enable irqs and
> cond_resched() if possible every BATCH_PAGES.

This is disgusting.

Many (most?) __gup_fast() users just want a single page, and the
stupid overhead of the multi-page version is already unnecessary.
This just makes things much worse.

Quite frankly, we should make a single-page version of __gup_fast(),
and convert existign users to use that. After that, the few multi-page
users could have this extra latency control stuff.

And yes, the single-page version of get_user_pages_fast() is actually
latency-critical. shared futexes hit it hard, and yes, I've seen this
in profiles.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 01/17] mm: gup: add FOLL_TRIED
From: Linus Torvalds @ 2014-10-03 18:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tg>
In-Reply-To: <1412356087-16115-2-git-send-email-aarcange@redhat.com>

This needs more explanation than that one-liner comment. Make the
commit message explain why the new FOLL_TRIED flag exists.

      Linus

On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> From: Andres Lagar-Cavilla <andreslc@google.com>
>
> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 17:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Peter Zijlstra, Ingo Molnar, Kees Cook, Erik Bosman,
	H. Peter Anvin, Linux API, Michael Kerrisk-manpages
In-Reply-To: <20141003174141.GR2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[cc's re-added]

On Fri, Oct 3, 2014 at 10:41 AM, Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Hi,
>
> On Fri, Oct 03, 2014 at 10:18:14AM -0700, Andy Lutomirski wrote:
>> Weakness 2: On most configurations, most or all userspace processes
>> have unrestricted access to RDPMC, which is even better than RDTSC
>> for exploiting timing attacks.
>
> "User access of the RDPMC instruction is not guaranteed. Like RDTSC,
> user access is controlled by a bit in CR4. CR4.PCE (bit-8) controls
> whether or not a user program can execute the RDPMC instruction
> without faulting"
>
> I don't think there's was a seccomp leak of RDPMC because of this, the
> rdtsc and rdpmc seems to be linked to the same cr4 tweak.

RDPMC is controlled by CR4.PCE, whereas RDTSC is controlled by
CR4.TSD, so, unless there's magic that I'm unaware of, seccomp never
blocked RDPMC access.  I don't know whether circa-2008 kernels set PCE
(or perhaps left it set if BIOS set it) at boot, but certainly almost
everyone on any kernel for the last few years has RDPMC enabled in
ring 3.

>
> The vsyscall data was leaked right, but you can't compare the
> two. Sure it's better to block that too but it's not comparable to
> give tsc access to apps running under seccomp.
>
> The time of the day isn't secret either (ok it could be an issue if
> you intend to run the system on some secret time in the past or future
> but this sounds not a practical issue).
>
> What's not public info and should never be leaked to seccomp
> sandboxes, is the tsc at that kind of cycle count granularity (and the
> various gettimeofday variants with nanosecond granularity). I thought
> RDPMC was blocked too with the same CR4 tweak... if that wasn't the
> case and you could get tsc granular information into a seccomp
> sandbox, that's not ok because it allows for covert channel attacks.

The HPET very fine granularity.  It's slow to access, but that isn't
necessarily a problem for attackers.

I agree that this is problematic, and I want to fix it.  The trouble
is that I'm not sure it's fixable in a sane manner with the current
semantics.  CR4.PCE, in particular, will need to be a function of both
per-thread state (PR_TSC_SIGSEGV) and per-mm state (whether perf_event
self-monitoring is on).  Getting the context switching logic correct
without hurting the common case too much will be quite complicated.

On top of this, supporting something like PR_TSC_SIGSEGV per-thread
using seccomp mode 2 should really be done by redirecting vdso-based
timing to use syscalls, and that's fundamentally per mm.

Hence my proposal of removing the current insecure model so that
adding a secure variant will be straightforward, rather than trying to
shoehorn a fix on top of the current ABI.

The eventual fix could still disable the TSC in a strict seccomp task
by default if the task is single threaded.  Yes, that won't quite
cover old Chromium-style sandboxes, but those are rapidly being
replaced by new designs using seccomp mode 2 anyway.

Also, keep in mind that multithreaded attackers can exploit timing
attacks without hardware help at all: one thread can just run a timing
loop.

--Andy

^ permalink raw reply

* Re: [PATCH v3 3/5] selftests/ipc: change test to use ksft framework
From: Davidlohr Bueso @ 2014-10-03 17:39 UTC (permalink / raw)
  To: Shuah Khan; +Cc: akpm, gregkh, colin.king, luto, linux-api, linux-kernel
In-Reply-To: <fb7ed86d44adbc7f97f9e42b314e4fce5616959e.1412349412.git.shuahkh@osg.samsung.com>

On Fri, 2014-10-03 at 09:36 -0600, Shuah Khan wrote:
>  	msgque.key = ftok(argv[0], 822155650);
>  	if (msgque.key == -1) {
> -		printf("Can't make key\n");
> -		return -errno;
> +		printf("Can't make key: %d\n", -errno);

So printing a numeric value is quite useless when users actually run
into these errors -- which is why I like err() so much. How about using
strerror() instead?

Thanks,
Davidlohr

^ permalink raw reply

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski @ 2014-10-03 17:27 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Peter Zijlstra, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML, Andy Lutomirski
In-Reply-To: <fc0c2447cbc39257941c6b118388c024b719353a.1412356529.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

[adding linux-api.  whoops.]

On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
> sandboxed programs from learning the time, presumably to avoid
> disclosing the wall clock and to make timing attacks much harder to
> exploit.
>
> Unfortunately, this feature is very insecure, for multiple reasons,
> and has probably been insecure since before it was written.
>
> Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
> part of the kernel's fixmap, so any user process could read them.
> The vvar page contains low-resolution timing information (with real
> wall clock and frequency data), and the HPET can be used for high
> precision timing.  Even in Linux 3.16, there clean way to disable
> access to these pages.
>
> Weakness 2: On most configurations, most or all userspace processes
> have unrestricted access to RDPMC, which is even better than RDTSC
> for exploiting timing attacks.
>
> I would like to fix both of these issues.  I want to deny access to
> RDPMC to processes that haven't asked for access via
> perf_event_open.  I also want to implement real TSC blocking, which
> will require some vdso enhancements
>
> The problem is that both of these fixes will be per-mm, not per
> task.  So PR_SET_TSC will be barely supportable.
>
> Therefore, I'm proposing the radical solution of ripping out the old
> ABI to make room for the new.
>
> Enabling strict seccomp mode no longer disables the TSC.
>
> PR_GET_TSC still works and returns PR_TSC_ENABLED.  PR_SET_TSC now
> return -EINVAL if you try to set PR_TSC_SIGSEGV.
>
> Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> ---
>  .../prctl/disable-tsc-ctxt-sw-stress-test.c        | 96 ----------------------
>  .../prctl/disable-tsc-on-off-stress-test.c         | 95 ---------------------
>  Documentation/prctl/disable-tsc-test.c             | 94 ---------------------
>  arch/x86/include/asm/processor.h                   |  7 --
>  arch/x86/include/asm/thread_info.h                 |  4 +-
>  arch/x86/include/asm/tsc.h                         |  2 -
>  arch/x86/kernel/process.c                          | 67 ---------------
>  include/uapi/linux/prctl.h                         |  7 +-
>  kernel/seccomp.c                                   |  3 -
>  kernel/sys.c                                       | 12 +--
>  10 files changed, 14 insertions(+), 373 deletions(-)
>  delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
>  delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
>  delete mode 100644 Documentation/prctl/disable-tsc-test.c
>
> diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
> deleted file mode 100644
> index f8e8e95e81fd..000000000000
> --- a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
> +++ /dev/null
> @@ -1,96 +0,0 @@
> -/*
> - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
> - *
> - * Tests if the control register is updated correctly
> - * at context switches
> - *
> - * Warning: this test will cause a very high load for a few seconds
> - *
> - */
> -
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <unistd.h>
> -#include <signal.h>
> -#include <inttypes.h>
> -#include <wait.h>
> -
> -
> -#include <sys/prctl.h>
> -#include <linux/prctl.h>
> -
> -/* Get/set the process' ability to use the timestamp counter instruction */
> -#ifndef PR_GET_TSC
> -#define PR_GET_TSC 25
> -#define PR_SET_TSC 26
> -# define PR_TSC_ENABLE         1   /* allow the use of the timestamp counter */
> -# define PR_TSC_SIGSEGV                2   /* throw a SIGSEGV instead of reading the TSC */
> -#endif
> -
> -uint64_t rdtsc() {
> -uint32_t lo, hi;
> -/* We cannot use "=A", since this would use %rax on x86_64 */
> -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
> -return (uint64_t)hi << 32 | lo;
> -}
> -
> -void sigsegv_expect(int sig)
> -{
> -       /* */
> -}
> -
> -void segvtask(void)
> -{
> -       if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
> -       {
> -               perror("prctl");
> -               exit(0);
> -       }
> -       signal(SIGSEGV, sigsegv_expect);
> -       alarm(10);
> -       rdtsc();
> -       fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
> -       exit(0);
> -}
> -
> -
> -void sigsegv_fail(int sig)
> -{
> -       fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
> -       exit(0);
> -}
> -
> -void rdtsctask(void)
> -{
> -       if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
> -       {
> -               perror("prctl");
> -               exit(0);
> -       }
> -       signal(SIGSEGV, sigsegv_fail);
> -       alarm(10);
> -       for(;;) rdtsc();
> -}
> -
> -
> -int main(int argc, char **argv)
> -{
> -       int n_tasks = 100, i;
> -
> -       fprintf(stderr, "[No further output means we're allright]\n");
> -
> -       for (i=0; i<n_tasks; i++)
> -               if (fork() == 0)
> -               {
> -                       if (i & 1)
> -                               segvtask();
> -                       else
> -                               rdtsctask();
> -               }
> -
> -       for (i=0; i<n_tasks; i++)
> -               wait(NULL);
> -
> -       exit(0);
> -}
> -
> diff --git a/Documentation/prctl/disable-tsc-on-off-stress-test.c b/Documentation/prctl/disable-tsc-on-off-stress-test.c
> deleted file mode 100644
> index 1fcd91445375..000000000000
> --- a/Documentation/prctl/disable-tsc-on-off-stress-test.c
> +++ /dev/null
> @@ -1,95 +0,0 @@
> -/*
> - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
> - *
> - * Tests if the control register is updated correctly
> - * when set with prctl()
> - *
> - * Warning: this test will cause a very high load for a few seconds
> - *
> - */
> -
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <unistd.h>
> -#include <signal.h>
> -#include <inttypes.h>
> -#include <wait.h>
> -
> -
> -#include <sys/prctl.h>
> -#include <linux/prctl.h>
> -
> -/* Get/set the process' ability to use the timestamp counter instruction */
> -#ifndef PR_GET_TSC
> -#define PR_GET_TSC 25
> -#define PR_SET_TSC 26
> -# define PR_TSC_ENABLE         1   /* allow the use of the timestamp counter */
> -# define PR_TSC_SIGSEGV                2   /* throw a SIGSEGV instead of reading the TSC */
> -#endif
> -
> -/* snippet from wikipedia :-) */
> -
> -uint64_t rdtsc() {
> -uint32_t lo, hi;
> -/* We cannot use "=A", since this would use %rax on x86_64 */
> -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
> -return (uint64_t)hi << 32 | lo;
> -}
> -
> -int should_segv = 0;
> -
> -void sigsegv_cb(int sig)
> -{
> -       if (!should_segv)
> -       {
> -               fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
> -               exit(0);
> -       }
> -       if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
> -       {
> -               perror("prctl");
> -               exit(0);
> -       }
> -       should_segv = 0;
> -
> -       rdtsc();
> -}
> -
> -void task(void)
> -{
> -       signal(SIGSEGV, sigsegv_cb);
> -       alarm(10);
> -       for(;;)
> -       {
> -               rdtsc();
> -               if (should_segv)
> -               {
> -                       fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
> -                       exit(0);
> -               }
> -               if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
> -               {
> -                       perror("prctl");
> -                       exit(0);
> -               }
> -               should_segv = 1;
> -       }
> -}
> -
> -
> -int main(int argc, char **argv)
> -{
> -       int n_tasks = 100, i;
> -
> -       fprintf(stderr, "[No further output means we're allright]\n");
> -
> -       for (i=0; i<n_tasks; i++)
> -               if (fork() == 0)
> -                       task();
> -
> -       for (i=0; i<n_tasks; i++)
> -               wait(NULL);
> -
> -       exit(0);
> -}
> -
> diff --git a/Documentation/prctl/disable-tsc-test.c b/Documentation/prctl/disable-tsc-test.c
> deleted file mode 100644
> index 843c81eac235..000000000000
> --- a/Documentation/prctl/disable-tsc-test.c
> +++ /dev/null
> @@ -1,94 +0,0 @@
> -/*
> - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
> - *
> - * Basic test to test behaviour of PR_GET_TSC and PR_SET_TSC
> - */
> -
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <unistd.h>
> -#include <signal.h>
> -#include <inttypes.h>
> -
> -
> -#include <sys/prctl.h>
> -#include <linux/prctl.h>
> -
> -/* Get/set the process' ability to use the timestamp counter instruction */
> -#ifndef PR_GET_TSC
> -#define PR_GET_TSC 25
> -#define PR_SET_TSC 26
> -# define PR_TSC_ENABLE         1   /* allow the use of the timestamp counter */
> -# define PR_TSC_SIGSEGV                2   /* throw a SIGSEGV instead of reading the TSC */
> -#endif
> -
> -const char *tsc_names[] =
> -{
> -       [0] = "[not set]",
> -       [PR_TSC_ENABLE] = "PR_TSC_ENABLE",
> -       [PR_TSC_SIGSEGV] = "PR_TSC_SIGSEGV",
> -};
> -
> -uint64_t rdtsc() {
> -uint32_t lo, hi;
> -/* We cannot use "=A", since this would use %rax on x86_64 */
> -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
> -return (uint64_t)hi << 32 | lo;
> -}
> -
> -void sigsegv_cb(int sig)
> -{
> -       int tsc_val = 0;
> -
> -       printf("[ SIG_SEGV ]\n");
> -       printf("prctl(PR_GET_TSC, &tsc_val); ");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_GET_TSC, &tsc_val) == -1)
> -               perror("prctl");
> -
> -       printf("tsc_val == %s\n", tsc_names[tsc_val]);
> -       printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
> -       fflush(stdout);
> -       if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
> -               perror("prctl");
> -
> -       printf("rdtsc() == ");
> -}
> -
> -int main(int argc, char **argv)
> -{
> -       int tsc_val = 0;
> -
> -       signal(SIGSEGV, sigsegv_cb);
> -
> -       printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
> -       printf("prctl(PR_GET_TSC, &tsc_val); ");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_GET_TSC, &tsc_val) == -1)
> -               perror("prctl");
> -
> -       printf("tsc_val == %s\n", tsc_names[tsc_val]);
> -       printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
> -       printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
> -               perror("prctl");
> -
> -       printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
> -       printf("prctl(PR_SET_TSC, PR_TSC_SIGSEGV)\n");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_SET_TSC, PR_TSC_SIGSEGV) == -1)
> -               perror("prctl");
> -
> -       printf("rdtsc() == ");
> -       fflush(stdout);
> -       printf("%llu\n", (unsigned long long)rdtsc());
> -       fflush(stdout);
> -
> -       exit(EXIT_SUCCESS);
> -}
> -
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index eb71ec794732..d9ed8489bc04 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -946,13 +946,6 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>
>  #define KSTK_EIP(task)         (task_pt_regs(task)->ip)
>
> -/* Get/set a process' ability to use the timestamp counter instruction */
> -#define GET_TSC_CTL(adr)       get_tsc_mode((adr))
> -#define SET_TSC_CTL(val)       set_tsc_mode((val))
> -
> -extern int get_tsc_mode(unsigned long adr);
> -extern int set_tsc_mode(unsigned int val);
> -
>  extern u16 amd_get_nb_id(int cpu);
>
>  static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 854053889d4d..ac9ed8b13aa8 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -78,7 +78,6 @@ struct thread_info {
>  #define TIF_MCE_NOTIFY         10      /* notify userspace of an MCE */
>  #define TIF_USER_RETURN_NOTIFY 11      /* notify kernel of userspace return */
>  #define TIF_UPROBE             12      /* breakpointed or singlestepping */
> -#define TIF_NOTSC              16      /* TSC is not accessible in userland */
>  #define TIF_IA32               17      /* IA32 compatibility process */
>  #define TIF_FORK               18      /* ret_from_fork */
>  #define TIF_NOHZ               19      /* in adaptive nohz mode */
> @@ -103,7 +102,6 @@ struct thread_info {
>  #define _TIF_MCE_NOTIFY                (1 << TIF_MCE_NOTIFY)
>  #define _TIF_USER_RETURN_NOTIFY        (1 << TIF_USER_RETURN_NOTIFY)
>  #define _TIF_UPROBE            (1 << TIF_UPROBE)
> -#define _TIF_NOTSC             (1 << TIF_NOTSC)
>  #define _TIF_IA32              (1 << TIF_IA32)
>  #define _TIF_FORK              (1 << TIF_FORK)
>  #define _TIF_NOHZ              (1 << TIF_NOHZ)
> @@ -145,7 +143,7 @@ struct thread_info {
>
>  /* flags to check in __switch_to() */
>  #define _TIF_WORK_CTXSW                                                        \
> -       (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP)
> +       (_TIF_IO_BITMAP|_TIF_BLOCKSTEP)
>
>  #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
>  #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index 94605c0e9cee..dc8fefa8b672 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -17,8 +17,6 @@ typedef unsigned long long cycles_t;
>  extern unsigned int cpu_khz;
>  extern unsigned int tsc_khz;
>
> -extern void disable_TSC(void);
> -
>  static inline cycles_t get_cycles(void)
>  {
>         unsigned long long ret = 0;
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index f804dc935d2a..73e6a57e24d9 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -137,64 +137,6 @@ void flush_thread(void)
>                 free_thread_xstate(tsk);
>  }
>
> -static void hard_disable_TSC(void)
> -{
> -       write_cr4(read_cr4() | X86_CR4_TSD);
> -}
> -
> -void disable_TSC(void)
> -{
> -       preempt_disable();
> -       if (!test_and_set_thread_flag(TIF_NOTSC))
> -               /*
> -                * Must flip the CPU state synchronously with
> -                * TIF_NOTSC in the current running context.
> -                */
> -               hard_disable_TSC();
> -       preempt_enable();
> -}
> -
> -static void hard_enable_TSC(void)
> -{
> -       write_cr4(read_cr4() & ~X86_CR4_TSD);
> -}
> -
> -static void enable_TSC(void)
> -{
> -       preempt_disable();
> -       if (test_and_clear_thread_flag(TIF_NOTSC))
> -               /*
> -                * Must flip the CPU state synchronously with
> -                * TIF_NOTSC in the current running context.
> -                */
> -               hard_enable_TSC();
> -       preempt_enable();
> -}
> -
> -int get_tsc_mode(unsigned long adr)
> -{
> -       unsigned int val;
> -
> -       if (test_thread_flag(TIF_NOTSC))
> -               val = PR_TSC_SIGSEGV;
> -       else
> -               val = PR_TSC_ENABLE;
> -
> -       return put_user(val, (unsigned int __user *)adr);
> -}
> -
> -int set_tsc_mode(unsigned int val)
> -{
> -       if (val == PR_TSC_SIGSEGV)
> -               disable_TSC();
> -       else if (val == PR_TSC_ENABLE)
> -               enable_TSC();
> -       else
> -               return -EINVAL;
> -
> -       return 0;
> -}
> -
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                       struct tss_struct *tss)
>  {
> @@ -214,15 +156,6 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                 update_debugctlmsr(debugctl);
>         }
>
> -       if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
> -           test_tsk_thread_flag(next_p, TIF_NOTSC)) {
> -               /* prev and next are different */
> -               if (test_tsk_thread_flag(next_p, TIF_NOTSC))
> -                       hard_disable_TSC();
> -               else
> -                       hard_enable_TSC();
> -       }
> -
>         if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {
>                 /*
>                  * Copy the relevant range of the IO bitmap.
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 58afc04c107e..9a2fea65c965 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -68,7 +68,12 @@
>  #define PR_CAPBSET_READ 23
>  #define PR_CAPBSET_DROP 24
>
> -/* Get/set the process' ability to use the timestamp counter instruction */
> +/*
> + * Get/set the process' ability to use the timestamp counter instruction
> + * Deprecated: PR_GET_TSC always reports PR_TSC_ENABLE, and PR_SET_TSC can
> + * only be used to set PR_TSC_ENABLE.  On non-x86 systems, neither of these
> + * prctls exists at all.
> + */
>  #define PR_GET_TSC 25
>  #define PR_SET_TSC 26
>  # define PR_TSC_ENABLE         1       /* allow the use of the timestamp counter */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 44eb005c6695..bdc60dc68e56 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -678,9 +678,6 @@ static long seccomp_set_mode_strict(void)
>         if (!seccomp_may_assign_mode(seccomp_mode))
>                 goto out;
>
> -#ifdef TIF_NOTSC
> -       disable_TSC();
> -#endif
>         seccomp_assign_mode(current, seccomp_mode);
>         ret = 0;
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce8129192a26..5dd1dedb7bf9 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -88,9 +88,6 @@
>  #ifndef GET_TSC_CTL
>  # define GET_TSC_CTL(a)                (-EINVAL)
>  #endif
> -#ifndef SET_TSC_CTL
> -# define SET_TSC_CTL(a)                (-EINVAL)
> -#endif
>
>  /*
>   * this is where the system-wide overflow UID and GID are defined, for
> @@ -1917,12 +1914,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>         case PR_SET_SECCOMP:
>                 error = prctl_set_seccomp(arg2, (char __user *)arg3);
>                 break;
> +#ifdef CONFIG_X86
>         case PR_GET_TSC:
> -               error = GET_TSC_CTL(arg2);
> +               error = put_user(PR_TSC_ENABLE, (int __user *)arg2);
>                 break;
>         case PR_SET_TSC:
> -               error = SET_TSC_CTL(arg2);
> +               if (arg2 == PR_TSC_ENABLE)
> +                       error = 0;
> +               else
> +                       error = -EINVAL;
>                 break;
> +#endif
>         case PR_TASK_PERF_EVENTS_DISABLE:
>                 error = perf_event_task_disable();
>                 break;
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>
In-Reply-To: <1412356087-16115-1-git-send-email-aarcange@redhat.com>

This adds two protocol commands to the userfaultfd protocol.

To register memory regions into userfaultfd you can write 16 bytes as:

	 [ start|0x1, end ]

to unregister write:

	 [ start|0x2, end ]

End is "start+len" (not start+len-1). Same as vma->vm_end.

This also enforces the constraint that start and end must both be page
aligned (so the last two bits become available to implement the
USERFAULTFD_RANGE_REGISTER|UNREGISTER commands).

This way there can be multiple userfaultfd for each process and each
one can register into its own virtual memory ranges.

If an userfaultfd tries to register into a virtual memory range
already registered into a different userfaultfd, -EBUSY will be
returned by the write() syscall.

userfaultfd can register into allocated ranges that don't have
MADV_USERFAULT set, but if MADV_USERFAULT is not set, no userfault
will fire on those.

Only if MADV_USERFAULT is set on the virtual memory range, and the
userfaultfd registered into the same range, the userfaultfd protocol
will engage.

If only MADV_USERFAULT is set and there's no userfaultfd registered on
a memory range, only a SIGBUS will be raised and the page fault will
not engage the userfaultfd protocol.

This also makes the handle_userfault() safe against race conditions
with regard to the mmap_sem by requiring FAULT_FLAG_ALLOW_RETRY to be
set the first time a fault is raised by any thread. In turn to work
reliably, the userfaultd depends on the gup_locked|unlocked patchset
to be applied.

If get_user_pages() is run on virtual memory ranges registered into
the userfaultfd, handle_userfault() will return VM_FAULT_SIGBUS and
gup() will return -EFAULT, because get_user_pages() doesn't allow
handle_userfault() to release the mmap_sem and in turn we cannot
safely engage the userfaultfd protocol. So the remaining
get_user_pages() calls must be restricted to memory ranges that we
know are not tracked through the userfaultfd protocol for the
userfaultfd to be reliable.

The only exception of a get_user_pages() that can safely run into an
userfaultfd triggering a -EFAULT is ptrace. ptrace would otherwise
hang so it's actually ok if it will get a -EFAULT instead of
hanging. But it would be ok also to phase out get_user_pages()
completely and have ptrace hang on the userfault (the hang can be
resolved sending SIGKILL to gdb or whatever process that is calling
ptrace). We could also decide to retain the current -EFAULT behavior
of ptrace using get_user_pages_locked with a NULL locked parameter so
the FAULT_FLAG_ALLOW_RETRY flag will not be set. Either ways would be
safe.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 411 +++++++++++++++++++++++++++-----------------
 include/linux/mm.h          |   2 +-
 include/linux/mm_types.h    |  11 ++
 include/linux/userfaultfd.h |  19 +-
 mm/madvise.c                |   3 +-
 mm/mempolicy.c              |   4 +-
 mm/mlock.c                  |   3 +-
 mm/mmap.c                   |  39 +++--
 mm/mprotect.c               |   3 +-
 9 files changed, 320 insertions(+), 175 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2667d0d..49bbd3b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -23,6 +23,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/syscalls.h>
 #include <linux/userfaultfd.h>
+#include <linux/mempolicy.h>
 
 struct userfaultfd_ctx {
 	/* pseudo fd refcounting */
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
 	unsigned int state;
 	/* released */
 	bool released;
+	/* mm with one ore more vmas attached to this userfaultfd_ctx */
+	struct mm_struct *mm;
 };
 
 struct userfaultfd_wait_queue {
@@ -49,6 +52,10 @@ struct userfaultfd_wait_queue {
 #define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
 #define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
 
+#define USERFAULTFD_RANGE_REGISTER ((__u64) 0x1)
+#define USERFAULTFD_RANGE_UNREGISTER ((__u64) 0x2)
+#define USERFAULTFD_RANGE_MASK (~((__u64) 0x3))
+
 enum {
 	USERFAULTFD_STATE_ASK_PROTOCOL,
 	USERFAULTFD_STATE_ACK_PROTOCOL,
@@ -56,43 +63,6 @@ enum {
 	USERFAULTFD_STATE_RUNNING,
 };
 
-/**
- * struct mm_slot - userlandfd information per mm that is being scanned
- * @link: link to the mm_slots hash list
- * @mm: the mm that this information is valid for
- * @ctx: userfaultfd context for this mm
- */
-struct mm_slot {
-	struct hlist_node link;
-	struct mm_struct *mm;
-	struct userfaultfd_ctx ctx;
-	struct rcu_head rcu_head;
-};
-
-#define MM_USERLANDFD_HASH_BITS 10
-static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
-
-static DEFINE_MUTEX(mm_userlandfd_mutex);
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
-	struct mm_slot *slot;
-
-	hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
-				   (unsigned long)mm)
-		if (slot->mm == mm)
-			return slot;
-
-	return NULL;
-}
-
-static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
-					 struct mm_slot *mm_slot)
-{
-	mm_slot->mm = mm;
-	hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
-}
-
 static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 				     int wake_flags, void *key)
 {
@@ -122,30 +92,10 @@ out:
  *
  * Returns: In case of success, returns not zero.
  */
-static int userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
 {
-	/*
-	 * If it's already released don't get it. This can race
-	 * against userfaultfd_release, if the race triggers it'll be
-	 * handled safely by the handle_userfault main loop
-	 * (userfaultfd_release will take the mmap_sem for writing to
-	 * flush out all in-flight userfaults). This check is only an
-	 * optimization.
-	 */
-	if (unlikely(ACCESS_ONCE(ctx->released)))
-		return 0;
-	return atomic_inc_not_zero(&ctx->refcount);
-}
-
-static void userfaultfd_free(struct userfaultfd_ctx *ctx)
-{
-	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
-
-	mutex_lock(&mm_userlandfd_mutex);
-	hash_del_rcu(&mm_slot->link);
-	mutex_unlock(&mm_userlandfd_mutex);
-
-	kfree_rcu(mm_slot, rcu_head);
+	if (!atomic_inc_not_zero(&ctx->refcount))
+		BUG();
 }
 
 /**
@@ -158,8 +108,10 @@ static void userfaultfd_free(struct userfaultfd_ctx *ctx)
  */
 static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 {
-	if (atomic_dec_and_test(&ctx->refcount))
-		userfaultfd_free(ctx);
+	if (atomic_dec_and_test(&ctx->refcount)) {
+		mmdrop(ctx->mm);
+		kfree(ctx);
+	}
 }
 
 /*
@@ -181,25 +133,55 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
-	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	rcu_read_lock();
-	slot = get_mm_slot(mm);
-	if (!slot) {
-		rcu_read_unlock();
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (!ctx)
 		return VM_FAULT_SIGBUS;
-	}
-	ctx = &slot->ctx;
-	if (!userfaultfd_ctx_get(ctx)) {
-		rcu_read_unlock();
+
+	BUG_ON(ctx->mm != mm);
+
+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_sem.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return VM_FAULT_SIGBUS;
+
+	/* check that we can return VM_FAULT_RETRY */
+	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+		/*
+		 * Validate the invariant that nowait must allow retry
+		 * to be sure not to return SIGBUS erroneously on
+		 * nowait invocations.
+		 */
+		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+#ifdef CONFIG_DEBUG_VM
+		if (printk_ratelimit()) {
+			printk(KERN_WARNING
+			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+			dump_stack();
+		}
+#endif
 		return VM_FAULT_SIGBUS;
 	}
-	rcu_read_unlock();
+
+	/*
+	 * Handle nowait, not much to do other than tell it to retry
+	 * and wait.
+	 */
+	if (flags & FAULT_FLAG_RETRY_NOWAIT)
+		return VM_FAULT_RETRY;
+
+	/* take the reference before dropping the mmap_sem */
+	userfaultfd_ctx_get(ctx);
+
+	/* be gentle and immediately relinquish the mmap_sem */
+	up_read(&mm->mmap_sem);
 
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
@@ -214,60 +196,15 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current)) {
-			/*
-			 * If we have to fail because the task is
-			 * killed just retry the fault either by
-			 * returning to userland or through
-			 * VM_FAULT_RETRY if we come from a page fault
-			 * and a fatal signal is pending.
-			 */
-			ret = 0;
-			if (flags & FAULT_FLAG_KILLABLE) {
-				/*
-				 * If FAULT_FLAG_KILLABLE is set we
-				 * and there's a fatal signal pending
-				 * can return VM_FAULT_RETRY
-				 * regardless if
-				 * FAULT_FLAG_ALLOW_RETRY is set or
-				 * not as long as we release the
-				 * mmap_sem. The page fault will
-				 * return stright to userland then to
-				 * handle the fatal signal.
-				 */
-				up_read(&mm->mmap_sem);
-				ret = VM_FAULT_RETRY;
-			}
-			break;
-		}
-		if (!uwq.pending || ACCESS_ONCE(ctx->released)) {
-			ret = 0;
-			if (flags & FAULT_FLAG_ALLOW_RETRY) {
-				ret = VM_FAULT_RETRY;
-				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
-					up_read(&mm->mmap_sem);
-			}
-			break;
-		}
-		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
-		     flags) ==
-		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
-			ret = VM_FAULT_RETRY;
-			/*
-			 * The mmap_sem must not be released if
-			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
-			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
-			 */
+		set_current_state(TASK_KILLABLE);
+		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
+		    fatal_signal_pending(current))
 			break;
-		}
 		spin_unlock(&ctx->fault_wqh.lock);
-		up_read(&mm->mmap_sem);
 
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 
-		down_read(&mm->mmap_sem);
 		spin_lock(&ctx->fault_wqh.lock);
 	}
 	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
@@ -276,30 +213,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	/*
 	 * ctx may go away after this if the userfault pseudo fd is
-	 * released by another CPU.
+	 * already released.
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return ret;
+	return VM_FAULT_RETRY;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
-	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
 	__u64 range[2] = { 0ULL, -1ULL };
 
 	ACCESS_ONCE(ctx->released) = true;
 
 	/*
-	 * Flush page faults out of all CPUs to avoid race conditions
-	 * against ctx->released. All page faults must be retried
-	 * without returning VM_FAULT_SIGBUS if the get_mm_slot and
-	 * userfaultfd_ctx_get both succeeds but ctx->released is set.
+	 * Flush page faults out of all CPUs. NOTE: all page faults
+	 * must be retried without returning VM_FAULT_SIGBUS if
+	 * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
+	 * changes while handle_userfault released the mmap_sem. So
+	 * it's critical that released is set to true (above), before
+	 * taking the mmap_sem for writing.
 	 */
-	down_write(&mm_slot->mm->mmap_sem);
-	up_write(&mm_slot->mm->mmap_sem);
+	down_write(&mm->mmap_sem);
+	prev = NULL;
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_userfaultfd_ctx.ctx != ctx)
+			continue;
+		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+				 vma->vm_flags, vma->anon_vma,
+				 vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_USERFAULTFD_CTX);
+		if (prev)
+			vma = prev;
+		else
+			prev = vma;
+		vma->vm_userfaultfd_ctx = NULL_VM_USERFAULTFD_CTX;
+	}
+	up_write(&mm->mmap_sem);
 
+	/*
+	 * After no new page faults can wait on this fautl_wqh, flush
+	 * the last page faults that may have been already waiting on
+	 * the fault_wqh.
+	 */
 	spin_lock(&ctx->fault_wqh.lock);
 	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
 	spin_unlock(&ctx->fault_wqh.lock);
@@ -454,6 +414,140 @@ static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
 	return ret;
 }
 
+static ssize_t userfaultfd_range_register(struct userfaultfd_ctx *ctx,
+					  unsigned long start,
+					  unsigned long end)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
+	int ret;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	if (!vma)
+		return -ENOMEM;
+	if (vma->vm_start >= end)
+		return -EINVAL;
+
+	prev = vma->vm_prev;
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	/* we got an overlap so start the splitting */
+	do {
+		if (vma->vm_userfaultfd_ctx.ctx == ctx)
+			goto next;
+		if (vma->vm_userfaultfd_ctx.ctx) {
+			ret = -EBUSY;
+			break;
+		}
+		prev = vma_merge(mm, prev, start, end, vma->vm_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 ((struct vm_userfaultfd_ctx){ ctx }));
+		if (prev) {
+			vma = prev;
+			vma->vm_userfaultfd_ctx.ctx = ctx;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret < 0)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret < 0)
+				break;
+		}
+		vma->vm_userfaultfd_ctx.ctx = ctx;
+	next:
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+	up_write(&mm->mmap_sem);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_range_unregister(struct userfaultfd_ctx *ctx,
+					    unsigned long start,
+					    unsigned long end)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
+	int ret;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	if (!vma)
+		return -ENOMEM;
+	if (vma->vm_start >= end)
+		return -EINVAL;
+
+	prev = vma->vm_prev;
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	/* we got an overlap so start the splitting */
+	do {
+		if (!vma->vm_userfaultfd_ctx.ctx)
+			goto next;
+		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
+			ret = -EBUSY;
+			break;
+		}
+		prev = vma_merge(mm, prev, start, end, vma->vm_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_USERFAULTFD_CTX);
+		if (prev) {
+			vma = prev;
+			vma->vm_userfaultfd_ctx = NULL_VM_USERFAULTFD_CTX;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret < 0)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret < 0)
+				break;
+		}
+		vma->vm_userfaultfd_ctx.ctx = NULL;
+	next:
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+	up_write(&mm->mmap_sem);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_handle_range(struct userfaultfd_ctx *ctx,
+					__u64 *range)
+{
+	unsigned long start, end;
+
+	start = range[0] & USERFAULTFD_RANGE_MASK;
+	end = range[1];
+	BUG_ON(end <= start);
+	if (end > TASK_SIZE)
+		return -ENOMEM;
+
+	if (range[0] & USERFAULTFD_RANGE_REGISTER) {
+		BUG_ON(range[0] & USERFAULTFD_RANGE_UNREGISTER);
+		return userfaultfd_range_register(ctx, start, end);
+	} else {
+		BUG_ON(!(range[0] & USERFAULTFD_RANGE_UNREGISTER));
+		return userfaultfd_range_unregister(ctx, start, end);
+	}
+}
+
 static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -483,9 +577,24 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 		return -EINVAL;
 	if (copy_from_user(&range, buf, sizeof(range)))
 		return -EFAULT;
-	if (range[0] >= range[1])
+	/* the range mask requires 2 bits */
+	BUILD_BUG_ON(PAGE_SHIFT < 2);
+	if (range[0] & ~PAGE_MASK & USERFAULTFD_RANGE_MASK)
+		return -EINVAL;
+	if ((range[0] & ~USERFAULTFD_RANGE_MASK) == ~USERFAULTFD_RANGE_MASK)
+		return -EINVAL;
+	if (range[1] & ~PAGE_MASK)
+		return -EINVAL;
+	if ((range[0] & PAGE_MASK) >= (range[1] & PAGE_MASK))
 		return -ERANGE;
 
+	/* handle the register/unregister commands */
+	if (range[0] & ~USERFAULTFD_RANGE_MASK) {
+		ssize_t ret = userfaultfd_handle_range(ctx, range);
+		BUG_ON(ret > 0);
+		return ret < 0 ? ret : sizeof(range);
+	}
+
 	/* always take the fd_wqh lock before the fault_wqh lock */
 	if (find_userfault(ctx, NULL, POLLOUT))
 		if (!wake_userfault(ctx, range))
@@ -552,7 +661,9 @@ static const struct file_operations userfaultfd_fops = {
 static struct file *userfaultfd_file_create(int flags)
 {
 	struct file *file;
-	struct mm_slot *mm_slot;
+	struct userfaultfd_ctx *ctx;
+
+	BUG_ON(!current->mm);
 
 	/* Check the UFFD_* constants for consistency.  */
 	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
@@ -562,33 +673,25 @@ static struct file *userfaultfd_file_create(int flags)
 	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
 		goto out;
 
-	mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
 	file = ERR_PTR(-ENOMEM);
-	if (!mm_slot)
+	if (!ctx)
 		goto out;
 
-	mutex_lock(&mm_userlandfd_mutex);
-	file = ERR_PTR(-EBUSY);
-	if (get_mm_slot(current->mm))
-		goto out_free_unlock;
-
-	atomic_set(&mm_slot->ctx.refcount, 1);
-	init_waitqueue_head(&mm_slot->ctx.fault_wqh);
-	init_waitqueue_head(&mm_slot->ctx.fd_wqh);
-	mm_slot->ctx.flags = flags;
-	mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
-	mm_slot->ctx.released = false;
-
-	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
-				  &mm_slot->ctx,
+	atomic_set(&ctx->refcount, 1);
+	init_waitqueue_head(&ctx->fault_wqh);
+	init_waitqueue_head(&ctx->fd_wqh);
+	ctx->flags = flags;
+	ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+	ctx->released = false;
+	ctx->mm = current->mm;
+	/* prevent the mm struct to be freed */
+	atomic_inc(&ctx->mm->mm_count);
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
 				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
 	if (IS_ERR(file))
-	out_free_unlock:
-		kfree(mm_slot);
-	else
-		insert_to_mm_userlandfd_hash(current->mm,
-					     mm_slot);
-	mutex_unlock(&mm_userlandfd_mutex);
+		kfree(ctx);
 out:
 	return file;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 71dbe03..cd60938 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1779,7 +1779,7 @@ extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *);
+	struct mempolicy *, struct vm_userfaultfd_ctx);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c876d1..bb78fa8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -238,6 +238,16 @@ struct vm_region {
 						* this region */
 };
 
+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_USERFAULTFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+	struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_USERFAULTFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -308,6 +318,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 };
 
 struct core_thread {
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index b7caef5..25f49db 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -29,14 +29,27 @@
 int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 		     unsigned int flags);
 
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
-			    unsigned int flags)
+static inline int handle_userfault(struct vm_area_struct *vma,
+				   unsigned long address,
+				   unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
 
-#endif
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return true;
+}
+
+#endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 24620c0..4bb9a68 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -117,7 +117,8 @@ static long madvise_behavior(struct vm_area_struct *vma,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-				vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..bf54e9c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -769,8 +769,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 		pgoff = vma->vm_pgoff +
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
-				  vma->anon_vma, vma->vm_file, pgoff,
-				  new_pol);
+				 vma->anon_vma, vma->vm_file, pgoff,
+				 new_pol, vma->vm_userfaultfd_ctx);
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index ce84cb0..ccb537e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -566,7 +566,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index c0a3637..303f45b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -901,7 +902,8 @@ again:			remove_next = 1 + (end > next->vm_end);
  * per-vma resources, so we don't attempt to merge those.
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
-			struct file *file, unsigned long vm_flags)
+				struct file *file, unsigned long vm_flags,
+				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -917,6 +919,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
 		return 0;
 	if (vma->vm_ops && vma->vm_ops->close)
 		return 0;
+	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+		return 0;
 	return 1;
 }
 
@@ -947,9 +951,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		     struct anon_vma *anon_vma, struct file *file,
+		     pgoff_t vm_pgoff,
+		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
@@ -966,9 +972,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
  */
 static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		    struct anon_vma *anon_vma, struct file *file,
+		    pgoff_t vm_pgoff,
+		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -1011,7 +1019,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
 		     	struct anon_vma *anon_vma, struct file *file,
-			pgoff_t pgoff, struct mempolicy *policy)
+			pgoff_t pgoff, struct mempolicy *policy,
+			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -1038,14 +1047,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (prev && prev->vm_end == addr &&
   			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags,
-						anon_vma, file, pgoff)) {
+					    anon_vma, file, pgoff,
+					    vm_userfaultfd_ctx)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
 		if (next && end == next->vm_start &&
 				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen) &&
+						     anon_vma, file,
+						     pgoff+pglen,
+						     vm_userfaultfd_ctx) &&
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
@@ -1066,7 +1078,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (next && end == next->vm_start &&
  			mpol_equal(policy, vma_policy(next)) &&
 			can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen)) {
+					     anon_vma, file, pgoff+pglen,
+					     vm_userfaultfd_ctx)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = vma_adjust(prev, prev->vm_start,
 				addr, prev->vm_pgoff, NULL);
@@ -1548,7 +1561,8 @@ munmap_back:
 	/*
 	 * Can we just expand an old mapping?
 	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+			NULL, file, pgoff, NULL, NULL_VM_USERFAULTFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2670,7 +2684,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL);
+			NULL, NULL, pgoff, NULL, NULL_VM_USERFAULTFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2829,7 +2843,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			    vma->vm_userfaultfd_ctx);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..2ee5aa7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -294,7 +294,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 */
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			   vma->vm_userfaultfd_ctx);
 	if (*pprev) {
 		vma = *pprev;
 		goto success;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox