* [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
@ 2013-12-09 21:18 Oleg Nesterov
2013-12-09 21:18 ` [PATCH 1/1] " Oleg Nesterov
2013-12-10 2:08 ` [PATCH 0/1] " Linus Torvalds
0 siblings, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2013-12-09 21:18 UTC (permalink / raw)
To: Linus Torvalds, H. Peter Anvin
Cc: Ananth N Mavinakayanahalli, Andi Kleen, Borislav Petkov,
Hugh Dickins, Ingo Molnar, Jiri Kosina, Peter Zijlstra,
Srikar Dronamraju, linux-kernel
Hello.
It is not clear to me if Linus still dislikes this change or not.
Let me send the patch "officially" so that it can be nacked if I
misunderstood the result of discussion.
Changes:
- add a huge comment above gup(WRITE | FORCE)
- add WARN_ON(!(PageAnon() && page_mapcount() == 1))
to ensure it works as expected
If (say, on x86) we can avoid the pte games, we can simply add
if (IS_ENABLED(CONFIG_WHATEVER)) {
copy_to_page(...);
set_page_dirty_locked(page);
goto put;
}
right after the 2nd get_user_pages().
In any case I believe it would be very nice to kill __replace_page(),
and even the fact this patch removes include(mm/internal.h) makes me
think this patch makes sense. Assuming it is correct.
Oleg.
---
int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
uprobe_opcode_t opcode)
{
struct page *page;
struct vm_area_struct *vma;
pte_t *ptep, entry;
spinlock_t *ptlp;
int ret;
/* Read the page with vaddr into memory */
ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
if (ret < 0)
return ret;
ret = verify_opcode(page, vaddr, &opcode);
if (ret <= 0)
goto put;
retry:
put_page(page);
/*
* Break the mapping unless the page is already anonymous and
* unshare the page, see the WARN_ON() below.
*
* We never write to the VM_SHARED vma, every caller must check
* valid_vma(). FOLL_WRITE | FOLL_FORCE should anonymize this
* page unless uprobe_write_opcode() was already called in the
* past or the application itself did mprotect(PROT_WRITE) and
* wrote into this page.
*
* If it was already anonymous it can be shared due to dup_mm(),
* in this case do_wp_page() or do_swap_page() will do another
* cow to unshare, so we can safely modify it.
*/
ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
if (ret < 0)
return ret;
ptep = page_check_address(page, mm, vaddr, &ptlp, 0);
if (!ptep)
goto retry;
ret = 0;
if (WARN_ON(!PageAnon(page) || page_mapcount(page) != 1)) {
dump_page(page);
ret = -EFAULT;
goto unlock;
}
/* Unmap this page to ensure that nobody can execute it */
flush_cache_page(vma, vaddr, pte_pfn(*ptep));
entry = ptep_clear_flush(vma, vaddr, ptep);
/* Nobody can fault in this page, modify it */
copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
/* Restore the old mapping */
entry = pte_mkdirty(entry);
flush_icache_page(vma, page);
set_pte_at(mm, vaddr, ptep, entry);
update_mmu_cache(vma, vaddr, ptep);
unlock:
pte_unmap_unlock(ptep, ptlp);
put:
put_page(page);
return ret;
}
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-09 21:18 [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE) Oleg Nesterov
@ 2013-12-09 21:18 ` Oleg Nesterov
2013-12-10 2:08 ` [PATCH 0/1] " Linus Torvalds
1 sibling, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2013-12-09 21:18 UTC (permalink / raw)
To: Linus Torvalds, H. Peter Anvin
Cc: Ananth N Mavinakayanahalli, Andi Kleen, Borislav Petkov,
Hugh Dickins, Ingo Molnar, Jiri Kosina, Peter Zijlstra,
Srikar Dronamraju, linux-kernel
__replace_page() logic is nontrivial, it plays with vm internals.
uprobe_write_opcode() can rely on gup(WRITE|FORCE) which should
break the mapping and unshare the page. After that we can unmap
this page temporary to ensure that nobody can access it, modify
its content, and restore the same pte + _PAGE_DIRTY.
This greatly simplifies the code and avoids the extra alloc_page()
if uprobe_write_opcode() was already called in the past.
Note: perhaps we do not need pte games at least on x86 which
always updates a single byte, we can trivially change this code
later to do copy_to_page() + set_page_dirty_lock() after the 2nd
get_user_pages() if CONFIG_X86.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/events/uprobes.c | 133 +++++++++++++++++-----------------------------
1 files changed, 49 insertions(+), 84 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 24b7d6c..1bae429 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -33,7 +33,6 @@
#include <linux/swap.h> /* try_to_free_swap */
#include <linux/ptrace.h> /* user_enable_single_step */
#include <linux/kdebug.h> /* notifier mechanism */
-#include "../../mm/internal.h" /* munlock_vma_page */
#include <linux/percpu-rwsem.h>
#include <linux/task_work.h>
@@ -114,65 +113,6 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
}
/**
- * __replace_page - replace page in vma by new page.
- * based on replace_page in mm/ksm.c
- *
- * @vma: vma that holds the pte pointing to page
- * @addr: address the old @page is mapped at
- * @page: the cowed page we are replacing by kpage
- * @kpage: the modified page we replace page by
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
- struct page *page, struct page *kpage)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pte_t *ptep;
- int err;
- /* For mmu_notifiers */
- const unsigned long mmun_start = addr;
- const unsigned long mmun_end = addr + PAGE_SIZE;
-
- /* For try_to_free_swap() and munlock_vma_page() below */
- lock_page(page);
-
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- err = -EAGAIN;
- ptep = page_check_address(page, mm, addr, &ptl, 0);
- if (!ptep)
- goto unlock;
-
- get_page(kpage);
- page_add_new_anon_rmap(kpage, vma, addr);
-
- if (!PageAnon(page)) {
- dec_mm_counter(mm, MM_FILEPAGES);
- inc_mm_counter(mm, MM_ANONPAGES);
- }
-
- flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
-
- page_remove_rmap(page);
- if (!page_mapped(page))
- try_to_free_swap(page);
- pte_unmap_unlock(ptep, ptl);
-
- if (vma->vm_flags & VM_LOCKED)
- munlock_vma_page(page);
- put_page(page);
-
- err = 0;
- unlock:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- unlock_page(page);
- return err;
-}
-
-/**
* is_swbp_insn - check if instruction is breakpoint instruction.
* @insn: instruction to be checked.
* Default implementation of is_swbp_insn
@@ -264,43 +204,68 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
uprobe_opcode_t opcode)
{
- struct page *old_page, *new_page;
+ struct page *page;
struct vm_area_struct *vma;
+ pte_t *ptep, entry;
+ spinlock_t *ptlp;
int ret;
-retry:
/* Read the page with vaddr into memory */
- ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
- if (ret <= 0)
+ ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+ if (ret < 0)
return ret;
- ret = verify_opcode(old_page, vaddr, &opcode);
+ ret = verify_opcode(page, vaddr, &opcode);
if (ret <= 0)
- goto put_old;
+ goto put;
- ret = -ENOMEM;
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
- if (!new_page)
- goto put_old;
-
- __SetPageUptodate(new_page);
+ retry:
+ put_page(page);
+ /*
+ * Break the mapping unless the page is already anonymous and
+ * unshare the page, see the WARN_ON() below.
+ *
+ * We never write to the VM_SHARED vma, every caller must check
+ * valid_vma(). FOLL_WRITE | FOLL_FORCE should anonymize this
+ * page unless uprobe_write_opcode() was already called in the
+ * past or the application itself did mprotect(PROT_WRITE) and
+ * wrote into this page.
+ *
+ * If it was already anonymous it can be shared due to dup_mm(),
+ * in this case do_wp_page() or do_swap_page() will do another
+ * cow to unshare, so we can safely modify it.
+ */
+ ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
+ if (ret < 0)
+ return ret;
- copy_highpage(new_page, old_page);
- copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ ptep = page_check_address(page, mm, vaddr, &ptlp, 0);
+ if (!ptep)
+ goto retry;
- ret = anon_vma_prepare(vma);
- if (ret)
- goto put_new;
+ ret = 0;
+ if (WARN_ON(!PageAnon(page) || page_mapcount(page) != 1)) {
+ dump_page(page);
+ ret = -EFAULT;
+ goto unlock;
+ }
- ret = __replace_page(vma, vaddr, old_page, new_page);
+ /* Unmap this page to ensure that nobody can execute it */
+ flush_cache_page(vma, vaddr, pte_pfn(*ptep));
+ entry = ptep_clear_flush(vma, vaddr, ptep);
-put_new:
- page_cache_release(new_page);
-put_old:
- put_page(old_page);
+ /* Nobody can fault in this page, modify it */
+ copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
- if (unlikely(ret == -EAGAIN))
- goto retry;
+ /* Restore the old mapping */
+ entry = pte_mkdirty(entry);
+ flush_icache_page(vma, page);
+ set_pte_at(mm, vaddr, ptep, entry);
+ update_mmu_cache(vma, vaddr, ptep);
+ unlock:
+ pte_unmap_unlock(ptep, ptlp);
+ put:
+ put_page(page);
return ret;
}
--
1.5.5.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-09 21:18 [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE) Oleg Nesterov
2013-12-09 21:18 ` [PATCH 1/1] " Oleg Nesterov
@ 2013-12-10 2:08 ` Linus Torvalds
2013-12-10 19:18 ` Oleg Nesterov
1 sibling, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2013-12-10 2:08 UTC (permalink / raw)
To: Oleg Nesterov
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On Mon, Dec 9, 2013 at 1:18 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> It is not clear to me if Linus still dislikes this change or not.
> Let me send the patch "officially" so that it can be nacked if I
> misunderstood the result of discussion.
So quite frankly, if the caller guarantees that it has looked up the
vma, and verified it, then I'm ok with it.
But in that case, I think you should pass in the vma as an argument,
and verify it for now. Because quite frankly, the reason I reacted to
this all is that it is NOT AT ALL obvious that the callers actually do
that.
That uprobe_write_opcode() also only works if the instruction is
within a single page, another thing that it doesn't actually check.
And again, it is not at all obvious by looking at the callers that
that check has ever been done. The interface looks like you can just
rewrite an arbitrary byte range using that function.
In other words: looking at that function, my immediate reaction is
still "it's buggy". Why? Because it seems to make assumptions about
the callers that are never checked and are not at all obvious that
they have ever *been* checked. In fact, callers can clearly call that
thing without ever even looking up the vma (which it must to do check
that it's not shared), and we look it up *again* by using that slow
nasty get_user_pages() thing.
Also, quite frankly, I think the routine is still just horrible. If I
read it right, even after that cleanup you do a page table walk
*THREE* times:
- 2x get_user_pages()
- page_check_address()
and that's without the whole "retry" thing. It all seems pretty damn
pointless. Wouldn't it be better to do it *once*, and then the retry
logic is for "oops, we need to cow the page we looked up, so we need
to drop the page table lock and allocate a new page".
So I'd actually prefer this to (a) pass in the vma (to make it obvious
that the caller must look it up!) (b) add a few sanity checks (not
page-crossing) and I think possibly (c) make it a bit smarter still.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 2:08 ` [PATCH 0/1] " Linus Torvalds
@ 2013-12-10 19:18 ` Oleg Nesterov
2013-12-10 19:38 ` Linus Torvalds
0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2013-12-10 19:18 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On 12/09, Linus Torvalds wrote:
>
> On Mon, Dec 9, 2013 at 1:18 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > It is not clear to me if Linus still dislikes this change or not.
> > Let me send the patch "officially" so that it can be nacked if I
> > misunderstood the result of discussion.
>
> So quite frankly, if the caller guarantees that it has looked up the
> vma, and verified it, then I'm ok with it.
>
> But in that case, I think you should pass in the vma as an argument,
> and verify it for now. Because quite frankly, the reason I reacted to
> this all is that it is NOT AT ALL obvious that the callers actually do
> that.
>
> That uprobe_write_opcode() also only works if the instruction is
> within a single page, another thing that it doesn't actually check.
> And again, it is not at all obvious by looking at the callers that
> that check has ever been done. The interface looks like you can just
> rewrite an arbitrary byte range using that function.
OK. I'll update the comment above uprobe_write_opcode() to make this
all clear. It is wrong anyway, ->mmap_sem is held for writing.
> In fact, callers can clearly call that
> thing without ever even looking up the vma (which it must to do check
> that it's not shared), and we look it up *again* by using that slow
> nasty get_user_pages() thing.
Well, if we use get_user_pages() we have vma for free. And this is the
same vma which was verified by the caller who must check valid_vma(),
the caller doesn't drop mmap_sem.
I agree, this should be documented. And yes, we can pass vma instead
of mm, but this needs a separate change. We need to change the signature
of set_swbp() and set_orig_insn(). And the current code also relies on
get_user_pages(&vma) in this sense.
But I am not sure about "and verify it for now" above. Do you mean
that uprobe_write_opcode() should do another valid_vma() ? I don't
think this will look good, unless we put this check into WARN_ON().
Or just verify that the argument matches the vma found by gup() ?
And perhaps we can do this later? This change will conflict with
the pending arm patches (arm reimplements set_swbp() because it
uses another insn != UPROBE_SWBP_INSN depending on original insn).
This doesn't look like a serious problem to me.
> Also, quite frankly, I think the routine is still just horrible.
I agree. I do think it becomes better (if the patch is correct),
just because we do not play with vm internals, at least that much.
> If I
> read it right, even after that cleanup you do a page table walk
> *THREE* times:
>
> - 2x get_user_pages()
Yes. The 1st get_user_pages() + verify_opcode() is only needed to
avoid the unnecessary write (and cow). Without this check, say,
uprobe_unregister() can create the anonymous cow'ed page even if
this task was not probed.
This makes sense, but perhaps we can optimize this somehow, not sure.
But this was not added by the patch.
> - page_check_address()
Yes. The patch adds another page table walk, this is unfortunate.
> and that's without the whole "retry" thing. It all seems pretty damn
> pointless. Wouldn't it be better to do it *once*, and then the retry
> logic is for "oops, we need to cow the page we looked up, so we need
> to drop the page table lock and allocate a new page".
Hmm, OK. I guess we can rely on the same PageAnon() + page_mapcount()
check, just we should not WARN() the first time. Plus we should probably
also check page_swapcount(). But I need to recheck.
Thanks. I'll try to make v2.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 19:18 ` Oleg Nesterov
@ 2013-12-10 19:38 ` Linus Torvalds
2013-12-10 20:04 ` Oleg Nesterov
0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2013-12-10 19:38 UTC (permalink / raw)
To: Oleg Nesterov
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On Tue, Dec 10, 2013 at 11:18 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Well, if we use get_user_pages() we have vma for free.
No we don't. get_user_pages() is expensive as hell.
We'd be *much* better off using get_user_pages_fast() if possible -
and I bet _is_ possible in 99% of all cases.
So saying that we get vma "for free" is complete BS. We get it "for
very expensive."
> But I am not sure about "and verify it for now" above. Do you mean
> that uprobe_write_opcode() should do another valid_vma() ?
No. I think we should just do something like
if (vma->vm_flags & (VM_SHARED | VM_HUGETLB))
return -EINVAL;
and actually make the logic obvious. We should *also* make sure we do
the right thing for THP, because the VM_HUGETLB check isn't obviously
sufficient. Does the code really do the right thing for a writable
hugepage? I'd like that to be *obvious*.
Quite frankly, I think that adding a few of these kinds of *obvious*
checks and then just looking up the page table _once_ - instead of
three times - would make the code safer and faster.
As it is, it is neither safe nor fast. It *may* be safe if callers do
all the right black magic, but it's by no means obvious.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 19:38 ` Linus Torvalds
@ 2013-12-10 20:04 ` Oleg Nesterov
2013-12-10 20:16 ` Linus Torvalds
2013-12-10 20:16 ` Oleg Nesterov
0 siblings, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2013-12-10 20:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On 12/10, Linus Torvalds wrote:
>
> On Tue, Dec 10, 2013 at 11:18 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Well, if we use get_user_pages() we have vma for free.
>
> No we don't. get_user_pages() is expensive as hell.
I understand, I said "if we use get_user_pages()".
> We'd be *much* better off using get_user_pages_fast() if possible -
> and I bet _is_ possible in 99% of all cases.
We can't. get_user_pages_fast() takes mmap_sem.
Perhaps we can do something, say add FOLL_FAST_ONLY, but until then
we need the 1st get_user_pages(write => 0) anyway.
> > But I am not sure about "and verify it for now" above. Do you mean
> > that uprobe_write_opcode() should do another valid_vma() ?
>
> No. I think we should just do something like
>
> if (vma->vm_flags & (VM_SHARED | VM_HUGETLB))
> return -EINVAL;
OK, but WARN_ON(VM_SHARED | VM_HUGETLB) looks better imho. Because
this should not happen.
> then just looking up the page table _once_ - instead of
> three times - would make the code safer and faster.
Unlikely we can do this only once... At least I do not see how.
> It *may* be safe if callers do
> all the right black magic, but it's by no means obvious.
I agree, this needs more documentation in any case.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 20:04 ` Oleg Nesterov
@ 2013-12-10 20:16 ` Linus Torvalds
2013-12-10 21:24 ` Oleg Nesterov
2013-12-10 20:16 ` Oleg Nesterov
1 sibling, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2013-12-10 20:16 UTC (permalink / raw)
To: Oleg Nesterov
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On Tue, Dec 10, 2013 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> We'd be *much* better off using get_user_pages_fast() if possible -
>> and I bet _is_ possible in 99% of all cases.
>
> We can't. get_user_pages_fast() takes mmap_sem.
Yeah, and we need to look up the page table entry anyway, so what we
actually want here is just the page table walker, none of the "get
page" crap at all.
So the core function should (I think) just do something like:
- new_page = NULL
- get page table lock
- look up page tables
- if it's dirty and private, or we have a new page for it, replace
the instruction
- drop page table lock
- we're done, free the new page if we didn't use it.
with a retry for the "uhhuh, it's not dirty and private, and we don't
have a new page", so we do
- drop the lock
- use a "get_user()" or something to page it in.
- allocate a new page if necessary
- goto retry
which gets us a single page table walk for the common case, and a
retry for the "uhhuh, we needed to page it in or allocate a new page"
condition.
Put another way: I actually think the existing "__replace_page()" code
is closer to being good than that disgusting uprobe_write_opcode()
function. I think you may be getting rid of the wrong ugly function.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 20:04 ` Oleg Nesterov
2013-12-10 20:16 ` Linus Torvalds
@ 2013-12-10 20:16 ` Oleg Nesterov
2013-12-10 20:19 ` Linus Torvalds
1 sibling, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2013-12-10 20:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On 12/10, Oleg Nesterov wrote:
>
> On 12/10, Linus Torvalds wrote:
> >
> > We'd be *much* better off using get_user_pages_fast() if possible -
> > and I bet _is_ possible in 99% of all cases.
>
> We can't. get_user_pages_fast() takes mmap_sem.
Hmm. I am stupid, there is __get_user_pages_fast(). OK, I'll try
to think more.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 20:16 ` Oleg Nesterov
@ 2013-12-10 20:19 ` Linus Torvalds
0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2013-12-10 20:19 UTC (permalink / raw)
To: Oleg Nesterov
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On Tue, Dec 10, 2013 at 12:16 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 12/10, Oleg Nesterov wrote:
>>
>> On 12/10, Linus Torvalds wrote:
>> >
>> > We'd be *much* better off using get_user_pages_fast() if possible -
>> > and I bet _is_ possible in 99% of all cases.
>>
>> We can't. get_user_pages_fast() takes mmap_sem.
>
> Hmm. I am stupid, there is __get_user_pages_fast(). OK, I'll try
> to think more.
There is indeed the non-locking version, however, since we want to get
the pte-pointer anyway, I really think you were right in complaining
about get_user_pages_fast() in the first place, and we'd be much
better off just doing the page table lookup ourselves. Which we have
to do later *anyway*.
The whole [__]get_user_pages[_fast]() family of functions are for
people who want the *page*. We want more than that. We really want the
page table entry, since we have to potentially change it.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE)
2013-12-10 20:16 ` Linus Torvalds
@ 2013-12-10 21:24 ` Oleg Nesterov
0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2013-12-10 21:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Ananth N Mavinakayanahalli, Andi Kleen,
Borislav Petkov, Hugh Dickins, Ingo Molnar, Jiri Kosina,
Peter Zijlstra, Srikar Dronamraju, Linux Kernel Mailing List
On 12/10, Linus Torvalds wrote:
>
> On Tue, Dec 10, 2013 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> We'd be *much* better off using get_user_pages_fast() if possible -
> >> and I bet _is_ possible in 99% of all cases.
> >
> > We can't. get_user_pages_fast() takes mmap_sem.
>
> Yeah, and we need to look up the page table entry anyway, so what we
> actually want here is just the page table walker,
Yes, this is clear.
And damn, somehow I forgot that _fast() can (obviously) only use
current->mm.
> none of the "get
> page" crap at all.
>
> So the core function should (I think) just do something like:
I'll try to think, but this is what I actually tried to avoid.
I mean,
> Put another way: I actually think the existing "__replace_page()" code
> is closer to being good than that disgusting uprobe_write_opcode()
> function. I think you may be getting rid of the wrong ugly function.
And perhaps you are right.
But my only motivation was: rely on gup() to simplify this code. Yes,
gup() is slow, but it works and we can avoid the "nontrivial" things
like page_remove_rmap/munlock_vma_page.
If we want to optimize this code, then this patch obviously goes to
the wrong direction, I agree.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-12-10 21:24 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-09 21:18 [PATCH 0/1] uprobes: Kill __replace_page(), change uprobe_write_opcode() to rely on gup(WRITE) Oleg Nesterov
2013-12-09 21:18 ` [PATCH 1/1] " Oleg Nesterov
2013-12-10 2:08 ` [PATCH 0/1] " Linus Torvalds
2013-12-10 19:18 ` Oleg Nesterov
2013-12-10 19:38 ` Linus Torvalds
2013-12-10 20:04 ` Oleg Nesterov
2013-12-10 20:16 ` Linus Torvalds
2013-12-10 21:24 ` Oleg Nesterov
2013-12-10 20:16 ` Oleg Nesterov
2013-12-10 20:19 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox