From: Andrew Morton <akpm@linux-foundation.org>
To: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Christoph Lameter <cl@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat?
Date: Sat, 6 Dec 2008 18:50:50 -0800 [thread overview]
Message-ID: <20081206185050.544f67fc.akpm@linux-foundation.org> (raw)
In-Reply-To: <493B3179.7030204@inria.fr>
On Sun, 07 Dec 2008 03:14:17 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote:
> Hello,
>
> I have been seeing some deadlocks that seem to be related to do_pages_stat()
> page-faulting while holding the mmap_sem. Part of my sys_move_pages() rework
> has been applied to 2.6.28-rc. So do_pages_stat() now gets page addresses
> from user-space (and puts the result back to user-space) while holding the
> mmap_sem for read. If there's a page-fault there, the page-fault handler
> grabs the mmap_sem for read again. But if another thread took it for write
> in the meantime (for instance in mprotect), it deadlocks since rwsem readers
> are blocked if a writer is already waiting.
>
> Reading the archives, I see some similar deadlocks a couple years ago but
> I can't find the final answer/fix. From what I understand, the mmap_sem
> fairness could not be changed easily. So I am not sure whether accessing
> user-space while holding mmap_sem for read is still valid/recommended today.
> But the behavior of do_pages_stat() changed between 2.6.27 and 2.6.28-rc
> because of my patch, and this deadlock seems to be happening for real.
Yes, that's still a bug.
Was lockdep able to tell you about this in any way?
> So I would like to fix this small regression in 2.6.28.
s/small/fairly large/:)
> The patch below seems
> to make my do_pages_stat() deadlock disappear here.
>
> Brice
>
>
>
> [PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat
>
> Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat()
> gets the page address from user-space and puts the corresponding status
> back while holding the mmap_sem for read. There is no need to hold
> mmap_sem there while some page-faults may occur.
>
> This patch adds a temporary address and status buffer so as to only hold
> mmap_sem while working on these kernel buffers. This is implemented by
> extracting do_pages_stat_array() out of do_pages_stat().
>
> Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1e0d6b2..4350101 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -987,25 +987,18 @@ out:
> /*
> * Determine the nodes of an array of pages and store it in an array of status.
> */
> -static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
> - const void __user * __user *pages,
> - int __user *status)
> +static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
> + const void * __user *pages, int *status)
> {
> unsigned long i;
> - int err;
>
> down_read(&mm->mmap_sem);
>
> - for (i = 0; i < nr_pages; i++) {
> - const void __user *p;
> - unsigned long addr;
> + for (i = 0; i < nr_pages; i++, pages++, status++) {
> + unsigned long addr = (unsigned long)(*pages);
Directly dereferencing a user pointer is very bad. Fortunately, it's
just that the above __user annotation is now wrong.
> struct vm_area_struct *vma;
> struct page *page;
> -
> - err = -EFAULT;
> - if (get_user(p, pages+i))
> - goto out;
> - addr = (unsigned long) p;
> + int err;
>
> vma = find_vma(mm, addr);
> if (!vma)
> @@ -1024,12 +1017,59 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
>
> err = page_to_nid(page);
> set_status:
> - put_user(err, status+i);
> + *status = err;
> + }
> +
> + up_read(&mm->mmap_sem);
> +}
> +
> +/*
> + * Determine the nodes of a user array of pages and store it in
> + * a user array of status.
> + */
> +static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
> + const void __user * __user *pages,
> + int __user *status)
> +{
> + const void * __user *chunk_pages;
This is not a userspace pointer.
> + int *chunk_status;
> + unsigned long i,chunk_nr;
> + int err;
> +
> + err = -ENOMEM;
> + chunk_pages = (const void * __user *)__get_free_page(GFP_KERNEL);
> + if (!chunk_pages)
> + goto out;
> + chunk_status = (int *)__get_free_page(GFP_KERNEL);
> + if (!chunk_status)
> + goto out_with_chunk_pages;
Given that this code use to perform acceptably (presumably) doing one
page at a time, I suspect that you could have retained that behaviour,
avoiding the complexity of those two arrays. Would an additional
down_read()/up_read() per page have been unacceptably costly?
The arrays could have been allocated on the stack, I expect. 16 slots
is enough?
> + chunk_nr = PAGE_SIZE/max(sizeof(*chunk_pages), sizeof(*chunk_status));
> + for(i = 0; i < nr_pages; i += chunk_nr, pages += chunk_nr, status += chunk_nr) {
Please try to make this more checkpatch-friendly. Moving the
alteration of `pages' and `status' to the end of the loop would fix
that, and would result in clearer (IMO) code.
And simply using pages[chunk_nr] everywhere would clean stuff up (IMO).
> + if (chunk_nr + i > nr_pages)
> + chunk_nr = nr_pages - i;
> +
> + err = copy_from_user(chunk_pages, pages, chunk_nr * sizeof(*chunk_pages));
> + if (err) {
> + err = -EFAULT;
> + goto out_with_chunk_status;
> + }
> +
> + do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
> +
> + err = copy_to_user(status, chunk_status, chunk_nr * sizeof(*chunk_status));
> + if (err) {
> + err = -EFAULT;
> + goto out_with_chunk_status;
> + }
> }
> err = 0;
>
> +out_with_chunk_status:
> + free_page((unsigned long)chunk_status);
> +out_with_chunk_pages:
> + free_page((unsigned long)chunk_pages);
> out:
> - up_read(&mm->mmap_sem);
> return err;
> }
It's a small semantic change, isn't it? We release the semaphore in
the middle of the operation, thus presenting possibly
non-internally-consistent results to userspace. Why does this not matter?
Given the number of __user errors this patch added, I'd recommend that
v2 be checked with sparse, please.
next prev parent reply other threads:[~2008-12-07 2:51 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-07 2:14 [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? Brice Goglin
2008-12-07 2:50 ` Andrew Morton [this message]
2008-12-07 14:21 ` Brice Goglin
2008-12-09 14:19 ` Christoph Lameter
2008-12-09 16:35 ` Peter Zijlstra
2008-12-09 16:47 ` Brice Goglin
2008-12-09 17:46 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081206185050.544f67fc.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=Brice.Goglin@inria.fr \
--cc=cl@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.