* [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? @ 2008-12-07 2:14 Brice Goglin 2008-12-07 2:50 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Brice Goglin @ 2008-12-07 2:14 UTC (permalink / raw) To: Andrew Morton, Christoph Lameter; +Cc: LKML Hello, I have been seeing some deadlocks that seem to be related to do_pages_stat() page-faulting while holding the mmap_sem. Part of my sys_move_pages() rework has been applied to 2.6.28-rc. So do_pages_stat() now gets page addresses from user-space (and puts the result back to user-space) while holding the mmap_sem for read. If there's a page-fault there, the page-fault handler grabs the mmap_sem for read again. But if another thread took it for write in the meantime (for instance in mprotect), it deadlocks since rwsem readers are blocked if a writer is already waiting. Reading the archives, I see some similar deadlocks a couple years ago but I can't find the final answer/fix. From what I understand, the mmap_sem fairness could not be changed easily. So I am not sure whether accessing user-space while holding mmap_sem for read is still valid/recommended today. But the behavior of do_pages_stat() changed between 2.6.27 and 2.6.28-rc because of my patch, and this deadlock seems to be happening for real. So I would like to fix this small regression in 2.6.28. The patch below seems to make my do_pages_stat() deadlock disappear here. Brice [PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat() gets the page address from user-space and puts the corresponding status back while holding the mmap_sem for read. There is no need to hold mmap_sem there while some page-faults may occur. This patch adds a temporary address and status buffer so as to only hold mmap_sem while working on these kernel buffers. This is implemented by extracting do_pages_stat_array() out of do_pages_stat(). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> diff --git a/mm/migrate.c b/mm/migrate.c index 1e0d6b2..4350101 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -987,25 +987,18 @@ out: /* * Determine the nodes of an array of pages and store it in an array of status. */ -static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, - const void __user * __user *pages, - int __user *status) +static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages, + const void * __user *pages, int *status) { unsigned long i; - int err; down_read(&mm->mmap_sem); - for (i = 0; i < nr_pages; i++) { - const void __user *p; - unsigned long addr; + for (i = 0; i < nr_pages; i++, pages++, status++) { + unsigned long addr = (unsigned long)(*pages); struct vm_area_struct *vma; struct page *page; - - err = -EFAULT; - if (get_user(p, pages+i)) - goto out; - addr = (unsigned long) p; + int err; vma = find_vma(mm, addr); if (!vma) @@ -1024,12 +1017,59 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, err = page_to_nid(page); set_status: - put_user(err, status+i); + *status = err; + } + + up_read(&mm->mmap_sem); +} + +/* + * Determine the nodes of a user array of pages and store it in + * a user array of status. + */ +static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, + const void __user * __user *pages, + int __user *status) +{ + const void * __user *chunk_pages; + int *chunk_status; + unsigned long i,chunk_nr; + int err; + + err = -ENOMEM; + chunk_pages = (const void * __user *)__get_free_page(GFP_KERNEL); + if (!chunk_pages) + goto out; + chunk_status = (int *)__get_free_page(GFP_KERNEL); + if (!chunk_status) + goto out_with_chunk_pages; + + chunk_nr = PAGE_SIZE/max(sizeof(*chunk_pages), sizeof(*chunk_status)); + for(i = 0; i < nr_pages; i += chunk_nr, pages += chunk_nr, status += chunk_nr) { + if (chunk_nr + i > nr_pages) + chunk_nr = nr_pages - i; + + err = copy_from_user(chunk_pages, pages, chunk_nr * sizeof(*chunk_pages)); + if (err) { + err = -EFAULT; + goto out_with_chunk_status; + } + + do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); + + err = copy_to_user(status, chunk_status, chunk_nr * sizeof(*chunk_status)); + if (err) { + err = -EFAULT; + goto out_with_chunk_status; + } } err = 0; +out_with_chunk_status: + free_page((unsigned long)chunk_status); +out_with_chunk_pages: + free_page((unsigned long)chunk_pages); out: - up_read(&mm->mmap_sem); return err; } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? 2008-12-07 2:14 [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? Brice Goglin @ 2008-12-07 2:50 ` Andrew Morton 2008-12-07 14:21 ` Brice Goglin 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2008-12-07 2:50 UTC (permalink / raw) To: Brice Goglin; +Cc: Christoph Lameter, LKML On Sun, 07 Dec 2008 03:14:17 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote: > Hello, > > I have been seeing some deadlocks that seem to be related to do_pages_stat() > page-faulting while holding the mmap_sem. Part of my sys_move_pages() rework > has been applied to 2.6.28-rc. So do_pages_stat() now gets page addresses > from user-space (and puts the result back to user-space) while holding the > mmap_sem for read. If there's a page-fault there, the page-fault handler > grabs the mmap_sem for read again. But if another thread took it for write > in the meantime (for instance in mprotect), it deadlocks since rwsem readers > are blocked if a writer is already waiting. > > Reading the archives, I see some similar deadlocks a couple years ago but > I can't find the final answer/fix. From what I understand, the mmap_sem > fairness could not be changed easily. So I am not sure whether accessing > user-space while holding mmap_sem for read is still valid/recommended today. > But the behavior of do_pages_stat() changed between 2.6.27 and 2.6.28-rc > because of my patch, and this deadlock seems to be happening for real. Yes, that's still a bug. Was lockdep able to tell you about this in any way? > So I would like to fix this small regression in 2.6.28. s/small/fairly large/:) > The patch below seems > to make my do_pages_stat() deadlock disappear here. > > Brice > > > > [PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat > > Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat() > gets the page address from user-space and puts the corresponding status > back while holding the mmap_sem for read. There is no need to hold > mmap_sem there while some page-faults may occur. > > This patch adds a temporary address and status buffer so as to only hold > mmap_sem while working on these kernel buffers. This is implemented by > extracting do_pages_stat_array() out of do_pages_stat(). > > Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> > > diff --git a/mm/migrate.c b/mm/migrate.c > index 1e0d6b2..4350101 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -987,25 +987,18 @@ out: > /* > * Determine the nodes of an array of pages and store it in an array of status. > */ > -static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, > - const void __user * __user *pages, > - int __user *status) > +static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages, > + const void * __user *pages, int *status) > { > unsigned long i; > - int err; > > down_read(&mm->mmap_sem); > > - for (i = 0; i < nr_pages; i++) { > - const void __user *p; > - unsigned long addr; > + for (i = 0; i < nr_pages; i++, pages++, status++) { > + unsigned long addr = (unsigned long)(*pages); Directly dereferencing a user pointer is very bad. Fortunately, it's just that the above __user annotation is now wrong. > struct vm_area_struct *vma; > struct page *page; > - > - err = -EFAULT; > - if (get_user(p, pages+i)) > - goto out; > - addr = (unsigned long) p; > + int err; > > vma = find_vma(mm, addr); > if (!vma) > @@ -1024,12 +1017,59 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, > > err = page_to_nid(page); > set_status: > - put_user(err, status+i); > + *status = err; > + } > + > + up_read(&mm->mmap_sem); > +} > + > +/* > + * Determine the nodes of a user array of pages and store it in > + * a user array of status. > + */ > +static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, > + const void __user * __user *pages, > + int __user *status) > +{ > + const void * __user *chunk_pages; This is not a userspace pointer. > + int *chunk_status; > + unsigned long i,chunk_nr; > + int err; > + > + err = -ENOMEM; > + chunk_pages = (const void * __user *)__get_free_page(GFP_KERNEL); > + if (!chunk_pages) > + goto out; > + chunk_status = (int *)__get_free_page(GFP_KERNEL); > + if (!chunk_status) > + goto out_with_chunk_pages; Given that this code use to perform acceptably (presumably) doing one page at a time, I suspect that you could have retained that behaviour, avoiding the complexity of those two arrays. Would an additional down_read()/up_read() per page have been unacceptably costly? The arrays could have been allocated on the stack, I expect. 16 slots is enough? > + chunk_nr = PAGE_SIZE/max(sizeof(*chunk_pages), sizeof(*chunk_status)); > + for(i = 0; i < nr_pages; i += chunk_nr, pages += chunk_nr, status += chunk_nr) { Please try to make this more checkpatch-friendly. Moving the alteration of `pages' and `status' to the end of the loop would fix that, and would result in clearer (IMO) code. And simply using pages[chunk_nr] everywhere would clean stuff up (IMO). > + if (chunk_nr + i > nr_pages) > + chunk_nr = nr_pages - i; > + > + err = copy_from_user(chunk_pages, pages, chunk_nr * sizeof(*chunk_pages)); > + if (err) { > + err = -EFAULT; > + goto out_with_chunk_status; > + } > + > + do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); > + > + err = copy_to_user(status, chunk_status, chunk_nr * sizeof(*chunk_status)); > + if (err) { > + err = -EFAULT; > + goto out_with_chunk_status; > + } > } > err = 0; > > +out_with_chunk_status: > + free_page((unsigned long)chunk_status); > +out_with_chunk_pages: > + free_page((unsigned long)chunk_pages); > out: > - up_read(&mm->mmap_sem); > return err; > } It's a small semantic change, isn't it? We release the semaphore in the middle of the operation, thus presenting possibly non-internally-consistent results to userspace. Why does this not matter? Given the number of __user errors this patch added, I'd recommend that v2 be checked with sparse, please. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? 2008-12-07 2:50 ` Andrew Morton @ 2008-12-07 14:21 ` Brice Goglin 2008-12-09 14:19 ` Christoph Lameter 2008-12-09 16:35 ` Peter Zijlstra 0 siblings, 2 replies; 7+ messages in thread From: Brice Goglin @ 2008-12-07 14:21 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, LKML Andrew Morton wrote: > Was lockdep able to tell you about this in any way? > With CONFIG_PROVE_LOCKING (assuming that it's enough), it doesn't detect the problem for real. It just says "possible recursive locking detected" between do_page_fault and sys_move_pages. I actually understood the problem after hitting sysrq-t and always getting the following backtraces: * thread 1 [<ffffffff80430cb8>] __down_write_nested+0x9c/0xb4 [<ffffffff8028c2c4>] sys_mprotect+0xcc/0x230 * thread 2 [<ffffffff80430d73>] __down_read+0x9c/0xb4 [<ffffffff80223f5c>] do_page_fault+0x551/0x9d9 [...] [<ffffffff8043111a>] error_exit+0x0/0x70 [<ffffffff802f5bec>] cap_task_movememory+0x0/0x3 [<ffffffff8032b99d>] __put_user_4+0x1d/0x30 [<ffffffff802a091f>] sys_move_pages+0x453/0x4c0 > Directly dereferencing a user pointer is very bad. Fortunately, it's > just that the above __user annotation is now wrong. > Sorry, I totally messed up my annotations. I fixed up the annotations and applied your other changes. The new patch (below) makes sparse and checkpatch.pl happy. > Given that this code use to perform acceptably (presumably) doing one > page at a time, I suspect that you could have retained that behaviour, > avoiding the complexity of those two arrays. Would an additional > down_read()/up_read() per page have been unacceptably costly? > > The arrays could have been allocated on the stack, I expect. 16 slots > is enough? > I thought down/up_read would be more expensive but it does not seem to be that bad: * original code: do_pages_stat stats pages at 30-45GB/s (depending on buffer size) on a quad-quad-core 1.9GHz opteron * if moving down/up inside the main loop in do_pages_stat() so as to down/up once per page, we get to 20-30GB/s * if working on __get_free_page-based arrays (my first patch), we reach 35GB/s for large buffers (32, 128MB, or so), but it is very slow for small buffers (about 500MB/s for 32kB). * if allocating 16-slots arrays on the stack, we get back to 30-45GB/s (new patch below) So 16-slots on the stack looks like a good code-complexity/performance compromise to me. do_pages_stat() performance doesn't look critical anyway, as long as it's not very slow. > It's a small semantic change, isn't it? We release the semaphore in > the middle of the operation, thus presenting possibly > non-internally-consistent results to userspace. Why does this not matter? > Unless I am mistaken, both do_pages_stat() and do_pages_move() only take the mmap_sem for read, they can happen concurrently. I think you never had any guarantee such as migrating in one thread being seen as "atomic" from another thread stat'ing the same buffer. By the way, in 2.6.29, sys_move_pages() will release the semaphore in the middle of page migration as well. My rework patch removed the need to vmalloc a huge array, get_user everything in there, and then migrate all pages at once. The new code will get_user a page-size array, migrate it, and switch to the next slots (releasing the semaphore in-between). If we look at do_pages_stat() "atomicity" versus another operation that takes the mmap_sem for write, I don't see any actual problem. Either do_pages_stat() doesn't care much about the operation occuring while it released the mmap_sem (mprotect or so), or user-threads should be synchronized instead of stat'ing some VMAs that are removed/reduced in parallel (mmap/munmap/...). Brice [PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat() gets the page address from user-space and puts the corresponding status back while holding the mmap_sem for read. There is no need to hold mmap_sem there while some page-faults may occur. This patch adds a temporary address and status buffer so as to only hold mmap_sem while working on these kernel buffers. This is implemented by extracting do_pages_stat_array() out of do_pages_stat(). diff --git a/mm/migrate.c b/mm/migrate.c index ded190d..bc2c773 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1018,25 +1018,18 @@ out: /* * Determine the nodes of an array of pages and store it in an array of status. */ -static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, - const void __user * __user *pages, - int __user *status) +static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages, + const void __user **pages, int *status) { unsigned long i; - int err; down_read(&mm->mmap_sem); for (i = 0; i < nr_pages; i++) { - const void __user *p; - unsigned long addr; + unsigned long addr = (unsigned long)(*pages); struct vm_area_struct *vma; struct page *page; - - err = -EFAULT; - if (get_user(p, pages+i)) - goto out; - addr = (unsigned long) p; + int err; vma = find_vma(mm, addr); if (!vma) @@ -1055,12 +1048,52 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, err = page_to_nid(page); set_status: - put_user(err, status+i); + *status = err; + + pages++; + status++; + } + + up_read(&mm->mmap_sem); +} + +/* + * Determine the nodes of a user array of pages and store it in + * a user array of status. + */ +static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages, + const void __user * __user *pages, + int __user *status) +{ +#define DO_PAGES_STAT_CHUNK_NR 16 + const void __user *chunk_pages[DO_PAGES_STAT_CHUNK_NR]; + int chunk_status[DO_PAGES_STAT_CHUNK_NR]; + unsigned long i, chunk_nr = DO_PAGES_STAT_CHUNK_NR; + int err; + + for (i = 0; i < nr_pages; i += chunk_nr) { + if (chunk_nr + i > nr_pages) + chunk_nr = nr_pages - i; + + err = copy_from_user(chunk_pages, &pages[i], + chunk_nr * sizeof(*chunk_pages)); + if (err) { + err = -EFAULT; + goto out; + } + + do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status); + + err = copy_to_user(&status[i], chunk_status, + chunk_nr * sizeof(*chunk_status)); + if (err) { + err = -EFAULT; + goto out; + } } err = 0; out: - up_read(&mm->mmap_sem); return err; } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? 2008-12-07 14:21 ` Brice Goglin @ 2008-12-09 14:19 ` Christoph Lameter 2008-12-09 16:35 ` Peter Zijlstra 1 sibling, 0 replies; 7+ messages in thread From: Christoph Lameter @ 2008-12-09 14:19 UTC (permalink / raw) To: Brice Goglin; +Cc: Andrew Morton, LKML I guess the simplest solution would be to move the taking of mmap_sem into the loop. Mean taking mmap_sem for every page that we determine the status of. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6/mm/migrate.c =================================================================== --- linux-2.6.orig/mm/migrate.c 2008-12-09 08:07:08.796603952 -0600 +++ linux-2.6/mm/migrate.c 2008-12-09 08:08:36.400116263 -0600 @@ -994,8 +994,6 @@ unsigned long i; int err; - down_read(&mm->mmap_sem); - for (i = 0; i < nr_pages; i++) { const void __user *p; unsigned long addr; @@ -1007,12 +1005,17 @@ goto out; addr = (unsigned long) p; + down_read(&mm->mmap_sem); + vma = find_vma(mm, addr); - if (!vma) + if (!vma) { + up_read(&mm->mmap_sem); goto set_status; - + } page = follow_page(vma, addr, 0); + up_read(&mm->mmap_sem); + err = PTR_ERR(page); if (IS_ERR(page)) goto set_status; @@ -1029,7 +1032,6 @@ err = 0; out: - up_read(&mm->mmap_sem); return err; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? 2008-12-07 14:21 ` Brice Goglin 2008-12-09 14:19 ` Christoph Lameter @ 2008-12-09 16:35 ` Peter Zijlstra 2008-12-09 16:47 ` Brice Goglin 1 sibling, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2008-12-09 16:35 UTC (permalink / raw) To: Brice Goglin; +Cc: Andrew Morton, Christoph Lameter, LKML On Sun, 2008-12-07 at 15:21 +0100, Brice Goglin wrote: > Andrew Morton wrote: > > Was lockdep able to tell you about this in any way? > > > > With CONFIG_PROVE_LOCKING (assuming that it's enough), it doesn't detect > the problem for real. It just says "possible recursive locking detected" > between do_page_fault and sys_move_pages. That is real - how much more real do you need a description of a recursive deadlock to be? > I actually understood the > problem after hitting sysrq-t and always getting the following backtraces: > * thread 1 > [<ffffffff80430cb8>] __down_write_nested+0x9c/0xb4 > [<ffffffff8028c2c4>] sys_mprotect+0xcc/0x230 > * thread 2 > [<ffffffff80430d73>] __down_read+0x9c/0xb4 > [<ffffffff80223f5c>] do_page_fault+0x551/0x9d9 > [...] > [<ffffffff8043111a>] error_exit+0x0/0x70 > [<ffffffff802f5bec>] cap_task_movememory+0x0/0x3 > [<ffffffff8032b99d>] __put_user_4+0x1d/0x30 > [<ffffffff802a091f>] sys_move_pages+0x453/0x4c0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? 2008-12-09 16:35 ` Peter Zijlstra @ 2008-12-09 16:47 ` Brice Goglin 2008-12-09 17:46 ` Peter Zijlstra 0 siblings, 1 reply; 7+ messages in thread From: Brice Goglin @ 2008-12-09 16:47 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, Christoph Lameter, LKML Peter Zijlstra wrote: > On Sun, 2008-12-07 at 15:21 +0100, Brice Goglin wrote: > >> Andrew Morton wrote: >> >>> Was lockdep able to tell you about this in any way? >>> >>> >> With CONFIG_PROVE_LOCKING (assuming that it's enough), it doesn't detect >> the problem for real. It just says "possible recursive locking detected" >> between do_page_fault and sys_move_pages. >> > > That is real - how much more real do you need a description of a > recursive deadlock to be? > Well, it's a recursive down_read. It could be ok if we had the guarantee that nobody else would be doing down_write in the middle. lockdep only complained about this recursive down_read when there was a down_write actually causing the deadlock, but it didn't say anything about this down_write in the log. It would be great if lockdep could say "recursive read-lock is deadlocking because this other guy (with its backtrace) took for write in the middle". I needed sysrq-t to get this info. Brice ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? 2008-12-09 16:47 ` Brice Goglin @ 2008-12-09 17:46 ` Peter Zijlstra 0 siblings, 0 replies; 7+ messages in thread From: Peter Zijlstra @ 2008-12-09 17:46 UTC (permalink / raw) To: Brice Goglin; +Cc: Andrew Morton, Christoph Lameter, LKML On Tue, 2008-12-09 at 17:47 +0100, Brice Goglin wrote: > Peter Zijlstra wrote: > > On Sun, 2008-12-07 at 15:21 +0100, Brice Goglin wrote: > > > >> Andrew Morton wrote: > >> > >>> Was lockdep able to tell you about this in any way? > >>> > >>> > >> With CONFIG_PROVE_LOCKING (assuming that it's enough), it doesn't detect > >> the problem for real. It just says "possible recursive locking detected" > >> between do_page_fault and sys_move_pages. > >> > > > > That is real - how much more real do you need a description of a > > recursive deadlock to be? > > > > Well, it's a recursive down_read. It could be ok if we had the guarantee > that nobody else would be doing down_write in the middle. lockdep only > complained about this recursive down_read when there was a down_write > actually causing the deadlock, but it didn't say anything about this > down_write in the log. > > It would be great if lockdep could say "recursive read-lock is > deadlocking because this other guy (with its backtrace) took for write > in the middle". I needed sysrq-t to get this info. rwsem does not support recursive read-locks, so irrespective of write side locks, its a bug. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-12-09 17:46 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-07 2:14 [RFC/PATCH] No get_user/put_user while holding mmap_sem in do_pages_stat? Brice Goglin 2008-12-07 2:50 ` Andrew Morton 2008-12-07 14:21 ` Brice Goglin 2008-12-09 14:19 ` Christoph Lameter 2008-12-09 16:35 ` Peter Zijlstra 2008-12-09 16:47 ` Brice Goglin 2008-12-09 17:46 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox