* [PATCH] proc: pagemap: Hold mmap_sem during page walk @ 2010-03-31 17:23 San Mehat 2010-03-31 17:54 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: San Mehat @ 2010-03-31 17:23 UTC (permalink / raw) To: linux-kernel Cc: San Mehat, Brian Swetland, Matt Mackall, Dave Hansen, Andrew Morton, Linus Torvalds If the mmap_sem is not held while we walk_page_range(), then it is possible for find_vma() to race with a remove_vma_list() caused by do_munmap() (or others). Unable to handle kernel paging request at virtual address 6b6b6b5b Internal error: Oops: 5 [#1] PREEMPT CPU: 0 Not tainted (2.6.32.9-27154-ge3e6e27 #1) PC is at find_vma+0x40/0x7c LR is at walk_page_range+0x70/0x230 pc : [<c00aa3ac>] lr : [<c00b298c>] psr: 20000013 sp : c6aa9eb8 ip : 6b6b6b53 fp : c6a58f60 r10: c7e1d1b8 r9 : 0001bca0 r8 : 47000000 r7 : c6aa9f80 r6 : c6aa8000 r5 : 46fbd000 r4 : 6b6b6b6b r3 : c7ca4820 r2 : 6b6b6b6b r1 : 46fbd000 r0 : c70e3e40 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5787d Table: 26574019 DAC: 00000015 [<c00aa3ac>] (find_vma+0x40/0x7c) from [<c00b298c>] (walk_page_range+0x70/0x230) [<c00b298c>] (walk_page_range+0x70/0x230) from [<c00f5d3c>] (pagemap_read+0x1a4/0x278) [<c00f5d3c>] (pagemap_read+0x1a4/0x278) from [<c00bac40>] (vfs_read+0xa8/0x150) [<c00bac40>] (vfs_read+0xa8/0x150) from [<c00bad94>] (sys_read+0x3c/0x68) [<c00bad94>] (sys_read+0x3c/0x68) from [<c0026f00>] (ret_fast_syscall+0x0/0x2c) Code: 98bd8010 e5932004 e3a00000 ea000008 (e5124010) Signed-off-by: San Mehat <san@google.com> Cc: Brian Swetland <swetland@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> --- fs/proc/task_mmu.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2a1bef9..3f300c1 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -726,8 +726,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, down_read(¤t->mm->mmap_sem); ret = get_user_pages(current, current->mm, uaddr, pagecount, 1, 0, pages, NULL); - up_read(¤t->mm->mmap_sem); - if (ret < 0) goto out_free; @@ -776,6 +774,7 @@ out_pages: page_cache_release(page); } out_free: + up_read(¤t->mm->mmap_sem); kfree(pages); out_mm: mmput(mm); -- 1.7.0.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-03-31 17:23 [PATCH] proc: pagemap: Hold mmap_sem during page walk San Mehat @ 2010-03-31 17:54 ` Linus Torvalds 2010-03-31 21:40 ` Matt Mackall 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2010-03-31 17:54 UTC (permalink / raw) To: San Mehat Cc: linux-kernel, Brian Swetland, Matt Mackall, Dave Hansen, Andrew Morton On Wed, 31 Mar 2010, San Mehat wrote: > > If the mmap_sem is not held while we walk_page_range(), then > it is possible for find_vma() to race with a remove_vma_list() > caused by do_munmap() (or others). I think you've found a bug, but I also look at that code and say "that's just totally insane". Why does it do that initial "get_user_pages()" at all? It never _uses_ that 'pages' array except to mark the pages dirty, but that's insane, since as far as I can see the way it actually dirties the pages in question is by doing a regular "put_user(pfn, pm->out);". And that will dirty the pages in hardware (or put_user). Also, I get the feeling that the _reason_ it is not doing that down_read() is that it would dead-lock the whole system, exactly on that "put_user()", if somebody else did a down_write() in another thread. In that case you have: thread#1 thread#2 -------- -------- down_read() ... down_write() - blocks ... put_user(); .. page fault .. down_read(); **DEADLOCK ** because our down_read() tries to be fair to the down_write(). So I think your patch would just create _different_ trouble. I get the _feeling_ that the whole point of that 'pages' array was to not do that put_user() at all, but write to the physical pages through that array. But the code looks totally buggy. I would seriously suggest that we consider removing the 'pagemap' interface. The way that code looks, it's just broken. Matt - give me a reason (which includes either a patch to fix this sh*t up or telling me why I'm wrong, but _also_ includes a real independent reason to keep that thing around regardless) to not remove it all. The whole notion seems to be utterly misdesigned. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-03-31 17:54 ` Linus Torvalds @ 2010-03-31 21:40 ` Matt Mackall 2010-04-01 1:33 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Matt Mackall @ 2010-03-31 21:40 UTC (permalink / raw) To: Linus Torvalds Cc: San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton On Wed, 2010-03-31 at 10:54 -0700, Linus Torvalds wrote: > > On Wed, 31 Mar 2010, San Mehat wrote: > > > > If the mmap_sem is not held while we walk_page_range(), then > > it is possible for find_vma() to race with a remove_vma_list() > > caused by do_munmap() (or others). > > I think you've found a bug, but I also look at that code and say "that's > just totally insane". > > Why does it do that initial "get_user_pages()" at all? It never _uses_ > that 'pages' array except to mark the pages dirty, but that's insane, > since as far as I can see the way it actually dirties the pages in > question is by doing a regular "put_user(pfn, pm->out);". And that will > dirty the pages in hardware (or put_user). > > Also, I get the feeling that the _reason_ it is not doing that down_read() > is that it would dead-lock the whole system, exactly on that "put_user()", > if somebody else did a down_write() in another thread. In that case you > have: > > thread#1 thread#2 > -------- -------- > > down_read() > ... > down_write() - blocks > ... > put_user(); > .. page fault .. > down_read(); **DEADLOCK ** > > > because our down_read() tries to be fair to the down_write(). > > So I think your patch would just create _different_ trouble. > > I get the _feeling_ that the whole point of that 'pages' array was to not > do that put_user() at all, but write to the physical pages through that > array. But the code looks totally buggy. > > I would seriously suggest that we consider removing the 'pagemap' > interface. The way that code looks, it's just broken. > > Matt - give me a reason (which includes either a patch to fix this sh*t up > or telling me why I'm wrong, but _also_ includes a real independent reason > to keep that thing around regardless) to not remove it all. > > The whole notion seems to be utterly misdesigned. Linus, I must say your charm has really worn thin. I've just stuck a post-it on my monitor saying "don't be Linus" to remind me not to be rude to my contributors. If I recall correctly, the get_user_pages is there to force all the output pages to be guaranteed present before we later fill them so that the output needn't be double-buffered and the locking and recursion on the page tables is significantly simpler and faster. put_user is then actually easier than "writing to the physical pages through the array". You're right that the SetPageDirty() at cleanup is redundant (but harmless), and I probably copied that pattern from another user of get_user_pages. Earlier versions of the pagewalk code studiously avoided calling find_vma and only looked at the page tables (with either caller doing locking or accepting the non-atomicity) to avoid the sort of issue that San has run into, but it looks like I forgot about that and let it sneak back in when other folks added more hugepage support. The deeper problem is that hugepages have various nasty layering violations associated with them like being tied to vmas so until some hugepage genius shows up and tells us how to do this cleanly, we'll probably have to deal with the associated mmap_sem. San, your patch is touching the wrong mm_sem, I think. The interesting races are going to happen on the target mm, not current's (generally not the same). Holding the mm_sem across the entire walk is also prohibitive, it probably needs to be localized to individual find_vma calls. -- http://selenic.com : development and support for Mercurial and Linux ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-03-31 21:40 ` Matt Mackall @ 2010-04-01 1:33 ` Linus Torvalds 2010-04-01 2:10 ` KOSAKI Motohiro 2010-04-01 3:20 ` Matt Mackall 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2010-04-01 1:33 UTC (permalink / raw) To: Matt Mackall Cc: San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton On Wed, 31 Mar 2010, Matt Mackall wrote: > > Linus, I must say your charm has really worn thin. I've just stuck a > post-it on my monitor saying "don't be Linus" to remind me not to be > rude to my contributors. You didn't actually answer the problem, though. I'm rude, because I think the code is buggy. I pointed out how, and why I think it's pretty fundamental. You quoted it, but you didn't answer it. > If I recall correctly, the get_user_pages is there to force all the > output pages to be guaranteed present before we later fill them so that > the output needn't be double-buffered and the locking and recursion on > the page tables is significantly simpler and faster. put_user is then > actually easier than "writing to the physical pages through the array". Umm. Why would get_user_pages() guarantee that the pages don't get swapped out in the meantime, and you end up with a page fault _anyway_? Yes, it makes those page faults much rarer, but that just makes things much worse. Now you have a nasty subtle fundamental bug that is hard to trigger too. NOTE! The fact that we mark the page dirty (and increment the page count) in get_user_pages() (not in the later loop, which does indeed look pointless) should mean that the page doesn't go away physically, and if it gets mapped out the same page will be mapped back in on access. That's how things like direct-IO work (modulo a concurrent fork(), when they don't work, but that's a separate issue). But the problem with the deadlock is not that "same physical page" issue, it's simply the problem of the page fault code itself wanting to do a down_read() on the mmap_sem. So the fact that the physical page is reliable in the array doesn't actually solve the bug I was pointing out. See? So the nr = get_user_pages(.. 1, 0, ..) ... for_each_page() mark_it_dirty(); pattern is a valid pattern, but it is _only_ valid if you then do the write to the physical page you looked up. If you do the accesses through the virtual addresses, it makes zero sense. > You're right that the SetPageDirty() at cleanup is redundant (but > harmless), and I probably copied that pattern from another user of > get_user_pages. Fine. So then you'd do get_user_pages() and do _nothing_ with the array? Please explain what the point of that is, again? See above: there are valid reasons for things like direct-IO to do that exact "get_user_pages()" pattern - they mark the pages dirty both early _and_ late, and both page dirtying is required: - the first one (done by get_user_pages() itself) is to make sure that an anonymous page doesn't get switched around to another anonymous page that just happens to have the same contents (ie if the page gets swapped out, the swapout code will turn it into a swap-cache page). Once it's a swap-out page, the page refcount will then guarantee that it doesn't get replaced with anything else, so now you can trust the physical page. - the second "mark it dirty afterwards" is required in case swapout did happen, and the page got marked clean before the direct-IO actually wrote to it. So you mark it dirty again - this time not to force any swap cache things, but simply because there might have been a race between cleaning the page and the thing that wrote to it through the physical address. So there is a very real reason for that pattern existing. It's just that that reason has nothing to do with locking the thing into the page tables. That is absolutely NOT guaranteed by the get_user_pages() physical page pinning (munmap() is an extreme example of this, but I think swapout will clear it too in try_to_unmap_one(). In fact, even without swapping the page out, the accessed and dirty bit stuff may end up being done by software and have that page fault happen. What I'm trying to explain is simply that all these VM rules mean that you must never do a virtual user space access while holding mmap_sem. And doing get_user_pages() in no way changes that fundamental rule. Now, I will _happily_ admit that the above is all very complicated, and very easy to get wrong. There is a real reason why I hate direct-IO. It's a total nightmare from a VM standpoint. We've had bugs here. So I can well see that people don't always understand all the rules. Quite frankly, the only reason that /proc/self/pagemap code got merged is because it came through Andrew. Andrew knows the VM layer. Probably way better than I do by now. So when I get patches that touch things like this through the -mm tree, I tend to apply them. And looking now at the code again, I really think it was a mistake. > Earlier versions of the pagewalk code studiously avoided calling > find_vma and only looked at the page tables (with either caller doing > locking or accepting the non-atomicity) to avoid the sort of issue that > San has run into, but it looks like I forgot about that and let it sneak > back in when other folks added more hugepage support. Ok, that would fix the problem. The page tables can be accessed even without holding mmap_sem. And once you don't need mmap_sem, all the deadlocks go away. The get_user_pages() pattern still doesn't make sense, but at that point it's a _harmless_ problem, and doesn't really hurt. See what I'm saying? > The deeper problem is that hugepages have various nasty layering > violations associated with them like being tied to vmas so until some > hugepage genius shows up and tells us how to do this cleanly, we'll > probably have to deal with the associated mmap_sem. > > San, your patch is touching the wrong mm_sem, I think. The interesting > races are going to happen on the target mm, not current's (generally not > the same). The thing is, you need to hold the mmap_sem over the whole loop in pagemap_pte_range (since you're using the vma inside the loop). And that means that you'd hold mmap sem over the add_to_pagemap(), and the put_user() reference. Which has deadlock bug I pointed out, and that you totally ignored. The fact that it's a separate mm doesn't change the deadlock in _any_ shape or form. It _could_ be the same mm, but even if it was another VM entirely, it would just make the deadlock possible as an ABBA thing across two different VM's instead of through a single VM. > Holding the mm_sem across the entire walk is also prohibitive, it > probably needs to be localized to individual find_vma calls. You'd need to do it for each individual page, as far as I can tell, and move the find_vma() inside the loop itself in order to avoid the whole "hold mmap-sem while potentially doing a page fault" case. At that point, it looks like it wouldn't be buggy any more, it would just suck horribly from a performance standpoint. So Matt, please actually address the _bug_ I pointed out rather than talk about other things. And yes, getting rid of the vma accesses sounds like it would fix it best. If that means that it doesn't work for hugepages, so be it. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 1:33 ` Linus Torvalds @ 2010-04-01 2:10 ` KOSAKI Motohiro 2010-04-01 3:20 ` Matt Mackall 1 sibling, 0 replies; 18+ messages in thread From: KOSAKI Motohiro @ 2010-04-01 2:10 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Matt Mackall, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton > So there is a very real reason for that pattern existing. It's just that > that reason has nothing to do with locking the thing into the page tables. > That is absolutely NOT guaranteed by the get_user_pages() physical page > pinning (munmap() is an extreme example of this, but I think swapout will > clear it too in try_to_unmap_one(). Right. To increment page->count only affect vmscan to prevent free page. Not prevent pageout I/O nor page unmapping from a process. Thanks. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 1:33 ` Linus Torvalds 2010-04-01 2:10 ` KOSAKI Motohiro @ 2010-04-01 3:20 ` Matt Mackall 2010-04-01 4:27 ` Linus Torvalds 2010-04-01 5:54 ` KOSAKI Motohiro 1 sibling, 2 replies; 18+ messages in thread From: Matt Mackall @ 2010-04-01 3:20 UTC (permalink / raw) To: Linus Torvalds Cc: San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton On Wed, 2010-03-31 at 18:33 -0700, Linus Torvalds wrote: > > On Wed, 31 Mar 2010, Matt Mackall wrote: > > > > Linus, I must say your charm has really worn thin. I've just stuck a > > post-it on my monitor saying "don't be Linus" to remind me not to be > > rude to my contributors. > > You didn't actually answer the problem, though. > > I'm rude, because I think the code is buggy. And what does that achieve? I've got plenty of other work I could be doing where people are nice to me when asking me to fix bugs. > I pointed out how, and why I > think it's pretty fundamental. You quoted it, but you didn't answer it. Yes, I was muddling the distinction between pinned in page cache and pinned in the mm, and you've just now re-clarified it for me. So I'll agree the current code is bogus. > So Matt, please actually address the _bug_ I pointed out rather than talk > about other things. And yes, getting rid of the vma accesses sounds like > it would fix it best. If that means that it doesn't work for hugepages, so > be it. That'd actually take us back to where it was when it hit mainline, which would make a lot of people unhappy. I wouldn't be one of them as there thankfully aren't any huge pages in my world. But I'm convinced put_user() must go. In which case, get_user_pages() stays, and I've got to switch things to direct physical page access into that array. Even if I fix that, I believe San's original bug can still be triggered though, as all the new callers to find_vma are run outside of the target's mm_sem. Fixing that should be reasonably straight-forward. -- http://selenic.com : development and support for Mercurial and Linux ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 3:20 ` Matt Mackall @ 2010-04-01 4:27 ` Linus Torvalds 2010-04-01 5:54 ` KOSAKI Motohiro 1 sibling, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2010-04-01 4:27 UTC (permalink / raw) To: Matt Mackall Cc: San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton On Wed, 31 Mar 2010, Matt Mackall wrote: > > > > I'm rude, because I think the code is buggy. > > And what does that achieve? I've got plenty of other work I could be > doing where people are nice to me when asking me to fix bugs. I would suggest you go back and read my original email once more, now that you realize that you had simply not understood the difference between physical page pinning and virtual page pinning. Seriously. Now that you understand why I called the code buggy, maybe you realize that calling the code "insane and misdesigned" is actually not overly rude: it's just an accurate representation of the state of the code. And if you read the mail once more, you'll also notice that every single derogatory remark was about the _code_, not you. Oh, and I did ask you for an explanation for why we shouldn't just remove it. There can't be all that many users. Because quite frankly, if you apparently want to keep the vma around, the code is going to get way more complex and ugly. You may be able to avoid some of the _worst_ crap if you require that user pointers have to always be u64-aligned. Yes, that's a very ugly and non-intuitive requirement for a read() interface, but probably better than the alternative. Or maybe just do the double buffering, and limiting pagemap reads to fairly small chunks at a time. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 3:20 ` Matt Mackall 2010-04-01 4:27 ` Linus Torvalds @ 2010-04-01 5:54 ` KOSAKI Motohiro 2010-04-01 5:55 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 18+ messages in thread From: KOSAKI Motohiro @ 2010-04-01 5:54 UTC (permalink / raw) To: Matt Mackall Cc: kosaki.motohiro, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton > > So Matt, please actually address the _bug_ I pointed out rather than talk > > about other things. And yes, getting rid of the vma accesses sounds like > > it would fix it best. If that means that it doesn't work for hugepages, so > > be it. > > That'd actually take us back to where it was when it hit mainline, which > would make a lot of people unhappy. I wouldn't be one of them as there > thankfully aren't any huge pages in my world. But I'm convinced > put_user() must go. In which case, get_user_pages() stays, and I've got > to switch things to direct physical page access into that array. no. direct physical page access for /proc is completely wrong idea, i think. please imazine the caller process is multi threaded and it use fork case, example scenario) 1. the parent process has thread-A and thread-B. 2. thread-A call read_pagemap 3. read_pagemap grab the page-C 3. at the same time, thread-B call fork(), now page-C pointed from two process. 4. thread-B touch page-C, cow occur, then the parent process has cowed page (page-C') and the child process has original page-C. 5. thread-A write page-C by physical page access, then the child page is modified, instead parent one. I just recommend simply do double buffering. thanks. > Even if I fix that, I believe San's original bug can still be triggered > though, as all the new callers to find_vma are run outside of the > target's mm_sem. Fixing that should be reasonably straight-forward. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 5:54 ` KOSAKI Motohiro @ 2010-04-01 5:55 ` KAMEZAWA Hiroyuki 2010-04-01 6:05 ` KOSAKI Motohiro 0 siblings, 1 reply; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-04-01 5:55 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Matt Mackall, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Thu, 1 Apr 2010 14:54:43 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > > > So Matt, please actually address the _bug_ I pointed out rather than talk > > > about other things. And yes, getting rid of the vma accesses sounds like > > > it would fix it best. If that means that it doesn't work for hugepages, so > > > be it. > > > > That'd actually take us back to where it was when it hit mainline, which > > would make a lot of people unhappy. I wouldn't be one of them as there > > thankfully aren't any huge pages in my world. But I'm convinced > > put_user() must go. In which case, get_user_pages() stays, and I've got > > to switch things to direct physical page access into that array. > > no. direct physical page access for /proc is completely wrong idea, i think. > please imazine the caller process is multi threaded and it use fork case, > > example scenario) > 1. the parent process has thread-A and thread-B. > 2. thread-A call read_pagemap > 3. read_pagemap grab the page-C > 3. at the same time, thread-B call fork(), now page-C pointed from two process. > 4. thread-B touch page-C, cow occur, then the parent process has cowed page (page-C') > and the child process has original page-C. > 5. thread-A write page-C by physical page access, then the child page is > modified, instead parent one. > > I just recommend simply do double buffering. > Like this ? CC'ed Horiguchi, he touched hugepage part of this code recently. == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> In initial design, walk_page_range() was designed just for walking page table and it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() and we need mmap_sem around it. This patch adds mmap_sem around walk_page_range(). Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get rid of it to do sane fix. Reported-by: San Mehat <san@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/proc/task_mmu.c | 89 ++++++++++++++++++++++------------------------------- 1 file changed, 38 insertions(+), 51 deletions(-) Index: linux-2.6.34-rc3/fs/proc/task_mmu.c =================================================================== --- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c +++ linux-2.6.34-rc3/fs/proc/task_mmu.c @@ -406,8 +406,11 @@ static int show_smap(struct seq_file *m, memset(&mss, 0, sizeof mss); mss.vma = vma; - if (vma->vm_mm && !is_vm_hugetlb_page(vma)) + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) { + down_read(&vma->vm_mm->mmap_sem); walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); + up_read(&vma->vm_mm->mmap_sem); + } show_map_vma(m, vma); @@ -552,7 +555,8 @@ const struct file_operations proc_clear_ }; struct pagemapread { - u64 __user *out, *end; + int pos, len; + u64 *buffer; }; #define PM_ENTRY_BYTES sizeof(u64) @@ -575,10 +579,8 @@ struct pagemapread { static int add_to_pagemap(unsigned long addr, u64 pfn, struct pagemapread *pm) { - if (put_user(pfn, pm->out)) - return -EFAULT; - pm->out++; - if (pm->out >= pm->end) + pm->buffer[pm->pos++] = pfn; + if (pm->pos >= pm->len) return PM_END_OF_BUFFER; return 0; } @@ -720,21 +722,20 @@ static int pagemap_hugetlb_range(pte_t * * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ +#define PAGEMAP_WALK_SIZE (PMD_SIZE) static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); - struct page **pages, *page; - unsigned long uaddr, uend; struct mm_struct *mm; struct pagemapread pm; - int pagecount; int ret = -ESRCH; struct mm_walk pagemap_walk = {}; unsigned long src; unsigned long svpfn; unsigned long start_vaddr; unsigned long end_vaddr; + int copied = 0; if (!task) goto out; @@ -757,34 +758,10 @@ static ssize_t pagemap_read(struct file if (!mm) goto out_task; - - uaddr = (unsigned long)buf & PAGE_MASK; - uend = (unsigned long)(buf + count); - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE; - ret = 0; - if (pagecount == 0) + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); + pm.buffer = kmalloc(pm.len, GFP_KERNEL); + if (!pm.buffer) goto out_mm; - pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL); - ret = -ENOMEM; - if (!pages) - goto out_mm; - - down_read(¤t->mm->mmap_sem); - ret = get_user_pages(current, current->mm, uaddr, pagecount, - 1, 0, pages, NULL); - up_read(¤t->mm->mmap_sem); - - if (ret < 0) - goto out_free; - - if (ret != pagecount) { - pagecount = ret; - ret = -EFAULT; - goto out_pages; - } - - pm.out = (u64 __user *)buf; - pm.end = (u64 __user *)(buf + count); pagemap_walk.pmd_entry = pagemap_pte_range; pagemap_walk.pte_hole = pagemap_pte_hole; @@ -807,23 +784,33 @@ static ssize_t pagemap_read(struct file * user buffer is tracked in "pm", and the walk * will stop when we hit the end of the buffer. */ - ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk); - if (ret == PM_END_OF_BUFFER) - ret = 0; - /* don't need mmap_sem for these, but this looks cleaner */ - *ppos += (char __user *)pm.out - buf; - if (!ret) - ret = (char __user *)pm.out - buf; - -out_pages: - for (; pagecount; pagecount--) { - page = pages[pagecount-1]; - if (!PageReserved(page)) - SetPageDirty(page); - page_cache_release(page); + while (count && (start_vaddr < end_vaddr)) { + int len; + unsigned long end; + pm.pos = 0; + start_vaddr += PAGEMAP_WALK_SIZE; + end = start_vaddr + PAGEMAP_WALK_SIZE; + down_read(&mm->mmap_sem); + ret = walk_page_range(start_vaddr, end, &pagemap_walk); + up_read(&mm->mmap_sem); + + len = PM_ENTRY_BYTES * pm.pos; + if (len > count) + len = count; + if (copy_to_user((char __user *)buf, pm.buffer, len) < 0) { + ret = -EFAULT; + goto out_free; + } + copied += len; + buf += len; + count -= len; } + *ppos += copied; + if (!ret || ret == PM_END_OF_BUFFER) + ret = copied; + out_free: - kfree(pages); + kfree(pm.buffer); out_mm: mmput(mm); out_task: ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 5:55 ` KAMEZAWA Hiroyuki @ 2010-04-01 6:05 ` KOSAKI Motohiro 2010-04-01 6:09 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 18+ messages in thread From: KOSAKI Motohiro @ 2010-04-01 6:05 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: kosaki.motohiro, Matt Mackall, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi > > I just recommend simply do double buffering. > > > Like this ? > CC'ed Horiguchi, he touched hugepage part of this code recently. I like this patch. but I have few comment. > > == > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > In initial design, walk_page_range() was designed just for walking page table and > it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() > and we need mmap_sem around it. > > This patch adds mmap_sem around walk_page_range(). > > Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get > rid of it to do sane fix. > > Reported-by: San Mehat <san@google.com> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > fs/proc/task_mmu.c | 89 ++++++++++++++++++++++------------------------------- > 1 file changed, 38 insertions(+), 51 deletions(-) > > Index: linux-2.6.34-rc3/fs/proc/task_mmu.c > =================================================================== > --- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c > +++ linux-2.6.34-rc3/fs/proc/task_mmu.c > @@ -406,8 +406,11 @@ static int show_smap(struct seq_file *m, > > memset(&mss, 0, sizeof mss); > mss.vma = vma; > - if (vma->vm_mm && !is_vm_hugetlb_page(vma)) > + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) { > + down_read(&vma->vm_mm->mmap_sem); > walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); > + up_read(&vma->vm_mm->mmap_sem); > + } m_start already take mmap_sem. then I don't think this part is necessary. > > show_map_vma(m, vma); > > @@ -552,7 +555,8 @@ const struct file_operations proc_clear_ > }; > > struct pagemapread { > - u64 __user *out, *end; > + int pos, len; > + u64 *buffer; > }; > > #define PM_ENTRY_BYTES sizeof(u64) > @@ -575,10 +579,8 @@ struct pagemapread { > static int add_to_pagemap(unsigned long addr, u64 pfn, > struct pagemapread *pm) > { > - if (put_user(pfn, pm->out)) > - return -EFAULT; > - pm->out++; > - if (pm->out >= pm->end) > + pm->buffer[pm->pos++] = pfn; > + if (pm->pos >= pm->len) > return PM_END_OF_BUFFER; > return 0; > } > @@ -720,21 +722,20 @@ static int pagemap_hugetlb_range(pte_t * > * determine which areas of memory are actually mapped and llseek to > * skip over unmapped regions. > */ > +#define PAGEMAP_WALK_SIZE (PMD_SIZE) > static ssize_t pagemap_read(struct file *file, char __user *buf, > size_t count, loff_t *ppos) > { > struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); > - struct page **pages, *page; > - unsigned long uaddr, uend; > struct mm_struct *mm; > struct pagemapread pm; > - int pagecount; > int ret = -ESRCH; > struct mm_walk pagemap_walk = {}; > unsigned long src; > unsigned long svpfn; > unsigned long start_vaddr; > unsigned long end_vaddr; > + int copied = 0; > > if (!task) > goto out; > @@ -757,34 +758,10 @@ static ssize_t pagemap_read(struct file > if (!mm) > goto out_task; > > - > - uaddr = (unsigned long)buf & PAGE_MASK; > - uend = (unsigned long)(buf + count); > - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE; > - ret = 0; > - if (pagecount == 0) > + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > + pm.buffer = kmalloc(pm.len, GFP_KERNEL); In /proc interface, GFP_TEMPORARY is recommened rather than GFP_KERNEL. it help to prevent unnecessary external fragmentation. > + if (!pm.buffer) > goto out_mm; > - pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL); > - ret = -ENOMEM; > - if (!pages) > - goto out_mm; > - > - down_read(¤t->mm->mmap_sem); > - ret = get_user_pages(current, current->mm, uaddr, pagecount, > - 1, 0, pages, NULL); > - up_read(¤t->mm->mmap_sem); > - > - if (ret < 0) > - goto out_free; > - > - if (ret != pagecount) { > - pagecount = ret; > - ret = -EFAULT; > - goto out_pages; > - } > - > - pm.out = (u64 __user *)buf; > - pm.end = (u64 __user *)(buf + count); > > pagemap_walk.pmd_entry = pagemap_pte_range; > pagemap_walk.pte_hole = pagemap_pte_hole; > @@ -807,23 +784,33 @@ static ssize_t pagemap_read(struct file > * user buffer is tracked in "pm", and the walk > * will stop when we hit the end of the buffer. > */ > - ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk); > - if (ret == PM_END_OF_BUFFER) > - ret = 0; > - /* don't need mmap_sem for these, but this looks cleaner */ > - *ppos += (char __user *)pm.out - buf; > - if (!ret) > - ret = (char __user *)pm.out - buf; > - > -out_pages: > - for (; pagecount; pagecount--) { > - page = pages[pagecount-1]; > - if (!PageReserved(page)) > - SetPageDirty(page); > - page_cache_release(page); > + while (count && (start_vaddr < end_vaddr)) { > + int len; > + unsigned long end; > + pm.pos = 0; > + start_vaddr += PAGEMAP_WALK_SIZE; > + end = start_vaddr + PAGEMAP_WALK_SIZE; > + down_read(&mm->mmap_sem); > + ret = walk_page_range(start_vaddr, end, &pagemap_walk); > + up_read(&mm->mmap_sem); > + > + len = PM_ENTRY_BYTES * pm.pos; > + if (len > count) > + len = count; > + if (copy_to_user((char __user *)buf, pm.buffer, len) < 0) { > + ret = -EFAULT; > + goto out_free; > + } > + copied += len; > + buf += len; > + count -= len; > } > + *ppos += copied; > + if (!ret || ret == PM_END_OF_BUFFER) > + ret = copied; > + > out_free: > - kfree(pages); > + kfree(pm.buffer); > out_mm: > mmput(mm); > out_task: > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 6:05 ` KOSAKI Motohiro @ 2010-04-01 6:09 ` KAMEZAWA Hiroyuki 2010-04-01 6:34 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-04-01 6:09 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Matt Mackall, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Thu, 1 Apr 2010 15:05:38 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: memset(&mss, 0, sizeof mss); > > mss.vma = vma; > > - if (vma->vm_mm && !is_vm_hugetlb_page(vma)) > > + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) { > > + down_read(&vma->vm_mm->mmap_sem); > > walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); > > + up_read(&vma->vm_mm->mmap_sem); > > + } > > m_start already take mmap_sem. then I don't think this part is necessary. > Right. <snip> > > - > > - uaddr = (unsigned long)buf & PAGE_MASK; > > - uend = (unsigned long)(buf + count); > > - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE; > > - ret = 0; > > - if (pagecount == 0) > > + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > > + pm.buffer = kmalloc(pm.len, GFP_KERNEL); > > In /proc interface, GFP_TEMPORARY is recommened rather than GFP_KERNEL. > it help to prevent unnecessary external fragmentation. > Ok. From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> In initial design, walk_page_range() was designed just for walking page table and it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() and we need mmap_sem around it. This patch adds mmap_sem around walk_page_range(). Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get rid of it to do sane fix. Changelog: - removed unnecessary change in smaps. - use GFP_TEMPORARY instead of GFP_KERNEL Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/proc/task_mmu.c | 86 ++++++++++++++++++++++------------------------------- 1 file changed, 36 insertions(+), 50 deletions(-) Index: linux-2.6.34-rc3/fs/proc/task_mmu.c =================================================================== --- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c +++ linux-2.6.34-rc3/fs/proc/task_mmu.c @@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m, memset(&mss, 0, sizeof mss); mss.vma = vma; + /* mmap_sem is held in m_start */ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); @@ -552,7 +553,8 @@ const struct file_operations proc_clear_ }; struct pagemapread { - u64 __user *out, *end; + int pos, len; + u64 *buffer; }; #define PM_ENTRY_BYTES sizeof(u64) @@ -575,10 +577,8 @@ struct pagemapread { static int add_to_pagemap(unsigned long addr, u64 pfn, struct pagemapread *pm) { - if (put_user(pfn, pm->out)) - return -EFAULT; - pm->out++; - if (pm->out >= pm->end) + pm->buffer[pm->pos++] = pfn; + if (pm->pos >= pm->len) return PM_END_OF_BUFFER; return 0; } @@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t * * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ +#define PAGEMAP_WALK_SIZE (PMD_SIZE) static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); - struct page **pages, *page; - unsigned long uaddr, uend; struct mm_struct *mm; struct pagemapread pm; - int pagecount; int ret = -ESRCH; struct mm_walk pagemap_walk = {}; unsigned long src; unsigned long svpfn; unsigned long start_vaddr; unsigned long end_vaddr; + int copied = 0; if (!task) goto out; @@ -757,34 +756,10 @@ static ssize_t pagemap_read(struct file if (!mm) goto out_task; - - uaddr = (unsigned long)buf & PAGE_MASK; - uend = (unsigned long)(buf + count); - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE; - ret = 0; - if (pagecount == 0) + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); + if (!pm.buffer) goto out_mm; - pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL); - ret = -ENOMEM; - if (!pages) - goto out_mm; - - down_read(¤t->mm->mmap_sem); - ret = get_user_pages(current, current->mm, uaddr, pagecount, - 1, 0, pages, NULL); - up_read(¤t->mm->mmap_sem); - - if (ret < 0) - goto out_free; - - if (ret != pagecount) { - pagecount = ret; - ret = -EFAULT; - goto out_pages; - } - - pm.out = (u64 __user *)buf; - pm.end = (u64 __user *)(buf + count); pagemap_walk.pmd_entry = pagemap_pte_range; pagemap_walk.pte_hole = pagemap_pte_hole; @@ -807,23 +782,34 @@ static ssize_t pagemap_read(struct file * user buffer is tracked in "pm", and the walk * will stop when we hit the end of the buffer. */ - ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk); - if (ret == PM_END_OF_BUFFER) - ret = 0; - /* don't need mmap_sem for these, but this looks cleaner */ - *ppos += (char __user *)pm.out - buf; - if (!ret) - ret = (char __user *)pm.out - buf; - -out_pages: - for (; pagecount; pagecount--) { - page = pages[pagecount-1]; - if (!PageReserved(page)) - SetPageDirty(page); - page_cache_release(page); + while (count && (start_vaddr < end_vaddr)) { + int len; + unsigned long end; + + pm.pos = 0; + start_vaddr += PAGEMAP_WALK_SIZE; + end = start_vaddr + PAGEMAP_WALK_SIZE; + down_read(&mm->mmap_sem); + ret = walk_page_range(start_vaddr, end, &pagemap_walk); + up_read(&mm->mmap_sem); + + len = PM_ENTRY_BYTES * pm.pos; + if (len > count) + len = count; + if (copy_to_user((char __user *)buf, pm.buffer, len) < 0) { + ret = -EFAULT; + goto out_free; + } + copied += len; + buf += len; + count -= len; } + *ppos += copied; + if (!ret || ret == PM_END_OF_BUFFER) + ret = copied; + out_free: - kfree(pages); + kfree(pm.buffer); out_mm: mmput(mm); out_task: ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 6:09 ` KAMEZAWA Hiroyuki @ 2010-04-01 6:34 ` KAMEZAWA Hiroyuki 2010-04-01 7:09 ` Matt Mackall 2010-04-01 15:10 ` Linus Torvalds 0 siblings, 2 replies; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-04-01 6:34 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro, Matt Mackall, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Thu, 1 Apr 2010 15:09:56 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > + pm.pos = 0; > + start_vaddr += PAGEMAP_WALK_SIZE; > + end = start_vaddr + PAGEMAP_WALK_SIZE; Sigh...this is bad.. == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> In initial design, walk_page_range() was designed just for walking page table and it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() and we need mmap_sem around it. This patch adds mmap_sem around walk_page_range(). Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get rid of it to do sane fix. Changelog: - fixed start_vaddr calculation - removed unnecessary cast. - removed unnecessary change in smaps. - use GFP_TEMPORARY instead of GFP_KERNEL - use min(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/proc/task_mmu.c | 84 +++++++++++++++++++++-------------------------------- 1 file changed, 34 insertions(+), 50 deletions(-) Index: linux-2.6.34-rc3/fs/proc/task_mmu.c =================================================================== --- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c +++ linux-2.6.34-rc3/fs/proc/task_mmu.c @@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m, memset(&mss, 0, sizeof mss); mss.vma = vma; + /* mmap_sem is held in m_start */ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); @@ -552,7 +553,8 @@ const struct file_operations proc_clear_ }; struct pagemapread { - u64 __user *out, *end; + int pos, len; + u64 *buffer; }; #define PM_ENTRY_BYTES sizeof(u64) @@ -575,10 +577,8 @@ struct pagemapread { static int add_to_pagemap(unsigned long addr, u64 pfn, struct pagemapread *pm) { - if (put_user(pfn, pm->out)) - return -EFAULT; - pm->out++; - if (pm->out >= pm->end) + pm->buffer[pm->pos++] = pfn; + if (pm->pos >= pm->len) return PM_END_OF_BUFFER; return 0; } @@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t * * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ +#define PAGEMAP_WALK_SIZE (PMD_SIZE) static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); - struct page **pages, *page; - unsigned long uaddr, uend; struct mm_struct *mm; struct pagemapread pm; - int pagecount; int ret = -ESRCH; struct mm_walk pagemap_walk = {}; unsigned long src; unsigned long svpfn; unsigned long start_vaddr; unsigned long end_vaddr; + int copied = 0; if (!task) goto out; @@ -757,35 +756,11 @@ static ssize_t pagemap_read(struct file if (!mm) goto out_task; - - uaddr = (unsigned long)buf & PAGE_MASK; - uend = (unsigned long)(buf + count); - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE; - ret = 0; - if (pagecount == 0) - goto out_mm; - pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL); - ret = -ENOMEM; - if (!pages) + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); + if (!pm.buffer) goto out_mm; - down_read(¤t->mm->mmap_sem); - ret = get_user_pages(current, current->mm, uaddr, pagecount, - 1, 0, pages, NULL); - up_read(¤t->mm->mmap_sem); - - if (ret < 0) - goto out_free; - - if (ret != pagecount) { - pagecount = ret; - ret = -EFAULT; - goto out_pages; - } - - pm.out = (u64 __user *)buf; - pm.end = (u64 __user *)(buf + count); - pagemap_walk.pmd_entry = pagemap_pte_range; pagemap_walk.pte_hole = pagemap_pte_hole; pagemap_walk.hugetlb_entry = pagemap_hugetlb_range; @@ -807,23 +782,32 @@ static ssize_t pagemap_read(struct file * user buffer is tracked in "pm", and the walk * will stop when we hit the end of the buffer. */ - ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk); - if (ret == PM_END_OF_BUFFER) - ret = 0; - /* don't need mmap_sem for these, but this looks cleaner */ - *ppos += (char __user *)pm.out - buf; - if (!ret) - ret = (char __user *)pm.out - buf; - -out_pages: - for (; pagecount; pagecount--) { - page = pages[pagecount-1]; - if (!PageReserved(page)) - SetPageDirty(page); - page_cache_release(page); + while (count && (start_vaddr < end_vaddr)) { + int len; + unsigned long end; + + pm.pos = 0; + end = min(start_vaddr + PAGEMAP_WALK_SIZE, end_vaddr); + down_read(&mm->mmap_sem); + ret = walk_page_range(start_vaddr, end, &pagemap_walk); + up_read(&mm->mmap_sem); + start_vaddr += PAGEMAP_WALK_SIZE; + + len = min(count, PM_ENTRY_BYTES * pm.pos); + if (copy_to_user(buf, pm.buffer, len) < 0) { + ret = -EFAULT; + goto out_free; + } + copied += len; + buf += len; + count -= len; } + *ppos += copied; + if (!ret || ret == PM_END_OF_BUFFER) + ret = copied; + out_free: - kfree(pages); + kfree(pm.buffer); out_mm: mmput(mm); out_task: ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 6:34 ` KAMEZAWA Hiroyuki @ 2010-04-01 7:09 ` Matt Mackall 2010-04-01 7:21 ` KOSAKI Motohiro 2010-04-01 15:10 ` Linus Torvalds 1 sibling, 1 reply; 18+ messages in thread From: Matt Mackall @ 2010-04-01 7:09 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Thu, 2010-04-01 at 15:34 +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 1 Apr 2010 15:09:56 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > + pm.pos = 0; > > + start_vaddr += PAGEMAP_WALK_SIZE; > > + end = start_vaddr + PAGEMAP_WALK_SIZE; > > Sigh...this is bad.. > > == > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > In initial design, walk_page_range() was designed just for walking page table and > it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() > and we need mmap_sem around it. This looks pretty reasonable. However, it also looks very similar to my first version of pagemap (which started with double-buffering). It's going to need re-testing to make sure it hasn't reintroduced any wrapping, alignment, or off-by-one bugs that have already been ironed out once or twice. -- http://selenic.com : development and support for Mercurial and Linux ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 7:09 ` Matt Mackall @ 2010-04-01 7:21 ` KOSAKI Motohiro 0 siblings, 0 replies; 18+ messages in thread From: KOSAKI Motohiro @ 2010-04-01 7:21 UTC (permalink / raw) To: Matt Mackall Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Linus Torvalds, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi > On Thu, 2010-04-01 at 15:34 +0900, KAMEZAWA Hiroyuki wrote: > > On Thu, 1 Apr 2010 15:09:56 +0900 > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > > > > + pm.pos = 0; > > > + start_vaddr += PAGEMAP_WALK_SIZE; > > > + end = start_vaddr + PAGEMAP_WALK_SIZE; > > > > Sigh...this is bad.. > > > > == > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > > > In initial design, walk_page_range() was designed just for walking page table and > > it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() > > and we need mmap_sem around it. > > This looks pretty reasonable. However, it also looks very similar to my > first version of pagemap (which started with double-buffering). It's > going to need re-testing to make sure it hasn't reintroduced any > wrapping, alignment, or off-by-one bugs that have already been ironed > out once or twice. OK, I'm looking for your test result. :) Thanks matt for your contribution! ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 6:34 ` KAMEZAWA Hiroyuki 2010-04-01 7:09 ` Matt Mackall @ 2010-04-01 15:10 ` Linus Torvalds 2010-04-02 0:11 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2010-04-01 15:10 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro, Matt Mackall, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Thu, 1 Apr 2010, KAMEZAWA Hiroyuki wrote: > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > In initial design, walk_page_range() was designed just for walking page table and > it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() > and we need mmap_sem around it. > > This patch adds mmap_sem around walk_page_range(). > > Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get > rid of it to do sane fix. > > Changelog: > - fixed start_vaddr calculation > - removed unnecessary cast. > - removed unnecessary change in smaps. > - use GFP_TEMPORARY instead of GFP_KERNEL > - use min(). Looks mostly correct to me (but just looking at the source, no testing, obviously). And I like how the double buffering removes more lines of code than it adds. However, I think there is a subtle problem with this: > + while (count && (start_vaddr < end_vaddr)) { > + int len; > + unsigned long end; > + > + pm.pos = 0; > + end = min(start_vaddr + PAGEMAP_WALK_SIZE, end_vaddr); > + down_read(&mm->mmap_sem); > + ret = walk_page_range(start_vaddr, end, &pagemap_walk); > + up_read(&mm->mmap_sem); > + start_vaddr += PAGEMAP_WALK_SIZE; I think "start_vaddr + PAGEMAP_WALK_SIZE" might overflow, and then 'end' ends up being odd. You'll never notice on architectures where the user space doesn't go all the way up to the end (walk_page_range will return 0 etc), but it will do the wrong thing if 'start' is close to the end, end is _at_ the end, and you'll not be able to read that range (because of the overflow). So I do think you should do something like end = start_vaddr + PAGEMAP_WALK_SIZE; /* overflow? or final chunk? */ if (end < start_vaddr || end > end_vaddr) end = end_vaddr; instead of using 'min()'. (This only matters if TASK_SIZE_OF() can be ~0ul, but I think that can happen on sparc, for example) Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-01 15:10 ` Linus Torvalds @ 2010-04-02 0:11 ` KAMEZAWA Hiroyuki 2010-04-02 14:30 ` Matt Mackall 0 siblings, 1 reply; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-04-02 0:11 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Matt Mackall, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Thu, 1 Apr 2010 08:10:40 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > + while (count && (start_vaddr < end_vaddr)) { > > + int len; > > + unsigned long end; > > + > > + pm.pos = 0; > > + end = min(start_vaddr + PAGEMAP_WALK_SIZE, end_vaddr); > > + down_read(&mm->mmap_sem); > > + ret = walk_page_range(start_vaddr, end, &pagemap_walk); > > + up_read(&mm->mmap_sem); > > + start_vaddr += PAGEMAP_WALK_SIZE; > > I think "start_vaddr + PAGEMAP_WALK_SIZE" might overflow, and then 'end' > ends up being odd. You'll never notice on architectures where the user > space doesn't go all the way up to the end (walk_page_range will return 0 > etc), but it will do the wrong thing if 'start' is close to the end, end > is _at_ the end, and you'll not be able to read that range (because of the > overflow). > I didn't noticed that. thanks. > So I do think you should do something like > > end = start_vaddr + PAGEMAP_WALK_SIZE; > /* overflow? or final chunk? */ > if (end < start_vaddr || end > end_vaddr) > end = end_vaddr; > > instead of using 'min()'. > Ok, here. now end = start_vaddr _ PAGEMAP_WALK_SIZE; if (end < start_vaddr || end > end_vaddr) end = end_vaddr; ....walk.... start_vaddr =end; Only tested on x86-64. Thanks, -Kame == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> In initial design, walk_page_range() was designed just for walking page table and it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() and we need mmap_sem around it. This patch adds mmap_sem around walk_page_range(). Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get rid of it to do sane fix. Changelog: 2010/Apr/2 - fixed start_vaddr and end overflow Changelog: 2010/Apr/1 - fixed start_vaddr calculation - removed unnecessary cast. - removed unnecessary change in smaps. - use GFP_TEMPORARY instead of GFP_KERNEL Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/proc/task_mmu.c | 87 ++++++++++++++++++++++------------------------------- 1 file changed, 37 insertions(+), 50 deletions(-) Index: linux-2.6.34-rc3/fs/proc/task_mmu.c =================================================================== --- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c +++ linux-2.6.34-rc3/fs/proc/task_mmu.c @@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m, memset(&mss, 0, sizeof mss); mss.vma = vma; + /* mmap_sem is held in m_start */ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); @@ -552,7 +553,8 @@ const struct file_operations proc_clear_ }; struct pagemapread { - u64 __user *out, *end; + int pos, len; + u64 *buffer; }; #define PM_ENTRY_BYTES sizeof(u64) @@ -575,10 +577,8 @@ struct pagemapread { static int add_to_pagemap(unsigned long addr, u64 pfn, struct pagemapread *pm) { - if (put_user(pfn, pm->out)) - return -EFAULT; - pm->out++; - if (pm->out >= pm->end) + pm->buffer[pm->pos++] = pfn; + if (pm->pos >= pm->len) return PM_END_OF_BUFFER; return 0; } @@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t * * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ +#define PAGEMAP_WALK_SIZE (PMD_SIZE) static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); - struct page **pages, *page; - unsigned long uaddr, uend; struct mm_struct *mm; struct pagemapread pm; - int pagecount; int ret = -ESRCH; struct mm_walk pagemap_walk = {}; unsigned long src; unsigned long svpfn; unsigned long start_vaddr; unsigned long end_vaddr; + int copied = 0; if (!task) goto out; @@ -757,34 +756,10 @@ static ssize_t pagemap_read(struct file if (!mm) goto out_task; - - uaddr = (unsigned long)buf & PAGE_MASK; - uend = (unsigned long)(buf + count); - pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE; - ret = 0; - if (pagecount == 0) + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); + if (!pm.buffer) goto out_mm; - pages = kcalloc(pagecount, sizeof(struct page *), GFP_KERNEL); - ret = -ENOMEM; - if (!pages) - goto out_mm; - - down_read(¤t->mm->mmap_sem); - ret = get_user_pages(current, current->mm, uaddr, pagecount, - 1, 0, pages, NULL); - up_read(¤t->mm->mmap_sem); - - if (ret < 0) - goto out_free; - - if (ret != pagecount) { - pagecount = ret; - ret = -EFAULT; - goto out_pages; - } - - pm.out = (u64 __user *)buf; - pm.end = (u64 __user *)(buf + count); pagemap_walk.pmd_entry = pagemap_pte_range; pagemap_walk.pte_hole = pagemap_pte_hole; @@ -807,23 +782,35 @@ static ssize_t pagemap_read(struct file * user buffer is tracked in "pm", and the walk * will stop when we hit the end of the buffer. */ - ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk); - if (ret == PM_END_OF_BUFFER) - ret = 0; - /* don't need mmap_sem for these, but this looks cleaner */ - *ppos += (char __user *)pm.out - buf; - if (!ret) - ret = (char __user *)pm.out - buf; - -out_pages: - for (; pagecount; pagecount--) { - page = pages[pagecount-1]; - if (!PageReserved(page)) - SetPageDirty(page); - page_cache_release(page); + while (count && (start_vaddr < end_vaddr)) { + int len; + unsigned long end; + + pm.pos = 0; + end = start_vaddr + PAGEMAP_WALK_SIZE; + /* overflow ? */ + if (end < start_vaddr || end > end_vaddr) + end = end_vaddr; + down_read(&mm->mmap_sem); + ret = walk_page_range(start_vaddr, end, &pagemap_walk); + up_read(&mm->mmap_sem); + start_vaddr = end; + + len = min(count, PM_ENTRY_BYTES * pm.pos); + if (copy_to_user(buf, pm.buffer, len) < 0) { + ret = -EFAULT; + goto out_free; + } + copied += len; + buf += len; + count -= len; } + *ppos += copied; + if (!ret || ret == PM_END_OF_BUFFER) + ret = copied; + out_free: - kfree(pages); + kfree(pm.buffer); out_mm: mmput(mm); out_task: ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-02 0:11 ` KAMEZAWA Hiroyuki @ 2010-04-02 14:30 ` Matt Mackall 2010-04-06 6:48 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 18+ messages in thread From: Matt Mackall @ 2010-04-02 14:30 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, KOSAKI Motohiro, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Fri, Apr 02, 2010 at 09:11:29AM +0900, KAMEZAWA Hiroyuki wrote: > int ret = -ESRCH; ... > + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); > + if (!pm.buffer) > goto out_mm; ... > out_mm: > mmput(mm); Looks like this gets the wrong return code? -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk 2010-04-02 14:30 ` Matt Mackall @ 2010-04-06 6:48 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-04-06 6:48 UTC (permalink / raw) To: Matt Mackall Cc: Linus Torvalds, KOSAKI Motohiro, San Mehat, linux-kernel, Brian Swetland, Dave Hansen, Andrew Morton, n-horiguchi On Fri, 2 Apr 2010 09:30:58 -0500 Matt Mackall <mpm@selenic.com> wrote: > On Fri, Apr 02, 2010 at 09:11:29AM +0900, KAMEZAWA Hiroyuki wrote: > > > int ret = -ESRCH; > ... > > + pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > > + pm.buffer = kmalloc(pm.len, GFP_TEMPORARY); > > + if (!pm.buffer) > > goto out_mm; > ... > > out_mm: > > mmput(mm); > > Looks like this gets the wrong return code? > I'm sorry. And thank you for pointing out. I confirmed merged one has fixed code. Regards, -Kame ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-04-06 6:52 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-31 17:23 [PATCH] proc: pagemap: Hold mmap_sem during page walk San Mehat 2010-03-31 17:54 ` Linus Torvalds 2010-03-31 21:40 ` Matt Mackall 2010-04-01 1:33 ` Linus Torvalds 2010-04-01 2:10 ` KOSAKI Motohiro 2010-04-01 3:20 ` Matt Mackall 2010-04-01 4:27 ` Linus Torvalds 2010-04-01 5:54 ` KOSAKI Motohiro 2010-04-01 5:55 ` KAMEZAWA Hiroyuki 2010-04-01 6:05 ` KOSAKI Motohiro 2010-04-01 6:09 ` KAMEZAWA Hiroyuki 2010-04-01 6:34 ` KAMEZAWA Hiroyuki 2010-04-01 7:09 ` Matt Mackall 2010-04-01 7:21 ` KOSAKI Motohiro 2010-04-01 15:10 ` Linus Torvalds 2010-04-02 0:11 ` KAMEZAWA Hiroyuki 2010-04-02 14:30 ` Matt Mackall 2010-04-06 6:48 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox