* [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! @ 2011-03-14 8:08 Hugh Dickins 2011-03-14 15:25 ` Minchan Kim 2011-03-14 15:52 ` Andrea Arcangeli 0 siblings, 2 replies; 14+ messages in thread From: Hugh Dickins @ 2011-03-14 8:08 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm THP's collapse_huge_page() has an understandable but ugly difference in when its huge page is allocated: inside if NUMA but outside if not. It's hardly surprising that the memcg failure path forgot that, freeing the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page() (or even worse, using the freed page). Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/huge_memory.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) --- 2.6.38-rc8/mm/huge_memory.c 2011-03-08 09:27:16.000000000 -0800 +++ linux/mm/huge_memory.c 2011-03-13 18:26:21.000000000 -0700 @@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm #ifndef CONFIG_NUMA VM_BUG_ON(!*hpage); new_page = *hpage; + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { + up_read(&mm->mmap_sem); + return; + } #else VM_BUG_ON(*hpage); /* @@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm *hpage = ERR_PTR(-ENOMEM); return; } -#endif if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { up_read(&mm->mmap_sem); put_page(new_page); return; } +#endif /* after allocating the hugepage upgrade to mmap_sem write mode */ up_read(&mm->mmap_sem); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 8:08 [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! Hugh Dickins @ 2011-03-14 15:25 ` Minchan Kim 2011-03-14 15:52 ` Andrea Arcangeli 1 sibling, 0 replies; 14+ messages in thread From: Minchan Kim @ 2011-03-14 15:25 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Mon, Mar 14, 2011 at 5:08 PM, Hugh Dickins <hughd@google.com> wrote: > THP's collapse_huge_page() has an understandable but ugly difference > in when its huge page is allocated: inside if NUMA but outside if not. > It's hardly surprising that the memcg failure path forgot that, freeing > the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page() > (or even worse, using the freed page). > > Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Thanks, Hugh. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 8:08 [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! Hugh Dickins 2011-03-14 15:25 ` Minchan Kim @ 2011-03-14 15:52 ` Andrea Arcangeli 2011-03-14 16:37 ` Hugh Dickins 1 sibling, 1 reply; 14+ messages in thread From: Andrea Arcangeli @ 2011-03-14 15:52 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Mon, Mar 14, 2011 at 01:08:47AM -0700, Hugh Dickins wrote: > mm/huge_memory.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > --- 2.6.38-rc8/mm/huge_memory.c 2011-03-08 09:27:16.000000000 -0800 > +++ linux/mm/huge_memory.c 2011-03-13 18:26:21.000000000 -0700 > @@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm > #ifndef CONFIG_NUMA > VM_BUG_ON(!*hpage); > new_page = *hpage; > + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > + up_read(&mm->mmap_sem); > + return; > + } > #else > VM_BUG_ON(*hpage); > /* > @@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm > *hpage = ERR_PTR(-ENOMEM); > return; > } > -#endif > if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { > up_read(&mm->mmap_sem); > put_page(new_page); > return; > } > +#endif > > /* after allocating the hugepage upgrade to mmap_sem write mode */ > up_read(&mm->mmap_sem); Correct! I'd suggest to fix it without duplicating the mem_cgroup_newpage_charge. It's just one more put_page than needed with memcg enabled and NUMA disabled (I guess most memcg testing happened with NUMA enabled). The larger diff likely rejects with -mm NUMA code... I'll try to diff it with a smaller -U2 (this code has little change to be misplaced) that may allow it to apply clean regardless of the merging order, so it may make life easier. It may have been overkill to keep the NUMA case separated in order to avoid spurious allocations for the not-NUMA case, code become more complex and I'm not sure if it's really worthwhile. The optimization makes sense but it's minor and it created more complexity than strictly needed. For now we can't change it in the short term as it has been tested this way, but if people dislike the additional complexity that this optimization created, I'm not against dropping it in the future. Your comment was positive about the optimization (you said understandable) so I wanted to share my current thinking on these ifdefs... Thanks, Andrea === Subject: thp+memcg-numa: fix BUG at include/linux/mm.h:370! From: Hugh Dickins <hughd@google.com> THP's collapse_huge_page() has an understandable but ugly difference in when its huge page is allocated: inside if NUMA but outside if not. It's hardly surprising that the memcg failure path forgot that, freeing the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page() (or even worse, using the freed page). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index dbe99a5..bf41114 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1785,5 +1785,7 @@ static void collapse_huge_page(struct mm_struct *mm, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { up_read(&mm->mmap_sem); +#ifdef CONFIG_NUMA put_page(new_page); +#endif return; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 15:52 ` Andrea Arcangeli @ 2011-03-14 16:37 ` Hugh Dickins 2011-03-14 16:56 ` Linus Torvalds 2011-03-14 16:59 ` [PATCH] mm: PageBuddy and mapcount underflows robustness Andrea Arcangeli 0 siblings, 2 replies; 14+ messages in thread From: Hugh Dickins @ 2011-03-14 16:37 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Mon, 14 Mar 2011, Andrea Arcangeli wrote: > > Correct! I'd suggest to fix it without duplicating the > mem_cgroup_newpage_charge. It's just one more put_page than needed > with memcg enabled and NUMA disabled (I guess most memcg testing > happened with NUMA enabled). The larger diff likely rejects with -mm > NUMA code... I'll try to diff it with a smaller -U2 (this code has > little change to be misplaced) that may allow it to apply clean > regardless of the merging order, so it may make life easier. I did try it that way at first (didn't help when I mistakenly put #ifndef instead of #ifdef around the put_page!), but was repulsed by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating version - which Linus has now taken. > > It may have been overkill to keep the NUMA case separated in order to > avoid spurious allocations for the not-NUMA case, code become more > complex and I'm not sure if it's really worthwhile. The optimization > makes sense but it's minor and it created more complexity than > strictly needed. For now we can't change it in the short term as it > has been tested this way, but if people dislike the additional > complexity that this optimization created, I'm not against dropping it > in the future. Your comment was positive about the optimization (you > said understandable) so I wanted to share my current thinking on these > ifdefs... I was certainly tempted to remove all the non-NUMA cases, but as you say, now is not the right time for that since we needed a quick bugfix. I do appreciate why you did it that way, it is nicer to be allocating on the outside, though unsuitable in the NUMA case. But given how this bug has passed unnoticed for so long, it seems like nobody has been testing non-NUMA, so yes, best to simplify and make non-NUMA do the same as NUMA in 2.6.39. Since Linus has taken my version that you didn't like, perhaps you can get even by sending him your "mm: PageBuddy cleanups" patch, the version I didn't like (for its silly branches) so was reluctant to push myself. I'd really like to see that fix in, since it's a little hard to argue for in -stable, being all about a system which is already unstable. But I think it needs a stronger title than "PageBuddy cleanups" - "fix BUG in bad_page()"? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 16:37 ` Hugh Dickins @ 2011-03-14 16:56 ` Linus Torvalds 2011-03-14 17:17 ` Andrea Arcangeli 2011-03-14 16:59 ` [PATCH] mm: PageBuddy and mapcount underflows robustness Andrea Arcangeli 1 sibling, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2011-03-14 16:56 UTC (permalink / raw) To: Hugh Dickins Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <hughd@google.com> wrote: > > I did try it that way at first (didn't help when I mistakenly put > #ifndef instead of #ifdef around the put_page!), but was repulsed > by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating > version - which Linus has now taken. I have to admit to being repulsed by the whole patch, but my main source of "that's effin ugly" was from the crazy lock handling. Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And if not, why not release the read-lock early? And even if it _does_ need it, why not do ret = mem_cgroup_newpage_charge(); up_read(&mm->mmap_sem); if (ret) { ... finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return path of the function too, and the nicer way would probably be to have it in one place and do something like /* * The allocation rules are different for the NUMA/non-NUMA cases * For the NUMA case, we allocate here, for the non-numa case we * use the allocation in *hpage */ static inline struct page *collapse_alloc_hugepage(struct page **hpage) { #ifdef CONFIG_NUMA VM_BUG_ON(*hpage); return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node); #else VM_BUG_ON(!*hpage); return *hpage; #endif } static inline void collapse_free_hugepage(struct page *page) { #ifdef CONFIG_NUMA put_page(new_page); #else /* Nothing to do */ #endif } and use that instead. The point being that the #ifdef'fery now ends up being in a much more targeted area and much better abstracted, rather than in the middle of code, and ugly as sin. But as mentioned, the lock handling is disgusting. Why is it even safe to drop and re-take the lock at all? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 16:56 ` Linus Torvalds @ 2011-03-14 17:17 ` Andrea Arcangeli 2011-03-14 19:58 ` Johannes Weiner 0 siblings, 1 reply; 14+ messages in thread From: Andrea Arcangeli @ 2011-03-14 17:17 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm, Johannes Weiner, Minchan Kim Hello, On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote: > On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <hughd@google.com> wrote: > > > > I did try it that way at first (didn't help when I mistakenly put > > #ifndef instead of #ifdef around the put_page!), but was repulsed > > by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating > > version - which Linus has now taken. > > I have to admit to being repulsed by the whole patch, but my main > source of "that's effin ugly" was from the crazy lock handling. > > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And > if not, why not release the read-lock early? And even if it _does_ > need it, why not do > The mmap_sem is needed for the page allocation, or the "vma" can become a dangling pointer (the vma is passed to alloc_hugepage_vma). > ret = mem_cgroup_newpage_charge(); > up_read(&mm->mmap_sem); > if (ret) { > ... > > finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return > path of the function too, and the nicer way would probably be to have > it in one place and do something like > > /* > * The allocation rules are different for the NUMA/non-NUMA cases > * For the NUMA case, we allocate here, for the non-numa case we > * use the allocation in *hpage > */ > static inline struct page *collapse_alloc_hugepage(struct page **hpage) > { > #ifdef CONFIG_NUMA > VM_BUG_ON(*hpage); > return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node); > #else > VM_BUG_ON(!*hpage); > return *hpage; > #endif > } > > static inline void collapse_free_hugepage(struct page *page) > { > #ifdef CONFIG_NUMA > put_page(new_page); > #else > /* Nothing to do */ > #endif > } > > and use that instead. The point being that the #ifdef'fery now ends up > being in a much more targeted area and much better abstracted, rather > than in the middle of code, and ugly as sin. Agreed about the cleanup. I didn't add a new function for it like probably I should have to make it less ugly... About mem_cgroup_newpage_charge I think you're right it won't need the mmap_sem. Running it under it is sure safe. But if it's not needed we can move the up_read before the mem_cgroup_newpage_charge like you suggested. Johannes/Minchan could you confirm the mmap_sem isn't needed around mem_cgroup_newpage_charge? The mm and new_page are stable without the mmap_sem, only the vma goes away but the memcg shouldn't care. > But as mentioned, the lock handling is disgusting. Why is it even safe > to drop and re-take the lock at all? We have to upgrade the rwsem from read to write (hugepmd isn't allowed to materialize under code that runs with the mmap_sem read mode, that's one of the invariants to be safe when split_huge_page_pmd does nothing and let the regular pte walk go ahead). It is safe because we revalidate the vma after dropping the read lock and taking the write lock. It's generally impossible to upgrade the lock without first dropping it if more than one thread does that in parallel on the same mm (they all hold the read lock so somebody has to drop the read lock and revalidate before anyone has a chance to take the write lock). Now interestingly I notice that in this very case khugepaged is single threaded and no other places would call upgrade_read() on the mmap_sem anywhere, so it probably would be safe, but there's no method for that (because it'd need to be called at most by one thread at once on the same mm and that's probably not considered an useful case, even if it probably would be in collapse_huge_page). If we ever thread khugepaged to run on more than one core then we'd be forced to drop the lock too (unless we make the scan on the same mm mutually exclusive which isn't the best for large mm anyway). But I exclude we ever need to thread khugepaged though, it's blazing fast (unlike the ksmd scan). So if we implment an upgrade_read() we might be able to remove a find_vma in collapse_huge_page. It's not very important to optimize though as the memory copy should be the biggest cost anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 17:17 ` Andrea Arcangeli @ 2011-03-14 19:58 ` Johannes Weiner 2011-03-14 23:42 ` KAMEZAWA Hiroyuki 2011-04-01 21:21 ` Andrea Arcangeli 0 siblings, 2 replies; 14+ messages in thread From: Johannes Weiner @ 2011-03-14 19:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm, Minchan Kim On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote: > On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote: > > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And > > if not, why not release the read-lock early? And even if it _does_ > > need it, why not do [...] > About mem_cgroup_newpage_charge I think you're right it won't need the > mmap_sem. Running it under it is sure safe. But if it's not needed we > can move the up_read before the mem_cgroup_newpage_charge like you > suggested. Johannes/Minchan could you confirm the mmap_sem isn't > needed around mem_cgroup_newpage_charge? The mm and new_page are > stable without the mmap_sem, only the vma goes away but the memcg > shouldn't care. We don't care about the vma. It's all about assigning the physical page to the memcg that mm->owner belongs to. It would be the first callsite not holding the mmap_sem, but that is only because all existing sites are fault handlers that don't drop the lock for other reasons. I am not aware of anything that would rely on the lock in there, or would not deserve to break if it did. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 19:58 ` Johannes Weiner @ 2011-03-14 23:42 ` KAMEZAWA Hiroyuki 2011-04-01 21:21 ` Andrea Arcangeli 1 sibling, 0 replies; 14+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-03-14 23:42 UTC (permalink / raw) To: Johannes Weiner Cc: Andrea Arcangeli, Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm, Minchan Kim On Mon, 14 Mar 2011 20:58:23 +0100 Johannes Weiner <jweiner@redhat.com> wrote: > On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote: > > On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote: > > > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And > > > if not, why not release the read-lock early? And even if it _does_ > > > need it, why not do > > [...] > > > About mem_cgroup_newpage_charge I think you're right it won't need the > > mmap_sem. Running it under it is sure safe. But if it's not needed we > > can move the up_read before the mem_cgroup_newpage_charge like you > > suggested. Johannes/Minchan could you confirm the mmap_sem isn't > > needed around mem_cgroup_newpage_charge? The mm and new_page are > > stable without the mmap_sem, only the vma goes away but the memcg > > shouldn't care. > > We don't care about the vma. It's all about assigning the physical > page to the memcg that mm->owner belongs to. > > It would be the first callsite not holding the mmap_sem, but that is > only because all existing sites are fault handlers that don't drop the > lock for other reasons. > > I am not aware of anything that would rely on the lock in there, or > would not deserve to break if it did. > mmap_sem is not required to held if uncharge() operation is done if vma turns out to be a stale pointer. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! 2011-03-14 19:58 ` Johannes Weiner 2011-03-14 23:42 ` KAMEZAWA Hiroyuki @ 2011-04-01 21:21 ` Andrea Arcangeli 1 sibling, 0 replies; 14+ messages in thread From: Andrea Arcangeli @ 2011-04-01 21:21 UTC (permalink / raw) To: Johannes Weiner Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm, Minchan Kim On Mon, Mar 14, 2011 at 08:58:23PM +0100, Johannes Weiner wrote: > We don't care about the vma. It's all about assigning the physical > page to the memcg that mm->owner belongs to. > > It would be the first callsite not holding the mmap_sem, but that is > only because all existing sites are fault handlers that don't drop the > lock for other reasons. I was afraid it'd be the first callsite this is why I wasn't excited to push it in 2.6.38, but Linus's right and we should micro-optimize it for 2.6.39. > I am not aware of anything that would rely on the lock in there, or > would not deserve to break if it did. Thanks for double checking. What about this? (only problem is the thp-vmstat patch in -mm then reject, maybe I should rediff it against -mm instead, as you wish) === Subject: thp: optimize memcg charge in khugepaged From: Andrea Arcangeli <aarcange@redhat.com> We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(), the mmap_sem is only hold for keeping the vma stable and we don't need the vma stable anymore after we return from alloc_hugepage_vma(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0a619e0..c61d9ad 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1762,12 +1762,9 @@ static void collapse_huge_page(struct mm_struct *mm, VM_BUG_ON(address & ~HPAGE_PMD_MASK); #ifndef CONFIG_NUMA + up_read(&mm->mmap_sem); VM_BUG_ON(!*hpage); new_page = *hpage; - if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { - up_read(&mm->mmap_sem); - return; - } #else VM_BUG_ON(*hpage); /* @@ -1782,20 +1779,20 @@ static void collapse_huge_page(struct mm_struct *mm, */ new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address, node, __GFP_OTHER_NODE); + /* after allocating the hugepage upgrade to mmap_sem write mode */ + up_read(&mm->mmap_sem); if (unlikely(!new_page)) { - up_read(&mm->mmap_sem); *hpage = ERR_PTR(-ENOMEM); return; } +#endif + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { - up_read(&mm->mmap_sem); +#ifdef CONFIG_NUMA put_page(new_page); +#endif return; } -#endif - - /* after allocating the hugepage upgrade to mmap_sem write mode */ - up_read(&mm->mmap_sem); /* * Prevent all access to pagetables with the exception of -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH] mm: PageBuddy and mapcount underflows robustness 2011-03-14 16:37 ` Hugh Dickins 2011-03-14 16:56 ` Linus Torvalds @ 2011-03-14 16:59 ` Andrea Arcangeli 2011-03-14 17:30 ` Linus Torvalds 1 sibling, 1 reply; 14+ messages in thread From: Andrea Arcangeli @ 2011-03-14 16:59 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm Hi Hugh, On Mon, Mar 14, 2011 at 09:37:43AM -0700, Hugh Dickins wrote: > version - which Linus has now taken. That was fast, so that solves all merging order rejects in the first place as we'll all rebase on upstraem. > I was certainly tempted to remove all the non-NUMA cases, but as you > say, now is not the right time for that since we needed a quick bugfix. Correct. Too much risk in making the not-NUMA case as the NUMA case. > I do appreciate why you did it that way, it is nicer to be allocating BTW, one reason it was done this way is also because the not-NUMA case was the original code, and when I added the NUMA awareness to khugepaged I didn't want to risk regressions to the existing case that I knew worked fine. > on the outside, though unsuitable in the NUMA case. But given how this > bug has passed unnoticed for so long, it seems like nobody has been > testing non-NUMA, so yes, best to simplify and make non-NUMA do the > same as NUMA in 2.6.39. These days clearly the NUMA case gets more testing than the not-NUMA case. Especially for features like memcg that are mostly needed on NUMA systems and not so much on small laptops or something not NUMA. I'm unsure if it's so urgent to remove it, maybe a little benchmarking with khugepaged at 100% load may be worth trying first, but if there's no real change in the frequency increase of khugepaged/pages_collapsed counter, I'm surely ok if it gets removed later. > Since Linus has taken my version that you didn't like, perhaps you can I'm ok with your version... no problem. > get even by sending him your "mm: PageBuddy cleanups" patch, the version > I didn't like (for its silly branches) so was reluctant to push myself. Ok that slipped even my own aa.git tree as it was more a RFC when I posted it and it didn't fix a real bug but just made the code more robust at runtime in case of hardware memory corruption or software bugs. > I'd really like to see that fix in, since it's a little hard to argue > for in -stable, being all about a system which is already unstable. > > But I think it needs a stronger title than "PageBuddy cleanups" - > "fix BUG in bad_page()"? I think this can comfortably wait 2.6.39-rc. I think it's best if Andrew merges it so it gets digested through -mm for a while. But let me know if you prefer something else. So here it is with slightly updated header. === Subject: mm: PageBuddy and mapcount underflows robustness From: Andrea Arcangeli <aarcange@redhat.com> bad_page could VM_BUG_ON(!PageBuddy(page)) inside __ClearPageBuddy(). I prefer to keep the VM_BUG_ON for safety and to add a if to solve it. Change the _mapcount value indicating PageBuddy from -2 to -1024*1024 for more robusteness against page_mapcount() underflows. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hugh Dickins <hughd@google.com> --- diff --git a/include/linux/mm.h b/include/linux/mm.h index f6385fc..fa16ba0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -402,16 +402,22 @@ static inline void init_page_count(struct page *page) /* * PageBuddy() indicate that the page is free and in the buddy system * (see mm/page_alloc.c). + * + * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to + * -2 so that an underflow of the page_mapcount() won't be mistaken + * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. */ +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024) + static inline int PageBuddy(struct page *page) { - return atomic_read(&page->_mapcount) == -2; + return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE; } static inline void __SetPageBuddy(struct page *page) { VM_BUG_ON(atomic_read(&page->_mapcount) != -1); - atomic_set(&page->_mapcount, -2); + atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE); } static inline void __ClearPageBuddy(struct page *page) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a873e61..8aac134 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -286,7 +286,9 @@ static void bad_page(struct page *page) /* Don't complain about poisoned pages */ if (PageHWPoison(page)) { - __ClearPageBuddy(page); + /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */ + if (PageBuddy(page)) + __ClearPageBuddy(page); return; } @@ -317,7 +319,8 @@ static void bad_page(struct page *page) dump_stack(); out: /* Leave bad fields for debug, except PageBuddy could make trouble */ - __ClearPageBuddy(page); + if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */ + __ClearPageBuddy(page); add_taint(TAINT_BAD_PAGE); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness 2011-03-14 16:59 ` [PATCH] mm: PageBuddy and mapcount underflows robustness Andrea Arcangeli @ 2011-03-14 17:30 ` Linus Torvalds 2011-03-17 23:16 ` Andrea Arcangeli 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2011-03-14 17:30 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024) I realize that this is a nitpick, but from a code generation standpoint, large random constants like these are just nasty. I would suggest aiming for constants that are easy to generate and/or fit better in the code stream. In many encoding schemes (eg x86), -128 is much easier to generate, since it fits in a signed byte and allows small instructions etc. And in most RISC encodings, 8- or 16-bit constants can be encoded much more easily than something like your current one, and bigger ones often end up resulting in a load from memory or at least several immediate-building instructions. > - __ClearPageBuddy(page); > + if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */ > + __ClearPageBuddy(page); Also, this is just disgusting. It adds no safety here to have that VM_BUG_ON(), so it's just unnecessary code generation to do this. Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the first place! What those two code-sites actually want are just a simple reset_page_mapcount(page); which does the right thing in _general_, and not just for the buddy case - we want to reset the mapcount for other reasons than just pagebuddy (ie the underflow/overflow case). And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed. No? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness 2011-03-14 17:30 ` Linus Torvalds @ 2011-03-17 23:16 ` Andrea Arcangeli 2011-03-18 21:34 ` Hugh Dickins 0 siblings, 1 reply; 14+ messages in thread From: Andrea Arcangeli @ 2011-03-17 23:16 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote: > On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024) > > I realize that this is a nitpick, but from a code generation > standpoint, large random constants like these are just nasty. > > I would suggest aiming for constants that are easy to generate and/or > fit better in the code stream. In many encoding schemes (eg x86), -128 > is much easier to generate, since it fits in a signed byte and allows > small instructions etc. And in most RISC encodings, 8- or 16-bit > constants can be encoded much more easily than something like your > current one, and bigger ones often end up resulting in a load from > memory or at least several immediate-building instructions. -128 sure is ok with me. It's extremely unlikely to be off by 127 times, if we're off by more than 127 times we're still going to detect the error. > Also, this is just disgusting. It adds no safety here to have that > VM_BUG_ON(), so it's just unnecessary code generation to do this. > Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the > first place! What those two code-sites actually want are just a simple > > reset_page_mapcount(page); > > which does the right thing in _general_, and not just for the buddy > case - we want to reset the mapcount for other reasons than just > pagebuddy (ie the underflow/overflow case). Agreed. > And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed. Well using reset_page_mapcount in the two error sites, didn't require me to remove the VM_BUG_ON from __ClearPageBuddy so I left it there... === Subject: mm: PageBuddy and mapcount robustness From: Andrea Arcangeli <aarcange@redhat.com> Change the _mapcount value indicating PageBuddy from -2 to -128 for more robusteness against page_mapcount() undeflows. Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to ignore the previous retval of PageBuddy(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hugh Dickins <hughd@google.com> --- include/linux/mm.h | 11 +++++++++-- mm/page_alloc.c | 4 ++-- 2 files changed, 11 insertions(+), 4 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -402,16 +402,23 @@ static inline void init_page_count(struc /* * PageBuddy() indicate that the page is free and in the buddy system * (see mm/page_alloc.c). + * + * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to + * -2 so that an underflow of the page_mapcount() won't be mistaken + * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very + * efficiently by most CPU architectures. */ +#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) + static inline int PageBuddy(struct page *page) { - return atomic_read(&page->_mapcount) == -2; + return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE; } static inline void __SetPageBuddy(struct page *page) { VM_BUG_ON(atomic_read(&page->_mapcount) != -1); - atomic_set(&page->_mapcount, -2); + atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE); } static inline void __ClearPageBuddy(struct page *page) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -286,7 +286,7 @@ static void bad_page(struct page *page) /* Don't complain about poisoned pages */ if (PageHWPoison(page)) { - __ClearPageBuddy(page); + reset_page_mapcount(page); /* remove PageBuddy */ return; } @@ -317,7 +317,7 @@ static void bad_page(struct page *page) dump_stack(); out: /* Leave bad fields for debug, except PageBuddy could make trouble */ - __ClearPageBuddy(page); + reset_page_mapcount(page); /* remove PageBuddy */ add_taint(TAINT_BAD_PAGE); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness 2011-03-17 23:16 ` Andrea Arcangeli @ 2011-03-18 21:34 ` Hugh Dickins 2011-03-23 22:58 ` [stable] " Greg KH 0 siblings, 1 reply; 14+ messages in thread From: Hugh Dickins @ 2011-03-18 21:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm, stable On Fri, 18 Mar 2011, Andrea Arcangeli wrote: > On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote: > Subject: mm: PageBuddy and mapcount robustness > > From: Andrea Arcangeli <aarcange@redhat.com> > > Change the _mapcount value indicating PageBuddy from -2 to -128 for > more robusteness against page_mapcount() undeflows. > > Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to > ignore the previous retval of PageBuddy(). > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > Reported-by: Hugh Dickins <hughd@google.com> Yes, this version satisfies my objections too. I'd say Acked-by but I see that it's already in, great. I've Cc'ed stable@kernel.org: please can we have this in 2.6.38.1, since 2.6.38 regressed the recovery from bad page states, inadvertently changing them to a fatal error when CONFIG_DEBUG_VM. Thanks, Hugh commit ef2b4b95a63a1d23958dcb99eb2c6898eddc87d0 Author: Andrea Arcangeli <aarcange@redhat.com> Date: Fri Mar 18 00:16:35 2011 +0100 mm: PageBuddy and mapcount robustness Change the _mapcount value indicating PageBuddy from -2 to -128 for more robusteness against page_mapcount() undeflows. Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to ignore the previous retval of PageBuddy(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff --git a/include/linux/mm.h b/include/linux/mm.h index 679300c..ff83798 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -402,16 +402,23 @@ static inline void init_page_count(struct page *page) /* * PageBuddy() indicate that the page is free and in the buddy system * (see mm/page_alloc.c). + * + * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to + * -2 so that an underflow of the page_mapcount() won't be mistaken + * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very + * efficiently by most CPU architectures. */ +#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) + static inline int PageBuddy(struct page *page) { - return atomic_read(&page->_mapcount) == -2; + return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE; } static inline void __SetPageBuddy(struct page *page) { VM_BUG_ON(atomic_read(&page->_mapcount) != -1); - atomic_set(&page->_mapcount, -2); + atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE); } static inline void __ClearPageBuddy(struct page *page) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bd76256..7945247 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -286,7 +286,7 @@ static void bad_page(struct page *page) /* Don't complain about poisoned pages */ if (PageHWPoison(page)) { - __ClearPageBuddy(page); + reset_page_mapcount(page); /* remove PageBuddy */ return; } @@ -317,7 +317,7 @@ static void bad_page(struct page *page) dump_stack(); out: /* Leave bad fields for debug, except PageBuddy could make trouble */ - __ClearPageBuddy(page); + reset_page_mapcount(page); /* remove PageBuddy */ add_taint(TAINT_BAD_PAGE); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [stable] [PATCH] mm: PageBuddy and mapcount underflows robustness 2011-03-18 21:34 ` Hugh Dickins @ 2011-03-23 22:58 ` Greg KH 0 siblings, 0 replies; 14+ messages in thread From: Greg KH @ 2011-03-23 22:58 UTC (permalink / raw) To: Hugh Dickins Cc: Andrea Arcangeli, linux-kernel, linux-mm, David Rientjes, Andrew Morton, Linus Torvalds, stable On Fri, Mar 18, 2011 at 02:34:03PM -0700, Hugh Dickins wrote: > On Fri, 18 Mar 2011, Andrea Arcangeli wrote: > > On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote: > > Subject: mm: PageBuddy and mapcount robustness > > > > From: Andrea Arcangeli <aarcange@redhat.com> > > > > Change the _mapcount value indicating PageBuddy from -2 to -128 for > > more robusteness against page_mapcount() undeflows. > > > > Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to > > ignore the previous retval of PageBuddy(). > > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > > Reported-by: Hugh Dickins <hughd@google.com> > > Yes, this version satisfies my objections too. > I'd say Acked-by but I see that it's already in, great. > > I've Cc'ed stable@kernel.org: please can we have this in 2.6.38.1, > since 2.6.38 regressed the recovery from bad page states, > inadvertently changing them to a fatal error when CONFIG_DEBUG_VM. Now queued up for 2.6.38.2, thanks. greg k-h -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-04-01 21:21 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-14 8:08 [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! Hugh Dickins 2011-03-14 15:25 ` Minchan Kim 2011-03-14 15:52 ` Andrea Arcangeli 2011-03-14 16:37 ` Hugh Dickins 2011-03-14 16:56 ` Linus Torvalds 2011-03-14 17:17 ` Andrea Arcangeli 2011-03-14 19:58 ` Johannes Weiner 2011-03-14 23:42 ` KAMEZAWA Hiroyuki 2011-04-01 21:21 ` Andrea Arcangeli 2011-03-14 16:59 ` [PATCH] mm: PageBuddy and mapcount underflows robustness Andrea Arcangeli 2011-03-14 17:30 ` Linus Torvalds 2011-03-17 23:16 ` Andrea Arcangeli 2011-03-18 21:34 ` Hugh Dickins 2011-03-23 22:58 ` [stable] " Greg KH
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).