* [PATCH] mm: hugetlbfs: fix hugetlbfs optimization @ 2013-11-05 22:10 Andrea Arcangeli 2013-11-07 17:18 ` Khalid Aziz 2013-11-12 19:22 ` Pravin Shelar 0 siblings, 2 replies; 5+ messages in thread From: Andrea Arcangeli @ 2013-11-05 22:10 UTC (permalink / raw) To: Khalid Aziz Cc: gregkh, bhutchings, pshelar, cl, hannes, mel, riel, minchan, andi, akpm, torvalds, linux-mm Hi, this patch is an alternative implementation of the hugetlbfs directio optimization discussed earlier. We've been looking into this with Khalid last week and an earlier version of this patch (fully equivalent as far as CPU overhead is concerned) was benchmarked by Khalid and it didn't degrade performance compared to the PageHuge check in current upstream code, so we should be good. The patch applies cleanly only after reverting 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review it in this form as it avoids all the alignment changes. I'll resend to Andrew against current upstream by squashing it with the revert after reviews. I wished to remove the _mapcount tailpage refcounting for slab and hugetlbfs tails too, but if the last put_page of a slab tail happens after the slab page isn't a slab page anymore (but still compound as it wasn't freed yet because of the tail pin), a VM_BUG_ON would trigger during the last (unpinning) put_page(slab_tail) with the mapcount underflow: VM_BUG_ON(page_mapcount(page) <= 0); Not even sure if any driver is doing anything like that, but the current code would allow it, Pravin should know more about when exactly in which conditions the last put_page is done on slab tail pages. It shall be possible to remove the _mapcount refcounting anyway, as it is only read by split_huge_page and so it doesn't actually matter if it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one more VM_BUG_ON(!PageHead()) even in this patch. I also didn't notice we missed a PageHead check before calling __put_single_page(page_head), so I corrected that. It sounds very unlikely that it could have ever triggered but still better to fix it. I just booted it... not very well tested yet. Help with the testing appreciated :). ===== ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization 2013-11-05 22:10 [PATCH] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli @ 2013-11-07 17:18 ` Khalid Aziz 2013-11-12 19:22 ` Pravin Shelar 1 sibling, 0 replies; 5+ messages in thread From: Khalid Aziz @ 2013-11-07 17:18 UTC (permalink / raw) To: Andrea Arcangeli Cc: gregkh, bhutchings, pshelar, cl, hannes, mel, riel, minchan, andi, akpm, torvalds, linux-mm On 11/05/2013 03:10 PM, Andrea Arcangeli wrote: > Hi, > > this patch is an alternative implementation of the hugetlbfs directio > optimization discussed earlier. We've been looking into this with > Khalid last week and an earlier version of this patch (fully > equivalent as far as CPU overhead is concerned) was benchmarked by > Khalid and it didn't degrade performance compared to the PageHuge > check in current upstream code, so we should be good. > > The patch applies cleanly only after reverting > 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review > it in this form as it avoids all the alignment changes. I'll resend to > Andrew against current upstream by squashing it with the revert after > reviews. > > I wished to remove the _mapcount tailpage refcounting for slab and > hugetlbfs tails too, but if the last put_page of a slab tail happens > after the slab page isn't a slab page anymore (but still compound as > it wasn't freed yet because of the tail pin), a VM_BUG_ON would > trigger during the last (unpinning) put_page(slab_tail) with the > mapcount underflow: > > VM_BUG_ON(page_mapcount(page) <= 0); > > Not even sure if any driver is doing anything like that, but the > current code would allow it, Pravin should know more about when > exactly in which conditions the last put_page is done on slab tail > pages. > > It shall be possible to remove the _mapcount refcounting anyway, as it > is only read by split_huge_page and so it doesn't actually matter if > it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one > more VM_BUG_ON(!PageHead()) even in this patch. > > I also didn't notice we missed a PageHead check before calling > __put_single_page(page_head), so I corrected that. It sounds very > unlikely that it could have ever triggered but still better to fix it. > > I just booted it... not very well tested yet. Help with the testing > appreciated :). > Hi Andrea, I ran performance tests on this patch. I am seeing 4.5% degradation with 1M random reads, 9.2% degradation with 64K random reads and 13.8% degradation with 8K using the orion tool. Just to double check, I repeated the same tests with the last version of patch we had exchanged before some of the final tweaks you made and degradation with that patch is 0.7% for 1M, 2.3% with 64K and 3% for 8K. Actual numbers are in the table below (all numbers are in MB/s): 3.12.0 3.12.0+this_patch 3.12.0+prev_patch ------- ------------------ ----------------- 1M 8467 8087 (-4.5%) 8409 (-0.7%) 64K 4049 3675 (-9.2%) 3957 (-2.3%) 8K 732 631 (-13.8%) 710 (-3.0%) One of the tweaks is causing problems. I will try to isolate which one. -- Khalid -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization 2013-11-05 22:10 [PATCH] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli 2013-11-07 17:18 ` Khalid Aziz @ 2013-11-12 19:22 ` Pravin Shelar 2013-11-13 16:10 ` Andrea Arcangeli 1 sibling, 1 reply; 5+ messages in thread From: Pravin Shelar @ 2013-11-12 19:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Khalid Aziz, gregkh, Ben Hutchings, Christoph Lameter, hannes, mel, riel, minchan, andi, Andrew Morton, torvalds, linux-mm On Tue, Nov 5, 2013 at 2:10 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > Hi, > > this patch is an alternative implementation of the hugetlbfs directio > optimization discussed earlier. We've been looking into this with > Khalid last week and an earlier version of this patch (fully > equivalent as far as CPU overhead is concerned) was benchmarked by > Khalid and it didn't degrade performance compared to the PageHuge > check in current upstream code, so we should be good. > > The patch applies cleanly only after reverting > 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review > it in this form as it avoids all the alignment changes. I'll resend to > Andrew against current upstream by squashing it with the revert after > reviews. > > I wished to remove the _mapcount tailpage refcounting for slab and > hugetlbfs tails too, but if the last put_page of a slab tail happens > after the slab page isn't a slab page anymore (but still compound as > it wasn't freed yet because of the tail pin), a VM_BUG_ON would > trigger during the last (unpinning) put_page(slab_tail) with the > mapcount underflow: > > VM_BUG_ON(page_mapcount(page) <= 0); > > Not even sure if any driver is doing anything like that, but the > current code would allow it, Pravin should know more about when > exactly in which conditions the last put_page is done on slab tail > pages. > Yes, This can happen when slab object is directly passed for IO and it is done in few filesystems (ocfs, xfs) when I checked last time. > It shall be possible to remove the _mapcount refcounting anyway, as it > is only read by split_huge_page and so it doesn't actually matter if > it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one > more VM_BUG_ON(!PageHead()) even in this patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization 2013-11-12 19:22 ` Pravin Shelar @ 2013-11-13 16:10 ` Andrea Arcangeli 2013-11-15 17:00 ` Andrea Arcangeli 0 siblings, 1 reply; 5+ messages in thread From: Andrea Arcangeli @ 2013-11-13 16:10 UTC (permalink / raw) To: Pravin Shelar Cc: Khalid Aziz, gregkh, Ben Hutchings, Christoph Lameter, hannes, mel, riel, minchan, andi, Andrew Morton, torvalds, linux-mm On Tue, Nov 12, 2013 at 11:22:50AM -0800, Pravin Shelar wrote: > On Tue, Nov 5, 2013 at 2:10 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Hi, > > > > this patch is an alternative implementation of the hugetlbfs directio > > optimization discussed earlier. We've been looking into this with > > Khalid last week and an earlier version of this patch (fully > > equivalent as far as CPU overhead is concerned) was benchmarked by > > Khalid and it didn't degrade performance compared to the PageHuge > > check in current upstream code, so we should be good. > > > > The patch applies cleanly only after reverting > > 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review > > it in this form as it avoids all the alignment changes. I'll resend to > > Andrew against current upstream by squashing it with the revert after > > reviews. > > > > I wished to remove the _mapcount tailpage refcounting for slab and > > hugetlbfs tails too, but if the last put_page of a slab tail happens > > after the slab page isn't a slab page anymore (but still compound as > > it wasn't freed yet because of the tail pin), a VM_BUG_ON would > > trigger during the last (unpinning) put_page(slab_tail) with the > > mapcount underflow: > > > > VM_BUG_ON(page_mapcount(page) <= 0); > > > > Not even sure if any driver is doing anything like that, but the > > current code would allow it, Pravin should know more about when > > exactly in which conditions the last put_page is done on slab tail > > pages. > > > Yes, This can happen when slab object is directly passed for IO and it > is done in few filesystems (ocfs, xfs) when I checked last time. About the slab case however, it cannot be that the tail pin obtained with get_page(tail_page), is the last reference on the compound page when it gets released through put_page. kfree/kmem_cache_free would allow reuse of the whole compound page as a different slab object without passing through the buddy allocator, so totally disregarding any tail page pin. So in short I believe it's safe to remove the mapcount refcounting from slab tail page pinning because all tail pins should be released before the kfree/kmem_cache_free, and in turn before the PG_slab flag has been cleared (which happens before the slab code releases the last reference count on the slab page to free it). It is also safe to for hugetlbfs as the hugetlbfs destructor is wiped only after the last hugetlbfs reference count has been released and no more put_page can happen then. So I think we've room for optimizations to fill the performance gap compared to the PageHuge check at the top of put_page (which also skips the mapcount tail page refcounting), but I would keep the optimization to skip the tail page refcounting incremental, and it makes sense to apply it to slab too if we do it, so we keep the hugetlbfs and slab cases identical (which is simpler to maintain). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization 2013-11-13 16:10 ` Andrea Arcangeli @ 2013-11-15 17:00 ` Andrea Arcangeli 0 siblings, 0 replies; 5+ messages in thread From: Andrea Arcangeli @ 2013-11-15 17:00 UTC (permalink / raw) To: Pravin Shelar Cc: Khalid Aziz, gregkh, Ben Hutchings, Christoph Lameter, hannes, mel, riel, minchan, andi, Andrew Morton, torvalds, linux-mm Hi, so while optimizing away the _mapcount tail page refcounting for slab and hugetlbfs pages incremental with the fix for the hugetlbfs optimization I just sent, I also noticed another bug in the current code (already fixed by the patch). gup_fast is still increasing _mapcount for hugetlbfs, but it's not decreased in put_page. It's only noticeable if you do: echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages On current upstream I get: BUG: Bad page state in process bash pfn:59a01 page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0 page flags: 0x1c00000000008000(tail) Modules linked in: CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000009 ffff880079cb5cc8 ffffffff81640e8b 0000000000000006 ffffea000139b038 ffff880079cb5ce8 ffffffff8115bb15 00000000000002c1 ffffea000139b038 ffff880079cb5d48 ffffffff8115bd83 ffff880079cb5de8 Call Trace: [<ffffffff81640e8b>] dump_stack+0x55/0x76 [<ffffffff8115bb15>] bad_page+0xd5/0x130 [<ffffffff8115bd83>] free_pages_prepare+0x213/0x280 [<ffffffff8115df16>] __free_pages+0x36/0x80 [<ffffffff8119b011>] update_and_free_page+0xc1/0xd0 [<ffffffff8119b512>] free_pool_huge_page+0xc2/0xe0 [<ffffffff8119b8cc>] set_max_huge_pages.part.58+0x14c/0x220 [<ffffffff81308a8c>] ? _kstrtoull+0x2c/0x90 [<ffffffff8119ba70>] nr_hugepages_store_common.isra.60+0xd0/0xf0 [<ffffffff8119bac3>] nr_hugepages_store+0x13/0x20 [<ffffffff812f763f>] kobj_attr_store+0xf/0x20 [<ffffffff812354e9>] sysfs_write_file+0x189/0x1e0 [<ffffffff811baff5>] vfs_write+0xc5/0x1f0 [<ffffffff811bb505>] SyS_write+0x55/0xb0 [<ffffffff81651712>] system_call_fastpath+0x16/0x1b So good thing I stopped the hugetlbfs optimization from going into stable. I'll send a v2 of this work as a patchset of 3 patches where the first is the same identical patch I already sent but incremental to upstream and it contains all the fixes needed including for the above problem. Patch 1/3 should be applied more urgently as it fixes all those various bugs. The 2/3 and 3/3 can be deferred. The patch 3/3 especially should be benchmarked in the usual 8GB/sec setup before being applied, unless it makes a real difference I wouldn't apply it because it tends to slowdown the THP case a bit and it complicates things a bit more. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-11-15 17:01 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-05 22:10 [PATCH] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli 2013-11-07 17:18 ` Khalid Aziz 2013-11-12 19:22 ` Pravin Shelar 2013-11-13 16:10 ` Andrea Arcangeli 2013-11-15 17:00 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).