* [PATCH] page_alloc.c: inline __rmqueue() @ 2017-10-09 5:44 Aaron Lu 2017-10-09 7:37 ` Anshuman Khandual 2017-10-09 20:23 ` Dave Hansen 0 siblings, 2 replies; 16+ messages in thread From: Aaron Lu @ 2017-10-09 5:44 UTC (permalink / raw) To: linux-mm, lkml Cc: Andrew Morton, Andi Kleen, Dave Hansen, Huang Ying, Tim Chen, Kemi Wang __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock and that lock can be heavily contended with memory intensive applications. Since __rmqueue() is a small function, inline it can save us some time. With the will-it-scale/page_fault1/process benchmark, when using nr_cpu processes to stress buddy: On a 2 sockets Intel-Skylake machine: base %change head 77342 +6.3% 82203 will-it-scale.per_process_ops On a 4 sockets Intel-Skylake machine: base %change head 75746 +4.6% 79248 will-it-scale.per_process_ops This patch adds inline to __rmqueue(). Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e309ce4a44a..c9605c7ebaf6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2291,7 +2291,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ -static struct page *__rmqueue(struct zone *zone, unsigned int order, +static inline struct page *__rmqueue(struct zone *zone, unsigned int order, int migratetype) { struct page *page; -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] page_alloc.c: inline __rmqueue() 2017-10-09 5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu @ 2017-10-09 7:37 ` Anshuman Khandual 2017-10-09 7:53 ` Aaron Lu 2017-10-09 20:23 ` Dave Hansen 1 sibling, 1 reply; 16+ messages in thread From: Anshuman Khandual @ 2017-10-09 7:37 UTC (permalink / raw) To: Aaron Lu, linux-mm, lkml Cc: Andrew Morton, Andi Kleen, Dave Hansen, Huang Ying, Tim Chen, Kemi Wang On 10/09/2017 11:14 AM, Aaron Lu wrote: > __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock > and that lock can be heavily contended with memory intensive applications. > > Since __rmqueue() is a small function, inline it can save us some time. > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu > processes to stress buddy: > > On a 2 sockets Intel-Skylake machine: > base %change head > 77342 +6.3% 82203 will-it-scale.per_process_ops > > On a 4 sockets Intel-Skylake machine: > base %change head > 75746 +4.6% 79248 will-it-scale.per_process_ops > > This patch adds inline to __rmqueue(). > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> Ran it through kernel bench and ebizzy micro benchmarks. Results were comparable with and without the patch. May be these are not the appropriate tests for this inlining improvement. Anyways it does not have any performance degradation either. Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] page_alloc.c: inline __rmqueue() 2017-10-09 7:37 ` Anshuman Khandual @ 2017-10-09 7:53 ` Aaron Lu 0 siblings, 0 replies; 16+ messages in thread From: Aaron Lu @ 2017-10-09 7:53 UTC (permalink / raw) To: Anshuman Khandual Cc: linux-mm, lkml, Andrew Morton, Andi Kleen, Dave Hansen, Huang Ying, Tim Chen, Kemi Wang On Mon, Oct 09, 2017 at 01:07:36PM +0530, Anshuman Khandual wrote: > On 10/09/2017 11:14 AM, Aaron Lu wrote: > > __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock > > and that lock can be heavily contended with memory intensive applications. > > > > Since __rmqueue() is a small function, inline it can save us some time. > > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu > > processes to stress buddy: > > > > On a 2 sockets Intel-Skylake machine: > > base %change head > > 77342 +6.3% 82203 will-it-scale.per_process_ops > > > > On a 4 sockets Intel-Skylake machine: > > base %change head > > 75746 +4.6% 79248 will-it-scale.per_process_ops > > > > This patch adds inline to __rmqueue(). > > > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > > Ran it through kernel bench and ebizzy micro benchmarks. Results > were comparable with and without the patch. May be these are not > the appropriate tests for this inlining improvement. Anyways it I think so. The benefit only appears when the lock contention is huge enough, e.g. perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath is as high as 80% with the workload I have used. > does not have any performance degradation either. > > Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] page_alloc.c: inline __rmqueue() 2017-10-09 5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu 2017-10-09 7:37 ` Anshuman Khandual @ 2017-10-09 20:23 ` Dave Hansen 2017-10-10 2:51 ` Aaron Lu 1 sibling, 1 reply; 16+ messages in thread From: Dave Hansen @ 2017-10-09 20:23 UTC (permalink / raw) To: Aaron Lu, linux-mm, lkml Cc: Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang On 10/08/2017 10:44 PM, Aaron Lu wrote: > __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock > and that lock can be heavily contended with memory intensive applications. What does "memory intensive" mean? I'd probably just say: "The two __rmqueue() call sites are in very hot page allocator paths." > Since __rmqueue() is a small function, inline it can save us some time. > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu > processes to stress buddy: Please include a description of the test and a link to the source. > On a 2 sockets Intel-Skylake machine: > base %change head > 77342 +6.3% 82203 will-it-scale.per_process_ops What's the unit here? That seems ridiculously low for page_fault1. It's usually in the millions. > On a 4 sockets Intel-Skylake machine: > base %change head > 75746 +4.6% 79248 will-it-scale.per_process_ops It's probably worth noting the reason that this is _less_ beneficial on a larger system. I'd also just put this in text rather than wasting space in tables like that. It took me a few minutes to figure out what the table was trying top say. This is one of those places where LKP output is harmful. Why not just say: This patch improved the benchmark by 6.3% on a 2-socket system and 4.6% on a 4-socket system. > This patch adds inline to __rmqueue(). How much text bloat does this cost? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] page_alloc.c: inline __rmqueue() 2017-10-09 20:23 ` Dave Hansen @ 2017-10-10 2:51 ` Aaron Lu 2017-10-10 2:56 ` [PATCH v2] mm/page_alloc.c: " Aaron Lu 0 siblings, 1 reply; 16+ messages in thread From: Aaron Lu @ 2017-10-10 2:51 UTC (permalink / raw) To: Dave Hansen Cc: linux-mm, lkml, Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang On Mon, Oct 09, 2017 at 01:23:34PM -0700, Dave Hansen wrote: > On 10/08/2017 10:44 PM, Aaron Lu wrote: > > On a 2 sockets Intel-Skylake machine: > > base %change head > > 77342 +6.3% 82203 will-it-scale.per_process_ops > > What's the unit here? That seems ridiculously low for page_fault1. > It's usually in the millions. per_process_ops = processes/nr_process since nr_process here is nr_cpu, so on the 2 sockets machine with 104 CPUs, processes are 8043568(base) and 8549112(head), which are in the millions as you correctly pointed out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2] mm/page_alloc.c: inline __rmqueue() 2017-10-10 2:51 ` Aaron Lu @ 2017-10-10 2:56 ` Aaron Lu 2017-10-10 5:19 ` Dave Hansen 0 siblings, 1 reply; 16+ messages in thread From: Aaron Lu @ 2017-10-10 2:56 UTC (permalink / raw) To: linux-mm, lkml Cc: Dave Hansen, Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual __rmqueue() is called by rmqueue_bulk() and rmqueue() under zone->lock and the two __rmqueue() call sites are in very hot page allocator paths. Since __rmqueue() is a small function, inline it can save us some time. With the will-it-scale/page_fault1/process benchmark, when using nr_cpu processes to stress buddy, this patch improved the benchmark by 6.3% on a 2-sockets Intel-Skylake system and 4.6% on a 4-sockets Intel-Skylake system. The benefit being less on 4 sockets machine is due to the lock contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe than on the 2 sockets machine(84%). What the benchmark does is: it forks nr_cpu processes and then each process does the following: 1 mmap() 128M anonymous space; 2 writes to each page there to trigger actual page allocation; 3 munmap() it. in a loop. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c This patch adds inline to __rmqueue() and vmlinux' size doesn't have any change after this patch according to size(1). without this patch: text data bss dec hex filename 9968576 5793372 17715200 33477148 1fed21c vmlinux with this patch: text data bss dec hex filename 9968576 5793372 17715200 33477148 1fed21c vmlinux Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- v2: change commit message according to Dave Hansen's suggestion. mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e309ce4a44a..c9605c7ebaf6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2291,7 +2291,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ -static struct page *__rmqueue(struct zone *zone, unsigned int order, +static inline struct page *__rmqueue(struct zone *zone, unsigned int order, int migratetype) { struct page *page; -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue() 2017-10-10 2:56 ` [PATCH v2] mm/page_alloc.c: " Aaron Lu @ 2017-10-10 5:19 ` Dave Hansen 2017-10-10 5:43 ` Aaron Lu 0 siblings, 1 reply; 16+ messages in thread From: Dave Hansen @ 2017-10-10 5:19 UTC (permalink / raw) To: Aaron Lu, linux-mm, lkml Cc: Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual On 10/09/2017 07:56 PM, Aaron Lu wrote: > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any > change after this patch according to size(1). > > without this patch: > text data bss dec hex filename > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > with this patch: > text data bss dec hex filename > 9968576 5793372 17715200 33477148 1fed21c vmlinux This is unexpected. Could you double-check this, please? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue() 2017-10-10 5:19 ` Dave Hansen @ 2017-10-10 5:43 ` Aaron Lu 2017-10-10 21:45 ` Andrew Morton 0 siblings, 1 reply; 16+ messages in thread From: Aaron Lu @ 2017-10-10 5:43 UTC (permalink / raw) To: Dave Hansen Cc: linux-mm, lkml, Andrew Morton, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual On Mon, Oct 09, 2017 at 10:19:52PM -0700, Dave Hansen wrote: > On 10/09/2017 07:56 PM, Aaron Lu wrote: > > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any > > change after this patch according to size(1). > > > > without this patch: > > text data bss dec hex filename > > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > > > with this patch: > > text data bss dec hex filename > > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > This is unexpected. Could you double-check this, please? mm/page_alloc.o has size changes: Without this patch: $ size mm/page_alloc.o text data bss dec hex filename 36695 9792 8396 54883 d663 mm/page_alloc.o With this patch: $ size mm/page_alloc.o text data bss dec hex filename 37511 9792 8396 55699 d993 mm/page_alloc.o But vmlinux doesn't. It's not clear to me what happened, do you want to me dig this out? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue() 2017-10-10 5:43 ` Aaron Lu @ 2017-10-10 21:45 ` Andrew Morton 2017-10-10 22:27 ` Andi Kleen 2017-10-11 2:34 ` Aaron Lu 0 siblings, 2 replies; 16+ messages in thread From: Andrew Morton @ 2017-10-10 21:45 UTC (permalink / raw) To: Aaron Lu Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual On Tue, 10 Oct 2017 13:43:43 +0800 Aaron Lu <aaron.lu@intel.com> wrote: > On Mon, Oct 09, 2017 at 10:19:52PM -0700, Dave Hansen wrote: > > On 10/09/2017 07:56 PM, Aaron Lu wrote: > > > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any > > > change after this patch according to size(1). > > > > > > without this patch: > > > text data bss dec hex filename > > > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > > > > > with this patch: > > > text data bss dec hex filename > > > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > > > This is unexpected. Could you double-check this, please? > > mm/page_alloc.o has size changes: > > Without this patch: > $ size mm/page_alloc.o > text data bss dec hex filename > 36695 9792 8396 54883 d663 mm/page_alloc.o > > With this patch: > $ size mm/page_alloc.o > text data bss dec hex filename > 37511 9792 8396 55699 d993 mm/page_alloc.o > > But vmlinux doesn't. > > It's not clear to me what happened, do you want to me dig this out? There's weird stuff going on. With x86_64 gcc-4.8.4 Patch not applied: akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue 0000000000002a00 t __rmqueue Patch applied: akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue 000000000000039f t __rmqueue_fallback 0000000000001220 t __rmqueue_smallest So inlining __rmqueue has caused the compiler to decide to uninline __rmqueue_fallback and __rmqueue_smallest, which largely undoes the effect of your patch. `inline' is basically advisory (or ignored) in modern gcc's. So gcc has felt free to ignore it in __rmqueue_fallback and __rmqueue_smallest because gcc thinks it knows best. That's why we created __always_inline, to grab gcc by the scruff of its neck. So... I think this patch could do with quite a bit more care, tuning and testing with various gcc versions. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue() 2017-10-10 21:45 ` Andrew Morton @ 2017-10-10 22:27 ` Andi Kleen 2017-10-11 2:34 ` Aaron Lu 1 sibling, 0 replies; 16+ messages in thread From: Andi Kleen @ 2017-10-10 22:27 UTC (permalink / raw) To: Andrew Morton Cc: Aaron Lu, Dave Hansen, linux-mm, lkml, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual > `inline' is basically advisory (or ignored) in modern gcc's. So gcc > has felt free to ignore it in __rmqueue_fallback and __rmqueue_smallest > because gcc thinks it knows best. That's why we created > __always_inline, to grab gcc by the scruff of its neck. > > So... I think this patch could do with quite a bit more care, tuning > and testing with various gcc versions. We should just everything in the hot path mark __always_inline. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] mm/page_alloc.c: inline __rmqueue() 2017-10-10 21:45 ` Andrew Morton 2017-10-10 22:27 ` Andi Kleen @ 2017-10-11 2:34 ` Aaron Lu 2017-10-13 6:31 ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu 1 sibling, 1 reply; 16+ messages in thread From: Aaron Lu @ 2017-10-11 2:34 UTC (permalink / raw) To: Andrew Morton Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual On Tue, Oct 10, 2017 at 02:45:45PM -0700, Andrew Morton wrote: > On Tue, 10 Oct 2017 13:43:43 +0800 Aaron Lu <aaron.lu@intel.com> wrote: > > > On Mon, Oct 09, 2017 at 10:19:52PM -0700, Dave Hansen wrote: > > > On 10/09/2017 07:56 PM, Aaron Lu wrote: > > > > This patch adds inline to __rmqueue() and vmlinux' size doesn't have any > > > > change after this patch according to size(1). > > > > > > > > without this patch: > > > > text data bss dec hex filename > > > > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > > > > > > > with this patch: > > > > text data bss dec hex filename > > > > 9968576 5793372 17715200 33477148 1fed21c vmlinux > > > > > > This is unexpected. Could you double-check this, please? > > > > mm/page_alloc.o has size changes: > > > > Without this patch: > > $ size mm/page_alloc.o > > text data bss dec hex filename > > 36695 9792 8396 54883 d663 mm/page_alloc.o > > > > With this patch: > > $ size mm/page_alloc.o > > text data bss dec hex filename > > 37511 9792 8396 55699 d993 mm/page_alloc.o > > > > But vmlinux doesn't. > > > > It's not clear to me what happened, do you want to me dig this out? > > There's weird stuff going on. > > With x86_64 gcc-4.8.4 > > Patch not applied: > > akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue > 0000000000002a00 t __rmqueue > > Patch applied: > > akpm3:/usr/local/google/home/akpm/k/25> nm mm/page_alloc.o|grep __rmqueue > 000000000000039f t __rmqueue_fallback > 0000000000001220 t __rmqueue_smallest > > So inlining __rmqueue has caused the compiler to decide to uninline > __rmqueue_fallback and __rmqueue_smallest, which largely undoes the > effect of your patch. > > `inline' is basically advisory (or ignored) in modern gcc's. So gcc > has felt free to ignore it in __rmqueue_fallback and __rmqueue_smallest > because gcc thinks it knows best. That's why we created > __always_inline, to grab gcc by the scruff of its neck. This is a good point and I agree with Andi to use always_inline for those functions that we really want to inline. > > So... I think this patch could do with quite a bit more care, tuning > and testing with various gcc versions. I did some more testing. With x86_64 gcc-4.6.3 available from kernel.org crosstool: Patch not applied: [aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue 00000000000023f0 t __rmqueue 00000000000027c0 t __rmqueue_pcplist.isra.95 Patch applied: [aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue 0000000000002950 t __rmqueue_pcplist.isra.95 Works expected. With self built x86_64 gcc-4.8.4: Patch not applied: [aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue 0000000000001f20 t __rmqueue Patch applied: [aaron@aaronlu linux]$ nm mm/page_alloc.o |grep __rmqueue Works expected.(conflicts with your result though). I also tested gcc-4.9.4, gcc-5.3.1, gcc-6.4.0 and gcc-7.2.1, all have the same output as the above gcc-4.8.4. Then I realized CONFIG_OPTIMIZE_INLINING which I always disabled as suggested by the help message(If unsure, say N). Turnining that config on indeed caused gcc-4.8.4 to emit __rmqueue_fallback here. I think I'll just mark those functions always_inline. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline 2017-10-11 2:34 ` Aaron Lu @ 2017-10-13 6:31 ` Aaron Lu 2017-10-17 11:32 ` Vlastimil Babka 0 siblings, 1 reply; 16+ messages in thread From: Aaron Lu @ 2017-10-13 6:31 UTC (permalink / raw) To: Andrew Morton Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and __rmqueue_cma_fallback() are all in page allocator's hot path and better be finished as soon as possible. One way to make them faster is by making them inline. But as Andrew Morton and Andi Kleen pointed out: https://lkml.org/lkml/2017/10/10/1252 https://lkml.org/lkml/2017/10/10/1279 To make sure they are inlined, we should use __always_inline for them. With the will-it-scale/page_fault1/process benchmark, when using nr_cpu processes to stress buddy, the results for will-it-scale.processes with and without the patch are: On a 2-sockets Intel-Skylake machine: compiler base head gcc-4.4.7 6496131 6911823 +6.4% gcc-4.9.4 7225110 7731072 +7.0% gcc-5.4.1 7054224 7688146 +9.0% gcc-6.2.0 7059794 7651675 +8.4% On a 4-sockets Intel-Skylake machine: compiler base head gcc-4.4.7 13162890 13508193 +2.6% gcc-4.9.4 14997463 15484353 +3.2% gcc-5.4.1 14708711 15449805 +5.0% gcc-6.2.0 14574099 15349204 +5.3% The above 4 compilers are used becuase I've done the tests through Intel's Linux Kernel Performance(LKP) infrastructure and they are the available compilers there. The benefit being less on 4 sockets machine is due to the lock contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe than on the 2 sockets machine(85%). What the benchmark does is: it forks nr_cpu processes and then each process does the following: 1 mmap() 128M anonymous space; 2 writes to each page there to trigger actual page allocation; 3 munmap() it. in a loop. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Binary size wise, I have locally built them with different compilers: [aaron@aaronlu obj]$ size */*/mm/page_alloc.o text data bss dec hex filename 37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o 38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o 37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o 38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o 37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o 38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o 36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o 37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o Text size increased about 800 bytes for mm/page_alloc.o. [aaron@aaronlu obj]$ size */*/vmlinux text data bss dec hex filename 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux Text size for vmlinux has no change though, probably due to function alignment. Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/page_alloc.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e309ce4a44a..0fe3e2095268 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1794,7 +1794,7 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags * Go through the free lists for the given migratetype and remove * the smallest available page from the freelists */ -static inline +static __always_inline struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, int migratetype) { @@ -1838,7 +1838,7 @@ static int fallbacks[MIGRATE_TYPES][4] = { }; #ifdef CONFIG_CMA -static struct page *__rmqueue_cma_fallback(struct zone *zone, +static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone, unsigned int order) { return __rmqueue_smallest(zone, order, MIGRATE_CMA); @@ -2219,7 +2219,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * deviation from the rest of this file, to make the for loop * condition simpler. */ -static inline bool +static __always_inline bool __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) { struct free_area *area; @@ -2291,8 +2291,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ -static struct page *__rmqueue(struct zone *zone, unsigned int order, - int migratetype) +static __always_inline struct page * +__rmqueue(struct zone *zone, unsigned int order, int migratetype) { struct page *page; -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline 2017-10-13 6:31 ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu @ 2017-10-17 11:32 ` Vlastimil Babka 2017-10-18 1:53 ` Lu, Aaron 0 siblings, 1 reply; 16+ messages in thread From: Vlastimil Babka @ 2017-10-17 11:32 UTC (permalink / raw) To: Aaron Lu, Andrew Morton Cc: Dave Hansen, linux-mm, lkml, Andi Kleen, Huang Ying, Tim Chen, Kemi Wang, Anshuman Khandual On 10/13/2017 08:31 AM, Aaron Lu wrote: > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and > __rmqueue_cma_fallback() are all in page allocator's hot path and > better be finished as soon as possible. One way to make them faster > is by making them inline. But as Andrew Morton and Andi Kleen pointed > out: > https://lkml.org/lkml/2017/10/10/1252 > https://lkml.org/lkml/2017/10/10/1279 > To make sure they are inlined, we should use __always_inline for them. > > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu > processes to stress buddy, the results for will-it-scale.processes with > and without the patch are: > > On a 2-sockets Intel-Skylake machine: > > compiler base head > gcc-4.4.7 6496131 6911823 +6.4% > gcc-4.9.4 7225110 7731072 +7.0% > gcc-5.4.1 7054224 7688146 +9.0% > gcc-6.2.0 7059794 7651675 +8.4% > > On a 4-sockets Intel-Skylake machine: > > compiler base head > gcc-4.4.7 13162890 13508193 +2.6% > gcc-4.9.4 14997463 15484353 +3.2% > gcc-5.4.1 14708711 15449805 +5.0% > gcc-6.2.0 14574099 15349204 +5.3% > > The above 4 compilers are used becuase I've done the tests through Intel's > Linux Kernel Performance(LKP) infrastructure and they are the available > compilers there. > > The benefit being less on 4 sockets machine is due to the lock contention > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe > than on the 2 sockets machine(85%). > > What the benchmark does is: it forks nr_cpu processes and then each > process does the following: > 1 mmap() 128M anonymous space; > 2 writes to each page there to trigger actual page allocation; > 3 munmap() it. > in a loop. > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Are transparent hugepages enabled? If yes, __rmqueue() is called from rmqueue(), and there's only one page fault (and __rmqueue()) per 512 "writes to each page". If not, __rmqueue() is called from rmqueue_bulk() in bursts once pcplists are depleted. I guess it's the latter, otherwise I wouldn't expect a function call to have such visible overhead. I guess what would help much more would be a bulk __rmqueue_smallest() to grab multiple pages from the freelists. But can't argue with your numbers against this patch. > Binary size wise, I have locally built them with different compilers: > > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o > text data bss dec hex filename > 37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o > 38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o > 37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o > 38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o > 37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o > 38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o > 36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o > 37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o > > Text size increased about 800 bytes for mm/page_alloc.o. BTW, do you know about ./scripts/bloat-o-meter? :) With gcc 7.2.1: > ./scripts/bloat-o-meter base.o mm/page_alloc.o add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844) function old new delta get_page_from_freelist 2898 4937 +2039 steal_suitable_fallback - 365 +365 find_suitable_fallback 31 120 +89 find_suitable_fallback.part 115 - -115 __rmqueue 1534 - -1534 > [aaron@aaronlu obj]$ size */*/vmlinux > text data bss dec hex filename > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux > > Text size for vmlinux has no change though, probably due to function > alignment. Yep that's useless to show. These differences do add up though, until they eventually cross the alignment boundary. Thanks, Vlastimil > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > --- > mm/page_alloc.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 0e309ce4a44a..0fe3e2095268 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1794,7 +1794,7 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags > * Go through the free lists for the given migratetype and remove > * the smallest available page from the freelists > */ > -static inline > +static __always_inline > struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > int migratetype) > { > @@ -1838,7 +1838,7 @@ static int fallbacks[MIGRATE_TYPES][4] = { > }; > > #ifdef CONFIG_CMA > -static struct page *__rmqueue_cma_fallback(struct zone *zone, > +static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone, > unsigned int order) > { > return __rmqueue_smallest(zone, order, MIGRATE_CMA); > @@ -2219,7 +2219,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, > * deviation from the rest of this file, to make the for loop > * condition simpler. > */ > -static inline bool > +static __always_inline bool > __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > { > struct free_area *area; > @@ -2291,8 +2291,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > -static struct page *__rmqueue(struct zone *zone, unsigned int order, > - int migratetype) > +static __always_inline struct page * > +__rmqueue(struct zone *zone, unsigned int order, int migratetype) > { > struct page *page; > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline 2017-10-17 11:32 ` Vlastimil Babka @ 2017-10-18 1:53 ` Lu, Aaron 2017-10-18 6:28 ` Vlastimil Babka 0 siblings, 1 reply; 16+ messages in thread From: Lu, Aaron @ 2017-10-18 1:53 UTC (permalink / raw) To: vbabka@suse.cz, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, tim.c.chen@linux.intel.com, khandual@linux.vnet.ibm.com, linux-mm@kvack.org, ak@linux.intel.com, Wang, Kemi, Hansen, Dave, Huang, Ying On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote: > On 10/13/2017 08:31 AM, Aaron Lu wrote: > > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and > > __rmqueue_cma_fallback() are all in page allocator's hot path and > > better be finished as soon as possible. One way to make them faster > > is by making them inline. But as Andrew Morton and Andi Kleen pointed > > out: > > https://lkml.org/lkml/2017/10/10/1252 > > https://lkml.org/lkml/2017/10/10/1279 > > To make sure they are inlined, we should use __always_inline for them. > > > > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu > > processes to stress buddy, the results for will-it-scale.processes with > > and without the patch are: > > > > On a 2-sockets Intel-Skylake machine: > > > > compiler base head > > gcc-4.4.7 6496131 6911823 +6.4% > > gcc-4.9.4 7225110 7731072 +7.0% > > gcc-5.4.1 7054224 7688146 +9.0% > > gcc-6.2.0 7059794 7651675 +8.4% > > > > On a 4-sockets Intel-Skylake machine: > > > > compiler base head > > gcc-4.4.7 13162890 13508193 +2.6% > > gcc-4.9.4 14997463 15484353 +3.2% > > gcc-5.4.1 14708711 15449805 +5.0% > > gcc-6.2.0 14574099 15349204 +5.3% > > > > The above 4 compilers are used becuase I've done the tests through Intel's > > Linux Kernel Performance(LKP) infrastructure and they are the available > > compilers there. > > > > The benefit being less on 4 sockets machine is due to the lock contention > > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe > > than on the 2 sockets machine(85%). > > > > What the benchmark does is: it forks nr_cpu processes and then each > > process does the following: > > 1 mmap() 128M anonymous space; > > 2 writes to each page there to trigger actual page allocation; > > 3 munmap() it. > > in a loop. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > Are transparent hugepages enabled? If yes, __rmqueue() is called from > rmqueue(), and there's only one page fault (and __rmqueue()) per 512 > "writes to each page". If not, __rmqueue() is called from rmqueue_bulk() > in bursts once pcplists are depleted. I guess it's the latter, otherwise > I wouldn't expect a function call to have such visible overhead. THP is disabled. I should have mentioned this in the changelog, sorry about that. > > I guess what would help much more would be a bulk __rmqueue_smallest() > to grab multiple pages from the freelists. But can't argue with your Do I understand you correctly that you suggest to use a bulk __rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead of looping pcp->batch times in rmqueue_bulk(), a single call to __rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will loop pcp->batch times to get those pages? Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(), or do I miss something? > numbers against this patch. > > > Binary size wise, I have locally built them with different compilers: > > > > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o > > text data bss dec hex filename > > 37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o > > 38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o > > 37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o > > 38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o > > 37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o > > 38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o > > 36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o > > 37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o > > > > Text size increased about 800 bytes for mm/page_alloc.o. > > BTW, do you know about ./scripts/bloat-o-meter? :) NO!!! Thanks for bringing this up :) > With gcc 7.2.1: > > ./scripts/bloat-o-meter base.o mm/page_alloc.o > > add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844) Nice, it clearly showed 844 bytes bloat. > function old new delta > get_page_from_freelist 2898 4937 +2039 > steal_suitable_fallback - 365 +365 > find_suitable_fallback 31 120 +89 > find_suitable_fallback.part 115 - -115 > __rmqueue 1534 - -1534 > > > > [aaron@aaronlu obj]$ size */*/vmlinux > > text data bss dec hex filename > > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux > > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux > > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux > > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux > > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux > > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux > > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux > > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux > > > > Text size for vmlinux has no change though, probably due to function > > alignment. > > Yep that's useless to show. These differences do add up though, until > they eventually cross the alignment boundary. Agreed. But you know, it is the hot path, the performance improvement might be worth it. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline 2017-10-18 1:53 ` Lu, Aaron @ 2017-10-18 6:28 ` Vlastimil Babka 2017-10-18 8:57 ` Aaron Lu 0 siblings, 1 reply; 16+ messages in thread From: Vlastimil Babka @ 2017-10-18 6:28 UTC (permalink / raw) To: Lu, Aaron, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, tim.c.chen@linux.intel.com, khandual@linux.vnet.ibm.com, linux-mm@kvack.org, ak@linux.intel.com, Wang, Kemi, Hansen, Dave, Huang, Ying On 10/18/2017 03:53 AM, Lu, Aaron wrote: > On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote: >> >> Are transparent hugepages enabled? If yes, __rmqueue() is called from >> rmqueue(), and there's only one page fault (and __rmqueue()) per 512 >> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk() >> in bursts once pcplists are depleted. I guess it's the latter, otherwise >> I wouldn't expect a function call to have such visible overhead. > > THP is disabled. I should have mentioned this in the changelog, sorry > about that. OK, then it makes sense! >> >> I guess what would help much more would be a bulk __rmqueue_smallest() >> to grab multiple pages from the freelists. But can't argue with your > > Do I understand you correctly that you suggest to use a bulk > __rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead > of looping pcp->batch times in rmqueue_bulk(), a single call to > __rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will > loop pcp->batch times to get those pages? Yeah, but I looked at it more closely, and maybe there's not much to gain after all. E.g., there seem to be no atomic counter updates that would benefit from batching, or expensive setup/cleanup in __rmqueue_smallest(). > Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(), > or do I miss something? Right, looks like thanks to inlining, the compiler can already achieve most of the potential gains. >> With gcc 7.2.1: >>> ./scripts/bloat-o-meter base.o mm/page_alloc.o >> >> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844) > > Nice, it clearly showed 844 bytes bloat. > >> function old new delta >> get_page_from_freelist 2898 4937 +2039 >> steal_suitable_fallback - 365 +365 >> find_suitable_fallback 31 120 +89 >> find_suitable_fallback.part 115 - -115 >> __rmqueue 1534 - -1534 It also shows that steal_suitable_fallback() is no longer inlined. Which is fine, because that should ideally be rarely executed. >> >>> [aaron@aaronlu obj]$ size */*/vmlinux >>> text data bss dec hex filename >>> 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux >>> 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux >>> 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux >>> 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux >>> 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux >>> 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux >>> 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux >>> 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux >>> >>> Text size for vmlinux has no change though, probably due to function >>> alignment. >> >> Yep that's useless to show. These differences do add up though, until >> they eventually cross the alignment boundary. > > Agreed. > But you know, it is the hot path, the performance improvement might be > worth it. I'd agree, so you can add Acked-by: Vlastimil Babka <vbabka@suse.cz> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline 2017-10-18 6:28 ` Vlastimil Babka @ 2017-10-18 8:57 ` Aaron Lu 0 siblings, 0 replies; 16+ messages in thread From: Aaron Lu @ 2017-10-18 8:57 UTC (permalink / raw) To: Vlastimil Babka Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, tim.c.chen@linux.intel.com, khandual@linux.vnet.ibm.com, linux-mm@kvack.org, ak@linux.intel.com, Wang, Kemi, Hansen, Dave, Huang, Ying On Wed, Oct 18, 2017 at 08:28:56AM +0200, Vlastimil Babka wrote: > On 10/18/2017 03:53 AM, Lu, Aaron wrote: > > On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote: > >> With gcc 7.2.1: > >>> ./scripts/bloat-o-meter base.o mm/page_alloc.o > >> > >> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844) > > > > Nice, it clearly showed 844 bytes bloat. > > > >> function old new delta > >> get_page_from_freelist 2898 4937 +2039 > >> steal_suitable_fallback - 365 +365 > >> find_suitable_fallback 31 120 +89 > >> find_suitable_fallback.part 115 - -115 > >> __rmqueue 1534 - -1534 > > It also shows that steal_suitable_fallback() is no longer inlined. Which > is fine, because that should ideally be rarely executed. Ah right, so this script is really good for analysing inline changes. > > >> > >>> [aaron@aaronlu obj]$ size */*/vmlinux > >>> text data bss dec hex filename > >>> 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux > >>> 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux > >>> 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux > >>> 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux > >>> 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux > >>> 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux > >>> 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux > >>> 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux > >>> > >>> Text size for vmlinux has no change though, probably due to function > >>> alignment. > >> > >> Yep that's useless to show. These differences do add up though, until > >> they eventually cross the alignment boundary. > > > > Agreed. > > But you know, it is the hot path, the performance improvement might be > > worth it. > > I'd agree, so you can add > > Acked-by: Vlastimil Babka <vbabka@suse.cz> Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-10-18 8:57 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-09 5:44 [PATCH] page_alloc.c: inline __rmqueue() Aaron Lu 2017-10-09 7:37 ` Anshuman Khandual 2017-10-09 7:53 ` Aaron Lu 2017-10-09 20:23 ` Dave Hansen 2017-10-10 2:51 ` Aaron Lu 2017-10-10 2:56 ` [PATCH v2] mm/page_alloc.c: " Aaron Lu 2017-10-10 5:19 ` Dave Hansen 2017-10-10 5:43 ` Aaron Lu 2017-10-10 21:45 ` Andrew Morton 2017-10-10 22:27 ` Andi Kleen 2017-10-11 2:34 ` Aaron Lu 2017-10-13 6:31 ` [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline Aaron Lu 2017-10-17 11:32 ` Vlastimil Babka 2017-10-18 1:53 ` Lu, Aaron 2017-10-18 6:28 ` Vlastimil Babka 2017-10-18 8:57 ` Aaron Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).