From: Yury <yury.norov@gmail.com>
To: Alexey Klimov <klimov.linux@gmail.com>,
Cassidy Burden <cburden@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-arm-msm@vger.kernel.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-arm-kernel@lists.infradead.org,
"David S. Miller" <davem@davemloft.net>,
Daniel Borkmann <dborkman@redhat.com>,
Hannes Frederic Sowa <hannes@stressinduktion.org>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Mark Salter <msalter@redhat.com>,
AKASHI Takahiro <takahiro.akashi@linaro.org>,
Thomas Graf <tgraf@suug.ch>,
Valentin Rothberg <valentinrothberg@gmail.com>,
Chris Wilson <chris@chris-wilson.co.uk>,
Rasmus Villemoes <linux@rasmusvillemoes.dk>,
linux@horizon.com
Subject: Re: [PATCH] lib: Make _find_next_bit helper function inline
Date: Sat, 29 Aug 2015 18:15:15 +0300 [thread overview]
Message-ID: <55E1CC83.1010007@gmail.com> (raw)
In-Reply-To: <CALW4P+LU2iLkT7d=BiaC_=oSJ6K_g152VsVSjfcQGUKx_5q4tQ@mail.gmail.com>
On 24.08.2015 01:53, Alexey Klimov wrote:
> Hi Cassidy,
>
>
> On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden <cburden@codeaurora.org> wrote:
>> I changed the test module to now set the entire array to all 0/1s and
>> only flip a few bits. There appears to be a performance benefit, but
>> it's only 2-3% better (if that). If the main benefit of the original
>> patch was to save space then inlining definitely doesn't seem worth the
>> small gains in real use cases.
>>
>> find_next_zero_bit (us)
>> old new inline
>> 14440 17080 17086
>> 4779 5181 5069
>> 10844 12720 12746
>> 9642 11312 11253
>> 3858 3818 3668
>> 10540 12349 12307
>> 12470 14716 14697
>> 5403 6002 5942
>> 2282 1820 1418
>> 13632 16056 15998
>> 11048 13019 13030
>> 6025 6790 6706
>> 13255 15586 15605
>> 3038 2744 2539
>> 10353 12219 12239
>> 10498 12251 12322
>> 14767 17452 17454
>> 12785 15048 15052
>> 1655 1034 691
>> 9924 11611 11558
>>
>> find_next_bit (us)
>> old new inline
>> 8535 9936 9667
>> 14666 17372 16880
>> 2315 1799 1355
>> 6578 9092 8806
>> 6548 7558 7274
>> 9448 11213 10821
>> 3467 3497 3449
>> 2719 3079 2911
>> 6115 7989 7796
>> 13582 16113 15643
>> 4643 4946 4766
>> 3406 3728 3536
>> 7118 9045 8805
>> 3174 3011 2701
>> 13300 16780 16252
>> 14285 16848 16330
>> 11583 13669 13207
>> 13063 15455 14989
>> 12661 14955 14500
>> 12068 14166 13790
>>
>> On 7/29/2015 6:30 AM, Alexey Klimov wrote:
>>>
>>> I will re-check on another machine. It's really interesting if
>>> __always_inline makes things better for aarch64 and worse for x86_64. It
>>> will be nice if someone will check it on x86_64 too.
>>
>>
>> Very odd, this may be related to the other compiler optimizations Yuri
>> mentioned?
>
> It's better to ask Yury, i hope he can answer some day.
>
> Do you need to re-check this (with more iterations or on another machine(s))?
>
Hi, Alexey, Cassidy,
(restoring Rasmus, George)
I found no difference between original and inline versions for x86_64:
(Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz)
find_next_bit find_next_zero_bit
old new inline old new inline
24 27 28 22 28 28
24 27 28 23 27 28
24 27 28 23 27 28
Inspecting assembler code, I found that GCC wants to see helper separated,
even if you provide '__always_inline':
inline <find_next_bit_new>: current <find_next_bit_new>:
280: cmp %rdx,%rsi 210: cmp %rdx,%rsi
283: jbe 295 <find_next_bit_new+0x15> 213: jbe 227 <find_next_bit_new+0x17>
285: test %rsi,%rsi 215: test %rsi,%rsi
288: je 295 <find_next_bit_new+0x15> 218: je 227 <find_next_bit_new+0x17>
28a: push %rbp 21a: push %rbp
28b: mov %rsp,%rbp 21b: xor %ecx,%ecx
28e: callq 0 <find_next_bit_new.part.0> 21d: mov %rsp,%rbp
293: pop %rbp 220: callq 0 <_find_next_bit.part.0>
294: retq 225: pop %rbp
295: mov %rsi,%rax 226: retq
298: retq 227: mov %rsi,%rax
299: nopl 0x0(%rax) 22a: retq
22b: nopl 0x0(%rax,%rax,1)
So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu)
ignores '__always_inline' hint as well as 'inline'. But in case of
__always_inline compiler does something not really smart: it introduces
<find_next_bit_new.part.0> and <find_next_zero_bit_new.part.1> helpers
and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really
inline to optimize push/pop and call/ret. I don't like inline, as I
already told, but I believe that complete disabling is bad idea.
Maybe someone knows another trick to make inline work?
WARNING: multiple messages have this Message-ID (diff)
From: yury.norov@gmail.com (Yury)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] lib: Make _find_next_bit helper function inline
Date: Sat, 29 Aug 2015 18:15:15 +0300 [thread overview]
Message-ID: <55E1CC83.1010007@gmail.com> (raw)
In-Reply-To: <CALW4P+LU2iLkT7d=BiaC_=oSJ6K_g152VsVSjfcQGUKx_5q4tQ@mail.gmail.com>
On 24.08.2015 01:53, Alexey Klimov wrote:
> Hi Cassidy,
>
>
> On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden <cburden@codeaurora.org> wrote:
>> I changed the test module to now set the entire array to all 0/1s and
>> only flip a few bits. There appears to be a performance benefit, but
>> it's only 2-3% better (if that). If the main benefit of the original
>> patch was to save space then inlining definitely doesn't seem worth the
>> small gains in real use cases.
>>
>> find_next_zero_bit (us)
>> old new inline
>> 14440 17080 17086
>> 4779 5181 5069
>> 10844 12720 12746
>> 9642 11312 11253
>> 3858 3818 3668
>> 10540 12349 12307
>> 12470 14716 14697
>> 5403 6002 5942
>> 2282 1820 1418
>> 13632 16056 15998
>> 11048 13019 13030
>> 6025 6790 6706
>> 13255 15586 15605
>> 3038 2744 2539
>> 10353 12219 12239
>> 10498 12251 12322
>> 14767 17452 17454
>> 12785 15048 15052
>> 1655 1034 691
>> 9924 11611 11558
>>
>> find_next_bit (us)
>> old new inline
>> 8535 9936 9667
>> 14666 17372 16880
>> 2315 1799 1355
>> 6578 9092 8806
>> 6548 7558 7274
>> 9448 11213 10821
>> 3467 3497 3449
>> 2719 3079 2911
>> 6115 7989 7796
>> 13582 16113 15643
>> 4643 4946 4766
>> 3406 3728 3536
>> 7118 9045 8805
>> 3174 3011 2701
>> 13300 16780 16252
>> 14285 16848 16330
>> 11583 13669 13207
>> 13063 15455 14989
>> 12661 14955 14500
>> 12068 14166 13790
>>
>> On 7/29/2015 6:30 AM, Alexey Klimov wrote:
>>>
>>> I will re-check on another machine. It's really interesting if
>>> __always_inline makes things better for aarch64 and worse for x86_64. It
>>> will be nice if someone will check it on x86_64 too.
>>
>>
>> Very odd, this may be related to the other compiler optimizations Yuri
>> mentioned?
>
> It's better to ask Yury, i hope he can answer some day.
>
> Do you need to re-check this (with more iterations or on another machine(s))?
>
Hi, Alexey, Cassidy,
(restoring Rasmus, George)
I found no difference between original and inline versions for x86_64:
(Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz)
find_next_bit find_next_zero_bit
old new inline old new inline
24 27 28 22 28 28
24 27 28 23 27 28
24 27 28 23 27 28
Inspecting assembler code, I found that GCC wants to see helper separated,
even if you provide '__always_inline':
inline <find_next_bit_new>: current <find_next_bit_new>:
280: cmp %rdx,%rsi 210: cmp %rdx,%rsi
283: jbe 295 <find_next_bit_new+0x15> 213: jbe 227 <find_next_bit_new+0x17>
285: test %rsi,%rsi 215: test %rsi,%rsi
288: je 295 <find_next_bit_new+0x15> 218: je 227 <find_next_bit_new+0x17>
28a: push %rbp 21a: push %rbp
28b: mov %rsp,%rbp 21b: xor %ecx,%ecx
28e: callq 0 <find_next_bit_new.part.0> 21d: mov %rsp,%rbp
293: pop %rbp 220: callq 0 <_find_next_bit.part.0>
294: retq 225: pop %rbp
295: mov %rsi,%rax 226: retq
298: retq 227: mov %rsi,%rax
299: nopl 0x0(%rax) 22a: retq
22b: nopl 0x0(%rax,%rax,1)
So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu)
ignores '__always_inline' hint as well as 'inline'. But in case of
__always_inline compiler does something not really smart: it introduces
<find_next_bit_new.part.0> and <find_next_zero_bit_new.part.1> helpers
and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really
inline to optimize push/pop and call/ret. I don't like inline, as I
already told, but I believe that complete disabling is bad idea.
Maybe someone knows another trick to make inline work?
next prev parent reply other threads:[~2015-08-29 15:15 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-28 19:09 [PATCH] lib: Make _find_next_bit helper function inline Cassidy Burden
2015-07-28 19:09 ` Cassidy Burden
2015-07-28 21:23 ` Yury
2015-07-28 21:23 ` Yury
2015-07-28 21:38 ` Yury
2015-07-28 21:38 ` Yury
2015-07-28 21:45 ` Andrew Morton
2015-07-28 21:45 ` Andrew Morton
2015-07-29 13:30 ` Alexey Klimov
2015-07-29 13:30 ` Alexey Klimov
2015-07-29 20:40 ` Cassidy Burden
2015-07-29 20:40 ` Cassidy Burden
2015-08-23 22:53 ` Alexey Klimov
2015-08-23 22:53 ` Alexey Klimov
2015-08-29 15:15 ` Yury [this message]
2015-08-29 15:15 ` Yury
2015-08-30 21:47 ` Rasmus Villemoes
2015-08-30 21:47 ` Rasmus Villemoes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55E1CC83.1010007@gmail.com \
--to=yury.norov@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cburden@codeaurora.org \
--cc=chris@chris-wilson.co.uk \
--cc=davem@davemloft.net \
--cc=dborkman@redhat.com \
--cc=hannes@stressinduktion.org \
--cc=klimov.linux@gmail.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@horizon.com \
--cc=linux@rasmusvillemoes.dk \
--cc=msalter@redhat.com \
--cc=takahiro.akashi@linaro.org \
--cc=tgraf@suug.ch \
--cc=valentinrothberg@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.