From: Jiri Wiesner <jwiesner@suse.de>
To: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>,
lvs-devel@vger.kernel.org,
yunhong-cgl jiang <xintian1976@gmail.com>,
dust.li@linux.alibaba.com
Subject: Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters
Date: Tue, 15 Nov 2022 13:26:02 +0100 [thread overview]
Message-ID: <20221115122602.GJ3484@incl> (raw)
In-Reply-To: <a7bf3435-aa8f-31a5-b93b-f0ad58fdc3a4@ssi.bg>
On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote:
>
> Hello,
>
> On Sat, 12 Nov 2022, Jiri Wiesner wrote:
>
> > On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote:
> > > On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> > > > Use the provided u64_stats_t type to avoid
> > > > load/store tearing.
> >
> > I think only 32-bit machines have a problem storing a 64-bit counter atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take care of that.
>
> Some compilers for 64-bit can use two 32-bit
> loads (load tearing) while we require atomic 64-bit read
> in case reader is interrupted by updater. That is why READ_ONCE
> is used now. I don't know which compilers can do this. That is
> why I switched to u64_stats_t to correctly use the u64_stats
> interface. And our data is 8-byte aligned. 32-bit are using
> seqcount_t, so they are protected for this problem.
All right. It is counter-intuitive but I guess those compiles have their reasons. I think it would not hurt to expand the description of the patch with the explanation above.
> > > Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like:
> > > 15b65: add (%rsi),%r14
> > > 15b68: add 0x8(%rsi),%r15
> > > 15b6c: add 0x10(%rsi),%r13
> > > 15b70: add 0x18(%rsi),%r12
> > > 15b74: add 0x20(%rsi),%rbp
> > > The u64_stats_read() calls in ip_vs_chain_estimation() turned it into:
> > > 159d5: mov (%rcx),%r11
> > > 159d8: mov 0x8(%rcx),%r10
> > > 159dc: mov 0x10(%rcx),%r9
> > > 159e0: mov 0x18(%rcx),%rdi
> > > 159e4: mov 0x20(%rcx),%rdx
> > > 159e8: add %r11,%r14
> > > 159eb: add %r10,%r13
> > > 159ee: add %r9,%r12
> > > 159f1: add %rdi,%rbp
> > > 159f4: add %rdx,%rbx
> > > I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast.
> >
> > Another concern is the number of registers needed for the summation. Previously, 5 registers were needed. Now, 10 registers are needed. This would matter mostly if the size of the stack frame of ip_vs_chain_estimation() and the number of callee-saved registers stored to the stack changed, but introducing u64_stats_read() in ip_vs_chain_estimation() did not change that.
>
> It happens in this way probably because compiler should not
> reorder two or more successive instances of READ_ONCE (rwonce.h),
> something that hurts u64_stats_read(). As result, it multiplies
> up to 5 times the needed temp storage (registers or stack).
>
> Another option would be to use something like
> this below, try on top of v6. It is rude to add ifdefs here but
> we should avoid unneeded code for 64-bit. On 32-bit, code is
> more but operations are same. On 64-bit it should order
> read+add one by one which can run in parallel. In my compile
> tests it uses 3 times movq(READ_ONCE)+addq(to sum) with same temp
> register which defeats fully parallel operations and for the
> last 2 counters uses movq,movq,addq,addq with different
> registers which can run in parallel. It seems there are no free
> registers to make it for all 5 counters.
>
> With this patch the benefit is that compiler should
> not hold all 5 values before adding them but on
> x86 this patch shows some problem: it does not allocate
> 5 new temp registers to read and add the values in parallel.
> Without this patch, as you show it, it somehow allocates 5+5
> registers which allows parallel operations.
>
> If you think we are better with such change, we
> can include it in patch 4. It will be better in case one day
> we add more counters and there are more available registers.
> You can notice the difference probably by comparing the
> chain_max value in performance mode.
I have tested the change to ip_vs_chain_estimation() and ran profiling with bus-cycles, which are proportional to time spent executing code. The number of estimator per kthread was the same in both tests - 88800. It turns out the change made the code slower:
> # Util1 Util2 Diff Command Shared Object Symbol CPU
> 1,124,932,796 1,143,867,951 18,935,155 ( 1.7%) ipvs-e:0:0 [ip_vs] all all
> 16,480,274 16,853,752 373,478 ( 2.3%) ipvs-e:0:1 [ip_vs] all all
> 0 21,987 21,987 (100.0%) ipvs-e:0:0 [kvm] all all
> 4,622,992 4,366,150 -256,842 ( 5.6%) ipvs-e:0:1 [kernel.kallsyms] all all
> 180,226,857 164,665,271 -15,561,586 ( 8.6%) ipvs-e:0:0 [kernel.kallsyms] all all
The most significant component was the time spent in ip_vs_chain_estimation():
> # Util1 Util2 Diff Command Shared Object Symbol CPU
> 1,124,414,110 1,143,352,233 18,938,123 ( 1.7%) ipvs-e:0:0 [ip_vs] ip_vs_chain_estimation all
> 16,291,044 16,574,498 283,454 ( 1.7%) ipvs-e:0:1 [ip_vs] ip_vs_chain_estimation all
> 189,230 279,254 90,024 ( 47.6%) ipvs-e:0:1 [ip_vs] ip_vs_estimation_kthread all
> 518,686 515,718 -2,968 ( 0.6%) ipvs-e:0:0 [ip_vs] ip_vs_estimation_kthread all
> 1,562,512 1,377,609 -184,903 ( 11.8%) ipvs-e:0:0 [kernel.kallsyms] kthread_should_stop all
> 2,435,803 2,138,404 -297,399 ( 12.2%) ipvs-e:0:1 [kernel.kallsyms] _find_next_bit all
> 16,304,577 15,786,553 -518,024 ( 3.2%) ipvs-e:0:0 [kernel.kallsyms] _raw_spin_lock all
> 151,110,969 137,172,160 -13,938,809 ( 9.2%) ipvs-e:0:0 [kernel.kallsyms] _find_next_bit all
The disassembly shows it is the mov instructions (there is a skid of 1 instruction) what takes the most time:
>Percent | Source code & Disassembly of kcore for bus-cycles (55354 samples, percent: local period)
> : ffffffffc0bad980 <ip_vs_chain_estimation>:
> v6 patchset v6 patchset + ip_vs_chain_estimation() change
>-------------------------------------------------------------------------------------------------------
> 0.00 : nopl 0x0(%rax,%rax,1) 0.00 : nopl 0x0(%rax,%rax,1)
> 0.00 : push %r15 0.00 : push %r15
> 0.00 : push %r14 0.00 : push %r14
> 0.00 : push %r13 0.00 : push %r13
> 0.00 : push %r12 0.00 : push %r12
> 0.00 : push %rbp 0.00 : push %rbp
> 0.00 : push %rbx 0.00 : push %rbx
> 0.00 : sub $0x8,%rsp 0.00 : sub $0x8,%rsp
> 0.00 : mov (%rdi),%r15 0.00 : mov (%rdi),%r15
> 0.00 : test %r15,%r15 0.00 : test %r15,%r15
> 0.00 : je 0xffffffffc0badaf1 0.00 : je 0xffffffffc0b7faf1
> 0.00 : call 0xffffffff8bcd33a0 0.00 : call 0xffffffffa1ed33a0
> 0.00 : test %al,%al 0.00 : test %al,%al
> 0.00 : jne 0xffffffffc0badaf1 0.00 : jne 0xffffffffc0b7faf1
> 0.00 : mov -0x334ccb22(%rip),%esi 0.00 : mov -0x1d29eb22(%rip),%esi
> 0.00 : xor %eax,%eax 0.00 : xor %eax,%eax
> 0.00 : xor %ebx,%ebx 0.00 : xor %ebx,%ebx
> 0.00 : xor %ebp,%ebp 0.00 : xor %ebp,%ebp
> 0.00 : xor %r12d,%r12d 0.00 : xor %r12d,%r12d
> 0.00 : xor %r13d,%r13d 0.00 : xor %r13d,%r13d
> 0.05 : xor %r14d,%r14d 0.08 : xor %r14d,%r14d
> 0.00 : jmp 0xffffffffc0bad9f7 0.00 : jmp 0xffffffffc0b7f9f7
> 0.14 : movslq %eax,%rdx 0.08 : movslq %eax,%rdx
> 0.00 : mov 0x68(%r15),%rcx 0.00 : mov 0x68(%r15),%rcx
> 11.92 : add $0x1,%eax 12.25 : add $0x1,%eax
> 0.00 : add -0x72ead560(,%rdx,8),%rcx 0.00 : add -0x5ccad560(,%rdx,8),%rcx
> 4.07 : mov (%rcx),%r11 3.84 : mov (%rcx),%rdx
> 34.28 : mov 0x8(%rcx),%r10 34.17 : add %rdx,%r14
> 10.35 : mov 0x10(%rcx),%r9 2.62 : mov 0x8(%rcx),%rdx
> 7.28 : mov 0x18(%rcx),%rdi 9.33 : add %rdx,%r13
> 7.70 : mov 0x20(%rcx),%rdx 1.82 : mov 0x10(%rcx),%rdx
> 6.80 : add %r11,%r14 6.11 : add %rdx,%r12
> 0.27 : add %r10,%r13 1.26 : mov 0x18(%rcx),%rdx
> 0.36 : add %r9,%r12 6.00 : add %rdx,%rbp
> 2.24 : add %rdi,%rbp 2.97 : mov 0x20(%rcx),%rdx
> 1.21 : add %rdx,%rbx 5.78 : add %rdx,%rbx
> 0.07 : movslq %eax,%rdx 1.08 : movslq %eax,%rdx
> 0.35 : mov $0xffffffff8d6e0860,%rdi 0.00 : mov $0xffffffffa38e0860,%rdi
> 2.26 : call 0xffffffff8c16e270 2.62 : call 0xffffffffa236e270
> 0.17 : mov -0x334ccb7c(%rip),%esi 0.11 : mov -0x1d29eb7c(%rip),%esi
> 0.00 : cmp %eax,%esi 0.00 : cmp %eax,%esi
> 0.00 : ja 0xffffffffc0bad9c3 0.00 : ja 0xffffffffc0b7f9c3
> 0.00 : lea 0x70(%r15),%rdx 0.00 : lea 0x70(%r15),%rdx
> 0.00 : mov %rdx,%rdi 0.00 : mov %rdx,%rdi
> 0.04 : mov %rdx,(%rsp) 0.05 : mov %rdx,(%rsp)
> 0.00 : call 0xffffffff8c74d800 0.00 : call 0xffffffffa294d800
> 0.06 : mov %r14,%rax 0.07 : mov %r14,%rax
> 0.00 : sub 0x20(%r15),%rax 0.00 : sub 0x20(%r15),%rax
> 9.55 : mov 0x38(%r15),%rcx 8.97 : mov 0x38(%r15),%rcx
> 0.01 : mov (%rsp),%rdx 0.03 : mov (%rsp),%rdx
The results indicate that the extra add instructions should not matter - memory accesses (the mov instructions) are the bottleneck.
--
Jiri Wiesner
SUSE Labs
next prev parent reply other threads:[~2022-11-15 12:26 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-31 14:56 [RFC PATCHv6 0/7] ipvs: Use kthreads for stats Julian Anastasov
2022-10-31 14:56 ` [RFC PATCHv6 1/7] ipvs: add rcu protection to stats Julian Anastasov
2022-11-12 7:51 ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 2/7] ipvs: use common functions for stats allocation Julian Anastasov
2022-11-12 8:22 ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters Julian Anastasov
2022-11-12 9:00 ` Jiri Wiesner
2022-11-12 9:09 ` Jiri Wiesner
2022-11-12 16:01 ` Julian Anastasov
2022-11-14 11:46 ` Julian Anastasov
2022-11-15 12:26 ` Jiri Wiesner [this message]
2022-11-15 16:53 ` Julian Anastasov
2022-11-16 17:37 ` Jiri Wiesner
2022-11-19 7:46 ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 4/7] ipvs: use kthreads for stats estimation Julian Anastasov
2022-11-10 15:39 ` Jiri Wiesner
2022-11-10 20:16 ` Julian Anastasov
2022-11-11 17:21 ` Jiri Wiesner
2022-11-21 16:05 ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 5/7] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
2022-11-21 16:29 ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 6/7] ipvs: run_estimation should control the kthread tasks Julian Anastasov
2022-11-21 16:32 ` Jiri Wiesner
2022-10-31 14:56 ` [RFC PATCHv6 7/7] ipvs: debug the tick time Julian Anastasov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221115122602.GJ3484@incl \
--to=jwiesner@suse.de \
--cc=dust.li@linux.alibaba.com \
--cc=horms@verge.net.au \
--cc=ja@ssi.bg \
--cc=lvs-devel@vger.kernel.org \
--cc=xintian1976@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.