* tc filter insertion rate degradation
@ 2019-01-21 11:24 Vlad Buslov
2019-01-22 17:33 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Vlad Buslov @ 2019-01-21 11:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim,
Maor Gottlieb
Hi Eric,
I've been investigating significant tc filter insertion rate degradation
and it seems it is caused by your commit 001c96db0181 ("net: align
gnet_stats_basic_cpu struct"). With this commit insertion rate is
reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
from file in tc batch mode on my machine.
Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
1) Before:
Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
Children Self Co Shared Object Symbol
+ 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
+ 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
2) After:
Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
Children Self Co Shared Object Symbol
+ 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
+ 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
It seems that it takes much more work for pcpu allocator to perform
allocation with new stricter alignment requirements. Not sure if it is
expected behavior or not in this case.
Regards,
Vlad
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: tc filter insertion rate degradation
2019-01-21 11:24 tc filter insertion rate degradation Vlad Buslov
@ 2019-01-22 17:33 ` Eric Dumazet
2019-01-22 21:18 ` Tejun Heo
2019-01-24 17:21 ` Dennis Zhou
0 siblings, 2 replies; 6+ messages in thread
From: Eric Dumazet @ 2019-01-22 17:33 UTC (permalink / raw)
To: Vlad Buslov, Dennis Zhou, Tejun Heo
Cc: Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim,
Maor Gottlieb
On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Hi Eric,
>
> I've been investigating significant tc filter insertion rate degradation
> and it seems it is caused by your commit 001c96db0181 ("net: align
> gnet_stats_basic_cpu struct"). With this commit insertion rate is
> reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> from file in tc batch mode on my machine.
>
> Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>
> 1) Before:
>
> Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> Children Self Co Shared Object Symbol
> + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
> + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
>
> 2) After:
>
> Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> Children Self Co Shared Object Symbol
> + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
> + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
>
> It seems that it takes much more work for pcpu allocator to perform
> allocation with new stricter alignment requirements. Not sure if it is
> expected behavior or not in this case.
>
> Regards,
> Vlad
Hi Vlad
I guess this is more a question for per-cpu allocator experts / maintainers ?
16-bytes alignment for 16-bytes objects sound quite reasonable [1]
It also means that if your workload is mostly being able to setup /
dismantle tc filters,
instead of really using them, you might go back to atomics instead of
expensive per cpu storage.
(Ie optimize control path instead of data path)
Thanks !
[1] We even might make this generic as in :
diff --git a/mm/percpu.c b/mm/percpu.c
index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
size_t align, bool reserved,
*/
if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
align = PCPU_MIN_ALLOC_SIZE;
-
+ while (align < L1_CACHE_BYTES && (align << 1) <= size) {
+ if (size % (align << 1))
+ break;
+ align <<= 1;
+ }
size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
bits = size >> PCPU_MIN_ALLOC_SHIFT;
bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: tc filter insertion rate degradation
2019-01-22 17:33 ` Eric Dumazet
@ 2019-01-22 21:18 ` Tejun Heo
2019-01-22 22:40 ` Eric Dumazet
2019-01-24 17:21 ` Dennis Zhou
1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2019-01-22 21:18 UTC (permalink / raw)
To: Eric Dumazet
Cc: Vlad Buslov, Dennis Zhou, Linux Kernel Network Developers,
Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb
Hello,
On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> > Children Self Co Shared Object Symbol
> > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
> > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> > Children Self Co Shared Object Symbol
> > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
> > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
The allocator hint only remembers the max available size per chunk but
not the alignment, so depending on the allocation pattern, alignment
requirement change can lead to way more scanning per alloc attempt.
Shouldn't be too difficult to improve tho.
> I guess this is more a question for per-cpu allocator experts / maintainers ?
>
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
>
> It also means that if your workload is mostly being able to setup /
> dismantle tc filters,
> instead of really using them, you might go back to atomics instead of
> expensive per cpu storage.
>
> (Ie optimize control path instead of data path)
>
> Thanks !
>
> [1] We even might make this generic as in :
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
> 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
> size_t align, bool reserved,
> */
> if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
> align = PCPU_MIN_ALLOC_SIZE;
> -
> + while (align < L1_CACHE_BYTES && (align << 1) <= size) {
> + if (size % (align << 1))
> + break;
> + align <<= 1;
> + }
Percpu storage is expensive and cache line sharing tends to be less of
a problem (cuz they're per-cpu), so it is useful to support custom
alignments for tighter packing.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: tc filter insertion rate degradation
2019-01-22 21:18 ` Tejun Heo
@ 2019-01-22 22:40 ` Eric Dumazet
0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2019-01-22 22:40 UTC (permalink / raw)
To: Tejun Heo
Cc: Vlad Buslov, Dennis Zhou, Linux Kernel Network Developers,
Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb
On Tue, Jan 22, 2019 at 1:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> Percpu storage is expensive and cache line sharing tends to be less of
> a problem (cuz they're per-cpu), so it is useful to support custom
> alignments for tighter packing.
>
We have BPF percpu maps of two 8-byte counters (packets and bytes
counter), with millions of slots.
We update the pair for every packet sent on the hosts.
BPF uses an alignment of 8 (that can not be changed/tuned, at least
all call sites from kernel/bpf/hashtab.c )
If we are lucky, all these pairs are allocated using a single cache line.
But when we are not lucky, 25% of the pairs are crossing a cache line,
reducing performance under DDOS.
Using a nicer alignment in our case does not consume more ram, and we
did not notice
extra cost of per-cpu allocations because we keep them in the slow
path (control path)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tc filter insertion rate degradation
2019-01-22 17:33 ` Eric Dumazet
2019-01-22 21:18 ` Tejun Heo
@ 2019-01-24 17:21 ` Dennis Zhou
2019-01-29 19:22 ` Vlad Buslov
1 sibling, 1 reply; 6+ messages in thread
From: Dennis Zhou @ 2019-01-24 17:21 UTC (permalink / raw)
To: Eric Dumazet, Vlad Buslov
Cc: Tejun Heo, Linux Kernel Network Developers, Yevgeny Kliteynik,
Yossef Efraim, Maor Gottlieb
Hi Vlad and Eric,
On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >
> > Hi Eric,
> >
> > I've been investigating significant tc filter insertion rate degradation
> > and it seems it is caused by your commit 001c96db0181 ("net: align
> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> > from file in tc batch mode on my machine.
> >
> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
> >
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> > Children Self Co Shared Object Symbol
> > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
> > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> > Children Self Co Shared Object Symbol
> > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
> > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
> >
> > It seems that it takes much more work for pcpu allocator to perform
> > allocation with new stricter alignment requirements. Not sure if it is
> > expected behavior or not in this case.
> >
> > Regards,
> > Vlad
Would you mind sharing a little more information with me:
1) output before and after a run of /sys/kernel/debug/percpu_stats
2) a full perf output
3) a reproducer
I'm a little surprised we're spending time in pcpu_alloc_area(), but it
might be due to constantly breaking the hint as an immediate guess.
>
> Hi Vlad
>
> I guess this is more a question for per-cpu allocator experts / maintainers ?
>
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
>
The alignment request seems reasonable. But as Tejun mentioned in a
reply to this, the overhead of forced alignment would be both in percpu
memory itself and in allocation time due to the stricter requirement.
Thanks,
Dennis
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: tc filter insertion rate degradation
2019-01-24 17:21 ` Dennis Zhou
@ 2019-01-29 19:22 ` Vlad Buslov
0 siblings, 0 replies; 6+ messages in thread
From: Vlad Buslov @ 2019-01-29 19:22 UTC (permalink / raw)
To: Dennis Zhou
Cc: Eric Dumazet, Tejun Heo, Linux Kernel Network Developers,
Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb
On Thu 24 Jan 2019 at 17:21, Dennis Zhou <dennis@kernel.org> wrote:
> Hi Vlad and Eric,
>
> On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
>> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >
>> > Hi Eric,
>> >
>> > I've been investigating significant tc filter insertion rate degradation
>> > and it seems it is caused by your commit 001c96db0181 ("net: align
>> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
>> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
>> > from file in tc batch mode on my machine.
>> >
>> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>> >
>> > 1) Before:
>> >
>> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
>> > Children Self Co Shared Object Symbol
>> > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
>> > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
>> >
>> > 2) After:
>> >
>> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
>> > Children Self Co Shared Object Symbol
>> > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
>> > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
>> >
>> > It seems that it takes much more work for pcpu allocator to perform
>> > allocation with new stricter alignment requirements. Not sure if it is
>> > expected behavior or not in this case.
>> >
>> > Regards,
>> > Vlad
>
> Would you mind sharing a little more information with me:
> 1) output before and after a run of /sys/kernel/debug/percpu_stats
Hi Dennis,
Some of these files are quite large, so I put them to my Dropbox.
Output before:
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 262144
static_size : 139160
reserved_size : 8192
dyn_size : 28776
atom_size : 2097152
alloc_size : 2097152
Global Stats:
----------------------------------------
nr_alloc : 3343
nr_dealloc : 752
nr_cur_alloc : 2591
nr_max_alloc : 2598
nr_chunks : 3
nr_max_chunks : 3
min_alloc_size : 4
max_alloc_size : 8208
empty_pop_pages : 3
Per Chunk Stats:
----------------------------------------
Chunk: <- Reserved Chunk
nr_alloc : 5
max_alloc_size : 320
empty_pop_pages : 0
first_bit : 1002
free_bytes : 7448
contig_bytes : 7424
sum_frag : 24
max_frag : 24
cur_min_alloc : 16
cur_med_alloc : 64
cur_max_alloc : 320
Chunk: <- First Chunk
nr_alloc : 479
max_alloc_size : 8208
empty_pop_pages : 0
first_bit : 8192
free_bytes : 0
contig_bytes : 0
sum_frag : 0
max_frag : 0
cur_min_alloc : 4
cur_med_alloc : 24
cur_max_alloc : 8208
Chunk:
nr_alloc : 1925
max_alloc_size : 8208
empty_pop_pages : 0
first_bit : 63102
free_bytes : 852
contig_bytes : 12
sum_frag : 852
max_frag : 12
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 8208
Chunk:
nr_alloc : 182
max_alloc_size : 936
empty_pop_pages : 3
first_bit : 21
free_bytes : 256452
contig_bytes : 255120
sum_frag : 1332
max_frag : 368
cur_min_alloc : 8
cur_med_alloc : 20
cur_max_alloc : 320
After: https://www.dropbox.com/s/unyzhx4vgo2x30e/stats_after?dl=0
> 2) a full perf output
https://www.dropbox.com/s/isfcxca3npn5slx/perf.data?dl=0
> 3) a reproducer
$ sudo tc -b add.0
Example batch file: https://www.dropbox.com/s/ey7cbl5nwu5p0tg/add.0?dl=0
Thanks,
Vlad
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-01-29 19:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-21 11:24 tc filter insertion rate degradation Vlad Buslov
2019-01-22 17:33 ` Eric Dumazet
2019-01-22 21:18 ` Tejun Heo
2019-01-22 22:40 ` Eric Dumazet
2019-01-24 17:21 ` Dennis Zhou
2019-01-29 19:22 ` Vlad Buslov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).