From: Andrew Morton <akpm@linux-foundation.org>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
"David S. Miller" <davem@davemloft.net>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: include/linux/pcounter.h
Date: Sat, 16 Feb 2008 11:26:18 -0800 [thread overview]
Message-ID: <20080216112618.ec450f9b.akpm@linux-foundation.org> (raw)
In-Reply-To: <47B6D12A.3090401@cosmosbay.com>
On Sat, 16 Feb 2008 13:03:54 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Morton a écrit :
> > On Sat, 16 Feb 2008 11:07:29 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> >
> >> Andrew, pcounter is a temporary abstraction.
> >
> > It's buggy! Main problems are a) possible return of negative numbers b)
> > some of the API can't be from preemptible code c) excessive interrupt-off
> > time on some machines if used from irq-disabled sections.
This is starting to get tiresome. Please do not waste my time and everyone
else's by defending the indefensible.
> a) We dont care of possibly off values when reading /proc/net/sockstat
So take the damn thing out of ./lib/ and hide it down in networking
somewhere.
> Same arguments apply for percpu_counters.
No it does not, Not at all. Callers regularly use percpu_counters to count
the number of some object. These counts never go negative!
And how the heck can you say that we don't care if /proc/net/sockstat goes
negative? You don't know that. Someone's system-monitoring app might very
well take drastic action if it sees connection-related statistics go
negative, or around 4 gigaconnections.
> b) It is called from network parts where preemption is disabled.
It's in ./lib/
> net/ipv4/inet_timewait_sock.c:94:
> sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv4/inet_hashtables.c:291: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/inet_hashtables.c:335: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/inet_hashtables.c:357: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/inet_hashtables.c:390: sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv4/raw.c:95: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/raw.c:104: sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv4/udp.c:235: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv6/ipv6_sockglue.c:271:
> sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv6/ipv6_sockglue.c:272:
> sock_prot_inuse_add(&tcp_prot, 1);
> net/ipv6/ipv6_sockglue.c:285:
> sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv6/ipv6_sockglue.c:286:
> sock_prot_inuse_add(prot, 1);
> net/ipv6/inet6_hashtables.c:46: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv6/inet6_hashtables.c:207: sock_prot_inuse_add(sk->sk_prot, 1);
>
> c) No need to play with irq games here.
It's in ./lib/
> >
> >> It is temporaty because it will vanish as soon as Christoph Clameter (or
> >> somebody else) provides real cheap per cpu counter implementation.
> >
> > numbers?
> >
> > most of percpu_counter_add() is only executed once per FBC_BATCH calls.
>
>
>
> >
> >> At time I introduced it in network tree (locally, not meant to invade kernel
> >> land and makes you unhappy :) ), the goals were :
> >
> > Well maybe as a temporary networking-only thing OK, based upon
> > performance-tested results. But I don't think the present code is suitable
> > as part of the kernel-wide toolkit.
> >
> >> Some counters (total sockets count) were a single integer, that were doing
> >> ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we
> >> dont really need to read their value), using plain atomic_t was overkill.
> >>
> >> Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *))
> >> instead of num_possible_cpus()*4).
> >
> > No, percpu_counters use alloc_percpu(), which is O(num_possible_cpus), not
> > O(NR_CPUS).
>
> You are playing with words.
I assumed the comment was comparing with percpu_counters.
Seems that it was comparing with some hand-rolled static array of NR_CPUS
entries or something.
> >
> >> We want really fast write side. Real fast.
> >
> > eh? It's called on a per-connection basis, not on a per-packet basis?
>
> Yes, per connection basis. Some workloads want to open/close more than 1000
> sockets per second.
ie: slowpath
> You are right we also need to reduce all atomic inc/dec done per packet :)
>
> A pcounter/perc_cpu for the device refcounter would be a very nice win, but as
> number of devices in the system might be very big, David said no to a percpu
> conversion. We will reconsider this when new percpu stuff can go in, so that
> the memory cost will be minimal (4 bytes per cpu per device)
>
> >
> >> Read side is when you do a "cat /proc/net/sockstat".
> >> That is ... once in a while...
> >
> > For the current single caller. But it's in ./lib/.
> >
> > And there's always someone out there who does whatever we don't expect them
> > to do.
> >
> >> Now when we allocate a new socket, code to increment the "socket count" is :
> >>
> >> c03a74a8 <tcp_pcounter_add>:
> >> c03a74a8: b8 90 26 5f c0 mov $0xc05f2690,%eax
> >> c03a74ad: 64 8b 0d 10 f1 5e c0 mov %fs:0xc05ef110,%ecx
> >> c03a74b4: 01 14 01 add %edx,(%ecx,%eax,1)
> >> c03a74b7: c3 ret
> >
> > I can't find that code. I suspect that's the DEFINE_PER_CPU flavour, which
> > isn't used anywhere afaict. Plus this omits the local_irq_save/restore (or
> > preempt_disable/enable) and the indirect function call, which can be
> > expensive.
>
> Please do : nm vmlinux | grep tcp_pcounter_add
>
> c03a74a8 t tcp_pcounter_add
>
> It should be there, even if its a static function
>
Probably - the macros hide it well. And the function itself doesn't tell
the whole story: callers must still do preempt_dsable or local_irq_enable
and must wear the cost of an indirect call.
> >
> >> It is even clearly stated at the top of include/linux/pcounter.h
> >>
> >> /*
> >> * Using a dynamic percpu 'int' variable has a cost :
> >> * 1) Extra dereference
> >> * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
> >> * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4
> >> *
> >> * This pcounter implementation is an abstraction to be able to use
> >> * either a static or a dynamic per cpu variable.
> >> * One dynamic per cpu variable gets a fast & cheap implementation, we can
> >> * change pcounter implementation too.
> >> */
> >>
> >>
> >> We all agree.
> >>
> >
> > No we don't. That comment is afaict wrong about the memory consumption and
> > the abstraction *isn't useful*.
>
> Fact is that we need percpu 32bits counters, and we need to have pointers to them.
>
> Current percpu_counters cannot cope that.
yes they can.
> >
> > Why do we want some abstraction which makes alloc_percpu() storage and
> > DEFINE_PERCPU storage "look the same"? What use is there in that? One is
> > per-object storage and one is singleton storage - they're quite different
> > things and they are used in quite different situations and they are
> > basically never interchangeable. Yet we add this pretend-they're-the-same
> > wrapper around them which costs us an indirect function call on the
> > fastpath.
>
> I believe all 'pcounter' are in fact statically allocated 'one per struct
> proto' to track inuse count. (search for DEFINE_PROTO_INUSE() uses)
>
> # find net include|xargs grep -n DEFINE_PROTO_INUSE
> net/dccp/ipv6.c:1105:DEFINE_PROTO_INUSE(dccp_v6)
> net/dccp/ipv4.c:920:DEFINE_PROTO_INUSE(dccp_v4)
> net/ipv6/udp.c:1001:DEFINE_PROTO_INUSE(udpv6)
> net/ipv6/udplite.c:43:DEFINE_PROTO_INUSE(udplitev6)
> net/ipv6/raw.c:1187:DEFINE_PROTO_INUSE(rawv6)
> net/ipv6/tcp_ipv6.c:2108:DEFINE_PROTO_INUSE(tcpv6)
> net/ipv4/udp.c:1477:DEFINE_PROTO_INUSE(udp)
> net/ipv4/udplite.c:47:DEFINE_PROTO_INUSE(udplite)
> net/ipv4/tcp_ipv4.c:2406:DEFINE_PROTO_INUSE(tcp)
> net/ipv4/raw.c:828:DEFINE_PROTO_INUSE(raw)
> net/sctp/socket.c:6461:DEFINE_PROTO_INUSE(sctp)
> net/sctp/socket.c:6495:DEFINE_PROTO_INUSE(sctpv6)
> include/net/sock.h:624:# define DEFINE_PROTO_INUSE(NAME) DEFINE_PCOUNTER(NAME)
> include/net/sock.h:644:# define DEFINE_PROTO_INUSE(NAME)
>
> So pcounter_alloc()/pcounter_free() could just be deleted from pcounter.h
Well if you do that and if pcounters become helpers functions around
DEFINE_PERCPU memory then things are starting to get a bit more sensible.
We can then remove the indirect function call.
Yes, a general wrapper around a DEFINE_PERCPU'ed counter could be a useful
thing to have. It could use raw_smp_processor_id() and local_add().
Problem is the read side will still be far too inefficient for it to be
presentable as a general-purpose library - it really will need to do all
the batching which experience told us was needed in percpu_counters.
The read side should also present two interfaces: one which returns a +ve
or -ve value and one which returns a never-negative value.
> indirect functions calls are everywhere in kernel, network, fs, everywhere.
That doesn't make them fast.
> As soon we can put in 'struct pcounter' the address of a percpu variable, we
> wont need anymore pointers to the pcounter_add()/getval() function.
>
> mov 0(pcounter),%eax # get the address of a percpuvar
> add %edx,fs:%eax
>
> This just need the percpu work done by SGI guys to be finished.
So... I need to keep watching the tree to make sure that nobody actually
uses this code in ./lib/?
next prev parent reply other threads:[~2008-02-16 19:28 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-04 9:44 include/linux/pcounter.h Andrew Morton
2008-02-05 0:20 ` include/linux/pcounter.h David Miller
2008-02-05 0:43 ` include/linux/pcounter.h Andrew Morton
2008-02-16 3:37 ` include/linux/pcounter.h Andrew Morton
2008-02-16 10:07 ` include/linux/pcounter.h Eric Dumazet
2008-02-16 10:50 ` include/linux/pcounter.h Andrew Morton
2008-02-16 12:03 ` include/linux/pcounter.h Eric Dumazet
2008-02-16 19:26 ` Andrew Morton [this message]
2008-02-16 19:39 ` include/linux/pcounter.h Arjan van de Ven
2008-02-17 5:54 ` include/linux/pcounter.h David Miller
2008-02-26 9:24 ` include/linux/pcounter.h Ingo Molnar
2008-02-26 21:35 ` include/linux/pcounter.h David Miller
2008-02-27 7:36 ` include/linux/pcounter.h Ingo Molnar
2008-02-28 1:27 ` include/linux/pcounter.h Arnaldo Carvalho de Melo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080216112618.ec450f9b.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=herbert@gondor.apana.org.au \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.