From: Stephen Hemminger <shemminger@vyatta.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yakov Lerner <iler.ml@gmail.com>,
netdev@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>,
David Miller <davem@davemloft.net>
Subject: Re: [PATCH] /proc/net/tcp, overhead removed
Date: Mon, 28 Sep 2009 16:24:17 -0700 [thread overview]
Message-ID: <20090928162417.59640672@nehalam> (raw)
In-Reply-To: <4AC13697.4090707@gmail.com>
On Tue, 29 Sep 2009 00:20:07 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Yakov Lerner a écrit :
> > On Sun, Sep 27, 2009 at 12:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> Yakov Lerner a écrit :
> >>> /proc/net/tcp does 20,000 sockets in 60-80 milliseconds, with this patch.
> >>>
> >>> The overhead was in tcp_seq_start(). See analysis (3) below.
> >>> The patch is against Linus git tree (1). The patch is small.
> >>>
> >>> ------------ ----------- ------------------------------------
> >>> Before patch After patch 20,000 sockets (10,000 tw + 10,000 estab)(2)
> >>> ------------ ----------- ------------------------------------
> >>> 6 sec 0.06 sec dd bs=1k if=/proc/net/tcp >/dev/null
> >>> 1.5 sec 0.06 sec dd bs=4k if=/proc/net/tcp >/dev/null
> >>>
> >>> 1.9 sec 0.16 sec netstat -4ant >/dev/null
> >>> ------------ ----------- ------------------------------------
> >>>
> >>> This is ~ x25 improvement.
> >>> The new time is not dependent on read blockize.
> >>> Speed of netstat, naturally, improves, too; both -4 and -6.
> >>> /proc/net/tcp6 does 20,000 sockets in 100 millisec.
> >>>
> >>> (1) against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> >>>
> >>> (2) Used 'manysock' utility to stress system with large number of sockets:
> >>> "manysock 10000 10000" - 10,000 tw + 10,000 estab ip4 sockets.
> >>> "manysock -6 10000 10000" - 10,000 tw + 10,000 estab ip6 sockets.
> >>> Found at http://ilerner.3b1.org/manysock/manysock.c
> >>>
> >>> (3) Algorithmic analysis.
> >>> Old algorithm.
> >>>
> >>> During 'cat </proc/net/tcp', tcp_seq_start() is called O(numsockets) times (4).
> >>> On average, every call to tcp_seq_start() scans half the whole hashtable. Ouch.
> >>> This is O(numsockets * hashsize). 95-99% of 'cat </proc/net/tcp' is spent in
> >>> tcp_seq_start()->tcp_get_idx. This overhead is eliminated by new algorithm,
> >>> which is O(numsockets + hashsize).
> >>>
> >>> New algorithm.
> >>>
> >>> New algorithms is O(numsockets + hashsize). We jump to the right
> >>> hash bucket in tcp_seq_start(), without scanning half the hash.
> >>> To jump right to the hash bucket corresponding to *pos in tcp_seq_start(),
> >>> we reuse three pieces of state (st->num, st->bucket, st->sbucket)
> >>> as follows:
> >>> - we check that requested pos >= last seen pos (st->num), the typical case.
> >>> - if so, we jump to bucket st->bucket
> >>> - to arrive to the right item after beginning of st->bucket, we
> >>> keep in st->sbucket the position corresponding to the beginning of
> >>> bucket.
> >>>
> >>> (4) Explanation of O( numsockets * hashsize) of old algorithm.
> >>>
> >>> tcp_seq_start() is called once for every ~7 lines of netstat output
> >>> if readsize is 1kb, or once for every ~28 lines if readsize >= 4kb.
> >>> Since record length of /proc/net/tcp records is 150 bytes, formula for
> >>> number of calls to tcp_seq_start() is
> >>> (numsockets * 150 / min(4096,readsize)).
> >>> Netstat uses 4kb readsize (newer versions), or 1kb (older versions).
> >>> Note that speed of old algorithm does not improve above 4kb blocksize.
> >>>
> >>> Speed of the new algorithm does not depend on blocksize.
> >>>
> >>> Speed of the new algorithm does not perceptibly depend on hashsize (which
> >>> depends on ramsize). Speed of old algorithm drops with bigger hashsize.
> >>>
> >>> (5) Reporting order.
> >>>
> >>> Reporting order is exactly same as before if hash does not change underfoot.
> >>> When hash elements come and go during report, reporting order will be
> >>> same as that of tcpdiag.
> >>>
> >>> Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
Does the netlink interface used by ss command have the problem?
--
next prev parent reply other threads:[~2009-09-28 23:24 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-26 21:31 [PATCH] /proc/net/tcp, overhead removed Yakov Lerner
2009-09-27 9:53 ` Eric Dumazet
2009-09-28 22:10 ` Yakov Lerner
2009-09-28 22:20 ` Eric Dumazet
2009-09-28 23:24 ` Stephen Hemminger [this message]
2009-09-29 7:43 ` Yakov Lerner
-- strict thread matches above, loose matches on Subject: below --
2009-09-28 23:01 Yakov Lerner
2009-09-29 4:39 ` Eric Dumazet
2009-09-29 7:56 ` Eric Dumazet
2009-09-29 8:55 ` Yakov Lerner
2009-09-29 15:45 ` Stephen Hemminger
2009-09-29 17:34 ` Yakov Lerner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090928162417.59640672@nehalam \
--to=shemminger@vyatta.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=iler.ml@gmail.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).