From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yakov Lerner Subject: Re: [PATCH] /proc/net/tcp, overhead removed Date: Tue, 29 Sep 2009 10:43:09 +0300 Message-ID: References: <1254000675-8327-1-git-send-email-iler.ml@gmail.com> <4ABF360E.7080301@gmail.com> <4AC13697.4090707@gmail.com> <20090928162417.59640672@nehalam> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Eric Dumazet , netdev@vger.kernel.org, David Miller To: Stephen Hemminger Return-path: Received: from fg-out-1718.google.com ([72.14.220.158]:30171 "EHLO fg-out-1718.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751146AbZI2HnI convert rfc822-to-8bit (ORCPT ); Tue, 29 Sep 2009 03:43:08 -0400 Received: by fg-out-1718.google.com with SMTP id 22so1692171fge.1 for ; Tue, 29 Sep 2009 00:43:11 -0700 (PDT) In-Reply-To: <20090928162417.59640672@nehalam> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, Sep 29, 2009 at 02:24, Stephen Hemminger wrote: > On Tue, 29 Sep 2009 00:20:07 +0200 > Eric Dumazet wrote: > >> Yakov Lerner a =E9crit : >> > On Sun, Sep 27, 2009 at 12:53, Eric Dumazet wrote: >> >> Yakov Lerner a =E9crit : >> >>> /proc/net/tcp does 20,000 sockets in 60-80 milliseconds, with th= is patch. >> >>> >> >>> The overhead was in tcp_seq_start(). See analysis (3) below. >> >>> The patch is against Linus git tree (1). The patch is small. >> >>> >> >>> ------------ =A0----------- =A0 --------------------------------= ---- >> >>> Before patch =A0After patch =A0 20,000 sockets (10,000 tw + 10,0= 00 estab)(2) >> >>> ------------ =A0----------- =A0 --------------------------------= ---- >> >>> 6 sec =A0 =A0 =A0 =A0 =A00.06 sec =A0 =A0 dd bs=3D1k if=3D/proc/= net/tcp >/dev/null >> >>> 1.5 sec =A0 =A0 =A0 =A00.06 sec =A0 =A0 dd bs=3D4k if=3D/proc/ne= t/tcp >/dev/null >> >>> >> >>> 1.9 sec =A0 =A0 =A0 =A00.16 sec =A0 =A0 netstat -4ant >/dev/null >> >>> ------------ =A0----------- =A0 --------------------------------= ---- >> >>> >> >>> This is ~ x25 improvement. >> >>> The new time is not dependent on read blockize. >> >>> Speed of netstat, naturally, improves, too; both -4 and -6. >> >>> /proc/net/tcp6 does 20,000 sockets in 100 millisec. >> >>> >> >>> (1) against git://git.kernel.org/pub/scm/linux/kernel/git/torval= ds/linux-2.6.git >> >>> >> >>> (2) Used 'manysock' utility to stress system with large number o= f sockets: >> >>> =A0 "manysock 10000 10000" =A0 =A0- 10,000 tw + 10,000 estab ip4= sockets. >> >>> =A0 "manysock -6 10000 10000" - 10,000 tw + 10,000 estab ip6 soc= kets. >> >>> Found at http://ilerner.3b1.org/manysock/manysock.c >> >>> >> >>> (3) Algorithmic analysis. >> >>> =A0 =A0 Old algorithm. >> >>> >> >>> During 'cat > >>> On average, every call to tcp_seq_start() scans half the whole h= ashtable. Ouch. >> >>> This is O(numsockets * hashsize). 95-99% of 'cat > >>> tcp_seq_start()->tcp_get_idx. This overhead is eliminated by new= algorithm, >> >>> which is O(numsockets + hashsize). >> >>> >> >>> =A0 =A0 New algorithm. >> >>> >> >>> New algorithms is O(numsockets + hashsize). We jump to the right >> >>> hash bucket in tcp_seq_start(), without scanning half the hash. >> >>> To jump right to the hash bucket corresponding to *pos in tcp_se= q_start(), >> >>> we reuse three pieces of state (st->num, st->bucket, st->sbucket= ) >> >>> as follows: >> >>> =A0- we check that requested pos >=3D last seen pos (st->num), t= he typical case. >> >>> =A0- if so, we jump to bucket st->bucket >> >>> =A0- to arrive to the right item after beginning of st->bucket, = we >> >>> keep in st->sbucket the position corresponding to the beginning = of >> >>> bucket. >> >>> >> >>> (4) Explanation of O( numsockets * hashsize) of old algorithm. >> >>> >> >>> tcp_seq_start() is called once for every ~7 lines of netstat out= put >> >>> if readsize is 1kb, or once for every ~28 lines if readsize >=3D= 4kb. >> >>> Since record length of /proc/net/tcp records is 150 bytes, formu= la for >> >>> number of calls to tcp_seq_start() is >> >>> =A0 =A0 =A0 =A0 =A0 =A0 (numsockets * 150 / min(4096,readsize)). >> >>> Netstat uses 4kb readsize (newer versions), or 1kb (older versio= ns). >> >>> Note that speed of old algorithm does not improve above 4kb bloc= ksize. >> >>> >> >>> Speed of the new algorithm does not depend on blocksize. >> >>> >> >>> Speed of the new algorithm does not perceptibly depend on hashsi= ze (which >> >>> depends on ramsize). Speed of old algorithm drops with bigger ha= shsize. >> >>> >> >>> (5) Reporting order. >> >>> >> >>> Reporting order is exactly same as before if hash does not chang= e underfoot. >> >>> When hash elements come and go during report, reporting order wi= ll be >> >>> same as that of tcpdiag. >> >>> >> >>> Signed-off-by: Yakov Lerner > > Does the netlink interface used by ss command have the problem? No. It's /proc/net/tcp that has fixable problem. Yakov