virt-manager broken by bind(0) in net-next.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* virt-manager broken by bind(0) in net-next.
       [not found] ` <20090129103544.GC22110@redhat.com>
@ 2009-01-30  5:35   ` Stephen Hemminger
  2009-01-30  8:16     ` Evgeniy Polyakov
  2009-01-30  6:50   ` Stephen Hemminger
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2009-01-30  5:35 UTC (permalink / raw)
  To: Daniel P. Berrange, Fedora/Linux Management Tools, David Miller,
	Evgeniy Polyakov
  Cc: berrange, Fedora/Linux Management Tools, netdev

[-- Attachment #1: Type: text/plain, Size: 2333 bytes --]

On Thu, 29 Jan 2009 10:35:44 +0000
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Wed, Jan 28, 2009 at 09:21:14PM -0800, Stephen Hemminger wrote:
> > This is probably related to the new GEM code. But on 2.6.29-rc2 if I start up the virtual
> > machine manager then run a guest, the display gets screwed up.
> > 
> > virt-machine-manager
> >   click local-host (System)
> >   Run one of the existing VM's
> >  
> > The virtual console window then cause a dialog about allowing remote access to display;
> > (this never happened with earlier kernels), regression #1
> > 
> > Then if I allow it multiple copies of the window start cloning and general chaos ensues.
> 
> You'll have to provide more useful information than 'screwed up' and
> 'general choas' if we're to properly dianose this. A screenshot of what
> is wrong if there's a graphics rendering problem would be a start.
> 
> Also, what GTK-VNC version do you have ?  Make sure it is at least
> 0.3.8, so that it is using Cairo for rendering, and not old buggy
> OpenGL based GtkGLExt.
> 
> Daniel

The problem is only in the net-next tree (not mainline 2.6.29-rcX).
Bisected down to this commit is the problem:

a9d8f9110d7e953c2f2b521087a4179677843c2a is first bad commit
commit a9d8f9110d7e953c2f2b521087a4179677843c2a
Author: Evgeniy Polyakov <zbr@ioremap.net>
Date:   Mon Jan 19 16:46:02 2009 -0800

    inet: Allowing more than 64k connections and heavily optimize bind(0) time.
    
    With simple extension to the binding mechanism, which allows to bind more
    than 64k sockets (or smaller amount, depending on sysctl parameters),
    we have to traverse the whole bind hash table to find out empty bucket.
    And while it is not a problem for example for 32k connections, bind()
    completion time grows exponentially (since after each successful binding
    we have to traverse one bucket more to find empty one) even if we start
    each time from random offset inside the hash table.
    
    So, when hash table is full, and we want to add another socket, we have
    to traverse the whole table no matter what, so effectivelly this will be
    the worst case performance and it will be constant.
    
    Attached picture shows bind() time depending on number of already bound
    sockets.

Not sure why but it breaks VNC, see attached screenshot.

[-- Attachment #2: Screenshot.png --]
[-- Type: image/png, Size: 178001 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* virt-manager broken by bind(0) in net-next.
       [not found] ` <20090129103544.GC22110@redhat.com>
  2009-01-30  5:35   ` virt-manager broken by bind(0) in net-next Stephen Hemminger
@ 2009-01-30  6:50   ` Stephen Hemminger
  1 sibling, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2009-01-30  6:50 UTC (permalink / raw)
  To: Daniel P. Berrange, Fedora/Linux Management Tools, David Miller,
	Evgeniy Polyakov
  Cc: berrange, Fedora/Linux Management Tools, netdev

Resend without screenshot that upsets vger spam filter!
The bind(0) optimization in David's net-next breaks virt machine manager when
a guest console window is started. 

virt-machine-manager
  click local-host (System)
  Run one of the existing VM's
 
The virtual console window then cause a dialog about allowing remote access to display;
(this never happened with earlier kernels), regression #1

Then if I allow it multiple copies of the window start cloning and general chaos ensues.
The problem is only in the net-next tree (not mainline 2.6.29-rcX).

Bisected down to the following commit, that breaks virt-manager


a9d8f9110d7e953c2f2b521087a4179677843c2a is first bad commit
commit a9d8f9110d7e953c2f2b521087a4179677843c2a
Author: Evgeniy Polyakov <zbr@ioremap.net>
Date:   Mon Jan 19 16:46:02 2009 -0800

    inet: Allowing more than 64k connections and heavily optimize bind(0) time.
    
    With simple extension to the binding mechanism, which allows to bind more
    than 64k sockets (or smaller amount, depending on sysctl parameters),
    we have to traverse the whole bind hash table to find out empty bucket.
    And while it is not a problem for example for 32k connections, bind()
    completion time grows exponentially (since after each successful binding
    we have to traverse one bucket more to find empty one) even if we start
    each time from random offset inside the hash table.
    
    So, when hash table is full, and we want to add another socket, we have
    to traverse the whole table no matter what, so effectivelly this will be
    the worst case performance and it will be constant.
    

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30  5:35   ` virt-manager broken by bind(0) in net-next Stephen Hemminger
@ 2009-01-30  8:16     ` Evgeniy Polyakov
       [not found]       ` <20090130081600.GA2717-i6C2adt8DTjR7s880joybQ@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-30  8:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Daniel P. Berrange, Fedora/Linux Management Tools, David Miller,
	netdev

Hi.

On Thu, Jan 29, 2009 at 09:35:49PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
> > > This is probably related to the new GEM code. But on 2.6.29-rc2 if I start up the virtual
> > > machine manager then run a guest, the display gets screwed up.
> > > 
> > > virt-machine-manager
> > >   click local-host (System)
> > >   Run one of the existing VM's
> > >  
> > > The virtual console window then cause a dialog about allowing remote access to display;
> > > (this never happened with earlier kernels), regression #1
> > > 
> > > Then if I allow it multiple copies of the window start cloning and general chaos ensues.
> > 
> > You'll have to provide more useful information than 'screwed up' and
> > 'general choas' if we're to properly dianose this. A screenshot of what
> > is wrong if there's a graphics rendering problem would be a start.
> > 
> > Also, what GTK-VNC version do you have ?  Make sure it is at least
> > 0.3.8, so that it is using Cairo for rendering, and not old buggy
> > OpenGL based GtkGLExt.
> > 
> > Daniel
> 
> The problem is only in the net-next tree (not mainline 2.6.29-rcX).
> Bisected down to this commit is the problem:
> 
> a9d8f9110d7e953c2f2b521087a4179677843c2a is first bad commit
> commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> Author: Evgeniy Polyakov <zbr@ioremap.net>
> Date:   Mon Jan 19 16:46:02 2009 -0800

Any chance to get a bit more information about what this console does?
Like how it binds or get a strace?
Will you be able to run a debug patch which will dump lots of info into
the dmesg and slow things down a bit?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
       [not found]       ` <20090130081600.GA2717-i6C2adt8DTjR7s880joybQ@public.gmane.org>
@ 2009-01-30 10:27         ` Daniel P. Berrange
  2009-01-30 11:21           ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel P. Berrange @ 2009-01-30 10:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller,
	Fedora/Linux Management Tools

On Fri, Jan 30, 2009 at 11:16:00AM +0300, Evgeniy Polyakov wrote:
> Hi.
> 
> On Thu, Jan 29, 2009 at 09:35:49PM -0800, Stephen Hemminger (shemminger-ZtmgI6mnKB3QT0dZR+AlfA@public.gmane.org) wrote:
> > > > This is probably related to the new GEM code. But on 2.6.29-rc2 if I start up the virtual
> > > > machine manager then run a guest, the display gets screwed up.
> > > > 
> > > > virt-machine-manager
> > > >   click local-host (System)
> > > >   Run one of the existing VM's
> > > >  
> > > > The virtual console window then cause a dialog about allowing remote access to display;
> > > > (this never happened with earlier kernels), regression #1
> > > > 
> > > > Then if I allow it multiple copies of the window start cloning and general chaos ensues.
> > > 
> > > You'll have to provide more useful information than 'screwed up' and
> > > 'general choas' if we're to properly dianose this. A screenshot of what
> > > is wrong if there's a graphics rendering problem would be a start.
> > > 
> > > Also, what GTK-VNC version do you have ?  Make sure it is at least
> > > 0.3.8, so that it is using Cairo for rendering, and not old buggy
> > > OpenGL based GtkGLExt.
> > > 
> > > Daniel
> > 
> > The problem is only in the net-next tree (not mainline 2.6.29-rcX).
> > Bisected down to this commit is the problem:
> > 
> > a9d8f9110d7e953c2f2b521087a4179677843c2a is first bad commit
> > commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> > Author: Evgeniy Polyakov <zbr-i6C2adt8DTjR7s880joybQ@public.gmane.org>
> > Date:   Mon Jan 19 16:46:02 2009 -0800
> 
> Any chance to get a bit more information about what this console does?

The virt-manager console is basically just a plain old boring VNC client.
It uses GTK-VNC to establish its VNC network connection, and that doesn't
do anything unusual AFAIK. We use getaddrinfo() to resolve the hostname,
and then try each of its results in turn, until we succesfully connect
to the VNC server. We don't explicitly bind() to the client port, just
let the kernel pick it for us. The code in question, is the "gvnc_open_host"
method from gvnc.c,  which starts at about line 2910

http://freehg.org/u/aliguori/gtk-vnc.hg/file/d68935d582f0/src/gvnc.c

Regards,
Daniel 
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30 10:27         ` Daniel P. Berrange
@ 2009-01-30 11:21           ` Evgeniy Polyakov
  2009-01-30 12:53             ` Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-30 11:21 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Stephen Hemminger, Fedora/Linux Management Tools, David Miller,
	netdev

On Fri, Jan 30, 2009 at 10:27:49AM +0000, Daniel P. Berrange (berrange@redhat.com) wrote:
> The virt-manager console is basically just a plain old boring VNC client.
> It uses GTK-VNC to establish its VNC network connection, and that doesn't
> do anything unusual AFAIK. We use getaddrinfo() to resolve the hostname,
> and then try each of its results in turn, until we succesfully connect
> to the VNC server. We don't explicitly bind() to the client port, just
> let the kernel pick it for us. The code in question, is the "gvnc_open_host"
> method from gvnc.c,  which starts at about line 2910
> 
> http://freehg.org/u/aliguori/gtk-vnc.hg/file/d68935d582f0/src/gvnc.c

So it is not explicit bind call, but port autoselection in the
connect(). Can you check what errno is returned?
Did I understand it right, that connect fails, you try different
address, but then suddenly all those sockets become 'alive'?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30 11:21           ` Evgeniy Polyakov
@ 2009-01-30 12:53             ` Herbert Xu
       [not found]               ` <20090130125337.GA7155-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2009-01-30 12:53 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: berrange, shemminger, et-mgmt-tools, davem, netdev

Evgeniy Polyakov <zbr@ioremap.net> wrote:
>
> So it is not explicit bind call, but port autoselection in the
> connect(). Can you check what errno is returned?
> Did I understand it right, that connect fails, you try different
> address, but then suddenly all those sockets become 'alive'?

Yes, I think a good strace vs. a bad strace would be really helpful
in these cases.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
       [not found]               ` <20090130125337.GA7155-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
@ 2009-01-30 17:57                 ` Stephen Hemminger
  2009-01-30 18:41                   ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2009-01-30 17:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Evgeniy Polyakov, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	et-mgmt-tools-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA

On Fri, 30 Jan 2009 23:53:37 +1100
Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org> wrote:

> Evgeniy Polyakov <zbr-i6C2adt8DTjR7s880joybQ@public.gmane.org> wrote:
> >
> > So it is not explicit bind call, but port autoselection in the
> > connect(). Can you check what errno is returned?
> > Did I understand it right, that connect fails, you try different
> > address, but then suddenly all those sockets become 'alive'?
> 
> Yes, I think a good strace vs. a bad strace would be really helpful
> in these cases.
> 
> Thanks,

I have the strace but it comes up no different.
What is different is that in the broken case (net-next), I see
IPV6 being used:

State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      23769  0        ::ffff:127.0.0.1:5900      ::ffff:127.0.0.1:55987   
ESTAB      0      0               127.0.0.1:55987            127.0.0.1:5900

and in the working case (2.6.29-rc3), IPV4 is being used
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      0      0               127.0.0.1:58894            127.0.0.1:5901    
ESTAB      0      0               127.0.0.1:5901             127.0.0.1:58894 

Relevant bits of strace in broken case are:

7276  socket(PF_NETLINK, SOCK_RAW, 0)   = 21
7276  bind(21, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
7276  getsockname(21, {sa_family=AF_NETLINK, pid=7276, groups=00000000}, [66387309494284]) = 0
7276  sendto(21, "\24\0\0\0\26\0\1\3\353<\203I\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
7276  recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\353<\203Il\34\0\0\2\10\200\376\1\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 168
7276  recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\353<\203Il\34\0\0\n\200\200\376\1\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 256
7276  recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\353<\203Il\34\0\0\0\0\0\0\1\0\0\0\24"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
7276  close(21)                         = 0
7276  socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 21
7276  fcntl(21, F_GETFL)                = 0x2 (flags O_RDWR)
7276  fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK) = 0
7276  fstat(21, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
7276  fcntl(21, F_GETFL)                = 0x802 (flags O_RDWR|O_NONBLOCK)
7276  connect(21, {sa_family=AF_INET, sin_port=htons(5900), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
7276  rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
7276  rt_sigprocmask(SIG_SETMASK, [], [], 8) = 0
7276  read(5, 0xca5af4, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
7276  poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=21, events=POLLOUT, revents=POLLOUT}], 11, 844) = 1
7276  read(18, 0x7fff4fa96a1f, 1)       = -1 EAGAIN (Resource temporarily unavailable)
7276  rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
7276  rt_sigprocmask(SIG_SETMASK, [], [], 8) = 0
7276  connect(21, {sa_family=AF_INET, sin_port=htons(5900), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
7276  rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
7276  rt_sigprocmask(SIG_SETMASK, [], [], 8) = 0
7276  read(5, 0xca5af4, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
7276  poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}], 10, 0) = 0
7276  read(18, 0x7fff4fa96a1f, 1)       = -1 EAGAIN (Resource temporarily unavailable)
7276  rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
7276  rt_sigprocmask(SIG_SETMASK, [], [], 8) = 0
7276  read(21, "RFB 003.007\n", 4096)   = 12
7276  write(21, "RFB 003.007\0", 12)    = 12
7276  read(21, 0x18c5170, 4096)         = -1 EAGAIN (Resource temporarily unavailable)
7276  rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
7276  rt_sigprocmask(SIG_SETMASK, [], [], 8) = 0
7276  read(5, 0xca5af4, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
7276  poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=21, events=POLLIN, revents=POLLIN}], 11, 842) = 1
7276  read(18, 0x7fff4fa96a1f, 1)       = -1 EAGAIN (Resource temporarily unavailable)
7276  rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
7276  rt_sigprocmask(SIG_SETMASK, [], [], 8) = 0
7276  read(21, "\2\22\1", 4096)         = 3
7276  write(21, "\22", 1)               = 1
7276  brk(0x1c6b000)                    = 0x1c6b000
7276  access("/dev/random", R_OK)       = 0
7276  access("/dev/urandom", R_OK)      = 0
7276  open("/dev/urandom", O_RDONLY)    = 22
7276  fcntl(22, F_GETFD)                = 0
7276  fcntl(22, F_SETFD, FD_CLOEXEC)    = 0
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999998})
7276  read(22, "\316\n\4!\224\227\215\276\\b\224\272\334,Y\256\4\236\245"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\236\216s*\211\347\\\245\217\2549\24!\242\216\257t\327"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "E\20\376E\322\366W\10t\342\273\2734\217\5\250\212\235\335"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\325\325\355\327\356\323\37\17\256|\34\375\223\4\340\323"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "=\371\222\340\354\21e0\271\5-\337e\273h\207uS \225\321"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "C\316\236\301\315\257{|,\217\253\321 ]W\212\217H\342\222"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, ",{\201\272V\246\257^\t\214\374\377\360\357;\26\226w\370"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\33\330\31\363L\25\243\360+\21J\315\227\251\364y\276\356"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\335u\377\235mf\34-\227\221\"\21y,Y\336a*9\25=H\350\334"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "U\274\270\373\326?Ly\232\24\2a\367\261DA\223N\273M\255"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\221\237PJY\342\260\207z\360W \274\303\360q@E8\246\355"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, ":{\177\347\20\246\373\345M;\243:\35\347j\302\317\2737\244"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\326\317\16\363\27\35\351\226o?@c\251\320\323\0\274\301"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\23N\257I\345\224Fi\364\7M\340\213\321\365\351\253;\4\16"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "p\317 \344\313\273\215\250G0-\212}\202\v(\354\207 \223"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\225\211\206_\2\220\3\222\3523@\353\203J^\324\320;r\206"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\344\251H\230#\244\302x\235\226\315J\364\207\221)\215&"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "3qS\366\343G\372\0)\340\313j\20`\300\3476\215}\35o>\6\305"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\371N9\213\261\230\341\211m/\224h\267lj\2\311\"\374\210"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "^\32ccG\271mh\302\324\244cu\325J\324B\210\245\237&\377"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "N\3463\324\372)\264\310\272\34\25\210POvoA#z\234\362LI"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\362\327g\330i\\\t\10:\357g\243Y\260]\346\235o\337e\30"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\377L\35\272WE\346\256g#\367qK\255\350\323P\323\366\350"..., 120) = 120
7276  select(23, [22], NULL, NULL, {3, 0}) = 1 (in [22], left {2, 999999})
7276  read(22, "\37\31\275\361\302\201/\234\327^m|\362/@\332\356\225`8"..., 120) = 120

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30 17:57                 ` Stephen Hemminger
@ 2009-01-30 18:41                   ` Eric Dumazet
  2009-01-30 21:50                     ` Evgeniy Polyakov
       [not found]                     ` <498349F7.4050300-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2009-01-30 18:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Herbert Xu, Evgeniy Polyakov, berrange, et-mgmt-tools, davem,
	netdev

Stephen Hemminger a écrit :
> On Fri, 30 Jan 2009 23:53:37 +1100
> Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
>> Evgeniy Polyakov <zbr@ioremap.net> wrote:
>>> So it is not explicit bind call, but port autoselection in the
>>> connect(). Can you check what errno is returned?
>>> Did I understand it right, that connect fails, you try different
>>> address, but then suddenly all those sockets become 'alive'?
>> Yes, I think a good strace vs. a bad strace would be really helpful
>> in these cases.
>>
>> Thanks,
> 
> I have the strace but it comes up no different.
> What is different is that in the broken case (net-next), I see
> IPV6 being used:
> 
> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> ESTAB      23769  0        ::ffff:127.0.0.1:5900      ::ffff:127.0.0.1:55987   
> ESTAB      0      0               127.0.0.1:55987            127.0.0.1:5900
> 
> and in the working case (2.6.29-rc3), IPV4 is being used
> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> ESTAB      0      0               127.0.0.1:58894            127.0.0.1:5901    
> ESTAB      0      0               127.0.0.1:5901             127.0.0.1:58894 
> 

Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a

I see use of a hashinfo->bsockets field that :

- lacks proper lock/synchronization
- suffers from cache line ping pongs on SMP

Also there might be a problem at line 175

if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
	spin_unlock(&head->lock);
	goto again;

If we entered inet_csk_get_port() with a non null snum, we can "goto again"
while it was not expected.

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index df8e72f..752c6b2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -172,7 +172,8 @@ tb_found:
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+					smallest_size == -1 &&  --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30 18:41                   ` Eric Dumazet
@ 2009-01-30 21:50                     ` Evgeniy Polyakov
  2009-01-30 22:30                       ` Eric Dumazet
       [not found]                       ` <20090130215008.GB12210-i6C2adt8DTjR7s880joybQ@public.gmane.org>
       [not found]                     ` <498349F7.4050300-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
  1 sibling, 2 replies; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-30 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

On Fri, Jan 30, 2009 at 07:41:59PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> 
> I see use of a hashinfo->bsockets field that :
> 
> - lacks proper lock/synchronization

It should contain rough number of sockets, there is no need to be very
precise because of this hueristic.

> - suffers from cache line ping pongs on SMP

I used free alignment slot so that socket structure would not be
icreased.

> Also there might be a problem at line 175
> 
> if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
> 	spin_unlock(&head->lock);
> 	goto again;
> 
> If we entered inet_csk_get_port() with a non null snum, we can "goto again"
> while it was not expected.
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index df8e72f..752c6b2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -172,7 +172,8 @@ tb_found:
>  		} else {
>  			ret = 1;
>  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
> -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
> +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
> +					smallest_size == -1 &&  --attempts >= 0) {

I think it should be smallest_size != -1, since we really want to goto
to the again label when hueristic is used, which in turn changes
smallest_size.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30 21:50                     ` Evgeniy Polyakov
@ 2009-01-30 22:30                       ` Eric Dumazet
  2009-01-30 22:51                         ` Evgeniy Polyakov
       [not found]                       ` <20090130215008.GB12210-i6C2adt8DTjR7s880joybQ@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-01-30 22:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

Evgeniy Polyakov a écrit :
> On Fri, Jan 30, 2009 at 07:41:59PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>> Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
>>
>> I see use of a hashinfo->bsockets field that :
>>
>> - lacks proper lock/synchronization
> 
> It should contain rough number of sockets, there is no need to be very
> precise because of this hueristic.

Denying there is a bug is... well... I dont know what to say.

I wonder why we still use atomic_t all over the kernel.

> 
>> - suffers from cache line ping pongs on SMP
> 
> I used free alignment slot so that socket structure would not be
> icreased.

Are you kidding ?

bsockets is not part of socket structure, but part of "struct inet_hashinfo",
shared by all cpus and accessed several thousand times per second on many
machines.

Please read the comment three lines after 'the free alignemnt slot'
you chose.... You just introduced one write on a cache line
that is supposed to *not* be written.

        unsigned int                    bhash_size;
        int                             bsockets;

        struct kmem_cache               *bind_bucket_cachep;

        /* All the above members are written once at bootup and
         * never written again _or_ are predominantly read-access.
         *
         * Now align to a new cache line as all the following members
         * might be often dirty.
         */



> 
>> Also there might be a problem at line 175
>>
>> if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
>> 	spin_unlock(&head->lock);
>> 	goto again;
>>
>> If we entered inet_csk_get_port() with a non null snum, we can "goto again"
>> while it was not expected.
>>
>> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
>> index df8e72f..752c6b2 100644
>> --- a/net/ipv4/inet_connection_sock.c
>> +++ b/net/ipv4/inet_connection_sock.c
>> @@ -172,7 +172,8 @@ tb_found:
>>  		} else {
>>  			ret = 1;
>>  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
>> -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
>> +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
>> +					smallest_size == -1 &&  --attempts >= 0) {
> 
> I think it should be smallest_size != -1, since we really want to goto
> to the again label when hueristic is used, which in turn changes
> smallest_size.
> 

Yep



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-30 22:30                       ` Eric Dumazet
@ 2009-01-30 22:51                         ` Evgeniy Polyakov
       [not found]                           ` <20090130225113.GA13977-i6C2adt8DTjR7s880joybQ@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-30 22:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

On Fri, Jan 30, 2009 at 11:30:22PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > It should contain rough number of sockets, there is no need to be very
> > precise because of this hueristic.
> 
> Denying there is a bug is... well... I dont know what to say.
> 
> I wonder why we still use atomic_t all over the kernel.

It is not a bug. It is not supposed to be precise. At all.
I implemented a simple heuristic on when diferent bind port selection
algorithm should start: roughly when number of opened sockets equals to
some predefined value (sysctl at the moment, but it could be 64k or
anything else), so if that number is loosely maintained and does not
precisely corresponds to the number of sockets, it is not a problem.

You also saw 'again' lavel which has magic 5 number - it is another
heuristic - since lock is dropped atfer the bind bucket check, and we
selected it, it is possible that non-reuse socket will be added into the
bucket, so we will have to rerun the process again. I limited this to
the 5 attempts only, since it is better than what we have right now (I
never saw more than 2 attempts needed in the tests), when number of
bound sockets does not exceed 64k.

> > I used free alignment slot so that socket structure would not be
> > icreased.
> 
> Are you kidding ?
> 
> bsockets is not part of socket structure, but part of "struct inet_hashinfo",

Yes, I mistyped.

> shared by all cpus and accessed several thousand times per second on many
> machines.
> 
> Please read the comment three lines after 'the free alignemnt slot'
> you chose.... You just introduced one write on a cache line
> that is supposed to *not* be written.

I have no objection on moving this anywhere at the end of the structure
like after bind_bucket_cachep.

--- ./include/net/inet_hashtables.h~	2009-01-19 22:19:11.000000000 +0300
+++ ./include/net/inet_hashtables.h	2009-01-31 01:48:21.000000000 +0300
@@ -134,7 +134,6 @@
 	struct inet_bind_hashbucket	*bhash;
 
 	unsigned int			bhash_size;
-	int				bsockets;
 
 	struct kmem_cache		*bind_bucket_cachep;
 
@@ -148,6 +147,7 @@
 	 * table where wildcard'd TCP sockets can exist.  Hash function here
 	 * is just local port number.
 	 */
+	int				bsockets;
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
 
--- ./net/ipv4/inet_connection_sock.c~	2009-01-19 22:21:08.000000000 +0300
+++ ./net/ipv4/inet_connection_sock.c	2009-01-31 01:50:20.000000000 +0300
@@ -172,7 +172,8 @@
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+					smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
       [not found]                           ` <20090130225113.GA13977-i6C2adt8DTjR7s880joybQ@public.gmane.org>
@ 2009-01-31  0:36                             ` Stephen Hemminger
  2009-01-31  8:35                               ` Evgeniy Polyakov
  2009-01-31  2:52                             ` Stephen Hemminger
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2009-01-31  0:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Herbert Xu, et-mgmt-tools-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Eric Dumazet,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Sat, 31 Jan 2009 01:51:14 +0300
Evgeniy Polyakov <zbr-i6C2adt8DTjR7s880joybQ@public.gmane.org> wrote:

> On Fri, Jan 30, 2009 at 11:30:22PM +0100, Eric Dumazet (dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org) wrote:
> > > It should contain rough number of sockets, there is no need to be very
> > > precise because of this hueristic.
> > 
> > Denying there is a bug is... well... I dont know what to say.
> > 
> > I wonder why we still use atomic_t all over the kernel.
> 
> It is not a bug. It is not supposed to be precise. At all.
> I implemented a simple heuristic on when diferent bind port selection
> algorithm should start: roughly when number of opened sockets equals to
> some predefined value (sysctl at the moment, but it could be 64k or
> anything else), so if that number is loosely maintained and does not
> precisely corresponds to the number of sockets, it is not a problem.
> 
> You also saw 'again' lavel which has magic 5 number - it is another
> heuristic - since lock is dropped atfer the bind bucket check, and we
> selected it, it is possible that non-reuse socket will be added into the
> bucket, so we will have to rerun the process again. I limited this to
> the 5 attempts only, since it is better than what we have right now (I
> never saw more than 2 attempts needed in the tests), when number of
> bound sockets does not exceed 64k.
> 
>

How is any of this supposed to fix the bug?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
       [not found]                           ` <20090130225113.GA13977-i6C2adt8DTjR7s880joybQ@public.gmane.org>
  2009-01-31  0:36                             ` Stephen Hemminger
@ 2009-01-31  2:52                             ` Stephen Hemminger
  2009-01-31  8:37                               ` Evgeniy Polyakov
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2009-01-31  2:52 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Herbert Xu, et-mgmt-tools-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Eric Dumazet,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

My working hypothesis is:
  1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
     preferred over plain IPV4.
  2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
  3. protocol gets screwed up after that.

It is probably reproducible with other services that support IPV6.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  0:36                             ` Stephen Hemminger
@ 2009-01-31  8:35                               ` Evgeniy Polyakov
  0 siblings, 0 replies; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-31  8:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Herbert Xu, berrange, et-mgmt-tools, davem, netdev

On Fri, Jan 30, 2009 at 04:36:00PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
> > It is not a bug. It is not supposed to be precise. At all.
> > I implemented a simple heuristic on when diferent bind port selection
> > algorithm should start: roughly when number of opened sockets equals to
> > some predefined value (sysctl at the moment, but it could be 64k or
> > anything else), so if that number is loosely maintained and does not
> > precisely corresponds to the number of sockets, it is not a problem.
> > 
> > You also saw 'again' lavel which has magic 5 number - it is another
> > heuristic - since lock is dropped atfer the bind bucket check, and we
> > selected it, it is possible that non-reuse socket will be added into the
> > bucket, so we will have to rerun the process again. I limited this to
> > the 5 attempts only, since it is better than what we have right now (I
> > never saw more than 2 attempts needed in the tests), when number of
> > bound sockets does not exceed 64k.
> > 
> >
> 
> How is any of this supposed to fix the bug?

Nothing from above fixes the bug. It was an explaination of how things
work. Patch is based on Eric's observation about unconditional (compared
to old code) attempt to get the new socket bucket when code should just
return.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  2:52                             ` Stephen Hemminger
@ 2009-01-31  8:37                               ` Evgeniy Polyakov
  2009-01-31  9:17                                 ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-31  8:37 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Herbert Xu, berrange, et-mgmt-tools, davem, netdev

On Fri, Jan 30, 2009 at 06:52:24PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
> My working hypothesis is:
>   1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
>      preferred over plain IPV4.
>   2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
>   3. protocol gets screwed up after that.
> 
> It is probably reproducible with other services that support IPV6.

getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
Previously it bailed out, but with my change it will try again without
reason for doing this. With the patch I sent based on Eric's observation
things should be fine.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  8:37                               ` Evgeniy Polyakov
@ 2009-01-31  9:17                                 ` Eric Dumazet
  2009-01-31  9:31                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-01-31  9:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

Evgeniy Polyakov a écrit :
> On Fri, Jan 30, 2009 at 06:52:24PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
>> My working hypothesis is:
>>   1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
>>      preferred over plain IPV4.
>>   2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
>>   3. protocol gets screwed up after that.
>>
>> It is probably reproducible with other services that support IPV6.
> 
> getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
> Previously it bailed out, but with my change it will try again without
> reason for doing this. With the patch I sent based on Eric's observation
> things should be fine.
> 

Problem is your patch is wrong Evgeniy, please think about it litle bit more
and resubmit it. 

Take the time to run this $0.02 program, before and after your upcoming fix :


$ cat size.c
#include <net/inet_hashtables.h>
extern int printf(const char *, ...);
int main(int argc, char *argv[])
{
        printf("offsetof(struct inet_hashinfo, bsockets)=0x%x\n",
                offsetof(struct inet_hashinfo, bsockets));
        return 0;
}
$ make size.o ; gcc -o size size.o ; ./size
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  SYMLINK include/asm -> include/asm-x86
  CALL    scripts/checksyscalls.sh
  CC      size.o
offsetof(struct inet_hashinfo, bsockets)=0x18


offset of bsockets being 0x18 or 0x20 is same result : bad because in
same cache line than ehash, ehash_locks, ehash_size, ehash_locks_mask,
bhash, bhash_size, unless your cpu is a Pentium.

Also, I suggest you change bsockets to something more appropriate, eg a
percpu counter.

Thank you.
Eric


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  9:17                                 ` Eric Dumazet
@ 2009-01-31  9:31                                   ` Evgeniy Polyakov
  2009-01-31  9:49                                     ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-31  9:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

On Sat, Jan 31, 2009 at 10:17:44AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
> > Previously it bailed out, but with my change it will try again without
> > reason for doing this. With the patch I sent based on Eric's observation
> > things should be fine.
> 
> Problem is your patch is wrong Evgeniy, please think about it litle bit more
> and resubmit it. 

No, patch should be ok. And its part which moves bsockets around was
added because of your complaints, that it is written into read-mostly
cache line. It is not a fix and has nothing with the problem at all.

> Take the time to run this $0.02 program, before and after your upcoming fix :

It is not a fix, but enhancement, which really has nothing with the bug
in question :)
Fix is to return an error if socket binding does not use the heuristic.

> offset of bsockets being 0x18 or 0x20 is same result : bad because in
> same cache line than ehash, ehash_locks, ehash_size, ehash_locks_mask,
> bhash, bhash_size, unless your cpu is a Pentium.

Attached patch makes difference, I'm curious if it ever make any
difference in the benchmarks.

> Also, I suggest you change bsockets to something more appropriate, eg a
> percpu counter.

I thought on that first, but found that looping over every cpu and
summing the total number of allocated/freed sockets will have noticebly
bigger overhead than having loosely maintaned number of sockets.

For the reference. This patch has nothing with the bug we discuss here,
the proper patch (without need to move bsockets around) was sent
earlier, which forces port selection codepath to return error when new
selection heuristic is not used.

--- ./include/net/inet_hashtables.h.orig	2009-01-31 12:27:41.000000000 +0300
+++ ./include/net/inet_hashtables.h	2009-01-31 12:28:15.000000000 +0300
@@ -134,7 +134,6 @@
 	struct inet_bind_hashbucket	*bhash;

 	unsigned int			bhash_size;
-	int				bsockets;

 	struct kmem_cache		*bind_bucket_cachep;

@@ -150,6 +149,8 @@
 	 */
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
+	
+	int				bsockets ____cacheline_aligned_in_smp;

 };

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  9:31                                   ` Evgeniy Polyakov
@ 2009-01-31  9:49                                     ` Eric Dumazet
  2009-01-31  9:56                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-01-31  9:49 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

Evgeniy Polyakov a écrit :
> On Sat, Jan 31, 2009 at 10:17:44AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>>> getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
>>> Previously it bailed out, but with my change it will try again without
>>> reason for doing this. With the patch I sent based on Eric's observation
>>> things should be fine.
>> Problem is your patch is wrong Evgeniy, please think about it litle bit more
>> and resubmit it. 
> 
> No, patch should be ok. And its part which moves bsockets around was
> added because of your complaints, that it is written into read-mostly
> cache line. It is not a fix and has nothing with the problem at all.
> 
>> Take the time to run this $0.02 program, before and after your upcoming fix :
> 
> It is not a fix, but enhancement, which really has nothing with the bug
> in question :)
> Fix is to return an error if socket binding does not use the heuristic.
> 
>> offset of bsockets being 0x18 or 0x20 is same result : bad because in
>> same cache line than ehash, ehash_locks, ehash_size, ehash_locks_mask,
>> bhash, bhash_size, unless your cpu is a Pentium.
> 
> Attached patch makes difference, I'm curious if it ever make any
> difference in the benchmarks.
> 
>> Also, I suggest you change bsockets to something more appropriate, eg a
>> percpu counter.
> 
> I thought on that first, but found that looping over every cpu and
> summing the total number of allocated/freed sockets will have noticebly
> bigger overhead than having loosely maintaned number of sockets.
> 
> For the reference. This patch has nothing with the bug we discuss here,
> the proper patch (without need to move bsockets around) was sent
> earlier, which forces port selection codepath to return error when new
> selection heuristic is not used.
> 
> --- ./include/net/inet_hashtables.h.orig	2009-01-31 12:27:41.000000000 +0300
> +++ ./include/net/inet_hashtables.h	2009-01-31 12:28:15.000000000 +0300
> @@ -134,7 +134,6 @@
>  	struct inet_bind_hashbucket	*bhash;
>  
>  	unsigned int			bhash_size;
> -	int				bsockets;
>  
>  	struct kmem_cache		*bind_bucket_cachep;
>  
> @@ -150,6 +149,8 @@
>  	 */
>  	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
>  					____cacheline_aligned_in_smp;
> +	
> +	int				bsockets ____cacheline_aligned_in_smp;
>  
>  };
>  
> 


It appears you are always right, I have nothing to say then.

Stupid I am.

I vote for plain revert of your initial patch, since you are anaware
of performance problems it introduces. Then, probably nobody cares
of my complaints, so dont worry.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  9:49                                     ` Eric Dumazet
@ 2009-01-31  9:56                                       ` Evgeniy Polyakov
  2009-01-31 10:17                                         ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-01-31  9:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

On Sat, Jan 31, 2009 at 10:49:00AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> It appears you are always right, I have nothing to say then.
> 
> Stupid I am.
> 
> I vote for plain revert of your initial patch, since you are anaware
> of performance problems it introduces. Then, probably nobody cares
> of my complaints, so dont worry.

Eric, do not get it soo personally :) After all it is only a matter of
how we enjoy the process and have fun with the development.

Really, I appreciate your work and help, and likely this
misunderstanding happened because of a bad mix of the original bug and
this performance implication. Original bug has really nothing with what
we discuss here. And while the performance problem with bound sockets
creation may be visible, I did not observe it, while the idea
implemented with this approach shows up clearly in the graph I posted.
So I vote by both hands to further improve it by moving things around so
that there would be no unneded cache flushes during update of this
field.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31  9:56                                       ` Evgeniy Polyakov
@ 2009-01-31 10:17                                         ` Eric Dumazet
  2009-02-01 12:42                                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-01-31 10:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

Evgeniy Polyakov a écrit :
> On Sat, Jan 31, 2009 at 10:49:00AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>> It appears you are always right, I have nothing to say then.
>>
>> Stupid I am.
>>
>> I vote for plain revert of your initial patch, since you are anaware
>> of performance problems it introduces. Then, probably nobody cares
>> of my complaints, so dont worry.
> 
> Eric, do not get it soo personally :) After all it is only a matter of
> how we enjoy the process and have fun with the development.
> 
> Really, I appreciate your work and help, and likely this
> misunderstanding happened because of a bad mix of the original bug and
> this performance implication. Original bug has really nothing with what
> we discuss here. And while the performance problem with bound sockets
> creation may be visible, I did not observe it, while the idea
> implemented with this approach shows up clearly in the graph I posted.
> So I vote by both hands to further improve it by moving things around so
> that there would be no unneded cache flushes during update of this
> field.
> 

OK OK, as I said, dont worry, it was not a strong feeling from me, only
a litle bit upset, thats all.

We only need to know if the *fix* is solving Stephen problem

About performance effects of careful variable placement and percpu counter
strategy you might consult as an example :

http://lkml.indiana.edu/hypermail/linux/kernel/0812.1/01624.html

Now, with these patches applied, try to see effect of your new bsockets field
on a network workload doing lot of socket bind()/unbind() calls...

With current kernels, you probably wont notice because of inode/dcache hot
cache lines, but it might change eventually...



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
       [not found]                     ` <498349F7.4050300-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
@ 2009-02-01  5:29                       ` Stephen Hemminger
  0 siblings, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2009-02-01  5:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, et-mgmt-tools-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Evgeniy Polyakov,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Fri, 30 Jan 2009 19:41:59 +0100
Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

> Stephen Hemminger a écrit :
> > On Fri, 30 Jan 2009 23:53:37 +1100
> > Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org> wrote:
> > 
> >> Evgeniy Polyakov <zbr-i6C2adt8DTjR7s880joybQ@public.gmane.org> wrote:
> >>> So it is not explicit bind call, but port autoselection in the
> >>> connect(). Can you check what errno is returned?
> >>> Did I understand it right, that connect fails, you try different
> >>> address, but then suddenly all those sockets become 'alive'?
> >> Yes, I think a good strace vs. a bad strace would be really helpful
> >> in these cases.
> >>
> >> Thanks,
> > 
> > I have the strace but it comes up no different.
> > What is different is that in the broken case (net-next), I see
> > IPV6 being used:
> > 
> > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > ESTAB      23769  0        ::ffff:127.0.0.1:5900      ::ffff:127.0.0.1:55987   
> > ESTAB      0      0               127.0.0.1:55987            127.0.0.1:5900
> > 
> > and in the working case (2.6.29-rc3), IPV4 is being used
> > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > ESTAB      0      0               127.0.0.1:58894            127.0.0.1:5901    
> > ESTAB      0      0               127.0.0.1:5901             127.0.0.1:58894 
> > 
> 
> Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> 
> I see use of a hashinfo->bsockets field that :
> 
> - lacks proper lock/synchronization
> - suffers from cache line ping pongs on SMP
> 
> Also there might be a problem at line 175
> 
> if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
> 	spin_unlock(&head->lock);
> 	goto again;
> 
> If we entered inet_csk_get_port() with a non null snum, we can "goto again"
> while it was not expected.
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index df8e72f..752c6b2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -172,7 +172,8 @@ tb_found:
>  		} else {
>  			ret = 1;
>  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
> -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
> +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
> +					smallest_size == -1 &&  --attempts >= 0) {
>  					spin_unlock(&head->lock);
>  					goto again;
>  				}
> 
> 

That didn't fix it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
       [not found]                       ` <20090130215008.GB12210-i6C2adt8DTjR7s880joybQ@public.gmane.org>
@ 2009-02-01  5:58                         ` Stephen Hemminger
  2009-02-01  9:07                           ` David Miller
  2009-02-01 12:44                           ` Evgeniy Polyakov
  0 siblings, 2 replies; 28+ messages in thread
From: Stephen Hemminger @ 2009-02-01  5:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Herbert Xu, et-mgmt-tools-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Eric Dumazet,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Sat, 31 Jan 2009 00:50:08 +0300
Evgeniy Polyakov <zbr-i6C2adt8DTjR7s880joybQ@public.gmane.org> wrote:

> On Fri, Jan 30, 2009 at 07:41:59PM +0100, Eric Dumazet (dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org) wrote:
> > Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> > 
> > I see use of a hashinfo->bsockets field that :
> > 
> > - lacks proper lock/synchronization
> 
> It should contain rough number of sockets, there is no need to be very
> precise because of this hueristic.
> 
> > - suffers from cache line ping pongs on SMP
> 
> I used free alignment slot so that socket structure would not be
> icreased.
> 
> > Also there might be a problem at line 175
> > 
> > if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
> > 	spin_unlock(&head->lock);
> > 	goto again;
> > 
> > If we entered inet_csk_get_port() with a non null snum, we can "goto again"
> > while it was not expected.
> > 
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index df8e72f..752c6b2 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -172,7 +172,8 @@ tb_found:
> >  		} else {
> >  			ret = 1;
> >  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
> > -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
> > +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
> > +					smallest_size == -1 &&  --attempts >= 0) {
> 
> I think it should be smallest_size != -1, since we really want to goto
> to the again label when hueristic is used, which in turn changes
> smallest_size.
> 

Yes, this fixes the problem, not sure who wants the honors for sending a signed off
version.




--- a/net/ipv4/inet_connection_sock.c	2009-01-31 21:18:45.433239861 -0800
+++ b/net/ipv4/inet_connection_sock.c	2009-01-31 21:30:14.720825414 -0800
@@ -172,7 +172,8 @@ tb_found:
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+				    smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-02-01  5:58                         ` Stephen Hemminger
@ 2009-02-01  9:07                           ` David Miller
  2009-02-01 12:44                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2009-02-01  9:07 UTC (permalink / raw)
  To: shemminger; +Cc: zbr, dada1, herbert, berrange, et-mgmt-tools, netdev

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Sat, 31 Jan 2009 21:58:50 -0800

> Yes, this fixes the problem, not sure who wants the honors for
> sending a signed off version.

I'll sort it out.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-01-31 10:17                                         ` Eric Dumazet
@ 2009-02-01 12:42                                           ` Evgeniy Polyakov
  2009-02-01 16:12                                             ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-02-01 12:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

Hi Eric.

On Sat, Jan 31, 2009 at 11:17:15AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> We only need to know if the *fix* is solving Stephen problem
> 
> About performance effects of careful variable placement and percpu counter
> strategy you might consult as an example :
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0812.1/01624.html

Impressive, but to be 100% fair it is not only because of the cache line
issues :)

> Now, with these patches applied, try to see effect of your new bsockets field
> on a network workload doing lot of socket bind()/unbind() calls...
> 
> With current kernels, you probably wont notice because of inode/dcache hot
> cache lines, but it might change eventually...

David applied the patch which fixed the problem, so we can return to the
cache line issues. What do you think about the last version where
bsockets field was placed at the very end of the structure and with
cacheline_aligned_on_smp attribute?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-02-01  5:58                         ` Stephen Hemminger
  2009-02-01  9:07                           ` David Miller
@ 2009-02-01 12:44                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-02-01 12:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Herbert Xu, berrange, et-mgmt-tools, davem, netdev

On Sat, Jan 31, 2009 at 09:58:50PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
> > I think it should be smallest_size != -1, since we really want to goto
> > to the again label when hueristic is used, which in turn changes
> > smallest_size.
> 
> Yes, this fixes the problem, not sure who wants the honors for sending a signed off
> version.

Thanks for testing Stephen. David applied that version and cut the
Gordian knot :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-02-01 12:42                                           ` Evgeniy Polyakov
@ 2009-02-01 16:12                                             ` Eric Dumazet
  2009-02-01 17:40                                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-02-01 16:12 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

Evgeniy Polyakov a écrit :
> Hi Eric.
> 
> On Sat, Jan 31, 2009 at 11:17:15AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>> We only need to know if the *fix* is solving Stephen problem
>>
>> About performance effects of careful variable placement and percpu counter
>> strategy you might consult as an example :
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/0812.1/01624.html
> 
> Impressive, but to be 100% fair it is not only because of the cache line
> issues :)
> 
>> Now, with these patches applied, try to see effect of your new bsockets field
>> on a network workload doing lot of socket bind()/unbind() calls...
>>
>> With current kernels, you probably wont notice because of inode/dcache hot
>> cache lines, but it might change eventually...
> 
> David applied the patch which fixed the problem, so we can return to the
> cache line issues. What do you think about the last version where
> bsockets field was placed at the very end of the structure and with
> cacheline_aligned_on_smp attribute?
> 

Yes, at a minimum, move it away from first cache line.

And using atomic_t so that we dont have to discuss about accumulated
errors on SMP on this variable. We will see later if percpu counter
is wanted or not.

Thank you

[PATCH] net: move bsockets outside of read only beginning of struct inet_hashinfo

And switch bsockets to atomic_t since it might be changed in parallel.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/inet_hashtables.h   |    3 ++-
 net/ipv4/inet_connection_sock.c |    2 +-
 net/ipv4/inet_hashtables.c      |    5 +++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 8d98dc7..a44e224 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -134,7 +134,7 @@ struct inet_hashinfo {
 	struct inet_bind_hashbucket	*bhash;
 
 	unsigned int			bhash_size;
-	int				bsockets;
+	/* 4 bytes hole on 64 bit */
 
 	struct kmem_cache		*bind_bucket_cachep;
 
@@ -151,6 +151,7 @@ struct inet_hashinfo {
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
 
+	atomic_t			bsockets;
 };
 
 static inline struct inet_ehash_bucket *inet_ehash_bucket(
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 9bc6a18..22cd19e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -119,7 +119,7 @@ again:
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
-						if (hashinfo->bsockets > (high - low) + 1) {
+						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
 							spin_unlock(&head->lock);
 							snum = smallest_rover;
 							goto have_snum;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d7b6178..625cc5f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -62,7 +62,7 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 
-	hashinfo->bsockets++;
+	atomic_inc(&hashinfo->bsockets);
 
 	inet_sk(sk)->num = snum;
 	sk_add_bind_node(sk, &tb->owners);
@@ -81,7 +81,7 @@ static void __inet_put_port(struct sock *sk)
 	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
 	struct inet_bind_bucket *tb;
 
-	hashinfo->bsockets--;
+	atomic_dec(&hashinfo->bsockets);
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
@@ -532,6 +532,7 @@ void inet_hashinfo_init(struct inet_hashinfo *h)
 {
 	int i;
 
+	atomic_set(&h->bsockets, 0);
 	for (i = 0; i < INET_LHTABLE_SIZE; i++) {
 		spin_lock_init(&h->listening_hash[i].lock);
 		INIT_HLIST_NULLS_HEAD(&h->listening_hash[i].head,


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-02-01 16:12                                             ` Eric Dumazet
@ 2009-02-01 17:40                                               ` Evgeniy Polyakov
  2009-02-01 20:31                                                 ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2009-02-01 17:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Herbert Xu, berrange, et-mgmt-tools, davem,
	netdev

On Sun, Feb 01, 2009 at 05:12:41PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > David applied the patch which fixed the problem, so we can return to the
> > cache line issues. What do you think about the last version where
> > bsockets field was placed at the very end of the structure and with
> > cacheline_aligned_on_smp attribute?
> > 
> 
> Yes, at a minimum, move it away from first cache line.
> 
> And using atomic_t so that we dont have to discuss about accumulated
> errors on SMP on this variable. We will see later if percpu counter
> is wanted or not.
> 
> Thank you
> 
> [PATCH] net: move bsockets outside of read only beginning of struct inet_hashinfo
> 
> And switch bsockets to atomic_t since it might be changed in parallel.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Ok, let's do it this way. Ack.
Thank you Eric.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: virt-manager broken by bind(0) in net-next.
  2009-02-01 17:40                                               ` Evgeniy Polyakov
@ 2009-02-01 20:31                                                 ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2009-02-01 20:31 UTC (permalink / raw)
  To: zbr; +Cc: dada1, shemminger, herbert, berrange, et-mgmt-tools, netdev

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Sun, 1 Feb 2009 20:40:25 +0300

> On Sun, Feb 01, 2009 at 05:12:41PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > [PATCH] net: move bsockets outside of read only beginning of struct inet_hashinfo
> > 
> > And switch bsockets to atomic_t since it might be changed in parallel.
> > 
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Ok, let's do it this way. Ack.
> Thank you Eric.

Applied, thanks everyone.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-02-01 20:31 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090128212114.38be3e8c@extreme>
     [not found] ` <20090129103544.GC22110@redhat.com>
2009-01-30  5:35   ` virt-manager broken by bind(0) in net-next Stephen Hemminger
2009-01-30  8:16     ` Evgeniy Polyakov
     [not found]       ` <20090130081600.GA2717-i6C2adt8DTjR7s880joybQ@public.gmane.org>
2009-01-30 10:27         ` Daniel P. Berrange
2009-01-30 11:21           ` Evgeniy Polyakov
2009-01-30 12:53             ` Herbert Xu
     [not found]               ` <20090130125337.GA7155-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
2009-01-30 17:57                 ` Stephen Hemminger
2009-01-30 18:41                   ` Eric Dumazet
2009-01-30 21:50                     ` Evgeniy Polyakov
2009-01-30 22:30                       ` Eric Dumazet
2009-01-30 22:51                         ` Evgeniy Polyakov
     [not found]                           ` <20090130225113.GA13977-i6C2adt8DTjR7s880joybQ@public.gmane.org>
2009-01-31  0:36                             ` Stephen Hemminger
2009-01-31  8:35                               ` Evgeniy Polyakov
2009-01-31  2:52                             ` Stephen Hemminger
2009-01-31  8:37                               ` Evgeniy Polyakov
2009-01-31  9:17                                 ` Eric Dumazet
2009-01-31  9:31                                   ` Evgeniy Polyakov
2009-01-31  9:49                                     ` Eric Dumazet
2009-01-31  9:56                                       ` Evgeniy Polyakov
2009-01-31 10:17                                         ` Eric Dumazet
2009-02-01 12:42                                           ` Evgeniy Polyakov
2009-02-01 16:12                                             ` Eric Dumazet
2009-02-01 17:40                                               ` Evgeniy Polyakov
2009-02-01 20:31                                                 ` David Miller
     [not found]                       ` <20090130215008.GB12210-i6C2adt8DTjR7s880joybQ@public.gmane.org>
2009-02-01  5:58                         ` Stephen Hemminger
2009-02-01  9:07                           ` David Miller
2009-02-01 12:44                           ` Evgeniy Polyakov
     [not found]                     ` <498349F7.4050300-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2009-02-01  5:29                       ` Stephen Hemminger
2009-01-30  6:50   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).