* RFC Hanging clean-up of a namespace @ 2012-01-19 11:07 Hans Schillstrom 2012-01-19 13:31 ` David Lamparter 2012-01-19 17:40 ` David Miller 0 siblings, 2 replies; 28+ messages in thread From: Hans Schillstrom @ 2012-01-19 11:07 UTC (permalink / raw) To: netdev@vger.kernel.org, Eric W. Biederman Hello, Closing of a namespace (container) can be delayed by ~ 2 minutes due to tcp timers ex tcp time wait (and of cource other things too). I think there should be some kind of "forced close" of the Network stack in ex free_nsproxy() A netns pre-close callback might do the job.... Any ideas how to solve this ? -- Regards Hans Schillstrom <hans.schillstrom@ericsson.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 11:07 RFC Hanging clean-up of a namespace Hans Schillstrom @ 2012-01-19 13:31 ` David Lamparter 2012-01-19 17:40 ` David Miller 1 sibling, 0 replies; 28+ messages in thread From: David Lamparter @ 2012-01-19 13:31 UTC (permalink / raw) To: Hans Schillstrom; +Cc: netdev@vger.kernel.org, Eric W. Biederman On Thu, Jan 19, 2012 at 12:07:09PM +0100, Hans Schillstrom wrote: > Closing of a namespace (container) can be delayed by ~ 2 minutes > due to tcp timers ex tcp time wait (and of cource other things too). > > I think there should be some kind of "forced close" of the Network stack > in ex free_nsproxy() > > A netns pre-close callback might do the job.... > > Any ideas how to solve this ? I'm interested in this too; right off I don't see a workable alternative to your suggestion. Each subsystem will have to implement that callback and release its references on the netns. -David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 11:07 RFC Hanging clean-up of a namespace Hans Schillstrom 2012-01-19 13:31 ` David Lamparter @ 2012-01-19 17:40 ` David Miller 2012-01-19 19:01 ` David Lamparter 2012-01-19 19:40 ` Hans Schillström 1 sibling, 2 replies; 28+ messages in thread From: David Miller @ 2012-01-19 17:40 UTC (permalink / raw) To: hans.schillstrom; +Cc: netdev, ebiederm From: Hans Schillstrom <hans.schillstrom@ericsson.com> Date: Thu, 19 Jan 2012 12:07:09 +0100 > Closing of a namespace (container) can be delayed by ~ 2 minutes > due to tcp timers ex tcp time wait (and of cource other things too). > > I think there should be some kind of "forced close" of the Network stack > in ex free_nsproxy() I think this is unwise. Keeping the timewait sockets around is necessary to absorb any lingering packets in the network meant for those sockets. If you truncate this activity, and then try to create another socket with the same ID you'll run into the very problems time-wait is meant to solve. It's an unfortunate delay, but one you will have to live with. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 17:40 ` David Miller @ 2012-01-19 19:01 ` David Lamparter 2012-01-19 19:06 ` David Miller 2012-01-19 19:40 ` Hans Schillström 1 sibling, 1 reply; 28+ messages in thread From: David Lamparter @ 2012-01-19 19:01 UTC (permalink / raw) To: David Miller; +Cc: hans.schillstrom, netdev, ebiederm On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote: > From: Hans Schillstrom <hans.schillstrom@ericsson.com> > Date: Thu, 19 Jan 2012 12:07:09 +0100 > > > Closing of a namespace (container) can be delayed by ~ 2 minutes > > due to tcp timers ex tcp time wait (and of cource other things too). > > > > I think there should be some kind of "forced close" of the Network stack > > in ex free_nsproxy() > > I think this is unwise. > > Keeping the timewait sockets around is necessary to absorb any lingering > packets in the network meant for those sockets. > > If you truncate this activity, and then try to create another socket with > the same ID you'll run into the very problems time-wait is meant to > solve. A network namespace is for practical matters a separate host on the network. Killing the namespace therefore is akin to shutting down that host, which on a real metal host doesn't wait for timewait sockets either. Creating a socket with the same parameters would actually require installing a network environment similar to the closed namespace first; if an user really does that he can reasonably anticipate the same issues as arise from removing a host from the network and giving its address to another host. -David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 19:01 ` David Lamparter @ 2012-01-19 19:06 ` David Miller 2012-01-19 19:25 ` David Lamparter 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2012-01-19 19:06 UTC (permalink / raw) To: equinox; +Cc: hans.schillstrom, netdev, ebiederm From: David Lamparter <equinox@diac24.net> Date: Thu, 19 Jan 2012 20:01:14 +0100 > On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote: >> From: Hans Schillstrom <hans.schillstrom@ericsson.com> >> Date: Thu, 19 Jan 2012 12:07:09 +0100 >> >> > Closing of a namespace (container) can be delayed by ~ 2 minutes >> > due to tcp timers ex tcp time wait (and of cource other things too). >> > >> > I think there should be some kind of "forced close" of the Network stack >> > in ex free_nsproxy() >> >> I think this is unwise. >> >> Keeping the timewait sockets around is necessary to absorb any lingering >> packets in the network meant for those sockets. >> >> If you truncate this activity, and then try to create another socket with >> the same ID you'll run into the very problems time-wait is meant to >> solve. > > A network namespace is for practical matters a separate host on the > network. Killing the namespace therefore is akin to shutting down that > host, which on a real metal host doesn't wait for timewait sockets > either. > > Creating a socket with the same parameters would actually require > installing a network environment similar to the closed namespace first; > if an user really does that he can reasonably anticipate the same issues > as arise from removing a host from the network and giving its address to > another host. The assumption is that the address is moving, which might not be true. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 19:06 ` David Miller @ 2012-01-19 19:25 ` David Lamparter 2012-01-19 19:31 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: David Lamparter @ 2012-01-19 19:25 UTC (permalink / raw) To: David Miller; +Cc: equinox, hans.schillstrom, netdev, ebiederm On Thu, Jan 19, 2012 at 02:06:21PM -0500, David Miller wrote: > From: David Lamparter <equinox@diac24.net> > > On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote: > >> From: Hans Schillstrom <hans.schillstrom@ericsson.com> > >> > Closing of a namespace (container) can be delayed by ~ 2 minutes > >> > due to tcp timers ex tcp time wait (and of cource other things too). > >> > > >> > I think there should be some kind of "forced close" of the Network stack > >> > in ex free_nsproxy() > >> > >> I think this is unwise. > >> > >> Keeping the timewait sockets around is necessary to absorb any lingering > >> packets in the network meant for those sockets. > >> > >> If you truncate this activity, and then try to create another socket with > >> the same ID you'll run into the very problems time-wait is meant to > >> solve. > > > > A network namespace is for practical matters a separate host on the > > network. Killing the namespace therefore is akin to shutting down that > > host, which on a real metal host doesn't wait for timewait sockets > > either. > > > > Creating a socket with the same parameters would actually require > > installing a network environment similar to the closed namespace first; > > if an user really does that he can reasonably anticipate the same issues > > as arise from removing a host from the network and giving its address to > > another host. > > The assumption is that the address is moving, which might not be true. I don't understand what you mean, what address may not be moving? We're talking about dropping a netns. All of its addresses disappear, all of its soft devices disappear. Its hard devices fall back into the init namespace, is that what you're referring to? Or are you referring to the case where the network namespace is recreated immediately after? That would be akin to a reboot, and again a physical box wouldn't wait for timewait sockets... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 19:25 ` David Lamparter @ 2012-01-19 19:31 ` David Miller 2012-01-19 19:53 ` David Lamparter 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2012-01-19 19:31 UTC (permalink / raw) To: equinox; +Cc: hans.schillstrom, netdev, ebiederm From: David Lamparter <equinox@diac24.net> Date: Thu, 19 Jan 2012 20:25:41 +0100 > On Thu, Jan 19, 2012 at 02:06:21PM -0500, David Miller wrote: >> From: David Lamparter <equinox@diac24.net> >> > On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote: >> >> From: Hans Schillstrom <hans.schillstrom@ericsson.com> >> >> > Closing of a namespace (container) can be delayed by ~ 2 minutes >> >> > due to tcp timers ex tcp time wait (and of cource other things too). >> >> > >> >> > I think there should be some kind of "forced close" of the Network stack >> >> > in ex free_nsproxy() >> >> >> >> I think this is unwise. >> >> >> >> Keeping the timewait sockets around is necessary to absorb any lingering >> >> packets in the network meant for those sockets. >> >> >> >> If you truncate this activity, and then try to create another socket with >> >> the same ID you'll run into the very problems time-wait is meant to >> >> solve. >> > >> > A network namespace is for practical matters a separate host on the >> > network. Killing the namespace therefore is akin to shutting down that >> > host, which on a real metal host doesn't wait for timewait sockets >> > either. >> > >> > Creating a socket with the same parameters would actually require >> > installing a network environment similar to the closed namespace first; >> > if an user really does that he can reasonably anticipate the same issues >> > as arise from removing a host from the network and giving its address to >> > another host. >> >> The assumption is that the address is moving, which might not be true. > > I don't understand what you mean, what address may not be moving? > > We're talking about dropping a netns. All of its addresses disappear, > all of its soft devices disappear. Its hard devices fall back into the > init namespace, is that what you're referring to? And then you immediately start up a new netns with the same address and then resets go back to lingering TCP packets the time-waits would have consumed. The reason this is different from a host reboot is that a host reboot takes some amount of time, which even if around 30 seconds is superior in behavior to what can happen with netns which can be created almost instantly. I totally disagree with the idea to truncate time-wait under the circumstances being suggested here in this thread. Maybe what you want it to keep a small lingering mini-netns state around so that the time-wait sockets can stay and do their job yet you can still clear out the main netns object. Then if a new netns is created that tries to reuse the address used by the mini-netns which hasn't cleared yet, you give -EAGAIN until all the timewaits expire. That's much more acceptable to me than what it being proposed, which is complete gatbage. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 19:31 ` David Miller @ 2012-01-19 19:53 ` David Lamparter 2012-01-19 20:27 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: David Lamparter @ 2012-01-19 19:53 UTC (permalink / raw) To: David Miller; +Cc: equinox, hans.schillstrom, netdev, ebiederm On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote: > >> >> Keeping the timewait sockets around is necessary to absorb any lingering > >> >> packets in the network meant for those sockets. [...] > >> The assumption is that the address is moving, which might not be true. > > > > I don't understand what you mean, what address may not be moving? > > > > We're talking about dropping a netns. All of its addresses disappear, > > all of its soft devices disappear. Its hard devices fall back into the > > init namespace, is that what you're referring to? > > And then you immediately start up a new netns with the same address > and then resets go back to lingering TCP packets the time-waits would > have consumed. > > The reason this is different from a host reboot is that a host reboot > takes some amount of time, which even if around 30 seconds is superior > in behavior to what can happen with netns which can be created almost > instantly. Arjan van de Ven booted Linux in 5 seconds in 2008, cf. http://lwn.net/Articles/299483/ On the TCP timewait scale of time, this is pretty much "immediate". [..] > Then if a new netns is created that tries to reuse the address used by > the mini-netns which hasn't cleared yet, you give -EAGAIN until all > the timewaits expire. The effect of this is that you end up being unable to reboot lxc based virtualised hosts without waiting 2 minutes for the TCP timers to expire. That sounds completely unacceptable to me. Another perspective of this is looking at the device references held by a namespace. I can without any issue and at any time move a network device into another namespace. This creates exactly the same situation where the TCP stack that now has the device may reset old connections. (Now you may argue that one should remove the network devices from a netns before closing it. One could do that, yes, but that would require having access to the netns before it actually goes down. That's problematic if the system inside the netns shuts down uncleanly. Also, the now-device-devoid network namespace will hold kernel resources for no good reason.) > That's much more acceptable to me than what it being proposed, which is > complete gatbage. I respectfully disagree. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 19:53 ` David Lamparter @ 2012-01-19 20:27 ` David Miller 2012-01-19 21:03 ` David Lamparter 2012-01-19 21:24 ` Eric W. Biederman 0 siblings, 2 replies; 28+ messages in thread From: David Miller @ 2012-01-19 20:27 UTC (permalink / raw) To: equinox; +Cc: hans.schillstrom, netdev, ebiederm From: David Lamparter <equinox@diac24.net> Date: Thu, 19 Jan 2012 20:53:49 +0100 > On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote: >> >> >> Keeping the timewait sockets around is necessary to absorb any lingering >> >> >> packets in the network meant for those sockets. > [...] >> >> The assumption is that the address is moving, which might not be true. >> > >> > I don't understand what you mean, what address may not be moving? >> > >> > We're talking about dropping a netns. All of its addresses disappear, >> > all of its soft devices disappear. Its hard devices fall back into the >> > init namespace, is that what you're referring to? >> >> And then you immediately start up a new netns with the same address >> and then resets go back to lingering TCP packets the time-waits would >> have consumed. >> >> The reason this is different from a host reboot is that a host reboot >> takes some amount of time, which even if around 30 seconds is superior >> in behavior to what can happen with netns which can be created almost >> instantly. > > Arjan van de Ven booted Linux in 5 seconds in 2008, > cf. http://lwn.net/Articles/299483/ > > On the TCP timewait scale of time, this is pretty much "immediate". > > [..] >> Then if a new netns is created that tries to reuse the address used by >> the mini-netns which hasn't cleared yet, you give -EAGAIN until all >> the timewaits expire. > > The effect of this is that you end up being unable to reboot lxc based > virtualised hosts without waiting 2 minutes for the TCP timers to > expire. That sounds completely unacceptable to me. All you are saying to me is that we are on a trajectory to major problems if it becomes pervasive that time-wait gets cancelled out and addresses then get reused so quickly. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 20:27 ` David Miller @ 2012-01-19 21:03 ` David Lamparter 2012-01-19 21:24 ` Eric W. Biederman 1 sibling, 0 replies; 28+ messages in thread From: David Lamparter @ 2012-01-19 21:03 UTC (permalink / raw) To: David Miller; +Cc: equinox, hans.schillstrom, netdev, ebiederm On Thu, Jan 19, 2012 at 03:27:52PM -0500, David Miller wrote: > From: David Lamparter <equinox@diac24.net> > > On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote: > >> Then if a new netns is created that tries to reuse the address used by > >> the mini-netns which hasn't cleared yet, you give -EAGAIN until all > >> the timewaits expire. > > > > The effect of this is that you end up being unable to reboot lxc based > > virtualised hosts without waiting 2 minutes for the TCP timers to > > expire. That sounds completely unacceptable to me. > > All you are saying to me is that we are on a trajectory to major problems > if it becomes pervasive that time-wait gets cancelled out and addresses > then get reused so quickly. On the funny side, RFC 793 page 28 actually states under "Knowing When to Keep Quiet" that "To be sure that a TCP does not create a segment that carries a sequence number which may be duplicated by an old segment remaining in the network, the TCP must keep quiet for a maximum segment lifetime (MSL) before assigning any sequence numbers upon starting up or recovering from a crash in which memory of sequence numbers in use was lost. For this specification the MSL is taken to be 2 minutes." Let's implement that, I bet people will love a 2 minute wait on TCP connections after booting :-) -David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 20:27 ` David Miller 2012-01-19 21:03 ` David Lamparter @ 2012-01-19 21:24 ` Eric W. Biederman 2012-01-19 21:40 ` David Lamparter 2012-01-19 21:40 ` Hagen Paul Pfeifer 1 sibling, 2 replies; 28+ messages in thread From: Eric W. Biederman @ 2012-01-19 21:24 UTC (permalink / raw) To: David Miller; +Cc: equinox, hans.schillstrom, netdev David Miller <davem@davemloft.net> writes: > From: David Lamparter <equinox@diac24.net> > Date: Thu, 19 Jan 2012 20:53:49 +0100 > >> On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote: >>> >> >> Keeping the timewait sockets around is necessary to absorb any lingering >>> >> >> packets in the network meant for those sockets. >> [...] >>> >> The assumption is that the address is moving, which might not be true. >>> > >>> > I don't understand what you mean, what address may not be moving? >>> > >>> > We're talking about dropping a netns. All of its addresses disappear, >>> > all of its soft devices disappear. Its hard devices fall back into the >>> > init namespace, is that what you're referring to? >>> >>> And then you immediately start up a new netns with the same address >>> and then resets go back to lingering TCP packets the time-waits would >>> have consumed. >>> >>> The reason this is different from a host reboot is that a host reboot >>> takes some amount of time, which even if around 30 seconds is superior >>> in behavior to what can happen with netns which can be created almost >>> instantly. >> >> Arjan van de Ven booted Linux in 5 seconds in 2008, >> cf. http://lwn.net/Articles/299483/ >> >> On the TCP timewait scale of time, this is pretty much "immediate". >> >> [..] >>> Then if a new netns is created that tries to reuse the address used by >>> the mini-netns which hasn't cleared yet, you give -EAGAIN until all >>> the timewaits expire. >> >> The effect of this is that you end up being unable to reboot lxc based >> virtualised hosts without waiting 2 minutes for the TCP timers to >> expire. That sounds completely unacceptable to me. > > All you are saying to me is that we are on a trajectory to major problems > if it becomes pervasive that time-wait gets cancelled out and addresses > then get reused so quickly. This thread is a fascinating disconnect from reality all of the way around. - inet_twsk_purge already implements throwing out of timewait sockets when a network namespaces is being cleaned up. So the RFC is nonsense. - Keeping the timewait sockets at that point we purge them in the code can achieve nothing. We don't have any userspace processes or network devices associated with the timewait sockets at the point we get rid of them. The network namespace exists so long as a userspace process can find it. The network namespace exit is asynchronous in it's own workqueue so userspace definitely is not blocked. - I don't see anything obvious that we can do in the kernel that will will make the situation better than it is today. I'm not arguing that we should reuse addresses quickly. I see value in the tcp_timewait mechanism. I'm just saying this thread seems to be discussing some other network stack than the one that lives in the linux kernel. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:24 ` Eric W. Biederman @ 2012-01-19 21:40 ` David Lamparter 2012-01-19 21:40 ` Hagen Paul Pfeifer 1 sibling, 0 replies; 28+ messages in thread From: David Lamparter @ 2012-01-19 21:40 UTC (permalink / raw) To: Eric W. Biederman; +Cc: David Miller, equinox, hans.schillstrom, netdev On Thu, Jan 19, 2012 at 01:24:13PM -0800, Eric W. Biederman wrote: > This thread is a fascinating disconnect from reality all of the way > around. > > - inet_twsk_purge already implements throwing out of timewait sockets > when a network namespaces is being cleaned up. So the RFC is nonsense. Indeed it does. Thanks for pointing that out, I should've tested that first thing... I'm sorry, I'll try better next time. Hans, I guess if you experienced delays on netns closing, they must have a different source? Maybe there's a bug in another place that should be cleaned up on namespace exit? -David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:24 ` Eric W. Biederman 2012-01-19 21:40 ` David Lamparter @ 2012-01-19 21:40 ` Hagen Paul Pfeifer 2012-01-19 21:47 ` David Lamparter 2012-01-20 6:08 ` Hans Schillstrom 1 sibling, 2 replies; 28+ messages in thread From: Hagen Paul Pfeifer @ 2012-01-19 21:40 UTC (permalink / raw) To: Eric W. Biederman; +Cc: David Miller, equinox, hans.schillstrom, netdev * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: >This thread is a fascinating disconnect from reality all of the way >around. > >- inet_twsk_purge already implements throwing out of timewait sockets > when a network namespaces is being cleaned up. So the RFC is nonsense. This is how it is implemented, not how it should be. TIME_WAIT is not the problem, it is there to keep the stack from sending wrong RST messages. Maybe the 2*MSL could be fixed by a more accurate 2*RTT. >- Keeping the timewait sockets at that point we purge them in the code > can achieve nothing. We don't have any userspace processes or network > devices associated with the timewait sockets at the point we get rid > of them. The network namespace exists so long as a userspace process > can find it. The network namespace exit is asynchronous in it's own > workqueue so userspace definitely is not blocked. Another possible solution could be a netns global TIME_WAIT list. But this is a little bit expensive and out of question - but this _is_ a convenient solution. The "5 second reboot" is no argument - it show a discrepance between TCP requirements and the actual situation. Hagen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:40 ` Hagen Paul Pfeifer @ 2012-01-19 21:47 ` David Lamparter 2012-01-19 22:10 ` Rick Jones ` (2 more replies) 2012-01-20 6:08 ` Hans Schillstrom 1 sibling, 3 replies; 28+ messages in thread From: David Lamparter @ 2012-01-19 21:47 UTC (permalink / raw) To: Hagen Paul Pfeifer Cc: Eric W. Biederman, David Miller, equinox, hans.schillstrom, netdev On Thu, Jan 19, 2012 at 10:40:53PM +0100, Hagen Paul Pfeifer wrote: > * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: > > >This thread is a fascinating disconnect from reality all of the way > >around. > > > >- inet_twsk_purge already implements throwing out of timewait sockets > > when a network namespaces is being cleaned up. So the RFC is nonsense. > > This is how it is implemented, not how it should be. TIME_WAIT is not the > problem, it is there to keep the stack from sending wrong RST messages. Maybe > the 2*MSL could be fixed by a more accurate 2*RTT. I may have made that argument hidden behind a joke, but please refer to the development of RFC 793's bootup "quiet time". The reason no one sticks to this quiet time is that TCP timestamps have obsoleted it by providing a better frame of reference. Refer to RFC 1323 Appendix B "DUPLICATES FROM EARLIER CONNECTION INCARNATIONS". -David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:47 ` David Lamparter @ 2012-01-19 22:10 ` Rick Jones 2012-01-19 22:16 ` Hagen Paul Pfeifer 2012-01-19 22:37 ` David Miller 2 siblings, 0 replies; 28+ messages in thread From: Rick Jones @ 2012-01-19 22:10 UTC (permalink / raw) To: David Lamparter Cc: Hagen Paul Pfeifer, Eric W. Biederman, David Miller, hans.schillstrom, netdev On 01/19/2012 01:47 PM, David Lamparter wrote: > On Thu, Jan 19, 2012 at 10:40:53PM +0100, Hagen Paul Pfeifer wrote: >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: >> >>> This thread is a fascinating disconnect from reality all of the way >>> around. >>> >>> - inet_twsk_purge already implements throwing out of timewait sockets >>> when a network namespaces is being cleaned up. So the RFC is nonsense. >> >> This is how it is implemented, not how it should be. TIME_WAIT is not the >> problem, it is there to keep the stack from sending wrong RST messages. Maybe >> the 2*MSL could be fixed by a more accurate 2*RTT. > > I may have made that argument hidden behind a joke, but please refer to > the development of RFC 793's bootup "quiet time". The reason no one > sticks to this quiet time is that TCP timestamps have obsoleted it by > providing a better frame of reference. Refer to RFC 1323 Appendix B > "DUPLICATES FROM EARLIER CONNECTION INCARNATIONS". May one actually, safely assume that timestamps are universally in place? rick jones suspects that no one stuck to the quiet time in the (more distant) timestamp-free past simply because booting took "long enough" anyway... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:47 ` David Lamparter 2012-01-19 22:10 ` Rick Jones @ 2012-01-19 22:16 ` Hagen Paul Pfeifer 2012-01-19 22:37 ` David Miller 2 siblings, 0 replies; 28+ messages in thread From: Hagen Paul Pfeifer @ 2012-01-19 22:16 UTC (permalink / raw) To: David Lamparter; +Cc: Eric W. Biederman, David Miller, hans.schillstrom, netdev * David Lamparter | 2012-01-19 22:47:18 [+0100]: >I may have made that argument hidden behind a joke, but please refer to >the development of RFC 793's bootup "quiet time". The reason no one >sticks to this quiet time is that TCP timestamps have obsoleted it by >providing a better frame of reference. Refer to RFC 1323 Appendix B >"DUPLICATES FROM EARLIER CONNECTION INCARNATIONS". Appendix B of RFC 1323 handles "Connection Incarnations", TIME_WAIT handles a wrong stack behavior and avoid a RST packet instead of a final ACK/FIN for a retransmission - two independent mechanisms. Hagen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:47 ` David Lamparter 2012-01-19 22:10 ` Rick Jones 2012-01-19 22:16 ` Hagen Paul Pfeifer @ 2012-01-19 22:37 ` David Miller 2 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2012-01-19 22:37 UTC (permalink / raw) To: equinox; +Cc: hagen, ebiederm, hans.schillstrom, netdev From: David Lamparter <equinox@diac24.net> Date: Thu, 19 Jan 2012 22:47:18 +0100 > On Thu, Jan 19, 2012 at 10:40:53PM +0100, Hagen Paul Pfeifer wrote: >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: >> >> >This thread is a fascinating disconnect from reality all of the way >> >around. >> > >> >- inet_twsk_purge already implements throwing out of timewait sockets >> > when a network namespaces is being cleaned up. So the RFC is nonsense. >> >> This is how it is implemented, not how it should be. TIME_WAIT is not the >> problem, it is there to keep the stack from sending wrong RST messages. Maybe >> the 2*MSL could be fixed by a more accurate 2*RTT. > > I may have made that argument hidden behind a joke, but please refer to > the development of RFC 793's bootup "quiet time". The reason no one > sticks to this quiet time is that TCP timestamps have obsoleted it by > providing a better frame of reference. Refer to RFC 1323 Appendix B > "DUPLICATES FROM EARLIER CONNECTION INCARNATIONS". I recommend you go and see what percentage of TCP connections actually have timestamps enabled, you might be surprised. I believe there are some good up to date figures in the google TCP papers. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-19 21:40 ` Hagen Paul Pfeifer 2012-01-19 21:47 ` David Lamparter @ 2012-01-20 6:08 ` Hans Schillstrom 2012-01-20 10:08 ` Eric W. Biederman 1 sibling, 1 reply; 28+ messages in thread From: Hans Schillstrom @ 2012-01-20 6:08 UTC (permalink / raw) To: Hagen Paul Pfeifer Cc: Eric W. Biederman, David Miller, equinox@diac24.net, netdev@vger.kernel.org On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote: > * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: > > >This thread is a fascinating disconnect from reality all of the way > >around. > > > >- inet_twsk_purge already implements throwing out of timewait sockets > > when a network namespaces is being cleaned up. So the RFC is nonsense. > > This is how it is implemented, not how it should be. TIME_WAIT is not the > problem, it is there to keep the stack from sending wrong RST messages. Maybe > the 2*MSL could be fixed by a more accurate 2*RTT. > I was only refering to my printk's i.e. the last sockets leaving the namespace was from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called. (and assumed that was the time_wait) > >- Keeping the timewait sockets at that point we purge them in the code > > can achieve nothing. We don't have any userspace processes or network > > devices associated with the timewait sockets at the point we get rid > > of them. The network namespace exists so long as a userspace process > > can find it. The network namespace exit is asynchronous in it's own > > workqueue so userspace definitely is not blocked. > One example of a real life problem is when a container crash where a VLAN from a physical interface is used in the container, and you automatically reboot that container. A new namespace is created with that VLAN again and what happens ? That VLAN id is busy (waiting for tcp_timer) and the continer start fails ... So you have to wait a couple of minutes :-( > Another possible solution could be a netns global TIME_WAIT list. But this > is a little bit expensive and out of question - but this _is_ a convenient > solution. > > The "5 second reboot" is no argument - it show a discrepance between TCP > requirements and the actual situation. > -- Regards Hans Schillstrom <hans.schillstrom@ericsson.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-20 6:08 ` Hans Schillstrom @ 2012-01-20 10:08 ` Eric W. Biederman 2012-01-20 11:51 ` Hans Schillstrom 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2012-01-20 10:08 UTC (permalink / raw) To: Hans Schillstrom Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote: >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: >> >> >This thread is a fascinating disconnect from reality all of the way >> >around. >> > >> >- inet_twsk_purge already implements throwing out of timewait sockets >> > when a network namespaces is being cleaned up. So the RFC is nonsense. >> >> This is how it is implemented, not how it should be. TIME_WAIT is not the >> problem, it is there to keep the stack from sending wrong RST messages. Maybe >> the 2*MSL could be fixed by a more accurate 2*RTT. >> > > I was only refering to my printk's i.e. the last sockets leaving the namespace was > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called. > (and assumed that was the time_wait) Which kernel are you running? I can't find a mention of a function named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel was put into git. There is a file named net/ipv4/tcp_timer.c But if you are actually describing normal sockets and not timewait sockets then it is remotely possible that something like what you are talking about is happening. Normal sockets keep the network namespace alive. So if something was keeping the sockets open. Like perhaps a process that has one of your sockets from your network namespace open then it could happen. nsproxy is not the only place that references to the network namespace are allowed to live that keep the network namespace alive. >> >- Keeping the timewait sockets at that point we purge them in the code >> > can achieve nothing. We don't have any userspace processes or network >> > devices associated with the timewait sockets at the point we get rid >> > of them. The network namespace exists so long as a userspace process >> > can find it. The network namespace exit is asynchronous in it's own >> > workqueue so userspace definitely is not blocked. >> > > One example of a real life problem is when a container crash where a VLAN from > a physical interface is used in the container, and you automatically reboot > that container. A new namespace is created with that VLAN again and what happens ? > That VLAN id is busy (waiting for tcp_timer) and the continer start fails ... > So you have to wait a couple of minutes :-( Yes the vlan is busy until that the network namespace is cleaned up, and we get as far as calling dellink on the network namespace. There are a lot of reasons why a network namespace would not be cleaned up immediately. Especially in older kernels. One problem people running older kernels had troubles with was vsftp created an empty network namespace for every connection. On kernels pre 2.6.34 I think before we had batching support for cleaning up network devices and network namespaces the kernel could simply not keep up with the rate that vsftp was creating and destroying network namespaces, and would slowly fall farther and farther behind in it's cleanup. If you are running an older kernel it is quite possible that you are missing some cleanups. It is also possible that you are hitting one of the cases where we can only destroy 4 network devices a second and you have lots of network devices dying with your network namespace. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-20 10:08 ` Eric W. Biederman @ 2012-01-20 11:51 ` Hans Schillstrom 2012-01-20 20:55 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Hans Schillstrom @ 2012-01-20 11:51 UTC (permalink / raw) To: Eric W. Biederman Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org On Friday 20 January 2012 11:08:37 Eric W. Biederman wrote: > Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > > > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote: > >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: > >> > >> >This thread is a fascinating disconnect from reality all of the way > >> >around. > >> > > >> >- inet_twsk_purge already implements throwing out of timewait sockets > >> > when a network namespaces is being cleaned up. So the RFC is nonsense. > >> > >> This is how it is implemented, not how it should be. TIME_WAIT is not the > >> problem, it is there to keep the stack from sending wrong RST messages. Maybe > >> the 2*MSL could be fixed by a more accurate 2*RTT. > >> > > > > I was only refering to my printk's i.e. the last sockets leaving the namespace was > > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called. > > (and assumed that was the time_wait) > > Which kernel are you running? 3.2.0 > I can't find a mention of a function > named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel > was put into git. Sorry, it was tcp_write_timer() in tcp_timer.c > > There is a file named net/ipv4/tcp_timer.c > > But if you are actually describing normal sockets and not timewait > sockets then it is remotely possible that something like what you are > talking about is happening. Hmm, state 7 is TCP_CLOSE I simply assumed that it was TCP_WAIT ... > Normal sockets keep the network namespace > alive. So if something was keeping the sockets open. Like perhaps a > process that has one of your sockets from your network namespace open > then it could happen. We had a number of procs. with tcp connections open, and kill proc 1 (lxc-init) i.e. all procs. in the ns got killed within a few ms. (or at least no visible traces left) > nsproxy is not the only place that references to the network namespace > are allowed to live that keep the network namespace alive. > > >> >- Keeping the timewait sockets at that point we purge them in the code > >> > can achieve nothing. We don't have any userspace processes or network > >> > devices associated with the timewait sockets at the point we get rid > >> > of them. The network namespace exists so long as a userspace process > >> > can find it. The network namespace exit is asynchronous in it's own > >> > workqueue so userspace definitely is not blocked. > >> > > > > One example of a real life problem is when a container crash where a VLAN from > > a physical interface is used in the container, and you automatically reboot > > that container. A new namespace is created with that VLAN again and what happens ? > > That VLAN id is busy (waiting for tcp_timer) and the continer start fails ... > > So you have to wait a couple of minutes :-( > > Yes the vlan is busy until that the network namespace is cleaned up, and > we get as far as calling dellink on the network namespace. > > There are a lot of reasons why a network namespace would not be cleaned > up immediately. Especially in older kernels. > > One problem people running older kernels had troubles with was vsftp > created an empty network namespace for every connection. On kernels pre > 2.6.34 I think before we had batching support for cleaning up network > devices and network namespaces the kernel could simply not keep up with > the rate that vsftp was creating and destroying network namespaces, and > would slowly fall farther and farther behind in it's cleanup. > > If you are running an older kernel it is quite possible that you are > missing some cleanups. It is also possible that you are hitting one of > the cases where we can only destroy 4 network devices a second and you > have lots of network devices dying with your network namespace. > We started with 2.6.32 but the cleanup process didn't work we always end up with ref-counts on loopback Thanks Hans ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-20 11:51 ` Hans Schillstrom @ 2012-01-20 20:55 ` Eric W. Biederman 2012-01-23 6:07 ` Hans Schillstrom 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2012-01-20 20:55 UTC (permalink / raw) To: Hans Schillstrom Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > On Friday 20 January 2012 11:08:37 Eric W. Biederman wrote: >> Hans Schillstrom <hans.schillstrom@ericsson.com> writes: >> >> > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote: >> >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: >> >> >> >> >This thread is a fascinating disconnect from reality all of the way >> >> >around. >> >> > >> >> >- inet_twsk_purge already implements throwing out of timewait sockets >> >> > when a network namespaces is being cleaned up. So the RFC is nonsense. >> >> >> >> This is how it is implemented, not how it should be. TIME_WAIT is not the >> >> problem, it is there to keep the stack from sending wrong RST messages. Maybe >> >> the 2*MSL could be fixed by a more accurate 2*RTT. >> >> >> > >> > I was only refering to my printk's i.e. the last sockets leaving the namespace was >> > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called. >> > (and assumed that was the time_wait) >> >> Which kernel are you running? > > 3.2.0 > >> I can't find a mention of a function >> named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel >> was put into git. > > Sorry, it was tcp_write_timer() in tcp_timer.c Now that is different. It sounds like your socket is flushing pending writes that haven't made it to the network and is having trouble. I can't imagine an argument for not writing everything in the socket buffers to the network if at all possible. > We had a number of procs. with tcp connections open, and kill proc 1 (lxc-init) > i.e. all procs. in the ns got killed within a few ms. > (or at least no visible traces left) My current hypothesis is that the namespace actually didn't get freed until the tcp socket finished closing. You can check by looking at when __put_net and then cleanup_net are called. > We started with 2.6.32 but the cleanup process didn't work we always end up > with ref-counts on loopback Yeah a bunch of references get transfered to the loopback device so any reference count buglets in the ip stack show up that way. I have been wondering if there is a good way to guarantee network device ref counts are balanced. Because those problems are a royal pain to track down. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-20 20:55 ` Eric W. Biederman @ 2012-01-23 6:07 ` Hans Schillstrom 2012-01-23 6:25 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Hans Schillstrom @ 2012-01-23 6:07 UTC (permalink / raw) To: Eric W. Biederman Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote: > Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > > > On Friday 20 January 2012 11:08:37 Eric W. Biederman wrote: > >> Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > >> > >> > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote: > >> >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]: > >> >> > >> >> >This thread is a fascinating disconnect from reality all of the way > >> >> >around. > >> >> > > >> >> >- inet_twsk_purge already implements throwing out of timewait sockets > >> >> > when a network namespaces is being cleaned up. So the RFC is nonsense. > >> >> > >> >> This is how it is implemented, not how it should be. TIME_WAIT is not the > >> >> problem, it is there to keep the stack from sending wrong RST messages. Maybe > >> >> the 2*MSL could be fixed by a more accurate 2*RTT. > >> >> > >> > > >> > I was only refering to my printk's i.e. the last sockets leaving the namespace was > >> > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called. > >> > (and assumed that was the time_wait) > >> > >> Which kernel are you running? > > > > 3.2.0 > > > >> I can't find a mention of a function > >> named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel > >> was put into git. > > > > Sorry, it was tcp_write_timer() in tcp_timer.c > > Now that is different. It sounds like your socket is flushing pending > writes that haven't made it to the network and is having trouble. I > can't imagine an argument for not writing everything in the socket > buffers to the network if at all possible. > > > We had a number of procs. with tcp connections open, and kill proc 1 (lxc-init) > > i.e. all procs. in the ns got killed within a few ms. > > (or at least no visible traces left) > > My current hypothesis is that the namespace actually didn't get freed > until the tcp socket finished closing. You can check by looking at when > __put_net and then cleanup_net are called. __put_net() is called just after tcp_write_timer() fires and then cleanup_net() > > > We started with 2.6.32 but the cleanup process didn't work we always end up > > with ref-counts on loopback > > Yeah a bunch of references get transfered to the loopback device so any > reference count buglets in the ip stack show up that way. > > I have been wondering if there is a good way to guarantee network > device ref counts are balanced. Because those problems are a royal pain > to track down. > I made some traceback functions/macros for put_net/get_net with a linked list so you can trace which ones are left at exit. There was also a primitive "call stack" trace that more or less ends up in the same point :-) Maybe we should have som kind of DEBUG/TRACE options at compile time for it... -- Regards Hans Schillstrom <hans.schillstrom@ericsson.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-23 6:07 ` Hans Schillstrom @ 2012-01-23 6:25 ` Eric W. Biederman 2012-01-23 6:58 ` Hans Schillstrom 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2012-01-23 6:25 UTC (permalink / raw) To: Hans Schillstrom Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote: >> My current hypothesis is that the namespace actually didn't get freed >> until the tcp socket finished closing. You can check by looking at when >> __put_net and then cleanup_net are called. > > __put_net() is called just after tcp_write_timer() fires and then > cleanup_net() Hypothesis confirmed. Your speed problem is that it is taking 2 minutes in the pathological case for your tcp socket to close. Do you have any clue why it is taking your sockets so long to close? Is the other side simply not responding? Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-23 6:25 ` Eric W. Biederman @ 2012-01-23 6:58 ` Hans Schillstrom 2012-01-23 7:17 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Hans Schillstrom @ 2012-01-23 6:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org On Monday 23 January 2012 07:25:52 Eric W. Biederman wrote: > Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > > > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote: > >> My current hypothesis is that the namespace actually didn't get freed > >> until the tcp socket finished closing. You can check by looking at when > >> __put_net and then cleanup_net are called. > > > > __put_net() is called just after tcp_write_timer() fires and then > > cleanup_net() > > Hypothesis confirmed. Your speed problem is that it is taking 2 minutes > in the pathological case for your tcp socket to close. > > Do you have any clue why it is taking your sockets so long to close? > Is the other side simply not responding? > The root cause of death is that the other side (init_net namespace) dies first and when it dies all containers will be killed ... /Hans ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-23 6:58 ` Hans Schillstrom @ 2012-01-23 7:17 ` Eric W. Biederman 2012-01-23 7:30 ` Hans Schillstrom 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2012-01-23 7:17 UTC (permalink / raw) To: Hans Schillstrom Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > On Monday 23 January 2012 07:25:52 Eric W. Biederman wrote: >> Hans Schillstrom <hans.schillstrom@ericsson.com> writes: >> >> > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote: >> >> My current hypothesis is that the namespace actually didn't get freed >> >> until the tcp socket finished closing. You can check by looking at when >> >> __put_net and then cleanup_net are called. >> > >> > __put_net() is called just after tcp_write_timer() fires and then >> > cleanup_net() >> >> Hypothesis confirmed. Your speed problem is that it is taking 2 minutes >> in the pathological case for your tcp socket to close. >> >> Do you have any clue why it is taking your sockets so long to close? >> Is the other side simply not responding? >> > > The root cause of death is that the other side (init_net namespace) dies first > and when it dies all containers will be killed ... ????? init_net can not die. init_net must not die. It makes no sense for the ref count on init_net to drop to 0. Among other places every kernel thread uses init_net. Furthermore the network namespaces are independent so even the impossible death of the initial network namespace should not kill a child network namespace. Did I misunderstand what you said? If you have a setup where you stop being able to talk to the outside world because you were relaying through the initial network namespace and the relay through the initial network namespace stopped functioning that makes sense. So effectively all of your packets are being dropped on the floor the tcp retransmit behavior on closing a socket makes sense. I just don't get how you have triggered that state. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-23 7:17 ` Eric W. Biederman @ 2012-01-23 7:30 ` Hans Schillstrom 2012-01-23 7:55 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Hans Schillstrom @ 2012-01-23 7:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org On Monday 23 January 2012 08:17:00 Eric W. Biederman wrote: > Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > > > On Monday 23 January 2012 07:25:52 Eric W. Biederman wrote: > >> Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > >> > >> > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote: > >> >> My current hypothesis is that the namespace actually didn't get freed > >> >> until the tcp socket finished closing. You can check by looking at when > >> >> __put_net and then cleanup_net are called. > >> > > >> > __put_net() is called just after tcp_write_timer() fires and then > >> > cleanup_net() > >> > >> Hypothesis confirmed. Your speed problem is that it is taking 2 minutes > >> in the pathological case for your tcp socket to close. > >> > >> Do you have any clue why it is taking your sockets so long to close? > >> Is the other side simply not responding? > >> > > > > The root cause of death is that the other side (init_net namespace) dies first > > and when it dies all containers will be killed ... > > ????? > > init_net can not die. init_net must not die. It makes no sense for the > ref count on init_net to drop to 0. Among other places every kernel > thread uses init_net. Furthermore the network namespaces are > independent so even the impossible death of the initial network > namespace should not kill a child network namespace. > > Did I misunderstand what you said? Yes you did, sorry for my bad expl. > > If you have a setup where you stop being able to talk to the outside > world because you were relaying through the initial network namespace > and the relay through the initial network namespace stopped functioning > that makes sense. So effectively all of your packets are being dropped > on the floor the tcp retransmit behavior on closing a socket makes > sense. I just don't get how you have triggered that state. > We have a process in root name space (init_net) that have tcp connections into a number of containers (in the same machine) If the control process in root ns dies all containers will also be killed i.e. there can easilly be out standing messages to and from the containers. So I guess that's why I see the tcp_write_timer() -- Regards Hans Schillstrom <hans.schillstrom@ericsson.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: RFC Hanging clean-up of a namespace 2012-01-23 7:30 ` Hans Schillstrom @ 2012-01-23 7:55 ` Eric W. Biederman 0 siblings, 0 replies; 28+ messages in thread From: Eric W. Biederman @ 2012-01-23 7:55 UTC (permalink / raw) To: Hans Schillstrom Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net, netdev@vger.kernel.org Hans Schillstrom <hans.schillstrom@ericsson.com> writes: > We have a process in root name space (init_net) that have > tcp connections into a number of containers (in the same machine) > > If the control process in root ns dies all containers will also be killed > i.e. there can easilly be out standing messages to and from the containers. > So I guess that's why I see the tcp_write_timer() The control process is an important piece of this but it should not be sufficient to cause tcp sockets to take forever to close, as the kernel manages sockets. Do you shut down communication between your namespaces before the are finished cleanup up? The easy way to communicate between namespaces would be just to use a veth pair and the veth pair will go away when the non init_net container goes away. Using a veth pair should ensure you have communications between your namespaces so your local sockets can shutdown quickly. It feels like right now that whatever your shutdown/restart process is that you are shooting yourself in the foot and triggering those long socket close times. I expect with a slight different order of operations you can avoid this long shutdown problem. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: RFC Hanging clean-up of a namespace 2012-01-19 17:40 ` David Miller 2012-01-19 19:01 ` David Lamparter @ 2012-01-19 19:40 ` Hans Schillström 1 sibling, 0 replies; 28+ messages in thread From: Hans Schillström @ 2012-01-19 19:40 UTC (permalink / raw) To: David Miller; +Cc: netdev@vger.kernel.org, ebiederm@xmission.com >Date: Thu, 19 Jan 2012 12:07:09 +0100 > >> Closing of a namespace (container) can be delayed by ~ 2 minutes >> due to tcp timers ex tcp time wait (and of cource other things too). >> >> I think there should be some kind of "forced close" of the Network stack >> in ex free_nsproxy() > >I think this is unwise. > >Keeping the timewait sockets around is necessary to absorb any lingering >packets in the network meant for those sockets. > >If you truncate this activity, and then try to create another socket with >the same ID you'll run into the very problems time-wait is meant to >solve. > >It's an unfortunate delay, but one you will have to live with. I think the whole clean up is to weak of the name space, there is too many things that can go wrong and they do. I was more thinking of a way to make a forced close, ex by cleaning the stack where time wait was one point. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2012-01-23 7:52 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-19 11:07 RFC Hanging clean-up of a namespace Hans Schillstrom 2012-01-19 13:31 ` David Lamparter 2012-01-19 17:40 ` David Miller 2012-01-19 19:01 ` David Lamparter 2012-01-19 19:06 ` David Miller 2012-01-19 19:25 ` David Lamparter 2012-01-19 19:31 ` David Miller 2012-01-19 19:53 ` David Lamparter 2012-01-19 20:27 ` David Miller 2012-01-19 21:03 ` David Lamparter 2012-01-19 21:24 ` Eric W. Biederman 2012-01-19 21:40 ` David Lamparter 2012-01-19 21:40 ` Hagen Paul Pfeifer 2012-01-19 21:47 ` David Lamparter 2012-01-19 22:10 ` Rick Jones 2012-01-19 22:16 ` Hagen Paul Pfeifer 2012-01-19 22:37 ` David Miller 2012-01-20 6:08 ` Hans Schillstrom 2012-01-20 10:08 ` Eric W. Biederman 2012-01-20 11:51 ` Hans Schillstrom 2012-01-20 20:55 ` Eric W. Biederman 2012-01-23 6:07 ` Hans Schillstrom 2012-01-23 6:25 ` Eric W. Biederman 2012-01-23 6:58 ` Hans Schillstrom 2012-01-23 7:17 ` Eric W. Biederman 2012-01-23 7:30 ` Hans Schillstrom 2012-01-23 7:55 ` Eric W. Biederman 2012-01-19 19:40 ` Hans Schillström
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).