RFC Hanging clean-up of a namespace

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC Hanging clean-up of a namespace
@ 2012-01-19 11:07 Hans Schillstrom
  2012-01-19 13:31 ` David Lamparter
  2012-01-19 17:40 ` David Miller
  0 siblings, 2 replies; 28+ messages in thread
From: Hans Schillstrom @ 2012-01-19 11:07 UTC (permalink / raw)
  To: netdev@vger.kernel.org, Eric W. Biederman

Hello,

Closing of a namespace (container) can be delayed by ~ 2 minutes 
due to tcp timers ex tcp time wait (and of cource other things too).

I think there should be some kind of "forced close" of the Network stack
in ex free_nsproxy() 

A netns pre-close  callback might do the job....

Any ideas how to solve this ?

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 11:07 RFC Hanging clean-up of a namespace Hans Schillstrom
@ 2012-01-19 13:31 ` David Lamparter
  2012-01-19 17:40 ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Lamparter @ 2012-01-19 13:31 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: netdev@vger.kernel.org, Eric W. Biederman

On Thu, Jan 19, 2012 at 12:07:09PM +0100, Hans Schillstrom wrote:
> Closing of a namespace (container) can be delayed by ~ 2 minutes 
> due to tcp timers ex tcp time wait (and of cource other things too).
> 
> I think there should be some kind of "forced close" of the Network stack
> in ex free_nsproxy() 
> 
> A netns pre-close  callback might do the job....
> 
> Any ideas how to solve this ?

I'm interested in this too; right off I don't see a workable alternative
to your suggestion. Each subsystem will have to implement that callback
and release its references on the netns.


-David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 11:07 RFC Hanging clean-up of a namespace Hans Schillstrom
  2012-01-19 13:31 ` David Lamparter
@ 2012-01-19 17:40 ` David Miller
  2012-01-19 19:01   ` David Lamparter
  2012-01-19 19:40   ` Hans Schillström
  1 sibling, 2 replies; 28+ messages in thread
From: David Miller @ 2012-01-19 17:40 UTC (permalink / raw)
  To: hans.schillstrom; +Cc: netdev, ebiederm

From: Hans Schillstrom <hans.schillstrom@ericsson.com>
Date: Thu, 19 Jan 2012 12:07:09 +0100

> Closing of a namespace (container) can be delayed by ~ 2 minutes 
> due to tcp timers ex tcp time wait (and of cource other things too).
> 
> I think there should be some kind of "forced close" of the Network stack
> in ex free_nsproxy() 

I think this is unwise.

Keeping the timewait sockets around is necessary to absorb any lingering
packets in the network meant for those sockets.

If you truncate this activity, and then try to create another socket with
the same ID you'll run into the very problems time-wait is meant to
solve.

It's an unfortunate delay, but one you will have to live with.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 17:40 ` David Miller
@ 2012-01-19 19:01   ` David Lamparter
  2012-01-19 19:06     ` David Miller
  2012-01-19 19:40   ` Hans Schillström
  1 sibling, 1 reply; 28+ messages in thread
From: David Lamparter @ 2012-01-19 19:01 UTC (permalink / raw)
  To: David Miller; +Cc: hans.schillstrom, netdev, ebiederm

On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote:
> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
> Date: Thu, 19 Jan 2012 12:07:09 +0100
> 
> > Closing of a namespace (container) can be delayed by ~ 2 minutes 
> > due to tcp timers ex tcp time wait (and of cource other things too).
> > 
> > I think there should be some kind of "forced close" of the Network stack
> > in ex free_nsproxy() 
> 
> I think this is unwise.
> 
> Keeping the timewait sockets around is necessary to absorb any lingering
> packets in the network meant for those sockets.
> 
> If you truncate this activity, and then try to create another socket with
> the same ID you'll run into the very problems time-wait is meant to
> solve.

A network namespace is for practical matters a separate host on the
network. Killing the namespace therefore is akin to shutting down that
host, which on a real metal host doesn't wait for timewait sockets
either.

Creating a socket with the same parameters would actually require
installing a network environment similar to the closed namespace first;
if an user really does that he can reasonably anticipate the same issues
as arise from removing a host from the network and giving its address to
another host.

-David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 19:01   ` David Lamparter
@ 2012-01-19 19:06     ` David Miller
  2012-01-19 19:25       ` David Lamparter
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2012-01-19 19:06 UTC (permalink / raw)
  To: equinox; +Cc: hans.schillstrom, netdev, ebiederm

From: David Lamparter <equinox@diac24.net>
Date: Thu, 19 Jan 2012 20:01:14 +0100

> On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote:
>> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
>> Date: Thu, 19 Jan 2012 12:07:09 +0100
>> 
>> > Closing of a namespace (container) can be delayed by ~ 2 minutes 
>> > due to tcp timers ex tcp time wait (and of cource other things too).
>> > 
>> > I think there should be some kind of "forced close" of the Network stack
>> > in ex free_nsproxy() 
>> 
>> I think this is unwise.
>> 
>> Keeping the timewait sockets around is necessary to absorb any lingering
>> packets in the network meant for those sockets.
>> 
>> If you truncate this activity, and then try to create another socket with
>> the same ID you'll run into the very problems time-wait is meant to
>> solve.
> 
> A network namespace is for practical matters a separate host on the
> network. Killing the namespace therefore is akin to shutting down that
> host, which on a real metal host doesn't wait for timewait sockets
> either.
> 
> Creating a socket with the same parameters would actually require
> installing a network environment similar to the closed namespace first;
> if an user really does that he can reasonably anticipate the same issues
> as arise from removing a host from the network and giving its address to
> another host.

The assumption is that the address is moving, which might not be true.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 19:06     ` David Miller
@ 2012-01-19 19:25       ` David Lamparter
  2012-01-19 19:31         ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: David Lamparter @ 2012-01-19 19:25 UTC (permalink / raw)
  To: David Miller; +Cc: equinox, hans.schillstrom, netdev, ebiederm

On Thu, Jan 19, 2012 at 02:06:21PM -0500, David Miller wrote:
> From: David Lamparter <equinox@diac24.net>
> > On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote:
> >> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
> >> > Closing of a namespace (container) can be delayed by ~ 2 minutes 
> >> > due to tcp timers ex tcp time wait (and of cource other things too).
> >> > 
> >> > I think there should be some kind of "forced close" of the Network stack
> >> > in ex free_nsproxy() 
> >> 
> >> I think this is unwise.
> >> 
> >> Keeping the timewait sockets around is necessary to absorb any lingering
> >> packets in the network meant for those sockets.
> >> 
> >> If you truncate this activity, and then try to create another socket with
> >> the same ID you'll run into the very problems time-wait is meant to
> >> solve.
> > 
> > A network namespace is for practical matters a separate host on the
> > network. Killing the namespace therefore is akin to shutting down that
> > host, which on a real metal host doesn't wait for timewait sockets
> > either.
> > 
> > Creating a socket with the same parameters would actually require
> > installing a network environment similar to the closed namespace first;
> > if an user really does that he can reasonably anticipate the same issues
> > as arise from removing a host from the network and giving its address to
> > another host.
> 
> The assumption is that the address is moving, which might not be true.

I don't understand what you mean, what address may not be moving?

We're talking about dropping a netns. All of its addresses disappear,
all of its soft devices disappear. Its hard devices fall back into the
init namespace, is that what you're referring to?

Or are you referring to the case where the network namespace is
recreated immediately after? That would be akin to a reboot, and again a
physical box wouldn't wait for timewait sockets...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 19:25       ` David Lamparter
@ 2012-01-19 19:31         ` David Miller
  2012-01-19 19:53           ` David Lamparter
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2012-01-19 19:31 UTC (permalink / raw)
  To: equinox; +Cc: hans.schillstrom, netdev, ebiederm

From: David Lamparter <equinox@diac24.net>
Date: Thu, 19 Jan 2012 20:25:41 +0100

> On Thu, Jan 19, 2012 at 02:06:21PM -0500, David Miller wrote:
>> From: David Lamparter <equinox@diac24.net>
>> > On Thu, Jan 19, 2012 at 12:40:02PM -0500, David Miller wrote:
>> >> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
>> >> > Closing of a namespace (container) can be delayed by ~ 2 minutes 
>> >> > due to tcp timers ex tcp time wait (and of cource other things too).
>> >> > 
>> >> > I think there should be some kind of "forced close" of the Network stack
>> >> > in ex free_nsproxy() 
>> >> 
>> >> I think this is unwise.
>> >> 
>> >> Keeping the timewait sockets around is necessary to absorb any lingering
>> >> packets in the network meant for those sockets.
>> >> 
>> >> If you truncate this activity, and then try to create another socket with
>> >> the same ID you'll run into the very problems time-wait is meant to
>> >> solve.
>> > 
>> > A network namespace is for practical matters a separate host on the
>> > network. Killing the namespace therefore is akin to shutting down that
>> > host, which on a real metal host doesn't wait for timewait sockets
>> > either.
>> > 
>> > Creating a socket with the same parameters would actually require
>> > installing a network environment similar to the closed namespace first;
>> > if an user really does that he can reasonably anticipate the same issues
>> > as arise from removing a host from the network and giving its address to
>> > another host.
>> 
>> The assumption is that the address is moving, which might not be true.
> 
> I don't understand what you mean, what address may not be moving?
> 
> We're talking about dropping a netns. All of its addresses disappear,
> all of its soft devices disappear. Its hard devices fall back into the
> init namespace, is that what you're referring to?

And then you immediately start up a new netns with the same address
and then resets go back to lingering TCP packets the time-waits would
have consumed.

The reason this is different from a host reboot is that a host reboot
takes some amount of time, which even if around 30 seconds is superior
in behavior to what can happen with netns which can be created almost
instantly.

I totally disagree with the idea to truncate time-wait under the
circumstances being suggested here in this thread.

Maybe what you want it to keep a small lingering mini-netns state
around so that the time-wait sockets can stay and do their job yet you
can still clear out the main netns object.

Then if a new netns is created that tries to reuse the address used by
the mini-netns which hasn't cleared yet, you give -EAGAIN until all
the timewaits expire.

That's much more acceptable to me than what it being proposed, which is
complete gatbage.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 19:31         ` David Miller
@ 2012-01-19 19:53           ` David Lamparter
  2012-01-19 20:27             ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: David Lamparter @ 2012-01-19 19:53 UTC (permalink / raw)
  To: David Miller; +Cc: equinox, hans.schillstrom, netdev, ebiederm

On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote:
> >> >> Keeping the timewait sockets around is necessary to absorb any lingering
> >> >> packets in the network meant for those sockets.
[...]
> >> The assumption is that the address is moving, which might not be true.
> > 
> > I don't understand what you mean, what address may not be moving?
> > 
> > We're talking about dropping a netns. All of its addresses disappear,
> > all of its soft devices disappear. Its hard devices fall back into the
> > init namespace, is that what you're referring to?
> 
> And then you immediately start up a new netns with the same address
> and then resets go back to lingering TCP packets the time-waits would
> have consumed.
> 
> The reason this is different from a host reboot is that a host reboot
> takes some amount of time, which even if around 30 seconds is superior
> in behavior to what can happen with netns which can be created almost
> instantly.

Arjan van de Ven booted Linux in 5 seconds in 2008,
cf. http://lwn.net/Articles/299483/

On the TCP timewait scale of time, this is pretty much "immediate".

[..]
> Then if a new netns is created that tries to reuse the address used by
> the mini-netns which hasn't cleared yet, you give -EAGAIN until all
> the timewaits expire.

The effect of this is that you end up being unable to reboot lxc based
virtualised hosts without waiting 2 minutes for the TCP timers to
expire. That sounds completely unacceptable to me.

Another perspective of this is looking at the device references held by a
namespace. I can without any issue and at any time move a network device
into another namespace. This creates exactly the same situation where
the TCP stack that now has the device may reset old connections.

(Now you may argue that one should remove the network devices from a
netns before closing it. One could do that, yes, but that would require
having access to the netns before it actually goes down. That's
problematic if the system inside the netns shuts down uncleanly. Also,
the now-device-devoid network namespace will hold kernel resources for
no good reason.)

> That's much more acceptable to me than what it being proposed, which is
> complete gatbage.

I respectfully disagree.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 19:53           ` David Lamparter
@ 2012-01-19 20:27             ` David Miller
  2012-01-19 21:03               ` David Lamparter
  2012-01-19 21:24               ` Eric W. Biederman
  0 siblings, 2 replies; 28+ messages in thread
From: David Miller @ 2012-01-19 20:27 UTC (permalink / raw)
  To: equinox; +Cc: hans.schillstrom, netdev, ebiederm

From: David Lamparter <equinox@diac24.net>
Date: Thu, 19 Jan 2012 20:53:49 +0100

> On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote:
>> >> >> Keeping the timewait sockets around is necessary to absorb any lingering
>> >> >> packets in the network meant for those sockets.
> [...]
>> >> The assumption is that the address is moving, which might not be true.
>> > 
>> > I don't understand what you mean, what address may not be moving?
>> > 
>> > We're talking about dropping a netns. All of its addresses disappear,
>> > all of its soft devices disappear. Its hard devices fall back into the
>> > init namespace, is that what you're referring to?
>> 
>> And then you immediately start up a new netns with the same address
>> and then resets go back to lingering TCP packets the time-waits would
>> have consumed.
>> 
>> The reason this is different from a host reboot is that a host reboot
>> takes some amount of time, which even if around 30 seconds is superior
>> in behavior to what can happen with netns which can be created almost
>> instantly.
> 
> Arjan van de Ven booted Linux in 5 seconds in 2008,
> cf. http://lwn.net/Articles/299483/
> 
> On the TCP timewait scale of time, this is pretty much "immediate".
> 
> [..]
>> Then if a new netns is created that tries to reuse the address used by
>> the mini-netns which hasn't cleared yet, you give -EAGAIN until all
>> the timewaits expire.
> 
> The effect of this is that you end up being unable to reboot lxc based
> virtualised hosts without waiting 2 minutes for the TCP timers to
> expire. That sounds completely unacceptable to me.

All you are saying to me is that we are on a trajectory to major problems
if it becomes pervasive that time-wait gets cancelled out and addresses
then get reused so quickly.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 20:27             ` David Miller
@ 2012-01-19 21:03               ` David Lamparter
  2012-01-19 21:24               ` Eric W. Biederman
  1 sibling, 0 replies; 28+ messages in thread
From: David Lamparter @ 2012-01-19 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: equinox, hans.schillstrom, netdev, ebiederm

On Thu, Jan 19, 2012 at 03:27:52PM -0500, David Miller wrote:
> From: David Lamparter <equinox@diac24.net>
> > On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote:
> >> Then if a new netns is created that tries to reuse the address used by
> >> the mini-netns which hasn't cleared yet, you give -EAGAIN until all
> >> the timewaits expire.
> > 
> > The effect of this is that you end up being unable to reboot lxc based
> > virtualised hosts without waiting 2 minutes for the TCP timers to
> > expire. That sounds completely unacceptable to me.
> 
> All you are saying to me is that we are on a trajectory to major problems
> if it becomes pervasive that time-wait gets cancelled out and addresses
> then get reused so quickly.

On the funny side, RFC 793 page 28 actually states under "Knowing When
to Keep Quiet" that
  "To be sure that a TCP does not create a segment that carries a
  sequence number which may be duplicated by an old segment remaining in
  the network, the TCP must keep quiet for a maximum segment lifetime
  (MSL) before assigning any sequence numbers upon starting up or
  recovering from a crash in which memory of sequence numbers in use was
  lost. For this specification the MSL is taken to be 2 minutes."

Let's implement that, I bet people will love a 2 minute wait on TCP
connections after booting :-)


-David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 20:27             ` David Miller
  2012-01-19 21:03               ` David Lamparter
@ 2012-01-19 21:24               ` Eric W. Biederman
  2012-01-19 21:40                 ` David Lamparter
  2012-01-19 21:40                 ` Hagen Paul Pfeifer
  1 sibling, 2 replies; 28+ messages in thread
From: Eric W. Biederman @ 2012-01-19 21:24 UTC (permalink / raw)
  To: David Miller; +Cc: equinox, hans.schillstrom, netdev

David Miller <davem@davemloft.net> writes:

> From: David Lamparter <equinox@diac24.net>
> Date: Thu, 19 Jan 2012 20:53:49 +0100
>
>> On Thu, Jan 19, 2012 at 02:31:05PM -0500, David Miller wrote:
>>> >> >> Keeping the timewait sockets around is necessary to absorb any lingering
>>> >> >> packets in the network meant for those sockets.
>> [...]
>>> >> The assumption is that the address is moving, which might not be true.
>>> > 
>>> > I don't understand what you mean, what address may not be moving?
>>> > 
>>> > We're talking about dropping a netns. All of its addresses disappear,
>>> > all of its soft devices disappear. Its hard devices fall back into the
>>> > init namespace, is that what you're referring to?
>>> 
>>> And then you immediately start up a new netns with the same address
>>> and then resets go back to lingering TCP packets the time-waits would
>>> have consumed.
>>> 
>>> The reason this is different from a host reboot is that a host reboot
>>> takes some amount of time, which even if around 30 seconds is superior
>>> in behavior to what can happen with netns which can be created almost
>>> instantly.
>> 
>> Arjan van de Ven booted Linux in 5 seconds in 2008,
>> cf. http://lwn.net/Articles/299483/
>> 
>> On the TCP timewait scale of time, this is pretty much "immediate".
>> 
>> [..]
>>> Then if a new netns is created that tries to reuse the address used by
>>> the mini-netns which hasn't cleared yet, you give -EAGAIN until all
>>> the timewaits expire.
>> 
>> The effect of this is that you end up being unable to reboot lxc based
>> virtualised hosts without waiting 2 minutes for the TCP timers to
>> expire. That sounds completely unacceptable to me.
>
> All you are saying to me is that we are on a trajectory to major problems
> if it becomes pervasive that time-wait gets cancelled out and addresses
> then get reused so quickly.

This thread is a fascinating disconnect from reality all of the way
around.

- inet_twsk_purge already implements throwing out of timewait sockets
  when a network namespaces is being cleaned up.  So the RFC is nonsense.

- Keeping the timewait sockets at that point we purge them in the code
  can achieve nothing.  We don't have any userspace processes or network
  devices associated with the timewait sockets at the point we get rid
  of them.  The network namespace exists so long as a userspace process
  can find it.  The network namespace exit is asynchronous in it's own
  workqueue so userspace definitely is not blocked.

- I don't see anything obvious that we can do in the kernel that will
  will make the situation better than it is today.

I'm not arguing that we should reuse addresses quickly.  I see value
in the tcp_timewait mechanism.  I'm just saying this thread seems
to be discussing some other network stack than the one that lives
in the linux kernel.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:24               ` Eric W. Biederman
@ 2012-01-19 21:40                 ` David Lamparter
  2012-01-19 21:40                 ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 28+ messages in thread
From: David Lamparter @ 2012-01-19 21:40 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, equinox, hans.schillstrom, netdev

On Thu, Jan 19, 2012 at 01:24:13PM -0800, Eric W. Biederman wrote:
> This thread is a fascinating disconnect from reality all of the way
> around.
> 
> - inet_twsk_purge already implements throwing out of timewait sockets
>   when a network namespaces is being cleaned up.  So the RFC is nonsense.

Indeed it does. Thanks for pointing that out, I should've tested that
first thing... I'm sorry, I'll try better next time.

Hans, I guess if you experienced delays on netns closing, they must have
a different source? Maybe there's a bug in another place that should be
cleaned up on namespace exit?


-David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:24               ` Eric W. Biederman
  2012-01-19 21:40                 ` David Lamparter
@ 2012-01-19 21:40                 ` Hagen Paul Pfeifer
  2012-01-19 21:47                   ` David Lamparter
  2012-01-20  6:08                   ` Hans Schillstrom
  1 sibling, 2 replies; 28+ messages in thread
From: Hagen Paul Pfeifer @ 2012-01-19 21:40 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, equinox, hans.schillstrom, netdev

* Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:

>This thread is a fascinating disconnect from reality all of the way
>around.
>
>- inet_twsk_purge already implements throwing out of timewait sockets
>  when a network namespaces is being cleaned up.  So the RFC is nonsense.

This is how it is implemented, not how it should be. TIME_WAIT is not the
problem, it is there to keep the stack from sending wrong RST messages. Maybe
the 2*MSL could be fixed by a more accurate 2*RTT.

>- Keeping the timewait sockets at that point we purge them in the code
>  can achieve nothing.  We don't have any userspace processes or network
>  devices associated with the timewait sockets at the point we get rid
>  of them.  The network namespace exists so long as a userspace process
>  can find it.  The network namespace exit is asynchronous in it's own
>  workqueue so userspace definitely is not blocked.

Another possible solution could be a netns global TIME_WAIT list. But this
is a little bit expensive and out of question - but this _is_ a convenient
solution.

The "5 second reboot" is no argument - it show a discrepance between TCP
requirements and the actual situation.

Hagen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:40                 ` Hagen Paul Pfeifer
@ 2012-01-19 21:47                   ` David Lamparter
  2012-01-19 22:10                     ` Rick Jones
                                       ` (2 more replies)
  2012-01-20  6:08                   ` Hans Schillstrom
  1 sibling, 3 replies; 28+ messages in thread
From: David Lamparter @ 2012-01-19 21:47 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Eric W. Biederman, David Miller, equinox, hans.schillstrom,
	netdev

On Thu, Jan 19, 2012 at 10:40:53PM +0100, Hagen Paul Pfeifer wrote:
> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
> 
> >This thread is a fascinating disconnect from reality all of the way
> >around.
> >
> >- inet_twsk_purge already implements throwing out of timewait sockets
> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
> 
> This is how it is implemented, not how it should be. TIME_WAIT is not the
> problem, it is there to keep the stack from sending wrong RST messages. Maybe
> the 2*MSL could be fixed by a more accurate 2*RTT.

I may have made that argument hidden behind a joke, but please refer to
the development of RFC 793's bootup "quiet time". The reason no one
sticks to this quiet time is that TCP timestamps have obsoleted it by
providing a better frame of reference. Refer to RFC 1323 Appendix B
"DUPLICATES FROM EARLIER CONNECTION INCARNATIONS".


-David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:47                   ` David Lamparter
@ 2012-01-19 22:10                     ` Rick Jones
  2012-01-19 22:16                     ` Hagen Paul Pfeifer
  2012-01-19 22:37                     ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: Rick Jones @ 2012-01-19 22:10 UTC (permalink / raw)
  To: David Lamparter
  Cc: Hagen Paul Pfeifer, Eric W. Biederman, David Miller,
	hans.schillstrom, netdev

On 01/19/2012 01:47 PM, David Lamparter wrote:
> On Thu, Jan 19, 2012 at 10:40:53PM +0100, Hagen Paul Pfeifer wrote:
>> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
>>
>>> This thread is a fascinating disconnect from reality all of the way
>>> around.
>>>
>>> - inet_twsk_purge already implements throwing out of timewait sockets
>>>   when a network namespaces is being cleaned up.  So the RFC is nonsense.
>>
>> This is how it is implemented, not how it should be. TIME_WAIT is not the
>> problem, it is there to keep the stack from sending wrong RST messages. Maybe
>> the 2*MSL could be fixed by a more accurate 2*RTT.
>
> I may have made that argument hidden behind a joke, but please refer to
> the development of RFC 793's bootup "quiet time". The reason no one
> sticks to this quiet time is that TCP timestamps have obsoleted it by
> providing a better frame of reference. Refer to RFC 1323 Appendix B
> "DUPLICATES FROM EARLIER CONNECTION INCARNATIONS".

May one actually, safely assume that timestamps are universally in place?

rick jones
suspects that no one stuck to the quiet time in the (more distant) 
timestamp-free past simply because booting took "long enough" anyway...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:47                   ` David Lamparter
  2012-01-19 22:10                     ` Rick Jones
@ 2012-01-19 22:16                     ` Hagen Paul Pfeifer
  2012-01-19 22:37                     ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: Hagen Paul Pfeifer @ 2012-01-19 22:16 UTC (permalink / raw)
  To: David Lamparter; +Cc: Eric W. Biederman, David Miller, hans.schillstrom, netdev

* David Lamparter | 2012-01-19 22:47:18 [+0100]:

>I may have made that argument hidden behind a joke, but please refer to
>the development of RFC 793's bootup "quiet time". The reason no one
>sticks to this quiet time is that TCP timestamps have obsoleted it by
>providing a better frame of reference. Refer to RFC 1323 Appendix B
>"DUPLICATES FROM EARLIER CONNECTION INCARNATIONS".

Appendix B of RFC 1323 handles "Connection Incarnations", TIME_WAIT handles a
wrong stack behavior and avoid a RST packet instead of a final ACK/FIN for a
retransmission - two independent mechanisms.


Hagen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:47                   ` David Lamparter
  2012-01-19 22:10                     ` Rick Jones
  2012-01-19 22:16                     ` Hagen Paul Pfeifer
@ 2012-01-19 22:37                     ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2012-01-19 22:37 UTC (permalink / raw)
  To: equinox; +Cc: hagen, ebiederm, hans.schillstrom, netdev

From: David Lamparter <equinox@diac24.net>
Date: Thu, 19 Jan 2012 22:47:18 +0100

> On Thu, Jan 19, 2012 at 10:40:53PM +0100, Hagen Paul Pfeifer wrote:
>> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
>> 
>> >This thread is a fascinating disconnect from reality all of the way
>> >around.
>> >
>> >- inet_twsk_purge already implements throwing out of timewait sockets
>> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
>> 
>> This is how it is implemented, not how it should be. TIME_WAIT is not the
>> problem, it is there to keep the stack from sending wrong RST messages. Maybe
>> the 2*MSL could be fixed by a more accurate 2*RTT.
> 
> I may have made that argument hidden behind a joke, but please refer to
> the development of RFC 793's bootup "quiet time". The reason no one
> sticks to this quiet time is that TCP timestamps have obsoleted it by
> providing a better frame of reference. Refer to RFC 1323 Appendix B
> "DUPLICATES FROM EARLIER CONNECTION INCARNATIONS".

I recommend you go and see what percentage of TCP connections actually
have timestamps enabled, you might be surprised.  I believe there are
some good up to date figures in the google TCP papers.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-19 21:40                 ` Hagen Paul Pfeifer
  2012-01-19 21:47                   ` David Lamparter
@ 2012-01-20  6:08                   ` Hans Schillstrom
  2012-01-20 10:08                     ` Eric W. Biederman
  1 sibling, 1 reply; 28+ messages in thread
From: Hans Schillstrom @ 2012-01-20  6:08 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Eric W. Biederman, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote:
> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
> 
> >This thread is a fascinating disconnect from reality all of the way
> >around.
> >
> >- inet_twsk_purge already implements throwing out of timewait sockets
> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
> 
> This is how it is implemented, not how it should be. TIME_WAIT is not the
> problem, it is there to keep the stack from sending wrong RST messages. Maybe
> the 2*MSL could be fixed by a more accurate 2*RTT.
> 

I was only refering to my printk's i.e. the last sockets leaving the namespace was
from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called.
(and assumed that was the time_wait)

> >- Keeping the timewait sockets at that point we purge them in the code
> >  can achieve nothing.  We don't have any userspace processes or network
> >  devices associated with the timewait sockets at the point we get rid
> >  of them.  The network namespace exists so long as a userspace process
> >  can find it.  The network namespace exit is asynchronous in it's own
> >  workqueue so userspace definitely is not blocked.
> 

One example of a real life problem is when a container crash where a VLAN from
a physical interface is used in the container, and you automatically reboot
that container.  A new namespace is created with that VLAN again and what happens ?
That VLAN id is busy (waiting for tcp_timer) and the continer start fails ...
So you have to wait a couple of minutes :-(

> Another possible solution could be a netns global TIME_WAIT list. But this
> is a little bit expensive and out of question - but this _is_ a convenient
> solution.
> 
> The "5 second reboot" is no argument - it show a discrepance between TCP
> requirements and the actual situation.
> 


-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-20  6:08                   ` Hans Schillstrom
@ 2012-01-20 10:08                     ` Eric W. Biederman
  2012-01-20 11:51                       ` Hans Schillstrom
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2012-01-20 10:08 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote:
>> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
>> 
>> >This thread is a fascinating disconnect from reality all of the way
>> >around.
>> >
>> >- inet_twsk_purge already implements throwing out of timewait sockets
>> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
>> 
>> This is how it is implemented, not how it should be. TIME_WAIT is not the
>> problem, it is there to keep the stack from sending wrong RST messages. Maybe
>> the 2*MSL could be fixed by a more accurate 2*RTT.
>> 
>
> I was only refering to my printk's i.e. the last sockets leaving the namespace was
> from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called.
> (and assumed that was the time_wait)

Which kernel are you running?  I can't find a mention of a function
named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel
was put into git.

There is a file named net/ipv4/tcp_timer.c

But if you are actually describing normal sockets and not timewait
sockets then it is remotely possible that something like what you are
talking about is happening.  Normal sockets keep the network namespace
alive.  So if something was keeping the sockets open.  Like perhaps a
process that has one of your sockets from your network namespace open
then it could happen.

nsproxy is not the only place that references to the network namespace
are allowed to live that keep the network namespace alive.

>> >- Keeping the timewait sockets at that point we purge them in the code
>> >  can achieve nothing.  We don't have any userspace processes or network
>> >  devices associated with the timewait sockets at the point we get rid
>> >  of them.  The network namespace exists so long as a userspace process
>> >  can find it.  The network namespace exit is asynchronous in it's own
>> >  workqueue so userspace definitely is not blocked.
>> 
>
> One example of a real life problem is when a container crash where a VLAN from
> a physical interface is used in the container, and you automatically reboot
> that container.  A new namespace is created with that VLAN again and what happens ?
> That VLAN id is busy (waiting for tcp_timer) and the continer start fails ...
> So you have to wait a couple of minutes :-(

Yes the vlan is busy until that the network namespace is cleaned up, and
we get as far as calling dellink on the network namespace.

There are a lot of reasons why a network namespace would not be cleaned
up immediately.  Especially in older kernels.

One problem people running older kernels had troubles with was vsftp
created an empty network namespace for every connection.  On kernels pre
2.6.34 I think before we had batching support for cleaning up network
devices and network namespaces the kernel could simply not keep up with
the rate that vsftp was creating and destroying network namespaces, and
would slowly fall farther and farther behind in it's cleanup.

If you are running an older kernel it is quite possible that you are
missing some cleanups.  It is also possible that you are hitting one of
the cases where we can only destroy 4 network devices a second and you
have lots of network devices dying with your network namespace.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-20 10:08                     ` Eric W. Biederman
@ 2012-01-20 11:51                       ` Hans Schillstrom
  2012-01-20 20:55                         ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Hans Schillstrom @ 2012-01-20 11:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

On Friday 20 January 2012 11:08:37 Eric W. Biederman wrote:
> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> 
> > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote:
> >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
> >> 
> >> >This thread is a fascinating disconnect from reality all of the way
> >> >around.
> >> >
> >> >- inet_twsk_purge already implements throwing out of timewait sockets
> >> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
> >> 
> >> This is how it is implemented, not how it should be. TIME_WAIT is not the
> >> problem, it is there to keep the stack from sending wrong RST messages. Maybe
> >> the 2*MSL could be fixed by a more accurate 2*RTT.
> >> 
> >
> > I was only refering to my printk's i.e. the last sockets leaving the namespace was
> > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called.
> > (and assumed that was the time_wait)
> 
> Which kernel are you running?  

 3.2.0 

> I can't find a mention of a function
> named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel
> was put into git.

Sorry, it was tcp_write_timer() in tcp_timer.c

> 
> There is a file named net/ipv4/tcp_timer.c
> 
> But if you are actually describing normal sockets and not timewait
> sockets then it is remotely possible that something like what you are
> talking about is happening. 

Hmm, state 7 is TCP_CLOSE I simply assumed that it was TCP_WAIT ...


> Normal sockets keep the network namespace
> alive.  So if something was keeping the sockets open.  Like perhaps a
> process that has one of your sockets from your network namespace open
> then it could happen.

We had a number of procs. with tcp connections open, and kill proc 1 (lxc-init)
i.e. all procs. in the ns got killed within a few ms. 
(or at least no visible traces left)

> nsproxy is not the only place that references to the network namespace
> are allowed to live that keep the network namespace alive.
> 
> >> >- Keeping the timewait sockets at that point we purge them in the code
> >> >  can achieve nothing.  We don't have any userspace processes or network
> >> >  devices associated with the timewait sockets at the point we get rid
> >> >  of them.  The network namespace exists so long as a userspace process
> >> >  can find it.  The network namespace exit is asynchronous in it's own
> >> >  workqueue so userspace definitely is not blocked.
> >> 
> >
> > One example of a real life problem is when a container crash where a VLAN from
> > a physical interface is used in the container, and you automatically reboot
> > that container.  A new namespace is created with that VLAN again and what happens ?
> > That VLAN id is busy (waiting for tcp_timer) and the continer start fails ...
> > So you have to wait a couple of minutes :-(
> 
> Yes the vlan is busy until that the network namespace is cleaned up, and
> we get as far as calling dellink on the network namespace.
> 
> There are a lot of reasons why a network namespace would not be cleaned
> up immediately.  Especially in older kernels.
> 
> One problem people running older kernels had troubles with was vsftp
> created an empty network namespace for every connection.  On kernels pre
> 2.6.34 I think before we had batching support for cleaning up network
> devices and network namespaces the kernel could simply not keep up with
> the rate that vsftp was creating and destroying network namespaces, and
> would slowly fall farther and farther behind in it's cleanup.
> 
> If you are running an older kernel it is quite possible that you are
> missing some cleanups.  It is also possible that you are hitting one of
> the cases where we can only destroy 4 network devices a second and you
> have lots of network devices dying with your network namespace.
> 

We started with 2.6.32 but the cleanup process didn't work we always end up
with ref-counts on loopback

Thanks
Hans

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-20 11:51                       ` Hans Schillstrom
@ 2012-01-20 20:55                         ` Eric W. Biederman
  2012-01-23  6:07                           ` Hans Schillstrom
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2012-01-20 20:55 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> On Friday 20 January 2012 11:08:37 Eric W. Biederman wrote:
>> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
>> 
>> > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote:
>> >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
>> >> 
>> >> >This thread is a fascinating disconnect from reality all of the way
>> >> >around.
>> >> >
>> >> >- inet_twsk_purge already implements throwing out of timewait sockets
>> >> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
>> >> 
>> >> This is how it is implemented, not how it should be. TIME_WAIT is not the
>> >> problem, it is there to keep the stack from sending wrong RST messages. Maybe
>> >> the 2*MSL could be fixed by a more accurate 2*RTT.
>> >> 
>> >
>> > I was only refering to my printk's i.e. the last sockets leaving the namespace was
>> > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called.
>> > (and assumed that was the time_wait)
>> 
>> Which kernel are you running?  
>
>  3.2.0 
>
>> I can't find a mention of a function
>> named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel
>> was put into git.
>
> Sorry, it was tcp_write_timer() in tcp_timer.c

Now that is different.  It sounds like your socket is flushing pending
writes that haven't made it to the network and is having trouble.   I
can't imagine an argument for not writing everything in the socket
buffers to the network if at all possible.

> We had a number of procs. with tcp connections open, and kill proc 1 (lxc-init)
> i.e. all procs. in the ns got killed within a few ms. 
> (or at least no visible traces left)

My current hypothesis is that the namespace actually didn't get freed
until the tcp socket finished closing.  You can check by looking at when
__put_net and then cleanup_net are called.

> We started with 2.6.32 but the cleanup process didn't work we always end up
> with ref-counts on loopback

Yeah a bunch of references get transfered to the loopback device so any
reference count buglets in the ip stack show up that way.

I have been wondering if there is a good way to guarantee network
device ref counts are balanced.  Because those problems are a royal pain
to track down.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-20 20:55                         ` Eric W. Biederman
@ 2012-01-23  6:07                           ` Hans Schillstrom
  2012-01-23  6:25                             ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Hans Schillstrom @ 2012-01-23  6:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote:
> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> 
> > On Friday 20 January 2012 11:08:37 Eric W. Biederman wrote:
> >> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> >> 
> >> > On Thursday 19 January 2012 22:40:53 Hagen Paul Pfeifer wrote:
> >> >> * Eric W. Biederman | 2012-01-19 13:24:13 [-0800]:
> >> >> 
> >> >> >This thread is a fascinating disconnect from reality all of the way
> >> >> >around.
> >> >> >
> >> >> >- inet_twsk_purge already implements throwing out of timewait sockets
> >> >> >  when a network namespaces is being cleaned up.  So the RFC is nonsense.
> >> >> 
> >> >> This is how it is implemented, not how it should be. TIME_WAIT is not the
> >> >> problem, it is there to keep the stack from sending wrong RST messages. Maybe
> >> >> the 2*MSL could be fixed by a more accurate 2*RTT.
> >> >> 
> >> >
> >> > I was only refering to my printk's i.e. the last sockets leaving the namespace was
> >> > from tcp_timer() with state 7, 2 minutes after free_nsproxy() was called.
> >> > (and assumed that was the time_wait)
> >> 
> >> Which kernel are you running?  
> >
> >  3.2.0 
> >
> >> I can't find a mention of a function
> >> named tcp_timer() anywhere in the kernel since 2.6.16 when the kernel
> >> was put into git.
> >
> > Sorry, it was tcp_write_timer() in tcp_timer.c
> 
> Now that is different.  It sounds like your socket is flushing pending
> writes that haven't made it to the network and is having trouble.   I
> can't imagine an argument for not writing everything in the socket
> buffers to the network if at all possible.
> 
> > We had a number of procs. with tcp connections open, and kill proc 1 (lxc-init)
> > i.e. all procs. in the ns got killed within a few ms. 
> > (or at least no visible traces left)
> 
> My current hypothesis is that the namespace actually didn't get freed
> until the tcp socket finished closing.  You can check by looking at when
> __put_net and then cleanup_net are called.

__put_net() is called just after tcp_write_timer() fires and then cleanup_net()

> 
> > We started with 2.6.32 but the cleanup process didn't work we always end up
> > with ref-counts on loopback
> 
> Yeah a bunch of references get transfered to the loopback device so any
> reference count buglets in the ip stack show up that way.
> 
> I have been wondering if there is a good way to guarantee network
> device ref counts are balanced.  Because those problems are a royal pain
> to track down.
> 

I made some traceback functions/macros for put_net/get_net with a linked list so you can
trace which ones are left at exit.
There was also a primitive "call stack" trace that more or less ends up in the same point :-)

Maybe we should have som kind of DEBUG/TRACE options at compile time for it...
-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-23  6:07                           ` Hans Schillstrom
@ 2012-01-23  6:25                             ` Eric W. Biederman
  2012-01-23  6:58                               ` Hans Schillstrom
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2012-01-23  6:25 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote:
>> My current hypothesis is that the namespace actually didn't get freed
>> until the tcp socket finished closing.  You can check by looking at when
>> __put_net and then cleanup_net are called.
>
> __put_net() is called just after tcp_write_timer() fires and then
> cleanup_net()

Hypothesis confirmed.  Your speed problem is that it is taking 2 minutes
in the pathological case for your tcp socket to close.

Do you have any clue why it is taking your sockets so long to close?
Is the other side simply not responding?

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-23  6:25                             ` Eric W. Biederman
@ 2012-01-23  6:58                               ` Hans Schillstrom
  2012-01-23  7:17                                 ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Hans Schillstrom @ 2012-01-23  6:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

On Monday 23 January 2012 07:25:52 Eric W. Biederman wrote:
> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> 
> > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote:
> >> My current hypothesis is that the namespace actually didn't get freed
> >> until the tcp socket finished closing.  You can check by looking at when
> >> __put_net and then cleanup_net are called.
> >
> > __put_net() is called just after tcp_write_timer() fires and then
> > cleanup_net()
> 
> Hypothesis confirmed.  Your speed problem is that it is taking 2 minutes
> in the pathological case for your tcp socket to close.
> 
> Do you have any clue why it is taking your sockets so long to close?
> Is the other side simply not responding?
> 

The root cause of death is that the other side (init_net namespace) dies first 
and when it dies all containers will be killed ...

/Hans

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-23  6:58                               ` Hans Schillstrom
@ 2012-01-23  7:17                                 ` Eric W. Biederman
  2012-01-23  7:30                                   ` Hans Schillstrom
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2012-01-23  7:17 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> On Monday 23 January 2012 07:25:52 Eric W. Biederman wrote:
>> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
>> 
>> > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote:
>> >> My current hypothesis is that the namespace actually didn't get freed
>> >> until the tcp socket finished closing.  You can check by looking at when
>> >> __put_net and then cleanup_net are called.
>> >
>> > __put_net() is called just after tcp_write_timer() fires and then
>> > cleanup_net()
>> 
>> Hypothesis confirmed.  Your speed problem is that it is taking 2 minutes
>> in the pathological case for your tcp socket to close.
>> 
>> Do you have any clue why it is taking your sockets so long to close?
>> Is the other side simply not responding?
>> 
>
> The root cause of death is that the other side (init_net namespace) dies first 
> and when it dies all containers will be killed ...

?????

init_net can not die.  init_net must not die.  It makes no sense for the
ref count on init_net to drop to 0.  Among other places every kernel
thread uses init_net.  Furthermore the network namespaces are
independent so even the impossible death of the initial network
namespace should not kill a child network namespace.

Did I misunderstand what you said?

If you have a setup where you stop being able to talk to the outside
world because you were relaying through the initial network namespace
and the relay through the initial network namespace stopped functioning
that makes sense.  So effectively all of your packets are being dropped
on the floor the tcp retransmit behavior on closing a socket makes
sense.  I just don't get how you have triggered that state.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-23  7:17                                 ` Eric W. Biederman
@ 2012-01-23  7:30                                   ` Hans Schillstrom
  2012-01-23  7:55                                     ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Hans Schillstrom @ 2012-01-23  7:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

On Monday 23 January 2012 08:17:00 Eric W. Biederman wrote:
> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> 
> > On Monday 23 January 2012 07:25:52 Eric W. Biederman wrote:
> >> Hans Schillstrom <hans.schillstrom@ericsson.com> writes:
> >> 
> >> > On Friday 20 January 2012 21:55:27 Eric W. Biederman wrote:
> >> >> My current hypothesis is that the namespace actually didn't get freed
> >> >> until the tcp socket finished closing.  You can check by looking at when
> >> >> __put_net and then cleanup_net are called.
> >> >
> >> > __put_net() is called just after tcp_write_timer() fires and then
> >> > cleanup_net()
> >> 
> >> Hypothesis confirmed.  Your speed problem is that it is taking 2 minutes
> >> in the pathological case for your tcp socket to close.
> >> 
> >> Do you have any clue why it is taking your sockets so long to close?
> >> Is the other side simply not responding?
> >> 
> >
> > The root cause of death is that the other side (init_net namespace) dies first 
> > and when it dies all containers will be killed ...
> 
> ?????
> 
> init_net can not die.  init_net must not die.  It makes no sense for the
> ref count on init_net to drop to 0.  Among other places every kernel
> thread uses init_net.  Furthermore the network namespaces are
> independent so even the impossible death of the initial network
> namespace should not kill a child network namespace.
> 
> Did I misunderstand what you said?

Yes you did, sorry for my bad expl.

> 
> If you have a setup where you stop being able to talk to the outside
> world because you were relaying through the initial network namespace
> and the relay through the initial network namespace stopped functioning
> that makes sense.  So effectively all of your packets are being dropped
> on the floor the tcp retransmit behavior on closing a socket makes
> sense.  I just don't get how you have triggered that state.
> 

We  have a process in root name space (init_net) that have 
tcp connections into a number of containers (in the same machine)

If the control process in root ns dies all containers will also be killed
i.e. there can easilly be out standing messages to and from the containers.
So I guess that's why I see the tcp_write_timer()

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC Hanging clean-up of a namespace
  2012-01-23  7:30                                   ` Hans Schillstrom
@ 2012-01-23  7:55                                     ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2012-01-23  7:55 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Hagen Paul Pfeifer, David Miller, equinox@diac24.net,
	netdev@vger.kernel.org

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> We  have a process in root name space (init_net) that have 
> tcp connections into a number of containers (in the same machine)
>
> If the control process in root ns dies all containers will also be killed
> i.e. there can easilly be out standing messages to and from the containers.
> So I guess that's why I see the tcp_write_timer()

The control process is an important piece of this but it should not be
sufficient to cause tcp sockets to take forever to close, as the kernel
manages sockets.  Do you shut down communication between your namespaces
before the are finished cleanup up?

The easy way to communicate between namespaces would be just to use
a veth pair and the veth pair will go away when the non init_net
container goes away.  Using a veth pair should ensure you have
communications between your namespaces so your local sockets
can shutdown quickly.

It feels like right now that whatever your shutdown/restart process
is that you are shooting yourself in the foot and triggering those
long socket close times.  I expect with a slight different order
of operations you can avoid this long shutdown problem.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: RFC Hanging clean-up of a namespace
  2012-01-19 17:40 ` David Miller
  2012-01-19 19:01   ` David Lamparter
@ 2012-01-19 19:40   ` Hans Schillström
  1 sibling, 0 replies; 28+ messages in thread
From: Hans Schillström @ 2012-01-19 19:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, ebiederm@xmission.com

>Date: Thu, 19 Jan 2012 12:07:09 +0100
>
>> Closing of a namespace (container) can be delayed by ~ 2 minutes
>> due to tcp timers ex tcp time wait (and of cource other things too).
>>
>> I think there should be some kind of "forced close" of the Network stack
>> in ex free_nsproxy()
>
>I think this is unwise.
>
>Keeping the timewait sockets around is necessary to absorb any lingering
>packets in the network meant for those sockets.
>
>If you truncate this activity, and then try to create another socket with
>the same ID you'll run into the very problems time-wait is meant to
>solve.
>
>It's an unfortunate delay, but one you will have to live with.

I think the whole clean up is to weak of the name space, there is too many things that
can go wrong and they do.

I was more thinking of a way to make a forced close, 
ex by cleaning the stack where time wait was one point.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2012-01-23  7:52 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-19 11:07 RFC Hanging clean-up of a namespace Hans Schillstrom
2012-01-19 13:31 ` David Lamparter
2012-01-19 17:40 ` David Miller
2012-01-19 19:01   ` David Lamparter
2012-01-19 19:06     ` David Miller
2012-01-19 19:25       ` David Lamparter
2012-01-19 19:31         ` David Miller
2012-01-19 19:53           ` David Lamparter
2012-01-19 20:27             ` David Miller
2012-01-19 21:03               ` David Lamparter
2012-01-19 21:24               ` Eric W. Biederman
2012-01-19 21:40                 ` David Lamparter
2012-01-19 21:40                 ` Hagen Paul Pfeifer
2012-01-19 21:47                   ` David Lamparter
2012-01-19 22:10                     ` Rick Jones
2012-01-19 22:16                     ` Hagen Paul Pfeifer
2012-01-19 22:37                     ` David Miller
2012-01-20  6:08                   ` Hans Schillstrom
2012-01-20 10:08                     ` Eric W. Biederman
2012-01-20 11:51                       ` Hans Schillstrom
2012-01-20 20:55                         ` Eric W. Biederman
2012-01-23  6:07                           ` Hans Schillstrom
2012-01-23  6:25                             ` Eric W. Biederman
2012-01-23  6:58                               ` Hans Schillstrom
2012-01-23  7:17                                 ` Eric W. Biederman
2012-01-23  7:30                                   ` Hans Schillstrom
2012-01-23  7:55                                     ` Eric W. Biederman
2012-01-19 19:40   ` Hans Schillström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).