Re: [PATCH] ipt_CLUSTERIP: Add network device notifier

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kirill Tkhai <ktkhai@parallels.com>
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: <netfilter-devel@vger.kernel.org>,
	Patrick McHardy <kaber@trash.net>,
	Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>, <tkhai@yandex.ru>,
	"Pavel Emelyanov" <xemul@parallels.com>
Subject: Re: [PATCH] ipt_CLUSTERIP: Add network device notifier
Date: Wed, 11 Jun 2014 16:10:06 +0400	[thread overview]
Message-ID: <1402488606.32126.72.camel@tkhai> (raw)
In-Reply-To: <20140611120027.GA23485@localhost>

В Ср, 11/06/2014 в 14:00 +0200, Pablo Neira Ayuso пишет:
> On Wed, Jun 11, 2014 at 03:55:18PM +0400, Kirill Tkhai wrote:
> > В Ср, 11/06/2014 в 13:49 +0200, Pablo Neira Ayuso пишет:
> > > On Wed, Jun 11, 2014 at 03:44:39PM +0400, Kirill Tkhai wrote:
> > > > Hi, Pablo,
> > > > 
> > > > В Пн, 28/04/2014 в 16:23 +0200, Pablo Neira Ayuso пишет:
> > > > > Hi,
> > > > > 
> > > > > On Mon, Apr 07, 2014 at 03:58:49PM +0400, Kirill Tkhai wrote:
> > > > > > Clusterip target does dev_hold() in .checkentry, while dev_put() in .destroy.
> > > > > > So, unregister_netdevice catches the leak:
> > > > > > 
> > > > > > # modprobe dummy
> > > > > > # iptables -A INPUT -d 10.31.3.236 -j CLUSTERIP --new --hashmode sourceip -i dummy0 --clustermac 01:aa:7b:47:f7:d7 --total-nodes 2 --local-node 1
> > > > > > # rmmod dummy
> > > > > >
> > > > > >   Message from syslogd@localhost ...
> > > > > >     kernel: unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
> > > > > >
> > > > > [...]
> > > > > >  1 file changed, 134 insertions(+), 12 deletions(-)
> > > > > 
> > > > > I have spinned several times on this patch, and I'm not very happy
> > > > > with taking this fix:
> > > > > 
> > > > > 1) It's quite large fix for a situation that seems unlikely to me.
> > > > 
> > > > We have several reports from containers users, who bumped into this.
> > > > The hang happens on netns stop, it's 100% reproducible. Every time
> > > > a container is stopping or a device is going away, the unregistration
> > > > fails and hungs if CLUSTERIP is used. So, we'd want to have some fix
> > > > of this.
> > > 
> > > How it this combination being triggered there? I mean:
> > > 
> > > # modprobe dummy
> > > # iptables -A INPUT -d 10.31.3.236 -j CLUSTERIP ...
> > > # rmmod dummy
> > > 
> > > Is it something included in some scripts that automate the setup?
> > 
> > It's a sample of how to trigger this. The problem is not in rmmod.
> > 
> > Really it happens when container is stopping and device is going away.
> > It's not OpenVZ related, current LXC has the same problem.
> 
> But that sample should be really easy to trigger if you're getting
> lost of reports for this.
> 
> Are your users really hitting that problem by accident? It seems quite
> rare condition to me. Please, clarify.

We had it so many times, that we had to add this ugly workaround
in our kernels in mid 2011:

commit 56ec6942c28cd6823f1481da8d0df829672f03d3
Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
Date:   Mon Aug 8 12:11:16 2011 +0400

    kernel.spec v2.6.32-131.6.1.el6-042stab026.1-vz
---
 net/core/dev.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 91 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index eaa31c1..d29131d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5483,6 +5483,64 @@ out:
 EXPORT_SYMBOL(register_netdev);
 
 /*
+ * We do horrible things -- we left a netdevice
+ * in "leaked" state, which means we release as much
+ * resources as possible but the device will remain
+ * present in namespace because someone holds a reference.
+ *
+ * The idea is to be able to force stop VE.
+ */
+static void ve_netdev_leak(struct net_device *dev)
+{
+	struct napi_struct *p, *n;
+
+	dev->is_leaked = 1;
+	barrier();
+
+	/*
+	 * Make sure we're unable to tx/rx
+	 * network packets to outside.
+	 */
+	WARN_ON_ONCE(dev->flags & IFF_UP);
+	WARN_ON_ONCE(dev->qdisc != &noop_qdisc);
+
+	rtnl_lock();
+
+	/*
+	 * No address and napi after that.
+	 */
+	dev_addr_flush(dev);
+	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
+		netif_napi_del(p);
+
+	/*
+	 * No release_net() here since the device remains
+	 * present in the namespace.
+	 */
+
+	__rtnl_unlock();
+
+	put_beancounter(netdev_bc(dev)->exec_ub);
+	put_beancounter(netdev_bc(dev)->owner_ub);
+
+	netdev_bc(dev)->exec_ub		= get_beancounter(get_ub0());
+	netdev_bc(dev)->owner_ub	= get_beancounter(get_ub0());
+
+	/*
+	 * Since we've already screwed the device and releasing
+	 * it in a normal way is not possible anymore, we're
+	 * to be sure the device will remain here forever.
+	 */
+	dev_hold(dev);
+
+	synchronize_net();
+
+	pr_emerg("Device (%s:%d:%u:%p) marked as leaked\n",
+		 dev->name, atomic_read(&dev->refcnt) - 1,
+		 VEID(dev->owner_env), dev);
+}
+
+/*
  * netdev_wait_allrefs - wait until all references are gone.
  *
  * This is called when unregistering network devices.
@@ -5493,9 +5551,10 @@ EXPORT_SYMBOL(register_netdev);
  * We can get stuck here if buggy protocols don't correctly
  * call dev_put.
  */
-static void netdev_wait_allrefs(struct net_device *dev)
+static int netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
+	int i = 0;
 
 	rebroadcast_time = warning_time = jiffies;
 	while (atomic_read(&dev->refcnt) != 0) {
@@ -5525,12 +5584,27 @@ static void netdev_wait_allrefs(struct net_device *dev)
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "
-			       "waiting for %s to become free. Usage "
-			       "count = %d\n",
-			       dev->name, atomic_read(&dev->refcnt));
+			       "waiting for %s=%p to become free. Usage "
+			       "count = %d\n ve=%u",
+			       dev->name, dev, atomic_read(&dev->refcnt),
+			       VEID(get_exec_env()));
 			warning_time = jiffies;
 		}
+
+		/*
+		 * If device has lost the reference we might stuck
+		 * in this loop forever not having a chance the VE
+		 * to stop.
+		 */
+		if (++i > 200) { /* give 50 seconds to try */
+			if (!ve_is_super(dev->owner_env)) {
+				ve_netdev_leak(dev);
+				return -EBUSY;
+			}
+		}
 	}
+
+	return 0;
 }
 
 /* The sequence is:
@@ -5585,7 +5659,12 @@ void netdev_run_todo(void)
 
 		on_each_cpu(flush_backlog, dev, 1);
 
-		netdev_wait_allrefs(dev);
+		/*
+		 * Even if device get stuck here we are
+		 * to proceed the rest of the list.
+		 */
+		if (netdev_wait_allrefs(dev))
+			continue;
 
 		/* paranoia */
 		BUG_ON(atomic_read(&dev->refcnt));
@@ -5768,6 +5847,13 @@ void free_netdev(struct net_device *dev)
 {
 	struct napi_struct *p, *n;
 
+	if (dev->is_leaked) {
+		pr_emerg("%s: device %s=%p is leaked\n",
+			 __func__, dev->name, dev);
+		dump_stack();
+		return;
+	}
+
 	release_net(dev_net(dev));
 
 	kfree(dev->_tx);


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2014-06-11 12:10 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-07 11:58 [PATCH] ipt_CLUSTERIP: Add network device notifier Kirill Tkhai
2014-04-28 14:23 ` Pablo Neira Ayuso
2014-06-11 11:44   ` Kirill Tkhai
2014-06-11 11:49     ` Pablo Neira Ayuso
2014-06-11 11:55       ` Kirill Tkhai
2014-06-11 11:58         ` Pavel Emelyanov
2014-06-11 12:03           ` Kirill Tkhai
2014-06-11 12:10             ` Pavel Emelyanov
2014-06-11 12:00         ` Pablo Neira Ayuso
2014-06-11 12:10           ` Kirill Tkhai [this message]

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:eaa31c1 dfblob:d29131d )
 OR (
bs:"Re: [PATCH] ipt_CLUSTERIP: Add network device notifier" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1402488606.32126.72.camel@tkhai \
    --to=ktkhai@parallels.com \
    --cc=kaber@trash.net \
    --cc=kadlec@blackhole.kfki.hu \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=tkhai@yandex.ru \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).