Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>,
	John Fastabend <john.r.fastabend@intel.com>,
	Neil Horman <nhorman@tuxdriver.com>,
	davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Vlad Yasevich <vyasevic@redhat.com>
Subject: Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap
Date: Thu, 9 Jan 2014 09:17:22 +0200	[thread overview]
Message-ID: <20140109071721.GD19559@redhat.com> (raw)
In-Reply-To: <52CDA186.1080601@gmail.com>

On Wed, Jan 08, 2014 at 11:05:42AM -0800, John Fastabend wrote:
> [...]
> 
> >>>OK I think I'm finally putting all the pieces together thanks.
> >>>
> >>>Do you know why macvtap is setting dev->tx_queue_len by default? If you
> >>>zero this then the noqueue_qdisc is used and the q->enqueue check in
> >>>dev_queue_xmit will fail.
> >>
> >>It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6
> >>("macvtap: Limit packet queue length") to limit the length of socket
> >>receive queue of macvtap. But I'm not sure whether the qdisc is a
> >>byproduct of this commit, maybe we can switch to use another name
> >>instead of just reuse dev->tx_queue_length.
> >
> >You mean tx_queue_len really, right?
> >
> >Problem is tx_queue_len can be accessed using netlink sysfs or ioctl,
> >so if someone uses these to control or check the # of packets that
> >can be queued by device, this will break.
> >
> >How about adding ndo_set_tx_queue_len then?
> >
> >At some point we wanted to decouple queue length from tx_queue_length
> >for tun as well, so that would be benefitial there as well.
> 
> On the receive side we need to limit the receive queue and the
> dev->tx_queue_len value was used to do this.
> 
> However on the tx side when dev->tx_queue_len is set the default
> qdisc pfifo_fast or mq is used depending on if there is multiqueue
> or not. Unless the user specifies some numtxqueues when creating
> the macvtap device then it will be a single queue device and use
> the pfifo_fast qdisc.
> 
> This is the default behaviour users could zero txqueuelen when
> they create the device because it is a stacked device with
> NETIF_F_LLTX using the lower devices qdisc makes sense but this
> would appear (from code inspection) to break macvtap_forward()?
> 
>         if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
>                 goto drop;
> 
> I'm not sure any of this is a problem other than its a bit
> confusing to overload tx_queue_len for the rx_queue_len but the
> precedent has been there for sometime.

So how about ndo ops to access tx_queue_len then?
This way we can set tx_queue_len to 0 and use some
other field to store the rx_queue_len without users noticing.
Along the lines of the below (it's a partial patch just
to give you the idea):

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 5b7d0e1..e526b46 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -167,7 +167,10 @@ static int dev_ifsioc_locked(struct net *net, struct ifreq *ifr, unsigned int cm
 		return 0;
 
 	case SIOCGIFTXQLEN:
-		ifr->ifr_qlen = dev->tx_queue_len;
+		if (dev->netdev_ops->ndo_get_tx_queue_len)
+			ifr->ifr_qlen = dev->netdev_ops->ndo_get_tx_queue_len(dev);
+		else
+			ifr->ifr_qlen = dev->tx_queue_len;
 		return 0;
 
 	default:
@@ -296,7 +299,10 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
 	case SIOCSIFTXQLEN:
 		if (ifr->ifr_qlen < 0)
 			return -EINVAL;
-		dev->tx_queue_len = ifr->ifr_qlen;
+		if (dev->netdev_ops->ndo_set_tx_queue_len)
+			dev->netdev_ops->ndo_set_tx_queue_len(dev, ifr->ifr_qlen);
+		else
+			dev->tx_queue_len = ifr->ifr_qlen;
 		return 0;
 
 	case SIOCSIFNAME:
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d954b56..f2fd9d5 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -280,10 +280,31 @@ NETDEVICE_SHOW_RW(flags, fmt_hex);
 
 static int change_tx_queue_len(struct net_device *net, unsigned long new_len)
 {
-	net->tx_queue_len = new_len;
+	if (dev->netdev_ops->ndo_set_tx_queue_len)
+		dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len);
+	else
+		dev->tx_queue_len = new_len;
 	return 0;
 }
 
+static ssize_t format_tx_queue_len(const struct net_device *net, char *buf)
+{
+	unsigned long tx_queue_len;
+
+	if (dev->netdev_ops->ndo_get_tx_queue_len)
+		tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev);
+	else
+		tx_queue_len = dev->tx_queue_len;
+
+	return sprintf(buf, fmt_ulong, tx_queue_len);
+}
+
+static ssize_t tx_queue_len_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	return netdev_show(dev, attr, buf, format_tx_queue_len);
+}
+
 static ssize_t tx_queue_len_store(struct device *dev,
 				  struct device_attribute *attr,
 				  const char *buf, size_t len)
@@ -293,7 +314,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
 
 	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
 }
-NETDEVICE_SHOW_RW(tx_queue_len, fmt_ulong);
+DEVICE_ATTR_RW(tx_queue_len);
 
 static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2a0e21d..9276e17 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -876,6 +876,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	struct nlattr *attr, *af_spec;
 	struct rtnl_af_ops *af_ops;
 	struct net_device *upper_dev = netdev_master_upper_dev_get(dev);
+	unsigned long tx_queue_len;
 
 	ASSERT_RTNL();
 	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
@@ -890,8 +891,13 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	ifm->ifi_flags = dev_get_flags(dev);
 	ifm->ifi_change = change;
 
+	if (dev->netdev_ops->ndo_get_tx_queue_len)
+		tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev);
+	else
+		tx_queue_len = dev->tx_queue_len;
+
 	if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
-	    nla_put_u32(skb, IFLA_TXQLEN, dev->tx_queue_len) ||
+	    nla_put_u32(skb, IFLA_TXQLEN, tx_queue_len) ||
 	    nla_put_u8(skb, IFLA_OPERSTATE,
 		       netif_running(dev) ? dev->operstate : IF_OPER_DOWN) ||
 	    nla_put_u8(skb, IFLA_LINKMODE, dev->link_mode) ||
@@ -1453,8 +1459,14 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 		modified = 1;
 	}
 
-	if (tb[IFLA_TXQLEN])
-		dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+	if (tb[IFLA_TXQLEN]) {
+		u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]);
+
+		if (dev->netdev_ops->ndo_set_tx_queue_len)
+			dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len);
+		else
+			dev->tx_queue_len = new_len;
+	}
 
 	if (tb[IFLA_OPERSTATE])
 		set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
@@ -1692,8 +1704,14 @@ struct net_device *rtnl_create_link(struct net *net,
 	if (tb[IFLA_BROADCAST])
 		memcpy(dev->broadcast, nla_data(tb[IFLA_BROADCAST]),
 				nla_len(tb[IFLA_BROADCAST]));
-	if (tb[IFLA_TXQLEN])
-		dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+	if (tb[IFLA_TXQLEN]) {
+		u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]);
+
+		if (dev->netdev_ops->ndo_set_tx_queue_len)
+			dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len);
+		else
+			dev->tx_queue_len = new_len;
+	}
 	if (tb[IFLA_OPERSTATE])
 		set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
 	if (tb[IFLA_LINKMODE])

Anyone objects?

> It is a change in behaviour
> though in net-next. Previously dev_queue_xmit() was not used so
> the qdisc layer was never hit on the macvtap device. Now with
> dev_queue_xmit() if the user attaches multiple macvlan queues to a
> single qdisc queue this should still work but wont be optimal. Ideally
> the user should create as many qdisc queues at creation time via
> numtxqueues as macvtap queues will be used during runtime so that there
> is a 1:1 mapping or use some heuristic to avoid cases where there
> is a many to 1 mapping.
> 
> From my perspective it would be OK to push this configuration/policy
> to the management layer. After all it is the entity that knows how
> to distribute system resources amongst the objects running over the
> macvtap devices. The relevance for me is I defaulted the l2 offloaded
> macvlans to single queue devices and wanted to note in net-next this
> is the same policy used in the non-offloaded case.
> 
> Bit long-winded there.
> 
> Thanks,
> John

I think it's a real problem you are pointing out - a performance
regression for multiqueue devices.
If we really think no qdisc is typically the right thing to do, 
we should just make it the default I think, but I agree
just making tx_queue_len 0 doesn't work without other changes.

What do others think about extra ndo ops so devices can decouple
the internal tx_queue_len from the userspace-visible value?

next prev parent reply	other threads:[~2014-01-09  7:17 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-06  3:21 [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap Jason Wang
2014-01-06  3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang
2014-01-06 12:04   ` Jeff Kirsher
2014-01-06 12:42   ` Neil Horman
2014-01-06 15:06     ` John Fastabend
2014-01-06 15:29       ` Neil Horman
2014-01-07  3:42     ` Jason Wang
2014-01-07 13:17       ` Neil Horman
2014-01-08  3:21         ` Jason Wang
2014-01-08 14:40           ` Neil Horman
2014-01-09  8:28             ` Jason Wang
2014-01-09 11:53               ` Neil Horman
2014-01-07  8:22   ` John Fastabend
2014-01-07  8:37     ` John Fastabend
2014-01-06  7:35 ` [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap John Fastabend
2014-01-06  7:54   ` Jason Wang
2014-01-06 12:26     ` Neil Horman
2014-01-07  3:10       ` Jason Wang
2014-01-07  5:15         ` John Fastabend
2014-01-07  6:22           ` Jason Wang
2014-01-07  7:26             ` John Fastabend
2014-01-07  9:00               ` Jason Wang
2014-01-08 12:55                 ` Michael S. Tsirkin
2014-01-08 19:05                   ` John Fastabend
2014-01-09  7:17                     ` Michael S. Tsirkin [this message]
2014-01-09  8:55                       ` Jason Wang
2014-01-09 21:39                         ` Stephen Hemminger
2014-01-09 22:03                           ` Michael S. Tsirkin
2014-01-09 22:20                             ` Stephen Hemminger
2014-01-10  7:06                           ` Jason Wang
2014-01-10 16:40                             ` Vlad Yasevich
2014-01-07  5:16         ` John Fastabend
2014-01-06 20:47 ` David Miller
2014-01-07  3:17   ` Jason Wang
2014-01-07  5:57     ` David Miller

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:5b7d0e1 dfblob:e526b46 dfblob:d954b56 dfblob:f2fd9d5
dfblob:2a0e21d dfblob:9276e17 )
 OR (
bs:"Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140109071721.GD19559@redhat.com \
    --to=mst@redhat.com \
    --cc=davem@davemloft.net \
    --cc=jasowang@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=john.r.fastabend@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=vyasevic@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).