From: "Michael S. Tsirkin" <mst@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>,
John Fastabend <john.r.fastabend@intel.com>,
Neil Horman <nhorman@tuxdriver.com>,
davem@davemloft.net, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, Vlad Yasevich <vyasevic@redhat.com>
Subject: Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap
Date: Thu, 9 Jan 2014 09:17:22 +0200 [thread overview]
Message-ID: <20140109071721.GD19559@redhat.com> (raw)
In-Reply-To: <52CDA186.1080601@gmail.com>
On Wed, Jan 08, 2014 at 11:05:42AM -0800, John Fastabend wrote:
> [...]
>
> >>>OK I think I'm finally putting all the pieces together thanks.
> >>>
> >>>Do you know why macvtap is setting dev->tx_queue_len by default? If you
> >>>zero this then the noqueue_qdisc is used and the q->enqueue check in
> >>>dev_queue_xmit will fail.
> >>
> >>It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6
> >>("macvtap: Limit packet queue length") to limit the length of socket
> >>receive queue of macvtap. But I'm not sure whether the qdisc is a
> >>byproduct of this commit, maybe we can switch to use another name
> >>instead of just reuse dev->tx_queue_length.
> >
> >You mean tx_queue_len really, right?
> >
> >Problem is tx_queue_len can be accessed using netlink sysfs or ioctl,
> >so if someone uses these to control or check the # of packets that
> >can be queued by device, this will break.
> >
> >How about adding ndo_set_tx_queue_len then?
> >
> >At some point we wanted to decouple queue length from tx_queue_length
> >for tun as well, so that would be benefitial there as well.
>
> On the receive side we need to limit the receive queue and the
> dev->tx_queue_len value was used to do this.
>
> However on the tx side when dev->tx_queue_len is set the default
> qdisc pfifo_fast or mq is used depending on if there is multiqueue
> or not. Unless the user specifies some numtxqueues when creating
> the macvtap device then it will be a single queue device and use
> the pfifo_fast qdisc.
>
> This is the default behaviour users could zero txqueuelen when
> they create the device because it is a stacked device with
> NETIF_F_LLTX using the lower devices qdisc makes sense but this
> would appear (from code inspection) to break macvtap_forward()?
>
> if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
> goto drop;
>
> I'm not sure any of this is a problem other than its a bit
> confusing to overload tx_queue_len for the rx_queue_len but the
> precedent has been there for sometime.
So how about ndo ops to access tx_queue_len then?
This way we can set tx_queue_len to 0 and use some
other field to store the rx_queue_len without users noticing.
Along the lines of the below (it's a partial patch just
to give you the idea):
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 5b7d0e1..e526b46 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -167,7 +167,10 @@ static int dev_ifsioc_locked(struct net *net, struct ifreq *ifr, unsigned int cm
return 0;
case SIOCGIFTXQLEN:
- ifr->ifr_qlen = dev->tx_queue_len;
+ if (dev->netdev_ops->ndo_get_tx_queue_len)
+ ifr->ifr_qlen = dev->netdev_ops->ndo_get_tx_queue_len(dev);
+ else
+ ifr->ifr_qlen = dev->tx_queue_len;
return 0;
default:
@@ -296,7 +299,10 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
case SIOCSIFTXQLEN:
if (ifr->ifr_qlen < 0)
return -EINVAL;
- dev->tx_queue_len = ifr->ifr_qlen;
+ if (dev->netdev_ops->ndo_set_tx_queue_len)
+ dev->netdev_ops->ndo_set_tx_queue_len(dev, ifr->ifr_qlen);
+ else
+ dev->tx_queue_len = ifr->ifr_qlen;
return 0;
case SIOCSIFNAME:
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d954b56..f2fd9d5 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -280,10 +280,31 @@ NETDEVICE_SHOW_RW(flags, fmt_hex);
static int change_tx_queue_len(struct net_device *net, unsigned long new_len)
{
- net->tx_queue_len = new_len;
+ if (dev->netdev_ops->ndo_set_tx_queue_len)
+ dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len);
+ else
+ dev->tx_queue_len = new_len;
return 0;
}
+static ssize_t format_tx_queue_len(const struct net_device *net, char *buf)
+{
+ unsigned long tx_queue_len;
+
+ if (dev->netdev_ops->ndo_get_tx_queue_len)
+ tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev);
+ else
+ tx_queue_len = dev->tx_queue_len;
+
+ return sprintf(buf, fmt_ulong, tx_queue_len);
+}
+
+static ssize_t tx_queue_len_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return netdev_show(dev, attr, buf, format_tx_queue_len);
+}
+
static ssize_t tx_queue_len_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
@@ -293,7 +314,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
return netdev_store(dev, attr, buf, len, change_tx_queue_len);
}
-NETDEVICE_SHOW_RW(tx_queue_len, fmt_ulong);
+DEVICE_ATTR_RW(tx_queue_len);
static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t len)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2a0e21d..9276e17 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -876,6 +876,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
struct nlattr *attr, *af_spec;
struct rtnl_af_ops *af_ops;
struct net_device *upper_dev = netdev_master_upper_dev_get(dev);
+ unsigned long tx_queue_len;
ASSERT_RTNL();
nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
@@ -890,8 +891,13 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
ifm->ifi_flags = dev_get_flags(dev);
ifm->ifi_change = change;
+ if (dev->netdev_ops->ndo_get_tx_queue_len)
+ tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev);
+ else
+ tx_queue_len = dev->tx_queue_len;
+
if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
- nla_put_u32(skb, IFLA_TXQLEN, dev->tx_queue_len) ||
+ nla_put_u32(skb, IFLA_TXQLEN, tx_queue_len) ||
nla_put_u8(skb, IFLA_OPERSTATE,
netif_running(dev) ? dev->operstate : IF_OPER_DOWN) ||
nla_put_u8(skb, IFLA_LINKMODE, dev->link_mode) ||
@@ -1453,8 +1459,14 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
modified = 1;
}
- if (tb[IFLA_TXQLEN])
- dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+ if (tb[IFLA_TXQLEN]) {
+ u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]);
+
+ if (dev->netdev_ops->ndo_set_tx_queue_len)
+ dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len);
+ else
+ dev->tx_queue_len = new_len;
+ }
if (tb[IFLA_OPERSTATE])
set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
@@ -1692,8 +1704,14 @@ struct net_device *rtnl_create_link(struct net *net,
if (tb[IFLA_BROADCAST])
memcpy(dev->broadcast, nla_data(tb[IFLA_BROADCAST]),
nla_len(tb[IFLA_BROADCAST]));
- if (tb[IFLA_TXQLEN])
- dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+ if (tb[IFLA_TXQLEN]) {
+ u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]);
+
+ if (dev->netdev_ops->ndo_set_tx_queue_len)
+ dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len);
+ else
+ dev->tx_queue_len = new_len;
+ }
if (tb[IFLA_OPERSTATE])
set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
if (tb[IFLA_LINKMODE])
Anyone objects?
> It is a change in behaviour
> though in net-next. Previously dev_queue_xmit() was not used so
> the qdisc layer was never hit on the macvtap device. Now with
> dev_queue_xmit() if the user attaches multiple macvlan queues to a
> single qdisc queue this should still work but wont be optimal. Ideally
> the user should create as many qdisc queues at creation time via
> numtxqueues as macvtap queues will be used during runtime so that there
> is a 1:1 mapping or use some heuristic to avoid cases where there
> is a many to 1 mapping.
>
> From my perspective it would be OK to push this configuration/policy
> to the management layer. After all it is the entity that knows how
> to distribute system resources amongst the objects running over the
> macvtap devices. The relevance for me is I defaulted the l2 offloaded
> macvlans to single queue devices and wanted to note in net-next this
> is the same policy used in the non-offloaded case.
>
> Bit long-winded there.
>
> Thanks,
> John
I think it's a real problem you are pointing out - a performance
regression for multiqueue devices.
If we really think no qdisc is typically the right thing to do,
we should just make it the default I think, but I agree
just making tx_queue_len 0 doesn't work without other changes.
What do others think about extra ndo ops so devices can decouple
the internal tx_queue_len from the userspace-visible value?
next prev parent reply other threads:[~2014-01-09 7:17 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-06 3:21 [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap Jason Wang
2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang
2014-01-06 3:21 ` Jason Wang
2014-01-06 12:04 ` [E1000-devel] " Jeff Kirsher
2014-01-06 12:04 ` Jeff Kirsher
2014-01-06 12:42 ` Neil Horman
2014-01-06 15:06 ` John Fastabend
2014-01-06 15:06 ` John Fastabend
2014-01-06 15:29 ` Neil Horman
2014-01-06 15:29 ` Neil Horman
2014-01-07 3:42 ` Jason Wang
2014-01-07 3:42 ` Jason Wang
2014-01-07 13:17 ` Neil Horman
2014-01-08 3:21 ` Jason Wang
2014-01-08 3:21 ` Jason Wang
2014-01-08 14:40 ` Neil Horman
2014-01-09 8:28 ` Jason Wang
2014-01-09 8:28 ` Jason Wang
2014-01-09 11:53 ` Neil Horman
2014-01-09 11:53 ` Neil Horman
2014-01-07 8:22 ` John Fastabend
2014-01-07 8:37 ` John Fastabend
2014-01-06 7:35 ` [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap John Fastabend
2014-01-06 7:54 ` Jason Wang
2014-01-06 12:26 ` Neil Horman
2014-01-07 3:10 ` Jason Wang
2014-01-07 5:15 ` John Fastabend
2014-01-07 6:22 ` Jason Wang
2014-01-07 7:26 ` John Fastabend
2014-01-07 9:00 ` Jason Wang
2014-01-08 12:55 ` Michael S. Tsirkin
2014-01-08 19:05 ` John Fastabend
2014-01-09 7:17 ` Michael S. Tsirkin [this message]
2014-01-09 8:55 ` Jason Wang
2014-01-09 21:39 ` Stephen Hemminger
2014-01-09 22:03 ` Michael S. Tsirkin
2014-01-09 22:20 ` Stephen Hemminger
2014-01-10 7:06 ` Jason Wang
2014-01-10 16:40 ` Vlad Yasevich
2014-01-07 5:16 ` John Fastabend
2014-01-06 20:47 ` David Miller
2014-01-07 3:17 ` Jason Wang
2014-01-07 5:57 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140109071721.GD19559@redhat.com \
--to=mst@redhat.com \
--cc=davem@davemloft.net \
--cc=jasowang@redhat.com \
--cc=john.fastabend@gmail.com \
--cc=john.r.fastabend@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=nhorman@tuxdriver.com \
--cc=vyasevic@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.