* [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap @ 2014-01-06 3:21 Jason Wang 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Jason Wang @ 2014-01-06 3:21 UTC (permalink / raw) To: davem, netdev, linux-kernel; +Cc: mst, Jason Wang, John Fastabend, Neil Horman L2 fowarding offload will bypass the rx handler of real device. This will make the packet could not be forwarded to macvtap device. Another problem is the dev_hard_start_xmit() called for macvtap does not have any synchronization. Fix this by forbidding L2 forwarding for macvtap. Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/net/macvlan.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 60406b0..5360f73 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -338,6 +338,8 @@ static const struct header_ops macvlan_hard_header_ops = { .cache_update = eth_header_cache_update, }; +static struct rtnl_link_ops macvlan_link_ops; + static int macvlan_open(struct net_device *dev) { struct macvlan_dev *vlan = netdev_priv(dev); @@ -353,7 +355,8 @@ static int macvlan_open(struct net_device *dev) goto hash_add; } - if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD) { + if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD && + dev->rtnl_link_ops == &macvlan_link_ops) { vlan->fwd_priv = lowerdev->netdev_ops->ndo_dfwd_add_station(lowerdev, dev); -- 1.7.1 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 3:21 [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap Jason Wang @ 2014-01-06 3:21 ` Jason Wang 2014-01-06 12:04 ` Jeff Kirsher ` (2 more replies) 2014-01-06 7:35 ` [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap John Fastabend 2014-01-06 20:47 ` David Miller 2 siblings, 3 replies; 35+ messages in thread From: Jason Wang @ 2014-01-06 3:21 UTC (permalink / raw) To: davem, netdev, linux-kernel Cc: John Fastabend, e1000-devel, Jason Wang, Neil Horman, mst Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The will cause several issues: - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock contention. - dev_hard_start_xmit() was called with NULL txq which bypasses the net device watchdog - dev_hard_start_xmit() does not check txq everywhere which will lead a crash when tso is disabled for lower device. Fix this by explicitly introducing a select queue method just for l2 forwarding offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the queue selecting and transmitting for l2 forwarding. With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need to check txq against NULL in dev_hard_start_xmit(). In the future, it was also required for macvtap l2 forwarding support since it provides a necessary synchronization method. Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: e1000-devel@lists.sourceforge.net Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 15 +++++++++---- drivers/net/macvlan.c | 3 +- include/linux/netdevice.h | 11 +++++++++ net/core/dev.c | 28 ++++++++++++++++++++++++- 4 files changed, 49 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index cc06854..ee71cf7 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -7629,16 +7629,20 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv) kfree(fwd_adapter); } +static u16 ixgbe_fwd_select_queue(struct net_device *dev, struct sk_buff *skb, + void *priv) +{ + struct ixgbe_fwd_adapter *fwd_adapter = priv; + return skb->queue_mapping + fwd_adapter->tx_base_queue; +} + static netdev_tx_t ixgbe_fwd_xmit(struct sk_buff *skb, struct net_device *dev, void *priv) { struct ixgbe_fwd_adapter *fwd_adapter = priv; - unsigned int queue; - struct ixgbe_ring *tx_ring; - - queue = skb->queue_mapping + fwd_adapter->tx_base_queue; - tx_ring = fwd_adapter->real_adapter->tx_ring[queue]; + struct ixgbe_ring *tx_ring = + fwd_adapter->real_adapter->tx_ring[skb->queue_mapping]; return __ixgbe_xmit_frame(skb, dev, tx_ring); } @@ -7689,6 +7693,7 @@ static const struct net_device_ops ixgbe_netdev_ops = { .ndo_bridge_getlink = ixgbe_ndo_bridge_getlink, .ndo_dfwd_add_station = ixgbe_fwd_add, .ndo_dfwd_del_station = ixgbe_fwd_del, + .ndo_dfwd_select_queue = ixgbe_fwd_select_queue, .ndo_dfwd_start_xmit = ixgbe_fwd_xmit, }; diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 5360f73..2cbbce3 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -299,7 +299,7 @@ netdev_tx_t macvlan_start_xmit(struct sk_buff *skb, if (vlan->fwd_priv) { skb->dev = vlan->lowerdev; - ret = dev_hard_start_xmit(skb, skb->dev, NULL, vlan->fwd_priv); + ret = dfwd_direct_xmit(skb, skb->dev, vlan->fwd_priv); } else { ret = macvlan_queue_xmit(skb, dev); } @@ -366,7 +366,6 @@ static int macvlan_open(struct net_device *dev) if (IS_ERR_OR_NULL(vlan->fwd_priv)) { vlan->fwd_priv = NULL; } else { - dev->features &= ~NETIF_F_LLTX; return 0; } } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d9a550b..dbfd476 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -975,6 +975,11 @@ struct netdev_phys_port_id { * by 'ndo_dfwd_add_station'. 'pdev' is the net device backing * the station and priv is the structure returned by the add * operation. + * u16 (*ndo_dfwd_select_queue)(struct net_device *dev, + * struct sk_buff *skb, + * void *priv); + * Called to decide which queue to xmit over the accelerated station when + * device supports multiple transmit queues. * netdev_tx_t (*ndo_dfwd_start_xmit)(struct sk_buff *skb, * struct net_device *dev, * void *priv); @@ -1123,6 +1128,10 @@ struct net_device_ops { void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv); + u16 (*ndo_dfwd_select_queue)(struct net_device *dev, + struct sk_buff *skb, + void *priv); + netdev_tx_t (*ndo_dfwd_start_xmit) (struct sk_buff *skb, struct net_device *dev, void *priv); @@ -2416,6 +2425,8 @@ int dev_set_mac_address(struct net_device *, struct sockaddr *); int dev_change_carrier(struct net_device *, bool new_carrier); int dev_get_phys_port_id(struct net_device *dev, struct netdev_phys_port_id *ppid); +int dfwd_direct_xmit(struct sk_buff *skb, struct net_device *dev, + void *accel_priv); int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, void *accel_priv); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index 4fc1722..bc2b03f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2538,6 +2538,32 @@ static inline int skb_needs_linearize(struct sk_buff *skb, !(features & NETIF_F_SG))); } +int dfwd_direct_xmit(struct sk_buff *skb, struct net_device *dev, + void *accel_priv) +{ + struct netdev_queue *txq; + int ret = NETDEV_TX_BUSY; + int index; + + BUG_ON(!dev->netdev_ops->ndo_dfwd_select_queue); + index = dev->netdev_ops->ndo_dfwd_select_queue(dev, skb, + accel_priv); + + local_bh_disable(); + + skb_set_queue_mapping(skb, index); + txq = netdev_get_tx_queue(dev, index); + + HARD_TX_LOCK(dev, txq, smp_processor_id()); + if (!netif_xmit_frozen_or_stopped(txq)) + ret = dev_hard_start_xmit(skb, dev, txq, accel_priv); + HARD_TX_UNLOCK(dev, txq); + + local_bh_enable(); + return ret; +} +EXPORT_SYMBOL_GPL(dfwd_direct_xmit); + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, void *accel_priv) { @@ -2611,7 +2637,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, rc = ops->ndo_start_xmit(skb, dev); trace_net_dev_xmit(skb, rc, dev, skb_len); - if (rc == NETDEV_TX_OK && txq) + if (rc == NETDEV_TX_OK) txq_trans_update(txq); return rc; } -- 1.7.1 ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang @ 2014-01-06 12:04 ` Jeff Kirsher 2014-01-06 12:42 ` Neil Horman 2014-01-07 8:22 ` John Fastabend 2 siblings, 0 replies; 35+ messages in thread From: Jeff Kirsher @ 2014-01-06 12:04 UTC (permalink / raw) To: Jason Wang, aaron.f.brown Cc: Neil Horman, mst, e1000-devel, netdev, linux-kernel, John Fastabend, davem [-- Attachment #1.1: Type: text/plain, Size: 1634 bytes --] On Mon, 2014-01-06 at 11:21 +0800, Jason Wang wrote: > Currently, the tx queue were selected implicitly in > ndo_dfwd_start_xmit(). The > will cause several issues: > > - NETIF_F_LLTX was forced for macvlan device in this case which lead > extra lock > contention. > - dev_hard_start_xmit() was called with NULL txq which bypasses the > net device > watchdog > - dev_hard_start_xmit() does not check txq everywhere which will lead > a crash > when tso is disabled for lower device. > > Fix this by explicitly introducing a select queue method just for l2 > forwarding > offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to > do the > queue selecting and transmitting for l2 forwarding. > > With this fixes, NETIF_F_LLTX could be preserved for macvlan and > there's no need > to check txq against NULL in dev_hard_start_xmit(). > > In the future, it was also required for macvtap l2 forwarding support > since it > provides a necessary synchronization method. > > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Neil Horman <nhorman@tuxdriver.com> > Cc: e1000-devel@lists.sourceforge.net > Signed-off-by: Jason Wang <jasowang@redhat.com> > --- > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 15 +++++++++---- > drivers/net/macvlan.c | 3 +- > include/linux/netdevice.h | 11 +++++++++ > net/core/dev.c | 28 > ++++++++++++++++++++++++- > 4 files changed, 49 insertions(+), 8 deletions(-) Thanks Jason, I have added this to my queue since it has changes against ixgbe. [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 455 bytes --] ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk [-- Attachment #3: Type: text/plain, Size: 257 bytes --] _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang 2014-01-06 12:04 ` Jeff Kirsher @ 2014-01-06 12:42 ` Neil Horman 2014-01-06 15:06 ` John Fastabend 2014-01-07 3:42 ` Jason Wang 2014-01-07 8:22 ` John Fastabend 2 siblings, 2 replies; 35+ messages in thread From: Neil Horman @ 2014-01-06 12:42 UTC (permalink / raw) To: Jason Wang; +Cc: davem, netdev, linux-kernel, mst, John Fastabend, e1000-devel On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: > Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The > will cause several issues: > > - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock > contention. > - dev_hard_start_xmit() was called with NULL txq which bypasses the net device > watchdog > - dev_hard_start_xmit() does not check txq everywhere which will lead a crash > when tso is disabled for lower device. > > Fix this by explicitly introducing a select queue method just for l2 forwarding > offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the > queue selecting and transmitting for l2 forwarding. > > With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need > to check txq against NULL in dev_hard_start_xmit(). > > In the future, it was also required for macvtap l2 forwarding support since it > provides a necessary synchronization method. > > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Neil Horman <nhorman@tuxdriver.com> > Cc: e1000-devel@lists.sourceforge.net > Signed-off-by: Jason Wang <jasowang@redhat.com> Instead of creating another operation here to do special queue selection, why not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? ndo_dfwd_start_xmit already knows which queue set to pick from (since their reserved for the device doing the transmitting). It seems more clear to me than creating a new netdevice operation. As for the crash issue, I'm not sure what you mean. Where in dev_hard_start_xmit would we need to check txq that we're not currently, and what crash results? Also, can you elaborate on what you mean by additional lock contention? What contention do you see that goes above and beyond the normal locking required by txq access? I suppose its extra locking above and beyond in the macvtap case, where you would otherwise never hit hardware, but that not the only use case, and I think the solution there is likely to add some code in the macvlan feature set handler so that NETIF_F_LLTX is cleared if you disable the hardware forwarding acceleration via ethtool. Regards Neil > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 12:42 ` Neil Horman @ 2014-01-06 15:06 ` John Fastabend 2014-01-06 15:29 ` Neil Horman 2014-01-07 3:42 ` Jason Wang 1 sibling, 1 reply; 35+ messages in thread From: John Fastabend @ 2014-01-06 15:06 UTC (permalink / raw) To: Neil Horman Cc: mst, e1000-devel, netdev, Jason Wang, linux-kernel, John Fastabend, davem On 01/06/2014 04:42 AM, Neil Horman wrote: > On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: >> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The >> will cause several issues: >> >> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock >> contention. >> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device >> watchdog >> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash >> when tso is disabled for lower device. >> >> Fix this by explicitly introducing a select queue method just for l2 forwarding >> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the >> queue selecting and transmitting for l2 forwarding. >> >> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need >> to check txq against NULL in dev_hard_start_xmit(). >> >> In the future, it was also required for macvtap l2 forwarding support since it >> provides a necessary synchronization method. >> >> Cc: John Fastabend <john.r.fastabend@intel.com> >> Cc: Neil Horman <nhorman@tuxdriver.com> >> Cc: e1000-devel@lists.sourceforge.net >> Signed-off-by: Jason Wang <jasowang@redhat.com> > > Instead of creating another operation here to do special queue selection, why > not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument > list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? > ndo_dfwd_start_xmit already knows which queue set to pick from (since their > reserved for the device doing the transmitting). It seems more clear to me than > creating a new netdevice operation. > > As for the crash issue, I'm not sure what you mean. Where in > dev_hard_start_xmit would we need to check txq that we're not currently, and > what crash results? > > Also, can you elaborate on what you mean by additional lock contention? What > contention do you see that goes above and beyond the normal locking required by > txq access? I suppose its extra locking above and beyond in the macvtap case, > where you would otherwise never hit hardware, but that not the only use case, > and I think the solution there is likely to add some code in the macvlan feature > set handler so that NETIF_F_LLTX is cleared if you disable the hardware > forwarding acceleration via ethtool. > NETIF_F_LLTX is cleared in macvlan_open() which should be used in the macvtap case. if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD) { vlan->fwd_priv = lowerdev->netdev_ops->ndo_dfwd_add_station(lowerdev, dev); /* If we get a NULL pointer back, or if we get an error * then we should just fall through to the non accelerated path */ if (IS_ERR_OR_NULL(vlan->fwd_priv)) { vlan->fwd_priv = NULL; } else { dev->features &= ~NETIF_F_LLTX; return 0; } } Thanks, John -- John Fastabend Intel Corporation ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 15:06 ` John Fastabend @ 2014-01-06 15:29 ` Neil Horman 0 siblings, 0 replies; 35+ messages in thread From: Neil Horman @ 2014-01-06 15:29 UTC (permalink / raw) To: John Fastabend Cc: mst, e1000-devel, netdev, Jason Wang, linux-kernel, John Fastabend, davem On Mon, Jan 06, 2014 at 07:06:25AM -0800, John Fastabend wrote: > On 01/06/2014 04:42 AM, Neil Horman wrote: > >On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: > >>Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The > >>will cause several issues: > >> > >>- NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock > >> contention. > >>- dev_hard_start_xmit() was called with NULL txq which bypasses the net device > >> watchdog > >>- dev_hard_start_xmit() does not check txq everywhere which will lead a crash > >> when tso is disabled for lower device. > >> > >>Fix this by explicitly introducing a select queue method just for l2 forwarding > >>offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the > >>queue selecting and transmitting for l2 forwarding. > >> > >>With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need > >>to check txq against NULL in dev_hard_start_xmit(). > >> > >>In the future, it was also required for macvtap l2 forwarding support since it > >>provides a necessary synchronization method. > >> > >>Cc: John Fastabend <john.r.fastabend@intel.com> > >>Cc: Neil Horman <nhorman@tuxdriver.com> > >>Cc: e1000-devel@lists.sourceforge.net > >>Signed-off-by: Jason Wang <jasowang@redhat.com> > > > >Instead of creating another operation here to do special queue selection, why > >not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument > >list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? > >ndo_dfwd_start_xmit already knows which queue set to pick from (since their > >reserved for the device doing the transmitting). It seems more clear to me than > >creating a new netdevice operation. > > > >As for the crash issue, I'm not sure what you mean. Where in > >dev_hard_start_xmit would we need to check txq that we're not currently, and > >what crash results? > > > >Also, can you elaborate on what you mean by additional lock contention? What > >contention do you see that goes above and beyond the normal locking required by > >txq access? I suppose its extra locking above and beyond in the macvtap case, > >where you would otherwise never hit hardware, but that not the only use case, > >and I think the solution there is likely to add some code in the macvlan feature > >set handler so that NETIF_F_LLTX is cleared if you disable the hardware > >forwarding acceleration via ethtool. > > > > NETIF_F_LLTX is cleared in macvlan_open() which should be used in the > macvtap case. > Thats right, since accelerated hardware tx queue doesn't participate in the network stack queue locking, the upper device needs to do it. Thanks! Neil > if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD) { > vlan->fwd_priv = > > lowerdev->netdev_ops->ndo_dfwd_add_station(lowerdev, dev); > > /* If we get a NULL pointer back, or if we get an error > * then we should just fall through to the non > accelerated path > */ > if (IS_ERR_OR_NULL(vlan->fwd_priv)) { > vlan->fwd_priv = NULL; > } else { > dev->features &= ~NETIF_F_LLTX; > return 0; > } > } > > Thanks, > John > > > -- > John Fastabend Intel Corporation > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 12:42 ` Neil Horman 2014-01-06 15:06 ` John Fastabend @ 2014-01-07 3:42 ` Jason Wang 2014-01-07 13:17 ` Neil Horman 1 sibling, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-07 3:42 UTC (permalink / raw) To: Neil Horman; +Cc: mst, e1000-devel, netdev, linux-kernel, John Fastabend, davem On 01/06/2014 08:42 PM, Neil Horman wrote: > On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: >> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The >> will cause several issues: >> >> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock >> contention. >> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device >> watchdog >> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash >> when tso is disabled for lower device. >> >> Fix this by explicitly introducing a select queue method just for l2 forwarding >> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the >> queue selecting and transmitting for l2 forwarding. >> >> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need >> to check txq against NULL in dev_hard_start_xmit(). >> >> In the future, it was also required for macvtap l2 forwarding support since it >> provides a necessary synchronization method. >> >> Cc: John Fastabend <john.r.fastabend@intel.com> >> Cc: Neil Horman <nhorman@tuxdriver.com> >> Cc: e1000-devel@lists.sourceforge.net >> Signed-off-by: Jason Wang <jasowang@redhat.com> > Instead of creating another operation here to do special queue selection, why > not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument > list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? > ndo_dfwd_start_xmit already knows which queue set to pick from (since their > reserved for the device doing the transmitting). It seems more clear to me than > creating a new netdevice operation. See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless tx path"). The point is keep the tx path lockless to be efficient and simplicity for management. And macvtap multiqueue was also implemented with this assumption. The real contention should be done in the txq of lower device instead of macvlan itself. This is also needed for multiqueue macvtap. > > As for the crash issue, I'm not sure what you mean. Where in > dev_hard_start_xmit would we need to check txq that we're not currently, and > what crash results? Well, see current dev_hard_start_xmit(), if lower device does not support tso or tso is disabled, in gso path: gso: ... txq_trans_update(txq); if (unlikely(netif_xmit_stopped(txq) && skb->next)) There's an obvious NULL pointer dereference. > > Also, can you elaborate on what you mean by additional lock contention? If the lower device has NETIF_F_LLTX, then both macvlan txq lock and the lock of device itself must be held before doing transmission. In the case, the macvlan txq lock contention is obvious unnecessary. > What > contention do you see that goes above and beyond the normal locking required by > txq access? As I said above, the point is keeping the lockess tx path and make the contention of txq for real device instead of macvlan itself. > I suppose its extra locking above and beyond in the macvtap case, > where you would otherwise never hit hardware, but that not the only use case, > and I think the solution there is likely to add some code in the macvlan feature > set handler so that NETIF_F_LLTX is cleared if you disable the hardware > forwarding acceleration via ethtool. > > Regards > Neil > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-07 3:42 ` Jason Wang @ 2014-01-07 13:17 ` Neil Horman 2014-01-08 3:21 ` Jason Wang 0 siblings, 1 reply; 35+ messages in thread From: Neil Horman @ 2014-01-07 13:17 UTC (permalink / raw) To: Jason Wang; +Cc: davem, netdev, linux-kernel, mst, John Fastabend, e1000-devel On Tue, Jan 07, 2014 at 11:42:24AM +0800, Jason Wang wrote: > On 01/06/2014 08:42 PM, Neil Horman wrote: > > On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: > >> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The > >> will cause several issues: > >> > >> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock > >> contention. > >> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device > >> watchdog > >> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash > >> when tso is disabled for lower device. > >> > >> Fix this by explicitly introducing a select queue method just for l2 forwarding > >> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the > >> queue selecting and transmitting for l2 forwarding. > >> > >> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need > >> to check txq against NULL in dev_hard_start_xmit(). > >> > >> In the future, it was also required for macvtap l2 forwarding support since it > >> provides a necessary synchronization method. > >> > >> Cc: John Fastabend <john.r.fastabend@intel.com> > >> Cc: Neil Horman <nhorman@tuxdriver.com> > >> Cc: e1000-devel@lists.sourceforge.net > >> Signed-off-by: Jason Wang <jasowang@redhat.com> > > Instead of creating another operation here to do special queue selection, why > > not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument > > list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? > > ndo_dfwd_start_xmit already knows which queue set to pick from (since their > > reserved for the device doing the transmitting). It seems more clear to me than > > creating a new netdevice operation. > > See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless > tx path"). The point is keep the tx path lockless to be efficient and > simplicity for management. And macvtap multiqueue was also implemented > with this assumption. The real contention should be done in the txq of > lower device instead of macvlan itself. This is also needed for > multiqueue macvtap. > > Ok, I see how you're preserving LLTX here, and thats great, but it doesn't really buy us anything that I can see. If a macvlan is using hardware acceleration, it needs to arbitrate access to that hardware. Weather thats done by locking the lowerdev's tx queue lock or by enforcing locking on the macvlan itself is equivalent. The decision to use dfwd hardware acceleration is made on open, so its not like theres any traffic that can avoid the lock, as it all goes through the hardware. All I see that this has bought us is an extra net_device method (which isn't a big deal, but not necessecary as I see it). It seems like the right solution here is to use ethtool to disable the dfwd acceleration feature on the hardware if you don't want to incur the additional locking. > > As for the crash issue, I'm not sure what you mean. Where in > > dev_hard_start_xmit would we need to check txq that we're not currently, and > > what crash results? > > Well, see current dev_hard_start_xmit(), if lower device does not > support tso or tso is disabled, in gso path: > > gso: > ... > txq_trans_update(txq); > if (unlikely(netif_xmit_stopped(txq) && skb->next)) > > There's an obvious NULL pointer dereference. Yeah, I saw that after I wrote my note, sorry about that. However, it doesn't change what I said above. i don't think theres a need to add an additional select_queue method here, just add a pointer to a pointer arg to the dfwd xmit routine to fill in the queue that was selected on transmit. That should resolve all the potential NULL pointers without the need to ad additional methods to net_device_ops. > > > > Also, can you elaborate on what you mean by additional lock contention? > > If the lower device has NETIF_F_LLTX, then both macvlan txq lock and the > lock of device itself must be held before doing transmission. In the > case, the macvlan txq lock contention is obvious unnecessary. Not in its current implementation. The lowerdevice's NETIF_F_LLTX are ignored during transmit because we're using a privately allocated set of queues assigned to the transmitting macvlan (i.e. there is not contention on the lowerdevice for any single given macvlan). Thats why we have to clear the NETIF_F_LLTX field on the macvlan itself. using dfwd acceleration is effectively giving a macvlan its own hardware, that needs arbitration between every context thats trying to transmit over it. That said, I did just notice that if the macvlan has multiple queues we will still use a single tx queue on transmit, we should map those queues to multiple hardware queues. I'll fix that shortly. > > What > > contention do you see that goes above and beyond the normal locking required by > > txq access? > > As I said above, the point is keeping the lockess tx path and make the > contention of txq for real device instead of macvlan itself. But you've not done that. You've just moved the locking down to the lowerdev, which is fine I suppose, but doesn't actually improve anything, since we will still lock exactly as many time as if we do the locking in the macvlan. Regards Neil > > I suppose its extra locking above and beyond in the macvtap case, > > where you would otherwise never hit hardware, but that not the only use case, > > and I think the solution there is likely to add some code in the macvlan feature > > set handler so that NETIF_F_LLTX is cleared if you disable the hardware > > forwarding acceleration via ethtool. > > > > Regards > > Neil > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-07 13:17 ` Neil Horman @ 2014-01-08 3:21 ` Jason Wang 2014-01-08 14:40 ` Neil Horman 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-08 3:21 UTC (permalink / raw) To: Neil Horman; +Cc: mst, e1000-devel, netdev, linux-kernel, John Fastabend, davem On 01/07/2014 09:17 PM, Neil Horman wrote: > On Tue, Jan 07, 2014 at 11:42:24AM +0800, Jason Wang wrote: >> On 01/06/2014 08:42 PM, Neil Horman wrote: >>> On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: >>>> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The >>>> will cause several issues: >>>> >>>> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock >>>> contention. >>>> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device >>>> watchdog >>>> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash >>>> when tso is disabled for lower device. >>>> >>>> Fix this by explicitly introducing a select queue method just for l2 forwarding >>>> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the >>>> queue selecting and transmitting for l2 forwarding. >>>> >>>> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need >>>> to check txq against NULL in dev_hard_start_xmit(). >>>> >>>> In the future, it was also required for macvtap l2 forwarding support since it >>>> provides a necessary synchronization method. >>>> >>>> Cc: John Fastabend <john.r.fastabend@intel.com> >>>> Cc: Neil Horman <nhorman@tuxdriver.com> >>>> Cc: e1000-devel@lists.sourceforge.net >>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>> Instead of creating another operation here to do special queue selection, why >>> not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument >>> list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? >>> ndo_dfwd_start_xmit already knows which queue set to pick from (since their >>> reserved for the device doing the transmitting). It seems more clear to me than >>> creating a new netdevice operation. >> See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless >> tx path"). The point is keep the tx path lockless to be efficient and >> simplicity for management. And macvtap multiqueue was also implemented >> with this assumption. The real contention should be done in the txq of >> lower device instead of macvlan itself. This is also needed for >> multiqueue macvtap. > Ok, I see how you're preserving LLTX here, and thats great, but it doesn't > really buy us anything that I can see. If a macvlan is using hardware > acceleration, it needs to arbitrate access to that hardware. Weather thats done > by locking the lowerdev's tx queue lock or by enforcing locking on the macvlan > itself is equivalent. The decision to use dfwd hardware acceleration is made on > open, so its not like theres any traffic that can avoid the lock, as it all goes > through the hardware. All I see that this has bought us is an extra net_device > method (which isn't a big deal, but not necessecary as I see it). As I replied to patch 1/2, looking at the code itself again. The locking on the lowerdev's tx queue is really need since we need synchronize with other control path. Two examples are dev watchdog and ixgbe_down() both of which will try to hold tx lock to synchronize the with transmission. Without holding the lowerdev tx lock, we may have more serious issues. Also, it's a little strange for a net device has two modes. Future developers need to care about two different tx lock paths which is sub optimal. For the issue of an extra net_device method, if you don't like we can reuse the ndo_select_queue by also passing the accel_priv to that method. > > It seems like the right solution here is to use ethtool to disable the dfwd > acceleration feature on the hardware if you don't want to incur the additional > locking. > >>> As for the crash issue, I'm not sure what you mean. Where in >>> dev_hard_start_xmit would we need to check txq that we're not currently, and >>> what crash results? >> Well, see current dev_hard_start_xmit(), if lower device does not >> support tso or tso is disabled, in gso path: >> >> gso: >> ... >> txq_trans_update(txq); >> if (unlikely(netif_xmit_stopped(txq) && skb->next)) >> >> There's an obvious NULL pointer dereference. > Yeah, I saw that after I wrote my note, sorry about that. However, it doesn't > change what I said above. i don't think theres a need to add an additional > select_queue method here, just add a pointer to a pointer arg to the dfwd xmit > routine to fill in the queue that was selected on transmit. That should resolve > all the potential NULL pointers without the need to ad additional methods to > net_device_ops. > >>> Also, can you elaborate on what you mean by additional lock contention? >> If the lower device has NETIF_F_LLTX, then both macvlan txq lock and the >> lock of device itself must be held before doing transmission. In the >> case, the macvlan txq lock contention is obvious unnecessary. > Not in its current implementation. The lowerdevice's NETIF_F_LLTX are ignored > during transmit because we're using a privately allocated set of queues assigned > to the transmitting macvlan (i.e. there is not contention on the lowerdevice for > any single given macvlan). Thats why we have to clear the NETIF_F_LLTX field on > the macvlan itself. using dfwd acceleration is effectively giving a macvlan its > own hardware, that needs arbitration between every context thats trying to > transmit over it. > > > That said, I did just notice that if the macvlan has multiple queues we will > still use a single tx queue on transmit, we should map those queues to multiple > hardware queues. I'll fix that shortly. > >>> What >>> contention do you see that goes above and beyond the normal locking required by >>> txq access? >> As I said above, the point is keeping the lockess tx path and make the >> contention of txq for real device instead of macvlan itself. > But you've not done that. You've just moved the locking down to the lowerdev, > which is fine I suppose, but doesn't actually improve anything, since we will > still lock exactly as many time as if we do the locking in the macvlan. > > Regards > Neil > >>> I suppose its extra locking above and beyond in the macvtap case, >>> where you would otherwise never hit hardware, but that not the only use case, >>> and I think the solution there is likely to add some code in the macvlan feature >>> set handler so that NETIF_F_LLTX is cleared if you disable the hardware >>> forwarding acceleration via ethtool. >>> >>> Regards >>> Neil >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-08 3:21 ` Jason Wang @ 2014-01-08 14:40 ` Neil Horman 2014-01-09 8:28 ` Jason Wang 0 siblings, 1 reply; 35+ messages in thread From: Neil Horman @ 2014-01-08 14:40 UTC (permalink / raw) To: Jason Wang; +Cc: davem, netdev, linux-kernel, mst, John Fastabend, e1000-devel On Wed, Jan 08, 2014 at 11:21:21AM +0800, Jason Wang wrote: > On 01/07/2014 09:17 PM, Neil Horman wrote: > > On Tue, Jan 07, 2014 at 11:42:24AM +0800, Jason Wang wrote: > >> On 01/06/2014 08:42 PM, Neil Horman wrote: > >>> On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: > >>>> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The > >>>> will cause several issues: > >>>> > >>>> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock > >>>> contention. > >>>> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device > >>>> watchdog > >>>> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash > >>>> when tso is disabled for lower device. > >>>> > >>>> Fix this by explicitly introducing a select queue method just for l2 forwarding > >>>> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the > >>>> queue selecting and transmitting for l2 forwarding. > >>>> > >>>> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need > >>>> to check txq against NULL in dev_hard_start_xmit(). > >>>> > >>>> In the future, it was also required for macvtap l2 forwarding support since it > >>>> provides a necessary synchronization method. > >>>> > >>>> Cc: John Fastabend <john.r.fastabend@intel.com> > >>>> Cc: Neil Horman <nhorman@tuxdriver.com> > >>>> Cc: e1000-devel@lists.sourceforge.net > >>>> Signed-off-by: Jason Wang <jasowang@redhat.com> > >>> Instead of creating another operation here to do special queue selection, why > >>> not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument > >>> list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? > >>> ndo_dfwd_start_xmit already knows which queue set to pick from (since their > >>> reserved for the device doing the transmitting). It seems more clear to me than > >>> creating a new netdevice operation. > >> See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless > >> tx path"). The point is keep the tx path lockless to be efficient and > >> simplicity for management. And macvtap multiqueue was also implemented > >> with this assumption. The real contention should be done in the txq of > >> lower device instead of macvlan itself. This is also needed for > >> multiqueue macvtap. > > Ok, I see how you're preserving LLTX here, and thats great, but it doesn't > > really buy us anything that I can see. If a macvlan is using hardware > > acceleration, it needs to arbitrate access to that hardware. Weather thats done > > by locking the lowerdev's tx queue lock or by enforcing locking on the macvlan > > itself is equivalent. The decision to use dfwd hardware acceleration is made on > > open, so its not like theres any traffic that can avoid the lock, as it all goes > > through the hardware. All I see that this has bought us is an extra net_device > > method (which isn't a big deal, but not necessecary as I see it). > > As I replied to patch 1/2, looking at the code itself again. The locking > on the lowerdev's tx queue is really need since we need synchronize with > other control path. Two examples are dev watchdog and ixgbe_down() both > of which will try to hold tx lock to synchronize the with transmission. > Without holding the lowerdev tx lock, we may have more serious issues. > Also, it's a little strange for a net device has two modes. Future > developers need to care about two different tx lock paths which is sub > optimal. > Ok, having looked at this for a few hours, I agree, locking in the lowerdev has some definiate advantages in plugging the holes you've pointed out. > For the issue of an extra net_device method, if you don't like we can > reuse the ndo_select_queue by also passing the accel_priv to that method. I do, that actually simplifies things, since it lets us use the entire dev_hard_start_xmit path unmodified, which gives us the locking your looking for without having to create a new slimmed down variant of dev_hard_start_xmit. Regards Neil ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-08 14:40 ` Neil Horman @ 2014-01-09 8:28 ` Jason Wang 2014-01-09 11:53 ` Neil Horman 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-09 8:28 UTC (permalink / raw) To: Neil Horman; +Cc: mst, e1000-devel, netdev, linux-kernel, John Fastabend, davem On 01/08/2014 10:40 PM, Neil Horman wrote: > On Wed, Jan 08, 2014 at 11:21:21AM +0800, Jason Wang wrote: >> On 01/07/2014 09:17 PM, Neil Horman wrote: >>> On Tue, Jan 07, 2014 at 11:42:24AM +0800, Jason Wang wrote: >>>> On 01/06/2014 08:42 PM, Neil Horman wrote: >>>>> On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: >>>>>> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The >>>>>> will cause several issues: >>>>>> >>>>>> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock >>>>>> contention. >>>>>> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device >>>>>> watchdog >>>>>> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash >>>>>> when tso is disabled for lower device. >>>>>> >>>>>> Fix this by explicitly introducing a select queue method just for l2 forwarding >>>>>> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the >>>>>> queue selecting and transmitting for l2 forwarding. >>>>>> >>>>>> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need >>>>>> to check txq against NULL in dev_hard_start_xmit(). >>>>>> >>>>>> In the future, it was also required for macvtap l2 forwarding support since it >>>>>> provides a necessary synchronization method. >>>>>> >>>>>> Cc: John Fastabend <john.r.fastabend@intel.com> >>>>>> Cc: Neil Horman <nhorman@tuxdriver.com> >>>>>> Cc: e1000-devel@lists.sourceforge.net >>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>>>> Instead of creating another operation here to do special queue selection, why >>>>> not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument >>>>> list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? >>>>> ndo_dfwd_start_xmit already knows which queue set to pick from (since their >>>>> reserved for the device doing the transmitting). It seems more clear to me than >>>>> creating a new netdevice operation. >>>> See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless >>>> tx path"). The point is keep the tx path lockless to be efficient and >>>> simplicity for management. And macvtap multiqueue was also implemented >>>> with this assumption. The real contention should be done in the txq of >>>> lower device instead of macvlan itself. This is also needed for >>>> multiqueue macvtap. >>> Ok, I see how you're preserving LLTX here, and thats great, but it doesn't >>> really buy us anything that I can see. If a macvlan is using hardware >>> acceleration, it needs to arbitrate access to that hardware. Weather thats done >>> by locking the lowerdev's tx queue lock or by enforcing locking on the macvlan >>> itself is equivalent. The decision to use dfwd hardware acceleration is made on >>> open, so its not like theres any traffic that can avoid the lock, as it all goes >>> through the hardware. All I see that this has bought us is an extra net_device >>> method (which isn't a big deal, but not necessecary as I see it). >> As I replied to patch 1/2, looking at the code itself again. The locking >> on the lowerdev's tx queue is really need since we need synchronize with >> other control path. Two examples are dev watchdog and ixgbe_down() both >> of which will try to hold tx lock to synchronize the with transmission. >> Without holding the lowerdev tx lock, we may have more serious issues. >> Also, it's a little strange for a net device has two modes. Future >> developers need to care about two different tx lock paths which is sub >> optimal. >> > Ok, having looked at this for a few hours, I agree, locking in the lowerdev has > some definiate advantages in plugging the holes you've pointed out. > >> For the issue of an extra net_device method, if you don't like we can >> reuse the ndo_select_queue by also passing the accel_priv to that method. > I do, that actually simplifies things, since it lets us use the entire > dev_hard_start_xmit path unmodified, which gives us the locking your looking for > without having to create a new slimmed down variant of dev_hard_start_xmit. > > Regards > Neil Right, will post V2. ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-09 8:28 ` Jason Wang @ 2014-01-09 11:53 ` Neil Horman 0 siblings, 0 replies; 35+ messages in thread From: Neil Horman @ 2014-01-09 11:53 UTC (permalink / raw) To: Jason Wang; +Cc: mst, e1000-devel, netdev, linux-kernel, John Fastabend, davem On Thu, Jan 09, 2014 at 04:28:49PM +0800, Jason Wang wrote: > On 01/08/2014 10:40 PM, Neil Horman wrote: > > On Wed, Jan 08, 2014 at 11:21:21AM +0800, Jason Wang wrote: > >> On 01/07/2014 09:17 PM, Neil Horman wrote: > >>> On Tue, Jan 07, 2014 at 11:42:24AM +0800, Jason Wang wrote: > >>>> On 01/06/2014 08:42 PM, Neil Horman wrote: > >>>>> On Mon, Jan 06, 2014 at 11:21:07AM +0800, Jason Wang wrote: > >>>>>> Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The > >>>>>> will cause several issues: > >>>>>> > >>>>>> - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock > >>>>>> contention. > >>>>>> - dev_hard_start_xmit() was called with NULL txq which bypasses the net device > >>>>>> watchdog > >>>>>> - dev_hard_start_xmit() does not check txq everywhere which will lead a crash > >>>>>> when tso is disabled for lower device. > >>>>>> > >>>>>> Fix this by explicitly introducing a select queue method just for l2 forwarding > >>>>>> offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the > >>>>>> queue selecting and transmitting for l2 forwarding. > >>>>>> > >>>>>> With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need > >>>>>> to check txq against NULL in dev_hard_start_xmit(). > >>>>>> > >>>>>> In the future, it was also required for macvtap l2 forwarding support since it > >>>>>> provides a necessary synchronization method. > >>>>>> > >>>>>> Cc: John Fastabend <john.r.fastabend@intel.com> > >>>>>> Cc: Neil Horman <nhorman@tuxdriver.com> > >>>>>> Cc: e1000-devel@lists.sourceforge.net > >>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com> > >>>>> Instead of creating another operation here to do special queue selection, why > >>>>> not just have ndo_dfwd_start_xmit include a pointer to a pointer in its argument > >>>>> list, so it can pass the txq it used back to the caller (dev_hard_start_xmit)? > >>>>> ndo_dfwd_start_xmit already knows which queue set to pick from (since their > >>>>> reserved for the device doing the transmitting). It seems more clear to me than > >>>>> creating a new netdevice operation. > >>>> See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless > >>>> tx path"). The point is keep the tx path lockless to be efficient and > >>>> simplicity for management. And macvtap multiqueue was also implemented > >>>> with this assumption. The real contention should be done in the txq of > >>>> lower device instead of macvlan itself. This is also needed for > >>>> multiqueue macvtap. > >>> Ok, I see how you're preserving LLTX here, and thats great, but it doesn't > >>> really buy us anything that I can see. If a macvlan is using hardware > >>> acceleration, it needs to arbitrate access to that hardware. Weather thats done > >>> by locking the lowerdev's tx queue lock or by enforcing locking on the macvlan > >>> itself is equivalent. The decision to use dfwd hardware acceleration is made on > >>> open, so its not like theres any traffic that can avoid the lock, as it all goes > >>> through the hardware. All I see that this has bought us is an extra net_device > >>> method (which isn't a big deal, but not necessecary as I see it). > >> As I replied to patch 1/2, looking at the code itself again. The locking > >> on the lowerdev's tx queue is really need since we need synchronize with > >> other control path. Two examples are dev watchdog and ixgbe_down() both > >> of which will try to hold tx lock to synchronize the with transmission. > >> Without holding the lowerdev tx lock, we may have more serious issues. > >> Also, it's a little strange for a net device has two modes. Future > >> developers need to care about two different tx lock paths which is sub > >> optimal. > >> > > Ok, having looked at this for a few hours, I agree, locking in the lowerdev has > > some definiate advantages in plugging the holes you've pointed out. > > > >> For the issue of an extra net_device method, if you don't like we can > >> reuse the ndo_select_queue by also passing the accel_priv to that method. > > I do, that actually simplifies things, since it lets us use the entire > > dev_hard_start_xmit path unmodified, which gives us the locking your looking for > > without having to create a new slimmed down variant of dev_hard_start_xmit. > > > > Regards > > Neil > > Right, will post V2. > Thanks Neil ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang 2014-01-06 12:04 ` Jeff Kirsher 2014-01-06 12:42 ` Neil Horman @ 2014-01-07 8:22 ` John Fastabend 2014-01-07 8:37 ` John Fastabend 2 siblings, 1 reply; 35+ messages in thread From: John Fastabend @ 2014-01-07 8:22 UTC (permalink / raw) To: Jason Wang; +Cc: davem, netdev, linux-kernel, mst, Neil Horman, e1000-devel On 1/5/2014 7:21 PM, Jason Wang wrote: > Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The > will cause several issues: > > - NETIF_F_LLTX was forced for macvlan device in this case which lead extra lock > contention. > - dev_hard_start_xmit() was called with NULL txq which bypasses the net device > watchdog > - dev_hard_start_xmit() does not check txq everywhere which will lead a crash > when tso is disabled for lower device. > > Fix this by explicitly introducing a select queue method just for l2 forwarding > offload (ndo_dfwd_select_queue), and introducing dfwd_direct_xmit() to do the > queue selecting and transmitting for l2 forwarding. > > With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need > to check txq against NULL in dev_hard_start_xmit(). > > In the future, it was also required for macvtap l2 forwarding support since it > provides a necessary synchronization method. > > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Neil Horman <nhorman@tuxdriver.com> > Cc: e1000-devel@lists.sourceforge.net > Signed-off-by: Jason Wang <jasowang@redhat.com> > --- [...] > index 4fc1722..bc2b03f 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2538,6 +2538,32 @@ static inline int skb_needs_linearize(struct sk_buff *skb, > !(features & NETIF_F_SG))); > } > > +int dfwd_direct_xmit(struct sk_buff *skb, struct net_device *dev, > + void *accel_priv) > +{ > + struct netdev_queue *txq; > + int ret = NETDEV_TX_BUSY; > + int index; > + > + BUG_ON(!dev->netdev_ops->ndo_dfwd_select_queue); > + index = dev->netdev_ops->ndo_dfwd_select_queue(dev, skb, > + accel_priv); > + > + local_bh_disable(); > + > + skb_set_queue_mapping(skb, index); How about replacing the index calculation and skb_set_queue_mapping with netdev_pick_tx(). Then we don't need to add a new op and the existing XPS, tx hash and select_queue() op works. > + txq = netdev_get_tx_queue(dev, index); > + > + HARD_TX_LOCK(dev, txq, smp_processor_id()); > + if (!netif_xmit_frozen_or_stopped(txq)) > + ret = dev_hard_start_xmit(skb, dev, txq, accel_priv); > + HARD_TX_UNLOCK(dev, txq); > + > + local_bh_enable(); > + return ret; > +} > +EXPORT_SYMBOL_GPL(dfwd_direct_xmit); > + > int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > struct netdev_queue *txq, void *accel_priv) > { > @@ -2611,7 +2637,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > rc = ops->ndo_start_xmit(skb, dev); > > trace_net_dev_xmit(skb, rc, dev, skb_len); > - if (rc == NETDEV_TX_OK && txq) > + if (rc == NETDEV_TX_OK) > txq_trans_update(txq); Removing the check here rather than adding more checks in the gso case as I suggested in the other thread seems cleaner. Thanks! John > return rc; > } > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding 2014-01-07 8:22 ` John Fastabend @ 2014-01-07 8:37 ` John Fastabend 0 siblings, 0 replies; 35+ messages in thread From: John Fastabend @ 2014-01-07 8:37 UTC (permalink / raw) To: Jason Wang; +Cc: davem, netdev, linux-kernel, mst, Neil Horman, e1000-devel [...] >> >> +int dfwd_direct_xmit(struct sk_buff *skb, struct net_device *dev, >> + void *accel_priv) >> +{ >> + struct netdev_queue *txq; >> + int ret = NETDEV_TX_BUSY; >> + int index; >> + >> + BUG_ON(!dev->netdev_ops->ndo_dfwd_select_queue); >> + index = dev->netdev_ops->ndo_dfwd_select_queue(dev, skb, >> + accel_priv); >> + >> + local_bh_disable(); >> + >> + skb_set_queue_mapping(skb, index); > > How about replacing the index calculation and skb_set_queue_mapping with > netdev_pick_tx(). Then we don't need to add a new op and the existing > XPS, tx hash and select_queue() op works. > Sorry for the noise that was dumb it wouldn't work. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-06 3:21 [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap Jason Wang 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang @ 2014-01-06 7:35 ` John Fastabend 2014-01-06 7:54 ` Jason Wang 2014-01-06 20:47 ` David Miller 2 siblings, 1 reply; 35+ messages in thread From: John Fastabend @ 2014-01-06 7:35 UTC (permalink / raw) To: Jason Wang; +Cc: davem, netdev, linux-kernel, mst, John Fastabend, Neil Horman On 01/05/2014 07:21 PM, Jason Wang wrote: > L2 fowarding offload will bypass the rx handler of real device. This will make > the packet could not be forwarded to macvtap device. Another problem is the > dev_hard_start_xmit() called for macvtap does not have any synchronization. > > Fix this by forbidding L2 forwarding for macvtap. > > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Neil Horman <nhorman@tuxdriver.com> > Signed-off-by: Jason Wang <jasowang@redhat.com> > --- > drivers/net/macvlan.c | 5 ++++- > 1 files changed, 4 insertions(+), 1 deletions(-) > I must be missing something. The lower layer device should set skb->dev to the correct macvtap device on receive so that in netif_receive_skb_core() the macvtap handler is hit. Skipping the macvlan receive handler should be OK because the switching was done by the hardware. If I read macvtap.c correctly macvlan_common_newlink() is called with 'dev' where 'dev' is the macvtap device. Any idea what I'm missing? I guess I'll need to setup a macvtap test case. And what synchronization are you worried about on dev_hard_start_xmit()? In the L2 forwarding offload case macvlan_open() clears the NETIF_F_LLTX flag so HARD_TX_LOCK protects the driver txq. We might hit this warning in dev_queue_xmit() though, net_crit_ratelimited("Virtual device %s asks to queue packet!\n", Perhaps we can remove it. > diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c > index 60406b0..5360f73 100644 > --- a/drivers/net/macvlan.c > +++ b/drivers/net/macvlan.c > @@ -338,6 +338,8 @@ static const struct header_ops macvlan_hard_header_ops = { > .cache_update = eth_header_cache_update, > }; > > +static struct rtnl_link_ops macvlan_link_ops; > + > static int macvlan_open(struct net_device *dev) > { > struct macvlan_dev *vlan = netdev_priv(dev); > @@ -353,7 +355,8 @@ static int macvlan_open(struct net_device *dev) > goto hash_add; > } > > - if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD) { > + if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD && > + dev->rtnl_link_ops == &macvlan_link_ops) { > vlan->fwd_priv = > lowerdev->netdev_ops->ndo_dfwd_add_station(lowerdev, dev); > > -- John Fastabend Intel Corporation ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-06 7:35 ` [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap John Fastabend @ 2014-01-06 7:54 ` Jason Wang 2014-01-06 12:26 ` Neil Horman 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-06 7:54 UTC (permalink / raw) To: John Fastabend Cc: davem, netdev, linux-kernel, mst, John Fastabend, Neil Horman On 01/06/2014 03:35 PM, John Fastabend wrote: > On 01/05/2014 07:21 PM, Jason Wang wrote: >> L2 fowarding offload will bypass the rx handler of real device. This >> will make >> the packet could not be forwarded to macvtap device. Another problem >> is the >> dev_hard_start_xmit() called for macvtap does not have any >> synchronization. >> >> Fix this by forbidding L2 forwarding for macvtap. >> >> Cc: John Fastabend <john.r.fastabend@intel.com> >> Cc: Neil Horman <nhorman@tuxdriver.com> >> Signed-off-by: Jason Wang <jasowang@redhat.com> >> --- >> drivers/net/macvlan.c | 5 ++++- >> 1 files changed, 4 insertions(+), 1 deletions(-) >> > > I must be missing something. > > The lower layer device should set skb->dev to the correct macvtap > device on receive so that in netif_receive_skb_core() the macvtap > handler is hit. Skipping the macvlan receive handler should be OK > because the switching was done by the hardware. If I read macvtap.c > correctly macvlan_common_newlink() is called with 'dev' where 'dev' > is the macvtap device. Any idea what I'm missing? I guess I'll need > to setup a macvtap test case. Unlike macvlan, macvtap depends on rx handler on the lower device to work. In this case macvlan_handle_frame() will call macvtap_receive(). So doing netif_receive_skb_core() for macvtap device directly won't work since we need to forward the packet to userspace instead of kernel. For net-next.git, it may work since commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 let macvtap device register an rx handler for itself. > > And what synchronization are you worried about on dev_hard_start_xmit()? > In the L2 forwarding offload case macvlan_open() clears the NETIF_F_LLTX > flag so HARD_TX_LOCK protects the driver txq. We might hit this warning > in dev_queue_xmit() though, > > net_crit_ratelimited("Virtual device %s asks to queue packet!\n", > > Perhaps we can remove it. The problem is macvtap does not call dev_queue_xmit() for macvlan device. It calls macvlan_start_xmit() directly from macvtap_get_user(). So HARD_TX_LOCK was not done for the txq. > >> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c >> index 60406b0..5360f73 100644 >> --- a/drivers/net/macvlan.c >> +++ b/drivers/net/macvlan.c >> @@ -338,6 +338,8 @@ static const struct header_ops >> macvlan_hard_header_ops = { >> .cache_update = eth_header_cache_update, >> }; >> >> +static struct rtnl_link_ops macvlan_link_ops; >> + >> static int macvlan_open(struct net_device *dev) >> { >> struct macvlan_dev *vlan = netdev_priv(dev); >> @@ -353,7 +355,8 @@ static int macvlan_open(struct net_device *dev) >> goto hash_add; >> } >> >> - if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD) { >> + if (lowerdev->features & NETIF_F_HW_L2FW_DOFFLOAD && >> + dev->rtnl_link_ops == &macvlan_link_ops) { >> vlan->fwd_priv = >> lowerdev->netdev_ops->ndo_dfwd_add_station(lowerdev, >> dev); >> >> > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-06 7:54 ` Jason Wang @ 2014-01-06 12:26 ` Neil Horman 2014-01-07 3:10 ` Jason Wang 0 siblings, 1 reply; 35+ messages in thread From: Neil Horman @ 2014-01-06 12:26 UTC (permalink / raw) To: Jason Wang Cc: John Fastabend, davem, netdev, linux-kernel, mst, John Fastabend On Mon, Jan 06, 2014 at 03:54:21PM +0800, Jason Wang wrote: > On 01/06/2014 03:35 PM, John Fastabend wrote: > > On 01/05/2014 07:21 PM, Jason Wang wrote: > >> L2 fowarding offload will bypass the rx handler of real device. This > >> will make > >> the packet could not be forwarded to macvtap device. Another problem > >> is the > >> dev_hard_start_xmit() called for macvtap does not have any > >> synchronization. > >> > >> Fix this by forbidding L2 forwarding for macvtap. > >> > >> Cc: John Fastabend <john.r.fastabend@intel.com> > >> Cc: Neil Horman <nhorman@tuxdriver.com> > >> Signed-off-by: Jason Wang <jasowang@redhat.com> > >> --- > >> drivers/net/macvlan.c | 5 ++++- > >> 1 files changed, 4 insertions(+), 1 deletions(-) > >> > > > > I must be missing something. > > > > The lower layer device should set skb->dev to the correct macvtap > > device on receive so that in netif_receive_skb_core() the macvtap > > handler is hit. Skipping the macvlan receive handler should be OK > > because the switching was done by the hardware. If I read macvtap.c > > correctly macvlan_common_newlink() is called with 'dev' where 'dev' > > is the macvtap device. Any idea what I'm missing? I guess I'll need > > to setup a macvtap test case. > > Unlike macvlan, macvtap depends on rx handler on the lower device to > work. In this case macvlan_handle_frame() will call macvtap_receive(). > So doing netif_receive_skb_core() for macvtap device directly won't work > since we need to forward the packet to userspace instead of kernel. > > For net-next.git, it may work since commit > 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 let macvtap device register an > rx handler for itself. I agree, this seems like it should already be fixed with the above commit. With this the macvlan rx handler should effectively be a no-op as far as the reception of frames is concerned. As long as the driver sets the dev correctly to the macvtap device (and it appears to), macvtap will get frames to user space, regardless of weather the software or hardware did the switching. If thats the case though, I think the solution is moving that fix to -stable (pending testing of course), rather than comming up with a new fix. > > > > And what synchronization are you worried about on dev_hard_start_xmit()? > > In the L2 forwarding offload case macvlan_open() clears the NETIF_F_LLTX > > flag so HARD_TX_LOCK protects the driver txq. We might hit this warning > > in dev_queue_xmit() though, > > > > net_crit_ratelimited("Virtual device %s asks to queue packet!\n", > > > > Perhaps we can remove it. > > The problem is macvtap does not call dev_queue_xmit() for macvlan > device. It calls macvlan_start_xmit() directly from macvtap_get_user(). > So HARD_TX_LOCK was not done for the txq. This seems to also be fixed by 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. Macvtap does, as of that commit use dev_queue_xmit for the transmission of frames to the lowerdevice. Regards Neil ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-06 12:26 ` Neil Horman @ 2014-01-07 3:10 ` Jason Wang 2014-01-07 5:15 ` John Fastabend 2014-01-07 5:16 ` John Fastabend 0 siblings, 2 replies; 35+ messages in thread From: Jason Wang @ 2014-01-07 3:10 UTC (permalink / raw) To: Neil Horman Cc: John Fastabend, davem, netdev, linux-kernel, mst, John Fastabend, Vlad Yasevich On 01/06/2014 08:26 PM, Neil Horman wrote: > On Mon, Jan 06, 2014 at 03:54:21PM +0800, Jason Wang wrote: >> On 01/06/2014 03:35 PM, John Fastabend wrote: >>> On 01/05/2014 07:21 PM, Jason Wang wrote: >>>> L2 fowarding offload will bypass the rx handler of real device. This >>>> will make >>>> the packet could not be forwarded to macvtap device. Another problem >>>> is the >>>> dev_hard_start_xmit() called for macvtap does not have any >>>> synchronization. >>>> >>>> Fix this by forbidding L2 forwarding for macvtap. >>>> >>>> Cc: John Fastabend <john.r.fastabend@intel.com> >>>> Cc: Neil Horman <nhorman@tuxdriver.com> >>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>>> --- >>>> drivers/net/macvlan.c | 5 ++++- >>>> 1 files changed, 4 insertions(+), 1 deletions(-) >>>> >>> I must be missing something. >>> >>> The lower layer device should set skb->dev to the correct macvtap >>> device on receive so that in netif_receive_skb_core() the macvtap >>> handler is hit. Skipping the macvlan receive handler should be OK >>> because the switching was done by the hardware. If I read macvtap.c >>> correctly macvlan_common_newlink() is called with 'dev' where 'dev' >>> is the macvtap device. Any idea what I'm missing? I guess I'll need >>> to setup a macvtap test case. >> Unlike macvlan, macvtap depends on rx handler on the lower device to >> work. In this case macvlan_handle_frame() will call macvtap_receive(). >> So doing netif_receive_skb_core() for macvtap device directly won't work >> since we need to forward the packet to userspace instead of kernel. >> >> For net-next.git, it may work since commit >> 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 let macvtap device register an >> rx handler for itself. > I agree, this seems like it should already be fixed with the above commit. With > this the macvlan rx handler should effectively be a no-op as far as the > reception of frames is concerned. As long as the driver sets the dev correctly > to the macvtap device (and it appears to), macvtap will get frames to user > space, regardless of weather the software or hardware did the switching. If > thats the case though, I think the solution is moving that fix to -stable > (pending testing of course), rather than comming up with a new fix. > >>> And what synchronization are you worried about on dev_hard_start_xmit()? >>> In the L2 forwarding offload case macvlan_open() clears the NETIF_F_LLTX >>> flag so HARD_TX_LOCK protects the driver txq. We might hit this warning >>> in dev_queue_xmit() though, >>> >>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n", >>> >>> Perhaps we can remove it. >> The problem is macvtap does not call dev_queue_xmit() for macvlan >> device. It calls macvlan_start_xmit() directly from macvtap_get_user(). >> So HARD_TX_LOCK was not done for the txq. > This seems to also be fixed by 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. > Macvtap does, as of that commit use dev_queue_xmit for the transmission of > frames to the lowerdevice. Unfortunately not. This commit has a side effect that it in fact disables the multiqueue macvtap transmission. Since all macvtap queues will contend on a single qdisc lock. For L2 forwarding offload itself, more issues need to be addressed for multiqueue macvtap: - ndo_dfwd_add_station() can only create queues per device at ndo_open, but multiqueue macvtap allows user to create and destroy queues at their will and at any time. - it looks that ixgbe has a upper limit of 4 queues per station, but macvtap currently allows up to 16 queues per device. So more works need to be done and unless those above 3 issues were addressed, this patch is really needed to make sure macvtap works. > > Regards > Neil > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 3:10 ` Jason Wang @ 2014-01-07 5:15 ` John Fastabend 2014-01-07 6:22 ` Jason Wang 2014-01-07 5:16 ` John Fastabend 1 sibling, 1 reply; 35+ messages in thread From: John Fastabend @ 2014-01-07 5:15 UTC (permalink / raw) To: Jason Wang Cc: Neil Horman, davem, netdev, linux-kernel, mst, John Fastabend, Vlad Yasevich On 01/06/2014 07:10 PM, Jason Wang wrote: > On 01/06/2014 08:26 PM, Neil Horman wrote: >> On Mon, Jan 06, 2014 at 03:54:21PM +0800, Jason Wang wrote: >>> On 01/06/2014 03:35 PM, John Fastabend wrote: >>>> On 01/05/2014 07:21 PM, Jason Wang wrote: >>>>> L2 fowarding offload will bypass the rx handler of real device. This >>>>> will make >>>>> the packet could not be forwarded to macvtap device. Another problem >>>>> is the >>>>> dev_hard_start_xmit() called for macvtap does not have any >>>>> synchronization. >>>>> >>>>> Fix this by forbidding L2 forwarding for macvtap. >>>>> >>>>> Cc: John Fastabend <john.r.fastabend@intel.com> >>>>> Cc: Neil Horman <nhorman@tuxdriver.com> >>>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>>>> --- >>>>> drivers/net/macvlan.c | 5 ++++- >>>>> 1 files changed, 4 insertions(+), 1 deletions(-) >>>>> >>>> I must be missing something. >>>> >>>> The lower layer device should set skb->dev to the correct macvtap >>>> device on receive so that in netif_receive_skb_core() the macvtap >>>> handler is hit. Skipping the macvlan receive handler should be OK >>>> because the switching was done by the hardware. If I read macvtap.c >>>> correctly macvlan_common_newlink() is called with 'dev' where 'dev' >>>> is the macvtap device. Any idea what I'm missing? I guess I'll need >>>> to setup a macvtap test case. >>> Unlike macvlan, macvtap depends on rx handler on the lower device to >>> work. In this case macvlan_handle_frame() will call macvtap_receive(). >>> So doing netif_receive_skb_core() for macvtap device directly won't work >>> since we need to forward the packet to userspace instead of kernel. >>> >>> For net-next.git, it may work since commit >>> 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 let macvtap device register an >>> rx handler for itself. >> I agree, this seems like it should already be fixed with the above commit. With >> this the macvlan rx handler should effectively be a no-op as far as the >> reception of frames is concerned. As long as the driver sets the dev correctly >> to the macvtap device (and it appears to), macvtap will get frames to user >> space, regardless of weather the software or hardware did the switching. If >> thats the case though, I think the solution is moving that fix to -stable >> (pending testing of course), rather than comming up with a new fix. >> >>>> And what synchronization are you worried about on dev_hard_start_xmit()? >>>> In the L2 forwarding offload case macvlan_open() clears the NETIF_F_LLTX >>>> flag so HARD_TX_LOCK protects the driver txq. We might hit this warning >>>> in dev_queue_xmit() though, >>>> >>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n", >>>> >>>> Perhaps we can remove it. >>> The problem is macvtap does not call dev_queue_xmit() for macvlan >>> device. It calls macvlan_start_xmit() directly from macvtap_get_user(). >>> So HARD_TX_LOCK was not done for the txq. >> This seems to also be fixed by 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. >> Macvtap does, as of that commit use dev_queue_xmit for the transmission of >> frames to the lowerdevice. > > Unfortunately not. This commit has a side effect that it in fact > disables the multiqueue macvtap transmission. Since all macvtap queues > will contend on a single qdisc lock. > They will only contend on a single qdisc lock if the lower device has 1 queue. Perhaps defaulting the L2 forwarding devices to 1queue was a mistake. But the same issue arises when running macvtap over a non-multiqueue nic. Or even if you have a multiqueue device and create many more macvtap queues than the lower device has queues. Shouldn't the macvtap configuration take into account the lowest level devices queues? How does using the L2 forwarding device change the contention issues? Without the L2 forwarding LLTX is enabled but the qdisc lock, etc is still acquired on the device below the macvlan. The ixgbe driver as it is currently written can be configured for up to 4 queues by setting numtxqueues when the device is created. I assume when creating macvtap queues the user needs to account for the number of queues supported by the lower device. > For L2 forwarding offload itself, more issues need to be addressed for > multiqueue macvtap: > > - ndo_dfwd_add_station() can only create queues per device at ndo_open, > but multiqueue macvtap allows user to create and destroy queues at their > will and at any time. same argument as above, isn't this the same when running macvtap without the l2 offloads over a real device? I expect you hit the same contention points when running over a real device. > - it looks that ixgbe has a upper limit of 4 queues per station, but > macvtap currently allows up to 16 queues per device. > The 4 limit was to simplify the code because the queue mapping in the driver gets complicated if it is greater than 4. We can probably increase this latter. But sorry reiterating how is this different than a macvtap on a real device that supports a max of 4 queues? > So more works need to be done and unless those above 3 issues were > addressed, this patch is really needed to make sure macvtap works. > Agreed there is a lot more work here to improve things I'm just not sure we need to disable this now. Also note its the l2 forwarding should be disabled by default so a user would have to enable the feature flag. Thanks, John -- John Fastabend Intel Corporation ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 5:15 ` John Fastabend @ 2014-01-07 6:22 ` Jason Wang 2014-01-07 7:26 ` John Fastabend 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-07 6:22 UTC (permalink / raw) To: John Fastabend Cc: Neil Horman, davem, netdev, linux-kernel, mst, John Fastabend, Vlad Yasevich On 01/07/2014 01:15 PM, John Fastabend wrote: > On 01/06/2014 07:10 PM, Jason Wang wrote: >> On 01/06/2014 08:26 PM, Neil Horman wrote: >>> On Mon, Jan 06, 2014 at 03:54:21PM +0800, Jason Wang wrote: >>>> On 01/06/2014 03:35 PM, John Fastabend wrote: >>>>> On 01/05/2014 07:21 PM, Jason Wang wrote: >>>>>> L2 fowarding offload will bypass the rx handler of real device. This >>>>>> will make >>>>>> the packet could not be forwarded to macvtap device. Another problem >>>>>> is the >>>>>> dev_hard_start_xmit() called for macvtap does not have any >>>>>> synchronization. >>>>>> >>>>>> Fix this by forbidding L2 forwarding for macvtap. >>>>>> >>>>>> Cc: John Fastabend <john.r.fastabend@intel.com> >>>>>> Cc: Neil Horman <nhorman@tuxdriver.com> >>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>>>>> --- >>>>>> drivers/net/macvlan.c | 5 ++++- >>>>>> 1 files changed, 4 insertions(+), 1 deletions(-) >>>>>> >>>>> I must be missing something. >>>>> >>>>> The lower layer device should set skb->dev to the correct macvtap >>>>> device on receive so that in netif_receive_skb_core() the macvtap >>>>> handler is hit. Skipping the macvlan receive handler should be OK >>>>> because the switching was done by the hardware. If I read macvtap.c >>>>> correctly macvlan_common_newlink() is called with 'dev' where 'dev' >>>>> is the macvtap device. Any idea what I'm missing? I guess I'll need >>>>> to setup a macvtap test case. >>>> Unlike macvlan, macvtap depends on rx handler on the lower device to >>>> work. In this case macvlan_handle_frame() will call macvtap_receive(). >>>> So doing netif_receive_skb_core() for macvtap device directly won't >>>> work >>>> since we need to forward the packet to userspace instead of kernel. >>>> >>>> For net-next.git, it may work since commit >>>> 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 let macvtap device >>>> register an >>>> rx handler for itself. >>> I agree, this seems like it should already be fixed with the above >>> commit. With >>> this the macvlan rx handler should effectively be a no-op as far as the >>> reception of frames is concerned. As long as the driver sets the >>> dev correctly >>> to the macvtap device (and it appears to), macvtap will get frames >>> to user >>> space, regardless of weather the software or hardware did the >>> switching. If >>> thats the case though, I think the solution is moving that fix to >>> -stable >>> (pending testing of course), rather than comming up with a new fix. >>> >>>>> And what synchronization are you worried about on >>>>> dev_hard_start_xmit()? >>>>> In the L2 forwarding offload case macvlan_open() clears the >>>>> NETIF_F_LLTX >>>>> flag so HARD_TX_LOCK protects the driver txq. We might hit this >>>>> warning >>>>> in dev_queue_xmit() though, >>>>> >>>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n", >>>>> >>>>> Perhaps we can remove it. >>>> The problem is macvtap does not call dev_queue_xmit() for macvlan >>>> device. It calls macvlan_start_xmit() directly from >>>> macvtap_get_user(). >>>> So HARD_TX_LOCK was not done for the txq. >>> This seems to also be fixed by >>> 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. >>> Macvtap does, as of that commit use dev_queue_xmit for the >>> transmission of >>> frames to the lowerdevice. >> >> Unfortunately not. This commit has a side effect that it in fact >> disables the multiqueue macvtap transmission. Since all macvtap queues >> will contend on a single qdisc lock. >> > > They will only contend on a single qdisc lock if the lower device has > 1 queue. I think we are talking about 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. The qdisc or txq lock were macvlan device itself since dev_queue_xmit() was called for macvlan device itself. So even if lower device has multiple txqs, if you just create a one queue macvlan device, you will get lock contention on macvlan device. And even if you explicitly specifying the txq numbers ( though I don't believe most management software will do this) when creating the macvlan/macvtap device, you must also configure the XPS for macvlan to make sure it has the possibility of using multiple transmit queues. > Perhaps defaulting the L2 forwarding devices to 1queue was a > mistake. But the same issue arises when running macvtap over a > non-multiqueue nic. Or even if you have a multiqueue device and create > many more macvtap queues than the lower device has queues. > > Shouldn't the macvtap configuration take into account the lowest level > devices queues? See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless tx path"). It allows the management to create a device without worrying the underlying device. > How does using the L2 forwarding device change the > contention issues? Without the L2 forwarding LLTX is enabled but the > qdisc lock, etc is still acquired on the device below the macvlan. > That's the point. We need make sure the txq selection and qdisc lock were done for the lower device not for the macvlan device itself. Then macvlan can automatically benefit from the multi-queue capable lower devices. But L2 forwarding needs to contend on the txq lock on macvlan device itself, which is unnecessary and can complex the management. > The ixgbe driver as it is currently written can be configured for up to > 4 queues by setting numtxqueues when the device is created. I assume > when creating macvtap queues the user needs to account for the number > of queues supported by the lower device. > We'd better not complicate the task of management, lockless tx path work very well so we can just keep it. Btw, there's no way for the user to know the maximum number of queues that L2 forwarding supports. >> For L2 forwarding offload itself, more issues need to be addressed for >> multiqueue macvtap: >> >> - ndo_dfwd_add_station() can only create queues per device at ndo_open, >> but multiqueue macvtap allows user to create and destroy queues at their >> will and at any time. > > same argument as above, isn't this the same when running macvtap without > the l2 offloads over a real device? I expect you hit the same contention > points when running over a real device. Not true and not only for contention. Macvtap allows user to create or destroy a queue by simply open or close to character device /dev/tapX. But currently, we do nothing when a new queue was created or destroyed for L2 forwarding offload. For contention, lockless tx path make the contention only happens for the txq or qdisc for the lower device, but L2 forwarding offload make contention also happen for the macvlan device itself. > > >> - it looks that ixgbe has a upper limit of 4 queues per station, but >> macvtap currently allows up to 16 queues per device. >> > > The 4 limit was to simplify the code because the queue mapping in the > driver gets complicated if it is greater than 4. We can probably > increase this latter. But sorry reiterating how is this different than > a macvtap on a real device that supports a max of 4 queues? Well, it maybe easy. I just point out possible issues we may meet currently. > >> So more works need to be done and unless those above 3 issues were >> addressed, this patch is really needed to make sure macvtap works. >> > > Agreed there is a lot more work here to improve things I'm just not > sure we need to disable this now. Also note its the l2 forwarding > should be disabled by default so a user would have to enable the > feature flag. Even if it was disabled by default. We should not surprise the user who want to enable it for macvtap. > > Thanks, > John > Thanks ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 6:22 ` Jason Wang @ 2014-01-07 7:26 ` John Fastabend 2014-01-07 9:00 ` Jason Wang 0 siblings, 1 reply; 35+ messages in thread From: John Fastabend @ 2014-01-07 7:26 UTC (permalink / raw) To: Jason Wang Cc: John Fastabend, Neil Horman, davem, netdev, linux-kernel, mst, Vlad Yasevich [...] >>> Unfortunately not. This commit has a side effect that it in fact >>> disables the multiqueue macvtap transmission. Since all macvtap queues >>> will contend on a single qdisc lock. >>> >> >> They will only contend on a single qdisc lock if the lower device has >> 1 queue. > > I think we are talking about 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. Yes. > > The qdisc or txq lock were macvlan device itself since dev_queue_xmit() > was called for macvlan device itself. So even if lower device has > multiple txqs, if you just create a one queue macvlan device, you will > get lock contention on macvlan device. And even if you explicitly > specifying the txq numbers ( though I don't believe most management > software will do this) when creating the macvlan/macvtap device, you > must also configure the XPS for macvlan to make sure it has the > possibility of using multiple transmit queues. > OK I think I'm finally putting all the pieces together thanks. Do you know why macvtap is setting dev->tx_queue_len by default? If you zero this then the noqueue_qdisc is used and the q->enqueue check in dev_queue_xmit will fail. Also if XPS is not configured then skb_tx_hash is used so multiple transmit queues will still be used. >> Perhaps defaulting the L2 forwarding devices to 1queue was a >> mistake. But the same issue arises when running macvtap over a >> non-multiqueue nic. Or even if you have a multiqueue device and create >> many more macvtap queues than the lower device has queues. >> >> Shouldn't the macvtap configuration take into account the lowest level >> devices queues? > > See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless > tx path"). It allows the management to create a device without worrying > the underlying device. OK. >> How does using the L2 forwarding device change the >> contention issues? Without the L2 forwarding LLTX is enabled but the >> qdisc lock, etc is still acquired on the device below the macvlan. >> > > That's the point. We need make sure the txq selection and qdisc lock > were done for the lower device not for the macvlan device itself. Then > macvlan can automatically benefit from the multi-queue capable lower > devices. But L2 forwarding needs to contend on the txq lock on macvlan > device itself, which is unnecessary and can complex the management. If I make the l2 forwarding defaults a bit better then using the L2 forwarding case should not be any more complex. And because the queues are dedicated to the macvtap device any contention from qdisc lock, etc comes from the upper device only. Also if I get the bandwidth controls in we can set the max/min bandwidth per macvtap device this way. That is future work though. >> The ixgbe driver as it is currently written can be configured for up to >> 4 queues by setting numtxqueues when the device is created. I assume >> when creating macvtap queues the user needs to account for the number >> of queues supported by the lower device. >> > > We'd better not complicate the task of management, lockless tx path work > very well so we can just keep it. Btw, there's no way for the user to > know the maximum number of queues that L2 forwarding supports. Good point I'll add an attribute to query it. >>> For L2 forwarding offload itself, more issues need to be addressed for >>> multiqueue macvtap: >>> >>> - ndo_dfwd_add_station() can only create queues per device at ndo_open, >>> but multiqueue macvtap allows user to create and destroy queues at their >>> will and at any time. >> >> same argument as above, isn't this the same when running macvtap without >> the l2 offloads over a real device? I expect you hit the same contention >> points when running over a real device. > > Not true and not only for contention. > > Macvtap allows user to create or destroy a queue by simply open or close > to character device /dev/tapX. But currently, we do nothing when a new > queue was created or destroyed for L2 forwarding offload. > > For contention, lockless tx path make the contention only happens for > the txq or qdisc for the lower device, but L2 forwarding offload make > contention also happen for the macvlan device itself. Right, but there will be less contention there because those queues are a dedicated resource for the upper device. At this point I think I need to put together a real testbed and benchmark some of this with netperf and perf running to get real numbers. When I originally did the l2 forwarding I did not do any testing with multiple macvtap queues and only very limited work with macvtap. > >> >> >>> - it looks that ixgbe has a upper limit of 4 queues per station, but >>> macvtap currently allows up to 16 queues per device. >>> >> >> The 4 limit was to simplify the code because the queue mapping in the >> driver gets complicated if it is greater than 4. We can probably >> increase this latter. But sorry reiterating how is this different than >> a macvtap on a real device that supports a max of 4 queues? > > Well, it maybe easy. I just point out possible issues we may meet currently. Right. >> >>> So more works need to be done and unless those above 3 issues were >>> addressed, this patch is really needed to make sure macvtap works. >>> >> >> Agreed there is a lot more work here to improve things I'm just not >> sure we need to disable this now. Also note its the l2 forwarding >> should be disabled by default so a user would have to enable the >> feature flag. > > Even if it was disabled by default. We should not surprise the user who > want to enable it for macvtap. So the question is what to do in net while we improve net-next. Either we fix the crash from the null txq and note that with l2 forwarding some non default configuration is needed for optimal performance OR for now disable it as your patch does. I would prefer to fix the crash and note the configuration but I see your point about surprising users so could go either way. Neil any thoughts? To fix the null txq in the gso case adding a check for a non-null txq before calling txq_trans_update() makes sense to me. We already have the check in the non-gso case so making it symmetric fixes it. >> >> Thanks, >> John >> > > Thanks > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 7:26 ` John Fastabend @ 2014-01-07 9:00 ` Jason Wang 2014-01-08 12:55 ` Michael S. Tsirkin 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-07 9:00 UTC (permalink / raw) To: John Fastabend Cc: John Fastabend, Neil Horman, davem, netdev, linux-kernel, mst, Vlad Yasevich On 01/07/2014 03:26 PM, John Fastabend wrote: > [...] > >>>> Unfortunately not. This commit has a side effect that it in fact >>>> disables the multiqueue macvtap transmission. Since all macvtap queues >>>> will contend on a single qdisc lock. >>>> >>> >>> They will only contend on a single qdisc lock if the lower device has >>> 1 queue. >> >> I think we are talking about 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. > > Yes. > >> >> The qdisc or txq lock were macvlan device itself since dev_queue_xmit() >> was called for macvlan device itself. So even if lower device has >> multiple txqs, if you just create a one queue macvlan device, you will >> get lock contention on macvlan device. And even if you explicitly >> specifying the txq numbers ( though I don't believe most management >> software will do this) when creating the macvlan/macvtap device, you >> must also configure the XPS for macvlan to make sure it has the >> possibility of using multiple transmit queues. >> > > OK I think I'm finally putting all the pieces together thanks. > > Do you know why macvtap is setting dev->tx_queue_len by default? If you > zero this then the noqueue_qdisc is used and the q->enqueue check in > dev_queue_xmit will fail. It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6 ("macvtap: Limit packet queue length") to limit the length of socket receive queue of macvtap. But I'm not sure whether the qdisc is a byproduct of this commit, maybe we can switch to use another name instead of just reuse dev->tx_queue_length. > > Also if XPS is not configured then skb_tx_hash is used so multiple > transmit queues will still be used. > True. >>> Perhaps defaulting the L2 forwarding devices to 1queue was a >>> mistake. But the same issue arises when running macvtap over a >>> non-multiqueue nic. Or even if you have a multiqueue device and create >>> many more macvtap queues than the lower device has queues. >>> >>> Shouldn't the macvtap configuration take into account the lowest level >>> devices queues? >> >> See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless >> tx path"). It allows the management to create a device without worrying >> the underlying device. > > OK. > >>> How does using the L2 forwarding device change the >>> contention issues? Without the L2 forwarding LLTX is enabled but the >>> qdisc lock, etc is still acquired on the device below the macvlan. >>> >> >> That's the point. We need make sure the txq selection and qdisc lock >> were done for the lower device not for the macvlan device itself. Then >> macvlan can automatically benefit from the multi-queue capable lower >> devices. But L2 forwarding needs to contend on the txq lock on macvlan >> device itself, which is unnecessary and can complex the management. > > If I make the l2 forwarding defaults a bit better then using the L2 > forwarding case should not be any more complex. And because the queues > are dedicated to the macvtap device any contention from qdisc lock, etc > comes from the upper device only. At very least the txq of lower device should be held in order to be synchronized with management path. Consider txq lock were often held by netif_tx_disable() before trying to down the card. Current cold does not hold txq lock, so it loses the synchronization which may cause issues. And the code also does not check whether the txq has been stopped before trying to start the transmission. > Also if I get the bandwidth controls > in we can set the max/min bandwidth per macvtap device this way. That > is future work though. > That will be a nice feature. >>> The ixgbe driver as it is currently written can be configured for up to >>> 4 queues by setting numtxqueues when the device is created. I assume >>> when creating macvtap queues the user needs to account for the number >>> of queues supported by the lower device. >>> >> >> We'd better not complicate the task of management, lockless tx path work >> very well so we can just keep it. Btw, there's no way for the user to >> know the maximum number of queues that L2 forwarding supports. > > Good point I'll add an attribute to query it. > >>>> For L2 forwarding offload itself, more issues need to be addressed for >>>> multiqueue macvtap: >>>> >>>> - ndo_dfwd_add_station() can only create queues per device at >>>> ndo_open, >>>> but multiqueue macvtap allows user to create and destroy queues at >>>> their >>>> will and at any time. >>> >>> same argument as above, isn't this the same when running macvtap >>> without >>> the l2 offloads over a real device? I expect you hit the same >>> contention >>> points when running over a real device. >> >> Not true and not only for contention. >> >> Macvtap allows user to create or destroy a queue by simply open or close >> to character device /dev/tapX. But currently, we do nothing when a new >> queue was created or destroyed for L2 forwarding offload. >> >> For contention, lockless tx path make the contention only happens for >> the txq or qdisc for the lower device, but L2 forwarding offload make >> contention also happen for the macvlan device itself. > > Right, but there will be less contention there because those queues > are a dedicated resource for the upper device. Yes and this is also true if we only do synchronization on the lower device since only dedicated queues could be selected. > > At this point I think I need to put together a real testbed and > benchmark some of this with netperf and perf running to get real > numbers. When I originally did the l2 forwarding I did not do any > testing with multiple macvtap queues and only very limited work with > macvtap. > As I said above, holding the txq lock of lower device seems a must and we should not get regression if NETIF_F_LLTX is kept. But I agree we need some test. >> >>> >>> >>>> - it looks that ixgbe has a upper limit of 4 queues per station, but >>>> macvtap currently allows up to 16 queues per device. >>>> >>> >>> The 4 limit was to simplify the code because the queue mapping in the >>> driver gets complicated if it is greater than 4. We can probably >>> increase this latter. But sorry reiterating how is this different than >>> a macvtap on a real device that supports a max of 4 queues? >> >> Well, it maybe easy. I just point out possible issues we may meet >> currently. > > Right. > >>> >>>> So more works need to be done and unless those above 3 issues were >>>> addressed, this patch is really needed to make sure macvtap works. >>>> >>> >>> Agreed there is a lot more work here to improve things I'm just not >>> sure we need to disable this now. Also note its the l2 forwarding >>> should be disabled by default so a user would have to enable the >>> feature flag. >> >> Even if it was disabled by default. We should not surprise the user who >> want to enable it for macvtap. > > So the question is what to do in net while we improve net-next. Either > we fix the crash from the null txq and note that with l2 forwarding > some non default configuration is needed for optimal performance OR > for now disable it as your patch does. I would prefer to fix the crash > and note the configuration but I see your point about surprising users > so could go either way. > It's much safer to disable l2 forwarding offload for macvtap temporarily consider it has several issues.We can re-enable it when everything is ready in net-next. We we really need to hold the txq lock of lower device, only add more check of NULL pointer is not sufficient. So explicitly select a txq is still needed. And I don't see any conflicts between this and future enhancement. Also I don't see any drawback of using NETIF_F_LLTX for l2 forwarding. So we'd better keep it. > Neil any thoughts? > > To fix the null txq in the gso case adding a check for a non-null > txq before calling txq_trans_update() makes sense to me. We already > have the check in the non-gso case so making it symmetric fixes it. > >>> >>> Thanks, >>> John >>> >> >> Thanks >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 9:00 ` Jason Wang @ 2014-01-08 12:55 ` Michael S. Tsirkin 2014-01-08 19:05 ` John Fastabend 0 siblings, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2014-01-08 12:55 UTC (permalink / raw) To: Jason Wang Cc: John Fastabend, John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On Tue, Jan 07, 2014 at 05:00:29PM +0800, Jason Wang wrote: > On 01/07/2014 03:26 PM, John Fastabend wrote: > > [...] > > > >>>> Unfortunately not. This commit has a side effect that it in fact > >>>> disables the multiqueue macvtap transmission. Since all macvtap queues > >>>> will contend on a single qdisc lock. > >>>> > >>> > >>> They will only contend on a single qdisc lock if the lower device has > >>> 1 queue. > >> > >> I think we are talking about 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. > > > > Yes. > > > >> > >> The qdisc or txq lock were macvlan device itself since dev_queue_xmit() > >> was called for macvlan device itself. So even if lower device has > >> multiple txqs, if you just create a one queue macvlan device, you will > >> get lock contention on macvlan device. And even if you explicitly > >> specifying the txq numbers ( though I don't believe most management > >> software will do this) when creating the macvlan/macvtap device, you > >> must also configure the XPS for macvlan to make sure it has the > >> possibility of using multiple transmit queues. > >> > > > > OK I think I'm finally putting all the pieces together thanks. > > > > Do you know why macvtap is setting dev->tx_queue_len by default? If you > > zero this then the noqueue_qdisc is used and the q->enqueue check in > > dev_queue_xmit will fail. > > It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6 > ("macvtap: Limit packet queue length") to limit the length of socket > receive queue of macvtap. But I'm not sure whether the qdisc is a > byproduct of this commit, maybe we can switch to use another name > instead of just reuse dev->tx_queue_length. You mean tx_queue_len really, right? Problem is tx_queue_len can be accessed using netlink sysfs or ioctl, so if someone uses these to control or check the # of packets that can be queued by device, this will break. How about adding ndo_set_tx_queue_len then? At some point we wanted to decouple queue length from tx_queue_length for tun as well, so that would be benefitial there as well. > > > > Also if XPS is not configured then skb_tx_hash is used so multiple > > transmit queues will still be used. > > > > True. > >>> Perhaps defaulting the L2 forwarding devices to 1queue was a > >>> mistake. But the same issue arises when running macvtap over a > >>> non-multiqueue nic. Or even if you have a multiqueue device and create > >>> many more macvtap queues than the lower device has queues. > >>> > >>> Shouldn't the macvtap configuration take into account the lowest level > >>> devices queues? > >> > >> See commit 8ffab51b3dfc54876f145f15b351c41f3f703195 ("macvlan: lockless > >> tx path"). It allows the management to create a device without worrying > >> the underlying device. > > > > OK. > > > >>> How does using the L2 forwarding device change the > >>> contention issues? Without the L2 forwarding LLTX is enabled but the > >>> qdisc lock, etc is still acquired on the device below the macvlan. > >>> > >> > >> That's the point. We need make sure the txq selection and qdisc lock > >> were done for the lower device not for the macvlan device itself. Then > >> macvlan can automatically benefit from the multi-queue capable lower > >> devices. But L2 forwarding needs to contend on the txq lock on macvlan > >> device itself, which is unnecessary and can complex the management. > > > > If I make the l2 forwarding defaults a bit better then using the L2 > > forwarding case should not be any more complex. And because the queues > > are dedicated to the macvtap device any contention from qdisc lock, etc > > comes from the upper device only. > > At very least the txq of lower device should be held in order to be > synchronized with management path. Consider txq lock were often held by > netif_tx_disable() before trying to down the card. Current cold does not > hold txq lock, so it loses the synchronization which may cause issues. > And the code also does not check whether the txq has been stopped before > trying to start the transmission. > > > > Also if I get the bandwidth controls > > in we can set the max/min bandwidth per macvtap device this way. That > > is future work though. > > > > That will be a nice feature. > >>> The ixgbe driver as it is currently written can be configured for up to > >>> 4 queues by setting numtxqueues when the device is created. I assume > >>> when creating macvtap queues the user needs to account for the number > >>> of queues supported by the lower device. > >>> > >> > >> We'd better not complicate the task of management, lockless tx path work > >> very well so we can just keep it. Btw, there's no way for the user to > >> know the maximum number of queues that L2 forwarding supports. > > > > Good point I'll add an attribute to query it. > > > >>>> For L2 forwarding offload itself, more issues need to be addressed for > >>>> multiqueue macvtap: > >>>> > >>>> - ndo_dfwd_add_station() can only create queues per device at > >>>> ndo_open, > >>>> but multiqueue macvtap allows user to create and destroy queues at > >>>> their > >>>> will and at any time. > >>> > >>> same argument as above, isn't this the same when running macvtap > >>> without > >>> the l2 offloads over a real device? I expect you hit the same > >>> contention > >>> points when running over a real device. > >> > >> Not true and not only for contention. > >> > >> Macvtap allows user to create or destroy a queue by simply open or close > >> to character device /dev/tapX. But currently, we do nothing when a new > >> queue was created or destroyed for L2 forwarding offload. > >> > >> For contention, lockless tx path make the contention only happens for > >> the txq or qdisc for the lower device, but L2 forwarding offload make > >> contention also happen for the macvlan device itself. > > > > Right, but there will be less contention there because those queues > > are a dedicated resource for the upper device. > > Yes and this is also true if we only do synchronization on the lower > device since only dedicated queues could be selected. > > > > At this point I think I need to put together a real testbed and > > benchmark some of this with netperf and perf running to get real > > numbers. When I originally did the l2 forwarding I did not do any > > testing with multiple macvtap queues and only very limited work with > > macvtap. > > > > As I said above, holding the txq lock of lower device seems a must and > we should not get regression if NETIF_F_LLTX is kept. But I agree we > need some test. > >> > >>> > >>> > >>>> - it looks that ixgbe has a upper limit of 4 queues per station, but > >>>> macvtap currently allows up to 16 queues per device. > >>>> > >>> > >>> The 4 limit was to simplify the code because the queue mapping in the > >>> driver gets complicated if it is greater than 4. We can probably > >>> increase this latter. But sorry reiterating how is this different than > >>> a macvtap on a real device that supports a max of 4 queues? > >> > >> Well, it maybe easy. I just point out possible issues we may meet > >> currently. > > > > Right. > > > >>> > >>>> So more works need to be done and unless those above 3 issues were > >>>> addressed, this patch is really needed to make sure macvtap works. > >>>> > >>> > >>> Agreed there is a lot more work here to improve things I'm just not > >>> sure we need to disable this now. Also note its the l2 forwarding > >>> should be disabled by default so a user would have to enable the > >>> feature flag. > >> > >> Even if it was disabled by default. We should not surprise the user who > >> want to enable it for macvtap. > > > > So the question is what to do in net while we improve net-next. Either > > we fix the crash from the null txq and note that with l2 forwarding > > some non default configuration is needed for optimal performance OR > > for now disable it as your patch does. I would prefer to fix the crash > > and note the configuration but I see your point about surprising users > > so could go either way. > > > > It's much safer to disable l2 forwarding offload for macvtap temporarily > consider it has several issues.We can re-enable it when everything is > ready in net-next. We we really need to hold the txq lock of lower > device, only add more check of NULL pointer is not sufficient. So > explicitly select a txq is still needed. And I don't see any conflicts > between this and future enhancement. > > Also I don't see any drawback of using NETIF_F_LLTX for l2 forwarding. > So we'd better keep it. > > Neil any thoughts? > > > > To fix the null txq in the gso case adding a check for a non-null > > txq before calling txq_trans_update() makes sense to me. We already > > have the check in the non-gso case so making it symmetric fixes it. > > > >>> > >>> Thanks, > >>> John > >>> > >> > >> Thanks > >> > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-08 12:55 ` Michael S. Tsirkin @ 2014-01-08 19:05 ` John Fastabend 2014-01-09 7:17 ` Michael S. Tsirkin 0 siblings, 1 reply; 35+ messages in thread From: John Fastabend @ 2014-01-08 19:05 UTC (permalink / raw) To: Michael S. Tsirkin, Jason Wang Cc: John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich [...] >>> OK I think I'm finally putting all the pieces together thanks. >>> >>> Do you know why macvtap is setting dev->tx_queue_len by default? If you >>> zero this then the noqueue_qdisc is used and the q->enqueue check in >>> dev_queue_xmit will fail. >> >> It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6 >> ("macvtap: Limit packet queue length") to limit the length of socket >> receive queue of macvtap. But I'm not sure whether the qdisc is a >> byproduct of this commit, maybe we can switch to use another name >> instead of just reuse dev->tx_queue_length. > > You mean tx_queue_len really, right? > > Problem is tx_queue_len can be accessed using netlink sysfs or ioctl, > so if someone uses these to control or check the # of packets that > can be queued by device, this will break. > > How about adding ndo_set_tx_queue_len then? > > At some point we wanted to decouple queue length from tx_queue_length > for tun as well, so that would be benefitial there as well. On the receive side we need to limit the receive queue and the dev->tx_queue_len value was used to do this. However on the tx side when dev->tx_queue_len is set the default qdisc pfifo_fast or mq is used depending on if there is multiqueue or not. Unless the user specifies some numtxqueues when creating the macvtap device then it will be a single queue device and use the pfifo_fast qdisc. This is the default behaviour users could zero txqueuelen when they create the device because it is a stacked device with NETIF_F_LLTX using the lower devices qdisc makes sense but this would appear (from code inspection) to break macvtap_forward()? if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len) goto drop; I'm not sure any of this is a problem other than its a bit confusing to overload tx_queue_len for the rx_queue_len but the precedent has been there for sometime. It is a change in behaviour though in net-next. Previously dev_queue_xmit() was not used so the qdisc layer was never hit on the macvtap device. Now with dev_queue_xmit() if the user attaches multiple macvlan queues to a single qdisc queue this should still work but wont be optimal. Ideally the user should create as many qdisc queues at creation time via numtxqueues as macvtap queues will be used during runtime so that there is a 1:1 mapping or use some heuristic to avoid cases where there is a many to 1 mapping. From my perspective it would be OK to push this configuration/policy to the management layer. After all it is the entity that knows how to distribute system resources amongst the objects running over the macvtap devices. The relevance for me is I defaulted the l2 offloaded macvlans to single queue devices and wanted to note in net-next this is the same policy used in the non-offloaded case. Bit long-winded there. Thanks, John ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-08 19:05 ` John Fastabend @ 2014-01-09 7:17 ` Michael S. Tsirkin 2014-01-09 8:55 ` Jason Wang 0 siblings, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2014-01-09 7:17 UTC (permalink / raw) To: John Fastabend Cc: Jason Wang, John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On Wed, Jan 08, 2014 at 11:05:42AM -0800, John Fastabend wrote: > [...] > > >>>OK I think I'm finally putting all the pieces together thanks. > >>> > >>>Do you know why macvtap is setting dev->tx_queue_len by default? If you > >>>zero this then the noqueue_qdisc is used and the q->enqueue check in > >>>dev_queue_xmit will fail. > >> > >>It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6 > >>("macvtap: Limit packet queue length") to limit the length of socket > >>receive queue of macvtap. But I'm not sure whether the qdisc is a > >>byproduct of this commit, maybe we can switch to use another name > >>instead of just reuse dev->tx_queue_length. > > > >You mean tx_queue_len really, right? > > > >Problem is tx_queue_len can be accessed using netlink sysfs or ioctl, > >so if someone uses these to control or check the # of packets that > >can be queued by device, this will break. > > > >How about adding ndo_set_tx_queue_len then? > > > >At some point we wanted to decouple queue length from tx_queue_length > >for tun as well, so that would be benefitial there as well. > > On the receive side we need to limit the receive queue and the > dev->tx_queue_len value was used to do this. > > However on the tx side when dev->tx_queue_len is set the default > qdisc pfifo_fast or mq is used depending on if there is multiqueue > or not. Unless the user specifies some numtxqueues when creating > the macvtap device then it will be a single queue device and use > the pfifo_fast qdisc. > > This is the default behaviour users could zero txqueuelen when > they create the device because it is a stacked device with > NETIF_F_LLTX using the lower devices qdisc makes sense but this > would appear (from code inspection) to break macvtap_forward()? > > if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len) > goto drop; > > I'm not sure any of this is a problem other than its a bit > confusing to overload tx_queue_len for the rx_queue_len but the > precedent has been there for sometime. So how about ndo ops to access tx_queue_len then? This way we can set tx_queue_len to 0 and use some other field to store the rx_queue_len without users noticing. Along the lines of the below (it's a partial patch just to give you the idea): diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c index 5b7d0e1..e526b46 100644 --- a/net/core/dev_ioctl.c +++ b/net/core/dev_ioctl.c @@ -167,7 +167,10 @@ static int dev_ifsioc_locked(struct net *net, struct ifreq *ifr, unsigned int cm return 0; case SIOCGIFTXQLEN: - ifr->ifr_qlen = dev->tx_queue_len; + if (dev->netdev_ops->ndo_get_tx_queue_len) + ifr->ifr_qlen = dev->netdev_ops->ndo_get_tx_queue_len(dev); + else + ifr->ifr_qlen = dev->tx_queue_len; return 0; default: @@ -296,7 +299,10 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd) case SIOCSIFTXQLEN: if (ifr->ifr_qlen < 0) return -EINVAL; - dev->tx_queue_len = ifr->ifr_qlen; + if (dev->netdev_ops->ndo_set_tx_queue_len) + dev->netdev_ops->ndo_set_tx_queue_len(dev, ifr->ifr_qlen); + else + dev->tx_queue_len = ifr->ifr_qlen; return 0; case SIOCSIFNAME: diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index d954b56..f2fd9d5 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -280,10 +280,31 @@ NETDEVICE_SHOW_RW(flags, fmt_hex); static int change_tx_queue_len(struct net_device *net, unsigned long new_len) { - net->tx_queue_len = new_len; + if (dev->netdev_ops->ndo_set_tx_queue_len) + dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len); + else + dev->tx_queue_len = new_len; return 0; } +static ssize_t format_tx_queue_len(const struct net_device *net, char *buf) +{ + unsigned long tx_queue_len; + + if (dev->netdev_ops->ndo_get_tx_queue_len) + tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev); + else + tx_queue_len = dev->tx_queue_len; + + return sprintf(buf, fmt_ulong, tx_queue_len); +} + +static ssize_t tx_queue_len_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return netdev_show(dev, attr, buf, format_tx_queue_len); +} + static ssize_t tx_queue_len_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) @@ -293,7 +314,7 @@ static ssize_t tx_queue_len_store(struct device *dev, return netdev_store(dev, attr, buf, len, change_tx_queue_len); } -NETDEVICE_SHOW_RW(tx_queue_len, fmt_ulong); +DEVICE_ATTR_RW(tx_queue_len); static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 2a0e21d..9276e17 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -876,6 +876,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, struct nlattr *attr, *af_spec; struct rtnl_af_ops *af_ops; struct net_device *upper_dev = netdev_master_upper_dev_get(dev); + unsigned long tx_queue_len; ASSERT_RTNL(); nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags); @@ -890,8 +891,13 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, ifm->ifi_flags = dev_get_flags(dev); ifm->ifi_change = change; + if (dev->netdev_ops->ndo_get_tx_queue_len) + tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev); + else + tx_queue_len = dev->tx_queue_len; + if (nla_put_string(skb, IFLA_IFNAME, dev->name) || - nla_put_u32(skb, IFLA_TXQLEN, dev->tx_queue_len) || + nla_put_u32(skb, IFLA_TXQLEN, tx_queue_len) || nla_put_u8(skb, IFLA_OPERSTATE, netif_running(dev) ? dev->operstate : IF_OPER_DOWN) || nla_put_u8(skb, IFLA_LINKMODE, dev->link_mode) || @@ -1453,8 +1459,14 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm, modified = 1; } - if (tb[IFLA_TXQLEN]) - dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); + if (tb[IFLA_TXQLEN]) { + u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]); + + if (dev->netdev_ops->ndo_set_tx_queue_len) + dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len); + else + dev->tx_queue_len = new_len; + } if (tb[IFLA_OPERSTATE]) set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE])); @@ -1692,8 +1704,14 @@ struct net_device *rtnl_create_link(struct net *net, if (tb[IFLA_BROADCAST]) memcpy(dev->broadcast, nla_data(tb[IFLA_BROADCAST]), nla_len(tb[IFLA_BROADCAST])); - if (tb[IFLA_TXQLEN]) - dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); + if (tb[IFLA_TXQLEN]) { + u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]); + + if (dev->netdev_ops->ndo_set_tx_queue_len) + dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len); + else + dev->tx_queue_len = new_len; + } if (tb[IFLA_OPERSTATE]) set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE])); if (tb[IFLA_LINKMODE]) Anyone objects? > It is a change in behaviour > though in net-next. Previously dev_queue_xmit() was not used so > the qdisc layer was never hit on the macvtap device. Now with > dev_queue_xmit() if the user attaches multiple macvlan queues to a > single qdisc queue this should still work but wont be optimal. Ideally > the user should create as many qdisc queues at creation time via > numtxqueues as macvtap queues will be used during runtime so that there > is a 1:1 mapping or use some heuristic to avoid cases where there > is a many to 1 mapping. > > From my perspective it would be OK to push this configuration/policy > to the management layer. After all it is the entity that knows how > to distribute system resources amongst the objects running over the > macvtap devices. The relevance for me is I defaulted the l2 offloaded > macvlans to single queue devices and wanted to note in net-next this > is the same policy used in the non-offloaded case. > > Bit long-winded there. > > Thanks, > John I think it's a real problem you are pointing out - a performance regression for multiqueue devices. If we really think no qdisc is typically the right thing to do, we should just make it the default I think, but I agree just making tx_queue_len 0 doesn't work without other changes. What do others think about extra ndo ops so devices can decouple the internal tx_queue_len from the userspace-visible value? ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-09 7:17 ` Michael S. Tsirkin @ 2014-01-09 8:55 ` Jason Wang 2014-01-09 21:39 ` Stephen Hemminger 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-09 8:55 UTC (permalink / raw) To: Michael S. Tsirkin, John Fastabend Cc: John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On 01/09/2014 03:17 PM, Michael S. Tsirkin wrote: > On Wed, Jan 08, 2014 at 11:05:42AM -0800, John Fastabend wrote: >> [...] >> >>>>> OK I think I'm finally putting all the pieces together thanks. >>>>> >>>>> Do you know why macvtap is setting dev->tx_queue_len by default? If you >>>>> zero this then the noqueue_qdisc is used and the q->enqueue check in >>>>> dev_queue_xmit will fail. >>>> It was introduced in commit 8a35747a5d13b99e076b0222729e0caa48cb69b6 >>>> ("macvtap: Limit packet queue length") to limit the length of socket >>>> receive queue of macvtap. But I'm not sure whether the qdisc is a >>>> byproduct of this commit, maybe we can switch to use another name >>>> instead of just reuse dev->tx_queue_length. >>> You mean tx_queue_len really, right? >>> >>> Problem is tx_queue_len can be accessed using netlink sysfs or ioctl, >>> so if someone uses these to control or check the # of packets that >>> can be queued by device, this will break. >>> >>> How about adding ndo_set_tx_queue_len then? >>> >>> At some point we wanted to decouple queue length from tx_queue_length >>> for tun as well, so that would be benefitial there as well. >> On the receive side we need to limit the receive queue and the >> dev->tx_queue_len value was used to do this. >> >> However on the tx side when dev->tx_queue_len is set the default >> qdisc pfifo_fast or mq is used depending on if there is multiqueue >> or not. Unless the user specifies some numtxqueues when creating >> the macvtap device then it will be a single queue device and use >> the pfifo_fast qdisc. >> >> This is the default behaviour users could zero txqueuelen when >> they create the device because it is a stacked device with >> NETIF_F_LLTX using the lower devices qdisc makes sense but this >> would appear (from code inspection) to break macvtap_forward()? >> >> if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len) >> goto drop; >> >> I'm not sure any of this is a problem other than its a bit >> confusing to overload tx_queue_len for the rx_queue_len but the >> precedent has been there for sometime. > So how about ndo ops to access tx_queue_len then? > This way we can set tx_queue_len to 0 and use some > other field to store the rx_queue_len without users noticing. > Along the lines of the below (it's a partial patch just > to give you the idea): > > diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c > index 5b7d0e1..e526b46 100644 > --- a/net/core/dev_ioctl.c > +++ b/net/core/dev_ioctl.c > @@ -167,7 +167,10 @@ static int dev_ifsioc_locked(struct net *net, struct ifreq *ifr, unsigned int cm > return 0; > > case SIOCGIFTXQLEN: > - ifr->ifr_qlen = dev->tx_queue_len; > + if (dev->netdev_ops->ndo_get_tx_queue_len) > + ifr->ifr_qlen = dev->netdev_ops->ndo_get_tx_queue_len(dev); > + else > + ifr->ifr_qlen = dev->tx_queue_len; > return 0; > > default: > @@ -296,7 +299,10 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd) > case SIOCSIFTXQLEN: > if (ifr->ifr_qlen < 0) > return -EINVAL; > - dev->tx_queue_len = ifr->ifr_qlen; > + if (dev->netdev_ops->ndo_set_tx_queue_len) > + dev->netdev_ops->ndo_set_tx_queue_len(dev, ifr->ifr_qlen); > + else > + dev->tx_queue_len = ifr->ifr_qlen; > return 0; > > case SIOCSIFNAME: > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index d954b56..f2fd9d5 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -280,10 +280,31 @@ NETDEVICE_SHOW_RW(flags, fmt_hex); > > static int change_tx_queue_len(struct net_device *net, unsigned long new_len) > { > - net->tx_queue_len = new_len; > + if (dev->netdev_ops->ndo_set_tx_queue_len) > + dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len); > + else > + dev->tx_queue_len = new_len; > return 0; > } > > +static ssize_t format_tx_queue_len(const struct net_device *net, char *buf) > +{ > + unsigned long tx_queue_len; > + > + if (dev->netdev_ops->ndo_get_tx_queue_len) > + tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev); > + else > + tx_queue_len = dev->tx_queue_len; > + > + return sprintf(buf, fmt_ulong, tx_queue_len); > +} > + > +static ssize_t tx_queue_len_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + return netdev_show(dev, attr, buf, format_tx_queue_len); > +} > + > static ssize_t tx_queue_len_store(struct device *dev, > struct device_attribute *attr, > const char *buf, size_t len) > @@ -293,7 +314,7 @@ static ssize_t tx_queue_len_store(struct device *dev, > > return netdev_store(dev, attr, buf, len, change_tx_queue_len); > } > -NETDEVICE_SHOW_RW(tx_queue_len, fmt_ulong); > +DEVICE_ATTR_RW(tx_queue_len); > > static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr, > const char *buf, size_t len) > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c > index 2a0e21d..9276e17 100644 > --- a/net/core/rtnetlink.c > +++ b/net/core/rtnetlink.c > @@ -876,6 +876,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, > struct nlattr *attr, *af_spec; > struct rtnl_af_ops *af_ops; > struct net_device *upper_dev = netdev_master_upper_dev_get(dev); > + unsigned long tx_queue_len; > > ASSERT_RTNL(); > nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags); > @@ -890,8 +891,13 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, > ifm->ifi_flags = dev_get_flags(dev); > ifm->ifi_change = change; > > + if (dev->netdev_ops->ndo_get_tx_queue_len) > + tx_queue_len = dev->netdev_ops->ndo_get_tx_queue_len(dev); > + else > + tx_queue_len = dev->tx_queue_len; > + > if (nla_put_string(skb, IFLA_IFNAME, dev->name) || > - nla_put_u32(skb, IFLA_TXQLEN, dev->tx_queue_len) || > + nla_put_u32(skb, IFLA_TXQLEN, tx_queue_len) || > nla_put_u8(skb, IFLA_OPERSTATE, > netif_running(dev) ? dev->operstate : IF_OPER_DOWN) || > nla_put_u8(skb, IFLA_LINKMODE, dev->link_mode) || > @@ -1453,8 +1459,14 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm, > modified = 1; > } > > - if (tb[IFLA_TXQLEN]) > - dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); > + if (tb[IFLA_TXQLEN]) { > + u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]); > + > + if (dev->netdev_ops->ndo_set_tx_queue_len) > + dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len); > + else > + dev->tx_queue_len = new_len; > + } > > if (tb[IFLA_OPERSTATE]) > set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE])); > @@ -1692,8 +1704,14 @@ struct net_device *rtnl_create_link(struct net *net, > if (tb[IFLA_BROADCAST]) > memcpy(dev->broadcast, nla_data(tb[IFLA_BROADCAST]), > nla_len(tb[IFLA_BROADCAST])); > - if (tb[IFLA_TXQLEN]) > - dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); > + if (tb[IFLA_TXQLEN]) { > + u32 new_len = nla_get_u32(tb[IFLA_TXQLEN]); > + > + if (dev->netdev_ops->ndo_set_tx_queue_len) > + dev->netdev_ops->ndo_set_tx_queue_len(dev, new_len); > + else > + dev->tx_queue_len = new_len; > + } > if (tb[IFLA_OPERSTATE]) > set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE])); > if (tb[IFLA_LINKMODE]) > > Anyone objects? What if use do want a qdisc and want to change the its queue length for tun/macvlan? And the the name tx_queue_length is misleading. For tun it may make sense since it was used in transmission path. For macvtap it was not. So maybe what we need is just a new ioctl for both tun/macvtap and a new feature flag. If user create the device with new feature flag, the socket receive queue length could be changed by ioctl instead of dev->tx_queue_length. If not, the old behaviour could be kept. >> It is a change in behaviour >> though in net-next. Previously dev_queue_xmit() was not used so >> the qdisc layer was never hit on the macvtap device. Now with >> dev_queue_xmit() if the user attaches multiple macvlan queues to a >> single qdisc queue this should still work but wont be optimal. Ideally >> the user should create as many qdisc queues at creation time via >> numtxqueues as macvtap queues will be used during runtime so that there >> is a 1:1 mapping or use some heuristic to avoid cases where there >> is a many to 1 mapping. >> >> From my perspective it would be OK to push this configuration/policy >> to the management layer. After all it is the entity that knows how >> to distribute system resources amongst the objects running over the >> macvtap devices. The relevance for me is I defaulted the l2 offloaded >> macvlans to single queue devices and wanted to note in net-next this >> is the same policy used in the non-offloaded case. >> >> Bit long-winded there. >> >> Thanks, >> John > I think it's a real problem you are pointing out - a performance > regression for multiqueue devices. > If we really think no qdisc is typically the right thing to do, > we should just make it the default I think, but I agree > just making tx_queue_len 0 doesn't work without other changes. > > What do others think about extra ndo ops so devices can decouple > the internal tx_queue_len from the userspace-visible value? > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-09 8:55 ` Jason Wang @ 2014-01-09 21:39 ` Stephen Hemminger 2014-01-09 22:03 ` Michael S. Tsirkin 2014-01-10 7:06 ` Jason Wang 0 siblings, 2 replies; 35+ messages in thread From: Stephen Hemminger @ 2014-01-09 21:39 UTC (permalink / raw) To: Jason Wang Cc: Michael S. Tsirkin, John Fastabend, John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On Thu, 09 Jan 2014 16:55:07 +0800 Jason Wang <jasowang@redhat.com> wrote: > What if use do want a qdisc and want to change the its queue length for > tun/macvlan? And the the name tx_queue_length is misleading. For tun it > may make sense since it was used in transmission path. For macvtap it > was not. So maybe what we need is just a new ioctl for both tun/macvtap > and a new feature flag. If user create the device with new feature flag, > the socket receive queue length could be changed by ioctl instead of > dev->tx_queue_length. If not, the old behaviour could be kept. The overloading of tx_queue_len in macvtap was the original design mistake. Can't this just be undone and expose rx_queue_len as sysfs attribute? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-09 21:39 ` Stephen Hemminger @ 2014-01-09 22:03 ` Michael S. Tsirkin 2014-01-09 22:20 ` Stephen Hemminger 2014-01-10 7:06 ` Jason Wang 1 sibling, 1 reply; 35+ messages in thread From: Michael S. Tsirkin @ 2014-01-09 22:03 UTC (permalink / raw) To: Stephen Hemminger Cc: Jason Wang, John Fastabend, John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On Thu, Jan 09, 2014 at 01:39:08PM -0800, Stephen Hemminger wrote: > On Thu, 09 Jan 2014 16:55:07 +0800 > Jason Wang <jasowang@redhat.com> wrote: > > > What if use do want a qdisc and want to change the its queue length for > > tun/macvlan? And the the name tx_queue_length is misleading. For tun it > > may make sense since it was used in transmission path. For macvtap it > > was not. So maybe what we need is just a new ioctl for both tun/macvtap > > and a new feature flag. If user create the device with new feature flag, > > the socket receive queue length could be changed by ioctl instead of > > dev->tx_queue_length. If not, the old behaviour could be kept. > > The overloading of tx_queue_len in macvtap was the original design mistake. > Can't this just be undone and expose rx_queue_len as sysfs attribute? Yes but we need to avoid breaking user-visible ABI. So I think we'll need to catch any access attempts and redirect them to the new rx_queue_len. I posted a patch like this using new ndo_set_tx_queue_len/ndo_get_tx_queue_len. Have you seen it? What do you think? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-09 22:03 ` Michael S. Tsirkin @ 2014-01-09 22:20 ` Stephen Hemminger 0 siblings, 0 replies; 35+ messages in thread From: Stephen Hemminger @ 2014-01-09 22:20 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Wang, John Fastabend, John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On Fri, 10 Jan 2014 00:03:23 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jan 09, 2014 at 01:39:08PM -0800, Stephen Hemminger wrote: > > On Thu, 09 Jan 2014 16:55:07 +0800 > > Jason Wang <jasowang@redhat.com> wrote: > > > > > What if use do want a qdisc and want to change the its queue length for > > > tun/macvlan? And the the name tx_queue_length is misleading. For tun it > > > may make sense since it was used in transmission path. For macvtap it > > > was not. So maybe what we need is just a new ioctl for both tun/macvtap > > > and a new feature flag. If user create the device with new feature flag, > > > the socket receive queue length could be changed by ioctl instead of > > > dev->tx_queue_length. If not, the old behaviour could be kept. > > > > The overloading of tx_queue_len in macvtap was the original design mistake. > > Can't this just be undone and expose rx_queue_len as sysfs attribute? > > Yes but we need to avoid breaking user-visible ABI. I think in this case, it was a mistake and hasn't been around long enough to cause serious damage. > So I think we'll need to catch any access attempts and redirect them to > the new rx_queue_len. I posted a patch like this using new > ndo_set_tx_queue_len/ndo_get_tx_queue_len. Have you seen it? What do > you think? It encourages others to do/make the same mistake so I don't like it. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-09 21:39 ` Stephen Hemminger 2014-01-09 22:03 ` Michael S. Tsirkin @ 2014-01-10 7:06 ` Jason Wang 2014-01-10 16:40 ` Vlad Yasevich 1 sibling, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-10 7:06 UTC (permalink / raw) To: Stephen Hemminger Cc: Michael S. Tsirkin, John Fastabend, John Fastabend, Neil Horman, davem, netdev, linux-kernel, Vlad Yasevich On 01/10/2014 05:39 AM, Stephen Hemminger wrote: > On Thu, 09 Jan 2014 16:55:07 +0800 > Jason Wang <jasowang@redhat.com> wrote: > >> What if use do want a qdisc and want to change the its queue length for >> tun/macvlan? And the the name tx_queue_length is misleading. For tun it >> may make sense since it was used in transmission path. For macvtap it >> was not. So maybe what we need is just a new ioctl for both tun/macvtap >> and a new feature flag. If user create the device with new feature flag, >> the socket receive queue length could be changed by ioctl instead of >> dev->tx_queue_length. If not, the old behaviour could be kept. > The overloading of tx_queue_len in macvtap was the original design mistake. > Can't this just be undone and expose rx_queue_len as sysfs attribute? That works. But we current allow user to change the socket sndbuf through TUNSNDBUF. Maybe we need a similar one for receive. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-10 7:06 ` Jason Wang @ 2014-01-10 16:40 ` Vlad Yasevich 0 siblings, 0 replies; 35+ messages in thread From: Vlad Yasevich @ 2014-01-10 16:40 UTC (permalink / raw) To: Jason Wang, Stephen Hemminger Cc: Michael S. Tsirkin, John Fastabend, John Fastabend, Neil Horman, davem, netdev, linux-kernel On 01/10/2014 02:06 AM, Jason Wang wrote: > On 01/10/2014 05:39 AM, Stephen Hemminger wrote: >> On Thu, 09 Jan 2014 16:55:07 +0800 >> Jason Wang <jasowang@redhat.com> wrote: >> >>> What if use do want a qdisc and want to change the its queue length for >>> tun/macvlan? And the the name tx_queue_length is misleading. For tun it >>> may make sense since it was used in transmission path. For macvtap it >>> was not. So maybe what we need is just a new ioctl for both tun/macvtap >>> and a new feature flag. If user create the device with new feature flag, >>> the socket receive queue length could be changed by ioctl instead of >>> dev->tx_queue_length. If not, the old behaviour could be kept. >> The overloading of tx_queue_len in macvtap was the original design mistake. >> Can't this just be undone and expose rx_queue_len as sysfs attribute? > > That works. But we current allow user to change the socket sndbuf > through TUNSNDBUF. Maybe we need a similar one for receive. > That would make sense. Since the user interacts with tun fd almost as a socket and there is actually a socket hiding in the kernel, it almost begs for actual SO_SNDBUF/SO_RCVBUF support :) -vlad ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 3:10 ` Jason Wang 2014-01-07 5:15 ` John Fastabend @ 2014-01-07 5:16 ` John Fastabend 1 sibling, 0 replies; 35+ messages in thread From: John Fastabend @ 2014-01-07 5:16 UTC (permalink / raw) To: Jason Wang Cc: Neil Horman, davem, netdev, linux-kernel, mst, John Fastabend, Vlad Yasevich On 01/06/2014 07:10 PM, Jason Wang wrote: > On 01/06/2014 08:26 PM, Neil Horman wrote: >> On Mon, Jan 06, 2014 at 03:54:21PM +0800, Jason Wang wrote: >>> On 01/06/2014 03:35 PM, John Fastabend wrote: >>>> On 01/05/2014 07:21 PM, Jason Wang wrote: >>>>> L2 fowarding offload will bypass the rx handler of real device. This >>>>> will make >>>>> the packet could not be forwarded to macvtap device. Another problem >>>>> is the >>>>> dev_hard_start_xmit() called for macvtap does not have any >>>>> synchronization. >>>>> >>>>> Fix this by forbidding L2 forwarding for macvtap. >>>>> >>>>> Cc: John Fastabend <john.r.fastabend@intel.com> >>>>> Cc: Neil Horman <nhorman@tuxdriver.com> >>>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>>>> --- >>>>> drivers/net/macvlan.c | 5 ++++- >>>>> 1 files changed, 4 insertions(+), 1 deletions(-) >>>>> >>>> I must be missing something. >>>> >>>> The lower layer device should set skb->dev to the correct macvtap >>>> device on receive so that in netif_receive_skb_core() the macvtap >>>> handler is hit. Skipping the macvlan receive handler should be OK >>>> because the switching was done by the hardware. If I read macvtap.c >>>> correctly macvlan_common_newlink() is called with 'dev' where 'dev' >>>> is the macvtap device. Any idea what I'm missing? I guess I'll need >>>> to setup a macvtap test case. >>> Unlike macvlan, macvtap depends on rx handler on the lower device to >>> work. In this case macvlan_handle_frame() will call macvtap_receive(). >>> So doing netif_receive_skb_core() for macvtap device directly won't work >>> since we need to forward the packet to userspace instead of kernel. >>> >>> For net-next.git, it may work since commit >>> 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 let macvtap device register an >>> rx handler for itself. >> I agree, this seems like it should already be fixed with the above commit. With >> this the macvlan rx handler should effectively be a no-op as far as the >> reception of frames is concerned. As long as the driver sets the dev correctly >> to the macvtap device (and it appears to), macvtap will get frames to user >> space, regardless of weather the software or hardware did the switching. If >> thats the case though, I think the solution is moving that fix to -stable >> (pending testing of course), rather than comming up with a new fix. >> >>>> And what synchronization are you worried about on dev_hard_start_xmit()? >>>> In the L2 forwarding offload case macvlan_open() clears the NETIF_F_LLTX >>>> flag so HARD_TX_LOCK protects the driver txq. We might hit this warning >>>> in dev_queue_xmit() though, >>>> >>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n", >>>> >>>> Perhaps we can remove it. >>> The problem is macvtap does not call dev_queue_xmit() for macvlan >>> device. It calls macvlan_start_xmit() directly from macvtap_get_user(). >>> So HARD_TX_LOCK was not done for the txq. >> This seems to also be fixed by 6acf54f1cf0a6747bac9fea26f34cfc5a9029523. >> Macvtap does, as of that commit use dev_queue_xmit for the transmission of >> frames to the lowerdevice. > > Unfortunately not. This commit has a side effect that it in fact > disables the multiqueue macvtap transmission. Since all macvtap queues > will contend on a single qdisc lock. > They will only contend on a single qdisc lock if the lower device has 1 queue. Perhaps defaulting the L2 forwarding devices to 1queue was a mistake. But the same issue arises when running macvtap over a non-multiqueue nic. Or even if you have a multiqueue device and create many more macvtap queues than the lower device has queues. Shouldn't the macvtap configuration take into account the lowest level devices queues? How does using the L2 forwarding device change the contention issues? Without the L2 forwarding LLTX is enabled but the qdisc lock, etc is still acquired on the device below the macvlan. The ixgbe driver as it is currently written can be configured for up to 4 queues by setting numtxqueues when the device is created. I assume when creating macvtap queues the user needs to account for the number of queues supported by the lower device. > For L2 forwarding offload itself, more issues need to be addressed for > multiqueue macvtap: > > - ndo_dfwd_add_station() can only create queues per device at ndo_open, > but multiqueue macvtap allows user to create and destroy queues at their > will and at any time. same argument as above, isn't this the same when running macvtap without the l2 offloads over a real device? I expect you hit the same contention points when running over a real device. > - it looks that ixgbe has a upper limit of 4 queues per station, but > macvtap currently allows up to 16 queues per device. > The 4 limit was to simplify the code because the queue mapping in the driver gets complicated if it is greater than 4. We can probably increase this latter. But sorry reiterating how is this different than a macvtap on a real device that supports a max of 4 queues? > So more works need to be done and unless those above 3 issues were > addressed, this patch is really needed to make sure macvtap works. > Agreed there is a lot more work here to improve things I'm just not sure we need to disable this now. Also note its the l2 forwarding should be disabled by default so a user would have to enable the feature flag. Thanks, John -- John Fastabend Intel Corporation ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-06 3:21 [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap Jason Wang 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang 2014-01-06 7:35 ` [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap John Fastabend @ 2014-01-06 20:47 ` David Miller 2014-01-07 3:17 ` Jason Wang 2 siblings, 1 reply; 35+ messages in thread From: David Miller @ 2014-01-06 20:47 UTC (permalink / raw) To: jasowang; +Cc: netdev, linux-kernel, mst, john.r.fastabend, nhorman From: Jason Wang <jasowang@redhat.com> Date: Mon, 6 Jan 2014 11:21:06 +0800 > L2 fowarding offload will bypass the rx handler of real device. This will make > the packet could not be forwarded to macvtap device. Another problem is the > dev_hard_start_xmit() called for macvtap does not have any synchronization. > > Fix this by forbidding L2 forwarding for macvtap. > > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Neil Horman <nhorman@tuxdriver.com> > Signed-off-by: Jason Wang <jasowang@redhat.com> I think I agree with Neil that the rx_handler change might be the best way to fix this. That change seems to have a lot of nice unintended side effects, no? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-06 20:47 ` David Miller @ 2014-01-07 3:17 ` Jason Wang 2014-01-07 5:57 ` David Miller 0 siblings, 1 reply; 35+ messages in thread From: Jason Wang @ 2014-01-07 3:17 UTC (permalink / raw) To: David Miller; +Cc: netdev, linux-kernel, mst, john.r.fastabend, nhorman On 01/07/2014 04:47 AM, David Miller wrote: > From: Jason Wang <jasowang@redhat.com> > Date: Mon, 6 Jan 2014 11:21:06 +0800 > >> L2 fowarding offload will bypass the rx handler of real device. This will make >> the packet could not be forwarded to macvtap device. Another problem is the >> dev_hard_start_xmit() called for macvtap does not have any synchronization. >> >> Fix this by forbidding L2 forwarding for macvtap. >> >> Cc: John Fastabend <john.r.fastabend@intel.com> >> Cc: Neil Horman <nhorman@tuxdriver.com> >> Signed-off-by: Jason Wang <jasowang@redhat.com> > I think I agree with Neil that the rx_handler change might be the best > way to fix this. That change seems to have a lot of nice unintended > side effects, no? Not all sides effects are nice. One obvious issue is it disables the multiqueue macvtap transmission, since all queues will contend on a single qdisc lock of macvlan. And even more, multiqueue macvtap support creating and destroying a queue on demand which is not supported by L2 forwarding offload. So L2 forwarding offload needs more fixes to let the multiqueue macvtap works. Currently, we really need this patch to make sure macvtap works as expected. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap 2014-01-07 3:17 ` Jason Wang @ 2014-01-07 5:57 ` David Miller 0 siblings, 0 replies; 35+ messages in thread From: David Miller @ 2014-01-07 5:57 UTC (permalink / raw) To: jasowang; +Cc: netdev, linux-kernel, mst, john.r.fastabend, nhorman From: Jason Wang <jasowang@redhat.com> Date: Tue, 07 Jan 2014 11:17:06 +0800 > On 01/07/2014 04:47 AM, David Miller wrote: >> From: Jason Wang <jasowang@redhat.com> >> Date: Mon, 6 Jan 2014 11:21:06 +0800 >> >>> L2 fowarding offload will bypass the rx handler of real device. This will make >>> the packet could not be forwarded to macvtap device. Another problem is the >>> dev_hard_start_xmit() called for macvtap does not have any synchronization. >>> >>> Fix this by forbidding L2 forwarding for macvtap. >>> >>> Cc: John Fastabend <john.r.fastabend@intel.com> >>> Cc: Neil Horman <nhorman@tuxdriver.com> >>> Signed-off-by: Jason Wang <jasowang@redhat.com> >> I think I agree with Neil that the rx_handler change might be the best >> way to fix this. That change seems to have a lot of nice unintended >> side effects, no? > > Not all sides effects are nice. > > One obvious issue is it disables the multiqueue macvtap transmission, > since all queues will contend on a single qdisc lock of macvlan. And > even more, multiqueue macvtap support creating and destroying a queue on > demand which is not supported by L2 forwarding offload. > > So L2 forwarding offload needs more fixes to let the multiqueue macvtap > works. Currently, we really need this patch to make sure macvtap works > as expected. Ok I moved these two patches back to "Under Review". These are pretty last minute and we'll need to make a decision on what to do before Friday if you want these changes to really make it into 3.13 ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2014-01-10 16:41 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-06 3:21 [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap Jason Wang 2014-01-06 3:21 ` [PATCH net 2/2] net: core: explicitly select a txq before doing l2 forwarding Jason Wang 2014-01-06 12:04 ` Jeff Kirsher 2014-01-06 12:42 ` Neil Horman 2014-01-06 15:06 ` John Fastabend 2014-01-06 15:29 ` Neil Horman 2014-01-07 3:42 ` Jason Wang 2014-01-07 13:17 ` Neil Horman 2014-01-08 3:21 ` Jason Wang 2014-01-08 14:40 ` Neil Horman 2014-01-09 8:28 ` Jason Wang 2014-01-09 11:53 ` Neil Horman 2014-01-07 8:22 ` John Fastabend 2014-01-07 8:37 ` John Fastabend 2014-01-06 7:35 ` [PATCH net 1/2] macvlan: forbid L2 fowarding offload for macvtap John Fastabend 2014-01-06 7:54 ` Jason Wang 2014-01-06 12:26 ` Neil Horman 2014-01-07 3:10 ` Jason Wang 2014-01-07 5:15 ` John Fastabend 2014-01-07 6:22 ` Jason Wang 2014-01-07 7:26 ` John Fastabend 2014-01-07 9:00 ` Jason Wang 2014-01-08 12:55 ` Michael S. Tsirkin 2014-01-08 19:05 ` John Fastabend 2014-01-09 7:17 ` Michael S. Tsirkin 2014-01-09 8:55 ` Jason Wang 2014-01-09 21:39 ` Stephen Hemminger 2014-01-09 22:03 ` Michael S. Tsirkin 2014-01-09 22:20 ` Stephen Hemminger 2014-01-10 7:06 ` Jason Wang 2014-01-10 16:40 ` Vlad Yasevich 2014-01-07 5:16 ` John Fastabend 2014-01-06 20:47 ` David Miller 2014-01-07 3:17 ` Jason Wang 2014-01-07 5:57 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).