Netdev List
 help / color / mirror / Atom feed
* Re: Quota in __qdisc_run()
From: Eric Dumazet @ 2014-10-07 17:32 UTC (permalink / raw)
  To: David Miller
  Cc: hannes, brouer, netdev, therbert, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <20141007.131938.1410434352331637585.davem@davemloft.net>

On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote:

> Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> then check how much quota was used afterwards to guard the outermost loop.

I am testing this, and also am testing the xmit_more patch for I40E.

Will send patches today.

Thanks

^ permalink raw reply

* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Tejun Heo @ 2014-10-07 17:34 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Luis R. Rodriguez, gregkh, dmitry.torokhov, tiwai, arjan, teg,
	rmilasan, werner, oleg, hare, bpoirier, santosh, pmladek, dbueso,
	linux-kernel, Tetsuo Handa, Joseph Salisbury, Kay Sievers,
	One Thousand Gnomes, Tim Gardner, Pierre Fersing, Andrew Morton,
	Nagalakshmi Nandigama, Praveen Krishnamoorthy, Sreekanth Reddy,
	Abhijit Mahajan, Casey Leedom
In-Reply-To: <20141006231046.GD14081@wotan.suse.de>

Hello,

On Tue, Oct 07, 2014 at 01:10:46AM +0200, Luis R. Rodriguez wrote:
> On Mon, Oct 06, 2014 at 05:01:18PM -0400, Tejun Heo wrote:
> > For in-kernel stuff, we already have a clear
> > synchronization point where we already synchronize all async calls.
> > Shouldn't we be flushing these async probes there too?
> 
> This seems to be addressing if what I meant by prepared, "ready", so let
> me address this as I do think its important.
> 
> By async calls do you mean users of async_schedule()? I see it

Yes.

> also uses system_unbound_wq as well but I do not see anyone calling
> flush_workqueue(system_unbound_wq) on the kernel. We do use
> async_synchronize_full() on kernel_init() but that just waits.

But you can create a new workqueue and queue all the async probing
work items there and flush the workqueue right after
async_synchronize_full().

...
> bus.enable_kern_async=1 would still also serve as a helper for the driver core
> to figure out if it should use async probe then on modules if prefer_async_probe
> was enabled. Let me know if you figure out a way to avoid it.

Why do we need the choice at all?  It always should, no?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: macvlan: optimizing the receive path?
From: Jason Baron @ 2014-10-07 17:35 UTC (permalink / raw)
  To: Vlad Yasevich, David Miller; +Cc: netdev@vger.kernel.org, kaber@trash.net
In-Reply-To: <5432936D.7010906@gmail.com>

On 10/06/2014 09:04 AM, Vlad Yasevich wrote:
> On 10/04/2014 08:42 PM, David Miller wrote:
>> From: Jason Baron <jbaron@akamai.com>
>> Date: Thu, 02 Oct 2014 16:28:13 -0400
>>
>>> --- a/drivers/net/macvlan.c
>>> +++ b/drivers/net/macvlan.c
>>> @@ -321,8 +321,8 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
>>>         skb->dev = dev;
>>>         skb->pkt_type = PACKET_HOST;
>>>  
>>> -       ret = netif_rx(skb);
>>> -
>>> +      macvlan_count_rx(vlan, len, true, 0);
>>> +      return RX_HANDLER_ANOTHER;
>>>  out:
>>>         macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, 0);
>>>         return RX_HANDLER_CONSUMED;
>>
>> That last argument to macvlan_count_rx() is a bool and thus should be
>> specified as "false".  Yes I know other areas of this file get it
>> wrong too.
>>

ok. I can fix those up too while here.

>> Also, what about GRO?  Won't we get GRO processing if we do this via
>> netif_rx() but not via the RX_HANDLER_ANOTHER route?  Just curious...
> 
> Wouldn't GRO already happen at the lower level?  For macvlan-to-macvlan,
> you'd typically have large packets so no need for GRO.
> 

Yes, afaict gro is happening a layer below __netif_receive_skb_core().

Here are some results of this optimization on 3.17 using macvlan with
lxc. Test case is (average of 3 runs):

for i in {35,50,65,80,95,110,125,140,155};
do super_netperf $i netperf -H $ip -t TCP_RR;
done

trans./sec (3.17)

494016
612806
673100
696982
710494
716830
714729
713478
711056

trans./sec (3.17 + macvlan patch)

517159  +(4.684733558%)
628382  +(2.541860742%)
669688  -(0.5069080835%)
706181  +(1.319833855%)
716660  +(0.8677995555%)
719581  +(0.3838661811%)
718738  +(0.5609585358%)
718904  +(0.7605470482%)
718344  +(1.02509555%)

On the host I can see that the idle time goes to 0, so this would
appear to be an improvement. I also observed that enqueue_to_backlog()
and process_backlog() are no longer in the 'perf' profiles as
expected.

So if there are no objections, I will post as a formal patch.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Luis R. Rodriguez @ 2014-10-07 17:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Luis R. Rodriguez, gregkh, dmitry.torokhov, tiwai, arjan, teg,
	rmilasan, werner, oleg, hare, bpoirier, santosh, pmladek, dbueso,
	linux-kernel, Tetsuo Handa, Joseph Salisbury, Kay Sievers,
	One Thousand Gnomes, Tim Gardner, Pierre Fersing, Andrew Morton,
	Nagalakshmi Nandigama, Praveen Krishnamoorthy, Sreekanth Reddy,
	Abhijit Mahajan, Casey Leedom
In-Reply-To: <20141007173404.GB31328@mtj.dyndns.org>

On Tue, Oct 07, 2014 at 01:34:04PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 07, 2014 at 01:10:46AM +0200, Luis R. Rodriguez wrote:
> > On Mon, Oct 06, 2014 at 05:01:18PM -0400, Tejun Heo wrote:
> > > For in-kernel stuff, we already have a clear
> > > synchronization point where we already synchronize all async calls.
> > > Shouldn't we be flushing these async probes there too?
> > 
> > This seems to be addressing if what I meant by prepared, "ready", so let
> > me address this as I do think its important.
> > 
> > By async calls do you mean users of async_schedule()? I see it
> 
> Yes.
> 
> > also uses system_unbound_wq as well but I do not see anyone calling
> > flush_workqueue(system_unbound_wq) on the kernel. We do use
> > async_synchronize_full() on kernel_init() but that just waits.
> 
> But you can create a new workqueue and queue all the async probing
> work items there and flush the workqueue right after
> async_synchronize_full().

On second thought I would prefer to avoid this, I see this being good
to help with old userspace but other than that I don't see a requirement
for new userspace. Do you?

> ...
> > bus.enable_kern_async=1 would still also serve as a helper for the driver core
> > to figure out if it should use async probe then on modules if prefer_async_probe
> > was enabled. Let me know if you figure out a way to avoid it.
> 
> Why do we need the choice at all?  It always should, no?

I'm OK to live with that, in that case I see no point to bus.enable_kern_async=1
at all.

  Luis

^ permalink raw reply

* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Tejun Heo @ 2014-10-07 17:55 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Luis R. Rodriguez, gregkh, dmitry.torokhov, tiwai, arjan, teg,
	rmilasan, werner, oleg, hare, bpoirier, santosh, pmladek, dbueso,
	linux-kernel, Tetsuo Handa, Joseph Salisbury, Kay Sievers,
	One Thousand Gnomes, Tim Gardner, Pierre Fersing, Andrew Morton,
	Nagalakshmi Nandigama, Praveen Krishnamoorthy, Sreekanth Reddy,
	Abhijit Mahajan, Casey Leedom
In-Reply-To: <20141007175010.GH14081@wotan.suse.de>

Hello,

On Tue, Oct 07, 2014 at 07:50:10PM +0200, Luis R. Rodriguez wrote:
> On Tue, Oct 07, 2014 at 01:34:04PM -0400, Tejun Heo wrote:
> > But you can create a new workqueue and queue all the async probing
> > work items there and flush the workqueue right after
> > async_synchronize_full().
> 
> On second thought I would prefer to avoid this, I see this being good
> to help with old userspace but other than that I don't see a requirement
> for new userspace. Do you?

Hmmm... we batch up and do everything parallel, so I'm not sure how
much gain we'd be looking at by not waiting for at the end before
jumping into the userland.  Also, it's a bit of an orthogonal issue.
If we wanna skip such synchornization point before passing control to
userland, why are we applying that to this but not
async_synchronize_full() which has a far larger impact?  It's weird to
synchronize one while not the other, so yeah, if there are actual
benefits we can consider it but let's do it separately.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: macvlan: optimizing the receive path?
From: Eric Dumazet @ 2014-10-07 18:00 UTC (permalink / raw)
  To: Jason Baron
  Cc: Vlad Yasevich, David Miller, netdev@vger.kernel.org,
	kaber@trash.net
In-Reply-To: <5434246E.1000403@akamai.com>

On Tue, 2014-10-07 at 13:35 -0400, Jason Baron wrote:

> So if there are no objections, I will post as a formal patch.

I think the code used netif_rx() as a precaution because of
multicast/broadcasts.

Now these are taken care (if multicast filter + work queue for
broadcasts), it seems ok to have a fast path.

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: Tom Herbert @ 2014-10-07 18:13 UTC (permalink / raw)
  To: David Miller, Linux Netdev List, Or Gerlitz
In-Reply-To: <1411962607-27878-1-git-send-email-therbert@google.com>

David,

I don't think there are any outstanding objections to this patch.
Will you be able to apply it or do you need something more to be done?

Thanks,
Tom


On Sun, Sep 28, 2014 at 8:50 PM, Tom Herbert <therbert@google.com> wrote:
> Add ndo_gso_check which a device can define to indicate whether is
> is capable of doing GSO on a packet. This funciton would be called from
> the stack to determine whether software GSO is needed to be done. A
> driver should populate this function if it advertises GSO types for
> which there are combinations that it wouldn't be able to handle. For
> instance a device that performs UDP tunneling might only implement
> support for transparent Ethernet bridging type of inner packets
> or might have limitations on lengths of inner headers.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/netdevice.h | 12 +++++++++++-
>  net/core/dev.c            |  2 +-
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 9f5d293..f8c2027 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -997,6 +997,12 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>   *     Callback to use for xmit over the accelerated station. This
>   *     is used in place of ndo_start_xmit on accelerated net
>   *     devices.
> + * bool        (*ndo_gso_check) (struct sk_buff *skb,
> + *                       struct net_device *dev);
> + *     Called by core transmit path to determine if device is capable of
> + *     performing GSO on a packet. The device returns true if it is
> + *     able to GSO the packet, false otherwise. If the return value is
> + *     false the stack will do software GSO.
>   */
>  struct net_device_ops {
>         int                     (*ndo_init)(struct net_device *dev);
> @@ -1146,6 +1152,8 @@ struct net_device_ops {
>                                                         struct net_device *dev,
>                                                         void *priv);
>         int                     (*ndo_get_lock_subclass)(struct net_device *dev);
> +       bool                    (*ndo_gso_check) (struct sk_buff *skb,
> +                                                 struct net_device *dev);
>  };
>
>  /**
> @@ -3536,10 +3544,12 @@ static inline bool skb_gso_ok(struct sk_buff *skb, netdev_features_t features)
>                (!skb_has_frag_list(skb) || (features & NETIF_F_FRAGLIST));
>  }
>
> -static inline bool netif_needs_gso(struct sk_buff *skb,
> +static inline bool netif_needs_gso(struct net_device *dev, struct sk_buff *skb,
>                                    netdev_features_t features)
>  {
>         return skb_is_gso(skb) && (!skb_gso_ok(skb, features) ||
> +               (dev->netdev_ops->ndo_gso_check &&
> +                !dev->netdev_ops->ndo_gso_check(skb, dev)) ||
>                 unlikely((skb->ip_summed != CHECKSUM_PARTIAL) &&
>                          (skb->ip_summed != CHECKSUM_UNNECESSARY)));
>  }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e2ced01..8c2b9bb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2680,7 +2680,7 @@ struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device *dev)
>         if (skb->encapsulation)
>                 features &= dev->hw_enc_features;
>
> -       if (netif_needs_gso(skb, features)) {
> +       if (netif_needs_gso(dev, skb, features)) {
>                 struct sk_buff *segs;
>
>                 segs = skb_gso_segment(skb, features);
> --
> 2.1.0.rc2.206.gedb03e5
>

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: Alexei Starovoitov @ 2014-10-07 18:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Jesse Gross, Or Gerlitz, Alexander Duyck,
	John Fastabend, Jeff Kirsher, David Miller, Linux Netdev List,
	Thomas Graf, Pravin Shelar, Andy Zhou
In-Reply-To: <1412702616.11091.143.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, Oct 7, 2014 at 10:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2014-10-07 at 09:50 -0700, Alexei Starovoitov wrote:
>
>> it's definitely more difficult to properly implement
>> CHECKSUM_UNNECESSARY in HW, but it's worth it.
>> CHECKSUM_COMPLETE is a burden on software. Old NICs
>> used to do that, but overhead of recomputing csum for every
>> step of packet parsing and header modifications is too high.
>> sw routers, bridges and < L4 networking devices are
>> simpler and faster with CHECKSUM_UNNECESSARY.
>
> Really this is wrong. Once you validated/pulled a header,
> adjusting the complete checksum is in the order of 10 cycles or so,
> ie less than 1% of the other costs.

correct, but there is also postpull() cost for every pull.
and there are many of them for encapsulated traffic.
It's small, but it's not zero.

> UNNECESSARY usually requests complex NIC firmware, and usually the NIC
> has fewer cores than the host.
>
> It mostly worked for basic IP+TCP kind of traffic, but once you want
> complex cloud models, it is a major pain.
>
> If we need to optimize csum_partial() for short lengths, lets do it,
> instead of pushing hardware vendors adding more and more schemes.

csum_partial() is in asm already. probably not much more can be squeezed.
I'm not suggesting that NICs must always do UNNECESSARY.
COMPLETE is a low bar which is magnitude better than NONE,
but if HW has programmable parser it should be able to take advantage
of it and UNNECESSARY is an established model.

^ permalink raw reply

* [PATCH net-next v2 1/1] tipc: fix bug in multicast congestion handling
From: Jon Maloy @ 2014-10-07 18:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, Paul Gortmaker, erik.hugne, ying.xue, maloy,
	tipc-discussion, Jon Maloy

One aim of commit 50100a5e39461b2a61d6040e73c384766c29975d ("tipc:
use pseudo message to wake up sockets after link congestion") was
to handle link congestion abatement in a uniform way for both unicast
and multicast transmit. However, the latter doesn't work correctly,
and has been broken since the referenced commit was applied.

If a user now sends a burst of multicast messages that is big
enough to cause broadcast link congestion, it will be put to sleep,
and not be waked up when the congestion abates as it should be.

This has two reasons. First, the flag that is used, TIPC_WAKEUP_USERS,
is set correctly, but in the wrong field. Instead of setting it in the
'action_flags' field of the arrival node struct, it is by mistake set
in the dummy node struct that is owned by the broadcast link, where it
will never tested for. Second, we cannot use the same flag for waking
up unicast and multicast users, since the function tipc_node_unlock()
needs to pick the wakeup pseudo messages to deliver from different
queues. It must hence be able to distinguish between the two cases.

This commit solves this problem by adding a new flag
TIPC_WAKEUP_BCAST_USERS, and a new function tipc_bclink_wakeup_user().
The latter is to be called by tipc_node_unlock() when the named flag,
now set in the correct field, is encountered.

v2: using explicit 'unsigned int' declaration instead of 'uint', as
per comment from David Miller.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/bcast.c | 14 +++++++++++++-
 net/tipc/bcast.h |  2 +-
 net/tipc/node.c  |  5 +++++
 net/tipc/node.h  |  3 ++-
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index b2bbe69..b8670bf 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -226,6 +226,17 @@ static void bclink_retransmit_pkt(u32 after, u32 to)
 }
 
 /**
+ * tipc_bclink_wakeup_users - wake up pending users
+ *
+ * Called with no locks taken
+ */
+void tipc_bclink_wakeup_users(void)
+{
+	while (skb_queue_len(&bclink->link.waiting_sks))
+		tipc_sk_rcv(skb_dequeue(&bclink->link.waiting_sks));
+}
+
+/**
  * tipc_bclink_acknowledge - handle acknowledgement of broadcast packets
  * @n_ptr: node that sent acknowledgement info
  * @acked: broadcast sequence # that has been acknowledged
@@ -300,7 +311,8 @@ void tipc_bclink_acknowledge(struct tipc_node *n_ptr, u32 acked)
 		bclink_set_last_sent();
 	}
 	if (unlikely(released && !skb_queue_empty(&bcl->waiting_sks)))
-		bclink->node.action_flags |= TIPC_WAKEUP_USERS;
+		n_ptr->action_flags |= TIPC_WAKEUP_BCAST_USERS;
+
 exit:
 	tipc_bclink_unlock();
 }
diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h
index 4875d95..e7b0f85 100644
--- a/net/tipc/bcast.h
+++ b/net/tipc/bcast.h
@@ -99,5 +99,5 @@ int  tipc_bclink_set_queue_limits(u32 limit);
 void tipc_bcbearer_sort(struct tipc_node_map *nm_ptr, u32 node, bool action);
 uint  tipc_bclink_get_mtu(void);
 int tipc_bclink_xmit(struct sk_buff *buf);
-
+void tipc_bclink_wakeup_users(void);
 #endif
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 17e6378..90cee4a 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -552,6 +552,7 @@ void tipc_node_unlock(struct tipc_node *node)
 	LIST_HEAD(conn_sks);
 	struct sk_buff_head waiting_sks;
 	u32 addr = 0;
+	unsigned int flags = node->action_flags;
 
 	if (likely(!node->action_flags)) {
 		spin_unlock_bh(&node->lock);
@@ -572,6 +573,7 @@ void tipc_node_unlock(struct tipc_node *node)
 		node->action_flags &= ~TIPC_NOTIFY_NODE_UP;
 		addr = node->addr;
 	}
+	node->action_flags &= ~TIPC_WAKEUP_BCAST_USERS;
 	spin_unlock_bh(&node->lock);
 
 	while (!skb_queue_empty(&waiting_sks))
@@ -583,6 +585,9 @@ void tipc_node_unlock(struct tipc_node *node)
 	if (!list_empty(&nsub_list))
 		tipc_nodesub_notify(&nsub_list);
 
+	if (flags & TIPC_WAKEUP_BCAST_USERS)
+		tipc_bclink_wakeup_users();
+
 	if (addr)
 		tipc_named_node_up(addr);
 }
diff --git a/net/tipc/node.h b/net/tipc/node.h
index 522d6f3..67513c3 100644
--- a/net/tipc/node.h
+++ b/net/tipc/node.h
@@ -59,7 +59,8 @@ enum {
 	TIPC_WAIT_OWN_LINKS_DOWN	= (1 << 2),
 	TIPC_NOTIFY_NODE_DOWN		= (1 << 3),
 	TIPC_NOTIFY_NODE_UP		= (1 << 4),
-	TIPC_WAKEUP_USERS		= (1 << 5)
+	TIPC_WAKEUP_USERS		= (1 << 5),
+	TIPC_WAKEUP_BCAST_USERS		= (1 << 6)
 };
 
 /**
-- 
1.9.1

^ permalink raw reply related

* Re: Quota in __qdisc_run()
From: Jesper Dangaard Brouer @ 2014-10-07 18:03 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, hannes, netdev, therbert, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke, brouer
In-Reply-To: <20141007.131938.1410434352331637585.davem@davemloft.net>

On Tue, 07 Oct 2014 13:19:38 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 07 Oct 2014 08:01:20 -0700
> 
> > On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote:
> > 
> >> This needs to be:
> >> 
> >> do
> >>    ...
> >> while ((iskb = iskb->next))
> > 
> > I do not feel needed to break the bulk dequeue at precise quota
> > boundary. These quotas are advisory, and bql prefers to get its full
> > budget for appropriate feedback from TX completion.
> > 
> > Quota was a packet quota, which was quite irrelevant if segmentation had
> > to be done, so I would just let the dequeue be done so that we benefit
> > from optimal xmit_more.
> 
> Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> then check how much quota was used afterwards to guard the outermost loop.

According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit
is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk
dequeued at once (with your version of the patch).

It seems to have the potential to exceed the weight_p(64) quite a lot.
And with e.g. TX ring size 512, we also also challenge the drivers at
this early adoption phase of tailptr writes.  Just saying...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: Quota in __qdisc_run()
From: Jesper Dangaard Brouer @ 2014-10-07 18:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, hannes, netdev, therbert, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke, brouer
In-Reply-To: <1412703132.11091.144.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, 07 Oct 2014 10:32:12 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote:
> 
> > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> > then check how much quota was used afterwards to guard the outermost loop.
> 
> I am testing this, and also am testing the xmit_more patch for I40E.

Check, I'm also testing both yours and Hannes patch.

Results at:
 http://people.netfilter.org/hawk/qdisc/measure18_restore_quota_fairness/
 http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/
 http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: David Miller @ 2014-10-07 18:47 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: therbert, jesse, gerlitz.or, alexander.h.duyck, john.r.fastabend,
	jeffrey.t.kirsher, netdev, tgraf, pshelar, azhou
In-Reply-To: <CAADnVQJUhEmER7C_OOs0dM_mD8REbYT89SG1dfu2b79AUn9eow@mail.gmail.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 7 Oct 2014 10:18:25 -0700

> On Tue, Oct 7, 2014 at 10:05 AM, David Miller <davem@davemloft.net> wrote:
>> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
>> Date: Tue, 7 Oct 2014 09:50:50 -0700
>>
>>> CHECKSUM_COMPLETE is a burden on software.
>>
>> I totally disagree, it's the most software friendly checksumming
>> offload mechanism possible.  I wish every card did it.
>>
>> CHECKSUM_COMPLETE means that any sub-protocol or tunneling mechanism
>> can be trivially supported without any modifications to hardware, and
>> it therefore makes checksum offloading of new protocols require no
>> hardware changes whatsoever.
> 
> yes, of course. My point is that if HW can parse the packet and validate
> csum it should do that, since it's faster for the stack on top.
> HW can fall back to CHECKSUM_COMPLETE if it fails to parse, for example.
> I think some NICs do exactly that.

I am totally against boolean "yes/no" protocol specific checksum
validation by HW.

It's not faster.  You have to look at the pseudo-header and bring it into
the CPU cache _anyways_, so negating it and 2's complementing it into
the CHECKSUM_COMPLETE value for validation is free.

There is no performance advantage whatsoever to use another checksumming
scheme.

^ permalink raw reply

* Re: macvlan: optimizing the receive path?
From: David Miller @ 2014-10-07 18:49 UTC (permalink / raw)
  To: jbaron; +Cc: vyasevich, netdev, kaber
In-Reply-To: <5434246E.1000403@akamai.com>

From: Jason Baron <jbaron@akamai.com>
Date: Tue, 07 Oct 2014 13:35:42 -0400

> So if there are no objections, I will post as a formal patch.

No objections from me, thanks for all the info.

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: Eric Dumazet @ 2014-10-07 18:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, Jesse Gross, Or Gerlitz, Alexander Duyck,
	John Fastabend, Jeff Kirsher, David Miller, Linux Netdev List,
	Thomas Graf, Pravin Shelar, Andy Zhou
In-Reply-To: <CAADnVQL_6tEhgQE9aCLe-icG4pyS1dMECGhiQ7+OHEao1GXk3g@mail.gmail.com>

On Tue, 2014-10-07 at 11:15 -0700, Alexei Starovoitov wrote:

> csum_partial() is in asm already. probably not much more can be squeezed.

Again this is wrong assumption. Its assembly and damn slow.

Take a look at commit 99f0b958b194f7d88973f1c2190d207e0a2c7e79
for details.

^ permalink raw reply

* Re: [PATCH net-next v2 1/1] tipc: fix bug in multicast congestion handling
From: David Miller @ 2014-10-07 18:50 UTC (permalink / raw)
  To: jon.maloy
  Cc: netdev, paul.gortmaker, erik.hugne, ying.xue, maloy,
	tipc-discussion
In-Reply-To: <1412705554-24252-1-git-send-email-jon.maloy@ericsson.com>

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Tue,  7 Oct 2014 14:12:34 -0400

> One aim of commit 50100a5e39461b2a61d6040e73c384766c29975d ("tipc:
> use pseudo message to wake up sockets after link congestion") was
> to handle link congestion abatement in a uniform way for both unicast
> and multicast transmit. However, the latter doesn't work correctly,
> and has been broken since the referenced commit was applied.
> 
> If a user now sends a burst of multicast messages that is big
> enough to cause broadcast link congestion, it will be put to sleep,
> and not be waked up when the congestion abates as it should be.
> 
> This has two reasons. First, the flag that is used, TIPC_WAKEUP_USERS,
> is set correctly, but in the wrong field. Instead of setting it in the
> 'action_flags' field of the arrival node struct, it is by mistake set
> in the dummy node struct that is owned by the broadcast link, where it
> will never tested for. Second, we cannot use the same flag for waking
> up unicast and multicast users, since the function tipc_node_unlock()
> needs to pick the wakeup pseudo messages to deliver from different
> queues. It must hence be able to distinguish between the two cases.
> 
> This commit solves this problem by adding a new flag
> TIPC_WAKEUP_BCAST_USERS, and a new function tipc_bclink_wakeup_user().
> The latter is to be called by tipc_node_unlock() when the named flag,
> now set in the correct field, is encountered.
> 
> v2: using explicit 'unsigned int' declaration instead of 'uint', as
> per comment from David Miller.
> 
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Applied, thanks Jon.

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: David Miller @ 2014-10-07 18:51 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: eric.dumazet, therbert, jesse, gerlitz.or, alexander.h.duyck,
	john.r.fastabend, jeffrey.t.kirsher, netdev, tgraf, pshelar,
	azhou
In-Reply-To: <CAADnVQL_6tEhgQE9aCLe-icG4pyS1dMECGhiQ7+OHEao1GXk3g@mail.gmail.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 7 Oct 2014 11:15:32 -0700

> but if HW has programmable parser it should be able to take advantage
> of it and UNNECESSARY is an established model.

Strongly disagree.

^ permalink raw reply

* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Luis R. Rodriguez @ 2014-10-07 18:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Kroah-Hartman, Dmitry Torokhov, Takashi Iwai,
	Arjan van de Ven, Tom Gundersen, Robert Milasan, werner,
	Oleg Nesterov, hare, Benjamin Poirier, Santosh Rastapur,
	Petr Mladek, dbueso, linux-kernel@vger.kernel.org, Tetsuo Handa,
	Joseph Salisbury, Kay Sievers, One Thousand Gnomes, Tim Gardner,
	Pierre Fersing, Andrew Morton, Nagalakshmi Nandigama, Praveen 
In-Reply-To: <20141007175503.GE31328@mtj.dyndns.org>

On Tue, Oct 7, 2014 at 10:55 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Oct 07, 2014 at 07:50:10PM +0200, Luis R. Rodriguez wrote:
>> On Tue, Oct 07, 2014 at 01:34:04PM -0400, Tejun Heo wrote:
>> > But you can create a new workqueue and queue all the async probing
>> > work items there and flush the workqueue right after
>> > async_synchronize_full().
>>
>> On second thought I would prefer to avoid this, I see this being good
>> to help with old userspace but other than that I don't see a requirement
>> for new userspace. Do you?
>
> Hmmm... we batch up and do everything parallel, so I'm not sure how
> much gain we'd be looking at by not waiting for at the end before
> jumping into the userland.  Also, it's a bit of an orthogonal issue.
> If we wanna skip such synchornization point before passing control to
> userland, why are we applying that to this but not
> async_synchronize_full() which has a far larger impact?  It's weird to
> synchronize one while not the other, so yeah, if there are actual
> benefits we can consider it but let's do it separately.

OK I'll just kill bus.enable_kern_async=1 to enable built-in async
probe support *and* also have prefer_async_probe *always* be
respected, whether modular or not.

  Luis

^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: Neil Horman @ 2014-10-07 18:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: John Fastabend, Daniel Borkmann, John Fastabend,
	Jesper Dangaard Brouer, John W. Linville, Florian Westphal,
	gerlitz.or, netdev, john.ronciak, amirv, eric.dumazet, danny.zhou,
	Willem de Bruijn
In-Reply-To: <1412637971.706532.175886517.077550BE@webmail.messagingengine.com>

On Tue, Oct 07, 2014 at 01:26:11AM +0200, Hannes Frederic Sowa wrote:
> Hi John,
> 
> On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
> > > I find the six additional ndo ops a bit worrisome as we are adding more
> > > and more subsystem specific ndoops to this struct. I would like to see
> > > some unification here, but currently cannot make concrete proposals,
> > > sorry.
> > 
> > I agree it seems like a bit much. One thought was to split the ndo
> > ops into categories. Switch ops, MACVLAN ops, basic ops and with this
> > userspace queue ops. This sort of goes along with some of the switch
> > offload work which is going to add a handful more ops as best I can
> > tell.
> 
> Thanks for your mail, you answered all of my questions.
> 
> Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
> Willem (also in Cc) used sysfs files which get mmaped to represent the
> tx/rx descriptors. The representation was independent of the device and
> IIRC the prototype used a write(fd, "", 1) to signal the kernel it
> should proceed with tx. I agree, it would be great to be syscall-free
> here.
> 
> For the semantics of the descriptors we could also easily generate files
> in sysfs. I thought about something like tracepoints already do for
> representing the data in the ringbuffer depending on the event:
> 
> -- >8 --
> # cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format 
> name: net_dev_queue
> ID: 1006
> format:
> 	field:unsigned short common_type;       offset:0;       size:2;
> 	signed:0;
> 	field:unsigned char common_flags;       offset:2;       size:1;
> 	signed:0;
> 	field:unsigned char common_preempt_count;       offset:3;      
> 	size:1; signed:0;
> 	field:int common_pid;   offset:4;       size:4; signed:1;
> 
> 	field:void * skbaddr;   offset:8;       size:8; signed:0;
> 	field:unsigned int len; offset:16;      size:4; signed:0;
> 	field:__data_loc char[] name;   offset:20;      size:4;
> 	signed:1;
> 
> print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
> REC->len
> -- >8 --
> 
> Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
> endianess would need to be added. Hopefully there is already a user
> space parser somewhere in the perf sources. An easier to parse binary
> representation could be added easily and maybe even something vDSO alike
> if people care about that.
> 
> Maybe this open/mmap per queue also kills some of the ndo_ops?
> 
> Bye,
> Hannes
> 


John-
	I don't know if its of use to you here, but I was experimenting awhile
ago with af_packet memory mapping, using the protection bits in the page tables
as a doorbell mechanism.  I scrapped the work as the performance bottleneck for
af_packet wasn't found in the syscall trap time, but it occurs to me, it might
be useful for you here, in that, using this mechanism, if you keep the transmit
ring non-empty, you only encur the cost of a single trap to start the transmit
process.  Let me know if you want to see it.

Neil

^ permalink raw reply

* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Luis R. Rodriguez @ 2014-10-07 19:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Kroah-Hartman, Dmitry Torokhov, Takashi Iwai,
	Arjan van de Ven, Tom Gundersen, Robert Milasan, werner,
	Oleg Nesterov, hare, Benjamin Poirier, Santosh Rastapur,
	Petr Mladek, dbueso, linux-kernel@vger.kernel.org, Tetsuo Handa,
	Joseph Salisbury, Kay Sievers, One Thousand Gnomes, Tim Gardner,
	Pierre Fersing, Andrew Morton, Nagalakshmi Nandigama, Praveen 
In-Reply-To: <CAB=NE6V_1yyYPhzbV_yRzt7fg=_8yF2YYNBz4qH_2TbH0nozcw@mail.gmail.com>

On Tue, Oct 7, 2014 at 11:55 AM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> OK I'll just kill bus.enable_kern_async=1 to enable built-in async
> probe support *and* also have prefer_async_probe *always* be
> respected, whether modular or not.

Well and I just realized you *do* want to flush, so will throw that in
too without an option to skip it.

 Luis

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: Tom Herbert @ 2014-10-07 19:10 UTC (permalink / raw)
  To: David Miller
  Cc: Alexei Starovoitov, Eric Dumazet, Jesse Gross, Or Gerlitz,
	Alexander Duyck, John Fastabend, Jeff Kirsher, Linux Netdev List,
	Thomas Graf, Pravin B Shelar, Andy Zhou
In-Reply-To: <20141007.145117.1442458821633299880.davem@davemloft.net>

On Tue, Oct 7, 2014 at 11:51 AM, David Miller <davem@davemloft.net> wrote:
>
> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Date: Tue, 7 Oct 2014 11:15:32 -0700
>
> > but if HW has programmable parser it should be able to take advantage
> > of it and UNNECESSARY is an established model.
>
> Strongly disagree.


Another problem with UNNECESSARY is that the stack has no way to
validate what the HW is doing. If HW starts setting UNNECESSARY for
packets with bad checksums, this sort of bug can be really difficult
to detect. With CHECKSUM_COMPLETE it's much harder to have undetected
false positives like this, especially when we need to include pseudo
header in validation. UNNECESSARY really shouldn't be considered
robust for deployment IMO.

^ permalink raw reply

* Re: Quota in __qdisc_run()
From: Eric Dumazet @ 2014-10-07 19:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, hannes, netdev, therbert, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <20141007200329.5d20a27e@redhat.com>

On Tue, 2014-10-07 at 20:03 +0200, Jesper Dangaard Brouer wrote:

> According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit
> is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk
> dequeued at once (with your version of the patch).
> 

That's because you use a single queue maybe ?

In reality, 10Gbe NIC are used in multiqueue mode ...

Here we have limits around 2 TSO packets.

Even with only 4 tx queues I have :

# sar -n DEV 3 3 |grep eth1
12:05:19 PM      eth1 147217.67 809066.67   9488.71 1196207.78      0.00      0.00      0.00
12:05:22 PM      eth1 145958.00 807822.33   9407.48 1194366.73      0.00      0.00      0.00
12:05:25 PM      eth1 147502.33 804739.33   9507.26 1189804.23      0.00      0.00      0.33
Average:         eth1 146892.67 807209.44   9467.82 1193459.58      0.00      0.00      0.11


grep . /sys/class/net/eth1/queues/tx*/byte_queue_limits/{inflight,limit}
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:115064
/sys/class/net/eth1/queues/tx-1/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-2/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-3/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:102952
/sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:124148
/sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:102952
/sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:136260


> It seems to have the potential to exceed the weight_p(64) quite a lot.
> And with e.g. TX ring size 512, we also also challenge the drivers at
> this early adoption phase of tailptr writes.  Just saying...
> 

Yep, but remind we want to squeeze bugs out of the drivers, then add
additional knobs later.

Whatever limit we choose in core networking stack (being 64 packets for
example), hardware might have different constraints that need to be
taken care of in the driver.

^ permalink raw reply

* Re: bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
From: David Miller @ 2014-10-07 19:13 UTC (permalink / raw)
  To: herbert; +Cc: fw, netfilter-devel, bsd, stephen, netdev, eric.dumazet, davidn
In-Reply-To: <20141005040022.GA14118@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 5 Oct 2014 12:00:22 +0800

> bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
> 
> As we may defragment the packet in IPv4 PRE_ROUTING and refragment
> it after POST_ROUTING we should save the value of frag_max_size.
> 
> This is still very wrong as the bridge is supposed to leave the
> packets intact, meaning that the right thing to do is to use the
> original frag_list for fragmentation.
> 
> Unfortunately we don't currently guarantee that the frag_list is
> left untouched throughout netfilter so until this changes this is
> the best we can do.
> 
> There is also a spot in FORWARD where it appears that we can
> forward a packet without going through fragmentation, mark it
> so that we can fix it later.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks Herbert.

^ permalink raw reply

* Re: [PATCH] Update Intel Ethernet Driver maintainers list
From: David Miller @ 2014-10-07 19:13 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: alexander.h.duyck, netdev, matthew.vick, alexander.duyck
In-Reply-To: <1412373876.2408.18.camel@jtkirshe-mobl>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri, 03 Oct 2014 15:04:36 -0700

> On Fri, 2014-10-03 at 14:45 -0700, Alexander Duyck wrote:
>> I will no longer be working for Intel as of today.  As such I am removing
>> myself from the maintainers list and adding my replacement, Matthew Vick
>> as he will be taking over maintenance of the fm10k driver.
>> 
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> Dave, no need to wait for a pull request on this one. :-)  We will miss
> you Alex.

Applied.

^ permalink raw reply

* sunvnet and ->xmit_more
From: David Miller @ 2014-10-07 19:18 UTC (permalink / raw)
  To: netdev; +Cc: david.stevens, Raghuram.Kothakota, sowmini.varadhan


David and others working on sunvnet, I just wanted to point out that
in the net-next tree there is a new facility that can improve
performance quite a bit in sunvnet.

Basically in the ->ndo_start_xmit() handler, if you see skb->xmit_more
set then the stack is telling you that it guarentees that another
packet will be given to you immediately when ->ndo_start_xmit()
returns.

This means that, unless you have filled up your TX queue, you can
defer the TX indication to the device.

For example, in the virtio_net driver the test is:

	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
		virtqueue_kick(sq->vq);

The pktgen module also has a new "burst" parameter you can use to test
out this facility directly, and the qdisc layer has heuristics for
dequeueing multiple packets at a time for normal traffic.

Just FYI...

^ permalink raw reply

* Re: sunvnet and ->xmit_more
From: Sowmini Varadhan @ 2014-10-07 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, david.stevens, Raghuram.Kothakota
In-Reply-To: <20141007.151849.84417269004453869.davem@davemloft.net>

On (10/07/14 15:18), David Miller wrote:
> 
> David and others working on sunvnet, I just wanted to point out that
> in the net-next tree there is a new facility that can improve
> performance quite a bit in sunvnet.
> 
> Basically in the ->ndo_start_xmit() handler, if you see skb->xmit_more
> set then the stack is telling you that it guarentees that another
> packet will be given to you immediately when ->ndo_start_xmit()
> returns.
> 
> This means that, unless you have filled up your TX queue, you can
> defer the TX indication to the device.

I'm not sure how this can be useful to sunvnet- in sunvnet's case
we send the TX indication at the *start* of a burst, so if xmit_more
was set, sure- we can send out another packet immediately, and
avoid another START message (which we already do today), but 
nothing else to be gained from xmit_more?

BTW, I have most of the NAPI done, getting it stress-tested etc
(the recent jumbo commit added a few more races between vnet_port_remove
and vnet_start_xmit, thanks to the extra clean_timer) but I figure
I might as well fully test this internally since net-next is closed
for the moment anyway?

> 
> For example, in the virtio_net driver the test is:
> 
> 	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
> 		virtqueue_kick(sq->vq);
> 
> The pktgen module also has a new "burst" parameter you can use to test
> out this facility directly, and the qdisc layer has heuristics for
> dequeueing multiple packets at a time for normal traffic.
> 
> Just FYI...
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox