Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH iproute2 net-next 2/4] tc: flower: document SCTP ip_proto
From: Stephen Hemminger @ 2016-12-01 18:50 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev
In-Reply-To: <1480434693-29397-3-git-send-email-simon.horman@netronome.com>

On Tue, 29 Nov 2016 16:51:31 +0100
Simon Horman <simon.horman@netronome.com> wrote:

> Add SCTP ip_proto to help text and man page.
> 
> Signed-off-by: Simon Horman <simon.horman@netronome.com>

Sorry doesn't apply to current net-next branch in iproute2 git.
Probably some of the other changes modified formatting.

^ permalink raw reply

* RE: [PATCH v2] sh_eth: remove unchecked interrupts
From: Chris Brandt @ 2016-12-01 18:53 UTC (permalink / raw)
  To: Sergei Shtylyov, Geert Uytterhoeven
  Cc: David Miller, Simon Horman, Geert Uytterhoeven,
	netdev@vger.kernel.org, Linux-Renesas
In-Reply-To: <25b92de0-806b-342b-7556-06b96b948b2c@cogentembedded.com>

Hi Geert,

On 12/1/2016, Sergei Shtylyov wrote:
> 
> On 12/01/2016 05:42 PM, Geert Uytterhoeven wrote:
> 
> >> --- a/drivers/net/ethernet/renesas/sh_eth.c
> >> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> >> @@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
> >>
> >>         .ecsr_value     = ECSR_ICD,
> >>         .ecsipr_value   = ECSIPR_ICDIP,
> >> -       .eesipr_value   = 0xff7f009f,
> >> +       .eesipr_value   = 0xe77f009f,
> >
> > Comment not directly related to the merits of this patch: the EESIPR
> > bit definitions seem to be identical to those for EESR, so we can get
> > rid of the hardcoded values here?
> 
>     Do you mean using the @define's? We have EESIPR bits also declared,
> see enum DMAC_IM_BIT,


Is your idea to get rid of .eesipr_value for devices that have values
that are the same as .eesr_err_check?


For example in sh_eth_dev_init():

	sh_eth_modify(ndev, EESR, 0, 0);
	mdp->irq_enabled = true;
-	sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
+	if (mdp->cd->eesipr_value)
+		sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
+	else
+		sh_eth_write(ndev, mdp->cd->eesr_err_check, EESIPR);


Chris


^ permalink raw reply

* Re: [PATCH iproute2 V4 0/3] tc: Support for ip tunnel metadata set/unset/classify
From: Stephen Hemminger @ 2016-12-01 18:54 UTC (permalink / raw)
  To: Amir Vadai
  Cc: netdev, David S. Miller, Jiri Benc, Or Gerlitz, Hadar Har-Zion,
	Roi Dayan
In-Reply-To: <20161201114446.30333-1-amir@vadai.me>

On Thu,  1 Dec 2016 13:44:43 +0200
Amir Vadai <amir@vadai.me> wrote:

> Hi,
> 
> This short series adds support for matching and setting metadata for ip tunnel
> shared device using the TC system, introduced in kernel 4.9 [1].
> 
> Applied and tested on top of commit f3f339e9590a ("cleanup debris from revert")
> 
> Example usage:
> 
> $ tc filter add dev vxlan0 protocol ip parent ffff: \
>     flower \
>       enc_src_ip 11.11.0.2 \
>       enc_dst_ip 11.11.0.1 \
>       enc_key_id 11 \
>       dst_ip 11.11.11.1 \
>     action mirred egress redirect dev vnet0
> 
> $ tc filter add dev net0 protocol ip parent ffff: \
>     flower \
>       ip_proto 1 \
>       dst_ip 11.11.11.2 \
>     action tunnel_key set \
>       src_ip 11.11.0.1 \
>       dst_ip 11.11.0.2 \
>       id 11 \
>     action mirred egress redirect dev vxlan0
> 
> [1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'")
> 
> Thanks,
> Amir
> 
> Changes from V3:
> - Fix bad wording in the man page about the use of the 'unset' operation
> 
> Changes from V2:
> - Use const where needed
> - Don't lose return value
> - Introduce rta_getattr_be16() and rta_getattr_be32()
> 
> Changes from V1:
> - Updated Patch 2/2 ("tc/act_tunnel: Introduce ip tunnel action") commit log
> 	and the man page tc-tunnel_key to reflect the fact that 'unset' operation is
> 	no mandatory.
> 	And describe when it might be needed.
> - Rename the 'release' operation to 'unset'
> 
> Amir Vadai (3):
>   libnetlink: Introduce rta_getattr_be*()
>   tc/cls_flower: Classify packet in ip tunnels
>   tc/act_tunnel: Introduce ip tunnel action
> 
>  bridge/fdb.c                         |   4 +-
>  include/libnetlink.h                 |   9 ++
>  include/linux/tc_act/tc_tunnel_key.h |  42 ++++++
>  ip/iplink_geneve.c                   |   2 +-
>  ip/iplink_vxlan.c                    |   2 +-
>  man/man8/tc-flower.8                 |  17 ++-
>  man/man8/tc-tunnel_key.8             | 112 +++++++++++++++
>  tc/Makefile                          |   1 +
>  tc/f_flower.c                        |  84 +++++++++++-
>  tc/m_tunnel_key.c                    | 258 +++++++++++++++++++++++++++++++++++
>  10 files changed, 522 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/tc_act/tc_tunnel_key.h
>  create mode 100644 man/man8/tc-tunnel_key.8
>  create mode 100644 tc/m_tunnel_key.c

I cleared up patch backlog and got net-next branch up to date with kernel
headers, and this patch series does not apply cleanly anymore.

Please rebase and resubmit.

^ permalink raw reply

* Re: [PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface
From: Stephen Hemminger @ 2016-12-01 18:57 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: netdev, avagin
In-Reply-To: <1478092496-7540-3-git-send-email-gorcunov@gmail.com>

On Wed,  2 Nov 2016 16:14:56 +0300
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> unix, tcp, udp[lite], packet, netlink sockets already support diag
> interface for their collection and killing. Implement support
> for raw sockets.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

Applied both patches, but needed to remove inet_diag.h since
already updated kernel headers.

^ permalink raw reply

* Re: [PATCH iproute2 1/2] ss: print new tcp_info fields: delivery_rate and app_limited
From: Stephen Hemminger @ 2016-12-01 19:01 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: netdev, Yuchung Cheng, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1480616500-16919-1-git-send-email-ncardwell@google.com>

On Thu,  1 Dec 2016 13:21:39 -0500
Neal Cardwell <ncardwell@google.com> wrote:

> Dump the new delivery_rate and delivery_rate_app_limited fields that
> were added to tcp_info in Linux v4.9.
> 
> Example output:
>   pacing_rate 65.7Mbps delivery_rate 62.9Mbps
> 
> And for the application-limited case this looks like:
>   pacing_rate 1031.1Mbps delivery_rate 87.4Mbps app_limited
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Looks good, applied to net-next branch

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 19:05 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201135508.GB24547@oracle.com>

On Thu, Dec 1, 2016 at 5:55 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (11/30/16 14:54), Tom Herbert wrote:
>>
>> Posting for discussion....
>    :
>> One simplifying assumption we might make is that TXDP is primarily for
>> optimizing latency, specifically request/response type operations
>> (think HPC, HFT, flash server, or other tightly coupled communications
>> within the datacenter). Notably, I don't think that saving CPU is as
>> relevant to TXDP, in fact we have already seen that CPU utilization
>> can be traded off for lower latency via spin polling. Similar to XDP
>> though, we might assume that single CPU performance is relevant (i.e.
>> on a cache server we'd like to spin as few CPUs as needed and no more
>> to handle the load an maintain throughput and latency requirements).
>> High throughput (ops/sec) and low variance should be side effects of
>> any design.
>
> I'm sending this with some hesitation (esp as the flamebait threads
> are starting up - I have no interest in getting into food-fights!!),
> because it sounds like the HPC/request-response use-case you have in mind
> (HTTP based?) is very likely different than the one the DB use-cases in
> my environment (RDBMS, Cluster req/responses). But to provide some
> perspective from the latter use-case..
>
> We also have request-response transactions, but CPU utilization
> is extremely critical- many DB operations are highly CPU bound,
> so it's not acceptable for the network to hog CPU util by polling.
> In that sense, the DB req/resp model has a lot of overlap with the
> Suricata use-case.
>
Hi Sowmini,

Polling does not necessarily imply that networking monopolizes the CPU
except when the CPU is otherwise idle. Presumably the application
drives the polling when it is ready to receive work.

> Also we need a select()able socket, because we have to deal with
> input from several sources- network I/O, but also disk, and
> file-system I/O. So need to make sure there is no starvation,
> and that we multiplex between  I/O sources efficiently
>
Yes, that is a requirement.

> and one other critical difference from the hot-potato-forwarding
> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> does not apply: in order to figure out the ethernet and IP headers
> in the response correctly at all times (in the face of things like VRRP,
> gw changes, gw's mac addr changes etc) the application should really
> be listening on NETLINK sockets for modifications to the networking
> state - again points to needing a select() socket set where you can
> have both the I/O fds and the netlink socket,
>
I would think that that is management would not be implemented in a
fast path processing thread for an application.

> For all of these reasons, we are investigating approaches similar ot
> Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx,
> and so far, tpacketv2 seems "good enough"). FWIW, we also took
> a look at netmap and so far have not seen any significant benefits
> to netmap over pf_packet.. investigation still ongoing.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>
> I'm curious- one thing that came out of the IPsec evaluation
> is that TSO is very valuable for performance, and this is most easily
> accessed via the sk_buff interfaces.  I have not had a chance
> to review your patches yet, but isnt that an issue if you bypass
> sk_buff usage? But I should probably go and review your patchset..
>
The *SOs are always an interesting question. They make for great
benchmarks, but in real life the amount of benefit is somewhat
unclear. Under the wrong conditions, like all cwnds have collapsed or
received packets for flows are small or so mixed that we can't get
much aggregation, SO provides no benefit and in fact becomes
overhead. Relying on any amount of segmentation offload in real
deployment is risky; for instance we've seen some video servers
deployed that were able to serve line rate at 90% CPU in testing (SO
was effective) but ended up needing 110% CPU in deployment when a
hiccup caused all cwnds to collapse. Moral of the story is provision
your servers assuming the worse case conditions that would render
opportunistic offloads unless.

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.

Tom

> --Sowmini

^ permalink raw reply

* Re: [PATCH] netfilter: avoid warn and OOM on vmalloc call
From: Marcelo Ricardo Leitner @ 2016-12-01 19:08 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Florian Westphal, Neil Horman, netdev, netfilter-devel, LKML
In-Reply-To: <CAAeHK+wjP=h_4YxB6VUc+FjKcZi9igmyTs3nPAuUJeNomYSA0w@mail.gmail.com>

On Thu, Dec 01, 2016 at 10:42:22AM +0100, Andrey Konovalov wrote:
> On Wed, Nov 30, 2016 at 8:42 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > Hi Andrey,
> >
> > Please let me know how this works for you. It seems good here, though
> > your poc may still trigger OOM through other means.
> 
> Hi Marcelo,
> 
> Don't see any reports with this patch.
> 
> Thanks!

Thanks Andrey.
I'll post a v2 after a few more tests here and to s/OOM/OOM killer/ in
most of the changelog.

> 
> >
> > Thanks,
> > Marcelo
> >
> > ---8<---
> >
> > Andrey Konovalov reported that this vmalloc call is based on an
> > userspace request and that it's spewing traces, which may flood the logs
> > and cause DoS if abused.
> >
> > Florian Westphal also mentioned that this call should not trigger OOM,
> > as kmalloc one is already not triggering it.
> >
> > This patch brings the vmalloc call in sync to kmalloc and disables the
> > warn trace on allocation failure and also disable OOM invocation.
> >
> > Note, however, that under such stress situation, other places may
> > trigger OOM invocation.
> >
> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > Cc: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/netfilter/x_tables.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> >         if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >                 info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> >         if (!info) {
> > -               info = vmalloc(sz);
> > +               info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> > +                                    __GFP_NORETRY | __GFP_HIGHMEM,
> > +                                PAGE_KERNEL);
> >                 if (!info)
> >                         return NULL;
> >         }
> > --
> > 2.9.3
> >
> 

^ permalink raw reply

* Re: [PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface
From: Cyrill Gorcunov @ 2016-12-01 19:13 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, avagin
In-Reply-To: <20161201105701.1c54b258@xeon-e3>

On Thu, Dec 01, 2016 at 10:57:01AM -0800, Stephen Hemminger wrote:
> 
> Applied both patches, but needed to remove inet_diag.h since
> already updated kernel headers.

Thank you! I think we might need to extend the matching interface
for killing raw sockets in near future, because for now it is
too wildcard. I put this into my todo list, once I finish with
more urgent tasks will back to this one.

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Jesper Dangaard Brouer @ 2016-12-01 19:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer
In-Reply-To: <1480611857.18162.319.camel@edumazet-glaptop3.roam.corp.google.com>


On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> BTW, if you are doing tests on mlx4 40Gbit,

I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
machines too.

>  would you check the
> following quick/dirty hack, using lots of low-rate flows ?

What tool should I use to send "low-rate flows"?

And what am I looking for?

> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
>         queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
>  }
>  
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
>  static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                                                 struct net_device *dev,
>                                                 netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                     (udp_hdr(skb)->dest != priv->vxlan_port))
>                         features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>         }
> +       if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> +               features &= NETIF_F_GSO_MASK;
>  
>         return features;
>  }
 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Willem de Bruijn @ 2016-12-01 19:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Willem de Bruijn,
	Rick Jones, Linux Kernel Network Developers, Saeed Mahameed,
	Tariq Toukan, Achiad Shochat
In-Reply-To: <CALx6S34ENTbUmCGx_4izNHoXbdy5UHuvUesbFGw+8kQSidesEg@mail.gmail.com>

>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>>> > and RX work.

This problem is somewhat tangential to the doorbell avoidance discussion.

>>> >
>>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>>> > thousands of TCP sockets.
>>> >
>> We have thousands of applications, and some of them 'kind of multicast'
>> events to a broad number of TCP sockets.
>>
>> Very often, applications writers use a single thread for doing this,
>> when all they need is to send small packets to 10,000 sockets, and they
>> do not really care of doing this very fast. They also do not want to
>> hurt other applications sharing the NIC.
>>
>> Very often, process scheduler will also run this single thread in a
>> single cpu, ie avoiding expensive migrations if they are not needed.
>>
>> Problem is this behavior locks one TX queue for the duration of the
>> multicast, since XPS will force all the TX packets to go to one TX
>> queue.
>>
> The fact that XPS is forcing TX packets to go over one CPU is a result
> of how you've configured XPS. There are other potentially
> configurations that present different tradeoffs.

Right, XPS supports multiple txqueues mappings, using skb_tx_hash
to decide among them. Unfortunately cross-cpu is more expensive
than tx + completion on the same core, so this is suboptimal for
the common case where there is no excessive load imbalance.

> For instance, TX
> queues are plentiful now days to the point that we could map a number
> of queues to each CPU while respecting NUMA locality between the
> sending CPU and where TX completions occur. If the set is sufficiently
> large this would also helps to address device lock contention. The
> other thing I'm wondering is if Willem's concepts distributing DOS
> attacks in RPS might be applicable in XPS. If we could detect that a
> TX queue is "under attack" maybe we can automatically backoff to
> distributing the load to more queues in XPS somehow.

If only targeting states of imbalance, that indeed could work. For the
10,000 socket case, instead of load balancing qdisc servicing, we
could perhaps modify tx queue selection in __netdev_pick_tx to
choose another queue if the the initial choice is paused or otherwise
backlogged.

^ permalink raw reply

* Re: [PATCH] net: asix: Fix AX88772_suspend() USB vendor commands failure issues
From: David Miller @ 2016-12-01 19:35 UTC (permalink / raw)
  To: allan
  Cc: jonathanh, freddy, Dean_Jenkins, Mark_Craske, robert.foss,
	ivecera, john.stultz, vpalatin, stephen, grundler, changchias,
	andrew, tremyfr, colin.king, linux-usb, netdev, linux-kernel,
	vpalatin
In-Reply-To: <00d701d24ae3$d4f4f2a0$7eded7e0$@asix.com.tw>

From: "ASIX_Allan [Office]" <allan@asix.com.tw>
Date: Wed, 30 Nov 2016 16:29:08 +0800

> The change fixes AX88772_suspend() USB vendor commands failure issues.
> 
> Signed-off-by: Allan Chou <allan@asix.com.tw>
> Tested-by: Allan Chou <allan@asix.com.tw>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Patch applied, thanks.

^ permalink raw reply

* Re: [net-next] rtnetlink: return the correct error code
From: David Miller @ 2016-12-01 19:39 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev
In-Reply-To: <1480495054-5114-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 30 Nov 2016 16:37:34 +0800

> Before this patch, function ndo_dflt_fdb_dump() will always return code
> from uc fdb dump. The reture code of mc fdb dump is lost.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/3] net/act_pedit: Support using offset relative to the conventional network headers
From: David Miller @ 2016-12-01 19:41 UTC (permalink / raw)
  To: amir; +Cc: netdev, jhs, ogerlitz, hadarh
In-Reply-To: <20161130090928.14816-3-amir@vadai.me>

From: Amir Vadai <amir@vadai.me>
Date: Wed, 30 Nov 2016 11:09:27 +0200

> @@ -119,18 +119,45 @@ static bool offset_valid(struct sk_buff *skb, int offset)
>  	return true;
>  }
>  
> +static int pedit_skb_hdr_offset(struct sk_buff *skb,
> +				enum pedit_header_type htype, int *hoffset)
> +{
> +	int ret = -1;
> +
> +	switch (htype) {
> +	case PEDIT_HDR_TYPE_ETH:
> +		if (skb_mac_header_was_set(skb)) {
> +			*hoffset = skb_mac_offset(skb);
> +			ret = 0;
> +		}
> +		break;
> +	case PEDIT_HDR_TYPE_RAW:
> +	case PEDIT_HDR_TYPE_IP4:
> +	case PEDIT_HDR_TYPE_IP6:
> +		*hoffset = skb_network_offset(skb);
> +		ret = 0;
> +		break;
> +	case PEDIT_HDR_TYPE_TCP:
> +	case PEDIT_HDR_TYPE_UDP:
> +		if (skb_transport_header_was_set(skb)) {
> +			*hoffset = skb_transport_offset(skb);
> +			ret = 0;
> +		}
> +		break;
> +	};
> +
> +	return ret;
> +}
> +

The only distinction between the cases is "L2", "L3", and "L4".

Therefore I don't see any reason to break it down into IP4 vs. IP6 vs.
RAW, for example.  They all map to the same thing.

So why not just have PEDIT_HDR_TYPE_L2, PEDIT_HDR_TYPE_L3, and
PEDIT_HDR_TYPE_L4?  It definitely seems more straightforward
and cleaner that way.

Thanks.

^ permalink raw reply

* Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2016-12-01 19:43 UTC (permalink / raw)
  To: andreyknvl
  Cc: herbert, jasowang, edumazet, pmk, pabeni, mst, soheil, elfring,
	rppt, netdev, linux-kernel, dvyukov, kcc, syzkaller
In-Reply-To: <1480584880-48651-1-git-send-email-andreyknvl@google.com>

From: Andrey Konovalov <andreyknvl@google.com>
Date: Thu,  1 Dec 2016 10:34:40 +0100

> This patch changes tun.c to call netif_receive_skb instead of netif_rx
> when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
> stack exhaustion). The difference between the two is that netif_rx queues
> the packet into the backlog, and netif_receive_skb proccesses the packet
> in the current context.
> 
> This patch is required for syzkaller [1] to collect coverage from packet
> receive paths, when a packet being received through tun (syzkaller collects
> coverage per process in the process context).
> 
> As mentioned by Eric this change also speeds up tun/tap. As measured by
> Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
> about 23%, from 700k packets/second to 867k packets/second.
> 
> A similar patch was introduced back in 2010 [2, 3], but the author found
> out that the patch doesn't help with the task he had in mind (for cgroups
> to shape network traffic based on the original process) and decided not to
> go further with it. The main concern back then was about possible stack
> exhaustion with 4K stacks.
> 
> [1] https://github.com/google/syzkaller
> 
> [2] https://www.spinics.net/lists/netdev/thrd440.html#130570
> 
> [3] https://www.spinics.net/lists/netdev/msg130570.html
> 
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
> ---
> 
> Changes since v1:
> - incorporate Eric's note about speed improvements in commit description
> - use netif_receive_skb CONFIG_4KSTACKS is not enabled

Applied to net-next, thanks!

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Rick Jones @ 2016-12-01 19:48 UTC (permalink / raw)
  To: Tom Herbert, Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35DCyi_2z1pqCLaB1bVyNykP_J3YaYEXUT8xxmuzyBDwA@mail.gmail.com>

On 12/01/2016 11:05 AM, Tom Herbert wrote:
> For the GSO and GRO the rationale is that performing the extra SW
> processing to do the offloads is significantly less expensive than
> running each packet through the full stack. This is true in a
> multi-layered generalized stack. In TXDP, however, we should be able
> to optimize the stack data path such that that would no longer be
> true. For instance, if we can process the packets received on a
> connection quickly enough so that it's about the same or just a little
> more costly than GRO processing then we might bypass GRO entirely.
> TSO is probably still relevant in TXDP since it reduces overheads
> processing TX in the device itself.

Just how much per-packet path-length are you thinking will go away under 
the likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO 
does some non-trivial things to effective overhead (service demand) and 
so throughput:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      9260.24   2.02     -1.00    0.428 
-1.000
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      5621.82   4.25     -1.00    1.486 
-1.000

And that is still with the stretch-ACKs induced by GRO at the receiver.

Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      9154.02   4.00     -1.00    0.860 
-1.000
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      4212.06   5.36     -1.00    2.502 
-1.000

I'm sure there is a very non-trivial "it depends" component here - 
netperf will get the peak benefit from *SO and so one will see the peak 
difference in service demands - but even if one gets only 6 segments per 
*SO that is a lot of path-length to make-up.

4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz

And even if one does have the CPU cycles to burn so to speak, the effect 
on power consumption needs to be included in the calculus.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
From: Pravin Shelar @ 2016-12-01 19:50 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161130145159.3cee7ba4@griffin>

On Wed, Nov 30, 2016 at 5:51 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote:
>> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
>>               goto drop;
>>       }
>>
>> -     if (unlikely(packet_length(skb, vport->dev) > mtu &&
>> -                  !skb_is_gso(skb))) {
>> -             net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
>> -                                  vport->dev->name,
>> -                                  packet_length(skb, vport->dev), mtu);
>> +     if (unlikely(!is_skb_forwardable(vport->dev, skb))) {
>
> How does this work when the vlan tag is accelerated? Then we can be
> over MTU, yet the check will pass.
>

This is not changing any behavior compared to current OVS vlan checks.
Single vlan header is not considered for MTU check.

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 19:51 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201024407.GE26507@breakpoint.cc>

On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal <fw@strlen.de> wrote:
> Tom Herbert <tom@herbertland.com> wrote:
>> Posting for discussion....
>
> Warning: You are not going to like this reply...
>
>> Now that XDP seems to be nicely gaining traction
>
> Yes, I regret to see that.  XDP seems useful to create impressive
> benchmark numbers (and little else).
>
> I will send a separate email to keep that flamebait part away from
> this thread though.
>
> [..]
>
>> addresses the performance gap for stateless packet processing). The
>> problem statement is analogous to that which we had for XDP, namely
>> can we create a mode in the kernel that offer the same performance
>> that is seen with L4 protocols over kernel bypass
>
> Why?  If you want to bypass the kernel, then DO IT.
>
I don't want kernel bypass. I want the Linux stack to provide
something close to bare metal performance for TCP/UDP for some latency
sensitive applications we run.

> There is nothing wrong with DPDK.  The ONLY problem is that the kernel
> does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.
>
> But even without that its not difficult to get DPDK running.
>
That is not true for large scale deployments. Also, TXDP is about
accelerating transport layers like TCP, DPDK is just the interface
from userspace to device queues. We need a whole lot more with DPDK, a
userspace TCP/IP stack for instance, to consider that we have an
equivalent functionality.

> (T)XDP seems born from spite, not technical rationale.
> IMO everyone would be better off if we'd just have something netmap-esqe
> in the network core (also see below).
>
>> I imagine there are a few reasons why userspace TCP stacks can get
>> good performance:
>>
>> - Spin polling (we already can do this in kernel)
>> - Lockless, I would assume that threads typically have exclusive
>> access to a queue pair for a connection
>> - Minimal TCP/IP stack code
>> - Zero copy TX/RX
>> - Light weight structures for queuing
>> - No context switches
>> - Fast data path for in order, uncongested flows
>> - Silo'ing between application and device queues
>
> I only see two cases:
>
> 1. Many applications running (standard Os model) that need to
> send/receive data
> -> Linux Network Stack
>
> 2. Single dedicated application that does all rx/tx
>
> -> no queueing needed (can block network rx completely if receiver
> is slow)
> -> no allocations needed at runtime at all
> -> no locking needed (single produce, single consumer)
>
> If you have #2 and you need to be fast etc then full userspace
> bypass is fine.  We will -- no matter what we do in kernel -- never
> be able to keep up with the speed you can get with that
> because we have to deal with #1.  (Plus the ease of use/freedom of doing
> userspace programming).  And yes, I think that #2 is something we
> should address solely by providing netmap or something similar.
>
> But even considering #1 there are ways to speed stack up:
>
> I'd kill RPF/RPS so we don't have IPI anymore and skb stays
> on same cpu up to where it gets queued (ofo or rx queue).
>
The reference to RPS and RFS is only to move packets off the hot CPU
that are not in the datapath. For instance if we get a FIN for a
connection it we can put this into a slow path since FIN processing is
not latency sensitive but may take a considerable amount of CPU to
process.

> Then we could tell driver what happened with the skb it gave us, e.g.
> we can tell driver it can do immediate page/dma reuse, for example
> in pure ack case as opposed to skb sitting in ofo or receive queue.
>
> (RPS/RFS functionality could still be provided via one of the gazillion
>  hooks we now have in the stack for those that need/want it), so I do
> not think we would lose functionality.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>> XDP API although we would need to generalize the interface to call
>> stack functions (I previously posted patches for that). We will also
>> need a new action, XDP_HELD?, that indicates the XDP function held the
>> packet (put on a socket for instance).
>
> Seems this will not work at all with the planned page pool thing when
> pages start to be held indefinitely.
>
The processing needed to gift a page to the stack shouldn't be very
different than what a driver needs to do when and skbuff is created
when XDP_PASS is returned. We probably would want to pass additional
meta data, things like checksum and vlan information from received
descriptor to the stack. A callback can be included if the stack
decides it wants to hold on to the buffer and driver needs to do
dma_sync etc. for that.

> You can also never get even close to userspace offload stacks once you
> need/do this; allocations in hotpath are too expensive.
>
> [..]
>
>>   - When we transmit, it would be nice to go straight from TCP
>> connection to an XDP device queue and in particular skip the qdisc
>> layer. This follows the principle of low latency being first criteria.
>
> It will never be lower than userspace offloads so anyone with serious
> "low latency" requirement (trading) will use that instead.
>
Maybe, but the question is how close can we get? If we can get within
say 10-20% performance that would be a win.

> Whats your target audience?
>
Many applications, but the most recent one that seems to driving the
need for very low latency is machine learning. The competition here
really isn't DPDK but is still RDMA (tomorrow's technology for the
past twenty years ;-) ). When the apps guys run their tests, they see
a huge difference between RDMA performance and the stack out of the
box-- like latency for an op goes from 1 usec to 30 usecs. So the apps
guys naturally want RDMA, but anyone in kernel or network ops knows
the nightmare that deploying RDMA entails. If we can get the latencies
and variance down to something comparable (say <5 usecs) then we have
much stronger argument that we can avoid the immense costs that RDMA
brings in.

>> longer latencies in effect which likely means TXDP isn't appropriate
>> in such a cases. BQL is also out, however we would want the TX
>> batching of XDP.
>
> Right, congestion control and buffer bloat are totally overrated .. 8-(
>
> So far I haven't seen anything that would need XDP at all.
>
> What makes it technically impossible to apply these miracles to the
> stack...?
>
> E.g. "mini-skb": Even if we assume that this provides a speedup
> (where does that come from? should make no difference if a 32 or
>  320 byte buffer gets allocated).
>
It's the zero'ing of three cache lines. I believe we talked about that
as netdev.

> If we assume that its the zeroing of sk_buff (but iirc it made little
> to no difference), could add
>
> unsigned long skb_extensions[1];
>
> to sk_buff, then move everything not needed for tcp fastpath
> (e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...)
> below that
>
Yes, that's the intent.

> Then convert accesses to accessors and init it on demand.
>
> One could probably also split cb[] into a smaller fastpath area
> and another one at the end that won't be touched at allocation time.
>
>> Miscellaneous
>
>> contemplating that connections/sockets can be bound to particularly
>> CPUs and that any operations (socket operations, timers, receive
>> processing) must occur on that CPU. The CPU would be the one where RX
>> happens. Note this implies perfect silo'ing, everything for driver RX
>> to application processing happens inline on the CPU. The stack would
>> not cross CPUs for a connection while in this mode.
>
> Again don't see how this relates to xdp.  Could also be done with
> current stack if we make rps/rfs pluggable since nothing else
> currently pushes skb to another cpu (except when scheduler is involved
> via tc mirred, netfilter userspace queueing etc) but that is always
> explicit (i.e. skb leaves softirq protection).
>
> Can we please fix and improve what we already have rather than creating
> yet another NIH thing that will have to be maintained forever?
>
That's what we are doing and this is major reason what we need to
improve Linux as opposed introducing to parallel stacks. The cost for
supporting modifications to Linux pale in comparison to we would need
to support a parallel stack.

Tom

> Thanks.

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: David Miller @ 2016-12-01 20:04 UTC (permalink / raw)
  To: idosch
  Cc: hannes, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>


Hannes and Ido,

It looks like we are very close to having this in mergable shape, can
you guys work out this final issue and figure out if it really is
a merge stopped or not?

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger
From: Michał Kępień @ 2016-12-01 20:08 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Johannes Berg, David S . Miller, linux-wireless,
	netdev, linux-kernel
In-Reply-To: <201612020131.aDbI7Mq9%fengguang.wu@intel.com>

> Hi Michał,
> 
> [auto build test ERROR on mac80211-next/master]
> [also build test ERROR on v4.9-rc7 next-20161201]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Micha-K-pie/net-rfkill-Cleanup-error-handling-in-rfkill_init/20161202-002119
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git master
> config: i386-randconfig-x004-201648 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    net/rfkill/core.c: In function 'rfkill_set_block':
> >> net/rfkill/core.c:354:2: error: implicit declaration of function '__rfkill_any_led_trigger_event' [-Werror=implicit-function-declaration]
>      __rfkill_any_led_trigger_event();
>      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    net/rfkill/core.c: In function 'rfkill_init':
>    net/rfkill/core.c:1349:1: warning: label 'error_led_trigger' defined but not used [-Wunused-label]
>     error_led_trigger:
>     ^~~~~~~~~~~~~~~~~
>    At top level:
>    net/rfkill/core.c:243:13: warning: 'rfkill_any_led_trigger_unregister' defined but not used [-Wunused-function]
>     static void rfkill_any_led_trigger_unregister(void)
>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    net/rfkill/core.c:238:12: warning: 'rfkill_any_led_trigger_register' defined but not used [-Wunused-function]
>     static int rfkill_any_led_trigger_register(void)
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    cc1: some warnings being treated as errors
> 
> vim +/__rfkill_any_led_trigger_event +354 net/rfkill/core.c
> 
>    348		rfkill->state &= ~RFKILL_BLOCK_SW_SETCALL;
>    349		rfkill->state &= ~RFKILL_BLOCK_SW_PREV;
>    350		curr = rfkill->state & RFKILL_BLOCK_SW;
>    351		spin_unlock_irqrestore(&rfkill->lock, flags);
>    352	
>    353		rfkill_led_trigger_event(rfkill);
>  > 354		__rfkill_any_led_trigger_event();
>    355	
>    356		if (prev != curr)
>    357			rfkill_event(rfkill);

Thanks, these are obviously all valid concerns.  Sorry for being sloppy
with the ifdefs.  If I get positive feedback on the proposed feature
itself, all these issues (and the warning pointed out in the other
message) will be resolved in v2.

-- 
Best regards,
Michał Kępień

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <20161201201707.5f51a02e@redhat.com>

On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > BTW, if you are doing tests on mlx4 40Gbit,
> 
> I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
> machines too.
> 
> >  would you check the
> > following quick/dirty hack, using lots of low-rate flows ?
> 
> What tool should I use to send "low-rate flows"?
> 

You could use https://github.com/google/neper

It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate
limited to 3028000 bytes per second  (so sending one 2-MSS TSO packet
every ms per flow)



> And what am I looking for?

Max throughput, in packets per second :/

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Sowmini Varadhan @ 2016-12-01 20:13 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35DCyi_2z1pqCLaB1bVyNykP_J3YaYEXUT8xxmuzyBDwA@mail.gmail.com>

On (12/01/16 11:05), Tom Herbert wrote:
> 
> Polling does not necessarily imply that networking monopolizes the CPU
> except when the CPU is otherwise idle. Presumably the application
> drives the polling when it is ready to receive work.

I'm not grokking that- "if the cpu is idle, we want to busy-poll
and make it 0% idle"?  Keeping CPU 0% idle has all sorts
of issues, see slide 20 of
 http://www.slideshare.net/shemminger/dpdk-performance

> > and one other critical difference from the hot-potato-forwarding
> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> > does not apply: in order to figure out the ethernet and IP headers
> > in the response correctly at all times (in the face of things like VRRP,
> > gw changes, gw's mac addr changes etc) the application should really
> > be listening on NETLINK sockets for modifications to the networking
> > state - again points to needing a select() socket set where you can
> > have both the I/O fds and the netlink socket,
> >
> I would think that that is management would not be implemented in a
> fast path processing thread for an application.

sure, but my point was that *XDP and other stack-bypass methods needs 
to provide a select()able socket: when your use-case is not about just
networking, you have to snoop on changes to the control plane, and update
your data path. In the OVS case (pure networking) the OVS control plane
updates are intrinsic to OVS. For the rest of the request/response world,
we need a select()able socket set to do this elegantly (not really
possible in DPDK, for example)


> The *SOs are always an interesting question. They make for great
> benchmarks, but in real life the amount of benefit is somewhat
> unclear. Under the wrong conditions, like all cwnds have collapsed or

I think Rick's already bringing up this one.

--Sowmini

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 20:18 UTC (permalink / raw)
  To: Rick Jones; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <aac93b13-6298-b9eb-7f3c-b074f22c388c@hpe.com>

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 12/01/2016 11:05 AM, Tom Herbert wrote:
>>
>> For the GSO and GRO the rationale is that performing the extra SW
>> processing to do the offloads is significantly less expensive than
>> running each packet through the full stack. This is true in a
>> multi-layered generalized stack. In TXDP, however, we should be able
>> to optimize the stack data path such that that would no longer be
>> true. For instance, if we can process the packets received on a
>> connection quickly enough so that it's about the same or just a little
>> more costly than GRO processing then we might bypass GRO entirely.
>> TSO is probably still relevant in TXDP since it reduces overheads
>> processing TX in the device itself.
>
>
> Just how much per-packet path-length are you thinking will go away under the
> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
> non-trivial things to effective overhead (service demand) and so throughput:
>
For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      9260.24   2.02     -1.00    0.428 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      5621.82   4.25     -1.00    1.486 -1.000
>
> And that is still with the stretch-ACKs induced by GRO at the receiver.
>
Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.

> Losing GRO has quite similar results:
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      9154.02   4.00     -1.00    0.860 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      4212.06   5.36     -1.00    2.502 -1.000
>
> I'm sure there is a very non-trivial "it depends" component here - netperf
> will get the peak benefit from *SO and so one will see the peak difference
> in service demands - but even if one gets only 6 segments per *SO that is a
> lot of path-length to make-up.
>
True, but I think there's a lot of path we'll be able to cut out. In
this mode we don't need IPtables, Netfilter, input route, IPvlan
check, or other similar lookups. Once we've successfully matched a
establish TCP state anything related to policy on both TX and RX for
that connection is inferred from the state. We want the processing
path in this case to just be concerned with just protocol processing
and interface to user.

> 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz
>
> And even if one does have the CPU cycles to burn so to speak, the effect on
> power consumption needs to be included in the calculus.
>
Definitely, power consumption is the down side of spin polling CPUs.
As I said we would never should be spinning any more CPUs than
necessary to handle the load.

Tom

> happy benchmarking,
>
> rick jones

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: David Miller @ 2016-12-01 20:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
In-Reply-To: <1480611857.18162.319.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 01 Dec 2016 09:04:17 -0800

> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> 
>> When qdisc layer or trafgen/af_packet see this indication it knows it
>> should/must flush the queue when it don't have more work left.  Perhaps
>> through net_tx_action(), by registering itself and e.g. if qdisc_run()
>> is called and queue is empty then check if queue needs a flush. I would
>> also allow driver to flush and clear this bit.
> 
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)

The one thing I wonder about is whether we should "ramp up" into a mode
where the NAPI poll does the doorbells instead of going directly there.

Maybe I misunderstand your algorithm, but it looks to me like if there
are any active packets in the TX queue at enqueue time you will defer
the doorbell to the interrupt handler.

Let's say we put 1 packet in, and hit the doorbell.

Then another packet comes in and we defer the doorbell to the IRQ.

At this point there are a couple things I'm unclear about.

For example, if we didn't hit the doorbell, will the chip still take a
peek at the second descriptor?  Depending upon how the doorbell works
it might, or it might not.

Either way, wouldn't there be a possible condition where the chip
wouldn't see the second enqueued packet and we'd thus have the wire
idle until the interrupt + NAPI runs and hits the doorbell?

This is why I think we should "ramp up" the doorbell deferral, in
order to avoid this potential wire idle time situation.

Maybe the situation I'm worried about is not possible, so please
explain it to me :-)

^ permalink raw reply

* Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration
From: Florian Fainelli @ 2016-12-01 20:21 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: idosch, andrew, vivien.didelot, netdev, bridge, jiri, davem
In-Reply-To: <20161123134856.cwk6sznnwa7p4xtq@splinter.mtl.com>

On 11/23/2016 05:48 AM, Ido Schimmel wrote:
> Hi Florian,
> 
> On Tue, Nov 22, 2016 at 09:56:30AM -0800, Florian Fainelli wrote:
>> On 11/22/2016 09:41 AM, Ido Schimmel wrote:
>>> Hi Florian,
>>>
>>> On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote:
>>>> Hi all,
>>>>
>>>> This patch series allows using the bridge master interface to configure
>>>> an Ethernet switch port's CPU/management port with different VLAN attributes than
>>>> those of the bridge downstream ports/members.
>>>>
>>>> Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and mv88e6xxx, I
>>>> tested this with b53 and a mockup DSA driver.
>>>
>>> We'll need to add a check in mlxsw and ignore any VLAN configuration for
>>> the bridge device itself. Otherwise, any configuration done on br0 will
>>> be propagated to all of its slaves, which is incorrect.
>>>
>>>>
>>>> Open questions:
>>>>
>>>> - if we have more than one bridge on top of a physical switch, the driver
>>>>   should keep track of that and verify that we are not going to change
>>>>   the CPU port VLAN attributes in a way that results in incompatible settings
>>>>   to be applied
>>>>
>>>> - if the default behavior is to have all VLANs associated with the CPU port
>>>>   be ingressing/egressing tagged to the CPU, is this really useful?
>>>
>>> First of all, I want to be sure that when we say "CPU port", we're
>>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>>> the device and the host, through which all packets trapped to the host
>>> go through. So, when a packet is trapped, the driver reads its Rx
>>> descriptor, checks through which port it ingressed, resolves its netdev,
>>> sets skb->dev accordingly and injects it to the Rx path via
>>> netif_receive_skb(). The CPU port itself isn't represented using a
>>> netdev.
>>
>> In the case of DSA, the CPU port is a normal Ethernet MAC driver, but in
>> premise, this driver plus the DSA tag protocol hook do exactly the same
>> things as you just describe.
> 
> Thanks for the detailed explanation! I also took the time to read
> dsa.txt, however I still don't quite understand the motivation for
> VLAN filtering on the CPU port. In which cases would you like to prevent
> packets from going to the host due to their VLAN header? This change
> would make sense to me if you only had a limited number of VLANs you
> could enable on the CPU port, but from your response I understand that
> this isn't the case.

It's not much about VLAN filtering per-se, but more about the default
VLAN membership of the CPU port, in the absence of any explicit
configuration. As an user, I find it a little inconvenient to have to
create one VLAN interface per VLAN that I am adding to the bridge to be
able to terminate that traffic properly towards the host/CPU/management
interface, especially when this VLAN is untagged.

This is really the motivation for these patches: if there is only one
VLAN configured, and it's the default VLAN for all ports, then the
bridge master interface also terminates this VLAN with the same
properties as those added by default (typically with default_pvid: VID 1
untagged, unless changed of course).

If you don't want that as an user, you now have the ability to change
it, and make this VLAN (or any other for that matter) to be terminated
differently at the host/CPU/management port level than how it is
egressing at the downstream ports part of that VLAN too.

Maybe it's a bit overkill...

> 
> FWIW, I don't have a problem with patches if they are useful for you,
> I'm just trying to understand the use case. We can easily patch mlxsw to
> ignore calls with orig_dev=br0.

OK, if I resubmit, I will try to take care of mlxsw and rocker as well.

Thanks!
-- 
Florian

^ permalink raw reply

* [PATCH iproute2 1/1] tc: updated man page to reflect handle-id use in filter GET command.
From: Roman Mashak @ 2016-12-01 20:20 UTC (permalink / raw)
  To: stephen; +Cc: netdev, sathya.perla, Roman Mashak

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 man/man8/tc.8 | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 8a47a2b..d957ffa 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -32,7 +32,9 @@ class-id ] qdisc
 DEV
 .B [ parent
 qdisc-id
-.B | root ] protocol
+.B | root ] [ handle
+handle-id ]
+.B protocol
 protocol
 .B prio
 priority filtertype
@@ -577,7 +579,7 @@ it is created.
 
 .TP
 get
-Displays a single filter given the interface, parent ID, priority, protocol and handle ID.
+Displays a single filter given the interface, qdisc-id, priority, protocol and handle-id.
 
 .TP
 show
-- 
1.9.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox