Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] geneve: add ttl inherit support
From: David Miller @ 2018-09-13  3:38 UTC (permalink / raw)
  To: liuhangbin; +Cc: netdev, pshelar, jbenc, sbrivio
In-Reply-To: <1536717861-21590-1-git-send-email-liuhangbin@gmail.com>

From: Hangbin Liu <liuhangbin@gmail.com>
Date: Wed, 12 Sep 2018 10:04:21 +0800

> Similar with commit 72f6d71e491e6 ("vxlan: add ttl inherit support"),
> currently ttl == 0 means "use whatever default value" on geneve instead
> of inherit inner ttl. To respect compatibility with old behavior, let's
> add a new IFLA_GENEVE_TTL_INHERIT for geneve ttl inherit support.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Suggested-by: Jiri Benc <jbenc@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

Applied to net-next.

^ permalink raw reply

* Re: Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-09-13  8:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtualization, linux-kernel, netdev, virtio-dev,
	wexu, jfreimann
In-Reply-To: <20180912121457-mutt-send-email-mst@kernel.org>

On Wed, Sep 12, 2018 at 12:16:32PM -0400, Michael S. Tsirkin wrote:
> On Tue, Sep 11, 2018 at 01:37:26PM +0800, Tiwei Bie wrote:
> > On Mon, Sep 10, 2018 at 11:33:17AM +0800, Jason Wang wrote:
> > > On 2018年09月10日 11:00, Tiwei Bie wrote:
> > > > On Fri, Sep 07, 2018 at 09:00:49AM -0400, Michael S. Tsirkin wrote:
> > > > > On Fri, Sep 07, 2018 at 09:22:25AM +0800, Tiwei Bie wrote:
> > > > > > On Mon, Aug 27, 2018 at 05:00:40PM +0300, Michael S. Tsirkin wrote:
> > > > > > > Are there still plans to test the performance with vost pmd?
> > > > > > > vhost doesn't seem to show a performance gain ...
> > > > > > > 
> > > > > > I tried some performance tests with vhost PMD. In guest, the
> > > > > > XDP program will return XDP_DROP directly. And in host, testpmd
> > > > > > will do txonly fwd.
> > > > > > 
> > > > > > When burst size is 1 and packet size is 64 in testpmd and
> > > > > > testpmd needs to iterate 5 Tx queues (but only the first two
> > > > > > queues are enabled) to prepare and inject packets, I got ~12%
> > > > > > performance boost (5.7Mpps -> 6.4Mpps). And if the vhost PMD
> > > > > > is faster (e.g. just need to iterate the first two queues to
> > > > > > prepare and inject packets), then I got similar performance
> > > > > > for both rings (~9.9Mpps) (packed ring's performance can be
> > > > > > lower, because it's more complicated in driver.)
> > > > > > 
> > > > > > I think packed ring makes vhost PMD faster, but it doesn't make
> > > > > > the driver faster. In packed ring, the ring is simplified, and
> > > > > > the handling of the ring in vhost (device) is also simplified,
> > > > > > but things are not simplified in driver, e.g. although there is
> > > > > > no desc table in the virtqueue anymore, driver still needs to
> > > > > > maintain a private desc state table (which is still managed as
> > > > > > a list in this patch set) to support the out-of-order desc
> > > > > > processing in vhost (device).
> > > > > > 
> > > > > > I think this patch set is mainly to make the driver have a full
> > > > > > functional support for the packed ring, which makes it possible
> > > > > > to leverage the packed ring feature in vhost (device). But I'm
> > > > > > not sure whether there is any other better idea, I'd like to
> > > > > > hear your thoughts. Thanks!
> > > > > Just this: Jens seems to report a nice gain with virtio and
> > > > > vhost pmd across the board. Try to compare virtio and
> > > > > virtio pmd to see what does pmd do better?
> > > > The virtio PMD (drivers/net/virtio) in DPDK doesn't need to share
> > > > the virtio ring operation code with other drivers and is highly
> > > > optimized for network. E.g. in Rx, the Rx burst function won't
> > > > chain descs. So the ID management for the Rx ring can be quite
> > > > simple and straightforward, we just need to initialize these IDs
> > > > when initializing the ring and don't need to change these IDs
> > > > in data path anymore (the mergable Rx code in that patch set
> > > > assumes the descs will be written back in order, which should be
> > > > fixed. I.e., the ID in the desc should be used to index vq->descx[]).
> > > > The Tx code in that patch set also assumes the descs will be
> > > > written back by device in order, which should be fixed.
> > > 
> > > Yes it is. I think I've pointed it out in some early version of pmd patch.
> > > So I suspect part (or all) of the boost may come from in order feature.
> > > 
> > > > 
> > > > But in kernel virtio driver, the virtio_ring.c is very generic.
> > > > The enqueue (virtqueue_add()) and dequeue (virtqueue_get_buf_ctx())
> > > > functions need to support all the virtio devices and should be
> > > > able to handle all the possible cases that may happen. So although
> > > > the packed ring can be very efficient in some cases, currently
> > > > the room to optimize the performance in kernel's virtio_ring.c
> > > > isn't that much. If we want to take the fully advantage of the
> > > > packed ring's efficiency, we need some further e.g. API changes
> > > > in virtio_ring.c, which shouldn't be part of this patch set.
> > > 
> > > Could you please share more thoughts on this e.g how to improve the API?
> > > Notice since the API is shared by both split ring and packed ring, it may
> > > improve the performance of split ring as well. One can easily imagine a
> > > batching API, but it does not have many real users now, the only case is the
> > > XDP transmission which can accept an array of XDP frames.
> > 
> > I don't have detailed thoughts on this yet. But kernel's
> > virtio_ring.c is quite generic compared with what we did
> > in virtio PMD.
> 
> In what way? What are some things that aren't implemented there?

Below is the code corresponding to the virtqueue_add()
for Rx ring in virtio PMD:

https://github.com/DPDK/dpdk/blob/3605968c2fa7/drivers/net/virtio/virtio_rxtx.c#L278-L304

And below is the code of virtqueue_add() in Linux:

https://github.com/torvalds/linux/blob/54eda9df17f3/drivers/virtio/virtio_ring.c#L275-L417

In virtio PMD, for each packet (mbuf), the code is pretty
straightforward, it will just check whether there is one
available desc. If it's true, it will just fill this desc
directly.

But in virtqueue_add(), it's obvious that, the logic is
much more complicated or generic. It's supposed to be
able to handle sglist (which may consist of multiple IN
buffers and multiple OUT buffers at the same time), and
it will try to use indirect descriptors. Then it needs
several loops to parse the sglist. That's why I said
it's quite generic.

> 
> If what you say is true then we should take a careful look
> and not supporting these generic things with packed layout.
> Once we do support them it will be too late and we won't
> be able to get performance back.

I think it's a good point that we don't need to support
everything in packed ring (especially these which would
hurt the performance), as the packed ring aims at high
performance. I'm also wondering about the features. Is
there any possibility that we won't support the out of
order processing (at least not by default) in packed ring?
If I didn't miss anything, the need to support out of order
processing in packed ring will make the data structure
inside the driver not cache friendly which is similar to
the case of the descriptor table in the split ring (the
difference is that, it only happens in driver now).


> 
> 
> 
> > > 
> > > > So
> > > > I still think this patch set is mainly to make the kernel virtio
> > > > driver to have a full functional support of the packed ring, and
> > > > we can't expect impressive performance boost with it.
> > > 
> > > We can only gain when virtio ring layout is the bottleneck. If there're
> > > bottlenecks elsewhere, we probably won't see any increasing in the numbers.
> > > Vhost-net is an example, and lots of optimizations have proved that virtio
> > > ring is not the main bottleneck for the current codes. I suspect it also the
> > > case of virtio driver. Did perf tell us any interesting things in virtio
> > > driver?
> > > 
> > > Thanks
> > > 
> > > > 
> > > > > 
> > > > > > > On Wed, Jul 11, 2018 at 10:27:06AM +0800, Tiwei Bie wrote:
> > > > > > > > Hello everyone,
> > > > > > > > 
> > > > > > > > This patch set implements packed ring support in virtio driver.
> > > > > > > > 
> > > > > > > > Some functional tests have been done with Jason's
> > > > > > > > packed ring implementation in vhost:
> > > > > > > > 
> > > > > > > > https://lkml.org/lkml/2018/7/3/33
> > > > > > > > 
> > > > > > > > Both of ping and netperf worked as expected.
> > > > > > > > 
> > > > > > > > v1 -> v2:
> > > > > > > > - Use READ_ONCE() to read event off_wrap and flags together (Jason);
> > > > > > > > - Add comments related to ccw (Jason);
> > > > > > > > 
> > > > > > > > RFC (v6) -> v1:
> > > > > > > > - Avoid extra virtio_wmb() in virtqueue_enable_cb_delayed_packed()
> > > > > > > >    when event idx is off (Jason);
> > > > > > > > - Fix bufs calculation in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > > > > > - Test the state of the desc at used_idx instead of last_used_idx
> > > > > > > >    in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > > > > > - Save wrap counter (as part of queue state) in the return value
> > > > > > > >    of virtqueue_enable_cb_prepare_packed();
> > > > > > > > - Refine the packed ring definitions in uapi;
> > > > > > > > - Rebase on the net-next tree;
> > > > > > > > 
> > > > > > > > RFC v5 -> RFC v6:
> > > > > > > > - Avoid tracking addr/len/flags when DMA API isn't used (MST/Jason);
> > > > > > > > - Define wrap counter as bool (Jason);
> > > > > > > > - Use ALIGN() in vring_init_packed() (Jason);
> > > > > > > > - Avoid using pointer to track `next` in detach_buf_packed() (Jason);
> > > > > > > > - Add comments for barriers (Jason);
> > > > > > > > - Don't enable RING_PACKED on ccw for now (noticed by Jason);
> > > > > > > > - Refine the memory barrier in virtqueue_poll();
> > > > > > > > - Add a missing memory barrier in virtqueue_enable_cb_delayed_packed();
> > > > > > > > - Remove the hacks in virtqueue_enable_cb_prepare_packed();
> > > > > > > > 
> > > > > > > > RFC v4 -> RFC v5:
> > > > > > > > - Save DMA addr, etc in desc state (Jason);
> > > > > > > > - Track used wrap counter;
> > > > > > > > 
> > > > > > > > RFC v3 -> RFC v4:
> > > > > > > > - Make ID allocation support out-of-order (Jason);
> > > > > > > > - Various fixes for EVENT_IDX support;
> > > > > > > > 
> > > > > > > > RFC v2 -> RFC v3:
> > > > > > > > - Split into small patches (Jason);
> > > > > > > > - Add helper virtqueue_use_indirect() (Jason);
> > > > > > > > - Just set id for the last descriptor of a list (Jason);
> > > > > > > > - Calculate the prev in virtqueue_add_packed() (Jason);
> > > > > > > > - Fix/improve desc suppression code (Jason/MST);
> > > > > > > > - Refine the code layout for XXX_split/packed and wrappers (MST);
> > > > > > > > - Fix the comments and API in uapi (MST);
> > > > > > > > - Remove the BUG_ON() for indirect (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > 
> > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > > Tiwei Bie (5):
> > > > > > > >    virtio: add packed ring definitions
> > > > > > > >    virtio_ring: support creating packed ring
> > > > > > > >    virtio_ring: add packed ring support
> > > > > > > >    virtio_ring: add event idx support in packed ring
> > > > > > > >    virtio_ring: enable packed ring
> > > > > > > > 
> > > > > > > >   drivers/s390/virtio/virtio_ccw.c   |   14 +
> > > > > > > >   drivers/virtio/virtio_ring.c       | 1365 ++++++++++++++++++++++------
> > > > > > > >   include/linux/virtio_ring.h        |    8 +-
> > > > > > > >   include/uapi/linux/virtio_config.h |    3 +
> > > > > > > >   include/uapi/linux/virtio_ring.h   |   43 +
> > > > > > > >   5 files changed, 1157 insertions(+), 276 deletions(-)
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > 2.18.0
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > > 
> > > 

^ permalink raw reply

* Re: [PATCH net-next V2] virtio_net: ethtool tx napi configuration
From: Jason Wang @ 2018-09-13  9:07 UTC (permalink / raw)
  To: mst; +Cc: davem, virtualization, netdev, linux-kernel, Willem de Bruijn
In-Reply-To: <20180913053545.18585-1-jasowang@redhat.com>



On 2018年09月13日 13:35, Jason Wang wrote:
> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> Interrupt moderation is currently not supported, so these accept and
> display the default settings of 0 usec and 1 frame.
>
> Toggle tx napi through a bit in tx-frames. So as to not interfere
> with possible future interrupt moderation, value 1 means tx napi while
> value 0 means not.
>
> To properly synchronize with the data path, tx napi is disabled and
> tx lock is held when changing the value of napi weight. And two more
> places that can access tx napi weight:
>
> - speculative tx polling in rx napi, we can leave it as is since it
>    not a must for correctness.
> - skb_xmit_done(), one more check of napi weight is added before
>    trying to enable tx to avoid tx to be disabled forever if napi is
>    disabled after skb_xmit_done() but before the napi
>
> Link: https://patchwork.ozlabs.org/patch/948149/
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> Changes from V1:
> - try to synchronize with datapath to allow changing mode when
>    interface is up.
> - use tx-frames 0 as to disable tx napi while tx-frames 1 to enable tx napi
> ---
>   drivers/net/virtio_net.c | 64 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 765920905226..6e70864f5899 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
>   
>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>   
> +static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
> +
>   static const unsigned long guest_offloads[] = {
>   	VIRTIO_NET_F_GUEST_TSO4,
>   	VIRTIO_NET_F_GUEST_TSO6,
> @@ -1444,7 +1446,10 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	virtqueue_napi_complete(napi, sq->vq, 0);
>   
> -	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
> +	/* Check napi.weight to avoid tx stall since it could be set
> +	 * to zero by ethtool after skb_xmit_done().
> +	 */
> +	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS || !sq->napi.weight)
>   		netif_tx_wake_queue(txq);
>   
>   	return 0;
> @@ -2181,6 +2186,61 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
>   	return 0;
>   }
>   
> +static int virtnet_set_coalesce(struct net_device *dev,
> +				struct ethtool_coalesce *ec)
> +{
> +	struct ethtool_coalesce ec_default = {
> +		.cmd = ETHTOOL_SCOALESCE,
> +		.rx_max_coalesced_frames = 1,
> +	};
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i, napi_weight;
> +
> +	if (ec->tx_max_coalesced_frames > 1)
> +		return -EINVAL;
> +
> +	ec_default.tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
> +	napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
> +
> +	/* disallow changes to fields not explicitly tested above */
> +	if (memcmp(ec, &ec_default, sizeof(ec_default)))
> +		return -EINVAL;
> +
> +	if (napi_weight ^ vi->sq[0].napi.weight) {
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			struct netdev_queue *txq =
> +			       netdev_get_tx_queue(vi->dev, i);
> +
> +			virtnet_napi_tx_disable(&vi->sq[i].napi);
> +			__netif_tx_lock_bh(txq);

Need to check netif_running() before disabling napi. Otherwise we may 
have a infinite loop here.

The discussion is still ongoing, if we decide to go set_coalesce, I will 
post a V3.

Thanks

> +			vi->sq[i].napi.weight = napi_weight;
> +			__netif_tx_unlock_bh(txq);
> +			virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> +					       &vi->sq[i].napi);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_get_coalesce(struct net_device *dev,
> +				struct ethtool_coalesce *ec)
> +{
> +	struct ethtool_coalesce ec_default = {
> +		.cmd = ETHTOOL_GCOALESCE,
> +		.rx_max_coalesced_frames = 1,
> +		.tx_max_coalesced_frames = 0,
> +	};
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	memcpy(ec, &ec_default, sizeof(ec_default));
> +
> +	if (vi->sq[0].napi.weight)
> +		ec->tx_max_coalesced_frames = 1;
> +
> +	return 0;
> +}
> +
>   static void virtnet_init_settings(struct net_device *dev)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
> @@ -2219,6 +2279,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>   	.get_ts_info = ethtool_op_get_ts_info,
>   	.get_link_ksettings = virtnet_get_link_ksettings,
>   	.set_link_ksettings = virtnet_set_link_ksettings,
> +	.set_coalesce = virtnet_set_coalesce,
> +	.get_coalesce = virtnet_get_coalesce,
>   };
>   
>   static void virtnet_freeze_down(struct virtio_device *vdev)

^ permalink raw reply

* [PATCH] stmmac: fix valid numbers of unicast filter entries
From: Jongsung Kim @ 2018-09-13  9:32 UTC (permalink / raw)
  To: linux-kernel, netdev, Giuseppe Cavallaro, Alexandre Torgue,
	Jose Abreu, David S . Miller
  Cc: Chanho Min, Jongsung Kim

Synopsys DWC Ethernet MAC can be configured to have 1..32, 64, or
128 unicast filter entries. (Table 7-8 MAC Address Registers from
databook) Fix dwmac1000_validate_ucast_entries() to accept values
between 1 and 32 in addition.

Signed-off-by: Jongsung Kim <neidhard.kim@lge.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 3609c7b..2b800ce 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -67,7 +67,7 @@ static int dwmac1000_validate_mcast_bins(int mcast_bins)
  * Description:
  * This function validates the number of Unicast address entries supported
  * by a particular Synopsys 10/100/1000 controller. The Synopsys controller
- * supports 1, 32, 64, or 128 Unicast filter entries for it's Unicast filter
+ * supports 1..32, 64, or 128 Unicast filter entries for it's Unicast filter
  * logic. This function validates a valid, supported configuration is
  * selected, and defaults to 1 Unicast address if an unsupported
  * configuration is selected.
@@ -77,8 +77,7 @@ static int dwmac1000_validate_ucast_entries(int ucast_entries)
 	int x = ucast_entries;
 
 	switch (x) {
-	case 1:
-	case 32:
+	case 1 ... 32:
 	case 64:
 	case 128:
 		break;
-- 
2.7.4

^ permalink raw reply related

* Re: Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Jason Wang @ 2018-09-13  9:47 UTC (permalink / raw)
  To: Tiwei Bie, Michael S. Tsirkin
  Cc: virtualization, linux-kernel, netdev, virtio-dev, wexu, jfreimann
In-Reply-To: <20180913085919.GA42049@fbsd1.sh.intel.com>



On 2018年09月13日 16:59, Tiwei Bie wrote:
>> If what you say is true then we should take a careful look
>> and not supporting these generic things with packed layout.
>> Once we do support them it will be too late and we won't
>> be able to get performance back.
> I think it's a good point that we don't need to support
> everything in packed ring (especially these which would
> hurt the performance), as the packed ring aims at high
> performance. I'm also wondering about the features. Is
> there any possibility that we won't support the out of
> order processing (at least not by default) in packed ring?
> If I didn't miss anything, the need to support out of order
> processing in packed ring will make the data structure
> inside the driver not cache friendly which is similar to
> the case of the descriptor table in the split ring (the
> difference is that, it only happens in driver now).

Out of order is not the only user, DMA is another one. We don't have 
used ring(len), so we need to maintain buffer length somewhere even for 
in order device. But if it's not too late, I second for a OUT_OF_ORDER 
feature. Starting from in order can have much simpler code in driver.

Thanks

^ permalink raw reply

* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Yonghong Song @ 2018-09-13  4:57 UTC (permalink / raw)
  To: Daniel Borkmann, ast, netdev; +Cc: kernel-team
In-Reply-To: <b5b3d620-80a5-3b00-e84f-12ebce6c6925@iogearbox.net>



On 9/12/18 3:29 PM, Daniel Borkmann wrote:
> On 09/06/2018 01:58 AM, Yonghong Song wrote:
>> Add "bpftool net" support. Networking devices are enumerated
>> to dump device index/name associated with xdp progs.
>>
>> For each networking device, tc classes and qdiscs are enumerated
>> in order to check their bpf filters.
>> In addition, root handle and clsact ingress/egress are also checked for
>> bpf filters.  Not all filter information is printed out. Only ifindex,
>> kind, filter name, prog_id and tag are printed out, which are good
>> enough to show attachment information. If the filter action
>> is a bpf action, its bpf program id, bpf name and tag will be
>> printed out as well.
>>
>> For example,
>>    $ ./bpftool net
>>    xdp [
>>    ifindex 2 devname eth0 prog_id 198
>>    ]
> 
> Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
> zero information but will take lots of space. 'eth0 (2)' would for example make it
> shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(

Right 'eth0 (2)' is a good idea. Similarly to other bpftool plain 
output, agree we should have concise printout.
For the above xdp output, I guess I tested on an old kernel and will 
test on newer kernel to ensure it works properly.

> 
>>    tc_filters [
>>    ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>>              prog_id 111727 tag d08fe3b4319bc2fd act []
>>    ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>>              prog_id 130246 tag 3f265c7f26db62c9 act []
>>    ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>>              prog_id 111726 tag 99a197826974c876
>>    ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>>              prog_id 108619 tag dc4630674fd72dcc act []
>>    ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>>              prog_id 130245 tag 72d2d830d6888d2c
>>    ]
> 
> Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
> the output of bpftool perf [0] doesn't provide it either and therefore makes the list
> as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
> do not show dev name as opposed to xdp progs? Can we shorten everything to make it
> a one-liner like in bpftool perf?

Yes, we should remove 'tag'. Users can use prog_id to find tag.
There is no IFNAME attribute in the tc filter return message.
But I already got them in previous iplink message, so I can
print out here.

> Should we have a small indicator here if the tc prog was offloaded?

Not sure about this one. As you suggested in the next email, we
can have a message like if users want more information they
can use more specific tools 'ip link ...', 'tc filter ...' etc.

> 
> Does the dump work with tc shared blocks?

I does not.I did not have experiences with tc shared block and that is 
why I did not add support for it.

> 
> Should we also dump networking related cgroup BPF progs here under bpftool net?

probably not, but in additional to the above suggestions about using 
different tools, we can add a message to suggest check `bpftool cgroup 
tree` for cgroup related net programs.

> 
> Thanks,
> Daniel
> 
>    [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
> 

^ permalink raw reply

* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Yonghong Song @ 2018-09-13  5:01 UTC (permalink / raw)
  To: Daniel Borkmann, ast, netdev; +Cc: kernel-team
In-Reply-To: <654fd0ce-5be5-0288-d7ed-42dab183f871@iogearbox.net>



On 9/12/18 3:40 PM, Daniel Borkmann wrote:
> On 09/13/2018 12:29 AM, Daniel Borkmann wrote:
>> On 09/06/2018 01:58 AM, Yonghong Song wrote:
>>> Add "bpftool net" support. Networking devices are enumerated
>>> to dump device index/name associated with xdp progs.
>>>
>>> For each networking device, tc classes and qdiscs are enumerated
>>> in order to check their bpf filters.
>>> In addition, root handle and clsact ingress/egress are also checked for
>>> bpf filters.  Not all filter information is printed out. Only ifindex,
>>> kind, filter name, prog_id and tag are printed out, which are good
>>> enough to show attachment information. If the filter action
>>> is a bpf action, its bpf program id, bpf name and tag will be
>>> printed out as well.
>>>
>>> For example,
>>>    $ ./bpftool net
>>>    xdp [
>>>    ifindex 2 devname eth0 prog_id 198
>>>    ]
>>
>> Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
>> zero information but will take lots of space. 'eth0 (2)' would for example make it
>> shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(
>>
>>>    tc_filters [
>>>    ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>>>              prog_id 111727 tag d08fe3b4319bc2fd act []
>>>    ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>>>              prog_id 130246 tag 3f265c7f26db62c9 act []
>>>    ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>>>              prog_id 111726 tag 99a197826974c876
>>>    ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>>>              prog_id 108619 tag dc4630674fd72dcc act []
>>>    ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>>>              prog_id 130245 tag 72d2d830d6888d2c
>>>    ]
>>
>> Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
>> the output of bpftool perf [0] doesn't provide it either and therefore makes the list
>> as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
>> do not show dev name as opposed to xdp progs? Can we shorten everything to make it
>> a one-liner like in bpftool perf?
>>
>> Should we have a small indicator here if the tc prog was offloaded?
>>
>> Does the dump work with tc shared blocks?
>>
>> Should we also dump networking related cgroup BPF progs here under bpftool net?
> 
> One more thought, I think it would make sense to also explicitly document current
> limitations of the progs that this listing does not show where user should consult
> other tools aka iproute2 e.g. lwt, seg6, tc bpf actions, etc, so the current scope
> would be more clear for users.

I will do a followup patch to address a few issues mentioned in my 
previous email (mostly more concise output, briefly mentioning 
limitation of the tool, etc.) and also add more information in the
documentation and also in the 'bpf net help' output to set user's
expectation right about what this tool can do and how to find more
information.

> 
>> Thanks,
>> Daniel
>>
>>    [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
>>
> 

^ permalink raw reply

* Re: [PATCHv2] net: ipv4: Use BUG_ON directly instead of a if condition followed by BUG
From: Simon Horman @ 2018-09-13 10:28 UTC (permalink / raw)
  To: zhong jiang; +Cc: davem, edumazet, kuznet, netdev, linux-kernel
In-Reply-To: <1536671593-14746-1-git-send-email-zhongjiang@huawei.com>

On Tue, Sep 11, 2018 at 09:13:13PM +0800, zhong jiang wrote:
> The if condition can be removed if we use BUG_ON directly.
> The issule is detected with the help of Coccinelle.
> 
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> ---
>  net/ipv4/tcp_input.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 62508a2..526c05e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4934,8 +4934,8 @@ void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
>  			BUG_ON(offset < 0);
>  			if (size > 0) {
>  				size = min(copy, size);
> -				if (skb_copy_bits(skb, offset, skb_put(nskb, size), size))
> -					BUG();
> +				BUG_ON(skb_copy_bits(skb, offset,
> +						     skb_put(nskb, size), size));

Perhaps things have changed but it doesn't seem desirable to me to use
BUG_ON with statements that have side-effects in case at some point BUG_ON
gets compiled out.

>  				TCP_SKB_CB(nskb)->end_seq += size;
>  				copy -= size;
>  				start += size;
> @@ -5327,8 +5327,8 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
>  		/* Is the urgent pointer pointing into this packet? */
>  		if (ptr < skb->len) {
>  			u8 tmp;
> -			if (skb_copy_bits(skb, ptr, &tmp, 1))
> -				BUG();
> +
> +			BUG_ON(skb_copy_bits(skb, ptr, &tmp, 1));
>  			tp->urg_data = TCP_URG_VALID | tmp;
>  			if (!sock_flag(sk, SOCK_DEAD))
>  				sk->sk_data_ready(sk);
> -- 
> 1.7.12.4
> 

^ permalink raw reply

* Re: Regression caused by commit 7bb05b85bc2d ("r8169: don't use MSI-X on RTL8106e")
From: Jian-Hong Pan @ 2018-09-13  5:50 UTC (permalink / raw)
  To: Kai-Heng Feng, Heiner Kallweit
  Cc: Thomas Gleixner, Linux Netdev List, Linux Kernel Mailing List,
	Linux Upstreaming Team, Daniel Drake, Steve Dodd
In-Reply-To: <BF554FD5-17BA-4F16-9FDF-705D1FC76C12@canonical.com>

2018-09-12 16:19 GMT+08:00 Kai-Heng Feng <kai.heng.feng@canonical.com>:
> at 14:32, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> On Wed, 12 Sep 2018, Kai-Heng Feng wrote:
>>
>>> There's a Dell machine with RTL8106e stops to work after S3 since the
>>> commit introduced. So I am wondering if it's possible to revert the
>>> commit and use DMI/subsystem id based quirk table?
>>
>>
>> Probably.
>
>
> Hopefully Jian-Hong can cook up a quirk table for the issue.

Module r8169 gets nothing in the PCI BAR after system resumes which
makes MSI-X fail on some ASUS laptops equipped with RTL8106e chip.
https://www.spinics.net/lists/linux-pci/msg75598.html

Actually, I am waiting for the patch "PCI: Reprogram bridge prefetch
registers on resume" being merged.
https://marc.info/?l=linux-pm&m=153680987814299&w=2

It resolves the drivers which get nothing in PCI BAR after system resumes.

After that, I can remove the falling back code of RTL8106e.

Heiner, any comment?

Regards,
Jian-Hong Pan

>>
>>> It's because of commit bc976233a872 ("genirq/msi, x86/vector: Prevent
>>> reservation mode for non maskable MSI") cleared the reservation mode, and
>>> I
>>> can see this after S3:
>>>
>>> [   94.872838] do_IRQ: 3.33 No irq handler for vector
>>
>>
>> It's not because of that commit, really. There is a interrupt sent after
>> resume to the wrong vector for whatever reason. The MSI vector cannot be
>> masked it seems in the device, but the driver should quiescen the device
>> to
>> a point where it does not send interrupts.
>
>
> Understood.
>
>>
>>> If the device uses MSI-X instead of MSI, the issue doesn't happen because
>>> of
>>> reservation mode.
>>
>>
>> Reservation mode has absolutely nothing to do with that. What prevents the
>> issue is the fact that MSI-X can be masked by the IRQ core.
>
>
> So in this case I think keep the device using MSI-X is a better route, it's
> MSI-X capable anyway.
>
>>
>>> Is it something should be handled by x86 BIOS? Because I don't see this
>>> issue
>>> when I use Suspend-to-Idle, which doesn't use BIOS to do suspend.
>>
>>
>> Suspend to idle works completely different and I don't see the BIOS at
>> fault here. it's more an issue of MSI not being maskable on that device,
>> which can't be fixed in BIOS or it's some half quiescened state which is
>> used when suspending and that's a pure driver issue.
>
>
> Understood.
> Thanks for all the info!
>
> Kai-Heng
>
>>
>> Thanks,
>>
>>         tglx
>
>
>

^ permalink raw reply

* Re: [PATCH iproute2] iproute2: fix use-after-free
From: Mahesh Bandewar (महेश बंडेवार) @ 2018-09-13  6:07 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Mahesh Bandewar, netdev
In-Reply-To: <20180912173320.68048381@xeon-e3>

On Wed, Sep 12, 2018 at 5:33 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Wed, 12 Sep 2018 16:29:28 -0700
> Mahesh Bandewar <mahesh@bandewar.net> wrote:
>
> > From: Mahesh Bandewar <maheshb@google.com>
> >
> > A local program using iproute2 lib pointed out the issue and looking
> > at the code it is pretty obvious -
> >
> >     a = (struct nlmsghdr *)b;
> >     ...
> >     free(b);
> >     if (a->nlmsg_seq == seq)
> >     ...
> >
> > Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc buff at run time")
> > Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>
> Yes, this is a real problem.
>
> Maybe a minimal patch like this would be enough:
actually that will leave the memory leak at the 'goto next' line (just
few lines below) since that jump will overwrite the buf.

> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 928de1dd16d8..ab2d8452e4a1 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -661,6 +661,8 @@ next:
>                                 if (l < sizeof(struct nlmsgerr)) {
>                                         fprintf(stderr, "ERROR truncated\n");
>                                 } else if (!err->error) {
> +                                       __u32 err_seq = h->nlmsg_seq;
> +
>                                         /* check messages from kernel */
>                                         nl_dump_ext_ack(h, errfn);
>
> @@ -668,7 +670,8 @@ next:
>                                                 *answer = (struct nlmsghdr *)buf;
>                                         else
>                                                 free(buf);
> -                                       if (h->nlmsg_seq == seq)
> +
> +                                       if (err_seq == seq)
>                                                 return 0;
>                                         else if (i < iovlen)
>                                                 goto next;
>

^ permalink raw reply

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Johannes Berg @ 2018-09-13 11:25 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180913104955.GE29691-OEaqT8BN2ewCVLCxKZUutA@public.gmane.org>

On Thu, 2018-09-13 at 12:49 +0200, Michal Kubecek wrote:

> >  		if (type > 0 && type <= maxtype) {
> >  			if (policy) {
> > -				err = validate_nla(nla, maxtype, policy);
> > +				err = validate_nla(nla, maxtype, policy,
> > +						   extack);
> >  				if (err < 0) {
> > -					NL_SET_ERR_MSG_ATTR(extack, nla,
> > -							    "Attribute failed policy validation");
> > +					NL_SET_BAD_ATTR(extack, nla);
> > +					if (extack && !extack->_msg)
> > +						NL_SET_ERR_MSG(extack,
> > +							       "Attribute failed policy validation");
> >  					goto errout;
> >  				}
> >  			}
> > -- 
> 
> Technically, this would change the outcome when nla_parse() is called
> with extack->_msg already set nad validate_nla() fails on something else
> than NLA_REJECT; it will preserve the previous message in such case.
> But I don't think this is a serious problem.

Yes, that's true. I looked at quite a few of the setters just now (there
are ~500 already, wow!), and they all set & return, so this shouldn't be
an issue.

johannes

^ permalink raw reply

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Michal Kubecek @ 2018-09-13 11:58 UTC (permalink / raw)
  To: Johannes Berg
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Johannes Berg
In-Reply-To: <20180913084603.7979-2-johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

On Thu, Sep 13, 2018 at 10:46:03AM +0200, Johannes Berg wrote:
> From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
> Commonly, ethernet addresses are just using a policy of
> 	{ .len = ETH_ALEN }
> which leaves userspace free to send more data than it should,
> which may hide bugs.
> 
> Introduce NLA_ETH_ADDR which checks for exact size, and rejects
> the attribute if the length isn't ETH_ALEN.
> 
> Also add NLA_ETH_ADDR_COMPAT which can be used in place of the
> pure length policy, but will, in addition, trigger a netlink
> attribute warning if the passed value is too long.
> 
> Signed-off-by: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---

The code looks correct to me but I have some doubts. Having a special
policy for MAC addresses may lead to adding one for IPv4 address (maybe
not, we can use NLA_U32 for them), IPv6 addresses and other data types
with fixed length. Wouldn't it be more helpful to add a variant of
NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
isn't equal to .len?

Michal Kubecek


>  include/net/netlink.h | 4 ++++
>  lib/nlattr.c          | 8 ++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 04e40fcc70d6..1139163c0db0 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -181,6 +181,8 @@ enum {
>  	NLA_S64,
>  	NLA_BITFIELD32,
>  	NLA_REJECT,
> +	NLA_ETH_ADDR,
> +	NLA_ETH_ADDR_COMPAT,
>  	__NLA_TYPE_MAX,
>  };
>  
> @@ -213,6 +215,8 @@ enum {
>   *                         data must point to a u32 value of valid flags
>   *    NLA_REJECT           Reject this attribute, validation data may point
>   *                         to a string to report as the error in extended ACK.
> + *    NLA_ETH_ADDR         Ethernet address, rejected if not exactly 6 octets.
> + *    NLA_ETH_ADDR_COMPAT  Ethernet address, only warns if not exactly 6 octets.
>   *    All other            Minimum length of attribute payload
>   *
>   * Example:
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index 56e0aae5cf23..b7c519f6e12a 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -29,6 +29,8 @@ static const u8 nla_attr_len[NLA_TYPE_MAX+1] = {
>  	[NLA_S16]	= sizeof(s16),
>  	[NLA_S32]	= sizeof(s32),
>  	[NLA_S64]	= sizeof(s64),
> +	[NLA_ETH_ADDR]	= ETH_ALEN,
> +	[NLA_ETH_ADDR_COMPAT] = ETH_ALEN,
>  };
>  
>  static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
> @@ -42,6 +44,7 @@ static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
>  	[NLA_S16]	= sizeof(s16),
>  	[NLA_S32]	= sizeof(s32),
>  	[NLA_S64]	= sizeof(s64),
> +	[NLA_ETH_ADDR_COMPAT] = ETH_ALEN,
>  };
>  
>  static int validate_nla_bitfield32(const struct nlattr *nla,
> @@ -93,6 +96,11 @@ static int validate_nla(const struct nlattr *nla, int maxtype,
>  			extack->_msg = pt->validation_data;
>  		return -EINVAL;
>  
> +	case NLA_ETH_ADDR:
> +		if (attrlen != ETH_ALEN)
> +			return -ERANGE;
> +		break;
> +
>  	case NLA_FLAG:
>  		if (attrlen > 0)
>  			return -ERANGE;
> -- 
> 2.14.4
> 

^ permalink raw reply

* Re: [PATCH net-next] NFC: nci: remove redundant variable 'status'
From: YueHaibing @ 2018-09-13 12:06 UTC (permalink / raw)
  To: davem, sameo; +Cc: linux-kernel, netdev
In-Reply-To: <20180801072020.14696-1-yuehaibing@huawei.com>

Ping.

On 2018/8/1 15:20, YueHaibing wrote:
> After commit d8cd37ed2fc8 ("NFC: nci: Fix improper management of HCI return code")
> variable 'status' is being assigned but never used,
> so can be removed. Also make a trival cleanup.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
>  net/nfc/nci/hci.c | 10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/net/nfc/nci/hci.c b/net/nfc/nci/hci.c
> index ddfc52a..4debba8 100644
> --- a/net/nfc/nci/hci.c
> +++ b/net/nfc/nci/hci.c
> @@ -370,17 +370,11 @@ static void nci_hci_resp_received(struct nci_dev *ndev, u8 pipe,
>  				  u8 result, struct sk_buff *skb)
>  {
>  	struct nci_conn_info    *conn_info;
> -	u8 status = result;
>  
>  	conn_info = ndev->hci_dev->conn_info;
> -	if (!conn_info) {
> -		status = NCI_STATUS_REJECTED;
> -		goto exit;
> -	}
> -
> -	conn_info->rx_skb = skb;
> +	if (conn_info)
> +		conn_info->rx_skb = skb;
>  
> -exit:
>  	nci_req_complete(ndev, NCI_STATUS_OK);
>  }
>  
> 

^ permalink raw reply

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Michal Kubecek @ 2018-09-13 12:24 UTC (permalink / raw)
  To: Johannes Berg
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1536840966.4160.6.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > On Thu, Sep 13, 2018 at 02:02:53PM +0200, Johannes Berg wrote:
> > > On Thu, 2018-09-13 at 13:58 +0200, Michal Kubecek wrote:
> > > 
> > > > The code looks correct to me but I have some doubts. Having a special
> > > > policy for MAC addresses may lead to adding one for IPv4 address (maybe
> > > > not, we can use NLA_U32 for them), IPv6 addresses and other data types
> > > > with fixed length. Wouldn't it be more helpful to add a variant of
> > > > NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
> > > > isn't equal to .len?
> > > 
> > > Yeah, I guess we could do that, and then
> > > 
> > > #define NLA_ETH_ADDR .len = ETH_ALEN, .type = NLA_BINARY_EXACT
> > > #define NLA_IP6_ADDR .len = 16, .type = NLA_BINARY_EXACT
> > > 
> > > or so?
> > 
> > Maybe rather
> > 
> >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > 
> > so that one could write
> > 
> >   { .type = NLA_ETH_ADDR }
> 
> Yeah, that's possible. I considered it for a second, but it was slightly
> too magical for my taste :-)
> 
> Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?

Right, that sounds better. I'm afraid anything too tricky would
inevitably trick people into using it in an unexpected way and ending up
with very confusing error messages.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: Greg KH @ 2018-09-13 12:32 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: maowenan, dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai
In-Reply-To: <20180816152409.GK10648@kroah.com>

On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> > > On 2018/8/16 19:39, Michal Kubecek wrote:
> > > > 
> > > > I suspect you may be doing something wrong with your tests. I checked
> > > > the segmentsmack testcase and the CPU utilization on receiving side
> > > > (with sending 10 times as many packets as default) went down from ~100%
> > > > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > > > kernel.
> > > 
> > > There seems no obvious problem when you send packets with default
> > > parameter in Segmentsmack POC, Which is also very related with your
> > > server's hardware configuration. Please try with below parameter to
> > > form OFO packets
> > 
> > I did and even with these (questionable, see below) changes, I did not
> > get more than 10% (of one core) by receiving ksoftirqd.
> > 
> > >       for (i = 0; i < 1024; i++)      // 128->1024
> > ...
> > >       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
> > 
> > The comment in the testcase source suggests to do _one_ of these two
> > changes so that you generate 10 times as many packets as the original
> > testcase. You did both so that you end up sending 102400 packets per
> > second. With 55 byte long packets, this kind of attack requires at least
> > 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
> > DoS", I'm afraid.
> > 
> > Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> > 
> > What I can see, though, is that with current stable 4.4 code, modified
> > testcase which sends something like
> > 
> >   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> > 
> > I quickly eat 6 MB of memory for receive queue of one socket while
> > earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> > Takashi's follow-up yet but I'm pretty sure it will help while
> > preserving nice performance when using the original segmentsmack
> > testcase (with increased packet ratio).
> 
> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
> push out a new 4.4-rc later tonight.  Can everyone standardize on that
> and test and let me know if it does, or does not, fix the reported
> issues?
> 
> If not, we can go from there and evaluate this much larger patch series.
> But let's try the simple thing first.

So, is the issue still present on the latest 4.4 release?  Has anyone
tested it?  If not, I'm more than willing to look at backported patches,
but I want to ensure that they really are needed here.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: Eric Dumazet @ 2018-09-13 12:44 UTC (permalink / raw)
  To: gregkh
  Cc: mkubecek, maowenan, David Woodhouse, netdev, Eric Dumazet,
	David Miller, Yuchung Cheng, jdw, stable, tiwai
In-Reply-To: <20180913123238.GI2268@kroah.com>

On Thu, Sep 13, 2018 at 5:32 AM Greg KH <gregkh@linux-foundation.org> wrote:
>
> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
> > On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> > > On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> > > > On 2018/8/16 19:39, Michal Kubecek wrote:
> > > > >
> > > > > I suspect you may be doing something wrong with your tests. I checked
> > > > > the segmentsmack testcase and the CPU utilization on receiving side
> > > > > (with sending 10 times as many packets as default) went down from ~100%
> > > > > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > > > > kernel.
> > > >
> > > > There seems no obvious problem when you send packets with default
> > > > parameter in Segmentsmack POC, Which is also very related with your
> > > > server's hardware configuration. Please try with below parameter to
> > > > form OFO packets
> > >
> > > I did and even with these (questionable, see below) changes, I did not
> > > get more than 10% (of one core) by receiving ksoftirqd.
> > >
> > > >       for (i = 0; i < 1024; i++)      // 128->1024
> > > ...
> > > >       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
> > >
> > > The comment in the testcase source suggests to do _one_ of these two
> > > changes so that you generate 10 times as many packets as the original
> > > testcase. You did both so that you end up sending 102400 packets per
> > > second. With 55 byte long packets, this kind of attack requires at least
> > > 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
> > > DoS", I'm afraid.
> > >
> > > Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> > >
> > > What I can see, though, is that with current stable 4.4 code, modified
> > > testcase which sends something like
> > >
> > >   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> > >
> > > I quickly eat 6 MB of memory for receive queue of one socket while
> > > earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> > > Takashi's follow-up yet but I'm pretty sure it will help while
> > > preserving nice performance when using the original segmentsmack
> > > testcase (with increased packet ratio).
> >
> > Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
> > push out a new 4.4-rc later tonight.  Can everyone standardize on that
> > and test and let me know if it does, or does not, fix the reported
> > issues?
> >
> > If not, we can go from there and evaluate this much larger patch series.
> > But let's try the simple thing first.
>
> So, is the issue still present on the latest 4.4 release?  Has anyone
> tested it?  If not, I'm more than willing to look at backported patches,
> but I want to ensure that they really are needed here.
>
> thanks,

Honestly, TCP stack without rb-tree for the OOO queue is vulnerable,
even with non malicious sender,
but with big enough TCP receive window and a not favorable network.

So a malicious peer can definitely send packets needed to make TCP
stack behave in O(N), which is pretty bad if N is big...

9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ("tcp: use an RB tree for ooo
receive queue")
was proven to be almost bug free [1], and should be backported if possible.

[1] bug fixed :
76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a tcp: fix a stale ooo_last_skb
after a replace

^ permalink raw reply

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Johannes Berg @ 2018-09-13 12:46 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180913122412.GI29691-OEaqT8BN2ewCVLCxKZUutA@public.gmane.org>

On Thu, 2018-09-13 at 14:24 +0200, Michal Kubecek wrote:
> On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> > On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > > On Thu, Sep 13, 2018 at 02:02:53PM +0200, Johannes Berg wrote:
> > > > On Thu, 2018-09-13 at 13:58 +0200, Michal Kubecek wrote:
> > > > 
> > > > > The code looks correct to me but I have some doubts. Having a special
> > > > > policy for MAC addresses may lead to adding one for IPv4 address (maybe
> > > > > not, we can use NLA_U32 for them), IPv6 addresses and other data types
> > > > > with fixed length. Wouldn't it be more helpful to add a variant of
> > > > > NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
> > > > > isn't equal to .len?
> > > > 
> > > > Yeah, I guess we could do that, and then
> > > > 
> > > > #define NLA_ETH_ADDR .len = ETH_ALEN, .type = NLA_BINARY_EXACT
> > > > #define NLA_IP6_ADDR .len = 16, .type = NLA_BINARY_EXACT
> > > > 
> > > > or so?
> > > 
> > > Maybe rather
> > > 
> > >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> > >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > > 
> > > so that one could write
> > > 
> > >   { .type = NLA_ETH_ADDR }
> > 
> > Yeah, that's possible. I considered it for a second, but it was slightly
> > too magical for my taste :-)
> > 
> > Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?
> 
> Right, that sounds better. I'm afraid anything too tricky would
> inevitably trick people into using it in an unexpected way and ending up
> with very confusing error messages.

Right.

Then again though, we already have NLA_MSECS which is basically
equivalent to NLA_U64 afaict, so why not have more types?

It doesn't really cost us that much, just a few bytes in the validation?

Also, with .type = NLA_ETH_ADDR_COMPAT we could get a warning, which is
not true for just checking .len.

OTOH, you could argue that adding two types for ethernet addresses, two
for IPv6 addresses, and possibly more quickly adds up to make that "just
a few bytes" matter ...

johannes

^ permalink raw reply

* Re: [PATCH v3] PCI: Reprogram bridge prefetch registers on resume
From: Peter Wu @ 2018-09-13  7:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Daniel Drake, Bjorn Helgaas, Linux PCI, Linux Upstreaming Team,
	nouveau, Linux PM, kherbst, Rafael Wysocki, Keith Busch,
	jonathan.derrick, Thomas Martitz, David Miller, Heiner Kallweit,
	netdev, nic_swsd, rchang
In-Reply-To: <CAJZ5v0iBt+mYy4K9YNavqyQkivNqwBp3zyK5QmqGR+kWGsRO4g@mail.gmail.com>

On Thu, Sep 13, 2018 at 08:52:29AM +0200, Rafael J. Wysocki wrote:
> On Thu, Sep 13, 2018 at 5:37 AM Daniel Drake <drake@endlessm.com> wrote:
> >
> > On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> > after S3 suspend/resume. The affected products include multiple
> > generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> > many errors such as:
> >
> >     fifo: fault 00 [READ] at 0000005555555000 engine 00 [GR] client 04
> >           [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> >     DRM: failed to idle channel 0 [DRM]
> >
> > Similarly, the nvidia proprietary driver also fails after resume
> > (black screen, 100% CPU usage in Xorg process). We shipped a sample
> > to Nvidia for diagnosis, and their response indicated that it's a
> > problem with the parent PCI bridge (on the Intel SoC), not the GPU.
> >
> > Runtime suspend/resume works fine, only S3 suspend is affected.
> >
> > We found a workaround: on resume, rewrite the Intel PCI bridge
> > 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> > the cases that I checked, this register has value 0 and we just have to
> > rewrite that value.
> >
> > Linux already saves and restores PCI config space during suspend/resume,
> > but this register was being skipped because upon resume, it already
> > has value 0 (the correct, pre-suspend value).
> >
> > Intel appear to have previously acknowledged this behaviour and the
> > requirement to rewrite this register.
> > https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
> >
> > Based on that, rewrite the prefetch register values even when that
> > appears unnecessary.
> >
> > We have confirmed this solution on all the affected models we have
> > in-hands (X542UQ, UX533FD, X530UN, V272UN).
> >
> > Additionally, this solves an issue where r8169 MSI-X interrupts were
> > broken after S3 suspend/resume on Asus X441UAR. This issue was recently
> > worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
> > RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
> > Aimfor-tech laptop that we had not yet patched. I suspect it will also
> > fix the issue that was worked around in commit 7c53a722459c ("r8169:
> > don't use MSI-X on RTL8168g").
> >
> > Thomas Martitz reports that this change also solves an issue where
> > the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> > after S3 suspend/resume.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
> > Signed-off-by: Daniel Drake <drake@endlessm.com>
> 
> Still
> 
> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-By: Peter Wu <peter@lekensteyn.nl>

> > ---
> >  drivers/pci/pci.c | 25 +++++++++++++++++--------
> >  1 file changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 29ff9619b5fa..5d58220b6997 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
> >  EXPORT_SYMBOL(pci_save_state);
> >
> >  static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> > -                                    u32 saved_val, int retry)
> > +                                    u32 saved_val, int retry, bool force)
> >  {
> >         u32 val;
> >
> >         pci_read_config_dword(pdev, offset, &val);
> > -       if (val == saved_val)
> > +       if (!force && val == saved_val)
> >                 return;
> >
> >         for (;;) {
> > @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> >  }
> >
> >  static void pci_restore_config_space_range(struct pci_dev *pdev,
> > -                                          int start, int end, int retry)
> > +                                          int start, int end, int retry,
> > +                                          bool force)
> >  {
> >         int index;
> >
> >         for (index = end; index >= start; index--)
> >                 pci_restore_config_dword(pdev, 4 * index,
> >                                          pdev->saved_config_space[index],
> > -                                        retry);
> > +                                        retry, force);
> >  }
> >
> >  static void pci_restore_config_space(struct pci_dev *pdev)
> >  {
> >         if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
> > -               pci_restore_config_space_range(pdev, 10, 15, 0);
> > +               pci_restore_config_space_range(pdev, 10, 15, 0, false);
> >                 /* Restore BARs before the command register. */
> > -               pci_restore_config_space_range(pdev, 4, 9, 10);
> > -               pci_restore_config_space_range(pdev, 0, 3, 0);
> > +               pci_restore_config_space_range(pdev, 4, 9, 10, false);
> > +               pci_restore_config_space_range(pdev, 0, 3, 0, false);
> > +       } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> > +               pci_restore_config_space_range(pdev, 12, 15, 0, false);
> > +               /* Force rewriting of prefetch registers to avoid
> > +                * S3 resume issues on Intel PCI bridges that occur when
> > +                * these registers are not explicitly written.
> > +                */
> > +               pci_restore_config_space_range(pdev, 9, 11, 0, true);
> > +               pci_restore_config_space_range(pdev, 0, 8, 0, false);
> >         } else {
> > -               pci_restore_config_space_range(pdev, 0, 15, 0);
> > +               pci_restore_config_space_range(pdev, 0, 15, 0, false);
> >         }
> >  }
> >
> > --
> > 2.17.1
> >

^ permalink raw reply

* [PATCH net-next v3 0/2] net: stmmac: Coalesce and tail addr fixes
From: Jose Abreu @ 2018-09-13  8:02 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, Florian Fainelli, Neil Armstrong, Jerome Brunet,
	Martin Blumenstingl, David S. Miller, Joao Pinto,
	Giuseppe Cavallaro, Alexandre Torgue

The fix for coalesce timer and a fix in tail address setting that impacts
XGMAC2 operation.

Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Jerome Brunet <jbrunet@baylibre.com>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>

Jose Abreu (2):
  net: stmmac: Rework coalesce timer and fix multi-queue races
  net: stmmac: Fixup the tail addr setting in xmit path

 drivers/net/ethernet/stmicro/stmmac/common.h      |   4 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  14 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 238 ++++++++++++----------
 include/linux/stmmac.h                            |   1 +
 4 files changed, 149 insertions(+), 108 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next v3 1/2] net: stmmac: Rework coalesce timer and fix multi-queue races
From: Jose Abreu @ 2018-09-13  8:02 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, Florian Fainelli, Neil Armstrong, Jerome Brunet,
	Martin Blumenstingl, David S. Miller, Joao Pinto,
	Giuseppe Cavallaro, Alexandre Torgue
In-Reply-To: <cover.1536762575.git.joabreu@synopsys.com>

This follows David Miller advice and tries to fix coalesce timer in
multi-queue scenarios.

We are now using per-queue coalesce values and per-queue TX timer.

Coalesce timer default values was changed to 1ms and the coalesce frames
to 25.

Tested in B2B setup between XGMAC2 and GMAC5.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Jerome Brunet <jbrunet@baylibre.com>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/common.h      |   4 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  14 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 233 ++++++++++++----------
 include/linux/stmmac.h                            |   1 +
 4 files changed, 146 insertions(+), 106 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index 1854f270ad66..b1b305f8f414 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -258,10 +258,10 @@ struct stmmac_safety_stats {
 #define MAX_DMA_RIWT		0xff
 #define MIN_DMA_RIWT		0x20
 /* Tx coalesce parameters */
-#define STMMAC_COAL_TX_TIMER	40000
+#define STMMAC_COAL_TX_TIMER	1000
 #define STMMAC_MAX_COAL_TX_TICK	100000
 #define STMMAC_TX_MAX_FRAMES	256
-#define STMMAC_TX_FRAMES	64
+#define STMMAC_TX_FRAMES	25
 
 /* Packets types */
 enum packets_types {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index c0a855b7ab3b..63e1064b27a2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -48,6 +48,8 @@ struct stmmac_tx_info {
 
 /* Frequently used values are kept adjacent for cache effect */
 struct stmmac_tx_queue {
+	u32 tx_count_frames;
+	struct timer_list txtimer;
 	u32 queue_index;
 	struct stmmac_priv *priv_data;
 	struct dma_extended_desc *dma_etx ____cacheline_aligned_in_smp;
@@ -73,7 +75,14 @@ struct stmmac_rx_queue {
 	u32 rx_zeroc_thresh;
 	dma_addr_t dma_rx_phy;
 	u32 rx_tail_addr;
+};
+
+struct stmmac_channel {
 	struct napi_struct napi ____cacheline_aligned_in_smp;
+	struct stmmac_priv *priv_data;
+	u32 index;
+	int has_rx;
+	int has_tx;
 };
 
 struct stmmac_tc_entry {
@@ -109,14 +118,12 @@ struct stmmac_pps_cfg {
 
 struct stmmac_priv {
 	/* Frequently used values are kept adjacent for cache effect */
-	u32 tx_count_frames;
 	u32 tx_coal_frames;
 	u32 tx_coal_timer;
 
 	int tx_coalesce;
 	int hwts_tx_en;
 	bool tx_path_in_lpi_mode;
-	struct timer_list txtimer;
 	bool tso;
 
 	unsigned int dma_buf_sz;
@@ -137,6 +144,9 @@ struct stmmac_priv {
 	/* TX Queue */
 	struct stmmac_tx_queue tx_queue[MTL_MAX_TX_QUEUES];
 
+	/* Generic channel for NAPI */
+	struct stmmac_channel channel[STMMAC_CH_MAX];
+
 	bool oldlink;
 	int speed;
 	int oldduplex;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 9f458bb16f2a..ab9cc0143ff2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -148,12 +148,14 @@ static void stmmac_verify_args(void)
 static void stmmac_disable_all_queues(struct stmmac_priv *priv)
 {
 	u32 rx_queues_cnt = priv->plat->rx_queues_to_use;
+	u32 tx_queues_cnt = priv->plat->tx_queues_to_use;
+	u32 maxq = max(rx_queues_cnt, tx_queues_cnt);
 	u32 queue;
 
-	for (queue = 0; queue < rx_queues_cnt; queue++) {
-		struct stmmac_rx_queue *rx_q = &priv->rx_queue[queue];
+	for (queue = 0; queue < maxq; queue++) {
+		struct stmmac_channel *ch = &priv->channel[queue];
 
-		napi_disable(&rx_q->napi);
+		napi_disable(&ch->napi);
 	}
 }
 
@@ -164,12 +166,14 @@ static void stmmac_disable_all_queues(struct stmmac_priv *priv)
 static void stmmac_enable_all_queues(struct stmmac_priv *priv)
 {
 	u32 rx_queues_cnt = priv->plat->rx_queues_to_use;
+	u32 tx_queues_cnt = priv->plat->tx_queues_to_use;
+	u32 maxq = max(rx_queues_cnt, tx_queues_cnt);
 	u32 queue;
 
-	for (queue = 0; queue < rx_queues_cnt; queue++) {
-		struct stmmac_rx_queue *rx_q = &priv->rx_queue[queue];
+	for (queue = 0; queue < maxq; queue++) {
+		struct stmmac_channel *ch = &priv->channel[queue];
 
-		napi_enable(&rx_q->napi);
+		napi_enable(&ch->napi);
 	}
 }
 
@@ -1843,18 +1847,18 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
  * @queue: TX queue index
  * Description: it reclaims the transmit resources after transmission completes.
  */
-static void stmmac_tx_clean(struct stmmac_priv *priv, u32 queue)
+static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
 {
 	struct stmmac_tx_queue *tx_q = &priv->tx_queue[queue];
 	unsigned int bytes_compl = 0, pkts_compl = 0;
-	unsigned int entry;
+	unsigned int entry, count = 0;
 
-	netif_tx_lock(priv->dev);
+	__netif_tx_lock_bh(netdev_get_tx_queue(priv->dev, queue));
 
 	priv->xstats.tx_clean++;
 
 	entry = tx_q->dirty_tx;
-	while (entry != tx_q->cur_tx) {
+	while ((entry != tx_q->cur_tx) && (count < budget)) {
 		struct sk_buff *skb = tx_q->tx_skbuff[entry];
 		struct dma_desc *p;
 		int status;
@@ -1870,6 +1874,8 @@ static void stmmac_tx_clean(struct stmmac_priv *priv, u32 queue)
 		if (unlikely(status & tx_dma_own))
 			break;
 
+		count++;
+
 		/* Make sure descriptor fields are read after reading
 		 * the own bit.
 		 */
@@ -1937,7 +1943,10 @@ static void stmmac_tx_clean(struct stmmac_priv *priv, u32 queue)
 		stmmac_enable_eee_mode(priv);
 		mod_timer(&priv->eee_ctrl_timer, STMMAC_LPI_T(eee_timer));
 	}
-	netif_tx_unlock(priv->dev);
+
+	__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
+
+	return count;
 }
 
 /**
@@ -2020,6 +2029,33 @@ static bool stmmac_safety_feat_interrupt(struct stmmac_priv *priv)
 	return false;
 }
 
+static int stmmac_napi_check(struct stmmac_priv *priv, u32 chan)
+{
+	int status = stmmac_dma_interrupt_status(priv, priv->ioaddr,
+						 &priv->xstats, chan);
+	struct stmmac_channel *ch = &priv->channel[chan];
+	bool needs_work = false;
+
+	if ((status & handle_rx) && ch->has_rx) {
+		needs_work = true;
+	} else {
+		status &= ~handle_rx;
+	}
+
+	if ((status & handle_tx) && ch->has_tx) {
+		needs_work = true;
+	} else {
+		status &= ~handle_tx;
+	}
+
+	if (needs_work && napi_schedule_prep(&ch->napi)) {
+		stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
+		__napi_schedule(&ch->napi);
+	}
+
+	return status;
+}
+
 /**
  * stmmac_dma_interrupt - DMA ISR
  * @priv: driver private structure
@@ -2034,57 +2070,14 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 	u32 channels_to_check = tx_channel_count > rx_channel_count ?
 				tx_channel_count : rx_channel_count;
 	u32 chan;
-	bool poll_scheduled = false;
 	int status[max_t(u32, MTL_MAX_TX_QUEUES, MTL_MAX_RX_QUEUES)];
 
 	/* Make sure we never check beyond our status buffer. */
 	if (WARN_ON_ONCE(channels_to_check > ARRAY_SIZE(status)))
 		channels_to_check = ARRAY_SIZE(status);
 
-	/* Each DMA channel can be used for rx and tx simultaneously, yet
-	 * napi_struct is embedded in struct stmmac_rx_queue rather than in a
-	 * stmmac_channel struct.
-	 * Because of this, stmmac_poll currently checks (and possibly wakes)
-	 * all tx queues rather than just a single tx queue.
-	 */
 	for (chan = 0; chan < channels_to_check; chan++)
-		status[chan] = stmmac_dma_interrupt_status(priv, priv->ioaddr,
-				&priv->xstats, chan);
-
-	for (chan = 0; chan < rx_channel_count; chan++) {
-		if (likely(status[chan] & handle_rx)) {
-			struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];
-
-			if (likely(napi_schedule_prep(&rx_q->napi))) {
-				stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
-				__napi_schedule(&rx_q->napi);
-				poll_scheduled = true;
-			}
-		}
-	}
-
-	/* If we scheduled poll, we already know that tx queues will be checked.
-	 * If we didn't schedule poll, see if any DMA channel (used by tx) has a
-	 * completed transmission, if so, call stmmac_poll (once).
-	 */
-	if (!poll_scheduled) {
-		for (chan = 0; chan < tx_channel_count; chan++) {
-			if (status[chan] & handle_tx) {
-				/* It doesn't matter what rx queue we choose
-				 * here. We use 0 since it always exists.
-				 */
-				struct stmmac_rx_queue *rx_q =
-					&priv->rx_queue[0];
-
-				if (likely(napi_schedule_prep(&rx_q->napi))) {
-					stmmac_disable_dma_irq(priv,
-							priv->ioaddr, chan);
-					__napi_schedule(&rx_q->napi);
-				}
-				break;
-			}
-		}
-	}
+		status[chan] = stmmac_napi_check(priv, chan);
 
 	for (chan = 0; chan < tx_channel_count; chan++) {
 		if (unlikely(status[chan] & tx_hard_error_bump_tc)) {
@@ -2233,6 +2226,13 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
 	return ret;
 }
 
+static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue)
+{
+	struct stmmac_tx_queue *tx_q = &priv->tx_queue[queue];
+
+	mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer));
+}
+
 /**
  * stmmac_tx_timer - mitigation sw timer for tx.
  * @data: data pointer
@@ -2241,13 +2241,14 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
  */
 static void stmmac_tx_timer(struct timer_list *t)
 {
-	struct stmmac_priv *priv = from_timer(priv, t, txtimer);
-	u32 tx_queues_count = priv->plat->tx_queues_to_use;
-	u32 queue;
+	struct stmmac_tx_queue *tx_q = from_timer(tx_q, t, txtimer);
+	struct stmmac_priv *priv = tx_q->priv_data;
+	struct stmmac_channel *ch;
+
+	ch = &priv->channel[tx_q->queue_index];
 
-	/* let's scan all the tx queues */
-	for (queue = 0; queue < tx_queues_count; queue++)
-		stmmac_tx_clean(priv, queue);
+	if (likely(napi_schedule_prep(&ch->napi)))
+		__napi_schedule(&ch->napi);
 }
 
 /**
@@ -2260,11 +2261,17 @@ static void stmmac_tx_timer(struct timer_list *t)
  */
 static void stmmac_init_tx_coalesce(struct stmmac_priv *priv)
 {
+	u32 tx_channel_count = priv->plat->tx_queues_to_use;
+	u32 chan;
+
 	priv->tx_coal_frames = STMMAC_TX_FRAMES;
 	priv->tx_coal_timer = STMMAC_COAL_TX_TIMER;
-	timer_setup(&priv->txtimer, stmmac_tx_timer, 0);
-	priv->txtimer.expires = STMMAC_COAL_TIMER(priv->tx_coal_timer);
-	add_timer(&priv->txtimer);
+
+	for (chan = 0; chan < tx_channel_count; chan++) {
+		struct stmmac_tx_queue *tx_q = &priv->tx_queue[chan];
+
+		timer_setup(&tx_q->txtimer, stmmac_tx_timer, 0);
+	}
 }
 
 static void stmmac_set_rings_length(struct stmmac_priv *priv)
@@ -2592,6 +2599,7 @@ static void stmmac_hw_teardown(struct net_device *dev)
 static int stmmac_open(struct net_device *dev)
 {
 	struct stmmac_priv *priv = netdev_priv(dev);
+	u32 chan;
 	int ret;
 
 	stmmac_check_ether_addr(priv);
@@ -2688,7 +2696,9 @@ static int stmmac_open(struct net_device *dev)
 	if (dev->phydev)
 		phy_stop(dev->phydev);
 
-	del_timer_sync(&priv->txtimer);
+	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
+		del_timer_sync(&priv->tx_queue[chan].txtimer);
+
 	stmmac_hw_teardown(dev);
 init_error:
 	free_dma_desc_resources(priv);
@@ -2708,6 +2718,7 @@ static int stmmac_open(struct net_device *dev)
 static int stmmac_release(struct net_device *dev)
 {
 	struct stmmac_priv *priv = netdev_priv(dev);
+	u32 chan;
 
 	if (priv->eee_enabled)
 		del_timer_sync(&priv->eee_ctrl_timer);
@@ -2722,7 +2733,8 @@ static int stmmac_release(struct net_device *dev)
 
 	stmmac_disable_all_queues(priv);
 
-	del_timer_sync(&priv->txtimer);
+	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
+		del_timer_sync(&priv->tx_queue[chan].txtimer);
 
 	/* Free the IRQ lines */
 	free_irq(dev->irq, dev);
@@ -2936,14 +2948,13 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->xstats.tx_tso_nfrags += nfrags;
 
 	/* Manage tx mitigation */
-	priv->tx_count_frames += nfrags + 1;
-	if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
-		mod_timer(&priv->txtimer,
-			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
-	} else {
-		priv->tx_count_frames = 0;
+	tx_q->tx_count_frames += nfrags + 1;
+	if (priv->tx_coal_frames <= tx_q->tx_count_frames) {
 		stmmac_set_tx_ic(priv, desc);
 		priv->xstats.tx_set_ic_bit++;
+		tx_q->tx_count_frames = 0;
+	} else {
+		stmmac_tx_timer_arm(priv, queue);
 	}
 
 	skb_tx_timestamp(skb);
@@ -3146,14 +3157,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * This approach takes care about the fragments: desc is the first
 	 * element in case of no SG.
 	 */
-	priv->tx_count_frames += nfrags + 1;
-	if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
-		mod_timer(&priv->txtimer,
-			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
-	} else {
-		priv->tx_count_frames = 0;
+	tx_q->tx_count_frames += nfrags + 1;
+	if (priv->tx_coal_frames <= tx_q->tx_count_frames) {
 		stmmac_set_tx_ic(priv, desc);
 		priv->xstats.tx_set_ic_bit++;
+		tx_q->tx_count_frames = 0;
+	} else {
+		stmmac_tx_timer_arm(priv, queue);
 	}
 
 	skb_tx_timestamp(skb);
@@ -3199,6 +3209,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	netdev_tx_sent_queue(netdev_get_tx_queue(dev, queue), skb->len);
 
 	stmmac_enable_dma_transmission(priv, priv->ioaddr);
+
 	stmmac_set_tx_tail_ptr(priv, priv->ioaddr, tx_q->tx_tail_addr, queue);
 
 	return NETDEV_TX_OK;
@@ -3319,6 +3330,7 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv, u32 queue)
 static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 {
 	struct stmmac_rx_queue *rx_q = &priv->rx_queue[queue];
+	struct stmmac_channel *ch = &priv->channel[queue];
 	unsigned int entry = rx_q->cur_rx;
 	int coe = priv->hw->rx_csum;
 	unsigned int next_entry;
@@ -3491,7 +3503,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 			else
 				skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-			napi_gro_receive(&rx_q->napi, skb);
+			napi_gro_receive(&ch->napi, skb);
 
 			priv->dev->stats.rx_packets++;
 			priv->dev->stats.rx_bytes += frame_len;
@@ -3514,27 +3526,33 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
  *  Description :
  *  To look at the incoming frames and clear the tx resources.
  */
-static int stmmac_poll(struct napi_struct *napi, int budget)
+static int stmmac_napi_poll(struct napi_struct *napi, int budget)
 {
-	struct stmmac_rx_queue *rx_q =
-		container_of(napi, struct stmmac_rx_queue, napi);
-	struct stmmac_priv *priv = rx_q->priv_data;
-	u32 tx_count = priv->plat->tx_queues_to_use;
-	u32 chan = rx_q->queue_index;
-	int work_done = 0;
-	u32 queue;
+	struct stmmac_channel *ch =
+		container_of(napi, struct stmmac_channel, napi);
+	struct stmmac_priv *priv = ch->priv_data;
+	int work_done = 0, work_rem = budget;
+	u32 chan = ch->index;
 
 	priv->xstats.napi_poll++;
 
-	/* check all the queues */
-	for (queue = 0; queue < tx_count; queue++)
-		stmmac_tx_clean(priv, queue);
+	if (ch->has_tx) {
+		int done = stmmac_tx_clean(priv, work_rem, chan);
 
-	work_done = stmmac_rx(priv, budget, rx_q->queue_index);
-	if (work_done < budget) {
-		napi_complete_done(napi, work_done);
-		stmmac_enable_dma_irq(priv, priv->ioaddr, chan);
+		work_done += done;
+		work_rem -= done;
+	}
+
+	if (ch->has_rx) {
+		int done = stmmac_rx(priv, work_rem, chan);
+
+		work_done += done;
+		work_rem -= done;
 	}
+
+	if (work_done < budget && napi_complete_done(napi, work_done))
+		stmmac_enable_dma_irq(priv, priv->ioaddr, chan);
+
 	return work_done;
 }
 
@@ -4198,8 +4216,8 @@ int stmmac_dvr_probe(struct device *device,
 {
 	struct net_device *ndev = NULL;
 	struct stmmac_priv *priv;
+	u32 queue, maxq;
 	int ret = 0;
-	u32 queue;
 
 	ndev = alloc_etherdev_mqs(sizeof(struct stmmac_priv),
 				  MTL_MAX_TX_QUEUES,
@@ -4322,11 +4340,22 @@ int stmmac_dvr_probe(struct device *device,
 			 "Enable RX Mitigation via HW Watchdog Timer\n");
 	}
 
-	for (queue = 0; queue < priv->plat->rx_queues_to_use; queue++) {
-		struct stmmac_rx_queue *rx_q = &priv->rx_queue[queue];
+	/* Setup channels NAPI */
+	maxq = max(priv->plat->rx_queues_to_use, priv->plat->tx_queues_to_use);
 
-		netif_napi_add(ndev, &rx_q->napi, stmmac_poll,
-			       (8 * priv->plat->rx_queues_to_use));
+	for (queue = 0; queue < maxq; queue++) {
+		struct stmmac_channel *ch = &priv->channel[queue];
+
+		ch->priv_data = priv;
+		ch->index = queue;
+
+		if (queue < priv->plat->rx_queues_to_use)
+			ch->has_rx = true;
+		if (queue < priv->plat->tx_queues_to_use)
+			ch->has_tx = true;
+
+		netif_napi_add(ndev, &ch->napi, stmmac_napi_poll,
+			       NAPI_POLL_WEIGHT);
 	}
 
 	mutex_init(&priv->lock);
@@ -4372,10 +4401,10 @@ int stmmac_dvr_probe(struct device *device,
 	    priv->hw->pcs != STMMAC_PCS_RTBI)
 		stmmac_mdio_unregister(ndev);
 error_mdio_register:
-	for (queue = 0; queue < priv->plat->rx_queues_to_use; queue++) {
-		struct stmmac_rx_queue *rx_q = &priv->rx_queue[queue];
+	for (queue = 0; queue < maxq; queue++) {
+		struct stmmac_channel *ch = &priv->channel[queue];
 
-		netif_napi_del(&rx_q->napi);
+		netif_napi_del(&ch->napi);
 	}
 error_hw_init:
 	destroy_workqueue(priv->wq);
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index c43e9a01b892..7ddfc65586b0 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -30,6 +30,7 @@
 
 #define MTL_MAX_RX_QUEUES	8
 #define MTL_MAX_TX_QUEUES	8
+#define STMMAC_CH_MAX		8
 
 #define STMMAC_RX_COE_NONE	0
 #define STMMAC_RX_COE_TYPE1	1
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v3 2/2] net: stmmac: Fixup the tail addr setting in xmit path
From: Jose Abreu @ 2018-09-13  8:02 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, David S. Miller, Joao Pinto, Giuseppe Cavallaro,
	Alexandre Torgue, Florian Fainelli
In-Reply-To: <cover.1536762575.git.joabreu@synopsys.com>

Currently we are always setting the tail address of descriptor list to
the end of the pre-allocated list.

According to databook this is not correct. Tail address should point to
the last available descriptor + 1, which means we have to update the
tail address everytime we call the xmit function.

This should make no impact in older versions of MAC but in newer
versions there are some DMA features which allows the IP to fetch
descriptors in advance and in a non sequential order so its critical
that we set the tail address correctly.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Fixes: f748be531d70 ("stmmac: support new GMAC4")
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ab9cc0143ff2..75896d6ba6e2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2213,8 +2213,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
 		stmmac_init_tx_chan(priv, priv->ioaddr, priv->plat->dma_cfg,
 				    tx_q->dma_tx_phy, chan);
 
-		tx_q->tx_tail_addr = tx_q->dma_tx_phy +
-			    (DMA_TX_SIZE * sizeof(struct dma_desc));
+		tx_q->tx_tail_addr = tx_q->dma_tx_phy;
 		stmmac_set_tx_tail_ptr(priv, priv->ioaddr,
 				       tx_q->tx_tail_addr, chan);
 	}
@@ -3003,6 +3002,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	netdev_tx_sent_queue(netdev_get_tx_queue(dev, queue), skb->len);
 
+	tx_q->tx_tail_addr = tx_q->dma_tx_phy + (tx_q->cur_tx * sizeof(*desc));
 	stmmac_set_tx_tail_ptr(priv, priv->ioaddr, tx_q->tx_tail_addr, queue);
 
 	return NETDEV_TX_OK;
@@ -3210,6 +3210,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	stmmac_enable_dma_transmission(priv, priv->ioaddr);
 
+	tx_q->tx_tail_addr = tx_q->dma_tx_phy + (tx_q->cur_tx * sizeof(*desc));
 	stmmac_set_tx_tail_ptr(priv, priv->ioaddr, tx_q->tx_tail_addr, queue);
 
 	return NETDEV_TX_OK;
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH iproute2-next] System specification health API
From: Eran Ben Elisha @ 2018-09-13  8:18 UTC (permalink / raw)
  To: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Jakub Kicinski,
	Simon Horman, Alexander Duyck
  Cc: Andrew Lunn, Florian Fainelli, Tal Alon, Ariel Almog,
	Eran Ben Elisha

The health spec is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The health contains sensors which sense for malfunction. Once sensor triggered,
actions such as logs and correction can be taken.
Sensors are sensing the health state and can trigger correction action.

The sensors are divided into the following groups
- Hardware sensor - a sensor which is triggered by the device due to
  malfunction.
- Software sensor - a sensor which is triggered by the software due to
  malfunction.
Both group of sensors can be triggered due to error event or due to a periodic check.

Actions are the way to handle sensor events. Action can be in one of the
following groups:
- Dump -  SW trace, SW dump, HW trace, HW dump
- Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
Actions can be performed by SW or HW.

User is allowed to enable or disable sensors and sensor2action mapping.

This RFC man page patch describes the suggested API of devlink-health in order
to control sensors and actions.

Eran Ben Elisha (1):
  man: Add devlink health man page

 man/man8/devlink-health.8 | 171 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man/man8/devlink-health.8

-- 
1.8.3.1

^ permalink raw reply

* [RFC PATCH iproute2-next] man: Add devlink health man page
From: Eran Ben Elisha @ 2018-09-13  8:18 UTC (permalink / raw)
  To: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Jakub Kicinski,
	Simon Horman, Alexander Duyck
  Cc: Andrew Lunn, Florian Fainelli, Tal Alon, Ariel Almog,
	Eran Ben Elisha
In-Reply-To: <1536826696-9413-1-git-send-email-eranbe@mellanox.com>

Add devlink-health man page. Devlink-health tool will control device
health attributes, sensors, actions and logging.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>

-------------------------------------------------------
Copy paste man output to here for easier review process of the RFC.

DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)

NAME
       devlink-health - devlink health configuration

SYNOPSIS
       devlink [ OPTIONS ] health  { COMMAND | help }

       OPTIONS := { -V[ersion] | -n[no-nice-names] }

       devlink health show [ DEV ] [ sensor NAME ]

       devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"

       devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }

       devlink health action reinit DEV name NAME

       devlink health help

DESCRIPTION
       devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
       reset and dump of info. In addition, set the health activity termination action.

   devlink health show - Display devlink health sensors and actions attributes
       DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.

           Format is:
             BUS_NAME/BUS_ADDRESS

       sensor NAME - Specifies the devlink sensor to show.

   devlink health sensor set - sets devlink health sensor attributes
       DEV    Specifies the devlink device to show.

       name NAME
              Name of the sensor to set.

       action NAME { active | inactive }
                  Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.

   devlink health action set - sets devlink action attributes
       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

       period PERIOD
              The period on which we limit the amount of performed actions, measured in seconds.

       count COUNT
              The maximum amount of actions performed in a limit time frame.

       fail   { ignore | down }
                  Specify the behavior once count limit was reached.

                  ignore - Ignore errors without execution of any action.

                  down - Driver will remain in nonoperational state.

   devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

EXAMPLES
       devlink health show
           Shows the health state of all devlink devices on the system.

       devlink health show pci/0000:01:00.0
           Shows the health state of specified devlink device.

       devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
           Sets TX_COMP_ERROR sensor parameters for a specific device.

       devlink health action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
           Sets health attributes for reset action.

SEE ALSO
       devlink(8), devlink-port(8), devlink-sb(8), devlink-monitor(8), devlink-dev(8),

AUTHOR
       Eran ben Elisha <eranbe@mellanox.com>

iproute2                                                                                                     15 Aug 2018                                                                                           DEVLINK-HEALTH(8)
---
 man/man8/devlink-health.8 | 171 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man/man8/devlink-health.8

diff --git a/man/man8/devlink-health.8 b/man/man8/devlink-health.8
new file mode 100644
index 000000000000..ac28b020be0d
--- /dev/null
+++ b/man/man8/devlink-health.8
@@ -0,0 +1,171 @@
+.TH DEVLINK\-HEALTH 8 "15 Aug 2018" "iproute2" "Linux"
+.SH NAME
+devlink-health \- devlink health configuration
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B devlink
+.RI "[ " OPTIONS " ]"
+.BR health
+.RI  " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.IR OPTIONS " := { "
+\fB\-V\fR[\fIersion\fR] |
+\fB\-n\fR[\fIno-nice-names\fR] }
+
+.ti -8
+.B devlink health show
+.RI "[ " DEV " ]"
+.RI "[ "
+.B sensor
+.IR NAME
+.RI "]"
+
+.ti -8
+.B devlink health sensor set
+.IR DEV
+.B name
+.IR NAME
+.RI "[ "
+.BR action
+.IR NAME
+.R "{" active "|" inactive "}" ]"
+
+.ti -8
+.B devlink health action set
+.IR DEV
+.B name
+.IR NAME
+.BR period
+.IR PERIOD
+.BR count
+.IR COUNT
+.BR fail " { "
+.IR ignore
+.BR "| "
+.IR down
+.R "} "
+
+.ti -8
+.B devlink health action reinit
+.IR DEV
+.B name
+.IR NAME
+
+.ti -8
+.B devlink health help
+
+.SH "DESCRIPTION"
+.B devlink-health
+tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as, reset and dump of info. In addition, set the health activity termination action.
+
+.SS devlink health show - Display devlink health sensors and actions attributes
+.PP
+.B "DEV"
+- Specifies the devlink device to show.
+If this argument is omitted, all devices are listed.
+
+.in +4
+Format is:
+.in +2
+BUS_NAME/BUS_ADDRESS
+
+.PP
+.BR sensor
+.IR "NAME"
+- Specifies the devlink sensor to show.
+
+.SS devlink health sensor set - sets devlink health sensor attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to show.
+
+.TP
+.BI name " NAME"
+Name of the sensor to set.
+
+.TP
+.BR action
+.IR NAME
+.R "{" active "|" inactive "} "
+.in +4
+Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
+
+.SS devlink health action set - sets devlink action attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.TP
+.BI period " PERIOD"
+The period on which we limit the amount of performed actions, measured in seconds.
+
+.TP
+.BI count " COUNT"
+The maximum amount of actions performed in a limit time frame.
+
+.TP
+.BR fail
+.R "{" ignore "|" down "}"
+.in +4
+Specify the behavior once count limit was reached.
+
+.I ignore
+- Ignore errors without execution of any action.
+
+.I down
+- Driver will remain in nonoperational state.
+
+.SS devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.SH "EXAMPLES"
+.PP
+devlink health show
+.RS 4
+Shows the health state of all devlink devices on the system.
+.RE
+.PP
+devlink health show pci/0000:01:00.0
+.RS 4
+Shows the health state of specified devlink device.
+.RE
+.PP
+devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
+.RS 4
+Sets TX_COMP_ERROR sensor parameters for a specific device.
+.RE
+.PP
+devlink health action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
+.RS 4
+Sets health attributes for reset action.
+.RE
+
+.SH SEE ALSO
+.BR devlink (8),
+.BR devlink-port (8),
+.BR devlink-sb (8),
+.BR devlink-monitor (8),
+.BR devlink-dev (8),
+.br
+
+.SH AUTHOR
+Eran ben Elisha <eranbe@mellanox.com>
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 4.18 005/197] net: sched: Fix memory exposure from short TCA_U32_SEL
From: Greg Kroah-Hartman @ 2018-09-13 13:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Al Viro, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, David S. Miller, netdev, Kees Cook
In-Reply-To: <20180913131841.568116777@linuxfoundation.org>

4.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Kees Cook <keescook@chromium.org>

[ Upstream commit 98c8f125fd8a6240ea343c1aa50a1be9047791b8 ]

Via u32_change(), TCA_U32_SEL has an unspecified type in the netlink
policy, so max length isn't enforced, only minimum. This means nkeys
(from userspace) was being trusted without checking the actual size of
nla_len(), which could lead to a memory over-read, and ultimately an
exposure via a call to u32_dump(). Reachability is CAP_NET_ADMIN within
a namespace.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/sched/cls_u32.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -912,6 +912,7 @@ static int u32_change(struct net *net, s
 	struct nlattr *opt = tca[TCA_OPTIONS];
 	struct nlattr *tb[TCA_U32_MAX + 1];
 	u32 htid, flags = 0;
+	size_t sel_size;
 	int err;
 #ifdef CONFIG_CLS_U32_PERF
 	size_t size;
@@ -1074,8 +1075,13 @@ static int u32_change(struct net *net, s
 	}
 
 	s = nla_data(tb[TCA_U32_SEL]);
+	sel_size = struct_size(s, keys, s->nkeys);
+	if (nla_len(tb[TCA_U32_SEL]) < sel_size) {
+		err = -EINVAL;
+		goto erridr;
+	}
 
-	n = kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key), GFP_KERNEL);
+	n = kzalloc(offsetof(typeof(*n), sel) + sel_size, GFP_KERNEL);
 	if (n == NULL) {
 		err = -ENOBUFS;
 		goto erridr;
@@ -1090,7 +1096,7 @@ static int u32_change(struct net *net, s
 	}
 #endif
 
-	memcpy(&n->sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key));
+	memcpy(&n->sel, s, sel_size);
 	RCU_INIT_POINTER(n->ht_up, ht);
 	n->handle = handle;
 	n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0;

^ permalink raw reply

* [PATCH 4.18 028/197] r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices
From: Greg Kroah-Hartman @ 2018-09-13 13:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Heiner Kallweit, David S. Miller,
	netdev, Realtek linux nic maintainers, Azat Khuzhin
In-Reply-To: <20180913131841.568116777@linuxfoundation.org>

4.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Azat Khuzhin <a3at.mail@gmail.com>

[ Upstream commit 05212ba8132b42047ab5d63d759c6f9c28e7eab5 ]

I have two Ethernet adapters:
  r8169 0000:03:01.0 eth0: RTL8169sb/8110sb, 00:14:d1:14:2d:49, XID 10000000, IRQ 18
  r8169 0000:01:00.0 eth0: RTL8168e/8111e, 64:66:b3:11:14:5d, XID 2c200000, IRQ 30
And after upgrading from linux 4.15 [1] to linux 4.18+ [2] RTL8169sb failed to
receive any packets. tcpdump shows a lot of checksum mismatch.

  [1]: a0f79386a4968b4925da6db2d1daffd0605a4402
  [2]: 0519359784328bfa92bf0931bf0cff3b58c16932 (4.19 merge window opened)

I started bisecting and the found that [3] breaks it. According to [4]:
  "For 8110S, 8110SB, and 8110SC series, the initial value of RxConfig
  needs to be set after the tx/rx is enabled."
So I moved rtl_init_rxcfg() after enabling tx/rs and now my adapter works
(RTL8168e works too).

  [3]: 3559d81e76bfe3803e89f2e04cf6ef7ab4f3aace
  [4]: e542a2269f232d61270ceddd42b73a4348dee2bb ("r8169: adjust the RxConfig
settings.")

Also drop "rx" from rtl_set_rx_tx_config_registers(), since it does nothing
with it already.

Fixes: 3559d81e76bfe3803e89f2e04cf6ef7ab4f3aace ("r8169: simplify
rtl_hw_start_8169")

Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/net/ethernet/realtek/r8169.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5039,7 +5039,7 @@ static void rtl8169_hw_reset(struct rtl8
 	rtl_hw_reset(tp);
 }
 
-static void rtl_set_rx_tx_config_registers(struct rtl8169_private *tp)
+static void rtl_set_tx_config_registers(struct rtl8169_private *tp)
 {
 	/* Set DMA burst size and Interframe Gap Time */
 	RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
@@ -5150,12 +5150,14 @@ static void rtl_hw_start(struct  rtl8169
 
 	rtl_set_rx_max_size(tp);
 	rtl_set_rx_tx_desc_registers(tp);
-	rtl_set_rx_tx_config_registers(tp);
+	rtl_set_tx_config_registers(tp);
 	RTL_W8(tp, Cfg9346, Cfg9346_Lock);
 
 	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
 	RTL_R8(tp, IntrMask);
 	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
+	rtl_init_rxcfg(tp);
+
 	rtl_set_rx_mode(tp->dev);
 	/* no early-rx interrupts */
 	RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox