Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v3 03/10] net: sched: refactor block offloads counter usage
From: Jiri Pirko @ 2019-08-26 15:46 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, jakub.kicinski, pablo
In-Reply-To: <20190826134506.9705-4-vladbu@mellanox.com>

Mon, Aug 26, 2019 at 03:44:59PM CEST, vladbu@mellanox.com wrote:
>Without rtnl lock protection filters can no longer safely manage block
>offloads counter themselves. Refactor cls API to protect block offloadcnt
>with tcf_block->cb_lock that is already used to protect driver callback
>list and nooffloaddevcnt counter. The counter can be modified by concurrent
>tasks by new functions that execute block callbacks (which is safe with
>previous patch that changed its type to atomic_t), however, block
>bind/unbind code that checks the counter value takes cb_lock in write mode
>to exclude any concurrent modifications. This approach prevents race
>conditions between bind/unbind and callback execution code but allows for
>concurrency for tc rule update path.
>
>Move block offload counter, filter in hardware counter and filter flags
>management from classifiers into cls hardware offloads API. Make functions
>tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API
>private. Implement following new cls API to be used instead:
>
>  tc_setup_cb_add() - non-destructive filter add. If filter that wasn't
>  already in hardware is successfully offloaded, increment block offloads
>  counter, set filter in hardware counter and flag. On failure, previously
>  offloaded filter is considered to be intact and offloads counter is not
>  decremented.
>
>  tc_setup_cb_replace() - destructive filter replace. Release existing
>  filter block offload counter and reset its in hardware counter and flag.
>  Set new filter in hardware counter and flag. On failure, previously
>  offloaded filter is considered to be destroyed and offload counter is
>  decremented.
>
>  tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block
>  offloads counter.
>
>  tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and
>  call tc_cls_offload_cnt_update() if cb() didn't return an error.
>
>Refactor all offload-capable classifiers to atomically offload filters to
>hardware, change block offload counter, and set filter in hardware counter
>and flag by means of the new cls API functions.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v3 04/10] net: sched: notify classifier on successful offload add/delete
From: Jiri Pirko @ 2019-08-26 15:51 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, jakub.kicinski, pablo
In-Reply-To: <20190826134506.9705-5-vladbu@mellanox.com>

Mon, Aug 26, 2019 at 03:45:00PM CEST, vladbu@mellanox.com wrote:
>To remove dependency on rtnl lock, extend classifier ops with new
>ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while
>holding cb_lock every time filter if successfully added to or deleted from
>hardware.
>
>Implement the new API in flower classifier. Use it to manage hw_filters
>list under cb_lock protection, instead of relying on rtnl lock to
>synchronize with concurrent fl_reoffload() call.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Maciej Fijalkowski @ 2019-08-26 15:54 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190826095722.28229-1-danieltimlee@gmail.com>

On Mon, 26 Aug 2019 18:57:22 +0900
"Daniel T. Lee" <danieltimlee@gmail.com> wrote:

> Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> to 600. To make this size flexible, a new map 'pcktsz' is added.
> 
> By updating new packet size to this map from the userland,
> xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
> 
> If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> will be 600 as a default.
> 
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> ---
>  samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
>  samples/bpf/xdp_adjust_tail_user.c | 21 +++++++++++++++++++--
>  2 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
> index 411fdb21f8bc..4d53af370b68 100644
> --- a/samples/bpf/xdp_adjust_tail_kern.c
> +++ b/samples/bpf/xdp_adjust_tail_kern.c
> @@ -25,6 +25,13 @@
>  #define ICMP_TOOBIG_SIZE 98
>  #define ICMP_TOOBIG_PAYLOAD_SIZE 92
>  
> +struct bpf_map_def SEC("maps") pcktsz = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u32),
> +	.max_entries = 1,
> +};
> +
>  struct bpf_map_def SEC("maps") icmpcnt = {
>  	.type = BPF_MAP_TYPE_ARRAY,
>  	.key_size = sizeof(__u32),
> @@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
>  	*csum = csum_fold_helper(*csum);
>  }
>  
> -static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
> +static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
> +					      __u32 max_pckt_size)
>  {
>  	int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
>  
> @@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
>  	orig_iph = data + off;
>  	icmp_hdr->type = ICMP_DEST_UNREACH;
>  	icmp_hdr->code = ICMP_FRAG_NEEDED;
> -	icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
> +	icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
>  	icmp_hdr->checksum = 0;
>  	ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
>  	icmp_hdr->checksum = csum;
> @@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
>  {
>  	void *data_end = (void *)(long)xdp->data_end;
>  	void *data = (void *)(long)xdp->data;
> +	__u32 max_pckt_size = MAX_PCKT_SIZE;
> +	__u32 *pckt_sz;
> +	__u32 key = 0;
>  	int pckt_size = data_end - data;
>  	int offset;
>  
> -	if (pckt_size > MAX_PCKT_SIZE) {
> +	pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
> +	if (pckt_sz && *pckt_sz)
> +		max_pckt_size = *pckt_sz;
> +
> +	if (pckt_size > max_pckt_size) {
>  		offset = pckt_size - ICMP_TOOBIG_SIZE;
>  		if (bpf_xdp_adjust_tail(xdp, 0 - offset))
>  			return XDP_PASS;
> -		return send_icmp4_too_big(xdp);
> +		return send_icmp4_too_big(xdp, max_pckt_size);
>  	}
>  	return XDP_PASS;
>  }
> diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> index a3596b617c4c..dd3befa5e1fe 100644
> --- a/samples/bpf/xdp_adjust_tail_user.c
> +++ b/samples/bpf/xdp_adjust_tail_user.c
> @@ -72,6 +72,7 @@ static void usage(const char *cmd)
>  	printf("Usage: %s [...]\n", cmd);
>  	printf("    -i <ifname|ifindex> Interface\n");
>  	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> +	printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
>  	printf("    -S use skb-mode\n");
>  	printf("    -N enforce native mode\n");
>  	printf("    -F force loading prog\n");
> @@ -85,9 +86,11 @@ int main(int argc, char **argv)
>  		.prog_type	= BPF_PROG_TYPE_XDP,
>  	};
>  	unsigned char opt_flags[256] = {};
> -	const char *optstr = "i:T:SNFh";
> +	const char *optstr = "i:T:P:SNFh";
>  	struct bpf_prog_info info = {};
>  	__u32 info_len = sizeof(info);
> +	__u32 max_pckt_size = 0;
> +	__u32 key = 0;
>  	unsigned int kill_after_s = 0;
>  	int i, prog_fd, map_fd, opt;
>  	struct bpf_object *obj;
> @@ -110,6 +113,9 @@ int main(int argc, char **argv)
>  		case 'T':
>  			kill_after_s = atoi(optarg);
>  			break;
> +		case 'P':
> +			max_pckt_size = atoi(optarg);
> +			break;
>  		case 'S':
>  			xdp_flags |= XDP_FLAGS_SKB_MODE;
>  			break;
> @@ -150,9 +156,20 @@ int main(int argc, char **argv)
>  	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
>  		return 1;
>  
> +	/* update pcktsz map */
>  	map = bpf_map__next(NULL, obj);
>  	if (!map) {
> -		printf("finding a map in obj file failed\n");
> +		printf("finding a pcktsz map in obj file failed\n");
> +		return 1;
> +	}
> +	map_fd = bpf_map__fd(map);

Consider using bpf_object__find_map_fd_by_name() here.

> +	if (max_pckt_size)
> +		bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
> +
> +	/* fetch icmpcnt map */
> +	map = bpf_map__next(map, obj);
> +	if (!map) {
> +		printf("finding a icmpcnt map in obj file failed\n");
>  		return 1;
>  	}
>  	map_fd = bpf_map__fd(map);


^ permalink raw reply

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Jakub Kicinski @ 2019-08-26 15:57 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Networking, OSS Drivers,
	Jiong Wang
In-Reply-To: <CAPhsuW7_dSEPJOdKApQFU-aVmEXgOwmqLS7S1FC4JtnzjR6OiQ@mail.gmail.com>

On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
> On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:
> > From: Jiong Wang <jiong.wang@netronome.com>
> >
> > NFP is using Local Memory to model stack. LM_addr could be used as base of
> > a 16 32-bit word region of Local Memory. Then, if the stack offset is
> > beyond the current region, the local index needs to be updated. The update
> > needs at least three cycles to take effect, therefore the sequence normally
> > looks like:
> >
> >   local_csr_wr[ActLMAddr3, gprB_5]
> >   nop
> >   nop
> >   nop
> >
> > If the local index switch happens on a narrow loads, then the instruction
> > preparing value to zero high 32-bit of the destination register could be
> > counted as one cycle, the sequence then could be something like:
> >
> >   local_csr_wr[ActLMAddr3, gprB_5]
> >   nop
> >   nop
> >   immed[gprB_5, 0]
> >
> > However, we have zero extension optimization that zeroing high 32-bit could
> > be eliminated, therefore above IMMED insn won't be available for which case
> > the first sequence needs to be generated.
> >
> > Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
> > Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> > Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> I haven't looked into the code yet. But ^^^ should be
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
> right?

I prefer Review on code I review, ack on code I ack, and sign-off on
code I co-author.

^ permalink raw reply

* Re: [PATCH net-next v3 4/6] net: dsa: mv88e6xxx: simplify SERDES code for Topaz and Peridot
From: Vivien Didelot @ 2019-08-26 16:08 UTC (permalink / raw)
  To: Marek Behun; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190825183609.4a9cc0d7@nic.cz>

Hi Marek,

On Sun, 25 Aug 2019 18:36:09 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> > Aren't you relying on -ENODEV as well?
> 
> Vivien, I am not relying o -ENODEV. I changed the serdes_get_lane
> semantics:
>  - previously:
>    - if port has a lane for current cmode, return given lane number
>    - otherwise return -ENODEV
>    - if other error occured during serdes_get_lane, return that error
>      (this never happened, because all implementations only need port
>      number and cmode, and cmode is cached, so no function was called
>      that could err)
>  - after this commit:
>    - if port has a lane for current cmode, return 0 and put lane number
>      into *lane
>    - otherwise return 0 and put -1 into *lane
>    - if error occured, return that error number
> 
> I removed the -ENODEV semantics for "no lane on port" event.
> There are two reasons for this:
>   1. once you requested lane number to be put into a place pointed to
>      by a pointer, rather than the return value, the code seemed better
>      to me (you may of course disagree, this is a personal opinion) when
>      I did:
>        if (err)
>            return err;
>        if (lane < 0)
>            return 0;
>      rather than
>        if (err == -ENODEV)
>            return 0;
>        if (err)
>            return err;

A single return path for invalid queries, eventually checking a specific
error, is always more idiomatic and better than checking two places which
could lead in mistakes as your previous patch did. So this is more readable:

    if (err)
        return err;

or:

    if (err && err != -ENODEV)
        return err;

or:

    if (err) {
        if (err = -ENODEV)
            err = 0;
        return err;
    }

>   2. some future implementation may actually need to call some MDIO
>      read/write functions, which may or may not return -ENODEV. That
>      could conflict with the -ENODEV returned when there is no lane.

The current code is already using -ENODEV to inform about "no lane for port",
even if it can be used by lower level functions, same as -EINVAL. That is fine.

So if you have to respin the series again, I would really prefer to see an
unsigned lane parameter, otherwise, fine...


Thanks,

	Vivien

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-26 16:09 UTC (permalink / raw)
  To: David Ahern
  Cc: Roopa Prabhu, netdev, David Miller, Jakub Kicinski,
	Stephen Hemminger, dcbw, Michal Kubecek, Andrew Lunn, parav,
	Saeed Mahameed, mlxsw
In-Reply-To: <20190813065617.GK2428@nanopsycho>

Tue, Aug 13, 2019 at 08:56:17AM CEST, jiri@resnulli.us wrote:
>Mon, Aug 12, 2019 at 06:01:59PM CEST, dsahern@gmail.com wrote:
>>On 8/12/19 2:31 AM, Jiri Pirko wrote:
>>> Mon, Aug 12, 2019 at 03:37:26AM CEST, dsahern@gmail.com wrote:
>>>> On 8/11/19 7:34 PM, David Ahern wrote:
>>>>> On 8/10/19 12:30 AM, Jiri Pirko wrote:
>>>>>> Could you please write me an example message of add/remove?
>>>>>
>>>>> altnames are for existing netdevs, yes? existing netdevs have an id and
>>>>> a name - 2 existing references for identifying the existing netdev for
>>>>> which an altname will be added. Even using the altname as the main
>>>>> 'handle' for a setlink change, I see no reason why the GETLINK api can
>>>>> not take an the IFLA_ALT_IFNAME and return the full details of the
>>>>> device if the altname is unique.
>>>>>
>>>>> So, what do the new RTM commands give you that you can not do with
>>>>> RTM_*LINK?
>>>>>
>>>>
>>>>
>>>> To put this another way, the ALT_NAME is an attribute of an object - a
>>>> LINK. It is *not* a separate object which requires its own set of
>>>> commands for manipulating.
>>> 
>>> Okay, again, could you provide example of a message to add/remove
>>> altname using existing setlink message? Thanks!
>>> 
>>
>>Examples from your cover letter with updates
>>
>>$ ip link set dummy0 altname someothername
>>$ ip link set dummy0 altname someotherveryveryveryverylongname
>>
>>$ ip link set dummy0 del altname someothername
>>$ ip link set dummy0 del altname someotherveryveryveryverylongname
>>
>>This syntactic sugar to what is really happening:
>>
>>RTM_NEWLINK, dummy0, IFLA_ALT_IFNAME
>>
>>if you are allowing many alt names, then yes, you need a flag to say
>>delete this specific one which is covered by Roopa's nested suggestion.
>
>Yeah, so you need and op inside the message. We are on the same page,
>thanks.

DaveA, Roopa. Do you insist on doing add/remove of altnames in the
existing setlist command using embedded message op attrs? I'm asking
because after some time thinking about it, it still feels wrong to me :/

If this would be a generic netlink api, we would just add another couple
of commands. What is so different we can't add commands here?
It is also much simpler code. Easy error handling, no need for
rollback, no possibly inconsistent state, etc.


^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering
From: Vladimir Oltean @ 2019-08-26 16:13 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Florian Fainelli, Andrew Lunn, Ido Schimmel, Roopa Prabhu,
	nikolay, David S. Miller, netdev
In-Reply-To: <20190826112049.GB27025@t480s.localdomain>

Hi Vivien,

On Mon, 26 Aug 2019 at 18:20, Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> Hi Vladimir,
>
> On Sun, 25 Aug 2019 21:44:54 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> > -     if (enabled)
> > -             err = dsa_port_vid_add(upstream_dp, tx_vid, 0);
> > -     else
> > -             err = dsa_port_vid_del(upstream_dp, tx_vid);
> > +     err = dsa_8021q_vid_apply(ds, upstream, tx_vid, 0, enabled);
> >       if (err) {
> >               dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
> >                       tx_vid, upstream, err);
> >               return err;
> >       }
> >
> > -     return 0;
> > +     if (!enabled)
> > +             err = dsa_8021q_restore_pvid(ds, port);
> > +
> > +     return err;
> >  }
>
> I did not dig that much into tag_8021q.c yet. From seeing this portion,
> I'm just wondering if these two helpers couldn't be part of the same logic
> as they both act upon the "enabled" condition?
>
> Otherwise I have no complains about the series.
>

I thought too about trying to merge the 2 into the same function (not
a lot, though).
But consider that they do different things in the "!enabled" case:
- dsa_8021q_vid_apply: check if this specific vid (provided as
argument) was installed in the bridge, and if so, restore it
- dsa_8021q_restore_pvid: search for the bridge port's pvid, and restore that
I don't think that the end result will look cleaner if I merge these 2 things.

>
> Thanks,
>
>         Vivien

Thanks,
-Vladimir

^ permalink raw reply

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Alexei Starovoitov @ 2019-08-26 16:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Song Liu, Daniel Borkmann, bpf, Networking, OSS Drivers,
	Jiong Wang
In-Reply-To: <CAJpBn1z736w5_uv7apwyy82vzcnc9c5Gua_9ZyUy-pSEwnQewA@mail.gmail.com>

On Mon, Aug 26, 2019 at 8:57 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
> > On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:
> > > From: Jiong Wang <jiong.wang@netronome.com>
> > >
> > > NFP is using Local Memory to model stack. LM_addr could be used as base of
> > > a 16 32-bit word region of Local Memory. Then, if the stack offset is
> > > beyond the current region, the local index needs to be updated. The update
> > > needs at least three cycles to take effect, therefore the sequence normally
> > > looks like:
> > >
> > >   local_csr_wr[ActLMAddr3, gprB_5]
> > >   nop
> > >   nop
> > >   nop
> > >
> > > If the local index switch happens on a narrow loads, then the instruction
> > > preparing value to zero high 32-bit of the destination register could be
> > > counted as one cycle, the sequence then could be something like:
> > >
> > >   local_csr_wr[ActLMAddr3, gprB_5]
> > >   nop
> > >   nop
> > >   immed[gprB_5, 0]
> > >
> > > However, we have zero extension optimization that zeroing high 32-bit could
> > > be eliminated, therefore above IMMED insn won't be available for which case
> > > the first sequence needs to be generated.
> > >
> > > Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
> > > Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> > > Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > I haven't looked into the code yet. But ^^^ should be
> >
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >
> > right?
>
> I prefer Review on code I review, ack on code I ack, and sign-off on
> code I co-author.

I believe if you're sending somebody else patch you have to add your SOB
in addition to their 'Author:' and their SOB fields.

^ permalink raw reply

* [PATCH net] tcp: remove empty skb from write queue in error cases
From: Eric Dumazet @ 2019-08-26 16:19 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Eric Dumazet, Jason Baron, Vladimir Rutsky

Vladimir Rutsky reported stuck TCP sessions after memory pressure
events. Edge Trigger epoll() user would never receive an EPOLLOUT
notification allowing them to retry a sendmsg().

Jason tested the case of sk_stream_alloc_skb() returning NULL,
but there are other paths that could lead both sendmsg() and sendpage()
to return -1 (EAGAIN), with an empty skb queued on the write queue.

This patch makes sure we remove this empty skb so that
Jason code can detect that the queue is empty, and
call sk->sk_write_space(sk) accordingly.

Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 77b485d60b9d0e00edc4e2f0d6c5bb3a9460b23b..61082065b26a068975c411b74eb46739ab0632ca 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -935,6 +935,22 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
+/* In some cases, both sendpage() and sendmsg() could have added
+ * an skb to the write queue, but failed adding payload on it.
+ * We need to remove it to consume less memory, but more
+ * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
+ * users.
+ */
+static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
+{
+	if (skb && !skb->len) {
+		tcp_unlink_write_queue(skb, sk);
+		if (tcp_write_queue_empty(sk))
+			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
+		sk_wmem_free_skb(sk, skb);
+	}
+}
+
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			 size_t size, int flags)
 {
@@ -1064,6 +1080,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	return copied;
 
 do_error:
+	tcp_remove_empty_skb(sk, tcp_write_queue_tail(sk));
 	if (copied)
 		goto out;
 out_err:
@@ -1388,18 +1405,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	sock_zerocopy_put(uarg);
 	return copied + copied_syn;
 
+do_error:
+	skb = tcp_write_queue_tail(sk);
 do_fault:
-	if (!skb->len) {
-		tcp_unlink_write_queue(skb, sk);
-		/* It is the one place in all of TCP, except connection
-		 * reset, where we can be unlinking the send_head.
-		 */
-		if (tcp_write_queue_empty(sk))
-			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
-		sk_wmem_free_skb(sk, skb);
-	}
+	tcp_remove_empty_skb(sk, skb);
 
-do_error:
 	if (copied + copied_syn)
 		goto out;
 out_err:
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Daniel Borkmann @ 2019-08-26 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Kicinski
  Cc: Song Liu, bpf, Networking, OSS Drivers, Jiong Wang
In-Reply-To: <CAADnVQ++TEUK=Cb3sCyunFyYFcpXu=NK71P4-1rEWEGCGewU7A@mail.gmail.com>

On 8/26/19 6:18 PM, Alexei Starovoitov wrote:
> On Mon, Aug 26, 2019 at 8:57 AM Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
>> On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
>>> On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:
>>>> From: Jiong Wang <jiong.wang@netronome.com>
>>>>
>>>> NFP is using Local Memory to model stack. LM_addr could be used as base of
>>>> a 16 32-bit word region of Local Memory. Then, if the stack offset is
>>>> beyond the current region, the local index needs to be updated. The update
>>>> needs at least three cycles to take effect, therefore the sequence normally
>>>> looks like:
>>>>
>>>>    local_csr_wr[ActLMAddr3, gprB_5]
>>>>    nop
>>>>    nop
>>>>    nop
>>>>
>>>> If the local index switch happens on a narrow loads, then the instruction
>>>> preparing value to zero high 32-bit of the destination register could be
>>>> counted as one cycle, the sequence then could be something like:
>>>>
>>>>    local_csr_wr[ActLMAddr3, gprB_5]
>>>>    nop
>>>>    nop
>>>>    immed[gprB_5, 0]
>>>>
>>>> However, we have zero extension optimization that zeroing high 32-bit could
>>>> be eliminated, therefore above IMMED insn won't be available for which case
>>>> the first sequence needs to be generated.
>>>>
>>>> Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
>>>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>>>> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>> I haven't looked into the code yet. But ^^^ should be
>>>
>>> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>>
>>> right?
>>
>> I prefer Review on code I review, ack on code I ack, and sign-off on
>> code I co-author.
> 
> I believe if you're sending somebody else patch you have to add your SOB
> in addition to their 'Author:' and their SOB fields.

+1, for co-authoring there's a 'Co-authored-by:' tag which seems to be frequently
used these days.

^ permalink raw reply

* [bpf-next, v2] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-08-26 16:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev

Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
to 600. To make this size flexible, a new map 'pcktsz' is added.

By updating new packet size to this map from the userland,
xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.

If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
will be 600 as a default.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>

---
Changes in v2:
    - Change the helper to fetch map from 'bpf_map__next' to 
    'bpf_object__find_map_fd_by_name'.

 samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
 samples/bpf/xdp_adjust_tail_user.c | 27 +++++++++++++++++++++------
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
index 411fdb21f8bc..d6d84ffe6a7a 100644
--- a/samples/bpf/xdp_adjust_tail_kern.c
+++ b/samples/bpf/xdp_adjust_tail_kern.c
@@ -25,6 +25,13 @@
 #define ICMP_TOOBIG_SIZE 98
 #define ICMP_TOOBIG_PAYLOAD_SIZE 92
 
+struct bpf_map_def SEC("maps") pcktsz = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
 struct bpf_map_def SEC("maps") icmpcnt = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(__u32),
@@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
 	*csum = csum_fold_helper(*csum);
 }
 
-static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
+static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
+					      __u32 max_pckt_size)
 {
 	int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
 
@@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
 	orig_iph = data + off;
 	icmp_hdr->type = ICMP_DEST_UNREACH;
 	icmp_hdr->code = ICMP_FRAG_NEEDED;
-	icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
+	icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
 	icmp_hdr->checksum = 0;
 	ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
 	icmp_hdr->checksum = csum;
@@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
 {
 	void *data_end = (void *)(long)xdp->data_end;
 	void *data = (void *)(long)xdp->data;
+	__u32 max_pckt_size = MAX_PCKT_SIZE;
+	__u32 *pckt_sz;
+	__u32 key = 0;
 	int pckt_size = data_end - data;
 	int offset;
 
-	if (pckt_size > MAX_PCKT_SIZE) {
+	pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
+	if (pckt_sz && *pckt_sz)
+		max_pckt_size = *pckt_sz;
+
+	if (pckt_size > max_pckt_size) {
 		offset = pckt_size - ICMP_TOOBIG_SIZE;
 		if (bpf_xdp_adjust_tail(xdp, 0 - offset))
 			return XDP_PASS;
-		return send_icmp4_too_big(xdp);
+		return send_icmp4_too_big(xdp, max_pckt_size);
 	}
 	return XDP_PASS;
 }
diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
index a3596b617c4c..29ade7caf841 100644
--- a/samples/bpf/xdp_adjust_tail_user.c
+++ b/samples/bpf/xdp_adjust_tail_user.c
@@ -72,6 +72,7 @@ static void usage(const char *cmd)
 	printf("Usage: %s [...]\n", cmd);
 	printf("    -i <ifname|ifindex> Interface\n");
 	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
+	printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
 	printf("    -S use skb-mode\n");
 	printf("    -N enforce native mode\n");
 	printf("    -F force loading prog\n");
@@ -85,13 +86,14 @@ int main(int argc, char **argv)
 		.prog_type	= BPF_PROG_TYPE_XDP,
 	};
 	unsigned char opt_flags[256] = {};
-	const char *optstr = "i:T:SNFh";
+	const char *optstr = "i:T:P:SNFh";
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
+	__u32 max_pckt_size = 0;
+	__u32 key = 0;
 	unsigned int kill_after_s = 0;
 	int i, prog_fd, map_fd, opt;
 	struct bpf_object *obj;
-	struct bpf_map *map;
 	char filename[256];
 	int err;
 
@@ -110,6 +112,9 @@ int main(int argc, char **argv)
 		case 'T':
 			kill_after_s = atoi(optarg);
 			break;
+		case 'P':
+			max_pckt_size = atoi(optarg);
+			break;
 		case 'S':
 			xdp_flags |= XDP_FLAGS_SKB_MODE;
 			break;
@@ -150,12 +155,22 @@ int main(int argc, char **argv)
 	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
 		return 1;
 
-	map = bpf_map__next(NULL, obj);
-	if (!map) {
-		printf("finding a map in obj file failed\n");
+	/* update pcktsz map */
+	if (max_pckt_size) {
+		map_fd = bpf_object__find_map_fd_by_name(obj, "pcktsz");
+		if (!map_fd) {
+			printf("finding a pcktsz map in obj file failed\n");
+			return 1;
+		}
+		bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
+	}
+
+	/* fetch icmpcnt map */
+	map_fd = bpf_object__find_map_fd_by_name(obj, "icmpcnt");
+	if (!map_fd) {
+		printf("finding a icmpcnt map in obj file failed\n");
 		return 1;
 	}
-	map_fd = bpf_map__fd(map);
 
 	if (!prog_fd) {
 		printf("load_bpf_file: %s\n", strerror(errno));
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-08-26 16:26 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190826175420.000021f3@gmail.com>

On Tue, Aug 27, 2019 at 12:54 AM Maciej Fijalkowski
<maciejromanfijalkowski@gmail.com> wrote:
>
> On Mon, 26 Aug 2019 18:57:22 +0900
> "Daniel T. Lee" <danieltimlee@gmail.com> wrote:
>
> > Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> > to 600. To make this size flexible, a new map 'pcktsz' is added.
> >
> > By updating new packet size to this map from the userland,
> > xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
> >
> > If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> > will be 600 as a default.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
> >  samples/bpf/xdp_adjust_tail_user.c | 21 +++++++++++++++++++--
> >  2 files changed, 38 insertions(+), 6 deletions(-)
> >
> > diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
> > index 411fdb21f8bc..4d53af370b68 100644
> > --- a/samples/bpf/xdp_adjust_tail_kern.c
> > +++ b/samples/bpf/xdp_adjust_tail_kern.c
> > @@ -25,6 +25,13 @@
> >  #define ICMP_TOOBIG_SIZE 98
> >  #define ICMP_TOOBIG_PAYLOAD_SIZE 92
> >
> > +struct bpf_map_def SEC("maps") pcktsz = {
> > +     .type = BPF_MAP_TYPE_ARRAY,
> > +     .key_size = sizeof(__u32),
> > +     .value_size = sizeof(__u32),
> > +     .max_entries = 1,
> > +};
> > +
> >  struct bpf_map_def SEC("maps") icmpcnt = {
> >       .type = BPF_MAP_TYPE_ARRAY,
> >       .key_size = sizeof(__u32),
> > @@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
> >       *csum = csum_fold_helper(*csum);
> >  }
> >
> > -static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
> > +static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
> > +                                           __u32 max_pckt_size)
> >  {
> >       int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
> >
> > @@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
> >       orig_iph = data + off;
> >       icmp_hdr->type = ICMP_DEST_UNREACH;
> >       icmp_hdr->code = ICMP_FRAG_NEEDED;
> > -     icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
> > +     icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
> >       icmp_hdr->checksum = 0;
> >       ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
> >       icmp_hdr->checksum = csum;
> > @@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> >  {
> >       void *data_end = (void *)(long)xdp->data_end;
> >       void *data = (void *)(long)xdp->data;
> > +     __u32 max_pckt_size = MAX_PCKT_SIZE;
> > +     __u32 *pckt_sz;
> > +     __u32 key = 0;
> >       int pckt_size = data_end - data;
> >       int offset;
> >
> > -     if (pckt_size > MAX_PCKT_SIZE) {
> > +     pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
> > +     if (pckt_sz && *pckt_sz)
> > +             max_pckt_size = *pckt_sz;
> > +
> > +     if (pckt_size > max_pckt_size) {
> >               offset = pckt_size - ICMP_TOOBIG_SIZE;
> >               if (bpf_xdp_adjust_tail(xdp, 0 - offset))
> >                       return XDP_PASS;
> > -             return send_icmp4_too_big(xdp);
> > +             return send_icmp4_too_big(xdp, max_pckt_size);
> >       }
> >       return XDP_PASS;
> >  }
> > diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> > index a3596b617c4c..dd3befa5e1fe 100644
> > --- a/samples/bpf/xdp_adjust_tail_user.c
> > +++ b/samples/bpf/xdp_adjust_tail_user.c
> > @@ -72,6 +72,7 @@ static void usage(const char *cmd)
> >       printf("Usage: %s [...]\n", cmd);
> >       printf("    -i <ifname|ifindex> Interface\n");
> >       printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> > +     printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
> >       printf("    -S use skb-mode\n");
> >       printf("    -N enforce native mode\n");
> >       printf("    -F force loading prog\n");
> > @@ -85,9 +86,11 @@ int main(int argc, char **argv)
> >               .prog_type      = BPF_PROG_TYPE_XDP,
> >       };
> >       unsigned char opt_flags[256] = {};
> > -     const char *optstr = "i:T:SNFh";
> > +     const char *optstr = "i:T:P:SNFh";
> >       struct bpf_prog_info info = {};
> >       __u32 info_len = sizeof(info);
> > +     __u32 max_pckt_size = 0;
> > +     __u32 key = 0;
> >       unsigned int kill_after_s = 0;
> >       int i, prog_fd, map_fd, opt;
> >       struct bpf_object *obj;
> > @@ -110,6 +113,9 @@ int main(int argc, char **argv)
> >               case 'T':
> >                       kill_after_s = atoi(optarg);
> >                       break;
> > +             case 'P':
> > +                     max_pckt_size = atoi(optarg);
> > +                     break;
> >               case 'S':
> >                       xdp_flags |= XDP_FLAGS_SKB_MODE;
> >                       break;
> > @@ -150,9 +156,20 @@ int main(int argc, char **argv)
> >       if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> >               return 1;
> >
> > +     /* update pcktsz map */
> >       map = bpf_map__next(NULL, obj);
> >       if (!map) {
> > -             printf("finding a map in obj file failed\n");
> > +             printf("finding a pcktsz map in obj file failed\n");
> > +             return 1;
> > +     }
> > +     map_fd = bpf_map__fd(map);
>
> Consider using bpf_object__find_map_fd_by_name() here.
>
> > +     if (max_pckt_size)
> > +             bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
> > +
> > +     /* fetch icmpcnt map */
> > +     map = bpf_map__next(map, obj);
> > +     if (!map) {
> > +             printf("finding a icmpcnt map in obj file failed\n");
> >               return 1;
> >       }
> >       map_fd = bpf_map__fd(map);
>

Thanks for the review!
I'll update it right away.

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
From: Björn Töpel @ 2019-08-26 16:34 UTC (permalink / raw)
  To: Ilya Maximets, Björn Töpel, ast, daniel, netdev
  Cc: magnus.karlsson, magnus.karlsson, bpf, jonathan.lemon,
	syzbot+c82697e3043781e08802, hdanton
In-Reply-To: <14576fd3-69ce-6493-5a38-c47566851d4e@samsung.com>

On 2019-08-26 17:24, Ilya Maximets wrote:
> This changes the error code a bit.
> Previously:
>     umem exists + xs unbound    --> EINVAL
>     no umem     + xs unbound    --> EBADF
>     xs bound to different dev/q --> EINVAL
> 
> With this change:
>     umem exists + xs unbound    --> EBADF
>     no umem     + xs unbound    --> EBADF
>     xs bound to different dev/q --> EINVAL
> 
> Just a note. Not sure if this is important.
> 

Note that this is for *shared* umem, so it's very seldom used. Still,
you're right, that strictly this is an uapi break, but I'd vote for the
change still. I find it hard to see that anyone relies on EINVAL/EBADF
for shared umem bind.

Opinions? :-)


Björn

^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: David Abdurachmanov @ 2019-08-26 16:39 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Abdurachmanov, Thomas Gleixner,
	Allison Randal, Alexios Zavras, Anup Patel, Vincent Chen,
	Alan Kao, linux-riscv, linux-kernel, linux-kselftest, netdev, bpf,
	me
In-Reply-To: <20190826145756.GB4664@cisco>

On Mon, Aug 26, 2019 at 7:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> Hi,
>
> On Fri, Aug 23, 2019 at 05:30:53PM -0700, Paul Walmsley wrote:
> > On Thu, 22 Aug 2019, David Abdurachmanov wrote:
> >
> > > There is one failing kernel selftest: global.user_notification_signal
> >
> > Also - could you follow up with the author of this failing test to see if
> > we can get some more clarity about what might be going wrong here?  It
> > appears that the failing test was added in commit 6a21cc50f0c7f ("seccomp:
> > add a return code to trap to userspace") by Tycho Andersen
> > <tycho@tycho.ws>.
>
> Can you post an strace and a cat of /proc/$pid/stack for both tasks
> where it gets stuck? I don't have any riscv hardware, and it "works
> for me" on x86 and arm64 with 100 tries.

I don't have the a build with SECCOMP for the board right now, so it
will have to wait. I just finished a new kernel (almost rc6) for Fedora,
but it will take time to assemble new repositories and a disk image.

There is older disk image available (5.2.0-rc7 kernel with v2 SECCOMP)
for QEMU or libvirt/QEMU:

https://dl.fedoraproject.org/pub/alt/risc-v/disk-images/fedora/rawhide/20190703.n.0/Developer/
https://fedoraproject.org/wiki/Architectures/RISC-V/Installing#Boot_with_libvirt

(If you are interesting trying it locally.)

IIRC I attempted to connected with strace, but it quickly returns and fails
properly. Simply put strace unblocks whatever is stuck.

david

^ permalink raw reply

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Jakub Kicinski @ 2019-08-26 16:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Song Liu, bpf, Networking, OSS Drivers,
	Jiong Wang
In-Reply-To: <1417962c-e63d-6c46-bf07-9284f5332583@iogearbox.net>

On Mon, 26 Aug 2019 18:25:10 +0200, Daniel Borkmann wrote:
> On 8/26/19 6:18 PM, Alexei Starovoitov wrote:
> > On Mon, Aug 26, 2019 at 8:57 AM Jakub Kicinski
> > <jakub.kicinski@netronome.com> wrote:  
> >> On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:  
> >>> On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:  
> >>>> From: Jiong Wang <jiong.wang@netronome.com>
> >>>>
> >>>> NFP is using Local Memory to model stack. LM_addr could be used as base of
> >>>> a 16 32-bit word region of Local Memory. Then, if the stack offset is
> >>>> beyond the current region, the local index needs to be updated. The update
> >>>> needs at least three cycles to take effect, therefore the sequence normally
> >>>> looks like:
> >>>>
> >>>>    local_csr_wr[ActLMAddr3, gprB_5]
> >>>>    nop
> >>>>    nop
> >>>>    nop
> >>>>
> >>>> If the local index switch happens on a narrow loads, then the instruction
> >>>> preparing value to zero high 32-bit of the destination register could be
> >>>> counted as one cycle, the sequence then could be something like:
> >>>>
> >>>>    local_csr_wr[ActLMAddr3, gprB_5]
> >>>>    nop
> >>>>    nop
> >>>>    immed[gprB_5, 0]
> >>>>
> >>>> However, we have zero extension optimization that zeroing high 32-bit could
> >>>> be eliminated, therefore above IMMED insn won't be available for which case
> >>>> the first sequence needs to be generated.
> >>>>
> >>>> Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
> >>>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> >>>> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>  
> >>> I haven't looked into the code yet. But ^^^ should be
> >>>
> >>> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >>>
> >>> right?  
> >>
> >> I prefer Review on code I review, ack on code I ack, and sign-off on
> >> code I co-author.  
> > 
> > I believe if you're sending somebody else patch you have to add your SOB
> > in addition to their 'Author:' and their SOB fields.  
> 
> +1, for co-authoring there's a 'Co-authored-by:' tag which seems to be frequently
> used these days.

Ack, there is a difference between co-author of code, and co-author as
step by step guidance. I've been doing this for 6 years now, and nobody
ever complained :)

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Is that enough or should I repost?

^ permalink raw reply

* Re: Unable to create htb tc classes more than 64K
From: Jesper Dangaard Brouer @ 2019-08-26 16:45 UTC (permalink / raw)
  To: Akshat Kakkar
  Cc: brouer, Cong Wang, NetFilter, lartc, netdev, Eric Dumazet,
	Toke Høiland-Jørgensen, Anton Danilov
In-Reply-To: <CAA5aLPjzX+9YFRGgCgceHjkU0=e6x8YMENfp_cC9fjfHYK3e+A@mail.gmail.com>

On Sun, 18 Aug 2019 00:34:33 +0530
Akshat Kakkar <akshat.1984@gmail.com> wrote:

> My goal is not just to make as many classes as possible, but also to
> use them to do rate limiting per ip per server. Say, I have a list of
> 10000 IPs and more than 100 servers. So simply if I want few IPs to
> get speed of says 1Mbps per server but others say speed of 2 Mbps per
> server. How can I achieve this without having 10000 x 100 classes.
> These numbers can be large than this and hence I am looking for a
> generic solution to this.

As Eric Dumazet also points out indirectly, you will be creating a huge
bottleneck for SMP/multi-core CPUs.  As your HTB root qdisc is a
serialization point for all egress traffic, that all CPUs will need to
take a lock on.

It sounds like your use-case is not global rate limiting, but instead
the goal is to rate limit customers or services (to something
significantly lower than NIC link speed).  To get scalability, in this
case, you can instead use the MQ qdisc (as Eric also points out).
I have an example script here[1], that shows how to setup MQ as root
qdisc and add HTB leafs based on how many TX-queue the interface have
via /sys/class/net/$DEV/queues/tx-*/

[1] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/bin/tc_mq_htb_setup_example.sh

You are not done, yet.  For solving the TX-queue locking congestion, the
traffic needs to be redirected to the appropriate/correct TX CPUs. This
can either be done with RSS (Receive Side Scaling) HW ethtool
adjustment (reduce hash to IPs L3 only), or RPS (Receive Packet
Steering), or with XDP cpumap redirect.

The XDP cpumap redirect feature is implemented with XDP+TC BPF code
here[2]. Notice, that XPS can screw with this so there is a XPS disable
script here[3].

[2] https://github.com/xdp-project/xdp-cpumap-tc
[3] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/bin/xps_setup.sh

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Leonardo Bras @ 2019-08-26 16:47 UTC (permalink / raw)
  To: Florian Westphal, Pablo Neira Ayuso
  Cc: netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	David S. Miller
In-Reply-To: <20190821095844.me6kscvnfruinseu@salvia>

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

Hello Pablo, Florian,

I implemented a V2 of this patch with the changes you proposed.
Could you please give your feedback on that patch?
https://lkml.org/lkml/2019/8/21/527

Thanks!

On Wed, 2019-08-21 at 11:58 +0200, Pablo Neira Ayuso wrote:
> On Tue, Aug 20, 2019 at 01:15:58PM -0300, Leonardo Bras wrote:
> > On Tue, 2019-08-20 at 07:36 +0200, Florian Westphal wrote:
> > > Wouldn't fib_netdev.c have the same problem?
> > Probably, but I haven't hit this issue yet.
> > 
> > > If so, might be better to place this test in both
> > > nft_fib6_eval_type and nft_fib6_eval.
> > 
> > I think that is possible, and not very hard to do.
> > 
> > But in my humble viewpoint, it looks like it's nft_fib_inet_eval() and
> > nft_fib_netdev_eval() have the responsibility to choose a valid
> > protocol or drop the package. 
> > I am not sure if it would be a good move to transfer this
> > responsibility to nft_fib6_eval_type() and nft_fib6_eval(), so I would
> > rather add the same test to nft_fib_netdev_eval().
> > 
> > Does it make sense?
> 
> Please, update common code to netdev and ip6 extensions as Florian
> suggests.
> 
> Thanks.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Soheil Hassas Yeganeh @ 2019-08-26 16:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Neal Cardwell, Eric Dumazet,
	Jason Baron, Vladimir Rutsky
In-Reply-To: <20190826161915.81676-1-edumazet@google.com>

On Mon, Aug 26, 2019 at 12:19 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
>
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
>
> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Reported-by: Vladimir Rutsky <rutsky@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Nice find!

> ---
>  net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 77b485d60b9d0e00edc4e2f0d6c5bb3a9460b23b..61082065b26a068975c411b74eb46739ab0632ca 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -935,6 +935,22 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>         return mss_now;
>  }
>
> +/* In some cases, both sendpage() and sendmsg() could have added
> + * an skb to the write queue, but failed adding payload on it.
> + * We need to remove it to consume less memory, but more
> + * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
> + * users.
> + */
> +static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
> +{
> +       if (skb && !skb->len) {
> +               tcp_unlink_write_queue(skb, sk);
> +               if (tcp_write_queue_empty(sk))
> +                       tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> +               sk_wmem_free_skb(sk, skb);
> +       }
> +}
> +
>  ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>                          size_t size, int flags)
>  {
> @@ -1064,6 +1080,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>         return copied;
>
>  do_error:
> +       tcp_remove_empty_skb(sk, tcp_write_queue_tail(sk));
>         if (copied)
>                 goto out;
>  out_err:
> @@ -1388,18 +1405,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>         sock_zerocopy_put(uarg);
>         return copied + copied_syn;
>
> +do_error:
> +       skb = tcp_write_queue_tail(sk);
>  do_fault:
> -       if (!skb->len) {
> -               tcp_unlink_write_queue(skb, sk);
> -               /* It is the one place in all of TCP, except connection
> -                * reset, where we can be unlinking the send_head.
> -                */
> -               if (tcp_write_queue_empty(sk))
> -                       tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> -               sk_wmem_free_skb(sk, skb);
> -       }
> +       tcp_remove_empty_skb(sk, skb);
>
> -do_error:
>         if (copied + copied_syn)
>                 goto out;
>  out_err:
> --
> 2.23.0.187.g17f5b7556c-goog
>

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jakub Kicinski @ 2019-08-26 16:55 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, Roopa Prabhu, netdev, David Miller,
	Stephen Hemminger, dcbw, Michal Kubecek, Andrew Lunn, parav,
	Saeed Mahameed, mlxsw
In-Reply-To: <20190826160916.GE2309@nanopsycho.orion>

On Mon, 26 Aug 2019 18:09:16 +0200, Jiri Pirko wrote:
> DaveA, Roopa. Do you insist on doing add/remove of altnames in the
> existing setlist command using embedded message op attrs? I'm asking
> because after some time thinking about it, it still feels wrong to me :/
> 
> If this would be a generic netlink api, we would just add another couple
> of commands. What is so different we can't add commands here?
> It is also much simpler code. Easy error handling, no need for
> rollback, no possibly inconsistent state, etc.

+1 the separate op feels like a better uapi to me as well.

Perhaps we could redo the iproute2 command line interface to make the
name the primary object? Would that address your concern Dave and Roopa?

^ permalink raw reply

* Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature
From: Florian Fainelli @ 2019-08-26 17:01 UTC (permalink / raw)
  To: Andrew Lunn, Horatiu Vultur
  Cc: roopa, nikolay, davem, UNGLinuxDriver, alexandre.belloni,
	allan.nielsen, netdev, linux-kernel, bridge
In-Reply-To: <20190826123811.GA13411@lunn.ch>

On 8/26/19 5:38 AM, Andrew Lunn wrote:
> On Mon, Aug 26, 2019 at 10:11:12AM +0200, Horatiu Vultur wrote:
>> When a network port is added to a bridge then the port is added in
>> promisc mode. Some HW that has bridge capabilities(can learn, forward,
>> flood etc the frames) they are disabling promisc mode in the network
>> driver when the port is added to the SW bridge.
>>
>> This patch adds the feature NETIF_F_HW_BR_CAP so that the network ports
>> that have this feature will not be set in promisc mode when they are
>> added to a SW bridge.
>>
>> In this way the HW that has bridge capabilities don't need to send all the
>> traffic to the CPU and can also implement the promisc mode and toggle it
>> using the command 'ip link set dev swp promisc on'
> 
> Hi Horatiu
> 
> I'm still not convinced this is needed. The model is, the hardware is
> there to accelerate what Linux can do in software. Any peculiarities
> of the accelerator should be hidden in the driver.  If the accelerator
> can do its job without needing promisc mode, do that in the driver.
> 
> So you are trying to differentiate between promisc mode because the
> interface is a member of a bridge, and promisc mode because some
> application, like pcap, has asked for promisc mode.
> 
> dev->promiscuity is a counter. So what you can do it look at its
> value, and how the interface is being used. If the interface is not a
> member of a bridge, and the count > 0, enable promisc mode in the
> accelerator. If the interface is a member of a bridge, and the count >
> 1, enable promisc mode in the accelerator.

That is an excellent suggestion actually.

Horatiu, the other issue with your approach here is that the features
don't propagate to/from lower/upper/real devices, so if e.g.: you have a
VLAN interface enslaved as a part of the bridge, or a bond, or a tunnel
interface, the logic won't make us check NETIF_F_HW_BR_CAP because those
virtual network devices won't inherit it from their real device. I am
not suggesting you fix this with your patch series, but rather, seek a
driver local solution.
-- 
Florian

^ permalink raw reply

* Re: [PATCH 12/16] arm64: prefer __section from compiler_attributes.h
From: Nick Desaulniers @ 2019-08-26 17:03 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Will Deacon, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Catalin Marinas,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Andrey Konovalov, Greg Kroah-Hartman, Enrico Weigelt,
	Suzuki K Poulose, Thomas Gleixner, Masayoshi Mizuma,
	Shaokun Zhang, Alexios Zavras, Allison Randal, Linux ARM,
	linux-kernel, Network Development, bpf
In-Reply-To: <CANiq72mcSniCzMzW6AX_5tG5W2edjEmZ=Rf=jo-Mw3H-9RVJqw@mail.gmail.com>

On Sat, Aug 24, 2019 at 5:48 AM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> On Sat, Aug 24, 2019 at 1:25 PM Will Deacon <will@kernel.org> wrote:
> >
> > Which bit are you pinging about? This patch (12/16) has been in -next for a
> > while and is queued in the arm64 tree for 5.4. The Oops/boot issue is
> > addressed in patch 14 which probably needs to be sent as a separate patch
> > (with a commit message) if it's targetting 5.3 and, I assume, routed via
> > somebody like akpm.
>
> I was pinging about the bit I was quoting, i.e. whether the Oops in
> the cover letter was #14 indeed. Also, since Nick said he wanted to
> get this ASAP through compiler-attributes, I assumed he wanted it to
> be in 5.3, but I have not seen the independent patch.
>
> Since he seems busy, I will write a better commit message myself and
> send it to Linus next week.

Sorry, very hectic week here last week.  I'll try to get the import
bit split off, collect the acks/reviewed-by tags, and resend a v2 of
the series this week.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply

* KASAN: slab-out-of-bounds Read in sctp_inq_pop
From: syzbot @ 2019-08-26 17:14 UTC (permalink / raw)
  To: davem, linux-kernel, linux-sctp, marcelo.leitner, netdev, nhorman,
	syzkaller-bugs, vyasevich

Hello,

syzbot found the following crash on:

HEAD commit:    9733a7c6 Add linux-next specific files for 20190823
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=143ec11e600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=f6c78a1438582bd1
dashboard link: https://syzkaller.appspot.com/bug?extid=3ca06c5cb35ee3fc1f89
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3ca06c5cb35ee3fc1f89@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: slab-out-of-bounds in sctp_inq_pop+0xafd/0xd80  
net/sctp/inqueue.c:201
Read of size 2 at addr ffff8880a4e37222 by task syz-executor.3/32407

CPU: 1 PID: 32407 Comm: syz-executor.3 Not tainted 5.3.0-rc5-next-20190823  
#72
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
  __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
  kasan_report+0x12/0x17 mm/kasan/common.c:610
  __asan_report_load2_noabort+0x14/0x20 mm/kasan/generic_report.c:130
  sctp_inq_pop+0xafd/0xd80 net/sctp/inqueue.c:201
  sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:335
  sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
  sctp_rcv+0x2807/0x3590 net/sctp/input.c:256
  sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049
  ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
  ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
  dst_input include/net/dst.h:442 [inline]
  ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84
  ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline]
  ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282
  ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316
  __netif_receive_skb_list_ptype net/core/dev.c:5049 [inline]
  __netif_receive_skb_list_core+0x1a2/0x9d0 net/core/dev.c:5087
  __netif_receive_skb_list net/core/dev.c:5149 [inline]
  netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244
  gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757
  gro_normal_list net/core/dev.c:5755 [inline]
  gro_normal_one net/core/dev.c:5769 [inline]
  napi_frags_finish net/core/dev.c:5782 [inline]
  napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855
  tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974
  tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020
  call_write_iter include/linux/fs.h:1890 [inline]
  do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
  do_iter_write fs/read_write.c:976 [inline]
  do_iter_write+0x17b/0x380 fs/read_write.c:957
  vfs_writev+0x1b3/0x2f0 fs/read_write.c:1021
  do_writev+0x15b/0x330 fs/read_write.c:1064
  __do_sys_writev fs/read_write.c:1137 [inline]
  __se_sys_writev fs/read_write.c:1134 [inline]
  __x64_sys_writev+0x75/0xb0 fs/read_write.c:1134
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459731
Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 34 b9 fb ff c3 48  
83 ec 08 e8 fa 2c 00 00 48 89 04 24 b8 14 00 00 00 0f 05 <48> 8b 3c 24 48  
89 c2 e8 43 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007fb4cd361ba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000002a RCX: 0000000000459731
RDX: 0000000000000001 RSI: 00007fb4cd361c00 RDI: 00000000000000f0
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb4cd3626d4
R13: 00000000004c87e3 R14: 00000000004df640 R15: 00000000ffffffff

Allocated by task 32407:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:486 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:459
  kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:494
  slab_post_alloc_hook mm/slab.h:584 [inline]
  slab_alloc mm/slab.c:3319 [inline]
  kmem_cache_alloc+0x121/0x710 mm/slab.c:3483
  __build_skb+0x26/0x70 net/core/skbuff.c:310
  __napi_alloc_skb+0x1d2/0x300 net/core/skbuff.c:523
  napi_alloc_skb include/linux/skbuff.h:2801 [inline]
  napi_get_frags net/core/dev.c:5742 [inline]
  napi_get_frags+0x65/0x140 net/core/dev.c:5737
  tun_napi_alloc_frags drivers/net/tun.c:1473 [inline]
  tun_get_user+0x16bd/0x3fa0 drivers/net/tun.c:1834
  tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020
  call_write_iter include/linux/fs.h:1890 [inline]
  do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
  do_iter_write fs/read_write.c:976 [inline]
  do_iter_write+0x17b/0x380 fs/read_write.c:957
  vfs_writev+0x1b3/0x2f0 fs/read_write.c:1021
  do_writev+0x15b/0x330 fs/read_write.c:1064
  __do_sys_writev fs/read_write.c:1137 [inline]
  __se_sys_writev fs/read_write.c:1134 [inline]
  __x64_sys_writev+0x75/0xb0 fs/read_write.c:1134
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 3891:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:448
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:456
  __cache_free mm/slab.c:3425 [inline]
  kmem_cache_free+0x86/0x320 mm/slab.c:3693
  kfree_skbmem net/core/skbuff.c:623 [inline]
  kfree_skbmem+0xc5/0x150 net/core/skbuff.c:617
  __kfree_skb net/core/skbuff.c:680 [inline]
  consume_skb net/core/skbuff.c:838 [inline]
  consume_skb+0x103/0x3b0 net/core/skbuff.c:832
  skb_free_datagram+0x1b/0x100 net/core/datagram.c:328
  netlink_recvmsg+0x6c6/0xf50 net/netlink/af_netlink.c:1996
  sock_recvmsg_nosec net/socket.c:871 [inline]
  sock_recvmsg net/socket.c:889 [inline]
  sock_recvmsg+0xce/0x110 net/socket.c:885
  ___sys_recvmsg+0x271/0x5a0 net/socket.c:2480
  __sys_recvmsg+0x102/0x1d0 net/socket.c:2537
  __do_sys_recvmsg net/socket.c:2547 [inline]
  __se_sys_recvmsg net/socket.c:2544 [inline]
  __x64_sys_recvmsg+0x78/0xb0 net/socket.c:2544
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880a4e37140
  which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 2 bytes to the right of
  224-byte region [ffff8880a4e37140, ffff8880a4e37220)
The buggy address belongs to the page:
page:ffffea0002938dc0 refcount:1 mapcount:0 mapping:ffff88821b6a3a80  
index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea000257fa88 ffffea00023a2008 ffff88821b6a3a80
raw: 0000000000000000 ffff8880a4e37000 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8880a4e37100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
  ffff8880a4e37180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ffff8880a4e37200: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
                                ^
  ffff8880a4e37280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8880a4e37300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Neal Cardwell @ 2019-08-26 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet,
	Jason Baron, Vladimir Rutsky
In-Reply-To: <20190826161915.81676-1-edumazet@google.com>

On Mon, Aug 26, 2019 at 12:19 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
>
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
>
> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Reported-by: Vladimir Rutsky <rutsky@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

Nice detective work. :-) Thanks, Eric!

neal

^ permalink raw reply

* [PATCH v1 1/3] can: mcp251x: Use devm_clk_get_optional() to get the input clock
From: Andy Shevchenko @ 2019-08-26 17:26 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, linux-can,
	David S. Miller, netdev
  Cc: Andy Shevchenko

Simplify the code which fetches the input clock by using
devm_clk_get_optional(). This comes with a small functional change: previously
all errors were ignored when platform data is present. Now all errors are
treated as errors. If no input clock is present devm_clk_get_optional() will
return NULL instead of an error which matches the behavior of the old code.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
 drivers/net/can/spi/mcp251x.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/net/can/spi/mcp251x.c b/drivers/net/can/spi/mcp251x.c
index 58992fd61cb9..e04b578f2b1f 100644
--- a/drivers/net/can/spi/mcp251x.c
+++ b/drivers/net/can/spi/mcp251x.c
@@ -1014,15 +1014,13 @@ static int mcp251x_can_probe(struct spi_device *spi)
 	struct clk *clk;
 	int freq, ret;
 
-	clk = devm_clk_get(&spi->dev, NULL);
-	if (IS_ERR(clk)) {
-		if (pdata)
-			freq = pdata->oscillator_frequency;
-		else
-			return PTR_ERR(clk);
-	} else {
-		freq = clk_get_rate(clk);
-	}
+	clk = devm_clk_get_optional(&spi->dev, NULL);
+	if (IS_ERR(clk))
+		return PTR_ERR(clk);
+
+	freq = clk_get_rate(clk);
+	if (freq == 0 && pdata)
+		freq = pdata->oscillator_frequency;
 
 	/* Sanity check */
 	if (freq < 1000000 || freq > 25000000)
@@ -1033,11 +1031,9 @@ static int mcp251x_can_probe(struct spi_device *spi)
 	if (!net)
 		return -ENOMEM;
 
-	if (!IS_ERR(clk)) {
-		ret = clk_prepare_enable(clk);
-		if (ret)
-			goto out_free;
-	}
+	ret = clk_prepare_enable(clk);
+	if (ret)
+		goto out_free;
 
 	net->netdev_ops = &mcp251x_netdev_ops;
 	net->flags |= IFF_ECHO;
@@ -1122,8 +1118,7 @@ static int mcp251x_can_probe(struct spi_device *spi)
 	mcp251x_power_enable(priv->power, 0);
 
 out_clk:
-	if (!IS_ERR(clk))
-		clk_disable_unprepare(clk);
+	clk_disable_unprepare(clk);
 
 out_free:
 	free_candev(net);
@@ -1141,8 +1136,7 @@ static int mcp251x_can_remove(struct spi_device *spi)
 
 	mcp251x_power_enable(priv->power, 0);
 
-	if (!IS_ERR(priv->clk))
-		clk_disable_unprepare(priv->clk);
+	clk_disable_unprepare(priv->clk);
 
 	free_candev(net);
 
-- 
2.23.0.rc1


^ permalink raw reply related

* [PATCH v1 3/3] can: mcp251x: Call wrapper instead of regulator_disable()
From: Andy Shevchenko @ 2019-08-26 17:26 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, linux-can,
	David S. Miller, netdev
  Cc: Andy Shevchenko
In-Reply-To: <20190826172623.79378-1-andriy.shevchenko@linux.intel.com>

There is no need to check for regulator presence in the ->suspend()
since a wrapper does it for us. Due to this we may unconditionally set
AFTER_SUSPEND_POWER flag.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
 drivers/net/can/spi/mcp251x.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/can/spi/mcp251x.c b/drivers/net/can/spi/mcp251x.c
index 0b7e743ca0a0..6ee0ea51399a 100644
--- a/drivers/net/can/spi/mcp251x.c
+++ b/drivers/net/can/spi/mcp251x.c
@@ -1162,10 +1162,8 @@ static int __maybe_unused mcp251x_can_suspend(struct device *dev)
 		priv->after_suspend = AFTER_SUSPEND_DOWN;
 	}
 
-	if (!IS_ERR_OR_NULL(priv->power)) {
-		regulator_disable(priv->power);
-		priv->after_suspend |= AFTER_SUSPEND_POWER;
-	}
+	mcp251x_power_enable(priv->power, 0);
+	priv->after_suspend |= AFTER_SUSPEND_POWER;
 
 	return 0;
 }
-- 
2.23.0.rc1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox