Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: ethernet: fec: Add missing SPEED_
From: LABBE Corentin @ 2018-10-18 18:47 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: andrew, davem, fugang.duan, linux-kernel, netdev
In-Reply-To: <2621cbc9-47ed-ce2a-b7ee-262f17dc138f@gmail.com>

On Thu, Oct 18, 2018 at 11:39:24AM -0700, Florian Fainelli wrote:
> On 10/18/2018 08:05 AM, Corentin Labbe wrote:
> > Since commit 58056c1e1b0e ("net: ethernet: Use phy_set_max_speed() to limit advertised speed"), the fec driver is unable to get any link.
> > This is due to missing SPEED_.
> 
> But SPEED_1000 is defined in include/uapi/linux/ethtool.h as 1000, so
> surely this would amount to the same code paths being taken or am I
> missing something here?

The bisect session pointed your patch, reverting it fix the issue.
BUT since the fix seemed trivial I sent the patch without more test then compile it.
Sorry, I have just found some minutes ago that it didnt fix the issue.

But your patch is still the cause for sure.

^ permalink raw reply

* [PATCH bpf-next] tools: bpftool: use 4 context mode for the NFP disasm
From: Jakub Kicinski @ 2018-10-18 18:34 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev, oss-drivers, Jakub Kicinski

The nfp driver is currently always JITing the BPF for 4 context/thread
mode of the NFP flow processors.  Tell this to the disassembler,
otherwise some registers may be incorrectly decoded.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
---
 tools/bpf/bpftool/common.c     |  5 ++++-
 tools/bpf/bpftool/jit_disasm.c |  4 +++-
 tools/bpf/bpftool/main.h       |  6 ++++--
 tools/bpf/bpftool/prog.c       | 14 +++++++++-----
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 3318da8060bd..25af85304ebe 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -554,7 +554,9 @@ static int read_sysfs_netdev_hex_int(char *devname, const char *entry_name)
 	return read_sysfs_hex_int(full_path);
 }
 
-const char *ifindex_to_bfd_name_ns(__u32 ifindex, __u64 ns_dev, __u64 ns_ino)
+const char *
+ifindex_to_bfd_params(__u32 ifindex, __u64 ns_dev, __u64 ns_ino,
+		      const char **opt)
 {
 	char devname[IF_NAMESIZE];
 	int vendor_id;
@@ -579,6 +581,7 @@ const char *ifindex_to_bfd_name_ns(__u32 ifindex, __u64 ns_dev, __u64 ns_ino)
 		    device_id != 0x6000 &&
 		    device_id != 0x6003)
 			p_info("Unknown NFP device ID, assuming it is NFP-6xxx arch");
+		*opt = "ctx4";
 		return "NFP-6xxx";
 	default:
 		p_err("Can't get bfd arch name for device vendor id 0x%04x",
diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
index 87439320ef70..c75ffd9ce2bb 100644
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -77,7 +77,7 @@ static int fprintf_json(void *out, const char *fmt, ...)
 }
 
 void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
-		       const char *arch)
+		       const char *arch, const char *disassembler_options)
 {
 	disassembler_ftype disassemble;
 	struct disassemble_info info;
@@ -116,6 +116,8 @@ void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
 
 	info.arch = bfd_get_arch(bfdf);
 	info.mach = bfd_get_mach(bfdf);
+	if (disassembler_options)
+		info.disassembler_options = disassembler_options;
 	info.buffer = image;
 	info.buffer_length = len;
 
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 28ee769bd11b..28322ace2856 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -145,13 +145,15 @@ int map_parse_fd(int *argc, char ***argv);
 int map_parse_fd_and_info(int *argc, char ***argv, void *info, __u32 *info_len);
 
 void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
-		       const char *arch);
+		       const char *arch, const char *disassembler_options);
 void print_data_json(uint8_t *data, size_t len);
 void print_hex_data_json(uint8_t *data, size_t len);
 
 unsigned int get_page_size(void);
 unsigned int get_possible_cpus(void);
-const char *ifindex_to_bfd_name_ns(__u32 ifindex, __u64 ns_dev, __u64 ns_ino);
+const char *
+ifindex_to_bfd_params(__u32 ifindex, __u64 ns_dev, __u64 ns_ino,
+		      const char **opt);
 
 struct btf_dumper {
 	const struct btf *btf;
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 335028968dfb..5302ee282409 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -449,6 +449,7 @@ static int do_dump(int argc, char **argv)
 	unsigned long *func_ksyms = NULL;
 	struct bpf_prog_info info = {};
 	unsigned int *func_lens = NULL;
+	const char *disasm_opt = NULL;
 	unsigned int nr_func_ksyms;
 	unsigned int nr_func_lens;
 	struct dump_data dd = {};
@@ -607,9 +608,10 @@ static int do_dump(int argc, char **argv)
 		const char *name = NULL;
 
 		if (info.ifindex) {
-			name = ifindex_to_bfd_name_ns(info.ifindex,
-						      info.netns_dev,
-						      info.netns_ino);
+			name = ifindex_to_bfd_params(info.ifindex,
+						     info.netns_dev,
+						     info.netns_ino,
+						     &disasm_opt);
 			if (!name)
 				goto err_free;
 		}
@@ -651,7 +653,8 @@ static int do_dump(int argc, char **argv)
 					printf("%s:\n", sym_name);
 				}
 
-				disasm_print_insn(img, lens[i], opcodes, name);
+				disasm_print_insn(img, lens[i], opcodes, name,
+						  disasm_opt);
 				img += lens[i];
 
 				if (json_output)
@@ -663,7 +666,8 @@ static int do_dump(int argc, char **argv)
 			if (json_output)
 				jsonw_end_array(json_wtr);
 		} else {
-			disasm_print_insn(buf, *member_len, opcodes, name);
+			disasm_print_insn(buf, *member_len, opcodes, name,
+					  disasm_opt);
 		}
 	} else if (visual) {
 		if (json_output)
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v2 net] r8169: fix NAPI handling under high load
From: David Miller @ 2018-10-18 18:34 UTC (permalink / raw)
  To: hkallweit1; +Cc: nic_swsd, netdev
In-Reply-To: <8d75498e-6f43-a699-3a0a-fb4a2b5dc860@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Thu, 18 Oct 2018 19:56:01 +0200

> rtl_rx() and rtl_tx() are called only if the respective bits are set
> in the interrupt status register. Under high load NAPI may not be
> able to process all data (work_done == budget) and it will schedule
> subsequent calls to the poll callback.
> rtl_ack_events() however resets the bits in the interrupt status
> register, therefore subsequent calls to rtl8169_poll() won't call
> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
> 
> Fix this by calling rtl_rx() and rtl_tx() independent of the bits
> set in the interrupt status register. Both functions will detect
> if there's nothing to do for them.
> 
> Fixes: da78dbff2e05 ("r8169: remove work from irq handler.")
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
> v2:
> - added "Fixes" tag

Applied and queued up for -stable, thanks!

And I agree that moving the IRQ ACKing to the hw irq handler is
net-next material.

^ permalink raw reply

* Issues in error queue polling
From: Ricardo Biehl Pasquali @ 2018-10-18 18:26 UTC (permalink / raw)
  To: netdev; +Cc: jacob.e.keller, vinicius.gomes, davem

The commit 7d4c04fc170087119727 ("net: add option to enable
error queue packets waking select") (2013-03-28) introduced
SO_SELECT_ERR_QUEUE, which masks POLLPRI with POLLERR event
return in some socket poll callbacks.

POLLERR event issued with sock_queue_err_skb() did not wake
up a poll when POLLERR is the only requested event because
sk_data_ready() (sock_def_readable()) was used and it
doesn't mask POLLERR in poll wake up:

wake_up_interruptible_sync_poll(&wq->wait,
                                EPOLLIN | EPOLLPRI |
                                EPOLLRDNORM | EPOLLRDBAND);

If POLLIN or POLLPRI are requested, for example, poll does
wake up.

POLLERR wakeup by requesting POLLPRI is possible without
set SO_SELECT_ERR_QUEUE. All the option does is masking
POLLPRI as a returned event before poll returns. poll
would return anyway because of POLLERR.

Also, the sentence "[...] enable software to wait on error
queue packets without waking up for regular data on the
socket." from the above commit is not true.

A POLLIN event issued via sock_def_readable() wakes up
threads waiting for POLLPRI, and vice versa. However,
poll() does not return, sleeping again, as the requested
events do not match events.

The commit 6e5d58fdc9bedd0255a8 ("skbuff: Fix not waking
applications when errors are enqueued") (2018-03-14) make
POLLERR alone wake up poll. It replaces sk_data_ready()
(sock_def_readable()) with sk_error_report()
(sock_def_error_report()). This makes "POLLERR wake up by
requesting POLLPRI" obsolete.

Rationale:

POLLIN-only and POLLERR-only wake up are useful when there
is a receiving thread, a sending thread, and a thread that
get transmit timestamps. The thread polling on POLLERR will
not wake up when regular data arrives (POLLIN). The thread
polling on POLLIN will not wake up when tx timestamps are
ready (POLLERR).

One solution is adding an option that disable POLLERR as
requested event. This is in the Virtual File System
subsystem, not in the network, though.

This solves the problem of waking up other threads that
not interested in error queue. Thus allowing a separate
thread take care of error queue (useful for receiving
transmit timestamps).

However, this may not be a good solution as it depends on
polling behavior. Thoughts?

By the way, as the kernel development shouldn't break user
space, SO_SELECT_ERR_QUEUE can become a compatibility
option.

P.S.: The kernel is slowly getting bigger (and perhaps
      messy) with compatibility code. One day, the
      compatibility code should be moved to a
      compatibility-only place, or completely dropped
      (unlikely).

	pasquali

^ permalink raw reply

* Re: [iproute2 PATCH] tc: flower: Classify packets based port ranges
From: Nambiar, Amritha @ 2018-10-18 18:25 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: stephen, netdev, jakub.kicinski, sridhar.samudrala, jhs,
	xiyou.wangcong
In-Reply-To: <20181018122136.GB4558@nanopsycho.orion>

On 10/18/2018 5:21 AM, Jiri Pirko wrote:
> Fri, Oct 12, 2018 at 03:54:42PM CEST, amritha.nambiar@intel.com wrote:
> 
> [...]
> 
>> @@ -1516,6 +1625,22 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
>> 	if (nl_type >= 0)
>> 		flower_print_port("src_port", tb[nl_type]);
>>
>> +	if (flower_port_range_attr_type(ip_proto, FLOWER_ENDPOINT_DST, &range)
>> +	    == 0) {
>> +		flower_print_port_range("dst_port_min",
>> +					tb[range.min_port_type]);
>> +		flower_print_port_range("dst_port_max",
>> +					tb[range.max_port_type]);
> 
> The input and output of iproute2 utils, tc included should be in sync.
> So you need to print "range x-y" here.
> 

Agree, will fix in v2. Thanks!

> 
>> +	}
>> +
>> +	if (flower_port_range_attr_type(ip_proto, FLOWER_ENDPOINT_SRC, &range)
>> +	    == 0) {
>> +		flower_print_port_range("src_port_min",
>> +					tb[range.min_port_type]);
>> +		flower_print_port_range("src_port_max",
>> +					tb[range.max_port_type]);
>> +	}
>> +
>> 	flower_print_tcp_flags("tcp_flags", tb[TCA_FLOWER_KEY_TCP_FLAGS],
>> 			       tb[TCA_FLOWER_KEY_TCP_FLAGS_MASK]);
>>
>>

^ permalink raw reply

* Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges
From: Nambiar, Amritha @ 2018-10-18 18:24 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jakub.kicinski, sridhar.samudrala, jhs,
	xiyou.wangcong
In-Reply-To: <20181018121727.GA4558@nanopsycho.orion>

On 10/18/2018 5:17 AM, Jiri Pirko wrote:
> Fri, Oct 12, 2018 at 03:53:30PM CEST, amritha.nambiar@intel.com wrote:
>> Added support in tc flower for filtering based on port ranges.
>> This is a rework of the RFC patch at:
>> https://patchwork.ozlabs.org/patch/969595/
>>
>> Example:
>> 1. Match on a port range:
>> -------------------------
>> $ tc filter add dev enp4s0 protocol ip parent ffff:\
>>  prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
>>  action drop
>>
>> $ tc -s filter show dev enp4s0 parent ffff:
>> filter protocol ip pref 1 flower chain 0
>> filter protocol ip pref 1 flower chain 0 handle 0x1
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_port_min 20
>>  dst_port_max 30
>>  skip_hw
>>  not_in_hw
>>        action order 1: gact action drop
>>         random type none pass val 0
>>         index 1 ref 1 bind 1 installed 181 sec used 5 sec
>>        Action statistics:
>>        Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>>        backlog 0b 0p requeues 0
>>
>> 2. Match on IP address and port range:
>> --------------------------------------
>> $ tc filter add dev enp4s0 protocol ip parent ffff:\
>>  prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
>>  skip_hw action drop
>>
>> $ tc -s filter show dev enp4s0 parent ffff:
>> filter protocol ip pref 1 flower chain 0 handle 0x2
>>  eth_type ipv4
>>  ip_proto tcp
>>  dst_ip 192.168.1.1
>>  dst_port_min 100
>>  dst_port_max 200
>>  skip_hw
>>  not_in_hw
>>        action order 1: gact action drop
>>         random type none pass val 0
>>         index 2 ref 1 bind 1 installed 28 sec used 6 sec
>>        Action statistics:
>>        Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
>>        backlog 0b 0p requeues 0
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
>> include/uapi/linux/pkt_cls.h |    5 ++
>> net/sched/cls_flower.c       |  134 ++++++++++++++++++++++++++++++++++++++++--
>> 2 files changed, 132 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> index 401d0c1..b569308 100644
>> --- a/include/uapi/linux/pkt_cls.h
>> +++ b/include/uapi/linux/pkt_cls.h
>> @@ -405,6 +405,11 @@ enum {
>> 	TCA_FLOWER_KEY_UDP_SRC,		/* be16 */
>> 	TCA_FLOWER_KEY_UDP_DST,		/* be16 */
>>
>> +	TCA_FLOWER_KEY_PORT_SRC_MIN,	/* be16 */
>> +	TCA_FLOWER_KEY_PORT_SRC_MAX,	/* be16 */
>> +	TCA_FLOWER_KEY_PORT_DST_MIN,	/* be16 */
>> +	TCA_FLOWER_KEY_PORT_DST_MAX,	/* be16 */
>> +
>> 	TCA_FLOWER_FLAGS,
>> 	TCA_FLOWER_KEY_VLAN_ID,		/* be16 */
>> 	TCA_FLOWER_KEY_VLAN_PRIO,	/* u8   */
>> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> index 9aada2d..5f135f0 100644
>> --- a/net/sched/cls_flower.c
>> +++ b/net/sched/cls_flower.c
>> @@ -55,6 +55,9 @@ struct fl_flow_key {
>> 	struct flow_dissector_key_ip ip;
>> 	struct flow_dissector_key_ip enc_ip;
>> 	struct flow_dissector_key_enc_opts enc_opts;
>> +
>> +	struct flow_dissector_key_ports tp_min;
>> +	struct flow_dissector_key_ports tp_max;
>> } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. */
>>
>> struct fl_flow_mask_range {
>> @@ -103,6 +106,11 @@ struct cls_fl_filter {
>> 	struct net_device *hw_dev;
>> };
>>
>> +enum fl_endpoint {
>> +	FLOWER_ENDPOINT_DST,
>> +	FLOWER_ENDPOINT_SRC
>> +};
>> +
>> static const struct rhashtable_params mask_ht_params = {
>> 	.key_offset = offsetof(struct fl_flow_mask, key),
>> 	.key_len = sizeof(struct fl_flow_key),
>> @@ -179,11 +187,86 @@ static void fl_clear_masked_range(struct fl_flow_key *key,
>> 	memset(fl_key_get_start(key, mask), 0, fl_mask_range(mask));
>> }
>>
>> +static int fl_range_compare_params(struct cls_fl_filter *filter,
>> +				   struct fl_flow_key *key,
>> +				   struct fl_flow_key *mkey,
>> +				   enum fl_endpoint endpoint)
>> +{
>> +	__be16 min_mask, max_mask, min_val, max_val;
>> +
>> +	if (endpoint == FLOWER_ENDPOINT_DST) {
>> +		min_mask = htons(filter->mask->key.tp_min.dst);
>> +		max_mask = htons(filter->mask->key.tp_max.dst);
>> +		min_val = htons(filter->key.tp_min.dst);
>> +		max_val = htons(filter->key.tp_max.dst);
>> +
>> +		if (min_mask && max_mask) {
>> +			if (htons(key->tp.dst) < min_val ||
>> +			    htons(key->tp.dst) > max_val)
>> +				return -1;
>> +
>> +			/* skb does not have min and max values */
>> +			mkey->tp_min.dst = filter->mkey.tp_min.dst;
>> +			mkey->tp_max.dst = filter->mkey.tp_max.dst;
>> +		}
>> +	} else {
>> +		min_mask = htons(filter->mask->key.tp_min.src);
>> +		max_mask = htons(filter->mask->key.tp_max.src);
>> +		min_val = htons(filter->key.tp_min.src);
>> +		max_val = htons(filter->key.tp_max.src);
>> +
>> +		if (min_mask && max_mask) {
>> +			if (htons(key->tp.src) < min_val ||
>> +			    htons(key->tp.src) > max_val)
>> +				return -1;
>> +
>> +			/* skb does not have min and max values */
>> +			mkey->tp_min.src = filter->mkey.tp_min.src;
>> +			mkey->tp_max.src = filter->mkey.tp_max.src;
>> +		}
> 
> You basically have 2 functions in 1 here. Just have 2 functions:
> fl_port_range_dst_cmp()
> and
> fl_port_range_src_cmp()
> 
> And avoid the "endpoint enum.
> Also, as you return -1 or 0, just make it bool.
> 

Makes sense. Will do.

> 
>> +	}
>> +	return 0;
>> +}
>> +
>> +static struct cls_fl_filter *fl_lookup_range(struct fl_flow_mask *mask,
>> +					     struct fl_flow_key *mkey,
>> +					     struct fl_flow_key *key)
>> +{
>> +	struct cls_fl_filter *filter, *f;
>> +	int ret;
>> +
>> +	list_for_each_entry_rcu(filter, &mask->filters, list) {
>> +		ret = fl_range_compare_params(filter, key, mkey,
>> +					      FLOWER_ENDPOINT_DST);
>> +		if (ret < 0)
>> +			continue;
>> +
>> +		ret = fl_range_compare_params(filter, key, mkey,
>> +					      FLOWER_ENDPOINT_SRC);
>> +		if (ret < 0)
>> +			continue;
>> +
>> +		f = rhashtable_lookup_fast(&mask->ht,
>> +					   fl_key_get_start(mkey, mask),
>> +					   mask->filter_ht_params);
>> +		if (f)
>> +			return f;
>> +	}
>> +	return NULL;
>> +}
>> +
>> static struct cls_fl_filter *fl_lookup(struct fl_flow_mask *mask,
>> -				       struct fl_flow_key *mkey)
>> +				       struct fl_flow_key *mkey,
>> +				       struct fl_flow_key *key, bool is_skb)
>> {
>> -	return rhashtable_lookup_fast(&mask->ht, fl_key_get_start(mkey, mask),
>> -				      mask->filter_ht_params);
>> +	if ((!(mask->key.tp_min.dst && mask->key.tp_max.dst) &&
>> +	     !(mask->key.tp_min.src && mask->key.tp_max.src)) || !is_skb) {
> 
> Would be probably good to have a dedicated bit to check for and decide
> if you do normal/range lookup. This is fast path. 
> 

Will fix in v2.

> 
>> +		return  rhashtable_lookup_fast(&mask->ht,
> 
> Remove double space   ^^
> 

Will fix in v2.

> 
>> +					       fl_key_get_start(mkey, mask),
>> +					       mask->filter_ht_params);
>> +	}
>> +	/* Classify based on range */
>> +	return fl_lookup_range(mask, mkey, key);
>> }
>>
>> static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>> @@ -207,8 +290,8 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>> 		skb_flow_dissect(skb, &mask->dissector, &skb_key, 0);
>>
>> 		fl_set_masked_key(&skb_mkey, &skb_key, mask);
>> +		f = fl_lookup(mask, &skb_mkey, &skb_key, true);
>>
>> -		f = fl_lookup(mask, &skb_mkey);
>> 		if (f && !tc_skip_sw(f->flags)) {
>> 			*res = f->res;
>> 			return tcf_exts_exec(skb, &f->exts, res);
>> @@ -909,6 +992,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
>> 			       sizeof(key->arp.tha));
>> 	}
>>
>> +	if (key->basic.ip_proto == IPPROTO_TCP ||
>> +	    key->basic.ip_proto == IPPROTO_UDP ||
>> +	    key->basic.ip_proto == IPPROTO_SCTP) {
>> +		fl_set_key_val(tb, &key->tp_min.dst,
>> +			       TCA_FLOWER_KEY_PORT_DST_MIN, &mask->tp_min.dst,
>> +			       TCA_FLOWER_UNSPEC, sizeof(key->tp_min.dst));
>> +		fl_set_key_val(tb, &key->tp_max.dst,
>> +			       TCA_FLOWER_KEY_PORT_DST_MAX, &mask->tp_max.dst,
>> +			       TCA_FLOWER_UNSPEC, sizeof(key->tp_max.dst));
>> +		fl_set_key_val(tb, &key->tp_min.src,
>> +			       TCA_FLOWER_KEY_PORT_SRC_MIN, &mask->tp_min.src,
>> +			       TCA_FLOWER_UNSPEC, sizeof(key->tp_min.src));
>> +		fl_set_key_val(tb, &key->tp_max.src,
>> +			       TCA_FLOWER_KEY_PORT_SRC_MAX, &mask->tp_max.src,
>> +			       TCA_FLOWER_UNSPEC, sizeof(key->tp_max.src));
>> +	}
>> +
>> 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
>> 	    tb[TCA_FLOWER_KEY_ENC_IPV4_DST]) {
>> 		key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
>> @@ -1026,8 +1126,7 @@ static void fl_init_dissector(struct flow_dissector *dissector,
>> 			     FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
>> 	FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>> 			     FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
>> -	FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>> -			     FLOW_DISSECTOR_KEY_PORTS, tp);
>> +	FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_PORTS, tp);
>> 	FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>> 			     FLOW_DISSECTOR_KEY_IP, ip);
>> 	FL_KEY_SET_IF_MASKED(mask, keys, cnt,
>> @@ -1227,7 +1326,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
>> 		goto errout_idr;
>>
>> 	if (!tc_skip_sw(fnew->flags)) {
>> -		if (!fold && fl_lookup(fnew->mask, &fnew->mkey)) {
>> +		if (!fold && fl_lookup(fnew->mask, &fnew->mkey, NULL, false)) {
> 
> 
> I don't undestand why do you need the "is_skb" arg here. Could you
> please explain?
> 
> Thanks!
> 

The reason to keep the 'is_skb' arg is because, fl_lookup is called in
two cases, one for skb classification and another for checking if a
filter exists every-time a new filter is added. In case of skb
classification, we need to go through the range-comparator to decide if
the skb's port-value falls within the range-filter's min and max limits.
In case of filter validation, the range-filter that we are trying to add
will have min and max values, and we are validating it against other
range-filters with min and max values. So, rhashtable lookup will
suffice here and there is no need to go through the range-comparator in
this case. In the above code, we are validating if a range-filter
exists, so 'is_skb' is false.

> 
>> 			err = -EEXIST;
>> 			goto errout_mask;
>> 		}
>> @@ -1800,6 +1899,27 @@ static int fl_dump_key(struct sk_buff *skb, struct net *net,
>> 				  sizeof(key->arp.tha))))
>> 		goto nla_put_failure;
>>
>> +	if ((key->basic.ip_proto == IPPROTO_TCP ||
>> +	     key->basic.ip_proto == IPPROTO_UDP ||
>> +	     key->basic.ip_proto == IPPROTO_SCTP) &&
>> +	     (fl_dump_key_val(skb, &key->tp_min.dst,
>> +			      TCA_FLOWER_KEY_PORT_DST_MIN,
>> +			      &mask->tp_min.dst, TCA_FLOWER_UNSPEC,
>> +			      sizeof(key->tp_min.dst)) ||
>> +	      fl_dump_key_val(skb, &key->tp_max.dst,
>> +			      TCA_FLOWER_KEY_PORT_DST_MAX,
>> +			      &mask->tp_max.dst, TCA_FLOWER_UNSPEC,
>> +			      sizeof(key->tp_max.dst)) ||
>> +	      fl_dump_key_val(skb, &key->tp_min.src,
>> +			      TCA_FLOWER_KEY_PORT_SRC_MIN,
>> +			      &mask->tp_min.src, TCA_FLOWER_UNSPEC,
>> +			      sizeof(key->tp_min.src)) ||
>> +	      fl_dump_key_val(skb, &key->tp_max.src,
>> +			      TCA_FLOWER_KEY_PORT_SRC_MAX,
>> +			      &mask->tp_max.src, TCA_FLOWER_UNSPEC,
>> +			      sizeof(key->tp_max.src))))
>> +		goto nla_put_failure;
>> +
>> 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&
>> 	    (fl_dump_key_val(skb, &key->enc_ipv4.src,
>> 			    TCA_FLOWER_KEY_ENC_IPV4_SRC, &mask->enc_ipv4.src,
>>

^ permalink raw reply

* Re: [PATCH net-next 0/2] sctp: fix sk_wmem_queued and use it to check for writable space
From: David Miller @ 2018-10-18 18:24 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <cover.1539716812.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 17 Oct 2018 03:07:49 +0800

> sctp doesn't count and use asoc sndbuf_used, sk sk_wmem_alloc and
> sk_wmem_queued properly, which also causes some problem.
> 
> This patchset is to improve it.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] net/ipv4: fix tcp_poll for SMC fallback
From: David Miller @ 2018-10-18 18:22 UTC (permalink / raw)
  To: kgraul; +Cc: netdev, ubraun, hch, linux-s390
In-Reply-To: <20181016144554.49660-1-kgraul@linux.ibm.com>

From: Karsten Graul <kgraul@linux.ibm.com>
Date: Tue, 16 Oct 2018 16:45:54 +0200

> Commit dd979b4df817 ("net: simplify sock_poll_wait") breaks tcp_poll for 
> SMC fallback: An AF_SMC socket establishes an internal TCP socket for the 
> CLC handshake with the remote peer. Whenever the SMC connection can not be 
> established this CLC socket is used as a fallback. All socket operations on the 
> SMC socket are then forwarded to the CLC socket. In case of poll, the 
> file->private_data pointer references the SMC socket because the CLC socket has 
> no file assigned. This causes tcp_poll to wait on the wrong socket.
> 
> This patch fixes the issue by (re)introducing a sock_poll_wait variant with 
> a socket parameter, and let tcp_poll use this variant.
> 
> Fixes: dd979b4df817 ("net: simplify sock_poll_wait")
> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>

Generally speaking, if the invariant of filp->private_data == sock
does not actually hold true universally, I'd rather revert the
simplifications and add a big comment to sock_poll_wait() explaining
this.

Thank you.

^ permalink raw reply

* Re: [PATCH v2 net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-18 18:17 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, Realtek linux nic maintainers
  Cc: netdev@vger.kernel.org
In-Reply-To: <05eb4233-bf98-4aeb-73f6-94fec46ca3de@gmail.com>

On 18.10.2018 20:02, Eric Dumazet wrote:
> 
> 
> On 10/18/2018 10:56 AM, Heiner Kallweit wrote:
>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>> in the interrupt status register. Under high load NAPI may not be
>> able to process all data (work_done == budget) and it will schedule
>> subsequent calls to the poll callback.
>> rtl_ack_events() however resets the bits in the interrupt status
>> register, therefore subsequent calls to rtl8169_poll() won't call
>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>
>> Fix this by calling rtl_rx() and rtl_tx() independent of the bits
>> set in the interrupt status register. Both functions will detect
>> if there's nothing to do for them.
>>
>> Fixes: da78dbff2e05 ("r8169: remove work from irq handler.")
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>> v2:
>> - added "Fixes" tag
>> ---
>>  drivers/net/ethernet/realtek/r8169.c | 8 +++-----
>>  1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index 8c4f49adc..7caf3b7e9 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -6528,17 +6528,15 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>>  	struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi);
>>  	struct net_device *dev = tp->dev;
>>  	u16 enable_mask = RTL_EVENT_NAPI | tp->event_slow;
>> -	int work_done= 0;
>> +	int work_done;
>>  	u16 status;
>>  
>>  	status = rtl_get_events(tp);
>>  	rtl_ack_events(tp, status & ~tp->event_slow);
>>  
>> -	if (status & RTL_EVENT_NAPI_RX)
>> -		work_done = rtl_rx(dev, tp, (u32) budget);
>> +	work_done = rtl_rx(dev, tp, (u32) budget);
>>  
>> -	if (status & RTL_EVENT_NAPI_TX)
>> -		rtl_tx(dev, tp);
>> +	rtl_tx(dev, tp);
>>  
>>  	if (status & tp->event_slow) {
>>  		enable_mask &= ~tp->event_slow;
>>
> 
> One has to wonder why rtl8169_poll(), which might be called in a loop under DOS,
> has to call rtl_ack_events() ?
> 
> Should not this be called one time from hard irq handler instead ?
> 
Yes, it should. I have a patch in my tree including this, in addition it
moves everything from rtl_slow_event_work() to the hard irq handler.
However, due to recent changes, this patch may not apply cleanly to
older kernel versions.
Calling rtl_ack_events() in the poll callback isn't nice but not directly
a bug, so I'd prefer to submit this to net-next.

> 

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Martin Lau @ 2018-10-18 18:12 UTC (permalink / raw)
  To: Edward Cree
  Cc: Yonghong Song, Alexei Starovoitov, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <45d2473b-76f6-4450-785e-bcc0a7e4ab5d@solarflare.com>

On Thu, Oct 18, 2018 at 05:47:11PM +0100, Edward Cree wrote:
> On 17/10/18 18:50, Martin Lau wrote:
> > On Wed, Oct 17, 2018 at 10:25:21AM -0700, Yonghong Song wrote:
> >> What you stated is true, BTF_KIND_FUNC_PROTO corresponds to
> >> dwarf subroutine tag which has no name while BTF_KIND_FUNC
> >> must have a valid name. The original design is to have both
> >> since they are corresponding to different dwarf constructs.
> >>
> >> Martin, what do you think?
> > I prefer to have separate kinds.  We need a way to distinguish them.
> > For example, the BTF verifier is checking it.  Having two kinds is
> > cleaner instead of resorting to other hints from 'struct btf_type'.
> > We don't lack of bits for kind.
> But my point is that (a) they can be distinguished by how they are
>  used, and (b) the _only_ difference between them is how they are
>  used.  In this C code:
>     int a = 5;
>     int foo(int x) { return a; }
>     int *b = &a;
>     int (*bar)(int) = &foo;
>  foo and *bar are _the same type_, just as a and *b are.  It's just
>  that C has a slightly odd way of writing
>     int foo(int) = lambda x: a;
>  and foo itself is implicitly sorta-const.
> What am I missing?
The BTF verification and bpf_prog_load() has to treat
them differently.

Are you suggesting they can be treated the same for
the kernel verification purpose?

or

The concern is, having two kind, BTF_KIND_FUNC_PROTO and
BTF_KIND_FUNC, is confusing because they have almost the
same BTF metadata?

>From the BTF perspective, they are different because
they are allowed to contain different information.
For example, only "foo" can have func_info in patch 5
(the to-be-added line_info can only belong to "foo" also).

This patch 2 is also doing checks on func_proto.
e.g. in btf_ptr_resolve().

The early idea is to have "foo_a", "foo_b", "foo_c"...
in a separate BTF section called "func" section (on top of
the current "type" and "string" section).  By doing this,
only one BTF_KIND_FUNC would be needed as you suggested
because "foo" could be clearly distinguished from "bar" by
observing they belong to different BTF sections.  The down
side is the "func" section will have its own id space separated
from the current "type" section id space and it is less ideal
for debugging purpose.

Hence, "foo" is contained in the "type" section also
and added FUNC_PROTO to distinguish them.

^ permalink raw reply

* [PATCH v3 net-next 0/2] Add support for Microchip Technology KSZ9131 10/100/1000 Ethernet PHY
From: Yuiko Oshino @ 2018-10-18 19:06 UTC (permalink / raw)
  To: davem, robh+dt, devicetree, f.fainelli, andrew
  Cc: linux-kernel, mark.rutland, m.felsch, Markus.Niebel, netdev,
	UNGLinuxDriver


This is the initial driver for Microchip KSZ9131 10/100/1000 Ethernet PHY

v3:
- KSZ9131 uses picosecond units for values of devicetree properties.
- rewrite micrel.c and micrel-ksz90x1.txt to use the picosecond values.
v2:
- Creating a series from two related patches.

Yuiko Oshino (2):
  net: phy: micrel: add Microchip KSZ9131 initial driver
  dt-bindings: net: add support for Microchip KSZ9131

 .../devicetree/bindings/net/micrel-ksz90x1.txt     |  28 ++++-
 drivers/net/phy/micrel.c                           | 130 ++++++++++++++++++++-
 include/linux/micrel_phy.h                         |   1 +
 3 files changed, 157 insertions(+), 2 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH v2 net] r8169: fix NAPI handling under high load
From: Eric Dumazet @ 2018-10-18 18:02 UTC (permalink / raw)
  To: Heiner Kallweit, David Miller, Realtek linux nic maintainers
  Cc: netdev@vger.kernel.org
In-Reply-To: <8d75498e-6f43-a699-3a0a-fb4a2b5dc860@gmail.com>



On 10/18/2018 10:56 AM, Heiner Kallweit wrote:
> rtl_rx() and rtl_tx() are called only if the respective bits are set
> in the interrupt status register. Under high load NAPI may not be
> able to process all data (work_done == budget) and it will schedule
> subsequent calls to the poll callback.
> rtl_ack_events() however resets the bits in the interrupt status
> register, therefore subsequent calls to rtl8169_poll() won't call
> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
> 
> Fix this by calling rtl_rx() and rtl_tx() independent of the bits
> set in the interrupt status register. Both functions will detect
> if there's nothing to do for them.
> 
> Fixes: da78dbff2e05 ("r8169: remove work from irq handler.")
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
> v2:
> - added "Fixes" tag
> ---
>  drivers/net/ethernet/realtek/r8169.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 8c4f49adc..7caf3b7e9 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -6528,17 +6528,15 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>  	struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi);
>  	struct net_device *dev = tp->dev;
>  	u16 enable_mask = RTL_EVENT_NAPI | tp->event_slow;
> -	int work_done= 0;
> +	int work_done;
>  	u16 status;
>  
>  	status = rtl_get_events(tp);
>  	rtl_ack_events(tp, status & ~tp->event_slow);
>  
> -	if (status & RTL_EVENT_NAPI_RX)
> -		work_done = rtl_rx(dev, tp, (u32) budget);
> +	work_done = rtl_rx(dev, tp, (u32) budget);
>  
> -	if (status & RTL_EVENT_NAPI_TX)
> -		rtl_tx(dev, tp);
> +	rtl_tx(dev, tp);
>  
>  	if (status & tp->event_slow) {
>  		enable_mask &= ~tp->event_slow;
> 

One has to wonder why rtl8169_poll(), which might be called in a loop under DOS,
has to call rtl_ack_events() ?

Should not this be called one time from hard irq handler instead ?

^ permalink raw reply

* [PATCH v2 net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-18 17:56 UTC (permalink / raw)
  To: David Miller, Realtek linux nic maintainers; +Cc: netdev@vger.kernel.org

rtl_rx() and rtl_tx() are called only if the respective bits are set
in the interrupt status register. Under high load NAPI may not be
able to process all data (work_done == budget) and it will schedule
subsequent calls to the poll callback.
rtl_ack_events() however resets the bits in the interrupt status
register, therefore subsequent calls to rtl8169_poll() won't call
rtl_rx() and rtl_tx() - chip interrupts are still disabled.

Fix this by calling rtl_rx() and rtl_tx() independent of the bits
set in the interrupt status register. Both functions will detect
if there's nothing to do for them.

Fixes: da78dbff2e05 ("r8169: remove work from irq handler.")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
v2:
- added "Fixes" tag
---
 drivers/net/ethernet/realtek/r8169.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 8c4f49adc..7caf3b7e9 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6528,17 +6528,15 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 	struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi);
 	struct net_device *dev = tp->dev;
 	u16 enable_mask = RTL_EVENT_NAPI | tp->event_slow;
-	int work_done= 0;
+	int work_done;
 	u16 status;

 	status = rtl_get_events(tp);
 	rtl_ack_events(tp, status & ~tp->event_slow);

-	if (status & RTL_EVENT_NAPI_RX)
-		work_done = rtl_rx(dev, tp, (u32) budget);
+	work_done = rtl_rx(dev, tp, (u32) budget);

-	if (status & RTL_EVENT_NAPI_TX)
-		rtl_tx(dev, tp);
+	rtl_tx(dev, tp);

 	if (status & tp->event_slow) {
 		enable_mask &= ~tp->event_slow;
-- 
2.19.1

^ permalink raw reply related

* RE: [PATCH net-next] netpoll: allow cleanup to be synchronous
From: Banerjee, Debabrata @ 2018-10-18 17:47 UTC (permalink / raw)
  To: 'Neil Horman'; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <20181018165935.GC2601@hmswarspite.think-freely.org>

> From: Neil Horman <nhorman@tuxdriver.com> 

> The team driver still seems
> to be an outlier there though I think , in that it doesn't guarantee the holding
> of rtnl in its port add/delete paths.

Not seeing where this is the case in the team driver today? They were actually
calling __netpoll_cleanup synchronously already.

> It might be worth cleaning that up and simply replacing all the calls to
> __netpoll_free_async with direct calls to __netpoll_cleanup.  That would let
> you eliminate the former function alltogether.

I believe v2 accomplishes that, please review.

-Deb

^ permalink raw reply

* Images 34
From: Nancy @ 2018-10-18 10:00 UTC (permalink / raw)
  To: netdev

We are an image team who can process 400+ images each day.

If you need any image editing service, please let us know.

Image cut out and clipping path, masking.
Such as for ecommerce photos, jewelry photos retouching, beauty and skin
images
and wedding photos.

We give test editing for your photos if you send us some.

Thanks,
Nancy

^ permalink raw reply

* Re: [bpf-next v2 1/2] bpf: skmsg, fix psock create on existing kcm/tls port
From: Eric Dumazet @ 2018-10-18 17:34 UTC (permalink / raw)
  To: John Fastabend, ast, daniel; +Cc: netdev
In-Reply-To: <1539840004-26433-2-git-send-email-john.fastabend@gmail.com>



On 10/17/2018 10:20 PM, John Fastabend wrote:
> Before using the psock returned by sk_psock_get() when adding it to a
> sockmap we need to ensure it is actually a sockmap based psock.
> Previously we were only checking this after incrementing the reference
> counter which was an error. This resulted in a slab-out-of-bounds
> error when the psock was not actually a sockmap type.
> 
> This moves the check up so the reference counter is only used
> if it is a sockmap psock.
> 
> Eric reported the following KASAN BUG,
> 
> BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
> BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
> Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
> 
> CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
>  print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
>  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>  check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
>  kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
>  atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
>  refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
>  sk_psock_get include/linux/skmsg.h:379 [inline]
>  sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
>  sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
>  sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
>  map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
> ---
>  include/linux/skmsg.h | 25 ++++++++++++++++++++-----
>  net/core/sock_map.c   | 11 ++++++-----
>  2 files changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 677b673..f44ca6b 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -275,11 +275,6 @@ static inline struct sk_psock *sk_psock(const struct sock *sk)
>  	return rcu_dereference_sk_user_data(sk);
>  }
>  
> -static inline bool sk_has_psock(struct sock *sk)
> -{
> -	return sk_psock(sk) != NULL && sk->sk_prot->recvmsg == tcp_bpf_recvmsg;
> -}
> -
>  static inline void sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
> @@ -379,6 +374,26 @@ static inline bool sk_psock_test_state(const struct sk_psock *psock,
>  	return test_bit(bit, &psock->state);
>  }
>  
> +static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
> +{
> +	struct sk_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = sk_psock(sk);
> +	if (psock) {
> +		if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
> +			psock = ERR_PTR(-EBUSY);
> +			goto out;
> +		}
> +
> +		if (!refcount_inc_not_zero(&psock->refcnt))
> +			psock = NULL;

Caller is using IS_ERR(), so you probably want to :

			psock = ERR_PTR(-E<something>);



> +	}
> +out:
> +	rcu_read_unlock();
> +	return psock;
> +}
> +
>  static inline struct sk_psock *sk_psock_get(struct sock *sk)
>  {
>  	struct sk_psock *psock;
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 3c0e44c..be6092a 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -175,12 +175,13 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>  		}
>  	}
>  
> -	psock = sk_psock_get(sk);
> +	psock = sk_psock_get_checked(sk);
> +	if (IS_ERR(psock)) {
> +		ret = PTR_ERR(psock);
> +		goto out_progs;
> +	}
> +
>  	if (psock) {
> -		if (!sk_has_psock(sk)) {
> -			ret = -EBUSY;
> -			goto out_progs;
> -		}
>  		if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
>  		    (skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
>  			sk_psock_put(sk, psock);
> 

^ permalink raw reply

* Re: [PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps
From: Mauricio Vasquez @ 2018-10-18 17:33 UTC (permalink / raw)
  To: Song Liu; +Cc: Alexei Starovoitov, Daniel Borkmann, Networking
In-Reply-To: <CAPhsuW4dPb8goUE46G2P2f=kXdFwgfRQzPuOjtkO8qB2SErOhw@mail.gmail.com>


On 10/18/18 11:36 AM, Song Liu wrote:
> On Thu, Oct 18, 2018 at 6:16 AM Mauricio Vasquez B
> <mauricio.vasquez@polito.it> wrote:
>> test_maps:
>> Tests that queue/stack maps are behaving correctly even in corner cases
>>
>> test_progs:
>> Tests new ebpf helpers
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   tools/lib/bpf/bpf.c                                |   12 ++
>>   tools/lib/bpf/bpf.h                                |    2
>>   tools/testing/selftests/bpf/Makefile               |    5 +
>>   tools/testing/selftests/bpf/bpf_helpers.h          |    7 +
>>   tools/testing/selftests/bpf/test_maps.c            |  122 ++++++++++++++++++++
>>   tools/testing/selftests/bpf/test_progs.c           |   99 ++++++++++++++++
>>   tools/testing/selftests/bpf/test_queue_map.c       |    4 +
>>   tools/testing/selftests/bpf/test_queue_stack_map.h |   59 ++++++++++
>>   tools/testing/selftests/bpf/test_stack_map.c       |    4 +
>>   9 files changed, 313 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
>>   create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
>>   create mode 100644 tools/testing/selftests/bpf/test_stack_map.c
>>
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index d70a255cb05e..03f9bcc4ef50 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -278,6 +278,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
>>          return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
>>   }
>>
>> +int bpf_map_lookup_and_delete_elem(int fd, const void *key, void *value)
>> +{
>> +       union bpf_attr attr;
>> +
>> +       bzero(&attr, sizeof(attr));
>> +       attr.map_fd = fd;
>> +       attr.key = ptr_to_u64(key);
>> +       attr.value = ptr_to_u64(value);
>> +
>> +       return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, &attr, sizeof(attr));
>> +}
>> +
>>   int bpf_map_delete_elem(int fd, const void *key)
>>   {
>>          union bpf_attr attr;
>> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
>> index 258c3c178333..26a51538213c 100644
>> --- a/tools/lib/bpf/bpf.h
>> +++ b/tools/lib/bpf/bpf.h
>> @@ -99,6 +99,8 @@ LIBBPF_API int bpf_map_update_elem(int fd, const void *key, const void *value,
>>                                     __u64 flags);
>>
>>   LIBBPF_API int bpf_map_lookup_elem(int fd, const void *key, void *value);
>> +LIBBPF_API int bpf_map_lookup_and_delete_elem(int fd, const void *key,
>> +                                             void *value);
>>   LIBBPF_API int bpf_map_delete_elem(int fd, const void *key);
>>   LIBBPF_API int bpf_map_get_next_key(int fd, const void *key, void *next_key);
>>   LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index d99dd6fc3fbe..e39dfb4e7970 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -37,7 +37,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
>>          test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
>>          get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
>>          test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
>> -       test_sk_lookup_kern.o test_xdp_vlan.o
>> +       test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o
>>
>>   # Order correspond to 'make run_tests' order
>>   TEST_PROGS := test_kmod.sh \
>> @@ -118,6 +118,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
>>   $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
>>   $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
>>
>> +$(OUTPUT)/test_queue_map.o: test_queue_stack_map.h
>> +$(OUTPUT)/test_stack_map.o: test_queue_stack_map.h
> This looks weird. You meant the .c files, right?

Queue and stack tests are very similar, in order to avoid a lot of code 
duplication I created a single test_queue_stack_map.h file where all the 
logic is placed.

There are two .c files (test_queue_map.c and test_stack_map.c) they 
define the map type and include test_queue_stack_map.h.

test_queue_map.o and test_stack_map.o create an implicit dependency on 
test_queue_map.c and test_stack_map.c, but the dependency on the header 
is also needed, so I added those two lines.

>
>> +
>>   BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
>>   BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
>>   BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index fda8c162d0df..6407a3df0f3b 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> @@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
>>          (void *) BPF_FUNC_map_update_elem;
>>   static int (*bpf_map_delete_elem)(void *map, void *key) =
>>          (void *) BPF_FUNC_map_delete_elem;
>> +static int (*bpf_map_push_elem)(void *map, void *value,
>> +                               unsigned long long flags) =
>> +       (void *) BPF_FUNC_map_push_elem;
>> +static int (*bpf_map_pop_elem)(void *map, void *value) =
>> +       (void *) BPF_FUNC_map_pop_elem;
>> +static int (*bpf_map_peek_elem)(void *map, void *value) =
>> +       (void *) BPF_FUNC_map_peek_elem;
>>   static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
>>          (void *) BPF_FUNC_probe_read;
>>   static unsigned long long (*bpf_ktime_get_ns)(void) =
>> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
>> index 9b552c0fc47d..4db2116e52be 100644
>> --- a/tools/testing/selftests/bpf/test_maps.c
>> +++ b/tools/testing/selftests/bpf/test_maps.c
>> @@ -15,6 +15,7 @@
>>   #include <string.h>
>>   #include <assert.h>
>>   #include <stdlib.h>
>> +#include <time.h>
>>
>>   #include <sys/wait.h>
>>   #include <sys/socket.h>
>> @@ -471,6 +472,122 @@ static void test_devmap(int task, void *data)
>>          close(fd);
>>   }
>>
>> +static void test_queuemap(int task, void *data)
>> +{
>> +       const int MAP_SIZE = 32;
>> +       __u32 vals[MAP_SIZE + MAP_SIZE/2], val;
>> +       int fd, i;
>> +
>> +       /* Fill test values to be used */
>> +       for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +               vals[i] = rand();
>> +
>> +       /* Invalid key size */
>> +       fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 4, sizeof(val), MAP_SIZE,
>> +                           map_flags);
>> +       assert(fd < 0 && errno == EINVAL);
>> +
>> +       fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(val), MAP_SIZE,
>> +                           map_flags);
>> +       /* Queue map does not support BPF_F_NO_PREALLOC */
>> +       if (map_flags & BPF_F_NO_PREALLOC) {
>> +               assert(fd < 0 && errno == EINVAL);
>> +               return;
>> +       }
>> +       if (fd < 0) {
>> +               printf("Failed to create queuemap '%s'!\n", strerror(errno));
>> +               exit(1);
>> +       }
>> +
>> +       /* Push MAP_SIZE elements */
>> +       for (i = 0; i < MAP_SIZE; i++)
>> +               assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>> +
>> +       /* Check that element cannot be pushed due to max_entries limit */
>> +       assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
>> +              errno == E2BIG);
>> +
>> +       /* Peek element */
>> +       assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[0]);
>> +
>> +       /* Replace half elements */
>> +       for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +               assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
>> +
>> +       /* Pop all elements */
>> +       for (i = MAP_SIZE/2; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +               assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
>> +                      val == vals[i]);
>> +
>> +       /* Check that there are not elements left */
>> +       assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
>> +              errno == ENOENT);
>> +
>> +       /* Check that non supported functions set errno to EINVAL */
>> +       assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
>> +       assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
>> +
>> +       close(fd);
>> +}
>> +
>> +static void test_stackmap(int task, void *data)
>> +{
>> +       const int MAP_SIZE = 32;
>> +       __u32 vals[MAP_SIZE + MAP_SIZE/2], val;
>> +       int fd, i;
>> +
>> +       /* Fill test values to be used */
>> +       for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +               vals[i] = rand();
>> +
>> +       /* Invalid key size */
>> +       fd = bpf_create_map(BPF_MAP_TYPE_STACK, 4, sizeof(val), MAP_SIZE,
>> +                           map_flags);
>> +       assert(fd < 0 && errno == EINVAL);
>> +
>> +       fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, sizeof(val), MAP_SIZE,
>> +                           map_flags);
>> +       /* Stack map does not support BPF_F_NO_PREALLOC */
>> +       if (map_flags & BPF_F_NO_PREALLOC) {
>> +               assert(fd < 0 && errno == EINVAL);
>> +               return;
>> +       }
>> +       if (fd < 0) {
>> +               printf("Failed to create stackmap '%s'!\n", strerror(errno));
>> +               exit(1);
>> +       }
>> +
>> +       /* Push MAP_SIZE elements */
>> +       for (i = 0; i < MAP_SIZE; i++)
>> +               assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>> +
>> +       /* Check that element cannot be pushed due to max_entries limit */
>> +       assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
>> +              errno == E2BIG);
>> +
>> +       /* Peek element */
>> +       assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[i - 1]);
>> +
>> +       /* Replace half elements */
>> +       for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +               assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
>> +
>> +       /* Pop all elements */
>> +       for (i = MAP_SIZE + MAP_SIZE/2 - 1; i >= MAP_SIZE/2; i--)
>> +               assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
>> +                      val == vals[i]);
>> +
>> +       /* Check that there are not elements left */
>> +       assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
>> +              errno == ENOENT);
>> +
>> +       /* Check that non supported functions set errno to EINVAL */
>> +       assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
>> +       assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
>> +
>> +       close(fd);
>> +}
>> +
>>   #include <sys/socket.h>
>>   #include <sys/ioctl.h>
>>   #include <arpa/inet.h>
>> @@ -1434,10 +1551,15 @@ static void run_all_tests(void)
>>          test_map_wronly();
>>
>>          test_reuseport_array();
>> +
>> +       test_queuemap(0, NULL);
>> +       test_stackmap(0, NULL);
>>   }
>>
>>   int main(void)
>>   {
>> +       srand(time(NULL));
>> +
>>          map_flags = 0;
>>          run_all_tests();
>>
>> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
>> index e8becca9c521..2d3c04f45530 100644
>> --- a/tools/testing/selftests/bpf/test_progs.c
>> +++ b/tools/testing/selftests/bpf/test_progs.c
>> @@ -1735,8 +1735,105 @@ static void test_reference_tracking()
>>          bpf_object__close(obj);
>>   }
>>
>> +enum {
>> +       QUEUE,
>> +       STACK,
>> +};
>> +
>> +static void test_queue_stack_map(int type)
>> +{
>> +       const int MAP_SIZE = 32;
>> +       __u32 vals[MAP_SIZE], duration, retval, size, val;
>> +       int i, err, prog_fd, map_in_fd, map_out_fd;
>> +       char file[32], buf[128];
>> +       struct bpf_object *obj;
>> +       struct iphdr *iph = (void *)buf + sizeof(struct ethhdr);
>> +
>> +       /* Fill test values to be used */
>> +       for (i = 0; i < MAP_SIZE; i++)
>> +               vals[i] = rand();
>> +
>> +       if (type == QUEUE)
>> +               strncpy(file, "./test_queue_map.o", sizeof(file));
>> +       else if (type == STACK)
>> +               strncpy(file, "./test_stack_map.o", sizeof(file));
>> +       else
>> +               return;
>> +
>> +       err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
>> +       if (err) {
>> +               error_cnt++;
>> +               return;
>> +       }
>> +
>> +       map_in_fd = bpf_find_map(__func__, obj, "map_in");
>> +       if (map_in_fd < 0)
>> +               goto out;
>> +
>> +       map_out_fd = bpf_find_map(__func__, obj, "map_out");
>> +       if (map_out_fd < 0)
>> +               goto out;
>> +
>> +       /* Push 32 elements to the input map */
>> +       for (i = 0; i < MAP_SIZE; i++) {
>> +               err = bpf_map_update_elem(map_in_fd, NULL, &vals[i], 0);
>> +               if (err) {
>> +                       error_cnt++;
>> +                       goto out;
>> +               }
>> +       }
>> +
>> +       /* The eBPF program pushes iph.saddr in the output map,
>> +        * pops the input map and saves this value in iph.daddr
>> +        */
>> +       for (i = 0; i < MAP_SIZE; i++) {
>> +               if (type == QUEUE) {
>> +                       val = vals[i];
>> +                       pkt_v4.iph.saddr = vals[i] * 5;
>> +               } else if (type == STACK) {
>> +                       val = vals[MAP_SIZE - 1 - i];
>> +                       pkt_v4.iph.saddr = vals[MAP_SIZE - 1 - i] * 5;
>> +               }
>> +
>> +               err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
>> +                                       buf, &size, &retval, &duration);
>> +               if (err || retval || size != sizeof(pkt_v4) ||
>> +                   iph->daddr != val)
>> +                       break;
>> +       }
>> +
>> +       CHECK(err || retval || size != sizeof(pkt_v4) || iph->daddr != val,
>> +             "bpf_map_pop_elem",
>> +             "err %d errno %d retval %d size %d iph->daddr %u\n",
>> +             err, errno, retval, size, iph->daddr);
>> +
>> +       /* Queue is empty, program should return TC_ACT_SHOT */
>> +       err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
>> +                               buf, &size, &retval, &duration);
>> +       CHECK(err || retval != 2 /* TC_ACT_SHOT */|| size != sizeof(pkt_v4),
>> +             "check-queue-stack-map-empty",
>> +             "err %d errno %d retval %d size %d\n",
>> +             err, errno, retval, size);
>> +
>> +       /* Check that the program pushed elements correctly */
>> +       for (i = 0; i < MAP_SIZE; i++) {
>> +               err = bpf_map_lookup_and_delete_elem(map_out_fd, NULL, &val);
>> +               if (err || val != vals[i] * 5)
>> +                       break;
>> +       }
>> +
>> +       CHECK(i != MAP_SIZE && (err || val != vals[i] * 5),
>> +             "bpf_map_push_elem", "err %d value %u\n", err, val);
>> +
>> +out:
>> +       pkt_v4.iph.saddr = 0;
>> +       bpf_object__close(obj);
>> +}
>> +
>>   int main(void)
>>   {
>> +       srand(time(NULL));
>> +
>>          jit_enabled = is_jit_enabled();
>>
>>          test_pkt_access();
>> @@ -1757,6 +1854,8 @@ int main(void)
>>          test_task_fd_query_rawtp();
>>          test_task_fd_query_tp();
>>          test_reference_tracking();
>> +       test_queue_stack_map(QUEUE);
>> +       test_queue_stack_map(STACK);
>>
>>          printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
>>          return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
>> diff --git a/tools/testing/selftests/bpf/test_queue_map.c b/tools/testing/selftests/bpf/test_queue_map.c
>> new file mode 100644
>> index 000000000000..87db1f9da33d
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/test_queue_map.c
>> @@ -0,0 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (c) 2018 Politecnico di Torino
>> +#define MAP_TYPE BPF_MAP_TYPE_QUEUE
>> +#include "test_queue_stack_map.h"
>> diff --git a/tools/testing/selftests/bpf/test_queue_stack_map.h b/tools/testing/selftests/bpf/test_queue_stack_map.h
>> new file mode 100644
>> index 000000000000..295b9b3bc5c7
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/test_queue_stack_map.h
>> @@ -0,0 +1,59 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +// Copyright (c) 2018 Politecnico di Torino
>> +#include <stddef.h>
>> +#include <string.h>
>> +#include <linux/bpf.h>
>> +#include <linux/if_ether.h>
>> +#include <linux/ip.h>
>> +#include <linux/pkt_cls.h>
>> +#include "bpf_helpers.h"
>> +
>> +int _version SEC("version") = 1;
>> +
>> +struct bpf_map_def __attribute__ ((section("maps"), used)) map_in = {
>> +       .type = MAP_TYPE,
>> +       .key_size = 0,
>> +       .value_size = sizeof(__u32),
>> +       .max_entries = 32,
>> +       .map_flags = 0,
>> +};
>> +
>> +struct bpf_map_def __attribute__ ((section("maps"), used)) map_out = {
>> +       .type = MAP_TYPE,
>> +       .key_size = 0,
>> +       .value_size = sizeof(__u32),
>> +       .max_entries = 32,
>> +       .map_flags = 0,
>> +};
>> +
>> +SEC("test")
>> +int _test(struct __sk_buff *skb)
>> +{
>> +       void *data_end = (void *)(long)skb->data_end;
>> +       void *data = (void *)(long)skb->data;
>> +       struct ethhdr *eth = (struct ethhdr *)(data);
>> +       __u32 value;
>> +       int err;
>> +
>> +       if (eth + 1 > data_end)
>> +               return TC_ACT_SHOT;
>> +
>> +       struct iphdr *iph = (struct iphdr *)(eth + 1);
>> +
>> +       if (iph + 1 > data_end)
>> +               return TC_ACT_SHOT;
>> +
>> +       err = bpf_map_pop_elem(&map_in, &value);
>> +       if (err)
>> +               return TC_ACT_SHOT;
>> +
>> +       iph->daddr = value;
>> +
>> +       err = bpf_map_push_elem(&map_out, &iph->saddr, 0);
>> +       if (err)
>> +               return TC_ACT_SHOT;
>> +
>> +       return TC_ACT_OK;
>> +}
>> +
>> +char _license[] SEC("license") = "GPL";
>> diff --git a/tools/testing/selftests/bpf/test_stack_map.c b/tools/testing/selftests/bpf/test_stack_map.c
>> new file mode 100644
>> index 000000000000..31c3880e6da0
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/test_stack_map.c
>> @@ -0,0 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (c) 2018 Politecnico di Torino
>> +#define MAP_TYPE BPF_MAP_TYPE_STACK
>> +#include "test_queue_stack_map.h"
>>

^ permalink raw reply

* Re: [net-next 00/11][pull request] 1GbE Intel Wired LAN Driver Updates 2018-10-17
From: David Miller @ 2018-10-18 17:27 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, sasha.neftin
In-Reply-To: <20181017222322.2171-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 17 Oct 2018 15:23:11 -0700

> This series adds support for the new igc driver.
> 
> The igc driver is the new client driver supporting the Intel I225
> Ethernet Controller, which supports 2.5GbE speeds.  The reason for
> creating a new client driver, instead of adding support for the new
> device in e1000e, is that the silicon behaves more like devices
> supported in igb driver.  It also did not make sense to add a client
> part, to the igb driver which supports only 1GbE server parts.
> 
> This initial set of patches is designed for basic support (i.e. link and
> pass traffic).  Follow-on patch series will add more advanced support
> like VLAN, Wake-on-LAN, etc..

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [pull request][net-next 00/14] Mellanox, mlx5 updates
From: David Miller @ 2018-10-18 17:26 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20181018000859.16212-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 17 Oct 2018 17:08:45 -0700

> This series from Paul, Or and Mark provides the support for 
> e-switch tc offloading of multiple priorities and chains, the series
> is based on a merge commit with mlx5-next branch for patches that are
> already submitted and reviewed through the netdev and rdma mailing
> lists.
> 
> For more information please see tag log below.
> 
> Please pull and let me know if there is any problem.

Pulled, thanks Saeed.

^ permalink raw reply

* Re: [net-next 03/11] igc: Add netdev
From: Jakub Kicinski @ 2018-10-18 17:15 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Sasha Neftin, netdev, nhorman, sassmann
In-Reply-To: <20181017222322.2171-4-jeffrey.t.kirsher@intel.com>

On Wed, 17 Oct 2018 15:23:14 -0700, Jeff Kirsher wrote:
> +/**
> + * igc_ioctl - I/O control method
> + * @netdev: network interface device structure
> + * @ifreq: frequency

Is it? :)

> + * @cmd: command
> + */
> +static int igc_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
> +{
> +	switch (cmd) {
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}

You don't seem to be adding anything to this function in the series.
Why add the stub?

^ permalink raw reply

* Re: [net-next 01/11] igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller support
From: Jakub Kicinski @ 2018-10-18 17:14 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Sasha Neftin, netdev, nhorman, sassmann
In-Reply-To: <20181017222322.2171-2-jeffrey.t.kirsher@intel.com>

On Wed, 17 Oct 2018 15:23:12 -0700, Jeff Kirsher wrote:
> From: Sasha Neftin <sasha.neftin@intel.com>
> 
> This patch adds the beginning framework onto which I am going to add
> the igc driver which supports the Intel(R) I225-LM/I225-V 2.5G
> Ethernet Controller.
> 
> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

bunch of minor nit picks

> diff --git a/drivers/net/ethernet/intel/igc/igc.h b/drivers/net/ethernet/intel/igc/igc.h
> new file mode 100644
> index 000000000000..afe595cfcf63
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/igc/igc.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c)  2018 Intel Corporation */
> +
> +#ifndef _IGC_H_
> +#define _IGC_H_
> +
> +#include <linux/kobject.h>
> +
> +#include <linux/pci.h>
> +#include <linux/netdevice.h>
> +#include <linux/vmalloc.h>
> +
> +#include <linux/ethtool.h>
> +
> +#include <linux/sctp.h>
> +
> +#define IGC_ERR(args...) pr_err("igc: " args)

Looks like you're reimplementing pr_fmt()

> +#define PFX "igc: "
> +
> +#include <linux/timecounter.h>
> +#include <linux/net_tstamp.h>
> +#include <linux/ptp_clock_kernel.h>

Splitting the includes looks fairly weird.  Also, you're not using any
of them here.

> +/* main */
> +extern char igc_driver_name[];
> +extern char igc_driver_version[];
> +
> +#endif /* _IGC_H_ */
> diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h b/drivers/net/ethernet/intel/igc/igc_hw.h
> new file mode 100644
> index 000000000000..aa68b4516700
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/igc/igc_hw.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c)  2018 Intel Corporation */
> +
> +#ifndef _IGC_HW_H_
> +#define _IGC_HW_H_
> +
> +#define IGC_DEV_ID_I225_LM			0x15F2
> +#define IGC_DEV_ID_I225_V			0x15F3
> +
> +#endif /* _IGC_HW_H_ */
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> new file mode 100644
> index 000000000000..753749ce5ae0
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -0,0 +1,146 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c)  2018 Intel Corporation */
> +
> +#include <linux/module.h>
> +#include <linux/types.h>
> +
> +#include "igc.h"
> +#include "igc_hw.h"
> +
> +#define DRV_VERSION	"0.0.1-k"

You can just use the kernel version, it works pretty well in presence
of backports.

> +#define DRV_SUMMARY	"Intel(R) 2.5G Ethernet Linux Driver"
> +
> +MODULE_AUTHOR("Intel Corporation, <linux.nics@intel.com>");
> +MODULE_DESCRIPTION(DRV_SUMMARY);
> +MODULE_LICENSE("GPL v2");
> +MODULE_VERSION(DRV_VERSION);
> +
> +char igc_driver_name[] = "igc";
> +char igc_driver_version[] = DRV_VERSION;
> +static const char igc_driver_string[] = DRV_SUMMARY;
> +static const char igc_copyright[] =
> +	"Copyright(c) 2018 Intel Corporation.";
> +
> +static const struct pci_device_id igc_pci_tbl[] = {
> +	{ PCI_VDEVICE(INTEL, IGC_DEV_ID_I225_LM) },
> +	{ PCI_VDEVICE(INTEL, IGC_DEV_ID_I225_V) },
> +	/* required last entry */
> +	{0, }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, igc_pci_tbl);
> +
> +/**
> + * igc_probe - Device Initialization Routine
> + * @pdev: PCI device information struct
> + * @ent: entry in igc_pci_tbl
> + *
> + * Returns 0 on success, negative on failure
> + *
> + * igc_probe initializes an adapter identified by a pci_dev structure.
> + * The OS initialization, configuring the adapter private structure,
> + * and a hardware reset occur.
> + */
> +static int igc_probe(struct pci_dev *pdev,
> +		     const struct pci_device_id *ent)
> +{
> +	int err, pci_using_dac;
> +
> +	err = pci_enable_device_mem(pdev);
> +	if (err)
> +		return err;
> +
> +	pci_using_dac = 0;
> +	err = dma_set_mask(&pdev->dev, DMA_BIT_MASK(64));
> +	if (!err) {
> +		err = dma_set_coherent_mask(&pdev->dev,
> +					    DMA_BIT_MASK(64));
> +		if (!err)
> +			pci_using_dac = 1;

You never use this pci_using_dac.  dma_set_mask_and_coherent() maybe?

> +	} else {
> +		err = dma_set_mask(&pdev->dev, DMA_BIT_MASK(32));
> +		if (err) {
> +			err = dma_set_coherent_mask(&pdev->dev,
> +						    DMA_BIT_MASK(32));
> +			if (err) {
> +				IGC_ERR("Wrong DMA configuration, aborting\n");
> +				goto err_dma;
> +			}
> +		}
> +	}
> +
> +	err = pci_request_selected_regions(pdev,
> +					   pci_select_bars(pdev,
> +							   IORESOURCE_MEM),
> +					   igc_driver_name);
> +	if (err)
> +		goto err_pci_reg;
> +
> +	pci_set_master(pdev);
> +	err = pci_save_state(pdev);

And you never look at err?

> +	return 0;
> +
> +err_pci_reg:
> +err_dma:

The error label should be named after what it points to, not where it's
coming from.

> +	pci_disable_device(pdev);
> +	return err;
> +}
> +
> +/**
> + * igc_remove - Device Removal Routine
> + * @pdev: PCI device information struct
> + *
> + * igc_remove is called by the PCI subsystem to alert the driver
> + * that it should release a PCI device.  This could be caused by a
> + * Hot-Plug event, or because the driver is going to be removed from
> + * memory.
> + */
> +static void igc_remove(struct pci_dev *pdev)
> +{
> +	pci_release_selected_regions(pdev,
> +				     pci_select_bars(pdev, IORESOURCE_MEM));
> +
> +	pci_disable_device(pdev);
> +}
> +
> +static struct pci_driver igc_driver = {
> +	.name     = igc_driver_name,
> +	.id_table = igc_pci_tbl,
> +	.probe    = igc_probe,
> +	.remove   = igc_remove,
> +};
> +
> +/**
> + * igc_init_module - Driver Registration Routine
> + *
> + * igc_init_module is the first routine called when the driver is
> + * loaded. All it does is register with the PCI subsystem.
> + */
> +static int __init igc_init_module(void)
> +{
> +	int ret;
> +
> +	pr_info("%s - version %s\n",
> +		igc_driver_string, igc_driver_version);
> +
> +	pr_info("%s\n", igc_copyright);
> +
> +	ret = pci_register_driver(&igc_driver);
> +	return ret;

Why the variable?

> +}
> +
> +module_init(igc_init_module);
> +
> +/**
> + * igc_exit_module - Driver Exit Cleanup Routine
> + *
> + * igc_exit_module is called just before the driver is removed
> + * from memory.
> + */
> +static void __exit igc_exit_module(void)
> +{
> +	pci_unregister_driver(&igc_driver);
> +}
> +
> +module_exit(igc_exit_module);
> +/* igc_main.c */

I'd argue most editors make it fairly clear which file one is
editing, hence making this sort of comments entirely superfluous :)

^ permalink raw reply

* Re: [danielwa@cisco.com: Re: gianfar: Implement MAC reset and reconfig procedure]
From: Daniel Walker @ 2018-10-18 17:11 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: Hemant Ramdasi, netdev@vger.kernel.org
In-Reply-To: <HE1PR04MB1145DAA9F214275D6489B06B96F80@HE1PR04MB1145.eurprd04.prod.outlook.com>

On Thu, Oct 18, 2018 at 04:49:26PM +0000, Claudiu Manoil wrote:
> I can only advise you to check whether the MACCFG2 register settings are consistent 
> at this point, when ping fails.  You should check the I/F Mode bits (22-23) and the
> Full Duplex bit (31), in big-endian format.  If these do not match the 100Mbps full 
> duplex link mode, then it might be that another thread (probably doing reset_gfar) 
> changes MACCFG2 concurrently.  I think MACCFG2 may be dumped with ethtool -d.
> I can get my hands on a board no sooner than maybe next week.

A board won't help you .. I'm running on customer hardware which you don't have
access to.

After boot up you have MACCFG2 = 0x7205 which is the same
as the INIT settings.

After the interface is brought up adjust_link() changes to MACCFG2  = 0x7105
which I think is MII.

0x7105 stays after the interface is brought down until gfar_mac_reset sets it to
0x7205 (GMII) .. then adjust link resets it to 0x7105 (MII) ..

That goes on and on each time to interface is brought down/up. It seems like
this is what your expecting to happen, but it doesn't seems to work %100 of the
time.

Daniel

^ permalink raw reply

* [PATCH v3] isdn: hfc_{pci,sx}: Avoid empty body if statements
From: Nathan Chancellor @ 2018-10-19  1:11 UTC (permalink / raw)
  To: Karsten Keil, David S. Miller
  Cc: netdev, linux-kernel, Masahiro Yamada, Nathan Chancellor
In-Reply-To: <20181018034935.16819-1-natechancellor@gmail.com>

Clang warns:

drivers/isdn/hisax/hfc_pci.c:131:34: error: if statement has empty body
[-Werror,-Wempty-body]
        if (Read_hfc(cs, HFCPCI_INT_S1));
                                        ^
drivers/isdn/hisax/hfc_pci.c:131:34: note: put the semicolon on a
separate line to silence this warning

In my attempt to hide the warnings because I thought they didn't serve
any purpose[1], Masahiro Yamada pointed out that {Read,Write}_hfc in
hci_pci.c should be using a standard register access method; otherwise,
the compiler will just remove the if statements.

For hfc_pci, use the versions of {Read,Write}_hfc found in
drivers/isdn/hardware/mISDN/hfc_pCI.h while converting pci_io to be
'void __iomem *' (and clean up ioremap) then remove the empty if
statements.

For hfc_sx, {Read,Write}_hfc are already use a proper register accessor
(inb, outb) so just remove the unnecessary if statements.

[1]: https://lore.kernel.org/lkml/20181016021454.11953-1-natechancellor@gmail.com/

Link: https://github.com/ClangBuiltLinux/linux/issues/66
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---

v1 -> v2:

* Remove unnecessary cast to void, just remove if statements (thanks to
  review from Masahiro).

* Clean up commit message.

v2 -> v3:

* Convert hfc_pci to properly use readb/writeb to avoid static checker
  issues.

 drivers/isdn/hisax/hfc_pci.c | 11 +++++------
 drivers/isdn/hisax/hfc_pci.h |  4 ++--
 drivers/isdn/hisax/hfc_sx.c  |  6 +++---
 drivers/isdn/hisax/hisax.h   |  2 +-
 4 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/isdn/hisax/hfc_pci.c b/drivers/isdn/hisax/hfc_pci.c
index 8e5b03161b2f..ea0e4c6de3fb 100644
--- a/drivers/isdn/hisax/hfc_pci.c
+++ b/drivers/isdn/hisax/hfc_pci.c
@@ -86,7 +86,7 @@ release_io_hfcpci(struct IsdnCardState *cs)
 	pci_free_consistent(cs->hw.hfcpci.dev, 0x8000,
 			    cs->hw.hfcpci.fifos, cs->hw.hfcpci.dma);
 	cs->hw.hfcpci.fifos = NULL;
-	iounmap((void *)cs->hw.hfcpci.pci_io);
+	iounmap(cs->hw.hfcpci.pci_io);
 }
 
 /********************************************************************************/
@@ -128,7 +128,7 @@ reset_hfcpci(struct IsdnCardState *cs)
 	Write_hfc(cs, HFCPCI_INT_M1, cs->hw.hfcpci.int_m1);
 
 	/* Clear already pending ints */
-	if (Read_hfc(cs, HFCPCI_INT_S1));
+	Read_hfc(cs, HFCPCI_INT_S1);
 
 	Write_hfc(cs, HFCPCI_STATES, HFCPCI_LOAD_STATE | 2);	/* HFC ST 2 */
 	udelay(10);
@@ -158,7 +158,7 @@ reset_hfcpci(struct IsdnCardState *cs)
 	/* Finally enable IRQ output */
 	cs->hw.hfcpci.int_m2 = HFCPCI_IRQ_ENABLE;
 	Write_hfc(cs, HFCPCI_INT_M2, cs->hw.hfcpci.int_m2);
-	if (Read_hfc(cs, HFCPCI_INT_S1));
+	Read_hfc(cs, HFCPCI_INT_S1);
 }
 
 /***************************************************/
@@ -1537,7 +1537,7 @@ hfcpci_bh(struct work_struct *work)
 					cs->hw.hfcpci.int_m1 &= ~HFCPCI_INTS_TIMER;
 					Write_hfc(cs, HFCPCI_INT_M1, cs->hw.hfcpci.int_m1);
 					/* Clear already pending ints */
-					if (Read_hfc(cs, HFCPCI_INT_S1));
+					Read_hfc(cs, HFCPCI_INT_S1);
 					Write_hfc(cs, HFCPCI_STATES, 4 | HFCPCI_LOAD_STATE);
 					udelay(10);
 					Write_hfc(cs, HFCPCI_STATES, 4);
@@ -1692,7 +1692,7 @@ setup_hfcpci(struct IsdnCard *card)
 		printk(KERN_WARNING "HFC-PCI: No IRQ for PCI card found\n");
 		return (0);
 	}
-	cs->hw.hfcpci.pci_io = (char *)(unsigned long)dev_hfcpci->resource[1].start;
+	cs->hw.hfcpci.pci_io = ioremap(dev_hfcpci->resource[1].start, 256);
 	printk(KERN_INFO "HiSax: HFC-PCI card manufacturer: %s card name: %s\n", id_list[i].vendor_name, id_list[i].card_name);
 
 	if (!cs->hw.hfcpci.pci_io) {
@@ -1716,7 +1716,6 @@ setup_hfcpci(struct IsdnCard *card)
 		return 0;
 	}
 	pci_write_config_dword(cs->hw.hfcpci.dev, 0x80, (u32)cs->hw.hfcpci.dma);
-	cs->hw.hfcpci.pci_io = ioremap((ulong) cs->hw.hfcpci.pci_io, 256);
 	printk(KERN_INFO
 	       "HFC-PCI: defined at mem %p fifo %p(%lx) IRQ %d HZ %d\n",
 	       cs->hw.hfcpci.pci_io,
diff --git a/drivers/isdn/hisax/hfc_pci.h b/drivers/isdn/hisax/hfc_pci.h
index 4e58700a3e61..4c3b3ba35726 100644
--- a/drivers/isdn/hisax/hfc_pci.h
+++ b/drivers/isdn/hisax/hfc_pci.h
@@ -228,8 +228,8 @@ typedef union {
 } fifo_area;
 
 
-#define Write_hfc(a, b, c) (*(((u_char *)a->hw.hfcpci.pci_io) + b) = c)
-#define Read_hfc(a, b) (*(((u_char *)a->hw.hfcpci.pci_io) + b))
+#define Write_hfc(a, b, c) (writeb(c, (a->hw.hfcpci.pci_io) + b))
+#define Read_hfc(a, b) (readb((a->hw.hfcpci.pci_io) + b))
 
 extern void main_irq_hcpci(struct BCState *bcs);
 extern void releasehfcpci(struct IsdnCardState *cs);
diff --git a/drivers/isdn/hisax/hfc_sx.c b/drivers/isdn/hisax/hfc_sx.c
index 4d3b4b2f2612..12af628d9b2c 100644
--- a/drivers/isdn/hisax/hfc_sx.c
+++ b/drivers/isdn/hisax/hfc_sx.c
@@ -381,7 +381,7 @@ reset_hfcsx(struct IsdnCardState *cs)
 	Write_hfc(cs, HFCSX_INT_M1, cs->hw.hfcsx.int_m1);
 
 	/* Clear already pending ints */
-	if (Read_hfc(cs, HFCSX_INT_S1));
+	Read_hfc(cs, HFCSX_INT_S1);
 
 	Write_hfc(cs, HFCSX_STATES, HFCSX_LOAD_STATE | 2);	/* HFC ST 2 */
 	udelay(10);
@@ -411,7 +411,7 @@ reset_hfcsx(struct IsdnCardState *cs)
 	/* Finally enable IRQ output */
 	cs->hw.hfcsx.int_m2 = HFCSX_IRQ_ENABLE;
 	Write_hfc(cs, HFCSX_INT_M2, cs->hw.hfcsx.int_m2);
-	if (Read_hfc(cs, HFCSX_INT_S2));
+	Read_hfc(cs, HFCSX_INT_S2);
 }
 
 /***************************************************/
@@ -1288,7 +1288,7 @@ hfcsx_bh(struct work_struct *work)
 					cs->hw.hfcsx.int_m1 &= ~HFCSX_INTS_TIMER;
 					Write_hfc(cs, HFCSX_INT_M1, cs->hw.hfcsx.int_m1);
 					/* Clear already pending ints */
-					if (Read_hfc(cs, HFCSX_INT_S1));
+					Read_hfc(cs, HFCSX_INT_S1);
 
 					Write_hfc(cs, HFCSX_STATES, 4 | HFCSX_LOAD_STATE);
 					udelay(10);
diff --git a/drivers/isdn/hisax/hisax.h b/drivers/isdn/hisax/hisax.h
index 338d0408b377..40080e06421c 100644
--- a/drivers/isdn/hisax/hisax.h
+++ b/drivers/isdn/hisax/hisax.h
@@ -703,7 +703,7 @@ struct hfcPCI_hw {
 	unsigned char nt_mode;
 	int nt_timer;
 	struct pci_dev *dev;
-	unsigned char *pci_io; /* start of PCI IO memory */
+	void __iomem *pci_io; /* start of PCI IO memory */
 	dma_addr_t dma; /* dma handle for Fifos */
 	void *fifos; /* FIFO memory */
 	int last_bfifo_cnt[2]; /* marker saving last b-fifo frame count */
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH v3 lora-next 1/5] regmap: Add regmap_noinc_write API
From: Andreas Färber @ 2018-10-18 16:59 UTC (permalink / raw)
  To: Ben Whitten, Mark Brown
  Cc: starnight, hasnain.virk, netdev, liuxuenetmail, shess,
	Ben Whitten, Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel,
	Stefan Rehm, linux-spi@vger.kernel.org
In-Reply-To: <1539361567-3602-2-git-send-email-ben.whitten@lairdtech.com>

Am 12.10.18 um 18:26 schrieb Ben Whitten:
> The regmap API had a noinc_read function added for instances where devices
> supported returning data from an internal FIFO in a single read.
> 
> This commit adds the noinc_write variant to allow writing to a non
> incrementing register, this is used in devices such as the sx1301 for
> loading firmware.
> 
> Signed-off-by: Ben Whitten <ben.whitten@lairdtech.com>

Reviewed-by: Andreas Färber <afaerber@suse.de>

Mark, please take this one through your tree - I'll rebase the LoRa
parts on linux-next then.

Thanks,
Andreas

> ---
>  drivers/base/regmap/internal.h |  3 ++
>  drivers/base/regmap/regmap.c   | 77 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/regmap.h         | 19 +++++++++++
>  3 files changed, 99 insertions(+)
> 
> diff --git a/drivers/base/regmap/internal.h b/drivers/base/regmap/internal.h
> index a6bf34d63..404f123 100644
> --- a/drivers/base/regmap/internal.h
> +++ b/drivers/base/regmap/internal.h
> @@ -94,11 +94,13 @@ struct regmap {
>  	bool (*readable_reg)(struct device *dev, unsigned int reg);
>  	bool (*volatile_reg)(struct device *dev, unsigned int reg);
>  	bool (*precious_reg)(struct device *dev, unsigned int reg);
> +	bool (*writeable_noinc_reg)(struct device *dev, unsigned int reg);
>  	bool (*readable_noinc_reg)(struct device *dev, unsigned int reg);
>  	const struct regmap_access_table *wr_table;
>  	const struct regmap_access_table *rd_table;
>  	const struct regmap_access_table *volatile_table;
>  	const struct regmap_access_table *precious_table;
> +	const struct regmap_access_table *wr_noinc_table;
>  	const struct regmap_access_table *rd_noinc_table;
>  
>  	int (*reg_read)(void *context, unsigned int reg, unsigned int *val);
> @@ -183,6 +185,7 @@ bool regmap_writeable(struct regmap *map, unsigned int reg);
>  bool regmap_readable(struct regmap *map, unsigned int reg);
>  bool regmap_volatile(struct regmap *map, unsigned int reg);
>  bool regmap_precious(struct regmap *map, unsigned int reg);
> +bool regmap_writeable_noinc(struct regmap *map, unsigned int reg);
>  bool regmap_readable_noinc(struct regmap *map, unsigned int reg);
>  
>  int _regmap_write(struct regmap *map, unsigned int reg,
> diff --git a/drivers/base/regmap/regmap.c b/drivers/base/regmap/regmap.c
> index 0360a90..d4f1fc6 100644
> --- a/drivers/base/regmap/regmap.c
> +++ b/drivers/base/regmap/regmap.c
> @@ -168,6 +168,17 @@ bool regmap_precious(struct regmap *map, unsigned int reg)
>  	return false;
>  }
>  
> +bool regmap_writeable_noinc(struct regmap *map, unsigned int reg)
> +{
> +	if (map->writeable_noinc_reg)
> +		return map->writeable_noinc_reg(map->dev, reg);
> +
> +	if (map->wr_noinc_table)
> +		return regmap_check_range_table(map, reg, map->wr_noinc_table);
> +
> +	return true;
> +}
> +
>  bool regmap_readable_noinc(struct regmap *map, unsigned int reg)
>  {
>  	if (map->readable_noinc_reg)
> @@ -777,11 +788,13 @@ struct regmap *__regmap_init(struct device *dev,
>  	map->rd_table = config->rd_table;
>  	map->volatile_table = config->volatile_table;
>  	map->precious_table = config->precious_table;
> +	map->wr_noinc_table = config->wr_noinc_table;
>  	map->rd_noinc_table = config->rd_noinc_table;
>  	map->writeable_reg = config->writeable_reg;
>  	map->readable_reg = config->readable_reg;
>  	map->volatile_reg = config->volatile_reg;
>  	map->precious_reg = config->precious_reg;
> +	map->writeable_noinc_reg = config->writeable_noinc_reg;
>  	map->readable_noinc_reg = config->readable_noinc_reg;
>  	map->cache_type = config->cache_type;
>  
> @@ -1298,6 +1311,7 @@ int regmap_reinit_cache(struct regmap *map, const struct regmap_config *config)
>  	map->readable_reg = config->readable_reg;
>  	map->volatile_reg = config->volatile_reg;
>  	map->precious_reg = config->precious_reg;
> +	map->writeable_noinc_reg = config->writeable_noinc_reg;
>  	map->readable_noinc_reg = config->readable_noinc_reg;
>  	map->cache_type = config->cache_type;
>  
> @@ -1898,6 +1912,69 @@ int regmap_raw_write(struct regmap *map, unsigned int reg,
>  EXPORT_SYMBOL_GPL(regmap_raw_write);
>  
>  /**
> + * regmap_noinc_write(): Write data from a register without incrementing the
> + *			register number
> + *
> + * @map: Register map to write to
> + * @reg: Register to write to
> + * @val: Pointer to data buffer
> + * @val_len: Length of output buffer in bytes.
> + *
> + * The regmap API usually assumes that bulk bus write operations will write a
> + * range of registers. Some devices have certain registers for which a write
> + * operation can write to an internal FIFO.
> + *
> + * The target register must be volatile but registers after it can be
> + * completely unrelated cacheable registers.
> + *
> + * This will attempt multiple writes as required to write val_len bytes.
> + *
> + * A value of zero will be returned on success, a negative errno will be
> + * returned in error cases.
> + */
> +int regmap_noinc_write(struct regmap *map, unsigned int reg,
> +		      const void *val, size_t val_len)
> +{
> +	size_t write_len;
> +	int ret;
> +
> +	if (!map->bus)
> +		return -EINVAL;
> +	if (!map->bus->write)
> +		return -ENOTSUPP;
> +	if (val_len % map->format.val_bytes)
> +		return -EINVAL;
> +	if (!IS_ALIGNED(reg, map->reg_stride))
> +		return -EINVAL;
> +	if (val_len == 0)
> +		return -EINVAL;
> +
> +	map->lock(map->lock_arg);
> +
> +	if (!regmap_volatile(map, reg) || !regmap_writeable_noinc(map, reg)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	while (val_len) {
> +		if (map->max_raw_write && map->max_raw_write < val_len)
> +			write_len = map->max_raw_write;
> +		else
> +			write_len = val_len;
> +		ret = _regmap_raw_write(map, reg, val, write_len);
> +		if (ret)
> +			goto out_unlock;
> +		val = ((u8 *)val) + write_len;
> +		val_len -= write_len;
> +	}
> +
> +out_unlock:
> +	map->unlock(map->lock_arg);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(regmap_noinc_write);
> +
> +/**
>   * regmap_field_update_bits_base() - Perform a read/modify/write cycle a
>   *                                   register field.
>   *
> diff --git a/include/linux/regmap.h b/include/linux/regmap.h
> index 379505a..de04dc4 100644
> --- a/include/linux/regmap.h
> +++ b/include/linux/regmap.h
> @@ -268,6 +268,13 @@ typedef void (*regmap_unlock)(void *);
>   *                field is NULL but precious_table (see below) is not, the
>   *                check is performed on such table (a register is precious if
>   *                it belongs to one of the ranges specified by precious_table).
> + * @writeable_noinc_reg: Optional callback returning true if the register
> + *			supports multiple write operations without incrementing
> + *			the register number. If this field is NULL but
> + *			wr_noinc_table (see below) is not, the check is
> + *			performed on such table (a register is no increment
> + *			writeable if it belongs to one of the ranges specified
> + *			by wr_noinc_table).
>   * @readable_noinc_reg: Optional callback returning true if the register
>   *			supports multiple read operations without incrementing
>   *			the register number. If this field is NULL but
> @@ -302,6 +309,7 @@ typedef void (*regmap_unlock)(void *);
>   * @rd_table:     As above, for read access.
>   * @volatile_table: As above, for volatile registers.
>   * @precious_table: As above, for precious registers.
> + * @wr_noinc_table: As above, for no increment writeable registers.
>   * @rd_noinc_table: As above, for no increment readable registers.
>   * @reg_defaults: Power on reset values for registers (for use with
>   *                register cache support).
> @@ -352,6 +360,7 @@ struct regmap_config {
>  	bool (*readable_reg)(struct device *dev, unsigned int reg);
>  	bool (*volatile_reg)(struct device *dev, unsigned int reg);
>  	bool (*precious_reg)(struct device *dev, unsigned int reg);
> +	bool (*writeable_noinc_reg)(struct device *dev, unsigned int reg);
>  	bool (*readable_noinc_reg)(struct device *dev, unsigned int reg);
>  
>  	bool disable_locking;
> @@ -369,6 +378,7 @@ struct regmap_config {
>  	const struct regmap_access_table *rd_table;
>  	const struct regmap_access_table *volatile_table;
>  	const struct regmap_access_table *precious_table;
> +	const struct regmap_access_table *wr_noinc_table;
>  	const struct regmap_access_table *rd_noinc_table;
>  	const struct reg_default *reg_defaults;
>  	unsigned int num_reg_defaults;
> @@ -979,6 +989,8 @@ int regmap_write(struct regmap *map, unsigned int reg, unsigned int val);
>  int regmap_write_async(struct regmap *map, unsigned int reg, unsigned int val);
>  int regmap_raw_write(struct regmap *map, unsigned int reg,
>  		     const void *val, size_t val_len);
> +int regmap_noinc_write(struct regmap *map, unsigned int reg,
> +		     const void *val, size_t val_len);
>  int regmap_bulk_write(struct regmap *map, unsigned int reg, const void *val,
>  			size_t val_count);
>  int regmap_multi_reg_write(struct regmap *map, const struct reg_sequence *regs,
> @@ -1222,6 +1234,13 @@ static inline int regmap_raw_write_async(struct regmap *map, unsigned int reg,
>  	return -EINVAL;
>  }
>  
> +static inline int regmap_noinc_write(struct regmap *map, unsigned int reg,
> +				    const void *val, size_t val_len)
> +{
> +	WARN_ONCE(1, "regmap API is disabled");
> +	return -EINVAL;
> +}
> +
>  static inline int regmap_bulk_write(struct regmap *map, unsigned int reg,
>  				    const void *val, size_t val_count)
>  {
> 


-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH net-next] netpoll: allow cleanup to be synchronous
From: Neil Horman @ 2018-10-18 16:59 UTC (permalink / raw)
  To: Banerjee, Debabrata; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <5777390a08cb43709615237dcd4ada83@usma1ex-dag1mb2.msg.corp.akamai.com>

On Thu, Oct 18, 2018 at 03:17:06PM +0000, Banerjee, Debabrata wrote:
> > From: Neil Horman <nhorman@tuxdriver.com>
> 
> > Agreed, this doesn't make sense.  If you want a synchronous cleanup, create
> > a wrapper function that creates a wait queue, calls __netpoll_free_async,
> > and blocks on the wait queue completion.  Modify the cleanup_work
> > method(s) to complete the wait queue, and you've got what you want.
> > 
> > Neil
> 
> Actually the patch looks bad, but it seems to turn out the rtnl is always
> held by the parent when it is called, and asynchronous cleanup doesn't
> seem to be necessary at all. Perhaps you could share your deadlock
> test case for 2cde6acd49da?
> 

That commit doesn't fix a deadlock, it fixes a modification of data that must
only be modified under the protection of the rtnl lock.  From the commit log:

__netpoll_rcu_free is used to free netpoll structures when the rtnl_lock is
    already held.  The mechanism is used to asynchronously call __netpoll_cleanup
    outside of the holding of the rtnl_lock, so as to avoid deadlock.
    Unfortunately, __netpoll_cleanup modifies pointers (dev->np), which means the
    rtnl_lock must be held while calling it.  Further, it cannot be held, because
    rcu callbacks may be issued in softirq contexts, which cannot sleep.

"The mechanism" is referring to the use of rcu callbacks, which origionally were
put in place to allow the freeing of data outside the holding of the rtnl, which
did avoid deadlock.

This patch fixes the fact that __netpoll_cleanup modifies the dev->np pointer
outside of the rtnl_lock, which is disallowed because readers of dev->np assume
it will be stable while rtnl is held.

As for the workqueue not being needed anymore, it definately used to be, but it
appears that most of the instances where a deadlock might occur have been
modified such that it is no longer the case.  The team driver still seems to be
an outlier there though I think , in that it doesn't guarantee the holding of
rtnl in its port add/delete paths.  

It might be worth cleaning that up and simply replacing all the calls to
__netpoll_free_async with direct calls to __netpoll_cleanup.  That would let you
eliminate the former function alltogether.

Neil

> Patch v2 coming next.
> 
> -Deb
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox