Netdev List
 help / color / mirror / Atom feed
* [PATCH v8 bpf-next 0/2] bpf: add cg_skb_is_valid_access
From: Song Liu @ 2018-10-19 16:57 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, edumazet, Song Liu

Changes v7 -> v8:
1. Dynamically allocate the dummy sk to avoid race conditions.

Changes v6 -> v7:
1. Make dummy sk a global variable (test_run_sk).

Changes v5 -> v6:
1. Fixed dummy sk in bpf_prog_test_run_skb() as suggested by Eric Dumazet.

Changes v4 -> v5:
1. Replaced bpf_compute_and_save_data_pointers() with
   bpf_compute_and_save_data_end();
   Replaced bpf_restore_data_pointers() with bpf_restore_data_end().
2. Fixed indentation in test_verifier.c

Changes v3 -> v4:
1. Fixed crash issue reported by Alexei.

Changes v2 -> v3:
1. Added helper function bpf_compute_and_save_data_pointers() and
   bpf_restore_data_pointers().

Changes v1 -> v2:
1. Updated the list of read-only fields, and read-write fields.
2. Added dummy sk to bpf_prog_test_run_skb().

This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
some __skb_buff data directly.

Song Liu (2):
  bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
  bpf: add tests for direct packet access from CGROUP_SKB

 include/linux/filter.h                      |  21 +++
 kernel/bpf/cgroup.c                         |   6 +
 net/bpf/test_run.c                          |  15 ++
 net/core/filter.c                           |  36 ++++-
 tools/testing/selftests/bpf/test_verifier.c | 171 ++++++++++++++++++++
 5 files changed, 248 insertions(+), 1 deletion(-)

^ permalink raw reply

* [PATCH net] net/ipv6: Fix index counter for unicast addresses in in6_dump_addrs
From: David Ahern @ 2018-10-19 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, David Ahern

From: David Ahern <dsahern@gmail.com>

The loop wants to skip previously dumped addresses, so loops until
current index >= saved index. If the message fills it wants to save
the index for the next address to dump - ie., the one that did not
fit in the current message.

Currently, it is incrementing the index counter before comparing to the
saved index, and then the saved index is off by 1 - it assumes the
current address is going to fit in the message.

Change the index handling to increment only after a succesful dump.

Fixes: 502a2ffd7376a ("ipv6: convert idev_list to list macros")
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/addrconf.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c63ccce6425f..4e81ff2f4588 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4928,8 +4928,8 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 
 		/* unicast address incl. temp addr */
 		list_for_each_entry(ifa, &idev->addr_list, if_list) {
-			if (++ip_idx < s_ip_idx)
-				continue;
+			if (ip_idx < s_ip_idx)
+				goto next;
 			err = inet6_fill_ifaddr(skb, ifa,
 						NETLINK_CB(cb->skb).portid,
 						cb->nlh->nlmsg_seq,
@@ -4938,6 +4938,8 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 			if (err < 0)
 				break;
 			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+next:
+			ip_idx++;
 		}
 		break;
 	}
-- 
2.11.0

^ permalink raw reply related

* [PATCH v8 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Song Liu @ 2018-10-19 16:57 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, edumazet, Song Liu
In-Reply-To: <20181019165758.1410213-1-songliubraving@fb.com>

BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.

Two helper functions bpf_compute_and_save_data_end() and
bpf_restore_data_end() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/filter.h | 21 +++++++++++++++++++++
 kernel/bpf/cgroup.c    |  6 ++++++
 net/core/filter.c      | 36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5771874bc01e..91b4c934f02e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -548,6 +548,27 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
 	cb->data_end  = skb->data + skb_headlen(skb);
 }
 
+/* Similar to bpf_compute_data_pointers(), except that save orginal
+ * data in cb->data and cb->meta_data for restore.
+ */
+static inline void bpf_compute_and_save_data_end(
+	struct sk_buff *skb, void **saved_data_end)
+{
+	struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+	*saved_data_end = cb->data_end;
+	cb->data_end  = skb->data + skb_headlen(skb);
+}
+
+/* Restore data saved by bpf_compute_data_pointers(). */
+static inline void bpf_restore_data_end(
+	struct sk_buff *skb, void *saved_data_end)
+{
+	struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+	cb->data_end = saved_data_end;
+}
+
 static inline u8 *bpf_skb_cb(struct sk_buff *skb)
 {
 	/* eBPF programs may read/write skb->cb[] area to transfer meta
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00f6ed2e4f9a..9425c2fb872f 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -553,6 +553,7 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 {
 	unsigned int offset = skb->data - skb_network_header(skb);
 	struct sock *save_sk;
+	void *saved_data_end;
 	struct cgroup *cgrp;
 	int ret;
 
@@ -566,8 +567,13 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	save_sk = skb->sk;
 	skb->sk = sk;
 	__skb_push(skb, offset);
+
+	/* compute pointers for the bpf prog */
+	bpf_compute_and_save_data_end(skb, &saved_data_end);
+
 	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
 				 bpf_prog_run_save_cb);
+	bpf_restore_data_end(skb, saved_data_end);
 	__skb_pull(skb, offset);
 	skb->sk = save_sk;
 	return ret == 1 ? 0 : -EPERM;
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c46873..e3ca30bd6840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5346,6 +5346,40 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool cg_skb_is_valid_access(int off, int size,
+				   enum bpf_access_type type,
+				   const struct bpf_prog *prog,
+				   struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, tc_classid):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
+		return false;
+	}
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case bpf_ctx_range(struct __sk_buff, mark):
+		case bpf_ctx_range(struct __sk_buff, priority):
+		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static bool lwt_is_valid_access(int off, int size,
 				enum bpf_access_type type,
 				const struct bpf_prog *prog,
@@ -7038,7 +7072,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
 
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
 	.get_func_proto		= cg_skb_func_proto,
-	.is_valid_access	= sk_filter_is_valid_access,
+	.is_valid_access	= cg_skb_is_valid_access,
 	.convert_ctx_access	= bpf_convert_ctx_access,
 };
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Edward Cree @ 2018-10-19 17:04 UTC (permalink / raw)
  To: Martin Lau
  Cc: Yonghong Song, Alexei Starovoitov, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181018211954.egdolymm3zqgphmd@kafai-mbp.dhcp.thefacebook.com>

On 18/10/18 22:19, Martin Lau wrote:
> As I have mentioned earlier, it is also special to
> the kernel because the BTF verifier and bpf_prog_load()
> need to do different checks for FUNC and FUNC_PROTO to
> ensure they are sane.
>
> First, we need to agree that the kernel needs to verify
> them differently.
>
> and then we can proceed to discuss how to distinguish them.
> We picked the current way to avoid adding a
> new BTF function section and keep it
> strict forward to distinguish them w/o relying
> on other hints from 'struct btf_type'.
>
> Are you suggesting another way of doing it?
But you *do* have such a new section.
The patch comment talks about a 'FuncInfo Table' which appears to
 map (section, insn_idx) to type_id.  (I think this gets added in
 .BTF.ext per patch 9?)  So when you're looking at a FUNC type
 because you looked up a type_id from that table, you know it's
 the signature of a subprogram, and you're checking it as such.
Whereas, if you were doing something with some other type and it
 referenced a FUNC type (e.g., in the patch comment's example,
 you're checking foo's first argument against the type bar) in
 its type_id, you know you're using it as a formal type (a FUNC_
 PROTO in your parlance) and not as a subprogram.
The context in which you are using a type entry tells you which
 kind it is.  And the verifier can and should be smart enough to
 know what it's doing in this way.

And it's entirely reasonable for the same type entry to get used
 for both those cases; in my example, you'd have a FUNC type for
 int foo(int), referenced both by the func_info entry for foo
 and by the PTR type for bar.  And if you had another subprogram
 int baz(int), its func_info entry could reference the same
 type_id, because the (reference to the) name of the function
 should live in the func_info, not in the type.

What you are proposing seems to be saying "if we have this
 particular special btf_kind, then this BTF entry doesn't just
 define a type, it declares a subprogram of that type".  Oh,
 and with the name of the type as the subprogram name.  Which
 just creates blurry confusion as to whether BTF entries define
 types or declare objects; IMNSHO the correct approach is for
 objects to be declared elsewhere and to reference BTF types by
 their type_id.
Which is what the func_info table in patch 9 appears to do.

(It also rather bothers me the way we are using special type
 names to associate maps with their k/v types, rather than
 extending the records in the maps section to include type_ids
 referencing them.  It's the same kind of weird implicitness,
 and if I'd spotted it when it was going in I'd've nacked it,
 but I suppose it's ABI now and too late to change.)

-Ed

^ permalink raw reply

* Re: [PATCH net-next] rocker: Drop pointless static qualifier
From: David Miller @ 2018-10-19 17:42 UTC (permalink / raw)
  To: yuehaibing; +Cc: jiri, netdev, kernel-janitors
In-Reply-To: <1539950579-182423-1-git-send-email-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Fri, 19 Oct 2018 12:02:59 +0000

> There is no need to have the 'struct rocker_desc_info *desc_info'
> variable static since new value always be assigned before use it.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied, thank you.

^ permalink raw reply

* (unknown)
From: David Miller @ 2018-10-19 17:46 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs
In-Reply-To: <11408.1539960053@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Fri, 19 Oct 2018 15:40:53 +0100

> Is there going to be a merge of net into net-next before the merge
> window opens?  Or do you have a sample merge that I can rebase my
> afs-next branch on?

I'll be doing a net to net-next merge some time today.

^ permalink raw reply

* [net  1/1] tipc: eliminate message disordering during binding table update
From: Jon Maloy @ 2018-10-19 17:55 UTC (permalink / raw)
  To: davem, netdev
  Cc: gordan.mihaljevic, tung.q.nguyen, hoang.h.le, jon.maloy,
	canh.d.luu, ying.xue, tipc-discussion

We have seen the following race scenario:
1) named_distribute() builds a "bulk" message, containing a PUBLISH
   item for a certain publication. This is based on the contents of
   the binding tables's 'cluster_scope' list.
2) tipc_named_withdraw() removes the same publication from the list,
   bulds a WITHDRAW message and distributes it to all cluster nodes.
3) tipc_named_node_up(), which was calling named_distribute(), sends
   out the bulk message built under 1)
4) The WITHDRAW message arrives at the just detected node, finds
   no corresponding publication, and is dropped.
5) The PUBLISH item arrives at the same node, is added to its binding
   table, and remains there forever.

This arrival disordering was earlier taken care of by the backlog queue,
originally added for a different purpose, which was removed in the
commit referred to below, but we now need a different solution.
In this commit, we replace the rcu lock protecting the 'cluster_scope'
list with a regular RW lock which comprises even the sending of the
bulk message. This both guarantees both the list integrity and the
message sending order. We will later add a commit which cleans up
this code further.

Note that this commit needs recently added commit d3092b2efca1 ("tipc:
fix unsafe rcu locking when accessing publication list") to apply
cleanly.

Fixes: 37922ea4a310 ("tipc: permit overlapping service ranges in name table")
Reported-by: Tuong Lien Tong <tuong.t.lien@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/name_distr.c | 18 ++++++++++--------
 net/tipc/name_table.c |  1 +
 net/tipc/name_table.h |  1 +
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 3cfeb9d..61219f0 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -94,8 +94,9 @@ struct sk_buff *tipc_named_publish(struct net *net, struct publication *publ)
 		list_add_tail_rcu(&publ->binding_node, &nt->node_scope);
 		return NULL;
 	}
-	list_add_tail_rcu(&publ->binding_node, &nt->cluster_scope);
-
+	write_lock_bh(&nt->cluster_scope_lock);
+	list_add_tail(&publ->binding_node, &nt->cluster_scope);
+	write_unlock_bh(&nt->cluster_scope_lock);
 	skb = named_prepare_buf(net, PUBLICATION, ITEM_SIZE, 0);
 	if (!skb) {
 		pr_warn("Publication distribution failure\n");
@@ -112,11 +113,13 @@ struct sk_buff *tipc_named_publish(struct net *net, struct publication *publ)
  */
 struct sk_buff *tipc_named_withdraw(struct net *net, struct publication *publ)
 {
+	struct name_table *nt = tipc_name_table(net);
 	struct sk_buff *buf;
 	struct distr_item *item;
 
-	list_del_rcu(&publ->binding_node);
-
+	write_lock_bh(&nt->cluster_scope_lock);
+	list_del(&publ->binding_node);
+	write_unlock_bh(&nt->cluster_scope_lock);
 	if (publ->scope == TIPC_NODE_SCOPE)
 		return NULL;
 
@@ -147,7 +150,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list,
 			ITEM_SIZE) * ITEM_SIZE;
 	u32 msg_rem = msg_dsz;
 
-	list_for_each_entry_rcu(publ, pls, binding_node) {
+	list_for_each_entry(publ, pls, binding_node) {
 		/* Prepare next buffer: */
 		if (!skb) {
 			skb = named_prepare_buf(net, PUBLICATION, msg_rem,
@@ -189,11 +192,10 @@ void tipc_named_node_up(struct net *net, u32 dnode)
 
 	__skb_queue_head_init(&head);
 
-	rcu_read_lock();
+	read_lock_bh(&nt->cluster_scope_lock);
 	named_distribute(net, &head, dnode, &nt->cluster_scope);
-	rcu_read_unlock();
-
 	tipc_node_xmit(net, &head, dnode, 0);
+	read_unlock_bh(&nt->cluster_scope_lock);
 }
 
 /**
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 66d5b2c..bff241f 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -744,6 +744,7 @@ int tipc_nametbl_init(struct net *net)
 
 	INIT_LIST_HEAD(&nt->node_scope);
 	INIT_LIST_HEAD(&nt->cluster_scope);
+	rwlock_init(&nt->cluster_scope_lock);
 	tn->nametbl = nt;
 	spin_lock_init(&tn->nametbl_lock);
 	return 0;
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 892bd75..f790663 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -100,6 +100,7 @@ struct name_table {
 	struct hlist_head services[TIPC_NAMETBL_SIZE];
 	struct list_head node_scope;
 	struct list_head cluster_scope;
+	rwlock_t cluster_scope_lock;
 	u32 local_publ_count;
 };
 
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
From: Florian Fainelli @ 2018-10-19 17:56 UTC (permalink / raw)
  To: Tristram.Ha, marek.vasut
  Cc: arkadis, UNGLinuxDriver, netdev, andrew, pavel, ruediger.schmitt
In-Reply-To: <93AF473E2DA327428DE3D46B72B1E9FD411AC863@CHN-SV-EXMX02.mchp-main.com>

On 08/16/2018 02:34 PM, Tristram.Ha@microchip.com wrote:
>> -----Original Message-----
>> From: Florian Fainelli <f.fainelli@gmail.com>
>> Sent: Wednesday, August 15, 2018 5:29 PM
>> To: Tristram Ha - C24268 <Tristram.Ha@microchip.com>; Andrew Lunn
>> <andrew@lunn.ch>; Pavel Machek <pavel@ucw.cz>; Ruediger Schmitt
>> <ruediger.schmitt@philips.com>
>> Cc: Arkadi Sharshevsky <arkadis@mellanox.com>; UNGLinuxDriver
>> <UNGLinuxDriver@microchip.com>; netdev@vger.kernel.org
>> Subject: Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477
>> DSA driver in preparation to add other KSZ switch drivers
>>
>> On 12/05/2017 05:46 PM, Tristram.Ha@microchip.com wrote:
>>> From: Tristram Ha <Tristram.Ha@microchip.com>
>>>
>>> This series of patches is to modify the original KSZ9477 DSA driver so
>>> that other KSZ switch drivers can be added and use the common code.
>>>
>>> There are several steps to accomplish this achievement.  First is to
>>> rename some function names with a prefix to indicate chip specific
>>> function.  Second is to move common code into header that can be shared.
>>> Last is to modify tag_ksz.c so that it can handle many tail tag formats
>>> used by different KSZ switch drivers.
>>>
>>> ksz_common.c will contain the common code used by all KSZ switch drivers.
>>> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
>>> ksz9477_spi.c is renamed from ksz_spi.c.
>>> ksz9477_reg.h is renamed from ksz_9477_reg.h.
>>> ksz_common.h is added to provide common code access to KSZ switch
>>> drivers.
>>> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
>>> drivers.
>>
>> Is something gating this series from getting included? It's been nearly
>> 8 months now and this has not been include nor resubmitted, any plans to
>> rebase that patch series and work towards inclusion in net-next when it
>> opens back again?
>>
>> Thank you!
> 
> Sorry for the long delay.  I will restart my kernel submission effort next month
> after finishing the work on current development project.
> 

Tristram, any chance of resubmitting this or should someone with access
to those switches take up your series and submit it?
-- 
Florian

^ permalink raw reply

* [PATCH][next] igc: fix error return handling from call to netif_set_real_num_tx_queues
From: Colin King @ 2018-10-19 18:16 UTC (permalink / raw)
  To: Sasha Neftin, Jeff Kirsher, David S . Miller, intel-wired-lan
  Cc: kernel-janitors, netdev

From: Colin Ian King <colin.king@canonical.com>

The call to netif_set_real_num_tx_queues is not assigning the error
return to variable err even though the next line checks err for an
error.  Fix this by adding the missing err assignment.

Detected by CoverityScan, CID#1474551 ("Logically dead code")

Fixes: 3df25e4c1e66 ("igc: Add interrupt support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/intel/igc/igc_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 9d85707e8a81..80ddbd987764 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -3358,7 +3358,7 @@ static int __igc_open(struct net_device *netdev, bool resuming)
 		goto err_req_irq;
 
 	/* Notify the stack of the actual queue counts. */
-	netif_set_real_num_tx_queues(netdev, adapter->num_tx_queues);
+	err = netif_set_real_num_tx_queues(netdev, adapter->num_tx_queues);
 	if (err)
 		goto err_set_queues;
 
-- 
2.19.1

^ permalink raw reply related

* Re: C45 Phys and PHY_FORCING state
From: Florian Fainelli @ 2018-10-19 18:43 UTC (permalink / raw)
  To: Jose Abreu, Andrew Lunn
  Cc: David S. Miller, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <cd0c9fe6-8b13-041b-1f80-caf3c6250df5@synopsys.com>

On 10/19/2018 05:02 AM, Jose Abreu wrote:
> Hello Andrew and Florian,
> 
> Currently I have a 10G C45 phy that is fixed at 10G link. This
> version does not support auto negotiation so I'm turning off the
> feature in phydev struct field. I found out that when I do this
> phylib is not composing C45 frames and is instead using C22. This
> is due to call to genphy_udpate_link() which doesn't work on my
> phy because it doesn't support C22.
> 
> If I apply attached patch then things work perfectly fine. Can
> you please review it ?

Looks reasonable, I could not find other functions in the state machine
that were not already abstracting the clause type, or letting a driver
callback be called. Can you submit this as a formal patch against
net-next (and not attached, but inline)?

I would suggest creating a helper, e.g: phy_update_link() that way
everything is well namespaced and clear within the state machine itself.
-- 
Florian

^ permalink raw reply

* [PATCH net-next 7/7] net: hns3: Add enable and process hw errors of TM scheduler
From: Salil Mehta @ 2018-10-19 19:15 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Shiju Jose
In-Reply-To: <20181019191532.10088-1-salil.mehta@huawei.com>

From: Shiju Jose <shiju.jose@huawei.com>

This patch enables and process hw errors of TM scheduler and
QCN(Quantized Congestion Control).

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   8 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 286 +++++++++++++++++++++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h |   3 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    |   6 +
 4 files changed, 303 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index f23cf78..872cd4b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -211,6 +211,12 @@ enum hclge_opcode_type {
 	HCLGE_OPC_LED_STATUS_CFG	= 0xB000,
 
 	/* Error INT commands */
+	HCLGE_TM_SCH_ECC_INT_EN		= 0x0829,
+	HCLGE_TM_SCH_ECC_ERR_RINT_CMD	= 0x082d,
+	HCLGE_TM_SCH_ECC_ERR_RINT_CE	= 0x082f,
+	HCLGE_TM_SCH_ECC_ERR_RINT_NFE	= 0x0830,
+	HCLGE_TM_SCH_ECC_ERR_RINT_FE	= 0x0831,
+	HCLGE_TM_SCH_MBIT_ECC_INFO_CMD	= 0x0833,
 	HCLGE_COMMON_ECC_INT_CFG	= 0x1505,
 	HCLGE_IGU_EGU_TNL_INT_QUERY	= 0x1802,
 	HCLGE_IGU_EGU_TNL_INT_EN	= 0x1803,
@@ -218,6 +224,8 @@ enum hclge_opcode_type {
 	HCLGE_IGU_COMMON_INT_QUERY	= 0x1805,
 	HCLGE_IGU_COMMON_INT_EN		= 0x1806,
 	HCLGE_IGU_COMMON_INT_CLR	= 0x1807,
+	HCLGE_TM_QCN_MEM_INT_CFG	= 0x1A14,
+	HCLGE_TM_QCN_MEM_INT_INFO_CMD	= 0x1A17,
 	HCLGE_PPP_CMD0_INT_CMD		= 0x2100,
 	HCLGE_PPP_CMD1_INT_CMD		= 0x2101,
 	HCLGE_NCSI_INT_QUERY		= 0x2400,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index ea73def..f7e363b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -213,6 +213,129 @@ static const struct hclge_hw_error hclge_ppp_mpf_int3[] = {
 	{ /* sentinel */ }
 };
 
+struct hclge_tm_sch_ecc_info {
+	const char *name;
+};
+
+static const struct hclge_tm_sch_ecc_info hclge_tm_sch_ecc_err[7][15] = {
+	{
+		{ .name = "QSET_QUEUE_CTRL:PRI_LEN TAB" },
+		{ .name = "QSET_QUEUE_CTRL:SPA_LEN TAB" },
+		{ .name = "QSET_QUEUE_CTRL:SPB_LEN TAB" },
+		{ .name = "QSET_QUEUE_CTRL:WRRA_LEN TAB" },
+		{ .name = "QSET_QUEUE_CTRL:WRRB_LEN TAB" },
+		{ .name = "QSET_QUEUE_CTRL:SPA_HPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:SPB_HPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:WRRA_HPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:WRRB_HPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:QS_LINKLIST TAB" },
+		{ .name = "QSET_QUEUE_CTRL:SPA_TPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:SPB_TPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:WRRA_TPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:WRRB_TPTR TAB" },
+		{ .name = "QSET_QUEUE_CTRL:QS_DEFICITCNT TAB" },
+	},
+	{
+		{ .name = "ROCE_QUEUE_CTRL:QS_LEN TAB" },
+		{ .name = "ROCE_QUEUE_CTRL:QS_TPTR TAB" },
+		{ .name = "ROCE_QUEUE_CTRL:QS_HPTR TAB" },
+		{ .name = "ROCE_QUEUE_CTRL:QLINKLIST TAB" },
+		{ .name = "ROCE_QUEUE_CTRL:QCLEN TAB" },
+	},
+	{
+		{ .name = "NIC_QUEUE_CTRL:QS_LEN TAB" },
+		{ .name = "NIC_QUEUE_CTRL:QS_TPTR TAB" },
+		{ .name = "NIC_QUEUE_CTRL:QS_HPTR TAB" },
+		{ .name = "NIC_QUEUE_CTRL:QLINKLIST TAB" },
+		{ .name = "NIC_QUEUE_CTRL:QCLEN TAB" },
+	},
+	{
+		{ .name = "RAM_CFG_CTRL:CSHAP TAB" },
+		{ .name = "RAM_CFG_CTRL:PSHAP TAB" },
+	},
+	{
+		{ .name = "SHAPER_CTRL:PSHAP TAB" },
+	},
+	{
+		{ .name = "MSCH_CTRL" },
+	},
+	{
+		{ .name = "TOP_CTRL" },
+	},
+};
+
+static const struct hclge_hw_error hclge_tm_sch_err_int[] = {
+	{ .int_msk = BIT(0), .msg = "tm_sch_ecc_1bit_err" },
+	{ .int_msk = BIT(1), .msg = "tm_sch_ecc_mbit_err" },
+	{ .int_msk = BIT(2), .msg = "tm_sch_port_shap_sub_fifo_wr_full_err" },
+	{ .int_msk = BIT(3), .msg = "tm_sch_port_shap_sub_fifo_rd_empty_err" },
+	{ .int_msk = BIT(4), .msg = "tm_sch_pg_pshap_sub_fifo_wr_full_err" },
+	{ .int_msk = BIT(5), .msg = "tm_sch_pg_pshap_sub_fifo_rd_empty_err" },
+	{ .int_msk = BIT(6), .msg = "tm_sch_pg_cshap_sub_fifo_wr_full_err" },
+	{ .int_msk = BIT(7), .msg = "tm_sch_pg_cshap_sub_fifo_rd_empty_err" },
+	{ .int_msk = BIT(8), .msg = "tm_sch_pri_pshap_sub_fifo_wr_full_err" },
+	{ .int_msk = BIT(9), .msg = "tm_sch_pri_pshap_sub_fifo_rd_empty_err" },
+	{ .int_msk = BIT(10), .msg = "tm_sch_pri_cshap_sub_fifo_wr_full_err" },
+	{ .int_msk = BIT(11), .msg = "tm_sch_pri_cshap_sub_fifo_rd_empty_err" },
+	{ .int_msk = BIT(12),
+	  .msg = "tm_sch_port_shap_offset_fifo_wr_full_err" },
+	{ .int_msk = BIT(13),
+	  .msg = "tm_sch_port_shap_offset_fifo_rd_empty_err" },
+	{ .int_msk = BIT(14),
+	  .msg = "tm_sch_pg_pshap_offset_fifo_wr_full_err" },
+	{ .int_msk = BIT(15),
+	  .msg = "tm_sch_pg_pshap_offset_fifo_rd_empty_err" },
+	{ .int_msk = BIT(16),
+	  .msg = "tm_sch_pg_cshap_offset_fifo_wr_full_err" },
+	{ .int_msk = BIT(17),
+	  .msg = "tm_sch_pg_cshap_offset_fifo_rd_empty_err" },
+	{ .int_msk = BIT(18),
+	  .msg = "tm_sch_pri_pshap_offset_fifo_wr_full_err" },
+	{ .int_msk = BIT(19),
+	  .msg = "tm_sch_pri_pshap_offset_fifo_rd_empty_err" },
+	{ .int_msk = BIT(20),
+	  .msg = "tm_sch_pri_cshap_offset_fifo_wr_full_err" },
+	{ .int_msk = BIT(21),
+	  .msg = "tm_sch_pri_cshap_offset_fifo_rd_empty_err" },
+	{ .int_msk = BIT(22), .msg = "tm_sch_rq_fifo_wr_full_err" },
+	{ .int_msk = BIT(23), .msg = "tm_sch_rq_fifo_rd_empty_err" },
+	{ .int_msk = BIT(24), .msg = "tm_sch_nq_fifo_wr_full_err" },
+	{ .int_msk = BIT(25), .msg = "tm_sch_nq_fifo_rd_empty_err" },
+	{ .int_msk = BIT(26), .msg = "tm_sch_roce_up_fifo_wr_full_err" },
+	{ .int_msk = BIT(27), .msg = "tm_sch_roce_up_fifo_rd_empty_err" },
+	{ .int_msk = BIT(28), .msg = "tm_sch_rcb_byte_fifo_wr_full_err" },
+	{ .int_msk = BIT(29), .msg = "tm_sch_rcb_byte_fifo_rd_empty_err" },
+	{ .int_msk = BIT(30), .msg = "tm_sch_ssu_byte_fifo_wr_full_err" },
+	{ .int_msk = BIT(31), .msg = "tm_sch_ssu_byte_fifo_rd_empty_err" },
+	{ /* sentinel */ }
+};
+
+static const struct hclge_hw_error hclge_qcn_ecc_err_int[] = {
+	{ .int_msk = BIT(0), .msg = "qcn_byte_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(1), .msg = "qcn_byte_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(2), .msg = "qcn_time_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(3), .msg = "qcn_time_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(4), .msg = "qcn_fb_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(5), .msg = "qcn_fb_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(6), .msg = "qcn_link_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(7), .msg = "qcn_link_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(8), .msg = "qcn_rate_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(9), .msg = "qcn_rate_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(10), .msg = "qcn_tmplt_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(11), .msg = "qcn_tmplt_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(12), .msg = "qcn_shap_cfg_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(13), .msg = "qcn_shap_cfg_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(14), .msg = "qcn_gp0_barrel_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(15), .msg = "qcn_gp0_barrel_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(16), .msg = "qcn_gp1_barrel_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(17), .msg = "qcn_gp1_barrel_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(18), .msg = "qcn_gp2_barrel_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(19), .msg = "qcn_gp2_barrel_mem_ecc_mbit_err" },
+	{ .int_msk = BIT(20), .msg = "qcn_gp3_barral_mem_ecc_1bit_err" },
+	{ .int_msk = BIT(21), .msg = "qcn_gp3_barral_mem_ecc_mbit_err" },
+	{ /* sentinel */ }
+};
+
 static void hclge_log_error(struct device *dev,
 			    const struct hclge_hw_error *err_list,
 			    u32 err_sts)
@@ -501,6 +624,47 @@ static int hclge_enable_ppp_error(struct hclge_dev *hdev, bool en)
 	return ret;
 }
 
+int hclge_enable_tm_hw_error(struct hclge_dev *hdev, bool en)
+{
+	struct device *dev = &hdev->pdev->dev;
+	struct hclge_desc desc;
+	int ret;
+
+	/* enable TM SCH hw errors */
+	hclge_cmd_setup_basic_desc(&desc, HCLGE_TM_SCH_ECC_INT_EN, false);
+	if (en)
+		desc.data[0] = cpu_to_le32(HCLGE_TM_SCH_ECC_ERR_INT_EN);
+	else
+		desc.data[0] = 0;
+
+	ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+	if (ret) {
+		dev_err(dev, "failed(%d) to configure TM SCH errors\n", ret);
+		return ret;
+	}
+
+	/* enable TM QCN hw errors */
+	ret = hclge_cmd_query_error(hdev, &desc, HCLGE_TM_QCN_MEM_INT_CFG,
+				    0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read TM QCN CFG status\n", ret);
+		return ret;
+	}
+
+	hclge_cmd_reuse_desc(&desc, false);
+	if (en)
+		desc.data[1] = cpu_to_le32(HCLGE_TM_QCN_MEM_ERR_INT_EN);
+	else
+		desc.data[1] = 0;
+
+	ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+	if (ret)
+		dev_err(dev,
+			"failed(%d) to configure TM QCN mem errors\n", ret);
+
+	return ret;
+}
+
 static void hclge_process_common_error(struct hclge_dev *hdev,
 				       enum hclge_err_int_type type)
 {
@@ -736,6 +900,125 @@ static void hclge_process_ppp_error(struct hclge_dev *hdev,
 			ret);
 }
 
+static void hclge_process_tm_sch_error(struct hclge_dev *hdev)
+{
+	struct device *dev = &hdev->pdev->dev;
+	const struct hclge_tm_sch_ecc_info *tm_sch_ecc_info;
+	struct hclge_desc desc;
+	u32 ecc_info;
+	u8 module_no;
+	u8 ram_no;
+	int ret;
+
+	/* read TM scheduler errors */
+	ret = hclge_cmd_query_error(hdev, &desc,
+				    HCLGE_TM_SCH_MBIT_ECC_INFO_CMD, 0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read SCH mbit ECC err info\n", ret);
+		return;
+	}
+	ecc_info = le32_to_cpu(desc.data[0]);
+
+	ret = hclge_cmd_query_error(hdev, &desc,
+				    HCLGE_TM_SCH_ECC_ERR_RINT_CMD, 0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read SCH ECC err status\n", ret);
+		return;
+	}
+
+	/* log TM scheduler errors */
+	if (le32_to_cpu(desc.data[0])) {
+		hclge_log_error(dev, &hclge_tm_sch_err_int[0],
+				le32_to_cpu(desc.data[0]));
+		if (le32_to_cpu(desc.data[0]) & 0x2) {
+			module_no = (ecc_info >> 20) & 0xF;
+			ram_no = (ecc_info >> 16) & 0xF;
+			tm_sch_ecc_info =
+				&hclge_tm_sch_ecc_err[module_no][ram_no];
+			dev_warn(dev, "ecc err module:ram=%s\n",
+				 tm_sch_ecc_info->name);
+			dev_warn(dev, "ecc memory address = 0x%x\n",
+				 ecc_info & 0xFFFF);
+		}
+	}
+
+	/* clear TM scheduler errors */
+	ret = hclge_cmd_clear_error(hdev, &desc, NULL, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to clear TM SCH error status\n", ret);
+		return;
+	}
+
+	ret = hclge_cmd_query_error(hdev, &desc,
+				    HCLGE_TM_SCH_ECC_ERR_RINT_CE, 0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read SCH CE status\n", ret);
+		return;
+	}
+
+	ret = hclge_cmd_clear_error(hdev, &desc, NULL, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to clear TM SCH CE status\n", ret);
+		return;
+	}
+
+	ret = hclge_cmd_query_error(hdev, &desc,
+				    HCLGE_TM_SCH_ECC_ERR_RINT_NFE, 0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read SCH NFE status\n", ret);
+		return;
+	}
+
+	ret = hclge_cmd_clear_error(hdev, &desc, NULL, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to clear TM SCH NFE status\n", ret);
+		return;
+	}
+
+	ret = hclge_cmd_query_error(hdev, &desc,
+				    HCLGE_TM_SCH_ECC_ERR_RINT_FE, 0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read SCH FE status\n", ret);
+		return;
+	}
+
+	ret = hclge_cmd_clear_error(hdev, &desc, NULL, 0, 0);
+	if (ret)
+		dev_err(dev, "failed(%d) to clear TM SCH FE status\n", ret);
+}
+
+static void hclge_process_tm_qcn_error(struct hclge_dev *hdev)
+{
+	struct device *dev = &hdev->pdev->dev;
+	struct hclge_desc desc;
+	int ret;
+
+	/* read QCN errors */
+	ret = hclge_cmd_query_error(hdev, &desc,
+				    HCLGE_TM_QCN_MEM_INT_INFO_CMD, 0, 0, 0);
+	if (ret) {
+		dev_err(dev, "failed(%d) to read QCN ECC err status\n", ret);
+		return;
+	}
+
+	/* log QCN errors */
+	if (le32_to_cpu(desc.data[0]))
+		hclge_log_error(dev, &hclge_qcn_ecc_err_int[0],
+				le32_to_cpu(desc.data[0]));
+
+	/* clear QCN errors */
+	ret = hclge_cmd_clear_error(hdev, &desc, NULL, 0, 0);
+	if (ret)
+		dev_err(dev, "failed(%d) to clear QCN error status\n", ret);
+}
+
+static void hclge_process_tm_error(struct hclge_dev *hdev,
+				   enum hclge_err_int_type type)
+{
+	hclge_process_tm_sch_error(hdev);
+	hclge_process_tm_qcn_error(hdev);
+}
+
 static const struct hclge_hw_blk hw_blk[] = {
 	{ .msk = BIT(0), .name = "IGU_EGU",
 	  .enable_error = hclge_enable_igu_egu_error,
@@ -743,6 +1026,9 @@ static const struct hclge_hw_blk hw_blk[] = {
 	{ .msk = BIT(5), .name = "COMMON",
 	  .enable_error = hclge_enable_common_error,
 	  .process_error = hclge_process_common_error, },
+	{ .msk = BIT(4), .name = "TM",
+	  .enable_error = hclge_enable_tm_hw_error,
+	  .process_error = hclge_process_tm_error, },
 	{ .msk = BIT(1), .name = "PPP",
 	  .enable_error = hclge_enable_ppp_error,
 	  .process_error = hclge_process_ppp_error, },
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
index c6d3739..e0e3b58 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
@@ -37,6 +37,8 @@
 #define HCLGE_PPP_MPF_ECC_ERR_INT2_EN_MASK	0x003F
 #define HCLGE_PPP_MPF_ECC_ERR_INT3_EN	0x003F
 #define HCLGE_PPP_MPF_ECC_ERR_INT3_EN_MASK	0x003F
+#define HCLGE_TM_SCH_ECC_ERR_INT_EN	0x3
+#define HCLGE_TM_QCN_MEM_ERR_INT_EN	0xFFFFFF
 #define HCLGE_NCSI_ERR_INT_EN	0x3
 #define HCLGE_NCSI_ERR_INT_TYPE	0x9
 
@@ -76,5 +78,6 @@ struct hclge_hw_error {
 };
 
 int hclge_hw_error_set_state(struct hclge_dev *hdev, bool state);
+int hclge_enable_tm_hw_error(struct hclge_dev *hdev, bool en);
 pci_ers_result_t hclge_process_ras_hw_error(struct hnae3_ae_dev *ae_dev);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 082ea97..5234b53 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -6881,6 +6881,12 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
 		return ret;
 	}
 
+	/* Re-enable the TM hw error interrupts because
+	 * they get disabled on core/global reset.
+	 */
+	if (hclge_enable_tm_hw_error(hdev, true))
+		dev_err(&pdev->dev, "failed to enable TM hw error interrupts\n");
+
 	dev_info(&pdev->dev, "Reset done, %s driver initialization finished.\n",
 		 HCLGE_DRIVER_NAME);
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next] net: ethernet: ti: cpsw: don't flush mcast entries while switch promisc mode
From: Ivan Khoronzhuk @ 2018-10-19 19:24 UTC (permalink / raw)
  To: Grygorii Strashko; +Cc: davem, linux-omap, netdev, linux-kernel
In-Reply-To: <36da9bb2-38b7-cc70-9569-8895e20c6d1f@ti.com>

On Fri, Oct 19, 2018 at 12:23:28PM -0500, Grygorii Strashko wrote:
>
>
>On 10/19/18 7:04 AM, Ivan Khoronzhuk wrote:
>>On Thu, Oct 18, 2018 at 07:03:06PM -0500, Grygorii Strashko wrote:
>>>
>>>
>>>On 10/18/18 1:00 PM, Ivan Khoronzhuk wrote:
>>>>No need now to flush mcast entries in switch mode while toggling to
>>>>promiscuous mode. It's not needed as vlan reg_mcast = ALL_PORTS
>>>>and mcast/vlan ports = ALL_PORTS, the same happening for vlan
>>>>unreg_mcast, it's set to ALL_PORT_MASK just after calling promisc
>>>>mode routine by calling set allmulti. I suppose main reason to flush
>>>>them is to use unreg_mcast to receive all to host port. Thus, now, all
>>>>mcast packets are received anyway and no reason to flush mcast entries
>>>>unsafely, as they were synced with __dev_mc_sync() previously and are
>>>>not restored. Another way is to _dev_mc_unsync() them, but no need.
>>>
>>>User have possibility to add additional mcast entries or edit 
>>>existing in switch-mode, which is now done using custom tool. So, 
>>>Host in promisc
>>>mode will not receive packets for mcast address X if port mask for this
>>>addr set to (ALL_PORTS - HOST_PORT). Am I missing smth?
>>
>>I didn't take into account the custom tool changing entries directly,
>>but even in this case there is at least a couple of interesting
>>questions:
>>
>>1) Before the patch applied only several days ago -
>>   5da1948969bc1991920987ce4361ea56046e5a98
>>   "ti: cpsw: fix lost of mcast packets while rx_mode update"
>>   It was impossible to do correctly anyway, as all mcast entries not
>>   in the mc list were flushed (after rx_mode cb), by:
>>   cpsw_ale_flush_multicast(cpsw->ale, ALE_ALL_PORTS, vid);
>>   and those in mc, rewritten by adding them back in corrected form.
>>   ... or this cb was not supposed to be called at all ...
>
>It's not allowed to manipulate ALE table in dual_mac mode, so your
>patches are safe in dual_mac mode. For switch-mode (unless we  move
>forward with switch dev) standard linux interfaces allows create
>default mcast entries which then (if required) corrected using custom
>tool now.
It's not related to my patches at all, this as it was from very beginning.
My proposition with __dev_mc_unsync(priv->ndev, NULL) is safe as for
dual_mac mode as for switch mode.

>
>>
>>2) What is the reason to add mcast switch entires
>>   (ALL_PORTS - HOST_PORT) if its function is added anyway by
>>   unreq_mcast & (ALL_PORTS - HOST_PORT) ?
>>   So, doesn't matter it's added or not - it will work :-|.
>
>because in switch mode not all traffic directed to the Host port -
No objections, that's exactly I'm talking about here, only p1&p2.
And looks like you forgot about single port board reusing switch
implementation, which by fact is not a switch but reuse this code.
Proposition to flush all its mcast table w/o restore - softly
speaking isn't best choice.

>only in promisc mode. Reason safety and performance - Host should not
>receive traffic which is not designated for it.
Any objections.

>
>promiscuous in switch mode:
>- disables learning
>- enables unicast flooding to Host port
>- enables unregistered multi-cast flooding to the Host port
>In other words, CPSW will continue forwarding packets between P1&P2, 
>but also will "duplicate" packets to Host port. This will work only 
>for
>vlans which have host port as member.
Sorry, but what part was not clearly described in the cover letter?
If you mean vlan entries added not by linux (linux vlans have host
port), and having no host port as a memeber - flush() hasn't been
helping in this case also.

>
>>
>>3) Even so, toggling promisc mode will clear all these changes anyway,
>>   even I will call _dev_mc_unsync() after flushing them.
>
>there can be records which are not under control of Linux now.
yes, and they are flushed w/o restore (along with linux ones).
that's why fix is needed.

>
>>
>>4) If user can tune ALE table by hand, what stops him do it after moving
>>   to promisc mode, seems he knows what he's doing?
>>
>>5) It could be possible only for not default vlan entries, but mcast
>>   vlan support is not supported yet. Who is gona restore those
>>   entries after promisc off?
>>
>>This behaviour is arguable, and flushing mcast entries can bring more
>>issues then leaving. For me it doesn't matter, I can archive the same
>>by adding after flush one line, it's even shorter:
>>__dev_mc_unsync(priv->ndev, NULL);
>
>Again, unless we move forward with switch dev you can't assume that
>Linux stack has full control over ALE table.
and that's why Linux stack is flushing all not created by it mcast entries...
but this what it was doing w/o my changes ).
after my patch it won't, Look:
- it adds/removes entries under control of linux
- it doesn't touch entries added whoever but not a linux
- it restores only entries added linux
  (it was before also, but with flushing all not related entries..
  ..deleting what you don't want to delete)

>Sry, hence this patch is
>not a fix and can introduce changes in current behavior and cause
>regression reports - NACK.
Sorry but what regression you are talking about, what the problem with
__dev_mc_unsync(priv->ndev, NULL)? after flush? This is more than fix.
Restoring entries that were under control of linux is not a fix?
It doesn't touch anything added by custom tool and restores entries
in mc list like it was before. According to your comments, it seems
absolutely one fix you need, ...and don't forget about BBB with one port.

>
>-- 
>regards,
>-grygorii

-- 
Regards,
Ivan Khoronzhuk

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Martin Lau @ 2018-10-19 19:36 UTC (permalink / raw)
  To: Edward Cree
  Cc: Yonghong Song, Alexei Starovoitov, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <852b205b-1d32-1d85-5dcf-0edfa8efcd6a@solarflare.com>

On Fri, Oct 19, 2018 at 06:04:11PM +0100, Edward Cree wrote:
> On 18/10/18 22:19, Martin Lau wrote:
> > As I have mentioned earlier, it is also special to
> > the kernel because the BTF verifier and bpf_prog_load()
> > need to do different checks for FUNC and FUNC_PROTO to
> > ensure they are sane.
> >
> > First, we need to agree that the kernel needs to verify
> > them differently.
> >
> > and then we can proceed to discuss how to distinguish them.
> > We picked the current way to avoid adding a
> > new BTF function section and keep it
> > strict forward to distinguish them w/o relying
> > on other hints from 'struct btf_type'.
> >
> > Are you suggesting another way of doing it?
> But you *do* have such a new section.
> The patch comment talks about a 'FuncInfo Table' which appears to
Note that the new section, which contains the FuncInfo Table,
is in a new ELF section ".BTF.ext" instead of the ".BTF".
It is not in the ".BTF" section because it is only useful during
bpf_prog_load().

I was meaning a new function section within ".BTF".

>  map (section, insn_idx) to type_id.  (I think this gets added in
>  .BTF.ext per patch 9?)  So when you're looking at a FUNC type
>  because you looked up a type_id from that table, you know it's
>  the signature of a subprogram, and you're checking it as such.
> Whereas, if you were doing something with some other type and it
>  referenced a FUNC type (e.g., in the patch comment's example,
>  you're checking foo's first argument against the type bar) in
>  its type_id, you know you're using it as a formal type (a FUNC_
>  PROTO in your parlance) and not as a subprogram.
> The context in which you are using a type entry tells you which
>  kind it is.  And the verifier can and should be smart enough to
>  know what it's doing in this way.
> 
> And it's entirely reasonable for the same type entry to get used
>  for both those cases; in my example, you'd have a FUNC type for
>  int foo(int), referenced both by the func_info entry for foo
>  and by the PTR type for bar.  And if you had another subprogram
>  int baz(int), its func_info entry could reference the same
>  type_id, because the (reference to the) name of the function
>  should live in the func_info, not in the type.
IIUC, I think what you are suggesting here is to use (type_id, name)
to describe DW_TAG_subprogram "int foo1(int) {}", "int foo2(int) {}",
"int foo3(int) {}" where type_id here is referring to the same
DW_TAG_subroutine_type, and only define that _one_
DW_TAG_subroutine_type in the BTF "type" section.

That will require more manipulation/type-merging in the dwarf2btf
process and it could get quite complicated.

Note that CTF is also fully spelling out the return type
and arg types for each DW_TAG_subprogram in a separate
function section (still within the same ELF section).
The only difference here is they are merged into the type
section and FUNC_PROTO is added.

If the concern is having both FUNC and FUNC_PROTO is confusing,
we could go back to the CTF way which adds a new function section
in ".BTF" and it is only for DW_TAG_subprogram.
BTF_KIND_FUNC_PROTO is then no longer necessary.
Some of new BTF verifier checkings may actually go away also.
The down side is there will be two id spaces.

> 
> What you are proposing seems to be saying "if we have this
>  particular special btf_kind, then this BTF entry doesn't just
>  define a type, it declares a subprogram of that type".  Oh,
>  and with the name of the type as the subprogram name.  Which
>  just creates blurry confusion as to whether BTF entries define
>  types or declare objects; IMNSHO the correct approach is for
>  objects to be declared elsewhere and to reference BTF types by
>  their type_id.
> Which is what the func_info table in patch 9 appears to do.
> 
> (It also rather bothers me the way we are using special type
>  names to associate maps with their k/v types, rather than
>  extending the records in the maps section to include type_ids
>  referencing them.  It's the same kind of weird implicitness,
>  and if I'd spotted it when it was going in I'd've nacked it,
>  but I suppose it's ABI now and too late to change.)
> 
> -Ed

^ permalink raw reply

* Re: [PATCH ghak90 (was ghak32) V4 02/10] audit: add container id
From: Paul Moore @ 2018-10-19 19:38 UTC (permalink / raw)
  To: rgb
  Cc: containers, linux-api, linux-audit, linux-fsdevel, linux-kernel,
	netdev, netfilter-devel, ebiederm, luto, carlos, dhowells, viro,
	simo, Eric Paris, Serge Hallyn
In-Reply-To: <12396e378a78ee8dd38c75f7730d67d8fbb08e02.1533065887.git.rgb@redhat.com>

On Tue, Jul 31, 2018 at 4:11 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Implement the proc fs write to set the audit container identifier of a
> process, emitting an AUDIT_CONTAINER_OP record to document the event.
>
> This is a write from the container orchestrator task to a proc entry of
> the form /proc/PID/audit_containerid where PID is the process ID of the
> newly created task that is to become the first task in a container, or
> an additional task added to a container.
>
> The write expects up to a u64 value (unset: 18446744073709551615).
>
> The writer must have capability CAP_AUDIT_CONTROL.
>
> This will produce a record such as this:
>   type=CONTAINER_ID msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 old-contid=18446744073709551615 contid=123456 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes

You need to update the record type in the example above.

> The "op" field indicates an initial set.  The "pid" to "ses" fields are
> the orchestrator while the "opid" field is the object's PID, the process
> being "contained".  Old and new audit container identifier values are
> given in the "contid" fields, while res indicates its success.

I understand Steve's concern around the "op" field, but I think it
might be a bit premature to think we might not need to do some sort of
audit container ID management in the future that would want to make
use of the CONTAINER_OP message type.  I would like to see the "op"
field preserved.

> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.
>
> See: https://github.com/linux-audit/audit-kernel/issues/90
> See: https://github.com/linux-audit/audit-userspace/issues/51
> See: https://github.com/linux-audit/audit-testsuite/issues/64
> See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Steve Grubb <sgrubb@redhat.com>
> ---
>  fs/proc/base.c             | 37 +++++++++++++++++++++++++
>  include/linux/audit.h      | 24 ++++++++++++++++
>  include/uapi/linux/audit.h |  2 ++
>  kernel/auditsc.c           | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 131 insertions(+)

...

> @@ -2112,6 +2114,72 @@ int audit_set_loginuid(kuid_t loginuid)
>  }
>
>  /**
> + * audit_set_contid - set current task's audit_context contid
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> +       u64 oldcontid;
> +       int rc = 0;
> +       struct audit_buffer *ab;
> +       uid_t uid;
> +       struct tty_struct *tty;
> +       char comm[sizeof(current->comm)];
> +
> +       task_lock(task);
> +       /* Can't set if audit disabled */
> +       if (!task->audit) {
> +               task_unlock(task);
> +               return -ENOPROTOOPT;
> +       }
> +       oldcontid = audit_get_contid(task);
> +       read_lock(&tasklist_lock);

I assume lockdep was happy with nesting the tasklist_lock inside the task lock?

> +       /* Don't allow the audit containerid to be unset */
> +       if (!audit_contid_valid(contid))
> +               rc = -EINVAL;
> +       /* if we don't have caps, reject */
> +       else if (!capable(CAP_AUDIT_CONTROL))
> +               rc = -EPERM;
> +       /* if task has children or is not single-threaded, deny */
> +       else if (!list_empty(&task->children))
> +               rc = -EBUSY;
> +       else if (!(thread_group_leader(task) && thread_group_empty(task)))
> +               rc = -EALREADY;
> +       read_unlock(&tasklist_lock);
> +       if (!rc)
> +               task->audit->contid = contid;
> +       task_unlock(task);
> +
> +       if (!audit_enabled)
> +               return rc;
> +
> +       ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> +       if (!ab)
> +               return rc;
> +
> +       uid = from_kuid(&init_user_ns, task_uid(current));
> +       tty = audit_get_tty(current);
> +       audit_log_format(ab, "op=set opid=%d old-contid=%llu contid=%llu pid=%d uid=%u auid=%u tty=%s ses=%u",
> +                        task_tgid_nr(task), oldcontid, contid,
> +                        task_tgid_nr(current), uid,
> +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
> +                        tty ? tty_name(tty) : "(none)",
> +                        audit_get_sessionid(current));
> +       audit_put_tty(tty);
> +       audit_log_task_context(ab);
> +       audit_log_format(ab, " comm=");
> +       audit_log_untrustedstring(ab, get_task_comm(comm, current));
> +       audit_log_d_path_exe(ab, current->mm);
> +       audit_log_format(ab, " res=%d", !rc);
> +       audit_log_end(ab);
> +       return rc;
> +}

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH net-next 2/4] net/ipv6: Remove ip_idx arg to in6_dump_addrs
From: David Ahern @ 2018-10-19 19:45 UTC (permalink / raw)
  To: netdev; +Cc: davem, David Ahern
In-Reply-To: <20181019194530.3590-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

ip_idx is always 0 going into in6_dump_addrs; it is passed as a pointer
to save the last good index into cb. Since cb is already argument to
in6_dump_addrs, just save the value there.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/addrconf.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e39c284e2954..6b659846ff8a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4955,14 +4955,13 @@ static int inet6_fill_ifacaddr(struct sk_buff *skb, struct ifacaddr6 *ifaca,
 
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
-			  struct netlink_callback *cb,
-			  int s_ip_idx, int *p_ip_idx,
+			  struct netlink_callback *cb, int s_ip_idx,
 			  struct inet6_fill_args *fillargs)
 {
 	struct ifmcaddr6 *ifmca;
 	struct ifacaddr6 *ifaca;
+	int ip_idx = 0;
 	int err = 1;
-	int ip_idx = *p_ip_idx;
 
 	read_lock_bh(&idev->lock);
 	switch (fillargs->type) {
@@ -5012,7 +5011,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 		break;
 	}
 	read_unlock_bh(&idev->lock);
-	*p_ip_idx = ip_idx;
+	cb->args[2] = ip_idx;
 	return err;
 }
 
@@ -5081,16 +5080,15 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	};
 	struct net *net = sock_net(skb->sk);
 	struct net *tgt_net = net;
+	int idx, s_idx, s_ip_idx;
 	int h, s_h;
-	int idx, ip_idx;
-	int s_idx, s_ip_idx;
 	struct net_device *dev;
 	struct inet6_dev *idev;
 	struct hlist_head *head;
 
 	s_h = cb->args[0];
 	s_idx = idx = cb->args[1];
-	s_ip_idx = ip_idx = cb->args[2];
+	s_ip_idx = cb->args[2];
 
 	if (cb->strict_check) {
 		int err;
@@ -5111,12 +5109,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 				goto cont;
 			if (h > s_h || idx > s_idx)
 				s_ip_idx = 0;
-			ip_idx = 0;
 			idev = __in6_dev_get(dev);
 			if (!idev)
 				goto cont;
 
-			if (in6_dump_addrs(idev, skb, cb, s_ip_idx, &ip_idx,
+			if (in6_dump_addrs(idev, skb, cb, s_ip_idx,
 					   &fillargs) < 0)
 				goto done;
 cont:
@@ -5127,7 +5124,6 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	rcu_read_unlock();
 	cb->args[0] = h;
 	cb->args[1] = idx;
-	cb->args[2] = ip_idx;
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 0/4] net: Add support for dumping addresses for a specific device
From: David Ahern @ 2018-10-19 19:45 UTC (permalink / raw)
  To: netdev; +Cc: davem, David Ahern

From: David Ahern <dsahern@gmail.com>

Use the recently added kernel side filter infrastructure to add support
for dumping addresses only for a specific device.

Patch 1 creates an IPv4 version similar to IPv6's in6_dump_addrs function.

Patch 2 simplifies in6_dump_addrs by moving index tracking of IP
addresses from inet6_dump_addr to in6_dump_addrs.

Patches 3 and 4 use the device-based address dump helpers to limit a
dump to just the addresses on a specific device.

David Ahern (4):
  net/ipv4: Move loop over addresses in dumps into in_dev_dump_addr
  net/ipv6: Remove ip_idx arg to in6_dump_addrs
  net/ipv4: Add support for dumping addresses for a specific device
  net/ipv6: Add support for dumping addresses for a specific device

 net/ipv4/devinet.c  | 77 +++++++++++++++++++++++++++++++++++++++--------------
 net/ipv6/addrconf.c | 43 +++++++++++++++++++-----------
 2 files changed, 85 insertions(+), 35 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net-next 3/4] net/ipv4: Add support for dumping addresses for a specific device
From: David Ahern @ 2018-10-19 19:45 UTC (permalink / raw)
  To: netdev; +Cc: davem, David Ahern
In-Reply-To: <20181019194530.3590-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
header, then return only the addresses for that device.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv4/devinet.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 67f382c560ba..63d5b58fbfdb 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -109,6 +109,7 @@ struct inet_fill_args {
 	int event;
 	unsigned int flags;
 	int netnsid;
+	int ifindex;
 };
 
 #define IN4_ADDR_HSIZE_SHIFT	8
@@ -1663,8 +1664,9 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
 static int inet_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 				      struct inet_fill_args *fillargs,
 				      struct net **tgt_net, struct sock *sk,
-				      struct netlink_ext_ack *extack)
+				      struct netlink_callback *cb)
 {
+	struct netlink_ext_ack *extack = cb->extack;
 	struct nlattr *tb[IFA_MAX+1];
 	struct ifaddrmsg *ifm;
 	int err, i;
@@ -1679,9 +1681,11 @@ static int inet_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 		NL_SET_ERR_MSG(extack, "ipv4: Invalid values in header for address dump request");
 		return -EINVAL;
 	}
-	if (ifm->ifa_index) {
-		NL_SET_ERR_MSG(extack, "ipv4: Filter by device index not supported for address dump");
-		return -EINVAL;
+
+	fillargs->ifindex = ifm->ifa_index;
+	if (fillargs->ifindex) {
+		cb->answer_flags |= NLM_F_DUMP_FILTERED;
+		fillargs->flags |= NLM_F_DUMP_FILTERED;
 	}
 
 	err = nlmsg_parse_strict(nlh, sizeof(*ifm), tb, IFA_MAX,
@@ -1765,9 +1769,22 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 
 	if (cb->strict_check) {
 		err = inet_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
-						 skb->sk, cb->extack);
+						 skb->sk, cb);
 		if (err < 0)
 			return err;
+
+		if (fillargs.ifindex) {
+			dev = __dev_get_by_index(tgt_net, fillargs.ifindex);
+			if (!dev)
+				return -ENODEV;
+
+			in_dev = __in_dev_get_rtnl(dev);
+			if (in_dev) {
+				err = in_dev_dump_addr(in_dev, skb, cb, s_ip_idx,
+						       &fillargs);
+			}
+			goto put_tgt_net;
+		}
 	}
 
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
@@ -1800,6 +1817,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 done:
 	cb->args[0] = h;
 	cb->args[1] = idx;
+put_tgt_net:
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 1/4] net/ipv4: Move loop over addresses on a device into in_dev_dump_addr
From: David Ahern @ 2018-10-19 19:45 UTC (permalink / raw)
  To: netdev; +Cc: davem, David Ahern
In-Reply-To: <20181019194530.3590-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Similar to IPv6 move the logic that walks over the ipv4 address list
for a device into a helper.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv4/devinet.c | 49 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index d122ebbe5980..67f382c560ba 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1713,6 +1713,32 @@ static int inet_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 	return 0;
 }
 
+static int in_dev_dump_addr(struct in_device *in_dev, struct sk_buff *skb,
+			    struct netlink_callback *cb, int s_ip_idx,
+			    struct inet_fill_args *fillargs)
+{
+	struct in_ifaddr *ifa;
+	int ip_idx = 0;
+	int err;
+
+	for (ifa = in_dev->ifa_list; ifa; ifa = ifa->ifa_next, ip_idx++) {
+		if (ip_idx < s_ip_idx)
+			continue;
+
+		err = inet_fill_ifaddr(skb, ifa, fillargs);
+		if (err < 0)
+			goto done;
+
+		nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+	}
+	err = 0;
+
+done:
+	cb->args[2] = ip_idx;
+
+	return err;
+}
+
 static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	const struct nlmsghdr *nlh = cb->nlh;
@@ -1727,19 +1753,17 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net *tgt_net = net;
 	int h, s_h;
 	int idx, s_idx;
-	int ip_idx, s_ip_idx;
+	int s_ip_idx;
 	struct net_device *dev;
 	struct in_device *in_dev;
-	struct in_ifaddr *ifa;
 	struct hlist_head *head;
+	int err;
 
 	s_h = cb->args[0];
 	s_idx = idx = cb->args[1];
-	s_ip_idx = ip_idx = cb->args[2];
+	s_ip_idx = cb->args[2];
 
 	if (cb->strict_check) {
-		int err;
-
 		err = inet_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
 						 skb->sk, cb->extack);
 		if (err < 0)
@@ -1761,15 +1785,11 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 			if (!in_dev)
 				goto cont;
 
-			for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
-			     ifa = ifa->ifa_next, ip_idx++) {
-				if (ip_idx < s_ip_idx)
-					continue;
-				if (inet_fill_ifaddr(skb, ifa, &fillargs) < 0) {
-					rcu_read_unlock();
-					goto done;
-				}
-				nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+			err = in_dev_dump_addr(in_dev, skb, cb, s_ip_idx,
+					       &fillargs);
+			if (err < 0) {
+				rcu_read_unlock();
+				goto done;
 			}
 cont:
 			idx++;
@@ -1780,7 +1800,6 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 done:
 	cb->args[0] = h;
 	cb->args[1] = idx;
-	cb->args[2] = ip_idx;
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 4/4] net/ipv6: Add support for dumping addresses for a specific device
From: David Ahern @ 2018-10-19 19:45 UTC (permalink / raw)
  To: netdev; +Cc: davem, David Ahern
In-Reply-To: <20181019194530.3590-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
header, then return only the addresses for that device.

Since inet6_dump_addr is reused for multicast and anycast addresses,
this adds support for device specfic dumps of RTM_GETMULTICAST and
RTM_GETANYCAST as well.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/addrconf.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 6b659846ff8a..45b84dd5c4eb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4821,6 +4821,7 @@ struct inet6_fill_args {
 	int event;
 	unsigned int flags;
 	int netnsid;
+	int ifindex;
 	enum addr_type_t type;
 };
 
@@ -5018,8 +5019,9 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 static int inet6_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 				       struct inet6_fill_args *fillargs,
 				       struct net **tgt_net, struct sock *sk,
-				       struct netlink_ext_ack *extack)
+				       struct netlink_callback *cb)
 {
+	struct netlink_ext_ack *extack = cb->extack;
 	struct nlattr *tb[IFA_MAX+1];
 	struct ifaddrmsg *ifm;
 	int err, i;
@@ -5034,9 +5036,11 @@ static int inet6_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 		NL_SET_ERR_MSG_MOD(extack, "Invalid values in header for address dump request");
 		return -EINVAL;
 	}
-	if (ifm->ifa_index) {
-		NL_SET_ERR_MSG_MOD(extack, "Filter by device index not supported for address dump");
-		return -EINVAL;
+
+	fillargs->ifindex = ifm->ifa_index;
+	if (fillargs->ifindex) {
+		cb->answer_flags |= NLM_F_DUMP_FILTERED;
+		fillargs->flags |= NLM_F_DUMP_FILTERED;
 	}
 
 	err = nlmsg_parse_strict(nlh, sizeof(*ifm), tb, IFA_MAX,
@@ -5094,9 +5098,21 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 		int err;
 
 		err = inet6_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
-						  skb->sk, cb->extack);
+						  skb->sk, cb);
 		if (err < 0)
 			return err;
+
+		if (fillargs.ifindex) {
+			dev = __dev_get_by_index(tgt_net, fillargs.ifindex);
+			if (!dev)
+				return -ENODEV;
+			idev = __in6_dev_get(dev);
+			if (idev) {
+				err = in6_dump_addrs(idev, skb, cb, s_ip_idx,
+						     &fillargs);
+			}
+			goto put_tgt_net;
+		}
 	}
 
 	rcu_read_lock();
@@ -5124,6 +5140,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	rcu_read_unlock();
 	cb->args[0] = h;
 	cb->args[1] = idx;
+put_tgt_net:
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH bpf-next v3 0/7] Implement queue/stack maps
From: Daniel Borkmann @ 2018-10-19 20:08 UTC (permalink / raw)
  To: Mauricio Vasquez B, Alexei Starovoitov, netdev; +Cc: Song Liu
In-Reply-To: <153986856416.9127.9618539079636149043.stgit@kernel>

On 10/18/2018 03:16 PM, Mauricio Vasquez B wrote:
> In some applications this is needed have a pool of free elements, for
> example the list of free L4 ports in a SNAT.  None of the current maps allow
> to do it as it is not possible to get any element without having they key
> it is associated to, even if it were possible, the lack of locking mecanishms in
> eBPF would do it almost impossible to be implemented without data races.
> 
> This patchset implements two new kind of eBPF maps: queue and stack.
> Those maps provide to eBPF programs the peek, push and pop operations, and for
> userspace applications a new bpf_map_lookup_and_delete_elem() is added.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> 
> v2 -> v3:
>  - Remove "almost dead code" in syscall.c
>  - Remove unnecessary copy_from_user in bpf_map_lookup_and_delete_elem
>  - Rebase
> 
> v1 -> v2:
>  - Put ARG_PTR_TO_UNINIT_MAP_VALUE logic into a separated patch
>  - Fix missing __this_cpu_dec & preempt_enable calls in kernel/bpf/syscall.c
> 
> RFC v4 -> v1:
>  - Remove roundup to power of 2 in memory allocation
>  - Remove count and use a free slot to check if queue/stack is empty
>  - Use if + assigment for wrapping indexes
>  - Fix some minor style issues
>  - Squash two patches together
> 
> RFC v3 -> RFC v4:
>  - Revert renaming of kernel/bpf/stackmap.c
>  - Remove restriction on value size
>  - Remove len arguments from peek/pop helpers
>  - Add new ARG_PTR_TO_UNINIT_MAP_VALUE
> 
> RFC v2 -> RFC v3:
>  - Return elements by value instead that by reference
>  - Implement queue/stack base on array and head + tail indexes
>  - Rename stack trace related files to avoid confusion and conflicts
> 
> RFC v1 -> RFC v2:
>  - Create two separate maps instead of single one + flags
>  - Implement bpf_map_lookup_and_delete syscall
>  - Support peek operation
>  - Define replacement policy through flags in the update() method
>  - Add eBPF side tests
> 
> ---
> 
> Mauricio Vasquez B (7):
>       bpf: rename stack trace map operations
>       bpf/syscall: allow key to be null in map functions
>       bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE
>       bpf: add queue and stack maps
>       bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
>       Sync uapi/bpf.h to tools/include
>       selftests/bpf: add test cases for queue and stack maps
> 
> 
>  include/linux/bpf.h                                |    7 
>  include/linux/bpf_types.h                          |    4 
>  include/uapi/linux/bpf.h                           |   30 ++
>  kernel/bpf/Makefile                                |    2 
>  kernel/bpf/core.c                                  |    3 
>  kernel/bpf/helpers.c                               |   43 +++
>  kernel/bpf/queue_stack_maps.c                      |  288 ++++++++++++++++++++
>  kernel/bpf/stackmap.c                              |    2 
>  kernel/bpf/syscall.c                               |   91 ++++++
>  kernel/bpf/verifier.c                              |   28 ++
>  net/core/filter.c                                  |    6 
>  tools/include/uapi/linux/bpf.h                     |   30 ++
>  tools/lib/bpf/bpf.c                                |   12 +
>  tools/lib/bpf/bpf.h                                |    2 
>  tools/testing/selftests/bpf/Makefile               |    5 
>  tools/testing/selftests/bpf/bpf_helpers.h          |    7 
>  tools/testing/selftests/bpf/test_maps.c            |  122 ++++++++
>  tools/testing/selftests/bpf/test_progs.c           |   99 +++++++
>  tools/testing/selftests/bpf/test_queue_map.c       |    4 
>  tools/testing/selftests/bpf/test_queue_stack_map.h |   59 ++++
>  tools/testing/selftests/bpf/test_stack_map.c       |    4 
>  21 files changed, 834 insertions(+), 14 deletions(-)
>  create mode 100644 kernel/bpf/queue_stack_maps.c
>  create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
>  create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
>  create mode 100644 tools/testing/selftests/bpf/test_stack_map.c
> 
> --
> 

Series:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* [PATCH net] qlcnic: fix a return in qlcnic_dcb_get_capability()
From: Dan Carpenter @ 2018-10-19 20:11 UTC (permalink / raw)
  To: Shahed Shaikh, Sucheta Chakraborty
  Cc: Manish Chopra, Dept-GELinuxNICDev, David S. Miller, netdev,
	kernel-janitors

These functions are supposed to return one on failure and zero on
success.  Returning a zero here could cause uninitialized variable
bugs in several of the callers.  For example:

    drivers/scsi/cxgbi/cxgb4i/cxgb4i.c:1660 get_iscsi_dcb_priority()
    error: uninitialized symbol 'caps'.

Fixes: 48365e485275 ("qlcnic: dcb: Add support for CEE Netlink interface.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_dcb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_dcb.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_dcb.c
index 4b76c69fe86d..834208e55f7b 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_dcb.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_dcb.c
@@ -883,7 +883,7 @@ static u8 qlcnic_dcb_get_capability(struct net_device *netdev, int capid,
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 
 	if (!test_bit(QLCNIC_DCB_STATE, &adapter->dcb->state))
-		return 0;
+		return 1;
 
 	switch (capid) {
 	case DCB_CAP_ATTR_PG:
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next] net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode
From: Ivan Khoronzhuk @ 2018-10-19 20:25 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk

After flushing all mcast entries from the table, the ones contained in
mc list of ndev are not restored when promisc mode is toggled off,
because they are considered as synched with ALE, thus, in order to
restore them after promisc mode - reset syncing info. This fix
touches only switch mode devices, including single port boards
like Beagle Bone.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---

Based on net-nex/master
and is logical continuation of the
https://lore.kernel.org/patchwork/patch/1001633/

 drivers/net/ethernet/ti/cpsw.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 226be2a56c1f..f7753b240ced 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -640,6 +640,7 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 
 			/* Clear all mcast from ALE */
 			cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS, -1);
+			__dev_mc_unsync(ndev, NULL);
 
 			/* Flood All Unicast Packets to Host port */
 			cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
-- 
2.17.1

^ permalink raw reply related

* [PATCH iproute2-next] iplink: Remove flags argument from iplink_get
From: David Ahern @ 2018-10-19 20:28 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern

From: David Ahern <dsahern@gmail.com>

iplink_get has 1 caller and the flags arg is 0, so just remove it.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 ip/ip_common.h | 2 +-
 ip/ipaddress.c | 2 +-
 ip/iplink.c    | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 200be5e23dd1..458a9cb7ff2c 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -91,7 +91,7 @@ void vrf_reset(void);
 int netns_identify_pid(const char *pidstr, char *name, int len);
 int do_seg6(int argc, char **argv);
 
-int iplink_get(unsigned int flags, char *name, __u32 filt_mask);
+int iplink_get(char *name, __u32 filt_mask);
 int iplink_ifla_xstats(int argc, char **argv);
 
 int ip_linkaddr_list(int family, req_filter_fn_t filter_fn,
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index c0c1fbbe4c74..9481f241cb36 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1940,7 +1940,7 @@ static int ipaddr_list_flush_or_save(int argc, char **argv, int action)
 	 * the link device
 	 */
 	if (filter_dev && filter.group == -1 && do_link == 1) {
-		if (iplink_get(0, filter_dev, RTEXT_FILTER_VF) < 0) {
+		if (iplink_get(filter_dev, RTEXT_FILTER_VF) < 0) {
 			perror("Cannot send link get request");
 			delete_json_obj();
 			exit(1);
diff --git a/ip/iplink.c b/ip/iplink.c
index 50ccb49a0263..9f39e3826c19 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -1087,11 +1087,11 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 	return 0;
 }
 
-int iplink_get(unsigned int flags, char *name, __u32 filt_mask)
+int iplink_get(char *name, __u32 filt_mask)
 {
 	struct iplink_req req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
-		.n.nlmsg_flags = NLM_F_REQUEST | flags,
+		.n.nlmsg_flags = NLM_F_REQUEST,
 		.n.nlmsg_type = RTM_GETLINK,
 		.i.ifi_family = preferred_family,
 	};
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH bpf-next v3 0/7] Implement queue/stack maps
From: Alexei Starovoitov @ 2018-10-19 20:30 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Mauricio Vasquez B, Alexei Starovoitov, netdev, Song Liu
In-Reply-To: <831d49fa-c131-b66a-80d2-65f8172467d5@iogearbox.net>

On Fri, Oct 19, 2018 at 10:08:08PM +0200, Daniel Borkmann wrote:
> On 10/18/2018 03:16 PM, Mauricio Vasquez B wrote:
> > In some applications this is needed have a pool of free elements, for
> > example the list of free L4 ports in a SNAT.  None of the current maps allow
> > to do it as it is not possible to get any element without having they key
> > it is associated to, even if it were possible, the lack of locking mecanishms in
> > eBPF would do it almost impossible to be implemented without data races.
> > 
> > This patchset implements two new kind of eBPF maps: queue and stack.
> > Those maps provide to eBPF programs the peek, push and pop operations, and for
> > userspace applications a new bpf_map_lookup_and_delete_elem() is added.
> > 
> > Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Applied, Thanks

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: Extend the sk_lookup() helper to XDP hookpoint.
From: Daniel Borkmann @ 2018-10-19 20:32 UTC (permalink / raw)
  To: Joe Stringer, Martin KaFai Lau
  Cc: Nitin Hande, netdev, ast, Jesper Brouer, john fastabend
In-Reply-To: <CAOftzPjyeD2-nGW+NPC4sbxLcQY_CFT5HikXYeKUEWCbRcrpQg@mail.gmail.com>

On 10/19/2018 06:47 PM, Joe Stringer wrote:
> On Thu, 18 Oct 2018 at 22:07, Martin Lau <kafai@fb.com> wrote:
>> On Thu, Oct 18, 2018 at 04:52:40PM -0700, Joe Stringer wrote:
>>> On Thu, 18 Oct 2018 at 14:20, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 10/18/2018 11:06 PM, Joe Stringer wrote:
>>>>> On Thu, 18 Oct 2018 at 11:54, Nitin Hande <nitin.hande@gmail.com> wrote:
>>>> [...]
>>>>>> Open Issue
>>>>>> * The underlying code relies on presence of an skb to find out the
>>>>>> right sk for the case of REUSEPORT socket option. Since there is
>>>>>> no skb available at XDP hookpoint, the helper function will return
>>>>>> the first available sk based off the 5 tuple hash. If the desire
>>>>>> is to return a particular sk matching reuseport_cb function, please
>>>>>> suggest way to tackle it, which can be addressed in a future commit.
>>>>
>>>>>> Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>
>>>>>
>>>>> Thanks Nitin, LGTM overall.
>>>>>
>>>>> The REUSEPORT thing suggests that the usage of this helper from XDP
>>>>> layer may lead to a different socket being selected vs. the equivalent
>>>>> call at TC hook, or other places where the selection may occur. This
>>>>> could be a bit counter-intuitive.
>>>>>
>>>>> One thought I had to work around this was to introduce a flag,
>>>>> something like BPF_F_FIND_REUSEPORT_SK_BY_HASH. This flag would
>>>>> effectively communicate in the API that the bpf_sk_lookup_xxx()
>>>>> functions will only select a REUSEPORT socket based on the hash and
>>>>> not by, for example BPF_PROG_TYPE_SK_REUSEPORT programs. The absence
>>>>> of the flag would support finding REUSEPORT sockets by other
>>>>> mechanisms (which would be allowed for now from TC hooks but would be
>>>>> disallowed from XDP, since there's no specific plan to support this).
>>>>
>>>> Hmm, given skb is NULL here the only way to lookup the socket in such
>>>> scenario is based on hash, that is, inet_ehashfn() / inet6_ehashfn(),
>>>> perhaps alternative is to pass this hash in from XDP itself to the
>>>> helper so it could be custom selector. Do you have a specific use case
>>>> on this for XDP (just curious)?
>>>
>>> I don't have a use case for SO_REUSEPORT introspection from XDP, so
>>> I'm primarily thinking from the perspective of making the behaviour
>>> clear in the API in a way that leaves open the possibility for a
>>> reasonable implementation in future. From that perspective, my main
>>> concern is that it may surprise some BPF writers that the same
>>> "bpf_sk_lookup_tcp()" call (with identical parameters) may have
>>> different behaviour at TC vs. XDP layers, as the BPF selection of
>>> sockets is respected at TC but not at XDP.
>>>
>>> FWIW we're already out of parameters for the actual call, so if we
>>> wanted to allow passing a hash in, we'd need to either dedicate half
>>> the 'flags' field for this configurable hash, or consider adding the
>>> new hash parameter to 'struct bpf_sock_tuple'.
>>>
>>> +Martin for any thoughts on SO_REUSEPORT and XDP here.
>> The XDP/TC prog has read access to the sk fields through
>> 'struct bpf_sock'?
>>
>> A quick thought...
>> Considering all sk in the same reuse->socks[] share
>> many things (e.g. family,type,protocol,ip,port..etc are the same),
>> I wonder returning which particular sk from reuse->socks[] will
>> matter too much since most of the fields from 'struct bpf_sock' will
>> be the same.  Some of fields in 'struct bpf_sock' could be different
>> though, like priority?  Hence, another possibility is to limit the
>> accessible fields for the XDP prog.  Only allow accessing the fields
>> that must be the same among the sk in the same reuse->socks[].
> 
> This sounds pretty reasonable to me.

Agree, and in any case this difference in returned sk selection should
probably also be documented in the uapi helper description.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox