Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] cxgb4: Check for kvzalloc allocation failure
From: YueHaibing @ 2018-05-25  1:39 UTC (permalink / raw)
  To: David Miller; +Cc: ganeshgr, linux-kernel, netdev
In-Reply-To: <20180524.110743.522760687215216591.davem@davemloft.net>

On 2018/5/24 23:07, David Miller wrote:
> From: YueHaibing <yuehaibing@huawei.com>
> Date: Tue, 22 May 2018 15:07:18 +0800
> 
>> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> index 130d1ee..019cffe 100644
>> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> @@ -4135,6 +4135,10 @@ static int adap_init0(struct adapter *adap)
>>  		 * card
>>  		 */
>>  		card_fw = kvzalloc(sizeof(*card_fw), GFP_KERNEL);
>> +		if (!card_fw) {
>> +			ret = -ENOMEM;
>> +			goto bye;
>> +		}
>>  
> 
> On error, this leaks fw_info.

Hi David,

I checked fw_info is an element of fw_info_array,there all members of struct fw_info no need free.

It likes this :

static struct fw_info fw_info_array[] = {
	{
		.chip = CHELSIO_T4,
		.fs_name = FW4_CFNAME,
		.fw_mod_name = FW4_FNAME,
		.fw_hdr = {
			.chip = FW_HDR_CHIP_T4,
			.fw_ver = __cpu_to_be32(FW_VERSION(T4)),
			.intfver_nic = FW_INTFVER(T4, NIC),
			.intfver_vnic = FW_INTFVER(T4, VNIC),
			.intfver_ri = FW_INTFVER(T4, RI),
			.intfver_iscsi = FW_INTFVER(T4, ISCSI),
			.intfver_fcoe = FW_INTFVER(T4, FCOE),
		},
	}, {
		........

Am I missing something?
> 
> .
> 

^ permalink raw reply

* linux-next: manual merge of the scsi tree with the net-next tree
From: Mark Brown @ 2018-05-25  1:38 UTC (permalink / raw)
  To: James Bottomley, Chad Dupuis, Martin K. Petersen, linux-scsi,
	David S. Miller, netdev
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]

Hi James,

Today's linux-next merge of the scsi tree got a conflict in:

  drivers/scsi/qedf/qedf.h

between commit:

  8673daf4f55bf3b91 ("qedf: Add get_generic_tlv_data handler.")

from the net-next tree and commit:

  4b9b7fabb39b3e9d7 ("scsi: qedf: Improve firmware debug dump handling")

from the scsi tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

diff --cc drivers/scsi/qedf/qedf.h
index cabb6af60fb8,2372a40326f8..000000000000
--- a/drivers/scsi/qedf/qedf.h
+++ b/drivers/scsi/qedf/qedf.h
@@@ -501,9 -499,8 +504,10 @@@ extern int qedf_post_io_req(struct qedf
  extern void qedf_process_seq_cleanup_compl(struct qedf_ctx *qedf,
  	struct fcoe_cqe *cqe, struct qedf_ioreq *io_req);
  extern int qedf_send_flogi(struct qedf_ctx *qedf);
 +extern void qedf_get_protocol_tlv_data(void *dev, void *data);
  extern void qedf_fp_io_handler(struct work_struct *work);
 +extern void qedf_get_generic_tlv_data(void *dev, struct qed_generic_tlvs *data);
+ extern void qedf_wq_grcdump(struct work_struct *work);
  
  #define FCOE_WORD_TO_BYTE  4
  #define QEDF_MAX_TASK_NUM	0xFFFF

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/7] bpf: implement BPF_TASK_FD_QUERY
From: Alexei Starovoitov @ 2018-05-25  1:32 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song, peterz, netdev; +Cc: kernel-team
In-Reply-To: <fde7c055-9409-e7de-1576-acd17e65dae1@iogearbox.net>

On 5/24/18 5:27 PM, Daniel Borkmann wrote:
> On 05/24/2018 08:21 PM, Yonghong Song wrote:
>> Currently, suppose a userspace application has loaded a bpf program
>> and attached it to a tracepoint/kprobe/uprobe, and a bpf
>> introspection tool, e.g., bpftool, wants to show which bpf program
>> is attached to which tracepoint/kprobe/uprobe. Such attachment
>> information will be really useful to understand the overall bpf
>> deployment in the system.
>>
>> There is a name field (16 bytes) for each program, which could
>> be used to encode the attachment point. There are some drawbacks
>> for this approaches. First, bpftool user (e.g., an admin) may not
>> really understand the association between the name and the
>> attachment point. Second, if one program is attached to multiple
>> places, encoding a proper name which can imply all these
>> attachments becomes difficult.
>>
>> This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
>> Given a pid and fd, this command will return bpf related information
>> to user space. Right now it only supports tracepoint/kprobe/uprobe
>> perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
>>    . prog_id
>>    . tracepoint name, or
>>    . k[ret]probe funcname + offset or kernel addr, or
>>    . u[ret]probe filename + offset
>> to the userspace.
>> The user can use "bpftool prog" to find more information about
>> bpf program itself with prog_id.
>>
>> Patch #1 adds function perf_get_event() in kernel/events/core.c.
>> Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
>> Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
>> in the libbpf library for samples/selftests/bpftool to use.
>> Patch #4 adds ksym_get_addr() utility function.
>> Patch #5 add a test in samples/bpf for querying k[ret]probes and
>> u[ret]probes.
>> Patch #6 add a test in tools/testing/selftests/bpf for querying
>> raw_tracepoint and tracepoint.
>> Patch #7 add a new subcommand "perf" to bpftool.
>>
>> Changelogs:
>>   v4 -> v5:
>>      . return strlen(buf) instead of strlen(buf) + 1
>>        in the attr.buf_len. As long as user provides
>>        non-empty buffer, it will be filed with empty
>>        string, truncated string, or full string
>>        based on the buffer size and the length of
>>        to-be-copied string.
>>   v3 -> v4:
>>      . made attr buf_len input/output. The length of
>>        actual buffter is written to buf_len so user space knows
>>        what is actually needed. If user provides a buffer
>>        with length >= 1 but less than required, do partial
>>        copy and return -ENOSPC.
>>      . code simplification with put_user.
>>      . changed query result attach_info to fd_type.
>>      . add tests at selftests/bpf to test zero len, null buf and
>>        insufficient buf.
>>   v2 -> v3:
>>      . made perf_get_event() return perf_event pointer const.
>>        this was to ensure that event fields are not meddled.
>>      . detect whether newly BPF_TASK_FD_QUERY is supported or
>>        not in "bpftool perf" and warn users if it is not.
>>   v1 -> v2:
>>      . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
>>        to BPF_TASK_FD_QUERY.
>>      . fixed various "bpftool perf" issues and added documentation
>>        and auto-completion.
>>
>> Yonghong Song (7):
>>   perf/core: add perf_get_event() to return perf_event given a struct
>>     file
>>   bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
>>   tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in
>>     libbpf
>>   tools/bpf: add ksym_get_addr() in trace_helpers
>>   samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
>>   tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
>>   tools/bpftool: add perf subcommand
>>
>>  include/linux/perf_event.h                       |   5 +
>>  include/linux/trace_events.h                     |  17 +
>>  include/uapi/linux/bpf.h                         |  26 ++
>>  kernel/bpf/syscall.c                             | 131 ++++++++
>>  kernel/events/core.c                             |   8 +
>>  kernel/trace/bpf_trace.c                         |  48 +++
>>  kernel/trace/trace_kprobe.c                      |  29 ++
>>  kernel/trace/trace_uprobe.c                      |  22 ++
>>  samples/bpf/Makefile                             |   4 +
>>  samples/bpf/task_fd_query_kern.c                 |  19 ++
>>  samples/bpf/task_fd_query_user.c                 | 382 +++++++++++++++++++++++
>>  tools/bpf/bpftool/Documentation/bpftool-perf.rst |  81 +++++
>>  tools/bpf/bpftool/Documentation/bpftool.rst      |   5 +-
>>  tools/bpf/bpftool/bash-completion/bpftool        |   9 +
>>  tools/bpf/bpftool/main.c                         |   3 +-
>>  tools/bpf/bpftool/main.h                         |   1 +
>>  tools/bpf/bpftool/perf.c                         | 246 +++++++++++++++
>>  tools/include/uapi/linux/bpf.h                   |  26 ++
>>  tools/lib/bpf/bpf.c                              |  23 ++
>>  tools/lib/bpf/bpf.h                              |   3 +
>>  tools/testing/selftests/bpf/test_progs.c         | 158 ++++++++++
>>  tools/testing/selftests/bpf/trace_helpers.c      |  12 +
>>  tools/testing/selftests/bpf/trace_helpers.h      |   1 +
>>  23 files changed, 1257 insertions(+), 2 deletions(-)
>>  create mode 100644 samples/bpf/task_fd_query_kern.c
>>  create mode 100644 samples/bpf/task_fd_query_user.c
>>  create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
>>  create mode 100644 tools/bpf/bpftool/perf.c
>
> LGTM, series:
>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Applied to bpf-next, Thanks everyone.

^ permalink raw reply

* Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
From: Jakub Kicinski @ 2018-05-25  1:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, linux-pci, netdev, Sathya Perla, Felix Manlunas,
	alexander.duyck, john.fastabend, Jacob Keller, Donald Dutile,
	oss-drivers, Christoph Hellwig
In-Reply-To: <20180524235748.GD15320@bhelgaas-glaptop.roam.corp.google.com>

Hi Bjorn!

On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > Some user space depends on enabling sriov_totalvfs number of VFs
> > to not fail, e.g.:
> > 
> > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > 
> > For devices which VF support depends on loaded FW we have the
> > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > to 0.  Remove the special values completely and simply initialize
> > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > Add a helper for drivers to reset the VF limit back to total.  
> 
> I still can't really make sense out of the changelog.
>
> I think part of the reason it's confusing is because there are two
> things going on:
> 
>   1) You want this:
>   
>        pci_sriov_set_totalvfs(dev, 0);
>        x = pci_sriov_get_totalvfs(dev) 
> 
>      to return 0 instead of total_VFs.  That seems to connect with
>      your subject line.  It means "sriov_totalvfs" in sysfs could be
>      0, but I don't know how that is useful (I'm sure it is; just
>      educate me :))

Let me just quote the bug report that got filed on our internal bug
tracker :)

  When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
  errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
  then tries to set that as the sriov_numvfs parameter.

  For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
  but it's set to max.  When FW is switched to flower*, the correct 
  sriov_totalvfs value is presented.

* flower is a project name

My understanding is OpenStack uses sriov_totalvfs to determine how many
VFs can be enabled, looks like this is the code:

http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464

>   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
>      sure what you intend for this.  Is *every* driver supposed to
>      call it in .remove()?  Could/should this be done in the core
>      somehow instead of depending on every driver?

Good question, I was just thinking yesterday we may want to call it
from the core, but I don't think it's strictly necessary nor always
sufficient (we may reload FW without re-probing).

We have a device which supports different number of VFs based on the FW
loaded.  Some legacy FWs does not inform the driver how many VFs it can
support, because it supports max.  So the flow in our driver is this:

load_fw(dev);
...
max_vfs = ask_fw_for_max_vfs(dev);
if (max_vfs >= 0)
	return pci_sriov_set_totalvfs(dev, max_vfs);
else /* FW didn't tell us, assume max */
	return pci_sriov_reset_totalvfs(dev); 

We also reset the max on device remove, but that's not strictly
necessary.

Other users of pci_sriov_set_totalvfs() always know the value to set
the total to (either always get it from FW or it's a constant).

If you prefer we can work out the correct max for those legacy cases in
the driver as well, although it seemed cleaner to just ask the core,
since it already has total_VFs value handy :)

> I'm also having a hard time connecting your user-space command example
> with the rest of this.  Maybe it will make more sense to me tomorrow
> after some coffee.

OpenStack assumes it will always be able to set sriov_numvfs to
sriov_totalvfs, see this 'if':

http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512

I tried to morph that into an concise bash command, but clearly failed.
Sorry about the lack of clarity! :(

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: Daniel Borkmann @ 2018-05-25  0:59 UTC (permalink / raw)
  To: Andrey Ignatov, netdev; +Cc: davem, kafai, ast, kernel-team
In-Reply-To: <ba05de3cd3b6af6d3300d9c5623976f4aec161b0.1527031931.git.rdna@fb.com>

On 05/23/2018 01:40 AM, Andrey Ignatov wrote:
[...]
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ff4d4ba..a1f9ba2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -900,6 +900,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  {
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct udp_sock *up = udp_sk(sk);
> +	DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
>  	struct flowi4 fl4_stack;
>  	struct flowi4 *fl4;
>  	int ulen = len;
> @@ -954,8 +955,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	/*
>  	 *	Get and verify the address.
>  	 */
> -	if (msg->msg_name) {
> -		DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
> +	if (usin) {
>  		if (msg->msg_namelen < sizeof(*usin))
>  			return -EINVAL;
>  		if (usin->sin_family != AF_INET) {
> @@ -1009,6 +1009,22 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		rcu_read_unlock();
>  	}
>  
> +	if (!connected) {
> +		err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> +					    (struct sockaddr *)usin, &ipc.addr);
> +		if (err)
> +			goto out_free;
> +		if (usin) {
> +			if (usin->sin_port == 0) {
> +				/* BPF program set invalid port. Reject it. */
> +				err = -EINVAL;
> +				goto out_free;
> +			}
> +			daddr = usin->sin_addr.s_addr;
> +			dport = usin->sin_port;
> +		}
> +	}
> +
>  	saddr = ipc.addr;
>  	ipc.addr = faddr = daddr;
>  
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2839c1b..67c44b5 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1315,6 +1315,29 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		fl6.saddr = np->saddr;
>  	fl6.fl6_sport = inet->inet_sport;
>  
> +	if (!connected) {
> +		err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
> +					   (struct sockaddr *)sin6, &fl6.saddr);
> +		if (err)
> +			goto out_no_dst;
> +		if (sin6) {
> +			if (ipv6_addr_v4mapped(&sin6->sin6_addr)) {
> +				/* BPF program rewrote IPv6-only by IPv4-mapped
> +				 * IPv6. It's currently unsupported.
> +				 */
> +				err = -ENOTSUPP;
> +				goto out_no_dst;
> +			}
> +			if (sin6->sin6_port == 0) {
> +				/* BPF program set invalid port. Reject it. */
> +				err = -EINVAL;
> +				goto out_no_dst;
> +			}
> +			fl6.fl6_dport = sin6->sin6_port;
> +			fl6.daddr = sin6->sin6_addr;
> +		}

Hmm, this extra work here and in v4 case should probably all be done under
the static key? Otherwise we'll do the extra work for checking sin6 and
setting up fl6 twice? Also, when not enabled, couldn't we run into the case
of ipv6_addr_v4mapped() as well? If I'm spotting this right, then we would
bail out though we shouldn't normally?

> +	}
> +
>  	final_p = fl6_update_dst(&fl6, opt, &final);
>  	if (final_p)
>  		connected = false;
> @@ -1394,6 +1417,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  
>  out:
>  	dst_release(dst);
> +out_no_dst:
>  	fl6_sock_release(flowlabel);
>  	txopt_put(opt_to_free);
>  	if (!err)
> 

^ permalink raw reply

* [PATCH net-next v5 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-25  0:56 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1527209803-48274-1-git-send-email-yihung.wei@gmail.com>

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
    * set default connection limit for all zones
    * set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
    * remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
    * get the default connection limit for all zones
    * get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 net/openvswitch/Kconfig     |   3 +-
 net/openvswitch/conntrack.c | 551 +++++++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   3 +
 5 files changed, 567 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
 		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 				     (!NF_NAT || NF_NAT) && \
 				     (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-				     (!NF_NAT_IPV6 || NF_NAT_IPV6)))
+				     (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+				     (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
 	select LIBCRC32C
 	select MPLS
 	select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 02fc343feb66..284aca2a252d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include <linux/tcp.h>
 #include <linux/udp.h>
 #include <linux/sctp.h>
+#include <linux/static_key.h>
 #include <net/ip.h>
+#include <net/genetlink.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_count.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_labels.h>
 #include <net/netfilter/nf_conntrack_seqadj.h>
@@ -76,6 +79,31 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED	0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+	/* Elements in ovs_ct_limit_info->limits hash table */
+	struct hlist_node hlist_node;
+	struct rcu_head rcu;
+	u16 zone;
+	u32 limit;
+};
+
+struct ovs_ct_limit_info {
+	u32 default_limit;
+	struct hlist_head *limits;
+	struct nf_conncount_data *data;
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+	[OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1064,89 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
 	return false;
 }
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+	const struct ovs_ct_limit_info *info, u16 zone)
+{
+	return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+			 struct ovs_ct_limit *new_ct_limit)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, new_ct_limit->zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == new_ct_limit->zone) {
+			hlist_replace_rcu(&ct_limit->hlist_node,
+					  &new_ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+
+	hlist_add_head_rcu(&new_ct_limit->hlist_node, head);
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+	struct hlist_node *n;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_safe(ct_limit, n, head, hlist_node) {
+		if (ct_limit->zone == zone) {
+			hlist_del_rcu(&ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+}
+
+/* Call with RCU read lock */
+static u32 ct_limit_get(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone)
+			return ct_limit->limit;
+	}
+
+	return info->default_limit;
+}
+
+static int ovs_ct_check_limit(struct net *net,
+			      const struct ovs_conntrack_info *info,
+			      const struct nf_conntrack_tuple *tuple)
+{
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	u32 per_zone_limit, connections;
+	u32 conncount_key;
+
+	conncount_key = info->zone.id;
+
+	per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
+	if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED)
+		return 0;
+
+	connections = nf_conncount_count(net, ct_limit_info->data,
+					 &conncount_key, tuple, &info->zone);
+	if (connections > per_zone_limit)
+		return -ENOMEM;
+
+	return 0;
+}
+#endif
+
 /* Lookup connection and confirm if unconfirmed. */
 static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 			 const struct ovs_conntrack_info *info,
@@ -1054,6 +1165,21 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 	if (!ct)
 		return 0;
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	if (static_branch_unlikely(&ovs_ct_limit_enabled)) {
+		if (!nf_ct_is_confirmed(ct)) {
+			err = ovs_ct_check_limit(net, info,
+				&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+			if (err) {
+				net_warn_ratelimited("openvswitch: zone: %u "
+					"execeeds conntrack limit\n",
+					info->zone.id);
+				return err;
+			}
+		}
+	}
+#endif
+
 	/* Set the conntrack event mask if given.  NEW and DELETE events have
 	 * their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
 	 * typically would receive many kinds of updates.  Setting the event
@@ -1655,7 +1781,420 @@ static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info)
 		nf_ct_tmpl_free(ct_info->ct);
 }
 
-void ovs_ct_init(struct net *net)
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static int ovs_ct_limit_init(struct net *net, struct ovs_net *ovs_net)
+{
+	int i, err;
+
+	ovs_net->ct_limit_info = kmalloc(sizeof(*ovs_net->ct_limit_info),
+					 GFP_KERNEL);
+	if (!ovs_net->ct_limit_info)
+		return -ENOMEM;
+
+	ovs_net->ct_limit_info->default_limit = OVS_CT_LIMIT_DEFAULT;
+	ovs_net->ct_limit_info->limits =
+		kmalloc_array(CT_LIMIT_HASH_BUCKETS, sizeof(struct hlist_head),
+			      GFP_KERNEL);
+	if (!ovs_net->ct_limit_info->limits) {
+		kfree(ovs_net->ct_limit_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; i++)
+		INIT_HLIST_HEAD(&ovs_net->ct_limit_info->limits[i]);
+
+	ovs_net->ct_limit_info->data =
+		nf_conncount_init(net, NFPROTO_INET, sizeof(u32));
+
+	if (IS_ERR(ovs_net->ct_limit_info->data)) {
+		err = PTR_ERR(ovs_net->ct_limit_info->data);
+		kfree(ovs_net->ct_limit_info->limits);
+		kfree(ovs_net->ct_limit_info);
+		pr_err("openvswitch: failed to init nf_conncount %d\n", err);
+		return err;
+	}
+	return 0;
+}
+
+static void ovs_ct_limit_exit(struct net *net, struct ovs_net *ovs_net)
+{
+	const struct ovs_ct_limit_info *info = ovs_net->ct_limit_info;
+	int i;
+
+	nf_conncount_destroy(net, NFPROTO_INET, info->data);
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		struct hlist_head *head = &info->limits[i];
+		struct ovs_ct_limit *ct_limit;
+
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node)
+			kfree_rcu(ct_limit, rcu);
+	}
+	kfree(ovs_net->ct_limit_info->limits);
+	kfree(ovs_net->ct_limit_info);
+}
+
+static struct sk_buff *
+ovs_ct_limit_cmd_reply_start(struct genl_info *info, u8 cmd,
+			     struct ovs_header **ovs_reply_header)
+{
+	struct ovs_header *ovs_header = info->userhdr;
+	struct sk_buff *skb;
+
+	skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	*ovs_reply_header = genlmsg_put(skb, info->snd_portid,
+					info->snd_seq,
+					&dp_ct_limit_genl_family, 0, cmd);
+
+	if (!*ovs_reply_header) {
+		nlmsg_free(skb);
+		return ERR_PTR(-EMSGSIZE);
+	}
+	(*ovs_reply_header)->dp_ifindex = ovs_header->dp_ifindex;
+
+	return skb;
+}
+
+static bool check_zone_id(int zone_id, u16 *pzone)
+{
+	if (zone_id >= 0 && zone_id <= 65535) {
+		*pzone = (u16)zone_id;
+		return true;
+	}
+	return false;
+}
+
+static int ovs_ct_limit_set_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id ==
+				OVS_ZONE_LIMIT_DEFAULT_ZONE)) {
+			ovs_lock();
+			info->default_limit = zone_limit->limit;
+			ovs_unlock();
+		} else if (unlikely(!check_zone_id(
+				zone_limit->zone_id, &zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			struct ovs_ct_limit *ct_limit;
+
+			ct_limit = kmalloc(sizeof(*ct_limit), GFP_KERNEL);
+			if (!ct_limit)
+				return -ENOMEM;
+
+			ct_limit->zone = zone;
+			ct_limit->limit = zone_limit->limit;
+
+			ovs_lock();
+			ct_limit_set(info, ct_limit);
+			ovs_unlock();
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "set zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_del_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id ==
+				OVS_ZONE_LIMIT_DEFAULT_ZONE)) {
+			ovs_lock();
+			info->default_limit = OVS_CT_LIMIT_DEFAULT;
+			ovs_unlock();
+		} else if (unlikely(!check_zone_id(
+				zone_limit->zone_id, &zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			ovs_lock();
+			ct_limit_del(info, zone);
+			ovs_unlock();
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "del zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
+					  struct sk_buff *reply)
+{
+	struct ovs_zone_limit zone_limit;
+	int err;
+
+	zone_limit.zone_id = OVS_ZONE_LIMIT_DEFAULT_ZONE;
+	zone_limit.limit = info->default_limit;
+	err = nla_put_nohdr(reply, sizeof(zone_limit), &zone_limit);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static int __ovs_ct_limit_get_zone_limit(struct net *net,
+					 struct nf_conncount_data *data,
+					 u16 zone_id, u32 limit,
+					 struct sk_buff *reply)
+{
+	struct nf_conntrack_zone ct_zone;
+	struct ovs_zone_limit zone_limit;
+	u32 conncount_key = zone_id;
+
+	zone_limit.zone_id = zone_id;
+	zone_limit.limit = limit;
+	nf_ct_zone_init(&ct_zone, zone_id, NF_CT_DEFAULT_ZONE_DIR, 0);
+
+	zone_limit.count = nf_conncount_count(net, data, &conncount_key, NULL,
+					      &ct_zone);
+	return nla_put_nohdr(reply, sizeof(zone_limit), &zone_limit);
+}
+
+static int ovs_ct_limit_get_zone_limit(struct net *net,
+				       struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info,
+				       struct sk_buff *reply)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem, err;
+	u32 limit;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id ==
+				OVS_ZONE_LIMIT_DEFAULT_ZONE)) {
+			err = ovs_ct_limit_get_default_limit(info, reply);
+			if (err)
+				return err;
+		} else if (unlikely(!check_zone_id(zone_limit->zone_id,
+							&zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			rcu_read_lock();
+			limit = ct_limit_get(info, zone);
+			rcu_read_unlock();
+
+			err = __ovs_ct_limit_get_zone_limit(
+				net, info->data, zone, limit, reply);
+			if (err)
+				return err;
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "get zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_all_zone_limit(struct net *net,
+					   struct ovs_ct_limit_info *info,
+					   struct sk_buff *reply)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+	int i, err = 0;
+
+	err = ovs_ct_limit_get_default_limit(info, reply);
+	if (err)
+		return err;
+
+	rcu_read_lock();
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		head = &info->limits[i];
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+			err = __ovs_ct_limit_get_zone_limit(net, info->data,
+				ct_limit->zone, ct_limit->limit, reply);
+			if (err)
+				goto exit_err;
+		}
+	}
+
+exit_err:
+	rcu_read_unlock();
+	return err;
+}
+
+static int ovs_ct_limit_cmd_set(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_SET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = -EINVAL;
+		goto exit_err;
+	}
+
+	err = ovs_ct_limit_set_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	static_branch_enable(&ovs_ct_limit_enabled);
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_DEL,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = -EINVAL;
+		goto exit_err;
+	}
+
+	err = ovs_ct_limit_del_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct nlattr *nla_reply;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct net *net = sock_net(skb->sk);
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_GET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_ZONE_LIMIT);
+
+	if (a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = ovs_ct_limit_get_zone_limit(
+			net, a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT], ct_limit_info,
+			reply);
+		if (err)
+			goto exit_err;
+	} else {
+		err = ovs_ct_limit_get_all_zone_limit(net, ct_limit_info,
+						      reply);
+		if (err)
+			goto exit_err;
+	}
+
+	nla_nest_end(reply, nla_reply);
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static struct genl_ops ct_limit_genl_ops[] = {
+	{ .cmd = OVS_CT_LIMIT_CMD_SET,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_set,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_DEL,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_del,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_GET,
+		.flags = 0,		  /* OK for unprivileged users. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_get,
+	},
+};
+
+static const struct genl_multicast_group ovs_ct_limit_multicast_group = {
+	.name = OVS_CT_LIMIT_MCGROUP,
+};
+
+struct genl_family dp_ct_limit_genl_family __ro_after_init = {
+	.hdrsize = sizeof(struct ovs_header),
+	.name = OVS_CT_LIMIT_FAMILY,
+	.version = OVS_CT_LIMIT_VERSION,
+	.maxattr = OVS_CT_LIMIT_ATTR_MAX,
+	.netnsok = true,
+	.parallel_ops = true,
+	.ops = ct_limit_genl_ops,
+	.n_ops = ARRAY_SIZE(ct_limit_genl_ops),
+	.mcgrps = &ovs_ct_limit_multicast_group,
+	.n_mcgrps = 1,
+	.module = THIS_MODULE,
+};
+#endif
+
+int ovs_ct_init(struct net *net)
 {
 	unsigned int n_bits = sizeof(struct ovs_key_ct_labels) * BITS_PER_BYTE;
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
@@ -1666,12 +2205,22 @@ void ovs_ct_init(struct net *net)
 	} else {
 		ovs_net->xt_label = true;
 	}
+
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	return ovs_ct_limit_init(net, ovs_net);
+#else
+	return 0;
+#endif
 }
 
 void ovs_ct_exit(struct net *net)
 {
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	ovs_ct_limit_exit(net, ovs_net);
+#endif
+
 	if (ovs_net->xt_label)
 		nf_connlabels_put(net);
 }
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 399dfdd2c4f9..900dadd70974 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -17,10 +17,11 @@
 #include "flow.h"
 
 struct ovs_conntrack_info;
+struct ovs_ct_limit_info;
 enum ovs_key_attr;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-void ovs_ct_init(struct net *);
+int ovs_ct_init(struct net *);
 void ovs_ct_exit(struct net *);
 bool ovs_ct_verify(struct net *, enum ovs_key_attr attr);
 int ovs_ct_copy_action(struct net *, const struct nlattr *,
@@ -44,7 +45,7 @@ void ovs_ct_free_action(const struct nlattr *a);
 #else
 #include <linux/errno.h>
 
-static inline void ovs_ct_init(struct net *net) { }
+static inline int ovs_ct_init(struct net *net) { return 0; }
 
 static inline void ovs_ct_exit(struct net *net) { }
 
@@ -104,4 +105,8 @@ static inline void ovs_ct_free_action(const struct nlattr *a) { }
 
 #define CT_SUPPORTED_MASK 0
 #endif /* CONFIG_NF_CONNTRACK */
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+extern struct genl_family dp_ct_limit_genl_family;
+#endif
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 015e24e08909..a61818e94396 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2288,6 +2288,9 @@ static struct genl_family * const dp_genl_families[] = {
 	&dp_flow_genl_family,
 	&dp_packet_genl_family,
 	&dp_meter_genl_family,
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	&dp_ct_limit_genl_family,
+#endif
 };
 
 static void dp_unregister_genl(int n_families)
@@ -2323,8 +2326,7 @@ static int __net_init ovs_init_net(struct net *net)
 
 	INIT_LIST_HEAD(&ovs_net->dps);
 	INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
-	ovs_ct_init(net);
-	return 0;
+	return ovs_ct_init(net);
 }
 
 static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
@@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 523d65526766..c9eb267c6f7e 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -144,6 +144,9 @@ struct dp_upcall_info {
 struct ovs_net {
 	struct list_head dps;
 	struct work_struct dp_notify_work;
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	struct ovs_ct_limit_info *ct_limit_info;
+#endif
 
 	/* Module reference for configuring conntrack. */
 	bool xt_label;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v5 1/2] openvswitch: Add conntrack limit netlink definition
From: Yi-Hung Wei @ 2018-05-25  0:56 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1527209803-48274-1-git-send-email-yihung.wei@gmail.com>

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 include/uapi/linux/openvswitch.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..863aabaa5cc9 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,32 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+	OVS_CT_LIMIT_CMD_UNSPEC,
+	OVS_CT_LIMIT_CMD_SET,		/* Add or modify ct limit. */
+	OVS_CT_LIMIT_CMD_DEL,		/* Delete ct limit. */
+	OVS_CT_LIMIT_CMD_GET		/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+	OVS_CT_LIMIT_ATTR_UNSPEC,
+	OVS_CT_LIMIT_ATTR_ZONE_LIMIT,	/* Nested struct ovs_zone_limit. */
+	__OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+#define OVS_ZONE_LIMIT_DEFAULT_ZONE -1
+
+struct ovs_zone_limit {
+	int zone_id;
+	__u32 limit;
+	__u32 count;
+};
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v5 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-25  0:56 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v4->v5:
  - Addresses comments from Parvin that include log error msg in
    ovs_ct_limit_init(), handle deletion for default limit, and
    add a common helper for get zone limit.
  - Rebases to master.

v3->v4:
  - Addresses comments from Parvin that include simplify netlink API,
    and remove unncessary RCU lockings.
  - Rebases to master.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
    if ovs_ct_limit features is used, only check ct_limit when a ct entry
    is unconfirmed, and reports rate limited warning messages when the ct
    limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  28 ++
 net/openvswitch/Kconfig          |   3 +-
 net/openvswitch/conntrack.c      | 551 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h      |   9 +-
 net/openvswitch/datapath.c       |   7 +-
 net/openvswitch/datapath.h       |   3 +
 6 files changed, 595 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [Bridge] [PATCH net-next] net: bridge: add support for port isolation
From: Toshiaki Makita @ 2018-05-25  0:47 UTC (permalink / raw)
  To: Nikolay Aleksandrov, netdev; +Cc: roopa, bridge, davem
In-Reply-To: <20180524085648.5934-1-nikolay@cumulusnetworks.com>

On 2018/05/24 17:56, Nikolay Aleksandrov wrote:
> This patch adds support for a new port flag - BR_ISOLATED. If it is set
> then isolated ports cannot communicate between each other, but they can
> still communicate with non-isolated ports. The same can be achieved via
> ACLs but they can't scale with large number of ports and also the
> complexity of the rules grows. This feature can be used to achieve
> isolated vlan functionality (similar to pvlan) as well, though currently
> it will be port-wide (for all vlans on the port). The new test in
> should_deliver uses data that is already cache hot and the new boolean
> is used to avoid an additional source port test in should_deliver.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Sometimes I need this kind of configuration and used vlan for such
cases. I guess it does not scale for your case so added this feature.
I wonder if this kind of feature is common in hardware switches.

FWIW,

Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

> ---
>  include/linux/if_bridge.h    | 1 +
>  include/uapi/linux/if_link.h | 1 +
>  net/bridge/br_forward.c      | 3 ++-
>  net/bridge/br_input.c        | 1 +
>  net/bridge/br_netlink.c      | 9 ++++++++-
>  net/bridge/br_private.h      | 9 +++++++++
>  net/bridge/br_sysfs_if.c     | 2 ++
>  7 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
> index 585d27182425..7843b98e1c6e 100644
> --- a/include/linux/if_bridge.h
> +++ b/include/linux/if_bridge.h
> @@ -50,6 +50,7 @@ struct br_ip_list {
>  #define BR_VLAN_TUNNEL		BIT(13)
>  #define BR_BCAST_FLOOD		BIT(14)
>  #define BR_NEIGH_SUPPRESS	BIT(15)
> +#define BR_ISOLATED		BIT(16)
>  
>  #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
>  
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index b85266420bfb..cf01b6824244 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -333,6 +333,7 @@ enum {
>  	IFLA_BRPORT_BCAST_FLOOD,
>  	IFLA_BRPORT_GROUP_FWD_MASK,
>  	IFLA_BRPORT_NEIGH_SUPPRESS,
> +	IFLA_BRPORT_ISOLATED,
>  	__IFLA_BRPORT_MAX
>  };
>  #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index 7a7fd672ccf2..9019f326fe81 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -30,7 +30,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
>  	vg = nbp_vlan_group_rcu(p);
>  	return ((p->flags & BR_HAIRPIN_MODE) || skb->dev != p->dev) &&
>  		br_allowed_egress(vg, skb) && p->state == BR_STATE_FORWARDING &&
> -		nbp_switchdev_allowed_egress(p, skb);
> +		nbp_switchdev_allowed_egress(p, skb) &&
> +		!br_skb_isolated(p, skb);
>  }
>  
>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 7f98a7d25866..72074276c088 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -114,6 +114,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>  		goto drop;
>  
>  	BR_INPUT_SKB_CB(skb)->brdev = br->dev;
> +	BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED);
>  
>  	if (IS_ENABLED(CONFIG_INET) &&
>  	    (skb->protocol == htons(ETH_P_ARP) ||
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 015f465c514b..9f5eb05b0373 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -139,6 +139,7 @@ static inline size_t br_port_info_size(void)
>  		+ nla_total_size(1)	/* IFLA_BRPORT_PROXYARP_WIFI */
>  		+ nla_total_size(1)	/* IFLA_BRPORT_VLAN_TUNNEL */
>  		+ nla_total_size(1)	/* IFLA_BRPORT_NEIGH_SUPPRESS */
> +		+ nla_total_size(1)	/* IFLA_BRPORT_ISOLATED */
>  		+ nla_total_size(sizeof(struct ifla_bridge_id))	/* IFLA_BRPORT_ROOT_ID */
>  		+ nla_total_size(sizeof(struct ifla_bridge_id))	/* IFLA_BRPORT_BRIDGE_ID */
>  		+ nla_total_size(sizeof(u16))	/* IFLA_BRPORT_DESIGNATED_PORT */
> @@ -213,7 +214,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
>  							BR_VLAN_TUNNEL)) ||
>  	    nla_put_u16(skb, IFLA_BRPORT_GROUP_FWD_MASK, p->group_fwd_mask) ||
>  	    nla_put_u8(skb, IFLA_BRPORT_NEIGH_SUPPRESS,
> -		       !!(p->flags & BR_NEIGH_SUPPRESS)))
> +		       !!(p->flags & BR_NEIGH_SUPPRESS)) ||
> +	    nla_put_u8(skb, IFLA_BRPORT_ISOLATED, !!(p->flags & BR_ISOLATED)))
>  		return -EMSGSIZE;
>  
>  	timerval = br_timer_value(&p->message_age_timer);
> @@ -660,6 +662,7 @@ static const struct nla_policy br_port_policy[IFLA_BRPORT_MAX + 1] = {
>  	[IFLA_BRPORT_VLAN_TUNNEL] = { .type = NLA_U8 },
>  	[IFLA_BRPORT_GROUP_FWD_MASK] = { .type = NLA_U16 },
>  	[IFLA_BRPORT_NEIGH_SUPPRESS] = { .type = NLA_U8 },
> +	[IFLA_BRPORT_ISOLATED]	= { .type = NLA_U8 },
>  };
>  
>  /* Change the state of the port and notify spanning tree */
> @@ -810,6 +813,10 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
>  	if (err)
>  		return err;
>  
> +	err = br_set_port_flag(p, tb, IFLA_BRPORT_ISOLATED, BR_ISOLATED);
> +	if (err)
> +		return err;
> +
>  	br_port_flags_change(p, old_flags ^ p->flags);
>  	return 0;
>  }
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 742f40aefdaf..11520ed528b0 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -423,6 +423,7 @@ struct br_input_skb_cb {
>  #endif
>  
>  	bool proxyarp_replied;
> +	bool src_port_isolated;
>  
>  #ifdef CONFIG_BRIDGE_VLAN_FILTERING
>  	bool vlan_filtered;
> @@ -574,6 +575,14 @@ int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
>  void br_flood(struct net_bridge *br, struct sk_buff *skb,
>  	      enum br_pkt_type pkt_type, bool local_rcv, bool local_orig);
>  
> +/* return true if both source port and dest port are isolated */
> +static inline bool br_skb_isolated(const struct net_bridge_port *to,
> +				   const struct sk_buff *skb)
> +{
> +	return BR_INPUT_SKB_CB(skb)->src_port_isolated &&
> +	       (to->flags & BR_ISOLATED);
> +}
> +
>  /* br_if.c */
>  void br_port_carrier_check(struct net_bridge_port *p, bool *notified);
>  int br_add_bridge(struct net *net, const char *name);
> diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
> index fd31ad83ec7b..f99c5bf5c906 100644
> --- a/net/bridge/br_sysfs_if.c
> +++ b/net/bridge/br_sysfs_if.c
> @@ -192,6 +192,7 @@ BRPORT_ATTR_FLAG(proxyarp_wifi, BR_PROXYARP_WIFI);
>  BRPORT_ATTR_FLAG(multicast_flood, BR_MCAST_FLOOD);
>  BRPORT_ATTR_FLAG(broadcast_flood, BR_BCAST_FLOOD);
>  BRPORT_ATTR_FLAG(neigh_suppress, BR_NEIGH_SUPPRESS);
> +BRPORT_ATTR_FLAG(isolated, BR_ISOLATED);
>  
>  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
>  static ssize_t show_multicast_router(struct net_bridge_port *p, char *buf)
> @@ -243,6 +244,7 @@ static const struct brport_attribute *brport_attrs[] = {
>  	&brport_attr_broadcast_flood,
>  	&brport_attr_group_fwd_mask,
>  	&brport_attr_neigh_suppress,
> +	&brport_attr_isolated,
>  	NULL
>  };
>  
> 

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH v3 net] stmmac: Added support for 802.1ad S-TAG stripping
From: Toshiaki Makita @ 2018-05-25  0:34 UTC (permalink / raw)
  To: Elad Nachman, Jose Abreu, Florian Fainelli, David Miller
  Cc: netdev, peppe.cavallaro, alexandre.torgue
In-Reply-To: <c9c605d9-dff6-4909-e90f-e3b7e179edb6@gmail.com>

On 2018/05/25 1:56, Elad Nachman wrote:
> stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before calling napi_gro_receive().
> 
> The function assumes VLAN tagged frames are always tagged with 802.1Q protocol,
> and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call to __vlan_hwaccel_put_tag() .
> 
> This causes packets not to be passed to the VLAN slave if it was created with 802.1AD protocol
> (ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).
> 
> This fix passes the protocol from the VLAN header into __vlan_hwaccel_put_tag()
> instead of using the hard-coded value of ETH_P_8021Q.
> NETIF_F_HW_VLAN_STAG_RX was added to the net device features to reflect this new support.
> 
> Signed-off-by: Elad Nachman <eladn@gilat.com>
> 
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index b65e2d1..2d2f37f 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3293,17 +3293,19 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
>  {
> -	struct ethhdr *ehdr;
> +	struct vlan_ethhdr *veth;
>  	u16 vlanid;
> +	__be16 vlan_proto;
>  
> -	if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
> -	    NETIF_F_HW_VLAN_CTAG_RX &&
> +	if ((dev->features & (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX)) ==
> +	    (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX) &&

This is basically not a correct condition since you cannot strip CTAG if
HW_VLAN_STAG_RX is disabled even when HW_VLAN_CTAG_RX is enabled.

The correct behavior is stripping CTAG when CTAG_RX is enabled and
stripping STAG when STAG_RX is enabled, so this code cannot be
protocol-agnostic. I suggested handling only CTAG in this driver because
I thought adding STAG support will make this unnecessarily complicated.

But I now actually noticed that this driver seems not able to toggle
CTAG_RX nor STAG_RX because hw_features does not include them. So this
code should work even if the condition is wrong, but in the first place
why we need to check if dev->features includes CTAG_RX here... it's
always included. It seems removing this check will be sufficient.

>  	    !__vlan_get_tag(skb, &vlanid)) {
>  		/* pop the vlan tag */
> -		ehdr = (struct ethhdr *)skb->data;
> -		memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
> +		veth = (struct vlan_ethhdr *)skb->data;
> +		vlan_proto = veth->h_vlan_proto;
> +		memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
>  		skb_pull(skb, VLAN_HLEN);
> -		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
> +		__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
>  	}
>  }
>  
> @@ -4344,7 +4346,7 @@ int stmmac_dvr_probe(struct device *device,
>  	ndev->watchdog_timeo = msecs_to_jiffies(watchdog);
>  #ifdef STMMAC_VLAN_TAG_USED
>  	/* Both mac100 and gmac support receive VLAN tag detection */
> -	ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
> +	ndev->features |= (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX);
>  #endif
>  	priv->msg_enable = netif_msg_init(debug, default_msg_level);
>  
> 

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH net] vhost: synchronize IOTLB message with dev cleanup
From: Michael S. Tsirkin @ 2018-05-25  0:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel
In-Reply-To: <1526990337-24892-1-git-send-email-jasowang@redhat.com>

On Tue, May 22, 2018 at 07:58:57PM +0800, Jason Wang wrote:
> DaeRyong Jeong reports a race between vhost_dev_cleanup() and
> vhost_process_iotlb_msg():
> 
> Thread interleaving:
> CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
> (In the case of both VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE)
> =====						=====
> 						vhost_umem_clean(dev->iotlb);
> if (!dev->iotlb) {
> 	        ret = -EFAULT;
> 		        break;
> }
> 						dev->iotlb = NULL;
> 
> The reason is we don't synchronize between them, fixing by protecting
> vhost_process_iotlb_msg() with dev mutex.
> 
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

We should think of a way to have a per-vq lock here, but for now:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/vhost/vhost.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..f0be5f3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  {
>  	int ret = 0;
>  
> +	mutex_lock(&dev->mutex);
>  	vhost_dev_lock_vqs(dev);
>  	switch (msg->type) {
>  	case VHOST_IOTLB_UPDATE:
> @@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  	}
>  
>  	vhost_dev_unlock_vqs(dev);
> +	mutex_unlock(&dev->mutex);
> +
>  	return ret;
>  }
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/7] bpf: implement BPF_TASK_FD_QUERY
From: Daniel Borkmann @ 2018-05-25  0:27 UTC (permalink / raw)
  To: Yonghong Song, peterz, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180524182111.454612-1-yhs@fb.com>

On 05/24/2018 08:21 PM, Yonghong Song wrote:
> Currently, suppose a userspace application has loaded a bpf program
> and attached it to a tracepoint/kprobe/uprobe, and a bpf
> introspection tool, e.g., bpftool, wants to show which bpf program
> is attached to which tracepoint/kprobe/uprobe. Such attachment
> information will be really useful to understand the overall bpf
> deployment in the system.
> 
> There is a name field (16 bytes) for each program, which could
> be used to encode the attachment point. There are some drawbacks
> for this approaches. First, bpftool user (e.g., an admin) may not
> really understand the association between the name and the
> attachment point. Second, if one program is attached to multiple
> places, encoding a proper name which can imply all these
> attachments becomes difficult.
> 
> This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
> Given a pid and fd, this command will return bpf related information
> to user space. Right now it only supports tracepoint/kprobe/uprobe
> perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
>    . prog_id
>    . tracepoint name, or
>    . k[ret]probe funcname + offset or kernel addr, or
>    . u[ret]probe filename + offset
> to the userspace.
> The user can use "bpftool prog" to find more information about
> bpf program itself with prog_id.
> 
> Patch #1 adds function perf_get_event() in kernel/events/core.c.
> Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
> Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
> in the libbpf library for samples/selftests/bpftool to use.
> Patch #4 adds ksym_get_addr() utility function.
> Patch #5 add a test in samples/bpf for querying k[ret]probes and
> u[ret]probes.
> Patch #6 add a test in tools/testing/selftests/bpf for querying
> raw_tracepoint and tracepoint.
> Patch #7 add a new subcommand "perf" to bpftool.
> 
> Changelogs:
>   v4 -> v5:
>      . return strlen(buf) instead of strlen(buf) + 1 
>        in the attr.buf_len. As long as user provides
>        non-empty buffer, it will be filed with empty
>        string, truncated string, or full string
>        based on the buffer size and the length of
>        to-be-copied string.
>   v3 -> v4:
>      . made attr buf_len input/output. The length of
>        actual buffter is written to buf_len so user space knows
>        what is actually needed. If user provides a buffer
>        with length >= 1 but less than required, do partial
>        copy and return -ENOSPC.
>      . code simplification with put_user.
>      . changed query result attach_info to fd_type.
>      . add tests at selftests/bpf to test zero len, null buf and
>        insufficient buf.
>   v2 -> v3:
>      . made perf_get_event() return perf_event pointer const.
>        this was to ensure that event fields are not meddled.
>      . detect whether newly BPF_TASK_FD_QUERY is supported or
>        not in "bpftool perf" and warn users if it is not.
>   v1 -> v2:
>      . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
>        to BPF_TASK_FD_QUERY.
>      . fixed various "bpftool perf" issues and added documentation
>        and auto-completion.
> 
> Yonghong Song (7):
>   perf/core: add perf_get_event() to return perf_event given a struct
>     file
>   bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
>   tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in
>     libbpf
>   tools/bpf: add ksym_get_addr() in trace_helpers
>   samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
>   tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
>   tools/bpftool: add perf subcommand
> 
>  include/linux/perf_event.h                       |   5 +
>  include/linux/trace_events.h                     |  17 +
>  include/uapi/linux/bpf.h                         |  26 ++
>  kernel/bpf/syscall.c                             | 131 ++++++++
>  kernel/events/core.c                             |   8 +
>  kernel/trace/bpf_trace.c                         |  48 +++
>  kernel/trace/trace_kprobe.c                      |  29 ++
>  kernel/trace/trace_uprobe.c                      |  22 ++
>  samples/bpf/Makefile                             |   4 +
>  samples/bpf/task_fd_query_kern.c                 |  19 ++
>  samples/bpf/task_fd_query_user.c                 | 382 +++++++++++++++++++++++
>  tools/bpf/bpftool/Documentation/bpftool-perf.rst |  81 +++++
>  tools/bpf/bpftool/Documentation/bpftool.rst      |   5 +-
>  tools/bpf/bpftool/bash-completion/bpftool        |   9 +
>  tools/bpf/bpftool/main.c                         |   3 +-
>  tools/bpf/bpftool/main.h                         |   1 +
>  tools/bpf/bpftool/perf.c                         | 246 +++++++++++++++
>  tools/include/uapi/linux/bpf.h                   |  26 ++
>  tools/lib/bpf/bpf.c                              |  23 ++
>  tools/lib/bpf/bpf.h                              |   3 +
>  tools/testing/selftests/bpf/test_progs.c         | 158 ++++++++++
>  tools/testing/selftests/bpf/trace_helpers.c      |  12 +
>  tools/testing/selftests/bpf/trace_helpers.h      |   1 +
>  23 files changed, 1257 insertions(+), 2 deletions(-)
>  create mode 100644 samples/bpf/task_fd_query_kern.c
>  create mode 100644 samples/bpf/task_fd_query_user.c
>  create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
>  create mode 100644 tools/bpf/bpftool/perf.c

LGTM, series:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
From: Bjorn Helgaas @ 2018-05-24 23:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bjorn Helgaas, linux-pci, netdev, Sathya Perla, Felix Manlunas,
	alexander.duyck, john.fastabend, Jacob Keller, Donald Dutile,
	oss-drivers, Christoph Hellwig
In-Reply-To: <20180402224652.4058-1-jakub.kicinski@netronome.com>

Hi Jakub,

On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> Some user space depends on enabling sriov_totalvfs number of VFs
> to not fail, e.g.:
> 
> $ cat .../sriov_totalvfs > .../sriov_numvfs
> 
> For devices which VF support depends on loaded FW we have the
> pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> a special "unset" value, meaning drivers can't limit sriov_totalvfs
> to 0.  Remove the special values completely and simply initialize
> driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> Add a helper for drivers to reset the VF limit back to total.

I still can't really make sense out of the changelog.

I think part of the reason it's confusing is because there are two
things going on:

  1) You want this:
  
       pci_sriov_set_totalvfs(dev, 0);
       x = pci_sriov_get_totalvfs(dev) 

     to return 0 instead of total_VFs.  That seems to connect with
     your subject line.  It means "sriov_totalvfs" in sysfs could be
     0, but I don't know how that is useful (I'm sure it is; just
     educate me :))

  2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
     sure what you intend for this.  Is *every* driver supposed to
     call it in .remove()?  Could/should this be done in the core
     somehow instead of depending on every driver?

I'm also having a hard time connecting your user-space command example
with the rest of this.  Maybe it will make more sense to me tomorrow
after some coffee.

> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_main.c |  6 +++---
>  drivers/pci/iov.c                             | 27 +++++++++++++++++++++------
>  include/linux/pci.h                           |  2 ++
>  3 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c
> index c4b1f344b4da..a76d177e40dd 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
> @@ -123,7 +123,7 @@ static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf)
>  		return pci_sriov_set_totalvfs(pf->pdev, pf->limit_vfs);
>  
>  	pf->limit_vfs = ~0;
> -	pci_sriov_set_totalvfs(pf->pdev, 0); /* 0 is unset */
> +	pci_sriov_reset_totalvfs(pf->pdev);
>  	/* Allow any setting for backwards compatibility if symbol not found */
>  	if (err == -ENOENT)
>  		return 0;
> @@ -537,7 +537,7 @@ static int nfp_pci_probe(struct pci_dev *pdev,
>  err_net_remove:
>  	nfp_net_pci_remove(pf);
>  err_sriov_unlimit:
> -	pci_sriov_set_totalvfs(pf->pdev, 0);
> +	pci_sriov_reset_totalvfs(pf->pdev);
>  err_fw_unload:
>  	kfree(pf->rtbl);
>  	nfp_mip_close(pf->mip);
> @@ -570,7 +570,7 @@ static void nfp_pci_remove(struct pci_dev *pdev)
>  	nfp_hwmon_unregister(pf);
>  
>  	nfp_pcie_sriov_disable(pdev);
> -	pci_sriov_set_totalvfs(pf->pdev, 0);
> +	pci_sriov_reset_totalvfs(pf->pdev);
>  
>  	nfp_net_pci_remove(pf);
>  
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924ae0350..c63ea870d8be 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -443,6 +443,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  	iov->nres = nres;
>  	iov->ctrl = ctrl;
>  	iov->total_VFs = total;
> +	iov->driver_max_VFs = total;
>  	pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device);
>  	iov->pgsz = pgsz;
>  	iov->self = dev;
> @@ -788,12 +789,29 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_set_totalvfs);
>  
> +/**
> + * pci_sriov_reset_totalvfs -- return the TotalVFs value to the default
> + * @dev: the PCI PF device
> + *
> + * Should be called from PF driver's remove routine with
> + * device's mutex held.
> + */
> +void pci_sriov_reset_totalvfs(struct pci_dev *dev)
> +{
> +	/* Shouldn't change if VFs already enabled */
> +	if (!dev->is_physfn || dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
> +		return;
> +
> +	dev->sriov->driver_max_VFs = dev->sriov->total_VFs;
> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_reset_totalvfs);
> +
>  /**
>   * pci_sriov_get_totalvfs -- get total VFs supported on this device
>   * @dev: the PCI PF device
>   *
> - * For a PCIe device with SRIOV support, return the PCIe
> - * SRIOV capability value of TotalVFs or the value of driver_max_VFs
> + * For a PCIe device with SRIOV support, return the value of driver_max_VFs
> + * which can be equal to the PCIe SRIOV capability value of TotalVFs or lower
>   * if the driver reduced it.  Otherwise 0.
>   */
>  int pci_sriov_get_totalvfs(struct pci_dev *dev)
> @@ -801,9 +819,6 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  	if (!dev->is_physfn)
>  		return 0;
>  
> -	if (dev->sriov->driver_max_VFs)
> -		return dev->sriov->driver_max_VFs;
> -
> -	return dev->sriov->total_VFs;
> +	return dev->sriov->driver_max_VFs;
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 024a1beda008..95fde8850393 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1952,6 +1952,7 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id);
>  int pci_num_vf(struct pci_dev *dev);
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
> +void pci_sriov_reset_totalvfs(struct pci_dev *dev);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> @@ -1978,6 +1979,7 @@ static inline int pci_vfs_assigned(struct pci_dev *dev)
>  { return 0; }
>  static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
>  { return 0; }
> +static inline void pci_sriov_reset_totalvfs(struct pci_dev *dev) { }
>  static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  { return 0; }
>  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> -- 
> 2.16.2
> 

^ permalink raw reply

* Re: [v8, bpf-next, 4/9] net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint
From: Steven Rostedt @ 2018-05-24 23:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Johannes Berg, Alexei Starovoitov, davem, daniel, torvalds,
	peterz, mathieu.desnoyers, netdev, kernel-team, linux-api,
	linux-wireless
In-Reply-To: <20180524232837.24jvdsdiohkpj7fs@ast-mbp>

On Thu, 24 May 2018 16:28:39 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Ohh. I didn't realize that networking wireless doesn't fall under netdev.
> I thought wireless folks are silent because they are embarrassed
> by a function with 17 arguments.

Please lets refrain from the demeaning comments.

I agree with your argument, but not the tone.

-- Steve

^ permalink raw reply

* Re: [PATCH 00/14] Modify action API for implementing lockless actions
From: Cong Wang @ 2018-05-24 23:34 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, David Miller, Jamal Hadi Salim,
	Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, Kees Cook,
	LKML, NetFilter, coreteam, kliteyn
In-Reply-To: <1526308035-12484-1-git-send-email-vladbu@mellanox.com>

On Mon, May 14, 2018 at 7:27 AM, Vlad Buslov <vladbu@mellanox.com> wrote:
> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path. It updates act API to
> use atomic operations, rcu and spinlocks for fine-grained locking. It
> also extend API with functions that are needed to update existing
> actions for parallel execution.

Can you give a summary here for what and how it is achieved?

You said this is the first step, what do you want to achieve in this
very first step? And how do you achieve it? Do you break the RTNL
lock down to, for a quick example, a per-device lock? Or perhaps you
completely remove it because of what reason?

I go through all the descriptions of your 14 patches (but not any code),
I still have no clue how you successfully avoid RTNL. Please don't
let me read into your code to understand that, there must be some
high-level justification on how it works. Without it, I don't event want
to read into the code.

Thanks.

^ permalink raw reply

* Re: [PATCH bpf-next v2 0/3] bpf: add boot parameters for sysctl knobs
From: Alexei Starovoitov @ 2018-05-24 23:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eugene Syromiatnikov, netdev, linux-kernel, linux-doc, Kees Cook,
	Kai-Heng Feng, Daniel Borkmann, Alexei Starovoitov,
	Jonathan Corbet, Jiri Olsa
In-Reply-To: <20180524094108.066d885a@redhat.com>

On Thu, May 24, 2018 at 09:41:08AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 23 May 2018 15:02:45 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Wed, May 23, 2018 at 02:18:19PM +0200, Eugene Syromiatnikov wrote:
> > > Some BPF sysctl knobs affect the loading of BPF programs, and during
> > > system boot/init stages these sysctls are not yet configured.
> > > A concrete example is systemd, that has implemented loading of BPF
> > > programs.
> > > 
> > > Thus, to allow controlling these setting at early boot, this patch set
> > > adds the ability to change the default setting of these sysctl knobs
> > > as well as option to override them via a boot-time kernel parameter
> > > (in order to avoid rebuilding kernel each time a need of changing these
> > > defaults arises).
> > > 
> > > The sysctl knobs in question are kernel.unprivileged_bpf_disable,
> > > net.core.bpf_jit_harden, and net.core.bpf_jit_kallsyms.  
> > 
> > - systemd is root. today it only uses cgroup-bpf progs which require root,
> >   so disabling unpriv during boot time makes no difference to systemd.
> >   what is the actual reason to present time?
> > 
> > - say in the future systemd wants to use so_reuseport+bpf for faster
> >   networking. With unpriv disable during boot, it will force systemd
> >   to do such networking from root, which will lower its security barrier.
> >   How that make sense?
> > 
> > - bpf_jit_kallsyms sysctl has immediate effect on loaded programs.
> >   Flipping it during the boot or right after or any time after
> >   is the same thing. Why add such boot flag then?
> > 
> > - jit_harden can be turned on by systemd. so turning it during the boot
> >   will make systemd progs to be constant blinded.
> >   Constant blinding protects kernel from unprivileged JIT spraying.
> >   Are you worried that systemd will attack the kernel with JIT spraying?
> 
> 
> I think you are missing that, we want the ability to change these
> defaults in-order to avoid depending on /etc/sysctl.conf settings, and
> that the these sysctl.conf setting happen too late.

What does it mean 'happens too late' ?
Too late for what?
sysctl.conf has plenty of system critical knobs like
kernel.perf_event_paranoid, kernel.core_pattern, etc
The behavior of the host is drastically different after sysctl config
is applied.

> For example with jit_harden, there will be a difference between the
> loaded BPF program that got loaded at boot-time with systemd (no
> constant blinding) and when someone reloads that systemd service after
> /etc/sysctl.conf have been evaluated and setting bpf_jit_harden (now
> slower due to constant blinding).   This is inconsistent behavior.

net.core.bpf_jit_harden can be flipped back and forth at run-time,
so bpf progs before and after will be either blinded or not.
I don't see any inconsistency.
In general I think bootparams should be used only for things
like kpti=on/off that cannot be set by sysctl.

^ permalink raw reply

* Re: [v8, bpf-next, 4/9] net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint
From: Alexei Starovoitov @ 2018-05-24 23:28 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Alexei Starovoitov, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	daniel-FeC+5ew28dpmcu3hnIyYJQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w,
	netdev-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1527073388.3759.21.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

On Wed, May 23, 2018 at 01:03:08PM +0200, Johannes Berg wrote:
> On Wed, 2018-03-28 at 12:05 -0700, Alexei Starovoitov wrote:
> > fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table
> > instead of all 17 arguments by value.
> > dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table'
> > defined with very similar yet subtly different fields and offsets.
> > tracepoint is still common and using definition of 'struct iwl_error_event_table'
> > from dvm/commands.h while copying fields.
> > Long term this tracepoint probably should be split into two.
> 
> It would've been nice to CC the wireless list for wireless related
> patches ...

Ohh. I didn't realize that networking wireless doesn't fall under netdev.
I thought wireless folks are silent because they are embarrassed
by a function with 17 arguments.

> > --- a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
> > +++ b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
> > @@ -30,6 +30,7 @@
> >  #ifndef __CHECKER__
> >  #include "iwl-trans.h"
> >  
> > +#include "dvm/commands.h"
> 
> In particular, this breaks the whole driver abstraction.
> 
> > +++ b/drivers/net/wireless/intel/iwlwifi/mvm/utils.c
> > @@ -549,12 +549,7 @@ static void iwl_mvm_dump_lmac_error_log(struct iwl_mvm *mvm, u32 base)
> >  
> >         IWL_ERR(mvm, "Loaded firmware version: %s\n", mvm->fw->fw_version);
> >  
> > -       trace_iwlwifi_dev_ucode_error(trans->dev, table.error_id, table.tsf_low,
> > -                                     table.data1, table.data2, table.data3,
> > -                                     table.blink2, table.ilink1,
> > -                                     table.ilink2, table.bcon_time, table.gp1,
> > -                                     table.gp2, table.fw_rev_type, table.major,
> > -                                     table.minor, table.hw_ver, table.brd_ver);
> > +       trace_iwlwifi_dev_ucode_error(trans->dev, &table, table.hw_ver, table.brd_ver);
> 
> This is also utterly wrong because mvm has - for better or worse - a
> different type "struct iwl_error_event_table" in this file ...

As I was trying to explain in the commit log the single struct
is used in both places, but differences in two
"struct iwl_error_event_table" are carefully matched
field and by field. For two extra fields it was not
possible and they are passed separately as you can see above.
I still believe that tracepoint output is still exactly
the same before and after the patch.
I guess you see the breakage because new fields got
added into one "struct iwl_error_event_table",
but were not added to its evil twin "struct iwl_error_event_table"
with the same name after the patch landed ?
imo wireless folks need to avoid such naming conflicts.
I suggest to isolate common fields into separate base struct and
give two children structs different names.

^ permalink raw reply

* [PATCH] ath6kl: mark expected switch fall-throughs
From: Gustavo A. R. Silva @ 2018-05-24 23:13 UTC (permalink / raw)
  To: Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel, Gustavo A. R. Silva

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/wireless/ath/ath6kl/cfg80211.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c b/drivers/net/wireless/ath/ath6kl/cfg80211.c
index 2ba8cf3..29e32cd 100644
--- a/drivers/net/wireless/ath/ath6kl/cfg80211.c
+++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c
@@ -3898,17 +3898,17 @@ int ath6kl_cfg80211_init(struct ath6kl *ar)
 	wiphy->max_scan_ie_len = 1000; /* FIX: what is correct limit? */
 	switch (ar->hw.cap) {
 	case WMI_11AN_CAP:
-		ht = true;
+		ht = true; /* fall through */
 	case WMI_11A_CAP:
 		band_5gig = true;
 		break;
 	case WMI_11GN_CAP:
-		ht = true;
+		ht = true; /* fall through */
 	case WMI_11G_CAP:
 		band_2gig = true;
 		break;
 	case WMI_11AGN_CAP:
-		ht = true;
+		ht = true; /* fall through */
 	case WMI_11AG_CAP:
 		band_2gig = true;
 		band_5gig = true;
-- 
2.7.4

^ permalink raw reply related

* [PATCH] ath5k: mark expected switch fall-through
From: Gustavo A. R. Silva @ 2018-05-24 23:07 UTC (permalink / raw)
  To: Jiri Slaby, Nick Kossifidis, Luis R. Rodriguez, Kalle Valo,
	David S. Miller
  Cc: linux-wireless, netdev, linux-kernel, Gustavo A. R. Silva

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/wireless/ath/ath5k/pcu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/ath/ath5k/pcu.c b/drivers/net/wireless/ath/ath5k/pcu.c
index f23c851..05140d8 100644
--- a/drivers/net/wireless/ath/ath5k/pcu.c
+++ b/drivers/net/wireless/ath/ath5k/pcu.c
@@ -670,6 +670,7 @@ ath5k_hw_init_beacon_timers(struct ath5k_hw *ah, u32 next_beacon, u32 interval)
 		break;
 	case NL80211_IFTYPE_ADHOC:
 		AR5K_REG_ENABLE_BITS(ah, AR5K_TXCFG, AR5K_TXCFG_ADHOC_BCN_ATIM);
+		/* fall through */
 	default:
 		/* On non-STA modes timer1 is used as next DMA
 		 * beacon alert (DBA) timer and timer2 as next
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next] net: phy: convert further flags in struct phy_device to bit-field
From: Florian Fainelli @ 2018-05-24 23:03 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <d148e574-2e29-a52f-7da0-13ef1ead927a@gmail.com>

On 05/24/2018 01:15 PM, Heiner Kallweit wrote:
> This patch is a follow-up to 87e5808d52b6 ("net: phy: replace bool
> members in struct phy_device with bit-fields") and converts further
> flags to bit-fields.

This looks fine, but then you would also have to clean-up all code that
does phydev->asym_pause = 1 and phydev->pause = 1 to use true/false
instead, I am not sure there is much value in doing that for these
fields considering that they are exposed to drivers so there is a risk
of possible breakage.

Thanks!

> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  include/linux/phy.h | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 6cd090984..cc66f2834 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -418,21 +418,20 @@ struct phy_device {
>  	/* The most recently read link state */
>  	unsigned link:1;
>  
> +	/* forced speed & duplex (no autoneg)
> +	 * partner speed & duplex & pause (autoneg)
> +	 */
> +	unsigned pause:1;
> +	unsigned asym_pause:1;
> +	int speed;
> +	int duplex;
> +
>  	enum phy_state state;
>  
>  	u32 dev_flags;
>  
>  	phy_interface_t interface;
>  
> -	/*
> -	 * forced speed & duplex (no autoneg)
> -	 * partner speed & duplex & pause (autoneg)
> -	 */
> -	int speed;
> -	int duplex;
> -	int pause;
> -	int asym_pause;
> -
>  	/* Enabled Interrupts */
>  	u32 interrupts;
>  
> 


-- 
Florian

^ permalink raw reply

* [PATCH] ath10k: htt_tx: mark expected switch fall-throughs
From: Gustavo A. R. Silva @ 2018-05-24 22:59 UTC (permalink / raw)
  To: Kalle Valo, David S. Miller
  Cc: ath10k, linux-wireless, netdev, linux-kernel, Gustavo A. R. Silva

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in this particular case, I replaced "pass through" with
a proper "fall through" comment, which is what GCC is expecting
to find.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/wireless/ath/ath10k/htt_tx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 5d8b97a..89157c5 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -1202,7 +1202,7 @@ static int ath10k_htt_tx_32(struct ath10k_htt *htt,
 	case ATH10K_HW_TXRX_RAW:
 	case ATH10K_HW_TXRX_NATIVE_WIFI:
 		flags0 |= HTT_DATA_TX_DESC_FLAGS0_MAC_HDR_PRESENT;
-		/* pass through */
+		/* fall through */
 	case ATH10K_HW_TXRX_ETHERNET:
 		if (ar->hw_params.continuous_frag_desc) {
 			ext_desc_t = htt->frag_desc.vaddr_desc_32;
@@ -1404,7 +1404,7 @@ static int ath10k_htt_tx_64(struct ath10k_htt *htt,
 	case ATH10K_HW_TXRX_RAW:
 	case ATH10K_HW_TXRX_NATIVE_WIFI:
 		flags0 |= HTT_DATA_TX_DESC_FLAGS0_MAC_HDR_PRESENT;
-		/* pass through */
+		/* fall through */
 	case ATH10K_HW_TXRX_ETHERNET:
 		if (ar->hw_params.continuous_frag_desc) {
 			ext_desc_t = htt->frag_desc.vaddr_desc_64;
-- 
2.7.4

^ permalink raw reply related

* Re: 4.16 issue with mbim modem and ping with size > 14552 bytes
From: Daniele Palmas @ 2018-05-24 22:54 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, linux-usb
In-Reply-To: <20180524155334.GA28874@kroah.com>

Hi Greg,

2018-05-24 17:53 GMT+02:00 Greg KH <gregkh@linuxfoundation.org>:
> On Thu, May 24, 2018 at 05:04:49PM +0200, Daniele Palmas wrote:
>> Hello,
>>
>> I have an issue with an USB mbim modem when trying to send with ping
>> more than 14552 bytes: it looks like to me a kernel issue, but not at
>> the cdc_mbim or cdc_ncm level, anyway not sure, so I'm reporting the
>> issue.
>>
>> My kernel is 4.16. The device is the following:
>
> Does older kernels work, or is this something that has always been
> there?
>

Not tested yet, I'm going to do.

> I ask, as my mobile provider does horrible things to large packet sizes.
> So much so that I have to set the mtu to 1280 just to get things to work
> properly when tethering my phone through to my laptop.  So this might be
> a network provider issue :)
>

Yeah, I thought the same, so I tried the same scenario with Windows 10
but it is working fine.

Thanks,
Daniele

> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH net] packet: in packet_snd start writing at link layer allocation
From: Willem de Bruijn @ 2018-05-24 22:13 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, Network Development, Eric Dumazet, Willem de Bruijn,
	Maor Gottlieb
In-Reply-To: <CAF=yD-JapgdzDxtt+noXEm2Zj4dy=9N1_ALYBsz-TXA5CwtTkQ@mail.gmail.com>

On Thu, May 24, 2018 at 1:01 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Thu, May 24, 2018 at 11:17 AM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Thu, May 24, 2018 at 11:07 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>>
>>> On 14/05/2018 3:20 AM, David Miller wrote:
>>>>
>>>> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
>>>> Date: Fri, 11 May 2018 13:24:25 -0400
>>>>
>>>>> From: Willem de Bruijn <willemb@google.com>
>>>>>
>>>>> Packet sockets allow construction of packets shorter than
>>>>> dev->hard_header_len to accommodate protocols with variable length
>>>>> link layer headers. These packets are padded to dev->hard_header_len,
>>>>> because some device drivers interpret that as a minimum packet size.
>>>>>
>>>>> packet_snd reserves dev->hard_header_len bytes on allocation.
>>>>> SOCK_DGRAM sockets call skb_push in dev_hard_header() to ensure that
>>>>> link layer headers are stored in the reserved range. SOCK_RAW sockets
>>>>> do the same in tpacket_snd, but not in packet_snd.
>>>>>
>>>>> Syzbot was able to send a zero byte packet to a device with massive
>>>>> 116B link layer header, causing padding to cross over into skb_shinfo.
>>>>> Fix this by writing from the start of the llheader reserved range also
>>>>> in the case of packet_snd/SOCK_RAW.
>>>>>
>>>>> Update skb_set_network_header to the new offset. This also corrects
>>>>> it for SOCK_DGRAM, where it incorrectly double counted reserve due to
>>>>> the skb_push in dev_hard_header.
>>>>>
>>>>> Fixes: 9ed988cd5915 ("packet: validate variable length ll headers")
>>>>> Reported-by: syzbot+71d74a5406d02057d559@syzkaller.appspotmail.com
>>>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>>>>
>>>>
>>>> Applied and queued up for -stable, thanks Willem.
>>>>
>>>
>>> Hi,
>>>
>>> One of our regression tests started failing. Once this patch is reverted,
>>> test passes.
>>>
>>> The tests add flow steering rules in the receiver side and in the sender
>>> side it send the packet with some RAW socket applications. Then received
>>> side gets completion with error.
>>>
>>> Our verification team compared the packets between the stable and the broken
>>> version, in the broken version we have some extra bytes at the end of the
>>> packet.
>>>
>>> It looks like some bad push to the SKB, maybe the conditional reserved
>>> addition should be more strict?
>>>
>>> Any idea?
>>
>> Thanks for reporting, sorry for the breakage.
>>
>> I think I might. This skb_push moves back the start of skb->data in the
>> same way that tpacket_snd does. But it does not reduce the length
>> passed to skb_put, so this might double count hard_header_len.
>>
>> Let me construct a test.
>
> Indeed.
>
> Still verifying, but this almost certainly has to be
>
>   @@ -2911,7 +2912,7 @@ static int packet_snd(struct socket *sock,
> struct msghdr *msg, size_t len)
>                   if (unlikely(offset < 0))
>                           goto out_free;
>           } else if (reserve) {
>   -               skb_push(skb, reserve);
>   +               skb_reserve(skb, -reserve);
>           }
>
> to move the start of the packet without changing its length.

I sent http://patchwork.ozlabs.org/patch/920126/

Again, thanks a lot for reporting this, Tariq. I'm working on some
packet socket boundary condition tests for tools/testing/selftests/net,
so that I cannot push such a mistake again.

^ permalink raw reply

* [PATCH net] packet: fix reserve calculation
From: Willem de Bruijn @ 2018-05-24 22:10 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Commit b84bbaf7a6c8 ("packet: in packet_snd start writing at link
layer allocation") ensures that packet_snd always starts writing
the link layer header in reserved headroom allocated for this
purpose.

This is needed because packets may be shorter than hard_header_len,
in which case the space up to hard_header_len may be zeroed. But
that necessary padding is not accounted for in skb->len.

The fix, however, is buggy. It calls skb_push, which grows skb->len
when moving skb->data back. But in this case packet length should not
change.

Instead, call skb_reserve, which moves both skb->data and skb->tail
back, without changing length.

Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation")
Reported-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/packet/af_packet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e9422fe45179..acb7b86574cd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2911,7 +2911,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		if (unlikely(offset < 0))
 			goto out_free;
 	} else if (reserve) {
-		skb_push(skb, reserve);
+		skb_reserve(skb, -reserve);
 	}
 
 	/* Returns -EFAULT on error */
-- 
2.17.0.921.gf22659ad46-goog

^ permalink raw reply related

* Re: [PATCH net-next 0/8] nfp: offload LAG for tc flower egress
From: Jakub Kicinski @ 2018-05-24 22:01 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Linux Netdev List, oss-drivers, Jiri Pirko,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek
In-Reply-To: <CAJ3xEMhJckJq6HDFm_QTtDP_SG1jPJ55q1b-_Vg0WoC_UqO_Wg@mail.gmail.com>

On Thu, 24 May 2018 22:26:03 +0300, Or Gerlitz wrote:
> On Thu, May 24, 2018 at 9:49 PM, Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
> > On Thu, 24 May 2018 20:04:56 +0300, Or Gerlitz wrote:  
> 
> >> Does this apply also to non-uplink representors? if yes, what is the use case?
> >>
> >> We are looking on supporting uplink lag in sriov switchdev scheme - we refer to
> >> it as "vf lag" -- b/c the netdev and rdma devices seen by the VF are actually
> >> subject to HA and/or LAG - I wasn't sure if/how you limit this series
> >> to uplink reprs  
> >
> > I don't think we have a limitation on the output port within the LAG.
> > But keep in mind in our devices all ports belong to the same eswitch/PF
> > so bonding uplink ports is generally sufficient, I'm not sure VF
> > bonding adds much HA.  IOW AFAIK we support VF bonding because HW can do
> > it easily, not because we have a strong use case for it.  
> 
> To make it clear, vf lag is code name for uplink lag, I think we want
> to say that we provide the VM a lagged VF, anyway, again, the lag is
> done on the uplink reps not on the vf reps.

Ah, ack, same use case here!

> Unlike the uplink port which is physical one, the vf vport is virtual
> one, what could be the benefit to bond two vports?

I'm not sure what it could be :)  We can also bond an uplink and a VF!
All outputs on the nfp are working same, so why limit ourselves if we
can do it? :)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox