Netdev List
 help / color / mirror / Atom feed
* RE: netif_receive_skb is taking long time
From: Keyur Amrutbhai Patel @ 2018-10-25 17:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org
In-Reply-To: <5d039fa2-95fc-dbdf-4ab0-d5cccebe8bcc@gmail.com>

Hi Eric,

First of all thank you for replying and giving some spotlight.

First step would be to read Documentation/networking/scaling.txt and see if anything there helps.
 - This is good article. I had gone through it.  Any suggestion on RSS? How to configure it? Do I need to take care anything specially in my NIC driver?

Have you tried to profile the kernel and see if some contention or hot function appears ?
- I have added time stampings in different functions. That is how I came to know that almost ~3375 neno seconds are used by just " netif_receive_skb " don’t know why. With less than that time my DMA operation is finishes and descriptors are managed.
Current time consuming function are " netif_receive_skb " and " napi_alloc_skb " these two function calls are taking maximum about of time

Maybe use a faster cpu, or remove not needed features like too heavy netfilter rules.
- I am using Intex Xeon Platinum series processors. These are fast enough CPUs available in market with 64 cores. 2 CPU nodes (each has 32 core)

We can not really answer your question, you do not provide enough information.
- Please let me know what additional details you need. We have 6 queues in HW. Each is mapped to MSI-X vector. Each vector is giving interrupt on different CPU. From interrupt I am scheduling napi and from napi poll function I am getting DMA page and constructing skb and passing it to network layer with "netif_receive_skb".

Let me know additional details which are required.

Regards,
Keyur

-----Original Message-----
From: Eric Dumazet <eric.dumazet@gmail.com> 
Sent: Thursday, October 25, 2018 10:38 PM
To: Keyur Amrutbhai Patel <keyurp@xilinx.com>; netdev@vger.kernel.org
Subject: Re: netif_receive_skb is taking long time

EXTERNAL EMAIL

On 10/25/2018 08:39 AM, Keyur Amrutbhai Patel wrote:
> Hi,
>
> In my NIC driver "netif_receive_skb" is taking too long time. Almost 3375 neno seconds. Which is more than whole packet processing from interrupt.
>
> Could anyone please help me to understand what could be the reason behind this? How to solve it to take minimum time?
>
> Is there any standard calls which we need to follow in order to get faster performance?
>

First step would be to read Documentation/networking/scaling.txt and see if anything there helps.

Have you tried to profile the kernel and see if some contention or hot function appears ?

Maybe use a faster cpu, or remove not needed features like too heavy netfilter rules.

We can not really answer your question, you do not provide enough information.

^ permalink raw reply

* [PATCH 1/1] ipmr: Make cache queue length configurable
From: Brodie Greenfield @ 2018-10-26  2:02 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji
  Cc: netdev, linux-kernel, chris.packham, luuk.paulussen,
	Brodie Greenfield

We want to be able to keep more spaces available in our queue for
processing incoming multicast traffic (adding (S,G) entries) - this lets
us learn more groups faster, rather than dropping them at this stage.

Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>
---
 Documentation/networking/ip-sysctl.txt | 7 +++++++
 include/net/netns/ipv4.h               | 1 +
 include/uapi/linux/sysctl.h            | 1 +
 kernel/sysctl_binary.c                 | 1 +
 net/ipv4/af_inet.c                     | 2 ++
 net/ipv4/ipmr.c                        | 4 +++-
 net/ipv4/sysctl_net_ipv4.c             | 7 +++++++
 7 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 960de8fe3f40..dfc70ef6c42b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -864,6 +864,13 @@ ip_local_reserved_ports - list of comma separated ranges
 
 	Default: Empty
 
+ip_mr_cache_queue_length - INTEGER
+	Limit the number of multicast packets we can have in the queue to be
+	resolved.
+	Bear in mind that this causes an O(n) traversal of the same size when
+	the queue is full. This should be considered if increasing.
+	Default: 10
+
 ip_unprivileged_port_start - INTEGER
 	This is a per-namespace sysctl.  It defines the first
 	unprivileged port in the network namespace.  Privileged ports
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index e47503b4e4d1..1ca5cabe2d3b 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -184,6 +184,7 @@ struct netns_ipv4 {
 	int sysctl_igmp_max_msf;
 	int sysctl_igmp_llm_reports;
 	int sysctl_igmp_qrv;
+	int sysctl_ip_mr_cache_queue_length;
 
 	struct ping_group_range ping_group_range;
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index d71013fffaf6..32e32d4904cd 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -425,6 +425,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_IPV4_IP_MR_CACHE_QUEUE_LENGTH=126,
 };
 
 enum {
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 07148b497451..8db94e8d97ed 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -367,6 +367,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
 	{ CTL_INT,	NET_IPV4_LOCAL_PORT_RANGE,		"ip_local_port_range" },
 	{ CTL_INT,	NET_IPV4_IGMP_MAX_MEMBERSHIPS,		"igmp_max_memberships" },
 	{ CTL_INT,	NET_IPV4_IGMP_MAX_MSF,			"igmp_max_msf" },
+	{ CTL_INT,	NET_IPV4_IP_MR_CACHE_QUEUE_LENGTH,	"ip_mr_cache_queue_length" },
 	{ CTL_INT,	NET_IPV4_INET_PEER_THRESHOLD,		"inet_peer_threshold" },
 	{ CTL_INT,	NET_IPV4_INET_PEER_MINTTL,		"inet_peer_minttl" },
 	{ CTL_INT,	NET_IPV4_INET_PEER_MAXTTL,		"inet_peer_maxttl" },
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1fbe2f815474..4b78d12aca36 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1818,6 +1818,8 @@ static __net_init int inet_init_net(struct net *net)
 	net->ipv4.sysctl_igmp_llm_reports = 1;
 	net->ipv4.sysctl_igmp_qrv = 2;
 
+	/* ipmr unresolved queue length max */
+	net->ipv4.sysctl_ip_mr_cache_queue_length = 10;
 	return 0;
 }
 
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5660adcf7a04..2864f80e2f2a 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1128,6 +1128,7 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
 	struct mfc_cache *c;
 	bool found = false;
 	int err;
+	struct net *net = dev_net(dev);
 
 	spin_lock_bh(&mfc_unres_lock);
 	list_for_each_entry(c, &mrt->mfc_unres_queue, _c.list) {
@@ -1140,7 +1141,8 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
 
 	if (!found) {
 		/* Create a new entry if allowable */
-		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
+		if (atomic_read(&mrt->cache_resolve_queue_len) >=
+		    net->ipv4.sysctl_ip_mr_cache_queue_length ||
 		    (c = ipmr_cache_alloc_unres()) == NULL) {
 			spin_unlock_bh(&mfc_unres_lock);
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 891ed2f91467..b249932ee24e 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -772,6 +772,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "ip_mr_cache_queue_length",
+		.data		= &init_net.ipv4.sysctl_ip_mr_cache_queue_length,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 #ifdef CONFIG_IP_MULTICAST
 	{
 		.procname	= "igmp_qrv",
-- 
2.19.1

^ permalink raw reply related

* Re: netif_receive_skb is taking long time
From: Eric Dumazet @ 2018-10-25 17:32 UTC (permalink / raw)
  To: Keyur Amrutbhai Patel, Eric Dumazet; +Cc: netdev@vger.kernel.org
In-Reply-To: <BN6PR02MB2641D817810960AAE59B9CDEB7F70@BN6PR02MB2641.namprd02.prod.outlook.com>


Please do not top post, and use normal quoting.

On 10/25/2018 10:22 AM, Keyur Amrutbhai Patel wrote:
> Hi Eric,
> 
> First of all thank you for replying and giving some spotlight.
> 
> First step would be to read Documentation/networking/scaling.txt and see if anything there helps.
>  - This is good article. I had gone through it.  Any suggestion on RSS? How to configure it? Do I need to take care anything specially in my NIC driver?

Just read the page and apply the various configurations.

> 
> Have you tried to profile the kernel and see if some contention or hot function appears ?
> - I have added time stampings in different functions. That is how I came to know that almost ~3375 neno seconds are used by just " netif_receive_skb " don’t know why. With less than that time my DMA operation is finishes and descriptors are managed.
> Current time consuming function are " netif_receive_skb " and " napi_alloc_skb " these two function calls are taking maximum about of time
> 

So... networking spend more time in upper stacks than a driver.

A driver does almost nothing, just passing around bits that that NIC put in memory.

In most workloads, a driver would not use more than 5% of total cpu cycles.

Now, if all you need is to impress your friends/boss about some
crazy number of RX packets per second,
just do not allocate skbs, and not call netif_receive_skb(),
use something like XDP to drop incoming frames :)

> Maybe use a faster cpu, or remove not needed features like too heavy netfilter rules.
> - I am using Intex Xeon Platinum series processors. These are fast enough CPUs available in market with 64 cores. 2 CPU nodes (each has 32 core)
> 
> We can not really answer your question, you do not provide enough information.
> - Please let me know what additional details you need. We have 6 queues in HW. Each is mapped to MSI-X vector. Each vector is giving interrupt on different CPU. From interrupt I am scheduling napi and from napi poll function I am getting DMA page and constructing skb and passing it to network layer with "netif_receive_skb".
> 
> Let me know additional details which are required.
> 
> Regards,
> Keyur
> 
> -----Original Message-----
> From: Eric Dumazet <eric.dumazet@gmail.com> 
> Sent: Thursday, October 25, 2018 10:38 PM
> To: Keyur Amrutbhai Patel <keyurp@xilinx.com>; netdev@vger.kernel.org
> Subject: Re: netif_receive_skb is taking long time
> 
> EXTERNAL EMAIL
> 
> On 10/25/2018 08:39 AM, Keyur Amrutbhai Patel wrote:
>> Hi,
>>
>> In my NIC driver "netif_receive_skb" is taking too long time. Almost 3375 neno seconds. Which is more than whole packet processing from interrupt.
>>
>> Could anyone please help me to understand what could be the reason behind this? How to solve it to take minimum time?
>>
>> Is there any standard calls which we need to follow in order to get faster performance?
>>
> 
> First step would be to read Documentation/networking/scaling.txt and see if anything there helps.
> 
> Have you tried to profile the kernel and see if some contention or hot function appears ?
> 
> Maybe use a faster cpu, or remove not needed features like too heavy netfilter rules.
> 
> We can not really answer your question, you do not provide enough information.
> 

^ permalink raw reply

* Re: Fw: [Bug 201423] New: eth0: hw csum failure
From: Eric Dumazet @ 2018-10-25 17:38 UTC (permalink / raw)
  To: Andre Tomt, Eric Dumazet, Eric Dumazet
  Cc: Stephen Hemminger, netdev, rossi.f, Dimitris Michailidis
In-Reply-To: <aa5628c9-41ae-37cb-6bba-e47c4d16cb78@tomt.net>



On 10/24/2018 12:41 PM, Andre Tomt wrote:
> 
> It eventually showed up again with mlx4, on 4.18.16 + fix and also on 4.19. I still do not have a useful packet capture.
> 
> It is running a torrent client serving up various linux distributions.
>

Have you also applied this fix ?

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913

^ permalink raw reply

* Re: netif_receive_skb is taking long time
From: David Miller @ 2018-10-25 17:43 UTC (permalink / raw)
  To: keyurp; +Cc: eric.dumazet, netdev
In-Reply-To: <BN6PR02MB2641D817810960AAE59B9CDEB7F70@BN6PR02MB2641.namprd02.prod.outlook.com>

From: Keyur Amrutbhai Patel <keyurp@xilinx.com>
Date: Thu, 25 Oct 2018 17:22:02 +0000

> Current time consuming function are " netif_receive_skb " and "
> napi_alloc_skb " these two function calls are taking maximum about
> of time

netif_receive_skb() calls the entire networking stack receive path.
So measuring it by itself it not very useful.

Use 'perf' or a similar tool to fully profile the kernel and get a
more detailed analysis.

^ permalink raw reply

* Re: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem
From: David Miller @ 2018-10-25 18:32 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: nicolas.ferre, UNGLinuxDriver, netdev
In-Reply-To: <1540417883-8476-1-git-send-email-Tristram.Ha@microchip.com>

From: <Tristram.Ha@microchip.com>
Date: Wed, 24 Oct 2018 14:51:23 -0700

> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Socket buffer is not re-created when headroom is 2 and tailroom is 1.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Applied.

^ permalink raw reply

* Re: [PATCH] octeontx2-af: Use GFP_ATOMIC under spin lock
From: David Miller @ 2018-10-25 18:36 UTC (permalink / raw)
  To: weiyongjun1; +Cc: sgoutham, lcherian, gakula, jerinj, netdev, kernel-janitors
In-Reply-To: <1540431746-176759-1-git-send-email-weiyongjun1@huawei.com>

From: Wei Yongjun <weiyongjun1@huawei.com>
Date: Thu, 25 Oct 2018 01:42:26 +0000

> The function nix_update_mce_list() is called from
> nix_update_bcast_mce_list(), and a spin lock is held
> here, so we should use GFP_ATOMIC instead.
> 
> Fixes: 4b05528ebf0c ("octeontx2-af: Update bcast list upon NIXLF alloc/free")
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>

I'm applying this.

I'm really disappointed in how the octeontx2 driver submission has done.

The Intel folks can get an entire new driver in with 2 series
of patches, we're on the 3rd or 4th here and the driver still
isn't completely enough to have basic functionality working.

This driver is huge, overly complicated, and is being submitted in a
very painful way.

^ permalink raw reply

* Re: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem
From: Florian Fainelli @ 2018-10-25 18:41 UTC (permalink / raw)
  To: David Miller, Tristram.Ha; +Cc: nicolas.ferre, UNGLinuxDriver, netdev
In-Reply-To: <20181025.113246.2298193822589257914.davem@davemloft.net>

On 10/25/18 11:32 AM, David Miller wrote:
> From: <Tristram.Ha@microchip.com>
> Date: Wed, 24 Oct 2018 14:51:23 -0700
> 
>> From: Tristram Ha <Tristram.Ha@microchip.com>
>>
>> Socket buffer is not re-created when headroom is 2 and tailroom is 1.
>>
>> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Applied.

No fixes tag?
-- 
Florian

^ permalink raw reply

* Re: [net-next][PATCH] net/ipv4: fix a net leak
From: Bjørn Mork @ 2018-10-25 18:43 UTC (permalink / raw)
  To: David Ahern; +Cc: Li RongQing, netdev, David Miller
In-Reply-To: <b2fe3de8-44b7-1ba9-3543-ac8abee44bf6@gmail.com>

David Ahern <dsahern@gmail.com> writes:
> On 10/24/18 9:02 AM, David Ahern wrote:
>> On 10/24/18 3:36 AM, Li RongQing wrote:
>>> put net when input a invalid ifindex, otherwise it will be leaked
>>>
>>> Fixes: 5fcd266a9f64("net/ipv4: Add support for dumping addresses for a specific device")
>>> Cc: David Ahern <dsahern@gmail.com>
>>> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
>>> Signed-off-by: Li RongQing <lirongqing@baidu.com>
>>> ---
>>>  net/ipv4/devinet.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
>>> index 63d5b58fbfdb..fd0c5a47e742 100644
>>> --- a/net/ipv4/devinet.c
>>> +++ b/net/ipv4/devinet.c
>>> @@ -1775,8 +1775,10 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
>>>  
>>>  		if (fillargs.ifindex) {
>>>  			dev = __dev_get_by_index(tgt_net, fillargs.ifindex);
>>> -			if (!dev)
>>> +			if (!dev) {
>>> +				put_net(tgt_net);
>>>  				return -ENODEV;
>>> +			}
>>>  
>>>  			in_dev = __in_dev_get_rtnl(dev);
>>>  			if (in_dev) {
>>>
>> 
>> Good catch. IPv6 has the same problem. Will fix that one.
>> 
> Actually remove that 'Reviewed-by'. You should only call put_net if
> (fillargs.netnsid >= 0)
>
> DaveM: just want to call this out since I mistakenly added the
> Reviewed-by. This patch should be dropped.

Hmm, I see that you implemented that.  But I believe it's still buggy if
called with an invalid netnsid.

inet_valid_dump_ifaddr_req() will bail out with an error, but only
*after* setting fillargs->netnsid:

                if (i == IFA_TARGET_NETNSID) {
                        struct net *net;

                        fillargs->netnsid = nla_get_s32(tb[i]);

                        net = rtnl_get_net_ns_capable(sk, fillargs->netnsid);
                        if (IS_ERR(net)) {
                                NL_SET_ERR_MSG(extack, "ipv4: Invalid target network namespace id");
                                return PTR_ERR(net);
                        }
                        *tgt_net = net;
                } else {



So inet_dump_ifaddr() ends up doing put_net(tgt_net):


                err = inet_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
                                                 skb->sk, cb);
                if (err < 0)
                        goto put_tgt_net;
..
put_tgt_net:
        if (fillargs.netnsid >= 0)
                put_net(tgt_net);



I believe you should set fillargs->netnsid back to -1 in the
inet_valid_dump_ifaddr_req() error path, or use a temp variable to avoid
changing it unless get_net is successful.



Bjørn

^ permalink raw reply

* Re: [net-next][PATCH] net/ipv4: fix a net leak
From: David Ahern @ 2018-10-25 18:46 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: Li RongQing, netdev, David Miller
In-Reply-To: <87lg6l7tq2.fsf@miraculix.mork.no>

On 10/25/18 12:43 PM, Bjørn Mork wrote:
> 
> inet_valid_dump_ifaddr_req() will bail out with an error, but only
> *after* setting fillargs->netnsid:
> 
>                 if (i == IFA_TARGET_NETNSID) {
>                         struct net *net;
> 
>                         fillargs->netnsid = nla_get_s32(tb[i]);
> 
>                         net = rtnl_get_net_ns_capable(sk, fillargs->netnsid);
>                         if (IS_ERR(net)) {
>                                 NL_SET_ERR_MSG(extack, "ipv4: Invalid target network namespace id");
>                                 return PTR_ERR(net);
>                         }
>                         *tgt_net = net;
>                 } else {
> 
> 
> 
> So inet_dump_ifaddr() ends up doing put_net(tgt_net):
> 
> 
>                 err = inet_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
>                                                  skb->sk, cb);
>                 if (err < 0)
>                         goto put_tgt_net;
> ..
> put_tgt_net:
>         if (fillargs.netnsid >= 0)
>                 put_net(tgt_net);
> 
> 
> 
> I believe you should set fillargs->netnsid back to -1 in the
> inet_valid_dump_ifaddr_req() error path, or use a temp variable to avoid
> changing it unless get_net is successful.

good point. either use of an intermediate or resetting nsid on failure.
Will you send a patch to fix ipv4 and v6?

Thanks,

^ permalink raw reply

* for your images 15
From: Kate @ 2018-10-25 12:05 UTC (permalink / raw)
  To: netdev

We are an imaging team who can process 300+ images daily.

If you need any image editing service, please contact us today.

We do mainly images cut out and clipping path, masking.
Such as for your ecommerce photos, jewelry photos retouching, also it is
for beauty portraits and skin images
and wedding photos.

We provide test editing if you send some photos.

Thanks,
Kate

^ permalink raw reply

* for your images 17
From: Kate @ 2018-10-24 12:06 UTC (permalink / raw)
  To: netdev

We are an imaging team who can process 300+ images daily.

If you need any image editing service, please contact us today.

We do mainly images cut out and clipping path, masking.
Such as for your ecommerce photos, jewelry photos retouching, also it is
for beauty portraits and skin images
and wedding photos.

We provide test editing if you send some photos.

Thanks,
Kate

^ permalink raw reply

* for your images 12
From: Kate @ 2018-10-24 11:37 UTC (permalink / raw)
  To: netdev

We are an imaging team who can process 300+ images daily.

If you need any image editing service, please contact us today.

We do mainly images cut out and clipping path, masking.
Such as for your ecommerce photos, jewelry photos retouching, also it is
for beauty portraits and skin images
and wedding photos.

We provide test editing if you send some photos.

Thanks,
Kate

^ permalink raw reply

* [PATCH] net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
From: Bjørn Mork @ 2018-10-25 19:18 UTC (permalink / raw)
  To: netdev; +Cc: Li RongQing, David Miller, Bjørn Mork, David Ahern

The cleanup path will put the target net when netnsid is set.  So we must
reset netnsid if the input is invalid.

Fixes: d7e38611b81e ("net/ipv4: Put target net when address dump fails due to bad attributes")
Fixes: 242afaa6968c ("net/ipv6: Put target net when address dump fails due to bad attributes")
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
 net/ipv4/devinet.c  | 1 +
 net/ipv6/addrconf.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 9250b309c742..a34602ae27de 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1704,6 +1704,7 @@ static int inet_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 
 			net = rtnl_get_net_ns_capable(sk, fillargs->netnsid);
 			if (IS_ERR(net)) {
+				fillargs->netnsid = -1;
 				NL_SET_ERR_MSG(extack, "ipv4: Invalid target network namespace id");
 				return PTR_ERR(net);
 			}
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 7eb09c86fa13..63a808d5af15 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5058,6 +5058,7 @@ static int inet6_valid_dump_ifaddr_req(const struct nlmsghdr *nlh,
 			fillargs->netnsid = nla_get_s32(tb[i]);
 			net = rtnl_get_net_ns_capable(sk, fillargs->netnsid);
 			if (IS_ERR(net)) {
+				fillargs->netnsid = -1;
 				NL_SET_ERR_MSG_MOD(extack, "Invalid target network namespace id");
 				return PTR_ERR(net);
 			}
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v3 2/2] net: qcom/emac: add phy-handle support for ACPI
From: Andrew Lunn @ 2018-10-25 19:24 UTC (permalink / raw)
  To: Wang Dongsheng; +Cc: timur, yu.zheng, f.fainelli, netdev
In-Reply-To: <7935985e49270ad2948b2a52d26510bdf55572e6.1540459999.git.dongsheng.wang@hxt-semitech.com>

On Thu, Oct 25, 2018 at 06:09:15PM +0800, Wang Dongsheng wrote:
> Use "phy-handle" to porint an internal MDIO device port.

Hi Dongsheng

You are basically defining how all future ACPI based MAC drivers get
access to their PHY. This needs to become part of the ACPI standard,
etc.

This code should not be hidden away in the emac driver. It needs to be
placed somewhere public so other drivers can use it. And it needs good
documentation, including an example of what needs to go into the ACPI
tables, etc.

Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH] net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
From: David Ahern @ 2018-10-25 20:16 UTC (permalink / raw)
  To: Bjørn Mork, netdev; +Cc: Li RongQing, David Miller
In-Reply-To: <20181025191825.23936-1-bjorn@mork.no>

On 10/25/18 1:18 PM, Bjørn Mork wrote:
> The cleanup path will put the target net when netnsid is set.  So we must
> reset netnsid if the input is invalid.
> 
> Fixes: d7e38611b81e ("net/ipv4: Put target net when address dump fails due to bad attributes")
> Fixes: 242afaa6968c ("net/ipv6: Put target net when address dump fails due to bad attributes")
> Cc: David Ahern <dsahern@gmail.com>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
>  net/ipv4/devinet.c  | 1 +
>  net/ipv6/addrconf.c | 1 +
>  2 files changed, 2 insertions(+)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [4.18-stable 1/1] netfilter: use kvmalloc_array to allocate memory for hashtable
From: Mark Asselstine @ 2018-10-25 20:18 UTC (permalink / raw)
  To: netdev, David Miller

David,

Please promote mainline commit 285189c78eeb6f684a024b86fb5997d10c6aa564 
[netfilter: use kvmalloc_array to allocate memory for hashtable] to 
linux-4.18.y stable. As it happens this not only fixes the issue described in 
the commit log, it also solves the issue of kmemleak reporting false positives 
of 'struct nf_conn' objects.

unreferenced object 0xffff9af78fa6de00 (size 256): 
  comm "rdate", pid 4215, jiffies 4299506036 (age 115.149s) 
  hex dump (first 32 bytes): 
    01 00 00 00 00 00 00 00 0a 00 96 98 f7 9a ff ff ................ 
    45 e6 00 00 00 00 00 00 10 99 a3 94 f7 9a ff ff E............... 
  backtrace: 
    [<0000000006b47d03>] kmem_cache_alloc+0x146/0x200 
    [<00000000dbb53245>] __nf_conntrack_alloc.isra.13+0x4d/0x170[nf_conntrack] 
    [<000000008c1c1285>] init_conntrack+0x6a/0x2f0 [nf_conntrack] 
    [<00000000a6dd3a04>] nf_conntrack_in+0x2c5/0x360 [nf_conntrack] 
    [<0000000000213d80>] ipv4_conntrack_local+0x5d/0x70 [nf_conntrack_ipv4] 
    [<00000000d98fc633>] nf_hook_slow+0x48/0xd0 
    [<00000000fea9b61e>] __ip_local_out+0xbd/0xf0 
    [<00000000e1418ed2>] ip_local_out+0x1c/0x50 
    [<0000000071f63135>] ip_queue_xmit+0x15f/0x3d0 
    [<000000008fb87cfd>] __tcp_transmit_skb+0x5bf/0xab0 
    [<0000000073c7808d>] tcp_connect+0x648/0x830 
    [<000000000e12e101>] tcp_v4_connect+0x458/0x4d0 
    [<000000003223764c>] __inet_stream_connect+0xe2/0x380 
    [<000000005c32d180>] inet_stream_connect+0x3b/0x60 
    [<00000000465bcd15>] __sys_connect+0xce/0x100 
    [<0000000055a63178>] __x64_sys_connect+0x1a/0x20 

The main object pointer to these struct nf_conn objects is 'salted' with 
ip_conntrack_info in sk_buff._nfct, and as such is not a viable pointer to 
this object by the kmemleak logic.

The only other consistent reference to these objects or contents is found in 
the hash table. But it appears that kmemleak does not scan the 
nf_conntrack_hash which is initialized in nf_ct_alloc_hashtable() via 
__get_free_pages(). This results in the objects appearing as "leaks".

I could solve this by keeping the original code and adding a call to 
kmemleak_alloc() in nf_ct_alloc_hashtable() and similarly a call to 
kmemleak_free() in nf_ct_free_hashtable(). But since this mainline commit 
exists which happens to also sort out this issue we are most likely best to do 
the backport and kill two birds with one stone.

He Zhe previously sent out a patch to this list "[RFC] [PATCH] netfilter: Fix 
kmemleak false positive reports". With the additional analysis summarized here 
that patch should not be considered for merging.

Thanks,
Mark Asselstine

^ permalink raw reply

* Re: [PATCH 1/2] Bluetooth: Add quirk for reading BD_ADDR from fwnode property
From: Balakrishna Godavarthi @ 2018-10-26  5:01 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: Marcel Holtmann, Johan Hedberg, David S . Miller, Loic Poulain,
	linux-bluetooth, netdev, linux-kernel, Brian Norris,
	Dmitry Grinberg, hemantg
In-Reply-To: <4041ef05cdd70d28d665d3288c4d4c43@codeaurora.org>

Hi Matthias,

I missed to add a point here.

On 2018-10-25 20:06, Balakrishna Godavarthi wrote:
> On 2018-10-25 05:51, Matthias Kaehlcke wrote:
>> Add HCI_QUIRK_USE_BDADDR_PROPERTY to allow controllers to retrieve
>> the public Bluetooth address from the firmware node property
>> 'local-bd-address'. If quirk is set and the property does not exist
>> or is invalid the controller is marked as unconfigured.
>> 
>> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
>> ---
>> hci_dev_get_bd_addr_from_property() currently assumes that the
>> firmware node with 'local-bd-address' is from hdev->dev.parent, not
>> sure if this universally true. However if it is true for existing
>> device that might use this interface we can assume this for now
>> (unless there is a clear solution now), and cross the bridge of
>> finding an alternative when we actually encounter the situation.
>> One option could be to look for the first parent that has a fwnode.
>> ---
>>  include/net/bluetooth/hci.h | 12 +++++++++++
>>  net/bluetooth/hci_core.c    | 42 
>> +++++++++++++++++++++++++++++++++++++
>>  net/bluetooth/mgmt.c        |  6 ++++--
>>  3 files changed, 58 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
>> index cdd9f1fe7cfa..a5d748099752 100644
>> --- a/include/net/bluetooth/hci.h
>> +++ b/include/net/bluetooth/hci.h
>> @@ -158,6 +158,18 @@ enum {
>>  	 */
>>  	HCI_QUIRK_INVALID_BDADDR,
>> 
>> +	/* When this quirk is set, the public Bluetooth address
>> +	 * initially reported by HCI Read BD Address command
>> +	 * is considered invalid. The public BD Address can be
>> +	 * specified in the fwnode property 'local-bd-address'.
>> +	 * If this property does not exist or is invalid controller
>> +	 * configuration is required before this device can be used.
>> +	 *
>> +	 * This quirk can be set before hci_register_dev is called or
>> +	 * during the hdev->setup vendor callback.
>> +	 */
>> +	HCI_QUIRK_USE_BDADDR_PROPERTY,
>> +
>>  	/* When this quirk is set, the duplicate filtering during
>>  	 * scanning is based on Bluetooth devices addresses. To allow
>>  	 * RSSI based updates, restart scanning if needed.
>> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
>> index 74b29c7d841c..97214262c4fb 100644
>> --- a/net/bluetooth/hci_core.c
>> +++ b/net/bluetooth/hci_core.c
>> @@ -30,6 +30,7 @@
>>  #include <linux/rfkill.h>
>>  #include <linux/debugfs.h>
>>  #include <linux/crypto.h>
>> +#include <linux/property.h>
>>  #include <asm/unaligned.h>
>> 
>>  #include <net/bluetooth/bluetooth.h>
>> @@ -1355,9 +1356,40 @@ int hci_inquiry(void __user *arg)
>>  	return err;
>>  }
>> 
>> +/**
>> + * hci_dev_get_bd_addr_from_property - Get the Bluetooth Device 
>> Address
>> + *				       (BD_ADDR) for a HCI device from
>> + *				       a firmware node property.
>> + * @hdev:	The HCI device
>> + *
>> + * Search the firmware node for 'local-bd-address'.
>> + *
>> + * All-zero BD addresses are rejected, because those could be 
>> properties
>> + * that exist in the firmware tables, but were not updated by the 
>> firmware. For
>> + * example, the DTS could define 'local-bd-address', with zero BD 
>> addresses.
>> + */
>> +static int hci_dev_get_bd_addr_from_property(struct hci_dev *hdev)
>> +{
>> +	struct fwnode_handle *fwnode = dev_fwnode(hdev->dev.parent);
>> +	bdaddr_t ba;
>> +	int ret;
>> +
>> +	ret = fwnode_property_read_u8_array(fwnode, "local-bd-address",
>> +					    (u8 *)&ba, sizeof(ba));
>> +	if (ret < 0)
>> +		return ret;
>> +	if (!bacmp(&ba, BDADDR_ANY))
>> +		return -ENODATA;
>> +
>> +	hdev->public_addr = ba;
>> +
>> +	return 0;
>> +}
>> +
>>  static int hci_dev_do_open(struct hci_dev *hdev)
>>  {
>>  	int ret = 0;
>> +	bool bd_addr_set = false;
>> 
>>  	BT_DBG("%s %p", hdev->name, hdev);
>> 
>> @@ -1422,6 +1454,16 @@ static int hci_dev_do_open(struct hci_dev 
>> *hdev)
>>  		if (hdev->setup)
>>  			ret = hdev->setup(hdev);
>> 
>> +		if (test_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks)) {
>> +			if (!hci_dev_get_bd_addr_from_property(hdev))
>> +				if (hdev->set_bdaddr &&
>> +				    !hdev->set_bdaddr(hdev, &hdev->public_addr))
>> +					bd_addr_set = true;

Can we check the return status of hdev->setup() before calling 
hdev->set_bdaddr().
some vendors assign hdev->set_baddr helper before calling hdev->setup().
https://elixir.bootlin.com/linux/v4.19-rc8/source/drivers/bluetooth/btqcomsmd.c#L194
There will no use in calling hdev->set_baddr() if hdev->setup() fails.

>> +
>> +			if (!bd_addr_set)
>> +				hci_dev_set_flag(hdev, HCI_UNCONFIGURED);
>> +		}
>> +
>>  		/* The transport driver can set these quirks before
>>  		 * creating the HCI device or in its setup callback.
>>  		 *
>> diff --git a/net/bluetooth/mgmt.c b/net/bluetooth/mgmt.c
>> index 3bdc8f3ca259..3d9edb752403 100644
>> --- a/net/bluetooth/mgmt.c
>> +++ b/net/bluetooth/mgmt.c
>> @@ -551,7 +551,8 @@ static bool is_configured(struct hci_dev *hdev)
>>  	    !hci_dev_test_flag(hdev, HCI_EXT_CONFIGURED))
>>  		return false;
>> 
>> -	if (test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) &&
>> +	if ((test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) ||
>> +	     test_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks)) &&
>>  	    !bacmp(&hdev->public_addr, BDADDR_ANY))
>>  		return false;
>> 
>> @@ -566,7 +567,8 @@ static __le32 get_missing_options(struct hci_dev 
>> *hdev)
>>  	    !hci_dev_test_flag(hdev, HCI_EXT_CONFIGURED))
>>  		options |= MGMT_OPTION_EXTERNAL_CONFIG;
>> 
>> -	if (test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) &&
>> +	if ((test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) ||
>> +	     test_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks)) &&
>>  	    !bacmp(&hdev->public_addr, BDADDR_ANY))
>>  		options |= MGMT_OPTION_PUBLIC_ADDRESS;
> 
> Looks fine to me.
> 
> Reviewed-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>

-- 
Regards
Balakrishna.

^ permalink raw reply

* [RFC] net: stmmac: RX Jumbo packet size > 8191 problem
From: Thor Thayer @ 2018-10-25 20:41 UTC (permalink / raw)
  To: Giuseppe CAVALLARO, alexandre.torgue, joabreu, netdev

Hi,

I'm running into a weird issue at the DMA boundary for large packets 
(>8192) that I can't explain.  I'm hoping someone here has an idea on 
why I'm seeing this issue.

This is the Synopsys DesignWare Ethernet GMAC core (3.74) using the 
stmmac driver found at drivers/net/ethernet/stmicro/stmmac.

If I ping with data sizes that exceed the first DMA buffer size (size 
set to 8191), ping reports a data mismatch as follows at byte #8144:

$ ping -c 1 -M do -s 8150 192.168.1.99
PING 192.168.1.99 (192.168.1.99) 8150(8178) bytes of data.
8158 bytes from 192.168.1.99: icmp_seq=1 ttl=64 time=0.669 ms
wrong data byte #8144 should be 0xd0 but was 0x0
#16	10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 
27 28 29 2a 2b 2c 2d 2e 2f
%< ---------------snip--------------------------------------
#8112	b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 
c6 c7 c8 c9 ca cb cc cd ce cf
#8144	0 0 0 0 d0 d1
         ^^^^^^^
Notice the 4 bytes of 0 there before the expected byte of d0. I 
confirmed the on-wire result with wireshark - same data packet as shown 
above.

Looking at the queue, I'm seeing these values in the RX descriptors (I'm 
using ring mode, enhanced descriptors).
0xa0040320 0x9fff1fff 0x7a358042 0x7a35a042
  ^des0      ^des1      ^des2      ^desc3

desc0 => 8196 bytes, OWN, First & Last Descriptor, Frame type = Eth
desc1 => Disable IRQ on done, Rx Buffer2 sz = 8191, Rx Buffer1 sz = 8191
desc2 => Buffer 1 Addr Pointer
desc3 => Buffer 2 Addr Pointer

If I adjust init_desc3() and refill_desc3() to initialize desc3 to 
desc2+BUF_SIZE_8KiB-4, I get a descriptor as show below and ping 
completes successfully.
0xa0040320 0x9fff1fff 0x77df8042 0x77dfa03e
                                   ^ this is now different

But I'm not sure why the -4 works because desc3 overlaps into the end of 
the first DMA buffer area (des2) which is counterintuitive.

At first I thought the 4 extra bytes were the FCS but that should occur 
at the end of the complete transfer, so I'd expect it to be at the end 
of all the data (in buffer2)

Here is the change that works. I ran a ping sweep with packet sizes from 
8100 to 8300 successfully with this change.
-------------------------------------------------------
$ git diff
diff --git a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c 
b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
index abc3f85270cd..b52be0235d8f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
@@ -115,13 +115,13 @@ static void refill_desc3(void *priv_ptr, struct 
dma_desc *p)

         /* Fill DES3 in case of RING mode */
         if (priv->dma_buf_sz >= BUF_SIZE_8KiB)
-               p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB);
+               p->des3 = cpu_to_le32(le32_to_cpu(p->des2) +
+                                     BUF_SIZE_8KiB - 4);

  }

  /* In ring mode we need to fill the desc3 because it is used as buffer */
  static void init_desc3(struct dma_desc *p)
  {
-       p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB);
+       p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB - 4);
  }

  static void clean_desc3(void *priv_ptr, struct dma_desc *p)
-----------------------------------------------------------

Any thoughts on why I need to change the indexing?

Thanks,

Thor

^ permalink raw reply related

* Re: [PATCH ghak90 (was ghak32) V4 03/10] audit: log container info of syscalls
From: Paul Moore @ 2018-10-25 20:40 UTC (permalink / raw)
  To: rgb
  Cc: sgrubb, simo, carlos, linux-api, containers, linux-kernel,
	dhowells, linux-audit, netfilter-devel, ebiederm, luto, netdev,
	linux-fsdevel, Eric Paris, Serge Hallyn, viro
In-Reply-To: <20181025173830.4yklhnrydt5qvr67@madcap2.tricolour.ca>

On Thu, Oct 25, 2018 at 1:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-10-25 17:57, Steve Grubb wrote:
> > On Thu, 25 Oct 2018 08:27:32 -0400
> > Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > > On 2018-10-25 06:49, Paul Moore wrote:
> > > > On Thu, Oct 25, 2018 at 2:06 AM Steve Grubb <sgrubb@redhat.com>
> > > > wrote:
> > > > > On Wed, 24 Oct 2018 20:42:55 -0400
> > > > > Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2018-10-24 16:55, Paul Moore wrote:
> > > > > > > On Wed, Oct 24, 2018 at 11:15 AM Richard Guy Briggs
> > > > > > > <rgb@redhat.com> wrote:
> > > > > > > > On 2018-10-19 19:16, Paul Moore wrote:
> > > > > > > > > On Sun, Aug 5, 2018 at 4:32 AM Richard Guy Briggs
> > > > > > > > > <rgb@redhat.com> wrote:
> > > >
> > > > ...
> > > >
> > > > > > > > > > +/*
> > > > > > > > > > + * audit_log_contid - report container info
> > > > > > > > > > + * @tsk: task to be recorded
> > > > > > > > > > + * @context: task or local context for record
> > > > > > > > > > + * @op: contid string description
> > > > > > > > > > + */
> > > > > > > > > > +int audit_log_contid(struct task_struct *tsk,
> > > > > > > > > > +                            struct audit_context
> > > > > > > > > > *context, char *op) +{
> > > > > > > > > > +       struct audit_buffer *ab;
> > > > > > > > > > +
> > > > > > > > > > +       if (!audit_contid_set(tsk))
> > > > > > > > > > +               return 0;
> > > > > > > > > > +       /* Generate AUDIT_CONTAINER record with
> > > > > > > > > > container ID */
> > > > > > > > > > +       ab = audit_log_start(context, GFP_KERNEL,
> > > > > > > > > > AUDIT_CONTAINER);
> > > > > > > > > > +       if (!ab)
> > > > > > > > > > +               return -ENOMEM;
> > > > > > > > > > +       audit_log_format(ab, "op=%s contid=%llu",
> > > > > > > > > > +                        op, audit_get_contid(tsk));
> > > > > > > > > > +       audit_log_end(ab);
> > > > > > > > > > +       return 0;
> > > > > > > > > > +}
> > > > > > > > > > +EXPORT_SYMBOL(audit_log_contid);
> > > > > > > > >
> > > > > > > > > As discussed in the previous iteration of the patch, I
> > > > > > > > > prefer AUDIT_CONTAINER_ID here over AUDIT_CONTAINER.  If
> > > > > > > > > you feel strongly about keeping it as-is with
> > > > > > > > > AUDIT_CONTAINER I suppose I could live with that, but it
> > > > > > > > > is isn't my first choice.
> > > > > > > >
> > > > > > > > I don't have a strong opinion on this one, mildly
> > > > > > > > preferring the shorter one only because it is shorter.
> > > > > > >
> > > > > > > We already have multiple AUDIT_CONTAINER* record types, so it
> > > > > > > seems as though we should use "AUDIT_CONTAINER" as a prefix
> > > > > > > of sorts, rather than a type itself.
> > > > > >
> > > > > > I'm fine with that.  I'd still like to hear Steve's input.  He
> > > > > > had stronger opinions than me.
> > > > >
> > > > > The creation event should be separate and distinct from the
> > > > > continuing use when its used as a supplemental record. IOW,
> > > > > binding the ID to a container is part of the lifecycle and needs
> > > > > to be kept distinct.
> > > >
> > > > Steve's comment is pretty ambiguous when it comes to AUDIT_CONTAINER
> > > > vs AUDIT_CONTAINER_ID, but one could argue that AUDIT_CONTAINER_ID
> > > > helps distinguish the audit container id marking record and gets to
> > > > what I believe is the spirit of Steve's comment.  Taking this in
> > > > context with my previous remarks, let's switch to using
> > > > AUDIT_CONTAINER_ID.
> > >
> > > I suspect Steve is mixing up AUDIT_CONTAINER_OP with
> > > AUDIT_CONTAINER_ID, confusing the fact that they are two seperate
> > > records.  As a summary, the suggested records are:
> > >     CONTAINER_OP    audit container identifier creation
> > >     CONTAINER       audit container identifier aux record to an
> > > event
> > >
> > > and what Paul is suggesting (which is fine by me) is:
> > >     CONTAINER_OP    audit container identifier creation event
> > >     CONTAINER_ID    audit container identifier aux record to
> > > an event
> > >
> > > Steve, please indicate you are fine with this.
> >
> > I thought it was:
>
> It *was*.  It was changed at Paul's request in this v3 thread:
>         https://www.redhat.com/archives/linux-audit/2018-July/msg00087.html
>
> And listed in the examples and changelog to this v4 patchset:
>         https://www.redhat.com/archives/linux-audit/2018-July/msg00178.html
>
> It is also listed in this userspace patchset update v4 (which should
> also have had a changelog added to it, note to self...):
>         https://www.redhat.com/archives/linux-audit/2018-July/msg00189.html
>
> I realize it is hard to keep up with all the detail changes in these
> patchsets...
>
> > CONTAINER_ID audit container identifier creation event
> > CONTAINER audit container identifier aux record to an event
> >
> > Or vice versa. Don't mix up creation of the identifier with operations.
>
> Exactly what I'm trying to avoid...  Worded another way: "Don't mix up
> the creation operation with routine reporting of the identifier in
> events."  Steve, can you and Paul discuss and agree on what they should
> be called?  I don't have a horse in this race, but I need to record the
> result of that run.  ;-)

See my previous comments, I think I've been pretty clear on what I
would like to see.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports
From: Chas Williams @ 2018-10-25 21:02 UTC (permalink / raw)
  To: davem
  Cc: netdev, j.vosburgh, vfalico, andy, jiri, kuznet, yoshfuji,
	Chas Williams

netif_is_lag_port should be used to identify link aggregation ports.
For this to work, we need to reorganize the bonding and team drivers
so that the necessary flags are set before dev_open is called.

commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
made this decision originally based on the IFF_SLAVE flag which isn't
used by the team driver.  Note, we do need to retain the IFF_SLAVE
check for the eql driver.

Signed-off-by: Chas Williams <3chas3@gmail.com>
---
 drivers/net/bonding/bond_main.c | 4 ++--
 drivers/net/team/team.c         | 7 +++++--
 net/ipv6/addrconf.c             | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ffa37adb7681..5cdad164332b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1536,6 +1536,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 
 	/* set slave flag before open to prevent IPv6 addrconf */
 	slave_dev->flags |= IFF_SLAVE;
+	slave_dev->priv_flags |= IFF_BONDING;
 
 	/* open the slave since the application closed it */
 	res = dev_open(slave_dev);
@@ -1544,7 +1545,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 		goto err_restore_mac;
 	}
 
-	slave_dev->priv_flags |= IFF_BONDING;
 	/* initialize slave stats */
 	dev_get_stats(new_slave->dev, &new_slave->slave_stats);
 
@@ -1804,10 +1804,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 	slave_disable_netpoll(new_slave);
 
 err_close:
-	slave_dev->priv_flags &= ~IFF_BONDING;
 	dev_close(slave_dev);
 
 err_restore_mac:
+	slave_dev->priv_flags &= ~IFF_BONDING;
 	slave_dev->flags &= ~IFF_SLAVE;
 	if (!bond->params.fail_over_mac ||
 	    BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index db633ae9f784..8fc7d57e9f6d 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1128,14 +1128,12 @@ static int team_upper_dev_link(struct team *team, struct team_port *port,
 					   &lag_upper_info, extack);
 	if (err)
 		return err;
-	port->dev->priv_flags |= IFF_TEAM_PORT;
 	return 0;
 }
 
 static void team_upper_dev_unlink(struct team *team, struct team_port *port)
 {
 	netdev_upper_dev_unlink(port->dev, team->dev);
-	port->dev->priv_flags &= ~IFF_TEAM_PORT;
 }
 
 static void __team_port_change_port_added(struct team_port *port, bool linkup);
@@ -1214,6 +1212,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
 		goto err_port_enter;
 	}
 
+	/* set slave flag before open to prevent IPv6 addrconf */
+	port->dev->priv_flags |= IFF_TEAM_PORT;
+
 	err = dev_open(port_dev);
 	if (err) {
 		netdev_dbg(dev, "Device %s opening failed\n",
@@ -1292,6 +1293,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
 	dev_close(port_dev);
 
 err_dev_open:
+	port->dev->priv_flags &= ~IFF_TEAM_PORT;
 	team_port_leave(team, port);
 	team_port_set_orig_dev_addr(port);
 
@@ -1328,6 +1330,7 @@ static int team_port_del(struct team *team, struct net_device *port_dev)
 	dev_uc_unsync(port_dev, dev);
 	dev_mc_unsync(port_dev, dev);
 	dev_close(port_dev);
+	port->dev->priv_flags &= ~IFF_TEAM_PORT;
 	team_port_leave(team, port);
 
 	__team_option_inst_mark_removed_port(team, port);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 45b84dd5c4eb..121f863022ed 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3482,7 +3482,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 
 	case NETDEV_UP:
 	case NETDEV_CHANGE:
-		if (dev->flags & IFF_SLAVE)
+		if (netif_is_lag_port(dev) || dev->flags & IFF_SLAVE)
 			break;
 
 		if (idev && idev->cnf.disable_ipv6)
-- 
2.14.4

^ permalink raw reply related

* Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports
From: Jiri Pirko @ 2018-10-25 21:10 UTC (permalink / raw)
  To: Chas Williams; +Cc: davem, netdev, j.vosburgh, vfalico, andy, kuznet, yoshfuji
In-Reply-To: <20181025210227.25544-1-3chas3@gmail.com>

Thu, Oct 25, 2018 at 11:02:27PM CEST, 3chas3@gmail.com wrote:
>netif_is_lag_port should be used to identify link aggregation ports.
>For this to work, we need to reorganize the bonding and team drivers
>so that the necessary flags are set before dev_open is called.
>
>commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>made this decision originally based on the IFF_SLAVE flag which isn't
>used by the team driver.  Note, we do need to retain the IFF_SLAVE
>check for the eql driver.
>
>Signed-off-by: Chas Williams <3chas3@gmail.com>
>---
> drivers/net/bonding/bond_main.c | 4 ++--
> drivers/net/team/team.c         | 7 +++++--
> net/ipv6/addrconf.c             | 2 +-

Subject talks about "team" yet you modify bond and team. Confusing..

^ permalink raw reply

* Re: [PATCH ghak90 (was ghak32) V4 03/10] audit: log container info of syscalls
From: Steve Grubb @ 2018-10-25 21:55 UTC (permalink / raw)
  To: Paul Moore
  Cc: rgb, simo, carlos, linux-api, containers, linux-kernel, dhowells,
	linux-audit, netfilter-devel, ebiederm, luto, netdev,
	linux-fsdevel, Eric Paris, Serge Hallyn, viro
In-Reply-To: <CAHC9VhScaG8aOFYRV5hPXEKob1QRth5YFEbHNY=ZgKKAKxBBsQ@mail.gmail.com>

On Thu, 25 Oct 2018 16:40:19 -0400
Paul Moore <paul@paul-moore.com> wrote:

> On Thu, Oct 25, 2018 at 1:38 PM Richard Guy Briggs <rgb@redhat.com>
> wrote:
> > On 2018-10-25 17:57, Steve Grubb wrote:  
> > > On Thu, 25 Oct 2018 08:27:32 -0400
> > > Richard Guy Briggs <rgb@redhat.com> wrote:
> > >  
> > > > On 2018-10-25 06:49, Paul Moore wrote:  
> > > > > On Thu, Oct 25, 2018 at 2:06 AM Steve Grubb
> > > > > <sgrubb@redhat.com> wrote:  
> > > > > > On Wed, 24 Oct 2018 20:42:55 -0400
> > > > > > Richard Guy Briggs <rgb@redhat.com> wrote:  
> > > > > > > On 2018-10-24 16:55, Paul Moore wrote:  
> > > > > > > > On Wed, Oct 24, 2018 at 11:15 AM Richard Guy Briggs
> > > > > > > > <rgb@redhat.com> wrote:  
> > > > > > > > > On 2018-10-19 19:16, Paul Moore wrote:  
> > > > > > > > > > On Sun, Aug 5, 2018 at 4:32 AM Richard Guy Briggs
> > > > > > > > > > <rgb@redhat.com> wrote:  
> > > > >
> > > > > ...
> > > > >  
> > > > > > > > > > > +/*
> > > > > > > > > > > + * audit_log_contid - report container info
> > > > > > > > > > > + * @tsk: task to be recorded
> > > > > > > > > > > + * @context: task or local context for record
> > > > > > > > > > > + * @op: contid string description
> > > > > > > > > > > + */
> > > > > > > > > > > +int audit_log_contid(struct task_struct *tsk,
> > > > > > > > > > > +                            struct audit_context
> > > > > > > > > > > *context, char *op) +{
> > > > > > > > > > > +       struct audit_buffer *ab;
> > > > > > > > > > > +
> > > > > > > > > > > +       if (!audit_contid_set(tsk))
> > > > > > > > > > > +               return 0;
> > > > > > > > > > > +       /* Generate AUDIT_CONTAINER record with
> > > > > > > > > > > container ID */
> > > > > > > > > > > +       ab = audit_log_start(context, GFP_KERNEL,
> > > > > > > > > > > AUDIT_CONTAINER);
> > > > > > > > > > > +       if (!ab)
> > > > > > > > > > > +               return -ENOMEM;
> > > > > > > > > > > +       audit_log_format(ab, "op=%s contid=%llu",
> > > > > > > > > > > +                        op,
> > > > > > > > > > > audit_get_contid(tsk));
> > > > > > > > > > > +       audit_log_end(ab);
> > > > > > > > > > > +       return 0;
> > > > > > > > > > > +}
> > > > > > > > > > > +EXPORT_SYMBOL(audit_log_contid);  
> > > > > > > > > >
> > > > > > > > > > As discussed in the previous iteration of the
> > > > > > > > > > patch, I prefer AUDIT_CONTAINER_ID here over
> > > > > > > > > > AUDIT_CONTAINER.  If you feel strongly about
> > > > > > > > > > keeping it as-is with AUDIT_CONTAINER I suppose I
> > > > > > > > > > could live with that, but it is isn't my first
> > > > > > > > > > choice.  
> > > > > > > > >
> > > > > > > > > I don't have a strong opinion on this one, mildly
> > > > > > > > > preferring the shorter one only because it is
> > > > > > > > > shorter.  
> > > > > > > >
> > > > > > > > We already have multiple AUDIT_CONTAINER* record types,
> > > > > > > > so it seems as though we should use "AUDIT_CONTAINER"
> > > > > > > > as a prefix of sorts, rather than a type itself.  
> > > > > > >
> > > > > > > I'm fine with that.  I'd still like to hear Steve's
> > > > > > > input.  He had stronger opinions than me.  
> > > > > >
> > > > > > The creation event should be separate and distinct from the
> > > > > > continuing use when its used as a supplemental record. IOW,
> > > > > > binding the ID to a container is part of the lifecycle and
> > > > > > needs to be kept distinct.  
> > > > >
> > > > > Steve's comment is pretty ambiguous when it comes to
> > > > > AUDIT_CONTAINER vs AUDIT_CONTAINER_ID, but one could argue
> > > > > that AUDIT_CONTAINER_ID helps distinguish the audit container
> > > > > id marking record and gets to what I believe is the spirit of
> > > > > Steve's comment.  Taking this in context with my previous
> > > > > remarks, let's switch to using AUDIT_CONTAINER_ID.  
> > > >
> > > > I suspect Steve is mixing up AUDIT_CONTAINER_OP with
> > > > AUDIT_CONTAINER_ID, confusing the fact that they are two
> > > > seperate records.  As a summary, the suggested records are:
> > > >     CONTAINER_OP    audit container identifier creation
> > > >     CONTAINER       audit container identifier aux record to an
> > > > event
> > > >
> > > > and what Paul is suggesting (which is fine by me) is:
> > > >     CONTAINER_OP    audit container identifier creation event
> > > >     CONTAINER_ID    audit container identifier aux record to
> > > > an event
> > > >
> > > > Steve, please indicate you are fine with this.  
> > >
> > > I thought it was:  
> >
> > It *was*.  It was changed at Paul's request in this v3 thread:
> >         https://www.redhat.com/archives/linux-audit/2018-July/msg00087.html
> >
> > And listed in the examples and changelog to this v4 patchset:
> >         https://www.redhat.com/archives/linux-audit/2018-July/msg00178.html
> >
> > It is also listed in this userspace patchset update v4 (which should
> > also have had a changelog added to it, note to self...):
> >         https://www.redhat.com/archives/linux-audit/2018-July/msg00189.html
> >
> > I realize it is hard to keep up with all the detail changes in these
> > patchsets...
> >  
> > > CONTAINER_ID audit container identifier creation event
> > > event. CONTAINER audit container identifier aux record to an
> > > event
> > >
> > > Or vice versa. Don't mix up creation of the identifier with
> > > operations.  
> >
> > Exactly what I'm trying to avoid...  Worded another way: "Don't mix
> > up the creation operation with routine reporting of the identifier
> > in events."  Steve, can you and Paul discuss and agree on what they
> > should be called?  I don't have a horse in this race, but I need to
> > record the result of that run.  ;-)  
> 
> See my previous comments, I think I've been pretty clear on what I
> would like to see.

And historically speaking setting audit loginuid produces a LOGIN
event, so it only makes sense to consider binding container ID to
container as a CONTAINER event. For other supplemental records, we name
things what they are: PATH, CWD, SOCKADDR, etc. So, CONTAINER_ID makes
sense. CONTAINER_OP sounds like its for operations on a container. Do
we have any operations on a container?

-Steve

^ permalink raw reply

* Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports
From: Jay Vosburgh @ 2018-10-25 21:59 UTC (permalink / raw)
  To: Chas Williams; +Cc: davem, netdev, vfalico, andy, jiri, kuznet, yoshfuji
In-Reply-To: <20181025210227.25544-1-3chas3@gmail.com>

Chas Williams <3chas3@gmail.com> wrote:

>netif_is_lag_port should be used to identify link aggregation ports.
>For this to work, we need to reorganize the bonding and team drivers
>so that the necessary flags are set before dev_open is called.
>
>commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>made this decision originally based on the IFF_SLAVE flag which isn't
>used by the team driver.  Note, we do need to retain the IFF_SLAVE
>check for the eql driver.

	Is 31e77c93e432 the correct commit reference?  I don't see
anything in there about IFF_SLAVE or bonding; it's a patch to the
process scheduler.

	And, as Jiri said, the subject doesn't mention bonding.

>Signed-off-by: Chas Williams <3chas3@gmail.com>
>---
> drivers/net/bonding/bond_main.c | 4 ++--
> drivers/net/team/team.c         | 7 +++++--
> net/ipv6/addrconf.c             | 2 +-
> 3 files changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index ffa37adb7681..5cdad164332b 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1536,6 +1536,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> 
> 	/* set slave flag before open to prevent IPv6 addrconf */
> 	slave_dev->flags |= IFF_SLAVE;
>+	slave_dev->priv_flags |= IFF_BONDING;
> 
> 	/* open the slave since the application closed it */
> 	res = dev_open(slave_dev);
>@@ -1544,7 +1545,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> 		goto err_restore_mac;
> 	}
> 
>-	slave_dev->priv_flags |= IFF_BONDING;
> 	/* initialize slave stats */
> 	dev_get_stats(new_slave->dev, &new_slave->slave_stats);
> 
>@@ -1804,10 +1804,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> 	slave_disable_netpoll(new_slave);
> 
> err_close:
>-	slave_dev->priv_flags &= ~IFF_BONDING;
> 	dev_close(slave_dev);
> 
> err_restore_mac:
>+	slave_dev->priv_flags &= ~IFF_BONDING;
> 	slave_dev->flags &= ~IFF_SLAVE;
> 	if (!bond->params.fail_over_mac ||
> 	    BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index db633ae9f784..8fc7d57e9f6d 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -1128,14 +1128,12 @@ static int team_upper_dev_link(struct team *team, struct team_port *port,
> 					   &lag_upper_info, extack);
> 	if (err)
> 		return err;
>-	port->dev->priv_flags |= IFF_TEAM_PORT;
> 	return 0;
> }
> 
> static void team_upper_dev_unlink(struct team *team, struct team_port *port)
> {
> 	netdev_upper_dev_unlink(port->dev, team->dev);
>-	port->dev->priv_flags &= ~IFF_TEAM_PORT;
> }
> 
> static void __team_port_change_port_added(struct team_port *port, bool linkup);
>@@ -1214,6 +1212,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
> 		goto err_port_enter;
> 	}
> 
>+	/* set slave flag before open to prevent IPv6 addrconf */
>+	port->dev->priv_flags |= IFF_TEAM_PORT;
>+
> 	err = dev_open(port_dev);
> 	if (err) {
> 		netdev_dbg(dev, "Device %s opening failed\n",
>@@ -1292,6 +1293,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
> 	dev_close(port_dev);
> 
> err_dev_open:
>+	port->dev->priv_flags &= ~IFF_TEAM_PORT;
> 	team_port_leave(team, port);
> 	team_port_set_orig_dev_addr(port);
> 
>@@ -1328,6 +1330,7 @@ static int team_port_del(struct team *team, struct net_device *port_dev)
> 	dev_uc_unsync(port_dev, dev);
> 	dev_mc_unsync(port_dev, dev);
> 	dev_close(port_dev);
>+	port->dev->priv_flags &= ~IFF_TEAM_PORT;
> 	team_port_leave(team, port);
> 
> 	__team_option_inst_mark_removed_port(team, port);
>diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>index 45b84dd5c4eb..121f863022ed 100644
>--- a/net/ipv6/addrconf.c
>+++ b/net/ipv6/addrconf.c
>@@ -3482,7 +3482,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
> 
> 	case NETDEV_UP:
> 	case NETDEV_CHANGE:
>-		if (dev->flags & IFF_SLAVE)
>+		if (netif_is_lag_port(dev) || dev->flags & IFF_SLAVE)

	Note that netvsc_vf_join() also uses IFF_SLAVE in order skip
IPv6 addrconf for netvsc devices; I don't believe its usage will pass
netif_is_lag_port().  It looks like the above will work, but your commit
message mentions eql as the reason for retaining the IFF_SLAVE test, and
eql isn't the only user of IFF_SLAVE in this manner.

	-J

> 			break;
> 
> 		if (idev && idev->cnf.disable_ipv6)
>-- 
>2.14.4
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports
From: Chas Williams @ 2018-10-25 22:08 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: davem, netdev, j.vosburgh, vfalico, andy, kuznet, yoshfuji
In-Reply-To: <20181025211008.GA2229@nanopsycho.orion>



On 10/25/2018 05:10 PM, Jiri Pirko wrote:
> Thu, Oct 25, 2018 at 11:02:27PM CEST, 3chas3@gmail.com wrote:
>> netif_is_lag_port should be used to identify link aggregation ports.
>> For this to work, we need to reorganize the bonding and team drivers
>> so that the necessary flags are set before dev_open is called.
>>
>> commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>> made this decision originally based on the IFF_SLAVE flag which isn't
>> used by the team driver.  Note, we do need to retain the IFF_SLAVE
>> check for the eql driver.
>>
>> Signed-off-by: Chas Williams <3chas3@gmail.com>
>> ---
>> drivers/net/bonding/bond_main.c | 4 ++--
>> drivers/net/team/team.c         | 7 +++++--
>> net/ipv6/addrconf.c             | 2 +-
> 
> Subject talks about "team" yet you modify bond and team. Confusing..

The subject discusses what I want to do. The body of the message
covers how I had to do it. The behavior of bonding with respect to
addrconf isn't changed but netif_is_lag_port is picky about the
flags it wants to see from bonding.  So some bonding changes are
necessary.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox