Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] ethtool: Support for driver private ioctl's
From: Andrew Lunn @ 2018-04-06 14:47 UTC (permalink / raw)
  To: Jose Abreu
  Cc: Florian Fainelli, David Miller, Jakub Jelinek, Jeff Garzik,
	Tim Hockin, Eli Kupermann, Chris Leech, Scott Feldman,
	Ben Hutchings, netdev, Joao Pinto
In-Reply-To: <27c05ec7-1f85-8e4a-be06-70d6d80e8a10@synopsys.com>

On Fri, Apr 06, 2018 at 02:51:15PM +0100, Jose Abreu wrote:
> Hi Florian,
> 
> On 05-04-2018 16:50, Florian Fainelli wrote:
> >
> > On 04/05/2018 03:47 AM, Jose Abreu wrote:
> >> Hi All,
> >>
> >> I would like to know your opinion regarding adding support for
> >> driver private ioctl's in ethtool.
> >>
> >> Background: Synopsys Ethernet IP's have a certain number of
> >> features which can be reconfigured at runtime. Giving you two
> >> examples: One of the most recent one is the safety features,
> >> which can be enabled/disabled and forced at runtime.

Hi Jose

Is there a reason somebody would decide to use the Ethernet in
'unsafe' mode? Cannot you just turn it on by default?

	 Andrew

^ permalink raw reply

* Re: [RFC] ethtool: Support for driver private ioctl's
From: Jose Abreu @ 2018-04-06 14:51 UTC (permalink / raw)
  To: Andrew Lunn, Jose Abreu
  Cc: Florian Fainelli, David Miller, Jakub Jelinek, Jeff Garzik,
	Tim Hockin, Eli Kupermann, Chris Leech, Scott Feldman,
	Ben Hutchings, netdev, Joao Pinto
In-Reply-To: <20180406144701.GO17495@lunn.ch>

Hi Andrew,

On 06-04-2018 15:47, Andrew Lunn wrote:
> On Fri, Apr 06, 2018 at 02:51:15PM +0100, Jose Abreu wrote:
>> Hi Florian,
>>
>> On 05-04-2018 16:50, Florian Fainelli wrote:
>>> On 04/05/2018 03:47 AM, Jose Abreu wrote:
>>>> Hi All,
>>>>
>>>> I would like to know your opinion regarding adding support for
>>>> driver private ioctl's in ethtool.
>>>>
>>>> Background: Synopsys Ethernet IP's have a certain number of
>>>> features which can be reconfigured at runtime. Giving you two
>>>> examples: One of the most recent one is the safety features,
>>>> which can be enabled/disabled and forced at runtime.
> Hi Jose
>
> Is there a reason somebody would decide to use the Ethernet in
> 'unsafe' mode? Cannot you just turn it on by default?

Yes, its already on by default. I was just trying to give an
example of an user-reconfigurable feature, maybe it was not the
best one :)

Thanks and Best Regards,
Jose Miguel Abreu

>
> 	 Andrew

^ permalink raw reply

* Re: TCP one-by-one acking - RFC interpretation question
From: Michal Kubecek @ 2018-04-06 15:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <65f7ac65-e6d1-11f2-408e-39eb542ec817@gmail.com>

On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> > Hello,
> > 
> > I encountered a strange behaviour of some (non-linux) TCP stack which
> > I believe is incorrect but support engineers from the company producing
> > it claim is OK.
> > 
> > Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> > segments but segments 2, 4 and 6 do not reach the server (receiver):
> > 
> >          ACK             SAK             SAK             SAK
> >       +-------+-------+-------+-------+-------+-------+-------+
> >       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >       +-------+-------+-------+-------+-------+-------+-------+
> >     34273   35701   37129   38557   39985   41413   42841   44269
> > 
> > When segment 2 is retransmitted after RTO timeout, normal response would
> > be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> > 42841-44269).
> > 
> > However, this server stack responds with two separate ACKs:
> > 
> >   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >   - ACK 38557, SACK 39985-41413 42841-44269
> 
> Hmmm... Yes this seems very very wrong and lazy.
> 
> Have you verified behavior of more recent linux kernel to such threats ?

No, unfortunately the problem was only encountered by our customer in
production environment (they tried to reproduce in a test lab but no
luck). They are running backups to NFS server and it happens from time
to time (in the order of hours, IIUC). So it would be probably hard to
let them try with more recent kernel.

On the other hand, they reported that SLE11 clients (kernel 3.0) do not
run into this kind of problem. It was originally reported as a
a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
(which seems to be mostly caused by the switch to RACK).

It also seems that part of the problem is specific packet loss pattern
where at some point, many packets are lost in "every second" pattern.
The customer finally started to investigate this problem and it seems it
has something to do with their bonding setup (they provided no details,
my guess is packets are divided over two paths and one of them fails).

> packetdrill test would be relatively easy to write.

I'll try but I have very little experience with writing packetdrill
scripts so it will probably take some time.

> Regardless of this broken alien stack, we might be able to work around
> this faster than the vendor is able to fix and deploy a new stack.
> 
> ( https://en.wikipedia.org/wiki/Robustness_principle )
> Be conservative in what you do, be liberal in what you accept from
> others...

I was thinking about this a bit. "Fixing" the acknowledgment number
could do the trick but it doesn't feel correct. We might use the fact
that TSecr of both ACKs above matches TSval of the retransmission which
triggered them so that RTT calculated from timestamp would be the right
one. So perhaps something like "prefer timestamp RTT if measured RTT
seems way too off". But I'm not sure if it couldn't break other use
cases where (high) measured RTT is actually correct, rather than (low)
timestamp RTT.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array
From: David Miller @ 2018-04-06 15:06 UTC (permalink / raw)
  To: Vadim.Lomovtsev
  Cc: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
	linux-kernel, dnelson, gustavo, Vadim.Lomovtsev
In-Reply-To: <20180406111425.14636-1-Vadim.Lomovtsev@caviumnetworks.com>

From: Vadim Lomovtsev <Vadim.Lomovtsev@caviumnetworks.com>
Date: Fri,  6 Apr 2018 04:14:25 -0700

> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
> index 5fc46c5a4f36..448d1fafc827 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
>  
>  struct cavium_ptp;
>  
> -struct xcast_addr {
> -	struct list_head list;
> -	u64              addr;
> -};
> -
>  struct xcast_addr_list {
> -	struct list_head list;
>  	int              count;
> +	u64              mc[];
>  };
>  
>  struct nicvf_work {
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31fef729..a26d8bc92e01 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  						  work.work);
>  	struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
>  	union nic_mbx mbx = {};
> -	struct xcast_addr *xaddr, *next;
> +	u8 idx = 0;
        ^^^^^^^^^^^

>  
>  	if (!vf_work)
>  		return;
> @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  	/* check if we have any specific MACs to be added to PF DMAC filter */
>  	if (vf_work->mc) {
>  		/* now go through kernel list of MACs and add them one by one */
> -		list_for_each_entry_safe(xaddr, next,
> -					 &vf_work->mc->list, list) {
> +		for (idx = 0; idx < vf_work->mc->count; idx++) {

vf_work->mx->count is an 'int' therefore 'idx' should be declared 'int' as well,
not a 'u8'.

^ permalink raw reply

* Re: [RFC 0/9] bpf: Add buildid check support
From: Jiri Olsa @ 2018-04-06 15:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, lkml, netdev,
	linux-kbuild, Quentin Monnet, Eugene Syromiatnikov, Jiri Benc,
	Stanislav Kozina, Jerome Marchand, Arnaldo Carvalho de Melo,
	Masahiro Yamada, Michal Marek, Jiri Kosina
In-Reply-To: <20180406013719.ekptkxbwzpjeaanu@ast-mbp.dhcp.thefacebook.com>

On Thu, Apr 05, 2018 at 06:37:23PM -0700, Alexei Starovoitov wrote:
> On Thu, Apr 05, 2018 at 05:16:36PM +0200, Jiri Olsa wrote:
> > hi,
> > eBPF programs loaded for kprobes are allowed to read kernel
> > internal structures. We check the provided kernel version
> > to ensure that the program is loaded for the proper kernel. 
> > 
> > The problem is that the version check is not enough, because
> > it only follows the version setup from kernel's Makefile.
> > However, the internal kernel structures change based on the
> > .config data, so in practise we have different kernels with
> > same version.
> > 
> > The eBPF kprobe program thus then get loaded for different
> > kernel than it's been built for, get wrong data (silently)
> > and provide misleading output.
> > 
> > This patchset implements additional check in eBPF loading code
> > on provided build ID (from kernel's elf image, .notes section
> > GNU build ID) to ensure we load the eBPF program on correct
> > kernel.
> > 
> > Also available in here (based on bpf-next/master):
> >   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
> >   bpf/checksum
> > 
> > This patchset consists of several changes:
> > 
> > - adding CONFIG_BUILDID_H option that instructs the build
> >   to generate uapi header file with build ID data, that
> >   will be included by eBPF program
> > 
> > - adding CONFIG_BPF_BUILDID_CHECK option and new bpf_attr
> >   field to allow build ID checking when loading the eBPF
> >   program
> > 
> > - changing libbpf to read and pass build ID to the kernel
> > 
> > - several small side fixes
> > 
> > - example perf eBPF code in bpf-samples/bpf-stdout-example.c
> >   to show the build ID support/usage.
> > 
> >     # perf record -vv  -e ./bpf-samples/bpf-stdout-example.c kill 2>&1 | grep buildid
> >     libbpf: section(7) buildid, size 21, link 0, flags 3, type=1
> >     libbpf: kernel buildid of ./bpf-samples/bpf-stdout-example.c is: 6e25edeb408513184e2753bebad25d42314501a0
> > 
> >   The buildid is provided the same way we provide kernel
> >   version, in a special "buildid" section:
> > 
> >     # cat ./bpf-samples/bpf-stdout-example.c
> >     ...
> >     #include <linux/buildid.h>
> > 
> >     char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
> >     ...
> > 
> >   where LINUX_BUILDID_DATA is defined in the generated buildid.h.
> > 
> > please note it's an RFC ;-) any comments and suggestions are welcome
> 
> I think this is overkill.
> 
> We're very heavy users of kprobe+bpf. It's used for lots
> of different cases and usage is constantly growing,
> but I haven't seen a single case of :
> 
> > The eBPF kprobe program thus then get loaded for different
> > kernel than it's been built for, get wrong data (silently)
> > and provide misleading output.
> 
> but I saw plenty of the opposite. People pre-compile the program
> and hack kernel version when they load, since they know in advance
> that kprobe+bpf doesn't use any kernel specific things.
> The existing kernel version check for kprobe+bpf is already annoying
> to them.

perhaps verifier could detect this (via bpf_probe_read usage) and disable
the version check automaticaly for such program?

and in the same way force the version check (or buildid when enabled)
once the bpf_probe_read is detected

thanks,
jirka

^ permalink raw reply

* Re: [PATCH net-next 0/5] net: stmmac: Stop using hard-coded callbacks
From: David Miller @ 2018-04-06 15:12 UTC (permalink / raw)
  To: Jose.Abreu; +Cc: netdev, Joao.Pinto, peppe.cavallaro, alexandre.torgue
In-Reply-To: <cover.1523019346.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Fri,  6 Apr 2018 14:08:14 +0100

> This a starting point for a cleanup and re-organization of stmmac.
> 
> In this series we stop using hard-coded callbacks along the code and use
> instead helpers which are defined in a single place ("hwif.h").
> 
> This brings several advantages:
> 	1) Less typing :)
> 	2) Guaranteed function pointer check
> 	3) More flexibility
> 
> By 2) we stop using the repeated pattern of:
> 	if (priv->hw->mac->some_func)
> 		priv->hw->mac->some_func(...)
> 
> I didn't check but I expect the final .ko will be bigger with this series
> because *all* of function pointers are checked.
> 
> Anyway, I hope this can make the code more readable and more flexible now.

The net-next tree is closed, please resubmit this series when it opens
back up.

Thank you.

^ permalink raw reply

* Re: [PATCH] dp83640: Ensure against premature access to PHY registers after reset
From: David Miller @ 2018-04-06 15:13 UTC (permalink / raw)
  To: andrew
  Cc: esben.haabendal, netdev, eha, richardcochran, f.fainelli,
	linux-kernel
In-Reply-To: <20180406141410.GI17495@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Fri, 6 Apr 2018 16:14:10 +0200

> On Fri, Apr 06, 2018 at 04:05:40PM +0200, Esben Haabendal wrote:
>> From: Esben Haabendal <eha@deif.com>
>> 
>> Signed-off-by: Esben Haabendal <eha@deif.com>
>> ---
>>  drivers/net/phy/dp83640.c | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>> 
>> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
>> index 654f42d00092..48403170096a 100644
>> --- a/drivers/net/phy/dp83640.c
>> +++ b/drivers/net/phy/dp83640.c
>> @@ -1207,6 +1207,22 @@ static void dp83640_remove(struct phy_device *phydev)
>>  	kfree(dp83640);
>>  }
>>  
>> +static int dp83640_soft_reset(struct phy_device *phydev)
>> +{
>> +	int ret;
>> +
>> +	ret = genphy_soft_reset(phydev);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/* From DP83640 datasheet: "Software driver code must wait 3 us
>> +	 * following a software reset before allowing further serial MII
>> +	 * operations with the DP83640." */
>> +	udelay(3);
> 
> Hi Esben
> 
> The accuracy of udelay() is not guaranteed. So you probably want to be
> a bit pessimistic, and use 10.

Agreed.

^ permalink raw reply

* Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array
From: Vadim Lomovtsev @ 2018-04-06 15:14 UTC (permalink / raw)
  To: David Miller
  Cc: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
	linux-kernel, dnelson, gustavo, Vadim.Lomovtsev
In-Reply-To: <20180406.110603.978796669734920248.davem@davemloft.net>

On Fri, Apr 06, 2018 at 11:06:03AM -0400, David Miller wrote:
> From: Vadim Lomovtsev <Vadim.Lomovtsev@caviumnetworks.com>
> Date: Fri,  6 Apr 2018 04:14:25 -0700
> 
> > diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
> > index 5fc46c5a4f36..448d1fafc827 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nic.h
> > +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> > @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
> >  
> >  struct cavium_ptp;
> >  
> > -struct xcast_addr {
> > -	struct list_head list;
> > -	u64              addr;
> > -};
> > -
> >  struct xcast_addr_list {
> > -	struct list_head list;
> >  	int              count;
> > +	u64              mc[];
> >  };
> >  
> >  struct nicvf_work {
> > diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > index 1e9a31fef729..a26d8bc92e01 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
> >  						  work.work);
> >  	struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
> >  	union nic_mbx mbx = {};
> > -	struct xcast_addr *xaddr, *next;
> > +	u8 idx = 0;
>         ^^^^^^^^^^^
> 
> >  
> >  	if (!vf_work)
> >  		return;
> > @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
> >  	/* check if we have any specific MACs to be added to PF DMAC filter */
> >  	if (vf_work->mc) {
> >  		/* now go through kernel list of MACs and add them one by one */
> > -		list_for_each_entry_safe(xaddr, next,
> > -					 &vf_work->mc->list, list) {
> > +		for (idx = 0; idx < vf_work->mc->count; idx++) {
> 
> vf_work->mx->count is an 'int' therefore 'idx' should be declared 'int' as well,
> not a 'u8'.

My bad, sorry.
Will post v4 shortly then.


WBR,
Vadim

^ permalink raw reply

* Re: [PATCH v3] net: thunderx: rework mac addresses list to u64 array
From: Vadim Lomovtsev @ 2018-04-06 15:16 UTC (permalink / raw)
  To: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
	linux-kernel
  Cc: dnelson, Vadim Lomovtsev
In-Reply-To: <20180406140443.15181-1-Vadim.Lomovtsev@caviumnetworks.com>

Self-NACK here, because of https://lkml.org/lkml/2018/4/6/724

Sorry for noise.

Vadim

On Fri, Apr 06, 2018 at 07:04:43AM -0700, Vadim Lomovtsev wrote:
> From: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> 
> It is too expensive to pass u64 values via linked list, instead
> allocate array for them by overall number of mac addresses from netdev.
> 
> This eventually removes multiple kmalloc() calls, aviod memory
> fragmentation and allow to put single null check on kmalloc
> return value in order to prevent a potential null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> ---
> Changes from v1 to v2:
>  - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
> Changes from v2 to v3:
>  - update commit description with 'Reported-by: Dan Carpenter';
>  - update size calculations for mc list to offsetof() call
>    instead of explicit arithmetic;
> ---
>  drivers/net/ethernet/cavium/thunder/nic.h        |  7 +-----
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +++++++++---------------
>  2 files changed, 11 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
> index 5fc46c5a4f36..448d1fafc827 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
>  
>  struct cavium_ptp;
>  
> -struct xcast_addr {
> -	struct list_head list;
> -	u64              addr;
> -};
> -
>  struct xcast_addr_list {
> -	struct list_head list;
>  	int              count;
> +	u64              mc[];
>  };
>  
>  struct nicvf_work {
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31fef729..7d9e58533a83 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  						  work.work);
>  	struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
>  	union nic_mbx mbx = {};
> -	struct xcast_addr *xaddr, *next;
> +	u8 idx = 0;
>  
>  	if (!vf_work)
>  		return;
> @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  	/* check if we have any specific MACs to be added to PF DMAC filter */
>  	if (vf_work->mc) {
>  		/* now go through kernel list of MACs and add them one by one */
> -		list_for_each_entry_safe(xaddr, next,
> -					 &vf_work->mc->list, list) {
> +		for (idx = 0; idx < vf_work->mc->count; idx++) {
>  			mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> -			mbx.xcast.data.mac = xaddr->addr;
> +			mbx.xcast.data.mac = vf_work->mc->mc[idx];
>  			nicvf_send_msg_to_pf(nic, &mbx);
> -
> -			/* after receiving ACK from PF release memory */
> -			list_del(&xaddr->list);
> -			kfree(xaddr);
> -			vf_work->mc->count--;
>  		}
>  		kfree(vf_work->mc);
>  	}
> @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
>  			mode |= BGX_XCAST_MCAST_FILTER;
>  			/* here we need to copy mc addrs */
>  			if (netdev_mc_count(netdev)) {
> -				struct xcast_addr *xaddr;
> -
> -				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> -				INIT_LIST_HEAD(&mc_list->list);
> +				mc_list = kmalloc(offsetof(typeof(*mc_list),
> +							   mc[netdev_mc_count(netdev)]),
> +						  GFP_ATOMIC);
> +				if (unlikely(!mc_list))
> +					return;
> +				mc_list->count = 0;
>  				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
> -					xaddr = kmalloc(sizeof(*xaddr),
> -							GFP_ATOMIC);
> -					xaddr->addr =
> +					mc_list->mc[mc_list->count] =
>  						ether_addr_to_u64(ha->addr);
> -					list_add_tail(&xaddr->list,
> -						      &mc_list->list);
>  					mc_list->count++;
>  				}
>  			}
> -- 
> 2.14.3
> 

^ permalink raw reply

* Re: [PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables
From: David Miller @ 2018-04-06 15:21 UTC (permalink / raw)
  To: andrew; +Cc: raghuramchary.jallipalli, netdev, unglinuxdriver, woojung.huh
In-Reply-To: <20180406144342.GN17495@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Fri, 6 Apr 2018 16:43:42 +0200

> On Fri, Apr 06, 2018 at 11:42:02AM +0530, Raghuram Chary J wrote:
>> The patch is to configure DSP registers of PHY device
>> to handle Gbe-EEE failures with >40m cable length.
>> 
>> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
>> Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
>> ---
>>  drivers/net/phy/microchip.c  | 123 ++++++++++++++++++++++++++++++++++++++++++-
>>  include/linux/microchipphy.h |   8 +++
>>  2 files changed, 130 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
>> index 0f293ef28935..174ae9808722 100644
>> --- a/drivers/net/phy/microchip.c
>> +++ b/drivers/net/phy/microchip.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/ethtool.h>
>>  #include <linux/phy.h>
>>  #include <linux/microchipphy.h>
>> +#include <linux/delay.h>
>>  
>>  #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
>>  #define DRIVER_DESC	"Microchip LAN88XX PHY driver"
>> @@ -66,6 +67,107 @@ static int lan88xx_suspend(struct phy_device *phydev)
>>  	return 0;
>>  }
>>  
>> +static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
>> +			       u32 data)
>> +{
>> +	int val;
>> +	u16 buf;
>> +
>> +	/* Get access to token ring page */
>> +	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
>> +		  LAN88XX_EXT_PAGE_ACCESS_TR);
> 
> Hi Raghuram
> 
> You might want to look at phy_read_paged(), phy_write_paged(), etc.
> 
> There can be race conditions with paged access.

Yep, so something like:

static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
			       u32 data)
{
	int save_page, val;
	u16 buf;

	save_page = phy_save_page(phydev);
	phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			LAN88XX_EXT_PAGE_TR_LOW_DATA, (data & 0xFFFF));
	phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			LAN88XX_EXT_PAGE_TR_HIGH_DATA,
			(data & 0x00FF0000) >> 16);

	/* Config control bits [15:13] of register */
	buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */
	buf |= 0x8000; /* Set [15] to Packet transmit */

	phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			LAN88XX_EXT_PAGE_TR_CR, buf);
	usleep_range(1000, 2000);/* Wait for Data to be written */

	val = phy_read_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			     LAN88XX_EXT_PAGE_TR_CR);
	if (!(val & 0x8000))
		pr_warn("TR Register[0x%X] configuration failed\n", regaddr);

	phy_restore_page(phydev, save_page, 0);
}

Since PHY accesses and thus things like phy_save_page() can fail, the
return type of this function should be changed to 'int' and some error
checking should be added.

^ permalink raw reply

* [net 2/2] ice: Bug fixes in ethtool code
From: Jeff Kirsher @ 2018-04-06 15:36 UTC (permalink / raw)
  To: davem
  Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20180406153630.9321-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

1) Return correct size from ice_get_regs_len.
2) Fix incorrect use of ARRAY_SIZE in ice_get_regs.

Fixes: fcea6f3da546 (ice: Add stats and ethtool support)
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 186764a5c263..1db304c01d10 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -156,7 +156,7 @@ ice_get_drvinfo(struct net_device *netdev, struct ethtool_drvinfo *drvinfo)
 
 static int ice_get_regs_len(struct net_device __always_unused *netdev)
 {
-	return ARRAY_SIZE(ice_regs_dump_list);
+	return sizeof(ice_regs_dump_list);
 }
 
 static void
@@ -170,7 +170,7 @@ ice_get_regs(struct net_device *netdev, struct ethtool_regs *regs, void *p)
 
 	regs->version = 1;
 
-	for (i = 0; i < ARRAY_SIZE(ice_regs_dump_list) / sizeof(u32); ++i)
+	for (i = 0; i < ARRAY_SIZE(ice_regs_dump_list); ++i)
 		regs_buf[i] = rd32(hw, ice_regs_dump_list[i]);
 }
 
-- 
2.14.3

^ permalink raw reply related

* [net 0/2][pull request] Intel Wired LAN Driver Updates 2018-04-06
From: Jeff Kirsher @ 2018-04-06 15:36 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains a couple of fixes for the new ice driver.

Wei Yongjun fixes the return error code for error case during init.

Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code
and fixed "for" loop calculations.

The following are changes since commit 3239534a79ee6f20cffd974173a1e62e0730e8ac:
  net/sched: fix NULL dereference in the error path of tcf_bpf_init()
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 100GbE

Anirudh Venkataramanan (1):
  ice: Bug fixes in ethtool code

Wei Yongjun (1):
  ice: Fix error return code in ice_init_hw()

 drivers/net/ethernet/intel/ice/ice_common.c  | 4 +++-
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [net 1/2] ice: Fix error return code in ice_init_hw()
From: Jeff Kirsher @ 2018-04-06 15:36 UTC (permalink / raw)
  To: davem; +Cc: Wei Yongjun, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180406153630.9321-1-jeffrey.t.kirsher@intel.com>

From: Wei Yongjun <weiyongjun1@huawei.com>

Fix to return error code ICE_ERR_NO_MEMORY from the alloc error
handling case instead of 0, as done elsewhere in this function.

Fixes: dc49c7723676 ("ice: Get MAC/PHY/link info and scheduler topology")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 385f5d425d19..21977ec984c4 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -468,8 +468,10 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
 	mac_buf_len = sizeof(struct ice_aqc_manage_mac_read_resp);
 	mac_buf = devm_kzalloc(ice_hw_to_dev(hw), mac_buf_len, GFP_KERNEL);
 
-	if (!mac_buf)
+	if (!mac_buf) {
+		status = ICE_ERR_NO_MEMORY;
 		goto err_unroll_fltr_mgmt_struct;
+	}
 
 	status = ice_aq_manage_mac_read(hw, mac_buf, mac_buf_len, NULL);
 	devm_kfree(ice_hw_to_dev(hw), mac_buf);
-- 
2.14.3

^ permalink raw reply related

* Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2018-04-06
From: David Miller @ 2018-04-06 15:39 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180406153630.9321-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri,  6 Apr 2018 08:36:28 -0700

> This series contains a couple of fixes for the new ice driver.
> 
> Wei Yongjun fixes the return error code for error case during init.
> 
> Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code
> and fixed "for" loop calculations.

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-06 16:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <874lko2y22.fsf@xmission.com>

On Fri, Apr 06, 2018 at 09:45:41AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >>>
> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >>>
> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >>>
> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >>> into all network namespaces.
> >> >> >>>
> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >>>
> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >>>
> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >>>         j       return;
> >> >> >>> }
> >> >> >>>
> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >>>
> >> >> >>>  sudo unshare -U --map-root
> >> >> >>>  udevadm monitor -k
> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >>>  modprobe kvm
> >> >> >>>  # or
> >> >> >>>  rmmod kvm
> >> >> >>>
> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >>> with them.
> >> >> >>>
> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >>> make a decision what uevents should be sent.
> >> >> >>>
> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >>>   event will always be filtered.
> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >>>   the event will be filtered as well.
> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >>> anything with interesting devices.
> >> >> >>>
> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >>> ---
> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >>>
> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >>>  		return sock_ns != ns;
> >> >> >>>  	}
> >> >> >>>  
> >> >> >>> -	return 0;
> >> >> >>> +	/*
> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >>> +	 * namespace below.
> >> >> >>> +	 */
> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >>> +		return 1;
> >> >> >>> +
> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >>>  }
> >> >> >>>  #endif
> >> >> >>
> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> > 
> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> > how this comes about.
> >> >> > There is only one case where this currently breaks and this is as I
> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> 
> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> Now it will return 1 sometimes.
> >> >
> >> > Oh sure, it's in the commit message though. The callchain is
> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >
> >> > This codepiece will check whether the openened socket holds
> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> > which it won't because we don't have device namespaces and all devices
> >> > belong to the initial set of namespaces.
> >> >
> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >                              CAP_NET_BROADCAST))
> >> >         j       return;
> >> >
> >> 
> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> on their socket and has had someone with the appropriate privileges
> >> assign a peerrnetid.
> >> 
> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> as it might be, and if you can pass the other two checks I think it is
> >> pointless (because the peernet attributes are not generated for
> >> kobj_uevents) but valid to receive events from outside your network
> >> namespace.
> >> 
> >> 
> >> I might be missing something but I don't see anything excluding network
> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> logic.
> >> 
> >> The uevent_sock_list has one entry per network namespace. And
> >> kobject_uevent_net_broacast appears to walk each one.
> >
> > Yeah, it definitely does.
> >
> >> 
> >> I had a memory of filtering uevent messages and I had a memory
> >> that the netlink_has_listeners had a per network namespace effect.
> >> Neither seems true from my inspection of the code tonight.
> >
> > Yeah, I drew the same conclusion.
> >
> >> 
> >> If we are not filtering ordinary uevents at least at the user namespace
> >> level that does seem to be at least a nuisance.
> >
> > This patch would filter based on user namespace and bump it from "we're
> > accidently doing the right thing" to "we're doing the right on purpose"
> > and without accidently leaking uevents in some corner cases.
> > Using kobj_bcast_filter() for this seems like the right thing to do.
> > This sounds like you don't disagree with the approach I'm taking here?
> 
> How are we accidentally filtering?  If we can not describe how the code
> currently works to achieve something we don't understand it well enough
> to change it.

I absolutely agree and without doing a lot of semantics here I just
meant that so far this all worked on accident that doesn't presuppose
that we understand why it worked.

> 
> 
> 
> > On a sidenote, it also very much feels like we're leaking information if
> > we're not filtering based on user namespaces on purpose. Non-initial
> > user namespaces should not be able to receive information about device
> > attribues simply via netlink uevent messages. At least not *trivially*
> > [1] by opening a netlink uevent socket.
> 
> This is not at all about isolation.  Devices don't belong to user

Sure, but there's still a difference between an unprivileged host user
listening to uevents and user namespaces that they might have delegated.

> namespaces.  This is about it being useless/silly to send events
> to those sockets as the receiver could not do anything with the uevent.

Yes, indeed unless a sufficiently privileged process explicitly decides
that a given uevent should be sent.

> At a practical level there should be no receivers.  Plus performance
> issues.  At least my memory is that any unprivileged user on the system
> is allowed to listen to those events.

Any unprivileged user is allowed to listen to uevents if they have
net_broadcast in the user namespace the uevent socket was opened in;
unless I'm misreading.

> 
> At least at one point udev listening and reacting to events in every
> network namespace in containers was a significant slow down on the
> system.

Sure, because we're holding a global lock on the list of all uevent
sockets and then go on to walk all of mc_list for each of those sockets
when broadcasting to guarantee that uevents are correctly ordered:

	mutex_lock(&uevent_sock_mutex);
	/* we will send an event, so request a new sequence number */
	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
	if (retval) {
		mutex_unlock(&uevent_sock_mutex);
		goto exit;
	}
	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
					      devpath);
	mutex_unlock(&uevent_sock_mutex);

which means that the time it takes for uevents to be sent out increases
drastically the more listeners you have in more network namespaces. Even
for just one network namespace you have quite a lot:

ss -e -f netlink -a | grep uevent

then just look at those that you can clearly associate with a single
program I typically get ~20 listeners.

> 
> The report was that adding netlink_has_listeners greatly sped up uevent
> sending.   That testing appears faulty.  I think the real gain was
> userspace not listening to uevents in every container.

Banal point but the truth is probably not userspace changes in general
but that most massive container deployments are app containers. So
there's no udevd.

> 
> Which means uevent injection for containers has some real technical
> challenges to overcome before it can be wide spread.
> 
> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> there are some real efficiency improvements that we could make to
> >> kobject_uevent_net_broadcast if nothing else.
> >
> > Yeah, for sure! I already started doing this. This patch here is
> > basically a preparatory patch for that work which is on my todo.
> 
> Limiting the network namespaces on uevent_sock_list to just network
> namespaces owned by the initial user namespaces would be a lot better
> than your patch.

I had this idea as well but I decided against it because of namespace
tag carrying kobjects. The only kobjects that fall into this category
are network devices so I need to think through this again and more
thoroughly.

> 
> 
> At a practical level I am concerned what will happen when we stand up a
> uevent listener aka udev in say 100 or maybe 1000 unprivileged
> containers on a system.  History suggests the current data structures
> won't scale to the problem, and the fixes that were put in place kernel
> side only did not change anything.  It was only the userspace fixes
> that made a difference.
> 
> Which suggests either improving the isolation between network namespaces
> for netlink multicast sockets or putting:
> 
> 
> 	if (!net_eq(sock_net(sk), p->net)) {
> 		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> 			return;
> 
> 		if (!peernet_has_id(sock_net(sk), p->net))
> 			return;
> 
> 		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> 				     CAP_NET_BROADCAST))
> 			return;
> 	}
> 
> In a filter and having an appropriate filter so that
> netlink_broadcast_filtered only needs to be called once from
> kobject_uevent_net_broadcast.

Maybe, but I'd like to try and find a solution that let's us wipe the
uevent socket list.

> 
> It is more difficult but in practice I expect we have far more to gain
> by fixing the mc_list and the netlink_has_listeners function to be per
> network namespace basis.  Handling the NETLINK_F_LISTEN_ALL_NSID will
> be trickier in that case but the current semantics appear correct
> for that case.

Yeah, I'll dig into this.

> 
> But before we do anything we have to test and see if the assertion
> that we don't make it to netlink listeners in network namespaces
> that are outside of the init_user_ns is true.   If it is true we need to
> find the code that makes it true.

Yeah, I'm on this as well.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-06 16:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180406160757.GA16281@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

>> At a practical level there should be no receivers.  Plus performance
>> issues.  At least my memory is that any unprivileged user on the system
>> is allowed to listen to those events.
>
> Any unprivileged user is allowed to listen to uevents if they have
> net_broadcast in the user namespace the uevent socket was opened in;
> unless I'm misreading.

I believe you are.

This code in do_one_broadcast.

	if (!net_eq(sock_net(sk), p->net)) {
		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
			return;

		if (!peernet_has_id(sock_net(sk), p->net))
			return;

		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
				     CAP_NET_BROADCAST))
			return;
	}

Used to just be:
	if (!net_eq(sock_net(sk), p->net))
        	return;

Which makes sense when you have a shared hash table and a shared mc_list
between network namespaces.

There is a non-container use of network namespaces where you just need
different contexts were ip addresses can overlap etc.  In that
configuration where a single program is mananging multiple network
namespaces being able to listen to rtnetlink events in all of them is an
advantage.

For that case a special socket option NETLINK_F_LISTEN_ALL_NSID was
added that allowed one socket to listen for events from multiple network
namespaces.

If we rework the code in af_netlink.c that matters.  However for just
understanding uevents you can assume there are no sockets with
NETLINK_F_LISTEN_ALL_NSID set.

Eric

^ permalink raw reply

* Re: TCP one-by-one acking - RFC interpretation question
From: Eric Dumazet @ 2018-04-06 16:49 UTC (permalink / raw)
  To: Michal Kubecek, Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell
In-Reply-To: <20180406150345.i3t7cu5g6dmoc3io@unicorn.suse.cz>

Cc Neal and Yuchung if they missed this thread.

On 04/06/2018 08:03 AM, Michal Kubecek wrote:
> On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
>>
>>
>> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
>>> Hello
>>>
>>> I encountered a strange behaviour of some (non-linux) TCP stack which
>>> I believe is incorrect but support engineers from the company producing
>>> it claim is OK.
>>>
>>> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
>>> segments but segments 2, 4 and 6 do not reach the server (receiver):
>>>
>>>          ACK             SAK             SAK             SAK
>>>       +-------+-------+-------+-------+-------+-------+-------+
>>>       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
>>>       +-------+-------+-------+-------+-------+-------+-------+
>>>     34273   35701   37129   38557   39985   41413   42841   44269
>>>
>>> When segment 2 is retransmitted after RTO timeout, normal response would
>>> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
>>> 42841-44269).
>>>
>>> However, this server stack responds with two separate ACKs:
>>>
>>>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>>>   - ACK 38557, SACK 39985-41413 42841-44269
>>
>> Hmmm... Yes this seems very very wrong and lazy.
>>
>> Have you verified behavior of more recent linux kernel to such threats ?
> 
> No, unfortunately the problem was only encountered by our customer in
> production environment (they tried to reproduce in a test lab but no
> luck). They are running backups to NFS server and it happens from time
> to time (in the order of hours, IIUC). So it would be probably hard to
> let them try with more recent kernel.
> 
> On the other hand, they reported that SLE11 clients (kernel 3.0) do not
> run into this kind of problem. It was originally reported as a
> a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
> kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
> (which seems to be mostly caused by the switch to RACK).
> 
> It also seems that part of the problem is specific packet loss pattern
> where at some point, many packets are lost in "every second" pattern.
> The customer finally started to investigate this problem and it seems it
> has something to do with their bonding setup (they provided no details,
> my guess is packets are divided over two paths and one of them fails).
> 
>> packetdrill test would be relatively easy to write.
> 
> I'll try but I have very little experience with writing packetdrill
> scripts so it will probably take some time.
> 
>> Regardless of this broken alien stack, we might be able to work around
>> this faster than the vendor is able to fix and deploy a new stack.
>>
>> ( https://en.wikipedia.org/wiki/Robustness_principle )
>> Be conservative in what you do, be liberal in what you accept from
>> others...
> 
> I was thinking about this a bit. "Fixing" the acknowledgment number
> could do the trick but it doesn't feel correct. We might use the fact
> that TSecr of both ACKs above matches TSval of the retransmission which
> triggered them so that RTT calculated from timestamp would be the right
> one. So perhaps something like "prefer timestamp RTT if measured RTT
> seems way too off". But I'm not sure if it couldn't break other use
> cases where (high) measured RTT is actually correct, rather than (low)
> timestamp RTT.
> 
> Michal Kubecek
> 

^ permalink raw reply

* Re: [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh
From: Jiri Olsa @ 2018-04-06 16:54 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, lkml, netdev,
	Linux Kbuild mailing list, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Michal Marek, Jiri Kosina
In-Reply-To: <CAK7LNATU9_zBH9DpayNOCGsFVL0K+Y8Np0z7wxFuwO3Cf1K_tg@mail.gmail.com>

On Fri, Apr 06, 2018 at 09:59:59AM +0900, Masahiro Yamada wrote:
> 2018-04-06 3:59 GMT+09:00 Jiri Olsa <jolsa@redhat.com>:
> > On Fri, Apr 06, 2018 at 12:50:00AM +0900, Masahiro Yamada wrote:
> >> 2018-04-06 0:16 GMT+09:00 Jiri Olsa <jolsa@kernel.org>:
> >> > There's no need to pass LD* arguments to link-vmlinux.sh,
> >> > because they are passed as variables. The only argument
> >> > the link-vmlinux.sh supports is the 'clean' argument.
> >> >
> >> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> >> > ---
> >>
> >> Wrong.
> >>
> >> $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux)
> >> exist here so that any change in them
> >> invokes scripts/linkk-vmlinux.sh
> >
> > sry, I can't see that.. but it's just a side fix,
> > which is actually not needed for the rest
> >
> > I'll check on more and address this separately
> 
> 
> The link command is recorded in .vmlinux.cmd

i see.. just for the sake command line,

thanks for explanation
jirka

^ permalink raw reply

* Re: [PATCH] dp83640: Ensure against premature access to PHY registers after reset
From: Esben Haabendal @ 2018-04-06 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, netdev, richardcochran, f.fainelli, linux-kernel
In-Reply-To: <20180406.111337.1908168293065420432.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Andrew Lunn <andrew@lunn.ch>
> Date: Fri, 6 Apr 2018 16:14:10 +0200
>
>> On Fri, Apr 06, 2018 at 04:05:40PM +0200, Esben Haabendal wrote:
>>> From: Esben Haabendal <eha@deif.com>
>>> 
>>> Signed-off-by: Esben Haabendal <eha@deif.com>
>>> ---
>>>  drivers/net/phy/dp83640.c | 17 +++++++++++++++++
>>>  1 file changed, 17 insertions(+)
>>> 
>>> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
>>> index 654f42d00092..48403170096a 100644
>>> --- a/drivers/net/phy/dp83640.c
>>> +++ b/drivers/net/phy/dp83640.c
>>> @@ -1207,6 +1207,22 @@ static void dp83640_remove(struct phy_device *phydev)
>>>  	kfree(dp83640);
>>>  }
>>>  
>>> +static int dp83640_soft_reset(struct phy_device *phydev)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = genphy_soft_reset(phydev);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	/* From DP83640 datasheet: "Software driver code must wait 3 us
>>> +	 * following a software reset before allowing further serial MII
>>> +	 * operations with the DP83640." */
>>> +	udelay(3);
>> 
>> Hi Esben
>> 
>> The accuracy of udelay() is not guaranteed. So you probably want to be
>> a bit pessimistic, and use 10.

Ok, will do.

/Esben

^ permalink raw reply

* [PATCH v2] dp83640: Ensure against premature access to PHY registers after reset
From: Esben Haabendal @ 2018-04-06 17:08 UTC (permalink / raw)
  To: netdev; +Cc: andrew, richardcochran, f.fainelli, linux-kernel, Esben Haabendal
In-Reply-To: <20180406140540.13511-1-esben.haabendal@gmail.com>

From: Esben Haabendal <eha@deif.com>

Signed-off-by: Esben Haabendal <eha@deif.com>
---
 drivers/net/phy/dp83640.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 654f42d00092..a6c87793d899 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1207,6 +1207,23 @@ static void dp83640_remove(struct phy_device *phydev)
 	kfree(dp83640);
 }
 
+static int dp83640_soft_reset(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_soft_reset(phydev);
+	if (ret < 0)
+		return ret;
+
+	/* From DP83640 datasheet: "Software driver code must wait 3 us
+	 * following a software reset before allowing further serial MII
+	 * operations with the DP83640."
+	 */
+	udelay(10);		/* Taking udelay inaccuracy into account */
+
+	return 0;
+}
+
 static int dp83640_config_init(struct phy_device *phydev)
 {
 	struct dp83640_private *dp83640 = phydev->priv;
@@ -1501,6 +1518,7 @@ static struct phy_driver dp83640_driver = {
 	.flags		= PHY_HAS_INTERRUPT,
 	.probe		= dp83640_probe,
 	.remove		= dp83640_remove,
+	.soft_reset	= dp83640_soft_reset,
 	.config_init	= dp83640_config_init,
 	.ack_interrupt  = dp83640_ack_interrupt,
 	.config_intr    = dp83640_config_intr,
-- 
2.16.3

^ permalink raw reply related

* Re: [PATCH] net: phy: marvell: Enable interrupt function on LED2 pin
From: Esben Haabendal @ 2018-04-06 17:10 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Bhadram Varka, Rasmus Villemoes, Florian Fainelli, open list,
	netdev@vger.kernel.org
In-Reply-To: <20180405144355.GB32663@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

>> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index
>> 0e0978d8a0eb..f03a510f1247 100644
>> --- a/drivers/net/phy/marvell.c
>> +++ b/drivers/net/phy/marvell.c
>> @@ -457,6 +457,21 @@ static int marvell_of_reg_init(struct phy_device
>> *phydev) } #endif /* CONFIG_OF_MDIO */
>>  
>> +static int m88e1318_config_intr(struct phy_device *phydev) {
>> +	int err;
>> +
>> +	err = marvell_config_intr(phydev);
>> +	if (err)
>> +		return err;
>> +
>> +	/* Setup LED[2] as interrupt pin (active low) */
>> +	return phy_modify(phydev, MII_88E1318S_PHY_LED_TCR,
>> +			  MII_88E1318S_PHY_LED_TCR_FORCE_INT,
>> +			  MII_88E1318S_PHY_LED_TCR_INTn_ENABLE |
>> +			  MII_88E1318S_PHY_LED_TCR_INT_ACTIVE_LOW);
>> 
>> Can we move this part of the code to m88e1121_config_init() ?
>> 
>> Every time whether we disable or enable the interrupts this part of code
>> will execute.
>
> Yes, doing this once would be better. But please allow the LED pin to
> be used as an LED when not using interrupts. phy_interrupt_is_valid()
> should be involved somehow.

This should be addressed in v2 of the patch which I have already posted.

/Esben

^ permalink raw reply

* [RFC PATCH v3 bpf-next 0/5] bpf/verifier: subprog/func_call simplifications
From: Edward Cree @ 2018-04-06 17:11 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev

By storing subprog boundaries as a subprogno mark on each insn, rather than
 a start (and implicit end) for each subprog, the code handling subprogs can
 in various places be made simpler, more explicit, and more efficient (i.e.
 better asymptotic complexity).  This also lays the ground for possible
 future work in which an extension of the check_subprogs() code to cover
 basic blocks could replace check_cfg(), which currently largely duplicates
 the insn-walking part.

Some other changes were also included to support this:
* Per-subprog info is stored in env->subprog_info, an array of structs,
  rather than several arrays with a common index.
* Subprog numbers and corresponding subprog_info index are now 0-based;
  subprog 0 is the main()-function of the program, starting at insn 0.
* Call graph is now stored in the new bpf_subprog_info struct; used here
  for check_max_stack_depth() but may have other uses too.

Along with this, patch #5 puts parent pointers (used by liveness analysis)
 in the registers instead of the func_state or verifier_state, so that we
 don't need skip_callee() machinery.  This also does the right thing for
 stack slots, so they don't need their own special handling for liveness
 marking either.

Testing done on this version:
* test_verifier
* test_progs (mostly)
* test_kmod.sh
* tested whether processed-insn numbers (hence pruning) change.
  (Couldn't test with programs containing pseudo-calls (e.g.
  test_xdp_noinline.o), because iproute2 rejects them in its bpf lib.)
  I tested with some Cilium object files, and found that in some cases
  the numbers *do* change as compared to bpf-next.
  - bpf_lxc_opt_-DDROP_ALL.o   26713 -> 17753
  - bpf_lxc_opt_-DUNKNOWN.o    36604 -> 23324
  - bpf_netdev.o               10003 -> 9984
  I'm not yet sure why this is, especially as these object files do not
  contain pseudo-calls.
  One possibility I see: in BPF_MOV64_REG handling, we copy the register
  state, which includes the parent pointer.  But that shouldn't matter,
  because we just read-marked the src reg (via check_reg_arg), and we
  write-mark the dst reg immediately after.  Same goes for other places
  we copy reg states around (e.g. stack reads and writes).
  So I don't quite see how this can affect pruning; but _something_ did.

Changes from v2->v3:
* Added RFC tags back since bpf-next is closed right now.
* Split up first patch into three parts to avoid doing too many things at
  once.
* Restore check_subprogs() as a separate pass rather than doing the work
  on-the-fly in do_check().
* Removed changes in jit_subprogs() handling which were there to support
  non-contiguous subprogs, since that's no longer needed.

Changes from v1->v2:
* No longer allows non-contiguous subprogs.
* No longer allows LD_ABS|IND and pseudo-calls in the same prog.

Edward Cree (5):
  bpf/verifier: store subprog info in struct bpf_subprog_info
  bpf/verifier: rewrite subprog boundary detection
  bpf/verifier: update selftests
  bpf/verifier: use call graph to efficiently check max stack depth
  bpf/verifier: per-register parent pointers

 include/linux/bpf_verifier.h                |  21 +-
 kernel/bpf/verifier.c                       | 617 ++++++++++++++--------------
 tools/testing/selftests/bpf/test_verifier.c |  51 ++-
 3 files changed, 354 insertions(+), 335 deletions(-)

^ permalink raw reply

* [RFC PATCH v3 bpf-next 1/5] bpf/verifier: store subprog info in struct bpf_subprog_info
From: Edward Cree @ 2018-04-06 17:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

Per-subprog info is stored in env->subprog_info, an array of structs,
 rather than multiple arrays with a common index.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |  9 +++++---
 kernel/bpf/verifier.c        | 49 +++++++++++++++++++++++---------------------
 2 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7e61c395fddf..8f70dc181e23 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -173,6 +173,11 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log)
 
 #define BPF_MAX_SUBPROGS 256
 
+struct bpf_subprog_info {
+	u32 start; /* insn idx of function entry point */
+	u16 stack_depth; /* max. stack depth used by this function */
+};
+
 /* single container for all structs
  * one verifier_env per bpf_check() call
  */
@@ -191,9 +196,7 @@ struct bpf_verifier_env {
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifier_log log;
-	u32 subprog_starts[BPF_MAX_SUBPROGS];
-	/* computes the stack depth of each bpf function */
-	u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
+	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
 	u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb902bf..df4d742360d9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -738,18 +738,19 @@ enum reg_arg_type {
 
 static int cmp_subprogs(const void *a, const void *b)
 {
-	return *(int *)a - *(int *)b;
+	return ((struct bpf_subprog_info *)a)->start -
+	       ((struct bpf_subprog_info *)b)->start;
 }
 
 static int find_subprog(struct bpf_verifier_env *env, int off)
 {
-	u32 *p;
+	struct bpf_subprog_info *p;
 
-	p = bsearch(&off, env->subprog_starts, env->subprog_cnt,
-		    sizeof(env->subprog_starts[0]), cmp_subprogs);
+	p = bsearch(&off, env->subprog_info, env->subprog_cnt,
+		    sizeof(env->subprog_info[0]), cmp_subprogs);
 	if (!p)
 		return -ENOENT;
-	return p - env->subprog_starts;
+	return p - env->subprog_info;
 
 }
 
@@ -769,9 +770,9 @@ static int add_subprog(struct bpf_verifier_env *env, int off)
 		verbose(env, "too many subprograms\n");
 		return -E2BIG;
 	}
-	env->subprog_starts[env->subprog_cnt++] = off;
-	sort(env->subprog_starts, env->subprog_cnt,
-	     sizeof(env->subprog_starts[0]), cmp_subprogs, NULL);
+	env->subprog_info[env->subprog_cnt++].start = off;
+	sort(env->subprog_info, env->subprog_cnt,
+	     sizeof(env->subprog_info[0]), cmp_subprogs, NULL);
 	return 0;
 }
 
@@ -802,14 +803,14 @@ static int check_subprogs(struct bpf_verifier_env *env)
 
 	if (env->log.level > 1)
 		for (i = 0; i < env->subprog_cnt; i++)
-			verbose(env, "func#%d @%d\n", i, env->subprog_starts[i]);
+			verbose(env, "func#%d @%d\n", i, env->subprog_info[i].start);
 
 	/* now check that all jumps are within the same subprog */
 	subprog_start = 0;
 	if (env->subprog_cnt == cur_subprog)
 		subprog_end = insn_cnt;
 	else
-		subprog_end = env->subprog_starts[cur_subprog++];
+		subprog_end = env->subprog_info[cur_subprog++].start;
 	for (i = 0; i < insn_cnt; i++) {
 		u8 code = insn[i].code;
 
@@ -837,7 +838,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
 			if (env->subprog_cnt == cur_subprog)
 				subprog_end = insn_cnt;
 			else
-				subprog_end = env->subprog_starts[cur_subprog++];
+				subprog_end = env->subprog_info[cur_subprog++].start;
 		}
 	}
 	return 0;
@@ -1470,13 +1471,13 @@ static int update_stack_depth(struct bpf_verifier_env *env,
 			      const struct bpf_func_state *func,
 			      int off)
 {
-	u16 stack = env->subprog_stack_depth[func->subprogno];
+	u16 stack = env->subprog_info[func->subprogno].stack_depth;
 
 	if (stack >= -off)
 		return 0;
 
 	/* update known max for given subprogram */
-	env->subprog_stack_depth[func->subprogno] = -off;
+	env->subprog_info[func->subprogno].stack_depth = -off;
 	return 0;
 }
 
@@ -1498,7 +1499,8 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	/* round up to 32-bytes, since this is granularity
 	 * of interpreter stack size
 	 */
-	depth += round_up(max_t(u32, env->subprog_stack_depth[subprog], 1), 32);
+	depth += round_up(max_t(u32, env->subprog_info[subprog].stack_depth, 1),
+			  32);
 	if (depth > MAX_BPF_STACK) {
 		verbose(env, "combined stack size of %d calls is %d. Too large\n",
 			frame + 1, depth);
@@ -1508,7 +1510,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	if (env->subprog_cnt == subprog)
 		subprog_end = insn_cnt;
 	else
-		subprog_end = env->subprog_starts[subprog];
+		subprog_end = env->subprog_info[subprog].start;
 	for (; i < subprog_end; i++) {
 		if (insn[i].code != (BPF_JMP | BPF_CALL))
 			continue;
@@ -1539,7 +1541,8 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	 */
 	if (frame == 0)
 		return 0;
-	depth -= round_up(max_t(u32, env->subprog_stack_depth[subprog], 1), 32);
+	depth -= round_up(max_t(u32, env->subprog_info[subprog].stack_depth, 1),
+			  32);
 	frame--;
 	i = ret_insn[frame];
 	subprog = ret_prog[frame];
@@ -1559,7 +1562,7 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
 		return -EFAULT;
 	}
 	subprog++;
-	return env->subprog_stack_depth[subprog];
+	return env->subprog_info[subprog].stack_depth;
 }
 #endif
 
@@ -4860,14 +4863,14 @@ static int do_check(struct bpf_verifier_env *env)
 	verbose(env, "processed %d insns (limit %d), stack depth ",
 		insn_processed, BPF_COMPLEXITY_LIMIT_INSNS);
 	for (i = 0; i < env->subprog_cnt + 1; i++) {
-		u32 depth = env->subprog_stack_depth[i];
+		u32 depth = env->subprog_info[i].stack_depth;
 
 		verbose(env, "%d", depth);
 		if (i + 1 < env->subprog_cnt + 1)
 			verbose(env, "+");
 	}
 	verbose(env, "\n");
-	env->prog->aux->stack_depth = env->subprog_stack_depth[0];
+	env->prog->aux->stack_depth = env->subprog_info[0].stack_depth;
 	return 0;
 }
 
@@ -5074,9 +5077,9 @@ static void adjust_subprog_starts(struct bpf_verifier_env *env, u32 off, u32 len
 	if (len == 1)
 		return;
 	for (i = 0; i < env->subprog_cnt; i++) {
-		if (env->subprog_starts[i] < off)
+		if (env->subprog_info[i].start < off)
 			continue;
-		env->subprog_starts[i] += len - 1;
+		env->subprog_info[i].start += len - 1;
 	}
 }
 
@@ -5274,7 +5277,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		if (env->subprog_cnt == i)
 			subprog_end = prog->len;
 		else
-			subprog_end = env->subprog_starts[i];
+			subprog_end = env->subprog_info[i].start;
 
 		len = subprog_end - subprog_start;
 		func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER);
@@ -5291,7 +5294,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		 * Long term would need debug info to populate names
 		 */
 		func[i]->aux->name[0] = 'F';
-		func[i]->aux->stack_depth = env->subprog_stack_depth[i];
+		func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
 		func[i]->jit_requested = 1;
 		func[i] = bpf_int_jit_compile(func[i]);
 		if (!func[i]->jited) {

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 3/5] bpf/verifier: update selftests
From: Edward Cree @ 2018-04-06 17:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

Error messages for some bad programs have changed.
Also added a test ("calls: interleaved functions") to ensure that subprogs
 are required to be contiguous.
It wasn't entirely clear to me what "calls: wrong recursive calls" was
 meant to test for, since all of the JMP|CALL insns are unreachable.  I've
 changed it so that they are now reachable, which causes static back-edges
 to be detected (since that, like insn reachability, is now tested before
 subprog boundaries are determined).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 51 ++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b1a9ae..d53522d20072 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -646,7 +646,7 @@ static struct bpf_test tests[] = {
 		.insns = {
 			BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
 		},
-		.errstr = "not an exit",
+		.errstr = "no exit/jump at end of program",
 		.result = REJECT,
 	},
 	{
@@ -9442,13 +9442,13 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "last insn is not an exit or jmp",
+		.errstr = "no exit/jump at end of subprog 0 (insn 0)",
 		.result = REJECT,
 	},
 	{
 		"calls: wrong recursive calls",
 		.insns = {
-			BPF_JMP_IMM(BPF_JA, 0, 0, 4),
+			BPF_JMP_IMM(BPF_JA, 0, 0, 3),
 			BPF_JMP_IMM(BPF_JA, 0, 0, 4),
 			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, -2),
 			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, -2),
@@ -9457,7 +9457,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 0 to 4 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9508,7 +9508,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 1 to 5 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9787,7 +9787,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-		.errstr = "jump out of range from insn 1 to 4",
+		.errstr = "jump from insn 1 to 4 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9803,13 +9803,12 @@ static struct bpf_test tests[] = {
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0),
 			BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
 			BPF_EXIT_INSN(),
-			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-				    offsetof(struct __sk_buff, len)),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
 			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, -3),
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range from insn 11 to 9",
+		.errstr = "jump from insn 11 to 9 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9861,7 +9860,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "invalid destination",
+		.errstr = "call to invalid destination",
 		.result = REJECT,
 	},
 	{
@@ -9873,7 +9872,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "invalid destination",
+		.errstr = "call to invalid destination",
 		.result = REJECT,
 	},
 	{
@@ -9886,7 +9885,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 3 to 1 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9899,7 +9898,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 0 to 4 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9913,7 +9912,7 @@ static struct bpf_test tests[] = {
 			BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -2),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "not an exit",
+		.errstr = "no exit/jump at end of program",
 		.result = REJECT,
 	},
 	{
@@ -9927,7 +9926,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "last insn",
+		.errstr = "no exit/jump at end of subprog",
 		.result = REJECT,
 	},
 	{
@@ -9942,7 +9941,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "last insn",
+		.errstr = "no exit/jump at end of subprog",
 		.result = REJECT,
 	},
 	{
@@ -9982,12 +9981,11 @@ static struct bpf_test tests[] = {
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0),
 			BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
 			BPF_MOV64_REG(BPF_REG_0, BPF_REG_0),
-			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-				    offsetof(struct __sk_buff, len)),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "not an exit",
+		.errstr = "no exit/jump at end of subprog",
 		.result = REJECT,
 	},
 	{
@@ -11423,6 +11421,21 @@ static struct bpf_test tests[] = {
 		.errstr = "BPF_XADD stores into R2 packet",
 		.prog_type = BPF_PROG_TYPE_XDP,
 	},
+	{
+		"calls: interleaved functions",
+		.insns = {
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_IMM(BPF_JA, 0, 0, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 2),
+			BPF_JMP_IMM(BPF_JA, 0, 0, 1),
+			BPF_EXIT_INSN(),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+		.errstr = "jump from insn 2 to 5 crosses",
+		.result = REJECT,
+	},
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 2/5] bpf/verifier: rewrite subprog boundary detection
From: Edward Cree @ 2018-04-06 17:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

By storing a subprogno in each insn's aux data, we avoid the need to keep
 the list of subprog starts sorted or bsearch() it in find_subprog().
Also, get rid of the weird one-based indexing of subprog numbers.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |   3 +-
 kernel/bpf/verifier.c        | 284 ++++++++++++++++++++++++++-----------------
 2 files changed, 177 insertions(+), 110 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 8f70dc181e23..17990dd56e65 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -146,6 +146,7 @@ struct bpf_insn_aux_data {
 		s32 call_imm;			/* saved imm field of call insn */
 	};
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
+	u16 subprogno; /* subprog in which this insn resides */
 	bool seen; /* this insn was processed by the verifier */
 };
 
@@ -196,7 +197,7 @@ struct bpf_verifier_env {
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifier_log log;
-	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
+	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS];
 	u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index df4d742360d9..587eb445bfa2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -736,109 +736,171 @@ enum reg_arg_type {
 	DST_OP_NO_MARK	/* same as above, check only, don't mark */
 };
 
-static int cmp_subprogs(const void *a, const void *b)
+static int find_subprog(struct bpf_verifier_env *env, int insn_idx)
 {
-	return ((struct bpf_subprog_info *)a)->start -
-	       ((struct bpf_subprog_info *)b)->start;
-}
-
-static int find_subprog(struct bpf_verifier_env *env, int off)
-{
-	struct bpf_subprog_info *p;
-
-	p = bsearch(&off, env->subprog_info, env->subprog_cnt,
-		    sizeof(env->subprog_info[0]), cmp_subprogs);
-	if (!p)
-		return -ENOENT;
-	return p - env->subprog_info;
+	int insn_cnt = env->prog->len;
+	u32 subprogno;
 
+	if (insn_idx >= insn_cnt || insn_idx < 0) {
+		verbose(env, "find_subprog of invalid insn_idx %d\n", insn_idx);
+		return -EINVAL;
+	}
+	subprogno = env->insn_aux_data[insn_idx].subprogno;
+	/* validate that we are at start of subprog */
+	if (env->subprog_info[subprogno].start != insn_idx) {
+		verbose(env, "insn_idx %d is in subprog %u but that starts at %d\n",
+			insn_idx, subprogno, env->subprog_info[subprogno].start);
+		return -EINVAL;
+	}
+	return subprogno;
 }
 
 static int add_subprog(struct bpf_verifier_env *env, int off)
 {
 	int insn_cnt = env->prog->len;
+	struct bpf_subprog_info *info;
 	int ret;
 
 	if (off >= insn_cnt || off < 0) {
 		verbose(env, "call to invalid destination\n");
 		return -EINVAL;
 	}
-	ret = find_subprog(env, off);
-	if (ret >= 0)
-		return 0;
-	if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
-		verbose(env, "too many subprograms\n");
-		return -E2BIG;
+	ret = env->insn_aux_data[off].subprogno;
+	/* At the time add_subprog is called, only (some) entry points are
+	 * marked with an aux->subprogno; 'body' insns aren't.  Thus, a
+	 * subprogno of 0 means either "not yet marked as an entry point" or
+	 * "entry point of main(), i.e. insn 0".
+	 * However, a call to insn 0 is never legal (it would necessarily imply
+	 * recursion, which check_cfg will catch), therefore we can assume that
+	 * we will only be called with off == 0 once, and in that case we
+	 * should go ahead and add the subprog entry.
+	 */
+	if (!ret) {
+		if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
+			verbose(env, "too many subprograms\n");
+			return -E2BIG;
+		}
+		ret = env->subprog_cnt++;
+		info = &env->subprog_info[ret];
+		info->start = off;
+		info->stack_depth = 0;
+		env->insn_aux_data[off].subprogno = ret;
 	}
-	env->subprog_info[env->subprog_cnt++].start = off;
-	sort(env->subprog_info, env->subprog_cnt,
-	     sizeof(env->subprog_info[0]), cmp_subprogs, NULL);
-	return 0;
+	return ret;
 }
 
 static int check_subprogs(struct bpf_verifier_env *env)
 {
-	int i, ret, subprog_start, subprog_end, off, cur_subprog = 0;
-	struct bpf_insn *insn = env->prog->insnsi;
-	int insn_cnt = env->prog->len;
+	struct bpf_insn_aux_data *aux = env->insn_aux_data;
+	struct bpf_insn *insns = env->prog->insnsi;
+	int insn_cnt = env->prog->len, i, err;
+	int cur_subprog;
 
-	/* determine subprog starts. The end is one before the next starts */
-	for (i = 0; i < insn_cnt; i++) {
-		if (insn[i].code != (BPF_JMP | BPF_CALL))
-			continue;
-		if (insn[i].src_reg != BPF_PSEUDO_CALL)
-			continue;
-		if (!env->allow_ptr_leaks) {
-			verbose(env, "function calls to other bpf functions are allowed for root only\n");
-			return -EPERM;
-		}
-		if (bpf_prog_is_dev_bound(env->prog->aux)) {
-			verbose(env, "function calls in offloaded programs are not supported yet\n");
-			return -EINVAL;
+	/* Find subprog starts */
+	err = add_subprog(env, 0); /* subprog 0, main(), starts at insn 0 */
+	if (err < 0)
+		return err;
+	for (i = 0; i < insn_cnt; i++)
+		if (insns[i].code == (BPF_JMP | BPF_CALL) &&
+		    insns[i].src_reg == BPF_PSEUDO_CALL) {
+			if (!env->allow_ptr_leaks) {
+				verbose(env, "function calls to other bpf functions are allowed for root only\n");
+				return -EPERM;
+			}
+			if (bpf_prog_is_dev_bound(env->prog->aux)) {
+				verbose(env, "function calls in offloaded programs are not supported yet\n");
+				return -EINVAL;
+			}
+			/* add_subprog marks the callee entry point with the
+			 * new subprogno.
+			 */
+			err = add_subprog(env, i + insns[i].imm + 1);
+			if (err < 0)
+				return err;
 		}
-		ret = add_subprog(env, i + insn[i].imm + 1);
-		if (ret < 0)
-			return ret;
-	}
 
 	if (env->log.level > 1)
 		for (i = 0; i < env->subprog_cnt; i++)
 			verbose(env, "func#%d @%d\n", i, env->subprog_info[i].start);
 
-	/* now check that all jumps are within the same subprog */
-	subprog_start = 0;
-	if (env->subprog_cnt == cur_subprog)
-		subprog_end = insn_cnt;
-	else
-		subprog_end = env->subprog_info[cur_subprog++].start;
+	/* Fill in subprogno on each insn of function body, on the assumption
+	 * that each subprog is a contiguous block of insns.  This will be
+	 * validated by the next step
+	 */
+	cur_subprog = 0;
+	for (i = 0; i < insn_cnt; i++)
+		if (aux[i].subprogno)
+			cur_subprog = aux[i].subprogno;
+		else
+			aux[i].subprogno = cur_subprog;
+
+	/* Check that control never flows from one subprog to another except by
+	 * a pseudo-call insn.
+	 */
 	for (i = 0; i < insn_cnt; i++) {
-		u8 code = insn[i].code;
-
-		if (BPF_CLASS(code) != BPF_JMP)
-			goto next;
-		if (BPF_OP(code) == BPF_EXIT || BPF_OP(code) == BPF_CALL)
-			goto next;
-		off = i + insn[i].off + 1;
-		if (off < subprog_start || off >= subprog_end) {
-			verbose(env, "jump out of range from insn %d to %d\n", i, off);
-			return -EINVAL;
+		bool fallthrough = false, jump = false;
+		u8 opcode = BPF_OP(insns[i].code);
+		int target;
+
+		/* Determine where control flow from this insn goes */
+		if (BPF_CLASS(insns[i].code) != BPF_JMP) {
+			fallthrough = true;
+		} else {
+			switch (opcode) {
+			case BPF_EXIT:
+				/* This insn doesn't go anywhere.  If we're in
+				 * a subprog, then the return target is handled
+				 * as a fallthrough on the call insn, so no need
+				 * to worry about it here.
+				 */
+				continue;
+			case BPF_CALL:
+				/* If pseudo-call, already handled marking the
+				 * callee.  Both kinds of call will eventually
+				 * return and pass to the following insn.
+				 */
+				fallthrough = true;
+				break;
+			case BPF_JA:
+				/* unconditional jump; mark target */
+				jump = true;
+				target = i + insns[i].off + 1;
+				break;
+			default:
+				/* conditional jump; branch both ways */
+				fallthrough = true;
+				jump = true;
+				target = i + insns[i].off + 1;
+				break;
+			}
 		}
-next:
-		if (i == subprog_end - 1) {
-			/* to avoid fall-through from one subprog into another
-			 * the last insn of the subprog should be either exit
-			 * or unconditional jump back
-			 */
-			if (code != (BPF_JMP | BPF_EXIT) &&
-			    code != (BPF_JMP | BPF_JA)) {
-				verbose(env, "last insn is not an exit or jmp\n");
+
+		/* Check that that control flow doesn't leave the subprog */
+		cur_subprog = aux[i].subprogno;
+		if (fallthrough) {
+			if (i + 1 >= insn_cnt) {
+				verbose(env, "no exit/jump at end of program (insn %d)\n",
+					i);
+				return -EINVAL;
+			}
+			if (aux[i + 1].subprogno != cur_subprog) {
+				verbose(env, "no exit/jump at end of subprog %d (insn %d)\n",
+					cur_subprog, i);
+				return -EINVAL;
+			}
+		}
+		if (jump) {
+			if (target < 0 || target >= insn_cnt) {
+				verbose(env, "jump out of range from insn %d to %d\n",
+					i, target);
+				return -EINVAL;
+			}
+			if (aux[target].subprogno != cur_subprog) {
+				verbose(env, "jump from insn %d to %d crosses from subprog %d to %d\n",
+					i, target, cur_subprog,
+					aux[target].subprogno);
 				return -EINVAL;
 			}
-			subprog_start = subprog_end;
-			if (env->subprog_cnt == cur_subprog)
-				subprog_end = insn_cnt;
-			else
-				subprog_end = env->subprog_info[cur_subprog++].start;
 		}
 	}
 	return 0;
@@ -1489,7 +1551,8 @@ static int update_stack_depth(struct bpf_verifier_env *env,
  */
 static int check_max_stack_depth(struct bpf_verifier_env *env)
 {
-	int depth = 0, frame = 0, subprog = 0, i = 0, subprog_end;
+	int depth = 0, frame = 0, subprog = 0, i = 0;
+	struct bpf_insn_aux_data *aux = env->insn_aux_data;
 	struct bpf_insn *insn = env->prog->insnsi;
 	int insn_cnt = env->prog->len;
 	int ret_insn[MAX_CALL_FRAMES];
@@ -1507,11 +1570,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 		return -EACCES;
 	}
 continue_func:
-	if (env->subprog_cnt == subprog)
-		subprog_end = insn_cnt;
-	else
-		subprog_end = env->subprog_info[subprog].start;
-	for (; i < subprog_end; i++) {
+	for (; i < insn_cnt && aux[i].subprogno == subprog; i++) {
 		if (insn[i].code != (BPF_JMP | BPF_CALL))
 			continue;
 		if (insn[i].src_reg != BPF_PSEUDO_CALL)
@@ -1528,7 +1587,6 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 				  i);
 			return -EFAULT;
 		}
-		subprog++;
 		frame++;
 		if (frame >= MAX_CALL_FRAMES) {
 			WARN_ONCE(1, "verifier bug. Call stack is too deep\n");
@@ -1561,7 +1619,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
 			  start);
 		return -EFAULT;
 	}
-	subprog++;
 	return env->subprog_info[subprog].stack_depth;
 }
 #endif
@@ -2100,7 +2157,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_FUNC_tail_call:
 		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
 			goto error;
-		if (env->subprog_cnt) {
+		if (env->subprog_cnt > 1) {
 			verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
 			return -EINVAL;
 		}
@@ -2245,6 +2302,9 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	}
 
 	target_insn = *insn_idx + insn->imm;
+	/* We will increment insn_idx (PC) in do_check() after handling this
+	 * call insn, so the actual start of the function is target + 1.
+	 */
 	subprog = find_subprog(env, target_insn + 1);
 	if (subprog < 0) {
 		verbose(env, "verifier bug. No program starts at insn %d\n",
@@ -2272,7 +2332,7 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			/* remember the callsite, it will be used by bpf_exit */
 			*insn_idx /* callsite */,
 			state->curframe + 1 /* frameno within this callchain */,
-			subprog + 1 /* subprog number within this prog */);
+			subprog /* subprog number within this prog */);
 
 	/* copy r1 - r5 args that callee can access */
 	for (i = BPF_REG_1; i <= BPF_REG_5; i++)
@@ -3831,7 +3891,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return -EINVAL;
 	}
 
-	if (env->subprog_cnt) {
+	if (env->subprog_cnt > 1) {
 		/* when program has LD_ABS insn JITs and interpreter assume
 		 * that r1 == ctx == skb which is not the case for callees
 		 * that can have arbitrary arguments. It's problematic
@@ -4020,10 +4080,6 @@ static int check_cfg(struct bpf_verifier_env *env)
 	int ret = 0;
 	int i, t;
 
-	ret = check_subprogs(env);
-	if (ret < 0)
-		return ret;
-
 	insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
 	if (!insn_state)
 		return -ENOMEM;
@@ -4862,11 +4918,11 @@ static int do_check(struct bpf_verifier_env *env)
 
 	verbose(env, "processed %d insns (limit %d), stack depth ",
 		insn_processed, BPF_COMPLEXITY_LIMIT_INSNS);
-	for (i = 0; i < env->subprog_cnt + 1; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		u32 depth = env->subprog_info[i].stack_depth;
 
 		verbose(env, "%d", depth);
-		if (i + 1 < env->subprog_cnt + 1)
+		if (i + 1 < env->subprog_cnt)
 			verbose(env, "+");
 	}
 	verbose(env, "\n");
@@ -5238,12 +5294,12 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 static int jit_subprogs(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog, **func, *tmp;
-	int i, j, subprog_start, subprog_end = 0, len, subprog;
 	struct bpf_insn *insn;
 	void *old_bpf_func;
+	int i, j, subprog;
 	int err = -ENOMEM;
 
-	if (env->subprog_cnt == 0)
+	if (env->subprog_cnt <= 1)
 		return 0;
 
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
@@ -5259,7 +5315,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		/* temporarily remember subprog id inside insn instead of
 		 * aux_data, since next loop will split up all insns into funcs
 		 */
-		insn->off = subprog + 1;
+		insn->off = subprog;
 		/* remember original imm in case JIT fails and fallback
 		 * to interpreter will be needed
 		 */
@@ -5268,23 +5324,24 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		insn->imm = 1;
 	}
 
-	func = kzalloc(sizeof(prog) * (env->subprog_cnt + 1), GFP_KERNEL);
+	func = kzalloc(sizeof(prog) * env->subprog_cnt, GFP_KERNEL);
 	if (!func)
 		return -ENOMEM;
 
-	for (i = 0; i <= env->subprog_cnt; i++) {
-		subprog_start = subprog_end;
-		if (env->subprog_cnt == i)
-			subprog_end = prog->len;
-		else
-			subprog_end = env->subprog_info[i].start;
+	for (i = 0; i < env->subprog_cnt; i++) {
+		struct bpf_subprog_info *info = &env->subprog_info[i];
+		int len, end = prog->len;
+
+		if (i + 1 < env->subprog_cnt)
+			end = info[1].start;
+		len = end - info->start;
 
-		len = subprog_end - subprog_start;
 		func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER);
 		if (!func[i])
 			goto out_free;
-		memcpy(func[i]->insnsi, &prog->insnsi[subprog_start],
+		memcpy(func[i]->insnsi, prog->insnsi + info->start,
 		       len * sizeof(struct bpf_insn));
+
 		func[i]->type = prog->type;
 		func[i]->len = len;
 		if (bpf_prog_calc_tag(func[i]))
@@ -5307,7 +5364,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	 * now populate all bpf_calls with correct addresses and
 	 * run last pass of JIT
 	 */
-	for (i = 0; i <= env->subprog_cnt; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		insn = func[i]->insnsi;
 		for (j = 0; j < func[i]->len; j++, insn++) {
 			if (insn->code != (BPF_JMP | BPF_CALL) ||
@@ -5315,12 +5372,16 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 				continue;
 			subprog = insn->off;
 			insn->off = 0;
+			if (subprog < 0 || subprog >= env->subprog_cnt) {
+				err = -EFAULT;
+				goto out_free;
+			}
 			insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
 				func[subprog]->bpf_func -
 				__bpf_call_base;
 		}
 	}
-	for (i = 0; i <= env->subprog_cnt; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		old_bpf_func = func[i]->bpf_func;
 		tmp = bpf_int_jit_compile(func[i]);
 		if (tmp != func[i] || func[i]->bpf_func != old_bpf_func) {
@@ -5334,7 +5395,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	/* finally lock prog and jit images for all functions and
 	 * populate kallsysm
 	 */
-	for (i = 0; i <= env->subprog_cnt; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		bpf_prog_lock_ro(func[i]);
 		bpf_prog_kallsyms_add(func[i]);
 	}
@@ -5351,7 +5412,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			continue;
 		insn->off = env->insn_aux_data[i].call_imm;
 		subprog = find_subprog(env, i + insn->off + 1);
-		addr  = (unsigned long)func[subprog + 1]->bpf_func;
+		addr  = (unsigned long)func[subprog]->bpf_func;
 		addr &= PAGE_MASK;
 		insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
 			    addr - __bpf_call_base;
@@ -5360,10 +5421,10 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	prog->jited = 1;
 	prog->bpf_func = func[0]->bpf_func;
 	prog->aux->func = func;
-	prog->aux->func_cnt = env->subprog_cnt + 1;
+	prog->aux->func_cnt = env->subprog_cnt;
 	return 0;
 out_free:
-	for (i = 0; i <= env->subprog_cnt; i++)
+	for (i = 0; i < env->subprog_cnt; i++)
 		if (func[i])
 			bpf_jit_free(func[i]);
 	kfree(func);
@@ -5393,6 +5454,7 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 		err = jit_subprogs(env);
 		if (err == 0)
 			return 0;
+		verbose(env, "failed to jit_subprogs %d\n", err);
 	}
 #ifndef CONFIG_BPF_JIT_ALWAYS_ON
 	for (i = 0; i < prog->len; i++, insn++) {
@@ -5682,6 +5744,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
 
 	env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
 
+	ret = check_subprogs(env);
+	if (ret < 0)
+		goto skip_full_check;
+
 	ret = check_cfg(env);
 	if (ret < 0)
 		goto skip_full_check;

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox