Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: virtio_net failover and initramfs
From: Siwei Liu @ 2018-08-22  7:27 UTC (permalink / raw)
  To: Harald Hoyer
  Cc: Samudrala, Sridhar, Jiri Pirko, initramfs, Michael S. Tsirkin,
	Netdev, vijay.balakrishna, si-wei liu, liran.alon
In-Reply-To: <87171e54-1d4f-f755-ab90-7c1373268950@redhat.com>

On Wed, Aug 22, 2018 at 12:23 AM, Harald Hoyer <harald@redhat.com> wrote:
> On 22.08.2018 09:17, Siwei Liu wrote:
>> On Tue, Aug 21, 2018 at 6:44 AM, Harald Hoyer <harald@redhat.com> wrote:
>>> On 17.08.2018 21:09, Samudrala, Sridhar wrote:
>>>> On 8/17/2018 2:56 AM, Harald Hoyer wrote:
>>>>> On 17.08.2018 11:51, Harald Hoyer wrote:
>>>>>> On 16.08.2018 00:17, Siwei Liu wrote:
>>>>>>> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
>>>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>>>>>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>>>>>>>> just looking at "IFLA_MASTER" attribute?
>>>>>>>>>
>>>>>>>>> When STANDBY is enabled on virtio-net, a failover master interface
>>>>>>>>> will appear, which automatically enslaves the virtio device. But it is
>>>>>>>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>>>>>>>> new failover interface together with a standby virtio (without any VF
>>>>>>>>> or PT device in place).
>>>>>>>>>
>>>>>>>>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>>>>>>>>
>>>>>>>>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>>>>>>>> timeout - starting timeout scripts
>>>>>>>>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>>>>>>>>            Starting Dracut Emergency Shell...
>>>>>>>>> Generating "/run/initramfs/rdsosreport.txt"
>>>>>>>>> Entering emergency mode. Exit the shell to continue.
>>>>>>>>> Type "journalctl" to view system logs.
>>>>>>>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>>>>>>>>> /boot
>>>>>>>>> after mounting them and attach it to a bug report.
>>>>>>>>> dracut:/# ip l sh
>>>>>>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>>>>>>> mode DEFAULT group default qlen 1000
>>>>>>>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>>>>>>>>> state UP mode DEFAULT group default qlen 1000
>>>>>>>>>       link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>>>>>>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>>>>>>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>>>>>>>>       link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>>>>>>>> dracut:/#
>>>>>>>>>
>>>>>>>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>>>>>>>> network boot starts to work.
>>>>>>>>
>>>>>>>> Does dracut by default tries to use all the interfaces that are UP?
>>>>>>>>
>>>>>>> Yes. The specific dracut cmdline of our case is "ip=dhcp
>>>>>>> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
>>>>>>> of same MAC address for failover and standby, while dracut tries to
>>>>>>> run DHCP on all interfaces that are up it eventually gets same route
>>>>>>> for each interface. Those conflict route entries kill off the network
>>>>>>> connection.
>>>>>>>
>>>>>>>>> The reason is that dracut has its own means to differentiate virtual
>>>>>>>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>>>>>>>> ignores slave interfaces. Instead, users have to provide explicit
>>>>>>>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>>>>>>>> the config and ignore the slave interfaces.
>>>>>>>>
>>>>>>>> Isn't it possible to specify the interface that should be used for network
>>>>>>>> boot?
>>>>>>> As I understand it, one can only specify interface name for running
>>>>>>> DHCP but not select interface for network boot.  We want DHCP to run
>>>>>>> on every NIC that is up (excluding the enslaved interfaces), and only
>>>>>>> one of them can get a route entry to the network boot server (ie.g.
>>>>>>> iSCSI target).
>>>>>>>
>>>>>>>>
>>>>>>>>> However, with automatic creation of failover interface that assumption
>>>>>>>>> is no longer true. Can we change dracut to ignore all slave interface
>>>>>>>>> by checking  IFLA_MASTER? I don't think so. It has a large impact to
>>>>>>>>> existing configs.
>>>>>>>>
>>>>>>>> What is the issue with checking for IFLA_MASTER? I guess this is used with
>>>>>>>> team/bonding setups.
>>>>>>> That should be discussed within and determined by the dracut
>>>>>>> community. But the current dracut code doesn't check IFLA_MASTER for
>>>>>>> team or bonding specifically. I guess this change might have broader
>>>>>>> impact to existing userspace that might be already relying on the
>>>>>>> current behaviour.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Siwei
>>>>>> Is there a sysfs flag for IFF_SLAVE? Or any "ip" output I can use to detect, that it is a IFF_SLAVE?
>>>>>>
>>>>> Oh, it's the other way around.. dracut should ignore "master" (eth1).
>>>> In the above example eth0 is the net_failover device and eth1 is the lower virtio_net device.
>>>> "ip" output of eth1 shows "master eth0". It indicates that eth0 is its upper/master device.
>>>> This information can also be obtained via sysfs too. /sys/class/net/eth1/upper_eth0
>>>>>
>>>>> Can the master enslave the "eth0", if it is already "UP" and busy later on?
>>>> eth0 is the master/failover device and eth1 gets registered as its slave via NETDEV_REGISTER event.
>>>> dracut should ignore eth1 in this setup.
>>>
>>>
>>> Care to test, if that fixes your case?
>>> https://github.com/dracutdevs/dracut/pull/450/files
>>
>> Sorry, I forgot to mention that some of our setups do not have
>> 'ip=dhcp' explicitly specified, but are still able to boot from iSCSI
>> target using "netroot=iscsi:...". Move the iface_is_enslaved check out
>> from the parsing of ip= lines?
>>
>> -Siwei
>>
>
> Does that mean, you want to use the enslaved interface?

Not really. I would like to ignore enslaved interfaces *even if ip= is
not specified*.

-Siwei

^ permalink raw reply

* Re: virtio_net failover and initramfs
From: Harald Hoyer @ 2018-08-22  7:23 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Jiri Pirko, initramfs, Michael S. Tsirkin,
	Netdev, vijay.balakrishna, si-wei liu, liran.alon
In-Reply-To: <CADGSJ22QksMTm3CuFXUVAGoiqy5VVuGAG9FHTRy6dt60gJ7KSA@mail.gmail.com>

On 22.08.2018 09:17, Siwei Liu wrote:
> On Tue, Aug 21, 2018 at 6:44 AM, Harald Hoyer <harald@redhat.com> wrote:
>> On 17.08.2018 21:09, Samudrala, Sridhar wrote:
>>> On 8/17/2018 2:56 AM, Harald Hoyer wrote:
>>>> On 17.08.2018 11:51, Harald Hoyer wrote:
>>>>> On 16.08.2018 00:17, Siwei Liu wrote:
>>>>>> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
>>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>>>>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>>>>>>> just looking at "IFLA_MASTER" attribute?
>>>>>>>>
>>>>>>>> When STANDBY is enabled on virtio-net, a failover master interface
>>>>>>>> will appear, which automatically enslaves the virtio device. But it is
>>>>>>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>>>>>>> new failover interface together with a standby virtio (without any VF
>>>>>>>> or PT device in place).
>>>>>>>>
>>>>>>>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>>>>>>>
>>>>>>>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>>>>>>> timeout - starting timeout scripts
>>>>>>>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>>>>>>>            Starting Dracut Emergency Shell...
>>>>>>>> Generating "/run/initramfs/rdsosreport.txt"
>>>>>>>> Entering emergency mode. Exit the shell to continue.
>>>>>>>> Type "journalctl" to view system logs.
>>>>>>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>>>>>>>> /boot
>>>>>>>> after mounting them and attach it to a bug report.
>>>>>>>> dracut:/# ip l sh
>>>>>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>>>>>> mode DEFAULT group default qlen 1000
>>>>>>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>>>>>>>> state UP mode DEFAULT group default qlen 1000
>>>>>>>>       link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>>>>>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>>>>>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>>>>>>>       link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>>>>>>> dracut:/#
>>>>>>>>
>>>>>>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>>>>>>> network boot starts to work.
>>>>>>>
>>>>>>> Does dracut by default tries to use all the interfaces that are UP?
>>>>>>>
>>>>>> Yes. The specific dracut cmdline of our case is "ip=dhcp
>>>>>> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
>>>>>> of same MAC address for failover and standby, while dracut tries to
>>>>>> run DHCP on all interfaces that are up it eventually gets same route
>>>>>> for each interface. Those conflict route entries kill off the network
>>>>>> connection.
>>>>>>
>>>>>>>> The reason is that dracut has its own means to differentiate virtual
>>>>>>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>>>>>>> ignores slave interfaces. Instead, users have to provide explicit
>>>>>>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>>>>>>> the config and ignore the slave interfaces.
>>>>>>>
>>>>>>> Isn't it possible to specify the interface that should be used for network
>>>>>>> boot?
>>>>>> As I understand it, one can only specify interface name for running
>>>>>> DHCP but not select interface for network boot.  We want DHCP to run
>>>>>> on every NIC that is up (excluding the enslaved interfaces), and only
>>>>>> one of them can get a route entry to the network boot server (ie.g.
>>>>>> iSCSI target).
>>>>>>
>>>>>>>
>>>>>>>> However, with automatic creation of failover interface that assumption
>>>>>>>> is no longer true. Can we change dracut to ignore all slave interface
>>>>>>>> by checking  IFLA_MASTER? I don't think so. It has a large impact to
>>>>>>>> existing configs.
>>>>>>>
>>>>>>> What is the issue with checking for IFLA_MASTER? I guess this is used with
>>>>>>> team/bonding setups.
>>>>>> That should be discussed within and determined by the dracut
>>>>>> community. But the current dracut code doesn't check IFLA_MASTER for
>>>>>> team or bonding specifically. I guess this change might have broader
>>>>>> impact to existing userspace that might be already relying on the
>>>>>> current behaviour.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> Is there a sysfs flag for IFF_SLAVE? Or any "ip" output I can use to detect, that it is a IFF_SLAVE?
>>>>>
>>>> Oh, it's the other way around.. dracut should ignore "master" (eth1).
>>> In the above example eth0 is the net_failover device and eth1 is the lower virtio_net device.
>>> "ip" output of eth1 shows "master eth0". It indicates that eth0 is its upper/master device.
>>> This information can also be obtained via sysfs too. /sys/class/net/eth1/upper_eth0
>>>>
>>>> Can the master enslave the "eth0", if it is already "UP" and busy later on?
>>> eth0 is the master/failover device and eth1 gets registered as its slave via NETDEV_REGISTER event.
>>> dracut should ignore eth1 in this setup.
>>
>>
>> Care to test, if that fixes your case?
>> https://github.com/dracutdevs/dracut/pull/450/files
> 
> Sorry, I forgot to mention that some of our setups do not have
> 'ip=dhcp' explicitly specified, but are still able to boot from iSCSI
> target using "netroot=iscsi:...". Move the iface_is_enslaved check out
> from the parsing of ip= lines?
> 
> -Siwei
> 

Does that mean, you want to use the enslaved interface?

^ permalink raw reply

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
From: Daniel Borkmann @ 2018-08-22  7:22 UTC (permalink / raw)
  To: Petar Penkov, Alexei Starovoitov
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	simon.horman, Willem de Bruijn
In-Reply-To: <CAG4SDVVPc=uRev7kpo7bAJ2OU5AJHGHdkpJnN+VG4c_XCD4jJQ@mail.gmail.com>

On 08/22/2018 02:19 AM, Petar Penkov wrote:
> On Mon, Aug 20, 2018 at 1:52 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
>>> From: Petar Penkov <ppenkov@google.com>
[...]
>>> 3/ The BPF program cannot use direct packet access everywhere because it
>>> uses an offset, initially supplied by the flow dissector.  Because the
>>> initial value of this non-constant offset comes from outside of the
>>> program, the verifier does not know what its value is, and it cannot verify
>>> that it is within packet bounds. Therefore, direct packet access programs
>>> get rejected.
>>
>> this part doesn't seem to match the code.
>> direct packet access is allowed and usable even for fragmented skbs.
>> in such case only linear part of skb is in "direct access".
> 
> I am not sure I understand. What I meant was that I use bpf_skb_load_bytes
> rather than direct packet access because the offset at which I read headers,
> nhoff, depends on an initial value that cannot be statically verified - namely
> what __skb_flow_dissect provides. Is there an alternative approach I should
> be taking here, and/or am I misunderstanding direct access?

You can still use direct packet access with it, the only thing you would
need to make sure is that the initial offset is bounded (e.g. test if
larger than some const and then drop the packet, or '& <const>') so that
the verifier can make sure the alu op won't cause overflow, then you can
add this to pkt_data, and later on open an access range with the usual test
like pkt_data' + <const> > pkt_end.

Thanks,
Daniel

^ permalink raw reply

* Re: virtio_net failover and initramfs
From: Siwei Liu @ 2018-08-22  7:17 UTC (permalink / raw)
  To: Harald Hoyer
  Cc: Samudrala, Sridhar, Jiri Pirko, initramfs, Michael S. Tsirkin,
	Netdev, vijay.balakrishna, si-wei liu, liran.alon
In-Reply-To: <aad05ef2-b2d2-e549-06d8-a8a9a454352d@redhat.com>

On Tue, Aug 21, 2018 at 6:44 AM, Harald Hoyer <harald@redhat.com> wrote:
> On 17.08.2018 21:09, Samudrala, Sridhar wrote:
>> On 8/17/2018 2:56 AM, Harald Hoyer wrote:
>>> On 17.08.2018 11:51, Harald Hoyer wrote:
>>>> On 16.08.2018 00:17, Siwei Liu wrote:
>>>>> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>>>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>>>>>> just looking at "IFLA_MASTER" attribute?
>>>>>>>
>>>>>>> When STANDBY is enabled on virtio-net, a failover master interface
>>>>>>> will appear, which automatically enslaves the virtio device. But it is
>>>>>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>>>>>> new failover interface together with a standby virtio (without any VF
>>>>>>> or PT device in place).
>>>>>>>
>>>>>>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>>>>>>
>>>>>>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>>>>>> timeout - starting timeout scripts
>>>>>>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>>>>>>            Starting Dracut Emergency Shell...
>>>>>>> Generating "/run/initramfs/rdsosreport.txt"
>>>>>>> Entering emergency mode. Exit the shell to continue.
>>>>>>> Type "journalctl" to view system logs.
>>>>>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>>>>>>> /boot
>>>>>>> after mounting them and attach it to a bug report.
>>>>>>> dracut:/# ip l sh
>>>>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>>>>> mode DEFAULT group default qlen 1000
>>>>>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>>>>>>> state UP mode DEFAULT group default qlen 1000
>>>>>>>       link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>>>>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>>>>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>>>>>>       link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>>>>>> dracut:/#
>>>>>>>
>>>>>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>>>>>> network boot starts to work.
>>>>>>
>>>>>> Does dracut by default tries to use all the interfaces that are UP?
>>>>>>
>>>>> Yes. The specific dracut cmdline of our case is "ip=dhcp
>>>>> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
>>>>> of same MAC address for failover and standby, while dracut tries to
>>>>> run DHCP on all interfaces that are up it eventually gets same route
>>>>> for each interface. Those conflict route entries kill off the network
>>>>> connection.
>>>>>
>>>>>>> The reason is that dracut has its own means to differentiate virtual
>>>>>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>>>>>> ignores slave interfaces. Instead, users have to provide explicit
>>>>>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>>>>>> the config and ignore the slave interfaces.
>>>>>>
>>>>>> Isn't it possible to specify the interface that should be used for network
>>>>>> boot?
>>>>> As I understand it, one can only specify interface name for running
>>>>> DHCP but not select interface for network boot.  We want DHCP to run
>>>>> on every NIC that is up (excluding the enslaved interfaces), and only
>>>>> one of them can get a route entry to the network boot server (ie.g.
>>>>> iSCSI target).
>>>>>
>>>>>>
>>>>>>> However, with automatic creation of failover interface that assumption
>>>>>>> is no longer true. Can we change dracut to ignore all slave interface
>>>>>>> by checking  IFLA_MASTER? I don't think so. It has a large impact to
>>>>>>> existing configs.
>>>>>>
>>>>>> What is the issue with checking for IFLA_MASTER? I guess this is used with
>>>>>> team/bonding setups.
>>>>> That should be discussed within and determined by the dracut
>>>>> community. But the current dracut code doesn't check IFLA_MASTER for
>>>>> team or bonding specifically. I guess this change might have broader
>>>>> impact to existing userspace that might be already relying on the
>>>>> current behaviour.
>>>>>
>>>>> Thanks,
>>>>> -Siwei
>>>> Is there a sysfs flag for IFF_SLAVE? Or any "ip" output I can use to detect, that it is a IFF_SLAVE?
>>>>
>>> Oh, it's the other way around.. dracut should ignore "master" (eth1).
>> In the above example eth0 is the net_failover device and eth1 is the lower virtio_net device.
>> "ip" output of eth1 shows "master eth0". It indicates that eth0 is its upper/master device.
>> This information can also be obtained via sysfs too. /sys/class/net/eth1/upper_eth0
>>>
>>> Can the master enslave the "eth0", if it is already "UP" and busy later on?
>> eth0 is the master/failover device and eth1 gets registered as its slave via NETDEV_REGISTER event.
>> dracut should ignore eth1 in this setup.
>
>
> Care to test, if that fixes your case?
> https://github.com/dracutdevs/dracut/pull/450/files

Sorry, I forgot to mention that some of our setups do not have
'ip=dhcp' explicitly specified, but are still able to boot from iSCSI
target using "netroot=iscsi:...". Move the iface_is_enslaved check out
from the parsing of ip= lines?

-Siwei

^ permalink raw reply

* Re: Experimental fix for MSI-X issue on r8169
From: Steve Dodd @ 2018-08-22  6:41 UTC (permalink / raw)
  To: David Miller; +Cc: Jian-Hong Pan, Heiner Kallweit, Lou Reed, netdev, linux
In-Reply-To: <20180821.212423.615788696464102670.davem@davemloft.net>

On 22 August 2018 at 05:24, David Miller <davem@davemloft.net> wrote:
> From: Jian-Hong Pan <jian-hong@endlessm.com>

>> [   56.462464] r8169 0000:02:00.0: MSI-X entry: context resume:
>> ffffffff ffffffff ffffffff ffffffff
>  ...
>> uh!  The MSI-X entry seems missed after resume on this laptop!
>
> Yeah, having all of the MSI-X entry values be all-1's is not a good
> sign.

I'm slightly confused as to why my machine doesn't even get to
printing the debugging message on resume..?

S.

^ permalink raw reply

* Re: [PATCH] datapath.c: fix missing return value check of nla_nest_start()
From: Pravin Shelar @ 2018-08-22  6:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: Jason Wood, Linux Kernel Network Developers
In-Reply-To: <20180821.163806.1699637902725064185.davem@davemloft.net>

On Tue, Aug 21, 2018 at 4:38 PM David Miller <davem@davemloft.net> wrote:
>
> From: Pravin Shelar <pshelar@ovn.org>
> Date: Tue, 21 Aug 2018 15:38:28 -0700
>
> > On Fri, Aug 17, 2018 at 1:15 AM Jiecheng Wu <jasonwood2031@gmail.com> wrote:
> >>
> >> Function queue_userspace_packet() defined in net/openvswitch/datapath.c calls nla_nest_start() to allocate memory for struct nlattr which is dereferenced immediately. As nla_nest_start() may return NULL on failure, this code piece may cause NULL pointer dereference bug.
> >> ---
> >>  net/openvswitch/datapath.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> >> index 0f5ce77..ff4457d 100644
> >> --- a/net/openvswitch/datapath.c
> >> +++ b/net/openvswitch/datapath.c
> >> @@ -460,6 +460,8 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
> >>
> >>         if (upcall_info->egress_tun_info) {
> >>                 nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_EGRESS_TUN_KEY);
> >> +               if (!nla)
> >> +                       return -EMSGSIZE;
> > It is not possible, since user_skb is allocated to accommodate all
> > netlink attributes.
>
> Pravin, common practice is to always check nla_*() return values even if the
> SKB is allocated with "enough space".
>
> Those calculations can have bugs, and these checks are therefore helpful to
> avoid crashes and memory corruption in such cases.
>
OK, in that case this patch needs to proper error handling.

^ permalink raw reply

* I NEED A TRUSTWORTHY PARTNER.
From: SIMON KAFANDO @ 2018-08-22  6:32 UTC (permalink / raw)


-- 
Dear Friend,

                      Greetings!
I am confidence you are good with your family. Please may I request
your urgent assistance in transferring the sum of ($10.7M) to your
account? It is 100% risk free and you will receive 35% of the total
sum for your kind assistance. On confirmation of your interest I will
send more details.

With Regards,
Mr. Simon Kafando

^ permalink raw reply

* Re: [PATCH 06/11] net: ethernet: renesas: use SPDX identifier for Renesas drivers
From: Simon Horman @ 2018-08-22  9:36 UTC (permalink / raw)
  To: Wolfram Sang
  Cc: linux-renesas-soc, Kuninori Morimoto, Sergei Shtylyov,
	David S. Miller, netdev, linux-kernel
In-Reply-To: <20180821220233.9202-7-wsa+renesas@sang-engineering.com>

On Wed, Aug 22, 2018 at 12:02:19AM +0200, Wolfram Sang wrote:
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>

Reviewed-by: Simon Horman <horms+renesas@verge.net.au>

^ permalink raw reply

* Re: Experimental fix for MSI-X issue on r8169
From: Heiner Kallweit @ 2018-08-22  5:56 UTC (permalink / raw)
  To: David Miller, jian-hong; +Cc: steved424, gogen, netdev, linux
In-Reply-To: <20180821.212423.615788696464102670.davem@davemloft.net>

On 22.08.2018 06:24, David Miller wrote:
> From: Jian-Hong Pan <jian-hong@endlessm.com>
> Date: Wed, 22 Aug 2018 11:01:02 +0800
> 
>  ...
>> [   56.462464] r8169 0000:02:00.0: MSI-X entry: context resume:
>> ffffffff ffffffff ffffffff ffffffff
>  ...
>> uh!  The MSI-X entry seems missed after resume on this laptop!
> 
> Yeah, having all of the MSI-X entry values be all-1's is not a good
> sign.
> 
all-1's seems to indicate that PCI access to the MSI-X table
BAR/region fails. Because falling back to MSI helps, accessing
the other BAR/region with the memory-mapped registers works.
I'll check with Realtek whether this symptom rings any bell.


> But this is quite a curious set of debugging traces we now have.
> 
> In the working case, the vector number in the DATA field seems
> to change, which suggests that something is assigning new values
> and programming them into these fields at resume time.
> 
> But in the failing cases, all of the values are garbage.
> 
> I would expect, given what the working trace looks like, that in the
> failing case some values would be wrong and the DATA value would have
> some new yet valid value.  But that is not what we are seeing here.
> 
> Weird.
> 
> 

^ permalink raw reply

* [PATCH net] net/ncsi: Fixup .dumpit message flags and ID check in Netlink handler
From: Samuel Mendoza-Jonas @ 2018-08-22  4:57 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Mendoza-Jonas, David S . Miller, linux-kernel, openbmc

The ncsi_pkg_info_all_nl() .dumpit handler is missing the NLM_F_MULTI
flag, causing additional package information after the first to be lost.
Also fixup a sanity check in ncsi_write_package_info() to reject out of
range package IDs.

Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
---
 net/ncsi/ncsi-netlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ncsi/ncsi-netlink.c b/net/ncsi/ncsi-netlink.c
index 82e6edf9c5d9..45f33d6dedf7 100644
--- a/net/ncsi/ncsi-netlink.c
+++ b/net/ncsi/ncsi-netlink.c
@@ -100,7 +100,7 @@ static int ncsi_write_package_info(struct sk_buff *skb,
 	bool found;
 	int rc;

-	if (id > ndp->package_num) {
+	if (id > ndp->package_num - 1) {
 		netdev_info(ndp->ndev.dev, "NCSI: No package with id %u\n", id);
 		return -ENODEV;
 	}
@@ -240,7 +240,7 @@ static int ncsi_pkg_info_all_nl(struct sk_buff *skb,
 		return 0; /* done */

 	hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
-			  &ncsi_genl_family, 0,  NCSI_CMD_PKG_INFO);
+			  &ncsi_genl_family, NLM_F_MULTI,  NCSI_CMD_PKG_INFO);
 	if (!hdr) {
 		rc = -EMSGSIZE;
 		goto err;
-- 
2.18.0

^ permalink raw reply related

* Re: Experimental fix for MSI-X issue on r8169
From: David Miller @ 2018-08-22  4:24 UTC (permalink / raw)
  To: jian-hong; +Cc: hkallweit1, steved424, gogen, netdev, linux
In-Reply-To: <CAPpJ_eezpOZtHKqMRhMhUHAKDtwwhBzZKmy7OXsy52Zbr0RZ_w@mail.gmail.com>

From: Jian-Hong Pan <jian-hong@endlessm.com>
Date: Wed, 22 Aug 2018 11:01:02 +0800

 ...
> [   56.462464] r8169 0000:02:00.0: MSI-X entry: context resume:
> ffffffff ffffffff ffffffff ffffffff
 ...
> uh!  The MSI-X entry seems missed after resume on this laptop!

Yeah, having all of the MSI-X entry values be all-1's is not a good
sign.

But this is quite a curious set of debugging traces we now have.

In the working case, the vector number in the DATA field seems
to change, which suggests that something is assigning new values
and programming them into these fields at resume time.

But in the failing cases, all of the values are garbage.

I would expect, given what the working trace looks like, that in the
failing case some values would be wrong and the DATA value would have
some new yet valid value.  But that is not what we are seeing here.

Weird.

^ permalink raw reply

* Re: [PATCH] selftests: net: move fragment forwarding/config up a level
From: Ido Schimmel @ 2018-08-22  6:45 UTC (permalink / raw)
  To: Shuah Khan; +Cc: Anders Roxell, davem, linux-kernel, netdev, linux-kselftest
In-Reply-To: <fa188bdd-2f4c-66fc-a803-5d53425c3daf@kernel.org>

On Tue, Aug 21, 2018 at 01:41:54PM -0600, Shuah Khan wrote:
> On 08/21/2018 12:56 PM, Ido Schimmel wrote:
> > On Tue, Aug 21, 2018 at 06:12:12PM +0200, Anders Roxell wrote:
> >> 'make kselftest-merge' assumes that the config files for the tests are
> >> located under the 'main' tet dir, like tools/testing/selftests/net/ and
> >> not in a subdir to net.
> > 
> > The tests under tools/testing/selftests/net/forwarding/ aren't executed
> > as part of the Makefile. The config file is there mainly so that people
> > will know which config options they need in order to run the tests.
> > 
> > The tests can be added to the Makefile, but some of them take a few
> > minutes to complete which is probably against "Don't take too long;"
> > mentioned in Documentation/dev-tools/kselftest.rst.
> > 
> 
> I don't see any reason why these shouldn't be added. With the number of
> tests that get run by default, time has gone up. The goal is to run more
> tests not less. There are some stress/destructive tests that continue to
> be left out of the Makefile.

Thanks for clarifying. I'll add the tests.

^ permalink raw reply

* Re: [PATCH 05/11] can: rcar: use SPDX identifier for Renesas drivers
From: Wolfram Sang @ 2018-08-22  6:30 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: Wolfram Sang, Philippe Ombredanne, Linux-Renesas,
	Kuninori Morimoto, Wolfgang Grandegger, Marc Kleine-Budde,
	David S. Miller, linux-can, netdev, linux-kernel
In-Reply-To: <CAOMZO5AkVNiYMGdED9erL=YMaKcmtS5BzZxw2gO1KLXEaPnD7A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 276 bytes --]


> > +// SPDX-License-Identifier: GPL-2.0-or-later
> 
> According to Documentation/process/license-rules.rst the format should
> be like this instead:
> 
> // SPDX-License-Identifier: GPL-2.0+

According to https://spdx.org/licenses/ it should be what I did above.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: Experimental fix for MSI-X issue on r8169
From: Jian-Hong Pan @ 2018-08-22  3:01 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Steve Dodd, Lou Reed, netdev@vger.kernel.org,
	Linux Upstreaming Team
In-Reply-To: <98df6b0a-44df-9c0d-ddf6-11b892ccd990@gmail.com>

2018-08-22 5:19 GMT+08:00 Heiner Kallweit <hkallweit1@gmail.com>:
> On 20.08.2018 05:47, Jian-Hong Pan wrote:
>> 2018-08-20 4:34 GMT+08:00 Heiner Kallweit <hkallweit1@gmail.com>:
>>> The three of you reported an MSI-X-related error when the system
>>> resumes from suspend. This has been fixed for now by disabling MSI-X
>>> on certain chip versions. However more versions may be affected.
>>>
>>> I checked with Realtek and they confirmed that on certain chip
>>> versions a MSIX-related value in PCI config space is reset when
>>> resuming from S3.
>>>
>>> I would appreciate if you could test the following experimental patch
>>> and whether warning "MSIX address lost, re-configuring" appears in
>>> your dmesg output after resume from suspend.
>>>
>>> Thanks a lot for your efforts.
>>
>> Tested with the experiment patch on ASUS X441UAR.
>>
>> This is the information before suspend:
>>
>> dev@endless:~$ dmesg | grep r8169
>> [   10.279565] libphy: r8169: probed
>> [   10.279947] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4,
>> XID 44900000, IRQ 127
>> [   10.445952] r8169 0000:02:00.0 enp2s0: renamed from eth0
>> [   15.676229] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [   17.455392] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>>
>> dev@endless:~$ ip addr show enp2s0
>> 4: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>> state UP group default qlen 1000
>>     link/ether 0c:9d:92:32:67:b4 brd ff:ff:ff:ff:ff:ff
>>     inet 10.100.13.152/24 brd 10.100.13.255 scope global noprefixroute
>> dynamic enp2s0
>>        valid_lft 86347sec preferred_lft 86347sec
>>     inet6 fe80::2873:a2a9:6ca1:c79d/64 scope link noprefixroute
>>        valid_lft forever preferred_lft forever
>>
>> This is the information after resume:
>>
>> dev@endless:~$ dmesg | grep r8169
>> [   10.279565] libphy: r8169: probed
>> [   10.279947] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4,
>> XID 44900000, IRQ 127
>> [   10.445952] r8169 0000:02:00.0 enp2s0: renamed from eth0
>> [   15.676229] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [   17.455392] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [   95.594265] r8169 0000:02:00.0 enp2s0: Link is Down
>> [   96.242074] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>>
>> dev@endless:~$ ip addr show enp2s0
>> 4: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc
>> pfifo_fast state DOWN group default qlen 1000
>>     link/ether 0c:9d:92:32:67:b4 brd ff:ff:ff:ff:ff:ff
>>
>> There is no "MSIX address lost, re-configuring" in dmesg.
>> The ethernet interface is still down after resume.
>>
>
> Thanks a lot for testing. Unfortunately I don't have test hardware
> affected by this MSI-X issue, so maybe you can help me to understand
> the issue a little better.
>
> Below is a patch printing the MSI-X table entry in different contexts,
> it's not supposed to fix anything. Could you please let me know
> what the output is on your system?
> I want to get an idea whether the issue clears the complete entry or
> just corrupts certain parts.

Here is the test result on ASUS X441UAR with this patch:

dev@endless:~$ dmesg | grep -E "(r8169|enp2s0)"
[    8.980001] r8169 0000:02:00.0: MSI-X entry: context probe: fee01004 0 40ef 1
[    8.981594] libphy: r8169: probed
[    8.981769] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4,
XID 44900000, IRQ 127
[    9.479848] r8169 0000:02:00.0 enp2s0: renamed from eth0
[   11.332834] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[   11.336350] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[   11.574892] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[   11.581816] r8169 0000:02:00.0 enp2s0: Link is Down
[   13.190535] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
flow control off
[   13.190548] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
[   56.227974] r8169 0000:02:00.0: MSI-X entry: context suspend:
fee04004 0 4024 0
[   56.462464] r8169 0000:02:00.0: MSI-X entry: context resume:
ffffffff ffffffff ffffffff ffffffff
[   58.406713] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
flow control off
[   58.766740] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[   58.767331] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[   59.003660] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready

uh!  The MSI-X entry seems missed after resume on this laptop!

Ethernet interface status after resume:
dev@endless:~$ ip addr show enp2s0
3: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc
pfifo_fast state DOWN group default qlen 1000
    link/ether 0c:9d:92:32:67:b4 brd ff:ff:ff:ff:ff:ff

Regards,
Jian-Hong Pan

> That's what I get on my system (RTL8168E-VL). In your case you'll come
> only till the first suspend.
>
> [    3.743404] r8169 0000:03:00.0: MSI-X entry: context probe: fee01004 0 40ef 1
> [   29.539250] r8169 0000:03:00.0: MSI-X entry: context suspend: fee02004 0 4028 0
> [   29.837457] r8169 0000:03:00.0: MSI-X entry: context resume: fee01004 0 402b 0
> [   36.921370] r8169 0000:03:00.0: MSI-X entry: context suspend: fee01004 0 402b 0
> [   37.239407] r8169 0000:03:00.0: MSI-X entry: context resume: fee01004 0 402b 0
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 54f53c8c0..f32645119 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -11,6 +11,7 @@
>  #include <linux/module.h>
>  #include <linux/moduleparam.h>
>  #include <linux/pci.h>
> +#include <linux/msi.h>
>  #include <linux/netdevice.h>
>  #include <linux/etherdevice.h>
>  #include <linux/delay.h>
> @@ -6822,6 +6823,20 @@ rtl8169_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
>         pm_runtime_put_noidle(&pdev->dev);
>  }
>
> +static void rtl_print_msix_entry(struct rtl8169_private *tp, const char *context)
> +{
> +       struct msi_desc *desc = first_pci_msi_entry(tp->pci_dev);
> +       u32 data[4];
> +
> +       data[0] = readl(desc->mask_base + PCI_MSIX_ENTRY_LOWER_ADDR);
> +       data[1] = readl(desc->mask_base + PCI_MSIX_ENTRY_UPPER_ADDR);
> +       data[2] = readl(desc->mask_base + PCI_MSIX_ENTRY_DATA);
> +       data[3] = readl(desc->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL);
> +
> +       dev_info(tp_to_dev(tp), "MSI-X entry: context %s: %x %x %x %x\n",
> +                context, data[0], data[1], data[2], data[3]);
> +}
> +
>  static void rtl8169_net_suspend(struct net_device *dev)
>  {
>         struct rtl8169_private *tp = netdev_priv(dev);
> @@ -6846,9 +6861,12 @@ static int rtl8169_suspend(struct device *device)
>  {
>         struct pci_dev *pdev = to_pci_dev(device);
>         struct net_device *dev = pci_get_drvdata(pdev);
> +       struct rtl8169_private *tp = netdev_priv(dev);
>
>         rtl8169_net_suspend(dev);
>
> +       rtl_print_msix_entry(tp, "suspend");
> +
>         return 0;
>  }
>
> @@ -6875,6 +6893,9 @@ static int rtl8169_resume(struct device *device)
>  {
>         struct pci_dev *pdev = to_pci_dev(device);
>         struct net_device *dev = pci_get_drvdata(pdev);
> +       struct rtl8169_private *tp = netdev_priv(dev);
> +
> +       rtl_print_msix_entry(tp, "resume");
>
>         if (netif_running(dev))
>                 __rtl8169_resume(dev);
> @@ -7075,11 +7096,6 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
>                 RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
>                 RTL_W8(tp, Cfg9346, Cfg9346_Lock);
>                 flags = PCI_IRQ_LEGACY;
> -       } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
> -               /* This version was reported to have issues with resume
> -                * from suspend when using MSI-X
> -                */
> -               flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
>         } else {
>                 flags = PCI_IRQ_ALL_TYPES;
>         }
> @@ -7354,6 +7370,8 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>                 return rc;
>         }
>
> +       rtl_print_msix_entry(tp, "probe");
> +
>         tp->saved_wolopts = __rtl8169_get_wol(tp);
>
>         mutex_init(&tp->wk.mutex);
> --
> 2.18.0
>

^ permalink raw reply

* Re: [PATCH] strparser: remove any offset before parsing messages
From: Dominique Martinet @ 2018-08-22  5:47 UTC (permalink / raw)
  To: Doron Roberts-Kedes
  Cc: Tom Herbert, Dave Watson, David S. Miller, netdev, linux-kernel
In-Reply-To: <20180822023308.GA5970@doronrk-mbp.dhcp.thefacebook.com>

Doron Roberts-Kedes wrote on Tue, Aug 21, 2018:
> > strparser logic in that case -- it might work to pull in the parser
> > function but it might not work in rcv for all I know, or the next user
> > might think that since pull is ok some other operation on the skb is as
> > well...
> 
> Just to make sure I understand, is it possible you meant to say that the
> other way around? Surely the rcv callback can do whatever it wants with
> the skb. Its the parse callback that may need to be a little more
> careful with the skb.

Hmm, right, when we call the rcv callback we remove the skb from
strp->skb_head, and by the next iteration we have a new clone of the
orig_skb so it looks safe, but that's not something that's obvious at
first glance; it works because the skb was cloned at the start of the
loop.

> For the parse case, why not just clone and pull? 

Hmm, if we clone I guess there is no side effect to fear, that could be
acceptable... It feels that we're just pushing more overhead on to
kcm/sockmap though; in strparser's __strp_recv we know we can pull
safely so doing it there feels simpler to me.

After testing the overhead doesn't seem to be bad though, it looks like
it's less than the noise level on my laptop on performance governor; the
only thing to pay attention to is that if we clone here I'll need to
also add another pull in sockmap's rcv function
(smap_read_sock_strparser) that doesn't handle offset either.

If we only pull just before parsing I think rcv is also guaranteed to
have no offset, it has to start with the same offset as the parsing
unless the user changed it, right?

Anyway there's progress, we're down to these two-ish choices if I
followed correctly:
 - add a flag in strp_callbacks that says offsets are ok or not for
parsing, and just pull if it's set.
Now we're back to that I'd be moving the pull just before parsing like I
did in v0, as that's easier to follow.
 or
 - add a (clone?+)pull in kcm_rcv_strparser and smap_parse_func_strparser
(+ in smap_read_sock_strparser if clone)

(As a side note, I noticed this patch is bugged, orig_offset/eaten
shouldn't be reset to zero after the pull (and thus orig_len doesn't
need changing either); that's what I get for trying to "simplify"
something that was simpler in the v0 to me...)

> > As I wrote above, I think it should not be possible, so we're not
> > even talking about a small percentage here.
> > The reason I didn't use skb_pull (the head-only variant) is that I'd
> > rather have the overhead than a BUG() if I'm wrong on this...
> 
> A printk in that section when (orig_offset + eaten > skb_headlen(head)) 
> confirms that this case is not uncommon or impossible. Would have to do
> more work to see how many hundreds of times per second, but it is not a
> philosophical concern.

Hmm, right, it does happen if I force two bigger packets in a single
write() on my reproducer; I guess my workload didn't exhibit that
behaviour with a 9p client...

I've tried measuring that overhead as well by writing a more complex bpf
program that would fetch the offset in the skb but for some reason I'm
reading a 0 offset when it's not zero... well, not like there's much
choice for this at this point anyway; I don't think we'll do this
without pull, I'll put that on background.

-- 
Dominique

^ permalink raw reply

* Re: unregister_netdevice: waiting for DEV to become free (2)
From: Julian Anastasov @ 2018-08-22  4:11 UTC (permalink / raw)
  To: Cong Wang
  Cc: syzbot+30209ea299c09d8785c9, ddstreet, Dmitry Vyukov, LKML,
	Linux Kernel Network Developers, syzkaller-bugs
In-Reply-To: <CAM_iQpVM4wHkW4RKMuDj_Jjof3XbJmAsN0SdSRfneaf94CL0cw@mail.gmail.com>


	Hello,

On Mon, 20 Aug 2018, Cong Wang wrote:

> For the one I look into, dst_cache doesn't matter, because the xmit
> path doesn't even use tunnel dst_cache at all, and it is ip6tnl0 FB
> device, unlike this one which is tun device.

	Ops, of course, it is dev tun and not ip tun...

> For the one I look into, it seems some fib6_info is not released for
> some reason. It seems to be the one created by addrconf_prefix_route(),
> which is supposed to be released by fib6_clean_tree() I think, but it
> never happens.

	May be, it is not direct reference to dev but one
that is moved to loopback, like from dst, route... The repro.c
creates permanent neighbours and addresses.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH v1 3/3] net: WireGuard secure network tunnel
From: Andrew Lunn @ 2018-08-22  0:23 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Miller, Stephen Hemminger, Netdev, Greg Kroah-Hartman
In-Reply-To: <CAHmME9qriuodvMo3FyfA0TEkRubxu7_Dbs=0aFrF6XOcZZtQfg@mail.gmail.com>

On Tue, Aug 21, 2018 at 04:59:52PM -0700, Jason A. Donenfeld wrote:
> On Tue, Aug 21, 2018 at 4:54 PM David Miller <davem@davemloft.net> wrote:
> >
> > From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> > Date: Tue, 21 Aug 2018 16:41:50 -0700
> >
> > > Is 100 in fact acceptable for new code? 120? 180?  What's the
> > > generally accepted limit these days?
> >
> > Please keep it as close to 80 columns as possible.
> >
> > Line breaks are not ugly, please embrace them :)
> 
> Okay! Will do. Thanks for the response.

Hi Jason

I find the coding style document,
Documentation/process/coding-style.rst makes good arguments about why
it is like this:

   Now, some people will claim that having 8-character indentations
   makes the code move too far to the right, and makes it hard to read
   on a 80-character terminal screen.  The answer to that is that if
   you need more than 3 levels of indentation, you're screwed anyway,
   and should fix your program.

   In short, 8-char indents make things easier to read, and have the
   added benefit of warning you when you're nesting your functions too
   deep.  Heed that warning.

If you were to decided on 100, you might then want to go more than 3
levels deep, rather than heed the warning and refactor the code into
lots of smaller functions.

There is also:

   Functions should be short and sweet, and do just one thing.  They
   should fit on one or two screenfuls of text (the ISO/ANSI screen
   size is 80x24, as we all know), and do one thing and do that well.

It is worth reading the document in full, just for the rationale.

	Andrew

^ permalink raw reply

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
From: Petar Penkov @ 2018-08-22  0:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20180820205205.jj7e75pulwqkorpr@ast-mbp>

On Mon, Aug 20, 2018 at 1:52 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
>> From: Petar Penkov <ppenkov@google.com>
>>
>> This patch series hardens the RX stack by allowing flow dissection in BPF,
>> as previously discussed [1]. Because of the rigorous checks of the BPF
>> verifier, this provides significant security guarantees. In particular, the
>> BPF flow dissector cannot get inside of an infinite loop, as with
>> CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
>> read outside of packet bounds, because all memory accesses are checked.
>> Also, with BPF the administrator can decide which protocols to support,
>> reducing potential attack surface. Rarely encountered protocols can be
>> excluded from dissection and the program can be updated without kernel
>> recompile or reboot if a bug is discovered.
>>
>> Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
>> This includes a new BPF program and attach type.
>>
>> Patch 2 adds a flow dissector program in BPF. This parses most protocols in
>> __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
>> and address types).
>>
>> Patch 3 adds a selftest that attaches the BPF program to the flow dissector
>> and sends traffic with different levels of encapsulation.
>
> Overall I fully support the direction. Few things to consider:
>
>> This RFC patchset exposes a few design considerations:
>>
>> 1/ Because the flow dissector key definitions live in
>> include/linux/net/flow_dissector.h, they are not visible from userspace,
>> and the flow keys definitions need to be copied in the BPF program.
>
> I don't think copy-paste avoids the issue of uapi.
> Anything used by BPF program is uapi.
> The only exception is offsets of kernel internal structures
> passed into bpf_probe_read().
> So we have several options:
> 1. be honest and say 'struct flow_dissect_key*' is now uapi
> 2. wrap all of them into 'struct bpf_flow_dissect_key*' and do rewrites
>   when/if 'struct flow_dissect_key*' changes
> 3. wait for BTF to solve it for tracing use case and for this one two.
> The idea is that kernel internal structs can be defined in bpf prog
> and since they will be described precisely in BTF that comes with the prog
> the kernel can validate that prog's BTF matches what kernel thinks it has.
> imo that's the most flexible, but BTF for all of vmlinux won't be ready
> tomorrow and looks like this patch set is ready to go, so I would go with 1 or 2.

I will pursue #2 then as both you and David are in support of it.

>
>> 2/ An alternative to adding a new hook would have been to attach flow
>> dissection programs at the XDP hook. Because this hook is executed before
>> GRO, it would have to execute on every MSS, which would be more
>> computationally expensive. Furthermore, the XDP hook is executed before an
>> SKB has been allocated and there is no clear way to move the dissected keys
>> into the SKB after it has been allocated. Eventually, perhaps a single pass
>> can implement both GRO and flow dissection -- but napi_gro_cb shows that a
>> lot more flow state would need to be parsed for this.
>
> global flow_dissect bpf hook semantics are problematic for testing.
> like patch 3 test affects the whole system including all containers.
> should the hook be per-netns or may be per netdevice?
>

Having the hook be per-netns would definitely be cleaner for testing.
I will look into
refactoring the hook from global to per-netns.

>> 3/ The BPF program cannot use direct packet access everywhere because it
>> uses an offset, initially supplied by the flow dissector.  Because the
>> initial value of this non-constant offset comes from outside of the
>> program, the verifier does not know what its value is, and it cannot verify
>> that it is within packet bounds. Therefore, direct packet access programs
>> get rejected.
>
> this part doesn't seem to match the code.
> direct packet access is allowed and usable even for fragmented skbs.
> in such case only linear part of skb is in "direct access".

I am not sure I understand. What I meant was that I use bpf_skb_load_bytes
rather than direct packet access because the offset at which I read headers,
nhoff, depends on an initial value that cannot be statically verified - namely
what __skb_flow_dissect provides. Is there an alternative approach I should
be taking here, and/or am I misunderstanding direct access?

>
> Last bit, I'm curious about, how the 'demo' flow dissector program
> from patch 2 fairs vs in-kernel dissector when performance tested?
>

I used the test in patch 3 to compare the two implementations and the
difference seemed to be small, with the BPF flow dissector being  slightly
slower.

Thank you so much for your feedback, Alexei and David!

^ permalink raw reply

* Re: [PATCH bpf] bpf, sockmap: fix sock_hash_alloc and reject zero-sized keys
From: Song Liu @ 2018-08-22  0:12 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, John Fastabend, Networking
In-Reply-To: <e444ac8bc5dce13f118d43f0277f5f0bccf4402d.1534888782.git.daniel@iogearbox.net>

On Tue, Aug 21, 2018 at 3:02 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> Currently, it is possible to create a sock hash map with key size
> of 0 and have the kernel return a fd back to user space. This is
> invalid for hash maps (and kernel also hasn't been tested for zero
> key size support in general at this point). Thus, reject such
> configuration.
>
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/sockmap.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 98e621a..60ceb0e 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -2140,7 +2140,9 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
>                 return ERR_PTR(-EPERM);
>
>         /* check sanity of attributes */
> -       if (attr->max_entries == 0 || attr->value_size != 4 ||
> +       if (attr->max_entries == 0 ||
> +           attr->key_size == 0 ||
> +           attr->value_size != 4 ||
>             attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
>                 return ERR_PTR(-EINVAL);
>
> --
> 2.9.5
>

^ permalink raw reply

* Re: [PATCH v1 3/3] net: WireGuard secure network tunnel
From: Jason A. Donenfeld @ 2018-08-21 23:59 UTC (permalink / raw)
  To: David Miller; +Cc: Stephen Hemminger, Netdev, Greg Kroah-Hartman
In-Reply-To: <20180821.165442.1494103888623363991.davem@davemloft.net>

On Tue, Aug 21, 2018 at 4:54 PM David Miller <davem@davemloft.net> wrote:
>
> From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Date: Tue, 21 Aug 2018 16:41:50 -0700
>
> > Is 100 in fact acceptable for new code? 120? 180?  What's the
> > generally accepted limit these days?
>
> Please keep it as close to 80 columns as possible.
>
> Line breaks are not ugly, please embrace them :)

Okay! Will do. Thanks for the response.

Jason

^ permalink raw reply

* Re: [PATCH v1 3/3] net: WireGuard secure network tunnel
From: David Miller @ 2018-08-21 23:54 UTC (permalink / raw)
  To: Jason; +Cc: stephen, netdev, gregkh
In-Reply-To: <CAHmME9oKvhQhn-8tjpNfKFgNPLH9aZAPO=iEwi2M4w2YBa5JBQ@mail.gmail.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Tue, 21 Aug 2018 16:41:50 -0700

> Is 100 in fact acceptable for new code? 120? 180?  What's the
> generally accepted limit these days?

Please keep it as close to 80 columns as possible.

Line breaks are not ugly, please embrace them :)

^ permalink raw reply

* Re: [PATCH v1 3/3] net: WireGuard secure network tunnel
From: Jason A. Donenfeld @ 2018-08-21 23:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Netdev, David Miller, Greg Kroah-Hartman
In-Reply-To: <20180731132204.2bc30d15@xeon-e3>

On Tue, Jul 31, 2018 at 1:22 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Tue, 31 Jul 2018 21:11:02 +0200
> "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>
> > +static int walk_by_peer(struct allowedips_node __rcu *top, u8 bits, struct allowedips_cursor *cursor, struct wireguard_peer *peer, int (*func)(void *ctx, const u8 *ip, u8 cidr, int family), void *ctx, struct mutex *lock)
> > +{
>
> Please break lines at something reasonable like 100 characters.

As I mentioned earlier, things will certainly be that way for v2, and
I'm in the process of that. I do care quite a bit about having "pretty
code", and so while I'm wrapping, I want to do it well and
consistently. Indeed 100 characters is far more reasonable than 80 --
things wind up much less annoying. And 120 is even easier. I don't
want to have to start renaming nice descriptive struct names and nice
descriptive struct member names just to avoid the line breaks in
function signatures, or something silly like that. In any case, before
I take out my word wrapping paint brush and make the prettiest
wrapping I can, I wanted to double check and confirm what the norm is
for the net tree. Is 100 in fact acceptable for new code? 120? 180?
What's the generally accepted limit these days?

(And, folks, please don't let this inquiry spiral out of control into
a bikeshed thread. I'm narrowly interested in knowing what the facts
are on what the norm is and what will be accepted here, rather than a
cacophony of random opinions.)

^ permalink raw reply

* Re: [PATCH] datapath.c: fix missing return value check of nla_nest_start()
From: David Miller @ 2018-08-21 23:38 UTC (permalink / raw)
  To: pshelar; +Cc: jasonwood2031, netdev
In-Reply-To: <CAOrHB_DgqCAGcMnyDngmvwtC8fvX9epp6KW5+DV7kn9LbBWkyQ@mail.gmail.com>

From: Pravin Shelar <pshelar@ovn.org>
Date: Tue, 21 Aug 2018 15:38:28 -0700

> On Fri, Aug 17, 2018 at 1:15 AM Jiecheng Wu <jasonwood2031@gmail.com> wrote:
>>
>> Function queue_userspace_packet() defined in net/openvswitch/datapath.c calls nla_nest_start() to allocate memory for struct nlattr which is dereferenced immediately. As nla_nest_start() may return NULL on failure, this code piece may cause NULL pointer dereference bug.
>> ---
>>  net/openvswitch/datapath.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>> index 0f5ce77..ff4457d 100644
>> --- a/net/openvswitch/datapath.c
>> +++ b/net/openvswitch/datapath.c
>> @@ -460,6 +460,8 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
>>
>>         if (upcall_info->egress_tun_info) {
>>                 nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_EGRESS_TUN_KEY);
>> +               if (!nla)
>> +                       return -EMSGSIZE;
> It is not possible, since user_skb is allocated to accommodate all
> netlink attributes.

Pravin, common practice is to always check nla_*() return values even if the
SKB is allocated with "enough space".

Those calculations can have bugs, and these checks are therefore helpful to
avoid crashes and memory corruption in such cases.

Thank you.

^ permalink raw reply

* Re: Experimental fix for MSI-X issue on r8169
From: David Miller @ 2018-08-21 23:32 UTC (permalink / raw)
  To: hkallweit1; +Cc: jian-hong, steved424, gogen, netdev
In-Reply-To: <98df6b0a-44df-9c0d-ddf6-11b892ccd990@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Tue, 21 Aug 2018 23:19:04 +0200

> That's what I get on my system (RTL8168E-VL). In your case you'll come
> only till the first suspend.
> 
> [    3.743404] r8169 0000:03:00.0: MSI-X entry: context probe: fee01004 0 40ef 1

On probe, MSI-X is masked (ie. disabled) and is configured to use:

	address: 0x00000000fee01004
	data:    0x00000000000040ef

> [   29.539250] r8169 0000:03:00.0: MSI-X entry: context suspend: fee02004 0 4028 0

At suspend time, MSI-X is unmasked (ie. enabled) and is configured to use:

	address: 0x00000000fee01004
	data:    0x0000000000004028

> [   29.837457] r8169 0000:03:00.0: MSI-X entry: context resume: fee01004 0 402b 0

At reume time, MSI-X is unmasked (ie. enabled) and is configured to use:

	address: 0x00000000fee01004
	data:    0x000000000000402b

> [   36.921370] r8169 0000:03:00.0: MSI-X entry: context suspend: fee01004 0 402b 0

Second suspend:

	address: 0x00000000fee01004
	data:    0x000000000000402b

> [   37.239407] r8169 0000:03:00.0: MSI-X entry: context resume: fee01004 0 402b 0

Second resume:

	address: 0x00000000fee01004
	data:    0x000000000000402b

And this all looks normal.  The data field is changing when you first up the device
and interrupts are enabled.  This is where the request_irq happens, the MSI
vector is allocated, and that vector number is written to the data field of the
MSI-X entry.

It looks like this (re-)allocation of MSI vectors happens on every resume as well.

And that's why the data field changes each resume.

^ permalink raw reply

* Re: [PATCH] libbpf: Remove the duplicate checking of function storage
From: Taeung Song @ 2018-08-22  2:54 UTC (permalink / raw)
  To: Daniel Borkmann, Jakub Kicinski
  Cc: Alexei Starovoitov, Linux Netdev List, LKML
In-Reply-To: <e0799ee4-875d-de62-e6a1-cc461fe38be7@iogearbox.net>



On 08/22/2018 05:11 AM, Daniel Borkmann wrote:
> On 08/21/2018 06:46 PM, Jakub Kicinski wrote:
>> On Tue, Aug 21, 2018 at 6:12 PM, Taeung Song <treeze.taeung@gmail.com> wrote:
>>> After the commit eac7d84519a3 ("tools: libbpf: don't return '.text'
>>> as a program for multi-function programs"), bpf_program__next()
>>> in bpf_object__for_each_program skips the function storage such as .text,
>>> so eliminate the duplicate checking.
>>>
>>> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
>>> Signed-off-by: Taeung Song <treeze.taeung@gmail.com>
>>
>> Looks reasonable, but you may need to repost once bpf-next is open:
>>
>> https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt
>>
>> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> 
> Agree, please resubmit once bpf-next opens up again. Thanks!
> 

OK ! I will.

Thanks,
Taeung

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox