Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 net-next 00/11] ibmvnic: Updated reset handler and code fixes
From: David Miller @ 2017-05-02 19:49 UTC (permalink / raw)
  To: nfont; +Cc: netdev, brking, jallen, muvic, tlfalcon
In-Reply-To: <20170502185006.28214.89072.stgit@ltcalpine2-lp23.aus.stglabs.ibm.com>

From: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Date: Tue, 02 May 2017 14:52:11 -0400

> This set of patches multiple code fixes and a new rest handler
> for the ibmvnic driver. In order to implement the new reset handler
> for the ibmvnic driver resource initialization needed to be moved to
> its own routine, a state variable is introduced to replace the
> various is_* flags in the driver, and a new routine to handle the
> assorted reasons the driver can be reset.
> 
> v3 updates:
> 
> Patch 10/11: Correct patch subject line to be a description of the patch.
> 
> v2 updates:
> 
> Patch 11/11: Use __netif_subqueue_stopped() instead of
> netif_subqueue_stopped() to avoid possible use of an un-initialized
> skb variable.

Your patches have trailing whitespace errors, please fix:

Applying: ibmvnic: Move resource initialization to its own routine
Applying: ibmvnic: Replace is_closed with state field
Applying: ibmvnic: Updated reset handling
/home/davem/src/GIT/net-next/.git/rebase-apply/patch:104: trailing whitespace.

warning: 1 line adds whitespace errors.
Applying: ibmvnic: Delete napi's when releasing driver resources
Applying: ibmvnic: Whitespace correction in release_rx_pools
Applying: ibmvnic: Clean up tx pools when closing
Applying: ibmvnic: Wait for any pending scrqs entries at driver close
/home/davem/src/GIT/net-next/.git/rebase-apply/patch:68: trailing whitespace.

warning: 1 line adds whitespace errors.
Applying: ibmvnic: Check for driver reset first in ibmvnic_xmit
Applying: ibmvnic: Continue skb processing after skb completion error
Applying: ibmvnic: Record SKB RX queue during poll
Applying: ibmvnic: Move queue restarting in ibmvnic_tx_complete

^ permalink raw reply

* Re: [PATCH 0/9] net: thunderx: Adds XDP support
From: David Miller @ 2017-05-02 19:47 UTC (permalink / raw)
  To: sunil.kovvuri; +Cc: netdev, linux-kernel, linux-arm-kernel, sgoutham
In-Reply-To: <1493730418-24606-1-git-send-email-sunil.kovvuri@gmail.com>

From: sunil.kovvuri@gmail.com
Date: Tue,  2 May 2017 18:36:49 +0530

> From: Sunil Goutham <sgoutham@cavium.com>
> 
> This patch series adds support for XDP to ThunderX NIC driver
> which is used on CN88xx, CN81xx and CN83xx platforms. 
> 
> Patches 1-4 are performance improvement and cleanup patches
> which are done keeping XDP performance bottlenecks in view.
> Rest of the patches adds actual XDP support.

Series applied, thanks for doing this work.

Do you have any performance numbers?

^ permalink raw reply

* Re: [net-next PATCH 0/4] Improve bpf ELF-loader under samples/bpf
From: David Miller @ 2017-05-02 19:40 UTC (permalink / raw)
  To: brouer; +Cc: kafai, netdev, eric, borkmann, alexei.starovoitov
In-Reply-To: <149372826543.22268.3617359219409721129.stgit@firesoul>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 02 May 2017 14:31:45 +0200

> This series improves and fixes bpf ELF loader and programs under
> samples/bpf.  The bpf_load.c created some hard to debug issues when
> the struct (bpf_map_def) used in the ELF maps section format changed
> in commit fb30d4b71214 ("bpf: Add tests for map-in-map").
> 
> This was hotfixed in commit 409526bea3c3 ("samples/bpf: bpf_load.c
> detect and abort if ELF maps section size is wrong") by detecting the
> issue and aborting the program.
> 
> In most situations the bpf-loader should be able to handle these kind
> of changes to the struct size.  This patch series aim to do proper
> backward and forward compabilility handling when loading ELF files.
> 
> This series also adjust the callback that was introduced in commit
> 9fd63d05f3e8 ("bpf: Allow bpf sample programs (*_user.c) to change
> bpf_map_def") to use the new bpf_map_data structure, before more users
> start to use this callback.
> 
> Hoping these changes can make the merge window, as above mentioned
> commits have not been merged yet, and it would be good to avoid users
> hitting these issues.

Alexei and Daniel, please review.

^ permalink raw reply

* Re: [PATCH] ipx: call ipxitf_put() in ioctl error path
From: David Miller @ 2017-05-02 19:35 UTC (permalink / raw)
  To: dan.carpenter; +Cc: netdev, security, secalert
In-Reply-To: <20170502105853.yz3ny6lyso6isdvx@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Tue, 2 May 2017 13:58:53 +0300

> We should call ipxitf_put() if the copy_to_user() fails.
> 
> Reported-by: 李强 <liqiang6-s@360.cn>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks Dan.

^ permalink raw reply

* Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
From: Raj, Ashok @ 2017-05-02 19:34 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Casey Leedom, Bjorn Helgaas, leedom, Michael Werner,
	Ganesh Goudar, Arjun V, David Miller, Asit K Mallick,
	Patrick J Cramer, Suravee Suthikulpanit, Bob Shaw, h,
	Ding Tianhong, Mark Rutland, Amir Ancel, Gabriele Paoloni,
	Catalin Marinas, Will Deacon, LinuxArm, David Laight, Jeff 
In-Reply-To: <CAKgT0UcxjGCZu_QV-eF7_yWkDBbQpLpdaPqBf1gji8XXM5Cc=w@mail.gmail.com>

On Tue, May 02, 2017 at 11:10:22AM -0700, Alexander Duyck wrote:
> On Tue, May 2, 2017 at 9:53 AM, Raj, Ashok <ashok.raj@intel.com> wrote:
> > On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote:
> >> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom <leedom@chelsio.com> wrote:
> >> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
> >> > Ordering Attribute should not be used on Transaction Layer Packets destined
> >> > for the PCIe End Node so flagged.  Initially flagged this way are Intel
> >> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit
> >> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
> >> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
> >>
> >> So this is a good first step though might I suggest one other change.
> >>
> >> We may want to add logic to the core PCIe code that clears the "Enable
> >> Relaxed Ordering" bit in the device control register for all devices
> >> hanging off of this root complex. Assuming the devices conform to the
> >> PCIe spec doing that should disable relaxed ordering in a device
> >> agnostic way that then enables us at a driver level to just enable the
> >> feature always without having to perform any checks for your flag. We
> >> could probably do that as a part of probing the PCIe interfaces
> >> hanging off of these devices.
> >
> > I suppose you don't want to turn off RO completely on the device. When
> > traffic is targetted to mmio for peer to peer reasons RO has performance
> > upside. The specific issue with these root ports indicate limitation using
> > RO for traffic targetting coherent memory.
> 
> Actually my main concern here is virtualization. If I take the PCIe
> function and direct assign it I have no way of seeing the root complex
> flag as it is now virtualized away. In the meantime the guest now has
> the ability to enable the function and sees nothing that says you
> can't enable relaxed ordering which in turn ends up potentially
> causing data corruption on the system. I want relaxed ordering
> disabled before I even consider assigning it to the guest on the
> systems where this would be an issue.
> 
> I prefer to err on the side of caution with this. Enabling Relaxed
> Ordering is technically a performance enhancement, so we function but
> not as well as we would like, while having it enabled when there are
> issues can lead to data corruption. I would weigh the risk of data
> corruption the thing to be avoided and of much higher priority than
> enabling improved performance. As such I say we should default the
> relaxed ordering attribute to off in general and look at
> "white-listing" it in for various architectures and/or chipsets that
> support/need it rather than having it enabled by default and trying to
> switch it off after the fact when we find some new issue.

I agree, after thinking about it a bit more.. even for transactions going to
p2p, i'm just reading the pcie spec and some sections aren't super clear
about completion redirect and ACS rules for p2p.

Also it appears the device control default value is 1 for enabling
Relaxed Ordering. This means we should probably save these states across
resets/FLR for e.g. To ensure perf isn't affected after a FLR.
> 
> So for example, in the case of x86 it seems like there are multiple
> root complexes that have issues, and the gains for enabling it with
> standard DMA to host memory are small. As such we may want to default
> it to off via the architecture specific PCIe code and then look at
> having "white-list" cases where we enable it for things like
> peer-to-peer accesses. In the case of SPARC we could look at
> defaulting it to on, and only "black-list" any cases where there might
> be issues since SPARC relies on this in a significant way for
> performance. In the case of ARM and other architectures it is a bit of
> a toss-up. I would say we could just default it to on for now and
> "black-list" anything that doesn't work, or we could go the other way
> like I suggested for x86. It all depends on what the ARM community
> would want to agree on for this. I would say unless it makes a
> significant difference like it does in the case of SPARC we are
> probably better off just defaulting it to off.
> 
> - Alex

^ permalink raw reply

* Re: [patch net-next repost] net: sched: add helpers to handle extended actions
From: David Miller @ 2017-05-02 19:34 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jhs, xiyou.wangcong, mlxsw
In-Reply-To: <1493712720-1501-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Tue,  2 May 2017 10:12:00 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Jump is now the only one using value action opcode. This is going to
> change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.
> 
> This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
> bit check, not a value check.
> 
> Fixes: e0ee84ded796 ("net sched actions: Complete the JUMPX opcode")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
> Dave, I'm sending this for -net-next although I know it is closed. But
> the mentioned commit is not yet in -net. Feel free to take this either
> to -net-next or -net, whatever suits you better. Thanks.

Applied to net-next, thanks Jiri.

^ permalink raw reply

* Re: [PATCH net 0/2] qed*: PTP bug fixes.
From: David Miller @ 2017-05-02 19:33 UTC (permalink / raw)
  To: sudarsana.kalluru; +Cc: richardcochran, netdev, Yuval.Mintz
In-Reply-To: <20170502081103.4252-1-sudarsana.kalluru@cavium.com>

From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Tue, 2 May 2017 01:11:01 -0700

> The series addresses couple of issues in the PTP implementation.

Series applied.

^ permalink raw reply

* Re: [PATCH v2] stmmac: Add support for SIMATIC IOT2000 platform
From: David Miller @ 2017-05-02 19:31 UTC (permalink / raw)
  To: jan.kiszka
  Cc: peppe.cavallaro, alexandre.torgue, netdev, linux-kernel,
	sascha.weisenberger, andy.shevchenko
In-Reply-To: <aa7a7f82-7933-1a97-12c4-b51efab498c9@siemens.com>

From: Jan Kiszka <jan.kiszka@siemens.com>
Date: Tue, 2 May 2017 09:58:00 +0200

> The IOT2000 is industrial controller platform, derived from the Intel
> Galileo Gen2 board. The variant IOT2020 comes with one LAN port, the
> IOT2040 has two of them. They can be told apart based on the board asset
> tag in the DMI table.
> 
> Based on patch by Sascha Weisenberger.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> Signed-off-by: Sascha Weisenberger <sascha.weisenberger@siemens.com>
> ---
> 
> Changes in v2:
> - reformatted match conditions [Andy]

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net v4 00/12] Fix possbile memleaks when fail to register_netdevice
From: David Miller @ 2017-05-02 19:30 UTC (permalink / raw)
  To: gfree.wind
  Cc: jiri, mareklindner, sw, a, kuznet, jmorris, yoshfuji, kaber,
	steffen.klassert, herbert, netdev
In-Reply-To: <cover.1493699451.git.gfree.wind@foxmail.com>

From: gfree.wind@foxmail.com
Date: Tue,  2 May 2017 13:58:42 +0800

> These following drivers allocate kinds of resources in its ndo_init
> func, free some of them or all in the destructor func. Then there is
> one memleak that some errors happen after register_netdevice invokes
> the ndo_init callback. Because only the ndo_uninit callback is invoked
> in the error handler of register_netdevice, but destructor not.
> 
> In my original approach, I tried to free the resources in the newlink
> func when fail to register_netdevice, like destructor did except not
> free the net_dev. This method is not good when destructor is changed,
> and the memleak could be not fixed when there is no newlink callback.
> 
> Now create one new func used to free the resources in the destructor,
> and the ndo_uninit func also could invokes it when fail to register
> the net_device by comparing the dev->reg_state with NETREG_UNINITIALIZED.
> If there is no existing ndo_uninit, just add one.
> 
> This solution doesn't only make sure free all resources in any case,
> but also follows the original desgin that some resources could be kept
> until the destructor executes normally after register the device
> successfully.

I want to think about this some more.

It is really unfortunate that resources are allocated strictly from
the ndo_init() yet released in two different callbacks which are
invoked only in certain (different) situations.

Just the fact that we have to make an internal netdev state test
in the ndo_uninit callback to get this right is a big red flag
to me.

^ permalink raw reply

* Re: [PATCH net v2] net: hns: fix ethtool_get_strings overflow in hns driver
From: David Miller @ 2017-05-02 19:08 UTC (permalink / raw)
  To: lixiaoping3
  Cc: huangdaode, lipeng321, yisen.zhuang, salil.mehta, yankejian,
	mbrugger, netdev, linux-kernel, linuxarm
In-Reply-To: <20170502024652.39536-1-lixiaoping3@huawei.com>

From: Timmy Li <lixiaoping3@huawei.com>
Date: Tue, 2 May 2017 10:46:52 +0800

> hns_get_sset_count() returns HNS_NET_STATS_CNT and the data space allocated
> is not enough for ethtool_get_strings(), which will cause random memory
> corruption.
> 
> When SLAB and DEBUG_SLAB are both enabled, memory corruptions like the
> the following can be observed without this patch:
> [   43.115200] Slab corruption (Not tainted): Acpi-ParseExt start=ffff801fb0b69030, len=80
> [   43.115206] Redzone: 0x9f911029d006462/0x5f78745f31657070.
> [   43.115208] Last user: [<5f7272655f746b70>](0x5f7272655f746b70)
> [   43.115214] 010: 70 70 65 31 5f 74 78 5f 70 6b 74 00 6b 6b 6b 6b  ppe1_tx_pkt.kkkk
> [   43.115217] 030: 70 70 65 31 5f 74 78 5f 70 6b 74 5f 6f 6b 00 6b  ppe1_tx_pkt_ok.k
> [   43.115218] Next obj: start=ffff801fb0b69098, len=80
> [   43.115220] Redzone: 0x706d655f6f666966/0x9f911029d74e35b.
> [   43.115229] Last user: [<ffff0000084b11b0>](acpi_os_release_object+0x28/0x38)
> [   43.115231] 000: 74 79 00 6b 6b 6b 6b 6b 70 70 65 31 5f 74 78 5f  ty.kkkkkppe1_tx_
> [   43.115232] 010: 70 6b 74 5f 65 72 72 5f 63 73 75 6d 5f 66 61 69  pkt_err_csum_fai
> 
> Signed-off-by: Timmy Li <lixiaoping3@huawei.com>
> ---
> Changelog:
> 
> v1 -> v2:
>     * Remove unnecessary parenthesis

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] tcp: fix wraparound issue in tcp_lp
From: David Miller @ 2017-05-02 19:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1493677788.31837.25.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 01 May 2017 15:29:48 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Be careful when comparing tcp_time_stamp to some u32 quantity,
> otherwise result can be surprising.
> 
> Fixes: 7c106d7e782b ("[TCP]: TCP Low Priority congestion control")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] bpf, arm64: fix jit branch offset related to ldimm64
From: David Miller @ 2017-05-02 19:06 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev, xi.wang, catalin.marinas, zlim.lnx, linux-arm-kernel
In-Reply-To: <5f7d105a1830ed789f1b416dc17b02fce0bd070e.1493749880.git.daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Tue,  2 May 2017 20:34:54 +0200

> When the instruction right before the branch destination is
> a 64 bit load immediate, we currently calculate the wrong
> jump offset in the ctx->offset[] array as we only account
> one instruction slot for the 64 bit load immediate although
> it uses two BPF instructions. Fix it up by setting the offset
> into the right slot after we incremented the index.
 ...
> Also, add a couple of test cases to make sure JITs pass
> this test. Tested on Cavium ThunderX ARMv8. The added
> test cases all pass after the fix.
> 
> Fixes: 8eee539ddea0 ("arm64: bpf: fix out-of-bounds read in bpf2a64_offset()")
> Reported-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Applied and queued up for -stable, thanks!

I also applied your XADD patch as well.

Thanks again.

^ permalink raw reply

* Re: [PATCH net-next RFC 1/1] net netlink: Add new type NLA_FLAG_BITS
From: David Miller @ 2017-05-02 19:03 UTC (permalink / raw)
  To: jhs; +Cc: netdev, jiri, xiyou.wangcong
In-Reply-To: <1493562519-15563-1-git-send-email-jhs@emojatatu.com>

From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Sun, 30 Apr 2017 10:28:39 -0400

> Generic bitflags attribute content sent to the kernel by user.
> With this type the user can either set or unset a flag in the
> kernel.

You asked for feedback, here it is :-)

I think this is overengineered.

Just define a u32 for the value, and mask which defines which bits are
legitimate and defined.  Any bit outside of the legitimate mask must
be zero.

Simple.

^ permalink raw reply

* Re: [PATCH] net: ethernet: stmmac: properly set PS bit in MII configurations during reset
From: David Miller @ 2017-05-02 19:00 UTC (permalink / raw)
  To: thomas.petazzoni; +Cc: peppe.cavallaro, alexandre.torgue, netdev, stable
In-Reply-To: <1493286329-24448-1-git-send-email-thomas.petazzoni@free-electrons.com>


Someone needs to review this patch.

^ permalink raw reply

* Re: [PATCH net-next v2] net: ipv6: make sure multicast packets are not forwarded beyond the different scopes
From: David Miller @ 2017-05-02 18:59 UTC (permalink / raw)
  To: donatas.abraitis; +Cc: netdev, stable
In-Reply-To: <20170427071202.84403-1-donatas.abraitis@gmail.com>

From: Donatas Abraitis <donatas.abraitis@gmail.com>
Date: Thu, 27 Apr 2017 10:12:02 +0300

> 	    RFC4291 2.7 Routers must not forward any multicast packets
> 	    beyond of the scope indicated by the scop field in the
> 	    destination multicast address.
> 
> Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>

I think it's a ">=" test which is needed here, not pure equality.
Scopes are subsets of other scopes and are therefore allowed within
eachother.

Did you actually see misbehavior due to this issue, or see a real
bonafide conformance test fail?

If you're just reading the RFC and sticking tests here and there based
upon what you read, without any testing or real life verification of
the issue, this is _strongly_ discouraged.

It would even be ok if you merely showed how another open source
networking stack makes this test.

^ permalink raw reply

* [PATCH] bonding: Use seq_putc() in bond_info_show_master()
From: SF Markus Elfring @ 2017-05-02 18:52 UTC (permalink / raw)
  To: netdev, Andy Gospodarek, Jay Vosburgh, Veaceslav Falico
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Tue, 2 May 2017 20:48:36 +0200

A few single characters should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/net/bonding/bond_procfs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
index d8d4ada034b7..bdc4db184d94 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -72,7 +72,7 @@ static void bond_info_show_master(struct seq_file *seq)
 		seq_printf(seq, " (fail_over_mac %s)", optval->string);
 	}
 
-	seq_printf(seq, "\n");
+	seq_putc(seq, '\n');
 
 	if (bond_mode_uses_xmit_hash(bond)) {
 		optval = bond_opt_get_val(BOND_OPT_XMIT_HASH,
@@ -117,11 +117,11 @@ static void bond_info_show_master(struct seq_file *seq)
 			if (!bond->params.arp_targets[i])
 				break;
 			if (printed)
-				seq_printf(seq, ",");
+				seq_putc(seq, ',');
 			seq_printf(seq, " %pI4", &bond->params.arp_targets[i]);
 			printed = 1;
 		}
-		seq_printf(seq, "\n");
+		seq_putc(seq, '\n');
 	}
 
 	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel
From: David Miller @ 2017-05-02 18:51 UTC (permalink / raw)
  To: dsa; +Cc: stephen, netdev, jakub.kicinski
In-Reply-To: <3acddce9-5421-637e-8b2b-ae8dd5c7b29a@cumulusnetworks.com>

From: David Ahern <dsa@cumulusnetworks.com>
Date: Tue, 2 May 2017 12:39:51 -0600

> On 5/2/17 12:03 PM, Stephen Hemminger wrote:
>> Then use libmnl it is already used in several other places in iproute2.
>> Eventually, I would like to use it everywhere and get rid of old netlink parser.
>> 
> 
> Why? libmnl is not going to simplify the iproute2 code.

Agreed.

> Look at attribute validation. Importing the kernel code into iproute2,
> the API is very familiar to anyone hacking on the kernel side:
> 
> +       if (nla_parse(tb, NLMSGERR_ATTR_MAX, attr, alen, extack_policy)
> != 0) {
> +               fprintf(stderr,
> +                       "Failed to parse extended error attributes\n");
> +               return 0;
> +       }
> +
> 
> ie., you pass a policy to the parse routine and the checking is part of
> nla_parse. The implementation of nla_parse and validate_nla are quite
> easy to read.
> 
> Now take a look at what devlink has for validation - attr_cb. IMO very
> unreadable and puts the burden on the app using the libmnl API.

Also agreed.

Stephen I totally agree with David, asking him to use libmnl for this is
not reasonable nor is it even a good idea given the above.

^ permalink raw reply

* Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel
From: David Ahern @ 2017-05-02 18:39 UTC (permalink / raw)
  To: Stephen Hemminger, David Miller; +Cc: netdev, jakub.kicinski
In-Reply-To: <20170502110314.66d9ee89@xeon-e3>

On 5/2/17 12:03 PM, Stephen Hemminger wrote:
> Then use libmnl it is already used in several other places in iproute2.
> Eventually, I would like to use it everywhere and get rid of old netlink parser.
> 

Why? libmnl is not going to simplify the iproute2 code.

Look at attribute validation. Importing the kernel code into iproute2,
the API is very familiar to anyone hacking on the kernel side:

+       if (nla_parse(tb, NLMSGERR_ATTR_MAX, attr, alen, extack_policy)
!= 0) {
+               fprintf(stderr,
+                       "Failed to parse extended error attributes\n");
+               return 0;
+       }
+

ie., you pass a policy to the parse routine and the checking is part of
nla_parse. The implementation of nla_parse and validate_nla are quite
easy to read.

Now take a look at what devlink has for validation - attr_cb. IMO very
unreadable and puts the burden on the app using the libmnl API.

^ permalink raw reply

* Re: net/smc and the RDMA core
From: Bart Van Assche @ 2017-05-02 18:39 UTC (permalink / raw)
  To: hch-jcswGhMUV9g@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	ubraun-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <d9214af6-1c6f-9f95-fc00-3e4a316b4f81-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

On Tue, 2017-05-02 at 14:25 +0200, Ursula Braun wrote:
> if you can point out specific issues, we will be happy to work with you
> to get them addressed!

Hello Ursula,

My list of issues that I would like to see addressed can be found below. Doug,
Christoph and others may have additional inputs. The issues that have not yet
been mentioned in other e-mails are:
- The SMC driver only supports one RDMA transport type (RoCE v1) but
  none of the other RDMA transport types (RoCE v2, IB and iWARP). New
  RDMA drivers should support all RDMA transport types transparently.
  The traditional approach to support multiple RDMA transport types is
  by using the RDMA/CM to establish connections.
- The implementation of the SMC driver only supports RoCEv1. This is
  a very unfortunate choice because:
  - RoCEv1 is not routable and hence is limited to a single Ethernet
    broadcast domain.
  - RoCEv1 packets escape a whole bunch of mechanisms that only work
    for IP packets. Firewalls pass all RoCEv1 packets and switches
    do not restrict RoCEv1 packets to a single VLAN. This means that
    if the network configuration is changed after an SMC connection
    has been set up such that IP communication between the endpoints
    of an SMC connection is blocked that the SMC RoCEv1 packets will
    not be blocked by the network equipment of which the configuration
    has just been changed.
- As already mentioned by Christoph, the SMC implementation uses RDMA
  calls that probably will be deprecated soon (ib_create_cq()) and
  should use the RDMA R/W API instead of building sge lists itself.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net] bpf, arm64: fix jit branch offset related to ldimm64
From: Daniel Borkmann @ 2017-05-02 18:34 UTC (permalink / raw)
  To: davem
  Cc: ast, netdev, xi.wang, catalin.marinas, zlim.lnx, linux-arm-kernel,
	Daniel Borkmann

When the instruction right before the branch destination is
a 64 bit load immediate, we currently calculate the wrong
jump offset in the ctx->offset[] array as we only account
one instruction slot for the 64 bit load immediate although
it uses two BPF instructions. Fix it up by setting the offset
into the right slot after we incremented the index.

Before (ldimm64 test 1):

  [...]
  00000020:  52800007  mov w7, #0x0 // #0
  00000024:  d2800060  mov x0, #0x3 // #3
  00000028:  d2800041  mov x1, #0x2 // #2
  0000002c:  eb01001f  cmp x0, x1
  00000030:  54ffff82  b.cs 0x00000020
  00000034:  d29fffe7  mov x7, #0xffff // #65535
  00000038:  f2bfffe7  movk x7, #0xffff, lsl #16
  0000003c:  f2dfffe7  movk x7, #0xffff, lsl #32
  00000040:  f2ffffe7  movk x7, #0xffff, lsl #48
  00000044:  d29dddc7  mov x7, #0xeeee // #61166
  00000048:  f2bdddc7  movk x7, #0xeeee, lsl #16
  0000004c:  f2ddddc7  movk x7, #0xeeee, lsl #32
  00000050:  f2fdddc7  movk x7, #0xeeee, lsl #48
  [...]

After (ldimm64 test 1):

  [...]
  00000020:  52800007  mov w7, #0x0 // #0
  00000024:  d2800060  mov x0, #0x3 // #3
  00000028:  d2800041  mov x1, #0x2 // #2
  0000002c:  eb01001f  cmp x0, x1
  00000030:  540000a2  b.cs 0x00000044
  00000034:  d29fffe7  mov x7, #0xffff // #65535
  00000038:  f2bfffe7  movk x7, #0xffff, lsl #16
  0000003c:  f2dfffe7  movk x7, #0xffff, lsl #32
  00000040:  f2ffffe7  movk x7, #0xffff, lsl #48
  00000044:  d29dddc7  mov x7, #0xeeee // #61166
  00000048:  f2bdddc7  movk x7, #0xeeee, lsl #16
  0000004c:  f2ddddc7  movk x7, #0xeeee, lsl #32
  00000050:  f2fdddc7  movk x7, #0xeeee, lsl #48
  [...]

Also, add a couple of test cases to make sure JITs pass
this test. Tested on Cavium ThunderX ARMv8. The added
test cases all pass after the fix.

Fixes: 8eee539ddea0 ("arm64: bpf: fix out-of-bounds read in bpf2a64_offset()")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Xi Wang <xi.wang@gmail.com>
---
 ( Based against net-next where BPF related patches are usually
   routed, if something else is preferred please let me know. )

 arch/arm64/net/bpf_jit_comp.c |  8 ++++----
 lib/test_bpf.c                | 45 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index a785554..ce8ab04 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -779,14 +779,14 @@ static int build_body(struct jit_ctx *ctx)
 		int ret;
 
 		ret = build_insn(insn, ctx);
-
-		if (ctx->image == NULL)
-			ctx->offset[i] = ctx->idx;
-
 		if (ret > 0) {
 			i++;
+			if (ctx->image == NULL)
+				ctx->offset[i] = ctx->idx;
 			continue;
 		}
+		if (ctx->image == NULL)
+			ctx->offset[i] = ctx->idx;
 		if (ret)
 			return ret;
 	}
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 0362da0..2e38502 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -4656,6 +4656,51 @@ static int bpf_fill_ld_abs_vlan_push_pop(struct bpf_test *self)
 		{ },
 		{ { 0, 1 } },
 	},
+	{
+		/* Mainly testing JIT + imm64 here. */
+		"JMP_JGE_X: ldimm64 test 1",
+		.u.insns_int = {
+			BPF_ALU32_IMM(BPF_MOV, R0, 0),
+			BPF_LD_IMM64(R1, 3),
+			BPF_LD_IMM64(R2, 2),
+			BPF_JMP_REG(BPF_JGE, R1, R2, 2),
+			BPF_LD_IMM64(R0, 0xffffffffffffffffUL),
+			BPF_LD_IMM64(R0, 0xeeeeeeeeeeeeeeeeUL),
+			BPF_EXIT_INSN(),
+		},
+		INTERNAL,
+		{ },
+		{ { 0, 0xeeeeeeeeU } },
+	},
+	{
+		"JMP_JGE_X: ldimm64 test 2",
+		.u.insns_int = {
+			BPF_ALU32_IMM(BPF_MOV, R0, 0),
+			BPF_LD_IMM64(R1, 3),
+			BPF_LD_IMM64(R2, 2),
+			BPF_JMP_REG(BPF_JGE, R1, R2, 0),
+			BPF_LD_IMM64(R0, 0xffffffffffffffffUL),
+			BPF_EXIT_INSN(),
+		},
+		INTERNAL,
+		{ },
+		{ { 0, 0xffffffffU } },
+	},
+	{
+		"JMP_JGE_X: ldimm64 test 3",
+		.u.insns_int = {
+			BPF_ALU32_IMM(BPF_MOV, R0, 1),
+			BPF_LD_IMM64(R1, 3),
+			BPF_LD_IMM64(R2, 2),
+			BPF_JMP_REG(BPF_JGE, R1, R2, 4),
+			BPF_LD_IMM64(R0, 0xffffffffffffffffUL),
+			BPF_LD_IMM64(R0, 0xeeeeeeeeeeeeeeeeUL),
+			BPF_EXIT_INSN(),
+		},
+		INTERNAL,
+		{ },
+		{ { 0, 1 } },
+	},
 	/* BPF_JMP | BPF_JNE | BPF_X */
 	{
 		"JMP_JNE_X: if (3 != 2) return 1",
-- 
1.9.3

^ permalink raw reply related

* [PATCH] net: ipv6: Fix warning of freeing alive inet6 address
From: Mike Manning @ 2017-05-02 18:30 UTC (permalink / raw)
  To: netdev

While this is not reproducible manually, Andrey's syzkaller program hit
the warning "IPv6: Freeing alive inet6 address" with this part trace:

inet6_ifa_finish_destroy+0x12e/0x190 c:894
in6_ifa_put ./include/net/addrconf.h:330
addrconf_dad_work+0x4e9/0x1040 net/ipv6/addrconf.c:3963

The fix is to call in6_ifa_put() for the inet6_ifaddr before rather
than after calling addrconf_ifdown(), as the latter may remove it from
the address hash table.

Fixes: 85b51b12115c ("net: ipv6: Remove addresses for failures with strict DAD")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
---
 net/ipv6/addrconf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 80ce478..361993a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3902,8 +3902,11 @@ static void addrconf_dad_work(struct work_struct *w)
 	} else if (action == DAD_ABORT) {
 		in6_ifa_hold(ifp);
 		addrconf_dad_stop(ifp, 1);
-		if (disable_ipv6)
+		if (disable_ipv6) {
+			in6_ifa_put(ifp);
 			addrconf_ifdown(idev->dev, 0);
+			goto unlock;
+		}
 		goto out;
 	}
 
@@ -3950,6 +3953,7 @@ static void addrconf_dad_work(struct work_struct *w)
 		      ifp->dad_nonce);
 out:
 	in6_ifa_put(ifp);
+unlock:
 	rtnl_unlock();
 }
 
-- 
2.1.4

^ permalink raw reply related

* Re: TPACKET_V3 timeout bug?
From: Guy Harris @ 2017-05-02 18:19 UTC (permalink / raw)
  To: chetan loke; +Cc: Andrew Lunn, Sowmini Varadhan, netdev, tcpdump-workers
In-Reply-To: <C289F9B5-A8DE-45B1-9485-AE322850B462@alum.mit.edu>

On May 2, 2017, at 10:54 AM, Guy Harris <guy@alum.mit.edu> wrote:

> Yes, there's a case where user space wasn't being woken up.

See also this thread from almost 3 years ago, beginning with

	http://marc.info/?l=linux-netdev&m=140633612828824&w=2

and this patch thread from almost 2 1/2 years ago:

	https://lkml.org/lkml/2014/12/18/466

^ permalink raw reply

* Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
From: Alexander Duyck @ 2017-05-02 18:10 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Casey Leedom, Bjorn Helgaas, leedom, Michael Werner,
	Ganesh Goudar, Arjun V, David Miller, Asit K Mallick,
	Patrick J Cramer, Suravee Suthikulpanit, Bob Shaw, h,
	Ding Tianhong, Mark Rutland, Amir Ancel, Gabriele Paoloni,
	Catalin Marinas, Will Deacon, LinuxArm, David Laight, Jeff 
In-Reply-To: <20170502165329.GB26406@linux.intel.com>

On Tue, May 2, 2017 at 9:53 AM, Raj, Ashok <ashok.raj@intel.com> wrote:
> On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote:
>> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom <leedom@chelsio.com> wrote:
>> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
>> > Ordering Attribute should not be used on Transaction Layer Packets destined
>> > for the PCIe End Node so flagged.  Initially flagged this way are Intel
>> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit
>> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
>> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
>>
>> So this is a good first step though might I suggest one other change.
>>
>> We may want to add logic to the core PCIe code that clears the "Enable
>> Relaxed Ordering" bit in the device control register for all devices
>> hanging off of this root complex. Assuming the devices conform to the
>> PCIe spec doing that should disable relaxed ordering in a device
>> agnostic way that then enables us at a driver level to just enable the
>> feature always without having to perform any checks for your flag. We
>> could probably do that as a part of probing the PCIe interfaces
>> hanging off of these devices.
>
> I suppose you don't want to turn off RO completely on the device. When
> traffic is targetted to mmio for peer to peer reasons RO has performance
> upside. The specific issue with these root ports indicate limitation using
> RO for traffic targetting coherent memory.

Actually my main concern here is virtualization. If I take the PCIe
function and direct assign it I have no way of seeing the root complex
flag as it is now virtualized away. In the meantime the guest now has
the ability to enable the function and sees nothing that says you
can't enable relaxed ordering which in turn ends up potentially
causing data corruption on the system. I want relaxed ordering
disabled before I even consider assigning it to the guest on the
systems where this would be an issue.

I prefer to err on the side of caution with this. Enabling Relaxed
Ordering is technically a performance enhancement, so we function but
not as well as we would like, while having it enabled when there are
issues can lead to data corruption. I would weigh the risk of data
corruption the thing to be avoided and of much higher priority than
enabling improved performance. As such I say we should default the
relaxed ordering attribute to off in general and look at
"white-listing" it in for various architectures and/or chipsets that
support/need it rather than having it enabled by default and trying to
switch it off after the fact when we find some new issue.

So for example, in the case of x86 it seems like there are multiple
root complexes that have issues, and the gains for enabling it with
standard DMA to host memory are small. As such we may want to default
it to off via the architecture specific PCIe code and then look at
having "white-list" cases where we enable it for things like
peer-to-peer accesses. In the case of SPARC we could look at
defaulting it to on, and only "black-list" any cases where there might
be issues since SPARC relies on this in a significant way for
performance. In the case of ARM and other architectures it is a bit of
a toss-up. I would say we could just default it to on for now and
"black-list" anything that doesn't work, or we could go the other way
like I suggested for x86. It all depends on what the ARM community
would want to agree on for this. I would say unless it makes a
significant difference like it does in the case of SPARC we are
probably better off just defaulting it to off.

- Alex

^ permalink raw reply

* Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel
From: Stephen Hemminger @ 2017-05-02 18:03 UTC (permalink / raw)
  To: David Miller; +Cc: dsa, netdev, jakub.kicinski
In-Reply-To: <20170502.130032.1854491688518080191.davem@davemloft.net>

On Tue, 02 May 2017 13:00:32 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: David Ahern <dsa@cumulusnetworks.com>
> Date: Tue, 2 May 2017 10:51:23 -0600
> 
> > On 5/2/17 9:25 AM, Stephen Hemminger wrote:  
> >> Please either use existing netlink attribute code in libnetlink.h
> >> (rta_getattr_u32 etc) or use libmnl like devlink.  
> > 
> > All of the existing rta_ functions take a struct rta_attr; netlink
> > messages use struct nlattr. It's just wrong to use rta functions for
> > netlink messages.  
> 
> Agreed.
> 
> > There is no existing parse function for nlattr. There is no existing
> > validate function - for nlattr or rta_attr. So I do not see any overlap
> > with existing code.  
> 
> Also agreed.

Then use libmnl it is already used in several other places in iproute2.
Eventually, I would like to use it everywhere and get rid of old netlink parser.

^ permalink raw reply

* Re: TPACKET_V3 timeout bug?
From: Guy Harris @ 2017-05-02 17:54 UTC (permalink / raw)
  To: chetan loke; +Cc: Andrew Lunn, Sowmini Varadhan, netdev, tcpdump-workers
In-Reply-To: <CAAsGZS5ir2Ha1swhdGCg7ibgYRORZykZcu089=oJ51NT4GKDuA@mail.gmail.com>

On May 2, 2017, at 8:04 AM, chetan loke <loke.chetan@gmail.com> wrote:

> On Sat, Apr 15, 2017 at 7:41 PM, Guy Harris <guy@alum.mit.edu> wrote:
>> On Apr 15, 2017, at 7:10 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> 
>>> Do you think this is a kernel problem, libpcap problem, or an
>>> application problem?
>> 
> 
> Its clearly a kernel regression.
> 
> If you look at if_packet.h, I have explicitly called out all the cases
> for the return/status codes. When I first merged the functionality in
> 3.11(or 3.12 I think) I had the logic in place to retire the block
> with or without packets in it. I think there was one case where we
> wouldn't wake up userspace. Someone checked in a fix for that. Now I
> am not sure the regression happened as part of that bug fix or
> sometime later. If you diff 3.12 against the latest you will find the
> regression. Look for prb_retire_rx_blk_timer_expired().

Yes, there's a case where user space wasn't being woken up.

As I said in

	https://github.com/the-tcpdump-group/libpcap/issues/335#issuecomment-30280794

It appeared, at the time, that PF_PACKET sockets delivered a wakeup when a packet is put in a buffer block or dropped due to no buffer blocks being empty, but not when a buffer block is handed to userland.

This means that if the kernel's timer expires, and there are no packets in the current buffer block being filled by the kernel, that buffer block will be handed to userland, but userland won't be woken up to tell it to consume that block.

Thus, libpcap will consume that block only if either:

	* a packet is put in a buffer block, meaning it must pass the filter and there must be a current buffer block, belonging to the kernel, into which to put it;
	* a packet arrives and passes the filter, but there are no current buffer blocks belonging to the kernel, so it's dropped;
	* the poll() times out.

So, with a low packet acceptance rate (either because there isn't much network traffic or because there is but most of it is rejected by the packet filter), and with a poll() timeout of -1, meaning "block forever", 1) will happen infrequently, and 3) will never happen. With an in-kernel timeout rate significantly lower than the rate of packet acceptance, the timeout will often occur when there are no packets in the current buffer block, in which case the kernel will hand an empty buffer block to userland and not tell userland about it.

If that happens often enough in sequence to cause all buffer blocks to be handed to userland before any wakeups occur, the kernel now has no buffer blocks into which to put packets, and the next time a packet arrives, it will be dropped, and a wakeup will finally occur. libpcap will drain the ring, handing all buffer blocks to the kernel, but it won't have any packets to process!

So this is ultimately a problem with the TPACKET_V3 code in the kernel. I personally think that it should not deliver empty buffer blocks to userland, and that it also should not deliver a wakeup when a packet is accepted, and should deliver a wakeup whenever a buffer block is handed to userland. I'll report this to somebody and let them decide which of those changes should be done.

If you want to deliver empty buffer blocks to userland, that's fine, but make sure you wake up userland so that it can process those packets rather than leaving them there taking up space in the ring buffer.

And if you insist on delivering a wakeup when a packet is accepted - a wakeup that libpcap, at least, won't do anything with, as there's nothing useful for it to do with that wakeup - also make sure you deliver a wakeup when a buffer block is handed to userland, which is what libpcap cares about.

> I cannot speak on behalf of user-space wrappers developed around
> tpacket_v3 but the intention(from the kernel POV) of the block_timer
> *is* to unblock the capture/user process/thread so that it does NOT
> stay blocked for an indefinite period of time. The header explicitly
> specifies that contract.

That's not part of the contract for libpcap, as it's a question of what the underlying capture mechanism does, and we don't necessarily have any control over that; if a particular capture mechanism used by libpcap has that as part of its contract, that's OK, but libpcap-based applications shouldn't depend on it.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox