Netdev List

Netdev List
 help / color / mirror / Atom feed

* [iproute PATCH 1/2] link_gre6: Fix for changing tclass/flowlabel
From: Phil Sutter @ 2017-09-01 14:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20170901140809.13230-1-phil@nwl.cc>

When trying to change tclass or flowlabel of a GREv6 tunnel which has
the respective value set already, the code accidentally bitwise OR'ed
the old and the new value, leading to unexpected results. Fix this by
clearing the relevant bits of flowinfo variable prior to assigning the
new value.

Fixes: af89576d7a8c4 ("iproute2: GRE over IPv6 tunnel support.")
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 ip/link_gre6.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 4d3d4b54210b9..447ac5d78ab7b 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -288,6 +288,7 @@ get_failed:
 			else {
 				if (get_u8(&uval, *argv, 16))
 					invarg("invalid TClass", *argv);
+				flowinfo &= ~IP6_FLOWINFO_TCLASS;
 				flowinfo |= htonl((__u32)uval << 20) & IP6_FLOWINFO_TCLASS;
 				flags &= ~IP6_TNL_F_USE_ORIG_TCLASS;
 			}
@@ -303,6 +304,7 @@ get_failed:
 					invarg("invalid Flowlabel", *argv);
 				if (uval > 0xFFFFF)
 					invarg("invalid Flowlabel", *argv);
+				flowinfo &= ~IP6_FLOWINFO_FLOWLABEL;
 				flowinfo |= htonl(uval) & IP6_FLOWINFO_FLOWLABEL;
 				flags &= ~IP6_TNL_F_USE_ORIG_FLOWLABEL;
 			}
-- 
2.13.1

^ permalink raw reply related

* [iproute PATCH 0/2] Fix and enhance link_gre6
From: Phil Sutter @ 2017-09-01 14:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Changing a tunnel's flowlabel value was broken if it was set to a
non-zero value before. Since the same problem existed for tclass, patch
1 fixes both instances at once.

Patch 2 enhances 'ip link show' to also print the tclass value. This
change was necessary to properly test the first patch's result.

Phil Sutter (2):
  link_gre6: Fix for changing tclass/flowlabel
  link_gre6: Print the tunnel's tclass setting

 ip/link_gre6.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

-- 
2.13.1

^ permalink raw reply

* [iproute PATCH 2/2] link_gre6: Print the tunnel's tclass setting
From: Phil Sutter @ 2017-09-01 14:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20170901140809.13230-1-phil@nwl.cc>

Print the value analogous to flowlabel. While being at it, also break
the overlong lines to not exceed 80 characters boundary.

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 ip/link_gre6.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 447ac5d78ab7b..78b5215c65037 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -462,7 +462,14 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 	if (flags & IP6_TNL_F_USE_ORIG_FLOWLABEL)
 		fprintf(f, "flowlabel inherit ");
 	else
-		fprintf(f, "flowlabel 0x%05x ", ntohl(flowinfo & IP6_FLOWINFO_FLOWLABEL));
+		fprintf(f, "flowlabel 0x%05x ",
+			ntohl(flowinfo & IP6_FLOWINFO_FLOWLABEL));
+
+	if (flags & IP6_TNL_F_USE_ORIG_TCLASS)
+		fprintf(f, "tclass inherit ");
+	else
+		fprintf(f, "tclass 0x%02x ",
+			ntohl(flowinfo & IP6_FLOWINFO_TCLASS) >> 20);
 
 	if (flags & IP6_TNL_F_RCV_DSCP_COPY)
 		fprintf(f, "dscp inherit ");
-- 
2.13.1

^ permalink raw reply related

* Re: [PATCH v4 4/5] net: stmmac: dwmac-sun8i: choose internal PHY via phy-is-integrated
From: Rob Herring @ 2017-09-01 14:04 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Corentin Labbe, Mark Rutland, Maxime Ripard, Chen-Yu Tsai,
	Russell King, Giuseppe CAVALLARO, Alexandre Torgue,
	Florian Fainelli, Icenowy Zheng, netdev,
	devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170831205928.GS22289@lunn.ch>

On Thu, Aug 31, 2017 at 3:59 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Thu, Aug 31, 2017 at 03:18:03PM -0500, Rob Herring wrote:
>> On Sat, Aug 26, 2017 at 11:20:51PM +0200, Andrew Lunn wrote:
>> > Hi Corentin
>> >
>> > I think we have now all agreed this is an mdio-mux, plus it is also an
>> > MII mux. We should represent that in device tree. This patchset does
>> > this. However, as it is now, the mux structure in DT is ignored. All
>> > it does is search for the phy-is-integrated flags and goes on that.
>> >
>> > I made the comment that the device tree representation cannot be
>> > implemented using an MDIO mux driver, because of driver loading
>> > issues.  However, the core of the MDIO mux code is just a library,
>> > symbols exported as GPL, free for anything to use.
>> >
>> > What i think should happen is the mdio-mux is implemented inside the
>> > MAC driver, using the mux-core as a library. The device tree structure
>> > of a mix is then reflected within Linux. The mux switch callback is
>> > implemented within the MAC driver. So it can reset the MAC when the
>> > mux is switched. The 'phy-is-integrated' property is then no longer
>> > needed.
>> >
>> > I would suggest a binding something like:
>>
>> This is looks better to me, but...
>>
>> > emac: ethernet@1c0b000 {
>> >         compatible = "allwinner,sun8i-h3-emac";
>> >         syscon = <&syscon>;
>> >         reg = <0x01c0b000 0x104>;
>> >         interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>;
>> >         interrupt-names = "macirq";
>> >         resets = <&ccu RST_BUS_EMAC>;
>> >         reset-names = "stmmaceth";
>> >         clocks = <&ccu CLK_BUS_EMAC>;
>> >         clock-names = "stmmaceth";
>> >         #address-cells = <1>;
>> >         #size-cells = <0>;
>> >
>> >         phy-handle = <&int_mii_phy>;
>> >         phy-mode = "mii";
>> >         allwinner,leds-active-low;
>> >
>> >         mdio: mdio {
>> >                 #address-cells = <1>;
>> >                 #size-cells = <0>;
>> >     }
>>
>> Why do you need this node still?
>
> Hi Rob
>
> It might not be needed, depending on how it is implemented. But:
>
> Documentation/devicetree/bindings/net/mdio-mux.txt
>
> It is normal for an mdio bus mux to have a phandle back to the parent
> mdio bus.  Also, i think the stmmac driver will only instantiate the
> mdio bus if there is a node for it in the device tree.

You don't have a phandle to the parent mdio bus though.

I think we should allow for both case. The mux could be within the bus
hierarchy. Depends where the controls are. Another alternative someone
will try sooner or later is using the new mux control binding.

Rob

^ permalink raw reply

* [PATCH 2/2] Revert "i40e: remove WQ_UNBOUND and the task limit of our workqueue"
From: Stefan Assmann @ 2017-09-01 14:02 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann
In-Reply-To: <20170901140234.7840-1-sassmann@kpanic.de>

This reverts commit 4d5957cbdecdbb77d24c1465caadd801c07afa4a.

Due to this workqueue change the eeprom check right after flashing
firmware fails, although the flash itself completed successfully.
The error observed looks like this
i40e 0000:88:00.0: eeprom check failed (-62), Tx/Rx traffic disabled
The NIC is fully operational after the flash and even after a cold boot
any follow-up eeprom checks succeed.

This needs to be investigated, but for now it should be more important
to make sure the firmware update works as expected.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6498da8806cb..069b6683e1b0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12160,14 +12160,12 @@ static int __init i40e_init_module(void)
 		i40e_driver_string, i40e_driver_version_str);
 	pr_info("%s: %s\n", i40e_driver_name, i40e_copyright);
 
-	/* There is no need to throttle the number of active tasks because
-	 * each device limits its own task using a state bit for scheduling
-	 * the service task, and the device tasks do not interfere with each
-	 * other, so we don't set a max task limit. We must set WQ_MEM_RECLAIM
-	 * since we need to be able to guarantee forward progress even under
-	 * memory pressure.
+	/* we will see if single thread per module is enough for now,
+	 * it can't be any worse than using the system workqueue which
+	 * was already single threaded
 	 */
-	i40e_wq = alloc_workqueue("%s", WQ_MEM_RECLAIM, 0, i40e_driver_name);
+	i40e_wq = alloc_workqueue("%s", WQ_UNBOUND | WQ_MEM_RECLAIM, 1,
+				  i40e_driver_name);
 	if (!i40e_wq) {
 		pr_err("%s: Failed to create workqueue\n", i40e_driver_name);
 		return -ENOMEM;
-- 
2.13.5

^ permalink raw reply related

* [PATCH 1/2] i40e: use non-locking i40e_read_nvm_word() function during nvmupdate
From: Stefan Assmann @ 2017-09-01 14:02 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann
In-Reply-To: <20170901140234.7840-1-sassmann@kpanic.de>

During firmware update the adminq gets locked. Currently during the
update i40e_calc_nvm_checksum() calls i40e_read_nvm_word() which itself
tries to acquire the adminq lock. This fails as the lock is already
taken.
This results in the firmware update to fail, leaving the eeprom in a
corrupt state.

To fix this, introducing __i40e_read_nvm_word() which avoids locking and
replace the calls to i40e_read_nvm_word() in i40e_calc_nvm_checksum()
with the newly introduced function. The change matches the i40e
sourceforge driver as much as possible.

With this in place the firmware update is now working again.
Fixes: ("96a39aed25e6 i40e: Acquire NVM lock before reads on all devices")

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 96afef98a08f..aa4cfdc51d2b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -322,6 +322,26 @@ i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 offset,
 }
 
 /**
+ * __i40e_read_nvm_word - Reads nvm word, assumes caller does the locking
+ * @hw: pointer to the HW structure
+ * @offset: offset of the Shadow RAM word to read (0x000000 - 0x001FFF)
+ * @data: word read from the Shadow RAM
+ *
+ * Reads one 16 bit word from the Shadow RAM using the GLNVM_SRCTL register.
+ **/
+static i40e_status __i40e_read_nvm_word(struct i40e_hw *hw,
+					u16 offset, u16 *data)
+{
+	enum i40e_status_code ret_code = 0;
+
+	if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE)
+		ret_code = i40e_read_nvm_word_aq(hw, offset, data);
+	else
+		ret_code = i40e_read_nvm_word_srctl(hw, offset, data);
+	return ret_code;
+}
+
+/**
  * i40e_read_nvm_buffer_srctl - Reads Shadow RAM buffer via SRCTL register
  * @hw: pointer to the HW structure
  * @offset: offset of the Shadow RAM word to read (0x000000 - 0x001FFF).
@@ -516,14 +536,14 @@ static i40e_status i40e_calc_nvm_checksum(struct i40e_hw *hw,
 	data = (u16 *)vmem.va;
 
 	/* read pointer to VPD area */
-	ret_code = i40e_read_nvm_word(hw, I40E_SR_VPD_PTR, &vpd_module);
+	ret_code = __i40e_read_nvm_word(hw, I40E_SR_VPD_PTR, &vpd_module);
 	if (ret_code) {
 		ret_code = I40E_ERR_NVM_CHECKSUM;
 		goto i40e_calc_nvm_checksum_exit;
 	}
 
 	/* read pointer to PCIe Alt Auto-load module */
-	ret_code = i40e_read_nvm_word(hw, I40E_SR_PCIE_ALT_AUTO_LOAD_PTR,
+	ret_code = __i40e_read_nvm_word(hw, I40E_SR_PCIE_ALT_AUTO_LOAD_PTR,
 				      &pcie_alt_module);
 	if (ret_code) {
 		ret_code = I40E_ERR_NVM_CHECKSUM;
-- 
2.13.5

^ permalink raw reply related

* [PATCH 0/2] i40e: fix firmware update
From: Stefan Assmann @ 2017-09-01 14:02 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, davem, jeffrey.t.kirsher, sassmann

The first patch fixes the firmware update which is currently broken and
results in a bad flash (corrupt firmware). Recovery is possible with a
fixed driver.
The second patch reverts a commit that causes the firmware checksum
verification to fail right after a successful flash. This is related to
a recent workqueue change. Haven't gotten to the bottom of this yet, but
for the sake of a smooth firmware update experience let's revert the
commit for now.

Stefan Assmann (2):
  i40e: use non-locking i40e_read_nvm_word() function during nvmupdate
  Revert "i40e: remove WQ_UNBOUND and the task limit of our workqueue"

 drivers/net/ethernet/intel/i40e/i40e_main.c | 12 +++++-------
 drivers/net/ethernet/intel/i40e/i40e_nvm.c  | 24 ++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 9 deletions(-)

-- 
2.13.5

^ permalink raw reply

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Robert Hoo @ 2017-09-01 13:57 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter; +Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F077932D8@ORSMSX103.amr.corp.intel.com>

On Fri, 2017-08-25 at 14:24 +0000, Waskiewicz Jr, Peter wrote:
> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
> >>
> >> Tested with Intel XL710 NIC with Cisco 3172 switch.
> >>
> >> It would be even slightly better if the irqbalance service is turned
> >> off outside.
> > 
> > Yes, if you don't turn-off (kill) irqbalance it will move around the
> > IRQs behind your back...
> 
> Or you can use the --banirq option to irqbalance to ignore your device's 
> interrupts as targets for balancing.

Oh, I wasn't aware of this parameter. I will be glad to have try later.
Meanwhile, in my test above, the irqbalance service just affect result
very little.
> 
> Cheers,
> -PJ

^ permalink raw reply

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Robert Hoo @ 2017-09-01 13:53 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: davem, brouer, kyle.leet, netdev
In-Reply-To: <800696cf-477e-52bf-24ae-a0a6c19a5f2d@mellanox.com>

On Sun, 2017-08-27 at 11:25 +0300, Tariq Toukan wrote:
> 
> On 25/08/2017 12:26 PM, Robert Hoo wrote:
> > (Sorry for yesterday's wrong sending, I finally fixed my MTA and git
> > send-email settings.)
> > 
> > It's hard to benchmark 40G+ network bandwidth using ordinary
> > tools like iperf, netperf (see reference 1).
> > Pktgen, packet generator from Kernel sapce, shall be a candidate.
> > I then tried with pktgen multiqueue sample scripts, but still
> > cannot reach line rate.
> 
> Try samples 03 and 04.

Thanks Tariq for review. Sorry for late reply; I do this part time.

Yes, I just tried sample 03 and 04. They can approximately reach 40G
line rate; though still slightly less than my script :) (see my reply
to Jesper).
> 
> > I then derived this NUMA awared irq affinity sample script from
> > multi-queue sample one, successfully benchmarked 40G link. I think this can
> > also be useful for 100G reference, though I haven't got device to test yet.
> > 
> > This script simply does:
> > Detect $DEV's NUMA node belonging.
> > Bind each thread (processor from that NUMA node) with each $DEV queue's
> > irq affinity, 1:1 mapping.
> > How many '-t' threads input determines how many queues will be
> > utilized.
> 
> I agree this is an essential capability.
> This was the main reason I added support for the -f argument.
> Using it, I could choose cores of local NUMA, especially for single 
> thread, or when cores of the NUMA are sequential.

Indeed this argument is very helpful.
Sorry I haven't taken it into consideration in v1. I should consider
this, what if user designate '-f'. I can improve this in v2.
> 
> > 
> > Tested with Intel XL710 NIC with Cisco 3172 switch.
> > 
> > It would be even slightly better if the irqbalance service is turned
> > off outside.
> > 
> > Referrences:
> > https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> > http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> > 
> > Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> > ---
> 
> Regards,
> Tariq Toukan

^ permalink raw reply

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Robert Hoo @ 2017-09-01 13:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev@vger.kernel.org, davem, tariqt, kyle.leet
In-Reply-To: <20170825111921.061713c8@redhat.com>

On Fri, 2017-08-25 at 11:19 +0200, Jesper Dangaard Brouer wrote:
> (please don't use BCC on the netdev list, replies might miss the list in cc)
> 
> Comments inlined below:
> 
> On Fri, 25 Aug 2017 10:24:30 +0800 Robert Hoo <robert.hu@intel.com> wrote:
> 
> > From: Robert Ho <robert.hu@intel.com>
> > 
> > It's hard to benchmark 40G+ network bandwidth using ordinary
> > tools like iperf, netperf. I then tried with pktgen multiqueue sample
> > scripts, but still cannot reach line rate.
> 
> The pktgen_sample02_multiqueue.sh does not use burst or skb_cloning.
> Thus, the performance will suffer.
> 
> See the samples that use the burst feature:
>   pktgen_sample03_burst_single_flow.sh
>   pktgen_sample05_flow_per_thread.sh
> 
> With the pktgen "burst" feature, I can easily generate 40G.  Generating
> 100G is also possible, but often you will hit some HW limits before the
> pktgen limit.  I experienced hitting both (1) PCIe Gen3 x8 limit, and (2)
> memory bandwidth limit.

Thanks Jesper for review. Sorry for late reply, I do this part time.

I just tried 'pktgen_sample03_burst_single_flow.sh' and 'pktgen_sample05_flow_per_thread.sh'
cmd:
	./pktgen_sample05_flow_per_thread.sh -i ens801 -s 1500 -m 3c:fd:fe:9d:6f:f0 -t 2 -v -x -d 192.168.0.107
	./pktgen_sample03_burst_single_flow.sh -i ens801 -s 1500 -m 3c:fd:fe:9d:6f:f0 -t 2 -v -x -d 192.168.0.107

indeed, they can achieve nearly 40G. (though still slightly less than my 
script). pktgen_sample03 and pktgen_sample05 can approximately achieve 38xxxMb/sec ~ 39xxxMb/sec;
my script can achieve 40xxxMb/sec ~ 41xxxMb/sec. (threads >= 2)

So a general question: is it still necessary to continue my sample06_numa_awared_queue_irq_affinity work? as sample03
and sample05 already approximately achieved 40G line rate.

> 
> 
> > I then derived this NUMA awared irq affinity sample script from
> > multi-queue sample one, successfully benchmarked 40G link. I think this can
> > also be useful for 100G reference, though I haven't got device to test.
> 
> Okay, so your issue was really related to NUMA irq affinity.  I do feel
> that IRQ tuning lives outside the realm of the pktgen scripts, but
> looking closer at your script, I it doesn't look like you change the
> IRQ setting which is good.  

Sorry I don't quite understand above. I changed the irq affinities.
See "echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list".
You would not like me to change it? I can restore them to original at the end
of the script.
> 
> You introduce some helper functions take makes it possible to extract
> NUMA information in the shell script code, really cool.  I would like
> to see these functions being integrated into the function.sh file.

Yes, it is doable, if you maintainer think so.
> 
>  
> > This script simply does:
> > Detect $DEV's NUMA node belonging.
> > Bind each thread (processor from that NUMA node) with each $DEV queue's
> > irq affinity, 1:1 mapping.
> > How many '-t' threads input determines how many queues will be
> > utilized.
> > 
> > Tested with Intel XL710 NIC with Cisco 3172 switch.
> > 
> > It would be even slightly better if the irqbalance service is turned
> > off outside.
> 
> Yes, if you don't turn-off (kill) irqbalance it will move around the
> IRQs behind your back...

Yes; while the experiment result turns out it affects just very little.
> 
>  
> > Referrences:
> > https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> > http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> > 
> > Signed-off-by: Robert Hoo <robert.hu@intel.com>
> > ---
> >  ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
> >  1 file changed, 132 insertions(+)
> >  create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > 
> > diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > new file mode 100755
> > index 0000000..f0ee25c
> > --- /dev/null
> > +++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > @@ -0,0 +1,132 @@
> > +#!/bin/bash
> > +#
> > +# Multiqueue: Using pktgen threads for sending on multiple CPUs
> > +#  * adding devices to kernel threads which are in the same NUMA node
> > +#  * bound devices queue's irq affinity to the threads, 1:1 mapping
> > +#  * notice the naming scheme for keeping device names unique
> > +#  * nameing scheme: dev@thread_number
> > +#  * flow variation via random UDP source port
> > +#
> > +basedir=`dirname $0`
> > +source ${basedir}/functions.sh
> > +root_check_run_with_sudo "$@"
> > +#
> > +# Required param: -i dev in $DEV
> > +source ${basedir}/parameters.sh
> > +
> > +get_iface_node()
> > +{
> > +	echo `cat /sys/class/net/$1/device/numa_node`
> 
> Here you could use the following shell trick to avoid using "cat":
> 
>  echo $(</sys/class/net/$1/device/numa_node)

Thanks for teaching. Indeed this is more concise.
> 
> It looks like you don't handle the case of -1, which indicate non-NUMA
> system.  You need to use something like::
> 
> get_iface_node()
> {
>     local node=$(</sys/class/net/$1/device/numa_node)
>     if [[ $node == -1 ]]; then
> 	echo 0
>     else
> 	echo $node
>     fi
> }

Yes, I can amend in v2.
> 
> 
> > +}
> > +
> > +get_iface_irqs()
> > +{
> > +	local IFACE=$1
> > +	local queues="${IFACE}-.*TxRx"
> > +
> > +	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
> > +	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
> > +	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
> > +		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
> > +	    done)
> 
> Nice that you handle all these different methods.  I personally look
> in /proc/irq/*/$IFACE*/../smp_affinity_list , like (copy-paste):
> 
> echo " --- Align IRQs ---"
> # I've named my NICs ixgbe1 + ixgbe2
> for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
>    # Extract irqname e.g. "ixgbe2-TxRx-2"
>    irqname=$(basename $(dirname $(dirname $F))) ;
>    # Substring pattern removal
>    hwq_nr=${irqname#*-*-}
>    echo $hwq_nr > $F
>    #grep . -H $F;
> done
> grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list
> 
> Maybe I should switch to use:
>    /sys/class/net/$IFACE/device/msi_irqs/*
>  
> 
> > +	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
> 
> In the error case you should let the script die.  There is a helper
> function for this called "err" (where first arg is the exitcode, which
> is useful to detect the reason your script failed).

Yes, I noticed that helper function and changed some of my original "echo Error"s;
this is a missing in my code clear/tidy work. I can amend in v2.
> 
> 
> > +	echo $irqs
> > +}
> 
> > +get_node_cpus()
> > +{
> > +	local node=$1
> > +	local node_cpu_list
> > +	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
> > +			/sys/devices/system/node/node$node/cpulist`
> > +
> > +	for cpu_range in $node_cpu_range_list
> > +	do
> > +		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
> > +	done
> > +
> > +	echo $node_cpu_list
> > +}
> > +
> > +
> > +# Base Config
> > +DELAY="0"        # Zero means max speed
> > +COUNT="20000000"   # Zero means indefinitely
> > +[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
> > +
> > +# Flow variation random source port between min and max
> > +UDP_MIN=9
> > +UDP_MAX=109
> > +
> > +node=`get_iface_node $DEV`
> > +irq_array=(`get_iface_irqs $DEV`)
> > +cpu_array=(`get_node_cpus $node`)
> 
> Nice trick to generate an array.
> 
> > +
> > +[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
> > +	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
> > +
> > +# (example of setting default params in your script)
> > +if [ -z "$DEST_IP" ]; then
> > +    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
> > +fi
> > +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
> > +
> > +# General cleanup everything since last run
> > +pg_ctrl "reset"
> > +
> > +# Threads are specified with parameter -t value in $THREADS
> > +for ((i = 0; i < $THREADS; i++)); do
> > +    # The device name is extended with @name, using thread number to
> > +    # make then unique, but any name will do.
> > +    # Set the queue's irq affinity to this $thread (processor)
> > +    thread=${cpu_array[$i]}
> > +    dev=${DEV}@${thread}
> > +    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
> > +    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
> > +
> > +    # Add remove all other devices and add_device $dev to thread
> > +    pg_thread $thread "rem_device_all"
> > +    pg_thread $thread "add_device" $dev
> > +
> > +    # select queue and bind the queue and $dev in 1:1 relationship
> > +    queue_num=$i
> > +    echo "queue number is $queue_num"
> > +    pg_set $dev "queue_map_min $queue_num"
> > +    pg_set $dev "queue_map_max $queue_num"
> > +
> > +    # Notice config queue to map to cpu (mirrors smp_processor_id())
> > +    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
> > +    pg_set $dev "flag QUEUE_MAP_CPU"
> > +
> > +    # Base config of dev
> > +    pg_set $dev "count $COUNT"
> > +    pg_set $dev "clone_skb $CLONE_SKB"
> > +    pg_set $dev "pkt_size $PKT_SIZE"
> > +    pg_set $dev "delay $DELAY"
> > +
> > +    # Flag example disabling timestamping
> > +    pg_set $dev "flag NO_TIMESTAMP"
> > +
> > +    # Destination
> > +    pg_set $dev "dst_mac $DST_MAC"
> > +    pg_set $dev "dst$IP6 $DEST_IP"
> > +
> > +    # Setup random UDP port src range
> > +    pg_set $dev "flag UDPSRC_RND"
> > +    pg_set $dev "udp_src_min $UDP_MIN"
> > +    pg_set $dev "udp_src_max $UDP_MAX"
> > +done
> > +
> > +# start_run
> > +echo "Running... ctrl^C to stop" >&2
> > +pg_ctrl "start"
> > +echo "Done" >&2
> > +
> > +# Print results
> > +for ((i = 0; i < $THREADS; i++)); do
> > +    thread=${cpu_array[$i]}
> > +    dev=${DEV}@${thread}
> > +    echo "Device: $dev"
> > +    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
> > +done
> 
> 
> 

^ permalink raw reply

* [PATCH] qlcnic: remove redundant zero check on retries counter
From: Colin King @ 2017-09-01 13:44 UTC (permalink / raw)
  To: Harish Patil, Manish Chopra, Dept-GELinuxNICDev, netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

At the end of the do while loop the integer counter retries will
always be zero and so the subsequent check to see if it is zero
is always true and therefore redundant.  Remove the redundant check
and always return -EIO on this return path.  Also unbreak the literal
string in dev_err message to clean up a checkpatch warning.

Detected by CoverityScan, CID#744279 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
index be41e4c77b65..c48a0e2d4d7e 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_init.c
@@ -592,13 +592,9 @@ qlcnic_receive_peg_ready(struct qlcnic_adapter *adapter)
 
 	} while (--retries);
 
-	if (!retries) {
-		dev_err(&adapter->pdev->dev, "Receive Peg initialization not "
-			      "complete, state: 0x%x.\n", val);
-		return -EIO;
-	}
-
-	return 0;
+	dev_err(&adapter->pdev->dev, "Receive Peg initialization not complete, state: 0x%x.\n",
+		val);
+	return -EIO;
 }
 
 int
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH v2 5/5] net: mdio-mux: fix unbalanced put_device
From: Andrew Lunn @ 2017-09-01 13:38 UTC (permalink / raw)
  To: Corentin Labbe; +Cc: f.fainelli, netdev, linux-kernel
In-Reply-To: <20170901115604.27513-6-clabbe.montjoie@gmail.com>

On Fri, Sep 01, 2017 at 01:56:04PM +0200, Corentin Labbe wrote:
> mdio_mux_uninit() call put_device (unconditionally) because of
> of_mdio_find_bus() in mdio_mux_init.
> But of_mdio_find_bus is only called if mux_bus is empty.
> If mux_bus is set, mdio_mux_uninit will print a "refcount_t: underflow"
> trace.
> 
> This patch add a get_device in the other branch of "if (mux_bus)".
> 
> Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH v2 3/5] net: mdio-mux: printing driver version is useless
From: Andrew Lunn @ 2017-09-01 13:38 UTC (permalink / raw)
  To: Corentin Labbe; +Cc: f.fainelli, netdev, linux-kernel
In-Reply-To: <20170901115604.27513-4-clabbe.montjoie@gmail.com>

On Fri, Sep 01, 2017 at 01:56:02PM +0200, Corentin Labbe wrote:
> Remove the driver version information because this information
> is not useful in an upstream kernel driver.
> 
> Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
From: Willem de Bruijn @ 2017-09-01 13:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Network Development, David Miller, Willem de Bruijn
In-Reply-To: <20170901144424.72f9c9e9@redhat.com>

On Fri, Sep 1, 2017 at 8:44 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 31 Aug 2017 17:00:13 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
>> +More Info
>> +---------
>> +
>> +Much of this document was derived from a longer paper presented at
>> +netdev 2.1. For more in-depth information see that paper and talk,
>> +the excellent reporting over at LWN.net or read the original code.
>> +
>> +  paper, slides, video
>> +    https://netdevconf.org/2.1/session.html?debruijn
>> +
>> +  LWN article
>> +    https://lwn.net/Articles/726917/
>> +
>> +  patchset
>> +    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
>> +    https://lwn.net/Articles/730010/
>> +    https://www.spinics.net/lists/netdev/msg447552.html
>
> IMHO I think it would be better to use the type links also used in the
> git log.  If you look at the kernel git log, then the "Link:" tag have
> the form: http://lkml.kernel.org/r/
> And you can simply append the "Message-Id:" email header.
>
> In this case the link would be:
>   http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com

I was not aware of that. Thanks, I'll send a v2.

^ permalink raw reply

* Re: [PATCH net-next v2 4/4] Documentation/bindings: net: marvell-pp2: add the link interrupt
From: Andrew Lunn @ 2017-09-01 13:35 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, gregory.clement, thomas.petazzoni, nadavh, linux,
	linux-kernel, mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170901090455.32316-5-antoine.tenart@free-electrons.com>

On Fri, Sep 01, 2017 at 11:04:55AM +0200, Antoine Tenart wrote:
> A link interrupt can be described. Document this valid interrupt name.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> Tested-by: Marcin Wojtas <mw@semihalf.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] net: mvpp2: make the phy optional
From: Andrew Lunn @ 2017-09-01 13:34 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, gregory.clement, thomas.petazzoni, nadavh, linux,
	linux-kernel, mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170901090455.32316-3-antoine.tenart@free-electrons.com>

On Fri, Sep 01, 2017 at 11:04:53AM +0200, Antoine Tenart wrote:
> There is not necessarily a PHY between the GoP and the physical port.
> However, the driver currently makes the "phy" property mandatory,
> contrary to what is stated in the device tree bindings. This patch makes
> the PHY optional, and aligns the PPv2 driver on its device tree
> documentation. However if a PHY is provided, the GoP link interrupt
> won't be used.
> 
> With this patch switches directly connected to the serdes lanes and SFP
> ports on the Armada 8040-db and Armada 7040-db can be used if the link
> interrupt is described in the device tree.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> Tested-by: Marcin Wojtas <mw@semihalf.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 1/4] net: mvpp2: take advantage of the is_rgmii helper
From: Andrew Lunn @ 2017-09-01 13:33 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, gregory.clement, thomas.petazzoni, nadavh, linux,
	linux-kernel, mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170901090455.32316-2-antoine.tenart@free-electrons.com>

On Fri, Sep 01, 2017 at 11:04:52AM +0200, Antoine Tenart wrote:
> Convert all RGMII checks to use the phy_interface_mode_is_rgmii()
> helper. This is a cosmetic patch.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/2] flow_dissector: Add limits for encapsulation and EH
From: Hannes Frederic Sowa @ 2017-09-01 13:32 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, alex.popov
In-Reply-To: <20170831222239.21509-3-tom@quantonium.net>

Tom Herbert <tom@quantonium.net> writes:

> In flow dissector there are no limits to the number of nested
> encapsulations that might be dissected which makes for a nice DOS
> attack. This patch limits for dissecting nested encapsulations
> as well as for dissecting over extension headers.

I was actually more referring to your patch, because the flow dissector
right now is not stack recursive. Your changes would make it doing
recursion on the stack. But it seems something along the lines is anyway
needed. See below.

> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Tom Herbert <tom@quantonium.net>
> ---
>  net/core/flow_dissector.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 5110180a3e96..1bca748de27d 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -396,6 +396,35 @@ __skb_flow_dissect_ipv6(const struct sk_buff *skb,
>  	key_ip->ttl = iph->hop_limit;
>  }
>  
> +/* Maximum number of nested encapsulations that can be processed in
> + * __skb_flow_dissect
> + */
> +#define MAX_FLOW_DISSECT_ENCAPS	5

I think you can exactly parse one encapsulation layer safely. This is
because you can only keep state on one encapsulation layer protocol
right now.

For example this scenario:

I would like to circumvent tc flower rules that filter based on vlan. I
simply construct a packet looking like:

Ethernet|Vlan|IP|GRE|Ethernet|Vlan|

because we don't recurse in the flow keys either, the second vlan header
would overwrite the information of the first one.

> +
> +static bool skb_flow_dissect_encap_allowed(int *num_encaps, unsigned int *flags)
> +{
> +	++*num_encaps;
> +
> +	if (*num_encaps >= MAX_FLOW_DISSECT_ENCAPS) {
> +		if (*num_encaps == MAX_FLOW_DISSECT_ENCAPS) {
> +			/* Allow one more pass but ignore disregard
> +			 * further encapsulations
> +			 */
> +			*flags |= FLOW_DISSECTOR_F_STOP_AT_ENCAP;
> +		} else {
> +			/* Max encaps reached */
> +			return  false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/* Maximum number of extension headers can be processed in __skb_flow_dissect
> + * per IPv6 packet
> + */
> +#define MAX_FLOW_DISSECT_EH	5

I would at least allow each extension header once, DEST_OPS twice, thus
let's say 12? It is safe in this version because it does not consume
stack space anyway and doesn't update internal state.

> +
>  /**
>   * __skb_flow_dissect - extract the flow_keys struct and return it
>   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
> @@ -426,6 +455,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>  	struct flow_dissector_key_tags *key_tags;
>  	struct flow_dissector_key_vlan *key_vlan;
>  	enum flow_dissect_ret fdret;
> +	int num_eh, num_encaps = 0;
>  	bool skip_vlan = false;
>  	u8 ip_proto = 0;
>  	bool ret;
> @@ -714,7 +744,9 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>  	case FLOW_DISSECT_RET_OUT_GOOD:
>  		goto out_good;
>  	case FLOW_DISSECT_RET_PROTO_AGAIN:
> -		goto proto_again;
> +		if (skb_flow_dissect_encap_allowed(&num_encaps, &flags))
> +			goto proto_again;

I think you should get the check to the proto_again label. In case you
loop to often you can `goto out_good'.

[...]

Bye,
Hannes

^ permalink raw reply

* Re: [RFC net-next 0/8] net: dsa: Multi-queue awareness
From: Andrew Lunn @ 2017-09-01 13:29 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: jiri, jhs, netdev, davem, xiyou.wangcong, vivien.didelot
In-Reply-To: <7d738ef5-c312-e0b3-3605-1f31fa7dc019@gmail.com>

> I suppose that you could somehow use TC to influence how the traffic
> from host to CPU works, but without a "CPU" port representor the
> question is how do we get that done? If we used "eth0" we need to
> callback into the switch driver for programming..

We need to compare how the different switches work with respect to
QoS. Marvell switches do a lot of the classification on the ingress
port where it determines what queue the frame should be placed in on
the egress port. The egress port then schedules its queues.

This does not map to TC too well.

> Regarding the last patch in this series, what I would ideally to replace
> it with is something along the lines of:
> 
> tc bind dev sw0p0 queue 0 dev eth0 queue 16

Why do you need this? sw0p0 has 8 queues? So i assume you use TC on
sw0p0 to place frames into these queues? The queue then gets passed
transparently down through the conduit interface and then used by the
tagger. I don't see why you need eth0 here? We try our best to avoid
eth0 wherever possible, it causes confusion. So i would prefer not to
have to use eth0 with TC commands.

     Andrew

^ permalink raw reply

* Re: tip -ENOBOOT - bisected to locking/refcounts, x86/asm: Implement fast refcount overflow protection
From: Mike Galbraith @ 2017-09-01 13:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Peter Zijlstra, LKML, Ingo Molnar,
	Reshetova, Elena, Network Development
In-Reply-To: <1504249070.17604.20.camel@gmx.de>

On Fri, 2017-09-01 at 08:57 +0200, Mike Galbraith wrote:
> On Thu, 2017-08-31 at 11:45 -0700, Kees Cook wrote:
> > On Thu, Aug 31, 2017 at 10:19 AM, Mike Galbraith <efault@gmx.de> wrote:
> > > On Thu, 2017-08-31 at 10:00 -0700, Kees Cook wrote:
> > >>
> > >> Oh! So it's gcc-version sensitive? That's alarming. Is this mapping correct:
> > >>
> > >> 4.8.5: WARN, eventual kernel hang
> > >> 6.3.1, 7.0.1: WARN, but continues working
> > >
> > > Yeah, that's correct.  I find that troubling, simply because this gcc
> > > version has been through one hell of a lot of kernels with me.  Yeah, I
> > > know, that doesn't exempt it from having bugs, but color me suspicious.
> > 
> > I still can't hit this with a 4.8.5 build. :(
> > 
> > With _RATELIMIT removed, this should, in theory, report whatever goes
> > negative first...
> 
> I applied the other patch you posted, and built with gcc-6.3.1 to
> remove the gcc-4.8.5 aspect.  Look below the resulting splat.

Grr, that one has a in6_dev_getx() line missing for the first
increment, where things go pear shaped.

With that added, looking at counter both before, and after incl, with a
trace_printk() in the exception handler showing it doing its saturate
thing, irqs disabled across the whole damn refcount_inc(), and even
booting box nr_cpus=1 for extra credit...

HTH can that first refcount_inc() get there?

# tracer: nop
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
         systemd-1     [000] d..1     1.937284: in6_dev_getx: PRE refs.counter:3
         systemd-1     [000] d..1     1.937295: ex_handler_refcount: *(int *)regs->cx = -1073741824
         systemd-1     [000] d..1     1.937296: in6_dev_getx: POST refs.counter:-1073741824
         systemd-1     [000] d..1     1.937296: in6_dev_getx: PRE refs.counter:-1073741824
         systemd-1     [000] d..1     1.937297: ex_handler_refcount: *(int *)regs->cx = -1073741824
         systemd-1     [000] d..1     1.937297: in6_dev_getx: POST refs.counter:-1073741824
         systemd-1     [000] d..1     1.937297: in6_dev_getx: PRE refs.counter:-1073741824
         systemd-1     [000] d..1     1.937298: ex_handler_refcount: *(int *)regs->cx = -1073741824
         systemd-1     [000] d..1     1.937299: in6_dev_getx: POST refs.counter:-1073741824

---
 arch/x86/include/asm/refcount.h |   14 ++++++++++++++
 arch/x86/mm/extable.c           |    1 +
 include/net/addrconf.h          |   12 ++++++++++++
 net/ipv6/route.c                |    6 +++---
 4 files changed, 30 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -55,6 +55,20 @@ static __always_inline void refcount_inc
 		: : "cc", "cx");
 }
 
+static __always_inline void refcount_inc_x(refcount_t *r)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	trace_printk("PRE refs.counter:%d\n", r->refs.counter);
+	asm volatile(LOCK_PREFIX "incl %0\n\t"
+		REFCOUNT_CHECK_LT_ZERO
+		: [counter] "+m" (r->refs.counter)
+		: : "cc", "cx");
+	trace_printk("POST refs.counter:%d\n", r->refs.counter);
+	local_irq_restore(flags);
+}
+
 static __always_inline void refcount_dec(refcount_t *r)
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -45,6 +45,7 @@ bool ex_handler_refcount(const struct ex
 {
 	/* First unconditionally saturate the refcount. */
 	*(int *)regs->cx = INT_MIN / 2;
+	trace_printk("*(int *)regs->cx = %d\n", *(int *)regs->cx);
 
 	/*
 	 * Strictly speaking, this reports the fixup destination, not
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -321,6 +321,18 @@ static inline struct inet6_dev *in6_dev_
 	return idev;
 }
 
+static inline struct inet6_dev *in6_dev_getx(const struct net_device *dev)
+{
+	struct inet6_dev *idev;
+
+	rcu_read_lock();
+	idev = rcu_dereference(dev->ip6_ptr);
+	if (idev)
+		refcount_inc_x(&idev->refcnt);
+	rcu_read_unlock();
+	return idev;
+}
+
 static inline struct neigh_parms *__in6_dev_nd_parms_get_rcu(const struct net_device *dev)
 {
 	struct inet6_dev *idev = __in6_dev_get(dev);
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4041,12 +4041,12 @@ void __init ip6_route_init_special_entri
 	 * the loopback reference in rt6_info will not be taken, do it
 	 * manually for init_net */
 	init_net.ipv6.ip6_null_entry->dst.dev = init_net.loopback_dev;
-	init_net.ipv6.ip6_null_entry->rt6i_idev = in6_dev_get(init_net.loopback_dev);
+	init_net.ipv6.ip6_null_entry->rt6i_idev = in6_dev_getx(init_net.loopback_dev);
   #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 	init_net.ipv6.ip6_prohibit_entry->dst.dev = init_net.loopback_dev;
-	init_net.ipv6.ip6_prohibit_entry->rt6i_idev = in6_dev_get(init_net.loopback_dev);
+	init_net.ipv6.ip6_prohibit_entry->rt6i_idev = in6_dev_getx(init_net.loopback_dev);
 	init_net.ipv6.ip6_blk_hole_entry->dst.dev = init_net.loopback_dev;
-	init_net.ipv6.ip6_blk_hole_entry->rt6i_idev = in6_dev_get(init_net.loopback_dev);
+	init_net.ipv6.ip6_blk_hole_entry->rt6i_idev = in6_dev_getx(init_net.loopback_dev);
   #endif
 }
 

^ permalink raw reply

* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Richard Cochran @ 2017-09-01 13:03 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: netdev, jhs, xiyou.wangcong, jiri, intel-wired-lan, andre.guedes,
	ivan.briano, jesus.sanchez-palencia, boon.leong.ong
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>


I happy to see this posted.  At first glance, it seems like a step in
the right direction.

On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
>  * Time-aware shaper (802.1Qbv):
...
>    S 0x01 300
>    S 0x03 500
> 
>    This means that there are two intervals, the first will have the gate
>    for traffic class 0 open for 300 nanoseconds, the second will have
>    both traffic classes open for 500 nanoseconds.

The i210 doesn't support this in HW, or does it?

>  * Frame Preemption (802.1Qbu):
> 
>    To control even further the latency, it may prove useful to signal which
>    traffic classes are marked as preemptable. For that, 'taprio' provides the
>    preemption command so you set each traffic class as preemptable or not:
> 
>    $ tc qdisc (...) \
>         preemption 0 1 1 1

Neither can the i210 preempt frames, or what am I missing?

The timing of this RFC is good, as I am just finishing up an RFC that
implements time-based transmit using the i210.  I'll try and get that
out ASAP.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH 1/3] security: bpf: Add eBPF LSM hooks to security module
From: Stephen Smalley @ 2017-09-01 12:50 UTC (permalink / raw)
  To: Chenbo Feng, linux-security-module
  Cc: Chenbo Feng, SELinux, netdev, Alexei Starovoitov, lorenzo
In-Reply-To: <20170831205635.80256-2-chenbofeng.kernel@gmail.com>

On Thu, 2017-08-31 at 13:56 -0700, Chenbo Feng wrote:
> From: Chenbo Feng <fengc@google.com>
> 
> Introduce 5 LSM hooks to provide finer granularity controls on eBPF
> related operations including create eBPF maps, modify and read eBPF
> maps
> content and load eBPF programs to the kernel. Hooks use the new
> security
> pointer inside the eBPF map struct to store the owner's security
> information and the different security modules can perform different
> checks based on the information stored inside the security field.
> 
> Signed-off-by: Chenbo Feng <fengc@google.com>
> ---
>  include/linux/lsm_hooks.h | 41
> +++++++++++++++++++++++++++++++++++++++++
>  include/linux/security.h  | 36 ++++++++++++++++++++++++++++++++++++
>  security/security.c       | 28 ++++++++++++++++++++++++++++
>  3 files changed, 105 insertions(+)
> 
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index ce02f76a6188..3aaf9a08a983 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1353,6 +1353,32 @@
>   *	@inode we wish to get the security context of.
>   *	@ctx is a pointer in which to place the allocated security
> context.
>   *	@ctxlen points to the place to put the length of @ctx.
> + *
> + * Security hooks for using the eBPF maps and programs
> functionalities through
> + * eBPF syscalls.
> + *
> + * @bpf_map_create:
> + *	Check permissions prior to creating a new bpf map.
> + *	Return 0 if the permission is granted.
> + *
> + * @bpf_map_modify:
> + *	Check permission prior to insert, update and delete map
> content.
> + *	@map pointer to the struct bpf_map that contains map
> information.
> + *	Return 0 if the permission is granted.
> + *
> + * @bpf_map_read:
> + *	Check permission prior to read a bpf map content.
> + *	@map pointer to the struct bpf_map that contains map
> information.
> + *	Return 0 if the permission is granted.
> + *
> + * @bpf_prog_load:
> + *	Check permission prior to load eBPF program.
> + *	Return 0 if the permission is granted.
> + *
> + * @bpf_post_create:
> + *	Initialize the bpf object security field inside struct
> bpf_maps and
> + *	it is used for future security checks.
> + *
>   */
>  union security_list_options {
>  	int (*binder_set_context_mgr)(struct task_struct *mgr);
> @@ -1685,6 +1711,14 @@ union security_list_options {
>  				struct audit_context *actx);
>  	void (*audit_rule_free)(void *lsmrule);
>  #endif /* CONFIG_AUDIT */
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +	int (*bpf_map_create)(void);
> +	int (*bpf_map_read)(struct bpf_map *map);
> +	int (*bpf_map_modify)(struct bpf_map *map);
> +	int (*bpf_prog_load)(void);
> +	int (*bpf_post_create)(struct bpf_map *map);
> +#endif /* CONFIG_BPF_SYSCALL */
>  };
>  
>  struct security_hook_heads {
> @@ -1905,6 +1939,13 @@ struct security_hook_heads {
>  	struct list_head audit_rule_match;
>  	struct list_head audit_rule_free;
>  #endif /* CONFIG_AUDIT */
> +#ifdef CONFIG_BPF_SYSCALL
> +	struct list_head bpf_map_create;
> +	struct list_head bpf_map_read;
> +	struct list_head bpf_map_modify;
> +	struct list_head bpf_prog_load;
> +	struct list_head bpf_post_create;
> +#endif /* CONFIG_BPF_SYSCALL */
>  } __randomize_layout;
>  
>  /*
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 458e24bea2d4..0656a4f74d14 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -31,6 +31,7 @@
>  #include <linux/string.h>
>  #include <linux/mm.h>
>  #include <linux/fs.h>
> +#include <linux/bpf.h>
>  
>  struct linux_binprm;
>  struct cred;
> @@ -1735,6 +1736,41 @@ static inline void securityfs_remove(struct
> dentry *dentry)
>  
>  #endif
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +#ifdef CONFIG_SECURITY
> +int security_map_create(void);
> +int security_map_modify(struct bpf_map *map);
> +int security_map_read(struct bpf_map *map);
> +int security_prog_load(void);
> +int security_post_create(struct bpf_map *map);
> +#else
> +static inline int security_map_create(void)
> +{
> +	return 0;
> +}
> +
> +static inline int security_map_read(struct bpf_map *map)
> +{
> +	return 0;
> +}
> +
> +static inline int security_map_modify(struct bpf_map *map)
> +{
> +	return 0;
> +}
> +
> +static inline int security_prog_load(void)
> +{
> +	return 0;
> +}
> +
> +static inline int security_post_create(struct bpf_map *map)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_SECURITY */
> +#endif /* CONFIG_BPF_SYSCALL */

These should be named consistently with the ones in lsm_hooks.h and
should unambiguously indicate that these are hooks for bpf
objects/operations, i.e. security_bpf_map_create(),
security_bpf_map_read(), etc.

Do you need this level of granularity?

Could you coalesce the map_create() and post_map_create() hooks into
one hook and just unwind the create in that case?

Why do you label bpf maps but not bpf progs?  Should we be controlling
the ability to attach/detach a bpf prog (partly controlled by
CAP_NET_ADMIN, but also somewhat broad in scope and doesn't allow
control based on who created the prog)?

Should there be a top-level security_bpf_use() hook and permission
check that limits ability to use bpf() at all?

> +
>  #ifdef CONFIG_SECURITY
>  
>  static inline char *alloc_secdata(void)
> diff --git a/security/security.c b/security/security.c
> index 55b5997e4b72..02272f93a89e 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -12,6 +12,7 @@
>   *	(at your option) any later version.
>   */
>  
> +#include <linux/bpf.h>
>  #include <linux/capability.h>
>  #include <linux/dcache.h>
>  #include <linux/module.h>
> @@ -1708,3 +1709,30 @@ int security_audit_rule_match(u32 secid, u32
> field, u32 op, void *lsmrule,
>  				actx);
>  }
>  #endif /* CONFIG_AUDIT */
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +int security_map_create(void)
> +{
> +	return call_int_hook(bpf_map_create, 0);
> +}
> +
> +int security_map_modify(struct bpf_map *map)
> +{
> +	return call_int_hook(bpf_map_modify, 0, map);
> +}
> +
> +int security_map_read(struct bpf_map *map)
> +{
> +	return call_int_hook(bpf_map_read, 0, map);
> +}
> +
> +int security_prog_load(void)
> +{
> +	return call_int_hook(bpf_prog_load, 0);
> +}
> +
> +int security_post_create(struct bpf_map *map)
> +{
> +	return call_int_hook(bpf_post_create, 0, map);
> +}
> +#endif /* CONFIG_BPF_SYSCALL */

^ permalink raw reply

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
From: Jesper Dangaard Brouer @ 2017-09-01 12:44 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: brouer, netdev, davem, Willem de Bruijn
In-Reply-To: <20170831210013.85220-1-willemdebruijn.kernel@gmail.com>

On Thu, 31 Aug 2017 17:00:13 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> +More Info
> +---------
> +
> +Much of this document was derived from a longer paper presented at
> +netdev 2.1. For more in-depth information see that paper and talk,
> +the excellent reporting over at LWN.net or read the original code.
> +
> +  paper, slides, video
> +    https://netdevconf.org/2.1/session.html?debruijn
> +
> +  LWN article
> +    https://lwn.net/Articles/726917/
> +
> +  patchset
> +    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
> +    https://lwn.net/Articles/730010/
> +    https://www.spinics.net/lists/netdev/msg447552.html

IMHO I think it would be better to use the type links also used in the
git log.  If you look at the kernel git log, then the "Link:" tag have
the form: http://lkml.kernel.org/r/
And you can simply append the "Message-Id:" email header.

In this case the link would be:
  http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net] udp: fix secpath leak
From: Paolo Abeni @ 2017-09-01 12:42 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Yossi Kuperman, Paul Moore

From: Yossi Kuperman <yossiku@mellanox.com>

After commit dce4551cb2ad ("udp: preserve head state for IP_CMSG_PASSSEC")
we preserve the secpath for the whole skb lifecycle, but we also
end up leaking a reference to it.

We must clear the head state on skb reception, if secpath is
present.

Fixes: dce4551cb2ad ("udp: preserve head state for IP_CMSG_PASSSEC")
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a6dc48d76a29..62344804baae 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1176,7 +1176,7 @@ static void udp_set_dev_scratch(struct sk_buff *skb)
 	scratch->csum_unnecessary = !!skb_csum_unnecessary(skb);
 	scratch->is_linear = !skb_is_nonlinear(skb);
 #endif
-	if (likely(!skb->_skb_refdst))
+	if (likely(!skb->_skb_refdst && !skb_sec_path(skb)))
 		scratch->_tsize_state |= UDP_SKB_IS_STATELESS;
 }

-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH net-next 1/2] flow_dissector: Cleanup control flow
From: Hannes Frederic Sowa @ 2017-09-01 12:35 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, alex.popov
In-Reply-To: <20170831222239.21509-2-tom@quantonium.net>

Tom Herbert <tom@quantonium.net> writes:

> __skb_flow_dissect is riddled with gotos that make discerning the flow,
> debugging, and extending the capability difficult. This patch
> reorganizes things so that we only perform goto's after the two main
> switch statements (no gotos within the cases now). It also eliminates
> several goto labels so that there are only two labels that can be target
> for goto.

The problem with the

fdret = ... ;
break;

is that we now have to count curly braces to look what break does
actually break and where fdret is being processed. With the number of
context lines you send for the patch this is hard to review.

I tend to like the gotos a bit more for now.

[...]

Bye,
Hannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox