Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Boot failures with net-next after rebase to v4.17.0-rc1
From: Linus Torvalds @ 2018-04-24 20:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, LKML, David Miller,
	Toke Høiland-Jørgensen, Paul E. McKenney, David Ahern
In-Reply-To: <20180424215429.1de8b1b3@redhat.com>

On Tue, Apr 24, 2018 at 12:54 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Hi all,
>
> I'm experiencing boot failures with net-next git-tree after it got
> rebased/merged with Linus'es tree at v4.17.0-rc1.

I suspect it's the global bit stuff that came in very late in the
merge window, and had been developed and tested for a while before,
but showed some problems under some configs.

The fix is currently in the x86/pti tree in -tip, see:

   x86/pti: Fix boot problems from Global-bit setting

and I expect it will percolate upstream soon.

In the meantime, it would be good to verify that merging that x86/pti
branch fixes it for you?

There is another candidate for boot problems - do you happen to have
CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled? That can under certain
circumstances get a percpu setup page fault because memory hadn't been
initialized sufficiently.

The fix there is to move the mm_init() call one step earlier in
init_main(): start_kernel() (to before trap_init()).

And if it's neither of the above, I think you'll need to help bisect it.

               Linus

^ permalink raw reply

* Re: [PATCH] net/tls: remove redundant second null check on sgout
From: David Miller @ 2018-04-24 20:02 UTC (permalink / raw)
  To: colin.king
  Cc: ilyal, aviadye, davejwatson, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180424123658.6541-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Tue, 24 Apr 2018 13:36:58 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> A duplicated null check on sgout is redundant as it is known to be
> already true because of the identical earlier check. Remove it.
> Detected by cppcheck:
> 
> net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always
> true.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH] fsl/fman_port: remove redundant check on port->rev_info.major
From: David Miller @ 2018-04-24 20:01 UTC (permalink / raw)
  To: colin.king; +Cc: madalin.bucur, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180424113945.16371-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Tue, 24 Apr 2018 12:39:45 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The check port->rev_info.major >= 6 is being performed twice, thus
> the inner second check is always true and is redundant, hence it
> can be removed. Detected by cppcheck.
> 
> drivers/net/ethernet/freescale/fman/fman_port.c:1394]: (warning)
> Identical inner 'if' condition is always true.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net-next v5 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: David Miller @ 2018-04-24 19:54 UTC (permalink / raw)
  To: rahul.lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, viro, ebiederm,
	stephen, akpm, torvalds, ganeshgr, nirranjan, indranil
In-Reply-To: <cover.1524329561.git.rahul.lakkireddy@chelsio.com>

From: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Date: Sat, 21 Apr 2018 22:35:52 +0530

> Patch 1 adds API to vmcore module to allow drivers to register callback
> to collect the device specific hardware/firmware logs.  The logs will
> be added to /proc/vmcore as elf notes.
> 
> Patch 2 updates read and mmap logic to append device specific hardware/
> firmware logs as elf notes.
> 
> Patch 3 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs in crash recovery kernel, before hardware is
> initialized.

Are there any serious remaining objections to this series?  I'm going to
integrate this into net-next soon if not.

Thank you.

^ permalink raw reply

* Boot failures with net-next after rebase to v4.17.0-rc1
From: Jesper Dangaard Brouer @ 2018-04-24 19:54 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: brouer, LKML, David Miller, Toke Høiland-Jørgensen,
	Paul E. McKenney, Linus Torvalds, David Ahern

Hi all,

I'm experiencing boot failures with net-next git-tree after it got
rebased/merged with Linus'es tree at v4.17.0-rc1.

The boot problem only occurs for certain kernel configs. I've bisected
the config problem down to enabling CONFIG_PREEMPT=y and resulting
dependencies in below diff.

Is this a know problem?
Have others experienced this too?

This happens for me on two different (x86_64) testlab machines...
I also tested on Linus'es tree at v4.17-rc2, and problem also exists
for me there.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


--- config21-steps-works        2018-04-24 21:33:42.353751894 +0200
+++ config20-steps-bad  2018-04-24 21:27:19.852654328 +0200
@@ -131,7 +131,7 @@
 #
 # RCU Subsystem
 #
-CONFIG_TREE_RCU=y
+CONFIG_PREEMPT_RCU=y
 # CONFIG_RCU_EXPERT is not set
 CONFIG_SRCU=y
 CONFIG_TREE_SRCU=y
@@ -421,11 +421,7 @@
 CONFIG_BFQ_GROUP_IOSCHED=y
 CONFIG_PREEMPT_NOTIFIERS=y
 CONFIG_ASN1=y
-CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
-CONFIG_INLINE_READ_UNLOCK=y
-CONFIG_INLINE_READ_UNLOCK_IRQ=y
-CONFIG_INLINE_WRITE_UNLOCK=y
-CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
+CONFIG_UNINLINE_SPIN_UNLOCK=y
 CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
 CONFIG_MUTEX_SPIN_ON_OWNER=y
 CONFIG_RWSEM_SPIN_ON_OWNER=y
@@ -497,9 +493,10 @@
 CONFIG_SCHED_SMT=y
 CONFIG_SCHED_MC=y
 CONFIG_SCHED_MC_PRIO=y
-CONFIG_PREEMPT_NONE=y
+# CONFIG_PREEMPT_NONE is not set
 # CONFIG_PREEMPT_VOLUNTARY is not set
-# CONFIG_PREEMPT is not set
+CONFIG_PREEMPT=y
+CONFIG_PREEMPT_COUNT=y
 CONFIG_X86_LOCAL_APIC=y
 CONFIG_X86_IO_APIC=y
 CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
@@ -3931,6 +3928,7 @@
 # CONFIG_SCHEDSTATS is not set
 # CONFIG_SCHED_STACK_END_CHECK is not set
 # CONFIG_DEBUG_TIMEKEEPING is not set
+# CONFIG_DEBUG_PREEMPT is not set
 
 #
 # Lock Debugging (spinlocks, mutexes, etc...)
@@ -3996,6 +3994,7 @@
 CONFIG_FUNCTION_GRAPH_TRACER=y
 # CONFIG_PREEMPTIRQ_EVENTS is not set
 # CONFIG_IRQSOFF_TRACER is not set
+# CONFIG_PREEMPT_TRACER is not set
 # CONFIG_SCHED_TRACER is not set
 CONFIG_HWLAT_TRACER=y
 # CONFIG_FTRACE_SYSCALLS is not set

^ permalink raw reply

* Re: [net 0/6][pull request] Intel Wired LAN Driver Updates 2018-04-24
From: Jeff Kirsher @ 2018-04-24 19:43 UTC (permalink / raw)
  To: davem; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

[-- Attachment #1: Type: text/plain, Size: 2310 bytes --]

On Tue, 2018-04-24 at 12:29 -0700, Jeff Kirsher wrote:
> This series contains fixes to ixgbevf, igb and ice drivers.
> 
> Colin Ian King fixes the return value on error for the new XDP
> support
> that went into ixgbevf for 4.16.

Oops, I meant 4.17, not 4.16.

> 
> Vinicius provides a fix for queue 0 for igb, which was not receiving
> all
> the credits it needed when QAV mode was enabled.
> 
> Anirudh provides several fixes for the new ice driver, starting with
> properly initializing num_nodes_added to zero.  Fixed up a code
> comment
> to better reflect what is really going on in the code.  Fixed how to
> detect if an OICR interrupt has occurred to a more reliable method.
> 
> Md Fahad fixes the ice driver to allocate the right amount of memory
> when reading and storing the devices MAC addresses.  The device can
> have
> up to 2 MAC addresses (LAN and WoL), while WoL is currently not
> supported, we need to ensure it can be properly handled when support
> is
> added.
> 
> The following are changes since commit
> 9cf2f437ca5b39828984064fad213e68fc17ef11:
>   team: fix netconsole setup over team
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
> 1GbE
> 
> Anirudh Venkataramanan (2):
>   ice: Fix initialization for num_nodes_added
>   ice: Fix incorrect comment for action type
> 
> Ben Shelton (1):
>   ice: Do not check INTEVENT bit for OICR interrupts
> 
> Colin Ian King (1):
>   ixgbevf: ensure xdp_ring resources are free'd on error exit
> 
> Md Fahad Iqbal Polash (1):
>   ice: Fix insufficient memory issue in ice_aq_manage_mac_read
> 
> Vinicius Costa Gomes (1):
>   igb: Fix the transmission mode of queue 0 for Qav mode
> 
>  drivers/net/ethernet/intel/ice/ice_adminq_cmd.h   |  2 +-
>  drivers/net/ethernet/intel/ice/ice_common.c       | 22
> +++++++++++++++++-----
>  drivers/net/ethernet/intel/ice/ice_hw_autogen.h   |  2 --
>  drivers/net/ethernet/intel/ice/ice_main.c         |  4 ----
>  drivers/net/ethernet/intel/ice/ice_sched.c        |  4 ++--
>  drivers/net/ethernet/intel/igb/igb_main.c         | 17
> ++++++++++++++++-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
>  7 files changed, 37 insertions(+), 16 deletions(-)
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 2/2 v1] netns: isolate seqnums to use per-netns locks
From: David Miller @ 2018-04-24 19:39 UTC (permalink / raw)
  To: christian.brauner
  Cc: ebiederm, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180423102443.16627-3-christian.brauner@ubuntu.com>

From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Mon, 23 Apr 2018 12:24:43 +0200

> +	#ifdef CONFIG_NET
> +		seqnum = get_ns_uevent_seqnum_by_vpid();
> +	#else
> +		seqnum = uevent_seqnum;
> +	#endif

Please don't indend the code like this.

By indenting the CPP directives, which should be at column zero, the
actual code became double indented.

Thank you.

^ permalink raw reply

* [net 6/6] ice: Fix insufficient memory issue in ice_aq_manage_mac_read
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem
  Cc: Md Fahad Iqbal Polash, netdev, nhorman, sassmann, jogreene,
	Anirudh Venkataramanan, Jeff Kirsher
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

From: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com>

For the MAC read operation, the device can return up to two (LAN and WoL)
MAC addresses. Without access to adequate memory, the device will return
an error. Fixed this by allocating the right amount of memory. Also, logic
to detect and copy the LAN MAC address into the port_info structure has
been added. Note that the WoL MAC address is ignored currently as the WoL
feature isn't supported yet.

Fixes: dc49c7723676 ("ice: Get MAC/PHY/link info and scheduler topology")
Signed-off-by: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_common.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 21977ec984c4..71d032cc5fa7 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -78,6 +78,7 @@ ice_aq_manage_mac_read(struct ice_hw *hw, void *buf, u16 buf_size,
 	struct ice_aq_desc desc;
 	enum ice_status status;
 	u16 flags;
+	u8 i;
 
 	cmd = &desc.params.mac_read;
 
@@ -98,8 +99,16 @@ ice_aq_manage_mac_read(struct ice_hw *hw, void *buf, u16 buf_size,
 		return ICE_ERR_CFG;
 	}
 
-	ether_addr_copy(hw->port_info->mac.lan_addr, resp->mac_addr);
-	ether_addr_copy(hw->port_info->mac.perm_addr, resp->mac_addr);
+	/* A single port can report up to two (LAN and WoL) addresses */
+	for (i = 0; i < cmd->num_addr; i++)
+		if (resp[i].addr_type == ICE_AQC_MAN_MAC_ADDR_TYPE_LAN) {
+			ether_addr_copy(hw->port_info->mac.lan_addr,
+					resp[i].mac_addr);
+			ether_addr_copy(hw->port_info->mac.perm_addr,
+					resp[i].mac_addr);
+			break;
+		}
+
 	return 0;
 }
 
@@ -464,9 +473,12 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
 	if (status)
 		goto err_unroll_sched;
 
-	/* Get port MAC information */
-	mac_buf_len = sizeof(struct ice_aqc_manage_mac_read_resp);
-	mac_buf = devm_kzalloc(ice_hw_to_dev(hw), mac_buf_len, GFP_KERNEL);
+	/* Get MAC information */
+	/* A single port can report up to two (LAN and WoL) addresses */
+	mac_buf = devm_kcalloc(ice_hw_to_dev(hw), 2,
+			       sizeof(struct ice_aqc_manage_mac_read_resp),
+			       GFP_KERNEL);
+	mac_buf_len = 2 * sizeof(struct ice_aqc_manage_mac_read_resp);
 
 	if (!mac_buf) {
 		status = ICE_ERR_NO_MEMORY;
-- 
2.14.3

^ permalink raw reply related

* [net 5/6] ice: Do not check INTEVENT bit for OICR interrupts
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem
  Cc: Ben Shelton, netdev, nhorman, sassmann, jogreene,
	Anirudh Venkataramanan, Jeff Kirsher
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

From: Ben Shelton <benjamin.h.shelton@intel.com>

According to the hardware spec, checking the INTEVENT bit isn't a
reliable way to detect if an OICR interrupt has occurred. This is
because this bit can be cleared by the hardware/firmware before the
interrupt service routine has run. So instead, just check for OICR
events every time.

Fixes: 940b61af02f4 ("ice: Initialize PF and setup miscellaneous interrupt")
Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h | 2 --
 drivers/net/ethernet/intel/ice/ice_main.c       | 4 ----
 2 files changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
index 1b9e2ef48a9d..499904874b3f 100644
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -121,8 +121,6 @@
 #define PFINT_FW_CTL_CAUSE_ENA_S	30
 #define PFINT_FW_CTL_CAUSE_ENA_M	BIT(PFINT_FW_CTL_CAUSE_ENA_S)
 #define PFINT_OICR			0x0016CA00
-#define PFINT_OICR_INTEVENT_S		0
-#define PFINT_OICR_INTEVENT_M		BIT(PFINT_OICR_INTEVENT_S)
 #define PFINT_OICR_HLP_RDY_S		14
 #define PFINT_OICR_HLP_RDY_M		BIT(PFINT_OICR_HLP_RDY_S)
 #define PFINT_OICR_CPM_RDY_S		15
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 210b7910f1cd..5299caf55a7f 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -1722,9 +1722,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data)
 	oicr = rd32(hw, PFINT_OICR);
 	ena_mask = rd32(hw, PFINT_OICR_ENA);
 
-	if (!(oicr & PFINT_OICR_INTEVENT_M))
-		goto ena_intr;
-
 	if (oicr & PFINT_OICR_GRST_M) {
 		u32 reset;
 		/* we have a reset warning */
@@ -1782,7 +1779,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data)
 	}
 	ret = IRQ_HANDLED;
 
-ena_intr:
 	/* re-enable interrupt causes that are not handled during this pass */
 	wr32(hw, PFINT_OICR_ENA, ena_mask);
 	if (!test_bit(__ICE_DOWN, pf->state)) {
-- 
2.14.3

^ permalink raw reply related

* [net 2/6] igb: Fix the transmission mode of queue 0 for Qav mode
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem
  Cc: Vinicius Costa Gomes, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

When Qav mode is enabled, queue 0 should be kept on Stream Reservation
mode. From the i210 datasheet, section 8.12.19:

"Note: Queue0 QueueMode must be set to 1b when TransmitMode is set to
Qav." ("QueueMode 1b" represents the Stream Reservation mode)

The solution is to give queue 0 the all the credits it might need, so
it has priority over queue 1.

A situation where this can happen is when cbs is "installed" only on
queue 1, leaving queue 0 alone. For example:

$ tc qdisc replace dev enp2s0 handle 100: parent root mqprio num_tc 3 \
     	   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc replace dev enp2s0 parent 100:2 cbs locredit -1470 \
     	   hicredit 30 sendslope -980000 idleslope 20000 offload 1

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index c1c0bc30a16d..cce7ada89255 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1700,7 +1700,22 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 	WARN_ON(hw->mac.type != e1000_i210);
 	WARN_ON(queue < 0 || queue > 1);
 
-	if (enable) {
+	if (enable || queue == 0) {
+		/* i210 does not allow the queue 0 to be in the Strict
+		 * Priority mode while the Qav mode is enabled, so,
+		 * instead of disabling strict priority mode, we give
+		 * queue 0 the maximum of credits possible.
+		 *
+		 * See section 8.12.19 of the i210 datasheet, "Note:
+		 * Queue0 QueueMode must be set to 1b when
+		 * TransmitMode is set to Qav."
+		 */
+		if (queue == 0 && !enable) {
+			/* max "linkspeed" idleslope in kbps */
+			idleslope = 1000000;
+			hicredit = ETH_FRAME_LEN;
+		}
+
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
-- 
2.14.3

^ permalink raw reply related

* [net 0/6][pull request] Intel Wired LAN Driver Updates 2018-04-24
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains fixes to ixgbevf, igb and ice drivers.

Colin Ian King fixes the return value on error for the new XDP support
that went into ixgbevf for 4.16.

Vinicius provides a fix for queue 0 for igb, which was not receiving all
the credits it needed when QAV mode was enabled.

Anirudh provides several fixes for the new ice driver, starting with
properly initializing num_nodes_added to zero.  Fixed up a code comment
to better reflect what is really going on in the code.  Fixed how to
detect if an OICR interrupt has occurred to a more reliable method.

Md Fahad fixes the ice driver to allocate the right amount of memory
when reading and storing the devices MAC addresses.  The device can have
up to 2 MAC addresses (LAN and WoL), while WoL is currently not
supported, we need to ensure it can be properly handled when support is
added.

The following are changes since commit 9cf2f437ca5b39828984064fad213e68fc17ef11:
  team: fix netconsole setup over team
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 1GbE

Anirudh Venkataramanan (2):
  ice: Fix initialization for num_nodes_added
  ice: Fix incorrect comment for action type

Ben Shelton (1):
  ice: Do not check INTEVENT bit for OICR interrupts

Colin Ian King (1):
  ixgbevf: ensure xdp_ring resources are free'd on error exit

Md Fahad Iqbal Polash (1):
  ice: Fix insufficient memory issue in ice_aq_manage_mac_read

Vinicius Costa Gomes (1):
  igb: Fix the transmission mode of queue 0 for Qav mode

 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h   |  2 +-
 drivers/net/ethernet/intel/ice/ice_common.c       | 22 +++++++++++++++++-----
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h   |  2 --
 drivers/net/ethernet/intel/ice/ice_main.c         |  4 ----
 drivers/net/ethernet/intel/ice/ice_sched.c        |  4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c         | 17 ++++++++++++++++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
 7 files changed, 37 insertions(+), 16 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [net 1/6] ixgbevf: ensure xdp_ring resources are free'd on error exit
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem; +Cc: Colin Ian King, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

From: Colin Ian King <colin.king@canonical.com>

The current error handling for failed resource setup for xdp_ring
data is a break out of the loop and returning 0 indicated everything
was OK, when in fact it is not.  Fix this by exiting via the
error exit label err_setup_tx that will clean up the resources
correctly and return and error status.

Detected by CoverityScan, CID#1466879 ("Logically dead code")

Fixes: 21092e9ce8b1 ("ixgbevf: Add support for XDP_TX action")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 3d9033f26eff..e3d04f226d57 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3420,7 +3420,7 @@ static int ixgbevf_setup_all_tx_resources(struct ixgbevf_adapter *adapter)
 		if (!err)
 			continue;
 		hw_dbg(&adapter->hw, "Allocation for XDP Queue %u failed\n", j);
-		break;
+		goto err_setup_tx;
 	}
 
 	return 0;
-- 
2.14.3

^ permalink raw reply related

* [net 4/6] ice: Fix incorrect comment for action type
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem
  Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

Action type 5 defines large action generic values. Fix comment to
reflect that better.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 5b13ca1bd85f..7dc5f045e969 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -586,7 +586,7 @@ struct ice_sw_rule_lg_act {
 #define ICE_LG_ACT_MIRROR_VSI_ID_S	3
 #define ICE_LG_ACT_MIRROR_VSI_ID_M	(0x3FF << ICE_LG_ACT_MIRROR_VSI_ID_S)
 
-	/* Action type = 5 - Large Action */
+	/* Action type = 5 - Generic Value */
 #define ICE_LG_ACT_GENERIC		0x5
 #define ICE_LG_ACT_GENERIC_VALUE_S	3
 #define ICE_LG_ACT_GENERIC_VALUE_M	(0xFFFF << ICE_LG_ACT_GENERIC_VALUE_S)
-- 
2.14.3

^ permalink raw reply related

* [net 3/6] ice: Fix initialization for num_nodes_added
From: Jeff Kirsher @ 2018-04-24 19:29 UTC (permalink / raw)
  To: davem
  Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20180424192911.22786-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

ice_sched_add_nodes_to_layer is used recursively, and so we start
with num_nodes_added being 0. This way, in case of an error or if
num_nodes is NULL, the function just returns 0 to indicate that no
nodes were added.

Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support")
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
index f16ff3e4a840..2e6c1d92cc88 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -751,14 +751,14 @@ ice_sched_add_nodes_to_layer(struct ice_port_info *pi,
 	u16 num_added = 0;
 	u32 temp;
 
+	*num_nodes_added = 0;
+
 	if (!num_nodes)
 		return status;
 
 	if (!parent || layer < hw->sw_entry_point_layer)
 		return ICE_ERR_PARAM;
 
-	*num_nodes_added = 0;
-
 	/* max children per node per layer */
 	max_child_nodes =
 	    le16_to_cpu(hw->layer_info[parent->tx_sched_layer].max_children);
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: David Miller @ 2018-04-24 19:03 UTC (permalink / raw)
  To: yihung.wei; +Cc: pshelar, netdev, fw
In-Reply-To: <CAG1aQhKtZ_4AYuKBTzEwG1YwUr9sFchcyh+eWXB_i64GSW_Z8A@mail.gmail.com>

From: Yi-Hung Wei <yihung.wei@gmail.com>
Date: Tue, 24 Apr 2018 11:21:33 -0700

> On Tue, Apr 24, 2018 at 10:42 AM, David Miller <davem@davemloft.net> wrote:
>> From: Pravin Shelar <pshelar@ovn.org>
>> Date: Mon, 23 Apr 2018 23:34:48 -0700
>>
>>> OK. Thanks for the info.
>>
>> So, ACK, Reviewed-by, etc.? :-)
>>
> 
> Parvin provides feedback in a previous email.  I will address them and
> send out v3.

Aha, I see, thanks for explaining.

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Paul Moore @ 2018-04-24 19:01 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <20180424020200.imonhbkwtb73luxl@madcap2.tricolour.ca>

On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-04-23 19:15, Paul Moore wrote:
>> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> > On 2018-04-18 19:47, Paul Moore wrote:
>> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > Implement the proc fs write to set the audit container ID of a process,
>> >> > emitting an AUDIT_CONTAINER record to document the event.
>> >> >
>> >> > This is a write from the container orchestrator task to a proc entry of
>> >> > the form /proc/PID/containerid where PID is the process ID of the newly
>> >> > created task that is to become the first task in a container, or an
>> >> > additional task added to a container.
>> >> >
>> >> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >> >
>> >> > This will produce a record such as this:
>> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>> >> >
>> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
>> >> > the orchestrator while the "opid" field is the object's PID, the process
>> >> > being "contained".  Old and new container ID values are given in the
>> >> > "contid" fields, while res indicates its success.
>> >> >
>> >> > It is not permitted to self-set, unset or re-set the container ID.  A
>> >> > child inherits its parent's container ID, but then can be set only once
>> >> > after.
>> >> >
>> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >> >
>> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> >> > ---
>> >> >  fs/proc/base.c             | 37 ++++++++++++++++++++
>> >> >  include/linux/audit.h      | 16 +++++++++
>> >> >  include/linux/init_task.h  |  4 ++-
>> >> >  include/linux/sched.h      |  1 +
>> >> >  include/uapi/linux/audit.h |  2 ++
>> >> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> >  6 files changed, 143 insertions(+), 1 deletion(-)

...

>> >> >  /* audit_rule_data supports filter rules with both integer and string
>> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> >> > index 4e0a4ac..29c8482 100644
>> >> > --- a/kernel/auditsc.c
>> >> > +++ b/kernel/auditsc.c
>> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
>> >> >         return rc;
>> >> >  }
>> >> >
>> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>> >> > +{
>> >> > +       struct task_struct *parent;
>> >> > +       u64 pcontainerid, ccontainerid;
>> >> > +
>> >> > +       /* Don't allow to set our own containerid */
>> >> > +       if (current == task)
>> >> > +               return -EPERM;
>> >>
>> >> Why not?  Is there some obvious security concern that I missing?
>> >
>> > We then lose the distinction in the AUDIT_CONTAINER record between the
>> > initiating PID and the target PID.  This was outlined in the proposal.
>>
>> I just went back and reread the v3 proposal and I still don't see a
>> good explanation of this.  Why is this bad?  What's the security
>> concern?
>
> I don't remember, specifically.  Maybe this has been addressed by the
> check for children/threads or identical parent container ID.  So, I'm
> reluctantly willing to remove that check for now.

Okay.  For the record, if someone can explain to me why this
restriction saves us from some terrible situation I'm all for leaving
it.  I'm just opposed to restrictions without solid reasoning behind
them.

>> > Having said that, I'm still not sure we have protected sufficiently from
>> > a child turning around and setting it's parent's as yet unset or
>> > inherited audit container ID.
>>
>> Yes, I believe we only want to let a task set the audit container for
>> it's children (or itself/threads if we decide to allow that, see
>> above).  There *has* to be a function to check to see if a task if a
>> child of a given task ... right? ... although this is likely to be a
>> pointer traversal and locking nightmare ... hmmm.
>
> Isn't that just (struct task_struct)parent == (struct
> task_struct)child->parent (or ->real_parent)?
>
> And now that I say that, it is covered by the following patch's child
> check, so as long as we keep that, we should be fine.

I was thinking of checking not just current's immediate children, but
any of it's descendants as I believe that is what we want to limit,
yes?  I just worry that it isn't really practical to perform that
check.

>> >> I ask because I suppose it might be possible for some container
>> >> runtime to do a fork, setup some of the environment and them exec the
>> >> container (before you answer the obvious "namespaces!" please remember
>> >> we're not trying to define containers).
>> >
>> > I don't think namespaces have any bearing on this concern since none are
>> > required.
>> >
>> >> > +       /* Don't allow the containerid to be unset */
>> >> > +       if (!cid_valid(containerid))
>> >> > +               return -EINVAL;
>> >> > +       /* if we don't have caps, reject */
>> >> > +       if (!capable(CAP_AUDIT_CONTROL))
>> >> > +               return -EPERM;
>> >> > +       /* if containerid is unset, allow */
>> >> > +       if (!audit_containerid_set(task))
>> >> > +               return 0;
>> >> > +       /* it is already set, and not inherited from the parent, reject */
>> >> > +       ccontainerid = audit_get_containerid(task);
>> >> > +       rcu_read_lock();
>> >> > +       parent = rcu_dereference(task->real_parent);
>> >> > +       rcu_read_unlock();
>> >> > +       task_lock(parent);
>> >> > +       pcontainerid = audit_get_containerid(parent);
>> >> > +       task_unlock(parent);
>> >> > +       if (ccontainerid != pcontainerid)
>> >> > +               return -EPERM;
>> >> > +       return 0;

I'm looking at the parent checks again and I wonder if the logic above
is what we really want.  Maybe it is, but I'm not sure.

Things I'm wondering about:

* "ccontainerid" and "containerid" are too close in name, I kept
confusing myself when looking at this code.  Please change one.  Bonus
points if it is shorter.

* What if the orchestrator wants to move the task to a new container?
Right now it looks like you can only do that once, then then the
task's audit container ID will no longer be the same as real_parent
... or does the orchestrator change that?  *Can* the orchestrator
change real_parent (I suspect the answer is "no")?

* I think the key is the relationship between current and task, not
between task and task->real_parent.  I believe what we really care
about is that task is a descendant of current.  We might also want to
allow current to change the audit container ID if it holds
CAP_AUDIT_CONTROL, regardless of it's relationship with task.

>> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
>> >> > +                                     u64 containerid, int rc)
>> >> > +{
>> >> > +       struct audit_buffer *ab;
>> >> > +       uid_t uid;
>> >> > +       struct tty_struct *tty;
>> >> > +
>> >> > +       if (!audit_enabled)
>> >> > +               return;
>> >> > +
>> >> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
>> >> > +       if (!ab)
>> >> > +               return;
>> >> > +
>> >> > +       uid = from_kuid(&init_user_ns, task_uid(current));
>> >> > +       tty = audit_get_tty(current);
>> >> > +
>> >> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
>> >> > +       audit_log_task_context(ab);
>> >> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
>> >> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
>> >> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
>> >> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
>> >> > +
>> >> > +       audit_put_tty(tty);
>> >> > +       audit_log_end(ab);
>> >> > +}
>> >> > +
>> >> > +/**
>> >> > + * audit_set_containerid - set current task's audit_context containerid
>> >> > + * @containerid: containerid value
>> >> > + *
>> >> > + * Returns 0 on success, -EPERM on permission failure.
>> >> > + *
>> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
>> >> > + */
>> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
>> >> > +{
>> >> > +       u64 oldcontainerid;
>> >> > +       int rc;
>> >> > +
>> >> > +       oldcontainerid = audit_get_containerid(task);
>> >> > +
>> >> > +       rc = audit_set_containerid_perm(task, containerid);
>> >> > +       if (!rc) {
>> >> > +               task_lock(task);
>> >> > +               task->containerid = containerid;
>> >> > +               task_unlock(task);
>> >> > +       }
>> >> > +
>> >> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
>> >> > +       return rc;
>> >>
>> >> Why are audit_set_containerid_perm() and audit_log_containerid()
>> >> separate functions?
>> >
>> > (I assume you mean audit_log_set_containerid()?)
>>
>> Yep.  My fingers got tired typing in that function name and decided a
>> shortcut was necessary.
>>
>> > It seemed clearer that all the permission checking was in one function
>> > and its return code could be used to report the outcome when logging the
>> > (attempted) action.  This is the same structure as audit_set_loginuid()
>> > and it made sense.
>>
>> When possible I really like it when the permission checks are in the
>> same function as the code which does the work; it's less likely to get
>> abused that way (you have to willfully bypass the access checks).  The
>> exceptions might be if you wanted to reuse the access control code, or
>> insert a modular access mechanism (e.g. LSMs).
>
> I don't follow how it could be abused.  The return code from the perm
> check gates setting the value and is used in the success field in the
> log.

If the permission checks are in the same function body as the code
which does the work you have to either split the function, or rewrite
it, if you want to bypass the permission checks.  It may be more of a
style issue than an actual safety issue, but the comments about
single-use functions in the same scope is the tie breaker.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
From: Björn Töpel @ 2018-04-24 18:58 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-+W8=wDXc1=wHi8KF0whgyFLqvo=tROOq16XfA7MDkR+Q@mail.gmail.com>

2018-04-24 18:56 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The xskmap is yet another BPF map, very much inspired by
>> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
>> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
>> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
>>
>> Note that a socket that is bound to certain ifindex/queue index will
>> *only* accept XDP frames from that netdev/queue index. If an XDP
>> program tries to redirect from a netdev/queue index other than what
>> the socket is bound to, the frame will not be received on the socket.
>>
>> A socket can reside in multiple maps.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>
>> +struct xsk_map_entry {
>> +       struct xdp_sock *xs;
>> +       struct rcu_head rcu;
>> +};
>
>> +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
>> +{
>> +       struct xsk_map *m = container_of(map, struct xsk_map, map);
>> +       struct xsk_map_entry *entry;
>> +
>> +       if (key >= map->max_entries)
>> +               return NULL;
>> +
>> +       entry = READ_ONCE(m->xsk_map[key]);
>> +       return entry ? entry->xs : NULL;
>> +}
>
> This dynamically allocated structure adds an extra cacheline lookup. If
> xdp_sock gets an rcu_head, it can be linked into the map directly.

Nice one! I'll try this out!

^ permalink raw reply

* Re: [PATCH net-next 5/8] net: mscc: Add initial Ocelot switch support
From: Alexandre Belloni @ 2018-04-24 18:52 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, David S . Miller, Allan Nielsen,
	razvan.stefanescu, po.liu, Thomas Petazzoni, netdev, devicetree,
	linux-kernel, linux-mips
In-Reply-To: <20180330145008.GE28244@lunn.ch>

I realise now that I didn't reply to this comment:

On 30/03/2018 16:50:08+0200, Andrew Lunn wrote:
> > The fact is that ocelot doesn't have separate controls. The port is
> > either forwarding or not. If it is not forwarding, then there is nothing
> > to tell the HW to do.
> 
> Think about the following sequence:
> 
> ip link set lan0 up
> 
> After this command, i expect to see packets on lan0 arrive at the
> host, tcpdump to work, etc. This probably means the port is in
> 'forwarding' mode, or for B53, STP is disabled.
> 

On Ocelot, forwarding packets to the host (i.e. forwarding frames
received on the port to the cpu port) is separate from bridging ports
together. So after that command, the host can receive packets on lan0.


-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH bpf-next 05/15] xsk: add support for bind for Rx
From: Björn Töpel @ 2018-04-24 18:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <CAF=yD-J-KfXzEKNZPr55PP4c2gxziU=6nPKJ3sty1EB3quvUdA@mail.gmail.com>

2018-04-24 18:55 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the bind syscall is added. Binding an AF_XDP socket, means
>> associating the socket to an umem, a netdev and a queue index. This
>> can be done in two ways.
>>
>> The first way, creating a "socket from scratch". Create the umem using
>> the XDP_UMEM_REG setsockopt and an associated fill queue with
>> XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
>> setsockopt. Call bind passing ifindex and queue index ("channel" in
>> ethtool speak).
>>
>> The second way to bind a socket, is simply skipping the
>> umem/netdev/queue index, and passing another already setup AF_XDP
>> socket. The new socket will then have the same umem/netdev/queue index
>> as the parent so it will share the same umem. You must also set the
>> flags field in the socket address to XDP_SHARED_UMEM.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>
>> +static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
>> +{
>> +       struct socket *sock;
>> +
>> +       *err = -ENOTSOCK;
>> +       sock = sockfd_lookup(fd, err);
>> +       if (!sock)
>> +               return NULL;
>> +
>> +       if (sock->sk->sk_family != PF_XDP) {
>> +               *err = -ENOPROTOOPT;
>> +               sockfd_put(sock);
>> +               return NULL;
>> +       }
>> +
>> +       *err = 0;
>> +       return sock;
>> +}
>
> In this and similar cases, can use ERR_PTR to avoid the extra argument.

Noted. Thanks!

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 18:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, David Miller, Andrew Morton, linux-mm, eric.dumazet,
	edumazet, netdev, linux-kernel, mst, jasowang, virtualization,
	dm-devel, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim
In-Reply-To: <20180424171651.GC30577@bombadil.infradead.org>



On Tue, 24 Apr 2018, Matthew Wilcox wrote:

> On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > 
> > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > Some bugs (such as buffer overflows) are better detected
> > > > with kmalloc code, so we must test the kmalloc path too.
> > > 
> > > Well now, this brings up another item for the collective TODO list --
> > > implement redzone checks for vmalloc.  Unless this is something already
> > > taken care of by kasan or similar.
> > 
> > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > the next slab size and detects buffer overflows only at this boundary.
> > 
> > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > immediatelly after the requested size - so that it can detect overflows at 
> > byte boundary 
> > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > 
> > That patch found a bug in crypto code:
> > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> 
> Is it still worth doing this, now we have kasan?

The kmalloc guard has much lower overhead than kasan.

(BTW. when I tried kasan, it oopsed with persistent memory)

Mikulas

^ permalink raw reply

* Re: [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support
From: Björn Töpel @ 2018-04-24 18:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-LAqKaC5U1ZR7ZNyZpmBonp9W68qp8z3v-P8cKFkXe4AA@mail.gmail.com>

2018-04-24 18:56 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> Here the actual receive functions of AF_XDP are implemented, that in a
>> later commit, will be called from the XDP layers.
>>
>> There's one set of functions for the XDP_DRV side and another for
>> XDP_SKB (generic).
>>
>> Support for the poll syscall is also implemented.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>
>> +/* Common functions operating for both RXTX and umem queues */
>> +
>> +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
>> +{
>> +       u32 entries = q->prod_tail - q->cons_tail;
>> +
>> +       if (entries == 0) {
>> +               /* Refresh the local pointer */
>> +               q->prod_tail = READ_ONCE(q->ring->producer);
>> +       }
>> +
>> +       entries = q->prod_tail - q->cons_tail;
>
> Probably meant to be inside the branch? Though I see the same
> pattern in the userspace example program.
>

Yes! Nasty C&P going on here... :-(

>> +static inline u32 *xskq_validate_id(struct xsk_queue *q)
>> +{
>> +       while (q->cons_tail != q->cons_head) {
>> +               struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>> +               unsigned int idx = q->cons_tail & q->ring_mask;
>> +
>> +               if (xskq_is_valid_id(q, ring->desc[idx]))
>> +                       return &ring->desc[idx];
>
> Missing a q->cons_tail increment in this loop?

Indeed! Good catch! Thanks!


Björn

^ permalink raw reply

* Re: [net-next v2] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode
From: Ahmed Abdelsalam @ 2018-04-24 18:24 UTC (permalink / raw)
  To: Ahmed Abdelsalam
  Cc: davem, dav.lebrun, kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <1524592795-1467-1-git-send-email-amsalam20@gmail.com>

On Tue, 24 Apr 2018 19:59:55 +0200
Ahmed Abdelsalam <amsalam20@gmail.com> wrote:

> This patch has been tested for IPv6, IPv4, and L2 traffic.
> 
> Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
> ---
>  include/net/netns/ipv6.h   |  1 +
>  net/ipv6/seg6_iptunnel.c   | 24 ++++++++++++++++++++++--
>  net/ipv6/sysctl_net_ipv6.c |  8 ++++++++
>  3 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index 97b3a54..c978a31 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
>  	int max_hbh_opts_cnt;
>  	int max_dst_opts_len;
>  	int max_hbh_opts_len;
> +	int seg6_flowlabel;
>  };
>  
>  struct netns_ipv6 {
> diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
> index 5fe1394..3d9cd86 100644
> --- a/net/ipv6/seg6_iptunnel.c
> +++ b/net/ipv6/seg6_iptunnel.c
> @@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device *dev,
>  	rcu_read_unlock();
>  }
>  
> +/* Compute flowlabel for outer IPv6 header */
> +__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
> +			   struct ipv6hdr *inner_hdr)

David, please take v3 of the patch. 
I re-defined seg6_make_flowlabel () as static to fix the kbuild test robot.

-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* [net-next v3] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode
From: Ahmed Abdelsalam @ 2018-04-24 18:23 UTC (permalink / raw)
  To: davem, dav.lebrun, kuznet, yoshfuji, netdev, linux-kernel
  Cc: Ahmed Abdelsalam

ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.

In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
label). Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated).

Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:

a) There is no guarantee that the inner flow label is set by the source.
b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
0 copy flowlabel from Inner paceket in case of Inner IPv6
(Set flowlabel to 0 in case IPv4/L2)
1 Compute the flowlabel using seg6_make_flowlabel()

This patch has been tested for IPv6, IPv4, and L2 traffic.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 include/net/netns/ipv6.h   |  1 +
 net/ipv6/seg6_iptunnel.c   | 24 ++++++++++++++++++++++--
 net/ipv6/sysctl_net_ipv6.c |  8 ++++++++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 97b3a54..c978a31 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
 	int max_hbh_opts_cnt;
 	int max_dst_opts_len;
 	int max_hbh_opts_len;
+	int seg6_flowlabel;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5fe1394..9898926 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device *dev,
 	rcu_read_unlock();
 }
 
+/* Compute flowlabel for outer IPv6 header */
+static __be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
+				  struct ipv6hdr *inner_hdr)
+{
+	int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel;
+	__be32 flowlabel = 0;
+	u32 hash;
+
+	if (do_flowlabel > 0) {
+		hash = skb_get_hash(skb);
+		rol32(hash, 16);
+		flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
+	} else if (!do_flowlabel && skb->protocol == htons(ETH_P_IPV6)) {
+		flowlabel = ip6_flowlabel(inner_hdr);
+	}
+	return flowlabel;
+}
+
 /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */
 int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 {
@@ -99,6 +117,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	struct ipv6hdr *hdr, *inner_hdr;
 	struct ipv6_sr_hdr *isrh;
 	int hdrlen, tot_len, err;
+	__be32 flowlabel;
 
 	hdrlen = (osrh->hdrlen + 1) << 3;
 	tot_len = hdrlen + sizeof(*hdr);
@@ -119,12 +138,13 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	 * decapsulation will overwrite inner hlim with outer hlim
 	 */
 
+	flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
 	if (skb->protocol == htons(ETH_P_IPV6)) {
 		ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
-			     ip6_flowlabel(inner_hdr));
+			     flowlabel);
 		hdr->hop_limit = inner_hdr->hop_limit;
 	} else {
-		ip6_flow_hdr(hdr, 0, 0);
+		ip6_flow_hdr(hdr, 0, flowlabel);
 		hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb));
 	}
 
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6fbdef6..e15cd37 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -152,6 +152,13 @@ static struct ctl_table ipv6_table_template[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "seg6_flowlabel",
+		.data		= &init_net.ipv6.sysctl.seg6_flowlabel,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
@@ -217,6 +224,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 	ipv6_table[12].data = &net->ipv6.sysctl.max_dst_opts_len;
 	ipv6_table[13].data = &net->ipv6.sysctl.max_hbh_opts_len;
 	ipv6_table[14].data = &net->ipv6.sysctl.multipath_hash_policy,
+	ipv6_table[15].data = &net->ipv6.sysctl.seg6_flowlabel;
 
 	ipv6_route_table = ipv6_route_sysctl_init(net);
 	if (!ipv6_route_table)
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH 5/5] PCI: remove PCI_DMA_BUS_IS_PHYS
From: Palmer Dabbelt @ 2018-04-24 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: iommu, linux-arch, linux-block, linux-ide, linux-scsi, netdev,
	davem, linux-kernel
In-Reply-To: <20180424181625.22410-6-hch@lst.de>

On Tue, 24 Apr 2018 11:16:25 PDT (-0700), Christoph Hellwig wrote:
> This was used by the ide, scsi and networking code in the past to
> determine if they should bounce payloads.  Now that the dma mapping
> always have to support dma to all physical memory (thanks to swiotlb
> for non-iommu systems) there is no need to this crude hack any more.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> [...]
> diff --git a/arch/riscv/include/asm/pci.h b/arch/riscv/include/asm/pci.h
> index 0f2fc9ef20fc..b3638c505728 100644
> --- a/arch/riscv/include/asm/pci.h
> +++ b/arch/riscv/include/asm/pci.h
> @@ -26,9 +26,6 @@
>  /* RISC-V shim does not initialize PCI bus */
>  #define pcibios_assign_all_busses() 1
>
> -/* We do not have an IOMMU */
> -#define PCI_DMA_BUS_IS_PHYS 1
> -
>  extern int isa_dma_bridge_buggy;
>
>  #ifdef CONFIG_PCI

Thanks!

Acked-by: Palmer Dabbelt <palmer@sifive.com> (For the RISC-V change)

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-24 18:21 UTC (permalink / raw)
  To: David Miller
  Cc: Pravin Shelar, Linux Kernel Network Developers, Florian Westphal
In-Reply-To: <20180424.134219.297866275253495288.davem@davemloft.net>

On Tue, Apr 24, 2018 at 10:42 AM, David Miller <davem@davemloft.net> wrote:
> From: Pravin Shelar <pshelar@ovn.org>
> Date: Mon, 23 Apr 2018 23:34:48 -0700
>
>> OK. Thanks for the info.
>
> So, ACK, Reviewed-by, etc.? :-)
>

Parvin provides feedback in a previous email.  I will address them and
send out v3.

Thanks,

-Yi-Hung

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox