Netdev List
 help / color / mirror / Atom feed
* [net 4/7] ice: Use capability count returned by the firmware
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, Jeff Kirsher
In-Reply-To: <20181024214731.26036-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

The firmware now returns the capability count in the command buffer.
Use it.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_common.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index c52f450f2c0d..78df54b25bf1 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1531,9 +1531,7 @@ ice_aq_discover_caps(struct ice_hw *hw, void *buf, u16 buf_size, u32 *cap_count,
 	if (!status)
 		ice_parse_caps(hw, buf, le32_to_cpu(cmd->count), opc);
 	else if (hw->adminq.sq_last_status == ICE_AQ_RC_ENOMEM)
-		*cap_count =
-			DIV_ROUND_UP(le16_to_cpu(desc.datalen),
-				     sizeof(struct ice_aqc_list_caps_elem));
+		*cap_count = le32_to_cpu(cmd->count);
 	return status;
 }
 
-- 
2.17.2

^ permalink raw reply related

* [net 5/7] ice: Introduce ice_dev_onetime_setup
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, Jeff Kirsher
In-Reply-To: <20181024214731.26036-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

ice_dev_onetime_setup contains a couple of driver workarounds for current
firmware limitations. These workarounds are expected to go away once
these limitations are fixed in the firmware.

On a firmware release that has these issues addressed, these workarounds
(while unnecessary) will not break anything.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_common.c   | 19 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_common.h   |  3 +++
 .../net/ethernet/intel/ice/ice_hw_autogen.h   |  2 ++
 drivers/net/ethernet/intel/ice/ice_lib.c      |  1 +
 4 files changed, 25 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 78df54b25bf1..5a91a9087d1e 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -42,6 +42,23 @@ static enum ice_status ice_set_mac_type(struct ice_hw *hw)
 	return 0;
 }
 
+/**
+ * ice_dev_onetime_setup - Temporary HW/FW workarounds
+ * @hw: pointer to the HW structure
+ *
+ * This function provides temporary workarounds for certain issues
+ * that are expected to be fixed in the HW/FW.
+ */
+void ice_dev_onetime_setup(struct ice_hw *hw)
+{
+	/* configure Rx - set non pxe mode */
+	wr32(hw, GLLAN_RCTL_0, 0x1);
+
+#define MBX_PF_VT_PFALLOC	0x00231E80
+	/* set VFs per PF */
+	wr32(hw, MBX_PF_VT_PFALLOC, rd32(hw, PF_VT_PFALLOC_HIF));
+}
+
 /**
  * ice_clear_pf_cfg - Clear PF configuration
  * @hw: pointer to the hardware structure
@@ -740,6 +757,8 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
 	if (status)
 		goto err_unroll_sched;
 
+	ice_dev_onetime_setup(hw);
+
 	/* Get MAC information */
 	/* A single port can report up to two (LAN and WoL) addresses */
 	mac_buf = devm_kcalloc(ice_hw_to_dev(hw), 2,
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h
index 1900681289a4..876347e32b6f 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -34,6 +34,9 @@ ice_sq_send_cmd(struct ice_hw *hw, struct ice_ctl_q_info *cq,
 		struct ice_sq_cd *cd);
 void ice_clear_pxe_mode(struct ice_hw *hw);
 enum ice_status ice_get_caps(struct ice_hw *hw);
+
+void ice_dev_onetime_setup(struct ice_hw *hw);
+
 enum ice_status
 ice_write_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
 		  u32 rxq_index);
diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
index a6679a9bfd3a..228afcad6fc3 100644
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -157,6 +157,7 @@
 #define VPINT_ALLOC_LAST_S			12
 #define VPINT_ALLOC_LAST_M			ICE_M(0x7FF, 12)
 #define VPINT_ALLOC_VALID_M			BIT(31)
+#define GLLAN_RCTL_0				0x002941F8
 #define QRX_CONTEXT(_i, _QRX)			(0x00280000 + ((_i) * 8192 + (_QRX) * 4))
 #define QRX_CTRL(_QRX)				(0x00120000 + ((_QRX) * 4))
 #define QRX_CTRL_MAX_INDEX			2047
@@ -320,6 +321,7 @@
 #define GLV_UPRCL(_i)				(0x003B2000 + ((_i) * 8))
 #define GLV_UPTCH(_i)				(0x0030A004 + ((_i) * 8))
 #define GLV_UPTCL(_i)				(0x0030A000 + ((_i) * 8))
+#define PF_VT_PFALLOC_HIF			0x0009DD80
 #define VSIQF_HKEY_MAX_INDEX			12
 #define VSIQF_HLUT_MAX_INDEX			15
 #define VFINT_DYN_CTLN(_i)			(0x00003800 + ((_i) * 4))
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index e750702bcdce..5bacad01f0c9 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2529,6 +2529,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi)
 	vsi->hw_base_vector = 0;
 	ice_vsi_clear_rings(vsi);
 	ice_vsi_free_arrays(vsi, false);
+	ice_dev_onetime_setup(&vsi->back->hw);
 	ice_vsi_set_num_qs(vsi);
 
 	/* Initialize VSI struct elements and create VSI in FW */
-- 
2.17.2

^ permalink raw reply related

* [net 3/7] ice: Update expected FW version
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, Jeff Kirsher
In-Reply-To: <20181024214731.26036-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

Update to the current firmware major and minor version which are
1 and 3 respectively.

Also remove an empty comment line.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_controlq.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_controlq.h b/drivers/net/ethernet/intel/ice/ice_controlq.h
index 437f832fd7c4..0038a4109c99 100644
--- a/drivers/net/ethernet/intel/ice/ice_controlq.h
+++ b/drivers/net/ethernet/intel/ice/ice_controlq.h
@@ -19,11 +19,10 @@
 
 /* Defines that help manage the driver vs FW API checks.
  * Take a look at ice_aq_ver_check in ice_controlq.c for actual usage.
- *
  */
 #define EXP_FW_API_VER_BRANCH		0x00
-#define EXP_FW_API_VER_MAJOR		0x00
-#define EXP_FW_API_VER_MINOR		0x01
+#define EXP_FW_API_VER_MAJOR		0x01
+#define EXP_FW_API_VER_MINOR		0x03
 
 /* Different control queue types: These are mainly for SW consumption. */
 enum ice_ctl_q {
-- 
2.17.2

^ permalink raw reply related

* [net 2/7] ice: Change device ID define names to align with branding string
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, Jeff Kirsher
In-Reply-To: <20181024214731.26036-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

Basically remove references to C810 and use E810C (from the branding
string) instead.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_devids.h | 6 +++---
 drivers/net/ethernet/intel/ice/ice_main.c   | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_devids.h b/drivers/net/ethernet/intel/ice/ice_devids.h
index a6f0a5c0c305..f8d5c661d0ba 100644
--- a/drivers/net/ethernet/intel/ice/ice_devids.h
+++ b/drivers/net/ethernet/intel/ice/ice_devids.h
@@ -6,10 +6,10 @@
 
 /* Device IDs */
 /* Intel(R) Ethernet Controller E810-C for backplane */
-#define ICE_DEV_ID_C810_BACKPLANE	0x1591
+#define ICE_DEV_ID_E810C_BACKPLANE	0x1591
 /* Intel(R) Ethernet Controller E810-C for QSFP */
-#define ICE_DEV_ID_C810_QSFP		0x1592
+#define ICE_DEV_ID_E810C_QSFP		0x1592
 /* Intel(R) Ethernet Controller E810-C for SFP */
-#define ICE_DEV_ID_C810_SFP		0x1593
+#define ICE_DEV_ID_E810C_SFP		0x1593
 
 #endif /* _ICE_DEVIDS_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 8f61b375e768..0084b7290b2b 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2271,9 +2271,9 @@ static void ice_remove(struct pci_dev *pdev)
  *   Class, Class Mask, private data (not used) }
  */
 static const struct pci_device_id ice_pci_tbl[] = {
-	{ PCI_VDEVICE(INTEL, ICE_DEV_ID_C810_BACKPLANE), 0 },
-	{ PCI_VDEVICE(INTEL, ICE_DEV_ID_C810_QSFP), 0 },
-	{ PCI_VDEVICE(INTEL, ICE_DEV_ID_C810_SFP), 0 },
+	{ PCI_VDEVICE(INTEL, ICE_DEV_ID_E810C_BACKPLANE), 0 },
+	{ PCI_VDEVICE(INTEL, ICE_DEV_ID_E810C_QSFP), 0 },
+	{ PCI_VDEVICE(INTEL, ICE_DEV_ID_E810C_SFP), 0 },
 	/* required last entry */
 	{ 0, }
 };
-- 
2.17.2

^ permalink raw reply related

* [net 6/7] ice: Allocate VF interrupts and set queue map
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, Jeff Kirsher
In-Reply-To: <20181024214731.26036-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

Allocate VF interrupts using VPINT_ALLOC_PCI. Multiple interrupts are
specified as a range from "first" to "last".

Also, according to the spec, the queue mapping for a VF needs to be set
in both contig and scatter queue modes. So make this change as well.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h  |  6 ++++++
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c | 15 +++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
index 228afcad6fc3..5fdea6ec7675 100644
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -157,6 +157,12 @@
 #define VPINT_ALLOC_LAST_S			12
 #define VPINT_ALLOC_LAST_M			ICE_M(0x7FF, 12)
 #define VPINT_ALLOC_VALID_M			BIT(31)
+#define VPINT_ALLOC_PCI(_VF)			(0x0009D000 + ((_VF) * 4))
+#define VPINT_ALLOC_PCI_FIRST_S			0
+#define VPINT_ALLOC_PCI_FIRST_M			ICE_M(0x7FF, 0)
+#define VPINT_ALLOC_PCI_LAST_S			12
+#define VPINT_ALLOC_PCI_LAST_M			ICE_M(0x7FF, 12)
+#define VPINT_ALLOC_PCI_VALID_M			BIT(31)
 #define GLLAN_RCTL_0				0x002941F8
 #define QRX_CONTEXT(_i, _QRX)			(0x00280000 + ((_i) * 8192 + (_QRX) * 4))
 #define QRX_CTRL(_QRX)				(0x00120000 + ((_QRX) * 4))
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
index c25e486706f3..45f10f8f01dc 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
@@ -173,6 +173,7 @@ static void ice_dis_vf_mappings(struct ice_vf *vf)
 	vsi = pf->vsi[vf->lan_vsi_idx];
 
 	wr32(hw, VPINT_ALLOC(vf->vf_id), 0);
+	wr32(hw, VPINT_ALLOC_PCI(vf->vf_id), 0);
 
 	first = vf->first_vector_idx;
 	last = first + pf->num_vf_msix - 1;
@@ -519,6 +520,10 @@ static void ice_ena_vf_mappings(struct ice_vf *vf)
 	       VPINT_ALLOC_VALID_M);
 	wr32(hw, VPINT_ALLOC(vf->vf_id), reg);
 
+	reg = (((first << VPINT_ALLOC_PCI_FIRST_S) & VPINT_ALLOC_PCI_FIRST_M) |
+	       ((last << VPINT_ALLOC_PCI_LAST_S) & VPINT_ALLOC_PCI_LAST_M) |
+	       VPINT_ALLOC_PCI_VALID_M);
+	wr32(hw, VPINT_ALLOC_PCI(vf->vf_id), reg);
 	/* map the interrupts to its functions */
 	for (v = first; v <= last; v++) {
 		reg = (((abs_vf_id << GLINT_VECT2FUNC_VF_NUM_S) &
@@ -528,10 +533,11 @@ static void ice_ena_vf_mappings(struct ice_vf *vf)
 		wr32(hw, GLINT_VECT2FUNC(v), reg);
 	}
 
+	/* set regardless of mapping mode */
+	wr32(hw, VPLAN_TXQ_MAPENA(vf->vf_id), VPLAN_TXQ_MAPENA_TX_ENA_M);
+
 	/* VF Tx queues allocation */
 	if (vsi->tx_mapping_mode == ICE_VSI_MAP_CONTIG) {
-		wr32(hw, VPLAN_TXQ_MAPENA(vf->vf_id),
-		     VPLAN_TXQ_MAPENA_TX_ENA_M);
 		/* set the VF PF Tx queue range
 		 * VFNUMQ value should be set to (number of queues - 1). A value
 		 * of 0 means 1 queue and a value of 255 means 256 queues
@@ -546,10 +552,11 @@ static void ice_ena_vf_mappings(struct ice_vf *vf)
 			"Scattered mode for VF Tx queues is not yet implemented\n");
 	}
 
+	/* set regardless of mapping mode */
+	wr32(hw, VPLAN_RXQ_MAPENA(vf->vf_id), VPLAN_RXQ_MAPENA_RX_ENA_M);
+
 	/* VF Rx queues allocation */
 	if (vsi->rx_mapping_mode == ICE_VSI_MAP_CONTIG) {
-		wr32(hw, VPLAN_RXQ_MAPENA(vf->vf_id),
-		     VPLAN_RXQ_MAPENA_RX_ENA_M);
 		/* set the VF PF Rx queue range
 		 * VFNUMQ value should be set to (number of queues - 1). A value
 		 * of 0 means 1 queue and a value of 255 means 256 queues
-- 
2.17.2

^ permalink raw reply related

* [net 0/7][pull request] Intel Wired LAN Driver Fixes 2018-10-24
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann

This series contains fixes for the ice driver.

Anirudh fixes a namespace issue which was introduced with a previous
patch to remove ice_netpoll.  Fixed up the device ID define names to
align with the branding string names.  Use the capability count returned
by the firmware, instead of calculating the count.  Introduced driver
workarounds due to current firmware limitations.  Fixed the queue
mapping for a VF, which needs to be set in the config and scatter queue
modes.  Fixed the driver which is setup to handle link status events
(LSE), even though the firmware does not have this feature yet, so add
the ability to poll for link status changes while we wait for updated
firmware.

The following are changes since commit 44adbac8f7217040be97928cd19998259d9d4418:
  Merge branch 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 100GbE

Anirudh Venkataramanan (7):
  ice: Make ice_msix_clean_rings static
  ice: Change device ID define names to align with branding string
  ice: Update expected FW version
  ice: Use capability count returned by the firmware
  ice: Introduce ice_dev_onetime_setup
  ice: Allocate VF interrupts and set queue map
  ice: Poll for link status change

 drivers/net/ethernet/intel/ice/ice_common.c   |  52 ++++----
 drivers/net/ethernet/intel/ice/ice_common.h   |   9 +-
 drivers/net/ethernet/intel/ice/ice_controlq.h |   5 +-
 drivers/net/ethernet/intel/ice/ice_devids.h   |   6 +-
 .../net/ethernet/intel/ice/ice_hw_autogen.h   |   8 ++
 drivers/net/ethernet/intel/ice/ice_lib.c      |   3 +-
 drivers/net/ethernet/intel/ice/ice_lib.h      |   1 -
 drivers/net/ethernet/intel/ice/ice_main.c     | 116 ++++--------------
 .../net/ethernet/intel/ice/ice_virtchnl_pf.c  |  15 ++-
 9 files changed, 77 insertions(+), 138 deletions(-)

-- 
2.17.2

^ permalink raw reply

* [net 1/7] ice: Make ice_msix_clean_rings static
From: Jeff Kirsher @ 2018-10-24 21:47 UTC (permalink / raw)
  To: davem; +Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, Jeff Kirsher
In-Reply-To: <20181024214731.26036-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

commit 158a08a694c4e ("ice: remove ndo_poll_controller") removed
ice_netpoll and introduced a namespace warning for ice_msix_clean_rings.
Fix the namespace warning by making ice_msix_clean_rings static.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 2 +-
 drivers/net/ethernet/intel/ice/ice_lib.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 49f1940772ed..e750702bcdce 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -433,7 +433,7 @@ int ice_vsi_clear(struct ice_vsi *vsi)
  * @irq: interrupt number
  * @data: pointer to a q_vector
  */
-irqreturn_t ice_msix_clean_rings(int __always_unused irq, void *data)
+static irqreturn_t ice_msix_clean_rings(int __always_unused irq, void *data)
 {
 	struct ice_q_vector *q_vector = (struct ice_q_vector *)data;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h
index 677db40338f5..3831b4f0960a 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -73,5 +73,4 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc);
 
 int ice_vsi_manage_rss_lut(struct ice_vsi *vsi, bool ena);
 
-irqreturn_t ice_msix_clean_rings(int __always_unused irq, void *data);
 #endif /* !_ICE_LIB_H_ */
-- 
2.17.2

^ permalink raw reply related

* Re: [PATCH bpf 0/7] Batch of direct packet access fixes for BPF
From: Song Liu @ 2018-10-24 21:43 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, Networking
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

On Wed, Oct 24, 2018 at 1:08 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Several fixes to get direct packet access in order from verifier
> side. Also test suite fix to run cg_skb as unpriv and an improvement
> to make direct packet write less error prone in future.
>
> Thanks!
>
> Daniel Borkmann (7):
>   bpf: fix test suite to enable all unpriv program types
>   bpf: disallow direct packet access for unpriv in cg_skb
>   bpf: fix direct packet access for flow dissector progs
>   bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
>   bpf: fix direct packet write into pop/peek helpers
>   bpf: fix leaking uninitialized memory on pop/peek helpers
>   bpf: make direct packet write unclone more robust
>
>  kernel/bpf/helpers.c                        |  2 --
>  kernel/bpf/queue_stack_maps.c               |  2 ++
>  kernel/bpf/verifier.c                       | 13 ++++++++++---
>  net/core/filter.c                           | 17 +++++++++++++++++
>  tools/testing/selftests/bpf/test_verifier.c | 15 +++++++++++++--
>  5 files changed, 42 insertions(+), 7 deletions(-)
>
> --
> 2.9.5
>

Other than the nitpick on 7/7, for the series:

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply

* Re: [PATCH bpf 7/7] bpf: make direct packet write unclone more robust
From: Song Liu @ 2018-10-24 21:42 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, Networking
In-Reply-To: <20181024200549.8516-8-daniel@iogearbox.net>

On Wed, Oct 24, 2018 at 1:06 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Given this seems to be quite fragile and can easily slip through the
> cracks, lets make direct packet write more robust by requiring that
> future program types which allow for such write must provide a prologue
> callback. In case of XDP and sk_msg it's noop, thus add a generic noop
> handler there. The latter starts out with NULL data/data_end unconditionally
> when sg pages are shared.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/verifier.c |  6 +++++-
>  net/core/filter.c     | 11 +++++++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5fc9a65..171a2c8 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5709,7 +5709,11 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>         bool is_narrower_load;
>         u32 target_size;
>
> -       if (ops->gen_prologue) {
> +       if (ops->gen_prologue || env->seen_direct_write) {
> +               if (!ops->gen_prologue) {
> +                       verbose(env, "bpf verifier is misconfigured\n");
> +                       return -EINVAL;
> +               }

nit: how about this?

diff --git i/kernel/bpf/verifier.c w/kernel/bpf/verifier.c
index 6fbe7a8afed7..d35078024e35 100644
--- i/kernel/bpf/verifier.c
+++ w/kernel/bpf/verifier.c
@@ -5286,6 +5286,11 @@ static int convert_ctx_accesses(struct
bpf_verifier_env *env)
        bool is_narrower_load;
        u32 target_size;

+       if (!ops->gen_prologue && env->seen_direct_write) {
+               verbose(env, "bpf verifier is misconfigured\n");
+               return -EINVAL;
+       }
+
        if (ops->gen_prologue) {
                cnt = ops->gen_prologue(insn_buf, env->seen_direct_write,
                                        env->prog);

^ permalink raw reply related

* Re: [PATCH net] net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
From: David Miller @ 2018-10-24 21:37 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, dsahern
In-Reply-To: <20181024205839.24689-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Wed, 24 Oct 2018 13:58:39 -0700

> From: David Ahern <dsahern@gmail.com>
> 
> The intent of ip6_route_check_nh_onlink is to make sure the gateway
> given for an onlink route is not actually on a connected route for
> a different interface (e.g., 2001:db8:1::/64 is on dev eth1 and then
> an onlink route has a via 2001:db8:1::1 dev eth2). If the gateway
> lookup hits the default route then it most likely will be a different
> interface than the onlink route which is ok.
> 
> Update ip6_route_check_nh_onlink to disregard the device mismatch
> if the gateway lookup hits the default route. Turns out the existing
> onlink tests are passing because there is no default route or it is
> an unreachable default, so update the onlink tests to have a default
> route other than unreachable.
> 
> Fixes: fc1e64e1092f6 ("net/ipv6: Add support for onlink flag")
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH ghak90 (was ghak32) V4 03/10] audit: log container info of syscalls
From: Steve Grubb @ 2018-10-25  6:06 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Paul Moore, simo, carlos, linux-api, containers, linux-kernel,
	dhowells, linux-audit, netfilter-devel, ebiederm, luto, netdev,
	linux-fsdevel, Eric Paris, Serge Hallyn, viro
In-Reply-To: <20181025004255.zl7p7j6gztouh2hh@madcap2.tricolour.ca>

On Wed, 24 Oct 2018 20:42:55 -0400
Richard Guy Briggs <rgb@redhat.com> wrote:

> On 2018-10-24 16:55, Paul Moore wrote:
> > On Wed, Oct 24, 2018 at 11:15 AM Richard Guy Briggs
> > <rgb@redhat.com> wrote:  
> > > On 2018-10-19 19:16, Paul Moore wrote:  
> > > > On Sun, Aug 5, 2018 at 4:32 AM Richard Guy Briggs
> > > > <rgb@redhat.com> wrote:  
> > > > > Create a new audit record AUDIT_CONTAINER to document the
> > > > > audit container identifier of a process if it is present.
> > > > >
> > > > > Called from audit_log_exit(), syscalls are covered.
> > > > >
> > > > > A sample raw event:
> > > > > type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e
> > > > > syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30
> > > > > a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0
> > > > > euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3
> > > > > comm="bash" exe="/usr/bin/bash"
> > > > > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> > > > > key="tmpcontainerid" type=CWD msg=audit(1519924845.499:257):
> > > > > cwd="/root" type=PATH msg=audit(1519924845.499:257): item=0
> > > > > name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0
> > > > > rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT
> > > > > cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0
> > > > > cap_fver=0 type=PATH msg=audit(1519924845.499:257): item=1
> > > > > name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644
> > > > > ouid=0 ogid=0 rdev=00:00
> > > > > obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE
> > > > > cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0
> > > > > cap_fver=0 type=PROCTITLE msg=audit(1519924845.499:257):
> > > > > proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > > > > type=CONTAINER msg=audit(1519924845.499:257): op=task
> > > > > contid=123458
> > > > >
> > > > > See: https://github.com/linux-audit/audit-kernel/issues/90
> > > > > See: https://github.com/linux-audit/audit-userspace/issues/51
> > > > > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > > > > See:
> > > > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Acked-by:
> > > > > Serge Hallyn <serge@hallyn.com> Acked-by: Steve Grubb
> > > > > <sgrubb@redhat.com> ---
> > > > >  include/linux/audit.h      |  7 +++++++
> > > > >  include/uapi/linux/audit.h |  1 +
> > > > >  kernel/audit.c             | 24 ++++++++++++++++++++++++
> > > > >  kernel/auditsc.c           |  3 +++
> > > > >  4 files changed, 35 insertions(+)  
> > > >
> > > > ...
> > > >  
> > > > > @@ -2045,6 +2045,30 @@ void audit_log_session_info(struct
> > > > > audit_buffer *ab) audit_log_format(ab, " auid=%u ses=%u",
> > > > > auid, sessionid); }
> > > > >
> > > > > +/*
> > > > > + * audit_log_contid - report container info
> > > > > + * @tsk: task to be recorded
> > > > > + * @context: task or local context for record
> > > > > + * @op: contid string description
> > > > > + */
> > > > > +int audit_log_contid(struct task_struct *tsk,
> > > > > +                            struct audit_context *context,
> > > > > char *op) +{
> > > > > +       struct audit_buffer *ab;
> > > > > +
> > > > > +       if (!audit_contid_set(tsk))
> > > > > +               return 0;
> > > > > +       /* Generate AUDIT_CONTAINER record with container ID
> > > > > */
> > > > > +       ab = audit_log_start(context, GFP_KERNEL,
> > > > > AUDIT_CONTAINER);
> > > > > +       if (!ab)
> > > > > +               return -ENOMEM;
> > > > > +       audit_log_format(ab, "op=%s contid=%llu",
> > > > > +                        op, audit_get_contid(tsk));
> > > > > +       audit_log_end(ab);
> > > > > +       return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL(audit_log_contid);  
> > > >
> > > > As discussed in the previous iteration of the patch, I prefer
> > > > AUDIT_CONTAINER_ID here over AUDIT_CONTAINER.  If you feel
> > > > strongly about keeping it as-is with AUDIT_CONTAINER I suppose
> > > > I could live with that, but it is isn't my first choice.  
> > >
> > > I don't have a strong opinion on this one, mildly preferring the
> > > shorter one only because it is shorter.  
> > 
> > We already have multiple AUDIT_CONTAINER* record types, so it seems
> > as though we should use "AUDIT_CONTAINER" as a prefix of sorts,
> > rather than a type itself.  
> 
> I'm fine with that.  I'd still like to hear Steve's input.  He had
> stronger opinions than me.

The creation event should be separate and distinct from the continuing
use when its used as a supplemental record. IOW, binding the ID to a
container is part of the lifecycle and needs to be kept distinct.

-Steve

> > > > However, I do care about the "op" field in this record.  It just
> > > > doesn't make any sense; the way you are using it it is more of a
> > > > context field than an operations field, and even then why is the
> > > > context important from a logging and/or security perspective?
> > > > Drop it please.  
> > >
> > > I'll rename it to whatever you like.  I'd suggest "ref=".  The
> > > reason I think it is important is there are multiple sources that
> > > aren't always obvious from the other records to which it is
> > > associated.  In the case of ptrace and signals, there can be many
> > > target tasks listed (OBJ_PID) with no other way to distinguish
> > > the matching audit container identifier records all for one
> > > event.  This is in addition to the default syscall container
> > > identifier record.  I'm not currently happy with the text content
> > > to link the two, but that should be solvable (most obvious is
> > > taret PID).  Throwing away this information seems shortsighted.  
> > 
> > It would be helpful if you could generate real audit events
> > demonstrating the problems you are describing, as well as a more
> > standard syscall event, so we can discuss some possible solutions.  
> 
> If the auditted process is in a container and it ptraces or signals
> another process in a container, there will be two AUDIT_CONTAINER
> records for the same event that won't be identified as to which record
> belongs to which process or other record (SYSCALL vs 1+ OBJ_PID
> records).  There could be many signals recorded, each with their own
> OBJ_PID record.  The first is stored in the audit context and
> additional ones are stored in a chained struct that can accommodate
> 16 entries each.
> 
> (See audit_signal_info(), __audit_ptrace().)
> 
> (As a side note, on code inspection it appears that a signal target
> would get overwritten by a ptrace action if they were to happen in
> that order.)
> 
> > paul moore  
> 
> - RGB
> 
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
> 
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit

^ permalink raw reply

* Re: [PATCH net] net: sched: Remove TCA_OPTIONS from policy
From: David Miller @ 2018-10-24 21:35 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, pupilla, dsahern
In-Reply-To: <20181024153249.15374-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Wed, 24 Oct 2018 08:32:49 -0700

> From: David Ahern <dsahern@gmail.com>
> 
> Marco reported an error with hfsc:
> root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> Error: Attribute failed policy validation.
> 
> Apparently a few implementations pass TCA_OPTIONS as a binary instead
> of nested attribute, so drop TCA_OPTIONS from the policy.
> 
> Fixes: 8b4c3cdd9dd8 ("net: sched: Add policy validation for tc attributes")
> Reported-by: Marco Berizzi <pupilla@libero.it>
> Signed-off-by: David Ahern <dsahern@gmail.com>

That's unfortunate... applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] octeontx2-af: Copy the right amount of memory
From: David Miller @ 2018-10-24 21:25 UTC (permalink / raw)
  To: dan.carpenter; +Cc: sgoutham, lcherian, gakula, jerinj, netdev, kernel-janitors
In-Reply-To: <20181024083221.humvwh2pefovptcd@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Wed, 24 Oct 2018 11:32:21 +0300

> This is a copy and paste bug where we copied the sizeof() from the chunk
> before.  We're copying more data than intended but the destination is a
> union so it doesn't cause memory corruption.
> 
> Fixes: ffb0abd7e9cb ("octeontx2-af: NIX AQ instruction enqueue support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks Dan.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code
From: Alexei Starovoitov @ 2018-10-24 21:23 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, David S. Miller, Eric Dumazet, kafai, daniel
In-Reply-To: <4c94ee8fe77a51d61927bfff46441abc15172193.camel@redhat.com>

On Tue, Oct 23, 2018 at 09:28:03AM +0200, Paolo Abeni wrote:
> Hi,
> 
> On Mon, 2018-10-22 at 21:49 -0700, Alexei Starovoitov wrote:
> > On Mon, May 15, 2017 at 11:01:42AM +0200, Paolo Abeni wrote:
> > > And update __sk_queue_drop_skb() to work on the specified queue.
> > > This will help the udp protocol to use an additional private
> > > rx queue in a later patch.
> > > 
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  include/linux/skbuff.h |  7 ++++
> > >  include/net/sock.h     |  4 +--
> > >  net/core/datagram.c    | 90 ++++++++++++++++++++++++++++----------------------
> > >  3 files changed, 60 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index a098d95..bfc7892 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
> > >  
> > >  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
> > >  				const struct sk_buff *skb);
> > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > > +					  struct sk_buff_head *queue,
> > > +					  unsigned int flags,
> > > +					  void (*destructor)(struct sock *sk,
> > > +							   struct sk_buff *skb),
> > > +					  int *peeked, int *off, int *err,
> > > +					  struct sk_buff **last);
> > >  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
> > >  					void (*destructor)(struct sock *sk,
> > >  							   struct sk_buff *skb),
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 66349e4..49d226f 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
> > >  
> > >  void sk_stop_timer(struct sock *sk, struct timer_list *timer);
> > >  
> > > -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
> > > -			unsigned int flags,
> > > +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
> > > +			struct sk_buff *skb, unsigned int flags,
> > >  			void (*destructor)(struct sock *sk,
> > >  					   struct sk_buff *skb));
> > >  int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> > > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > > index db1866f2..a4592b4 100644
> > > --- a/net/core/datagram.c
> > > +++ b/net/core/datagram.c
> > > @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
> > >  	return skb;
> > >  }
> > >  
> > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > > +					  struct sk_buff_head *queue,
> > > +					  unsigned int flags,
> > > +					  void (*destructor)(struct sock *sk,
> > > +							   struct sk_buff *skb),
> > > +					  int *peeked, int *off, int *err,
> > > +					  struct sk_buff **last)
> > > +{
> > > +	struct sk_buff *skb;
> > > +
> > > +	*last = queue->prev;
> > 
> > this refactoring changed the behavior.
> > Now queue->prev is returned as last.
> > Whereas it was *last = queue before.
> > 
> > > +	skb_queue_walk(queue, skb) {
> > 
> > and *last = skb assignment is gone too.
> > 
> > Was this intentional ? 
> 
> Yes.
> 
> > Is this the right behavior?
> 
> I think so. queue->prev is the last skb in the queue. With the old
> code,   __skb_try_recv_datagram(), when returning NULL, used the
> instructions you quoted to overall set 'last' to the last skb in the
> queue. We did not use 'last' elsewhere. So overall this just reduce the
> number of instructions inside the loop. (unless I'm missing something).

Right. On the second glance it does appear to be correct.

> Are you experiencing any specific issues due to the mentioned commit?

yes.

Just like what Baoyou Xie reported https://lore.kernel.org/patchwork/patch/962802/
we're hitting infinite loop in __skb_recv_datagram() on 4.11 kernel.
and different, but also buggy, behavior on the net-next.

In particular __skb_try_recv_datagram() returns immediately,
because skb_queue_empty() is true (sk->sk_receive_queue.next == &sk->sk_receive_queue)

But __skb_wait_for_more_packets() also returns immediately
because if (sk->sk_receive_queue.prev != skb) is also true.

There is a link list corruption in sk_receive_queue.

list->next == list, but list->prev still points to valid skb.
Before your commit we had
*last = queue;
and we had this infinite loop I described above.
After your commit
*last = queue->next;
which assigns buggy pointer into *last, but that accidentally
makes if (sk->sk_receive_queue.prev != skb) to be false
and __skb_wait_for_more_packets() goes into schedule_timeout().
Eventually bad things happen too, but in the different spot.

The corruption is somehow related to netlink_recvmsg() just like in that
Baoyou Xie report.

The typical stack trace is
__skb_wait_for_more_packets+0x64/0x140
? skb_gro_receive+0x310/0x310
__skb_recv_datagram+0x5c/0xa0
skb_recv_datagram+0x31/0x40
netlink_recvmsg+0x51/0x3c0
? sock_write_iter+0xf8/0x110
SYSC_recvfrom+0x116/0x190

We didn't figure out a way to reproduce it yet.
kasan didn't help.
The way netlink socket pushes skbs into sk_receive_queue and drains it
all looks correct. We thought it could be MSG_PEAK related, but
skb->users refcnting also looks correct.

If anyone have any ideas what things to try, I'm all ears.

^ permalink raw reply

* Re: [PATCH] r8169: Add new device ID support
From: David Miller @ 2018-10-24 21:22 UTC (permalink / raw)
  To: shawn.lin; +Cc: nic_swsd, netdev, hkallweit1
In-Reply-To: <1540345607-110155-1-git-send-email-shawn.lin@rock-chips.com>

From: Shawn Lin <shawn.lin@rock-chips.com>
Date: Wed, 24 Oct 2018 09:46:47 +0800

> It's found my r8169 ethernet card at hand has a device ID
> of 0x0000 which wasn't on the list of rtl8169_pci_tbl. Add
> a new entry to make it work:
> 
> [2.165785] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [2.165863] r8169 0000:01:00.0: enabling device (0000 -> 0003)
> [2.167110] r8169 0000:01:00.0 eth0: RTL8168c/8111c at 0xffffff80089be000,
> 00:e0:4c:21:00:17, XID 1c4000c0 IRQ 208
> [2.167128] r8169 0000:01:00.0 eth0: jumbo features [frames: 6128
> bytes, tx checksumming: ko]
> 
> [root@rk1808:/]# lspci
> 00:00.0 Class 0604: 1d87:1808
> 01:00.0 Class 0200: 10ec:0000
> 
> Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>

I'm stil not terribly confident in this change, a device ID of zero is
really unusual.

Heiner, what do you think?

^ permalink raw reply

* Re: [PATCH net v2] net: udp: fix handling of CHECKSUM_COMPLETE packets
From: David Miller @ 2018-10-24 21:21 UTC (permalink / raw)
  To: stranche; +Cc: eric.dumazet, netdev, samanthakumar, edumazet
In-Reply-To: <1540332271-15564-1-git-send-email-stranche@codeaurora.org>

From: Sean Tranchetti <stranche@codeaurora.org>
Date: Tue, 23 Oct 2018 16:04:31 -0600

> Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
> incorrect for any packet that has an incorrect checksum value.
> 
> udp4/6_csum_init() will both make a call to
> __skb_checksum_validate_complete() to initialize/validate the csum
> field when receiving a CHECKSUM_COMPLETE packet. When this packet
> fails validation, skb->csum will be overwritten with the pseudoheader
> checksum so the packet can be fully validated by software, but the
> skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
> the stack can later warn the user about their hardware spewing bad
> checksums. Unfortunately, leaving the SKB in this state can cause
> problems later on in the checksum calculation.
> 
> Since the the packet is still marked as CHECKSUM_COMPLETE,
> udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
> from skb->csum instead of adding it, leaving us with a garbage value
> in that field. Once we try to copy the packet to userspace in the
> udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
> to checksum the packet data and add it in the garbage skb->csum value
> to perform our final validation check.
> 
> Since the value we're validating is not the proper checksum, it's possible
> that the folded value could come out to 0, causing us not to drop the
> packet. Instead, we believe that the packet was checksummed incorrectly
> by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
> to warn the user with netdev_rx_csum_fault(skb->dev);
> 
> Unfortunately, since this is the UDP path, skb->dev has been overwritten
> by skb->dev_scratch and is no longer a valid pointer, so we end up
> reading invalid memory.

Just want to say that it has always been complicated in this area due to
the fact that we do this deferral of final checksum validation to when we
copy the packet into userspace.  For example, poll() needs to do special
things, etc.

Because we have to make it seem as if we dropped the packet with a bad
checksum from the point of view of what the user sees during recvmsg()
and poll() calls.  But until we do that checksum validation, we don't
know exactly what the situation is.

> This patch addresses this problem in two ways:
> 	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
> 	   from skb_copy_and_csum_datagram_msg(). Since this gets called
> 	   from the UDP path where skb->dev has been overwritten, we have
> 	   no way of knowing if the pointer is still valid. Also for the
> 	   sake of consistency with the other uses of
> 	   netdev_rx_csum_fault(), don't attempt to call it if the
> 	   packet was checksummed by software.
> 
> 	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
> 	   If we receive a packet that's CHECKSUM_COMPLETE that fails
> 	   verification (i.e. skb->csum_valid == 0), check who performed
> 	   the calculation. It's possible that the checksum was done in
> 	   software by the network stack earlier (such as Netfilter's
> 	   CONNTRACK module), and if that says the checksum is bad,
> 	   we can drop the packet immediately instead of waiting until
> 	   we try and copy it to userspace. Otherwise, we need to
> 	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
> 	   no longer contains the full packet checksum after the
> 	   call to __skb_checksum_validate_complete().
> 
> Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

Can't count on my hands how many regressions are a result of that change and
it's subtle side effects. :-/

> Fixes: c84d949057ca ("udp: copy skb->truesize in the first cache line")
> Cc: Sam Kumar <samanthakumar@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>

Applied and queued up for -stable, thank you.

^ permalink raw reply

* Re: [PATCH net 0/4] net: Fixups for recent dump filtering changes
From: David Miller @ 2018-10-24 21:07 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, lirongqing, dsahern
In-Reply-To: <20181024195902.17479-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Wed, 24 Oct 2018 12:58:58 -0700

> Li RongQing noted that tgt_net is leaked in ipv4 due to the recent change
> to handle address dumps for a specific device. The report also applies to
> ipv6 and other error paths. Patches 1 and 2 fix those leaks.
> 
> Patch 3 stops route dumps from erroring out when dumping across address
> families and a table id is given. This is needed in preparation for
> patch 4.
> 
> Patch 4 updates the rtnl_dump_all to handle a failure in one of the dumpit
> functions. At the moment, if an address dump returns an error the dump all
> loop breaks but the error is dropped. The result can be no data is returned
> and no error either leaving the user wondering about the addresses.
> 
> Patches were tested with a modified iproute2 to add invalid data to the
> dump request causing each specific failure path to be hit in addition
> to positive testing that it works as it should when given valid data.

Series applied, thanks David.

^ permalink raw reply

* Re: Regression in 4.19 net/phy/realtek: garbled sysfs output
From: David Miller @ 2018-10-24 20:59 UTC (permalink / raw)
  To: holger; +Cc: netdev, jaswinder.singh, hkallweit1
In-Reply-To: <5e989478-0d33-cd05-efa7-fe3ec54856ab@applied-asynchrony.com>

From: Holger Hoffstätte <holger@applied-asynchrony.com>
Date: Wed, 24 Oct 2018 21:36:02 +0200

Adding Heiner to CC:

> Since 4.19 r8169 depends on phylib:
> 
> $lsmod | grep r8169
> r8169                  81920  0
> libphy                 57344  2 r8169,realtek
> 
> Unfortunately this now gives me the following sysfs error:
> 
> $cd /sys/module/realtek/drivers
> $ls -l
> ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such
> file or directory
> total 0
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8201CP Ethernet' ->
> '../../../bus/mdio_bus/drivers/RTL8201CP Ethernet'
> l????????? ? ?  ?  ?  ? 'mdio_bus:RTL8201F 10/100Mbps Ethernet'
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8211 Gigabit
> Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8211 Gigabit Ethernet'
> [..]
> 
> Apparently the forward slash in "10/100Mbps Ethernet" is interpreted
> as
> directory separator that leads nowhere, and was introduced in commit
> 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
> 
> Would it be acceptable to change the name simply to "RTL8201F
> Ethernet"?

^ permalink raw reply

* [PATCH net] net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
From: David Ahern @ 2018-10-24 20:58 UTC (permalink / raw)
  To: netdev, davem; +Cc: David Ahern

From: David Ahern <dsahern@gmail.com>

The intent of ip6_route_check_nh_onlink is to make sure the gateway
given for an onlink route is not actually on a connected route for
a different interface (e.g., 2001:db8:1::/64 is on dev eth1 and then
an onlink route has a via 2001:db8:1::1 dev eth2). If the gateway
lookup hits the default route then it most likely will be a different
interface than the onlink route which is ok.

Update ip6_route_check_nh_onlink to disregard the device mismatch
if the gateway lookup hits the default route. Turns out the existing
onlink tests are passing because there is no default route or it is
an unreachable default, so update the onlink tests to have a default
route other than unreachable.

Fixes: fc1e64e1092f6 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c                                |  2 ++
 tools/testing/selftests/net/fib-onlink-tests.sh | 14 +++++++-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e3226284e480..2a7423c39456 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2745,6 +2745,8 @@ static int ip6_route_check_nh_onlink(struct net *net,
 	grt = ip6_nh_lookup_table(net, cfg, gw_addr, tbid, 0);
 	if (grt) {
 		if (!grt->dst.error &&
+		    /* ignore match if it is the default route */
+		    grt->from && !ipv6_addr_any(&grt->from->fib6_dst.addr) &&
 		    (grt->rt6i_flags & flags || dev != grt->dst.dev)) {
 			NL_SET_ERR_MSG(extack,
 				       "Nexthop has invalid gateway or device mismatch");
diff --git a/tools/testing/selftests/net/fib-onlink-tests.sh b/tools/testing/selftests/net/fib-onlink-tests.sh
index 3991ad1a368d..864f865eee55 100755
--- a/tools/testing/selftests/net/fib-onlink-tests.sh
+++ b/tools/testing/selftests/net/fib-onlink-tests.sh
@@ -167,8 +167,8 @@ setup()
 	# add vrf table
 	ip li add ${VRF} type vrf table ${VRF_TABLE}
 	ip li set ${VRF} up
-	ip ro add table ${VRF_TABLE} unreachable default
-	ip -6 ro add table ${VRF_TABLE} unreachable default
+	ip ro add table ${VRF_TABLE} unreachable default metric 8192
+	ip -6 ro add table ${VRF_TABLE} unreachable default metric 8192
 
 	# create test interfaces
 	ip li add ${NETIFS[p1]} type veth peer name ${NETIFS[p2]}
@@ -185,20 +185,20 @@ setup()
 	for n in 1 3 5 7; do
 		ip li set ${NETIFS[p${n}]} up
 		ip addr add ${V4ADDRS[p${n}]}/24 dev ${NETIFS[p${n}]}
-		ip addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]}
+		ip addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]} nodad
 	done
 
 	# move peer interfaces to namespace and add addresses
 	for n in 2 4 6 8; do
 		ip li set ${NETIFS[p${n}]} netns ${PEER_NS} up
 		ip -netns ${PEER_NS} addr add ${V4ADDRS[p${n}]}/24 dev ${NETIFS[p${n}]}
-		ip -netns ${PEER_NS} addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]}
+		ip -netns ${PEER_NS} addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]} nodad
 	done
 
-	set +e
+	ip -6 ro add default via ${V6ADDRS[p3]/::[0-9]/::64}
+	ip -6 ro add table ${VRF_TABLE} default via ${V6ADDRS[p7]/::[0-9]/::64}
 
-	# let DAD complete - assume default of 1 probe
-	sleep 1
+	set +e
 }
 
 cleanup()
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH ghak90 (was ghak32) V4 03/10] audit: log container info of syscalls
From: Paul Moore @ 2018-10-24 20:55 UTC (permalink / raw)
  To: rgb
  Cc: containers, linux-api, linux-audit, linux-fsdevel, linux-kernel,
	netdev, netfilter-devel, ebiederm, luto, carlos, dhowells, viro,
	simo, Eric Paris, Serge Hallyn
In-Reply-To: <20181024151439.lavhanabsyxdrdvo@madcap2.tricolour.ca>

On Wed, Oct 24, 2018 at 11:15 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-10-19 19:16, Paul Moore wrote:
> > On Sun, Aug 5, 2018 at 4:32 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Create a new audit record AUDIT_CONTAINER to document the audit
> > > container identifier of a process if it is present.
> > >
> > > Called from audit_log_exit(), syscalls are covered.
> > >
> > > A sample raw event:
> > > type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
> > > type=CWD msg=audit(1519924845.499:257): cwd="/root"
> > > type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
> > > type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
> > > type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > > type=CONTAINER msg=audit(1519924845.499:257): op=task contid=123458
> > >
> > > See: https://github.com/linux-audit/audit-kernel/issues/90
> > > See: https://github.com/linux-audit/audit-userspace/issues/51
> > > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > > See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > Acked-by: Steve Grubb <sgrubb@redhat.com>
> > > ---
> > >  include/linux/audit.h      |  7 +++++++
> > >  include/uapi/linux/audit.h |  1 +
> > >  kernel/audit.c             | 24 ++++++++++++++++++++++++
> > >  kernel/auditsc.c           |  3 +++
> > >  4 files changed, 35 insertions(+)
> >
> > ...
> >
> > > @@ -2045,6 +2045,30 @@ void audit_log_session_info(struct audit_buffer *ab)
> > >         audit_log_format(ab, " auid=%u ses=%u", auid, sessionid);
> > >  }
> > >
> > > +/*
> > > + * audit_log_contid - report container info
> > > + * @tsk: task to be recorded
> > > + * @context: task or local context for record
> > > + * @op: contid string description
> > > + */
> > > +int audit_log_contid(struct task_struct *tsk,
> > > +                            struct audit_context *context, char *op)
> > > +{
> > > +       struct audit_buffer *ab;
> > > +
> > > +       if (!audit_contid_set(tsk))
> > > +               return 0;
> > > +       /* Generate AUDIT_CONTAINER record with container ID */
> > > +       ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER);
> > > +       if (!ab)
> > > +               return -ENOMEM;
> > > +       audit_log_format(ab, "op=%s contid=%llu",
> > > +                        op, audit_get_contid(tsk));
> > > +       audit_log_end(ab);
> > > +       return 0;
> > > +}
> > > +EXPORT_SYMBOL(audit_log_contid);
> >
> > As discussed in the previous iteration of the patch, I prefer
> > AUDIT_CONTAINER_ID here over AUDIT_CONTAINER.  If you feel strongly
> > about keeping it as-is with AUDIT_CONTAINER I suppose I could live
> > with that, but it is isn't my first choice.
>
> I don't have a strong opinion on this one, mildly preferring the shorter
> one only because it is shorter.

We already have multiple AUDIT_CONTAINER* record types, so it seems as
though we should use "AUDIT_CONTAINER" as a prefix of sorts, rather
than a type itself.

> > However, I do care about the "op" field in this record.  It just
> > doesn't make any sense; the way you are using it it is more of a
> > context field than an operations field, and even then why is the
> > context important from a logging and/or security perspective?  Drop it
> > please.
>
> I'll rename it to whatever you like.  I'd suggest "ref=".  The reason I
> think it is important is there are multiple sources that aren't always
> obvious from the other records to which it is associated.  In the case
> of ptrace and signals, there can be many target tasks listed (OBJ_PID)
> with no other way to distinguish the matching audit container identifier
> records all for one event.  This is in addition to the default syscall
> container identifier record.  I'm not currently happy with the text
> content to link the two, but that should be solvable (most obvious is
> taret PID).  Throwing away this information seems shortsighted.

It would be helpful if you could generate real audit events
demonstrating the problems you are describing, as well as a more
standard syscall event, so we can discuss some possible solutions.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH] netfilter: conntrack: fix calculation of next bucket number in early_drop
From: Dmitry Safonov @ 2018-10-25  4:52 UTC (permalink / raw)
  To: Vasily Khoruzhick, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, netfilter-devel, coreteam,
	netdev, linux-kernel
  Cc: stable
In-Reply-To: <20181025034853.23573-1-vasilykh@arista.com>

On 10/25/18 4:48 AM, Vasily Khoruzhick wrote:
> If there's no entry to drop in bucket that corresponds to the hash,
> early_drop() should look for it in other buckets. But since it increments
> hash instead of bucket number, it actually looks in the same bucket 8
> times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
> reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
> most cases.
> 
> Fix it by increasing bucket number instead of hash and rename _hash
> to bucket to avoid future confusion.
> 
> Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
> Cc: <stable@vger.kernel.org> # v4.7+
> Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>

Nice work!

Reviewed-by: Dmitry Safonov <dima@arista.com>

> ---
>   net/netfilter/nf_conntrack_core.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index ca1168d67fac..a04af246b184 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1073,19 +1073,22 @@ static unsigned int early_drop_list(struct net *net,
>   	return drops;
>   }
>   
> -static noinline int early_drop(struct net *net, unsigned int _hash)
> +static noinline int early_drop(struct net *net, unsigned int hash)
>   {
>   	unsigned int i;
>   
>   	for (i = 0; i < NF_CT_EVICTION_RANGE; i++) {
>   		struct hlist_nulls_head *ct_hash;
> -		unsigned int hash, hsize, drops;
> +		unsigned int bucket, hsize, drops;
>   
>   		rcu_read_lock();
>   		nf_conntrack_get_ht(&ct_hash, &hsize);
> -		hash = reciprocal_scale(_hash++, hsize);
> +		if (!i)
> +			bucket = reciprocal_scale(hash, hsize);
> +		else
> +			bucket = (bucket + 1) % hsize;
>   
> -		drops = early_drop_list(net, &ct_hash[hash]);
> +		drops = early_drop_list(net, &ct_hash[bucket]);
>   		rcu_read_unlock();
>   
>   		if (drops) {
> 

-- 
           Dima

^ permalink raw reply

* Re: Regression in 4.19 net/phy/realtek: garbled sysfs output
From: Andrew Lunn @ 2018-10-24 20:12 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Netdev, Jassi Brar, David S. Miller
In-Reply-To: <5e989478-0d33-cd05-efa7-fe3ec54856ab@applied-asynchrony.com>

On Wed, Oct 24, 2018 at 09:36:02PM +0200, Holger Hoffstätte wrote:
> Hi,
> 
> Since 4.19 r8169 depends on phylib:
> 
> $lsmod | grep r8169
> r8169                  81920  0
> libphy                 57344  2 r8169,realtek
> 
> Unfortunately this now gives me the following sysfs error:
> 
> $cd /sys/module/realtek/drivers
> $ls -l
> ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such file or directory
> total 0
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8201CP Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8201CP Ethernet'
> l????????? ? ?    ?    ?            ? 'mdio_bus:RTL8201F 10/100Mbps Ethernet'
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8211 Gigabit Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8211 Gigabit Ethernet'
> [..]
> 
> Apparently the forward slash in "10/100Mbps Ethernet" is interpreted as
> directory separator that leads nowhere, and was introduced in commit
> 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
> 
> Would it be acceptable to change the name simply to "RTL8201F Ethernet"?

Hi Holger

Or use "RTL8201F Fast Ethernet"

I wonder if other drivers have similar problems?

davicom.c:      .name           = "Davicom DM9161B/C",
intel-xway.c:           .name           = "Intel XWAY PHY11G (PEF 7071/PEF 7072) v1.3",
intel-xway.c:           .name           = "Intel XWAY PHY11G (PEF 7071/PEF 7072) v1.4",
intel-xway.c:           .name           = "Intel XWAY PHY11G (PEF 7071/PEF 7072) v1.5 / v1.6",
intel-xway.c:           .name           = "Intel XWAY PHY22F (PEF 7061) v1.5 / v1.6",
smsc.c:	 .name	       = "SMSC LAN8710/LAN8720",

	 Andrew

^ permalink raw reply

* [PATCH bpf 7/7] bpf: make direct packet write unclone more robust
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Given this seems to be quite fragile and can easily slip through the
cracks, lets make direct packet write more robust by requiring that
future program types which allow for such write must provide a prologue
callback. In case of XDP and sk_msg it's noop, thus add a generic noop
handler there. The latter starts out with NULL data/data_end unconditionally
when sg pages are shared.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c |  6 +++++-
 net/core/filter.c     | 11 +++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fc9a65..171a2c8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5709,7 +5709,11 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 	bool is_narrower_load;
 	u32 target_size;
 
-	if (ops->gen_prologue) {
+	if (ops->gen_prologue || env->seen_direct_write) {
+		if (!ops->gen_prologue) {
+			verbose(env, "bpf verifier is misconfigured\n");
+			return -EINVAL;
+		}
 		cnt = ops->gen_prologue(insn_buf, env->seen_direct_write,
 					env->prog);
 		if (cnt >= ARRAY_SIZE(insn_buf)) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 3fdddfa..cd648d0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5644,6 +5644,15 @@ static bool sock_filter_is_valid_access(int off, int size,
 					       prog->expected_attach_type);
 }
 
+static int bpf_noop_prologue(struct bpf_insn *insn_buf, bool direct_write,
+			     const struct bpf_prog *prog)
+{
+	/* Neither direct read nor direct write requires any preliminary
+	 * action.
+	 */
+	return 0;
+}
+
 static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write,
 				const struct bpf_prog *prog, int drop_verdict)
 {
@@ -7210,6 +7219,7 @@ const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
 	.convert_ctx_access	= xdp_convert_ctx_access,
+	.gen_prologue		= bpf_noop_prologue,
 };
 
 const struct bpf_prog_ops xdp_prog_ops = {
@@ -7308,6 +7318,7 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
 	.get_func_proto		= sk_msg_func_proto,
 	.is_valid_access	= sk_msg_is_valid_access,
 	.convert_ctx_access	= sk_msg_convert_ctx_access,
+	.gen_prologue		= bpf_noop_prologue,
 };
 
 const struct bpf_prog_ops sk_msg_prog_ops = {
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 6/7] bpf: fix leaking uninitialized memory on pop/peek helpers
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Mauricio Vasquez B
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit f1a2e44a3aec ("bpf: add queue and stack maps") added helpers
with ARG_PTR_TO_UNINIT_MAP_VALUE. Meaning, the helper is supposed to
fill the map value buffer with data instead of reading from it like
in other helpers such as map update. However, given the buffer is
allowed to be uninitialized (since we fill it in the helper anyway),
it also means that the helper is obliged to wipe the memory in case
of an error in order to not allow for leaking uninitialized memory.
Given pop/peek is both handled inside __{stack,queue}_map_get(),
lets wipe it there on error case, that is, empty stack/queue.

Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 kernel/bpf/queue_stack_maps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 12a93fb..8bbd72d 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -122,6 +122,7 @@ static int __queue_map_get(struct bpf_map *map, void *value, bool delete)
 	raw_spin_lock_irqsave(&qs->lock, flags);
 
 	if (queue_stack_map_is_empty(qs)) {
+		memset(value, 0, qs->map.value_size);
 		err = -ENOENT;
 		goto out;
 	}
@@ -151,6 +152,7 @@ static int __stack_map_get(struct bpf_map *map, void *value, bool delete)
 	raw_spin_lock_irqsave(&qs->lock, flags);
 
 	if (queue_stack_map_is_empty(qs)) {
+		memset(value, 0, qs->map.value_size);
 		err = -ENOENT;
 		goto out;
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 2/7] bpf: disallow direct packet access for unpriv in cg_skb
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Song Liu
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit b39b5f411dcf ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added support for returning pkt pointers
for direct packet access. Given this program type is allowed for both
unprivileged and privileged users, we shouldn't allow unprivileged
ones to use it, e.g. besides others one reason would be to avoid any
potential speculation on the packet test itself, thus guard this for
root only.

Fixes: b39b5f411dcf ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
---
 net/core/filter.c                           | 6 ++++++
 tools/testing/selftests/bpf/test_verifier.c | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 35c6933..3fdddfa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5496,7 +5496,13 @@ static bool cg_skb_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		return false;
+	case bpf_ctx_range(struct __sk_buff, data):
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		if (!capable(CAP_SYS_ADMIN))
+			return false;
+		break;
 	}
+
 	if (type == BPF_WRITE) {
 		switch (off) {
 		case bpf_ctx_range(struct __sk_buff, mark):
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 8e1a79d..36f3d30 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4892,7 +4892,7 @@ static struct bpf_test tests[] = {
 		},
 		.result = ACCEPT,
 		.result_unpriv = REJECT,
-		.errstr_unpriv = "R3 pointer comparison prohibited",
+		.errstr_unpriv = "invalid bpf_context access off=76 size=4",
 		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
 	},
 	{
-- 
2.9.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox