Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Edward Cree @ 2018-04-23 12:25 UTC (permalink / raw)
  To: Yonghong Song, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-5-yhs@fb.com>

On 20/04/18 23:18, Yonghong Song wrote:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3c8bb92..01c215d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
>  		/* We may learn something more from the var_off */
>  		__update_reg_bounds(dst_reg);
>  		break;
> +	case BPF_ARSH:
> +		if (umax_val >= insn_bitness) {
> +			/* Shifts greater than 31 or 63 are undefined.
> +			 * This includes shifts by a negative number.
> +			 */
> +			mark_reg_unknown(env, regs, insn->dst_reg);
> +			break;
> +		}
> +		if (dst_reg->smin_value < 0)
> +			dst_reg->smin_value >>= umin_val;
> +		else
> +			dst_reg->smin_value >>= umax_val;
> +		if (dst_reg->smax_value < 0)
> +			dst_reg->smax_value >>= umax_val;
> +		else
> +			dst_reg->smax_value >>= umin_val;
> +		if (src_known)
> +			dst_reg->var_off = tnum_rshift(dst_reg->var_off,
> +						       umin_val);
tnum_rshift is an unsigned shift, it won't do what you want here.
I think you could write a tnum_arshift that looks something like this
 (UNTESTED!):

    struct tnum tnum_arshift(struct tnum a, u8 shift)
    {
        return TNUM(((s64)a.value) >> shift, ((s64)a.mask) >> shift);
    }
Theory: if value sign bit is 1 then number is known negative so populate
 upper bits with known 1s.  If mask sign bit is 1 then number might be
 negative so populate upper bits with unknown.  Otherwise, number is
 known positive so populate upper bits with known 0s.

> +		else
> +			dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
Applying the above here, tnum_arshift(tnum_unknown, ...) would always just
 return tnum_unknown, so just do "dst_reg->var_off = tnum_unknown;".
The reason for the corresponding logic in the BPF_RSH case is that a right
 logical shift _always_ populates upper bits with zeroes.
In any case these 'else' branches are currently never taken because they
 fall foul of the check Alexei added just before the switch,
    if (!src_known &&
        opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) {
        __mark_reg_unknown(dst_reg);
        return 0;
    }
So I can guarantee you haven't tested this code :-)

> +		dst_reg->umin_value >>= umax_val;
> +		dst_reg->umax_value >>= umin_val;
FWIW I think the way to handle umin/umax here is to blow them away and
 just rely on inferring new ubounds from the sbounds (i.e. the inverse of
 what we do just above in case BPF_RSH) since BPF_ARSH is essentially an
 operation on the signed value.  I don't think there is a need to support
 cases where the unsigned bounds of a signed shift of a value that may
 cross the sign boundary at (1<<63) are needed to verify a program.
(Unlike in the unsigned shift case, it is at least _possible_ for there to
 be information from the ubounds that we can't get from the sbounds - but
 it's a contrived case that isn't likely to be useful in real programs.)

-Ed
> +		/* We may learn something more from the var_off */
> +		__update_reg_bounds(dst_reg);
> +		break;
>  	default:
>  		mark_reg_unknown(env, regs, insn->dst_reg);
>  		break;

^ permalink raw reply

* [PATCH net-next 2/2] qed: Add configuration information to register dump and debug data
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

Configuration information is added to the debug data collection, in
addition to register dump.
Added qed_dbg_nvm_image() that receives an image type, allocates a
buffer and reads the image. The images are saved in the buffers and the
dump size is updated.

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 113 +++++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |  14 +++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  14 ++++
 include/linux/qed/qed_if.h                  |   3 +
 4 files changed, 139 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index 4926c55..b3211c7 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7778,6 +7778,57 @@ int qed_dbg_igu_fifo_size(struct qed_dev *cdev)
 	return qed_dbg_feature_size(cdev, DBG_FEATURE_IGU_FIFO);
 }
 
+int qed_dbg_nvm_image_length(struct qed_hwfn *p_hwfn,
+			     enum qed_nvm_images image_id, u32 *length)
+{
+	struct qed_nvm_image_att image_att;
+	int rc;
+
+	*length = 0;
+	rc = qed_mcp_get_nvm_image_att(p_hwfn, image_id, &image_att);
+	if (rc)
+		return rc;
+
+	*length = image_att.length;
+
+	return rc;
+}
+
+int qed_dbg_nvm_image(struct qed_dev *cdev, void *buffer,
+		      u32 *num_dumped_bytes, enum qed_nvm_images image_id)
+{
+	struct qed_hwfn *p_hwfn =
+		&cdev->hwfns[cdev->dbg_params.engine_for_debug];
+	u32 len_rounded, i;
+	__be32 val;
+	int rc;
+
+	*num_dumped_bytes = 0;
+	rc = qed_dbg_nvm_image_length(p_hwfn, image_id, &len_rounded);
+	if (rc)
+		return rc;
+
+	DP_NOTICE(p_hwfn->cdev,
+		  "Collecting a debug feature [\"nvram image %d\"]\n",
+		  image_id);
+
+	len_rounded = roundup(len_rounded, sizeof(u32));
+	rc = qed_mcp_get_nvm_image(p_hwfn, image_id, buffer, len_rounded);
+	if (rc)
+		return rc;
+
+	/* QED_NVM_IMAGE_NVM_META image is not swapped like other images */
+	if (image_id != QED_NVM_IMAGE_NVM_META)
+		for (i = 0; i < len_rounded; i += 4) {
+			val = cpu_to_be32(*(u32 *)(buffer + i));
+			*(u32 *)(buffer + i) = val;
+		}
+
+	*num_dumped_bytes = len_rounded;
+
+	return rc;
+}
+
 int qed_dbg_protection_override(struct qed_dev *cdev, void *buffer,
 				u32 *num_dumped_bytes)
 {
@@ -7831,6 +7882,9 @@ enum debug_print_features {
 	IGU_FIFO = 6,
 	PHY = 7,
 	FW_ASSERTS = 8,
+	NVM_CFG1 = 9,
+	DEFAULT_CFG = 10,
+	NVM_META = 11,
 };
 
 static u32 qed_calc_regdump_header(enum debug_print_features feature,
@@ -7965,13 +8019,61 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 		DP_ERR(cdev, "qed_dbg_mcp_trace failed. rc = %d\n", rc);
 	}
 
+	/* nvm cfg1 */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_NVM_CFG1);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(NVM_CFG1, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image  %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_NVM_CFG1, "QED_NVM_IMAGE_NVM_CFG1", rc);
+	}
+
+	/* nvm default */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_DEFAULT_CFG);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(DEFAULT_CFG, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_DEFAULT_CFG, "QED_NVM_IMAGE_DEFAULT_CFG",
+		       rc);
+	}
+
+	/* nvm meta */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_NVM_META);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(NVM_META, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_NVM_META, "QED_NVM_IMAGE_NVM_META", rc);
+	}
+
 	return 0;
 }
 
 int qed_dbg_all_data_size(struct qed_dev *cdev)
 {
+	struct qed_hwfn *p_hwfn =
+		&cdev->hwfns[cdev->dbg_params.engine_for_debug];
+	u32 regs_len = 0, image_len = 0;
 	u8 cur_engine, org_engine;
-	u32 regs_len = 0;
 
 	org_engine = qed_get_debug_engine(cdev);
 	for (cur_engine = 0; cur_engine < cdev->num_hwfns; cur_engine++) {
@@ -7993,6 +8095,15 @@ int qed_dbg_all_data_size(struct qed_dev *cdev)
 
 	/* Engine common */
 	regs_len += REGDUMP_HEADER_SIZE + qed_dbg_mcp_trace_size(cdev);
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_NVM_CFG1, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_DEFAULT_CFG, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_NVM_META, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
 
 	return regs_len;
 }
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 1377ad1..0550f0e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2529,7 +2529,7 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 	return rc;
 }
 
-static int
+int
 qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
 			  enum qed_nvm_images image_id,
 			  struct qed_nvm_image_att *p_image_att)
@@ -2545,6 +2545,15 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 	case QED_NVM_IMAGE_FCOE_CFG:
 		type = NVM_TYPE_FCOE_CFG;
 		break;
+	case QED_NVM_IMAGE_NVM_CFG1:
+		type = NVM_TYPE_NVM_CFG1;
+		break;
+	case QED_NVM_IMAGE_DEFAULT_CFG:
+		type = NVM_TYPE_DEFAULT_CFG;
+		break;
+	case QED_NVM_IMAGE_NVM_META:
+		type = NVM_TYPE_META;
+		break;
 	default:
 		DP_NOTICE(p_hwfn, "Unknown request of image_id %08x\n",
 			  image_id);
@@ -2588,9 +2597,6 @@ int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
 		return -EINVAL;
 	}
 
-	/* Each NVM image is suffixed by CRC; Upper-layer has no need for it */
-	image_att.length -= 4;
-
 	if (image_att.length > buffer_len) {
 		DP_VERBOSE(p_hwfn,
 			   QED_MSG_STORAGE,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index dd62c38..3af3896 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -486,6 +486,20 @@ struct qed_nvm_image_att {
  * @brief Allows reading a whole nvram image
  *
  * @param p_hwfn
+ * @param image_id - image to get attributes for
+ * @param p_image_att - image attributes structure into which to fill data
+ *
+ * @return int - 0 - operation was successful.
+ */
+int
+qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
+			  enum qed_nvm_images image_id,
+			  struct qed_nvm_image_att *p_image_att);
+
+/**
+ * @brief Allows reading a whole nvram image
+ *
+ * @param p_hwfn
  * @param image_id - image requested for reading
  * @param p_buffer - allocated buffer into which to fill data
  * @param buffer_len - length of the allocated buffer.
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index b5b2bc9..e53f9c7 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -159,6 +159,9 @@ struct qed_dcbx_get {
 enum qed_nvm_images {
 	QED_NVM_IMAGE_ISCSI_CFG,
 	QED_NVM_IMAGE_FCOE_CFG,
+	QED_NVM_IMAGE_NVM_CFG1,
+	QED_NVM_IMAGE_DEFAULT_CFG,
+	QED_NVM_IMAGE_NVM_META,
 };
 
 struct qed_link_eee_params {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] qed: Delete unused parameter p_ptt from mcp APIs
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

Since nvm images attributes are cached during driver load, acquiring ptt
is not needed when calling qed_mcp_get_nvm_image().

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 9 +--------
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  | 4 +---
 drivers/net/ethernet/qlogic/qed/qed_mcp.h  | 2 --
 3 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 9854aa9..d1d3787 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1894,15 +1894,8 @@ static int qed_nvm_get_image(struct qed_dev *cdev, enum qed_nvm_images type,
 			     u8 *buf, u16 len)
 {
 	struct qed_hwfn *hwfn = QED_LEADING_HWFN(cdev);
-	struct qed_ptt *ptt = qed_ptt_acquire(hwfn);
-	int rc;
-
-	if (!ptt)
-		return -EAGAIN;
 
-	rc = qed_mcp_get_nvm_image(hwfn, ptt, type, buf, len);
-	qed_ptt_release(hwfn, ptt);
-	return rc;
+	return qed_mcp_get_nvm_image(hwfn, type, buf, len);
 }
 
 static int qed_set_coalesce(struct qed_dev *cdev, u16 rx_coal, u16 tx_coal,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index ec0d425..1377ad1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2531,7 +2531,6 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 
 static int
 qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  struct qed_nvm_image_att *p_image_att)
 {
@@ -2569,7 +2568,6 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 }
 
 int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  u8 *p_buffer, u32 buffer_len)
 {
@@ -2578,7 +2576,7 @@ int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
 
 	memset(p_buffer, 0, buffer_len);
 
-	rc = qed_mcp_get_nvm_image_att(p_hwfn, p_ptt, image_id, &image_att);
+	rc = qed_mcp_get_nvm_image_att(p_hwfn, image_id, &image_att);
 	if (rc)
 		return rc;
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index 8a5c988..dd62c38 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -486,7 +486,6 @@ struct qed_nvm_image_att {
  * @brief Allows reading a whole nvram image
  *
  * @param p_hwfn
- * @param p_ptt
  * @param image_id - image requested for reading
  * @param p_buffer - allocated buffer into which to fill data
  * @param buffer_len - length of the allocated buffer.
@@ -494,7 +493,6 @@ struct qed_nvm_image_att {
  * @return 0 iff p_buffer now contains the nvram image.
  */
 int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  u8 *p_buffer, u32 buffer_len);
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/2] Add configuration information to register dump and debug data
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin

The purpose of this patchset is to add configuration information to the
debug data collection, which already contains register dump.
The first patch (removing the ptt) is essential because it prevents the
unnecessary ptt acquirement when calling mcp APIs.

Denis Bolotin (2):
  qed: Delete unused parameter p_ptt from mcp APIs
  qed: Add configuration information to register dump and debug data

 drivers/net/ethernet/qlogic/qed/qed_debug.c | 113 +++++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_main.c  |   9 +--
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |  18 +++--
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  16 +++-
 include/linux/qed/qed_if.h                  |   3 +
 5 files changed, 141 insertions(+), 18 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: kTLS in combination with mlx4 is very unstable
From: Andre Tomt @ 2018-04-23 11:44 UTC (permalink / raw)
  To: netdev; +Cc: davejwatson, Tariq Toukan
In-Reply-To: <bfb1e726-24ea-89e9-61e7-8e43e82cd23b@tomt.net>

On 22. april 2018 23:21, Andre Tomt wrote:
> Hello!
> 
> kTLS looks fun, so I decided to play with it. It is quite spiffy - 
> however with mlx4 I get kernel crashes I'm not seeing when testing on 
> ixgbe.
> 
> For testing I'm using a git build of the "stream reflector" cubemap[1] 
> configured with kTLS and 8 worker threads running on 4 physical cores, 
> loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.
> 
> The kernel seems to get increasingly unstable as I load it up with 
> client connections. At about 9Gbps and 700 connections, it is okay at 
> least for a while - it might run fine for say 45 minutes. Once it gets 
> to 20 - 30Gbps, the kernel will usually start spewing OOPSes within 
> minutes and the traffic drops.
> 
> Some bad interaction between mlx4 and kTLS?
> 
> Hardware is a quad core Xeon-D 1520 using a dual port Mellanox 
> ConnectX-3 VPI with a single 40Gbps ethernet link configured. Mellanox 
> mlx4 driver settings are kernel.org upstream defaults. Interface is 
> configured with FQ qdisc and sockets are using BBR congestion control.
> 
> Tested on kernel 4.14.34, 4.15.17, and 4.16.2 - 4.16.3.
> 
> [1] https://git.sesse.net/?p=cubemap
> 
> First OOPS (from 4.16.3)

It also blows up with a similar trace on 4.17-rc2.

^ permalink raw reply

* [iptables 2/2] extensions: libip6t_srh: add test-cases for matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam
In-Reply-To: <1524480503-1883-1-git-send-email-amsalam20@gmail.com>

This patch adds some test-cases to "libip6t_srh.t" for matching previous SID,
next SID, and last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 extensions/libip6t_srh.t | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/extensions/libip6t_srh.t b/extensions/libip6t_srh.t
index 08897d5..88a379e 100644
--- a/extensions/libip6t_srh.t
+++ b/extensions/libip6t_srh.t
@@ -23,4 +23,8 @@
 -m srh ! --srh-tag 0;=;OK
 -m srh --srh-next-hdr 17 --srh-segs-left-eq 1 --srh-last-entry-eq 4 --srh-tag 0;=;OK
 -m srh ! --srh-next-hdr 17 ! --srh-segs-left-eq 0 --srh-tag 0;=;OK
+-m srh --srh-psid A::2/64 --srh-nsid B2::/128 --srh-lsid C::/0;=;OK
+-m srh ! --srh-psid A::2/64 ! --srh-nsid B2::/128 ! --srh-lsid C::/0;=;OK
+-m srh --srh-psid A::2 --srh-nsid B2:: --srh-lsid C::;=;OK
+-m srh ! --srh-psid A::2 ! --srh-nsid B2:: ! --srh-lsid C::;=;OK
 -m srh;=;OK
-- 
2.1.4

^ permalink raw reply related

* [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam
In-Reply-To: <1524480503-1883-1-git-send-email-amsalam20@gmail.com>

IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed by
SR encapsulated packet. Each SID is encoded as an IPv6 prefix.

When a Firewall receives an SR encapsulated packet, it should be able to
identify which node previously processed the packet (previous SID), which
node is going to process the packet next (next SID), and which node is the
last to process the packet (last SID) which represent the final destination
of the packet in case of inline SR mode.

An example use-case of using these features could be SID list that includes
two firewalls. When the second firewall receives a packet, it can check
whether the packet has been processed by the first firewall or not. Based on
that check, it decides to apply all rules, apply just subset of the rules,
or totally skip all rules and forward the packet to the next SID.

This patch extends SRH match to support matching previous SID, next SID, and
last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
 net/ipv6/netfilter/ip6t_srh.c                | 41 +++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
index f3cc0ef..9808382 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
@@ -17,7 +17,10 @@
 #define IP6T_SRH_LAST_GT        0x0100
 #define IP6T_SRH_LAST_LT        0x0200
 #define IP6T_SRH_TAG            0x0400
-#define IP6T_SRH_MASK           0x07FF
+#define IP6T_SRH_PSID           0x0800
+#define IP6T_SRH_NSID           0x1000
+#define IP6T_SRH_LSID           0x2000
+#define IP6T_SRH_MASK           0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR    0x0001
@@ -31,7 +34,10 @@
 #define IP6T_SRH_INV_LAST_GT    0x0100
 #define IP6T_SRH_INV_LAST_LT    0x0200
 #define IP6T_SRH_INV_TAG        0x0400
-#define IP6T_SRH_INV_MASK       0x07FF
+#define IP6T_SRH_INV_PSID       0x0800
+#define IP6T_SRH_INV_NSID       0x1000
+#define IP6T_SRH_INV_LSID       0x2000
+#define IP6T_SRH_INV_MASK       0x3FFF
 
 /**
  *      struct ip6t_srh - SRH match options
@@ -40,6 +46,12 @@
  *      @ segs_left: Segments left field of SRH
  *      @ last_entry: Last entry field of SRH
  *      @ tag: Tag field of SRH
+ *      @ psid_addr: Address of previous SID in SRH SID list
+ *      @ nsid_addr: Address of NEXT SID in SRH SID list
+ *      @ lsid_addr: Address of LAST SID in SRH SID list
+ *      @ psid_msk: Mask of previous SID in SRH SID list
+ *      @ nsid_msk: Mask of next SID in SRH SID list
+ *      @ lsid_msk: MAsk of last SID in SRH SID list
  *      @ mt_flags: match options
  *      @ mt_invflags: Invert the sense of match options
  */
@@ -50,6 +62,12 @@ struct ip6t_srh {
 	__u8                    segs_left;
 	__u8                    last_entry;
 	__u16                   tag;
+	struct in6_addr		psid_addr;
+	struct in6_addr		nsid_addr;
+	struct in6_addr		lsid_addr;
+	struct in6_addr		psid_msk;
+	struct in6_addr		nsid_msk;
+	struct in6_addr		lsid_msk;
 	__u16                   mt_flags;
 	__u16                   mt_invflags;
 };
diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
index 33719d5..2b5cc73 100644
--- a/net/ipv6/netfilter/ip6t_srh.c
+++ b/net/ipv6/netfilter/ip6t_srh.c
@@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct ip6t_srh *srhinfo = par->matchinfo;
 	struct ipv6_sr_hdr *srh;
 	struct ipv6_sr_hdr _srh;
-	int hdrlen, srhoff = 0;
+	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
+	struct in6_addr *psid, *nsid, *lsid;
+	struct in6_addr _psid, _nsid, _lsid;
 
 	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
 		return false;
@@ -114,6 +116,43 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_TAG,
 				!(srh->tag == srhinfo->tag)))
 			return false;
+
+	/* Previous SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_PSID) {
+		if (srh->segments_left == srh->first_segment)
+			return false;
+		psidoff = srhoff + sizeof(struct ipv6_sr_hdr) +
+			  ((srh->segments_left + 1) * sizeof(struct in6_addr));
+		psid = skb_header_pointer(skb, psidoff, sizeof(_psid), &_psid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_PSID,
+				ipv6_masked_addr_cmp(psid, &srhinfo->psid_msk,
+						     &srhinfo->psid_addr)))
+			return false;
+	}
+
+	/* Next SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_NSID) {
+		if (srh->segments_left == 0)
+			return false;
+		nsidoff = srhoff + sizeof(struct ipv6_sr_hdr) +
+			  ((srh->segments_left - 1) * sizeof(struct in6_addr));
+		nsid = skb_header_pointer(skb, nsidoff, sizeof(_nsid), &_nsid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_NSID,
+				ipv6_masked_addr_cmp(nsid, &srhinfo->nsid_msk,
+						     &srhinfo->nsid_addr)))
+			return false;
+	}
+
+	/* Last SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_LSID) {
+		lsidoff = srhoff + sizeof(struct ipv6_sr_hdr);
+		lsid = skb_header_pointer(skb, lsidoff, sizeof(_lsid), &_lsid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_LSID,
+				ipv6_masked_addr_cmp(lsid, &srhinfo->lsid_msk,
+						     &srhinfo->lsid_addr)))
+			return false;
+	}
+
 	return true;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [iptables 1/2] extensions: libip6t_srh: support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam

This patch extends the libip6t_srh shared library to support matching
previous SID, next SID, and last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 extensions/libip6t_srh.c                | 65 ++++++++++++++++++++++++++++++++-
 include/linux/netfilter_ipv6/ip6t_srh.h | 22 ++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/extensions/libip6t_srh.c b/extensions/libip6t_srh.c
index ac0ae08..5acc2ee 100644
--- a/extensions/libip6t_srh.c
+++ b/extensions/libip6t_srh.c
@@ -22,6 +22,9 @@ enum {
 	O_SRH_LAST_GT,
 	O_SRH_LAST_LT,
 	O_SRH_TAG,
+	O_SRH_PSID,
+	O_SRH_NSID,
+	O_SRH_LSID,
 };
 
 static void srh_help(void)
@@ -38,7 +41,10 @@ static void srh_help(void)
 "[!] --srh-last-entry-eq 	last_entry      Last Entry value of SRH\n"
 "[!] --srh-last-entry-gt 	last_entry      Last Entry value of SRH\n"
 "[!] --srh-last-entry-lt 	last_entry      Last Entry value of SRH\n"
-"[!] --srh-tag			tag             Tag value of SRH\n");
+"[!] --srh-tag			tag             Tag value of SRH\n"
+"[!] --srh-psid			addr[/mask]	SRH previous SID\n"
+"[!] --srh-nsid			addr[/mask]	SRH next SID\n"
+"[!] --srh-lsid			addr[/mask]	SRH Last SID\n");
 }
 
 #define s struct ip6t_srh
@@ -65,6 +71,12 @@ static const struct xt_option_entry srh_opts[] = {
 	.flags = XTOPT_INVERT | XTOPT_PUT, XTOPT_POINTER(s, last_entry)},
 	{ .name = "srh-tag", .id = O_SRH_TAG, .type = XTTYPE_UINT16,
 	.flags = XTOPT_INVERT | XTOPT_PUT, XTOPT_POINTER(s, tag)},
+	{ .name = "srh-psid", .id = O_SRH_PSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
+	{ .name = "srh-nsid", .id = O_SRH_NSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
+	{ .name = "srh-lsid", .id = O_SRH_LSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
 	{ }
 };
 #undef s
@@ -75,6 +87,12 @@ static void srh_init(struct xt_entry_match *m)
 
 	srhinfo->mt_flags = 0;
 	srhinfo->mt_invflags = 0;
+	memset(srhinfo->psid_addr.s6_addr, 0, sizeof(srhinfo->psid_addr.s6_addr));
+	memset(srhinfo->nsid_addr.s6_addr, 0, sizeof(srhinfo->nsid_addr.s6_addr));
+	memset(srhinfo->lsid_addr.s6_addr, 0, sizeof(srhinfo->lsid_addr.s6_addr));
+	memset(srhinfo->psid_msk.s6_addr, 0, sizeof(srhinfo->psid_msk.s6_addr));
+	memset(srhinfo->nsid_msk.s6_addr, 0, sizeof(srhinfo->nsid_msk.s6_addr));
+	memset(srhinfo->lsid_msk.s6_addr, 0, sizeof(srhinfo->lsid_msk.s6_addr));
 }
 
 static void srh_parse(struct xt_option_call *cb)
@@ -138,6 +156,27 @@ static void srh_parse(struct xt_option_call *cb)
 		if (cb->invert)
 			srhinfo->mt_invflags |= IP6T_SRH_INV_TAG;
 		break;
+	case O_SRH_PSID:
+		srhinfo->mt_flags |= IP6T_SRH_PSID;
+		srhinfo->psid_addr = cb->val.haddr.in6;
+		srhinfo->psid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_PSID;
+		break;
+	case O_SRH_NSID:
+		srhinfo->mt_flags |= IP6T_SRH_NSID;
+		srhinfo->nsid_addr = cb->val.haddr.in6;
+		srhinfo->nsid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_NSID;
+		break;
+	case O_SRH_LSID:
+		srhinfo->mt_flags |= IP6T_SRH_LSID;
+		srhinfo->lsid_addr = cb->val.haddr.in6;
+		srhinfo->lsid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_LSID;
+		break;
 	}
 }
 
@@ -180,6 +219,18 @@ static void srh_print(const void *ip, const struct xt_entry_match *match,
 	if (srhinfo->mt_flags & IP6T_SRH_TAG)
 		printf(" tag:%s%d", srhinfo->mt_invflags & IP6T_SRH_INV_TAG ? "!" : "",
 			srhinfo->tag);
+	if (srhinfo->mt_flags & IP6T_SRH_PSID)
+		printf(" psid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_PSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->psid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->psid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_NSID)
+		printf(" nsid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_NSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->nsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->nsid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_LSID)
+		printf(" lsid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_LSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->lsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->lsid_msk));
 }
 
 static void srh_save(const void *ip, const struct xt_entry_match *match)
@@ -219,6 +270,18 @@ static void srh_save(const void *ip, const struct xt_entry_match *match)
 	if (srhinfo->mt_flags & IP6T_SRH_TAG)
 		printf("%s --srh-tag %u", (srhinfo->mt_invflags & IP6T_SRH_INV_TAG) ? " !" : "",
 			srhinfo->tag);
+	if (srhinfo->mt_flags & IP6T_SRH_PSID)
+		printf("%s --srh-psid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_PSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->psid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->psid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_NSID)
+		printf("%s --srh-nsid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_NSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->nsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->nsid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_LSID)
+		printf("%s --srh-lsid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_LSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->lsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->lsid_msk));
 }
 
 static struct xtables_match srh_mt6_reg = {
diff --git a/include/linux/netfilter_ipv6/ip6t_srh.h b/include/linux/netfilter_ipv6/ip6t_srh.h
index 087efa1..3d77241 100644
--- a/include/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/linux/netfilter_ipv6/ip6t_srh.h
@@ -16,7 +16,10 @@
 #define IP6T_SRH_LAST_GT        0x0100
 #define IP6T_SRH_LAST_LT        0x0200
 #define IP6T_SRH_TAG            0x0400
-#define IP6T_SRH_MASK           0x07FF
+#define IP6T_SRH_PSID           0x0800
+#define IP6T_SRH_NSID           0x1000
+#define IP6T_SRH_LSID           0x2000
+#define IP6T_SRH_MASK           0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR    0x0001
@@ -30,7 +33,10 @@
 #define IP6T_SRH_INV_LAST_GT    0x0100
 #define IP6T_SRH_INV_LAST_LT    0x0200
 #define IP6T_SRH_INV_TAG        0x0400
-#define IP6T_SRH_INV_MASK       0x07FF
+#define IP6T_SRH_INV_PSID       0x0800
+#define IP6T_SRH_INV_NSID       0x1000
+#define IP6T_SRH_INV_LSID       0x2000
+#define IP6T_SRH_INV_MASK       0x3FFF
 
 /**
  *      struct ip6t_srh - SRH match options
@@ -39,6 +45,12 @@
  *      @ segs_left: Segments left field of SRH
  *      @ last_entry: Last entry field of SRH
  *      @ tag: Tag field of SRH
+ *      @ psid_addr: Address of previous SID in SRH SID list
+ *      @ nsid_addr: Address of NEXT SID in SRH SID list
+ *      @ lsid_addr: Address of LAST SID in SRH SID list
+ *      @ psid_msk: Mask of previous SID in SRH SID list
+ *      @ nsid_msk: Mask of next SID in SRH SID list
+ *      @ lsid_msk: MAsk of last SID in SRH SID list
  *      @ mt_flags: match options
  *      @ mt_invflags: Invert the sense of match options
  */
@@ -49,6 +61,12 @@ struct ip6t_srh {
 	__u8                    segs_left;
 	__u8                    last_entry;
 	__u16                   tag;
+	struct in6_addr		psid_addr;
+	struct in6_addr		nsid_addr;
+	struct in6_addr		lsid_addr;
+	struct in6_addr		psid_msk;
+	struct in6_addr		nsid_msk;
+	struct in6_addr		lsid_msk;
 	__u16                   mt_flags;
 	__u16                   mt_invflags;
 };
-- 
2.1.4

^ permalink raw reply related

* RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
From: Karlsson, Magnus @ 2018-04-23 10:26 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: 'Björn Töpel', Duyck, Alexander H,
	'alexander.duyck@gmail.com',
	'john.fastabend@gmail.com', 'ast@fb.com',
	'brouer@redhat.com',
	'willemdebruijn.kernel@gmail.com',
	'daniel@iogearbox.net', 'netdev@vger.kernel.org',
	'michael.lundkvist@ericsson.com', Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z,
	'ravineet.singh@ericsson.com'
In-Reply-To: <AFED4FBCE79F3548A8F74434195ACE39588D2083@IRSMSX107.ger.corp.intel.com>



> -----Original Message-----
> From: Karlsson, Magnus
> Sent: Thursday, April 12, 2018 5:20 PM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com
> Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, April 12, 2018 4:05 PM
> > To: Karlsson, Magnus <magnus.karlsson@intel.com>
> > Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> > <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> > john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> > willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> > netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> > Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> > <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ravineet.singh@ericsson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> >
> > On Thu, Apr 12, 2018 at 07:38:25AM +0000, Karlsson, Magnus wrote:
> > > I think you are definitely right in that there are ways in which we
> > > can improve performance here. That said, the current queue performs
> > > slightly better than the previous one we had that was more or less a
> > > copy of one of your first virtio 1.1 proposals from little over a
> > > year ago. It had bidirectional queues and a valid flag in the
> > > descriptor itself. The reason we abandoned this was not poor
> > > performance (it was good), but a need to go to unidirectional
> > > queues. Maybe I should have only changed that aspect and kept the valid
> flag.
> >
> > Is there a summary about unidirectional queues anywhere?  I'm curious
> > to know whether there are any lessons here to be learned for virtio or
> ptr_ring.
> 
> I did a quick hack in which I used your ptr_ring for the fill queue instead of
> our head/tail based one. In the corner cases (usually empty or usually full),
> there is basically no difference. But for the case when the queue is always
> half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7
> Mpps (as there is no cache line bouncing in this case) on my system (slower
> than Björn's that was used for the numbers in the RFC).
> 
> So I think this should be implemented properly so we can get some real
> numbers.
> Especially since 0.1 Mpps with copies will likely become much more with
> zero-copy as we are really chasing cycles there. We will get back a better
> evaluation in a few days.
> 
> Thanks: Magnus
> 
> > --
> > MST

Hi Michael,

Sorry for the late reply. Been travelling. Björn and I have now
made a real implementation of the ptr_ring principles in the
af_xdp code. We just added a switch in bind (only for the purpose
of this test) to be able to pick what ring implementation to use
from the user space test program. The main difference between our
version of ptr_ring and our head/tail ring is that the ptr_ring
version uses the idx field to signal if the entry is available or
not (idx == 0 indicating empty descriptor) and that it does not
use the head and tail pointers at all. Even though it is not
a "ring of pointers" in our implementation, we will still call it
ptr_ring for the purpose of this mail.

In summary, our experiments show that the two rings perform the
same in our micro benchmarks when the queues are balanced and
rarely full or empty, but the head/tail version performs better
for RX when the queues are not perfectly balanced. Why is that?
We do not exactly know, but there are a number of differences
between a ptr_ring in the kernel and one between user and kernel
space for the use in af_xdp.

* The user space descriptors have to be validated as we are
  communicating between user space and kernel space. Done slightly
  differently for the two rings due to the batching below.

* The RX and TX ring have descriptors that are larger than one
  pointer, so need to have barriers here even with ptr_ring. We can
  not rely on address dependency because it is not a pointer.

* Batching performed slightly differently in both versions. We
  avoid touching head and tail for as long as possible. At the
  worst it is once per batch, but it might be much less than that
  on the consumer side. The drawback with the accesses to the
  head/tail pointers is that it usually ends up to be a cache
  line bounce. But with ptr_ring, the drawback is that it is
  always N writes (setting idx = 0) for a batch size of N. The
  good thing though, is that these will not incur any cache
  line bouncing if the rings are balanced (well, they will be
  read by the producer at some point, but only once per traversal
  of the ring).

Something to note is that we think that the head/tail version
provides an easier-to-use user space interface since the indexes start
from 0 instead of 1 as in the ptr_ring case. With ptr_ring you
have to teach the user space application writer not to use index
0. With the head/tail version no such restriction is needed.

Here are just some of the results for a workload where user space
is faster than kernel space. This is for the case in which the user
space program has no problem keeping up with the kernel.

head/tail 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,782,718   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,504,235   
tx              2,504,232   


ptr_ring 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,519,373   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,519,265   
tx              2,519,265   


ptr_ring with batch sizes calculated as in ptr_ring.h

  sock0@p3p2:16 rxdrop
                 pps        
rx              7,470,658   
tx              0           
^C

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,431,701   
tx              2,431,701   

/Magnus

^ permalink raw reply

* [PATCH net-next 2/2 v1] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-23 10:24 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180423102443.16627-1-christian.brauner@ubuntu.com>

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.
Testing revealed significant performance improvements. For details see
"Testing" below.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Testing:
Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
with decoupled locking and one without. To ensure that testing made sense
both kernels carried the patch to remove network namespaces not owned by
the initial user namespace from the uevent socket list.
Three tests were constructed. All of them showed significant performance
improvements with per-netns uevent sequence numbers and decoupled locking.

 # Testcase 1:
   Only Injecting Uevents into network namespaces not owned by the initial
   user namespace.
   - created 1000 new user namespace + network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high
     number of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
     - *without* uevent sequence number namespacing: 67 μs
     - *with* uevent sequence number namespacing:    55 μs
     - makes a difference of:                        12 μs
   - a t-test was performed on the two data vectors which revealed
     shows significant performance improvements:
     Welch Two Sample t-test
     data:  x1 and y1
     t = 405.16, df = 18883000, p-value < 2.2e-16
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
     12.14949 12.26761
     sample estimates:
     mean of x mean of y
     68.48594  56.27739

 # Testcase 2:
   Injecting Uevents into network namespaces not owned by the initial user
   namespace and network namespaces owned by the initial user namespace.
   - created 500 new user namespace + network namespace pairs
   - created 500 new network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high
     number of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
     - *without* uevent sequence number namespacing: 572 μs
     - *with* uevent sequence number namespacing:    514 μs
     - makes a difference of:                         58 μs
   - a t-test was performed on the two data vectors which revealed
     shows significant performance improvements:
     Welch Two Sample t-test
     data:  x2 and y2
     t = 38.685, df = 19682000, p-value < 2.2e-16
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
     55.10630 60.98815
     sample estimates:
     mean of x mean of y
     572.9684  514.9211

 # Testcase 3:
   Created 500 new user namespace + network namespace pairs *without uevent
   listeners*
   - created 500 new network namespace pairs *without uevent listeners*
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high number
     of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
    - mean transaction time was calculated:
      - *without* uevent sequence number namespacing: 206 μs
      - *with* uevent sequence number namespacing:    163 μs
      - makes a difference of:                         43 μs
    - a t-test was performed on the two data vectors which revealed
      shows significant performance improvements:
      Welch Two Sample t-test
      data:  x3 and y3
      t = 58.37, df = 17711000, p-value < 2.2e-16
      alternative hypothesis: true difference in means is not equal to 0
      95 percent confidence interval:
      41.77860 44.68178
      sample estimates:
      mean of x mean of y
      207.2632  164.0330

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog v0->v1:
* add detailed test results to the commit message
* account for kernels compiled without CONFIG_NET
---
 include/linux/kobject.h     |   2 +
 include/net/net_namespace.h |   3 ++
 kernel/ksysfs.c             |  11 +++-
 lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
 net/core/net_namespace.c    |  14 +++++
 5 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..4e608968907f 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,8 +36,10 @@
 extern char uevent_helper[];
 #endif
 
+#ifndef CONFIG_NET
 /* counter to tag the uevent, read only except for the kobject core */
 extern u64 uevent_seqnum;
+#endif
 
 /*
  * The actions here must match the index to the string array
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct uevent_sock	*uevent_sock;		/* uevent socket */
+	/* counter to tag the uevent, read only except for the kobject core */
+	u64                     uevent_seqnum;
 
 	struct list_head 	dev_base_head;
 	struct hlist_head 	*dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
 
 struct net *get_net_ns_by_pid(pid_t pid);
 struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);
 
 #ifdef CONFIG_SYSCTL
 void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..075ec814381b 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/capability.h>
 #include <linux/compiler.h>
+#include <net/net_namespace.h>
 
 #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
 
@@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
 static ssize_t uevent_seqnum_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+	u64 seqnum;
+
+	#ifdef CONFIG_NET
+		seqnum = get_ns_uevent_seqnum_by_vpid();
+	#else
+		seqnum = uevent_seqnum;
+	#endif
+
+	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
 }
 KERNEL_ATTR_RO(uevent_seqnum);
 
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..5da20def556d 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,42 @@
 #include <net/net_namespace.h>
 
 
+#ifndef CONFIG_NET
 u64 uevent_seqnum;
+#endif
+
 #ifdef CONFIG_UEVENT_HELPER
 char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
 #endif
 
+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
+ */
+#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
 struct uevent_sock {
 	struct list_head list;
 	struct sock *sk;
+	/*
+	 * This mutex protects uevent sockets and the uevent counter of
+	 * network namespaces *not* owned by init_user_ns.
+	 * For network namespaces owned by init_user_ns this lock is *not*
+	 * valid instead the global uevent_sock_mutex must be used!
+	 */
+	struct mutex sk_mutex;
 };
 
 #ifdef CONFIG_NET
 static LIST_HEAD(uevent_sock_list);
 #endif
 
-/* This lock protects uevent_seqnum and uevent_sock_list */
+/*
+ * This mutex protects uevent sockets and the uevent counter of network
+ * namespaces owned by init_user_ns.
+ * For network namespaces not owned by init_user_ns this lock is *not*
+ * valid instead the network namespace specific sk_mutex in struct
+ * uevent_sock must be used!
+ */
 static DEFINE_MUTEX(uevent_sock_mutex);
 
 /* the strings here must match the enum in include/linux/kobject.h */
@@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 
 	return 0;
 }
+
+static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
+{
+	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
+		WARN(1, KERN_ERR "Failed to append sequence number. "
+		     "Too many uevent variables\n");
+		return false;
+	}
+
+	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
+		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
+		return false;
+	}
+
+	return true;
+}
 #endif
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 
 	/* send netlink message */
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+		/* bump sequence number */
+		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
 		struct sock *uevent_sock = ue_sk->sk;
+		char buf[SEQNUM_BUFSIZE];
 
 		if (!netlink_has_listeners(uevent_sock, 1))
 			continue;
 
 		if (!skb) {
-			/* allocate message with the maximum possible size */
+			/* calculate header length */
 			size_t len = strlen(action_string) + strlen(devpath) + 2;
 			char *scratch;
 
+			/* allocate message with the maximum possible size */
 			retval = -ENOMEM;
-			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
 			if (!skb)
 				continue;
 
@@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
 
+			/* add env */
 			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
 		}
 
+		/* prepare netns seqnum */
+		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
+		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
+			continue;
+		retval++;
+
+		if (!can_hold_seqnum(env, retval))
+			continue;
+
+		/* append netns seqnum */
+		skb_put_data(skb, buf, retval);
+
 		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
 						    0, 1, GFP_KERNEL,
 						    kobj_bcast_filter,
@@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 		/* ENOBUFS should be handled in userspace */
 		if (retval == -ENOBUFS || retval == -ESRCH)
 			retval = 0;
+
+		/* remove netns seqnum */
+		skb_trim(skb, env->buflen);
 	}
 	consume_skb(skb);
+#else
+	uevent_seqnum++;
 #endif
 	return retval;
 }
@@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 	}
 
 	mutex_lock(&uevent_sock_mutex);
-	/* we will send an event, so request a new sequence number */
-	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
-	if (retval) {
-		mutex_unlock(&uevent_sock_mutex);
-		goto exit;
-	}
-	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
-					      devpath);
+	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
 	mutex_unlock(&uevent_sock_mutex);
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
 EXPORT_SYMBOL_GPL(add_uevent_var);
 
 #if defined(CONFIG_NET)
-static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
+static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
 				struct netlink_ext_ack *extack)
 {
-	/* u64 to chars: 2^64 - 1 = 21 chars */
-	char buf[sizeof("SEQNUM=") + 21];
+	struct sock *usk = ue_sk->sk;
+	char buf[SEQNUM_BUFSIZE];
 	struct sk_buff *skbc;
 	int ret;
 
 	/* bump and prepare sequence number */
-	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
-	if (ret < 0 || (size_t)ret >= sizeof(buf))
+	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
+		       ++sock_net(ue_sk->sk)->uevent_seqnum);
+	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
 		return -ENOMEM;
 	ret++;
 
@@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EPERM;
 	}
 
-	mutex_lock(&uevent_sock_mutex);
-	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
-	mutex_unlock(&uevent_sock_mutex);
+	if (net->user_ns == &init_user_ns)
+		mutex_lock(&uevent_sock_mutex);
+	else
+		mutex_lock(&net->uevent_sock->sk_mutex);
+	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
+	if (net->user_ns == &init_user_ns)
+		mutex_unlock(&uevent_sock_mutex);
+	else
+		mutex_unlock(&net->uevent_sock->sk_mutex);
 
 	return ret;
 }
@@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
 		mutex_lock(&uevent_sock_mutex);
 		list_add_tail(&ue_sk->list, &uevent_sock_list);
 		mutex_unlock(&uevent_sock_mutex);
+	} else {
+		/*
+		 * Uevent sockets and counters for network namespaces
+		 * not owned by the initial user namespace have their
+		 * own mutex.
+		 */
+		mutex_init(&ue_sk->sk_mutex);
 	}
 
 	return 0;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..8894638f5150 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+u64 get_ns_uevent_seqnum_by_vpid(void)
+{
+	pid_t cur_pid;
+	struct net *net;
+
+	cur_pid = task_pid_vnr(current);
+	net = get_net_ns_by_pid(cur_pid);
+	if (IS_ERR(net))
+		return 0;
+
+	return net->uevent_seqnum;
+}
+EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
+
 static __net_init int net_ns_net_init(struct net *net)
 {
 #ifdef CONFIG_NET_NS
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 1/2 v1] netns: restrict uevents
From: Christian Brauner @ 2018-04-23 10:24 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180423102443.16627-1-christian.brauner@ubuntu.com>

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog v0->v1:
* patch unchanged
---
 lib/kobject_uevent.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/* Restrict uevents to initial user namespace. */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 0/2 v1] netns: uevent performance tweaks
From: Christian Brauner @ 2018-04-23 10:24 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

Hey everyone,

This is v1 of "netns: uevent performance tweaks". Like Eric requested, I
did extensive testing that prove significant performance improvements
when using per-netns uevent sequence numbers with decoupled locks. The
results and test descriptions were added to the commit message of
[PATCH 2/2 v1] netns: isolate seqnums to use per-netns locks.

This series deals with a bunch of performance improvements when sending
out uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  A very detailed rationale is present in the commit message for
  [PATCH 1/2] netns: restrict uevents
- Decouple the locking for network namespaces in the global uevent
  socket list from the locking for network namespaces not in the global
  uevent socket list.
  A very detailed rationale including performance test results is
  present in the commit message for
  [PATCH 2/2] netns: isolate seqnums to use per-netns locks

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h     |   2 +
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c             |  11 +++-
 lib/kobject_uevent.c        | 122 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  14 +++++
 5 files changed, 126 insertions(+), 26 deletions(-)

-- 
2.17.0

^ permalink raw reply

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-23 10:12 UTC (permalink / raw)
  To: kbuild-all
  Cc: ebiederm, davem, netdev, linux-kernel, avagin, ktkhai, serge,
	gregkh
In-Reply-To: <201804231003.izXFVWTI%fengguang.wu@intel.com>

On Mon, Apr 23, 2018 at 10:39:50AM +0800, kbuild test robot wrote:
> Hi Christian,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Christian-Brauner/netns-uevent-performance-tweaks/20180420-013717
> config: alpha-alldefconfig (attached as .config)
> compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=alpha 
> 
> All errors (new ones prefixed by >>):
> 
>    kernel/ksysfs.o: In function `uevent_seqnum_show':
> >> (.text+0x18c): undefined reference to `get_ns_uevent_seqnum_by_vpid'
>    (.text+0x19c): undefined reference to `get_ns_uevent_seqnum_by_vpid'

Right, this happens when CONFIG_NET=n. I am about to send out the second
version of the patch that includes all of the test results in the commit
message and also accounts for kernels compiled without CONFIG_NET.

Thanks!
Christian

> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: How to detect libbfd when building tools/bpf?
From: Daniel Borkmann @ 2018-04-23 10:05 UTC (permalink / raw)
  To: Wang Sheng-Hui, ast, kstewart, gregkh, tglx, pombredanne
  Cc: netdev, acme, jakub.kicinski, quentin.monnet
In-Reply-To: <tencent_2F07B2EF6347A6E6521E4B9DECEEDBF88E05@qq.com>

On 04/22/2018 09:51 AM, Wang Sheng-Hui wrote:
> Sorry to trouble you !
> 
> I run debian and installed binutils-dev beforehand.
> Then I copied tools/build/feature/test-libbfd.c to t.c and run:
> ------------------------------------------------------------------------
> root@lazyfintech:~# cat t.c 
> #include <bfd.h>
> 
> extern int printf(const char *format, ...);
> 
> int main(void)
> {
> 	char symbol[4096] = "FieldName__9ClassNameFd";
> 	char *tmp;
> 
> 	tmp = bfd_demangle(0, symbol, 0);
> 
> 	printf("demangled symbol: {%s}\n", tmp);
> 
> 	return 0;
> }
> root@lazyfintech:~# gcc t.c -lbfd

Can you run with 'gcc t.c -lbfd -lz -liberty -ldl' since this is
what the Makefile for the feature test is doing:

  $(OUTPUT)test-libbfd.bin:
        $(BUILD) -DPACKAGE='"perf"' -lbfd -lz -liberty -ldl

Seems most likely that you're missing some of the remaining libs
to have the feature test properly detect it?

> root@lazyfintech:~# ./a.out 
> demangled symbol: {ClassName::FieldName}
> 
> 
> I thought libbfd can be reached from above. 
> But when I built tools/bpf, libbfd cannot be detected: 
> -------------------------------------------------------------------------
> /linux# make O=../buildkernel/ tools/bpf
> make[1]: Entering directory '/root/buildkernel'
>   DESCEND  bpf
> 
> Auto-detecting system features:
> ...                        libbfd: [ OFF ]
> ...        disassembler-four-args: [ on  ]
> 
>   DESCEND  bpftool
> 
> Auto-detecting system features:
> ...                        libbfd: [ OFF ]
> ...        disassembler-four-args: [ on  ]
> 
> make[1]: Leaving directory '/root/buildkernel'
> 
> 
> I wonder how should I do to get libbfd auto detected when  building bpf tool?
> 
> Regards,
> shenghui

^ permalink raw reply

* Re: [PATCH 2/3] xen netback: add fault injection facility
From: Wei Liu @ 2018-04-23  9:58 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: jakub.kicinski, hpa, mcroce, tglx, ggarcia, daniel, x86, mingo,
	xen-devel, axboe, konrad.wilk, amir.jer.levy, paul.durrant,
	stefanha, dsa, boris.ostrovsky, jgross, linux-block, wei.liu2,
	netdev, linux-kernel, davem, dwmw, roger.pau
In-Reply-To: <20180420104731.17823.97617.stgit@dev-dsk-staskins-1a-ca5afbf2.eu-west-1.amazon.com>

On Fri, Apr 20, 2018 at 10:47:31AM +0000, Stanislav Kinsburskii wrote:
>  
>  #include <linux/kthread.h>
>  #include <linux/if_vlan.h>
> @@ -1649,6 +1650,7 @@ static int __init netback_init(void)
>  			PTR_ERR(xen_netback_dbg_root));
>  #endif /* CONFIG_DEBUG_FS */
>  
> +	(void) xen_netbk_fi_init();

If you care about the return value, please propagate it to
netback_init's caller. Otherwise you can just make the function return
void.

> +
> +int xenvif_fi_init(struct xenvif *vif)
> +{
> +	struct dentry *parent;
> +	struct xenvif_fi *vfi;
> +	int fi, err = -ENOMEM;
> +
> +	parent = vif_fi_dir;
> +	if (!parent)
> +		return -ENOMEM;
> +
> +	vfi = kmalloc(sizeof(*vfi), GFP_KERNEL);
> +	if (!vfi)
> +		return -ENOMEM;
> +
> +	vfi->dir = debugfs_create_dir(vif->dev->name, parent);
> +	if (!vfi->dir)
> +		goto err_dir;
> +
> +	for (fi = 0; fi < XENVIF_FI_MAX; fi++) {
> +		vfi->faults[fi] = xen_fi_dir_add(vfi->dir,
> +				xenvif_fi_names[fi]);
> +		if (!vfi->faults[fi])
> +			goto err_fault;
> +	}
> +
> +	vif->fi_info = vfi;
> +	return 0;
> +
> +err_fault:
> +	for (; fi > 0; fi--)

fi >= 0

Wei.

^ permalink raw reply

* Re: bpf: samples/bpf/sockex2: Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed.
From: Daniel Borkmann @ 2018-04-23  9:53 UTC (permalink / raw)
  To: Wang Sheng-Hui, ast, netdev
In-Reply-To: <tencent_8DE3019EA8BEBE9DCA1A4C791875FCBB7E07@qq.com>

On 04/23/2018 04:56 AM, Wang Sheng-Hui wrote:
> Sorry to trouble you!
> 
> I run samples/bpf/sockex2 failed with 
> "Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed."
> 
> Then I run " strace ./sockex2" and got:
> ...
> bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("lo"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
> setsockopt(3, SOL_SOCKET, SO_ATTACH_BPF, [0], 4) = -1 EINVAL (Invalid argument)
> write(2, "sockex2: /root/linux/samples/bpf"..., 156sockex2: /root/linux/samples/bpf/sockex2_user.c:35: main: Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed.
> ) = 156
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb8ec4bf000
> rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
> getpid()                                = 3513
> gettid()                                = 3513
> tgkill(3513, 3513, SIGABRT)             = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=3513, si_uid=0} ---
> +++ killed by SIGABRT +++
> Aborted

[...]
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=4, value_size=16, max_entries=1024}, 72) = 4
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_SOCKET_FILTER, insn_cnt=201, insns=0x2676cd0, license="GPL", log_level=0, log_size=0, log_buf=0, kern_version=0}, 72) = 5
close(3)                                = 0
socket(AF_PACKET, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, 768) = 3
access("/proc/net", R_OK)               = 0
access("/proc/net/unix", R_OK)          = 0
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 6
ioctl(6, SIOCGIFINDEX, {ifr_name="lo", }) = 0
close(6)                                = 0
bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("lo"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
setsockopt(3, SOL_SOCKET, SO_ATTACH_BPF, [5], 4) = 0
pipe2([6, 7], O_CLOEXEC)                = 0
[...]

Works fine for me. The EINVAL in your case comes from the 'setsockopt(3, SOL_SOCKET,
SO_ATTACH_BPF, [0], 4)'. So you send fd 0 to SO_ATTACH_BPF, which is not a valid BPF
fd and therefore bails out here. You might want to debug bpf_load.c why it's failing
to load the program in your case, or check strace a bit further above where you do
the map and prog creation (as I copied in my case).

Cheers,
Daniel

^ permalink raw reply

* Re: [lkp-robot] [bpf] 4f7b25baf0: kernel_selftests.bpf.test_maps.fail
From: Daniel Borkmann @ 2018-04-23  9:41 UTC (permalink / raw)
  To: kernel test robot, John Fastabend; +Cc: ast, netdev, davem, lkp
In-Reply-To: <20180423063250.GO5563@yexl-desktop>

On 04/23/2018 08:32 AM, kernel test robot wrote:
> 
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 4f7b25baf0cefa88aa267e8ff60150c4ac1479ef ("bpf: sockmap, refactor sockmap routines to work with hashmap")
> url: https://github.com/0day-ci/linux/commits/John-Fastabend/bpf-sockmap-free-memory-on-sock-close-with-cork-data/20180401-230416
> base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf-next.git master

(fyi, refers to https://patchwork.ozlabs.org/patch/894017/ which is not in tree.)

> in testcase: kernel_selftests
> with following parameters:
> 
> 	group: kselftests-00
> 
> test-description: The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small unit tests to exercise individual code paths in the kernel.
> test-url: https://www.kernel.org/doc/Documentation/kselftest.txt
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 5G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> 
> selftests: test_maps
> ========================================
> Failed allowed duplicate programs in update ANY sockmap 0 '4:8'
> not ok 1..3 selftests:  test_maps [FAIL]
> 
> 
> 
> To reproduce:
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
> 
> 
> 
> Thanks,
> Xiaolong
> 

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: btf: Clean up btf.h in uapi
From: Daniel Borkmann @ 2018-04-23  9:34 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team
In-Reply-To: <20180421164823.1059397-1-kafai@fb.com>

On 04/21/2018 06:48 PM, Martin KaFai Lau wrote:
> This patch cleans up btf.h in uapi:
> 1) Rename "name" to "name_off" to better reflect it is an offset to the
>    string section instead of a char array.
> 2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP
> 
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied to bpf-next, thanks Martin!

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: fix virtio-net's length calc for XDP_PASS
From: Daniel Borkmann @ 2018-04-23  9:30 UTC (permalink / raw)
  To: Jason Wang, Nikita V. Shirokov, Alexei Starovoitov,
	Michael S. Tsirkin
  Cc: netdev, David Ahern
In-Reply-To: <eced76a9-5480-8180-7e5a-84a62d0fdc8b@redhat.com>

On 04/23/2018 08:33 AM, Jason Wang wrote:
> On 2018年04月23日 12:16, Nikita V. Shirokov wrote:
>> In commit 6870de435b90 ("bpf: make virtio compatible w/
>> bpf_xdp_adjust_tail") i didn't account for vi->hdr_len during new
>> packet's length calculation after bpf_prog_run in receive_mergeable.
>> because of this all packets, if they were passed to the kernel,
>> were truncated by 12 bytes.
>>
>> Fixes:6870de435b90 ("bpf: make virtio compatible w/ bpf_xdp_adjust_tail")
>> Reported-by: David Ahern <dsahern@gmail.com>
>>
>> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
[...]
> 
> Acked-by: Jason Wang <jasowang@redhat.com>

Applied to bpf-next, thanks everyone!

^ permalink raw reply

* Re: [PATCH/RFC 7/8] ARM: shmobile: Remove the ARCH_SHMOBILE Kconfig symbol
From: Simon Horman @ 2018-04-23  9:29 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: alsa-devel, Kuninori Morimoto, Catalin Marinas, Will Deacon,
	Liam Girdwood, Laurent Pinchart, devel, Mauro Carvalho Chehab,
	Vinod Koul, Magnus Damm, Russell King, linux-media, Arnd Bergmann,
	Mark Brown, Dan Williams, Jaroslav Kysela, linux-arm-kernel,
	Sergei Shtylyov, Greg Kroah-Hartman, Takashi Iwai, linux-kernel,
	linux-renesas-soc, netdev, dmaengine
In-Reply-To: <1524230914-10175-8-git-send-email-geert+renesas@glider.be>

On Fri, Apr 20, 2018 at 03:28:33PM +0200, Geert Uytterhoeven wrote:
> All drivers for Renesas ARM SoCs have gained proper ARCH_RENESAS
> platform dependencies.  Hence finish the conversion from ARCH_SHMOBILE
> to ARCH_RENESAS for Renesas 32-bit ARM SoCs, as started by commit
> 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS").
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
> This depends on the previous patches in this series, hence the RFC.

Thanks, I have marked this and the following patch as deferred.
Please repost or otherwise ping me once the dependencies are in place.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-23  9:29 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <515e635b-bc80-9b8d-72f9-b390ae5103ec@redhat.com>

On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support for virtio driver.
> > 
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > 
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> >   include/linux/virtio_ring.h        |    8 +-
> >   include/uapi/linux/virtio_config.h |   12 +-
> >   include/uapi/linux/virtio_ring.h   |   61 ++
> >   4 files changed, 980 insertions(+), 195 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 71458f493cf8..0515dca34d77 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -58,14 +58,15 @@
> 
> [...]
> 
> > +
> > +	if (vq->indirect) {
> > +		u32 len;
> > +
> > +		desc = vq->desc_state[head].indir_desc;
> > +		/* Free the indirect table, if any, now that it's unmapped. */
> > +		if (!desc)
> > +			goto out;
> > +
> > +		len = virtio32_to_cpu(vq->vq.vdev,
> > +				      vq->vring_packed.desc[head].len);
> > +
> > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> 
> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> can safely remove this BUG_ON() here.
> 
> > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> 
> Len could be ignored for used descriptor according to the spec, so we need
> remove this BUG_ON() too.

Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
And I think something related to this in the spec isn't very
clear currently.

In the spec, there are below words:

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
"""
In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
is reserved and is ignored by the device.
"""

So when device writes back an used descriptor in this case,
device may not set the VIRTQ_DESC_F_WRITE flag as the flag
is reserved and should be ignored.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
"""
Element Length is reserved for used descriptors without the
VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
"""

And this is the way how driver ignores the `len` in an used
descriptor.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
"""
To increase ring capacity the driver can store a (read-only
by the device) table of indirect descriptors anywhere in memory,
and insert a descriptor in the main virtqueue (with \field{Flags}
bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
containing this indirect descriptor table;
"""

So the indirect descriptors in the table are read-only by
the device. And the only descriptor which is writeable by
the device is the descriptor in the main virtqueue (with
Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
`len` in this descriptor, we won't be able to get the
length of the data written by the device.

So I think the `len` in this descriptor will carry the
length of the data written by the device (if the buffers
are writable to the device) even if the VIRTQ_DESC_F_WRITE
isn't set by the device. How do you think?


> 
> The reason is we don't touch descriptor ring in the case of split, so
> BUG_ON()s may help there.
> 
> > +
> > +		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > +			vring_unmap_one_packed(vq, &desc[j]);
> > +
> > +		kfree(desc);
> > +		vq->desc_state[head].indir_desc = NULL;
> > +	} else if (ctx) {
> > +		*ctx = vq->desc_state[head].indir_desc;
> > +	}
> > +
> > +out:
> > +	return vq->desc_state[head].num;
> > +}
> > +
> > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> >   {
> >   	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> >   }
> > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > +{
> > +	u16 last_used, flags;
> > +	bool avail, used;
> > +
> > +	if (vq->vq.num_free == vq->vring_packed.num)
> > +		return false;
> > +
> > +	last_used = vq->last_used_idx;
> > +	flags = virtio16_to_cpu(vq->vq.vdev,
> > +				vq->vring_packed.desc[last_used].flags);
> > +	avail = flags & VRING_DESC_F_AVAIL(1);
> > +	used = flags & VRING_DESC_F_USED(1);
> > +
> > +	return avail == used;
> > +}
> 
> This looks interesting, spec said:
> 
> "
> Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> available descriptor and
> equal for a used descriptor.
> Note that this observation is mostly useful for sanity-checking as these are
> necessary but not sufficient
> conditions - for example, all descriptors are zero-initialized. To detect
> used and available descriptors it is
> possible for drivers and devices to keep track of the last observed value of
> VIRTQ_DESC_F_USED/VIRTQ_-
> DESC_F_AVAIL. Other techniques to detect
> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> might also be possible.
> "
> 
> So it looks to me it was not sufficient, looking at the example codes in
> spec, do we need to track last seen used_wrap_counter here?

I don't think we have to track used_wrap_counter in
driver. There was a discussion on this:

https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html

And after that, below sentence was added (it's also
in the above words you quoted):

"""
Other techniques to detect
VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
might also be possible.
"""

Best regards,
Tiwei Bie


> 
> Thanks

^ permalink raw reply

* Re: [PATCH/RFC 7/8] ARM: shmobile: Remove the ARCH_SHMOBILE Kconfig symbol
From: Simon Horman @ 2018-04-23  9:20 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: alsa-devel, Kuninori Morimoto, Catalin Marinas, Will Deacon,
	Liam Girdwood, Laurent Pinchart, devel, Mauro Carvalho Chehab,
	Vinod Koul, Magnus Damm, Russell King, linux-media, Arnd Bergmann,
	Mark Brown, Dan Williams, Jaroslav Kysela, linux-arm-kernel,
	Sergei Shtylyov, Greg Kroah-Hartman, Takashi Iwai, linux-kernel,
	linux-renesas-soc, netdev, dmaengine
In-Reply-To: <1524230914-10175-8-git-send-email-geert+renesas@glider.be>

On Fri, Apr 20, 2018 at 03:28:33PM +0200, Geert Uytterhoeven wrote:
> All drivers for Renesas ARM SoCs have gained proper ARCH_RENESAS
> platform dependencies.  Hence finish the conversion from ARCH_SHMOBILE
> to ARCH_RENESAS for Renesas 32-bit ARM SoCs, as started by commit
> 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS").
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
> This depends on the previous patches in this series, hence the RFC.
> 
> JFTR, after this, the following symbols for drivers supporting only
> Renesas SuperH "SH-Mobile" SoCs can no longer be selected:
>   - CONFIG_KEYBOARD_SH_KEYSC,
>   - CONFIG_VIDEO_SH_VOU,
>   - CONFIG_VIDEO_SH_MOBILE_CEU,
>   - CONFIG_DRM_SHMOBILE[*],
>   - CONFIG_FB_SH_MOBILE_MERAM.
> (changes for a shmobile_defconfig .config)
> 
> [*] CONFIG_DRM_SHMOBILE has a dependency on ARM, but it was never wired
>     up.  From the use of sh_mobile_meram, I guess it was meant for
>     SH-Mobile AP4 on Mackerel or AP4EVB, which are long gone.
>     So the only remaining upstream platforms that could make use of it
>     are legacy SuperH SH-Mobile SoCs?

That sounds about right. If there is interest I can sift through my mail
archive and see if it yields any answers.

^ permalink raw reply

* Re: [PATCH bpf-next v3 4/8] bpf: add documentation for eBPF helpers (23-32)
From: Daniel Borkmann @ 2018-04-23  9:11 UTC (permalink / raw)
  To: Quentin Monnet, ast; +Cc: netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <b7ef2bc0-90f7-8bd9-2dd2-3b5ba2a6d4b0@netronome.com>

On 04/20/2018 08:54 PM, Quentin Monnet wrote:
> 2018-04-19 13:16 UTC+0200 ~ Daniel Borkmann <daniel@iogearbox.net>
>> On 04/17/2018 04:34 PM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions, all
>>> written by Daniel:
>>>
>>> - bpf_get_prandom_u32()
>>> - bpf_get_smp_processor_id()
>>> - bpf_get_cgroup_classid()
>>> - bpf_get_route_realm()
>>> - bpf_skb_load_bytes()
>>> - bpf_csum_diff()
>>> - bpf_skb_get_tunnel_opt()
>>> - bpf_skb_set_tunnel_opt()
>>> - bpf_skb_change_proto()
>>> - bpf_skb_change_type()
>>>
>>> v3:
>>> - bpf_get_prandom_u32(): Fix helper name :(. Add description, including
>>>   a note on the internal random state.
>>> - bpf_get_smp_processor_id(): Add description, including a note on the
>>>   processor id remaining stable during program run.
>>> - bpf_get_cgroup_classid(): State that CONFIG_CGROUP_NET_CLASSID is
>>>   required to use the helper. Add a reference to related documentation.
>>>   State that placing a task in net_cls controller disables cgroup-bpf.
>>> - bpf_get_route_realm(): State that CONFIG_CGROUP_NET_CLASSID is
>>>   required to use this helper.
>>> - bpf_skb_load_bytes(): Fix comment on current use cases for the helper.
>>>
>>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>>> ---
>>>  include/uapi/linux/bpf.h | 152 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 152 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index c59bf5b28164..d748f65a8f58 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
> 
> [...]
> 
>>> @@ -615,6 +632,27 @@ union bpf_attr {
>>>   * 	Return
>>>   * 		0 on success, or a negative error in case of failure.
>>>   *
>>> + * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
>>> + * 	Description
>>> + * 		Retrieve the classid for the current task, i.e. for the
>>> + * 		net_cls (network classifier) cgroup to which *skb* belongs.
>>> + *
>>> + * 		This helper is only available is the kernel was compiled with
>>> + * 		the **CONFIG_CGROUP_NET_CLASSID** configuration option set to
>>> + * 		"**y**" or to "**m**".
>>> + *
>>> + * 		Note that placing a task into the net_cls controller completely
>>> + * 		disables the execution of eBPF programs with the cgroup.
>>
>> I'm not sure I follow the above sentence, what do you mean by that?
> 
> Please see comment below.
> 
>> I would definitely also add here that this helper is limited to cgroups v1
>> only, and that it works on clsact TC egress hook but not the ingress one.
>>
>>> + * 		Also note that, in the above description, the "network
>>> + * 		classifier" cgroup does not designate a generic classifier, but
>>> + * 		a particular mechanism that provides an interface to tag
>>> + * 		network packets with a specific class identifier. See also the
>>
>> The "generic classifier" part is a bit strange to parse. I would probably
>> leave the first part out and explain that this provides a means to tag
>> packets based on a user-provided ID for all traffic coming from the tasks
>> belonging to the related cgroup.
> 
> In this paragraph and in the one above, I am trying to address Alexei
> comments to the RFC v2:
> 
>     please add that kernel should be configured with
>     CONFIG_NET_CLS_CGROUP=y|m
>     and mention Documentation/cgroup-v1/net_cls.txt
>     Otherwise 'network classifier' is way too generic.
>     I'd also mention that placing a task into net_cls controller
>     disables all of cgroup-bpf.
> 
> But I must admit I am struggling a bit on this helper. If I understand
> correctly, "network classifier" is too generic and could be confused
> with TC classifiers? Hence the attempt to avoid that in the second

I think if you just name it "net_cls cgroup" it should be good enough in case
the concern is that "network classifier" name would be misunderstood.

> paragraph... As for "placing a task into net_cls controller disables all
> of cgroup-bpf", again if I understand correctly, I think it refers to
> cgroup_sk_alloc_disable() being called in write_classid() in file
> net/core/netclassid_cgroup.c. But I am not familiar with it and cannot
> find a nice way to explain that in the doc :/.

Point here should be that there's cgroups v1 and v2 infra. Both are available
and you can use a mixture of them, but net_cls is v1 only whereas bpf progs on
cgroups is v2 only feature. And if you look at struct sock_cgroup_data, it's
a union for the socket, so you can only use one of them with regards to sockets.

>>> + * 		related kernel documentation, available from the Linux sources
>>> + * 		in file *Documentation/cgroup-v1/net_cls.txt*.
>>> + * 	Return
>>> + * 		The classid, or 0 for the default unconfigured classid.
>>> + *
>>>   * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>>>   * 	Description
>>>   * 		Push a *vlan_tci* (VLAN tag control information) of protocol
>>> @@ -734,6 +772,16 @@ union bpf_attr {
>>>   * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>>>   * 		error.
>>>   *
>>> + * u32 bpf_get_route_realm(struct sk_buff *skb)
>>> + * 	Description
>>> + * 		Retrieve the realm or the route, that is to say the
>>> + * 		**tclassid** field of the destination for the *skb*. This
>>> + * 		helper is available only if the kernel was compiled with
>>> + * 		**CONFIG_IP_ROUTE_CLASSID** configuration option.
>>
>> Could mention that this is a similar user provided tag like in the net_cls
>> case with cgroups only that this applies to routes here (dst entries) which
>> hold this tag.
>>
>> Also, should say that this works with clsact TC egress hook or alternatively
>> conventional classful egress qdiscs, but not on TC ingress. In case of clsact
>> TC egress hook this has the advantage that the dst entry has not been dropped
>> yet in the xmit path. Therefore, the dst entry does not need to be artificially
>> held via netif_keep_dst() for a classful qdisc until the skb is freed.
> 
> Here I am not sure to follow, the advantage is for clsact, but this is
> only for a classful qdisc that we can avoid holding the dst entry? How

No, it's only for sch_clsact where we don't need to hold it. Take a look at
__dev_queue_xmit(), and where sch_handle_egress() is called. It's right before
the dev->priv_flags & IFF_XMIT_DST_RELEASE check where we either hold or drop
the dst from skb.

In cls_bpf_prog_from_efd() the tcf_block_netif_keep_dst() will handle this.
If you call cls_bpf e.g. out of htb via htb_classify() -> tcf_classify()
then you would have to hold it and thus call netif_keep_dst() so that the
test in __dev_queue_xmit() results in holding the dst via skb_dst_force().

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH] iptables: Per-net ns lock
From: Kirill Tkhai @ 2018-04-23  9:03 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: fw, netdev, pablo, rstoyanov1, ptikhomirov
In-Reply-To: <20180420230620.GA23540@outlook.office365.com>

On 21.04.2018 02:06, Andrei Vagin wrote:
> On Fri, Apr 20, 2018 at 04:42:47PM +0300, Kirill Tkhai wrote:
>> Containers want to restore their own net ns,
>> while they may have no their own mnt ns.
>> This case they share host's /run/xtables.lock
>> file, but they may not have permission to open
>> it.
>>
>> Patch makes /run/xtables.lock to be per-namespace,
>> i.e., to refer to the caller task's net ns.
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>  iptables/xshared.c |    7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/iptables/xshared.c b/iptables/xshared.c
>> index 06db72d4..b6dbe4e7 100644
>> --- a/iptables/xshared.c
>> +++ b/iptables/xshared.c
>> @@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval *wait_interval)
>>  	time_left.tv_sec = wait;
>>  	time_left.tv_usec = 0;
>>  
>> -	fd = open(XT_LOCK_NAME, O_CREAT, 0600);
>> +	if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
> 
> Any user can open this file and take the lock. Before this patch, the
> lock file could be opened only by the root user. It means that any user
> will be able to block all iptables operations. Do I miss something?

Yes, this is the idea. It looks like the only way to save compatibility
with old iptables and to allow to set rules from nested net namespaces.
Also, this allows to synchronize with containers, which have its own mount
namespace.

Comparing to existing interfaces in kernel, there is an example. Ordinary user
can open a file RO on a partition, and this prevents root from umounting it
But this is never considered as a problem, and nobody makes partitions available
only for root in 0600 mode to prevent this. There is lsof, and it's easy to find
the lock owner. The same with iptables. The lock is not a critical protection,
it's just a try for different users to synchronize between each other. Real protection
happens in setsockopt() path.

> [root@fc24 ~]# ln -s /proc/self/ns/net /run/xtables.lock2
> [root@fc24 ~]# ls -l /run/xtables.lock2
> lrwxrwxrwx 1 root root 17 Apr 21 01:52 /run/xtables.lock2 ->
> /proc/self/ns/net
> [root@fc24 ~]# ls -l /proc/self/ns/net 
> lrwxrwxrwx 1 root root 0 Apr 21 01:52 /proc/self/ns/net ->
> net:[4026531993]
> 
> Thanks,
> Andrei
> 
>> +	    errno != EEXIST) {
>> +		fprintf(stderr, "Fatal: can't create lock file\n");
> 
> 		fprintf(stderr, "Fatal: can't create lock file %s: %s\n",
> 			XT_LOCK_NAME, strerror(errno));
> 
>> +		return XT_LOCK_FAILED;
>> +	}
>> +	fd = open(XT_LOCK_NAME, O_RDONLY);
>>  	if (fd < 0) {
>>  		fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
>>  			XT_LOCK_NAME, strerror(errno));

Kirill

^ permalink raw reply

* Re: Page allocator bottleneck
From: Tariq Toukan @ 2018-04-23  8:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko
In-Reply-To: <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com>



On 22/04/2018 7:43 PM, Tariq Toukan wrote:
> 
> 
> On 21/04/2018 11:15 AM, Aaron Lu wrote:
>> Sorry to bring up an old thread...
>>
> 
> I want to thank you very much for bringing this up!
> 
>> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>>
>>>
>>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>>
>>>>
>>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>>> as result of increasing congestion over shared resources.
>>>>>>
>>>>>
>>>>> Unfortunately, no surprises there.
>>>>>
>>>>>> Congestion in this case is very clear. When monitored in perf
>>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>>
>>>>>
>>>>> While it's not proven, the most likely candidate is the zone lock
>>>>> and that should be confirmed using a call-graph profile. If so, then
>>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>>> mitigate the problem.
>>>>>
>>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>>
>>>
>>> Hi all,
>>>
>>> After leaving this task for a while doing other tasks, I got back to 
>>> it now
>>> and see that the good behavior I observed earlier was not stable.
>>
>> I posted a patchset to improve zone->lock contention for order-0 pages
>> recently, it can almost eliminate 80% zone->lock contention for
>> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
>> Skylake server and it doesn't require PCP size tune, so should have
>> some effects on your workload where one CPU does allocation while
>> another does free.
>>
> 
> That is great news. In our driver's memory scheme (and many others as 
> well) we allocate only order-0 pages (the only flow that does not do 
> that yet in upstream will do so very soon, we already have the patches 
> in our internal branch).
> Allocation of order-0 pages is not only the common case, but is the only 
> type of allocation in our data-path. Let's optimize it!
> 
> 
>> It did this by some disruptive changes:
>> 1 on free path, it skipped doing merge(so could be bad for mixed
>>    workloads where both 4K and high order pages are needed);
> 
> I think there are so many advantages to not using high order 
> allocations, especially in production servers that are not rebooted for 
> long periods and become fragmented.
> AFAIK, the community direction (at least in networking) is using order-0 
> pages in datapath, so optimizing their allocaiton is a very good idea. 
> Need of course to perf evaluate possible degradations, and see how 
> important these use cases are.
> 
>> 2 on allocation path, it avoided touching multiple cachelines.
>>
> 
> Great!
> 
>> RFC v2 patchset:
>> https://lkml.org/lkml/2018/3/20/171
>>
>> repo:
>> https://github.com/aaronlu/linux zone_lock_rfc_v2
>>
> 
> I will check them out first thing tomorrow!
> 
> p.s., I will be on vacation for a week starting Tuesday.
> I hope I can make some progress before that :)
> 
> Thanks,
> Tariq
> 

Hi,

I ran my tests with your patches.
Initial BW numbers are significantly higher than I documented back then 
in this mail-thread.
For example, in driver #2 (see original mail thread), with 6 rings, I 
now get 92Gbps (slightly less than linerate) in comparison to 64Gbps 
back then.

However, there were many kernel changes since then, I need to isolate 
your changes. I am not sure I can finish this today, but I will surely 
get to it next week after I'm back from vacation.

Still, when I increase the scale (more rings, i.e. more cpus), I see 
that queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower 
than it used to be.

This should be root solved by the (orthogonal) changes planned in 
network subsystem, which will change the SKB allocation/free scheme so 
that SKBs are released on the originating cpu.

Thanks,
Tariq

>>> Recall: I work with a modified driver that allocates a page (4K) per 
>>> packet
>>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>>> NICs.
>>>
>>> Performance is good as long as pages are available in the allocating 
>>> cores's
>>> PCP.
>>> Issue is that pages are allocated in one core, then free'd in another,
>>> making it's hard for the PCP to work efficiently, and both the allocator
>>> core and the freeing core need to access the buddy allocator very often.
>>>
>>> I'd like to share with you some testing numbers:
>>>
>>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
>>>
>>> 100% cpu on all cores, top func in perf:
>>>     84.98%  [kernel]             [k] queued_spin_lock_slowpath
>>>
>>> system wide (all cores)
>>>             1135941      kmem:mm_page_alloc
>>>
>>>             2606629      kmem:mm_page_free
>>>
>>>                   0      kmem:mm_page_alloc_extfrag
>>>             4784616      kmem:mm_page_alloc_zone_locked
>>>
>>>                1337      kmem:mm_page_free_batched
>>>
>>>             6488213      kmem:mm_page_pcpu_drain
>>>
>>>             8925503      net:napi_gro_receive_entry
>>>
>>>
>>> Two types of cores:
>>> A core mostly running napi (8 such cores):
>>>              221875      kmem:mm_page_alloc
>>>
>>>               17100      kmem:mm_page_free
>>>
>>>                   0      kmem:mm_page_alloc_extfrag
>>>              766584      kmem:mm_page_alloc_zone_locked
>>>
>>>                  16      kmem:mm_page_free_batched
>>>
>>>                  35      kmem:mm_page_pcpu_drain
>>>
>>>             1340139      net:napi_gro_receive_entry
>>>
>>>
>>> Other core, mostly running user application (40 such):
>>>                   2      kmem:mm_page_alloc
>>>
>>>               38922      kmem:mm_page_free
>>>
>>>                   0      kmem:mm_page_alloc_extfrag
>>>                   1      kmem:mm_page_alloc_zone_locked
>>>
>>>                   8      kmem:mm_page_free_batched
>>>
>>>              107289      kmem:mm_page_pcpu_drain
>>>
>>>                  34      net:napi_gro_receive_entry
>>>
>>>
>>> As you can see, sync overhead is enormous.
>>>
>>> PCP-wise, a key improvement in such scenarios would be reached if we 
>>> could
>>> (1) keep and handle the allocated page on same cpu, or (2) somehow 
>>> get the
>>> page back to the allocating core's PCP in a fast-path, without going 
>>> through
>>> the regular buddy allocator paths.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox