Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 3/4] qed: Add support for multi function mode with 802.1ad tagging.
From: Sudarsana Reddy Kalluru @ 2018-05-06  1:43 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior
In-Reply-To: <20180506014302.29110-1-sudarsana.kalluru@cavium.com>

The patch adds support for new Multi function mode wherein the traffic
classification is done based on the 802.1ad tagging and the outer vlan tag
provided by the management firmware.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_dev.c         | 64 ++++++++++++++++-------
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c |  5 ++
 2 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 9b07d7f..95d00cb 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1668,6 +1668,18 @@ int qed_hw_init(struct qed_dev *cdev, struct qed_hw_init_params *p_params)
 		if (rc)
 			return rc;
 
+		if (IS_PF(cdev) && test_bit(QED_MF_8021AD_TAGGING,
+					    &cdev->mf_bits)) {
+			STORE_RT_REG(p_hwfn, PRS_REG_TAG_ETHERTYPE_0_RT_OFFSET,
+				     ETH_P_8021AD);
+			STORE_RT_REG(p_hwfn, NIG_REG_TAG_ETHERTYPE_0_RT_OFFSET,
+				     ETH_P_8021AD);
+			STORE_RT_REG(p_hwfn, PBF_REG_TAG_ETHERTYPE_0_RT_OFFSET,
+				     ETH_P_8021AD);
+			STORE_RT_REG(p_hwfn, DORQ_REG_TAG1_ETHERTYPE_RT_OFFSET,
+				     ETH_P_8021AD);
+		}
+
 		qed_fill_load_req_params(&load_req_params,
 					 p_params->p_drv_load_params);
 		rc = qed_mcp_load_req(p_hwfn, p_hwfn->p_main_ptt,
@@ -2630,39 +2642,51 @@ static int qed_hw_get_nvm_info(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
 		   link->pause.autoneg,
 		   p_caps->default_eee, p_caps->eee_lpi_timer);
 
-	/* Read Multi-function information from shmem */
-	addr = MCP_REG_SCRATCH + nvm_cfg1_offset +
-	       offsetof(struct nvm_cfg1, glob) +
-	       offsetof(struct nvm_cfg1_glob, generic_cont0);
+	if (IS_LEAD_HWFN(p_hwfn)) {
+		struct qed_dev *cdev = p_hwfn->cdev;
 
-	generic_cont0 = qed_rd(p_hwfn, p_ptt, addr);
+		/* Read Multi-function information from shmem */
+		addr = MCP_REG_SCRATCH + nvm_cfg1_offset +
+		       offsetof(struct nvm_cfg1, glob) +
+		       offsetof(struct nvm_cfg1_glob, generic_cont0);
 
-	mf_mode = (generic_cont0 & NVM_CFG1_GLOB_MF_MODE_MASK) >>
-		  NVM_CFG1_GLOB_MF_MODE_OFFSET;
+		generic_cont0 = qed_rd(p_hwfn, p_ptt, addr);
 
-	switch (mf_mode) {
-	case NVM_CFG1_GLOB_MF_MODE_MF_ALLOWED:
-		p_hwfn->cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS);
-		break;
-	case NVM_CFG1_GLOB_MF_MODE_NPAR1_0:
-		p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+		mf_mode = (generic_cont0 & NVM_CFG1_GLOB_MF_MODE_MASK) >>
+			  NVM_CFG1_GLOB_MF_MODE_OFFSET;
+
+		switch (mf_mode) {
+		case NVM_CFG1_GLOB_MF_MODE_MF_ALLOWED:
+			cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS);
+			break;
+		case NVM_CFG1_GLOB_MF_MODE_BD:
+			cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS) |
+					BIT(QED_MF_LLH_PROTO_CLSS) |
+					BIT(QED_MF_8021AD_TAGGING);
+			break;
+		case NVM_CFG1_GLOB_MF_MODE_NPAR1_0:
+			cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
 					BIT(QED_MF_LLH_PROTO_CLSS) |
 					BIT(QED_MF_LL2_NON_UNICAST) |
 					BIT(QED_MF_INTER_PF_SWITCH);
-		break;
-	case NVM_CFG1_GLOB_MF_MODE_DEFAULT:
-		p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+			break;
+		case NVM_CFG1_GLOB_MF_MODE_DEFAULT:
+			cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
 					BIT(QED_MF_LLH_PROTO_CLSS) |
 					BIT(QED_MF_LL2_NON_UNICAST);
-		if (QED_IS_BB(p_hwfn->cdev))
-			p_hwfn->cdev->mf_bits |= BIT(QED_MF_NEED_DEF_PF);
-		break;
+			if (QED_IS_BB(p_hwfn->cdev))
+				cdev->mf_bits |= BIT(QED_MF_NEED_DEF_PF);
+			break;
+		}
+
+		DP_INFO(p_hwfn, "Multi function mode is 0x%lx\n",
+			cdev->mf_bits);
 	}
 
 	DP_INFO(p_hwfn, "Multi function mode is 0x%lx\n",
 		p_hwfn->cdev->mf_bits);
 
-	/* Read Multi-function information from shmem */
+	/* Read device capabilities information from shmem */
 	addr = MCP_REG_SCRATCH + nvm_cfg1_offset +
 		offsetof(struct nvm_cfg1, glob) +
 		offsetof(struct nvm_cfg1_glob, device_capabilities);
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
index fbb3172..26bed26 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
@@ -346,6 +346,11 @@ int qed_sp_pf_start(struct qed_hwfn *p_hwfn,
 
 	p_ramrod->outer_tag_config.outer_tag.tci =
 		cpu_to_le16(p_hwfn->hw_info.ovlan);
+	if (test_bit(QED_MF_8021AD_TAGGING, &p_hwfn->cdev->mf_bits)) {
+		p_ramrod->outer_tag_config.outer_tag.tpid = ETH_P_8021AD;
+		p_ramrod->outer_tag_config.enable_stag_pri_change = 1;
+	}
+
 
 	/* Place EQ address in RAMROD */
 	DMA_REGPAIR_LE(p_ramrod->event_ring_pbl_addr,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 2/4] qed: Remove unused data member 'is_mf_default'.
From: Sudarsana Reddy Kalluru @ 2018-05-06  1:43 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior
In-Reply-To: <20180506014302.29110-1-sudarsana.kalluru@cavium.com>

The data member 'is_mf_default' is not used by the qed/qede drivers,
removing the same.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 2 --
 include/linux/qed/qed_if.h                 | 1 -
 2 files changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 307fe33..70bc563 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -264,8 +264,6 @@ int qed_fill_dev_info(struct qed_dev *cdev,
 	dev_info->pci_mem_end = cdev->pci_params.mem_end;
 	dev_info->pci_irq = cdev->pci_params.irq;
 	dev_info->rdma_supported = QED_IS_RDMA_PERSONALITY(p_hwfn);
-	dev_info->is_mf_default = !test_bit(QED_MF_LLH_MAC_CLSS,
-					    &cdev->mf_bits);
 	dev_info->dev_type = cdev->type;
 	ether_addr_copy(dev_info->hw_mac, hw_info->hw_mac_addr);
 
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index 5dac561..907976f 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -339,7 +339,6 @@ struct qed_dev_info {
 	u8		num_hwfns;
 
 	u8		hw_mac[ETH_ALEN];
-	bool		is_mf_default;
 
 	/* FW version */
 	u16		fw_major;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/4] qed*: Add support for new multi partitioning modes.
From: Sudarsana Reddy Kalluru @ 2018-05-06  1:42 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Sudarsana Reddy Kalluru

From: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>

The patch series simplifies the multi function (MF) mode implementation of
qed/qede drivers, and adds support for new MF modes.
 
Please consider applying it to net-next branch.

Sudarsana Reddy Kalluru (4):
  qed*: Refactor mf_mode to consist of bits.
  qed: Remove unused data member 'is_mf_default'.
  qed: Add support for multi function mode with 802.1ad tagging.
  qed: Add support for Unified Fabric Port.

 drivers/net/ethernet/qlogic/qed/qed.h             |  61 +++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c        |  14 ++-
 drivers/net/ethernet/qlogic/qed/qed_dev.c         | 113 +++++++++++++++-------
 drivers/net/ethernet/qlogic/qed/qed_fcoe.c        |   3 +
 drivers/net/ethernet/qlogic/qed/qed_hsi.h         |  28 ++++++
 drivers/net/ethernet/qlogic/qed/qed_ll2.c         |  46 ++++++---
 drivers/net/ethernet/qlogic/qed/qed_main.c        |   4 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c         |  78 +++++++++++++++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h         |   8 ++
 drivers/net/ethernet/qlogic/qed/qed_sp.h          |  12 ++-
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c |  76 ++++++++++++---
 drivers/net/ethernet/qlogic/qede/qede_main.c      |   4 +-
 drivers/scsi/qedf/qedf_fip.c                      |   7 +-
 drivers/scsi/qedf/qedf_main.c                     |   2 +-
 drivers/scsi/qedi/qedi_iscsi.c                    |   2 +-
 include/linux/qed/qed_if.h                        |   3 +-
 include/linux/qed/qed_ll2_if.h                    |  10 +-
 17 files changed, 389 insertions(+), 82 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net-next 1/4] qed*: Refactor mf_mode to consist of bits.
From: Sudarsana Reddy Kalluru @ 2018-05-06  1:42 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior
In-Reply-To: <20180506014302.29110-1-sudarsana.kalluru@cavium.com>

`mf_mode' field indicates the multi-partitioning mode the device is
configured to. This method doesn't scale very well, adding a new MF mode
requires going over all the existing conditions, and deciding whether those
are needed for the new mode or not.
The patch defines a set of bit-fields for modes which are derived according
to the mode info shared by the MFW and all the configuration would be made
according to those. To add a new mode, there would be a single place where
we'll need to go and choose which bits apply and which don't.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed.h             | 41 ++++++++++++++++++++---
 drivers/net/ethernet/qlogic/qed/qed_dev.c         | 39 +++++++++++----------
 drivers/net/ethernet/qlogic/qed/qed_ll2.c         |  6 ++--
 drivers/net/ethernet/qlogic/qed/qed_main.c        |  6 ++--
 drivers/net/ethernet/qlogic/qed/qed_sp.h          |  3 +-
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c | 16 +++------
 drivers/net/ethernet/qlogic/qede/qede_main.c      |  4 +--
 include/linux/qed/qed_if.h                        |  2 +-
 8 files changed, 71 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h b/drivers/net/ethernet/qlogic/qed/qed.h
index e07460a..c8f3507 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -439,6 +439,41 @@ struct qed_fw_data {
 	u32			init_ops_size;
 };
 
+enum qed_mf_mode_bit {
+	/* Supports PF-classification based on tag */
+	QED_MF_OVLAN_CLSS,
+
+	/* Supports PF-classification based on MAC */
+	QED_MF_LLH_MAC_CLSS,
+
+	/* Supports PF-classification based on protocol type */
+	QED_MF_LLH_PROTO_CLSS,
+
+	/* Requires a default PF to be set */
+	QED_MF_NEED_DEF_PF,
+
+	/* Allow LL2 to multicast/broadcast */
+	QED_MF_LL2_NON_UNICAST,
+
+	/* Allow Cross-PF [& child VFs] Tx-switching */
+	QED_MF_INTER_PF_SWITCH,
+
+	/* Unified Fabtic Port support enabled */
+	QED_MF_UFP_SPECIFIC,
+
+	/* Disable Accelerated Receive Flow Steering (aRFS) */
+	QED_MF_DISABLE_ARFS,
+
+	/* Use vlan for steering */
+	QED_MF_8021Q_TAGGING,
+
+	/* Use stag for steering */
+	QED_MF_8021AD_TAGGING,
+
+	/* Allow DSCP to TC mapping */
+	QED_MF_DSCP_TO_TC_MAP,
+};
+
 enum BAR_ID {
 	BAR_ID_0,		/* used for GRC */
 	BAR_ID_1		/* Used for doorbells */
@@ -669,10 +704,8 @@ struct qed_dev {
 	u8				num_funcs_in_port;
 
 	u8				path_id;
-	enum qed_mf_mode		mf_mode;
-#define IS_MF_DEFAULT(_p_hwfn)  (((_p_hwfn)->cdev)->mf_mode == QED_MF_DEFAULT)
-#define IS_MF_SI(_p_hwfn)       (((_p_hwfn)->cdev)->mf_mode == QED_MF_NPAR)
-#define IS_MF_SD(_p_hwfn)       (((_p_hwfn)->cdev)->mf_mode == QED_MF_OVLAN)
+
+	unsigned long			mf_bits;
 
 	int				pcie_width;
 	int				pcie_speed;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index d2ad5e9..9b07d7f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1149,18 +1149,10 @@ static int qed_calc_hw_mode(struct qed_hwfn *p_hwfn)
 		return -EINVAL;
 	}
 
-	switch (p_hwfn->cdev->mf_mode) {
-	case QED_MF_DEFAULT:
-	case QED_MF_NPAR:
-		hw_mode |= 1 << MODE_MF_SI;
-		break;
-	case QED_MF_OVLAN:
+	if (test_bit(QED_MF_OVLAN_CLSS, &p_hwfn->cdev->mf_bits))
 		hw_mode |= 1 << MODE_MF_SD;
-		break;
-	default:
-		DP_NOTICE(p_hwfn, "Unsupported MF mode, init as DEFAULT\n");
+	else
 		hw_mode |= 1 << MODE_MF_SI;
-	}
 
 	hw_mode |= 1 << MODE_ASIC;
 
@@ -1557,7 +1549,6 @@ static int qed_hw_init_pf(struct qed_hwfn *p_hwfn,
 
 		/* send function start command */
 		rc = qed_sp_pf_start(p_hwfn, p_ptt, p_tunn,
-				     p_hwfn->cdev->mf_mode,
 				     allow_npar_tx_switch);
 		if (rc) {
 			DP_NOTICE(p_hwfn, "Function start ramrod failed\n");
@@ -2651,17 +2642,25 @@ static int qed_hw_get_nvm_info(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
 
 	switch (mf_mode) {
 	case NVM_CFG1_GLOB_MF_MODE_MF_ALLOWED:
-		p_hwfn->cdev->mf_mode = QED_MF_OVLAN;
+		p_hwfn->cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS);
 		break;
 	case NVM_CFG1_GLOB_MF_MODE_NPAR1_0:
-		p_hwfn->cdev->mf_mode = QED_MF_NPAR;
+		p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+					BIT(QED_MF_LLH_PROTO_CLSS) |
+					BIT(QED_MF_LL2_NON_UNICAST) |
+					BIT(QED_MF_INTER_PF_SWITCH);
 		break;
 	case NVM_CFG1_GLOB_MF_MODE_DEFAULT:
-		p_hwfn->cdev->mf_mode = QED_MF_DEFAULT;
+		p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+					BIT(QED_MF_LLH_PROTO_CLSS) |
+					BIT(QED_MF_LL2_NON_UNICAST);
+		if (QED_IS_BB(p_hwfn->cdev))
+			p_hwfn->cdev->mf_bits |= BIT(QED_MF_NEED_DEF_PF);
 		break;
 	}
-	DP_INFO(p_hwfn, "Multi function mode is %08x\n",
-		p_hwfn->cdev->mf_mode);
+
+	DP_INFO(p_hwfn, "Multi function mode is 0x%lx\n",
+		p_hwfn->cdev->mf_bits);
 
 	/* Read Multi-function information from shmem */
 	addr = MCP_REG_SCRATCH + nvm_cfg1_offset +
@@ -3462,7 +3461,7 @@ int qed_llh_add_mac_filter(struct qed_hwfn *p_hwfn,
 	u32 high = 0, low = 0, en;
 	int i;
 
-	if (!(IS_MF_SI(p_hwfn) || IS_MF_DEFAULT(p_hwfn)))
+	if (!test_bit(QED_MF_LLH_MAC_CLSS, &p_hwfn->cdev->mf_bits))
 		return 0;
 
 	qed_llh_mac_to_filter(&high, &low, p_filter);
@@ -3507,7 +3506,7 @@ void qed_llh_remove_mac_filter(struct qed_hwfn *p_hwfn,
 	u32 high = 0, low = 0;
 	int i;
 
-	if (!(IS_MF_SI(p_hwfn) || IS_MF_DEFAULT(p_hwfn)))
+	if (!test_bit(QED_MF_LLH_MAC_CLSS, &p_hwfn->cdev->mf_bits))
 		return;
 
 	qed_llh_mac_to_filter(&high, &low, p_filter);
@@ -3549,7 +3548,7 @@ void qed_llh_remove_mac_filter(struct qed_hwfn *p_hwfn,
 	u32 high = 0, low = 0, en;
 	int i;
 
-	if (!(IS_MF_SI(p_hwfn) || IS_MF_DEFAULT(p_hwfn)))
+	if (!test_bit(QED_MF_LLH_PROTO_CLSS, &p_hwfn->cdev->mf_bits))
 		return 0;
 
 	switch (type) {
@@ -3647,7 +3646,7 @@ void qed_llh_remove_mac_filter(struct qed_hwfn *p_hwfn,
 	u32 high = 0, low = 0;
 	int i;
 
-	if (!(IS_MF_SI(p_hwfn) || IS_MF_DEFAULT(p_hwfn)))
+	if (!test_bit(QED_MF_LLH_PROTO_CLSS, &p_hwfn->cdev->mf_bits))
 		return;
 
 	switch (type) {
diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
index 74fc626..3ac0ede 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
@@ -922,9 +922,9 @@ static int qed_sp_ll2_rx_queue_start(struct qed_hwfn *p_hwfn,
 	p_ramrod->queue_id = p_ll2_conn->queue_id;
 	p_ramrod->main_func_queue = p_ll2_conn->main_func_queue ? 1 : 0;
 
-	if ((IS_MF_DEFAULT(p_hwfn) || IS_MF_SI(p_hwfn)) &&
-	    p_ramrod->main_func_queue && (conn_type != QED_LL2_TYPE_ROCE) &&
-	    (conn_type != QED_LL2_TYPE_IWARP)) {
+	if (test_bit(QED_MF_LL2_NON_UNICAST, &p_hwfn->cdev->mf_bits) &&
+	    p_ramrod->main_func_queue && conn_type != QED_LL2_TYPE_ROCE &&
+	    conn_type != QED_LL2_TYPE_IWARP) {
 		p_ramrod->mf_si_bcast_accept_all = 1;
 		p_ramrod->mf_si_mcast_accept_all = 1;
 	} else {
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c
index d1d3787..307fe33 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -264,7 +264,8 @@ int qed_fill_dev_info(struct qed_dev *cdev,
 	dev_info->pci_mem_end = cdev->pci_params.mem_end;
 	dev_info->pci_irq = cdev->pci_params.irq;
 	dev_info->rdma_supported = QED_IS_RDMA_PERSONALITY(p_hwfn);
-	dev_info->is_mf_default = IS_MF_DEFAULT(&cdev->hwfns[0]);
+	dev_info->is_mf_default = !test_bit(QED_MF_LLH_MAC_CLSS,
+					    &cdev->mf_bits);
 	dev_info->dev_type = cdev->type;
 	ether_addr_copy(dev_info->hw_mac, hw_info->hw_mac_addr);
 
@@ -273,7 +274,8 @@ int qed_fill_dev_info(struct qed_dev *cdev,
 		dev_info->fw_minor = FW_MINOR_VERSION;
 		dev_info->fw_rev = FW_REVISION_VERSION;
 		dev_info->fw_eng = FW_ENGINEERING_VERSION;
-		dev_info->mf_mode = cdev->mf_mode;
+		dev_info->b_inter_pf_switch = test_bit(QED_MF_INTER_PF_SWITCH,
+						       &cdev->mf_bits);
 		dev_info->tx_switching = true;
 
 		if (hw_info->b_wol_support == QED_WOL_SUPPORT_PME)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp.h b/drivers/net/ethernet/qlogic/qed/qed_sp.h
index ab4ad8a..7680222 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp.h
@@ -416,7 +416,6 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
  * @param p_hwfn
  * @param p_ptt
  * @param p_tunn
- * @param mode
  * @param allow_npar_tx_switch
  *
  * @return int
@@ -425,7 +424,7 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
 int qed_sp_pf_start(struct qed_hwfn *p_hwfn,
 		    struct qed_ptt *p_ptt,
 		    struct qed_tunnel_info *p_tunn,
-		    enum qed_mf_mode mode, bool allow_npar_tx_switch);
+		    bool allow_npar_tx_switch);
 
 /**
  * @brief qed_sp_pf_update - PF Function Update Ramrod
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
index 5e927b6..fbb3172 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
@@ -306,7 +306,7 @@ static void qed_set_hw_tunn_mode_port(struct qed_hwfn *p_hwfn,
 int qed_sp_pf_start(struct qed_hwfn *p_hwfn,
 		    struct qed_ptt *p_ptt,
 		    struct qed_tunnel_info *p_tunn,
-		    enum qed_mf_mode mode, bool allow_npar_tx_switch)
+		    bool allow_npar_tx_switch)
 {
 	struct pf_start_ramrod_data *p_ramrod = NULL;
 	u16 sb = qed_int_get_sp_sb_id(p_hwfn);
@@ -339,18 +339,10 @@ int qed_sp_pf_start(struct qed_hwfn *p_hwfn,
 	p_ramrod->dont_log_ramrods	= 0;
 	p_ramrod->log_type_mask		= cpu_to_le16(0xf);
 
-	switch (mode) {
-	case QED_MF_DEFAULT:
-	case QED_MF_NPAR:
-		p_ramrod->mf_mode = MF_NPAR;
-		break;
-	case QED_MF_OVLAN:
+	if (test_bit(QED_MF_OVLAN_CLSS, &p_hwfn->cdev->mf_bits))
 		p_ramrod->mf_mode = MF_OVLAN;
-		break;
-	default:
-		DP_NOTICE(p_hwfn, "Unsupported MF mode, init as DEFAULT\n");
+	else
 		p_ramrod->mf_mode = MF_NPAR;
-	}
 
 	p_ramrod->outer_tag_config.outer_tag.tci =
 		cpu_to_le16(p_hwfn->hw_info.ovlan);
@@ -365,7 +357,7 @@ int qed_sp_pf_start(struct qed_hwfn *p_hwfn,
 
 	qed_tunn_set_pf_start_params(p_hwfn, p_tunn, &p_ramrod->tunnel_config);
 
-	if (IS_MF_SI(p_hwfn))
+	if (test_bit(QED_MF_INTER_PF_SWITCH, &p_hwfn->cdev->mf_bits))
 		p_ramrod->allow_npar_tx_switching = allow_npar_tx_switch;
 
 	switch (p_hwfn->hw_info.personality) {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index a01e7d6..89c581c 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -199,7 +199,7 @@ static int qede_sriov_configure(struct pci_dev *pdev, int num_vfs_param)
 
 	/* Enable/Disable Tx switching for PF */
 	if ((rc == num_vfs_param) && netif_running(edev->ndev) &&
-	    qed_info->mf_mode != QED_MF_NPAR && qed_info->tx_switching) {
+	    !qed_info->b_inter_pf_switch && qed_info->tx_switching) {
 		vport_params->vport_id = 0;
 		vport_params->update_tx_switching_flg = 1;
 		vport_params->tx_switching_flg = num_vfs_param ? 1 : 0;
@@ -1928,7 +1928,7 @@ static int qede_start_queues(struct qede_dev *edev, bool clear_stats)
 	vport_update_params->update_vport_active_flg = 1;
 	vport_update_params->vport_active_flg = 1;
 
-	if ((qed_info->mf_mode == QED_MF_NPAR || pci_num_vf(edev->pdev)) &&
+	if ((qed_info->b_inter_pf_switch || pci_num_vf(edev->pdev)) &&
 	    qed_info->tx_switching) {
 		vport_update_params->update_tx_switching_flg = 1;
 		vport_update_params->tx_switching_flg = 1;
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index e53f9c7..5dac561 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -359,7 +359,7 @@ struct qed_dev_info {
 #define QED_MFW_VERSION_3_OFFSET	24
 
 	u32		flash_size;
-	u8		mf_mode;
+	bool		b_inter_pf_switch;
 	bool		tx_switching;
 	bool		rdma_supported;
 	u16		mtu;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net] net/tls: Don't recursively call push_record during tls_write_space callbacks
From: Andre Tomt @ 2018-05-06  1:06 UTC (permalink / raw)
  To: Dave Watson, David S. Miller, netdev, borisp, Aviad Yehezkel
In-Reply-To: <20180501200539.GA69207@davejwatson-mba.dhcp.thefacebook.com>

On 01. mai 2018 22:05, Dave Watson wrote:
> It is reported that in some cases, write_space may be called in
> do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again:
> 
> [  660.468802]  ? do_tcp_sendpages+0x8d/0x580
> [  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
> [  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
> [  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
> ...
> 
> tls_push_sg already does a loop over all sending sg's, so ignore
> any tls_write_space notifications until we are done sending.
> We then have to call the previous write_space to wake up
> poll() waiters after we are done with the send loop.
> 
> Reported-by: Andre Tomt <andre@tomt.net>
> Signed-off-by: Dave Watson <davejwatson@fb.com>

Unfortunately it seems like this patch has a bug, while it fixed the 
kernel crashing it is causing some connections in my testbed to stall.

Making sure ctx->in_tcp_sendpages is also cleared before the return ret 
within the while(1) loop seems to fix it for me.


diff -Naurp a/net/tls/tls_main.c b/net/tls/tls_main.c
--- a/net/tls/tls_main.c	2018-05-06 02:23:41.638597066 +0200
+++ b/net/tls/tls_main.c	2018-05-06 01:59:14.378568139 +0200
@@ -135,6 +135,7 @@ retry:
  			offset -= sg->offset;
  			ctx->partially_sent_offset = offset;
  			ctx->partially_sent_record = (void *)sg;
+			ctx->in_tcp_sendpages = false;
  			return ret;
  		}

^ permalink raw reply

* BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at net/dccp/ccids/lib/packet_history.c:LINE/tfrc_rx_his
From: syzbot @ 2018-05-06  0:57 UTC (permalink / raw)
  To: davem, dccp, garsilva, gerrit, linux-kernel, netdev,
	syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13d5de47800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
dashboard link: https://syzkaller.appspot.com/bug?extid=99858724c0ba555a12ea
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=170afde7800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=141b4be7800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+99858724c0ba555a12ea@syzkaller.appspotmail.com

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at  
net/dccp/ccids/lib/packet_history.c:425/tfrc_rx_hist_sample_rtt()
CPU: 0 PID: 4495 Comm: syz-executor551 Not tainted 4.17.0-rc3+ #34
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  tfrc_rx_hist_sample_rtt.cold.3+0x54/0x5c  
net/dccp/ccids/lib/packet_history.c:422
  ccid3_hc_rx_packet_recv+0x5c8/0xed0 net/dccp/ccids/ccid3.c:765
  ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
  dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
  dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
  dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
  sk_backlog_rcv include/net/sock.h:909 [inline]
  __sk_receive_skb+0x3a2/0xd60 net/core/sock.c:513
  dccp_v4_rcv+0x10e5/0x1f3f net/dccp/ipv4.c:875
  ip_local_deliver_finish+0x2e3/0xd80 net/ipv4/ip_input.c:215
  NF_HOOK include/linux/netfilter.h:288 [inline]
  ip_local_deliver+0x1e1/0x720 net/ipv4/ip_input.c:256
  dst_input include/net/dst.h:450 [inline]
  ip_rcv_finish+0x81b/0x2200 net/ipv4/ip_input.c:396
  NF_HOOK include/linux/netfilter.h:288 [inline]
  ip_rcv+0xb70/0x143d net/ipv4/ip_input.c:492
  __netif_receive_skb_core+0x26f5/0x3630 net/core/dev.c:4592
  __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4657
  process_backlog+0x219/0x760 net/core/dev.c:5337
  napi_poll net/core/dev.c:5735 [inline]
  net_rx_action+0x7b7/0x1930 net/core/dev.c:5801
  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1046
  </IRQ>
  do_softirq.part.17+0x14d/0x190 kernel/softirq.c:329
  do_softirq arch/x86/include/asm/preempt.h:23 [inline]
  __local_bh_enable_ip+0x1ec/0x230 kernel/softirq.c:182
  local_bh_enable include/linux/bottom_half.h:32 [inline]
  rcu_read_unlock_bh include/linux/rcupdate.h:728 [inline]
  ip_finish_output2+0xab2/0x1840 net/ipv4/ip_output.c:231
  ip_finish_output+0x828/0xf80 net/ipv4/ip_output.c:317
  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
  ip_output+0x21b/0x850 net/ipv4/ip_output.c:405
  dst_output include/net/dst.h:444 [inline]
  ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
  ip_queue_xmit+0x9d7/0x1f70 net/ipv4/ip_output.c:504
  dccp_transmit_skb+0x999/0x12e0 net/dccp/output.c:142
  dccp_xmit_packet+0x250/0x790 net/dccp/output.c:281
  dccp_write_xmit+0x190/0x1f0 net/dccp/output.c:363
  dccp_sendmsg+0x8c7/0x1020 net/dccp/proto.c:818
  inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x525/0x940 net/socket.c:2117
  __sys_sendmmsg+0x240/0x6f0 net/socket.c:2212
  __do_sys_sendmmsg net/socket.c:2241 [inline]
  __se_sys_sendmmsg net/socket.c:2238 [inline]
  __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2238
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x445d09
RSP: 002b:00007f3c7eff5d88 EFLAGS: 00000293 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000006dac40 RCX: 0000000000445d09
RDX: 0000000000000001 RSI: 000000


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* [PATCH bpf-next v5 4/4] bpf: bpftool, support for sockhash
From: John Fastabend @ 2018-05-05 23:25 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525562710-11603-1-git-send-email-john.fastabend@gmail.com>

This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 tools/bpf/bpftool/map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index af6766e..097b1a5 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -66,6 +66,7 @@
 	[BPF_MAP_TYPE_DEVMAP]		= "devmap",
 	[BPF_MAP_TYPE_SOCKMAP]		= "sockmap",
 	[BPF_MAP_TYPE_CPUMAP]		= "cpumap",
+	[BPF_MAP_TYPE_SOCKHASH]		= "sockhash",
 };
 
 static bool map_is_per_cpu(__u32 type)
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v5 3/4] bpf: selftest additions for SOCKHASH
From: John Fastabend @ 2018-05-05 23:25 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525562710-11603-1-git-send-email-john.fastabend@gmail.com>

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 tools/include/uapi/linux/bpf.h                     | 52 +++++++++++++++++++++-
 tools/testing/selftests/bpf/Makefile               |  2 +-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |  4 ++
 tools/testing/selftests/bpf/test_sockmap.c         | 27 ++++++++---
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   | 10 ++---
 5 files changed, 81 insertions(+), 14 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => test_sockmap_kern.h} (97%)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 83a95ae..f3d8d05 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1825,6 +1826,52 @@ struct bpf_stack_build_id {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		Add an entry to, or update a sockhash *map* referencing sockets.
+ *		The *skops* is used as a new value for the entry associated to
+ *		*key*. *flags* is one of:
+ *
+ *		**BPF_NOEXIST**
+ *			The entry for *key* must not exist in the map.
+ *		**BPF_EXIST**
+ *			The entry for *key* must already exist in the map.
+ *		**BPF_ANY**
+ *			No condition on the existence of the entry for *key*.
+ *
+ *		If the *map* has eBPF programs (parser and verdict), those will
+ *		be inherited by the socket being added. If the socket is
+ *		already attached to eBPF programs, this results in an error.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		This helper is used in programs implementing policies at the
+ *		socket level. If the message *msg* is allowed to pass (i.e. if
+ *		the verdict eBPF program returns **SK_PASS**), redirect it to
+ *		the socket referenced by *map* (of type
+ *		**BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ *		egress interfaces can be used for redirection. The
+ *		**BPF_F_INGRESS** value in *flags* is used to make the
+ *		distinction (ingress path is selected if the flag is present,
+ *		egress path otherwise). This is the only flag supported for now.
+ *	Return
+ *		**SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		This helper is used in programs implementing policies at the
+ *		skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
+ *		if the verdeict eBPF program returns **SK_PASS**), redirect it
+ *		to the socket referenced by *map* (of type
+ *		**BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ *		egress interfaces can be used for redirection. The
+ *		**BPF_F_INGRESS** value in *flags* is used to make the
+ *		distinction (ingress path is selected if the flag is present,
+ *		egress otherwise). This is the only flag supported for now.
+ *	Return
+ *		**SK_PASS** on success, or **SK_DROP** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1895,7 +1942,10 @@ struct bpf_stack_build_id {
 	FN(xdp_adjust_tail),		\
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
-	FN(skb_load_bytes_relative),
+	FN(skb_load_bytes_relative),	\
+	FN(sock_hash_update),		\
+	FN(msg_redirect_hash),		\
+	FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 9d76218..28316f1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
 	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
-	test_get_stack_rawtp.o
+	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_sockhash_kern.c b/tools/testing/selftests/bpf/test_sockhash_kern.c
new file mode 100644
index 0000000..3bf4ad4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockhash_kern.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+#define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH
+#include "./test_sockmap_kern.h"
diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 29c022d..df7afc7 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -47,7 +47,8 @@
 #define S1_PORT 10000
 #define S2_PORT 10001
 
-#define BPF_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKHASH_FILENAME "test_sockmap_kern.o"
 #define CG_PATH "/sockmap"
 
 /* global sockets */
@@ -1260,9 +1261,8 @@ static int test_start_end(int cgrp)
 	BPF_PROG_TYPE_SK_MSG,
 };
 
-static int populate_progs(void)
+static int populate_progs(char *bpf_file)
 {
-	char *bpf_file = BPF_FILENAME;
 	struct bpf_program *prog;
 	struct bpf_object *obj;
 	int i = 0;
@@ -1306,11 +1306,11 @@ static int populate_progs(void)
 	return 0;
 }
 
-static int test_suite(void)
+static int __test_suite(char *bpf_file)
 {
 	int cg_fd, err;
 
-	err = populate_progs();
+	err = populate_progs(bpf_file);
 	if (err < 0) {
 		fprintf(stderr, "ERROR: (%i) load bpf failed\n", err);
 		return err;
@@ -1347,17 +1347,30 @@ static int test_suite(void)
 
 out:
 	printf("Summary: %i PASSED %i FAILED\n", passed, failed);
+	cleanup_cgroup_environment();
 	close(cg_fd);
 	return err;
 }
 
+static int test_suite(void)
+{
+	int err;
+
+	err = __test_suite(BPF_SOCKMAP_FILENAME);
+	if (err)
+		goto out;
+	err = __test_suite(BPF_SOCKHASH_FILENAME);
+out:
+	return err;
+}
+
 int main(int argc, char **argv)
 {
 	struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
 	int iov_count = 1, length = 1024, rate = 1;
 	struct sockmap_options options = {0};
 	int opt, longindex, err, cg_fd = 0;
-	char *bpf_file = BPF_FILENAME;
+	char *bpf_file = BPF_SOCKMAP_FILENAME;
 	int test = PING_PONG;
 
 	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
@@ -1438,7 +1451,7 @@ int main(int argc, char **argv)
 		return -1;
 	}
 
-	err = populate_progs();
+	err = populate_progs(bpf_file);
 	if (err) {
 		fprintf(stderr, "populate program: (%s) %s\n",
 			bpf_file, strerror(errno));
diff --git a/tools/testing/selftests/bpf/test_sockmap_kern.c b/tools/testing/selftests/bpf/test_sockmap_kern.h
similarity index 97%
rename from tools/testing/selftests/bpf/test_sockmap_kern.c
rename to tools/testing/selftests/bpf/test_sockmap_kern.h
index 33de97e..2cc57a1 100644
--- a/tools/testing/selftests/bpf/test_sockmap_kern.c
+++ b/tools/testing/selftests/bpf/test_sockmap_kern.h
@@ -1,5 +1,5 @@
-// SPDX-License-Identifier: GPL-2.0
-// Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io */
 #include <stddef.h>
 #include <string.h>
 #include <linux/bpf.h>
@@ -36,21 +36,21 @@
 })
 
 struct bpf_map_def SEC("maps") sock_map = {
-	.type = BPF_MAP_TYPE_SOCKMAP,
+	.type = TEST_MAP_TYPE,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
 	.max_entries = 20,
 };
 
 struct bpf_map_def SEC("maps") sock_map_txmsg = {
-	.type = BPF_MAP_TYPE_SOCKMAP,
+	.type = TEST_MAP_TYPE,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
 	.max_entries = 20,
 };
 
 struct bpf_map_def SEC("maps") sock_map_redir = {
-	.type = BPF_MAP_TYPE_SOCKMAP,
+	.type = TEST_MAP_TYPE,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
 	.max_entries = 20,
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v5 2/4] bpf: sockmap, add hash map support
From: John Fastabend @ 2018-05-05 23:25 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525562710-11603-1-git-send-email-john.fastabend@gmail.com>

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 include/linux/bpf.h       |   8 +
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |  52 ++++-
 kernel/bpf/core.c         |   1 +
 kernel/bpf/sockmap.c      | 494 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/bpf/verifier.c     |  14 +-
 net/core/filter.c         |  58 ++++++
 7 files changed, 610 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 321969d..6dbe1aa 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -668,6 +668,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map)
 
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
@@ -675,6 +676,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
 	return NULL;
 }
 
+static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
+						    void *key)
+{
+	return NULL;
+}
+
 static inline int sock_map_prog(struct bpf_map *map,
 				struct bpf_prog *prog,
 				u32 type)
@@ -724,6 +731,7 @@ static inline void __xsk_map_flush(struct bpf_map *map)
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
+extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d7df1b32..b67f879 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,6 +47,7 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93d5a4e..98c0e8c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -117,6 +117,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
 	BPF_MAP_TYPE_XSKMAP,
+	BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1826,6 +1827,52 @@ struct bpf_stack_build_id {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		Add an entry to, or update a sockhash *map* referencing sockets.
+ *		The *skops* is used as a new value for the entry associated to
+ *		*key*. *flags* is one of:
+ *
+ *		**BPF_NOEXIST**
+ *			The entry for *key* must not exist in the map.
+ *		**BPF_EXIST**
+ *			The entry for *key* must already exist in the map.
+ *		**BPF_ANY**
+ *			No condition on the existence of the entry for *key*.
+ *
+ *		If the *map* has eBPF programs (parser and verdict), those will
+ *		be inherited by the socket being added. If the socket is
+ *		already attached to eBPF programs, this results in an error.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		This helper is used in programs implementing policies at the
+ *		socket level. If the message *msg* is allowed to pass (i.e. if
+ *		the verdict eBPF program returns **SK_PASS**), redirect it to
+ *		the socket referenced by *map* (of type
+ *		**BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ *		egress interfaces can be used for redirection. The
+ *		**BPF_F_INGRESS** value in *flags* is used to make the
+ *		distinction (ingress path is selected if the flag is present,
+ *		egress path otherwise). This is the only flag supported for now.
+ *	Return
+ *		**SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		This helper is used in programs implementing policies at the
+ *		skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
+ *		if the verdeict eBPF program returns **SK_PASS**), redirect it
+ *		to the socket referenced by *map* (of type
+ *		**BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ *		egress interfaces can be used for redirection. The
+ *		**BPF_F_INGRESS** value in *flags* is used to make the
+ *		distinction (ingress path is selected if the flag is present,
+ *		egress otherwise). This is the only flag supported for now.
+ *	Return
+ *		**SK_PASS** on success, or **SK_DROP** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1896,7 +1943,10 @@ struct bpf_stack_build_id {
 	FN(xdp_adjust_tail),		\
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
-	FN(skb_load_bytes_relative),
+	FN(skb_load_bytes_relative),	\
+	FN(sock_hash_update),		\
+	FN(msg_redirect_hash),		\
+	FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d0d7d94..2194c6a 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1707,6 +1707,7 @@ void bpf_user_rnd_init_once(void)
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
+const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 4eef5b1..306eb5d 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -60,6 +60,28 @@ struct bpf_stab {
 	struct bpf_sock_progs progs;
 };
 
+struct bucket {
+	struct hlist_head head;
+	raw_spinlock_t lock;
+};
+
+struct bpf_htab {
+	struct bpf_map map;
+	struct bucket *buckets;
+	atomic_t count;
+	u32 n_buckets;
+	u32 elem_size;
+	struct bpf_sock_progs progs;
+};
+
+struct htab_elem {
+	struct rcu_head rcu;
+	struct hlist_node hash_node;
+	u32 hash;
+	struct sock *sk;
+	char key[0];
+};
+
 enum smap_psock_state {
 	SMAP_TX_RUNNING,
 };
@@ -67,6 +89,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
 	struct list_head list;
 	struct sock **entry;
+	struct htab_elem *hash_link;
+	struct bpf_htab *htab;
 };
 
 struct smap_psock {
@@ -195,6 +219,12 @@ static void bpf_tcp_release(struct sock *sk)
 	rcu_read_unlock();
 }
 
+static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+	atomic_dec(&htab->count);
+	kfree_rcu(l, rcu);
+}
+
 static void bpf_tcp_close(struct sock *sk, long timeout)
 {
 	void (*close_fun)(struct sock *sk, long timeout);
@@ -231,10 +261,16 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 	}
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		osk = cmpxchg(e->entry, sk, NULL);
-		if (osk == sk) {
-			list_del(&e->list);
-			smap_release_sock(psock, sk);
+		if (e->entry) {
+			osk = cmpxchg(e->entry, sk, NULL);
+			if (osk == sk) {
+				list_del(&e->list);
+				smap_release_sock(psock, sk);
+			}
+		} else {
+			hlist_del_rcu(&e->hash_link->hash_node);
+			smap_release_sock(psock, e->hash_link->sk);
+			free_htab_elem(e->htab, e->hash_link);
 		}
 	}
 	write_unlock_bh(&sk->sk_callback_lock);
@@ -1526,12 +1562,14 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void smap_list_remove(struct smap_psock *psock, struct sock **entry)
+static void smap_list_remove(struct smap_psock *psock,
+			     struct sock **entry,
+			     struct htab_elem *hash_link)
 {
 	struct smap_psock_map_entry *e, *tmp;
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		if (e->entry == entry) {
+		if (e->entry == entry || e->hash_link == hash_link) {
 			list_del(&e->list);
 			break;
 		}
@@ -1569,7 +1607,7 @@ static void sock_map_free(struct bpf_map *map)
 		 * to be null and queued for garbage collection.
 		 */
 		if (likely(psock)) {
-			smap_list_remove(psock, &stab->sock_map[i]);
+			smap_list_remove(psock, &stab->sock_map[i], NULL);
 			smap_release_sock(psock, sock);
 		}
 		write_unlock_bh(&sock->sk_callback_lock);
@@ -1628,7 +1666,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
 
 	if (psock->bpf_parse)
 		smap_stop_sock(psock, sock);
-	smap_list_remove(psock, &stab->sock_map[k]);
+	smap_list_remove(psock, &stab->sock_map[k], NULL);
 	smap_release_sock(psock, sock);
 out:
 	write_unlock_bh(&sock->sk_callback_lock);
@@ -1745,10 +1783,12 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
 		new = true;
 	}
 
-	e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
-	if (!e) {
-		err = -ENOMEM;
-		goto out_progs;
+	if (map_link) {
+		e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
+		if (!e) {
+			err = -ENOMEM;
+			goto out_progs;
+		}
 	}
 
 	/* 3. At this point we have a reference to a valid psock that is
@@ -1782,6 +1822,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
 	write_unlock_bh(&sock->sk_callback_lock);
 	return err;
 out_free:
+	kfree(e);
 	smap_release_sock(psock, sock);
 out_progs:
 	if (verdict)
@@ -1828,7 +1869,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		struct smap_psock *opsock = smap_psock_sk(osock);
 
 		write_lock_bh(&osock->sk_callback_lock);
-		smap_list_remove(opsock, &stab->sock_map[i]);
+		smap_list_remove(opsock, &stab->sock_map[i], NULL);
 		smap_release_sock(opsock, osock);
 		write_unlock_bh(&osock->sk_callback_lock);
 	}
@@ -1845,6 +1886,10 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
 		struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 
 		progs = &stab->progs;
+	} else if (map->map_type == BPF_MAP_TYPE_SOCKHASH) {
+		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+		progs = &htab->progs;
 	} else {
 		return -EINVAL;
 	}
@@ -1905,11 +1950,19 @@ static int sock_map_update_elem(struct bpf_map *map,
 
 static void sock_map_release(struct bpf_map *map)
 {
-	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 	struct bpf_sock_progs *progs;
 	struct bpf_prog *orig;
 
-	progs = &stab->progs;
+	if (map->map_type == BPF_MAP_TYPE_SOCKMAP) {
+		struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+
+		progs = &stab->progs;
+	} else {
+		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+		progs = &htab->progs;
+	}
+
 	orig = xchg(&progs->bpf_parse, NULL);
 	if (orig)
 		bpf_prog_put(orig);
@@ -1922,6 +1975,390 @@ static void sock_map_release(struct bpf_map *map)
 		bpf_prog_put(orig);
 }
 
+static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
+{
+	struct bpf_htab *htab;
+	int i, err;
+	u64 cost;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->value_size != 4 ||
+	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	err = bpf_tcp_ulp_register();
+	if (err && err != -EEXIST)
+		return ERR_PTR(err);
+
+	htab = kzalloc(sizeof(*htab), GFP_USER);
+	if (!htab)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&htab->map, attr);
+
+	htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
+	htab->elem_size = sizeof(struct htab_elem) +
+			  round_up(htab->map.key_size, 8);
+
+	if (htab->n_buckets == 0 ||
+	    htab->n_buckets > U32_MAX / sizeof(struct bucket))
+		goto free_htab;
+
+	cost = (u64) htab->n_buckets * sizeof(struct bucket) +
+	       (u64) htab->elem_size * htab->map.max_entries;
+
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_htab;
+
+	htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+	err = bpf_map_precharge_memlock(htab->map.pages);
+	if (err)
+		goto free_htab;
+
+	err = -ENOMEM;
+	htab->buckets = bpf_map_area_alloc(
+				htab->n_buckets * sizeof(struct bucket),
+				htab->map.numa_node);
+	if (!htab->buckets)
+		goto free_htab;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		INIT_HLIST_HEAD(&htab->buckets[i].head);
+		raw_spin_lock_init(&htab->buckets[i].lock);
+	}
+
+	return &htab->map;
+free_htab:
+	kfree(htab);
+	return ERR_PTR(err);
+}
+
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
+static void sock_hash_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	int i;
+
+	synchronize_rcu();
+
+	/* At this point no update, lookup or delete operations can happen.
+	 * However, be aware we can still get a socket state event updates,
+	 * and data ready callabacks that reference the psock from sk_user_data
+	 * Also psock worker threads are still in-flight. So smap_release_sock
+	 * will only free the psock after cancel_sync on the worker threads
+	 * and a grace period expire to ensure psock is really safe to remove.
+	 */
+	rcu_read_lock();
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_node *n;
+		struct htab_elem *l;
+
+		hlist_for_each_entry_safe(l, n, head, hash_node) {
+			struct sock *sock = l->sk;
+			struct smap_psock *psock;
+
+			hlist_del_rcu(&l->hash_node);
+			write_lock_bh(&sock->sk_callback_lock);
+			psock = smap_psock_sk(sock);
+			/* This check handles a racing sock event that can get
+			 * the sk_callback_lock before this case but after xchg
+			 * causing the refcnt to hit zero and sock user data
+			 * (psock) to be null and queued for garbage collection.
+			 */
+			if (likely(psock)) {
+				smap_list_remove(psock, NULL, l);
+				smap_release_sock(psock, sock);
+			}
+			write_unlock_bh(&sock->sk_callback_lock);
+			kfree(l);
+		}
+	}
+	rcu_read_unlock();
+	bpf_map_area_free(htab->buckets);
+	kfree(htab);
+}
+
+static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
+					      void *key, u32 key_size, u32 hash,
+					      struct sock *sk,
+					      struct htab_elem *old_elem)
+{
+	struct htab_elem *l_new;
+
+	if (atomic_inc_return(&htab->count) > htab->map.max_entries) {
+		if (!old_elem) {
+			atomic_dec(&htab->count);
+			return ERR_PTR(-E2BIG);
+		}
+	}
+	l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN,
+			     htab->map.numa_node);
+	if (!l_new)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(l_new->key, key, key_size);
+	l_new->sk = sk;
+	l_new->hash = hash;
+	return l_new;
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
+					 u32 hash, void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+
+	return NULL;
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+	return jhash(key, key_len, 0);
+}
+
+static int sock_hash_get_next_key(struct bpf_map *map,
+				  void *key, void *next_key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l, *next_l;
+	struct hlist_head *h;
+	u32 hash, key_size;
+	int i = 0;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+	if (!key)
+		goto find_first_elem;
+	hash = htab_map_hash(key, key_size);
+	h = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(h, hash, key, key_size);
+	if (!l)
+		goto find_first_elem;
+	next_l = hlist_entry_safe(
+		     rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+		     struct htab_elem, hash_node);
+	if (next_l) {
+		memcpy(next_key, next_l->key, key_size);
+		return 0;
+	}
+
+	/* no more elements in this hash list, go to the next bucket */
+	i = hash & (htab->n_buckets - 1);
+	i++;
+
+find_first_elem:
+	/* iterate over buckets */
+	for (; i < htab->n_buckets; i++) {
+		h = select_bucket(htab, i);
+
+		/* pick first element in the bucket */
+		next_l = hlist_entry_safe(
+				rcu_dereference_raw(hlist_first_rcu(h)),
+				struct htab_elem, hash_node);
+		if (next_l) {
+			/* if it's not empty, just return it */
+			memcpy(next_key, next_l->key, key_size);
+			return 0;
+		}
+	}
+
+	/* iterated over all buckets and all elements */
+	return -ENOENT;
+}
+
+static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
+				     struct bpf_map *map,
+				     void *key, u64 map_flags)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct bpf_sock_progs *progs = &htab->progs;
+	struct htab_elem *l_new = NULL, *l_old;
+	struct smap_psock_map_entry *e = NULL;
+	struct hlist_head *head;
+	struct smap_psock *psock;
+	u32 key_size, hash;
+	struct sock *sock;
+	struct bucket *b;
+	int err;
+
+	sock = skops->sk;
+
+	if (sock->sk_type != SOCK_STREAM ||
+	    sock->sk_protocol != IPPROTO_TCP)
+		return -EOPNOTSUPP;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+
+	e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
+	if (!e)
+		return -ENOMEM;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	key_size = map->key_size;
+	hash = htab_map_hash(key, key_size);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	err = __sock_map_ctx_update_elem(map, progs, sock, NULL, key);
+	if (err)
+		goto err;
+
+	/* bpf_map_update_elem() can be called in_irq() */
+	raw_spin_lock_bh(&b->lock);
+	l_old = lookup_elem_raw(head, hash, key, key_size);
+	if (l_old && map_flags == BPF_NOEXIST) {
+		err = -EEXIST;
+		goto bucket_err;
+	}
+	if (!l_old && map_flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto bucket_err;
+	}
+
+	l_new = alloc_sock_hash_elem(htab, key, key_size, hash, sock, l_old);
+	if (IS_ERR(l_new)) {
+		err = PTR_ERR(l_new);
+		goto bucket_err;
+	}
+
+	psock = smap_psock_sk(sock);
+	if (unlikely(!psock)) {
+		err = -EINVAL;
+		goto bucket_err;
+	}
+
+	e->hash_link = l_new;
+	e->htab = container_of(map, struct bpf_htab, map);
+	list_add_tail(&e->list, &psock->maps);
+
+	/* add new element to the head of the list, so that
+	 * concurrent search will find it before old elem
+	 */
+	hlist_add_head_rcu(&l_new->hash_node, head);
+	if (l_old) {
+		psock = smap_psock_sk(l_old->sk);
+
+		hlist_del_rcu(&l_old->hash_node);
+		smap_list_remove(psock, NULL, l_old);
+		smap_release_sock(psock, l_old->sk);
+		free_htab_elem(htab, l_old);
+	}
+	raw_spin_unlock_bh(&b->lock);
+	return 0;
+bucket_err:
+	raw_spin_unlock_bh(&b->lock);
+err:
+	kfree(e);
+	psock = smap_psock_sk(sock);
+	if (psock)
+		smap_release_sock(psock, sock);
+	return err;
+}
+
+static int sock_hash_update_elem(struct bpf_map *map,
+				void *key, void *value, u64 flags)
+{
+	struct bpf_sock_ops_kern skops;
+	u32 fd = *(u32 *)value;
+	struct socket *socket;
+	int err;
+
+	socket = sockfd_lookup(fd, &err);
+	if (!socket)
+		return err;
+
+	skops.sk = socket->sk;
+	if (!skops.sk) {
+		fput(socket->file);
+		return -EINVAL;
+	}
+
+	err = sock_hash_ctx_update_elem(&skops, map, key, flags);
+	fput(socket->file);
+	return err;
+}
+
+static int sock_hash_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct bucket *b;
+	struct htab_elem *l;
+	u32 hash, key_size;
+	int ret = -ENOENT;
+
+	key_size = map->key_size;
+	hash = htab_map_hash(key, key_size);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	raw_spin_lock_bh(&b->lock);
+	l = lookup_elem_raw(head, hash, key, key_size);
+	if (l) {
+		struct sock *sock = l->sk;
+		struct smap_psock *psock;
+
+		hlist_del_rcu(&l->hash_node);
+		write_lock_bh(&sock->sk_callback_lock);
+		psock = smap_psock_sk(sock);
+		/* This check handles a racing sock event that can get the
+		 * sk_callback_lock before this case but after xchg happens
+		 * causing the refcnt to hit zero and sock user data (psock)
+		 * to be null and queued for garbage collection.
+		 */
+		if (likely(psock)) {
+			smap_list_remove(psock, NULL, l);
+			smap_release_sock(psock, sock);
+		}
+		write_unlock_bh(&sock->sk_callback_lock);
+		free_htab_elem(htab, l);
+		ret = 0;
+	}
+	raw_spin_unlock_bh(&b->lock);
+	return ret;
+}
+
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	u32 key_size, hash;
+	struct bucket *b;
+	struct sock *sk;
+
+	key_size = map->key_size;
+	hash = htab_map_hash(key, key_size);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	raw_spin_lock_bh(&b->lock);
+	l = lookup_elem_raw(head, hash, key, key_size);
+	sk = l ? l->sk : NULL;
+	raw_spin_unlock_bh(&b->lock);
+	return sk;
+}
+
 const struct bpf_map_ops sock_map_ops = {
 	.map_alloc = sock_map_alloc,
 	.map_free = sock_map_free,
@@ -1932,6 +2369,15 @@ static void sock_map_release(struct bpf_map *map)
 	.map_release_uref = sock_map_release,
 };
 
+const struct bpf_map_ops sock_hash_ops = {
+	.map_alloc = sock_hash_alloc,
+	.map_free = sock_hash_free,
+	.map_lookup_elem = sock_map_lookup,
+	.map_get_next_key = sock_hash_get_next_key,
+	.map_update_elem = sock_hash_update_elem,
+	.map_delete_elem = sock_hash_delete_elem,
+};
+
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
 	   struct bpf_map *, map, void *, key, u64, flags)
 {
@@ -1949,3 +2395,21 @@ static void sock_map_release(struct bpf_map *map)
 	.arg3_type	= ARG_PTR_TO_MAP_KEY,
 	.arg4_type	= ARG_ANYTHING,
 };
+
+BPF_CALL_4(bpf_sock_hash_update, struct bpf_sock_ops_kern *, bpf_sock,
+	   struct bpf_map *, map, void *, key, u64, flags)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
+}
+
+const struct bpf_func_proto bpf_sock_hash_update_proto = {
+	.func		= bpf_sock_hash_update,
+	.gpl_only	= false,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_PTR_TO_MAP_KEY,
+	.arg4_type	= ARG_ANYTHING,
+};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d5e1a6c..99eeb04 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2093,6 +2093,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_msg_redirect_map)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_SOCKHASH:
+		if (func_id != BPF_FUNC_sk_redirect_hash &&
+		    func_id != BPF_FUNC_sock_hash_update &&
+		    func_id != BPF_FUNC_map_delete_elem &&
+		    func_id != BPF_FUNC_msg_redirect_hash)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -2130,11 +2137,14 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
+	case BPF_FUNC_sock_map_update:
 		if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
 			goto error;
 		break;
-	case BPF_FUNC_sock_map_update:
-		if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
+	case BPF_FUNC_sk_redirect_hash:
+	case BPF_FUNC_msg_redirect_hash:
+	case BPF_FUNC_sock_hash_update:
+		if (map->map_type != BPF_MAP_TYPE_SOCKHASH)
 			goto error;
 		break;
 	default:
diff --git a/net/core/filter.c b/net/core/filter.c
index 14c6bb2..9c76a82 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2070,6 +2070,33 @@ int skb_do_redirect(struct sk_buff *skb)
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_sk_redirect_hash, struct sk_buff *, skb,
+	   struct bpf_map *, map, void *, key, u64, flags)
+{
+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+	/* If user passes invalid input drop the packet. */
+	if (unlikely(flags & ~(BPF_F_INGRESS)))
+		return SK_DROP;
+
+	tcb->bpf.flags = flags;
+	tcb->bpf.sk_redir = __sock_hash_lookup_elem(map, key);
+	if (!tcb->bpf.sk_redir)
+		return SK_DROP;
+
+	return SK_PASS;
+}
+
+static const struct bpf_func_proto bpf_sk_redirect_hash_proto = {
+	.func           = bpf_sk_redirect_hash,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_PTR_TO_MAP_KEY,
+	.arg4_type      = ARG_ANYTHING,
+};
+
 BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
@@ -2104,6 +2131,31 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb)
 	.arg4_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg_buff *, msg,
+	   struct bpf_map *, map, void *, key, u64, flags)
+{
+	/* If user passes invalid input drop the packet. */
+	if (unlikely(flags & ~(BPF_F_INGRESS)))
+		return SK_DROP;
+
+	msg->flags = flags;
+	msg->sk_redir = __sock_hash_lookup_elem(map, key);
+	if (!msg->sk_redir)
+		return SK_DROP;
+
+	return SK_PASS;
+}
+
+static const struct bpf_func_proto bpf_msg_redirect_hash_proto = {
+	.func           = bpf_msg_redirect_hash,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_PTR_TO_MAP_KEY,
+	.arg4_type      = ARG_ANYTHING,
+};
+
 BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
@@ -4235,6 +4287,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_sock_ops_cb_flags_set_proto;
 	case BPF_FUNC_sock_map_update:
 		return &bpf_sock_map_update_proto;
+	case BPF_FUNC_sock_hash_update:
+		return &bpf_sock_hash_update_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -4246,6 +4300,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 	switch (func_id) {
 	case BPF_FUNC_msg_redirect_map:
 		return &bpf_msg_redirect_map_proto;
+	case BPF_FUNC_msg_redirect_hash:
+		return &bpf_msg_redirect_hash_proto;
 	case BPF_FUNC_msg_apply_bytes:
 		return &bpf_msg_apply_bytes_proto;
 	case BPF_FUNC_msg_cork_bytes:
@@ -4277,6 +4333,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_sk_redirect_map:
 		return &bpf_sk_redirect_map_proto;
+	case BPF_FUNC_sk_redirect_hash:
+		return &bpf_sk_redirect_hash_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v5 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap
From: John Fastabend @ 2018-05-05 23:25 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525562710-11603-1-git-send-email-john.fastabend@gmail.com>

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
    can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
    u32.
  - Instead of passing map/key through the metadata we now do
    the lookup inline. This avoids storing the key in the metadata
    which will be useful when keys can be longer than 4 bytes. We
    rename the sk pointers to sk_redir at this point as well to
    avoid any confusion between the current sk pointer and the
    redirect pointer sk_redir.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 include/linux/filter.h |   3 +-
 include/net/tcp.h      |   3 +-
 kernel/bpf/sockmap.c   | 148 +++++++++++++++++++++++++++++--------------------
 net/core/filter.c      |  31 +++--------
 4 files changed, 98 insertions(+), 87 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index da7e165..9dbcb9d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -515,9 +515,8 @@ struct sk_msg_buff {
 	int sg_end;
 	struct scatterlist sg_data[MAX_SKB_FRAGS];
 	bool sg_copy[MAX_SKB_FRAGS];
-	__u32 key;
 	__u32 flags;
-	struct bpf_map *map;
+	struct sock *sk_redir;
 	struct sk_buff *skb;
 	struct list_head list;
 };
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..089185a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -814,9 +814,8 @@ struct tcp_skb_cb {
 #endif
 		} header;	/* For incoming skbs */
 		struct {
-			__u32 key;
 			__u32 flags;
-			struct bpf_map *map;
+			struct sock *sk_redir;
 			void *data_end;
 		} bpf;
 	};
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 801273b..4eef5b1 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -48,14 +48,18 @@
 #define SOCK_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
-struct bpf_stab {
-	struct bpf_map map;
-	struct sock **sock_map;
+struct bpf_sock_progs {
 	struct bpf_prog *bpf_tx_msg;
 	struct bpf_prog *bpf_parse;
 	struct bpf_prog *bpf_verdict;
 };
 
+struct bpf_stab {
+	struct bpf_map map;
+	struct sock **sock_map;
+	struct bpf_sock_progs progs;
+};
+
 enum smap_psock_state {
 	SMAP_TX_RUNNING,
 };
@@ -460,7 +464,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff *md)
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
 {
 	return ((_rc == SK_PASS) ?
-	       (md->map ? __SK_REDIRECT : __SK_PASS) :
+	       (md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
 	       __SK_DROP);
 }
 
@@ -1091,7 +1095,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
 	 * when we orphan the skb so that we don't have the possibility
 	 * to reference a stale map.
 	 */
-	TCP_SKB_CB(skb)->bpf.map = NULL;
+	TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
 	skb->sk = psock->sock;
 	bpf_compute_data_pointers(skb);
 	preempt_disable();
@@ -1101,7 +1105,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
 
 	/* Moving return codes from UAPI namespace into internal namespace */
 	return rc == SK_PASS ?
-		(TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
+		(TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
 		__SK_DROP;
 }
 
@@ -1371,7 +1375,6 @@ static int smap_init_sock(struct smap_psock *psock,
 }
 
 static void smap_init_progs(struct smap_psock *psock,
-			    struct bpf_stab *stab,
 			    struct bpf_prog *verdict,
 			    struct bpf_prog *parse)
 {
@@ -1449,14 +1452,13 @@ static void smap_gc_work(struct work_struct *w)
 	kfree(psock);
 }
 
-static struct smap_psock *smap_init_psock(struct sock *sock,
-					  struct bpf_stab *stab)
+static struct smap_psock *smap_init_psock(struct sock *sock, int node)
 {
 	struct smap_psock *psock;
 
 	psock = kzalloc_node(sizeof(struct smap_psock),
 			     GFP_ATOMIC | __GFP_NOWARN,
-			     stab->map.numa_node);
+			     node);
 	if (!psock)
 		return ERR_PTR(-ENOMEM);
 
@@ -1661,40 +1663,26 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
  *  - sock_map must use READ_ONCE and (cmp)xchg operations
  *  - BPF verdict/parse programs must use READ_ONCE and xchg operations
  */
-static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
-				    struct bpf_map *map,
-				    void *key, u64 flags)
+
+static int __sock_map_ctx_update_elem(struct bpf_map *map,
+				      struct bpf_sock_progs *progs,
+				      struct sock *sock,
+				      struct sock **map_link,
+				      void *key)
 {
-	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
-	struct smap_psock_map_entry *e = NULL;
 	struct bpf_prog *verdict, *parse, *tx_msg;
-	struct sock *osock, *sock;
+	struct smap_psock_map_entry *e = NULL;
 	struct smap_psock *psock;
-	u32 i = *(u32 *)key;
 	bool new = false;
 	int err;
 
-	if (unlikely(flags > BPF_EXIST))
-		return -EINVAL;
-
-	if (unlikely(i >= stab->map.max_entries))
-		return -E2BIG;
-
-	sock = READ_ONCE(stab->sock_map[i]);
-	if (flags == BPF_EXIST && !sock)
-		return -ENOENT;
-	else if (flags == BPF_NOEXIST && sock)
-		return -EEXIST;
-
-	sock = skops->sk;
-
 	/* 1. If sock map has BPF programs those will be inherited by the
 	 * sock being added. If the sock is already attached to BPF programs
 	 * this results in an error.
 	 */
-	verdict = READ_ONCE(stab->bpf_verdict);
-	parse = READ_ONCE(stab->bpf_parse);
-	tx_msg = READ_ONCE(stab->bpf_tx_msg);
+	verdict = READ_ONCE(progs->bpf_verdict);
+	parse = READ_ONCE(progs->bpf_parse);
+	tx_msg = READ_ONCE(progs->bpf_tx_msg);
 
 	if (parse && verdict) {
 		/* bpf prog refcnt may be zero if a concurrent attach operation
@@ -1702,11 +1690,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		 * we increment the refcnt. If this is the case abort with an
 		 * error.
 		 */
-		verdict = bpf_prog_inc_not_zero(stab->bpf_verdict);
+		verdict = bpf_prog_inc_not_zero(progs->bpf_verdict);
 		if (IS_ERR(verdict))
 			return PTR_ERR(verdict);
 
-		parse = bpf_prog_inc_not_zero(stab->bpf_parse);
+		parse = bpf_prog_inc_not_zero(progs->bpf_parse);
 		if (IS_ERR(parse)) {
 			bpf_prog_put(verdict);
 			return PTR_ERR(parse);
@@ -1714,7 +1702,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 	}
 
 	if (tx_msg) {
-		tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
+		tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg);
 		if (IS_ERR(tx_msg)) {
 			if (verdict)
 				bpf_prog_put(verdict);
@@ -1747,7 +1735,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 			goto out_progs;
 		}
 	} else {
-		psock = smap_init_psock(sock, stab);
+		psock = smap_init_psock(sock, map->numa_node);
 		if (IS_ERR(psock)) {
 			err = PTR_ERR(psock);
 			goto out_progs;
@@ -1762,7 +1750,6 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		err = -ENOMEM;
 		goto out_progs;
 	}
-	e->entry = &stab->sock_map[i];
 
 	/* 3. At this point we have a reference to a valid psock that is
 	 * running. Attach any BPF programs needed.
@@ -1779,7 +1766,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		err = smap_init_sock(psock, sock);
 		if (err)
 			goto out_free;
-		smap_init_progs(psock, stab, verdict, parse);
+		smap_init_progs(psock, verdict, parse);
 		smap_start_sock(psock, sock);
 	}
 
@@ -1788,19 +1775,12 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 	 * it with. Because we can only have a single set of programs if
 	 * old_sock has a strp we can stop it.
 	 */
-	list_add_tail(&e->list, &psock->maps);
-	write_unlock_bh(&sock->sk_callback_lock);
-
-	osock = xchg(&stab->sock_map[i], sock);
-	if (osock) {
-		struct smap_psock *opsock = smap_psock_sk(osock);
-
-		write_lock_bh(&osock->sk_callback_lock);
-		smap_list_remove(opsock, &stab->sock_map[i]);
-		smap_release_sock(opsock, osock);
-		write_unlock_bh(&osock->sk_callback_lock);
+	if (map_link) {
+		e->entry = map_link;
+		list_add_tail(&e->list, &psock->maps);
 	}
-	return 0;
+	write_unlock_bh(&sock->sk_callback_lock);
+	return err;
 out_free:
 	smap_release_sock(psock, sock);
 out_progs:
@@ -1815,23 +1795,69 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 	return err;
 }
 
-int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
+static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
+				    struct bpf_map *map,
+				    void *key, u64 flags)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct bpf_sock_progs *progs = &stab->progs;
+	struct sock *osock, *sock;
+	u32 i = *(u32 *)key;
+	int err;
+
+	if (unlikely(flags > BPF_EXIST))
+		return -EINVAL;
+
+	if (unlikely(i >= stab->map.max_entries))
+		return -E2BIG;
+
+	sock = READ_ONCE(stab->sock_map[i]);
+	if (flags == BPF_EXIST && !sock)
+		return -ENOENT;
+	else if (flags == BPF_NOEXIST && sock)
+		return -EEXIST;
+
+	sock = skops->sk;
+	err = __sock_map_ctx_update_elem(map, progs, sock, &stab->sock_map[i],
+					 key);
+	if (err)
+		goto out;
+
+	osock = xchg(&stab->sock_map[i], sock);
+	if (osock) {
+		struct smap_psock *opsock = smap_psock_sk(osock);
+
+		write_lock_bh(&osock->sk_callback_lock);
+		smap_list_remove(opsock, &stab->sock_map[i]);
+		smap_release_sock(opsock, osock);
+		write_unlock_bh(&osock->sk_callback_lock);
+	}
+out:
+	return 0;
+}
+
+int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
+{
+	struct bpf_sock_progs *progs;
 	struct bpf_prog *orig;
 
-	if (unlikely(map->map_type != BPF_MAP_TYPE_SOCKMAP))
+	if (map->map_type == BPF_MAP_TYPE_SOCKMAP) {
+		struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+
+		progs = &stab->progs;
+	} else {
 		return -EINVAL;
+	}
 
 	switch (type) {
 	case BPF_SK_MSG_VERDICT:
-		orig = xchg(&stab->bpf_tx_msg, prog);
+		orig = xchg(&progs->bpf_tx_msg, prog);
 		break;
 	case BPF_SK_SKB_STREAM_PARSER:
-		orig = xchg(&stab->bpf_parse, prog);
+		orig = xchg(&progs->bpf_parse, prog);
 		break;
 	case BPF_SK_SKB_STREAM_VERDICT:
-		orig = xchg(&stab->bpf_verdict, prog);
+		orig = xchg(&progs->bpf_verdict, prog);
 		break;
 	default:
 		return -EOPNOTSUPP;
@@ -1880,16 +1906,18 @@ static int sock_map_update_elem(struct bpf_map *map,
 static void sock_map_release(struct bpf_map *map)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct bpf_sock_progs *progs;
 	struct bpf_prog *orig;
 
-	orig = xchg(&stab->bpf_parse, NULL);
+	progs = &stab->progs;
+	orig = xchg(&progs->bpf_parse, NULL);
 	if (orig)
 		bpf_prog_put(orig);
-	orig = xchg(&stab->bpf_verdict, NULL);
+	orig = xchg(&progs->bpf_verdict, NULL);
 	if (orig)
 		bpf_prog_put(orig);
 
-	orig = xchg(&stab->bpf_tx_msg, NULL);
+	orig = xchg(&progs->bpf_tx_msg, NULL);
 	if (orig)
 		bpf_prog_put(orig);
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index 6877426..14c6bb2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2079,9 +2079,10 @@ int skb_do_redirect(struct sk_buff *skb)
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
 
-	tcb->bpf.key = key;
 	tcb->bpf.flags = flags;
-	tcb->bpf.map = map;
+	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
+	if (!tcb->bpf.sk_redir)
+		return SK_DROP;
 
 	return SK_PASS;
 }
@@ -2089,16 +2090,8 @@ int skb_do_redirect(struct sk_buff *skb)
 struct sock *do_sk_redirect_map(struct sk_buff *skb)
 {
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-	struct sock *sk = NULL;
-
-	if (tcb->bpf.map) {
-		sk = __sock_map_lookup_elem(tcb->bpf.map, tcb->bpf.key);
 
-		tcb->bpf.key = 0;
-		tcb->bpf.map = NULL;
-	}
-
-	return sk;
+	return tcb->bpf.sk_redir;
 }
 
 static const struct bpf_func_proto bpf_sk_redirect_map_proto = {
@@ -2118,25 +2111,17 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb)
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
 
-	msg->key = key;
 	msg->flags = flags;
-	msg->map = map;
+	msg->sk_redir = __sock_map_lookup_elem(map, key);
+	if (!msg->sk_redir)
+		return SK_DROP;
 
 	return SK_PASS;
 }
 
 struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 {
-	struct sock *sk = NULL;
-
-	if (msg->map) {
-		sk = __sock_map_lookup_elem(msg->map, msg->key);
-
-		msg->key = 0;
-		msg->map = NULL;
-	}
-
-	return sk;
+	return msg->sk_redir;
 }
 
 static const struct bpf_func_proto bpf_msg_redirect_map_proto = {
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v5 0/4] Hash support for sock
From: John Fastabend @ 2018-05-05 23:25 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.

This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
    its own patch, and fixup the test patch stats to catch the
    renamed file and provide only diffs +/- on that.
v4: Add documentation to UAPI bpf.h
v5: Add documentation to tools UAPI bpf.h

Just a note I pushed Dave's Acks through v4 into v5 due to small
size of change.

John Fastabend (4):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: bpftool, support for sockhash
  bpf: selftest additions for SOCKHASH

 include/linux/bpf.h                                |   8 +
 include/linux/bpf_types.h                          |   1 +
 include/linux/filter.h                             |   3 +-
 include/net/tcp.h                                  |   3 +-
 include/uapi/linux/bpf.h                           |   6 +-
 kernel/bpf/core.c                                  |   1 +
 kernel/bpf/sockmap.c                               | 638 ++++++++++++++++++---
 kernel/bpf/verifier.c                              |  14 +-
 net/core/filter.c                                  |  89 ++-
 tools/bpf/bpftool/map.c                            |   1 +
 tools/include/uapi/linux/bpf.h                     |   6 +-
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |   4 +
 tools/testing/selftests/bpf/test_sockmap.c         |  27 +-
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   |   6 +-
 15 files changed, 695 insertions(+), 115 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => test_sockmap_kern.h} (98%)

-- 
1.9.1

^ permalink raw reply

* Re: WARNING in kernfs_add_one
From: Greg KH @ 2018-05-05 22:07 UTC (permalink / raw)
  To: linux-wireless, Eric Dumazet
  Cc: netdev, syzbot, linux-kernel, syzkaller-bugs, tj
In-Reply-To: <095bc684-0029-0b2e-5e28-23376174ca02@gmail.com>

On Sat, May 05, 2018 at 10:43:45AM -0700, Eric Dumazet wrote:
> 
> 
> On 05/05/2018 09:40 AM, Greg KH wrote:
> > On Sat, May 05, 2018 at 08:47:02AM -0700, syzbot wrote:
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit:    8fb11a9a8d51 net/ipv6: rename rt6_next to fib6_next
> >> git tree:       net-next
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=14b27237800000
> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=c416c61f3cd96be
> >> dashboard link: https://syzkaller.appspot.com/bug?extid=df47f81c226b31d89fb1
> >> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> >> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=172fb3e7800000
> >> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16552e57800000
> >>
> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >> Reported-by: syzbot+df47f81c226b31d89fb1@syzkaller.appspotmail.com
> >>
> >> RBP: 00007fff808f3e10 R08: 0000000000000002 R09: 00007fff80003534
> >> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
> >> R13: 0000000000000006 R14: 0000000000000000 R15: 0000000000000000
> >> ------------[ cut here ]------------
> >> kernfs: ns required in 'ieee80211' for 'phy3'
> > 
> > That's interesting, this looks like a netfilter bug (adding netdev to
> > the report here.)
> 
> 
> I do not see anything netfilter related here.
> 
> More likely wireless territory

Ugh, that's what I get for writing emails before coffee in the
morning...

Yes, you are right, this looks like a wireless issue.

Now cc: linux-wireless.

> > Yes, we can "tone down" the kernfs warning to just be an error message
> > in the log, but there might be something worse going on here.
> > 
> > Network developers, any idea?  Rest of the callback chain is here:
> > 
> > 
> >> WARNING: CPU: 0 PID: 4538 at fs/kernfs/dir.c:759 kernfs_add_one+0x406/0x4d0
> >> fs/kernfs/dir.c:758
> >> Kernel panic - not syncing: panic_on_warn set ...
> >>
> >> CPU: 0 PID: 4538 Comm: syz-executor486 Not tainted 4.17.0-rc3+ #33
> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> >> Google 01/01/2011
> >> Call Trace:
> >>  __dump_stack lib/dump_stack.c:77 [inline]
> >>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> >>  panic+0x22f/0x4de kernel/panic.c:184
> >>  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
> >>  report_bug+0x252/0x2d0 lib/bug.c:186
> >>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
> >>  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
> >>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> >>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> >> RIP: 0010:kernfs_add_one+0x406/0x4d0 fs/kernfs/dir.c:758
> >> RSP: 0018:ffff8801ca9eece0 EFLAGS: 00010286
> >> RAX: 000000000000002d RBX: ffffffff87d5cee0 RCX: ffffffff8160ba7d
> >> RDX: 0000000000000000 RSI: ffffffff81610731 RDI: ffff8801ca9ee840
> >> RBP: ffff8801ca9eed20 R08: ffff8801d9538500 R09: 0000000000000006
> >> R10: ffff8801d9538500 R11: 0000000000000000 R12: ffff8801ad1cb6c0
> >> R13: ffffffff885da640 R14: 0000000000000020 R15: 0000000000000000
> >>  kernfs_create_link+0x112/0x180 fs/kernfs/symlink.c:41
> >>  sysfs_do_create_link_sd.isra.2+0x90/0x130 fs/sysfs/symlink.c:43
> >>  sysfs_do_create_link fs/sysfs/symlink.c:79 [inline]
> >>  sysfs_create_link+0x65/0xc0 fs/sysfs/symlink.c:91
> >>  device_add_class_symlinks drivers/base/core.c:1612 [inline]
> >>  device_add+0x7a0/0x16d0 drivers/base/core.c:1810
> >>  wiphy_register+0x178a/0x2430 net/wireless/core.c:806
> >>  ieee80211_register_hw+0x13cd/0x35d0 net/mac80211/main.c:1047
> >>  mac80211_hwsim_new_radio+0x1d9b/0x3410
> >> drivers/net/wireless/mac80211_hwsim.c:2772
> >>  hwsim_new_radio_nl+0x7a7/0xa60 drivers/net/wireless/mac80211_hwsim.c:3246
> >>  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
> >>  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
> >>  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
> >>  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
> >>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
> >>  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
> >>  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
> >>  sock_sendmsg_nosec net/socket.c:629 [inline]
> >>  sock_sendmsg+0xd5/0x120 net/socket.c:639
> >>  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
> >>  __sys_sendmsg+0x115/0x270 net/socket.c:2155
> >>  __do_sys_sendmsg net/socket.c:2164 [inline]
> >>  __se_sys_sendmsg net/socket.c:2162 [inline]
> >>  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
> >>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> >>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >> RIP: 0033:0x4404c9
> >> RSP: 002b:00007fff808f3e08 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> >> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004404c9
> >> RDX: 0000000000000000 RSI: 0000000020b3dfc8 RDI: 0000000000000005
> >> RBP: 00007fff808f3e10 R08: 0000000000000002 R09: 00007fff80003534
> >> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
> >> R13: 0000000000000006 R14: 0000000000000000 R15: 0000000000000000
> >> Dumping ftrace buffer:
> >>    (ftrace buffer empty)
> >> Kernel Offset: disabled
> >> Rebooting in 86400 seconds..
> > 
> > 

Any ideas?

thanks,

greg k-h
> > 

^ permalink raw reply

* Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.
From: NeilBrown @ 2018-05-05 22:00 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <20180505094117.pl7b6bbk6mtyri6d@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 2302 bytes --]

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> rhashtable_try_insert() currently hold a lock on the bucket in
>> the first table, while also locking buckets in subsequent tables.
>> This is unnecessary and looks like a hold-over from some earlier
>> version of the implementation.
>> 
>> As insert and remove always lock a bucket in each table in turn, and
>> as insert only inserts in the final table, there cannot be any races
>> that are not covered by simply locking a bucket in each table in turn.
>> 
>> When an insert call reaches that last table it can be sure that there
>> is no match entry in any other table as it has searched them all, and
>> insertion never happens anywhere but in the last table.  The fact that
>> code tests for the existence of future_tbl while holding a lock on
>> the relevant bucket ensures that two threads inserting the same key
>> will make compatible decisions about which is the "last" table.
>> 
>> This simplifies the code and allows the ->rehash field to be
>> discarded.
>> We still need a way to ensure that a dead bucket_table is never
>> re-linked by rhashtable_walk_stop().  This can be achieved by
>> setting the ->size to 1.  This still allows lookup code to work (and
>> correctly not find anything) but can never happen on an active bucket
>> table (as the minimum size is 4).
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> I'm not convinced this is safe.  There can be multiple tables
> in existence.  Unless you hold the lock on the first table, what
> is to stop two parallel inserters each from inserting into their
> "last" tables which may not be the same table?

The insert function must (and does) take the lock on the bucket before
testing if there is a "next" table.
If one inserter finds that it has locked the "last" table (because there
is no next) and successfully inserts, then the other inserter cannot
have locked that table yet, else it would have inserted.  When it does,
it will find what the first inserter inserted. 

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk
From: NeilBrown @ 2018-05-05 21:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <20180505094212.hpqjyv5ijibviuwh@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 2307 bytes --]

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> If the sequence:
>>    obj = rhashtable_walk_next(iter);
>>    rhashtable_walk_stop(iter);
>>    rhashtable_remove_fast(ht, &obj->head, params);
>>    rhashtable_walk_start(iter);
>> 
>>  races with another thread inserting or removing
>>  an object on the same hash chain, a subsequent
>>  rhashtable_walk_next() is not guaranteed to get the "next"
>>  object. It is possible that an object could be
>>  repeated, or missed.
>> 
>>  This can be made more reliable by keeping the objects in a hash chain
>>  sorted by memory address.  A subsequent rhashtable_walk_next()
>>  call can reliably find the correct position in the list, and thus
>>  find the 'next' object.
>> 
>>  It is not possible (certainly not so easy) to achieve this with an
>>  rhltable as keeping the hash chain in order is not so easy.  When the
>>  first object with a given key is removed, it is replaced in the chain
>>  with the next object with the same key, and the address of that
>>  object may not be correctly ordered.
>>  No current user of rhltable_walk_enter() calls
>>  rhashtable_walk_start() more than once, so no current code
>>  could benefit from a more reliable walk of rhltables.
>> 
>>  This patch only attempts to improve walks for rhashtables.
>>  - a new object is always inserted after the last object with a
>>    smaller address, or at the start
>>  - when rhashtable_walk_start() is called, it records that 'p' is not
>>    'safe', meaning that it cannot be dereferenced.  The revalidation
>>    that was previously done here is moved to rhashtable_walk_next()
>>  - when rhashtable_walk_next() is called while p is not NULL and not
>>    safe, it walks the chain looking for the first object with an
>>    address greater than p and returns that.  If there is none, it moves
>>    to the next hash chain.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> I'm a bit torn on this.  On the hand this is definitely an improvement
> over the status quo.  On the other this does not work on rhltable and
> we do have a way of fixing it for both rhashtable and rhltable.

Do we?  How could we fix it for both rhashtable and rhltable?

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.
From: NeilBrown @ 2018-05-05 21:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <20180505091046.bd74tk75oieaasi5@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 847 bytes --]

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> print_ht in rhashtable_test calls rht_dereference() with neither
>> RCU protection or the mutex.  This triggers an RCU warning.
>> So take the mutex to silence the warning.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> I don't think the mutex is actually needed but since we don't
> have rht_dereference_raw and this is just test code:

I agrees there is no functional need for the mutex.  It just needed to
silence the warning, so that other warnings are easier to see.

>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

thanks,
NeilBrown

> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()
From: NeilBrown @ 2018-05-05 21:48 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <20180505092907.2qa3scf6bzvubmtt@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> If two threads run nested_table_alloc() at the same time
>> they could both allocate a new table.
>> Best case is that one of them will never be freed, leaking memory.
>> Worst case is hat entry get stored there before it leaks,
>> and the are lost from the table.
>> 
>> So use cmpxchg to detect the race and free the unused table.
>> 
>> Fixes: da20420f83ea ("rhashtable: Add nested tables")
>> Cc: stable@vger.kernel.org # 4.11+
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> What about the spinlock that's meant to be held around this
> operation?

The spinlock protects 2 or more buckets.  The nested table contains at
least 512 buckets, maybe more.
It is quite possible for two insertions into 2 different buckets to both
get their spinlock and both try to instantiate the same nested table.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.
From: NeilBrown @ 2018-05-05 21:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <20180505092725.rlwn77d3yhknspdw@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> Rather than borrowing one of the bucket locks to
>> protect ->future_tbl updates, use cmpxchg().
>> This gives more freedom to change how bucket locking
>> is implemented.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> This looks nice.
>
>> -	spin_unlock_bh(old_tbl->locks);
>> +	rcu_assign_pointer(tmp, new_tbl);
>
> Do we need this barrier since cmpxchg is supposed to provide memory
> barrier semantics?

It's hard to find documentation even for what cmpxchg() is meant do, let
alone what barriers is provides, but there does seem to be something
hidden in Documentation/core-api/atomic_ops.rst which suggests full
barrier semantics if the comparison succeeds.  I'll replace the
rcu_assign_pointer with a comment saying why it isn't needed.

Thanks,
NeilBrown

>
>> +	if (cmpxchg(&old_tbl->future_tbl, NULL, tmp) != NULL)
>> +		return -EEXIST;
>
> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [RFC PATCH 2/3] arcnet: com20020: Fixup missing SLOWARB bit
From: Andrea Greco @ 2018-05-05 21:37 UTC (permalink / raw)
  To: m.grzeschik; +Cc: Andrea Greco, netdev, linux-kernel

From: Andrea Greco <a.greco@4sigma.it>

If com20020 clock is major of 40Mhz SLOWARB bit is requested.

Signed-off-by: Andrea Greco <a.greco@4sigma.it>
---
 drivers/net/arcnet/com20020.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
index f09ea77dd6a8..abd32ed8ec9b 100644
--- a/drivers/net/arcnet/com20020.c
+++ b/drivers/net/arcnet/com20020.c
@@ -102,6 +102,10 @@ int com20020_check(struct net_device *dev)
 	lp->setup = lp->clockm ? 0 : (lp->clockp << 1);
 	lp->setup2 = (lp->clockm << 4) | 8;
 
+	// If clock is major of 40Mhz, SLOWARB bit must be set
+	if (lp->clockm > 1)
+		lp->setup2 |= SLOWARB;
+
 	/* CHECK: should we do this for SOHARD cards ? */
 	/* Enable P1Mode for backplane mode */
 	lp->setup = lp->setup | P1MODE;
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.
From: NeilBrown @ 2018-05-05 21:37 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <20180505091208.tnsxi6hdpjn456yz@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 2289 bytes --]

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> This "feature" is unused, undocumented, and untested and so
>> doesn't really belong.  If a use for the nulls marker
>> is found, all this code would need to be reviewed to
>> ensure it works as required.  It would be just as easy to
>> just add the code if/when it is needed instead.
>> 
>> This patch actually fixes a bug too.  The table resizing allows a
>> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
>> any growth beyond 2^27 is wasteful an ineffective.
>> 
>> This patch result in NULLS_MARKER(0) being used for all chains,
>> and leave the use of rht_is_a_null() to test for it.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> I disagree.  This is a fundamental requirement for the use of
> rhashtable in certain networking systems such as TCP/UDP.  So
> we know that there will be a use for this.

I can see no evidence that this is required for anything, as it isn't
use and I'm fairly sure that in it's current form - it cannot be used.

Based on my best guess about how you might intend to use it, I suspect
it would be simpler to store the address of the bucket head in the nuls
rather than the hash and a magic number.  This would make it just as
easy to detect when a search reaches the end of the wrong chain, which I
presume is the purpose.
I would find that useful myself - if the search would repeat when that
happened - as I could then use SLAB_TYPESAFE_BY_RCU.

Were we to take this approach, all the code I've removed here would
still need to be removed.

>
> As to the bug fix, please separate it out of the patch and resubmit.

I don't know how to do that.  I don't know what is safe to change
without "breaking" the nulls_base code because that code is undocumented and
unused, so unmaintainable.
In general the kernel has, I believe, a policy against keeping unused
interfaces.  While that isn't firm and universal, is seems to apply
particularly well to unusable interfaces.

Thanks,
NeilBrown


>
> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [RFC PATCH 3/3] arcnet: com20020: Add ethtool support
From: Andrea Greco @ 2018-05-05 21:35 UTC (permalink / raw)
  To: m.grzeschik; +Cc: Andrea Greco, netdev, linux-kernel

From: Andrea Greco <a.greco@4sigma.it>

Setup ethtols for export com20020 diag register

Signed-off-by: Andrea Greco <a.greco@4sigma.it>
---
 drivers/net/arcnet/com20020-isa.c    |  1 +
 drivers/net/arcnet/com20020-membus.c |  1 +
 drivers/net/arcnet/com20020.c        | 29 +++++++++++++++++++++++++++++
 drivers/net/arcnet/com20020.h        |  1 +
 drivers/net/arcnet/com20020_cs.c     |  1 +
 include/uapi/linux/if_arcnet.h       |  6 ++++++
 6 files changed, 39 insertions(+)

diff --git a/drivers/net/arcnet/com20020-isa.c b/drivers/net/arcnet/com20020-isa.c
index 38fa60ddaf2e..44ab6dcccb58 100644
--- a/drivers/net/arcnet/com20020-isa.c
+++ b/drivers/net/arcnet/com20020-isa.c
@@ -154,6 +154,7 @@ static int __init com20020_init(void)
 		dev->dev_addr[0] = node;
 
 	dev->netdev_ops = &com20020_netdev_ops;
+	dev->ethtool_ops = &com20020_ethtool_ops;
 
 	lp = netdev_priv(dev);
 	lp->backplane = backplane;
diff --git a/drivers/net/arcnet/com20020-membus.c b/drivers/net/arcnet/com20020-membus.c
index 6e4a2f3a84f7..9eead734a3cf 100644
--- a/drivers/net/arcnet/com20020-membus.c
+++ b/drivers/net/arcnet/com20020-membus.c
@@ -91,6 +91,7 @@ static int com20020_probe(struct platform_device *pdev)
 
 	dev = alloc_arcdev(NULL);// Let autoassign name arc%d
 	dev->netdev_ops = &com20020_netdev_ops;
+	dev->ethtool_ops = &com20020_ethtool_ops;
 	lp = netdev_priv(dev);
 
 	lp->card_flags = ARC_CAN_10MBIT;/* pretend all of them can 10Mbit */
diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
index abd32ed8ec9b..2089b45e81c8 100644
--- a/drivers/net/arcnet/com20020.c
+++ b/drivers/net/arcnet/com20020.c
@@ -201,6 +201,34 @@ const struct net_device_ops com20020_netdev_ops = {
 	.ndo_set_rx_mode = com20020_set_mc_list,
 };
 
+static int com20020_ethtool_regs_len(struct net_device *netdev)
+{
+	return sizeof(struct com20020_ethtool_regs);
+}
+
+static void com20020_ethtool_regs_read(struct net_device *dev,
+				       struct ethtool_regs *regs, void *p)
+{
+	struct arcnet_local *lp;
+	struct com20020_ethtool_regs *com_reg;
+
+	lp = netdev_priv(dev);
+	memset(p, 0, sizeof(struct com20020_ethtool_regs));
+
+	regs->version = 1;
+
+	com_reg = p;
+
+	com_reg->status = lp->hw.status(dev) & 0xFF;
+	com_reg->diag_register = (lp->hw.status(dev) >> 8) & 0xFF;
+	com_reg->reconf_count = lp->num_recons;
+}
+
+const struct ethtool_ops com20020_ethtool_ops = {
+	.get_regs = com20020_ethtool_regs_read,
+	.get_regs_len  = com20020_ethtool_regs_len,
+};
+
 /* Set up the struct net_device associated with this card.  Called after
  * probing succeeds.
  */
@@ -402,6 +430,7 @@ static void com20020_set_mc_list(struct net_device *dev)
 EXPORT_SYMBOL(com20020_check);
 EXPORT_SYMBOL(com20020_found);
 EXPORT_SYMBOL(com20020_netdev_ops);
+EXPORT_SYMBOL(com20020_ethtool_ops);
 #endif
 
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/arcnet/com20020.h b/drivers/net/arcnet/com20020.h
index 0bcc5d0a6903..a1024c8f8a1f 100644
--- a/drivers/net/arcnet/com20020.h
+++ b/drivers/net/arcnet/com20020.h
@@ -31,6 +31,7 @@
 int com20020_check(struct net_device *dev);
 int com20020_found(struct net_device *dev, int shared);
 extern const struct net_device_ops com20020_netdev_ops;
+extern const struct ethtool_ops com20020_ethtool_ops;
 
 /* The number of low I/O ports used by the card. */
 #define ARCNET_TOTAL_SIZE 8
diff --git a/drivers/net/arcnet/com20020_cs.c b/drivers/net/arcnet/com20020_cs.c
index cf607ffcf358..ae64f436fd54 100644
--- a/drivers/net/arcnet/com20020_cs.c
+++ b/drivers/net/arcnet/com20020_cs.c
@@ -233,6 +233,7 @@ static int com20020_config(struct pcmcia_device *link)
 	}
 
 	dev->irq = link->irq;
+	dev->ethtool_ops = &com20020_ethtool_ops;
 
 	ret = pcmcia_enable_device(link);
 	if (ret)
diff --git a/include/uapi/linux/if_arcnet.h b/include/uapi/linux/if_arcnet.h
index 683878036d76..790c0fa7386d 100644
--- a/include/uapi/linux/if_arcnet.h
+++ b/include/uapi/linux/if_arcnet.h
@@ -127,4 +127,10 @@ struct archdr {
 	} soft;
 };
 
+struct com20020_ethtool_regs {
+	__u8 status;
+	__u8 diag_register;
+	__u32 reconf_count;
+};
+
 #endif				/* _LINUX_IF_ARCNET_H */
-- 
2.14.3

^ permalink raw reply related

* [RFC PATCH 1/3] arcnet: com20020: Add memory map of com20020
From: Andrea Greco @ 2018-05-05 21:34 UTC (permalink / raw)
  To: m.grzeschik
  Cc: Andrea Greco, Rob Herring, Mark Rutland, netdev, devicetree,
	linux-kernel

From: Andrea Greco <a.greco@4sigma.it>

Add support for com20022I/com20020, memory mapped chip version.
Support bus: Intel 80xx and Motorola 68xx.
Bus size: Only 8 bit bus size is supported.
Added related device tree bindings

Signed-off-by: Andrea Greco <a.greco@4sigma.it>
---
 .../devicetree/bindings/net/smsc-com20020.txt      |  23 +++
 drivers/net/arcnet/Kconfig                         |  12 +-
 drivers/net/arcnet/Makefile                        |   1 +
 drivers/net/arcnet/arcdevice.h                     |  27 ++-
 drivers/net/arcnet/com20020-membus.c               | 191 +++++++++++++++++++++
 drivers/net/arcnet/com20020.c                      |   9 +-
 6 files changed, 253 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt
 create mode 100644 drivers/net/arcnet/com20020-membus.c

diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt b/Documentation/devicetree/bindings/net/smsc-com20020.txt
new file mode 100644
index 000000000000..39c5b19c55af
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
@@ -0,0 +1,23 @@
+SMSC com20020, com20022I
+
+timeout: Arcnet timeout, checkout datashet
+clockp: Clock Prescaler, checkout datashet
+clockm: Clock multiplier, checkout datasheet
+
+phy-reset-gpios: Chip reset ppin
+phy-irq-gpios: Chip irq pin
+
+com20020_A@0 {
+    compatible = "smsc,com20020";
+
+	timeout	= <0x3>;
+	backplane = <0x0>;
+
+	clockp = <0x0>;
+	clockm = <0x3>;
+
+	phy-reset-gpios = <&gpio3 21 GPIO_ACTIVE_LOW>;
+	phy-irq-gpios = <&gpio2 10 GPIO_ACTIVE_LOW>;
+
+	status = "okay";
+};
diff --git a/drivers/net/arcnet/Kconfig b/drivers/net/arcnet/Kconfig
index 39bd16f3f86d..d39faf45be1e 100644
--- a/drivers/net/arcnet/Kconfig
+++ b/drivers/net/arcnet/Kconfig
@@ -3,7 +3,7 @@
 #
 
 menuconfig ARCNET
-	depends on NETDEVICES && (ISA || PCI || PCMCIA)
+	depends on NETDEVICES
 	tristate "ARCnet support"
 	---help---
 	  If you have a network card of this type, say Y and check out the
@@ -129,5 +129,15 @@ config ARCNET_COM20020_CS
 
 	  To compile this driver as a module, choose M here: the module will be
 	  called com20020_cs.  If unsure, say N.
+config ARCNET_COM20020_MEMORY_BUS
+	bool "Support for COM20020 on external memory"
+	depends on ARCNET_COM20020 && !(ARCNET_COM20020_PCI || ARCNET_COM20020_ISA || ARCNET_COM20020_CS)
+	help
+	  Say Y here if on your custom board mount com20020 or friends.
+
+	  Com20022I support arcnet bus 10Mbitps.
+	  This driver support only 8bit, and DMA is not supported is attached on your board at external interface bus.
+	  Supported bus Intel80xx / Motorola 68xx.
+	  This driver not work with other com20020 module: PCI or PCMCIA compiled as [M].
 
 endif # ARCNET
diff --git a/drivers/net/arcnet/Makefile b/drivers/net/arcnet/Makefile
index 53525e8ea130..19425c1e06f4 100644
--- a/drivers/net/arcnet/Makefile
+++ b/drivers/net/arcnet/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_ARCNET_COM20020) += com20020.o
 obj-$(CONFIG_ARCNET_COM20020_ISA) += com20020-isa.o
 obj-$(CONFIG_ARCNET_COM20020_PCI) += com20020-pci.o
 obj-$(CONFIG_ARCNET_COM20020_CS) += com20020_cs.o
+obj-$(CONFIG_ARCNET_COM20020_MEMORY_BUS) += com20020-membus.o
diff --git a/drivers/net/arcnet/arcdevice.h b/drivers/net/arcnet/arcdevice.h
index d09b2b46ab63..16c608269cca 100644
--- a/drivers/net/arcnet/arcdevice.h
+++ b/drivers/net/arcnet/arcdevice.h
@@ -201,7 +201,7 @@ struct ArcProto {
 	void (*rx)(struct net_device *dev, int bufnum,
 		   struct archdr *pkthdr, int length);
 	int (*build_header)(struct sk_buff *skb, struct net_device *dev,
-			    unsigned short ethproto, uint8_t daddr);
+				unsigned short ethproto, uint8_t daddr);
 
 	/* these functions return '1' if the skb can now be freed */
 	int (*prepare_tx)(struct net_device *dev, struct archdr *pkt,
@@ -326,9 +326,9 @@ struct arcnet_local {
 		void (*recontrigger) (struct net_device * dev, int enable);
 
 		void (*copy_to_card)(struct net_device *dev, int bufnum,
-				     int offset, void *buf, int count);
+					 int offset, void *buf, int count);
 		void (*copy_from_card)(struct net_device *dev, int bufnum,
-				       int offset, void *buf, int count);
+					   int offset, void *buf, int count);
 	} hw;
 
 	void __iomem *mem_start;	/* pointer to ioremap'ed MMIO */
@@ -360,7 +360,7 @@ struct net_device *alloc_arcdev(const char *name);
 int arcnet_open(struct net_device *dev);
 int arcnet_close(struct net_device *dev);
 netdev_tx_t arcnet_send_packet(struct sk_buff *skb,
-			       struct net_device *dev);
+				   struct net_device *dev);
 void arcnet_timeout(struct net_device *dev);
 
 /* I/O equivalents */
@@ -371,7 +371,23 @@ void arcnet_timeout(struct net_device *dev);
 #define BUS_ALIGN  1
 #endif
 
-/* addr and offset allow register like names to define the actual IO  address.
+#ifdef CONFIG_ARCNET_COM20020_MEMORY_BUS
+#define arcnet_inb(addr, offset)					\
+	ioread8((void __iomem *)(addr) + BUS_ALIGN * (offset))
+
+#define arcnet_outb(value, addr, offset)				\
+	iowrite8(value, (void __iomem *)(addr) + BUS_ALIGN * (offset))
+
+#define arcnet_insb(addr, offset, buffer, count)			\
+	ioread8_rep((void __iomem *)					\
+	(addr) + BUS_ALIGN * (offset), buffer, count)
+
+#define arcnet_outsb(addr, offset, buffer, count)			\
+	iowrite8_rep((void __iomem *)					\
+	(addr) + BUS_ALIGN * (offset), buffer, count)
+#else
+/**
+ * addr and offset allow register like names to define the actual IO  address.
  * A configuration option multiplies the offset for alignment.
  */
 #define arcnet_inb(addr, offset)					\
@@ -388,6 +404,7 @@ void arcnet_timeout(struct net_device *dev);
 	readb((addr) + (offset))
 #define arcnet_writeb(value, addr, offset)				\
 	writeb(value, (addr) + (offset))
+#endif
 
 #endif				/* __KERNEL__ */
 #endif				/* _LINUX_ARCDEVICE_H */
diff --git a/drivers/net/arcnet/com20020-membus.c b/drivers/net/arcnet/com20020-membus.c
new file mode 100644
index 000000000000..6e4a2f3a84f7
--- /dev/null
+++ b/drivers/net/arcnet/com20020-membus.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Linux ARCnet driver for com 20020.
+ *
+ * This datasheet:
+ * http://ww1.microchip.com/downloads/en/DeviceDoc/200223vrevc.pdf
+ * http://ww1.microchip.com/downloads/en/DeviceDoc/20020.pdf
+ *
+ * This driver support:
+ * - com20020,
+ * - com20022
+ * - com20022I-3v3
+ *
+ * This driver support only, 8bit read and write.
+ * DMA is not supported by this driver.
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/platform_device.h>
+#include <linux/netdevice.h>
+#include <linux/of_address.h>
+#include <linux/of_gpio.h>
+#include <linux/sizes.h>
+#include <linux/interrupt.h>
+#include <linux/ioport.h>
+#include <linux/random.h>
+
+#include <linux/delay.h>
+#include "arcdevice.h"
+#include "com20020.h"
+
+#define VERSION "arcnet: COM20020 MEMORY BUS support loaded.\n"
+
+static int com20020_probe(struct platform_device *pdev)
+{
+	struct device_node *np;
+	struct net_device *dev;
+	struct arcnet_local *lp;
+	struct resource res, *iores;
+	int ret, phy_reset, err;
+	u32 timeout, backplane, clockp, clockm;
+	void __iomem *ioaddr;
+
+	np = pdev->dev.of_node;
+
+	iores = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+
+	if (of_address_to_resource(np, 0, &res))
+		return -EINVAL;
+
+	ret = of_property_read_u32(np, "timeout", &timeout);
+	if (ret) {
+		dev_err(&pdev->dev, "timeout is required param");
+		return ret;
+	}
+
+	ret = of_property_read_u32(np, "backplane", &backplane);
+	if (ret) {
+		dev_err(&pdev->dev, "backplane is required param");
+		return ret;
+	}
+
+	ret = of_property_read_u32(np, "clockp", &clockp);
+	if (ret) {
+		dev_err(&pdev->dev, "clockp is required param");
+		return ret;
+	}
+
+	ret = of_property_read_u32(np, "clockm", &clockm);
+	if (ret) {
+		dev_err(&pdev->dev, "clockm is required param");
+		return ret;
+	}
+
+	phy_reset = of_get_named_gpio(np, "phy-reset-gpios", 0);
+	if (phy_reset == -EPROBE_DEFER) {
+		return phy_reset;
+	} else if (!gpio_is_valid(phy_reset)) {
+		dev_err(&pdev->dev, "phy-reset-gpios not valid !");
+		return 0;
+	}
+
+	err = devm_gpio_request_one(&pdev->dev, phy_reset, GPIOF_OUT_INIT_LOW,
+				    "arcnet-phy-reset");
+	if (err) {
+		dev_err(&pdev->dev, "failed to get phy-reset-gpios: %d\n", err);
+		return err;
+	}
+
+	dev = alloc_arcdev(NULL);// Let autoassign name arc%d
+	dev->netdev_ops = &com20020_netdev_ops;
+	lp = netdev_priv(dev);
+
+	lp->card_flags = ARC_CAN_10MBIT;/* pretend all of them can 10Mbit */
+
+	// Force address to 0
+	// Will be set by user with `ip set dev arc0 address YOUR_NODE_ID`
+	dev->dev_addr[0] = 0;
+
+	// request to system this memory region
+	if (!devm_request_mem_region(&pdev->dev, res.start, resource_size(&res),
+				     lp->card_name))
+		return -EBUSY;
+
+	ioaddr = devm_ioremap(&pdev->dev, iores->start, resource_size(iores));
+	if (!ioaddr) {
+		dev_err(&pdev->dev, "ioremap fallied\n");
+		return -ENOMEM;
+	}
+
+	// Reset time is 5 * xTalFreq, minimal xtal is 10Mhz
+	// (5 * 1000) / 10Mhz = 500ns
+
+	gpio_set_value_cansleep(phy_reset, 0);
+	ndelay(500);
+	gpio_set_value_cansleep(phy_reset, 1);
+	ndelay(500);
+
+	/* Dummy access after Reset
+	 * ARCNET controller needs
+	 * this access to detect bustype
+	 */
+	arcnet_outb(0x00, ioaddr, COM20020_REG_W_COMMAND);
+	arcnet_inb(ioaddr, COM20020_REG_R_DIAGSTAT);
+
+	dev->base_addr = (unsigned long)ioaddr;
+	get_random_bytes(dev->dev_addr, sizeof(u8));
+
+	dev->irq = of_get_named_gpio(np, "phy-irq-gpios", 0);
+	if (dev->irq == -EPROBE_DEFER) {
+		return dev->irq;
+	} else if (!gpio_is_valid(dev->irq)) {
+		dev_err(&pdev->dev, "phy-irq-gpios not valid !");
+		return 0;
+	}
+	dev->irq = gpio_to_irq(dev->irq);
+
+	lp->backplane = backplane;
+	lp->clockp = clockp & 7;
+	lp->clockm = clockm & 3;
+	lp->timeout = timeout;
+	lp->hw.owner = THIS_MODULE;
+
+	if (arcnet_inb(ioaddr, COM20020_REG_R_STATUS) == 0xFF) {
+		ret = -EIO;
+		goto err_release_mem;
+	}
+
+	if (com20020_check(dev)) {
+		ret = -EIO;
+		goto err_release_mem;
+	}
+
+	ret = com20020_found(dev, IRQF_TRIGGER_FALLING);
+	if (ret)
+		goto err_release_mem;
+
+	dev_dbg(&pdev->dev, "probe Done\n");
+	return 0;
+
+err_release_mem:
+	devm_iounmap(&pdev->dev, (void __iomem *)ioaddr);
+	devm_release_mem_region(&pdev->dev, res.start, resource_size(&res));
+	dev_err(&pdev->dev, "probe failed!\n");
+	return ret;
+}
+
+static const struct of_device_id of_com20020_match[] = {
+	{ .compatible = "smsc,com20020",	},
+	{ },
+};
+
+MODULE_DEVICE_TABLE(of, of_com20020_match);
+
+static struct platform_driver of_com20020_driver = {
+	.driver			= {
+		.name		= "com20020-memory-bus",
+		.of_match_table = of_com20020_match,
+	},
+	.probe			= com20020_probe,
+};
+
+static int com20020_init(void)
+{
+	return platform_driver_register(&of_com20020_driver);
+}
+late_initcall(com20020_init);
+
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
index 78043a9c5981..f09ea77dd6a8 100644
--- a/drivers/net/arcnet/com20020.c
+++ b/drivers/net/arcnet/com20020.c
@@ -43,7 +43,7 @@
 #include "com20020.h"
 
 static const char * const clockrates[] = {
-	"XXXXXXX", "XXXXXXXX", "XXXXXX", "2.5 Mb/s",
+	"10 Mb/s", "XXXXXXXX", "XXXXXX", "2.5 Mb/s",
 	"1.25Mb/s", "625 Kb/s", "312.5 Kb/s", "156.25 Kb/s",
 	"Reserved", "Reserved", "Reserved"
 };
@@ -391,9 +391,10 @@ static void com20020_set_mc_list(struct net_device *dev)
 	}
 }
 
-#if defined(CONFIG_ARCNET_COM20020_PCI_MODULE) || \
-    defined(CONFIG_ARCNET_COM20020_ISA_MODULE) || \
-    defined(CONFIG_ARCNET_COM20020_CS_MODULE)
+#if	defined(CONFIG_ARCNET_COM20020_PCI_MODULE) || \
+	defined(CONFIG_ARCNET_COM20020_ISA_MODULE) || \
+	defined(CONFIG_ARCNET_COM20020_CS_MODULE)  || \
+	defined(CONFIG_ARCNET_COM20020_MEMORY_BUS)
 EXPORT_SYMBOL(com20020_check);
 EXPORT_SYMBOL(com20020_found);
 EXPORT_SYMBOL(com20020_netdev_ops);
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH net-next v2 03/13] net: phy: sfp: warn the user when no tx_disable pin is available
From: Andrew Lunn @ 2018-05-05 20:52 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: maxime.chevallier, ymarkman, jason, Antoine Tenart, netdev,
	gregory.clement, linux-kernel, linux, kishon, nadavh,
	thomas.petazzoni, miquel.raynal, stefanc, mw, davem,
	linux-arm-kernel, sebastian.hesselbarth
In-Reply-To: <44545AF4-64E3-4772-B5BA-43CCF2321025@gmail.com>

On Sat, May 05, 2018 at 01:38:31PM -0700, Florian Fainelli wrote:
> On May 4, 2018 10:14:25 AM PDT, Andrew Lunn <andrew@lunn.ch> wrote:
> >On Fri, May 04, 2018 at 10:07:53AM -0700, Florian Fainelli wrote:
> >> On 05/04/2018 06:56 AM, Antoine Tenart wrote:
> >> > In case no Tx disable pin is available the SFP modules will always
> >be
> >> > emitting. This could be an issue when using modules using laser as
> >their
> >> > light source as we would have no way to disable it when the fiber
> >is
> >> > removed. This patch adds a warning when registering an SFP cage
> >which do
> >> > not have its tx_disable pin wired or available.
> >> 
> >> Is this something that was done in a possibly earlier revision of a
> >> given board design and which was finally fixed? Nothing wrong with
> >the
> >> patch, but this seems like a pretty serious board design mistake,
> >that
> >> needs to be addressed.
> >
> >Hi Florian
> >
> >Zii Devel B is like this. Only the "Signal Detect" pin is wired to a
> >GPIO.
> 

> Good point, indeed. BTW what do you think about exposing the SFF's
> EEPROM and diagnostics through the standard ethtool operations even
> if we have to keep the description of the SFF as a fixed link in
> Device Tree because of the unfortunate wiring?

I believe in Antoine case, all the control plane is broken. He cannot
read the EEPROM, nor any of the modules pins via GPIOs.

For Zii Devel B, the EEPROM is accessible, and so is the SD pin. What
is missing is transmit disable. So i would expose it as an SFF module.

   Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 03/13] net: phy: sfp: warn the user when no tx_disable pin is available
From: Florian Fainelli @ 2018-05-05 20:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, linux, gregory.clement, jason,
	sebastian.hesselbarth, netdev, linux-kernel, thomas.petazzoni,
	maxime.chevallier, miquel.raynal, nadavh, stefanc, ymarkman, mw,
	linux-arm-kernel
In-Reply-To: <20180504171425.GA17233@lunn.ch>

On May 4, 2018 10:14:25 AM PDT, Andrew Lunn <andrew@lunn.ch> wrote:
>On Fri, May 04, 2018 at 10:07:53AM -0700, Florian Fainelli wrote:
>> On 05/04/2018 06:56 AM, Antoine Tenart wrote:
>> > In case no Tx disable pin is available the SFP modules will always
>be
>> > emitting. This could be an issue when using modules using laser as
>their
>> > light source as we would have no way to disable it when the fiber
>is
>> > removed. This patch adds a warning when registering an SFP cage
>which do
>> > not have its tx_disable pin wired or available.
>> 
>> Is this something that was done in a possibly earlier revision of a
>> given board design and which was finally fixed? Nothing wrong with
>the
>> patch, but this seems like a pretty serious board design mistake,
>that
>> needs to be addressed.
>
>Hi Florian
>
>Zii Devel B is like this. Only the "Signal Detect" pin is wired to a
>GPIO.

Good point, indeed. BTW what do you think about exposing the SFF's EEPROM and diagnostics through the standard ethtool operations even if we have to keep the description of the SFF as a fixed link in Device Tree because of the unfortunate wiring?

-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] net: phy: sfp: handle cases where neither BR,min nor BR,max is given
From: Florian Fainelli @ 2018-05-05 20:35 UTC (permalink / raw)
  To: Antoine Tenart, davem, linux
  Cc: netdev, linux-kernel, thomas.petazzoni, maxime.chevallier,
	gregory.clement, miquel.raynal, nadavh, stefanc, ymarkman, mw
In-Reply-To: <20180504152103.18152-1-antoine.tenart@bootlin.com>

On May 4, 2018 8:21:03 AM PDT, Antoine Tenart <antoine.tenart@bootlin.com> wrote:
>When computing the bitrate using values read from an SFP module EEPROM,
>we use the nominal BR plus BR,min and BR,max to determine the
>boundaries. But in some cases BR,min and BR,max aren't provided, which
>led the SFP code to end up having the nominal value for both the
>minimum
>and maximum bitrate values. When using a passive cable, the nominal
>value should be used as the maximum one, and there is no minimum one
>so we should use 0.
>
>Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
>---
>
>Hi Russell,
>
>I'm not completely sure about this patch as this case is not really
>specified in the specification. But the issue is there, and I've
>discuss
>this with others. It seemed logical (at least to us :)) to use the
>BR,nominal values as br_max and 0 as br_min when using a passive cable
>which only provides BR,nominal as this would be the highest rate at
>which the cable could work. And because it's passive, there should be
>no
>issues using it at a lower rate.
>
>I've tested this with one passive cable which only reports its
>BR,nominal (which was 10300) while it could be used when using
>1000baseX
>or 2500baseX modes.

Which SFP modules (vendor and model) exposed this out of curiosity? Russell and I already saw the Cotsworks modules having so e issues with checksums, so building a table of quirks would help. Thanks!

-- 
Florian

^ permalink raw reply

* Re: [PATCH v5 0/6] firmware_loader: cleanups for v4.18
From: Krzysztof Halasa @ 2018-05-05 20:26 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: gregkh, akpm, keescook, josh, teg, wagi, hdegoede, andresx7,
	zohar, kubakici, shuah, mfuzzey, dhowells, pali.rohar, tiwai,
	kvalo, arend.vanspriel, zajec5, nbroeking, markivx, broonie,
	dmitry.torokhov, dwmw2, torvalds, Abhay_Salunke, bjorn.andersson,
	jewalt, oneukum, cantabile.desu, ast, hare, jejb, martin.petersen,
	davem, maco, arve, tkjos
In-Reply-To: <20180504195835.GU27853@wotan.suse.de>

"Luis R. Rodriguez" <mcgrof@kernel.org> writes:

> So we can nuke CONFIG_WANXL_BUILD_FIRMWARE now?

I'm uncertain I understand why do you want it, or maybe what are you
trying to do at all.

And what use would wanxlfw.S (the assembly source) have if the option is
removed?

>> It's more about delivering the .S source for the firmware, I guess.
>> Nobody is expected to build it. The fw is about 2.5 KB and is directly
>> linked with the driver.
>
> :P Future work I guess would be to just use the firmware API and stuff
> it into linux-firmware?

Who's going to make it happen?
The last time I checked (several years ago), wanXL worked. Who's going
to test it after the change?

I assume linux-firmware could include fw source and there would be means
to build the binary.

Just to be sure: the wanXL firmware has exactly nothing to do with FW
loader, nothing depends on it (nor the other way around), it's just
(with the rest of the wanXL code) an old piece of a driver for an old
card.

The question is, what do we gain by messing with it?
-- 
Krzysztof Halasa

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox