Netdev List
 help / color / mirror / Atom feed
* [net-next 07/16] fm10k: prevent null pointer dereference of msix_entries table
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

According to the C standard dereferencing a variable before it is
checked invokes undefined behavior, and thus compilers are free to
assume the check for NULL isn't necessary. Prevent this by re-ordering
the NULL check of msix_entries in fm10k_free_mbx_irq.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 6190a81..8c23fb3d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -1143,14 +1143,16 @@ static irqreturn_t fm10k_msix_mbx_pf(int __always_unused irq, void *data)
 
 void fm10k_mbx_free_irq(struct fm10k_intfc *interface)
 {
-	struct msix_entry *entry = &interface->msix_entries[FM10K_MBX_VECTOR];
 	struct fm10k_hw *hw = &interface->hw;
+	struct msix_entry *entry;
 	int itr_reg;
 
 	/* no mailbox IRQ to free if MSI-X is not enabled */
 	if (!interface->msix_entries)
 		return;
 
+	entry = &interface->msix_entries[FM10K_MBX_VECTOR];
+
 	/* disconnect the mailbox */
 	hw->mbx.ops.disconnect(hw, &hw->mbx);
 
-- 
2.5.5

^ permalink raw reply related

* [net-next 14/16] fm10k: correctly clean up when init_queueing_scheme fails
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

Fix a kernel panic that occurs during surprise removal. Clear the
interface queue counts upon fm10k_init_msix_capability failure. This
prevents further code (fm10k_update_stats etc.) from attempting to
access unallocated queue vector or ring memory.

[  628.692648] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
[  628.692805] IP: [<ffffffffa0475caf>] fm10k_update_stats+0x7f/0x2c0 [fm10k]
[  628.693173] PGD 0
[  628.693759] Oops: 0000 [#1] SMP
[  628.699321] CPU: 10 PID: 8164 Comm: kworker/10:0 Tainted: G           OE  ------------   3.10.0-327.el7.x86_64 #1
[  628.700096] Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.2 05/09/2015
[  628.700894] Workqueue: pciehp-1 pciehp_power_thread
[  628.701686] task: ffff88086559c500 ti: ffff8808593c0000 task.ti: ffff8808593c0000
[  628.702493] RIP: 0010:[<ffffffffa0475caf>]  [<ffffffffa0475caf>] fm10k_update_stats+0x7f/0x2c0 [fm10k]
[  628.703310] RSP: 0018:ffff8808593c3b00  EFLAGS: 00010282
[  628.704132] RAX: 0000000000000000 RBX: ffff880860760000 RCX: 0000000000000000
[  628.704963] RDX: ffff880860760b08 RSI: 0000000000000000 RDI: 0000000000000000
[  628.705794] RBP: ffff8808593c3b40 R08: 0000000000000000 R09: 0000000000000000
[  628.706604] R10: 0000000000000000 R11: ffff880860760c40 R12: 0000000000000080
[  628.707420] R13: ffff8808607608c0 R14: ffff880860779ec0 R15: ffff880860779f40
[  628.708238] FS:  0000000000000000(0000) GS:ffff88086f000000(0000) knlGS:0000000000000000
[  628.709071] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  628.709923] CR2: 0000000000000068 CR3: 000000000194a000 CR4: 00000000001407e0
[  628.710752] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  628.711596] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  628.712438] Stack:
[  628.713255]  ffff880860764458 ffff8808607608c0 ffff880860760000 ffff880860760000
[  628.714088]  0000000000000080 ffff8808607608c0 ffff880860779ec0 ffff880860779f40
[  628.714925]  ffff8808593c3b88 ffffffffa04780c5 ffff880860764458 0000000a8163cb5b
[  628.715752] Call Trace:
[  628.716560]  [<ffffffffa04780c5>] fm10k_down+0x155/0x1f0 [fm10k]
[  628.717367]  [<ffffffffa0479958>] fm10k_close+0x28/0xd0 [fm10k]
[  628.718184]  [<ffffffff81526365>] __dev_close_many+0x85/0xd0
[  628.718986]  [<ffffffff815264d8>] dev_close_many+0x98/0x120
[  628.719764]  [<ffffffff81527ab8>] rollback_registered_many+0xa8/0x230
[  628.720527]  [<ffffffff81527c80>] rollback_registered+0x40/0x70
[  628.721294]  [<ffffffff81529198>] unregister_netdevice_queue+0x48/0x80
[  628.722052]  [<ffffffff815291ec>] unregister_netdev+0x1c/0x30
[  628.722816]  [<ffffffffa04762b8>] fm10k_remove+0xd8/0xe0 [fm10k]
[  628.723581]  [<ffffffff81328c7b>] pci_device_remove+0x3b/0xb0
[  628.724340]  [<ffffffff813f5fbf>] __device_release_driver+0x7f/0xf0
[  628.725088]  [<ffffffff813f6053>] device_release_driver+0x23/0x30
[  628.725814]  [<ffffffff81321fe4>] pci_stop_bus_device+0x94/0xa0
[  628.726535]  [<ffffffff813220d2>] pci_stop_and_remove_bus_device+0x12/0x20
[  628.727249]  [<ffffffff8133de40>] pciehp_unconfigure_device+0xb0/0x1b0
[  628.727964]  [<ffffffff8133d822>] pciehp_disable_slot+0x52/0xd0
[  628.728664]  [<ffffffff8133d98a>] pciehp_power_thread+0xea/0x150
[  628.729358]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[  628.730036]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[  628.730730]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[  628.731385]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[  628.732036]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[  628.732674]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[  628.733289]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[  628.733883] Code: 83 e8 01 48 8d 97 40 02 00 00 45 31 c0 4c 8d 9c c7 48 02 0
[  628.735202] RIP  [<ffffffffa0475caf>] fm10k_update_stats+0x7f/0x2c0 [fm10k]
[  628.735732]  RSP <ffff8808593c3b00>
[  628.736285] CR2: 0000000000000068
[  628.736846] ---[ end trace 9156088b311aff42 ]---

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 35 ++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index b87401c38..31179af 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1581,6 +1581,20 @@ static void fm10k_set_num_queues(struct fm10k_intfc *interface)
 }
 
 /**
+ * fm10k_reset_num_queues - Reset the number of queues to zero
+ * @interface: board private structure
+ *
+ * This function should be called whenever we need to reset the number of
+ * queues after an error condition.
+ */
+static void fm10k_reset_num_queues(struct fm10k_intfc *interface)
+{
+	interface->num_tx_queues = 0;
+	interface->num_rx_queues = 0;
+	interface->num_q_vectors = 0;
+}
+
+/**
  * fm10k_alloc_q_vector - Allocate memory for a single interrupt vector
  * @interface: board private structure to initialize
  * @v_count: q_vectors allocated on interface, used for ring interleaving
@@ -1763,9 +1777,7 @@ static int fm10k_alloc_q_vectors(struct fm10k_intfc *interface)
 	return 0;
 
 err_out:
-	interface->num_tx_queues = 0;
-	interface->num_rx_queues = 0;
-	interface->num_q_vectors = 0;
+	fm10k_reset_num_queues(interface);
 
 	while (v_idx--)
 		fm10k_free_q_vector(interface, v_idx);
@@ -1785,9 +1797,7 @@ static void fm10k_free_q_vectors(struct fm10k_intfc *interface)
 {
 	int v_idx = interface->num_q_vectors;
 
-	interface->num_tx_queues = 0;
-	interface->num_rx_queues = 0;
-	interface->num_q_vectors = 0;
+	fm10k_reset_num_queues(interface);
 
 	while (v_idx--)
 		fm10k_free_q_vector(interface, v_idx);
@@ -1995,14 +2005,15 @@ int fm10k_init_queueing_scheme(struct fm10k_intfc *interface)
 	if (err) {
 		dev_err(&interface->pdev->dev,
 			"Unable to initialize MSI-X capability\n");
-		return err;
+		goto err_init_msix;
 	}
 
 	/* Allocate memory for queues */
 	err = fm10k_alloc_q_vectors(interface);
 	if (err) {
-		fm10k_reset_msix_capability(interface);
-		return err;
+		dev_err(&interface->pdev->dev,
+			"Unable to allocate queue vectors\n");
+		goto err_alloc_q_vectors;
 	}
 
 	/* Map rings to devices, and map devices to physical queues */
@@ -2012,6 +2023,12 @@ int fm10k_init_queueing_scheme(struct fm10k_intfc *interface)
 	fm10k_init_reta(interface);
 
 	return 0;
+
+err_alloc_q_vectors:
+	fm10k_reset_msix_capability(interface);
+err_init_msix:
+	fm10k_reset_num_queues(interface);
+	return err;
 }
 
 /**
-- 
2.5.5

^ permalink raw reply related

* [net-next 09/16] fm10k: base queue scheme covered by RSS
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

In fm10k_set_num_queues, we previously assigned the base template. This
would always be overwritten by either fm10k_set_qos_queues or
fm10k_set_rss_queues. In either case, we don't need the base values, so
we can just remove them.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index db4353b..b87401c38 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1572,13 +1572,11 @@ static bool fm10k_set_rss_queues(struct fm10k_intfc *interface)
  **/
 static void fm10k_set_num_queues(struct fm10k_intfc *interface)
 {
-	/* Start with base case */
-	interface->num_rx_queues = 1;
-	interface->num_tx_queues = 1;
-
+	/* Attempt to setup QoS and RSS first */
 	if (fm10k_set_qos_queues(interface))
 		return;
 
+	/* If we don't have QoS, just fallback to only RSS. */
 	fm10k_set_rss_queues(interface);
 }
 
-- 
2.5.5

^ permalink raw reply related

* [net-next 11/16] fm10k: free MBX IRQ before clearing interrupt scheme
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

During fm10k_io_error_detected we were clearing the interrupt scheme
before we freed the MBX IRQ. This causes a kernel panic because the MBX
IRQ are assigned after MSI-X initialization. Clearing the interrupt
scheme results in removing the MSI-X entry table. Fix this by freeing
the MBX IRQ before we clear the interrupt scheme, as we do elsewhere in
the driver.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 3c7c819..da38af0 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -2274,11 +2274,11 @@ static pci_ers_result_t fm10k_io_error_detected(struct pci_dev *pdev,
 	if (netif_running(netdev))
 		fm10k_close(netdev);
 
+	fm10k_mbx_free_irq(interface);
+
 	/* free interrupts */
 	fm10k_clear_queueing_scheme(interface);
 
-	fm10k_mbx_free_irq(interface);
-
 	pci_disable_device(pdev);
 
 	/* Request a slot reset. */
-- 
2.5.5

^ permalink raw reply related

* [net-next 05/16] fm10k: cleanup SPACE_BEFORE_TAB checkpatch warning
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_ptp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ptp.c b/drivers/net/ethernet/intel/fm10k/fm10k_ptp.c
index b4945e8..1c1ccad 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_ptp.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_ptp.c
@@ -416,7 +416,7 @@ void fm10k_ptp_register(struct fm10k_intfc *interface)
 	/* This math is simply the inverse of the math in
 	 * fm10k_adjust_systime_pf applied to an adjustment value
 	 * of 2^30 - 1 which is the maximum value of the register:
-	 * 	max_ppb == ((2^30 - 1) * 5^9) / 2^31
+	 *	max_ppb == ((2^30 - 1) * 5^9) / 2^31
 	 */
 	ptp_caps->max_adj	= 976562;
 	ptp_caps->adjfreq	= fm10k_ptp_adjfreq;
-- 
2.5.5

^ permalink raw reply related

* [net-next 02/16] fm10k: cleanup remaining right-bit-shifted 1
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Use BIT() macro instead.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k.h         | 12 ++++++------
 drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c | 20 +++++++++-----------
 drivers/net/ethernet/intel/fm10k/fm10k_main.c    | 20 ++++++++++----------
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c  |  2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c     |  8 ++++----
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c      | 12 ++++++------
 drivers/net/ethernet/intel/fm10k/fm10k_tlv.c     | 24 ++++++++++++------------
 drivers/net/ethernet/intel/fm10k/fm10k_type.h    |  8 ++++----
 8 files changed, 52 insertions(+), 54 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k.h b/drivers/net/ethernet/intel/fm10k/fm10k.h
index b34bb00..83f3867 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k.h
+++ b/drivers/net/ethernet/intel/fm10k/fm10k.h
@@ -262,12 +262,12 @@ struct fm10k_intfc {
 	unsigned long state;
 
 	u32 flags;
-#define FM10K_FLAG_RESET_REQUESTED		(u32)(1 << 0)
-#define FM10K_FLAG_RSS_FIELD_IPV4_UDP		(u32)(1 << 1)
-#define FM10K_FLAG_RSS_FIELD_IPV6_UDP		(u32)(1 << 2)
-#define FM10K_FLAG_RX_TS_ENABLED		(u32)(1 << 3)
-#define FM10K_FLAG_SWPRI_CONFIG			(u32)(1 << 4)
-#define FM10K_FLAG_DEBUG_STATS			(u32)(1 << 5)
+#define FM10K_FLAG_RESET_REQUESTED		(u32)(BIT(0))
+#define FM10K_FLAG_RSS_FIELD_IPV4_UDP		(u32)(BIT(1))
+#define FM10K_FLAG_RSS_FIELD_IPV6_UDP		(u32)(BIT(2))
+#define FM10K_FLAG_RX_TS_ENABLED		(u32)(BIT(3))
+#define FM10K_FLAG_SWPRI_CONFIG			(u32)(BIT(4))
+#define FM10K_FLAG_DEBUG_STATS			(u32)(BIT(5))
 	int xcast_mode;
 
 	/* Tx fast path data */
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
index 2f6a05b..28837ae 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
@@ -425,7 +425,7 @@ static void fm10k_get_regs(struct net_device *netdev,
 	u32 *buff = p;
 	u16 i;
 
-	regs->version = (1 << 24) | (hw->revision_id << 16) | hw->device_id;
+	regs->version = BIT(24) | (hw->revision_id << 16) | hw->device_id;
 
 	switch (hw->mac.type) {
 	case fm10k_mac_pf:
@@ -942,8 +942,8 @@ static int fm10k_mbx_test(struct fm10k_intfc *interface, u64 *data)
 		return 0;
 
 	/* loop through both nested and unnested attribute types */
-	for (attr_flag = (1 << FM10K_TEST_MSG_UNSET);
-	     attr_flag < (1 << (2 * FM10K_TEST_MSG_NESTED));
+	for (attr_flag = BIT(FM10K_TEST_MSG_UNSET);
+	     attr_flag < BIT(2 * FM10K_TEST_MSG_NESTED);
 	     attr_flag += attr_flag) {
 		/* generate message to be tested */
 		fm10k_tlv_msg_test_create(test_msg, attr_flag);
@@ -1005,7 +1005,7 @@ static u32 fm10k_get_priv_flags(struct net_device *netdev)
 	u32 priv_flags = 0;
 
 	if (interface->flags & FM10K_FLAG_DEBUG_STATS)
-		priv_flags |= 1 << FM10K_PRV_FLAG_DEBUG_STATS;
+		priv_flags |= BIT(FM10K_PRV_FLAG_DEBUG_STATS);
 
 	return priv_flags;
 }
@@ -1014,10 +1014,10 @@ static int fm10k_set_priv_flags(struct net_device *netdev, u32 priv_flags)
 {
 	struct fm10k_intfc *interface = netdev_priv(netdev);
 
-	if (priv_flags >= (1 << FM10K_PRV_FLAG_LEN))
+	if (priv_flags >= BIT(FM10K_PRV_FLAG_LEN))
 		return -EINVAL;
 
-	if (priv_flags & (1 << FM10K_PRV_FLAG_DEBUG_STATS))
+	if (priv_flags & BIT(FM10K_PRV_FLAG_DEBUG_STATS))
 		interface->flags |= FM10K_FLAG_DEBUG_STATS;
 	else
 		interface->flags &= ~FM10K_FLAG_DEBUG_STATS;
@@ -1145,7 +1145,7 @@ static unsigned int fm10k_max_channels(struct net_device *dev)
 
 	/* For QoS report channels per traffic class */
 	if (tcs > 1)
-		max_combined = 1 << (fls(max_combined / tcs) - 1);
+		max_combined = BIT((fls(max_combined / tcs) - 1));
 
 	return max_combined;
 }
@@ -1210,11 +1210,9 @@ static int fm10k_get_ts_info(struct net_device *dev,
 	else
 		info->phc_index = -1;
 
-	info->tx_types = (1 << HWTSTAMP_TX_OFF) |
-			 (1 << HWTSTAMP_TX_ON);
+	info->tx_types = BIT(HWTSTAMP_TX_OFF) | BIT(HWTSTAMP_TX_ON);
 
-	info->rx_filters = (1 << HWTSTAMP_FILTER_NONE) |
-			   (1 << HWTSTAMP_FILTER_ALL);
+	info->rx_filters = BIT(HWTSTAMP_FILTER_NONE) | BIT(HWTSTAMP_FILTER_ALL);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index d411aa5..db4353b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -401,10 +401,10 @@ static inline void fm10k_rx_checksum(struct fm10k_ring *ring,
 }
 
 #define FM10K_RSS_L4_TYPES_MASK \
-	((1ul << FM10K_RSSTYPE_IPV4_TCP) | \
-	 (1ul << FM10K_RSSTYPE_IPV4_UDP) | \
-	 (1ul << FM10K_RSSTYPE_IPV6_TCP) | \
-	 (1ul << FM10K_RSSTYPE_IPV6_UDP))
+	(BIT(FM10K_RSSTYPE_IPV4_TCP) | \
+	 BIT(FM10K_RSSTYPE_IPV4_UDP) | \
+	 BIT(FM10K_RSSTYPE_IPV6_TCP) | \
+	 BIT(FM10K_RSSTYPE_IPV6_UDP))
 
 static inline void fm10k_rx_hash(struct fm10k_ring *ring,
 				 union fm10k_rx_desc *rx_desc,
@@ -420,7 +420,7 @@ static inline void fm10k_rx_hash(struct fm10k_ring *ring,
 		return;
 
 	skb_set_hash(skb, le32_to_cpu(rx_desc->d.rss),
-		     ((1ul << rss_type) & FM10K_RSS_L4_TYPES_MASK) ?
+		     (BIT(rss_type) & FM10K_RSS_L4_TYPES_MASK) ?
 		     PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3);
 }
 
@@ -1409,7 +1409,7 @@ static void fm10k_update_itr(struct fm10k_ring_container *ring_container)
 	 * accounts for changes in the ITR due to PCIe link speed.
 	 */
 	itr_round = ACCESS_ONCE(ring_container->itr_scale) + 8;
-	avg_wire_size += (1 << itr_round) - 1;
+	avg_wire_size += BIT(itr_round) - 1;
 	avg_wire_size >>= itr_round;
 
 	/* write back value and retain adaptive flag */
@@ -1511,17 +1511,17 @@ static bool fm10k_set_qos_queues(struct fm10k_intfc *interface)
 	/* set QoS mask and indices */
 	f = &interface->ring_feature[RING_F_QOS];
 	f->indices = pcs;
-	f->mask = (1 << fls(pcs - 1)) - 1;
+	f->mask = BIT(fls(pcs - 1)) - 1;
 
 	/* determine the upper limit for our current DCB mode */
 	rss_i = interface->hw.mac.max_queues / pcs;
-	rss_i = 1 << (fls(rss_i) - 1);
+	rss_i = BIT(fls(rss_i) - 1);
 
 	/* set RSS mask and indices */
 	f = &interface->ring_feature[RING_F_RSS];
 	rss_i = min_t(u16, rss_i, f->limit);
 	f->indices = rss_i;
-	f->mask = (1 << fls(rss_i - 1)) - 1;
+	f->mask = BIT(fls(rss_i - 1)) - 1;
 
 	/* configure pause class to queue mapping */
 	for (i = 0; i < pcs; i++)
@@ -1551,7 +1551,7 @@ static bool fm10k_set_rss_queues(struct fm10k_intfc *interface)
 
 	/* record indices and power of 2 mask for RSS */
 	f->indices = rss_i;
-	f->mask = (1 << fls(rss_i - 1)) - 1;
+	f->mask = BIT(fls(rss_i - 1)) - 1;
 
 	interface->num_rx_queues = rss_i;
 	interface->num_tx_queues = rss_i;
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index d09a8dd..0ff6874 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1429,7 +1429,7 @@ struct net_device *fm10k_alloc_netdev(const struct fm10k_info *info)
 
 	/* configure default debug level */
 	interface = netdev_priv(dev);
-	interface->msg_enable = (1 << DEFAULT_DEBUG_LEVEL_SHIFT) - 1;
+	interface->msg_enable = BIT(DEFAULT_DEBUG_LEVEL_SHIFT) - 1;
 
 	/* configure default features */
 	dev->features |= NETIF_F_IP_CSUM |
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 86700a4..c9324c7 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -579,7 +579,7 @@ static void fm10k_configure_tx_ring(struct fm10k_intfc *interface,
 	u64 tdba = ring->dma;
 	u32 size = ring->count * sizeof(struct fm10k_tx_desc);
 	u32 txint = FM10K_INT_MAP_DISABLE;
-	u32 txdctl = (1 << FM10K_TXDCTL_MAX_TIME_SHIFT) | FM10K_TXDCTL_ENABLE;
+	u32 txdctl = BIT(FM10K_TXDCTL_MAX_TIME_SHIFT) | FM10K_TXDCTL_ENABLE;
 	u8 reg_idx = ring->reg_idx;
 
 	/* disable queue to avoid issues while updating state */
@@ -730,7 +730,7 @@ static void fm10k_configure_rx_ring(struct fm10k_intfc *interface,
 	if (interface->pfc_en)
 		rx_pause = interface->pfc_en;
 #endif
-	if (!(rx_pause & (1 << ring->qos_pc)))
+	if (!(rx_pause & BIT(ring->qos_pc)))
 		rxdctl |= FM10K_RXDCTL_DROP_ON_EMPTY;
 
 	fm10k_write_reg(hw, FM10K_RXDCTL(reg_idx), rxdctl);
@@ -779,7 +779,7 @@ void fm10k_update_rx_drop_en(struct fm10k_intfc *interface)
 		u32 rxdctl = FM10K_RXDCTL_WRITE_BACK_MIN_DELAY;
 		u8 reg_idx = ring->reg_idx;
 
-		if (!(rx_pause & (1 << ring->qos_pc)))
+		if (!(rx_pause & BIT(ring->qos_pc)))
 			rxdctl |= FM10K_RXDCTL_DROP_ON_EMPTY;
 
 		fm10k_write_reg(hw, FM10K_RXDCTL(reg_idx), rxdctl);
@@ -1065,7 +1065,7 @@ static void fm10k_reset_drop_on_empty(struct fm10k_intfc *interface, u32 eicr)
 	if (maxholdq)
 		fm10k_write_reg(hw, FM10K_MAXHOLDQ(7), maxholdq);
 	for (q = 255;;) {
-		if (maxholdq & (1 << 31)) {
+		if (maxholdq & BIT(31)) {
 			if (q < FM10K_MAX_QUEUES_PF) {
 				interface->rx_overrun_pf++;
 				fm10k_write_reg(hw, FM10K_RXDCTL(q), rxdctl);
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
index 34a0b03..23de956 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
@@ -527,8 +527,8 @@ static s32 fm10k_configure_dglort_map_pf(struct fm10k_hw *hw,
 		return FM10K_ERR_PARAM;
 
 	/* determine count of VSIs and queues */
-	queue_count = 1 << (dglort->rss_l + dglort->pc_l);
-	vsi_count = 1 << (dglort->vsi_l + dglort->queue_l);
+	queue_count = BIT(dglort->rss_l + dglort->pc_l);
+	vsi_count = BIT(dglort->vsi_l + dglort->queue_l);
 	glort = dglort->glort;
 	q_idx = dglort->queue_b;
 
@@ -544,8 +544,8 @@ static s32 fm10k_configure_dglort_map_pf(struct fm10k_hw *hw,
 	}
 
 	/* determine count of PCs and queues */
-	queue_count = 1 << (dglort->queue_l + dglort->rss_l + dglort->vsi_l);
-	pc_count = 1 << dglort->pc_l;
+	queue_count = BIT(dglort->queue_l + dglort->rss_l + dglort->vsi_l);
+	pc_count = BIT(dglort->pc_l);
 
 	/* configure PC for Tx queues */
 	for (pc = 0; pc < pc_count; pc++) {
@@ -952,7 +952,7 @@ static s32 fm10k_iov_reset_resources_pf(struct fm10k_hw *hw,
 		return FM10K_ERR_PARAM;
 
 	/* clear event notification of VF FLR */
-	fm10k_write_reg(hw, FM10K_PFVFLREC(vf_idx / 32), 1 << (vf_idx % 32));
+	fm10k_write_reg(hw, FM10K_PFVFLREC(vf_idx / 32), BIT(vf_idx % 32));
 
 	/* force timeout and then disconnect the mailbox */
 	vf_info->mbx.timeout = 0;
@@ -1370,7 +1370,7 @@ s32 fm10k_iov_msg_lport_state_pf(struct fm10k_hw *hw, u32 **results,
 		mode = fm10k_iov_supported_xcast_mode_pf(vf_info, mode);
 
 		/* if mode is not currently enabled, enable it */
-		if (!(FM10K_VF_FLAG_ENABLED(vf_info) & (1 << mode)))
+		if (!(FM10K_VF_FLAG_ENABLED(vf_info) & BIT(mode)))
 			fm10k_update_xcast_mode_pf(hw, vf_info->glort, mode);
 
 		/* swap mode back to a bit flag */
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_tlv.c b/drivers/net/ethernet/intel/fm10k/fm10k_tlv.c
index ab01bb3..b999897 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_tlv.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_tlv.c
@@ -222,7 +222,7 @@ s32 fm10k_tlv_attr_put_value(u32 *msg, u16 attr_id, s64 value, u32 len)
 	attr = &msg[FM10K_TLV_DWORD_LEN(*msg)];
 
 	if (len < 4) {
-		attr[1] = (u32)value & ((0x1ul << (8 * len)) - 1);
+		attr[1] = (u32)value & (BIT(8 * len) - 1);
 	} else {
 		attr[1] = (u32)value;
 		if (len > 4)
@@ -652,29 +652,29 @@ const struct fm10k_tlv_attr fm10k_tlv_msg_test_attr[] = {
  **/
 static void fm10k_tlv_msg_test_generate_data(u32 *msg, u32 attr_flags)
 {
-	if (attr_flags & (1 << FM10K_TEST_MSG_STRING))
+	if (attr_flags & BIT(FM10K_TEST_MSG_STRING))
 		fm10k_tlv_attr_put_null_string(msg, FM10K_TEST_MSG_STRING,
 					       test_str);
-	if (attr_flags & (1 << FM10K_TEST_MSG_MAC_ADDR))
+	if (attr_flags & BIT(FM10K_TEST_MSG_MAC_ADDR))
 		fm10k_tlv_attr_put_mac_vlan(msg, FM10K_TEST_MSG_MAC_ADDR,
 					    test_mac, test_vlan);
-	if (attr_flags & (1 << FM10K_TEST_MSG_U8))
+	if (attr_flags & BIT(FM10K_TEST_MSG_U8))
 		fm10k_tlv_attr_put_u8(msg, FM10K_TEST_MSG_U8,  test_u8);
-	if (attr_flags & (1 << FM10K_TEST_MSG_U16))
+	if (attr_flags & BIT(FM10K_TEST_MSG_U16))
 		fm10k_tlv_attr_put_u16(msg, FM10K_TEST_MSG_U16, test_u16);
-	if (attr_flags & (1 << FM10K_TEST_MSG_U32))
+	if (attr_flags & BIT(FM10K_TEST_MSG_U32))
 		fm10k_tlv_attr_put_u32(msg, FM10K_TEST_MSG_U32, test_u32);
-	if (attr_flags & (1 << FM10K_TEST_MSG_U64))
+	if (attr_flags & BIT(FM10K_TEST_MSG_U64))
 		fm10k_tlv_attr_put_u64(msg, FM10K_TEST_MSG_U64, test_u64);
-	if (attr_flags & (1 << FM10K_TEST_MSG_S8))
+	if (attr_flags & BIT(FM10K_TEST_MSG_S8))
 		fm10k_tlv_attr_put_s8(msg, FM10K_TEST_MSG_S8,  test_s8);
-	if (attr_flags & (1 << FM10K_TEST_MSG_S16))
+	if (attr_flags & BIT(FM10K_TEST_MSG_S16))
 		fm10k_tlv_attr_put_s16(msg, FM10K_TEST_MSG_S16, test_s16);
-	if (attr_flags & (1 << FM10K_TEST_MSG_S32))
+	if (attr_flags & BIT(FM10K_TEST_MSG_S32))
 		fm10k_tlv_attr_put_s32(msg, FM10K_TEST_MSG_S32, test_s32);
-	if (attr_flags & (1 << FM10K_TEST_MSG_S64))
+	if (attr_flags & BIT(FM10K_TEST_MSG_S64))
 		fm10k_tlv_attr_put_s64(msg, FM10K_TEST_MSG_S64, test_s64);
-	if (attr_flags & (1 << FM10K_TEST_MSG_LE_STRUCT))
+	if (attr_flags & BIT(FM10K_TEST_MSG_LE_STRUCT))
 		fm10k_tlv_attr_put_le_struct(msg, FM10K_TEST_MSG_LE_STRUCT,
 					     test_le, 8);
 }
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_type.h b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
index 854ebb1..5c05330 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_type.h
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
@@ -617,10 +617,10 @@ struct fm10k_vf_info {
 						 */
 };
 
-#define FM10K_VF_FLAG_ALLMULTI_CAPABLE	((u8)1 << FM10K_XCAST_MODE_ALLMULTI)
-#define FM10K_VF_FLAG_MULTI_CAPABLE	((u8)1 << FM10K_XCAST_MODE_MULTI)
-#define FM10K_VF_FLAG_PROMISC_CAPABLE	((u8)1 << FM10K_XCAST_MODE_PROMISC)
-#define FM10K_VF_FLAG_NONE_CAPABLE	((u8)1 << FM10K_XCAST_MODE_NONE)
+#define FM10K_VF_FLAG_ALLMULTI_CAPABLE	(u8)(BIT(FM10K_XCAST_MODE_ALLMULTI))
+#define FM10K_VF_FLAG_MULTI_CAPABLE	(u8)(BIT(FM10K_XCAST_MODE_MULTI))
+#define FM10K_VF_FLAG_PROMISC_CAPABLE	(u8)(BIT(FM10K_XCAST_MODE_PROMISC))
+#define FM10K_VF_FLAG_NONE_CAPABLE	(u8)(BIT(FM10K_XCAST_MODE_NONE))
 #define FM10K_VF_FLAG_CAPABLE(vf_info)	((vf_info)->vf_flags & (u8)0xF)
 #define FM10K_VF_FLAG_ENABLED(vf_info)	((vf_info)->vf_flags >> 4)
 #define FM10K_VF_FLAG_SET_MODE(mode)	((u8)0x10 << (mode))
-- 
2.5.5

^ permalink raw reply related

* [net-next 01/16] fm10k: Move constants to the right of binary operators
From: Jeff Kirsher @ 2016-04-05  8:01 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459843288-40623-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

The semantic patch that makes this change is available
in scripts/coccinelle/misc/compare_const_fl.cocci.

More information about semantic patching is available at
http://coccinelle.lip6.fr/

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |  2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c  | 16 ++++++++--------
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c   |  6 +++---
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 4de17db..d411aa5 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -420,7 +420,7 @@ static inline void fm10k_rx_hash(struct fm10k_ring *ring,
 		return;
 
 	skb_set_hash(skb, le32_to_cpu(rx_desc->d.rss),
-		     (FM10K_RSS_L4_TYPES_MASK & (1ul << rss_type)) ?
+		     ((1ul << rss_type) & FM10K_RSS_L4_TYPES_MASK) ?
 		     PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3);
 }
 
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 4eb7a6f..86700a4 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -579,7 +579,7 @@ static void fm10k_configure_tx_ring(struct fm10k_intfc *interface,
 	u64 tdba = ring->dma;
 	u32 size = ring->count * sizeof(struct fm10k_tx_desc);
 	u32 txint = FM10K_INT_MAP_DISABLE;
-	u32 txdctl = FM10K_TXDCTL_ENABLE | (1 << FM10K_TXDCTL_MAX_TIME_SHIFT);
+	u32 txdctl = (1 << FM10K_TXDCTL_MAX_TIME_SHIFT) | FM10K_TXDCTL_ENABLE;
 	u8 reg_idx = ring->reg_idx;
 
 	/* disable queue to avoid issues while updating state */
@@ -903,8 +903,8 @@ static irqreturn_t fm10k_msix_mbx_vf(int __always_unused irq, void *data)
 
 	/* re-enable mailbox interrupt and indicate 20us delay */
 	fm10k_write_reg(hw, FM10K_VFITR(FM10K_MBX_VECTOR),
-			FM10K_ITR_ENABLE | (FM10K_MBX_INT_DELAY >>
-					    hw->mac.itr_scale));
+			(FM10K_MBX_INT_DELAY >> hw->mac.itr_scale) |
+			FM10K_ITR_ENABLE);
 
 	/* service upstream mailbox */
 	if (fm10k_mbx_trylock(interface)) {
@@ -1135,8 +1135,8 @@ static irqreturn_t fm10k_msix_mbx_pf(int __always_unused irq, void *data)
 
 	/* re-enable mailbox interrupt and indicate 20us delay */
 	fm10k_write_reg(hw, FM10K_ITR(FM10K_MBX_VECTOR),
-			FM10K_ITR_ENABLE | (FM10K_MBX_INT_DELAY >>
-					    hw->mac.itr_scale));
+			(FM10K_MBX_INT_DELAY >> hw->mac.itr_scale) |
+			FM10K_ITR_ENABLE);
 
 	return IRQ_HANDLED;
 }
@@ -1253,7 +1253,7 @@ static int fm10k_mbx_request_irq_vf(struct fm10k_intfc *interface)
 	int err;
 
 	/* Use timer0 for interrupt moderation on the mailbox */
-	u32 itr = FM10K_INT_MAP_TIMER0 | entry->entry;
+	u32 itr = entry->entry | FM10K_INT_MAP_TIMER0;
 
 	/* register mailbox handlers */
 	err = hw->mbx.ops.register_handlers(&hw->mbx, vf_mbx_data);
@@ -1420,8 +1420,8 @@ static int fm10k_mbx_request_irq_pf(struct fm10k_intfc *interface)
 	int err;
 
 	/* Use timer0 for interrupt moderation on the mailbox */
-	u32 mbx_itr = FM10K_INT_MAP_TIMER0 | entry->entry;
-	u32 other_itr = FM10K_INT_MAP_IMMEDIATE | entry->entry;
+	u32 mbx_itr = entry->entry | FM10K_INT_MAP_TIMER0;
+	u32 other_itr = entry->entry | FM10K_INT_MAP_IMMEDIATE;
 
 	/* register mailbox handlers */
 	err = hw->mbx.ops.register_handlers(&hw->mbx, pf_mbx_data);
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
index 62ccebc..34a0b03 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
@@ -711,8 +711,8 @@ static s32 fm10k_iov_assign_resources_pf(struct fm10k_hw *hw, u16 num_vfs,
 					FM10K_RXDCTL_WRITE_BACK_MIN_DELAY |
 					FM10K_RXDCTL_DROP_ON_EMPTY);
 			fm10k_write_reg(hw, FM10K_RXQCTL(vf_q_idx),
-					FM10K_RXQCTL_VF |
-					(i << FM10K_RXQCTL_VF_SHIFT));
+					(i << FM10K_RXQCTL_VF_SHIFT) |
+					FM10K_RXQCTL_VF);
 
 			/* map queue pair to VF */
 			fm10k_write_reg(hw, FM10K_TQMAP(qmap_idx), vf_q_idx);
@@ -987,7 +987,7 @@ static s32 fm10k_iov_reset_resources_pf(struct fm10k_hw *hw,
 	txqctl = ((u32)vf_vid << FM10K_TXQCTL_VID_SHIFT) |
 		 (vf_idx << FM10K_TXQCTL_TC_SHIFT) |
 		 FM10K_TXQCTL_VF | vf_idx;
-	rxqctl = FM10K_RXQCTL_VF | (vf_idx << FM10K_RXQCTL_VF_SHIFT);
+	rxqctl = (vf_idx << FM10K_RXQCTL_VF_SHIFT) | FM10K_RXQCTL_VF;
 
 	/* stop further DMA and reset queue ownership back to VF */
 	for (i = vf_q_idx; i < (queues_per_pool + vf_q_idx); i++) {
-- 
2.5.5

^ permalink raw reply related

* Re: [RESEND PATCH V4 2/3] IB/hns: Add HiSilicon RoCE driver support
From: oulijun @ 2016-04-05  7:32 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	jiri-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	gongyangming-hv44wF8Li93QT0dZR+AlfA,
	xiaokun-hv44wF8Li93QT0dZR+AlfA,
	tangchaofei-hv44wF8Li93QT0dZR+AlfA,
	haifeng.wei-hv44wF8Li93QT0dZR+AlfA,
	yisen.zhuang-hv44wF8Li93QT0dZR+AlfA,
	yankejian-hv44wF8Li93QT0dZR+AlfA,
	lisheng011-hv44wF8Li93QT0dZR+AlfA,
	charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20160402015830.GD8565-2ukJVAZIZ/Y@public.gmane.org>

Hi,  Leon Romanovsky
On 2016/4/2 9:58, Leon Romanovsky wrote:
> On Fri, Apr 01, 2016 at 05:21:31PM +0800, Lijun Ou wrote:
>> The driver for HiSilicon RoCE is a platform driver.
>> The driver will support multiple versions of hardware. Currently only "v1"
>> for hip06 SoC is supported.
>> The driver includes two parts: common driver and hardware-specific
>> operations. hns_roce_v1_hw.c and hns_roce_v1_hw.h are files for
>> hardware-specific operations only for v1 engine, and other files(.c and .h)
>> for common algorithm and common hardware operations.
>>
>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Wei Hu(Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Znlong <zhaonenglong-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>> ---
>>  MAINTAINERS                                        |    8 +
>>  drivers/infiniband/Kconfig                         |    1 +
>>  drivers/infiniband/hw/Makefile                     |    1 +
>>  drivers/infiniband/hw/hisilicon/hns/Kconfig        |   10 +
>>  drivers/infiniband/hw/hisilicon/hns/Makefile       |    9 +
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c  |  110 +
> 
> We are not adding name of company (hisilicon) for infiniband HW drivers
> drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c
> --->
> drivers/infiniband/hw/hns/hns_roce_ah.c
>
Surely, i will modify the location of RoCE driver code after disscussed in next patch

> 
>>  .../infiniband/hw/hisilicon/hns/hns_roce_alloc.c   |  239 ++
>  ^^^^^^
> Please fix you paths.
> 
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.c |  338 +++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.h |   80 +
>>  .../infiniband/hw/hisilicon/hns/hns_roce_common.h  |  308 +++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cq.c  |  436 +++
>>  .../infiniband/hw/hisilicon/hns/hns_roce_device.h  |  794 ++++++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.c  |  758 ++++++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.h  |  132 +
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.c |  578 ++++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.h |  112 +
>>  .../infiniband/hw/hisilicon/hns/hns_roce_main.c    | 1097 ++++++++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_mr.c  |  605 +++++
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_pd.c  |  124 +
>>  drivers/infiniband/hw/hisilicon/hns/hns_roce_qp.c  |  841 ++++++
>>  .../infiniband/hw/hisilicon/hns/hns_roce_user.h    |   31 +
>>  .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.c   | 2832 ++++++++++++++++++++
>>  .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.h   |  985 +++++++
>                                           ^^^^^^
> Do you support v1 of RoCE or v1 of your HW?
> 
Here, v1 stands for hw, that is, we support v1 of our hw.
>>  23 files changed, 10429 insertions(+)
> 
> Please appreciate the effort needed to review such large patch and
> invest time and effort to divide this to number of small easy review patches.
> 
    Surely, i have pay attention to the patch, but i consider that it is not better to
split the patch into small patch. because it will the base function of RoCE.
    For your advice, i will make further efforts to taking a discussion how to reslove the question.

thanks
Lijun Ou
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH (net.git)] stmmac: fix adjust link call in case of a switch is attached
From: Giuseppe Cavallaro @ 2016-04-05  6:46 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro, Alexandre TORGUE

While initializing the phy, the stmmac driver sets the
PHY_IGNORE_INTERRUPT so the PAL won't call the adjust hook
that is needed, on some platforms, e.g. STi, to invoke the glue.

The patch allows the PAL to poll the stmmac_adjust_link just one time
in case of a switch is attached, setting later the PHY_IGNORE_INTERRUPT
flag.
Moving this kind of logic inside the adjust_link it makes sense to
anticipate the check for EEE that will never initialized in this
scenario.

Reported-by: Gabriel Fernandez <gabriel.fernandez@linaro.org>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Gabriel Fernandez <gabriel.fernandez@linaro.org>
Cc: Alexandre TORGUE <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |   22 +++++++++-----------
 1 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 78464fa..fcbd4be 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -288,10 +288,6 @@ bool stmmac_eee_init(struct stmmac_priv *priv)
 	    (priv->pcs == STMMAC_PCS_RTBI))
 		goto out;
 
-	/* Never init EEE in case of a switch is attached */
-	if (priv->phydev->is_pseudo_fixed_link)
-		goto out;
-
 	/* MAC core supports the EEE feature. */
 	if (priv->dma_cap.eee) {
 		int tx_lpi_timer = priv->tx_lpi_timer;
@@ -771,10 +767,16 @@ static void stmmac_adjust_link(struct net_device *dev)
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 
-	/* At this stage, it could be needed to setup the EEE or adjust some
-	 * MAC related HW registers.
-	 */
-	priv->eee_enabled = stmmac_eee_init(priv);
+	if (phydev->is_pseudo_fixed_link)
+		/* Stop PHY layer to call the hook to adjust the link in case
+		 * of a switch is attached to the stmmac driver.
+		 */
+		phydev->irq = PHY_IGNORE_INTERRUPT;
+	else
+		/* At this stage, init the EEE if supported.
+		 * Never called in case of fixed_link.
+		 */
+		priv->eee_enabled = stmmac_eee_init(priv);
 }
 
 /**
@@ -865,10 +867,6 @@ static int stmmac_init_phy(struct net_device *dev)
 		return -ENODEV;
 	}
 
-	/* If attached to a switch, there is no reason to poll phy handler */
-	if (phydev->is_pseudo_fixed_link)
-		phydev->irq = PHY_IGNORE_INTERRUPT;
-
 	pr_debug("stmmac_init_phy:  %s: attached to PHY (UID 0x%x)"
 		 " Link = %d\n", dev->name, phydev->phy_id, phydev->link);
 
-- 
1.7.4.4

^ permalink raw reply related

* Re: am335x: no multicast reception over VLAN
From: Yegor Yefremov @ 2016-04-05  6:22 UTC (permalink / raw)
  To: Mugunthan V N
  Cc: Grygorii Strashko, Peter Korsgaard, netdev,
	linux-omap@vger.kernel.org, drivshin, ml, David Miller
In-Reply-To: <57035729.8050704@ti.com>

Grygorii, Mugunthan,

On Tue, Apr 5, 2016 at 8:11 AM, Mugunthan V N <mugunthanvnm@ti.com> wrote:
> On Friday 01 April 2016 05:39 PM, Grygorii Strashko wrote:
>> On 03/31/2016 10:52 AM, Yegor Yefremov wrote:
>>> On Thu, Mar 31, 2016 at 8:37 AM, Mugunthan V N <mugunthanvnm@ti.com> wrote:
>>>> On Thursday 31 March 2016 01:17 AM, Peter Korsgaard wrote:
>>>>>>>>>> "Mugunthan" == Mugunthan V N <mugunthanvnm@ti.com> writes:
>>>>>
>>>>> Hi,
>>>>>
>>>>>   > You had received these packets as tcpdump will enable promiscuous mode
>>>>>   > so that you receive all the packets from the wire.
>>>>>
>>>>> FYI, you can use the -p option to tcpdump to not put the interface into
>>>>> promiscuous mode.
>>>>>
>>>>
>>>> Thanks for the information Peter Korsgaard.
>>>>
>>>> Yegor, can you provide tcpdump using -p as well in Grygorii commands.
>>>
>>> Before VLAN configuration:
>>>
>>> # switch-config -d
>>> cpsw hw version 1.12 (0)
>>> 0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
>>> unreg_mcast = 0x0, member_list = 0x3
>>> 1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>>> no super, port_mask = 0x3
>>> 2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
>>> persistant, port_num = 0x0, Secure
>>> 3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
>>> unreg_mcast = 0x0, member_list = 0x7
>>> 4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
>>> no super, port_mask = 0x3
>>> 5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
>>> unreg_mcast = 0x0, member_list = 0x5
>>> 6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>>> no super, port_mask = 0x5
>>> 7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
>>> persistant, port_num = 0x0, Secure
>>> 8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
>>> no super, port_mask = 0x5
>>>
>>> After VLAN configuration:
>>>
>>> # switch-config -d
>>> cpsw hw version 1.12 (0)
>>> 0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
>>> unreg_mcast = 0x0, member_list = 0x3
>>> 1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>>> no super, port_mask = 0x3
>>> 2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
>>> persistant, port_num = 0x0, Secure
>>> 3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
>>> unreg_mcast = 0x0, member_list = 0x7
>>> 4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
>>> no super, port_mask = 0x3
>>> 5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
>>> unreg_mcast = 0x0, member_list = 0x5
>>> 6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>>> no super, port_mask = 0x5
>>> 7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
>>> persistant, port_num = 0x0, Secure
>>> 8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
>>> no super, port_mask = 0x5
>>> 9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5,
>>> unreg_mcast = 0x0, member_list = 0x5
>>> 10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type =
>>> persistant, port_num = 0x0
>>> 11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state =
>>> f, no super, port_mask = 0x5
>>> 12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f,
>>> no super, port_mask = 0x5
>>>
>>> During mulitcast receive:
>>>
>>> # switch-config -d
>>> cpsw hw version 1.12 (0)
>>> 0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, unreg_mcast = 0x0, member_list = 0x3
>>> 1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f, no super, port_mask = 0x3
>>> 2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type = persistant, port_num = 0x0, Secure
>>> 3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, unreg_mcast = 0x0, member_list = 0x7
>>
>> unreg_mcast = 0x0 means unregistered multicast packets will be dropped
>>
>>> 4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f, no super, port_mask = 0x3
>>> 5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, unreg_mcast = 0x0, member_list = 0x5
>>
>> unreg_mcast = 0x0
>>
>>> 6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f, no super, port_mask = 0x5
>>> 7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type = persistant, port_num = 0x0, Secure
>>> 8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f, no super, port_mask = 0x5
>>> 9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, unreg_mcast = 0x0, member_list = 0x5
>>
>> unreg_mcast = 0x0
>>
>>> 10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type = persistant, port_num = 0x0
>>> 11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state = f, no super, port_mask = 0x5
>>> 12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f, no super, port_mask = 0x5
>>> 13  : type: mcast, vid = 2, addr = 01:00:5e:03:1d:47, mcast_state = f, no super, port_mask = 0x5
>>
>> This is requested mcast address, but it's registered for vid=2 (propagated through .ndo_set_rx_mode())
>>
>>> 14  : type: ucast, vid = 100, addr = 66:22:04:bc:90:26, ucast_type = untouched , port_num = 0x2
>>
>> [...]
>>>
>>> Both tcpdumps with -p option showed no packets. If I execute ping, I
>>> can see related ICMP packets. addr = 66:22:04:bc:90:26 is PandaBoards
>>> MAC.
>>>
>>> Btw I've attached my test scripts (mcastr.py - multicast receiver and
>>> mcastt.py - multicast transmitter). Could you reproduce my setup?
>>>
>>
>> I was able to reproduce an issue with your script. As I understand, when cpsw receive the
>> mcast packet with dst_address=01:00:5e:03:1d:47 and vid=100 it hits
>> the case:
>> "if (Multicast packet) # destination address not found
>> then portmask is the logical “AND” of unreg_mcast_flood_mask and vlan_member_list
>> then goto Egress process"
>>
>> and as result packet is dropped (you can check eth1 statistic # ethtool -S eth1).
>>
>> Unfortunately, I was no able to configure mcast address properly in dual mac mode :(,
>> but probably Mugunthan can comment here - mcast addressess are offloaded to cpsw from
>
> I was able to add mcast address for eth0/eth1, but not finding a way to
> add mcast entries to eth1.100 interface. I gone through Linux network
> stack and didn't find a way where stack asks the driver to add mcast
> address for vlan interfaces. *_Network Experts_* can help us here.
>
>> Net core through  .ndo_set_rx_mode() and struct netdev_hw_addr doesn't contain any
>> information about vlan and cpsw uses default port vlan, which is vid=2 for eth1 in
>> dual mac mode.
>>
>>
>>
>> As W/A the allmulti flag can be used:
>>
>> # ifconfig eth1.100 allmulti
>
> For now this is the only possible option that can be used.

Thanks for working on this issue and finding this workaround.

I've attached an USB-to-Ethernet adapter (pagasus driver) to the same
system and could successfully send/receive multicasts over VLAN
interface (eth2.100).

Yegor

^ permalink raw reply

* Re: am335x: no multicast reception over VLAN
From: Mugunthan V N @ 2016-04-05  6:11 UTC (permalink / raw)
  To: Grygorii Strashko, Yegor Yefremov
  Cc: Peter Korsgaard, netdev, linux-omap@vger.kernel.org, drivshin, ml,
	David Miller
In-Reply-To: <56FE64EE.6050701@ti.com>

On Friday 01 April 2016 05:39 PM, Grygorii Strashko wrote:
> On 03/31/2016 10:52 AM, Yegor Yefremov wrote:
>> On Thu, Mar 31, 2016 at 8:37 AM, Mugunthan V N <mugunthanvnm@ti.com> wrote:
>>> On Thursday 31 March 2016 01:17 AM, Peter Korsgaard wrote:
>>>>>>>>> "Mugunthan" == Mugunthan V N <mugunthanvnm@ti.com> writes:
>>>>
>>>> Hi,
>>>>
>>>>   > You had received these packets as tcpdump will enable promiscuous mode
>>>>   > so that you receive all the packets from the wire.
>>>>
>>>> FYI, you can use the -p option to tcpdump to not put the interface into
>>>> promiscuous mode.
>>>>
>>>
>>> Thanks for the information Peter Korsgaard.
>>>
>>> Yegor, can you provide tcpdump using -p as well in Grygorii commands.
>>
>> Before VLAN configuration:
>>
>> # switch-config -d
>> cpsw hw version 1.12 (0)
>> 0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
>> unreg_mcast = 0x0, member_list = 0x3
>> 1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>> no super, port_mask = 0x3
>> 2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
>> persistant, port_num = 0x0, Secure
>> 3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
>> unreg_mcast = 0x0, member_list = 0x7
>> 4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
>> no super, port_mask = 0x3
>> 5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
>> unreg_mcast = 0x0, member_list = 0x5
>> 6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>> no super, port_mask = 0x5
>> 7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
>> persistant, port_num = 0x0, Secure
>> 8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
>> no super, port_mask = 0x5
>>
>> After VLAN configuration:
>>
>> # switch-config -d
>> cpsw hw version 1.12 (0)
>> 0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
>> unreg_mcast = 0x0, member_list = 0x3
>> 1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>> no super, port_mask = 0x3
>> 2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
>> persistant, port_num = 0x0, Secure
>> 3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
>> unreg_mcast = 0x0, member_list = 0x7
>> 4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
>> no super, port_mask = 0x3
>> 5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
>> unreg_mcast = 0x0, member_list = 0x5
>> 6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
>> no super, port_mask = 0x5
>> 7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
>> persistant, port_num = 0x0, Secure
>> 8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
>> no super, port_mask = 0x5
>> 9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5,
>> unreg_mcast = 0x0, member_list = 0x5
>> 10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type =
>> persistant, port_num = 0x0
>> 11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state =
>> f, no super, port_mask = 0x5
>> 12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f,
>> no super, port_mask = 0x5
>>
>> During mulitcast receive:
>>
>> # switch-config -d
>> cpsw hw version 1.12 (0)
>> 0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, unreg_mcast = 0x0, member_list = 0x3
>> 1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f, no super, port_mask = 0x3
>> 2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type = persistant, port_num = 0x0, Secure
>> 3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, unreg_mcast = 0x0, member_list = 0x7
> 
> unreg_mcast = 0x0 means unregistered multicast packets will be dropped
> 
>> 4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f, no super, port_mask = 0x3
>> 5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, unreg_mcast = 0x0, member_list = 0x5
> 
> unreg_mcast = 0x0 
> 
>> 6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f, no super, port_mask = 0x5
>> 7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type = persistant, port_num = 0x0, Secure
>> 8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f, no super, port_mask = 0x5
>> 9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, unreg_mcast = 0x0, member_list = 0x5
> 
> unreg_mcast = 0x0 
> 
>> 10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type = persistant, port_num = 0x0
>> 11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state = f, no super, port_mask = 0x5
>> 12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f, no super, port_mask = 0x5
>> 13  : type: mcast, vid = 2, addr = 01:00:5e:03:1d:47, mcast_state = f, no super, port_mask = 0x5
> 
> This is requested mcast address, but it's registered for vid=2 (propagated through .ndo_set_rx_mode())
> 
>> 14  : type: ucast, vid = 100, addr = 66:22:04:bc:90:26, ucast_type = untouched , port_num = 0x2
> 
> [...]
>>
>> Both tcpdumps with -p option showed no packets. If I execute ping, I
>> can see related ICMP packets. addr = 66:22:04:bc:90:26 is PandaBoards
>> MAC.
>>
>> Btw I've attached my test scripts (mcastr.py - multicast receiver and
>> mcastt.py - multicast transmitter). Could you reproduce my setup?
>>
> 
> I was able to reproduce an issue with your script. As I understand, when cpsw receive the
> mcast packet with dst_address=01:00:5e:03:1d:47 and vid=100 it hits
> the case:
> "if (Multicast packet) # destination address not found
> then portmask is the logical “AND” of unreg_mcast_flood_mask and vlan_member_list
> then goto Egress process"
> 
> and as result packet is dropped (you can check eth1 statistic # ethtool -S eth1).
> 
> Unfortunately, I was no able to configure mcast address properly in dual mac mode :(,
> but probably Mugunthan can comment here - mcast addressess are offloaded to cpsw from

I was able to add mcast address for eth0/eth1, but not finding a way to
add mcast entries to eth1.100 interface. I gone through Linux network
stack and didn't find a way where stack asks the driver to add mcast
address for vlan interfaces. *_Network Experts_* can help us here.

> Net core through  .ndo_set_rx_mode() and struct netdev_hw_addr doesn't contain any
> information about vlan and cpsw uses default port vlan, which is vid=2 for eth1 in 
> dual mac mode.
>  
> 
> 
> As W/A the allmulti flag can be used:
> 
> # ifconfig eth1.100 allmulti 

For now this is the only possible option that can be used.

Regards
Mugunthan V N

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Jesper Dangaard Brouer @ 2016-04-05  6:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <20160404182724.GB68392@ast-mbp.thefacebook.com>

On Mon, 4 Apr 2016 11:27:27 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote:
> > On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
> > [...]  
> > > 
> > > I think you need to DMA sync RX-page before you can safely access
> > > packet data in page (on all arch's).
> > >   
> > Thanks, I will give that a try in the next spin.  
> > > > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > > > +						 frags[0].page_offset);
> > > > +			if (mlx4_call_bpf(prog, ethh, length)) {  
> > > 
> > > AFAIK length here covers all the frags[n].page, thus potentially
> > > causing the BPF program to access memory out of bound (crash).
> > > 
> > > Having several page fragments is AFAIK an optimization for jumbo-frames
> > > on PowerPC (which is a bit annoying for you use-case ;-)).
> > >   
> > Yeah, this needs some more work. I can think of some options:
> > 1. limit pseudo skb.len to first frag's length only, and signal to
> > program that the packet is incomplete
> > 2. for nfrags>1 skip bpf processing, but this could be functionally
> > incorrect for some use cases
> > 3. run the program for each frag
> > 4. reject ndo_bpf_set when frags are possible (large mtu?)
> > 
> > My preference is to go with 1, thoughts?  
> 
> hmm and what program will do with 'incomplete' packet?
> imo option 4 is only way here. If phys_dev bpf program already
> attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus.
> My understanding of mlx4_en_calc_rx_buf is that mtu < 1514
> will have num_frags==1. That's the common case and one we
> want to optimize for.

I agree, we should only optimize for the common case, where
num_frags==1.


> If later we can find a way to change mlx4 driver to support
> phys_dev bpf programs with jumbo mtus, great.

For getting the DMA-buffer/packet-page writable, some change are needed
in this code path anyhow.  Lets look at that later, when touching that
code path.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* RE: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call
From: Amitkumar Karwar @ 2016-04-05  5:48 UTC (permalink / raw)
  To: Eric Dumazet, Wei-Ning Huang
  Cc: Kalle Valo, Linux Wireless, LKML, Nishant Sarmukadam,
	Sameer Nanda, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sonny Rao, Douglas Anderson
In-Reply-To: <1459256320.6473.160.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>

Hi Eric,

Thanks for the comments.

> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Tuesday, March 29, 2016 6:29 PM
> To: Wei-Ning Huang
> Cc: Kalle Valo; Linux Wireless; LKML; Amitkumar Karwar; Nishant
> Sarmukadam; Sameer Nanda; netdev@vger.kernel.org; Sonny Rao; Douglas
> Anderson
> Subject: Re: [PATCH] mwifiex: add __GFP_REPEAT to skb allocation call
> 
> On Tue, 2016-03-29 at 17:27 +0800, Wei-Ning Huang wrote:
> > Adding some chromium devs to the thread.
> >
> > In, http://lxr.free-electrons.com/source/mm/page_alloc.c#L3152
> >
> > The default mm retry allocation when 'order <=
> > PAGE_ALLOC_COSTLY_ORDER' of gfp_mask contains __GFP_REPEAT.
> > PAGE_ALLOC_COSTLY_ORDER is defined to be 3. On systems with page size
> > = 4K, this means memory compaction and retry is only done when the
> > size of allocation is <= 32K In mwifiex, the allocation size is 64K.
> 
> 
> 
> >  When we have system with
> > memory fragmentation and allocation failed, there will be no retry.
> > This is why we need to add __GFP_REPEAT here to allow the system to
> > perform memory compaction and retry allocation.
> >
> > Maybe Amit@marvell can comment on if this is a good fix on this issue.
> > I'm also aware that marvell is the progress of implementing
> > scatter/gatter for mwifiex, which can also fix the issue.
> 
> Before SG is implemented, you really need to copy incoming frames into
> smallest chunks (to get lowest skb->truesize) and leave the 64KB
> allocated stuff forever in the driver.

We do have a 64KB pre-allocated buffer for receiving Rx data in our driver.

> 
> __GFP_REPEAT wont really solve the issue.
> 
> It seems the problem comes from the fact that the drivers calls
> dev_kfree_skb_any() after calling mwifiex_deaggr_sdio_pkt(), instead of
> recycling this very precious 64KB skb once memory gets fragmented.

Our one time allocated 64k buffer read from firmware contains multiple data chunks. We have a feature called single port aggregation in which firmware attaches an aggregated buffer to single port. So sometimes a single data chunk can exceed 32k. dev_kfree_skb_any() is called to free those data chunks.

> 
> Another problem is that mwifiex_deaggr_sdio_pkt() uses
> mwifiex_alloc_dma_align_buf() with GFP_KERNEL | GFP_DMA
> 
> Really GFP_DMA makes no sense here, since the skb is going to be
> processed by the stack, which has no such requirement.
> 
> Please use normal skb allocations there.

Sure. I will submit a patch for this.

Regards,
Amitkumar

^ permalink raw reply

* Re: [PATCHv2 net-next 3/6] bridge: simplify the stp_state_store by calling store_bridge_parm
From: Toshiaki Makita @ 2016-04-05  5:08 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: nikolay, davem
In-Reply-To: <baf4788b6ac529a2132ace735fc1caf4989307da.1459827115.git.lucien.xin@gmail.com>

On 2016/04/05 12:32, Xin Long wrote:
> There are some repetitive codes in stp_state_store, we can remove
> them by calling store_bridge_parm.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_sysfs_br.c | 24 +++++++-----------------
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
> index 137cd3b..9918763 100644
> --- a/net/bridge/br_sysfs_br.c
> +++ b/net/bridge/br_sysfs_br.c
> @@ -128,27 +128,17 @@ static ssize_t stp_state_show(struct device *d,
>  }
>  
>  
> +static int set_stp_state(struct net_bridge *br, unsigned long val)
> +{

You forgot to add rtnl lock here?
The missing lock is restored in patch 4, but at this point bisect could
break..

> +	br_stp_set_enabled(br, val);
> +	return 0;
> +}

Toshiaki Makita

^ permalink raw reply

* [PATCH net-next 6/8] samples/bpf: add tracepoint support to bpf loader
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

Recognize "tracepoint/" section name prefix and attach the program
to that tracepoint.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/bpf_load.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 58f86bd11b3d..022af71c2bb5 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -49,6 +49,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_socket = strncmp(event, "socket", 6) == 0;
 	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
 	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
 	enum bpf_prog_type prog_type;
 	char buf[256];
 	int fd, efd, err, id;
@@ -63,6 +64,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
 	} else if (is_kprobe || is_kretprobe) {
 		prog_type = BPF_PROG_TYPE_KPROBE;
+	} else if (is_tracepoint) {
+		prog_type = BPF_PROG_TYPE_TRACEPOINT;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -111,12 +114,23 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 			       event, strerror(errno));
 			return -1;
 		}
-	}
 
-	strcpy(buf, DEBUGFS);
-	strcat(buf, "events/kprobes/");
-	strcat(buf, event);
-	strcat(buf, "/id");
+		strcpy(buf, DEBUGFS);
+		strcat(buf, "events/kprobes/");
+		strcat(buf, event);
+		strcat(buf, "/id");
+	} else if (is_tracepoint) {
+		event += 11;
+
+		if (*event == 0) {
+			printf("event name cannot be empty\n");
+			return -1;
+		}
+		strcpy(buf, DEBUGFS);
+		strcat(buf, "events/");
+		strcat(buf, event);
+		strcat(buf, "/id");
+	}
 
 	efd = open(buf, O_RDONLY, 0);
 	if (efd < 0) {
@@ -304,6 +318,7 @@ int load_bpf_file(char *path)
 
 			if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
 			    memcmp(shname_prog, "kretprobe/", 10) == 0 ||
+			    memcmp(shname_prog, "tracepoint/", 11) == 0 ||
 			    memcmp(shname_prog, "socket", 6) == 0)
 				load_and_attach(shname_prog, insns, data_prog->d_size);
 		}
@@ -320,6 +335,7 @@ int load_bpf_file(char *path)
 
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
+		    memcmp(shname, "tracepoint/", 11) == 0 ||
 		    memcmp(shname, "socket", 6) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 7/8] samples/bpf: tracepoint example
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

modify offwaketime to work with sched/sched_switch tracepoint
instead of kprobe into finish_task_switch

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/offwaketime_kern.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/samples/bpf/offwaketime_kern.c b/samples/bpf/offwaketime_kern.c
index c0aa5a9b9c48..983629a31c79 100644
--- a/samples/bpf/offwaketime_kern.c
+++ b/samples/bpf/offwaketime_kern.c
@@ -73,7 +73,7 @@ int waker(struct pt_regs *ctx)
 	return 0;
 }
 
-static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
+static inline int update_counts(void *ctx, u32 pid, u64 delta)
 {
 	struct key_t key = {};
 	struct wokeby_t *woke;
@@ -100,15 +100,33 @@ static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
 	return 0;
 }
 
+#if 1
+/* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
+struct sched_switch_args {
+	unsigned long long pad;
+	char prev_comm[16];
+	int prev_pid;
+	int prev_prio;
+	long long prev_state;
+	char next_comm[16];
+	int next_pid;
+	int next_prio;
+};
+SEC("tracepoint/sched/sched_switch")
+int oncpu(struct sched_switch_args *ctx)
+{
+	/* record previous thread sleep time */
+	u32 pid = ctx->prev_pid;
+#else
 SEC("kprobe/finish_task_switch")
 int oncpu(struct pt_regs *ctx)
 {
 	struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
+	/* record previous thread sleep time */
+	u32 pid = _(p->pid);
+#endif
 	u64 delta, ts, *tsp;
-	u32 pid;
 
-	/* record previous thread sleep time */
-	pid = _(p->pid);
 	ts = bpf_ktime_get_ns();
 	bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
 
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 3/8] bpf: register BPF_PROG_TYPE_TRACEPOINT program type
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

register tracepoint bpf program type and let it call the same set
of helper functions as BPF_PROG_TYPE_KPROBE

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/trace/bpf_trace.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e4ffb3ace5f..3e5ebe3254d2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -268,7 +268,7 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = {
 	.arg5_type	= ARG_CONST_STACK_SIZE,
 };
 
-static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
@@ -295,12 +295,20 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_perf_event_read:
 		return &bpf_perf_event_read_proto;
+	default:
+		return NULL;
+	}
+}
+
+static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
 		return &bpf_perf_event_output_proto;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto;
 	default:
-		return NULL;
+		return tracing_func_proto(func_id);
 	}
 }
 
@@ -332,9 +340,42 @@ static struct bpf_prog_type_list kprobe_tl = {
 	.type	= BPF_PROG_TYPE_KPROBE,
 };
 
+static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_perf_event_output:
+	case BPF_FUNC_get_stackid:
+		return NULL;
+	default:
+		return tracing_func_proto(func_id);
+	}
+}
+
+static bool tp_prog_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	if (off < sizeof(void *) || off >= PERF_MAX_TRACE_SIZE)
+		return false;
+	if (type != BPF_READ)
+		return false;
+	if (off % size != 0)
+		return false;
+	return true;
+}
+
+static const struct bpf_verifier_ops tracepoint_prog_ops = {
+	.get_func_proto  = tp_prog_func_proto,
+	.is_valid_access = tp_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list tracepoint_tl = {
+	.ops	= &tracepoint_prog_ops,
+	.type	= BPF_PROG_TYPE_TRACEPOINT,
+};
+
 static int __init register_kprobe_prog_ops(void)
 {
 	bpf_register_prog_type(&kprobe_tl);
+	bpf_register_prog_type(&tracepoint_tl);
 	return 0;
 }
 late_initcall(register_kprobe_prog_ops);
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 0/8] allow bpf attach to tracepoints
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team

Hi Steven, Peter,

last time we discussed bpf+tracepoints it was a year ago [1] and the reason
we didn't proceed with that approach was that bpf would make arguments
arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
and that was considered unnecessary extension of abi. Back then I wanted
to avoid the cost of buffer alloc and field assign part in all
of the tracepoints, but looks like when optimized the cost is acceptable.
So this new apporach doesn't expose any new abi to bpf program.
The program is looking at tracepoint fields after they were copied
by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
We made a tool [2] that takes arguments from /sys/.../format and works as:
$ tplist.py -v random:urandom_read
    int got_bits;
    int pool_left;
    int input_left;
Then these fields can be copy-pasted into bpf program like:
struct urandom_read {
    __u64 hidden_pad;
    int got_bits;
    int pool_left;
    int input_left;
};
and the program can use it:
SEC("tracepoint/random/urandom_read")
int bpf_prog(struct urandom_read *ctx)
{
    return ctx->pool_left > 0 ? 1 : 0;
}
This way the program can access tracepoint fields faster than
equivalent bpf+kprobe program, which is the main goal of these patches.

Patch 1 and 2 are simple changes in perf core side, please review.
I'd like to take the whole set via net-next tree, since the rest of
the patches might conflict with other bpf work going on in net-next
and we want to avoid cross-tree merge conflicts.
Patch 7 is an example of access to tracepoint fields from bpf prog.
Patch 8 is a micro benchmark for bpf+kprobe vs bpf+tracepoint.

Note that for actual tracing tools the user doesn't need to
run tplist.py and copy-paste fields manually. The tools do it
automatically. Like argdist tool [3] can be used as:
$ argdist -H 't:block:block_rq_complete():u32:nr_sector'
where 'nr_sector' is name of tracepoint field taken from
/sys/kernel/debug/tracing/events/block/block_rq_complete/format
and appropriate bpf program is generated on the fly.

[1] http://thread.gmane.org/gmane.linux.kernel.api/8127/focus=8165
[2] https://github.com/iovisor/bcc/blob/master/tools/tplist.py
[3] https://github.com/iovisor/bcc/blob/master/tools/argdist.py

Alexei Starovoitov (8):
  perf: optimize perf_fetch_caller_regs
  perf, bpf: allow bpf programs attach to tracepoints
  bpf: register BPF_PROG_TYPE_TRACEPOINT program type
  bpf: support bpf_get_stackid() and bpf_perf_event_output() in
    tracepoint programs
  bpf: sanitize bpf tracepoint access
  samples/bpf: add tracepoint support to bpf loader
  samples/bpf: tracepoint example
  samples/bpf: add tracepoint vs kprobe performance tests

 include/linux/bpf.h                     |   2 +
 include/linux/perf_event.h              |   2 -
 include/linux/trace_events.h            |   1 +
 include/trace/perf.h                    |  18 +++-
 include/uapi/linux/bpf.h                |   1 +
 kernel/bpf/stackmap.c                   |   2 +-
 kernel/bpf/verifier.c                   |   6 +-
 kernel/events/core.c                    |  21 ++++-
 kernel/trace/bpf_trace.c                |  85 ++++++++++++++++-
 kernel/trace/trace_event_perf.c         |   4 +
 kernel/trace/trace_events.c             |  18 ++++
 samples/bpf/Makefile                    |   5 +
 samples/bpf/bpf_load.c                  |  26 +++++-
 samples/bpf/offwaketime_kern.c          |  26 +++++-
 samples/bpf/test_overhead_kprobe_kern.c |  41 ++++++++
 samples/bpf/test_overhead_tp_kern.c     |  36 +++++++
 samples/bpf/test_overhead_user.c        | 161 ++++++++++++++++++++++++++++++++
 17 files changed, 432 insertions(+), 23 deletions(-)
 create mode 100644 samples/bpf/test_overhead_kprobe_kern.c
 create mode 100644 samples/bpf/test_overhead_tp_kern.c
 create mode 100644 samples/bpf/test_overhead_user.c

-- 
2.8.0

^ permalink raw reply

* [PATCH net-next 5/8] bpf: sanitize bpf tracepoint access
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields

This also disallows access to __dynamic_array fields, but can be
relaxed in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h          |  1 +
 include/linux/trace_events.h |  1 +
 kernel/bpf/verifier.c        |  6 +++++-
 kernel/events/core.c         |  8 ++++++++
 kernel/trace/trace_events.c  | 18 ++++++++++++++++++
 5 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 198f6ace70ec..b2365a6eba3d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -131,6 +131,7 @@ struct bpf_prog_type_list {
 struct bpf_prog_aux {
 	atomic_t refcnt;
 	u32 used_map_cnt;
+	u32 max_ctx_offset;
 	const struct bpf_verifier_ops *ops;
 	struct bpf_map **used_maps;
 	struct bpf_prog *prog;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 0810f81b6db2..97bd7da98567 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -569,6 +569,7 @@ extern int trace_define_field(struct trace_event_call *call, const char *type,
 			      int is_signed, int filter_type);
 extern int trace_add_event_call(struct trace_event_call *call);
 extern int trace_remove_event_call(struct trace_event_call *call);
+extern int trace_event_get_offsets(struct trace_event_call *call);
 
 #define is_signed_type(type)	(((type)(-1)) < (type)1)
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e9b771..58792fed5678 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -652,8 +652,12 @@ static int check_ctx_access(struct verifier_env *env, int off, int size,
 			    enum bpf_access_type t)
 {
 	if (env->prog->aux->ops->is_valid_access &&
-	    env->prog->aux->ops->is_valid_access(off, size, t))
+	    env->prog->aux->ops->is_valid_access(off, size, t)) {
+		/* remember the offset of last byte accessed in ctx */
+		if (env->prog->aux->max_ctx_offset < off + size)
+			env->prog->aux->max_ctx_offset = off + size;
 		return 0;
+	}
 
 	verbose("invalid bpf_context access off=%d size=%d\n", off, size);
 	return -EACCES;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 58fc9a7d1562..e7e3c2057582 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7131,6 +7131,14 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 		return -EINVAL;
 	}
 
+	if (is_tracepoint) {
+		int off = trace_event_get_offsets(event->tp_event);
+
+		if (prog->aux->max_ctx_offset > off) {
+			bpf_prog_put(prog);
+			return -EACCES;
+		}
+	}
 	event->tp_event->prog = prog;
 
 	return 0;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 05ddc0820771..ced963049e0a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -204,6 +204,24 @@ static void trace_destroy_fields(struct trace_event_call *call)
 	}
 }
 
+/*
+ * run-time version of trace_event_get_offsets_<call>() that returns the last
+ * accessible offset of trace fields excluding __dynamic_array bytes
+ */
+int trace_event_get_offsets(struct trace_event_call *call)
+{
+	struct ftrace_event_field *tail;
+	struct list_head *head;
+
+	head = trace_get_fields(call);
+	/*
+	 * head->next points to the last field with the largest offset,
+	 * since it was added last by trace_define_field()
+	 */
+	tail = list_first_entry(head, struct ftrace_event_field, link);
+	return tail->offset + tail->size;
+}
+
 int trace_event_raw_init(struct trace_event_call *call)
 {
 	int id;
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 8/8] samples/bpf: add tracepoint vs kprobe performance tests
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

the first microbenchmark does
fd=open("/proc/self/comm");
for() {
  write(fd, "test");
}
and on 4 cpus in parallel:
                                      writes per sec
base (no tracepoints, no kprobes)         930k
with kprobe at __set_task_comm()          420k
with tracepoint at task:task_rename       730k

For kprobe + full bpf program manully fetches oldcomm, newcomm via bpf_probe_read.
For tracepint bpf program does nothing, since arguments are copied by tracepoint.

2nd microbenchmark does:
fd=open("/dev/urandom");
for() {
  read(fd, buf);
}
and on 4 cpus in parallel:
                                       reads per sec
base (no tracepoints, no kprobes)         300k
with kprobe at urandom_read()             279k
with tracepoint at random:urandom_read    290k

bpf progs attached to kprobe and tracepoint are noop.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/Makefile                    |   5 +
 samples/bpf/test_overhead_kprobe_kern.c |  41 ++++++++
 samples/bpf/test_overhead_tp_kern.c     |  36 +++++++
 samples/bpf/test_overhead_user.c        | 161 ++++++++++++++++++++++++++++++++
 4 files changed, 243 insertions(+)
 create mode 100644 samples/bpf/test_overhead_kprobe_kern.c
 create mode 100644 samples/bpf/test_overhead_tp_kern.c
 create mode 100644 samples/bpf/test_overhead_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 502c9fc8db85..9959771bf808 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -19,6 +19,7 @@ hostprogs-y += lathist
 hostprogs-y += offwaketime
 hostprogs-y += spintest
 hostprogs-y += map_perf_test
+hostprogs-y += test_overhead
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -38,6 +39,7 @@ lathist-objs := bpf_load.o libbpf.o lathist_user.o
 offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
 spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
+test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -56,6 +58,8 @@ always += lathist_kern.o
 always += offwaketime_kern.o
 always += spintest_kern.o
 always += map_perf_test_kern.o
+always += test_overhead_tp_kern.o
+always += test_overhead_kprobe_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -75,6 +79,7 @@ HOSTLOADLIBES_lathist += -lelf
 HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
+HOSTLOADLIBES_test_overhead += -lelf -lrt
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/test_overhead_kprobe_kern.c b/samples/bpf/test_overhead_kprobe_kern.c
new file mode 100644
index 000000000000..468a66a92ef9
--- /dev/null
+++ b/samples/bpf/test_overhead_kprobe_kern.c
@@ -0,0 +1,41 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/version.h>
+#include <linux/ptrace.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
+
+SEC("kprobe/__set_task_comm")
+int prog(struct pt_regs *ctx)
+{
+	struct signal_struct *signal;
+	struct task_struct *tsk;
+	char oldcomm[16] = {};
+	char newcomm[16] = {};
+	u16 oom_score_adj;
+	u32 pid;
+
+	tsk = (void *)PT_REGS_PARM1(ctx);
+
+	pid = _(tsk->pid);
+	bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm);
+	bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx));
+	signal = _(tsk->signal);
+	oom_score_adj = _(signal->oom_score_adj);
+	return 0;
+}
+
+SEC("kprobe/urandom_read")
+int prog2(struct pt_regs *ctx)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/test_overhead_tp_kern.c b/samples/bpf/test_overhead_tp_kern.c
new file mode 100644
index 000000000000..38f5c0b9da9f
--- /dev/null
+++ b/samples/bpf/test_overhead_tp_kern.c
@@ -0,0 +1,36 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* from /sys/kernel/debug/tracing/events/task/task_rename/format */
+struct task_rename {
+	__u64 pad;
+	__u32 pid;
+	char oldcomm[16];
+	char newcomm[16];
+	__u16 oom_score_adj;
+};
+SEC("tracepoint/task/task_rename")
+int prog(struct task_rename *ctx)
+{
+	return 0;
+}
+
+/* from /sys/kernel/debug/tracing/events/random/urandom_read/format */
+struct urandom_read {
+	__u64 pad;
+	int got_bits;
+	int pool_left;
+	int input_left;
+};
+SEC("tracepoint/random/urandom_read")
+int prog2(struct urandom_read *ctx)
+{
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_overhead_user.c b/samples/bpf/test_overhead_user.c
new file mode 100644
index 000000000000..6872d67e151d
--- /dev/null
+++ b/samples/bpf/test_overhead_user.c
@@ -0,0 +1,161 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <asm/unistd.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <assert.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include <string.h>
+#include <time.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_CNT 1000000
+
+static __u64 time_get_ns(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000ull + ts.tv_nsec;
+}
+
+static void test_task_rename(int cpu)
+{
+	__u64 start_time;
+	char buf[] = "test\n";
+	int i, fd;
+
+	fd = open("/proc/self/comm", O_WRONLY|O_TRUNC);
+	if (fd < 0) {
+		printf("couldn't open /proc\n");
+		exit(1);
+	}
+	start_time = time_get_ns();
+	for (i = 0; i < MAX_CNT; i++)
+		write(fd, buf, sizeof(buf));
+	printf("task_rename:%d: %lld events per sec\n",
+	       cpu, MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
+	close(fd);
+}
+
+static void test_urandom_read(int cpu)
+{
+	__u64 start_time;
+	char buf[4];
+	int i, fd;
+
+	fd = open("/dev/urandom", O_RDONLY);
+	if (fd < 0) {
+		printf("couldn't open /dev/urandom\n");
+		exit(1);
+	}
+	start_time = time_get_ns();
+	for (i = 0; i < MAX_CNT; i++)
+		read(fd, buf, sizeof(buf));
+	printf("urandom_read:%d: %lld events per sec\n",
+	       cpu, MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
+	close(fd);
+}
+
+static void loop(int cpu, int flags)
+{
+	cpu_set_t cpuset;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+	if (flags & 1)
+		test_task_rename(cpu);
+	if (flags & 2)
+		test_urandom_read(cpu);
+}
+
+static void run_perf_test(int tasks, int flags)
+{
+	pid_t pid[tasks];
+	int i;
+
+	for (i = 0; i < tasks; i++) {
+		pid[i] = fork();
+		if (pid[i] == 0) {
+			loop(i, flags);
+			exit(0);
+		} else if (pid[i] == -1) {
+			printf("couldn't spawn #%d process\n", i);
+			exit(1);
+		}
+	}
+	for (i = 0; i < tasks; i++) {
+		int status;
+
+		assert(waitpid(pid[i], &status, 0) == pid[i]);
+		assert(status == 0);
+	}
+}
+
+static void unload_progs(void)
+{
+	close(prog_fd[0]);
+	close(prog_fd[1]);
+	close(event_fd[0]);
+	close(event_fd[1]);
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char filename[256];
+	int num_cpu = 8;
+	int test_flags = ~0;
+
+	setrlimit(RLIMIT_MEMLOCK, &r);
+
+	if (argc > 1)
+		test_flags = atoi(argv[1]) ? : test_flags;
+	if (argc > 2)
+		num_cpu = atoi(argv[2]) ? : num_cpu;
+
+	if (test_flags & 0x3) {
+		printf("BASE\n");
+		run_perf_test(num_cpu, test_flags);
+	}
+
+	if (test_flags & 0xC) {
+		snprintf(filename, sizeof(filename),
+			 "%s_kprobe_kern.o", argv[0]);
+		if (load_bpf_file(filename)) {
+			printf("%s", bpf_log_buf);
+			return 1;
+		}
+		printf("w/KPROBE\n");
+		run_perf_test(num_cpu, test_flags >> 2);
+		unload_progs();
+	}
+
+	if (test_flags & 0x30) {
+		snprintf(filename, sizeof(filename),
+			 "%s_tp_kern.o", argv[0]);
+		if (load_bpf_file(filename)) {
+			printf("%s", bpf_log_buf);
+			return 1;
+		}
+		printf("w/TRACEPOINT\n");
+		run_perf_test(num_cpu, test_flags >> 4);
+		unload_progs();
+	}
+
+	return 0;
+}
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 4/8] bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h      |  1 +
 kernel/bpf/stackmap.c    |  2 +-
 kernel/trace/bpf_trace.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 21ee41b92e8a..198f6ace70ec 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -160,6 +160,7 @@ struct bpf_array {
 #define MAX_TAIL_CALL_CNT 32
 
 u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
+u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 void bpf_fd_array_map_clear(struct bpf_map *map);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 499d9e933f8e..35114725cf30 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -116,7 +116,7 @@ free_smap:
 	return ERR_PTR(err);
 }
 
-static u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 r4, u64 r5)
+u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 r4, u64 r5)
 {
 	struct pt_regs *regs = (struct pt_regs *) (long) r1;
 	struct bpf_map *map = (struct bpf_map *) (long) r2;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e5ebe3254d2..413ec5614180 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -340,12 +340,52 @@ static struct bpf_prog_type_list kprobe_tl = {
 	.type	= BPF_PROG_TYPE_KPROBE,
 };
 
+static u64 bpf_perf_event_output_tp(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
+{
+	/*
+	 * r1 points to perf tracepoint buffer where first 8 bytes are hidden
+	 * from bpf program and contain a pointer to 'struct pt_regs'. Fetch it
+	 * from there and call the same bpf_perf_event_output() helper
+	 */
+	u64 ctx = *(long *)r1;
+
+	return bpf_perf_event_output(ctx, r2, index, r4, size);
+}
+
+static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
+	.func		= bpf_perf_event_output_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_STACK,
+	.arg5_type	= ARG_CONST_STACK_SIZE,
+};
+
+static u64 bpf_get_stackid_tp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	u64 ctx = *(long *)r1;
+
+	return bpf_get_stackid(ctx, r2, r3, r4, r5);
+}
+
+static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
+	.func		= bpf_get_stackid_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
+		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
-		return NULL;
+		return &bpf_get_stackid_proto_tp;
 	default:
 		return tracing_func_proto(func_id);
 	}
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
attached to tracepoints.
The tracepoint will copy the arguments in the per-cpu buffer and pass
it to the bpf program as its first argument.
The layout of the fields can be discovered by doing
'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
prior to the compilation of the program with exception that first 8 bytes
are reserved and not accessible to the program. This area is used to store
the pointer to 'struct pt_regs' which some of the bpf helpers will use:
+---------+
| 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
+---------+
| N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
+---------+
| dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
+---------+

Not that all of the fields are already dumped to user space via perf ring buffer
and some application access it directly without consulting tracepoint/format.
Same rule applies here: static tracepoint fields should only be accessed
in a format defined in tracepoint/format. The order of fields and
field sizes are not an ABI.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/trace/perf.h            | 18 ++++++++++++++----
 include/uapi/linux/bpf.h        |  1 +
 kernel/events/core.c            | 13 +++++++++----
 kernel/trace/trace_event_perf.c |  3 +++
 4 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 26486fcd74ce..55feb69c873f 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -37,18 +37,19 @@ perf_trace_##call(void *__data, proto)					\
 	struct trace_event_call *event_call = __data;			\
 	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
 	struct trace_event_raw_##call *entry;				\
+	struct bpf_prog *prog = event_call->prog;			\
 	struct pt_regs *__regs;						\
 	u64 __addr = 0, __count = 1;					\
 	struct task_struct *__task = NULL;				\
 	struct hlist_head *head;					\
 	int __entry_size;						\
 	int __data_size;						\
-	int rctx;							\
+	int rctx, event_type;						\
 									\
 	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
-	if (__builtin_constant_p(!__task) && !__task &&			\
+	if (!prog && __builtin_constant_p(!__task) && !__task &&	\
 				hlist_empty(head))			\
 		return;							\
 									\
@@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto)					\
 			     sizeof(u64));				\
 	__entry_size -= sizeof(u32);					\
 									\
-	entry = perf_trace_buf_prepare(__entry_size,			\
-			event_call->event.type, &__regs, &rctx);	\
+	event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
+	entry = perf_trace_buf_prepare(__entry_size, event_type,	\
+				       &__regs, &rctx);			\
 	if (!entry)							\
 		return;							\
 									\
@@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	{ assign; }							\
 									\
+	if (prog) {							\
+		*(struct pt_regs **)entry = __regs;			\
+		if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
+			perf_swevent_put_recursion_context(rctx);	\
+			return;						\
+		}							\
+		memset(&entry->ent, 0, sizeof(entry->ent));		\
+	}								\
 	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
 		__count, __regs, head, __task);				\
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 23917bb47bf3..70eda5aeb304 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -92,6 +92,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_KPROBE,
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
+	BPF_PROG_TYPE_TRACEPOINT,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
diff --git a/kernel/events/core.c b/kernel/events/core.c
index de24fbce5277..58fc9a7d1562 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6725,12 +6725,13 @@ int perf_swevent_get_recursion_context(void)
 }
 EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);
 
-inline void perf_swevent_put_recursion_context(int rctx)
+void perf_swevent_put_recursion_context(int rctx)
 {
 	struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
 
 	put_recursion_context(swhash->recursion, rctx);
 }
+EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);
 
 void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
 {
@@ -7104,6 +7105,7 @@ static void perf_event_free_filter(struct perf_event *event)
 
 static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 {
+	bool is_kprobe, is_tracepoint;
 	struct bpf_prog *prog;
 
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
@@ -7112,15 +7114,18 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 	if (event->tp_event->prog)
 		return -EEXIST;
 
-	if (!(event->tp_event->flags & TRACE_EVENT_FL_UKPROBE))
-		/* bpf programs can only be attached to u/kprobes */
+	is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
+	is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
+	if (!is_kprobe && !is_tracepoint)
+		/* bpf programs can only be attached to u/kprobe or tracepoint */
 		return -EINVAL;
 
 	prog = bpf_prog_get(prog_fd);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
-	if (prog->type != BPF_PROG_TYPE_KPROBE) {
+	if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
+	    (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
 		/* valid fd, but invalid bpf program type */
 		bpf_prog_put(prog);
 		return -EINVAL;
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 7a68afca8249..7ada829029d3 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
 		*regs = this_cpu_ptr(&__perf_regs[*rctxp]);
 	raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
 
+	if (type == TRACE_EVENT_TYPE_MAX)
+		return raw_data;
+
 	/* zero the dead bytes from align to not leak stack to user */
 	memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));
 
-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs
From: Alexei Starovoitov @ 2016-04-05  4:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-1-git-send-email-ast@fb.com>

avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc and
subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
so we can safely drop memset from all of the above cases and move it into
perf_ftrace_function_call that calls it with stack allocated pt_regs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/perf_event.h      | 2 --
 kernel/trace/trace_event_perf.c | 1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f291275ffd71..e89f7199c223 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -882,8 +882,6 @@ static inline void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned lo
  */
 static inline void perf_fetch_caller_regs(struct pt_regs *regs)
 {
-	memset(regs, 0, sizeof(*regs));
-
 	perf_arch_fetch_caller_regs(regs, CALLER_ADDR0);
 }
 
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 00df25fd86ef..7a68afca8249 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -316,6 +316,7 @@ perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
 
 	BUILD_BUG_ON(ENTRY_SIZE > PERF_MAX_TRACE_SIZE);
 
+	memset(&regs, 0, sizeof(regs));
 	perf_fetch_caller_regs(&regs);
 
 	entry = perf_trace_buf_prepare(ENTRY_SIZE, TRACE_FN, NULL, &rctx);
-- 
2.8.0

^ permalink raw reply related

* [net-next:master 58/68] net/ipv4/udp.c:1693:9: sparse: incompatible types in comparison expression (different address spaces)
From: kbuild test robot @ 2016-04-05  4:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kbuild-all, netdev

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   15f41e2ba13a6726632e44b1180e805a61e470ad
commit: ca065d0cf80fa547724440a8bf37f1e674d917c0 [58/68] udp: no longer use SLAB_DESTROY_BY_RCU
reproduce:
        # apt-get install sparse
        git checkout ca065d0cf80fa547724440a8bf37f1e674d917c0
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': unknown attribute
>> net/ipv4/udp.c:1693:9: sparse: incompatible types in comparison expression (different address spaces)
>> net/ipv4/udp.c:1693:9: sparse: incompatible types in comparison expression (different address spaces)
--
   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': unknown attribute
>> net/ipv6/udp.c:743:9: sparse: incompatible types in comparison expression (different address spaces)
>> net/ipv6/udp.c:743:9: sparse: incompatible types in comparison expression (different address spaces)

vim +1693 net/ipv4/udp.c

  1677		struct udp_hslot *hslot = udp_hashslot(udptable, net, hnum);
  1678		unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
  1679		unsigned int offset = offsetof(typeof(*sk), sk_node);
  1680		int dif = skb->dev->ifindex;
  1681		struct hlist_node *node;
  1682		struct sk_buff *nskb;
  1683	
  1684		if (use_hash2) {
  1685			hash2_any = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum) &
  1686				    udp_table.mask;
  1687			hash2 = udp4_portaddr_hash(net, daddr, hnum) & udp_table.mask;
  1688	start_lookup:
  1689			hslot = &udp_table.hash2[hash2];
  1690			offset = offsetof(typeof(*sk), __sk_common.skc_portaddr_node);
  1691		}
  1692	
> 1693		sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
  1694			if (!__udp_is_mcast_sock(net, sk, uh->dest, daddr,
  1695						 uh->source, saddr, dif, hnum))
  1696				continue;
  1697	
  1698			if (!first) {
  1699				first = sk;
  1700				continue;
  1701			}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Herbert Xu @ 2016-04-05  4:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Tom Herbert, Jesse Gross, Eric Dumazet, Netdev,
	David Miller
In-Reply-To: <CAKgT0UfcqBUTY1SxawwKdnzS7qW7XggjzpKNcfT3cTSZ9DHMmA@mail.gmail.com>

On Mon, Apr 04, 2016 at 09:26:55PM -0700, Alexander Duyck wrote:
> 
> The problem is right now we are mangling the IP ID for outer headers
> on tunnels.  We end up totally ignoring the delta between the values
> so if you have two flows that get interleaved over the same tunnel GRO
> will currently mash the IP IDs for the two tunnels so that they end up
> overlapping.

Then it should be fixed.  I never reviewed those patches or I would
have objected at the time.

> The reason why I keep referencing RFC 6864 is because it specifies
> that the IP ID field must not be read if the DF bit is set, and that
> if we are manipulating headers we can handle the IP ID as though we
> are the transmitting station.  What this means is that if DF is not
> set we have to have unique values per packet, otherwise we can ignore
> the values if DF is set.

As I said GRO itself should not be visible.  The fact that it is
for tunnels is a bug.
 
> The question I would have is what are you really losing with increment
> from 0 versus fixed 0?  From what I see it is essentially just garbage
> in/garbage out.

GRO is meant to be lossless, that is, you should not be able to
detect its presence from the outside.  If you lose information then
you're breaking this rule and people will soon start asking for it
to be disabled in various situations.

I'm not against doing this per se but it should not be part of the
default configuration.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox