* [PATCH net-next 07/11] sfc: Use dev_kfree_skb() in efx_end_loopback()
From: Ben Hutchings @ 2012-07-11 23:17 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342048389.2613.59.camel@bwh-desktop.uk.solarflarecom.com>
Fix CID 102619 in the Coverity report on Linux.
efx_end_loopback() iterates over an array of skb pointers of which
some may be null (if efx_begin_loopback() failed). It should not use
dev_kfree_skb_irq(), which requires non-null pointers. In practice
this is safe because it does not run in interrupt context and
therefore always ends up calling dev_kfree_skb(), which does allow
null pointers. But we should make that explicit.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/ethernet/sfc/selftest.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ethernet/sfc/selftest.c b/drivers/net/ethernet/sfc/selftest.c
index de4c006..ccc428f 100644
--- a/drivers/net/ethernet/sfc/selftest.c
+++ b/drivers/net/ethernet/sfc/selftest.c
@@ -488,7 +488,7 @@ static int efx_end_loopback(struct efx_tx_queue *tx_queue,
skb = state->skbs[i];
if (skb && !skb_shared(skb))
++tx_done;
- dev_kfree_skb_any(skb);
+ dev_kfree_skb(skb);
}
netif_tx_unlock_bh(efx->net_dev);
--
1.7.7.6
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next 08/11] sfc: Explain why efx_mcdi_exit_assertion() ignores result of efx_mcdi_rpc()
From: Ben Hutchings @ 2012-07-11 23:17 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342048389.2613.59.camel@bwh-desktop.uk.solarflarecom.com>
Fix CID 113952 in Coverity report on Linux.
This is the one instance where we don't, and shouldn't, check the
return code from efx_mcdi_rpc(). It wasn't immediately obvious to me
why we didn't, so I think an explanation is in order.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/ethernet/sfc/mcdi.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/sfc/mcdi.c b/drivers/net/ethernet/sfc/mcdi.c
index 17b6463..fc5e7bb 100644
--- a/drivers/net/ethernet/sfc/mcdi.c
+++ b/drivers/net/ethernet/sfc/mcdi.c
@@ -1001,12 +1001,17 @@ static void efx_mcdi_exit_assertion(struct efx_nic *efx)
{
u8 inbuf[MC_CMD_REBOOT_IN_LEN];
- /* Atomically reboot the mcfw out of the assertion handler */
+ /* If the MC is running debug firmware, it might now be
+ * waiting for a debugger to attach, but we just want it to
+ * reboot. We set a flag that makes the command a no-op if it
+ * has already done so. We don't know what return code to
+ * expect (0 or -EIO), so ignore it.
+ */
BUILD_BUG_ON(MC_CMD_REBOOT_OUT_LEN != 0);
MCDI_SET_DWORD(inbuf, REBOOT_IN_FLAGS,
MC_CMD_REBOOT_FLAGS_AFTER_ASSERTION);
- efx_mcdi_rpc(efx, MC_CMD_REBOOT, inbuf, MC_CMD_REBOOT_IN_LEN,
- NULL, 0, NULL);
+ (void) efx_mcdi_rpc(efx, MC_CMD_REBOOT, inbuf, MC_CMD_REBOOT_IN_LEN,
+ NULL, 0, NULL);
}
int efx_mcdi_handle_assertion(struct efx_nic *efx)
--
1.7.7.6
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next 09/11] sfc: Disable VF queues during register self-test
From: Ben Hutchings @ 2012-07-11 23:18 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342048389.2613.59.camel@bwh-desktop.uk.solarflarecom.com>
Currently VF queues and drivers may remain active during this test.
This could cause memory corruption or spurious test failures.
Therefore we reset the port/function before running these tests on
Siena.
On Falcon this doesn't work: we have to do some additional
initialisation before some blocks will work again. So refactor the
reset/register-test sequence into an efx_nic_type method so
efx_selftest() doesn't have to consider such quirks.
In the process, fix another minor bug: Siena does not have an
'invisible' reset and the self-test currently fails to push the PHY
configuration after resetting. Passing RESET_TYPE_ALL to
efx_reset_{down,up}() fixes this.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/ethernet/sfc/falcon.c | 35 ++++++++++++++++--
drivers/net/ethernet/sfc/net_driver.h | 7 +++-
drivers/net/ethernet/sfc/nic.c | 3 --
drivers/net/ethernet/sfc/selftest.c | 62 ++++++++-------------------------
drivers/net/ethernet/sfc/siena.c | 29 +++++++++++++--
5 files changed, 76 insertions(+), 60 deletions(-)
diff --git a/drivers/net/ethernet/sfc/falcon.c b/drivers/net/ethernet/sfc/falcon.c
index 3a1ca2b..12b573a 100644
--- a/drivers/net/ethernet/sfc/falcon.c
+++ b/drivers/net/ethernet/sfc/falcon.c
@@ -25,9 +25,12 @@
#include "io.h"
#include "phy.h"
#include "workarounds.h"
+#include "selftest.h"
/* Hardware control for SFC4000 (aka Falcon). */
+static int falcon_reset_hw(struct efx_nic *efx, enum reset_type method);
+
static const unsigned int
/* "Large" EEPROM device: Atmel AT25640 or similar
* 8 KB, 16-bit address, 32 B write block */
@@ -1034,10 +1037,34 @@ static const struct efx_nic_register_test falcon_b0_register_tests[] = {
EFX_OWORD32(0x0003FF0F, 0x00000000, 0x00000000, 0x00000000) },
};
-static int falcon_b0_test_registers(struct efx_nic *efx)
+static int
+falcon_b0_test_chip(struct efx_nic *efx, struct efx_self_tests *tests)
{
- return efx_nic_test_registers(efx, falcon_b0_register_tests,
- ARRAY_SIZE(falcon_b0_register_tests));
+ enum reset_type reset_method = RESET_TYPE_INVISIBLE;
+ int rc, rc2;
+
+ mutex_lock(&efx->mac_lock);
+ if (efx->loopback_modes) {
+ /* We need the 312 clock from the PHY to test the XMAC
+ * registers, so move into XGMII loopback if available */
+ if (efx->loopback_modes & (1 << LOOPBACK_XGMII))
+ efx->loopback_mode = LOOPBACK_XGMII;
+ else
+ efx->loopback_mode = __ffs(efx->loopback_modes);
+ }
+ __efx_reconfigure_port(efx);
+ mutex_unlock(&efx->mac_lock);
+
+ efx_reset_down(efx, reset_method);
+
+ tests->registers =
+ efx_nic_test_registers(efx, falcon_b0_register_tests,
+ ARRAY_SIZE(falcon_b0_register_tests))
+ ? -1 : 1;
+
+ rc = falcon_reset_hw(efx, reset_method);
+ rc2 = efx_reset_up(efx, reset_method, rc == 0);
+ return rc ? rc : rc2;
}
/**************************************************************************
@@ -1818,7 +1845,7 @@ const struct efx_nic_type falcon_b0_nic_type = {
.get_wol = falcon_get_wol,
.set_wol = falcon_set_wol,
.resume_wol = efx_port_dummy_op_void,
- .test_registers = falcon_b0_test_registers,
+ .test_chip = falcon_b0_test_chip,
.test_nvram = falcon_test_nvram,
.revision = EFX_REV_FALCON_B0,
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 8a9f6d4..55be2fd 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -68,6 +68,8 @@
#define EFX_TXQ_TYPES 4
#define EFX_MAX_TX_QUEUES (EFX_TXQ_TYPES * EFX_MAX_CHANNELS)
+struct efx_self_tests;
+
/**
* struct efx_special_buffer - An Efx special buffer
* @addr: CPU base address of the buffer
@@ -901,7 +903,8 @@ static inline unsigned int efx_port_num(struct efx_nic *efx)
* @get_wol: Get WoL configuration from driver state
* @set_wol: Push WoL configuration to the NIC
* @resume_wol: Synchronise WoL state between driver and MC (e.g. after resume)
- * @test_registers: Test read/write functionality of control registers
+ * @test_chip: Test registers. Should use efx_nic_test_registers(), and is
+ * expected to reset the NIC.
* @test_nvram: Test validity of NVRAM contents
* @revision: Hardware architecture revision
* @mem_map_size: Memory BAR mapped size
@@ -946,7 +949,7 @@ struct efx_nic_type {
void (*get_wol)(struct efx_nic *efx, struct ethtool_wolinfo *wol);
int (*set_wol)(struct efx_nic *efx, u32 type);
void (*resume_wol)(struct efx_nic *efx);
- int (*test_registers)(struct efx_nic *efx);
+ int (*test_chip)(struct efx_nic *efx, struct efx_self_tests *tests);
int (*test_nvram)(struct efx_nic *efx);
int revision;
diff --git a/drivers/net/ethernet/sfc/nic.c b/drivers/net/ethernet/sfc/nic.c
index 287738d..326d799 100644
--- a/drivers/net/ethernet/sfc/nic.c
+++ b/drivers/net/ethernet/sfc/nic.c
@@ -124,9 +124,6 @@ int efx_nic_test_registers(struct efx_nic *efx,
unsigned address = 0, i, j;
efx_oword_t mask, imask, original, reg, buf;
- /* Falcon should be in loopback to isolate the XMAC from the PHY */
- WARN_ON(!LOOPBACK_INTERNAL(efx));
-
for (i = 0; i < n_regs; ++i) {
address = regs[i].address;
mask = imask = regs[i].mask;
diff --git a/drivers/net/ethernet/sfc/selftest.c b/drivers/net/ethernet/sfc/selftest.c
index ccc428f..96068d1 100644
--- a/drivers/net/ethernet/sfc/selftest.c
+++ b/drivers/net/ethernet/sfc/selftest.c
@@ -120,19 +120,6 @@ static int efx_test_nvram(struct efx_nic *efx, struct efx_self_tests *tests)
return rc;
}
-static int efx_test_chip(struct efx_nic *efx, struct efx_self_tests *tests)
-{
- int rc = 0;
-
- /* Test register access */
- if (efx->type->test_registers) {
- rc = efx->type->test_registers(efx);
- tests->registers = rc ? -1 : 1;
- }
-
- return rc;
-}
-
/**************************************************************************
*
* Interrupt and event queue testing
@@ -699,8 +686,7 @@ int efx_selftest(struct efx_nic *efx, struct efx_self_tests *tests,
{
enum efx_loopback_mode loopback_mode = efx->loopback_mode;
int phy_mode = efx->phy_mode;
- enum reset_type reset_method = RESET_TYPE_INVISIBLE;
- int rc_test = 0, rc_reset = 0, rc;
+ int rc_test = 0, rc_reset, rc;
efx_selftest_async_cancel(efx);
@@ -737,44 +723,26 @@ int efx_selftest(struct efx_nic *efx, struct efx_self_tests *tests,
*/
netif_device_detach(efx->net_dev);
- mutex_lock(&efx->mac_lock);
- if (efx->loopback_modes) {
- /* We need the 312 clock from the PHY to test the XMAC
- * registers, so move into XGMII loopback if available */
- if (efx->loopback_modes & (1 << LOOPBACK_XGMII))
- efx->loopback_mode = LOOPBACK_XGMII;
- else
- efx->loopback_mode = __ffs(efx->loopback_modes);
- }
-
- __efx_reconfigure_port(efx);
- mutex_unlock(&efx->mac_lock);
-
- /* free up all consumers of SRAM (including all the queues) */
- efx_reset_down(efx, reset_method);
-
- rc = efx_test_chip(efx, tests);
- if (rc && !rc_test)
- rc_test = rc;
+ if (efx->type->test_chip) {
+ rc_reset = efx->type->test_chip(efx, tests);
+ if (rc_reset) {
+ netif_err(efx, hw, efx->net_dev,
+ "Unable to recover from chip test\n");
+ efx_schedule_reset(efx, RESET_TYPE_DISABLE);
+ return rc_reset;
+ }
- /* reset the chip to recover from the register test */
- rc_reset = efx->type->reset(efx, reset_method);
+ if ((tests->registers < 0) && !rc_test)
+ rc_test = -EIO;
+ }
/* Ensure that the phy is powered and out of loopback
* for the bist and loopback tests */
+ mutex_lock(&efx->mac_lock);
efx->phy_mode &= ~PHY_MODE_LOW_POWER;
efx->loopback_mode = LOOPBACK_NONE;
-
- rc = efx_reset_up(efx, reset_method, rc_reset == 0);
- if (rc && !rc_reset)
- rc_reset = rc;
-
- if (rc_reset) {
- netif_err(efx, drv, efx->net_dev,
- "Unable to recover from chip test\n");
- efx_schedule_reset(efx, RESET_TYPE_DISABLE);
- return rc_reset;
- }
+ __efx_reconfigure_port(efx);
+ mutex_unlock(&efx->mac_lock);
rc = efx_test_phy(efx, tests, flags);
if (rc && !rc_test)
diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index 9f8d7ce..2354886 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -25,10 +25,12 @@
#include "workarounds.h"
#include "mcdi.h"
#include "mcdi_pcol.h"
+#include "selftest.h"
/* Hardware control for SFC9000 family including SFL9021 (aka Siena). */
static void siena_init_wol(struct efx_nic *efx);
+static int siena_reset_hw(struct efx_nic *efx, enum reset_type method);
static void siena_push_irq_moderation(struct efx_channel *channel)
@@ -154,10 +156,29 @@ static const struct efx_nic_register_test siena_register_tests[] = {
EFX_OWORD32(0xFFFFFFFF, 0xFFFFFFFF, 0x00000007, 0x00000000) },
};
-static int siena_test_registers(struct efx_nic *efx)
+static int siena_test_chip(struct efx_nic *efx, struct efx_self_tests *tests)
{
- return efx_nic_test_registers(efx, siena_register_tests,
- ARRAY_SIZE(siena_register_tests));
+ enum reset_type reset_method = reset_method;
+ int rc, rc2;
+
+ efx_reset_down(efx, reset_method);
+
+ /* Reset the chip immediately so that it is completely
+ * quiescent regardless of what any VF driver does.
+ */
+ rc = siena_reset_hw(efx, reset_method);
+ if (rc)
+ goto out;
+
+ tests->registers =
+ efx_nic_test_registers(efx, siena_register_tests,
+ ARRAY_SIZE(siena_register_tests))
+ ? -1 : 1;
+
+ rc = siena_reset_hw(efx, reset_method);
+out:
+ rc2 = efx_reset_up(efx, reset_method, rc == 0);
+ return rc ? rc : rc2;
}
/**************************************************************************
@@ -649,7 +670,7 @@ const struct efx_nic_type siena_a0_nic_type = {
.get_wol = siena_get_wol,
.set_wol = siena_set_wol,
.resume_wol = siena_init_wol,
- .test_registers = siena_test_registers,
+ .test_chip = siena_test_chip,
.test_nvram = efx_mcdi_nvram_test_all,
.revision = EFX_REV_SIENA_A0,
--
1.7.7.6
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next 10/11] sfc: Fix interface statistics running backward
From: Ben Hutchings @ 2012-07-11 23:18 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342048389.2613.59.camel@bwh-desktop.uk.solarflarecom.com>
Some interface statistics are computed in such a way that they can
sometimes decrease (and even underflow). Since the computed value
will never be greater than the true value, we fix this by only storing
the computed value when it increases.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/ethernet/sfc/falcon_xmac.c | 12 ++++++------
drivers/net/ethernet/sfc/nic.h | 18 ++++++++++++++++++
drivers/net/ethernet/sfc/siena.c | 8 ++++----
3 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/sfc/falcon_xmac.c b/drivers/net/ethernet/sfc/falcon_xmac.c
index 6106ef1..8333865 100644
--- a/drivers/net/ethernet/sfc/falcon_xmac.c
+++ b/drivers/net/ethernet/sfc/falcon_xmac.c
@@ -341,12 +341,12 @@ void falcon_update_stats_xmac(struct efx_nic *efx)
FALCON_STAT(efx, XgTxIpSrcErrPkt, tx_ip_src_error);
/* Update derived statistics */
- mac_stats->tx_good_bytes =
- (mac_stats->tx_bytes - mac_stats->tx_bad_bytes -
- mac_stats->tx_control * 64);
- mac_stats->rx_bad_bytes =
- (mac_stats->rx_bytes - mac_stats->rx_good_bytes -
- mac_stats->rx_control * 64);
+ efx_update_diff_stat(&mac_stats->tx_good_bytes,
+ mac_stats->tx_bytes - mac_stats->tx_bad_bytes -
+ mac_stats->tx_control * 64);
+ efx_update_diff_stat(&mac_stats->rx_bad_bytes,
+ mac_stats->rx_bytes - mac_stats->rx_good_bytes -
+ mac_stats->rx_control * 64);
}
void falcon_poll_xmac(struct efx_nic *efx)
diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
index f48ccf6..bab5cd9 100644
--- a/drivers/net/ethernet/sfc/nic.h
+++ b/drivers/net/ethernet/sfc/nic.h
@@ -294,6 +294,24 @@ extern bool falcon_xmac_check_fault(struct efx_nic *efx);
extern int falcon_reconfigure_xmac(struct efx_nic *efx);
extern void falcon_update_stats_xmac(struct efx_nic *efx);
+/* Some statistics are computed as A - B where A and B each increase
+ * linearly with some hardware counter(s) and the counters are read
+ * asynchronously. If the counters contributing to B are always read
+ * after those contributing to A, the computed value may be lower than
+ * the true value by some variable amount, and may decrease between
+ * subsequent computations.
+ *
+ * We should never allow statistics to decrease or to exceed the true
+ * value. Since the computed value will never be greater than the
+ * true value, we can achieve this by only storing the computed value
+ * when it increases.
+ */
+static inline void efx_update_diff_stat(u64 *stat, u64 diff)
+{
+ if ((s64)(diff - *stat) > 0)
+ *stat = diff;
+}
+
/* Interrupts and test events */
extern int efx_nic_init_interrupt(struct efx_nic *efx);
extern void efx_nic_enable_interrupts(struct efx_nic *efx);
diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index 2354886..6bafd21 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -458,8 +458,8 @@ static int siena_try_update_nic_stats(struct efx_nic *efx)
MAC_STAT(tx_bytes, TX_BYTES);
MAC_STAT(tx_bad_bytes, TX_BAD_BYTES);
- mac_stats->tx_good_bytes = (mac_stats->tx_bytes -
- mac_stats->tx_bad_bytes);
+ efx_update_diff_stat(&mac_stats->tx_good_bytes,
+ mac_stats->tx_bytes - mac_stats->tx_bad_bytes);
MAC_STAT(tx_packets, TX_PKTS);
MAC_STAT(tx_bad, TX_BAD_FCS_PKTS);
MAC_STAT(tx_pause, TX_PAUSE_PKTS);
@@ -492,8 +492,8 @@ static int siena_try_update_nic_stats(struct efx_nic *efx)
MAC_STAT(tx_ip_src_error, TX_IP_SRC_ERR_PKTS);
MAC_STAT(rx_bytes, RX_BYTES);
MAC_STAT(rx_bad_bytes, RX_BAD_BYTES);
- mac_stats->rx_good_bytes = (mac_stats->rx_bytes -
- mac_stats->rx_bad_bytes);
+ efx_update_diff_stat(&mac_stats->rx_good_bytes,
+ mac_stats->rx_bytes - mac_stats->rx_bad_bytes);
MAC_STAT(rx_packets, RX_PKTS);
MAC_STAT(rx_good, RX_GOOD_PKTS);
MAC_STAT(rx_bad, RX_BAD_FCS_PKTS);
--
1.7.7.6
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next 11/11] sfc: Correct some comments on enum reset_type
From: Ben Hutchings @ 2012-07-11 23:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342048389.2613.59.camel@bwh-desktop.uk.solarflarecom.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/ethernet/sfc/enum.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/sfc/enum.h b/drivers/net/ethernet/sfc/enum.h
index d725a8f..182dbe2 100644
--- a/drivers/net/ethernet/sfc/enum.h
+++ b/drivers/net/ethernet/sfc/enum.h
@@ -136,10 +136,10 @@ enum efx_loopback_mode {
*
* Reset methods are numbered in order of increasing scope.
*
- * @RESET_TYPE_INVISIBLE: don't reset the PHYs or interrupts
- * @RESET_TYPE_ALL: reset everything but PCI core blocks
- * @RESET_TYPE_WORLD: reset everything, save & restore PCI config
- * @RESET_TYPE_DISABLE: disable NIC
+ * @RESET_TYPE_INVISIBLE: Reset datapath and MAC (Falcon only)
+ * @RESET_TYPE_ALL: Reset datapath, MAC and PHY
+ * @RESET_TYPE_WORLD: Reset as much as possible
+ * @RESET_TYPE_DISABLE: Reset datapath, MAC and PHY; leave NIC disabled
* @RESET_TYPE_TX_WATCHDOG: reset due to TX watchdog
* @RESET_TYPE_INT_ERROR: reset due to internal error
* @RESET_TYPE_RX_RECOVERY: reset to recover from RX datapath errors
--
1.7.7.6
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-11 23:38 UTC (permalink / raw)
To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
In-Reply-To: <4FFDC48D.2030606@hp.com>
On Wed, 2012-07-11 at 11:23 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
>
> I take it you hacked your local copy of netperf to emit those? Or did I
> leave some cruft behind in something I committed to the repository?
>
Yep, its a netperf-2.5.0 with a one line change to output these TCP_INFO
bits
> What was the ultimate limiter on throughput? I notice it didn't achieve
> link-rate on either 10 GbE nor 1 GbE.
>
My lab has one fast machine (source in this 10Gb test), and one slow
machine (Intel Q6600 quad core), both with ixgbe cards.
On Gigabit test, the receiver is a laptop.
> > Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
> > in socket write queue.
>
> So the SO_SNDBUF can still grow rather larger than necessary? It is
> just that TCP will be nice to the other flows by not dumping all of it
> into the qdisc at once. Latency seen by the application itself is then
> unchanged since there will still be (potentially) as much stuff queued
> in the SO_SNDBUF as before right?
Of course SO_SNDBUF can grow if autotuning is enabled.
I think there is a bit of misunderstanding about this patch and what it
does.
It only makes sure the amount of packets (from socket write queue) are
cloned in qdisc/device queue in a limited way, not "as much as allowed"
^ permalink raw reply
* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-11 23:47 UTC (permalink / raw)
To: Andi Kleen; +Cc: nanditad, netdev, codel, mattmathis, ncardwell, David Miller
In-Reply-To: <m2r4sirvmg.fsf@firstfloor.org>
On Wed, 2012-07-11 at 11:38 -0700, Andi Kleen wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> > +
> > + if (!sock_owned_by_user(sk)) {
> > + if ((1 << sk->sk_state) &
> > + (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 |
> > + TCPF_CLOSING | TCPF_CLOSE_WAIT))
> > + tcp_write_xmit(sk,
> > + tcp_current_mss(sk),
> > + 0, 0,
> > + GFP_ATOMIC);
>
> Did you have any problems with the GFP_ATOMIC allocation failing here?
> I think you move some skb allocs from process context to ATOMIC, right?
>
> It relies on the VM somehow catching up in another context.
>
> May be interesting to try out under very high memory pressure.
AFAIK this patch has no effect on memory allocations, because :
At write() time, we use GFP_KERNEL allocations to build the socket write
queue.
And each skb is "precloned", that is allocated from skbuff_fclone_cache.
Then, when we select some skbs from write for transmission, we clone
them. And this cloning doesnt allocate memory, because we use the fast
cloning.
There are some pathological cases were we do allocate memory, but TSQ
should lower probabilities :
1) when we skb_clone(skb), we might need an allocation if the previous
clone is still in flight. This is a retransmit case, and TSQ only takes
care of transmits (of previously unsent skbs : their fast clone is
available)
2) When we need to split the skb because not enough cwnd is available.
As TSQ tends to limit number of skbs in qdisc, we lower the
probabilities of such splits. Anyway a TSO split only need an sk_buff,
this is not a lot of memory.
3) On bulk sends, most transmits are triggered by incoming ACKS, in
softirq context, so already using GFP_ATOMIC "allocations", if
ever needed (see point 1)
Thanks
^ permalink raw reply
* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-11 23:49 UTC (permalink / raw)
To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
In-Reply-To: <4FFDC985.6050805@hp.com>
On Wed, 2012-07-11 at 11:44 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> > Some bench results about the choice of 128KB being the default value:
>
> What were the starting/baseline figures?
>
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
> > Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
> > Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
> > Size Size Size (sec) Util Util Util Util Demand Demand Units
> > Final Final % Method % Method
> > 392360 392360 16384 20.00 1389.53 10^6bits/s 0.52 S 4.30 S 0.737 1.014 usec/KB
>
> By the way, that double reporting of the local socket send size is fixed in:
>
> ------------------------------------------------------------------------
> r516 | raj | 2012-01-05 15:48:52 -0800 (Thu, 05 Jan 2012) | 1 line
>
> report the rsr_size_end in an omni stream test rather than a copy of the
> lss_size_end
>
> of netperf and later. Also, any idea why the local socket send size got
> so much larger with 1GbE than 10 GbE at that setting of
> tcp_limit_output_bytes?
The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
ubuntu kernel. They probably have very different TCP behavior.
^ permalink raw reply
* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-12 0:00 UTC (permalink / raw)
To: Nandita Dukkipati; +Cc: netdev, codel, ncardwell, David Miller, mattmathis
In-Reply-To: <CAB_+Fg496FgQZkJE8uf_4RX_wW+fz1p5Rq8rY=BuA=c6KPakpA@mail.gmail.com>
Note : netdev mailing list doesnt accept HTML ;)
On Wed, 2012-07-11 at 12:29 -0700, Nandita Dukkipati wrote:
> I have a couple of high level questions on the two solutions for
> maintain small queues at qdisc: 1) using codel's feedback versus 2)
> the approach in TSQ to explicitly limit bytes per-flow
> to tcp_limit_output_bytes.
>
>
> 1. multiple flows case: codel feedback (be it drop/ECN or direct
> feedback like in your patch with fq_codel: report congestion
> notification at enqueue time) seems to work in maintaining small
> queues regardless of the number of connections. TSQ probably won't
> achieve this goal when number of flows is large. Is this correct?
Problem with codel/fq_codel is that the congestion is delayed.
And my patch (signaling congestion to local tcp senders) raised several
issues.
I had then TSQ idea to have a more general mechanism (not depending on a
particular qdisc setup)
TCP Small Queue is effective even with a large number of flows because
with standard pfifo_fast, we can without TCP Small Queue store 1000
packets in Qdisc, each being 64KB. Thats 64 MBytes...
With TCP Small Queue, you can reach this level of occupancy using 500
flows (2 packets per flow)
Without TCP Small Queue, you can reach this level using 16 flows.
(This of course assumes TSO is on, but modern NIC are TSO enabled)
> 2. cwnd adjustment: codel feedback directly and explicitly controls
> snd_cwnd which in turn will also adjust the send buffer size to
> appropriate value. That's a nice property, which TSQ probably won't
> have because it doesn't explicitly adjust snd_cwnd.
>
> Considering these two points, why TSQ over the Codel feedback
> approach?
I dont think they compete. They are in fact complementary.
If you use codel/fq_codel + TSQ, you have both per flow limitation in
qdisc (TSQ) + sojourn time aware and multi flow aware feedback.
^ permalink raw reply
* [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: Alexander Duyck @ 2012-07-12 0:18 UTC (permalink / raw)
To: netdev; +Cc: davem, jeffrey.t.kirsher, alexander.duyck, Alexander Duyck
The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
an out of bounds access due to what appears to be a typo. I believe this
change should resolve the issue by replacing the access to RTAX_CWND with
TCP_METRIC_CWND.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
net/ipv4/tcp_metrics.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 1fd83d3..5a38a2d 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -412,7 +412,7 @@ void tcp_update_metrics(struct sock *sk)
max(tp->snd_cwnd >> 1, tp->snd_ssthresh));
if (!tcp_metric_locked(tm, TCP_METRIC_CWND)) {
val = tcp_metric_get(tm, TCP_METRIC_CWND);
- tcp_metric_set(tm, RTAX_CWND, (val + tp->snd_cwnd) >> 1);
+ tcp_metric_set(tm, TCP_METRIC_CWND, (val + tp->snd_cwnd) >> 1);
}
} else {
/* Else slow start did not finish, cwnd is non-sense,
^ permalink raw reply related
* [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Alexander Duyck @ 2012-07-12 0:18 UTC (permalink / raw)
To: netdev
Cc: davem, jeffrey.t.kirsher, alexander.duyck, Eric Dumazet,
Alexander Duyck
In-Reply-To: <20120712001804.26542.2889.stgit@gitlad.jf.intel.com>
This patch does several things.
First it reorders the netdev_alloc_frag code so that only one conditional
check is needed in most cases instead of 2.
Second it incorporates the atomic_set and atomic_sub_return logic from an
earlier proposed patch by Eric Dumazet allowing for a reduction in the
get_page/put_page overhead when dealing with frags.
Finally it also incorporates the page reuse code so that if the page count
is dropped to 0 we can just reinitialize the page and reuse it.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
net/core/skbuff.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 506f678..69f4add 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,12 @@ EXPORT_SYMBOL(build_skb);
struct netdev_alloc_cache {
struct page *page;
unsigned int offset;
+ unsigned int pagecnt_bias;
};
static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+#define NETDEV_PAGECNT_BIAS (PAGE_SIZE / SMP_CACHE_BYTES)
+
/**
* netdev_alloc_frag - allocate a page fragment
* @fragsz: fragment size
@@ -311,23 +314,33 @@ void *netdev_alloc_frag(unsigned int fragsz)
struct netdev_alloc_cache *nc;
void *data = NULL;
unsigned long flags;
+ unsigned int offset;
local_irq_save(flags);
nc = &__get_cpu_var(netdev_alloc_cache);
- if (unlikely(!nc->page)) {
-refill:
+ offset = nc->offset;
+ if (unlikely(offset < fragsz)) {
+ BUG_ON(PAGE_SIZE < fragsz);
+
+ if (likely(nc->page) &&
+ atomic_sub_and_test(nc->pagecnt_bias, &nc->page->_count))
+ goto recycle;
+
nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
- nc->offset = 0;
- }
- if (likely(nc->page)) {
- if (nc->offset + fragsz > PAGE_SIZE) {
- put_page(nc->page);
- goto refill;
+ if (unlikely(!nc->page)) {
+ offset = 0;
+ goto end;
}
- data = page_address(nc->page) + nc->offset;
- nc->offset += fragsz;
- get_page(nc->page);
- }
+recycle:
+ atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+ nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
+ offset = PAGE_SIZE;
+ }
+ offset -= fragsz;
+ nc->pagecnt_bias--;
+ data = page_address(nc->page) + offset;
+end:
+ nc->offset = offset;
local_irq_restore(flags);
return data;
}
^ permalink raw reply related
* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Eric W. Biederman @ 2012-07-12 0:18 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev
In-Reply-To: <16903.1341947581@death.nxdomain>
Jay Vosburgh <fubar@us.ibm.com> writes:
> Eric W. Biederman <ebiederm@xmission.com> wrote:
>>I haven't run across any of those network devices, but if they create a
>>debugfs entry that embeds the device name it will be a problem.
>
> A quick grep suggests that cxgb4, skge, sky2, stmmac, ipoib and
> half a dozen of the wireless drivers all create files in debugfs. I did
> not check exhaustively, but at least some of them include the device
> name.
Yep. It looks like imperfect habits are common.
>>Last I looked any custom user space interface from network devices was
>>rare and bonding using debugfs is the first instance of using debugfs
>>from networking devices I have seen.
>>
>>I think the problem will be a little less severe for physical network
>>devices as they all start in the initial network namespace and so start
>>with distinct names.
>>
>>With bonding I can do "ip link add type bond" in any network namespace
>>and get another bond0. So name conflicts are very much expeted with all
>>virtual networking devices.
>
> Fair enough, although it is trivial to rename any network device
> such that a conflict would occur.
Actually for userspace and administrative reasons frequently it isn't
trivial to rename devices.
>>But if you know of any other networking devices using debugsfs that
>>code should probably get the same treatment as the bonding debugfs code.
>
> Is there no alternative than simply disabling debugfs whenever
> network namespaces are enabled? The information bonding displays via
> debugfs is useful, and having it unavailable on all distro kernels seems
> a bit harsh.
I took a good hard look at debugfs while writing this reply and debufs
scares me. It is the kind of code that just about wants to me to throw
in the towel seeing no hope of a good solid kernel.
I can definitely open a /sys/kernel/debug/bonding/bond0/rlb_hash_table
and delete the bond and then read the file. On a bad day that will oops
the kernel, as there is nothing holding a reference to the network
device. I think only the BOND_MODE_ALB check makes keeps the kernel
from oopsing in my quick tests.
The fact that debugfs is enabled in distro kernels is actually apalling
to me. debugfs makes it easy to oops the kernel.
There are lots of alternatives to debugfs on where to put information
and the bonding driver already uses most of them.
> Why is the logic already in the driver not sufficient? If the
> attempt to create the debugfs directory with the interface name fails,
> then it merely prints a warning and continues without the debugfs for
> that interface.
All I know for certain is the existing logic will eventually cause
someone doing something reasonable to send me a bug report.
I can see where you are coming from in that the bonding driver debugfs
code really was built to gracefully fail and ignore problems of instead
of just hapharzardly and sloppily ignore problems. At the same time
I can oops the kernel if I try with your debugfs in the bonding driver.
But it causes the code to fail and issue a warning. So if I don't
disable the code now, I expect I will get a bug report, and who
knows how many sill files in bonding will have in debugfs by then.
what silly things bonding may be doing in debugfs by then.
Eric
^ permalink raw reply
* Re: Add missing license tag to nci.
From: David Miller @ 2012-07-12 0:24 UTC (permalink / raw)
To: davej; +Cc: netdev, ilane, linux-wireless
In-Reply-To: <20120711183157.GA19417@redhat.com>
From: Dave Jones <davej@redhat.com>
Date: Wed, 11 Jul 2012 14:31:57 -0400
> nci: module license 'unspecified' taints kernel.
>
> Signed-off-by: Dave Jones <davej@redhat.com>
NFC patches go to linux-wireless, CC:'d
> diff --git a/net/nfc/nci/core.c b/net/nfc/nci/core.c
> index d560e6f..f18f207 100644
> --- a/net/nfc/nci/core.c
> +++ b/net/nfc/nci/core.c
> @@ -27,6 +27,7 @@
>
> #define pr_fmt(fmt) KBUILD_MODNAME ": %s: " fmt, __func__
>
> +#include <linux/module.h>
> #include <linux/types.h>
> #include <linux/workqueue.h>
> #include <linux/completion.h>
> @@ -878,3 +879,5 @@ static void nci_cmd_work(struct work_struct *work)
> jiffies + msecs_to_jiffies(NCI_CMD_TIMEOUT));
> }
> }
> +
> +MODULE_LICENSE("GPL");
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* [RFC PATCH 0/2] Coalesce MMIO writes for transmits
From: Alexander Duyck @ 2012-07-12 0:25 UTC (permalink / raw)
To: netdev
Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
alexander.duyck
This patch set is meant to address recent issues I found with ixgbe
performance being bound by Tx tail writes. With these changes in place
and the dispatch_limit set to 1 or more I see a significant increase in
performance.
In the case of one of my systems I saw the routing rate for 7 queues jump
from 10.5 to 11.7Mpps. The overall increase I have seen on most systems is
something on the order of about 15%. In the case of pktgen I have also
seen a noticeable increase as the previous limit for transmits was
~12.5Mpps, but with this patch set in place and the dispatch_limit enabled
the value increases to ~14.2Mpps.
I expected there to be an increase in latency, however so far I have not
ran into that. I have tried running NPtcp tests for latency and seen no
difference in the coalesced and non-coalesced transaction times. I welcome
any suggestions for tests I might run that might expose any latency issues
as a result of this patch.
---
Alexander Duyck (2):
ixgbe: Add functionality for delaying the MMIO write for Tx
net: Add new network device function to allow for MMIO batching
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 22 +++++++-
include/linux/netdevice.h | 57 +++++++++++++++++++++
net/core/dev.c | 67 +++++++++++++++++++++++++
net/core/net-sysfs.c | 36 +++++++++++++
4 files changed, 180 insertions(+), 2 deletions(-)
--
Thanks,
Alex
^ permalink raw reply
* [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
From: Alexander Duyck @ 2012-07-12 0:26 UTC (permalink / raw)
To: netdev
Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
alexander.duyck
In-Reply-To: <20120712002103.27846.73812.stgit@gitlad.jf.intel.com>
This change adds capabilities to the driver for batching the MMIO write
involved with transmits. Most of the logic is based off of the code for
the qdisc scheduling.
What I did is break the transmit path into two parts. We already had the
ndo_start_xmit function which has been there all along. The part I added
was ndo_complete_xmit which is meant to handle notifying the hardware that
frames are ready for delivery.
To control all of this I added a net sysfs value for the Tx queues called
dispatch_limit. When 0 it indicates that all frames will notify hardware
immediately. When 1 or more the netdev_complete_xmit call will queue up to
that number of packets, and when the value is exceeded it will notify the
hardware and reset the pending frame dispatch count.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
include/linux/netdevice.h | 57 ++++++++++++++++++++++++++++++++++++++
net/core/dev.c | 67 +++++++++++++++++++++++++++++++++++++++++++++
net/core/net-sysfs.c | 36 ++++++++++++++++++++++++
3 files changed, 160 insertions(+), 0 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a1a657..8d50fc4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -522,6 +522,8 @@ enum netdev_queue_state_t {
__QUEUE_STATE_DRV_XOFF,
__QUEUE_STATE_STACK_XOFF,
__QUEUE_STATE_FROZEN,
+ __QUEUE_STATE_DELAYED,
+ __QUEUE_STATE_DISPATCH,
#define QUEUE_STATE_ANY_XOFF ((1 << __QUEUE_STATE_DRV_XOFF) | \
(1 << __QUEUE_STATE_STACK_XOFF))
#define QUEUE_STATE_ANY_XOFF_OR_FROZEN (QUEUE_STATE_ANY_XOFF | \
@@ -550,6 +552,7 @@ struct netdev_queue {
#if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
int numa_node;
#endif
+ unsigned int dispatch_limit;
/*
* write mostly part
*/
@@ -561,6 +564,11 @@ struct netdev_queue {
unsigned long trans_start;
/*
+ * pointer to next Tx queue in dispatch_queue
+ */
+ struct netdev_queue *next_dispatch;
+
+ /*
* Number of TX timeouts for this queue
* (/sys/class/net/DEV/Q/trans_timeout)
*/
@@ -568,6 +576,8 @@ struct netdev_queue {
unsigned long state;
+ unsigned int dispatch_pending;
+
#ifdef CONFIG_BQL
struct dql dql;
#endif
@@ -924,6 +934,8 @@ struct net_device_ops {
int (*ndo_stop)(struct net_device *dev);
netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+ void (*ndo_complete_xmit) (struct net_device *dev,
+ unsigned int queue);
u16 (*ndo_select_queue)(struct net_device *dev,
struct sk_buff *skb);
void (*ndo_change_rx_flags)(struct net_device *dev,
@@ -1760,6 +1772,9 @@ struct softnet_data {
unsigned int dropped;
struct sk_buff_head input_pkt_queue;
struct napi_struct backlog;
+
+ struct netdev_queue *dispatch_queue;
+ struct netdev_queue **dispatch_queue_tailp;
};
static inline void input_queue_head_incr(struct softnet_data *sd)
@@ -1779,6 +1794,44 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd,
DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
+static inline void netif_tx_delay_queue(struct netdev_queue *txq)
+{
+ set_bit(__QUEUE_STATE_DELAYED, &txq->state);
+}
+
+extern void __netif_tx_dispatch_queue(struct netdev_queue *txq);
+
+static inline void netif_tx_dispatch_queue(struct netdev_queue *txq)
+{
+ if (test_and_clear_bit(__QUEUE_STATE_DELAYED, &txq->state))
+ __netif_tx_dispatch_queue(txq);
+}
+
+static inline bool netif_tx_queue_delayed(const struct netdev_queue *txq)
+{
+ return test_bit(__QUEUE_STATE_DELAYED, &txq->state);
+}
+
+static inline void netdev_complete_xmit(struct netdev_queue *txq)
+{
+ struct net_device *dev = txq->dev;
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ if (txq->dispatch_pending < txq->dispatch_limit) {
+ if (netif_tx_queue_delayed(txq)) {
+ txq->dispatch_pending++;
+ return;
+ }
+
+ /* start of delayed write sequence */
+ netif_tx_delay_queue(txq);
+ }
+
+ txq->dispatch_pending = 0;
+
+ ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]);
+}
+
extern void __netif_schedule(struct Qdisc *q);
static inline void netif_schedule_queue(struct netdev_queue *txq)
@@ -1973,6 +2026,7 @@ static inline void netdev_completed_queue(struct net_device *dev,
static inline void netdev_tx_reset_queue(struct netdev_queue *q)
{
+ clear_bit(__QUEUE_STATE_DELAYED, &q->state);
#ifdef CONFIG_BQL
clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state);
dql_reset(&q->dql);
@@ -2482,6 +2536,9 @@ static inline void netif_tx_unlock_bh(struct net_device *dev)
} \
}
+#define HARD_TX_TRYLOCK(dev, txq) \
+ ((dev->features & NETIF_F_LLTX) || __netif_tx_trylock(txq))
+
static inline void netif_tx_disable(struct net_device *dev)
{
unsigned int i;
diff --git a/net/core/dev.c b/net/core/dev.c
index 93af533..a72669a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2032,6 +2032,27 @@ int netif_get_num_default_rss_queues(void)
}
EXPORT_SYMBOL(netif_get_num_default_rss_queues);
+static inline void __netif_tx_redispatch_queue(struct netdev_queue *txq)
+{
+ struct softnet_data *sd;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sd = &__get_cpu_var(softnet_data);
+ txq->next_dispatch = NULL;
+ sd->dispatch_queue = txq;
+ sd->dispatch_queue_tailp = &txq->next_dispatch;
+ raise_softirq_irqoff(NET_TX_SOFTIRQ);
+ local_irq_restore(flags);
+}
+
+void __netif_tx_dispatch_queue(struct netdev_queue *txq)
+{
+ if (!test_and_set_bit(__QUEUE_STATE_DISPATCH, &txq->state))
+ __netif_tx_redispatch_queue(txq);
+}
+EXPORT_SYMBOL(__netif_tx_dispatch_queue);
+
static inline void __netif_reschedule(struct Qdisc *q)
{
struct softnet_data *sd;
@@ -3268,6 +3289,41 @@ static void net_tx_action(struct softirq_action *h)
}
}
}
+
+ if (sd->dispatch_queue) {
+ struct netdev_queue *head;
+
+ local_irq_disable();
+ head = sd->dispatch_queue;
+ sd->dispatch_queue = NULL;
+ sd->dispatch_queue_tailp = &sd->dispatch_queue;
+ local_irq_enable();
+
+ while (head) {
+ struct netdev_queue *txq = head;
+ struct net_device *dev = txq->dev;
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ head = head->next_dispatch;
+
+ if (!HARD_TX_TRYLOCK(dev, txq)) {
+ __netif_tx_redispatch_queue(txq);
+ continue;
+ }
+
+ smp_mb__before_clear_bit();
+ clear_bit(__QUEUE_STATE_DISPATCH, &txq->state);
+
+ if (txq->dispatch_pending &&
+ !netif_tx_queue_delayed(txq)) {
+ int index = txq - &dev->_tx[0];
+
+ ops->ndo_complete_xmit(dev, index);
+ }
+
+ HARD_TX_UNLOCK(dev, txq);
+ }
+ }
}
#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
@@ -6485,6 +6541,15 @@ static int dev_cpu_callback(struct notifier_block *nfb,
oldsd->output_queue = NULL;
oldsd->output_queue_tailp = &oldsd->output_queue;
}
+
+ /* Append delayed xmit queue from offline CPU */
+ if (oldsd->dispatch_queue) {
+ *sd->dispatch_queue_tailp = oldsd->dispatch_queue;
+ sd->dispatch_queue_tailp = oldsd->dispatch_queue_tailp;
+ oldsd->dispatch_queue = NULL;
+ oldsd->dispatch_queue_tailp = &oldsd->dispatch_queue;
+ }
+
/* Append NAPI poll list from offline CPU. */
if (!list_empty(&oldsd->poll_list)) {
list_splice_init(&oldsd->poll_list, &sd->poll_list);
@@ -6772,6 +6837,8 @@ static int __init net_dev_init(void)
INIT_LIST_HEAD(&sd->poll_list);
sd->output_queue = NULL;
sd->output_queue_tailp = &sd->output_queue;
+ sd->dispatch_queue = NULL;
+ sd->dispatch_queue_tailp = &sd->dispatch_queue;
#ifdef CONFIG_RPS
sd->csd.func = rps_trigger_softirq;
sd->csd.info = sd;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 42bb496..4f7eb58 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -997,11 +997,47 @@ static struct netdev_queue_attribute xps_cpus_attribute =
__ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
#endif /* CONFIG_XPS */
+static ssize_t show_dispatch_limit(struct netdev_queue *queue,
+ struct netdev_queue_attribute *attribute,
+ char *buf)
+{
+ unsigned int dispatch_limit;
+
+ spin_lock_irq(&queue->_xmit_lock);
+ dispatch_limit = queue->dispatch_limit;
+ spin_unlock_irq(&queue->_xmit_lock);
+
+ return sprintf(buf, "%u\n", dispatch_limit);
+}
+
+static ssize_t store_dispatch_limit(struct netdev_queue *queue,
+ struct netdev_queue_attribute *attribute,
+ const char *buf, size_t len)
+{
+ unsigned int dispatch_limit;
+ int err;
+
+ err = kstrtouint(buf, 10, &dispatch_limit);
+ if (err < 0)
+ return err;
+
+ spin_lock_irq(&queue->_xmit_lock);
+ queue->dispatch_limit = dispatch_limit;
+ spin_unlock_irq(&queue->_xmit_lock);
+
+ return len;
+}
+
+static struct netdev_queue_attribute dispatch_limit_attribute =
+ __ATTR(dispatch_limit, S_IRUGO | S_IWUSR,
+ show_dispatch_limit, store_dispatch_limit);
+
static struct attribute *netdev_queue_default_attrs[] = {
&queue_trans_timeout.attr,
#ifdef CONFIG_XPS
&xps_cpus_attribute.attr,
#endif
+ &dispatch_limit_attribute.attr,
NULL
};
^ permalink raw reply related
* [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx
From: Alexander Duyck @ 2012-07-12 0:26 UTC (permalink / raw)
To: netdev
Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
alexander.duyck
In-Reply-To: <20120712002103.27846.73812.stgit@gitlad.jf.intel.com>
This change makes it so that ixgbe can use the new framework for delaying
the MMIO writes in the transmit path. With this change in place we see a
significant reduction in CPU utilization and increase in overall packets
per second throughput for bulk traffic tests. In addition I have not seen
any increase in latency as a result of this patch.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 22 ++++++++++++++++++++--
1 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9ec65ee..e9b71b8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -884,6 +884,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
return true;
}
+ netif_tx_dispatch_queue(txring_txq(tx_ring));
+
netdev_tx_completed_queue(txring_txq(tx_ring),
total_packets, total_bytes);
@@ -5825,6 +5827,22 @@ static void ixgbe_service_task(struct work_struct *work)
ixgbe_service_event_complete(adapter);
}
+static void ixgbe_complete_xmit_frame(struct net_device *dev,
+ unsigned int index)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ struct ixgbe_ring *tx_ring = adapter->tx_ring[index];
+
+ /* notify HW of packet */
+ writel(tx_ring->next_to_use, tx_ring->tail);
+
+ /*
+ * we need this if more than one processor can write to our tail
+ * at a time, it synchronizes IO on IA64/Altix systems
+ */
+ mmiowb();
+}
+
static int ixgbe_tso(struct ixgbe_ring *tx_ring,
struct ixgbe_tx_buffer *first,
u8 *hdr_len)
@@ -6150,8 +6168,7 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring,
tx_ring->next_to_use = i;
- /* notify HW of packet */
- writel(i, tx_ring->tail);
+ netdev_complete_xmit(txring_txq(tx_ring));
return;
dma_error:
@@ -6961,6 +6978,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_open = ixgbe_open,
.ndo_stop = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
+ .ndo_complete_xmit = ixgbe_complete_xmit_frame,
#ifdef IXGBE_FCOE
.ndo_select_queue = ixgbe_select_queue,
#endif
^ permalink raw reply related
* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Eric Dumazet @ 2012-07-12 0:29 UTC (permalink / raw)
To: Alexander Duyck
Cc: netdev, davem, jeffrey.t.kirsher, alexander.duyck, Eric Dumazet
In-Reply-To: <20120712001810.26542.61967.stgit@gitlad.jf.intel.com>
On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
> This patch does several things.
>
> First it reorders the netdev_alloc_frag code so that only one conditional
> check is needed in most cases instead of 2.
>
> Second it incorporates the atomic_set and atomic_sub_return logic from an
> earlier proposed patch by Eric Dumazet allowing for a reduction in the
> get_page/put_page overhead when dealing with frags.
>
> Finally it also incorporates the page reuse code so that if the page count
> is dropped to 0 we can just reinitialize the page and reuse it.
>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
Hmm, I was working on a version using order-3 pages if available.
(or more exactly 32768 bytes chunks)
I am not sure how your version can help with typical 1500 allocations
(2 skbs per page)
^ permalink raw reply
* Re: [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: David Miller @ 2012-07-12 0:32 UTC (permalink / raw)
To: alexander.h.duyck; +Cc: netdev, jeffrey.t.kirsher, alexander.duyck
In-Reply-To: <20120712001804.26542.2889.stgit@gitlad.jf.intel.com>
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Wed, 11 Jul 2012 17:18:04 -0700
> The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
> an out of bounds access due to what appears to be a typo. I believe this
> change should resolve the issue by replacing the access to RTAX_CWND with
> TCP_METRIC_CWND.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Applied, thanks a lot.
How did you spot this, did you get a compiler warning?
I ask because while working on this, I at one point put the
tcp timestamp members after the metrics array in the
tcp_metrics_bucket struct. And I got a warning from gcc about
an array bounds violation that I could not figure out.
I am pretty certain this bug here is what it was warning about. And
the problem is that if you put the array at the end gcc doesn't warn
in order to handle things similar to what people use zero length
arrays for.
^ permalink raw reply
* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Nandita Dukkipati @ 2012-07-12 0:46 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, codel, ncardwell, David Miller, mattmathis
In-Reply-To: <1342051233.3265.8206.camel@edumazet-glaptop>
>> Considering these two points, why TSQ over the Codel feedback
>> approach?
>
> I dont think they compete. They are in fact complementary.
>
> If you use codel/fq_codel + TSQ, you have both per flow limitation in
> qdisc (TSQ) + sojourn time aware and multi flow aware feedback.
Makes sense. My conjecture is when using codel/fq_codel qdisc, the
need for TSQ will diminish. But as you said... good part of TSQ is it
limits per-flow queuing for any qdisc structure, even those not using
codel.
^ permalink raw reply
* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: David Miller @ 2012-07-12 1:11 UTC (permalink / raw)
To: eric.dumazet
Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, alexander.duyck,
edumazet
In-Reply-To: <1342052967.3265.8210.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 Jul 2012 02:29:27 +0200
> On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
>> This patch does several things.
>>
>> First it reorders the netdev_alloc_frag code so that only one conditional
>> check is needed in most cases instead of 2.
>>
>> Second it incorporates the atomic_set and atomic_sub_return logic from an
>> earlier proposed patch by Eric Dumazet allowing for a reduction in the
>> get_page/put_page overhead when dealing with frags.
>>
>> Finally it also incorporates the page reuse code so that if the page count
>> is dropped to 0 we can just reinitialize the page and reuse it.
>>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>
>
> Hmm, I was working on a version using order-3 pages if available.
>
> (or more exactly 32768 bytes chunks)
>
> I am not sure how your version can help with typical 1500 allocations
> (2 skbs per page)
I'd like you two to sort things out before I apply anything, thanks :)
^ permalink raw reply
* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: David Miller @ 2012-07-12 1:14 UTC (permalink / raw)
To: eric.dumazet; +Cc: nanditad, netdev, codel, ncardwell, mattmathis
In-Reply-To: <1342021831.3265.8174.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 11 Jul 2012 17:50:31 +0200
> This introduce TSQ (TCP Small Queues)
Applied, fingers cross, thanks :-)
^ permalink raw reply
* Re: [PATCH] tc_util: fix incorrect bare number process in get_rate.
From: Li Wei @ 2012-07-12 1:16 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120711075129.4f81eea8@nehalam.linuxnetplumber.net>
于 2012-7-11 22:51, Stephen Hemminger 写道:
> On Wed, 11 Jul 2012 15:24:50 +0800
> Li Wei <lw@cn.fujitsu.com> wrote:
>
>>
>> As the comment and manpage indicated that the bare number means
>> bytes per second, so the division is not needed.
>> ---
>> tc/tc_util.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/tc/tc_util.c b/tc/tc_util.c
>> index 926ed08..e8d89c1 100644
>> --- a/tc/tc_util.c
>> +++ b/tc/tc_util.c
>> @@ -153,7 +153,7 @@ int get_rate(unsigned *rate, const char *str)
>> return -1;
>>
>> if (*p == '\0') {
>> - *rate = bps / 8.; /* assume bytes/sec */
>> + *rate = bps; /* assume bytes/sec */
>> return 0;
>> }
>>
> Thanks for finding this. The documentation, code and comment do
> all need to be the same!
>
> But changing the code as you propose would break existing usage
> by scripts. Instead, the man page and comment need to change
> to match the reality of the existing application.
>
>
Well, I see, I'll send another patch to take care of this.
Thanks,
Wei
^ permalink raw reply
* Re: [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: Alexander Duyck @ 2012-07-12 1:46 UTC (permalink / raw)
To: David Miller; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher
In-Reply-To: <20120711.173249.1303803416502735349.davem@davemloft.net>
On 7/11/2012 5:32 PM, David Miller wrote:
> From: Alexander Duyck<alexander.h.duyck@intel.com>
> Date: Wed, 11 Jul 2012 17:18:04 -0700
>
>> The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
>> an out of bounds access due to what appears to be a typo. I believe this
>> change should resolve the issue by replacing the access to RTAX_CWND with
>> TCP_METRIC_CWND.
>>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
> Applied, thanks a lot.
>
> How did you spot this, did you get a compiler warning?
>
> I ask because while working on this, I at one point put the
> tcp timestamp members after the metrics array in the
> tcp_metrics_bucket struct. And I got a warning from gcc about
> an array bounds violation that I could not figure out.
>
> I am pretty certain this bug here is what it was warning about. And
> the problem is that if you put the array at the end gcc doesn't warn
> in order to handle things similar to what people use zero length
> arrays for.
It came up as a compiler warning. I suspect it may have something to do
with the optimizations I had turned on since it complained that the
issue was in tcp_update_metrics but then reported it on the one line in
tcp_metric_set.
Thanks,
Alex
^ permalink raw reply
* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Jay Vosburgh @ 2012-07-12 1:57 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <877gu9omrr.fsf@xmission.com>
Eric W. Biederman <ebiederm@xmission.com> wrote:
>Jay Vosburgh <fubar@us.ibm.com> writes:
>
>> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>>>I haven't run across any of those network devices, but if they create a
>>>debugfs entry that embeds the device name it will be a problem.
>>
>> A quick grep suggests that cxgb4, skge, sky2, stmmac, ipoib and
>> half a dozen of the wireless drivers all create files in debugfs. I did
>> not check exhaustively, but at least some of them include the device
>> name.
>
>Yep. It looks like imperfect habits are common.
>
>>>Last I looked any custom user space interface from network devices was
>>>rare and bonding using debugfs is the first instance of using debugfs
>>>from networking devices I have seen.
>>>
>>>I think the problem will be a little less severe for physical network
>>>devices as they all start in the initial network namespace and so start
>>>with distinct names.
>>>
>>>With bonding I can do "ip link add type bond" in any network namespace
>>>and get another bond0. So name conflicts are very much expeted with all
>>>virtual networking devices.
>>
>> Fair enough, although it is trivial to rename any network device
>> such that a conflict would occur.
>
>Actually for userspace and administrative reasons frequently it isn't
>trivial to rename devices.
Well, perhaps it's uncommon for users to do so, but "ip link set
dev eth0 name eth44" is pretty easy to do.
>>>But if you know of any other networking devices using debugsfs that
>>>code should probably get the same treatment as the bonding debugfs code.
>>
>> Is there no alternative than simply disabling debugfs whenever
>> network namespaces are enabled? The information bonding displays via
>> debugfs is useful, and having it unavailable on all distro kernels seems
>> a bit harsh.
>
>
>I took a good hard look at debugfs while writing this reply and debufs
>scares me. It is the kind of code that just about wants to me to throw
>in the towel seeing no hope of a good solid kernel.
>
>I can definitely open a /sys/kernel/debug/bonding/bond0/rlb_hash_table
>and delete the bond and then read the file. On a bad day that will oops
>the kernel, as there is nothing holding a reference to the network
>device. I think only the BOND_MODE_ALB check makes keeps the kernel
>from oopsing in my quick tests.
>
>The fact that debugfs is enabled in distro kernels is actually apalling
>to me. debugfs makes it easy to oops the kernel.
I'm not so sure things are that bad. I cannot unload the
bonding module while a program holds an open file descriptor on its
debugfs file (it appears to hold a reference to the module), so uses
that only remove the debugfs file on module unload shouldn't have a
problem.
The /proc file that bonding removes when an interface is
dynamically removed does not have this problem, as subsequent reads on
that file descriptor will fail. I suspect that's because
remove_proc_entry NULLs the proc_fops, whereas debugfs_remove does not
do the equivalent for its case. It may not be that simple, though; I'm
just looking at the code and have not tested anything.
>There are lots of alternatives to debugfs on where to put information
>and the bonding driver already uses most of them.
>
>> Why is the logic already in the driver not sufficient? If the
>> attempt to create the debugfs directory with the interface name fails,
>> then it merely prints a warning and continues without the debugfs for
>> that interface.
>
>All I know for certain is the existing logic will eventually cause
>someone doing something reasonable to send me a bug report.
>
>I can see where you are coming from in that the bonding driver debugfs
>code really was built to gracefully fail and ignore problems of instead
>of just hapharzardly and sloppily ignore problems. At the same time
>I can oops the kernel if I try with your debugfs in the bonding driver.
>
>But it causes the code to fail and issue a warning. So if I don't
>disable the code now, I expect I will get a bug report, and who
>knows how many sill files in bonding will have in debugfs by then.
>what silly things bonding may be doing in debugfs by then.
Or perhaps we can fix the debugfs support to function correctly
even in the face of network namespaces. For example, do namespaces have
a unique name or identifier than can go into the debugfs name?
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
* [PATCH] tc: man: change man page and comment to confirm to code's behavior.
From: Li Wei @ 2012-07-12 1:56 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120711075129.4f81eea8@nehalam.linuxnetplumber.net>
Since the get_rate() code incorrectly interpreted bare number, the
behavior is not the same as man page and comment described.
We need to change the man page and comment for compatible with the
existing usage by scripts.
---
man/man8/tc.8 | 7 +++++--
tc/tc_util.c | 2 +-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 958ab98..f0e5613 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -259,6 +259,9 @@ All parameters accept a floating point number, possibly followed by a unit.
.P
Bandwidths or rates can be specified in:
.TP
+bps
+Bytes per second
+.TP
kbps
Kilobytes per second
.TP
@@ -271,8 +274,8 @@ Kilobits per second
mbit
Megabits per second
.TP
-bps or a bare number
-Bytes per second
+bit or a bare number
+Bits per second
.P
Amounts of data can be specified in:
.TP
diff --git a/tc/tc_util.c b/tc/tc_util.c
index 926ed08..ccf8fa4 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -153,7 +153,7 @@ int get_rate(unsigned *rate, const char *str)
return -1;
if (*p == '\0') {
- *rate = bps / 8.; /* assume bytes/sec */
+ *rate = bps / 8.; /* assume bits/sec */
return 0;
}
--
1.7.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox