Netdev List
 help / color / mirror / Atom feed
* [PATCH net-2.6 5/7] bnx2x: perform statistics "action" before state transition.
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein, Vladislav Zolotarov


From: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_stats.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_stats.c b/drivers/net/bnx2x/bnx2x_stats.c
index bda60d5..3445ded 100644
--- a/drivers/net/bnx2x/bnx2x_stats.c
+++ b/drivers/net/bnx2x/bnx2x_stats.c
@@ -1239,14 +1239,14 @@ void bnx2x_stats_handle(struct bnx2x *bp, enum bnx2x_stats_event event)
 	if (unlikely(bp->panic))
 		return;
 
+	bnx2x_stats_stm[bp->stats_state][event].action(bp);
+
 	/* Protect a state change flow */
 	spin_lock_bh(&bp->stats_lock);
 	state = bp->stats_state;
 	bp->stats_state = bnx2x_stats_stm[state][event].next_state;
 	spin_unlock_bh(&bp->stats_lock);
 
-	bnx2x_stats_stm[state][event].action(bp);
-
 	if ((event != STATS_EVENT_UPDATE) || netif_msg_timer(bp))
 		DP(BNX2X_MSG_STATS, "state %d -> event %d -> state %d\n",
 		   state, event, bp->stats_state);
-- 
1.7.2.2





^ permalink raw reply related

* [PATCH net-2.6 3/7] bnx2x: Fix ethtool -t link test for MF (non-pmf) devices.
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein


Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_ethtool.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_ethtool.c b/drivers/net/bnx2x/bnx2x_ethtool.c
index 5b0fe7a..ef29199 100644
--- a/drivers/net/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/bnx2x/bnx2x_ethtool.c
@@ -1932,11 +1932,11 @@ static void bnx2x_self_test(struct net_device *dev,
 		buf[4] = 1;
 		etest->flags |= ETH_TEST_FL_FAILED;
 	}
-	if (bp->port.pmf)
-		if (bnx2x_link_test(bp, is_serdes) != 0) {
-			buf[5] = 1;
-			etest->flags |= ETH_TEST_FL_FAILED;
-		}
+
+	if (bnx2x_link_test(bp, is_serdes) != 0) {
+		buf[5] = 1;
+		etest->flags |= ETH_TEST_FL_FAILED;
+	}
 
 #ifdef BNX2X_EXTRA_DEBUG
 	bnx2x_panic_dump(bp);
-- 
1.7.2.2





^ permalink raw reply related

* [PATCH net-2.6 4/7] bnx2x: properly configure coefficients for MinBW algorithm (NPAR mode).
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein



Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x.h      |   26 +++++++++++++++-----------
 drivers/net/bnx2x/bnx2x_main.c |    3 ++-
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x.h b/drivers/net/bnx2x/bnx2x.h
index 653c624..368cfcd 100644
--- a/drivers/net/bnx2x/bnx2x.h
+++ b/drivers/net/bnx2x/bnx2x.h
@@ -1613,19 +1613,23 @@ static inline u32 reg_poll(struct bnx2x *bp, u32 reg, u32 expected, int ms,
 #define BNX2X_BTR			4
 #define MAX_SPQ_PENDING			8
 
-
-/* CMNG constants
-   derived from lab experiments, and not from system spec calculations !!! */
-#define DEF_MIN_RATE			100
+/* CMNG constants, as derived from system spec calculations */
+/* default MIN rate in case VNIC min rate is configured to zero - 100Mbps */
+#define DEF_MIN_RATE					100
 /* resolution of the rate shaping timer - 100 usec */
-#define RS_PERIODIC_TIMEOUT_USEC	100
-/* resolution of fairness algorithm in usecs -
-   coefficient for calculating the actual t fair */
-#define T_FAIR_COEF			10000000
+#define RS_PERIODIC_TIMEOUT_USEC			100
 /* number of bytes in single QM arbitration cycle -
-   coefficient for calculating the fairness timer */
-#define QM_ARB_BYTES			40000
-#define FAIR_MEM			2
+ * coefficient for calculating the fairness timer */
+#define QM_ARB_BYTES					160000
+/* resolution of Min algorithm 1:100 */
+#define MIN_RES						100
+/* how many bytes above threshold for the minimal credit of Min algorithm*/
+#define MIN_ABOVE_THRESH				32768
+/* Fairness algorithm integration time coefficient -
+ * for calculating the actual Tfair */
+#define T_FAIR_COEF	((MIN_ABOVE_THRESH +  QM_ARB_BYTES) * 8 * MIN_RES)
+/* Memory of fairness algorithm . 2 cycles */
+#define FAIR_MEM					2
 
 
 #define ATTN_NIG_FOR_FUNC		(1L << 8)
diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
index 203e9bf..032ae18 100644
--- a/drivers/net/bnx2x/bnx2x_main.c
+++ b/drivers/net/bnx2x/bnx2x_main.c
@@ -2015,7 +2015,8 @@ static void bnx2x_init_vn_minmax(struct bnx2x *bp, int vn)
 		m_fair_vn.vn_credit_delta =
 			max_t(u32, (vn_min_rate * (T_FAIR_COEF /
 						   (8 * bp->vn_weight_sum))),
-			      (bp->cmng.fair_vars.fair_threshold * 2));
+			      (bp->cmng.fair_vars.fair_threshold +
+							MIN_ABOVE_THRESH));
 		DP(NETIF_MSG_IFUP, "m_fair_vn.vn_credit_delta %d\n",
 		   m_fair_vn.vn_credit_delta);
 	}
-- 
1.7.2.2





^ permalink raw reply related

* [PATCH net-2.6 2/7] bnx2x: Fix nvram test for single port devices.
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein



Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_ethtool.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_ethtool.c b/drivers/net/bnx2x/bnx2x_ethtool.c
index b3da295..5b0fe7a 100644
--- a/drivers/net/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/bnx2x/bnx2x_ethtool.c
@@ -1782,9 +1782,7 @@ static int bnx2x_test_nvram(struct bnx2x *bp)
 		{ 0x100, 0x350 }, /* manuf_info */
 		{ 0x450,  0xf0 }, /* feature_info */
 		{ 0x640,  0x64 }, /* upgrade_key_info */
-		{ 0x6a4,  0x64 },
 		{ 0x708,  0x70 }, /* manuf_key_info */
-		{ 0x778,  0x70 },
 		{     0,     0 }
 	};
 	__be32 buf[0x350 / 4];
-- 
1.7.2.2





^ permalink raw reply related

* [PATCH net-2.6 1/7] bnx2x: (NPAR mode) Fix FW initialization
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein

 Fix FW initialization according to max BW stored in percents
 for NPAR mode. Protect HW from being configured to speed 0.


Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_cmn.c     |   17 +++++++++--------
 drivers/net/bnx2x/bnx2x_cmn.h     |   20 ++++++++++++++++++++
 drivers/net/bnx2x/bnx2x_ethtool.c |   13 +++++++------
 drivers/net/bnx2x/bnx2x_main.c    |   15 ++++++++++++---
 4 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_cmn.c b/drivers/net/bnx2x/bnx2x_cmn.c
index 710ce5d..a58baf3 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/bnx2x/bnx2x_cmn.c
@@ -703,19 +703,20 @@ u16 bnx2x_get_mf_speed(struct bnx2x *bp)
 {
 	u16 line_speed = bp->link_vars.line_speed;
 	if (IS_MF(bp)) {
-		u16 maxCfg = (bp->mf_config[BP_VN(bp)] &
-						FUNC_MF_CFG_MAX_BW_MASK) >>
-						FUNC_MF_CFG_MAX_BW_SHIFT;
-		/* Calculate the current MAX line speed limit for the DCC
-		 * capable devices
+		u16 maxCfg = bnx2x_extract_max_cfg(bp,
+						   bp->mf_config[BP_VN(bp)]);
+
+		/* Calculate the current MAX line speed limit for the MF
+		 * devices
 		 */
-		if (IS_MF_SD(bp)) {
+		if (IS_MF_SI(bp))
+			line_speed = (line_speed * maxCfg) / 100;
+		else { /* SD mode */
 			u16 vn_max_rate = maxCfg * 100;
 
 			if (vn_max_rate < line_speed)
 				line_speed = vn_max_rate;
-		} else /* IS_MF_SI(bp)) */
-			line_speed = (line_speed * maxCfg) / 100;
+		}
 	}
 
 	return line_speed;
diff --git a/drivers/net/bnx2x/bnx2x_cmn.h b/drivers/net/bnx2x/bnx2x_cmn.h
index 03eb4d6..326ba44 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/bnx2x/bnx2x_cmn.h
@@ -1044,4 +1044,24 @@ static inline void storm_memset_cmng(struct bnx2x *bp,
 void bnx2x_acquire_phy_lock(struct bnx2x *bp);
 void bnx2x_release_phy_lock(struct bnx2x *bp);
 
+/**
+ * Extracts MAX BW part from MF configuration.
+ *
+ * @param bp
+ * @param mf_cfg
+ *
+ * @return u16
+ */
+static inline u16 bnx2x_extract_max_cfg(struct bnx2x *bp, u32 mf_cfg)
+{
+	u16 max_cfg = (mf_cfg & FUNC_MF_CFG_MAX_BW_MASK) >>
+			      FUNC_MF_CFG_MAX_BW_SHIFT;
+	if (!max_cfg) {
+		BNX2X_ERR("Illegal configuration detected for Max BW - "
+			  "using 100 instead\n");
+		max_cfg = 100;
+	}
+	return max_cfg;
+}
+
 #endif /* BNX2X_CMN_H */
diff --git a/drivers/net/bnx2x/bnx2x_ethtool.c b/drivers/net/bnx2x/bnx2x_ethtool.c
index 5b44a8b..b3da295 100644
--- a/drivers/net/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/bnx2x/bnx2x_ethtool.c
@@ -238,7 +238,7 @@ static int bnx2x_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 	speed |= (cmd->speed_hi << 16);
 
 	if (IS_MF_SI(bp)) {
-		u32 param = 0;
+		u32 param = 0, part;
 		u32 line_speed = bp->link_vars.line_speed;
 
 		/* use 10G if no link detected */
@@ -251,9 +251,11 @@ static int bnx2x_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 				       REQ_BC_VER_4_SET_MF_BW);
 			return -EINVAL;
 		}
-		if (line_speed < speed) {
-			BNX2X_DEV_INFO("New speed should be less or equal "
-				       "to actual line speed\n");
+		part = (speed * 100) / line_speed;
+		if (line_speed < speed || !part) {
+			BNX2X_DEV_INFO("Speed setting should be in a range "
+				       "from 1%% to 100%% "
+				       "of actual line speed\n");
 			return -EINVAL;
 		}
 		/* load old values */
@@ -263,8 +265,7 @@ static int bnx2x_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 		param &= FUNC_MF_CFG_MIN_BW_MASK;
 
 		/* set new MAX value */
-		param |= (((speed * 100) / line_speed)
-				 << FUNC_MF_CFG_MAX_BW_SHIFT)
+		param |= (part << FUNC_MF_CFG_MAX_BW_SHIFT)
 				  & FUNC_MF_CFG_MAX_BW_MASK;
 
 		bnx2x_fw_command(bp, DRV_MSG_CODE_SET_MF_BW, param);
diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
index d584d32..203e9bf 100644
--- a/drivers/net/bnx2x/bnx2x_main.c
+++ b/drivers/net/bnx2x/bnx2x_main.c
@@ -1974,13 +1974,22 @@ static void bnx2x_init_vn_minmax(struct bnx2x *bp, int vn)
 		vn_max_rate = 0;
 
 	} else {
+		u32 maxCfg = bnx2x_extract_max_cfg(bp, vn_cfg);
+
 		vn_min_rate = ((vn_cfg & FUNC_MF_CFG_MIN_BW_MASK) >>
 				FUNC_MF_CFG_MIN_BW_SHIFT) * 100;
-		/* If min rate is zero - set it to 1 */
+		/* If fairness is enabled (not all min rates are zeroes) and
+		   if current min rate is zero - set it to 1.
+		   This is a requirement of the algorithm. */
 		if (bp->vn_weight_sum && (vn_min_rate == 0))
 			vn_min_rate = DEF_MIN_RATE;
-		vn_max_rate = ((vn_cfg & FUNC_MF_CFG_MAX_BW_MASK) >>
-				FUNC_MF_CFG_MAX_BW_SHIFT) * 100;
+
+		if (IS_MF_SI(bp))
+			/* maxCfg in percents of linkspeed */
+			vn_max_rate = (bp->link_vars.line_speed * maxCfg) / 100;
+		else
+			/* maxCfg is absolute in 100Mb units */
+			vn_max_rate = maxCfg * 100;
 	}
 
 	DP(NETIF_MSG_IFUP,
-- 
1.7.2.2





^ permalink raw reply related

* Re: [PATCH] iproute2: allow to specify truncation bits on auth algo
From: Nicolas Dichtel @ 2011-02-28 13:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, herbert, netdev, christophe.gouault
In-Reply-To: <4D49879B.4060108@6wind.com>

Hi,

what is the status of this patch? It has been set to 'Superseded' in the 
patchwork tool (http://patchwork.ozlabs.org/patch/81486/).
Kernel headers have been updated in iproute2, should I resend the patch?


Regards,
Nicolas

On 02/02/2011 17:34, Nicolas Dichtel wrote:
> On 02/02/2011 17:30, Nicolas Dichtel wrote:
>> On 28/01/2011 20:46, David Miller wrote:
>>> From: Nicolas Dichtel<nicolas.dichtel@6wind.com>
>>> Date: Fri, 28 Jan 2011 09:51:40 +0100
>>>
>>>> On 28/01/2011 05:51, Herbert Xu wrote:
>>>>> So perhaps an SA configuration flag is needed?
>>>> I agree. If David is ok, I will update the patch.
>>>
>>> Sounds good to me.
>> And the patch for iproute2.
> Sorry, two patches were mixed :(
>
> Here is the right one.
>
>
> Regards,
> Nicolas

^ permalink raw reply

* [PATCH] macb: don't use platform_set_drvdata() on a net_device
From: Jamie Iles @ 2011-02-28 14:05 UTC (permalink / raw)
  To: netdev; +Cc: Jamie Iles, Nicolas Ferre

Commit 71d6429 (Driver core: convert platform_{get,set}_drvdata to
static inline functions) now triggers a warning in the macb network
driver:

  CC      drivers/net/macb.o
drivers/net/macb.c: In function ‘macb_mii_init’:
drivers/net/macb.c:263: warning: passing argument 1 of ‘platform_set_drvdata’ from incompatible pointer type
include/linux/platform_device.h:138: note: expected ‘struct platform_device *’ but argument is of type ‘struct net_device *’

Use dev_set_drvdata() on the device embedded in the net_device instead.

Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
---
 drivers/net/macb.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/macb.c b/drivers/net/macb.c
index f69e73e..79ccb54 100644
--- a/drivers/net/macb.c
+++ b/drivers/net/macb.c
@@ -260,7 +260,7 @@ static int macb_mii_init(struct macb *bp)
 	for (i = 0; i < PHY_MAX_ADDR; i++)
 		bp->mii_bus->irq[i] = PHY_POLL;
 
-	platform_set_drvdata(bp->dev, bp->mii_bus);
+	dev_set_drvdata(&bp->dev->dev, bp->mii_bus);
 
 	if (mdiobus_register(bp->mii_bus))
 		goto err_out_free_mdio_irq;
-- 
1.7.4


^ permalink raw reply related

* Re: SO_REUSEPORT - can it be done in kernel?
From: Thomas Graf @ 2011-02-28 14:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <20110228113659.GA20726@gondor.apana.org.au>

On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> But please do test them heavily, especially if you have an AMD
> NUMA machine as that's where scalability problems really show
> up.  Intel tends to be a lot more forgiving.  My last AMD machine
> blew up years ago :)

This is just a preliminary test result and not 100% reliable
because half through the testing the machine reported memory
issues and disabled a DIMM before booting the tested kernels.

Nevertheless, bind 9.7.3:

2.6.38-rc5+: 62kqps
2.6.38-rc5+ w/ Herbert's patch: 442kqps

This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores

Again, this number is not 100% reliably but clearly shows that
the concept of the patch is working very well.

Will test Herbert's patch on the machine that did 650kqps with
SO_REUSEPORT and also on some AMD machines.

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-28 14:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <1298899971.2941.281.camel@edumazet-laptop>

On Mon, Feb 28, 2011 at 02:32:51PM +0100, Eric Dumazet wrote:
>
> Now, input path can run without finding socket locked by xmit path, so
> skb are queued into receive queue, not backlog one.

Indeed, I think this is what Dave alluded to earlier.  This will
eventually have to be dealt with but for now the data rate is low
enough that it isn't killing us.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Eric Dumazet @ 2011-02-28 14:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <20110228141327.GA22851@gondor.apana.org.au>

Le lundi 28 février 2011 à 22:13 +0800, Herbert Xu a écrit :
> On Mon, Feb 28, 2011 at 02:32:51PM +0100, Eric Dumazet wrote:
> >
> > Now, input path can run without finding socket locked by xmit path, so
> > skb are queued into receive queue, not backlog one.
> 
> Indeed, I think this is what Dave alluded to earlier.  This will
> eventually have to be dealt with but for now the data rate is low
> enough that it isn't killing us.

Not sure how you read this ;)

I said that before your patches, a sender was consuming lot of time to
transfert frames from backlog to receive queue right before releasing
socket lock.

Now, the receive path doesnt slow down the senders, and vice versa.

:)



^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-28 14:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <1298902926.2941.349.camel@edumazet-laptop>

On Mon, Feb 28, 2011 at 03:22:06PM +0100, Eric Dumazet wrote:
>
> Not sure how you read this ;)
> 
> I said that before your patches, a sender was consuming lot of time to
> transfert frames from backlog to receive queue right before releasing
> socket lock.
> 
> Now, the receive path doesnt slow down the senders, and vice versa.
> 
> :)

I understood what you wrote :)

I was just referring to an earlier message where Dave talked about
the UDP accounting patch making us having to take the lock on every
packet.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH] net: fix multithreaded signal handling in unix recv routines
From: Rainer Weikusat @ 2011-02-28 14:50 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-kernel

From: Rainer Weikusat <rweikusat@mobileactivedefense.com>

The unix_dgram_recvmsg and unix_stream_recvmsg routines in
net/af_unix.c utilize mutex_lock(&u->readlock) calls in order to
serialize read operations of multiple threads on a single socket. This
implies that, if all n threads of a process block in an AF_UNIX recv
call trying to read data from the same socket, one of these threads
will be sleeping in state TASK_INTERRUPTIBLE and all others in state
TASK_UNINTERRUPTIBLE. Provided that a particular signal is supposed to
be handled by a signal handler defined by the process and that none of
this threads is blocking the signal, the complete_signal routine in
kernel/signal.c will select the 'first' such thread it happens to
encounter when deciding which thread to notify that a signal is
supposed to be handled and if this is one of the TASK_UNINTERRUPTIBLE
threads, the signal won't be handled until the one thread not blocking
on the u->readlock mutex is woken up because some data to process has
arrived (if this ever happens). The included patch fixes this by
changing mutex_lock to mutex_lock_interruptible and handling possible
error returns in the same way interruptions are handled by the actual
receive-code.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>

---
diff -urp net-2.6/net/unix/af_unix.c net-2.6-patched//net/unix/af_unix.c
--- net-2.6/net/unix/af_unix.c	2011-02-16 22:19:43.338358559 +0000
+++ net-2.6-patched//net/unix/af_unix.c	2011-02-16 22:38:39.483543598 +0000
@@ -1724,7 +1724,11 @@ static int unix_dgram_recvmsg(struct kio
 
 	msg->msg_namelen = 0;
 
-	mutex_lock(&u->readlock);
+	err = mutex_lock_interruptible(&u->readlock);
+	if (err) {
+		err = sock_intr_errno(sock_rcvtimeo(sk, noblock));
+		goto out;
+	}
 
 	skb = skb_recv_datagram(sk, flags, noblock, &err);
 	if (!skb) {
@@ -1864,7 +1868,11 @@ static int unix_stream_recvmsg(struct ki
 		memset(&tmp_scm, 0, sizeof(tmp_scm));
 	}
 
-	mutex_lock(&u->readlock);
+	err = mutex_lock_interruptible(&u->readlock);
+	if (err) {
+		err = sock_intr_errno(timeo);
+		goto out;
+	}
 
 	do {
 		int chunk;
@@ -1895,11 +1903,12 @@ static int unix_stream_recvmsg(struct ki
 
 			timeo = unix_stream_data_wait(sk, timeo);
 
-			if (signal_pending(current)) {
+			if (signal_pending(current)
+			    ||  mutex_lock_interruptible(&u->readlock)) {
 				err = sock_intr_errno(timeo);
 				goto out;
 			}
-			mutex_lock(&u->readlock);
+
 			continue;
  unlock:
 			unix_state_unlock(sk);

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Eric Dumazet @ 2011-02-28 14:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <1298899971.2941.281.camel@edumazet-laptop>

Le lundi 28 février 2011 à 14:32 +0100, Eric Dumazet a écrit :
> Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit :
> > On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> > > I'm working on this right now.
> > 
> > OK I think I was definitely on the right track.  With the send
> > patch made lockless I now get numbers which are even better than
> > those obtained with running named with multiple sockets.  That's
> > right, a single socket is now faster than what multiple sockets
> > were without the patch (of course, multiple sockets may still
> > faster with the patch vs. a single socket for obvious reasons,
> > but I couldn't measure any significant difference).
> > 
> > Also worthy of note is that prior to the patch all CPUs showed
> > idleness (lazy bastards!), with the patch they're all maxed out.
> > 
> > In retrospect, the idleness was simply the result of the socket
> > lock scheduling away and was an indication of lock contention.
> > 
> 
> Now, input path can run without finding socket locked by xmit path, so
> skb are queued into receive queue, not backlog one.
> 
> > Here are the patches I used.  Please don't them yet as I intend
> > to clean them up quite a bit.
> > 
> > But please do test them heavily, especially if you have an AMD
> > NUMA machine as that's where scalability problems really show
> > up.  Intel tends to be a lot more forgiving.  My last AMD machine
> > blew up years ago :)
> 
> I am going to test them, thanks !
> 

First "sending only" tests on my 2x4x2 machine (two E5540@2.53GHz, quad
core, hyper threaded, NUMA kernel)

16 threads, each one sending 100.000 UDP frames using a _shared_ socket

I use the same destination IP, so suffer a bit of dst refcount
contention.

(to dummy0 device to avoid contention on qdisc and device)
# ip ro get 10.2.2.21
10.2.2.21 dev dummy0  src 10.2.2.2 
    cache 

LOCKDEP enabled kernel

Before :

time ./udpflood -f -t 16 -l 100000 10.2.2.21

real	0m42.749s
user	0m1.010s
sys	1m38.039s

After :

time ./udpflood -f -t 16 -l 100000 10.2.2.21

real	0m1.167s
user	0m0.488s
sys	0m17.373s


With one thread only and 16*100000 frames :
# time ./udpflood -f -l 1600000 10.2.2.21

real	0m9.318s
user	0m0.238s
sys	0m9.052s

(We have some false sharing on atomic fields in struct file and socket,
but nothing to worry about.)

With LOCKDEP OFF :

16 threads :

# time ./udpflood -f -t 16 -l 100000 10.2.2.21

real	0m0.718s
user	0m0.376s
sys	0m10.963s

1 thread :

# time ./udpflood -f -l 1600000 10.2.2.21

real	0m1.514s
user	0m0.153s
sys	0m1.357s


"perf record/report" results for the 16 threads case (no lockdep)

# Events: 389K cpu-clock-msecs
#
# Overhead      Command        Shared Object                               Symbol
# ........  ...........  ...................  ...................................
#
     9.03%     udpflood  [kernel.kallsyms]    [k] sock_wfree
     8.58%     udpflood  [kernel.kallsyms]    [k] __ip_route_output_key
     8.52%     udpflood  [kernel.kallsyms]    [k] sock_alloc_send_pskb
     7.46%     udpflood  [kernel.kallsyms]    [k] sock_def_write_space
     6.76%     udpflood  [kernel.kallsyms]    [k] __xfrm_lookup
     6.18%      swapper  [kernel.kallsyms]    [k] acpi_idle_enter_bm
     5.66%     udpflood  [kernel.kallsyms]    [k] dst_release
     4.96%     udpflood  [kernel.kallsyms]    [k] udp_sendmsg
     3.48%     udpflood  [kernel.kallsyms]    [k] fget_light
     2.75%     udpflood  [kernel.kallsyms]    [k] sock_tx_timestamp
     2.40%     udpflood  [kernel.kallsyms]    [k] __ip_make_skb
     2.36%     udpflood  [kernel.kallsyms]    [k] fput
     1.87%      swapper  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
     1.81%     udpflood  [kernel.kallsyms]    [k] inet_sendmsg
     1.53%     udpflood  [kernel.kallsyms]    [k] sys_sendto
     1.50%     udpflood  [kernel.kallsyms]    [k] ip_finish_output
     1.31%     udpflood  [kernel.kallsyms]    [k] csum_partial_copy_generic
     1.30%     udpflood  udpflood             [.] do_thread
     1.28%     udpflood  [kernel.kallsyms]    [k] __ip_append_data
     1.08%     udpflood  [kernel.kallsyms]    [k] __memset
     1.05%     udpflood  [kernel.kallsyms]    [k] ip_route_output_flow
     0.91%     udpflood  [kernel.kallsyms]    [k] kfree
     0.88%     udpflood  [vdso]               [.] 0xffffe430
     0.83%     udpflood  [kernel.kallsyms]    [k] copy_user_generic_string
     0.78%     udpflood  libc-2.3.4.so        [.] __GI_memcpy
     0.77%     udpflood  [kernel.kallsyms]    [k] ia32_sysenter_target


What do you suggest to perform a bind based test ?




^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Thomas Graf @ 2011-02-28 15:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev
In-Reply-To: <1298904783.2941.412.camel@edumazet-laptop>

On Mon, Feb 28, 2011 at 03:53:03PM +0100, Eric Dumazet wrote:
> What do you suggest to perform a bind based test ?

We use queryperf from BIND sources. I typically run 1 queryperf
instance per core on multiple machines.

^ permalink raw reply

* Re: fix multithreaded signal handling in unix recv routines/ background
From: Rainer Weikusat @ 2011-02-28 15:07 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-kernel
In-Reply-To: <877hck43hs.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rw@sapphire.mobileactivedefense.com> writes:
> The unix_dgram_recvmsg and unix_stream_recvmsg routines in
> net/af_unix.c utilize mutex_lock(&u->readlock) calls

This is IMHO a more sensible place for additional information.

I noticed this because the intended termination processing sequence of
some program which is used as part of a 'content-filtering solution'
for mobile devices (aka iPhones, iPads etc) stopped working the first
time I tested the program in its intended 'actual execution
context'. The program is supposed to listen for 'URL classifiction
requests' on a AF_UNIX SOCK_SEQPACKET socket, pass these to a
third-party library which does the actual classification job and then
send a reply containing a list of categories associated with the URL
in question. It uses multiple threads which basically work as follows:

	1. initialize the library
        2. unblock termination signals
        3. block in read awaiting requests
        4. block termination signals
        5. process request and send reply
        6. goto 2

Upon termination, each thread needs to execute the library
finalization routine before exiting. This is supposed to work with the
help of a signal handler for 'termination signals' calling siglongjmp
to get the particular thread executing it out of the processing
loop. Afterwards, this thread (with termination signals again blocked)
does the finalization call, executes a kill(getpid(), SIGTERM) and
exits via pthread_exit. The SIGTERM should then be picked up by
another thread of the process which will then perform the same
shutdown sequence and signal the next thread, until all threads of the
process have terminated properly. An example program whose structure
is basically identical to that of the actual application which
demonstrates the problem is available here:

	http://mss-uk.mssgmbh.com/~rw/signal/signal-problem-app.c

I've since spent some more thoughts on this and came to the conclusion
that this should also affect independent process blocking on the same
AF_UNIX socket and this even in absence of any user-defined signal
handling. Another example program demonstrating this phenomenon can be
downloaded from

	http://mss-uk.mssgmbh.com/~rw/signal/signal-problem-fork-simple.c

This basically creates an 'unkillable' process, meaning, one which is
even immune to a SIGKILL.

I've also tested that the issue still occurs with 2.6.38-rc5 and that
it is fixed by the proposed patch. The program itself has meanwhile
been moved to the computers which are actually used by the customers
of my employer. This move included patching all the kernels running on
these machines in the way I suggested.

^ permalink raw reply

* Re: Bug inkvm_set_irq
From: Jean-Philippe Menil @ 2011-02-28 15:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization
In-Reply-To: <20110228113939.GH28006@redhat.com>

Le 28/02/2011 12:39, Michael S. Tsirkin a écrit :
> On Mon, Feb 28, 2011 at 11:40:43AM +0100, Jean-Philippe Menil wrote:
>> Le 28/02/2011 11:11, Michael S. Tsirkin a écrit :
>>> On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
>>>> Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
>>>>> On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Each time i try tou use vhost_net, i'm facing a kernel bug.
>>>>>> I do a "modprobe vhost_net", and start guest whith vhost=on.
>>>>>>
>>>>>> Following is a trace with a kernel 2.6.37, but  i had the same
>>>>>> problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
>>>>> 2.6.36 had a theorectical race that could explain this,
>>>>> but it should be ok in 2.6.37.
>>>>>
>>>>>> The bug only occurs whith vhost_net charged, so i don't know if this
>>>>>> is a bug in kvm module code or in the vhost_net code.
>>>>> It could be a bug in eventfd which is the interface
>>>>> used by both kvm and vhost_net.
>>>>> Just for fun, you can try 3.6.38 - eventfd code has been changed
>>>>> a lot in 2.6.38 and if it does not trigger there
>>>>> it's a hint that irqfd is the reason.
>>>>>
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.243100] BUG: unable to handle kernel paging request at
>>>>>> 0000000000002458
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
>>>>> Could you run markup_oops/ ksymoops on this please?
>>>>> As far as I can see kvm_set_irq can only get a wrong
>>>>> kvm pointer. Unless there's some general memory corruption,
>>>>> I'd guess
>>>>>
>>>>> You can also try comparing the irqfd->kvm pointer in
>>>>> kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
>>>>> virt/kvm/eventfd.c.
>>>>>
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.243556] Oops: 0000 [#1] SMP
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.243692] last sysfs file:
>>>>>> /sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.243820] Modules linked in: vhost_net macvtap macvlan tun
>>>>>> powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
>>>>>> cpufreq_ondemand fre
>>>>>> q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
>>>>>> ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
>>>>>> xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
>>>>>> nf_conntrack_ftp nf_connt
>>>>>> rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
>>>>>> dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
>>>>>> nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
>>>>>> snd_page_alloc tpm_tis tpm ps
>>>>>> mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
>>>>>> serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
>>>>>> xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
>>>>>> sd_mod enclosu
>>>>>> re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
>>>>>> scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
>>>>>> 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] RIP: 0010:[<ffffffffa041aa8a>]  [<ffffffffa041aa8a>]
>>>>>> kvm_set_irq+0x2a/0x130 [kvm]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] RSP: 0018:ffff88045fc89d30  EFLAGS: 00010246
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
>>>>>> 0000000000000001
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>>>> 0000000000000000
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
>>>>>> ffff880856a91e48
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
>>>>>> 0000000000000000
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
>>>>>> 0000000000000000
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] FS:  00007f617986c710(0000) GS:ffff88007f800000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
>>>>>> 00000000000006f0
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>> 0000000000000000
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>>>> 0000000000000400
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] Process kworker/0:1 (pid: 10, threadinfo
>>>>>> ffff88045fc88000, task ffff88085fc53c30)
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
>>>>>> ffff88085fc53ee8
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
>>>>>> 00000000000119c0
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  00000000000119c0 ffffffff8137f7ce ffff88007f80df40
>>>>>> 00000000ffffffff
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] Call Trace:
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8106be25>] ? worker_thread+0x145/0x410
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8106f786>] ? kthread+0x96/0xa0
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
>>>>>> fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
>>>>>> 00 00 00<4
>>>>>> 9>    8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] RIP  [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123]  RSP<ffff88045fc89d30>
>>>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>>>> 685.246123] CR2: 0000000000002458
>>>>>>
>>>>>>
>>>>>> If someone can help me, on how to solve this.
>>>>>>
>>>>>> Regards.
>>>>>> _______________________________________________
>>>>>> Virtualization mailing list
>>>>>> Virtualization@lists.linux-foundation.org
>>>>>> https://lists.linux-foundation.org/mailman/listinfo/virtualization
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Hi,
>>>>
>>>> thanks for your response.
>>>>
>>>> This is what markup_oops.pl return me:
>>>> "No matching code found"
>>> Well, let's try to understand what's there.
>>>
>>> Do objdumop -ldS kvm.ko
>>> look for<kvm_set_irq>
>>>
>>> and paste the content from start of that function
>>> to offset 0x2a and a bit beyond.
>>>
>>> You can also upload your kvm.ko somewhere, I'll try to take a look.
>>>
>>>
>>>> So this is not a vhost_net bug, or my oops is incomplete and
>>>> markup_oops can't find the good vma offset.
>>>>
>>>> I will try to compare the pointers you indicate me, even it could be
>>>> a little difficult for me.
>>> Hmm you know how to add printk to code and rebuild, right?
>>>
>>>> Maybe i will try a 2.6.38, will wait a response from the kvm team.
>>>>
>>>> Regards.
>>>>
>>>> -- 
>>>> Jean-Philippe Menil - Pôle réseau Service IRTS
>>>> DSI Université de Nantes
>>>> jean-philippe.menil@univ-nantes.fr
>>>> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
>> So, here is the result for the objdump against the kvm.ko (the
>> kvm_set_irq part) :
> Can you try building with -g and adding -l and -S to objdump
> please? I'd rather make the tool do the legwork than
> do it manually.
>
>> 0000000000006a60<kvm_set_irq>:
>> kvm_set_irq():
>>      6a60:       41 57                   push   %r15
>>      6a62:       41 89 f7                mov    %esi,%r15d
>>      6a65:       41 56                   push   %r14
>>      6a67:       41 55                   push   %r13
>>      6a69:       41 89 cd                mov    %ecx,%r13d
>>      6a6c:       41 54                   push   %r12
>>      6a6e:       49 89 fc                mov    %rdi,%r12
>>      6a71:       55                      push   %rbp
>>      6a72:       53                      push   %rbx
>>      6a73:       89 d3                   mov    %edx,%ebx
>>      6a75:       48 81 ec 98 00 00 00    sub    $0x98,%rsp
>>      6a7c:       8b 15 00 00 00 00       mov    0x0(%rip),%edx
>> # 6a82<kvm_set_irq+0x22>
>>      6a82:       85 d2                   test   %edx,%edx
>>      6a84:       0f 85 c4 00 00 00       jne    6b4e<kvm_set_irq+0xee>
>>      6a8a:       49 8b 84 24 58 24 00    mov    0x2458(%r12),%rax
> OK, 0x6a8a is the offset.
> After you build with -g, try
>
> addr2line kvm.ko 0x6a8a
>
> and see which line this points to.
>
>
>>      6a91:       00
>>      6a92:       3b 98 28 01 00 00       cmp    0x128(%rax),%ebx
>>      6a98:       73 5e                   jae    6af8<kvm_set_irq+0x98>
>>      6a9a:       89 db                   mov    %ebx,%ebx
>>      6a9c:       48 8b 84 d8 30 01 00    mov    0x130(%rax,%rbx,8),%rax
>>      6aa3:       00
>>      6aa4:       48 85 c0                test   %rax,%rax
>>      6aa7:       74 4f                   je     6af8<kvm_set_irq+0x98>
>>      6aa9:       48 89 e2                mov    %rsp,%rdx
>>      6aac:       31 db                   xor    %ebx,%ebx
>>      6aae:       48 8b 08                mov    (%rax),%rcx
>>      6ab1:       83 c3 01                add    $0x1,%ebx
>>      6ab4:       0f 18 09                prefetcht0 (%rcx)
>>      6ab7:       48 8b 48 e0             mov    -0x20(%rax),%rcx
>>      6abb:       48 89 0a                mov    %rcx,(%rdx)
>>      6abe:       48 8b 48 e8             mov    -0x18(%rax),%rcx
>>      6ac2:       48 89 4a 08             mov    %rcx,0x8(%rdx)
>>      6ac6:       48 8b 48 f0             mov    -0x10(%rax),%rcx
>>      6aca:       48 89 4a 10             mov    %rcx,0x10(%rdx)
>>      6ace:       48 8b 48 f8             mov    -0x8(%rax),%rcx
>>      6ad2:       48 89 4a 18             mov    %rcx,0x18(%rdx)
>>      6ad6:       48 8b 08                mov    (%rax),%rcx
>>      6ad9:       48 89 4a 20             mov    %rcx,0x20(%rdx)
>>      6add:       48 8b 48 08             mov    0x8(%rax),%rcx
>>      6ae1:       48 89 4a 28             mov    %rcx,0x28(%rdx)
>>      6ae5:       48 8b 00                mov    (%rax),%rax
>>      6ae8:       48 83 c2 30             add    $0x30,%rdx
>>      6aec:       48 85 c0                test   %rax,%rax
>>      6aef:       75 bd                   jne    6aae<kvm_set_irq+0x4e>
>>      6af1:       eb 07                   jmp    6afa<kvm_set_irq+0x9a>
>>      6af3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>      6af8:       31 db                   xor    %ebx,%ebx
>>      6afa:       bd ff ff ff ff          mov    $0xffffffff,%ebp
>>      6aff:       49 89 e6                mov    %rsp,%r14
>>      6b02:       85 db                   test   %ebx,%ebx
>>      6b04:       74 34                   je     6b3a<kvm_set_irq+0xda>
>>      6b06:       83 eb 01                sub    $0x1,%ebx
>>      6b09:       44 89 e9                mov    %r13d,%ecx
>>      6b0c:       44 89 fa                mov    %r15d,%edx
>>      6b0f:       48 63 c3                movslq %ebx,%rax
>>      6b12:       4c 89 e6                mov    %r12,%rsi
>>      6b15:       48 8d 04 40             lea    (%rax,%rax,2),%rax
>>      6b19:       48 c1 e0 04             shl    $0x4,%rax
>>      6b1d:       49 8d 3c 06             lea    (%r14,%rax,1),%rdi
>>      6b21:       ff 54 04 08             callq  *0x8(%rsp,%rax,1)
>>      6b25:       85 c0                   test   %eax,%eax
>>      6b27:       78 d9                   js     6b02<kvm_set_irq+0xa2>
>>      6b29:       85 ed                   test   %ebp,%ebp
>>      6b2b:       ba 00 00 00 00          mov    $0x0,%edx
>>      6b30:       0f 48 ea                cmovs  %edx,%ebp
>>      6b33:       85 db                   test   %ebx,%ebx
>>      6b35:       8d 2c 28                lea    (%rax,%rbp,1),%ebp
>>      6b38:       75 cc                   jne    6b06<kvm_set_irq+0xa6>
>>      6b3a:       48 81 c4 98 00 00 00    add    $0x98,%rsp
>>      6b41:       89 e8                   mov    %ebp,%eax
>>      6b43:       5b                      pop    %rbx
>>      6b44:       5d                      pop    %rbp
>>      6b45:       41 5c                   pop    %r12
>>      6b47:       41 5d                   pop    %r13
>>      6b49:       41 5e                   pop    %r14
>>      6b4b:       41 5f                   pop    %r15
>>      6b4d:       c3                      retq
>>      6b4e:       48 8b 2d 00 00 00 00    mov    0x0(%rip),%rbp
>> # 6b55<kvm_set_irq+0xf5>
>>      6b55:       48 85 ed                test   %rbp,%rbp
>>      6b58:       0f 84 2c ff ff ff       je     6a8a<kvm_set_irq+0x2a>
>>      6b5e:       48 8b 45 00             mov    0x0(%rbp),%rax
>>      6b62:       48 8b 7d 08             mov    0x8(%rbp),%rdi
>>      6b66:       48 83 c5 10             add    $0x10,%rbp
>>      6b6a:       44 89 f9                mov    %r15d,%ecx
>>      6b6d:       44 89 ea                mov    %r13d,%edx
>>      6b70:       89 de                   mov    %ebx,%esi
>>      6b72:       ff d0                   callq  *%rax
>>      6b74:       48 8b 45 00             mov    0x0(%rbp),%rax
>>      6b78:       48 85 c0                test   %rax,%rax
>>      6b7b:       75 e5                   jne    6b62<kvm_set_irq+0x102>
>>      6b7d:       e9 08 ff ff ff          jmpq   6a8a<kvm_set_irq+0x2a>
>>      6b82:       66 66 66 66 66 2e 0f    nopw   %cs:0x0(%rax,%rax,1)
>>      6b89:       1f 84 00 00 00 00 00
>>
>> I admit that this analysis is too complicated for me.
>> I, effectively, can rebuild a kernel with more printk, and program a reboot.
>>
>> The kvm.ko is available through the following address:
>> http://filex.univ-nantes.fr/get?k=k1jKhQghdcHLz12Z50H
>>
>> Regards.
> This has no debug data. Can you rebuild with -g please?
>
> BTW if you want to rerun and get more reliable backtrace,
> tyr enabling frame pointers (do you know how to?). But this will change code
> so backtrace will no longer be val we will need
> a new one.
>
>> -- 
>> Jean-Philippe Menil - Pôle réseau Service IRTS
>> DSI Université de Nantes
>> jean-philippe.menil@univ-nantes.fr
>> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Host reboot with his new kernel (2.6.37.2), i modprobe vhost_net, start 
three kvm guests.
Host hang in a half an hour.

This time i get a general protection fault:

[ 2380.381225] general protection fault: 0000 [#1] SMP
[ 2380.381261] last sysfs file: 
/sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map
[ 2380.381309] CPU 0
[ 2380.381316] Modules linked in: vhost_net macvtap macvlan tun veth 
powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave 
cpufreq_ondemand freq_table cpufreq_conservative fuse xt_physdev 
ip6t_LOG ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit 
xt_tcpudp xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp 
nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 
mbcache dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack 
nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore 
snd_page_alloc shpchp i2c_nforce2 pci_hotplug psmouse tpm_tis joydev 
pcspkr tpm evdev i2c_core dcdbas tpm_bios serio_raw processor ghes 
button hed thermal_sys xfs exportfs dm_mod sg sr_mod cdrom usbhid hid 
usb_storage sd_mod ses enclosure megaraid_sas lpfc ohci_hcd 
scsi_transport_fc scsi_tgt scsi_mod bnx2 ehci_hcd [last unloaded: 
scsi_wait_scan]
Feb 28 15:28:09 ayrshire.u06.univ-nantes.prive kernel: Feb 28 15:28:09 
ayrshire.u06.univ-nantes.prive kernel: [ 2380.381839] Pid: 10, comm: 
kworker/0:1 Not tainted 2.6.37.2-dsiun-110105+ #2 Dell Inc. PowerEdge 
M605/0K543T
[ 2380.381902] RIP: 0010:[<ffffffffa037e877>]  [<ffffffffa037e877>] 
kvm_set_irq+0x37/0x140 [kvm]
[ 2380.381973] RSP: 0018:ffff88045fc85d00  EFLAGS: 00010246
[ 2380.382002] RAX: 000200740000029c RBX: 000000000000001a RCX: 
0000000000000001
[ 2380.382035] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff88045dbb7440
[ 2380.382068] RBP: ffff88045fc85dd0 R08: ffff88045fc84000 R09: 
000000000000000c
[ 2380.382101] R10: 0000000000000036 R11: 00000000ffffffff R12: 
ffff88045dbb7440
[ 2380.382134] R13: ffff88045dbb7440 R14: ffffffffa037faa0 R15: 
0000000000000001
[ 2380.382168] FS:  00007f0c97165720(0000) GS:ffff88007f800000(0000) 
knlGS:0000000000000000
[ 2380.382216] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2380.382246] CR2: 00007f13bcc80b40 CR3: 000000045e96c000 CR4: 
00000000000006f0
[ 2380.382279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 2380.382312] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 2380.382347] Process kworker/0:1 (pid: 10, threadinfo 
ffff88045fc84000, task ffff88085fc53c30)
[ 2380.382395] Stack:
[ 2380.382416]  00000000000119c0 00000000000119c0 00000000000119c0 
ffff88085fc53c30
[ 2380.382466]  ffff88085fc53ee0 ffff88045fc85fd8 ffff88085fc53ee8 
ffff88045fc84010
[ 2380.382516]  00000000000119c0 ffff88045fc85fd8 00000000000119c0 
00000000000119c0
[ 2380.382566] Call Trace:
[ 2380.382600]  [<ffffffff813818ce>] ? common_interrupt+0xe/0x13
[ 2380.382648]  [<ffffffffa037faa0>] ? irqfd_inject+0x0/0x50 [kvm]
[ 2380.382694]  [<ffffffffa037faca>] irqfd_inject+0x2a/0x50 [kvm]
[ 2380.382729]  [<ffffffff8106b7bb>] process_one_work+0x11b/0x450
[ 2380.382762]  [<ffffffff8106bf37>] worker_thread+0x157/0x410
[ 2380.382796]  [<ffffffff8103a569>] ? __wake_up_common+0x59/0x90
[ 2380.382828]  [<ffffffff8106bde0>] ? worker_thread+0x0/0x410
[ 2380.382861]  [<ffffffff8106f996>] kthread+0x96/0xa0
[ 2380.382894]  [<ffffffff81003c64>] kernel_thread_helper+0x4/0x10
[ 2380.382927]  [<ffffffff8106f900>] ? kthread+0x0/0xa0
[ 2380.382958]  [<ffffffff81003c60>] ? kernel_thread_helper+0x0/0x10
[ 2380.382987] Code: 55 49 89 fd 41 54 53 89 d3 48 81 ec a8 00 00 00 8b 
15 a6 75 03 00 89 b5 3c ff ff ff 85 d2 0f 85 d5 00 00 00 49 8b 85 58 24 
00 00 <3b> 98 28 01 00 00 73 61 89 db 48 8b 84 d8 30 01 00 00 48 85 c0
[ 2380.383185] RIP  [<ffffffffa037e877>] kvm_set_irq+0x37/0x140 [kvm]
[ 2380.383231]  RSP <ffff88045fc85d00>

Running markup_oops.pl give me the following:

vmaoffset = 18446744072102576128 ffffffffa037e841:    48 89 
e5                 mov    %rsp,%rbp
  ffffffffa037e844:    41 57                    push   %r15
  ffffffffa037e846:    41 89 cf                 mov    %ecx,%r15d  |  
%r15 => 1  %ecx = 1
  ffffffffa037e849:    41 56                    push   %r14        |  
%r14 => ffffffffa037faa0
  ffffffffa037e84b:    41 55                    push   %r13
  ffffffffa037e84d:    49 89 fd                 mov    %rdi,%r13   |  
%edi = ffff88045dbb7440  %r13 => ffff88045dbb7440
  ffffffffa037e850:    41 54                    push   %r12        |  
%r12 => ffff88045dbb7440
  ffffffffa037e852:    53                       push   %rbx
  ffffffffa037e853:    89 d3                    mov    %edx,%ebx   |  
%ebx => 1a
  ffffffffa037e855:    48 81 ec a8 00 00 00     sub    $0xa8,%rsp
  ffffffffa037e85c:    8b 15 00 00 00 00        mov    
0x0(%rip),%edx        # ffffffffa037e862 <kvm_set_irq+0x22>
  ffffffffa037e862:    89 b5 3c ff ff ff        mov    %esi,-0xc4(%rbp) 
|  %esi = 0
  ffffffffa037e868:    85 d2                    test   %edx,%edx   |  
%edx => 0
  ffffffffa037e86a:    0f 85 d5 00 00 00        jne    ffffffffa037e945 
<kvm_set_irq+0x105>
  ffffffffa037e870:    49 8b 85 58 24 00 00     mov    0x2458(%r13),%rax 
|  %eax => 200740000029c  %r13 = ffff88045dbb7440
*ffffffffa037e877:    3b 98 28 01 00 00        cmp    0x128(%rax),%ebx 
|  %eax = 200740000029c  %ebx = 1a <--- faulting instruction
  ffffffffa037e87d:    73 61                    jae    ffffffffa037e8e0 
<kvm_set_irq+0xa0>
  ffffffffa037e87f:    89 db                    mov    %ebx,%ebx
  ffffffffa037e881:    48 8b 84 d8 30 01 00     mov    
0x130(%rax,%rbx,8),%rax
  ffffffffa037e888:    00
  ffffffffa037e889:    48 85 c0                 test   %rax,%rax
  ffffffffa037e88c:    74 52                    je     ffffffffa037e8e0 
<kvm_set_irq+0xa0>
  ffffffffa037e88e:    48 8d 95 40 ff ff ff     lea    -0xc0(%rbp),%rdx
  ffffffffa037e895:    31 db                    xor    %ebx,%ebx
  ffffffffa037e897:    48 8b 08                 mov    (%rax),%rcx
  ffffffffa037e89a:    83 c3 01                 add    $0x1,%ebx
  ffffffffa037e89d:    0f 18 09                 prefetcht0 (%rcx)
  ffffffffa037e8a0:    48 8b 48 e0              mov    -0x20(%rax),%rcx
  ffffffffa037e8a4:    48 89 0a                 mov    %rcx,(%rdx)
  ffffffffa037e8a7:    48 8b 48 e8              mov    -0x18(%rax),%rcx
  ffffffffa037e8ab:    48 89 4a 08              mov    %rcx,0x8(%rdx)
  ffffffffa037e8af:    48 8b 48 f0              mov    -0x10(%rax),%rcx
  ffffffffa037e8b3:    48 89 4a 10              mov    %rcx,0x10(%rdx)
  ffffffffa037e8b7:    48 8b 48 f8              mov    -0x8(%rax),%rcx
  ffffffffa037e8bb:    48 89 4a 18              mov    %rcx,0x18(%rdx)
  ffffffffa037e8bf:    48 8b 08                 mov    (%rax),%rcx

I've re-run markup_oops on the first oops (2.6.37.1) (on the right 
module this time, sorry for that), it give me the following:

vmaoffset = 18446744072103215104 ffffffffa041aa62:    41 89 
f7                 mov    %esi,%r15d  |  %r15 => 0  %esi = 0
  ffffffffa041aa65:    41 56                    push   %r14        |  
%r14 => 0
  ffffffffa041aa67:    41 55                    push   %r13
  ffffffffa041aa69:    41 89 cd                 mov    %ecx,%r13d  |  
%ecx = 1  %r13 => 1
  ffffffffa041aa6c:    41 54                    push   %r12
  ffffffffa041aa6e:    49 89 fc                 mov    %rdi,%r12   |  
%edi = 0  %r12 => 0
  ffffffffa041aa71:    55                       push   %rbp
  ffffffffa041aa72:    53                       push   %rbx
  ffffffffa041aa73:    89 d3                    mov    %edx,%ebx   |  
%ebx => 1a
  ffffffffa041aa75:    48 81 ec 98 00 00 00     sub    $0x98,%rsp
  ffffffffa041aa7c:    8b 15 00 00 00 00        mov    
0x0(%rip),%edx        # ffffffffa041aa82 <kvm_set_irq+0x22>
  ffffffffa041aa82:    85 d2                    test   %edx,%edx   |  
%edx => 0
  ffffffffa041aa84:    0f 85 c4 00 00 00        jne    ffffffffa041ab4e 
<kvm_set_irq+0xee>
*ffffffffa041aa8a:    49 8b 84 24 58 24 00     mov    0x2458(%r12),%rax 
|  %eax = 0  %r12 = 0 <--- faulting instruction
  ffffffffa041aa91:    00
  ffffffffa041aa92:    3b 98 28 01 00 00        cmp    0x128(%rax),%ebx
  ffffffffa041aa98:    73 5e                    jae    ffffffffa041aaf8 
<kvm_set_irq+0x98>
  ffffffffa041aa9a:    89 db                    mov    %ebx,%ebx
  ffffffffa041aa9c:    48 8b 84 d8 30 01 00     mov    
0x130(%rax,%rbx,8),%rax
  ffffffffa041aaa3:    00
  ffffffffa041aaa4:    48 85 c0                 test   %rax,%rax
  ffffffffa041aaa7:    74 4f                    je     ffffffffa041aaf8 
<kvm_set_irq+0x98>
  ffffffffa041aaa9:    48 89 e2                 mov    %rsp,%rdx
  ffffffffa041aaac:    31 db                    xor    %ebx,%ebx
  ffffffffa041aaae:    48 8b 08                 mov    (%rax),%rcx
  ffffffffa041aab1:    83 c3 01                 add    $0x1,%ebx
  ffffffffa041aab4:    0f 18 09                 prefetcht0 (%rcx)
  ffffffffa041aab7:    48 8b 48 e0              mov    -0x20(%rax),%rcx
  ffffffffa041aabb:    48 89 0a                 mov    %rcx,(%rdx)
  ffffffffa041aabe:    48 8b 48 e8              mov    -0x18(%rax),%rcx
  ffffffffa041aac2:    48 89 4a 08              mov    %rcx,0x8(%rdx)
  ffffffffa041aac6:    48 8b 48 f0              mov    -0x10(%rax),%rcx
  ffffffffa041aaca:    48 89 4a 10              mov    %rcx,0x10(%rdx)
  ffffffffa041aace:    48 8b 48 f8              mov    -0x8(%rax),%rcx

It's appear that the kernel i recompiled (make-pkg) with the debug 
options for kvm module, doesn't have the debug!
addr2line give me an "??:0"

I will retent with the good options.

Regards.

-- 
Jean-Philippe Menil - Pôle réseau Service IRTS
DSI Université de Nantes
jean-philippe.menil@univ-nantes.fr
Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09


^ permalink raw reply

* Re: [PATCH 0/3] [RFC] Implement multiqueue (RX & TX) virtio-net
From: Krishna Kumar2 @ 2011-02-28 15:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, horms, kvm, netdev,
	rusty
In-Reply-To: <20110228073514.GA28006@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 02/28/2011 01:05:15 PM:

> > This patch series is a continuation of an earlier one that
> > implemented guest MQ TX functionality.  This new patchset
> > implements both RX and TX MQ.  Qemu changes are not being
> > included at this time solely to aid in easier review.
> > Compatibility testing with old/new combinations of qemu/guest
> > and vhost was done without any issues.
> >
> > Some early TCP/UDP test results are at the bottom of this
> > post, I plan to submit more test results in the coming days.
> >
> > Please review and provide feedback on what can improve.
> >
> > Thanks!
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
>
>
> To help testing, could you post the qemu changes separately please?

Thanks Michael for your review and feedback. I will send the qemu
changes and respond to your comments tomorrow.

Thanks,

- KK


^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Hagen Paul Pfeifer @ 2011-02-28 15:38 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel,
	netdev
In-Reply-To: <AANLkTimofhhH5omyk=HhkyaNG+MGqoac4rDf=dPuR7K-@mail.gmail.com>


On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:



> I suppose there is a need to allow at least 2 packets despite any

> time limits, so that it remains possible to use a traditional modem

> even if a huge packet takes several seconds to send.



That is a good point! We talk about as we may know every use case of

Linux. But this is not true at all. One of my customer for example operates

the Linux network stack functionality on top of a proprietary MAC/Driver

where the current packet queue characteristic is just fine. The

time-drop-approach is unsuitable because the bandwidth can vary in a small

amount of time over a great range (0 till max. bandwidth). A sufficient

buffering shows up superior in this environment (only IPv{4,6}/UDP).



Hagen

^ permalink raw reply

* Re: [PATCH] iproute2: allow to specify truncation bits on auth algo
From: Stephen Hemminger @ 2011-02-28 15:48 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: David Miller, herbert, netdev, christophe.gouault
In-Reply-To: <4D6BA724.2020504@6wind.com>

It is in net-next branch of iproute

-- 

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: John W. Linville @ 2011-02-28 16:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel,
	netdev
In-Reply-To: <1298837273.8726.128.camel@edumazet-laptop>

On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:

> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)
> 
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.

Can you elaborate on what problem this causes?  Is it any worse than
if the packet is dropped at some later hop?

Is there any API that could report the drop to the sender (at
least a local one) without having to wait for the ack timeout?
Should there be?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Eric Dumazet @ 2011-02-28 16:22 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev
In-Reply-To: <20110228141322.GF9763@canuck.infradead.org>

Le lundi 28 février 2011 à 09:13 -0500, Thomas Graf a écrit :
> On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> > But please do test them heavily, especially if you have an AMD
> > NUMA machine as that's where scalability problems really show
> > up.  Intel tends to be a lot more forgiving.  My last AMD machine
> > blew up years ago :)
> 
> This is just a preliminary test result and not 100% reliable
> because half through the testing the machine reported memory
> issues and disabled a DIMM before booting the tested kernels.
> 
> Nevertheless, bind 9.7.3:
> 
> 2.6.38-rc5+: 62kqps
> 2.6.38-rc5+ w/ Herbert's patch: 442kqps
> 
> This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores
> 
> Again, this number is not 100% reliably but clearly shows that
> the concept of the patch is working very well.
> 
> Will test Herbert's patch on the machine that did 650kqps with
> SO_REUSEPORT and also on some AMD machines.
> --

I suspect your queryperf input file hits many zones ?

With a single zone, my machine is able to give 250kps : most of the time
is consumed in bind code, dealing with rwlocks and false sharing
things...

(bind-9.7.2-P3)
Using two remote machines to perform queries, on bnx2x adapter, RSS
enabled : two cpus receive UDP frames for the same socket, so we also
hit false sharing in kernel receive path.


---------------------------------------------------------------------------------------------------------------------------------
   PerfTop:  558863 irqs/sec  kernel:40.8%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, 16 CPUs)
---------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________________________

           137175.00 12.4% acpi_idle_enter_bm            [kernel.kallsyms]                     
            63784.00  5.8% _raw_spin_unlock_irqrestore   [kernel.kallsyms]                     
            54140.00  4.9% isc_rwlock_lock               /opt/src/bind-9.7.2-P3/bin/named/named
            32682.00  2.9% isc_rwlock_unlock             /opt/src/bind-9.7.2-P3/bin/named/named
            21823.00  2.0% dns_rbt_findnode              /opt/src/bind-9.7.2-P3/bin/named/named
            20306.00  1.8% __ticket_spin_lock            [kernel.kallsyms]                     
            16881.00  1.5% finish_task_switch            [kernel.kallsyms]                     
            15335.00  1.4% zone_find                     /opt/src/bind-9.7.2-P3/bin/named/named
            14082.00  1.3% decrement_reference           /opt/src/bind-9.7.2-P3/bin/named/named
            14064.00  1.3% __pthread_mutex_lock_internal /lib/tls/libpthread-2.3.4.so          
            13519.00  1.2% isc_stats_increment           /opt/src/bind-9.7.2-P3/bin/named/named
            13027.00  1.2% __GI_memcpy                   /lib/tls/libc-2.3.4.so                
            12516.00  1.1% dns_name_concatenate          /opt/src/bind-9.7.2-P3/bin/named/named
            12499.00  1.1% currentversion                /opt/src/bind-9.7.2-P3/bin/named/named
            11412.00  1.0% dns_name_fullcompare          /opt/src/bind-9.7.2-P3/bin/named/named
            10814.00  1.0% new_reference.clone.6         /opt/src/bind-9.7.2-P3/bin/named/named
            10580.00  1.0% attach                        /opt/src/bind-9.7.2-P3/bin/named/named
             9805.00  0.9% zone_zonecut_callback         /opt/src/bind-9.7.2-P3/bin/named/named



^ permalink raw reply

* Re: [PATCH net-2.6] bonding: drop frames received with master's source MAC
From: Andy Gospodarek @ 2011-02-28 16:32 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Andy Gospodarek, netdev, David Miller, Herbert Xu, Jay Vosburgh,
	Jiri Pirko
In-Reply-To: <4D683653.4050409@gmail.com>

On Sat, Feb 26, 2011 at 12:08:03AM +0100, Nicolas de Pesloüan wrote:
> Le 25/02/2011 23:24, Andy Gospodarek a écrit :
[...]
>>
>> I confirmed your suspicion, this breaks ARP monitoring.  I would still
>> welcome other opinions though as I think it would be nice to fix this as
>> low as possible.
>
> Why do you want to fix it earlier that in ndisc_recv_ns drop? Your 
> original idea of silently dropping the frame there seems perfect to me.
>

Maybe it's just me, but I cannot understand why we want a bunch of extra
packets floating up into the stack when they may only create issues for
the recipients of these duplicate frames.

Clearly my original patch needs to be refined so ARP monitoring still
works, but I would rather fix the issue there than in a higher layer.



^ permalink raw reply

* Re: ICMP reply uses wrong source address as destination
From: Jiri Kosina @ 2011-02-28 16:33 UTC (permalink / raw)
  To: Anders Nilsson Plymoth; +Cc: linux-kernel, netdev
In-Reply-To: <AANLkTi=2hEK9rR2h1hzWtud0DdJU5A4d9rBYD6aTLFx-@mail.gmail.com>


[ adding netdev@ to CC ]

On Mon, 28 Feb 2011, Anders Nilsson Plymoth wrote:

> Dear linux kernel enthusiasts,
> 
> I came upon an issue where ICMP reply packets were issued towards the
> IP address of the receiving interface, rather than the source IP
> address.
> Looking at the kernel code, I saw that this is caused by the following
> line in net/ipv4/icmp.c function icmp_reply:
> 
> daddr = ipc.addr = rt->rt_src;
> 
> For most cases the original line of code is ok, but in some situations
> doesn't arrive to the kernel from the network device, but through some
> other mechanism such as a userspace application. In these cases the
> receiving device in the skb appears to be the loopback interface, not
> a physical device. icmp_reply will thus issue the reply to the
> loopback IP address, rather than the source IP address as it should.
> 
> While googling to see if this issue have been submitted, I found this
> two threads that address the same problem:
> 
> ([PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to
> reply packet) on 17 sep 2007
> AND
> ([PATCH RESEND] 2.6.22.6 networking [ipv4]: fix wrong destination when
> reply packetes) on 20 sep 2007).
> 
> Nothing came out of these threads, and some of the questions there are
> easy to answer; such as this doesn't affect DNAT, and if source IP
> address is not set then you should not issue a reply for echo and
> timestamp anyway.
> 
> As to the statement:
> "... which IP address should be used as the source
> 1. the destination address of the packet that generated the message
> 
> or.
> 
> 2. the IP address that the machine would use by default if the machine
> were to generate a new connection to the destination."
> These may be relevant questions, but the ICMP RFC clearly states the
> answer is 1. 2. may seem relevant to multi-homing, but its not the
> role of the ICMP reply to resolve multi-homing issues.
> The following code will correct the issue.
> {
>    struct iphdr *ip = ip_hdr(skb);
>    daddr = ipc.addr = ip->saddr;
> }
> The only functions that use icmp_reply are icmp_echo and
> icmp_timestamp, and this change do not modify their behavior. After
> extensive testing, in regular setups and DNATed situations, I can
> verify this change works as intended.
> Thanks,

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Albert Cahalan @ 2011-02-28 16:37 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel,
	netdev
In-Reply-To: <cbb9e1113901f1a324359b6ed3f1a611@localhost>

On Mon, Feb 28, 2011 at 10:38 AM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>
>> I suppose there is a need to allow at least 2 packets despite any
>> time limits, so that it remains possible to use a traditional modem
>> even if a huge packet takes several seconds to send.
>
> That is a good point! We talk about as we may know every use case of
> Linux. But this is not true at all. One of my customer for example operates
> the Linux network stack functionality on top of a proprietary MAC/Driver
> where the current packet queue characteristic is just fine. The
> time-drop-approach is unsuitable because the bandwidth can vary in a small
> amount of time over a great range (0 till max. bandwidth). A sufficient
> buffering shows up superior in this environment (only IPv{4,6}/UDP).

I don't think the current non-time queue is just fine for him.
I can see that time-based discard-on-enqueue would not be
fine either. He needs time-based discard-on-dequeue.
Good for him is probably:

On dequeue, discard all packets that are too old.
On enqueue, assume max bandwidth and discard all
packets that have no hope of surviving the dequeue check.
(the enqueue check is only to prevent wasting RAM)
Exception: always keep at least 2 packets.

Better is something that would allow random drop.
The trouble here is that bandwidth varies greatly.
Some sort of undelete functionality is needed...?

Assuming the difficulty with implementing random drop
is solvable, I think this would work for the rest of us too.

Keeping the timeout really low is important because it isn't
OK to eat up all the latency tolerance in one hop. You have
an end-to-end budget of 20 ms for usable GUI rubber banding.
The budget for gaming is about 80 and for VoIP is about 150.

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Thomas Graf @ 2011-02-28 16:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev
In-Reply-To: <1298910174.2941.585.camel@edumazet-laptop>

On Mon, Feb 28, 2011 at 05:22:54PM +0100, Eric Dumazet wrote:
> Le lundi 28 février 2011 à 09:13 -0500, Thomas Graf a écrit :
> > On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> > > But please do test them heavily, especially if you have an AMD
> > > NUMA machine as that's where scalability problems really show
> > > up.  Intel tends to be a lot more forgiving.  My last AMD machine
> > > blew up years ago :)
> > 
> > This is just a preliminary test result and not 100% reliable
> > because half through the testing the machine reported memory
> > issues and disabled a DIMM before booting the tested kernels.
> > 
> > Nevertheless, bind 9.7.3:
> > 
> > 2.6.38-rc5+: 62kqps
> > 2.6.38-rc5+ w/ Herbert's patch: 442kqps
> > 
> > This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores
> > 
> > Again, this number is not 100% reliably but clearly shows that
> > the concept of the patch is working very well.
> > 
> > Will test Herbert's patch on the machine that did 650kqps with
> > SO_REUSEPORT and also on some AMD machines.
> > --
> 
> I suspect your queryperf input file hits many zones ?

No, we use a simple example.com zone with host[1-4] A records
resolving to 10.[1-4].0.1

> With a single zone, my machine is able to give 250kps : most of the time
> is consumed in bind code, dealing with rwlocks and false sharing
> things...
> 
> (bind-9.7.2-P3)
> Using two remote machines to perform queries, on bnx2x adapter, RSS
> enabled : two cpus receive UDP frames for the same socket, so we also
> hit false sharing in kernel receive path.

How do you measure the qps? The output of queryperf? That is not always
accurate. I run rdnc stats twice and then calculate the qps based on the
counter "queries resulted in successful answer" diff and timestamp diff.

The numbers differ a lot depending on the architecture we test on.

F.e. on a 12 core AMD with 2 NUMA nodes:

2.6.32   named -n 1: 37.0kqps
         named:       3.8kqps (yes, no joke, the socket receive buffer is
                               always full and the kernel drops pkts)

2.6.38-rc5+ with Herbert's patches:
        named -n 1:  36.9kqps
        named:      222.0kqps

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox