Netdev List
 help / color / mirror / Atom feed
* Re: Suspend/resume - slow resume
From: Linus Torvalds @ 2011-04-18 18:49 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Ciprian Docan, netdev, Linux Kernel Mailing List, Len Brown,
	Pavel Machek, Rafael, J. Wysocki, Greg KH
In-Reply-To: <20110418180813.GA18469@electric-eye.fr.zoreil.com>

[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]

On Mon, Apr 18, 2011 at 11:08 AM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> [...]
>>  - unload not on close, but on device unregister (iow not when you do
>> "ifconfig eth0 down", but when the "eth0" device really goes away)
>
> Without further action, the firmware(s) will thus be locked in until the
> driver is removed.

I do agree. It's a downside. Maybe doing it in "close()" is the right
thing, as long as we don't have that crazy "every four timer ticks"
situation with rtl8169_reinit_task.

As mentioned, the only real reason for me to be worried about the
close thing is that I don't have a good feel for what happens at boot
time. Are the setup scripts going to look at the interface lots of
times? On my desktop, I couldn't care less, but I try to keep boot
time in mind.

Maybe in practice there's just a single open at boot-time (for dhcp or
whatever), and I'm just worried for no good reason.

Without having looked at that whole rtl8169_reinit_task thing, I
probably wouldn't even worry about anything else doing something
similar ;)

> As long as it can be fixed... If the 60s delay is removed and the firmware
> loading emits some messages for programmer barbie, I am more than happy.

So the firmware loading timeout used to be ten seconds (which I
already think is excessive), but then commit
2f65168de7d68a5795e945e781d85b313bdc97b9 increased it to 60s because

    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=174589

    The ipw driver sometimes takes a long time to load its firmware.
    Whilst the ipw driver should be using the async interface of
    the firmware loader to make this a non-issue, this is a minimal fix.

although when I actually look at that bugzilla entry, the _timestamps_
for the failed case do not seem to support this being a timeout.

Very odd.

But the real problem is that we do that timeout even in cases where it
cannot help, ie when people load firmware during early boot or during
suspend. So I think drivers/base/firmware_class.c should be made a bit
smarter.

We have a few cases where call_usermodehelper() fails immediately:

 - khelper_wq hasn't been set up yet
 - usermodehelper_disabled is set.

and in particular, during suspend/resume, that
"usermodehelper_disabled" flag will be set.

I don't think it is sensible to do a user request for firmware during
that time either, and that 60-second timeout is just silly. It's not
going to help.

Why doesn't the firmware loader class check the error return from the
kobject_uevent()? I'd expect that if that fails, we should just warn
and abort, rather than wait 60 seconds to time out. Greg?

TOTALLY UNTESTED PATCH ATTACHED!

Ciprian - does this get rid of the 60-second wait? Do you get a nice
kernel traceback in your dmesg instead?

> If someone can tell me where Realtek's firmware should be sent to as David
> W. seems to be busy, it will be perfect.

Hmm. Dunno about that.

                       Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 625 bytes --]

 drivers/base/firmware_class.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index 8c798ef7f13f..956dd34e59da 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -549,7 +549,8 @@ static int _request_firmware(const struct firmware **firmware_p,
 				  round_jiffies_up(jiffies +
 						   loading_timeout * HZ));
 
-		kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD);
+		if (WARN_ON(kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD)))
+			fw_load_abort(fw_priv);
 	}
 
 	wait_for_completion(&fw_priv->completion);

^ permalink raw reply related

* Re: r8169 :  always copying the rx buffer to new skb
From: Francois Romieu @ 2011-04-18 18:21 UTC (permalink / raw)
  To: John Lumby; +Cc: netdev, nic_swsd
In-Reply-To: <4DAC7001.9060800@hotmail.com>

John Lumby <johnlumby@hotmail.com> :
> Summary  -  current r8169 always memcpy's every received buffer to new skb,
>             and I'd like to propose that it should not,
>             which can improve throughput / reduce CPU utilization by
> around 10%

Disclaimer: I'll read your suggestion more carefully when I am back home.
I am late and I could be off-topic.

> I think (not sure) this was related to a bug

Short answer: it's mostly related to CVE-2009-4537 (see git log).

I may resurrect some alternate fix - i.e. detect corrupted Tx descriptors
and reset before things gets wrong - but it is not easy to prove it right
since it may be necessary to tailor it for each member of the 816x / 810x
family. Some input from Realtek may help though.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH v2] net: r8169: convert to hw_features
From: Francois Romieu @ 2011-04-18 18:08 UTC (permalink / raw)
  To: David Dillow; +Cc: netdev, Realtek, davem
In-Reply-To: <1302718764.3167.7.camel@lap75545.ornl.gov>

[...]
> I've attached my ancient patch, if it helps.

Thanks, it works way better now (see below). It is ok for me to use
you s-o-b on it ?

I have given it a try on 8169, 8168b, 8168d, 8102e and the load can
dramatically fall (more at 1000 Mbps than at 100 Mbps).

David (M.), should it go through net-next or be made ready for net-2.6 ?

Subject: [PATCH] r8169: TSO fixes.

- the MSS value is actually contained in a 11 bits wide (0x7ff) field.
  The extra bit in the former MSSMask did encompass the TSO command
  bit ("LargeSend") as well (0xfff). Oops.

- the Tx descriptor layout is not the same through the whole chipset
  family. The 8169 documentation, the 8168c documentation and Realtek's
  drivers (8.020.00, 1.019.00, 6.014.00) highlight two layouts:
  1. 8169, 8168 up to 8168b (included) and 8101
  2. {8102e, 8168c} and beyond

- notwithstanding the "first descriptor" and "last descriptor" bits, the
  same Tx descriptor content is enforced when a packet consists of several
  descriptors. The chipsets are documented to require it.

Credits go to David Dillow <dave@thedillows.org> for the original patch.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Realtek <nic_swsd@realtek.com>
---
 drivers/net/r8169.c |  209 +++++++++++++++++++++++++++++++++------------------
 1 files changed, 137 insertions(+), 72 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 058524f..fb03e6f 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -134,47 +134,52 @@ enum mac_version {
 	RTL_GIGA_MAC_VER_33 = 0x21, // 8168E
 };
 
-#define _R(NAME,MAC,MASK) \
-	{ .name = NAME, .mac_version = MAC, .RxConfigMask = MASK }
+enum rtl_tx_desc_version {
+	RTL_TD_0	= 0,
+	RTL_TD_1	= 1,
+};
+
+#define _R(NAME,MAC,TD) \
+	{ .name = NAME, .mac_version = MAC, .txd_version = TD }
 
 static const struct {
 	const char *name;
 	u8 mac_version;
-	u32 RxConfigMask;	/* Clears the bits supported by this chip */
+	enum rtl_tx_desc_version txd_version;
 } rtl_chip_info[] = {
-	_R("RTL8169",		RTL_GIGA_MAC_VER_01, 0xff7e1880), // 8169
-	_R("RTL8169s",		RTL_GIGA_MAC_VER_02, 0xff7e1880), // 8169S
-	_R("RTL8110s",		RTL_GIGA_MAC_VER_03, 0xff7e1880), // 8110S
-	_R("RTL8169sb/8110sb",	RTL_GIGA_MAC_VER_04, 0xff7e1880), // 8169SB
-	_R("RTL8169sc/8110sc",	RTL_GIGA_MAC_VER_05, 0xff7e1880), // 8110SCd
-	_R("RTL8169sc/8110sc",	RTL_GIGA_MAC_VER_06, 0xff7e1880), // 8110SCe
-	_R("RTL8102e",		RTL_GIGA_MAC_VER_07, 0xff7e1880), // PCI-E
-	_R("RTL8102e",		RTL_GIGA_MAC_VER_08, 0xff7e1880), // PCI-E
-	_R("RTL8102e",		RTL_GIGA_MAC_VER_09, 0xff7e1880), // PCI-E
-	_R("RTL8101e",		RTL_GIGA_MAC_VER_10, 0xff7e1880), // PCI-E
-	_R("RTL8168b/8111b",	RTL_GIGA_MAC_VER_11, 0xff7e1880), // PCI-E
-	_R("RTL8168b/8111b",	RTL_GIGA_MAC_VER_12, 0xff7e1880), // PCI-E
-	_R("RTL8101e",		RTL_GIGA_MAC_VER_13, 0xff7e1880), // PCI-E 8139
-	_R("RTL8100e",		RTL_GIGA_MAC_VER_14, 0xff7e1880), // PCI-E 8139
-	_R("RTL8100e",		RTL_GIGA_MAC_VER_15, 0xff7e1880), // PCI-E 8139
-	_R("RTL8168b/8111b",	RTL_GIGA_MAC_VER_17, 0xff7e1880), // PCI-E
-	_R("RTL8101e",		RTL_GIGA_MAC_VER_16, 0xff7e1880), // PCI-E
-	_R("RTL8168cp/8111cp",	RTL_GIGA_MAC_VER_18, 0xff7e1880), // PCI-E
-	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_19, 0xff7e1880), // PCI-E
-	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_20, 0xff7e1880), // PCI-E
-	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_21, 0xff7e1880), // PCI-E
-	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_22, 0xff7e1880), // PCI-E
-	_R("RTL8168cp/8111cp",	RTL_GIGA_MAC_VER_23, 0xff7e1880), // PCI-E
-	_R("RTL8168cp/8111cp",	RTL_GIGA_MAC_VER_24, 0xff7e1880), // PCI-E
-	_R("RTL8168d/8111d",	RTL_GIGA_MAC_VER_25, 0xff7e1880), // PCI-E
-	_R("RTL8168d/8111d",	RTL_GIGA_MAC_VER_26, 0xff7e1880), // PCI-E
-	_R("RTL8168dp/8111dp",	RTL_GIGA_MAC_VER_27, 0xff7e1880), // PCI-E
-	_R("RTL8168dp/8111dp",	RTL_GIGA_MAC_VER_28, 0xff7e1880), // PCI-E
-	_R("RTL8105e",		RTL_GIGA_MAC_VER_29, 0xff7e1880), // PCI-E
-	_R("RTL8105e",		RTL_GIGA_MAC_VER_30, 0xff7e1880), // PCI-E
-	_R("RTL8168dp/8111dp",	RTL_GIGA_MAC_VER_31, 0xff7e1880), // PCI-E
-	_R("RTL8168e/8111e",	RTL_GIGA_MAC_VER_32, 0xff7e1880), // PCI-E
-	_R("RTL8168e/8111e",	RTL_GIGA_MAC_VER_33, 0xff7e1880)  // PCI-E
+	_R("RTL8169",		RTL_GIGA_MAC_VER_01, RTL_TD_0), // 8169
+	_R("RTL8169s",		RTL_GIGA_MAC_VER_02, RTL_TD_0), // 8169S
+	_R("RTL8110s",		RTL_GIGA_MAC_VER_03, RTL_TD_0), // 8110S
+	_R("RTL8169sb/8110sb",	RTL_GIGA_MAC_VER_04, RTL_TD_0), // 8169SB
+	_R("RTL8169sc/8110sc",	RTL_GIGA_MAC_VER_05, RTL_TD_0), // 8110SCd
+	_R("RTL8169sc/8110sc",	RTL_GIGA_MAC_VER_06, RTL_TD_0), // 8110SCe
+	_R("RTL8102e",		RTL_GIGA_MAC_VER_07, RTL_TD_1), // PCI-E
+	_R("RTL8102e",		RTL_GIGA_MAC_VER_08, RTL_TD_1), // PCI-E
+	_R("RTL8102e",		RTL_GIGA_MAC_VER_09, RTL_TD_1), // PCI-E
+	_R("RTL8101e",		RTL_GIGA_MAC_VER_10, RTL_TD_0), // PCI-E
+	_R("RTL8168b/8111b",	RTL_GIGA_MAC_VER_11, RTL_TD_0), // PCI-E
+	_R("RTL8168b/8111b",	RTL_GIGA_MAC_VER_12, RTL_TD_0), // PCI-E
+	_R("RTL8101e",		RTL_GIGA_MAC_VER_13, RTL_TD_0), // PCI-E 8139
+	_R("RTL8100e",		RTL_GIGA_MAC_VER_14, RTL_TD_0), // PCI-E 8139
+	_R("RTL8100e",		RTL_GIGA_MAC_VER_15, RTL_TD_0), // PCI-E 8139
+	_R("RTL8168b/8111b",	RTL_GIGA_MAC_VER_17, RTL_TD_0), // PCI-E
+	_R("RTL8101e",		RTL_GIGA_MAC_VER_16, RTL_TD_0), // PCI-E
+	_R("RTL8168cp/8111cp",	RTL_GIGA_MAC_VER_18, RTL_TD_1), // PCI-E
+	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_19, RTL_TD_1), // PCI-E
+	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_20, RTL_TD_1), // PCI-E
+	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_21, RTL_TD_1), // PCI-E
+	_R("RTL8168c/8111c",	RTL_GIGA_MAC_VER_22, RTL_TD_1), // PCI-E
+	_R("RTL8168cp/8111cp",	RTL_GIGA_MAC_VER_23, RTL_TD_1), // PCI-E
+	_R("RTL8168cp/8111cp",	RTL_GIGA_MAC_VER_24, RTL_TD_1), // PCI-E
+	_R("RTL8168d/8111d",	RTL_GIGA_MAC_VER_25, RTL_TD_1), // PCI-E
+	_R("RTL8168d/8111d",	RTL_GIGA_MAC_VER_26, RTL_TD_1), // PCI-E
+	_R("RTL8168dp/8111dp",	RTL_GIGA_MAC_VER_27, RTL_TD_1), // PCI-E
+	_R("RTL8168dp/8111dp",	RTL_GIGA_MAC_VER_28, RTL_TD_1), // PCI-E
+	_R("RTL8105e",		RTL_GIGA_MAC_VER_29, RTL_TD_1), // PCI-E
+	_R("RTL8105e",		RTL_GIGA_MAC_VER_30, RTL_TD_1), // PCI-E
+	_R("RTL8168dp/8111dp",	RTL_GIGA_MAC_VER_31, RTL_TD_1), // PCI-E
+	_R("RTL8168e/8111e",	RTL_GIGA_MAC_VER_32, RTL_TD_1), // PCI-E
+	_R("RTL8168e/8111e",	RTL_GIGA_MAC_VER_33, RTL_TD_1)  // PCI-E
 };
 #undef _R
 
@@ -230,6 +235,9 @@ enum rtl_registers {
 	IntrStatus	= 0x3e,
 	TxConfig	= 0x40,
 	RxConfig	= 0x44,
+
+#define RTL_RX_CONFIG_MASK		0xff7e1880u
+
 	RxMissed	= 0x4c,
 	Cfg9346		= 0x50,
 	Config0		= 0x51,
@@ -452,21 +460,69 @@ enum rtl_register_content {
 	CounterDump	= 0x8,
 };
 
-enum desc_status_bit {
+enum rtl_desc_bit {
+	/* First doubleword. */
 	DescOwn		= (1 << 31), /* Descriptor is owned by NIC */
 	RingEnd		= (1 << 30), /* End of descriptor ring */
 	FirstFrag	= (1 << 29), /* First segment of a packet */
 	LastFrag	= (1 << 28), /* Final segment of a packet */
+};
+
+/* Generic case. */
+enum rtl_tx_desc_bit {
+	/* First doubleword. */
+	TD_LSO		= (1 << 27),		/* Large Send Offload */
+#define TD_MSS_MAX			0x07ffu	/* MSS value */
 
-	/* Tx private */
-	LargeSend	= (1 << 27), /* TCP Large Send Offload (TSO) */
-	MSSShift	= 16,        /* MSS value position */
-	MSSMask		= 0xfff,     /* MSS value + LargeSend bit: 12 bits */
-	IPCS		= (1 << 18), /* Calculate IP checksum */
-	UDPCS		= (1 << 17), /* Calculate UDP/IP checksum */
-	TCPCS		= (1 << 16), /* Calculate TCP/IP checksum */
-	TxVlanTag	= (1 << 17), /* Add VLAN tag */
+	/* Second doubleword. */
+	TxVlanTag	= (1 << 17),		/* Add VLAN tag */
+};
+
+/* 8169, 8168b and 810x except 8102e. */
+enum rtl_tx_desc_bit_0 {
+	/* First doubleword. */
+#define TD0_MSS_SHIFT			16	/* MSS position (11 bits) */
+	TD0_TCP_CS	= (1 << 16),		/* Calculate TCP/IP checksum */
+	TD0_UDP_CS	= (1 << 17),		/* Calculate UDP/IP checksum */
+	TD0_IP_CS	= (1 << 18),		/* Calculate IP checksum */
+};
+
+/* 8102e, 8168c and beyond. */
+enum rtl_tx_desc_bit_1 {
+	/* Second doubleword. */
+#define TD1_MSS_SHIFT			18	/* MSS position (11 bits) */
+	TD1_IP_CS	= (1 << 29),		/* Calculate IP checksum */
+	TD1_TCP_CS	= (1 << 30),		/* Calculate TCP/IP checksum */
+	TD1_UDP_CS	= (1 << 31),		/* Calculate UDP/IP checksum */
+};
 
+static const struct rtl_tx_desc_info {
+	struct {
+		u32 udp;
+		u32 tcp;
+	} checksum;
+	u16 mss_shift;
+	u16 opts_offset;
+} tx_desc_info [] = {
+	[RTL_TD_0] = {
+		.checksum = {
+			.udp	= TD0_IP_CS | TD0_UDP_CS,
+			.tcp	= TD0_IP_CS | TD0_TCP_CS
+		},
+		.mss_shift	= TD0_MSS_SHIFT,
+		.opts_offset	= 0
+	},
+	[RTL_TD_1] = {
+		.checksum = {
+			.udp	= TD1_IP_CS | TD1_UDP_CS,
+			.tcp	= TD1_IP_CS | TD1_TCP_CS
+		},
+		.mss_shift	= TD1_MSS_SHIFT,
+		.opts_offset	= 1
+	}
+};
+
+enum rtl_rx_desc_bit {
 	/* Rx private */
 	PID1		= (1 << 18), /* Protocol ID bit 1/2 */
 	PID0		= (1 << 17), /* Protocol ID bit 2/2 */
@@ -531,8 +587,8 @@ struct rtl8169_private {
 	struct napi_struct napi;
 	spinlock_t lock;		/* spin lock flag */
 	u32 msg_enable;
-	int chipset;
-	int mac_version;
+	u16 txd_version;
+	u16 mac_version;
 	u32 cur_rx; /* Index into the Rx descriptor buffer of next Rx pkt. */
 	u32 cur_tx; /* Index into the Tx descriptor buffer of next Rx pkt. */
 	u32 dirty_rx;
@@ -1288,7 +1344,7 @@ static int rtl8169_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 
 static u32 rtl8169_fix_features(struct net_device *dev, u32 features)
 {
-	if (dev->mtu > MSSMask)
+	if (dev->mtu > TD_MSS_MAX)
 		features &= ~NETIF_F_ALL_TSO;
 
 	return features;
@@ -3194,7 +3250,7 @@ rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	struct mii_if_info *mii;
 	struct net_device *dev;
 	void __iomem *ioaddr;
-	unsigned int i;
+	int chipset, i;
 	int rc;
 
 	if (netif_msg_drv(&debug)) {
@@ -3336,7 +3392,8 @@ rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 			"driver bug, MAC version not found in rtl_chip_info\n");
 		goto err_out_msi_4;
 	}
-	tp->chipset = i;
+	chipset = i;
+	tp->txd_version = rtl_chip_info[chipset].txd_version;
 
 	RTL_W8(Cfg9346, Cfg9346_Unlock);
 	RTL_W8(Config1, RTL_R8(Config1) | PMEnable);
@@ -3413,8 +3470,7 @@ rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	pci_set_drvdata(pdev, dev);
 
 	netif_info(tp, probe, dev, "%s at 0x%lx, %pM, XID %08x IRQ %d\n",
-		   rtl_chip_info[tp->chipset].name,
-		   dev->base_addr, dev->dev_addr,
+		   rtl_chip_info[chipset].name, dev->base_addr, dev->dev_addr,
 		   (u32)(RTL_R32(TxConfig) & 0x9cf0f8ff), dev->irq);
 
 	if ((tp->mac_version == RTL_GIGA_MAC_VER_27) ||
@@ -3572,7 +3628,7 @@ static void rtl_set_rx_tx_config_registers(struct rtl8169_private *tp)
 	void __iomem *ioaddr = tp->mmio_addr;
 	u32 cfg = rtl8169_rx_config;
 
-	cfg |= (RTL_R32(RxConfig) & rtl_chip_info[tp->chipset].RxConfigMask);
+	cfg |= (RTL_R32(RxConfig) & RTL_RX_CONFIG_MASK);
 	RTL_W32(RxConfig, cfg);
 
 	/* Set DMA burst size and Interframe Gap Time */
@@ -4564,7 +4620,7 @@ static void rtl8169_tx_timeout(struct net_device *dev)
 }
 
 static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
-			      u32 opts1)
+			      u32 *opts)
 {
 	struct skb_shared_info *info = skb_shinfo(skb);
 	unsigned int cur_frag, entry;
@@ -4592,9 +4648,11 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 		}
 
 		/* anti gcc 2.95.3 bugware (sic) */
-		status = opts1 | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
+		status = opts[0] | len |
+			(RingEnd * !((entry + 1) % NUM_TX_DESC));
 
 		txd->opts1 = cpu_to_le32(status);
+		txd->opts2 = cpu_to_le32(opts[1]);
 		txd->addr = cpu_to_le64(mapping);
 
 		tp->tx_skb[entry].len = len;
@@ -4612,23 +4670,26 @@ err_out:
 	return -EIO;
 }
 
-static inline u32 rtl8169_tso_csum(struct sk_buff *skb, struct net_device *dev)
+static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
+				    struct sk_buff *skb, u32 *opts)
 {
+	const struct rtl_tx_desc_info *info = tx_desc_info + tp->txd_version;
 	u32 mss = skb_shinfo(skb)->gso_size;
+	int offset = info->opts_offset;
 
-	if (mss)
-		return LargeSend | ((mss & MSSMask) << MSSShift);
-
-	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+	if (mss) {
+		opts[0] |= TD_LSO;
+		opts[offset] |= min(mss, TD_MSS_MAX) << info->mss_shift;
+	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		const struct iphdr *ip = ip_hdr(skb);
 
 		if (ip->protocol == IPPROTO_TCP)
-			return IPCS | TCPCS;
+			opts[offset] |= info->checksum.tcp;
 		else if (ip->protocol == IPPROTO_UDP)
-			return IPCS | UDPCS;
-		WARN_ON(1);	/* we need a WARN() */
+			opts[offset] |= info->checksum.udp;
+		else
+			WARN_ON_ONCE(1);
 	}
-	return 0;
 }
 
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
@@ -4641,7 +4702,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	struct device *d = &tp->pci_dev->dev;
 	dma_addr_t mapping;
 	u32 status, len;
-	u32 opts1;
+	u32 opts[2];
 	int frags;
 
 	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
@@ -4662,24 +4723,28 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	tp->tx_skb[entry].len = len;
 	txd->addr = cpu_to_le64(mapping);
-	txd->opts2 = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
 
-	opts1 = DescOwn | rtl8169_tso_csum(skb, dev);
+	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
+	opts[0] = DescOwn;
 
-	frags = rtl8169_xmit_frags(tp, skb, opts1);
+	rtl8169_tso_csum(tp, skb, opts);
+
+	frags = rtl8169_xmit_frags(tp, skb, opts);
 	if (frags < 0)
 		goto err_dma_1;
 	else if (frags)
-		opts1 |= FirstFrag;
+		opts[0] |= FirstFrag;
 	else {
-		opts1 |= FirstFrag | LastFrag;
+		opts[0] |= FirstFrag | LastFrag;
 		tp->tx_skb[entry].skb = skb;
 	}
 
+	txd->opts2 = cpu_to_le32(opts[1]);
+
 	wmb();
 
 	/* anti gcc 2.95.3 bugware (sic) */
-	status = opts1 | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
+	status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
 	txd->opts1 = cpu_to_le32(status);
 
 	tp->cur_tx += frags + 1;
@@ -5167,7 +5232,7 @@ static void rtl_set_rx_mode(struct net_device *dev)
 	spin_lock_irqsave(&tp->lock, flags);
 
 	tmp = rtl8169_rx_config | rx_mode |
-	      (RTL_R32(RxConfig) & rtl_chip_info[tp->chipset].RxConfigMask);
+	      (RTL_R32(RxConfig) & RTL_RX_CONFIG_MASK);
 
 	if (tp->mac_version > RTL_GIGA_MAC_VER_06) {
 		u32 data = mc_filter[0];
-- 
1.7.4.2


^ permalink raw reply related

* Re: Suspend/resume - slow resume
From: Francois Romieu @ 2011-04-18 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ciprian Docan, netdev, Linux Kernel Mailing List, Len Brown,
	Pavel Machek, Rafael, J. Wysocki
In-Reply-To: <BANLkTi=tPc6NmUuTYqbVZduNRNbE=-c34Q@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> :
[lots of silly bugs]

Ok. it should be fixed for wednesday.

[...]
>  - unload not on close, but on device unregister (iow not when you do
> "ifconfig eth0 down", but when the "eth0" device really goes away)

Without further action, the firmware(s) will thus be locked in until the
driver is removed. 

> But as mentioned, the above is (a) just my gut feel - please discuss -
> and (b) I really think that leaving this to the network driver has
> been shown to continually result in the drivers doing the firmware
> load/unload in all the wrong places.

As long as it can be fixed... If the 60s delay is removed and the firmware
loading emits some messages for programmer barbie, I am more than happy.

If someone can tell me where Realtek's firmware should be sent to as David
W. seems to be busy, it will be perfect.

-- 
"Lack of vision" Ueimor

^ permalink raw reply

* Re: r8169 :  always copying the rx buffer to new skb
From: Ben Hutchings @ 2011-04-18 17:27 UTC (permalink / raw)
  To: John Lumby; +Cc: netdev
In-Reply-To: <4DAC7001.9060800@hotmail.com>

On Mon, 2011-04-18 at 13:08 -0400, John Lumby wrote:
> Summary  -  current r8169 always memcpy's every received buffer to new skb,
>              and I'd like to propose that it should not,
>              which can improve throughput / reduce CPU utilization by 
> around 10%

At least some variants of the hardware have a bug in RX DMA scatter that
can be exploited for denial-of-service.  The workaround is to allocate
huge RX buffers (16K).  Then, to avoid allocation failures later on (and
to save memory) the buffers must be copied rather than passed up the
stack and reallocated.

[...]
> As a second proposal (which I made before),  I'd like to suggest that 
> the number of rx buffers allocated at open should be configurable by 
> module param.
[...]

No, it should implement the ethtool set_ringparam operation.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* r8169 :  always copying the rx buffer to new skb
From: John Lumby @ 2011-04-18 17:08 UTC (permalink / raw)
  To: netdev

Summary  -  current r8169 always memcpy's every received buffer to new skb,
             and I'd like to propose that it should not,
             which can improve throughput / reduce CPU utilization by 
around 10%

In the patch
   2010-10-09 Stanislaw Gruszka r8169: allocate with GFP_KERNEL flag 
when able to sleep
the code for making a decision
     "shall I copy the buffer to new skb or unhook it from ring to pass 
up and allocate new"
based on a module param called rx_copybreak,  was removed,  and instead 
we now always allocate naked data buffers (i.e. not wrapped in skb) and 
always memcpy each one to a new skb to pass up to netif.

I think (not sure) this was related to a bug
   Bug 629158 Network adapter "disappears" after resuming from acpi suspend
although the change from using GFP_ATOMIC to using GFP_KERNEL for 
initial allocation
was made (I believe) in
   2010-04-30 Eric Dumazet r8169: Fix rtl8169_rx_interrupt()

So I am not entirely certain of the motivation for the removal of 
rx_copybreak and the "always memcpy".  But I believe it is not 
necessary,  at least not on all (most?) systems, and have measured 11% 
increase in throughput (aggregate Mbits/sec) on a heavy network 
benchmark on my own machine,  on 2.6.39-rc2 with rps/rfs enabled, by 
patching the code back to the 2.6.39 equivalent of rx_copybreak.

oprofile shows the improvement is all or mostly obtained from avoiding 
the memcpy'ing.  And of course since the memcpy'ing is done in the 
driver before the netif/rps gets it,  all that memcpy'ing is done on 
CPU0    (I think true on any SMP machine with an rtl8169?)

The measurements are :
                              aggregate (rx+tx) Mbits/sec through the NIC
     2.6.39-rc2 unpatched                                  918
     2.6.39-rc2   patched,  rx_copybreak=64               1026

packet rates fluctuate around 60K - 70K pkts/sec rx plus 60K - 70K 
pkts/sec tx

I would like to propose that :
   .  switch back to wrapping databuffers in skb's (so we have the 
option of copying or unhooking)
   .  re-introduce rx_copybreak module param so each sysadmin can 
control if wished.

That is my main proposal,    and I would be interested to hear thoughts 
on that.

As a second proposal (which I made before),  I'd like to suggest that 
the number of rx buffers allocated at open should be configurable by 
module param.     This is not needed for my other proposal but may help 
reduce the possibility of temporary shortage of buffers.     There is 
really no justification for assuming that 256 buffers is the correct 
number for all systems from a netbook to a 32-way server.     I ran my 
"patched" measurement with num_rx_buffs=128 and there were no alloc 
failures logged.     I would say that if there is any enthusiasm for the 
main avoid_memcpy proposal,   then it is worth the extra small effort to 
do the num_rx_buffs as well.

My current patch (including configurable num_rx_buffs) -
lines deleted   120
lines added     668

BTW if this proposal is acceptable,  I'm willing to do the patch work 
but I have only one machine with a r8169 (actually a RTL8168c) to test on.

John Lumby

^ permalink raw reply

* Re: Driver build-testing (was: [PATCH] net: dm9000: Fix build)
From: Mark Brown @ 2011-04-18 16:07 UTC (permalink / raw)
  To: Micha? Miros?aw; +Cc: netdev
In-Reply-To: <20110418145102.GA2555@rere.qmqm.pl>

On Mon, Apr 18, 2011 at 04:51:02PM +0200, Micha? Miros?aw wrote:

> I wonder if allyesconfig/allmodconfig is supposed to include code that's
> known not to work for a particular architecture.

Well, it's certainly supposed to include things that aren't *useful*
which is more the point here.  all*config is clearly building a
configuration which isn't terribly useful for any particular machine.

^ permalink raw reply

* Re: DSCP values in TCP handshake
From: Mikael Abrahamsson @ 2011-04-18 15:57 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, Joe Buehler, netdev
In-Reply-To: <20110418083827.05dd2d43@nehalam>

On Mon, 18 Apr 2011, Stephen Hemminger wrote:

> If you want to set DSCP in the response, the application needs to apply
> set it on the listening socket.
>   dscp = 0x2e;
>   setsockopt(s, IPPROTO_IP, IP_TOS, &dscp, sizeof(dscp));

<http://tools.ietf.org/html/rfc2873> says this is ok, but I would like 
default to be that if incoming SYN has a certain DSCP value, the SYN+ACK 
should mirror this value if the application doesn't explicitly set 
anything else.

I was under the impression that mirroring was done historically, but this 
has changed? Looking at how my apache server is behaving in 2.6.32, it 
seems it uses 0x0 for the whole TOS byte by default. I send it 0x20 and it 
responds with 0x0. SSH does the same thing.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply

* ixgbe: about ixgbe_clean_rx_irq
From: wangjun @ 2011-04-18 15:51 UTC (permalink / raw)
  To: netdev

hi:
     I  am  learning  the codes of  ixgbe-3.3.8,  and  i wonder why
the packet is invalid  in ixgbe_main.c:1645 ,
     and  if no RSC nor Packet Split  is  configured ,  we receive
larage packet that may span several descriptors and  buffers,  what
will the code do in this scene.

thanks !


1645    if (ixgbe_close_active_frag_list(skb) && !pkt_is_rsc) {
1646    /* if we got here without RSC the packet is invalid */
1647    dev_kfree_skb_any(skb);
1648    goto next_desc;
1649   }

^ permalink raw reply

* Re: DSCP values in TCP handshake
From: Stephen Hemminger @ 2011-04-18 15:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Joe Buehler, netdev
In-Reply-To: <1303135512.3137.335.camel@edumazet-laptop>

On Mon, 18 Apr 2011 16:05:12 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 18 avril 2011 à 13:48 +0000, Joe Buehler a écrit :
> > What does the LINUX network code put in the DSCP field of TCP SYN-ACK packets
> > that are replies to a SYN connection request?  I have a customer sending SYN
> > with 0x2e in the DSCP field but apparently getting DSCP 0x0 in the reply.
> 
> The other way (server->client) is depending on application and listener
> only.
> 
> If it uses a standard socket, standard bind() + listen(), TOS is 0

If you want to set DSCP in the response, the application needs to apply
set it on the listening socket. 
   dscp = 0x2e;
   setsockopt(s, IPPROTO_IP, IP_TOS, &dscp, sizeof(dscp));

^ permalink raw reply

* Fw: [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Stephen Hemminger @ 2011-04-18 15:30 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Sun, 17 Apr 2011 19:29:39 GMT
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb


https://bugzilla.kernel.org/show_bug.cgi?id=33502

           Summary: Caught 64-bit read from uninitialized memory in
                    __alloc_skb
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.39-rc3
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: casteyde.christian@free.fr
        Regression: Yes


Acer Aspire 1511LMi
Athlon 64 3GHz in 64bits mode
Slackware 64 13.1

Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning:
...
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding
0xc0000-0xfffff
pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: excluding
0x60000000-0x
60ffffff
pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: excluding
0xa0000000-0x
a0ffffff
udev: renamed network interface wlan0 to eth1
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory
(ffff88001b0bb800)
00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000
 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
 ^

Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510  /Aspire
1510
RIP: 0010:[<ffffffff810c2f0c>]  [<ffffffff810c2f0c>]
__kmalloc_track_caller+0xbc/0x1d0
RSP: 0018:ffff88001d3a7a18  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f
RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a
RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100
R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800
R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6
FS:  00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190
 [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0
 [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20
 [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770
 [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110
 [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0
 [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff
Adding 506012k swap on /dev/sda1.  Priority:-1 extents:1 across:506012k 
EXT4-fs (sda2): re-mounted. Opts: (null)
EXT4-fs (sda2): re-mounted. Opts: (null)
EXT3-fs: barriers not enabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (sda3): using internal journal
EXT3-fs (sda3): mounted filesystem with writeback data mode
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
ADDRCONF(NETDEV_UP): eth1: link is not ready
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
...

I cannot build 2.6.39-rc2, and in 2.6.39-rc1 I used to have another warning but
not this one.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


-- 

^ permalink raw reply

* Re: Driver build-testing (was: [PATCH] net: dm9000: Fix build)
From: Randy Dunlap @ 2011-04-18 15:28 UTC (permalink / raw)
  To: Micha? Miros?aw; +Cc: Mark Brown, netdev
In-Reply-To: <20110418151729.GA4721@rere.qmqm.pl>

On Mon, 18 Apr 2011 17:17:29 +0200 Micha? Miros?aw wrote:

> On Mon, Apr 18, 2011 at 07:55:08AM -0700, Randy Dunlap wrote:
> > On Mon, 18 Apr 2011 16:51:02 +0200 Micha? Miros?aw wrote:
> > 
> > > On Mon, Apr 18, 2011 at 01:25:22PM +0100, Mark Brown wrote:
> > > > On Mon, Apr 18, 2011 at 01:12:03PM +0200, Micha? Miros?aw wrote:
> > > > > This brings the issue of build testing effectiveness. In current code
> > > > > there is no configuration that makes all drivers build. I would like
> > > > > to see something like 'make brokenconfig' that would allow most of
> > > > > the code to be built, and not necessarily working. Maybe someone has
> > > > > an idea how to implement that?
> > > > For the drivers that genuinely are rather platform specific this tends
> > > > to fail very badly as you need headers that only come along with the
> > > > architecture.
> > > > 
> > > > In the case of DM9000 if it fails to build on your platform then the
> > > > driver is just buggy - looking at the Kconfig I rather suspect that the
> > > > dependency on architectures should just be removed.
> > > I wonder if allyesconfig/allmodconfig is supposed to include code that's
> > > known not to work for a particular architecture.
> > all*config just use whatever is in all of the various Kconfig* files.
> > If they say "depends on $somearch", then so be it.  If not, then the
> > remaining dependencies are used.
> 
> Yes, I know how it works. I just wonder if removing those dependencies so
> that all drivers (even if known not to work) are built on all*config
> is acceptable. Or maybe there should be a config like 'disable all drivers
> that are known to build but not to work on this arch'?

AFAIK, it's always the case that we prefer not to have
	depends on $somearch
for drivers, but the reality is that lots of them do depend on $ARCH
for header files etc., so they are listed that way.  There's not much
that we can do about that.  I don't think that removing those dependencies
is acceptable.  OTOH, it may be acceptable to enable CONFIG_BROKEN so that
the drivers that depend on BROKEN can try to be built.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: Driver build-testing (was: [PATCH] net: dm9000: Fix build)
From: Micha? Miros?aw @ 2011-04-18 15:17 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Mark Brown, netdev
In-Reply-To: <20110418075508.f7b14f43.rdunlap@xenotime.net>

On Mon, Apr 18, 2011 at 07:55:08AM -0700, Randy Dunlap wrote:
> On Mon, 18 Apr 2011 16:51:02 +0200 Micha? Miros?aw wrote:
> 
> > On Mon, Apr 18, 2011 at 01:25:22PM +0100, Mark Brown wrote:
> > > On Mon, Apr 18, 2011 at 01:12:03PM +0200, Micha? Miros?aw wrote:
> > > > This brings the issue of build testing effectiveness. In current code
> > > > there is no configuration that makes all drivers build. I would like
> > > > to see something like 'make brokenconfig' that would allow most of
> > > > the code to be built, and not necessarily working. Maybe someone has
> > > > an idea how to implement that?
> > > For the drivers that genuinely are rather platform specific this tends
> > > to fail very badly as you need headers that only come along with the
> > > architecture.
> > > 
> > > In the case of DM9000 if it fails to build on your platform then the
> > > driver is just buggy - looking at the Kconfig I rather suspect that the
> > > dependency on architectures should just be removed.
> > I wonder if allyesconfig/allmodconfig is supposed to include code that's
> > known not to work for a particular architecture.
> all*config just use whatever is in all of the various Kconfig* files.
> If they say "depends on $somearch", then so be it.  If not, then the
> remaining dependencies are used.

Yes, I know how it works. I just wonder if removing those dependencies so
that all drivers (even if known not to work) are built on all*config
is acceptable. Or maybe there should be a config like 'disable all drivers
that are known to build but not to work on this arch'?

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: Driver build-testing (was: [PATCH] net: dm9000: Fix build)
From: Randy Dunlap @ 2011-04-18 14:55 UTC (permalink / raw)
  To: Micha? Miros?aw; +Cc: Mark Brown, netdev
In-Reply-To: <20110418145102.GA2555@rere.qmqm.pl>

On Mon, 18 Apr 2011 16:51:02 +0200 Micha? Miros?aw wrote:

> On Mon, Apr 18, 2011 at 01:25:22PM +0100, Mark Brown wrote:
> > On Mon, Apr 18, 2011 at 01:12:03PM +0200, Micha? Miros?aw wrote:
> > > On Mon, Apr 18, 2011 at 12:04:37PM +0100, Mark Brown wrote:
> > > > Commit c88fcb (net: dm9000: convert to hw_features) broke the build of
> > > > the dm9000 driver since it merged functions which use different names
> > > > for the board info structure used for I/O operations without updating
> > > > all the references to use the same name. Fix that.
> > > This brings the issue of build testing effectiveness. In current code
> > > there is no configuration that makes all drivers build. I would like
> > > to see something like 'make brokenconfig' that would allow most of
> > > the code to be built, and not necessarily working. Maybe someone has
> > > an idea how to implement that?
> > For the drivers that genuinely are rather platform specific this tends
> > to fail very badly as you need headers that only come along with the
> > architecture.
> > 
> > In the case of DM9000 if it fails to build on your platform then the
> > driver is just buggy - looking at the Kconfig I rather suspect that the
> > dependency on architectures should just be removed.
> 
> I wonder if allyesconfig/allmodconfig is supposed to include code that's
> known not to work for a particular architecture.

all*config just use whatever is in all of the various Kconfig* files.
If they say "depends on $somearch", then so be it.  If not, then the
remaining dependencies are used.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: Driver build-testing (was: [PATCH] net: dm9000: Fix build)
From: Micha? Miros?aw @ 2011-04-18 14:51 UTC (permalink / raw)
  To: Mark Brown; +Cc: netdev
In-Reply-To: <20110418122522.GC9462@sirena.org.uk>

On Mon, Apr 18, 2011 at 01:25:22PM +0100, Mark Brown wrote:
> On Mon, Apr 18, 2011 at 01:12:03PM +0200, Micha? Miros?aw wrote:
> > On Mon, Apr 18, 2011 at 12:04:37PM +0100, Mark Brown wrote:
> > > Commit c88fcb (net: dm9000: convert to hw_features) broke the build of
> > > the dm9000 driver since it merged functions which use different names
> > > for the board info structure used for I/O operations without updating
> > > all the references to use the same name. Fix that.
> > This brings the issue of build testing effectiveness. In current code
> > there is no configuration that makes all drivers build. I would like
> > to see something like 'make brokenconfig' that would allow most of
> > the code to be built, and not necessarily working. Maybe someone has
> > an idea how to implement that?
> For the drivers that genuinely are rather platform specific this tends
> to fail very badly as you need headers that only come along with the
> architecture.
> 
> In the case of DM9000 if it fails to build on your platform then the
> driver is just buggy - looking at the Kconfig I rather suspect that the
> dependency on architectures should just be removed.

I wonder if allyesconfig/allmodconfig is supposed to include code that's
known not to work for a particular architecture.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH] net: myri10ge: convert to hw_features
From: Jon Mason @ 2011-04-18 14:24 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: David Miller, netdev, gallatin, brice
In-Reply-To: <20110418110347.GA18324@rere.qmqm.pl>

On Mon, Apr 18, 2011 at 01:03:47PM +0200, Michał Mirosław wrote:
> On Sun, Apr 17, 2011 at 11:30:53PM -0700, David Miller wrote:
> > From: Jon Mason <jon.mason@myri.com>
> > Date: Fri, 15 Apr 2011 13:29:22 -0500
> > 
> > > On Fri, Apr 15, 2011 at 04:50:50PM +0200, Michał Mirosław wrote:
> > >> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Acked-by: Jon Mason <jon.mason@myri.com>

> > >> ---
> > >>  drivers/net/myri10ge/myri10ge.c |   66 +++++++-------------------------------
> > >>  1 files changed, 12 insertions(+), 54 deletions(-)
> > >> 
> > >> diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
> > >> index 1446de5..a48eb92 100644
> > >> --- a/drivers/net/myri10ge/myri10ge.c
> > >> +++ b/drivers/net/myri10ge/myri10ge.c
> > >> @@ -205,7 +205,6 @@ struct myri10ge_priv {
> > >>  	int tx_boundary;	/* boundary transmits cannot cross */
> > >>  	int num_slices;
> > >>  	int running;		/* running?             */
> > >> -	int csum_flag;		/* rx_csums?            */
> > > Get rid of MXGEFW_FLAGS_CKSUM in drivers/net/myri10ge/myri10ge_mcp.h,
> > > as this was the only thing using it.
> >  ...
> > > ethtool_op_set_tso does not support TSO6.  This would remove the
> > > enable/disable of that feature.
> > Michał please fix these issues and resubmit this patch, thanks!
> 
> There are no issues. MXGEFW_FLAGS_CKSUM is used elsewhere in the driver
> and TSO6 is handled by masking netdev->hw_features at devinit time.
> 
> BTW, ethtool_op_set_tso() is not used at all in new offload changing scheme.
> 
> Best Regards,
> Michał Mirosław
> 

^ permalink raw reply

* Re: [PATCH 12/12] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: Mel Gorman @ 2011-04-18 14:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-MM, Linux-Netdev, LKML, Peter Zijlstra
In-Reply-To: <20110418223251.7ab148bb@notabene.brown>

On Mon, Apr 18, 2011 at 10:32:51PM +1000, NeilBrown wrote:
> On Thu, 14 Apr 2011 11:41:38 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > If swap is backed by network storage such as NBD, there is a risk that a
> > large number of reclaimers can hang the system by consuming all
> > PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
> > min_free_kbytes in advance. This patch will throttle direct reclaimers
> > if half the PF_MEMALLOC reserves are in use as the system is at risk of
> > hanging. A message will be displayed so the administrator knows that
> > min_free_kbytes should be tuned to a higher value to avoid the
> > throttling in the future.
> > 
> 
> (I knew there was something else).
> 
> I understand that there are suggestions that direct reclaim should always be
> serialised as this reduces lock contention and improve data patterns (or
> something like that).
> 

AFAIK, this suggestion never got much beyond the "hand-waving" stage
of development. It tended to trip up on the fact that such a feature
could also throttle processes on machines with plenty of free clean
unmapped pagecache which would be undesirable.

> Would that make this patch redundant? 

Depends on how it was being serialised but ....

> Or does this provide some extra
> guarantee that the other proposal would not?
> 

This patch could be extended to serialise direct reclaims in situations
other than PFMEMALLOC is low if someone demonstrated the benefit.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: DSCP values in TCP handshake
From: Eric Dumazet @ 2011-04-18 14:05 UTC (permalink / raw)
  To: Joe Buehler; +Cc: netdev
In-Reply-To: <loom.20110418T154445-102@post.gmane.org>

Le lundi 18 avril 2011 à 13:48 +0000, Joe Buehler a écrit :
> What does the LINUX network code put in the DSCP field of TCP SYN-ACK packets
> that are replies to a SYN connection request?  I have a customer sending SYN
> with 0x2e in the DSCP field but apparently getting DSCP 0x0 in the reply.

The other way (server->client) is depending on application and listener
only.

If it uses a standard socket, standard bind() + listen(), TOS is 0




^ permalink raw reply

* Re: [PATCH 12/12] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: Mel Gorman @ 2011-04-18 14:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-MM, Linux-Netdev, LKML, Peter Zijlstra
In-Reply-To: <20110418223014.22a6e490@notabene.brown>

On Mon, Apr 18, 2011 at 10:30:14PM +1000, NeilBrown wrote:
> On Thu, 14 Apr 2011 11:41:38 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > If swap is backed by network storage such as NBD, there is a risk that a
> > large number of reclaimers can hang the system by consuming all
> > PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
> > min_free_kbytes in advance. This patch will throttle direct reclaimers
> > if half the PF_MEMALLOC reserves are in use as the system is at risk of
> > hanging. A message will be displayed so the administrator knows that
> > min_free_kbytes should be tuned to a higher value to avoid the
> > throttling in the future.
> 
> This sounds like a much simpler approach than all the pre-allocation.
> Is it certain to work? 

It should - at least I haven't conceived of a situation where it would
fail yet nor have I triggered the throttling logic during tests. The
logic was tested with a debugging patch that set the throttling
level higher.

> Are PF_MEMALLOC reserved only used from direct
> reclaim?
> 

No. They are also used by kswapd and by a task that is being OOM killed.

> Is printing a message for the admin really a good idea? 

Ordinarily no but initially I wanted to make it brutually obvious
when throttling is hit and what got hit. Ultimately it's more likely
to be a tracepoint.

> Auto-tuning is much
> better than requiring the sysadmin to tune.

That requires the memory reservation and pre-allocation patches. To
keep complexity down, I wanted to treat that as a separate series.

> Is throttling when we are low on memory really a problem that needs to be
> tuned away? Presumably we would get over the memory shortage fairly soon and
> the throttling would stop (??).
> 

It depends on what the administrator wants really. If they don't care
about the stall, they can simply ignore the problem because as you say,
it should get resolved quickly and the throttled processes continue.

> > +	if (printk_ratelimit())
> > +		printk(KERN_INFO "Throttling %s due to reclaim pressure on "
> > +				 "network storage\n",
> > +			current->comm);
> > +	do {
> > +		prepare_to_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait,
> > +							TASK_INTERRUPTIBLE);
> > +		schedule();
> > +		finish_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait);
> > +	} while (!pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx) &&
> > +			!fatal_signal_pending(current));
> > +}
> > +
> 
> This looks racing.  It is my understanding that you should always perform the
> test between the 'prepare_to_wait' and the 'schedule'. Otherwise the wakeup
> could happen just before the prepare_to_wait and you never wake from the
> schedule.
> If that doesn't apply in this case I would appreciate a comment explaining
> why.
> 

You're right, it's racy. Well spotted.

> 
> 
> >  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  				gfp_t gfp_mask, nodemask_t *nodemask)
> >  {
> > @@ -2131,6 +2188,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  		.nodemask = nodemask,
> >  	};
> >  
> > +	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > +
> >  	trace_mm_vmscan_direct_reclaim_begin(order,
> >  				sc.may_writepage,
> >  				gfp_mask);
> > @@ -2482,6 +2541,13 @@ loop_again:
> >  			}
> >  
> >  		}
> > +
> > +		/* Wake throttled direct reclaimers if low watermark is met */
> > +		if (sk_memalloc_socks() &&
> > +				waitqueue_active(&pgdat->pfmemalloc_wait) &&
> > +				pfmemalloc_watermark_ok(pgdat, MAX_NR_ZONES - 1))
> > +			wake_up_interruptible(&pgdat->pfmemalloc_wait);
> > +
> 
> This test on sk_memalloc_socks looks ugly to me.
> The VM shouldn't be checking on some networking state.
> Do we really need the test?  It is not reasonable to always throttle direct
> reclaim when mem gets really low?

It's a micro-optimisation. Throttling is not currently necessary
as backing storage such as local block devices have mempools in
place that avoid dipping into the PF_MEMALLOC reserves. On a normal
configuration, that waitqueue will simply never be active so I can
remove the sk_memalloc_socks() test.

What about in slab though? A function call in the fast path is avoided
by using the sk_memalloc_socks tests which is nice.

> If we do need the test - could networking set some global flag in the VM
> which the VM can then test.

I'd like to keep the test in slab at least but adding a new global flag
feels like a waste. I could add a VM wrapper around sk_memalloc_socks()
that would effectively be a rename but that doesn't seem much better.

> Maybe one day we will have something other than network which needs special
> care with the last dregs of memory - then it could set the global flag too
> (in which case it should probably be a global counter).
> 

When/if that happens, the naming would become more obvious. Right now,
it's network-related so doesn't seem unreasonable to have a
network-related check.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* DSCP values in TCP handshake
From: Joe Buehler @ 2011-04-18 13:48 UTC (permalink / raw)
  To: netdev

What does the LINUX network code put in the DSCP field of TCP SYN-ACK packets
that are replies to a SYN connection request?  I have a customer sending SYN
with 0x2e in the DSCP field but apparently getting DSCP 0x0 in the reply.

Joe Buehler



^ permalink raw reply

* Re: BUG: unable to handle kernel NULL pointer dereference at 000002c0 / IP: [<c04c70f2>] in6_dev_finish_destroy+0x35/0x8c
From: Patrick McHardy @ 2011-04-18 13:34 UTC (permalink / raw)
  To: Simon Arlott
  Cc: Eric Dumazet, Linux Kernel Mailing List, netdev,
	Netfilter Development Mailinglist
In-Reply-To: <4DA86FE5.8080507@simon.arlott.org.uk>

[-- Attachment #1: Type: text/plain, Size: 785 bytes --]

Am 15.04.2011 18:18, schrieb Simon Arlott:
> On 15/04/11 14:24, Eric Dumazet wrote:
>> Hmm.. a more complete patch :
>>
>> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
>> index 0857272..6f0bed0 100644
> 
> I applied the patch by recompiling and then reloading the nf_conntrack_ipv6
> module (temporarily flushing and then restoring all ip6tables rules).
> Then this happened 10 minutes later:
> 
> [33876.950100] BUG: unable to handle kernel NULL pointer dereference at 00000014
> [33876.951060] IP: [<f9b012bb>] nf_ct_frag6_gather+0x864/0x881 [nf_conntrack_ipv6]

nf_ct_frag6_reasm() can return NULL, so we need to check for a non-NULL
ret_skb before trying to set the device.

Does this patch (based on Eric's second version) help?


[-- Attachment #2: x --]
[-- Type: text/plain, Size: 872 bytes --]

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0857272..b7ecfce 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -576,7 +576,9 @@ struct sk_buff *nf_ct_frag6_gather(struct sk_buff *skb, u32 user)
 	if (fq->q.last_in == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
 	    fq->q.meat == fq->q.len) {
 		ret_skb = nf_ct_frag6_reasm(fq, dev);
-		if (ret_skb == NULL)
+		if (ret_skb != NULL)
+			ret_skb->dev = dev;
+		else
 			pr_debug("Can't reassemble fragmented packets\n");
 	}
 	spin_unlock_bh(&fq->q.lock);
@@ -602,7 +604,7 @@ void nf_ct_frag6_output(unsigned int hooknum, struct sk_buff *skb,
 
 		s2 = s->next;
 		s->next = NULL;
-
+		s->dev = in;
 		NF_HOOK_THRESH(NFPROTO_IPV6, hooknum, s, in, out, okfn,
 			       NF_IP6_PRI_CONNTRACK_DEFRAG + 1);
 		s = s2;

^ permalink raw reply related

* Re: [PATCH 12/12] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: NeilBrown @ 2011-04-18 12:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-Netdev, LKML, Peter Zijlstra
In-Reply-To: <1302777698-28237-13-git-send-email-mgorman@suse.de>

On Thu, 14 Apr 2011 11:41:38 +0100 Mel Gorman <mgorman@suse.de> wrote:

> If swap is backed by network storage such as NBD, there is a risk that a
> large number of reclaimers can hang the system by consuming all
> PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
> min_free_kbytes in advance. This patch will throttle direct reclaimers
> if half the PF_MEMALLOC reserves are in use as the system is at risk of
> hanging. A message will be displayed so the administrator knows that
> min_free_kbytes should be tuned to a higher value to avoid the
> throttling in the future.
> 

(I knew there was something else).

I understand that there are suggestions that direct reclaim should always be
serialised as this reduces lock contention and improve data patterns (or
something like that).

Would that make this patch redundant?  Or does this provide some extra
guarantee that the other proposal would not?

Thanks again,

NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 12/12] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: NeilBrown @ 2011-04-18 12:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-Netdev, LKML, Peter Zijlstra
In-Reply-To: <1302777698-28237-13-git-send-email-mgorman@suse.de>

On Thu, 14 Apr 2011 11:41:38 +0100 Mel Gorman <mgorman@suse.de> wrote:

> If swap is backed by network storage such as NBD, there is a risk that a
> large number of reclaimers can hang the system by consuming all
> PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
> min_free_kbytes in advance. This patch will throttle direct reclaimers
> if half the PF_MEMALLOC reserves are in use as the system is at risk of
> hanging. A message will be displayed so the administrator knows that
> min_free_kbytes should be tuned to a higher value to avoid the
> throttling in the future.

This sounds like a much simpler approach than all the pre-allocation.
Is it certain to work?  Are PF_MEMALLOC reserved only used from direct
reclaim?

Is printing a message for the admin really a good idea?  Auto-tuning is much
better than requiring the sysadmin to tune.
Is throttling when we are low on memory really a problem that needs to be
tuned away?  Presumably we would get over the memory shortage fairly soon and
the throttling would stop (??).

> +	if (printk_ratelimit())
> +		printk(KERN_INFO "Throttling %s due to reclaim pressure on "
> +				 "network storage\n",
> +			current->comm);
> +	do {
> +		prepare_to_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait,
> +							TASK_INTERRUPTIBLE);
> +		schedule();
> +		finish_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait);
> +	} while (!pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx) &&
> +			!fatal_signal_pending(current));
> +}
> +

This looks racing.  It is my understanding that you should always perform the
test between the 'prepare_to_wait' and the 'schedule'.  Otherwise the wakeup
could happen just before the prepare_to_wait and you never wake from the
schedule.
If that doesn't apply in this case I would appreciate a comment explaining
why.



>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  				gfp_t gfp_mask, nodemask_t *nodemask)
>  {
> @@ -2131,6 +2188,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.nodemask = nodemask,
>  	};
>  
> +	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> +
>  	trace_mm_vmscan_direct_reclaim_begin(order,
>  				sc.may_writepage,
>  				gfp_mask);
> @@ -2482,6 +2541,13 @@ loop_again:
>  			}
>  
>  		}
> +
> +		/* Wake throttled direct reclaimers if low watermark is met */
> +		if (sk_memalloc_socks() &&
> +				waitqueue_active(&pgdat->pfmemalloc_wait) &&
> +				pfmemalloc_watermark_ok(pgdat, MAX_NR_ZONES - 1))
> +			wake_up_interruptible(&pgdat->pfmemalloc_wait);
> +

This test on sk_memalloc_socks looks ugly to me.
The VM shouldn't be checking on some networking state.

Do we really need the test?  It is not reasonable to always throttle direct
reclaim when mem gets really low?
If we do need the test - could networking set some global flag in the VM
which the VM can then test.
Maybe one day we will have something other than network which needs special
care with the last dregs of memory - then it could set the global flag too
(in which case it should probably be a global counter).

Thanks,
NeilBrown


>  		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
>  			break;		/* kswapd: all done */
>  		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Driver build-testing (was: [PATCH] net: dm9000: Fix build)
From: Mark Brown @ 2011-04-18 12:25 UTC (permalink / raw)
  To: Micha? Miros?aw; +Cc: netdev
In-Reply-To: <20110418111203.GB18324@rere.qmqm.pl>

On Mon, Apr 18, 2011 at 01:12:03PM +0200, Micha? Miros?aw wrote:
> On Mon, Apr 18, 2011 at 12:04:37PM +0100, Mark Brown wrote:
> > Commit c88fcb (net: dm9000: convert to hw_features) broke the build of
> > the dm9000 driver since it merged functions which use different names
> > for the board info structure used for I/O operations without updating
> > all the references to use the same name. Fix that.

> This brings the issue of build testing effectiveness. In current code
> there is no configuration that makes all drivers build. I would like
> to see something like 'make brokenconfig' that would allow most of
> the code to be built, and not necessarily working. Maybe someone has
> an idea how to implement that?

For the drivers that genuinely are rather platform specific this tends
to fail very badly as you need headers that only come along with the
architecture.

In the case of DM9000 if it fails to build on your platform then the
driver is just buggy - looking at the Kconfig I rather suspect that the
dependency on architectures should just be removed.

^ permalink raw reply

* Re: [PATCH v2 01/27] HFI: skeleton driver
From: Ben Hutchings @ 2011-04-18 12:23 UTC (permalink / raw)
  To: dykmanj
  Cc: netdev, Piyush Chaudhary, Fu-Chung Chang, William S. Cadden,
	Wen C. Chen, Scot Sakolish, Jian Xiao, Carol L. Soto,
	Sarah J. Sheppard
In-Reply-To: <1303096919-7367-2-git-send-email-dykmanj@linux.vnet.ibm.com>

On Sun, 2011-04-17 at 23:21 -0400, dykmanj@linux.vnet.ibm.com wrote:
> From: Jim Dykman <dykmanj@linux.vnet.ibm.com>
> 
> Device driver Makefile & Kconfig plumbing plus simple mod_init and mod_exit
[...]
> --- /dev/null
> +++ b/drivers/net/hfi/core/hfidd_init.c
[...]
> +#include <linux/version.h>

Never include <linux/version.h> in an in-tree driver.

[...]
> +static int __init hfidd_mod_init(void)
> +{
> +	int			rc = 0;
> +
> +	rc = hfidd_create_class();
> +	if (rc < 0) {
> +		printk(KERN_ERR "%s: hfidd_mod_init: hfidd_create_class failed"
> +			" rc=%d\n", HFIDD_DEV_NAME, rc);
> +		return -1;

Should be 'return rc'.

[...]
> --- /dev/null
> +++ b/include/linux/hfi/hfidd_client.h
[...]
> +#ifndef _HFIDD_CLIENT_H_
> +#define _HFIDD_CLIENT_H_
> +
> +#define MAX_TORRENTS            1
> +#define MAX_HFI_PER_TORRENT     2
> +#define MAX_HFIS                (MAX_TORRENTS * MAX_HFI_PER_TORRENT)
[...]

Are you sure you want to expose these values to userland?  You can never
change them later.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox