Netdev List
 help / color / mirror / Atom feed
* [PATCH] ipvs: Remove unused variable "cs" from ip_vs_leave function.
From: Krzysztof Wilczynski @ 2011-10-18 19:59 UTC (permalink / raw)
  To: Simon Horman; +Cc: Patrick McHardy, netdev

This is to address the following warning during compilation time:

  net/netfilter/ipvs/ip_vs_core.c: In function ‘ip_vs_leave’:
  net/netfilter/ipvs/ip_vs_core.c:532: warning: unused variable ‘cs’

This variable is indeed no longer in use.

Signed-off-by: Krzysztof Wilczynski <krzysztof.wilczynski@linux.com>
---
 net/netfilter/ipvs/ip_vs_core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 00ea1ad..4f7d89d 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -529,7 +529,7 @@ int ip_vs_leave(struct ip_vs_service *svc, struct sk_buff *skb,
 	   a cache_bypass connection entry */
 	ipvs = net_ipvs(net);
 	if (ipvs->sysctl_cache_bypass && svc->fwmark && unicast) {
-		int ret, cs;
+		int ret;
 		struct ip_vs_conn *cp;
 		unsigned int flags = (svc->flags & IP_VS_SVC_F_ONEPACKET &&
 				      iph.protocol == IPPROTO_UDP)?
-- 
1.7.7

^ permalink raw reply related

* Re: [PATCH 6/7] mlx4_en: Adding rxhash support
From: Ben Hutchings @ 2011-10-18 19:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Yevgeny Petrilin, Eric Dumazet, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <20111018083606.6367ff50@nehalam.linuxnetplumber.net>

On Tue, 2011-10-18 at 08:36 -0700, Stephen Hemminger wrote:
> On Tue, 18 Oct 2011 08:59:44 +0000
> Yevgeny Petrilin <yevgenyp@mellanox.co.il> wrote:
> 
> > > 
> > > What is the gain using random values ?
> > > 
> > > Usually, we tend to have same hardware in a single machine, or we use
> > > active-backup bonding mode, and an active slave flip can change rxhash
> > > values with litle effect, since this happens not often.
> > > 
> > > I really prefer not random values, because it allows to have replayable
> > > configurations : For a given tcp flow, the same rxhash value is given
> > > and same cpu target in RPS. Its way easier to tune your machine for some workloads.
> > > 
> > 
> > There is no gain in random values, 
> > I'll make the change to have static value for RSS function.
> > 
> > We might consider how to ensure consistency across the different drivers in this aspect.
> 
> The key should be part of the network device core. Almost all hardware just
> implements the Microsoft standard, and if all drivers used same key they should
> come up with the same hash.

It should, but that's not enough.  The core also needs to be responsible
for initialising the hash indirection table, determining how many RX
queues to create, and IRQ affinity hints.

> Although using the same key all the time makes testing easier.
> The risk of using the same key is that it makes it easier for an attacker to
> create a set of addresses that all map to the same CPU which would make a DoS
> attack work better.  Therefore the key should be randomly generated at boot time.

If I understand correctly, the core of the Toeplitz hash functions is
(pretending we have a wide enough type called bigint):

u32 toeplitz_hash(bigint input, bigint key, unsigned width)
{
	u32 hash = 0;
	unsigned i;

	for (i = 0; i < width; i++)
		if (input & ((bigint)1 << i))
		        hash ^= key >> (1 + i);
	return hash;
}

This is hardly a cryptographic hash!  And while the key probably should
be random it should not just be random *bits*.  For example, if any 32
consecutive bits of the key are zero then 1 bit of input will have no
effect on the hash at all.

There was also a proposal a while back that we should try to make the
hash symmetric w.r.t. RX and TX addresses, so that both directions of a
flow through a router/bridge are aligned.  I think this was to be done
by repeating a 16-bit pattern across the key.  Not sure whether that's
worthwhile.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] Fix guest memory leak and panic
From: David Miller @ 2011-10-18 19:12 UTC (permalink / raw)
  To: krkumar2; +Cc: rusty, mst, Ian.Campbell, netdev, linux-kernel, virtualization
In-Reply-To: <20111018080523.16861.55402.sendpatchset@krkumar2.in.ibm.com>

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Tue, 18 Oct 2011 13:35:23 +0530

> Commit 86ee8130 ("virtionet: convert to SKB paged frag API")
> introduced a bug in guest. During RX testing, guest runs out
> of memory within seconds, causing oom-killer; which then
> panics the system: "Kernel panic - not syncing: Out of memory
> and no killable processes...". /proc/meminfo just before the
> panic shows MemFree is a few MB's:
> 
> 	MemFree:         1928544 kB	(starts here)
> 	...
> 	...
> 	MemFree:           27488 kB
> 	MemFree:           26248 kB
> 	MemFree:           24636 kB
> 	MemFree:           22632 kB
> 	MemFree:           19580 kB
> 	MemFree:           17928 kB
> 	MemFree:           15548 kB
> 		(Panic)
> 
> The extra reference to the fragment pages causes those pages to
> not get freed in skb_release_data(). The following patch fixes
> the bug. I have not checked if any other converted driver has
> the same issue.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

I'll wait for Ian's full audit, but Krishna please use more appropriate
subject lines in future patch submissions.

This patch is fixing a problem in the virtio_net driver, so please
mention that: "Subject: [PATCH] virtio_net: Fix guest memory leak and panic"

^ permalink raw reply

* Re: [PATCH 6/7] mlx4_en: Adding rxhash support
From: Eric Dumazet @ 2011-10-18 19:05 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Stephen Hemminger, Yevgeny Petrilin, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <CAEuXFEwzoC-L7Agr3Ssq9M-QN2w+t=rxjVvt9oJwQ2pQw7To8w@mail.gmail.com>

Le mardi 18 octobre 2011 à 11:49 -0700, Jesse Brandeburg a écrit :
> On Tue, Oct 18, 2011 at 8:36 AM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> > On Tue, 18 Oct 2011 08:59:44 +0000
> > Yevgeny Petrilin <yevgenyp@mellanox.co.il> wrote:
> >> There is no gain in random values,
> >> I'll make the change to have static value for RSS function.
> >>
> >> We might consider how to ensure consistency across the different drivers in this aspect.
> >
> > The key should be part of the network device core. Almost all hardware just
> > implements the Microsoft standard, and if all drivers used same key they should
> > come up with the same hash.
> >
> > Although using the same key all the time makes testing easier.
> > The risk of using the same key is that it makes it easier for an attacker to
> > create a set of addresses that all map to the same CPU which would make a DoS
> > attack work better.  Therefore the key should be randomly generated at boot time.
> 
> Stephen, I respectfully disagree with your position here.  The risk of
> using the same key is that a malicious user could target a particular
> queue with a DoS attack, but how is that different than any single
> queue device?  NAPI protects a single queue against (a network
> interrupt based) DoS.  I do not think we should be generating a random
> key at boot time, and because of the way NAPI mitigates load, we are
> okay.  The gain from from the far simpler setup (and reproducability)
> outweighs the risk until someone can show damage due to this
> theoretical DoS attack.

Note : This policy could be up to the admin :

1) We could let admin chose a known hash for reproducability

   ethtool .... rss_hash xxxxxxxx:yyyyyyyy:zzzzzzzz:....

2) We could have a 'rss_perturb N ' ethtool option, to randomly reshufle
things every N seconds, for people really afraid ;)

^ permalink raw reply

* Re: Problem with ixgbe and TX locked on one cpu
From: Jesse Brandeburg @ 2011-10-18 18:57 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list, e1000-devel
In-Reply-To: <4E988B1E.5000606@itcare.pl>

CC: e1000-devel

2011/10/14 Paweł Staszewski <pstaszewski@itcare.pl>:
> Hello
>
> I have weird problem with ixgbe and irq affinity / rx-tx queue assignment

what application are you running, how are you using ixgbe? looks like
a router.  is something changing the skb->rx_queue entry (like
netfilter) or is there a layered device above ixgbe (bonding or ...) ?

why do your interrupts move after a long period? did you do it by
hand? we recommend disabling irqbalance and hand tuning interrupts
possibly with the set_irq_affinity.sh script.

> Statistics for my ethernet - ixgbe driver:
> ethtool -S eth4
> NIC statistics:
>     rx_packets: 5815535848808
>     tx_packets: 5811202378421
>     rx_bytes: 4791001750842200
>     tx_bytes: 4781190419358301
>     rx_pkts_nic: 5815535848827
>     tx_pkts_nic: 5811202378510
>     rx_bytes_nic: 4837563124411799
>     tx_bytes_nic: 4829987507084013
>     lsc_int: 8
>     tx_busy: 0
>     non_eop_descs: 0
>     rx_errors: 0
>     tx_errors: 0
>     rx_dropped: 0
>     tx_dropped: 0
>     multicast: 92494273
>     broadcast: 268718206
>     rx_no_buffer_count: 28829
>     collisions: 0
>     rx_over_errors: 0
>     rx_crc_errors: 0
>     rx_frame_errors: 0
>     hw_rsc_aggregated: 0
>     hw_rsc_flushed: 0
>     fdir_match: 0
>     fdir_miss: 0
>     rx_fifo_errors: 0
>     rx_missed_errors: 307051074
>     tx_aborted_errors: 0
>     tx_carrier_errors: 0
>     tx_fifo_errors: 0
>     tx_heartbeat_errors: 0
>     tx_timeout_count: 0
>     tx_restart_queue: 15926219
>     rx_long_length_errors: 298
>     rx_short_length_errors: 0
>     tx_flow_control_xon: 0
>     rx_flow_control_xon: 0
>     tx_flow_control_xoff: 0
>     rx_flow_control_xoff: 0
>     rx_csum_offload_errors: 54173917
>     alloc_rx_page_failed: 0
>     alloc_rx_buff_failed: 0
>     rx_no_dma_resources: 0
>     tx_queue_0_packets: 68694825
>     tx_queue_0_bytes: 9443750332
>     tx_queue_1_packets: 8410961
>     tx_queue_1_bytes: 2527763233
>     tx_queue_2_packets: 14411252
>     tx_queue_2_bytes: 1317132394
>     tx_queue_3_packets: 15013508147
>     tx_queue_3_bytes: 17364767277348
>     tx_queue_4_packets: 62779891
>     tx_queue_4_bytes: 63476596221
>     tx_queue_5_packets: 11176001
>     tx_queue_5_bytes: 2763600253
>     tx_queue_6_packets: 4416357
>     tx_queue_6_bytes: 611874984
>     tx_queue_7_packets: 8933405
>     tx_queue_7_bytes: 1837198524
>     tx_queue_8_packets: 13292669
>     tx_queue_8_bytes: 3241333510
>     tx_queue_9_packets: 10747236
>     tx_queue_9_bytes: 1805109931
>     tx_queue_10_packets: 5795935258380
>     tx_queue_10_bytes: 4763725304722245
>     tx_queue_11_packets: 12073934
>     tx_queue_11_bytes: 2982743045
>     tx_queue_12_packets: 10523764
>     tx_queue_12_bytes: 2637451199
>     tx_queue_13_packets: 12480552
>     tx_queue_13_bytes: 2434827407
>     tx_queue_14_packets: 7401777
>     tx_queue_14_bytes: 2413618099
>     tx_queue_15_packets: 8269270
>     tx_queue_15_bytes: 2854359576
>     rx_queue_0_packets: 361373769507
>     rx_queue_0_bytes: 298565751248279
>     rx_queue_1_packets: 369901571908
>     rx_queue_1_bytes: 303414679798160
>     rx_queue_2_packets: 362508961738
>     rx_queue_2_bytes: 299852439447157
>     rx_queue_3_packets: 363449272013
>     rx_queue_3_bytes: 299738390792515
>     rx_queue_4_packets: 361876234461
>     rx_queue_4_bytes: 297483366939732
>     rx_queue_5_packets: 361402926316
>     rx_queue_5_bytes: 297633876486533
>     rx_queue_6_packets: 362261522767
>     rx_queue_6_bytes: 298026696344647
>     rx_queue_7_packets: 361248593301
>     rx_queue_7_bytes: 296756459279986
>     rx_queue_8_packets: 361654143416
>     rx_queue_8_bytes: 298272433659520
>     rx_queue_9_packets: 362781764710
>     rx_queue_9_bytes: 298804803191595
>     rx_queue_10_packets: 361386593064
>     rx_queue_10_bytes: 297434987797644
>     rx_queue_11_packets: 369886597895
>     rx_queue_11_bytes: 302353350171712
>     rx_queue_12_packets: 361582732276
>     rx_queue_12_bytes: 298670408005971
>     rx_queue_13_packets: 365248093536
>     rx_queue_13_bytes: 302573023878287
>     rx_queue_14_packets: 366571142073
>     rx_queue_14_bytes: 302396739276514
>     rx_queue_15_packets: 362401929830
>     rx_queue_15_bytes: 299024344526029
>
> The problem is with queue 10
>     tx_queue_10_packets: 5795935258380
>     tx_queue_10_bytes: 4763725304722245
>
> as you can see most of the queue processing is used in queue 10
> Average difference is 1,854271229903958e-6  - compared to other queues
>
> and the problem is that almost all TX packet processing is on one CPU
> cat /proc/interrupts - in attached file
>
> Is this driver or kernel problem ?
>
> Kernel is: 2.6.38.2
>
> ixgbe driver is:
> ethtool -i eth4
> driver: ixgbe
> version: 3.2.9-k2
> firmware-version: 1.12-2
> bus-info: 0000:04:00.0

^ permalink raw reply

* Re: [PATCH 6/7] mlx4_en: Adding rxhash support
From: Jesse Brandeburg @ 2011-10-18 18:49 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Yevgeny Petrilin, Eric Dumazet, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <20111018083606.6367ff50@nehalam.linuxnetplumber.net>

On Tue, Oct 18, 2011 at 8:36 AM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Tue, 18 Oct 2011 08:59:44 +0000
> Yevgeny Petrilin <yevgenyp@mellanox.co.il> wrote:
>> There is no gain in random values,
>> I'll make the change to have static value for RSS function.
>>
>> We might consider how to ensure consistency across the different drivers in this aspect.
>
> The key should be part of the network device core. Almost all hardware just
> implements the Microsoft standard, and if all drivers used same key they should
> come up with the same hash.
>
> Although using the same key all the time makes testing easier.
> The risk of using the same key is that it makes it easier for an attacker to
> create a set of addresses that all map to the same CPU which would make a DoS
> attack work better.  Therefore the key should be randomly generated at boot time.

Stephen, I respectfully disagree with your position here.  The risk of
using the same key is that a malicious user could target a particular
queue with a DoS attack, but how is that different than any single
queue device?  NAPI protects a single queue against (a network
interrupt based) DoS.  I do not think we should be generating a random
key at boot time, and because of the way NAPI mitigates load, we are
okay.  The gain from from the far simpler setup (and reproducability)
outweighs the risk until someone can show damage due to this
theoretical DoS attack.

^ permalink raw reply

* Re: Bug#645589: linux-image-3.0.0-2-amd64: sky2 rx errors on 3.0, 2.6.32 works
From: Stephen Hemminger @ 2011-10-18 18:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: 645589, Antti Salmela, netdev
In-Reply-To: <1318909386.3340.91.camel@deadeye>

On Tue, 18 Oct 2011 04:43:06 +0100
Ben Hutchings <ben@decadent.org.uk> wrote:

> On Mon, 2011-10-17 at 10:40 +0300, Antti Salmela wrote:
> > Package: linux-2.6
> > Version: 3.0.0-5
> > Severity: normal
> > 
> > 
> > sky2 loses packets on 3.0 (-3 and -5) and 3.1-rc7, 2.6.32-38 and
> > setting interface to promiscuous works.
> > 
> > [   60.118244] sky2 0000:02:00.0: eth0: rx error, status 0xb92100 length 185
> > [   62.664370] sky2 0000:02:00.0: eth0: rx error, status 0x602100 length 96
> > [   63.370051] sky2 0000:02:00.0: eth0: rx error, status 0x422100 length 66
> > [   63.714672] sky2 0000:02:00.0: eth0: rx error, status 0x722100 length 114
> > [   64.513458] device eth0 entered promiscuous mode
> 
> It looks like this is a bug in accounting of VLAN tags, though I don't
> see what difference promiscuous mode should make.
> 
> The log messages show that status has the VLAN flag (bit 13) set and the
> length field (bits 16:28) equals the length passed into sky2_receive(),
> but that function expects the length field to be greater by VLAN_HLEN.
> 
> This device is:
> 
> [...]
> > 02:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller [11ab:4362] (rev 19)
> > 	Subsystem: ASUSTeK Computer Inc. Marvell 88E8053 Gigabit Ethernet controller PCIe (Asus) [1043:8142]
> > 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> > 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > 	Latency: 0, Cache Line Size: 16 bytes
> > 	Interrupt: pin A routed to IRQ 43
> > 	Region 0: Memory at cdefc000 (64-bit, non-prefetchable) [size=16K]
> > 	Region 2: I/O ports at c800 [size=256]
> > 	Expansion ROM at cdec0000 [disabled] [size=128K]
> > 	Capabilities: <access denied>
> > 	Kernel driver in use: sky2
> [...]

The accounting is supposed to be:
   MAC = total length of packet (including vlan)
   DMA = bytes dma'd to buffer (does not include vlan)
Looks like the code is incorrect for the case where hardware
VLAN stripping is disabled.  What happens is that status bit
still has the VLAN flag, but DMA engine leaves the VLAN tag
in the DMA buffer so the check fails.

Proper accounting would involve more state machine mechanics
about whether VLAN tag has already been seen in current receive
status ring.

For now probably best to do something like:

--- net-next.orig/drivers/net/ethernet/marvell/sky2.c	2011-10-18 11:09:04.108683763 -0700
+++ net-next/drivers/net/ethernet/marvell/sky2.c	2011-10-18 11:09:53.661264323 -0700
@@ -2543,7 +2543,8 @@ static struct sk_buff *sky2_receive(stru
 	struct sk_buff *skb = NULL;
 	u16 count = (status & GMR_FS_LEN) >> 16;
 
-	if (status & GMR_FS_VLAN)
+	if ((dev->features & NETIF_F_HW_VLAN_RX) &&
+	    (status & GMR_FS_VLAN))
 		count -= VLAN_HLEN;	/* Account for vlan tag */
 
 	netif_printk(sky2, rx_status, KERN_DEBUG, dev,

^ permalink raw reply

* [RESEND] [PATCH] ll_temac: Add support for ethtool
From: Ricardo Ribalda Delgado @ 2011-10-18 17:55 UTC (permalink / raw)
  To: davem, grant.likely, sfr, u.kleine-koenig, netdev, linux-kernel
  Cc: Ricardo Ribalda Delgado

This patch enables the ethtool interface. The implementation is done
using the libphy helper functions.

Reviewed-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
---
 drivers/net/ll_temac_main.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ll_temac_main.c b/drivers/net/ll_temac_main.c
index 728fe41..91a9804 100644
--- a/drivers/net/ll_temac_main.c
+++ b/drivers/net/ll_temac_main.c
@@ -957,6 +957,32 @@ static const struct attribute_group temac_attr_group = {
 	.attrs = temac_device_attrs,
 };
 
+/* ethtool support */
+static int temac_get_settings(struct net_device *ndev, struct ethtool_cmd *cmd)
+{
+	struct temac_local *lp = netdev_priv(ndev);
+	return phy_ethtool_gset(lp->phy_dev, cmd);
+}
+
+static int temac_set_settings(struct net_device *ndev, struct ethtool_cmd *cmd)
+{
+	struct temac_local *lp = netdev_priv(ndev);
+	return phy_ethtool_sset(lp->phy_dev, cmd);
+}
+
+static int temac_nway_reset(struct net_device *ndev)
+{
+	struct temac_local *lp = netdev_priv(ndev);
+	return phy_start_aneg(lp->phy_dev);
+}
+
+static const struct ethtool_ops temac_ethtool_ops = {
+	.get_settings = temac_get_settings,
+	.set_settings = temac_set_settings,
+	.nway_reset = temac_nway_reset,
+	.get_link = ethtool_op_get_link,
+};
+
 static int __devinit temac_of_probe(struct platform_device *op)
 {
 	struct device_node *np;
@@ -978,6 +1004,7 @@ static int __devinit temac_of_probe(struct platform_device *op)
 	ndev->flags &= ~IFF_MULTICAST;  /* clear multicast */
 	ndev->features = NETIF_F_SG | NETIF_F_FRAGLIST;
 	ndev->netdev_ops = &temac_netdev_ops;
+	ndev->ethtool_ops= &temac_ethtool_ops;
 #if 0
 	ndev->features |= NETIF_F_IP_CSUM; /* Can checksum TCP/UDP over IPv4. */
 	ndev->features |= NETIF_F_HW_CSUM; /* Can checksum all the packets. */
-- 
1.7.7

^ permalink raw reply related

* Re: [PATCH 6/7] mlx4_en: Adding rxhash support
From: Stephen Hemminger @ 2011-10-18 15:36 UTC (permalink / raw)
  To: Yevgeny Petrilin
  Cc: Eric Dumazet, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <953B660C027164448AE903364AC447D2235EEC80@MTLDAG01.mtl.com>

On Tue, 18 Oct 2011 08:59:44 +0000
Yevgeny Petrilin <yevgenyp@mellanox.co.il> wrote:

> > 
> > What is the gain using random values ?
> > 
> > Usually, we tend to have same hardware in a single machine, or we use
> > active-backup bonding mode, and an active slave flip can change rxhash
> > values with litle effect, since this happens not often.
> > 
> > I really prefer not random values, because it allows to have replayable
> > configurations : For a given tcp flow, the same rxhash value is given
> > and same cpu target in RPS. Its way easier to tune your machine for some workloads.
> > 
> 
> There is no gain in random values, 
> I'll make the change to have static value for RSS function.
> 
> We might consider how to ensure consistency across the different drivers in this aspect.

The key should be part of the network device core. Almost all hardware just
implements the Microsoft standard, and if all drivers used same key they should
come up with the same hash.

Although using the same key all the time makes testing easier.
The risk of using the same key is that it makes it easier for an attacker to
create a set of addresses that all map to the same CPU which would make a DoS
attack work better.  Therefore the key should be randomly generated at boot time.

^ permalink raw reply

* Re: [patch] pktgen: bug when calling ndelay in x86 architectures
From: Eric Dumazet @ 2011-10-18 14:47 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Daniel Turull, David Miller, netdev, Robert Olsson,
	Voravit Tanyingyong, Jens Laas
In-Reply-To: <1318946401.23980.6.camel@deadeye>

Le mardi 18 octobre 2011 à 15:00 +0100, Ben Hutchings a écrit :

> AIUI, the reason for limits on delays is not that it's bad practice to
> spin for so long, but that the delay calculations may overflow or
> otherwise become inaccurate.

OK, I can understand that, then a more appropriate patch would be :


diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 796044a..28bbf5b 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2145,9 +2145,12 @@ static void spin(struct pktgen_dev *pkt_dev, ktime_t spin_until)
 	}
 
 	start_time = ktime_now();
-	if (remaining < 100000)
-		ndelay(remaining);	/* really small just spin */
-	else {
+	if (remaining < 100000) {
+		if (remaining >= 10000)
+			udelay(remaining/1000);
+		else
+			ndelay(remaining);
+	} else {
 		/* see do_nanosleep */
 		hrtimer_init_sleeper(&t, current);
 		do {

^ permalink raw reply related

* Re: [patch] pktgen: bug when calling ndelay in x86 architectures
From: Ben Hutchings @ 2011-10-18 14:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Turull, David Miller, netdev, Robert Olsson,
	Voravit Tanyingyong, Jens Laas
In-Reply-To: <1318939007.2657.57.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Tue, 2011-10-18 at 13:56 +0200, Eric Dumazet wrote:
> Le mardi 18 octobre 2011 à 13:08 +0200, Daniel Turull a écrit :
> > The value selected to delay the transmission in pktgen with the ndelay function should be lower.
> > In Linux/arch/x86/include/asm/delay.h and Linux/arch/sh/include/asm/delay.h
> > the maximal expected value for a constant is 20000 ns.
> > 
> > Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
> > ---
> > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> > index 796044a..e17bd41 100644
> > --- a/net/core/pktgen.c
> > +++ b/net/core/pktgen.c
> > @@ -2145,7 +2145,7 @@ static void spin(struct pktgen_dev *pkt_dev, ktime_t spin_until)
> >  	}
> >  
> >  	start_time = ktime_now();
> > -	if (remaining < 100000)
> > +	if (remaining < 20000)
> >  		ndelay(remaining);	/* really small just spin */
> >  	else {
> >  		/* see do_nanosleep */
> 
> But 'remaining' is not a constant.
> 
> If we want exactly 40.000 packets per second rate (25 us between
> packets), your patch makes this not quite possible without
> CONFIG_HIGH_RES_TIMERS and probable high jitter because of scheduler
> effects.
> 
> pktgen is kind of special, we _want_ a cpu for our exclusive use.

AIUI, the reason for limits on delays is not that it's bad practice to
spin for so long, but that the delay calculations may overflow or
otherwise become inaccurate.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 6/6 V2] mlx4_en: Updating driver version
From: Yevgeny Petrilin @ 2011-10-18 11:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Driver version updated to 1.5.4.2

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a6b5cf6..fca6616 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -51,8 +51,8 @@
 #include "en_port.h"
 
 #define DRV_NAME	"mlx4_en"
-#define DRV_VERSION	"1.5.4.1"
-#define DRV_RELDATE	"March 2011"
+#define DRV_VERSION	"1.5.4.2"
+#define DRV_RELDATE	"October 2011"
 
 #define MLX4_EN_MSG_LEVEL	(NETIF_MSG_LINK | NETIF_MSG_IFDOWN)
 
-- 
1.7.7

^ permalink raw reply related

* [PATCH 5/6 V2] mlx4_en: Adding rxhash support
From: Yevgeny Petrilin @ 2011-10-18 11:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Moving to Toeplitz function in RSS calculation.
Reporting rxhash in skb.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |   12 ++++++++++++
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c4c4be4..78d776b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1084,7 +1084,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 
 	dev->vlan_features = dev->hw_features;
 
-	dev->hw_features |= NETIF_F_RXCSUM;
+	dev->hw_features |= NETIF_F_RXCSUM | NETIF_F_RXHASH;
 	dev->features = dev->hw_features | NETIF_F_HIGHDMA |
 			NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX |
 			NETIF_F_HW_VLAN_FILTER;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index eed2a0a..0e368af 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -618,6 +618,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						__vlan_hwaccel_put_tag(gro_skb, vid);
 					}
 
+					if (dev->features & NETIF_F_RXHASH)
+						gro_skb->rxhash = be32_to_cpu(cqe->immed_rss_invalid);
+
 					skb_record_rx_queue(gro_skb, cq->ring);
 					napi_gro_frags(&cq->napi);
 
@@ -651,6 +654,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		skb->protocol = eth_type_trans(skb, dev);
 		skb_record_rx_queue(skb, cq->ring);
 
+		if (dev->features & NETIF_F_RXHASH)
+			skb->rxhash = be32_to_cpu(cqe->immed_rss_invalid);
+
 		if (be32_to_cpu(cqe->vlan_my_qpn) &
 		    MLX4_CQE_VLAN_PRESENT_MASK)
 			__vlan_hwaccel_put_tag(skb, be16_to_cpu(cqe->sl_vid));
@@ -834,6 +840,9 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 	int i, qpn;
 	int err = 0;
 	int good_qps = 0;
+	static const u32 rsskey[10] = { 0xD181C62C, 0xF7F4DB5B, 0x1983A2FC,
+				0x943E1ADB, 0xD9389E6B, 0xD1039C2C, 0xA74499AD,
+				0x593D56D9, 0xF3253C06, 0x2ADC1FFC};
 
 	en_dbg(DRV, priv, "Configuring rss steering\n");
 	err = mlx4_qp_reserve_range(mdev->dev, priv->rx_ring_num,
@@ -871,6 +880,9 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 					    (rss_map->base_qpn));
 	rss_context->default_qpn = cpu_to_be32(rss_map->base_qpn);
 	rss_context->flags = rss_mask;
+	rss_context->hash_fn = 1;
+	for (i = 0; i < 10; i++)
+		rss_context->rss_key[i] = rsskey[i];
 
 	if (priv->mdev->profile.udp_rss)
 		rss_context->base_qpn_udp = rss_context->default_qpn;
-- 
1.7.7

^ permalink raw reply related

* [PATCH 4/6 V2] mlx4_en: Recording rx queue for gro packets
From: Yevgeny Petrilin @ 2011-10-18 11:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 1231b21..eed2a0a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -618,6 +618,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						__vlan_hwaccel_put_tag(gro_skb, vid);
 					}
 
+					skb_record_rx_queue(gro_skb, cq->ring);
 					napi_gro_frags(&cq->napi);
 
 					goto next;
-- 
1.7.7

^ permalink raw reply related

* [PATCH 3/6 V2] mlx4_en: Checksum counters per ring
From: Yevgeny Petrilin @ 2011-10-18 11:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Not updating common counters from data path.
The checksum counters are per ring, summarizing them when collecting statistics.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_port.c |    6 ++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    6 +++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |    2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    3 +++
 4 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index 9d27555..03c84cd 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -214,15 +214,21 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 
 	stats->rx_packets = 0;
 	stats->rx_bytes = 0;
+	priv->port_stats.rx_chksum_good = 0;
+	priv->port_stats.rx_chksum_none = 0;
 	for (i = 0; i < priv->rx_ring_num; i++) {
 		stats->rx_packets += priv->rx_ring[i].packets;
 		stats->rx_bytes += priv->rx_ring[i].bytes;
+		priv->port_stats.rx_chksum_good += priv->rx_ring[i].csum_ok;
+		priv->port_stats.rx_chksum_none += priv->rx_ring[i].csum_none;
 	}
 	stats->tx_packets = 0;
 	stats->tx_bytes = 0;
+	priv->port_stats.tx_chksum_offload = 0;
 	for (i = 0; i < priv->tx_ring_num; i++) {
 		stats->tx_packets += priv->tx_ring[i].packets;
 		stats->tx_bytes += priv->tx_ring[i].bytes;
+		priv->port_stats.tx_chksum_offload += priv->tx_ring[i].tx_csum;
 	}
 
 	stats->rx_errors = be64_to_cpu(mlx4_en_stats->PCS) +
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index fbf1dcf..1231b21 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -587,7 +587,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
 			    (cqe->checksum == cpu_to_be16(0xffff))) {
-				priv->port_stats.rx_chksum_good++;
+				ring->csum_ok++;
 				/* This packet is eligible for LRO if it is:
 				 * - DIX Ethernet (type interpretation)
 				 * - TCP/IP (v4)
@@ -627,11 +627,11 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 				ip_summed = CHECKSUM_UNNECESSARY;
 			} else {
 				ip_summed = CHECKSUM_NONE;
-				priv->port_stats.rx_chksum_none++;
+				ring->csum_none++;
 			}
 		} else {
 			ip_summed = CHECKSUM_NONE;
-			priv->port_stats.rx_chksum_none++;
+			ring->csum_none++;
 		}
 
 		skb = mlx4_en_rx_skb(priv, rx_desc, skb_frags,
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 6e03de0..f199460 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -695,7 +695,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
 		tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
 							 MLX4_WQE_CTRL_TCP_UDP_CSUM);
-		priv->port_stats.tx_chksum_offload++;
+		ring->tx_csum++;
 	}
 
 	if (unlikely(priv->validate_loopback)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 3b753f7..a6b5cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -249,6 +249,7 @@ struct mlx4_en_tx_ring {
 	struct mlx4_srq dummy;
 	unsigned long bytes;
 	unsigned long packets;
+	unsigned long tx_csum;
 	spinlock_t comp_lock;
 	struct mlx4_bf bf;
 	bool bf_enabled;
@@ -275,6 +276,8 @@ struct mlx4_en_rx_ring {
 	void *rx_info;
 	unsigned long bytes;
 	unsigned long packets;
+	unsigned long csum_ok;
+	unsigned long csum_none;
 };
 
 
-- 
1.7.7

^ permalink raw reply related

* [PATCH 2/6 V2] mlx4_en: Controlling FCS header removal
From: Yevgeny Petrilin @ 2011-10-18 11:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Canceling FCS removal where FW allows for better alignment
of incoming data.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    4 ++++
 drivers/net/ethernet/mellanox/mlx4/fw.c    |    1 +
 include/linux/mlx4/device.h                |    1 +
 3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 37cc9e5..fbf1dcf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -806,6 +806,10 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
 				qpn, ring->cqn, context);
 	context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
 
+	/* Cancel FCS removal if FW allows */
+	if (mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_FCS_KEEP)
+		context->param3 |= cpu_to_be32(1 << 29);
+
 	err = mlx4_qp_to_ready(mdev->dev, &ring->wqres.mtt, context, qp, state);
 	if (err) {
 		mlx4_qp_remove(mdev->dev, qp);
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 7eb8ba8..ed452dd 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -101,6 +101,7 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u64 flags)
 		[25] = "Router support",
 		[30] = "IBoE support",
 		[32] = "Unicast loopback support",
+		[34] = "FCS header control",
 		[38] = "Wake On LAN support",
 		[40] = "UDP RSS support",
 		[41] = "Unicast VEP steering support",
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 53ef894..2366f94 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -75,6 +75,7 @@ enum {
 	MLX4_DEV_CAP_FLAG_UD_MCAST	= 1LL << 21,
 	MLX4_DEV_CAP_FLAG_IBOE		= 1LL << 30,
 	MLX4_DEV_CAP_FLAG_UC_LOOPBACK	= 1LL << 32,
+	MLX4_DEV_CAP_FLAG_FCS_KEEP	= 1LL << 34,
 	MLX4_DEV_CAP_FLAG_WOL		= 1LL << 38,
 	MLX4_DEV_CAP_FLAG_UDP_RSS	= 1LL << 40,
 	MLX4_DEV_CAP_FLAG_VEP_UC_STEER	= 1LL << 41,
-- 
1.7.7

^ permalink raw reply related

* [PATCH 1/6 V2] mlx4: Fix vlan table overflow
From: Yevgeny Petrilin @ 2011-10-18 11:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Prevent overflow when trying to register more Vlans then the Vlan table in
HW is configured to.
Need to take into acount that the first 2 entries are reserved.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/port.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index 609e0ec..163a314 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -65,7 +65,7 @@ void mlx4_init_vlan_table(struct mlx4_dev *dev, struct mlx4_vlan_table *table)
 		table->entries[i] = 0;
 		table->refs[i]	 = 0;
 	}
-	table->max   = 1 << dev->caps.log_num_vlans;
+	table->max   = (1 << dev->caps.log_num_vlans) - MLX4_VLAN_REGULAR;
 	table->total = 0;
 }
 
@@ -354,6 +354,13 @@ int mlx4_register_vlan(struct mlx4_dev *dev, u8 port, u16 vlan, int *index)
 	int free = -1;
 
 	mutex_lock(&table->mutex);
+
+	if (table->total == table->max) {
+		/* No free vlan entries */
+		err = -ENOSPC;
+		goto out;
+	}
+
 	for (i = MLX4_VLAN_REGULAR; i < MLX4_MAX_VLAN_NUM; i++) {
 		if (free < 0 && (table->refs[i] == 0)) {
 			free = i;
@@ -375,12 +382,6 @@ int mlx4_register_vlan(struct mlx4_dev *dev, u8 port, u16 vlan, int *index)
 		goto out;
 	}
 
-	if (table->total == table->max) {
-		/* No free vlan entries */
-		err = -ENOSPC;
-		goto out;
-	}
-
 	/* Register new MAC */
 	table->refs[free] = 1;
 	table->entries[free] = cpu_to_be32(vlan | MLX4_VLAN_VALID);
-- 
1.7.7

^ permalink raw reply related

* [PATCH 0/6 V2] mlx4_en: Data path optimizations and bug fix
From: Yevgeny Petrilin @ 2011-10-18 11:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp

Hello,
This is V2 of the patches sent yesterday.
The differences comparing to V1 are:
1. Remove patch 3/7 : "Incoming traffic alignment optimizations" following Eric's comments for a rework.
2. Using static seed value for Toeplitz function, rather then randomizing the value each time

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_port.c   |    6 ++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |   23 ++++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |    2 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c        |    1 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    7 +++++--
 drivers/net/ethernet/mellanox/mlx4/port.c      |   15 ++++++++-------
 include/linux/mlx4/device.h                    |    1 +


Thanks,
Yevgeny

^ permalink raw reply

* Re: PROBLEM: System call 'sendmsg' of process ospfd (quagga) causes kernel oops
From: Herbert Xu @ 2011-10-18 13:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Elmar Vonlanthen, linux-kernel, netdev, Timo Teräs
In-Reply-To: <1318942560.2657.69.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Tue, Oct 18, 2011 at 02:56:00PM +0200, Eric Dumazet wrote:
>
> I am ok by this way, but we might hit another similar problem elsewhere.
> 
> (igmp.c ip6_output, ...)
> 
> We effectively want to remove LL_ALLOCATED_SPACE() usage and obfuscate
> code...

Here's another idea, provide a helper to do the skb allocation
and the skb_reserve in one go.  That way this ugliness would only
need to be done once.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Possible via-velocity bug.
From: Eric Dumazet @ 2011-10-18 13:38 UTC (permalink / raw)
  To: n00bsys0p; +Cc: linux-kernel, netdev
In-Reply-To: <j7js28$u0m$1@dough.gmane.org>

Le mardi 18 octobre 2011 à 13:39 +0100, n00bsys0p a écrit :
> Hi everyone, I have a nice simple question.
> 
> I've found what I think is a bug in the 3.0+ kernel's via-velocity LAN 
> driver, which causes the leaking of data into the video RAM during PXE boot.
> 
> How do I go about reporting this to the relevant people?
> 
> Thanks in advance!

Here and netdev is fine.

^ permalink raw reply

* Re: [PATCH] Fix guest memory leak and panic
From: Krishna Kumar2 @ 2011-10-18 13:22 UTC (permalink / raw)
  To: Ian Campbell
  Cc: davem@davemloft.net, linux-kernel@vger.kernel.org, mst@redhat.com,
	netdev@vger.kernel.org, rusty@rustcorp.com.au,
	virtualization@lists.linux-foundation.org
In-Reply-To: <1318935055.3385.0.camel@zakaz.uk.xensource.com>

Ian Campbell <Ian.Campbell@citrix.com> wrote on 10/18/2011 04:20:55 PM:

> > > Sigh, only one out of the ten callers of (__)skb_frag_set_page
expects
> > > skb_frag_set_page to take a new reference. I think that's pretty
> > > comprehensive evidence that the current behaviour is unexpected and
> > > wrong.
> >
> > Looks good!
>
> Thanks.

Tested this patch for virtio_net - hang/panic is fixed.
MemFree also doesn't reduce at end of test compared to
the start.

Tested-by: krkumar2@in.ibm.com

- KK

> > Does it make sense to commit both of these patches? The
> > reason being - my patch becomes a cleanup of set_skb_frag()
> > in virtio_net driver.
>
> I think your patch remains a valid cleanup but I don't maintain that
> code.
>
> Ian.
>
>

^ permalink raw reply

* Re: [PATCH] Fix guest memory leak and panic
From: Eric Dumazet @ 2011-10-18 13:20 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Krishna Kumar, rusty@rustcorp.com.au, mst@redhat.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, davem@davemloft.net
In-Reply-To: <1318931232.16132.65.camel@zakaz.uk.xensource.com>

Le mardi 18 octobre 2011 à 10:47 +0100, Ian Campbell a écrit :
> On Tue, 2011-10-18 at 09:36 +0100, Ian Campbell wrote:
> > I think the best thing might be to remove the additional ref taking from
> > the setter function and audit the previous changes to ensure they
> > conform. I'll do that right away and post a fixup patch ASAP.
> 
> Sigh, only one out of the ten callers of (__)skb_frag_set_page expects
> skb_frag_set_page to take a new reference. I think that's pretty
> comprehensive evidence that the current behaviour is unexpected and
> wrong. 
> 
> Sorry about this.
> 
> Ian.
> 
> 8<---------------------------------------------------------
> 
> From 42c26b7ca640bd5cb6f9c3bc76db96c92ed8ff82 Mon Sep 17 00:00:00 2001
> From: Ian Campbell <ian.campbell@citrix.com>
> Date: Tue, 18 Oct 2011 09:59:37 +0100
> Subject: [PATCH] net: do not take an additional reference in skb_frag_set_page
> 
> I audited all of the callers in the tree and only one of them (pktgen) expects
> it to do so. Taking this reference is pretty obviously confusing and error
> prone.
> 
> In particular I looked at the following commits which switched callers of
> (__)skb_frag_set_page to the skb paged fragment api:
> 
> 6a930b9f163d7e6d9ef692e05616c4ede65038ec cxgb3: convert to SKB paged frag API.
> 5dc3e196ea21e833128d51eb5b788a070fea1f28 myri10ge: convert to SKB paged frag API.
> 0e0634d20dd670a89af19af2a686a6cce943ac14 vmxnet3: convert to SKB paged frag API.
> 86ee8130a46769f73f8f423f99dbf782a09f9233 virtionet: convert to SKB paged frag API.
> 4a22c4c919c201c2a7f4ee09e672435a3072d875 sfc: convert to SKB paged frag API.
> 18324d690d6a5028e3c174fc1921447aedead2b8 cassini: convert to SKB paged frag API.
> b061b39e3ae18ad75466258cf2116e18fa5bbd80 benet: convert to SKB paged frag API.
> b7b6a688d217936459ff5cf1087b2361db952509 bnx2: convert to SKB paged frag API.
> 804cf14ea5ceca46554d5801e2817bba8116b7e5 net: xfrm: convert to SKB frag APIs
> ea2ab69379a941c6f8884e290fdd28c93936a778 net: convert core to skb paged frag APIs
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  include/linux/skbuff.h |    1 -
>  net/core/pktgen.c      |    1 +
>  2 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 64f8695..78741da 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1765,7 +1765,6 @@ static inline void *skb_frag_address_safe(const skb_frag_t *frag)
>  static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page)
>  {
>  	frag->page = page;
> -	__skb_frag_ref(frag);
>  }
>  
>  /**
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index 796044a..c4effd4 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -2603,6 +2603,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,
>  					break;
>  			}
>  			skb_frag_set_page(skb, i, pkt_dev->page);
> +			skb_frag_ref(skb, i);
>  			skb_shinfo(skb)->frags[i].page_offset = 0;
>  			/*last fragment, fill rest of data*/
>  			if (i == (frags - 1))

I am ok with this patch, since it makes sense to let driver do the
page->_count change itself (it can use a batched add() in case a page is
splitted in many frags)

But I suggest using get_page(pkt_dev->page), this seems more obvious to
me [ This was how I wrote the thing ;) ]

^ permalink raw reply

* [PATCH] net: Fix driver name for mdio-gpio.c
From: Dirk Eibach @ 2011-10-18 13:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, grant.likely, Dirk Eibach

Since commit
"7488876... dt/net: Eliminate users of of_platform_{,un}register_driver"
there are two platform drivers named "mdio-gpio" registered.
I renamed the of variant to "mdio-ofgpio".

Signed-off-by: Dirk Eibach <eibach@gdsys.de>
---
 drivers/net/phy/mdio-gpio.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c
index f2d122f..3d66d9f 100644
--- a/drivers/net/phy/mdio-gpio.c
+++ b/drivers/net/phy/mdio-gpio.c
@@ -241,7 +241,7 @@ MODULE_DEVICE_TABLE(of, mdio_ofgpio_match);
 
 static struct platform_driver mdio_ofgpio_driver = {
 	.driver = {
-		.name = "mdio-gpio",
+		.name = "mdio-ofgpio",
 		.owner = THIS_MODULE,
 		.of_match_table = mdio_ofgpio_match,
 	},
-- 
1.5.6.5

^ permalink raw reply related

* Re: PROBLEM: System call 'sendmsg' of process ospfd (quagga) causes kernel oops
From: Eric Dumazet @ 2011-10-18 12:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Elmar Vonlanthen, linux-kernel, netdev, Timo Teräs
In-Reply-To: <20111018114945.GA16359@gondor.apana.org.au>

Le mardi 18 octobre 2011 à 13:49 +0200, Herbert Xu a écrit :
> On Tue, Oct 18, 2011 at 01:37:58PM +0200, Eric Dumazet wrote:
> >
> > In the bug we try to fix, we have :
> > 
> > skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev) 
> > 
> > ... < increase of dev->needed_headroom by another cpu/task >
> > 
> > skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
> 
> OK, in that case one fix would be to replace LL_ALLOCATED_SPACE
> with its two constiuents so that they may be stored in local
> variables for later use.
> 
> 	hlen = LL_HEADROOM(skb);
> 	tlen = LL_TAILROOM(skb);
> 	skb_alloc_send_skb(sk, ... + LL_ALIGN(hlen + tlen));
> 
> 	skb_reserve(skb, LL_ALIGN(hlen));
> 
> Cheers,

I am ok by this way, but we might hit another similar problem elsewhere.

(igmp.c ip6_output, ...)

We effectively want to remove LL_ALLOCATED_SPACE() usage and obfuscate
code...


[PATCH] raw: allow dev->needed_headroom dynamic change

It seems ip_gre is able to change dev->needed_headroom on the fly.

It triggers a BUG in raw_sendmsg()

skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev) 

< another cpu change dev->needed_headromm (making it bigger)

...
skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));

We end with LL_RESERVED_SPACE() being bigger than LL_ALLOCATED_SPACE()
-> we crash later because skb head is exhausted.

Bug introduced in commit 243aad83 in 2.6.34 (ip_gre: include route
header_len in max_headroom calculation)

Reported-by: Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Timo Teräs <timo.teras@iki.fi>
CC: Herbert Xu <herbert@gondor.apana.org.au>
---
 include/linux/netdevice.h |   10 +++++++---
 net/ipv4/raw.c            |    9 +++++++--
 net/ipv6/raw.c            |    9 +++++++--
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddee79b..dba2399 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -276,12 +276,16 @@ struct hh_cache {
  * LL_ALLOCATED_SPACE also takes into account the tailroom the device
  * may need.
  */
+#define LL_ALIGN(__len) (((__len)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+
 #define LL_RESERVED_SPACE(dev) \
-	((((dev)->hard_header_len+(dev)->needed_headroom)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+	LL_ALIGN((dev)->hard_header_len + (dev)->needed_headroom))
+
 #define LL_RESERVED_SPACE_EXTRA(dev,extra) \
-	((((dev)->hard_header_len+(dev)->needed_headroom+(extra))&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+	LL_ALIGN((dev)->hard_header_len + (dev)->needed_headroom + (extra))
+
 #define LL_ALLOCATED_SPACE(dev) \
-	((((dev)->hard_header_len+(dev)->needed_headroom+(dev)->needed_tailroom)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+	LL_ALIGN((dev)->hard_header_len + (dev)->needed_headroom + (dev)->needed_tailroom)
 
 struct header_ops {
 	int	(*create) (struct sk_buff *skb, struct net_device *dev,
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 61714bd..4ed4eda 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -326,6 +326,9 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	unsigned int iphlen;
 	int err;
 	struct rtable *rt = *rtp;
+	unsigned int hard_header_len = rt->dst.dev->hard_header_len;
+	unsigned int needed_headroom = rt->dst.dev->needed_headroom;
+	unsigned int needed_tailroom = rt->dst.dev->needed_tailroom;
 
 	if (length > rt->dst.dev->mtu) {
 		ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
@@ -336,11 +339,13 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 		goto out;
 
 	skb = sock_alloc_send_skb(sk,
-				  length + LL_ALLOCATED_SPACE(rt->dst.dev) + 15,
+				  length + LL_ALIGN(hard_header_len +
+						    needed_headroom +
+					            needed_tailroom) + 15,
 				  flags & MSG_DONTWAIT, &err);
 	if (skb == NULL)
 		goto error;
-	skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
+	skb_reserve(skb, LL_ALIGN(hard_header_len + needed_headroom));
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 343852e..eb0a797 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -610,6 +610,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	struct sk_buff *skb;
 	int err;
 	struct rt6_info *rt = (struct rt6_info *)*dstp;
+	unsigned int hard_header_len = rt->dst.dev->hard_header_len;
+	unsigned int needed_headroom = rt->dst.dev->needed_headroom;
+	unsigned int needed_tailroom = rt->dst.dev->needed_tailroom;
 
 	if (length > rt->dst.dev->mtu) {
 		ipv6_local_error(sk, EMSGSIZE, fl6, rt->dst.dev->mtu);
@@ -619,11 +622,13 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 		goto out;
 
 	skb = sock_alloc_send_skb(sk,
-				  length + LL_ALLOCATED_SPACE(rt->dst.dev) + 15,
+				  length + LL_ALIGN(hard_header_len +
+						    needed_headroom +
+					            needed_tailroom) + 15,
 				  flags & MSG_DONTWAIT, &err);
 	if (skb == NULL)
 		goto error;
-	skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
+	skb_reserve(skb, LL_ALIGN(hard_header_len + needed_headroom));
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;

^ permalink raw reply related

* Re: [PATCH v13 0/6] flexcan: Add support for powerpc flexcan (freescale p1010)
From: Robin Holt @ 2011-10-18 12:30 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Grant Likely, netdev-u79uwXL29TY76Z2rM5mHXA, U Bhaskar-B22300,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w, Marc Kleine-Budde,
	PPC list, David S. Miller, Wolfgang Grandegger
In-Reply-To: <D79CB818-C14E-4C8D-9A8D-42B39ADE20B2-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>

On Tue, Oct 18, 2011 at 06:43:13AM -0500, Kumar Gala wrote:
> 
> >> Robin,
> >> 
> >> Do you remember why we went with just 'fsl,p1010-flexcan' as the device tree compatible?  Do we feel the flex can on P1010 isn't the same as on MPC5xxx? or the ARM SoCs?
> > 
> > The decision was due to the fact there is no true "generic" fsl.flexcan
> > chip free of any SOC implementation and therefore not something which
> > could be separately defined.  That decision was made by Grant Likely.
> > I will inline that email below.
> > 
> > Robin
> 
> 
> Thanks, I'll look into this internally at FSL.  I think its confusing as hell to have "fsl,p1010-flexcan" in an ARM .dts and don't think any reasonable ARM customer of FSL would know to put a PPC SOC name in their .dts.  I'll ask the HW guys what's going on so we can come up with a bit more generic name so we don't have to constantly change this.  Even if its just:

Grants argument was that there should then be a fsl,zeba-flexcan which
would define each arm based soc.  The match string could be there and
the devicetree binding would match on each equivalent.

Robin

> 
> fsl,ppc-flexcan & fsl,arm-flexcan.
> 
> > On Mon, Aug 15, 2011 at 09:13:50AM -0600, Grant Likely wrote:
> >> On Mon, Aug 15, 2011 at 9:03 AM, Robin Holt <holt-sJ/iWh9BUns@public.gmane.org> wrote:
> >>> Grant,
> >>> 
> >>> Earlier, you had asked for a more specific name for the compatible
> >>> property of the Freescale flexcan device.  I still have not gotten a
> >>> more specific answer.  Hopefully Marc can give you more details about
> >>> the flexcan implementations.
> >> 
> >> If there is no ip core version, then just stick with the
> >> fsl,<soc>-flexcan name and drop "fsl,flexcan".  Marketing may say
> >> flexcan is flexcan, but hardware engineers like to change things.
> >> Trying to be too generic in compatible values will just lead to
> >> problems in the future.
> > 
> > Thanks,
> > Robin
> 
> - k

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox