Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 13:17 UTC (permalink / raw)
  To: Scott Feldman; +Cc: Chris Wright, davem, netdev
In-Reply-To: <C7F35C03.29CF4%scofeldm@cisco.com>

On Tuesday 20 April 2010, Scott Feldman wrote:
> On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
>
> >> In the case of devices that can do adjacent switch negotiations directly.
> > 
> > I thought the idea to deal with those devices was to beat sense into
> > the respective developers until they do the negotiation in software 8-)
> 
> When the device can do the negotiation directly with the switch, why does it
> make sense to bypass that and use software on the host?  I don't think we'd
> want to give up on link speed/duplex auto-negotiation and punt those setting
> back to the user/host like in the old days.

For the link negotiation, the card is the right place because it's necessary
to get the link working before the OS can talk to the switch.
For VDP, that's different because the hypervisor needs to talk to the switch
before the guest can communicate, so there is no interdependency.

More importantly, the card cannot possibly do the protocol by itself,
because the information that gets exchanged is specific to the hypervisor and
the guest, not to the hardware. What you have implemented is another protocol
between the hypervisor and the NIC that exchanges the exact same data that
then gets sent to the switch. We already need to have an implementation that
sends this data to the switch from user space for all cards that don't do
it in firmware, so doing an alternative path in the adapter really creates
more work for the users, and means that when we fix bugs or add features
to the common code, you don't get them ;-).

	Arnd

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Jan Engelhardt @ 2010-04-21 13:35 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Developer Mailing List, Linux Netdev List, John Haxby
In-Reply-To: <4BCEFACC.7020804@trash.net>


On Wednesday 2010-04-21 15:17, Patrick McHardy wrote:
>Jan Engelhardt wrote:
>> On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:
>> 
>>> Jan Engelhardt wrote:
>>>> The SYSRQ target will allow to remotely invoke sysrq on the local
>>>> machine. Authentication is by means of a pre-shared key that can
>>>> either be transmitted plaintext or digest-secured.
>>> I really think this is pushing what netfilter is meant for a bit
>>> far. Its basically abusing the firewall ruleset to offer a network
>>> service.
>>>
>>> I can see that its useful to have this in the kernel instead of
>>> userspace, but why isn't this implemented as a stand-alone module?
>>> That seems like a better design to me and also makes it more useful
>>> by not depending on netfilter.
>> 
>> That sort of diverts from the earlier what-seemed-to-be-consensus.
>> 
>> Oh well, I would not mind holding the single commit up as long as the 
>> rest isn't blocked too :-)
>
>Then lets skip this one for now.

Well you raised the concern before -- namely that kdboe would have
the very same feature. And yet, kdboe was not part of the kernel.
Neither is the magical stand-alone module.
I really prefer to have it in rather than out, because I know
that's going to mess up maintenance-here-and-there. I'm already
having a big time with xtables-addons that still carries
xt_condition and SYSRQ for a while, and it does have some different
code lines than the kernel copy.

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-21 13:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kaber
In-Reply-To: <20100420.190127.66303903.davem@davemloft.net>



On Tue, 20 Apr 2010, David Miller wrote:

> 
> I looked more at your crash report.
> 
> You shouldn't even be in this code path for other reasons, namely
> skb->next should be NULL.  But it's not in your case.  skb->next would
> only be non-NULL for GSO frames, which we've established we should not
> be seeing here.
> 
> Given that skb->next is non-NULL and the fraglists of this SKB are
> corrupted (next pointer is 0x18), I think we're getting memory
> corruption from somewhere else.  This also jives with the fact that
> this is not readily reproducable.

The crash happened just a few days after I started to use the machine for 
bridging. There were no unexplained crashes before. So I suspect that the 
cause is bridging or tg3.

> The whole ->ndo_start_xmit() return value stuff is unrelated to this
> issue, we shouldn't even be in this code path.  In fact, if reverting
> that TX flags handling commit makes your crashes go away it would be a
> huge surprise.

I thought that if some weird ->ndo_start_xmit() return values appeared, 
this would lead to misunderstanding who owns the skb, using of already 
deallocated skb and mentioned memory corruption. But I can't prove it.

Mikulas

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Patrick McHardy @ 2010-04-21 13:17 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developer Mailing List, Linux Netdev List, John Haxby
In-Reply-To: <alpine.LSU.2.01.1004211506280.10840@obet.zrqbmnf.qr>

Jan Engelhardt wrote:
> On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:
> 
>> Jan Engelhardt wrote:
>>> The SYSRQ target will allow to remotely invoke sysrq on the local
>>> machine. Authentication is by means of a pre-shared key that can
>>> either be transmitted plaintext or digest-secured.
>> I really think this is pushing what netfilter is meant for a bit
>> far. Its basically abusing the firewall ruleset to offer a network
>> service.
>>
>> I can see that its useful to have this in the kernel instead of
>> userspace, but why isn't this implemented as a stand-alone module?
>> That seems like a better design to me and also makes it more useful
>> by not depending on netfilter.
> 
> That sort of diverts from the earlier what-seemed-to-be-consensus.
> 
> Oh well, I would not mind holding the single commit up as long as the 
> rest isn't blocked too :-)

Then lets skip this one for now.

^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: Herbert Xu @ 2010-04-21 13:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, cratiu
In-Reply-To: <20100421.015823.266647388.davem@davemloft.net>

On Wed, Apr 21, 2010 at 01:58:23AM -0700, David Miller wrote:
> 
> If we're going to use CHECKSUM_PARTIAL for these things (which we are
> since commit 2e8e18ef52e7dd1af0a3bd1f7d990a1d0b249586 "tcp: Set
> CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
> buff->csum as we always have been here in tcp_v6_send_response.  We
> need to leave it at zero.
> 
> Kill that line and checksums are good again.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

That line is needed for the CHECKSUM_NONE case.  In fact, if we're
in the CHECKSUM_PARTIAL case, then that buff->csum calculation
should be a noop because it immediately gets overwritten when we
subsequently set csum_start/csum_offset.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Jan Engelhardt @ 2010-04-21 13:07 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Developer Mailing List, Linux Netdev List, John Haxby
In-Reply-To: <4BCEF6B4.8090105@trash.net>

On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:

>Jan Engelhardt wrote:
>> The SYSRQ target will allow to remotely invoke sysrq on the local
>> machine. Authentication is by means of a pre-shared key that can
>> either be transmitted plaintext or digest-secured.
>
>I really think this is pushing what netfilter is meant for a bit
>far. Its basically abusing the firewall ruleset to offer a network
>service.
>
>I can see that its useful to have this in the kernel instead of
>userspace, but why isn't this implemented as a stand-alone module?
>That seems like a better design to me and also makes it more useful
>by not depending on netfilter.

That sort of diverts from the earlier what-seemed-to-be-consensus.

Oh well, I would not mind holding the single commit up as long as the 
rest isn't blocked too :-)


^ permalink raw reply

* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Ben Hutchings @ 2010-04-21 12:59 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: tglx, davem, arjan, netdev, linux-kernel
In-Reply-To: <20100420180112.1276.11906.stgit@ppwaskie-hc2.jf.intel.com>

On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> This patch adds a callback function pointer to the irq_desc
> structure, along with a registration function and a read-only
> proc entry for each interrupt.
> 
> This affinity_hint handle for each interrupt can be used by
> underlying drivers that need a better mechanism to control
> interrupt affinity.  The underlying driver can register a
> callback for the interrupt, which will allow the driver to
> provide the CPU mask for the interrupt to anything that
> requests it.  The intent is to extend the userspace daemon,
> irqbalance, to help hint to it a preferred CPU mask to balance
> the interrupt into.

Doesn't it make more sense to have the driver follow affinity decisions
made from user-space?  I realise that reallocating queues is disruptive
and we probably don't want irqbalance to trigger that, but there should
be a mechanism for the administrator to trigger it.

Looking at your patch for ixgbe:

[...]
> diff --git a/drivers/net/ixgbe/ixgbe_main.c
> b/drivers/net/ixgbe/ixgbe_main.c
> index 1b1419c..3e00d41 100644
> --- a/drivers/net/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ixgbe/ixgbe_main.c
[...]
> @@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
>                         q_vector->eitr = adapter->rx_eitr_param;
>  
>                 ixgbe_write_eitr(q_vector);
> +
> +               /*
> +                * Allocate the affinity_hint cpumask, assign the mask for
> +                * this vector, and register our affinity_hint callback.
> +                */
> +               alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
> +               cpumask_set_cpu(v_idx, q_vector->affinity_mask);
> +               irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
> +                                          adapter,
> +                                          &ixgbe_irq_affinity_callback);
>         }
>  
>         if (adapter->hw.mac.type == ixgbe_mac_82598EB)
[...]

This just assigns IRQs to the first n CPU threads.  Depending on the
enumeration order, this might result in assigning an IRQ to each of 2
threads on a core while leaving other cores unused!

irqbalance can already find the various IRQs associated with a single
net device by looking at the handler names.  So it can do at least as
well as this without such a hint.  Unless drivers have *useful* hints to
give, I don't see the point in adding this mechanism.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Patrick McHardy @ 2010-04-21 12:59 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: netfilter-devel, Linux Netdev List
In-Reply-To: <1271845618-28569-2-git-send-email-jengelh@medozas.de>

Jan Engelhardt wrote:
> The SYSRQ target will allow to remotely invoke sysrq on the local
> machine. Authentication is by means of a pre-shared key that can
> either be transmitted plaintext or digest-secured.

I really think this is pushing what netfilter is meant for a bit
far. Its basically abusing the firewall ruleset to offer a network
service.

I can see that its useful to have this in the kernel instead of
userspace, but why isn't this implemented as a stand-alone module?
That seems like a better design to me and also makes it more useful
by not depending on netfilter.

> Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
> ---
>  net/netfilter/Kconfig    |   12 ++
>  net/netfilter/Makefile   |    1 +
>  net/netfilter/xt_SYSRQ.c |  354 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 367 insertions(+), 0 deletions(-)
>  create mode 100644 net/netfilter/xt_SYSRQ.c
> 
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index 673a6c8..bfd9b6f 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -502,6 +502,18 @@ config NETFILTER_XT_TARGET_RATEEST
>  
>  	  To compile it as a module, choose M here.  If unsure, say N.
>  
> +config NETFILTER_XT_TARGET_SYSRQ
> +	tristate '"SYSRQ" - remote sysrq invocation'
> +	depends on NETFILTER_ADVANCED
> +	---help---
> +	This option enables the "SYSRQ" target which can be used to trigger
> +	sysrq from a remote machine using a magic UDP packet with a pre-shared
> +	password. This is useful when the receiving host has locked up in an
> +	Oops yet still can process incoming packets.
> +
> +	Besides plaintext packets, digest-secured SYSRQ requests will be
> +	supported when CONFIG_CRYPTO is enabled.
> +
>  config NETFILTER_XT_TARGET_TEE
>  	tristate '"TEE" - packet cloning to alternate destiantion'
>  	depends on NETFILTER_ADVANCED
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 14e3a8f..f032195 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -56,6 +56,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE) += xt_NFQUEUE.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_NOTRACK) += xt_NOTRACK.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_RATEEST) += xt_RATEEST.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_SECMARK) += xt_SECMARK.o
> +obj-$(CONFIG_NETFILTER_XT_TARGET_SYSRQ) += xt_SYSRQ.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_TPROXY) += xt_TPROXY.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_TCPMSS) += xt_TCPMSS.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP) += xt_TCPOPTSTRIP.o
> diff --git a/net/netfilter/xt_SYSRQ.c b/net/netfilter/xt_SYSRQ.c
> new file mode 100644
> index 0000000..929b204
> --- /dev/null
> +++ b/net/netfilter/xt_SYSRQ.c
> @@ -0,0 +1,354 @@
> +/*
> + *	"SYSRQ" target extension for Netfilter
> + *	Copyright Â© Jan Engelhardt <jengelh [at] medozas de>, 2008 - 2010
> + *
> + *	Based upon the ipt_SYSRQ idea by Marek Zalem <marek [at] terminus sk>
> + *
> + *	This program is free software; you can redistribute it and/or
> + *	modify it under the terms of the GNU General Public License
> + *	version 2 or later as published by the Free Software Foundation.
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/in.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/sysrq.h>
> +#include <linux/udp.h>
> +#include <linux/netfilter_ipv4/ip_tables.h>
> +#include <linux/netfilter_ipv6/ip6_tables.h>
> +#include <linux/netfilter/x_tables.h>
> +#include <linux/crypto.h>
> +#include <linux/scatterlist.h>
> +#include <net/ip.h>
> +
> +#if defined(CONFIG_CRYPTO) || defined(CRYPTO_CONFIG_MODULE)
> +#	define WITH_CRYPTO 1
> +#endif
> +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> +#	define WITH_IPV6 1
> +#endif
> +
> +static bool sysrq_once;
> +static char sysrq_password[64];
> +static char sysrq_hash[16] = "sha1";
> +static long sysrq_seqno;
> +static int sysrq_debug;
> +module_param_string(password, sysrq_password, sizeof(sysrq_password),
> +	S_IRUSR | S_IWUSR);
> +module_param_string(hash, sysrq_hash, sizeof(sysrq_hash), S_IRUSR);
> +module_param_named(seqno, sysrq_seqno, long, S_IRUSR | S_IWUSR);
> +module_param_named(debug, sysrq_debug, int, S_IRUSR | S_IWUSR);
> +MODULE_PARM_DESC(password, "password for remote sysrq");
> +MODULE_PARM_DESC(hash, "hash algorithm, default sha1");
> +MODULE_PARM_DESC(seqno, "sequence number for remote sysrq");
> +MODULE_PARM_DESC(debug, "debugging: 0=off, 1=on");
> +
> +#ifdef WITH_CRYPTO
> +static struct crypto_hash *sysrq_tfm;
> +static int sysrq_digest_size;
> +static unsigned char *sysrq_digest_password;
> +static unsigned char *sysrq_digest;
> +static char *sysrq_hexdigest;
> +
> +/*
> + * The data is of the form "<requests>,<seqno>,<salt>,<hash>" where <requests>
> + * is a series of sysrq requests; <seqno> is a sequence number that must be
> + * greater than the last sequence number; <salt> is some random bytes; and
> + * <hash> is the hash of everything up to and including the preceding ","
> + * together with the password.
> + *
> + * For example
> + *
> + *   salt=$RANDOM
> + *   req="s,$(date +%s),$salt"
> + *   echo "$req,$(echo -n $req,secret | sha1sum | cut -c1-40)"
> + *
> + * You will want a better salt and password than that though :-)
> + */
> +static unsigned int sysrq_tg(const void *pdata, uint16_t len)
> +{
> +	const char *data = pdata;
> +	int i, n;
> +	struct scatterlist sg[2];
> +	struct hash_desc desc;
> +	int ret;
> +	long new_seqno = 0;
> +
> +	if (*sysrq_password == '\0') {
> +		if (!sysrq_once)
> +			pr_info("No password set\n");
> +		sysrq_once = true;
> +		return NF_DROP;
> +	}
> +	if (len == 0)
> +		return NF_DROP;
> +
> +	for (i = 0; sysrq_password[i] != '\0' &&
> +	     sysrq_password[i] != '\n'; ++i)
> +		/* loop */;
> +	sysrq_password[i] = '\0';
> +
> +	i = 0;
> +	for (n = 0; n < len - 1; ++n) {
> +		if (i == 1 && '0' <= data[n] && data[n] <= '9')
> +			new_seqno = 10L * new_seqno + data[n] - '0';
> +		if (data[n] == ',' && ++i == 3)
> +			break;
> +	}
> +	++n;
> +	if (i != 3) {
> +		if (sysrq_debug)
> +			pr_info("badly formatted request\n");
> +		return NF_DROP;
> +	}
> +	if (sysrq_seqno >= new_seqno) {
> +		if (sysrq_debug)
> +			pr_info("old sequence number ignored\n");
> +		return NF_DROP;
> +	}
> +
> +	desc.tfm   = sysrq_tfm;
> +	desc.flags = 0;
> +	ret = crypto_hash_init(&desc);
> +	if (ret != 0)
> +		goto hash_fail;
> +	sg_init_table(sg, 2);
> +	sg_set_buf(&sg[0], data, n);
> +	strcpy(sysrq_digest_password, sysrq_password);
> +	i = strlen(sysrq_digest_password);
> +	sg_set_buf(&sg[1], sysrq_digest_password, i);
> +	ret = crypto_hash_digest(&desc, sg, n + i, sysrq_digest);
> +	if (ret != 0)
> +		goto hash_fail;
> +
> +	for (i = 0; i < sysrq_digest_size; ++i) {
> +		sysrq_hexdigest[2*i] =
> +			"0123456789abcdef"[(sysrq_digest[i] >> 4) & 0xf];
> +		sysrq_hexdigest[2*i+1] =
> +			"0123456789abcdef"[sysrq_digest[i] & 0xf];
> +	}
> +	sysrq_hexdigest[2*sysrq_digest_size] = '\0';
> +	if (len - n < sysrq_digest_size) {
> +		if (sysrq_debug)
> +			pr_info("Short digest, expected %s\n",
> +				sysrq_hexdigest);
> +		return NF_DROP;
> +	}
> +	if (strncmp(data + n, sysrq_hexdigest, sysrq_digest_size) != 0) {
> +		if (sysrq_debug)
> +			pr_info("Bad digest, expected %s\n", sysrq_hexdigest);
> +		return NF_DROP;
> +	}
> +
> +	/* Now we trust the requester */
> +	sysrq_seqno = new_seqno;
> +	for (i = 0; i < len && data[i] != ','; ++i) {
> +		pr_info("SysRq %c\n", data[i]);
> +		handle_sysrq(data[i], NULL);
> +	}
> +	return NF_ACCEPT;
> +
> + hash_fail:
> +	pr_warning("digest failure\n");
> +	return NF_DROP;
> +}
> +#else
> +static unsigned int sysrq_tg(const void *pdata, uint16_t len)
> +{
> +	const char *data = pdata;
> +	char c;
> +
> +	if (*sysrq_password == '\0') {
> +		if (!sysrq_once)
> +			pr_info("No password set\n");
> +		sysrq_once = true;
> +		return NF_DROP;
> +	}
> +
> +	if (len == 0)
> +		return NF_DROP;
> +
> +	c = *data;
> +	if (strncmp(&data[1], sysrq_password, len - 1) != 0) {
> +		pr_warning("Failed attempt - password mismatch\n");
> +		return NF_DROP;
> +	}
> +
> +	handle_sysrq(c, NULL);
> +	return NF_ACCEPT;
> +}
> +#endif
> +
> +static unsigned int
> +sysrq_tg4(struct sk_buff *skb, const struct xt_target_param *par)
> +{
> +	const struct iphdr *iph;
> +	const struct udphdr *udph;
> +	uint16_t len;
> +
> +	if (skb_linearize(skb) < 0)
> +		return NF_DROP;
> +
> +	iph = ip_hdr(skb);
> +	if (iph->protocol != IPPROTO_UDP && iph->protocol != IPPROTO_UDPLITE)
> +		return NF_DROP;
> +
> +	udph = (const void *)iph + ip_hdrlen(skb);
> +	len  = ntohs(udph->len) - sizeof(struct udphdr);
> +
> +	if (sysrq_debug)
> +		pr_info(": %pI4:%u -> :%u len=%u\n", &iph->saddr,
> +			htons(udph->source), htons(udph->dest), len);
> +	return sysrq_tg((void *)udph + sizeof(struct udphdr), len);
> +}
> +
> +#ifdef WITH_IPV6
> +static unsigned int
> +sysrq_tg6(struct sk_buff *skb, const struct xt_target_param *par)
> +{
> +	const struct ipv6hdr *iph;
> +	const struct udphdr *udph;
> +	unsigned short frag_off;
> +	unsigned int th_off;
> +	uint16_t len;
> +
> +	if (skb_linearize(skb) < 0)
> +		return NF_DROP;
> +
> +	iph = ipv6_hdr(skb);
> +	if (ipv6_find_hdr(skb, &th_off, IPPROTO_UDP, &frag_off) < 0 ||
> +	    frag_off > 0)
> +		return NF_ACCEPT; /* sink it */
> +
> +	udph = (const void *)iph + th_off;
> +	len  = ntohs(udph->len) - sizeof(struct udphdr);
> +
> +	if (sysrq_debug)
> +		pr_info("%pI6:%hu -> :%hu len=%u\n", &iph->saddr,
> +			ntohs(udph->source), ntohs(udph->dest), len);
> +	return sysrq_tg(udph + sizeof(struct udphdr), len);
> +}
> +#endif
> +
> +static int sysrq_tg_check(const struct xt_tgchk_param *par)
> +{
> +	if (par->target->family == NFPROTO_IPV4) {
> +		const struct ipt_entry *entry = par->entryinfo;
> +
> +		if ((entry->ip.proto != IPPROTO_UDP &&
> +		    entry->ip.proto != IPPROTO_UDPLITE) ||
> +		    entry->ip.invflags & XT_INV_PROTO)
> +			goto out;
> +	} else if (par->target->family == NFPROTO_IPV6) {
> +		const struct ip6t_entry *entry = par->entryinfo;
> +
> +		if ((entry->ipv6.proto != IPPROTO_UDP &&
> +		    entry->ipv6.proto != IPPROTO_UDPLITE) ||
> +		    entry->ipv6.invflags & XT_INV_PROTO)
> +			goto out;
> +	}
> +
> +	return true;
> +
> + out:
> +	pr_info("only available for UDP and UDP-Lite");
> +	return false;
> +}
> +
> +static struct xt_target sysrq_tg_reg[] __read_mostly = {
> +	{
> +		.name       = "SYSRQ",
> +		.revision   = 1,
> +		.family     = NFPROTO_IPV4,
> +		.target     = sysrq_tg4,
> +		.checkentry = sysrq_tg_check,
> +		.me         = THIS_MODULE,
> +	},
> +#ifdef WITH_IPV6
> +	{
> +		.name       = "SYSRQ",
> +		.revision   = 1,
> +		.family     = NFPROTO_IPV6,
> +		.target     = sysrq_tg6,
> +		.checkentry = sysrq_tg_check,
> +		.me         = THIS_MODULE,
> +	},
> +#endif
> +};
> +
> +static void sysrq_crypto_exit(void)
> +{
> +#ifdef WITH_CRYPTO
> +	if (sysrq_tfm)
> +		crypto_free_hash(sysrq_tfm);
> +	if (sysrq_digest)
> +		kfree(sysrq_digest);
> +	if (sysrq_hexdigest)
> +		kfree(sysrq_hexdigest);
> +	if (sysrq_digest_password)
> +		kfree(sysrq_digest_password);
> +#endif
> +}
> +
> +static int __init sysrq_crypto_init(void)
> +{
> +#if defined(WITH_CRYPTO)
> +	struct timeval now;
> +	int ret;
> +
> +	sysrq_tfm = crypto_alloc_hash(sysrq_hash, 0, CRYPTO_ALG_ASYNC);
> +	if (IS_ERR(sysrq_tfm)) {
> +		pr_err("Could not find or load %s hash\n", sysrq_hash);
> +		sysrq_tfm = NULL;
> +		ret = PTR_ERR(sysrq_tfm);
> +		goto fail;
> +	}
> +	sysrq_digest_size = crypto_hash_digestsize(sysrq_tfm);
> +	sysrq_digest = kmalloc(sysrq_digest_size, GFP_KERNEL);
> +	ret = -ENOMEM;
> +	if (sysrq_digest == NULL)
> +		goto fail;
> +	sysrq_hexdigest = kmalloc(2 * sysrq_digest_size + 1, GFP_KERNEL);
> +	if (sysrq_hexdigest == NULL)
> +		goto fail;
> +	sysrq_digest_password = kmalloc(sizeof(sysrq_password), GFP_KERNEL);
> +	if (sysrq_digest_password == NULL)
> +		goto fail;
> +	do_gettimeofday(&now);
> +	sysrq_seqno = now.tv_sec;
> +	ret = xt_register_targets(sysrq_tg_reg, ARRAY_SIZE(sysrq_tg_reg));
> +	if (ret < 0)
> +		goto fail;
> +	return ret;
> +
> + fail:
> +	sysrq_crypto_exit();
> +	return ret;
> +#else
> +	pr_info("compiled without crypto\n");
> +#endif
> +	return -EINVAL;
> +}
> +
> +static int __init sysrq_tg_init(void)
> +{
> +	if (sysrq_crypto_init() < 0)
> +		pr_info("starting without crypto\n");
> +	return xt_register_targets(sysrq_tg_reg, ARRAY_SIZE(sysrq_tg_reg));
> +}
> +
> +static void __exit sysrq_tg_exit(void)
> +{
> +	sysrq_crypto_exit();
> +	xt_unregister_targets(sysrq_tg_reg, ARRAY_SIZE(sysrq_tg_reg));
> +}
> +
> +module_init(sysrq_tg_init);
> +module_exit(sysrq_tg_exit);
> +MODULE_DESCRIPTION("Xtables: triggering SYSRQ remotely");
> +MODULE_AUTHOR("Jan Engelhardt <jengelh@medozas.de>");
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS("ipt_SYSRQ");
> +MODULE_ALIAS("ip6t_SYSRQ");

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: Kumar Gala @ 2010-04-21 12:17 UTC (permalink / raw)
  To: David Miller; +Cc: timur.tabi, afleming, netdev
In-Reply-To: <20100420.223659.236667659.davem@davemloft.net>


On Apr 21, 2010, at 12:36 AM, David Miller wrote:

> From: Kumar Gala <galak@kernel.crashing.org>
> Date: Tue, 20 Apr 2010 23:22:19 -0500
> 
>> 
>> On Apr 20, 2010, at 8:06 PM, David Miller wrote:
>> 
>>> From: Timur Tabi <timur.tabi@gmail.com>
>>> Date: Tue, 20 Apr 2010 10:01:48 -0500
>>> 
>>>> On Mon, Apr 19, 2010 at 11:43 PM, Kumar Gala <galak@kernel.crashing.org> wrote:
>>>> 
>>>>> spin_event_timeout doesn't make sense for this.  The patch is fine.
>>>> 
>>>> Can you please elaborate on that?  I don't understand why you think
>>>> that.  spin_event_timeout() takes an expression and a timeout, and
>>>> loops over the expression calling cpu_relax(), just like this loop
>>>> does.
>>> 
>>> Indeed it does, Kumar this request seems reasonable.
>> 
>> Are we saying that cpu_relax() is useless and should be removed if we are spinning on a HW register?
> 
> Kumar, take a deep breath and a step back.
> 
> spin_event_timeout() does the cpu_relax() too, that's what Timur is
> trying to tell you.
> 
> The code will be basically identical as far as I can tell.

I understand, its more a sense that we are saying we want to time out for what I consider a catastrophic HW failure.

- k

^ permalink raw reply

* Re: [PATCH] tun: orphan an skb on tx
From: Michael S. Tsirkin @ 2010-04-21 11:45 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: stable@kernel.org, David S. Miller, Herbert Xu, Paul Moore,
	David Woodhouse, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, qemu-devel
In-Reply-To: <4BCEE599.1020501@siemens.com>

On Wed, Apr 21, 2010 at 01:46:33PM +0200, Jan Kiszka wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> >> The following situation was observed in the field:
> >> tap1 sends packets, tap2 does not consume them, as a result
> >> tap1 can not be closed. This happens because
> >> tun/tap devices can hang on to skbs undefinitely.
> >>
> >> As noted by Herbert, possible solutions include a timeout followed by a
> >> copy/change of ownership of the skb, or always copying/changing
> >> ownership if we're going into a hostile device.
> >>
> >> This patch implements the second approach.
> >>
> >> Note: one issue still remaining is that since skbs
> >> keep reference to tun socket and tun socket has a
> >> reference to tun device, we won't flush backlog,
> >> instead simply waiting for all skbs to get transmitted.
> >> At least this is not user-triggerable, and
> >> this was not reported in practice, my assumption is
> >> other devices besides tap complete an skb
> >> within finite time after it has been queued.
> >>
> >> A possible solution for the second issue
> >> would not to have socket reference the device,
> >> instead, implement dev->destructor for tun, and
> >> wait for all skbs to complete there, but this
> >> needs some thought, probably too risky for 2.6.34.
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> >>
> >> ---
> >>
> >> Please review the below, and consider for 2.6.34,
> >> and stable trees.
> >>
> >>  drivers/net/tun.c |    4 ++++
> >>  1 files changed, 4 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 96c39bd..4326520 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>  		}
> >>  	}
> >>  
> >> +	/* Orphan the skb - required as we might hang on to it
> >> +	 * for indefinite time. */
> >> +	skb_orphan(skb);
> >> +
> >>  	/* Enqueue packet */
> >>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
> >>  	dev->trans_start = jiffies;
> >> -- 
> >> 1.7.0.2.280.gc6f05
> > 
> > This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> > Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).
> > 
> > Thanks!
> > 
> 
> Before I forget again:
> 
> Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
> (namely the field test of our customer)
> 
> Jan

Cool, thanks!

> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply

* Selective MD5 Checksum Failuers
From: Bijay Singh @ 2010-04-21 11:42 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: chrisw@redhat.com, davem@davemloft.net
In-Reply-To: <201004211326.00974.arnd@arndb.de>

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

I am noticing a strange and a troublesome behavior with tcp md5 checksums. Some selective packets are going out with invalid md5 checksums.

The only thing that is changing is the ack number (between the packets with valid and invalid md5 checksums), so while most packets have correct md5 checksums few 1 in 1000s have md5 checksums errors.

I am on 2.6.26 and I know that there have been significant changes since this version in this area. I have gone thru them but none of issues they address seem like the cause for this problem.

I have the scatter/gather and tcp segmentation disabled in the card.

The packet captures are attached.

Bijay

[-- Attachment #2: md5sum-2010-04-16.pcap1-754478487 --]
[-- Type: application/octet-stream, Size: 126 bytes --]

[-- Attachment #3: md5sum-2010-04-16.pcap1-754479689 --]
[-- Type: application/octet-stream, Size: 126 bytes --]

^ permalink raw reply

* Re: [PATCH] tun: orphan an skb on tx
From: Jan Kiszka @ 2010-04-21 11:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: stable@kernel.org, David S. Miller, Herbert Xu, Paul Moore,
	David Woodhouse, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, qemu-devel
In-Reply-To: <20100421113557.GA31606@redhat.com>

Michael S. Tsirkin wrote:
> On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
>> The following situation was observed in the field:
>> tap1 sends packets, tap2 does not consume them, as a result
>> tap1 can not be closed. This happens because
>> tun/tap devices can hang on to skbs undefinitely.
>>
>> As noted by Herbert, possible solutions include a timeout followed by a
>> copy/change of ownership of the skb, or always copying/changing
>> ownership if we're going into a hostile device.
>>
>> This patch implements the second approach.
>>
>> Note: one issue still remaining is that since skbs
>> keep reference to tun socket and tun socket has a
>> reference to tun device, we won't flush backlog,
>> instead simply waiting for all skbs to get transmitted.
>> At least this is not user-triggerable, and
>> this was not reported in practice, my assumption is
>> other devices besides tap complete an skb
>> within finite time after it has been queued.
>>
>> A possible solution for the second issue
>> would not to have socket reference the device,
>> instead, implement dev->destructor for tun, and
>> wait for all skbs to complete there, but this
>> needs some thought, probably too risky for 2.6.34.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
>>
>> ---
>>
>> Please review the below, and consider for 2.6.34,
>> and stable trees.
>>
>>  drivers/net/tun.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 96c39bd..4326520 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>  		}
>>  	}
>>  
>> +	/* Orphan the skb - required as we might hang on to it
>> +	 * for indefinite time. */
>> +	skb_orphan(skb);
>> +
>>  	/* Enqueue packet */
>>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>>  	dev->trans_start = jiffies;
>> -- 
>> 1.7.0.2.280.gc6f05
> 
> This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).
> 
> Thanks!
> 

Before I forget again:

Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
(namely the field test of our customer)

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply

* Re: [PATCH] tun: orphan an skb on tx
From: Michael S. Tsirkin @ 2010-04-21 11:35 UTC (permalink / raw)
  To: stable
  Cc: David S. Miller, Herbert Xu, Paul Moore, David Woodhouse, netdev,
	linux-kernel, Jan Kiszka, qemu-devel
In-Reply-To: <20100413145944.GA7716@redhat.com>

On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> The following situation was observed in the field:
> tap1 sends packets, tap2 does not consume them, as a result
> tap1 can not be closed. This happens because
> tun/tap devices can hang on to skbs undefinitely.
> 
> As noted by Herbert, possible solutions include a timeout followed by a
> copy/change of ownership of the skb, or always copying/changing
> ownership if we're going into a hostile device.
> 
> This patch implements the second approach.
> 
> Note: one issue still remaining is that since skbs
> keep reference to tun socket and tun socket has a
> reference to tun device, we won't flush backlog,
> instead simply waiting for all skbs to get transmitted.
> At least this is not user-triggerable, and
> this was not reported in practice, my assumption is
> other devices besides tap complete an skb
> within finite time after it has been queued.
> 
> A possible solution for the second issue
> would not to have socket reference the device,
> instead, implement dev->destructor for tun, and
> wait for all skbs to complete there, but this
> needs some thought, probably too risky for 2.6.34.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> 
> ---
> 
> Please review the below, and consider for 2.6.34,
> and stable trees.
> 
>  drivers/net/tun.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 96c39bd..4326520 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  	}
>  
> +	/* Orphan the skb - required as we might hang on to it
> +	 * for indefinite time. */
> +	skb_orphan(skb);
> +
>  	/* Enqueue packet */
>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>  	dev->trans_start = jiffies;
> -- 
> 1.7.0.2.280.gc6f05

This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).

Thanks!

-- 
MST

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 11:27 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421095812.GA14778@ioremap.net>

Here is the patch I use now and my test application is now able to open
and connect 1000000 sockets (ulimit -n 1000000)

Trick is bind_conflict() must refuse a socket to bind to a port on a non
null IP if another socket already uses same port on same IP.

Plus the previous patch sent (check a conflict before exiting the search
loop)

What do you think ?

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index e0a3e35..78cbc39 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -70,13 +70,17 @@ int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
+			const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
+
 			if (!reuse || !sk2->sk_reuse ||
 			    sk2->sk_state == TCP_LISTEN) {
-				const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr ||
 				    sk2_rcv_saddr == sk_rcv_saddr)
 					break;
-			}
+			} else if (reuse && sk2->sk_reuse &&
+				   sk2_rcv_saddr &&
+				   sk2_rcv_saddr == sk_rcv_saddr)
+				break;
 		}
 	}
 	return node != NULL;
@@ -120,9 +124,11 @@ again:
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
 						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
+							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
+								spin_unlock(&head->lock);
+								snum = smallest_rover;
+								goto have_snum;
+							}
 						}
 					}
 					goto next;



^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 11:26 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <C7F354F4.29CC8%scofeldm@cisco.com>

On Tuesday 20 April 2010, Scott Feldman wrote:
> On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> 
> > On Monday 19 April 2010, Scott Feldman wrote:
> > 
> >> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
> >> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
> >> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
> >> design allows for the case where master and slave are the
> >> same netdev interface.
> > 
> > What is the reason for controlling the slave device through the master,
> > rather than talking to the slave directly? The kernel always knows
> > the master for each slave, so it seems to me that this information
> > is redundant.
> 
> The interface would allow talking to the slave directly. In fact, that's the
> example with enic port-profile in patch 2/2.  But, it would be nice not to
> rule out the case where the master proxies slave control and the master is
> under exclusively controlled by hypervisor.

Not sure I understand. Do you mean the case where this code runs in
the hypervisor (e.g. KVM), or a different scerario with the setup being
done in a guest driver?

So far, I have assumed that we would always do the setup on the host
side, which always has access to both the master, and a slave proxy.
In particular, your interface requires access to the slave AFAICT,
because otherwise the VF IFNAME does not have any significance.

Take the case where you use network namespaces and put the VF into
a separate namespace. With your interface, the PF is still in the
root namespace, but passing both interface names in this interface
won't help you because they are never visible in the same namespace
(e.g. both might be named eth0 in their respective containers).

> > Is this new interface only for the case that you have a switch integrated
> > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > with an adjacent bridge and put the device into VEPA mode?
> 
> All of the above.  Basing this on netlink give us flexibility to work with
> user-space mgmt tools or directly with kernel netdev as in the enic case.
> Not trying to make assumptions about where (user-space, kernel) and by which
> entity sources or sinks the netlink msg.

ok.

> > Did you consider making this code an extension to the DCB interface
> > instead of a separate one? What was the reason for your decision
> > to keep it separate?
> 
> Considered it but DCB interface is well defined for DCB and it didn't seem
> right gluing on interfaces not specified within DCB.  I agree that there is
> some overlap in the sense that both interface are used to configure a netdev
> with some properties interesting for the data center, but the DCB interface
> is for local setting of the properties on the host whereas iovnl is about
> pushing the setting of those properties to the network for policy-based
> control.
>
> > Also, do you expect your interface to be supported by dcbd/lldpad,
> > or is there a good reason to create a new tool for iovnl?
> 
> Lldpad supporting this interface would seem right, for those cases where
> lldpad is responsible for configuring the netdev.

I believe we meant different things here, because I misunderstood the
intention of the code. My question was whether lldpad would send the
netlink messages to iovnl, but from what you and Chris write, the
real idea was that both lldpad and kernel/iovnl can receive the
same messages, right?

> >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> > 
> > Can you elaborate more on what these do? Who is the 'client' and the 'host'
> > in this case, and why do you need to identify them?
> 
> Those are optional and useful, for example, by the network mgmt tool for
> presenting a view such as:
> 
>     - blade 1/2                     // know by host uuid
>         - vm-rhel5-eth0             // client name
>             - port-profile: xyz
> 
> Something like that.

Hmm, but how do they get from the device driver to the the network
management tool then? Also, these are similar to the attributes
that are passed in the 802.1Qbg VDP protocol, but not compatible.
If the idea is use the same netlink protocol for both your internal
representation and for the standard based protocol, I think we should
make them compatible.

Instead of a string identifying the port profile, this needs to pass
a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).

There is also a UUID in VDP, but it identifies the guest, not the host,
so this is really confusing.

VDP also needs a list of MAC addresses and VLAN IDs (normally only
one of each), but that would be separate from what you tell the adapter,
see below: 

> >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> > 
> > Just one mac address? What happens if we want to assign multiple mac
> > addresses to the VF later? Also, how is this defined specifically?
> > Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> > later, or is this just the default value?
> 
> Depends on how the VF wants to handle this.  For our use-case with enic we
> only need the port-profile op so I'm not sure what the best design is for
> mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
> needed, let's scratch it.

In order to make VEPA work, it's absolutely required to impose a hard limit
on what MAC+VLAN IDs are visible to the VF, because the switch identifies
the guest by those and forwards any frames to/from that address according
to the VSI type.

However, I feel that we should strictly separate the steps of configuring
the adapter from talking to the switch. When we do the VDP association
in user land, we still need to set up the VLAN and MAC configuration for
the VF through a kernel interface. If we ignore the port profile stuff
for a moment, your netlink interface looks like a good fit for that.

Since it seems what you really want to do is to do the exchange with the
switch from here, maybe the hardware configuration part should be moved
the DCB interface?

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Eric Dumazet @ 2010-04-21 11:16 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <4BCEDC29.6040303@lastsummer.de>

Le mercredi 21 avril 2010 à 13:06 +0200, Franco Fichtner a écrit :
> > 
> 
> Hashing for cpu distribution should be as minimal as it could possibly
> be with the least number operations needed to compute a hash, which 
> normally involves touching one cold cache line (ip header). If you add 
> the ports to your mix you have the luxury of solving static ip mappings,
> but only for protocols that support it. Usage of the destination port
> may also prove to be more or less pointless with a lot of http traffic,
> because it's most likely static. And you add another potential cold
> cache line access. For a lot of traffic scenarios, we'll have a bunch of
> internal ips and the internet on the other side, so having a simple hash
> based on a flavor if internal/external ip is more than enough to work
> with for distribution. If the network card can provide a complete hash
> all the better. Then this part of my point is void.
> 

But we already have to bring into our cpu cache one cache line, needed
in eth_type_trans() : (12+2 bytes of ethernet header)

TCP/UDP tuples are included into this cache line (64 bytes on current
popular arches)

Cost of rxhash is absolute noise into the picture.
A device provided hash, to be effective, would also make
eth_type_trans() call not done.




^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Franco Fichtner @ 2010-04-21 11:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <1271842773.7895.1692.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
>> Tom Herbert wrote:
>>>> I thought about this for some time...
>>>>
>>>> Do we really need the port numbers here at all? A simple
>>>> addr1^addr2 can provide a good enough pointer for
>>>> distribution amongst CPUs.
>>>>
>>> What about a server behind a TCP proxy?  Also, need to minimize
>>> collisions for RPS to be effective
>> What about routers? What about loopback? This all boils down to
>> the same issue of obscuring IP data by "magical" means and then
>> reattaching functionality by reaching for upper layer information.
>> It is necessary in some cases, but it can cripple performance
>> for other cases.
>>
>> The interesting thing is you don't need to deal with collisions
>> while distributing amonst cpus at all. You just need to make sure
>> the distribution algorithm keeps every single flow attached to
>> the correct cpu.
>>
>> All of the actual flow hashing, tracking and whatever else the
>> traffic needs to go through can be done locally by cpu x which
>> helps a lot with load distribution and cache issues in mind. It
>> also helps locking because there is no global flow lookup table.
>> Oh, and it also reduces collisions with every cpu you add for
>> receiving.
>>
>> I work with a lot of plain office and ISP traffic in mind daily,
>> so please don't misunderstand my motivation here. I'd hate to
>> see poor performance in scenarios in which there is a lot of
>> potential improvement.
>>
> 
> I am a bit lost by this conversation.
> 
> Are you saying something is wrong with current schem ?
> 
> What are exactly your suggestions ?
> 

Hashing for cpu distribution should be as minimal as it could possibly
be with the least number operations needed to compute a hash, which 
normally involves touching one cold cache line (ip header). If you add 
the ports to your mix you have the luxury of solving static ip mappings,
but only for protocols that support it. Usage of the destination port
may also prove to be more or less pointless with a lot of http traffic,
because it's most likely static. And you add another potential cold
cache line access. For a lot of traffic scenarios, we'll have a bunch of
internal ips and the internet on the other side, so having a simple hash
based on a flavor if internal/external ip is more than enough to work
with for distribution. If the network card can provide a complete hash
all the better. Then this part of my point is void.

But then, hashing for cpu distribution should have nothing todo with
real flow tracking with lookup tables for let's say a firewall or dpi
application, because that data is only needed by local cpu and can
be gathered after distribution. Simply put, the lookup for the flow, if
it is needed, does not belong to distribution. It can be outsourced to
the destination cpu or just simply be ignored, if the application
doesn't care.

> Tom replied to you that a hash derived from (addr1 ^ addr2) would not
> work in situations where all flows goes from machine A to machine B
> (all hashes would be the same)
> 

Yes, I see the point. And all I'm just asking if it's wise to optimize
for this particular scenario.

If you spin this idea further beyond flow tracking, maybe an application
also needs to do some kind of user tracking by ip. Wouldn't it make
sense to have user based flows on a more local basis, not a global one
because ports will get in the way?

> Current hash is probably more than enough to cover all situations.
> 

I agree with this, but would like to point out the phrasing "probably
more than enough". :)

Franco

^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Wolfgang Grandegger @ 2010-04-21 10:27 UTC (permalink / raw)
  To: Hans J. Koch
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20100421101805.GA1995-hikPBsva6T+Nj9Bq2fkWzw@public.gmane.org>

Hans J. Koch wrote:
> On Tue, Apr 20, 2010 at 06:05:01PM -0700, David Miller wrote:
>> From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
>>> I think "dev_err(&intf->dev, ...)" should be used before
>>> SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
>>> calls which need to be fixed.
>> Agreed.
> 
> Then it should probably look like this:

There are even more dev_err's to be fixed...

> From: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
> 	Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
> Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
> 
> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
> 
> This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
> avoid this.
> 
> Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> ---
>  drivers/net/can/usb/ems_usb.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Index: net-next-2.6/drivers/net/can/usb/ems_usb.c
> ===================================================================
> --- net-next-2.6.orig/drivers/net/can/usb/ems_usb.c	2010-04-13 11:27:33.000000000 +0200
> +++ net-next-2.6/drivers/net/can/usb/ems_usb.c	2010-04-21 11:59:04.000000000 +0200
> @@ -1006,7 +1006,7 @@
>  
>  	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
>  	if (!netdev) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
> +		dev_err(&intf->dev, "ems_usb: Couldn't alloc candev\n");
>  		return -ENOMEM;
>  	}
>  
> @@ -1036,20 +1036,20 @@
>  
>  	dev->intr_urb = usb_alloc_urb(0, GFP_KERNEL);
>  	if (!dev->intr_urb) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc intr URB\n");
> +		dev_err(&intf->dev, "Couldn't alloc intr URB\n");
>  		goto cleanup_candev;
>  	}
>  
>  	dev->intr_in_buffer = kzalloc(INTR_IN_BUFFER_SIZE, GFP_KERNEL);
>  	if (!dev->intr_in_buffer) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc Intr buffer\n");
> +		dev_err(&intf->dev, "Couldn't alloc Intr buffer\n");
>  		goto cleanup_intr_urb;
>  	}
>  
>  	dev->tx_msg_buffer = kzalloc(CPC_HEADER_SIZE +
>  				     sizeof(struct ems_cpc_msg), GFP_KERNEL);
>  	if (!dev->tx_msg_buffer) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc Tx buffer\n");
> +		dev_err(&intf->dev, "Couldn't alloc Tx buffer\n");
>  		goto cleanup_intr_in_buffer;
>  	}

Acked-by: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 10:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421095812.GA14778@ioremap.net>

Le mercredi 21 avril 2010 à 13:58 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 11:02:15AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :
> > 
> > > I believe this is a useful patch, but it addresses a different issue.
> > > This path should not fire up when we bind to single address.
> > 
> > Well, the real problem is that following sequence can happen :
> > 
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
> > setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> > setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> > setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> > setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
> > setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
> > setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > 
> > 
> > Note ports are given several times for different sockets.
> > 
> > So several sockets are 'bound' to same IP:port values
> 
> That's kind of weird, since address is the same. I'm curious why bind
> conflict does not resolve that. Likely because socket is not yet
> connected.
> 
> > At connect() time, we refuse and say address is not available.
> 
> Yup. But isn't it a different problem from what Gaspar observed?
> Netcat connects after bind so it should not be an issue here.
> 

Well, just replace the "#if 0" in program, and i'll connect, and
reproduce same problem

conflict only checks :

                        if (!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) {
                                const __be32 sk2_rcv_saddr =
inet_rcv_saddr(sk2);
                                if (!sk2_rcv_saddr || !sk_rcv_saddr ||
                                    sk2_rcv_saddr == sk_rcv_saddr)
                                        break;
                        }


ie it wont re-allocate a port only if a listener uses this port.

If sk2 is connected or close, it doesnt matter, and port can be reused.




^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Hans J. Koch @ 2010-04-21 10:18 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <20100420.180501.08643239.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Tue, Apr 20, 2010 at 06:05:01PM -0700, David Miller wrote:
> From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
> > I think "dev_err(&intf->dev, ...)" should be used before
> > SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
> > calls which need to be fixed.
> 
> Agreed.

Then it should probably look like this:


From: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
	Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c

In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...

This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
avoid this.

Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 drivers/net/can/usb/ems_usb.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: net-next-2.6/drivers/net/can/usb/ems_usb.c
===================================================================
--- net-next-2.6.orig/drivers/net/can/usb/ems_usb.c	2010-04-13 11:27:33.000000000 +0200
+++ net-next-2.6/drivers/net/can/usb/ems_usb.c	2010-04-21 11:59:04.000000000 +0200
@@ -1006,7 +1006,7 @@
 
 	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
 	if (!netdev) {
-		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
+		dev_err(&intf->dev, "ems_usb: Couldn't alloc candev\n");
 		return -ENOMEM;
 	}
 
@@ -1036,20 +1036,20 @@
 
 	dev->intr_urb = usb_alloc_urb(0, GFP_KERNEL);
 	if (!dev->intr_urb) {
-		dev_err(netdev->dev.parent, "Couldn't alloc intr URB\n");
+		dev_err(&intf->dev, "Couldn't alloc intr URB\n");
 		goto cleanup_candev;
 	}
 
 	dev->intr_in_buffer = kzalloc(INTR_IN_BUFFER_SIZE, GFP_KERNEL);
 	if (!dev->intr_in_buffer) {
-		dev_err(netdev->dev.parent, "Couldn't alloc Intr buffer\n");
+		dev_err(&intf->dev, "Couldn't alloc Intr buffer\n");
 		goto cleanup_intr_urb;
 	}
 
 	dev->tx_msg_buffer = kzalloc(CPC_HEADER_SIZE +
 				     sizeof(struct ems_cpc_msg), GFP_KERNEL);
 	if (!dev->tx_msg_buffer) {
-		dev_err(netdev->dev.parent, "Couldn't alloc Tx buffer\n");
+		dev_err(&intf->dev, "Couldn't alloc Tx buffer\n");
 		goto cleanup_intr_in_buffer;
 	}

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  9:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271840535.7895.1612.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 11:02:15AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :
> 
> > I believe this is a useful patch, but it addresses a different issue.
> > This path should not fire up when we bind to single address.
> 
> Well, the real problem is that following sequence can happen :
> 
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
> setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
> setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
> setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> 
> 
> Note ports are given several times for different sockets.
> 
> So several sockets are 'bound' to same IP:port values

That's kind of weird, since address is the same. I'm curious why bind
conflict does not resolve that. Likely because socket is not yet
connected.

> At connect() time, we refuse and say address is not available.

Yup. But isn't it a different problem from what Gaspar observed?
Netcat connects after bind so it should not be an issue here.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: Cosmin Ratiu @ 2010-04-21  9:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev
In-Reply-To: <20100421.004922.193694715.davem@davemloft.net>

On Wednesday 21 April 2010 10:49:22 David Miller wrote:
> I put this into net-2.6 and modified the commit message since, as we
> found, this incorrect transport header reset was added there to fix
> IPSEC.
> 
> I'm convinced that Cosmin didn't test the patch he actually sent out
> 
> :-)

I apologize. I've tested the patch on 2.6.7 (tcp_v6_send_ack then) and then 
redid the patch half-asleep against net-next.

Cosmin.

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Eric Dumazet @ 2010-04-21  9:39 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <4BCEC593.4000007@lastsummer.de>

Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
> Tom Herbert wrote:
> >> I thought about this for some time...
> >>
> >> Do we really need the port numbers here at all? A simple
> >> addr1^addr2 can provide a good enough pointer for
> >> distribution amongst CPUs.
> >>
> > 
> > What about a server behind a TCP proxy?  Also, need to minimize
> > collisions for RPS to be effective
> 
> What about routers? What about loopback? This all boils down to
> the same issue of obscuring IP data by "magical" means and then
> reattaching functionality by reaching for upper layer information.
> It is necessary in some cases, but it can cripple performance
> for other cases.
> 
> The interesting thing is you don't need to deal with collisions
> while distributing amonst cpus at all. You just need to make sure
> the distribution algorithm keeps every single flow attached to
> the correct cpu.
> 
> All of the actual flow hashing, tracking and whatever else the
> traffic needs to go through can be done locally by cpu x which
> helps a lot with load distribution and cache issues in mind. It
> also helps locking because there is no global flow lookup table.
> Oh, and it also reduces collisions with every cpu you add for
> receiving.
> 
> I work with a lot of plain office and ISP traffic in mind daily,
> so please don't misunderstand my motivation here. I'd hate to
> see poor performance in scenarios in which there is a lot of
> potential improvement.
> 

I am a bit lost by this conversation.

Are you saying something is wrong with current schem ?

What are exactly your suggestions ?

Tom replied to you that a hash derived from (addr1 ^ addr2) would not
work in situations where all flows goes from machine A to machine B
(all hashes would be the same)

Current hash is probably more than enough to cover all situations.






^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Franco Fichtner @ 2010-04-21  9:29 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, Changli Gao, David Miller, netdev
In-Reply-To: <i2q65634d661004200809ydba56c35g785ba51900c379f7@mail.gmail.com>

Tom Herbert wrote:
>> I thought about this for some time...
>>
>> Do we really need the port numbers here at all? A simple
>> addr1^addr2 can provide a good enough pointer for
>> distribution amongst CPUs.
>>
> 
> What about a server behind a TCP proxy?  Also, need to minimize
> collisions for RPS to be effective

What about routers? What about loopback? This all boils down to
the same issue of obscuring IP data by "magical" means and then
reattaching functionality by reaching for upper layer information.
It is necessary in some cases, but it can cripple performance
for other cases.

The interesting thing is you don't need to deal with collisions
while distributing amonst cpus at all. You just need to make sure
the distribution algorithm keeps every single flow attached to
the correct cpu.

All of the actual flow hashing, tracking and whatever else the
traffic needs to go through can be done locally by cpu x which
helps a lot with load distribution and cache issues in mind. It
also helps locking because there is no global flow lookup table.
Oh, and it also reduces collisions with every cpu you add for
receiving.

I work with a lot of plain office and ISP traffic in mind daily,
so please don't misunderstand my motivation here. I'd hate to
see poor performance in scenarios in which there is a lot of
potential improvement.

Franco

^ permalink raw reply

* [PATCH 2/2][RESEND] ehea: fix possible DLPAR/mem deadlock
From: Thomas Klein @ 2010-04-21  9:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, linuxppc-dev, linux-kernel, themann

Force serialization of userspace-triggered DLPAR/mem operations

Signed-off-by: Thomas Klein <tklein@de.ibm.com>
---

Patch created against net-2.6

diff -Nurp net-2.6.orig/drivers/net/ehea/ehea.h net-2.6/drivers/net/ehea/ehea.h
--- net-2.6.orig/drivers/net/ehea/ehea.h	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea.h	2010-04-21 10:42:14.000000000 +0200
@@ -40,7 +40,7 @@
 #include <asm/io.h>
 
 #define DRV_NAME	"ehea"
-#define DRV_VERSION	"EHEA_0102"
+#define DRV_VERSION	"EHEA_0103"
 
 /* eHEA capability flags */
 #define DLPAR_PORT_ADD_REM 1
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_main.c net-2.6/drivers/net/ehea/ehea_main.c
--- net-2.6.orig/drivers/net/ehea/ehea_main.c	2010-04-21 10:43:14.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_main.c	2010-04-21 10:42:14.000000000 +0200
@@ -2889,7 +2889,6 @@ static void ehea_rereg_mrs(struct work_s
 	int ret, i;
 	struct ehea_adapter *adapter;
 
-	mutex_lock(&dlpar_mem_lock);
 	ehea_info("LPAR memory changed - re-initializing driver");
 
 	list_for_each_entry(adapter, &adapter_list, list)
@@ -2959,7 +2958,6 @@ static void ehea_rereg_mrs(struct work_s
 		}
 	ehea_info("re-initializing driver complete");
 out:
-	mutex_unlock(&dlpar_mem_lock);
 	return;
 }
 
@@ -3542,7 +3540,14 @@ void ehea_crash_handler(void)
 static int ehea_mem_notifier(struct notifier_block *nb,
                              unsigned long action, void *data)
 {
+	int ret = NOTIFY_BAD;
 	struct memory_notify *arg = data;
+
+	if (!mutex_trylock(&dlpar_mem_lock)) {
+		ehea_info("ehea_mem_notifier must not be called parallelized");
+		goto out;
+	}
+
 	switch (action) {
 	case MEM_CANCEL_OFFLINE:
 		ehea_info("memory offlining canceled");
@@ -3551,14 +3556,14 @@ static int ehea_mem_notifier(struct noti
 		ehea_info("memory is going online");
 		set_bit(__EHEA_STOP_XFER, &ehea_driver_flags);
 		if (ehea_add_sect_bmap(arg->start_pfn, arg->nr_pages))
-			return NOTIFY_BAD;
+			goto out_unlock;
 		ehea_rereg_mrs(NULL);
 		break;
 	case MEM_GOING_OFFLINE:
 		ehea_info("memory is going offline");
 		set_bit(__EHEA_STOP_XFER, &ehea_driver_flags);
 		if (ehea_rem_sect_bmap(arg->start_pfn, arg->nr_pages))
-			return NOTIFY_BAD;
+			goto out_unlock;
 		ehea_rereg_mrs(NULL);
 		break;
 	default:
@@ -3566,8 +3571,12 @@ static int ehea_mem_notifier(struct noti
 	}
 
 	ehea_update_firmware_handles();
+	ret = NOTIFY_OK;
 
-	return NOTIFY_OK;
+out_unlock:
+	mutex_unlock(&dlpar_mem_lock);
+out:
+	return ret;
 }
 
 static struct notifier_block ehea_mem_nb = {

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox