Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Hagen Paul Pfeifer @ 2009-10-25 20:17 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: netdev
In-Reply-To: <200910252158.53921.opurdila@ixiacom.com>

* Octavian Purdila | 2009-10-25 21:58:53 [+0200]:

>
>The current dev_name_hash is not very good at spreading entries when a
>large number of interfaces of the same type (e.g. ethXXXXX) are used.
>
>Here are some performance numbers for creating 16000 dummy interfaces with
>and without the patch (with per device sysctl entries disabled)
>
>    With patch                      Without patch
>
>    real    0m 2.27s                real    0m 4.32s
>    user    0m 0.00s                user    0m 0.00s
>    sys     0m 1.13s                sys     0m 2.16s

Can you rerun the test with jhash() as the hash function? Just for clearness -
the spreading of jhash should be more uniformly distributed. At the end: where
is the threshold where a more accurate hash function is superior.

HGN

-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

^ permalink raw reply

* Re: [PATCH 6/9] ser_gigaset: checkpatch cleanup
From: Alan Cox @ 2009-10-25 20:37 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: David Miller, Karsten Keil, Hansjoerg Lipp, netdev, linux-kernel,
	isdn4linux, i4ldeveloper
In-Reply-To: <20091023-patch-gigaset-06.tilman@imap.cc>

On Sun, 25 Oct 2009 20:30:27 +0100 (CET)
Tilman Schmidt <tilman@imap.cc> wrote:

> Duly uglified as demanded by checkpatch.pl.
> 
> Impact: cosmeti

Umm ??

> -		if (!(bcs->tx_skb = skb_dequeue(&bcs->squeue)))
> +		nextskb = skb_dequeue(&bcs->squeue);
> +		if (!nextskb)
>  			/* no skb either, nothing to do */
>  			return;
> +		bcs->tx_skb = nextskb;

This does not do the same thing as before

Previously the NULL case assigned to bcs->tx_skb, now it does not.

^ permalink raw reply

* Re: Where is the netdev-next-2.6 tree ?
From: David Miller @ 2009-10-25 21:19 UTC (permalink / raw)
  To: Sandeep.Kumar; +Cc: netdev
In-Reply-To: <9F4C7D19E8361D4C94921B95BE08B81B95012C@zin33exm22.fsl.freescale.net>

From: "Kumar Gopalpet-B05799" <Sandeep.Kumar@freescale.com>
Date: Sun, 25 Oct 2009 18:58:51 +0530

> Hi all,
>  
> What is the url for cloning the netdev's latest development tree ?
>  
> The clone from http://www.kernel.org/pub/scm/linux/kernel/git/davem/
> <BLOCKED::http://www.kernel.org/pub/scm/linux/kernel/git/davem/> 
> net-2.6.git/ <BLOCKED::outbind://12/net-2.6.git/>  and
> net-next-2.6.git/ <BLOCKED::outbind://12/net-next-2.6.git/>   failed.

I do not enable HTML access to my GIT trees, you must use the
usual GIT protocol to access my trees.

I'm sorry if you don't have real internet access, you'll need
to find a way around that.

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: Do Tx handling in a separate tasklet.
From: David Miller @ 2009-10-25 21:20 UTC (permalink / raw)
  To: vladz; +Cc: eilong, netdev
In-Reply-To: <1256473182.11634.4.camel@lb-tlvb-vladz.il.broadcom.com>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 25 Oct 2009 14:19:42 +0200

> This patch moves the 'Tx interrupt work' of each Tx queue from the hardIRQ
> context to the separate low-latency tasklet. Otherwise there is a possibility
> of a software lockup situation in a Tx softIRQ as it handles freeing all skb's
> 'freed' in (hard)IRQ context.
> 
> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Use NAPI, please...

^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Eric Dumazet @ 2009-10-25 21:24 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: netdev
In-Reply-To: <200910252158.53921.opurdila@ixiacom.com>

Octavian Purdila a écrit :
> The current dev_name_hash is not very good at spreading entries when a
> large number of interfaces of the same type (e.g. ethXXXXX) are used.

Very true

> 
> Here are some performance numbers for creating 16000 dummy interfaces with
> and without the patch (with per device sysctl entries disabled)
> 
>     With patch                      Without patch
> 
>     real    0m 2.27s                real    0m 4.32s
>     user    0m 0.00s                user    0m 0.00s
>     sys     0m 1.13s                sys     0m 2.16s
> 

1) A program to show hash distribution would be more convincing,
and could show that 17 multiplier is better, not 31 :)

2) Why full_name_hash() not changed as well ?


$ cat test.c
#include <stdio.h>
#define init_name_hash()                0

/* partial hash update function. Assume roughly 4 bits per character */
static inline unsigned long
partial_name_hash(unsigned long c, unsigned long prevhash)
{
        return (prevhash + (c << 4) + (c >> 4)) * 11;
}

/*
 * Finally: cut down the number of bits to a int value (and try to avoid
 * losing bits)
 */
static inline unsigned long end_name_hash(unsigned long hash)
{
        return (unsigned int) hash;
}

/* Compute the hash for a name string. */
static inline unsigned int
full_name_hash(const unsigned char *name, unsigned int len)
{
        unsigned long hash = init_name_hash();
        while (len--)
                hash = partial_name_hash(*name++, hash);
        return end_name_hash(hash);
}

static inline unsigned int
string_hash31(const unsigned char *name, unsigned int len)
{
        unsigned hash = 0;
        int i;

        for (i = 0; i < len; ++i)
                hash = 31 * hash + name[i];
        return hash;
}

static inline unsigned int
string_hash17(const unsigned char *name, unsigned int len)
{
        unsigned hash = 0;
        int i;

        for (i = 0; i < len; ++i)
                hash = 17 * hash + name[i];
        return hash;
}

unsigned int hashdist1[256], hashdist2[256], hashdist3[256];

void print_dist(unsigned int *dist, const char *name)
{
        unsigned int i;

        printf("%s[] = {\n", name);
        for (i = 0; i < 256; i++) {
                printf("%d, ", dist[i]);
                if ((i & 15) == 15) putchar('\n');
        }
        printf("};\n");
}

int main()
{
        int i;
        unsigned int hash;
        unsigned char name[16];

        for (i = 0; i < 16000; i++) {
                unsigned int len = sprintf(name, "dummy%d", i);

                hash = full_name_hash(name, len);
                hashdist1[hash & 255]++;

                hash = string_hash31(name, len);
                hashdist2[hash & 255]++;

                hash = string_hash17(name, len);
                hashdist3[hash & 255]++;
                }
        print_dist(hashdist1, "full_name_hash");
        print_dist(hashdist2, "string_hash31");
        print_dist(hashdist3, "string_hash17");
        return 0;
}

$ gcc -o test test.c
$ ./test
full_name_hash[] = {
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 6, 0, 0, 0, 0, 56,
0, 0, 0, 374, 0, 0, 562, 0, 0, 0, 5, 0, 0, 0, 0, 57,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 5, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 6, 0, 0, 0, 0, 57,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 5, 0, 0, 0, 0, 56,
0, 0, 0, 376, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 5, 1, 0, 0, 0, 57,
0, 0, 0, 374, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 376, 0, 0, 562, 0, 0, 0, 6, 0, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 5, 0, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 5, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 57,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
};
string_hash31[] = {
42, 54, 67, 81, 92, 103, 116, 125, 131, 131, 133, 128, 121, 110, 101, 87,
72, 59, 47, 36, 25, 17, 12, 8, 4, 4, 6, 7, 11, 15, 24, 32,
42, 54, 68, 79, 91, 105, 117, 127, 130, 135, 133, 129, 120, 113, 100, 86,
71, 58, 46, 33, 24, 16, 11, 6, 4, 4, 5, 7, 10, 16, 23, 32,
41, 54, 66, 78, 91, 104, 117, 123, 130, 133, 134, 127, 122, 112, 101, 86,
72, 61, 47, 35, 25, 19, 12, 8, 5, 6, 6, 7, 11, 16, 24, 31,
43, 54, 66, 78, 92, 106, 116, 125, 131, 136, 132, 129, 121, 112, 98, 84,
72, 58, 44, 32, 24, 16, 11, 6, 5, 5, 4, 7, 10, 17, 23, 32,
41, 54, 65, 78, 92, 105, 116, 123, 132, 133, 133, 127, 122, 111, 99, 86,
74, 60, 45, 35, 26, 19, 12, 8, 7, 5, 5, 7, 12, 17, 24, 31,
43, 54, 66, 80, 94, 107, 116, 126, 131, 135, 131, 129, 120, 110, 97, 85,
71, 56, 44, 33, 25, 16, 11, 7, 5, 3, 4, 7, 11, 17, 22, 32,
43, 54, 66, 80, 94, 105, 115, 124, 133, 132, 132, 127, 121, 110, 99, 87,
74, 59, 46, 37, 26, 19, 12, 9, 6, 4, 5, 8, 12, 16, 23, 32,
43, 53, 67, 81, 94, 105, 116, 128, 132, 135, 133, 130, 121, 112, 100, 88,
72, 57, 46, 34, 25, 17, 11, 7, 4, 3, 5, 8, 11, 16, 22, 33,
};
string_hash17[] = {
68, 73, 72, 69, 62, 57, 52, 53, 60, 66, 71, 69, 69, 68, 67, 64,
67, 68, 70, 68, 63, 56, 53, 51, 57, 64, 68, 71, 68, 67, 66, 65,
62, 65, 67, 68, 64, 59, 55, 52, 54, 60, 67, 69, 69, 65, 65, 61,
59, 60, 63, 64, 64, 60, 55, 54, 54, 61, 67, 72, 72, 71, 66, 64,
60, 59, 60, 64, 64, 62, 58, 56, 55, 59, 66, 70, 73, 69, 67, 62,
58, 53, 56, 57, 60, 59, 57, 54, 53, 56, 62, 69, 71, 72, 67, 64,
57, 53, 51, 54, 58, 60, 59, 57, 57, 56, 63, 69, 74, 74, 71, 65,
60, 53, 50, 52, 55, 58, 59, 58, 57, 58, 61, 68, 73, 80, 77, 73,
66, 59, 52, 52, 54, 60, 61, 58, 58, 58, 59, 64, 71, 73, 78, 71,
66, 57, 50, 46, 50, 54, 59, 61, 58, 59, 60, 65, 70, 75, 78, 80,
72, 65, 56, 50, 49, 53, 59, 63, 62, 60, 62, 63, 68, 72, 77, 77,
76, 67, 58, 49, 46, 49, 55, 60, 62, 62, 59, 62, 65, 70, 72, 78,
75, 73, 62, 53, 47, 47, 52, 58, 63, 62, 63, 61, 64, 67, 71, 73,
76, 72, 68, 57, 49, 46, 50, 57, 62, 65, 65, 65, 64, 67, 69, 73,
76, 76, 71, 65, 54, 49, 49, 55, 62, 65, 65, 64, 65, 63, 66, 67,
71, 71, 70, 63, 57, 49, 47, 52, 58, 65, 66, 67, 65, 67, 65, 67,
};


^ permalink raw reply

* Re: [RFC] make per interface sysctl entries configurable
From: Eric Dumazet @ 2009-10-25 21:37 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: Benjamin LaHaise, Stephen Hemminger, Cosmin Ratiu, netdev
In-Reply-To: <200910251954.49700.opurdila@ixiacom.com>

Octavian Purdila a écrit :
> RFC patches are attached.
> 
> Another possible approach: add an interface flag and use it to decide whether 
> we want per interface sysctl entries or not.
> 

Hmm, could we speedup sysctl instead, adding rbtree or something ?



^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Octavian Purdila @ 2009-10-25 21:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4AE4C1FA.7000002@gmail.com>

[-- Attachment #1: Type: Text/Plain, Size: 2271 bytes --]

On Sunday 25 October 2009 23:24:10 you wrote:
> Octavian Purdila a écrit :
> > The current dev_name_hash is not very good at spreading entries when a
> > large number of interfaces of the same type (e.g. ethXXXXX) are used.
> 
> Very true
> 
> > Here are some performance numbers for creating 16000 dummy interfaces
> > with and without the patch (with per device sysctl entries disabled)
> >
> >     With patch                      Without patch
> >
> >     real    0m 2.27s                real    0m 4.32s
> >     user    0m 0.00s                user    0m 0.00s
> >     sys     0m 1.13s                sys     0m 2.16s
> 
> 1) A program to show hash distribution would be more convincing,
> and could show that 17 multiplier is better, not 31 :)
> 

You beat me to it :) 

But anyways, I've attached mine as well , which compares orig, jhash, new17, 
new31, with different number of interfaces as well as different hash bits. 

The score it shows its the number of iterations needed to find all elements in 
the hash (good enough metric?)

My results shows that new17 is better or very close to jhash2. And I think its 
lighter then jhash too.

> 2) Why full_name_hash() not changed as well ?

I don't think it is going to be as easy to "benchmark" this, as it is used 
with more diverse inputs. 

AFAIK, full_name_hash is used by the dentry code, meaning that it cashes path 
names? If so, perhaps we can use find {/etc,/bin,/sbin,/usr} as input patterns? 

My results:

$ ./dev_name_hash ixint 255 0 8
score 1140
$ ./dev_name_hash ixint 255 1 8
score 379
$ ./dev_name_hash ixint 255 2 8
score 461
$ ./dev_name_hash ixint 255 3 8
score 329
$ ./dev_name_hash ixint 1000 0 8
score 26074
$ ./dev_name_hash ixint 1000 1 8
score 2887
$ ./dev_name_hash ixint 1000 2 8
score 2853
$ ./dev_name_hash ixint 1000 3 8
score 3713
$ ./dev_name_hash ixint 16000 0 8
score 3689828
$ ./dev_name_hash ixint 16000 1 8
score 516389
$ ./dev_name_hash ixint 16000 2 8
score 515292
$ ./dev_name_hash ixint 16000 3 8
score 554042
$ ./dev_name_hash ixint 16000 0 16
score 24741
$ ./dev_name_hash ixint 16000 1 16
score 17949
$ ./dev_name_hash ixint 16000 2 16
score 16715
$ ./dev_name_hash ixint 16000 3 16
score 18125

[-- Attachment #2: dev_name_hash.c --]
[-- Type: text/x-csrc, Size: 4994 bytes --]

#define _GNU_SOURCE
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <net/if.h>

typedef unsigned int u32;
typedef unsigned char u8;

#define __jhash_mix(a, b, c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8); \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12);  \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5); \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

/* The golden ration: an arbitrary value */
#define JHASH_GOLDEN_RATIO	0x9e3779b9

/* The most generic version, hashes an arbitrary sequence
 * of bytes.  No alignment or length assumptions are made about
 * the input key.
 */
static inline u32 jhash(const void *key, u32 length, u32 initval)
{
	u32 a, b, c, len;
	const u8 *k = key;

	len = length;
	a = b = JHASH_GOLDEN_RATIO;
	c = initval;

	while (len >= 12) {
		a += (k[0] +((u32)k[1]<<8) +((u32)k[2]<<16) +((u32)k[3]<<24));
		b += (k[4] +((u32)k[5]<<8) +((u32)k[6]<<16) +((u32)k[7]<<24));
		c += (k[8] +((u32)k[9]<<8) +((u32)k[10]<<16)+((u32)k[11]<<24));

		__jhash_mix(a,b,c);

		k += 12;
		len -= 12;
	}

	c += length;
	switch (len) {
	case 11: c += ((u32)k[10]<<24);
	case 10: c += ((u32)k[9]<<16);
	case 9 : c += ((u32)k[8]<<8);
	case 8 : b += ((u32)k[7]<<24);
	case 7 : b += ((u32)k[6]<<16);
	case 6 : b += ((u32)k[5]<<8);
	case 5 : b += k[4];
	case 4 : a += ((u32)k[3]<<24);
	case 3 : a += ((u32)k[2]<<16);
	case 2 : a += ((u32)k[1]<<8);
	case 1 : a += k[0];
	};

	__jhash_mix(a,b,c);

	return c;
}


/* Name hashing routines. Initial hash value */
/* Hash courtesy of the R5 hash in reiserfs modulo sign bits */
#define init_name_hash()		0

/* partial hash update function. Assume roughly 4 bits per character */
static inline unsigned long
partial_name_hash(unsigned long c, unsigned long prevhash)
{
	return (prevhash + (c << 4) + (c >> 4)) * 11;
}

/*
 * Finally: cut down the number of bits to a int value (and try to avoid
 * losing bits)
 */
static inline unsigned long end_name_hash(unsigned long hash)
{
	return (unsigned int) hash;
}

/* Compute the hash for a name string. */
static inline unsigned int
full_name_hash(const char *name, unsigned int len)
{
	unsigned long hash = init_name_hash();
	while (len--)
		hash = partial_name_hash(*name++, hash);
	return end_name_hash(hash);
}

unsigned int dev_name_hash_new17(const char *name)
{
	unsigned hash = 0;
	int len = strnlen(name, IFNAMSIZ);
	int i;

	for (i = 0; i < len; ++i)
		hash = 17 * hash + name[i];
	return hash;
}

unsigned int dev_name_hash_new31(const char *name)
{
	unsigned hash = 0;
	int len = strnlen(name, IFNAMSIZ);
	int i;

	for (i = 0; i < len; ++i)
		hash = 31 * hash + name[i];
	return hash;
}

unsigned int dev_name_hash_orig(const char *name)
{
	return full_name_hash(name, strnlen(name, IFNAMSIZ));
}

unsigned int dev_name_hash_jhash(const char *name)
{
	return jhash(name, strnlen(name, IFNAMSIZ), 0);
}

typedef unsigned int (*dev_hash_name)(const char *name);

dev_hash_name hfn[] = {
	dev_name_hash_orig,
	dev_name_hash_jhash,
	dev_name_hash_new17,
	dev_name_hash_new31,
};

struct hocs {
	unsigned value;
	int occurences;
};

#if 0
#define debug(x...) printf(x)
#else
#define debug(x...) 
#endif

int main(int argc, char **argv)
{
	char dev_name[64];
	int no, htype, i, j, score, hbits;
	unsigned *h;
	/* stores hash elements by value and number of occurences */
	struct hocs *hocs;

	if (argc != 5) {
		fprintf(stderr, "syntax: dev_name_hash prefix no_of_names hash_type hash_bits\n");
		return -1;
	}

	no = atoi(argv[2]);
	if (no <= 0) {
		fprintf(stderr, "invalid number: %d\n", no);
		return -1;
	}

	htype = atoi(argv[3]);
	if (htype < 0 || htype >= sizeof(hfn)) {
		fprintf(stderr, "invalid hash type: %d\n", htype);
		return -1;
	}

	hbits = atoi(argv[4]);
	if (hbits <= 0) {
		fprintf(stderr, "invalid hash bits: %d\n", hbits);
		return -1;
	}
	
	h = malloc(no * sizeof(unsigned));

	for(i = 0; i < no; i++) {
		snprintf(dev_name, sizeof(dev_name), "%s%d", argv[1], i);
		h[i] = hfn[htype](dev_name) & ((1 << hbits) - 1);
		debug("h[i] = %x\n", h[i]);
	}

	hocs = malloc(no * sizeof(struct hocs));
	memset(hocs, 0, no * sizeof(unsigned));

	/* find occurences */
	for(i = 0; i < no; i++) {
		unsigned hash, dup;

		hash= h[i];
		dup = 0;
		hocs[i].value = hash;
		
		/* did we count this value already? */
		for(j = 0; j < i; j++) {
			if (i == j)
				continue;
			if (hocs[j].value == hash) {
				dup = 1;
				break;
			}
		}
		if (dup)
			continue;

		hocs[i].occurences = 1;

		for(j = 0; j < no; j++) {
			if (i == j)
				continue;
			if (hash == h[j])
				hocs[i].occurences++;
		}
	}

	/* the score is the number of iterations required to find each element
	   once */
	score = 0;
	for(i = 0; i < no; i++) {
		debug("hocs[%d] value = %x occurences = %d\n", i, hocs[i].value, hocs[i].occurences);
		score += hocs[i].occurences * (hocs[i].occurences + 1) / 2;
	}

	printf("score %d\n", score);

	return 0;
}

^ permalink raw reply

* Re: [RFC] make per interface sysctl entries configurable
From: Octavian Purdila @ 2009-10-25 22:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Benjamin LaHaise, Stephen Hemminger, Cosmin Ratiu, netdev
In-Reply-To: <4AE4C50F.7060605@gmail.com>

On Sunday 25 October 2009 23:37:19 you wrote:
> Octavian Purdila a écrit :
> > RFC patches are attached.
> >
> > Another possible approach: add an interface flag and use it to decide
> > whether we want per interface sysctl entries or not.
> 
> Hmm, could we speedup sysctl instead, adding rbtree or something ?
> 

Very good point, I think this is the best solution for people using a 
moderately high number of interfaces (a few thousand).

But for really large setups there is another issue: memory consumption. In 
fact, in order to be able to scale to 128K interfaces and still have a 
significant amount of memory available to applications we also had to disable 
sysfs and #ifdef CONFIG_SYSFS struct device from net_device.

I would also argue that when you have such a large number of interfaces you 
don't need to change setting on a per interface basis. Or at least this is our 
case :)  and I suspect that the case with a large number of PPP interfaces is 
similar.

^ permalink raw reply

* Re: [RFC] make per interface sysctl entries configurable
From: Denys Fedoryschenko @ 2009-10-25 22:32 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: Eric Dumazet, Benjamin LaHaise, Stephen Hemminger, Cosmin Ratiu,
	netdev
In-Reply-To: <200910260021.48785.opurdila@ixiacom.com>

On Monday 26 October 2009 00:21:48 Octavian Purdila wrote:

> Very good point, I think this is the best solution for people using a
> moderately high number of interfaces (a few thousand).
>
> But for really large setups there is another issue: memory consumption. In
> fact, in order to be able to scale to 128K interfaces and still have a
> significant amount of memory available to applications we also had to
> disable sysfs and #ifdef CONFIG_SYSFS struct device from net_device.
>
> I would also argue that when you have such a large number of interfaces you
> don't need to change setting on a per interface basis. Or at least this is
> our case :)  and I suspect that the case with a large number of PPP
> interfaces is similar.
I will add also, sysctl -a (over busybox) on pppoe with 2k interfaces takes 
ages to complete.

^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Hagen Paul Pfeifer @ 2009-10-25 22:41 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Eric Dumazet, netdev
In-Reply-To: <200910252355.32640.opurdila@ixiacom.com>

* Octavian Purdila | 2009-10-25 23:55:32 [+0200]:

>My results shows that new17 is better or very close to jhash2. And I think its 
>lighter then jhash too.

If new17 is very close to jhash/jhash2 then the cycles comes into play.
Anyway, there is already a very potent hash interface in form of jhash{,2}.

+1 for jhash2

HGN

PS: great work! ;)

PPS: http://libhashish.sourceforge.net/ have some real hash benchmarks in form
of avalanche test and some others too. It does not really matter here because
Jenkins performs _nearly_ perfect in all cases. ;)

-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Octavian Purdila @ 2009-10-25 22:45 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: Eric Dumazet, netdev
In-Reply-To: <20091025224153.GB20987@nuttenaction>

On Monday 26 October 2009 00:41:54 you wrote:
> * Octavian Purdila | 2009-10-25 23:55:32 [+0200]:
> >My results shows that new17 is better or very close to jhash2. And I think
> > its lighter then jhash too.
> 
> If new17 is very close to jhash/jhash2 then the cycles comes into play.
> Anyway, there is already a very potent hash interface in form of jhash{,2}.
> 

Ah, sorry for the confusion, jhash2 was a typo, I've only tested jhash.

^ permalink raw reply

* Re: [PATCH 6/9] ser_gigaset: checkpatch cleanup
From: Tilman Schmidt @ 2009-10-25 23:36 UTC (permalink / raw)
  Cc: David Miller, Karsten Keil, Hansjoerg Lipp, netdev, linux-kernel,
	isdn4linux, i4ldeveloper
In-Reply-To: <20091025203705.273e1c14@lxorguk.ukuu.org.uk>

[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

Alan Cox schrieb:

>> -		if (!(bcs->tx_skb = skb_dequeue(&bcs->squeue)))
>> +		nextskb = skb_dequeue(&bcs->squeue);
>> +		if (!nextskb)
>>  			/* no skb either, nothing to do */
>>  			return;
>> +		bcs->tx_skb = nextskb;
> 
> This does not do the same thing as before
> 
> Previously the NULL case assigned to bcs->tx_skb, now it does not.

bcs->tx_skb is guaranteed to be NULL already at this point because
of the enclosing "if (!bcs->tx_skb)".

Thanks,
Tilman

-- 
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply

* Re: [PATCH v2 8/8] Document future removal of sysctl_tcp_* options
From: Bill Fink @ 2009-10-26  0:21 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: Eric Dumazet, William Allen Simpson, netdev
In-Reply-To: <4AE415DD.5050406@codefidence.com>

On Sun, 25 Oct 2009, Gilad Ben-Yossef wrote:

> Eric Dumazet wrote:
> 
> > Bill Fink a écrit :
> >
> >   
> >> And as mentioned previously, the global options can be quite useful
> >> in certain test scenarios.  I also agree the per route settings are
> >> a very useful addition.  I think the global and per route settings
> >> are complementary and shouldn't be thought of as in conflict with
> >> one another.
> >>     
> > Absolutely, global setting is a must when an admin wants a quick path.
> >
> > The more flexible would be to have two bits per route, plus
> > 2 bits on the global configuration.
> >
> > global conf:
> > 00 : timestamps OFF, unless a route setting is not 00
> > 01 : timestamps ON, unless a route setting is not 00
> > 10 : Force timestamps OFF, ignore route settings (emergency sysadmin request)
> > 11 : Force timestamps ON, ignore route settings 
> >
> > Route settings (used *only* if global setting is 0Y)
> > 00 : global conf is used
> > 01 : Force timestamps being OFF for this route
> > 10 : Force timestamps being ON for this route
> > 11 : complement global conf
> 
> Hey, I have no issue to re-spin the patch with this suggestion, if you 
> truly think this is valuable, but would you please consider the 
> nightmare of having to just explain this to someone?
> 
> It sounds to me way too complicated for what it does.
> 
> I still think having a global kill switch and per route options better 
> (basically use the exiting patch but not retire the global kill 
> switch|), but if you must Hgow about we leave the global sysctl as they 
> are and just have a two bit route option:
> 
> 0 Use global default
> 1 Off
> 2 On
> 
> It's kind of funny, because this is what the original patch from 
> Comsleep does and I thought it needlessly complicates things.
> 
> So, what do you say - which will it be?

I personally feel the 2-bit settings are overkill.  What i think
makes the most sense is for the global options to act as they always
have in the absence of any route specific settings, and for any
route specific settings to override the related global settings.
This is both simple and maintains backward compatibility.

						-Bill

^ permalink raw reply

* Re: [PATCH 6/9] ser_gigaset: checkpatch cleanup
From: Joe Perches @ 2009-10-26  0:54 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: David Miller, Karsten Keil, Hansjoerg Lipp, netdev, linux-kernel,
	isdn4linux, i4ldeveloper
In-Reply-To: <20091023-patch-gigaset-06.tilman@imap.cc>

On Sun, 2009-10-25 at 20:30 +0100, Tilman Schmidt wrote:
> Duly uglified as demanded by checkpatch.pl.
> diff --git a/drivers/isdn/gigaset/ser-gigaset.c b/drivers/isdn/gigaset/ser-gigaset.c
> index 3071a52..ac3409e 100644
> --- a/drivers/isdn/gigaset/ser-gigaset.c
> +++ b/drivers/isdn/gigaset/ser-gigaset.c
> @@ -164,9 +164,15 @@ static void gigaset_modem_fill(unsigned long data)
>  {
>  	struct cardstate *cs = (struct cardstate *) data;
>  	struct bc_state *bcs;
> +	struct sk_buff *nextskb;
>  	int sent = 0;
>  
> -	if (!cs || !(bcs = cs->bcs)) {
> +	if (!cs) {
> +		gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> +		return;
> +	}
> +	bcs = cs->bcs;
> +	if (!bcs) {
>  		gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
>  		return;
>	}

perhaps:
	if (!cs || !cs->bcs) {
		gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
		return;
	}
	bcs = cs->bcs;

[]

> @@ -404,16 +412,20 @@ static void gigaset_device_release(struct device *dev)
>  static int gigaset_initcshw(struct cardstate *cs)
>  {
>  	int rc;
> +	struct ser_cardstate *scs;
>  
> -	if (!(cs->hw.ser = kzalloc(sizeof(struct ser_cardstate), GFP_KERNEL))) {
> +	scs = kzalloc(sizeof(struct ser_cardstate), GFP_KERNEL);
> +	if (!scs) {
>  		pr_err("out of memory\n");
>  		return 0;
>  	}
> +	cs->hw.ser = scs;

Why not no temporary and just:

	cs->hw.ser = kzalloc...
	if (!cs->hw.ser)

^ permalink raw reply

* Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
From: KOSAKI Motohiro @ 2009-10-26  1:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	Mohamed Abbas, Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Stephan von Krawczynski, Kernel Testers List,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"
In-Reply-To: <1256221356-26049-2-git-send-email-mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>

> If a direct reclaim makes no forward progress, it considers whether it
> should go OOM or not. Whether OOM is triggered or not, it may retry the
> application afterwards. In times past, this would always wake kswapd as well
> but currently, kswapd is not woken up after direct reclaim fails. For order-0
> allocations, this makes little difference but if there is a heavy mix of
> higher-order allocations that direct reclaim is failing for, it might mean
> that kswapd is not rewoken for higher orders as much as it did previously.
> 
> This patch wakes up kswapd when an allocation is being retried after a direct
> reclaim failure. It would be expected that kswapd is already awake, but
> this has the effect of telling kswapd to reclaim at the higher order as well.
> 
> Signed-off-by: Mel Gorman <mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf72055..dfa4362 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>  		goto nopage;
>  
> +restart:
>  	wake_all_kswapd(order, zonelist, high_zoneidx);
>  
> -restart:
>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
>  	 * reclaim. Now things get more complex, so set up alloc_flags according

I think this patch is correct. personally, I like to add some commnent
at restart label. but it isn't big matter.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

However, I have a question. __alloc_pages_slowpath() retry logic is,

1. try_to_free_pages() reclaimed some pages:
	-> wait awhile and goto rebalance
2. try_to_free_pages() didn't reclaimed any page:
	-> call out_of_memory() and goto restart

Then, case-1 should be fixed too? 
I mean,

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..5a27896 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1899,6 +1899,12 @@ rebalance:
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
+
+		/*
+		 * While we wait congestion wait, Amount of free memory can
+		 * be changed dramatically. Thus, we kick kswapd again.
+		 */
+		wake_all_kswapd(order, zonelist, high_zoneidx);
 		goto rebalance;
 	}
 
-------------------------------------------

?

^ permalink raw reply related

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
From: KOSAKI Motohiro @ 2009-10-26  1:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	Mohamed Abbas, Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Stephan von Krawczynski, Kernel Testers List, netdev
In-Reply-To: <1256221356-26049-3-git-send-email-mel@csn.ul.ie>

> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.
> 
> [rientjes@google.com: Spotted the problem]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dfa4362..7f2aa3e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
> -	} else if (unlikely(rt_task(p)))
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> -- 
> 1.6.3.3

good catch.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] virtio-net: fix data corruption with OOM
From: Rusty Russell @ 2009-10-26  1:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, virtualization
In-Reply-To: <20091025170340.GA22099@redhat.com>

On Mon, 26 Oct 2009 03:33:40 am Michael S. Tsirkin wrote:
> virtio net used to unlink skbs from send queues on error,
> but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> we do not do this. This causes guest data corruption and crashes
> with vhost since net core can requeue the skb or free it without
> it being taken off the list.
> 
> This patch fixes this by queueing the skb after successfull
> transmit.

I originally thought that this was racy: as soon as we do add_buf, we need to
make sure we're ready for the callback (for virtio_pci, it's ->kick, but we
shouldn't rely on that).

So a comment would be nice.  How's this?

Subject: virtio-net: fix data corruption with OOM
Date: Sun, 25 Oct 2009 19:03:40 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>

virtio net used to unlink skbs from send queues on error,
but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
we do not do this. This causes guest data corruption and crashes
with vhost since net core can requeue the skb or free it without
it being taken off the list.

This patch fixes this by queueing the skb after successful
transmit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ comment)
---

Rusty, here's a fix for another data corrupter I saw.
This fixes a regression from 2.6.31, so definitely
2.6.32 I think. Comments?

 drivers/net/virtio_net.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -516,8 +516,7 @@ again:
 	/* Free up any pending old buffers before queueing new ones. */
 	free_old_xmit_skbs(vi);
 
-	/* Put new one in send queue and do transmit */
-	__skb_queue_head(&vi->send, skb);
+	/* Try to transmit */
 	capacity = xmit_skb(vi, skb);
 
 	/* This can happen with OOM and indirect buffers. */
@@ -531,8 +530,17 @@ again:
 		}
 		return NETDEV_TX_BUSY;
 	}
+	vi->svq->vq_ops->kick(vi->svq);
 
-	vi->svq->vq_ops->kick(vi->svq);
+	/*
+	 * Put new one in send queue.  You'd expect we'd need this before
+	 * xmit_skb calls add_buf(), since the callback can be triggered
+	 * immediately after that.  But since the callback just triggers
+	 * another call back here, normal network xmit locking prevents the
+	 * race.
+	 */
+	__skb_queue_head(&vi->send, skb);
+
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
 	nf_reset(skb);

^ permalink raw reply

* linux-next: manual merge of the net tree with the i2c tree
From: Stephen Rothwell @ 2009-10-26  2:37 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Mika Kuoppala, Jean Delvare,
	Ben Hutchings

Hi all,

Today's linux-next merge of the net tree got a conflict in
drivers/net/sfc/sfe4001.c between commit
3f7c0648f727a6d5baf6117653e4001dc877b90b ("i2c: Prevent priority
inversion on top of bus lock") from the i2c tree and commit
c9597d4f89565b6562bd3026adbe6eac6c317f47 ("sfc: Merge sfe4001.c into
falcon_boards.c") from the net tree.

I have applied the following merge fixup patch (after removing
drivers/net/sfc/sfe4001.c) and can carry it as necessary.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 26 Oct 2009 12:24:55 +1100
Subject: [PATCH] net: merge fixup for drivers/net/sfc/falcon_boards.c

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/sfc/falcon_boards.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/sfc/falcon_boards.c b/drivers/net/sfc/falcon_boards.c
index 99f7372..788b336 100644
--- a/drivers/net/sfc/falcon_boards.c
+++ b/drivers/net/sfc/falcon_boards.c
@@ -327,7 +327,7 @@ static int sfn4111t_reset(struct efx_nic *efx)
 	efx_oword_t reg;
 
 	/* GPIO 3 and the GPIO register are shared with I2C, so block that */
-	mutex_lock(&efx->i2c_adap.bus_lock);
+	rt_mutex_lock(&efx->i2c_adap.bus_lock);
 
 	/* Pull RST_N (GPIO 2) low then let it up again, setting the
 	 * FLASH_CFG_1 strap (GPIO 3) appropriately.  Only change the
@@ -343,7 +343,7 @@ static int sfn4111t_reset(struct efx_nic *efx)
 	efx_writeo(efx, &reg, FR_AB_GPIO_CTL);
 	msleep(1);
 
-	mutex_unlock(&efx->i2c_adap.bus_lock);
+	rt_mutex_unlock(&efx->i2c_adap.bus_lock);
 
 	ssleep(1);
 	return 0;
-- 
1.6.5

^ permalink raw reply related

* Re: PATCH 23/10]Optimize the upload speed for PPP connection.
From: Franko Fang @ 2009-10-26  2:44 UTC (permalink / raw)
  To: David Miller, william.allen.simpson
  Cc: netdev, linux-kernel, zihan, greg, haegar
In-Reply-To: <20091024.064603.249036129.davem@davemloft.net>

Thanks your advice.

But generally, PAGE_SIZE is 4096, whether it is too large or not?

If PAGE_SIZE is really appropriate, then I can resubmit the patch.

Thanks very much.
----- Original Message ----- 
From: "David Miller" <davem@davemloft.net>
To: <william.allen.simpson@gmail.com>
Cc: <huananhu@huawei.com>; <netdev@vger.kernel.org>; <linux-kernel@vger.kernel.org>; <zihan@huawei.com>; <greg@kroah.com>; <haegar@sdinet.de>
Sent: Saturday, October 24, 2009 9:46 PM
Subject: Re: PATCH 23/10]Optimize the upload speed for PPP connection.


> From: William Allen Simpson <william.allen.simpson@gmail.com>
> Date: Fri, 23 Oct 2009 07:46:08 -0400
> 
>> Concur.  I'd go further than that, my code usually made room for at
>> least
>> a full MTU (MRU) with HDLC escaping.  To minimize context switches,
>> that
>> should be 3014 ((1500 MRU + 2 FCS + 4 header) * 2 escapes + 2 flags).
>> 
>> Even in the old days, when memory was tight, context switches and
>> interrupt
>> time were more expensive, too.  PPP is supposed to scale to OC-192.
> 
> Actually I'd like to see ->obuf allocated externally and then
> make it simply PAGE_SIZE.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: VLAN and ARP failure on tg3 drivers
From: Gertjan Hofman @ 2009-10-26  4:30 UTC (permalink / raw)
  To: Matt Carlson; +Cc: netdev@vger.kernel.org, Eric Dumazet, Benny Amorsen

Dear Matt, Eric, Benny,

Sorry about the slow response to your fast replies. I think Benny is correct, the 'problem' lies in the fact that we were using a VLAN ID of 0, without knowing its special significance. User error.

I tested it with other VLAN id's (>0) and it appears to work fine. We are not entirely sure we understand  why it used to work with VLAN ID 0 on the Broadcom chips and still does with a number of different cards (with >2.6.27 kernels).  What is the 'correct' behaviour for this incorrect usage ? When a PC returns the ARP response to the machine with the BroadCom card, it will have the destination address of the VLAN device, but presumably the VLAN ID tag set to zero.  Hmmm. I can live with not knowing the answer I guess.


Thanks again,

Gertjan



 

--- On Fri, 10/23/09, Matt Carlson <mcarlson@broadcom.com> wrote:

> From: Matt Carlson <mcarlson@broadcom.com>
> Subject: Re: VLAN and ARP failure on tg3 drivers
> To: "Gertjan Hofman" <gertjan_hofman@yahoo.com>
> Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
> Date: Friday, October 23, 2009, 3:35 PM
> On Thu, Oct 22, 2009 at 09:52:42PM
> -0700, Gertjan Hofman wrote:
> > Dear Kernel developers,
> > 
> > A couple of weeks ago we tried to migrate from a
> 2.6.24? kernel to a 2.6.29 kernel and noticed our VLAN
> application no longer works.? The problem is easy to
> replicate:
> > 
> > 1. connect 2 PC's with a cross-over cable
> > 2. set up a fixed IP address to both PC's? (say
> 192.168.0.[1,2])
> > 3. create a vlan:? vconfig? add eth0 0.
> > 4. set IP addresses on the VLAN devices? (say
> 192.168.1.[1,2])
> > 5. try ping one machine from the other.
> > 
> > I tried to dig into the problem by using un-patched
> kernel.org kernels with Ubuntu .config files.? Kernels up to
> 2.6.26 work fine, kernels after and including 2.6.27 fail.
> The problem is that ARP messages are being dropped. If the
> ARP table is updated by hand on each machine, the
> communication across the VLAN works fine.
> > 
> > At first I thought the kernel VLAN code was the
> problem (we had an earlier issue with a regression in
> 2.6.24) but it looks like the problem is actually with the
> tg3 driver.? Our system uses Broadcom ethernet chips. I
> tried the same experiments with combination of boards that
> have Broadcom and none-Broadcom and the only time I see it
> fail is with the tg3? driver loaded.
> > 
> > Snooping with WireShark shows that a ARP request from
> the non-Broadcom machine is seen and even answered, but
> never appears back on the network. If the Broadcom machine
> orginates the ARP message, it never arrives at the
> destination. I tried lowering the size of the MTU to 1492 as
> well as giving each VLAN device a different MAC. No deal.
> > 
> > I tried to look at tg3 patch changes from 2.6.26 to
> 2.6.27 but I am not familiar enough with the Git system to
> extract the appropiate changes.? I am a bit surprised that I
> am not seeing any references to this on the web, the
> combination of >2.6.27 kernels, Broadcom and VLAN cant be
> that uncommon.
> > 
> > I would be happy to provide more information and to
> try tests if any one can suggest them.
> > 
> > Sincerely,
> > 
> > Gertjan
> 
> I don't see any reason why your setup should fail, but it
> doesn't hurt
> to gather more info about the problem.
> 
> What device are you experiencing this problem with? 
> Is management
> firmware enabled?  (`ethtool -i ethx`)
> 
>






      

^ permalink raw reply

* Re: [RFC] make per interface sysctl entries configurable
From: Stephen Hemminger @ 2009-10-26  4:31 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Eric Dumazet, Benjamin LaHaise, Cosmin Ratiu, netdev
In-Reply-To: <200910251954.49700.opurdila@ixiacom.com>

On Sun, 25 Oct 2009 19:54:49 +0200
Octavian Purdila <opurdila@ixiacom.com> wrote:

> 
> RFC patches are attached.
> 
> Another possible approach: add an interface flag and use it to decide whether 
> we want per interface sysctl entries or not.
> 
> Benchmarks for creating 1000 interface (with the ndst module previously posted 
> on the list, ppc750 @800Mhz machine):
> 
> - without the patches:
> 
> real    4m 38.27s
> user    0m 0.00s
> sys     2m 18.90s
> 
> - with the patches:
> 
> real    0m 0.10s
> user    0m 0.00s
> sys     0m 0.05s
> 
> Thanks,
> tavi

I would rather optimize the algorithm than give up and make it
not available. It should be possible to do better by just using some
better programming.

-- 

^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Stephen Hemminger @ 2009-10-26  4:43 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: netdev
In-Reply-To: <200910252158.53921.opurdila@ixiacom.com>

On Sun, 25 Oct 2009 21:58:53 +0200
Octavian Purdila <opurdila@ixiacom.com> wrote:

> 
> The current dev_name_hash is not very good at spreading entries when a
> large number of interfaces of the same type (e.g. ethXXXXX) are used.
> 
> Here are some performance numbers for creating 16000 dummy interfaces with
> and without the patch (with per device sysctl entries disabled)
> 
>     With patch                      Without patch
> 
>     real    0m 2.27s                real    0m 4.32s
>     user    0m 0.00s                user    0m 0.00s
>     sys     0m 1.13s                sys     0m 2.16s
> 
> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> ---
>  net/core/dev.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)

My $.02 would for fixing full name hash because other usages would
benefit as well.  Inventing special case for network devices seems
unnecessary.

-- 

^ permalink raw reply

* Re: [PATCH v2 8/8] Document future removal of sysctl_tcp_* options
From: Eric Dumazet @ 2009-10-26  5:03 UTC (permalink / raw)
  To: Bill Fink; +Cc: Gilad Ben-Yossef, William Allen Simpson, netdev
In-Reply-To: <20091025202114.152b94b8.billfink@mindspring.com>

Bill Fink a écrit :
> On Sun, 25 Oct 2009, Gilad Ben-Yossef wrote:
> 
>> Eric Dumazet wrote:
>>
>>> Bill Fink a écrit :
>>>
>>>   
>>>> And as mentioned previously, the global options can be quite useful
>>>> in certain test scenarios.  I also agree the per route settings are
>>>> a very useful addition.  I think the global and per route settings
>>>> are complementary and shouldn't be thought of as in conflict with
>>>> one another.
>>>>     
>>> Absolutely, global setting is a must when an admin wants a quick path.
>>>
>>> The more flexible would be to have two bits per route, plus
>>> 2 bits on the global configuration.
>>>
>>> global conf:
>>> 00 : timestamps OFF, unless a route setting is not 00
>>> 01 : timestamps ON, unless a route setting is not 00
>>> 10 : Force timestamps OFF, ignore route settings (emergency sysadmin request)
>>> 11 : Force timestamps ON, ignore route settings 
>>>
>>> Route settings (used *only* if global setting is 0Y)
>>> 00 : global conf is used
>>> 01 : Force timestamps being OFF for this route
>>> 10 : Force timestamps being ON for this route
>>> 11 : complement global conf
>> Hey, I have no issue to re-spin the patch with this suggestion, if you 
>> truly think this is valuable, but would you please consider the 
>> nightmare of having to just explain this to someone?
>>
>> It sounds to me way too complicated for what it does.
>>
>> I still think having a global kill switch and per route options better 
>> (basically use the exiting patch but not retire the global kill 
>> switch|), but if you must Hgow about we leave the global sysctl as they 
>> are and just have a two bit route option:
>>
>> 0 Use global default
>> 1 Off
>> 2 On
>>
>> It's kind of funny, because this is what the original patch from 
>> Comsleep does and I thought it needlessly complicates things.
>>
>> So, what do you say - which will it be?
> 
> I personally feel the 2-bit settings are overkill.  What i think
> makes the most sense is for the global options to act as they always
> have in the absence of any route specific settings, and for any
> route specific settings to override the related global settings.
> This is both simple and maintains backward compatibility.

Backward compatibility is important, very important, if not the most
important thing. Then usability comes.

I know some busy servers where adding/changing a single route makes them
go crazy (because of ip route flush cache)

So if a route is overriding a global conf, and the admin wants to make an
emergency change during peak hours, he should do it by a global setting,
or he wont use at all this new stuff, and stay conservative.

Alternative would be to not trigger the flush of cache when changing
features flags.

^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Eric Dumazet @ 2009-10-26  5:28 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: Octavian Purdila, netdev
In-Reply-To: <20091025224153.GB20987@nuttenaction>

Hagen Paul Pfeifer a écrit :
> * Octavian Purdila | 2009-10-25 23:55:32 [+0200]:
> 
>> My results shows that new17 is better or very close to jhash2. And I think its 
>> lighter then jhash too.
> 
> If new17 is very close to jhash/jhash2 then the cycles comes into play.
> Anyway, there is already a very potent hash interface in form of jhash{,2}.
> 
> +1 for jhash2
> 

In fact, new10 should be the 'perfect' hash for the "eth%d"
netdev use, not jhash (way more expensive in cpu cycles BTW)

Most linux machines in the world have less than 10 interfaces, jhash
would be really overkill.


Thanks


[PATCH net-next-2.6] netdev: better dev_name_hash

Octavian Purdila pointed out that the current dev_name_hash is 
not very good at spreading entries when a large number of interfaces
of the same type (e.g. ethXXXXX) are used.

Here is hash distribution of 16000 "dummy%d" devices :

full_name_hash[] = {
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 6, 0, 0, 0, 0, 56,
0, 0, 0, 374, 0, 0, 562, 0, 0, 0, 5, 0, 0, 0, 0, 57,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 5, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 6, 0, 0, 0, 0, 57,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 5, 0, 0, 0, 0, 56,
0, 0, 0, 376, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 5, 1, 0, 0, 0, 57,
0, 0, 0, 374, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
0, 0, 0, 376, 0, 0, 562, 0, 0, 0, 6, 0, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 5, 0, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 562, 0, 0, 0, 5, 1, 0, 0, 0, 56,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 57,
0, 0, 0, 375, 0, 0, 563, 0, 0, 0, 6, 1, 0, 0, 0, 56,
};

Instead of using full_name_hash(), netdev should use a hash suited
to its typical uses, which are a common substring followed by a base 10 number.

new hash distribution :

string_hash10[] = {
62, 63, 61, 60, 61, 63, 61, 62, 64, 62, 61, 62, 62, 60, 60, 61,
61, 59, 60, 63, 61, 60, 62, 63, 62, 60, 60, 60, 59, 60, 61, 59,
58, 61, 61, 60, 60, 61, 61, 58, 58, 59, 58, 57, 58, 59, 58, 59,
60, 60, 59, 61, 63, 61, 60, 60, 62, 61, 60, 61, 61, 60, 61, 62,
61, 62, 63, 63, 62, 62, 64, 64, 61, 62, 63, 62, 62, 63, 64, 64,
64, 64, 64, 62, 64, 65, 62, 62, 63, 63, 62, 62, 63, 64, 62, 62,
64, 62, 63, 65, 64, 63, 63, 64, 64, 63, 63, 67, 65, 64, 66, 66,
66, 66, 66, 65, 64, 63, 65, 63, 63, 66, 66, 64, 64, 65, 65, 64,
63, 66, 64, 64, 65, 65, 63, 64, 65, 63, 62, 61, 64, 61, 61, 63,
65, 64, 63, 64, 62, 62, 62, 64, 61, 61, 63, 63, 63, 63, 65, 64,
62, 61, 63, 61, 61, 62, 61, 61, 62, 63, 62, 62, 63, 66, 62, 61,
62, 62, 62, 61, 62, 61, 61, 61, 64, 62, 63, 65, 63, 63, 63, 64,
62, 60, 60, 63, 61, 61, 63, 62, 63, 65, 65, 63, 62, 63, 65, 62,
62, 64, 63, 63, 65, 66, 65, 64, 64, 65, 63, 64, 66, 63, 63, 65,
66, 64, 63, 64, 66, 64, 63, 65, 63, 64, 66, 66, 64, 63, 64, 64,
62, 62, 64, 61, 60, 62, 63, 62, 61, 62, 63, 61, 62, 63, 60, 59,
}; 


Based on a previous patch from Octavian Purdila

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fa88dcd..e192068 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -196,9 +196,22 @@ EXPORT_SYMBOL(dev_base_lock);
 #define NETDEV_HASHBITS	8
 #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
 
+/*
+ * Because of "eth%d" patterns, following hash is giving good distribution
+ */
+static inline unsigned int string_hash10(const char *name, unsigned int len)
+{
+	unsigned int i, hash = 0;
+
+	for (i = 0; i < len; i++)
+		hash = hash * 10 + name[i];
+
+	return hash;
+}
+
 static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
 {
-	unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
+	unsigned hash = string_hash10(name, strnlen(name, IFNAMSIZ));
 	return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
 }
 

^ permalink raw reply related

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Stephen Hemminger @ 2009-10-26  6:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, netdev
In-Reply-To: <4AE4C1FA.7000002@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1536 bytes --]

I overkilled this with more functions and compared filenames as well.


genarated names (dummyNNNN)
Algorithm             Time (us)     Ratio       Max   StdDev
kr_hash                  277925   152408.6   468448 543.19
string_hash31            329356     5859.4    16042  44.18
SuperFastHash            324570     4885.9    10502   2.29
djb2                     327908     5608.5    15210  38.08
string_hash17            326769     4883.6     9896   0.76
full_name_hash           343196    63921.0   140628 343.62
jhash_string             463801     4883.8    10085   1.02
sdbm                     398587     9801.7    29634  99.18

filesystem names
Algorithm             Time (us)     Ratio       Max   StdDev
kr_hash                  278840   152314.9   468717 543.01
string_hash31            331206     5802.1    16004  42.87
SuperFastHash            325938     4887.5    13528   2.88
djb2                     330621     5607.1    15333  38.05
string_hash17            331181     4884.9    13274   1.78
full_name_hash           347312    63972.2   141336 343.77
jhash_string             466799     4885.2    13275   1.92
sdbm                     403691     9771.7    29629  98.88

Ratio is the average number of buckets examined when scanning
the whole set of names.


1) Increased hash buckets to 1024 which seems reasonable if we are
   going to test that many names.
2) Increased name size to 256 so that longer filenames could be
   checked and name blocks were not in same cache line

* SuperFastHash is too big to put inline


[-- Attachment #2: hashtest.c --]
[-- Type: text/x-c++src, Size: 6872 bytes --]

#include <stdio.h>
#include <stdint.h>
#include <malloc.h>
#include <sys/types.h>
#include <sys/time.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>

static unsigned int
kr_hash(const unsigned char *str, unsigned int len)
{
	unsigned int hash = 0;

        while (len--)
	    hash += *str++;

	return hash;
}

static unsigned int
sdbm(const unsigned char *name, unsigned int len)
{
        unsigned long hash = 0;

        while (len--)
            hash = *name++ + (hash << 6) + (hash << 16) - hash;

        return hash;
}

#define init_name_hash()                0

/* partial hash update function. Assume roughly 4 bits per character */
static inline unsigned long
partial_name_hash(unsigned long c, unsigned long prevhash)
{
        return (prevhash + (c << 4) + (c >> 4)) * 11;
}

/* Compute the hash for a name string. */
static unsigned int
full_name_hash(const unsigned char *name, unsigned int len)
{
        unsigned long hash = 0;
        while (len--)
                hash = partial_name_hash(*name++, hash);
        return hash;
}

static unsigned int
djb2(const unsigned char *str, unsigned int len)
{
	unsigned long hash = 5381;

        while (len--)
		hash = ((hash << 5) + hash) + *str++; /* hash * 33 + c */

        return hash;
}

static unsigned int
string_hash31(const unsigned char *name, unsigned int len)
{
        unsigned hash = 0;
        int i;

        for (i = 0; i < len; ++i)
                hash = 31 * hash + name[i];
        return hash;
}

static unsigned int
string_hash17(const unsigned char *name, unsigned int len)
{
        unsigned hash = 0;
        int i;

        for (i = 0; i < len; ++i)
                hash = 17 * hash + name[i];
        return hash;
}

/* will need to change to unaligned_access() */
static inline uint16_t get16bits(const unsigned char *data)
{
	return *(uint16_t *) data;
}

static inline unsigned int
SuperFastHash (const unsigned char * data, unsigned int len)
{
	uint32_t hash = len, tmp;
	int rem;

	rem = len & 3;
	len >>= 2;

	/* Main loop */
	for (;len > 0; len--) {
		hash  += get16bits (data);
		tmp    = (get16bits (data+2) << 11) ^ hash;
		hash   = (hash << 16) ^ tmp;
		data  += 2*sizeof (uint16_t);
		hash  += hash >> 11;
	}

	/* Handle end cases */
	switch (rem) {
        case 3: hash += get16bits (data);
                hash ^= hash << 16;
                hash ^= data[sizeof (uint16_t)] << 18;
                hash += hash >> 11;
                break;
        case 2: hash += get16bits (data);
                hash ^= hash << 11;
                hash += hash >> 17;
                break;
        case 1: hash += *data;
                hash ^= hash << 10;
                hash += hash >> 1;
	}

	/* Force "avalanching" of final 127 bits */
	hash ^= hash << 3;
	hash += hash >> 5;
	hash ^= hash << 4;
	hash += hash >> 17;
	hash ^= hash << 25;
	hash += hash >> 6;

	return hash;
}


/* NOTE: Arguments are modified. */
#define __jhash_mix(a, b, c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8); \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12);  \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5); \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

/* The golden ration: an arbitrary value */
#define JHASH_GOLDEN_RATIO	0x9e3779b9

/* The most generic version, hashes an arbitrary sequence
 * of bytes.  No alignment or length assumptions are made about
 * the input key.
 */
static inline uint32_t jhash(const void *key, uint32_t length, uint32_t initval)
{
	uint32_t a, b, c, len;
	const uint8_t *k = key;

	len = length;
	a = b = JHASH_GOLDEN_RATIO;
	c = initval;

	while (len >= 12) {
		a += (k[0] +((uint32_t)k[1]<<8) +((uint32_t)k[2]<<16) +((uint32_t)k[3]<<24));
		b += (k[4] +((uint32_t)k[5]<<8) +((uint32_t)k[6]<<16) +((uint32_t)k[7]<<24));
		c += (k[8] +((uint32_t)k[9]<<8) +((uint32_t)k[10]<<16)+((uint32_t)k[11]<<24));

		__jhash_mix(a,b,c);

		k += 12;
		len -= 12;
	}

	c += length;
	switch (len) {
	case 11: c += ((uint32_t)k[10]<<24);
	case 10: c += ((uint32_t)k[9]<<16);
	case 9 : c += ((uint32_t)k[8]<<8);
	case 8 : b += ((uint32_t)k[7]<<24);
	case 7 : b += ((uint32_t)k[6]<<16);
	case 6 : b += ((uint32_t)k[5]<<8);
	case 5 : b += k[4];
	case 4 : a += ((uint32_t)k[3]<<24);
	case 3 : a += ((uint32_t)k[2]<<16);
	case 2 : a += ((uint32_t)k[1]<<8);
	case 1 : a += k[0];
	};

	__jhash_mix(a,b,c);

	return c;
}

static unsigned int
jhash_string(const unsigned char *name, unsigned int len)
{
	return jhash(name, len, 0);
}

#define TESTSIZE	10000000ul
#define HASHBITS	10
#define HASHSZ		(1<<HASHBITS)
#define HASHMSK		(HASHSZ-1)
#define IFNAMSIZ	256

static char *names[TESTSIZE];

static void generate_names(void)
{
	unsigned int i;
	char *buf = malloc(TESTSIZE * IFNAMSIZ);

	printf("\ngenarated names (dummyNNNN)\n");
	for (i = 0; i < TESTSIZE; i++) {
		snprintf(buf, IFNAMSIZ, "dummy%d", i);
		names[i] = buf;
		buf += IFNAMSIZ;
	}
}

static inline void measure(const char *name,  
			   unsigned int (*hash)(const unsigned char *, unsigned int))
{
	unsigned int i;
	struct timeval t0, t1;
	unsigned int dist[HASHSZ] = { 0 };
	unsigned long long ratio = 0;
	long us;
	unsigned long m = 0;
	const double avg = TESTSIZE / HASHSZ;
	double std = 0;

	gettimeofday(&t0, NULL);
	
	for (i = 0; i < TESTSIZE; i++) {
		const unsigned char *str = (const unsigned char *) names[i];
		unsigned int h = hash(str, strlen(names[i]));
		++dist[h & HASHMSK];
	}

	gettimeofday(&t1, NULL);
	us = (t1.tv_sec - t0.tv_sec) * 1000000;
	us += (t1.tv_usec - t0.tv_usec);

	for (i = 0; i < HASHSZ; i++) {
		unsigned int n = dist[i];
		unsigned long long s;
		double delta = (n - avg);

		s = n;
		s *= (1 + n);
		ratio += s / 2;
		if (n > m) m = n;
		
		std += delta * delta;
	}

	printf("%-20s %10ld %10.1f %8ld %6.2f\n", name, us, 
	       (double) ratio / (double)TESTSIZE, m, sqrt(std / TESTSIZE));

}
#define MEASURE(func)	measure(#func, &func)


static void filesystem_names(void)
{
	FILE *f = fopen("/tmp/names", "r");
	if (!f) { perror("/tmp/names"); exit(1); }
	unsigned int i = 0;
	char line[IFNAMSIZ];

	printf("\nfilesystem names\n");
	while (fgets(line, sizeof line, f) != NULL) {
		strncpy(names[i], line, IFNAMSIZ);
		if (++i == TESTSIZE)
			break;
	}
	fclose(f);

}


int main()
{

	generate_names();

	printf("Algorithm             Time (us)     Ratio       Max   StdDev\n");
	MEASURE(kr_hash);
	MEASURE(string_hash31);
	MEASURE(SuperFastHash);
	MEASURE(djb2);
	MEASURE(string_hash17);
	MEASURE(full_name_hash);
	MEASURE(jhash_string);
	MEASURE(sdbm);

	filesystem_names();

	printf("Algorithm             Time (us)     Ratio       Max   StdDev\n");
	MEASURE(kr_hash);
	MEASURE(string_hash31);
	MEASURE(SuperFastHash);
	MEASURE(djb2);
	MEASURE(string_hash17);
	MEASURE(full_name_hash);
	MEASURE(jhash_string);
	MEASURE(sdbm);

        return 0;
}

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox