Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-06 20:19 UTC (permalink / raw)
  To: Pankaj Thakkar
  Cc: Gleb Natapov, Christoph Hellwig, Dmitry Torokhov,
	pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100506180411.GC25364@vmware.com>

On Thu, May 06, 2010 at 11:04:11AM -0700, Pankaj Thakkar wrote:
> Plugin is x86 or x64 machine code. You write the plugin in C and compile it using gcc/ld to get the object file, we map the relevant sections only to the OS space. 

Which is simply not supportable for a cross-platform operating system
like Linux.

^ permalink raw reply

* Re: RTL-8110SC lockup with r8169
From: Francois Romieu @ 2010-05-06 20:20 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: netdev, Glen Gray
In-Reply-To: <4BE1973D.8080502@draigBrady.com>

Pádraig Brady <P@draigBrady.com> :
[...]
> However the above code wasn't in the 2.6.32.10-90.fc12 driver we used.
> Also I've back-ported the latest r8169 driver from git to our kernel
> and it still has the same issue.

"latest" as "includes 908ba2bfd22253f26fa910cd855e4ccffb1467d0" ?

Otherwise you may save some time and try directly the backport at :
http://userweb.kernel.org/~romieu/r8169/2.6.32.11-99.fc12/

> # dmesg | grep 8169
> # lspci -n | grep -v 8086:
> 01:04.0 0200: 10ec:8167 (rev 10)

8167 how comes...

Which hardware (lspci) is the host computer made of ?

--
Ueimor

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-06 20:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Dmitry Torokhov, Christoph Hellwig, pv-drivers@vmware.com,
	Pankaj Thakkar, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100505105253.0a8bc465@nehalam>

On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote:
> Let me put it bluntly. Any design that allows external code to run
> in the kernel is not going to be accepted.  Out of tree kernel modules are enough
> of a pain already, why do you expect the developers to add another
> interface.

Exactly.  Until our friends at VMware get this basic fact it's useless
to continue arguing.

Pankaj and Dmitry: you're fine to waste your time on this, but it's not
going to go anywhere until you address that fundamental problem.  The
first thing you need to fix in your archicture is to integrate the VF
function code into the kernel tree, and we can work from there.

Please post patches doing this if you want to resume the discussion.

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 20:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273176614.2222.21.camel@edumazet-laptop>

On Thu, May 06, 2010 at 10:10:14PM +0200, Eric Dumazet wrote:
> 
> > Doing that doesn't solve my aim however, which is to avoid performing rt_genid
> > updates when no one is attacking you at all.  I completely agree that we can
> > start the gen_id at some random value (by forcing an initial invalidation),
> > however.  Beyond that however, if someone is managing to guess our secret value,
> > then we need to make our secret value more complex to determine.  Perhaps given
> > the reduction in the number of times we need to iterate our gen_id with the
> > timer gone, we can use something more heavyweight to determine the the hash
> > secret (the cprng perhaps?).
> 
> Secrets that dont change are known to be honey pots for hackers.
> 
> I just dont see why we want to risk security regressions for something
> that proved to work well.
> 
Because we have two ways of doing the same thing now, and I don't see why we
should maintain code for both.  I get that a timer based invalidation works
well.  So does the statistical analysis.

> Cache invalidation is just a genid change nowadays, and dont have side
> effects.
> 
I disagree with this, changing a genid in and of itself is fast, yes, but it
creates a need for the cache to get repopulated, sending packets through the
slow routing path.  On high volume systems this causes a performance
degradation.  The timer approach makes that a periodic degradation, one that I
would like to avoid if possible.

I get that hackers like secrets to stay unchanged so that they can figure out
what they are.  Its not like we're leaving ourselves vulnerable here, we're just
rebuilding only when we need it, not every X seconds.  And if someone is
_really_ in need of a periodic rebuild, and can cope with the performance hit,
then they can still do that from user space, as I've pointed out.  We just don't
need to keep the code in the kernel any more.

> Considering we do cache invalidation when routes are changed anyway, I
> dont get why we should avoid the invalidation once every xxx seconds...
> 
Who says routes are going to change that often?  I know you dont believe that a
former is a substitute for the latter.  As for why we should avoid periodic
invalidation, I've said it several times now.

> If you believe this cache invalidation has problems, maybe we should
> address them and not hide them ?
> 
Now you're just being intentionally obtuse.  Eric, you know perfectly good and
well what my reasons are for wanting to remove the rt_secret timer.  Its why we
did the statistical analysis code in the first place.  There just not a large
need for it.  If you want to do periodic invalidation, fine, do it.  Just do it
in user space.  We have an on-demand strategy in the kernel that has been
working well for quite some time, and is superior in performance for 99% of the
use cases out there.  So lets lighten the maintenence workload for the code
thats not strictly needed anymore by getting rid of it.


^ permalink raw reply

* Re: [PATCH v21 020/100] c/r: documentation
From: Randy Dunlap @ 2010-05-06 20:27 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, containers, linux-kernel, Serge Hallyn,
	Matt Helsley, Pavel Emelyanov, linux-api, linux-mm, linux-fsdevel,
	netdev, Dave Hansen
In-Reply-To: <1272723382-19470-21-git-send-email-orenl@cs.columbia.edu>

On Sat,  1 May 2010 10:15:02 -0400 Oren Laadan wrote:

> Covers application checkpoint/restart, overall design, interfaces,
> usage, shared objects, and and checkpoint image format.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Tested-by: Serge E. Hallyn <serue@us.ibm.com>
> ---
>  Documentation/checkpoint/checkpoint.c      |   38 +++
>  Documentation/checkpoint/readme.txt        |  370 ++++++++++++++++++++++++++++
>  Documentation/checkpoint/self_checkpoint.c |   69 +++++
>  Documentation/checkpoint/self_restart.c    |   40 +++
>  Documentation/checkpoint/usage.txt         |  247 +++++++++++++++++++
>  5 files changed, 764 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/checkpoint/checkpoint.c
>  create mode 100644 Documentation/checkpoint/readme.txt
>  create mode 100644 Documentation/checkpoint/self_checkpoint.c
>  create mode 100644 Documentation/checkpoint/self_restart.c
>  create mode 100644 Documentation/checkpoint/usage.txt

> diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
> new file mode 100644
> index 0000000..4fa5560
> --- /dev/null
> +++ b/Documentation/checkpoint/readme.txt
> @@ -0,0 +1,370 @@
> +
...
> +In contrast, when checkpointing a subtree of a container it is up to
> +the user to ensure that dependencies either don't exist or can be
> +safely ignored. This is useful, for instance, for HPC scenarios or
> +even a user that would like to periodically checkpoint a long-running

               who

> +batch job.
> +
...

> +
> +Checkpoint image format
> +=======================
> +
...

> +
> +The container configuration section containers information that is

                                       contains

> +global to the container. Security (LSM) configuration is one example.
> +Network configuration and container-wide mounts may also go here, so
> +that the userspace restart coordinator can re-create a suitable
> +environment.
> +
...

> +
> +Then the state of all tasks is saved, in the order that they appear in
> +the tasks array above. For each state, we save data like task_struct,
> +namespaces, open files, memory layout, memory contents, cpu state,

                                                           CPU (throughout, please)

> +signals and signal handlers, etc. For resources that are shared among
> +multiple processes, we first checkpoint said resource (and only once),
> +and in the task data we give a reference to it. More about shared
> +resources below.
> +
...

> +
> +Shared objects
> +==============
> +
> +Many resources may be shared by multiple tasks (e.g. file descriptors,
> +memory address space, etc), or even have multiple references from

                         etc.),

> +other resources (e.g. a single inode that represents two ends of a
> +pipe).
> +
...

> +Memory contents format
> +======================
> +
> +The memory contents of a given memory address space (->mm) is dumped

                                                              are (I think)

> +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
> +This header details the vma properties, and a reference to a file
> +(if file backed) or an inode (or shared memory) object.
> +
> +The vma header is followed by the actual contents - but only those
> +pages that need to be saved, i.e. dirty pages. They are written in
> +chunks of data, where each chunks contains a header that indicates

                              chunk

> +that number of pages in the chunk, followed by an array of virtual

   the

> +addresses and then an array of actual page contents. The last chunk
> +holds zero pages.
> +
...

> +Kernel interfaces
> +=================
> +
> +* To checkpoint a vma, the 'struct vm_operations_struct' needs to
> +  provide a method ->checkpoint:
> +    int checkpoint(struct ckpt_ctx *, struct vma_struct *)
> +  Restart requires a matching (exported) restore:
> +    int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
> +
> +* To checkpoint a file, the 'struct file_operations' needs to provide
> +  the methods ->checkpoint and ->collect:
> +    int checkpoint(struct ckpt_ctx *, struct file *)
> +    int collect(struct ckpt_ctx *, struct file *)
> +  Restart requires a matching (exported) restore:
> +    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
> +  For most file systems, generic_file_{checkpoint,restore}() can be
> +  used.
> +
> +* To checkpoint a socket, the 'struct proto_ops' needs to provide

     To checkpoint/restart a socket,

> +  the methods ->checkpoint, ->collect and ->restore:
> +    int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
> +    int collect(struct ckpt_ctx *ctx, struct socket *sock);
> +    int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)


> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
> new file mode 100644
> index 0000000..c6fc045
> --- /dev/null
> +++ b/Documentation/checkpoint/usage.txt
> @@ -0,0 +1,247 @@
> +
> +	      How to use Checkpoint-Restart
> +	=========================================
> +
> +
> +API
> +===
> +
> +The API consists of three new system calls:
> +
> +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd);

                                                      flags,

> +
> + Checkpoint a (sub-)container whose root task is identified by @pid,
> + to the open file indicated by @fd. If @logfd isn't -1, it indicates
> + an open file to which error and debug messages are written. @flags
> + may be one or more of:
> +   - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
> + (other value are not allowed).
> +
> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
> + it returns from a restart, and -1 if an error occurs. The ckptid will
> + uniquely identify a checkpoint image, for as long as the checkpoint
> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
> + partial checkpoint, residing in kernel memory).
> +
> +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd);
> +
> + Restart a process hierarchy from a checkpoint image that is read from
> + the blob stored in the file indicated by @fd.  If @logfd isn't -1, it
> + indicates an open file to which error and debug messages are written.
> + @flags will have future meaning (must be 0 for now). @pid indicates
> + the root of the hierarchy as seen in the coordinator's pid-namespace,
> + and is expected to be a child of the coordinator. @flags may be one
> + or more of:
> +   - RESTART_TASKSELF : (self) restart of a single process
> +   - RESTART_FROEZN : processes remain frozen once restart completes

                FROZEN ?

> +   - RESTART_GHOST : process is a ghost (placeholder for a pid)

about @flags:  Above says both of these:
a) @flags will have future meaning (must be 0 for now)
b) @flags may be one or more of:

so please decide which one it is ;)

> + (Note that this argument may mean 'ckptid' to identify an in-kernel
> + checkpoint image, with some @flags in the future).
> +
> + Returns: -1 if an error occurs, 0 on success when restarting from a
> + "self" checkpoint, and return value of system call at the time of the
> + checkpoint when restarting from an "external" checkpoint.
> +
...
> +
> +Sysctl/proc
> +===========
> +
> +/proc/sys/kernel/ckpt_unpriv_allowed		[default = 1]
> +  controls whether c/r operation is allowed for unprivileged users

                      C/R

> +
> +
> +Operation
> +=========
> +
> +The granularity of a checkpoint usually is a process hierarchy. The
> +'pid' argument is interpreted in the caller's pid namespace. So to
> +checkpoint a container whose init task (pid 1 in that pidns) appears
> +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
> +pid 1 will attempt to checkpoint the caller's container, and if the
> +caller isn't privileged and init is owned by root, it will fail.
> +
> +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
> +which does not refer to a container's init task, then sys_checkpoint()
> +would return -EINVAL.

   returns -EINVAL.

...

> +
> +
> +User tools
> +==========
> +
> +* checkpoint(1): a tool to perform a checkpoint of a container/subtree
> +* restart(1): a tool to restart a container/subtree
> +* ckptinfo: a tool to examine a checkpoint image
> +
> +It is best to use the dedicated user tools for checkpoint and restart.
> +
> +If you insist, then here is a code snippet that illustrates how a
> +checkpoint is initiated by a process inside a container - the logic is
> +similar to fork():
> +	...
> +	ckptid = checkpoint(0, ...);
> +	switch (crid) {

	       (ckptid) ?

> +	case -1:
> +		perror("checkpoint failed");
> +		break;
> +	default:
> +		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);

s/ret/ckptid/ ?

> +		/* proceed with execution after checkpoint */
> +		...
> +		break;
> +	case 0:
> +		fprintf(stderr, "returned after restart\n");
> +		/* proceed with action required following a restart */
> +		...
> +		break;
> +	}
> +	...
> +
> +And to initiate a restart, the process in an empty container can use
> +logic similar to execve():
> +	...
> +	if (restart(pid, ...) < 0)
> +		perror("restart failed");
> +	/* only get here if restart failed */
> +	...
> +
> +Note, that the code also supports "self" checkpoint, where a process

   Note that

> +can checkpoint itself. This mode does not capture the relationships of
> +the task with other tasks, or any shared resources. It is useful for
> +application that wish to be able to save and restore their state.

   applications

> +They will either not use (or care about) shared resources, or they
> +will be aware of the operations and adapt suitably after a restart.
> +The code above can also be used for "self" checkpoint.
> +
> +
> +You may find the following sample programs useful:
> +
> +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout

                                       checkpoints

> +* self_checkpoint.c: a simple test program doing self-checkpoint
> +* self_restart.c: restarts a (self-) checkpoint image from stdin
> +
> +See also the utilities 'checkpoint' and 'restart' (from user-cr).
> +
> +
> +"External" checkpoint
> +=====================
> +
> +To do "external" checkpoint, you need to first freeze that other task
> +either using the freezer cgroup.

eh?  cannot parse that.

> +
> +Restart does not preserve the original PID yet, (because we haven't
> +solved yet the fork-with-specific-pid issue). In a real scenario, you
> +probably want to first create a new names space, and have the init

                                       namespace,

> +task there call 'sys_restart()'.
> +
> +I tested it this way:

...

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v2)
From: Neil Horman @ 2010-05-06 20:29 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506171639.GA5063@hmsreliant.think-freely.org>

Version 2 of this patch, taking Erics comment about making the rt_genid non-zero
when a netns is created.  This makes sense, and helps prevent attackers from
guessing our initial secret value



A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant.  This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/netns/ipv4.h |    1 
 net/ipv4/route.c         |  111 ++++-------------------------------------------
 2 files changed, 11 insertions(+), 101 deletions(-)


diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ae07fee..d68c3f1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,7 +55,6 @@ struct netns_ipv4 {
 	int sysctl_rt_cache_rebuild_count;
 	int current_rt_cache_rebuild_count;
 
-	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
 
 #ifdef CONFIG_IP_MROUTE
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..e55a066 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
 static int rt_chain_length_max __read_mostly	= 20;
 
 static struct delayed_work expires_work;
@@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
 	rt_do_flush(!in_softirq());
 }
 
-/*
- * We change rt_genid and let gc do the cleanup
- */
-static void rt_secret_rebuild(unsigned long __net)
-{
-	struct net *net = (struct net *)__net;
-	rt_cache_invalidate(net);
-	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
-static void rt_secret_rebuild_oneshot(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-	rt_cache_invalidate(net);
-	if (ip_rt_secret_interval)
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
 static void rt_emergency_hash_rebuild(struct net *net)
 {
-	if (net_ratelimit()) {
+	if (net_ratelimit())
 		printk(KERN_WARNING "Route hash chain too long!\n");
-		printk(KERN_WARNING "Adjust your secret_interval!\n");
-	}
-
-	rt_secret_rebuild_oneshot(net);
+	rt_cache_invalidate(net);
 }
 
 /*
@@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 	return -EINVAL;
 }
 
-static void rt_secret_reschedule(int old)
-{
-	struct net *net;
-	int new = ip_rt_secret_interval;
-	int diff = new - old;
-
-	if (!diff)
-		return;
-
-	rtnl_lock();
-	for_each_net(net) {
-		int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-		long time;
-
-		if (!new)
-			continue;
-
-		if (deleted) {
-			time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-			if (time <= 0 || (time += diff) <= 0)
-				time = 0;
-		} else
-			time = new;
-
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
-	}
-	rtnl_unlock();
-}
-
-static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
-					  void __user *buffer, size_t *lenp,
-					  loff_t *ppos)
-{
-	int old = ip_rt_secret_interval;
-	int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
-
-	rt_secret_reschedule(old);
-
-	return ret;
-}
-
 static ctl_table ipv4_route_table[] = {
 	{
 		.procname	= "gc_thresh",
@@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "secret_interval",
-		.data		= &ip_rt_secret_interval,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= ipv4_sysctl_rt_secret_interval,
-	},
 	{ }
 };
 
@@ -3336,34 +3265,18 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 };
 #endif
 
-
-static __net_init int rt_secret_timer_init(struct net *net)
+static __net_init int rt_genid_init(struct net *net)
 {
-	atomic_set(&net->ipv4.rt_genid,
-			(int) ((num_physpages ^ (num_physpages>>8)) ^
-			(jiffies ^ (jiffies >> 7))));
-
-	net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
-	net->ipv4.rt_secret_timer.data = (unsigned long)net;
-	init_timer_deferrable(&net->ipv4.rt_secret_timer);
-
-	if (ip_rt_secret_interval) {
-		net->ipv4.rt_secret_timer.expires =
-			jiffies + net_random() % ip_rt_secret_interval +
-			ip_rt_secret_interval;
-		add_timer(&net->ipv4.rt_secret_timer);
-	}
+	/*
+	 * This just serves to start off each new net namespace
+	 * with a non-zero rt_genid value, making it harder to guess
+	 */
+	rt_cache_invalidate(net);
 	return 0;
 }
 
-static __net_exit void rt_secret_timer_exit(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-}
-
-static __net_initdata struct pernet_operations rt_secret_timer_ops = {
-	.init = rt_secret_timer_init,
-	.exit = rt_secret_timer_exit,
+static __net_initdata struct pernet_operations rt_genid_ops = {
+	.init = rt_genid_init,
 };
 
 
@@ -3424,9 +3337,6 @@ int __init ip_rt_init(void)
 	schedule_delayed_work(&expires_work,
 		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
 
-	if (register_pernet_subsys(&rt_secret_timer_ops))
-		printk(KERN_ERR "Unable to setup rt_secret_timer\n");
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM
@@ -3438,6 +3348,7 @@ int __init ip_rt_init(void)
 #ifdef CONFIG_SYSCTL
 	register_pernet_subsys(&sysctl_route_ops);
 #endif
+	register_pernet_subsys(&rt_genid_ops);
 	return rc;
 }
 

^ permalink raw reply related

* Re: [PATCH  kernel 2.6.34-rc5] lib8390: to be SMP safe
From: Ken Kawasaki @ 2010-05-06 20:47 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20100503194316.60c98272.ken_kawasaki@spring.nifty.jp>


Sorry, I cancel this patch
and test it again.


Best Regards
Ken

> 
> lib8390:
> 	write the value "ENISR_ALL" to register "EN0_IMR"
> 	after enable_irq_lockdep_irqrestore. 
> 
> 	This patch avoids frequent transmit error on SMP system.
> 
> 
> Signed-off-by: Ken Kawasaki <ken_kawasaki@spring.nifty.jp>
> 
> ---
> 
> --- linux-2.6.34-rc6/drivers/net/lib8390.c.orig	2010-05-02 16:49:57.000000000 +0900
> +++ linux-2.6.34-rc6/drivers/net/lib8390.c	2010-05-02 18:09:18.000000000 +0900
> @@ -367,9 +367,9 @@ static netdev_tx_t __ei_start_xmit(struc
>  				dev->name, ei_local->tx1, ei_local->tx2, ei_local->lasttx);
>  		ei_local->irqlock = 0;
>  		netif_stop_queue(dev);
> -		ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  		spin_unlock(&ei_local->page_lock);
>  		enable_irq_lockdep_irqrestore(dev->irq, &flags);
> +		ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  		dev->stats.tx_errors++;
>  		return NETDEV_TX_BUSY;
>  	}
> @@ -407,10 +407,10 @@ static netdev_tx_t __ei_start_xmit(struc
>  
>  	/* Turn 8390 interrupts back on. */
>  	ei_local->irqlock = 0;
> -	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  
>  	spin_unlock(&ei_local->page_lock);
>  	enable_irq_lockdep_irqrestore(dev->irq, &flags);
> +	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  
>  	dev_kfree_skb (skb);
>  	dev->stats.tx_bytes += send_length;


^ permalink raw reply

* Re: ixgbe and mac-vlans problem
From: Ben Greear @ 2010-05-06 20:49 UTC (permalink / raw)
  To: Tantilov, Emil S; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A1365FEA26A2D0@rrsmsx501.amr.corp.intel.com>

On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:

> Hi Ben,
>
> We do have a patch in testing (see attached). It may not apply cleanly as it is on top of some other patches currently in validation. Let me know if it works for you.

It wasn't difficult to backport this patch to 2.6.31.12....

I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.

The NIC doesn't show as PROMISC in any way I can detect, but I guess
it must actually be in PROMISC mode:

[root@i7-1qc-1 ~]# cat /sys/class/net/eth11/flags
0x1003

[root@i7-1qc-1 ~]# ip link show dev eth11
2: eth11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
     link/ether 00:e0:ed:11:25:12 brd ff:ff:ff:ff:ff:ff

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v2)
From: Eric Dumazet @ 2010-05-06 21:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506202957.GE5063@hmsreliant.think-freely.org>

Le jeudi 06 mai 2010 à 16:29 -0400, Neil Horman a écrit :
> Version 2 of this patch, taking Erics comment about making the rt_genid non-zero
> when a netns is created.  This makes sense, and helps prevent attackers from
> guessing our initial secret value
> 
> 
> 
> A while back there was a discussion regarding the rt_secret_interval timer.
> Given that we've had the ability to do emergency route cache rebuilds for awhile
> now, based on a statistical analysis of the various hash chain lengths in the
> cache, the use of the flush timer is somewhat redundant.  This patch removes the
> rt_secret_interval sysctl, allowing us to rely solely on the statistical
> analysis mechanism to determine the need for route cache flushes.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 

> -
> -static __net_init int rt_secret_timer_init(struct net *net)
> +static __net_init int rt_genid_init(struct net *net)
>  {
> -	atomic_set(&net->ipv4.rt_genid,
> -			(int) ((num_physpages ^ (num_physpages>>8)) ^
> -			(jiffies ^ (jiffies >> 7))));
> -


> +	/*
> +	 * This just serves to start off each new net namespace
> +	 * with a non-zero rt_genid value, making it harder to guess
> +	 */
> +	rt_cache_invalidate(net);
>  	return 0;
>  }
>  

I am _sorry_ to be such a paranoiac guy.

Could you please feed more than 8 bits here ?

like :

get_random_bytes(&net->ipv4.rt_genid, sizeof(net->ipv4.rt_genid));

There is no need to comment this in the code, this kind of rnd init is
very common in net tree.






^ permalink raw reply

* Re: 2.6.33.2: Turn tx power off/on for Atheros card
From: Luis R. Rodriguez @ 2010-05-06 22:16 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <g2wf69abfc31005060752w6876439cm45f5be68001c8382-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, May 6, 2010 at 7:52 AM, Yegor Yefremov
<yegorslists-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:
> On Wed, May 5, 2010 at 12:26 PM, Yegor Yefremov
> <yegorslists-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:
>> I'm using kernel 2.6.33.2 with AR2413 WLAN card. Issuing
>>
>> iwconfig wlan0 txpower off
>>
>> turns txpower off. I can see this status by iwconfig wlan0 and the
>> communication with AP terminates. But when I turn the txpower on
>>
>> iwconfig wlan0 txpower on
>>
>> nothing happens. Though iwconfig shows the previous tx power value.
>> Only ifconfig wlan0 down and then up recovers the transmission.
>>
>> Is it a known bug or I'm doing something wrong?
>
> I made some debugging and found out that after iwconfig wlan0 txpower
> off dev_close() will be invoked, so that local->open_count will be 0.
> The next time txpower on will be called, it will be checked if
> local->open_count > 0 and this conditions fails, so no  hardware
> configuration will be made.
>
> I've made a quick and dirty hack, that opens the wireless device by
> enabling the txpower, if it was closed before. Is there any proper
> solution? Is it really necessary to close device to tunr txpower off?

Depends on the type of interfaces you have. For a monitor device it
makes no sense to close the device as you should be able to still RX.
It also is possible to TX over a monitor device using frame injection
so technically setting tx power to off would just mute it and would
seem useful.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 2.6.34-rc6] net: Improve ks8851 snl transmit performance
From: Ha, Tristram @ 2010-05-06 22:50 UTC (permalink / raw)
  To: Ben Dooks; +Cc: David Miller, netdev, linux-kernel, Abraham Arce, Sebastien Jan

From: Tristram Ha <Tristram.Ha@micrel.com>

Under heavy transmission the driver will put 4 1514-byte packets in queue and stop the device transmit queue.  Only the last packet triggers the transmit done interrupt and wakes up the device transmit queue.  That means a bit of time is wasted when the CPU cannot send any more packet.

The new implementation triggers the transmit interrupt when the transmit buffer left is less than 3 packets.  The maximum transmit buffer size is 6144 bytes.  This allows the device transmit queue to be restarted sooner so that CPU can send more packets.

For TCP receiving it also has the benefit of not triggering any transmit interrupt at all.

There is a driver option no_tx_opt so that the driver can revert to original implementation.  This allows user to verify if the transmit performance actually improves.

Signed-off-by: Tristram Ha <Tristram.Ha@micrel.com>
---
This replaces the [patch 01/13] patch I submitted and was objected by David.

Other users with Micrel KSZ8851 SNL chip please verify the transmit performance does improve or not.

--- a/drivers/net/ks8851.c	2010-04-29 20:02:05.000000000 -0700
+++ b/drivers/net/ks8851.c	2010-05-06 15:30:40.000000000 -0700
@@ -74,6 +74,9 @@ union ks8851_tx_hdr {
  * @rxd: Space for receiving SPI data, in DMA-able space.
  * @txd: Space for transmitting SPI data, in DMA-able space.
  * @msg_enable: The message flags controlling driver output (see ethtool).
+ * @tx_space: The current available transmit buffer size.
+ * @tx_avail: The maximum available transmit buffer size.
+ * @tx_chk_cnt: Used to indicate how often to check the transmit buffer.
  * @fid: Incrementing frame id tag.
  * @rc_ier: Cached copy of KS_IER.
  * @rc_rxqcr: Cached copy of KS_RXQCR.
@@ -103,6 +106,8 @@ struct ks8851_net {
 
 	u32			msg_enable ____cacheline_aligned;
 	u16			tx_space;
+	u16			tx_avail;
+	u8			tx_chk_cnt;
 	u8			fid;
 
 	u16			rc_ier;
@@ -124,6 +129,7 @@ struct ks8851_net {
 };
 
 static int msg_enable;
+static int no_tx_opt;
 
 #define ks_info(_ks, _msg...) dev_info(&(_ks)->spidev->dev, _msg)
 #define ks_warn(_ks, _msg...) dev_warn(&(_ks)->spidev->dev, _msg)
@@ -580,10 +586,21 @@ static void ks8851_irq_work(struct work_
 
 		/* update our idea of how much tx space is available to the
 		 * system */
+		ks->tx_chk_cnt = 0;
 		ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
 
 		if (netif_msg_intr(ks))
 			ks_dbg(ks, "%s: txspace %d\n", __func__, ks->tx_space);
+
+	/* Update tx space when packets are being transmitted. */
+	} else if (ks->tx_space < ks->tx_avail) {
+		ks->tx_chk_cnt++;
+
+		/* Read the transmit buffer register every 4th rx interrupt. */
+		if (4 == ks->tx_chk_cnt) {
+			ks->tx_chk_cnt = 0;
+			ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
+		}
 	}
 
 	if (status & IRQ_RXI)
@@ -715,6 +732,7 @@ static void ks8851_tx_work(struct work_s
 	struct ks8851_net *ks = container_of(work, struct ks8851_net, tx_work);
 	struct sk_buff *txb;
 	bool last = skb_queue_empty(&ks->txq);
+	bool tx_irq;
 
 	mutex_lock(&ks->lock);
 
@@ -724,7 +742,11 @@ static void ks8851_tx_work(struct work_s
 
 		if (txb != NULL) {
 			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr | RXQCR_SDA);
-			ks8851_wrpkt(ks, txb, last);
+			if (ks->tx_avail)
+				tx_irq = (CHECKSUM_UNNECESSARY == txb->ip_summed);
+			else
+				tx_irq = last;
+			ks8851_wrpkt(ks, txb, tx_irq);
 			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr);
 			ks8851_wrreg16(ks, KS_TXQCR, TXQCR_METFE);
 
@@ -917,11 +939,17 @@ static netdev_tx_t ks8851_start_xmit(str
 		ret = NETDEV_TX_BUSY;
 	} else {
 		ks->tx_space -= needed;
+		/*
+		 * Indicate to enable transmit done interrupt when transmit
+		 * buffer is less than a certain size.
+		 */
+		if (ks->tx_avail && ks->tx_space < 1514 * 3)
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
 		skb_queue_tail(&ks->txq, skb);
+		schedule_work(&ks->tx_work);
 	}
 
 	spin_unlock(&ks->statelock);
-	schedule_work(&ks->tx_work);
 
 	return ret;
 }
@@ -1224,7 +1252,6 @@ static int __devinit ks8851_probe(struct
 
 	ks->netdev = ndev;
 	ks->spidev = spi;
-	ks->tx_space = 6144;
 
 	mutex_init(&ks->lock);
 	spin_lock_init(&ks->statelock);
@@ -1279,6 +1306,10 @@ static int __devinit ks8851_probe(struct
 		goto err_id;
 	}
 
+	ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
+	if (!no_tx_opt)
+		ks->tx_avail = ks->tx_space;
+
 	ks8851_read_selftest(ks);
 	ks8851_init_mac(ks);
 
@@ -1351,6 +1382,8 @@ MODULE_DESCRIPTION("KS8851 Network drive
 MODULE_AUTHOR("Ben Dooks <ben@simtec.co.uk>");
 MODULE_LICENSE("GPL");
 
+module_param(no_tx_opt, int, 0);
+MODULE_PARM_DESC(message, "No TX optimization");
 module_param_named(message, msg_enable, int, 0);
 MODULE_PARM_DESC(message, "Message verbosity level (0=none, 31=all)");
 MODULE_ALIAS("spi:ks8851");

^ permalink raw reply

* Re: 3 packet TCP window limit?
From: Jerry Chu @ 2010-05-06 23:15 UTC (permalink / raw)
  To: dormando; +Cc: Lars Eggert, Rick Jones, Brian Bloniarz, netdev@vger.kernel.org
In-Reply-To: <q2ud1c2719f1005061613yf90cd7c6r46ee23cc49858e74@mail.gmail.com>

From: dormando <dormando@rydia.net>
>
> Date: Thu, May 6, 2010 at 1:51 AM
> Subject: Re: 3 packet TCP window limit?
> To: Lars Eggert <lars.eggert@nokia.com>
> Cc: Rick Jones <rick.jones2@hp.com>, Brian Bloniarz
> <bmb@athenacr.com>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>
>
>
> > On 2010-5-5, at 23:31, dormando wrote:
> > > The RFC clearly states "around 4k",
> >
> > no, it doesn't. RFC3390 gives a very precise formula for calculating the initial window:
> >
> >       min (4*MSS, max (2*MSS, 4380 bytes))
> >
> > Please see the RFC for why. More reading at http://www.icir.org/floyd/tcp_init_win.html I believe that Linux implements behavior this pretty faithfully.
>
> Sorry, paraphrasing :) Web nerds have been working around this for a long
> time now. Google talks about using HTTP chunked encoding responses to send
> an initial "frame" of a webpage in under 3 packets. Which immediately
> gives the browser something to render and primes the TCP connection for
> more web junk.
>
> > I'm surprised to hear that OpenBSD doesn't follow the RFC. Can you share a measurement? Are you sure the box you are measuring is using the default configuration?
>
> Yeah, default config. OBSD was giving me back 4 packets in the first
> window, while linux always gives back 3. The Big/IP is based on linux
> 2.4.21. If that kernel didn't have it wrong, they tuned it.
>
> Already nuked my dumps. If you're curious I'll re-create.
>
> > I don't think the RFC can be misread (it's pretty clear), and the
> > formula is also not exactly complicated. My guess would be that some
> > vendors have convinced themselves that using a slightly larger value is
> > OK, esp. if they can show customers that "their" TCP is "faster" than
> > some competitors' TCPs. An arms race between vendors in this space would
> > really not be good for anyone - it's clear that at some point, problems
> > due to overshoot will occur.
>
> I clearly remember some vendors bragging about doing this. That was a long
> time ago? Perhaps they stopped? If it's true they've been doing it for
> half a decade or more, and haven't broken anything someone would notice.
>
> The only reason why I set about tuning this is because our latency jumped
> while moving traffic from a commercial machine to a linux machine, and I
> had to figure out what they changed to do that. I've since turned the
> setting *back* to the standard, having confirmed what they did.
>
> Almost tempted to test this against a bunch of websites...
>
> > (We can definitely argue about whether the current RFC-recommended value
> > is too low, and Google and others are gathering data in support of
> > making a convincing and backed-up argument for increasing the initial
> > window to the IETF. Which is exactly the correct way of going about
> > this.)
>
> This sounds like fun. We have some diverse traffic, so I'm hoping we can
> contribute to that conversation. Still have a lot of reading to catch up
> with first :)

Yes please do.  Our presentation at Anaheim IETF can be found at
http://www.ietf.org/proceedings/10mar/slides/tcpm-4.pdf, with a paper describing
the details of our experiments at
http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf.

We've gotten a lot of feedback from IETF and are planning to collect
more data to
justify the proposal. But at this point we really need help from
others as the scope of
the work is certainly not a one-company job. Help can be in the form of more
experiments/tests and/or simulations to study the effect of a larger
initcwnd. Please
contact me directly or send your data to IETF's TCPM WG list
(http://www.ietf.org/mail-archive/web/tcpm/current/maillist.html).

Thanks,

Jerry

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v2)
From: nhorman @ 2010-05-07  0:02 UTC (permalink / raw)
  To: Eric Dumazet, Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273180085.2222.33.camel@edumazet-laptop>


On Thu, 6 May 2010 17:08:05 -0400, Eric Dumazet wrote:

> Le jeudi 06 mai 2010 à 16:29 -0400, Neil Horman a écrit :
> > Version 2 of this patch, taking Erics comment about making the rt_genid non-zero
> > when a netns is created.  This makes sense, and helps prevent attackers from
> > guessing our initial secret value
> > 
> > 
> > 
> > A while back there was a discussion regarding the rt_secret_interval timer.
> > Given that we've had the ability to do emergency route cache rebuilds for awhile
> > now, based on a statistical anal> > cache, the use of the flush timer is somewhat redundant.  This patch removes the
> > rt_secret_interval sysctl, allowing us to rely solely on the statistical
> > analysis mechanism to determine the need for route cache flushes.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> 
> > -
> > -static __net_init int rt_secret_timer_init(struct net *net)
> > +static __net_init int rt_genid_init(struct net *net)
> >  {
> > -	atomic_set(&net->ipv4.rt_genid,
> > -			(int) ((num_physpages ^ (num_physpages>>8)) ^
> > -			(jiffies ^ (jiffies >> 7))));
> > -
> 
> 
> > +	/*
> > +	 * This just serves to start off each new net namespace
> > +	 * with a non-zero rt_genid value, making it harder to guess
> > +	 */
> > +	rt_cache_invalidate(net);
> >  	return 0;
> >  }
> >  
> 
> I am _sorry_ to be such a paranoiac guy.
> 
Don't be sorry, I think your concern is valid, I just don't want to keep old code around when 
> Could you please feed more than 8 bits here ?
> 
> like :
> 
> get_random_bytes(&net->ipv4.rt_genid, sizeof(net->ipv4.rt_genid));
> 
Sure, I'm good with that. I'm not at my desk right now, but ill do that in the morning.

> There is no need to comment this in the code, this kind of rnd init is
> very common in net tree.
>
Ok, copy that, ill fix that up at the same time.  

Thanks & regards
Neil

> 
> 
> 
> 
> 


^ permalink raw reply

* RE: ixgbe and mac-vlans problem
From: Tantilov, Emil S @ 2010-05-07  0:06 UTC (permalink / raw)
  To: Ben Greear; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <4BE32B4A.2030409@candelatech.com>

Ben Greear wrote:
> On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:
> 
>> Hi Ben,
>> 
>> We do have a patch in testing (see attached). It may not apply
>> cleanly as it is on top of some other patches currently in
>> validation. Let me know if it works for you.  
> 
> It wasn't difficult to backport this patch to 2.6.31.12....
> 
> I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.
> 
> The NIC doesn't show as PROMISC in any way I can detect, but I guess
> it must actually be in PROMISC mode:

Yes the interface is in promisc mode. The driver sets the FCTRL.UPE bit
(unicast promisc mode) when the number of allowed rar_entries is exceeded. 

> 
> [root@i7-1qc-1 ~]# cat /sys/class/net/eth11/flags
> 0x1003
> 
> [root@i7-1qc-1 ~]# ip link show dev eth11
> 2: eth11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>      state UP qlen 1000 link/ether 00:e0:ed:11:25:12 brd
> ff:ff:ff:ff:ff:ff 

The IFF_PROMISC flag is not set in this case. That's how the driver knows when the promisc mode is turned by the user.

Thanks,
Emil

^ permalink raw reply

* [PATCH 0/2] net-next/fec: bug fixing after introduced phylib supporting
From: Bryan Wu @ 2010-05-07  2:27 UTC (permalink / raw)
  To: davem, Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev,
	linux-kernel

After introduced phylib supporting, we found some critical issues in Ubuntu on
Freescale iMX51. Following 2 patches fix those bugs which was recorded in our
Launchpad bug tracker.

Bryan Wu (2):
  netdev/fec: fix performance impact from mdio poll operation
  netdev/fec: fix ifconfig eth0 down hang issue

 drivers/net/fec.c |   73 +++++++++++++++++++++++++++--------------------------
 1 files changed, 37 insertions(+), 36 deletions(-)

^ permalink raw reply

* [PATCH 2/2] netdev/fec: fix ifconfig eth0 down hang issue
From: Bryan Wu @ 2010-05-07  2:27 UTC (permalink / raw)
  To: davem, Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev,
	linux-kernel
In-Reply-To: <1273199239-11057-1-git-send-email-bryan.wu@canonical.com>

BugLink: http://bugs.launchpad.net/bugs/559065

In fec open/close function, we need to use phy_connect and phy_disconnect
operation before we start/stop phy. Otherwise it will cause system hang.

Only call fec_enet_mii_probe() in open function, because the first open
action will cause NULL pointer error.

Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
---
 drivers/net/fec.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index 9c58f6b..af4243f 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -678,6 +678,8 @@ static int fec_enet_mii_probe(struct net_device *dev)
 	struct phy_device *phy_dev = NULL;
 	int phy_addr;
 
+	fep->phy_dev = NULL;
+
 	/* find the first phy */
 	for (phy_addr = 0; phy_addr < PHY_MAX_ADDR; phy_addr++) {
 		if (fep->mii_bus->phy_map[phy_addr]) {
@@ -708,6 +710,11 @@ static int fec_enet_mii_probe(struct net_device *dev)
 	fep->link = 0;
 	fep->full_duplex = 0;
 
+	printk(KERN_INFO "%s: Freescale FEC PHY driver [%s] "
+		"(mii_bus:phy_addr=%s, irq=%d)\n", dev->name,
+		fep->phy_dev->drv->name, dev_name(&fep->phy_dev->dev),
+		fep->phy_dev->irq);
+
 	return 0;
 }
 
@@ -753,13 +760,8 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	if (mdiobus_register(fep->mii_bus))
 		goto err_out_free_mdio_irq;
 
-	if (fec_enet_mii_probe(dev) != 0)
-		goto err_out_unregister_bus;
-
 	return 0;
 
-err_out_unregister_bus:
-	mdiobus_unregister(fep->mii_bus);
 err_out_free_mdio_irq:
 	kfree(fep->mii_bus->irq);
 err_out_free_mdiobus:
@@ -912,7 +914,12 @@ fec_enet_open(struct net_device *dev)
 	if (ret)
 		return ret;
 
-	/* schedule a link state check */
+	/* Probe and connect to PHY when open the interface */
+	ret = fec_enet_mii_probe(dev);
+	if (ret) {
+		fec_enet_free_buffers(dev);
+		return ret;
+	}
 	phy_start(fep->phy_dev);
 	netif_start_queue(dev);
 	fep->opened = 1;
@@ -926,10 +933,12 @@ fec_enet_close(struct net_device *dev)
 
 	/* Don't know what to do yet. */
 	fep->opened = 0;
-	phy_stop(fep->phy_dev);
 	netif_stop_queue(dev);
 	fec_stop(dev);
 
+	if (fep->phy_dev)
+		phy_disconnect(fep->phy_dev);
+
         fec_enet_free_buffers(dev);
 
 	return 0;
@@ -1293,11 +1302,6 @@ fec_probe(struct platform_device *pdev)
 	if (ret)
 		goto failed_register;
 
-	printk(KERN_INFO "%s: Freescale FEC PHY driver [%s] "
-		"(mii_bus:phy_addr=%s, irq=%d)\n", ndev->name,
-		fep->phy_dev->drv->name, dev_name(&fep->phy_dev->dev),
-		fep->phy_dev->irq);
-
 	return 0;
 
 failed_register:
-- 
1.7.0.1

^ permalink raw reply related

* [PATCH 1/2] netdev/fec: fix performance impact from mdio poll operation
From: Bryan Wu @ 2010-05-07  2:27 UTC (permalink / raw)
  To: davem, Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev,
	linux-kernel
In-Reply-To: <1273199239-11057-1-git-send-email-bryan.wu@canonical.com>

BugLink: http://bugs.launchpad.net/bugs/546649
BugLink: http://bugs.launchpad.net/bugs/457878

After introducing phylib supporting, users experienced performace drop. That is
because of the mdio polling operation of phylib. Use msleep to replace the busy
waiting cpu_relax() and remove the warning message.

Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
---
 drivers/net/fec.c |   45 +++++++++++++++++++++------------------------
 1 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index 2b1651a..9c58f6b 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -203,7 +203,7 @@ static void fec_stop(struct net_device *dev);
 #define FEC_MMFR_TA		(2 << 16)
 #define FEC_MMFR_DATA(v)	(v & 0xffff)
 
-#define FEC_MII_TIMEOUT		10000
+#define FEC_MII_TIMEOUT		10
 
 /* Transmitter timeout */
 #define TX_TIMEOUT (2 * HZ)
@@ -611,13 +611,29 @@ spin_unlock:
 /*
  * NOTE: a MII transaction is during around 25 us, so polling it...
  */
-static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+static int fec_enet_mdio_poll(struct fec_enet_private *fep)
 {
-	struct fec_enet_private *fep = bus->priv;
 	int timeout = FEC_MII_TIMEOUT;
 
 	fep->mii_timeout = 0;
 
+	/* wait for end of transfer */
+	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
+		msleep(1);
+		if (timeout-- < 0) {
+			fep->mii_timeout = 1;
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+{
+	struct fec_enet_private *fep = bus->priv;
+
+
 	/* clear MII end of transfer bit*/
 	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
 
@@ -626,15 +642,7 @@ static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 		FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
 		FEC_MMFR_TA, fep->hwp + FEC_MII_DATA);
 
-	/* wait for end of transfer */
-	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
-		cpu_relax();
-		if (timeout-- < 0) {
-			fep->mii_timeout = 1;
-			printk(KERN_ERR "FEC: MDIO read timeout\n");
-			return -ETIMEDOUT;
-		}
-	}
+	fec_enet_mdio_poll(fep);
 
 	/* return value */
 	return FEC_MMFR_DATA(readl(fep->hwp + FEC_MII_DATA));
@@ -644,9 +652,6 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 			   u16 value)
 {
 	struct fec_enet_private *fep = bus->priv;
-	int timeout = FEC_MII_TIMEOUT;
-
-	fep->mii_timeout = 0;
 
 	/* clear MII end of transfer bit*/
 	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
@@ -657,15 +662,7 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 		FEC_MMFR_TA | FEC_MMFR_DATA(value),
 		fep->hwp + FEC_MII_DATA);
 
-	/* wait for end of transfer */
-	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
-		cpu_relax();
-		if (timeout-- < 0) {
-			fep->mii_timeout = 1;
-			printk(KERN_ERR "FEC: MDIO write timeout\n");
-			return -ETIMEDOUT;
-		}
-	}
+	fec_enet_mdio_poll(fep);
 
 	return 0;
 }
-- 
1.7.0.1


^ permalink raw reply related

* Re: virtio: put last_used and last_avail index into ring itself.
From: Rusty Russell @ 2010-05-07  3:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, virtualization, kvm, linux-kernel, mingo, linux-mm, akpm,
	hpa, gregory.haskins, s.hetze, Daniel Walker, Eric Dumazet
In-Reply-To: <20100506062755.GC8363@redhat.com>

On Thu, 6 May 2010 03:57:55 pm Michael S. Tsirkin wrote:
> On Thu, May 06, 2010 at 10:22:12AM +0930, Rusty Russell wrote:
> > On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote:
> > > What do you think?
> > 
> > I think everyone is settled on 128 byte cache lines for the forseeable
> > future, so it's not really an issue.
> 
> You mean with 64 bit descriptors we will be bouncing a cache line
> between host and guest, anyway?

I'm confused by this entire thread.

Descriptors are 16 bytes.  They are at the start, so presumably aligned to
cache boundaries.

Available ring follows that at 2 bytes per entry, so it's also packed nicely
into cachelines.

Then there's padding to page boundary.  That puts us on a cacheline again
for the used ring; also 2 bytes per entry.

I don't see how any change in layout could be more cache friendly?
Rusty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* linux-next: build failure after merge of the suspend tree
From: Stephen Rothwell @ 2010-05-07  3:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-next, linux-kernel, Helmut Schaa, John W. Linville,
	David Miller, netdev

Hi Rafael,

After merging the suspend tree, today's linux-next build (x86_64
allmodconfig) failed like this:

net/mac80211/scan.c: In function 'ieee80211_scan_state_decision':
net/mac80211/scan.c:510: error: implicit declaration of function 'pm_qos_requirement'

Caused by commit 62bad14fc6e0911a99882c261390968977d43283 ("PM QOS
update") from the suspend tree interacting with commit
df13cce53a7b28a81460e6bfc4857e9df4956141 ("mac80211: Improve software
scan timing") from the net tree.

I have added the following merge fixup patch and can carry it as
necessary:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Fri, 7 May 2010 13:02:54 +1000
Subject: [PATCH] wireless: update for pm_qos_requirement to pm_qos_request rename

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/mac80211/scan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/mac80211/scan.c b/net/mac80211/scan.c
index e14c441..e1b0be7 100644
--- a/net/mac80211/scan.c
+++ b/net/mac80211/scan.c
@@ -510,7 +510,7 @@ static int ieee80211_scan_state_decision(struct ieee80211_local *local,
 		bad_latency = time_after(jiffies +
 				ieee80211_scan_get_channel_time(next_chan),
 				local->leave_oper_channel_time +
-				usecs_to_jiffies(pm_qos_requirement(PM_QOS_NETWORK_LATENCY)));
+				usecs_to_jiffies(pm_qos_request(PM_QOS_NETWORK_LATENCY)));
 
 		listen_int_exceeded = time_after(jiffies +
 				ieee80211_scan_get_channel_time(next_chan),
-- 
1.7.1

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: ixgbe and mac-vlans problem
From: Ben Greear @ 2010-05-07  3:12 UTC (permalink / raw)
  To: Tantilov, Emil S; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A1365FEA26A969@rrsmsx501.amr.corp.intel.com>

On 05/06/2010 05:06 PM, Tantilov, Emil S wrote:
> Ben Greear wrote:
>> On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:
>>
>>> Hi Ben,
>>>
>>> We do have a patch in testing (see attached). It may not apply
>>> cleanly as it is on top of some other patches currently in
>>> validation. Let me know if it works for you.
>>
>> It wasn't difficult to backport this patch to 2.6.31.12....
>>
>> I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.
>>
>> The NIC doesn't show as PROMISC in any way I can detect, but I guess
>> it must actually be in PROMISC mode:
>
> Yes the interface is in promisc mode. The driver sets the FCTRL.UPE bit
> (unicast promisc mode) when the number of allowed rar_entries is exceeded.

Is there any way to get this setting from ethtool or similar?  It would be nice
to know the actual PROMISC state of the NIC regardless of what user-space has or has not
configured.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-05-07  3:24 UTC (permalink / raw)
  To: David Miller
  Cc: mpm, linux-kernel, netdev, bridge, gospo, nhorman, jmoyer,
	shemminger, bonding-devel, fubar
In-Reply-To: <20100506.004457.71584133.davem@davemloft.net>

On 05/06/10 15:44, David Miller wrote:
> From: Matt Mackall<mpm@selenic.com>
> Date: Wed, 05 May 2010 21:05:30 -0500
>
>> On Wed, 2010-05-05 at 04:11 -0400, Amerigo Wang wrote:
>>> V5:
>>> Fix coding style problems pointed by David.
>>
>> Aside from my concern about the policy of disabling netpoll on
>> bridges/bonds with only partial netpoll support, I don't have any
>> remaining issues with this. But I'll leave it to other folks to ack the
>> underlying driver bits for this series.
>
> Yes the partial support handling is a thorny issue.
>
> But this patch set makes things better than they were before, because
> support over such devices didn't work at all previously.
>
> So I'll toss these patches into net-next-2.6, thanks everyone!

Thank you, David.


^ permalink raw reply

* Re: [PATCH] ipv4: udp: fix short packet and bad checksum logging
From: David Miller @ 2010-05-07  4:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: bjorn, netdev, stable
In-Reply-To: <1273157280.2853.9.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 May 2010 16:48:00 +0200

> Le jeudi 06 mai 2010 à 15:44 +0200, Bjørn Mork a écrit :
>> commit 2783ef23 moved the initialisation of saddr and daddr after
>> pskb_may_pull() to avoid a potential data corruption.  Unfortunately
>> also placing it after the short packet and bad checksum error paths,
>> where these variables are used for logging.  The result is bogus
>> output like
>> 
>> [92238.389505] UDP: short packet: From 2.0.0.0:65535 23715/178 to 0.0.0.0:65535
>> 
>> Moving the saddr and daddr initialisation above the error paths, while still
>> keeping it after the pskb_may_pull() to keep the fix from commit 2783ef23.
>> 
>> Signed-off-by: Bjørn Mork <bjorn@mork.no>
>> Cc: stable@kernel.org
>> ---
>>  net/ipv4/udp.c |    6 +++---
>>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> Well done :)
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> To be backported to 2.6.29 and up kernels ;)

Applied to net-2.6 and queued up for -stable, thanks!

^ permalink raw reply

* Re: [PATCH] ipv6: udp: make short packet logging consistent with ipv4
From: David Miller @ 2010-05-07  4:50 UTC (permalink / raw)
  To: eric.dumazet; +Cc: bjorn, netdev
In-Reply-To: <1273157366.2853.10.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 May 2010 16:49:26 +0200

> Le jeudi 06 mai 2010 à 15:44 +0200, Bjørn Mork a écrit :
>> Adding addresses and ports to the short packet log message,
>> like ipv4/udp.c does it, makes these messages a lot more useful:
>> 
>> [  822.182450] UDPv6: short packet: From [2001:db8:ffb4:3::1]:47839 23715/178 to [2001:db8:ffb4:3:5054:ff:feff:200]:1234
>> 
>> This requires us to drop logging in case pskb_may_pull() fails,
>> which also is consistent with ipv4/udp.c
>> 
>> Signed-off-by: Bjørn Mork <bjorn@mork.no>
>> ---
>>  net/ipv6/udp.c |   11 ++++++++---
>>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: r8169 transmit queue time outs
From: Kyle McMartin @ 2010-05-07  4:51 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Kyle McMartin, netdev
In-Reply-To: <20100506201024.GA3541@electric-eye.fr.zoreil.com>

On Thu, May 06, 2010 at 10:10:24PM +0200, Francois Romieu wrote:
> Kyle McMartin <kmcmartin@redhat.com> :
> [...]
> > Some of our users have been seeing their r8169 cards just up and stop
> > transmitting packets pretty quickly after boot with recent kernels.
> [...]
> > Pid: 0, comm: swapper Not tainted 2.6.31.5-127.fc12.i686.PAE #1
> 
> Can they upgrade to 2.6.32.11-99.fc12.i686 and try an out-of-tree build
> of the driver at http://userweb.kernel.org/~romieu/r8169/2.6.32.11-99.fc12/ ?
> 
> It should be quite close to the current git kernel.
> 

Thanks Francois, I've done a build for F-12 and F-13 with that driver
for the users, and updated the bugs with links to the builds.

I'll let you know if it helps things.

Thanks again, Kyle.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Increase NET_SKB_PAD to 64 bytes
From: David Miller @ 2010-05-07  5:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hadi, therbert, monstr, microblaze-uclinux
In-Reply-To: <1273037049.2304.7.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 May 2010 07:24:09 +0200

> eth_type_trans() & get_rps_cpus() currently need two 64bytes cache lines
> in packet to compute rxhash.
> 
> Increasing NET_SKB_PAD from 32 to 64 reduces the need to one cache line
> only, and makes RPS faster.
> 
> NET_IP_ALIGN(2) + ethernet_header(14) + IP_header(20/40) + ports(8)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

Seeing this made me go check who was overriding NET_IP_ALIGN or
NET_SKB_PAD.

The powerpc bits are legitimate, but the microblaze case is complete
bogosity.  It defines NET_IP_ALIGN to the default (2) and sets
NET_SKB_PAD to L1_CACHE_BYTES which on microblaze is 4 and
significantly smaller than the default.

So I'm going to delete them in net-next-2.6 like so:

--------------------
microblaze: Kill NET_SKB_PAD and NET_IP_ALIGN overrides.

NET_IP_ALIGN defaults to 2, no need to override.

NET_SKB_PAD is now 64, which is much larger than microblaze's
L1_CACHE_SIZE so no need to override that either.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 arch/microblaze/include/asm/system.h |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/arch/microblaze/include/asm/system.h b/arch/microblaze/include/asm/system.h
index 48c4f03..b1e2f07 100644
--- a/arch/microblaze/include/asm/system.h
+++ b/arch/microblaze/include/asm/system.h
@@ -97,14 +97,4 @@ extern struct dentry *of_debugfs_root;
 
 #define arch_align_stack(x) (x)
 
-/*
- * MicroBlaze doesn't handle unaligned accesses in hardware.
- *
- * Based on this we force the IP header alignment in network drivers.
- * We also modify NET_SKB_PAD to be a cacheline in size, thus maintaining
- * cacheline alignment of buffers.
- */
-#define NET_IP_ALIGN	2
-#define NET_SKB_PAD	L1_CACHE_BYTES
-
 #endif /* _ASM_MICROBLAZE_SYSTEM_H */
-- 
1.7.0.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox