Netdev List
 help / color / mirror / Atom feed
* Re: [RFC] tcp md5 use of alloc_percpu
From: Eric Dumazet @ 2014-10-23  5:23 UTC (permalink / raw)
  To: David Ahern; +Cc: Crestez Dan Leonard, netdev
In-Reply-To: <544886C4.4020702@gmail.com>

On Wed, 2014-10-22 at 22:40 -0600, David Ahern wrote:
> On 10/22/14, 12:55 PM, Crestez Dan Leonard wrote:
> > Hello,
> >
> > It seems that the TCP MD5 feature allocates a percpu struct tcp_md5sig_pool and uses part of that memory for a scratch buffer to do crypto on. Here is the relevant code:
> 
> This is a forward port of a local change to address the problem (local 
> kernel version is 3.4 so perhaps my quick bump to top of tree is off but 
> it shows the general idea). Been on my to-do list to figure out why this 
> is needed, but it seems related to your problem:
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1bec4e76d88c..833a676bd4b0 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2941,7 +2941,7 @@ struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
>      local_bh_disable();
>      p = ACCESS_ONCE(tcp_md5sig_pool);
>      if (p)
> -       return raw_cpu_ptr(p);
> +       return __va(per_cpu_ptr_to_phys(raw_cpu_ptr(p)));
> 
>      local_bh_enable();
>      return NULL;

per_cpu_ptr_to_phys() can be pretty expensive and should not be called
in fast path.

^ permalink raw reply

* Re: [RFC] tcp md5 use of alloc_percpu
From: Eric Dumazet @ 2014-10-23  5:38 UTC (permalink / raw)
  To: David Ahern; +Cc: Crestez Dan Leonard, netdev
In-Reply-To: <1414041833.2094.28.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, 2014-10-22 at 22:23 -0700, Eric Dumazet wrote:
> On Wed, 2014-10-22 at 22:40 -0600, David Ahern wrote:
> > On 10/22/14, 12:55 PM, Crestez Dan Leonard wrote:
> > > Hello,
> > >
> > > It seems that the TCP MD5 feature allocates a percpu struct tcp_md5sig_pool and uses part of that memory for a scratch buffer to do crypto on. Here is the relevant code:
> > 
> > This is a forward port of a local change to address the problem (local 
> > kernel version is 3.4 so perhaps my quick bump to top of tree is off but 
> > it shows the general idea). Been on my to-do list to figure out why this 
> > is needed, but it seems related to your problem:
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 1bec4e76d88c..833a676bd4b0 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -2941,7 +2941,7 @@ struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
> >      local_bh_disable();
> >      p = ACCESS_ONCE(tcp_md5sig_pool);
> >      if (p)
> > -       return raw_cpu_ptr(p);
> > +       return __va(per_cpu_ptr_to_phys(raw_cpu_ptr(p)));
> > 
> >      local_bh_enable();
> >      return NULL;
> 
> per_cpu_ptr_to_phys() can be pretty expensive and should not be called
> in fast path.
> 

My updated patch would be :

 net/ipv4/tcp.c |   66 +++++++++++++++++++----------------------------
 1 file changed, 28 insertions(+), 38 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1bec4e76d88c..af4dc16b61f6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2868,61 +2868,51 @@ EXPORT_SYMBOL(compat_tcp_getsockopt);
 #endif
 
 #ifdef CONFIG_TCP_MD5SIG
-static struct tcp_md5sig_pool __percpu *tcp_md5sig_pool __read_mostly;
+static DEFINE_PER_CPU(struct tcp_md5sig_pool, *tcp_md5sig_pool);
 static DEFINE_MUTEX(tcp_md5sig_mutex);
-
-static void __tcp_free_md5sig_pool(struct tcp_md5sig_pool __percpu *pool)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		struct tcp_md5sig_pool *p = per_cpu_ptr(pool, cpu);
-
-		if (p->md5_desc.tfm)
-			crypto_free_hash(p->md5_desc.tfm);
-	}
-	free_percpu(pool);
-}
+static bool tcp_md5sig_pool_populated = false;
 
 static void __tcp_alloc_md5sig_pool(void)
 {
 	int cpu;
-	struct tcp_md5sig_pool __percpu *pool;
-
-	pool = alloc_percpu(struct tcp_md5sig_pool);
-	if (!pool)
-		return;
 
 	for_each_possible_cpu(cpu) {
+		struct tcp_md5sig_pool *pool;
 		struct crypto_hash *hash;
 
-		hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
-		if (IS_ERR_OR_NULL(hash))
-			goto out_free;
-
-		per_cpu_ptr(pool, cpu)->md5_desc.tfm = hash;
+		pool = per_cpu(tcp_md5sig_pool, cpu);
+		if (!pool) {
+			pool = kzalloc_node(sizeof(*pool), GFP_KERNEL,
+					    cpu_to_node(cpu));
+			if (!pool)
+				return;
+			per_cpu(tcp_md5sig_pool, cpu) = pool;
+		}
+		if (!pool->md5_desc.tfm) {
+			hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
+			if (IS_ERR_OR_NULL(hash))
+				return;
+			pool->md5_desc.tfm = hash;
+		}
 	}
-	/* before setting tcp_md5sig_pool, we must commit all writes
-	 * to memory. See ACCESS_ONCE() in tcp_get_md5sig_pool()
+	/* before setting tcp_md5sig_pool_populated, we must commit all writes
+	 * to memory. See smp_rmb() in tcp_get_md5sig_pool()
 	 */
 	smp_wmb();
-	tcp_md5sig_pool = pool;
-	return;
-out_free:
-	__tcp_free_md5sig_pool(pool);
+	tcp_md5sig_pool_populated = true;
 }
 
 bool tcp_alloc_md5sig_pool(void)
 {
-	if (unlikely(!tcp_md5sig_pool)) {
+	if (unlikely(!tcp_md5sig_pool_populated)) {
 		mutex_lock(&tcp_md5sig_mutex);
 
-		if (!tcp_md5sig_pool)
+		if (!tcp_md5sig_pool_populated)
 			__tcp_alloc_md5sig_pool();
 
 		mutex_unlock(&tcp_md5sig_mutex);
 	}
-	return tcp_md5sig_pool != NULL;
+	return tcp_md5sig_pool_populated;
 }
 EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
 
@@ -2936,13 +2926,13 @@ EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
  */
 struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
 {
-	struct tcp_md5sig_pool __percpu *p;
-
 	local_bh_disable();
-	p = ACCESS_ONCE(tcp_md5sig_pool);
-	if (p)
-		return raw_cpu_ptr(p);
 
+	if (tcp_md5sig_pool_populated) {
+		/* coupled with smp_wmb() in __tcp_alloc_md5sig_pool */
+		smp_rmb();
+		return this_cpu_read(tcp_md5sig_pool);
+	}
 	local_bh_enable();
 	return NULL;
 }

^ permalink raw reply related

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Yanko Kaneti @ 2014-10-23  6:09 UTC (permalink / raw)
  To: paulmck
  Cc: Josh Boyer, Eric W. Biederman, Cong Wang, Kevin Fenzi, netdev,
	Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <20141022232421.GN4977@linux.vnet.ibm.com>

On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> 
> [ . . . ]
> 
> > > > Don't get me wrong -- the fact that this kthread appears to 
> > > > have
> > > > blocked within rcu_barrier() for 120 seconds means that 
> > > > something is
> > > > most definitely wrong here.  I am surprised that there are no 
> > > > RCU CPU
> > > > stall warnings, but perhaps the blockage is in the callback 
> > > > execution
> > > > rather than grace-period completion.  Or something is 
> > > > preventing this
> > > > kthread from starting up after the wake-up callback executes.  
> > > > Or...
> > > > 
> > > > Is this thing reproducible?
> > > 
> > > I've added Yanko on CC, who reported the backtrace above and can
> > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > (d6dd50e) and rebuilding the latest after that does not show the
> > > issue.  I'll let Yanko explain more and answer any questions you 
> > > have.
> > 
> > - It is reproducible
> > - I've done another build here to double check and its definitely 
> > the rcu merge
> >   that's causing it.
> > 
> > Don't think I'll be able to dig deeper, but I can do testing if 
> > needed.
> 
> Please!  Does the following patch help?

Nope, doesn't seem to make a difference to the modprobe ppp_generic 
test


INFO: task kworker/u16:6:101 blocked for more than 120 seconds.
      Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
kworker/u16:6   D ffff88022067cec0 11680   101      2 0x00000000
Workqueue: netns cleanup_net
 ffff8802206939e8 0000000000000096 ffff88022067cec0 00000000001d5f00
 ffff880220693fd8 00000000001d5f00 ffff880223263480 ffff88022067cec0
 ffffffff82c51d60 7fffffffffffffff ffffffff81ee2698 ffffffff81ee2690
Call Trace:
 [<ffffffff8185e289>] schedule+0x29/0x70
 [<ffffffff818634ac>] schedule_timeout+0x26c/0x410
 [<ffffffff81028c4a>] ? native_sched_clock+0x2a/0xa0
 [<ffffffff81107afc>] ? mark_held_locks+0x7c/0xb0
 [<ffffffff81864530>] ? _raw_spin_unlock_irq+0x30/0x50
 [<ffffffff81107c8d>] ? trace_hardirqs_on_caller+0x15d/0x200
 [<ffffffff8185fcbc>] wait_for_completion+0x10c/0x150
 [<ffffffff810e5430>] ? wake_up_state+0x20/0x20
 [<ffffffff8112a799>] _rcu_barrier+0x159/0x200
 [<ffffffff8112a895>] rcu_barrier+0x15/0x20
 [<ffffffff81718f0f>] netdev_run_todo+0x6f/0x310
 [<ffffffff8170dad5>] ? rollback_registered_many+0x265/0x2e0
 [<ffffffff81725f7e>] rtnl_unlock+0xe/0x10
 [<ffffffff8170f936>] default_device_exit_batch+0x156/0x180
 [<ffffffff810fd8f0>] ? abort_exclusive_wait+0xb0/0xb0
 [<ffffffff817079e3>] ops_exit_list.isra.1+0x53/0x60
 [<ffffffff81708590>] cleanup_net+0x100/0x1f0
 [<ffffffff810ccff8>] process_one_work+0x218/0x850
 [<ffffffff810ccf5f>] ? process_one_work+0x17f/0x850
 [<ffffffff810cd717>] ? worker_thread+0xe7/0x4a0
 [<ffffffff810cd69b>] worker_thread+0x6b/0x4a0
 [<ffffffff810cd630>] ? process_one_work+0x850/0x850
 [<ffffffff810d39eb>] kthread+0x10b/0x130
 [<ffffffff81028cc9>] ? sched_clock+0x9/0x10
 [<ffffffff810d38e0>] ? kthread_create_on_node+0x250/0x250
 [<ffffffff8186527c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810d38e0>] ? kthread_create_on_node+0x250/0x250
4 locks held by kworker/u16:6/101:
 #0:  ("%s""netns"){.+.+.+}, at: [<ffffffff810ccf5f>] process_one_work+0x17f/0x850
 #1:  (net_cleanup_work){+.+.+.}, at: [<ffffffff810ccf5f>] process_one_work+0x17f/0x850
 #2:  (net_mutex){+.+.+.}, at: [<ffffffff8170851c>] cleanup_net+0x8c/0x1f0
 #3:  (rcu_sched_state.barrier_mutex){+.+...}, at: [<ffffffff8112a675>] _rcu_barrier+0x35/0x200
INFO: task modprobe:1139 blocked for more than 120 seconds.
      Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
modprobe        D ffff880213ac1a40 13112  1139   1138 0x00000080
 ffff880036ab3be8 0000000000000096 ffff880213ac1a40 00000000001d5f00
 ffff880036ab3fd8 00000000001d5f00 ffff880223264ec0 ffff880213ac1a40
 ffff880213ac1a40 ffffffff81f8fb48 0000000000000246 ffff880213ac1a40
Call Trace:
 [<ffffffff8185e831>] schedule_preempt_disabled+0x31/0x80
 [<ffffffff81860083>] mutex_lock_nested+0x183/0x440
 [<ffffffff817083af>] ? register_pernet_subsys+0x1f/0x50
 [<ffffffff817083af>] ? register_pernet_subsys+0x1f/0x50
 [<ffffffffa06f3000>] ? 0xffffffffa06f3000
 [<ffffffff817083af>] register_pernet_subsys+0x1f/0x50
 [<ffffffffa06f3048>] br_init+0x48/0xd3 [bridge]
 [<ffffffff81002148>] do_one_initcall+0xd8/0x210
 [<ffffffff81153c52>] load_module+0x20c2/0x2870
 [<ffffffff8114ec30>] ? store_uevent+0x70/0x70
 [<ffffffff8110ac76>] ? lock_release_non_nested+0x3c6/0x3d0
 [<ffffffff811544e7>] SyS_init_module+0xe7/0x140
 [<ffffffff81865329>] system_call_fastpath+0x12/0x17
1 lock held by modprobe/1139:
 #0:  (net_mutex){+.+.+.}, at: [<ffffffff817083af>] 
register_pernet_subsys+0x1f/0x50
INFO: task modprobe:1209 blocked for more than 120 seconds.
      Not tainted 3.18.0-0.rc1.git2.3.fc22.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
modprobe        D ffff8800c5324ec0 13368  1209   1151 0x00000080
 ffff88020d14bbe8 0000000000000096 ffff8800c5324ec0 00000000001d5f00
 ffff88020d14bfd8 00000000001d5f00 ffff880223280000 ffff8800c5324ec0
 ffff8800c5324ec0 ffffffff81f8fb48 0000000000000246 ffff8800c5324ec0
Call Trace:
 [<ffffffff8185e831>] schedule_preempt_disabled+0x31/0x80
 [<ffffffff81860083>] mutex_lock_nested+0x183/0x440
 [<ffffffff817083fd>] ? register_pernet_device+0x1d/0x70
 [<ffffffff817083fd>] ? register_pernet_device+0x1d/0x70
 [<ffffffffa070f000>] ? 0xffffffffa070f000
 [<ffffffff817083fd>] register_pernet_device+0x1d/0x70
 [<ffffffffa070f020>] ppp_init+0x20/0x1000 [ppp_generic]
 [<ffffffff81002148>] do_one_initcall+0xd8/0x210
 [<ffffffff81153c52>] load_module+0x20c2/0x2870
 [<ffffffff8114ec30>] ? store_uevent+0x70/0x70
 [<ffffffff8110ac76>] ? lock_release_non_nested+0x3c6/0x3d0
 [<ffffffff811544e7>] SyS_init_module+0xe7/0x140
 [<ffffffff81865329>] system_call_fastpath+0x12/0x17
1 lock held by modprobe/1209:
 #0:  (net_mutex){+.+.+.}, at: [<ffffffff817083fd>] register_pernet_device+0x1d/0x70


>                 Thanx, Paul
> 
> ---------------------------------------------------------------------
> ---
> 
> rcu: More on deadlock between CPU hotplug and expedited grace periods
> 
> Commit dd56af42bd82 (rcu: Eliminate deadlock between CPU hotplug and
> expedited grace periods) was incomplete.  Although it did eliminate
> deadlocks involving synchronize_sched_expedited()'s acquisition of
> cpu_hotplug.lock via get_online_cpus(), it did nothing about the 
> similar
> deadlock involving acquisition of this same lock via 
> put_online_cpus().
> This deadlock became apparent with testing involving hibernation.
> 
> This commit therefore changes put_online_cpus() acquisition of this 
> lock
> to be conditional, and increments a new cpu_hotplug.puts_pending 
> field
> in case of acquisition failure.  Then cpu_hotplug_begin() checks for 
> this
> new field being non-zero, and applies any changes to 
> cpu_hotplug.refcount.
> 
> Reported-by: Jiri Kosina <jkosina@suse.cz>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Tested-by: Jiri Kosina <jkosina@suse.cz>
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 356450f09c1f..90a3d017b90c 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -64,6 +64,8 @@ static struct {
>         * an ongoing cpu hotplug operation.
>         */
>         int refcount;
> +       /* And allows lockless put_online_cpus(). */
> +       atomic_t puts_pending;
> 
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         struct lockdep_map dep_map;
> @@ -113,7 +115,11 @@ void put_online_cpus(void)
>  {
>         if (cpu_hotplug.active_writer == current)
>         return;
> -       mutex_lock(&cpu_hotplug.lock);
> +       if (!mutex_trylock(&cpu_hotplug.lock)) {
> +       atomic_inc(&cpu_hotplug.puts_pending);
> +       cpuhp_lock_release();
> +       return;
> +       }
> 
>         if (WARN_ON(!cpu_hotplug.refcount))
>         cpu_hotplug.refcount++; /* try to fix things up */
> @@ -155,6 +161,12 @@ void cpu_hotplug_begin(void)
>         cpuhp_lock_acquire();
>         for (;;) {
>         mutex_lock(&cpu_hotplug.lock);
> +       if (atomic_read(&cpu_hotplug.puts_pending)) {
> +       int delta;
> +
> +       delta = atomic_xchg(&cpu_hotplug.puts_pending, 0);
> +       cpu_hotplug.refcount -= delta;
> +       }
>         if (likely(!cpu_hotplug.refcount))
>         break;
>         __set_current_state(TASK_UNINTERRUPTIBLE);
> 
> 

^ permalink raw reply

* Re: irq disable in __netdev_alloc_frag() ?
From: Alexei Starovoitov @ 2014-10-23  6:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, Eric Dumazet, Network Development
In-Reply-To: <1414041244.2094.26.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, Oct 22, 2014 at 10:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Really, I doubt this will make any difference in real workloads
> where bottleneck is memory bandwidth, not the time taken by
> pushf/cli/popf.
>
> What difference do you have on a real workload like : netperf TCP_RR or
> UDP_RR ?

there are different workloads. l2/l3 switching is very real
as well. For tcp_rr on the host the difference will be
minimal, since the time is spent elsewhere. For cases
that don't use tcp stack, but do pure l2/l3 processing
all improvements will add up. imo the kernel should be
optimized for all use cases when possible.

^ permalink raw reply

* Re: [RFC] tcp md5 use of alloc_percpu
From: Jonathan Toppins @ 2014-10-23  6:58 UTC (permalink / raw)
  To: Eric Dumazet, David Ahern; +Cc: Crestez Dan Leonard, netdev
In-Reply-To: <1414042688.2094.30.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/23/14, 1:38 AM, Eric Dumazet wrote:
> On Wed, 2014-10-22 at 22:23 -0700, Eric Dumazet wrote:
>> On Wed, 2014-10-22 at 22:40 -0600, David Ahern wrote:
>>> On 10/22/14, 12:55 PM, Crestez Dan Leonard wrote:
>>>> Hello,
>>>>
>>>> It seems that the TCP MD5 feature allocates a percpu struct tcp_md5sig_pool and uses part of that memory for a scratch buffer to do crypto on. Here is the relevant code:
>>>
>>> This is a forward port of a local change to address the problem (local 
>>> kernel version is 3.4 so perhaps my quick bump to top of tree is off but 
>>> it shows the general idea). Been on my to-do list to figure out why this 
>>> is needed, but it seems related to your problem:
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 1bec4e76d88c..833a676bd4b0 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -2941,7 +2941,7 @@ struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
>>>      local_bh_disable();
>>>      p = ACCESS_ONCE(tcp_md5sig_pool);
>>>      if (p)
>>> -       return raw_cpu_ptr(p);
>>> +       return __va(per_cpu_ptr_to_phys(raw_cpu_ptr(p)));
>>>
>>>      local_bh_enable();
>>>      return NULL;
>>
>> per_cpu_ptr_to_phys() can be pretty expensive and should not be called
>> in fast path.
>>
> 
> My updated patch would be :
> 
>  net/ipv4/tcp.c |   66 +++++++++++++++++++----------------------------
>  1 file changed, 28 insertions(+), 38 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1bec4e76d88c..af4dc16b61f6 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2868,61 +2868,51 @@ EXPORT_SYMBOL(compat_tcp_getsockopt);
>  #endif
>  
>  #ifdef CONFIG_TCP_MD5SIG
> -static struct tcp_md5sig_pool __percpu *tcp_md5sig_pool __read_mostly;
> +static DEFINE_PER_CPU(struct tcp_md5sig_pool, *tcp_md5sig_pool);
>  static DEFINE_MUTEX(tcp_md5sig_mutex);
> -
> -static void __tcp_free_md5sig_pool(struct tcp_md5sig_pool __percpu *pool)
> -{
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu) {
> -		struct tcp_md5sig_pool *p = per_cpu_ptr(pool, cpu);
> -
> -		if (p->md5_desc.tfm)
> -			crypto_free_hash(p->md5_desc.tfm);
> -	}
> -	free_percpu(pool);
> -}
> +static bool tcp_md5sig_pool_populated = false;
>  
>  static void __tcp_alloc_md5sig_pool(void)
>  {
>  	int cpu;
> -	struct tcp_md5sig_pool __percpu *pool;
> -
> -	pool = alloc_percpu(struct tcp_md5sig_pool);
> -	if (!pool)
> -		return;
>  
>  	for_each_possible_cpu(cpu) {
> +		struct tcp_md5sig_pool *pool;
>  		struct crypto_hash *hash;
>  
> -		hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
> -		if (IS_ERR_OR_NULL(hash))
> -			goto out_free;
> -
> -		per_cpu_ptr(pool, cpu)->md5_desc.tfm = hash;
> +		pool = per_cpu(tcp_md5sig_pool, cpu);
> +		if (!pool) {
> +			pool = kzalloc_node(sizeof(*pool), GFP_KERNEL,
GFP_DMA | GFP_KERNEL
This memory will possibly be used in a DMA correct? (thinking crypto
hardware offload)
> +					    cpu_to_node(cpu));
> +			if (!pool)
> +				return;
> +			per_cpu(tcp_md5sig_pool, cpu) = pool;
> +		}
> +		if (!pool->md5_desc.tfm) {
> +			hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
> +			if (IS_ERR_OR_NULL(hash))
> +				return;
> +			pool->md5_desc.tfm = hash;
> +		}
>  	}
> -	/* before setting tcp_md5sig_pool, we must commit all writes
> -	 * to memory. See ACCESS_ONCE() in tcp_get_md5sig_pool()
> +	/* before setting tcp_md5sig_pool_populated, we must commit all writes
> +	 * to memory. See smp_rmb() in tcp_get_md5sig_pool()
>  	 */
>  	smp_wmb();
> -	tcp_md5sig_pool = pool;
> -	return;
> -out_free:
> -	__tcp_free_md5sig_pool(pool);
> +	tcp_md5sig_pool_populated = true;
>  }
>  
>  bool tcp_alloc_md5sig_pool(void)
>  {
> -	if (unlikely(!tcp_md5sig_pool)) {
> +	if (unlikely(!tcp_md5sig_pool_populated)) {
>  		mutex_lock(&tcp_md5sig_mutex);
>  
> -		if (!tcp_md5sig_pool)
> +		if (!tcp_md5sig_pool_populated)
>  			__tcp_alloc_md5sig_pool();
>  
>  		mutex_unlock(&tcp_md5sig_mutex);
>  	}
> -	return tcp_md5sig_pool != NULL;
> +	return tcp_md5sig_pool_populated;
>  }
>  EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
>  
> @@ -2936,13 +2926,13 @@ EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
>   */
>  struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
>  {
> -	struct tcp_md5sig_pool __percpu *p;
> -
>  	local_bh_disable();
> -	p = ACCESS_ONCE(tcp_md5sig_pool);
> -	if (p)
> -		return raw_cpu_ptr(p);
>  
> +	if (tcp_md5sig_pool_populated) {
> +		/* coupled with smp_wmb() in __tcp_alloc_md5sig_pool */
> +		smp_rmb();
> +		return this_cpu_read(tcp_md5sig_pool);
> +	}
>  	local_bh_enable();
>  	return NULL;
>  }
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH v2] drivers: net: xgene: Rewrite loop in xgene_enet_ecc_init()
From: Geert Uytterhoeven @ 2014-10-23  7:05 UTC (permalink / raw)
  To: David Miller
  Cc: Iyappan Subramanian, kchudgar, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20141022.161257.290251379136182774.davem@davemloft.net>

On Wed, Oct 22, 2014 at 10:12 PM, David Miller <davem@davemloft.net> wrote:
> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Date: Wed, 22 Oct 2014 21:50:06 +0200
>
>> On Wed, Oct 22, 2014 at 9:34 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Geert Uytterhoeven <geert@linux-m68k.org>
>>> Date: Wed, 22 Oct 2014 09:39:41 +0200
>>>
>>>> drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c: In function ‘xgene_enet_ecc_init’:
>>>> drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c:126: warning: ‘data’ may be used uninitialized in this function
>>>>
>>>> Depending on the arbitrary value on the stack, the loop may terminate
>>>> too early, and cause a bogus -ENODEV failure.
>>>>
>>>> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
>>>> ---
>>>> v2: Rewrite the loop instead of pre-initializing data.
>>>
>>> I hate to be a pest, but like the other patch of your's I think
>>> a do { } while() works best here because the intent is clearly
>>> to run the loop at least once, right?
>>
>> I wanted to avoid checking for "data != ~0U" twice: once to abort the loop,
>> and once to check if a timeout happened.
>
> Hmmm:
>
>         do {
>                 usleep_range(...);
>                 data = ...();
>                 if (data == ~0)
>                         return 0;
>         } while (++i < 10);
>
>         netdev_err(...);
>         return -ENODEV;
>
> Why would you have to check data twice?



-- 
Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH v2] drivers: net: xgene: Rewrite loop in xgene_enet_ecc_init()
From: Geert Uytterhoeven @ 2014-10-23  7:06 UTC (permalink / raw)
  To: David Miller
  Cc: Iyappan Subramanian, kchudgar, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20141022.161257.290251379136182774.davem@davemloft.net>

On Wed, Oct 22, 2014 at 10:12 PM, David Miller <davem@davemloft.net> wrote:
> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Date: Wed, 22 Oct 2014 21:50:06 +0200
>
>> On Wed, Oct 22, 2014 at 9:34 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Geert Uytterhoeven <geert@linux-m68k.org>
>>> Date: Wed, 22 Oct 2014 09:39:41 +0200
>>>
>>>> drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c: In function ‘xgene_enet_ecc_init’:
>>>> drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c:126: warning: ‘data’ may be used uninitialized in this function
>>>>
>>>> Depending on the arbitrary value on the stack, the loop may terminate
>>>> too early, and cause a bogus -ENODEV failure.
>>>>
>>>> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
>>>> ---
>>>> v2: Rewrite the loop instead of pre-initializing data.
>>>
>>> I hate to be a pest, but like the other patch of your's I think
>>> a do { } while() works best here because the intent is clearly
>>> to run the loop at least once, right?
>>
>> I wanted to avoid checking for "data != ~0U" twice: once to abort the loop,
>> and once to check if a timeout happened.
>
> Hmmm:
>
>         do {
>                 usleep_range(...);
>                 data = ...();
>                 if (data == ~0)
>                         return 0;
>         } while (++i < 10);
>
>         netdev_err(...);
>         return -ENODEV;
>
> Why would you have to check data twice?

Yes, that would work to.

Feel free to do s/for (i = 0; i < 10; i++)/do/ and s/}/} while (++i < 10);/

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH] igb: don't reuse pages with pfmemalloc flag
From: Roman Gushchin @ 2014-10-23  7:52 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: alexander.h.duyck@intel.com, netdev@vger.kernel.org,
	sassmann@kpanic.de, e1000-devel@lists.sourceforge.net,
	peter.p.waskiewicz.jr@intel.com, bruce.w.allan@intel.com,
	jesse.brandeburg@intel.com, davem@davemloft.net,
	john.ronciak@intel.com, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1414002630.2179.0.camel@jtkirshe-mobl>

Thank you!

Probably we should add it to stable trees too?

--
Regards,
Roman

22.10.2014, 22:30, "Jeff Kirsher" <jeffrey.t.kirsher@intel.com>:
> On Wed, 2014-10-22 at 17:50 +0400, Roman Gushchin wrote:
>>  Incoming packet is dropped silently by sk_filter(), if the skb was
>>  allocated from pfmemalloc reserves and the corresponding socket is
>>  not marked with the SOCK_MEMALLOC flag.
>>
>>  Igb driver allocates pages for DMA with __skb_alloc_page(), which
>>  calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
>>  of OOM condition, igb can get pages with pfmemalloc flag set.
>>
>>  If an incoming packet hits the pfmemalloc page and is large enough
>>  (small packets are copying into the memory, allocated with
>>  netdev_alloc_skb_ip_align(), so they are not affected), it will be
>>  dropped.
>>
>>  This behavior is ok under high memory pressure, but the problem is
>>  that the igb driver reuses these mapped pages. So, packets are still
>>  dropping even if all memory issues are gone and there is a plenty
>>  of free memory.
>>
>>  In my case, some TCP sessions hang on a small percentage (< 0.1%)
>>  of machines days after OOMs.
>>
>>  Fix this by avoiding reuse of such pages.
>>
>>  Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
>>  ---
>>   drivers/net/ethernet/intel/igb/igb_main.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> Thanks Roman, I have added you patch to my queue.

------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* [PATCH net-next 2/3] gianfar: Fix the way the local advertising flow options are determined
From: Matei Pavaluca @ 2014-10-23  7:57 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Claudiu Manoil, Pavaluca Matei
In-Reply-To: <1414051077-4330-1-git-send-email-matei.pavaluca@freescale.com>

From: Pavaluca Matei <matei.pavaluca@freescale.com>

Local flow control options needed in order to resolve the negotiation
are incorrectly calculated.

Previously 'mii_advertise_flowctrl' was called to determine the local advertising
options, but these were determined based on FLOW_CTRL_RX/TX flags which are
never set through ethtool.
The patch simply translates from ethtool flow options to mii flow options.

Signed-off-by: Pavaluca Matei <matei.pavaluca@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 2485b74..329efca 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -3373,7 +3373,11 @@ static u32 gfar_get_flowctrl_cfg(struct gfar_private *priv)
 		if (phydev->asym_pause)
 			rmt_adv |= LPA_PAUSE_ASYM;
 
-		lcl_adv = mii_advertise_flowctrl(phydev->advertising);
+		lcl_adv = 0;
+		if (phydev->advertising & ADVERTISED_Pause)
+			lcl_adv |= ADVERTISE_PAUSE_CAP;
+		if (phydev->advertising & ADVERTISED_Asym_Pause)
+			lcl_adv |= ADVERTISE_PAUSE_ASYM;
 
 		flowctrl = mii_resolve_flowctrl_fdx(lcl_adv, rmt_adv);
 		if (flowctrl & FLOW_CTRL_TX)
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next 3/3] gianfar: Implement PAUSE frame generation support
From: Matei Pavaluca @ 2014-10-23  7:57 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Claudiu Manoil, Matei Pavaluca
In-Reply-To: <1414051077-4330-1-git-send-email-matei.pavaluca@freescale.com>

The hardware can automatically generate pause frames when the number
of free buffers drops under a certain threshold, but in order to do this,
the address of the last free buffer needs to be written to a specific
register for each RX queue.

This has to be done in 'gfar_clean_rx_ring' which is called for each
RX queue. In order not to impact performance, by adding a register write
for each incoming packet, this operation is done only when the PAUSE frame
transmission is enabled.

Whenever the link is readjusted, this capability is turned on or off.

Signed-off-by: Matei Pavaluca <matei.pavaluca@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c         | 56 ++++++++++++++++++++++++
 drivers/net/ethernet/freescale/gianfar.h         | 35 ++++++++++++++-
 drivers/net/ethernet/freescale/gianfar_ethtool.c | 11 +++--
 3 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 329efca..e3b9759 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -173,10 +173,12 @@ static void gfar_init_rxbdp(struct gfar_priv_rx_q *rx_queue, struct rxbd8 *bdp,
 static int gfar_init_bds(struct net_device *ndev)
 {
 	struct gfar_private *priv = netdev_priv(ndev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	struct gfar_priv_tx_q *tx_queue = NULL;
 	struct gfar_priv_rx_q *rx_queue = NULL;
 	struct txbd8 *txbdp;
 	struct rxbd8 *rxbdp;
+	u32 *rfbptr;
 	int i, j;
 
 	for (i = 0; i < priv->num_tx_queues; i++) {
@@ -201,6 +203,7 @@ static int gfar_init_bds(struct net_device *ndev)
 		txbdp->status |= TXBD_WRAP;
 	}
 
+	rfbptr = &regs->rfbptr0;
 	for (i = 0; i < priv->num_rx_queues; i++) {
 		rx_queue = priv->rx_queue[i];
 		rx_queue->cur_rx = rx_queue->rx_bd_base;
@@ -227,6 +230,8 @@ static int gfar_init_bds(struct net_device *ndev)
 			rxbdp++;
 		}
 
+		rx_queue->rfbptr = rfbptr;
+		rfbptr += 2;
 	}
 
 	return 0;
@@ -336,6 +341,20 @@ static void gfar_init_tx_rx_base(struct gfar_private *priv)
 	}
 }
 
+static void gfar_init_rqprm(struct gfar_private *priv)
+{
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 __iomem *baddr;
+	int i;
+
+	baddr = &regs->rqprm0;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		gfar_write(baddr, priv->rx_queue[i]->rx_ring_size |
+			   (DEFAULT_RX_LFC_THR << FBTHR_SHIFT));
+		baddr++;
+	}
+}
+
 static void gfar_rx_buff_size_config(struct gfar_private *priv)
 {
 	int frame_size = priv->ndev->mtu + ETH_HLEN + ETH_FCS_LEN;
@@ -396,6 +415,13 @@ static void gfar_mac_rx_config(struct gfar_private *priv)
 	if (priv->ndev->features & NETIF_F_HW_VLAN_CTAG_RX)
 		rctrl |= RCTRL_VLEX | RCTRL_PRSDEP_INIT;
 
+	/* Clear the LFC bit */
+	gfar_write(&regs->rctrl, rctrl);
+	/* Init flow control threshold values */
+	gfar_init_rqprm(priv);
+	gfar_write(&regs->ptv, DEFAULT_LFC_PTVVAL);
+	rctrl |= RCTRL_LFC;
+
 	/* Init rctrl based on our settings */
 	gfar_write(&regs->rctrl, rctrl);
 }
@@ -2859,6 +2885,10 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
 		/* Setup the new bdp */
 		gfar_new_rxbdp(rx_queue, bdp, newskb);
 
+		/* Update Last Free RxBD pointer for LFC */
+		if (rx_queue->rfbptr && priv->tx_actual_en)
+			gfar_write(rx_queue->rfbptr, (u32)bdp);
+
 		/* Update to the next pointer */
 		bdp = next_bd(bdp, base, rx_queue->rx_ring_size);
 
@@ -3393,6 +3423,11 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 {
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	struct phy_device *phydev = priv->phydev;
+	struct gfar_priv_grp *grp = NULL;
+	struct gfar_priv_rx_q *rx_queue = NULL;
+	unsigned long rstat_rxf;
+	int i, j, num_act_queues;
+	struct rxbd8 *bdp;
 
 	if (unlikely(test_bit(GFAR_RESETTING, &priv->state)))
 		return;
@@ -3401,6 +3436,7 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 		u32 tempval1 = gfar_read(&regs->maccfg1);
 		u32 tempval = gfar_read(&regs->maccfg2);
 		u32 ecntrl = gfar_read(&regs->ecntrl);
+		u32 tx_flow_oldval = (tempval & MACCFG1_TX_FLOW);
 
 		if (phydev->duplex != priv->oldduplex) {
 			if (!(phydev->duplex))
@@ -3445,6 +3481,26 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 		tempval1 &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
 		tempval1 |= gfar_get_flowctrl_cfg(priv);
 
+		/* Turn last free buffer recording on */
+		if ((tempval1 & MACCFG1_TX_FLOW) && !tx_flow_oldval) {
+			for (i = 0; i < priv->num_rx_queues; i++) {
+				rx_queue = priv->rx_queue[i];
+				bdp = rx_queue->cur_rx;
+				/* skip to previous bd */
+				bdp = skip_bd(bdp, rx_queue->rx_ring_size - 1,
+					      rx_queue->rx_bd_base,
+					      rx_queue->rx_ring_size);
+
+				if (rx_queue->rfbptr)
+					gfar_write(rx_queue->rfbptr, (u32)bdp);
+			}
+
+			priv->tx_actual_en = 1;
+		}
+
+		if (!(tempval1 & MACCFG1_TX_FLOW) && tx_flow_oldval)
+			priv->tx_actual_en = 0;
+
 		gfar_write(&regs->maccfg1, tempval1);
 		gfar_write(&regs->maccfg2, tempval);
 		gfar_write(&regs->ecntrl, ecntrl);
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 6b00868..9ba16ad 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -99,6 +99,10 @@ extern const char gfar_driver_version[];
 #define GFAR_MAX_FIFO_STARVE	511
 #define GFAR_MAX_FIFO_STARVE_OFF 511
 
+#define FBTHR_SHIFT        24
+#define DEFAULT_RX_LFC_THR  16
+#define DEFAULT_LFC_PTVVAL  4
+
 #define DEFAULT_RX_BUFFER_SIZE  1536
 #define TX_RING_MOD_MASK(size) (size-1)
 #define RX_RING_MOD_MASK(size) (size-1)
@@ -273,6 +277,7 @@ extern const char gfar_driver_version[];
 
 #define RCTRL_TS_ENABLE 	0x01000000
 #define RCTRL_PAL_MASK		0x001f0000
+#define RCTRL_LFC		0x00004000
 #define RCTRL_VLEX		0x00002000
 #define RCTRL_FILREN		0x00001000
 #define RCTRL_GHTX		0x00000400
@@ -849,7 +854,32 @@ struct gfar {
 	u8	res23c[248];
 	u32	attr;		/* 0x.bf8 - Attributes Register */
 	u32	attreli;	/* 0x.bfc - Attributes Extract Length and Extract Index Register */
-	u8	res24[688];
+	u32	rqprm0;	/* 0x.c00 - Receive queue parameters register 0 */
+	u32	rqprm1;	/* 0x.c04 - Receive queue parameters register 1 */
+	u32	rqprm2;	/* 0x.c08 - Receive queue parameters register 2 */
+	u32	rqprm3;	/* 0x.c0c - Receive queue parameters register 3 */
+	u32	rqprm4;	/* 0x.c10 - Receive queue parameters register 4 */
+	u32	rqprm5;	/* 0x.c14 - Receive queue parameters register 5 */
+	u32	rqprm6;	/* 0x.c18 - Receive queue parameters register 6 */
+	u32	rqprm7;	/* 0x.c1c - Receive queue parameters register 7 */
+	u8	res24[36];
+	u32	rfbptr0; /* 0x.c44 - Last free RxBD pointer for ring 0 */
+	u8	res24a[4];
+	u32	rfbptr1; /* 0x.c4c - Last free RxBD pointer for ring 1 */
+	u8	res24b[4];
+	u32	rfbptr2; /* 0x.c54 - Last free RxBD pointer for ring 2 */
+	u8	res24c[4];
+	u32	rfbptr3; /* 0x.c5c - Last free RxBD pointer for ring 3 */
+	u8	res24d[4];
+	u32	rfbptr4; /* 0x.c64 - Last free RxBD pointer for ring 4 */
+	u8	res24e[4];
+	u32	rfbptr5; /* 0x.c6c - Last free RxBD pointer for ring 5 */
+	u8	res24f[4];
+	u32	rfbptr6; /* 0x.c74 - Last free RxBD pointer for ring 6 */
+	u8	res24g[4];
+	u32	rfbptr7; /* 0x.c7c - Last free RxBD pointer for ring 7 */
+	u8	res24h[4];
+	u8	res24x[556];
 	u32	isrg0;		/* 0x.eb0 - Interrupt steering group 0 register */
 	u32	isrg1;		/* 0x.eb4 - Interrupt steering group 1 register */
 	u32	isrg2;		/* 0x.eb8 - Interrupt steering group 2 register */
@@ -1009,6 +1039,7 @@ struct gfar_priv_rx_q {
 	/* RX Coalescing values */
 	unsigned char rxcoalescing;
 	unsigned long rxic;
+	u32 *rfbptr;
 };
 
 enum gfar_irqinfo_id {
@@ -1134,6 +1165,8 @@ struct gfar_private {
 		tx_pause_en:1,
 		rx_pause_en:1;
 
+	int tx_actual_en;
+
 	/* The total tx and rx ring size for the enabled queues */
 	unsigned int total_tx_ring_size;
 	unsigned int total_rx_ring_size;
diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 76d7070..0e8d8bf 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -531,7 +531,7 @@ static int gfar_spauseparam(struct net_device *dev,
 	struct gfar_private *priv = netdev_priv(dev);
 	struct phy_device *phydev = priv->phydev;
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
-	u32 oldadv, newadv;
+	u32 oldadv, newadv = 0;
 
 	if (!phydev)
 		return -ENODEV;
@@ -548,7 +548,7 @@ static int gfar_spauseparam(struct net_device *dev,
 		if (epause->tx_pause) {
 			priv->tx_pause_en = 1;
 			/* FLOW_CTRL_RX & TX */
-			newadv = ADVERTISED_Pause;
+			newadv |= ADVERTISED_Pause;
 		} else  /* FLOW_CTLR_RX */
 			newadv = ADVERTISED_Pause | ADVERTISED_Asym_Pause;
 	} else if (epause->tx_pause) {
@@ -579,8 +579,13 @@ static int gfar_spauseparam(struct net_device *dev,
 			u32 tempval;
 			tempval = gfar_read(&regs->maccfg1);
 			tempval &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
-			if (priv->tx_pause_en)
+
+			priv->tx_actual_en = 0;
+			if (priv->tx_pause_en) {
+				priv->tx_actual_en = 1;
 				tempval |= MACCFG1_TX_FLOW;
+			}
+
 			if (priv->rx_pause_en)
 				tempval |= MACCFG1_RX_FLOW;
 			gfar_write(&regs->maccfg1, tempval);
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH 06/14] net: dsa: Add support for hardware monitoring
From: Richard Cochran @ 2014-10-23  8:24 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Florian Fainelli, netdev, David S. Miller, Andrew Lunn,
	linux-kernel@vger.kernel.org
In-Reply-To: <54488CE1.2000106@roeck-us.net>

On Wed, Oct 22, 2014 at 10:06:41PM -0700, Guenter Roeck wrote:
> On 10/22/2014 09:37 PM, Florian Fainelli wrote:
> >You probably want the number of temperature sensors to come from the
> >switch driver, and support arbitrary number of temperature sensors?
> 
> In that case we would need the number of sensors, pass a sensor index,
> and create a dynamic number of attributes. The code would get much more
> complex without real benefit unless there is such a chip - and if there is,
> we can still change the code at that time. I would prefer to keep it simple
> for now.

Better to do it correctly, right from the start. There *will* be such
a chip one day, and the person wanting to add it will appreciate the
solid foundation (even if that person ends up being you ;).

Thanks,
Richard

^ permalink raw reply

* [PATCH v3] drivers: net: xgene: Rewrite buggy loop in xgene_enet_ecc_init()
From: Geert Uytterhoeven @ 2014-10-23  8:25 UTC (permalink / raw)
  To: David S. Miller, Iyappan Subramanian, Keyur Chudgar
  Cc: netdev, linux-kernel, Geert Uytterhoeven

drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c: In function ‘xgene_enet_ecc_init’:
drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c:126: warning: ‘data’ may be used uninitialized in this function

Depending on the arbitrary value on the stack, the loop may terminate
too early, and cause a bogus -ENODEV failure.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
v3:
  - Use "do { ... } while (...);" instead of "for (...) { ... }",
v2:
  - Rewrite the loop instead of pre-initializing data.

 drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
index e6d24c2101982444..c22f32622fa9a50a 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c
@@ -124,20 +124,18 @@ static int xgene_enet_ecc_init(struct xgene_enet_pdata *p)
 {
 	struct net_device *ndev = p->ndev;
 	u32 data;
-	int i;
+	int i = 0;
 
 	xgene_enet_wr_diag_csr(p, ENET_CFG_MEM_RAM_SHUTDOWN_ADDR, 0);
-	for (i = 0; i < 10 && data != ~0U ; i++) {
+	do {
 		usleep_range(100, 110);
 		data = xgene_enet_rd_diag_csr(p, ENET_BLOCK_MEM_RDY_ADDR);
-	}
+		if (data == ~0U)
+			return 0;
+	} while (++i < 10);
 
-	if (data != ~0U) {
-		netdev_err(ndev, "Failed to release memory from shutdown\n");
-		return -ENODEV;
-	}
-
-	return 0;
+	netdev_err(ndev, "Failed to release memory from shutdown\n");
+	return -ENODEV;
 }
 
 static void xgene_enet_config_ring_if_assoc(struct xgene_enet_pdata *p)
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 1/3] gianfar: Add flow control support flags
From: Matei Pavaluca @ 2014-10-23  7:57 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Claudiu Manoil, Pavaluca Matei

From: Pavaluca Matei <matei.pavaluca@freescale.com>

The phy device supports 802.3x flow control, but the specific flags are not set
in the phy initialisation code. Flow control flags need to be added to the
supported capabilities of the phydev by the driver.

This is needed in order for ethtool to work ('ethtool -A' code checks for these
flags)

Signed-off-by: Pavaluca Matei <matei.pavaluca@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 3 +++
 drivers/net/ethernet/freescale/gianfar.h | 4 +---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 4fdf0aa..2485b74 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -1687,6 +1687,9 @@ static int init_phy(struct net_device *dev)
 	priv->phydev->supported &= (GFAR_SUPPORTED | gigabit_support);
 	priv->phydev->advertising = priv->phydev->supported;
 
+	/* Add support for flow control, but don't advertise it by default */
+	priv->phydev->supported |= (SUPPORTED_Pause | SUPPORTED_Asym_Pause);
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 2805cfb..6b00868 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -145,9 +145,7 @@ extern const char gfar_driver_version[];
 		| SUPPORTED_Autoneg \
 		| SUPPORTED_MII)
 
-#define GFAR_SUPPORTED_GBIT (SUPPORTED_1000baseT_Full \
-		| SUPPORTED_Pause \
-		| SUPPORTED_Asym_Pause)
+#define GFAR_SUPPORTED_GBIT SUPPORTED_1000baseT_Full
 
 /* TBI register addresses */
 #define MII_TBICON		0x11
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 3/3] gianfar: Implement PAUSE frame generation support
From: Matei Pavaluca @ 2014-10-23  9:18 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Claudiu Manoil, Matei Pavaluca

The hardware can automatically generate pause frames when the number
of free buffers drops under a certain threshold, but in order to do this,
the address of the last free buffer needs to be written to a specific
register for each RX queue.

This has to be done in 'gfar_clean_rx_ring' which is called for each
RX queue. In order not to impact performance, by adding a register write
for each incoming packet, this operation is done only when the PAUSE frame
transmission is enabled.

Whenever the link is readjusted, this capability is turned on or off.

Signed-off-by: Matei Pavaluca <matei.pavaluca@freescale.com>
---
 
v2:
  - Added unlikely() in 'gfar_clean_rx_ring'
  - Removed some unused variables
 
 drivers/net/ethernet/freescale/gianfar.c         | 54 ++++++++++++++++++++++++
 drivers/net/ethernet/freescale/gianfar.h         | 34 ++++++++++++++-
 drivers/net/ethernet/freescale/gianfar_ethtool.c |  7 ++-
 3 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 329efca..86dccb26 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -173,10 +173,12 @@ static void gfar_init_rxbdp(struct gfar_priv_rx_q *rx_queue, struct rxbd8 *bdp,
 static int gfar_init_bds(struct net_device *ndev)
 {
 	struct gfar_private *priv = netdev_priv(ndev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	struct gfar_priv_tx_q *tx_queue = NULL;
 	struct gfar_priv_rx_q *rx_queue = NULL;
 	struct txbd8 *txbdp;
 	struct rxbd8 *rxbdp;
+	u32 *rfbptr;
 	int i, j;
 
 	for (i = 0; i < priv->num_tx_queues; i++) {
@@ -201,6 +203,7 @@ static int gfar_init_bds(struct net_device *ndev)
 		txbdp->status |= TXBD_WRAP;
 	}
 
+	rfbptr = &regs->rfbptr0;
 	for (i = 0; i < priv->num_rx_queues; i++) {
 		rx_queue = priv->rx_queue[i];
 		rx_queue->cur_rx = rx_queue->rx_bd_base;
@@ -227,6 +230,8 @@ static int gfar_init_bds(struct net_device *ndev)
 			rxbdp++;
 		}
 
+		rx_queue->rfbptr = rfbptr;
+		rfbptr += 2;
 	}
 
 	return 0;
@@ -336,6 +341,20 @@ static void gfar_init_tx_rx_base(struct gfar_private *priv)
 	}
 }
 
+static void gfar_init_rqprm(struct gfar_private *priv)
+{
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 __iomem *baddr;
+	int i;
+
+	baddr = &regs->rqprm0;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		gfar_write(baddr, priv->rx_queue[i]->rx_ring_size |
+			   (DEFAULT_RX_LFC_THR << FBTHR_SHIFT));
+		baddr++;
+	}
+}
+
 static void gfar_rx_buff_size_config(struct gfar_private *priv)
 {
 	int frame_size = priv->ndev->mtu + ETH_HLEN + ETH_FCS_LEN;
@@ -396,6 +415,13 @@ static void gfar_mac_rx_config(struct gfar_private *priv)
 	if (priv->ndev->features & NETIF_F_HW_VLAN_CTAG_RX)
 		rctrl |= RCTRL_VLEX | RCTRL_PRSDEP_INIT;
 
+	/* Clear the LFC bit */
+	gfar_write(&regs->rctrl, rctrl);
+	/* Init flow control threshold values */
+	gfar_init_rqprm(priv);
+	gfar_write(&regs->ptv, DEFAULT_LFC_PTVVAL);
+	rctrl |= RCTRL_LFC;
+
 	/* Init rctrl based on our settings */
 	gfar_write(&regs->rctrl, rctrl);
 }
@@ -2859,6 +2885,10 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
 		/* Setup the new bdp */
 		gfar_new_rxbdp(rx_queue, bdp, newskb);
 
+		/* Update Last Free RxBD pointer for LFC */
+		if (unlikely(rx_queue->rfbptr && priv->tx_actual_en))
+			gfar_write(rx_queue->rfbptr, (u32)bdp);
+
 		/* Update to the next pointer */
 		bdp = next_bd(bdp, base, rx_queue->rx_ring_size);
 
@@ -3393,6 +3423,9 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 {
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	struct phy_device *phydev = priv->phydev;
+	struct gfar_priv_rx_q *rx_queue = NULL;
+	int i;
+	struct rxbd8 *bdp;
 
 	if (unlikely(test_bit(GFAR_RESETTING, &priv->state)))
 		return;
@@ -3401,6 +3434,7 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 		u32 tempval1 = gfar_read(&regs->maccfg1);
 		u32 tempval = gfar_read(&regs->maccfg2);
 		u32 ecntrl = gfar_read(&regs->ecntrl);
+		u32 tx_flow_oldval = (tempval & MACCFG1_TX_FLOW);
 
 		if (phydev->duplex != priv->oldduplex) {
 			if (!(phydev->duplex))
@@ -3445,6 +3479,26 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 		tempval1 &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
 		tempval1 |= gfar_get_flowctrl_cfg(priv);
 
+		/* Turn last free buffer recording on */
+		if ((tempval1 & MACCFG1_TX_FLOW) && !tx_flow_oldval) {
+			for (i = 0; i < priv->num_rx_queues; i++) {
+				rx_queue = priv->rx_queue[i];
+				bdp = rx_queue->cur_rx;
+				/* skip to previous bd */
+				bdp = skip_bd(bdp, rx_queue->rx_ring_size - 1,
+					      rx_queue->rx_bd_base,
+					      rx_queue->rx_ring_size);
+
+				if (rx_queue->rfbptr)
+					gfar_write(rx_queue->rfbptr, (u32)bdp);
+			}
+
+			priv->tx_actual_en = 1;
+		}
+
+		if (unlikely(!(tempval1 & MACCFG1_TX_FLOW) && tx_flow_oldval))
+			priv->tx_actual_en = 0;
+
 		gfar_write(&regs->maccfg1, tempval1);
 		gfar_write(&regs->maccfg2, tempval);
 		gfar_write(&regs->ecntrl, ecntrl);
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 6b00868..b581b88 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -99,6 +99,10 @@ extern const char gfar_driver_version[];
 #define GFAR_MAX_FIFO_STARVE	511
 #define GFAR_MAX_FIFO_STARVE_OFF 511
 
+#define FBTHR_SHIFT        24
+#define DEFAULT_RX_LFC_THR  16
+#define DEFAULT_LFC_PTVVAL  4
+
 #define DEFAULT_RX_BUFFER_SIZE  1536
 #define TX_RING_MOD_MASK(size) (size-1)
 #define RX_RING_MOD_MASK(size) (size-1)
@@ -273,6 +277,7 @@ extern const char gfar_driver_version[];
 
 #define RCTRL_TS_ENABLE 	0x01000000
 #define RCTRL_PAL_MASK		0x001f0000
+#define RCTRL_LFC		0x00004000
 #define RCTRL_VLEX		0x00002000
 #define RCTRL_FILREN		0x00001000
 #define RCTRL_GHTX		0x00000400
@@ -849,7 +854,32 @@ struct gfar {
 	u8	res23c[248];
 	u32	attr;		/* 0x.bf8 - Attributes Register */
 	u32	attreli;	/* 0x.bfc - Attributes Extract Length and Extract Index Register */
-	u8	res24[688];
+	u32	rqprm0;	/* 0x.c00 - Receive queue parameters register 0 */
+	u32	rqprm1;	/* 0x.c04 - Receive queue parameters register 1 */
+	u32	rqprm2;	/* 0x.c08 - Receive queue parameters register 2 */
+	u32	rqprm3;	/* 0x.c0c - Receive queue parameters register 3 */
+	u32	rqprm4;	/* 0x.c10 - Receive queue parameters register 4 */
+	u32	rqprm5;	/* 0x.c14 - Receive queue parameters register 5 */
+	u32	rqprm6;	/* 0x.c18 - Receive queue parameters register 6 */
+	u32	rqprm7;	/* 0x.c1c - Receive queue parameters register 7 */
+	u8	res24[36];
+	u32	rfbptr0; /* 0x.c44 - Last free RxBD pointer for ring 0 */
+	u8	res24a[4];
+	u32	rfbptr1; /* 0x.c4c - Last free RxBD pointer for ring 1 */
+	u8	res24b[4];
+	u32	rfbptr2; /* 0x.c54 - Last free RxBD pointer for ring 2 */
+	u8	res24c[4];
+	u32	rfbptr3; /* 0x.c5c - Last free RxBD pointer for ring 3 */
+	u8	res24d[4];
+	u32	rfbptr4; /* 0x.c64 - Last free RxBD pointer for ring 4 */
+	u8	res24e[4];
+	u32	rfbptr5; /* 0x.c6c - Last free RxBD pointer for ring 5 */
+	u8	res24f[4];
+	u32	rfbptr6; /* 0x.c74 - Last free RxBD pointer for ring 6 */
+	u8	res24g[4];
+	u32	rfbptr7; /* 0x.c7c - Last free RxBD pointer for ring 7 */
+	u8	res24h[4];
+	u8	res24x[556];
 	u32	isrg0;		/* 0x.eb0 - Interrupt steering group 0 register */
 	u32	isrg1;		/* 0x.eb4 - Interrupt steering group 1 register */
 	u32	isrg2;		/* 0x.eb8 - Interrupt steering group 2 register */
@@ -1009,6 +1039,7 @@ struct gfar_priv_rx_q {
 	/* RX Coalescing values */
 	unsigned char rxcoalescing;
 	unsigned long rxic;
+	u32 *rfbptr;
 };
 
 enum gfar_irqinfo_id {
@@ -1099,6 +1130,7 @@ struct gfar_private {
 	unsigned int num_tx_queues;
 	unsigned int num_rx_queues;
 	unsigned int num_grps;
+	int tx_actual_en;
 
 	/* Network Statistics */
 	struct gfar_extra_stats extra_stats;
diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 76d7070..3e1a9c1 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -579,8 +579,13 @@ static int gfar_spauseparam(struct net_device *dev,
 			u32 tempval;
 			tempval = gfar_read(&regs->maccfg1);
 			tempval &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
-			if (priv->tx_pause_en)
+
+			priv->tx_actual_en = 0;
+			if (priv->tx_pause_en) {
+				priv->tx_actual_en = 1;
 				tempval |= MACCFG1_TX_FLOW;
+			}
+
 			if (priv->rx_pause_en)
 				tempval |= MACCFG1_RX_FLOW;
 			gfar_write(&regs->maccfg1, tempval);
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next 1/3] gianfar: Add flow control support flags
From: Claudiu Manoil @ 2014-10-23  9:36 UTC (permalink / raw)
  To: Matei Pavaluca, netdev; +Cc: David S. Miller
In-Reply-To: <1414051077-4330-1-git-send-email-matei.pavaluca@freescale.com>

On 10/23/2014 10:57 AM, Matei Pavaluca wrote:
> From: Pavaluca Matei <matei.pavaluca@freescale.com>
>
> The phy device supports 802.3x flow control, but the specific flags are not set
> in the phy initialisation code. Flow control flags need to be added to the
> supported capabilities of the phydev by the driver.
>
> This is needed in order for ethtool to work ('ethtool -A' code checks for these
> flags)
>
> Signed-off-by: Pavaluca Matei <matei.pavaluca@freescale.com>

Acked-by: Claudiu Manoil <claudiu.manoil@freescale.com>

^ permalink raw reply

* Re: [PATCH net-next 2/3] gianfar: Fix the way the local advertising flow options are determined
From: Claudiu Manoil @ 2014-10-23  9:36 UTC (permalink / raw)
  To: Matei Pavaluca, netdev; +Cc: David S. Miller
In-Reply-To: <1414051077-4330-2-git-send-email-matei.pavaluca@freescale.com>

On 10/23/2014 10:57 AM, Matei Pavaluca wrote:
> From: Pavaluca Matei <matei.pavaluca@freescale.com>
>
> Local flow control options needed in order to resolve the negotiation
> are incorrectly calculated.
>
> Previously 'mii_advertise_flowctrl' was called to determine the local advertising
> options, but these were determined based on FLOW_CTRL_RX/TX flags which are
> never set through ethtool.
> The patch simply translates from ethtool flow options to mii flow options.
>
> Signed-off-by: Pavaluca Matei <matei.pavaluca@freescale.com>

Acked-by: Claudiu Manoil <claudiu.manoil@freescale.com>

^ permalink raw reply

* Re: [PATCH net-next v2 3/3] gianfar: Implement PAUSE frame generation support
From: Claudiu Manoil @ 2014-10-23  9:36 UTC (permalink / raw)
  To: Matei Pavaluca, netdev; +Cc: David S. Miller
In-Reply-To: <1414055930-10331-1-git-send-email-matei.pavaluca@freescale.com>

On 10/23/2014 12:18 PM, Matei Pavaluca wrote:
> The hardware can automatically generate pause frames when the number
> of free buffers drops under a certain threshold, but in order to do this,
> the address of the last free buffer needs to be written to a specific
> register for each RX queue.
>
> This has to be done in 'gfar_clean_rx_ring' which is called for each
> RX queue. In order not to impact performance, by adding a register write
> for each incoming packet, this operation is done only when the PAUSE frame
> transmission is enabled.
>
> Whenever the link is readjusted, this capability is turned on or off.
>
> Signed-off-by: Matei Pavaluca <matei.pavaluca@freescale.com>
> ---
>
> v2:
>    - Added unlikely() in 'gfar_clean_rx_ring'
>    - Removed some unused variables

Acked-by: Claudiu Manoil <claudiu.manoil@freescale.com>

^ permalink raw reply

* [RFC PATCH v3 net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Erik Kline @ 2014-10-23  9:59 UTC (permalink / raw)
  To: netdev; +Cc: ben, lorenzo, hannes, Erik Kline

Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes.  Preferred addresses
will still take precedence over optimistic addresses, subject
to other ranking in the source address selection algorithm.

This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).

The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.

For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.

Also, because optimistic nodes can receive frames, issue
an RTM_NEWADDR as soon as DAD starts.  If DAD fails, a separate
RTM_DELADDR is always sent.

Also: add an entry in ip-sysctl.txt for optimistic_dad.

Signed-off-by: Erik Kline <ek@google.com>
---
 Documentation/networking/ip-sysctl.txt | 13 +++++++++
 include/linux/ipv6.h                   |  1 +
 include/uapi/linux/ipv6.h              |  1 +
 net/ipv6/addrconf.c                    | 52 ++++++++++++++++++++++++++++++++--
 4 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 0307e28..e03cf49 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1452,6 +1452,19 @@ suppress_frag_ndisc - INTEGER
 	1 - (default) discard fragmented neighbor discovery packets
 	0 - allow fragmented neighbor discovery packets
 
+optimistic_dad - BOOLEAN
+	Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+		0: disabled (default)
+		1: enabled
+
+use_optimistic - BOOLEAN
+	If enabled, do not classify optimistic addresses as deprecated during
+	source address selection.  Preferred addresses will still be chosen
+	before optimistic addresses, subject to other ranking in the source
+	address selection algorithm.
+		0: disabled (default)
+		1: enabled
+
 icmp/*:
 ratelimit - INTEGER
 	Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ff56053..7121a2e 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -42,6 +42,7 @@ struct ipv6_devconf {
 	__s32		accept_ra_from_local;
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 	__s32		optimistic_dad;
+	__s32		use_optimistic;
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 	__s32		mc_forwarding;
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index efa2666..e863d08 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -164,6 +164,7 @@ enum {
 	DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL,
 	DEVCONF_SUPPRESS_FRAG_NDISC,
 	DEVCONF_ACCEPT_RA_FROM_LOCAL,
+	DEVCONF_USE_OPTIMISTIC,
 	DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 725c763..879d7866 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1170,6 +1170,9 @@ enum {
 	IPV6_SADDR_RULE_PRIVACY,
 	IPV6_SADDR_RULE_ORCHID,
 	IPV6_SADDR_RULE_PREFIX,
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	IPV6_SADDR_RULE_NOT_OPTIMISTIC,
+#endif
 	IPV6_SADDR_RULE_MAX
 };
 
@@ -1197,6 +1200,15 @@ static inline int ipv6_saddr_preferred(int type)
 	return 0;
 }
 
+static inline bool ipv6_use_optimistic_addr(struct inet6_dev *idev)
+{
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	return idev && idev->cnf.optimistic_dad && idev->cnf.use_optimistic;
+#else
+	return false;
+#endif
+}
+
 static int ipv6_get_saddr_eval(struct net *net,
 			       struct ipv6_saddr_score *score,
 			       struct ipv6_saddr_dst *dst,
@@ -1257,10 +1269,16 @@ static int ipv6_get_saddr_eval(struct net *net,
 		score->scopedist = ret;
 		break;
 	case IPV6_SADDR_RULE_PREFERRED:
+	    {
 		/* Rule 3: Avoid deprecated and optimistic addresses */
+		u8 avoid = IFA_F_DEPRECATED;
+
+		if (!ipv6_use_optimistic_addr(score->ifa->idev))
+			avoid |= IFA_F_OPTIMISTIC;
 		ret = ipv6_saddr_preferred(score->addr_type) ||
-		      !(score->ifa->flags & (IFA_F_DEPRECATED|IFA_F_OPTIMISTIC));
+		      !(score->ifa->flags & avoid);
 		break;
+	    }
 #ifdef CONFIG_IPV6_MIP6
 	case IPV6_SADDR_RULE_HOA:
 	    {
@@ -1306,6 +1324,14 @@ static int ipv6_get_saddr_eval(struct net *net,
 			ret = score->ifa->prefix_len;
 		score->matchlen = ret;
 		break;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	case IPV6_SADDR_RULE_NOT_OPTIMISTIC:
+		/* Optimistic addresses still have lower precedence than other
+		 * preferred addresses.
+		 */
+		ret = !(score->ifa->flags & IFA_F_OPTIMISTIC);
+		break;
+#endif
 	default:
 		ret = 0;
 	}
@@ -3222,8 +3248,15 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 	 * Optimistic nodes can start receiving
 	 * Frames right away
 	 */
-	if (ifp->flags & IFA_F_OPTIMISTIC)
+	if (ifp->flags & IFA_F_OPTIMISTIC) {
 		ip6_ins_rt(ifp->rt);
+		if (ipv6_use_optimistic_addr(idev)) {
+			/* Because optimistic nodes can use this address,
+			 * notify listeners. If DAD fails, RTM_DELADDR is sent.
+			 */
+			ipv6_ifa_notify(RTM_NEWADDR, ifp);
+		}
+	}
 
 	addrconf_dad_kick(ifp);
 out:
@@ -3354,7 +3387,11 @@ static void addrconf_dad_completed(struct inet6_ifaddr *ifp)
 	 *	Configure the address for reception. Now it is valid.
 	 */
 
-	ipv6_ifa_notify(RTM_NEWADDR, ifp);
+	/* If optimistic DAD is in use, the notification was already sent
+	 * in addrconf_dad_begin().
+	 */
+	if (!ipv6_use_optimistic_addr(ifp->idev))
+		ipv6_ifa_notify(RTM_NEWADDR, ifp);
 
 	/* If added prefix is link local and we are prepared to process
 	   router advertisements, start sending router solicitations.
@@ -4330,6 +4367,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
 	array[DEVCONF_ACCEPT_SOURCE_ROUTE] = cnf->accept_source_route;
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 	array[DEVCONF_OPTIMISTIC_DAD] = cnf->optimistic_dad;
+	array[DEVCONF_USE_OPTIMISTIC] = cnf->use_optimistic;
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 	array[DEVCONF_MC_FORWARDING] = cnf->mc_forwarding;
@@ -5155,6 +5193,14 @@ static struct addrconf_sysctl_table
 			.proc_handler   = proc_dointvec,
 
 		},
+		{
+			.procname       = "use_optimistic",
+			.data           = &ipv6_devconf.use_optimistic,
+			.maxlen         = sizeof(int),
+			.mode           = 0644,
+			.proc_handler   = proc_dointvec,
+
+		},
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 		{
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCH 1/3] xen-netback: make feature-rx-notify mandatory
From: Wei Liu @ 2014-10-23 11:16 UTC (permalink / raw)
  To: David Vrabel; +Cc: netdev, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-2-git-send-email-david.vrabel@citrix.com>

On Wed, Oct 22, 2014 at 02:08:53PM +0100, David Vrabel wrote:
> Frontends that do not provide feature-rx-notify may stall because
> netback depends on the notification from frontend to wake the guest Rx
> thread (even if can_queue is false).
> 
> This could be fixed but feature-rx-notify was introduced in 2006 and I
> am not aware of any frontends that do not implement this.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Acked-by: Wei Liu <wei.liu2@citrix.com>

^ permalink raw reply

* Re: [PATCH] igb: don't reuse pages with pfmemalloc flag
From: Roman Gushchin @ 2014-10-23 11:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: jeffrey.t.kirsher@intel.com, jesse.brandeburg@intel.com,
	bruce.w.allan@intel.com, carolyn.wyborny@intel.com,
	donald.c.skidmore@intel.com, gregory.v.rose@intel.com,
	peter.p.waskiewicz.jr@intel.com, alexander.h.duyck@intel.com,
	john.ronciak@intel.com, tushar.n.dave@intel.com,
	davem@davemloft.net, sassmann@kpanic.de,
	gregkh@linuxfoundation.org, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, "linux-kernel@vger.kernel.or
In-Reply-To: <1413992737.9031.1.camel@edumazet-glaptop2.roam.corp.google.com>

>
> Interesting...
>
> It seems we also need to clear skb->pfmemalloc in napi_reuse_skb()

Sounds reasonable, but are you sure, that we can just drop skb->pfmemalloc flag in napi_reuse_skb()?

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH 1/3] xen-netback: make feature-rx-notify mandatory
From: Wei Liu @ 2014-10-23 11:32 UTC (permalink / raw)
  To: David Vrabel; +Cc: netdev, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <20141023111629.GE9188@zion.uk.xensource.com>

On Thu, Oct 23, 2014 at 12:16:29PM +0100, Wei Liu wrote:
> On Wed, Oct 22, 2014 at 02:08:53PM +0100, David Vrabel wrote:
> > Frontends that do not provide feature-rx-notify may stall because
> > netback depends on the notification from frontend to wake the guest Rx
> > thread (even if can_queue is false).
> > 
> > This could be fixed but feature-rx-notify was introduced in 2006 and I
> > am not aware of any frontends that do not implement this.
> > 
> > Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> 
> Acked-by: Wei Liu <wei.liu2@citrix.com>

While I can understand this patch by itself, can you elaborate a little
bit on how it affects later patches? Because what I'm thinking is that
this patch is not for stable while other two should go to stable.

Wei.

^ permalink raw reply

* Re: [PATCH 1/3] xen-netback: make feature-rx-notify mandatory
From: David Vrabel @ 2014-10-23 11:37 UTC (permalink / raw)
  To: Wei Liu; +Cc: netdev, xen-devel, Ian Campbell
In-Reply-To: <20141023113206.GF9188@zion.uk.xensource.com>

On 23/10/14 12:32, Wei Liu wrote:
> On Thu, Oct 23, 2014 at 12:16:29PM +0100, Wei Liu wrote:
>> On Wed, Oct 22, 2014 at 02:08:53PM +0100, David Vrabel wrote:
>>> Frontends that do not provide feature-rx-notify may stall because
>>> netback depends on the notification from frontend to wake the guest Rx
>>> thread (even if can_queue is false).
>>>
>>> This could be fixed but feature-rx-notify was introduced in 2006 and I
>>> am not aware of any frontends that do not implement this.
>>>
>>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>>
>> Acked-by: Wei Liu <wei.liu2@citrix.com>
> 
> While I can understand this patch by itself, can you elaborate a little
> bit on how it affects later patches? Because what I'm thinking is that
> this patch is not for stable while other two should go to stable.

>From the cover letter:

"The first patch is a prerequite.  Removing support for frontends with
feature-rx-notify makes it easier to reason about the correctness of
netback since it no longer has to support this outdated and broken
mode."

The other patches do not meet the stable kernel requirements (they're
too long one thing).

David

^ permalink raw reply

* Re: [PATCH 2/3] xen-netback: fix unlimited guest Rx internal queue and carrier flapping
From: Wei Liu @ 2014-10-23 11:40 UTC (permalink / raw)
  To: David Vrabel; +Cc: netdev, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-3-git-send-email-david.vrabel@citrix.com>

On Wed, Oct 22, 2014 at 02:08:54PM +0100, David Vrabel wrote:
> Netback needs to discard old to-guest skb's (guest Rx queue drain) and
> it needs detect guest Rx stalls (to disable the carrier so packets are
> discarded earlier), but the current implementation is very broken.
> 
> 1. The check in hard_start_xmit of the slot availability did not
>    consider the number of packets that were already in the guest Rx
>    queue.  This could allow the queue to grow without bound.
> 
>    The guest stops consuming packets and the ring was allowed to fill
>    leaving S slot free.  Netback queues a packet requiring more than S
>    slots (ensuring that the ring stays with S slots free).  Netback
>    queue indefinately packets provided that then require S or fewer
>    slots.
> 
> 2. The Rx stall detection is not triggered in this case since the
>    (host) Tx queue is not stopped.
> 
> 3. If the Tx queue is stopped and a guest Rx interrupt occurs, netback
>    will consider this an Rx purge event which may result in it taking
>    the carrier down unnecessarily.  It also considers a queue with
>    only 1 slot free as unstalled (even though the next packet might
>    not fit in this).
> 
> The internal guest Rx queue is limited by a byte length (to 512 Kib,
> enough for half the ring).  The (host) Tx queue is stopped and started
> based on this limit.  This sets an upper bound on the amount of memory
> used by packets on the internal queue.
> 
> This allows the estimatation of the number of slots for an skb to be
> removed (it wasn't a very good estimate anyway).  Instead, the guest

+1 for this.

> Rx thread just waits for enough free slots for a maximum sized packet.
> 
> skbs queued on the internal queue have an 'expires' time (set to the
> current time plus the drain timeout).  The guest Rx thread will detect
> when the skb at the head of the queue has expired and discard expired
> skbs.  This sets a clear upper bound on the length of time an skb can
> be queued for.  For a guest being destroyed the maximum time needed to
> wait for all the packets it sent to be dropped is still the drain
> timeout (10 s) since it will not be sending new packets.
> 
> Rx stall detection is reintroduced in a later commit.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Reviewed-by: Wei Liu <wei.liu2@citrix.com>

^ permalink raw reply

* Re: [PATCH 1/3] xen-netback: make feature-rx-notify mandatory
From: Wei Liu @ 2014-10-23 11:44 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, netdev, xen-devel, Ian Campbell
In-Reply-To: <5448E892.9010307@citrix.com>

On Thu, Oct 23, 2014 at 12:37:54PM +0100, David Vrabel wrote:
> On 23/10/14 12:32, Wei Liu wrote:
> > On Thu, Oct 23, 2014 at 12:16:29PM +0100, Wei Liu wrote:
> >> On Wed, Oct 22, 2014 at 02:08:53PM +0100, David Vrabel wrote:
> >>> Frontends that do not provide feature-rx-notify may stall because
> >>> netback depends on the notification from frontend to wake the guest Rx
> >>> thread (even if can_queue is false).
> >>>
> >>> This could be fixed but feature-rx-notify was introduced in 2006 and I
> >>> am not aware of any frontends that do not implement this.
> >>>
> >>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >>
> >> Acked-by: Wei Liu <wei.liu2@citrix.com>
> > 
> > While I can understand this patch by itself, can you elaborate a little
> > bit on how it affects later patches? Because what I'm thinking is that
> > this patch is not for stable while other two should go to stable.
> 
> >From the cover letter:
> 
> "The first patch is a prerequite.  Removing support for frontends with
> feature-rx-notify makes it easier to reason about the correctness of
> netback since it no longer has to support this outdated and broken
> mode."
> 

I saw that.

I think you should make it a little bit clearer.

The queue is not guaranteed to stop if we keep this feature.

> The other patches do not meet the stable kernel requirements (they're
> too long one thing).
> 

Does length matter? I surely had written long patch for stable.

Wei.

> David

^ permalink raw reply

* Re: [PATCH 3/3] xen-netback: reintroduce guest Rx stall detection
From: Wei Liu @ 2014-10-23 11:49 UTC (permalink / raw)
  To: David Vrabel; +Cc: netdev, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-4-git-send-email-david.vrabel@citrix.com>

On Wed, Oct 22, 2014 at 02:08:55PM +0100, David Vrabel wrote:
> If a frontend not receiving packets it is useful to detect this and
> turn off the carrier so packets are dropped early instead of being
> queued and drained when they expire.
> 
> A to-guest queue is stalled if it doesn't have enough free slots for a
> an extended period of time (default 60 s).
> 
> If at least one queue is stalled, the carrier is turned off (in the
> expectation that the other queues will soon stall as well).  The
> carrier is only turned on once all queues are ready.
> 
> When the frontend connects, all the queues start in the stalled state
> and only become ready once the frontend queues enough Rx requests.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Reviewed-by: Wei Liu <wei.liu2@citrix.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox