Netdev List
 help / color / mirror / Atom feed
* Re: no reassembly for outgoing packets on RAW socket
From: Jan Engelhardt @ 2010-06-09 15:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Jiri Olsa, netdev, Netfilter Developer Mailing List
In-Reply-To: <4C0FA24A.7060907@trash.net>


On Wednesday 2010-06-09 16:16, Patrick McHardy wrote:
>>>> I'd like to be able to sendout a single IP packet with MF flag set.
>>>>
>>>> When using RAW sockets the packet will get stuck in the
>>>> netfilter (NF_INET_LOCAL_OUT nf_defrag_ipv4 reassembly unit)
>>>> and wont ever make it out..
>>>>
>>>> I made a change which bypass the outgoing reassembly for
>>>> RAW sockets, but I'm not sure wether it's too invasive..
>>>>       
>>> That would break reassembly (and thus connection tracking) for cases
>>> where its really intended.
>>>     
>>>> Is there any standard for RAW sockets behaviour?
>>>> Or another way around? :)
>>>>       
>>> You could use the NOTRACK target to bypass connection tracking.
>>
>> I tried the NOTRACK target, but the packet is still going
>> throught reassembly, because the RAW filter has lower priority
>> then the connection track defragmentation..
>
>Right.

Blech. That reminds me of
http://marc.info/?l=netfilter-devel&m=126581823826735&w=2

^ permalink raw reply

* Re: [PATCH v3] netfilter: Xtables: idletimer target implementation
From: Luciano Coelho @ 2010-06-09 15:11 UTC (permalink / raw)
  To: ext Patrick McHardy
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	jengelh@medozas.de, Timo Teras
In-Reply-To: <4C0F9B17.10504@trash.net>

On Wed, 2010-06-09 at 15:45 +0200, ext Patrick McHardy wrote:
> luciano.coelho@nokia.com wrote:
> > From: Luciano Coelho <luciano.coelho@nokia.com>
> >
> > This patch implements an idletimer Xtables target that can be used to
> > identify when interfaces have been idle for a certain period of time.
> >
> > Timers are identified by labels and are created when a rule is set with a new
> > label.  The rules also take a timeout value (in seconds) as an option.  If
> > more than one rule uses the same timer label, the timer will be restarted
> > whenever any of the rules get a hit.
> >
> > One entry for each timer is created in sysfs.  This attribute contains the
> > timer remaining for the timer to expire.  The attributes are located under
> > the xt_idletimer class:
> >
> > /sys/class/xt_idletimer/timers/<label>
> >
> > When the timer expires, the target module sends a sysfs notification to the
> > userspace, which can then decide what to do (eg. disconnect to save power).
> >   
> 
> Basically this seems fine to me, some minor comments below.

Thanks a lot for reviewing this again.  My replies below.


> > +++ b/include/linux/netfilter/xt_IDLETIMER.h
> > @@ -0,0 +1,40 @@
> > +#ifndef _XT_IDLETIMER_H
> > +#define _XT_IDLETIMER_H
> > +
> > +#define MAX_LABEL_SIZE 32
> >   
> 
> This seems a bit generic, maybe better use MAX_IDLETIMER_LABER_SIZE.

True, I'll change it.


> > +struct idletimer_tg {
> > +	struct list_head entry;
> > +	struct timer_list timer;
> > +	struct work_struct work;
> > +
> > +	struct kobject *kobj;
> > +	struct idletimer_tg_attr *attr;
> > +
> > +	unsigned int refcnt;
> > +};
> > +
> > +struct idletimer_tg_attr {
> > +	struct attribute attr;
> > +	ssize_t	(*show)(struct kobject *kobj,
> > +			struct attribute *attr, char *buf);
> > +};
> > +
> > +static LIST_HEAD(idletimer_tg_list);
> >   
> 
> How does this work with multiple namespaces? It seems every namespace
> can bind to any timer.

I was implementing this solution for multiple namespaces (see the
previous versions of my patch), but the code started getting really
complicated.  Then I found out that sysfs and multiple namespaces are
not working very well together yet and decided to drop it for the time
being.  Of course this doesn't matter anymore, since the timers belong
to an independent class in sysfs, so I can easily add multiple namespace
support by adding struct net *net as part of the list key together with
the label.

Do you think it's okay to leave it like this for now and extend it for
multiple namespace support with a future patch?


> > +static ssize_t idletimer_tg_show(struct kobject *kobj, struct attribute *attr,
> > +				 char *buf)
> > +{
> > +	struct idletimer_tg *timer;
> > +	unsigned long expires = 0;
> > +
> > +	spin_lock_bh(&list_lock);
> > +	timer =	__idletimer_tg_find_by_label(attr->name);
> > +	if (timer)
> > +		expires = timer->timer.expires;
> > +	spin_unlock_bh(&list_lock);
> > +
> > +	if (expires > jiffies)
> >   
> 
> time_after()?

Of course!


> > +		return sprintf(buf, "%u\n",
> > +			       jiffies_to_msecs(expires - jiffies) / 1000);
> > +
> > +	return sprintf(buf, "0\n");
> > +}
> > +
> > +static void idletimer_tg_delete(const struct idletimer_tg_info *info)
> > +{
> >   
> 
> The only caller is the target cleanup function, why don't you just move
> everything there?

Good point.  It probably remained as a separate function as vestigial
code.


> > +static
> > +struct idletimer_tg *idletimer_tg_create(const struct idletimer_tg_info *info)
> > +{
> > +	struct idletimer_tg *timer;
> > +	struct idletimer_tg_attr *attr;
> > +
> > +	attr = kzalloc(sizeof(attr), GFP_KERNEL);
> >   
> 
> sizeof(*attr)

Ugh! Nice catch, I'll fix it.


> > +	if (!attr) {
> > +		pr_debug("couldn't alloc attribute\n");
> > +		return NULL;
> > +	}
> > +
> > +	attr->attr.name = kstrdup(info->label, GFP_KERNEL);
> > +	if (!attr->attr.name) {
> > +		pr_debug("couldn't alloc attribute name\n");
> > +		goto out_free_attr;
> > +	}
> > +	attr->attr.mode = S_IRUGO;
> > +	attr->show = idletimer_tg_show;
> > +
> > +	if (sysfs_create_file(idletimer_tg_kobj, &attr->attr)) {
> > +		pr_debug("couldn't add attr to sysfs\n");
> > +		goto out_free_name;
> > +	}
> > +
> > +	timer = kmalloc(sizeof(struct idletimer_tg), GFP_KERNEL);

I will also change this to kmalloc(sizeof(*timer), GFP_KERNEL) for
consistency.


> > +	if (!timer) {
> > +		pr_debug("couldn't alloc timer\n");
> > +		goto out_free_file;
> > +	}
> > +
> > +	spin_lock_bh(&list_lock);
> > +	list_add(&timer->entry, &idletimer_tg_list);
> > +
> > +	init_timer(&timer->timer);
> > +	setup_timer(&timer->timer, idletimer_tg_expired, (unsigned long) timer);
> > +	mod_timer(&timer->timer,
> > +		  msecs_to_jiffies(info->timeout * 1000) + jiffies);
> > +
> > +	timer->attr = attr;
> > +	timer->refcnt = 0;
> >   
> 
> Better fully set up the timer before activating it.

Ok.  Also, I don't need init_timer() here, since setup_timer() already
calls it.


> > +
> > +	INIT_WORK(&timer->work, idletimer_tg_work);
> > +	spin_unlock_bh(&list_lock);
> > +
> > +	return timer;
> > +
> > +out_free_file:
> > +	sysfs_remove_file(idletimer_tg_kobj, &attr->attr);
> > +out_free_name:
> > +	kfree(attr->attr.name);
> > +out_free_attr:
> > +	kfree(attr);
> > +	return NULL;
> > +}
> > +
> > +static void idletimer_tg_cleanup(void)
> > +{
> > +	struct idletimer_tg *timer;
> > +
> > +	spin_lock(&list_lock);
> > +	list_for_each_entry(timer, &idletimer_tg_list, entry) {
> >   
> 
> list_for_each_entry_safe(). This function seems unnecessary though, the
> module
> can't be unloaded while its still in use and cleanup will already happen
> when the
> rules are removed.

Right.  I'll remove the cleanup function, since it's unnecessary.


> > +static unsigned int idletimer_tg_target(struct sk_buff *skb,
> > +					 const struct xt_action_param *par)
> > +{
> > +	const struct idletimer_tg_info *info = par->targinfo;
> > +	struct idletimer_tg *timer;
> > +
> > +	pr_debug("resetting timer %s, timeout period %u\n",
> > +		 info->label, info->timeout);
> > +
> > +	spin_lock(&list_lock);
> >   
> 
> You need BH protection here as well for the output path.

True.  I'll add it.


> > +static int idletimer_tg_checkentry(const struct xt_tgchk_param *par)
> > +{
> > +	const struct idletimer_tg_info *info = par->targinfo;
> > +	struct idletimer_tg *timer;
> > +
> > +	pr_debug("checkentry targinfo %s\n", info->label);
> > +
> > +	if (info->timeout == 0) {
> > +		pr_debug("timeout value is zero\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!info->label || strlen(info->label) == 0) {
> >   
> 
> !info->label is impossible. You should check for \0 termination instead.

Ooops! <blush> I'll fix it.


> 
> > +		pr_debug("label is missing\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	spin_lock(&list_lock);
> >   
> spin_lock_bh()

Yes, I'll add BH here too.


> > +	timer = __idletimer_tg_find_by_label(info->label);
> > +	if (!timer) {
> > +		spin_unlock(&list_lock);
> > +		timer = idletimer_tg_create(info);
> >   
> 
> How does this prevent creating the same timer twice?

The timer will only be created if __idletimer_tg_find_by_label() returns
NULL, which means that no timer with that label has been found.  "info"
won't be the same if info->label is different, right? Or can it change
on the fly?

I'll submit v4 with these changes later this evening.


-- 
Cheers,
Luca.


^ permalink raw reply

* Re: 2.6.35-rc2-git1 - include/linux/cgroup.h:534 invoked rcu_dereference_check() without protection!
From: Miles Lane @ 2010-06-09 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	LKML, nauman, eric.dumazet, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg
In-Reply-To: <1276004075.2987.208.camel@twins>

On Tue, Jun 8, 2010 at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-06-08 at 09:14 -0400, Miles Lane wrote:
>
>>   CC      kernel/sched.o
>> kernel/sched.c: In function ‘task_group’:
>> kernel/sched.c:321: error: implicit declaration of function ‘task_rq’
>> kernel/sched.c:321: error: invalid type argument of ‘->’ (have ‘int’)
>> make[1]: *** [kernel/sched.o] Error 1
>>
>> I had to apply with fuzz.  Did it mess up?
>
>
> No, I probably did.. task_rq() is defined on line 636 or thereabouts,
> and this function landed around line 320.
>
> Ahh, and it compiled here because I have CGROUP_SCHED=y, but
> PROVE_RCU=n, so that whole check expression disappears and is never
> evaluated...
>
> /me fixes
>
> ---
> Subject: sched: PROVE_RCU vs cpu_cgroup
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Jun 08 11:40:42 CEST 2010
>
> PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
> typically holds rq->lock around the css rcu derefs but the generic
> cgroup code doesn't (and can't) know about that lock.
>
> Provide means to add extra checks to the css dereference and use that
> in the scheduler to annotate its users.
>
> The addition of rq->lock to these checks is correct because the
> cgroup_subsys::attach() method takes the rq->lock for each task it
> moves, therefore by holding that lock, we ensure the task is pinned to
> the current cgroup and the RCU dereference is valid.
>
> That leaves one genuine race in __sched_setscheduler() where we used
> task_group() without holding any of the required locks and thus raced
> with the cgroup code. Solve this by moving the check under the rq->lock.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/cgroup.h |   20 +++++---
>  kernel/sched.c         |  115 +++++++++++++++++++++++++------------------------
>  2 files changed, 73 insertions(+), 62 deletions(-)
>
> Index: linux-2.6/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.orig/include/linux/cgroup.h
> +++ linux-2.6/include/linux/cgroup.h
> @@ -525,13 +525,21 @@ static inline struct cgroup_subsys_state
>        return cgrp->subsys[subsys_id];
>  }
>
> -static inline struct cgroup_subsys_state *task_subsys_state(
> -       struct task_struct *task, int subsys_id)
> +/*
> + * function to get the cgroup_subsys_state which allows for extra
> + * rcu_dereference_check() conditions, such as locks used during the
> + * cgroup_subsys::attach() methods.
> + */
> +#define task_subsys_state_check(task, subsys_id, __c)                  \
> +       rcu_dereference_check(task->cgroups->subsys[subsys_id],         \
> +                             rcu_read_lock_held() ||                   \
> +                             lockdep_is_held(&task->alloc_lock) ||     \
> +                             cgroup_lock_is_held() || (__c))
> +
> +static inline struct cgroup_subsys_state *
> +task_subsys_state(struct task_struct *task, int subsys_id)
>  {
> -       return rcu_dereference_check(task->cgroups->subsys[subsys_id],
> -                                    rcu_read_lock_held() ||
> -                                    lockdep_is_held(&task->alloc_lock) ||
> -                                    cgroup_lock_is_held());
> +       return task_subsys_state_check(task, subsys_id, false);
>  }
>
>  static inline struct cgroup* task_cgroup(struct task_struct *task,
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -306,52 +306,6 @@ static int init_task_group_load = INIT_T
>  */
>  struct task_group init_task_group;
>
> -/* return group to which a task belongs */
> -static inline struct task_group *task_group(struct task_struct *p)
> -{
> -       struct task_group *tg;
> -
> -#ifdef CONFIG_CGROUP_SCHED
> -       tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
> -                               struct task_group, css);
> -#else
> -       tg = &init_task_group;
> -#endif
> -       return tg;
> -}
> -
> -/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> -static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> -{
> -       /*
> -        * Strictly speaking this rcu_read_lock() is not needed since the
> -        * task_group is tied to the cgroup, which in turn can never go away
> -        * as long as there are tasks attached to it.
> -        *
> -        * However since task_group() uses task_subsys_state() which is an
> -        * rcu_dereference() user, this quiets CONFIG_PROVE_RCU.
> -        */
> -       rcu_read_lock();
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
> -       p->se.parent = task_group(p)->se[cpu];
> -#endif
> -
> -#ifdef CONFIG_RT_GROUP_SCHED
> -       p->rt.rt_rq  = task_group(p)->rt_rq[cpu];
> -       p->rt.parent = task_group(p)->rt_se[cpu];
> -#endif
> -       rcu_read_unlock();
> -}
> -
> -#else
> -
> -static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
> -static inline struct task_group *task_group(struct task_struct *p)
> -{
> -       return NULL;
> -}
> -
>  #endif /* CONFIG_CGROUP_SCHED */
>
>  /* CFS-related fields in a runqueue */
> @@ -644,6 +598,49 @@ static inline int cpu_of(struct rq *rq)
>  #define cpu_curr(cpu)          (cpu_rq(cpu)->curr)
>  #define raw_rq()               (&__raw_get_cpu_var(runqueues))
>
> +#ifdef CONFIG_CGROUP_SCHED
> +
> +/*
> + * Return the group to which this tasks belongs.
> + *
> + * We use task_subsys_state_check() and extend the RCU verification
> + * with lockdep_is_held(&task_rq(p)->lock) because cpu_cgroup_attach()
> + * holds that lock for each task it moves into the cgroup. Therefore
> + * by holding that lock, we pin the task to the current cgroup.
> + */
> +static inline struct task_group *task_group(struct task_struct *p)
> +{
> +       struct cgroup_subsys_state *css;
> +
> +       css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> +                       lockdep_is_held(&task_rq(p)->lock));
> +       return container_of(css, struct task_group, css);
> +}
> +
> +/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> +static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> +{
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
> +       p->se.parent = task_group(p)->se[cpu];
> +#endif
> +
> +#ifdef CONFIG_RT_GROUP_SCHED
> +       p->rt.rt_rq  = task_group(p)->rt_rq[cpu];
> +       p->rt.parent = task_group(p)->rt_se[cpu];
> +#endif
> +}
> +
> +#else /* CONFIG_CGROUP_SCHED */
> +
> +static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
> +static inline struct task_group *task_group(struct task_struct *p)
> +{
> +       return NULL;
> +}
> +
> +#endif /* CONFIG_CGROUP_SCHED */
> +
>  inline void update_rq_clock(struct rq *rq)
>  {
>        if (!rq->skip_clock_update)
> @@ -4465,16 +4462,6 @@ recheck:
>        }
>
>        if (user) {
> -#ifdef CONFIG_RT_GROUP_SCHED
> -               /*
> -                * Do not allow realtime tasks into groups that have no runtime
> -                * assigned.
> -                */
> -               if (rt_bandwidth_enabled() && rt_policy(policy) &&
> -                               task_group(p)->rt_bandwidth.rt_runtime == 0)
> -                       return -EPERM;
> -#endif
> -
>                retval = security_task_setscheduler(p, policy, param);
>                if (retval)
>                        return retval;
> @@ -4490,6 +4477,22 @@ recheck:
>         * runqueue lock must be held.
>         */
>        rq = __task_rq_lock(p);
> +
> +#ifdef CONFIG_RT_GROUP_SCHED
> +       if (user) {
> +               /*
> +                * Do not allow realtime tasks into groups that have no runtime
> +                * assigned.
> +                */
> +               if (rt_bandwidth_enabled() && rt_policy(policy) &&
> +                               task_group(p)->rt_bandwidth.rt_runtime == 0) {
> +                       __task_rq_unlock(rq);
> +                       raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +                       return -EPERM;
> +               }
> +       }
> +#endif
> +
>        /* recheck policy now with rq lock held */
>        if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
>                policy = oldpolicy = -1;
>
>

Sorry.  I misunderstood this message when I first read it.  I didn't
realize this message include a new version of the patch.
Anyhow, I just tried to apply the patch to 2.6.35-rc2-git3 and got this:

# patch -p1 -l -F 20 --dry-run < ../5.patch
patching file include/linux/cgroup.h
patching file kernel/sched.c
Hunk #1 succeeded at 306 with fuzz 1.
Hunk #3 FAILED at 4462.
Hunk #4 succeeded at 4487 with fuzz 3.
1 out of 4 hunks FAILED -- saving rejects to file kernel/sched.c.rej

^ permalink raw reply

* Re: [PATCH 1/2] Export firmware assigned labels of network devices to sysfs
From: Greg KH @ 2010-06-09 15:02 UTC (permalink / raw)
  To: Matt Domsch
  Cc: K, Narendra, netdev@vger.kernel.org,
	linux-hotplug@vger.kernel.org, linux-pci@vger.kernel.org,
	Hargrave, Jordan, Rose, Charles, Nijhawan, Vijay
In-Reply-To: <20100609041709.GA7280@auslistsprd01.us.dell.com>

On Tue, Jun 08, 2010 at 11:17:09PM -0500, Matt Domsch wrote:
> On Fri, May 28, 2010 at 11:51:40PM -0500, Domsch, Matt wrote:
> > On Fri, May 28, 2010 at 05:27:45PM -0500, Greg KH wrote:
> > > Care to post that ECN publically?  And no, the Linux Foundation does not
> > > have a PCI-SIG membership, the PCI-SIG keeps forbidding it.  Other
> > > operating systems are allowed to join but not Linux.  Strange but
> > > true...
> > 
> > I'm looking into it, and should know more next week.
> 
> I'm advised that I cannot post the ECN publically, due to it being an
> in-progress work item of a SIG working group, and therefore falls
> under the confidentiality rules that SIG members agree to.  Members of
> the PCI SIG have access, which unfortunately is not everyone.

Then we can't properly review this, sorry.  How about waiting until the
ECN is finalized?  Then we could review and possibly accept this.

thanks,

greg k-h

^ permalink raw reply

* Re: [ath5k-devel] [PATCH] [ath5k][leds] Ability to disable leds support. If leds support enabled do not force mac802.11 leds layer selection.
From: Dmytro Milinevskyy @ 2010-06-09 14:43 UTC (permalink / raw)
  To: Bob Copeland
  Cc: ath5k-devel, Kalle Valo, linux-wireless, GeunSik Lim, Jiri Slaby,
	Greg Kroah-Hartman, John W. Linville, Keng-Yu Lin, netdev,
	Jiri Kosina, Shahar Or, linux-kernel, Luca Verdesca
In-Reply-To: <AANLkTilrYYHjcMEhhrP7zjcLoo-TNczozxJSNRkDvSZy@mail.gmail.com>

Hi, Bob.

For instance I don't use 802.11 leds layer and trigger leds from
userspace according to my purposes.

-- Dima

On Wed, Jun 9, 2010 at 4:58 PM, Bob Copeland <me@bobcopeland.com> wrote:
> On Wed, Jun 9, 2010 at 3:31 AM, Dmytro Milinevskyy
> <milinevskyy@gmail.com> wrote:
>> However if the leds support was enabled do not force selection of 802.11 leds layer.
>
> I don't understand this part.  What's the point of enabling software LEDs
> if nothing triggers them?
>
> Also, we can probably build hardware LEDs (hw_set_ledstate) in regardless.
> It's a pure register write and doesn't require the rest of the LED machinery.
>
> I assume you are doing this to reduce the size of the module.  If so, can
> you include size(1) output in the commit message?
>
> --
> Bob Copeland %% www.bobcopeland.com
>

^ permalink raw reply

* Re: 2.6.35-rc2-git2: Reported regressions from 2.6.34
From: Linus Torvalds @ 2010-06-09 14:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Carl Worth, Eric Anholt, Venkatesh Pallipadi, Jens Axboe,
	Dave Airlie, Jesse Barnes, Ingo Molnar, David  Härdeman,
	Mauro Carvalho Chehab, Eric Dumazet, Linux Kernel Mailing List,
	Maciej Rutecki, Andrew Morton, Kernel Testers List,
	Network Development, Linux ACPI, Linux PM List, Linux SCSI List,
	Linux Wireless List, DRI
In-Reply-To: <201006091106.11232.rjw@sisk.pl>



On Wed, 9 Jun 2010, Rafael J. Wysocki wrote:
> > 
> > That has a "reverting the commit fixes it", and a confirmation from Nick 
> > Bowler.
> > 
> > Eric, Carl: should I just revert that commit? Or do you have a fix?
> 
> This one is reported to have been fixed already.  Closed now.

Heh. That "fixed already" is actually the revert. Carl acked it.

> This should be fixed by commit f8ed8b4c5d30b5214f185997131b06e35f6f7113, so
> closing now.

Good, that was in yesterday's drm pull.

		Linus

^ permalink raw reply

* Re: no reassembly for outgoing packets on RAW socket
From: Patrick McHardy @ 2010-06-09 14:16 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: netdev, Netfilter Developer Mailing List
In-Reply-To: <20100607145558.GA1939@jolsa.lab.eng.brq.redhat.com>

Jiri Olsa wrote:
> On Fri, Jun 04, 2010 at 02:03:17PM +0200, Patrick McHardy wrote:
>   
>> Jiri Olsa wrote:
>>     
>>> hi,
>>>
>>> I'd like to be able to sendout a single IP packet with MF flag set.
>>>
>>> When using RAW sockets the packet will get stuck in the
>>> netfilter (NF_INET_LOCAL_OUT nf_defrag_ipv4 reassembly unit)
>>> and wont ever make it out..
>>>
>>> I made a change which bypass the outgoing reassembly for
>>> RAW sockets, but I'm not sure wether it's too invasive..
>>>       
>> That would break reassembly (and thus connection tracking) for cases
>> where its really intended.
>>
>>     
>>> Is there any standard for RAW sockets behaviour?
>>> Or another way around? :)
>>>       
>> You could use the NOTRACK target to bypass connection tracking.
>>     
>
> ok,
>
> I tried the NOTRACK target, but the packet is still going
> throught reassembly, because the RAW filter has lower priority
> then the connection track defragmentation..
>   

Right.
> I was able to get it bypassed by attached patch and following
> command:
>
> 	iptables -v -t raw -A OUTPUT -p icmp -j NOTRACK
>
> again, not sure if this is too invasive ;)
>   

Well, we can't change it in the mainline kernel.
> If this is not the way, I'd appreciatte any hint..  my goal is
> to put malformed packet on the wire (more frags bit set for a
> non fragmented packet)

I don't have any good suggestions besides adding a flag to the IPCB
and skipping defragmentation based on that.

^ permalink raw reply

* Re: [PATCH] ipvs: Add missing locking during connection table hashing and unhashing
From: Patrick McHardy @ 2010-06-09 14:12 UTC (permalink / raw)
  To: Sven Wegener
  Cc: Julian Anastasov, Simon Horman, Wensong Zhang, netdev, lvs-devel
In-Reply-To: <alpine.LNX.2.00.1006080828290.18946@titan.stealer.net>

Sven Wegener wrote:
> The code that hashes and unhashes connections from the connection table
> is missing locking of the connection being modified, which opens up a
> race condition and results in memory corruption when this race condition
> is hit.
>
> Here is what happens in pretty verbose form:
>
> CPU 0					CPU 1
> ------------				------------
> An active connection is terminated and
> we schedule ip_vs_conn_expire() on this
> CPU to expire this connection.
>
> 					IRQ assignment is changed to this CPU,
> 					but the expire timer stays scheduled on
> 					the other CPU.
>
> 					New connection from same ip:port comes
> 					in right before the timer expires, we
> 					find the inactive connection in our
> 					connection table and get a reference to
> 					it. We proper lock the connection in
> 					tcp_state_transition() and read the
> 					connection flags in set_tcp_state().
>
> ip_vs_conn_expire() gets called, we
> unhash the connection from our
> connection table and remove the hashed
> flag in ip_vs_conn_unhash(), without
> proper locking!
>
> 					While still holding proper locks we
> 					write the connection flags in
> 					set_tcp_state() and this sets the hashed
> 					flag again.
>
> ip_vs_conn_expire() fails to expire the
> connection, because the other CPU has
> incremented the reference count. We try
> to re-insert the connection into our
> connection table, but this fails in
> ip_vs_conn_hash(), because the hashed
> flag has been set by the other CPU. We
> re-schedule execution of
> ip_vs_conn_expire(). Now this connection
> has the hashed flag set, but isn't
> actually hashed in our connection table
> and has a dangling list_head.
>
> 					We drop the reference we held on the
> 					connection and schedule the expire timer
> 					for timeouting the connection on this
> 					CPU. Further packets won't be able to
> 					find this connection in our connection
> 					table.
>
> 					ip_vs_conn_expire() gets called again,
> 					we think it's already hashed, but the
> 					list_head is dangling and while removing
> 					the connection from our connection table
> 					we write to the memory location where
> 					this list_head points to.
>
> The result will probably be a kernel oops at some other point in time.
>
> Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
> Cc: stable@kernel.org
> Acked-by: Simon Horman <horms@verge.net.au>
> ---
>  net/netfilter/ipvs/ip_vs_conn.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> This race condition is pretty subtle, but it can be triggered remotely.
> It needs the IRQ assignment change or another circumstance where packets
> coming from the same ip:port for the same service are being processed on
> different CPUs. And it involves hitting the exact time at which
> ip_vs_conn_expire() gets called. It can be avoided by making sure that
> all packets from one connection are always processed on the same CPU and
> can be made harder to exploit by changing the connection timeouts to
> some custom values.
>
>   

Applied, thanks.


^ permalink raw reply

* Re: [v2 Patch 1/2] s2io: add dynamic LRO disable support
From: Ben Hutchings @ 2010-06-09 14:00 UTC (permalink / raw)
  To: Amerigo Wang
  Cc: netdev, nhorman, sgruszka, herbert.xu, Ramkrishna.Vepa, davem
In-Reply-To: <20100609100928.6573.14199.sendpatchset@localhost.localdomain>

On Wed, 2010-06-09 at 06:05 -0400, Amerigo Wang wrote:
[...]
> +static int s2io_ethtool_set_flags(struct net_device *dev, u32 data)
> +{
> +	struct s2io_nic *sp = netdev_priv(dev);
> +	int rc = 0;
> +	int changed = 0;
> +
> +	if (data & ETH_FLAG_LRO) {
> +		if (lro_enable) {
> +			if (!(dev->features & NETIF_F_LRO)) {
> +				dev->features |= NETIF_F_LRO;
> +				changed = 1;
> +			}
> +		} else
> +			rc = -EINVAL;
> +	} else if (dev->features & NETIF_F_LRO) {
> +		dev->features &= ~NETIF_F_LRO;
> +		changed = 1;
> +	}
> +
> +	if (changed && netif_running(dev)) {
> +		s2io_stop_all_tx_queue(sp);
> +		s2io_card_down(sp);
> +		sp->lro = dev->features & NETIF_F_LRO;
> +		rc = s2io_card_up(sp);
> +		s2io_start_all_tx_queue(sp);
[...]

Is it safe to call s2io_start_all_tx_queue() if s2io_card_up() failed?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [ath5k-devel] [PATCH] [ath5k][leds] Ability to disable leds support. If leds support enabled do not force mac802.11 leds layer selection.
From: Bob Copeland @ 2010-06-09 13:58 UTC (permalink / raw)
  To: Dmytro Milinevskyy
  Cc: ath5k-devel, Kalle Valo, linux-wireless, GeunSik Lim, Jiri Slaby,
	Greg Kroah-Hartman, John W. Linville, Keng-Yu Lin, netdev,
	Jiri Kosina, Shahar Or, linux-kernel, Luca Verdesca
In-Reply-To: <4c0f4326.0a41df0a.7bde.ffffeef2@mx.google.com>

On Wed, Jun 9, 2010 at 3:31 AM, Dmytro Milinevskyy
<milinevskyy@gmail.com> wrote:
> However if the leds support was enabled do not force selection of 802.11 leds layer.

I don't understand this part.  What's the point of enabling software LEDs
if nothing triggers them?

Also, we can probably build hardware LEDs (hw_set_ledstate) in regardless.
It's a pure register write and doesn't require the rest of the LED machinery.

I assume you are doing this to reduce the size of the module.  If so, can
you include size(1) output in the commit message?

-- 
Bob Copeland %% www.bobcopeland.com

^ permalink raw reply

* Re: [PATCH v3] netfilter: Xtables: idletimer target implementation
From: Patrick McHardy @ 2010-06-09 13:45 UTC (permalink / raw)
  To: luciano.coelho; +Cc: netfilter-devel, netdev, jengelh, Timo Teras
In-Reply-To: <1275592445-15555-1-git-send-email-luciano.coelho@nokia.com>

luciano.coelho@nokia.com wrote:
> From: Luciano Coelho <luciano.coelho@nokia.com>
>
> This patch implements an idletimer Xtables target that can be used to
> identify when interfaces have been idle for a certain period of time.
>
> Timers are identified by labels and are created when a rule is set with a new
> label.  The rules also take a timeout value (in seconds) as an option.  If
> more than one rule uses the same timer label, the timer will be restarted
> whenever any of the rules get a hit.
>
> One entry for each timer is created in sysfs.  This attribute contains the
> timer remaining for the timer to expire.  The attributes are located under
> the xt_idletimer class:
>
> /sys/class/xt_idletimer/timers/<label>
>
> When the timer expires, the target module sends a sysfs notification to the
> userspace, which can then decide what to do (eg. disconnect to save power).
>   

Basically this seems fine to me, some minor comments below.

> +++ b/include/linux/netfilter/xt_IDLETIMER.h
> @@ -0,0 +1,40 @@
> +#ifndef _XT_IDLETIMER_H
> +#define _XT_IDLETIMER_H
> +
> +#define MAX_LABEL_SIZE 32
>   

This seems a bit generic, maybe better use MAX_IDLETIMER_LABER_SIZE.

> +
> +struct idletimer_tg_info {
> +	__u32 timeout;
> +
> +	char label[MAX_LABEL_SIZE];
> +};
> +
> +#endif
> new file mode 100644
> index 0000000..65c195e
> --- /dev/null
> +++ b/net/netfilter/xt_IDLETIMER.c
> @@ -0,0 +1,356 @@
> +/*
> + * linux/net/netfilter/xt_IDLETIMER.c
> + *
> + * Netfilter module to trigger a timer when packet matches.
> + * After timer expires a kevent will be sent.
> + *
> + * Copyright (C) 2004, 2010 Nokia Corporation
> + * Written by Timo Teras <ext-timo.teras@nokia.com>
> + *
> + * Converted to x_tables and reworked for upstream inclusion
> + * by Luciano Coelho <luciano.coelho@nokia.com>
> + *
> + * Contact: Luciano Coelho <luciano.coelho@nokia.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * version 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
> + * 02110-1301 USA
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/timer.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/netfilter.h>
> +#include <linux/netfilter/x_tables.h>
> +#include <linux/netfilter/xt_IDLETIMER.h>
> +#include <linux/kobject.h>
> +#include <linux/workqueue.h>
> +#include <linux/sysfs.h>
> +
> +struct idletimer_tg {
> +	struct list_head entry;
> +	struct timer_list timer;
> +	struct work_struct work;
> +
> +	struct kobject *kobj;
> +	struct idletimer_tg_attr *attr;
> +
> +	unsigned int refcnt;
> +};
> +
> +struct idletimer_tg_attr {
> +	struct attribute attr;
> +	ssize_t	(*show)(struct kobject *kobj,
> +			struct attribute *attr, char *buf);
> +};
> +
> +static LIST_HEAD(idletimer_tg_list);
>   

How does this work with multiple namespaces? It seems every namespace
can bind to any timer.

> +static DEFINE_SPINLOCK(list_lock);
> +
> +static struct kobject *idletimer_tg_kobj;
> +
> +static
> +struct idletimer_tg *__idletimer_tg_find_by_label(const char *label)
> +{
> +	struct idletimer_tg *entry;
> +
> +	BUG_ON(!label);
> +
> +	list_for_each_entry(entry, &idletimer_tg_list, entry) {
> +		if (!strcmp(label, entry->attr->attr.name))
> +			return entry;
> +	}
> +
> +	return NULL;
> +}
> +
> +static ssize_t idletimer_tg_show(struct kobject *kobj, struct attribute *attr,
> +				 char *buf)
> +{
> +	struct idletimer_tg *timer;
> +	unsigned long expires = 0;
> +
> +	spin_lock_bh(&list_lock);
> +	timer =	__idletimer_tg_find_by_label(attr->name);
> +	if (timer)
> +		expires = timer->timer.expires;
> +	spin_unlock_bh(&list_lock);
> +
> +	if (expires > jiffies)
>   

time_after()?

> +		return sprintf(buf, "%u\n",
> +			       jiffies_to_msecs(expires - jiffies) / 1000);
> +
> +	return sprintf(buf, "0\n");
> +}
> +
> +static void idletimer_tg_delete(const struct idletimer_tg_info *info)
> +{
>   

The only caller is the target cleanup function, why don't you just move
everything there?

> +	struct idletimer_tg *timer;
> +
> +	spin_lock_bh(&list_lock);
> +	timer = __idletimer_tg_find_by_label(info->label);
> +	if (!timer) {
> +		spin_unlock_bh(&list_lock);
> +		return;
> +	}
> +
> +	if (--timer->refcnt == 0) {
> +		pr_debug("deleting timer %s\n", info->label);
> +
> +		list_del(&timer->entry);
> +		del_timer_sync(&timer->timer);
> +		spin_unlock_bh(&list_lock);
> +
> +		sysfs_remove_file(idletimer_tg_kobj, &timer->attr->attr);
> +		kfree(timer->attr->attr.name);
> +		kfree(timer->attr);
> +		kfree(timer);
> +	} else {
> +		spin_unlock_bh(&list_lock);
> +		pr_debug("decreased refcnt of timer %s to %u\n",
> +			 info->label, timer->refcnt);
> +	}
> +}
> +
> +static void idletimer_tg_work(struct work_struct *work)
> +{
> +	struct idletimer_tg *timer = container_of(work, struct idletimer_tg,
> +						  work);
> +
> +	sysfs_notify(idletimer_tg_kobj, NULL,
> +		     timer->attr->attr.name);
> +}
> +
> +static void idletimer_tg_expired(unsigned long data)
> +{
> +	struct idletimer_tg *timer = (struct idletimer_tg *) data;
> +
> +	pr_debug("timer %s expired\n",
> +		 timer->attr->attr.name);
> +
> +	schedule_work(&timer->work);
> +}
> +
> +static
> +struct idletimer_tg *idletimer_tg_create(const struct idletimer_tg_info *info)
> +{
> +	struct idletimer_tg *timer;
> +	struct idletimer_tg_attr *attr;
> +
> +	attr = kzalloc(sizeof(attr), GFP_KERNEL);
>   

sizeof(*attr)

> +	if (!attr) {
> +		pr_debug("couldn't alloc attribute\n");
> +		return NULL;
> +	}
> +
> +	attr->attr.name = kstrdup(info->label, GFP_KERNEL);
> +	if (!attr->attr.name) {
> +		pr_debug("couldn't alloc attribute name\n");
> +		goto out_free_attr;
> +	}
> +	attr->attr.mode = S_IRUGO;
> +	attr->show = idletimer_tg_show;
> +
> +	if (sysfs_create_file(idletimer_tg_kobj, &attr->attr)) {
> +		pr_debug("couldn't add attr to sysfs\n");
> +		goto out_free_name;
> +	}
> +
> +	timer = kmalloc(sizeof(struct idletimer_tg), GFP_KERNEL);
> +	if (!timer) {
> +		pr_debug("couldn't alloc timer\n");
> +		goto out_free_file;
> +	}
> +
> +	spin_lock_bh(&list_lock);
> +	list_add(&timer->entry, &idletimer_tg_list);
> +
> +	init_timer(&timer->timer);
> +	setup_timer(&timer->timer, idletimer_tg_expired, (unsigned long) timer);
> +	mod_timer(&timer->timer,
> +		  msecs_to_jiffies(info->timeout * 1000) + jiffies);
> +
> +	timer->attr = attr;
> +	timer->refcnt = 0;
>   

Better fully set up the timer before activating it.

> +
> +	INIT_WORK(&timer->work, idletimer_tg_work);
> +	spin_unlock_bh(&list_lock);
> +
> +	return timer;
> +
> +out_free_file:
> +	sysfs_remove_file(idletimer_tg_kobj, &attr->attr);
> +out_free_name:
> +	kfree(attr->attr.name);
> +out_free_attr:
> +	kfree(attr);
> +	return NULL;
> +}
> +
> +static void idletimer_tg_cleanup(void)
> +{
> +	struct idletimer_tg *timer;
> +
> +	spin_lock(&list_lock);
> +	list_for_each_entry(timer, &idletimer_tg_list, entry) {
>   

list_for_each_entry_safe(). This function seems unnecessary though, the
module
can't be unloaded while its still in use and cleanup will already happen
when the
rules are removed.

> +		pr_debug("deleting timer %s\n", timer->attr->attr.name);
> +
> +		list_del(&timer->entry);
> +		del_timer_sync(&timer->timer);
> +		kfree(timer->attr->attr.name);
> +		kfree(timer->attr);
> +		kfree(timer);
> +	}
> +	spin_unlock(&list_lock);
> +}
> +
> +/*
> + * The actual xt_tables plugin.
> + */
> +static unsigned int idletimer_tg_target(struct sk_buff *skb,
> +					 const struct xt_action_param *par)
> +{
> +	const struct idletimer_tg_info *info = par->targinfo;
> +	struct idletimer_tg *timer;
> +
> +	pr_debug("resetting timer %s, timeout period %u\n",
> +		 info->label, info->timeout);
> +
> +	spin_lock(&list_lock);
>   

You need BH protection here as well for the output path.

> +	timer = __idletimer_tg_find_by_label(info->label);
> +
> +	BUG_ON(!timer);
> +
> +	mod_timer(&timer->timer,
> +		  msecs_to_jiffies(info->timeout * 1000) + jiffies);
> +	spin_unlock(&list_lock);
> +
> +	return XT_CONTINUE;
> +}
> +
> +static int idletimer_tg_checkentry(const struct xt_tgchk_param *par)
> +{
> +	const struct idletimer_tg_info *info = par->targinfo;
> +	struct idletimer_tg *timer;
> +
> +	pr_debug("checkentry targinfo %s\n", info->label);
> +
> +	if (info->timeout == 0) {
> +		pr_debug("timeout value is zero\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!info->label || strlen(info->label) == 0) {
>   

!info->label is impossible. You should check for \0 termination instead.

> +		pr_debug("label is missing\n");
> +		return -EINVAL;
> +	}
> +
> +	spin_lock(&list_lock);
>   
spin_lock_bh()

> +	timer = __idletimer_tg_find_by_label(info->label);
> +	if (!timer) {
> +		spin_unlock(&list_lock);
> +		timer = idletimer_tg_create(info);
>   

How does this prevent creating the same timer twice?

> +		if (!timer) {
> +			pr_debug("failed to create timer\n");
> +			return -ENOMEM;
> +		}
> +		spin_lock(&list_lock);
> +	}
> +
> +	timer->refcnt++;
> +	mod_timer(&timer->timer,
> +		  msecs_to_jiffies(info->timeout * 1000) + jiffies);
> +	spin_unlock(&list_lock);
> +
> +	return 0;
> +}
> +
> +static void idletimer_tg_destroy(const struct xt_tgdtor_param *par)
> +{
> +	const struct idletimer_tg_info *info = par->targinfo;
> +
> +	pr_debug("destroy targinfo %s\n", info->label);
> +
> +	idletimer_tg_delete(info);
> +}
> +
> +static struct xt_target idletimer_tg __read_mostly = {
> +	.name		= "IDLETIMER",
> +	.family		= NFPROTO_UNSPEC,
> +	.target		= idletimer_tg_target,
> +	.targetsize     = sizeof(struct idletimer_tg_info),
> +	.checkentry	= idletimer_tg_checkentry,
> +	.destroy        = idletimer_tg_destroy,
> +	.me		= THIS_MODULE,
> +};
> +
> +static struct class *idletimer_tg_class;
> +
> +static struct device *idletimer_tg_device;
> +
> +static int __init idletimer_tg_init(void)
> +{
> +	int err;
> +
> +	idletimer_tg_class = class_create(THIS_MODULE, "xt_idletimer");
> +	err = PTR_ERR(idletimer_tg_class);
> +	if (IS_ERR(idletimer_tg_class)) {
> +		pr_debug("couldn't register device class\n");
> +		goto out;
> +	}
> +
> +	idletimer_tg_device = device_create(idletimer_tg_class, NULL,
> +					    MKDEV(0, 0), NULL, "timers");
> +	err = PTR_ERR(idletimer_tg_device);
> +	if (IS_ERR(idletimer_tg_device)) {
> +		pr_debug("couldn't register system device\n");
> +		goto out_class;
> +	}
> +
> +	idletimer_tg_kobj = &idletimer_tg_device->kobj;
> +
> +	err =  xt_register_target(&idletimer_tg);
> +	if (err < 0) {
> +		pr_debug("couldn't register xt target\n");
> +		goto out_dev;
> +	}
> +
> +	return 0;
> +out_dev:
> +	device_destroy(idletimer_tg_class, MKDEV(0, 0));
> +out_class:
> +	class_destroy(idletimer_tg_class);
> +out:
> +	return err;
> +}
> +
> +static void __exit idletimer_tg_exit(void)
> +{
> +	xt_unregister_target(&idletimer_tg);
> +
> +	device_destroy(idletimer_tg_class, MKDEV(0, 0));
> +	class_destroy(idletimer_tg_class);
> +
> +	idletimer_tg_cleanup();
> +}
> +
> +module_init(idletimer_tg_init);
> +module_exit(idletimer_tg_exit);
> +
> +MODULE_AUTHOR("Timo Teras <ext-timo.teras@nokia.com>");
> +MODULE_AUTHOR("Luciano Coelho <luciano.coelho@nokia.com>");
> +MODULE_DESCRIPTION("Xtables: idle time monitor");
> +MODULE_LICENSE("GPL v2");
>   


^ permalink raw reply

* Re: 2.6.35-rc2, CONFIG_RPS is filling the dmesg log
From: Eric Dumazet @ 2010-06-09 13:42 UTC (permalink / raw)
  To: tim.gardner; +Cc: netdev
In-Reply-To: <4C0F96B4.2000307@canonical.com>

Le mercredi 09 juin 2010 à 07:27 -0600, Tim Gardner a écrit :
> On 06/08/2010 02:55 PM, Tim Gardner wrote:
> > With 2.6.35-rc2 my dmesg log is being flooded with messages like this:
> >
> > br0 received packet on queue 4, but number of RX queues is 1
> >
> > This machine is bridged for KVM and has 2 igb network adapters.
> >
> > The root cause appears to be CONFIG_RPS=y and the fact that none of the
> > drivers that call skb_record_rx_queue() perform their net device
> > allocation using alloc_netdev_mq(), thereby initializing num_rx_queues
> > to a maximum of 1.
> >
> > Given that this is early RPS days, is the warning in get_rps_cpu()
> > really necessary? It would appear that _all_ of the multi-receive queue
> > devices that call skb_record_rx_queue() will cause this log noise.
> >
> > By the way, how do you turn off CONFIG_RPS? The only way I could get it
> > disabled was to change the default in net/Kconfig to 'n'.
> >
> > rtg
> 
> This is the route that I'm taking with Ubuntu in the short term. I'll 
> have lots of server testers complaining pretty soon if I don't take care 
> of this now. It does keep my server logs from filling.
> 
> rtg
> 

Probably fine, but your commit message is not exact :

  So far no users of skb_record_rx_queue() use alloc_netdev_mq() for
  network device initialization, so don't print a warning about num_rx_queues
  imbalances in get_rps_cpu() unless they have actually been allocated.

In fact, drivers that use skb_record_rx_queue() did use alloc_netdev_mq().

Problem is : packets going thru bridge/bonding that are not yet
multiqueue enabled. If R[PF]S enabled for these "virtual devices",
we trigger the get_rps_cpu() warning.

Also, in a bonding setup, we still have a problem
because all tx packets will go thru tx queue 0 (dev_pick_tx() job)

(That might be good to know that for Ubuntu server testers)




^ permalink raw reply

* Re: 2.6.35-rc2, CONFIG_RPS is filling the dmesg log
From: Tim Gardner @ 2010-06-09 13:27 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4C0EAE3E.4070708@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

On 06/08/2010 02:55 PM, Tim Gardner wrote:
> With 2.6.35-rc2 my dmesg log is being flooded with messages like this:
>
> br0 received packet on queue 4, but number of RX queues is 1
>
> This machine is bridged for KVM and has 2 igb network adapters.
>
> The root cause appears to be CONFIG_RPS=y and the fact that none of the
> drivers that call skb_record_rx_queue() perform their net device
> allocation using alloc_netdev_mq(), thereby initializing num_rx_queues
> to a maximum of 1.
>
> Given that this is early RPS days, is the warning in get_rps_cpu()
> really necessary? It would appear that _all_ of the multi-receive queue
> devices that call skb_record_rx_queue() will cause this log noise.
>
> By the way, how do you turn off CONFIG_RPS? The only way I could get it
> disabled was to change the default in net/Kconfig to 'n'.
>
> rtg

This is the route that I'm taking with Ubuntu in the short term. I'll 
have lots of server testers complaining pretty soon if I don't take care 
of this now. It does keep my server logs from filling.

rtg

-- 
Tim Gardner tim.gardner@canonical.com

[-- Attachment #2: 0001-net-Print-num_rx_queues-imbalance-warning-only-when-.patch --]
[-- Type: text/x-patch, Size: 1210 bytes --]

>From 02598ea1409568654a554fae3ac2c22ecc2474d0 Mon Sep 17 00:00:00 2001
From: Tim Gardner <tim.gardner@canonical.com>
Date: Tue, 8 Jun 2010 17:51:27 -0600
Subject: [PATCH] net: Print num_rx_queues imbalance warning only when there are allocated queues

BugLink: http://bugs.launchpad.net/bugs/591416

So far no users of skb_record_rx_queue() use alloc_netdev_mq() for
network device initialization, so don't print a warning about num_rx_queues
imbalances in get_rps_cpu() unless they have actually been allocated.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 net/core/dev.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d03470f..0852608 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2253,7 +2253,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
-			if (net_ratelimit()) {
+			if (dev->num_rx_queues > 1 && net_ratelimit()) {
 				pr_warning("%s received packet on queue "
 					"%u, but number of RX queues is %u\n",
 					dev->name, index, dev->num_rx_queues);
-- 
1.7.0.4


^ permalink raw reply related

* Re: Bug#584238: linux-image-2.6.32-3-486: When using USB to Ethernet nework adapter later on get error: blocked for more than 120 seconds
From: Paul Chany @ 2010-06-09 13:15 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: 584238, Petko Manolov, netdev
In-Reply-To: <1275521053.2870.70.camel@localhost>

Ben Hutchings <ben@decadent.org.uk> writes:

> On Wed, 2010-06-02 at 16:45 +0200, Paul Chany wrote:
>> Package: linux-2.6
>> Version: 2.6.32-9
>> Severity: normal
>> Tags: squeeze
>> 
>> When using USB to Ethernet adapter later on get error on console:
>> 
>> [11531.988248] INOF: task khubd:618 blocked for more than 120 seconds.
>> [11531.988377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> [11531.990656] INFO: task pegasus:1156 blocked for more than 120 seconds.
>> [11531.990722] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> 
>> After that I can't reconfigure anymore network, nor reboot the system with the command: 
>> sudo shutdown -t 1 -r now
>> because the system hangs forever.
>> 
>> It is not help if I unplug and plug in again the network adapter, it is not recognised by 
>> the kernel anymore.
>
> I'm forwarding this to the pegasus driver developers so they can comment
> on it.
>
> Debian kernel version 2.6.32-9 is closely based on stable kernel
> 2.6.32.9.

It seems that the developers dont' responds for this bugreport at all.
Is the pegasus kernel driver a dead project now?
There in their projectpage: http://pegasus2.sourceforge.net/
one can see that that pegasus is available
only for kernels 2.4 and 2.5 but not for 2.6, right?

In the meantime I can to reboot every hour or so because of this buggish
driver.

-- 
Regards,
Paul Chany
You can freely correct my English.
http://csanyi-pal.info

^ permalink raw reply

* Re: [PATCH] pkt_sched: gen_kill_estimator() rcu fixes
From: Eric Dumazet @ 2010-06-09 13:05 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <20100609125035.GD7308@ff.dom.local>

Le mercredi 09 juin 2010 à 12:50 +0000, Jarek Poplawski a écrit :
> On Wed, Jun 09, 2010 at 02:09:23PM +0200, Eric Dumazet wrote:
> > Le mercredi 09 juin 2010 ?? 10:41 +0000, Jarek Poplawski a écrit :
> > 
> > > Spelling suggestion: xt_rateest_free_rcu?
> > > 
> > > Since it's only 3 places I'd consider adding little comments,
> > > who needs this call_rcu (like in qdisc_destroy).
> > > 
> > > Anyway this patch looks OK to me.
> > > 
> > 
> > Thanks for reviewing, you are right about typo and adding some comments.
> > 
> > What about following ?
> 
> Very nice! (But let's remember these comments aren't very precise
> wrt. bstats at the moment ;-)

Yes, I anticipated a bit next patch ;)

BTW, I think using qdisc spinlock to update stats or reading them is not
really needed, particularly in xt_rateest/xt_RATEEST case, where
per_cpu bstats would be good

We could have one seqcount_t used by (bytes/packets) bstats producer,
and one seqcount_t used by est_timer when computing rates.


bstats updates :


write_seqcount_begin(&bstats->abs_seq);
bstats->bytes += len;
bstats->packets++;
write_seqcount_end(&bstats->abs_seq);


est_timer() would really run lockless, only rcu protected.

(No more spinlock, only the actual rcu_read_lock())


do {
	seq = read_seqcount_begin(&bstats->abs_seq);
	abs_bytes = bstats->bytes;
	abs_packets = bstats->packets;
} while (read_seqcount_retry(&bstats->abs_seq, seq));

write_seqcount_begin(&s->rate_seq);
/* compute rate estimation and store it */
...
write_seqcount_end(&s->rate_seq);

This would mean bstats should be rcu protected too...




^ permalink raw reply

* Re: [PATCH v3] netfilter: Xtables: idletimer target implementation
From: Luciano Coelho @ 2010-06-09 13:00 UTC (permalink / raw)
  To: netfilter-devel@vger.kernel.org
  Cc: netdev@vger.kernel.org, jengelh@medozas.de, kaber@trash.net,
	Timo Teras
In-Reply-To: <1275592445-15555-1-git-send-email-luciano.coelho@nokia.com>

Hello,

On Thu, 2010-06-03 at 21:14 +0200, Coelho Luciano (Nokia-D/Helsinki)
wrote:
> From: Luciano Coelho <luciano.coelho@nokia.com>
> 
> This patch implements an idletimer Xtables target that can be used to
> identify when interfaces have been idle for a certain period of time.
> 
> Timers are identified by labels and are created when a rule is set with a new
> label.  The rules also take a timeout value (in seconds) as an option.  If
> more than one rule uses the same timer label, the timer will be restarted
> whenever any of the rules get a hit.
> 
> One entry for each timer is created in sysfs.  This attribute contains the
> timer remaining for the timer to expire.  The attributes are located under
> the xt_idletimer class:
> 
> /sys/class/xt_idletimer/timers/<label>
> 
> When the timer expires, the target module sends a sysfs notification to the
> userspace, which can then decide what to do (eg. disconnect to save power).
> 
> Cc: Timo Teras <timo.teras@iki.fi>
> Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
> ---
> v2: Fixed according to Jan's comments
> v3: Changed to a device class in the virtual bus in sysfs
>     Removed unnecessary attribute group
>     Fixed missing deallocation in some error cases

Does this patch look fine now or is there something more I need to do to
get it applied?

If it's okay already, are you waiting for the merge window to close
before applying it?

Sorry to bother you with this, but we're depending on this feature and I
need to know if there's something more to be doen.

I'll provide the extension for the iptables tool pretty soon.


-- 
Cheers,
Luca.


^ permalink raw reply

* Re: [PATCH] pkt_sched: gen_kill_estimator() rcu fixes
From: Jarek Poplawski @ 2010-06-09 12:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <1276085363.2442.125.camel@edumazet-laptop>

On Wed, Jun 09, 2010 at 02:09:23PM +0200, Eric Dumazet wrote:
> Le mercredi 09 juin 2010 ?? 10:41 +0000, Jarek Poplawski a écrit :
> 
> > Spelling suggestion: xt_rateest_free_rcu?
> > 
> > Since it's only 3 places I'd consider adding little comments,
> > who needs this call_rcu (like in qdisc_destroy).
> > 
> > Anyway this patch looks OK to me.
> > 
> 
> Thanks for reviewing, you are right about typo and adding some comments.
> 
> What about following ?

Very nice! (But let's remember these comments aren't very precise
wrt. bstats at the moment ;-)

Thanks,
Jarek P.

> 
> [PATCH v2] pkt_sched: gen_kill_estimator() rcu fixes
> 
> gen_kill_estimator() API is incomplete or not well documented, since
> caller should make sure an RCU grace period is respected before
> freeing stats_lock.
> 
> This was partially addressed in commit 5d944c640b4 
> (gen_estimator: deadlock fix), but same problem exist for all
> gen_kill_estimator() users, if lock they use is not already RCU
> protected.
> 
> A code review shows xt_RATEEST.c, act_api.c, act_police.c have this
> problem. Other are ok because they use qdisc lock, already RCU
> protected.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/act_api.h              |    2 ++
>  include/net/netfilter/xt_rateest.h |    1 +
>  net/core/gen_estimator.c           |    1 +
>  net/netfilter/xt_RATEEST.c         |   12 +++++++++++-
>  net/sched/act_api.c                |   11 ++++++++++-
>  net/sched/act_police.c             |   12 +++++++++++-
>  6 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/act_api.h b/include/net/act_api.h
> index c05fd71..bab385f 100644
> --- a/include/net/act_api.h
> +++ b/include/net/act_api.h
> @@ -20,6 +20,7 @@ struct tcf_common {
>  	struct gnet_stats_queue		tcfc_qstats;
>  	struct gnet_stats_rate_est	tcfc_rate_est;
>  	spinlock_t			tcfc_lock;
> +	struct rcu_head			tcfc_rcu;
>  };
>  #define tcf_next	common.tcfc_next
>  #define tcf_index	common.tcfc_index
> @@ -32,6 +33,7 @@ struct tcf_common {
>  #define tcf_qstats	common.tcfc_qstats
>  #define tcf_rate_est	common.tcfc_rate_est
>  #define tcf_lock	common.tcfc_lock
> +#define tcf_rcu		common.tcfc_rcu
>  
>  struct tcf_police {
>  	struct tcf_common	common;
> diff --git a/include/net/netfilter/xt_rateest.h b/include/net/netfilter/xt_rateest.h
> index ddbf37e..5e14277 100644
> --- a/include/net/netfilter/xt_rateest.h
> +++ b/include/net/netfilter/xt_rateest.h
> @@ -9,6 +9,7 @@ struct xt_rateest {
>  	struct gnet_estimator		params;
>  	struct gnet_stats_rate_est	rstats;
>  	struct gnet_stats_basic_packed	bstats;
> +	struct rcu_head			rcu;
>  };
>  
>  extern struct xt_rateest *xt_rateest_lookup(const char *name);
> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> index 785e527..9fbe7f7 100644
> --- a/net/core/gen_estimator.c
> +++ b/net/core/gen_estimator.c
> @@ -263,6 +263,7 @@ static void __gen_kill_estimator(struct rcu_head *head)
>   *
>   * Removes the rate estimator specified by &bstats and &rate_est.
>   *
> + * Note : Caller should respect an RCU grace period before freeing stats_lock
>   */
>  void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
>  			struct gnet_stats_rate_est *rate_est)
> diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
> index 69c01e1..de079ab 100644
> --- a/net/netfilter/xt_RATEEST.c
> +++ b/net/netfilter/xt_RATEEST.c
> @@ -60,13 +60,22 @@ struct xt_rateest *xt_rateest_lookup(const char *name)
>  }
>  EXPORT_SYMBOL_GPL(xt_rateest_lookup);
>  
> +static void xt_rateest_free_rcu(struct rcu_head *head)
> +{
> +	kfree(container_of(head, struct xt_rateest, rcu));
> +}
> +
>  void xt_rateest_put(struct xt_rateest *est)
>  {
>  	mutex_lock(&xt_rateest_mutex);
>  	if (--est->refcnt == 0) {
>  		hlist_del(&est->list);
>  		gen_kill_estimator(&est->bstats, &est->rstats);
> -		kfree(est);
> +		/*
> +		 * gen_estimator est_timer() might access est->lock or bstats,
> +		 * wait a RCU grace period before freeing 'est'
> +		 */
> +		call_rcu(&est->rcu, xt_rateest_free_rcu);
>  	}
>  	mutex_unlock(&xt_rateest_mutex);
>  }
> @@ -179,6 +188,7 @@ static int __init xt_rateest_tg_init(void)
>  static void __exit xt_rateest_tg_fini(void)
>  {
>  	xt_unregister_target(&xt_rateest_tg_reg);
> +	rcu_barrier(); /* Wait for completion of call_rcu()'s (xt_rateest_free_rcu) */
>  }
>  
>  
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 972378f..23b25f8 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -26,6 +26,11 @@
>  #include <net/act_api.h>
>  #include <net/netlink.h>
>  
> +static void tcf_common_free_rcu(struct rcu_head *head)
> +{
> +	kfree(container_of(head, struct tcf_common, tcfc_rcu));
> +}
> +
>  void tcf_hash_destroy(struct tcf_common *p, struct tcf_hashinfo *hinfo)
>  {
>  	unsigned int h = tcf_hash(p->tcfc_index, hinfo->hmask);
> @@ -38,7 +43,11 @@ void tcf_hash_destroy(struct tcf_common *p, struct tcf_hashinfo *hinfo)
>  			write_unlock_bh(hinfo->lock);
>  			gen_kill_estimator(&p->tcfc_bstats,
>  					   &p->tcfc_rate_est);
> -			kfree(p);
> +			/*
> +			 * gen_estimator est_timer() might access p->tcfc_lock
> +			 * or bstats, wait a RCU grace period before freeing p
> +			 */
> +			call_rcu(&p->tcfc_rcu, tcf_common_free_rcu);
>  			return;
>  		}
>  	}
> diff --git a/net/sched/act_police.c b/net/sched/act_police.c
> index 654f73d..537a487 100644
> --- a/net/sched/act_police.c
> +++ b/net/sched/act_police.c
> @@ -97,6 +97,11 @@ nla_put_failure:
>  	goto done;
>  }
>  
> +static void tcf_police_free_rcu(struct rcu_head *head)
> +{
> +	kfree(container_of(head, struct tcf_police, tcf_rcu));
> +}
> +
>  static void tcf_police_destroy(struct tcf_police *p)
>  {
>  	unsigned int h = tcf_hash(p->tcf_index, POL_TAB_MASK);
> @@ -113,7 +118,11 @@ static void tcf_police_destroy(struct tcf_police *p)
>  				qdisc_put_rtab(p->tcfp_R_tab);
>  			if (p->tcfp_P_tab)
>  				qdisc_put_rtab(p->tcfp_P_tab);
> -			kfree(p);
> +			/*
> +			 * gen_estimator est_timer() might access p->tcf_lock
> +			 * or bstats, wait a RCU grace period before freeing p
> +			 */
> +			call_rcu(&p->tcf_rcu, tcf_police_free_rcu);
>  			return;
>  		}
>  	}
> @@ -397,6 +406,7 @@ static void __exit
>  police_cleanup_module(void)
>  {
>  	tcf_unregister_action(&act_police_ops);
> +	rcu_barrier(); /* Wait for completion of call_rcu()'s (tcf_police_free_rcu) */
>  }
>  
>  module_init(police_init_module);
> 
> 

^ permalink raw reply

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
From: Patrick McHardy @ 2010-06-09 12:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <1276009946.2486.216.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 08 juin 2010 à 16:52 +0200, Eric Dumazet a écrit :
>   
>> Le mardi 08 juin 2010 à 16:29 +0200, Patrick McHardy a écrit :
>>     
>>> On 04.06.2010 22:15, Eric Dumazet wrote:
>>>       
>>>> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
>>>> twice per packet, slowing down performance.
>>>>
>>>> This patch converts it to a per_cpu variable.
>>>>
>>>> We assume same cpu is used for a given packet, entering and exiting the
>>>> NOTRACK state.
>>>>         
>>> That doesn't seem to be a valid assumption, the conntrack entry is
>>> attached to the skb and processing in the output path might get
>>> preempted and rescheduled to a different CPU.
>>>       
>> Thats unfortunate.
>>
>> Ok, only choice then is to not change refcount on the untracked ct, and
>> keep a shared (read only after setup time) untrack structure.
>>
>>
>>     
>
> Oh well, re-reading my patch, I dont see why I said this in Changelog :)
>
> We lazily select the untrack structure in one cpu, then keep the pointer
> to this untrack structure, attached to ct.
>
> The (still atomic) increment / decrement of refcount is done on the
> saved pointer, not on actual per_cpu structure.
>
> So if a packet is rescheduled on a different CPU, second cpu will "only"
> dirty cache line of other cpu, it probably almost never happens...
>   

Indeed, you're right of course.
> Thanks
>
> [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
>
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet, slowing down performance.
>
> This patch converts it to a per_cpu variable.
>   
Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 1/2] mac80211: make max_network_latency notifier atomic safe
From: Johannes Berg @ 2010-06-09 12:27 UTC (permalink / raw)
  To: Florian Mickler
  Cc: pm list, james.bottomley-l3A5Bk7waGM,
	markgross-FexmcLCpFRVAfugRpC6u6w, mgross-VuQAYsv1563Yd54FQh9/CA,
	John W. Linville, David S. Miller, Javier Cardona, Jouni Malinen,
	Rui Paulo, Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner
In-Reply-To: <20100609141643.14e9aedc-mGsOIKOveelVRbCss4o9kg@public.gmane.org>

On Wed, 2010-06-09 at 14:16 +0200, Florian Mickler wrote:

> That was also my first idea, but then I thought about qos and thought
> atomic notification are necessary.
> Do you see any value in having atomic notification? 
> 
> I have the following situation before my eyes:
> 
> 	Driver A gets an interrupt and needs (to service that
> 	interrupt) the cpu to guarantee a latency of X because the
> 	device is a bit icky.
> 
> Now, in that situation, if we don't immediately (without scheduling in
> between) notify the system to be in that latency-mode the driver won't
> function properly. Is this a realistic scene?
> 
> At the moment we only have process context notification and only 2
> listeners.
> 
> I think providing for atomic as well as "relaxed" notification could be
> useful. 
> 
> If atomic notification is deemed unnecessary, I have no
> problems to just use schedule_work() in update request.
> Anyway, it is probably best to split this. I.e. first make
> update_request callable from atomic contexts with doing the
> schedule_work in update_request and then
> as an add on provide for constraints_objects with atomic notifications.

Well I remember http://thread.gmane.org/gmane.linux.kernel/979935 where
Mark renamed things to "request" which seems to imply to me more of a
"please do this" than "I NEED IT NOW!!!!!".

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 1/2] mac80211: make max_network_latency notifier atomic safe
From: Florian Mickler @ 2010-06-09 12:16 UTC (permalink / raw)
  To: Johannes Berg
  Cc: pm list, james.bottomley, markgross, mgross, John W. Linville,
	David S. Miller, Javier Cardona, Jouni Malinen, Rui Paulo,
	Kalle Valo, linux-wireless, linux-kernel, netdev, Thomas Gleixner
In-Reply-To: <1276080128.14580.5.camel@jlt3.sipsolutions.net>

On Wed, 09 Jun 2010 12:42:08 +0200
Johannes Berg <johannes@sipsolutions.net> wrote:

> On Wed, 2010-06-09 at 12:20 +0200, Florian Mickler wrote:
> 
> > A third possibility would be to make it dependent on the
> > type of the constraint, if blocking notifiers are allowed or not. 
> > But that would sacrifice API consistency (update_request for one
> > constraint is allowed to be called in interrupt context and
> > update_request for another would be not).
> 
> I don't see what's wrong with the fourth possibility: Allow calling
> pm_qos_update_request() from atomic context, but change _it_ to schedule
> off a work that calls the blocking notifier chain. That avoids the
> complexity in notify-API users since they have process context, and also
> in request-API users since they can call it from any context.
> 
> johannes

That was also my first idea, but then I thought about qos and thought
atomic notification are necessary.
Do you see any value in having atomic
notification? 

I have the following situation before my eyes:

	Driver A gets an interrupt and needs (to service that
	interrupt) the cpu to guarantee a latency of X because the
	device is a bit icky.

Now, in that situation, if we don't immediately (without scheduling in
between) notify the system to be in that latency-mode the driver won't
function properly. Is this a realistic scene?

At the moment we only have process context notification and only 2
listeners.

I think providing for atomic as well as "relaxed" notification could be
useful. 

If atomic notification is deemed unnecessary, I have no
problems to just use schedule_work() in update request.
Anyway, it is probably best to split this. I.e. first make
update_request callable from atomic contexts with doing the
schedule_work in update_request and then
as an add on provide for constraints_objects with atomic notifications.

Flo

^ permalink raw reply

* Re: [PATCH] pkt_sched: gen_kill_estimator() rcu fixes
From: Eric Dumazet @ 2010-06-09 12:09 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <20100609104100.GB7308@ff.dom.local>

Le mercredi 09 juin 2010 à 10:41 +0000, Jarek Poplawski a écrit :

> Spelling suggestion: xt_rateest_free_rcu?
> 
> Since it's only 3 places I'd consider adding little comments,
> who needs this call_rcu (like in qdisc_destroy).
> 
> Anyway this patch looks OK to me.
> 

Thanks for reviewing, you are right about typo and adding some comments.

What about following ?

[PATCH v2] pkt_sched: gen_kill_estimator() rcu fixes

gen_kill_estimator() API is incomplete or not well documented, since
caller should make sure an RCU grace period is respected before
freeing stats_lock.

This was partially addressed in commit 5d944c640b4 
(gen_estimator: deadlock fix), but same problem exist for all
gen_kill_estimator() users, if lock they use is not already RCU
protected.

A code review shows xt_RATEEST.c, act_api.c, act_police.c have this
problem. Other are ok because they use qdisc lock, already RCU
protected.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/act_api.h              |    2 ++
 include/net/netfilter/xt_rateest.h |    1 +
 net/core/gen_estimator.c           |    1 +
 net/netfilter/xt_RATEEST.c         |   12 +++++++++++-
 net/sched/act_api.c                |   11 ++++++++++-
 net/sched/act_police.c             |   12 +++++++++++-
 6 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index c05fd71..bab385f 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -20,6 +20,7 @@ struct tcf_common {
 	struct gnet_stats_queue		tcfc_qstats;
 	struct gnet_stats_rate_est	tcfc_rate_est;
 	spinlock_t			tcfc_lock;
+	struct rcu_head			tcfc_rcu;
 };
 #define tcf_next	common.tcfc_next
 #define tcf_index	common.tcfc_index
@@ -32,6 +33,7 @@ struct tcf_common {
 #define tcf_qstats	common.tcfc_qstats
 #define tcf_rate_est	common.tcfc_rate_est
 #define tcf_lock	common.tcfc_lock
+#define tcf_rcu		common.tcfc_rcu
 
 struct tcf_police {
 	struct tcf_common	common;
diff --git a/include/net/netfilter/xt_rateest.h b/include/net/netfilter/xt_rateest.h
index ddbf37e..5e14277 100644
--- a/include/net/netfilter/xt_rateest.h
+++ b/include/net/netfilter/xt_rateest.h
@@ -9,6 +9,7 @@ struct xt_rateest {
 	struct gnet_estimator		params;
 	struct gnet_stats_rate_est	rstats;
 	struct gnet_stats_basic_packed	bstats;
+	struct rcu_head			rcu;
 };
 
 extern struct xt_rateest *xt_rateest_lookup(const char *name);
diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 785e527..9fbe7f7 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -263,6 +263,7 @@ static void __gen_kill_estimator(struct rcu_head *head)
  *
  * Removes the rate estimator specified by &bstats and &rate_est.
  *
+ * Note : Caller should respect an RCU grace period before freeing stats_lock
  */
 void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
 			struct gnet_stats_rate_est *rate_est)
diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 69c01e1..de079ab 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -60,13 +60,22 @@ struct xt_rateest *xt_rateest_lookup(const char *name)
 }
 EXPORT_SYMBOL_GPL(xt_rateest_lookup);
 
+static void xt_rateest_free_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct xt_rateest, rcu));
+}
+
 void xt_rateest_put(struct xt_rateest *est)
 {
 	mutex_lock(&xt_rateest_mutex);
 	if (--est->refcnt == 0) {
 		hlist_del(&est->list);
 		gen_kill_estimator(&est->bstats, &est->rstats);
-		kfree(est);
+		/*
+		 * gen_estimator est_timer() might access est->lock or bstats,
+		 * wait a RCU grace period before freeing 'est'
+		 */
+		call_rcu(&est->rcu, xt_rateest_free_rcu);
 	}
 	mutex_unlock(&xt_rateest_mutex);
 }
@@ -179,6 +188,7 @@ static int __init xt_rateest_tg_init(void)
 static void __exit xt_rateest_tg_fini(void)
 {
 	xt_unregister_target(&xt_rateest_tg_reg);
+	rcu_barrier(); /* Wait for completion of call_rcu()'s (xt_rateest_free_rcu) */
 }
 
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 972378f..23b25f8 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -26,6 +26,11 @@
 #include <net/act_api.h>
 #include <net/netlink.h>
 
+static void tcf_common_free_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct tcf_common, tcfc_rcu));
+}
+
 void tcf_hash_destroy(struct tcf_common *p, struct tcf_hashinfo *hinfo)
 {
 	unsigned int h = tcf_hash(p->tcfc_index, hinfo->hmask);
@@ -38,7 +43,11 @@ void tcf_hash_destroy(struct tcf_common *p, struct tcf_hashinfo *hinfo)
 			write_unlock_bh(hinfo->lock);
 			gen_kill_estimator(&p->tcfc_bstats,
 					   &p->tcfc_rate_est);
-			kfree(p);
+			/*
+			 * gen_estimator est_timer() might access p->tcfc_lock
+			 * or bstats, wait a RCU grace period before freeing p
+			 */
+			call_rcu(&p->tcfc_rcu, tcf_common_free_rcu);
 			return;
 		}
 	}
diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 654f73d..537a487 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -97,6 +97,11 @@ nla_put_failure:
 	goto done;
 }
 
+static void tcf_police_free_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct tcf_police, tcf_rcu));
+}
+
 static void tcf_police_destroy(struct tcf_police *p)
 {
 	unsigned int h = tcf_hash(p->tcf_index, POL_TAB_MASK);
@@ -113,7 +118,11 @@ static void tcf_police_destroy(struct tcf_police *p)
 				qdisc_put_rtab(p->tcfp_R_tab);
 			if (p->tcfp_P_tab)
 				qdisc_put_rtab(p->tcfp_P_tab);
-			kfree(p);
+			/*
+			 * gen_estimator est_timer() might access p->tcf_lock
+			 * or bstats, wait a RCU grace period before freeing p
+			 */
+			call_rcu(&p->tcf_rcu, tcf_police_free_rcu);
 			return;
 		}
 	}
@@ -397,6 +406,7 @@ static void __exit
 police_cleanup_module(void)
 {
 	tcf_unregister_action(&act_police_ops);
+	rcu_barrier(); /* Wait for completion of call_rcu()'s (tcf_police_free_rcu) */
 }
 
 module_init(police_init_module);



^ permalink raw reply related

* Re: [PATCH net-2.6 v2] pkt_sched: gen_estimator: add a new lock
From: Eric Dumazet @ 2010-06-09 11:55 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <20100609113329.GC7308@ff.dom.local>

Le mercredi 09 juin 2010 à 11:33 +0000, Jarek Poplawski a écrit :
> On Wed, Jun 09, 2010 at 11:39:10AM +0200, Eric Dumazet wrote:
> ...
> > Here is v2 of patch, adding protection in gen_estimator_active() as
> > well.
> > 
> > [PATCH net-2.6 v2] pkt_sched: gen_estimator: add a new lock
> > 
> > gen_kill_estimator() / gen_new_estimator() is not always called with
> > RTNL held.
> > 
> > net/netfilter/xt_RATEEST.c is one user of these API that do not hold
> > RTNL, so random corruptions can occur between "tc" and "iptables".
> > 
> > Add a new fine grained lock instead of trying to use RTNL in netfilter.
> 
> Btw, maybe it's a different case, but RTNL happens in netfilter already
> (nf_conntrack_helper.c, nf_conntrack_proto.c).
> 
> It would be nice to mention Changli's argument that
> gen_replace_estimator isn't atomic operation after this change.
> 

See my next comment. This function is still used in qdisc domain only.
We could add ASSERT_RTNL() here, but its not a bug fix...

> > @@ -312,8 +315,14 @@ EXPORT_SYMBOL(gen_replace_estimator);
> >  bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
> >  			  const struct gnet_stats_rate_est *rate_est)
> >  {
> > +	bool res;
> > +
> >  	ASSERT_RTNL();
> 
> This line should go away as well.
> 

Nope. This is really needed for safety. Of course, its a debugging knob
and you can remove it if you trust all callers to be good.

Caller should still use RTNL here, or else, as soon as we exit from
gen_estimator_active() with a 'true' result, some other qdisc user could
destroy the estimator.

Fortunatly up to 2.6.35, xt_RATEEST cannot reuse/delete an estimator
created by a qdisc user, and it doesnt use gen_estimator_active().

In a future patch, we should add RCU safety here instead of ASSERT_RTNL.
Thats to make sure caller of gen_estimator_active() is inside a
rcu_read_lock() section, and estimator cannot disappear under it, even
if another cpu/thread calls gen_kill_estimator() on same estimator.




^ permalink raw reply

* Re: [PATCH net-2.6 v2] pkt_sched: gen_estimator: add a new lock
From: Jarek Poplawski @ 2010-06-09 11:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <1276076350.2442.90.camel@edumazet-laptop>

On Wed, Jun 09, 2010 at 11:39:10AM +0200, Eric Dumazet wrote:
...
> Here is v2 of patch, adding protection in gen_estimator_active() as
> well.
> 
> [PATCH net-2.6 v2] pkt_sched: gen_estimator: add a new lock
> 
> gen_kill_estimator() / gen_new_estimator() is not always called with
> RTNL held.
> 
> net/netfilter/xt_RATEEST.c is one user of these API that do not hold
> RTNL, so random corruptions can occur between "tc" and "iptables".
> 
> Add a new fine grained lock instead of trying to use RTNL in netfilter.

Btw, maybe it's a different case, but RTNL happens in netfilter already
(nf_conntrack_helper.c, nf_conntrack_proto.c).

It would be nice to mention Changli's argument that
gen_replace_estimator isn't atomic operation after this change.

> @@ -312,8 +315,14 @@ EXPORT_SYMBOL(gen_replace_estimator);
>  bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
>  			  const struct gnet_stats_rate_est *rate_est)
>  {
> +	bool res;
> +
>  	ASSERT_RTNL();

This line should go away as well.

I'm OK with this patch unless there is an alternative xt_RATEEST RTNL
fix available.

Jarek P.

^ permalink raw reply

* Re: [v2 Patch 2/2] mlx4: add dynamic LRO disable support
From: Stanislaw Gruszka @ 2010-06-09 11:05 UTC (permalink / raw)
  To: Amerigo Wang, David Miller
  Cc: netdev, nhorman, herbert.xu, bhutchings, Ramkrishna.Vepa, davem
In-Reply-To: <20100609100938.6573.73536.sendpatchset@localhost.localdomain>

On Wed, Jun 09, 2010 at 06:05:34AM -0400, Amerigo Wang wrote:
> diff --git a/drivers/net/mlx4/en_ethtool.c b/drivers/net/mlx4/en_ethtool.c
> index d5afd03..2c77805 100644
> --- a/drivers/net/mlx4/en_ethtool.c
> +++ b/drivers/net/mlx4/en_ethtool.c
> @@ -387,6 +387,37 @@ static void mlx4_en_get_ringparam(struct net_device *dev,
>  	param->tx_pending = mdev->profile.prof[priv->port].tx_ring_size;
>  }
>  
> +static int mlx4_ethtool_op_set_flags(struct net_device *dev, u32 data)
> +{
> +	struct mlx4_en_priv *priv = netdev_priv(dev);
> +	struct mlx4_en_dev *mdev = priv->mdev;
> +	int rc = 0;
> +	int changed = 0;
> +
> +	if (data & ETH_FLAG_LRO) {
> +		if (!(dev->features & NETIF_F_LRO)) {
> +			dev->features |= NETIF_F_LRO;
> +			changed = 1;
> +			mdev->profile.num_lro = min_t(int, num_lro , MLX4_EN_MAX_LRO_DESCRIPTORS);

I do not understand why you override mdev->profile.num_lro in v2 patch.
If in Rx patch NETIF_F_LRO flag is used we do not need this IMHO.

> +		}
> +	} else if (dev->features & NETIF_F_LRO) {
> +		dev->features &= ~NETIF_F_LRO;
> +		changed = 1;
> +		mdev->profile.num_lro = 0;
> +	}
[snip]
>  const struct ethtool_ops mlx4_en_ethtool_ops = {
>  	.get_drvinfo = mlx4_en_get_drvinfo,
>  	.get_settings = mlx4_en_get_settings,
> @@ -415,7 +446,7 @@ const struct ethtool_ops mlx4_en_ethtool_ops = {
>  	.get_ringparam = mlx4_en_get_ringparam,
>  	.set_ringparam = mlx4_en_set_ringparam,
>  	.get_flags = ethtool_op_get_flags,
> -	.set_flags = ethtool_op_set_flags,
> +	.set_flags = mlx4_ethtool_op_set_flags,
>  };

Since we modify .set_flags, please assure we return -EOPNOTSUPP
if someone will try to setup ETH_FLAG_NTUPLE and ETH_FLAG_RXHASH.

BTW: seems default ethtool_op_set_flags introduce a bug on many
devices regarding ETH_FLAG_RXHASH. I think default should
be EOPNOTSUPP, and these few devices that actually support RXHASH
should have custom ethtool_ops->set_flags

Stanislaw

^ permalink raw reply

* Re: [Patch 2/2] mlx4: add dynamic LRO disable support
From: Stanislaw Gruszka @ 2010-06-09 10:49 UTC (permalink / raw)
  To: Cong Wang; +Cc: Ben Hutchings, netdev, herbert.xu, nhorman, davem
In-Reply-To: <4C0F5D97.5030605@redhat.com>

Hi Amerigo

Sorry for being silent in this thread before.

On Wed, Jun 09, 2010 at 05:23:35PM +0800, Cong Wang wrote:
>>>> Is that flag test actually unsafe - and if so, how is testing num_lro
>>>> any better?  Perhaps access to net_device::features should be wrapped
>>>> with ACCESS_ONCE() to ensure that reads and writes are atomic.
>>>>
>>>
>>> At least, I don't find there is any race with 'num_lro', thus
>>> no lock is needed.
>>
>> In both cases there is a race condition but it is harmless so long as
>> the read and the write are atomic.  There is a general assumption in
>> networking code that this is the case for int and long.  Personally I
>> would prefer to see this made explicit using ACCESS_ONCE(), but I don't
>> see any specific problem in mlx4 (not that I'm familiar with this driver
>> either).
>
> I read this email again.
>
> I think you misunderstood the race condition here. Even read and write
> are atomic here, the race still exists. One can just set NETIF_F_LRO
> asynchronously right before mlx4 check this flag in mlx4_en_process_rx_cq()
> which doesn't take rtnl_lock.

If so, it's better to stop device before modify LRO settings. I suggest
something like that in mlx4_ethtool_op_set_flags:

if (!!(data & ETH_FLAG_LRO) != !!(dev->features & NETIF_F_LRO)) {
	/* Need to toggle LRO */

	if (netdev_running(dev)) {
               mutex_lock(&mdev->state_lock);
               mlx4_en_stop_port(dev);
               rc = mlx4_en_start_port(dev);
               if (rc)
                       en_err(priv, "Failed to restart port\n");
	}

	dev->features ^= NETIF_F_LRO;

	if (netdev_running(dev))
               mutex_unlock(&mdev->state_lock);
}

Stanislaw

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox