* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-22 15:49 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422142506.GA15858@gondor.apana.org.au>
On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> This patch fixes this by using the DADFAILED bit to synchronise
> the two paths while holding the ifp lock. It relies on the fact
> that the TENTATIVE bit is always set during DAD, and that the
> DADFAILED bit is only set on failure.
But the addr_dad_failure()->...->ipv6_del_addr() path will
still race with any other path calling ipv6_del_addr() (e.g. a
manual address removal). Won't it?
I still don't see why __ipv6_ifa_notify() needs to call
dst_free(). Shouldn't that be dst_release() instead, to drop the
reference obtained by dst_hold(&ifp->rt->u.dst)?
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ
^ permalink raw reply
* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Ben Hutchings @ 2010-04-22 15:41 UTC (permalink / raw)
To: Peter P Waskiewicz Jr
Cc: tglx@linutronix.de, davem@davemloft.net, arjan@linux.jf.intel.com,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004220459110.3324@PPWASKIE-MOBL2.amr.corp.intel.com>
On Thu, 2010-04-22 at 05:11 -0700, Peter P Waskiewicz Jr wrote:
> On Wed, 21 Apr 2010, Ben Hutchings wrote:
>
> > On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> >> This patch adds a callback function pointer to the irq_desc
> >> structure, along with a registration function and a read-only
> >> proc entry for each interrupt.
> >>
> >> This affinity_hint handle for each interrupt can be used by
> >> underlying drivers that need a better mechanism to control
> >> interrupt affinity. The underlying driver can register a
> >> callback for the interrupt, which will allow the driver to
> >> provide the CPU mask for the interrupt to anything that
> >> requests it. The intent is to extend the userspace daemon,
> >> irqbalance, to help hint to it a preferred CPU mask to balance
> >> the interrupt into.
> >
> > Doesn't it make more sense to have the driver follow affinity decisions
> > made from user-space? I realise that reallocating queues is disruptive
> > and we probably don't want irqbalance to trigger that, but there should
> > be a mechanism for the administrator to trigger it.
>
> The driver here would be assisting userspace (irqbalance) to provide
> better details how the HW is laid out with respect to flows. As it stands
> today, irqbalance is almost guaranteed to move interrups to CPUs that are
> not aligned with where applications are running for network adapters.
> This is very apparent when running at speeds in the 10 Gigabit range, or
> even multiple 1 Gigabit ports running at the same time.
I'm well aware that irqbalance isn't making good decisions at the
moment. The question is whether this will really help irqbalance to do
better.
[...]
> > This just assigns IRQs to the first n CPU threads. Depending on the
> > enumeration order, this might result in assigning an IRQ to each of 2
> > threads on a core while leaving other cores unused!
>
> This ixgbe patch is only meant to be an example of how you could use it.
> I didn't hammer out all the corner cases of interrupt alignment in it yet.
> However, ixgbe is already aligning Tx flows onto the CPU/queue pair the Tx
> occurred (i.e. Tx session from CPU 4 will be queued on Tx queue 4),
[...]
OK, now I remember ixgbe has this odd select_queue() implementation.
But this behaviour can result in reordering whenever a user thread
migrates, and in any case Dave discourages people from setting
select_queue(). So I see that these changes would be useful for ixgbe
(together with an update to irqbalance), but they don't seem to fit the
general direction of multiqueue networking on Linux.
(Actually, the hints seem to be incomplete. If there are more than 16
CPU threads then multiple CPU threads can map to the same queues, but it
looks like you only include the first in the queue's hint.)
An alternate approach is to use the RX queue index to drive TX queue
selection. I posted a patch to do that earlier this week. However I
haven't yet had a chance to try that on a suitably large system.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Vivek Goyal @ 2010-04-22 14:56 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
Gui Jianfeng, Li Zefan
In-Reply-To: <20100421213543.GO2563@linux.vnet.ibm.com>
On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:
[..]
> > [ 3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [ 3.116754] ---------------------------------------------------
> > [ 3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > without protection!
> > [ 3.116754]
> > [ 3.116754] other info that might help us debug this:
> > [ 3.116754]
> > [ 3.116754]
> > [ 3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > [ 3.116754] 2 locks held by async/1/666:
> > [ 3.116754] #0: (&shost->scan_mutex){+.+.+.}, at:
> > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > [ 3.116754] #1: (&(&blkcg->lock)->rlock){......}, at:
> > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > [ 3.116754]
> > [ 3.116754] stack backtrace:
> > [ 3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > [ 3.116754] Call Trace:
> > [ 3.116754] [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [ 3.116754] [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > [ 3.116754] [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > [ 3.116754] [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > [ 3.116754] [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > [ 3.116754] [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > [ 3.116754] [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > [ 3.116754] [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > [ 3.116754] [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > [ 3.116754] [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > [ 3.116754] [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > [ 3.116754] [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > [ 3.116754] [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [ 3.116754] [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > [ 3.116754] [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [ 3.116754] [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > [ 3.116754] [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [ 3.116754] [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > [ 3.116754] [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > [ 3.116754] [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > [ 3.116754] [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > [ 3.116754] [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [ 3.116754] [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > [ 3.116754] [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > [ 3.116754] [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [ 3.116754] [<ffffffff8105713e>] kthread+0x89/0x91
> > [ 3.116754] [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [ 3.116754] [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > [ 3.116754] [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > [ 3.116754] [<ffffffff810570b5>] ? kthread+0x0/0x91
> > [ 3.116754] [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
>
> I cannot convince myself that the above access is safe. Vivek, Nauman,
> thoughts?
Hi Paul,
blkiocg_add_blkio_group() is called from two paths.
First one is following. This path should be safe as it takes rcu read
lock.
cfq_get_cfqg()
rcu_read_lock()
cfq_find_alloc_cfqg()
blkiocg_add_blkio_group()
rcu_read_unlock()
Second one is as shown in above backtrace.
cfq_init_queue()
blkiocg_add_blkio_group().
This path is called at request queue and cfq initialization time and
we access only root cgroup (root blkio_cgroup). As root cgroup can't
go away, do we have to protect that call also using rcu_read_lock()?
So I guess it is not unsafe but propably we need to fix the warning, I
should wrap second call to blkiocg_add_blkio_group() with
rcu_read_lock/unlock pair?
Thanks
Vivek
^ permalink raw reply
* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22 14:54 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <1271946805.7895.5658.camel@edumazet-laptop>
On Thu, Apr 22, 2010 at 10:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> It does make a difference, Damn it.
>
> I really really start to think you dont read what I wrote, or you dont
> care.
I misunderstood it. Sorry.
>
> Damn, cant you update all the things at once, taking this lock only
> once ?
>
> You focus having an ultra precise count of pkt_queue.len, but we dont
> care at all ! We only want a _limit_, or else the box can be killed by
> DOS.
>
> If in practice this limit can be 2*limit, thats OK.
>
> Cant you understand this ?
>
>
> We need one limit. Not two limits.
>
> I already told you how to do it, but you ignored me and started yet
> another convoluted thing.
>
>
> process_backlog() transfert the queue to its own queue and reset pkt_len
> to 0 (Only once)
>
> End of story.
>
> Maximum packet queued to this cpu softnet_data will be 2 * old_limit.
>
> So what ?
>
Now, I think I really understand. We don't need a precious limit. So
only a additional queue is enough.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:53 UTC (permalink / raw)
To: Patrick McHardy
Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
Paul E McKenney
In-Reply-To: <1271946961.7895.5665.camel@edumazet-laptop>
Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> If one hash slot is under attack, then there is a bug somewhere.
>
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
>
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.
>
> I suspect a bug elsewhere, quite frankly !
>
> We have a chain that have an end pointer that doesnt match the expected
> one.
>
On normal situation, we always finish the lookup :
1) If we found the thing we were looking at.
2) We get the list end (item not found), we then check if it is the
expected end.
It is _not_ the expected end only if some writer deleted/inserted an
element in _this_ chain during our lookup.
Because our lookup is lockless, we then have to redo it because we might
miss the object we are looking for.
If we can do the 'retry' a 10 times, it means the attacker was really
clever enough to inject new packets (new conntracks) at the right
moment, in the right hash chain, and this sounds so higly incredible
that I cannot believe it at all :)
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:36 UTC (permalink / raw)
To: Patrick McHardy
Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
Paul E McKenney
In-Reply-To: <4BD04C74.9020402@trash.net>
Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
> >> At an unnamed ISP, we experienced a DDoS attack against one of our
> >> customers. This attack also caused problems for one of our Linux
> >> based routers.
> >>
> >> The attack was "only" generating 300 kpps (packets per sec), which
> >> usually isn't a problem for this (fairly old) Linux Router. But the
> >> conntracking system chocked and reduced pps processing power to
> >> 40kpps.
> >>
> >> I do extensive RRD/graph monitoring of the machines. The IP conntrack
> >> searches in the period exploded, to a stunning 700.000 searches per
> >> sec.
> >>
> >> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
> >>
> >> First I though it might be caused by bad hashing, but after reading
> >> the kernel code (func: __nf_conntrack_find()), I think its caused by
> >> the loop restart (goto begin) of the conntrack search, running under
> >> local_bh_disable(). These RCU changes to conntrack were introduced in
> >> ea781f19 by Eric Dumazet.
> >>
> >> Code: net/netfilter/nf_conntrack_core.c
> >> Func: __nf_conntrack_find()
> >>
> >> struct nf_conntrack_tuple_hash *
> >> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> >> {
> >> struct nf_conntrack_tuple_hash *h;
> >> struct hlist_nulls_node *n;
> >> unsigned int hash = hash_conntrack(tuple);
> >>
> >> /* Disable BHs the entire time since we normally need to disable them
> >> * at least once for the stats anyway.
> >> */
> >> local_bh_disable();
> >> begin:
> >> hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
> >> if (nf_ct_tuple_equal(tuple, &h->tuple)) {
> >> NF_CT_STAT_INC(net, found);
> >> local_bh_enable();
> >> return h;
> >> }
> >> NF_CT_STAT_INC(net, searched);
> >> }
> >> /*
> >> * if the nulls value we got at the end of this lookup is
> >> * not the expected one, we must restart lookup.
> >> * We probably met an item that was moved to another chain.
> >> */
> >> if (get_nulls_value(n) != hash)
> >> goto begin;
> >> local_bh_enable();
> >>
> >
> > We should add a retry limit there.
>
> We can't do that since that would allow false negatives.
If one hash slot is under attack, then there is a bug somewhere.
If we cannot avoid this, we can fallback to a secure mode at the second
retry, and take the spinlock.
Tis way, most of lookups stay lockless (one pass), and some might take
the slot lock to avoid the possibility of a loop.
I suspect a bug elsewhere, quite frankly !
We have a chain that have an end pointer that doesnt match the expected
one.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:33 UTC (permalink / raw)
To: Changli Gao; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <y2n412e6f7f1004220606id324dc9bj2cc04cfbad50a101@mail.gmail.com>
Le jeudi 22 avril 2010 à 21:06 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 7:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Please reorder things better.
> >
> > Most likely this function is called for one packet.
> >
> > In your version you take twice the rps_lock()/rps_unlock() path, so
> > it'll be slower.
> >
> > Once to 'transfert' one list to process list
> >
> > Once to be able to do the 'label out:' post processing.
> >
>
> It doesn't make any difference. We have to hold rps_lock to update
> input_pkt_queue_len, if we don't use another variable to record the
> length of the process queue, or atomic variable.
>
It does make a difference, Damn it.
I really really start to think you dont read what I wrote, or you dont
care.
Damn, cant you update all the things at once, taking this lock only
once ?
You focus having an ultra precise count of pkt_queue.len, but we dont
care at all ! We only want a _limit_, or else the box can be killed by
DOS.
If in practice this limit can be 2*limit, thats OK.
Cant you understand this ?
> I think it is better that using another variable to record the length
> of the process queue, and updating it before process_backlog()
> returns. For one packet, there is only one locking/unlocking. There is
> only one issue you concerned: cache miss due to sum the two queues'
> length. I'll change softnet_data to:
>
> struct softnet_data {
> struct Qdisc *output_queue;
> struct list_head poll_list;
> struct sk_buff *completion_queue;
> struct sk_buff_head process_queue;
>
> #ifdef CONFIG_RPS
> struct softnet_data *rps_ipi_list;
>
> /* Elements below can be accessed between CPUs for RPS */
> struct call_single_data csd ____cacheline_aligned_in_smp;
> struct softnet_data *rps_ipi_next;
> unsigned int cpu;
> unsigned int input_queue_head;
> #endif
> unsigned int process_queue_len;
> struct sk_buff_head input_pkt_queue;
> struct napi_struct backlog;
> };
>
> For one packets, we have to update process_queue_len in any way. For
> more packets, we only change process_queue_len just before
> process_backlog() returns. It means that process_queue_len change is
> batched.
>
We need one limit. Not two limits.
I already told you how to do it, but you ignored me and started yet
another convoluted thing.
process_backlog() transfert the queue to its own queue and reset pkt_len
to 0 (Only once)
End of story.
Maximum packet queued to this cpu softnet_data will be 2 * old_limit.
So what ?
^ permalink raw reply
* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-22 14:25 UTC (permalink / raw)
To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422.004324.67422011.davem@davemloft.net>
On Thu, Apr 22, 2010 at 12:43:24AM -0700, David Miller wrote:
>
> Thanks Herbert.
No worries :)
BTW similar races exist in other NDISC receive functions, but
it's too late today so I'll look at this tomorrow unless someone
else wants to have a go at this.
ipv6: Prevent DAD races
The NDISC receive path has not been written in a way that handles
unexpected packets properly.
For example, if we get two identical simultaneous NA/NS packets
that result in a DAD failure, we may try to delete the same address
twice.
A similar problem occurs when we get a DAD failure just as we're
about to mark an address as having passed DAD.
This patch fixes this by using the DADFAILED bit to synchronise
the two paths while holding the ifp lock. It relies on the fact
that the TENTATIVE bit is always set during DAD, and that the
DADFAILED bit is only set on failure.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index de7a194..1d15d5e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1401,6 +1401,16 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
void addrconf_dad_failure(struct inet6_ifaddr *ifp)
{
struct inet6_dev *idev = ifp->idev;
+ int ignore;
+
+ spin_lock(&ifp->lock);
+ ignore = (ifp->flags & (IFA_F_DADFAILED|IFA_F_TENTATIVE)) ^
+ IFA_F_TENTATIVE;
+ ifp->flags |= IFA_F_DADFAILED;
+ spin_unlock(&ifp->lock);
+
+ if (ignore)
+ return;
if (net_ratelimit())
printk(KERN_INFO "%s: IPv6 duplicate address %pI6c detected!\n",
@@ -2789,7 +2799,10 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
read_lock_bh(&idev->lock);
if (ifp->dead)
goto out;
+
spin_lock_bh(&ifp->lock);
+ if (ifp->flags & IFA_F_DADFAILED)
+ goto unlock_ifp;
if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
idev->cnf.accept_dad < 1 ||
@@ -2824,6 +2837,7 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
ip6_ins_rt(ifp->rt);
addrconf_dad_kick(ifp);
+unlock_ifp:
spin_unlock_bh(&ifp->lock);
out:
read_unlock_bh(&idev->lock);
@@ -2841,6 +2855,11 @@ static void addrconf_dad_timer(unsigned long data)
goto out;
}
spin_lock_bh(&ifp->lock);
+ if (ifp->flags & IFA_F_DADFAILED) {
+ spin_unlock_bh(&ifp->lock);
+ read_unlock_bh(&idev->lock);
+ goto out;
+ }
if (ifp->probes == 0) {
/*
* DAD was successful
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related
* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:24 UTC (permalink / raw)
To: Changli Gao; +Cc: David Miller, hadi, therbert, netdev
In-Reply-To: <t2t412e6f7f1004220527mb5340bebye1d1fda0963dcf2a@mail.gmail.com>
Le jeudi 22 avril 2010 à 20:27 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 5:43 PM, David Miller <davem@davemloft.net> wrote:
> > From: Changli Gao <xiaosuo@gmail.com>
> > Date: Thu, 22 Apr 2010 17:09:17 +0800
> >
> >> + unsigned int input_pkt_queue_len;
> >> + struct sk_buff *input_pkt_queue_head;
> >> + struct sk_buff **input_pkt_queue_tailp;
> >> +
> >
> > Please do not ignore Stephen Hemminger's feedback.
> >
> > We already have enough odd SKB queue implementations, we
> > do not need yet another one in a core location. This makes
> > it harder and harder to eventually convert sk_buff to use
> > "struct list_head".
> >
> > Instead, use "struct sk_buff_head" and the lockless accessors
> > (__skb_insert, etc.) and initializer (__skb_queue_head_init).
> >
>
> If I want to keep softnet_data small, I have to access the internal
> fields of sk_buff_head, and modify them in a hack way. It doesn't
> sound good. If not, softnet_data will become:
>
> struct softnet_data {
> struct Qdisc *output_queue;
> struct list_head poll_list;
> struct sk_buff *completion_queue;
> struct sk_buff_head process_queue;
>
> #ifdef CONFIG_RPS
> struct softnet_data *rps_ipi_list;
>
> /* Elements below can be accessed between CPUs for RPS */
> struct call_single_data csd ____cacheline_aligned_in_smp;
> struct softnet_data *rps_ipi_next;
> unsigned int cpu;
> unsigned int input_queue_head;
> #endif
> unsigned int input_pkt_queue_len;
> struct sk_buff_head input_pkt_queue;
> struct napi_struct backlog;
> };
>
> Eric, do you think it is too fat?
No it is not, as long as we are careful to separate cache lines.
By the way, input_pkt_queue_len is already in input_ptk_queue (.len)
Thanks
^ permalink raw reply
* Re: Bug#577640: linux-image-2.6.33-2-amd64: Kernel warnings in netns thread
From: Martín Ferrari @ 2010-04-22 14:05 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Ben Hutchings, 577640, netdev, Eric W. Biederman, Alexey Dobriyan,
Mathieu Lacage
In-Reply-To: <m1sk6oky96.fsf@fess.ebiederm.org>
On Thu, Apr 22, 2010 at 04:38, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> > $ sudo ./startns bash
>>> > # ip l a type veth
>>> > # ip l s veth0 netns 1
>>> > # exit
> Then I should ask what is startns?
That's just a simple C program that calls unshare(CLONE_NEWNET) and
execs a shell.
> Either that is doing something different from my equivalent program, or I have
> patches to fix this, that just haven't been merged yet.
I have just downloaded and compiled 2.6.32-2 and 2.6.34-rc5 from
kernel.org using the .config from the debian package, and the oops is
reproducible in both.
This small C file reproduces the error every time:
$ cat netnsoops.c
#include <stdio.h>
#include <stdlib.h>
#define _GNU_SOURCE
#include <sched.h>
int main(int argc, char *argv[])
{
int c;
unsigned long flags = CLONE_NEWNET;
if(unshare(flags) == -1) {
perror("unshare");
return 1;
}
system("ip link add name FOO type veth peer name BAR");
system("ip link set FOO netns 1");
system("ip link show");
return 0;
}
--
Martín Ferrari
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-22 13:31 UTC (permalink / raw)
To: Changli Gao
Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
netfilter-devel, Paul E McKenney
In-Reply-To: <q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com>
I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.
[PATCH] net: netfilter conntrack extended with extra stat counter.
From: Jesper Dangaard Brouer <hawk@comx.dk>
I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.
Adding a stats counter to see if the search is restarted too often.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---
include/linux/netfilter/nf_conntrack_common.h | 1 +
.../netfilter/nf_conntrack_l3proto_ipv4_compat.c | 7 ++++---
net/netfilter/nf_conntrack_core.c | 4 +++-
net/netfilter/nf_conntrack_standalone.c | 7 ++++---
4 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index a8248ee..57ff312 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -93,6 +93,7 @@ struct ip_conntrack_stat
unsigned int expect_new;
unsigned int expect_create;
unsigned int expect_delete;
+ unsigned int search_restart;
};
/* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 8668a3d..b94f510 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
const struct ip_conntrack_stat *st = v;
if (v == SEQ_START_TOKEN) {
- seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete\n");
+ seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete search_restart\n");
return 0;
}
seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x "
- "%08x %08x %08x %08x %08x %08x %08x %08x \n",
+ "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
nr_conntracks,
st->searched,
st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
st->expect_new,
st->expect_create,
- st->expect_delete
+ st->expect_delete,
+ st->search_restart
);
return 0;
}
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4299db7..5ca8286 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -315,8 +315,10 @@ begin:
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
- if (get_nulls_value(n) != hash)
+ if (get_nulls_value(n) != hash) {
+ NF_CT_STAT_INC(net, search_restart);
goto begin;
+ }
local_bh_enable();
return NULL;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 1935153..4812750 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -245,12 +245,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
const struct ip_conntrack_stat *st = v;
if (v == SEQ_START_TOKEN) {
- seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete\n");
+ seq_printf(seq, "entries searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete search_restart\n");
return 0;
}
seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x "
- "%08x %08x %08x %08x %08x %08x %08x %08x \n",
+ "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
nr_conntracks,
st->searched,
st->found,
@@ -267,7 +267,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
st->expect_new,
st->expect_create,
- st->expect_delete
+ st->expect_delete,
+ st->search_restart
);
return 0;
}
--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network Kernel Developer
Cand. Scient Datalog / MSc.CS
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply related
* [PATCHv2] Socket filter ancilliary data access for skb->dev->type
From: Paul LeoNerd Evans @ 2010-04-22 13:32 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]
Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving
access to skb->dev->type, as reported in the sll_hatype field.
When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on. This patch adds such ability.
This patch also handles the case where skb->dev can be NULL, such as on
netlink sockets.
Signed-off-by: Paul Evans <leonerd@leonerd.org.uk>
---
diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h 2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h 2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
#define SKF_AD_NLATTR_NEST 16
#define SKF_AD_MARK 20
#define SKF_AD_QUEUE 24
-#define SKF_AD_MAX 28
+#define SKF_AD_HATYPE 28
+#define SKF_AD_MAX 32
#define SKF_NET_OFF (-0x100000)
#define SKF_LL_OFF (-0x200000)
diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c 2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c 2010-04-22 14:19:24.000000000 +0100
@@ -301,6 +301,8 @@
A = skb->pkt_type;
continue;
case SKF_AD_IFINDEX:
+ if (!skb->dev)
+ return 0;
A = skb->dev->ifindex;
continue;
case SKF_AD_MARK:
@@ -309,6 +311,11 @@
case SKF_AD_QUEUE:
A = skb->queue_mapping;
continue;
+ case SKF_AD_HATYPE:
+ if (!skb->dev)
+ return 0;
+ A = skb->dev->type;
+ continue;
case SKF_AD_NLATTR: {
struct nlattr *nla;
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-22 13:17 UTC (permalink / raw)
To: Changli Gao
Cc: hawk, Eric Dumazet, Linux Kernel Network Hackers, netfilter-devel,
Paul E McKenney
In-Reply-To: <q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com>
Changli Gao wrote:
> On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
>> At an unnamed ISP, we experienced a DDoS attack against one of our
>> customers. This attack also caused problems for one of our Linux
>> based routers.
>>
>> The attack was "only" generating 300 kpps (packets per sec), which
>> usually isn't a problem for this (fairly old) Linux Router. But the
>> conntracking system chocked and reduced pps processing power to
>> 40kpps.
>>
>> I do extensive RRD/graph monitoring of the machines. The IP conntrack
>> searches in the period exploded, to a stunning 700.000 searches per
>> sec.
>>
>> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
>>
>> First I though it might be caused by bad hashing, but after reading
>> the kernel code (func: __nf_conntrack_find()), I think its caused by
>> the loop restart (goto begin) of the conntrack search, running under
>> local_bh_disable(). These RCU changes to conntrack were introduced in
>> ea781f19 by Eric Dumazet.
>>
>> Code: net/netfilter/nf_conntrack_core.c
>> Func: __nf_conntrack_find()
>>
>> struct nf_conntrack_tuple_hash *
>> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>> {
>> struct nf_conntrack_tuple_hash *h;
>> struct hlist_nulls_node *n;
>> unsigned int hash = hash_conntrack(tuple);
>>
>> /* Disable BHs the entire time since we normally need to disable them
>> * at least once for the stats anyway.
>> */
>> local_bh_disable();
>> begin:
>> hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>> if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>> NF_CT_STAT_INC(net, found);
>> local_bh_enable();
>> return h;
>> }
>> NF_CT_STAT_INC(net, searched);
>> }
>> /*
>> * if the nulls value we got at the end of this lookup is
>> * not the expected one, we must restart lookup.
>> * We probably met an item that was moved to another chain.
>> */
>> if (get_nulls_value(n) != hash)
>> goto begin;
>> local_bh_enable();
>>
>
> We should add a retry limit there.
We can't do that since that would allow false negatives.
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Changli Gao @ 2010-04-22 13:13 UTC (permalink / raw)
To: hawk
Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
netfilter-devel, Paul E McKenney
In-Reply-To: <1271941082.14501.189.camel@jdb-workstation>
On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
>
> At an unnamed ISP, we experienced a DDoS attack against one of our
> customers. This attack also caused problems for one of our Linux
> based routers.
>
> The attack was "only" generating 300 kpps (packets per sec), which
> usually isn't a problem for this (fairly old) Linux Router. But the
> conntracking system chocked and reduced pps processing power to
> 40kpps.
>
> I do extensive RRD/graph monitoring of the machines. The IP conntrack
> searches in the period exploded, to a stunning 700.000 searches per
> sec.
>
> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
>
> First I though it might be caused by bad hashing, but after reading
> the kernel code (func: __nf_conntrack_find()), I think its caused by
> the loop restart (goto begin) of the conntrack search, running under
> local_bh_disable(). These RCU changes to conntrack were introduced in
> ea781f19 by Eric Dumazet.
>
> Code: net/netfilter/nf_conntrack_core.c
> Func: __nf_conntrack_find()
>
> struct nf_conntrack_tuple_hash *
> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> {
> struct nf_conntrack_tuple_hash *h;
> struct hlist_nulls_node *n;
> unsigned int hash = hash_conntrack(tuple);
>
> /* Disable BHs the entire time since we normally need to disable them
> * at least once for the stats anyway.
> */
> local_bh_disable();
> begin:
> hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
> if (nf_ct_tuple_equal(tuple, &h->tuple)) {
> NF_CT_STAT_INC(net, found);
> local_bh_enable();
> return h;
> }
> NF_CT_STAT_INC(net, searched);
> }
> /*
> * if the nulls value we got at the end of this lookup is
> * not the expected one, we must restart lookup.
> * We probably met an item that was moved to another chain.
> */
> if (get_nulls_value(n) != hash)
> goto begin;
> local_bh_enable();
>
We should add a retry limit there.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] Socket filter ancilliary data access for skb->dev->type
From: Patrick McHardy @ 2010-04-22 13:13 UTC (permalink / raw)
To: Paul LeoNerd Evans; +Cc: netdev
In-Reply-To: <20100422131105.GS19334@cel.leo>
Paul LeoNerd Evans wrote:
> On Thu, Apr 22, 2010 at 02:28:46PM +0200, Patrick McHardy wrote:
>> I think we should be adding a check whether skb->dev is non-NULL here
>> since filters can also be attached to netlink sockets. The same applies
>> to SKF_AD_IFINDEX.
>
> What should the appropriate behaviour be here? Set A to some rogue value
> - 0 or -1 seem appropriate? Or, abort the filter entirely (such as in
> e.g. divide-by-zero, or invalid memory buffer access)?
>
> Either way that sounds simple enough, I can hack that in and resubmit.
I'd say we should abort execution.
^ permalink raw reply
* Re: [PATCH] Socket filter ancilliary data access for skb->dev->type
From: Paul LeoNerd Evans @ 2010-04-22 13:11 UTC (permalink / raw)
To: Patrick McHardy; +Cc: netdev
In-Reply-To: <4BD040FE.3000809@trash.net>
[-- Attachment #1: Type: text/plain, Size: 644 bytes --]
On Thu, Apr 22, 2010 at 02:28:46PM +0200, Patrick McHardy wrote:
> I think we should be adding a check whether skb->dev is non-NULL here
> since filters can also be attached to netlink sockets. The same applies
> to SKF_AD_IFINDEX.
What should the appropriate behaviour be here? Set A to some rogue value
- 0 or -1 seem appropriate? Or, abort the filter entirely (such as in
e.g. divide-by-zero, or invalid memory buffer access)?
Either way that sounds simple enough, I can hack that in and resubmit.
--
Paul "LeoNerd" Evans
leonerd@leonerd.org.uk
ICQ# 4135350 | Registered Linux# 179460
http://www.leonerd.org.uk/
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply
* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22 13:06 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <1271936227.7895.5285.camel@edumazet-laptop>
On Thu, Apr 22, 2010 at 7:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Please reorder things better.
>
> Most likely this function is called for one packet.
>
> In your version you take twice the rps_lock()/rps_unlock() path, so
> it'll be slower.
>
> Once to 'transfert' one list to process list
>
> Once to be able to do the 'label out:' post processing.
>
It doesn't make any difference. We have to hold rps_lock to update
input_pkt_queue_len, if we don't use another variable to record the
length of the process queue, or atomic variable.
I think it is better that using another variable to record the length
of the process queue, and updating it before process_backlog()
returns. For one packet, there is only one locking/unlocking. There is
only one issue you concerned: cache miss due to sum the two queues'
length. I'll change softnet_data to:
struct softnet_data {
struct Qdisc *output_queue;
struct list_head poll_list;
struct sk_buff *completion_queue;
struct sk_buff_head process_queue;
#ifdef CONFIG_RPS
struct softnet_data *rps_ipi_list;
/* Elements below can be accessed between CPUs for RPS */
struct call_single_data csd ____cacheline_aligned_in_smp;
struct softnet_data *rps_ipi_next;
unsigned int cpu;
unsigned int input_queue_head;
#endif
unsigned int process_queue_len;
struct sk_buff_head input_pkt_queue;
struct napi_struct backlog;
};
For one packets, we have to update process_queue_len in any way. For
more packets, we only change process_queue_len just before
process_backlog() returns. It means that process_queue_len change is
batched.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-22 12:58 UTC (permalink / raw)
To: Eric Dumazet, Patrick McHardy
Cc: Linux Kernel Network Hackers, netfilter-devel, Paul E McKenney
At an unnamed ISP, we experienced a DDoS attack against one of our
customers. This attack also caused problems for one of our Linux
based routers.
The attack was "only" generating 300 kpps (packets per sec), which
usually isn't a problem for this (fairly old) Linux Router. But the
conntracking system chocked and reduced pps processing power to
40kpps.
I do extensive RRD/graph monitoring of the machines. The IP conntrack
searches in the period exploded, to a stunning 700.000 searches per
sec.
http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
First I though it might be caused by bad hashing, but after reading
the kernel code (func: __nf_conntrack_find()), I think its caused by
the loop restart (goto begin) of the conntrack search, running under
local_bh_disable(). These RCU changes to conntrack were introduced in
ea781f19 by Eric Dumazet.
Code: net/netfilter/nf_conntrack_core.c
Func: __nf_conntrack_find()
struct nf_conntrack_tuple_hash *
__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
{
struct nf_conntrack_tuple_hash *h;
struct hlist_nulls_node *n;
unsigned int hash = hash_conntrack(tuple);
/* Disable BHs the entire time since we normally need to disable them
* at least once for the stats anyway.
*/
local_bh_disable();
begin:
hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
if (nf_ct_tuple_equal(tuple, &h->tuple)) {
NF_CT_STAT_INC(net, found);
local_bh_enable();
return h;
}
NF_CT_STAT_INC(net, searched);
}
/*
* if the nulls value we got at the end of this lookup is
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
if (get_nulls_value(n) != hash)
goto begin;
local_bh_enable();
return NULL;
}
>From the graphs:
http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
Its possible to see, that the problems are most likely caused by the
number of conntrack elements being deleted.
http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_delete001.png
If you look closely at the graphs, you should be able to see, that
CPU1 is doing all the conntrack "searches", and CPU2 is doing most of
the conntrack "deletes" (and CPU1 is creating a lot of new entries).
The question is, how do we avoid this unfortunately behavior of the
delete process disturbing the search process (causing it into
looping)?
--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network Kernel Developer
Cand. Scient Datalog / MSc.CS
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer
Extra info: Conntrack tuning
-----------
I have tuned the conntrack system on these hosts. Firstly I have
increased the number of hash buckets for the conntrack system to
around 300.000.
cat /sys/module/nf_conntrack/parameters/hashsize
300032
Next I have increased the max conntracking elements to 900.000.
cat /proc/sys/net/nf_conntrack_max
900000
^ permalink raw reply
* [PATCH NEXT 7/8] qlcnic: protect resource access
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman
In-Reply-To: <1271940702-17064-1-git-send-email-amit.salecha@qlogic.com>
We do netif_device_attach, even if resource allocation fails.
Driver callbacks can be called, if device is attached.
All these callbacks need to be protected by ADAPTER_UP_MAGIC check.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
drivers/net/qlcnic/qlcnic_main.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index bfc5510..ee573fe 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -208,6 +208,9 @@ qlcnic_napi_enable(struct qlcnic_adapter *adapter)
struct qlcnic_host_sds_ring *sds_ring;
struct qlcnic_recv_context *recv_ctx = &adapter->recv_ctx;
+ if (adapter->is_up != QLCNIC_ADAPTER_UP_MAGIC)
+ return;
+
for (ring = 0; ring < adapter->max_sds_rings; ring++) {
sds_ring = &recv_ctx->sds_rings[ring];
napi_enable(&sds_ring->napi);
@@ -222,6 +225,9 @@ qlcnic_napi_disable(struct qlcnic_adapter *adapter)
struct qlcnic_host_sds_ring *sds_ring;
struct qlcnic_recv_context *recv_ctx = &adapter->recv_ctx;
+ if (adapter->is_up != QLCNIC_ADAPTER_UP_MAGIC)
+ return;
+
for (ring = 0; ring < adapter->max_sds_rings; ring++) {
sds_ring = &recv_ctx->sds_rings[ring];
qlcnic_disable_int(sds_ring);
@@ -1573,6 +1579,11 @@ qlcnic_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
int frag_count, no_of_desc;
u32 num_txd = tx_ring->num_desc;
+ if (!test_bit(__QLCNIC_DEV_UP, &adapter->state)) {
+ netif_stop_queue(netdev);
+ return NETDEV_TX_BUSY;
+ }
+
frag_count = skb_shinfo(skb)->nr_frags + 1;
/* 4 fragments per cmd des */
--
1.6.0.2
^ permalink raw reply related
* [PATCH NEXT 6/8] qlcnic: fix rcv buffer leak
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman
In-Reply-To: <1271940702-17064-1-git-send-email-amit.salecha@qlogic.com>
Rcv producer value should be read in spin-lock.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
drivers/net/qlcnic/qlcnic_init.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/qlcnic/qlcnic_init.c b/drivers/net/qlcnic/qlcnic_init.c
index 9ef9f58..1b621ca 100644
--- a/drivers/net/qlcnic/qlcnic_init.c
+++ b/drivers/net/qlcnic/qlcnic_init.c
@@ -1554,9 +1554,10 @@ qlcnic_post_rx_buffers(struct qlcnic_adapter *adapter, u32 ringid,
int producer, count = 0;
struct list_head *head;
+ spin_lock(&rds_ring->lock);
+
producer = rds_ring->producer;
- spin_lock(&rds_ring->lock);
head = &rds_ring->free_list;
while (!list_empty(head)) {
@@ -1578,13 +1579,13 @@ qlcnic_post_rx_buffers(struct qlcnic_adapter *adapter, u32 ringid,
producer = get_next_index(producer, rds_ring->num_desc);
}
- spin_unlock(&rds_ring->lock);
if (count) {
rds_ring->producer = producer;
writel((producer-1) & (rds_ring->num_desc-1),
rds_ring->crb_rcv_producer);
}
+ spin_unlock(&rds_ring->lock);
}
static void
@@ -1596,10 +1597,11 @@ qlcnic_post_rx_buffers_nodb(struct qlcnic_adapter *adapter,
int producer, count = 0;
struct list_head *head;
- producer = rds_ring->producer;
if (!spin_trylock(&rds_ring->lock))
return;
+ producer = rds_ring->producer;
+
head = &rds_ring->free_list;
while (!list_empty(head)) {
--
1.6.0.2
^ permalink raw reply related
* [PATCH NEXT 8/8] qlcnic: update version 5.0.2
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman
In-Reply-To: <1271940702-17064-1-git-send-email-amit.salecha@qlogic.com>
Update version to indicate IDC(fw recovery) changes.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
drivers/net/qlcnic/qlcnic.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/qlcnic/qlcnic.h b/drivers/net/qlcnic/qlcnic.h
index 6c1da71..2fba9cd 100644
--- a/drivers/net/qlcnic/qlcnic.h
+++ b/drivers/net/qlcnic/qlcnic.h
@@ -51,8 +51,8 @@
#define _QLCNIC_LINUX_MAJOR 5
#define _QLCNIC_LINUX_MINOR 0
-#define _QLCNIC_LINUX_SUBVERSION 1
-#define QLCNIC_LINUX_VERSIONID "5.0.1"
+#define _QLCNIC_LINUX_SUBVERSION 2
+#define QLCNIC_LINUX_VERSIONID "5.0.2"
#define QLCNIC_VERSION_CODE(a, b, c) (((a) << 24) + ((b) << 16) + (c))
#define _major(v) (((v) >> 24) & 0xff)
--
1.6.0.2
^ permalink raw reply related
* [PATCH NEXT 3/8] qlcnic: fix fw initialization responsibility
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman
In-Reply-To: <1271940702-17064-1-git-send-email-amit.salecha@qlogic.com>
Now any pci-func can start fw, whoever sees the reset ack first.
Before this, pci-func which sets the RESET state has the responsibility
to start fw.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
drivers/net/qlcnic/qlcnic_main.c | 66 +++++++++++++++++++++----------------
1 files changed, 37 insertions(+), 29 deletions(-)
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index ff7705b..0634990 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -2015,6 +2015,7 @@ err:
clear_bit(__QLCNIC_RESETTING, &adapter->state);
}
+/* Grab api lock, before checking state */
static int
qlcnic_check_drv_state(struct qlcnic_adapter *adapter)
{
@@ -2037,6 +2038,9 @@ qlcnic_can_start_firmware(struct qlcnic_adapter *adapter)
u8 dev_init_timeo = adapter->dev_init_timeo;
int portnum = adapter->portnum;
+ if (test_and_clear_bit(__QLCNIC_START_FW, &adapter->state))
+ return 1;
+
if (qlcnic_api_lock(adapter))
return -1;
@@ -2044,8 +2048,6 @@ qlcnic_can_start_firmware(struct qlcnic_adapter *adapter)
if (!(val & ((int)0x1 << (portnum * 4)))) {
val |= ((u32)0x1 << (portnum * 4));
QLCWR32(adapter, QLCNIC_CRB_DEV_REF_COUNT, val);
- } else if (test_and_clear_bit(__QLCNIC_START_FW, &adapter->state)) {
- goto start_fw;
}
prev_state = QLCRD32(adapter, QLCNIC_CRB_DEV_STATE);
@@ -2053,7 +2055,6 @@ qlcnic_can_start_firmware(struct qlcnic_adapter *adapter)
switch (prev_state) {
case QLCNIC_DEV_COLD:
-start_fw:
QLCWR32(adapter, QLCNIC_CRB_DEV_STATE, QLCNIC_DEV_INITIALIZING);
qlcnic_api_unlock(adapter);
return 1;
@@ -2113,51 +2114,59 @@ qlcnic_fwinit_work(struct work_struct *work)
{
struct qlcnic_adapter *adapter = container_of(work,
struct qlcnic_adapter, fw_work.work);
- int dev_state;
+ u32 dev_state = 0xf;
- if (test_bit(__QLCNIC_START_FW, &adapter->state)) {
+ if (qlcnic_api_lock(adapter))
+ goto err_ret;
- if (qlcnic_check_drv_state(adapter) &&
- (adapter->fw_wait_cnt++ < adapter->reset_ack_timeo)) {
- qlcnic_schedule_work(adapter,
- qlcnic_fwinit_work, FW_POLL_DELAY);
- return;
+ if (adapter->fw_wait_cnt++ > adapter->reset_ack_timeo) {
+ dev_err(&adapter->pdev->dev, "Reset:Failed to get ack %d sec\n",
+ adapter->reset_ack_timeo);
+ goto skip_ack_check;
+ }
+
+ if (!qlcnic_check_drv_state(adapter)) {
+skip_ack_check:
+ dev_state = QLCRD32(adapter, QLCNIC_CRB_DEV_STATE);
+ if (dev_state == QLCNIC_DEV_NEED_RESET) {
+ QLCWR32(adapter, QLCNIC_CRB_DEV_STATE,
+ QLCNIC_DEV_INITIALIZING);
+ set_bit(__QLCNIC_START_FW, &adapter->state);
+ QLCDB(adapter, DRV, "Restarting fw\n");
}
- QLCDB(adapter, DRV, "Resetting FW\n");
+ qlcnic_api_unlock(adapter);
+
if (!qlcnic_start_firmware(adapter)) {
qlcnic_schedule_work(adapter, qlcnic_attach_work, 0);
return;
}
-
goto err_ret;
}
- if (adapter->fw_wait_cnt++ > (adapter->dev_init_timeo / 2)) {
- dev_err(&adapter->pdev->dev,
- "Waiting for device to reset timeout\n");
- goto err_ret;
- }
+ qlcnic_api_unlock(adapter);
dev_state = QLCRD32(adapter, QLCNIC_CRB_DEV_STATE);
- QLCDB(adapter, HW, "Func waiting: Device state=%d\n", dev_state);
+ QLCDB(adapter, HW, "Func waiting: Device state=%u\n", dev_state);
switch (dev_state) {
- case QLCNIC_DEV_READY:
- if (!qlcnic_start_firmware(adapter)) {
- qlcnic_schedule_work(adapter, qlcnic_attach_work, 0);
- return;
- }
+ case QLCNIC_DEV_NEED_RESET:
+ qlcnic_schedule_work(adapter,
+ qlcnic_fwinit_work, FW_POLL_DELAY);
+ return;
case QLCNIC_DEV_FAILED:
break;
default:
- qlcnic_schedule_work(adapter,
- qlcnic_fwinit_work, 2 * FW_POLL_DELAY);
- return;
+ if (!qlcnic_start_firmware(adapter)) {
+ qlcnic_schedule_work(adapter, qlcnic_attach_work, 0);
+ return;
+ }
}
err_ret:
+ dev_err(&adapter->pdev->dev, "Fwinit work failed state=%u "
+ "fw_wait_cnt=%u\n", dev_state, adapter->fw_wait_cnt);
netif_device_attach(adapter->netdev);
qlcnic_clr_all_drv_state(adapter);
}
@@ -2202,6 +2211,7 @@ err_ret:
}
+/*Transit to RESET state from READY state only */
static void
qlcnic_dev_request_reset(struct qlcnic_adapter *adapter)
{
@@ -2212,10 +2222,8 @@ qlcnic_dev_request_reset(struct qlcnic_adapter *adapter)
state = QLCRD32(adapter, QLCNIC_CRB_DEV_STATE);
- if (state != QLCNIC_DEV_INITIALIZING &&
- state != QLCNIC_DEV_NEED_RESET) {
+ if (state == QLCNIC_DEV_READY) {
QLCWR32(adapter, QLCNIC_CRB_DEV_STATE, QLCNIC_DEV_NEED_RESET);
- set_bit(__QLCNIC_START_FW, &adapter->state);
QLCDB(adapter, DRV, "NEED_RESET state set\n");
}
--
1.6.0.2
^ permalink raw reply related
* [PATCH NEXT 2/8] qlcnic: fix defines as per IDC document
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman, Anirban Chakraborty
In-Reply-To: <1271940702-17064-1-git-send-email-amit.salecha@qlogic.com>
Different class of drivers co-exist for CNA device,
there is some minimal interaction that will be required amongst
the drivers for performing some device level operations.
All the driver should follow inter driver coexistence document.
Fixing polling interval and spelling mistake.
Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
drivers/net/qlcnic/qlcnic_hdr.h | 22 +++++++++++-----------
drivers/net/qlcnic/qlcnic_main.c | 9 +++++++--
2 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/drivers/net/qlcnic/qlcnic_hdr.h b/drivers/net/qlcnic/qlcnic_hdr.h
index 51fa3fb..8285a06 100644
--- a/drivers/net/qlcnic/qlcnic_hdr.h
+++ b/drivers/net/qlcnic/qlcnic_hdr.h
@@ -694,17 +694,18 @@ enum {
#define QLCNIC_CRB_DRV_STATE (QLCNIC_CAM_RAM(0x144))
#define QLCNIC_CRB_DRV_SCRATCH (QLCNIC_CAM_RAM(0x148))
#define QLCNIC_CRB_DEV_PARTITION_INFO (QLCNIC_CAM_RAM(0x14c))
-#define QLCNIC_CRB_DRV_IDC_VER (QLCNIC_CAM_RAM(0x14c))
+#define QLCNIC_CRB_DRV_IDC_VER (QLCNIC_CAM_RAM(0x174))
#define QLCNIC_ROM_DEV_INIT_TIMEOUT (0x3e885c)
#define QLCNIC_ROM_DRV_RESET_TIMEOUT (0x3e8860)
- /* Device State */
-#define QLCNIC_DEV_COLD 1
-#define QLCNIC_DEV_INITALIZING 2
-#define QLCNIC_DEV_READY 3
-#define QLCNIC_DEV_NEED_RESET 4
-#define QLCNIC_DEV_NEED_QUISCENT 5
-#define QLCNIC_DEV_FAILED 6
+/* Device State */
+#define QLCNIC_DEV_COLD 0x1
+#define QLCNIC_DEV_INITIALIZING 0x2
+#define QLCNIC_DEV_READY 0x3
+#define QLCNIC_DEV_NEED_RESET 0x4
+#define QLCNIC_DEV_NEED_QUISCENT 0x5
+#define QLCNIC_DEV_FAILED 0x6
+#define QLCNIC_DEV_QUISCENT 0x7
#define QLCNIC_RCODE_DRIVER_INFO 0x20000000
#define QLCNIC_RCODE_DRIVER_CAN_RELOAD 0x40000000
@@ -712,9 +713,8 @@ enum {
#define QLCNIC_FWERROR_PEGNUM(code) ((code) & 0xff)
#define QLCNIC_FWERROR_CODE(code) ((code >> 8) & 0xfffff)
-#define FW_POLL_DELAY (2 * HZ)
-#define FW_FAIL_THRESH 3
-#define FW_POLL_THRESH 10
+#define FW_POLL_DELAY (1 * HZ)
+#define FW_FAIL_THRESH 2
#define ISR_MSI_INT_TRIGGER(FUNC) (QLCNIC_PCIX_PS_REG(PCIX_MSI_F(FUNC)))
#define ISR_LEGACY_INT_TRIGGERED(VAL) (((VAL) & 0x300) == 0x200)
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index 5845dc0..ff7705b 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -2054,7 +2054,7 @@ qlcnic_can_start_firmware(struct qlcnic_adapter *adapter)
switch (prev_state) {
case QLCNIC_DEV_COLD:
start_fw:
- QLCWR32(adapter, QLCNIC_CRB_DEV_STATE, QLCNIC_DEV_INITALIZING);
+ QLCWR32(adapter, QLCNIC_CRB_DEV_STATE, QLCNIC_DEV_INITIALIZING);
qlcnic_api_unlock(adapter);
return 1;
@@ -2077,6 +2077,10 @@ start_fw:
case QLCNIC_DEV_FAILED:
qlcnic_api_unlock(adapter);
return -1;
+
+ case QLCNIC_DEV_INITIALIZING:
+ case QLCNIC_DEV_QUISCENT:
+ break;
}
qlcnic_api_unlock(adapter);
@@ -2208,7 +2212,8 @@ qlcnic_dev_request_reset(struct qlcnic_adapter *adapter)
state = QLCRD32(adapter, QLCNIC_CRB_DEV_STATE);
- if (state != QLCNIC_DEV_INITALIZING && state != QLCNIC_DEV_NEED_RESET) {
+ if (state != QLCNIC_DEV_INITIALIZING &&
+ state != QLCNIC_DEV_NEED_RESET) {
QLCWR32(adapter, QLCNIC_CRB_DEV_STATE, QLCNIC_DEV_NEED_RESET);
set_bit(__QLCNIC_START_FW, &adapter->state);
QLCDB(adapter, DRV, "NEED_RESET state set\n");
--
1.6.0.2
^ permalink raw reply related
* [PATCH NEXT 1/8] qlcnic: additional driver statistics
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman
In-Reply-To: <1271940702-17064-1-git-send-email-amit.salecha@qlogic.com>
Added additional driver statistics to track errors in rcv/tx path.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
---
drivers/net/qlcnic/qlcnic.h | 4 ++++
drivers/net/qlcnic/qlcnic_ethtool.c | 8 ++++++++
drivers/net/qlcnic/qlcnic_init.c | 7 ++++++-
drivers/net/qlcnic/qlcnic_main.c | 4 +++-
4 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/drivers/net/qlcnic/qlcnic.h b/drivers/net/qlcnic/qlcnic.h
index 28c148c..6c1da71 100644
--- a/drivers/net/qlcnic/qlcnic.h
+++ b/drivers/net/qlcnic/qlcnic.h
@@ -428,6 +428,10 @@ struct qlcnic_adapter_stats {
u64 xmit_on;
u64 xmit_off;
u64 skb_alloc_failure;
+ u64 null_skb;
+ u64 null_rxbuf;
+ u64 rx_dma_map_error;
+ u64 tx_dma_map_error;
};
/*
diff --git a/drivers/net/qlcnic/qlcnic_ethtool.c b/drivers/net/qlcnic/qlcnic_ethtool.c
index 08d6f10..6cdc5eb 100644
--- a/drivers/net/qlcnic/qlcnic_ethtool.c
+++ b/drivers/net/qlcnic/qlcnic_ethtool.c
@@ -69,6 +69,14 @@ static const struct qlcnic_stats qlcnic_gstrings_stats[] = {
QLC_SIZEOF(stats.xmit_off), QLC_OFF(stats.xmit_off)},
{"skb_alloc_failure", QLC_SIZEOF(stats.skb_alloc_failure),
QLC_OFF(stats.skb_alloc_failure)},
+ {"null skb",
+ QLC_SIZEOF(stats.null_skb), QLC_OFF(stats.null_skb)},
+ {"null rxbuf",
+ QLC_SIZEOF(stats.null_rxbuf), QLC_OFF(stats.null_rxbuf)},
+ {"rx dma map error", QLC_SIZEOF(stats.rx_dma_map_error),
+ QLC_OFF(stats.rx_dma_map_error)},
+ {"tx dma map error", QLC_SIZEOF(stats.tx_dma_map_error),
+ QLC_OFF(stats.tx_dma_map_error)},
};
diff --git a/drivers/net/qlcnic/qlcnic_init.c b/drivers/net/qlcnic/qlcnic_init.c
index 01ce74e..9ef9f58 100644
--- a/drivers/net/qlcnic/qlcnic_init.c
+++ b/drivers/net/qlcnic/qlcnic_init.c
@@ -1287,6 +1287,7 @@ qlcnic_alloc_rx_skb(struct qlcnic_adapter *adapter,
rds_ring->dma_size, PCI_DMA_FROMDEVICE);
if (pci_dma_mapping_error(pdev, dma)) {
+ adapter->stats.rx_dma_map_error++;
dev_kfree_skb_any(skb);
buffer->skb = NULL;
return -ENOMEM;
@@ -1311,8 +1312,10 @@ static struct sk_buff *qlcnic_process_rxbuf(struct qlcnic_adapter *adapter,
PCI_DMA_FROMDEVICE);
skb = buffer->skb;
- if (!skb)
+ if (!skb) {
+ adapter->stats.null_skb++;
goto no_skb;
+ }
if (likely(adapter->rx_csum && cksum == STATUS_CKSUM_OK)) {
adapter->stats.csummed++;
@@ -1502,6 +1505,8 @@ qlcnic_process_rcv_ring(struct qlcnic_host_sds_ring *sds_ring, int max)
if (rxbuf)
list_add_tail(&rxbuf->list, &sds_ring->free_list[ring]);
+ else
+ adapter->stats.null_rxbuf++;
skip:
for (; desc_cnt > 0; desc_cnt--) {
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index e4fd5dc..5845dc0 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -1589,8 +1589,10 @@ qlcnic_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
pdev = adapter->pdev;
- if (qlcnic_map_tx_skb(pdev, skb, pbuf))
+ if (qlcnic_map_tx_skb(pdev, skb, pbuf)) {
+ adapter->stats.tx_dma_map_error++;
goto drop_packet;
+ }
pbuf->skb = skb;
pbuf->frag_count = frag_count;
--
1.6.0.2
^ permalink raw reply related
* [PATCH NEXT 0/8]qlcnic: inter driver coexistence fixes
From: Amit Kumar Salecha @ 2010-04-22 12:51 UTC (permalink / raw)
To: davem; +Cc: netdev, ameen.rahman
Hi
Different class of drivers co-exist for CNA device, there is some
minimal interaction that will be required amongst the drivers for
performing some device level operations. Operations such as
device initialization, device recovery and device diagnostics test,
can potentially affect other function drivers.
All these drivers should follow common protocol(IDC) to do such operation.
Fixing qlcnic driver according to Inter driver coexistence.
Sending series of 8 patches, please apply them in net-next.
-Amit
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox