* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-23 10:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
Paul E McKenney
In-Reply-To: <1271946961.7895.5665.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
>> Changli Gao wrote:
>>>> struct nf_conntrack_tuple_hash *
>>>> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>>>> ...
>>> We should add a retry limit there.
>> We can't do that since that would allow false negatives.
>
> If one hash slot is under attack, then there is a bug somewhere.
>
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
>
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.
That sounds like a good idea. But lets what for Jesper's test results
before we start fixing this problem :)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-23 10:55 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1271970199.7895.6482.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
>> On Thu, 22 Apr 2010, Eric Dumazet wrote:
>>
>>> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
>>>> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
>>>>> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
>>>>>
>>>>> If we can do the 'retry' a 10 times, it means the attacker was really
>>>>> clever enough to inject new packets (new conntracks) at the right
>>>>> moment, in the right hash chain, and this sounds so higly incredible
>>>>> that I cannot believe it at all :)
>>>> Or maybe the DoS attack is injecting so many new conntracks that a large
>>>> fraction of the hash chains are being modified at any given time?
>>>>
>> I think its plausable, there is a lot of modification going on.
>> Approx 40.000 deletes/sec and 40.000 inserts/sec.
>> The hash bucket size is 300032, and with 80000 modifications/sec, we are
>> (potentially) changing 26.6% of the hash chains each second.
>>
>
> OK but a lookup last a fraction of a micro second, unless interrupted by
> hard irq.
>
> Probability of a change during a lookup should be very very small.
>
> Note that the scenario for a restart is :
>
> The lookup go through the chain.
> While it is examining one object, this object is deleted.
> The object is re-allocated by another cpu and inserted to a new chain.
I think another scenario that seems a bit more likely would be
that a new entry is added to the chain after it was fully searched.
Perhaps we could continue searching at the last position if the
last entry is not a nulls entry to improve this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH net-next-2.6] l2tp: fix memory allocation
From: Jiri Pirko @ 2010-04-23 10:53 UTC (permalink / raw)
To: netdev; +Cc: davem, kleptog, jchapman
Since .size is set properly in "struct pernet_operations l2tp_net_ops",
allocating space for "struct l2tp_net" by hand is not correct, even causes
memory leakage.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index ecc7aea..1712af1 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1617,14 +1617,9 @@ EXPORT_SYMBOL_GPL(l2tp_session_create);
static __net_init int l2tp_init_net(struct net *net)
{
- struct l2tp_net *pn;
- int err;
+ struct l2tp_net *pn = net_generic(net, l2tp_net_id);
int hash;
- pn = kzalloc(sizeof(*pn), GFP_KERNEL);
- if (!pn)
- return -ENOMEM;
-
INIT_LIST_HEAD(&pn->l2tp_tunnel_list);
spin_lock_init(&pn->l2tp_tunnel_list_lock);
@@ -1633,33 +1628,11 @@ static __net_init int l2tp_init_net(struct net *net)
spin_lock_init(&pn->l2tp_session_hlist_lock);
- err = net_assign_generic(net, l2tp_net_id, pn);
- if (err)
- goto out;
-
return 0;
-
-out:
- kfree(pn);
- return err;
-}
-
-static __net_exit void l2tp_exit_net(struct net *net)
-{
- struct l2tp_net *pn;
-
- pn = net_generic(net, l2tp_net_id);
- /*
- * if someone has cached our net then
- * further net_generic call will return NULL
- */
- net_assign_generic(net, l2tp_net_id, NULL);
- kfree(pn);
}
static struct pernet_operations l2tp_net_ops = {
.init = l2tp_init_net,
- .exit = l2tp_exit_net,
.id = &l2tp_net_id,
.size = sizeof(struct l2tp_net),
};
^ permalink raw reply related
* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-23 10:36 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1271970893.7895.6507.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 23:03 +0200, Eric Dumazet a écrit :
>>> Guess I have to reproduce the DoS attack in a testlab (I will first have
>>> time Tuesday). So we can determine if its bad hashing or restart of the
>>> search loop.
>>>
>
> Or very long chains, if attacker managed to find a jhash flaw.
That should be visible in the "searched" statistic.
> You could add a lookup_restart counter :
I've applied Jespers equivalent patch.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-23 10:35 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Changli Gao, Eric Dumazet, Linux Kernel Network Hackers,
netfilter-devel, Paul E McKenney
In-Reply-To: <1271943066.14501.194.camel@jdb-workstation>
Jesper Dangaard Brouer wrote:
> I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
> The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.
>
>
>
> [PATCH] net: netfilter conntrack extended with extra stat counter.
>
> From: Jesper Dangaard Brouer <hawk@comx.dk>
>
> I suspect an unfortunatly series of events occuring under a DDoS
> attack, in function __nf_conntrack_find() nf_contrack_core.c.
>
> Adding a stats counter to see if the search is restarted too often.
Applied, thanks Jesper.
^ permalink raw reply
* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-23 10:26 UTC (permalink / raw)
To: Changli Gao
Cc: David S. Miller, jamal, Tom Herbert, Stephen Hemminger, netdev
In-Reply-To: <1272010378-2955-1-git-send-email-xiaosuo@gmail.com>
Le vendredi 23 avril 2010 à 16:12 +0800, Changli Gao a écrit :
> batch skb dequeueing from softnet input_pkt_queue.
>
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention when RPS is enabled.
>
> Note: in the worst case, the number of packets in a softnet_data may be double
> of netdev_max_backlog.
>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
Oops, reading it again, I found process_backlog() was still taking the
lock twice, if only one packet is waiting in input_pkt_queue.
Possible fix, on top of your patch :
diff --git a/net/core/dev.c b/net/core/dev.c
index 0eddd23..0569be7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3296,8 +3296,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
#endif
napi->weight = weight_p;
local_irq_disable();
- while (1) {
+ while (work < quota) {
struct sk_buff *skb;
+ unsigned int qlen;
while ((skb = __skb_dequeue(&sd->process_queue))) {
local_irq_enable();
@@ -3308,13 +3309,15 @@ static int process_backlog(struct napi_struct *napi, int quota)
}
rps_lock(sd);
- input_queue_head_add(sd, skb_queue_len(&sd->input_pkt_queue));
- skb_queue_splice_tail_init(&sd->input_pkt_queue,
- &sd->process_queue);
- if (skb_queue_empty(&sd->process_queue)) {
+ qlen = skb_queue_len(&sd->input_pkt_queue);
+ if (qlen) {
+ input_queue_head_add(sd, qlen);
+ skb_queue_splice_tail_init(&sd->input_pkt_queue,
+ &sd->process_queue);
+ }
+ if (qlen < quota - work) {
__napi_complete(napi);
- rps_unlock(sd);
- break;
+ quota = work + qlen;
}
rps_unlock(sd);
}
^ permalink raw reply related
* Re: [RFC 2/2] phylib: Convert MDIO bitbang to new MDIO 45 format
From: Ben Hutchings @ 2010-04-23 10:22 UTC (permalink / raw)
To: Andy Fleming; +Cc: davem, netdev
In-Reply-To: <1271997497-6896-3-git-send-email-afleming@freescale.com>
On Thu, 2010-04-22 at 23:38 -0500, Andy Fleming wrote:
> Now that we've added somewhat more complete MDIO 45 support to the PHY
> Lib, convert the MDIO bitbang driver to use this new infrastructure.
>
> Signed-off-by: Andy Fleming <afleming@freescale.com>
> ---
> drivers/net/phy/mdio-bitbang.c | 23 +++++++++++------------
> 1 files changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/phy/mdio-bitbang.c b/drivers/net/phy/mdio-bitbang.c
> index 2f6f02e..4c0c89b 100644
> --- a/drivers/net/phy/mdio-bitbang.c
> +++ b/drivers/net/phy/mdio-bitbang.c
[...]
> @@ -157,9 +154,10 @@ static int mdiobb_read(struct mii_bus *bus, int phy, int devad, int reg)
> struct mdiobb_ctrl *ctrl = bus->priv;
> int ret, i;
>
> - if (reg & MII_ADDR_C45) {
> - reg = mdiobb_cmd_addr(ctrl, phy, reg);
> - mdiobb_cmd(ctrl, MDIO_C45_READ, phy, reg);
> + /* Clause 22 PHYs only use devad = 0, and Clause 45 only use nonzero */
> + if (devad) {
> + mdiobb_cmd_addr(ctrl, phy, devad, reg);
> + mdiobb_cmd(ctrl, MDIO_C45_READ, phy, devad);
> } else
> mdiobb_cmd(ctrl, MDIO_READ, phy, reg);
>
[...]
I don't believe there's any protocol requirement in clause 45 that
devad != 0 (although the address is not allocated). In the mdio module
I played safe and defined MDIO_DEVAD_NONE == -1 to indicate a clause 22
request.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: John Fastabend @ 2010-04-23 9:27 UTC (permalink / raw)
To: Ben Hutchings
Cc: Waskiewicz Jr, Peter P, tglx@linutronix.de, davem@davemloft.net,
arjan@linux.jf.intel.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <1271950900.2095.25.camel@achroite.uk.solarflarecom.com>
Ben Hutchings wrote:
> On Thu, 2010-04-22 at 05:11 -0700, Peter P Waskiewicz Jr wrote:
>> On Wed, 21 Apr 2010, Ben Hutchings wrote:
>>
>>> On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
>>>> This patch adds a callback function pointer to the irq_desc
>>>> structure, along with a registration function and a read-only
>>>> proc entry for each interrupt.
>>>>
>>>> This affinity_hint handle for each interrupt can be used by
>>>> underlying drivers that need a better mechanism to control
>>>> interrupt affinity. The underlying driver can register a
>>>> callback for the interrupt, which will allow the driver to
>>>> provide the CPU mask for the interrupt to anything that
>>>> requests it. The intent is to extend the userspace daemon,
>>>> irqbalance, to help hint to it a preferred CPU mask to balance
>>>> the interrupt into.
>>> Doesn't it make more sense to have the driver follow affinity decisions
>>> made from user-space? I realise that reallocating queues is disruptive
>>> and we probably don't want irqbalance to trigger that, but there should
>>> be a mechanism for the administrator to trigger it.
>> The driver here would be assisting userspace (irqbalance) to provide
>> better details how the HW is laid out with respect to flows. As it stands
>> today, irqbalance is almost guaranteed to move interrups to CPUs that are
>> not aligned with where applications are running for network adapters.
>> This is very apparent when running at speeds in the 10 Gigabit range, or
>> even multiple 1 Gigabit ports running at the same time.
>
> I'm well aware that irqbalance isn't making good decisions at the
> moment. The question is whether this will really help irqbalance to do
> better.
>
FCoE is one example where these hints can really help irqbalance make
good decisions. By aligning the interrupt affinity with the FCoE
receive processing thread we can avoid context switching from the NET_RX
softirq to the receive processing thread.
Because the base driver knows which rx rings are being used for FCoE in
a particular configuration and their corresponding vectors it seems to
be in the best position to provide good hints to irqbalance. Also if
the mapping changes at some point the base driver will be aware of it.
> [...]
>>> This just assigns IRQs to the first n CPU threads. Depending on the
>>> enumeration order, this might result in assigning an IRQ to each of 2
>>> threads on a core while leaving other cores unused!
>> This ixgbe patch is only meant to be an example of how you could use it.
>> I didn't hammer out all the corner cases of interrupt alignment in it yet.
>> However, ixgbe is already aligning Tx flows onto the CPU/queue pair the Tx
>> occurred (i.e. Tx session from CPU 4 will be queued on Tx queue 4),
> [...]
>
> OK, now I remember ixgbe has this odd select_queue() implementation.
> But this behaviour can result in reordering whenever a user thread
> migrates, and in any case Dave discourages people from setting
> select_queue(). So I see that these changes would be useful for ixgbe
> (together with an update to irqbalance), but they don't seem to fit the
> general direction of multiqueue networking on Linux.
For DCB setting select_queue() is useful because we want to map traffic
types to specific tx queues not hash them across all queues. In this
case where we are placing specific traffic on specific queues it also
makes sense to align the interrupts for some types such as FCoE. There
shouldn't be any issues with user thread migration in this specific example.
>
> (Actually, the hints seem to be incomplete. If there are more than 16
> CPU threads then multiple CPU threads can map to the same queues, but it
> looks like you only include the first in the queue's hint.)
>
> An alternate approach is to use the RX queue index to drive TX queue
> selection. I posted a patch to do that earlier this week. However I
> haven't yet had a chance to try that on a suitably large system.
>
I'll post an FCoE example patch soon and take a closer look at your
patch, but mapping TX/RX queues in sock's won't help for cases like FCoE.
Thanks,
John.
^ permalink raw reply
* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-23 9:27 UTC (permalink / raw)
To: Changli Gao
Cc: David S. Miller, jamal, Tom Herbert, Stephen Hemminger, netdev
In-Reply-To: <1272010378-2955-1-git-send-email-xiaosuo@gmail.com>
Le vendredi 23 avril 2010 à 16:12 +0800, Changli Gao a écrit :
> batch skb dequeueing from softnet input_pkt_queue.
>
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention when RPS is enabled.
>
> Note: in the worst case, the number of packets in a softnet_data may be double
> of netdev_max_backlog.
>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Very good patch Changli, thanks !
Lets see how it improves thing for Jamal benchs ;)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ----
> include/linux/netdevice.h | 6 +++--
> net/core/dev.c | 50 +++++++++++++++++++++++++++++++---------------
> 2 files changed, 38 insertions(+), 18 deletions(-)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3c5ed5f..6ae9f2b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1387,6 +1387,7 @@ struct softnet_data {
> struct Qdisc *output_queue;
> struct list_head poll_list;
> struct sk_buff *completion_queue;
> + struct sk_buff_head process_queue;
>
> #ifdef CONFIG_RPS
> struct softnet_data *rps_ipi_list;
> @@ -1401,10 +1402,11 @@ struct softnet_data {
> struct napi_struct backlog;
> };
>
> -static inline void input_queue_head_incr(struct softnet_data *sd)
> +static inline void input_queue_head_add(struct softnet_data *sd,
> + unsigned int len)
> {
> #ifdef CONFIG_RPS
> - sd->input_queue_head++;
> + sd->input_queue_head += len;
> #endif
> }
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a4a7c36..c1585f9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2409,12 +2409,13 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
> __get_cpu_var(netdev_rx_stat).total++;
>
> rps_lock(sd);
> - if (sd->input_pkt_queue.qlen <= netdev_max_backlog) {
> - if (sd->input_pkt_queue.qlen) {
> + if (skb_queue_len(&sd->input_pkt_queue) <= netdev_max_backlog) {
> + if (skb_queue_len(&sd->input_pkt_queue)) {
> enqueue:
> __skb_queue_tail(&sd->input_pkt_queue, skb);
> #ifdef CONFIG_RPS
> - *qtail = sd->input_queue_head + sd->input_pkt_queue.qlen;
> + *qtail = sd->input_queue_head +
> + skb_queue_len(&sd->input_pkt_queue);
> #endif
> rps_unlock(sd);
> local_irq_restore(flags);
> @@ -2934,13 +2935,21 @@ static void flush_backlog(void *arg)
> struct sk_buff *skb, *tmp;
>
> rps_lock(sd);
> - skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp)
> + skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
> if (skb->dev == dev) {
> __skb_unlink(skb, &sd->input_pkt_queue);
> kfree_skb(skb);
> - input_queue_head_incr(sd);
> + input_queue_head_add(sd, 1);
> }
> + }
> rps_unlock(sd);
> +
> + skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
> + if (skb->dev == dev) {
> + __skb_unlink(skb, &sd->process_queue);
> + kfree_skb(skb);
> + }
> + }
> }
>
> static int napi_gro_complete(struct sk_buff *skb)
> @@ -3286,24 +3295,30 @@ static int process_backlog(struct napi_struct *napi, int quota)
> }
> #endif
> napi->weight = weight_p;
> - do {
> + local_irq_disable();
> + while (1) {
> struct sk_buff *skb;
>
> - local_irq_disable();
> + while ((skb = __skb_dequeue(&sd->process_queue))) {
> + local_irq_enable();
> + __netif_receive_skb(skb);
> + if (++work >= quota)
> + return work;
> + local_irq_disable();
> + }
> +
> rps_lock(sd);
> - skb = __skb_dequeue(&sd->input_pkt_queue);
> - if (!skb) {
> + input_queue_head_add(sd, skb_queue_len(&sd->input_pkt_queue));
> + skb_queue_splice_tail_init(&sd->input_pkt_queue,
> + &sd->process_queue);
> + if (skb_queue_empty(&sd->process_queue)) {
> __napi_complete(napi);
> rps_unlock(sd);
> - local_irq_enable();
> break;
> }
> - input_queue_head_incr(sd);
> rps_unlock(sd);
> - local_irq_enable();
> -
> - __netif_receive_skb(skb);
> - } while (++work < quota);
> + }
> + local_irq_enable();
>
> return work;
> }
> @@ -5631,8 +5646,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
> /* Process offline CPU's input_pkt_queue */
> while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
> netif_rx(skb);
> - input_queue_head_incr(oldsd);
> + input_queue_head_add(oldsd, 1);
> }
> + while ((skb = __skb_dequeue(&oldsd->process_queue)))
> + netif_rx(skb);
>
> return NOTIFY_OK;
> }
> @@ -5851,6 +5868,7 @@ static int __init net_dev_init(void)
> struct softnet_data *sd = &per_cpu(softnet_data, i);
>
> skb_queue_head_init(&sd->input_pkt_queue);
> + skb_queue_head_init(&sd->process_queue);
> sd->completion_queue = NULL;
> INIT_LIST_HEAD(&sd->poll_list);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-23 9:23 UTC (permalink / raw)
To: Jan Engelhardt
Cc: Jesper Dangaard Brouer, Patrick McHardy, hawk,
Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <alpine.LSU.2.01.1004230955030.26168@obet.zrqbmnf.qr>
Le vendredi 23 avril 2010 à 09:55 +0200, Jan Engelhardt a écrit :
> On Friday 2010-04-23 09:46, Eric Dumazet wrote:
> >Years ago, we had to manually change PAGE_OFFSET, and I remember some
> >machines with PAGE_OFFSET 0xA0000000 (1.5 GB LOWMEM),
> >or 0xB0000000 (1.25 GB), (PAE off)
>
> I notice that 0xB0000000, which is now known as LOWMEM_3G_OPT,
> is only available when PAE is off. Would you know the reason for
> that decision? Are some values unsuitable for PAE?
>
If PAE was on, PAGE_OFFSET must be a 1GB multiple.
This is because of hardware limitations.
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-23 8:40 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, paulmck, Patrick McHardy, xiaosuo, netdev,
Netfilter Developers
In-Reply-To: <20100423.011845.254684857.davem@davemloft.net>
On Fri, 23 Apr 2010, David Miller wrote:
> This all reminds me of the namespace bug we dealt with
> a month or two ago.
>
> Jesper, you don't happen to be using network namespaces are you?
No, I don't use network namespaces.
(In .config CONFIG_NAMESPACES is not set.)
Cheers,
Jesper Brouer
--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
^ permalink raw reply
* [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-23 8:12 UTC (permalink / raw)
To: David S. Miller
Cc: jamal, Tom Herbert, Eric Dumazet, Stephen Hemminger, netdev,
Changli Gao
batch skb dequeueing from softnet input_pkt_queue.
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention when RPS is enabled.
Note: in the worst case, the number of packets in a softnet_data may be double
of netdev_max_backlog.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/linux/netdevice.h | 6 +++--
net/core/dev.c | 50 +++++++++++++++++++++++++++++++---------------
2 files changed, 38 insertions(+), 18 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3c5ed5f..6ae9f2b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1387,6 +1387,7 @@ struct softnet_data {
struct Qdisc *output_queue;
struct list_head poll_list;
struct sk_buff *completion_queue;
+ struct sk_buff_head process_queue;
#ifdef CONFIG_RPS
struct softnet_data *rps_ipi_list;
@@ -1401,10 +1402,11 @@ struct softnet_data {
struct napi_struct backlog;
};
-static inline void input_queue_head_incr(struct softnet_data *sd)
+static inline void input_queue_head_add(struct softnet_data *sd,
+ unsigned int len)
{
#ifdef CONFIG_RPS
- sd->input_queue_head++;
+ sd->input_queue_head += len;
#endif
}
diff --git a/net/core/dev.c b/net/core/dev.c
index a4a7c36..c1585f9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2409,12 +2409,13 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
__get_cpu_var(netdev_rx_stat).total++;
rps_lock(sd);
- if (sd->input_pkt_queue.qlen <= netdev_max_backlog) {
- if (sd->input_pkt_queue.qlen) {
+ if (skb_queue_len(&sd->input_pkt_queue) <= netdev_max_backlog) {
+ if (skb_queue_len(&sd->input_pkt_queue)) {
enqueue:
__skb_queue_tail(&sd->input_pkt_queue, skb);
#ifdef CONFIG_RPS
- *qtail = sd->input_queue_head + sd->input_pkt_queue.qlen;
+ *qtail = sd->input_queue_head +
+ skb_queue_len(&sd->input_pkt_queue);
#endif
rps_unlock(sd);
local_irq_restore(flags);
@@ -2934,13 +2935,21 @@ static void flush_backlog(void *arg)
struct sk_buff *skb, *tmp;
rps_lock(sd);
- skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp)
+ skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
if (skb->dev == dev) {
__skb_unlink(skb, &sd->input_pkt_queue);
kfree_skb(skb);
- input_queue_head_incr(sd);
+ input_queue_head_add(sd, 1);
}
+ }
rps_unlock(sd);
+
+ skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
+ if (skb->dev == dev) {
+ __skb_unlink(skb, &sd->process_queue);
+ kfree_skb(skb);
+ }
+ }
}
static int napi_gro_complete(struct sk_buff *skb)
@@ -3286,24 +3295,30 @@ static int process_backlog(struct napi_struct *napi, int quota)
}
#endif
napi->weight = weight_p;
- do {
+ local_irq_disable();
+ while (1) {
struct sk_buff *skb;
- local_irq_disable();
+ while ((skb = __skb_dequeue(&sd->process_queue))) {
+ local_irq_enable();
+ __netif_receive_skb(skb);
+ if (++work >= quota)
+ return work;
+ local_irq_disable();
+ }
+
rps_lock(sd);
- skb = __skb_dequeue(&sd->input_pkt_queue);
- if (!skb) {
+ input_queue_head_add(sd, skb_queue_len(&sd->input_pkt_queue));
+ skb_queue_splice_tail_init(&sd->input_pkt_queue,
+ &sd->process_queue);
+ if (skb_queue_empty(&sd->process_queue)) {
__napi_complete(napi);
rps_unlock(sd);
- local_irq_enable();
break;
}
- input_queue_head_incr(sd);
rps_unlock(sd);
- local_irq_enable();
-
- __netif_receive_skb(skb);
- } while (++work < quota);
+ }
+ local_irq_enable();
return work;
}
@@ -5631,8 +5646,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
/* Process offline CPU's input_pkt_queue */
while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
netif_rx(skb);
- input_queue_head_incr(oldsd);
+ input_queue_head_add(oldsd, 1);
}
+ while ((skb = __skb_dequeue(&oldsd->process_queue)))
+ netif_rx(skb);
return NOTIFY_OK;
}
@@ -5851,6 +5868,7 @@ static int __init net_dev_init(void)
struct softnet_data *sd = &per_cpu(softnet_data, i);
skb_queue_head_init(&sd->input_pkt_queue);
+ skb_queue_head_init(&sd->process_queue);
sd->completion_queue = NULL;
INIT_LIST_HEAD(&sd->poll_list);
^ permalink raw reply related
* Re: [PATCH 1/2][RESEND] ehea: error handling improvement
From: Thomas Klein @ 2010-04-23 8:22 UTC (permalink / raw)
To: David Miller; +Cc: tklein, netdev, linuxppc-dev, linux-kernel, themann
In-Reply-To: <20100421.223620.257172362.davem@davemloft.net>
On 04/22/2010 07:36 AM, David Miller wrote:
> From: Thomas Klein<tklein@de.ibm.com>
> Date: Wed, 21 Apr 2010 11:10:55 +0200
>
>> Reset a port's resources only if they're actually in an error state
>>
>> Signed-off-by: Thomas Klein<tklein@de.ibm.com>
>> ---
>>
>> Patch created against net-2.6
>
> I thought you were sorry for wasting my time and that you were going
> to follow the directions I gave you last time, and I quote:
>
> --------------------
> 3) These are not appropriate for net-2.6 as we are deep in
> the -rcX series at this point and only the most diabolical
> bug fixes are appropriate. Therefore, please generate these
> against net-next-2.6, thanks.
> --------------------
>
> And here you are generating your patches against net-2.6. Heck, you
> even feel it's worth mentioning explicitly.
Guilty! Allows no excuse. Screwed it. Deeply sorry.
>
> Lucky for you the patches happen to apply cleanly to net-next-2.6 so
> I've put them there.
Thanks!
Thomas
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: David Miller @ 2010-04-23 8:18 UTC (permalink / raw)
To: eric.dumazet; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel
In-Reply-To: <20100423.011328.107238355.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Fri, 23 Apr 2010 01:13:28 -0700 (PDT)
> I really can't see what might cause this behavior then.
This all reminds me of the namespace bug we dealt with
a month or two ago.
Jesper, you don't happen to be using network namespaces are you?
Because if so, the following might be your cure.
commit 5b3501faa8741d50617ce4191c20061c6ef36cb3
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon Feb 8 11:16:56 2010 -0800
netfilter: nf_conntrack: per netns nf_conntrack_cachep
^ permalink raw reply
* [PATCH] can: Add driver for esd CAN-USB/2 device
From: Matthias Fuchs @ 2010-04-23 8:15 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA; +Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w
This patch adds a driver for esd's USB high speed
CAN interface. The driver supports devices with
multiple CAN interfaces.
Signed-off-by: Matthias Fuchs <matthias.fuchs-iOnpLzIbIdM@public.gmane.org>
---
drivers/net/can/usb/Kconfig | 6 +
drivers/net/can/usb/Makefile | 1 +
drivers/net/can/usb/esd_usb2.c | 1107 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 1114 insertions(+), 0 deletions(-)
create mode 100644 drivers/net/can/usb/esd_usb2.c
diff --git a/drivers/net/can/usb/Kconfig b/drivers/net/can/usb/Kconfig
index 97ff6fe..0452549 100644
--- a/drivers/net/can/usb/Kconfig
+++ b/drivers/net/can/usb/Kconfig
@@ -7,4 +7,10 @@ config CAN_EMS_USB
This driver is for the one channel CPC-USB/ARM7 CAN/USB interface
from EMS Dr. Thomas Wuensche (http://www.ems-wuensche.de).
+config CAN_ESD_USB2
+ tristate "ESD USB/2 CAN/USB interface"
+ ---help---
+ This driver supports the CAN-USB/2 interface
+ from esd electronic system design gmbh (http://www.esd.eu).
+
endmenu
diff --git a/drivers/net/can/usb/Makefile b/drivers/net/can/usb/Makefile
index 0afd51d..fce3cf1 100644
--- a/drivers/net/can/usb/Makefile
+++ b/drivers/net/can/usb/Makefile
@@ -3,5 +3,6 @@
#
obj-$(CONFIG_CAN_EMS_USB) += ems_usb.o
+obj-$(CONFIG_CAN_ESD_USB2) += esd_usb2.o
ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
diff --git a/drivers/net/can/usb/esd_usb2.c b/drivers/net/can/usb/esd_usb2.c
new file mode 100644
index 0000000..c714ce9
--- /dev/null
+++ b/drivers/net/can/usb/esd_usb2.c
@@ -0,0 +1,1107 @@
+/*
+ * CAN driver for esd CAN-USB/2
+ *
+ * Copyright (C) 2010 Matthias Fuchs <matthias.fuchs-iOnpLzIbIdM@public.gmane.org>, esd gmbh
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published
+ * by the Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+#include <linux/init.h>
+#include <linux/signal.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/usb.h>
+
+#include <linux/can.h>
+#include <linux/can/dev.h>
+#include <linux/can/error.h>
+
+MODULE_AUTHOR("Matthias Fuchs <matthias.fuchs-iOnpLzIbIdM@public.gmane.org>");
+MODULE_DESCRIPTION("CAN driver for esd CAN-USB/2 interfaces");
+MODULE_LICENSE("GPL v2");
+
+/* Define these values to match your devices */
+#define USB_ESDGMBH_VENDOR_ID 0x0ab4
+#define USB_CANUSB2_PRODUCT_ID 0x0010
+
+#define ESD_USB2_CAN_CLOCK 60000000
+#define ESD_USB2_MAX_NETS 2
+
+/* USB2 commands */
+#define CMD_VERSION 1 /* also used for VERSION_REPLY */
+#define CMD_CAN_RX 2 /* device to host only */
+#define CMD_CAN_TX 3 /* also used for TX_DONE */
+#define CMD_SETBAUD 4 /* also used for SETBAUD_REPLY */
+#define CMD_TS 5 /* also used for TS_REPLY */
+#define CMD_IDADD 6 /* also used for IDADD_REPLY */
+
+/* esd CAN message flags - dlc field */
+#define ESD_RTR 0x10
+
+/* esd CAN message flags - id field */
+#define ESD_EXTID 0x20000000
+#define ESD_EVENT 0x40000000
+#define ESD_IDMASK 0x1fffffff
+
+/* esd CAN event ids used by this driver */
+#define ESD_EV_CAN_ERROR_EXT 2
+
+/* baudrate message flags */
+#define ESD_USB2_UBR 0x80000000
+#define ESD_USB2_LOM 0x40000000
+#define ESD_USB2_NO_BAUDRATE 0x7fffffff
+#define ESD_USB2_TSEG1_MIN 1
+#define ESD_USB2_TSEG1_MAX 16
+#define ESD_USB2_TSEG1_SHIFT 16
+#define ESD_USB2_TSEG2_MIN 1
+#define ESD_USB2_TSEG2_MAX 8
+#define ESD_USB2_TSEG2_SHIFT 20
+#define ESD_USB2_SJW_MAX 4
+#define ESD_USB2_SJW_SHIFT 14
+#define ESD_USB2_BRP_MIN 1
+#define ESD_USB2_BRP_MAX 1024
+#define ESD_USB2_BRP_INC 1
+#define ESD_USB2_3_SAMPLES 0x00800000
+
+/* esd IDADD message */
+#define ESD_ID_ENABLE 0x80
+#define ESD_MAX_ID_SEGMENT 64
+
+/* SJA1000 ECC register (emulated by usb2 firmware) */
+#define SJA1000_ECC_SEG 0x1F
+#define SJA1000_ECC_DIR 0x20
+#define SJA1000_ECC_ERR 0x06
+#define SJA1000_ECC_BIT 0x00
+#define SJA1000_ECC_FORM 0x40
+#define SJA1000_ECC_STUFF 0x80
+#define SJA1000_ECC_MASK 0xc0
+
+/* esd bus state event codes */
+#define ESD_BUSSTATE_MASK 0xc0
+#define ESD_BUSSTATE_WARN 0x40
+#define ESD_BUSSTATE_ERRPASSIVE 0x80
+#define ESD_BUSSTATE_BUSOFF 0xc0
+
+#define RX_BUFFER_SIZE 1024
+#define MAX_RX_URBS 4
+#define MAX_TX_URBS 16 /* must be power of 2 */
+
+struct header_msg {
+ u8 len; /* len is always the total message length in 32bit words */
+ u8 cmd;
+ u8 rsvd[2];
+};
+
+struct version_msg {
+ u8 len;
+ u8 cmd;
+ u8 rsvd;
+ u8 flags;
+ __le32 drv_version;
+};
+
+struct version_reply_msg {
+ u8 len;
+ u8 cmd;
+ u8 nets;
+ u8 features;
+ __le32 version;
+ u8 name[16];
+ __le32 rsvd;
+ __le32 ts;
+};
+
+struct rx_msg {
+ u8 len;
+ u8 cmd;
+ u8 net;
+ u8 dlc;
+ __le32 ts;
+ __le32 id; /* upper 3 bits contain flags */
+ u8 data[8];
+};
+
+struct tx_msg {
+ u8 len;
+ u8 cmd;
+ u8 net;
+ u8 dlc;
+ __le32 hnd;
+ __le32 id; /* upper 3 bits contain flags */
+ u8 data[8];
+};
+
+struct tx_done_msg {
+ u8 len;
+ u8 cmd;
+ u8 net;
+ u8 status;
+ __le32 hnd;
+ __le32 ts;
+};
+
+struct id_filter_msg {
+ u8 len;
+ u8 cmd;
+ u8 net;
+ u8 option;
+ __le32 mask[65];
+};
+
+struct set_baudrate_msg {
+ u8 len;
+ u8 cmd;
+ u8 net;
+ u8 rsvd;
+ __le32 baud;
+};
+
+/* Main message type used between library and application */
+struct __attribute__ ((packed)) esd_usb2_msg {
+ union {
+ struct header_msg hdr;
+ struct version_msg version;
+ struct version_reply_msg version_reply;
+ struct rx_msg rx;
+ struct tx_msg tx;
+ struct tx_done_msg txdone;
+ struct set_baudrate_msg setbaud;
+ struct id_filter_msg filter;
+ } msg;
+};
+
+static struct usb_device_id esd_usb2_table[] = {
+ {USB_DEVICE(USB_ESDGMBH_VENDOR_ID, USB_CANUSB2_PRODUCT_ID)},
+ {}
+};
+MODULE_DEVICE_TABLE(usb, esd_usb2_table);
+
+struct esd_usb2_net_priv;
+
+struct esd_tx_urb_context {
+ struct esd_usb2_net_priv *priv;
+ u32 echo_index;
+ int dlc;
+};
+
+struct esd_usb2 {
+ struct usb_device *udev;
+ struct esd_usb2_net_priv *nets[ESD_USB2_MAX_NETS];
+
+ struct usb_anchor rx_submitted;
+
+ int net_count;
+ u32 version;
+ int rxinitdone;
+};
+
+struct esd_usb2_net_priv {
+ struct can_priv can; /* must be the first member */
+
+ atomic_t active_tx_jobs;
+ struct usb_anchor tx_submitted;
+ struct esd_tx_urb_context tx_contexts[MAX_TX_URBS];
+
+ int open_time;
+ struct esd_usb2 *usb2;
+ struct net_device *netdev;
+ int index;
+ u8 old_state;
+};
+
+static void esd_usb2_rx_event(struct esd_usb2_net_priv *priv,
+ struct esd_usb2_msg *msg)
+{
+ struct net_device_stats *stats = &priv->netdev->stats;
+ struct can_frame *cf;
+ struct sk_buff *skb;
+ u32 id = le32_to_cpu(msg->msg.rx.id) & ESD_IDMASK;
+
+ if (id == ESD_EV_CAN_ERROR_EXT) {
+ u8 state = msg->msg.rx.data[0];
+ u8 ecc = msg->msg.rx.data[1];
+ u8 txerr = msg->msg.rx.data[2];
+ u8 rxerr = msg->msg.rx.data[3];
+
+ skb = alloc_can_err_skb(priv->netdev, &cf);
+ if (skb == NULL) {
+ stats->rx_dropped++;
+ return;
+ }
+
+ if (state != priv->old_state) {
+ priv->old_state = state;
+
+ switch (state & ESD_BUSSTATE_MASK) {
+ case ESD_BUSSTATE_BUSOFF:
+ priv->can.state = CAN_STATE_BUS_OFF;
+ cf->can_id |= CAN_ERR_BUSOFF;
+ can_bus_off(priv->netdev);
+ break;
+ case ESD_BUSSTATE_WARN:
+ priv->can.state = CAN_STATE_ERROR_WARNING;
+ priv->can.can_stats.error_warning++;
+ break;
+ case ESD_BUSSTATE_ERRPASSIVE:
+ priv->can.state = CAN_STATE_ERROR_PASSIVE;
+ priv->can.can_stats.error_passive++;
+ break;
+ default:
+ priv->can.state = CAN_STATE_ERROR_ACTIVE;
+ break;
+ }
+ } else {
+ priv->can.can_stats.bus_error++;
+ stats->rx_errors++;
+
+ cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
+
+ switch (ecc & SJA1000_ECC_MASK) {
+ case SJA1000_ECC_BIT:
+ cf->data[2] |= CAN_ERR_PROT_BIT;
+ break;
+ case SJA1000_ECC_FORM:
+ cf->data[2] |= CAN_ERR_PROT_FORM;
+ break;
+ case SJA1000_ECC_STUFF:
+ cf->data[2] |= CAN_ERR_PROT_STUFF;
+ break;
+ default:
+ cf->data[2] |= CAN_ERR_PROT_UNSPEC;
+ cf->data[3] = ecc & SJA1000_ECC_SEG;
+ break;
+ }
+
+ /* Error occured during transmission? */
+ if (!(ecc & SJA1000_ECC_DIR))
+ cf->data[2] |= CAN_ERR_PROT_TX;
+
+ if (priv->can.state == CAN_STATE_ERROR_WARNING ||
+ priv->can.state == CAN_STATE_ERROR_PASSIVE) {
+ cf->data[1] = (txerr > rxerr) ?
+ CAN_ERR_CRTL_TX_PASSIVE :
+ CAN_ERR_CRTL_RX_PASSIVE;
+ }
+ }
+
+ netif_rx(skb);
+
+ stats->rx_packets++;
+ stats->rx_bytes += cf->can_dlc;
+ }
+}
+
+static void esd_usb2_rx_can_msg(struct esd_usb2_net_priv *priv,
+ struct esd_usb2_msg *msg)
+{
+ struct net_device_stats *stats = &priv->netdev->stats;
+ struct can_frame *cf;
+ struct sk_buff *skb;
+ int i;
+ u32 id;
+
+ if (!netif_device_present(priv->netdev))
+ return;
+
+ id = le32_to_cpu(msg->msg.rx.id);
+
+ if (id & ESD_EVENT) {
+ esd_usb2_rx_event(priv, msg);
+ } else {
+ skb = alloc_can_skb(priv->netdev, &cf);
+ if (skb == NULL) {
+ stats->rx_dropped++;
+ return;
+ }
+
+ cf->can_id = id & ESD_IDMASK;
+ cf->can_dlc = get_can_dlc(msg->msg.rx.dlc);
+
+ if (id & ESD_EXTID)
+ cf->can_id |= CAN_EFF_FLAG;
+
+ if (msg->msg.rx.dlc & ESD_RTR) {
+ cf->can_id |= CAN_RTR_FLAG;
+ } else {
+ for (i = 0; i < cf->can_dlc; i++)
+ cf->data[i] = msg->msg.rx.data[i];
+ }
+
+ netif_rx(skb);
+
+ stats->rx_packets++;
+ stats->rx_bytes += cf->can_dlc;
+ }
+
+ return;
+}
+
+static void esd_usb2_tx_done_msg(struct esd_usb2_net_priv *priv,
+ struct esd_usb2_msg *msg)
+{
+ struct net_device_stats *stats = &priv->netdev->stats;
+ struct net_device *netdev = priv->netdev;
+ struct esd_tx_urb_context *context;
+
+ if (!netif_device_present(netdev))
+ return;
+
+ context = &priv->tx_contexts[msg->msg.txdone.hnd & (MAX_TX_URBS - 1)];
+
+ if (!msg->msg.txdone.status) {
+ stats->tx_packets++;
+ stats->tx_bytes += context->dlc;
+ can_get_echo_skb(netdev, context->echo_index);
+ } else {
+ stats->tx_errors++;
+ can_free_echo_skb(netdev, context->echo_index);
+ }
+
+ /* Release context */
+ context->echo_index = MAX_TX_URBS;
+ atomic_dec(&priv->active_tx_jobs);
+
+ netif_wake_queue(netdev);
+}
+
+static void esd_usb2_read_bulk_callback(struct urb *urb)
+{
+ struct esd_usb2 *dev = urb->context;
+ int retval;
+ int pos = 0;
+ int i;
+
+ switch (urb->status) {
+ case 0: /* success */
+ break;
+
+ case -ENOENT:
+ case -ESHUTDOWN:
+ return;
+
+ default:
+ dev_info(dev->udev->dev.parent,
+ "Rx URB aborted (%d)\n", urb->status);
+ goto resubmit_urb;
+ }
+
+ while (pos < urb->actual_length) {
+ struct esd_usb2_msg *msg;
+
+ msg = (struct esd_usb2_msg *)(urb->transfer_buffer + pos);
+
+ switch (msg->msg.hdr.cmd) {
+ case CMD_CAN_RX:
+ esd_usb2_rx_can_msg(dev->nets[msg->msg.rx.net], msg);
+ break;
+
+ case CMD_CAN_TX:
+ esd_usb2_tx_done_msg(dev->nets[msg->msg.txdone.net],
+ msg);
+ break;
+ }
+
+ pos += msg->msg.hdr.len << 2;
+
+ if (pos > urb->actual_length) {
+ dev_err(dev->udev->dev.parent, "format error\n");
+ break;
+ }
+ }
+
+resubmit_urb:
+ usb_fill_bulk_urb(urb, dev->udev, usb_rcvbulkpipe(dev->udev, 1),
+ urb->transfer_buffer, RX_BUFFER_SIZE,
+ esd_usb2_read_bulk_callback, dev);
+
+ retval = usb_submit_urb(urb, GFP_ATOMIC);
+ if (retval == -ENODEV) {
+ for (i = 0; i < dev->net_count; i++) {
+ if (dev->nets[i])
+ netif_device_detach(dev->nets[i]->netdev);
+ }
+ } else if (retval) {
+ dev_err(dev->udev->dev.parent,
+ "failed resubmitting read bulk urb: %d\n", retval);
+ }
+
+ return;
+}
+
+/*
+ * callback for bulk IN urb
+ */
+static void esd_usb2_write_bulk_callback(struct urb *urb)
+{
+ struct esd_tx_urb_context *context = urb->context;
+ struct esd_usb2_net_priv *priv;
+ struct esd_usb2 *dev;
+ struct net_device *netdev;
+ size_t size = sizeof(struct esd_usb2_msg);
+
+ BUG_ON(!context);
+
+ priv = context->priv;
+ netdev = priv->netdev;
+ dev = priv->usb2;
+
+ /* free up our allocated buffer */
+ usb_buffer_free(urb->dev, size,
+ urb->transfer_buffer, urb->transfer_dma);
+
+ if (!netif_device_present(netdev))
+ return;
+
+ if (urb->status)
+ dev_info(netdev->dev.parent, "Tx URB aborted (%d)\n",
+ urb->status);
+
+ netdev->trans_start = jiffies;
+}
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_firmware(struct device *d,
+ struct device_attribute *attr, char *buf)
+{
+ struct usb_interface *intf = to_usb_interface(d);
+ struct esd_usb2 *dev = usb_get_intfdata(intf);
+
+ return sprintf(buf, "%d.%d.%d\n",
+ (dev->version >> 12) & 0xf,
+ (dev->version >> 8) & 0xf,
+ dev->version & 0xff);
+}
+static DEVICE_ATTR(firmware, S_IRUGO, show_firmware, NULL);
+
+static ssize_t show_hardware(struct device *d,
+ struct device_attribute *attr, char *buf)
+{
+ struct usb_interface *intf = to_usb_interface(d);
+ struct esd_usb2 *dev = usb_get_intfdata(intf);
+
+ return sprintf(buf, "%d.%d.%d\n",
+ (dev->version >> 28) & 0xf,
+ (dev->version >> 24) & 0xf,
+ (dev->version >> 16) & 0xff);
+}
+static DEVICE_ATTR(hardware, S_IRUGO, show_hardware, NULL);
+
+static ssize_t show_nets(struct device *d,
+ struct device_attribute *attr, char *buf)
+{
+ struct usb_interface *intf = to_usb_interface(d);
+ struct esd_usb2 *dev = usb_get_intfdata(intf);
+
+ return sprintf(buf, "%d", dev->net_count);
+}
+static DEVICE_ATTR(nets, S_IRUGO, show_nets, NULL);
+#endif
+
+static int esd_usb2_send_msg(struct esd_usb2 *dev, struct esd_usb2_msg *msg)
+{
+ int actual_length;
+
+ return usb_bulk_msg(dev->udev,
+ usb_sndbulkpipe(dev->udev, 2),
+ msg,
+ msg->msg.hdr.len << 2,
+ &actual_length,
+ 1000);
+}
+
+static int esd_usb2_wait_msg(struct esd_usb2 *dev,
+ struct esd_usb2_msg *msg)
+{
+ int actual_length;
+
+ return usb_bulk_msg(dev->udev,
+ usb_rcvbulkpipe(dev->udev, 1),
+ msg,
+ sizeof(*msg),
+ &actual_length,
+ 1000);
+}
+
+static int esd_usb2_setup_rx_urbs(struct esd_usb2 *dev)
+{
+ int i, err = 0;
+
+ if (dev->rxinitdone)
+ return 0;
+
+ for (i = 0; i < MAX_RX_URBS; i++) {
+ struct urb *urb = NULL;
+ u8 *buf = NULL;
+
+ /* create a URB, and a buffer for it */
+ urb = usb_alloc_urb(0, GFP_KERNEL);
+ if (!urb) {
+ dev_warn(dev->udev->dev.parent,
+ "No memory left for URBs\n");
+ err = -ENOMEM;
+ break;
+ }
+
+ buf = usb_buffer_alloc(dev->udev, RX_BUFFER_SIZE, GFP_KERNEL,
+ &urb->transfer_dma);
+ if (!buf) {
+ dev_warn(dev->udev->dev.parent,
+ "No memory left for USB buffer\n");
+ err = -ENOMEM;
+ goto freeurb;
+ }
+
+ usb_fill_bulk_urb(urb, dev->udev,
+ usb_rcvbulkpipe(dev->udev, 1),
+ buf, RX_BUFFER_SIZE,
+ esd_usb2_read_bulk_callback, dev);
+ urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
+ usb_anchor_urb(urb, &dev->rx_submitted);
+
+ err = usb_submit_urb(urb, GFP_KERNEL);
+ if (err) {
+ usb_unanchor_urb(urb);
+ usb_buffer_free(dev->udev, RX_BUFFER_SIZE, buf,
+ urb->transfer_dma);
+ }
+
+freeurb:
+ /* Drop reference, USB core will take care of freeing it */
+ usb_free_urb(urb);
+ if (err)
+ break;
+ }
+
+ /* Did we submit any URBs */
+ if (i == 0) {
+ dev_err(dev->udev->dev.parent, "couldn't setup read URBs\n");
+ return err;
+ }
+
+ /* Warn if we've couldn't transmit all the URBs */
+ if (i < MAX_RX_URBS) {
+ dev_warn(dev->udev->dev.parent,
+ "rx performance may be slow\n");
+ }
+
+ dev->rxinitdone = 1;
+ return 0;
+}
+
+/*
+ * Start interface
+ */
+static int esd_usb2_start(struct esd_usb2_net_priv *priv)
+{
+ struct esd_usb2 *dev = priv->usb2;
+ struct net_device *netdev = priv->netdev;
+ struct esd_usb2_msg msg;
+ int err, i;
+
+ /*
+ * Enable all IDs
+ * The IDADD message takes up to 64 32 bit bitmasks (2048 bits).
+ * Each bit represents one 11 bit CAN identifier. A set bit
+ * enables reception of the corresponding CAN identifier. A cleared
+ * bit disabled this identifier. An additional bitmask value
+ * following the CAN 2.0A bits is used to enable reception of
+ * extended CAN frames. Only the LSB of this final mask is checked
+ * for the complete 29 bit ID range. The IDADD message also allows
+ * filter configuration for an ID subset. In this case you can add
+ * the number of the starting bitmask (0..64) to the filter.option
+ * field followed by only some bitmasks.
+ */
+ msg.msg.hdr.cmd = CMD_IDADD;
+ msg.msg.hdr.len = 2 + ESD_MAX_ID_SEGMENT;
+ msg.msg.filter.net = priv->index;
+ msg.msg.filter.option = ESD_ID_ENABLE; /* start with segment 0 */
+ for (i = 0; i < ESD_MAX_ID_SEGMENT; i++)
+ msg.msg.filter.mask[i] = cpu_to_le32(0xffffffff);
+ /* enable 29bit extended IDs */
+ msg.msg.filter.mask[ESD_MAX_ID_SEGMENT] = cpu_to_le32(0x00000001);
+
+ err = esd_usb2_send_msg(dev, &msg);
+ if (err)
+ goto failed;
+
+ err = esd_usb2_setup_rx_urbs(dev);
+ if (err)
+ goto failed;
+
+ priv->can.state = CAN_STATE_ERROR_ACTIVE;
+
+ return 0;
+
+failed:
+ if (err == -ENODEV)
+ netif_device_detach(netdev);
+
+ dev_err(netdev->dev.parent, "couldn't start device: %d\n", err);
+
+ return err;
+}
+
+static void unlink_all_urbs(struct esd_usb2 *dev)
+{
+ struct esd_usb2_net_priv *priv;
+ int i;
+
+ usb_kill_anchored_urbs(&dev->rx_submitted);
+ for (i = 0; i < dev->net_count; i++) {
+ priv = dev->nets[i];
+ if (priv) {
+ usb_kill_anchored_urbs(&priv->tx_submitted);
+ atomic_set(&priv->active_tx_jobs, 0);
+
+ for (i = 0; i < MAX_TX_URBS; i++)
+ priv->tx_contexts[i].echo_index = MAX_TX_URBS;
+ }
+ }
+}
+
+static int esd_usb2_open(struct net_device *netdev)
+{
+ struct esd_usb2_net_priv *priv = netdev_priv(netdev);
+ int err;
+
+ /* common open */
+ err = open_candev(netdev);
+ if (err)
+ return err;
+
+ /* finally start device */
+ err = esd_usb2_start(priv);
+ if (err) {
+ dev_warn(netdev->dev.parent,
+ "couldn't start device: %d\n", err);
+ close_candev(netdev);
+ return err;
+ }
+
+ priv->open_time = jiffies;
+
+ netif_start_queue(netdev);
+
+ return 0;
+}
+
+static netdev_tx_t esd_usb2_start_xmit(struct sk_buff *skb,
+ struct net_device *netdev)
+{
+ struct esd_usb2_net_priv *priv = netdev_priv(netdev);
+ struct esd_usb2 *dev = priv->usb2;
+ struct esd_tx_urb_context *context = NULL;
+ struct net_device_stats *stats = &netdev->stats;
+ struct can_frame *cf = (struct can_frame *)skb->data;
+ struct esd_usb2_msg *msg;
+ struct urb *urb;
+ u8 *buf;
+ int i, err;
+ int ret = NETDEV_TX_OK;
+ size_t size = sizeof(struct esd_usb2_msg);
+
+ if (can_dropped_invalid_skb(netdev, skb))
+ return NETDEV_TX_OK;
+
+ /* create a URB, and a buffer for it, and copy the data to the URB */
+ urb = usb_alloc_urb(0, GFP_ATOMIC);
+ if (!urb) {
+ dev_err(netdev->dev.parent, "No memory left for URBs\n");
+ stats->tx_dropped++;
+ dev_kfree_skb(skb);
+ goto nourbmem;
+ }
+
+ buf = usb_buffer_alloc(dev->udev, size, GFP_ATOMIC, &urb->transfer_dma);
+ if (!buf) {
+ dev_err(netdev->dev.parent, "No memory left for USB buffer\n");
+ stats->tx_dropped++;
+ dev_kfree_skb(skb);
+ goto nobufmem;
+ }
+
+ msg = (struct esd_usb2_msg *)buf;
+
+ msg->msg.hdr.len = 3; /* minimal length */
+ msg->msg.hdr.cmd = CMD_CAN_TX;
+ msg->msg.tx.net = priv->index;
+ msg->msg.tx.dlc = cf->can_dlc;
+ msg->msg.tx.id = cpu_to_le32(cf->can_id & CAN_ERR_MASK);
+
+ if (cf->can_id & CAN_RTR_FLAG)
+ msg->msg.tx.dlc |= ESD_RTR;
+
+ if (cf->can_id & CAN_EFF_FLAG)
+ msg->msg.tx.id |= cpu_to_le32(ESD_EXTID);
+
+ for (i = 0; i < cf->can_dlc; i++)
+ msg->msg.tx.data[i] = cf->data[i];
+
+ msg->msg.hdr.len += (cf->can_dlc + 3) >> 2;
+
+ for (i = 0; i < MAX_TX_URBS; i++) {
+ if (priv->tx_contexts[i].echo_index == MAX_TX_URBS) {
+ context = &priv->tx_contexts[i];
+ break;
+ }
+ }
+
+ /*
+ * This may never happen.
+ */
+ if (!context) {
+ dev_warn(netdev->dev.parent, "couldn't find free context\n");
+ ret = NETDEV_TX_BUSY;
+ goto releasebuf;
+ }
+
+ context->priv = priv;
+ context->echo_index = i;
+ context->dlc = cf->can_dlc;
+
+ /* hnd must not be 0 */
+ msg->msg.tx.hnd = 0x80000000 | i; /* returned in TX done message */
+
+ usb_fill_bulk_urb(urb, dev->udev, usb_sndbulkpipe(dev->udev, 2), buf,
+ msg->msg.hdr.len << 2,
+ esd_usb2_write_bulk_callback, context);
+
+ urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
+
+ usb_anchor_urb(urb, &priv->tx_submitted);
+
+ can_put_echo_skb(skb, netdev, context->echo_index);
+
+ atomic_inc(&priv->active_tx_jobs);
+
+ err = usb_submit_urb(urb, GFP_ATOMIC);
+ if (err) {
+ can_free_echo_skb(netdev, context->echo_index);
+
+ atomic_dec(&priv->active_tx_jobs);
+ usb_unanchor_urb(urb);
+
+ stats->tx_dropped++;
+
+ if (err == -ENODEV)
+ netif_device_detach(netdev);
+ else
+ dev_warn(netdev->dev.parent, "failed tx_urb %d\n", err);
+
+ goto releasebuf;
+ }
+
+ netdev->trans_start = jiffies;
+
+ /* Slow down tx path */
+ if (atomic_read(&priv->active_tx_jobs) >= MAX_TX_URBS)
+ netif_stop_queue(netdev);
+
+ /*
+ * Release our reference to this URB, the USB core will eventually free
+ * it entirely.
+ */
+ usb_free_urb(urb);
+
+ return NETDEV_TX_OK;
+
+releasebuf:
+ usb_buffer_free(dev->udev, size, buf, urb->transfer_dma);
+
+nobufmem:
+ usb_free_urb(urb);
+
+nourbmem:
+ return ret;
+}
+
+static int esd_usb2_close(struct net_device *netdev)
+{
+ struct esd_usb2_net_priv *priv = netdev_priv(netdev);
+ struct esd_usb2_msg msg;
+ int i;
+
+ /* Disable all IDs (see esd_usb2_start()) */
+ msg.msg.hdr.cmd = CMD_IDADD;
+ msg.msg.hdr.len = 2 + ESD_MAX_ID_SEGMENT;
+ msg.msg.filter.net = priv->index;
+ msg.msg.filter.option = ESD_ID_ENABLE; /* start with segment 0 */
+ for (i = 0; i <= ESD_MAX_ID_SEGMENT; i++)
+ msg.msg.filter.mask[i] = 0;
+ esd_usb2_send_msg(priv->usb2, &msg);
+
+ /* set CAN controller to reset mode */
+ msg.msg.hdr.len = 2;
+ msg.msg.hdr.cmd = CMD_SETBAUD;
+ msg.msg.setbaud.net = priv->index;
+ msg.msg.setbaud.rsvd = 0;
+ msg.msg.setbaud.baud = cpu_to_le32(ESD_USB2_NO_BAUDRATE);
+ esd_usb2_send_msg(priv->usb2, &msg);
+
+ priv->can.state = CAN_STATE_STOPPED;
+
+ netif_stop_queue(netdev);
+
+ close_candev(netdev);
+
+ priv->open_time = 0;
+
+ return 0;
+}
+
+static const struct net_device_ops esd_usb2_netdev_ops = {
+ .ndo_open = esd_usb2_open,
+ .ndo_stop = esd_usb2_close,
+ .ndo_start_xmit = esd_usb2_start_xmit,
+};
+
+static struct can_bittiming_const esd_usb2_bittiming_const = {
+ .name = "esd_usb2",
+ .tseg1_min = ESD_USB2_TSEG1_MIN,
+ .tseg1_max = ESD_USB2_TSEG1_MAX,
+ .tseg2_min = ESD_USB2_TSEG2_MIN,
+ .tseg2_max = ESD_USB2_TSEG2_MAX,
+ .sjw_max = ESD_USB2_SJW_MAX,
+ .brp_min = ESD_USB2_BRP_MIN,
+ .brp_max = ESD_USB2_BRP_MAX,
+ .brp_inc = ESD_USB2_BRP_INC,
+};
+
+static int esd_usb2_set_bittiming(struct net_device *netdev)
+{
+ struct esd_usb2_net_priv *priv = netdev_priv(netdev);
+ struct can_bittiming *bt = &priv->can.bittiming;
+ struct esd_usb2_msg msg;
+ u32 canbtr;
+
+ canbtr = ESD_USB2_UBR;
+ canbtr |= (bt->brp - 1) & (ESD_USB2_BRP_MAX - 1);
+ canbtr |= ((bt->sjw - 1) & (ESD_USB2_SJW_MAX - 1))
+ << ESD_USB2_SJW_SHIFT;
+ canbtr |= ((bt->prop_seg + bt->phase_seg1 - 1)
+ & (ESD_USB2_TSEG1_MAX - 1))
+ << ESD_USB2_TSEG1_SHIFT;
+ canbtr |= ((bt->phase_seg2 - 1) & (ESD_USB2_TSEG2_MAX - 1))
+ << ESD_USB2_TSEG2_SHIFT;
+ if (priv->can.ctrlmode & CAN_CTRLMODE_3_SAMPLES)
+ canbtr |= ESD_USB2_3_SAMPLES;
+
+ msg.msg.hdr.len = 2;
+ msg.msg.hdr.cmd = CMD_SETBAUD;
+ msg.msg.setbaud.net = priv->index;
+ msg.msg.setbaud.rsvd = 0;
+ msg.msg.setbaud.baud = cpu_to_le32(canbtr);
+
+ dev_info(netdev->dev.parent, "setting BTR=%#x\n", canbtr);
+
+ return esd_usb2_send_msg(priv->usb2, &msg);
+}
+
+static int esd_usb2_set_mode(struct net_device *netdev, enum can_mode mode)
+{
+ struct esd_usb2_net_priv *priv = netdev_priv(netdev);
+
+ if (!priv->open_time)
+ return -EINVAL;
+
+ switch (mode) {
+ case CAN_MODE_START:
+ netif_wake_queue(netdev);
+ break;
+
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static int esd_usb2_probe_one_net(struct usb_interface *intf, int index)
+{
+ struct esd_usb2 *dev = usb_get_intfdata(intf);
+ struct net_device *netdev;
+ struct esd_usb2_net_priv *priv;
+ int err;
+ int i;
+
+ netdev = alloc_candev(sizeof(*priv), MAX_TX_URBS);
+ if (!netdev) {
+ dev_err(&intf->dev, "couldn't alloc candev\n");
+ return -ENOMEM;
+ }
+
+ priv = netdev_priv(netdev);
+
+ init_usb_anchor(&priv->tx_submitted);
+ atomic_set(&priv->active_tx_jobs, 0);
+
+ for (i = 0; i < MAX_TX_URBS; i++)
+ priv->tx_contexts[i].echo_index = MAX_TX_URBS;
+
+ priv->usb2 = dev;
+ priv->netdev = netdev;
+ priv->index = index;
+
+ priv->can.state = CAN_STATE_STOPPED;
+ priv->can.clock.freq = ESD_USB2_CAN_CLOCK;
+ priv->can.bittiming_const = &esd_usb2_bittiming_const;
+ priv->can.do_set_bittiming = esd_usb2_set_bittiming;
+ priv->can.do_set_mode = esd_usb2_set_mode;
+ priv->can.ctrlmode_supported = CAN_CTRLMODE_3_SAMPLES;
+
+ netdev->flags |= IFF_ECHO; /* we support local echo */
+
+ netdev->netdev_ops = &esd_usb2_netdev_ops;
+
+ SET_NETDEV_DEV(netdev, &intf->dev);
+
+ err = register_candev(netdev);
+ if (err) {
+ dev_err(&intf->dev,
+ "couldn't register CAN device: %d\n", err);
+ free_candev(netdev);
+ return -ENOMEM;
+ }
+
+ dev->nets[index] = priv;
+ dev_info(netdev->dev.parent, "device %s registered\n", netdev->name);
+ return 0;
+}
+
+/*
+ * probe function for new USB2 devices
+ *
+ * check version information and number of available
+ * CAN interfaces
+ */
+static int esd_usb2_probe(struct usb_interface *intf,
+ const struct usb_device_id *id)
+{
+ struct esd_usb2 *dev;
+ struct esd_usb2_msg msg;
+ int i, err = -ENOMEM;
+
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev)
+ return -ENOMEM;
+
+ dev->udev = interface_to_usbdev(intf);
+
+ init_usb_anchor(&dev->rx_submitted);
+
+ usb_set_intfdata(intf, dev);
+
+ /* query number of CAN interfaces (nets) */
+ msg.msg.hdr.cmd = CMD_VERSION;
+ msg.msg.hdr.len = 2;
+ msg.msg.version.rsvd = 0;
+ msg.msg.version.flags = 0;
+ msg.msg.version.drv_version = 0;
+
+ if (esd_usb2_send_msg(dev, &msg) < 0) {
+ dev_err(&intf->dev, "sending version message failed\n");
+ goto free_dev;
+ }
+
+ if (esd_usb2_wait_msg(dev, &msg) < 0) {
+ dev_err(&intf->dev, "no version message answer\n");
+ goto free_dev;
+ }
+
+ dev->net_count = (int)msg.msg.version_reply.nets;
+ dev->version = le32_to_cpu(msg.msg.version_reply.version);
+
+#ifdef CONFIG_SYSFS
+ if (device_create_file(&intf->dev, &dev_attr_firmware))
+ dev_err(&intf->dev,
+ "Couldn't create device file for firmware\n");
+
+ if (device_create_file(&intf->dev, &dev_attr_hardware))
+ dev_err(&intf->dev,
+ "Couldn't create device file for hardware\n");
+
+ if (device_create_file(&intf->dev, &dev_attr_nets))
+ dev_err(&intf->dev,
+ "Couldn't create device file for nets\n");
+#endif
+
+ /* do per device probing */
+ for (i = 0; i < dev->net_count; i++)
+ esd_usb2_probe_one_net(intf, i);
+
+ return 0;
+
+free_dev:
+ kfree(dev);
+ return err;
+}
+
+/*
+ * called by the usb core when the device is removed from the system
+ */
+static void esd_usb2_disconnect(struct usb_interface *intf)
+{
+ struct esd_usb2 *dev = usb_get_intfdata(intf);
+ struct net_device *netdev;
+ int i;
+
+#ifdef CONFIG_SYSFS
+ device_remove_file(&intf->dev, &dev_attr_firmware);
+ device_remove_file(&intf->dev, &dev_attr_hardware);
+ device_remove_file(&intf->dev, &dev_attr_nets);
+#endif
+ usb_set_intfdata(intf, NULL);
+
+ if (dev) {
+ for (i = 0; i < dev->net_count; i++) {
+ if (dev->nets[i]) {
+ netdev = dev->nets[i]->netdev;
+ unregister_netdev(netdev);
+ free_candev(netdev);
+ }
+ }
+ unlink_all_urbs(dev);
+ }
+}
+
+/* usb specific object needed to register this driver with the usb subsystem */
+static struct usb_driver esd_usb2_driver = {
+ .name = "esd_usb2",
+ .probe = esd_usb2_probe,
+ .disconnect = esd_usb2_disconnect,
+ .id_table = esd_usb2_table,
+};
+
+static int __init esd_usb2_init(void)
+{
+ int err;
+
+ /* register this driver with the USB subsystem */
+ err = usb_register(&esd_usb2_driver);
+
+ if (err) {
+ err("usb_register failed. Error number %d\n", err);
+ return err;
+ }
+
+ return 0;
+}
+module_init(esd_usb2_init);
+
+static void __exit esd_usb2_exit(void)
+{
+ /* deregister this driver with the USB subsystem */
+ usb_deregister(&esd_usb2_driver);
+}
+module_exit(esd_usb2_exit);
--
1.5.6.3
^ permalink raw reply related
* Re: [PATCH] NIU support for skb->rxhash
From: David Miller @ 2010-04-23 8:14 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <20100422.141922.39169749.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Thu, 22 Apr 2010 14:19:22 -0700 (PDT)
> Also I have some ideas about what we can do if we have
> just the rxhash. It seems we can avoid the type_trans
> overhead on the interrupting cpu.
>
> Things like eth_type_trans() become a netdev operation rather than
> something drivers statically call by hand. ->ndo_type_trans or
> similar.
>
> SKB has a state bit saying whether ->ndo_type_trans has been invoked
> yet on RX.
>
> Drivers pass raw SKBs up into the stack.
>
> We defer the ->ndo_type_trans as far as possible, for RPS when we have
> ->rxhash we can defer this all the way to the destination RPS cpu.
>
> If we lack ->rxhash, the source cpu will need to invoke
> ->ndo_type_trans before it can begin parsing the packet.
I looked into implementing this and it doesn't work. The
problem is GRO want's to look into the packet very early
and we want to batch GRO a set of packets into a big packet
before shooting them over to a remote cpu.
This reminds me that we can start using ->rxhash as a quick
mismatch check in the GRO flow matcher.
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: David Miller @ 2010-04-23 8:13 UTC (permalink / raw)
To: eric.dumazet; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel
In-Reply-To: <1272001478.7895.7545.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 23 Apr 2010 07:44:38 +0200
> Le jeudi 22 avril 2010 à 16:44 -0700, David Miller a écrit :
>> Eric, I wonder if we run into some kind of issue on 32-bit systems
>> because we always lose a bit of the conntrack hash value when we store
>> it into the 'nulls' area?
>>
>> Wouldn't that make the "get_nulls_value(n) != hash" fail?
>> --
>
>
> Well, 'hash' at this time is not the result of the jhash() transform [0
> - 0xFFFFFFFF], but a slot number in htable [0 - (300032-1)].
Aha, I see.
I really can't see what might cause this behavior then.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Jan Engelhardt @ 2010-04-23 7:55 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Patrick McHardy, hawk,
Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1272008780.7895.7746.camel@edumazet-laptop>
On Friday 2010-04-23 09:46, Eric Dumazet wrote:
>Le vendredi 23 avril 2010 à 09:23 +0200, Jan Engelhardt a écrit :
>> On Thursday 2010-04-22 23:28, Jesper Dangaard Brouer wrote:
>>
>> > On Thu, 22 Apr 2010, Eric Dumazet wrote:
>> >
>> >> What exact version of kernel are you running ?
>> >
>> > 2.6.31.7-pvlan2G #3 SMP PREEMPT
>> > 32-bit kernel with 2G kernel mem (you showed me that trick).
>>
>> Since when is enabling 2G a trick? :)
>> There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.
>>
>
>Yes, when you know it, its not a trick anymore :)
>
>Years ago, we had to manually change PAGE_OFFSET, and I remember some
>machines with PAGE_OFFSET 0xA0000000 (1.5 GB LOWMEM),
>or 0xB0000000 (1.25 GB), (PAE off)
I notice that 0xB0000000, which is now known as LOWMEM_3G_OPT,
is only available when PAE is off. Would you know the reason for
that decision? Are some values unsuitable for PAE?
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-23 7:46 UTC (permalink / raw)
To: Jan Engelhardt
Cc: Jesper Dangaard Brouer, Patrick McHardy, hawk,
Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <alpine.LSU.2.01.1004230922120.24961@obet.zrqbmnf.qr>
Le vendredi 23 avril 2010 à 09:23 +0200, Jan Engelhardt a écrit :
> On Thursday 2010-04-22 23:28, Jesper Dangaard Brouer wrote:
>
> > On Thu, 22 Apr 2010, Eric Dumazet wrote:
> >
> >> What exact version of kernel are you running ?
> >
> > 2.6.31.7-pvlan2G #3 SMP PREEMPT
> > 32-bit kernel with 2G kernel mem (you showed me that trick).
>
> Since when is enabling 2G a trick? :)
> There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.
>
Yes, when you know it, its not a trick anymore :)
Years ago, we had to manually change PAGE_OFFSET, and I remember some
machines with PAGE_OFFSET 0xA0000000 (1.5 GB LOWMEM),
or 0xB0000000 (1.25 GB), (PAE off)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: DDoS attack causing bad effect on conntrack searches
From: Jan Engelhardt @ 2010-04-23 7:23 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, Patrick McHardy, hawk, Linux Kernel Network Hackers,
Netfilter Developers
In-Reply-To: <Pine.LNX.4.64.1004222323020.11449@ask.diku.dk>
On Thursday 2010-04-22 23:28, Jesper Dangaard Brouer wrote:
> On Thu, 22 Apr 2010, Eric Dumazet wrote:
>
>> What exact version of kernel are you running ?
>
> 2.6.31.7-pvlan2G #3 SMP PREEMPT
> 32-bit kernel with 2G kernel mem (you showed me that trick).
Since when is enabling 2G a trick? :)
There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-23 7:11 UTC (permalink / raw)
To: therbert; +Cc: netdev
In-Reply-To: <alpine.DEB.1.00.1004222249400.27016@pokey.mtv.corp.google.com>
From: Tom Herbert <therbert@google.com>
Date: Thu, 22 Apr 2010 22:54:16 -0700 (PDT)
> Add support to bnx2x to extract Toeplitz hash out of the receive descriptor
> for use in skb->rxhash.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
Sweeeeet.
Applied, thanks Tom.
^ permalink raw reply
* Re:[RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.
From: xiaohui.xin @ 2010-04-23 7:08 UTC (permalink / raw)
To: mst; +Cc: arnd, netdev, kvm, linux-kernel, mingo, davem, jdike, Xin Xiaohui
In-Reply-To: <20100422094951.GB30532@redhat.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
Michael,
>>>Can't vhost supply a kiocb completion callback that will handle the list?
>>Yes, thanks. And with it I also remove the vq->receivr finally.
>>Thanks
>>Xiaohui
>Nice progress. I commented on some minor issues below.
>Thanks!
The updated patch addressed your comments on the minor issues.
Thanks!
Thanks
Xiaohui
drivers/vhost/net.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.c | 120 ++++++++++++++-----------
drivers/vhost/vhost.h | 14 +++
3 files changed, 314 insertions(+), 56 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 38989d1..18f6c41 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -23,6 +23,8 @@
#include <linux/if_arp.h>
#include <linux/if_tun.h>
#include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>
#include <net/sock.h>
@@ -48,6 +50,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+ struct kmem_cache *cache;
/* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
@@ -92,11 +95,138 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
net->tx_poll_state = VHOST_NET_POLL_STARTED;
}
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ if (!list_empty(&vq->notifier)) {
+ iocb = list_first_entry(&vq->notifier,
+ struct kiocb, ki_list);
+ list_del(&iocb->ki_list);
+ }
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+ return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+ struct vhost_virtqueue *vq = iocb->private;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ list_add_tail(&iocb->ki_list, &vq->notifier);
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+ return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ struct socket *sock)
+{
+ struct kiocb *iocb = NULL;
+ struct vhost_log *vq_log = NULL;
+ int rx_total_len = 0;
+ unsigned int head, log, in, out;
+ int size;
+
+ if (!is_async_vq(vq))
+ return;
+
+ if (sock->sk->sk_data_ready)
+ sock->sk->sk_data_ready(sock->sk, 0);
+
+ vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+ vq->log : NULL;
+
+ while ((iocb = notify_dequeue(vq)) != NULL) {
+ vhost_add_used_and_signal(&net->dev, vq,
+ iocb->ki_pos, iocb->ki_nbytes);
+ size = iocb->ki_nbytes;
+ head = iocb->ki_pos;
+ rx_total_len += iocb->ki_nbytes;
+
+ if (iocb->ki_dtor)
+ iocb->ki_dtor(iocb);
+ kmem_cache_free(net->cache, iocb);
+
+ /* when log is enabled, recomputing the log info is needed,
+ * since these buffers are in async queue, and may not get
+ * the log info before.
+ */
+ if (unlikely(vq_log)) {
+ if (!log)
+ __vhost_get_vq_desc(&net->dev, vq, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in, vq_log,
+ &log, head);
+ vhost_log_write(vq, vq_log, log, size);
+ }
+ if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ int tx_total_len = 0;
+
+ if (!is_async_vq(vq))
+ return;
+
+ while ((iocb = notify_dequeue(vq)) != NULL) {
+ vhost_add_used_and_signal(&net->dev, vq,
+ iocb->ki_pos, 0);
+ tx_total_len += iocb->ki_nbytes;
+
+ if (iocb->ki_dtor)
+ iocb->ki_dtor(iocb);
+
+ kmem_cache_free(net->cache, iocb);
+ if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ unsigned head)
+{
+ struct kiocb *iocb = NULL;
+
+ if (!is_async_vq(vq))
+ return NULL;
+
+ iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+ if (!iocb)
+ return NULL;
+ iocb->private = vq;
+ iocb->ki_pos = head;
+ iocb->ki_dtor = handle_iocb;
+ if (vq == &net->dev.vqs[VHOST_NET_VQ_RX]) {
+ iocb->ki_user_data = vq->num;
+ iocb->ki_iovec = vq->hdr;
+ }
+ return iocb;
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+ struct kiocb *iocb = NULL;
unsigned head, out, in, s;
struct msghdr msg = {
.msg_name = NULL,
@@ -129,6 +259,8 @@ static void handle_tx(struct vhost_net *net)
tx_poll_stop(net);
hdr_size = vq->hdr_size;
+ handle_async_tx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -156,6 +288,13 @@ static void handle_tx(struct vhost_net *net)
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
msg.msg_iovlen = out;
+
+ if (is_async_vq(vq)) {
+ iocb = create_iocb(net, vq, head);
+ if (!iocb)
+ break;
+ }
+
len = iov_length(vq->iov, out);
/* Sanity check */
if (!len) {
@@ -165,12 +304,18 @@ static void handle_tx(struct vhost_net *net)
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? */
- err = sock->ops->sendmsg(NULL, sock, &msg, len);
+ err = sock->ops->sendmsg(iocb, sock, &msg, len);
if (unlikely(err < 0)) {
+ if (is_async_vq(vq))
+ kmem_cache_free(net->cache, iocb);
vhost_discard_vq_desc(vq);
tx_poll_start(net, sock);
break;
}
+
+ if (is_async_vq(vq))
+ continue;
+
if (err != len)
pr_err("Truncated TX packet: "
" len %d != %zd\n", err, len);
@@ -182,6 +327,8 @@ static void handle_tx(struct vhost_net *net)
}
}
+ handle_async_tx_events_notify(net, vq);
+
mutex_unlock(&vq->mutex);
unuse_mm(net->dev.mm);
}
@@ -191,6 +338,7 @@ static void handle_tx(struct vhost_net *net)
static void handle_rx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+ struct kiocb *iocb = NULL;
unsigned head, out, in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
@@ -211,7 +359,8 @@ static void handle_rx(struct vhost_net *net)
int err;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq->private_data);
- if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+ if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+ vq->link_state == VHOST_VQ_LINK_SYNC))
return;
use_mm(net->dev.mm);
@@ -219,9 +368,17 @@ static void handle_rx(struct vhost_net *net)
vhost_disable_notify(vq);
hdr_size = vq->hdr_size;
+ /* In async cases, when write log is enabled, in case the submitted
+ * buffers did not get log info before the log enabling, so we'd
+ * better recompute the log info when needed. We do this in
+ * handle_async_rx_events_notify().
+ */
+
vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
vq->log : NULL;
+ handle_async_rx_events_notify(net, vq, sock);
+
for (;;) {
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -250,6 +407,13 @@ static void handle_rx(struct vhost_net *net)
s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq->iov, in);
+
+ if (is_async_vq(vq)) {
+ iocb = create_iocb(net, vq, head);
+ if (!iocb)
+ break;
+ }
+
/* Sanity check */
if (!len) {
vq_err(vq, "Unexpected header len for RX: "
@@ -257,13 +421,20 @@ static void handle_rx(struct vhost_net *net)
iov_length(vq->hdr, s), hdr_size);
break;
}
- err = sock->ops->recvmsg(NULL, sock, &msg,
+
+ err = sock->ops->recvmsg(iocb, sock, &msg,
len, MSG_DONTWAIT | MSG_TRUNC);
/* TODO: Check specific error and bomb out unless EAGAIN? */
if (err < 0) {
+ if (is_async_vq(vq))
+ kmem_cache_free(net->cache, iocb);
vhost_discard_vq_desc(vq);
break;
}
+
+ if (is_async_vq(vq))
+ continue;
+
/* TODO: Should check and handle checksum. */
if (err > len) {
pr_err("Discarded truncated rx packet: "
@@ -289,6 +460,8 @@ static void handle_rx(struct vhost_net *net)
}
}
+ handle_async_rx_events_notify(net, vq, sock);
+
mutex_unlock(&vq->mutex);
unuse_mm(net->dev.mm);
}
@@ -342,6 +515,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+ n->cache = NULL;
f->private_data = n;
@@ -405,6 +579,18 @@ static void vhost_net_flush(struct vhost_net *n)
vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
}
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+ /* clean the notifier */
+ struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+ struct kiocb *iocb = NULL;
+ if (n->cache) {
+ while ((iocb = notify_dequeue(vq)) != NULL)
+ kmem_cache_free(n->cache, iocb);
+ kmem_cache_destroy(n->cache);
+ }
+}
+
static int vhost_net_release(struct inode *inode, struct file *f)
{
struct vhost_net *n = f->private_data;
@@ -421,6 +607,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
/* We do an extra flush before freeing memory,
* since jobs can re-queue themselves. */
vhost_net_flush(n);
+ vhost_async_cleanup(n);
kfree(n);
return 0;
}
@@ -472,21 +659,58 @@ static struct socket *get_tap_socket(int fd)
return sock;
}
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+ struct file *file = fget(fd);
+ struct socket *sock;
+ if (!file)
+ return ERR_PTR(-EBADF);
+ sock = mp_get_socket(file);
+ if (IS_ERR(sock))
+ fput(file);
+ return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+ enum vhost_vq_link_state *state)
{
struct socket *sock;
/* special case to disable backend */
if (fd == -1)
return NULL;
+
+ *state = VHOST_VQ_LINK_SYNC;
+
sock = get_raw_socket(fd);
if (!IS_ERR(sock))
return sock;
sock = get_tap_socket(fd);
if (!IS_ERR(sock))
return sock;
+ sock = get_mp_socket(fd);
+ if (!IS_ERR(sock)) {
+ *state = VHOST_VQ_LINK_ASYNC;
+ return sock;
+ }
return ERR_PTR(-ENOTSOCK);
}
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+ struct vhost_virtqueue *vq = n->vqs + index;
+
+ WARN_ON(!mutex_is_locked(&vq->mutex));
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+ INIT_LIST_HEAD(&vq->notifier);
+ spin_lock_init(&vq->notify_lock);
+ if (!n->cache) {
+ n->cache = kmem_cache_create("vhost_kiocb",
+ sizeof(struct kiocb), 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+ }
+ }
+}
+
static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
{
struct socket *sock, *oldsock;
@@ -510,12 +734,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
r = -EFAULT;
goto err_vq;
}
- sock = get_socket(fd);
+ sock = get_socket(vq, fd, &vq->link_state);
if (IS_ERR(sock)) {
r = PTR_ERR(sock);
goto err_vq;
}
+ vhost_init_link_state(n, index);
+
/* start polling new socket */
oldsock = vq->private_data;
if (sock == oldsock)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 3f10194..add77d3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -860,61 +860,17 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
return 0;
}
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+/* This computes the log info according to the index of buffer */
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head)
{
struct vring_desc desc;
unsigned int i, head, found = 0;
- u16 last_avail_idx;
- int ret;
-
- /* Check it isn't doing very strange things with descriptor numbers. */
- last_avail_idx = vq->last_avail_idx;
- if (get_user(vq->avail_idx, &vq->avail->idx)) {
- vq_err(vq, "Failed to access avail idx at %p\n",
- &vq->avail->idx);
- return vq->num;
- }
-
- if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
- vq_err(vq, "Guest moved used index from %u to %u",
- last_avail_idx, vq->avail_idx);
- return vq->num;
- }
-
- /* If there's nothing new since last we looked, return invalid. */
- if (vq->avail_idx == last_avail_idx)
- return vq->num;
+ unsigned int ret;
- /* Only get avail ring entries after they have been exposed by guest. */
- smp_rmb();
-
- /* Grab the next descriptor number they're advertising, and increment
- * the index we've seen. */
- if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
- vq_err(vq, "Failed to read head: idx %d address %p\n",
- last_avail_idx,
- &vq->avail->ring[last_avail_idx % vq->num]);
- return vq->num;
- }
-
- /* If their number is silly, that's an error. */
- if (head >= vq->num) {
- vq_err(vq, "Guest says index %u > %u is available",
- head, vq->num);
- return vq->num;
- }
-
- /* When we start there are none of either input nor output. */
*out_num = *in_num = 0;
if (unlikely(log))
*log_num = 0;
@@ -978,8 +934,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
*out_num += ret;
}
} while ((i = next_desc(&desc)) != -1);
+ return head;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ struct vring_desc desc;
+ unsigned int i, head, found = 0;
+ u16 last_avail_idx;
+ unsigned int ret;
+
+ /* Check it isn't doing very strange things with descriptor numbers. */
+ last_avail_idx = vq->last_avail_idx;
+ if (get_user(vq->avail_idx, &vq->avail->idx)) {
+ vq_err(vq, "Failed to access avail idx at %p\n",
+ &vq->avail->idx);
+ return vq->num;
+ }
+
+ if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+ vq_err(vq, "Guest moved used index from %u to %u",
+ last_avail_idx, vq->avail_idx);
+ return vq->num;
+ }
+
+ /* If there's nothing new since last we looked, return invalid. */
+ if (vq->avail_idx == last_avail_idx)
+ return vq->num;
+
+ /* Only get avail ring entries after they have been exposed by guest. */
+ rmb();
+
+ /* Grab the next descriptor number they're advertising, and increment
+ * the index we've seen. */
+ if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+ vq_err(vq, "Failed to read head: idx %d address %p\n",
+ last_avail_idx,
+ &vq->avail->ring[last_avail_idx % vq->num]);
+ return vq->num;
+ }
+
+ /* If their number is silly, that's an error. */
+ if (head >= vq->num) {
+ vq_err(vq, "Guest says index %u > %u is available",
+ head, vq->num);
+ return vq->num;
+ }
+
+ ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
+ out_num, in_num,
+ log, log_num, head);
/* On success, increment avail index. */
+ if (ret == vq->num)
+ return ret;
vq->last_avail_idx++;
return head;
}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..3c9cbce 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
u64 len;
};
+enum vhost_vq_link_state {
+ VHOST_VQ_LINK_SYNC = 0,
+ VHOST_VQ_LINK_ASYNC = 1,
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -96,6 +101,10 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log log[VHOST_NET_MAX_SG];
+ /* Differiate async socket for 0-copy from normal */
+ enum vhost_vq_link_state link_state;
+ struct list_head notifier;
+ spinlock_t notify_lock;
};
struct vhost_dev {
@@ -124,6 +133,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
+unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+ struct iovec iov[], unsigned int iov_count,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head);
void vhost_discard_vq_desc(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
--
1.5.4.4
^ permalink raw reply related
* Re: eSwitch management
From: Anirban Chakraborty @ 2010-04-23 5:57 UTC (permalink / raw)
To: Scott Feldman
Cc: David Miller, netdev@vger.kernel.org, chrisw@redhat.com,
Arnd Bergmann, Ameen Rahman, Amit Salecha, Rajesh Borundia
In-Reply-To: <C7F64614.2ADDF%scofeldm@cisco.com>
On Apr 22, 2010, at 6:29 PM, Scott Feldman wrote:
> On 4/22/10 5:47 PM, "Scott Feldman" <scofeldm@cisco.com> wrote:
>
>> On 4/22/10 4:16 PM, "Anirban Chakraborty" <anirban.chakraborty@qlogic.com>
>> wrote:
>>
>>> I am following the discussions on iovnl patch closely. While it is going to
>>> take some time for iovnl patch to be reviewed and accepted, what would be the
>>> interim approach to manage the eswitch in NIC? We need to add support in
>>> qlcnic driver to configure the eswitch in our 10G NIC. Some of the things
>>> that
>>> we need to set to the switch are setting a port's VLAN, tx bandwidth etc. We
>>> would like to set these parameters for a bunch of ports at the start of the
>>> day and set it to the eswitch.
>>
>> Are any of these settings covered in DCB? (net/dcb/dcbnl.c). Maybe you can
>> get a start there? Not sure not knowing your device requirements.
>
> Or maybe the RTM_SETLINK IFLA_VF_* ops in include/linux/if_link.h? Those
> seem like what you're looking for. I'm looking at moving iovnl here as well
> for port-profile.
It looks like ifla_vf_info does contain most of the data set. But if I use it, what NETLINK protocol family should I use in my driver to receive netlink messages? Do I need to create a private protocol family?
Thanks a lot,
Anirban
^ permalink raw reply
* [PATCH] bnx2x: add support for receive hashing
From: Tom Herbert @ 2010-04-23 5:54 UTC (permalink / raw)
To: davem, netdev
Add support to bnx2x to extract Toeplitz hash out of the receive descriptor
for use in skb->rxhash.
Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
index 0819530..8bd2368 100644
--- a/drivers/net/bnx2x.h
+++ b/drivers/net/bnx2x.h
@@ -1330,7 +1330,7 @@ static inline u32 reg_poll(struct bnx2x *bp, u32 reg, u32 expected, int ms,
AEU_INPUTS_ATTN_BITS_MCP_LATCHED_UMP_TX_PARITY | \
AEU_INPUTS_ATTN_BITS_MCP_LATCHED_SCPAD_PARITY)
-#define MULTI_FLAGS(bp) \
+#define RSS_FLAGS(bp) \
(TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_CAPABILITY | \
TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_TCP_CAPABILITY | \
TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV6_CAPABILITY | \
diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
index 0c6dba2..613f727 100644
--- a/drivers/net/bnx2x_main.c
+++ b/drivers/net/bnx2x_main.c
@@ -1582,7 +1582,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
struct sw_rx_bd *rx_buf = NULL;
struct sk_buff *skb;
union eth_rx_cqe *cqe;
- u8 cqe_fp_flags;
+ u8 cqe_fp_flags, cqe_fp_status_flags;
u16 len, pad;
comp_ring_cons = RCQ_BD(sw_comp_cons);
@@ -1598,6 +1598,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
cqe = &fp->rx_comp_ring[comp_ring_cons];
cqe_fp_flags = cqe->fast_path_cqe.type_error_flags;
+ cqe_fp_status_flags = cqe->fast_path_cqe.status_flags;
DP(NETIF_MSG_RX_STATUS, "CQE type %x err %x status %x"
" queue %x vlan %x len %u\n", CQE_TYPE(cqe_fp_flags),
@@ -1727,6 +1728,12 @@ reuse_rx:
skb->protocol = eth_type_trans(skb, bp->dev);
+ if ((bp->dev->features & ETH_FLAG_RXHASH) &&
+ (cqe_fp_status_flags &
+ ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
+ skb->rxhash = le32_to_cpu(
+ cqe->fast_path_cqe.rss_hash_result);
+
skb->ip_summed = CHECKSUM_NONE;
if (bp->rx_csum) {
if (likely(BNX2X_RX_CSUM_OK(cqe)))
@@ -5750,10 +5757,10 @@ static void bnx2x_init_internal_func(struct bnx2x *bp)
u32 offset;
u16 max_agg_size;
- if (is_multi(bp)) {
- tstorm_config.config_flags = MULTI_FLAGS(bp);
+ tstorm_config.config_flags = RSS_FLAGS(bp);
+
+ if (is_multi(bp))
tstorm_config.rss_result_mask = MULTI_MASK;
- }
/* Enable TPA if needed */
if (bp->flags & TPA_ENABLE_FLAG)
@@ -6629,10 +6636,8 @@ static int bnx2x_init_common(struct bnx2x *bp)
bnx2x_init_block(bp, PBF_BLOCK, COMMON_STAGE);
REG_WR(bp, SRC_REG_SOFT_RST, 1);
- for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4) {
- REG_WR(bp, i, 0xc0cac01a);
- /* TODO: replace with something meaningful */
- }
+ for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4)
+ REG_WR(bp, i, random32());
bnx2x_init_block(bp, SRCH_BLOCK, COMMON_STAGE);
#ifdef BCM_CNIC
REG_WR(bp, SRC_REG_KEYSEARCH_0, 0x63285672);
@@ -11001,6 +11006,11 @@ static int bnx2x_set_flags(struct net_device *dev, u32 data)
changed = 1;
}
+ if (data & ETH_FLAG_RXHASH)
+ dev->features |= NETIF_F_RXHASH;
+ else
+ dev->features &= ~NETIF_F_RXHASH;
+
if (changed && netif_running(dev)) {
bnx2x_nic_unload(bp, UNLOAD_NORMAL);
rc = bnx2x_nic_load(bp, LOAD_NORMAL);
^ permalink raw reply related
* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-23 5:44 UTC (permalink / raw)
To: David Miller; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel
In-Reply-To: <20100422.164425.171794554.davem@davemloft.net>
Le jeudi 22 avril 2010 à 16:44 -0700, David Miller a écrit :
> Eric, I wonder if we run into some kind of issue on 32-bit systems
> because we always lose a bit of the conntrack hash value when we store
> it into the 'nulls' area?
>
> Wouldn't that make the "get_nulls_value(n) != hash" fail?
> --
Well, 'hash' at this time is not the result of the jhash() transform [0
- 0xFFFFFFFF], but a slot number in htable [0 - (300032-1)].
And we can have a nulls_value up to 0x7FFFFFFF (31 bits)
static inline unsigned long get_nulls_value(const struct hlist_nulls_node *ptr)
{
return ((unsigned long)ptr) >> 1;
}
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox