Netdev List
 help / color / mirror / Atom feed
* Re: [ath9k-devel] [PATCH v2 07/46] net/wireless: ath9k: fix DMA API usage
From: Ben Greear @ 2011-07-12  5:30 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: Michał Mirosław, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Jouni Malinen,
	Senthil Balasubramanian, ath9k-devel-juf53994utBLZpfksSYvnA,
	Vasanthakumar Thiagarajan
In-Reply-To: <4E1BCF36.2010506-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

On 07/11/2011 09:36 PM, Felix Fietkau wrote:
> On 2011-07-11 8:52 AM, Michał Mirosław wrote:
>> Also constify buf_addr for ath9k_hw_process_rxdesc_edma() to verify
>> assumptions --- dma_sync_single_for_device() call can be removed.
>>
>> Signed-off-by: Michał Mirosław<mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>
>> ---
>> drivers/net/wireless/ath/ath9k/ar9003_mac.c | 4 ++--
>> drivers/net/wireless/ath/ath9k/ar9003_mac.h | 2 +-
>> drivers/net/wireless/ath/ath9k/recv.c | 10 +++-------
>> 3 files changed, 6 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
>> index 70dc8ec..c5f46d5 100644
>> --- a/drivers/net/wireless/ath/ath9k/recv.c
>> +++ b/drivers/net/wireless/ath/ath9k/recv.c
>> @@ -684,15 +684,11 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
>> BUG_ON(!bf);
>>
>> dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
>> - common->rx_bufsize, DMA_FROM_DEVICE);
>> + common->rx_bufsize, DMA_BIDIRECTIONAL);
>>
>> ret = ath9k_hw_process_rxdesc_edma(ah, NULL, skb->data);
>> - if (ret == -EINPROGRESS) {
>> - /*let device gain the buffer again*/
>> - dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
>> - common->rx_bufsize, DMA_FROM_DEVICE);
>> + if (ret == -EINPROGRESS)
>> return false;
>> - }
>>
>> __skb_unlink(skb,&rx_edma->rx_fifo);
>> if (ret == -EINVAL) {
> I have strong doubts about this change. On most MIPS devices, dma_sync_single_for_cpu is a no-op, whereas dma_sync_single_for_device flushes the cache range.
> With this change, the CPU could cache the DMA status part behind skb->data and that cache entry would not be flushed inbetween calls to this functions on the
> same buffer, likely leading to rx stalls.

At the very least, it would need heavy testing.  It took a very long time to get
the ath9k DMA issues (mostly?) resolved...so we shouldn't go mucking in this
code on theory...

Thanks,
Ben

>
> - Felix
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface
From: Ronny Meeus @ 2011-07-12  6:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110711.202717.1764669415135298036.davem@davemloft.net>

On Tue, Jul 12, 2011 at 5:27 AM, David Miller <davem@davemloft.net> wrote:
> From: Ronny Meeus <ronny.meeus@gmail.com>
> Date: Sat, 11 Jun 2011 07:04:09 +0200
>
>> I was running a test: 1 application was sending raw Ethernet packets
>> on a physical looped interface while a second application was
>> receiving packets, so the latter application receives each packet 2
>> times (once while sending from the context of the first application
>> and a second time while receiving from the hardware).  After some
>> time, the test blocks due to a spinlock reentrance issue in
>> af_packet. Both the sending application and the softIRQ receiving
>> packets enter the spinlock code. After applying the patch below, the
>> issue is resolved.
>>
>> Signed-off-by: Ronny Meeus <ronny.meeus@gmail.com>
>
> The packet receive hooks should always be called with software
> interrupts disabled, it is a bug if this is not happening.  Your
> patch should not be necessary at all.
>
>

Can you be a bit more specific on where the software interrupts should
be disabled?

Below you find the information I get after switching on the spin_lock
issue detection in the kernel.
In this run also I-PIPE was active but this issue is also seen with
I-PIPE disabled.

[   96.450034] BUG: spinlock recursion on CPU#0, send_eth_socket/1907
[   96.540451]  lock: eaeb8c9c, .magic: dead4ead, .owner:
send_eth_socket/1907, .owner_cpu: 0
[   96.656060] Call Trace: (unreliable)0161000 success=90] [c000789c]
show_stack+0x78/0x18c160001
[   96.827988] [efff3dd0] [c01e2a50] spin_bug+0xa8/0xc0=0000162000 succ
[   96.920712] [efff3df0] [c01e2b9c] do_raw_spin_lock+0x70/0x1c4ount=0000163000
[   97.022823] [efff3e20] [c0388d88] _raw_spin_lock+0x10/0x2001
[   97.121800] [efff3e30] [c03391a0] tpacket_rcv+0x274/0x61c164001
[   97.219733] [efff3e80] [c02d5970] __netif_receive_skb+0x1a8/0x36c0000165001
[   97.326001] [efff3eb0] [c02d6234] netif_receive_skb+0x98/0xacess=0000166001
[   97.427060] [efff3ee0] [c029e6c4] ingress_rx_default_dqrr+0x42c/0x4b8
[   97.504194] [efff3f10] [c02ba524] qman_poll_dqrr+0x1e0/0x284
[   97.571948] [efff3f50] [c029ff3c] dpaa_eth_poll+0x34/0xd0
[   97.636579] [efff3f70] [c02d6514] net_rx_action+0xc0/0x1e8
[   97.702256] [efff3fa0] [c0034d28] __do_softirq+0x138/0x210
[   97.767928] [efff3ff0] [c0010128] call_do_softirq+0x14/0x24
[   97.834641] [ec479a90] [c0004a00] do_softirq+0xb4/0xec
[   97.896148] [ec479ab0] [c0034a44] irq_exit+0x60/0xb8
[   97.955573] [ec479ac0] [c000a490] __ipipe_do_IRQ+0x88/0xc0
[   98.021253] [ec479ae0] [c0070214] __ipipe_sync_stage+0x1f0/0x27c
[   98.093176] [ec479b20] [c0009f28] __ipipe_handle_irq+0x1b8/0x1e8
[   98.165106] [ec479b50] [c000a210] __ipipe_grab_irq+0x18c/0x1bc
[   98.234947] [ec479b80] [c0011520] __ipipe_ret_from_except+0x0/0xc
[   98.307915] --- Exception: 501 at __packet_get_status+0x48/0x70
[   98.307920]     LR = __packet_get_status+0x44/0x70
[   98.436082] [ec479c40] [00000578] 0x578 (unreliable)
[   98.495524] [ec479c50] [c0338360] packet_lookup_frame+0x48/0x70
[   98.566405] [ec479c60] [c03391b4] tpacket_rcv+0x288/0x61c
[   98.631037] [ec479cb0] [c02d762c] dev_hard_start_xmit+0x164/0x588
[   98.704003] [ec479cf0] [c0338d6c] packet_sendmsg+0x8c4/0x988
[   98.771758] [ec479d70] [c02c3838] sock_sendmsg+0x90/0xb4
[   98.835348] [ec479e40] [c02c4420] sys_sendto+0xdc/0x120
[   98.897891] [ec479f10] [c02c57d0] sys_socketcall+0x148/0x210
[   98.965648] [ec479f40] [c001084c] ret_from_syscall+0x0/0x3c
[   99.032361] --- Exception: c01 at 0x48051f00
[   99.032365]     LR = 0x4808e030
[  100.563009] BUG: spinlock lockup on CPU#0, send_eth_socket/1907, eaeb8c9c
[  100.644283] Call Trace:
[  100.673480] [efff3db0] [c000789c] show_stack+0x78/0x18c (unreliable)
[  100.749589] [efff3df0] [c01e2c94] do_raw_spin_lock+0x168/0x1c4
[  100.819430] [efff3e20] [c0388d88] _raw_spin_lock+0x10/0x20
[  100.885102] [efff3e30] [c03391a0] tpacket_rcv+0x274/0x61c
[  100.949733] [efff3e80] [c02d5970] __netif_receive_skb+0x1a8/0x36c
[  101.022699] [efff3eb0] [c02d6234] netif_receive_skb+0x98/0xac
[  101.091497] [efff3ee0] [c029e6c4] ingress_rx_default_dqrr+0x42c/0x4b8
[  101.168628] [efff3f10] [c02ba524] qman_poll_dqrr+0x1e0/0x284
[  101.236385] [efff3f50] [c029ff3c] dpaa_eth_poll+0x34/0xd0
[  101.301016] [efff3f70] [c02d6514] net_rx_action+0xc0/0x1e8
[  101.366692] [efff3fa0] [c0034d28] __do_softirq+0x138/0x210
[  101.432364] [efff3ff0] [c0010128] call_do_softirq+0x14/0x24
[  101.499078] [ec479a90] [c0004a00] do_softirq+0xb4/0xec
[  101.560585] [ec479ab0] [c0034a44] irq_exit+0x60/0xb8
[  101.620003] [ec479ac0] [c000a490] __ipipe_do_IRQ+0x88/0xc0
[  101.685678] [ec479ae0] [c0070214] __ipipe_sync_stage+0x1f0/0x27c
[  101.757601] [ec479b20] [c0009f28] __ipipe_handle_irq+0x1b8/0x1e8
[  101.829523] [ec479b50] [c000a210] __ipipe_grab_irq+0x18c/0x1bc
[  101.899363] [ec479b80] [c0011520] __ipipe_ret_from_except+0x0/0xc
[  101.972333] --- Exception: 501 at __packet_get_status+0x48/0x70
[  101.972338]     LR = __packet_get_status+0x44/0x70
[  102.100490] [ec479c40] [00000578] 0x578 (unreliable)
[  102.159929] [ec479c50] [c0338360] packet_lookup_frame+0x48/0x70
[  102.230810] [ec479c60] [c03391b4] tpacket_rcv+0x288/0x61c
[  102.295442] [ec479cb0] [c02d762c] dev_hard_start_xmit+0x164/0x588
[  102.368407] [ec479cf0] [c0338d6c] packet_sendmsg+0x8c4/0x988
[  102.436162] [ec479d70] [c02c3838] sock_sendmsg+0x90/0xb4
[  102.499747] [ec479e40] [c02c4420] sys_sendto+0xdc/0x120
[  102.562296] [ec479f10] [c02c57d0] sys_socketcall+0x148/0x210
[  102.630052] [ec479f40] [c001084c] ret_from_syscall+0x0/0x3c
[  102.696773] --- Exception: c01 at 0x48051f00
[  102.696777]     LR = 0x4808e030

Ronny

^ permalink raw reply

* Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface
From: David Miller @ 2011-07-12  6:42 UTC (permalink / raw)
  To: ronny.meeus; +Cc: netdev
In-Reply-To: <CAMJ=MEf8KkNNSvHy-pepg_GfkrnquKo+zW3GyB7RDAYO2H+SVA@mail.gmail.com>

From: Ronny Meeus <ronny.meeus@gmail.com>
Date: Tue, 12 Jul 2011 08:38:57 +0200

> Can you be a bit more specific on where the software interrupts should
> be disabled?

In all paths that lead to the packet capturing hooks, I can't
be any more specific than that, because that is the exact
requirement.

^ permalink raw reply

* Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface
From: Eric Dumazet @ 2011-07-12  6:58 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: David Miller, netdev
In-Reply-To: <CAMJ=MEf8KkNNSvHy-pepg_GfkrnquKo+zW3GyB7RDAYO2H+SVA@mail.gmail.com>

Le mardi 12 juillet 2011 à 08:38 +0200, Ronny Meeus a écrit :
> On Tue, Jul 12, 2011 at 5:27 AM, David Miller <davem@davemloft.net> wrote:
> > From: Ronny Meeus <ronny.meeus@gmail.com>
> > Date: Sat, 11 Jun 2011 07:04:09 +0200
> >
> >> I was running a test: 1 application was sending raw Ethernet packets
> >> on a physical looped interface while a second application was
> >> receiving packets, so the latter application receives each packet 2
> >> times (once while sending from the context of the first application
> >> and a second time while receiving from the hardware).  After some
> >> time, the test blocks due to a spinlock reentrance issue in
> >> af_packet. Both the sending application and the softIRQ receiving
> >> packets enter the spinlock code. After applying the patch below, the
> >> issue is resolved.
> >>
> >> Signed-off-by: Ronny Meeus <ronny.meeus@gmail.com>
> >
> > The packet receive hooks should always be called with software
> > interrupts disabled, it is a bug if this is not happening.  Your
> > patch should not be necessary at all.
> >
> >
> 
> Can you be a bit more specific on where the software interrupts should
> be disabled?
> 
> Below you find the information I get after switching on the spin_lock
> issue detection in the kernel.
> In this run also I-PIPE was active but this issue is also seen with
> I-PIPE disabled.
> 

This seems a bug, but in softirq handling in your arch

> [   96.450034] BUG: spinlock recursion on CPU#0, send_eth_socket/1907
> [   96.540451]  lock: eaeb8c9c, .magic: dead4ead, .owner:
> send_eth_socket/1907, .owner_cpu: 0
> [   96.656060] Call Trace: (unreliable)0161000 success=90] [c000789c]
> show_stack+0x78/0x18c160001
> [   96.827988] [efff3dd0] [c01e2a50] spin_bug+0xa8/0xc0=0000162000 succ
> [   96.920712] [efff3df0] [c01e2b9c] do_raw_spin_lock+0x70/0x1c4ount=0000163000
> [   97.022823] [efff3e20] [c0388d88] _raw_spin_lock+0x10/0x2001
> [   97.121800] [efff3e30] [c03391a0] tpacket_rcv+0x274/0x61c164001
> [   97.219733] [efff3e80] [c02d5970] __netif_receive_skb+0x1a8/0x36c0000165001
> [   97.326001] [efff3eb0] [c02d6234] netif_receive_skb+0x98/0xacess=0000166001
> [   97.427060] [efff3ee0] [c029e6c4] ingress_rx_default_dqrr+0x42c/0x4b8
> [   97.504194] [efff3f10] [c02ba524] qman_poll_dqrr+0x1e0/0x284
> [   97.571948] [efff3f50] [c029ff3c] dpaa_eth_poll+0x34/0xd0
> [   97.636579] [efff3f70] [c02d6514] net_rx_action+0xc0/0x1e8
> [   97.702256] [efff3fa0] [c0034d28] __do_softirq+0x138/0x210
> [   97.767928] [efff3ff0] [c0010128] call_do_softirq+0x14/0x24
> [   97.834641] [ec479a90] [c0004a00] do_softirq+0xb4/0xec

We got an IRQ, and we start do_softirq() from irq_exit()

> [   97.896148] [ec479ab0] [c0034a44] irq_exit+0x60/0xb8
> [   97.955573] [ec479ac0] [c000a490] __ipipe_do_IRQ+0x88/0xc0
> [   98.021253] [ec479ae0] [c0070214] __ipipe_sync_stage+0x1f0/0x27c
> [   98.093176] [ec479b20] [c0009f28] __ipipe_handle_irq+0x1b8/0x1e8
> [   98.165106] [ec479b50] [c000a210] __ipipe_grab_irq+0x18c/0x1bc
> [   98.234947] [ec479b80] [c0011520] __ipipe_ret_from_except+0x0/0xc
> [   98.307915] --- Exception: 501 at __packet_get_status+0x48/0x70
> [   98.307920]     LR = __packet_get_status+0x44/0x70
> [   98.436082] [ec479c40] [00000578] 0x578 (unreliable)
> [   98.495524] [ec479c50] [c0338360] packet_lookup_frame+0x48/0x70
> [   98.566405] [ec479c60] [c03391b4] tpacket_rcv+0x288/0x61c

Here we were in BH disabled section, since dev_queue_xmit() contains :

	rcu_read_lock_bh()


> [   98.631037] [ec479cb0] [c02d762c] dev_hard_start_xmit+0x164/0x588
> [   98.704003] [ec479cf0] [c0338d6c] packet_sendmsg+0x8c4/0x988
> [   98.771758] [ec479d70] [c02c3838] sock_sendmsg+0x90/0xb4
> [   98.835348] [ec479e40] [c02c4420] sys_sendto+0xdc/0x120
> [   98.897891] [ec479f10] [c02c57d0] sys_socketcall+0x148/0x210
> [   98.965648] [ec479f40] [c001084c] ret_from_syscall+0x0/0x3c
> [   99.032361] --- Exception: c01 at 0x48051f00
> [   99.032365]     LR = 0x4808e030
> [  100.563009] BUG: spinlock lockup on CPU#0, send_eth_socket/1907, eaeb8c9c
> [  100.644283] Call Trace:
> [  100.673480] [efff3db0] [c000789c] show_stack+0x78/0x18c (unreliable)
> [  100.749589] [efff3df0] [c01e2c94] do_raw_spin_lock+0x168/0x1c4
> [  100.819430] [efff3e20] [c0388d88] _raw_spin_lock+0x10/0x20
> [  100.885102] [efff3e30] [c03391a0] tpacket_rcv+0x274/0x61c
> [  100.949733] [efff3e80] [c02d5970] __netif_receive_skb+0x1a8/0x36c
> [  101.022699] [efff3eb0] [c02d6234] netif_receive_skb+0x98/0xac
> [  101.091497] [efff3ee0] [c029e6c4] ingress_rx_default_dqrr+0x42c/0x4b8
> [  101.168628] [efff3f10] [c02ba524] qman_poll_dqrr+0x1e0/0x284
> [  101.236385] [efff3f50] [c029ff3c] dpaa_eth_poll+0x34/0xd0
> [  101.301016] [efff3f70] [c02d6514] net_rx_action+0xc0/0x1e8
> [  101.366692] [efff3fa0] [c0034d28] __do_softirq+0x138/0x210
> [  101.432364] [efff3ff0] [c0010128] call_do_softirq+0x14/0x24
> [  101.499078] [ec479a90] [c0004a00] do_softirq+0xb4/0xec
> [  101.560585] [ec479ab0] [c0034a44] irq_exit+0x60/0xb8
> [  101.620003] [ec479ac0] [c000a490] __ipipe_do_IRQ+0x88/0xc0
> [  101.685678] [ec479ae0] [c0070214] __ipipe_sync_stage+0x1f0/0x27c
> [  101.757601] [ec479b20] [c0009f28] __ipipe_handle_irq+0x1b8/0x1e8
> [  101.829523] [ec479b50] [c000a210] __ipipe_grab_irq+0x18c/0x1bc
> [  101.899363] [ec479b80] [c0011520] __ipipe_ret_from_except+0x0/0xc
> [  101.972333] --- Exception: 501 at __packet_get_status+0x48/0x70
> [  101.972338]     LR = __packet_get_status+0x44/0x70
> [  102.100490] [ec479c40] [00000578] 0x578 (unreliable)
> [  102.159929] [ec479c50] [c0338360] packet_lookup_frame+0x48/0x70
> [  102.230810] [ec479c60] [c03391b4] tpacket_rcv+0x288/0x61c
> [  102.295442] [ec479cb0] [c02d762c] dev_hard_start_xmit+0x164/0x588
> [  102.368407] [ec479cf0] [c0338d6c] packet_sendmsg+0x8c4/0x988
> [  102.436162] [ec479d70] [c02c3838] sock_sendmsg+0x90/0xb4
> [  102.499747] [ec479e40] [c02c4420] sys_sendto+0xdc/0x120
> [  102.562296] [ec479f10] [c02c57d0] sys_socketcall+0x148/0x210
> [  102.630052] [ec479f40] [c001084c] ret_from_syscall+0x0/0x3c
> [  102.696773] --- Exception: c01 at 0x48051f00
> [  102.696777]     LR = 0x4808e030
> 
> Ronny
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply

* Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface
From: David Miller @ 2011-07-12  7:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ronny.meeus, netdev
In-Reply-To: <1310453928.2860.32.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 12 Jul 2011 08:58:48 +0200

> This seems a bug, but in softirq handling in your arch

That's not the problem, it's the dev_queue_xmit_nit() code
path that interrupt is interrupting.

^ permalink raw reply

* Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface
From: Eric Dumazet @ 2011-07-12  8:23 UTC (permalink / raw)
  To: David Miller; +Cc: ronny.meeus, netdev
In-Reply-To: <20110712.001953.709404117315345178.davem@davemloft.net>

Le mardi 12 juillet 2011 à 00:19 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 12 Jul 2011 08:58:48 +0200
> 
> > This seems a bug, but in softirq handling in your arch
> 
> That's not the problem, it's the dev_queue_xmit_nit() code
> path that interrupt is interrupting.

Isnt it what I said ?

Unless I am mistaken, dev_queue_xmit_nit() is called from
dev_hard_start_xmit(), and from dev_queue_xmit(), therefore BH
are/should_be masked. Calling do_softirq() while BH are masked is the
bug.

Something must be messing the !in_interrupt() test done from irq_exit()




^ permalink raw reply

* Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface
From: David Miller @ 2011-07-12  8:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ronny.meeus, netdev
In-Reply-To: <1310458986.4763.4.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 12 Jul 2011 10:23:06 +0200

> Le mardi 12 juillet 2011 à 00:19 -0700, David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 12 Jul 2011 08:58:48 +0200
>> 
>> > This seems a bug, but in softirq handling in your arch
>> 
>> That's not the problem, it's the dev_queue_xmit_nit() code
>> path that interrupt is interrupting.
> 
> Isnt it what I said ?
> 
> Unless I am mistaken, dev_queue_xmit_nit() is called from
> dev_hard_start_xmit(), and from dev_queue_xmit(), therefore BH
> are/should_be masked. Calling do_softirq() while BH are masked is the
> bug.
> 
> Something must be messing the !in_interrupt() test done from irq_exit()

Aha, I see.  Yes, this seems to be where the problem are.

^ permalink raw reply

* [PATCH] ipv4: Inline neigh binding.
From: David Miller @ 2011-07-12  8:44 UTC (permalink / raw)
  To: netdev


Get rid of all of the useless and costly indirection
by doing the neigh hash table lookup directly inside
of the neighbour binding.

Rename from arp_bind_neighbour to rt_bind_neighbour.

Use new helpers {__,}ipv4_neigh_lookup()

In rt_bind_neighbour() get rid of useless tests which
are never true in the context this function is called,
namely dev is never NULL and the dst->neighbour is
always NULL.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/arp.h |   33 ++++++++++++++++++++++++++++++++-
 net/ipv4/arp.c    |   24 ------------------------
 net/ipv4/route.c  |   30 +++++++++++++++++++++++++++---
 3 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 723bde5..5e669e6 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -15,6 +15,38 @@ static inline u32 arp_hashfn(u32 key, const struct net_device *dev, u32 hash_rnd
 	return val * hash_rnd;
 }
 
+static inline struct neighbour *__ipv4_neigh_lookup(struct neigh_table *tbl, struct net_device *dev, u32 key)
+{
+	struct neigh_hash_table *nht;
+	struct neighbour *n;
+	u32 hash_val;
+
+	rcu_read_lock_bh();
+	nht = rcu_dereference_bh(tbl->nht);
+	hash_val = arp_hashfn(key, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
+	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
+	     n != NULL;
+	     n = rcu_dereference_bh(n->next)) {
+		if (n->dev == dev && *(u32 *)n->primary_key == key) {
+			if (!atomic_inc_not_zero(&n->refcnt))
+				n = NULL;
+			break;
+		}
+	}
+	rcu_read_unlock_bh();
+
+	return n;
+}
+
+static inline struct neighbour *ipv4_neigh_lookup(struct neigh_table *tbl, struct net_device *dev, const __be32 *pkey)
+{
+	struct neighbour *n = __ipv4_neigh_lookup(tbl, dev,
+						  *(__force u32 *)pkey);
+	if (n)
+		return n;
+	return neigh_create(tbl, pkey, dev);
+}
+
 extern void	arp_init(void);
 extern int	arp_find(unsigned char *haddr, struct sk_buff *skb);
 extern int	arp_ioctl(struct net *net, unsigned int cmd, void __user *arg);
@@ -22,7 +54,6 @@ extern void     arp_send(int type, int ptype, __be32 dest_ip,
 			 struct net_device *dev, __be32 src_ip,
 			 const unsigned char *dest_hw,
 			 const unsigned char *src_hw, const unsigned char *th);
-extern int	arp_bind_neighbour(struct dst_entry *dst);
 extern int	arp_mc_map(__be32 addr, u8 *haddr, struct net_device *dev, int dir);
 extern void	arp_ifdown(struct net_device *dev);
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 4412b57..3e55456 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -517,30 +517,6 @@ EXPORT_SYMBOL(arp_find);
 
 /* END OF OBSOLETE FUNCTIONS */
 
-int arp_bind_neighbour(struct dst_entry *dst)
-{
-	struct net_device *dev = dst->dev;
-	struct neighbour *n = dst->neighbour;
-
-	if (dev == NULL)
-		return -EINVAL;
-	if (n == NULL) {
-		__be32 nexthop = ((struct rtable *)dst)->rt_gateway;
-		if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
-			nexthop = 0;
-		n = __neigh_lookup_errno(
-#if defined(CONFIG_ATM_CLIP) || defined(CONFIG_ATM_CLIP_MODULE)
-					 dev->type == ARPHRD_ATM ?
-					 clip_tbl_hook :
-#endif
-					 &arp_tbl, &nexthop, dev);
-		if (IS_ERR(n))
-			return PTR_ERR(n);
-		dst->neighbour = n;
-	}
-	return 0;
-}
-
 /*
  * Check if we can use proxy ARP for this path
  */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a8ccd9b..c6388e8 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -108,6 +108,7 @@
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
+#include <net/atmclip.h>
 
 #define RT_FL_TOS(oldflp4) \
     ((u32)(oldflp4->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK)))
@@ -1006,6 +1007,29 @@ static int slow_chain_length(const struct rtable *head)
 	return length >> FRACT_BITS;
 }
 
+static int rt_bind_neighbour(struct rtable *rt)
+{
+	static const __be32 inaddr_any = 0;
+	struct net_device *dev = rt->dst.dev;
+	struct neigh_table *tbl = &arp_tbl;
+	const __be32 *nexthop;
+	struct neighbour *n;
+
+#if defined(CONFIG_ATM_CLIP) || defined(CONFIG_ATM_CLIP_MODULE)
+	if (dev->type == ARPHRD_ATM)
+		tbl = clip_tbl_hook;
+#endif
+	nexthop = &rt->rt_gateway;
+	if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
+		nexthop = &inaddr_any;
+	n = ipv4_neigh_lookup(tbl, dev, nexthop);
+	if (IS_ERR(n))
+		return PTR_ERR(n);
+	rt->dst.neighbour = n;
+
+	return 0;
+}
+
 static struct rtable *rt_intern_hash(unsigned hash, struct rtable *rt,
 				     struct sk_buff *skb, int ifindex)
 {
@@ -1042,7 +1066,7 @@ restart:
 
 		rt->dst.flags |= DST_NOCACHE;
 		if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
-			int err = arp_bind_neighbour(&rt->dst);
+			int err = rt_bind_neighbour(rt);
 			if (err) {
 				if (net_ratelimit())
 					printk(KERN_WARNING
@@ -1138,7 +1162,7 @@ restart:
 	   route or unicast forwarding path.
 	 */
 	if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
-		int err = arp_bind_neighbour(&rt->dst);
+		int err = rt_bind_neighbour(rt);
 		if (err) {
 			spin_unlock_bh(rt_hash_lock_addr(hash));
 
@@ -1599,7 +1623,7 @@ static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
 	rt->dst.neighbour = NULL;
 
 	rt->rt_gateway = peer->redirect_learned.a4;
-	if (arp_bind_neighbour(&rt->dst) ||
+	if (rt_bind_neighbour(rt) ||
 	    !(rt->dst.neighbour->nud_state & NUD_VALID)) {
 		if (rt->dst.neighbour)
 			neigh_event_send(rt->dst.neighbour, NULL);
-- 
1.7.5.4


^ permalink raw reply related

* softirqs are invoked while bottom halves are masked (was: Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface)
From: Thomas De Schampheleire @ 2011-07-12  9:23 UTC (permalink / raw)
  To: Eric Dumazet, linuxppc-dev; +Cc: Ronny Meeus, David Miller, netdev

Hi,

I'm adding the linuxppc-dev mailing list since this may be pointing to
an irq/softirq problem in the powerpc architecture-specific code...

On Tue, Jul 12, 2011 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 juillet 2011 à 08:38 +0200, Ronny Meeus a écrit :
>> On Tue, Jul 12, 2011 at 5:27 AM, David Miller <davem@davemloft.net> wrote:
>> > From: Ronny Meeus <ronny.meeus@gmail.com>
>> > Date: Sat, 11 Jun 2011 07:04:09 +0200
>> >
>> >> I was running a test: 1 application was sending raw Ethernet packets
>> >> on a physical looped interface while a second application was
>> >> receiving packets, so the latter application receives each packet 2
>> >> times (once while sending from the context of the first application
>> >> and a second time while receiving from the hardware).  After some
>> >> time, the test blocks due to a spinlock reentrance issue in
>> >> af_packet. Both the sending application and the softIRQ receiving
>> >> packets enter the spinlock code. After applying the patch below, the
>> >> issue is resolved.
>> >>
>> >> Signed-off-by: Ronny Meeus <ronny.meeus@gmail.com>
>> >
>> > The packet receive hooks should always be called with software
>> > interrupts disabled, it is a bug if this is not happening.  Your
>> > patch should not be necessary at all.
>> >
>> >
>>
>> Can you be a bit more specific on where the software interrupts should
>> be disabled?
>>
>> Below you find the information I get after switching on the spin_lock
>> issue detection in the kernel.
>> In this run also I-PIPE was active but this issue is also seen with
>> I-PIPE disabled.
>>
>
> This seems a bug, but in softirq handling in your arch
>
>> [   96.450034] BUG: spinlock recursion on CPU#0, send_eth_socket/1907
>> [   96.540451]  lock: eaeb8c9c, .magic: dead4ead, .owner:
>> send_eth_socket/1907, .owner_cpu: 0
>> [   96.656060] Call Trace: (unreliable)0161000 success=90] [c000789c]
>> show_stack+0x78/0x18c160001
>> [   96.827988] [efff3dd0] [c01e2a50] spin_bug+0xa8/0xc0=0000162000 succ
>> [   96.920712] [efff3df0] [c01e2b9c] do_raw_spin_lock+0x70/0x1c4ount=0000163000
>> [   97.022823] [efff3e20] [c0388d88] _raw_spin_lock+0x10/0x2001
>> [   97.121800] [efff3e30] [c03391a0] tpacket_rcv+0x274/0x61c164001
>> [   97.219733] [efff3e80] [c02d5970] __netif_receive_skb+0x1a8/0x36c0000165001
>> [   97.326001] [efff3eb0] [c02d6234] netif_receive_skb+0x98/0xacess=0000166001
>> [   97.427060] [efff3ee0] [c029e6c4] ingress_rx_default_dqrr+0x42c/0x4b8
>> [   97.504194] [efff3f10] [c02ba524] qman_poll_dqrr+0x1e0/0x284
>> [   97.571948] [efff3f50] [c029ff3c] dpaa_eth_poll+0x34/0xd0
>> [   97.636579] [efff3f70] [c02d6514] net_rx_action+0xc0/0x1e8
>> [   97.702256] [efff3fa0] [c0034d28] __do_softirq+0x138/0x210
>> [   97.767928] [efff3ff0] [c0010128] call_do_softirq+0x14/0x24
>> [   97.834641] [ec479a90] [c0004a00] do_softirq+0xb4/0xec
>
> We got an IRQ, and we start do_softirq() from irq_exit()
>
>> [   97.896148] [ec479ab0] [c0034a44] irq_exit+0x60/0xb8
>> [   97.955573] [ec479ac0] [c000a490] __ipipe_do_IRQ+0x88/0xc0
>> [   98.021253] [ec479ae0] [c0070214] __ipipe_sync_stage+0x1f0/0x27c
>> [   98.093176] [ec479b20] [c0009f28] __ipipe_handle_irq+0x1b8/0x1e8
>> [   98.165106] [ec479b50] [c000a210] __ipipe_grab_irq+0x18c/0x1bc
>> [   98.234947] [ec479b80] [c0011520] __ipipe_ret_from_except+0x0/0xc
>> [   98.307915] --- Exception: 501 at __packet_get_status+0x48/0x70
>> [   98.307920]     LR = __packet_get_status+0x44/0x70
>> [   98.436082] [ec479c40] [00000578] 0x578 (unreliable)
>> [   98.495524] [ec479c50] [c0338360] packet_lookup_frame+0x48/0x70
>> [   98.566405] [ec479c60] [c03391b4] tpacket_rcv+0x288/0x61c
>
> Here we were in BH disabled section, since dev_queue_xmit() contains :
>
>        rcu_read_lock_bh()
>
>
>> [   98.631037] [ec479cb0] [c02d762c] dev_hard_start_xmit+0x164/0x588
>> [   98.704003] [ec479cf0] [c0338d6c] packet_sendmsg+0x8c4/0x988
>> [   98.771758] [ec479d70] [c02c3838] sock_sendmsg+0x90/0xb4
>> [   98.835348] [ec479e40] [c02c4420] sys_sendto+0xdc/0x120
>> [   98.897891] [ec479f10] [c02c57d0] sys_socketcall+0x148/0x210
>> [   98.965648] [ec479f40] [c001084c] ret_from_syscall+0x0/0x3c
>> [   99.032361] --- Exception: c01 at 0x48051f00
>> [   99.032365]     LR = 0x4808e030
>> [  100.563009] BUG: spinlock lockup on CPU#0, send_eth_socket/1907, eaeb8c9c
>> [  100.644283] Call Trace:
>> [  100.673480] [efff3db0] [c000789c] show_stack+0x78/0x18c (unreliable)
>> [  100.749589] [efff3df0] [c01e2c94] do_raw_spin_lock+0x168/0x1c4
>> [  100.819430] [efff3e20] [c0388d88] _raw_spin_lock+0x10/0x20
>> [  100.885102] [efff3e30] [c03391a0] tpacket_rcv+0x274/0x61c
>> [  100.949733] [efff3e80] [c02d5970] __netif_receive_skb+0x1a8/0x36c
>> [  101.022699] [efff3eb0] [c02d6234] netif_receive_skb+0x98/0xac
>> [  101.091497] [efff3ee0] [c029e6c4] ingress_rx_default_dqrr+0x42c/0x4b8
>> [  101.168628] [efff3f10] [c02ba524] qman_poll_dqrr+0x1e0/0x284
>> [  101.236385] [efff3f50] [c029ff3c] dpaa_eth_poll+0x34/0xd0
>> [  101.301016] [efff3f70] [c02d6514] net_rx_action+0xc0/0x1e8
>> [  101.366692] [efff3fa0] [c0034d28] __do_softirq+0x138/0x210
>> [  101.432364] [efff3ff0] [c0010128] call_do_softirq+0x14/0x24
>> [  101.499078] [ec479a90] [c0004a00] do_softirq+0xb4/0xec
>> [  101.560585] [ec479ab0] [c0034a44] irq_exit+0x60/0xb8
>> [  101.620003] [ec479ac0] [c000a490] __ipipe_do_IRQ+0x88/0xc0
>> [  101.685678] [ec479ae0] [c0070214] __ipipe_sync_stage+0x1f0/0x27c
>> [  101.757601] [ec479b20] [c0009f28] __ipipe_handle_irq+0x1b8/0x1e8
>> [  101.829523] [ec479b50] [c000a210] __ipipe_grab_irq+0x18c/0x1bc
>> [  101.899363] [ec479b80] [c0011520] __ipipe_ret_from_except+0x0/0xc
>> [  101.972333] --- Exception: 501 at __packet_get_status+0x48/0x70
>> [  101.972338]     LR = __packet_get_status+0x44/0x70
>> [  102.100490] [ec479c40] [00000578] 0x578 (unreliable)
>> [  102.159929] [ec479c50] [c0338360] packet_lookup_frame+0x48/0x70
>> [  102.230810] [ec479c60] [c03391b4] tpacket_rcv+0x288/0x61c
>> [  102.295442] [ec479cb0] [c02d762c] dev_hard_start_xmit+0x164/0x588
>> [  102.368407] [ec479cf0] [c0338d6c] packet_sendmsg+0x8c4/0x988
>> [  102.436162] [ec479d70] [c02c3838] sock_sendmsg+0x90/0xb4
>> [  102.499747] [ec479e40] [c02c4420] sys_sendto+0xdc/0x120
>> [  102.562296] [ec479f10] [c02c57d0] sys_socketcall+0x148/0x210
>> [  102.630052] [ec479f40] [c001084c] ret_from_syscall+0x0/0x3c
>> [  102.696773] --- Exception: c01 at 0x48051f00
>> [  102.696777]     LR = 0x4808e030
>>


Note that the reason we are seeing this problem, may be because the
kernel we are using contains some patches from Freescale.
Specifically, in dev_queue_xmit(), support is added for hardware queue
handling, just before entering the rcu_read_lock_bh():

        if (dev->features & NETIF_F_HW_QDISC) {
                txq = dev_pick_tx(dev, skb);
                return dev_hard_start_xmit(skb, dev, txq);
        }

        /* Disable soft irqs for various locks below. Also
         * stops preemption for RCU.
         */
        rcu_read_lock_bh();

We just tried moving the escaping to dev_hard_start_xmit() after
taking the lock, but this gives a large number of other problems, e.g.

[   78.662428] BUG: sleeping function called from invalid context at
mm/slab.c:3101
[   78.751004] in_atomic(): 1, irqs_disabled(): 0, pid: 1908, name:
send_eth_socket
[   78.839582] Call Trace:
[   78.868784] [ec537b70] [c000789c] show_stack+0x78/0x18c (unreliable)
[   78.944905] [ec537bb0] [c0022900] __might_sleep+0x100/0x118
[   79.011636] [ec537bc0] [c00facc4] kmem_cache_alloc+0x48/0x118
[   79.080446] [ec537be0] [c02cd0e8] __alloc_skb+0x50/0x130
[   79.144047] [ec537c00] [c02cdf5c] skb_copy+0x44/0xc8
[   79.203478] [ec537c20] [c029f904] dpa_tx+0x154/0x758
[   79.262907] [ec537c80] [c02d78ec] dev_hard_start_xmit+0x424/0x588
[   79.335878] [ec537cc0] [c02d7aac] dev_queue_xmit+0x5c/0x3a4
[   79.402602] [ec537cf0] [c0338d4c] packet_sendmsg+0x8c4/0x988
[   79.470363] [ec537d70] [c02c3838] sock_sendmsg+0x90/0xb4
[   79.533960] [ec537e40] [c02c4420] sys_sendto+0xdc/0x120
[   79.596514] [ec537f10] [c02c57d0] sys_socketcall+0x148/0x210
[   79.664287] [ec537f40] [c001084c] ret_from_syscall+0x0/0x3c
[   79.731015] --- Exception: c01 at 0x48051f00
[   79.731019]     LR = 0x4808e030


Note that this may just be the cause for us seeing this problem. If
indeed the main problem is irq_exit() invoking softirqs in a locked
context, then this patch adding hardware queue support is not really
relevant.

Any suggestions from the developers at linuxppc-dev are very welcome...

Thomas

^ permalink raw reply

* Re: [ath9k-devel] [PATCH v2 07/46] net/wireless: ath9k: fix DMA API usage
From: Michał Mirosław @ 2011-07-12  9:55 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Jouni Malinen,
	Senthil Balasubramanian, ath9k-devel-xDcbHBWguxHbcTqmT+pZeQ,
	Vasanthakumar Thiagarajan, Ralf Baechle,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA
In-Reply-To: <4E1BCF36.2010506-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

On Tue, Jul 12, 2011 at 12:36:06PM +0800, Felix Fietkau wrote:
> On 2011-07-11 8:52 AM, Michał Mirosław wrote:
> >Also constify buf_addr for ath9k_hw_process_rxdesc_edma() to verify
> >assumptions --- dma_sync_single_for_device() call can be removed.
> >
> >Signed-off-by: Michał Mirosław<mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>
> >---
> >  drivers/net/wireless/ath/ath9k/ar9003_mac.c |    4 ++--
> >  drivers/net/wireless/ath/ath9k/ar9003_mac.h |    2 +-
> >  drivers/net/wireless/ath/ath9k/recv.c       |   10 +++-------
> >  3 files changed, 6 insertions(+), 10 deletions(-)
> >
> >diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
> >index 70dc8ec..c5f46d5 100644
> >--- a/drivers/net/wireless/ath/ath9k/recv.c
> >+++ b/drivers/net/wireless/ath/ath9k/recv.c
> >@@ -684,15 +684,11 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
> >  	BUG_ON(!bf);
> >
> >  	dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
> >-				common->rx_bufsize, DMA_FROM_DEVICE);
> >+				common->rx_bufsize, DMA_BIDIRECTIONAL);
> >
> >  	ret = ath9k_hw_process_rxdesc_edma(ah, NULL, skb->data);
> >-	if (ret == -EINPROGRESS) {
> >-		/*let device gain the buffer again*/
> >-		dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
> >-				common->rx_bufsize, DMA_FROM_DEVICE);
> >+	if (ret == -EINPROGRESS)
> >  		return false;
> >-	}
> >
> >  	__skb_unlink(skb,&rx_edma->rx_fifo);
> >  	if (ret == -EINVAL) {
> I have strong doubts about this change. On most MIPS devices,
> dma_sync_single_for_cpu is a no-op, whereas
> dma_sync_single_for_device flushes the cache range. With this
> change, the CPU could cache the DMA status part behind skb->data and
> that cache entry would not be flushed inbetween calls to this
> functions on the same buffer, likely leading to rx stalls.

You're suggesting a platform implementation bug then. If the platform is not
cache-coherent, it should invalidate relevant CPU cache lines for sync_to_cpu
and unmap cases. Do other devices show such symptoms on MIPS systems?

I'm not familiar with the platform internals, so we should ask MIPS people.

[added Cc: linux-mips]

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: softirqs are invoked while bottom halves are masked
From: David Miller @ 2011-07-12 10:01 UTC (permalink / raw)
  To: patrickdepinguin+linuxppc; +Cc: eric.dumazet, linuxppc-dev, ronny.meeus, netdev
In-Reply-To: <CAAXf6LXHeiCT+n=Wpdf=QfUXUHUSd7fmx9y3kB7te8N3ON1fTg@mail.gmail.com>

From: Thomas De Schampheleire <patrickdepinguin+linuxppc@gmail.com>
Date: Tue, 12 Jul 2011 11:23:28 +0200

> Note that the reason we are seeing this problem, may be because the
> kernel we are using contains some patches from Freescale.
> Specifically, in dev_queue_xmit(), support is added for hardware queue
> handling, just before entering the rcu_read_lock_bh():
> 
>         if (dev->features & NETIF_F_HW_QDISC) {
>                 txq = dev_pick_tx(dev, skb);
>                 return dev_hard_start_xmit(skb, dev, txq);
>         }
> 
>         /* Disable soft irqs for various locks below. Also
>          * stops preemption for RCU.
>          */
>         rcu_read_lock_bh();
> 
> We just tried moving the escaping to dev_hard_start_xmit() after
> taking the lock, but this gives a large number of other problems, e.g.

This is definitely why you are seeing this behavior.

You cannot invoke dev_hard_start_xmit() without softirqs
being disabled.  It breaks everything.

This is what happens when you integrate networking patches
which were not reviewed and vetted on netdev.

^ permalink raw reply

* Re: [net-next 5/5] ixgbe: convert to ndo_fix_features
From: Michal Miroslaw @ 2011-07-12 10:09 UTC (permalink / raw)
  To: Skidmore, Donald C
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, netdev@vger.kernel.org,
	gospo@redhat.com
In-Reply-To: <C0716E470783B24A8178C9E38CD2F34505DF1B3F@orsmsx509.amr.corp.intel.com>

On Mon, Jul 11, 2011 at 04:53:57PM -0700, Skidmore, Donald C wrote:
> >-----Original Message-----
> >From: Michal Miroslaw [mailto:mirq-linux@rere.qmqm.pl]
> >Sent: Saturday, July 09, 2011 5:11 AM
> >To: Kirsher, Jeffrey T
> >Cc: davem@davemloft.net; Skidmore, Donald C; netdev@vger.kernel.org;
> >gospo@redhat.com
> >Subject: Re: [net-next 5/5] ixgbe: convert to ndo_fix_features
> >
> >On Sat, Jul 09, 2011 at 04:50:40AM -0700, Jeff Kirsher wrote:
> >> From: Don Skidmore <donald.c.skidmore@intel.com>
> >>
> >> Private rx_csum flags are now duplicate of netdev->features &
> >> NETIF_F_RXCSUM.  We remove those duplicates and now use the
> >net_device_ops
> >> ndo_set_features.  This was based on the original patch submitted by
> >> Michal Miroslaw <mirq-linux@rere.qmqm.pl>.  I also removed the special
> >> case not requiring a reset for X540 hardware.  It is needed just as it
> >is
> >> in 82599 hardware.
> >[...]
> >> --- a/drivers/net/ixgbe/ixgbe_main.c
> >> +++ b/drivers/net/ixgbe/ixgbe_main.c
> >> @@ -7188,6 +7188,88 @@ int ixgbe_setup_tc(struct net_device *dev, u8
> >tc)
> >[...]
> >> +static int ixgbe_set_features(struct net_device *netdev, u32 data)
> >> +{
> >> +	struct ixgbe_adapter *adapter = netdev_priv(netdev);
> >> +	bool need_reset = false;
> >> +
> >> +#ifdef CONFIG_DCB
> >> +	if ((adapter->flags & IXGBE_FLAG_DCB_ENABLED) &&
> >> +	    !(data &  NETIF_F_HW_VLAN_RX))
> >> +		return -EINVAL;
> >> +#endif
> >> +	/* return error if RXHASH is being enabled when RSS is not
> >supported */
> >> +	if (!(adapter->flags & IXGBE_FLAG_RSS_ENABLED) &&
> >> +	     (data &  NETIF_F_RXHASH))
> >> +		return -EINVAL;
> >> +
> >> +	/* If Rx checksum is disabled, then RSC/LRO should also be
> >disabled */
> >> +	if (!(data & NETIF_F_RXCSUM)) {
> >> +		data &= ~NETIF_F_LRO;
> >> +		adapter->flags &= ~IXGBE_FLAG_RX_CSUM_ENABLED;
> >> +	} else {
> >> +		adapter->flags |= IXGBE_FLAG_RX_CSUM_ENABLED;
> >> +	}
> >
> >The checks above should be done in ndo_fix_features callback. The check
> >for
> >RSS_ENABLED for RXHASH is redundant, as the feature is removed in
> >probe()
> >in this case (see [MARK] below).
> Thanks for the comments Michal, it clears up a lot in my mind. :)
> 
> So in ndo_fix_features we would just turn off feature flags rather than return an error?  For example:
> 
> 	if (!(data & NETIF_F_RXCSUM)) 
> 		data &= ~NETIF_F_LRO;

Exactly.

> As for RSS_ENABLED/RXHASH check it was to cover the cases where RSS_ENABLED might have changed since probe.  This could happen with a resume were we don't get enough MSI-X vectors.  There are also paths in FCoE and DCB that get into code that could clear IXGBE_FLAG_RSS_ENABLED.

That makes sense now. The checks should be in ndo_fix_features and clear
features whenever they are unavailable.

> >> +
> >> +	/* if state changes we need to update adapter->flags and reset */
> >> +	if ((adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE) &&
> >> +	    (!!(data & NETIF_F_LRO) !=
> >> +	     !!(adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED))) {

ndo_fix_features() should prevent the change if
!(adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE) is true.

> >> +		if ((data &  NETIF_F_LRO) &&
> >> +		    adapter->rx_itr_setting != 1 &&
> >> +		    adapter->rx_itr_setting > IXGBE_MAX_RSC_INT_RATE) {
> >> +			e_info(probe, "rx-usecs set too low, "
> >> +			       "not enabling RSC\n");

And should clear NETIF_F_LRO when above condition is met.

> >> +		} else {
> >> +			adapter->flags2 ^= IXGBE_FLAG2_RSC_ENABLED;
> >> +			switch (adapter->hw.mac.type) {
> >> +			case ixgbe_mac_X540:
> >> +			case ixgbe_mac_82599EB:
> >> +				need_reset = true;
> >> +				break;
> >> +			default:
> >> +				break;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	/*
> >> +	 * Check if Flow Director n-tuple support was enabled or disabled.
> >If
> >> +	 * the state changed, we need to reset.
> >> +	 */
> >> +	if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)) {
> >> +		/* turn off ATR, enable perfect filters and reset */
> >> +		if (data & NETIF_F_NTUPLE) {
> >> +			adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
> >> +			adapter->flags |= IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
> >> +			need_reset = true;
> >> +		}
> >> +	} else if (!(data & NETIF_F_NTUPLE)) {
> >> +		/* turn off Flow Director, set ATR and reset */
> >> +		adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
> >> +		if ((adapter->flags &  IXGBE_FLAG_RSS_ENABLED) &&
> >> +		    !(adapter->flags &  IXGBE_FLAG_DCB_ENABLED))
> >> +			adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
> >> +		need_reset = true;
> >> +	}
> >
> >Part of the checks above should be in ndo_fix_features, too.
> >ndo_set_features
> >should just enable what it has been passed. What ndo_fix_features
> >callback
> >returns is further limited by generic checks, and then (if the resulting
> >set
> >is different than current dev->features) ndo_set_features is called.
> 
> I'm a little confused here.  From your comments I get the idea that ndo_fix_features() just modifies and error checks our feature set.  The result of this would then be just a change to the feature set (data variable in my case above).  Is that a correct assumption?  If so I'm confused as none of the two checks above change the feature set.  They do change driver flags to indicate the new state and mark whether we need a reset.  I don't believe we would want to do the reset until ndo_set_feature is called and if we broke up the setting of the driver flags into ndo_fix_features we would lose some state (i.e. if the IXGBE_FLAG2_RSC_ENABLED changed) that we need to decide if a reset is needed in ndo_set_features.  
> 
> Am I just missing something here?

I see that this changes only driver internal state, so your right in putting
this in ndo_set_features callback.

Can you draw a map of the relevant bits and what state should result from
them (a truth table if you will)? I have a hard time understanding the idea.
This changes _CAPABLE bits depending on _ENABLED which adds to confusion.

I would expect that:
 _CAPABLE are set whenever a config is possible, disregarding dependencies
 _ENABLED are set whenever a config is requested and possible
But it looks like what I described is reversed or just wrong.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: softirqs are invoked while bottom halves are masked (was: Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface)
From: Eric Dumazet @ 2011-07-12 10:10 UTC (permalink / raw)
  To: Thomas De Schampheleire; +Cc: linuxppc-dev, Ronny Meeus, David Miller, netdev
In-Reply-To: <CAAXf6LXHeiCT+n=Wpdf=QfUXUHUSd7fmx9y3kB7te8N3ON1fTg@mail.gmail.com>

Le mardi 12 juillet 2011 à 11:23 +0200, Thomas De Schampheleire a
écrit :
> Hi,
> 
> I'm adding the linuxppc-dev mailing list since this may be pointing to
> an irq/softirq problem in the powerpc architecture-specific code...

> 
> Note that the reason we are seeing this problem, may be because the
> kernel we are using contains some patches from Freescale.
> Specifically, in dev_queue_xmit(), support is added for hardware queue
> handling, just before entering the rcu_read_lock_bh():
> 

Oh well, what a mess.

>         if (dev->features & NETIF_F_HW_QDISC) {
>                 txq = dev_pick_tx(dev, skb);



>                 return dev_hard_start_xmit(skb, dev, txq);
	This need to be :
		local_bh_disable();
		rc = dev_hard_start_xmit(skb, dev, txq);
		local_bh_enable();
		return rc;


>         }
> 
>         /* Disable soft irqs for various locks below. Also
>          * stops preemption for RCU.
>          */
>         rcu_read_lock_bh();
> 
> We just tried moving the escaping to dev_hard_start_xmit() after
> taking the lock, but this gives a large number of other problems, e.g.
> 
> [   78.662428] BUG: sleeping function called from invalid context at
> mm/slab.c:3101
> [   78.751004] in_atomic(): 1, irqs_disabled(): 0, pid: 1908, name:
> send_eth_socket
> [   78.839582] Call Trace:
> [   78.868784] [ec537b70] [c000789c] show_stack+0x78/0x18c (unreliable)
> [   78.944905] [ec537bb0] [c0022900] __might_sleep+0x100/0x118
> [   79.011636] [ec537bc0] [c00facc4] kmem_cache_alloc+0x48/0x118
> [   79.080446] [ec537be0] [c02cd0e8] __alloc_skb+0x50/0x130
> [   79.144047] [ec537c00] [c02cdf5c] skb_copy+0x44/0xc8
> [   79.203478] [ec537c20] [c029f904] dpa_tx+0x154/0x758

doing GFP_KERNEL allocations in dpa_tx() is wrong, for sure.


> [   79.262907] [ec537c80] [c02d78ec] dev_hard_start_xmit+0x424/0x588
> [   79.335878] [ec537cc0] [c02d7aac] dev_queue_xmit+0x5c/0x3a4
> [   79.402602] [ec537cf0] [c0338d4c] packet_sendmsg+0x8c4/0x988
> [   79.470363] [ec537d70] [c02c3838] sock_sendmsg+0x90/0xb4
> [   79.533960] [ec537e40] [c02c4420] sys_sendto+0xdc/0x120
> [   79.596514] [ec537f10] [c02c57d0] sys_socketcall+0x148/0x210
> [   79.664287] [ec537f40] [c001084c] ret_from_syscall+0x0/0x3c
> [   79.731015] --- Exception: c01 at 0x48051f00
> [   79.731019]     LR = 0x4808e030
> 
> 
> Note that this may just be the cause for us seeing this problem. If
> indeed the main problem is irq_exit() invoking softirqs in a locked
> context, then this patch adding hardware queue support is not really
> relevant.

irq_exit() is fine. This is because BH are not masked because of the
Freescale patches.

Really, suggesting an af_packet patch to solve a problem introduced in
an out of tree patch is insane.

You guys hould have clearly stated you were using an alien kernel.




^ permalink raw reply

* [patch net-next-2.6] net: allow multiple rx_handler registration
From: Jiri Pirko @ 2011-07-12 11:06 UTC (permalink / raw)
  To: netdev
  Cc: davem, shemminger, kaber, fubar, eric.dumazet, nicolas.2p.debian,
	andy, greearb, mirqus, equinox

For some net topos it is necessary to have multiple "soft-net-devices"
hooked on one netdev. For example very common is to have
eth<->(br+vlan). Vlan is not using rh_handler (yet) but it might be useful
for other setups.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/bonding/bond_main.c |   14 ++++---
 drivers/net/bonding/bonding.h   |    9 +++-
 drivers/net/macvlan.c           |   35 +++++++++++-----
 include/linux/netdevice.h       |   63 +++++++++++++++++++++++++---
 net/bridge/br_if.c              |    5 +-
 net/bridge/br_input.c           |    5 +-
 net/bridge/br_private.h         |   28 ++++++++++---
 net/core/dev.c                  |   87 +++++++++++++++++++++++++++++++--------
 8 files changed, 193 insertions(+), 53 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 61265f7..f18af47 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1482,7 +1482,8 @@ static bool bond_should_deliver_exact_match(struct sk_buff *skb,
 	return false;
 }
 
-static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
+static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb,
+					     struct rx_handler *rx_handler)
 {
 	struct sk_buff *skb = *pskb;
 	struct slave *slave;
@@ -1494,7 +1495,7 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 
 	*pskb = skb;
 
-	slave = bond_slave_get_rcu(skb->dev);
+	slave = bond_slave_get(rx_handler);
 	bond = slave->bond;
 
 	if (bond->params.arp_interval)
@@ -1897,8 +1898,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	if (res)
 		goto err_close;
 
-	res = netdev_rx_handler_register(slave_dev, bond_handle_frame,
-					 new_slave);
+	res = netdev_rx_handler_register(slave_dev, &new_slave->rx_handler,
+					 bond_handle_frame,
+					 RX_HANDLER_PRIO_BOND);
 	if (res) {
 		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
 		goto err_dest_symlinks;
@@ -1988,7 +1990,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 	/* unregister rx_handler early so bond_handle_frame wouldn't be called
 	 * for this slave anymore.
 	 */
-	netdev_rx_handler_unregister(slave_dev);
+	netdev_rx_handler_unregister(slave_dev, &slave->rx_handler);
 	write_unlock_bh(&bond->lock);
 	synchronize_net();
 	write_lock_bh(&bond->lock);
@@ -2189,7 +2191,7 @@ static int bond_release_all(struct net_device *bond_dev)
 		/* unregister rx_handler early so bond_handle_frame wouldn't
 		 * be called for this slave anymore.
 		 */
-		netdev_rx_handler_unregister(slave_dev);
+		netdev_rx_handler_unregister(slave_dev, &slave->rx_handler);
 		synchronize_net();
 
 		if (bond_is_lb(bond)) {
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 2936171..e732e16 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -172,6 +172,7 @@ struct vlan_entry {
 
 struct slave {
 	struct net_device *dev; /* first - useful for panic debug */
+	struct rx_handler rx_handler;
 	struct slave *next;
 	struct slave *prev;
 	struct bonding *bond; /* our master */
@@ -196,6 +197,11 @@ struct slave {
 #endif
 };
 
+#define bond_slave_get(rx_handler)			\
+	netdev_rx_handler_get_priv(rx_handler,		\
+				   struct slave,	\
+				   rx_handler)
+
 /*
  * Link pseudo-state only used internally by monitors
  */
@@ -253,9 +259,6 @@ struct bonding {
 #endif /* CONFIG_DEBUG_FS */
 };
 
-#define bond_slave_get_rcu(dev) \
-	((struct slave *) rcu_dereference(dev->rx_handler_data))
-
 /**
  * Returns NULL if the net_device does not belong to any of the bond's slaves
  *
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index cc67cbe..49ca58b 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -34,19 +34,28 @@
 #define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
 
 struct macvlan_port {
+	struct rx_handler	rx_handler;
 	struct net_device	*dev;
 	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
 	struct list_head	vlans;
 	struct rcu_head		rcu;
-	bool 			passthru;
+	bool			passthru;
 	int			count;
 };
 
+#define macvlan_port_get(rx_handler)				\
+	netdev_rx_handler_get_priv(rx_handler,			\
+				   struct macvlan_port,		\
+				   rx_handler)
+
+#define macvlan_port_get_by_dev(dev)					\
+	netdev_rx_handler_get_priv_by_prio(dev,				\
+					   RX_HANDLER_PRIO_MACVLAN,	\
+					   struct macvlan_port,		\
+					   rx_handler)
+
 static void macvlan_port_destroy(struct net_device *dev);
 
-#define macvlan_port_get_rcu(dev) \
-	((struct macvlan_port *) rcu_dereference(dev->rx_handler_data))
-#define macvlan_port_get(dev) ((struct macvlan_port *) dev->rx_handler_data)
 #define macvlan_port_exists(dev) (dev->priv_flags & IFF_MACVLAN_PORT)
 
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
@@ -156,7 +165,8 @@ static void macvlan_broadcast(struct sk_buff *skb,
 }
 
 /* called under rcu_read_lock() from netif_receive_skb */
-static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
+static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb,
+						struct rx_handler *rx_handler)
 {
 	struct macvlan_port *port;
 	struct sk_buff *skb = *pskb;
@@ -167,7 +177,7 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
 	unsigned int len = 0;
 	int ret = NET_RX_DROP;
 
-	port = macvlan_port_get_rcu(skb->dev);
+	port = macvlan_port_get(rx_handler);
 	if (is_multicast_ether_addr(eth->h_dest)) {
 		src = macvlan_hash_lookup(port, eth->h_source);
 		if (!src)
@@ -617,7 +627,9 @@ static int macvlan_port_create(struct net_device *dev)
 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
 
-	err = netdev_rx_handler_register(dev, macvlan_handle_frame, port);
+	err = netdev_rx_handler_register(dev, &port->rx_handler,
+					 macvlan_handle_frame,
+					 RX_HANDLER_PRIO_MACVLAN);
 	if (err)
 		kfree(port);
 	else
@@ -627,10 +639,11 @@ static int macvlan_port_create(struct net_device *dev)
 
 static void macvlan_port_destroy(struct net_device *dev)
 {
-	struct macvlan_port *port = macvlan_port_get(dev);
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	struct macvlan_port *port = vlan->port;
 
 	dev->priv_flags &= ~IFF_MACVLAN_PORT;
-	netdev_rx_handler_unregister(dev);
+	netdev_rx_handler_unregister(dev, &port->rx_handler);
 	kfree_rcu(port, rcu);
 }
 
@@ -696,7 +709,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 		if (err < 0)
 			return err;
 	}
-	port = macvlan_port_get(lowerdev);
+	port = macvlan_port_get_by_dev(lowerdev);
 
 	/* Only 1 macvlan device can be created in passthru mode */
 	if (port->passthru)
@@ -818,7 +831,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 	if (!macvlan_port_exists(dev))
 		return NOTIFY_DONE;
 
-	port = macvlan_port_get(dev);
+	port = macvlan_port_get_by_dev(dev);
 
 	switch (event) {
 	case NETDEV_CHANGE:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 011eb89..126cd07 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -437,7 +437,51 @@ enum rx_handler_result {
 	RX_HANDLER_PASS,
 };
 typedef enum rx_handler_result rx_handler_result_t;
-typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb);
+
+struct rx_handler;
+typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb,
+					      struct rx_handler *rx_handler);
+
+enum rx_handler_prio {
+	RX_HANDLER_PRIO_BRIDGE,
+	RX_HANDLER_PRIO_BOND,
+	RX_HANDLER_PRIO_MACVLAN,
+};
+
+/*
+ * struct rx_handler should be embedded into
+ * private struct used by rx_handler
+ */
+struct rx_handler {
+	struct list_head	list;
+	rx_handler_func_t	*callback;
+	unsigned int		prio;
+};
+
+/**
+ * netdev_rx_handler_get_priv - get containing private structure of given
+ *				receive handler
+ * @rx_handler: receive_handler
+ * @type: the type of the container struct this is embedded in
+ * @member: the name of the member within the struct
+ */
+#define netdev_rx_handler_get_priv(rx_handler, type, member) \
+	container_of(rx_handler, type, member)
+
+/**
+ * netdev_rx_handler_get_priv_by_prio, netdev_rx_handler_get_priv_by_prio_rcu
+ *	- get containing private structure of given receive handler priority
+ * @dev: netdevice
+ * @type: the type of the container struct this is embedded in
+ * @member: the name of the member within the struct
+ */
+#define netdev_rx_handler_get_priv_by_prio(dev, prio, type, member)		\
+	netdev_rx_handler_get_priv(netdev_rx_handler_get_by_prio(dev, prio),	\
+				   type, member)
+
+#define netdev_rx_handler_get_priv_by_prio_rcu(dev, prio, type, member)		\
+	netdev_rx_handler_get_priv(netdev_rx_handler_get_by_prio_rcu(dev, prio),\
+				   type, member)
 
 extern void __napi_schedule(struct napi_struct *n);
 
@@ -1238,8 +1282,7 @@ struct net_device {
 #endif
 #endif
 
-	rx_handler_func_t __rcu	*rx_handler;
-	void __rcu		*rx_handler_data;
+	struct list_head	rx_handler_list;
 
 	struct netdev_queue __rcu *ingress_queue;
 
@@ -2082,10 +2125,18 @@ static inline void napi_free_frags(struct napi_struct *napi)
 	napi->skb = NULL;
 }
 
+extern struct rx_handler *
+netdev_rx_handler_get_by_prio(const struct net_device *dev,
+			      unsigned int prio);
+extern struct rx_handler *
+netdev_rx_handler_get_by_prio_rcu(const struct net_device *dev,
+				  unsigned int prio);
 extern int netdev_rx_handler_register(struct net_device *dev,
-				      rx_handler_func_t *rx_handler,
-				      void *rx_handler_data);
-extern void netdev_rx_handler_unregister(struct net_device *dev);
+				      struct rx_handler *rx_handler,
+			              rx_handler_func_t *callback,
+				      unsigned int prio);
+extern void netdev_rx_handler_unregister(struct net_device *dev,
+					 struct rx_handler *rx_handler);
 
 extern int		dev_valid_name(const char *name);
 extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 1bacca4..4ee5d78 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -146,7 +146,7 @@ static void del_nbp(struct net_bridge_port *p)
 
 	dev->priv_flags &= ~IFF_BRIDGE_PORT;
 
-	netdev_rx_handler_unregister(dev);
+	netdev_rx_handler_unregister(dev, &p->rx_handler);
 	synchronize_net();
 
 	netdev_set_master(dev, NULL);
@@ -365,7 +365,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	if (err)
 		goto err3;
 
-	err = netdev_rx_handler_register(dev, br_handle_frame, p);
+	err = netdev_rx_handler_register(dev, &p->rx_handler, br_handle_frame,
+					 RX_HANDLER_PRIO_BRIDGE);
 	if (err)
 		goto err4;
 
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f06ee39..1f729a0 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -142,7 +142,8 @@ static inline int is_link_local(const unsigned char *dest)
  * Return NULL if skb is handled
  * note: already called with rcu_read_lock
  */
-rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
+rx_handler_result_t br_handle_frame(struct sk_buff **pskb,
+				    struct rx_handler *rx_handler)
 {
 	struct net_bridge_port *p;
 	struct sk_buff *skb = *pskb;
@@ -159,7 +160,7 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 	if (!skb)
 		return RX_HANDLER_CONSUMED;
 
-	p = br_port_get_rcu(skb->dev);
+	p = br_port_get(rx_handler);
 
 	if (unlikely(is_link_local(dest))) {
 		/* Pause frames shouldn't be passed up by driver anyway */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 54578f2..1a1ea40 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -108,6 +108,7 @@ struct net_bridge_mdb_htable
 
 struct net_bridge_port
 {
+	struct rx_handler		rx_handler;
 	struct net_bridge		*br;
 	struct net_device		*dev;
 	struct list_head		list;
@@ -152,18 +153,32 @@ struct net_bridge_port
 #endif
 };
 
+#define br_port_get(rx_handler)					\
+	netdev_rx_handler_get_priv(rx_handler,			\
+				   struct net_bridge_port,	\
+				   rx_handler)
+
 #define br_port_exists(dev) (dev->priv_flags & IFF_BRIDGE_PORT)
 
-static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev)
+static inline struct net_bridge_port *
+br_port_get_rcu(const struct net_device *dev)
 {
-	struct net_bridge_port *port = rcu_dereference(dev->rx_handler_data);
-	return br_port_exists(dev) ? port : NULL;
+	if (unlikely(!br_port_exists(dev)))
+		return NULL;
+	return netdev_rx_handler_get_priv_by_prio_rcu(dev,
+						      RX_HANDLER_PRIO_BRIDGE,
+						      struct net_bridge_port,
+						      rx_handler);
 }
 
 static inline struct net_bridge_port *br_port_get_rtnl(struct net_device *dev)
 {
-	return br_port_exists(dev) ?
-		rtnl_dereference(dev->rx_handler_data) : NULL;
+	if (unlikely(!br_port_exists(dev)))
+		return NULL;
+	return netdev_rx_handler_get_priv_by_prio(dev,
+						  RX_HANDLER_PRIO_BRIDGE,
+						  struct net_bridge_port,
+						  rx_handler);
 }
 
 struct br_cpu_netstats {
@@ -382,7 +397,8 @@ extern u32 br_features_recompute(struct net_bridge *br, u32 features);
 
 /* br_input.c */
 extern int br_handle_frame_finish(struct sk_buff *skb);
-extern rx_handler_result_t br_handle_frame(struct sk_buff **pskb);
+extern rx_handler_result_t br_handle_frame(struct sk_buff **pskb,
+					   struct rx_handler *rx_handler);
 
 /* br_ioctl.c */
 extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
diff --git a/net/core/dev.c b/net/core/dev.c
index 9ca1514..ea5e3fb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3053,10 +3053,55 @@ out:
 #endif
 
 /**
+ *	netdev_rx_handler_get_by_prio - get receive handler struct by priority
+ *	@dev: net device
+ *	@prio: receive handler priority
+ *
+ *	Find and return receive handler for given priority.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+struct rx_handler *
+netdev_rx_handler_get_by_prio(const struct net_device *dev, unsigned int prio)
+{
+	struct rx_handler *rx_handler;
+
+	ASSERT_RTNL();
+	list_for_each_entry(rx_handler, &dev->rx_handler_list, list)
+		if (rx_handler->prio == prio)
+			return rx_handler;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_get_by_prio);
+
+/**
+ *	netdev_rx_handler_get_by_prio_rcu - get receive handler struct by priority
+ *	@dev: net device
+ *	@prio: receive handler priority
+ *
+ *	RCU variant to find and return receive handler for given priority.
+ *
+ *	The caller must hold the rcu_read_lock.
+ */
+struct rx_handler *
+netdev_rx_handler_get_by_prio_rcu(const struct net_device *dev,
+				  unsigned int prio)
+{
+	struct rx_handler *rx_handler;
+
+	list_for_each_entry_rcu(rx_handler, &dev->rx_handler_list, list)
+		if (rx_handler->prio == prio)
+			return rx_handler;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_get_by_prio_rcu);
+
+/**
  *	netdev_rx_handler_register - register receive handler
  *	@dev: device to register a handler for
- *	@rx_handler: receive handler to register
- *	@rx_handler_data: data pointer that is used by rx handler
+ *	@rx_handler: receive handler structure to register
+ *	@callback: receive handler callback function to register
+ *	@prio: receive handler priority
  *
  *	Register a receive hander for a device. This handler will then be
  *	called from __netif_receive_skb. A negative errno code is returned
@@ -3067,17 +3112,24 @@ out:
  *	For a general description of rx_handler, see enum rx_handler_result.
  */
 int netdev_rx_handler_register(struct net_device *dev,
-			       rx_handler_func_t *rx_handler,
-			       void *rx_handler_data)
+			       struct rx_handler *rx_handler,
+			       rx_handler_func_t *callback, unsigned int prio)
 {
-	ASSERT_RTNL();
+	struct list_head *pos;
 
-	if (dev->rx_handler)
+	ASSERT_RTNL();
+	if (netdev_rx_handler_get_by_prio(dev, prio))
 		return -EBUSY;
+	list_for_each(pos, &dev->rx_handler_list) {
+		struct rx_handler *entry;
 
-	rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
-	rcu_assign_pointer(dev->rx_handler, rx_handler);
-
+		entry = list_entry(pos, struct rx_handler, list);
+		if (prio > entry->prio)
+			break;
+	}
+	rx_handler->callback = callback;
+	rx_handler->prio = prio;
+	list_add_rcu(&rx_handler->list, pos);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
@@ -3085,24 +3137,24 @@ EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
 /**
  *	netdev_rx_handler_unregister - unregister receive handler
  *	@dev: device to unregister a handler from
+ *	@prio: handler priority
  *
  *	Unregister a receive hander from a device.
  *
  *	The caller must hold the rtnl_mutex.
  */
-void netdev_rx_handler_unregister(struct net_device *dev)
+void netdev_rx_handler_unregister(struct net_device *dev,
+				  struct rx_handler *rx_handler)
 {
-
 	ASSERT_RTNL();
-	rcu_assign_pointer(dev->rx_handler, NULL);
-	rcu_assign_pointer(dev->rx_handler_data, NULL);
+	list_del_rcu(&rx_handler->list);
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
-	rx_handler_func_t *rx_handler;
+	struct rx_handler *rx_handler;
 	struct net_device *orig_dev;
 	struct net_device *null_or_dev;
 	bool deliver_exact = false;
@@ -3162,13 +3214,12 @@ another_round:
 ncls:
 #endif
 
-	rx_handler = rcu_dereference(skb->dev->rx_handler);
-	if (rx_handler) {
+	list_for_each_entry_rcu(rx_handler, &skb->dev->rx_handler_list, list) {
 		if (pt_prev) {
 			ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = NULL;
 		}
-		switch (rx_handler(&skb)) {
+		switch (rx_handler->callback(&skb, rx_handler)) {
 		case RX_HANDLER_CONSUMED:
 			goto out;
 		case RX_HANDLER_ANOTHER:
@@ -5881,6 +5932,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->napi_list);
 	INIT_LIST_HEAD(&dev->unreg_list);
 	INIT_LIST_HEAD(&dev->link_watch_list);
+	INIT_LIST_HEAD(&dev->rx_handler_list);
+
 	dev->priv_flags = IFF_XMIT_DST_RELEASE;
 	setup(dev);
 
-- 
1.7.6


^ permalink raw reply related

* Re: [PATCH] bridge: mask forwarding of IEEE 802 local multicast groups
From: David Lamparter @ 2011-07-12 11:36 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Nick Carter, netdev, Michał Mirosław, David Lamparter,
	davem
In-Reply-To: <20110711082755.0b38a15a@nehalam.ftrdhcpuser.net>

On Mon, Jul 11, 2011 at 08:27:55AM -0700, Stephen Hemminger wrote:
> On Sun, 10 Jul 2011 17:04:30 +0100
> Nick Carter <ncarter100@gmail.com> wrote:
> 
> > Updated diffs so they apply to net-next (Original diffs were based off 2.6.38).
> > 
> > Any chance of getting these diffs applied?  The default behaviour of
> > the bridge code is unchanged.  They solve the problem of
> > authenticating a virtual 802.1x supplicant machine against an external
> > 802.1X authenticator.  It is also a general solution that allows the
> > forwarding of any combination of the IEEE 802 local multicast groups.
> > 
> > Signed-off-by: Nick Carter <ncarter100@gmail.com>
> > Reviewed-by: David Lamparter <equinox@diac24.net>
> 
> I am still undecided on this. Understand the need, but don't like idea
> of bridge behaving in non-conforming manner. Will see if IEEE 802 committee
> has any input.

The patch doesn't make the bridge behave nonconformant. The default mask
is 0, which just keeps the old behaviour.

If you set the lowest 3 bits, yes, you can break your network. But so
does enabling proxy_arp in most cases. And there are reasonable use
cases for it, both 802.1X forwarding and fully-transparent* packet
capture bridges benefit from it. And the latter is something I wouldn't
wish to move to userspace either.

Maybe we should add a warning if the lowest 3 bits are set, like
"you have enabled forwarding of STP/Pause/Bond frames. This can
thoroughly break your network."

* excl. pause frames, sadly - those get eaten by hw/driver...

> Also, don't want to build more knobs in with sysfs that are per-bridge.
> Eventually, the plan is to make all the setting per-port with sysctl's
> like IPv6.

This setting doesn't make sense per-port IMHO. Also, sysctl?!


-David


^ permalink raw reply

* Re: [patch net-next-2.6] net: allow multiple rx_handler registration
From: David Lamparter @ 2011-07-12 11:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, shemminger, kaber, fubar, eric.dumazet,
	nicolas.2p.debian, andy, greearb, mirqus, equinox
In-Reply-To: <1310468761-2304-1-git-send-email-jpirko@redhat.com>

On Tue, Jul 12, 2011 at 01:06:01PM +0200, Jiri Pirko wrote:
> For some net topos it is necessary to have multiple "soft-net-devices"
> hooked on one netdev. For example very common is to have
> eth<->(br+vlan). Vlan is not using rh_handler (yet) but it might be useful
> for other setups.

I disagree strongly, especially with the use cases you're enabling in
this patch.

> +	res = netdev_rx_handler_register(slave_dev, &new_slave->rx_handler,
> +					 bond_handle_frame,
> +					 RX_HANDLER_PRIO_BOND);

> +	err = netdev_rx_handler_register(dev, &port->rx_handler,
> +					 macvlan_handle_frame,
> +					 RX_HANDLER_PRIO_MACVLAN);

> +	err = netdev_rx_handler_register(dev, &p->rx_handler, br_handle_frame,
> +					 RX_HANDLER_PRIO_BRIDGE);

> +enum rx_handler_prio {
> +	RX_HANDLER_PRIO_BRIDGE,
> +	RX_HANDLER_PRIO_BOND,
> +	RX_HANDLER_PRIO_MACVLAN,
> +};

These are all incompatible with each other to a varying degree and/or
don't make much sense. Let's look at them:

a) a device simultaneously being a bridge member and a bond slave
 -> Fully incompatible. Your bonding peer switch will start sending
    the bridge's packets on other bond member devices.

b) a device having macvlans and being a bond slave
 -> Fully incompatible. Same as above, packets to the macvlan will end
    up on other bond member devices.

c) bridge + macvlan
 -> Mostly useless. Add veth/tap devices to your bridge... as a bonus
    you get a proper MAC table.

This at least needs bonding support removed since bonding is essentially
incompatible with anything else w/ the same reasoning as above. Bonds
are as low-level as Pause frames. Never ever touch individual bond
slaves.

What does make sense is a device being member of multiple bridges, with
ebtables as solicitor for which bridge gets the packet. But that's not
possible with your patch...
+       if (netdev_rx_handler_get_by_prio(dev, prio))
                return -EBUSY;

I think your idea is good, but it needs WAY more proper consideration.


-David


^ permalink raw reply

* Re: softirqs are invoked while bottom halves are masked (was: Re: [PATCH] [PATCH] Fix deadlock in af_packet while stressing raw ethernet socket interface)
From: Ronny Meeus @ 2011-07-12 12:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas De Schampheleire, linuxppc-dev, David Miller, netdev,
	afleming
In-Reply-To: <1310465411.3314.6.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Tue, Jul 12, 2011 at 12:10 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 juillet 2011 à 11:23 +0200, Thomas De Schampheleire a
> écrit :
>> Hi,
>>
>> I'm adding the linuxppc-dev mailing list since this may be pointing to
>> an irq/softirq problem in the powerpc architecture-specific code...
>
>>
>> Note that the reason we are seeing this problem, may be because the
>> kernel we are using contains some patches from Freescale.
>> Specifically, in dev_queue_xmit(), support is added for hardware queue
>> handling, just before entering the rcu_read_lock_bh():
>>
>
> Oh well, what a mess.
>
>>         if (dev->features & NETIF_F_HW_QDISC) {
>>                 txq = dev_pick_tx(dev, skb);
>
>
>
>>                 return dev_hard_start_xmit(skb, dev, txq);
>        This need to be :
>                local_bh_disable();
>                rc = dev_hard_start_xmit(skb, dev, txq);
>                local_bh_enable();
>                return rc;
>
>
>>         }
>>
>>         /* Disable soft irqs for various locks below. Also
>>          * stops preemption for RCU.
>>          */
>>         rcu_read_lock_bh();
>>
>> We just tried moving the escaping to dev_hard_start_xmit() after
>> taking the lock, but this gives a large number of other problems, e.g.
>>
>> [   78.662428] BUG: sleeping function called from invalid context at
>> mm/slab.c:3101
>> [   78.751004] in_atomic(): 1, irqs_disabled(): 0, pid: 1908, name:
>> send_eth_socket
>> [   78.839582] Call Trace:
>> [   78.868784] [ec537b70] [c000789c] show_stack+0x78/0x18c (unreliable)
>> [   78.944905] [ec537bb0] [c0022900] __might_sleep+0x100/0x118
>> [   79.011636] [ec537bc0] [c00facc4] kmem_cache_alloc+0x48/0x118
>> [   79.080446] [ec537be0] [c02cd0e8] __alloc_skb+0x50/0x130
>> [   79.144047] [ec537c00] [c02cdf5c] skb_copy+0x44/0xc8
>> [   79.203478] [ec537c20] [c029f904] dpa_tx+0x154/0x758
>
> doing GFP_KERNEL allocations in dpa_tx() is wrong, for sure.
>
>
>> [   79.262907] [ec537c80] [c02d78ec] dev_hard_start_xmit+0x424/0x588
>> [   79.335878] [ec537cc0] [c02d7aac] dev_queue_xmit+0x5c/0x3a4
>> [   79.402602] [ec537cf0] [c0338d4c] packet_sendmsg+0x8c4/0x988
>> [   79.470363] [ec537d70] [c02c3838] sock_sendmsg+0x90/0xb4
>> [   79.533960] [ec537e40] [c02c4420] sys_sendto+0xdc/0x120
>> [   79.596514] [ec537f10] [c02c57d0] sys_socketcall+0x148/0x210
>> [   79.664287] [ec537f40] [c001084c] ret_from_syscall+0x0/0x3c
>> [   79.731015] --- Exception: c01 at 0x48051f00
>> [   79.731019]     LR = 0x4808e030
>>
>>
>> Note that this may just be the cause for us seeing this problem. If
>> indeed the main problem is irq_exit() invoking softirqs in a locked
>> context, then this patch adding hardware queue support is not really
>> relevant.
>
> irq_exit() is fine. This is because BH are not masked because of the
> Freescale patches.
>
> Really, suggesting an af_packet patch to solve a problem introduced in
> an out of tree patch is insane.
>
> You guys hould have clearly stated you were using an alien kernel.
>
>
>
>

Sorry for not mentioning we were using a patched kernel.
I was not aware that the code involved was patched by the FreeScale
patches we applied. The code found in the stack dumps is not
implemented in FSL specific files.

While reading the code of af_packet I saw that the spin_lock_bh is
used in several places while this is not the case in the tpacket_rcv
function. Since we had a locking issue in that code, I thought that my
patch would be OK.
I was not aware that for that specific function (tpacket_rcv) a
different lock primitive must be used. A suggestion for improvement:
it would be better to document this pre-condition in the code.

After doing the change you proposed our code now looks like:

>---if (dev->features & NETIF_F_HW_QDISC) {
>--->---txq = dev_pick_tx(dev, skb);
>--->---local_bh_disable();
>--->---rc = dev_hard_start_xmit(skb, dev, txq);
>--->---local_bh_enable();
>--->---return rc;
>---}

>---/* Disable soft irqs for various locks below. Also
>--- * stops preemption for RCU.
>--- */
>---rcu_read_lock_bh();

but we still see the issue "BUG: sleeping function called from invalid context":

[   91.015989] BUG: sleeping function called from invalid context at
include/linux/skbuff.h:786
[   91.117096] in_atomic(): 1, irqs_disabled(): 0, pid: 1865, name: NMTX_T1842
[   91.200461] Call Trace:
[   91.229672] [ec58bbd0] [c000789c] show_stack+0x78/0x18c (unreliable)
[   91.305791] [ec58bc10] [c0022900] __might_sleep+0x100/0x118
[   91.372524] [ec58bc20] [c029f8d8] dpa_tx+0x128/0x758
[   91.431957] [ec58bc80] [c02d78ec] dev_hard_start_xmit+0x424/0x588
[   91.504952] [ec58bcc0] [c02d7ab0] dev_queue_xmit+0x60/0x3ac
[   91.571692] [ec58bcf0] [c0338d54] packet_sendmsg+0x8c4/0x988
[   91.639457] [ec58bd70] [c02c3838] sock_sendmsg+0x90/0xb4
[   91.703066] [ec58be40] [c02c4420] sys_sendto+0xdc/0x120
[   91.765646] [ec58bf10] [c02c57d0] sys_socketcall+0x148/0x210
[   91.833420] [ec58bf40] [c001084c] ret_from_syscall+0x0/0x3c
[   91.900153] --- Exception: c01 at 0x4824df00
[   91.900157]     LR = 0x4828a030


The FreeScale patch that introduced this code was created by Andy
Fleming <afleming@freescale.com> (Put in CC).

The purpose of the patch is:
"
Subject: [PATCH] net: Add support for handling queueing in hardware

The QDisc code does a bunch of locking which is unnecessary if
you have hardware which handles all of the queueing.  Add
support for this, and skip over all of the queueing code if
the feature is enabled on a given device.
"

Ronny

^ permalink raw reply

* Re: softirqs are invoked while bottom halves are masked
From: David Miller @ 2011-07-12 12:08 UTC (permalink / raw)
  To: ronny.meeus
  Cc: eric.dumazet, patrickdepinguin+linuxppc, linuxppc-dev, netdev,
	afleming
In-Reply-To: <CAMJ=MEeC1hoqufs7AfFRn3yJoC8mdw7v+14N+7e=wQuJefm4_w@mail.gmail.com>

From: Ronny Meeus <ronny.meeus@gmail.com>
Date: Tue, 12 Jul 2011 14:03:04 +0200

> but we still see the issue "BUG: sleeping function called from invalid context":
> 
> [   91.015989] BUG: sleeping function called from invalid context at
> include/linux/skbuff.h:786
> [   91.117096] in_atomic(): 1, irqs_disabled(): 0, pid: 1865, name: NMTX_T1842
> [   91.200461] Call Trace:
> [   91.229672] [ec58bbd0] [c000789c] show_stack+0x78/0x18c (unreliable)
> [   91.305791] [ec58bc10] [c0022900] __might_sleep+0x100/0x118
> [   91.372524] [ec58bc20] [c029f8d8] dpa_tx+0x128/0x758

Because this dpa driver's transmit method is doing something else that
is not allowed in software interrupt context.

You must remove all things that might sleep in this driver's
->ndo_start_xmit method, and I do mean everything.

^ permalink raw reply

* Re: softirqs are invoked while bottom halves are masked
From: David Miller @ 2011-07-12 12:13 UTC (permalink / raw)
  To: ronny.meeus
  Cc: eric.dumazet, patrickdepinguin+linuxppc, linuxppc-dev, netdev,
	afleming
In-Reply-To: <20110712.050817.1253941735409335652.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 12 Jul 2011 05:08:17 -0700 (PDT)

> From: Ronny Meeus <ronny.meeus@gmail.com>
> Date: Tue, 12 Jul 2011 14:03:04 +0200
> 
>> but we still see the issue "BUG: sleeping function called from invalid context":
>> 
>> [   91.015989] BUG: sleeping function called from invalid context at
>> include/linux/skbuff.h:786
>> [   91.117096] in_atomic(): 1, irqs_disabled(): 0, pid: 1865, name: NMTX_T1842
>> [   91.200461] Call Trace:
>> [   91.229672] [ec58bbd0] [c000789c] show_stack+0x78/0x18c (unreliable)
>> [   91.305791] [ec58bc10] [c0022900] __might_sleep+0x100/0x118
>> [   91.372524] [ec58bc20] [c029f8d8] dpa_tx+0x128/0x758
> 
> Because this dpa driver's transmit method is doing something else that
> is not allowed in software interrupt context.
> 
> You must remove all things that might sleep in this driver's
> ->ndo_start_xmit method, and I do mean everything.

Also this whole HW QOS feature bit facility is beyond bogus.

What if the user enables a qdisc that the hardware can't handle, or a
configuration of a hw supported qdisc that the hardware can't support?

What if I have packet classification and packet actions enabled in the
packet scheduler?

These changes are terrible, and we really need you guys to sort out
your problems with these changes yoursleves because your wounds are
entirely self-inflicted and totally not our problem.

^ permalink raw reply

* RE: [PATCH] [v3][net][bna] Fix call trace when interrupts are disabled while sleeping function kzalloc is called
From: Ivan Vecera @ 2011-07-12 12:42 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: shyam.iyer.t, netdev, rmody, ddutt, huangj, davem
In-Reply-To: <DBFB1B45AF80394ABD1C807E9F28D157063779959F@BLRX7MCDC203.AMER.DELL.COM>

On Fri, 2011-07-08 at 03:23 +0530, Shyam_Iyer@Dell.com wrote:
> 
> > -----Original Message-----
> > From: Ivan Vecera [mailto:ivecera@redhat.com]
> > Sent: Thursday, July 07, 2011 11:14 AM
> > To: Shyam Iyer
> > Cc: netdev@vger.kernel.org; rmody@brocade.com; ddutt@brocade.com;
> > huangj@brocade.com; davem@davemloft.net; Iyer, Shyam
> > Subject: Re: [PATCH] [v3][net][bna] Fix call trace when interrupts are
> > disabled while sleeping function kzalloc is called
> > 
> > Small note, the root of the problem was that non-atomic allocation was
> > requested with IRQs disabled. Your patch description does not contain
> > why were the IRQs disabled.
> > The function bnad_mbox_irq_alloc incorrectly uses 'flags' var for two
> > different things, 1) to save current CPU flags and 2) for request_irq
> > call.
> > First the spin_lock_irqsave disables the IRQs and saves _all_ CPU flags
> > (including one that enables/disables interrupts) to 'flags'. Then the
> > 'flags' is overwritten by 0 or 0x80 (IRQF_SHARED). Finally the
> > spin_unlock_irqrestore should restore saved flags, but these flags are
> > now either 0x00 or 0x80. The interrupt bit value in flags register on
> > x86 arch is 0x100.
> > This means that the interrupt bit is zero (IRQs disabled) after
> > spin_unlock_irqrestore so the request_irq function is called with
> > disabled interrupts.
> 
> That seems to make a lot more sense.. and that way I don't have to initialize the flags variable outside of the spin_lock_irqsave/restore bracket to preserve the irqs not being disabled when allocating in request_irq
> 
> Would the below patch make sense instead...? (Note that since davem has accepted the earlier one this is on top of the already committed patch)
> 
Yes, this is much cleaner...


^ permalink raw reply

* Re: [PATCH v2 07/46] net/wireless: ath9k: fix DMA API usage
From: Felix Fietkau @ 2011-07-12 12:54 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: linux-mips@linux-mips.org, netdev@vger.kernel.org,
	linux-wireless@vger.kernel.org, Ralf Baechle, Jouni Malinen,
	Senthil Balasubramanian, ath9k-devel@lists.ath9k.org,
	Vasanthakumar Thiagarajan
In-Reply-To: <20110712095541.GA6236@rere.qmqm.pl>

On 12.07.2011, at 17:55, Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:

> On Tue, Jul 12, 2011 at 12:36:06PM +0800, Felix Fietkau wrote:
>> On 2011-07-11 8:52 AM, Michał Mirosław wrote:
>>> Also constify buf_addr for ath9k_hw_process_rxdesc_edma() to verify
>>> assumptions --- dma_sync_single_for_device() call can be removed.
>>> 
>>> Signed-off-by: Michał Mirosław<mirq-linux@rere.qmqm.pl>
>>> ---
>>> drivers/net/wireless/ath/ath9k/ar9003_mac.c |    4 ++--
>>> drivers/net/wireless/ath/ath9k/ar9003_mac.h |    2 +-
>>> drivers/net/wireless/ath/ath9k/recv.c       |   10 +++-------
>>> 3 files changed, 6 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
>>> index 70dc8ec..c5f46d5 100644
>>> --- a/drivers/net/wireless/ath/ath9k/recv.c
>>> +++ b/drivers/net/wireless/ath/ath9k/recv.c
>>> @@ -684,15 +684,11 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
>>>    BUG_ON(!bf);
>>> 
>>>    dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
>>> -                common->rx_bufsize, DMA_FROM_DEVICE);
>>> +                common->rx_bufsize, DMA_BIDIRECTIONAL);
>>> 
>>>    ret = ath9k_hw_process_rxdesc_edma(ah, NULL, skb->data);
>>> -    if (ret == -EINPROGRESS) {
>>> -        /*let device gain the buffer again*/
>>> -        dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
>>> -                common->rx_bufsize, DMA_FROM_DEVICE);
>>> +    if (ret == -EINPROGRESS)
>>>        return false;
>>> -    }
>>> 
>>>    __skb_unlink(skb,&rx_edma->rx_fifo);
>>>    if (ret == -EINVAL) {
>> I have strong doubts about this change. On most MIPS devices,
>> dma_sync_single_for_cpu is a no-op, whereas
>> dma_sync_single_for_device flushes the cache range. With this
>> change, the CPU could cache the DMA status part behind skb->data and
>> that cache entry would not be flushed inbetween calls to this
>> functions on the same buffer, likely leading to rx stalls.
> 
> You're suggesting a platform implementation bug then. If the platform is not
> cache-coherent, it should invalidate relevant CPU cache lines for sync_to_cpu
> and unmap cases. Do other devices show such symptoms on MIPS systems?
> 
> I'm not familiar with the platform internals, so we should ask MIPS people.
I only mentioned MIPS to describe the potential side effect of this change. From my current understanding of the DMA API, it would be wrong on other platforms as well. I believe the _for_device function needs to be used to transfer ownership of the buffer back to the device, before calling _for_cpu again later for another read.
This is definitely required in this case, because when the return code is -EINPROGRESS, the driver waits for the hardware to complete this buffer, and the next call has to fetch the memory again after the device has updated it.
This is handled properly by the current code without your change.

- Felix
_______________________________________________
ath9k-devel mailing list
ath9k-devel@lists.ath9k.org
https://lists.ath9k.org/mailman/listinfo/ath9k-devel

^ permalink raw reply

* Re: [ath9k-devel] [PATCH v2 07/46] net/wireless: ath9k: fix DMA API usage
From: Michał Mirosław @ 2011-07-12 13:03 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: Felix Fietkau, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jouni Malinen, Senthil Balasubramanian,
	ath9k-devel-xDcbHBWguxHbcTqmT+pZeQ@public.gmane.org,
	Vasanthakumar Thiagarajan, Ralf Baechle,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA@public.gmane.org
In-Reply-To: <B4765EFC-B5C9-4E2D-BE00-ED5519D13A4E-Vt+b4OUoWG0@public.gmane.org>

On Tue, Jul 12, 2011 at 08:54:32PM +0800, Felix Fietkau wrote:
> On 12.07.2011, at 17:55, Michał Mirosław <mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org> wrote:
> 
> > On Tue, Jul 12, 2011 at 12:36:06PM +0800, Felix Fietkau wrote:
> >> On 2011-07-11 8:52 AM, Michał Mirosław wrote:
> >>> Also constify buf_addr for ath9k_hw_process_rxdesc_edma() to verify
> >>> assumptions --- dma_sync_single_for_device() call can be removed.
> >>> 
> >>> Signed-off-by: Michał Mirosław<mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>
> >>> ---
> >>> drivers/net/wireless/ath/ath9k/ar9003_mac.c |    4 ++--
> >>> drivers/net/wireless/ath/ath9k/ar9003_mac.h |    2 +-
> >>> drivers/net/wireless/ath/ath9k/recv.c       |   10 +++-------
> >>> 3 files changed, 6 insertions(+), 10 deletions(-)
> >>> 
> >>> diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
> >>> index 70dc8ec..c5f46d5 100644
> >>> --- a/drivers/net/wireless/ath/ath9k/recv.c
> >>> +++ b/drivers/net/wireless/ath/ath9k/recv.c
> >>> @@ -684,15 +684,11 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
> >>>    BUG_ON(!bf);
> >>> 
> >>>    dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
> >>> -                common->rx_bufsize, DMA_FROM_DEVICE);
> >>> +                common->rx_bufsize, DMA_BIDIRECTIONAL);
> >>> 
> >>>    ret = ath9k_hw_process_rxdesc_edma(ah, NULL, skb->data);
> >>> -    if (ret == -EINPROGRESS) {
> >>> -        /*let device gain the buffer again*/
> >>> -        dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
> >>> -                common->rx_bufsize, DMA_FROM_DEVICE);
> >>> +    if (ret == -EINPROGRESS)
> >>>        return false;
> >>> -    }
> >>> 
> >>>    __skb_unlink(skb,&rx_edma->rx_fifo);
> >>>    if (ret == -EINVAL) {
> >> I have strong doubts about this change. On most MIPS devices,
> >> dma_sync_single_for_cpu is a no-op, whereas
> >> dma_sync_single_for_device flushes the cache range. With this
> >> change, the CPU could cache the DMA status part behind skb->data and
> >> that cache entry would not be flushed inbetween calls to this
> >> functions on the same buffer, likely leading to rx stalls.
> > 
> > You're suggesting a platform implementation bug then. If the platform is not
> > cache-coherent, it should invalidate relevant CPU cache lines for sync_to_cpu
> > and unmap cases. Do other devices show such symptoms on MIPS systems?
> > 
> > I'm not familiar with the platform internals, so we should ask MIPS people.
> I only mentioned MIPS to describe the potential side effect of this change. From my current understanding of the DMA API, it would be wrong on other platforms as well. I believe the _for_device function needs to be used to transfer ownership of the buffer back to the device, before calling _for_cpu again later for another read.

What you're saying reminds the wording in DMA-API-HOWTO.txt that I find
wrong (or at least misleading) compared to what DMA-API.txt describes.
DMA sync calls do not transfer the ownership of the buffer - they are
cache synchronization points, ownership passing is handled entirely by
the driver.

> This is definitely required in this case, because when the return code is -EINPROGRESS, the driver waits for the hardware to complete this buffer, and the next call has to fetch the memory again after the device has updated it.

Correctness of this access should be provided by sync_to_cpu() call.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch net-next-2.6] net: allow multiple rx_handler registration
From: Jiri Pirko @ 2011-07-12 13:20 UTC (permalink / raw)
  To: David Lamparter
  Cc: netdev, davem, shemminger, kaber, fubar, eric.dumazet,
	nicolas.2p.debian, andy, greearb, mirqus
In-Reply-To: <20110712115422.GD616804@jupiter.n2.diac24.net>

Tue, Jul 12, 2011 at 01:54:22PM CEST, equinox@diac24.net wrote:
>On Tue, Jul 12, 2011 at 01:06:01PM +0200, Jiri Pirko wrote:
>> For some net topos it is necessary to have multiple "soft-net-devices"
>> hooked on one netdev. For example very common is to have
>> eth<->(br+vlan). Vlan is not using rh_handler (yet) but it might be useful
>> for other setups.
>
>I disagree strongly, especially with the use cases you're enabling in
>this patch.
>
>> +	res = netdev_rx_handler_register(slave_dev, &new_slave->rx_handler,
>> +					 bond_handle_frame,
>> +					 RX_HANDLER_PRIO_BOND);
>
>> +	err = netdev_rx_handler_register(dev, &port->rx_handler,
>> +					 macvlan_handle_frame,
>> +					 RX_HANDLER_PRIO_MACVLAN);
>
>> +	err = netdev_rx_handler_register(dev, &p->rx_handler, br_handle_frame,
>> +					 RX_HANDLER_PRIO_BRIDGE);
>
>> +enum rx_handler_prio {
>> +	RX_HANDLER_PRIO_BRIDGE,
>> +	RX_HANDLER_PRIO_BOND,
>> +	RX_HANDLER_PRIO_MACVLAN,
>> +};
>
>These are all incompatible with each other to a varying degree and/or
>don't make much sense. Let's look at them:
>
>a) a device simultaneously being a bridge member and a bond slave
> -> Fully incompatible. Your bonding peer switch will start sending
>    the bridge's packets on other bond member devices.

Not possible. See netdev_set_master(). Anyway, before rx_handler was
introduced, this was possible and no one cared.

>
>b) a device having macvlans and being a bond slave
> -> Fully incompatible. Same as above, packets to the macvlan will end
>    up on other bond member devices.
>
>c) bridge + macvlan
> -> Mostly useless. Add veth/tap devices to your bridge... as a bonus
>    you get a proper MAC table.
>
>This at least needs bonding support removed since bonding is essentially
>incompatible with anything else w/ the same reasoning as above. Bonds
>are as low-level as Pause frames. Never ever touch individual bond
>slaves.
>
>What does make sense is a device being member of multiple bridges, with
>ebtables as solicitor for which bridge gets the packet. But that's not
>possible with your patch...
>+       if (netdev_rx_handler_get_by_prio(dev, prio))
>                return -EBUSY;
>
>I think your idea is good, but it needs WAY more proper consideration.

This patch doen't introduce anything new which wasn't possible before
rx_handler times. Anyway removing bond from using rx_handler as you
suggested pushes us back.

The rationale of this patch is to have all in one place, clean
architecture. The rest of problems, like what can be
used with what in one time etc can be easily sorted out by follow-up
patches.

And to your idea about multi-bridge support, br co needs to be
adjusted as well. And in relation with PRIO, my idea (inspired from RFC
of this patch comments) is to allow users to change priorities
dynamically from userspace. Also then it could be a range of prios for
bridge for example.

Hope that I cleared that out for you.

Jirka

>
>
>-David
>

^ permalink raw reply

* Re: [ath9k-devel] [PATCH v2 07/46] net/wireless: ath9k: fix DMA API usage
From: Felix Fietkau @ 2011-07-12 14:21 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Felix Fietkau, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jouni Malinen, Senthil Balasubramanian,
	ath9k-devel-xDcbHBWguxHbcTqmT+pZeQ@public.gmane.org,
	Vasanthakumar Thiagarajan, Ralf Baechle,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA@public.gmane.org
In-Reply-To: <20110712130316.GA8621-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>

On 12.07.2011, at 21:03, Michał Mirosław <mirq-linux-CoA6ZxLDdyE@public.gmane.orgm.pl> wrote:

> On Tue, Jul 12, 2011 at 08:54:32PM +0800, Felix Fietkau wrote:
>> On 12.07.2011, at 17:55, Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:
>> 
>>> On Tue, Jul 12, 2011 at 12:36:06PM +0800, Felix Fietkau wrote:
>>>> On 2011-07-11 8:52 AM, Michał Mirosław wrote:
>>>>> Also constify buf_addr for ath9k_hw_process_rxdesc_edma() to verify
>>>>> assumptions --- dma_sync_single_for_device() call can be removed.
>>>>> 
>>>>> Signed-off-by: Michał Mirosław<mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>
>>>>> ---
>>>>> drivers/net/wireless/ath/ath9k/ar9003_mac.c |    4 ++--
>>>>> drivers/net/wireless/ath/ath9k/ar9003_mac.h |    2 +-
>>>>> drivers/net/wireless/ath/ath9k/recv.c       |   10 +++-------
>>>>> 3 files changed, 6 insertions(+), 10 deletions(-)
>>>>> 
>>>>> diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
>>>>> index 70dc8ec..c5f46d5 100644
>>>>> --- a/drivers/net/wireless/ath/ath9k/recv.c
>>>>> +++ b/drivers/net/wireless/ath/ath9k/recv.c
>>>>> @@ -684,15 +684,11 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
>>>>>   BUG_ON(!bf);
>>>>> 
>>>>>   dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
>>>>> -                common->rx_bufsize, DMA_FROM_DEVICE);
>>>>> +                common->rx_bufsize, DMA_BIDIRECTIONAL);
>>>>> 
>>>>>   ret = ath9k_hw_process_rxdesc_edma(ah, NULL, skb->data);
>>>>> -    if (ret == -EINPROGRESS) {
>>>>> -        /*let device gain the buffer again*/
>>>>> -        dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
>>>>> -                common->rx_bufsize, DMA_FROM_DEVICE);
>>>>> +    if (ret == -EINPROGRESS)
>>>>>       return false;
>>>>> -    }
>>>>> 
>>>>>   __skb_unlink(skb,&rx_edma->rx_fifo);
>>>>>   if (ret == -EINVAL) {
>>>> I have strong doubts about this change. On most MIPS devices,
>>>> dma_sync_single_for_cpu is a no-op, whereas
>>>> dma_sync_single_for_device flushes the cache range. With this
>>>> change, the CPU could cache the DMA status part behind skb->data and
>>>> that cache entry would not be flushed inbetween calls to this
>>>> functions on the same buffer, likely leading to rx stalls.
>>> 
>>> You're suggesting a platform implementation bug then. If the platform is not
>>> cache-coherent, it should invalidate relevant CPU cache lines for sync_to_cpu
>>> and unmap cases. Do other devices show such symptoms on MIPS systems?
>>> 
>>> I'm not familiar with the platform internals, so we should ask MIPS people.
>> I only mentioned MIPS to describe the potential side effect of this change. From my current understanding of the DMA API, it would be wrong on other platforms as well. I believe the _for_device function needs to be used to transfer ownership of the buffer back to the device, before calling _for_cpu again later for another read.
> 
> What you're saying reminds the wording in DMA-API-HOWTO.txt that I find
> wrong (or at least misleading) compared to what DMA-API.txt describes.
> DMA sync calls do not transfer the ownership of the buffer - they are
> cache synchronization points, ownership passing is handled entirely by
> the driver.
What I meant was that the DMA sync calls reflect the ownership transfer of the memory regions. In this case ownership is transferred between device and CPU multiple times and the code reflects that.

> 
>> This is definitely required in this case, because when the return code is -EINPROGRESS, the driver waits for the hardware to complete this buffer, and the next call has to fetch the memory again after the device has updated it.
> 
> Correctness of this access should be provided by sync_to_cpu() call.
At least in MIPS I'm sure it isn't. If I remember correctly, it also isn't on ARM, so I'm pretty sure that either your understanding of the API is incorrect, or arch code does not implement it properly. In either case, this change (and probably also the p54 one) should not be merged.

- Felix--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch net-next-2.6] net: allow multiple rx_handler registration
From: David Lamparter @ 2011-07-12 14:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Lamparter, netdev, davem, shemminger, kaber, fubar,
	eric.dumazet, nicolas.2p.debian, andy, greearb, mirqus
In-Reply-To: <20110712132005.GA2300@minipsycho.brq.redhat.com>

On Tue, Jul 12, 2011 at 03:20:08PM +0200, Jiri Pirko wrote:
> Tue, Jul 12, 2011 at 01:54:22PM CEST, equinox@diac24.net wrote:
> >On Tue, Jul 12, 2011 at 01:06:01PM +0200, Jiri Pirko wrote:
> >> For some net topos it is necessary to have multiple "soft-net-devices"
> >> hooked on one netdev. For example very common is to have
> >> eth<->(br+vlan). Vlan is not using rh_handler (yet) but it might be useful
> >> for other setups.
> >
> >I disagree strongly, especially with the use cases you're enabling in
> >this patch.
> >
> >> +	res = netdev_rx_handler_register(slave_dev, &new_slave->rx_handler,
> >> +					 bond_handle_frame,
> >> +					 RX_HANDLER_PRIO_BOND);
> >
> >> +	err = netdev_rx_handler_register(dev, &port->rx_handler,
> >> +					 macvlan_handle_frame,
> >> +					 RX_HANDLER_PRIO_MACVLAN);
> >
> >> +	err = netdev_rx_handler_register(dev, &p->rx_handler, br_handle_frame,
> >> +					 RX_HANDLER_PRIO_BRIDGE);
> >
> >> +enum rx_handler_prio {
> >> +	RX_HANDLER_PRIO_BRIDGE,
> >> +	RX_HANDLER_PRIO_BOND,
> >> +	RX_HANDLER_PRIO_MACVLAN,
> >> +};
> >
> >These are all incompatible with each other to a varying degree and/or
> >don't make much sense. Let's look at them:
> >
> >a) a device simultaneously being a bridge member and a bond slave
> > -> Fully incompatible. Your bonding peer switch will start sending
> >    the bridge's packets on other bond member devices.
> 
> Not possible. See netdev_set_master(). Anyway, before rx_handler was
> introduced, this was possible and no one cared.

I don't see how this is related. I'm talking about the other end of your
bond. Like for example the 802.3ad capable switch you're bonding to.

> >b) a device having macvlans and being a bond slave
> > -> Fully incompatible. Same as above, packets to the macvlan will end
> >    up on other bond member devices.
> >
> >c) bridge + macvlan
> > -> Mostly useless. Add veth/tap devices to your bridge... as a bonus
> >    you get a proper MAC table.
> >
> >This at least needs bonding support removed since bonding is essentially
> >incompatible with anything else w/ the same reasoning as above. Bonds
> >are as low-level as Pause frames. Never ever touch individual bond
> >slaves.
> >
> >What does make sense is a device being member of multiple bridges, with
> >ebtables as solicitor for which bridge gets the packet. But that's not
> >possible with your patch...
> >+       if (netdev_rx_handler_get_by_prio(dev, prio))
> >                return -EBUSY;
> >
> >I think your idea is good, but it needs WAY more proper consideration.
> 
> This patch doen't introduce anything new which wasn't possible before
> rx_handler times. Anyway removing bond from using rx_handler as you
> suggested pushes us back.

I would actually consider this a regression, if the clashing rx_handler
is the only thing that gets bonding an 'exclusive' hold of the device.

> The rationale of this patch is to have all in one place, clean
> architecture. The rest of problems, like what can be
> used with what in one time etc can be easily sorted out by follow-up
> patches.

Yes, I see what you're trying to do. But if your patch goes back to
allowing broken combinations, I think we need to have those follow-up
patches right here with this patch.

> And to your idea about multi-bridge support, br co needs to be
> adjusted as well. And in relation with PRIO, my idea (inspired from RFC
> of this patch comments) is to allow users to change priorities
> dynamically from userspace. Also then it could be a range of prios for
> bridge for example.

Hoping I can convey my point,


-David


P.S.: Could you please provide some sample usage cases for this feature?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox