Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05  7:48 UTC (permalink / raw)
  To: victor; +Cc: eric.dumazet, netdev
In-Reply-To: <4E12B5A6.2020802@inliniac.net>

From: Victor Julien <victor@inliniac.net>
Date: Tue, 05 Jul 2011 08:56:38 +0200

> On 07/05/2011 08:21 AM, Eric Dumazet wrote:
>> Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
>>> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
>>> sockets.  Two fanout policies are implemented:
>>>
>>> 1) Hashing based upon skb->rxhash
>> 
>> ...
>> 
>>> +
>>> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
>>> +{
>>> +	u32 idx, hash = skb->rxhash;
>>> +
>>> +	idx = ((u64)hash * f->num_members) >> 32;
>>> +
>>> +	return f->arr[idx];
>>> +}
>>> +
>> 
>> rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
>> path.
>> 
> 
> Is this still also true for IP fragments?

I have a plan to fix this.  But what I've posted will work as you want
it to for everything else.

^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05  7:46 UTC (permalink / raw)
  To: eric.dumazet; +Cc: victor, netdev
In-Reply-To: <1309846875.2720.43.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Jul 2011 08:21:15 +0200

> rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
> path.

CONFIG_RPS is effectively on all the time for SMP builds.

If you want to make it a hard enable in that situation,
I fully support such a change. :-)


^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: Victor Julien @ 2011-07-05  7:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1309849214.2720.45.camel@edumazet-laptop>

On 07/05/2011 09:00 AM, Eric Dumazet wrote:
> Le mardi 05 juillet 2011 à 08:56 +0200, Victor Julien a écrit :
> 
>> Is this still also true for IP fragments?
>>
> 
> This point was already raised. IP fragments have rxhash = 0, obviously,
> since we dont have full information (source / destination ports for
> example)

Sure, just seeing if something was changed here as that wasn't
immediately obvious to me from the code.

-- 
---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
---------------------------------------------


^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: Eric Dumazet @ 2011-07-05  7:00 UTC (permalink / raw)
  To: Victor Julien; +Cc: David Miller, netdev
In-Reply-To: <4E12B5A6.2020802@inliniac.net>

Le mardi 05 juillet 2011 à 08:56 +0200, Victor Julien a écrit :

> Is this still also true for IP fragments?
> 

This point was already raised. IP fragments have rxhash = 0, obviously,
since we dont have full information (source / destination ports for
example)




^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: Victor Julien @ 2011-07-05  6:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1309846875.2720.43.camel@edumazet-laptop>

On 07/05/2011 08:21 AM, Eric Dumazet wrote:
> Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
>> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
>> sockets.  Two fanout policies are implemented:
>>
>> 1) Hashing based upon skb->rxhash
> 
> ...
> 
>> +
>> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
>> +{
>> +	u32 idx, hash = skb->rxhash;
>> +
>> +	idx = ((u64)hash * f->num_members) >> 32;
>> +
>> +	return f->arr[idx];
>> +}
>> +
> 
> rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
> path.
> 

Is this still also true for IP fragments?

-- 
---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
---------------------------------------------


^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: Eric Dumazet @ 2011-07-05  6:21 UTC (permalink / raw)
  To: David Miller; +Cc: victor, netdev
In-Reply-To: <20110704.212014.236340473910292460.davem@davemloft.net>

Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
> sockets.  Two fanout policies are implemented:
> 
> 1) Hashing based upon skb->rxhash

...

> +
> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
> +{
> +	u32 idx, hash = skb->rxhash;
> +
> +	idx = ((u64)hash * f->num_members) >> 32;
> +
> +	return f->arr[idx];
> +}
> +

rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
path.




^ permalink raw reply

* RE: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel 2.6.32.36)
From: MaoXiaoyun @ 2011-07-05  6:02 UTC (permalink / raw)
  To: mchan, netdev; +Cc: davidch
In-Reply-To: <C27F8246C663564A84BB7AB343977242667C64FA19@IRVEXCHCCR01.corp.ad.broadcom.com>


 
Before having debug patch, I plan to run a test with disable_msi=1 first.
 
Well, is there a place I can get the lastest PRG document?
 
Thanks for your help.
 
 

----------------------------------------
> From: mchan@broadcom.com
> To: tinnycloud@hotmail.com; netdev@vger.kernel.org
> CC: davidch@broadcom.com
> Date: Mon, 4 Jul 2011 10:04:25 -0700
> Subject: Re: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel 2.6.32.36)
>
> MaoXiaoyun wrote:
>
> > Could it be caused by the similar timeout as
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
> > 2.6.git;a=commit;h=c441b8d2cb2194b05550a558d6d95d8944e56a84.
>
> Based on the register dump below, it is not caused by the MSI-X issue.
>
> >
> > Maybe timeout still happens in my test scenerino.
> >
> > Well, from the patch, BNX2_MISC_ECO_HW_CTL is defined 0x000008cc. But I
> > cannot find
> > the defines in programmer reference Guide.(NetXtremeII-PG203-R.pdf).
> > Could some help
> > to point out for me or is the doc is out of date.
>
> I will request the document to be updated to describe that register. We
> are increasing the register read and write timeout value to workaround
> the problem of the MSI-X table being updated while there is a pending
> MSI-X. Without the patch, the write to unmask the MSI-X table entry can
> be dropped by the chip.
>
> >
> > Also, is there a way to comfirm whether the timeout really happen?
> > (which regisiter
> > shall I read?) Or is there a bigger timeout I can set?
>
> Again, the register dump shows that it is not caused by this issue. I'll
> send you some additional debug patch to try to debug the problem.
>
> Thanks.
> >
> > thanks.
> >
> > ----------------------------------------
> > > From: tinnycloud@hotmail.com
> > > To: netdev@vger.kernel.org
> > > Subject: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel
> > 2.6.32.36)
> > > Date: Mon, 4 Jul 2011 15:40:01 +0800
> > >
> > >
> > > Hi:
> > >
> > > I met bnx2 FTQ dump over and over again during my testing on Xen live
> > migration which generate
> > > heavy network workload.
> > >
> > > I have two physcial machine, both have xen 4.0.1 installed, and
> > kernel 2.6.32.36, bnx2 2.0.23b.
> > > I start 15 Virtual Machines totoally, and doing migration between the
> > host over and over again,
> > > about 16hours, the network will not work, and sometimes, it can reset
> > successfully, sometimes, it
> > > cause kernel crash.
> > >
> > > I've tried debug some, add code in the driver. below is the code when
> > FTQ happened.
> > > It looks like the NIC is stop transmit the packets, and cause
> > timeout.
> > >
> > > BTW, cpu max_cstate=1 in my grub.
> > >
> > > Thanks.
> > >
> > > --------------
> > > static void
> > > bnx2_tx_timeout(struct net_device *dev)
> > > {
> > > struct bnx2 *bp = netdev_priv(dev);
> > > struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
> > > struct bnx2_tx_ring_info *txr = &bnapi->tx_ring;
> > > struct bnx2_rx_ring_info *rxr = &bnapi->rx_ring;
> > > int i ;
> > > bnx2_dump_ftq(bp);
> > > bnx2_dump_state(bp);
> > > if (stop_on_tx_timeout) {
> > > printk(KERN_WARNING PFX
> > > "%s: prevent chip reset during tx timeout\n",
> > > bp->dev->name);
> > > smp_rmb();
> > > printk("last status idx %d \n", bnapi->last_status_idx);
> > > printk("hw_tx_cons %d, txr->hw_tx_conds %d txr->tx_prod %d txr-
> > >tx_cons %d\n",
> > > bnx2_get_hw_tx_cons(bnapi), txr->hw_tx_cons, txr->tx_prod, txr-
> > >tx_cons);
> > > printk("hw_rx_cons %d, txr->hw_rx_conds %d\n",
> > bnx2_get_hw_rx_cons(bnapi), rxr->rx_cons);
> > > printk("sblk->status_attn_bits %d\n",bnapi->status_blk.msi-
> > >status_attn_bits);
> > > printk("sblk->status_attn_bits_ack %d\n",bnapi->status_blk.msi-
> > >status_attn_bits_ack);
> > > printk("bnx2_tx_avail %d \n",(bnx2_tx_avail(bp, txr)));
> > > printk("sblk->status_tx_quick_consumer_index0 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index0);
> > > printk("sblk->status_tx_quick_consumer_index1 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index1);
> > > printk("sblk->status_tx_quick_consumer_index2 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index2);
> > > printk("sblk->status_tx_quick_consumer_index3 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index3);
> > > printk("sblk->status_rx_quick_consumer_index0 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index0);
> > > printk("sblk->status_rx_quick_consumer_index1 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index1);
> > > printk("sblk->status_rx_quick_consumer_index2 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index2);
> > > printk("sblk->status_rx_quick_consumer_index3 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index3);
> > > printk("sblk->status_rx_quick_consumer_index4 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index4);
> > > printk("sblk->status_rx_quick_consumer_index5 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index5);
> > > printk("sblk->status_rx_quick_consumer_index6 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index6);
> > > printk("sblk->status_rx_quick_consumer_index7 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index7);
> > > printk("sblk->status_rx_quick_consumer_index8 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index8);
> > > printk("sblk->status_rx_quick_consumer_index9 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index9);
> > > printk("sblk->status_rx_quick_consumer_index10 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index10);
> > > printk("sblk->status_rx_quick_consumer_index11 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index11);
> > > printk("sblk->status_rx_quick_consumer_index12 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index12);
> > > printk("sblk->status_rx_quick_consumer_index13 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index13);
> > > printk("sblk->status_rx_quick_consumer_index14 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index14);
> > > printk("sblk->status_rx_quick_consumer_index15 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index15);
> > > printk("sblk->status_completion_producer_index %d\n",bnapi-
> > >status_blk.msi->status_completion_producer_index);
> > > printk("sblk->status_cmd_consumer_index %d\n",bnapi->status_blk.msi-
> > >status_cmd_consumer_index);
> > > printk("sblk->status_idx %d\n",bnapi->status_blk.msi->status_idx);
> > > printk("sblk->status_unused %d\n",bnapi->status_blk.msi-
> > >status_unused);
> > > printk("sblk->status_blk_num %d\n",bnapi->status_blk.msi-
> > >status_blk_num);
> > > is_timedout = 1;
> > > for (i = 0; i < bp->irq_nvecs; i++) {
> > > bnapi = &bp->bnx2_napi[i];
> > > bnx2_tx_int(bp, bnapi, 0);
> > > }
> > > return;
> > > }
> > > -----------------
> > >
> > > -------------FTQ log in /var/log/message
> > > ------------[ cut here ]------------
> > > WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x105/0x16a()
> > > Hardware name: Tecal RH2285
> > > Modules linked in: iptable_filter ip_tables nfs fscache nfs_acl
> > auth_rpcgss bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler
> > lockd sunrpc ipv6 xenfs dm_multipath fuse xen_netback xen_blkback
> > blktap blkback_pagemap loop nbd video output sbs sbshc parport_pc lp
> > parport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
> > snd_seq_device snd_pcm_oss snd_mixer_oss bnx2 serio_raw snd_pcm
> > snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt
> > iTCO_vendor_support i2c_core pata_acpi ata_generic pcspkr ata_piix
> > shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
> > > Pid: 0, comm: swapper Not tainted 2.6.32.36xen #1
> > > Call Trace:
> > > <IRQ> [<ffffffff813ba154>] ? dev_watchdog+0x105/0x16a
> > > [<ffffffff81056666>] warn_slowpath_common+0x7c/0x94
> > > [<ffffffff81056738>] warn_slowpath_fmt+0xa4/0xa6
> > > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
> > > [<ffffffff81081fce>] ? tick_program_event+0x2a/0x2c
> > > [<ffffffff813b951d>] ? __netif_tx_lock+0x1b/0x24
> > > [<ffffffff813b95a8>] ? netif_tx_lock+0x46/0x6e
> > > [<ffffffff813a3ed1>] ? netdev_drivername+0x48/0x4f
> > > [<ffffffff813ba154>] dev_watchdog+0x105/0x16a
> > > [<ffffffff81063d98>] run_timer_softirq+0x156/0x1f8
> > > [<ffffffff813ba04f>] ? dev_watchdog+0x0/0x16a
> > > [<ffffffff8105d6f0>] __do_softirq+0xd7/0x19e
> > > [<ffffffff81013eac>] call_softirq+0x1c/0x30
> > > [<ffffffff8101564b>] do_softirq+0x46/0x87
> > > [<ffffffff8105d575>] irq_exit+0x3b/0x7a
> > > [<ffffffff8128dcfe>] xen_evtchn_do_upcall+0x38/0x46
> > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> > > <EOI> [<ffffffff8103f642>] ? pick_next_task_idle+0x18/0x22
> > > [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
> > > [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
> > > [<ffffffff8100f1bb>] ? xen_safe_halt+0x10/0x1a
> > > [<ffffffff81019e14>] ? default_idle+0x39/0x56
> > > [<ffffffff81011cd0>] ? cpu_idle+0x5d/0x8c
> > > [<ffffffff8143375d>] ? cpu_bringup_and_idle+0x13/0x15
> > > ---[ end trace 867bb8f6cd959b03 ]---
> > > bnx2: <--- start FTQ dump on peth0 --->
> > > bnx2: peth0: BNX2_RV2P_PFTQ_CTL 10000
> > > bnx2: peth0: BNX2_RV2P_TFTQ_CTL 20000
> > > bnx2: peth0: BNX2_RV2P_MFTQ_CTL 4000
> > > bnx2: peth0: BNX2_TBDR_FTQ_CTL 1004002
> > > bnx2: peth0: BNX2_TDMA_FTQ_CTL 4010002
> > > bnx2: peth0: BNX2_TXP_FTQ_CTL 2410002
> > > bnx2: peth0: BNX2_TPAT_FTQ_CTL 10002
> > > bnx2: peth0: BNX2_RXP_CFTQ_CTL 8000
> > > bnx2: peth0: BNX2_RXP_FTQ_CTL 100000
> > > bnx2: peth0: BNX2_COM_COMXQ_FTQ_CTL 10000
> > > bnx2: peth0: BNX2_COM_COMTQ_FTQ_CTL 20000
> > > bnx2: peth0: BNX2_COM_COMQ_FTQ_CTL 10000
> > > bnx2: peth0: BNX2_CP_CPQ_FTQ_CTL 4000
> > > bnx2: peth0: TXP mode b84c state 80005000 evt_mask 500 pc 8000d60 pc
> > 8000d60 instr 8f860000
> > > bnx2: peth0: TPAT mode b84c state 80009000 evt_mask 500 pc 8000a5c pc
> > 8000a5c instr 10400016
> > > bnx2: peth0: RXP mode b84c state 80001000 evt_mask 500 pc 8004c14 pc
> > 8004c14 instr 10e00088
> > > bnx2: peth0: COM mode b8cc state 80000000 evt_mask 500 pc 8000b28 pc
> > 8000a9c instr 8c530000
> > > bnx2: peth0: CP mode b8cc state 80000000 evt_mask 500 pc 8000c50 pc
> > 8000c58 instr 8ca50020
> > > bnx2: <--- end FTQ dump on peth0 --->
> > > bnx2: peth0 DEBUG: intr_sem[0]
> > > bnx2: peth0 DEBUG: intr_sem[0] PCI_CMD[20100406]
> > > bnx2: peth0 DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
> > > bnx2: peth0 DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> > > bnx2: peth0 RPM_MGMT_PKT_CTRL[40000088]
> > > bnx2: peth0 DEBUG: MCP_STATE_P0[0007e10e] MCP_STATE_P1[0003e00e]
> > > bnx2: peth0 DEBUG: HC_STATS_INTERRUPT_STATUS[01ff0000]
> > > bnx2: peth0 DEBUG: PBA[00000000]
> > > BNX2_PCICFG_INT_ACK_CMD[00013ce1]
> > > bnx2: peth0: prevent chip reset during tx timeout
> > > last status idx 2426
> > > hw_tx_cons 32474, txr->hw_tx_conds 32474 txr->tx_prod 32641 txr-
> > >tx_cons 32474
> > > hw_rx_cons 19665, txr->hw_rx_conds 19665
> > > sblk->status_attn_bits 1
> > > sblk->status_attn_bits_ack 1
> > > bnx2_tx_avail 88
> > > sblk->status_tx_quick_consumer_index0 32474
> > > sblk->status_tx_quick_consumer_index1 0
> > > sblk->status_tx_quick_consumer_index2 0
> > > sblk->status_tx_quick_consumer_index3 0
> > > sblk->status_rx_quick_consumer_index0 19665
> > > sblk->status_rx_quick_consumer_index1 0
> > > sblk->status_rx_quick_consumer_index2 0
> > > sblk->status_rx_quick_consumer_index3 0
> > > sblk->status_rx_quick_consumer_index4 0
> > > sblk->status_rx_quick_consumer_index5 0
> > > sblk->status_rx_quick_consumer_index6 0
> > > sblk->status_rx_quick_consumer_index7 0
> > > sblk->status_rx_quick_consumer_index8 0
> > > sblk->status_rx_quick_consumer_index9 0
> > > sblk->status_rx_quick_consumer_index10 0
> > > sblk->status_rx_quick_consumer_index11 0
> > > sblk->status_rx_quick_consumer_index12 0
> > > sblk->status_rx_quick_consumer_index13 0
> > > sblk->status_rx_quick_consumer_index14 0
> > > sblk->status_rx_quick_consumer_index15 0
> > > sblk->status_completion_producer_index 0
> > > sblk->status_cmd_consumer_index 0
> > > sblk->status_idx 2426
> > > sblk->status_unused 0
> > > sblk->status_blk_num 0
> > > hw_cons 32474 sw_cons 32474 ffff8801d27f85c0 bnapi
> > > return hw_cons 32474 sw_cons 32474 ffff8801d27f85c0 bnapi
> > > hw_cons 3628 sw_cons 3625 ffff8801d27f8bc0 bnapi
> > > return hw_cons 3628 sw_cons 3625 ffff8801d27f8bc0 bnapi
> > > hw_cons 62094 sw_cons 62090 ffff8801d27f91c0 bnapi
> > > return hw_cons 62094 sw_cons 62090 ffff8801d27f91c0 bnapi
> > > hw_cons 3184 sw_cons 3173 ffff8801d27f97c0 bnapi
> > > return hw_cons 3184 sw_cons 3173 ffff8801d27f97c0 bnapi
> > > hw_cons 0 sw_cons 0 ffff8801d27f9dc0 bnapi
> > > return hw_cons 0 sw_cons 0 ffff8801d27f9dc0 bnapi
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html 		 	   		  

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05  5:59 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309844009.2720.39.camel@edumazet-laptop>

Le mardi 05 juillet 2011 à 07:33 +0200, Eric Dumazet a écrit :
> Le mardi 05 juillet 2011 à 09:18 +0400, Alexey Zaytsev a écrit :
> 
> > Actually, I've added a trace to show b44_init_rings and b44_free_rings
> > calls, and they are only called once, right after the driver is
> > loaded. So it can't be related to START_RFO. Will attach the diff and
> > dmesg.
> 
> Thanks
> 
> I was wondering if DMA could be faster if providing word aligned
> addresses, could you try :
> 
> -#define RX_PKT_OFFSET          (RX_HEADER_LEN + 2)
> +#define RX_PKT_OFFSET          (RX_HEADER_LEN + NET_IP_ALIGN)
> 
> (On x86, we now have NET_IP_ALIGN = 0 since commit ea812ca1)
> 

I suspect a hardware bug.

You could force copybreak, so that b44 only touch kind of private
memory.

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index a69331e..62a0599 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -75,7 +75,7 @@
 	  (BP)->tx_cons - (BP)->tx_prod - TX_RING_GAP(BP))
 #define NEXT_TX(N)		(((N) + 1) & (B44_TX_RING_SIZE - 1))
 
-#define RX_PKT_OFFSET		(RX_HEADER_LEN + 2)
+#define RX_PKT_OFFSET		(RX_HEADER_LEN + NET_IP_ALIGN)
 #define RX_PKT_BUF_SZ		(1536 + RX_PKT_OFFSET)
 
 /* minimum number of free TX descriptors required to wake up TX process */
@@ -829,6 +829,7 @@ static int b44_rx(struct b44 *bp, int budget)
 	}
 
 	bp->rx_cons = cons;
+	wmb();
 	bw32(bp, B44_DMARX_PTR, cons * sizeof(struct dma_desc));
 
 	return received;
@@ -848,6 +849,7 @@ static int b44_poll(struct napi_struct *napi, int budget)
 		/* spin_unlock(&bp->tx_lock); */
 	}
 	if (bp->istat & ISTAT_RFO) {	/* fast recovery, in ~20msec */
+		pr_err("b44: ISTAT_RFO !\n");
 		bp->istat &= ~ISTAT_RFO;
 		b44_disable_ints(bp);
 		ssb_device_enable(bp->sdev, 0); /* resets ISTAT_RFO */
@@ -2155,7 +2157,7 @@ static int __devinit b44_init_one(struct ssb_device *sdev,
 	bp = netdev_priv(dev);
 	bp->sdev = sdev;
 	bp->dev = dev;
-	bp->force_copybreak = 0;
+	bp->force_copybreak = 1;
 
 	bp->msg_enable = netif_msg_init(b44_debug, B44_DEF_MSG_ENABLE);
 



^ permalink raw reply related

* Re: [PATCH] net/core: Make urgent data inline by default
From: Esa-Pekka Pyokkimies @ 2011-07-05  5:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110704.163838.1746904172195321123.davem@davemloft.net>

On Tue, 05 Jul 2011 02:38:38 +0300, David Miller <davem@davemloft.net>  
wrote:

> There is no way we can make this change, we've had the default
> we currently have for 18+ years.  Breaking applications is a
> very real possibility.
>
> It doesn't matter what some RFC says.

I understand. However urgent pointer is a very niche feature and I don't  
think
it would really break much. FTP and telnet both want the urgent data inline
anyway. I haven't found any application which uses the "1-byte" urgent  
data,
which can by some change be overwritten by the next urgent data if you  
didn't
read it in time. The reason I would want this change is that attack  
detection
is very difficult when there can be a byte missing due to URG flag being  
set,
and the damage done by crackers is more than the damage to applications I  
think.
But I guess you decide. Atleast I tried.

Esa-Pekka

^ permalink raw reply

* A GRO question
From: Li Yu @ 2011-07-05  5:41 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

	I have a question about GRO implementation, this indeed confuses me.

	I found that we assume that NAPI_GRO_CB(skb)->frag0 starts 
with a mac/L2 header in compare_ether_header(), which is called in
__napi_gro_receive()

	However, in further dev_gro_receive() -> ptype->gro_receive [inet_gro_receive],
we use same address as IPv4/L3 header, like below:

        off = skb_gro_offset(skb); //it should keep zero until now, in my words.
        hlen = off + sizeof(*iph);      
        iph = skb_gro_header_fast(skb, off);  //just return NAPI_GRO_CB(skb)->frag0 + 0

	So we forget that updating NAPI_GRO_CB(skb)->data_offset here, or I miss sth?

	And, in my understanding against igb source code, if rx_ring->rx_buffer_len < 1024 
(if we used large MTU), then igb driver use header split mode, in such case, the mac header
should be saved in skb->data : skb_put(skb, igb_get_hlen(rx_ring, rx_desc)), the rest data
is loaded by below skb_fill_page_desc() call. so NAPI_GRO_CB(skb)->frag0 should start with
L3 header. 

	Thanks.

Yu

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05  5:33 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DEiPFzQt-TgeVC3r3Y7YFwApLK_NHkDahFOKpibtABrZg@mail.gmail.com>

Le mardi 05 juillet 2011 à 09:18 +0400, Alexey Zaytsev a écrit :

> Actually, I've added a trace to show b44_init_rings and b44_free_rings
> calls, and they are only called once, right after the driver is
> loaded. So it can't be related to START_RFO. Will attach the diff and
> dmesg.

Thanks

I was wondering if DMA could be faster if providing word aligned
addresses, could you try :

-#define RX_PKT_OFFSET          (RX_HEADER_LEN + 2)
+#define RX_PKT_OFFSET          (RX_HEADER_LEN + NET_IP_ALIGN)

(On x86, we now have NET_IP_ALIGN = 0 since commit ea812ca1)




^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-05  5:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309842642.2720.36.camel@edumazet-laptop>

On Tue, Jul 5, 2011 at 09:10, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 05 juillet 2011 à 08:57 +0400, Alexey Zaytsev a écrit :
>
>> Ran tcpdump. You are right, I was wrong. Sorry for the noise.
>
> Thanks for testing ;)
>
> It would be nice to know if the memory scribbles start after or before
> one RFO triggers.
>
> I can see this calls b44_init_rings() without really stopping the device
> before. This seems very suspect to me.
>

Actually, I've added a trace to show b44_init_rings and b44_free_rings
calls, and they are only called once, right after the driver is
loaded. So it can't be related to START_RFO. Will attach the diff and
dmesg.

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05  5:10 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DFM8PcujUB-YzcK49DS7T6Bz2FLDtkVdEYt8an1oPYVFw@mail.gmail.com>

Le mardi 05 juillet 2011 à 08:57 +0400, Alexey Zaytsev a écrit :

> Ran tcpdump. You are right, I was wrong. Sorry for the noise.

Thanks for testing ;)

It would be nice to know if the memory scribbles start after or before
one RFO triggers.

I can see this calls b44_init_rings() without really stopping the device
before. This seems very suspect to me.



diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index a69331e..b22dd4c 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -829,6 +829,7 @@ static int b44_rx(struct b44 *bp, int budget)
 	}
 
 	bp->rx_cons = cons;
+	wmb();
 	bw32(bp, B44_DMARX_PTR, cons * sizeof(struct dma_desc));
 
 	return received;
@@ -848,6 +849,7 @@ static int b44_poll(struct napi_struct *napi, int budget)
 		/* spin_unlock(&bp->tx_lock); */
 	}
 	if (bp->istat & ISTAT_RFO) {	/* fast recovery, in ~20msec */
+		pr_err("b44: ISTAT_RFO !\n");
 		bp->istat &= ~ISTAT_RFO;
 		b44_disable_ints(bp);
 		ssb_device_enable(bp->sdev, 0); /* resets ISTAT_RFO */
 



^ permalink raw reply related

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-05  4:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309840708.2720.31.camel@edumazet-laptop>

On Tue, Jul 5, 2011 at 08:38, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 05 juillet 2011 à 08:29 +0400, Alexey Zaytsev a écrit :
>> On Tue, Jul 5, 2011 at 08:25, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
>> >> >
>> >> Check out starting at packet 302893. 383 _identical_ ACKs were sent
>> >> out by the b44 machine within 30 milliseconds.
>> >
>> >
>> > As I said, b44 driver lost at least 200 consecutive frames (source says
>> > recovery takes about 20 ms)
>> >
>> > TCP then do its normal job.
>> >
>>
>> From my understanding, after a frame is lost, TCP would be waiting for
>> a retransmit. Or at least, it would not be sending 400 duplicate ACKs
>> for the single last frame received, right? Let me run tcpdump on the
>> b44 side now. I'm quite sure I won't see any ACK dups leaving the
>> stack.
>
> Wow, I believe you are on a wrong track. Honestly.
>
> Try to unpplug the wire for 100ms, and watch your "duplicate acks
> disease".
>
> Thats exactly what is happening with b44 driver doing a "fast recovery"
> right now.
>
> Thats a moot point. Running tcpdump on your b44 machine will kill your
> performance even more, it wont solve the b44 bug.
>
> If you prefer to 'fix tcp', please open another thread.

Ran tcpdump. You are right, I was wrong. Sorry for the noise.

^ permalink raw reply

* Re: [PATCH] greth: greth_set_mac_add would corrupt the MAC address.
From: David Miller @ 2011-07-05  4:39 UTC (permalink / raw)
  To: kristoffer; +Cc: netdev
In-Reply-To: <1309770483-16026-1-git-send-email-kristoffer@gaisler.com>

From: Kristoffer Glembo <kristoffer@gaisler.com>
Date: Mon,  4 Jul 2011 11:08:03 +0200

> The MAC address was set using the signed char sockaddr->sa_addr
> field and thus the address could be corrupted through sign extension.
> 
> Signed-off-by: Kristoffer Glembo <kristoffer@gaisler.com>

Applied, thanks!

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05  4:38 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DEOhZh37aqx1qrLnrz5+tqjcjgBx-DP6M_0NkygZ1LjcQ@mail.gmail.com>

Le mardi 05 juillet 2011 à 08:29 +0400, Alexey Zaytsev a écrit :
> On Tue, Jul 5, 2011 at 08:25, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
> >> >
> >> Check out starting at packet 302893. 383 _identical_ ACKs were sent
> >> out by the b44 machine within 30 milliseconds.
> >
> >
> > As I said, b44 driver lost at least 200 consecutive frames (source says
> > recovery takes about 20 ms)
> >
> > TCP then do its normal job.
> >
> 
> From my understanding, after a frame is lost, TCP would be waiting for
> a retransmit. Or at least, it would not be sending 400 duplicate ACKs
> for the single last frame received, right? Let me run tcpdump on the
> b44 side now. I'm quite sure I won't see any ACK dups leaving the
> stack.

Wow, I believe you are on a wrong track. Honestly.

Try to unpplug the wire for 100ms, and watch your "duplicate acks
disease".

Thats exactly what is happening with b44 driver doing a "fast recovery"
right now.

Thats a moot point. Running tcpdump on your b44 machine will kill your
performance even more, it wont solve the b44 bug.

If you prefer to 'fix tcp', please open another thread.




^ permalink raw reply

* Re: [PATCH] net: bind() fix error return on wrong address family
From: David Miller @ 2011-07-05  4:38 UTC (permalink / raw)
  To: meissner
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev, linux-kernel,
	meissner, max
In-Reply-To: <1309779029-15403-1-git-send-email-meissner@novell.com>

From: Marcus Meissner <meissner@novell.com>
Date: Mon,  4 Jul 2011 13:30:29 +0200

> Reinhard Max also pointed out that the error should EAFNOSUPPORT according
> to POSIX.
> 
> The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
> EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.
> 
> Other protocols error values in their af bind() methods in current mainline git as far
> as a brief look shows:
> 	EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
> 	EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, 
> 	No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip
> 
> Signed-off-by: Marcus Meissner <meissner@suse.de>
> Cc: Reinhard Max <max@suse.de>

Applied to net-2.6, thanks.

^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05  4:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: victor, netdev
In-Reply-To: <1309840429.2720.26.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Jul 2011 06:33:49 +0200

> Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
>> +#define PACKET_FANOUT_MAX	2048
 ...
>> +	struct sock		*arr[PACKET_FANOUT_MAX];
> 
> Thats about 16Kbytes, yet you use kzalloc()
> 
>> +	spinlock_t		lock;
>> +	atomic_t		sk_ref;
>> +	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
>> +};
>> +
> 
> Maybe use a dynamic array ? I suspect most uses wont even reach 16
> sockets anyway...

True.  Another option, for now, is to just make PACKET_FANOUT_MAX more
reasonable.  I'll make it something like 256.

Thanks!


^ permalink raw reply

* Re: [PATCH 2/2] packet: Add fanout support.
From: Eric Dumazet @ 2011-07-05  4:33 UTC (permalink / raw)
  To: David Miller; +Cc: victor, netdev
In-Reply-To: <20110704.212014.236340473910292460.davem@davemloft.net>

Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
> sockets.  Two fanout policies are implemented:
> 
> 1) Hashing based upon skb->rxhash
> 
> 2) Pure round-robin
> 
> An AF_PACKET socket must be fully bound before it tries to add itself
> to a fanout.  All AF_PACKET sockets trying to join the same fanout
> must all have the same bind settings.
> 
> Fanouts are identified (within a network namespace) by a 16-bit ID.
> The first socket to try to add itself to a fanout with a particular
> ID, creates that fanout.  When the last socket leaves the fanout
> (which happens only when the socket is closed), that fanout is
> destroyed.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  include/linux/if_packet.h |    4 +
>  net/packet/af_packet.c    |  250 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 249 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
> index 7b31863..1efa1cb 100644
> --- a/include/linux/if_packet.h
> +++ b/include/linux/if_packet.h
> @@ -49,6 +49,10 @@ struct sockaddr_ll {
>  #define PACKET_VNET_HDR			15
>  #define PACKET_TX_TIMESTAMP		16
>  #define PACKET_TIMESTAMP		17
> +#define PACKET_FANOUT			18
> +
> +#define PACKET_FANOUT_HASH		0
> +#define PACKET_FANOUT_LB		1
>  
>  struct tpacket_stats {
>  	unsigned int	tp_packets;
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index bb281bf..7db1e12 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -187,9 +187,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
>  
>  static void packet_flush_mclist(struct sock *sk);
>  
> +struct packet_fanout;
>  struct packet_sock {
>  	/* struct sock has to be the first member of packet_sock */
>  	struct sock		sk;
> +	struct packet_fanout	*fanout;
>  	struct tpacket_stats	stats;
>  	struct packet_ring_buffer	rx_ring;
>  	struct packet_ring_buffer	tx_ring;
> @@ -212,6 +214,24 @@ struct packet_sock {
>  	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
>  };
>  
> +#define PACKET_FANOUT_MAX	2048
> +
> +struct packet_fanout {
> +#ifdef CONFIG_NET_NS
> +	struct net		*net;
> +#endif
> +	int			num_members;
> +	u16			id;
> +	u8			type;
> +	u8			pad;
> +	atomic_t		rr_cur;
> +	struct list_head	list;
> +	struct sock		*arr[PACKET_FANOUT_MAX];

Thats about 16Kbytes, yet you use kzalloc()

> +	spinlock_t		lock;
> +	atomic_t		sk_ref;
> +	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
> +};
> +

Maybe use a dynamic array ? I suspect most uses wont even reach 16
sockets anyway...


^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-05  4:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309839928.2720.23.camel@edumazet-laptop>

On Tue, Jul 5, 2011 at 08:25, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
>> >
>> Check out starting at packet 302893. 383 _identical_ ACKs were sent
>> out by the b44 machine within 30 milliseconds.
>
>
> As I said, b44 driver lost at least 200 consecutive frames (source says
> recovery takes about 20 ms)
>
> TCP then do its normal job.
>

From my understanding, after a frame is lost, TCP would be waiting for
a retransmit. Or at least, it would not be sending 400 duplicate ACKs
for the single last frame received, right? Let me run tcpdump on the
b44 side now. I'm quite sure I won't see any ACK dups leaving the
stack.

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05  4:25 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DGSFAG9V0jqem+tDP3G-N8v6Z+_6oKdPwL-ZwhfhCOZnw@mail.gmail.com>

Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
> >
> Check out starting at packet 302893. 383 _identical_ ACKs were sent
> out by the b44 machine within 30 milliseconds.


As I said, b44 driver lost at least 200 consecutive frames (source says
recovery takes about 20 ms)

TCP then do its normal job.




^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05  4:21 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309837443.2720.8.camel@edumazet-laptop>

Le mardi 05 juillet 2011 à 05:44 +0200, Eric Dumazet a écrit :

> Maybe we should do instead a fast dequeue of packets (recycling them
> instead of pushing them to upper stack) in case too many packets are
> ready to be delivered, and always make sure NIC has a reserve of
> available buffers for DMA accesses, before it can assert ISTAT_RFO
> 
> 

Another way would be to add Explicit Congestion Notification when too
many packets are received in a burst, but unfortunately not enough TCP
flows are ECN ready :)




^ permalink raw reply

* [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05  4:20 UTC (permalink / raw)
  To: victor; +Cc: netdev


Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
sockets.  Two fanout policies are implemented:

1) Hashing based upon skb->rxhash

2) Pure round-robin

An AF_PACKET socket must be fully bound before it tries to add itself
to a fanout.  All AF_PACKET sockets trying to join the same fanout
must all have the same bind settings.

Fanouts are identified (within a network namespace) by a 16-bit ID.
The first socket to try to add itself to a fanout with a particular
ID, creates that fanout.  When the last socket leaves the fanout
(which happens only when the socket is closed), that fanout is
destroyed.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/if_packet.h |    4 +
 net/packet/af_packet.c    |  250 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 249 insertions(+), 5 deletions(-)

diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 7b31863..1efa1cb 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -49,6 +49,10 @@ struct sockaddr_ll {
 #define PACKET_VNET_HDR			15
 #define PACKET_TX_TIMESTAMP		16
 #define PACKET_TIMESTAMP		17
+#define PACKET_FANOUT			18
+
+#define PACKET_FANOUT_HASH		0
+#define PACKET_FANOUT_LB		1
 
 struct tpacket_stats {
 	unsigned int	tp_packets;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index bb281bf..7db1e12 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -187,9 +187,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
 
 static void packet_flush_mclist(struct sock *sk);
 
+struct packet_fanout;
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
+	struct packet_fanout	*fanout;
 	struct tpacket_stats	stats;
 	struct packet_ring_buffer	rx_ring;
 	struct packet_ring_buffer	tx_ring;
@@ -212,6 +214,24 @@ struct packet_sock {
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
+#define PACKET_FANOUT_MAX	2048
+
+struct packet_fanout {
+#ifdef CONFIG_NET_NS
+	struct net		*net;
+#endif
+	int			num_members;
+	u16			id;
+	u8			type;
+	u8			pad;
+	atomic_t		rr_cur;
+	struct list_head	list;
+	struct sock		*arr[PACKET_FANOUT_MAX];
+	spinlock_t		lock;
+	atomic_t		sk_ref;
+	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
+};
+
 struct packet_skb_cb {
 	unsigned int origlen;
 	union {
@@ -227,6 +247,9 @@ static inline struct packet_sock *pkt_sk(struct sock *sk)
 	return (struct packet_sock *)sk;
 }
 
+static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
+static void __fanout_link(struct sock *sk, struct packet_sock *po);
+
 /* register_prot_hook must be invoked with the po->bind_lock held,
  * or from a context in which asynchronous accesses to the packet
  * socket is not possible (packet_create()).
@@ -235,7 +258,10 @@ static void register_prot_hook(struct sock *sk)
 {
 	struct packet_sock *po = pkt_sk(sk);
 	if (!po->running) {
-		dev_add_pack(&po->prot_hook);
+		if (po->fanout)
+			__fanout_link(sk, po);
+		else
+			dev_add_pack(&po->prot_hook);
 		sock_hold(sk);
 		po->running = 1;
 	}
@@ -253,7 +279,10 @@ static void __unregister_prot_hook(struct sock *sk, bool sync)
 	struct packet_sock *po = pkt_sk(sk);
 
 	po->running = 0;
-	__dev_remove_pack(&po->prot_hook);
+	if (po->fanout)
+		__fanout_unlink(sk, po);
+	else
+		__dev_remove_pack(&po->prot_hook);
 	__sock_put(sk);
 
 	if (sync) {
@@ -388,6 +417,195 @@ static void packet_sock_destruct(struct sock *sk)
 	sk_refcnt_debug_dec(sk);
 }
 
+static int fanout_rr_next(struct packet_fanout *f)
+{
+	int x = atomic_read(&f->rr_cur) + 1;
+
+	if (x >= f->num_members)
+		x = 0;
+
+	return x;
+}
+
+static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
+{
+	u32 idx, hash = skb->rxhash;
+
+	idx = ((u64)hash * f->num_members) >> 32;
+
+	return f->arr[idx];
+}
+
+static struct sock *fanout_demux_lb(struct packet_fanout *f, struct sk_buff *skb)
+{
+	int cur, old;
+
+	cur = atomic_read(&f->rr_cur);
+	while ((old = atomic_cmpxchg(&f->rr_cur, cur,
+				     fanout_rr_next(f))) != cur)
+		cur = old;
+	return f->arr[cur];
+}
+
+static int packet_rcv_fanout_hash(struct sk_buff *skb, struct net_device *dev,
+				  struct packet_type *pt, struct net_device *orig_dev)
+{
+	struct packet_fanout *f = pt->af_packet_priv;
+	struct packet_sock *po;
+	struct sock *sk;
+
+	if (!net_eq(dev_net(dev), read_pnet(&f->net))) {
+		kfree_skb(skb);
+		return 0;
+	}
+
+	sk = fanout_demux_hash(f, skb);
+	po = pkt_sk(sk);
+
+	return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
+}
+
+static int packet_rcv_fanout_lb(struct sk_buff *skb, struct net_device *dev,
+				struct packet_type *pt, struct net_device *orig_dev)
+{
+	struct packet_fanout *f = pt->af_packet_priv;
+	struct packet_sock *po;
+	struct sock *sk;
+
+	if (!net_eq(dev_net(dev), read_pnet(&f->net))) {
+		kfree_skb(skb);
+		return 0;
+	}
+
+	sk = fanout_demux_lb(f, skb);
+	po = pkt_sk(sk);
+
+	return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
+}
+
+static DEFINE_MUTEX(fanout_mutex);
+static LIST_HEAD(fanout_list);
+
+static void __fanout_link(struct sock *sk, struct packet_sock *po)
+{
+	struct packet_fanout *f = po->fanout;
+
+	spin_lock(&f->lock);
+	f->arr[f->num_members] = sk;
+	smp_wmb();
+	f->num_members++;
+	spin_unlock(&f->lock);
+}
+
+static void __fanout_unlink(struct sock *sk, struct packet_sock *po)
+{
+	struct packet_fanout *f = po->fanout;
+	int i;
+
+	spin_unlock(&f->lock);
+	for (i = 0; i < f->num_members; i++) {
+		if (f->arr[i] == sk)
+			break;
+	}
+	BUG_ON(i >= f->num_members);
+	f->arr[i] = f->arr[f->num_members - 1];
+	f->num_members--;
+	spin_unlock(&f->lock);
+}
+
+static int fanout_add(struct sock *sk, u16 id, u8 type)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct packet_fanout *f, *match;
+	int err;
+
+	switch (type) {
+	case PACKET_FANOUT_HASH:
+	case PACKET_FANOUT_LB:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (!po->running)
+		return -EINVAL;
+
+	if (po->fanout)
+		return -EALREADY;
+
+	mutex_lock(&fanout_mutex);
+	match = NULL;
+	list_for_each_entry(f, &fanout_list, list) {
+		if (f->id == id &&
+		    read_pnet(&f->net) == sock_net(sk)) {
+			match = f;
+			break;
+		}
+	}
+	if (!match) {
+		match = kzalloc(sizeof(*match), GFP_KERNEL);
+		if (match) {
+			write_pnet(&match->net, sock_net(sk));
+			match->id = id;
+			match->type = type;
+			atomic_set(&match->rr_cur, 0);
+			INIT_LIST_HEAD(&match->list);
+			spin_lock_init(&match->lock);
+			atomic_set(&match->sk_ref, 0);
+			match->prot_hook.type = po->prot_hook.type;
+			match->prot_hook.dev = po->prot_hook.dev;
+			switch (type) {
+			case PACKET_FANOUT_HASH:
+				match->prot_hook.func = packet_rcv_fanout_hash;
+				break;
+			case PACKET_FANOUT_LB:
+				match->prot_hook.func = packet_rcv_fanout_lb;
+				break;
+			}
+			match->prot_hook.af_packet_priv = match;
+			dev_add_pack(&match->prot_hook);
+			list_add(&match->list, &fanout_list);
+		}
+	}
+	err = -ENOMEM;
+	if (match) {
+		err = -EINVAL;
+		if (match->type == type &&
+		    match->prot_hook.type == po->prot_hook.type &&
+		    match->prot_hook.dev == po->prot_hook.dev) {
+			err = -ENOSPC;
+			if (atomic_read(&match->sk_ref) < PACKET_FANOUT_MAX) {
+				__dev_remove_pack(&po->prot_hook);
+				po->fanout = match;
+				atomic_inc(&match->sk_ref);
+				__fanout_link(sk, po);
+				err = 0;
+			}
+		}
+	}
+	mutex_unlock(&fanout_mutex);
+	return err;
+}
+
+static void fanout_release(struct sock *sk)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct packet_fanout *f;
+
+	f = po->fanout;
+	if (!f)
+		return;
+
+	po->fanout = NULL;
+
+	mutex_lock(&fanout_mutex);
+	if (atomic_dec_and_test(&f->sk_ref)) {
+		list_del(&f->list);
+		dev_remove_pack(&f->prot_hook);
+		kfree(f);
+	}
+	mutex_unlock(&fanout_mutex);
+}
 
 static const struct proto_ops packet_ops;
 
@@ -1398,6 +1616,8 @@ static int packet_release(struct socket *sock)
 	if (po->tx_ring.pg_vec)
 		packet_set_ring(sk, &req, 1, 1);
 
+	fanout_release(sk);
+
 	synchronize_net();
 	/*
 	 *	Now the socket is dead. No more input will appear.
@@ -1421,9 +1641,9 @@ static int packet_release(struct socket *sock)
 static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protocol)
 {
 	struct packet_sock *po = pkt_sk(sk);
-	/*
-	 *	Detach an existing hook if present.
-	 */
+
+	if (po->fanout)
+		return -EINVAL;
 
 	lock_sock(sk);
 
@@ -2133,6 +2353,17 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->tp_tstamp = val;
 		return 0;
 	}
+	case PACKET_FANOUT:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		return fanout_add(sk, val & 0xffff, val >> 16);
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -2231,6 +2462,15 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 		val = po->tp_tstamp;
 		data = &val;
 		break;
+	case PACKET_FANOUT:
+		if (len > sizeof(int))
+			len = sizeof(int);
+		val = (po->fanout ?
+		       ((u32)po->fanout->id |
+			((u32)po->fanout->type << 16)) :
+		       0);
+		data = &val;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
-- 
1.7.5.4


^ permalink raw reply related

* [PATCH 1/2] packet: Add helpers to register/unregister ->prot_hook
From: David Miller @ 2011-07-05  4:20 UTC (permalink / raw)
  To: victor; +Cc: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/packet/af_packet.c |  103 +++++++++++++++++++++++++++--------------------
 1 files changed, 59 insertions(+), 44 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 461b16f..bb281bf 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -222,6 +222,55 @@ struct packet_skb_cb {
 
 #define PACKET_SKB_CB(__skb)	((struct packet_skb_cb *)((__skb)->cb))
 
+static inline struct packet_sock *pkt_sk(struct sock *sk)
+{
+	return (struct packet_sock *)sk;
+}
+
+/* register_prot_hook must be invoked with the po->bind_lock held,
+ * or from a context in which asynchronous accesses to the packet
+ * socket is not possible (packet_create()).
+ */
+static void register_prot_hook(struct sock *sk)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	if (!po->running) {
+		dev_add_pack(&po->prot_hook);
+		sock_hold(sk);
+		po->running = 1;
+	}
+}
+
+/* {,__}unregister_prot_hook() must be invoked with the po->bind_lock
+ * held.   If the sync parameter is true, we will temporarily drop
+ * the po->bind_lock and do a synchronize_net to make sure no
+ * asynchronous packet processing paths still refer to the elements
+ * of po->prot_hook.  If the sync parameter is false, it is the
+ * callers responsibility to take care of this.
+ */
+static void __unregister_prot_hook(struct sock *sk, bool sync)
+{
+	struct packet_sock *po = pkt_sk(sk);
+
+	po->running = 0;
+	__dev_remove_pack(&po->prot_hook);
+	__sock_put(sk);
+
+	if (sync) {
+		spin_unlock(&po->bind_lock);
+		synchronize_net();
+		spin_lock(&po->bind_lock);
+	}
+}
+
+static void unregister_prot_hook(struct sock *sk, bool sync)
+{
+	struct packet_sock *po = pkt_sk(sk);
+
+	if (po->running)
+		__unregister_prot_hook(sk, sync);
+}
+
 static inline __pure struct page *pgv_to_page(void *addr)
 {
 	if (is_vmalloc_addr(addr))
@@ -324,11 +373,6 @@ static inline void packet_increment_head(struct packet_ring_buffer *buff)
 	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
 }
 
-static inline struct packet_sock *pkt_sk(struct sock *sk)
-{
-	return (struct packet_sock *)sk;
-}
-
 static void packet_sock_destruct(struct sock *sk)
 {
 	skb_queue_purge(&sk->sk_error_queue);
@@ -1337,15 +1381,7 @@ static int packet_release(struct socket *sock)
 	spin_unlock_bh(&net->packet.sklist_lock);
 
 	spin_lock(&po->bind_lock);
-	if (po->running) {
-		/*
-		 * Remove from protocol table
-		 */
-		po->running = 0;
-		po->num = 0;
-		__dev_remove_pack(&po->prot_hook);
-		__sock_put(sk);
-	}
+	unregister_prot_hook(sk, false);
 	if (po->prot_hook.dev) {
 		dev_put(po->prot_hook.dev);
 		po->prot_hook.dev = NULL;
@@ -1392,15 +1428,7 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
 	lock_sock(sk);
 
 	spin_lock(&po->bind_lock);
-	if (po->running) {
-		__sock_put(sk);
-		po->running = 0;
-		po->num = 0;
-		spin_unlock(&po->bind_lock);
-		dev_remove_pack(&po->prot_hook);
-		spin_lock(&po->bind_lock);
-	}
-
+	unregister_prot_hook(sk, true);
 	po->num = protocol;
 	po->prot_hook.type = protocol;
 	if (po->prot_hook.dev)
@@ -1413,9 +1441,7 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
 		goto out_unlock;
 
 	if (!dev || (dev->flags & IFF_UP)) {
-		dev_add_pack(&po->prot_hook);
-		sock_hold(sk);
-		po->running = 1;
+		register_prot_hook(sk);
 	} else {
 		sk->sk_err = ENETDOWN;
 		if (!sock_flag(sk, SOCK_DEAD))
@@ -1542,9 +1568,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 
 	if (proto) {
 		po->prot_hook.type = proto;
-		dev_add_pack(&po->prot_hook);
-		sock_hold(sk);
-		po->running = 1;
+		register_prot_hook(sk);
 	}
 
 	spin_lock_bh(&net->packet.sklist_lock);
@@ -2240,9 +2264,7 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
 			if (dev->ifindex == po->ifindex) {
 				spin_lock(&po->bind_lock);
 				if (po->running) {
-					__dev_remove_pack(&po->prot_hook);
-					__sock_put(sk);
-					po->running = 0;
+					__unregister_prot_hook(sk, false);
 					sk->sk_err = ENETDOWN;
 					if (!sock_flag(sk, SOCK_DEAD))
 						sk->sk_error_report(sk);
@@ -2259,11 +2281,8 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
 		case NETDEV_UP:
 			if (dev->ifindex == po->ifindex) {
 				spin_lock(&po->bind_lock);
-				if (po->num && !po->running) {
-					dev_add_pack(&po->prot_hook);
-					sock_hold(sk);
-					po->running = 1;
-				}
+				if (po->num)
+					register_prot_hook(sk);
 				spin_unlock(&po->bind_lock);
 			}
 			break;
@@ -2530,10 +2549,8 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 	was_running = po->running;
 	num = po->num;
 	if (was_running) {
-		__dev_remove_pack(&po->prot_hook);
 		po->num = 0;
-		po->running = 0;
-		__sock_put(sk);
+		__unregister_prot_hook(sk, false);
 	}
 	spin_unlock(&po->bind_lock);
 
@@ -2564,11 +2581,9 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 	mutex_unlock(&po->pg_vec_lock);
 
 	spin_lock(&po->bind_lock);
-	if (was_running && !po->running) {
-		sock_hold(sk);
-		po->running = 1;
+	if (was_running) {
 		po->num = num;
-		dev_add_pack(&po->prot_hook);
+		register_prot_hook(sk);
 	}
 	spin_unlock(&po->bind_lock);
 
-- 
1.7.5.4


^ permalink raw reply related

* [PATCH 0/2] AF_PACKET fanout support
From: David Miller @ 2011-07-05  4:20 UTC (permalink / raw)
  To: victor; +Cc: netdev


This is a fully functional version, I've tested both hash and
load-balance modes successfully.  I plan to commit this to
net-next-2.6 very soon.

Below is a test program that other people can play with
if they want.  It basically creates 4 threads, and creates
an AF_PACKET fanout amongst them.  Each thread prints out
it's pid in parentheses every time it receives 10 packets.
After each thread processes 10,000 packets, it exits.

Try things like "./test eth0 hash", "./test eth0 lb", etc.

Signed-off-by: David S. Miller <davem@davemloft.net>

--------------------
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <sys/types.h>
#include <sys/wait.h>
#include <sys/socket.h>
#include <sys/ioctl.h>

#include <unistd.h>

#include <linux/if_ether.h>
#include <linux/if_packet.h>

#include <net/if.h>

static const char *device_name;
static int fanout_type;
static int fanout_id;

#ifndef PACKET_FANOUT
#define PACKET_FANOUT		18
#define PACKET_FANOUT_HASH		0
#define PACKET_FANOUT_LB		1
#endif

static int setup_socket(void)
{
	int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
	struct sockaddr_ll ll;
	struct ifreq ifr;
	int fanout_arg;

	if (fd < 0) {
		perror("socket");
		return EXIT_FAILURE;
	}

	memset(&ifr, 0, sizeof(ifr));
	strcpy(ifr.ifr_name, device_name);
	err = ioctl(fd, SIOCGIFINDEX, &ifr);
	if (err < 0) {
		perror("SIOCGIFINDEX");
		return EXIT_FAILURE;
	}

	memset(&ll, 0, sizeof(ll));
	ll.sll_family = AF_PACKET;
	ll.sll_ifindex = ifr.ifr_ifindex;
	err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
	if (err < 0) {
		perror("bind");
		return EXIT_FAILURE;
	}

	fanout_arg = (fanout_id | (fanout_type << 16));
	err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
			 &fanout_arg, sizeof(fanout_arg));
	if (err) {
		perror("setsockopt");
		return EXIT_FAILURE;
	}

	return fd;
}

static void fanout_thread(void)
{
	int fd = setup_socket();
	int limit = 10000;

	if (fd < 0)
		exit(fd);

	while (limit-- > 0) {
		char buf[1600];
		int err;

		err = read(fd, buf, sizeof(buf));
		if (err < 0) {
			perror("read");
			exit(EXIT_FAILURE);
		}
		if ((limit % 10) == 0)
			fprintf(stdout, "(%d) \n", getpid());
	}

	fprintf(stdout, "%d: Received 10000 packets\n", getpid());

	close(fd);
	exit(0);
}

int main(int argc, char **argp)
{
	int fd, err;
	int i;

	if (argc != 3) {
		fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
		return EXIT_FAILURE;
	}

	if (!strcmp(argp[2], "hash"))
		fanout_type = PACKET_FANOUT_HASH;
	else if (!strcmp(argp[2], "lb"))
		fanout_type = PACKET_FANOUT_LB;
	else {
		fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
		exit(EXIT_FAILURE);
	}

	device_name = argp[1];
	fanout_id = getpid() & 0xffff;

	for (i = 0; i < 4; i++) {
		pid_t pid = fork();

		switch (pid) {
		case 0:
			fanout_thread();

		case -1:
			perror("fork");
			exit(EXIT_FAILURE);
		}
	}

	for (i = 0; i < 4; i++) {
		int status;

		wait(&status);
	}

	return 0;
}

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox