* Re: [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05 7:48 UTC (permalink / raw)
To: victor; +Cc: eric.dumazet, netdev
In-Reply-To: <4E12B5A6.2020802@inliniac.net>
From: Victor Julien <victor@inliniac.net>
Date: Tue, 05 Jul 2011 08:56:38 +0200
> On 07/05/2011 08:21 AM, Eric Dumazet wrote:
>> Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
>>> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
>>> sockets. Two fanout policies are implemented:
>>>
>>> 1) Hashing based upon skb->rxhash
>>
>> ...
>>
>>> +
>>> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
>>> +{
>>> + u32 idx, hash = skb->rxhash;
>>> +
>>> + idx = ((u64)hash * f->num_members) >> 32;
>>> +
>>> + return f->arr[idx];
>>> +}
>>> +
>>
>> rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
>> path.
>>
>
> Is this still also true for IP fragments?
I have a plan to fix this. But what I've posted will work as you want
it to for everything else.
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05 7:46 UTC (permalink / raw)
To: eric.dumazet; +Cc: victor, netdev
In-Reply-To: <1309846875.2720.43.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Jul 2011 08:21:15 +0200
> rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
> path.
CONFIG_RPS is effectively on all the time for SMP builds.
If you want to make it a hard enable in that situation,
I fully support such a change. :-)
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: Victor Julien @ 2011-07-05 7:06 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1309849214.2720.45.camel@edumazet-laptop>
On 07/05/2011 09:00 AM, Eric Dumazet wrote:
> Le mardi 05 juillet 2011 à 08:56 +0200, Victor Julien a écrit :
>
>> Is this still also true for IP fragments?
>>
>
> This point was already raised. IP fragments have rxhash = 0, obviously,
> since we dont have full information (source / destination ports for
> example)
Sure, just seeing if something was changed here as that wasn't
immediately obvious to me from the code.
--
---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
---------------------------------------------
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: Eric Dumazet @ 2011-07-05 7:00 UTC (permalink / raw)
To: Victor Julien; +Cc: David Miller, netdev
In-Reply-To: <4E12B5A6.2020802@inliniac.net>
Le mardi 05 juillet 2011 à 08:56 +0200, Victor Julien a écrit :
> Is this still also true for IP fragments?
>
This point was already raised. IP fragments have rxhash = 0, obviously,
since we dont have full information (source / destination ports for
example)
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: Victor Julien @ 2011-07-05 6:56 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1309846875.2720.43.camel@edumazet-laptop>
On 07/05/2011 08:21 AM, Eric Dumazet wrote:
> Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
>> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
>> sockets. Two fanout policies are implemented:
>>
>> 1) Hashing based upon skb->rxhash
>
> ...
>
>> +
>> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
>> +{
>> + u32 idx, hash = skb->rxhash;
>> +
>> + idx = ((u64)hash * f->num_members) >> 32;
>> +
>> + return f->arr[idx];
>> +}
>> +
>
> rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
> path.
>
Is this still also true for IP fragments?
--
---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
---------------------------------------------
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: Eric Dumazet @ 2011-07-05 6:21 UTC (permalink / raw)
To: David Miller; +Cc: victor, netdev
In-Reply-To: <20110704.212014.236340473910292460.davem@davemloft.net>
Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
> sockets. Two fanout policies are implemented:
>
> 1) Hashing based upon skb->rxhash
...
> +
> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
> +{
> + u32 idx, hash = skb->rxhash;
> +
> + idx = ((u64)hash * f->num_members) >> 32;
> +
> + return f->arr[idx];
> +}
> +
rxhash is 0 unless skb_get_rxhash() was called, or some NIC set it in RX
path.
^ permalink raw reply
* RE: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel 2.6.32.36)
From: MaoXiaoyun @ 2011-07-05 6:02 UTC (permalink / raw)
To: mchan, netdev; +Cc: davidch
In-Reply-To: <C27F8246C663564A84BB7AB343977242667C64FA19@IRVEXCHCCR01.corp.ad.broadcom.com>
Before having debug patch, I plan to run a test with disable_msi=1 first.
Well, is there a place I can get the lastest PRG document?
Thanks for your help.
----------------------------------------
> From: mchan@broadcom.com
> To: tinnycloud@hotmail.com; netdev@vger.kernel.org
> CC: davidch@broadcom.com
> Date: Mon, 4 Jul 2011 10:04:25 -0700
> Subject: Re: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel 2.6.32.36)
>
> MaoXiaoyun wrote:
>
> > Could it be caused by the similar timeout as
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
> > 2.6.git;a=commit;h=c441b8d2cb2194b05550a558d6d95d8944e56a84.
>
> Based on the register dump below, it is not caused by the MSI-X issue.
>
> >
> > Maybe timeout still happens in my test scenerino.
> >
> > Well, from the patch, BNX2_MISC_ECO_HW_CTL is defined 0x000008cc. But I
> > cannot find
> > the defines in programmer reference Guide.(NetXtremeII-PG203-R.pdf).
> > Could some help
> > to point out for me or is the doc is out of date.
>
> I will request the document to be updated to describe that register. We
> are increasing the register read and write timeout value to workaround
> the problem of the MSI-X table being updated while there is a pending
> MSI-X. Without the patch, the write to unmask the MSI-X table entry can
> be dropped by the chip.
>
> >
> > Also, is there a way to comfirm whether the timeout really happen?
> > (which regisiter
> > shall I read?) Or is there a bigger timeout I can set?
>
> Again, the register dump shows that it is not caused by this issue. I'll
> send you some additional debug patch to try to debug the problem.
>
> Thanks.
> >
> > thanks.
> >
> > ----------------------------------------
> > > From: tinnycloud@hotmail.com
> > > To: netdev@vger.kernel.org
> > > Subject: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel
> > 2.6.32.36)
> > > Date: Mon, 4 Jul 2011 15:40:01 +0800
> > >
> > >
> > > Hi:
> > >
> > > I met bnx2 FTQ dump over and over again during my testing on Xen live
> > migration which generate
> > > heavy network workload.
> > >
> > > I have two physcial machine, both have xen 4.0.1 installed, and
> > kernel 2.6.32.36, bnx2 2.0.23b.
> > > I start 15 Virtual Machines totoally, and doing migration between the
> > host over and over again,
> > > about 16hours, the network will not work, and sometimes, it can reset
> > successfully, sometimes, it
> > > cause kernel crash.
> > >
> > > I've tried debug some, add code in the driver. below is the code when
> > FTQ happened.
> > > It looks like the NIC is stop transmit the packets, and cause
> > timeout.
> > >
> > > BTW, cpu max_cstate=1 in my grub.
> > >
> > > Thanks.
> > >
> > > --------------
> > > static void
> > > bnx2_tx_timeout(struct net_device *dev)
> > > {
> > > struct bnx2 *bp = netdev_priv(dev);
> > > struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
> > > struct bnx2_tx_ring_info *txr = &bnapi->tx_ring;
> > > struct bnx2_rx_ring_info *rxr = &bnapi->rx_ring;
> > > int i ;
> > > bnx2_dump_ftq(bp);
> > > bnx2_dump_state(bp);
> > > if (stop_on_tx_timeout) {
> > > printk(KERN_WARNING PFX
> > > "%s: prevent chip reset during tx timeout\n",
> > > bp->dev->name);
> > > smp_rmb();
> > > printk("last status idx %d \n", bnapi->last_status_idx);
> > > printk("hw_tx_cons %d, txr->hw_tx_conds %d txr->tx_prod %d txr-
> > >tx_cons %d\n",
> > > bnx2_get_hw_tx_cons(bnapi), txr->hw_tx_cons, txr->tx_prod, txr-
> > >tx_cons);
> > > printk("hw_rx_cons %d, txr->hw_rx_conds %d\n",
> > bnx2_get_hw_rx_cons(bnapi), rxr->rx_cons);
> > > printk("sblk->status_attn_bits %d\n",bnapi->status_blk.msi-
> > >status_attn_bits);
> > > printk("sblk->status_attn_bits_ack %d\n",bnapi->status_blk.msi-
> > >status_attn_bits_ack);
> > > printk("bnx2_tx_avail %d \n",(bnx2_tx_avail(bp, txr)));
> > > printk("sblk->status_tx_quick_consumer_index0 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index0);
> > > printk("sblk->status_tx_quick_consumer_index1 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index1);
> > > printk("sblk->status_tx_quick_consumer_index2 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index2);
> > > printk("sblk->status_tx_quick_consumer_index3 %d\n",bnapi-
> > >status_blk.msi->status_tx_quick_consumer_index3);
> > > printk("sblk->status_rx_quick_consumer_index0 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index0);
> > > printk("sblk->status_rx_quick_consumer_index1 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index1);
> > > printk("sblk->status_rx_quick_consumer_index2 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index2);
> > > printk("sblk->status_rx_quick_consumer_index3 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index3);
> > > printk("sblk->status_rx_quick_consumer_index4 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index4);
> > > printk("sblk->status_rx_quick_consumer_index5 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index5);
> > > printk("sblk->status_rx_quick_consumer_index6 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index6);
> > > printk("sblk->status_rx_quick_consumer_index7 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index7);
> > > printk("sblk->status_rx_quick_consumer_index8 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index8);
> > > printk("sblk->status_rx_quick_consumer_index9 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index9);
> > > printk("sblk->status_rx_quick_consumer_index10 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index10);
> > > printk("sblk->status_rx_quick_consumer_index11 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index11);
> > > printk("sblk->status_rx_quick_consumer_index12 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index12);
> > > printk("sblk->status_rx_quick_consumer_index13 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index13);
> > > printk("sblk->status_rx_quick_consumer_index14 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index14);
> > > printk("sblk->status_rx_quick_consumer_index15 %d\n",bnapi-
> > >status_blk.msi->status_rx_quick_consumer_index15);
> > > printk("sblk->status_completion_producer_index %d\n",bnapi-
> > >status_blk.msi->status_completion_producer_index);
> > > printk("sblk->status_cmd_consumer_index %d\n",bnapi->status_blk.msi-
> > >status_cmd_consumer_index);
> > > printk("sblk->status_idx %d\n",bnapi->status_blk.msi->status_idx);
> > > printk("sblk->status_unused %d\n",bnapi->status_blk.msi-
> > >status_unused);
> > > printk("sblk->status_blk_num %d\n",bnapi->status_blk.msi-
> > >status_blk_num);
> > > is_timedout = 1;
> > > for (i = 0; i < bp->irq_nvecs; i++) {
> > > bnapi = &bp->bnx2_napi[i];
> > > bnx2_tx_int(bp, bnapi, 0);
> > > }
> > > return;
> > > }
> > > -----------------
> > >
> > > -------------FTQ log in /var/log/message
> > > ------------[ cut here ]------------
> > > WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x105/0x16a()
> > > Hardware name: Tecal RH2285
> > > Modules linked in: iptable_filter ip_tables nfs fscache nfs_acl
> > auth_rpcgss bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler
> > lockd sunrpc ipv6 xenfs dm_multipath fuse xen_netback xen_blkback
> > blktap blkback_pagemap loop nbd video output sbs sbshc parport_pc lp
> > parport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
> > snd_seq_device snd_pcm_oss snd_mixer_oss bnx2 serio_raw snd_pcm
> > snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt
> > iTCO_vendor_support i2c_core pata_acpi ata_generic pcspkr ata_piix
> > shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
> > > Pid: 0, comm: swapper Not tainted 2.6.32.36xen #1
> > > Call Trace:
> > > <IRQ> [<ffffffff813ba154>] ? dev_watchdog+0x105/0x16a
> > > [<ffffffff81056666>] warn_slowpath_common+0x7c/0x94
> > > [<ffffffff81056738>] warn_slowpath_fmt+0xa4/0xa6
> > > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
> > > [<ffffffff81081fce>] ? tick_program_event+0x2a/0x2c
> > > [<ffffffff813b951d>] ? __netif_tx_lock+0x1b/0x24
> > > [<ffffffff813b95a8>] ? netif_tx_lock+0x46/0x6e
> > > [<ffffffff813a3ed1>] ? netdev_drivername+0x48/0x4f
> > > [<ffffffff813ba154>] dev_watchdog+0x105/0x16a
> > > [<ffffffff81063d98>] run_timer_softirq+0x156/0x1f8
> > > [<ffffffff813ba04f>] ? dev_watchdog+0x0/0x16a
> > > [<ffffffff8105d6f0>] __do_softirq+0xd7/0x19e
> > > [<ffffffff81013eac>] call_softirq+0x1c/0x30
> > > [<ffffffff8101564b>] do_softirq+0x46/0x87
> > > [<ffffffff8105d575>] irq_exit+0x3b/0x7a
> > > [<ffffffff8128dcfe>] xen_evtchn_do_upcall+0x38/0x46
> > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> > > <EOI> [<ffffffff8103f642>] ? pick_next_task_idle+0x18/0x22
> > > [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
> > > [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
> > > [<ffffffff8100f1bb>] ? xen_safe_halt+0x10/0x1a
> > > [<ffffffff81019e14>] ? default_idle+0x39/0x56
> > > [<ffffffff81011cd0>] ? cpu_idle+0x5d/0x8c
> > > [<ffffffff8143375d>] ? cpu_bringup_and_idle+0x13/0x15
> > > ---[ end trace 867bb8f6cd959b03 ]---
> > > bnx2: <--- start FTQ dump on peth0 --->
> > > bnx2: peth0: BNX2_RV2P_PFTQ_CTL 10000
> > > bnx2: peth0: BNX2_RV2P_TFTQ_CTL 20000
> > > bnx2: peth0: BNX2_RV2P_MFTQ_CTL 4000
> > > bnx2: peth0: BNX2_TBDR_FTQ_CTL 1004002
> > > bnx2: peth0: BNX2_TDMA_FTQ_CTL 4010002
> > > bnx2: peth0: BNX2_TXP_FTQ_CTL 2410002
> > > bnx2: peth0: BNX2_TPAT_FTQ_CTL 10002
> > > bnx2: peth0: BNX2_RXP_CFTQ_CTL 8000
> > > bnx2: peth0: BNX2_RXP_FTQ_CTL 100000
> > > bnx2: peth0: BNX2_COM_COMXQ_FTQ_CTL 10000
> > > bnx2: peth0: BNX2_COM_COMTQ_FTQ_CTL 20000
> > > bnx2: peth0: BNX2_COM_COMQ_FTQ_CTL 10000
> > > bnx2: peth0: BNX2_CP_CPQ_FTQ_CTL 4000
> > > bnx2: peth0: TXP mode b84c state 80005000 evt_mask 500 pc 8000d60 pc
> > 8000d60 instr 8f860000
> > > bnx2: peth0: TPAT mode b84c state 80009000 evt_mask 500 pc 8000a5c pc
> > 8000a5c instr 10400016
> > > bnx2: peth0: RXP mode b84c state 80001000 evt_mask 500 pc 8004c14 pc
> > 8004c14 instr 10e00088
> > > bnx2: peth0: COM mode b8cc state 80000000 evt_mask 500 pc 8000b28 pc
> > 8000a9c instr 8c530000
> > > bnx2: peth0: CP mode b8cc state 80000000 evt_mask 500 pc 8000c50 pc
> > 8000c58 instr 8ca50020
> > > bnx2: <--- end FTQ dump on peth0 --->
> > > bnx2: peth0 DEBUG: intr_sem[0]
> > > bnx2: peth0 DEBUG: intr_sem[0] PCI_CMD[20100406]
> > > bnx2: peth0 DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
> > > bnx2: peth0 DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> > > bnx2: peth0 RPM_MGMT_PKT_CTRL[40000088]
> > > bnx2: peth0 DEBUG: MCP_STATE_P0[0007e10e] MCP_STATE_P1[0003e00e]
> > > bnx2: peth0 DEBUG: HC_STATS_INTERRUPT_STATUS[01ff0000]
> > > bnx2: peth0 DEBUG: PBA[00000000]
> > > BNX2_PCICFG_INT_ACK_CMD[00013ce1]
> > > bnx2: peth0: prevent chip reset during tx timeout
> > > last status idx 2426
> > > hw_tx_cons 32474, txr->hw_tx_conds 32474 txr->tx_prod 32641 txr-
> > >tx_cons 32474
> > > hw_rx_cons 19665, txr->hw_rx_conds 19665
> > > sblk->status_attn_bits 1
> > > sblk->status_attn_bits_ack 1
> > > bnx2_tx_avail 88
> > > sblk->status_tx_quick_consumer_index0 32474
> > > sblk->status_tx_quick_consumer_index1 0
> > > sblk->status_tx_quick_consumer_index2 0
> > > sblk->status_tx_quick_consumer_index3 0
> > > sblk->status_rx_quick_consumer_index0 19665
> > > sblk->status_rx_quick_consumer_index1 0
> > > sblk->status_rx_quick_consumer_index2 0
> > > sblk->status_rx_quick_consumer_index3 0
> > > sblk->status_rx_quick_consumer_index4 0
> > > sblk->status_rx_quick_consumer_index5 0
> > > sblk->status_rx_quick_consumer_index6 0
> > > sblk->status_rx_quick_consumer_index7 0
> > > sblk->status_rx_quick_consumer_index8 0
> > > sblk->status_rx_quick_consumer_index9 0
> > > sblk->status_rx_quick_consumer_index10 0
> > > sblk->status_rx_quick_consumer_index11 0
> > > sblk->status_rx_quick_consumer_index12 0
> > > sblk->status_rx_quick_consumer_index13 0
> > > sblk->status_rx_quick_consumer_index14 0
> > > sblk->status_rx_quick_consumer_index15 0
> > > sblk->status_completion_producer_index 0
> > > sblk->status_cmd_consumer_index 0
> > > sblk->status_idx 2426
> > > sblk->status_unused 0
> > > sblk->status_blk_num 0
> > > hw_cons 32474 sw_cons 32474 ffff8801d27f85c0 bnapi
> > > return hw_cons 32474 sw_cons 32474 ffff8801d27f85c0 bnapi
> > > hw_cons 3628 sw_cons 3625 ffff8801d27f8bc0 bnapi
> > > return hw_cons 3628 sw_cons 3625 ffff8801d27f8bc0 bnapi
> > > hw_cons 62094 sw_cons 62090 ffff8801d27f91c0 bnapi
> > > return hw_cons 62094 sw_cons 62090 ffff8801d27f91c0 bnapi
> > > hw_cons 3184 sw_cons 3173 ffff8801d27f97c0 bnapi
> > > return hw_cons 3184 sw_cons 3173 ffff8801d27f97c0 bnapi
> > > hw_cons 0 sw_cons 0 ffff8801d27f9dc0 bnapi
> > > return hw_cons 0 sw_cons 0 ffff8801d27f9dc0 bnapi
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05 5:59 UTC (permalink / raw)
To: Alexey Zaytsev
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309844009.2720.39.camel@edumazet-laptop>
Le mardi 05 juillet 2011 à 07:33 +0200, Eric Dumazet a écrit :
> Le mardi 05 juillet 2011 à 09:18 +0400, Alexey Zaytsev a écrit :
>
> > Actually, I've added a trace to show b44_init_rings and b44_free_rings
> > calls, and they are only called once, right after the driver is
> > loaded. So it can't be related to START_RFO. Will attach the diff and
> > dmesg.
>
> Thanks
>
> I was wondering if DMA could be faster if providing word aligned
> addresses, could you try :
>
> -#define RX_PKT_OFFSET (RX_HEADER_LEN + 2)
> +#define RX_PKT_OFFSET (RX_HEADER_LEN + NET_IP_ALIGN)
>
> (On x86, we now have NET_IP_ALIGN = 0 since commit ea812ca1)
>
I suspect a hardware bug.
You could force copybreak, so that b44 only touch kind of private
memory.
diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index a69331e..62a0599 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -75,7 +75,7 @@
(BP)->tx_cons - (BP)->tx_prod - TX_RING_GAP(BP))
#define NEXT_TX(N) (((N) + 1) & (B44_TX_RING_SIZE - 1))
-#define RX_PKT_OFFSET (RX_HEADER_LEN + 2)
+#define RX_PKT_OFFSET (RX_HEADER_LEN + NET_IP_ALIGN)
#define RX_PKT_BUF_SZ (1536 + RX_PKT_OFFSET)
/* minimum number of free TX descriptors required to wake up TX process */
@@ -829,6 +829,7 @@ static int b44_rx(struct b44 *bp, int budget)
}
bp->rx_cons = cons;
+ wmb();
bw32(bp, B44_DMARX_PTR, cons * sizeof(struct dma_desc));
return received;
@@ -848,6 +849,7 @@ static int b44_poll(struct napi_struct *napi, int budget)
/* spin_unlock(&bp->tx_lock); */
}
if (bp->istat & ISTAT_RFO) { /* fast recovery, in ~20msec */
+ pr_err("b44: ISTAT_RFO !\n");
bp->istat &= ~ISTAT_RFO;
b44_disable_ints(bp);
ssb_device_enable(bp->sdev, 0); /* resets ISTAT_RFO */
@@ -2155,7 +2157,7 @@ static int __devinit b44_init_one(struct ssb_device *sdev,
bp = netdev_priv(dev);
bp->sdev = sdev;
bp->dev = dev;
- bp->force_copybreak = 0;
+ bp->force_copybreak = 1;
bp->msg_enable = netif_msg_init(b44_debug, B44_DEF_MSG_ENABLE);
^ permalink raw reply related
* Re: [PATCH] net/core: Make urgent data inline by default
From: Esa-Pekka Pyokkimies @ 2011-07-05 5:41 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110704.163838.1746904172195321123.davem@davemloft.net>
On Tue, 05 Jul 2011 02:38:38 +0300, David Miller <davem@davemloft.net>
wrote:
> There is no way we can make this change, we've had the default
> we currently have for 18+ years. Breaking applications is a
> very real possibility.
>
> It doesn't matter what some RFC says.
I understand. However urgent pointer is a very niche feature and I don't
think
it would really break much. FTP and telnet both want the urgent data inline
anyway. I haven't found any application which uses the "1-byte" urgent
data,
which can by some change be overwritten by the next urgent data if you
didn't
read it in time. The reason I would want this change is that attack
detection
is very difficult when there can be a byte missing due to URG flag being
set,
and the damage done by crackers is more than the damage to applications I
think.
But I guess you decide. Atleast I tried.
Esa-Pekka
^ permalink raw reply
* A GRO question
From: Li Yu @ 2011-07-05 5:41 UTC (permalink / raw)
To: netdev@vger.kernel.org
Hi,
I have a question about GRO implementation, this indeed confuses me.
I found that we assume that NAPI_GRO_CB(skb)->frag0 starts
with a mac/L2 header in compare_ether_header(), which is called in
__napi_gro_receive()
However, in further dev_gro_receive() -> ptype->gro_receive [inet_gro_receive],
we use same address as IPv4/L3 header, like below:
off = skb_gro_offset(skb); //it should keep zero until now, in my words.
hlen = off + sizeof(*iph);
iph = skb_gro_header_fast(skb, off); //just return NAPI_GRO_CB(skb)->frag0 + 0
So we forget that updating NAPI_GRO_CB(skb)->data_offset here, or I miss sth?
And, in my understanding against igb source code, if rx_ring->rx_buffer_len < 1024
(if we used large MTU), then igb driver use header split mode, in such case, the mac header
should be saved in skb->data : skb_put(skb, igb_get_hlen(rx_ring, rx_desc)), the rest data
is loaded by below skb_fill_page_desc() call. so NAPI_GRO_CB(skb)->frag0 should start with
L3 header.
Thanks.
Yu
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05 5:33 UTC (permalink / raw)
To: Alexey Zaytsev
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DEiPFzQt-TgeVC3r3Y7YFwApLK_NHkDahFOKpibtABrZg@mail.gmail.com>
Le mardi 05 juillet 2011 à 09:18 +0400, Alexey Zaytsev a écrit :
> Actually, I've added a trace to show b44_init_rings and b44_free_rings
> calls, and they are only called once, right after the driver is
> loaded. So it can't be related to START_RFO. Will attach the diff and
> dmesg.
Thanks
I was wondering if DMA could be faster if providing word aligned
addresses, could you try :
-#define RX_PKT_OFFSET (RX_HEADER_LEN + 2)
+#define RX_PKT_OFFSET (RX_HEADER_LEN + NET_IP_ALIGN)
(On x86, we now have NET_IP_ALIGN = 0 since commit ea812ca1)
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-05 5:18 UTC (permalink / raw)
To: Eric Dumazet
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309842642.2720.36.camel@edumazet-laptop>
On Tue, Jul 5, 2011 at 09:10, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 05 juillet 2011 à 08:57 +0400, Alexey Zaytsev a écrit :
>
>> Ran tcpdump. You are right, I was wrong. Sorry for the noise.
>
> Thanks for testing ;)
>
> It would be nice to know if the memory scribbles start after or before
> one RFO triggers.
>
> I can see this calls b44_init_rings() without really stopping the device
> before. This seems very suspect to me.
>
Actually, I've added a trace to show b44_init_rings and b44_free_rings
calls, and they are only called once, right after the driver is
loaded. So it can't be related to START_RFO. Will attach the diff and
dmesg.
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05 5:10 UTC (permalink / raw)
To: Alexey Zaytsev
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DFM8PcujUB-YzcK49DS7T6Bz2FLDtkVdEYt8an1oPYVFw@mail.gmail.com>
Le mardi 05 juillet 2011 à 08:57 +0400, Alexey Zaytsev a écrit :
> Ran tcpdump. You are right, I was wrong. Sorry for the noise.
Thanks for testing ;)
It would be nice to know if the memory scribbles start after or before
one RFO triggers.
I can see this calls b44_init_rings() without really stopping the device
before. This seems very suspect to me.
diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index a69331e..b22dd4c 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -829,6 +829,7 @@ static int b44_rx(struct b44 *bp, int budget)
}
bp->rx_cons = cons;
+ wmb();
bw32(bp, B44_DMARX_PTR, cons * sizeof(struct dma_desc));
return received;
@@ -848,6 +849,7 @@ static int b44_poll(struct napi_struct *napi, int budget)
/* spin_unlock(&bp->tx_lock); */
}
if (bp->istat & ISTAT_RFO) { /* fast recovery, in ~20msec */
+ pr_err("b44: ISTAT_RFO !\n");
bp->istat &= ~ISTAT_RFO;
b44_disable_ints(bp);
ssb_device_enable(bp->sdev, 0); /* resets ISTAT_RFO */
^ permalink raw reply related
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-05 4:57 UTC (permalink / raw)
To: Eric Dumazet
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309840708.2720.31.camel@edumazet-laptop>
On Tue, Jul 5, 2011 at 08:38, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 05 juillet 2011 à 08:29 +0400, Alexey Zaytsev a écrit :
>> On Tue, Jul 5, 2011 at 08:25, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
>> >> >
>> >> Check out starting at packet 302893. 383 _identical_ ACKs were sent
>> >> out by the b44 machine within 30 milliseconds.
>> >
>> >
>> > As I said, b44 driver lost at least 200 consecutive frames (source says
>> > recovery takes about 20 ms)
>> >
>> > TCP then do its normal job.
>> >
>>
>> From my understanding, after a frame is lost, TCP would be waiting for
>> a retransmit. Or at least, it would not be sending 400 duplicate ACKs
>> for the single last frame received, right? Let me run tcpdump on the
>> b44 side now. I'm quite sure I won't see any ACK dups leaving the
>> stack.
>
> Wow, I believe you are on a wrong track. Honestly.
>
> Try to unpplug the wire for 100ms, and watch your "duplicate acks
> disease".
>
> Thats exactly what is happening with b44 driver doing a "fast recovery"
> right now.
>
> Thats a moot point. Running tcpdump on your b44 machine will kill your
> performance even more, it wont solve the b44 bug.
>
> If you prefer to 'fix tcp', please open another thread.
Ran tcpdump. You are right, I was wrong. Sorry for the noise.
^ permalink raw reply
* Re: [PATCH] greth: greth_set_mac_add would corrupt the MAC address.
From: David Miller @ 2011-07-05 4:39 UTC (permalink / raw)
To: kristoffer; +Cc: netdev
In-Reply-To: <1309770483-16026-1-git-send-email-kristoffer@gaisler.com>
From: Kristoffer Glembo <kristoffer@gaisler.com>
Date: Mon, 4 Jul 2011 11:08:03 +0200
> The MAC address was set using the signed char sockaddr->sa_addr
> field and thus the address could be corrupted through sign extension.
>
> Signed-off-by: Kristoffer Glembo <kristoffer@gaisler.com>
Applied, thanks!
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05 4:38 UTC (permalink / raw)
To: Alexey Zaytsev
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DEOhZh37aqx1qrLnrz5+tqjcjgBx-DP6M_0NkygZ1LjcQ@mail.gmail.com>
Le mardi 05 juillet 2011 à 08:29 +0400, Alexey Zaytsev a écrit :
> On Tue, Jul 5, 2011 at 08:25, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
> >> >
> >> Check out starting at packet 302893. 383 _identical_ ACKs were sent
> >> out by the b44 machine within 30 milliseconds.
> >
> >
> > As I said, b44 driver lost at least 200 consecutive frames (source says
> > recovery takes about 20 ms)
> >
> > TCP then do its normal job.
> >
>
> From my understanding, after a frame is lost, TCP would be waiting for
> a retransmit. Or at least, it would not be sending 400 duplicate ACKs
> for the single last frame received, right? Let me run tcpdump on the
> b44 side now. I'm quite sure I won't see any ACK dups leaving the
> stack.
Wow, I believe you are on a wrong track. Honestly.
Try to unpplug the wire for 100ms, and watch your "duplicate acks
disease".
Thats exactly what is happening with b44 driver doing a "fast recovery"
right now.
Thats a moot point. Running tcpdump on your b44 machine will kill your
performance even more, it wont solve the b44 bug.
If you prefer to 'fix tcp', please open another thread.
^ permalink raw reply
* Re: [PATCH] net: bind() fix error return on wrong address family
From: David Miller @ 2011-07-05 4:38 UTC (permalink / raw)
To: meissner
Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev, linux-kernel,
meissner, max
In-Reply-To: <1309779029-15403-1-git-send-email-meissner@novell.com>
From: Marcus Meissner <meissner@novell.com>
Date: Mon, 4 Jul 2011 13:30:29 +0200
> Reinhard Max also pointed out that the error should EAFNOSUPPORT according
> to POSIX.
>
> The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
> EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.
>
> Other protocols error values in their af bind() methods in current mainline git as far
> as a brief look shows:
> EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
> EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
> No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip
>
> Signed-off-by: Marcus Meissner <meissner@suse.de>
> Cc: Reinhard Max <max@suse.de>
Applied to net-2.6, thanks.
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05 4:36 UTC (permalink / raw)
To: eric.dumazet; +Cc: victor, netdev
In-Reply-To: <1309840429.2720.26.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Jul 2011 06:33:49 +0200
> Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
>> +#define PACKET_FANOUT_MAX 2048
...
>> + struct sock *arr[PACKET_FANOUT_MAX];
>
> Thats about 16Kbytes, yet you use kzalloc()
>
>> + spinlock_t lock;
>> + atomic_t sk_ref;
>> + struct packet_type prot_hook ____cacheline_aligned_in_smp;
>> +};
>> +
>
> Maybe use a dynamic array ? I suspect most uses wont even reach 16
> sockets anyway...
True. Another option, for now, is to just make PACKET_FANOUT_MAX more
reasonable. I'll make it something like 256.
Thanks!
^ permalink raw reply
* Re: [PATCH 2/2] packet: Add fanout support.
From: Eric Dumazet @ 2011-07-05 4:33 UTC (permalink / raw)
To: David Miller; +Cc: victor, netdev
In-Reply-To: <20110704.212014.236340473910292460.davem@davemloft.net>
Le lundi 04 juillet 2011 à 21:20 -0700, David Miller a écrit :
> Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
> sockets. Two fanout policies are implemented:
>
> 1) Hashing based upon skb->rxhash
>
> 2) Pure round-robin
>
> An AF_PACKET socket must be fully bound before it tries to add itself
> to a fanout. All AF_PACKET sockets trying to join the same fanout
> must all have the same bind settings.
>
> Fanouts are identified (within a network namespace) by a 16-bit ID.
> The first socket to try to add itself to a fanout with a particular
> ID, creates that fanout. When the last socket leaves the fanout
> (which happens only when the socket is closed), that fanout is
> destroyed.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> include/linux/if_packet.h | 4 +
> net/packet/af_packet.c | 250 ++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 249 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
> index 7b31863..1efa1cb 100644
> --- a/include/linux/if_packet.h
> +++ b/include/linux/if_packet.h
> @@ -49,6 +49,10 @@ struct sockaddr_ll {
> #define PACKET_VNET_HDR 15
> #define PACKET_TX_TIMESTAMP 16
> #define PACKET_TIMESTAMP 17
> +#define PACKET_FANOUT 18
> +
> +#define PACKET_FANOUT_HASH 0
> +#define PACKET_FANOUT_LB 1
>
> struct tpacket_stats {
> unsigned int tp_packets;
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index bb281bf..7db1e12 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -187,9 +187,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
>
> static void packet_flush_mclist(struct sock *sk);
>
> +struct packet_fanout;
> struct packet_sock {
> /* struct sock has to be the first member of packet_sock */
> struct sock sk;
> + struct packet_fanout *fanout;
> struct tpacket_stats stats;
> struct packet_ring_buffer rx_ring;
> struct packet_ring_buffer tx_ring;
> @@ -212,6 +214,24 @@ struct packet_sock {
> struct packet_type prot_hook ____cacheline_aligned_in_smp;
> };
>
> +#define PACKET_FANOUT_MAX 2048
> +
> +struct packet_fanout {
> +#ifdef CONFIG_NET_NS
> + struct net *net;
> +#endif
> + int num_members;
> + u16 id;
> + u8 type;
> + u8 pad;
> + atomic_t rr_cur;
> + struct list_head list;
> + struct sock *arr[PACKET_FANOUT_MAX];
Thats about 16Kbytes, yet you use kzalloc()
> + spinlock_t lock;
> + atomic_t sk_ref;
> + struct packet_type prot_hook ____cacheline_aligned_in_smp;
> +};
> +
Maybe use a dynamic array ? I suspect most uses wont even reach 16
sockets anyway...
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-05 4:29 UTC (permalink / raw)
To: Eric Dumazet
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309839928.2720.23.camel@edumazet-laptop>
On Tue, Jul 5, 2011 at 08:25, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
>> >
>> Check out starting at packet 302893. 383 _identical_ ACKs were sent
>> out by the b44 machine within 30 milliseconds.
>
>
> As I said, b44 driver lost at least 200 consecutive frames (source says
> recovery takes about 20 ms)
>
> TCP then do its normal job.
>
From my understanding, after a frame is lost, TCP would be waiting for
a retransmit. Or at least, it would not be sending 400 duplicate ACKs
for the single last frame received, right? Let me run tcpdump on the
b44 side now. I'm quite sure I won't see any ACK dups leaving the
stack.
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05 4:25 UTC (permalink / raw)
To: Alexey Zaytsev
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DGSFAG9V0jqem+tDP3G-N8v6Z+_6oKdPwL-ZwhfhCOZnw@mail.gmail.com>
Le mardi 05 juillet 2011 à 08:17 +0400, Alexey Zaytsev a écrit :
> >
> Check out starting at packet 302893. 383 _identical_ ACKs were sent
> out by the b44 machine within 30 milliseconds.
As I said, b44 driver lost at least 200 consecutive frames (source says
recovery takes about 20 ms)
TCP then do its normal job.
^ permalink raw reply
* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-05 4:21 UTC (permalink / raw)
To: Alexey Zaytsev
Cc: Michael Büsch, Andrew Morton, netdev, Gary Zambrano,
bugme-daemon, David S. Miller, Pekka Pietikainen,
Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309837443.2720.8.camel@edumazet-laptop>
Le mardi 05 juillet 2011 à 05:44 +0200, Eric Dumazet a écrit :
> Maybe we should do instead a fast dequeue of packets (recycling them
> instead of pushing them to upper stack) in case too many packets are
> ready to be delivered, and always make sure NIC has a reserve of
> available buffers for DMA accesses, before it can assert ISTAT_RFO
>
>
Another way would be to add Explicit Congestion Notification when too
many packets are received in a burst, but unfortunately not enough TCP
flows are ECN ready :)
^ permalink raw reply
* [PATCH 2/2] packet: Add fanout support.
From: David Miller @ 2011-07-05 4:20 UTC (permalink / raw)
To: victor; +Cc: netdev
Fanouts allow packet capturing to be demuxed to a set of AF_PACKET
sockets. Two fanout policies are implemented:
1) Hashing based upon skb->rxhash
2) Pure round-robin
An AF_PACKET socket must be fully bound before it tries to add itself
to a fanout. All AF_PACKET sockets trying to join the same fanout
must all have the same bind settings.
Fanouts are identified (within a network namespace) by a 16-bit ID.
The first socket to try to add itself to a fanout with a particular
ID, creates that fanout. When the last socket leaves the fanout
(which happens only when the socket is closed), that fanout is
destroyed.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/linux/if_packet.h | 4 +
net/packet/af_packet.c | 250 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 249 insertions(+), 5 deletions(-)
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 7b31863..1efa1cb 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -49,6 +49,10 @@ struct sockaddr_ll {
#define PACKET_VNET_HDR 15
#define PACKET_TX_TIMESTAMP 16
#define PACKET_TIMESTAMP 17
+#define PACKET_FANOUT 18
+
+#define PACKET_FANOUT_HASH 0
+#define PACKET_FANOUT_LB 1
struct tpacket_stats {
unsigned int tp_packets;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index bb281bf..7db1e12 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -187,9 +187,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
static void packet_flush_mclist(struct sock *sk);
+struct packet_fanout;
struct packet_sock {
/* struct sock has to be the first member of packet_sock */
struct sock sk;
+ struct packet_fanout *fanout;
struct tpacket_stats stats;
struct packet_ring_buffer rx_ring;
struct packet_ring_buffer tx_ring;
@@ -212,6 +214,24 @@ struct packet_sock {
struct packet_type prot_hook ____cacheline_aligned_in_smp;
};
+#define PACKET_FANOUT_MAX 2048
+
+struct packet_fanout {
+#ifdef CONFIG_NET_NS
+ struct net *net;
+#endif
+ int num_members;
+ u16 id;
+ u8 type;
+ u8 pad;
+ atomic_t rr_cur;
+ struct list_head list;
+ struct sock *arr[PACKET_FANOUT_MAX];
+ spinlock_t lock;
+ atomic_t sk_ref;
+ struct packet_type prot_hook ____cacheline_aligned_in_smp;
+};
+
struct packet_skb_cb {
unsigned int origlen;
union {
@@ -227,6 +247,9 @@ static inline struct packet_sock *pkt_sk(struct sock *sk)
return (struct packet_sock *)sk;
}
+static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
+static void __fanout_link(struct sock *sk, struct packet_sock *po);
+
/* register_prot_hook must be invoked with the po->bind_lock held,
* or from a context in which asynchronous accesses to the packet
* socket is not possible (packet_create()).
@@ -235,7 +258,10 @@ static void register_prot_hook(struct sock *sk)
{
struct packet_sock *po = pkt_sk(sk);
if (!po->running) {
- dev_add_pack(&po->prot_hook);
+ if (po->fanout)
+ __fanout_link(sk, po);
+ else
+ dev_add_pack(&po->prot_hook);
sock_hold(sk);
po->running = 1;
}
@@ -253,7 +279,10 @@ static void __unregister_prot_hook(struct sock *sk, bool sync)
struct packet_sock *po = pkt_sk(sk);
po->running = 0;
- __dev_remove_pack(&po->prot_hook);
+ if (po->fanout)
+ __fanout_unlink(sk, po);
+ else
+ __dev_remove_pack(&po->prot_hook);
__sock_put(sk);
if (sync) {
@@ -388,6 +417,195 @@ static void packet_sock_destruct(struct sock *sk)
sk_refcnt_debug_dec(sk);
}
+static int fanout_rr_next(struct packet_fanout *f)
+{
+ int x = atomic_read(&f->rr_cur) + 1;
+
+ if (x >= f->num_members)
+ x = 0;
+
+ return x;
+}
+
+static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
+{
+ u32 idx, hash = skb->rxhash;
+
+ idx = ((u64)hash * f->num_members) >> 32;
+
+ return f->arr[idx];
+}
+
+static struct sock *fanout_demux_lb(struct packet_fanout *f, struct sk_buff *skb)
+{
+ int cur, old;
+
+ cur = atomic_read(&f->rr_cur);
+ while ((old = atomic_cmpxchg(&f->rr_cur, cur,
+ fanout_rr_next(f))) != cur)
+ cur = old;
+ return f->arr[cur];
+}
+
+static int packet_rcv_fanout_hash(struct sk_buff *skb, struct net_device *dev,
+ struct packet_type *pt, struct net_device *orig_dev)
+{
+ struct packet_fanout *f = pt->af_packet_priv;
+ struct packet_sock *po;
+ struct sock *sk;
+
+ if (!net_eq(dev_net(dev), read_pnet(&f->net))) {
+ kfree_skb(skb);
+ return 0;
+ }
+
+ sk = fanout_demux_hash(f, skb);
+ po = pkt_sk(sk);
+
+ return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
+}
+
+static int packet_rcv_fanout_lb(struct sk_buff *skb, struct net_device *dev,
+ struct packet_type *pt, struct net_device *orig_dev)
+{
+ struct packet_fanout *f = pt->af_packet_priv;
+ struct packet_sock *po;
+ struct sock *sk;
+
+ if (!net_eq(dev_net(dev), read_pnet(&f->net))) {
+ kfree_skb(skb);
+ return 0;
+ }
+
+ sk = fanout_demux_lb(f, skb);
+ po = pkt_sk(sk);
+
+ return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
+}
+
+static DEFINE_MUTEX(fanout_mutex);
+static LIST_HEAD(fanout_list);
+
+static void __fanout_link(struct sock *sk, struct packet_sock *po)
+{
+ struct packet_fanout *f = po->fanout;
+
+ spin_lock(&f->lock);
+ f->arr[f->num_members] = sk;
+ smp_wmb();
+ f->num_members++;
+ spin_unlock(&f->lock);
+}
+
+static void __fanout_unlink(struct sock *sk, struct packet_sock *po)
+{
+ struct packet_fanout *f = po->fanout;
+ int i;
+
+ spin_unlock(&f->lock);
+ for (i = 0; i < f->num_members; i++) {
+ if (f->arr[i] == sk)
+ break;
+ }
+ BUG_ON(i >= f->num_members);
+ f->arr[i] = f->arr[f->num_members - 1];
+ f->num_members--;
+ spin_unlock(&f->lock);
+}
+
+static int fanout_add(struct sock *sk, u16 id, u8 type)
+{
+ struct packet_sock *po = pkt_sk(sk);
+ struct packet_fanout *f, *match;
+ int err;
+
+ switch (type) {
+ case PACKET_FANOUT_HASH:
+ case PACKET_FANOUT_LB:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (!po->running)
+ return -EINVAL;
+
+ if (po->fanout)
+ return -EALREADY;
+
+ mutex_lock(&fanout_mutex);
+ match = NULL;
+ list_for_each_entry(f, &fanout_list, list) {
+ if (f->id == id &&
+ read_pnet(&f->net) == sock_net(sk)) {
+ match = f;
+ break;
+ }
+ }
+ if (!match) {
+ match = kzalloc(sizeof(*match), GFP_KERNEL);
+ if (match) {
+ write_pnet(&match->net, sock_net(sk));
+ match->id = id;
+ match->type = type;
+ atomic_set(&match->rr_cur, 0);
+ INIT_LIST_HEAD(&match->list);
+ spin_lock_init(&match->lock);
+ atomic_set(&match->sk_ref, 0);
+ match->prot_hook.type = po->prot_hook.type;
+ match->prot_hook.dev = po->prot_hook.dev;
+ switch (type) {
+ case PACKET_FANOUT_HASH:
+ match->prot_hook.func = packet_rcv_fanout_hash;
+ break;
+ case PACKET_FANOUT_LB:
+ match->prot_hook.func = packet_rcv_fanout_lb;
+ break;
+ }
+ match->prot_hook.af_packet_priv = match;
+ dev_add_pack(&match->prot_hook);
+ list_add(&match->list, &fanout_list);
+ }
+ }
+ err = -ENOMEM;
+ if (match) {
+ err = -EINVAL;
+ if (match->type == type &&
+ match->prot_hook.type == po->prot_hook.type &&
+ match->prot_hook.dev == po->prot_hook.dev) {
+ err = -ENOSPC;
+ if (atomic_read(&match->sk_ref) < PACKET_FANOUT_MAX) {
+ __dev_remove_pack(&po->prot_hook);
+ po->fanout = match;
+ atomic_inc(&match->sk_ref);
+ __fanout_link(sk, po);
+ err = 0;
+ }
+ }
+ }
+ mutex_unlock(&fanout_mutex);
+ return err;
+}
+
+static void fanout_release(struct sock *sk)
+{
+ struct packet_sock *po = pkt_sk(sk);
+ struct packet_fanout *f;
+
+ f = po->fanout;
+ if (!f)
+ return;
+
+ po->fanout = NULL;
+
+ mutex_lock(&fanout_mutex);
+ if (atomic_dec_and_test(&f->sk_ref)) {
+ list_del(&f->list);
+ dev_remove_pack(&f->prot_hook);
+ kfree(f);
+ }
+ mutex_unlock(&fanout_mutex);
+}
static const struct proto_ops packet_ops;
@@ -1398,6 +1616,8 @@ static int packet_release(struct socket *sock)
if (po->tx_ring.pg_vec)
packet_set_ring(sk, &req, 1, 1);
+ fanout_release(sk);
+
synchronize_net();
/*
* Now the socket is dead. No more input will appear.
@@ -1421,9 +1641,9 @@ static int packet_release(struct socket *sock)
static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protocol)
{
struct packet_sock *po = pkt_sk(sk);
- /*
- * Detach an existing hook if present.
- */
+
+ if (po->fanout)
+ return -EINVAL;
lock_sock(sk);
@@ -2133,6 +2353,17 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
po->tp_tstamp = val;
return 0;
}
+ case PACKET_FANOUT:
+ {
+ int val;
+
+ if (optlen != sizeof(val))
+ return -EINVAL;
+ if (copy_from_user(&val, optval, sizeof(val)))
+ return -EFAULT;
+
+ return fanout_add(sk, val & 0xffff, val >> 16);
+ }
default:
return -ENOPROTOOPT;
}
@@ -2231,6 +2462,15 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
val = po->tp_tstamp;
data = &val;
break;
+ case PACKET_FANOUT:
+ if (len > sizeof(int))
+ len = sizeof(int);
+ val = (po->fanout ?
+ ((u32)po->fanout->id |
+ ((u32)po->fanout->type << 16)) :
+ 0);
+ data = &val;
+ break;
default:
return -ENOPROTOOPT;
}
--
1.7.5.4
^ permalink raw reply related
* [PATCH 1/2] packet: Add helpers to register/unregister ->prot_hook
From: David Miller @ 2011-07-05 4:20 UTC (permalink / raw)
To: victor; +Cc: netdev
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/packet/af_packet.c | 103 +++++++++++++++++++++++++++--------------------
1 files changed, 59 insertions(+), 44 deletions(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 461b16f..bb281bf 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -222,6 +222,55 @@ struct packet_skb_cb {
#define PACKET_SKB_CB(__skb) ((struct packet_skb_cb *)((__skb)->cb))
+static inline struct packet_sock *pkt_sk(struct sock *sk)
+{
+ return (struct packet_sock *)sk;
+}
+
+/* register_prot_hook must be invoked with the po->bind_lock held,
+ * or from a context in which asynchronous accesses to the packet
+ * socket is not possible (packet_create()).
+ */
+static void register_prot_hook(struct sock *sk)
+{
+ struct packet_sock *po = pkt_sk(sk);
+ if (!po->running) {
+ dev_add_pack(&po->prot_hook);
+ sock_hold(sk);
+ po->running = 1;
+ }
+}
+
+/* {,__}unregister_prot_hook() must be invoked with the po->bind_lock
+ * held. If the sync parameter is true, we will temporarily drop
+ * the po->bind_lock and do a synchronize_net to make sure no
+ * asynchronous packet processing paths still refer to the elements
+ * of po->prot_hook. If the sync parameter is false, it is the
+ * callers responsibility to take care of this.
+ */
+static void __unregister_prot_hook(struct sock *sk, bool sync)
+{
+ struct packet_sock *po = pkt_sk(sk);
+
+ po->running = 0;
+ __dev_remove_pack(&po->prot_hook);
+ __sock_put(sk);
+
+ if (sync) {
+ spin_unlock(&po->bind_lock);
+ synchronize_net();
+ spin_lock(&po->bind_lock);
+ }
+}
+
+static void unregister_prot_hook(struct sock *sk, bool sync)
+{
+ struct packet_sock *po = pkt_sk(sk);
+
+ if (po->running)
+ __unregister_prot_hook(sk, sync);
+}
+
static inline __pure struct page *pgv_to_page(void *addr)
{
if (is_vmalloc_addr(addr))
@@ -324,11 +373,6 @@ static inline void packet_increment_head(struct packet_ring_buffer *buff)
buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
}
-static inline struct packet_sock *pkt_sk(struct sock *sk)
-{
- return (struct packet_sock *)sk;
-}
-
static void packet_sock_destruct(struct sock *sk)
{
skb_queue_purge(&sk->sk_error_queue);
@@ -1337,15 +1381,7 @@ static int packet_release(struct socket *sock)
spin_unlock_bh(&net->packet.sklist_lock);
spin_lock(&po->bind_lock);
- if (po->running) {
- /*
- * Remove from protocol table
- */
- po->running = 0;
- po->num = 0;
- __dev_remove_pack(&po->prot_hook);
- __sock_put(sk);
- }
+ unregister_prot_hook(sk, false);
if (po->prot_hook.dev) {
dev_put(po->prot_hook.dev);
po->prot_hook.dev = NULL;
@@ -1392,15 +1428,7 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
lock_sock(sk);
spin_lock(&po->bind_lock);
- if (po->running) {
- __sock_put(sk);
- po->running = 0;
- po->num = 0;
- spin_unlock(&po->bind_lock);
- dev_remove_pack(&po->prot_hook);
- spin_lock(&po->bind_lock);
- }
-
+ unregister_prot_hook(sk, true);
po->num = protocol;
po->prot_hook.type = protocol;
if (po->prot_hook.dev)
@@ -1413,9 +1441,7 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
goto out_unlock;
if (!dev || (dev->flags & IFF_UP)) {
- dev_add_pack(&po->prot_hook);
- sock_hold(sk);
- po->running = 1;
+ register_prot_hook(sk);
} else {
sk->sk_err = ENETDOWN;
if (!sock_flag(sk, SOCK_DEAD))
@@ -1542,9 +1568,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
if (proto) {
po->prot_hook.type = proto;
- dev_add_pack(&po->prot_hook);
- sock_hold(sk);
- po->running = 1;
+ register_prot_hook(sk);
}
spin_lock_bh(&net->packet.sklist_lock);
@@ -2240,9 +2264,7 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
if (dev->ifindex == po->ifindex) {
spin_lock(&po->bind_lock);
if (po->running) {
- __dev_remove_pack(&po->prot_hook);
- __sock_put(sk);
- po->running = 0;
+ __unregister_prot_hook(sk, false);
sk->sk_err = ENETDOWN;
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_error_report(sk);
@@ -2259,11 +2281,8 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
case NETDEV_UP:
if (dev->ifindex == po->ifindex) {
spin_lock(&po->bind_lock);
- if (po->num && !po->running) {
- dev_add_pack(&po->prot_hook);
- sock_hold(sk);
- po->running = 1;
- }
+ if (po->num)
+ register_prot_hook(sk);
spin_unlock(&po->bind_lock);
}
break;
@@ -2530,10 +2549,8 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
was_running = po->running;
num = po->num;
if (was_running) {
- __dev_remove_pack(&po->prot_hook);
po->num = 0;
- po->running = 0;
- __sock_put(sk);
+ __unregister_prot_hook(sk, false);
}
spin_unlock(&po->bind_lock);
@@ -2564,11 +2581,9 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
mutex_unlock(&po->pg_vec_lock);
spin_lock(&po->bind_lock);
- if (was_running && !po->running) {
- sock_hold(sk);
- po->running = 1;
+ if (was_running) {
po->num = num;
- dev_add_pack(&po->prot_hook);
+ register_prot_hook(sk);
}
spin_unlock(&po->bind_lock);
--
1.7.5.4
^ permalink raw reply related
* [PATCH 0/2] AF_PACKET fanout support
From: David Miller @ 2011-07-05 4:20 UTC (permalink / raw)
To: victor; +Cc: netdev
This is a fully functional version, I've tested both hash and
load-balance modes successfully. I plan to commit this to
net-next-2.6 very soon.
Below is a test program that other people can play with
if they want. It basically creates 4 threads, and creates
an AF_PACKET fanout amongst them. Each thread prints out
it's pid in parentheses every time it receives 10 packets.
After each thread processes 10,000 packets, it exits.
Try things like "./test eth0 hash", "./test eth0 lb", etc.
Signed-off-by: David S. Miller <davem@davemloft.net>
--------------------
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <net/if.h>
static const char *device_name;
static int fanout_type;
static int fanout_id;
#ifndef PACKET_FANOUT
#define PACKET_FANOUT 18
#define PACKET_FANOUT_HASH 0
#define PACKET_FANOUT_LB 1
#endif
static int setup_socket(void)
{
int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
struct sockaddr_ll ll;
struct ifreq ifr;
int fanout_arg;
if (fd < 0) {
perror("socket");
return EXIT_FAILURE;
}
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, device_name);
err = ioctl(fd, SIOCGIFINDEX, &ifr);
if (err < 0) {
perror("SIOCGIFINDEX");
return EXIT_FAILURE;
}
memset(&ll, 0, sizeof(ll));
ll.sll_family = AF_PACKET;
ll.sll_ifindex = ifr.ifr_ifindex;
err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
if (err < 0) {
perror("bind");
return EXIT_FAILURE;
}
fanout_arg = (fanout_id | (fanout_type << 16));
err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
&fanout_arg, sizeof(fanout_arg));
if (err) {
perror("setsockopt");
return EXIT_FAILURE;
}
return fd;
}
static void fanout_thread(void)
{
int fd = setup_socket();
int limit = 10000;
if (fd < 0)
exit(fd);
while (limit-- > 0) {
char buf[1600];
int err;
err = read(fd, buf, sizeof(buf));
if (err < 0) {
perror("read");
exit(EXIT_FAILURE);
}
if ((limit % 10) == 0)
fprintf(stdout, "(%d) \n", getpid());
}
fprintf(stdout, "%d: Received 10000 packets\n", getpid());
close(fd);
exit(0);
}
int main(int argc, char **argp)
{
int fd, err;
int i;
if (argc != 3) {
fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
return EXIT_FAILURE;
}
if (!strcmp(argp[2], "hash"))
fanout_type = PACKET_FANOUT_HASH;
else if (!strcmp(argp[2], "lb"))
fanout_type = PACKET_FANOUT_LB;
else {
fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
exit(EXIT_FAILURE);
}
device_name = argp[1];
fanout_id = getpid() & 0xffff;
for (i = 0; i < 4; i++) {
pid_t pid = fork();
switch (pid) {
case 0:
fanout_thread();
case -1:
perror("fork");
exit(EXIT_FAILURE);
}
}
for (i = 0; i < 4; i++) {
int status;
wait(&status);
}
return 0;
}
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox