Netdev List
 help / color / mirror / Atom feed
* Re: [Bugme-new] [Bug 35862] New: arp requests from wrong src IP
From: David Miller @ 2011-05-26  1:52 UTC (permalink / raw)
  To: akpm; +Cc: netdev, bugzilla-daemon, bugme-daemon, matare
In-Reply-To: <20110525163137.6f04f26e.akpm@linux-foundation.org>

From: Andrew Morton <akpm@linux-foundation.org>
Date: Wed, 25 May 2011 16:31:37 -0700

>> I switched a host's ip address from 137.226.164.13 to 137.226.164.2. The .13 IP
>> now belongs to the host that had .2 before (I swapped them). Now both hosts
>> still arp from their old IPs although ifconfig as well as ip clearly tell
>> otherwise. Examining the host which now has 137.226.164.13:
>> 
>> # ip addr show dev eth0
>> 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
>>     link/ether 00:e0:81:41:1f:e4 brd ff:ff:ff:ff:ff:ff
>>     inet 137.226.164.2/24 brd 137.226.164.255 scope global eth0
>>     inet 192.168.23.2/24 brd 137.226.164.255 scope global eth0:0

If you keep the old IP address around it remains as the "primary"
IP address.

You have to explicitly remove the original IP address from the
interface first, then add the new one, in order for the new
one to become the "primary"

Not a bug, please close this.

^ permalink raw reply

* Re: [patch 1/1] net: convert %p usage to %pK
From: David Miller @ 2011-05-26  1:50 UTC (permalink / raw)
  To: kees.cook
  Cc: eric.dumazet, joe, mingo, akpm, netdev, drosenberg, a.p.zijlstra,
	eparis, eugeneteo, jmorris, tgraf
In-Reply-To: <20110525232921.GD19633@outflux.net>

From: Kees Cook <kees.cook@canonical.com>
Date: Wed, 25 May 2011 16:29:21 -0700

> Hi David,
> 
> On Tue, May 24, 2011 at 03:58:01AM -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 24 May 2011 09:45:01 +0200
>> 
>> > Le mardi 24 mai 2011 à 00:35 -0700, Joe Perches a écrit :
>> > 
>> >> I think it's be better without the casts
>> >> using the standard kernel.h macros.
>> >> 
>> >> 	void *ptr;
>> >> 
>> >> 	ptr = maybe_hide_ptr(sk);
>> >> 	r->id.idiag_cookie[0] = lower_32_bits(ptr);
>> >> 	r->id.idiag_cookie[1] = upper_32_bits(ptr);
>> >> 
>> > 
>> > I am not sure I want to patch lower_32_bits() and upper_32_bits() for
>> > this.
>> > 
>> > They dont work on pointers, but on "numbers", according to kerneldoc
>> > Andrew wrote years ago. gcc agrees :
>> > 
>> > net/ipv4/inet_diag.c: In function ‘inet_csk_diag_fill’:
>> > net/ipv4/inet_diag.c:119: warning: cast from pointer to integer of different size
>> > net/ipv4/inet_diag.c:120: error: invalid operands to binary >>
>> > make[1]: *** [net/ipv4/inet_diag.o] Error 1
>> 
>> Also you can't do this, the "cookie" is used by the kernel future
>> lookups to find sockets.
>> 
>> The kernel pointer is part of the API, so sorry you can't "hide"
>> kernel pointers in this case without really breaking user visible
>> things.
> 
> But this is precisely what we're trying to control with kptr_restrict.
> Setting kptr_restrict will make inet_diag (and some details of similar
> things in /proc) meaningless. Based on the name, "diag" isn't going to be
> used in normal operation, and kptr_restrict is 0 by default, so only system
> owners interested in this will enable it and effectively disable inet_diag.

Are you kidding me?

inet_diag is the standard way to dump sockets using netlink.
It's not a special obscure debugging facility, it's for real
users.

And the encoded kernel pointer here is used as a shortcut to looking
up precise sockets.

^ permalink raw reply

* atl1c suspend issue - remove_proc_entry: removing non-empty directory 'irq/44', leaking at least 'smp_affinity_list'
From: Parag Warudkar @ 2011-05-26  1:50 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Got this on suspend :

[  115.182723] cfg80211:     (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[  115.182732] cfg80211:     (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[  115.182740] cfg80211:     (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[  115.182747] cfg80211:     (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[  115.182755] cfg80211:     (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[  115.292922] ------------[ cut here ]------------
[  115.292949] WARNING: at fs/proc/generic.c:849 remove_proc_entry+0x26e/0x280()
[  115.292959] Hardware name: 0876                            
[  115.292969] remove_proc_entry: removing non-empty directory 'irq/44', leaking at least 'smp_affinity_list'
[  115.292979] Modules linked in: cryptd aes_x86_64 aes_generic parport_pc ppdev nls_utf8 udf crc_itu_t fuse binfmt_misc joydev snd_hda_codec_hdmi snd_hda_codec_realtek i915 snd_hda_intel snd_hda_codec arc4 drm_kms_helper drm snd_hwdep i2c_algo_bit iwlagn snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq cfbcopyarea cfbimgblt snd_timer cfbfillrect mac80211 snd_seq_device snd uvcvideo usb_storage soundcore cfg80211 videodev ideapad_laptop v4l2_compat_ioctl32 psmouse video snd_page_alloc serio_raw intel_ips sparse_keymap btusb mac_hid lp bluetooth parport ext4 mbcache jbd2 ahci libahci libata atl1c
[  115.293170] Pid: 749, comm: NetworkManager Not tainted 2.6.39+ #11
[  115.293179] Call Trace:
[  115.293204]  [<ffffffff8105875f>] warn_slowpath_common+0x7f/0xc0
[  115.293218]  [<ffffffff81058856>] warn_slowpath_fmt+0x46/0x50
[  115.293232]  [<ffffffff8119f84e>] remove_proc_entry+0x26e/0x280
[  115.293251]  [<ffffffff811f59e0>] ? sprintf+0x40/0x50
[  115.293270]  [<ffffffff810b7db7>] unregister_irq_proc+0xb7/0xe0
[  115.293285]  [<ffffffff810b3a4c>] free_desc+0x2c/0x70
[  115.293297]  [<ffffffff810b3ada>] irq_free_descs+0x4a/0x90
[  115.293314]  [<ffffffff81029c9b>] free_irq_at+0x3b/0x50
[  115.293329]  [<ffffffff8102bc7b>] destroy_irq+0x7b/0x90
[  115.293343]  [<ffffffff8102bf0e>] native_teardown_msi_irq+0xe/0x10
[  115.293359]  [<ffffffff8122282f>] default_teardown_msi_irqs+0x6f/0x90
[  115.293374]  [<ffffffff81222216>] free_msi_irqs+0x96/0x130
[  115.293387]  [<ffffffff81222ec5>] pci_disable_msi+0x45/0x50
[  115.293414]  [<ffffffffa0002ef7>] atl1c_down+0xc7/0x110 [atl1c]
[  115.293434]  [<ffffffffa00034f8>] atl1c_close+0x28/0x50 [atl1c]
[  115.293452]  [<ffffffff8138f666>] __dev_close_many+0x86/0xd0
[  115.293467]  [<ffffffff8138f6e6>] __dev_close+0x36/0x50
[  115.293480]  [<ffffffff81395681>] __dev_change_flags+0xa1/0x180
[  115.293492]  [<ffffffff81395828>] dev_change_flags+0x28/0x70
[  115.293508]  [<ffffffff813a3220>] do_setlink+0x200/0x9f0
[  115.293527]  [<ffffffff81047ef0>] ? update_curr+0x100/0x1a0
[  115.293541]  [<ffffffff81205390>] ? nla_parse+0x30/0xd0
[  115.293555]  [<ffffffff813a3aff>] rtnl_setlink+0xef/0x130
[  115.293572]  [<ffffffff813a16bf>] rtnetlink_rcv_msg+0x20f/0x240
[  115.293587]  [<ffffffff813a14b0>] ? rtnetlink_net_init+0x50/0x50
[  115.293604]  [<ffffffff813bb8e9>] netlink_rcv_skb+0xa9/0xd0
[  115.293620]  [<ffffffff813a2055>] rtnetlink_rcv+0x25/0x40
[  115.293635]  [<ffffffff813bb223>] netlink_unicast+0x2d3/0x2f0
[  115.293648]  [<ffffffff8138998d>] ? memcpy_fromiovec+0x7d/0xa0
[  115.293662]  [<ffffffff813bb462>] netlink_sendmsg+0x222/0x360
[  115.293678]  [<ffffffff8137d0cf>] sock_sendmsg+0xef/0x120
[  115.293695]  [<ffffffff814201dd>] ? unix_dgram_sendmsg+0x5cd/0x650
[  115.293711]  [<ffffffff8137d0cf>] ? sock_sendmsg+0xef/0x120
[  115.293725]  [<ffffffff8137ecc0>] ? move_addr_to_kernel+0x50/0x60
[  115.293738]  [<ffffffff81389a32>] ? verify_iovec+0x82/0xf0
[  115.293751]  [<ffffffff8137e79d>] __sys_sendmsg+0x1dd/0x340
[  115.293765]  [<ffffffff8137b893>] ? sock_destroy_inode+0x33/0x40
[  115.293783]  [<ffffffff8112dfd0>] ? kmem_cache_free+0x20/0xe0
[  115.293798]  [<ffffffff8137f796>] ? sys_sendto+0x156/0x190
[  115.293813]  [<ffffffff8115d65f>] ? mntput+0x1f/0x30
[  115.293827]  [<ffffffff8137fc19>] sys_sendmsg+0x49/0x90
[  115.293847]  [<ffffffff81487c82>] system_call_fastpath+0x16/0x1b
[  115.293858] ---[ end trace e0ec9dc53f93f46e ]---
[  116.649267] EXT4-fs (sda6): re-mounted. Opts: errors=remount-ro,commit=0
[  116.652785] EXT4-fs (sda1): re-mounted. Opts: commit=0
[  117.994536] PM: Syncing filesystems ... done.
[  117.996922] PM: Preparing system for mem sleep
[  118.449261] Freezing user space processes ... (elapsed 0.01 seconds) done.
[  118.462504] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  118.475800] PM: Entering mem sleep
[  118.475902] Suspending console(s) (use no_console_suspend to debug)
[  118.476562] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  118.478045] sd 0:0:0:0: [sda] Stopping disk
[  118.608869] ehci_hcd 0000:00:1a.0: PCI INT A disabled
[  118.608903] ehci_hcd 0000:00:1d.0: PCI INT A disabled

^ permalink raw reply

* Re: [PATCH 1/1] IPVS : bug in ip_vs_ftp, same list heaad used in all netns.
From: Simon Horman @ 2011-05-26  1:48 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: ja, wensong, lvs-devel, netdev, netfilter-devel, hans
In-Reply-To: <1306239065-17271-1-git-send-email-hans.schillstrom@ericsson.com>

On Tue, May 24, 2011 at 02:11:05PM +0200, Hans Schillstrom wrote:
> When ip_vs was adapted to netns the ftp application was not adapted
> in a correct way.
> However this is a fix to avoid kernel errors. In the long term another solution
> might be chosen.  I.e the ports that the ftp appl, uses should be per netns.
> 
> Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>

Julian, do you have any thoughts on this?

> ---
>  include/net/ip_vs.h            |    3 ++-
>  net/netfilter/ipvs/ip_vs_ftp.c |   27 +++++++++++++++++++--------
>  2 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 4fff432..481f856 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -797,7 +797,8 @@ struct netns_ipvs {
>  	struct list_head	rs_table[IP_VS_RTAB_SIZE];
>  	/* ip_vs_app */
>  	struct list_head	app_list;
> -
> +	/* ip_vs_ftp */
> +	struct ip_vs_app	*ftp_app;
>  	/* ip_vs_proto */
>  	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
>  	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index 6b5dd6d..af63553 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -411,25 +411,35 @@ static struct ip_vs_app ip_vs_ftp = {
>  static int __net_init __ip_vs_ftp_init(struct net *net)
>  {
>  	int i, ret;
> -	struct ip_vs_app *app = &ip_vs_ftp;
> +	struct ip_vs_app *app;
> +	struct netns_ipvs *ipvs = net_ipvs(net);
> +
> +	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
> +	if (!app)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&app->a_list);
> +	INIT_LIST_HEAD(&app->incs_list);
> +	ipvs->ftp_app = app;
>  
>  	ret = register_ip_vs_app(net, app);
>  	if (ret)
> -		return ret;
> +		goto err_exit;
>  
>  	for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
>  		if (!ports[i])
>  			continue;
>  		ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
>  		if (ret)
> -			break;
> +			goto err_unreg;
>  		pr_info("%s: loaded support on port[%d] = %d\n",
>  			app->name, i, ports[i]);
>  	}
> +	return 0;
>  
> -	if (ret)
> -		unregister_ip_vs_app(net, app);
> -
> +err_unreg:
> +	unregister_ip_vs_app(net, app);
> +err_exit:
> +	kfree(ipvs->ftp_app);
>  	return ret;
>  }
>  /*
> @@ -437,9 +447,10 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
>   */
>  static void __ip_vs_ftp_exit(struct net *net)
>  {
> -	struct ip_vs_app *app = &ip_vs_ftp;
> +	struct netns_ipvs *ipvs = net_ipvs(net);
>  
> -	unregister_ip_vs_app(net, app);
> +	unregister_ip_vs_app(net, ipvs->ftp_app);
> +	kfree(ipvs->ftp_app);
>  }
>  
>  static struct pernet_operations ip_vs_ftp_ops = {
> -- 
> 1.7.2.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
From: Eric W. Biederman @ 2011-05-25 23:40 UTC (permalink / raw)
  To: C Anthony Risinger
  Cc: Serge E. Hallyn, Linux Containers, netdev, linux-kernel
In-Reply-To: <BANLkTinbw6pZjhMscfXFMArd=XU=VC=+eQ@mail.gmail.com>

C Anthony Risinger <anthony@xtfx.me> writes:

> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Quoting C Anthony Risinger (anthony@xtfx.me):
>>> On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>> >
>>> > This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
>>> > /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
>>> > process at the time those files are opened, and can be bind mounted to
>>> > keep the specified namespace alive without a process.
>>> >
>>> > This tree adds the setns system call that can be used to change the
>>> > specified namespace of a process to the namespace specified by a system
>>> > call.
>>>
>>> i just have a quick question regarding these, apologies if wrong place
>>> to respond -- i trimmed to lists only.
>>>
>>> if i understand correctly, mount namespaces (for example), allow one
>>> to build such constructs as "private /tmp" and similar that even
>>> `root` cannot access ... and there are many reasons `root` does not
>>> deserve to completely know/interact with user processes (FUSE makes a
>>> good example ... just because i [user] have SSH access to a machine,
>>> why should `root`?)
>>>
>>> would these /proc additions break such guarantees?  IOW, would it now
>>> become possible for `root` to inject stuff into my private namespaces,
>>> and/or has these guarantees never existed and i am mistaken?  is there
>>> any kind of ACL mechanism that endows the origin process (or similar)
>>> with the ability to dictate who can hold and/or interact with these
>>> references?
>>
>> If for instance you have a file open in your private /tmp, then root
>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>> If it's a directory, he can now traverse the whole fs.
>
> aaah right :-( ... there's always another way isn't there ... curse
> you Linux for being so flexible! (just kidding baby i love you)

Even more significant the access to the new files is guarded by the
ptrace access checks.  And if root can ptrace your process root
can remote control your process.

> this seems like a more fundamental issue then?  or should i not expect
> to be able to achieve separation like this?  i ask in the context of
> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
> i've used/followed these technologies for couple years now ... and
> it's starting to feel like "the right time".

I don't think anything really new is allowed, but we haven't designed
anything that radically reduces the power of root either.

At some point we may have the user namespace done and that should
give you a root like user with vastly reduced powers, but we aren't
there yet.

Eric

^ permalink raw reply

* [PATCH] af-packet:  Add flag to distinguish VID 0 from no-vlan.
From: greearb @ 2011-05-25 23:36 UTC (permalink / raw)
  To: netdev; +Cc: Ben Greear

From: Ben Greear <greearb@candelatech.com>

Currently, user-space cannot determine if a 0 tcp_vlan_tci
means there is no VLAN tag or the VLAN ID was zero.

Add flag to make this explicit.  User-space can check for
TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
compatible. Older could would have just checked for tp_vlan_tci,
so it will work no worse than before.

Signed-off-by: Ben Greear <greearb@candelatech.com>
---
:100644 100644 72bfa5a... 6d66ce1... M	include/linux/if_packet.h
:100644 100644 658edd1... 885d76d... M	net/packet/af_packet.c
 include/linux/if_packet.h |    1 +
 net/packet/af_packet.c    |    7 ++++++-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 72bfa5a..6d66ce1 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -70,6 +70,7 @@ struct tpacket_auxdata {
 #define TP_STATUS_COPY		0x2
 #define TP_STATUS_LOSING	0x4
 #define TP_STATUS_CSUMNOTREADY	0x8
+#define TP_STATUS_VLAN_VALID   0x10 /* auxdata has valid tp_vlan_tci */
 
 /* Tx ring - header status */
 #define TP_STATUS_AVAILABLE	0x0
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 658edd1..885d76d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1772,7 +1772,12 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
 		aux.tp_snaplen = skb->len;
 		aux.tp_mac = 0;
 		aux.tp_net = skb_network_offset(skb);
-		aux.tp_vlan_tci = vlan_tx_tag_get(skb);
+		if (vlan_tx_tag_present(skb)) {
+			aux.tp_vlan_tci = vlan_tx_tag_get(skb);
+			aux.tp_status |= TP_STATUS_VLAN_VALID;
+		}
+		else
+			aux.tp_vlan_tci = 0;
 
 		put_cmsg(msg, SOL_PACKET, PACKET_AUXDATA, sizeof(aux), &aux);
 	}
-- 
1.7.3.4


^ permalink raw reply related

* Re: [patch 1/1] net: convert %p usage to %pK
From: Kees Cook @ 2011-05-25 23:29 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, joe, mingo, akpm, netdev, drosenberg, a.p.zijlstra,
	eparis, eugeneteo, jmorris, tgraf
In-Reply-To: <20110524.035801.1555795213632087107.davem@davemloft.net>

Hi David,

On Tue, May 24, 2011 at 03:58:01AM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 24 May 2011 09:45:01 +0200
> 
> > Le mardi 24 mai 2011 à 00:35 -0700, Joe Perches a écrit :
> > 
> >> I think it's be better without the casts
> >> using the standard kernel.h macros.
> >> 
> >> 	void *ptr;
> >> 
> >> 	ptr = maybe_hide_ptr(sk);
> >> 	r->id.idiag_cookie[0] = lower_32_bits(ptr);
> >> 	r->id.idiag_cookie[1] = upper_32_bits(ptr);
> >> 
> > 
> > I am not sure I want to patch lower_32_bits() and upper_32_bits() for
> > this.
> > 
> > They dont work on pointers, but on "numbers", according to kerneldoc
> > Andrew wrote years ago. gcc agrees :
> > 
> > net/ipv4/inet_diag.c: In function ‘inet_csk_diag_fill’:
> > net/ipv4/inet_diag.c:119: warning: cast from pointer to integer of different size
> > net/ipv4/inet_diag.c:120: error: invalid operands to binary >>
> > make[1]: *** [net/ipv4/inet_diag.o] Error 1
> 
> Also you can't do this, the "cookie" is used by the kernel future
> lookups to find sockets.
> 
> The kernel pointer is part of the API, so sorry you can't "hide"
> kernel pointers in this case without really breaking user visible
> things.

But this is precisely what we're trying to control with kptr_restrict.
Setting kptr_restrict will make inet_diag (and some details of similar
things in /proc) meaningless. Based on the name, "diag" isn't going to be
used in normal operation, and kptr_restrict is 0 by default, so only system
owners interested in this will enable it and effectively disable inet_diag.

It seems like everything that fills idiag_cookie needs to be adjusted, not
just the one instance, too:

$ fgrep 'idiag_cookie[0] = ' net/ipv4/inet_diag.c
	r->id.idiag_cookie[0] = (u32)(unsigned long)sk;
	r->id.idiag_cookie[0] = (u32)(unsigned long)tw;
	r->id.idiag_cookie[0] = (u32)(unsigned long)req;

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply

* Re: [Bugme-new] [Bug 35862] New: arp requests from wrong src IP
From: Andrew Morton @ 2011-05-25 23:31 UTC (permalink / raw)
  To: netdev; +Cc: bugzilla-daemon, bugme-daemon, matare
In-Reply-To: <bug-35862-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 25 May 2011 23:27:48 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=35862
> 
>            Summary: arp requests from wrong src IP
>            Product: Networking
>            Version: 2.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: matare@lih.rwth-aachen.de
>         Regression: No
> 
> 
> I switched a host's ip address from 137.226.164.13 to 137.226.164.2. The .13 IP
> now belongs to the host that had .2 before (I swapped them). Now both hosts
> still arp from their old IPs although ifconfig as well as ip clearly tell
> otherwise. Examining the host which now has 137.226.164.13:
> 
> # ip addr show dev eth0
> 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
>     link/ether 00:e0:81:41:1f:e4 brd ff:ff:ff:ff:ff:ff
>     inet 137.226.164.2/24 brd 137.226.164.255 scope global eth0
>     inet 192.168.23.2/24 brd 137.226.164.255 scope global eth0:0
> 
> but arping defaults to the old src IP (.13). I can manually correct this with
> the -s parameter, but it looks like linux still believes that 137.226.164.13 is
> this host's ip address. When I try to manually correct the arp table:
> # arp -s 137.226.164.13 00:30:48:70:91:95
> SIOCSARP: Invalid argument
> # arp -n 137.226.164.13
> 137.226.164.13 (137.226.164.13) -- no entry
> 
> And this is what arping does:
> # tcpdump -ieth0 -c1 -s0 -vvv -n arp & (sleep 1; arping 137.226.164.13 &>
> /dev/null)
> [1] 2217
> tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535
> bytes
> 01:14:37.785126 arp who-has 137.226.164.13 (ff:ff:ff:ff:ff:ff) tell
> 137.226.164.13
> 
> Also, ifconfig doesn't even show the second IP address:
> # ifconfig eth0
> eth0      Link encap:Ethernet  HWaddr 00:e0:81:41:1f:e4  
>           inet addr:137.226.164.2  Bcast:137.226.164.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:103996345 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:122352625 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000 
>           RX bytes:52478932087 (48.8 GiB)  TX bytes:110248931949 (102.6 GiB)
>           Interrupt:24
> 
> What's going on here? If this is by design, it's very unintuitive behaviour.
> 


^ permalink raw reply

* Re: [RFC 01/01]af_packet: Enhance network capture visibility
From: chetan loke @ 2011-05-25 23:24 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev, loke.chetan
In-Reply-To: <4DDD8C5E.7040207@candelatech.com>

On Wed, May 25, 2011 at 7:10 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 05/25/2011 04:03 PM, chetan loke wrote:
>>
>> This patch is not complete and is intended to:
>> a) demonstrate the improvments
>> b) gather suggestions
>>
>>
>> Signed-off-by: Chetan Loke<lokec@ccs.neu.edu>
>
>> +struct tpacket3_hdr {
>> +       __u32           tp_status;
>> +       __u32           tp_len;
>> +       __u32           tp_snaplen;
>> +       __u16           tp_mac;
>> +       __u16           tp_net;
>> +       __u32           tp_sec;
>> +       __u32           tp_nsec;
>> +       __u16           tp_vlan_tci;
>> +       long            tp_next_offset;
>> +};
>
> Use fixed-size variables, like __u64 instead of 'long'.  That way,
> you have the same sized msgs on 32 and 64-bit systems.
>

Thanks Ben.


The intent is to also introduce something like

typedef struct {
		uint64_t pkt_sliced:1;
		uint64_t crc_error:1;
		uint64_t code_violation:1; /* if frame had code violation */
		uint64_t num_mpls_labels:4;
		uint64_t num_vlans:3;
		uint64_t l2_type:6;
		uint64_t l3_type:4;
		uint64_t l4_type:4;
		uint64_t l7_type:8;
		uint64_t rsvd:32;
}feature_s1;

typedef struct {
	union {
		feature_s1 f_s1;
               /* future feature goes here */
	}u1;
}feature_variants;

And then embed feature_variants in the pkt_desc.


Once we have the proposed non-static frame format in place then I am
hoping some vendor can borrow this format, enhance their capture
driver and DMA the data directly in the block. This way we can also
attempt to standardize the block-capture format on linux and make it
easier for smaller FPGA shops.


>
> Thanks,
> Ben
>

Chetan

^ permalink raw reply

* [PATCH] af-packet:  Use existing netdev reference for bound sockets.
From: greearb @ 2011-05-25 23:15 UTC (permalink / raw)
  To: netdev; +Cc: Ben Greear

From: Ben Greear <greearb@candelatech.com>

This saves a network device lookup on each packet transmitted,
for sockets that are bound to a network device.

Signed-off-by: Ben Greear <greearb@candelatech.com>
---
:100644 100644 4005b24... 658edd1... M	net/packet/af_packet.c
 net/packet/af_packet.c |   26 +++++++++++++++++++-------
 1 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4005b24..658edd1 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -987,8 +987,9 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 {
 	struct sk_buff *skb;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	__be16 proto;
+	bool need_rls_dev = false;
 	int ifindex, err, reserve = 0;
 	void *ph;
 	struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
@@ -1002,6 +1003,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	err = -EBUSY;
 	if (saddr == NULL) {
 		ifindex	= po->ifindex;
+		dev = po->prot_hook.dev;
 		proto	= po->num;
 		addr	= NULL;
 	} else {
@@ -1017,7 +1019,10 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		addr	= saddr->sll_addr;
 	}
 
-	dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+	if (!dev) {
+		dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+		need_rls_dev = true;
+	}
 	err = -ENXIO;
 	if (unlikely(dev == NULL))
 		goto out;
@@ -1103,7 +1108,8 @@ out_status:
 	__packet_set_status(po, ph, status);
 	kfree_skb(skb);
 out_put:
-	dev_put(dev);
+	if (need_rls_dev)
+		dev_put(dev);
 out:
 	mutex_unlock(&po->pg_vec_lock);
 	return err;
@@ -1139,8 +1145,9 @@ static int packet_snd(struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
 	struct sk_buff *skb;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	__be16 proto;
+	bool need_rls_dev = false;
 	unsigned char *addr;
 	int ifindex, err, reserve = 0;
 	struct virtio_net_hdr vnet_hdr = { 0 };
@@ -1161,6 +1168,7 @@ static int packet_snd(struct socket *sock,
 
 	if (saddr == NULL) {
 		ifindex	= po->ifindex;
+		dev = po->prot_hook.dev;
 		proto	= po->num;
 		addr	= NULL;
 	} else {
@@ -1174,8 +1182,11 @@ static int packet_snd(struct socket *sock,
 		addr	= saddr->sll_addr;
 	}
 
+	if (!dev) {
+		dev = dev_get_by_index(sock_net(sk), ifindex);
+		need_rls_dev = true;
+	}
 
-	dev = dev_get_by_index(sock_net(sk), ifindex);
 	err = -ENXIO;
 	if (dev == NULL)
 		goto out_unlock;
@@ -1315,14 +1326,15 @@ static int packet_snd(struct socket *sock,
 	if (err > 0 && (err = net_xmit_errno(err)) != 0)
 		goto out_unlock;
 
-	dev_put(dev);
+	if (need_rls_dev)
+		dev_put(dev);
 
 	return len;
 
 out_free:
 	kfree_skb(skb);
 out_unlock:
-	if (dev)
+	if (dev && need_rls_dev)
 		dev_put(dev);
 out:
 	return err;
-- 
1.7.3.4


^ permalink raw reply related

* Re: [RFC 01/01]af_packet: Enhance network capture visibility
From: Ben Greear @ 2011-05-25 23:10 UTC (permalink / raw)
  To: chetan loke; +Cc: netdev
In-Reply-To: <BANLkTimYVUkUWA2XPix2nUL-=rnQKghZQA@mail.gmail.com>

On 05/25/2011 04:03 PM, chetan loke wrote:
> This patch is not complete and is intended to:
> a) demonstrate the improvments
> b) gather suggestions
>
>
> Signed-off-by: Chetan Loke<lokec@ccs.neu.edu>

> +struct tpacket3_hdr {
> +	__u32		tp_status;
> +	__u32		tp_len;
> +	__u32		tp_snaplen;
> +	__u16		tp_mac;
> +	__u16		tp_net;
> +	__u32		tp_sec;
> +	__u32		tp_nsec;
> +	__u16		tp_vlan_tci;
> +	long		tp_next_offset;
> +};

Use fixed-size variables, like __u64 instead of 'long'.  That way,
you have the same sized msgs on 32 and 64-bit systems.

I didn't look at the rest of it in any detail, so no comment there.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* [RFC 01/01]af_packet: Enhance network capture visibility
From: chetan loke @ 2011-05-25 23:03 UTC (permalink / raw)
  To: netdev, loke.chetan

This patch is not complete and is intended to:
a) demonstrate the improvments
b) gather suggestions


Signed-off-by: Chetan Loke <lokec@ccs.neu.edu>

-----------------------
 include/linux/if_packet.h |   27 ++
 net/packet/af_packet.c    |  637 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 632 insertions(+), 32 deletions(-)

-----------------------

diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 72bfa5a..1452f47 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -55,6 +55,17 @@ struct tpacket_stats {
 	unsigned int	tp_drops;
 };

+struct tpacket_stats_v3 {
+	unsigned int	tp_packets;
+	unsigned int	tp_drops;
+	unsigned int	tp_plug_q_cnt;
+};
+
+union tpacket_stats_u {
+	struct tpacket_stats stats1;
+	struct tpacket_stats_v3 stats3;
+};
+
 struct tpacket_auxdata {
 	__u32		tp_status;
 	__u32		tp_len;
@@ -102,11 +113,27 @@ struct tpacket2_hdr {
 	__u16		tp_vlan_tci;
 };

+
+struct tpacket3_hdr {
+	__u32		tp_status;
+	__u32		tp_len;
+	__u32		tp_snaplen;
+	__u16		tp_mac;
+	__u16		tp_net;
+	__u32		tp_sec;
+	__u32		tp_nsec;
+	__u16		tp_vlan_tci;
+	long		tp_next_offset;
+};
+
 #define TPACKET2_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket2_hdr))
+ sizeof(struct sockaddr_ll))

+#define TPACKET3_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket3_hdr))
+ sizeof(struct sockaddr_ll))
+
 enum tpacket_versions {
 	TPACKET_V1,
 	TPACKET_V2,
+	TPACKET_V3
 };

 /*
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 91cb1d7..8e0bc51 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -164,6 +164,57 @@ struct packet_mreq_max {
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring);

+
+#define V3_ALIGNMENT	(4)
+#define ALIGN_4(x)		(((x)+V3_ALIGNMENT-1)&~(V3_ALIGNMENT-1))
+
+
+struct bd_ts{
+	unsigned int ts_sec;
+	union {
+		unsigned int u1_i1[1];
+		struct {
+			unsigned int ts_usec;
+		}ts_s1;
+		struct {
+			unsigned int ts_nsec;
+		}ts_s2;
+	} ts_u1;
+}__attribute__ ((__packed__));
+
+struct  block_desc{
+	uint32_t		block_status;
+	uint32_t		num_pkts;
+	struct bd_ts	ts_first_pkt;
+	struct bd_ts	ts_last_pkt;
+	long			offset_to_first_pkt;
+	uint32_t		seq_num;
+} __attribute__ ((__packed__));
+
+struct kbdq_core{
+	struct pgv		*pkbdq;
+	unsigned int	hdrlen;
+	unsigned char	reset_pending_on_curr_blk;
+	unsigned char   delete_blk_timer;
+	unsigned short	kactive_blk_num;
+	unsigned short	hole_bytes_size;
+	char			*pkblk_start;
+	char			*pkblk_end;
+	int				kblk_size;
+	unsigned int	knum_blocks;
+	unsigned int	knxt_seq_num;
+	char			*prev;
+	char			*nxt_offset;
+	/* last_kactive_blk_num:
+	 * trick to see if user-space has caught up
+	 * in order to avoid refreshing timer when every single pkt arrives.
+	 */
+	unsigned short	last_kactive_blk_num;
+#define DEFAULT_PRB_RETIRE_TMO	(4)
+	unsigned short  retire_blk_tmo;
+	struct timer_list retire_blk_timer;
+};
+
 #define PGV_FROM_VMALLOC 1
 struct pgv {
 	char *buffer;
@@ -179,11 +230,16 @@ struct packet_ring_buffer {
 	unsigned int		pg_vec_order;
 	unsigned int		pg_vec_pages;
 	unsigned int		pg_vec_len;
-
+	struct kbdq_core			prb_bdqc;
 	atomic_t		pending;
 };

 struct packet_sock;
+
+static void prb_open_block(struct kbdq_core *pkc1,struct block_desc *pbd1);
+static void prb_retire_rx_blk_timer_expired(unsigned long data);
+static void _prb_refresh_rx_retire_blk_timer(struct kbdq_core *pkc);
+static void prb_init_blk_timer(struct packet_sock *po,struct
kbdq_core *pkc,void (*func) (unsigned long));
 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);

 static void packet_flush_mclist(struct sock *sk);
@@ -192,6 +248,7 @@ struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
 	struct tpacket_stats	stats;
+	union  tpacket_stats_u	stats_u;
 	struct packet_ring_buffer	rx_ring;
 	struct packet_ring_buffer	tx_ring;
 	int			copy_thresh;
@@ -223,7 +280,14 @@ struct packet_skb_cb {

 #define PACKET_SKB_CB(__skb)	((struct packet_skb_cb *)((__skb)->cb))

-static inline __pure struct page *pgv_to_page(void *addr)
+#define GET_PBDQC_FROM_RB(x)				((struct kbdq_core *)(&(x)->prb_bdqc))
+#define GET_CURR_PBLOCK_DESC_FROM_CORE(x)	((struct block_desc
*)((x)->pkbdq[(x)->kactive_blk_num].buffer))
+#define GET_PBLOCK_DESC(x,bid)				((struct block_desc
*)((x)->pkbdq[(bid)].buffer))
+
+#define INCREMENT_PRB_BLK_NUM(x) \
+	(((x)->kactive_blk_num < ((x)->knum_blocks-1)) ? ((x)->kactive_blk_num+1) : 0)
+
+static inline struct page *pgv_to_page(void *addr)
 {
 	if (is_vmalloc_addr(addr))
 		return vmalloc_to_page(addr);
@@ -248,8 +312,12 @@ static void __packet_set_status(struct
packet_sock *po, void *frame, int status)
 		h.h2->tp_status = status;
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		break;
+	case TPACKET_V3:
+		pr_err("<%s> TPACKET version not supported.Who is calling?.Dumping
stack.\n",__func__);
+		dump_stack();
+		break;
 	default:
-		pr_err("TPACKET version not supported\n");
+		pr_err("<%s> TPACKET version not supported\n",__func__);
 		BUG();
 	}

@@ -274,6 +342,10 @@ static int __packet_get_status(struct packet_sock
*po, void *frame)
 	case TPACKET_V2:
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		return h.h2->tp_status;
+	case TPACKET_V3:
+		pr_err("<%s> TPACKET version:%d not supported.Dumping
stack.\n",__func__,po->tp_version);
+		dump_stack();
+		return 0;
 	default:
 		pr_err("TPACKET version not supported\n");
 		BUG();
@@ -309,9 +381,234 @@ static inline void *packet_current_frame(struct
packet_sock *po,
 		struct packet_ring_buffer *rb,
 		int status)
 {
-	return packet_lookup_frame(po, rb, rb->head, status);
+	switch (po->tp_version) {
+		case TPACKET_V1:
+		case TPACKET_V2:
+			return packet_lookup_frame(po, rb, rb->head, status);
+		case TPACKET_V3:
+			pr_err("<%s> TPACKET version:%d not supported.Dumping
stack.\n",__func__,po->tp_version);
+			dump_stack();
+			return 0;
+		default:
+			pr_err("<%s> TPACKET version not supported\n",__func__);
+			BUG();
+			return 0;
+	}
+}
+
+static void prb_flush_block(struct block_desc *pbd1)
+{
+	flush_dcache_page(pgv_to_page(pbd1));
+}
+
+/* Side effect:
+ * 1)flush the block-header
+ * 2)Increment active_blk_num
+ */
+static void prb_close_block(struct kbdq_core *pkc1,struct block_desc *pbd1)
+{
+	
+	//long size = pkc1->pkblk_end - pkc1->nxt_offset;
+	pbd1->block_status = TP_STATUS_USER;
+
+	/* Get the ts of the last pkt */
+	if (pbd1->num_pkts) {
+		struct tpacket3_hdr *ph = (struct tpacket3_hdr *)pkc1->prev;
+		pbd1->ts_last_pkt.ts_sec		= ph->tp_sec;
+		pbd1->ts_last_pkt.ts_s2.ts_nsec	= ph->tp_nsec;
+	} else {
+		/* Ok, we tmo'd - so get the current time */
+		struct timespec ts;
+		getnstimeofday(&ts);
+		pbd1->ts_last_pkt.ts_sec		= ts.tp_sec;
+		pbd1->ts_last_pkt.ts_s2.ts_nsec	= ts.tp_nsec;
+	}
+
+	prb_flush_block(pbd1);
+	pkc1->kactive_blk_num = INCREMENT_PRB_BLK_NUM(pkc1);
+}
+
+static inline void prb_unplug_queue(struct kbdq_core *pkc) {
+	pkc->reset_pending_on_curr_blk=0;
+}
+
+/* Side effect of opening a block:
+ * 1) prb_queue is unplugged.
+ * 2) retire_blk_timer is refreshed.
+ */
+static void prb_open_block(struct kbdq_core *pkc1,struct block_desc *pbd1)
+{
+	struct timespec ts;
+
+	pbd1->block_status	= TP_STATUS_KERNEL;
+	getnstimeofday(&ts);
+	pbd1->num_pkts		= 0;
+	pbd1->ts_first_pkt.ts_sec				= ts.tv_sec;
+	pbd1->ts_first_pkt.ts_u1.ts_s2.ts_nsec	= ts.tv_nsec;
+	pkc1->pkblk_start	= (char *)pbd1;
+	pbd1->seq_num		= pkc1->knxt_seq_num++;
+	pkc1->nxt_offset	= (char *)(pkc1->pkblk_start + sizeof(struct block_desc));
+	
+	pbd1->offset_to_first_pkt    = (long)sizeof(struct block_desc);
+
+	pkc1->prev			= pkc1->nxt_offset;
+	pkc1->pkblk_end		= pkc1->pkblk_start + pkc1->kblk_size;
+
+	prb_unplug_queue(pkc1);
+	_prb_refresh_rx_retire_blk_timer(pkc1);
+}
+
+static inline void prb_plug_queue(struct kbdq_core *pkc,struct
packet_sock *po) {
+	pkc->reset_pending_on_curr_blk=1;
+	po->stats_u.stats3.tp_plug_q_cnt++;
+}
+
+static void *prb_try_next_block(struct kbdq_core *pkc,struct packet_sock *po)
+{
+	struct block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+	/* close current block */
+	if (likely(TP_STATUS_KERNEL == pbd->block_status)) {
+		prb_close_block(pkc,pbd);
+	} else {
+		printk("<%s> ERROR - pbd[%d]:%p\n",__func__,pkc->kactive_blk_num,pbd);
+		BUG();
+	}
+
+	/* Get the next block num */
+	pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+	
+	smp_mb();
+	
+	/* If the curr_block is currently in_use then plug the queue */
+	if (TP_STATUS_USER == pbd->block_status) {
+		    prb_plug_queue(pkc,po);
+			return NULL;
+	}
+	/* open next block */
+	prb_open_block(pkc,pbd);
+	return (void *)pkc->nxt_offset;
+}
+
+#define TOTAL_PKT_LEN_INCL_ALIGN(length) (ALIGN_4((length)))
+
+static void prb_fill_curr_block(char *curr,struct kbdq_core
*pkc,struct block_desc *pbd,unsigned int len)
+{
+	struct tpacket3_hdr *ppd;
+	struct tpacket3_hdr *prev;
+
+	ppd  = (struct tpacket3_hdr *)curr;
+	prev = (struct tpacket3_hdr *)pkc->prev;
+	/* lets do pd_s1 for for V4 header */
+	//ppd->pd_u1.pd_s1.nxt_offset = 0;
+	//((struct tpacket3_hdr *)pkc->prev)->pd_u1.pd_s1.next_offset =
(char *)ppd - pkc->prev;
+	ppd->tp_next_offset = 0;
+	if (pkc->prev > (char *)ppd) {
+		printk("<%s> curr:0x%p len:%d pkc->prev:%p \n",__func__,curr,len,pkc->prev);
+		BUG();
+	}
+	prev->tp_next_offset = (long)ppd - (long)pkc->prev;
+	pkc->prev = curr;
+	pkc->nxt_offset += TOTAL_PKT_LEN_INCL_ALIGN(len);
+	pbd->num_pkts += 1;
+}
+
+static inline int prb_curr_blk_in_use(struct kbdq_core *pkc,struct
block_desc *pbd) {
+
+	return (TP_STATUS_USER == pbd->block_status);
+}
+
+static inline int prb_queue_plugged(struct kbdq_core *pkc) {
+	return pkc->reset_pending_on_curr_blk;
+}
+
+/* Assumes caller has the sk->rx_queue.lock */
+static void *__packet_lookup_frame_in_block(struct packet_ring_buffer *rb,
+		int status,unsigned int len,struct packet_sock *po)
+{
+	struct kbdq_core *pkc  = GET_PBDQC_FROM_RB(rb);
+	struct block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+	char *curr, *end;
+	
+	if (prb_queue_plugged(pkc)) {
+		if (prb_curr_blk_in_use(pkc,pbd)) {
+			return NULL;
+		} else {
+			/* open-block unplugs the queue. Unplugging is a side effect */
+			prb_open_block(pkc,pbd);
+		}
+	}
+
+	smp_mb();
+
+	curr = pkc->nxt_offset;
+	end  = (char *) ( (char *)pbd + pkc->kblk_size);
+	
+	/* first try the current block */
+	if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
+		prb_fill_curr_block(curr,pkc,pbd,len);
+		return (void *)curr;
+	}
+	
+	/* Then try the next block. */
+	if ((curr = (char *)prb_try_next_block(pkc,po))) {
+		pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+		prb_fill_curr_block(curr,pkc,pbd,len);
+		return (void *)curr;
+	}
+
+	/* no free blocks are available - user_space hasn't caught up yet */
+	return NULL;
+}
+
+static inline void *packet_current_rx_frame(struct packet_sock *po,
+		struct packet_ring_buffer *rb,
+		int status, unsigned int len)
+{
+	char *curr=NULL;
+	switch (po->tp_version) {
+		case TPACKET_V1:
+		case TPACKET_V2:
+			curr = packet_lookup_frame(po, rb, rb->head, status);
+			return curr;
+		case TPACKET_V3:
+			return __packet_lookup_frame_in_block(rb, status,len,po);
+		default:
+			pr_err("<%s> TPACKET version:%d not supported\n",__func__,po->tp_version);
+			BUG();
+			return 0;
+	}
+}
+
+static inline void *prb_lookup_block(struct packet_sock *po,
+		struct packet_ring_buffer *rb,unsigned int previous,
+		int status)
+{
+	struct kbdq_core *pkc  = GET_PBDQC_FROM_RB(rb);
+	struct block_desc *pbd = GET_PBLOCK_DESC(pkc,previous);
+
+	if (status != pbd->block_status)
+		return NULL;
+	return pbd;
+}
+
+static inline int prb_previous_blk_num(struct packet_ring_buffer *rb)
+{
+	unsigned int prev = rb->prb_bdqc.kactive_blk_num ?
(rb->prb_bdqc.kactive_blk_num-1) : (rb->prb_bdqc.knum_blocks-1);
+	return prev;
+}
+
+/* Assumes caller has held the rx_queue.lock */
+static inline void* __prb_previous_block(struct packet_sock *po,
+		struct packet_ring_buffer *rb,
+		int status)
+{
+
+	unsigned int previous = prb_previous_blk_num(rb);
+	return prb_lookup_block(po,rb,previous,status);
 }

+
 static inline void *packet_previous_frame(struct packet_sock *po,
 		struct packet_ring_buffer *rb,
 		int status)
@@ -320,11 +617,38 @@ static inline void *packet_previous_frame(struct
packet_sock *po,
 	return packet_lookup_frame(po, rb, previous, status);
 }

+static inline void *packet_previous_rx_frame(struct packet_sock *po,
+		struct packet_ring_buffer *rb,
+		int status)
+{
+	if (po->tp_version <= TPACKET_V2)
+		return packet_previous_frame(po,rb,status);
+	
+	return __prb_previous_block(po,rb,status);
+}
+
 static inline void packet_increment_head(struct packet_ring_buffer *buff)
 {
 	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
 }

+static inline void packet_increment_rx_head(struct packet_sock
*po,struct packet_ring_buffer *rb)
+{
+	switch (po->tp_version) {
+		case TPACKET_V1:
+		case TPACKET_V2:
+			return packet_increment_head(rb);
+		case TPACKET_V3:
+			pr_err("<%s> TPACKET version:%d not supported.Dumping
stack.\n",__func__,po->tp_version);
+			dump_stack();
+			return;
+		default:
+			pr_err("<%s> TPACKET version not supported\n",__func__);
+			BUG();
+			return;
+	}
+}
+
 static inline struct packet_sock *pkt_sk(struct sock *sk)
 {
 	return (struct packet_sock *)sk;
@@ -663,6 +987,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct
net_device *dev,
 	union {
 		struct tpacket_hdr *h1;
 		struct tpacket2_hdr *h2;
+		struct tpacket3_hdr *h3;
 		void *raw;
 	} h;
 	u8 *skb_head = skb->data;
@@ -715,29 +1040,31 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
 		macoff = netoff - maclen;
 	}

-	if (macoff + snaplen > po->rx_ring.frame_size) {
-		if (po->copy_thresh &&
-		    atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
-		    (unsigned)sk->sk_rcvbuf) {
-			if (skb_shared(skb)) {
-				copy_skb = skb_clone(skb, GFP_ATOMIC);
-			} else {
-				copy_skb = skb_get(skb);
-				skb_head = skb->data;
+	if (po->tp_version <= TPACKET_V2) {
+		if (macoff + snaplen > po->rx_ring.frame_size) {
+			if (po->copy_thresh &&
+				atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
+				(unsigned)sk->sk_rcvbuf) {
+				if (skb_shared(skb)) {
+					copy_skb = skb_clone(skb, GFP_ATOMIC);
+				} else {
+					copy_skb = skb_get(skb);
+					skb_head = skb->data;
+				}
+				if (copy_skb)
+					skb_set_owner_r(copy_skb, sk);
 			}
-			if (copy_skb)
-				skb_set_owner_r(copy_skb, sk);
+			snaplen = po->rx_ring.frame_size - macoff;
+			if ((int)snaplen < 0)
+				snaplen = 0;
 		}
-		snaplen = po->rx_ring.frame_size - macoff;
-		if ((int)snaplen < 0)
-			snaplen = 0;
 	}
-
 	spin_lock(&sk->sk_receive_queue.lock);
-	h.raw = packet_current_frame(po, &po->rx_ring, TP_STATUS_KERNEL);
+	h.raw = packet_current_rx_frame(po, &po->rx_ring,
TP_STATUS_KERNEL,(macoff+snaplen));
 	if (!h.raw)
 		goto ring_is_full;
-	packet_increment_head(&po->rx_ring);
+	if (TPACKET_V3 != po->tp_version)
+		packet_increment_rx_head(po,&po->rx_ring);
 	po->stats.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
@@ -789,6 +1116,21 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
 		h.h2->tp_vlan_tci = vlan_tx_tag_get(skb);
 		hdrlen = sizeof(*h.h2);
 		break;
+	case TPACKET_V3:
+		/* tp_nxt_offset is already populated above. So DONT clear those
fields here */
+		h.h3->tp_len = skb->len;
+		h.h3->tp_snaplen = snaplen;
+		h.h3->tp_mac = macoff;
+		h.h3->tp_net = netoff;
+		if (skb->tstamp.tv64)
+			ts = ktime_to_timespec(skb->tstamp);
+		else
+			getnstimeofday(&ts);
+		h.h3->tp_sec  = ts.tv_sec;
+		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_vlan_tci = vlan_tx_tag_get(skb);
+		hdrlen = sizeof(*h.h3);
+		break;	
 	default:
 		BUG();
 	}
@@ -804,7 +1146,8 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
 	else
 		sll->sll_ifindex = dev->ifindex;

-	__packet_set_status(po, h.raw, status);
+	if (po->tp_version <= TPACKET_V2)
+		__packet_set_status(po, h.raw, status);
 	smp_mb();
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
 	{
@@ -815,7 +1158,6 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
 			flush_dcache_page(pgv_to_page(start));
 	}
 #endif
-
 	sk->sk_data_ready(sk, 0);

 drop_n_restore:
@@ -1984,6 +2326,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
 		switch (val) {
 		case TPACKET_V1:
 		case TPACKET_V2:
+		case TPACKET_V3:
 			po->tp_version = val;
 			return 0;
 		default:
@@ -2082,6 +2425,7 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
 	struct packet_sock *po = pkt_sk(sk);
 	void *data;
 	struct tpacket_stats st;
+	union tpacket_stats_u st_u;

 	if (level != SOL_PACKET)
 		return -ENOPROTOOPT;
@@ -2094,15 +2438,25 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,

 	switch (optname) {
 	case PACKET_STATISTICS:
-		if (len > sizeof(struct tpacket_stats))
-			len = sizeof(struct tpacket_stats);
+		if (po->tp_version == TPACKET_V3) {
+			len = sizeof(struct tpacket_stats_v3);
+		} else {
+			if (len > sizeof(struct tpacket_stats))
+				len = sizeof(struct tpacket_stats);
+		}
 		spin_lock_bh(&sk->sk_receive_queue.lock);
-		st = po->stats;
+		if (po->tp_version == TPACKET_V3) {
+			memcpy(&st_u.stats3,&po->stats,sizeof(struct tpacket_stats));
+			st_u.stats3.tp_plug_q_cnt  = po->stats_u.stats3.tp_plug_q_cnt;
+			st_u.stats3.tp_packets += po->stats.tp_drops;
+			data = &st_u.stats3;
+		} else {
+			st = po->stats;
+			st.tp_packets += st.tp_drops;
+			data = &st;
+		}
 		memset(&po->stats, 0, sizeof(st));
 		spin_unlock_bh(&sk->sk_receive_queue.lock);
-		st.tp_packets += st.tp_drops;
-
-		data = &st;
 		break;
 	case PACKET_AUXDATA:
 		if (len > sizeof(int))
@@ -2143,6 +2497,9 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
 		case TPACKET_V2:
 			val = sizeof(struct tpacket2_hdr);
 			break;
+		case TPACKET_V3:
+			val = sizeof(struct tpacket3_hdr);
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -2293,7 +2650,7 @@ static unsigned int packet_poll(struct file
*file, struct socket *sock,

 	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (po->rx_ring.pg_vec) {
-		if (!packet_previous_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
+		if (!packet_previous_rx_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
 			mask |= POLLIN | POLLRDNORM;
 	}
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
@@ -2396,7 +2753,6 @@ static struct pgv *alloc_pg_vec(struct
tpacket_req *req, int order)
 	pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL);
 	if (unlikely(!pg_vec))
 		goto out;
-
 	for (i = 0; i < block_nr; i++) {
 		pg_vec[i].buffer = alloc_one_pg_vec_page(order);
 		if (unlikely(!pg_vec[i].buffer))
@@ -2412,6 +2768,197 @@ out_free_pgvec:
 	goto out;
 }

+
+static void prb_del_retire_blk_timer(struct kbdq_core *pkc)
+{
+	del_timer_sync(&pkc->retire_blk_timer);
+}
+
+static void prb_shutdown_retire_blk_timer(struct packet_sock *po, int
tx_ring,struct sk_buff_head *rb_queue)
+{
+	struct kbdq_core *pkc;
+
+	pkc	= tx_ring ? &po->tx_ring.prb_bdqc : &po->rx_ring.prb_bdqc;
+	
+	spin_lock(&rb_queue->lock);
+	pkc->delete_blk_timer=1;
+	spin_unlock(&rb_queue->lock);
+
+	prb_del_retire_blk_timer(pkc);
+}
+
+/*  Increment the blk_num and then invoke this func to refresh the timer.
+ *  We do it in this order so that if a timer is about
+ *  to fire then it will fail the blk_num check.
+ *  Assumes sk_buff_head lock is held.
+ */
+static void _prb_refresh_rx_retire_blk_timer(struct kbdq_core *pkc)
+{
+	pkc->last_kactive_blk_num = pkc->kactive_blk_num;
+	mod_timer(&pkc->retire_blk_timer,jiffies+msecs_to_jiffies(pkc->retire_blk_tmo));
+}
+
+/* close current block and open next block or plug the queue */
+static inline void prb_retire_curr_block(struct kbdq_core *pkc,struct
packet_sock *po)
+{
+	prb_try_next_block(pkc,po);
+}
+
+/*
+ * Timer logic:
+ * 1) We refresh the timer only when we open a block.
+ *    By doing this we don't waste cycles refreshing the timer
+ *    on packet-by-packet basis.
+ * With a 1MB block-size, on a 1Gbps line, it will take
+ * ~8 ms to fill a block.
+ * So, if the user sets the 'tmo' to 10ms then the timer will never
fire(which is what we want)!
+ * However, the user could choose to close a block early and that's fine.
+ *
+ * But when the timer does fire, we check whether or not to refresh it.
+ * Since the tmo granularity is in msecs, it is not too expensive
+ * to refresh the timer every '8' msecs.
+ * Either the user can set the 'tmo' or we can derive it based on
+ * a) line-speed and b) block-size
+ */
+static void prb_retire_rx_blk_timer_expired(unsigned long data)
+{
+	struct packet_sock *po = (struct packet_sock *)data;
+	struct kbdq_core *pkc = &po->rx_ring.prb_bdqc;
+	unsigned short tmo;
+	unsigned int plugged;
+	struct block_desc *pbd;
+
+	spin_lock(&po->sk.sk_receive_queue.lock);
+
+	plugged = prb_queue_plugged(pkc);
+	pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+	/* We read the tmo so that user-space can change it anytime they want.
+	 * But, the changes will get into affect only when:
+	 * i) Either when the timer expires(this code path) or
+	 * ii)When a new block is opened.
+	 */
+	tmo = pkc->retire_blk_tmo;
+	if (pkc->last_kactive_blk_num == pkc->kactive_blk_num &&
+		!plugged) {
+		if (TP_STATUS_KERNEL == pbd->block_status) {
+			prb_retire_curr_block(pkc,po);
+		}
+	}
+	pkc->last_kactive_blk_num = pkc->kactive_blk_num;
+	
+	if (pkc->delete_blk_timer)
+		goto out;
+
+	if (plugged) {
+		/* Case 1. queue was plugged because user-space was lagging behind */
+		if (prb_curr_blk_in_use(pkc,pbd)) {
+			/* Ok, user-space is still behind. But we still want to refresh the timer */
+			/* if-check added for code readability */
+		} else {
+			/* Case 2. queue was plugged, user-space caught up and now the
link went idle && the timer fired.
+			 * We don't have a block to close and we cannot close the current
block because
+			 * the timer wasn't really meant for this block. So we just open
this block and restart the timer.
+			 * open-block unplugs the queue, restarts timer.
Unplugging/refreshing-timer is a side effect.
+			 */
+			prb_open_block(pkc,pbd);
+			goto out;
+		}
+	}
+
+	mod_timer(&pkc->retire_blk_timer,jiffies+msecs_to_jiffies(tmo));
+
+out:
+	spin_unlock(&po->sk.sk_receive_queue.lock);
+}
+
+static void prb_init_blk_timer(struct packet_sock *po,struct
kbdq_core *pkc,void (*func) (unsigned long))
+{
+
+	init_timer(&pkc->retire_blk_timer);
+	pkc->retire_blk_timer.data		= (long)po;
+	pkc->retire_blk_timer.function	= func;
+	pkc->retire_blk_timer.expires	= jiffies;
+}
+
+static void prb_setup_retire_blk_timer(struct packet_sock *po,int tx_ring)
+{
+	struct kbdq_core *pkc;
+
+	if (tx_ring)
+		BUG();
+
+	pkc	 = tx_ring ? &po->tx_ring.prb_bdqc : &po->rx_ring.prb_bdqc;
+	prb_init_blk_timer(po,pkc,prb_retire_rx_blk_timer_expired);
+}
+
+static int prb_calc_retire_blk_tmo(struct packet_sock *po, int
blk_size_in_bytes)
+{
+	struct net_device *dev;
+	unsigned int mbits=0,msec=0,div=0,tmo=0;
+
+	dev = dev_get_by_index(sock_net(&po->sk), po->ifindex);
+	if (unlikely(dev == NULL)) {
+		return DEFAULT_PRB_RETIRE_TMO;
+	}
+
+    if (dev->ethtool_ops && dev->ethtool_ops->get_settings) {
+		struct ethtool_cmd ecmd = { .cmd = ETHTOOL_GSET, };
+
+        if (!dev->ethtool_ops->get_settings(dev, &ecmd)) {
+			switch(ecmd.speed) {
+				case SPEED_10000:
+					msec = 1;
+					div=10000/1000;
+					break;
+                case SPEED_1000:
+                    msec = 1;
+					div = 1000/1000;
+					break;
+                /* If the link speed is so low you don't really need
to care about perf anyways */
+				case SPEED_100:
+				case SPEED_10:
+				default:
+					return DEFAULT_PRB_RETIRE_TMO;
+            }
+        }
+    }
+
+	mbits = (blk_size_in_bytes * 8) / (1024 * 1024);
+
+	if (div)
+		mbits /= div;
+
+	tmo = mbits * msec;
+
+	if (div)
+		return (tmo+1);
+	return tmo;
+}
+
+static void init_prb_bdqc(struct packet_sock *po,struct
packet_ring_buffer *rb,struct pgv *pg_vec,struct tpacket_req *req,int
tx_ring)
+{
+
+	struct kbdq_core *p1 = &rb->prb_bdqc;
+	struct block_desc *pbd;
+
+	memset(p1,0x0,sizeof(*p1));
+	p1->pkbdq			= pg_vec;
+	pbd					= (struct block_desc *)pg_vec[0].buffer;
+	p1->pkblk_start		= (char *)pg_vec[0].buffer;
+	
+	p1->kblk_size		= req->tp_block_size;
+	p1->knum_blocks		= req->tp_block_nr;
+	p1->hdrlen			= po->tp_hdrlen;
+	
+	p1->last_kactive_blk_num = 0;
+	po->stats_u.stats3.tp_plug_q_cnt = 0;
+	p1->retire_blk_tmo = prb_calc_retire_blk_tmo(po,req->tp_block_size);
+
+	prb_setup_retire_blk_timer(po,tx_ring);
+	prb_open_block(p1,pbd);
+}
+
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring)
 {
@@ -2421,7 +2968,14 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
 	struct packet_ring_buffer *rb;
 	struct sk_buff_head *rb_queue;
 	__be16 num;
-	int err;
+	int err=-EINVAL;
+
+	/* Opening a Tx-ring is NOT supported post TPACKET_V2 */
+	if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
+		pr_err("<%s> Tx-ring is not supported on version:%d.Dumping
stack.\n",__func__,po->tp_version);
+		dump_stack();
+		goto out;
+	}

 	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
 	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
@@ -2447,6 +3001,9 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
 		case TPACKET_V2:
 			po->tp_hdrlen = TPACKET2_HDRLEN;
 			break;
+		case TPACKET_V3:
+			po->tp_hdrlen = TPACKET3_HDRLEN;
+			break;
 		}

 		err = -EINVAL;
@@ -2472,6 +3029,15 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
 		pg_vec = alloc_pg_vec(req, order);
 		if (unlikely(!pg_vec))
 			goto out;
+		switch (po->tp_version) {
+			case TPACKET_V3:
+				/* Transmit path is not supported. We checked it above but just
being paranoid */
+				if (!tx_ring)
+					init_prb_bdqc(po,rb,pg_vec,req,tx_ring);
+				break;
+			default:
+				break;
+		}
 	}
 	/* Done */
 	else {
@@ -2529,10 +3095,17 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
 	}
 	spin_unlock(&po->bind_lock);

+	if (closing && (po->tp_version > TPACKET_V2)) {
+		/* Because we don't support block-based V3 on tx-ring */
+		if (!tx_ring)	
+			prb_shutdown_retire_blk_timer(po,tx_ring,rb_queue);
+	}
+
 	release_sock(sk);

 	if (pg_vec)
 		free_pg_vec(pg_vec, order, req->tp_block_nr);
+	
 out:
 	return err;
 }

^ permalink raw reply related

* [RFC 00/01]af_packet: Enhance network capture visibility
From: chetan loke @ 2011-05-25 23:02 UTC (permalink / raw)
  To: netdev, loke.chetan

Hello,

Please review the RFC/patchset. Any feedback is appreciated.

The patch set is not complete and is intended to:
a) demonstrate the improvements
b) gather suggestions

This patch attempts to i) improve network capture visibility by
increasing packet density ii) assist in analyzing multiple(aggregated)
capture ports.

With the current af_packet->rx::mmap based approach, the element size
in the block needs to be statically configured. Nothing wrong with
this config/implementation. But the traffic profile
cannot be known in advance. And so it would be nice if that
configuration wasn't static. Normally, one would configure the
element-size to be '2048' so that you can atleast capture the entire
'MTU-size'.
But if the traffic profile varies then we would end up either
i)wasting memory or ii) end up getting a sliced frame. In other words
the packet density will be much less in the first case.

Enhancement:
E1) Enhance tpacket_rcv so that it can dump/copy the packets one after another.
E2) Also implement basic timeout mechanism to close 'a' current
block.That way, user-space won't be blocked forever on an idle link.
This is a much needed feature while monitoring multiple ports.
      Look at 3) below.

Why is such enhancement needed?
1) Well, spin-waiting/polling on a per-packet basis to see if it's
ready to be consumed does not scale while monitoring multiple ports.
poll() is not performance friendly either.
2) Also, typically a user-space packet capture interface handles
multiple packets to another user-space protocol-decoder.

   ----------------
   protocol-decoder
          T2
   ----------------
    ========
      ship pkts
    ========
	   ^
	   |
	   v
   -----------------
   pkt-capture logic
           T1
   -----------------
   ================
	 nic/adp/sock IF
  ================
           ^
	   |
	   V
		
T1 and T2 are user-space threads. If the hand-off between T1 and T2
happens on a per-pkt basis then the solution does NOT scale.

However, one can argue that T1 can coalesce packets and then pass of a
single chunk to T2.But T1's packet consumption granularity is still at
an individual packet level and that is something that needs to be
addressed to avoid excessive polling.


3) Port aggregation:
   Multiple ports are viewed/analyzed as one logical pipe.
   Example:
   3.1) up-stream    path can be tapped in eth1
   3.2) down-stream  path can be tapped in eth2
   3.3) Network TAP splits Rx/Tx paths and then feeds to eth1,eth2.

   If both eth1,eth2 need to be viewed as one logical channel,
   then that implies we need to timesort the packets as they come across
   eth1,eth2.

   3.4) But following issues further complicates the problem:
	 3.4.1)What if one stream is bursty and other is flowing
              at line rate?
	3.4.2)How long do we wait before we can actually make a
	       decision in the app-space and bail-out from the spin-wait?

   Solution:
   3.5) Once we receive a block from multiple ports, then we can
compare the timestamps from the block-descriptor and then easily sort
the packets and feed the pointers to the decoders.


------------------------------
Performance results:
------------------------------

Setup:
S1)Ran 3 pktgen sessions from 3 worker VMs(VM0-VM2).
S2)Each pktgen session was configured to send 40Million, 64byte packets.
S3)Ran patched kernel on the probe-VM(VM3).
S4)rx-mmap application code:
   BLOCK_SIZE: 1MB
   FRAME_SIZE: 2048 bytes
   NUM_BLOCKS: 64

Note: TPACKET_V3 doesn't really care about FRAME_SIZE.
      But the code was untouched to ensure minimal disruption.

Numbers from VM3(tpacket_stats):

Case P1) TPACKET_V0[V1](existing model):

recieved 84909875 packets, dropped 5760817
Pkts seen by the app:79149058


Case P2) TPACKET_V3(enhanced model):
recieved 102562944 packets, dropped 2 plug_q_cnt 12
Pkts seen by the app:102562942

PS:plug_q_cnt is interpreted as "The tpacket_rcv code got blocked only
12 times during the entire capture process.Blocked implies, user-space
process took some time to catch up."
	
Note: In both the cases,VM3 should have seen ~120 Million packets. But
notice it only sees around 90-100M pkts. The hypervisor is dropping
~30%-20% of the traffic.We can ignore this because in non-virtual
world, there could be limitations on the host side too.


Summary:

A) In P2) notice how the VM keeps up and so it now has more visibility
than the P1) case.
So,
  A.1] P2) almost always has around 10%-20% higher visibility than P1.
  A.2] P2) almost always captures ~98-99% of the traffic as seen by the kernel.
  A.3] P1) on the other hand drops anywhere around ~7-10% traffic.
  A.4] P1) also has 10%-20% lower visibility because
         i) it loses frames due to the static frame size format
         ii) has to poll/spin-wait for a single packet.


  Regards
  Chetan Loke

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2011-05-25 22:52 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


That majority of the bits here are just a merge with John Linville's
queued up wireless stuff.  This has been in his tree for more than
a week and I was just waiting for him to get back from a conference
to send the pull request to me.

Other noteworthy bits:

1) Erroneous socket filters can log kernel messages without control,
   fix from Joe Perches.

2) Fix regression in the locking of interface dumping, from Eric Dumazet.

3) Fix crash in bridging due to improperly initialized route object,
   also from Eric.

4) IP fragments give erroneous congestion notification signals in
   SFQ packet scheduler, also from Eric.

5) Rest of networking %pK conversions, from Dan Rosenberg via Andrew
   Morton.

6) When the RTNL mutex is held, synchonize_net() can use
   synchronize_rcu_expedited().  From Eric Dumazet.

7) Fix IGMP source filter clearing when users of the group still
   exist, from Veaceslav Falico.

8) __dst_destroy_metrics_generic() forgets to set "read-only" bit
   in the encoded pointer.  Fix from Eric Dumazet.

9) dev_disable_lro() needs to propagate to underlying physical device
   of a VLAN, from Neil Horman.

10) ASCONF memory leak in SCTP, fix from Wei Yongjun.

11) SFQ packet scheduler's ->peek() method returns different packets
    than ->dequeue() would, fix from Eric Dumazet.

12) Fix bonding deadlock in ALB mode, from Neil Horman.

Please pull, thanks a lot!

The following changes since commit 2a651c7f8d377cf88271374315cbb5fe82eac784:

  Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs (2011-05-25 09:21:56 -0700)

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Alexey Dobriyan (1):
      airo: correct proc entry creation interfaces

Alexey Orishko (1):
      CDC NCM: release interfaces fix in unbind()

Breno Leitao (1):
      ehea: Fix multicast registration on semi-promiscuous mode

Christian Lamparter (2):
      p54usb: add zoom 4410 usbid
      carl9170: advertise interface combinations

Dan Rosenberg (1):
      net: convert %p usage to %pK

Daniel Halperin (1):
      iwlwifi: remove unused parameter from iwl_hcmd_queue_reclaim

David S. Miller (3):
      ipv6: Fix return of xfrm6_tunnel_rcv()
      bug.h: Fix build with CONFIG_PRINTK disabled.
      Merge branch 'for-davem' of ssh://master.kernel.org/.../linville/wireless-next-2.6

Dmitry Kravkov (2):
      bnx2x: fix inverted condition
      bnx2x: protect sequence increment with mutex

Eric Dumazet (8):
      net: ping: cleanups ping_v4_unhash()
      snap: remove one synchronize_net()
      sch_sfq: avoid giving spurious NET_XMIT_CN signals
      net: use synchronize_rcu_expedited()
      net: fix __dst_destroy_metrics_generic()
      bridge: initialize fake_rtable metrics
      sch_sfq: fix peek() implementation
      net: hold rtnl again in dump callbacks

Felix Fietkau (3):
      ath9k: fix ad-hoc mode beacon selection
      ath9k: fix ad-hoc nexttbtt calculation
      ath9k: implement .tx_last_beacon()

Flavio Leitner (1):
      bonding: documentation and code cleanup for resend_igmp

Ian Campbell (1):
      xen: netfront: hold RTNL when updating features.

Javier Cardona (2):
      mac80211: Deactivate mesh path timers when freeing nodes
      mac80211: Don't sleep when growing the mesh path

Joe Perches (2):
      bug.h: Add WARN_RATELIMIT
      net: filter: Use WARN_RATELIMIT

Johannes Berg (10):
      iwlagn: prepare for multi-TB commands
      iwlagn: clean up TXQ indirection
      iwlagn: remove unused pad argument
      iwlagn: support multiple TBs per command
      iwlagn: remove set but unused vars
      iwlagn: change default beacon interval
      mac80211: verify IBSS in interface combinations
      mac80211: add missing rcu_barrier
      mac80211: fix and simplify mesh locking
      mac80211: annotate and fix RCU in mesh code

John W. Linville (2):
      Merge branch 'wireless-next-2.6' of git://git.kernel.org/.../iwlwifi/iwlwifi-2.6
      Merge ssh://master.kernel.org/.../linville/wireless-next-2.6 into for-davem

Jouni Malinen (1):
      cfg80211: Use consistent BSS matching between scan and sme

Larry Finger (1):
      rtlwifi: rtl8192c-common: rtl8192ce: Fix for HT40 regression

Luciano Coelho (1):
      nl80211: remove some stack variables in trigger_scan and start_sched_scan

Marc Yang (5):
      mwifiex: reduce CPU usage by tracking tx_pkts_queued
      mwifiex: reduce CPU usage by tracking highest_queued_prio
      mwifiex: check mwifiex_wmm_lists_empty() before dequeue
      mwifiex: CPU mips optimization with NO_PKT_PRIO_TID
      mwifiex: adjust high/low water marks for tx_pending queue

Meelis Roos (1):
      Add Fujitsu 1000base-SX PCI ID to tg3

Mike Frysinger (1):
      net/irda: convert bfin_sir to common Blackfin UART header

Mohammed Shafi Shajakhan (2):
      ath_hw: Fix bssid mask documentation
      ath9k: use PS wakeup before REG_READ

Neil Horman (3):
      net: move is_vlan_dev into public header file (v2)
      net: make dev_disable_lro use physical device if passed a vlan dev (v2)
      bonding: prevent deadlock on slave store with alb mode (v3)

Prarit Bhargava (1):
      isdn: netjet - blacklist Digium TDM400P

Rafał Miłecki (8):
      b43: rename b43_wldev's field with ssb_device to sdev
      bcma: add PCI ID of the card found in Thinkpad X120e
      b43: add helpers for block R/W ops
      b43: make b43_wireless_init less bus specific
      b43: dma: cache translation (routing bits)
      b43: add helper for finding GPIO device
      b43: separate ssb core reset
      b43: read PHY info only when needed (for PHY-A)

Rajkumar Manoharan (2):
      mac80211: abort scan_work immediately when the device goes down
      ath9k: Fix power save wrappers in debug ops

Randy Dunlap (2):
      wireless: fix cfg80211.h new kernel-doc warnings
      wireless: fix fatal kernel-doc error + warning in mac80211.h

Rhyland Klein (1):
      net: rfkill: add generic gpio rfkill driver

Sathya Perla (1):
      be2net: hash key for rss-config cmd not set

Stephen Hemminger (1):
      dst: catch uninitialized metrics

Sujith Manoharan (9):
      ath9k_htc: Fix mode selection
      ath9k_htc: Fix station flags
      ath9k_htc: Recalculate the BSSID mask on interface
      ath9k_htc: Fix RX filter calculation
      ath9k_htc: Fix BSSID calculation
      ath9k_htc: Fix max subframe handling
      ath9k_htc: Change credit limit for UB94/95
      ath9k_htc: Fix packet timeout
      ath9k: Drag the driver to the year 2011

Ulrich Hecht (1):
      via-velocity: don't annotate MAC registers as packed

Veaceslav Falico (1):
      igmp: call ip_mc_clear_src() only when we have no users of ip_mc_list

Wei Yongjun (1):
      sctp: fix memory leak of the ASCONF queue when free asoc

Wey-Yi Guy (8):
      iwlagn: more ucode error log info
      iwlagn: add testmode trace command
      iwlagn: add eeprom command to testmode
      iwlagn: add testmode set fixed rate command
      iwlagn: clear STATUS_HCMD_ACTIVE bit if fail enqueue
      iwlagn: alwasy send RXON with disassociate falge before associate
      iwlagn: remove unused old_assoc parameter
      iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled

 Documentation/networking/bonding.txt               |   13 +-
 drivers/bcma/host_pci.c                            |    1 +
 drivers/isdn/hardware/mISDN/netjet.c               |    6 +
 drivers/net/benet/be_cmds.c                        |    3 +-
 drivers/net/bnx2x/bnx2x_cmn.c                      |    2 +-
 drivers/net/bnx2x/bnx2x_main.c                     |    3 +-
 drivers/net/bonding/bond_alb.c                     |    4 -
 drivers/net/bonding/bond_main.c                    |   28 +-
 drivers/net/bonding/bond_sysfs.c                   |   16 +-
 drivers/net/ehea/ehea_main.c                       |    2 +-
 drivers/net/irda/bfin_sir.c                        |   59 ++--
 drivers/net/irda/bfin_sir.h                        |   63 +----
 drivers/net/tg3.c                                  |    1 +
 drivers/net/usb/cdc_ncm.c                          |   73 ++---
 drivers/net/via-velocity.h                         |    2 +-
 drivers/net/wireless/airo.c                        |   33 +--
 drivers/net/wireless/ath/ath9k/ahb.c               |    2 +-
 drivers/net/wireless/ath/ath9k/ani.c               |    2 +-
 drivers/net/wireless/ath/ath9k/ani.h               |    2 +-
 drivers/net/wireless/ath/ath9k/ar5008_initvals.h   |    2 +-
 drivers/net/wireless/ath/ath9k/ar5008_phy.c        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9001_initvals.h   |    2 +-
 drivers/net/wireless/ath/ath9k/ar9002_calib.c      |    2 +-
 drivers/net/wireless/ath/ath9k/ar9002_hw.c         |    2 +-
 drivers/net/wireless/ath/ath9k/ar9002_initvals.h   |    2 +-
 drivers/net/wireless/ath/ath9k/ar9002_mac.c        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9002_phy.c        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9002_phy.h        |    2 +-
 .../net/wireless/ath/ath9k/ar9003_2p2_initvals.h   |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_calib.c      |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_eeprom.c     |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_eeprom.h     |   16 +
 drivers/net/wireless/ath/ath9k/ar9003_hw.c         |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_mac.c        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_mac.h        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_paprd.c      |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_phy.c        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_phy.h        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9485_initvals.h   |    2 +-
 drivers/net/wireless/ath/ath9k/ath9k.h             |    5 +-
 drivers/net/wireless/ath/ath9k/beacon.c            |   48 ++-
 drivers/net/wireless/ath/ath9k/btcoex.c            |    2 +-
 drivers/net/wireless/ath/ath9k/btcoex.h            |    2 +-
 drivers/net/wireless/ath/ath9k/calib.c             |    2 +-
 drivers/net/wireless/ath/ath9k/calib.h             |    2 +-
 drivers/net/wireless/ath/ath9k/common.c            |    2 +-
 drivers/net/wireless/ath/ath9k/common.h            |    2 +-
 drivers/net/wireless/ath/ath9k/debug.c             |   10 +-
 drivers/net/wireless/ath/ath9k/debug.h             |    2 +-
 drivers/net/wireless/ath/ath9k/eeprom.c            |    2 +-
 drivers/net/wireless/ath/ath9k/eeprom.h            |    2 +-
 drivers/net/wireless/ath/ath9k/eeprom_4k.c         |    2 +-
 drivers/net/wireless/ath/ath9k/eeprom_9287.c       |    2 +-
 drivers/net/wireless/ath/ath9k/eeprom_def.c        |    2 +-
 drivers/net/wireless/ath/ath9k/gpio.c              |    2 +-
 drivers/net/wireless/ath/ath9k/hif_usb.c           |    2 +-
 drivers/net/wireless/ath/ath9k/hif_usb.h           |    4 +-
 drivers/net/wireless/ath/ath9k/htc.h               |   25 +-
 drivers/net/wireless/ath/ath9k/htc_drv_beacon.c    |    2 +-
 drivers/net/wireless/ath/ath9k/htc_drv_gpio.c      |    2 +-
 drivers/net/wireless/ath/ath9k/htc_drv_init.c      |    9 +-
 drivers/net/wireless/ath/ath9k/htc_drv_main.c      |   79 +++--
 drivers/net/wireless/ath/ath9k/htc_drv_txrx.c      |    6 +-
 drivers/net/wireless/ath/ath9k/htc_hst.c           |    2 +-
 drivers/net/wireless/ath/ath9k/htc_hst.h           |    2 +-
 drivers/net/wireless/ath/ath9k/hw-ops.h            |    2 +-
 drivers/net/wireless/ath/ath9k/hw.c                |    2 +-
 drivers/net/wireless/ath/ath9k/hw.h                |    2 +-
 drivers/net/wireless/ath/ath9k/init.c              |    2 +-
 drivers/net/wireless/ath/ath9k/mac.c               |    2 +-
 drivers/net/wireless/ath/ath9k/mac.h               |    2 +-
 drivers/net/wireless/ath/ath9k/main.c              |   42 +++-
 drivers/net/wireless/ath/ath9k/pci.c               |    2 +-
 drivers/net/wireless/ath/ath9k/phy.h               |    2 +-
 drivers/net/wireless/ath/ath9k/rc.c                |    2 +-
 drivers/net/wireless/ath/ath9k/rc.h                |    2 +-
 drivers/net/wireless/ath/ath9k/recv.c              |    2 +-
 drivers/net/wireless/ath/ath9k/reg.h               |    2 +-
 drivers/net/wireless/ath/ath9k/wmi.c               |    2 +-
 drivers/net/wireless/ath/ath9k/wmi.h               |    2 +-
 drivers/net/wireless/ath/ath9k/xmit.c              |    2 +-
 drivers/net/wireless/ath/carl9170/carl9170.h       |    4 +
 drivers/net/wireless/ath/carl9170/fw.c             |   19 +-
 drivers/net/wireless/ath/carl9170/main.c           |   10 +-
 drivers/net/wireless/ath/hw.c                      |   10 +-
 drivers/net/wireless/b43/b43.h                     |   24 +-
 drivers/net/wireless/b43/dma.c                     |   37 +-
 drivers/net/wireless/b43/leds.c                    |    4 +-
 drivers/net/wireless/b43/lo.c                      |    4 +-
 drivers/net/wireless/b43/main.c                    |  194 ++++++-----
 drivers/net/wireless/b43/phy_a.c                   |   16 +-
 drivers/net/wireless/b43/phy_common.c              |    8 +-
 drivers/net/wireless/b43/phy_g.c                   |   48 ++--
 drivers/net/wireless/b43/phy_lp.c                  |   22 +-
 drivers/net/wireless/b43/phy_n.c                   |   24 +-
 drivers/net/wireless/b43/pio.c                     |   30 +-
 drivers/net/wireless/b43/rfkill.c                  |    6 +-
 drivers/net/wireless/b43/sdio.c                    |    4 +-
 drivers/net/wireless/b43/sysfs.c                   |    4 +-
 drivers/net/wireless/b43/tables_lpphy.c            |    4 +-
 drivers/net/wireless/b43/wa.c                      |    4 +-
 drivers/net/wireless/b43/xmit.c                    |    2 +-
 drivers/net/wireless/iwlwifi/iwl-1000.c            |    4 -
 drivers/net/wireless/iwlwifi/iwl-2000.c            |    8 +-
 drivers/net/wireless/iwlwifi/iwl-5000.c            |   12 +-
 drivers/net/wireless/iwlwifi/iwl-6000.c            |   12 +-
 drivers/net/wireless/iwlwifi/iwl-agn-calib.c       |   14 +-
 drivers/net/wireless/iwlwifi/iwl-agn-lib.c         |   14 +-
 drivers/net/wireless/iwlwifi/iwl-agn-rs.c          |   86 +++--
 drivers/net/wireless/iwlwifi/iwl-agn-rxon.c        |    9 +-
 drivers/net/wireless/iwlwifi/iwl-agn-sta.c         |    4 +-
 drivers/net/wireless/iwlwifi/iwl-agn-tx.c          |   16 +-
 drivers/net/wireless/iwlwifi/iwl-agn-ucode.c       |    6 +-
 drivers/net/wireless/iwlwifi/iwl-agn.c             |  250 +++-----------
 drivers/net/wireless/iwlwifi/iwl-agn.h             |   13 +-
 drivers/net/wireless/iwlwifi/iwl-commands.h        |    5 +-
 drivers/net/wireless/iwlwifi/iwl-core.h            |   10 -
 drivers/net/wireless/iwlwifi/iwl-dev.h             |   66 +++--
 drivers/net/wireless/iwlwifi/iwl-devtrace.h        |   58 +++-
 drivers/net/wireless/iwlwifi/iwl-eeprom.c          |    7 +-
 drivers/net/wireless/iwlwifi/iwl-hcmd.c            |    9 +-
 drivers/net/wireless/iwlwifi/iwl-led.c             |    4 +-
 drivers/net/wireless/iwlwifi/iwl-sta.c             |   12 +-
 drivers/net/wireless/iwlwifi/iwl-sv-open.c         |  177 ++++++++++-
 drivers/net/wireless/iwlwifi/iwl-testmode.h        |   34 ++
 drivers/net/wireless/iwlwifi/iwl-tx.c              |  364 ++++++++++++++------
 drivers/net/wireless/iwmc3200wifi/rx.c             |    4 +-
 drivers/net/wireless/mwifiex/11n_aggr.c            |    4 +
 drivers/net/wireless/mwifiex/main.h                |    9 +-
 drivers/net/wireless/mwifiex/txrx.c                |    4 +-
 drivers/net/wireless/mwifiex/wmm.c                 |   59 +++-
 drivers/net/wireless/p54/p54usb.c                  |    1 +
 drivers/net/wireless/rndis_wlan.c                  |    3 +-
 drivers/net/wireless/rtlwifi/ps.c                  |    2 +-
 drivers/net/wireless/rtlwifi/rtl8192c/phy_common.c |    2 +-
 drivers/net/wireless/rtlwifi/rtl8192ce/phy.c       |   69 ++++
 drivers/net/wireless/rtlwifi/rtl8192ce/phy.h       |    1 +
 drivers/net/wireless/rtlwifi/rtl8192ce/sw.c        |    1 +
 drivers/net/xen-netfront.c                         |    2 +
 drivers/staging/ath6kl/os/linux/cfg80211.c         |    2 +-
 drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c   |    4 +-
 drivers/staging/wlan-ng/cfg80211.c                 |    2 +-
 fs/proc/generic.c                                  |    1 +
 include/asm-generic/bug.h                          |   37 ++
 include/linux/if_vlan.h                            |    5 +
 include/linux/rfkill-gpio.h                        |   43 +++
 include/net/cfg80211.h                             |    8 +-
 include/net/dst.h                                  |    2 +
 net/802/psnap.c                                    |    1 -
 net/8021q/vlan.h                                   |    5 -
 net/atm/proc.c                                     |    4 +-
 net/bridge/br_netfilter.c                          |    6 +-
 net/can/bcm.c                                      |    6 +-
 net/core/dev.c                                     |   12 +-
 net/core/dst.c                                     |    2 +-
 net/core/fib_rules.c                               |    1 +
 net/core/filter.c                                  |    4 +-
 net/core/rtnetlink.c                               |    9 +-
 net/ipv4/igmp.c                                    |   10 +-
 net/ipv4/ping.c                                    |    3 -
 net/ipv4/raw.c                                     |    2 +-
 net/ipv4/tcp_ipv4.c                                |    6 +-
 net/ipv4/udp.c                                     |    2 +-
 net/ipv6/raw.c                                     |    2 +-
 net/ipv6/tcp_ipv6.c                                |    6 +-
 net/ipv6/udp.c                                     |    2 +-
 net/ipv6/xfrm6_tunnel.c                            |    2 +-
 net/key/af_key.c                                   |    2 +-
 net/mac80211/iface.c                               |    4 +-
 net/mac80211/main.c                                |   22 +-
 net/mac80211/mesh.h                                |    7 +-
 net/mac80211/mesh_pathtbl.c                        |  204 +++++++----
 net/mac80211/scan.c                                |    5 +
 net/netlink/af_netlink.c                           |    2 +-
 net/packet/af_packet.c                             |    2 +-
 net/phonet/socket.c                                |    2 +-
 net/rfkill/Kconfig                                 |    9 +
 net/rfkill/Makefile                                |    1 +
 net/rfkill/rfkill-gpio.c                           |  227 ++++++++++++
 net/sched/sch_sfq.c                                |   22 +-
 net/sctp/associola.c                               |   16 +
 net/sctp/proc.c                                    |    4 +-
 net/unix/af_unix.c                                 |    2 +-
 net/wireless/core.h                                |    5 +-
 net/wireless/nl80211.c                             |   12 +-
 net/wireless/sme.c                                 |   19 +-
 net/wireless/util.c                                |    2 +-
 187 files changed, 2050 insertions(+), 1204 deletions(-)
 create mode 100644 include/linux/rfkill-gpio.h
 create mode 100644 net/rfkill/rfkill-gpio.c

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-25 22:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <20110519234154.GA13784@redhat.com>

On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote:
> So the requirements are
> - data must be released in a timely fashion (e.g. unlike virtio-net
>   tun or bridge)
The current patch doesn't enable tun zero-copy. tun will copy data It's
not an issue now. We can disallow macvtap attach to bridge when
zero-copy is enabled.

> - SG support
> - HIGHDMA support (on arches where this makes sense)

This can be checked by device flags.

> - no filtering based on data (data is mapped in guest)

> - on fast path no calls to skb_copy, skb_clone, pskb_copy,
>   pskb_expand_head as these are slow

Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will do a
copy. The performance should be the same as none zero-copy case before.
I have done/tested the patch V6, will send it out for review tomorrow.

I am looking at where there are some cases, skb remains the same for
filtering.

> First 2 requirements are a must, all other requirements
> are just dependencies to make sure zero copy will be faster
> than non zero copy.
> Using a new feature bit is probably the simplest approach to
> this. macvtap on top of most physical NICs most likely works
> correctly so it seems a bit more work than it needs to be,
> but it's also the safest one I think ... 

For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, it
looks safe to me once patching skb_copy, skb_clone, pskb_copy,
pskb_expand_head.

To extend zero-copy in other usages, we can have a new feature bit
later.

Is that reasonable?

Thanks
Shirley

^ permalink raw reply

* [PATCHv3] net: Abstract features usage.
From: Mahesh Bandewar @ 2011-05-25 22:43 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Mahesh Bandewar, Tom Herbert, Michał Mirosław,
	Stephen Hemminger
In-Reply-To: <1306288544-1700-1-git-send-email-maheshb@google.com>

Define macros to set/clear/test bits for feature set usage. This will eliminate
the direct use of these fields and enable future ease in managing these fields.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
Changes since v2:
 Added the include which accidently went into the other patch.

Changes since v1:
 Split the patch into two pieces.

 include/linux/netdev_features.h |   64 +++++++++++++++++++++++++++++++++++++++
 include/linux/netdevice.h       |    9 +++++
 2 files changed, 73 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netdev_features.h

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
new file mode 100644
index 0000000..3043c4d
--- /dev/null
+++ b/include/linux/netdev_features.h
@@ -0,0 +1,64 @@
+#ifndef _NETDEV_FEATURES_H
+#define _NETDEV_FEATURES_H
+
+/* Forward declarations */
+struct net_device;
+
+typedef unsigned long *nd_feature_t;
+
+static inline void _nd_set_feature(u32 *old_field,
+		unsigned long *new_field, int bit)
+{
+	if (bit < 32)
+		*old_field |= (1 << bit);
+	set_bit(bit, new_field);
+}
+
+static inline void _nd_clear_feature(u32 *old_field,
+		unsigned long *new_field, int bit)
+{
+	if (bit < 32)
+		*old_field &= ~(1 << bit);
+
+	clear_bit(bit, new_field);
+}
+
+static inline bool _nd_test_feature(u32 old_field,
+		unsigned long *new_field, int bit)
+{
+	if (bit < 32)
+		return (old_field & (1 << bit)) == 1;	
+
+	return test_bit(bit, new_field) == 1;
+}
+
+#define netdev_set_active_feature(dev, bit)	\
+	_nd_set_feature(&(dev)->features, (dev)->active_feature, (bit))
+#define netdev_clear_active_feature(dev, bit)	\
+	_nd_clear_feature(&(dev)->features, (dev)->active_feature, (bit))
+#define netdev_test_active_feature(dev, bit)	\
+	_nd_test_feature((dev)->features, (dev)->active_feature, (bit))
+
+#define netdev_set_offered_feature(dev, bit)	\
+	_nd_set_feature(&(dev)->hw_features, (dev)->offered_feature, (bit))
+#define netdev_clear_offered_feature(dev, bit)	\
+	_nd_clear_feature(&(dev)->hw_features, (dev)->offered_feature, (bit))
+#define netdev_test_offered_feature(dev, bit)	\
+	_nd_test_feature((dev)->hw_features, (dev)->offered_feature, (bit))
+
+#define netdev_set_vlan_feature(dev, bit)	\
+	_nd_set_feature(&(dev)->vlan_features, (dev)->vlan_feature, (bit))
+#define netdev_clear_vlan_feature(dev, bit)	\
+	_nd_clear_feature(&(dev)->vlan_features, (dev)->vlan_feature, (bit))
+#define netdev_test_vlan_feature(dev, bit)	\
+	_nd_test_feature((dev)->vlan_features, (dev)->vlan_feature, (bit))
+
+#define netdev_set_wanted_feature(dev, bit)	\
+	_nd_set_feature(&(dev)->wanted_features, (dev)->wanted_feature, (bit))
+#define netdev_clear_wanted_feature(dev, bit)	\
+	_nd_clear_feature(&(dev)->wanted_features, (dev)->wanted_feature, (bit))
+#define netdev_test_wanted_feature(dev, bit)	\
+	_nd_test_feature((dev)->wanted_features, (dev)->wanted_feature, (bit))
+
+
+#endif	/* __NETDEV_FEATURES_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9bb5872..ca31706 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,7 @@
 #ifdef CONFIG_DCB
 #include <net/dcbnl.h>
 #endif
+#include <linux/netdev_features.h>
 
 struct vlan_group;
 struct netpoll_info;
@@ -1078,6 +1079,14 @@ struct net_device {
 	/* mask of features inheritable by VLAN devices */
 	u32			vlan_features;
 
+#define DEV_FEATURE_WORDS	BITS_TO_LONGS(ND_FEATURE_NUM_BITS)
+#define DEV_FEATURE_BITS	(DEV_FEATURE_WORDS * BITS_PER_LONG)
+
+	DECLARE_BITMAP(active_feature, DEV_FEATURE_BITS);
+	DECLARE_BITMAP(offered_feature, DEV_FEATURE_BITS);
+	DECLARE_BITMAP(wanted_feature, DEV_FEATURE_BITS);
+	DECLARE_BITMAP(vlan_feature, DEV_FEATURE_BITS);
+
 #define BIT2FLAG(bit)		(1 << (bit))
 
 #define NETIF_F_SG		BIT2FLAG(NETIF_F_SG_BIT)
-- 
1.7.3.1


^ permalink raw reply related

* Re: [RFC] af-packet: Save reference to bound network device.
From: David Miller @ 2011-05-25 22:42 UTC (permalink / raw)
  To: greearb; +Cc: netdev
In-Reply-To: <4DDD8487.6070000@candelatech.com>

From: Ben Greear <greearb@candelatech.com>
Date: Wed, 25 May 2011 15:36:55 -0700

> I can't see where the code holds any reference to prot_hook.dev.
> (It just assigns the pointer and then does a dev_put()).
> 
> Maybe it gets away with it because a NETDEV_UNREGISTER event
> is always sent?

I think that is precisely the property it is depending upon.

It may seem sketchy, but as far as I can tell it's completely
legal.

^ permalink raw reply

* [PATCHv3] net: Define enum for the bits used in features.
From: Mahesh Bandewar @ 2011-05-25 22:42 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Mahesh Bandewar, Tom Herbert, Michał Mirosław,
	Stephen Hemminger
In-Reply-To: <1306288567-1773-1-git-send-email-maheshb@google.com>

Little bit cleanup by defining enum for all bits used. Also use those enum
values to redefine flags.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
Changes since v2:
 (1) Removed the include which was part of the other patch (split mishap).
 (2) Changed the enums to add NETIF_F_ prefix.

Changes since v1:
 Split the patch into two pieces.

 include/linux/netdevice.h |   99 +++++++++++++++++++++++++++++++--------------
 1 files changed, 69 insertions(+), 30 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca333e7..9bb5872 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -981,6 +981,49 @@ struct net_device_ops {
 };
 
 /*
+ * Net device feature bits; if you change something,
+ * also update netdev_features_strings[] in ethtool.c
+ */
+enum netdev_features {
+	NETIF_F_SG_BIT,			/* Scatter/gather IO. */
+	NETIF_F_IP_CSUM_BIT,		/* Can checksum TCP/UDP over IPv4. */
+	NETIF_F_NO_CSUM_BIT,		/* Does not require checksum. F.e. loopack. */
+	NETIF_F_HW_CSUM_BIT,		/* Can checksum all the packets. */
+	NETIF_F_IPV6_CSUM_BIT,		/* Can checksum TCP/UDP over IPV6 */
+	NETIF_F_HIGHDMA_BIT,		/* Can DMA to high memory. */
+	NETIF_F_FRAGLIST_BIT,		/* Scatter/gather IO. */
+	NETIF_F_HW_VLAN_TX_BIT,		/* Transmit VLAN hw acceleration */
+	NETIF_F_HW_VLAN_RX_BIT,		/* Receive VLAN hw acceleration */
+	NETIF_F_HW_VLAN_FILTER_BIT,	/* Receive filtering on VLAN */
+	NETIF_F_VLAN_CHALLENGED_BIT,	/* Device cannot handle VLAN packets */
+	NETIF_F_GSO_BIT,		/* Enable software GSO. */
+	NETIF_F_LLTX_BIT,		/* LockLess TX - deprecated. Please */
+					/* do not use LLTX in new drivers */
+	NETIF_F_NETNS_LOCAL_BIT,	/* Does not change network namespaces */
+	NETIF_F_GRO_BIT,		/* Generic receive offload */
+	NETIF_F_LRO_BIT,		/* large receive offload */
+	RESERVED16_BIT,			/* the GSO_MASK reserved bit 16 */
+	RESERVED17_BIT,			/* the GSO_MASK reserved bit 17 */
+	RESERVED18_BIT,			/* the GSO_MASK reserved bit 18 */
+	RESERVED19_BIT,			/* the GSO_MASK reserved bit 19 */
+	RESERVED20_BIT,			/* the GSO_MASK reserved bit 20 */
+	RESERVED21_BIT,			/* the GSO_MASK reserved bit 21 */
+	RESERVED22_BIT,			/* the GSO_MASK reserved bit 22 */
+	RESERVED23_BIT,			/* the GSO_MASK reserved bit 23 */
+	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
+	NETIF_F_SCTP_CSUM_BIT,		/* SCTP checksum offload */
+	NETIF_F_FCOE_MTU_BIT,		/* Supports max FCoE MTU, 2158 bytes*/
+	NETIF_F_NTUPLE_BIT,		/* N-tuple filters supported */
+	NETIF_F_RXHASH_BIT,		/* Receive hashing offload */
+	NETIF_F_RXCSUM_BIT,		/* Receive checksumming offload */
+	NETIF_F_NOCACHE_COPY_BIT,	/* Use no-cache copyfromuser */
+	NETIF_F_LOOPBACK_BIT,		/* Enable loopback */
+
+	/* Add you bit above this */
+	ND_FEATURE_NUM_BITS		/* (LAST VALUE) Total bits in use */
+};
+
+/*
  *	The DEVICE structure.
  *	Actually, this whole structure is a big mistake.  It mixes I/O
  *	data with strictly "high-level" data, and it has to know about
@@ -1035,36 +1078,32 @@ struct net_device {
 	/* mask of features inheritable by VLAN devices */
 	u32			vlan_features;
 
-	/* Net device feature bits; if you change something,
-	 * also update netdev_features_strings[] in ethtool.c */
-
-#define NETIF_F_SG		1	/* Scatter/gather IO. */
-#define NETIF_F_IP_CSUM		2	/* Can checksum TCP/UDP over IPv4. */
-#define NETIF_F_NO_CSUM		4	/* Does not require checksum. F.e. loopack. */
-#define NETIF_F_HW_CSUM		8	/* Can checksum all the packets. */
-#define NETIF_F_IPV6_CSUM	16	/* Can checksum TCP/UDP over IPV6 */
-#define NETIF_F_HIGHDMA		32	/* Can DMA to high memory. */
-#define NETIF_F_FRAGLIST	64	/* Scatter/gather IO. */
-#define NETIF_F_HW_VLAN_TX	128	/* Transmit VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_RX	256	/* Receive VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_FILTER	512	/* Receive filtering on VLAN */
-#define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
-#define NETIF_F_GSO		2048	/* Enable software GSO. */
-#define NETIF_F_LLTX		4096	/* LockLess TX - deprecated. Please */
-					/* do not use LLTX in new drivers */
-#define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
-#define NETIF_F_GRO		16384	/* Generic receive offload */
-#define NETIF_F_LRO		32768	/* large receive offload */
-
-/* the GSO_MASK reserves bits 16 through 23 */
-#define NETIF_F_FCOE_CRC	(1 << 24) /* FCoE CRC32 */
-#define NETIF_F_SCTP_CSUM	(1 << 25) /* SCTP checksum offload */
-#define NETIF_F_FCOE_MTU	(1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
-#define NETIF_F_NTUPLE		(1 << 27) /* N-tuple filters supported */
-#define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
-#define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
-#define NETIF_F_NOCACHE_COPY	(1 << 30) /* Use no-cache copyfromuser */
-#define NETIF_F_LOOPBACK	(1 << 31) /* Enable loopback */
+#define BIT2FLAG(bit)		(1 << (bit))
+
+#define NETIF_F_SG		BIT2FLAG(NETIF_F_SG_BIT)
+#define NETIF_F_IP_CSUM		BIT2FLAG(NETIF_F_IP_CSUM_BIT)
+#define NETIF_F_NO_CSUM		BIT2FLAG(NETIF_F_NO_CSUM_BIT)
+#define NETIF_F_HW_CSUM		BIT2FLAG(NETIF_F_HW_CSUM_BIT)
+#define NETIF_F_IPV6_CSUM	BIT2FLAG(NETIF_F_IPV6_CSUM_BIT)
+#define NETIF_F_HIGHDMA		BIT2FLAG(NETIF_F_HIGHDMA_BIT)
+#define NETIF_F_FRAGLIST	BIT2FLAG(NETIF_F_FRAGLIST_BIT)
+#define NETIF_F_HW_VLAN_TX	BIT2FLAG(NETIF_F_HW_VLAN_TX_BIT)
+#define NETIF_F_HW_VLAN_RX	BIT2FLAG(NETIF_F_HW_VLAN_RX_BIT)
+#define NETIF_F_HW_VLAN_FILTER	BIT2FLAG(NETIF_F_HW_VLAN_FILTER_BIT)
+#define NETIF_F_VLAN_CHALLENGED	BIT2FLAG(NETIF_F_VLAN_CHALLENGED_BIT)
+#define NETIF_F_GSO		BIT2FLAG(NETIF_F_GSO_BIT)
+#define NETIF_F_LLTX		BIT2FLAG(NETIF_F_LLTX_BIT)
+#define NETIF_F_NETNS_LOCAL	BIT2FLAG(NETIF_F_NETNS_LOCAL_BIT)
+#define NETIF_F_GRO		BIT2FLAG(NETIF_F_GRO_BIT)
+#define NETIF_F_LRO		BIT2FLAG(NETIF_F_LRO_BIT)
+#define NETIF_F_FCOE_CRC	BIT2FLAG(NETIF_F_FCOE_CRC_BIT)
+#define NETIF_F_SCTP_CSUM	BIT2FLAG(NETIF_F_SCTP_CSUM_BIT)
+#define NETIF_F_FCOE_MTU	BIT2FLAG(NETIF_F_FCOE_MTU_BIT)
+#define NETIF_F_NTUPLE		BIT2FLAG(NETIF_F_NTUPLE_BIT)
+#define NETIF_F_RXHASH		BIT2FLAG(NETIF_F_RXHASH_BIT)
+#define NETIF_F_RXCSUM		BIT2FLAG(NETIF_F_RXCSUM_BIT)
+#define NETIF_F_NOCACHE_COPY	BIT2FLAG(NETIF_F_NOCACHE_COPY_BIT)
+#define NETIF_F_LOOPBACK	BIT2FLAG(NETIF_F_LOOPBACK_BIT)
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
-- 
1.7.3.1


^ permalink raw reply related

* Re: [RFC] af-packet: Save reference to bound network device.
From: Ben Greear @ 2011-05-25 22:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110525.181418.1100603684033986711.davem@davemloft.net>

On 05/25/2011 03:14 PM, David Miller wrote:
> From: Ben Greear<greearb@candelatech.com>
> Date: Wed, 25 May 2011 15:05:10 -0700
>
>> Doesn't this piece of code take care of that?
>> I tested with rmmod..but of course I could have missed something.
>>
>> @@ -2266,6 +2284,10 @@ static int packet_notifier(struct
>> notifier_block *this, unsigned long msg, void
>>   				}
>>   				if (msg == NETDEV_UNREGISTER) {
>>   					po->ifindex = -1;
>> +					if (po->bound_dev) {
>> + dev_put(po->bound_dev);
>> +						po->bound_dev = NULL;
>> +					}
>>   					po->prot_hook.dev = NULL;
>>   				}
>>   				spin_unlock(&po->bind_lock);
>>
>
> Indeed, it should, thanks for pointing that out.
>
> Wait a second, why do you need to store the device a second
> time, can't you get at po->prot_hook.dev in all the necessary
> spots?

I can't see where the code holds any reference to prot_hook.dev.
(It just assigns the pointer and then does a dev_put()).

Maybe it gets away with it because a NETDEV_UNREGISTER event
is always sent?

Or, maybe we should hold a ref to it?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* (unknown), 
From: Western Union Money Transfer. @ 2011-05-25 22:44 UTC (permalink / raw)


Good day,

My working partner has helped me to send your
first payment of US$7,500 to you as
instructed by Mr. David Cameron and will
keep sending you US$7,500 twice a week until
the payment of (US$360,000) is completed
within six months and here is the information
below:

MONEY TRANSFER CONTROL NUMBER (MTCN):
522-905-9427

SENDER'S NAME: Mr. Mark Daniel
AMOUNT: US$7,500

To track your funds forward Western Union
Money Transfer agent your Full Names and
Mobile Number via Email to: sirteddy_westernumtrs@hotmail.com

Mr.Teddy brown
E-mail: sirteddy_westernumtrs@hotmail.com
D/L :+44 7045714366


Please direct all enquiring to:
sirteddy_westernumtrs@hotmail.com

Best Regards,
Mrs. Larisa Alexander.





^ permalink raw reply

* Re: [RFC] af-packet: Save reference to bound network device.
From: Ben Greear @ 2011-05-25 22:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110525.181418.1100603684033986711.davem@davemloft.net>

On 05/25/2011 03:14 PM, David Miller wrote:
> From: Ben Greear<greearb@candelatech.com>
> Date: Wed, 25 May 2011 15:05:10 -0700
>
>> Doesn't this piece of code take care of that?
>> I tested with rmmod..but of course I could have missed something.
>>
>> @@ -2266,6 +2284,10 @@ static int packet_notifier(struct
>> notifier_block *this, unsigned long msg, void
>>   				}
>>   				if (msg == NETDEV_UNREGISTER) {
>>   					po->ifindex = -1;
>> +					if (po->bound_dev) {
>> + dev_put(po->bound_dev);
>> +						po->bound_dev = NULL;
>> +					}
>>   					po->prot_hook.dev = NULL;
>>   				}
>>   				spin_unlock(&po->bind_lock);
>>
>
> Indeed, it should, thanks for pointing that out.
>
> Wait a second, why do you need to store the device a second
> time, can't you get at po->prot_hook.dev in all the necessary
> spots?

I think so...I'll poke at the code a bit and run some more
tests using that instead...

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* IFB and iptables
From: Jérôme Poulin @ 2011-05-25 22:21 UTC (permalink / raw)
  To: netdev

Hi,

I'm trying to convert my IMQ based script to use the IFB device instead.
Things appear to work quite right however the u32 classifier isn't
aware of any connection tracking and I was wondering if it is at all
possible to use match from iptables like layer7 when you use the IFB
device?

And my need for the IFB device / IMQ is because I want to classify my
IPv6 traffic which is in an IPv4 SIT tunnel and mix the content of the
SIT tunnel to eth0 minus protocol 41.

Thanks.

^ permalink raw reply

* Re: [RFC] af-packet: Save reference to bound network device.
From: David Miller @ 2011-05-25 22:14 UTC (permalink / raw)
  To: greearb; +Cc: netdev
In-Reply-To: <4DDD7D16.6030907@candelatech.com>

From: Ben Greear <greearb@candelatech.com>
Date: Wed, 25 May 2011 15:05:10 -0700

> Doesn't this piece of code take care of that?
> I tested with rmmod..but of course I could have missed something.
> 
> @@ -2266,6 +2284,10 @@ static int packet_notifier(struct
> notifier_block *this, unsigned long msg, void
>  				}
>  				if (msg == NETDEV_UNREGISTER) {
>  					po->ifindex = -1;
> +					if (po->bound_dev) {
> + dev_put(po->bound_dev);
> +						po->bound_dev = NULL;
> +					}
>  					po->prot_hook.dev = NULL;
>  				}
>  				spin_unlock(&po->bind_lock);
> 

Indeed, it should, thanks for pointing that out.

Wait a second, why do you need to store the device a second
time, can't you get at po->prot_hook.dev in all the necessary
spots?

^ permalink raw reply

* Re: [GIT PULL] Namespace file descriptors for 2.6.40
From: Michał Mirosław @ 2011-05-25 22:11 UTC (permalink / raw)
  To: C Anthony Risinger
  Cc: Serge E. Hallyn, Eric W. Biederman, Linux Containers, netdev,
	linux-kernel
In-Reply-To: <BANLkTinbw6pZjhMscfXFMArd=XU=VC=+eQ@mail.gmail.com>

2011/5/25 C Anthony Risinger <anthony@xtfx.me>:
> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Quoting C Anthony Risinger (anthony@xtfx.me):
[...]
>>> if i understand correctly, mount namespaces (for example), allow one
>>> to build such constructs as "private /tmp" and similar that even
>>> `root` cannot access ... and there are many reasons `root` does not
>>> deserve to completely know/interact with user processes (FUSE makes a
>>> good example ... just because i [user] have SSH access to a machine,
>>> why should `root`?)
>> If for instance you have a file open in your private /tmp, then root
>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>> If it's a directory, he can now traverse the whole fs.
> aaah right :-( ... there's always another way isn't there ... curse
> you Linux for being so flexible! (just kidding baby i love you)
>
> this seems like a more fundamental issue then?  or should i not expect
> to be able to achieve separation like this?  i ask in the context of
> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
> i've used/followed these technologies for couple years now ... and
> it's starting to feel like "the right time".

You either trust the admin or don't use the machine. There is no third way.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [RFC] af-packet: Save reference to bound network device.
From: Ben Greear @ 2011-05-25 22:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110525.180113.1194226831134092545.davem@davemloft.net>

On 05/25/2011 03:01 PM, David Miller wrote:
> From: greearb@candelatech.com
> Date: Wed, 25 May 2011 14:56:42 -0700
>
>> From: Ben Greear<greearb@candelatech.com>
>>
>> This saves a network device lookup on each packet transmitted,
>> for sockets that are bound to a network device.
>>
>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>
> You can't hold onto devices like this unless you also add a netdev
> event notifier that will release it.  Otherwise we'll hang on net
> driver module unload until the packet socket is closed.
>
> I don't think you really want to walk all pf-packet sockets on netdev
> events just to do this.

Doesn't this piece of code take care of that?
I tested with rmmod..but of course I could have missed something.

@@ -2266,6 +2284,10 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
  				}
  				if (msg == NETDEV_UNREGISTER) {
  					po->ifindex = -1;
+					if (po->bound_dev) {
+						dev_put(po->bound_dev);
+						po->bound_dev = NULL;
+					}
  					po->prot_hook.dev = NULL;
  				}
  				spin_unlock(&po->bind_lock);


>
> dev_get_by_index(,_rcu}() is insanely cheap, I doubt it's showing up
> on your profiles at all.

I admit it was a small change...maybe 5Mbps (from 165 to 170Mbps in
this particular test), but it did seem to improve things a bit.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox