* Re: [Bugme-new] [Bug 35862] New: arp requests from wrong src IP
From: David Miller @ 2011-05-26 1:52 UTC (permalink / raw)
To: akpm; +Cc: netdev, bugzilla-daemon, bugme-daemon, matare
In-Reply-To: <20110525163137.6f04f26e.akpm@linux-foundation.org>
From: Andrew Morton <akpm@linux-foundation.org>
Date: Wed, 25 May 2011 16:31:37 -0700
>> I switched a host's ip address from 137.226.164.13 to 137.226.164.2. The .13 IP
>> now belongs to the host that had .2 before (I swapped them). Now both hosts
>> still arp from their old IPs although ifconfig as well as ip clearly tell
>> otherwise. Examining the host which now has 137.226.164.13:
>>
>> # ip addr show dev eth0
>> 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
>> link/ether 00:e0:81:41:1f:e4 brd ff:ff:ff:ff:ff:ff
>> inet 137.226.164.2/24 brd 137.226.164.255 scope global eth0
>> inet 192.168.23.2/24 brd 137.226.164.255 scope global eth0:0
If you keep the old IP address around it remains as the "primary"
IP address.
You have to explicitly remove the original IP address from the
interface first, then add the new one, in order for the new
one to become the "primary"
Not a bug, please close this.
^ permalink raw reply
* Re: [patch 1/1] net: convert %p usage to %pK
From: David Miller @ 2011-05-26 1:50 UTC (permalink / raw)
To: kees.cook
Cc: eric.dumazet, joe, mingo, akpm, netdev, drosenberg, a.p.zijlstra,
eparis, eugeneteo, jmorris, tgraf
In-Reply-To: <20110525232921.GD19633@outflux.net>
From: Kees Cook <kees.cook@canonical.com>
Date: Wed, 25 May 2011 16:29:21 -0700
> Hi David,
>
> On Tue, May 24, 2011 at 03:58:01AM -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 24 May 2011 09:45:01 +0200
>>
>> > Le mardi 24 mai 2011 à 00:35 -0700, Joe Perches a écrit :
>> >
>> >> I think it's be better without the casts
>> >> using the standard kernel.h macros.
>> >>
>> >> void *ptr;
>> >>
>> >> ptr = maybe_hide_ptr(sk);
>> >> r->id.idiag_cookie[0] = lower_32_bits(ptr);
>> >> r->id.idiag_cookie[1] = upper_32_bits(ptr);
>> >>
>> >
>> > I am not sure I want to patch lower_32_bits() and upper_32_bits() for
>> > this.
>> >
>> > They dont work on pointers, but on "numbers", according to kerneldoc
>> > Andrew wrote years ago. gcc agrees :
>> >
>> > net/ipv4/inet_diag.c: In function ‘inet_csk_diag_fill’:
>> > net/ipv4/inet_diag.c:119: warning: cast from pointer to integer of different size
>> > net/ipv4/inet_diag.c:120: error: invalid operands to binary >>
>> > make[1]: *** [net/ipv4/inet_diag.o] Error 1
>>
>> Also you can't do this, the "cookie" is used by the kernel future
>> lookups to find sockets.
>>
>> The kernel pointer is part of the API, so sorry you can't "hide"
>> kernel pointers in this case without really breaking user visible
>> things.
>
> But this is precisely what we're trying to control with kptr_restrict.
> Setting kptr_restrict will make inet_diag (and some details of similar
> things in /proc) meaningless. Based on the name, "diag" isn't going to be
> used in normal operation, and kptr_restrict is 0 by default, so only system
> owners interested in this will enable it and effectively disable inet_diag.
Are you kidding me?
inet_diag is the standard way to dump sockets using netlink.
It's not a special obscure debugging facility, it's for real
users.
And the encoded kernel pointer here is used as a shortcut to looking
up precise sockets.
^ permalink raw reply
* atl1c suspend issue - remove_proc_entry: removing non-empty directory 'irq/44', leaking at least 'smp_affinity_list'
From: Parag Warudkar @ 2011-05-26 1:50 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel
Got this on suspend :
[ 115.182723] cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 115.182732] cfg80211: (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[ 115.182740] cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[ 115.182747] cfg80211: (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 115.182755] cfg80211: (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 115.292922] ------------[ cut here ]------------
[ 115.292949] WARNING: at fs/proc/generic.c:849 remove_proc_entry+0x26e/0x280()
[ 115.292959] Hardware name: 0876
[ 115.292969] remove_proc_entry: removing non-empty directory 'irq/44', leaking at least 'smp_affinity_list'
[ 115.292979] Modules linked in: cryptd aes_x86_64 aes_generic parport_pc ppdev nls_utf8 udf crc_itu_t fuse binfmt_misc joydev snd_hda_codec_hdmi snd_hda_codec_realtek i915 snd_hda_intel snd_hda_codec arc4 drm_kms_helper drm snd_hwdep i2c_algo_bit iwlagn snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq cfbcopyarea cfbimgblt snd_timer cfbfillrect mac80211 snd_seq_device snd uvcvideo usb_storage soundcore cfg80211 videodev ideapad_laptop v4l2_compat_ioctl32 psmouse video snd_page_alloc serio_raw intel_ips sparse_keymap btusb mac_hid lp bluetooth parport ext4 mbcache jbd2 ahci libahci libata atl1c
[ 115.293170] Pid: 749, comm: NetworkManager Not tainted 2.6.39+ #11
[ 115.293179] Call Trace:
[ 115.293204] [<ffffffff8105875f>] warn_slowpath_common+0x7f/0xc0
[ 115.293218] [<ffffffff81058856>] warn_slowpath_fmt+0x46/0x50
[ 115.293232] [<ffffffff8119f84e>] remove_proc_entry+0x26e/0x280
[ 115.293251] [<ffffffff811f59e0>] ? sprintf+0x40/0x50
[ 115.293270] [<ffffffff810b7db7>] unregister_irq_proc+0xb7/0xe0
[ 115.293285] [<ffffffff810b3a4c>] free_desc+0x2c/0x70
[ 115.293297] [<ffffffff810b3ada>] irq_free_descs+0x4a/0x90
[ 115.293314] [<ffffffff81029c9b>] free_irq_at+0x3b/0x50
[ 115.293329] [<ffffffff8102bc7b>] destroy_irq+0x7b/0x90
[ 115.293343] [<ffffffff8102bf0e>] native_teardown_msi_irq+0xe/0x10
[ 115.293359] [<ffffffff8122282f>] default_teardown_msi_irqs+0x6f/0x90
[ 115.293374] [<ffffffff81222216>] free_msi_irqs+0x96/0x130
[ 115.293387] [<ffffffff81222ec5>] pci_disable_msi+0x45/0x50
[ 115.293414] [<ffffffffa0002ef7>] atl1c_down+0xc7/0x110 [atl1c]
[ 115.293434] [<ffffffffa00034f8>] atl1c_close+0x28/0x50 [atl1c]
[ 115.293452] [<ffffffff8138f666>] __dev_close_many+0x86/0xd0
[ 115.293467] [<ffffffff8138f6e6>] __dev_close+0x36/0x50
[ 115.293480] [<ffffffff81395681>] __dev_change_flags+0xa1/0x180
[ 115.293492] [<ffffffff81395828>] dev_change_flags+0x28/0x70
[ 115.293508] [<ffffffff813a3220>] do_setlink+0x200/0x9f0
[ 115.293527] [<ffffffff81047ef0>] ? update_curr+0x100/0x1a0
[ 115.293541] [<ffffffff81205390>] ? nla_parse+0x30/0xd0
[ 115.293555] [<ffffffff813a3aff>] rtnl_setlink+0xef/0x130
[ 115.293572] [<ffffffff813a16bf>] rtnetlink_rcv_msg+0x20f/0x240
[ 115.293587] [<ffffffff813a14b0>] ? rtnetlink_net_init+0x50/0x50
[ 115.293604] [<ffffffff813bb8e9>] netlink_rcv_skb+0xa9/0xd0
[ 115.293620] [<ffffffff813a2055>] rtnetlink_rcv+0x25/0x40
[ 115.293635] [<ffffffff813bb223>] netlink_unicast+0x2d3/0x2f0
[ 115.293648] [<ffffffff8138998d>] ? memcpy_fromiovec+0x7d/0xa0
[ 115.293662] [<ffffffff813bb462>] netlink_sendmsg+0x222/0x360
[ 115.293678] [<ffffffff8137d0cf>] sock_sendmsg+0xef/0x120
[ 115.293695] [<ffffffff814201dd>] ? unix_dgram_sendmsg+0x5cd/0x650
[ 115.293711] [<ffffffff8137d0cf>] ? sock_sendmsg+0xef/0x120
[ 115.293725] [<ffffffff8137ecc0>] ? move_addr_to_kernel+0x50/0x60
[ 115.293738] [<ffffffff81389a32>] ? verify_iovec+0x82/0xf0
[ 115.293751] [<ffffffff8137e79d>] __sys_sendmsg+0x1dd/0x340
[ 115.293765] [<ffffffff8137b893>] ? sock_destroy_inode+0x33/0x40
[ 115.293783] [<ffffffff8112dfd0>] ? kmem_cache_free+0x20/0xe0
[ 115.293798] [<ffffffff8137f796>] ? sys_sendto+0x156/0x190
[ 115.293813] [<ffffffff8115d65f>] ? mntput+0x1f/0x30
[ 115.293827] [<ffffffff8137fc19>] sys_sendmsg+0x49/0x90
[ 115.293847] [<ffffffff81487c82>] system_call_fastpath+0x16/0x1b
[ 115.293858] ---[ end trace e0ec9dc53f93f46e ]---
[ 116.649267] EXT4-fs (sda6): re-mounted. Opts: errors=remount-ro,commit=0
[ 116.652785] EXT4-fs (sda1): re-mounted. Opts: commit=0
[ 117.994536] PM: Syncing filesystems ... done.
[ 117.996922] PM: Preparing system for mem sleep
[ 118.449261] Freezing user space processes ... (elapsed 0.01 seconds) done.
[ 118.462504] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[ 118.475800] PM: Entering mem sleep
[ 118.475902] Suspending console(s) (use no_console_suspend to debug)
[ 118.476562] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 118.478045] sd 0:0:0:0: [sda] Stopping disk
[ 118.608869] ehci_hcd 0000:00:1a.0: PCI INT A disabled
[ 118.608903] ehci_hcd 0000:00:1d.0: PCI INT A disabled
^ permalink raw reply
* Re: [PATCH 1/1] IPVS : bug in ip_vs_ftp, same list heaad used in all netns.
From: Simon Horman @ 2011-05-26 1:48 UTC (permalink / raw)
To: Hans Schillstrom; +Cc: ja, wensong, lvs-devel, netdev, netfilter-devel, hans
In-Reply-To: <1306239065-17271-1-git-send-email-hans.schillstrom@ericsson.com>
On Tue, May 24, 2011 at 02:11:05PM +0200, Hans Schillstrom wrote:
> When ip_vs was adapted to netns the ftp application was not adapted
> in a correct way.
> However this is a fix to avoid kernel errors. In the long term another solution
> might be chosen. I.e the ports that the ftp appl, uses should be per netns.
>
> Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Julian, do you have any thoughts on this?
> ---
> include/net/ip_vs.h | 3 ++-
> net/netfilter/ipvs/ip_vs_ftp.c | 27 +++++++++++++++++++--------
> 2 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 4fff432..481f856 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -797,7 +797,8 @@ struct netns_ipvs {
> struct list_head rs_table[IP_VS_RTAB_SIZE];
> /* ip_vs_app */
> struct list_head app_list;
> -
> + /* ip_vs_ftp */
> + struct ip_vs_app *ftp_app;
> /* ip_vs_proto */
> #define IP_VS_PROTO_TAB_SIZE 32 /* must be power of 2 */
> struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index 6b5dd6d..af63553 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -411,25 +411,35 @@ static struct ip_vs_app ip_vs_ftp = {
> static int __net_init __ip_vs_ftp_init(struct net *net)
> {
> int i, ret;
> - struct ip_vs_app *app = &ip_vs_ftp;
> + struct ip_vs_app *app;
> + struct netns_ipvs *ipvs = net_ipvs(net);
> +
> + app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
> + if (!app)
> + return -ENOMEM;
> + INIT_LIST_HEAD(&app->a_list);
> + INIT_LIST_HEAD(&app->incs_list);
> + ipvs->ftp_app = app;
>
> ret = register_ip_vs_app(net, app);
> if (ret)
> - return ret;
> + goto err_exit;
>
> for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
> if (!ports[i])
> continue;
> ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
> if (ret)
> - break;
> + goto err_unreg;
> pr_info("%s: loaded support on port[%d] = %d\n",
> app->name, i, ports[i]);
> }
> + return 0;
>
> - if (ret)
> - unregister_ip_vs_app(net, app);
> -
> +err_unreg:
> + unregister_ip_vs_app(net, app);
> +err_exit:
> + kfree(ipvs->ftp_app);
> return ret;
> }
> /*
> @@ -437,9 +447,10 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
> */
> static void __ip_vs_ftp_exit(struct net *net)
> {
> - struct ip_vs_app *app = &ip_vs_ftp;
> + struct netns_ipvs *ipvs = net_ipvs(net);
>
> - unregister_ip_vs_app(net, app);
> + unregister_ip_vs_app(net, ipvs->ftp_app);
> + kfree(ipvs->ftp_app);
> }
>
> static struct pernet_operations ip_vs_ftp_ops = {
> --
> 1.7.2.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [GIT PULL] Namespace file descriptors for 2.6.40
From: Eric W. Biederman @ 2011-05-25 23:40 UTC (permalink / raw)
To: C Anthony Risinger
Cc: Serge E. Hallyn, Linux Containers, netdev, linux-kernel
In-Reply-To: <BANLkTinbw6pZjhMscfXFMArd=XU=VC=+eQ@mail.gmail.com>
C Anthony Risinger <anthony@xtfx.me> writes:
> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Quoting C Anthony Risinger (anthony@xtfx.me):
>>> On Mon, May 23, 2011 at 4:05 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>> >
>>> > This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
>>> > /proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
>>> > process at the time those files are opened, and can be bind mounted to
>>> > keep the specified namespace alive without a process.
>>> >
>>> > This tree adds the setns system call that can be used to change the
>>> > specified namespace of a process to the namespace specified by a system
>>> > call.
>>>
>>> i just have a quick question regarding these, apologies if wrong place
>>> to respond -- i trimmed to lists only.
>>>
>>> if i understand correctly, mount namespaces (for example), allow one
>>> to build such constructs as "private /tmp" and similar that even
>>> `root` cannot access ... and there are many reasons `root` does not
>>> deserve to completely know/interact with user processes (FUSE makes a
>>> good example ... just because i [user] have SSH access to a machine,
>>> why should `root`?)
>>>
>>> would these /proc additions break such guarantees? IOW, would it now
>>> become possible for `root` to inject stuff into my private namespaces,
>>> and/or has these guarantees never existed and i am mistaken? is there
>>> any kind of ACL mechanism that endows the origin process (or similar)
>>> with the ability to dictate who can hold and/or interact with these
>>> references?
>>
>> If for instance you have a file open in your private /tmp, then root
>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>> If it's a directory, he can now traverse the whole fs.
>
> aaah right :-( ... there's always another way isn't there ... curse
> you Linux for being so flexible! (just kidding baby i love you)
Even more significant the access to the new files is guarded by the
ptrace access checks. And if root can ptrace your process root
can remote control your process.
> this seems like a more fundamental issue then? or should i not expect
> to be able to achieve separation like this? i ask in the context of
> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
> i've used/followed these technologies for couple years now ... and
> it's starting to feel like "the right time".
I don't think anything really new is allowed, but we haven't designed
anything that radically reduces the power of root either.
At some point we may have the user namespace done and that should
give you a root like user with vastly reduced powers, but we aren't
there yet.
Eric
^ permalink raw reply
* [PATCH] af-packet: Add flag to distinguish VID 0 from no-vlan.
From: greearb @ 2011-05-25 23:36 UTC (permalink / raw)
To: netdev; +Cc: Ben Greear
From: Ben Greear <greearb@candelatech.com>
Currently, user-space cannot determine if a 0 tcp_vlan_tci
means there is no VLAN tag or the VLAN ID was zero.
Add flag to make this explicit. User-space can check for
TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
compatible. Older could would have just checked for tp_vlan_tci,
so it will work no worse than before.
Signed-off-by: Ben Greear <greearb@candelatech.com>
---
:100644 100644 72bfa5a... 6d66ce1... M include/linux/if_packet.h
:100644 100644 658edd1... 885d76d... M net/packet/af_packet.c
include/linux/if_packet.h | 1 +
net/packet/af_packet.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 72bfa5a..6d66ce1 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -70,6 +70,7 @@ struct tpacket_auxdata {
#define TP_STATUS_COPY 0x2
#define TP_STATUS_LOSING 0x4
#define TP_STATUS_CSUMNOTREADY 0x8
+#define TP_STATUS_VLAN_VALID 0x10 /* auxdata has valid tp_vlan_tci */
/* Tx ring - header status */
#define TP_STATUS_AVAILABLE 0x0
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 658edd1..885d76d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1772,7 +1772,12 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
aux.tp_snaplen = skb->len;
aux.tp_mac = 0;
aux.tp_net = skb_network_offset(skb);
- aux.tp_vlan_tci = vlan_tx_tag_get(skb);
+ if (vlan_tx_tag_present(skb)) {
+ aux.tp_vlan_tci = vlan_tx_tag_get(skb);
+ aux.tp_status |= TP_STATUS_VLAN_VALID;
+ }
+ else
+ aux.tp_vlan_tci = 0;
put_cmsg(msg, SOL_PACKET, PACKET_AUXDATA, sizeof(aux), &aux);
}
--
1.7.3.4
^ permalink raw reply related
* Re: [patch 1/1] net: convert %p usage to %pK
From: Kees Cook @ 2011-05-25 23:29 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, joe, mingo, akpm, netdev, drosenberg, a.p.zijlstra,
eparis, eugeneteo, jmorris, tgraf
In-Reply-To: <20110524.035801.1555795213632087107.davem@davemloft.net>
Hi David,
On Tue, May 24, 2011 at 03:58:01AM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 24 May 2011 09:45:01 +0200
>
> > Le mardi 24 mai 2011 à 00:35 -0700, Joe Perches a écrit :
> >
> >> I think it's be better without the casts
> >> using the standard kernel.h macros.
> >>
> >> void *ptr;
> >>
> >> ptr = maybe_hide_ptr(sk);
> >> r->id.idiag_cookie[0] = lower_32_bits(ptr);
> >> r->id.idiag_cookie[1] = upper_32_bits(ptr);
> >>
> >
> > I am not sure I want to patch lower_32_bits() and upper_32_bits() for
> > this.
> >
> > They dont work on pointers, but on "numbers", according to kerneldoc
> > Andrew wrote years ago. gcc agrees :
> >
> > net/ipv4/inet_diag.c: In function ‘inet_csk_diag_fill’:
> > net/ipv4/inet_diag.c:119: warning: cast from pointer to integer of different size
> > net/ipv4/inet_diag.c:120: error: invalid operands to binary >>
> > make[1]: *** [net/ipv4/inet_diag.o] Error 1
>
> Also you can't do this, the "cookie" is used by the kernel future
> lookups to find sockets.
>
> The kernel pointer is part of the API, so sorry you can't "hide"
> kernel pointers in this case without really breaking user visible
> things.
But this is precisely what we're trying to control with kptr_restrict.
Setting kptr_restrict will make inet_diag (and some details of similar
things in /proc) meaningless. Based on the name, "diag" isn't going to be
used in normal operation, and kptr_restrict is 0 by default, so only system
owners interested in this will enable it and effectively disable inet_diag.
It seems like everything that fills idiag_cookie needs to be adjusted, not
just the one instance, too:
$ fgrep 'idiag_cookie[0] = ' net/ipv4/inet_diag.c
r->id.idiag_cookie[0] = (u32)(unsigned long)sk;
r->id.idiag_cookie[0] = (u32)(unsigned long)tw;
r->id.idiag_cookie[0] = (u32)(unsigned long)req;
-Kees
--
Kees Cook
Ubuntu Security Team
^ permalink raw reply
* Re: [Bugme-new] [Bug 35862] New: arp requests from wrong src IP
From: Andrew Morton @ 2011-05-25 23:31 UTC (permalink / raw)
To: netdev; +Cc: bugzilla-daemon, bugme-daemon, matare
In-Reply-To: <bug-35862-10286@https.bugzilla.kernel.org/>
(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Wed, 25 May 2011 23:27:48 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=35862
>
> Summary: arp requests from wrong src IP
> Product: Networking
> Version: 2.5
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: IPV4
> AssignedTo: shemminger@linux-foundation.org
> ReportedBy: matare@lih.rwth-aachen.de
> Regression: No
>
>
> I switched a host's ip address from 137.226.164.13 to 137.226.164.2. The .13 IP
> now belongs to the host that had .2 before (I swapped them). Now both hosts
> still arp from their old IPs although ifconfig as well as ip clearly tell
> otherwise. Examining the host which now has 137.226.164.13:
>
> # ip addr show dev eth0
> 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
> link/ether 00:e0:81:41:1f:e4 brd ff:ff:ff:ff:ff:ff
> inet 137.226.164.2/24 brd 137.226.164.255 scope global eth0
> inet 192.168.23.2/24 brd 137.226.164.255 scope global eth0:0
>
> but arping defaults to the old src IP (.13). I can manually correct this with
> the -s parameter, but it looks like linux still believes that 137.226.164.13 is
> this host's ip address. When I try to manually correct the arp table:
> # arp -s 137.226.164.13 00:30:48:70:91:95
> SIOCSARP: Invalid argument
> # arp -n 137.226.164.13
> 137.226.164.13 (137.226.164.13) -- no entry
>
> And this is what arping does:
> # tcpdump -ieth0 -c1 -s0 -vvv -n arp & (sleep 1; arping 137.226.164.13 &>
> /dev/null)
> [1] 2217
> tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535
> bytes
> 01:14:37.785126 arp who-has 137.226.164.13 (ff:ff:ff:ff:ff:ff) tell
> 137.226.164.13
>
> Also, ifconfig doesn't even show the second IP address:
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr 00:e0:81:41:1f:e4
> inet addr:137.226.164.2 Bcast:137.226.164.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:103996345 errors:0 dropped:0 overruns:0 frame:0
> TX packets:122352625 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:52478932087 (48.8 GiB) TX bytes:110248931949 (102.6 GiB)
> Interrupt:24
>
> What's going on here? If this is by design, it's very unintuitive behaviour.
>
^ permalink raw reply
* Re: [RFC 01/01]af_packet: Enhance network capture visibility
From: chetan loke @ 2011-05-25 23:24 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev, loke.chetan
In-Reply-To: <4DDD8C5E.7040207@candelatech.com>
On Wed, May 25, 2011 at 7:10 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 05/25/2011 04:03 PM, chetan loke wrote:
>>
>> This patch is not complete and is intended to:
>> a) demonstrate the improvments
>> b) gather suggestions
>>
>>
>> Signed-off-by: Chetan Loke<lokec@ccs.neu.edu>
>
>> +struct tpacket3_hdr {
>> + __u32 tp_status;
>> + __u32 tp_len;
>> + __u32 tp_snaplen;
>> + __u16 tp_mac;
>> + __u16 tp_net;
>> + __u32 tp_sec;
>> + __u32 tp_nsec;
>> + __u16 tp_vlan_tci;
>> + long tp_next_offset;
>> +};
>
> Use fixed-size variables, like __u64 instead of 'long'. That way,
> you have the same sized msgs on 32 and 64-bit systems.
>
Thanks Ben.
The intent is to also introduce something like
typedef struct {
uint64_t pkt_sliced:1;
uint64_t crc_error:1;
uint64_t code_violation:1; /* if frame had code violation */
uint64_t num_mpls_labels:4;
uint64_t num_vlans:3;
uint64_t l2_type:6;
uint64_t l3_type:4;
uint64_t l4_type:4;
uint64_t l7_type:8;
uint64_t rsvd:32;
}feature_s1;
typedef struct {
union {
feature_s1 f_s1;
/* future feature goes here */
}u1;
}feature_variants;
And then embed feature_variants in the pkt_desc.
Once we have the proposed non-static frame format in place then I am
hoping some vendor can borrow this format, enhance their capture
driver and DMA the data directly in the block. This way we can also
attempt to standardize the block-capture format on linux and make it
easier for smaller FPGA shops.
>
> Thanks,
> Ben
>
Chetan
^ permalink raw reply
* [PATCH] af-packet: Use existing netdev reference for bound sockets.
From: greearb @ 2011-05-25 23:15 UTC (permalink / raw)
To: netdev; +Cc: Ben Greear
From: Ben Greear <greearb@candelatech.com>
This saves a network device lookup on each packet transmitted,
for sockets that are bound to a network device.
Signed-off-by: Ben Greear <greearb@candelatech.com>
---
:100644 100644 4005b24... 658edd1... M net/packet/af_packet.c
net/packet/af_packet.c | 26 +++++++++++++++++++-------
1 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4005b24..658edd1 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -987,8 +987,9 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
{
struct sk_buff *skb;
- struct net_device *dev;
+ struct net_device *dev = NULL;
__be16 proto;
+ bool need_rls_dev = false;
int ifindex, err, reserve = 0;
void *ph;
struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
@@ -1002,6 +1003,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
err = -EBUSY;
if (saddr == NULL) {
ifindex = po->ifindex;
+ dev = po->prot_hook.dev;
proto = po->num;
addr = NULL;
} else {
@@ -1017,7 +1019,10 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
addr = saddr->sll_addr;
}
- dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+ if (!dev) {
+ dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+ need_rls_dev = true;
+ }
err = -ENXIO;
if (unlikely(dev == NULL))
goto out;
@@ -1103,7 +1108,8 @@ out_status:
__packet_set_status(po, ph, status);
kfree_skb(skb);
out_put:
- dev_put(dev);
+ if (need_rls_dev)
+ dev_put(dev);
out:
mutex_unlock(&po->pg_vec_lock);
return err;
@@ -1139,8 +1145,9 @@ static int packet_snd(struct socket *sock,
struct sock *sk = sock->sk;
struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
struct sk_buff *skb;
- struct net_device *dev;
+ struct net_device *dev = NULL;
__be16 proto;
+ bool need_rls_dev = false;
unsigned char *addr;
int ifindex, err, reserve = 0;
struct virtio_net_hdr vnet_hdr = { 0 };
@@ -1161,6 +1168,7 @@ static int packet_snd(struct socket *sock,
if (saddr == NULL) {
ifindex = po->ifindex;
+ dev = po->prot_hook.dev;
proto = po->num;
addr = NULL;
} else {
@@ -1174,8 +1182,11 @@ static int packet_snd(struct socket *sock,
addr = saddr->sll_addr;
}
+ if (!dev) {
+ dev = dev_get_by_index(sock_net(sk), ifindex);
+ need_rls_dev = true;
+ }
- dev = dev_get_by_index(sock_net(sk), ifindex);
err = -ENXIO;
if (dev == NULL)
goto out_unlock;
@@ -1315,14 +1326,15 @@ static int packet_snd(struct socket *sock,
if (err > 0 && (err = net_xmit_errno(err)) != 0)
goto out_unlock;
- dev_put(dev);
+ if (need_rls_dev)
+ dev_put(dev);
return len;
out_free:
kfree_skb(skb);
out_unlock:
- if (dev)
+ if (dev && need_rls_dev)
dev_put(dev);
out:
return err;
--
1.7.3.4
^ permalink raw reply related
* Re: [RFC 01/01]af_packet: Enhance network capture visibility
From: Ben Greear @ 2011-05-25 23:10 UTC (permalink / raw)
To: chetan loke; +Cc: netdev
In-Reply-To: <BANLkTimYVUkUWA2XPix2nUL-=rnQKghZQA@mail.gmail.com>
On 05/25/2011 04:03 PM, chetan loke wrote:
> This patch is not complete and is intended to:
> a) demonstrate the improvments
> b) gather suggestions
>
>
> Signed-off-by: Chetan Loke<lokec@ccs.neu.edu>
> +struct tpacket3_hdr {
> + __u32 tp_status;
> + __u32 tp_len;
> + __u32 tp_snaplen;
> + __u16 tp_mac;
> + __u16 tp_net;
> + __u32 tp_sec;
> + __u32 tp_nsec;
> + __u16 tp_vlan_tci;
> + long tp_next_offset;
> +};
Use fixed-size variables, like __u64 instead of 'long'. That way,
you have the same sized msgs on 32 and 64-bit systems.
I didn't look at the rest of it in any detail, so no comment there.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* [RFC 01/01]af_packet: Enhance network capture visibility
From: chetan loke @ 2011-05-25 23:03 UTC (permalink / raw)
To: netdev, loke.chetan
This patch is not complete and is intended to:
a) demonstrate the improvments
b) gather suggestions
Signed-off-by: Chetan Loke <lokec@ccs.neu.edu>
-----------------------
include/linux/if_packet.h | 27 ++
net/packet/af_packet.c | 637 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 632 insertions(+), 32 deletions(-)
-----------------------
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 72bfa5a..1452f47 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -55,6 +55,17 @@ struct tpacket_stats {
unsigned int tp_drops;
};
+struct tpacket_stats_v3 {
+ unsigned int tp_packets;
+ unsigned int tp_drops;
+ unsigned int tp_plug_q_cnt;
+};
+
+union tpacket_stats_u {
+ struct tpacket_stats stats1;
+ struct tpacket_stats_v3 stats3;
+};
+
struct tpacket_auxdata {
__u32 tp_status;
__u32 tp_len;
@@ -102,11 +113,27 @@ struct tpacket2_hdr {
__u16 tp_vlan_tci;
};
+
+struct tpacket3_hdr {
+ __u32 tp_status;
+ __u32 tp_len;
+ __u32 tp_snaplen;
+ __u16 tp_mac;
+ __u16 tp_net;
+ __u32 tp_sec;
+ __u32 tp_nsec;
+ __u16 tp_vlan_tci;
+ long tp_next_offset;
+};
+
#define TPACKET2_HDRLEN (TPACKET_ALIGN(sizeof(struct tpacket2_hdr))
+ sizeof(struct sockaddr_ll))
+#define TPACKET3_HDRLEN (TPACKET_ALIGN(sizeof(struct tpacket3_hdr))
+ sizeof(struct sockaddr_ll))
+
enum tpacket_versions {
TPACKET_V1,
TPACKET_V2,
+ TPACKET_V3
};
/*
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 91cb1d7..8e0bc51 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -164,6 +164,57 @@ struct packet_mreq_max {
static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing, int tx_ring);
+
+#define V3_ALIGNMENT (4)
+#define ALIGN_4(x) (((x)+V3_ALIGNMENT-1)&~(V3_ALIGNMENT-1))
+
+
+struct bd_ts{
+ unsigned int ts_sec;
+ union {
+ unsigned int u1_i1[1];
+ struct {
+ unsigned int ts_usec;
+ }ts_s1;
+ struct {
+ unsigned int ts_nsec;
+ }ts_s2;
+ } ts_u1;
+}__attribute__ ((__packed__));
+
+struct block_desc{
+ uint32_t block_status;
+ uint32_t num_pkts;
+ struct bd_ts ts_first_pkt;
+ struct bd_ts ts_last_pkt;
+ long offset_to_first_pkt;
+ uint32_t seq_num;
+} __attribute__ ((__packed__));
+
+struct kbdq_core{
+ struct pgv *pkbdq;
+ unsigned int hdrlen;
+ unsigned char reset_pending_on_curr_blk;
+ unsigned char delete_blk_timer;
+ unsigned short kactive_blk_num;
+ unsigned short hole_bytes_size;
+ char *pkblk_start;
+ char *pkblk_end;
+ int kblk_size;
+ unsigned int knum_blocks;
+ unsigned int knxt_seq_num;
+ char *prev;
+ char *nxt_offset;
+ /* last_kactive_blk_num:
+ * trick to see if user-space has caught up
+ * in order to avoid refreshing timer when every single pkt arrives.
+ */
+ unsigned short last_kactive_blk_num;
+#define DEFAULT_PRB_RETIRE_TMO (4)
+ unsigned short retire_blk_tmo;
+ struct timer_list retire_blk_timer;
+};
+
#define PGV_FROM_VMALLOC 1
struct pgv {
char *buffer;
@@ -179,11 +230,16 @@ struct packet_ring_buffer {
unsigned int pg_vec_order;
unsigned int pg_vec_pages;
unsigned int pg_vec_len;
-
+ struct kbdq_core prb_bdqc;
atomic_t pending;
};
struct packet_sock;
+
+static void prb_open_block(struct kbdq_core *pkc1,struct block_desc *pbd1);
+static void prb_retire_rx_blk_timer_expired(unsigned long data);
+static void _prb_refresh_rx_retire_blk_timer(struct kbdq_core *pkc);
+static void prb_init_blk_timer(struct packet_sock *po,struct
kbdq_core *pkc,void (*func) (unsigned long));
static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
static void packet_flush_mclist(struct sock *sk);
@@ -192,6 +248,7 @@ struct packet_sock {
/* struct sock has to be the first member of packet_sock */
struct sock sk;
struct tpacket_stats stats;
+ union tpacket_stats_u stats_u;
struct packet_ring_buffer rx_ring;
struct packet_ring_buffer tx_ring;
int copy_thresh;
@@ -223,7 +280,14 @@ struct packet_skb_cb {
#define PACKET_SKB_CB(__skb) ((struct packet_skb_cb *)((__skb)->cb))
-static inline __pure struct page *pgv_to_page(void *addr)
+#define GET_PBDQC_FROM_RB(x) ((struct kbdq_core *)(&(x)->prb_bdqc))
+#define GET_CURR_PBLOCK_DESC_FROM_CORE(x) ((struct block_desc
*)((x)->pkbdq[(x)->kactive_blk_num].buffer))
+#define GET_PBLOCK_DESC(x,bid) ((struct block_desc
*)((x)->pkbdq[(bid)].buffer))
+
+#define INCREMENT_PRB_BLK_NUM(x) \
+ (((x)->kactive_blk_num < ((x)->knum_blocks-1)) ? ((x)->kactive_blk_num+1) : 0)
+
+static inline struct page *pgv_to_page(void *addr)
{
if (is_vmalloc_addr(addr))
return vmalloc_to_page(addr);
@@ -248,8 +312,12 @@ static void __packet_set_status(struct
packet_sock *po, void *frame, int status)
h.h2->tp_status = status;
flush_dcache_page(pgv_to_page(&h.h2->tp_status));
break;
+ case TPACKET_V3:
+ pr_err("<%s> TPACKET version not supported.Who is calling?.Dumping
stack.\n",__func__);
+ dump_stack();
+ break;
default:
- pr_err("TPACKET version not supported\n");
+ pr_err("<%s> TPACKET version not supported\n",__func__);
BUG();
}
@@ -274,6 +342,10 @@ static int __packet_get_status(struct packet_sock
*po, void *frame)
case TPACKET_V2:
flush_dcache_page(pgv_to_page(&h.h2->tp_status));
return h.h2->tp_status;
+ case TPACKET_V3:
+ pr_err("<%s> TPACKET version:%d not supported.Dumping
stack.\n",__func__,po->tp_version);
+ dump_stack();
+ return 0;
default:
pr_err("TPACKET version not supported\n");
BUG();
@@ -309,9 +381,234 @@ static inline void *packet_current_frame(struct
packet_sock *po,
struct packet_ring_buffer *rb,
int status)
{
- return packet_lookup_frame(po, rb, rb->head, status);
+ switch (po->tp_version) {
+ case TPACKET_V1:
+ case TPACKET_V2:
+ return packet_lookup_frame(po, rb, rb->head, status);
+ case TPACKET_V3:
+ pr_err("<%s> TPACKET version:%d not supported.Dumping
stack.\n",__func__,po->tp_version);
+ dump_stack();
+ return 0;
+ default:
+ pr_err("<%s> TPACKET version not supported\n",__func__);
+ BUG();
+ return 0;
+ }
+}
+
+static void prb_flush_block(struct block_desc *pbd1)
+{
+ flush_dcache_page(pgv_to_page(pbd1));
+}
+
+/* Side effect:
+ * 1)flush the block-header
+ * 2)Increment active_blk_num
+ */
+static void prb_close_block(struct kbdq_core *pkc1,struct block_desc *pbd1)
+{
+
+ //long size = pkc1->pkblk_end - pkc1->nxt_offset;
+ pbd1->block_status = TP_STATUS_USER;
+
+ /* Get the ts of the last pkt */
+ if (pbd1->num_pkts) {
+ struct tpacket3_hdr *ph = (struct tpacket3_hdr *)pkc1->prev;
+ pbd1->ts_last_pkt.ts_sec = ph->tp_sec;
+ pbd1->ts_last_pkt.ts_s2.ts_nsec = ph->tp_nsec;
+ } else {
+ /* Ok, we tmo'd - so get the current time */
+ struct timespec ts;
+ getnstimeofday(&ts);
+ pbd1->ts_last_pkt.ts_sec = ts.tp_sec;
+ pbd1->ts_last_pkt.ts_s2.ts_nsec = ts.tp_nsec;
+ }
+
+ prb_flush_block(pbd1);
+ pkc1->kactive_blk_num = INCREMENT_PRB_BLK_NUM(pkc1);
+}
+
+static inline void prb_unplug_queue(struct kbdq_core *pkc) {
+ pkc->reset_pending_on_curr_blk=0;
+}
+
+/* Side effect of opening a block:
+ * 1) prb_queue is unplugged.
+ * 2) retire_blk_timer is refreshed.
+ */
+static void prb_open_block(struct kbdq_core *pkc1,struct block_desc *pbd1)
+{
+ struct timespec ts;
+
+ pbd1->block_status = TP_STATUS_KERNEL;
+ getnstimeofday(&ts);
+ pbd1->num_pkts = 0;
+ pbd1->ts_first_pkt.ts_sec = ts.tv_sec;
+ pbd1->ts_first_pkt.ts_u1.ts_s2.ts_nsec = ts.tv_nsec;
+ pkc1->pkblk_start = (char *)pbd1;
+ pbd1->seq_num = pkc1->knxt_seq_num++;
+ pkc1->nxt_offset = (char *)(pkc1->pkblk_start + sizeof(struct block_desc));
+
+ pbd1->offset_to_first_pkt = (long)sizeof(struct block_desc);
+
+ pkc1->prev = pkc1->nxt_offset;
+ pkc1->pkblk_end = pkc1->pkblk_start + pkc1->kblk_size;
+
+ prb_unplug_queue(pkc1);
+ _prb_refresh_rx_retire_blk_timer(pkc1);
+}
+
+static inline void prb_plug_queue(struct kbdq_core *pkc,struct
packet_sock *po) {
+ pkc->reset_pending_on_curr_blk=1;
+ po->stats_u.stats3.tp_plug_q_cnt++;
+}
+
+static void *prb_try_next_block(struct kbdq_core *pkc,struct packet_sock *po)
+{
+ struct block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+ /* close current block */
+ if (likely(TP_STATUS_KERNEL == pbd->block_status)) {
+ prb_close_block(pkc,pbd);
+ } else {
+ printk("<%s> ERROR - pbd[%d]:%p\n",__func__,pkc->kactive_blk_num,pbd);
+ BUG();
+ }
+
+ /* Get the next block num */
+ pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+ smp_mb();
+
+ /* If the curr_block is currently in_use then plug the queue */
+ if (TP_STATUS_USER == pbd->block_status) {
+ prb_plug_queue(pkc,po);
+ return NULL;
+ }
+ /* open next block */
+ prb_open_block(pkc,pbd);
+ return (void *)pkc->nxt_offset;
+}
+
+#define TOTAL_PKT_LEN_INCL_ALIGN(length) (ALIGN_4((length)))
+
+static void prb_fill_curr_block(char *curr,struct kbdq_core
*pkc,struct block_desc *pbd,unsigned int len)
+{
+ struct tpacket3_hdr *ppd;
+ struct tpacket3_hdr *prev;
+
+ ppd = (struct tpacket3_hdr *)curr;
+ prev = (struct tpacket3_hdr *)pkc->prev;
+ /* lets do pd_s1 for for V4 header */
+ //ppd->pd_u1.pd_s1.nxt_offset = 0;
+ //((struct tpacket3_hdr *)pkc->prev)->pd_u1.pd_s1.next_offset =
(char *)ppd - pkc->prev;
+ ppd->tp_next_offset = 0;
+ if (pkc->prev > (char *)ppd) {
+ printk("<%s> curr:0x%p len:%d pkc->prev:%p \n",__func__,curr,len,pkc->prev);
+ BUG();
+ }
+ prev->tp_next_offset = (long)ppd - (long)pkc->prev;
+ pkc->prev = curr;
+ pkc->nxt_offset += TOTAL_PKT_LEN_INCL_ALIGN(len);
+ pbd->num_pkts += 1;
+}
+
+static inline int prb_curr_blk_in_use(struct kbdq_core *pkc,struct
block_desc *pbd) {
+
+ return (TP_STATUS_USER == pbd->block_status);
+}
+
+static inline int prb_queue_plugged(struct kbdq_core *pkc) {
+ return pkc->reset_pending_on_curr_blk;
+}
+
+/* Assumes caller has the sk->rx_queue.lock */
+static void *__packet_lookup_frame_in_block(struct packet_ring_buffer *rb,
+ int status,unsigned int len,struct packet_sock *po)
+{
+ struct kbdq_core *pkc = GET_PBDQC_FROM_RB(rb);
+ struct block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+ char *curr, *end;
+
+ if (prb_queue_plugged(pkc)) {
+ if (prb_curr_blk_in_use(pkc,pbd)) {
+ return NULL;
+ } else {
+ /* open-block unplugs the queue. Unplugging is a side effect */
+ prb_open_block(pkc,pbd);
+ }
+ }
+
+ smp_mb();
+
+ curr = pkc->nxt_offset;
+ end = (char *) ( (char *)pbd + pkc->kblk_size);
+
+ /* first try the current block */
+ if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
+ prb_fill_curr_block(curr,pkc,pbd,len);
+ return (void *)curr;
+ }
+
+ /* Then try the next block. */
+ if ((curr = (char *)prb_try_next_block(pkc,po))) {
+ pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+ prb_fill_curr_block(curr,pkc,pbd,len);
+ return (void *)curr;
+ }
+
+ /* no free blocks are available - user_space hasn't caught up yet */
+ return NULL;
+}
+
+static inline void *packet_current_rx_frame(struct packet_sock *po,
+ struct packet_ring_buffer *rb,
+ int status, unsigned int len)
+{
+ char *curr=NULL;
+ switch (po->tp_version) {
+ case TPACKET_V1:
+ case TPACKET_V2:
+ curr = packet_lookup_frame(po, rb, rb->head, status);
+ return curr;
+ case TPACKET_V3:
+ return __packet_lookup_frame_in_block(rb, status,len,po);
+ default:
+ pr_err("<%s> TPACKET version:%d not supported\n",__func__,po->tp_version);
+ BUG();
+ return 0;
+ }
+}
+
+static inline void *prb_lookup_block(struct packet_sock *po,
+ struct packet_ring_buffer *rb,unsigned int previous,
+ int status)
+{
+ struct kbdq_core *pkc = GET_PBDQC_FROM_RB(rb);
+ struct block_desc *pbd = GET_PBLOCK_DESC(pkc,previous);
+
+ if (status != pbd->block_status)
+ return NULL;
+ return pbd;
+}
+
+static inline int prb_previous_blk_num(struct packet_ring_buffer *rb)
+{
+ unsigned int prev = rb->prb_bdqc.kactive_blk_num ?
(rb->prb_bdqc.kactive_blk_num-1) : (rb->prb_bdqc.knum_blocks-1);
+ return prev;
+}
+
+/* Assumes caller has held the rx_queue.lock */
+static inline void* __prb_previous_block(struct packet_sock *po,
+ struct packet_ring_buffer *rb,
+ int status)
+{
+
+ unsigned int previous = prb_previous_blk_num(rb);
+ return prb_lookup_block(po,rb,previous,status);
}
+
static inline void *packet_previous_frame(struct packet_sock *po,
struct packet_ring_buffer *rb,
int status)
@@ -320,11 +617,38 @@ static inline void *packet_previous_frame(struct
packet_sock *po,
return packet_lookup_frame(po, rb, previous, status);
}
+static inline void *packet_previous_rx_frame(struct packet_sock *po,
+ struct packet_ring_buffer *rb,
+ int status)
+{
+ if (po->tp_version <= TPACKET_V2)
+ return packet_previous_frame(po,rb,status);
+
+ return __prb_previous_block(po,rb,status);
+}
+
static inline void packet_increment_head(struct packet_ring_buffer *buff)
{
buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
}
+static inline void packet_increment_rx_head(struct packet_sock
*po,struct packet_ring_buffer *rb)
+{
+ switch (po->tp_version) {
+ case TPACKET_V1:
+ case TPACKET_V2:
+ return packet_increment_head(rb);
+ case TPACKET_V3:
+ pr_err("<%s> TPACKET version:%d not supported.Dumping
stack.\n",__func__,po->tp_version);
+ dump_stack();
+ return;
+ default:
+ pr_err("<%s> TPACKET version not supported\n",__func__);
+ BUG();
+ return;
+ }
+}
+
static inline struct packet_sock *pkt_sk(struct sock *sk)
{
return (struct packet_sock *)sk;
@@ -663,6 +987,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct
net_device *dev,
union {
struct tpacket_hdr *h1;
struct tpacket2_hdr *h2;
+ struct tpacket3_hdr *h3;
void *raw;
} h;
u8 *skb_head = skb->data;
@@ -715,29 +1040,31 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
macoff = netoff - maclen;
}
- if (macoff + snaplen > po->rx_ring.frame_size) {
- if (po->copy_thresh &&
- atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
- (unsigned)sk->sk_rcvbuf) {
- if (skb_shared(skb)) {
- copy_skb = skb_clone(skb, GFP_ATOMIC);
- } else {
- copy_skb = skb_get(skb);
- skb_head = skb->data;
+ if (po->tp_version <= TPACKET_V2) {
+ if (macoff + snaplen > po->rx_ring.frame_size) {
+ if (po->copy_thresh &&
+ atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
+ (unsigned)sk->sk_rcvbuf) {
+ if (skb_shared(skb)) {
+ copy_skb = skb_clone(skb, GFP_ATOMIC);
+ } else {
+ copy_skb = skb_get(skb);
+ skb_head = skb->data;
+ }
+ if (copy_skb)
+ skb_set_owner_r(copy_skb, sk);
}
- if (copy_skb)
- skb_set_owner_r(copy_skb, sk);
+ snaplen = po->rx_ring.frame_size - macoff;
+ if ((int)snaplen < 0)
+ snaplen = 0;
}
- snaplen = po->rx_ring.frame_size - macoff;
- if ((int)snaplen < 0)
- snaplen = 0;
}
-
spin_lock(&sk->sk_receive_queue.lock);
- h.raw = packet_current_frame(po, &po->rx_ring, TP_STATUS_KERNEL);
+ h.raw = packet_current_rx_frame(po, &po->rx_ring,
TP_STATUS_KERNEL,(macoff+snaplen));
if (!h.raw)
goto ring_is_full;
- packet_increment_head(&po->rx_ring);
+ if (TPACKET_V3 != po->tp_version)
+ packet_increment_rx_head(po,&po->rx_ring);
po->stats.tp_packets++;
if (copy_skb) {
status |= TP_STATUS_COPY;
@@ -789,6 +1116,21 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
h.h2->tp_vlan_tci = vlan_tx_tag_get(skb);
hdrlen = sizeof(*h.h2);
break;
+ case TPACKET_V3:
+ /* tp_nxt_offset is already populated above. So DONT clear those
fields here */
+ h.h3->tp_len = skb->len;
+ h.h3->tp_snaplen = snaplen;
+ h.h3->tp_mac = macoff;
+ h.h3->tp_net = netoff;
+ if (skb->tstamp.tv64)
+ ts = ktime_to_timespec(skb->tstamp);
+ else
+ getnstimeofday(&ts);
+ h.h3->tp_sec = ts.tv_sec;
+ h.h3->tp_nsec = ts.tv_nsec;
+ h.h3->tp_vlan_tci = vlan_tx_tag_get(skb);
+ hdrlen = sizeof(*h.h3);
+ break;
default:
BUG();
}
@@ -804,7 +1146,8 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
else
sll->sll_ifindex = dev->ifindex;
- __packet_set_status(po, h.raw, status);
+ if (po->tp_version <= TPACKET_V2)
+ __packet_set_status(po, h.raw, status);
smp_mb();
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
{
@@ -815,7 +1158,6 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
flush_dcache_page(pgv_to_page(start));
}
#endif
-
sk->sk_data_ready(sk, 0);
drop_n_restore:
@@ -1984,6 +2326,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
switch (val) {
case TPACKET_V1:
case TPACKET_V2:
+ case TPACKET_V3:
po->tp_version = val;
return 0;
default:
@@ -2082,6 +2425,7 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
struct packet_sock *po = pkt_sk(sk);
void *data;
struct tpacket_stats st;
+ union tpacket_stats_u st_u;
if (level != SOL_PACKET)
return -ENOPROTOOPT;
@@ -2094,15 +2438,25 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
switch (optname) {
case PACKET_STATISTICS:
- if (len > sizeof(struct tpacket_stats))
- len = sizeof(struct tpacket_stats);
+ if (po->tp_version == TPACKET_V3) {
+ len = sizeof(struct tpacket_stats_v3);
+ } else {
+ if (len > sizeof(struct tpacket_stats))
+ len = sizeof(struct tpacket_stats);
+ }
spin_lock_bh(&sk->sk_receive_queue.lock);
- st = po->stats;
+ if (po->tp_version == TPACKET_V3) {
+ memcpy(&st_u.stats3,&po->stats,sizeof(struct tpacket_stats));
+ st_u.stats3.tp_plug_q_cnt = po->stats_u.stats3.tp_plug_q_cnt;
+ st_u.stats3.tp_packets += po->stats.tp_drops;
+ data = &st_u.stats3;
+ } else {
+ st = po->stats;
+ st.tp_packets += st.tp_drops;
+ data = &st;
+ }
memset(&po->stats, 0, sizeof(st));
spin_unlock_bh(&sk->sk_receive_queue.lock);
- st.tp_packets += st.tp_drops;
-
- data = &st;
break;
case PACKET_AUXDATA:
if (len > sizeof(int))
@@ -2143,6 +2497,9 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
case TPACKET_V2:
val = sizeof(struct tpacket2_hdr);
break;
+ case TPACKET_V3:
+ val = sizeof(struct tpacket3_hdr);
+ break;
default:
return -EINVAL;
}
@@ -2293,7 +2650,7 @@ static unsigned int packet_poll(struct file
*file, struct socket *sock,
spin_lock_bh(&sk->sk_receive_queue.lock);
if (po->rx_ring.pg_vec) {
- if (!packet_previous_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
+ if (!packet_previous_rx_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
mask |= POLLIN | POLLRDNORM;
}
spin_unlock_bh(&sk->sk_receive_queue.lock);
@@ -2396,7 +2753,6 @@ static struct pgv *alloc_pg_vec(struct
tpacket_req *req, int order)
pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL);
if (unlikely(!pg_vec))
goto out;
-
for (i = 0; i < block_nr; i++) {
pg_vec[i].buffer = alloc_one_pg_vec_page(order);
if (unlikely(!pg_vec[i].buffer))
@@ -2412,6 +2768,197 @@ out_free_pgvec:
goto out;
}
+
+static void prb_del_retire_blk_timer(struct kbdq_core *pkc)
+{
+ del_timer_sync(&pkc->retire_blk_timer);
+}
+
+static void prb_shutdown_retire_blk_timer(struct packet_sock *po, int
tx_ring,struct sk_buff_head *rb_queue)
+{
+ struct kbdq_core *pkc;
+
+ pkc = tx_ring ? &po->tx_ring.prb_bdqc : &po->rx_ring.prb_bdqc;
+
+ spin_lock(&rb_queue->lock);
+ pkc->delete_blk_timer=1;
+ spin_unlock(&rb_queue->lock);
+
+ prb_del_retire_blk_timer(pkc);
+}
+
+/* Increment the blk_num and then invoke this func to refresh the timer.
+ * We do it in this order so that if a timer is about
+ * to fire then it will fail the blk_num check.
+ * Assumes sk_buff_head lock is held.
+ */
+static void _prb_refresh_rx_retire_blk_timer(struct kbdq_core *pkc)
+{
+ pkc->last_kactive_blk_num = pkc->kactive_blk_num;
+ mod_timer(&pkc->retire_blk_timer,jiffies+msecs_to_jiffies(pkc->retire_blk_tmo));
+}
+
+/* close current block and open next block or plug the queue */
+static inline void prb_retire_curr_block(struct kbdq_core *pkc,struct
packet_sock *po)
+{
+ prb_try_next_block(pkc,po);
+}
+
+/*
+ * Timer logic:
+ * 1) We refresh the timer only when we open a block.
+ * By doing this we don't waste cycles refreshing the timer
+ * on packet-by-packet basis.
+ * With a 1MB block-size, on a 1Gbps line, it will take
+ * ~8 ms to fill a block.
+ * So, if the user sets the 'tmo' to 10ms then the timer will never
fire(which is what we want)!
+ * However, the user could choose to close a block early and that's fine.
+ *
+ * But when the timer does fire, we check whether or not to refresh it.
+ * Since the tmo granularity is in msecs, it is not too expensive
+ * to refresh the timer every '8' msecs.
+ * Either the user can set the 'tmo' or we can derive it based on
+ * a) line-speed and b) block-size
+ */
+static void prb_retire_rx_blk_timer_expired(unsigned long data)
+{
+ struct packet_sock *po = (struct packet_sock *)data;
+ struct kbdq_core *pkc = &po->rx_ring.prb_bdqc;
+ unsigned short tmo;
+ unsigned int plugged;
+ struct block_desc *pbd;
+
+ spin_lock(&po->sk.sk_receive_queue.lock);
+
+ plugged = prb_queue_plugged(pkc);
+ pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+ /* We read the tmo so that user-space can change it anytime they want.
+ * But, the changes will get into affect only when:
+ * i) Either when the timer expires(this code path) or
+ * ii)When a new block is opened.
+ */
+ tmo = pkc->retire_blk_tmo;
+ if (pkc->last_kactive_blk_num == pkc->kactive_blk_num &&
+ !plugged) {
+ if (TP_STATUS_KERNEL == pbd->block_status) {
+ prb_retire_curr_block(pkc,po);
+ }
+ }
+ pkc->last_kactive_blk_num = pkc->kactive_blk_num;
+
+ if (pkc->delete_blk_timer)
+ goto out;
+
+ if (plugged) {
+ /* Case 1. queue was plugged because user-space was lagging behind */
+ if (prb_curr_blk_in_use(pkc,pbd)) {
+ /* Ok, user-space is still behind. But we still want to refresh the timer */
+ /* if-check added for code readability */
+ } else {
+ /* Case 2. queue was plugged, user-space caught up and now the
link went idle && the timer fired.
+ * We don't have a block to close and we cannot close the current
block because
+ * the timer wasn't really meant for this block. So we just open
this block and restart the timer.
+ * open-block unplugs the queue, restarts timer.
Unplugging/refreshing-timer is a side effect.
+ */
+ prb_open_block(pkc,pbd);
+ goto out;
+ }
+ }
+
+ mod_timer(&pkc->retire_blk_timer,jiffies+msecs_to_jiffies(tmo));
+
+out:
+ spin_unlock(&po->sk.sk_receive_queue.lock);
+}
+
+static void prb_init_blk_timer(struct packet_sock *po,struct
kbdq_core *pkc,void (*func) (unsigned long))
+{
+
+ init_timer(&pkc->retire_blk_timer);
+ pkc->retire_blk_timer.data = (long)po;
+ pkc->retire_blk_timer.function = func;
+ pkc->retire_blk_timer.expires = jiffies;
+}
+
+static void prb_setup_retire_blk_timer(struct packet_sock *po,int tx_ring)
+{
+ struct kbdq_core *pkc;
+
+ if (tx_ring)
+ BUG();
+
+ pkc = tx_ring ? &po->tx_ring.prb_bdqc : &po->rx_ring.prb_bdqc;
+ prb_init_blk_timer(po,pkc,prb_retire_rx_blk_timer_expired);
+}
+
+static int prb_calc_retire_blk_tmo(struct packet_sock *po, int
blk_size_in_bytes)
+{
+ struct net_device *dev;
+ unsigned int mbits=0,msec=0,div=0,tmo=0;
+
+ dev = dev_get_by_index(sock_net(&po->sk), po->ifindex);
+ if (unlikely(dev == NULL)) {
+ return DEFAULT_PRB_RETIRE_TMO;
+ }
+
+ if (dev->ethtool_ops && dev->ethtool_ops->get_settings) {
+ struct ethtool_cmd ecmd = { .cmd = ETHTOOL_GSET, };
+
+ if (!dev->ethtool_ops->get_settings(dev, &ecmd)) {
+ switch(ecmd.speed) {
+ case SPEED_10000:
+ msec = 1;
+ div=10000/1000;
+ break;
+ case SPEED_1000:
+ msec = 1;
+ div = 1000/1000;
+ break;
+ /* If the link speed is so low you don't really need
to care about perf anyways */
+ case SPEED_100:
+ case SPEED_10:
+ default:
+ return DEFAULT_PRB_RETIRE_TMO;
+ }
+ }
+ }
+
+ mbits = (blk_size_in_bytes * 8) / (1024 * 1024);
+
+ if (div)
+ mbits /= div;
+
+ tmo = mbits * msec;
+
+ if (div)
+ return (tmo+1);
+ return tmo;
+}
+
+static void init_prb_bdqc(struct packet_sock *po,struct
packet_ring_buffer *rb,struct pgv *pg_vec,struct tpacket_req *req,int
tx_ring)
+{
+
+ struct kbdq_core *p1 = &rb->prb_bdqc;
+ struct block_desc *pbd;
+
+ memset(p1,0x0,sizeof(*p1));
+ p1->pkbdq = pg_vec;
+ pbd = (struct block_desc *)pg_vec[0].buffer;
+ p1->pkblk_start = (char *)pg_vec[0].buffer;
+
+ p1->kblk_size = req->tp_block_size;
+ p1->knum_blocks = req->tp_block_nr;
+ p1->hdrlen = po->tp_hdrlen;
+
+ p1->last_kactive_blk_num = 0;
+ po->stats_u.stats3.tp_plug_q_cnt = 0;
+ p1->retire_blk_tmo = prb_calc_retire_blk_tmo(po,req->tp_block_size);
+
+ prb_setup_retire_blk_timer(po,tx_ring);
+ prb_open_block(p1,pbd);
+}
+
static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing, int tx_ring)
{
@@ -2421,7 +2968,14 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
struct packet_ring_buffer *rb;
struct sk_buff_head *rb_queue;
__be16 num;
- int err;
+ int err=-EINVAL;
+
+ /* Opening a Tx-ring is NOT supported post TPACKET_V2 */
+ if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
+ pr_err("<%s> Tx-ring is not supported on version:%d.Dumping
stack.\n",__func__,po->tp_version);
+ dump_stack();
+ goto out;
+ }
rb = tx_ring ? &po->tx_ring : &po->rx_ring;
rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
@@ -2447,6 +3001,9 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
case TPACKET_V2:
po->tp_hdrlen = TPACKET2_HDRLEN;
break;
+ case TPACKET_V3:
+ po->tp_hdrlen = TPACKET3_HDRLEN;
+ break;
}
err = -EINVAL;
@@ -2472,6 +3029,15 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
pg_vec = alloc_pg_vec(req, order);
if (unlikely(!pg_vec))
goto out;
+ switch (po->tp_version) {
+ case TPACKET_V3:
+ /* Transmit path is not supported. We checked it above but just
being paranoid */
+ if (!tx_ring)
+ init_prb_bdqc(po,rb,pg_vec,req,tx_ring);
+ break;
+ default:
+ break;
+ }
}
/* Done */
else {
@@ -2529,10 +3095,17 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req,
}
spin_unlock(&po->bind_lock);
+ if (closing && (po->tp_version > TPACKET_V2)) {
+ /* Because we don't support block-based V3 on tx-ring */
+ if (!tx_ring)
+ prb_shutdown_retire_blk_timer(po,tx_ring,rb_queue);
+ }
+
release_sock(sk);
if (pg_vec)
free_pg_vec(pg_vec, order, req->tp_block_nr);
+
out:
return err;
}
^ permalink raw reply related
* [RFC 00/01]af_packet: Enhance network capture visibility
From: chetan loke @ 2011-05-25 23:02 UTC (permalink / raw)
To: netdev, loke.chetan
Hello,
Please review the RFC/patchset. Any feedback is appreciated.
The patch set is not complete and is intended to:
a) demonstrate the improvements
b) gather suggestions
This patch attempts to i) improve network capture visibility by
increasing packet density ii) assist in analyzing multiple(aggregated)
capture ports.
With the current af_packet->rx::mmap based approach, the element size
in the block needs to be statically configured. Nothing wrong with
this config/implementation. But the traffic profile
cannot be known in advance. And so it would be nice if that
configuration wasn't static. Normally, one would configure the
element-size to be '2048' so that you can atleast capture the entire
'MTU-size'.
But if the traffic profile varies then we would end up either
i)wasting memory or ii) end up getting a sliced frame. In other words
the packet density will be much less in the first case.
Enhancement:
E1) Enhance tpacket_rcv so that it can dump/copy the packets one after another.
E2) Also implement basic timeout mechanism to close 'a' current
block.That way, user-space won't be blocked forever on an idle link.
This is a much needed feature while monitoring multiple ports.
Look at 3) below.
Why is such enhancement needed?
1) Well, spin-waiting/polling on a per-packet basis to see if it's
ready to be consumed does not scale while monitoring multiple ports.
poll() is not performance friendly either.
2) Also, typically a user-space packet capture interface handles
multiple packets to another user-space protocol-decoder.
----------------
protocol-decoder
T2
----------------
========
ship pkts
========
^
|
v
-----------------
pkt-capture logic
T1
-----------------
================
nic/adp/sock IF
================
^
|
V
T1 and T2 are user-space threads. If the hand-off between T1 and T2
happens on a per-pkt basis then the solution does NOT scale.
However, one can argue that T1 can coalesce packets and then pass of a
single chunk to T2.But T1's packet consumption granularity is still at
an individual packet level and that is something that needs to be
addressed to avoid excessive polling.
3) Port aggregation:
Multiple ports are viewed/analyzed as one logical pipe.
Example:
3.1) up-stream path can be tapped in eth1
3.2) down-stream path can be tapped in eth2
3.3) Network TAP splits Rx/Tx paths and then feeds to eth1,eth2.
If both eth1,eth2 need to be viewed as one logical channel,
then that implies we need to timesort the packets as they come across
eth1,eth2.
3.4) But following issues further complicates the problem:
3.4.1)What if one stream is bursty and other is flowing
at line rate?
3.4.2)How long do we wait before we can actually make a
decision in the app-space and bail-out from the spin-wait?
Solution:
3.5) Once we receive a block from multiple ports, then we can
compare the timestamps from the block-descriptor and then easily sort
the packets and feed the pointers to the decoders.
------------------------------
Performance results:
------------------------------
Setup:
S1)Ran 3 pktgen sessions from 3 worker VMs(VM0-VM2).
S2)Each pktgen session was configured to send 40Million, 64byte packets.
S3)Ran patched kernel on the probe-VM(VM3).
S4)rx-mmap application code:
BLOCK_SIZE: 1MB
FRAME_SIZE: 2048 bytes
NUM_BLOCKS: 64
Note: TPACKET_V3 doesn't really care about FRAME_SIZE.
But the code was untouched to ensure minimal disruption.
Numbers from VM3(tpacket_stats):
Case P1) TPACKET_V0[V1](existing model):
recieved 84909875 packets, dropped 5760817
Pkts seen by the app:79149058
Case P2) TPACKET_V3(enhanced model):
recieved 102562944 packets, dropped 2 plug_q_cnt 12
Pkts seen by the app:102562942
PS:plug_q_cnt is interpreted as "The tpacket_rcv code got blocked only
12 times during the entire capture process.Blocked implies, user-space
process took some time to catch up."
Note: In both the cases,VM3 should have seen ~120 Million packets. But
notice it only sees around 90-100M pkts. The hypervisor is dropping
~30%-20% of the traffic.We can ignore this because in non-virtual
world, there could be limitations on the host side too.
Summary:
A) In P2) notice how the VM keeps up and so it now has more visibility
than the P1) case.
So,
A.1] P2) almost always has around 10%-20% higher visibility than P1.
A.2] P2) almost always captures ~98-99% of the traffic as seen by the kernel.
A.3] P1) on the other hand drops anywhere around ~7-10% traffic.
A.4] P1) also has 10%-20% lower visibility because
i) it loses frames due to the static frame size format
ii) has to poll/spin-wait for a single packet.
Regards
Chetan Loke
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2011-05-25 22:52 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
That majority of the bits here are just a merge with John Linville's
queued up wireless stuff. This has been in his tree for more than
a week and I was just waiting for him to get back from a conference
to send the pull request to me.
Other noteworthy bits:
1) Erroneous socket filters can log kernel messages without control,
fix from Joe Perches.
2) Fix regression in the locking of interface dumping, from Eric Dumazet.
3) Fix crash in bridging due to improperly initialized route object,
also from Eric.
4) IP fragments give erroneous congestion notification signals in
SFQ packet scheduler, also from Eric.
5) Rest of networking %pK conversions, from Dan Rosenberg via Andrew
Morton.
6) When the RTNL mutex is held, synchonize_net() can use
synchronize_rcu_expedited(). From Eric Dumazet.
7) Fix IGMP source filter clearing when users of the group still
exist, from Veaceslav Falico.
8) __dst_destroy_metrics_generic() forgets to set "read-only" bit
in the encoded pointer. Fix from Eric Dumazet.
9) dev_disable_lro() needs to propagate to underlying physical device
of a VLAN, from Neil Horman.
10) ASCONF memory leak in SCTP, fix from Wei Yongjun.
11) SFQ packet scheduler's ->peek() method returns different packets
than ->dequeue() would, fix from Eric Dumazet.
12) Fix bonding deadlock in ALB mode, from Neil Horman.
Please pull, thanks a lot!
The following changes since commit 2a651c7f8d377cf88271374315cbb5fe82eac784:
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs (2011-05-25 09:21:56 -0700)
are available in the git repository at:
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master
Alexey Dobriyan (1):
airo: correct proc entry creation interfaces
Alexey Orishko (1):
CDC NCM: release interfaces fix in unbind()
Breno Leitao (1):
ehea: Fix multicast registration on semi-promiscuous mode
Christian Lamparter (2):
p54usb: add zoom 4410 usbid
carl9170: advertise interface combinations
Dan Rosenberg (1):
net: convert %p usage to %pK
Daniel Halperin (1):
iwlwifi: remove unused parameter from iwl_hcmd_queue_reclaim
David S. Miller (3):
ipv6: Fix return of xfrm6_tunnel_rcv()
bug.h: Fix build with CONFIG_PRINTK disabled.
Merge branch 'for-davem' of ssh://master.kernel.org/.../linville/wireless-next-2.6
Dmitry Kravkov (2):
bnx2x: fix inverted condition
bnx2x: protect sequence increment with mutex
Eric Dumazet (8):
net: ping: cleanups ping_v4_unhash()
snap: remove one synchronize_net()
sch_sfq: avoid giving spurious NET_XMIT_CN signals
net: use synchronize_rcu_expedited()
net: fix __dst_destroy_metrics_generic()
bridge: initialize fake_rtable metrics
sch_sfq: fix peek() implementation
net: hold rtnl again in dump callbacks
Felix Fietkau (3):
ath9k: fix ad-hoc mode beacon selection
ath9k: fix ad-hoc nexttbtt calculation
ath9k: implement .tx_last_beacon()
Flavio Leitner (1):
bonding: documentation and code cleanup for resend_igmp
Ian Campbell (1):
xen: netfront: hold RTNL when updating features.
Javier Cardona (2):
mac80211: Deactivate mesh path timers when freeing nodes
mac80211: Don't sleep when growing the mesh path
Joe Perches (2):
bug.h: Add WARN_RATELIMIT
net: filter: Use WARN_RATELIMIT
Johannes Berg (10):
iwlagn: prepare for multi-TB commands
iwlagn: clean up TXQ indirection
iwlagn: remove unused pad argument
iwlagn: support multiple TBs per command
iwlagn: remove set but unused vars
iwlagn: change default beacon interval
mac80211: verify IBSS in interface combinations
mac80211: add missing rcu_barrier
mac80211: fix and simplify mesh locking
mac80211: annotate and fix RCU in mesh code
John W. Linville (2):
Merge branch 'wireless-next-2.6' of git://git.kernel.org/.../iwlwifi/iwlwifi-2.6
Merge ssh://master.kernel.org/.../linville/wireless-next-2.6 into for-davem
Jouni Malinen (1):
cfg80211: Use consistent BSS matching between scan and sme
Larry Finger (1):
rtlwifi: rtl8192c-common: rtl8192ce: Fix for HT40 regression
Luciano Coelho (1):
nl80211: remove some stack variables in trigger_scan and start_sched_scan
Marc Yang (5):
mwifiex: reduce CPU usage by tracking tx_pkts_queued
mwifiex: reduce CPU usage by tracking highest_queued_prio
mwifiex: check mwifiex_wmm_lists_empty() before dequeue
mwifiex: CPU mips optimization with NO_PKT_PRIO_TID
mwifiex: adjust high/low water marks for tx_pending queue
Meelis Roos (1):
Add Fujitsu 1000base-SX PCI ID to tg3
Mike Frysinger (1):
net/irda: convert bfin_sir to common Blackfin UART header
Mohammed Shafi Shajakhan (2):
ath_hw: Fix bssid mask documentation
ath9k: use PS wakeup before REG_READ
Neil Horman (3):
net: move is_vlan_dev into public header file (v2)
net: make dev_disable_lro use physical device if passed a vlan dev (v2)
bonding: prevent deadlock on slave store with alb mode (v3)
Prarit Bhargava (1):
isdn: netjet - blacklist Digium TDM400P
Rafał Miłecki (8):
b43: rename b43_wldev's field with ssb_device to sdev
bcma: add PCI ID of the card found in Thinkpad X120e
b43: add helpers for block R/W ops
b43: make b43_wireless_init less bus specific
b43: dma: cache translation (routing bits)
b43: add helper for finding GPIO device
b43: separate ssb core reset
b43: read PHY info only when needed (for PHY-A)
Rajkumar Manoharan (2):
mac80211: abort scan_work immediately when the device goes down
ath9k: Fix power save wrappers in debug ops
Randy Dunlap (2):
wireless: fix cfg80211.h new kernel-doc warnings
wireless: fix fatal kernel-doc error + warning in mac80211.h
Rhyland Klein (1):
net: rfkill: add generic gpio rfkill driver
Sathya Perla (1):
be2net: hash key for rss-config cmd not set
Stephen Hemminger (1):
dst: catch uninitialized metrics
Sujith Manoharan (9):
ath9k_htc: Fix mode selection
ath9k_htc: Fix station flags
ath9k_htc: Recalculate the BSSID mask on interface
ath9k_htc: Fix RX filter calculation
ath9k_htc: Fix BSSID calculation
ath9k_htc: Fix max subframe handling
ath9k_htc: Change credit limit for UB94/95
ath9k_htc: Fix packet timeout
ath9k: Drag the driver to the year 2011
Ulrich Hecht (1):
via-velocity: don't annotate MAC registers as packed
Veaceslav Falico (1):
igmp: call ip_mc_clear_src() only when we have no users of ip_mc_list
Wei Yongjun (1):
sctp: fix memory leak of the ASCONF queue when free asoc
Wey-Yi Guy (8):
iwlagn: more ucode error log info
iwlagn: add testmode trace command
iwlagn: add eeprom command to testmode
iwlagn: add testmode set fixed rate command
iwlagn: clear STATUS_HCMD_ACTIVE bit if fail enqueue
iwlagn: alwasy send RXON with disassociate falge before associate
iwlagn: remove unused old_assoc parameter
iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
Documentation/networking/bonding.txt | 13 +-
drivers/bcma/host_pci.c | 1 +
drivers/isdn/hardware/mISDN/netjet.c | 6 +
drivers/net/benet/be_cmds.c | 3 +-
drivers/net/bnx2x/bnx2x_cmn.c | 2 +-
drivers/net/bnx2x/bnx2x_main.c | 3 +-
drivers/net/bonding/bond_alb.c | 4 -
drivers/net/bonding/bond_main.c | 28 +-
drivers/net/bonding/bond_sysfs.c | 16 +-
drivers/net/ehea/ehea_main.c | 2 +-
drivers/net/irda/bfin_sir.c | 59 ++--
drivers/net/irda/bfin_sir.h | 63 +----
drivers/net/tg3.c | 1 +
drivers/net/usb/cdc_ncm.c | 73 ++---
drivers/net/via-velocity.h | 2 +-
drivers/net/wireless/airo.c | 33 +--
drivers/net/wireless/ath/ath9k/ahb.c | 2 +-
drivers/net/wireless/ath/ath9k/ani.c | 2 +-
drivers/net/wireless/ath/ath9k/ani.h | 2 +-
drivers/net/wireless/ath/ath9k/ar5008_initvals.h | 2 +-
drivers/net/wireless/ath/ath9k/ar5008_phy.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9001_initvals.h | 2 +-
drivers/net/wireless/ath/ath9k/ar9002_calib.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9002_hw.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9002_initvals.h | 2 +-
drivers/net/wireless/ath/ath9k/ar9002_mac.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9002_phy.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9002_phy.h | 2 +-
.../net/wireless/ath/ath9k/ar9003_2p2_initvals.h | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_calib.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_eeprom.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_eeprom.h | 16 +
drivers/net/wireless/ath/ath9k/ar9003_hw.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_mac.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_mac.h | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_paprd.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_phy.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_phy.h | 2 +-
drivers/net/wireless/ath/ath9k/ar9485_initvals.h | 2 +-
drivers/net/wireless/ath/ath9k/ath9k.h | 5 +-
drivers/net/wireless/ath/ath9k/beacon.c | 48 ++-
drivers/net/wireless/ath/ath9k/btcoex.c | 2 +-
drivers/net/wireless/ath/ath9k/btcoex.h | 2 +-
drivers/net/wireless/ath/ath9k/calib.c | 2 +-
drivers/net/wireless/ath/ath9k/calib.h | 2 +-
drivers/net/wireless/ath/ath9k/common.c | 2 +-
drivers/net/wireless/ath/ath9k/common.h | 2 +-
drivers/net/wireless/ath/ath9k/debug.c | 10 +-
drivers/net/wireless/ath/ath9k/debug.h | 2 +-
drivers/net/wireless/ath/ath9k/eeprom.c | 2 +-
drivers/net/wireless/ath/ath9k/eeprom.h | 2 +-
drivers/net/wireless/ath/ath9k/eeprom_4k.c | 2 +-
drivers/net/wireless/ath/ath9k/eeprom_9287.c | 2 +-
drivers/net/wireless/ath/ath9k/eeprom_def.c | 2 +-
drivers/net/wireless/ath/ath9k/gpio.c | 2 +-
drivers/net/wireless/ath/ath9k/hif_usb.c | 2 +-
drivers/net/wireless/ath/ath9k/hif_usb.h | 4 +-
drivers/net/wireless/ath/ath9k/htc.h | 25 +-
drivers/net/wireless/ath/ath9k/htc_drv_beacon.c | 2 +-
drivers/net/wireless/ath/ath9k/htc_drv_gpio.c | 2 +-
drivers/net/wireless/ath/ath9k/htc_drv_init.c | 9 +-
drivers/net/wireless/ath/ath9k/htc_drv_main.c | 79 +++--
drivers/net/wireless/ath/ath9k/htc_drv_txrx.c | 6 +-
drivers/net/wireless/ath/ath9k/htc_hst.c | 2 +-
drivers/net/wireless/ath/ath9k/htc_hst.h | 2 +-
drivers/net/wireless/ath/ath9k/hw-ops.h | 2 +-
drivers/net/wireless/ath/ath9k/hw.c | 2 +-
drivers/net/wireless/ath/ath9k/hw.h | 2 +-
drivers/net/wireless/ath/ath9k/init.c | 2 +-
drivers/net/wireless/ath/ath9k/mac.c | 2 +-
drivers/net/wireless/ath/ath9k/mac.h | 2 +-
drivers/net/wireless/ath/ath9k/main.c | 42 +++-
drivers/net/wireless/ath/ath9k/pci.c | 2 +-
drivers/net/wireless/ath/ath9k/phy.h | 2 +-
drivers/net/wireless/ath/ath9k/rc.c | 2 +-
drivers/net/wireless/ath/ath9k/rc.h | 2 +-
drivers/net/wireless/ath/ath9k/recv.c | 2 +-
drivers/net/wireless/ath/ath9k/reg.h | 2 +-
drivers/net/wireless/ath/ath9k/wmi.c | 2 +-
drivers/net/wireless/ath/ath9k/wmi.h | 2 +-
drivers/net/wireless/ath/ath9k/xmit.c | 2 +-
drivers/net/wireless/ath/carl9170/carl9170.h | 4 +
drivers/net/wireless/ath/carl9170/fw.c | 19 +-
drivers/net/wireless/ath/carl9170/main.c | 10 +-
drivers/net/wireless/ath/hw.c | 10 +-
drivers/net/wireless/b43/b43.h | 24 +-
drivers/net/wireless/b43/dma.c | 37 +-
drivers/net/wireless/b43/leds.c | 4 +-
drivers/net/wireless/b43/lo.c | 4 +-
drivers/net/wireless/b43/main.c | 194 ++++++-----
drivers/net/wireless/b43/phy_a.c | 16 +-
drivers/net/wireless/b43/phy_common.c | 8 +-
drivers/net/wireless/b43/phy_g.c | 48 ++--
drivers/net/wireless/b43/phy_lp.c | 22 +-
drivers/net/wireless/b43/phy_n.c | 24 +-
drivers/net/wireless/b43/pio.c | 30 +-
drivers/net/wireless/b43/rfkill.c | 6 +-
drivers/net/wireless/b43/sdio.c | 4 +-
drivers/net/wireless/b43/sysfs.c | 4 +-
drivers/net/wireless/b43/tables_lpphy.c | 4 +-
drivers/net/wireless/b43/wa.c | 4 +-
drivers/net/wireless/b43/xmit.c | 2 +-
drivers/net/wireless/iwlwifi/iwl-1000.c | 4 -
drivers/net/wireless/iwlwifi/iwl-2000.c | 8 +-
drivers/net/wireless/iwlwifi/iwl-5000.c | 12 +-
drivers/net/wireless/iwlwifi/iwl-6000.c | 12 +-
drivers/net/wireless/iwlwifi/iwl-agn-calib.c | 14 +-
drivers/net/wireless/iwlwifi/iwl-agn-lib.c | 14 +-
drivers/net/wireless/iwlwifi/iwl-agn-rs.c | 86 +++--
drivers/net/wireless/iwlwifi/iwl-agn-rxon.c | 9 +-
drivers/net/wireless/iwlwifi/iwl-agn-sta.c | 4 +-
drivers/net/wireless/iwlwifi/iwl-agn-tx.c | 16 +-
drivers/net/wireless/iwlwifi/iwl-agn-ucode.c | 6 +-
drivers/net/wireless/iwlwifi/iwl-agn.c | 250 +++-----------
drivers/net/wireless/iwlwifi/iwl-agn.h | 13 +-
drivers/net/wireless/iwlwifi/iwl-commands.h | 5 +-
drivers/net/wireless/iwlwifi/iwl-core.h | 10 -
drivers/net/wireless/iwlwifi/iwl-dev.h | 66 +++--
drivers/net/wireless/iwlwifi/iwl-devtrace.h | 58 +++-
drivers/net/wireless/iwlwifi/iwl-eeprom.c | 7 +-
drivers/net/wireless/iwlwifi/iwl-hcmd.c | 9 +-
drivers/net/wireless/iwlwifi/iwl-led.c | 4 +-
drivers/net/wireless/iwlwifi/iwl-sta.c | 12 +-
drivers/net/wireless/iwlwifi/iwl-sv-open.c | 177 ++++++++++-
drivers/net/wireless/iwlwifi/iwl-testmode.h | 34 ++
drivers/net/wireless/iwlwifi/iwl-tx.c | 364 ++++++++++++++------
drivers/net/wireless/iwmc3200wifi/rx.c | 4 +-
drivers/net/wireless/mwifiex/11n_aggr.c | 4 +
drivers/net/wireless/mwifiex/main.h | 9 +-
drivers/net/wireless/mwifiex/txrx.c | 4 +-
drivers/net/wireless/mwifiex/wmm.c | 59 +++-
drivers/net/wireless/p54/p54usb.c | 1 +
drivers/net/wireless/rndis_wlan.c | 3 +-
drivers/net/wireless/rtlwifi/ps.c | 2 +-
drivers/net/wireless/rtlwifi/rtl8192c/phy_common.c | 2 +-
drivers/net/wireless/rtlwifi/rtl8192ce/phy.c | 69 ++++
drivers/net/wireless/rtlwifi/rtl8192ce/phy.h | 1 +
drivers/net/wireless/rtlwifi/rtl8192ce/sw.c | 1 +
drivers/net/xen-netfront.c | 2 +
drivers/staging/ath6kl/os/linux/cfg80211.c | 2 +-
drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c | 4 +-
drivers/staging/wlan-ng/cfg80211.c | 2 +-
fs/proc/generic.c | 1 +
include/asm-generic/bug.h | 37 ++
include/linux/if_vlan.h | 5 +
include/linux/rfkill-gpio.h | 43 +++
include/net/cfg80211.h | 8 +-
include/net/dst.h | 2 +
net/802/psnap.c | 1 -
net/8021q/vlan.h | 5 -
net/atm/proc.c | 4 +-
net/bridge/br_netfilter.c | 6 +-
net/can/bcm.c | 6 +-
net/core/dev.c | 12 +-
net/core/dst.c | 2 +-
net/core/fib_rules.c | 1 +
net/core/filter.c | 4 +-
net/core/rtnetlink.c | 9 +-
net/ipv4/igmp.c | 10 +-
net/ipv4/ping.c | 3 -
net/ipv4/raw.c | 2 +-
net/ipv4/tcp_ipv4.c | 6 +-
net/ipv4/udp.c | 2 +-
net/ipv6/raw.c | 2 +-
net/ipv6/tcp_ipv6.c | 6 +-
net/ipv6/udp.c | 2 +-
net/ipv6/xfrm6_tunnel.c | 2 +-
net/key/af_key.c | 2 +-
net/mac80211/iface.c | 4 +-
net/mac80211/main.c | 22 +-
net/mac80211/mesh.h | 7 +-
net/mac80211/mesh_pathtbl.c | 204 +++++++----
net/mac80211/scan.c | 5 +
net/netlink/af_netlink.c | 2 +-
net/packet/af_packet.c | 2 +-
net/phonet/socket.c | 2 +-
net/rfkill/Kconfig | 9 +
net/rfkill/Makefile | 1 +
net/rfkill/rfkill-gpio.c | 227 ++++++++++++
net/sched/sch_sfq.c | 22 +-
net/sctp/associola.c | 16 +
net/sctp/proc.c | 4 +-
net/unix/af_unix.c | 2 +-
net/wireless/core.h | 5 +-
net/wireless/nl80211.c | 12 +-
net/wireless/sme.c | 19 +-
net/wireless/util.c | 2 +-
187 files changed, 2050 insertions(+), 1204 deletions(-)
create mode 100644 include/linux/rfkill-gpio.h
create mode 100644 net/rfkill/rfkill-gpio.c
^ permalink raw reply
* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-25 22:49 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Michał Mirosław, Ben Hutchings, David Miller,
Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
linux-kernel
In-Reply-To: <20110519234154.GA13784@redhat.com>
On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote:
> So the requirements are
> - data must be released in a timely fashion (e.g. unlike virtio-net
> tun or bridge)
The current patch doesn't enable tun zero-copy. tun will copy data It's
not an issue now. We can disallow macvtap attach to bridge when
zero-copy is enabled.
> - SG support
> - HIGHDMA support (on arches where this makes sense)
This can be checked by device flags.
> - no filtering based on data (data is mapped in guest)
> - on fast path no calls to skb_copy, skb_clone, pskb_copy,
> pskb_expand_head as these are slow
Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will do a
copy. The performance should be the same as none zero-copy case before.
I have done/tested the patch V6, will send it out for review tomorrow.
I am looking at where there are some cases, skb remains the same for
filtering.
> First 2 requirements are a must, all other requirements
> are just dependencies to make sure zero copy will be faster
> than non zero copy.
> Using a new feature bit is probably the simplest approach to
> this. macvtap on top of most physical NICs most likely works
> correctly so it seems a bit more work than it needs to be,
> but it's also the safest one I think ...
For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, it
looks safe to me once patching skb_copy, skb_clone, pskb_copy,
pskb_expand_head.
To extend zero-copy in other usages, we can have a new feature bit
later.
Is that reasonable?
Thanks
Shirley
^ permalink raw reply
* [PATCHv3] net: Abstract features usage.
From: Mahesh Bandewar @ 2011-05-25 22:43 UTC (permalink / raw)
To: David Miller
Cc: netdev, Mahesh Bandewar, Tom Herbert, Michał Mirosław,
Stephen Hemminger
In-Reply-To: <1306288544-1700-1-git-send-email-maheshb@google.com>
Define macros to set/clear/test bits for feature set usage. This will eliminate
the direct use of these fields and enable future ease in managing these fields.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
Changes since v2:
Added the include which accidently went into the other patch.
Changes since v1:
Split the patch into two pieces.
include/linux/netdev_features.h | 64 +++++++++++++++++++++++++++++++++++++++
include/linux/netdevice.h | 9 +++++
2 files changed, 73 insertions(+), 0 deletions(-)
create mode 100644 include/linux/netdev_features.h
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
new file mode 100644
index 0000000..3043c4d
--- /dev/null
+++ b/include/linux/netdev_features.h
@@ -0,0 +1,64 @@
+#ifndef _NETDEV_FEATURES_H
+#define _NETDEV_FEATURES_H
+
+/* Forward declarations */
+struct net_device;
+
+typedef unsigned long *nd_feature_t;
+
+static inline void _nd_set_feature(u32 *old_field,
+ unsigned long *new_field, int bit)
+{
+ if (bit < 32)
+ *old_field |= (1 << bit);
+ set_bit(bit, new_field);
+}
+
+static inline void _nd_clear_feature(u32 *old_field,
+ unsigned long *new_field, int bit)
+{
+ if (bit < 32)
+ *old_field &= ~(1 << bit);
+
+ clear_bit(bit, new_field);
+}
+
+static inline bool _nd_test_feature(u32 old_field,
+ unsigned long *new_field, int bit)
+{
+ if (bit < 32)
+ return (old_field & (1 << bit)) == 1;
+
+ return test_bit(bit, new_field) == 1;
+}
+
+#define netdev_set_active_feature(dev, bit) \
+ _nd_set_feature(&(dev)->features, (dev)->active_feature, (bit))
+#define netdev_clear_active_feature(dev, bit) \
+ _nd_clear_feature(&(dev)->features, (dev)->active_feature, (bit))
+#define netdev_test_active_feature(dev, bit) \
+ _nd_test_feature((dev)->features, (dev)->active_feature, (bit))
+
+#define netdev_set_offered_feature(dev, bit) \
+ _nd_set_feature(&(dev)->hw_features, (dev)->offered_feature, (bit))
+#define netdev_clear_offered_feature(dev, bit) \
+ _nd_clear_feature(&(dev)->hw_features, (dev)->offered_feature, (bit))
+#define netdev_test_offered_feature(dev, bit) \
+ _nd_test_feature((dev)->hw_features, (dev)->offered_feature, (bit))
+
+#define netdev_set_vlan_feature(dev, bit) \
+ _nd_set_feature(&(dev)->vlan_features, (dev)->vlan_feature, (bit))
+#define netdev_clear_vlan_feature(dev, bit) \
+ _nd_clear_feature(&(dev)->vlan_features, (dev)->vlan_feature, (bit))
+#define netdev_test_vlan_feature(dev, bit) \
+ _nd_test_feature((dev)->vlan_features, (dev)->vlan_feature, (bit))
+
+#define netdev_set_wanted_feature(dev, bit) \
+ _nd_set_feature(&(dev)->wanted_features, (dev)->wanted_feature, (bit))
+#define netdev_clear_wanted_feature(dev, bit) \
+ _nd_clear_feature(&(dev)->wanted_features, (dev)->wanted_feature, (bit))
+#define netdev_test_wanted_feature(dev, bit) \
+ _nd_test_feature((dev)->wanted_features, (dev)->wanted_feature, (bit))
+
+
+#endif /* __NETDEV_FEATURES_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9bb5872..ca31706 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,7 @@
#ifdef CONFIG_DCB
#include <net/dcbnl.h>
#endif
+#include <linux/netdev_features.h>
struct vlan_group;
struct netpoll_info;
@@ -1078,6 +1079,14 @@ struct net_device {
/* mask of features inheritable by VLAN devices */
u32 vlan_features;
+#define DEV_FEATURE_WORDS BITS_TO_LONGS(ND_FEATURE_NUM_BITS)
+#define DEV_FEATURE_BITS (DEV_FEATURE_WORDS * BITS_PER_LONG)
+
+ DECLARE_BITMAP(active_feature, DEV_FEATURE_BITS);
+ DECLARE_BITMAP(offered_feature, DEV_FEATURE_BITS);
+ DECLARE_BITMAP(wanted_feature, DEV_FEATURE_BITS);
+ DECLARE_BITMAP(vlan_feature, DEV_FEATURE_BITS);
+
#define BIT2FLAG(bit) (1 << (bit))
#define NETIF_F_SG BIT2FLAG(NETIF_F_SG_BIT)
--
1.7.3.1
^ permalink raw reply related
* Re: [RFC] af-packet: Save reference to bound network device.
From: David Miller @ 2011-05-25 22:42 UTC (permalink / raw)
To: greearb; +Cc: netdev
In-Reply-To: <4DDD8487.6070000@candelatech.com>
From: Ben Greear <greearb@candelatech.com>
Date: Wed, 25 May 2011 15:36:55 -0700
> I can't see where the code holds any reference to prot_hook.dev.
> (It just assigns the pointer and then does a dev_put()).
>
> Maybe it gets away with it because a NETDEV_UNREGISTER event
> is always sent?
I think that is precisely the property it is depending upon.
It may seem sketchy, but as far as I can tell it's completely
legal.
^ permalink raw reply
* [PATCHv3] net: Define enum for the bits used in features.
From: Mahesh Bandewar @ 2011-05-25 22:42 UTC (permalink / raw)
To: David Miller
Cc: netdev, Mahesh Bandewar, Tom Herbert, Michał Mirosław,
Stephen Hemminger
In-Reply-To: <1306288567-1773-1-git-send-email-maheshb@google.com>
Little bit cleanup by defining enum for all bits used. Also use those enum
values to redefine flags.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
Changes since v2:
(1) Removed the include which was part of the other patch (split mishap).
(2) Changed the enums to add NETIF_F_ prefix.
Changes since v1:
Split the patch into two pieces.
include/linux/netdevice.h | 99 +++++++++++++++++++++++++++++++--------------
1 files changed, 69 insertions(+), 30 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca333e7..9bb5872 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -981,6 +981,49 @@ struct net_device_ops {
};
/*
+ * Net device feature bits; if you change something,
+ * also update netdev_features_strings[] in ethtool.c
+ */
+enum netdev_features {
+ NETIF_F_SG_BIT, /* Scatter/gather IO. */
+ NETIF_F_IP_CSUM_BIT, /* Can checksum TCP/UDP over IPv4. */
+ NETIF_F_NO_CSUM_BIT, /* Does not require checksum. F.e. loopack. */
+ NETIF_F_HW_CSUM_BIT, /* Can checksum all the packets. */
+ NETIF_F_IPV6_CSUM_BIT, /* Can checksum TCP/UDP over IPV6 */
+ NETIF_F_HIGHDMA_BIT, /* Can DMA to high memory. */
+ NETIF_F_FRAGLIST_BIT, /* Scatter/gather IO. */
+ NETIF_F_HW_VLAN_TX_BIT, /* Transmit VLAN hw acceleration */
+ NETIF_F_HW_VLAN_RX_BIT, /* Receive VLAN hw acceleration */
+ NETIF_F_HW_VLAN_FILTER_BIT, /* Receive filtering on VLAN */
+ NETIF_F_VLAN_CHALLENGED_BIT, /* Device cannot handle VLAN packets */
+ NETIF_F_GSO_BIT, /* Enable software GSO. */
+ NETIF_F_LLTX_BIT, /* LockLess TX - deprecated. Please */
+ /* do not use LLTX in new drivers */
+ NETIF_F_NETNS_LOCAL_BIT, /* Does not change network namespaces */
+ NETIF_F_GRO_BIT, /* Generic receive offload */
+ NETIF_F_LRO_BIT, /* large receive offload */
+ RESERVED16_BIT, /* the GSO_MASK reserved bit 16 */
+ RESERVED17_BIT, /* the GSO_MASK reserved bit 17 */
+ RESERVED18_BIT, /* the GSO_MASK reserved bit 18 */
+ RESERVED19_BIT, /* the GSO_MASK reserved bit 19 */
+ RESERVED20_BIT, /* the GSO_MASK reserved bit 20 */
+ RESERVED21_BIT, /* the GSO_MASK reserved bit 21 */
+ RESERVED22_BIT, /* the GSO_MASK reserved bit 22 */
+ RESERVED23_BIT, /* the GSO_MASK reserved bit 23 */
+ NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
+ NETIF_F_SCTP_CSUM_BIT, /* SCTP checksum offload */
+ NETIF_F_FCOE_MTU_BIT, /* Supports max FCoE MTU, 2158 bytes*/
+ NETIF_F_NTUPLE_BIT, /* N-tuple filters supported */
+ NETIF_F_RXHASH_BIT, /* Receive hashing offload */
+ NETIF_F_RXCSUM_BIT, /* Receive checksumming offload */
+ NETIF_F_NOCACHE_COPY_BIT, /* Use no-cache copyfromuser */
+ NETIF_F_LOOPBACK_BIT, /* Enable loopback */
+
+ /* Add you bit above this */
+ ND_FEATURE_NUM_BITS /* (LAST VALUE) Total bits in use */
+};
+
+/*
* The DEVICE structure.
* Actually, this whole structure is a big mistake. It mixes I/O
* data with strictly "high-level" data, and it has to know about
@@ -1035,36 +1078,32 @@ struct net_device {
/* mask of features inheritable by VLAN devices */
u32 vlan_features;
- /* Net device feature bits; if you change something,
- * also update netdev_features_strings[] in ethtool.c */
-
-#define NETIF_F_SG 1 /* Scatter/gather IO. */
-#define NETIF_F_IP_CSUM 2 /* Can checksum TCP/UDP over IPv4. */
-#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
-#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
-#define NETIF_F_IPV6_CSUM 16 /* Can checksum TCP/UDP over IPV6 */
-#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
-#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
-#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
-#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
-#define NETIF_F_GSO 2048 /* Enable software GSO. */
-#define NETIF_F_LLTX 4096 /* LockLess TX - deprecated. Please */
- /* do not use LLTX in new drivers */
-#define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */
-#define NETIF_F_GRO 16384 /* Generic receive offload */
-#define NETIF_F_LRO 32768 /* large receive offload */
-
-/* the GSO_MASK reserves bits 16 through 23 */
-#define NETIF_F_FCOE_CRC (1 << 24) /* FCoE CRC32 */
-#define NETIF_F_SCTP_CSUM (1 << 25) /* SCTP checksum offload */
-#define NETIF_F_FCOE_MTU (1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
-#define NETIF_F_NTUPLE (1 << 27) /* N-tuple filters supported */
-#define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
-#define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */
-#define NETIF_F_NOCACHE_COPY (1 << 30) /* Use no-cache copyfromuser */
-#define NETIF_F_LOOPBACK (1 << 31) /* Enable loopback */
+#define BIT2FLAG(bit) (1 << (bit))
+
+#define NETIF_F_SG BIT2FLAG(NETIF_F_SG_BIT)
+#define NETIF_F_IP_CSUM BIT2FLAG(NETIF_F_IP_CSUM_BIT)
+#define NETIF_F_NO_CSUM BIT2FLAG(NETIF_F_NO_CSUM_BIT)
+#define NETIF_F_HW_CSUM BIT2FLAG(NETIF_F_HW_CSUM_BIT)
+#define NETIF_F_IPV6_CSUM BIT2FLAG(NETIF_F_IPV6_CSUM_BIT)
+#define NETIF_F_HIGHDMA BIT2FLAG(NETIF_F_HIGHDMA_BIT)
+#define NETIF_F_FRAGLIST BIT2FLAG(NETIF_F_FRAGLIST_BIT)
+#define NETIF_F_HW_VLAN_TX BIT2FLAG(NETIF_F_HW_VLAN_TX_BIT)
+#define NETIF_F_HW_VLAN_RX BIT2FLAG(NETIF_F_HW_VLAN_RX_BIT)
+#define NETIF_F_HW_VLAN_FILTER BIT2FLAG(NETIF_F_HW_VLAN_FILTER_BIT)
+#define NETIF_F_VLAN_CHALLENGED BIT2FLAG(NETIF_F_VLAN_CHALLENGED_BIT)
+#define NETIF_F_GSO BIT2FLAG(NETIF_F_GSO_BIT)
+#define NETIF_F_LLTX BIT2FLAG(NETIF_F_LLTX_BIT)
+#define NETIF_F_NETNS_LOCAL BIT2FLAG(NETIF_F_NETNS_LOCAL_BIT)
+#define NETIF_F_GRO BIT2FLAG(NETIF_F_GRO_BIT)
+#define NETIF_F_LRO BIT2FLAG(NETIF_F_LRO_BIT)
+#define NETIF_F_FCOE_CRC BIT2FLAG(NETIF_F_FCOE_CRC_BIT)
+#define NETIF_F_SCTP_CSUM BIT2FLAG(NETIF_F_SCTP_CSUM_BIT)
+#define NETIF_F_FCOE_MTU BIT2FLAG(NETIF_F_FCOE_MTU_BIT)
+#define NETIF_F_NTUPLE BIT2FLAG(NETIF_F_NTUPLE_BIT)
+#define NETIF_F_RXHASH BIT2FLAG(NETIF_F_RXHASH_BIT)
+#define NETIF_F_RXCSUM BIT2FLAG(NETIF_F_RXCSUM_BIT)
+#define NETIF_F_NOCACHE_COPY BIT2FLAG(NETIF_F_NOCACHE_COPY_BIT)
+#define NETIF_F_LOOPBACK BIT2FLAG(NETIF_F_LOOPBACK_BIT)
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
--
1.7.3.1
^ permalink raw reply related
* Re: [RFC] af-packet: Save reference to bound network device.
From: Ben Greear @ 2011-05-25 22:36 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110525.181418.1100603684033986711.davem@davemloft.net>
On 05/25/2011 03:14 PM, David Miller wrote:
> From: Ben Greear<greearb@candelatech.com>
> Date: Wed, 25 May 2011 15:05:10 -0700
>
>> Doesn't this piece of code take care of that?
>> I tested with rmmod..but of course I could have missed something.
>>
>> @@ -2266,6 +2284,10 @@ static int packet_notifier(struct
>> notifier_block *this, unsigned long msg, void
>> }
>> if (msg == NETDEV_UNREGISTER) {
>> po->ifindex = -1;
>> + if (po->bound_dev) {
>> + dev_put(po->bound_dev);
>> + po->bound_dev = NULL;
>> + }
>> po->prot_hook.dev = NULL;
>> }
>> spin_unlock(&po->bind_lock);
>>
>
> Indeed, it should, thanks for pointing that out.
>
> Wait a second, why do you need to store the device a second
> time, can't you get at po->prot_hook.dev in all the necessary
> spots?
I can't see where the code holds any reference to prot_hook.dev.
(It just assigns the pointer and then does a dev_put()).
Maybe it gets away with it because a NETDEV_UNREGISTER event
is always sent?
Or, maybe we should hold a ref to it?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* (unknown),
From: Western Union Money Transfer. @ 2011-05-25 22:44 UTC (permalink / raw)
Good day,
My working partner has helped me to send your
first payment of US$7,500 to you as
instructed by Mr. David Cameron and will
keep sending you US$7,500 twice a week until
the payment of (US$360,000) is completed
within six months and here is the information
below:
MONEY TRANSFER CONTROL NUMBER (MTCN):
522-905-9427
SENDER'S NAME: Mr. Mark Daniel
AMOUNT: US$7,500
To track your funds forward Western Union
Money Transfer agent your Full Names and
Mobile Number via Email to: sirteddy_westernumtrs@hotmail.com
Mr.Teddy brown
E-mail: sirteddy_westernumtrs@hotmail.com
D/L :+44 7045714366
Please direct all enquiring to:
sirteddy_westernumtrs@hotmail.com
Best Regards,
Mrs. Larisa Alexander.
^ permalink raw reply
* Re: [RFC] af-packet: Save reference to bound network device.
From: Ben Greear @ 2011-05-25 22:22 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110525.181418.1100603684033986711.davem@davemloft.net>
On 05/25/2011 03:14 PM, David Miller wrote:
> From: Ben Greear<greearb@candelatech.com>
> Date: Wed, 25 May 2011 15:05:10 -0700
>
>> Doesn't this piece of code take care of that?
>> I tested with rmmod..but of course I could have missed something.
>>
>> @@ -2266,6 +2284,10 @@ static int packet_notifier(struct
>> notifier_block *this, unsigned long msg, void
>> }
>> if (msg == NETDEV_UNREGISTER) {
>> po->ifindex = -1;
>> + if (po->bound_dev) {
>> + dev_put(po->bound_dev);
>> + po->bound_dev = NULL;
>> + }
>> po->prot_hook.dev = NULL;
>> }
>> spin_unlock(&po->bind_lock);
>>
>
> Indeed, it should, thanks for pointing that out.
>
> Wait a second, why do you need to store the device a second
> time, can't you get at po->prot_hook.dev in all the necessary
> spots?
I think so...I'll poke at the code a bit and run some more
tests using that instead...
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* IFB and iptables
From: Jérôme Poulin @ 2011-05-25 22:21 UTC (permalink / raw)
To: netdev
Hi,
I'm trying to convert my IMQ based script to use the IFB device instead.
Things appear to work quite right however the u32 classifier isn't
aware of any connection tracking and I was wondering if it is at all
possible to use match from iptables like layer7 when you use the IFB
device?
And my need for the IFB device / IMQ is because I want to classify my
IPv6 traffic which is in an IPv4 SIT tunnel and mix the content of the
SIT tunnel to eth0 minus protocol 41.
Thanks.
^ permalink raw reply
* Re: [RFC] af-packet: Save reference to bound network device.
From: David Miller @ 2011-05-25 22:14 UTC (permalink / raw)
To: greearb; +Cc: netdev
In-Reply-To: <4DDD7D16.6030907@candelatech.com>
From: Ben Greear <greearb@candelatech.com>
Date: Wed, 25 May 2011 15:05:10 -0700
> Doesn't this piece of code take care of that?
> I tested with rmmod..but of course I could have missed something.
>
> @@ -2266,6 +2284,10 @@ static int packet_notifier(struct
> notifier_block *this, unsigned long msg, void
> }
> if (msg == NETDEV_UNREGISTER) {
> po->ifindex = -1;
> + if (po->bound_dev) {
> + dev_put(po->bound_dev);
> + po->bound_dev = NULL;
> + }
> po->prot_hook.dev = NULL;
> }
> spin_unlock(&po->bind_lock);
>
Indeed, it should, thanks for pointing that out.
Wait a second, why do you need to store the device a second
time, can't you get at po->prot_hook.dev in all the necessary
spots?
^ permalink raw reply
* Re: [GIT PULL] Namespace file descriptors for 2.6.40
From: Michał Mirosław @ 2011-05-25 22:11 UTC (permalink / raw)
To: C Anthony Risinger
Cc: Serge E. Hallyn, Eric W. Biederman, Linux Containers, netdev,
linux-kernel
In-Reply-To: <BANLkTinbw6pZjhMscfXFMArd=XU=VC=+eQ@mail.gmail.com>
2011/5/25 C Anthony Risinger <anthony@xtfx.me>:
> On Wed, May 25, 2011 at 4:38 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Quoting C Anthony Risinger (anthony@xtfx.me):
[...]
>>> if i understand correctly, mount namespaces (for example), allow one
>>> to build such constructs as "private /tmp" and similar that even
>>> `root` cannot access ... and there are many reasons `root` does not
>>> deserve to completely know/interact with user processes (FUSE makes a
>>> good example ... just because i [user] have SSH access to a machine,
>>> why should `root`?)
>> If for instance you have a file open in your private /tmp, then root
>> in another mounts ns can open the file through /proc/$$/fd/N anyway.
>> If it's a directory, he can now traverse the whole fs.
> aaah right :-( ... there's always another way isn't there ... curse
> you Linux for being so flexible! (just kidding baby i love you)
>
> this seems like a more fundamental issue then? or should i not expect
> to be able to achieve separation like this? i ask in the context of
> OS virt via cgroups + namespaces, eg. LXC et al, because i'm about to
> perform a massive overhaul to our crusty sub-2.6.18 infrastructure and
> i've used/followed these technologies for couple years now ... and
> it's starting to feel like "the right time".
You either trust the admin or don't use the machine. There is no third way.
Best Regards,
Michał Mirosław
^ permalink raw reply
* Re: [RFC] af-packet: Save reference to bound network device.
From: Ben Greear @ 2011-05-25 22:05 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110525.180113.1194226831134092545.davem@davemloft.net>
On 05/25/2011 03:01 PM, David Miller wrote:
> From: greearb@candelatech.com
> Date: Wed, 25 May 2011 14:56:42 -0700
>
>> From: Ben Greear<greearb@candelatech.com>
>>
>> This saves a network device lookup on each packet transmitted,
>> for sockets that are bound to a network device.
>>
>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>
> You can't hold onto devices like this unless you also add a netdev
> event notifier that will release it. Otherwise we'll hang on net
> driver module unload until the packet socket is closed.
>
> I don't think you really want to walk all pf-packet sockets on netdev
> events just to do this.
Doesn't this piece of code take care of that?
I tested with rmmod..but of course I could have missed something.
@@ -2266,6 +2284,10 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
}
if (msg == NETDEV_UNREGISTER) {
po->ifindex = -1;
+ if (po->bound_dev) {
+ dev_put(po->bound_dev);
+ po->bound_dev = NULL;
+ }
po->prot_hook.dev = NULL;
}
spin_unlock(&po->bind_lock);
>
> dev_get_by_index(,_rcu}() is insanely cheap, I doubt it's showing up
> on your profiles at all.
I admit it was a small change...maybe 5Mbps (from 165 to 170Mbps in
this particular test), but it did seem to improve things a bit.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox