* Re: Adding Support for SG,GSO,GRO
From: Ben Hutchings @ 2010-12-10 16:20 UTC (permalink / raw)
To: Michał Mirosław
Cc: David Lamparter, David Miller, srk, netdev, Jens Axboe
In-Reply-To: <AANLkTi=cVufUAHn6LRs_vG-cW2cY87SdP9MOE7i5Ru09@mail.gmail.com>
On Fri, 2010-12-10 at 17:01 +0100, Michał Mirosław wrote:
[...]
> The question is do we really want good checksum for bogus data?
[...]
It's not bogus data. It's a snapshot of the file contents at some
arbitrary point in time.
Now please stop wasting your own time and that of the networking
maintainers, and remember to check for checksum offload next time you
need to select a network controller.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] iproute2: ip: add wilcard support for device matching
From: Stephen Hemminger @ 2010-12-10 16:18 UTC (permalink / raw)
To: Octavian Purdila; +Cc: netdev, Lucian Adrian Grijincu, Vlad Dogaru
In-Reply-To: <1291993092-8675-1-git-send-email-opurdila@ixiacom.com>
On Fri, 10 Dec 2010 16:58:12 +0200
Octavian Purdila <opurdila@ixiacom.com> wrote:
> Allow the users to specify a wildcard when selecting a device:
>
> $ ip set link dev dummy* up
>
> We do this by expanding the original command line in multiple lines
> which we then feed via a pipe to a forked ip processed run in batch
> mode.
>
Seems like feature creep. Can't you do this with bash completion
script instead.
^ permalink raw reply
* Re: [PATCH] Fix build system and configure script to use for cross build. Optional IPv6.
From: Stephen Hemminger @ 2010-12-10 16:14 UTC (permalink / raw)
To: Serj Kalichev; +Cc: netdev
In-Reply-To: <1291995271-9912-1-git-send-email-serj.kalichev@gmail.com>
On Fri, 10 Dec 2010 18:34:31 +0300
Serj Kalichev <serj.kalichev@gmail.com> wrote:
> The Makefiles and configure script understand the external variables like
> CC, CFLAGS, LDFLAGS etc. So it can be used for cross build easily. The
> configure script use CC instead hardcoded gcc and search for the xtables
> within specified SYSROOT but not on the host. Two checks were added. The
> check for the IPv6 support and check for the Berkeley DB availability.
> The iproute2 can be build without IPv6 now.
>
> Signed-off-by: Serj Kalichev <serj.kalichev@gmail.com>
IPv6 support should not be optional.
--
^ permalink raw reply
* Re: [PATCH] iproute2: add dynamic index and name hashes
From: Daniel Baluta @ 2010-12-10 16:09 UTC (permalink / raw)
To: Octavian Purdila; +Cc: netdev, Lucian Adrian Grijincu, Vlad Dogaru
In-Reply-To: <1291993164-8793-1-git-send-email-opurdila@ixiacom.com>
> + if (!idx_hash) {
> + idx_hash = malloc((1<<hbits) * sizeof(struct idxmap_head));
> + if (!idx_hash)
> + return -1;
> + }
> +
> + if (!name_hash) {
> + name_hash = malloc((1<<hbits) * sizeof(struct idxmap_head));
> + if (!name_hash)
Can you reach this point with idx_hash non-null? Well, then avoid
memory leaks by freeing it.
> + return -1;
> + }
thanks,
Daniel.
^ permalink raw reply
* Re: [PATCH] kptr_restrict for hiding kernel pointers from unprivileged users
From: Peter Zijlstra @ 2010-12-10 16:05 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Dan Rosenberg, linux-kernel, linux-security-module, netdev
In-Reply-To: <1291865039.2795.46.camel@edumazet-laptop>
On Thu, 2010-12-09 at 04:23 +0100, Eric Dumazet wrote:
> > + if (kptr_restrict) {
> > + if (in_interrupt())
> > + WARN(1, "%%pK used in interrupt context.\n");
>
> So caller can not block BH ?
>
> This seems wrong to me, please consider :
>
> normal process context :
>
> spin_lock_bh() ...
>
> for (...)
> {xxx}printf( ... "%pK" ...)
>
> spin_unlock_bh();
That's a bug in in_interrupt(), one I've been pointing out for a long
while. Luckily we recently grew the infrastructure to deal with it.
If you write it as: if (in_irq() || in_serving_softirq() || in_nmi())
you'll not trigger for the above example.
Ideally in_serving_softirq() wouldn't exist and in_softirq() would do
what in_server_softirq() does -- which would make it symmetric with the
hardirq functions -- but nobody has found time to audit all in_softirq()
users.
^ permalink raw reply
* Re: Adding Support for SG,GSO,GRO
From: Michał Mirosław @ 2010-12-10 16:01 UTC (permalink / raw)
To: David Lamparter; +Cc: David Miller, bhutchings, srk, netdev, Jens Axboe
In-Reply-To: <20101210143140.GD3536057@jupiter.n2.diac24.net>
W dniu 10 grudnia 2010 15:31 użytkownik David Lamparter
<equinox@diac24.net> napisał:
> On Fri, Dec 10, 2010 at 03:18:11PM +0100, Michał Mirosław wrote:
>> I'm trying to understand the dependency because it looks artificial for me.
>
> You have the data you want to send in the RAM, somewhere, possibly
> scattered. The application calls sendfile(). The kernel puts the
> transmission in the network card's queue, which might already have lots
> of entries.
>
> A millisecond later - an eternity for the CPU - the card decides to do
> the transmission.
>
> However, the data might have changed in the meantime.
>
> sendfile() is defined so that it works asynchronously, that means if you
> change the data while it is in the queue, you get unpredictable results.
>
> But, what you should NOT get is packets with an invalid checksum.
> Whatever data you are sending, it needs to have a correct checksum.
>
> Now, if the card does the checksum itself, everything is fine. But what
> are you supposed to do if the card can't checksum? Call back the kernel
> at the point where the card does the TX? That's pointless (and racy).
> Pre-calculate the Checksum at submission time? Doesn't work, you would
> have to make a copy of the data, so it doesn't change anymore, so the
> checksum stays correct. But not copying the data is the whole point of
> sendfile().
The question is do we really want good checksum for bogus data? I
think that what matters is that good data (it had not changed between
queuing and sending, and so the checksum does not depend on whether
hardware or software calculated it) need to be accompanied by good
checksum. For broken data it would be even better to not send
anything, but that's not always possible. Bad checksum in this case is
actually a good thing as it clearly shows that something is broken in
the sender and avoids accepting the data as valid at the receiving
end.
sendfile() is supposed to replace read()/write() loops to avoid
copying data buffers. Whatever the optimizations, it has no additional
cavat that it might corrupt the transfer. Fixing the checksum in case
data gets corrupted is not the right thing, I think.
The change of file data might happen if sendfile() submits page from
pagecache when something writes to the file, or when an application
modifies vmsplice()-submitted memory. In current kernel sendfile() is
a wrapper around splice(), and so has a kernel pipe buffer between
file and the socket. Can this hide the possible data changes?
Best Regards,
Michał Mirosław
^ permalink raw reply
* Re: [PATCH] [Bug 24472] Kernel panic - not syncing: Fatal Exception
From: Jarek Poplawski @ 2010-12-10 15:55 UTC (permalink / raw)
To: Andrej Ota
Cc: Paweł Staszewski, Andrew Morton, netdev, Paul Mackerras,
bugzilla-daemon, bugme-daemon, pstaszewski, Eric Dumazet,
David Miller
In-Reply-To: <4D023DE4.8000400@ota.si>
On Fri, Dec 10, 2010 at 03:49:08PM +0100, Andrej Ota wrote:
> Move kfree_skb which was causing memory corruption to new location, while still keeping appropriate return value for function __pppoe_xmit. Prevents memory corruption and consequent kernel panic when PPPoE peer terminates the link.
Andrej, a slight misunderstanding - probably I should be more explicit.
I sent this link, which explains why return shouldn't be zero:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=db7bf6d97c6956b7eb0f22131cb5c37bd41f33c0
So the simplest fix is to revert this one change only.
If you disagree with this let me know.
You should also fix the subject to something more meaningful, e.g.:
[PATCH] pppoe: Fix kernel panic caused by __pppoe_xmit
Please, break lines in the changelog around 70 lines and add it
fixes commit 55c95e738da85373965cb03b4f975d0fd559865b.
Thanks,
Jarek P.
>
> Signed-off-by: Andrej Ota [andrej@ota.si]
> Reported-by: Pawel Staszewski [pstaszewski@artcom.pl]
> ---
> drivers/net/pppoe.c | 5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
> index d72fb05..1a21dce 100644
> --- a/drivers/net/pppoe.c
> +++ b/drivers/net/pppoe.c
> @@ -924,8 +924,10 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
> /* Copy the data if there is no space for the header or if it's
> * read-only.
> */
> - if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
> + if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len)) {
> + kfree_skb(skb);
> goto abort;
> + }
>
> __skb_push(skb, sizeof(*ph));
> skb_reset_network_header(skb);
> @@ -947,7 +949,6 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
> return 1;
>
> abort:
> - kfree_skb(skb);
> return 0;
> }
>
> ---
>
> Andrej Ota.
^ permalink raw reply
* Re: [net-next-2.6 03/27] Documentation/networking/igb.txt: update documentation
From: Ben Hutchings @ 2010-12-10 15:50 UTC (permalink / raw)
To: Jeff Kirsher; +Cc: davem, davem, netdev, gospo, bphilips
In-Reply-To: <1291974667-30254-4-git-send-email-jeffrey.t.kirsher@intel.com>
On Fri, 2010-12-10 at 01:50 -0800, Jeff Kirsher wrote:
> Update Intel Wired LAN igb documentation.
>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> Documentation/networking/igb.txt | 22 +++++++++++++++++++---
> 1 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt
> index ab2d718..393bdb7 100644
> --- a/Documentation/networking/igb.txt
> +++ b/Documentation/networking/igb.txt
> @@ -36,6 +36,7 @@ Default Value: 0
> This parameter adds support for SR-IOV. It causes the driver to spawn up to
> max_vfs worth of virtual function.
>
> +
> Additional Configurations
> =========================
>
> @@ -60,7 +61,8 @@ Additional Configurations
> Ethtool
> -------
> The driver utilizes the ethtool interface for driver configuration and
> - diagnostics, as well as displaying statistical information.
> + diagnostics, as well as displaying statistical information. The latest
> + version of Ethtool can be found at:
>
> http://sourceforge.net/projects/gkernel.
Please update this to:
http://ftp.kernel.org/pub/software/network/ethtool/
> @@ -103,8 +105,8 @@ Additional Configurations
>
> NOTE: You need to have inet_lro enabled via either the CONFIG_INET_LRO or
> CONFIG_INET_LRO_MODULE kernel config option. Additionally, if
> - CONFIG_INET_LRO_MODULE is used, the inet_lro module needs to be loaded
> - before the igb driver.
> + CONFIG_INET_LRO_MODULE is used, the inet_lro module needs to be loaded before
> + the igb driver.
This should be removed as you don't use inet_lro any more.
> You can verify that the driver is using LRO by looking at these counters in
> Ethtool:
> @@ -116,6 +118,20 @@ Additional Configurations
>
> NOTE: IPv6 and UDP are not supported by LRO.
>
> + MAC and VLAN anti-spoofing feature
> + ----------------------------------
> + When a malicious driver attempts to send a spoofed packet, it is dropped by
> + the hardware and not transmitted. An interrupt is sent to the PF driver
> + notifying it of the spoof attempt.
> +
> + When a spoofed packet is detected the PF driver will send the following
> + message to the system log (displayed by the "dmesg" command):
> +
> + Spoof event(s) detected on VF(n)
> +
> + Where n=the VF that attempted to do the spoofing.
I can't see that message in the PF driver code; does this actually apply
to the in-tree driver? Also I hope this is rate-limited.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Workqueues vs. kernel threads for processing asynchronous socket events
From: Martin Lucina @ 2010-12-10 15:27 UTC (permalink / raw)
To: netdev; +Cc: Martin Sustrik
Hi,
I'm trying to find the best mechanism to process events from kernel space
sockets in an asynchronous manner. The work in progress code I have at the
moment tries to at least call kernel_accept() on a bound TCP socket when it
gets called by the underlying sk->sk_data_ready callback.
The current approach I have is to use a workqueue and try to schedule work
inside the callback, but this has the kernel complaining about "scheduling
while atomic", so it doesn't look like it's the right approach? Am I
allowed to call schedule_work() from the context of a sk->sk_data_ready
callback or not?
Sunrpc/knfsd appears to use a different approach where
svc_tcp_listen_data_ready() sets the appropriate state and then calls
wake_up_interruptible_all(sk_sleep(sk)) -- it's not clear who this is
waking up, the nfsd kernel thread or someone else?
Any advice on what is the best future-proof approach to use for this kind
of thing in a new project?
-mato
^ permalink raw reply
* [PATCH] iproute2: initialize the ll_map only once
From: Octavian Purdila @ 2010-12-10 14:59 UTC (permalink / raw)
To: netdev; +Cc: Lucian Adrian Grijincu, Vlad Dogaru, Octavian Purdila
Avoid initializing the LL map (which involves a costly RTNL dump)
multiple times. This can happen when running in batch mode.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
lib/ll_map.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/lib/ll_map.c b/lib/ll_map.c
index 9831322..9c6144a 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -266,6 +266,11 @@ unsigned ll_name_to_index(const char *name)
int ll_init_map(struct rtnl_handle *rth)
{
+ static int initialized;
+
+ if (initialized)
+ return 0;
+
if (rtnl_wilddump_request(rth, AF_UNSPEC, RTM_GETLINK) < 0) {
perror("Cannot send dump request");
exit(1);
@@ -275,5 +280,8 @@ int ll_init_map(struct rtnl_handle *rth)
fprintf(stderr, "Dump terminated\n");
exit(1);
}
+
+ initialized = 1;
+
return 0;
}
--
1.7.1
^ permalink raw reply related
* [PATCH] iproute2: add dynamic index and name hashes
From: Octavian Purdila @ 2010-12-10 14:59 UTC (permalink / raw)
To: netdev; +Cc: Lucian Adrian Grijincu, Vlad Dogaru, Octavian Purdila
The hashes sizes start with 16 entries and grow up to 2^20
entries. The hashes double when the entries in the LL map is greater
then the hash size.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
lib/ll_map.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 103 insertions(+), 22 deletions(-)
diff --git a/lib/ll_map.c b/lib/ll_map.c
index b8b49aa..9831322 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -26,7 +26,8 @@ extern unsigned int if_nametoindex (const char *);
struct idxmap
{
- struct idxmap * next;
+ struct idxmap *idx_next;
+ struct idxmap *name_next;
unsigned index;
int type;
int alen;
@@ -35,31 +36,100 @@ struct idxmap
char name[16];
};
-static struct idxmap *idxmap[16];
+struct idxmap_head {
+ struct idxmap *next;
+};
+
+static int hbits = 4, entries;
+static struct idxmap_head *idx_hash, *name_hash;
+
+static unsigned int name_hashfn(const char *name)
+{
+ const unsigned char *c = (const unsigned char *)name;
+ unsigned int hash = 0;
+
+ while (*c)
+ hash = 31*hash + *c++;
+
+ return hash;
+}
+
+static inline unsigned int htrunc(unsigned int value, int bits)
+{
+ return value % (1<<bits);
+}
+
+void grow_hashes(void)
+{
+ struct idxmap_head *new_idx_hash, *new_name_hash;
+ struct idxmap *next, *im;
+ int hidx, hname, i, new_size = 1<<(hbits+1);
+
+ if (hbits == 20)
+ return;
+
+ new_idx_hash = malloc(new_size * sizeof(struct idxmap_head));
+ new_name_hash = malloc(new_size * sizeof(struct idxmap_head));
+
+ if (!new_idx_hash || !new_name_hash)
+ return;
+
+ for (i = 0; i < (hbits<<1); i++)
+ for (im = idx_hash[i].next;
+ im != NULL && (next = im->idx_next, 1); im = next) {
+ hidx = htrunc(im->index, hbits + 1);
+ im->idx_next = new_idx_hash[hidx].next;
+ new_idx_hash[hidx].next = im;
+
+ hname = htrunc(name_hashfn(im->name), hbits + 1);
+ im->name_next = new_name_hash[hname].next;
+ new_name_hash[hname].next = im;
+ }
+
+ free(idx_hash);
+ idx_hash = new_idx_hash;
+
+ free(name_hash);
+ name_hash = new_name_hash;
+
+ hbits = hbits + 1;
+}
int ll_remember_index(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg)
{
- int h;
+ int hidx, hname;
struct ifinfomsg *ifi = NLMSG_DATA(n);
- struct idxmap *im, **imp;
+ struct idxmap *im;
struct rtattr *tb[IFLA_MAX+1];
+ if (!idx_hash) {
+ idx_hash = malloc((1<<hbits) * sizeof(struct idxmap_head));
+ if (!idx_hash)
+ return -1;
+ }
+
+ if (!name_hash) {
+ name_hash = malloc((1<<hbits) * sizeof(struct idxmap_head));
+ if (!name_hash)
+ return -1;
+ }
+
if (n->nlmsg_type != RTM_NEWLINK)
return 0;
if (n->nlmsg_len < NLMSG_LENGTH(sizeof(ifi)))
return -1;
-
memset(tb, 0, sizeof(tb));
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), IFLA_PAYLOAD(n));
if (tb[IFLA_IFNAME] == NULL)
return 0;
- h = ifi->ifi_index&0xF;
+ hidx = htrunc(ifi->ifi_index, hbits);
+ hname = htrunc(name_hashfn(RTA_DATA(tb[IFLA_IFNAME])), hbits);
- for (imp=&idxmap[h]; (im=*imp)!=NULL; imp = &im->next)
+ for (im = idx_hash[hidx].next; im != NULL; im = im->idx_next)
if (im->index == ifi->ifi_index)
break;
@@ -67,9 +137,15 @@ int ll_remember_index(const struct sockaddr_nl *who,
im = malloc(sizeof(*im));
if (im == NULL)
return 0;
- im->next = *imp;
+
+ entries++;
im->index = ifi->ifi_index;
- *imp = im;
+
+ im->idx_next = idx_hash[hidx].next;
+ idx_hash[hidx].next = im;
+
+ im->name_next = name_hash[hname].next;
+ name_hash[hname].next = im;
}
im->type = ifi->ifi_type;
@@ -85,6 +161,10 @@ int ll_remember_index(const struct sockaddr_nl *who,
memset(im->addr, 0, sizeof(im->addr));
}
strcpy(im->name, RTA_DATA(tb[IFLA_IFNAME]));
+
+ if (entries > (1<<hbits))
+ grow_hashes();
+
return 0;
}
@@ -94,7 +174,7 @@ const char *ll_idx_n2a(unsigned idx, char *buf)
if (idx == 0)
return "*";
- for (im = idxmap[idx&0xF]; im; im = im->next)
+ for (im = idx_hash[htrunc(idx, hbits)].next; im; im = im->idx_next)
if (im->index == idx)
return im->name;
snprintf(buf, 16, "if%d", idx);
@@ -115,7 +195,7 @@ int ll_index_to_type(unsigned idx)
if (idx == 0)
return -1;
- for (im = idxmap[idx&0xF]; im; im = im->next)
+ for (im = idx_hash[htrunc(idx, hbits)].next; im; im = im->idx_next)
if (im->index == idx)
return im->type;
return -1;
@@ -128,7 +208,7 @@ unsigned ll_index_to_flags(unsigned idx)
if (idx == 0)
return 0;
- for (im = idxmap[idx&0xF]; im; im = im->next)
+ for (im = idx_hash[htrunc(idx, hbits)].next; im; im = im->idx_next)
if (im->index == idx)
return im->flags;
return 0;
@@ -142,7 +222,7 @@ unsigned ll_index_to_addr(unsigned idx, unsigned char *addr,
if (idx == 0)
return 0;
- for (im = idxmap[idx&0xF]; im; im = im->next) {
+ for (im = idx_hash[htrunc(idx, hbits)].next; im; im = im->idx_next) {
if (im->index == idx) {
if (alen > sizeof(im->addr))
alen = sizeof(im->addr);
@@ -155,25 +235,26 @@ unsigned ll_index_to_addr(unsigned idx, unsigned char *addr,
return 0;
}
+
unsigned ll_name_to_index(const char *name)
{
static char ncache[16];
static int icache;
struct idxmap *im;
- int i;
+ int hname;
unsigned idx;
if (name == NULL)
return 0;
if (icache && strcmp(name, ncache) == 0)
return icache;
- for (i=0; i<16; i++) {
- for (im = idxmap[i]; im; im = im->next) {
- if (strcmp(im->name, name) == 0) {
- icache = im->index;
- strcpy(ncache, name);
- return im->index;
- }
+
+ hname = htrunc(name_hashfn(name), hbits);
+ for (im = name_hash[hname].next; im; im = im->name_next) {
+ if (strcmp(im->name, name) == 0) {
+ icache = im->index;
+ strcpy(ncache, name);
+ return im->index;
}
}
@@ -190,7 +271,7 @@ int ll_init_map(struct rtnl_handle *rth)
exit(1);
}
- if (rtnl_dump_filter(rth, ll_remember_index, &idxmap, NULL, NULL) < 0) {
+ if (rtnl_dump_filter(rth, ll_remember_index, NULL, NULL, NULL) < 0) {
fprintf(stderr, "Dump terminated\n");
exit(1);
}
--
1.7.1
^ permalink raw reply related
* [PATCH] iproute2: ip: add wilcard support for device matching
From: Octavian Purdila @ 2010-12-10 14:58 UTC (permalink / raw)
To: netdev; +Cc: Lucian Adrian Grijincu, Vlad Dogaru, Octavian Purdila
Allow the users to specify a wildcard when selecting a device:
$ ip set link dev dummy* up
We do this by expanding the original command line in multiple lines
which we then feed via a pipe to a forked ip processed run in batch
mode.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
ip/ip.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 70 insertions(+), 0 deletions(-)
diff --git a/ip/ip.c b/ip/ip.c
index b127d57..2e26488 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -18,6 +18,8 @@
#include <netinet/in.h>
#include <string.h>
#include <errno.h>
+#include <sys/types.h>
+#include <sys/wait.h>
#include "SNAPSHOT.h"
#include "utils.h"
@@ -139,10 +141,72 @@ static int batch(const char *name)
return ret;
}
+int main(int argc, char **argv);
+
+int expand_dev_pattern(int argc, char **argv, int pos)
+{
+ FILE *proc;
+ size_t n, dev_no;
+ char scanf_pattern[64], *line = NULL, *dev_base = argv[pos];
+ int p[2], i;
+ pid_t pid;
+
+ *strchr(dev_base, '*') = 0;
+ snprintf(scanf_pattern, sizeof(scanf_pattern), " %s%%d:", dev_base);
+
+ if (pipe(p) < 0) {
+ fprintf(stderr, "pipe() failed: %s\n", strerror(errno));
+ return -1;
+ }
+
+ pid = fork();
+ switch (pid) {
+ case -1:
+ fprintf(stderr, "fork failed: %s\n", strerror(errno));
+ return -1;
+ case 0:
+ {
+ char *nargv[] = { argv[0], "-b", "-" };
+ int ret;
+
+ dup2(p[0], 0); close(p[0]); close(p[1]);
+ ret = main(3, nargv);
+ exit(ret);
+ }
+ default:
+ dup2(p[1], 1); close(p[0]); close(p[1]);
+ }
+
+ proc = fopen("/proc/net/dev", "r");
+ if (!proc) {
+ fprintf(stderr, "can't open /proc/net/dev\n");
+ return -1;
+ }
+
+ while (getline(&line, &n, proc) > 0) {
+ if (sscanf(line, scanf_pattern, &dev_no) == 1) {
+ for (i = 1; i < argc; i++)
+ if (i != pos)
+ printf("%s ", argv[i]);
+ else
+ printf("%s%d ", dev_base, dev_no);
+ printf("\n");
+ }
+ }
+ free(line);
+
+ fflush(stdout); close(1);
+
+ waitpid(pid, NULL, 0);
+
+ return 0;
+}
int main(int argc, char **argv)
{
char *basename;
+ int i = 0;
+
basename = strrchr(argv[0], '/');
if (basename == NULL)
@@ -150,6 +214,12 @@ int main(int argc, char **argv)
else
basename++;
+ for (i = 1; i < argc - 1; i++) {
+ if (matches(argv[i], "dev") == 0 && strchr(argv[i+1], '*')) {
+ return expand_dev_pattern(argc, argv, i+1);
+ }
+ }
+
while (argc > 1) {
char *opt = argv[1];
if (strcmp(opt,"--") == 0) {
--
1.7.1
^ permalink raw reply related
* [PATCH] [Bug 24472] Kernel panic - not syncing: Fatal Exception
From: Andrej Ota @ 2010-12-10 14:49 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Paweł Staszewski, Andrew Morton, netdev, Paul Mackerras,
bugzilla-daemon, bugme-daemon, pstaszewski, Eric Dumazet,
David Miller
In-Reply-To: <20101210091505.GA7868@ff.dom.local>
Move kfree_skb which was causing memory corruption to new location, while still keeping appropriate return value for function __pppoe_xmit. Prevents memory corruption and consequent kernel panic when PPPoE peer terminates the link.
Signed-off-by: Andrej Ota [andrej@ota.si]
Reported-by: Pawel Staszewski [pstaszewski@artcom.pl]
---
drivers/net/pppoe.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index d72fb05..1a21dce 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -924,8 +924,10 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
/* Copy the data if there is no space for the header or if it's
* read-only.
*/
- if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
+ if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len)) {
+ kfree_skb(skb);
goto abort;
+ }
__skb_push(skb, sizeof(*ph));
skb_reset_network_header(skb);
@@ -947,7 +949,6 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
return 1;
abort:
- kfree_skb(skb);
return 0;
}
---
Andrej Ota.
^ permalink raw reply related
* Re: Adding Support for SG,GSO,GRO
From: David Lamparter @ 2010-12-10 14:31 UTC (permalink / raw)
To: Michał Mirosław; +Cc: David Miller, bhutchings, srk, netdev
In-Reply-To: <AANLkTi=7NLkHW6c88gUcyW8i0Wwmf2Cw4NdmRiGci4kE@mail.gmail.com>
On Fri, Dec 10, 2010 at 03:18:11PM +0100, Michał Mirosław wrote:
> I'm trying to understand the dependency because it looks artificial for me.
You have the data you want to send in the RAM, somewhere, possibly
scattered. The application calls sendfile(). The kernel puts the
transmission in the network card's queue, which might already have lots
of entries.
A millisecond later - an eternity for the CPU - the card decides to do
the transmission.
However, the data might have changed in the meantime.
sendfile() is defined so that it works asynchronously, that means if you
change the data while it is in the queue, you get unpredictable results.
But, what you should NOT get is packets with an invalid checksum.
Whatever data you are sending, it needs to have a correct checksum.
Now, if the card does the checksum itself, everything is fine. But what
are you supposed to do if the card can't checksum? Call back the kernel
at the point where the card does the TX? That's pointless (and racy).
Pre-calculate the Checksum at submission time? Doesn't work, you would
have to make a copy of the data, so it doesn't change anymore, so the
checksum stays correct. But not copying the data is the whole point of
sendfile().
You see why SG without HW checksum is useless here?
-David
^ permalink raw reply
* Re: Adding Support for SG,GSO,GRO
From: Michał Mirosław @ 2010-12-10 14:18 UTC (permalink / raw)
To: David Miller; +Cc: bhutchings, srk, netdev
In-Reply-To: <20101209.113806.71114756.davem@davemloft.net>
2010/12/9 David Miller <davem@davemloft.net>:
> From: Michał Mirosław <mirqus@gmail.com>
> Date: Thu, 9 Dec 2010 19:47:57 +0100
>> Isn't that condition too broad? If the data could change after packet
>> is submitted to the driver then results would be unpredictable and
>> allow sending wrong data with correct (because hw-calculated)
>> checksum.
> They are intentionally like that, without question.
>
> Otherwise we'd need to interlock with all application mapped,
> filesystem, and other page writes while sending any page over the
> network.
>
> We absolutely do not want to have to freeze every page we try to send
> via sendfile() or similar, the cost is just too high.
>
> If the application or networked filesystem needs such synchronization,
> it provides it for itself.
>
> For example, SAMBA only uses sendfile() when the file has an op-lock
> held on it.
>
> The checksum requirement for using SG is not going away, so continuing
> to discuss along the lines of removing that requirement is not a good
> use of your time I don't think.
I'm trying to understand the dependency because it looks artificial for me.
Unless I totally misunderstood, you say that we accept bogus data to
being sent using sendfile(). If yes, then we might as well allow
broken checksum (CPU calculated, before data changed and then was sent
to network).
If the splice/sendfile is taken out of the picture, are there any
other scenarios, when data pages could be changed after
ndo_start_xmit() entry and before TX DMA completion (between dma_map
.. dma_unmap)? And is it really what can happen with splice/sendfile?
Best Regards,
Michał Mirosław
^ permalink raw reply
* [PATCH v2] bridge: Fix return values of br_multicast_add_group/br_multicast_new_group
From: Tobias Klauser @ 2010-12-10 13:18 UTC (permalink / raw)
To: Stephen Hemminger, David S. Miller, bridge; +Cc: netdev
In-Reply-To: <20101209082924.6d797871@nehalam>
If br_multicast_new_group returns NULL, we would return 0 (no error) to
the caller of br_multicast_add_group, which is not what we want. Instead
br_multicast_new_group should return ERR_PTR(-ENOMEM) in this case.
Also propagate the error number returned by br_mdb_rehash properly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
net/bridge/br_multicast.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 326e599..85a0398 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -654,11 +654,13 @@ static struct net_bridge_mdb_entry *br_multicast_new_group(
struct net_bridge_mdb_htable *mdb;
struct net_bridge_mdb_entry *mp;
int hash;
+ int err;
mdb = rcu_dereference_protected(br->mdb, 1);
if (!mdb) {
- if (br_mdb_rehash(&br->mdb, BR_HASH_SIZE, 0))
- return NULL;
+ err = br_mdb_rehash(&br->mdb, BR_HASH_SIZE, 0);
+ if (err)
+ return ERR_PTR(err);
goto rehash;
}
@@ -680,7 +682,7 @@ rehash:
mp = kzalloc(sizeof(*mp), GFP_ATOMIC);
if (unlikely(!mp))
- goto out;
+ return ERR_PTR(-ENOMEM);
mp->br = br;
mp->addr = *group;
@@ -713,7 +715,7 @@ static int br_multicast_add_group(struct net_bridge *br,
mp = br_multicast_new_group(br, port, group);
err = PTR_ERR(mp);
- if (unlikely(IS_ERR(mp) || !mp))
+ if (IS_ERR(mp))
goto err;
if (!port) {
--
1.7.0.4
^ permalink raw reply related
* Re: [PATCH] Sysctl interface to UNIX_INFLIGHT_TRIGGER_GC v.3
From: Eric Dumazet @ 2010-12-10 13:04 UTC (permalink / raw)
To: pavel; +Cc: Shan Wei, netdev
In-Reply-To: <4D0221A2.4010602@pavlinux.ru>
Le vendredi 10 décembre 2010 à 15:48 +0300, Pavel Vasilyev a écrit :
> On 10.12.2010 06:45, Shan Wei wrote:
> > Pavel Vasilyev wrote, at 12/10/2010 01:26 AM:
> >> Sysctl interface to UNIX_INFLIGHT_TRIGGER_GC.
> >> IMHO convenient for testing.
> >>
> >> +inflight_trigger_gc - INTEGER
> >> + The maximal number of inflight sockets for force garbage collect.
> >> +
> >> + Default: 16000
> >
> > 1) For lower payload and enough memory, it's not necessary to force garbage collection.
> > So set it to 0, disable gc.
>
>
> May be, set default to 2000, and zero to disable
>
zero to disable ?
Maybe you missed commit 9915672d41273f5b77 intent.
If you have no limit (like old kernels), you can freeze your machine,
even if it has terabytes of ram, running a single program, even as a non
root user.
When we discussed about the fix, we said a limit was needed, obviously.
Now you'll have to prove we need to make it a sysctl (yet
another /proc/sys/net parameter, yet another documentation to add...)
Even changing default from 16000 to 2000 must be for a valid reason (a
real use case)
^ permalink raw reply
* Re: [PATCH] Sysctl interface to UNIX_INFLIGHT_TRIGGER_GC v.3
From: Pavel Vasilyev @ 2010-12-10 12:48 UTC (permalink / raw)
To: Shan Wei; +Cc: netdev
In-Reply-To: <4D01A26C.8060608@cn.fujitsu.com>
[-- Attachment #1: Type: text/plain, Size: 641 bytes --]
On 10.12.2010 06:45, Shan Wei wrote:
> Pavel Vasilyev wrote, at 12/10/2010 01:26 AM:
>> Sysctl interface to UNIX_INFLIGHT_TRIGGER_GC.
>> IMHO convenient for testing.
>>
>> +inflight_trigger_gc - INTEGER
>> + The maximal number of inflight sockets for force garbage collect.
>> +
>> + Default: 16000
>
> 1) For lower payload and enough memory, it's not necessary to force garbage collection.
> So set it to 0, disable gc.
May be, set default to 2000, and zero to disable
> 2) Copy your patch to the mail, for other guys to review it.
Where me find other guys? :)
--
Pavel.
[-- Attachment #2: sysctl.inflight_trigger_gc.patch --]
[-- Type: text/x-patch, Size: 2897 bytes --]
Documentation/networking/ip-sysctl.txt | 6 ++++++
include/net/af_unix.h | 1 +
net/unix/garbage.c | 8 +++++---
net/unix/sysctl_net_unix.c | 9 +++++++++
4 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 3c5e465..f0c4b6b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1463,6 +1463,12 @@ max_dgram_qlen - INTEGER
Default: 10
+inflight_trigger_gc - INTEGER
+ The maximal number of inflight sockets for force garbage collect.
+
+ 0 - disable force garbage collection.
+
+ Default: 2000
UNDOCUMENTED:
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 18e5c3f..ea580e4 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -15,6 +15,7 @@ extern struct sock *unix_get_socket(struct file *filp);
#define UNIX_HASH_SIZE 256
extern unsigned int unix_tot_inflight;
+extern unsigned int sysctl_inflight_trigger_gc;
struct unix_address {
atomic_t refcnt;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index f89f83b..c2f3e98 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -94,7 +94,7 @@ static DEFINE_SPINLOCK(unix_gc_lock);
static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
unsigned int unix_tot_inflight;
-
+unsigned int sysctl_inflight_trigger_gc = 2000;
struct sock *unix_get_socket(struct file *filp)
{
@@ -259,7 +259,6 @@ static void inc_inflight_move_tail(struct unix_sock *u)
}
static bool gc_in_progress = false;
-#define UNIX_INFLIGHT_TRIGGER_GC 16000
void wait_for_unix_gc(void)
{
@@ -267,8 +266,11 @@ void wait_for_unix_gc(void)
* If number of inflight sockets is insane,
* force a garbage collect right now.
*/
- if (unix_tot_inflight > UNIX_INFLIGHT_TRIGGER_GC && !gc_in_progress)
+ if (!sysctl_inflight_trigger_gc &&
+ (unix_tot_inflight > sysctl_inflight_trigger_gc
+ && !gc_in_progress))
unix_gc();
+
wait_event(unix_gc_wait, gc_in_progress == false);
}
diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c
index 397cffe..c807235 100644
--- a/net/unix/sysctl_net_unix.c
+++ b/net/unix/sysctl_net_unix.c
@@ -23,6 +23,13 @@ static ctl_table unix_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "inflight_trigger_gc",
+ .data = &sysctl_inflight_trigger_gc,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
{ }
};
@@ -41,6 +48,8 @@ int __net_init unix_sysctl_register(struct net *net)
goto err_alloc;
table[0].data = &net->unx.sysctl_max_dgram_qlen;
+ table[1].data = &sysctl_inflight_trigger_gc;
+
net->unx.ctl = register_net_sysctl_table(net, unix_path, table);
if (net->unx.ctl == NULL)
goto err_reg;
---
Signed-off-by: Pavel Vasilyev <pavel@pavlinux.ru>
^ permalink raw reply related
* [patch] isdn: return -EFAULT if copy_from_user() fails
From: Dan Carpenter @ 2010-12-10 12:40 UTC (permalink / raw)
To: Karsten Keil; +Cc: David S. Miller, netdev, kernel-janitors
We should be returning -EFAULT here.
Mostly this patch is to silence a smatch warning. The upper levels
of this driver turn all non-zero return values from isar_load_firmware()
into 1.
Signed-off-by: Dan Carpenter <error27@gmail.com>
diff --git a/drivers/isdn/hisax/isar.c b/drivers/isdn/hisax/isar.c
index 2e72227..9cd4829 100644
--- a/drivers/isdn/hisax/isar.c
+++ b/drivers/isdn/hisax/isar.c
@@ -212,9 +212,9 @@ isar_load_firmware(struct IsdnCardState *cs, u_char __user *buf)
cs->debug &= ~(L1_DEB_HSCX | L1_DEB_HSCX_FIFO);
#endif
- if ((ret = copy_from_user(&size, p, sizeof(int)))) {
+ if (copy_from_user(&size, p, sizeof(int))) {
printk(KERN_ERR"isar_load_firmware copy_from_user ret %d\n", ret);
- return ret;
+ return -EFAULT;
}
p += sizeof(int);
printk(KERN_DEBUG"isar_load_firmware size: %d\n", size);
^ permalink raw reply related
* Re: [net-next-2.6 25/27] e1000e: static analysis tools complain of a possible null ptr p dereference
From: Joe Perches @ 2010-12-10 12:44 UTC (permalink / raw)
To: Jeff Kirsher; +Cc: davem, davem, Bruce Allan, netdev, gospo, bphilips
In-Reply-To: <1291975585-30576-2-git-send-email-jeffrey.t.kirsher@intel.com>
On Fri, 2010-12-10 at 02:06 -0800, Jeff Kirsher wrote:
> diff --git a/drivers/net/e1000e/ethtool.c b/drivers/net/e1000e/ethtool.c
[]
> + default:
> + data[i] = 0;
> + continue;
> + break;
Using
continue;
break;
is odd and unhelpful.
Just continue; is sufficient and clear.
^ permalink raw reply
* [PATCH 1/1] dccp: remove unused macros
From: Gerrit Renker @ 2010-12-10 11:59 UTC (permalink / raw)
To: davem; +Cc: dccp, netdev, Shan Wei
In-Reply-To: <1291982371-5666-1-git-send-email-gerrit@erg.abdn.ac.uk>
From: Shan Wei <shanwei@cn.fujitsu.com>
Remove macros which have been unused since the initial implementation
(commit 7c657876b63cb1d8a2ec06f8fc6c37bb8412e66c, [DCCP]: Initial
implementation from Tue Aug 9 20:14:34 2005 -0700).
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
net/dccp/dccp.h | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)
diff --git a/net/dccp/dccp.h b/net/dccp/dccp.h
index 48ad5d9..4508705 100644
--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -93,9 +93,6 @@ extern void dccp_time_wait(struct sock *sk, int state, int timeo);
#define DCCP_FALLBACK_RTT (USEC_PER_SEC / 5)
#define DCCP_SANE_RTT_MAX (3 * USEC_PER_SEC)
-/* Maximal interval between probes for local resources. */
-#define DCCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ / 2U))
-
/* sysctl variables for DCCP */
extern int sysctl_dccp_request_retries;
extern int sysctl_dccp_retries1;
@@ -203,12 +200,7 @@ struct dccp_mib {
DECLARE_SNMP_STAT(struct dccp_mib, dccp_statistics);
#define DCCP_INC_STATS(field) SNMP_INC_STATS(dccp_statistics, field)
#define DCCP_INC_STATS_BH(field) SNMP_INC_STATS_BH(dccp_statistics, field)
-#define DCCP_INC_STATS_USER(field) SNMP_INC_STATS_USER(dccp_statistics, field)
#define DCCP_DEC_STATS(field) SNMP_DEC_STATS(dccp_statistics, field)
-#define DCCP_ADD_STATS_BH(field, val) \
- SNMP_ADD_STATS_BH(dccp_statistics, field, val)
-#define DCCP_ADD_STATS_USER(field, val) \
- SNMP_ADD_STATS_USER(dccp_statistics, field, val)
/*
* Checksumming routines
^ permalink raw reply related
* net-next-2.6 [Patch 1/1] dccp: dead code elimination
From: Gerrit Renker @ 2010-12-10 11:59 UTC (permalink / raw)
To: davem; +Cc: dccp, netdev
In-Reply-To: <4D01FBBE.2030705@cn.fujitsu.com>
Dave,
can you please consider the attached patch - it removes indeed dead code.
I have also placed this in into a fresh (today's) copy of net-next-2.6, on
git://eden-feed.erg.abdn.ac.uk/net-next-2.6 [subtree 'dccp']
Shan,
I have edited the commit message (s/marcos/macros/). In future, please
can you cc: your patches to netdev@vger also. Thank you.
Gerrit
^ permalink raw reply
* Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap
From: Eric Dumazet @ 2010-12-10 10:27 UTC (permalink / raw)
To: Shirley Ma
Cc: Avi Kivity, Arnd Bergmann, mst, xiaohui.xin, netdev, kvm,
linux-kernel
In-Reply-To: <1291976026.2167.49.camel@localhost.localdomain>
Le vendredi 10 décembre 2010 à 02:13 -0800, Shirley Ma a écrit :
> + while (len) {
> + f = &skb_shinfo(skb)->frags[i];
> + f->page = page[i];
> + f->page_offset = base & ~PAGE_MASK;
> + f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
> + skb->data_len += f->size;
> + skb->len += f->size;
> + skb->truesize += f->size;
> + skb_shinfo(skb)->nr_frags++;
> + /* increase sk_wmem_alloc */
> + atomic_add(f->size, &skb->sk->sk_wmem_alloc);
> + base += f->size;
> + len -= f->size;
> + i++;
> + }
You could make one atomic_add() outside of the loop, and factorize many
things...
atomic_add(len, &skb->sk->sk_wmem_alloc);
skb->data_len += len;
skb->len += len;
skb->truesize += len;
while (len) {
...
}
^ permalink raw reply
* Re: [RFC PATCH V2 0/5] macvtap TX zero copy between guest and host kernel
From: Shirley Ma @ 2010-12-10 10:16 UTC (permalink / raw)
To: Avi Kivity; +Cc: Arnd Bergmann, mst, xiaohui.xin, netdev, kvm, linux-kernel
In-Reply-To: <1291974691.2167.24.camel@localhost.localdomain>
This patch has built and tested against most recent linus git tree. But
I haven't done checkpatch yet. I would like to know whether this
approach is acceptable or not first.
Thanks
Shirley
^ permalink raw reply
* [RFC PATCH V2 5/5] Add TX zero copy in macvtap
From: Shirley Ma @ 2010-12-10 10:13 UTC (permalink / raw)
To: Avi Kivity, Arnd Bergmann, mst; +Cc: xiaohui.xin, netdev, kvm, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 6174 bytes --]
Only when buffer size is greater than GOODCOPY_LEN (128), macvtap enables zero-copy.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/net/macvtap.c | 128 ++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 116 insertions(+), 12 deletions(-)
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 4256727..2ec9692 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
static struct class *macvtap_class;
static struct cdev macvtap_cdev;
@@ -338,6 +339,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;
@@ -367,6 +369,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+ /*
+ * so far only VM uses macvtap, enable zero copy between guest
+ * kernel and host kernel when lower device supports high memory
+ * DMA
+ */
+ if (vlan) {
+ if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -431,6 +443,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (offset >= from->iov_len) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (copy > 0) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb->data_len += f->size;
+ skb->len += f->size;
+ skb->truesize += f->size;
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -514,17 +600,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int copylen, zerocopy;
+ zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;
@@ -550,12 +638,28 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;
- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
- noblock, &err);
+ if (zerocopy)
+ copylen = vnet_hdr.hdr_len;
+ else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+ vnet_hdr.hdr_len, noblock, &err);
if (!skb)
goto err;
-
- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
+ if (sock_flag(&q->sk, SOCK_ZEROCOPY)) {
+ struct skb_ubuf_info pend =
+ (struct skb_ubuf_info *)m->msg_control;
+
+ skb_shinfo(skb)->ubuf.callback = pend.callback;
+ skb_shinfo(skb)->ubuf.desc = pend.desc;
+ }
if (err)
goto err_kfree;
@@ -577,7 +681,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
kfree_skb(skb);
rcu_read_unlock_bh();
- return count;
+ return total_len;
err_kfree:
kfree_skb(skb);
@@ -599,8 +703,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;
- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}
@@ -813,7 +917,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}
[-- Attachment #2: macvtap-zero.patch --]
[-- Type: text/x-patch, Size: 6039 bytes --]
drivers/net/macvtap.c | 128 ++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 116 insertions(+), 12 deletions(-)
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 4256727..2ec9692 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
static struct class *macvtap_class;
static struct cdev macvtap_cdev;
@@ -338,6 +339,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;
@@ -367,6 +369,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+ /*
+ * so far only VM uses macvtap, enable zero copy between guest
+ * kernel and host kernel when lower device supports high memory
+ * DMA
+ */
+ if (vlan) {
+ if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -431,6 +443,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (offset >= from->iov_len) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (copy > 0) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb->data_len += f->size;
+ skb->len += f->size;
+ skb->truesize += f->size;
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -514,17 +600,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int copylen, zerocopy;
+ zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;
@@ -550,12 +638,28 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;
- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
- noblock, &err);
+ if (zerocopy)
+ copylen = vnet_hdr.hdr_len;
+ else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+ vnet_hdr.hdr_len, noblock, &err);
if (!skb)
goto err;
-
- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
+ if (sock_flag(&q->sk, SOCK_ZEROCOPY)) {
+ struct skb_ubuf_info pend =
+ (struct skb_ubuf_info *)m->msg_control;
+
+ skb_shinfo(skb)->ubuf.callback = pend.callback;
+ skb_shinfo(skb)->ubuf.desc = pend.desc;
+ }
if (err)
goto err_kfree;
@@ -577,7 +681,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
kfree_skb(skb);
rcu_read_unlock_bh();
- return count;
+ return total_len;
err_kfree:
kfree_skb(skb);
@@ -599,8 +703,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;
- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}
@@ -813,7 +917,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox