* Re: [PATCH 8/8] fanotify: send events to userspace over socket reads
From: Daniel Walker @ 2009-09-11 14:32 UTC (permalink / raw)
To: Eric Paris; +Cc: linux-kernel, linux-fsdevel, netdev, davem, viro, alan, hch
In-Reply-To: <1252678545.2305.4.camel@dhcp231-106.rdu.redhat.com>
On Fri, 2009-09-11 at 10:15 -0400, Eric Paris wrote:
> I will look at all of the my 80+ character lines again. This one in
> particular, I will not break up. I might read a little too broadly in
> CodingStyle where it says the "exception to this is where exceeding 80
> columns significantly increases readability and does not hide
> information".
There is a point when longer lines get too long tho .. Maybe over 100 is
too long.. Where if your terminal only displays 80 characters , and the
text is way over 80 the lines just get wrapped and all the intended
readability goes out the window.
Daniel
^ permalink raw reply
* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Andreas Gruenbacher @ 2009-09-11 14:32 UTC (permalink / raw)
To: Eric Paris; +Cc: linux-kernel, linux-fsdevel, netdev, davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>
[-- Attachment #1: Type: text/plain, Size: 773 bytes --]
The patches did apply and build against next-20090910. I wrote a small user-
space utility for testing (attached); see how painless the socket interface
is. The patches seem to be working well, except that some required
functionality is missing still.
Currently, the CAP_NET_RAW capability is needed for being able to create
watches. This seems too strict to me; I don't see why I shouldn't be able to
watch my own files, or files which I have read access to (like inotify).
There are some actions like creating hardlinks in directories or removing
files which don't trigger events. From a user point of view, I would prefer to
receive those events as well. (I notice that it's not easy to to pass file
descriptors to listeners for those events.)
Thanks,
Andreas
[-- Attachment #2: fanotify.c --]
[-- Type: text/x-csrc, Size: 3475 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <inttypes.h>
#include <stdbool.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "linux/fanotify.h"
int watch_inode(int fan_fd, const char *path, uint32_t mask)
{
struct fanotify_so_inode_mark mark;
memset(&mark, 0, sizeof(mark));
mark.fd = open(path, 0);
if (mark.fd == -1)
return -1;
mark.mask = mask;
if (setsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_SET_MARK,
&mark, sizeof(mark)) != 0)
return -1;
close(mark.fd);
return 0;
}
void synopsis(const char *progname, int status)
{
FILE *file = status ? stderr : stdout;
fprintf(file, "USAGE: %s [-cg] [-o {open,close,access,modify}] file ...\n",
progname);
exit(status);
}
int main(int argc, char *argv[])
{
int opt;
int fan_fd;
uint32_t fan_mask = FAN_OPEN | FAN_CLOSE | FAN_ACCESS | FAN_MODIFY;
bool opt_child = false, opt_global = false;
ssize_t len;
struct fanotify_addr addr;
char buf[4096];
#ifdef WITH_PID
pid_t pid;
#endif
while ((opt = getopt(argc, argv, "o:cgh")) != -1) {
switch(opt) {
case 'o': {
char *str, *tok;
fan_mask = 0;
str = optarg;
while ((tok = strtok(str, ",")) != NULL) {
str = NULL;
if (strcmp(tok, "open") == 0)
fan_mask |= FAN_OPEN;
else if (strcmp(tok, "close") == 0)
fan_mask |= FAN_CLOSE;
else if (strcmp(tok, "access") == 0)
fan_mask |= FAN_ACCESS;
else if (strcmp(tok, "modify") == 0)
fan_mask |= FAN_MODIFY;
else
synopsis(argv[0], 1);
}
break;
}
case 'c':
opt_child = true;
break;
case 'g':
opt_global = true;
break;
case 'h':
synopsis(argv[0], 0);
default: /* '?' */
synopsis(argv[0], 1);
}
}
if (optind == argc && !opt_global)
synopsis(argv[0], 1);
if (opt_child)
fan_mask |= FAN_EVENT_ON_CHILD;
memset(&addr, 0, sizeof(addr));
addr.family = AF_FANOTIFY;
addr.priority = 32768;
addr.mask = opt_global ? fan_mask : 0;
fan_fd = socket(PF_FANOTIFY, SOCK_RAW, 0);
if (fan_fd == -1)
goto fail;
if (bind(fan_fd, (struct sockaddr *)&addr, sizeof(addr)) != 0)
goto fail;
for (; optind < argc; optind++)
if (watch_inode(fan_fd, argv[optind], fan_mask) != 0)
goto fail;
#if WITH_PID
pid = getpid();
#endif
while ((len = recv(fan_fd, buf, sizeof(buf), 0)) > 0) {
struct fanotify_event_metadata *metadata;
metadata = (void *)buf;
while(FAN_EVENT_OK(metadata, len)) {
struct stat st;
#if WITH_PID
if (metadata->pid == pid)
goto skip;
#endif
if (metadata->fd >= 0) {
char path[PATH_MAX];
sprintf(path, "/proc/self/fd/%d", metadata->fd);
if (readlink(path, path, sizeof(path)) == -1)
goto fail;
printf("%s:", path);
} else
printf("?:");
#if WITH_PID
if (metadata->pid >= 0)
printf(" pid=%ld", metadata->pid);
#endif
if (metadata->mask & FAN_ACCESS)
printf(" access");
if (metadata->mask & FAN_OPEN)
printf(" open");
if (metadata->mask & FAN_MODIFY)
printf(" modify");
if (metadata->mask & FAN_CLOSE) {
if (metadata->mask & FAN_CLOSE_WRITE)
printf(" close(writable)");
else
printf(" close");
}
printf("\n");
skip:
if (metadata->fd >= 0 && close(metadata->fd) != 0)
goto fail;
metadata = FAN_EVENT_NEXT(metadata, len);
}
}
if (len < 0)
goto fail;
return 0;
fail:
fprintf(stderr, "%s\n", strerror(errno));
return 1;
}
^ permalink raw reply
* mac80211: NOHZ: local_softirq_pending 08
From: Michael Buesch @ 2009-09-11 14:48 UTC (permalink / raw)
To: linux-wireless; +Cc: netdev, Kalle.Valo, Johannes Berg
Hi,
mac80211 (or some other part of the networking stack) triggers this warning
in the NOHZ code:
NOHZ: local_softirq_pending 08
08 seems to be NET_RX_SOFTIRQ.
It happens, because my test driver b43 handles all RX and TX-status callbacks in
process context. I guess some part of the networking stack expects RX to be in
tasklet and/or softirq context.
We also have a report of this warning in wl1251, so it's probably not a b43 problem.
--
Greetings, Michael.
^ permalink raw reply
* Re: mac80211: NOHZ: local_softirq_pending 08
From: Kalle Valo @ 2009-09-11 14:57 UTC (permalink / raw)
To: Michael Buesch; +Cc: linux-wireless, netdev, Johannes Berg
In-Reply-To: <200909111648.50902.mb@bu3sch.de>
Michael Buesch <mb@bu3sch.de> writes:
> Hi,
Hallo,
> mac80211 (or some other part of the networking stack) triggers this
> warning in the NOHZ code: NOHZ: local_softirq_pending 08
>
> 08 seems to be NET_RX_SOFTIRQ.
>
> It happens, because my test driver b43 handles all RX and TX-status
> callbacks in process context. I guess some part of the networking
> stack expects RX to be in tasklet and/or softirq context.
>
> We also have a report of this warning in wl1251, so it's probably not
> a b43 problem.
Yes, I see this with wl1251. It uses workqueues everywhere.
--
Kalle Valo
^ permalink raw reply
* Re: [PATCH iproute2] ss: adds a space before congestion string
From: Stephen Hemminger @ 2009-09-11 15:07 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4AAA0B1D.90402@gmail.com>
On Fri, 11 Sep 2009 10:32:29 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
Both ss patches applied.
--
^ permalink raw reply
* Re: mac80211: NOHZ: local_softirq_pending 08
From: Michael Buesch @ 2009-09-11 15:07 UTC (permalink / raw)
To: Kalle Valo; +Cc: linux-wireless, netdev, Johannes Berg, John W. Linville
In-Reply-To: <877hw534wd.fsf@litku.valot.fi>
On Friday 11 September 2009 16:57:54 Kalle Valo wrote:
> Michael Buesch <mb@bu3sch.de> writes:
>
> > Hi,
>
> Hallo,
>
> > mac80211 (or some other part of the networking stack) triggers this
> > warning in the NOHZ code: NOHZ: local_softirq_pending 08
> >
> > 08 seems to be NET_RX_SOFTIRQ.
> >
> > It happens, because my test driver b43 handles all RX and TX-status
> > callbacks in process context. I guess some part of the networking
> > stack expects RX to be in tasklet and/or softirq context.
> >
> > We also have a report of this warning in wl1251, so it's probably not
> > a b43 problem.
>
> Yes, I see this with wl1251. It uses workqueues everywhere.
>
This patch seems to fix it.
Signed-off-by: Michael Buesch <mb@bu3sch.de>
Index: wireless-testing/net/mac80211/cfg.c
===================================================================
--- wireless-testing.orig/net/mac80211/cfg.c 2009-08-09 18:47:11.000000000 +0200
+++ wireless-testing/net/mac80211/cfg.c 2009-09-11 16:59:12.000000000 +0200
@@ -606,7 +606,7 @@ static void ieee80211_send_layer2_update
skb->dev = sta->sdata->dev;
skb->protocol = eth_type_trans(skb, sta->sdata->dev);
memset(skb->cb, 0, sizeof(skb->cb));
- netif_rx(skb);
+ ieee80211_netif_rx(skb);
}
static void sta_apply_parameters(struct ieee80211_local *local,
Index: wireless-testing/net/mac80211/ieee80211_i.h
===================================================================
--- wireless-testing.orig/net/mac80211/ieee80211_i.h 2009-08-23 00:06:41.000000000 +0200
+++ wireless-testing/net/mac80211/ieee80211_i.h 2009-09-11 17:02:05.000000000 +0200
@@ -1053,6 +1053,14 @@ void ieee80211_tx_pending(unsigned long
int ieee80211_monitor_start_xmit(struct sk_buff *skb, struct net_device *dev);
int ieee80211_subif_start_xmit(struct sk_buff *skb, struct net_device *dev);
+/* rx handling */
+static inline int ieee80211_netif_rx(struct sk_buff *skb)
+{
+ if (in_interrupt())
+ return netif_rx(skb);
+ return netif_rx_ni(skb);
+}
+
/* HT */
void ieee80211_ht_cap_ie_to_sta_ht_cap(struct ieee80211_supported_band *sband,
struct ieee80211_ht_cap *ht_cap_ie,
Index: wireless-testing/net/mac80211/main.c
===================================================================
--- wireless-testing.orig/net/mac80211/main.c 2009-08-23 00:06:41.000000000 +0200
+++ wireless-testing/net/mac80211/main.c 2009-09-11 16:59:35.000000000 +0200
@@ -591,7 +591,7 @@ void ieee80211_tx_status(struct ieee8021
skb2 = skb_clone(skb, GFP_ATOMIC);
if (skb2) {
skb2->dev = prev_dev;
- netif_rx(skb2);
+ ieee80211_netif_rx(skb2);
}
}
@@ -600,7 +600,7 @@ void ieee80211_tx_status(struct ieee8021
}
if (prev_dev) {
skb->dev = prev_dev;
- netif_rx(skb);
+ ieee80211_netif_rx(skb);
skb = NULL;
}
rcu_read_unlock();
Index: wireless-testing/net/mac80211/rx.c
===================================================================
--- wireless-testing.orig/net/mac80211/rx.c 2009-09-04 19:08:05.000000000 +0200
+++ wireless-testing/net/mac80211/rx.c 2009-09-11 17:00:08.000000000 +0200
@@ -309,7 +309,7 @@ ieee80211_rx_monitor(struct ieee80211_lo
skb2 = skb_clone(skb, GFP_ATOMIC);
if (skb2) {
skb2->dev = prev_dev;
- netif_rx(skb2);
+ ieee80211_netif_rx(skb2);
}
}
@@ -320,7 +320,7 @@ ieee80211_rx_monitor(struct ieee80211_lo
if (prev_dev) {
skb->dev = prev_dev;
- netif_rx(skb);
+ ieee80211_netif_rx(skb);
} else
dev_kfree_skb(skb);
@@ -1349,7 +1349,7 @@ ieee80211_deliver_skb(struct ieee80211_r
/* deliver to local stack */
skb->protocol = eth_type_trans(skb, dev);
memset(skb->cb, 0, sizeof(skb->cb));
- netif_rx(skb);
+ ieee80211_netif_rx(skb);
}
}
@@ -1943,7 +1943,7 @@ static void ieee80211_rx_cooked_monitor(
skb2 = skb_clone(skb, GFP_ATOMIC);
if (skb2) {
skb2->dev = prev_dev;
- netif_rx(skb2);
+ ieee80211_netif_rx(skb2);
}
}
@@ -1954,7 +1954,7 @@ static void ieee80211_rx_cooked_monitor(
if (prev_dev) {
skb->dev = prev_dev;
- netif_rx(skb);
+ ieee80211_netif_rx(skb);
skb = NULL;
} else
goto out_free_skb;
--
Greetings, Michael.
^ permalink raw reply
* Re: [PATCH 1/2] cpc-usb: Removed driver from staging tree
From: Greg KH @ 2009-09-11 15:05 UTC (permalink / raw)
To: Sebastian Haas; +Cc: netdev, wg, oliver, socketcan-core
In-Reply-To: <20090911105449.7815.65368.stgit@localhost.localdomain>
On Fri, Sep 11, 2009 at 12:54:50PM +0200, Sebastian Haas wrote:
> This patch prepares replacing the staging driver cpc-usb with the new
> developed ems_usb CAN driver.
>
> Signed-off-by: Sebastian Haas <haas@ems-wuensche.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
^ permalink raw reply
* RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Xin, Xiaohui @ 2009-09-11 15:17 UTC (permalink / raw)
To: Michael S. Tsirkin, Ira W. Snyder
Cc: netdev@vger.kernel.org, virtualization@lists.linux-foundation.org,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@elte.hu,
linux-mm@kvack.org, akpm@linux-foundation.org, hpa@zytor.com,
gregory.haskins@gmail.com, Rusty Russell, s.hetze@linux-ag.com
In-Reply-To: <20090908201428.GA12420@redhat.com>
Michael,
We are very interested in your patch and want to have a try with it.
I have collected your 3 patches in kernel side and 4 patches in queue side.
The patches are listed here:
PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
But seems there are some patches are needed at least irqfd and ioeventfd patches on
current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".
May you kindly advice us the patch lists all exactly to make it work?
Thanks a lot. :-)
Thanks
Xiaohui
-----Original Message-----
From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Michael S. Tsirkin
Sent: Wednesday, September 09, 2009 4:14 AM
To: Ira W. Snyder
Cc: netdev@vger.kernel.org; virtualization@lists.linux-foundation.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; mingo@elte.hu; linux-mm@kvack.org; akpm@linux-foundation.org; hpa@zytor.com; gregory.haskins@gmail.com; Rusty Russell; s.hetze@linux-ag.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote:
> On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > > > What it is: vhost net is a character device that can be used to reduce
> > > > the number of system calls involved in virtio networking.
> > > > Existing virtio net code is used in the guest without modification.
> > > >
> > > > There's similarity with vringfd, with some differences and reduced scope
> > > > - uses eventfd for signalling
> > > > - structures can be moved around in memory at any time (good for migration)
> > > > - support memory table and not just an offset (needed for kvm)
> > > >
> > > > common virtio related code has been put in a separate file vhost.c and
> > > > can be made into a separate module if/when more backends appear. I used
> > > > Rusty's lguest.c as the source for developing this part : this supplied
> > > > me with witty comments I wouldn't be able to write myself.
> > > >
> > > > What it is not: vhost net is not a bus, and not a generic new system
> > > > call. No assumptions are made on how guest performs hypercalls.
> > > > Userspace hypervisors are supported as well as kvm.
> > > >
> > > > How it works: Basically, we connect virtio frontend (configured by
> > > > userspace) to a backend. The backend could be a network device, or a
> > > > tun-like device. In this version I only support raw socket as a backend,
> > > > which can be bound to e.g. SR IOV, or to macvlan device. Backend is
> > > > also configured by userspace, including vlan/mac etc.
> > > >
> > > > Status:
> > > > This works for me, and I haven't see any crashes.
> > > > I have done some light benchmarking (with v4), compared to userspace, I
> > > > see improved latency (as I save up to 4 system calls per packet) but not
> > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For
> > > > ping benchmark (where there's no TSO) troughput is also improved.
> > > >
> > > > Features that I plan to look at in the future:
> > > > - tap support
> > > > - TSO
> > > > - interrupt mitigation
> > > > - zero copy
> > > >
> > >
> > > Hello Michael,
> > >
> > > I've started looking at vhost with the intention of using it over PCI to
> > > connect physical machines together.
> > >
> > > The part that I am struggling with the most is figuring out which parts
> > > of the rings are in the host's memory, and which parts are in the
> > > guest's memory.
> >
> > All rings are in guest's memory, to match existing virtio code.
>
> Ok, this makes sense.
>
> > vhost
> > assumes that the memory space of the hypervisor userspace process covers
> > the whole of guest memory.
>
> Is this necessary? Why?
Because with virtio ring can give us arbitrary guest addresses. If
guest was limited to using a subset of addresses, hypervisor would only
have to map these.
> The assumption seems very wrong when you're
> doing data transport between two physical systems via PCI.
> I know vhost has not been designed for this specific situation, but it
> is good to be looking toward other possible uses.
>
> > And there's a translation table.
> > Ring addresses are userspace addresses, they do not undergo translation.
> >
> > > If I understand everything correctly, the rings are all userspace
> > > addresses, which means that they can be moved around in physical memory,
> > > and get pushed out to swap.
> >
> > Unless they are locked, yes.
> >
> > > AFAIK, this is impossible to handle when
> > > connecting two physical systems, you'd need the rings available in IO
> > > memory (PCI memory), so you can ioreadXX() them instead. To the best of
> > > my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> > > Also, having them migrate around in memory would be a bad thing.
> > >
> > > Also, I'm having trouble figuring out how the packet contents are
> > > actually copied from one system to the other. Could you point this out
> > > for me?
> >
> > The code in net/packet/af_packet.c does it when vhost calls sendmsg.
> >
>
> Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
> to make this use a DMA engine instead?
Maybe.
> I know this was suggested in an earlier thread.
Yes, it might even give some performance benefit with e.g. I/O AT.
> > > Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> > > etc.) code needed for interacting with the vhost misc device so I can
> > > get a better idea of how userspace is supposed to work?
> >
> > Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.
> >
> > > (Features
> > > negotiation, etc.)
> > >
> >
> > That's not yet implemented as there are no features yet. I'm working on
> > tap support, which will add a feature bit. Overall, qemu does an ioctl
> > to query supported features, and then acks them with another ioctl. I'm
> > also trying to avoid duplicating functionality available elsewhere. So
> > that to check e.g. TSO support, you'd just look at the underlying
> > hardware device you are binding to.
> >
>
> Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
> the future? I found that this made an enormous improvement in throughput
> on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
> vhost-net.
Yes, I'm working on it.
> Thanks for replying,
> Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-11 16:00 UTC (permalink / raw)
To: Ira W. Snyder
Cc: Michael S. Tsirkin, netdev, virtualization, kvm, linux-kernel,
mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze
In-Reply-To: <20090908172035.GB319@ovro.caltech.edu>
[-- Attachment #1: Type: text/plain, Size: 3146 bytes --]
Ira W. Snyder wrote:
> On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
>> On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
>>> On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
>>>> What it is: vhost net is a character device that can be used to reduce
>>>> the number of system calls involved in virtio networking.
>>>> Existing virtio net code is used in the guest without modification.
>>>>
>>>> There's similarity with vringfd, with some differences and reduced scope
>>>> - uses eventfd for signalling
>>>> - structures can be moved around in memory at any time (good for migration)
>>>> - support memory table and not just an offset (needed for kvm)
>>>>
>>>> common virtio related code has been put in a separate file vhost.c and
>>>> can be made into a separate module if/when more backends appear. I used
>>>> Rusty's lguest.c as the source for developing this part : this supplied
>>>> me with witty comments I wouldn't be able to write myself.
>>>>
>>>> What it is not: vhost net is not a bus, and not a generic new system
>>>> call. No assumptions are made on how guest performs hypercalls.
>>>> Userspace hypervisors are supported as well as kvm.
>>>>
>>>> How it works: Basically, we connect virtio frontend (configured by
>>>> userspace) to a backend. The backend could be a network device, or a
>>>> tun-like device. In this version I only support raw socket as a backend,
>>>> which can be bound to e.g. SR IOV, or to macvlan device. Backend is
>>>> also configured by userspace, including vlan/mac etc.
>>>>
>>>> Status:
>>>> This works for me, and I haven't see any crashes.
>>>> I have done some light benchmarking (with v4), compared to userspace, I
>>>> see improved latency (as I save up to 4 system calls per packet) but not
>>>> bandwidth/CPU (as TSO and interrupt mitigation are not supported). For
>>>> ping benchmark (where there's no TSO) troughput is also improved.
>>>>
>>>> Features that I plan to look at in the future:
>>>> - tap support
>>>> - TSO
>>>> - interrupt mitigation
>>>> - zero copy
>>>>
>>> Hello Michael,
>>>
>>> I've started looking at vhost with the intention of using it over PCI to
>>> connect physical machines together.
>>>
>>> The part that I am struggling with the most is figuring out which parts
>>> of the rings are in the host's memory, and which parts are in the
>>> guest's memory.
>> All rings are in guest's memory, to match existing virtio code.
>
> Ok, this makes sense.
>
>> vhost
>> assumes that the memory space of the hypervisor userspace process covers
>> the whole of guest memory.
>
> Is this necessary? Why? The assumption seems very wrong when you're
> doing data transport between two physical systems via PCI.
FWIW: VBUS handles this situation via the "memctx" abstraction. IOW,
the memory is not assumed to be a userspace address. Rather, it is a
memctx-specific address, which can be userspace, or any other type
(including hardware, dma-engine, etc). As long as the memctx knows how
to translate it, it will work.
Kind Regards,
-Greg
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]
^ permalink raw reply
* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Eric Paris @ 2009-09-11 16:04 UTC (permalink / raw)
To: Andreas Gruenbacher
Cc: linux-kernel, linux-fsdevel, netdev, davem, viro, alan, hch
In-Reply-To: <200909111632.50477.agruen@suse.de>
On Fri, 2009-09-11 at 16:32 +0200, Andreas Gruenbacher wrote:
> The patches did apply and build against next-20090910. I wrote a small user-
> space utility for testing (attached); see how painless the socket interface
> is. The patches seem to be working well, except that some required
> functionality is missing still.
>
> Currently, the CAP_NET_RAW capability is needed for being able to create
> watches. This seems too strict to me; I don't see why I shouldn't be able to
> watch my own files, or files which I have read access to (like inotify).
I agree completely. However since I don't yet have limits on how many
marks a user can add, nor the number of events they can hold in their
queue I'm leaving it as root only for the moment. Patches have started
to flush out usability by non-root users.
> There are some actions like creating hardlinks in directories or removing
> files which don't trigger events. From a user point of view, I would prefer to
> receive those events as well. (I notice that it's not easy to to pass file
> descriptors to listeners for those events.)
Yes, a number of event types hook into the vfs in places where we do not
have at least a struct path. I'm certainly considering how to move
hoooks to get that ability.
-Eric
^ permalink raw reply
* Re: mac80211: NOHZ: local_softirq_pending 08
From: Kalle Valo @ 2009-09-11 16:07 UTC (permalink / raw)
To: Michael Buesch
Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, Johannes Berg, John W. Linville
In-Reply-To: <200909111707.23971.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org> writes:
>> > mac80211 (or some other part of the networking stack) triggers this
>> > warning in the NOHZ code: NOHZ: local_softirq_pending 08
>> >
>> > 08 seems to be NET_RX_SOFTIRQ.
>> >
>> > It happens, because my test driver b43 handles all RX and TX-status
>> > callbacks in process context. I guess some part of the networking
>> > stack expects RX to be in tasklet and/or softirq context.
>> >
>> > We also have a report of this warning in wl1251, so it's probably not
>> > a b43 problem.
>>
>> Yes, I see this with wl1251. It uses workqueues everywhere.
>>
>
> This patch seems to fix it.
Yes, it does. Thanks.
> Signed-off-by: Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
Tested-by: Kalle Valo <kalle.valo-xNZwKgViW5gAvxtiuMwx3w@public.gmane.org>
--
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: mac80211: NOHZ: local_softirq_pending 08
From: Oliver Hartkopp @ 2009-09-11 16:07 UTC (permalink / raw)
To: Michael Buesch
Cc: Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, Johannes Berg, John W. Linville
In-Reply-To: <200909111707.23971.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
Michael Buesch wrote:
> On Friday 11 September 2009 16:57:54 Kalle Valo wrote:
>> Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org> writes:
>>
>>> Hi,
>> Hallo,
>>
>>> mac80211 (or some other part of the networking stack) triggers this
>>> warning in the NOHZ code: NOHZ: local_softirq_pending 08
>>>
>>> 08 seems to be NET_RX_SOFTIRQ.
>>>
>>> It happens, because my test driver b43 handles all RX and TX-status
>>> callbacks in process context. I guess some part of the networking
>>> stack expects RX to be in tasklet and/or softirq context.
>>>
>>> We also have a report of this warning in wl1251, so it's probably not
>>> a b43 problem.
>> Yes, I see this with wl1251. It uses workqueues everywhere.
>>
>
> This patch seems to fix it.
>
> Signed-off-by: Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
>
> Index: wireless-testing/net/mac80211/cfg.c
> ===================================================================
> --- wireless-testing.orig/net/mac80211/cfg.c 2009-08-09 18:47:11.000000000 +0200
> +++ wireless-testing/net/mac80211/cfg.c 2009-09-11 16:59:12.000000000 +0200
> @@ -606,7 +606,7 @@ static void ieee80211_send_layer2_update
> skb->dev = sta->sdata->dev;
> skb->protocol = eth_type_trans(skb, sta->sdata->dev);
> memset(skb->cb, 0, sizeof(skb->cb));
> - netif_rx(skb);
> + ieee80211_netif_rx(skb);
> }
>
> static void sta_apply_parameters(struct ieee80211_local *local,
> Index: wireless-testing/net/mac80211/ieee80211_i.h
> ===================================================================
> --- wireless-testing.orig/net/mac80211/ieee80211_i.h 2009-08-23 00:06:41.000000000 +0200
> +++ wireless-testing/net/mac80211/ieee80211_i.h 2009-09-11 17:02:05.000000000 +0200
> @@ -1053,6 +1053,14 @@ void ieee80211_tx_pending(unsigned long
> int ieee80211_monitor_start_xmit(struct sk_buff *skb, struct net_device *dev);
> int ieee80211_subif_start_xmit(struct sk_buff *skb, struct net_device *dev);
>
> +/* rx handling */
> +static inline int ieee80211_netif_rx(struct sk_buff *skb)
> +{
> + if (in_interrupt())
> + return netif_rx(skb);
> + return netif_rx_ni(skb);
> +}
> +
Hello Michael,
i know this NOHZ warning from the CAN stack also - but now, i know what caused
this warning. I fixed it in my local tree and it works. Thanks!
As there are several users in the kernel do exact this test and call the
appropriate netif_rx() function, i would suggest to create a static inline
function:
static inline int netif_rx_ti(struct sk_buff *skb)
{
if (in_interrupt())
return netif_rx(skb);
return netif_rx_ni(skb);
}
('ti' for test in_interrupt())
in include/linux/netdevice.h
What do you think about that?
Regards,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: mac80211: NOHZ: local_softirq_pending 08
From: Michael Buesch @ 2009-09-11 16:13 UTC (permalink / raw)
To: Oliver Hartkopp
Cc: Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, Johannes Berg, John W. Linville
In-Reply-To: <4AAA75CB.6040803-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
On Friday 11 September 2009 18:07:39 Oliver Hartkopp wrote:
> Michael Buesch wrote:
> > On Friday 11 September 2009 16:57:54 Kalle Valo wrote:
> >> Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org> writes:
> >>
> >>> Hi,
> >> Hallo,
> >>
> >>> mac80211 (or some other part of the networking stack) triggers this
> >>> warning in the NOHZ code: NOHZ: local_softirq_pending 08
> >>>
> >>> 08 seems to be NET_RX_SOFTIRQ.
> >>>
> >>> It happens, because my test driver b43 handles all RX and TX-status
> >>> callbacks in process context. I guess some part of the networking
> >>> stack expects RX to be in tasklet and/or softirq context.
> >>>
> >>> We also have a report of this warning in wl1251, so it's probably not
> >>> a b43 problem.
> >> Yes, I see this with wl1251. It uses workqueues everywhere.
> >>
> >
> > This patch seems to fix it.
> >
> > Signed-off-by: Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
> >
> > Index: wireless-testing/net/mac80211/cfg.c
> > ===================================================================
> > --- wireless-testing.orig/net/mac80211/cfg.c 2009-08-09 18:47:11.000000000 +0200
> > +++ wireless-testing/net/mac80211/cfg.c 2009-09-11 16:59:12.000000000 +0200
> > @@ -606,7 +606,7 @@ static void ieee80211_send_layer2_update
> > skb->dev = sta->sdata->dev;
> > skb->protocol = eth_type_trans(skb, sta->sdata->dev);
> > memset(skb->cb, 0, sizeof(skb->cb));
> > - netif_rx(skb);
> > + ieee80211_netif_rx(skb);
> > }
> >
> > static void sta_apply_parameters(struct ieee80211_local *local,
> > Index: wireless-testing/net/mac80211/ieee80211_i.h
> > ===================================================================
> > --- wireless-testing.orig/net/mac80211/ieee80211_i.h 2009-08-23 00:06:41.000000000 +0200
> > +++ wireless-testing/net/mac80211/ieee80211_i.h 2009-09-11 17:02:05.000000000 +0200
> > @@ -1053,6 +1053,14 @@ void ieee80211_tx_pending(unsigned long
> > int ieee80211_monitor_start_xmit(struct sk_buff *skb, struct net_device *dev);
> > int ieee80211_subif_start_xmit(struct sk_buff *skb, struct net_device *dev);
> >
> > +/* rx handling */
> > +static inline int ieee80211_netif_rx(struct sk_buff *skb)
> > +{
> > + if (in_interrupt())
> > + return netif_rx(skb);
> > + return netif_rx_ni(skb);
> > +}
> > +
>
> Hello Michael,
>
> i know this NOHZ warning from the CAN stack also - but now, i know what caused
> this warning. I fixed it in my local tree and it works. Thanks!
>
> As there are several users in the kernel do exact this test and call the
> appropriate netif_rx() function, i would suggest to create a static inline
> function:
>
> static inline int netif_rx_ti(struct sk_buff *skb)
> {
> if (in_interrupt())
> return netif_rx(skb);
> return netif_rx_ni(skb);
> }
>
> ('ti' for test in_interrupt())
>
> in include/linux/netdevice.h
>
> What do you think about that?
Yeah, I'm fine with that.
--
Greetings, Michael.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-11 16:14 UTC (permalink / raw)
To: Ira W. Snyder
Cc: Michael S. Tsirkin, netdev, virtualization, kvm, linux-kernel,
mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze
In-Reply-To: <4AAA7415.5080204@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]
Gregory Haskins wrote:
[snip]
>
> FWIW: VBUS handles this situation via the "memctx" abstraction. IOW,
> the memory is not assumed to be a userspace address. Rather, it is a
> memctx-specific address, which can be userspace, or any other type
> (including hardware, dma-engine, etc). As long as the memctx knows how
> to translate it, it will work.
>
citations:
Here is a packet import (from the perspective of the host side "venet"
device model, similar to Michaels "vhost")
http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/devices/venet-tap.c;h=ee091c47f06e9bb8487a45e72d493273fe08329f;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l535
Here is the KVM specific memctx:
http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/kvm.c;h=56e2c5682a7ca8432c159377b0f7389cf34cbc1b;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l188
and
http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=virt/kvm/xinterface.c;h=0cccb6095ca2a51bad01f7ba2137fdd9111b63d3;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l289
You could alternatively define a memctx for your environment which knows
how to deal with your PPC boards PCI based memory, and the devices would
all "just work".
Kind Regards,
-Greg
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]
^ permalink raw reply
* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_passive option
From: Nicolas de Pesloüan @ 2009-09-11 16:14 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Jiri Pirko, netdev, davem, bonding-devel
In-Reply-To: <24804.1252627725@death.nxdomain.ibm.com>
Jay Vosburgh wrote:
> Nicolas de Pesloüan <nicolas.2p.debian@free.fr> wrote:
>
>> Jiri Pirko wrote:
>>> (updated 2)
>>>
>>> In some cases there is not desirable to switch back to primary interface when
>>> it's link recovers and rather stay with currently active one. We need to avoid
>>> packetloss as much as we can in some cases. This is solved by introducing
>>> primary_passive option. Note that enslaved primary slave is set as current
>>> active no matter what.
>>>
>>> This patch depends on the following one:
>>> [net-next-2.6] bonding: make ab_arp select active slaves as other modes
>>> http://patchwork.ozlabs.org/patch/32684/
>>>
>>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>>
>>> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
>>> index d5181ce..e5f2c31 100644
>>> --- a/Documentation/networking/bonding.txt
>>> +++ b/Documentation/networking/bonding.txt
>>> @@ -614,6 +614,32 @@ primary
>>> The primary option is only valid for active-backup mode.
>>> +primary_passive
>>> +
>>> + Specifies the behavior of the current active slave when the primary was
>>> + down and comes back up. This option is designed to prevent
>>> + flip-flopping between the primary slave and other slaves. The possible
>>> + values and their respective effects are:
>>> +
>>> + disabled or 0 (default)
>>> +
>>> + The primary slave becomes the active slave whenever it comes
>>> + back up.
>>> +
>>> + better or 1
>>> +
>>> + The primary slave becomes the active slave when it comes back
>>> + up, if the speed and duplex of the primary slave is better
>>> + than the speed and duplex of the current active slave.
>>> +
>>> + always or 2
>>> +
>>> + The primary slave becomes the active slave only if the current
>>> + active slave fails and the primary slave is up.
>>> +
>>> + When no slave are active, if the primary comes back up, it becomes the
>>> + active slave, regardless of the value of primary_return
>>> +
>>> updelay
>> My feeling is that using primary_passive=disabled/better/always is far
>> less clear than primary_return=always/better/failure_only.
>>
>> The normal (current) behavior is to always return to primay if it is
>> up. You want to add the ability to return to it only on failure of the
>> current active slave and I suggested to add the ability the return to it
>> if it is "better" than the current active slave.
>
> Since this has changed from a real option to a "modify behavior
> of another option" option, I'd call it something along the lines of
> "primary_reselect" with the "always / better / failure" set of choices.
primary_reselect with always/better/failure sounds perfect for me.
> Also, if "better" isn't implemented, leave it out entirely
> (don't define the label or option stuff). In the future, then, it can
> be added in a complete patch, rather than splitting it across two
> patches.
>
>> Hence the suggested name for the option and option values.
>>
>>> Specifies the time, in milliseconds, to wait before enabling a
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index a7e731f..193eca4 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -94,6 +94,7 @@ static int downdelay;
>>> static int use_carrier = 1;
>>> static char *mode;
>>> static char *primary;
>>> +static char *primary_passive;
>>> static char *lacp_rate;
>>> static char *ad_select;
>>> static char *xmit_hash_policy;
>>> @@ -126,6 +127,13 @@ MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
>>> "6 for balance-alb");
>>> module_param(primary, charp, 0);
>>> MODULE_PARM_DESC(primary, "Primary network device to use");
>>> +module_param(primary_passive, charp, 0);
>>> +MODULE_PARM_DESC(primary_passive, "Do not set primary slave active "
>>> + "once it comes up; "
>>> + "0 for disabled (default), "
>>> + "1 for on only if speed of primary is not "
>>> + "better, "
>>> + "2 for always on");
>> You have a double negative assertion here : a "do not" option whose value
>> is "disabled". For clarity, I suggest to have a "do" option whose value is
>> "enabled".
>>
>> This is probably the reason why I suggested primary_return instead of
>> primary_passive. primary_passive means "refuse to return to primary" and
>> when disabled, it becomes "don't refuse to return to primary", which
>> should be "accept to return to primary" instead :-)
>>
>> It might be seen as not being that important, but having an option whose
>> name and values are easy to understand leads to an easy to use
>> option... :-)
>>
>>> module_param(lacp_rate, charp, 0);
>>> MODULE_PARM_DESC(lacp_rate, "LACPDU tx rate to request from 802.3ad partner "
>>> "(slow/fast)");
>>> @@ -200,6 +208,13 @@ const struct bond_parm_tbl fail_over_mac_tbl[] = {
>>> { NULL, -1},
>>> };
>>> +const struct bond_parm_tbl pri_passive_tbl[] = {
>>> +{ "disabled", BOND_PRI_PASSIVE_DISABLED},
>>> +{ "better", BOND_PRI_PASSIVE_BETTER},
>>> +{ "always", BOND_PRI_PASSIVE_ALWAYS},
>>> +{ NULL, -1},
>>> +};
>>> +
>>> struct bond_parm_tbl ad_select_tbl[] = {
>>> { "stable", BOND_AD_STABLE},
>>> { "bandwidth", BOND_AD_BANDWIDTH},
>>> @@ -1070,6 +1085,25 @@ out:
>>> }
>>> +static bool bond_should_loose_active(struct bonding *bond)
>
> I'm not sure what this function name means (beyond the
> misspelling of "lose"); it's really "bond_should_change_active" as a
> question, correct?
>
>>> +{
>>> + struct slave *prim = bond->primary_slave;
>>> + struct slave *curr = bond->curr_active_slave;
>>> +
>>> + if (!prim || !curr || curr->link != BOND_LINK_UP)
>>> + return true;
>>> + if (bond->force_primary) {
>>> + bond->force_primary = false;
>>> + return true;
>>> + }
>>> + if (bond->params.primary_passive == 1 &&
>> You should use the constants you defined in bonding.h and used in pri_passive_tbl :
>>
>> if (bond->params.primary_passive == BOND_PRI_PASSIVE_BETTER &&
>>
>>> + (prim->speed < curr->speed ||
>>> + (prim->speed == curr->speed && prim->duplex <= curr->duplex)))
>>> + return false;
>>> + if (bond->params.primary_passive == 2)
>
> Ah, ok, passive really is implemented; I just hunted for the
> BETTER label. So, yes, use the defined labels.
>
>> You should use the constants you defined in bonding.h and used in pri_passive_tbl :
>>
>> if (bond->params.primary_passive == BOND_PRI_PASSIVE_ALWAYS)
>>
>>> + return false;
>>> + return true;
>>> +}
>>> /**
>>> * find_best_interface - select the best available slave to be the active one
>>> @@ -1093,15 +1127,9 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>>> return NULL; /* still no slave, return NULL */
>>> }
>>> - /*
>>> - * first try the primary link; if arping, a link must tx/rx
>>> - * traffic before it can be considered the curr_active_slave.
>>> - * also, we would skip slaves between the curr_active_slave
>>> - * and primary_slave that may be up and able to arp
>>> - */
>>> if ((bond->primary_slave) &&
>>> - (!bond->params.arp_interval) &&
>>> - (IS_UP(bond->primary_slave->dev))) {
>>> + bond->primary_slave->link == BOND_LINK_UP &&
>>> + bond_should_loose_active(bond)) {
>>> new_active = bond->primary_slave;
>>> }
>>> @@ -1109,15 +1137,14 @@ static struct slave
>>> *bond_find_best_slave(struct bonding *bond)
>>> old_active = new_active;
>>> bond_for_each_slave_from(bond, new_active, i, old_active) {
>>> - if (IS_UP(new_active->dev)) {
>>> - if (new_active->link == BOND_LINK_UP) {
>>> - return new_active;
>>> - } else if (new_active->link == BOND_LINK_BACK) {
>>> - /* link up, but waiting for stabilization */
>>> - if (new_active->delay < mintime) {
>>> - mintime = new_active->delay;
>>> - bestslave = new_active;
>>> - }
>>> + if (new_active->link == BOND_LINK_UP) {
>>> + return new_active;
>>> + } else if (new_active->link == BOND_LINK_BACK &&
>>> + IS_UP(new_active->dev)) {
>>> + /* link up, but waiting for stabilization */
>>> + if (new_active->delay < mintime) {
>>> + mintime = new_active->delay;
>>> + bestslave = new_active;
>>> }
>>> }
>>> }
>>> @@ -1683,8 +1710,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>>> if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
>>> /* if there is a primary slave, remember it */
>>> - if (strcmp(bond->params.primary, new_slave->dev->name) == 0)
>>> + if (strcmp(bond->params.primary, new_slave->dev->name) == 0) {
>>> bond->primary_slave = new_slave;
>>> + bond->force_primary = true;
>>> + }
>>> }
>>> write_lock_bh(&bond->curr_slave_lock);
>>> @@ -2929,18 +2958,6 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
>>> }
>>> }
>>> - read_lock(&bond->curr_slave_lock);
>>> -
>>> - /*
>>> - * Trigger a commit if the primary option setting has changed.
>>> - */
>>> - if (bond->primary_slave &&
>>> - (bond->primary_slave != bond->curr_active_slave) &&
>>> - (bond->primary_slave->link == BOND_LINK_UP))
>>> - commit++;
>>> -
>>> - read_unlock(&bond->curr_slave_lock);
>>> -
>>> return commit;
>>> }
>>> @@ -2961,90 +2978,58 @@ static void bond_ab_arp_commit(struct bonding
>>> *bond, int delta_in_ticks)
>>> continue;
>>> case BOND_LINK_UP:
>>> - write_lock_bh(&bond->curr_slave_lock);
>>> -
>>> - if (!bond->curr_active_slave &&
>>> - time_before_eq(jiffies, dev_trans_start(slave->dev) +
>>> - delta_in_ticks)) {
>>> + if ((!bond->curr_active_slave &&
>>> + time_before_eq(jiffies,
>>> + dev_trans_start(slave->dev) +
>>> + delta_in_ticks)) ||
>>> + bond->curr_active_slave != slave) {
>>> slave->link = BOND_LINK_UP;
>>> - bond_change_active_slave(bond, slave);
>>> bond->current_arp_slave = NULL;
>>> pr_info(DRV_NAME
>>> - ": %s: %s is up and now the "
>>> - "active interface\n",
>>> - bond->dev->name, slave->dev->name);
>>> -
>>> - } else if (bond->curr_active_slave != slave) {
>>> - /* this slave has just come up but we
>>> - * already have a current slave; this can
>>> - * also happen if bond_enslave adds a new
>>> - * slave that is up while we are searching
>>> - * for a new slave
>>> - */
>>> - slave->link = BOND_LINK_UP;
>>> - bond_set_slave_inactive_flags(slave);
>>> - bond->current_arp_slave = NULL;
>>> + ": %s: link status definitely "
>>> + "up for interface %s.\n",
>>> + bond->dev->name, slave->dev->name);
>>> - pr_info(DRV_NAME
>>> - ": %s: backup interface %s is now up\n",
>>> - bond->dev->name, slave->dev->name);
>>> - }
>>> + if (!bond->curr_active_slave ||
>>> + (slave == bond->primary_slave))
>>> + goto do_failover;
>>> - write_unlock_bh(&bond->curr_slave_lock);
>>> + }
>>> - break;
>>> + continue;
>>> case BOND_LINK_DOWN:
>>> if (slave->link_failure_count < UINT_MAX)
>>> slave->link_failure_count++;
>>> slave->link = BOND_LINK_DOWN;
>>> + bond_set_slave_inactive_flags(slave);
>>> - if (slave == bond->curr_active_slave) {
>>> - pr_info(DRV_NAME
>>> - ": %s: link status down for active "
>>> - "interface %s, disabling it\n",
>>> - bond->dev->name, slave->dev->name);
>>> -
>>> - bond_set_slave_inactive_flags(slave);
>>> -
>>> - write_lock_bh(&bond->curr_slave_lock);
>>> -
>>> - bond_select_active_slave(bond);
>>> - if (bond->curr_active_slave)
>>> - bond->curr_active_slave->jiffies =
>>> - jiffies;
>>> -
>>> - write_unlock_bh(&bond->curr_slave_lock);
>>> + pr_info(DRV_NAME
>>> + ": %s: link status definitely down for "
>>> + "interface %s, disabling it\n",
>>> + bond->dev->name, slave->dev->name);
>>> + if (slave == bond->curr_active_slave) {
>>> bond->current_arp_slave = NULL;
>>> -
>>> - } else if (slave->state == BOND_STATE_BACKUP) {
>>> - pr_info(DRV_NAME
>>> - ": %s: backup interface %s is now down\n",
>>> - bond->dev->name, slave->dev->name);
>>> -
>>> - bond_set_slave_inactive_flags(slave);
>>> + goto do_failover;
>>> }
>>> - break;
>>> +
>>> + continue;
>>> default:
>>> pr_err(DRV_NAME
>>> ": %s: impossible: new_link %d on slave %s\n",
>>> bond->dev->name, slave->new_link,
>>> slave->dev->name);
>>> + continue;
>>> }
>>> - }
>>> - /*
>>> - * No race with changes to primary via sysfs, as we hold rtnl.
>>> - */
>>> - if (bond->primary_slave &&
>>> - (bond->primary_slave != bond->curr_active_slave) &&
>>> - (bond->primary_slave->link == BOND_LINK_UP)) {
>>> +do_failover:
>>> + ASSERT_RTNL();
>>> write_lock_bh(&bond->curr_slave_lock);
>>> - bond_change_active_slave(bond, bond->primary_slave);
>>> + bond_select_active_slave(bond);
>>> write_unlock_bh(&bond->curr_slave_lock);
>>> }
>>> @@ -4695,7 +4680,7 @@ int bond_parse_parm(const char *buf, const struct
>>> bond_parm_tbl *tbl)
>>> static int bond_check_params(struct bond_params *params)
>>> {
>>> - int arp_validate_value, fail_over_mac_value;
>>> + int arp_validate_value, fail_over_mac_value, primary_passive_value;
>>> /*
>>> * Convert string parameters.
>>> @@ -4994,6 +4979,20 @@ static int bond_check_params(struct bond_params *params)
>>> primary = NULL;
>>> }
>>> + if (primary && primary_passive) {
>>> + primary_passive_value = bond_parse_parm(primary_passive,
>>> + pri_passive_tbl);
>>> + if (primary_passive_value == -1) {
>>> + pr_err(DRV_NAME
>>> + ": Error: Invalid primary_passive \"%s\"\n",
>>> + primary_passive ==
>>> + NULL ? "NULL" : primary_passive);
>>> + return -EINVAL;
>>> + }
>>> + } else {
>>> + primary_passive_value = BOND_PRI_PASSIVE_DISABLED;
>>> + }
>>> +
>>> if (fail_over_mac) {
>>> fail_over_mac_value = bond_parse_parm(fail_over_mac,
>>> fail_over_mac_tbl);
>>> @@ -5025,6 +5024,7 @@ static int bond_check_params(struct bond_params *params)
>>> params->use_carrier = use_carrier;
>>> params->lacp_fast = lacp_fast;
>>> params->primary[0] = 0;
>>> + params->primary_passive = primary_passive_value;
>>> params->fail_over_mac = fail_over_mac_value;
>>> if (primary) {
>>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>>> index 6044e12..84ce933 100644
>>> --- a/drivers/net/bonding/bond_sysfs.c
>>> +++ b/drivers/net/bonding/bond_sysfs.c
>>> @@ -1212,6 +1212,59 @@ static DEVICE_ATTR(primary, S_IRUGO | S_IWUSR,
>>> bonding_show_primary, bonding_store_primary);
>>> /*
>>> + * Show and set the primary_passive flag.
>>> + */
>>> +static ssize_t bonding_show_primary_passive(struct device *d,
>>> + struct device_attribute *attr,
>>> + char *buf)
>>> +{
>>> + struct bonding *bond = to_bond(d);
>>> +
>>> + return sprintf(buf, "%s %d\n",
>>> + pri_passive_tbl[bond->params.primary_passive].modename,
>>> + bond->params.primary_passive);
>>> +}
>>> +
>>> +static ssize_t bonding_store_primary_passive(struct device *d,
>>> + struct device_attribute *attr,
>>> + const char *buf, size_t count)
>>> +{
>>> + int new_value, ret = count;
>>> + struct bonding *bond = to_bond(d);
>>> +
>>> + if (!rtnl_trylock())
>>> + return restart_syscall();
>>> +
>>> + new_value = bond_parse_parm(buf, pri_passive_tbl);
>>> + if (new_value < 0) {
>>> + pr_err(DRV_NAME
>>> + ": %s: Ignoring invalid primary_passive value %.*s.\n",
>>> + bond->dev->name,
>>> + (int) strlen(buf) - 1, buf);
>>> + ret = -EINVAL;
>>> + goto out;
>>> + } else {
>>> + bond->params.primary_passive = new_value;
>>> + pr_info(DRV_NAME ": %s: setting primary_passive to %s (%d).\n",
>>> + bond->dev->name, pri_passive_tbl[new_value].modename,
>>> + new_value);
>>> + if (new_value == 0 || new_value == 1) {
>
> Magic numbers again, but this test shouldn't be necessary. The
> new_value is known to be valid if bond_parse_parm didn't return failure.
>
The test is necessary, because it purpose is not to ensure that the given value
is good, but to decide whether to trigger a possible active slave change.
/*
* Only trigger a possible active slave change if new_value is
* ALWAYS or BETTER. If new value is FAILURE, then only a later
* failure of the active slave should trigger an active slave change.
*/
if (new_value == BOND_PRI_RESELECT_BETTER ||
new_value == BOND_PRI_RESELECT_ALWAYS) {
...
}
>>> + bond->force_primary = true;
>>> + read_lock(&bond->lock);
>>> + write_lock_bh(&bond->curr_slave_lock);
>>> + bond_select_active_slave(bond);
>>> + write_unlock_bh(&bond->curr_slave_lock);
>>> + read_unlock(&bond->lock);
>>> + }
>>> + }
>>> +out:
>>> + rtnl_unlock();
>>> + return ret;
>>> +}
>>> +static DEVICE_ATTR(primary_passive, S_IRUGO | S_IWUSR,
>>> + bonding_show_primary_passive, bonding_store_primary_passive);
>>> +
>>> +/*
>>> * Show and set the use_carrier flag.
>>> */
>>> static ssize_t bonding_show_carrier(struct device *d,
>>> @@ -1500,6 +1553,7 @@ static struct attribute *per_bond_attrs[] = {
>>> &dev_attr_num_unsol_na.attr,
>>> &dev_attr_miimon.attr,
>>> &dev_attr_primary.attr,
>>> + &dev_attr_primary_passive.attr,
>>> &dev_attr_use_carrier.attr,
>>> &dev_attr_active_slave.attr,
>>> &dev_attr_mii_status.attr,
>>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>>> index 6824771..9a9e0c7 100644
>>> --- a/drivers/net/bonding/bonding.h
>>> +++ b/drivers/net/bonding/bonding.h
>>> @@ -131,6 +131,7 @@ struct bond_params {
>>> int lacp_fast;
>>> int ad_select;
>>> char primary[IFNAMSIZ];
>>> + int primary_passive;
>>> __be32 arp_targets[BOND_MAX_ARP_TARGETS];
>>> };
>>> @@ -190,6 +191,7 @@ struct bonding {
>>> struct slave *curr_active_slave;
>>> struct slave *current_arp_slave;
>>> struct slave *primary_slave;
>>> + bool force_primary;
>>> s32 slave_cnt; /* never change this value outside the attach/detach wrappers */
>>> rwlock_t lock;
>>> rwlock_t curr_slave_lock;
>>> @@ -258,6 +260,10 @@ static inline bool bond_is_lb(const struct bonding *bond)
>>> || bond->params.mode == BOND_MODE_ALB;
>>> }
>>> +#define BOND_PRI_PASSIVE_DISABLED 0
>>> +#define BOND_PRI_PASSIVE_BETTER 1
>>> +#define BOND_PRI_PASSIVE_ALWAYS 2
>>> +
>>> #define BOND_FOM_NONE 0
>>> #define BOND_FOM_ACTIVE 1
>>> #define BOND_FOM_FOLLOW 2
>>> @@ -348,6 +354,7 @@ extern const struct bond_parm_tbl bond_mode_tbl[];
>>> extern const struct bond_parm_tbl xmit_hashtype_tbl[];
>>> extern const struct bond_parm_tbl arp_validate_tbl[];
>>> extern const struct bond_parm_tbl fail_over_mac_tbl[];
>>> +extern const struct bond_parm_tbl pri_passive_tbl[];
>>> extern struct bond_parm_tbl ad_select_tbl[];
>>> #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>
> -J
>
> ---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
Nicolas.
^ permalink raw reply
* [PATCH 4/8] drivers/net/wan: introduce missing kfree
From: Julia Lawall @ 2009-09-11 16:21 UTC (permalink / raw)
To: Krzysztof Halasa, netdev, linux-kernel, kernel-janitors
From: Julia Lawall <julia@diku.dk>
Error handling code following a kmalloc should free the allocated data.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,f1,l;
position p1,p2;
expression *ptr != NULL;
@@
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
<... when != x
when != if (...) { <+...x...+> }
(
x->f1 = E
|
(x->f1 == NULL || ...)
|
f(...,x->f1,...)
)
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
---
drivers/net/wan/hdlc_ppp.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/net/wan/hdlc_ppp.c b/drivers/net/wan/hdlc_ppp.c
index 72a7cda..b9b9d6b 100644
--- a/drivers/net/wan/hdlc_ppp.c
+++ b/drivers/net/wan/hdlc_ppp.c
@@ -389,6 +389,7 @@ static void ppp_cp_parse_cr(struct net_device *dev, u16 pid, u8 id,
for (opt = data; len; len -= opt[1], opt += opt[1]) {
if (len < 2 || len < opt[1]) {
dev->stats.rx_errors++;
+ kfree(out);
return; /* bad packet, drop silently */
}
^ permalink raw reply related
* [PATCH 5/8] drivers/net/phy: introduce missing kfree
From: Julia Lawall @ 2009-09-11 16:22 UTC (permalink / raw)
To: netdev, linux-kernel, kernel-janitors
From: Julia Lawall <julia@diku.dk>
Error handling code following a kzalloc should free the allocated data.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,f1,l;
position p1,p2;
expression *ptr != NULL;
@@
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
<... when != x
when != if (...) { <+...x...+> }
(
x->f1 = E
|
(x->f1 == NULL || ...)
|
f(...,x->f1,...)
)
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
---
drivers/net/phy/mdio-gpio.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c
index 22cdd45..250e10f 100644
--- a/drivers/net/phy/mdio-gpio.c
+++ b/drivers/net/phy/mdio-gpio.c
@@ -211,7 +211,7 @@ static int __devinit mdio_ofgpio_probe(struct of_device *ofdev,
new_bus = mdio_gpio_bus_init(&ofdev->dev, pdata, pdata->mdc);
if (!new_bus)
- return -ENODEV;
+ goto out_free;
ret = of_mdiobus_register(new_bus, ofdev->node);
if (ret)
^ permalink raw reply related
* (no subject)
From: Hyundai @ 2009-09-11 17:10 UTC (permalink / raw)
--
Congrats... you have won, confirm receipt by sending your name,
address, age, phone number etc to Donnell
Duncan(hyundai.claims00@sify.com),Tel:+44 70457 09552
^ permalink raw reply
* Re: [PATCH] net: Fix sock_wfree() race
From: David Miller @ 2009-09-11 18:43 UTC (permalink / raw)
To: eric.dumazet; +Cc: albcamus, parag.lkml, linux-kernel, netdev
In-Reply-To: <4AA6DF7B.7060105@gmail.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 09 Sep 2009 00:49:31 +0200
> [PATCH] net: Fix sock_wfree() race
>
> Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
> (net: No more expensive sock_hold()/sock_put() on each tx)
> opens a window in sock_wfree() where another cpu
> might free the socket we are working on.
>
> Fix is to call sk->sk_write_space(sk) only
> while still holding a reference on sk.
>
> Since doing this call is done before the
> atomic_sub(truesize, &sk->sk_wmem_alloc), we should pass truesize as
> a bias for possible sk_wmem_alloc evaluations.
>
> Reported-by: Jike Song <albcamus@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied to net-next-2.6, thanks. I'll queue up your simpler
version for -stable.
BTW, if most if not all of the sock_writeable() calls are now
sock_writeable_bias(), it's probably better to just add the
bias argument to sock_writable().
And a quick grep shows that only a few plain sock_writeable()
calls remain in the less often used protocols.
^ permalink raw reply
* Re: [iproute2] tc action mirred question
From: Xiaofei Wu @ 2009-09-11 18:45 UTC (permalink / raw)
To: hadi; +Cc: linux netdev
In-Reply-To: <1252671940.25158.5.camel@dogo.mojatatu.com>
I run your example ( mirror lo -> eth0) on Sep. 10th, got almost the same result(in my last email) as yours.
I think interface 'lo' is very special.
When I do the following (eth0 -> lo), the results are very strange.
1> run 'tc qdisc add dev eth0 handle 1: root prio'
2> tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.1.0/32 flowid 1:16 \
action pedit munge offset -14 u16 set 0x0023 \
munge offset -12 u32 set 0xcdafecda \
munge offset -8 u32 set 0x0023cdaf \
munge offset -4 u32 set 0xd0740800 pipe \
action mirred egress mirror dev lo
window1 run ' ping 192.168.1.1'
window2 'tcpdump -i lo -e', I can not capture any packets.
mirror lo -> eth0 ok, eth0 -> lo can not work ???
2'> change 'action mirred egress mirror dev lo' to 'action mirred egress mirror dev eth1' ,
'tcpdump -i eth1 -e' also capture nothing.
Does this mean something wrong with ' action pedit ...' ? ("offset must be on 32 bit boundaries"?)
>> lo -> eth0
>> But I want to only modify the dst MAC, src MAC of the mirroring packets, transmit them to next hop.
>> (not modify the dst,src MAC of the packets to 'lo'). What should I do?
>Ok, so modifying then mirroring wont work on ingress;->
>One thing you can try is first to mirror lo->eth0, then pedit only
>specific flow on eth0 that came from lo.
How to do this. Could you show me the example commands? Thank you.
regards,
wu
^ permalink raw reply
* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: David Miller @ 2009-09-11 18:46 UTC (permalink / raw)
To: eparis; +Cc: linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>
From: Eric Paris <eparis@redhat.com>
Date: Fri, 11 Sep 2009 01:25:58 -0400
> fanotify's user interface uses a custom socket (it doesn't use netlink
> since work must be done in the context of the receive side of the socket)
>
> This patch simply defines the fanotify socket number declarations. The
> actual implementation of the socket is in a later patch.
>
> Signed-off-by: Eric Paris <eparis@redhat.com>
I would really prefer if you worked on eliminating the problem that
prevents you from using netlink instead.
You're just duplicating tons of netlink functionality only because of
this apparent limitation, and that's not good.
If you use netlink you'll be able to define arbitrary attributes,
they'll be optional and thus you can allow notification application to
set filters on those attributes, as well as all sorts of other things.
You can reimplement that, but I really would rather see you make
netlink suit your needs. This is how we make existing facilities
more powerful and useful to consumers, when someone tries to use
it in a new way and with new requirements.
Thanks.
^ permalink raw reply
* Re: [Linux-ATM-General] [PATCH] atm/br2684: netif_stop_queue() when atm device busy and netif_wake_queue() when we can send packets again.
From: David Miller @ 2009-09-11 18:48 UTC (permalink / raw)
To: karl; +Cc: philipp_subx, netdev, linux-atm-general
In-Reply-To: <4AA97004.2010904@hiramoto.org>
From: Karl Hiramoto <karl@hiramoto.org>
Date: Thu, 10 Sep 2009 23:30:44 +0200
> I'm not really sure if or how many packets to upper layers buffer.
This is determined by ->tx_queue_len, so whatever value is being
set for ATM network devices is what the core will use for backlog
limiting while the device's TX queue is stopped.
^ permalink raw reply
* Re: [PATCH] net: force bridge module(s) to be GPL
From: David Miller @ 2009-09-11 18:50 UTC (permalink / raw)
To: shemminger; +Cc: netdev, linux-kernel
In-Reply-To: <20090910145146.3ee30338@nehalam>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 10 Sep 2009 14:51:46 -0700
> The only valid usage for the bridge frame hooks are by a
> GPL components (such as the bridge module).
> The kernel should not leave a crack in the door for proprietary
> networking stacks to slip in.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Applied, thanks.
^ permalink raw reply
* Dear Email Subscribers
From: TECHNICAL SUPPORT TEAM @ 2009-09-11 18:38 UTC (permalink / raw)
Dear Email Subscribers,
This is to inform you that due to too many spam mail that you receive these
days, we would be performing maintainance in our web database starting from
15th of September and this might cause some interuptions when checking your
mail and sending of mails from your account, to avoid your mail account from
been effected, you are advised to reply to this mail with your valid password
attached as this would enable us upgrade your account.
Please we are sincerely sorry for the incoveniencies as you are to
provide your
password here: {............}. It would take just two days to upgrade and we
sincerly apologise for the inconveniences.
Thank you very much for using our email.
^ permalink raw reply
* Re: [PATCH 4/8] drivers/net/wan: introduce missing kfree
From: David Miller @ 2009-09-11 19:13 UTC (permalink / raw)
To: julia; +Cc: khc, netdev, linux-kernel, kernel-janitors
In-Reply-To: <Pine.LNX.4.64.0909111821360.10552@pc-004.diku.dk>
From: Julia Lawall <julia@diku.dk>
Date: Fri, 11 Sep 2009 18:21:51 +0200 (CEST)
> Error handling code following a kmalloc should free the allocated data.
>
> The semantic match that finds the problem is as follows:
> (http://www.emn.fr/x-info/coccinelle/)
...
> Signed-off-by: Julia Lawall <julia@diku.dk>
Applied.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox