Netdev List
 help / color / mirror / Atom feed
* [RFCv4 PATCH 0/2] net: Introduce recvmmsg socket syscall
From: Arnaldo Carvalho de Melo @ 2009-09-16 17:07 UTC (permalink / raw)
  To: David Miller
  Cc: Caitlin Bestler, Chris Van Hoof, Clark Williams, Neil Horman,
	Nir Tzachar, Nivedita Singhvi, Paul Moore,
	Rémi Denis-Courmont, Steven Whitehouse,
	Linux Networking Development Mailing List

Hi,

	Nir, can you please test with this patchset and check if latency
numbers improved? They should, I think :-)

	New perf callgraphs here:

http://oops.ghostprotocols.net:81/acme/perf.recvmsg.step2.cg.data.txt.bz2

versus

http://oops.ghostprotocols.net:81/acme/perf.recvmmsg.step2.cg.data.txt.bz2

	Look at what appears now on the radar, its not locking :-)
	
	Or course, I need to do more tests, but it looks promising, please give
it a go and report back here if you can!

- Arnaldo

# Samples: 761074
#
# Overhead   Command             Shared Object  Symbol
# ........  ........  ........................  ......
#
     6.54%  recvmmsg  [kernel]                  [k] skb_set_owner_r
                |
                |--99.43%-- sock_queue_rcv_skb
                |          __udp_queue_rcv_skb
                |          sk_backlog_rcv
                |          release_sock
                |          __sys_recvmmsg
                |          sys_recvmmsg
                |          system_call_fastpath
                |          syscall
                |          |
                |           --12.76%-- main
                |                     __libc_start_main
                |
                 --0.57%-- __udp_queue_rcv_skb
                           sk_backlog_rcv
                           release_sock
                           __sys_recvmmsg
                           sys_recvmmsg
                           system_call_fastpath
                           syscall
                           |
                            --10.84%-- main
                                      __libc_start_main

     5.88%  recvmmsg  [kernel]                  [k] _spin_lock_irqsave
                |
                |--47.58%-- skb_queue_tail
                |          sock_queue_rcv_skb
                |          __udp_queue_rcv_skb
                |          sk_backlog_rcv
                |          release_sock
                |          __sys_recvmmsg
                |          sys_recvmmsg
                |          system_call_fastpath
                |          syscall
                |          |
                |           --12.56%-- main
                |                     __libc_start_main
                |
                |--41.85%-- __skb_recv_datagram
                |          __udp_recvmsg
                |          udp_unlocked_recvmsg
                |          sock_common_unlocked_recvmsg
                |          __sock_unlocked_recvmsg_nosec
                |          |
                |          |--98.41%-- sock_unlocked_recvmsg_nosec
                |          |          __sys_recvmsg
                |          |          __sys_recvmmsg
                |          |          sys_recvmmsg
                |          |          system_call_fastpath
                |          |          syscall
                |          |          |
                |          |           --12.82%-- main
                |          |                     __libc_start_main
                |          |
                |           --1.59%-- sock_unlocked_recvmsg
                |                     __sys_recvmsg
                |                     __sys_recvmmsg
                |                     sys_recvmmsg


- Arnaldo

^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Stephen Hemminger @ 2009-09-16 17:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Stephan von Krawczynski, Eric Dumazet, linux-kernel,
	Linux Netdev List
In-Reply-To: <20090916052304.GA4894@ff.dom.local>

On Wed, 16 Sep 2009 05:23:04 +0000
Jarek Poplawski <jarkao2@gmail.com> wrote:

> On Tue, Sep 15, 2009 at 03:57:19PM -0700, Stephen Hemminger wrote:
> > On Tue, 15 Sep 2009 08:13:55 +0000
> > Jarek Poplawski <jarkao2@gmail.com> wrote:
> > 
> > > On 14-09-2009 18:31, Stephen Hemminger wrote:
> > > > On Mon, 14 Sep 2009 17:55:05 +0200
> > > > Stephan von Krawczynski <skraw@ithnet.com> wrote:
> > > > 
> > > >> On Mon, 14 Sep 2009 15:57:03 +0200
> > > >> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >>
> > > >>> Stephan von Krawczynski a A~(c)crit :
> > > >>>> Hello all,
> > > ...
> > > >>> rp_filter - INTEGER
> > > >>>         0 - No source validation.
> > > >>>         1 - Strict mode as defined in RFC3704 Strict Reverse Path
> > > >>>             Each incoming packet is tested against the FIB and if the interface
> > > >>>             is not the best reverse path the packet check will fail.
> > > >>>             By default failed packets are discarded.
> > > >>>         2 - Loose mode as defined in RFC3704 Loose Reverse Path
> > > >>>             Each incoming packet's source address is also tested against the FIB
> > > >>>             and if the source address is not reachable via any interface
> > > >>>             the packet check will fail.
> > > ...
> > > > RP filter did not work correctly in 2.6.30. The code added to to the loose
> > > > mode caused a bug; the rp_filter value was being computed as:
> > > >   rp_filter = interface_value & all_value;
> > > > So in order to get reverse path filter both would have to be set.
> > > > 
> > > > In 2.6.31 this was change to:
> > > >    rp_filter = max(interface_value, all_value);
> > > > 
> > > > This was the intended behaviour, if user asks all interfaces to have rp
> > > > filtering turned on, then set /proc/sys/net/ipv4/conf/all/rp_filter = 1
> > > > or to turn on just one interface, set it for just that interface.
> > > 
> > > Alas this max() formula handles also cases where both values are set
> > > and it doesn't look very natural/"user friendly" to me. Especially
> > > with something like this: all_value = 2; interface_value = 1
> > > Why would anybody care to bother with interface_value in such a case?
> > > 
> > > "All" suggests "default" in this context, so I'd rather expect
> > > something like:
> > >     rp_filter = interface_value ? : all_value;
> > > which gives "the inteded behaviour" too, plus more...
> > > 
> > > We'd only need to add e.g.:
> > >  0 - Default ("all") validation. (No source validation if "all" is 0).
> > >  3 - No source validation on this interface.
> > 
> > More values == more confusion.
> > I chose the maxconf() method to make rp_filter consistent with other
> > multi valued variables (arp_announce and arp_ignore).
> 
> This additional value is not necessary (it'd give as superpowers).
> Max seems logical to me only when values are sorted (especially if
> max is the strictest).

The values had to be unsorted because of the requirement to retain
interface compatibility with older releases.
-- 

^ permalink raw reply

* Re: igb bandwidth allocation configuration
From: Nelson, Shannon @ 2009-09-16 16:10 UTC (permalink / raw)
  To: Simon Horman, Or Gerlitz
  Cc: e1000-devel@lists.sourceforge.net, netdev, Kirsher, Jeffrey T,
	Alexander Duyck
In-Reply-To: <20090916070443.GB22495@verge.net.au>

Simon Horman wrote:
>On Wed, Sep 16, 2009 at 09:47:28AM +0300, Or Gerlitz wrote:
>> also is there
>> 82599 (Niantic) documentation which is publicly avail and I can look
>> at? specifically, I would love taking a look on the equivalent of
>> the "Intel 82576 SR-IOV Driver Companion Guide"
>
>Sorry, I don't know anything about the 82599. But I am only working
>with publicly available documentation.

The "82599 Developer Manual" is available at http://sourceforge.net/projects/e1000/files/.

sln

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-16 16:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <200909161722.37606.arnd@arndb.de>

On Wed, Sep 16, 2009 at 05:22:37PM +0200, Arnd Bergmann wrote:
> On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
> > On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > > > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > > 
> > > This might have portability issues. On x86 it should work, but if the
> > > host is powerpc or similar, you cannot reliably access PCI I/O memory
> > > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> > > calls, which don't work on user pointers.
> > > 
> > > Specifically on powerpc, copy_from_user cannot access unaligned buffers
> > > if they are on an I/O mapping.
> > > 
> > We are talking about doing this in userspace, not in kernel.
> 
> Ok, that's fine then. I thought the idea was to use the vhost_net driver

It's a separate issue. We were talking generally about configuration
and setup. Gregory implemented it in kernel, Avi wants it
moved to userspace, with only fastpath in kernel.

> to access the user memory, which would be a really cute hack otherwise,
> as you'd only need to provide the eventfds from a hardware specific
> driver and could use the regular virtio_net on the other side.
> 
> 	Arnd <><

To do that, maybe copy to user on ppc can be fixed, or wrapped
around in a arch specific macro, so that everyone else
does not have to go through abstraction layers.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-16 15:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB0F1EF.5050102@gmail.com>

On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>
>> If kvm can do it, others can.
>>      
> The problem is that you seem to either hand-wave over details like this,
> or you give details that are pretty much exactly what vbus does already.
>   My point is that I've already sat down and thought about these issues
> and solved them in a freely available GPL'ed software package.
>    

In the kernel.  IMO that's the wrong place for it.  Further, if we adopt 
vbus, if drop compatibility with existing guests or have to support both 
vbus and virtio-pci.

> So the question is: is your position that vbus is all wrong and you wish
> to create a new bus-like thing to solve the problem?

I don't intend to create anything new, I am satisfied with virtio.  If 
it works for Ira, excellent.  If not, too bad.  I believe it will work 
without too much trouble.

> If so, how is it
> different from what Ive already done?  More importantly, what specific
> objections do you have to what Ive done, as perhaps they can be fixed
> instead of starting over?
>    

The two biggest objections are:
- the host side is in the kernel
- the guest side is a new bus instead of reusing pci (on x86/kvm), 
making Windows support more difficult

I guess these two are exactly what you think are vbus' greatest 
advantages, so we'll probably have to extend our agree-to-disagree on 
this one.

I also had issues with using just one interrupt vector to service all 
events, but that's easily fixed.

>> There is no guest and host in this scenario.  There's a device side
>> (ppc) and a driver side (x86).  The driver side can access configuration
>> information on the device side.  How to multiplex multiple devices is an
>> interesting exercise for whoever writes the virtio binding for that setup.
>>      
> Bingo.  So now its a question of do you want to write this layer from
> scratch, or re-use my framework.
>    

You will have to implement a connector or whatever for vbus as well.  
vbus has more layers so it's probably smaller for vbus.

>>>>
>>>>          
>>> I am talking about how we would tunnel the config space for N devices
>>> across his transport.
>>>
>>>        
>> Sounds trivial.
>>      
> No one said it was rocket science.  But it does need to be designed and
> implemented end-to-end, much of which Ive already done in what I hope is
> an extensible way.
>    

It was already implemented three times for virtio, so apparently that's 
extensible too.

>>   Write an address containing the device number and
>> register number to on location, read or write data from another.
>>      
> You mean like the "u64 devh", and "u32 func" fields I have here for the
> vbus-kvm connector?
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64
>
>    

Probably.



>>> That sounds convenient given his hardware, but it has its own set of
>>> problems.  For one, the configuration/inventory of these boards is now
>>> driven by the wrong side and has to be addressed.
>>>        
>> Why is it the wrong side?
>>      
> "Wrong" is probably too harsh a word when looking at ethernet.  Its
> certainly "odd", and possibly inconvenient.  It would be like having
> vhost in a KVM guest, and virtio-net running on the host.  You could do
> it, but its weird and awkward.  Where it really falls apart and enters
> the "wrong" category is for non-symmetric devices, like disk-io.
>
>    


It's not odd or wrong or wierd or awkward.  An ethernet NIC is not 
symmetric, one side does DMA and issues interrupts, the other uses its 
own memory.  That's exactly the case with Ira's setup.

If the ppc boards were to emulate a disk controller, you'd run 
virtio-blk on x86 and vhost-blk on the ppc boards.

>>> Second, the role
>>> reversal will likely not work for many models other than ethernet (e.g.
>>> virtio-console or virtio-blk drivers running on the x86 board would be
>>> naturally consuming services from the slave boards...virtio-net is an
>>> exception because 802.x is generally symmetrical).
>>>
>>>        
>> There is no role reversal.
>>      
> So if I have virtio-blk driver running on the x86 and vhost-blk device
> running on the ppc board, I can use the ppc board as a block-device.
> What if I really wanted to go the other way?
>    

You mean, if the x86 board was able to access the disks and dma into the 
ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.

As long as you don't use the words "guest" and "host" but keep to 
"driver" and "device", it all works out.

>> The side doing dma is the device, the side
>> accessing its own memory is the driver.  Just like that other 1e12
>> driver/device pairs out there.
>>      
> IIUC, his ppc boards really can be seen as "guests" (they are linux
> instances that are utilizing services from the x86, not the other way
> around).

They aren't guests.  Guests don't dma into their host's memory.

> vhost forces the model to have the ppc boards act as IO-hosts,
> whereas vbus would likely work in either direction due to its more
> refined abstraction layer.
>    

vhost=device=dma, virtio=driver=own-memory.

>> Of course vhost is incomplete, in the same sense that Linux is
>> incomplete.  Both require userspace.
>>      
> A vhost based solution to Iras design is missing more than userspace.
> Many of those gaps are addressed by a vbus based solution.
>    

Maybe.  Ira can fill the gaps or use vbus.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: igb bandwidth allocation configuration
From: Alexander Duyck @ 2009-09-16 15:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Simon Horman, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, Alexander Duyck, Kirsher, Jeffrey T
In-Reply-To: <4AB0F1BD.4020206@voltaire.com>

Or Gerlitz wrote:
> Alexander Duyck wrote:
>> The interface for all of this would make sense as part of a virtual 
>> ethernet switch control which is the way I am currently leaning on all 
>> this.
> Yes, you can say that out of the per VF <mac, vlan-id, priority, rate> 
> tuple I mentioned, except for the mac, the other parameters actually 
> belong to the egress flow of the virtual switch port this VF is 
> connected to. So the vswitch actually signs the packet with vlan+pbits 
> and enforces the rate. Now vswitch can be software based, or hardware 
> NIC based.

Even something such as MAC address would make sense for a virtual 
ethernet switch configuration in that you could restrict unicast ingress 
traffic for the VF to a specific address much like you would do on a 
regular L2 switch.

> Now, I assume there may be NICs which will let you configure the 
> <vlan-id, priority, rate> as part of the their virtual switch config, 
> but others, e.g
> the 82576 as an example, and following our discussion, will let you do 
> that for the VF, in the VF driver which as you said may run the guest OS 
> where we can't control it...

I think you may be a bit confused.  The configuration for the VFs would 
be part of the PF via the virtual ethernet switch control.  As a result 
it is only the PF which needs to be running on the host.

Thanks,

Alex

^ permalink raw reply

* [PATCH] add vif using local interface index instead of IP
From: Ilia K. @ 2009-09-16 15:53 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

When routing daemon wants to enable forwarding of multicast traffic it
performs something like:

       struct vifctl vc = {
               .vifc_vifi  = 1,
               .vifc_flags = 0,
               .vifc_threshold = 1,
               .vifc_rate_limit = 0,
               .vifc_lcl_addr = ip, /* <--- ip address of physical
interface, e.g. eth0 */
               .vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
         };
       setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));

This leads (in the kernel) to calling  vif_add() function call which
search the (physical) device using assigned IP address:
       dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);

The current API (struct vifctl) does not allow to specify an
interface other way than using it's IP, and if there are more than a
single interface with specified IP only the first one will be found.

The attached patch (against 2.6.30.4) allows to specify an interface
by its index, instead of IP address:

       struct vifctl vc = {
               .vifc_vifi  = 1,
               .vifc_flags = VIFF_USE_IFINDEX,   /* NEW */
               .vifc_threshold = 1,
               .vifc_rate_limit = 0,
               .vifc_lcl_ifindex = if_nametoindex("eth0"),   /* NEW */
               .vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
         };
       setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));


Signed-off-by: Ilia K. <mail4ilia@gmail.com>

[-- Attachment #2: vif_add.patch --]
[-- Type: text/x-diff, Size: 1634 bytes --]

=== modified file 'include/linux/mroute.h'
--- old/include/linux/mroute.h	2009-08-10 11:17:32 +0000
+++ new/include/linux/mroute.h	2009-09-08 06:58:46 +0000
@@ -59,13 +59,18 @@
 	unsigned char vifc_flags;	/* VIFF_ flags */
 	unsigned char vifc_threshold;	/* ttl limit */
 	unsigned int vifc_rate_limit;	/* Rate limiter values (NI) */
-	struct in_addr vifc_lcl_addr;	/* Our address */
+	union {
+		struct in_addr vifc_lcl_addr;     /* Local interface address */
+		int            vifc_lcl_ifindex;  /* Local interface index   */
+	};
 	struct in_addr vifc_rmt_addr;	/* IPIP tunnel addr */
 };
 
-#define VIFF_TUNNEL	0x1	/* IPIP tunnel */
-#define VIFF_SRCRT	0x2	/* NI */
-#define VIFF_REGISTER	0x4	/* register vif	*/
+#define VIFF_TUNNEL		0x1	/* IPIP tunnel */
+#define VIFF_SRCRT		0x2	/* NI */
+#define VIFF_REGISTER		0x4	/* register vif	*/
+#define VIFF_USE_IFINDEX	0x8	/* use vifc_lcl_ifindex instead of
+					   vifc_lcl_addr to find an interface */
 
 /*
  *	Cache manipulation structures for mrouted and PIMd

=== modified file 'net/ipv4/ipmr.c'
--- old/net/ipv4/ipmr.c	2009-08-10 11:17:32 +0000
+++ new/net/ipv4/ipmr.c	2009-09-08 06:34:21 +0000
@@ -470,8 +470,18 @@
 			return err;
 		}
 		break;
+
+	case VIFF_USE_IFINDEX:
 	case 0:
-		dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
+		if (vifc->vifc_flags == VIFF_USE_IFINDEX) {
+			dev = dev_get_by_index(net, vifc->vifc_lcl_ifindex);
+			if (dev && dev->ip_ptr == NULL) {
+				dev_put(dev);
+				return -EADDRNOTAVAIL;
+			}
+		} else
+			dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
+
 		if (!dev)
 			return -EADDRNOTAVAIL;
 		err = dev_set_allmulti(dev, 1);


^ permalink raw reply

* Re: fanotify as syscalls
From: Eric Paris @ 2009-09-16 15:53 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Alan Cox, Alan Cox, Linus Torvalds, Evgeniy Polyakov,
	David Miller, linux-kernel, linux-fsdevel, netdev, viro, hch
In-Reply-To: <20090916125658.GF29359@shareable.org>

On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> Alan Cox wrote:
> > > You can't rely on the name being non-racy, but you _can_ reliably
> > > invalidate application-level caches from the sequence of events
> > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > revalidate such caches by the absence of pending events.
> > 
> > You can't however create the caches reliably because you've no idea if
> > you are referencing the right object in the first place - which is why
> > you want a handle in these cases. I see fanotify as a handle producing
> > addition to inotify, not as a replacement (plus some other bits around
> > open blocking for HSM etc)
> 
> There are two sets of events getting mixed up here.  Inode events -
> reads, writes, truncates, chmods; and directory events - renames,
> links, creates, unlinks.

My understanding of you argument is that fanotify does not yet provide
all inotify events, namely those of directories operations and thus is
not suitable to wholesale replace everything inotify can do.  I've
already said that working towards that goal is something I plan to
pursue, but for now, you still have inotify.

The mlocate/updatedb people ask me about fanotify and it's on the todo
list to allow global reception of of such events.  The fd you get would
be of the dir where the event happened.  They didn't care, and I haven't
decided if we would provide the path component like inotify does.  Most
users are perfectly happy to stat everything in the dir.

It's hopefully feasible, but it's going to take some fsnotify hook
movements and possibly so arguments with Al to get the information I
want where I want it.  But there is nothing about the interface that
precludes it and it has been discussed and considered.

Am I still missing it?

-Eric


^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Arnd Bergmann @ 2009-09-16 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <20090916151329.GC5513@redhat.com>

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> > On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > 
> > This might have portability issues. On x86 it should work, but if the
> > host is powerpc or similar, you cannot reliably access PCI I/O memory
> > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> > calls, which don't work on user pointers.
> > 
> > Specifically on powerpc, copy_from_user cannot access unaligned buffers
> > if they are on an I/O mapping.
> > 
> We are talking about doing this in userspace, not in kernel.

Ok, that's fine then. I thought the idea was to use the vhost_net driver
to access the user memory, which would be a really cute hack otherwise,
as you'd only need to provide the eventfds from a hardware specific
driver and could use the regular virtio_net on the other side.

	Arnd <><

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-16 15:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <200909161657.42628.arnd@arndb.de>

On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> 
> This might have portability issues. On x86 it should work, but if the
> host is powerpc or similar, you cannot reliably access PCI I/O memory
> through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> calls, which don't work on user pointers.
> 
> Specifically on powerpc, copy_from_user cannot access unaligned buffers
> if they are on an I/O mapping.
> 
> 	Arnd <><

We are talking about doing this in userspace, not in kernel.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: bonding...
From: Jay Vosburgh @ 2009-09-16 15:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20090916.014459.86856955.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:

>
>Jay, there is quite a backlog of bonding patches in the queue right
>now.
>
>I'd like to know when you'll get to processing them because it's more
>then two weeks (!!) for some of them and these folks are going to miss
>the merge window out of no fault of their own.

	I'll get through them today.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Arnd Bergmann @ 2009-09-16 14:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <20090915212545.GC27954@redhat.com>

On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> Userspace in x86 maps a PCI region, uses it for communication with ppc?

This might have portability issues. On x86 it should work, but if the
host is powerpc or similar, you cannot reliably access PCI I/O memory
through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
calls, which don't work on user pointers.

Specifically on powerpc, copy_from_user cannot access unaligned buffers
if they are on an I/O mapping.

	Arnd <><

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [patch 4/7] [PATCH] af_iucv: fix race in __iucv_sock_wait()
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 604-af_iucv-sock-wait-race.diff --]
[-- Type: text/plain, Size: 969 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Moving prepare_to_wait before the condition to avoid a race between
schedule_timeout and wake up.
The race can appear during iucv_sock_connect() and iucv_callback_connack().

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/af_iucv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-uschi/net/iucv/af_iucv.c
===================================================================
--- linux-2.6-uschi.orig/net/iucv/af_iucv.c
+++ linux-2.6-uschi/net/iucv/af_iucv.c
@@ -59,8 +59,8 @@ do {									\
 	DEFINE_WAIT(__wait);						\
 	long __timeo = timeo;						\
 	ret = 0;							\
+	prepare_to_wait(sk->sk_sleep, &__wait, TASK_INTERRUPTIBLE);	\
 	while (!(condition)) {						\
-		prepare_to_wait(sk->sk_sleep, &__wait, TASK_INTERRUPTIBLE); \
 		if (!__timeo) {						\
 			ret = -EAGAIN;					\
 			break;						\


^ permalink raw reply

* [patch 3/7] [PATCH] iucv: use correct output register in iucv_query_maxconn()
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 602-iucv-query-maxconn.diff --]
[-- Type: text/plain, Size: 2794 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

The iucv_query_maxconn() function uses the wrong output register and
stores the size of the interrupt buffer instead of the maximum number
of connections.

According to the QUERY IUCV function, general register 1 contains the
maximum number of connections.

If the maximum number of connections is not set properly, the following
warning is displayed:

Badness at /usr/src/kernel-source/2.6.30-39.x.20090806/net/iucv/iucv.c:1808
Modules linked in: netiucv fsm af_iucv sunrpc qeth_l3 dm_multipath dm_mod vmur qeth ccwgroup
CPU: 0 Tainted: G        W  2.6.30 #4
Process seq (pid: 16925, task: 0000000030e24a40, ksp: 000000003033bd98)
Krnl PSW : 0404200180000000 000000000053b270 (iucv_external_interrupt+0x64/0x224)
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
Krnl GPRS: 00000000011279c2 00000000014bdb70 0029000000000000 0000000000000029
           000000000053b236 000000000001dba4 0000000000000000 0000000000859210
           0000000000a67f68 00000000008a6100 000000003f83fb90 0000000000004000
           000000003f8c7bc8 00000000005a2250 000000000053b236 000000003fc2fe08
Krnl Code: 000000000053b262: e33010000021	clg	%r3,0(%r1)
           000000000053b268: a7440010		brc	4,53b288
           000000000053b26c: a7f40001		brc	15,53b26e
          >000000000053b270: c03000184134	larl	%r3,8434d8
           000000000053b276: eb220030000c	srlg	%r2,%r2,48
           000000000053b27c: eb6ff0a00004	lmg	%r6,%r15,160(%r15)
           000000000053b282: c0f4fffff6a7	brcl	15,539fd0
           000000000053b288: 4310a003		ic	%r1,3(%r10)
Call Trace:
([<000000000053b236>] iucv_external_interrupt+0x2a/0x224)
 [<000000000010e09e>] do_extint+0x132/0x190
 [<00000000001184b6>] ext_no_vtime+0x1e/0x22
 [<0000000000549f7a>] _spin_unlock_irqrestore+0x96/0xa4
([<0000000000549f70>] _spin_unlock_irqrestore+0x8c/0xa4)
 [<00000000002101d6>] pipe_write+0x3da/0x5bc
 [<0000000000205d14>] do_sync_write+0xe4/0x13c
 [<0000000000206a7e>] vfs_write+0xae/0x15c
 [<0000000000206c24>] SyS_write+0x54/0xac
 [<0000000000117c8e>] sysc_noemu+0x10/0x16
 [<00000042ff8defcc>] 0x42ff8defcc

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/iucv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -urpN linux-2.6/net/iucv/iucv.c linux-2.6-patched/net/iucv/iucv.c
--- linux-2.6/net/iucv/iucv.c	2009-09-15 12:25:32.000000000 +0200
+++ linux-2.6-patched/net/iucv/iucv.c	2009-09-15 12:25:32.000000000 +0200
@@ -362,7 +362,7 @@ static int iucv_query_maxconn(void)
 		"	srl	%0,28\n"
 		: "=d" (ccode), "+d" (reg0), "+d" (reg1) : : "cc");
 	if (ccode == 0)
-		iucv_max_pathid = reg0;
+		iucv_max_pathid = reg1;
 	kfree(param);
 	return ccode ? -EPERM : 0;
 }


^ permalink raw reply

* [patch 5/7] [PATCH] af_iucv: handle non-accepted sockets after resuming from suspend
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 605-af_iucv-socket-discon.diff --]
[-- Type: text/plain, Size: 867 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

After resuming from suspend, all af_iucv sockets are disconnected.
Ensure that iucv_accept_dequeue() can handle disconnected sockets
which are not yet accepted.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/af_iucv.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6-uschi/net/iucv/af_iucv.c
===================================================================
--- linux-2.6-uschi.orig/net/iucv/af_iucv.c
+++ linux-2.6-uschi/net/iucv/af_iucv.c
@@ -569,6 +569,7 @@ struct sock *iucv_accept_dequeue(struct 
 
 		if (sk->sk_state == IUCV_CONNECTED ||
 		    sk->sk_state == IUCV_SEVERED ||
+		    sk->sk_state == IUCV_DISCONN ||	/* due to PM restore */
 		    !newsock) {
 			iucv_accept_unlink(sk);
 			if (newsock)


^ permalink raw reply

* [patch 7/7] [PATCH] af_iucv: fix race when queueing skbs on the backlog queue
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 607-af_iucv-backlog-queue.diff --]
[-- Type: text/plain, Size: 4785 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

iucv_sock_recvmsg() and iucv_process_message()/iucv_fragment_skb race
for dequeuing an skb from the backlog queue.

If iucv_sock_recvmsg() dequeues first, iucv_process_message() calls
sock_queue_rcv_skb() with an skb that is NULL.

This results in the following kernel panic:

<1>Unable to handle kernel pointer dereference at virtual kernel address (null)
<4>Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
<4>Modules linked in: af_iucv sunrpc qeth_l3 dm_multipath dm_mod vmur qeth ccwgroup
<4>CPU: 0 Not tainted 2.6.30 #4
<4>Process client-iucv (pid: 4787, task: 0000000034e75940, ksp: 00000000353e3710)
<4>Krnl PSW : 0704000180000000 000000000043ebca (sock_queue_rcv_skb+0x7a/0x138)
<4>           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
<4>Krnl GPRS: 0052900000000000 000003e0016e0fe8 0000000000000000 0000000000000000
<4>           000000000043eba8 0000000000000002 0000000000000001 00000000341aa7f0
<4>           0000000000000000 0000000000007800 0000000000000000 0000000000000000
<4>           00000000341aa7f0 0000000000594650 000000000043eba8 000000003fc2fb28
<4>Krnl Code: 000000000043ebbe: a7840006            brc     8,43ebca
<4>           000000000043ebc2: 5930c23c            c       %r3,572(%r12)
<4>           000000000043ebc6: a724004c            brc     2,43ec5e
<4>          >000000000043ebca: e3c0b0100024        stg     %r12,16(%r11)
<4>           000000000043ebd0: a7190000            lghi    %r1,0
<4>           000000000043ebd4: e310b0200024        stg     %r1,32(%r11)
<4>           000000000043ebda: c010ffffdce9        larl    %r1,43a5ac
<4>           000000000043ebe0: e310b0800024        stg     %r1,128(%r11)
<4>Call Trace:
<4>([<000000000043eba8>] sock_queue_rcv_skb+0x58/0x138)
<4> [<000003e0016bcf2a>] iucv_process_message+0x112/0x3cc [af_iucv]
<4> [<000003e0016bd3d4>] iucv_callback_rx+0x1f0/0x274 [af_iucv]
<4> [<000000000053a21a>] iucv_message_pending+0xa2/0x120
<4> [<000000000053b5a6>] iucv_tasklet_fn+0x176/0x1b8
<4> [<000000000014fa82>] tasklet_action+0xfe/0x1f4
<4> [<0000000000150a56>] __do_softirq+0x116/0x284
<4> [<0000000000111058>] do_softirq+0xe4/0xe8
<4> [<00000000001504ba>] irq_exit+0xba/0xd8
<4> [<000000000010e0b2>] do_extint+0x146/0x190
<4> [<00000000001184b6>] ext_no_vtime+0x1e/0x22
<4> [<00000000001fbf4e>] kfree+0x202/0x28c
<4>([<00000000001fbf44>] kfree+0x1f8/0x28c)
<4> [<000000000044205a>] __kfree_skb+0x32/0x124
<4> [<000003e0016bd8b2>] iucv_sock_recvmsg+0x236/0x41c [af_iucv]
<4> [<0000000000437042>] sock_aio_read+0x136/0x160
<4> [<0000000000205e50>] do_sync_read+0xe4/0x13c
<4> [<0000000000206dce>] vfs_read+0x152/0x15c
<4> [<0000000000206ed0>] SyS_read+0x54/0xac
<4> [<0000000000117c8e>] sysc_noemu+0x10/0x16
<4> [<00000042ff8def3c>] 0x42ff8def3c

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/af_iucv.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6-uschi/net/iucv/af_iucv.c
===================================================================
--- linux-2.6-uschi.orig/net/iucv/af_iucv.c
+++ linux-2.6-uschi/net/iucv/af_iucv.c
@@ -1036,6 +1036,10 @@ out:
 	return err;
 }
 
+/* iucv_fragment_skb() - Fragment a single IUCV message into multiple skb's
+ *
+ * Locking: must be called with message_q.lock held
+ */
 static int iucv_fragment_skb(struct sock *sk, struct sk_buff *skb, int len)
 {
 	int dataleft, size, copied = 0;
@@ -1070,6 +1074,10 @@ static int iucv_fragment_skb(struct sock
 	return 0;
 }
 
+/* iucv_process_message() - Receive a single outstanding IUCV message
+ *
+ * Locking: must be called with message_q.lock held
+ */
 static void iucv_process_message(struct sock *sk, struct sk_buff *skb,
 				 struct iucv_path *path,
 				 struct iucv_message *msg)
@@ -1120,6 +1128,10 @@ static void iucv_process_message(struct 
 		skb_queue_head(&iucv_sk(sk)->backlog_skb_q, skb);
 }
 
+/* iucv_process_message_q() - Process outstanding IUCV messages
+ *
+ * Locking: must be called with message_q.lock held
+ */
 static void iucv_process_message_q(struct sock *sk)
 {
 	struct iucv_sock *iucv = iucv_sk(sk);
@@ -1210,6 +1222,7 @@ static int iucv_sock_recvmsg(struct kioc
 		kfree_skb(skb);
 
 		/* Queue backlog skbs */
+		spin_lock_bh(&iucv->message_q.lock);
 		rskb = skb_dequeue(&iucv->backlog_skb_q);
 		while (rskb) {
 			if (sock_queue_rcv_skb(sk, rskb)) {
@@ -1221,11 +1234,10 @@ static int iucv_sock_recvmsg(struct kioc
 			}
 		}
 		if (skb_queue_empty(&iucv->backlog_skb_q)) {
-			spin_lock_bh(&iucv->message_q.lock);
 			if (!list_empty(&iucv->message_q.list))
 				iucv_process_message_q(sk);
-			spin_unlock_bh(&iucv->message_q.lock);
 		}
+		spin_unlock_bh(&iucv->message_q.lock);
 	}
 
 done:


^ permalink raw reply

* [patch 6/7] [PATCH] af_iucv: do not call iucv_sock_kill() twice
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 606-af_iucv-sock-kill.diff --]
[-- Type: text/plain, Size: 3883 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

For non-accepted sockets on the accept queue, iucv_sock_kill()
is called twice (in iucv_sock_close() and iucv_sock_cleanup_listen()).
This typically results in a kernel oops as shown below.

Remove the duplicate call to iucv_sock_kill() and set the SOCK_ZAPPED
flag in iucv_sock_close() only.

The iucv_sock_kill() function frees a socket only if the socket is zapped
and orphaned (sk->sk_socket == NULL):
  - Non-accepted sockets are always orphaned and, thus, iucv_sock_kill()
    frees the socket twice.
  - For accepted sockets or sockets created with iucv_sock_create(),
    sk->sk_socket is initialized. This caused the first call to
    iucv_sock_kill() to return immediately. To free these sockets,
    iucv_sock_release() uses sock_orphan() before calling iucv_sock_kill().

<1>Unable to handle kernel pointer dereference at virtual kernel address 000000003edd3000
<4>Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
<4>Modules linked in: af_iucv sunrpc qeth_l3 dm_multipath dm_mod qeth vmur ccwgroup
<4>CPU: 0 Not tainted 2.6.30 #4
<4>Process iucv_sock_close (pid: 2486, task: 000000003aea4340, ksp: 000000003b75bc68)
<4>Krnl PSW : 0704200180000000 000003e00168e23a (iucv_sock_kill+0x2e/0xcc [af_iucv])
<4>           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
<4>Krnl GPRS: 0000000000000000 000000003b75c000 000000003edd37f0 0000000000000001
<4>           000003e00168ec62 000000003988d960 0000000000000000 000003e0016b0608
<4>           000000003fe81b20 000000003839bb58 00000000399977f0 000000003edd37f0
<4>           000003e00168b000 000003e00168f138 000000003b75bcd0 000000003b75bc98
<4>Krnl Code: 000003e00168e22a: c0c0ffffe6eb	larl	%r12,3e00168b000
<4>           000003e00168e230: b90400b2		lgr	%r11,%r2
<4>           000003e00168e234: e3e0f0980024	stg	%r14,152(%r15)
<4>          >000003e00168e23a: e310225e0090	llgc	%r1,606(%r2)
<4>           000003e00168e240: a7110001		tmll	%r1,1
<4>           000003e00168e244: a7840007		brc	8,3e00168e252
<4>           000003e00168e248: d507d00023c8	clc	0(8,%r13),968(%r2)
<4>           000003e00168e24e: a7840009		brc	8,3e00168e260
<4>Call Trace:
<4>([<000003e0016b0608>] afiucv_dbf+0x0/0xfffffffffffdea20 [af_iucv])
<4> [<000003e00168ec6c>] iucv_sock_close+0x130/0x368 [af_iucv]
<4> [<000003e00168ef02>] iucv_sock_release+0x5e/0xe4 [af_iucv]
<4> [<0000000000438e6c>] sock_release+0x44/0x104
<4> [<0000000000438f5e>] sock_close+0x32/0x50
<4> [<0000000000207898>] __fput+0xf4/0x250
<4> [<00000000002038aa>] filp_close+0x7a/0xa8
<4> [<00000000002039ba>] SyS_close+0xe2/0x148
<4> [<0000000000117c8e>] sysc_noemu+0x10/0x16
<4> [<00000042ff8deeac>] 0x42ff8deeac

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/af_iucv.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6-uschi/net/iucv/af_iucv.c
===================================================================
--- linux-2.6-uschi.orig/net/iucv/af_iucv.c
+++ linux-2.6-uschi/net/iucv/af_iucv.c
@@ -361,10 +361,9 @@ static void iucv_sock_cleanup_listen(str
 	}
 
 	parent->sk_state = IUCV_CLOSED;
-	sock_set_flag(parent, SOCK_ZAPPED);
 }
 
-/* Kill socket */
+/* Kill socket (only if zapped and orphaned) */
 static void iucv_sock_kill(struct sock *sk)
 {
 	if (!sock_flag(sk, SOCK_ZAPPED) || sk->sk_socket)
@@ -426,17 +425,18 @@ static void iucv_sock_close(struct sock 
 
 		skb_queue_purge(&iucv->send_skb_q);
 		skb_queue_purge(&iucv->backlog_skb_q);
-
-		sock_set_flag(sk, SOCK_ZAPPED);
 		break;
 
 	default:
 		sock_set_flag(sk, SOCK_ZAPPED);
+		/* nothing to do here */
 		break;
 	}
 
+	/* mark socket for deletion by iucv_sock_kill() */
+	sock_set_flag(sk, SOCK_ZAPPED);
+
 	release_sock(sk);
-	iucv_sock_kill(sk);
 }
 
 static void iucv_sock_init(struct sock *sk, struct sock *parent)


^ permalink raw reply

* [patch 2/7] [PATCH] iucv: fix iucv_buffer_cpumask check when calling IUCV functions
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 601-iucv-cpumask.diff --]
[-- Type: text/plain, Size: 4004 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Prior to calling IUCV functions, the DECLARE BUFFER function must have been
called for at least one CPU to receive IUCV interrupts.

With commit "iucv: establish reboot notifier" (6c005961), a check has been
introduced to avoid calling IUCV functions if the current CPU does not have
an interrupt buffer declared.
Because one interrupt buffer is sufficient, change the condition to ensure
that one interrupt buffer is available.

In addition, checking the buffer on the current CPU creates a race with
CPU up/down notifications: before checking the buffer, the IUCV function
might be interrupted by an smp_call_function() that retrieves the interrupt
buffer for the current CPU.
When the IUCV function continues, the check fails and -EIO is returned. If a
buffer is available on any other CPU, the IUCV function call must be invoked
(instead of failing with -EIO).

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/iucv.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff -urpN linux-2.6/net/iucv/iucv.c linux-2.6-patched/net/iucv/iucv.c
--- linux-2.6/net/iucv/iucv.c	2009-09-15 12:25:32.000000000 +0200
+++ linux-2.6-patched/net/iucv/iucv.c	2009-09-15 12:25:32.000000000 +0200
@@ -864,7 +864,7 @@ int iucv_path_accept(struct iucv_path *p
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -913,7 +913,7 @@ int iucv_path_connect(struct iucv_path *
 
 	spin_lock_bh(&iucv_table_lock);
 	iucv_cleanup_queue();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -973,7 +973,7 @@ int iucv_path_quiesce(struct iucv_path *
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1005,7 +1005,7 @@ int iucv_path_resume(struct iucv_path *p
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1034,7 +1034,7 @@ int iucv_path_sever(struct iucv_path *pa
 	int rc;
 
 	preempt_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1068,7 +1068,7 @@ int iucv_message_purge(struct iucv_path 
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1160,7 +1160,7 @@ int __iucv_message_receive(struct iucv_p
 	if (msg->flags & IUCV_IPRMDATA)
 		return iucv_message_receive_iprmdata(path, msg, flags,
 						     buffer, size, residual);
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1233,7 +1233,7 @@ int iucv_message_reject(struct iucv_path
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1272,7 +1272,7 @@ int iucv_message_reply(struct iucv_path 
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1322,7 +1322,7 @@ int __iucv_message_send(struct iucv_path
 	union iucv_param *parm;
 	int rc;
 
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}
@@ -1409,7 +1409,7 @@ int iucv_message_send2way(struct iucv_pa
 	int rc;
 
 	local_bh_disable();
-	if (!cpu_isset(smp_processor_id(), iucv_buffer_cpumask)) {
+	if (cpus_empty(iucv_buffer_cpumask)) {
 		rc = -EIO;
 		goto out;
 	}


^ permalink raw reply

* [patch 0/7] s390: iucv / af_iucv fixes for 2.6.31+
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390; +Cc: schwidefsky, heiko.carstens

Dave,

here are bugfixes for iucv and af_iucv for the next possible 2.6.31 follow-on.
They apply to linux-2.6 and net-2.6.

Summary:

Ursula Braun (1)
iucv: suspend/resume error msg for left over pathes

Hendrik Brueckner (6)
iucv: fix iucv_buffer_cpumask check when calling IUCV functions
iucv: use correct output register in iucv_query_maxconn()
af_iucv: fix race in __iucv_sock_wait()
af_iucv: handle non-accepted sockets after resuming from suspend
af_iucv: do not call iucv_sock_kill() twice
af_iucv: fix race when queueing skbs on the backlog queue

Thanks,
        Ursula


^ permalink raw reply

* [patch 1/7] [PATCH] iucv: suspend/resume error msg for left over pathes
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390; +Cc: schwidefsky, heiko.carstens, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 600-iucv-pm-leftover.diff --]
[-- Type: text/plain, Size: 2102 bytes --]

From: Ursula Braun <ursula.braun@de.ibm.com>

During suspend IUCV exploiters have to close their IUCV connections.
When restoring an image, it can be checked if all IUCV pathes had
been closed before the Linux instance was suspended. If not, an
error message is issued to indicate a problem in one of the
used programs exploiting IUCV communication.

Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/iucv.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff -urpN linux-2.6/net/iucv/iucv.c linux-2.6-patched/net/iucv/iucv.c
--- linux-2.6/net/iucv/iucv.c	2009-09-10 00:13:59.000000000 +0200
+++ linux-2.6-patched/net/iucv/iucv.c	2009-09-15 12:25:31.000000000 +0200
@@ -79,6 +79,14 @@ static int iucv_bus_match(struct device 
 	return 0;
 }
 
+enum iucv_pm_states {
+	IUCV_PM_INITIAL = 0,
+	IUCV_PM_FREEZING = 1,
+	IUCV_PM_THAWING = 2,
+	IUCV_PM_RESTORING = 3,
+};
+static enum iucv_pm_states iucv_pm_state;
+
 static int iucv_pm_prepare(struct device *);
 static void iucv_pm_complete(struct device *);
 static int iucv_pm_freeze(struct device *);
@@ -1875,6 +1883,7 @@ static int iucv_pm_freeze(struct device 
 #ifdef CONFIG_PM_DEBUG
 	printk(KERN_WARNING "iucv_pm_freeze\n");
 #endif
+	iucv_pm_state = IUCV_PM_FREEZING;
 	for_each_cpu_mask_nr(cpu, iucv_irq_cpumask)
 		smp_call_function_single(cpu, iucv_block_cpu_almost, NULL, 1);
 	if (dev->driver && dev->driver->pm && dev->driver->pm->freeze)
@@ -1899,6 +1908,7 @@ static int iucv_pm_thaw(struct device *d
 #ifdef CONFIG_PM_DEBUG
 	printk(KERN_WARNING "iucv_pm_thaw\n");
 #endif
+	iucv_pm_state = IUCV_PM_THAWING;
 	if (!iucv_path_table) {
 		rc = iucv_enable();
 		if (rc)
@@ -1933,6 +1943,10 @@ static int iucv_pm_restore(struct device
 #ifdef CONFIG_PM_DEBUG
 	printk(KERN_WARNING "iucv_pm_restore %p\n", iucv_path_table);
 #endif
+	if ((iucv_pm_state != IUCV_PM_RESTORING) && iucv_path_table)
+		pr_warning("Suspending Linux did not completely close all IUCV "
+			"connections\n");
+	iucv_pm_state = IUCV_PM_RESTORING;
 	if (cpus_empty(iucv_irq_cpumask)) {
 		rc = iucv_query_maxconn();
 		rc = iucv_enable();


^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-16 14:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB0E2A2.3080409@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6777 bytes --]

Avi Kivity wrote:
> On 09/16/2009 02:44 PM, Gregory Haskins wrote:
>> The problem isn't where to find the models...the problem is how to
>> aggregate multiple models to the guest.
>>    
> 
> You mean configuration?
> 
>>> You instantiate multiple vhost-nets.  Multiple ethernet NICs is a
>>> supported configuration for kvm.
>>>      
>> But this is not KVM.
>>
>>    
> 
> If kvm can do it, others can.

The problem is that you seem to either hand-wave over details like this,
or you give details that are pretty much exactly what vbus does already.
 My point is that I've already sat down and thought about these issues
and solved them in a freely available GPL'ed software package.

So the question is: is your position that vbus is all wrong and you wish
to create a new bus-like thing to solve the problem?  If so, how is it
different from what Ive already done?  More importantly, what specific
objections do you have to what Ive done, as perhaps they can be fixed
instead of starting over?

> 
>>>> His slave boards surface themselves as PCI devices to the x86
>>>> host.  So how do you use that to make multiple vhost-based devices (say
>>>> two virtio-nets, and a virtio-console) communicate across the
>>>> transport?
>>>>
>>>>        
>>> I don't really see the difference between 1 and N here.
>>>      
>> A KVM surfaces N virtio-devices as N pci-devices to the guest.  What do
>> we do in Ira's case where the entire guest represents itself as a PCI
>> device to the host, and nothing the other way around?
>>    
> 
> There is no guest and host in this scenario.  There's a device side
> (ppc) and a driver side (x86).  The driver side can access configuration
> information on the device side.  How to multiplex multiple devices is an
> interesting exercise for whoever writes the virtio binding for that setup.

Bingo.  So now its a question of do you want to write this layer from
scratch, or re-use my framework.

> 
>>>> There are multiple ways to do this, but what I am saying is that
>>>> whatever is conceived will start to look eerily like a vbus-connector,
>>>> since this is one of its primary purposes ;)
>>>>
>>>>        
>>> I'm not sure if you're talking about the configuration interface or data
>>> path here.
>>>      
>> I am talking about how we would tunnel the config space for N devices
>> across his transport.
>>    
> 
> Sounds trivial.

No one said it was rocket science.  But it does need to be designed and
implemented end-to-end, much of which Ive already done in what I hope is
an extensible way.

>  Write an address containing the device number and
> register number to on location, read or write data from another.

You mean like the "u64 devh", and "u32 func" fields I have here for the
vbus-kvm connector?

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64

> Just
> like the PCI cf8/cfc interface.
> 
>>> They aren't in the "guest".  The best way to look at it is
>>>
>>> - a device side, with a dma engine: vhost-net
>>> - a driver side, only accessing its own memory: virtio-net
>>>
>>> Given that Ira's config has the dma engine in the ppc boards, that's
>>> where vhost-net would live (the ppc boards acting as NICs to the x86
>>> board, essentially).
>>>      
>> That sounds convenient given his hardware, but it has its own set of
>> problems.  For one, the configuration/inventory of these boards is now
>> driven by the wrong side and has to be addressed.
> 
> Why is it the wrong side?

"Wrong" is probably too harsh a word when looking at ethernet.  Its
certainly "odd", and possibly inconvenient.  It would be like having
vhost in a KVM guest, and virtio-net running on the host.  You could do
it, but its weird and awkward.  Where it really falls apart and enters
the "wrong" category is for non-symmetric devices, like disk-io.

> 
>> Second, the role
>> reversal will likely not work for many models other than ethernet (e.g.
>> virtio-console or virtio-blk drivers running on the x86 board would be
>> naturally consuming services from the slave boards...virtio-net is an
>> exception because 802.x is generally symmetrical).
>>    
> 
> There is no role reversal.

So if I have virtio-blk driver running on the x86 and vhost-blk device
running on the ppc board, I can use the ppc board as a block-device.
What if I really wanted to go the other way?

> The side doing dma is the device, the side
> accessing its own memory is the driver.  Just like that other 1e12
> driver/device pairs out there.

IIUC, his ppc boards really can be seen as "guests" (they are linux
instances that are utilizing services from the x86, not the other way
around).  vhost forces the model to have the ppc boards act as IO-hosts,
whereas vbus would likely work in either direction due to its more
refined abstraction layer.

> 
>>> I have no idea, that's for Ira to solve.
>>>      
>> Bingo.  Thus my statement that the vhost proposal is incomplete.  You
>> have the virtio-net and vhost-net pieces covering the fast-path
>> end-points, but nothing in the middle (transport, aggregation,
>> config-space), and nothing on the management-side.  vbus provides most
>> of the other pieces, and can even support the same virtio-net protocol
>> on top.  The remaining part would be something like a udev script to
>> populate the vbus with devices on board-insert events.
>>    
> 
> Of course vhost is incomplete, in the same sense that Linux is
> incomplete.  Both require userspace.

A vhost based solution to Iras design is missing more than userspace.
Many of those gaps are addressed by a vbus based solution.

> 
>>> If he could fake the PCI
>>> config space as seen by the x86 board, he would just show the normal pci
>>> config and use virtio-pci (multiple channels would show up as a
>>> multifunction device).  Given he can't, he needs to tunnel the virtio
>>> config space some other way.
>>>      
>> Right, and note that vbus was designed to solve this.  This tunneling
>> can, of course, be done without vbus using some other design.  However,
>> whatever solution is created will look incredibly close to what I've
>> already done, so my point is "why reinvent it"?
>>    
> 
> virtio requires binding for this tunnelling, so does vbus.

We aren't talking about virtio.  Virtio would work with either vbus or
vhost.  This is purely a question of what the layers below virtio and
the device backend looks like.

>  Its the same problem with the same solution.

I disagree.

Kind Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: igb bandwidth allocation configuration
From: Or Gerlitz @ 2009-09-16 14:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Simon Horman, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, Alexander Duyck, Kirsher, Jeffrey T
In-Reply-To: <4AAFD690.5040900@intel.com>

Alexander Duyck wrote:
> The interface for all of this would make sense as part of a virtual 
> ethernet switch control which is the way I am currently leaning on all 
> this.
Yes, you can say that out of the per VF <mac, vlan-id, priority, rate> 
tuple I mentioned, except for the mac, the other parameters actually 
belong to the egress flow of the virtual switch port this VF is 
connected to. So the vswitch actually signs the packet with vlan+pbits 
and enforces the rate. Now vswitch can be software based, or hardware 
NIC based.

Now, I assume there may be NICs which will let you configure the 
<vlan-id, priority, rate> as part of the their virtual switch config, 
but others, e.g
the 82576 as an example, and following our discussion, will let you do 
that for the VF, in the VF driver which as you said may run the guest OS 
where we can't control it...

Or.



^ permalink raw reply

* [PATCH 2/2] ieee802154: add locking for seq numbers
From: Dmitry Eremin-Solenikov @ 2009-09-16 13:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253107333-25043-2-git-send-email-dbaryshkov@gmail.com>

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 net/ieee802154/netlink.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ieee802154/netlink.c b/net/ieee802154/netlink.c
index 2106ecb..ca767bd 100644
--- a/net/ieee802154/netlink.c
+++ b/net/ieee802154/netlink.c
@@ -35,6 +35,7 @@
 #include <net/ieee802154_netdev.h>
 
 static unsigned int ieee802154_seq_num;
+static DEFINE_SPINLOCK(ieee802154_seq_lock);
 
 static struct genl_family ieee802154_coordinator_family = {
 	.id		= GENL_ID_GENERATE,
@@ -57,12 +58,15 @@ static struct sk_buff *ieee802154_nl_create(int flags, u8 req)
 {
 	void *hdr;
 	struct sk_buff *msg = nlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC);
+	unsigned long f;
 
 	if (!msg)
 		return NULL;
 
+	spin_lock_irqsave(&ieee802154_seq_lock, f);
 	hdr = genlmsg_put(msg, 0, ieee802154_seq_num++,
 			&ieee802154_coordinator_family, flags, req);
+	spin_unlock_irqrestore(&ieee802154_seq_lock, f);
 	if (!hdr) {
 		nlmsg_free(msg);
 		return NULL;
-- 
1.6.3.3


^ permalink raw reply related

* [GIT PULL 0/2] Fixes for IEEE 802.15.4
From: Dmitry Eremin-Solenikov @ 2009-09-16 13:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev

Hi, David,

Please pull both into net/master and net-next/master (as I'd like
to submit few patches into net-next/master depending on this).

The following changes since commit 4e36a95e591e9c58dd10bb4103c00993917c27fd:
  David Howells (1):
        RxRPC: Use uX/sX rather than uintX_t/intX_t types

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lowpan/lowpan.git for-linus

Dmitry Eremin-Solenikov (2):
      af_ieee802154: setsockopt optlen arg isn't __user
      ieee802154: add locking for seq numbers

 net/ieee802154/dgram.c   |    2 +-
 net/ieee802154/netlink.c |    4 ++++
 net/ieee802154/raw.c     |    2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)


^ permalink raw reply

* [PATCH 1/2] af_ieee802154: setsockopt optlen arg isn't __user
From: Dmitry Eremin-Solenikov @ 2009-09-16 13:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253107333-25043-1-git-send-email-dbaryshkov@gmail.com>

Remove __user annotation from optlen arg as it's bogus.

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 net/ieee802154/dgram.c |    2 +-
 net/ieee802154/raw.c   |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ieee802154/dgram.c b/net/ieee802154/dgram.c
index 77ae685..51593a4 100644
--- a/net/ieee802154/dgram.c
+++ b/net/ieee802154/dgram.c
@@ -414,7 +414,7 @@ static int dgram_getsockopt(struct sock *sk, int level, int optname,
 }
 
 static int dgram_setsockopt(struct sock *sk, int level, int optname,
-		    char __user *optval, int __user optlen)
+		    char __user *optval, int optlen)
 {
 	struct dgram_sock *ro = dgram_sk(sk);
 	int val;
diff --git a/net/ieee802154/raw.c b/net/ieee802154/raw.c
index 4681501..1319885 100644
--- a/net/ieee802154/raw.c
+++ b/net/ieee802154/raw.c
@@ -244,7 +244,7 @@ static int raw_getsockopt(struct sock *sk, int level, int optname,
 }
 
 static int raw_setsockopt(struct sock *sk, int level, int optname,
-		    char __user *optval, int __user optlen)
+		    char __user *optval, int optlen)
 {
 	return -EOPNOTSUPP;
 }
-- 
1.6.3.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox