* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Miche Baker-Harvey @ 2011-12-06 17:05 UTC (permalink / raw)
To: Amit Shah
Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
Benjamin Herrenschmidt, linux-kernel, virtualization,
Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
Eric Northrup
In-Reply-To: <20111205105452.GB27683@amit-x200.redhat.com>
Amit,
Ah, indeed. I am not using MSI-X, so virtio_pci::vp_try_to_find_vqs()
calls vp_request_intx() and sets up an interrupt callback. From
there, when an interrupt occurs, the stack looks something like this:
virtio_pci::vp_interrupt()
virtio_pci::vp_vring_interrupt()
virtio_ring::vring_interrupt()
vq->vq.callback() <-- in this case, that's virtio_console::control_intr()
workqueue::schedule_work()
workqueue::queue_work()
queue_work_on(get_cpu()) <-- queues the work on the current CPU.
I'm not doing anything to keep multiple control message from being
sent concurrently to the guest, and we will take those interrupts on
any CPU. I've confirmed that the two instances of
handle_control_message() are occurring on different CPUs.
Should this work? I don't see anywhere that QEMU is serializing the
sending of data to the control queue in the guest, and there's no
serialization in
the control_intr. I don't understand why you are not seeing the
concurrent execution of handle_control_message(). Are you taking all
your interrupts on a single CPU, maybe? Or is there some other
serialization in user space?
Miche
On Mon, Dec 5, 2011 at 2:54 AM, Amit Shah <amit.shah@redhat.com> wrote:
> On (Tue) 29 Nov 2011 [09:50:41], Miche Baker-Harvey wrote:
>> Good grief! Sorry for the spacing mess-up! Here's a resend with reformatting.
>>
>> Amit,
>> We aren't using either QEMU or kvmtool, but we are using KVM. All
>
> So it's a different userspace? Any chance this different userspace is
> causing these problems to appear? Esp. since I couldn't reproduce
> with qemu.
>
>> the issues we are seeing happen when we try to establish multiple
>> virtioconsoles at boot time. The command line isn't relevant, but I
>> can tell you the protocol that's passing between the host (kvm) and
>> the guest (see the end of this message).
>>
>> We do go through the control_work_handler(), but it's not
>> providing synchronization. Here's a trace of the
>> control_work_handler() and handle_control_message() calls; note that
>> there are two concurrent calls to control_work_handler().
>
> Ah; how does that happen? control_work_handler() should just be
> invoked once, and if there are any more pending work items to be
> consumed, they should be done within the loop inside
> control_work_handler().
>
>> I decorated control_work_handler() with a "lifetime" marker, and
>> passed this value to handle_control_message(), so we can see which
>> control messages are being handled from which instance of
>> the control_work_handler() thread.
>>
>> Notice that we enter control_work_handler() a second time before
>> the handling of the second PORT_ADD message is complete. The
>> first CONSOLE_PORT message is handled by the second
>> control_work_handler() call, but the second is handled by the first
>> control_work_handler() call.
>>
>> root@myubuntu:~# dmesg | grep MBH
>> [3371055.808738] control_work_handler #1
>> [3371055.809372] + #1 handle_control_message PORT_ADD
>> [3371055.810169] - handle_control_message PORT_ADD
>> [3371055.810170] + #1 handle_control_message PORT_ADD
>> [3371055.810244] control_work_handler #2
>> [3371055.810245] + #2 handle_control_message CONSOLE_PORT
>> [3371055.810246] got hvc_ports_mutex
>> [3371055.810578] - handle_control_message PORT_ADD
>> [3371055.810579] + #1 handle_control_message CONSOLE_PORT
>> [3371055.810580] trylock of hvc_ports_mutex failed
>> [3371055.811352] got hvc_ports_mutex
>> [3371055.811370] - handle_control_message CONSOLE_PORT
>> [3371055.816609] - handle_control_message CONSOLE_PORT
>>
>> So, I'm guessing the bug is that there shouldn't be two instances of
>> control_work_handler() running simultaneously?
>
> Yep, I assumed we did that but apparently not. Do you plan to chase
> this one down?
>
> Amit
>
^ permalink raw reply
* Re: [Xen-devel] [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Konrad Rzeszutek Wilk @ 2011-12-06 16:49 UTC (permalink / raw)
To: Raghavendra K T
Cc: Jeremy Fitzhardinge, KVM, Peter Zijlstra, Virtualization,
Raghavendra K T, H. Peter Anvin, Dave Hansen, Xen, Dave Jiang,
x86, Ingo Molnar, Avi Kivity, Stefano Stabellini,
Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek, Thomas Gleixner,
Yinghai Lu, Konrad Rzeszutek Wilk, Greg Kroah-Hartman,
Srivatsa Vaddagiri, LKML, Suzuki
In-Reply-To: <4EDBB6C2.9050303@linux.vnet.ibm.com>
On Sun, Dec 04, 2011 at 11:36:58PM +0530, Raghavendra K T wrote:
> On 12/02/2011 01:20 AM, Raghavendra K T wrote:
> >>Have you tested it on AMD machines? There are some differences in the
> >>hypercall infrastructure there.
> >
> >Yes. 'll test on AMD machine and update on that.
> >
>
> I tested the code on 64 bit Dual-Core AMD Opteron machine, and it is
> working.
I am not that familiar with how KVM does migration, but do you need any
special flags in QEMU to when migrating a KVM-pv-spinlock enabled guest
to another machine?
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Michael S. Tsirkin @ 2011-12-06 16:14 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: krkumar2, kvm, virtualization, levinsasha928, netdev, bhutchings
In-Reply-To: <4EDE37FE.5090409@us.ibm.com>
On Tue, Dec 06, 2011 at 07:42:54AM -0800, Sridhar Samudrala wrote:
> On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
> >On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com> wrote:
> >>On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
> >>>On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com> wrote:
> >>>>On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
> >>>>>On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
> >>>>> wrote:
> >>>The vcpus are just threads and may not be bound to physical CPUs, so
> >>>what is the big picture here? Is the guest even in the position to
> >>>set the best queue mappings today?
> >>
> >>Not sure it could publish the best mapping but the idea is to make sure the
> >>packets of a flow were handled by the same guest vcpu and may be the same
> >>vhost thread in order to eliminate the packet reordering and lock
> >>contention. But this assumption does not take the bouncing of vhost or vcpu
> >>threads which would also affect the result.
> >Okay, this is why I'd like to know what the big picture here is. What
> >solution are you proposing? How are we going to have everything from
> >guest application, guest kernel, host threads, and host NIC driver
> >play along so we get the right steering up the entire stack. I think
> >there needs to be an answer to that before changing virtio-net to add
> >any steering mechanism.
> >
> >
> Yes. Also the current model of a vhost thread per VM's interface
> doesn't help with packet steering
> all the way from the guest to the host physical NIC.
>
> I think we need to have vhost thread(s) per-CPU that can handle
> packets to/from physical NIC's
> TX/RX queues.
> Currently we have a single vhost thread for a VM's i/f
> that handles all the packets from
> various flows coming from a multi-queue physical NIC.
>
> Thanks
> Sridhar
It's not hard to try that:
1. revert c23f3445e68e1db0e74099f264bc5ff5d55ebdeb
this will convert our thread to a workqueue
2. convert the workqueue to a per-cpu one
It didn't work that well in the past, but YMMV
On the surface I'd say a single thread makes some sense
as long as guest uses a single queue.
--
MST
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Sridhar Samudrala @ 2011-12-06 15:42 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
bhutchings
In-Reply-To: <CAJSP0QXsLwvH5xYj6h0E_V4VLg6DuUc-GKXu9esEYzL2MFcFGw@mail.gmail.com>
On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com> wrote:
>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com> wrote:
>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>> wrote:
>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>> what is the big picture here? Is the guest even in the position to
>>> set the best queue mappings today?
>>
>> Not sure it could publish the best mapping but the idea is to make sure the
>> packets of a flow were handled by the same guest vcpu and may be the same
>> vhost thread in order to eliminate the packet reordering and lock
>> contention. But this assumption does not take the bouncing of vhost or vcpu
>> threads which would also affect the result.
> Okay, this is why I'd like to know what the big picture here is. What
> solution are you proposing? How are we going to have everything from
> guest application, guest kernel, host threads, and host NIC driver
> play along so we get the right steering up the entire stack. I think
> there needs to be an answer to that before changing virtio-net to add
> any steering mechanism.
>
>
Yes. Also the current model of a vhost thread per VM's interface
doesn't help with packet steering
all the way from the guest to the host physical NIC.
I think we need to have vhost thread(s) per-CPU that can handle packets
to/from physical NIC's
TX/RX queues. Currently we have a single vhost thread for a VM's i/f
that handles all the packets from
various flows coming from a multi-queue physical NIC.
Thanks
Sridhar
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Stefan Hajnoczi @ 2011-12-06 13:15 UTC (permalink / raw)
To: Jason Wang
Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
bhutchings
In-Reply-To: <4EDDECA9.8060808@redhat.com>
On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>
>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com> wrote:
>>>
>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>
>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>> wrote:
>> The vcpus are just threads and may not be bound to physical CPUs, so
>> what is the big picture here? Is the guest even in the position to
>> set the best queue mappings today?
>
>
> Not sure it could publish the best mapping but the idea is to make sure the
> packets of a flow were handled by the same guest vcpu and may be the same
> vhost thread in order to eliminate the packet reordering and lock
> contention. But this assumption does not take the bouncing of vhost or vcpu
> threads which would also affect the result.
Okay, this is why I'd like to know what the big picture here is. What
solution are you proposing? How are we going to have everything from
guest application, guest kernel, host threads, and host NIC driver
play along so we get the right steering up the entire stack. I think
there needs to be an answer to that before changing virtio-net to add
any steering mechanism.
Stefan
^ permalink raw reply
* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Rusty Russell @ 2011-12-06 12:03 UTC (permalink / raw)
To: Avi Kivity
Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
Sasha Levin
In-Reply-To: <4EDDE73D.1080209@redhat.com>
On Tue, 06 Dec 2011 11:58:21 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/06/2011 07:07 AM, Rusty Russell wrote:
> > Yes, but the hypervisor/trusted party would simply have to do the copy;
> > the rings themselves would be shared A would say "copy this to/from B's
> > ring entry N" and you know that A can't have changed B's entry.
>
> Sorry, I don't follow. How can the rings be shared? If A puts a gpa in
> A's address space into the ring, there's no way B can do anything with
> it, it's an opaque number. Xen solves this with an extra layer of
> indirection (grant table handles) that cost extra hypercalls to map or
> copy.
It's not symmetric. B can see the desc and avail pages R/O, and the
used page R/W. It needs to ask the something to copy in/out of
descriptors, though, because they're an opaque number, and it doesn't
have access. ie. the existence of the descriptor in the ring *implies*
a grant.
Perhaps this could be generalized further into a "connect these two
rings", but I'm not sure. Descriptors with both read and write parts
are tricky.
> > Every driver really wants to put a pointer in there. We have an array
> > to map desc. numbers to cookies inside the virtio core.
> >
> > We really want 64 bits.
>
> With multiqueue, it may be cheaper to do the extra translation locally
> than to ship the cookie across cores (or, more likely, it will make no
> difference).
Indeed.
> However, moving pointers only works if you trust the other side. That
> doesn't work if we manage to share a ring.
Yes, that part needs to be trusted too.
> > I'm just not sure how the host would even know to hint.
>
> For JBOD storage, a good rule of thumb is (number of spindles) x 3.
> With less, you might leave an idle spindle; with more, you're just
> adding latency. This assumes you're using indirects so ring entry ==
> request. The picture is muddier with massive battery-backed RAID
> controllers or flash.
>
> For networking, you want (ring size) * min(expected packet size, page
> size) / (link bandwidth) to be something that doesn't get the
> bufferbloat people after your blood.
OK, so while neither side knows, the host knows slightly more.
Now I think about it, from a spec POV, saying it's a "hint" is useless,
as it doesn't tell the driver what to do with it. I'll say it's a
maximum, which keeps it simple.
Cheers,
Rusty.
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Jason Wang @ 2011-12-06 10:21 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
bhutchings
In-Reply-To: <CAJSP0QX5dDkpX+cRcQut2mb6K91zeqGLRrZBGAWT_r2p685gaQ@mail.gmail.com>
On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com> wrote:
>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com> wrote:
>>>> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
>>>> +{
>>>> + struct virtnet_info *vi = netdev_priv(dev);
>>>> + struct virtio_device *vdev = vi->vdev;
>>>> +
>>>> + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
>>>> + vdev->config->set(vdev,
>>>> + offsetof(struct virtio_net_config_fd,
>>>> addr),
>>>> +&pfn, sizeof(u32));
>>> Please use the virtio model (i.e. virtqueues) instead of shared
>>> memory. Mapping a page breaks the virtio abstraction.
>>
>> Using control virtqueue is more suitable but there's are also some problems:
>>
>> One problem is the interface, if we use control virtqueue, we need a
>> interface between the backend and tap/macvtap to change the flow mapping.
>> But qemu and vhost_net only know about the file descriptor, more
>> informations or interfaces need to be exposed in order to let ethtool or
>> ioctl work.
> QEMU could provide map a shared page with tap/macvtap. The difference
> would be that the guest<->host interface is still virtio and QEMU
> pokes values into the shared page on behalf of the guest.
This makes sense.
>> Another problem is the delay introduced by ctrl vq, as the ctrl vq would be
>> used in the critical path in guest and it use busy wait to get the response,
>> the delay is not neglectable.
> Then you need to find a better way of doing this. Can the host
> automatically associate the flow from the tx virtqueue packets are
> transmitted on? Does it make sense to add a virtio_net_hdr field that
> updates the queue mapping?
It can but it can not properly handling the the packet re-ordering
caused by the moving of guest applications among guest cpus. One more
problem for virtio_net_hdr is we need to build a empty packet when there
no other packet to send.
One solution is to introduce unblock cmd for ctrl vq.
> The vcpus are just threads and may not be bound to physical CPUs, so
> what is the big picture here? Is the guest even in the position to
> set the best queue mappings today?
Not sure it could publish the best mapping but the idea is to make sure
the packets of a flow were handled by the same guest vcpu and may be the
same vhost thread in order to eliminate the packet reordering and lock
contention. But this assumption does not take the bouncing of vhost or
vcpu threads which would also affect the result.
Anyway, the mapping from guest was an important reference.
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-06 9:58 UTC (permalink / raw)
To: Rusty Russell
Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
Sasha Levin
In-Reply-To: <87pqg2p9t8.fsf@rustcorp.com.au>
On 12/06/2011 07:07 AM, Rusty Russell wrote:
> On Mon, 05 Dec 2011 11:52:54 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/05/2011 02:10 AM, Rusty Russell wrote:
> > > On Sun, 04 Dec 2011 17:16:59 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > > On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > > > > There's also the used ring, but that's a
> > > > > > mistake if you have out of order completion. We should have used copying.
> > > > >
> > > > > Seems unrelated... unless you want used to be written into
> > > > > descriptor ring itself?
> > > >
> > > > The avail/used rings are in addition to the regular ring, no? If you
> > > > copy descriptors, then it goes away.
> > >
> > > There were two ideas which drove the current design:
> > >
> > > 1) The Van-Jacobson style "no two writers to same cacheline makes rings
> > > fast" idea. Empirically, this doesn't show any winnage.
> >
> > Write/write is the same as write/read or read/write. Both cases have to
> > send a probe and wait for the result. What we really need is to
> > minimize cache line ping ponging, and the descriptor pool fails that
> > with ooo completion. I doubt it's measurable though except with the
> > very fastest storage providers.
>
> The claim was that going exclusive->shared->exclusive was cheaper than
> exclusive->invalid->exclusive. When VJ said it, it seemed convincing :)
IIRC at some point in time certain transitions were forced to go through
main memory; perhaps that was behind this. I think these days
everything goes through the coherency protocol; not sure though.
> > > 2) Allowing a generic inter-guest copy mechanism, so we could have
> > > genuinely untrusted driver domains. Yet noone ever did this so it's
> > > hardly a killer feature :(
> >
> > It's still a goal, though not an important one. But we have to
> > translate rings anyway, don't, since buffers are in guest physical
> > addresses, and we're moving into an address space that doesn't map those.
>
> Yes, but the hypervisor/trusted party would simply have to do the copy;
> the rings themselves would be shared A would say "copy this to/from B's
> ring entry N" and you know that A can't have changed B's entry.
Sorry, I don't follow. How can the rings be shared? If A puts a gpa in
A's address space into the ring, there's no way B can do anything with
it, it's an opaque number. Xen solves this with an extra layer of
indirection (grant table handles) that cost extra hypercalls to map or copy.
> > I thought of having a vhost-copy driver that could do ring translation,
> > using a dma engine for the copy.
>
> As long as we get the length of data written from the vhost-copy driver
> (ie. not just the network header). Otherwise a malicious other guest
> can send short packets, and a local process can read uninitialized
> memory. And pre-zeroing the buffers for this corner case sucks.
Right. It's important to get the metadata visible at the right layer.
> > Also, stuff the cookie into len_and_flags as well.
>
> Every driver really wants to put a pointer in there. We have an array
> to map desc. numbers to cookies inside the virtio core.
>
> We really want 64 bits.
With multiqueue, it may be cheaper to do the extra translation locally
than to ship the cookie across cores (or, more likely, it will make no
difference).
However, moving pointers only works if you trust the other side. That
doesn't work if we manage to share a ring.
> > > 2) Stick with physically-contiguous rings, but use them of size (2^n)-1.
> > > Makes the indexing harder, but that -1 lets us stash the indices in
> > > the first entry and makes the ring a nice 2^n size.
> >
> > Allocate at lease a cache line for those. The 2^n size is not really
> > material, a division is never necessary.
>
> We free-run our indices, so we *do* a division (by truncation). If we
> limit indices to ringsize, then we have to handle empty/full confusion.
>
> It's nice for simple OSes if things pack nicely into pages, but it's not
> a killer feature IMHO.
Agree.
> > > > > > 16kB worth of descriptors is 1024 entries. With 4kB buffers, that's 4MB
> > > > > > worth of data, or 4 ms at 10GbE line speed. With 1500 byte buffers it's
> > > > > > just 1.5 ms. In any case I think it's sufficient.
> > > > >
> > > > > Right. So I think that without indirect, we waste about 3 entries
> > > > > per packet for virtio header and transport etc headers.
> > > >
> > > > That does suck. Are there issues in increasing the ring size? Or
> > > > making it discontiguous?
> > >
> > > Because the qemu implementation is broken.
> >
> > I was talking about something else, but this is more important. Every
> > time we make a simplifying assumption, it turns around and bites us, and
> > the code becomes twice as complicated as it would have been in the first
> > place, and the test matrix explodes.
>
> True, though we seem to be improving. But this is why I don't want
> optional features in the spec; I want us always to exercise all of it.
Yes, and I think the main reason we're improving is that we actually
have a spec these days. Agree about optional features, though of course
we accumulate them by not thinking of everything ahead of time, I don't
see how we can fix that.
> > > We currently use small rings: the guest can't negotiate so qemu has to
> > > offer a lowest-common-denominator value. The new virtio-pci layout
> > > fixes this, and lets the guest set the ring size.
> >
> > Ok good. Note the figuring out the best ring size needs some info from
> > the host, but that can be had from other channels.
>
> We have a ringsize field; should the host initialize it as a hint? Or
> as a maximum allowable?
>
> I'm just not sure how the host would even know to hint.
For JBOD storage, a good rule of thumb is (number of spindles) x 3.
With less, you might leave an idle spindle; with more, you're just
adding latency. This assumes you're using indirects so ring entry ==
request. The picture is muddier with massive battery-backed RAID
controllers or flash.
For networking, you want (ring size) * min(expected packet size, page
size) / (link bandwidth) to be something that doesn't get the
bufferbloat people after your blood.
> > > > Can you take a peek at how Xen manages its rings? They have the same
> > > > problems we do.
> > >
> > > Yes, I made some mistakes, but I did steal from them in the first
> > > place...
> >
> > There was a bit of second system syndrome there.
>
> But also some third syndrome: Xen had gone through their great device
> rewrite, and things like feature bits were a direct response against
> that complexity. We did well there.
Yes, feature bits really proved themselves.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Stefan Hajnoczi @ 2011-12-06 9:18 UTC (permalink / raw)
To: Jason Wang
Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
bhutchings
In-Reply-To: <4EDDB71F.9080908@redhat.com>
On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>
>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com> wrote:
>>>
>>> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
>>> +{
>>> + struct virtnet_info *vi = netdev_priv(dev);
>>> + struct virtio_device *vdev = vi->vdev;
>>> +
>>> + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
>>> + vdev->config->set(vdev,
>>> + offsetof(struct virtio_net_config_fd,
>>> addr),
>>> +&pfn, sizeof(u32));
>>
>> Please use the virtio model (i.e. virtqueues) instead of shared
>> memory. Mapping a page breaks the virtio abstraction.
>
>
> Using control virtqueue is more suitable but there's are also some problems:
>
> One problem is the interface, if we use control virtqueue, we need a
> interface between the backend and tap/macvtap to change the flow mapping.
> But qemu and vhost_net only know about the file descriptor, more
> informations or interfaces need to be exposed in order to let ethtool or
> ioctl work.
QEMU could provide map a shared page with tap/macvtap. The difference
would be that the guest<->host interface is still virtio and QEMU
pokes values into the shared page on behalf of the guest.
> Another problem is the delay introduced by ctrl vq, as the ctrl vq would be
> used in the critical path in guest and it use busy wait to get the response,
> the delay is not neglectable.
Then you need to find a better way of doing this. Can the host
automatically associate the flow from the tx virtqueue packets are
transmitted on? Does it make sense to add a virtio_net_hdr field that
updates the queue mapping?
The vcpus are just threads and may not be bound to physical CPUs, so
what is the big picture here? Is the guest even in the position to
set the best queue mappings today?
Stefan
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Jason Wang @ 2011-12-06 7:25 UTC (permalink / raw)
To: Ben Hutchings; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <1323117745.2887.31.camel@bwh-desktop>
On 12/06/2011 04:42 AM, Ben Hutchings wrote:
> On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote:
>> In order to let the packets of a flow to be passed to the desired
>> guest cpu, we can co-operate with devices through programming the flow
>> director which was just a hash to queue table.
>>
>> This kinds of co-operation is done through the accelerate RFS support,
>> a device specific flow sterring method virtnet_fd() is used to modify
>> the flow director based on rfs mapping. The desired queue were
>> calculated through reverse mapping of the irq affinity table. In order
>> to parallelize the ingress path, irq affinity of rx queue were also
>> provides by the driver.
>>
>> In addition to accelerate RFS, we can also use the guest scheduler to
>> balance the load of TX and reduce the lock contention on egress path,
>> so the processor_id() were used to tx queue selection.
> [...]
>> +#ifdef CONFIG_RFS_ACCEL
>> +
>> +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
>> + u16 rxq_index, u32 flow_id)
>> +{
>> + struct virtnet_info *vi = netdev_priv(net_dev);
>> + u16 *table = NULL;
>> +
>> + if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash)
>> + return -EPROTONOSUPPORT;
> Why only IPv4?
Oops, IPv6 should work also.
>> + table = kmap_atomic(vi->fd_page);
>> + table[skb->rxhash& TAP_HASH_MASK] = rxq_index;
>> + kunmap_atomic(table);
>> +
>> + return 0;
>> +}
>> +#endif
> This is not a proper implementation of ndo_rx_flow_steer. If you steer
> a flow by changing the RSS table this can easily cause packet reordering
> in other flows. The filtering should be more precise, ideally matching
> exactly a single flow by e.g. VID and IP 5-tuple.
>
> I think you need to add a second hash table which records exactly which
> flow is supposed to be steered. Also, you must call
> rps_may_expire_flow() to check whether an entry in this table may be
> replaced; otherwise you can cause packet reordering in the flow that was
> previously being steered.
>
> Finally, this function must return the table index it assigned, so that
> rps_may_expire_flow() works.
Thanks for the explanation, how about document this briefly in scaling.txt?
>> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
>> +{
>> + int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>> + smp_processor_id();
>> +
>> + /* As we make use of the accelerate rfs which let the scheduler to
>> + * balance the load, it make sense to choose the tx queue also based on
>> + * theprocessor id?
>> + */
>> + while (unlikely(txq>= dev->real_num_tx_queues))
>> + txq -= dev->real_num_tx_queues;
>> + return txq;
>> +}
> [...]
>
> Don't do this, let XPS handle it.
>
> Ben.
>
^ permalink raw reply
* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
From: Jason Wang @ 2011-12-06 7:21 UTC (permalink / raw)
To: Ben Hutchings; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <1323115763.2887.12.camel@bwh-desktop>
On 12/06/2011 04:09 AM, Ben Hutchings wrote:
> On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
>> This patch adds a simple flow director to tun/tap device. It is just a
>> page that contains the hash to queue mapping which could be changed by
>> user-space. The backend (tap/macvtap) would query this table to get
>> the desired queue of a packets when it send packets to userspace.
> This is just flow hashing (RSS), not flow steering.
>
>> The page address were set through a new kind of ioctl - TUNSETFD and
>> were pinned until device exit or another new page were specified.
> [...]
>
> You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.
>
> Ben.
>
I'm not fully understanding this. The page belongs to guest, and the
idea is to let guest driver can easily change any entry. Looks like if
ethtool_set_rxfh_indir() is used, this kind of change is not easy as it
needs one copy and can only accept the whole table as its parameters.
^ permalink raw reply
* Re: [PATCH RFC V3 4/4] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Raghavendra K T @ 2011-12-06 6:54 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
Virtualization, H. Peter Anvin, Jeremy Fitzhardinge, Dave Jiang,
KVM, x86, Ingo Molnar, Avi Kivity, Rik van Riel,
Stefano Stabellini, Srivatsa Vaddagiri, Xen, Sasha Levin,
Sedat Dilek, Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman,
LKML, Dave Hansen, Suzuki Poulose
In-Reply-To: <20111206032741.GF6647@phenom.dumpdata.com>
On 12/06/2011 08:57 AM, Konrad Rzeszutek Wilk wrote:
>> +static inline void add_stats(enum kvm_contention_stat var, int val)
>
> You probably want 'int val' to be 'u32 val' as that is the type
> in contention_stats.
>
Yes. Thanks for pointing, as its cumulative. It is indeed u32 in #else
:).I 'll change that.
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Jason Wang @ 2011-12-06 6:33 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
bhutchings
In-Reply-To: <CAJSP0QXfHC8j2Kk9x6wyBMubo2=YvpaLAShJtwMWA2mtro6UhA@mail.gmail.com>
On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com> wrote:
>> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
>> +{
>> + struct virtnet_info *vi = netdev_priv(dev);
>> + struct virtio_device *vdev = vi->vdev;
>> +
>> + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
>> + vdev->config->set(vdev,
>> + offsetof(struct virtio_net_config_fd, addr),
>> +&pfn, sizeof(u32));
> Please use the virtio model (i.e. virtqueues) instead of shared
> memory. Mapping a page breaks the virtio abstraction.
Using control virtqueue is more suitable but there's are also some problems:
One problem is the interface, if we use control virtqueue, we need a
interface between the backend and tap/macvtap to change the flow
mapping. But qemu and vhost_net only know about the file descriptor,
more informations or interfaces need to be exposed in order to let
ethtool or ioctl work.
Another problem is the delay introduced by ctrl vq, as the ctrl vq would
be used in the critical path in guest and it use busy wait to get the
response, the delay is not neglectable.
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Rusty Russell @ 2011-12-06 5:07 UTC (permalink / raw)
To: Avi Kivity
Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
Sasha Levin
In-Reply-To: <4EDC9476.3000301@redhat.com>
On Mon, 05 Dec 2011 11:52:54 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/05/2011 02:10 AM, Rusty Russell wrote:
> > On Sun, 04 Dec 2011 17:16:59 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > > > There's also the used ring, but that's a
> > > > > mistake if you have out of order completion. We should have used copying.
> > > >
> > > > Seems unrelated... unless you want used to be written into
> > > > descriptor ring itself?
> > >
> > > The avail/used rings are in addition to the regular ring, no? If you
> > > copy descriptors, then it goes away.
> >
> > There were two ideas which drove the current design:
> >
> > 1) The Van-Jacobson style "no two writers to same cacheline makes rings
> > fast" idea. Empirically, this doesn't show any winnage.
>
> Write/write is the same as write/read or read/write. Both cases have to
> send a probe and wait for the result. What we really need is to
> minimize cache line ping ponging, and the descriptor pool fails that
> with ooo completion. I doubt it's measurable though except with the
> very fastest storage providers.
The claim was that going exclusive->shared->exclusive was cheaper than
exclusive->invalid->exclusive. When VJ said it, it seemed convincing :)
> > 2) Allowing a generic inter-guest copy mechanism, so we could have
> > genuinely untrusted driver domains. Yet noone ever did this so it's
> > hardly a killer feature :(
>
> It's still a goal, though not an important one. But we have to
> translate rings anyway, don't, since buffers are in guest physical
> addresses, and we're moving into an address space that doesn't map those.
Yes, but the hypervisor/trusted party would simply have to do the copy;
the rings themselves would be shared A would say "copy this to/from B's
ring entry N" and you know that A can't have changed B's entry.
> I thought of having a vhost-copy driver that could do ring translation,
> using a dma engine for the copy.
As long as we get the length of data written from the vhost-copy driver
(ie. not just the network header). Otherwise a malicious other guest
can send short packets, and a local process can read uninitialized
memory. And pre-zeroing the buffers for this corner case sucks.
> > So if we're going to revisit and drop those requirements, I'd say:
> >
> > 1) Shared device/driver rings like Xen. Xen uses device-specific ring
> > contents, I'd be tempted to stick to our pre-headers, and a 'u64
> > addr; u64 len_and_flags; u64 cookie;' generic style. Then use
> > the same ring for responses. That's a slight space-win, since
> > we're 24 bytes vs 26 bytes now.
>
> Let's cheat and have inline contents. Take three bits from
> len_and_flags to specify additional descriptors as inline data.
Nice, I like this optimization.
> Also, stuff the cookie into len_and_flags as well.
Every driver really wants to put a pointer in there. We have an array
to map desc. numbers to cookies inside the virtio core.
We really want 64 bits.
> > 2) Stick with physically-contiguous rings, but use them of size (2^n)-1.
> > Makes the indexing harder, but that -1 lets us stash the indices in
> > the first entry and makes the ring a nice 2^n size.
>
> Allocate at lease a cache line for those. The 2^n size is not really
> material, a division is never necessary.
We free-run our indices, so we *do* a division (by truncation). If we
limit indices to ringsize, then we have to handle empty/full confusion.
It's nice for simple OSes if things pack nicely into pages, but it's not
a killer feature IMHO.
> > > > > 16kB worth of descriptors is 1024 entries. With 4kB buffers, that's 4MB
> > > > > worth of data, or 4 ms at 10GbE line speed. With 1500 byte buffers it's
> > > > > just 1.5 ms. In any case I think it's sufficient.
> > > >
> > > > Right. So I think that without indirect, we waste about 3 entries
> > > > per packet for virtio header and transport etc headers.
> > >
> > > That does suck. Are there issues in increasing the ring size? Or
> > > making it discontiguous?
> >
> > Because the qemu implementation is broken.
>
> I was talking about something else, but this is more important. Every
> time we make a simplifying assumption, it turns around and bites us, and
> the code becomes twice as complicated as it would have been in the first
> place, and the test matrix explodes.
True, though we seem to be improving. But this is why I don't want
optional features in the spec; I want us always to exercise all of it.
> > We can often put the virtio
> > header at the head of the packet. In practice, the qemu implementation
> > insists the header be a single descriptor.
> >
> > (At least, it used to, perhaps it has now been fixed. We need a
> > VIRTIO_NET_F_I_NOW_CONFORM_TO_THE_DAMN_SPEC_SORRY_I_SUCK bit).
>
> We'll run out of bits in no time.
We had one already: VIRTIO_F_BAD_FEATURE. We haven't used it in a long
time (if ever), and I just removed it from the latest version of the
spec.
But we can cheat: we can add this as a requirement to The New Ring
Layout. And document the old behaviour as a footnote.
> > We currently use small rings: the guest can't negotiate so qemu has to
> > offer a lowest-common-denominator value. The new virtio-pci layout
> > fixes this, and lets the guest set the ring size.
>
> Ok good. Note the figuring out the best ring size needs some info from
> the host, but that can be had from other channels.
We have a ringsize field; should the host initialize it as a hint? Or
as a maximum allowable?
I'm just not sure how the host would even know to hint.
> > > Can you take a peek at how Xen manages its rings? They have the same
> > > problems we do.
> >
> > Yes, I made some mistakes, but I did steal from them in the first
> > place...
>
> There was a bit of second system syndrome there.
But also some third syndrome: Xen had gone through their great device
rewrite, and things like feature bits were a direct response against
that complexity. We did well there.
> And I don't understand how the ring/pool issue didn't surface during
> review, it seems so obvious now but completely eluded me then.
The idea was we'd just enlarge the rings. Only later did that prove
naive, and many things came in to fix that. The cacheline issue is
speculative at this point.
But we are learning, and making fewer mistakes; we could have learn some
things earlier (I was remiss in not seeking the Xen devs' advice on the
ring itself, for example).
I honestly don't think it gets much better than this in Real Life.
I'm working on the virtio-pci capability layout patch now, then I'll
look at a ring rewrite.
Thanks,
Rusty.
^ permalink raw reply
* Re: [PATCH RFC V3 4/4] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Konrad Rzeszutek Wilk @ 2011-12-06 3:27 UTC (permalink / raw)
To: Raghavendra K T
Cc: Peter Zijlstra, Virtualization, H. Peter Anvin,
Jeremy Fitzhardinge, Dave Jiang, KVM, x86, Ingo Molnar,
Avi Kivity, Rik van Riel, Stefano Stabellini, Srivatsa Vaddagiri,
Xen, Sasha Levin, Sedat Dilek, Thomas Gleixner, Yinghai Lu,
Greg Kroah-Hartman, LKML, Dave Hansen, Suzuki Poulose
In-Reply-To: <20111130090038.23386.4505.sendpatchset@oc5400248562.ibm.com>
On Wed, Nov 30, 2011 at 02:30:38PM +0530, Raghavendra K T wrote:
> This patch extends Linux guests running on KVM hypervisor to support
> pv-ticketlocks.
> During smp_boot_cpus paravirtualied KVM guest detects if the hypervisor has
> required feature (KVM_FEATURE_KICK_VCPU) to support pv-ticketlocks. If so,
> support for pv-ticketlocks is registered via pv_lock_ops.
>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 8b1d65d..7e419ad 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -195,10 +195,21 @@ void kvm_async_pf_task_wait(u32 token);
> void kvm_async_pf_task_wake(u32 token);
> u32 kvm_read_and_reset_pf_reason(void);
> extern void kvm_disable_steal_time(void);
> -#else
> -#define kvm_guest_init() do { } while (0)
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +void __init kvm_spinlock_init(void);
> +#else /* CONFIG_PARAVIRT_SPINLOCKS */
> +static void kvm_spinlock_init(void)
> +{
> +}
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#else /* CONFIG_KVM_GUEST */
> +#define kvm_guest_init() do {} while (0)
> #define kvm_async_pf_task_wait(T) do {} while(0)
> #define kvm_async_pf_task_wake(T) do {} while(0)
> +#define kvm_spinlock_init() do {} while (0)
> +
> static inline u32 kvm_read_and_reset_pf_reason(void)
> {
> return 0;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index a9c2116..dffeea3 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -33,6 +33,7 @@
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/kprobes.h>
> +#include <linux/debugfs.h>
> #include <asm/timer.h>
> #include <asm/cpu.h>
> #include <asm/traps.h>
> @@ -545,6 +546,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
> #endif
> kvm_guest_cpu_init();
> native_smp_prepare_boot_cpu();
> + kvm_spinlock_init();
> }
>
> static void __cpuinit kvm_guest_cpu_online(void *dummy)
> @@ -627,3 +629,248 @@ static __init int activate_jump_labels(void)
> return 0;
> }
> arch_initcall(activate_jump_labels);
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +
> +enum kvm_contention_stat {
> + TAKEN_SLOW,
> + TAKEN_SLOW_PICKUP,
> + RELEASED_SLOW,
> + RELEASED_SLOW_KICKED,
> + NR_CONTENTION_STATS
> +};
> +
> +#ifdef CONFIG_KVM_DEBUG_FS
> +
> +static struct kvm_spinlock_stats
> +{
> + u32 contention_stats[NR_CONTENTION_STATS];
> +
> +#define HISTO_BUCKETS 30
> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
> +
> + u64 time_blocked;
> +} spinlock_stats;
> +
> +static u8 zero_stats;
> +
> +static inline void check_zero(void)
> +{
> + u8 ret;
> + u8 old = ACCESS_ONCE(zero_stats);
> + if (unlikely(old)) {
> + ret = cmpxchg(&zero_stats, old, 0);
> + /* This ensures only one fellow resets the stat */
> + if (ret == old)
> + memset(&spinlock_stats, 0, sizeof(spinlock_stats));
> + }
> +}
> +
> +static inline void add_stats(enum kvm_contention_stat var, int val)
You probably want 'int val' to be 'u32 val' as that is the type
in contention_stats.
^ permalink raw reply
* Re: [PATCH] xen-blkfront: Use kcalloc instead of kzalloc to allocate array
From: Konrad Rzeszutek Wilk @ 2011-12-06 3:27 UTC (permalink / raw)
To: Thomas Meyer; +Cc: xen-devel, linux-kernel, virtualization
In-Reply-To: <1322600880.1534.293.camel@localhost.localdomain>
On Tue, Nov 29, 2011 at 10:08:00PM +0100, Thomas Meyer wrote:
> The advantage of kcalloc is, that will prevent integer overflows which could
> result from the multiplication of number of elements and size and it is also
> a bit nicer to read.
>
> The semantic patch that makes this change is available
> in https://lkml.org/lkml/2011/11/25/107
>
> Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
> ---
>
> diff -u -p a/drivers/block/cciss_scsi.c b/drivers/block/cciss_scsi.c
> --- a/drivers/block/cciss_scsi.c 2011-11-28 19:36:47.343430551 +0100
> +++ b/drivers/block/cciss_scsi.c 2011-11-28 19:49:24.922716381 +0100
> @@ -534,10 +534,10 @@ adjust_cciss_scsi_table(ctlr_info_t *h,
> int nadded, nremoved;
> struct Scsi_Host *sh = NULL;
>
> - added = kzalloc(sizeof(*added) * CCISS_MAX_SCSI_DEVS_PER_HBA,
> - GFP_KERNEL);
> - removed = kzalloc(sizeof(*removed) * CCISS_MAX_SCSI_DEVS_PER_HBA,
> + added = kcalloc(CCISS_MAX_SCSI_DEVS_PER_HBA, sizeof(*added),
> GFP_KERNEL);
> + removed = kcalloc(CCISS_MAX_SCSI_DEVS_PER_HBA, sizeof(*removed),
> + GFP_KERNEL);
>
It looks like you mixed two patches together.
> if (!added || !removed) {
> dev_warn(&h->pdev->dev,
> @@ -1191,8 +1191,8 @@ cciss_update_non_disk_devices(ctlr_info_
>
> ld_buff = kzalloc(reportlunsize, GFP_KERNEL);
> inq_buff = kmalloc(OBDR_TAPE_INQ_SIZE, GFP_KERNEL);
> - currentsd = kzalloc(sizeof(*currentsd) *
> - (CCISS_MAX_SCSI_DEVS_PER_HBA+1), GFP_KERNEL);
> + currentsd = kcalloc(CCISS_MAX_SCSI_DEVS_PER_HBA + 1,
> + sizeof(*currentsd), GFP_KERNEL);
> if (ld_buff == NULL || inq_buff == NULL || currentsd == NULL) {
> printk(KERN_ERR "cciss: out of memory\n");
> goto out;
> diff -u -p a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> --- a/drivers/block/xen-blkfront.c 2011-11-13 11:07:22.680095573 +0100
> +++ b/drivers/block/xen-blkfront.c 2011-11-28 19:49:29.109460410 +0100
> @@ -156,7 +156,7 @@ static int xlbd_reserve_minors(unsigned
> if (end > nr_minors) {
> unsigned long *bitmap, *old;
>
> - bitmap = kzalloc(BITS_TO_LONGS(end) * sizeof(*bitmap),
> + bitmap = kcalloc(BITS_TO_LONGS(end), sizeof(*bitmap),
> GFP_KERNEL);
But this parts looks good.
Can you respin it with just that part please?
> if (bitmap == NULL)
> return -ENOMEM;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply
* Re: [Xen-devel] [PATCH RFC V3 1/4] debugfs: Add support to print u32 array in debugfs
From: Konrad Rzeszutek Wilk @ 2011-12-06 0:13 UTC (permalink / raw)
To: Raghavendra K T
Cc: KVM, Peter Zijlstra, Virtualization, H. Peter Anvin, Xen,
Dave Jiang, x86, Ingo Molnar, Avi Kivity, Rik van Riel,
Stefano Stabellini, Srivatsa Vaddagiri, Jeremy Fitzhardinge,
Sasha Levin, Sedat Dilek, Thomas Gleixner, Yinghai Lu,
Konrad Rzeszutek Wilk, Greg Kroah-Hartman, LKML, Dave Hansen,
Suzuki Poulose
In-Reply-To: <20111130085939.23386.1242.sendpatchset@oc5400248562.ibm.com>
On Wed, Nov 30, 2011 at 02:29:39PM +0530, Raghavendra K T wrote:
> Add debugfs support to print u32-arrays in debugfs. Move the code from Xen to debugfs
> to make the code common for other users as well.
>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Looks good to me.
> ---
> diff --git a/arch/x86/xen/debugfs.c b/arch/x86/xen/debugfs.c
> index 7c0fedd..c8377fb 100644
> --- a/arch/x86/xen/debugfs.c
> +++ b/arch/x86/xen/debugfs.c
> @@ -19,107 +19,3 @@ struct dentry * __init xen_init_debugfs(void)
> return d_xen_debug;
> }
>
> -struct array_data
> -{
> - void *array;
> - unsigned elements;
> -};
> -
> -static int u32_array_open(struct inode *inode, struct file *file)
> -{
> - file->private_data = NULL;
> - return nonseekable_open(inode, file);
> -}
> -
> -static size_t format_array(char *buf, size_t bufsize, const char *fmt,
> - u32 *array, unsigned array_size)
> -{
> - size_t ret = 0;
> - unsigned i;
> -
> - for(i = 0; i < array_size; i++) {
> - size_t len;
> -
> - len = snprintf(buf, bufsize, fmt, array[i]);
> - len++; /* ' ' or '\n' */
> - ret += len;
> -
> - if (buf) {
> - buf += len;
> - bufsize -= len;
> - buf[-1] = (i == array_size-1) ? '\n' : ' ';
> - }
> - }
> -
> - ret++; /* \0 */
> - if (buf)
> - *buf = '\0';
> -
> - return ret;
> -}
> -
> -static char *format_array_alloc(const char *fmt, u32 *array, unsigned array_size)
> -{
> - size_t len = format_array(NULL, 0, fmt, array, array_size);
> - char *ret;
> -
> - ret = kmalloc(len, GFP_KERNEL);
> - if (ret == NULL)
> - return NULL;
> -
> - format_array(ret, len, fmt, array, array_size);
> - return ret;
> -}
> -
> -static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
> - loff_t *ppos)
> -{
> - struct inode *inode = file->f_path.dentry->d_inode;
> - struct array_data *data = inode->i_private;
> - size_t size;
> -
> - if (*ppos == 0) {
> - if (file->private_data) {
> - kfree(file->private_data);
> - file->private_data = NULL;
> - }
> -
> - file->private_data = format_array_alloc("%u", data->array, data->elements);
> - }
> -
> - size = 0;
> - if (file->private_data)
> - size = strlen(file->private_data);
> -
> - return simple_read_from_buffer(buf, len, ppos, file->private_data, size);
> -}
> -
> -static int xen_array_release(struct inode *inode, struct file *file)
> -{
> - kfree(file->private_data);
> -
> - return 0;
> -}
> -
> -static const struct file_operations u32_array_fops = {
> - .owner = THIS_MODULE,
> - .open = u32_array_open,
> - .release= xen_array_release,
> - .read = u32_array_read,
> - .llseek = no_llseek,
> -};
> -
> -struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
> - struct dentry *parent,
> - u32 *array, unsigned elements)
> -{
> - struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
> -
> - if (data == NULL)
> - return NULL;
> -
> - data->array = array;
> - data->elements = elements;
> -
> - return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
> -}
> diff --git a/arch/x86/xen/debugfs.h b/arch/x86/xen/debugfs.h
> index e281320..12ebf33 100644
> --- a/arch/x86/xen/debugfs.h
> +++ b/arch/x86/xen/debugfs.h
> @@ -3,8 +3,4 @@
>
> struct dentry * __init xen_init_debugfs(void);
>
> -struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
> - struct dentry *parent,
> - u32 *array, unsigned elements);
> -
> #endif /* _XEN_DEBUGFS_H */
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index fc506e6..14a8961 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -286,7 +286,7 @@ static int __init xen_spinlock_debugfs(void)
> debugfs_create_u64("time_blocked", 0444, d_spin_debug,
> &spinlock_stats.time_blocked);
>
> - xen_debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
> + debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
> spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
>
> return 0;
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index 90f7657..df44ccf 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -18,6 +18,7 @@
> #include <linux/pagemap.h>
> #include <linux/namei.h>
> #include <linux/debugfs.h>
> +#include <linux/slab.h>
>
> static ssize_t default_read_file(struct file *file, char __user *buf,
> size_t count, loff_t *ppos)
> @@ -525,3 +526,130 @@ struct dentry *debugfs_create_blob(const char *name, mode_t mode,
> return debugfs_create_file(name, mode, parent, blob, &fops_blob);
> }
> EXPORT_SYMBOL_GPL(debugfs_create_blob);
> +
> +struct array_data {
> + void *array;
> + u32 elements;
> +};
> +
> +static int u32_array_open(struct inode *inode, struct file *file)
> +{
> + file->private_data = NULL;
> + return nonseekable_open(inode, file);
> +}
> +
> +static size_t format_array(char *buf, size_t bufsize, const char *fmt,
> + u32 *array, u32 array_size)
> +{
> + size_t ret = 0;
> + u32 i;
> +
> + for (i = 0; i < array_size; i++) {
> + size_t len;
> +
> + len = snprintf(buf, bufsize, fmt, array[i]);
> + len++; /* ' ' or '\n' */
> + ret += len;
> +
> + if (buf) {
> + buf += len;
> + bufsize -= len;
> + buf[-1] = (i == array_size-1) ? '\n' : ' ';
> + }
> + }
> +
> + ret++; /* \0 */
> + if (buf)
> + *buf = '\0';
> +
> + return ret;
> +}
> +
> +static char *format_array_alloc(const char *fmt, u32 *array,
> + u32 array_size)
> +{
> + size_t len = format_array(NULL, 0, fmt, array, array_size);
> + char *ret;
> +
> + ret = kmalloc(len, GFP_KERNEL);
> + if (ret == NULL)
> + return NULL;
> +
> + format_array(ret, len, fmt, array, array_size);
> + return ret;
> +}
> +
> +static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
> + loff_t *ppos)
> +{
> + struct inode *inode = file->f_path.dentry->d_inode;
> + struct array_data *data = inode->i_private;
> + size_t size;
> +
> + if (*ppos == 0) {
> + if (file->private_data) {
> + kfree(file->private_data);
> + file->private_data = NULL;
> + }
> +
> + file->private_data = format_array_alloc("%u", data->array,
> + data->elements);
> + }
> +
> + size = 0;
> + if (file->private_data)
> + size = strlen(file->private_data);
> +
> + return simple_read_from_buffer(buf, len, ppos,
> + file->private_data, size);
> +}
> +
> +static int u32_array_release(struct inode *inode, struct file *file)
> +{
> + kfree(file->private_data);
> +
> + return 0;
> +}
> +
> +static const struct file_operations u32_array_fops = {
> + .owner = THIS_MODULE,
> + .open = u32_array_open,
> + .release = u32_array_release,
> + .read = u32_array_read,
> + .llseek = no_llseek,
> +};
> +
> +/**
> + * debugfs_create_u32_array - create a debugfs file that is used to read u32
> + * array.
> + * @name: a pointer to a string containing the name of the file to create.
> + * @mode: the permission that the file should have.
> + * @parent: a pointer to the parent dentry for this file. This should be a
> + * directory dentry if set. If this parameter is %NULL, then the
> + * file will be created in the root of the debugfs filesystem.
> + * @array: u32 array that provides data.
> + * @elements: total number of elements in the array.
> + *
> + * This function creates a file in debugfs with the given name that exports
> + * @array as data. If the @mode variable is so set it can be read from.
> + * Writing is not supported. Seek within the file is also not supported.
> + * Once array is created its size can not be changed.
> + *
> + * The function returns a pointer to dentry on success. If debugfs is not
> + * enabled in the kernel, the value -%ENODEV will be returned.
> + */
> +struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
> + struct dentry *parent,
> + u32 *array, u32 elements)
> +{
> + struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
> +
> + if (data == NULL)
> + return NULL;
> +
> + data->array = array;
> + data->elements = elements;
> +
> + return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
> +}
> +EXPORT_SYMBOL_GPL(debugfs_create_u32_array);
> diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
> index e7d9b20..253e2fb 100644
> --- a/include/linux/debugfs.h
> +++ b/include/linux/debugfs.h
> @@ -74,6 +74,10 @@ struct dentry *debugfs_create_blob(const char *name, mode_t mode,
> struct dentry *parent,
> struct debugfs_blob_wrapper *blob);
>
> +struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
> + struct dentry *parent,
> + u32 *array, u32 elements);
> +
> bool debugfs_initialized(void);
>
> #else
> @@ -193,6 +197,13 @@ static inline bool debugfs_initialized(void)
> return false;
> }
>
> +struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
> + struct dentry *parent,
> + u32 *array, u32 elements)
> +{
> + return ERR_PTR(-ENODEV);
> +}
> +
> #endif
>
> #endif
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Ben Hutchings @ 2011-12-05 20:42 UTC (permalink / raw)
To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote:
> In order to let the packets of a flow to be passed to the desired
> guest cpu, we can co-operate with devices through programming the flow
> director which was just a hash to queue table.
>
> This kinds of co-operation is done through the accelerate RFS support,
> a device specific flow sterring method virtnet_fd() is used to modify
> the flow director based on rfs mapping. The desired queue were
> calculated through reverse mapping of the irq affinity table. In order
> to parallelize the ingress path, irq affinity of rx queue were also
> provides by the driver.
>
> In addition to accelerate RFS, we can also use the guest scheduler to
> balance the load of TX and reduce the lock contention on egress path,
> so the processor_id() were used to tx queue selection.
[...]
> +#ifdef CONFIG_RFS_ACCEL
> +
> +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
> + u16 rxq_index, u32 flow_id)
> +{
> + struct virtnet_info *vi = netdev_priv(net_dev);
> + u16 *table = NULL;
> +
> + if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash)
> + return -EPROTONOSUPPORT;
Why only IPv4?
> + table = kmap_atomic(vi->fd_page);
> + table[skb->rxhash & TAP_HASH_MASK] = rxq_index;
> + kunmap_atomic(table);
> +
> + return 0;
> +}
> +#endif
This is not a proper implementation of ndo_rx_flow_steer. If you steer
a flow by changing the RSS table this can easily cause packet reordering
in other flows. The filtering should be more precise, ideally matching
exactly a single flow by e.g. VID and IP 5-tuple.
I think you need to add a second hash table which records exactly which
flow is supposed to be steered. Also, you must call
rps_may_expire_flow() to check whether an entry in this table may be
replaced; otherwise you can cause packet reordering in the flow that was
previously being steered.
Finally, this function must return the table index it assigned, so that
rps_may_expire_flow() works.
> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> + int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
> + smp_processor_id();
> +
> + /* As we make use of the accelerate rfs which let the scheduler to
> + * balance the load, it make sense to choose the tx queue also based on
> + * theprocessor id?
> + */
> + while (unlikely(txq >= dev->real_num_tx_queues))
> + txq -= dev->real_num_tx_queues;
> + return txq;
> +}
[...]
Don't do this, let XPS handle it.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCHv2 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-12-05 20:20 UTC (permalink / raw)
To: Jesse Barnes
Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
Christian Borntraeger, Sasha Levin, Amit Shah, Linus Torvalds
In-Reply-To: <20111205111605.73b60faa@jbarnes-desktop>
On Mon, Dec 05, 2011 at 11:16:05AM -0800, Jesse Barnes wrote:
> On Mon, 14 Nov 2011 20:18:55 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> > Add a flexible mechanism to specify virtio configuration layout, using
> > pci vendor-specific capability. A separate capability is used for each
> > of common, device specific and data-path accesses.
> >
> > Warning: compiled only.
> > This patch also needs to be split up, pci_iomap changes
> > also need arch updates for non-x86.
> > There might also be more spec changes.
> >
> > Posting here for early feedback, and to allow Sasha to
> > proceed with his "kvm tool" work.
> >
> > Changes from v1:
> > Updated to match v3 of the spec, see:
> > Subject: [PATCHv3 RFC] virtio-spec: flexible configuration layout
> > Message-ID: <20111110122436.GA13144@redhat.com>
> > In-Reply-To: <20111109195901.GA28155@redhat.com>
>
> Looks like this conflicts with your other iomap changes... I didn't
> check your latest tree; do you just add another patch on top for the
> virtio changes now?
>
> Thanks,
Yes. Rusty asked for more changes so that isn't yet pushed.
> --
> Jesse Barnes, Intel Open Source Technology Center
^ permalink raw reply
* Re: [net-next RFC PATCH 3/5] macvtap: flow director support
From: Ben Hutchings @ 2011-12-05 20:11 UTC (permalink / raw)
To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <20111205085906.6116.84426.stgit@dhcp-8-146.nay.redhat.com>
Similarly, macvtap chould implement the ethtool {get,set}_rxfh_indir
operations.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
From: Ben Hutchings @ 2011-12-05 20:09 UTC (permalink / raw)
To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <20111205085857.6116.99252.stgit@dhcp-8-146.nay.redhat.com>
On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
> This patch adds a simple flow director to tun/tap device. It is just a
> page that contains the hash to queue mapping which could be changed by
> user-space. The backend (tap/macvtap) would query this table to get
> the desired queue of a packets when it send packets to userspace.
This is just flow hashing (RSS), not flow steering.
> The page address were set through a new kind of ioctl - TUNSETFD and
> were pinned until device exit or another new page were specified.
[...]
You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCHv2 RFC] virtio-pci: flexible configuration layout
From: Jesse Barnes @ 2011-12-05 19:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
Christian Borntraeger, Sasha Levin, Amit Shah, Linus Torvalds
In-Reply-To: <20111114181854.GA24953@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 1026 bytes --]
On Mon, 14 Nov 2011 20:18:55 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> Add a flexible mechanism to specify virtio configuration layout, using
> pci vendor-specific capability. A separate capability is used for each
> of common, device specific and data-path accesses.
>
> Warning: compiled only.
> This patch also needs to be split up, pci_iomap changes
> also need arch updates for non-x86.
> There might also be more spec changes.
>
> Posting here for early feedback, and to allow Sasha to
> proceed with his "kvm tool" work.
>
> Changes from v1:
> Updated to match v3 of the spec, see:
> Subject: [PATCHv3 RFC] virtio-spec: flexible configuration layout
> Message-ID: <20111110122436.GA13144@redhat.com>
> In-Reply-To: <20111109195901.GA28155@redhat.com>
Looks like this conflicts with your other iomap changes... I didn't
check your latest tree; do you just add another patch on top for the
virtio changes now?
Thanks,
--
Jesse Barnes, Intel Open Source Technology Center
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2] Add virtio-scsi to the virtio spec
From: Paolo Bonzini @ 2011-12-05 17:26 UTC (permalink / raw)
To: Rusty Russell, virtualization
Cc: linux-scsi, LKML, Stefan Hajnoczi, Michael S. Tsirkin
In-Reply-To: <1323105965-6391-1-git-send-email-pbonzini@redhat.com>
For simplicity, instead of including the whole spec, I am just including
the diff from v1.
--- virtio-spec.txt.v1 2011-11-30 12:21:01.472479754 +0100
+++ virtio-spec.txt 2011-12-05 14:07:02.645044924 +0100
@@ -1,10 +1,9 @@
Appendix H: SCSI Host Device
-The virtio SCSI host device groups together one or more simple
-virtual devices (ie. disk), and allows communicating to these
-devices using the SCSI protocol. An instance of the device
-represents a SCSI host with possibly many buses (also known as
-channels or paths), targets and LUNs attached.
+The virtio SCSI host device groups together one or more virtual
+logical units (such as disks), and allows communicating to them using
+the SCSI protocol. An instance of the device represents a SCSI host to
+which many targets and LUNs are attached.
The virtio SCSI device services two kinds of requests:
@@ -39,6 +38,8 @@
struct virtio_scsi_config {
u32 num_queues;
u32 seg_max;
+ u32 max_sectors;
+ u32 cmd_per_lun;
u32 event_info_size;
u32 sense_size;
u32 cdb_size;
@@ -47,14 +48,22 @@
u32 max_lun;
};
- num_queues is the total number of virtqueues exposed by the
- device. The driver is free to use only one request queue, or
- it can use more to achieve better performance.
+ num_queues is the total number of request virtqueues exposed by
+ the device. The driver is free to use only one request queue,
+ or it can use more to achieve better performance.
seg_max is the maximum number of segments that can be in a
command. A bidirectional command can include seg_max input
segments and seg_max output segments.
+ max_sectors is a hint to the guest about the maximum transfer
+ size it should use.
+
+ cmd_per_lun is a hint to the guest about the maximum number of
+ linked commands it should send to one LUN. The actual value
+ to be used is the minimum of cmd_per_lun and the virtqueue
+ size.
+
event_info_size is the maximum size that the device will fill
for buffers that the driver places in the eventq. The driver
should always put buffers at least of this size. It is
@@ -72,9 +81,7 @@
restored to the default when the device is reset.
max_channel, max_target and max_lun can be used by the driver
- as hints for scanning the logical units on the host. In the
- current version of the spec, they will always be respectively
- 0, 255 and 16383.
+ as hints to constrain scanning the logical units on the host.
Device Initialization
=====================
@@ -93,13 +100,15 @@
================================
The driver queues requests to an arbitrary request queue, and
-they are used by the device on that same queue. In this version
-of the spec, commands placed on different queue will be consumed
-with _no_ order constraints.
+they are used by the device on that same queue. It is the
+responsibility of the driver to ensure strict request ordering
+for commands placed on different queues, because they will be
+consumed with _no_ order constraints.
Requests have the following format:
struct virtio_scsi_req_cmd {
+ // Read-only
u8 lun[8];
u64 id;
u8 task_attr;
@@ -107,6 +116,7 @@
u8 crn;
char cdb[cdb_size];
char dataout[];
+ // Write-only part
u32 sense_len;
u32 residual;
u16 status_qualifier;
@@ -122,10 +132,11 @@
#define VIRTIO_SCSI_S_ABORTED 2
#define VIRTIO_SCSI_S_BAD_TARGET 3
#define VIRTIO_SCSI_S_RESET 4
- #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 5
- #define VIRTIO_SCSI_S_TARGET_FAILURE 6
- #define VIRTIO_SCSI_S_NEXUS_FAILURE 7
- #define VIRTIO_SCSI_S_FAILURE 8
+ #define VIRTIO_SCSI_S_BUSY 5
+ #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
+ #define VIRTIO_SCSI_S_TARGET_FAILURE 7
+ #define VIRTIO_SCSI_S_NEXUS_FAILURE 8
+ #define VIRTIO_SCSI_S_FAILURE 9
/* task_attr */
#define VIRTIO_SCSI_S_SIMPLE 0
@@ -134,22 +145,21 @@
#define VIRTIO_SCSI_S_ACA 3
The lun field addresses a target and logical unit in the
-virtio-scsi device's SCSI domain. In this version of the spec,
-the only supported value of the LUN field is: first byte set to
-1, second byte set to target, third and fourth byte representing
-a single level LUN structure, followed by four zero bytes. With
-this representation, a virtio-scsi device can serve up to 256
-targets and 16384 LUNs per target.
+virtio-scsi device's SCSI domain. The only supported format for
+the LUN field is: first byte set to 1, second byte set to target,
+third and fourth byte representing a single level LUN structure,
+followed by four zero bytes. With this representation, a
+virtio-scsi device can serve up to 256 targets and 16384 LUNs per
+target.
The id field is the command identifier (“tag”).
-Task_attr, prio and crn should be left to zero: command priority
-is explicitly not supported by this version of the device;
-task_attr defines the task attribute as in the table above, but
-all task attributes may be mapped to SIMPLE by the device; crn
-may also be provided by clients, but is generally expected to be
-0. The maximum CRN value defined by the protocol is 255, since
-CRN is stored in an 8-bit integer.
+task_attr, prio and crn should be left to zero. task_attr defines
+the task attribute as in the table above, but all task attributes
+may be mapped to SIMPLE by the device; crn may also be provided
+by clients, but is generally expected to be 0. The maximum CRN
+value defined by the protocol is 255, since CRN is stored in an
+8-bit integer.
All of these fields are defined in SAM. They are always
read-only, as are the cdb and dataout field. The cdb_size is
@@ -167,7 +177,7 @@
processed partially and the datain field was not processed at
all.
-The status byte is written by the device to be the status
+The status byte is written by the device to be the status
code as defined in SAM.
The response byte is written by the device to be one of the
@@ -180,14 +190,14 @@
VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires
transferring more data than is available in the data buffers.
- VIRTIO_SCSI_S_ABORTED if the request was cancelled due to a
- task management function.
+ VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an
+ ABORT TASK or ABORT TASK SET task management function.
VIRTIO_SCSI_S_BAD_TARGET if the request was never processed
because the target indicated by the lun field does not exist.
VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus
- or device reset.
+ or device reset (including a task management function).
VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a
problem in the connection between the host and the target
@@ -199,6 +209,9 @@
VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure
but retrying on other paths might yield a different result.
+ VIRTIO_SCSI_S_BUSY if the request failed but retrying on the
+ same path should work.
+
VIRTIO_SCSI_S_FAILURE for other host or guest error. In
particular, if neither dataout nor datain is empty, and the
VIRTIO_SCSI_F_INOUT feature has not been negotiated, the
@@ -220,11 +233,12 @@
/* response values valid for all commands */
#define VIRTIO_SCSI_S_OK 0
#define VIRTIO_SCSI_S_BAD_TARGET 3
- #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 5
- #define VIRTIO_SCSI_S_TARGET_FAILURE 6
- #define VIRTIO_SCSI_S_NEXUS_FAILURE 7
- #define VIRTIO_SCSI_S_FAILURE 8
- #define VIRTIO_SCSI_S_INCORRECT_LUN 11
+ #define VIRTIO_SCSI_S_BUSY 5
+ #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
+ #define VIRTIO_SCSI_S_TARGET_FAILURE 7
+ #define VIRTIO_SCSI_S_NEXUS_FAILURE 8
+ #define VIRTIO_SCSI_S_FAILURE 9
+ #define VIRTIO_SCSI_S_INCORRECT_LUN 12
The type identifies the remaining fields.
@@ -245,31 +259,31 @@
struct virtio_scsi_ctrl_tmf
{
+ // Read-only part
u32 type;
u32 subtype;
u8 lun[8];
u64 id;
- u8 additional[];
+ // Write-only part
u8 response;
}
/* command-specific response values */
#define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0
- #define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 9
- #define VIRTIO_SCSI_S_FUNCTION_REJECTED 10
+ #define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10
+ #define VIRTIO_SCSI_S_FUNCTION_REJECTED 11
The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All
fields except response are filled by the driver. The subtype
field must always be specified and identifies the requested
- task management function. Other fields may be irrelevant for
- the requested TMF are ignored. The lun field is in the same
- format specified for request queues; the single level LUN is
- ignored when the task management function addresses a whole I_T
- nexus. When relevant, the value of the id field is matched
- against the id values passed on the requestq.
+ task management function.
- Note that since ACA is not supported by this version of the
- spec, VIRTIO_SCSI_T_TMF_CLEAR_ACA is always a no-operation.
+ Other fields may be irrelevant for the requested TMF; if so,
+ they are ignored but they should still be present. The lun
+ field is in the same format specified for request queues; the
+ single level LUN is ignored when the task management function
+ addresses a whole I_T nexus. When relevant, the value of the id
+ field is matched against the id values passed on the requestq.
The outcome of the task management function is written by the
device in the response field. The command-specific response
@@ -280,9 +294,11 @@
#define VIRTIO_SCSI_T_AN_QUERY 1
struct virtio_scsi_ctrl_an {
+ // Read-only part
u32 type;
u8 lun[8];
u32 event_requested;
+ // Write-only part
u32 event_actual;
u8 response;
}
@@ -312,9 +328,11 @@
#define VIRTIO_SCSI_T_AN_SUBSCRIBE 2
struct virtio_scsi_ctrl_an {
+ // Read-only part
u32 type;
u8 lun[8];
u32 event_requested;
+ // Write-only part
u32 event_actual;
u8 response;
}
@@ -339,9 +357,13 @@
The eventq is used by the device to report information on logical
units that are attached to it. The driver should always leave a
-few buffers ready in the eventq. The device will end up dropping
-events if it finds no buffer ready. 10-15 buffers should be
-enough.
+few buffers ready in the eventq. In general, the device will not
+queue events to cope with an empty eventq, and will end up
+dropping events if it finds no buffer ready. However, when
+reporting events for many LUNs (e.g. when a whole target
+disappears), the device can throttle events to avoid dropping
+them. For this reason, placing 10-15 buffers on the event queue
+should be enough.
Buffers are placed in the eventq and filled by the device when
interesting events occur. The buffers should be strictly
@@ -356,6 +378,7 @@
#define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000
struct virtio_scsi_event {
+ // Write-only part
u32 event;
...
}
@@ -391,7 +414,8 @@
#define VIRTIO_SCSI_T_TRANSPORT_RESET 1
- struct virtio_scsi_reset {
+ struct virtio_scsi_event_reset {
+ // Write-only part
u32 event;
u8 lun[8];
u32 reason;
@@ -454,7 +478,8 @@
#define VIRTIO_SCSI_T_ASYNC_NOTIFY 2
- struct virtio_scsi_an_event {
+ struct virtio_scsi_event_an {
+ // Write-only part
u32 event;
u8 lun[8];
u32 reason;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH v2] Add virtio-scsi to the virtio spec
From: Paolo Bonzini @ 2011-12-05 17:26 UTC (permalink / raw)
To: Rusty Russell, virtualization
Cc: linux-scsi, LKML, Stefan Hajnoczi, Michael S. Tsirkin
Hi all,
here is the specification for a virtio-based SCSI host (controller, HBA,
you name it). The virtio SCSI host is the basis of an alternative
storage stack for KVM. This stack would overcome several limitations of
the current solution, virtio-blk:
1) scalability limitations: virtio-blk-over-PCI puts a strong upper
limit on the number of devices that can be added to a guest. Common
configurations have a limit of ~30 devices. While this can be worked
around by implementing a PCI-to-PCI bridge, or by using multifunction
virtio-blk devices, these solutions either have not been implemented
yet, or introduce management restrictions. On the other hand, the SCSI
architecture is well known for its scalability and virtio-scsi supports
advanced feature such as multiqueueing.
2) limited flexibility: virtio-blk does not support all possible storage
scenarios. For example, it only allows limited SCSI passthrough.
In principle, virtio-scsi provides anything that the underlying SCSI
target (be it emulated by QEMU, physical storage, iSCSI or the in-kernel
target) supports.
3) limited extensibility: over the time, many features have been added
to virtio-blk. Each such change requires modifications to the virtio
specification, to the guest drivers, and to the device model in the
host. The virtio-scsi spec has been written to follow SAM conventions,
and exposing new features to the guest will only require changes to the
host's SCSI target implementation.
This includes all the changes suggested by Hannes, Rusty and Ben in
reply to the previous version of the draft.
Here is the lyx version. The PDF version is at
http://people.redhat.com/pbonzini/virtio-spec.pdf and the text version
of the spec (interdiff from v1) is in a reply to this message.
--- virtio-spec.lyx.saved 2011-11-29 14:00:59.782659120 +0100
+++ virtio-spec.lyx 2011-12-05 14:02:41.946266491 +0100
@@ -56,6 +56,7 @@
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
+\author 1531152142 "pbonzini"
\end_header
\begin_body
@@ -321,7 +322,7 @@
\begin_layout Standard
\begin_inset Tabular
-<lyxtabular version="3" rows="8" columns="3">
+<lyxtabular version="3" rows="9" columns="3">
<features tabularvalignment="middle">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
@@ -530,6 +531,41 @@
</cell>
</row>
<row>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322650850
+7
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322650855
+SCSI host
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322650861
+Appendix H
+\end_layout
+
+\end_inset
+</cell>
+</row>
+<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
@@ -6427,6 +6463,2172 @@
\end_layout
\begin_layout Chapter*
+
+\change_inserted 1531152142 1322571716
+Appendix H: SCSI Host Device
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1323090161
+The virtio SCSI host device groups together one or more virtual logical
+ units (such as disks), and allows communicating to them using the SCSI
+ protocol.
+ An instance of the device represents a SCSI host to which many targets
+ and LUNs are attached.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322571726
+The virtio SCSI device services two kinds of requests:
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322571726
+command requests for a logical unit;
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322571726
+task management functions related to a logical unit, target or command.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322571726
+The device is also able to send out notifications about added and removed
+ logical units.
+ Together, these capabilities provide a SCSI transport protocol that uses
+ virtqueues as the transfer medium.
+ In the transport protocol, the virtio driver acts as the initiator, while
+ the virtio SCSI host provides one or more targets that receive and process
+ the requests.
+
+\end_layout
+
+\begin_layout Section*
+
+\change_inserted 1531152142 1322571697
+Configuration
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322651166
+Subsystem
+\begin_inset space ~
+\end_inset
+
+Device
+\begin_inset space ~
+\end_inset
+
+ID 7
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322571777
+Virtqueues 0:controlq; 1:eventq; 2..n:request queues.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322571813
+Feature
+\begin_inset space ~
+\end_inset
+
+bits
+\end_layout
+
+\begin_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322653523
+VIRTIO_SCSI_F_INOUT
+\begin_inset space ~
+\end_inset
+
+(0) A single request can include both read-only and write-only data buffers.
+\end_layout
+
+\end_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322651190
+Device
+\begin_inset space ~
+\end_inset
+
+configuration
+\begin_inset space ~
+\end_inset
+
+layout All fields of this configuration are always available.
+
+\series bold
+sense_size
+\series default
+ and
+\series bold
+cdb_size
+\series default
+ are writable by the guest.
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322571919
+
+struct virtio_scsi_config {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575810
+
+ u32 num_queues;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322934205
+
+ u32 seg_max;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322934257
+
+ u32 max_sectors;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322934350
+
+ u32 cmd_per_lun;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575811
+
+ u32 event_info_size;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575811
+
+ u32 sense_size;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575812
+
+ u32 cdb_size;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322576412
+
+ u16 max_channel;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322576413
+
+ u16 max_target;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322576414
+
+ u32 max_lun;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322571878
+
+};
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322724976
+num_queues is the total number of request virtqueues exposed by the device.
+ The driver is free to use only one request queue, or it can use more to
+ achieve better performance.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322934245
+seg_max is the maximum number of segments that can be in a command.
+ A bidirectional command can include
+\series bold
+seg_max
+\series default
+ input segments and
+\series bold
+seg_max
+\series default
+output segments.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322934739
+max_sectors is a hint to the guest about the maximum transfer size it should
+ use.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322934845
+cmd_per_lun is a hint to the guest about the maximum number of linked commands
+ it should send to one LUN.
+ The actual value to be used is the minimum of
+\series bold
+cmd_per_lun
+\series default
+ and the virtqueue size.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322571959
+event_info_size is the maximum size that the device will fill for buffers
+ that the driver places in the eventq.
+ The driver should always put buffers at least of this size.
+ It is written by the device depending on the set of negotated features.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322571997
+sense_size is the maximum size of the sense data that the device will write.
+ The default value is written by the device and will always be 96, but the
+ driver can modify it.
+ It is restored to the default when the device is reset.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322575599
+cdb_size is the maximum size of the CDB that the driver will write.
+ The default value is written by the device and will always be 32, but the
+ driver can likewise modify it.
+ It is restored to the default when the device is reset.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322724997
+max_channel,
+\begin_inset space \space{}
+\end_inset
+
+max_target
+\series medium
+
+\begin_inset space ~
+\end_inset
+
+and
+\begin_inset space \space{}
+\end_inset
+
+
+\series default
+max_lun can be used by the driver as hints to constrain scanning the logical
+ units on the host.
+\change_unchanged
+
+\end_layout
+
+\end_deeper
+\begin_layout Section*
+
+\change_inserted 1531152142 1322571959
+Device Initialization
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572042
+The initialization routine should first of all discover the device's virtqueues.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572054
+If the driver uses the eventq, it should then place at least a buffer in
+ the eventq.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572042
+The driver can immediately issue requests (for example, INQUIRY or REPORT
+ LUNS) or task management functions (for example, I_T RESET).
+
+\end_layout
+
+\begin_layout Section*
+
+\change_inserted 1531152142 1322572348
+Device Operation: request queues
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322725031
+The driver queues requests to an arbitrary request queue, and they are used
+ by the device on that same queue.
+ It is the responsibility of the driver to ensure strict request ordering
+ for commands placed on different queues, because they will be consumed
+ with no order constraints.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572395
+Requests have the following format:
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572526
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725766
+
+struct virtio_scsi_req_cmd {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725783
+
+ // Read-only
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572417
+
+ u8 lun[8];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572419
+
+ u64 id;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572420
+
+ u8 task_attr;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572422
+
+ u8 prio;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572425
+
+ u8 crn;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572426
+
+ char cdb[cdb_size];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572410
+
+ char dataout[];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725797
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572429
+
+ u32 sense_len;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572430
+
+ u32 residual;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572432
+
+ u16 status_qualifier;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572434
+
+ u8 status;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572435
+
+ u8 response;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572437
+
+ u8 sense[sense_size];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572439
+
+ char datain[];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572471
+
+};
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572410
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572476
+
+/* command-specific response values */
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572480
+
+#define VIRTIO_SCSI_S_OK 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572483
+
+#define VIRTIO_SCSI_S_UNDERRUN 1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572489
+
+#define VIRTIO_SCSI_S_ABORTED 2
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572491
+
+#define VIRTIO_SCSI_S_BAD_TARGET 3
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572494
+
+#define VIRTIO_SCSI_S_RESET 4
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665324
+
+#define VIRTIO_SCSI_S_BUSY 5
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665325
+
+#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665326
+
+#define VIRTIO_SCSI_S_TARGET_FAILURE 7
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665326
+
+#define VIRTIO_SCSI_S_NEXUS_FAILURE 8
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665326
+
+#define VIRTIO_SCSI_S_FAILURE 9
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572502
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572507
+
+/* task_attr */
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572510
+
+#define VIRTIO_SCSI_S_SIMPLE 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572513
+
+#define VIRTIO_SCSI_S_ORDERED 1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572516
+
+#define VIRTIO_SCSI_S_HEAD 2
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322572504
+
+#define VIRTIO_SCSI_S_ACA 3
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322725062
+The
+\series bold
+lun
+\series default
+ field addresses a target and logical unit in the virtio-scsi device's SCSI
+ domain.
+ The only supported format for the LUN field is: first byte set to 1, second
+ byte set to target, third and fourth byte representing a single level LUN
+ structure, followed by four zero bytes.
+ With this representation, a virtio-scsi device can serve up to 256 targets
+ and 16384 LUNs per target.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572562
+The
+\series bold
+id
+\series default
+ field is the command identifier (
+\begin_inset Quotes eld
+\end_inset
+
+tag
+\begin_inset Quotes erd
+\end_inset
+
+).
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322725122
+
+\series bold
+task_attr
+\series default
+,
+\series bold
+prio
+\series default
+ and
+\series bold
+crn
+\series default
+ should be left to zero.
+
+\series bold
+task_attr
+\series default
+ defines the task attribute as in the table above, but all task attributes
+ may be mapped to SIMPLE by the device;
+\series bold
+crn
+\series default
+ may also be provided by clients, but is generally expected to be 0.
+ The maximum CRN value defined by the protocol is 255, since CRN is stored
+ in an 8-bit integer.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572647
+All of these fields are defined in SAM.
+ They are always read-only, as are the
+\series bold
+cdb
+\series default
+ and
+\series bold
+dataout
+\series default
+ field.
+ The
+\series bold
+cdb_size
+\series default
+ is taken from the configuration space.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572919
+
+\series bold
+sense
+\series default
+ and subsequent fields are always write-only.
+ The
+\series bold
+sense_len
+\series default
+ field indicates the number of bytes actually written to the sense buffer.
+ The
+\series bold
+residual
+\series default
+ field indicates the residual size, calculated as
+\begin_inset Quotes eld
+\end_inset
+
+data_length - number_of_transferred_bytes
+\begin_inset Quotes erd
+\end_inset
+
+, for read or write operations.
+ For bidirectional commands, the number_of_transferred_bytes includes both
+ read and written bytes.
+ A residual field that is less than the size of datain means that the dataout
+ field was processed entirely.
+ A residual field that exceeds the size of datain means that the dataout
+ field was processed partially and the datain field was not processed at
+ all.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572971
+The
+\series bold
+status
+\series default
+ byte is written by the device to be the status code as defined in SAM.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322572971
+The
+\series bold
+response
+\series default
+ byte is written by the device to be one of the following:
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322572971
+VIRTIO_SCSI_S_OK when the request was completed and the status byte is filled
+ with a SCSI status code (not necessarily "GOOD").
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322572971
+VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires transferring more
+ data than is available in the data buffers.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322652973
+VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an ABORT TASK
+ or ABORT TASK SET task management function.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322573041
+VIRTIO_SCSI_S_BAD_TARGET if the request was never processed because the
+ target indicated by the
+\series bold
+lun
+\series default
+ field does not exist.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322653176
+VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus or device
+ reset (including a task management function).
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322572971
+VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a problem in
+ the connection between the host and the target (severed link).
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322572971
+VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a failure and the
+ guest should not retry on other paths.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322572971
+VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure but retrying
+ on other paths might yield a different result.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322664259
+VIRTIO_SCSI_S_BUSY if the request failed but retrying on the same path should
+ work.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322664258
+VIRTIO_SCSI_S_FAILURE for other host or guest error.
+ In particular, if neither dataout nor datain is empty, and the VIRTIO_SCSI_F_IN
+OUT feature has not been negotiated, the request will be immediately returned
+ with a response equal to VIRTIO_SCSI_S_FAILURE.
+
+\end_layout
+
+\begin_layout Section*
+
+\change_inserted 1531152142 1322573130
+Device Operation: controlq
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322573193
+The controlq is used for other SCSI transport operations.
+ Requests have the following format:
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322573233
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573243
+
+struct virtio_scsi_ctrl {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573246
+
+ u32 type;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573248
+
+ ...
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573250
+
+ u8 response;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574229
+
+};
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574230
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574236
+
+/* response values valid for all commands */
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574310
+
+#define VIRTIO_SCSI_S_OK 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665338
+
+#define VIRTIO_SCSI_S_BAD_TARGET 3
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665342
+
+#define VIRTIO_SCSI_S_BUSY 5
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665355
+
+#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665357
+
+#define VIRTIO_SCSI_S_TARGET_FAILURE 7
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665358
+
+#define VIRTIO_SCSI_S_NEXUS_FAILURE 8
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665359
+
+#define VIRTIO_SCSI_S_FAILURE 9
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665363
+
+#define VIRTIO_SCSI_S_INCORRECT_LUN 12
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322573193
+The
+\series bold
+type
+\series default
+ identifies the remaining fields.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322573193
+The following commands are defined:
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322576973
+Task
+\begin_inset space \space{}
+\end_inset
+
+management
+\begin_inset space \space{}
+\end_inset
+
+function
+\begin_inset space ~
+\end_inset
+
+
+\begin_inset Newline newline
+\end_inset
+
+
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_ABORT_TASK 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_CLEAR_ACA 2
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_QUERY_TASK 6
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+struct virtio_scsi_ctrl_tmf
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+{
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725821
+
+ // Read-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725810
+
+ u32 type;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+ u32 subtype;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+ u8 lun[8];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+ u64 id;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725832
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+ u8 response;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+}
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+/* command-specific response values */
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322573683
+
+#define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665370
+
+#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322665370
+
+#define VIRTIO_SCSI_S_FUNCTION_REJECTED 11
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Standard
+
+\change_inserted 1531152142 1322725968
+The type is VIRTIO_SCSI_T_TMF; the subtype field defines.
+ All fields except
+\series bold
+response
+\series default
+ are filled by the driver.
+ The
+\series bold
+subtype
+\series default
+ field must always be specified and identifies the requested task management
+ function.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322725982
+Other fields may be irrelevant for the requested TMF; if so, they are ignored
+ but they should still be present.
+ The
+\series bold
+lun
+\series default
+ field is in the same format specified for request queues; the single level
+ LUN is ignored when the task management function addresses a whole I_T
+ nexus.
+ When relevant, the value of the
+\series bold
+id
+\series default
+ field is matched against the id values passed on the requestq.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574270
+The outcome of the task management function is written by the device in
+ the response field.
+ The command-specific response values map 1-to-1 with those defined in SAM.
+\end_layout
+
+\end_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322576979
+Asynchronous
+\begin_inset space \space{}
+\end_inset
+
+notification
+\begin_inset space \space{}
+\end_inset
+
+query
+\begin_inset space ~
+\end_inset
+
+
+\begin_inset Newline newline
+\end_inset
+
+
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_T_AN_QUERY 1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+struct virtio_scsi_ctrl_an {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725848
+
+ // Read-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725843
+
+ u32 type;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+ u8 lun[8];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+ u32 event_requested;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725838
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725838
+
+ u32 event_actual;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+ u8 response;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+}
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574160
+
+#define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574687
+By sending this command, the driver asks the device which events the given
+ LUN can report, as described in paragraphs 6.6 and A.6 of the SCSI MMC specificat
+ion.
+ The driver writes the events it is interested in into the event_requested;
+ the device responds by writing the events that it supports into event_actual.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574688
+The
+\series bold
+type
+\series default
+ is VIRTIO_SCSI_T_AN_QUERY.
+ The
+\series bold
+lun
+\series default
+ and
+\series bold
+event_requested
+\series default
+ fields are written by the driver.
+ The
+\series bold
+event_actual
+\series default
+ and
+\series bold
+response
+\series default
+ fields are written by the device.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574345
+No command-specific values are defined for the response byte.
+\end_layout
+
+\end_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322576981
+Asynchronous
+\begin_inset space \space{}
+\end_inset
+
+notification
+\begin_inset space \space{}
+\end_inset
+
+subscription
+\begin_inset space ~
+\end_inset
+
+
+\begin_inset Newline newline
+\end_inset
+
+
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574354
+
+#define VIRTIO_SCSI_T_AN_SUBSCRIBE 2
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+struct virtio_scsi_ctrl_an {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725858
+
+ // Read-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+ u32 type;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+ u8 lun[8];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+ u32 event_requested;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725864
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+ u32 event_actual;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+ u8 response;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574342
+
+}
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574708
+By sending this command, the driver asks the specified LUN to report events
+ for its physical interface, again as described in the SCSI MMC specification.
+ The driver writes the events it is interested in into the event_requested;
+ the device responds by writing the events that it supports into event_actual.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574709
+Event types are the same as for the asynchronous notification query message.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574710
+The
+\series bold
+type
+\series default
+ is VIRTIO_SCSI_T_AN_SUBSCRIBE.
+ The
+\series bold
+lun
+\series default
+ and
+\series bold
+event_requested
+\series default
+ fields are written by the driver.
+ The
+\series bold
+event_actual
+\series default
+ and
+\series bold
+response
+\series default
+ fields are written by the device.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574419
+No command-specific values are defined for the response byte.
+\end_layout
+
+\end_deeper
+\begin_layout Section*
+
+\change_inserted 1531152142 1322574433
+Device Operation: eventq
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322653610
+The eventq is used by the device to report information on logical units
+ that are attached to it.
+ The driver should always leave a few buffers ready in the eventq.
+ In general, the device will not queue events to cope with an empty eventq,
+ and will end up dropping events if it finds no buffer ready.
+ However, when reporting events for many LUNs (e.g.
+ when a whole target disappears), the device can throttle events to avoid
+ dropping them.
+ For this reason, placing 10-15 buffers on the event queue should be enough.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574442
+Buffers are placed in the eventq and filled by the device when interesting
+ events occur.
+ The buffers should be strictly write-only (device-filled) and the size
+ of the buffers should be at least the value given in the device's configuration
+ information.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574487
+Buffers returned by the device on the eventq will be referred to as "events"
+ in the rest of this section.
+ Events have the following format:
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574508
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574500
+
+#define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574500
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574500
+
+struct virtio_scsi_event {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725871
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574500
+
+ u32 event;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574500
+
+ ...
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574500
+
+}
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574516
+If bit 31 is set in the event field, the device failed to report an event
+ due to missing buffers.
+ In this case, the driver should poll the logical units for unit attention
+ conditions, and/or do whatever form of bus scan is appropriate for the
+ guest operating system.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574521
+Other data that the device writes to the buffer depends on the contents
+ of the event field.
+ The following events are defined:
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1322653652
+No
+\begin_inset space \space{}
+\end_inset
+
+event
+\begin_inset space ~
+\end_inset
+
+
+\begin_inset Newline newline
+\end_inset
+
+
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574545
+
+#define VIRTIO_SCSI_T_NO_EVENT 0
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Standard
+
+\change_inserted 1531152142 1322576984
+This event is fired in the following cases:
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322574588
+When the device detects in the eventq a buffer that is shorter than what
+ is indicated in the configuration field, it might use it immediately and
+ put this dummy value in the event field.
+ A well-written driver will never observe this situation.
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322574604
+When events are dropped, the device may signal this event as soon as the
+ drivers makes a buffer available, in order to request action from the driver.
+ In this case, of course, this event will be reported with the VIRTIO_SCSI_T_EVE
+NTS_MISSED flag.
+
+\end_layout
+
+\end_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322576985
+Transport
+\begin_inset space \space{}
+\end_inset
+
+reset
+\begin_inset space ~
+\end_inset
+
+
+\begin_inset Newline newline
+\end_inset
+
+
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+#define VIRTIO_SCSI_T_TRANSPORT_RESET 1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725908
+
+struct virtio_scsi_event_reset {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725876
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+ u32 event;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+ u8 lun[8];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+ u32 reason;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+}
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+#define VIRTIO_SCSI_EVT_RESET_HARD 0
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+#define VIRTIO_SCSI_EVT_RESET_RESCAN 1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322574628
+
+#define VIRTIO_SCSI_EVT_RESET_REMOVED 2
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Standard
+
+\change_inserted 1531152142 1322574756
+By sending this event, the device signals that a logical unit on a target
+ has been reset, including the case of a new device appearing or disappearing
+ on the bus.The device fills in all fields.
+ The
+\series bold
+event
+\series default
+ field is set to VIRTIO_SCSI_T_TRANSPORT_RESET.
+ The
+\series bold
+lun
+\series default
+ field addresses a logical unit in the SCSI host.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322577082
+The
+\series bold
+reason
+\series default
+ value is one of the three #define values appearing above:
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322577449
+
+\series bold
+VIRTIO_SCSI_EVT_RESET_REMOVED
+\series default
+ (
+\begin_inset Quotes eld
+\end_inset
+
+LUN/target removed
+\begin_inset Quotes erd
+\end_inset
+
+) is used if the target or logical unit is no longer able to receive commands.
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322577452
+
+\series bold
+VIRTIO_SCSI_EVT_RESET_HARD
+\series default
+ (
+\begin_inset Quotes eld
+\end_inset
+
+LUN hard reset
+\begin_inset Quotes erd
+\end_inset
+
+) is used if the logical unit has been reset, but is still present.
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322577446
+
+\series bold
+VIRTIO_SCSI_EVT_RESET_RESCAN
+\series default
+ (
+\begin_inset Quotes eld
+\end_inset
+
+rescan LUN/target
+\begin_inset Quotes erd
+\end_inset
+
+) is used if a target or logical unit has just appeared on the device.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322577382
+The
+\begin_inset Quotes eld
+\end_inset
+
+removed
+\begin_inset Quotes erd
+\end_inset
+
+ and
+\begin_inset Quotes eld
+\end_inset
+
+rescan
+\begin_inset Quotes erd
+\end_inset
+
+ events, when sent for LUN 0, may apply to the entire target.
+ After receiving them the driver should ask the initiator to rescan the
+ target, in order to detect the case when an entire target has appeared
+ or disappeared.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322577057
+Events will also be reported via sense codes (this obviously does not apply
+ to newly appeared buses or targets, since the application has never discovered
+ them):
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322577457
+\begin_inset Quotes eld
+\end_inset
+
+LUN/target removed
+\begin_inset Quotes erd
+\end_inset
+
+ maps to sense key ILLEGAL REQUEST, asc 0x25, ascq 0x00 (LOGICAL UNIT NOT
+ SUPPORTED)
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322577460
+\begin_inset Quotes eld
+\end_inset
+
+LUN hard reset
+\begin_inset Quotes erd
+\end_inset
+
+ maps to sense key UNIT ATTENTION, asc 0x29 (POWER ON, RESET OR BUS DEVICE
+ RESET OCCURRED)
+\end_layout
+
+\begin_layout Itemize
+
+\change_inserted 1531152142 1322577462
+\begin_inset Quotes eld
+\end_inset
+
+rescan LUN/target
+\begin_inset Quotes erd
+\end_inset
+
+ maps to sense key UNIT ATTENTION, asc 0x3f, ascq 0x0e (REPORTED LUNS DATA
+ HAS CHANGED)
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322575482
+The preferred way to detect transport reset is always to use events, because
+ sense codes are only seen by the driver when it sends a SCSI command to
+ the logical unit or target.
+ However, in case events are dropped, the initiator will still be able to
+ synchronize with the actual state of the controller if the driver asks
+ the initiator to rescan of the SCSI bus.
+ During the rescan, the initiator will be able to observe the above sense
+ codes, and it will process them as if it the driver had received the equivalent
+ event.
+
+\end_layout
+
+\end_deeper
+\begin_layout Description
+
+\change_inserted 1531152142 1322576987
+Asynchronous
+\begin_inset space \space{}
+\end_inset
+
+notification
+\begin_inset space ~
+\end_inset
+
+
+\begin_inset Newline newline
+\end_inset
+
+
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575505
+
+#define VIRTIO_SCSI_T_ASYNC_NOTIFY 2
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575505
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725913
+
+struct virtio_scsi_event_an {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322725880
+
+ // Write-only part
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575505
+
+ u32 event;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575505
+
+ u8 lun[8];
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575505
+
+ u32 reason;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1322575505
+
+}
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_deeper
+\begin_layout Standard
+
+\change_inserted 1531152142 1322575520
+By sending this event, the device signals that an asynchronous event was
+ fired from a physical interface.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322575546
+All fields are written by the device.
+ The
+\series bold
+event
+\series default
+ field is set to VIRTIO_SCSI_T_ASYNC_NOTIFY.
+ The
+\series bold
+lun
+\series default
+ field addresses a logical unit in the SCSI host.
+ The
+\series bold
+reason
+\series default
+ field is a subset of the events that the driver has subscribed to via the
+ "Asynchronous notification subscription" command.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1531152142 1322575520
+When dropped events are reported, the driver should poll for asynchronous
+ events manually using SCSI commands.
+\change_unchanged
+
+\end_layout
+
+\end_deeper
+\begin_layout Chapter*
Appendix X: virtio-mmio
\end_layout
^ permalink raw reply
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Stefan Hajnoczi @ 2011-12-05 10:55 UTC (permalink / raw)
To: Jason Wang
Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
bhutchings
In-Reply-To: <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang <jasowang@redhat.com> wrote:
> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
> +{
> + struct virtnet_info *vi = netdev_priv(dev);
> + struct virtio_device *vdev = vi->vdev;
> +
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
> + vdev->config->set(vdev,
> + offsetof(struct virtio_net_config_fd, addr),
> + &pfn, sizeof(u32));
Please use the virtio model (i.e. virtqueues) instead of shared
memory. Mapping a page breaks the virtio abstraction.
Stefan
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox