[Qemu-devel] scsi-generic and max request size

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] scsi-generic and max request size
@ 2010-12-21  3:25 Benjamin Herrenschmidt
  2010-12-21  3:38 ` ronnie sahlberg
  0 siblings, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-21  3:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Hannes Reinecke

Hi folks !

There's an odd problem I've encountered with my scsi host (basically an
powerpc "vscsi" compatible with IBM PAPR).

When using /dev/sg (ie, scsi-generic), there seem to be no way I can
find to retrieve the underlying driver's max request transfer size.

This can normally be obtained with the BLKSECTGET ioctl under Linux (I'm
not familiar with other OSes here). However, this is a bit buggy as
well, ie, afaik, this doesn't work with 32-bit binaries on 64-bit
kernels (the compat ioctl doesn't seem to work on /dev/sg).

For now, qemu doesn't pass that from its bdev layer, which means that
scsi-generic doesn't pass it to its own "upper" layer neither.

What that means is two fold I suppose:

 - For real SCSI HBAs, how do you limit the transfer size anyways ? You
can't start breaking up user requests without taking risks with tags
etc...

 - For vscsi, I can expose the limit I want via the SRP interface, but
scsi-generic doesn't tell me what it is :-)

This is a real problem in practice. IE. the USB CD-ROM on this POWER7
blade limits transfers to 0x1e000 bytes for example and the Linux "sr"
driver on the guest is going to try to give me bigger requests than that
if I don't start limiting them, which will cause all sort of errors.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-21  3:25 [Qemu-devel] scsi-generic and max request size Benjamin Herrenschmidt
@ 2010-12-21  3:38 ` ronnie sahlberg
  2010-12-21  3:52   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 16+ messages in thread
From: ronnie sahlberg @ 2010-12-21  3:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Kevin Wolf, qemu-devel, Hannes Reinecke

Ben,

Since it is a scsi device you can try the Inquiry command with
pagecode 0xb0  :  Block Limit VPD Page.
That pages show optimal and maximum request sizes.

This is for SBC, in the Vital Product Data chapter.

Unfortunately this page is not mandatory so some devices might not
understand it. :-(

sg_inq --page=0x00 /dev/sg?
will show you what inq pages your device supports.


regards
ronnie sahlberg


On Tue, Dec 21, 2010 at 2:25 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> Hi folks !
>
> There's an odd problem I've encountered with my scsi host (basically an
> powerpc "vscsi" compatible with IBM PAPR).
>
> When using /dev/sg (ie, scsi-generic), there seem to be no way I can
> find to retrieve the underlying driver's max request transfer size.
>
> This can normally be obtained with the BLKSECTGET ioctl under Linux (I'm
> not familiar with other OSes here). However, this is a bit buggy as
> well, ie, afaik, this doesn't work with 32-bit binaries on 64-bit
> kernels (the compat ioctl doesn't seem to work on /dev/sg).
>
> For now, qemu doesn't pass that from its bdev layer, which means that
> scsi-generic doesn't pass it to its own "upper" layer neither.
>
> What that means is two fold I suppose:
>
>  - For real SCSI HBAs, how do you limit the transfer size anyways ? You
> can't start breaking up user requests without taking risks with tags
> etc...
>
>  - For vscsi, I can expose the limit I want via the SRP interface, but
> scsi-generic doesn't tell me what it is :-)
>
> This is a real problem in practice. IE. the USB CD-ROM on this POWER7
> blade limits transfers to 0x1e000 bytes for example and the Linux "sr"
> driver on the guest is going to try to give me bigger requests than that
> if I don't start limiting them, which will cause all sort of errors.
>
> Cheers,
> Ben.
>
>
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-21  3:38 ` ronnie sahlberg
@ 2010-12-21  3:52   ` Benjamin Herrenschmidt
  2010-12-21  8:44     ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-21  3:52 UTC (permalink / raw)
  To: ronnie sahlberg; +Cc: Kevin Wolf, qemu-devel, Hannes Reinecke

On Tue, 2010-12-21 at 14:38 +1100, ronnie sahlberg wrote:
> Ben,
> 
> Since it is a scsi device you can try the Inquiry command with
> pagecode 0xb0  :  Block Limit VPD Page.
> That pages show optimal and maximum request sizes.
> 
> This is for SBC, in the Vital Product Data chapter.
> 
> Unfortunately this page is not mandatory so some devices might not
> understand it. :-(
> 
> sg_inq --page=0x00 /dev/sg?
> will show you what inq pages your device supports.

Well, that won't help much figuring what the limit is since in most case
the limit seems to come from the host linux HBA (ie, usb-storage for
example artificially clamps the max request size to deal with bogus
USB-ATA bridges).

As for using this to try to "inform" the guest OS as to what the limit
is, this could be done by "patching" the result of that command on the
fly in qemu, but that is nasty, and would only work if the guest OS
actually uses the said command in the first place. AFAIK, neither sr.c
nor sd.c do in Linux.

So back to square 1 ... my vscsi (and virtio-blk too btw) can
technically pass a max size to the guest, but we don't have a way to
interrogate scsi-generic (and the underlying block driver) which is the
main issue (that plus the fact that the ioctl seems to be broken in
"compat" mode for /dev/sg specifically)...

Cheers,
Ben.


> 
> regards
> ronnie sahlberg
> 
> 
> On Tue, Dec 21, 2010 at 2:25 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > Hi folks !
> >
> > There's an odd problem I've encountered with my scsi host (basically an
> > powerpc "vscsi" compatible with IBM PAPR).
> >
> > When using /dev/sg (ie, scsi-generic), there seem to be no way I can
> > find to retrieve the underlying driver's max request transfer size.
> >
> > This can normally be obtained with the BLKSECTGET ioctl under Linux (I'm
> > not familiar with other OSes here). However, this is a bit buggy as
> > well, ie, afaik, this doesn't work with 32-bit binaries on 64-bit
> > kernels (the compat ioctl doesn't seem to work on /dev/sg).
> >
> > For now, qemu doesn't pass that from its bdev layer, which means that
> > scsi-generic doesn't pass it to its own "upper" layer neither.
> >
> > What that means is two fold I suppose:
> >
> >  - For real SCSI HBAs, how do you limit the transfer size anyways ? You
> > can't start breaking up user requests without taking risks with tags
> > etc...
> >
> >  - For vscsi, I can expose the limit I want via the SRP interface, but
> > scsi-generic doesn't tell me what it is :-)
> >
> > This is a real problem in practice. IE. the USB CD-ROM on this POWER7
> > blade limits transfers to 0x1e000 bytes for example and the Linux "sr"
> > driver on the guest is going to try to give me bigger requests than that
> > if I don't start limiting them, which will cause all sort of errors.
> >
> > Cheers,
> > Ben.
> >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-21  3:52   ` Benjamin Herrenschmidt
@ 2010-12-21  8:44     ` Hannes Reinecke
  2010-12-21 22:05       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2010-12-21  8:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Kevin Wolf, qemu-devel, ronnie sahlberg

On 12/21/2010 04:52 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2010-12-21 at 14:38 +1100, ronnie sahlberg wrote:
>> Ben,
>>
>> Since it is a scsi device you can try the Inquiry command with
>> pagecode 0xb0  :  Block Limit VPD Page.
>> That pages show optimal and maximum request sizes.
>>
>> This is for SBC, in the Vital Product Data chapter.
>>
>> Unfortunately this page is not mandatory so some devices might not
>> understand it. :-(
>>
>> sg_inq --page=0x00 /dev/sg?
>> will show you what inq pages your device supports.
> 
> Well, that won't help much figuring what the limit is since in most case
> the limit seems to come from the host linux HBA (ie, usb-storage for
> example artificially clamps the max request size to deal with bogus
> USB-ATA bridges).
> 
Indeed. The request size is pretty much limited by the driver/scsi
layer, so the above page won't help much here.

> As for using this to try to "inform" the guest OS as to what the limit
> is, this could be done by "patching" the result of that command on the
> fly in qemu, but that is nasty, and would only work if the guest OS
> actually uses the said command in the first place. AFAIK, neither sr.c
> nor sd.c do in Linux.
> 
And you'll be getting yelled at by hch to boot.

> So back to square 1 ... my vscsi (and virtio-blk too btw) can
> technically pass a max size to the guest, but we don't have a way to
> interrogate scsi-generic (and the underlying block driver) which is the
> main issue (that plus the fact that the ioctl seems to be broken in
> "compat" mode for /dev/sg specifically)...
> 
Ah, the warm and fuzzy feeling knowing to be not alone in this ...

This is basically the same issue I brought up with the first
submission round of my megasas emulation.

As we're passing scatter-gather lists directly to the underlying
device we might end up sending a request which is improperly
formatted. The linux block layer has three limits onto which a
request has to be formatted:
- Max length of the scatter-gather list (max_sectors)
- Max overall request size (max_segments)
- Max length of individual sg elements (max_segment_size)

newer kernels export these limits; they have been exported with
commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
For older kernels, however, we're being left in the dark here.

So on newer kernel we probably could be doing a quick check on the
block queue limits and reformat the I/O if required.

Instead of reformatting we could be sendiong each element of an eg
list individually. Thereby we would be introducing some slowdown as
the sg lists have to be reassembled again by the lower layers, but
we would be insulated from any sg list mismatch.
However, this won't cover requests with too large sg elements.
For those we could probably use some simple divide-by-two algorithm
on the element to make them fit.

But seeing we have to split the I/O requests anyway we might as well
use the divide-by-two algorithm for the sg lists, too.

Easiest would be if we could just transfer the available bits and
push the request back to the guest as a partial completion.
Sadly the I/O stack on the guest will choose to interpret this as an
I/O error instead of retrying the remainder :-(

So in the long run I fear we have to implement some sort of I/O
request splitting in Qemu, using the values from sysfs.

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-21  8:44     ` Hannes Reinecke
@ 2010-12-21 22:05       ` Benjamin Herrenschmidt
  2010-12-22 13:54         ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-21 22:05 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Kevin Wolf, qemu-devel, ronnie sahlberg

> > So back to square 1 ... my vscsi (and virtio-blk too btw) can
> > technically pass a max size to the guest, but we don't have a way to
> > interrogate scsi-generic (and the underlying block driver) which is the
> > main issue (that plus the fact that the ioctl seems to be broken in
> > "compat" mode for /dev/sg specifically)...
> > 
> Ah, the warm and fuzzy feeling knowing to be not alone in this ...
> 
> This is basically the same issue I brought up with the first
> submission round of my megasas emulation.

heh.

> As we're passing scatter-gather lists directly to the underlying
> device we might end up sending a request which is improperly
> formatted. The linux block layer has three limits onto which a
> request has to be formatted:
> - Max length of the scatter-gather list (max_sectors)
> - Max overall request size (max_segments)

Didn't you swap the 2 above ? max_sectors is the max overall req. size
and max_segments the max number of SG elements afaik :-)

> - Max length of individual sg elements (max_segment_size)

> newer kernels export these limits; they have been exported with
> commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
> For older kernels, however, we're being left in the dark here.

Well, first of all, "sg" is not there so that doesn't help with the
scsi-generic problem much, then parsing sysfs... yuck.

> So on newer kernel we probably could be doing a quick check on the
> block queue limits and reformat the I/O if required.

Maybe but then, "sg" isn't there. We "could" I suppose use "sr" as an
indication tho when we know it's a cdrom.

> Instead of reformatting we could be sendiong each element of an eg
> list individually. Thereby we would be introducing some slowdown as
> the sg lists have to be reassembled again by the lower layers, but
> we would be insulated from any sg list mismatch.
> However, this won't cover requests with too large sg elements.
> For those we could probably use some simple divide-by-two algorithm
> on the element to make them fit.

How can we ? We need a single request to match a single sg list anyways
no ?

Let's say you get a READ10 from the guest for 200Kb and your underlying
max_sectors is 128Kb. How do you want to "break that up" ? The only way
would be to make it two different READ10's and that's a can of worms
especially if you start putting tags into the picture...

> But seeing we have to split the I/O requests anyway we might as well
> use the divide-by-two algorithm for the sg lists, too.
> 
> Easiest would be if we could just transfer the available bits and
> push the request back to the guest as a partial completion.
> Sadly the I/O stack on the guest will choose to interpret this as an
> I/O error instead of retrying the remainder :-(
> 
> So in the long run I fear we have to implement some sort of I/O
> request splitting in Qemu, using the values from sysfs.

So in my case, I'm happy for the time being to continue doing bounce
buffering and so my only problem at the moment is the max request size
(aka max_sectors). Also I -can- tell the guest what my limitation is,
it's part of the vscsi login protocol. I can look into doing DMA
directly to the guest SG lists later maybe.

However, I can't quite figure out how to reliably obtain that
information in my driver since on one hand, the ioctl doesn't seem to
work in mixed 32/64-bit environments, and on the other hand, sysfs
doesn't seem to have anything for "sg" in /sys/class/block... Besides,
those are both Linux-isms... so we'd have to be extra careful there too.

Cheers,
Ben.

> Cheers,
> 
> Hannes
> --
> Dr. Hannes Reinecke		      zSeries & Storage
> hare@suse.de			      +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 13:54         ` Hannes Reinecke
@ 2010-12-22 13:27           ` Christoph Hellwig
  2010-12-22 22:06             ` Benjamin Herrenschmidt
  2010-12-22 23:19             ` Alexander Graf
  2010-12-22 21:59           ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2010-12-22 13:27 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Kevin Wolf, qemu-devel, ronnie sahlberg

On Wed, Dec 22, 2010 at 02:54:54PM +0100, Hannes Reinecke wrote:
> Most modern HBAs are using separate codepaths for streaming/block I/O
> anyway,

That's not true at all.  Every normal HBA justs passes normal SCSI
commands to the SCSI targets.  It's just raid adapters that take special
commands, and the megaraid one is extremly special as it actually
emulates a few SCSI commands even in RAID mode, which almost no other
HBA does.  Strictly speaking we should not allow scsi-generic with
megaraid_sas, except for the separate passthrough channels that the real
hardware has for things like tape drives.

> However, since Alex Graf is facing similar problems with the AHCI HBA of
> his maybe we could retry again ...

AHCI is a ATA adapter, and should never be used with scsi-generic for
disks.  Only for the ATAPI-attached cdroms/tapes/etc it could be used,
although it's quite pointless.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-21 22:05       ` Benjamin Herrenschmidt
@ 2010-12-22 13:54         ` Hannes Reinecke
  2010-12-22 13:27           ` Christoph Hellwig
  2010-12-22 21:59           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 16+ messages in thread
From: Hannes Reinecke @ 2010-12-22 13:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Kevin Wolf, qemu-devel, ronnie sahlberg

On 12/21/2010 11:05 PM, Benjamin Herrenschmidt wrote:
>>> So back to square 1 ... my vscsi (and virtio-blk too btw) can
>>> technically pass a max size to the guest, but we don't have a way to
>>> interrogate scsi-generic (and the underlying block driver) which is the
>>> main issue (that plus the fact that the ioctl seems to be broken in
>>> "compat" mode for /dev/sg specifically)...
>>>
>> Ah, the warm and fuzzy feeling knowing to be not alone in this ...
>>
>> This is basically the same issue I brought up with the first
>> submission round of my megasas emulation.
> 
> heh.
> 
>> As we're passing scatter-gather lists directly to the underlying
>> device we might end up sending a request which is improperly
>> formatted. The linux block layer has three limits onto which a
>> request has to be formatted:
>> - Max length of the scatter-gather list (max_sectors)
>> - Max overall request size (max_segments)
> 
> Didn't you swap the 2 above ? max_sectors is the max overall req. size
> and max_segments the max number of SG elements afaik :-)
> 
Yeah, could be. 'twas only meant for illustration anyway.

>> - Max length of individual sg elements (max_segment_size)
> 
>> newer kernels export these limits; they have been exported with
>> commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
>> For older kernels, however, we're being left in the dark here.
> 
> Well, first of all, "sg" is not there so that doesn't help with the
> scsi-generic problem much, then parsing sysfs... yuck.
> 
Well, sort of. 'sg' doesn't have any block queue limits directly as the
block queue is attached to the block device (surprise, surprise :-).
But nevertheless any commands send via SG_IO are being placed on the
block queue, hence the same limits apply here, too.

>> So on newer kernel we probably could be doing a quick check on the
>> block queue limits and reformat the I/O if required.
> 
> Maybe but then, "sg" isn't there. We "could" I suppose use "sr" as an
> indication tho when we know it's a cdrom.
> 
If it were me I would be using
>> Instead of reformatting we could be sendiong each element of an eg
>> list individually. Thereby we would be introducing some slowdown as
>> the sg lists have to be reassembled again by the lower layers, but
>> we would be insulated from any sg list mismatch.
>> However, this won't cover requests with too large sg elements.
>> For those we could probably use some simple divide-by-two algorithm
>> on the element to make them fit.
> 
> How can we ? We need a single request to match a single sg list anyways
> no ?
> 
Yes, true. That's what I was trying to illustrate here.

> Let's say you get a READ10 from the guest for 200Kb and your underlying
> max_sectors is 128Kb. How do you want to "break that up" ? The only way
> would be to make it two different READ10's and that's a can of worms
> especially if you start putting tags into the picture...
> 
Precisely. Hence I didn't try to implement anything in that area :-)

>> But seeing we have to split the I/O requests anyway we might as well
>> use the divide-by-two algorithm for the sg lists, too.
>>
>> Easiest would be if we could just transfer the available bits and
>> push the request back to the guest as a partial completion.
>> Sadly the I/O stack on the guest will choose to interpret this as an
>> I/O error instead of retrying the remainder :-(
>>
>> So in the long run I fear we have to implement some sort of I/O
>> request splitting in Qemu, using the values from sysfs.
> 
> So in my case, I'm happy for the time being to continue doing bounce
> buffering and so my only problem at the moment is the max request size
> (aka max_sectors). Also I -can- tell the guest what my limitation is,
> it's part of the vscsi login protocol. I can look into doing DMA
> directly to the guest SG lists later maybe.
> 
> However, I can't quite figure out how to reliably obtain that
> information in my driver since on one hand, the ioctl doesn't seem to
> work in mixed 32/64-bit environments, and on the other hand, sysfs
> doesn't seem to have anything for "sg" in /sys/class/block... Besides,
> those are both Linux-isms... so we'd have to be extra careful there too.
> 
Yes. I've been bashing my head against this, too.

IMO the whole problem arises from the fact that we're deliberately
destroying information here.
Most modern HBAs are using separate codepaths for streaming/block I/O
anyway, but when using 'scsi-generic' we are forced to discard this
information. We have to fake a SCSI READ/WRITE command, and send it via
SG_IO to the underlying device and keep fingers crossed that we're not
exceeding any device limitations.

The whole problem would just go away if we could use the standard block
read()/write() calls here. Then the iovec would be placed _as
scatter-gather list_ on the request-queue and the block layer would take
care of the whole issue.

I've tried to advocate this approach once, but (again) was being told
that it's a misuse of scsi-generic and I should be using scsi-disk instead.

However, since Alex Graf is facing similar problems with the AHCI HBA of
his maybe we could retry again ...

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 13:54         ` Hannes Reinecke
  2010-12-22 13:27           ` Christoph Hellwig
@ 2010-12-22 21:59           ` Benjamin Herrenschmidt
  2010-12-22 23:23             ` Alexander Graf
  1 sibling, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-22 21:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Kevin Wolf, Christoph Hellwig, qemu-devel, ronnie sahlberg

On Wed, 2010-12-22 at 14:54 +0100, Hannes Reinecke wrote:

> Well, sort of. 'sg' doesn't have any block queue limits directly as the
> block queue is attached to the block device (surprise, surprise :-).
> But nevertheless any commands send via SG_IO are being placed on the
> block queue, hence the same limits apply here, too.

Right, tho is there a "simple" way to map sg to the appropriate block
driver to retreive the info via sysfs ? I looks possible from a quick
peek there but it also looks like an ungodly mess.

> If it were me I would be using

I think you meant to type more here :-)

> > However, I can't quite figure out how to reliably obtain that
> > information in my driver since on one hand, the ioctl doesn't seem to
> > work in mixed 32/64-bit environments, and on the other hand, sysfs
> > doesn't seem to have anything for "sg" in /sys/class/block... Besides,
> > those are both Linux-isms... so we'd have to be extra careful there too.
> > 
> Yes. I've been bashing my head against this, too.

Christoph, any suggestion there ?

> IMO the whole problem arises from the fact that we're deliberately
> destroying information here.
> Most modern HBAs are using separate codepaths for streaming/block I/O
> anyway, but when using 'scsi-generic' we are forced to discard this
> information. We have to fake a SCSI READ/WRITE command, and send it via
> SG_IO to the underlying device and keep fingers crossed that we're not
> exceeding any device limitations.

I wouldn't say it like that no.

It's a transport problem. In my case I'm not "faking" anything, vscsi is
just a transport (a variant of SRP). The problem is that when
'emulating' a HW HBA, you have no way to express the intrinsic
limitations of the underlying HBA, but that's not a problem I have with
vscsi which is meant to be a transport and as such does have means to
convey that sort of information (tho in my case, I have some issues due
to assumptions/bugs in the existing ibm vscsi client driver but that's a
different topic).

So I think there's a significant difference here between emulating a HW
HBA and doing something like vscsi. The former has problems that cannot
be easily solved I believe. The later problems on the other hands can be
solved, the means to do so are there, but we have to deal with
"interface" issues ... plumbing problems.

The non working compat ioctl is one, the fact that "sg" has
no /sys/class/block (or /sys/block) entries is another, etc... Ie, we
are faced with a problem with Linux not exposing those informations in
an easy to retrieve way, and no proper cross-platform way to obtain
those informations neither.

> The whole problem would just go away if we could use the standard block
> read()/write() calls here. Then the iovec would be placed _as
> scatter-gather list_ on the request-queue and the block layer would take
> care of the whole issue.

That would be somewhat cheating with the concept of just being a SCSI
transport layer :-) You would interpret some requests and turn them into
something else. That would be "interesting" when your user starts using
tags and make assumptions about what's in flight and what not etc...

> I've tried to advocate this approach once, but (again) was being told
> that it's a misuse of scsi-generic and I should be using scsi-disk instead.
> 
> However, since Alex Graf is facing similar problems with the AHCI HBA of
> his maybe we could retry again ...

Again, I'd say different problems :-) To some extent scsi-disk will
solve the issues with basic read/write operations, but there's some more
nasty SCSI commands that you want through for things like DVD burning
for example, unless we start building higher level abstractions into the
kernel. So you -still- end up acting somewhat as a SCSI transport layer,
and potentially hit the problem with limits again.

Cheers,
Ben.

> Cheers,
> 
> Hannes
> --
> Dr. Hannes Reinecke		      zSeries & Storage
> hare@suse.de			      +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 13:27           ` Christoph Hellwig
@ 2010-12-22 22:06             ` Benjamin Herrenschmidt
  2010-12-22 23:19             ` Alexander Graf
  1 sibling, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-22 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Hannes Reinecke, ronnie sahlberg, qemu-devel

On Wed, 2010-12-22 at 14:27 +0100, Christoph Hellwig wrote:
> On Wed, Dec 22, 2010 at 02:54:54PM +0100, Hannes Reinecke wrote:
> > Most modern HBAs are using separate codepaths for streaming/block I/O
> > anyway,
> 
> That's not true at all.  Every normal HBA justs passes normal SCSI
> commands to the SCSI targets.  It's just raid adapters that take special
> commands, and the megaraid one is extremly special as it actually
> emulates a few SCSI commands even in RAID mode, which almost no other
> HBA does.  Strictly speaking we should not allow scsi-generic with
> megaraid_sas, except for the separate passthrough channels that the real
> hardware has for things like tape drives.

Actually, I would put it differently here.

scsi-generic is -fundamentally- busted for HBA HW emulation since you
simply cannot convey the limits of the underlying real HBA.

If you are on top of usb-storage with a 120K max_sectors and try to
emulate a piece of HBA with no such limitation how in hell do you make
you guest know not to give your >120K requests at a time and what do you
do if it does ? You're stuffed basically...

Hence, the only way scsi-generic can make sense imho, is for something
like vscsi which I'm doing now, which is just a transport and does have
the ability to convey to the client/guest some of those limitations...
provided it can get to them in the first place (see the discussion, it's
really non trivial, which makes /dev/sg even less useful even for normal
userspace :-)

In the Megaraid case, the fact that it has this separate read/write
channel on the contrary should make it -easier- to solve that problem
typically by allowing the emulation layer to construct sequences of
READ/WRITE requests that match the limitations. IE. Ie makes
scsi-generic a possibility while it would otherwise (and is) broken in
unfixable ways with other HBA emulation.

> > However, since Alex Graf is facing similar problems with the AHCI HBA of
> > his maybe we could retry again ...
> 
> AHCI is a ATA adapter, and should never be used with scsi-generic for
> disks.  Only for the ATAPI-attached cdroms/tapes/etc it could be used,
> although it's quite pointless.

Right, but in that case (cdroms etc...) it would have the exact same
problem. I'm not familiar with AHCI HW, and so I don't know whether
there's a way for the HW to convey "limits" to the driver, but if not,
then operating via scsi-generic would be busted the same way anything
else is.

Basically, scsi-generic cannot work for emulating an HBA. In fact, I
would go as far as saying that it's not possible to generically emulate
an HBA that just pass-through any SCSI command, simply due to the
inability to convey those limits.

vscsi is a special case (and other "paravirt" drivers that may exist)
because being explicitely designed for acting as such transports, they
-do- convey the necessary limit information. I don't know iscsi but I
would be surprised if it didn't provide similar facilities.

So what we need here is a way for qemu to retrieve those reliably when
using scsi-generic. That's the missing piece of the puzzle on my side.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 13:27           ` Christoph Hellwig
  2010-12-22 22:06             ` Benjamin Herrenschmidt
@ 2010-12-22 23:19             ` Alexander Graf
  1 sibling, 0 replies; 16+ messages in thread
From: Alexander Graf @ 2010-12-22 23:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Hannes Reinecke, ronnie sahlberg, qemu-devel

On 22.12.2010, at 14:27, Christoph Hellwig wrote:

> On Wed, Dec 22, 2010 at 02:54:54PM +0100, Hannes Reinecke wrote:
>> Most modern HBAs are using separate codepaths for streaming/block I/O
>> anyway,
> 
> That's not true at all.  Every normal HBA justs passes normal SCSI
> commands to the SCSI targets.  It's just raid adapters that take special
> commands, and the megaraid one is extremly special as it actually
> emulates a few SCSI commands even in RAID mode, which almost no other
> HBA does.  Strictly speaking we should not allow scsi-generic with
> megaraid_sas, except for the separate passthrough channels that the real
> hardware has for things like tape drives.
> 
>> However, since Alex Graf is facing similar problems with the AHCI HBA of
>> his maybe we could retry again ...
> 
> AHCI is a ATA adapter, and should never be used with scsi-generic for
> disks.  Only for the ATAPI-attached cdroms/tapes/etc it could be used,
> although it's quite pointless.

It's not 100% pointless - ATAPI passthrough is a feature requested by users.
If we were to model ATAPI properly, it would end up using whatever SCSI layers we have below - which means ATAPI passthrough would be a mere matter of replacing the "scsi-cdrom" backend with a "scsi-passthrough" backend.

Now for the fun part. ATAPI can also do NCQ. So we actually end up having the exact same thing here as megasas. If we were to do ATAPI passthrough on CD-ROM with NCQ. NCQ goes through the normal read/write path of a block backend. Passthrough would do SG_IO.

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 21:59           ` Benjamin Herrenschmidt
@ 2010-12-22 23:23             ` Alexander Graf
  2010-12-22 23:35               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Graf @ 2010-12-22 23:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Kevin Wolf, Christoph Hellwig, Hannes Reinecke, ronnie sahlberg,
	qemu-devel


On 22.12.2010, at 22:59, Benjamin Herrenschmidt wrote:

> On Wed, 2010-12-22 at 14:54 +0100, Hannes Reinecke wrote:
> 
>> Well, sort of. 'sg' doesn't have any block queue limits directly as the
>> block queue is attached to the block device (surprise, surprise :-).
>> But nevertheless any commands send via SG_IO are being placed on the
>> block queue, hence the same limits apply here, too.
> 
> Right, tho is there a "simple" way to map sg to the appropriate block
> driver to retreive the info via sysfs ? I looks possible from a quick
> peek there but it also looks like an ungodly mess.
> 
>> If it were me I would be using
> 
> I think you meant to type more here :-)
> 
>>> However, I can't quite figure out how to reliably obtain that
>>> information in my driver since on one hand, the ioctl doesn't seem to
>>> work in mixed 32/64-bit environments, and on the other hand, sysfs
>>> doesn't seem to have anything for "sg" in /sys/class/block... Besides,
>>> those are both Linux-isms... so we'd have to be extra careful there too.
>>> 
>> Yes. I've been bashing my head against this, too.
> 
> Christoph, any suggestion there ?
> 
>> IMO the whole problem arises from the fact that we're deliberately
>> destroying information here.
>> Most modern HBAs are using separate codepaths for streaming/block I/O
>> anyway, but when using 'scsi-generic' we are forced to discard this
>> information. We have to fake a SCSI READ/WRITE command, and send it via
>> SG_IO to the underlying device and keep fingers crossed that we're not
>> exceeding any device limitations.
> 
> I wouldn't say it like that no.
> 
> It's a transport problem. In my case I'm not "faking" anything, vscsi is
> just a transport (a variant of SRP). The problem is that when
> 'emulating' a HW HBA, you have no way to express the intrinsic
> limitations of the underlying HBA, but that's not a problem I have with
> vscsi which is meant to be a transport and as such does have means to
> convey that sort of information (tho in my case, I have some issues due
> to assumptions/bugs in the existing ibm vscsi client driver but that's a
> different topic).
> 
> So I think there's a significant difference here between emulating a HW
> HBA and doing something like vscsi. The former has problems that cannot
> be easily solved I believe. The later problems on the other hands can be
> solved, the means to do so are there, but we have to deal with
> "interface" issues ... plumbing problems.
> 
> The non working compat ioctl is one, the fact that "sg" has
> no /sys/class/block (or /sys/block) entries is another, etc... Ie, we
> are faced with a problem with Linux not exposing those informations in
> an easy to retrieve way, and no proper cross-platform way to obtain
> those informations neither.

Why would you care about cross-platform here? Not saying I fully understand what information exactly you're lacking. But it's either SG_IO max request size in which case you don't need any equivalent on other platforms, as it's not available anywhere else. Or it's something else in which case you can just set it to some "safe" small default value and call it a day :).


Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 23:23             ` Alexander Graf
@ 2010-12-22 23:35               ` Benjamin Herrenschmidt
  2010-12-22 23:39                 ` Alexander Graf
  0 siblings, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-22 23:35 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Kevin Wolf, Christoph Hellwig, Hannes Reinecke, ronnie sahlberg,
	qemu-devel

On Thu, 2010-12-23 at 00:23 +0100, Alexander Graf wrote:
> > The non working compat ioctl is one, the fact that "sg" has
> > no /sys/class/block (or /sys/block) entries is another, etc... Ie,
> we
> > are faced with a problem with Linux not exposing those informations
> in
> > an easy to retrieve way, and no proper cross-platform way to obtain
> > those informations neither.
> 
> Why would you care about cross-platform here? Not saying I fully
> understand what information exactly you're lacking. But it's either
> SG_IO max request size in which case you don't need any equivalent on
> other platforms, as it's not available anywhere else. Or it's
> something else in which case you can just set it to some "safe" small
> default value and call it a day :).

Well, do we support something like scsi-generic on windows or BSD
hosts ? dunno.. .just asking :-) They -have- mechanisms (at least
windows do) to pass SCSI requests down the stack. In that case, they'll
have similar limitations (at the very least the max request size).

So we'd want some way to expose that... but if scsi-generic today is
linux only, then I can try to add linux-isms in there as a stop-gap to
try to at least retreive the max req. size which is the main issue for
me right now... at least until I start trying to have the SG_IO
read/write directly into guest memory without bouncing :-) At that
point, the SG limits might become trouble as well.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 23:35               ` Benjamin Herrenschmidt
@ 2010-12-22 23:39                 ` Alexander Graf
  2010-12-22 23:44                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Graf @ 2010-12-22 23:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Kevin Wolf, Christoph Hellwig, Hannes Reinecke, ronnie sahlberg,
	qemu-devel


On 23.12.2010, at 00:35, Benjamin Herrenschmidt wrote:

> On Thu, 2010-12-23 at 00:23 +0100, Alexander Graf wrote:
>>> The non working compat ioctl is one, the fact that "sg" has
>>> no /sys/class/block (or /sys/block) entries is another, etc... Ie,
>> we
>>> are faced with a problem with Linux not exposing those informations
>> in
>>> an easy to retrieve way, and no proper cross-platform way to obtain
>>> those informations neither.
>> 
>> Why would you care about cross-platform here? Not saying I fully
>> understand what information exactly you're lacking. But it's either
>> SG_IO max request size in which case you don't need any equivalent on
>> other platforms, as it's not available anywhere else. Or it's
>> something else in which case you can just set it to some "safe" small
>> default value and call it a day :).
> 
> Well, do we support something like scsi-generic on windows or BSD
> hosts ? dunno.. .just asking :-) They -have- mechanisms (at least
> windows do) to pass SCSI requests down the stack. In that case, they'll
> have similar limitations (at the very least the max request size).
> 
> So we'd want some way to expose that... but if scsi-generic today is
> linux only, then I can try to add linux-isms in there as a stop-gap to
> try to at least retreive the max req. size which is the main issue for
> me right now... at least until I start trying to have the SG_IO
> read/write directly into guest memory without bouncing :-) At that
> point, the SG limits might become trouble as well.

This all belongs in the block layer. If you create a call back function or property in the block struct, windows can implement its own limits when someone sits down to implement SG_IO on Windows.


Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 23:39                 ` Alexander Graf
@ 2010-12-22 23:44                   ` Benjamin Herrenschmidt
  2010-12-22 23:49                     ` Alexander Graf
  0 siblings, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-22 23:44 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Kevin Wolf, Christoph Hellwig, Hannes Reinecke, ronnie sahlberg,
	qemu-devel

On Thu, 2010-12-23 at 00:39 +0100, Alexander Graf wrote:
> This all belongs in the block layer. If you create a call back
> function or property in the block struct, windows can implement its
> own limits when someone sits down to implement SG_IO on Windows.

Right and we do have "generic" ways it seems to interrogate those
limits .. except they seem to be broken for "sg" :-)

Also I've spotted some oddities where the ioctl for the max request size
sometimes put_user as a int * and sometimes as a short * ... ooops...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 23:44                   ` Benjamin Herrenschmidt
@ 2010-12-22 23:49                     ` Alexander Graf
  2010-12-23  0:00                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Graf @ 2010-12-22 23:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Kevin Wolf, Christoph Hellwig, Hannes Reinecke, ronnie sahlberg,
	qemu-devel


On 23.12.2010, at 00:44, Benjamin Herrenschmidt wrote:

> On Thu, 2010-12-23 at 00:39 +0100, Alexander Graf wrote:
>> This all belongs in the block layer. If you create a call back
>> function or property in the block struct, windows can implement its
>> own limits when someone sits down to implement SG_IO on Windows.
> 
> Right and we do have "generic" ways it seems to interrogate those
> limits .. except they seem to be broken for "sg" :-)
> 
> Also I've spotted some oddities where the ioctl for the max request size
> sometimes put_user as a int * and sometimes as a short * ... ooops...

Congratulations for finding lots of Linux bugs :). Look at it from that way: You'll most likely be the very first person actually using sg properly. So after you're done, others won't have to fix it :).


Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] scsi-generic and max request size
  2010-12-22 23:49                     ` Alexander Graf
@ 2010-12-23  0:00                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2010-12-23  0:00 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Kevin Wolf, Christoph Hellwig, Hannes Reinecke, ronnie sahlberg,
	qemu-devel

On Thu, 2010-12-23 at 00:49 +0100, Alexander Graf wrote:
> 
> Congratulations for finding lots of Linux bugs :). Look at it from
> that way: You'll most likely be the very first person actually using
> sg properly. So after you're done, others won't have to fix it :).

Hahah, I doubt it :-) Makes me wonder whether "sg" can be used properly
to be honest...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-12-23  0:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-21  3:25 [Qemu-devel] scsi-generic and max request size Benjamin Herrenschmidt
2010-12-21  3:38 ` ronnie sahlberg
2010-12-21  3:52   ` Benjamin Herrenschmidt
2010-12-21  8:44     ` Hannes Reinecke
2010-12-21 22:05       ` Benjamin Herrenschmidt
2010-12-22 13:54         ` Hannes Reinecke
2010-12-22 13:27           ` Christoph Hellwig
2010-12-22 22:06             ` Benjamin Herrenschmidt
2010-12-22 23:19             ` Alexander Graf
2010-12-22 21:59           ` Benjamin Herrenschmidt
2010-12-22 23:23             ` Alexander Graf
2010-12-22 23:35               ` Benjamin Herrenschmidt
2010-12-22 23:39                 ` Alexander Graf
2010-12-22 23:44                   ` Benjamin Herrenschmidt
2010-12-22 23:49                     ` Alexander Graf
2010-12-23  0:00                       ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).