[RFC] Netlink and user-space buffer pointers

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Netlink and user-space buffer pointers
       [not found] ` <20060418160121.GA2707@us.ibm.com>
@ 2006-04-19 12:57   ` James Smart
  2006-04-19 16:22     ` Patrick McHardy
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: James Smart @ 2006-04-19 12:57 UTC (permalink / raw)
  To: linux-scsi, netdev, linux-kernel

Folks,

To take netlink to where we want to use it within the SCSI subsystem (as
the mechanism of choice to replace ioctls), we're going to need to pass
user-space buffer pointers.

What is the best, portable manner to pass a pointer between user and kernel
space within a netlink message ?  The example I've seen is in the iscsi
target code - and it's passed between user-kernel space as a u64, then
typecast to a void *, and later within the bio_map_xxx functions, as an
unsigned long. I assume we are going to continue with this method ?

-- james s

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 12:57   ` [RFC] Netlink and user-space buffer pointers James Smart
@ 2006-04-19 16:22     ` Patrick McHardy
  2006-04-19 17:08       ` James Smart
  2006-04-19 16:26     ` Stephen Hemminger
  2006-04-19 21:32     ` Mike Christie
  2 siblings, 1 reply; 16+ messages in thread
From: Patrick McHardy @ 2006-04-19 16:22 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

James Smart wrote:
> To take netlink to where we want to use it within the SCSI subsystem (as
> the mechanism of choice to replace ioctls), we're going to need to pass
> user-space buffer pointers.
> 
> What is the best, portable manner to pass a pointer between user and kernel
> space within a netlink message ?  The example I've seen is in the iscsi
> target code - and it's passed between user-kernel space as a u64, then
> typecast to a void *, and later within the bio_map_xxx functions, as an
> unsigned long. I assume we are going to continue with this method ?

This might be problematic, since there is a shared receive-queue in
the kernel netlink message might get processed in the context of
a different process. I didn't find any spots where ISCSI passes
pointers over netlink, can you point me to it?

Besides that, netlink protocols should use fixed size architecture
independant types, so u64 would be the best choice for pointers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 12:57   ` [RFC] Netlink and user-space buffer pointers James Smart
  2006-04-19 16:22     ` Patrick McHardy
@ 2006-04-19 16:26     ` Stephen Hemminger
  2006-04-19 17:05       ` James Smart
  2006-04-19 21:32     ` Mike Christie
  2 siblings, 1 reply; 16+ messages in thread
From: Stephen Hemminger @ 2006-04-19 16:26 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

On Wed, 19 Apr 2006 08:57:25 -0400
James Smart <James.Smart@Emulex.Com> wrote:

> Folks,
> 
> To take netlink to where we want to use it within the SCSI subsystem (as
> the mechanism of choice to replace ioctls), we're going to need to pass
> user-space buffer pointers.

This changes the design of netlink. It is desired that netlink
can be done remotely over the network as well as queueing.
The current design is message based, not RPC based. By including a
user-space pointer, you are making the message dependent on the
context as it is process.

Please rethink your design.

> What is the best, portable manner to pass a pointer between user and kernel
> space within a netlink message ?  The example I've seen is in the iscsi
> target code - and it's passed between user-kernel space as a u64, then
> typecast to a void *, and later within the bio_map_xxx functions, as an
> unsigned long. I assume we are going to continue with this method ?
> 
> -- james s
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 16:26     ` Stephen Hemminger
@ 2006-04-19 17:05       ` James Smart
  0 siblings, 0 replies; 16+ messages in thread
From: James Smart @ 2006-04-19 17:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: linux-scsi, netdev, linux-kernel



Stephen Hemminger wrote:
> On Wed, 19 Apr 2006 08:57:25 -0400
> James Smart <James.Smart@Emulex.Com> wrote:
> 
>> Folks,
>>
>> To take netlink to where we want to use it within the SCSI subsystem (as
>> the mechanism of choice to replace ioctls), we're going to need to pass
>> user-space buffer pointers.
> 
> This changes the design of netlink. It is desired that netlink
> can be done remotely over the network as well as queueing.
> The current design is message based, not RPC based. By including a
> user-space pointer, you are making the message dependent on the
> context as it is process.
> 
> Please rethink your design.

I assume that the message receiver has some way to determine where the
message originated (via the sk_buff), and thus could reject it if it
didn't meet the right criteria.  True ?  You just have to be cognizant
that it is usable from a remote entity - which is a very good thing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 16:22     ` Patrick McHardy
@ 2006-04-19 17:08       ` James Smart
  2006-04-19 17:16         ` Patrick McHardy
  0 siblings, 1 reply; 16+ messages in thread
From: James Smart @ 2006-04-19 17:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: linux-scsi, netdev, linux-kernel

Patrick McHardy wrote:
> This might be problematic, since there is a shared receive-queue in
> the kernel netlink message might get processed in the context of
> a different process. I didn't find any spots where ISCSI passes
> pointers over netlink, can you point me to it?

Please explain... Would the pid be set erroneously as well ?  Ignoring
the kernel-user space pointer issue, we're going to have a tight
pid + request_id relationship being maintained across multiple messages.
We'll also be depending on the pid events for clean up if an app dies.
So I hope pid is consistent.

-- james s

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 17:08       ` James Smart
@ 2006-04-19 17:16         ` Patrick McHardy
  0 siblings, 0 replies; 16+ messages in thread
From: Patrick McHardy @ 2006-04-19 17:16 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

James Smart wrote:
> 
> 
> Patrick McHardy wrote:
> 
>> This might be problematic, since there is a shared receive-queue in
>> the kernel netlink message might get processed in the context of
>> a different process. I didn't find any spots where ISCSI passes
>> pointers over netlink, can you point me to it?
> 
> 
> Please explain... Would the pid be set erroneously as well ?  Ignoring
> the kernel-user space pointer issue, we're going to have a tight
> pid + request_id relationship being maintained across multiple messages.
> We'll also be depending on the pid events for clean up if an app dies.
> So I hope pid is consistent.

The PID contained in the netlink message itself is correct, current->pid
might not be.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 12:57   ` [RFC] Netlink and user-space buffer pointers James Smart
  2006-04-19 16:22     ` Patrick McHardy
  2006-04-19 16:26     ` Stephen Hemminger
@ 2006-04-19 21:32     ` Mike Christie
  2006-04-20 14:33       ` James Smart
  2 siblings, 1 reply; 16+ messages in thread
From: Mike Christie @ 2006-04-19 21:32 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

James Smart wrote:
> Folks,
> 
> To take netlink to where we want to use it within the SCSI subsystem (as
> the mechanism of choice to replace ioctls), we're going to need to pass
> user-space buffer pointers.
> 
> What is the best, portable manner to pass a pointer between user and kernel
> space within a netlink message ?  The example I've seen is in the iscsi
> target code - and it's passed between user-kernel space as a u64, then
> typecast to a void *, and later within the bio_map_xxx functions, as an
> unsigned long. I assume we are going to continue with this method ?
> 

I do not know if it is needed. For the target code, we use the
bio_map_xxx to avoid having to copy the command data which is needed for
decent performance. We have also been trying to figure out ways of
getting out of using netlink to send the command info (cdb, tag info,
etc) around, because in some of Tomo's tests using mmaped packet sockets
he was able to imporove performance by removing that copy from the
kernel to userspace. We had problems with that though and other nice
interfaces like relayfs only allowed us to pass data from the kernel to
userspace so we still need another interface to pass things from
userspace to the kernel. Still working on this though. If someone knows
a interface please let us know.

For the tasks you want to do for the fc class is performance critical?
If not, you could do what the iscsi class (for the netdev people this is
drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
copies. For iscsi we do this in userspace to send down a login pdu:

	/*
	 * xmitbuf is a buffer that is large enough for the iscsi_event,
	 * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
	 */
        memset(xmitbuf, 0, sizeof(*ev) + hdr_size + data_size);
        xmitlen = sizeof(*ev);
        ev = xmitbuf;
        ev->type = ISCSI_UEVENT_SEND_PDU;
        ev->transport_handle = transport_handle;
        ev->u.send_pdu.sid = sid;
        ev->u.send_pdu.cid = cid;
        ev->u.send_pdu.hdr_size = hdr_size;
        ev->u.send_pdu.data_size = data_size;

then later we do sendmsg()to send down the xmitbuf to the kernel iscsi
driver. I think there may be issues with packing structs or 32 bit
userspace and 64 bit kernels and other fun things like this so the iscsi
pdu and iscsi event have to be defined correctly and I guess we are back
to some of the problems with ioctls :(

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-19 21:32     ` Mike Christie
@ 2006-04-20 14:33       ` James Smart
  2006-04-20 17:45         ` Mike Christie
  0 siblings, 1 reply; 16+ messages in thread
From: James Smart @ 2006-04-20 14:33 UTC (permalink / raw)
  To: Mike Christie; +Cc: linux-scsi, netdev, linux-kernel

Mike Christie wrote:
> For the tasks you want to do for the fc class is performance critical?

No, it should not be.

> If not, you could do what the iscsi class (for the netdev people this is
> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
> copies. For iscsi we do this in userspace to send down a login pdu:
> 
> 	/*
> 	 * xmitbuf is a buffer that is large enough for the iscsi_event,
> 	 * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
> 	 */

Well, the real difference is that the payload of the "message" is actually
the payload of the SCSI command or ELS/CT Request. Thus, the payload may
range in size from a few hundred bytes to several kbytes (> 1 page) to
Mbyte's in size. Rather than buffer all of this, and push it over the socket,
thus the extra copies - it would best to have the LLDD simply DMA the
payload like on a typical SCSI command.  Additionally, there will be
response data that can be several kbytes in length.

> ... I think there may be issues with packing structs or 32 bit
> userspace and 64 bit kernels and other fun things like this so the iscsi
> pdu and iscsi event have to be defined correctly and I guess we are back
> to some of the problems with ioctls :(

Agreed. In this use of netlink, there's not a lot of wins for netlink over
ioctls. It all comes down to 2 things: a) proper portable message definition;
and b) what do you do with that non-portable user space buffer pointer ?

-- james s

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 14:33       ` James Smart
@ 2006-04-20 17:45         ` Mike Christie
  2006-04-20 17:52           ` Mike Christie
                             ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Mike Christie @ 2006-04-20 17:45 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

James Smart wrote:
> 
> Mike Christie wrote:
>> For the tasks you want to do for the fc class is performance critical?
> 
> No, it should not be.
> 
>> If not, you could do what the iscsi class (for the netdev people this is
>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>> copies. For iscsi we do this in userspace to send down a login pdu:
>>
>>     /*
>>      * xmitbuf is a buffer that is large enough for the iscsi_event,
>>      * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>      */
> 
> Well, the real difference is that the payload of the "message" is actually
> the payload of the SCSI command or ELS/CT Request. Thus, the payload may

I am not sure I follow. For iscsi, everything after the iscsi_event
struct can be the iscsi request that is to be transmitted. The payload
will not normally be Mbytes but it is not a couple if bytes.

> range in size from a few hundred bytes to several kbytes (> 1 page) to
> Mbyte's in size. Rather than buffer all of this, and push it over the
> socket,
> thus the extra copies - it would best to have the LLDD simply DMA the
> payload like on a typical SCSI command.  Additionally, there will be
> response data that can be several kbytes in length.
> 

Once you have got the buffer to the class, the class can create a
scatterlist to DMA from for the LLD. I thought. iscsi does not do this
just because it is software right now. For qla4xxx we do not need
something like what you are talking about (see below for what I was
thinking about for the initiators). If you are saying the extra step of
the copy is plain dumb, I agree, but this happens (you have to suffer
some copy and cannot do dio) for sg io as well in some cases. I think
for the sg driver the copy_*_user is the default.

Instead of netlink for scsi commands and transport requests....

For scsi commands could we just use sg io, or is there something special
about the command you want to send? If you can use sg io for scsi
commands, maybe for transport level requests (in my example iscsi pdu)
we could modify something like sg/bsg/block layer scsi_ioctl.c to send
down transport requests to the classes and encapsulate them in some new
struct transport_requests or use the existing struct request but do that
thing people keep taling about using the request/request_queue for
message passing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 17:45         ` Mike Christie
@ 2006-04-20 17:52           ` Mike Christie
  2006-04-20 17:58           ` Mike Christie
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2006-04-20 17:52 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

Mike Christie wrote:
> James Smart wrote:
>> Mike Christie wrote:
>>> For the tasks you want to do for the fc class is performance critical?
>> No, it should not be.
>>
>>> If not, you could do what the iscsi class (for the netdev people this is
>>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>>> copies. For iscsi we do this in userspace to send down a login pdu:
>>>
>>>     /*
>>>      * xmitbuf is a buffer that is large enough for the iscsi_event,
>>>      * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>>      */
>> Well, the real difference is that the payload of the "message" is actually
>> the payload of the SCSI command or ELS/CT Request. Thus, the payload may
> 
> I am not sure I follow. For iscsi, everything after the iscsi_event
> struct can be the iscsi request that is to be transmitted. The payload
> will not normally be Mbytes but it is not a couple if bytes.
> 
>> range in size from a few hundred bytes to several kbytes (> 1 page) to
>> Mbyte's in size. Rather than buffer all of this, and push it over the
>> socket,
>> thus the extra copies - it would best to have the LLDD simply DMA the
>> payload like on a typical SCSI command.  Additionally, there will be
>> response data that can be several kbytes in length.
>>
> 
> Once you have got the buffer to the class, the class can create a
> scatterlist to DMA from for the LLD. I thought. iscsi does not do this
> just because it is software right now. For qla4xxx we do not need

That should be, we do need.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 17:45         ` Mike Christie
  2006-04-20 17:52           ` Mike Christie
@ 2006-04-20 17:58           ` Mike Christie
  2006-04-20 20:03           ` James Smart
  2006-04-20 20:18           ` Douglas Gilbert
  3 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2006-04-20 17:58 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

Mike Christie wrote:
> James Smart wrote:
>> Mike Christie wrote:
>>> For the tasks you want to do for the fc class is performance critical?
>> No, it should not be.
>>
>>> If not, you could do what the iscsi class (for the netdev people this is
>>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>>> copies. For iscsi we do this in userspace to send down a login pdu:
>>>
>>>     /*
>>>      * xmitbuf is a buffer that is large enough for the iscsi_event,
>>>      * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>>      */
>> Well, the real difference is that the payload of the "message" is actually
>> the payload of the SCSI command or ELS/CT Request. Thus, the payload may
> 
> I am not sure I follow. For iscsi, everything after the iscsi_event
> struct can be the iscsi request that is to be transmitted. The payload
> will not normally be Mbytes but it is not a couple if bytes.
> 
>> range in size from a few hundred bytes to several kbytes (> 1 page) to
>> Mbyte's in size. Rather than buffer all of this, and push it over the
>> socket,
>> thus the extra copies - it would best to have the LLDD simply DMA the
>> payload like on a typical SCSI command.  Additionally, there will be
>> response data that can be several kbytes in length.
>>
> 
> Once you have got the buffer to the class, the class can create a
> scatterlist to DMA from for the LLD. I thought. iscsi does not do this
> just because it is software right now. For qla4xxx we do not need
> something like what you are talking about (see below for what I was
> thinking about for the initiators). If you are saying the extra step of
> the copy is plain dumb, I agree, but this happens (you have to suffer
> some copy and cannot do dio) for sg io as well in some cases. I think
> for the sg driver the copy_*_user is the default.
> 
> Instead of netlink for scsi commands and transport requests....
> 
> For scsi commands could we just use sg io, or is there something special
> about the command you want to send? If you can use sg io for scsi
> commands, maybe for transport level requests (in my example iscsi pdu)
> we could modify something like sg/bsg/block layer scsi_ioctl.c to send
> down transport requests to the classes and encapsulate them in some new
> struct transport_requests or use the existing struct request but do that
> thing people keep taling about using the request/request_queue for
> message passing.

And just to be complete, the problem with this is that it is tied to the
request queue and so you cannot just send a transport level request
unless it is tied to the device. But for the target stuff we added a
request queue to the host so we could inject requests (the idea was to
send down those magic message requests) at a higher level.  To be able
to use that for sg io though it would require some more code and magic
as you know.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 17:45         ` Mike Christie
  2006-04-20 17:52           ` Mike Christie
  2006-04-20 17:58           ` Mike Christie
@ 2006-04-20 20:03           ` James Smart
  2006-04-20 20:35             ` Mike Christie
  2006-04-20 23:44             ` Andrew Vasquez
  2006-04-20 20:18           ` Douglas Gilbert
  3 siblings, 2 replies; 16+ messages in thread
From: James Smart @ 2006-04-20 20:03 UTC (permalink / raw)
  To: Mike Christie; +Cc: linux-scsi, netdev, linux-kernel

Note: We've transitioned off topic. If what this means is "there isn't a good
way except by ioctls (which still isn't easily portable) or system calls",
then that's ok. Then at least we know the limits and can look at other
implementation alternatives.

Mike Christie wrote:
> James Smart wrote:
>> Mike Christie wrote:
>>> For the tasks you want to do for the fc class is performance critical?
>> No, it should not be.
>>
>>> If not, you could do what the iscsi class (for the netdev people this is
>>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>>> copies. For iscsi we do this in userspace to send down a login pdu:
>>>
>>>     /*
>>>      * xmitbuf is a buffer that is large enough for the iscsi_event,
>>>      * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>>      */
>> Well, the real difference is that the payload of the "message" is actually
>> the payload of the SCSI command or ELS/CT Request. Thus, the payload may
> 
> I am not sure I follow. For iscsi, everything after the iscsi_event
> struct can be the iscsi request that is to be transmitted. The payload
> will not normally be Mbytes but it is not a couple if bytes.

True... For a large read/write - it will eventually total what the i/o
request size was, and you did have to push it through the socekt.
What this discussion really comes down to is the difference between initiator
offload and what a target does.

The initiator offloads the "full" i/o from the users - e.g. send command,
get response. In the initiator case, the user isn't aware of each and
every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing
the offload needs the full buffer sitting and ready. DMA is preferred so
the buffer doesn't have to be consuming socket/kernel/driver buffers while
it's pending - plus speed.

In the target case, the target controls each IU and it's size, thus it
only has to have access to as much buffer space as it wants to push the next
IU. The i/o can be "paced" by the target. Unfortunately, this is an entirely
different use model than users of a scsi initiator expect, and it won't map
well into replacing things like our sg_io ioctls.

> Instead of netlink for scsi commands and transport requests....
> 
> For scsi commands could we just use sg io, or is there something special
> about the command you want to send? If you can use sg io for scsi
> commands, maybe for transport level requests (in my example iscsi pdu)
> we could modify something like sg/bsg/block layer scsi_ioctl.c to send
> down transport requests to the classes and encapsulate them in some new
> struct transport_requests or use the existing struct request but do that
> thing people keep taling about using the request/request_queue for
> message passing.

Well - there's 2 parts to this answer:

First : IOCTL's are considered dangerous/bad practice and therefore it would
   be nice to find a replacement mechanism that eliminates them. If that
   mechanism has some of the cool features that netlink does, even better.
   Using sg io, in the manner you indicate, wouldn't remove the ioctl use.
   Note: I have OEMs/users that are very confused about the community's statement
   about ioctls. They've heard they are bad, should never be allowed, will no
   be longer supported, but yet they are at the heart of DM and sg io and other
   subsystems. Other than a "grandfathered" explanation, they don't understand
   why the rules bend for one piece of code but not for another. To them, all
   the features are just as critical regardless of whose providing them.

Second: transport level i/o could be done like you suggest, and we've
   prototyped some of this as well. However, there's something very wrong
   about putting "block device" wrappers and settings around something that
   is not a block device.  In general, it's a heck of a lot of overhead and
   still doesn't solve the real issue - how to portably pass that user buffer
   in to/out of the kernel.

-- james s

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 17:45         ` Mike Christie
                             ` (2 preceding siblings ...)
  2006-04-20 20:03           ` James Smart
@ 2006-04-20 20:18           ` Douglas Gilbert
  3 siblings, 0 replies; 16+ messages in thread
From: Douglas Gilbert @ 2006-04-20 20:18 UTC (permalink / raw)
  To: Mike Christie; +Cc: James.Smart, linux-scsi, netdev, linux-kernel

Mike Christie wrote:
> James Smart wrote:
> 
>>Mike Christie wrote:
>>
>>>For the tasks you want to do for the fc class is performance critical?
>>
>>No, it should not be.
>>
>>
>>>If not, you could do what the iscsi class (for the netdev people this is
>>>drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>>>copies. For iscsi we do this in userspace to send down a login pdu:
>>>
>>>    /*
>>>     * xmitbuf is a buffer that is large enough for the iscsi_event,
>>>     * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>>     */
>>
>>Well, the real difference is that the payload of the "message" is actually
>>the payload of the SCSI command or ELS/CT Request. Thus, the payload may
> 
> 
> I am not sure I follow. For iscsi, everything after the iscsi_event
> struct can be the iscsi request that is to be transmitted. The payload
> will not normally be Mbytes but it is not a couple if bytes.
> 
> 
>>range in size from a few hundred bytes to several kbytes (> 1 page) to
>>Mbyte's in size. Rather than buffer all of this, and push it over the
>>socket,
>>thus the extra copies - it would best to have the LLDD simply DMA the
>>payload like on a typical SCSI command.  Additionally, there will be
>>response data that can be several kbytes in length.
>>
> 
> 
> Once you have got the buffer to the class, the class can create a
> scatterlist to DMA from for the LLD. I thought. iscsi does not do this
> just because it is software right now. For qla4xxx we do not need
> something like what you are talking about (see below for what I was
> thinking about for the initiators). If you are saying the extra step of
> the copy is plain dumb, I agree, but this happens (you have to suffer
> some copy and cannot do dio) for sg io as well in some cases. I think
> for the sg driver the copy_*_user is the default.

Mike,
Indirect IO is the default in the sg driver because:
  - it has always been thus
  - the sg driver is less constrained (e.g. max number
    of scatg elements is a bigger issue with dio)
  - the only alignment to worry about is byte
    alignment (some folks would like bit alignment
    but you can't please everybody)
  - there is no need for the sg driver to pin user
    pages in memory (as there is with direct IO and
    mmaped-IO)

> Instead of netlink for scsi commands and transport requests....

With a netlink based pass through one might:
  - improve on the SG_IO ioctl and add things like
    tags that are currently missing
  - introduce a proper SCSI task management function
    pass through (no request queue please)
  - make other pass throughs for SAS: SMP and STP
  - have an alternative to sysfs for various control
    functions in a HBA (e.g. in SAS: link and hard
    reset) and fetching performance data from a HBA

Apart from how to get data efficiently between the HBA
and the user space, another major issue is the flexibility
of the bind() in s_netlink (storage netlink??).

> For scsi commands could we just use sg io, or is there something special
> about the command you want to send? If you can use sg io for scsi
> commands, maybe for transport level requests (in my example iscsi pdu)
> we could modify something like sg/bsg/block layer scsi_ioctl.c to send
> down transport requests to the classes and encapsulate them in some new
> struct transport_requests or use the existing struct request but do that
> thing people keep taling about using the request/request_queue for
> message passing.

Some SG_IO ioctl users want up to 32 MB in one transaction
and others want their data fast. Many pass through users
view the kernel as an impediment (not so much as "the way"
as "in the way").

Doug Gilbert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 20:03           ` James Smart
@ 2006-04-20 20:35             ` Mike Christie
  2006-04-20 20:40               ` Mike Christie
  2006-04-20 23:44             ` Andrew Vasquez
  1 sibling, 1 reply; 16+ messages in thread
From: Mike Christie @ 2006-04-20 20:35 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

James Smart wrote:
> Note: We've transitioned off topic. If what this means is "there isn't a
> good
> way except by ioctls (which still isn't easily portable) or system calls",
> then that's ok. Then at least we know the limits and can look at other
> implementation alternatives.
> 
> Mike Christie wrote:
>> James Smart wrote:
>>> Mike Christie wrote:
>>>> For the tasks you want to do for the fc class is performance critical?
>>> No, it should not be.
>>>
>>>> If not, you could do what the iscsi class (for the netdev people
>>>> this is
>>>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>>>> copies. For iscsi we do this in userspace to send down a login pdu:
>>>>
>>>>     /*
>>>>      * xmitbuf is a buffer that is large enough for the iscsi_event,
>>>>      * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>>>      */
>>> Well, the real difference is that the payload of the "message" is
>>> actually
>>> the payload of the SCSI command or ELS/CT Request. Thus, the payload may
>>
>> I am not sure I follow. For iscsi, everything after the iscsi_event
>> struct can be the iscsi request that is to be transmitted. The payload
>> will not normally be Mbytes but it is not a couple if bytes.
> 
> True... For a large read/write - it will eventually total what the i/o
> request size was, and you did have to push it through the socekt.
> What this discussion really comes down to is the difference between
> initiator
> offload and what a target does.
> 
> The initiator offloads the "full" i/o from the users - e.g. send command,
> get response. In the initiator case, the user isn't aware of each and
> every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing
> the offload needs the full buffer sitting and ready. DMA is preferred so
> the buffer doesn't have to be consuming socket/kernel/driver buffers while
> it's pending - plus speed.
> 
> In the target case, the target controls each IU and it's size, thus it
> only has to have access to as much buffer space as it wants to push the
> next
> IU. The i/o can be "paced" by the target. Unfortunately, this is an
> entirely
> different use model than users of a scsi initiator expect, and it won't map
> well into replacing things like our sg_io ioctls.


I am not talking about the target here. For the open-iscsi initiator
that is in mainline that I referecnced in the example we send pdus from
userpsace to the LLD. In the future, initaitors that offload some iscsi
processing and will login from userspace or have userspace monitor the
transport by doing iscsi pings, we need to be able to send these pdus.
And the iscsi pdu cannot be broken up at the iscsi level (they can at
the interconect level though). From the iscsi host level they have to go
out like a scsi command would in that the LLD cannot decide to send out
mutiple pdus for he pdu that userspace sends down.

I do agree with you that targets can break down a scsi command into
multiple transport level packets as it sees fit.


> 
>> Instead of netlink for scsi commands and transport requests....
>>
>> For scsi commands could we just use sg io, or is there something special
>> about the command you want to send? If you can use sg io for scsi
>> commands, maybe for transport level requests (in my example iscsi pdu)
>> we could modify something like sg/bsg/block layer scsi_ioctl.c to send
>> down transport requests to the classes and encapsulate them in some new
>> struct transport_requests or use the existing struct request but do that
>> thing people keep taling about using the request/request_queue for
>> message passing.
> 
> Well - there's 2 parts to this answer:
> 
> First : IOCTL's are considered dangerous/bad practice and therefore it
> would

Yeah, i am not trying to kill ioctls. I go where the community goes.
What I am trying to dois just reuse the sg io mapping code so that we do
not end up with sg, st, target, blk scsi_ioctl.c and bsg all doing
similar things.


>   be nice to find a replacement mechanism that eliminates them. If that
>   mechanism has some of the cool features that netlink does, even better.
>   Using sg io, in the manner you indicate, wouldn't remove the ioctl use.
>   Note: I have OEMs/users that are very confused about the community's
> statement
>   about ioctls. They've heard they are bad, should never be allowed,
> will no
>   be longer supported, but yet they are at the heart of DM and sg io and
> other
>   subsystems. Other than a "grandfathered" explanation, they don't
> understand
>   why the rules bend for one piece of code but not for another. To them,
> all
>   the features are just as critical regardless of whose providing them.
> 
> Second: transport level i/o could be done like you suggest, and we've
>   prototyped some of this as well. However, there's something very wrong
>   about putting "block device" wrappers and settings around something that
>   is not a block device.  In general, it's a heck of a lot of overhead and
>   still doesn't solve the real issue - how to portably pass that user
> buffer


I am not talking about putting block device wrappers. This the magic
part and the message passing comes in. A while back I made the requuest
queue a class (only sent the patch to Jens and did not follow up). The
original reason was for the io sched swap and some multipath stuff, but
since with then the request queue would need a block device to be
exposed to userspace through sysfs and would not need a block device to
send messages throgh. You just need a way to commniutate between
userspace and the kernel but it does not have to be through a block
device. I think this path has other benefits in that you could do
userspace level scanning as well since you do not need the block device
and ULD like we do today.



>   in to/out of the kernel.
> 
> 
> -- james s


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 20:35             ` Mike Christie
@ 2006-04-20 20:40               ` Mike Christie
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2006-04-20 20:40 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, netdev, linux-kernel

Mike Christie wrote:
> James Smart wrote:
>> Note: We've transitioned off topic. If what this means is "there isn't a
>> good
>> way except by ioctls (which still isn't easily portable) or system calls",
>> then that's ok. Then at least we know the limits and can look at other
>> implementation alternatives.
>>
>> Mike Christie wrote:
>>> James Smart wrote:
>>>> Mike Christie wrote:
>>>>> For the tasks you want to do for the fc class is performance critical?
>>>> No, it should not be.
>>>>
>>>>> If not, you could do what the iscsi class (for the netdev people
>>>>> this is
>>>>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple
>>>>> copies. For iscsi we do this in userspace to send down a login pdu:
>>>>>
>>>>>     /*
>>>>>      * xmitbuf is a buffer that is large enough for the iscsi_event,
>>>>>      * iscsi pdu (hdr_size) and iscsi pdu data (data_size)
>>>>>      */
>>>> Well, the real difference is that the payload of the "message" is
>>>> actually
>>>> the payload of the SCSI command or ELS/CT Request. Thus, the payload may
>>> I am not sure I follow. For iscsi, everything after the iscsi_event
>>> struct can be the iscsi request that is to be transmitted. The payload
>>> will not normally be Mbytes but it is not a couple if bytes.
>> True... For a large read/write - it will eventually total what the i/o
>> request size was, and you did have to push it through the socekt.
>> What this discussion really comes down to is the difference between
>> initiator
>> offload and what a target does.
>>
>> The initiator offloads the "full" i/o from the users - e.g. send command,
>> get response. In the initiator case, the user isn't aware of each and
>> every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing
>> the offload needs the full buffer sitting and ready. DMA is preferred so
>> the buffer doesn't have to be consuming socket/kernel/driver buffers while
>> it's pending - plus speed.
>>
>> In the target case, the target controls each IU and it's size, thus it
>> only has to have access to as much buffer space as it wants to push the
>> next
>> IU. The i/o can be "paced" by the target. Unfortunately, this is an
>> entirely
>> different use model than users of a scsi initiator expect, and it won't map
>> well into replacing things like our sg_io ioctls.
> 
> 
> I am not talking about the target here. For the open-iscsi initiator
> that is in mainline that I referecnced in the example we send pdus from
> userpsace to the LLD. In the future, initaitors that offload some iscsi
> processing and will login from userspace or have userspace monitor the
> transport by doing iscsi pings, we need to be able to send these pdus.
> And the iscsi pdu cannot be broken up at the iscsi level (they can at
> the interconect level though). From the iscsi host level they have to go
> out like a scsi command would in that the LLD cannot decide to send out
> mutiple pdus for he pdu that userspace sends down.
> 
> I do agree with you that targets can break down a scsi command into
> multiple transport level packets as it sees fit.
> 

Oh yeah is

FC IU == iscsi tcp packet
or
FC IU == iscsi pdu
?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Netlink and user-space buffer pointers
  2006-04-20 20:03           ` James Smart
  2006-04-20 20:35             ` Mike Christie
@ 2006-04-20 23:44             ` Andrew Vasquez
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Vasquez @ 2006-04-20 23:44 UTC (permalink / raw)
  To: James Smart; +Cc: Mike Christie, linux-scsi, netdev, linux-kernel

On Thu, 20 Apr 2006, James Smart wrote:

> Note: We've transitioned off topic. If what this means is "there isn't a 
> good
> way except by ioctls (which still isn't easily portable) or system calls",
> then that's ok. Then at least we know the limits and can look at other
> implementation alternatives.

this topic has been brought-up many times in the past, most recently:

	http://thread.gmane.org/gmane.linux.drivers.openib/19525/focus=19525
	http://thread.gmane.org/gmane.linux.kernel/387375/focus=387455

where is was suggested to pathscale folks to use some blend of sysfs,
netlink sockets and debugfs:

	http://kerneltrap.org/node/4394

> >>Mike Christie wrote:
> >Instead of netlink for scsi commands and transport requests....
> >
> >For scsi commands could we just use sg io, or is there something special
> >about the command you want to send? If you can use sg io for scsi
> >commands, maybe for transport level requests (in my example iscsi pdu)
> >we could modify something like sg/bsg/block layer scsi_ioctl.c to send
> >down transport requests to the classes and encapsulate them in some new
> >struct transport_requests or use the existing struct request but do that
> >thing people keep taling about using the request/request_queue for
> >message passing.
> 
> Well - there's 2 parts to this answer:
> 
> First : IOCTL's are considered dangerous/bad practice and therefore it would
>   be nice to find a replacement mechanism that eliminates them. If that
>   mechanism has some of the cool features that netlink does, even better.
>   Using sg io, in the manner you indicate, wouldn't remove the ioctl use.
>   Note: I have OEMs/users that are very confused about the community's 
>   statement
>   about ioctls. They've heard they are bad, should never be allowed, will no
>   be longer supported, but yet they are at the heart of DM and sg io and 
>   other
>   subsystems. Other than a "grandfathered" explanation, they don't 
>   understand
>   why the rules bend for one piece of code but not for another. To them, all
>   the features are just as critical regardless of whose providing them.

I believe it to be the same for most hardware-vendor's customers...

> Second: transport level i/o could be done like you suggest, and we've
>   prototyped some of this as well. However, there's something very wrong
>   about putting "block device" wrappers and settings around something that
>   is not a block device.

Eeww...  no wrappers.  Your netlink prototypes certainly get FC-
transport further along, but would also be nice if there could be some
subsystem consensus on *the* interface.

I honestly don't know which interface is *best*, but from a HBA
vendors perspective managing per-request locally allocated memory is
undesirable.

Thanks,
av

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-04-20 23:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1145306661.4151.0.camel@localhost.localdomain>
     [not found] ` <20060418160121.GA2707@us.ibm.com>
2006-04-19 12:57   ` [RFC] Netlink and user-space buffer pointers James Smart
2006-04-19 16:22     ` Patrick McHardy
2006-04-19 17:08       ` James Smart
2006-04-19 17:16         ` Patrick McHardy
2006-04-19 16:26     ` Stephen Hemminger
2006-04-19 17:05       ` James Smart
2006-04-19 21:32     ` Mike Christie
2006-04-20 14:33       ` James Smart
2006-04-20 17:45         ` Mike Christie
2006-04-20 17:52           ` Mike Christie
2006-04-20 17:58           ` Mike Christie
2006-04-20 20:03           ` James Smart
2006-04-20 20:35             ` Mike Christie
2006-04-20 20:40               ` Mike Christie
2006-04-20 23:44             ` Andrew Vasquez
2006-04-20 20:18           ` Douglas Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).